Grunske Et Al-2011-Software Practice and Experience
Grunske Et Al-2011-Software Practice and Experience
Grunske Et Al-2011-Software Practice and Experience
SUMMARY
Failure Modes and Effects Analysis (FMEA) is a widely used system and software safety analysis
technique that systematically identifies failure modes of system components and explores whether these
failure modes might lead to potential hazards. In practice, FMEA is typically a labor-intensive team-based
exercise, with little tool support. This article presents our experience with automating parts of the FMEA
process, using a model checker to automate the search for system-level consequences of component
failures. The idea is to inject runtime faults into a model based on the system specification and check
if the resulting model violates safety requirements, specified as temporal logical formulas. This enables
the safety engineer to identify if a component failure, or combination of multiple failures, can lead to a
specified hazard condition. If so, the model checker produces an example of the events leading up to the
hazard occurrence which the analyst can use to identify the relevant failure propagation pathways and
co-effectors. The process is applied on three medium-sized case studies modeled with Behavior Trees.
Performance metrics for SAL model checking are presented. Copyright q 2011 John Wiley & Sons, Ltd.
KEY WORDS: behavior trees; failure modes and effects analysis; fault injection experiments; model
checking
1. INTRODUCTION
The development of safety critical systems demands assurance that the system does not pose harm
for people or the environment, even if one or more of the system’s components fail [1, 2]. The
related assurance process is known as hazard analysis. The goal of this process is to give evidence
that a system design avoids hazardous states in the presence of component failures. This involves
identifying cause–consequence relationships between component failure modes and hazardous
conditions at the system level. The process is typically conducted as an alternating forward and
backward search, between causes and effects [3]. Backward search tries to identify all the relevant
causes for each hazard; a commonly used technique is Fault Tree Analysis (FTA) [4, 5]. Forward
search identifies the consequences of already identified failure modes and is most commonly
performed by a systematic technique known as Failure Modes and Effects Analysis (FMEA) [6],
which is the main topic of this paper.
∗ Correspondence to: Lars Grunske, Swinburne University of Technology, Faculty of ICT, Hawthorn, VIC 3122,
Australia.
†
E-mail: [email protected], [email protected]
‡
Saad Zafar was at Griffith University when working on this project.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1235
with the SAL tool suite. In Section 3, we introduce our overall approach for tool-supported FMEA.
Section 4 documents in a step-by-step fashion how this approach is applied to a system, namely
the industrial metal press. We report on the results of two further systems, the mine pump and the
ambulatory infusion pump (AIP), in Section 5. Section 6 discusses the practicability and usefulness
of using this approach, and issues of scalability. Finally, we conclude and discuss the future work
in Section 7.
2. PRELIMINARIES
tag Component Flag tag Component Flag tag Component Flag tag Component Flag
(d) > Behavior < (e) < Behavior > (f) >> Behavior << (g) << Behavior >>
Figure 1. Different node types of the BT syntax: (a) state realization; (b) selection; (c) guard; (d) internal
input event; (e) internal output event; (f) external input event; and (g) external output event.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1236 L. GRUNSKE ET AL.
(d–e) an internal event modeling communication and data flow between components within the
system, where B specifies an event; the control flow can pass the internal input event
node when the event occurs (the message is sent), otherwise it is blocked until it occurs;
(f–g) an external event modeling communication and data flow between the system and its
environment, where B specifies an event; the control flow can pass the external input event
node when the event occurs (the message is sent), otherwise it is blocked until it occurs.
A node can also be labeled with one or more flags, used to indicate control flow. In Figure 1
Flag is used as a placeholder for a set of flags. Flags often refer to a matching node: that is, a
node of the same type and with the same component name and behavior. A flag can be either:
1. a reversion ˆ in the case where the node is a leaf node, indicating that the control flow loops
back to the matching node;
2. a reference node =>, indicating that the flow continues from the matching node;
3. a synchronization point =, where the control flow waits until all other threads with a matching
synchronization point have reached the synchronization point; or
4. killing of a thread −−, which kills any thread that starts with the matching node.
Nodes also have tags, which enable tracing back to requirements identifiers in the original
system functional requirements specification. Since tags have no consequences for the semantics
of the formal model they will largely be omitted from the models given below.
Control flow in the system is modeled by either a normal or a branching edge. Figure 2 shows
the different types of normal edges. As an example, we use state realization nodes in the figure,
however, there are no restrictions on the node types. A normal arrowed edge models sequential flow
between two steps (Figure 2(a)). If two nodes are drawn together (i.e. with no edge in between,
see Figure 2(c)), the two steps occur together atomically, i.e. simultaneously, to all extents and
purposes, at this level of system description.
Figure 3 shows the two types of branching edges using two branches only. In general, however,
a branching can involve more than two branches. Branching behavior is either concurrent or
alternative. Concurrent branching (Figure 3(a)) models threads running in parallel (Figure 3(a)
only depicts the first node of each thread). As an example the threads in the figure start with a
guard node. The branches, however, can start with any node type.
In alternative branching (Figure 3(b)), the control flow follows only one of the branches.
Alternative branches can either comprise selections only (for example, as shown in Figure 3(b)) or
no selections at all. Alternative branching over selections operates as a non-deterministic choice
over the branches with a satisfied selection condition Bi . If none of the selections is satisfied the
tag tag
tag
tag
tag tag (c)
(a) (b)
tag tag
Figure 3. Branching behavior in the BT syntax: (a) concurrent flow and (b) alternative flow.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1237
When translating BT models into SAL code, each state realization node and each internal/external
output node is translated into the update of a SAL action, which are automatically labeled A1,
A2, etc. Each selection and event node is translated into the guard of an action. Additionally, a
‘program-counter’ variable pc is introduced into (the guard and update of) each action in order to
control the sequence in which transitions are to be fired. Internal events are modeled by Boolean
variables which are set to true when the input is sent (by an internal output node) and are set
to false either if the input is consumed (by the matching internal input node) or if another step
has been taken and the event has not been consumed. This reflects that events are transient and
might be missed. External input events become SAL input variables which are not controlled
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1238 L. GRUNSKE ET AL.
by the model and behave arbitrarily. To render the transition relation of the SAL model total (a
prerequisite for model checking) we add an ELSE action which is enabled if no other action is.
That is, all paths in the model are infinite.
Our SAL translator includes an option for giving priority to internal actions over external actions.
An external action is one involving an external input or output event in the BT model; all other
actions are internal actions. An example of generated SAL code (in the context of FMEA) is
shown in Figure 8. More detail about the translation is given in [16] and a formal definition of the
translational semantics can be found in [23].
From the SAL tool suite we use a symbolic model checker for LTL. This tool checks if a system,
modeled in the SAL language, satisfies a given property, specified in LTL. In the case where the
property is violated the tool outputs a counter-example, i.e. a sequence of states that leads to the
violation. Symbolic model checkers, such as the SAL tool, use a graph structure, called Binary
Decision Diagrams (BDDs) [31], to represent the model internally. This representation is symbolic,
which means that the tool does not store the information about states and state transitions explicitly.
Therefore, the generation of a counter-example in the case of a violation is a separate step that is
triggered once the model checking algorithm reports a violation of the checked property. Moreover,
specific to symbolic model checking is that the runtime of the checker is not always proportional
to the size of state space of the model. Instead the number of BDDs give an indication of the
size of the model that is processed. Some models can be represented more succinctly than others
depending on multiple factors such as interdependencies between the variables, etc. These factors
will influence how we can interpret the statistical results from our experiments.
Reachability of hazardous states can easily be modeled in LTL using operators for next state X,
always G, until U, and eventually F. For example, a simple traffic-light system needs to ensure
the following requirement: if the button is pressed then eventually the light will turn green. This
requirement can be formalized in LTL as follows:
The temporal operator G ensures that the desired behavior is always the case (in every possible
run of the system), whereas F models ensure that at some state in the future the desired change
will happen. The symbol ⇒ models implication. In some cases, however, it might not be sufficient
if the desired state is reached eventually since this can be at any time in the future. If we want
to know that it only takes a fixed number of steps to reach the goal the formula has to be more
specific. The following formula gives an example:
This formula models that after the button has been pressed it takes two steps to set the light to
green. Since the next operator X models that something occurs in the next state, the combination
of two next operators (i.e., X(X . . .) models that something occurs in the state after the next state,
i.e. in two steps.
Another example for a typical LTL formula is given by the following requirement: The car does
not go before the light turns green. This can be formalized using the until operator U.
This formula states that it is always the case (G) that the car is in state stop until (U) the light
is in state green. Implicitly, this also requires that the light will be green eventually. That is, a
behavior in which the light will never be green violates this formula. Note that until is useful for
formalizing properties where something occurs after a number of internal steps, where the exact
number of steps may be indeterminate (e.g. because it includes steps of components that have no
bearing on the property in question).
Apart from temporal operators, LTL supports propositional logical operators, such as ∧ (and),
∨ (or), ¬ (negation), and ⇒ (implication).
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1239
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1240 L. GRUNSKE ET AL.
Model M
SAL Code
Failure
View Fv1
Fv2 Cause
Fv3 Consequence
Analysis
(MC)
Safety LTL
Conditions Formulae
a set of LTL formulae. After (manually) identifying the components’ failure modes (4) the user
can create the (BT) failure views of the model (5) using the support provided by the BTE tool.
This completes the inputs required for the checking process and the model checker automatically
checks the safety requirements (6) (a plain box indicates full automation of the step). The output
of the model checker is analyzed manually by the user and failure consequences are identified (7).
Countermeasures can be chosen (8) and integrated into the BT model (9) using the BTE tool. The
process then repeats the analysis.
As noted above, our approach goes beyond the traditional FMEA in supporting effects analysis
of multiple failures. Failure views corresponding to multiple failures can be constructed with the
aid of the BTE tool. However, it is left to the analyst to decide how best to handle the potential
combinatorial explosion in the number of failure views to be considered. One approach would be to
do a breadth-first sweep, pruning the experiment model at cutsets (i.e. combinations of component
failures that give rise to hazards.) Another approach to reduce the number of failure views for
multiple failure modes is described in [7]. In this approach each failure mode is associated with a
likelihood and multiple failure modes are only selected where the combined likelihood is greater
than a predefined threshold. Obviously, one assumption for this approach is that all failure modes
are statistically independent.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1241
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1242 L. GRUNSKE ET AL.
Sensor Sensor
> atTop < > atBottom <
Sensor Sensor
[high] [low]
Sensor Sensor
< sensor_high > < sensor_low >
Sensor
> atTop <
For example, Figure 6 shows how an omission failure can be injected into the BT model of
a sensor—in this case, where a sensor has failed stuck high: i.e. it is not able to realize its state
low thereafter and does not send the low message. The corresponding state realization node in
the BT model has to be eliminated, and also any corresponding output nodes with which other
components are notified about the sensor’s change to state low. Figure 6 shows the BT thread with
two shaded nodes, the state realization and the internal output, which are to be deleted to generate
the BT of the failure view.
The BTE tool (see Section 3.1) provides support for generating a failure view in the following
way: Each node in the BT model can be annotated by a set of view identifiers to specify the views
the node is part of (including the fault-free view). To generate a new failure view V the user simply
specifies which of the existing nodes will be in the view and annotates them with the identifier
for V, then adds new nodes into the BT model as needed, with appropriate transitions. Menu
options in the tool facilitate this process, for example by enabling all nodes to annotate with V, so
that the user simply needs to unannotate nodes corresponding to removed behavior when creating
the failure view for an omission failure.
As noted above, the BT approach to system modeling makes fault injection relatively straightfor-
ward compared to the more traditional models such as Statecharts [13]. This is even more evident
when it comes to injection of multiple independent component failure modes, since independent
functions map to different parts of the tree, meaning that interference does not need to be explicitly
considered. Of course, failure of one component may mean that another component is misled into
behaving incorrectly—a so-called command failure—but such consequences ‘emerge’ out of the
resulting BT model, rather than having to be added explicitly to the model by the analyst, as
happens in most other approaches.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1243
Failure
Failure View
View 1 Transitions
1 -> 2
Failure Failure 1 -> 3
View 2 View 3
2 -> 4
3 -> 4
Failure
View 4
Figure 7 shows an example of how the user specifies the failure-view transition matrix
(as supported in the BTE tool). In the example the user intends to analyze a double-point of
failure of the system. For this purpose four different failure views have been generated by the user
(1–4): 1 represents the fault-free view; 2 and 3 represent single-failure views, where two different
failure modes are present (say, f 1 of component C1 and f 2 of component C2, respectively);
and 4 where both f 1 and f 2 are present. The failure view transition now specifies the different
orders in which the failure views may occur: Transition 1 → 2 corresponds to the failure of
component C1 occurring first, for example, and transition 2 → 4 corresponds to the subsequent
failure of component C2. The user also intends to investigate the reverse order: first failure of
component C2 occurring (transition 1 → 3) followed by a failure of component C1 (transition
3 → 4). Because the model checker considers all possible paths through the corresponding BTs,
the analysis covers the occurrence of the two failures of the components at any possible time
during system operation.
This approach gives the analyst a lot of control over which fault combinations are considered.
For example, it is possible to investigate the effects of stepwise degradation of a component’s
behavior in this way, by using separate failure views for the different levels of (faulty) behavior
and transitions between them. It is also possible to examine what would happen if one component
fault was corrected before another one, to check if hazards occur during maintenance for example.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1244 L. GRUNSKE ET AL.
The translation of the extended model, consisting of the individual BT models and the failure-
view transition matrix, into SAL proceeds in essentially the same manner as for individual BT
modes as explained in Section 2.2 above. The main change is that the guard in each action
is extended with details of the views in which the action is applicable. Thus for example,
actions A1 and A2 are enabled in both views, since they represent behavior that was present
in both the fault-free BT and the failure view. Actions A3 and A4 are enabled only during
fault-free behavior: the sensor correctly changes state (sensor’=low) and transmits a message
(sensorLowMsg’=true). The program counter variable is updated (pc1’=1) to implement
the reversion in action A4. Action A5 is enabled only when the fault is present; the sensor does not
change state nor sends a message. Instead, the behavior progresses back to the beginning (action
A1 is re-enabled). The last action, A6, represents the view transition, from the normal view to the
failure view; this may occur at any time during the operation of the system.
This SAL code is generated automatically according to the rules defined in [23]. The user is
provided this functionality, including the customization of the translation, via a menu option in
the BTE tool (see Section 3.1).
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1245
for example, the changes in the plunger state typically take much longer than internal transitions
in the control system components (see Section 4).
The prioritization feature is built into the translation step as an option and can be seen as
an alternative interpretation of the model. But since it eliminates some behaviors, it might lead
to unsound results. Therefore, we use this facility only in those cases where we have found a
violation of the property without the prioritization but are not happy with the counter-example.
The prioritization of internal communication then filters out undesired counter-examples and can
lead to a more interesting counter-example.
In the following we apply our approach for tool-supported FMEA to a case study, namely the
industrial metal press§. The industrial metal press is a system for compressing sheets of metal
into body parts for vehicles. When the press is turned on, a plunger begins rising, with the aid
of an electric motor. The operator may then load the metal into the press. When the plunger
reaches the top of the press, the operator can then push and hold a button, which turns off the
motor, causing the plunger to fall. For the safety of the operator, the button is located a safe
distance away from the falling plunger. When the plunger reaches the bottom, thereby compressing
the metal, the motor is automatically turned on. The plunger begins rising again, repeating the
cycle.
The operator may abort the fall of the plunger by releasing the button, but only if the plunger
has not yet reached the ‘point of no return’ also referred to as PONR. It is dangerous to turn on
the motor when the plunger is falling beyond this point, as the momentum of the plunger could
cause the motor to explode, exposing the operator to flying parts. Figure 9 shows the physical
architecture of the press. See [10] for more detail.
§ Alldetails of this case study, as well as the full model, generated SAL code and counter-examples are available
via http://www.itee.uq.edu.au/∼dccs/FMEA.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1246 L. GRUNSKE ET AL.
top sensor
PoNR sensor
bottom sensor
serial I/F
motor from PLC
parallel I/F
to PLC hydraulic clutches
...
Plunger
R1 [ At bottom ]
Button
R3 [ Pushed ]
Power
R2 ?? On ??
R3 Controller
Controller + [ Closing ]
R2
+ [ Opening ]
Motor
R3 [ Off ]
Motor
R2 [ On ]
Plunger
R3 [ Falling ]
Plunger
R2 [ Rising ]
Plunger Operator
R6 ?? At bottom ?? R4, R5
?? Release button ??
Plunger
R3 ?? At top ??
R4 Controller R5 Plunger
Operator + [ Opening ] ?? At bottom ??
R3 ?? Pushes button ??
...
sensor messages. Upon receiving a message from a sensor it sets an internal attribute that captures
the current status of the sensor. This internal representation of the sensors and the motor is
then responsible for triggering the controller’s action (e.g. sending a message to switch on the
motor) which is captured in the thread for the controller’s main behavior on the left in Figure 11.
This architecture models quite closely the communication between sensors and the controller
as in the real system and will reproduce problems that can be encountered during the system’s
runtime.
The top sensor (modeled in the thread shown on the right-hand side of Figure 11) gets input
from the plunger component being at the top. Receiving this message the top sensor changes into
state high and outputs a message to the controller. After receiving a message that the plunger is
not at the top anymore the top sensor realizes the state low and again notifies the controller about
this state change.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1247
Controller Controller
??BottomSensor=High?? ??Button=Released??
Controller Controller
??BottomSensor=High?? [Opening]
Figure 11. Left: Part of the controller thread and right: top sensor thread.
The plunger changes state in response to messages it receives from the motor and external
events it receives from the physical environment. For example, it transitions from state At bottom
to Rising belowPONR after receiving a message that the motor is on (via internal event motor On);
when the environment signals that the plunger has now passed the PONR (via external input event
risingAbovePONR), the plunger changes to its state Rising abovePONR; and finally after receiving
an external event indicating that the top has been reached, it changes to At top. Upon receiving
a message that the motor is off, it transitions to state Falling slow, and then to Falling fast upon
passing the PONR (indicated by means of external input event fallingFast), and so on.
Further details of the model is available via http://www.itee.uq.edu.au/∼dccs/
FMEA/.
Note that other designs are possible which satisfy the functional requirement given above.
For example, in the design studied here, the controller sends a message to switch the motor on
as soon as it receives a message from the bottom sensor that the plunger is at the bottom. In
an alternative design, the controller might instead be looking for a signal that the plunger has
passed the PONR before it checks the bottom sensor state. In the fault-free case the two system
designs will result in exactly the same press behavior, in the sense of how the plunger responds to
operator commands. But in the presence of component faults the two designs have quite different
behaviors.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1248 L. GRUNSKE ET AL.
Note that condition 3 goes beyond a simple static condition of the system, such as particular
component state combinations, and instead relates to a dynamic property, namely that one compo-
nent (the plunger) should not transition to another state before a certain event takes place (namely,
the motor comes on). This is an example where the richness of temporal logic is needed.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1249
operator has pushed the button, the motor is switched off, but before the plunger starts to fall (and
drops below the top sensor position), the operator releases the button again. This situation violates
the formalized safety condition 1.
Although such behavior is possible strictly in a real system, its likelihood is low, and the system
would recover by aborting operation once the plunger starts to fall, so that the associated risk is
probably negligible. Since at this stage of analysis we are more interested in checking for higher
likelihood, higher risk outcomes, and not so interested in detailed timing issues, we opt to use the
prioritization mechanism described in Sections 2.2 and 3.6 above. This has the effect of giving
priority to system-internal behavior over external input. External input events are used in our model
to represent (a) the physical movement of the plunger (i.e. after some time the model receives
the external events risingAbovePONR, atTop, fallingFast, and atBottom, respectively) and (b) the
arbitrary behavior of the operator (i.e. external input events are pushedButton and releasedButton).
These actions consume (some unspecified amount of) time. Internal events in contrast model the
communication between the system components which is almost instantaneous. Therefore, it seems
a reasonable assumption that internal events are faster than external events.
When the prioritization option is used, the model reacts to external input events only once all
internal events have been recognized and reacted to. This has the effect of ruling out behaviors
such as the plunger falling from the top to PONR faster than the controller can communicate with
the motor, or the user releasing the button before the plunger starts to fall (see counter-example
described above).
Technically speaking the model with prioritization is an under approximation of the original
model, since some behavior is removed. This of course can be dangerous from a safety analysis
viewpoint, since it may inadvertently dismiss unsafe behaviors such as race conditions. It should
therefore be used with a great degree of caution. But on the other hand it is a useful technique for
checking for issues that go beyond timing issues, as will be demonstrated below.
When the prioritization feature is enabled within our translation tool and the checks of the
safety conditions are rerun on the fault-free model, this time all four conditions are satisfied. For
the following FMEA, we also use this technique: thus all the results have to be interpreted with
respect to this assumption of prioritized internal actions.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1250 L. GRUNSKE ET AL.
Further experiments were conducted applying our approach in the same fashion as outlined in
Section 4 to two medium-sized systems, namely the mine pump and the AIP. As a result, we
have gained further insight into the practicality and performance results of the proposed method.
As with the press system we have performed an FMEA investigating single-failure modes and
double-failure modes. We did not apply any heuristics to minimize the number of experiments
in double-failure mode but instead applied a brute force approach that considers every possible
combination to investigate scalability. The aim was to show how the approach can be applied in
practice and to study its performance.
In the following subsections we briefly describe the case studies and report on the findings for
the FMEA process¶. The last subsection provides further discussion on the performance of our
approach for all three systems.
¶ Please
note that the full description of the models used, the hazard conditions, as well as the results of the FMEA
can be accessed via http://www.itee.uq.edu.au/∼dccs/FMEA/.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1251
Ourmodel contains internal actions (for which action = internal) which are not observable and thus not relevant
in measuring the safety of the system.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1252 L. GRUNSKE ET AL.
To implement the required condition three steps are needed, each of which must occur
immediately.
We found violations of all three requirements exercising the different single-failure views.
In total, 4 of the 12 single-failure modes investigated led to a violation of at least one safety
requirement. Interestingly, the combination of single-failure modes to double-failure modes only
showed us violations that had already been detected in a corresponding single-failure mode.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1253
3. As soon as the pump operation is interrupted the drug volume must be re-calculated.
To model the interruption of the pump we utilize the next state operator within the antecedent
of the implication. If the pump is pumping and in the next state it is not pumping, then it
has been interrupted and the drug volume must be calculated. This can be modeled using the
following formula:
G((pump = pumping) ∧ X(pump = pumping) ⇒ X(aipVol = calculateVol))
However, running the model checker showed us that this condition is violated in the case
where the pump is stopped for other reasons. Obviously, in this case the volume cannot be
re-calculated since the system as a whole is down. This result hints at a different safety
condition, namely how can the system be shut down safely when the power supply breaks
down, which is not further investigated here. To check, however, that the system operates
appropriately under normal conditions, we have to modify the formula by extending the
antecedent of the implication: if the pump is pumping and in the next state has stopped
pumping and the controller is still running ((aipPump = running)∧(aip = on)), then in the
next state the drug volume must be re-calculated. The resulting formula reads as follows:
We have investigated eight single-failures modes, and could identify a violation of two of the
safety requirements. When checking the double-failure modes we did not find any new violations
that were not already found in single-failure mode. However, two of the violations that were found
in single-failure mode were eliminated when checked in combination with a second failure mode.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1254 L. GRUNSKE ET AL.
the number of nodes in the BTs is almost similar. However, the mine pump is a more complex
model wherein it contains more threads and also more transitions. Most of the behavior in the
AIP’s controller thread is modeled atomically, which rules out interleaving behavior between the
sensor and actuator threads, thus reducing the complexity of the model’s state space.
These interpretations are also reflected in Table II which compares the complexity of the model
checking experiments of the three models. The number of BDD variables reflects the complexity
of the internal representation of the models in the model checker. The number of visited states
shows the size of the state space that is investigated. This figure does not necessarily correlate
with the number of BDD variables since some state spaces can be represented more compactly
than others. This mismatch can be observed, for instance, between the mine pump and the AIP
model. The number of iterations indicates after how many steps (of applying the transition relation
to the states) a fixpoint could be found and all reachable states were met. Obviously, the more
iterations are necessary, the longer the computation takes. However, for some models one iteration
(or step) is a simpler and faster computation than for others. Although the AIP model requires
more iterations, the overall execution time is less than for the mine pump. These figures can vary
from experiment to experiment (i.e. between no-failure, single-point-of-failure, and double-point-
of-failure experiments). The numbers given here are taken from the no-failure experiments.
The columns four and five of Table II summarize the average total execution times for single-
point-of-failure experiments and double-point-of-failure experiments, respectively. The experiments
were executed on a Dell PowerEdge 2600 Server with a 2.4 GHz Intel Xeon Processor and 1G
RAM running RedHat Linux 8.0 as the operating system. The numbers in Table II are quite similar
for the press model and the AIP model. That is, the failures have relatively little impact on the
complexity of the models. This is different for the mine pump model where the double-point-of-
failure experiments were running longer than the single-point-of-failure experiments. This suggests
that for the AIP model, as well as for the press model, the time to compute the satisfiability of the
requirements is relatively small compared to the time for building the initial model, so therefore
the different runtimes for computing the fixpoint do not become apparent in the same way as for
the mine pump.
The last column of Table II shows the total time it took to run all double-failure mode experiments
(figures are in minutes). It can be seen that for larger models it will be beneficial to apply heuristics
to reduce the number of experiments run in a failure mode of a higher level. For the mine pump
it took, for example, over 4 days to finalize the checks in double-failure mode. Whereas each
experiment run is itself relatively fast, the sum of all runs if they are many can be quite large.
The total execution times for each of the model checking experiments (single-failure views vs
double failure views) are presented in Figure 13. (Please be aware that the diagrams have different
timescales on the vertical axis.) The results are ordered according to the safety requirements being
checked. An analysis of the performance statistics shows a fairly homogeneous distribution across
all runs. Variation in the execution times result mainly from the complexity of the system model. In
addition, the structure of the formalized safety requirements has an effect on the model-checking
performance. As an example for the metal press the third safety requirement has a higher execution
time than the other safety conditions. This theorem is defined by three nested temporal operators,
of which one is the until operator, resulting in a complex model checking algorithm to be applied.
Finally, if a safety condition is violated the execution time may increase if a long counter-example
needs to be generated.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1255
55 42
Total Execution Time for Metal Press Total Execution Time for Metal Press
(Single Failures) (Double Failures)
40
50
Time (seconds)
Time (seconds)
38
45
36
40
34
35
32
30 30
0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 140 160
Experiment No. Experiment No.
4000 7000
Total Execution Time for Minepump Total Execution Time for Minepump
3500 (Single Failures) 6000 (Double Failures)
3000
5000
Time (seconds)
Time (seconds)
2500
4000
2000
3000
1500
2000
1000
1000
500
0 0
0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 140 160 180
Experiment No. Experiment No.
3000 3000
Total Execution Time for AIP Total Execution Time for AIP
(Single Failures) (Double Failures)
2500 2500
Time (seconds)
Time (seconds)
2000 2000
1500 1500
1000 1000
500 500
0 0
0 5 10 15 20 25 0 20 40 60 80
Experiment No. Experiment No.
Figure 13. Total execution time of the case studies with single-failure views and double-failure views.
6. DISCUSSION
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1256 L. GRUNSKE ET AL.
the functional requirements and thus simplifies the validation (since the model obviously satisfies
the requirements). A comparison with other formal system modeling techniques is given in [39].
On the other hand, the modeling notation cannot prevent errors in the model (and the require-
ments) or wrong assumptions which might undermine the results of the proposed FMEA process.
Obviously, modeling has to be done with care and assumptions made have to be documented when
presenting the findings. We find, however, that the use of a model checker is of great help in
debugging the model and validating that it does behave according to the modeler’s intention.
In this paper, we have presented an approach to automating FMEA based on the BT language. This
process uses fault injection experiments and model checking to explore whether component failures
can lead to violations of hazard conditions, formalized as temporal logical formulas. Technically,
we are able to provide tool support for the generation of fault injection experiments for multiple
runtime failures. We argue that the BT language makes fault injection relatively simple, in contrast
to other methods.
We applied our approach to three case studies: the metal press, the mine pump, and the AIP.
For each of the case studies we have formalized the hazard conditions and applied our process
of tool-supported FMEA to investigate single-failure and double-failure modes. As a result, we
could identify several relationships between failure modes and possible hazards, which identify
weaknesses in the design that should be resolved or mitigated. Furthermore, the performance
results for the fault injection experiments are presented and discussed.
In the future work, we would like to extend our tool with a feature that animates the counter-
examples directly in the BT that is used for the fault injection experiment. This would help the user
to better visualize the series of steps between occurrence of the fault and occurrence of the hazard.
We also plan to investigate generation of multiple counterexamples [42, 43], to improve coverage
of the different ways that the cause–consequence relationship between a fault and a hazard can
manifest itself.
In this work we only consider un-timed BTs as the foundation for our fault injection experiments.
To investigate the consequences of timing failures timed BTs [44] and timed fault injection
experiments [45] would have been an alternative to allow for a more detailed FMEA. However,
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
EXPERIENCE WITH FAULT INJECTION EXPERIMENTS FOR FMEA 1257
although theoretical possible, the scalability of the approach due to complexity of the underlying
real-time model-checking is still an open problem.
Another stream of research is to use the recently developed probabilistic extension of BTs
[46] to perform a probabilistic FMEA [47, 48] using a stochastic model checker. This enables
the analyst to quantify the cause–consequence that are revealed, and allows the analyst to define
tolerable hazard rates for each hazard condition and experiment with different component failure
rates, for example, to derive quantified component safety requirements. Furthermore, based on a
probabilistic FMEA it is possible to include dependence on the environment and the deployment
of software components to hardware nodes.
REFERENCES
1. Leveson NG. Safeware: System Safety and Computers. Addison-Wesley: Reading, MA, 1995.
2. Lutz RR. Software engineering for safety: a roadmap. ICSE—Future of SE Track. ACM Press: New York, 2000;
213–226.
3. Lutz RR, Woodhouse RM. Requirements analysis using forward and backward search. Annals of Software
Engineering 1997; 3:459–475.
4. Vesely WE, Goldberg FF, Roberts NH, Haasl DF. Fault Tree Handbook. U.S. Nuclear Regulatory Commission,
1996.
5. IEC 61025. IEC (International Electrotechnical Commission) Fault-Tree-Analysis (FTA), 1990.
6. IEC 60812. IEC (International Electrotechnical Commission), Functional safety of electrical/electronical/
programmable electronic safety/related systems, Analysis Techniques for System Reliability—Procedure for
Failure Mode and Effect Analysis (FMEA), 1991.
7. Price CJ, Taylor N. Automated multiple failure FMEA. Reliability Engineering and System Safety 2002;
76(1):1–10.
8. Papadopoulos Y, Walker M, Parker D, Lonn H, Törngren M, Chen D, Johansson R, Sandberg A. Semi-automatic
FMEA supporting complex systems with combinations and sequences of failures. SAE Journal of Passenger
Cars-Mechanical Systems 2009; 2(1):791–802.
9. Walker M, Papadopoulos Y. Qualitative temporal analysis: Towards a full implementation of the Fault Tree
Handbook. Control Engineering Practice 2009; 17(10):1115–1125.
10. Atchison B, Lindsay P, Tombs D. A case study in software safety assurance using formal methods. Technical
Report, University of Queensland, SVRC 99-31, 1999.
11. Reese JD, Leveson NG. Software deviation analysis. Proceedings of the 19th International Conference on Software
Engineering. ACM Press: New York, 1997; 250–261.
12. Bieber P, Castel C, Seguin C. Combination of fault tree analysis and model checking for safety assessment of
complex system. Proceedings of the Fourth European Dependable Computing Conference (EDCC-4) (Lecture
Notes in Computer Science, vol. 2485), Grandoni F (ed.). Springer: Berlin, 2002; 19–31.
13. Schneider F, Easterbrook S, Callahan J, Holzmann G. Validating requirements for fault tolerant systems using
model checking. Proceedings of the Third International Conference on Requirements Engineering. IEEE Computer
Society: Colorado Springs, CO, U.S.A., 1998; 4–13.
14. Bozzano M, Villafiorita A. Improving system reliability via model checking: The FSAP/NuSMV-SA safety
analysis platform. International Conference on Computer Safety, Reliability, and Security (SAFECOMP 2003)
(Lecture Notes in Computer Science, vol. 2788). Springer-Verlag: Berlin, 2003; 49–62.
15. Cichocki T, Górski J. Failure mode and effect analysis for safety-critical systems with software components.
Proceedings of the 19th International Conference on Computer Safety, Reliability and Security, SAFECOMP
2000 (Lecture Notes in Computer Science, vol. 1943), Koornneef F, van der Meulen M (eds.). Springer: Berlin,
2000; 382–394.
16. Grunske L, Lindsay PA, Yatapanage N, Winter K. An automated failure mode and effect analysis based on high-
level design specification with Behavior Trees. Proceedings of the Fifth International Conference on Integrated
Formal Methods (IFM 2005) (Lecture Notes in Computer Science, vol. 3771), Romijn J, Smith G, van de Pol J
(eds.). Springer: Berlin, 2005; 129–149.
17. Heimdahl MPE, Choi Y, Whalen MW. Deviation analysis: A new use of model checking. Automated Software
Engineering 2005; 12(3):321–347.
18. Powell D. Requirements evaluation using Behavior Trees—Findings from industry. Industry Track of Australian
Conference on Software Engineering (ASWEC 2007). Available at: http://www.behaviorengineering.org/docs/
ASWEC07 Industry Powell.pdf [15 March 2010].
19. Boston J. Behavior trees—How they improve engineering behaviour. Sixth Annual Software & Systems Engineering
Process Group Conference (SEPG), Melbourne, Australia, 2008. Available at: http://www.behaviorengineering.org
[15 March 2010].
20. Dromey RG. From requirements to design: Formalizing the key steps. International Conference on Software
Engineering and Formal Methods. IEEE Computer Society: Washington, DC, U.S.A., 2003; 2–13.
21. Behavior Engineering website. Available at: http://www.behaviorengineering.org [15 March 2010].
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe
1258 L. GRUNSKE ET AL.
22. Wen L, Dromey RG. From requirements change to design change: A formal path. International Conference on
Software Engineering and Formal Methods (SEFM 2004). IEEE Computer Society: Washington, DC, U.S.A.,
2004; 104–113.
23. Grunske L, Winter K, Yatapanage N. Defining the abstract syntax of visual languages with advanced graph
grammars—A case study based on Behavior Trees. Journal of Visual Language and Computing 2008;
19(3):343–379.
24. Papacostantinou P, Tran T, Lee P, Phillips V. Implementing a Behaviour Tree analysis tool using Eclipse
development frameworks. Experience Report Proceedings of Australian Software Enginerering Conference
(ASWEC08). IEEE Computer Society: Washington, DC, U.S.A., 2008; 61–66.
25. Hall A. Seven myths of formal methods. IEEE Software 1990; 7(5):11–19.
26. Berry DM. Formal methods: The very idea—Some thoughts about why they work when they work. Science of
Computer Programming 2002; 42(1):11–27.
27. Colvin R, Hayes IJ. A semantics for Behavior Trees. ACCS Technical Report ACCS-TR-07-01, ARC Centre for
Complex Systems, April 2007.
28. de Moura L, Owre S, Rueß H, Rushby J, Shankar N, Sorea M, Tiwari A. SAL 2. International Conference
on Computer-Aided Verification, (CAV 2004) (Lecture Notes in Computer Science, vol. 3114), Alur R, Peled D
(eds.). Springer: Berlin, 2004; 496–500.
29. Emerson EA. Temporal and modal logic. Handbook of Theoretical Computer Science, vol. B, van Leeuwen J
(ed.). Elsevier Science Publishers: Amsterdam, 1990.
30. Back R, von Wright J. Trace refinement of action systems. Concurrency Theory (CONCUR ’94) (Lecture Notes
in Computer Science, vol. 836), Jonsson B, Parrow J (eds.). Springer: Berlin, 1994; 367–384.
31. Bryant RE. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers 1986;
C-35(8):677–691.
32. BTE. Genetic software engineering tools. Available at: http://www.sqi.gu.edu.au/gse/tools [15 March 2010].
33. Dwyer MB, Avrunin GS, Corbett JC. Patterns in property specifications for finite-state verification. Proceedings
of the 1999 International Conference on Software Engineering (ICSE’99). Association for Computing Machinery:
New York, 1999; 411–421.
34. Bitsch F. Safety patterns—The key to formal specification of safety requirements. International Conference on
Computer Safety, Reliability and Security (SAFECOMP 2001) (Lecture Notes in Computer Science, vol. 2187).
Springer: Berlin, 2001; 176–189.
35. Bondavalli A, Simoncini L. Failure Classification with respect to Detection. Esprit Project Nr 3092 (PDCS:
Predictably Dependable Computing Systems), 1990.
36. Winter K, Yatapanage N. The metal press case study. Available at: http://www.itee.uq.edu.au/∼dccs/FMEA [15
March 2010].
37. Burns A, Lister A. A framework for building dependable systems. The Computer Journal 1991; 34(2):173–181.
38. Winter K, Hayes IJ, Colvin R. Integrating requirements: The behavior tree philosophy. Proceedings of the Eighth
IEEE International Conference on Software Engineering and Formal Methods (SEFM). IEEE Computer Society:
Washington, DC, U.S.A., 2010.
39. Lindsay PA. Behavior trees: From systems engineering to software engineering. Proceedings of the Eighth
IEEE International Conference on Software Engineering and Formal Methods (SEFM). IEEE Computer Society:
Washington, DC, U.S.A., 2010.
40. Tang Z, Dugan JB. BDD-based reliability analysis of phased-mission systems with multimode failures. IEEE
Transactions on Reliability 2006; 55(2):350–360.
41. Yatapanage N, Winter K. Slicing Behavior Tree models for verification. Proceedings of the Sixth IFIP International
Conference on Theoretical Computer Science (TCS 2010), IFIP AICT 323, 2010.
42. Aljazzar H, Hermanns H, Leue S. Counterexamples for timed probabilistic reachability. Proceedingsof the Third
International Conference on Formal Modeling and Analysis of Timed Systems (FORMATS 2005) (Lecture Notes
in Computer Science, vol. 3829), Pettersson P, Yi W (eds.). Springer: Berlin, 2005; 177–195.
43. Han T, Katoen JP, Damman B. Counterexample generation in probabilistic model checking. IEEE Transactions
on Software Engineering 2009; 35(2):241–257.
44. Grunske L, Winter K, Colvin R. Timed behavior trees and their application to verifying real-time systems. 18th
Australian Software Engineering Conference (ASWEC 2007). IEEE Computer Society: Washington, DC, U.S.A.,
2007; 211–222.
45. Colvin R, Grunske L, Winter K. Timed behavior trees for failure mode and effects analysis of time-critical
systems. Journal of Systems and Software 2008; 81(12):2163–2182.
46. Colvin R, Grunske L, Winter K. Probabilistic timed Behavior Trees. Proceedings of the Sixth International
Conference on Integrated Formal Methods (IFM 2007) (Lecture Notes in Computer Science, vol. 4591), Davies J,
Gibbons J (eds.). Springer: Berlin, 2007; 156–175.
47. Grunske L, Colvin R, Winter K. Probabilistic model-checking support for FMEA. Proceedings of the Fourth
International Conference on the Quantitative Evaluation of Systems (QEST 2007). IEEE Computer Society:
Washington, DC, U.S.A., 2007; 119–128.
48. Aljazzar H, Fischer M, Grunske L, Kuntz M, Leitner-Fischer F, Leue S. Safety analysis of an airbag system
using probabilistic fmea and probabilistic counter examples. Sixth International Conference on the Quantitative
Evaluation of Systems. IEEE Computer Society: Washington, DC, U.S.A., 2009; 299–308.
Copyright q 2011 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:1233–1258
DOI: 10.1002/spe