SAFECOMP2012
SAFECOMP2012
These Proceedings are based on the authors copy of the camera ready submissions.
The document is structured according to the official Springer LNCS 7612
conference proceedings:
– DOI: 10.1007/978-3-642-33678-2
– ISBN: 978-3-642-33677-5
iii
Preface
Since 1979, when the first SAFECOMP conference was organized by the Tech-
nical Committee on Reliability, Safety and Security of the European Workshop
on Industrial Computer Systems (EWICS TC7), the SAFECOMP conference
series has always been a mirror of actual trends and challenges in highly critical
systems engineering.
The key theme of SAFECOMP 2012 is virtually safe making system safety
traceable. This ambiguous theme addresses two important aspects of critical
systems. On the one hand, systems are always claimed to be virtually safe, which
often means they are safe unless some very rare events happen. However, many
recent accidents like Fukushima for example have shown that these assumptions
often do not hold. As a consequence, we must reconsider what acceptable and
residual risk shall be. The second meaning of the theme addresses the question
of making system safety understandable. Safety case and arguments are often
based on deep understanding of the system and its behavior. Displaying such
dynamic behavior in a visual way or even a virtual reality scenario might help
understanding the arguments better and finding flaws more easily.
SAFECOMP has always seen itself as a conference connecting industry and
academia. To account for this, we introduced separate categories for industrial
and academic papers. More than 70 submission from authors of 20 countries
were reviewed and the best 33 papers were selected for presentation at the con-
ference and publication in this volume. In addition three invited talks given by
Prof. Jürgen Leohold (CTO of Volkswagen), Prof. Marta Kwiatkowska (Oxford
University) and Prof. Hans Hansson (Märladen University) have been included
into the conference program. Safety, security and reliability is a very broad topic,
which touches many different domains of application. In 2012, we decided to co-
locate five scientiffc workshops, which focus on different actual topics ranging
from critical infrastructures to dependable cyber-physical systems. The SAFE-
COMP workshops are not included in this volume but in a separate SAFECOMP
LNCS volume.
As Program Chairs, we want to give a very warm thank you to all 60 members
of the international program committee. The comprehensive reviews provided
the basis for the productive discussions at the program committee meeting held
in May in Munich, which was hosted by Siemens. We also want to thank the local
organization team at the Otto-von-Guericke-Universität Magdeburg (OVGU),
the local chairs Gunter Saake, Michael Schenk and Jana Dittmann, the Center
for Digital Engineering (CDE) and the Virtual Development and Training Center
(VDTC).
Finally, we wish you interesting reading of the articles in this volume. In the
name of EWICS TC7, we are also inviting you to join the SAFECOMP com-
munity and hope you will be joining us at the 2013 SAFECOMP conference in
Toulouse.
Program Committee
Sponsors
Organization Team
– Augustine, Marcus
– Fietz, Gabriele
– Gonschorek, Tim
– Güdemann, Matthias
– Köppen, Veit
– Lipaczewski, Michael
– Ortmeier, Frank
– Struck, Simon
– Weise, Jens
can be addressed together more efficiently. This would allow reasoning about
the design and safety aspects of parts of the systems (components) in relative
isolation, without consideration of their interfaces and emergent behaviour, and
then deal with these remaining issues in a more structured manner without hav-
ing to revert to the current holistic practices. The majority of research on such
compositional aspects has concentrated on the functional properties of systems
with a few efforts dealing with timing properties. However, much less work has
considered non-functional properties, including dependability properties such as
safety, reliability and availability.
This keynote provides an introduction to component-based software devel-
opment and how it can be applied to development of safety-relevant embed-
ded systems, together with an overview and motivation of the research being
performed in the SafeCer and SYNOPSIS projects. Key verification and safety
argumentation challenges will be presented and solutions outlined.
Sensing Everywhere: Towards Safer and More
Reliable Sensor-enabled Devices
(Invited Talk)
Marta Kwiatkowska
and aims establish quantitative properties, for example, calculating the probabil-
ity or expectation of a given event. Tools such as the probabilistic model checker
PRISM [6] are widely used to analyse safety, dependability and performabil-
ity of system models in several application domains, including communication
protocols, sensor networks and biological systems.
The lecture will give an overview of current research directions in automated
verification for sensor-enabled devices. This will include software verification for
TinyOS [7], aimed at improving the reliability of embedded software written in
nesC; as well as analysis of sensor network protocols for collective decision mak-
ing, where the increased levels of autonomy demand a stochastic games approach
[8]. We will outline the promise and future challenges of the methods, includ-
ing emerging applications at the molecular level [9] that are already attracting
attention from the software engineering community [10].
Acknowledgement. This research has been supported in part by ERC grant
VERIWARE and Oxford Martin School.
References
1. Sankaranarayanan, S., Fainekos, G.: Simulating insulin infusion pump risks by
in-silico modeling of the insulin-glucose regulatory system. In: Proc. CMSB’12.
LNCS, Springer (2012) To appear.
2. Jiang, Z., Pajic, M., Moarref, S., Alur, R., Mangharam, R.: Modeling and verifi-
cation of a dual chamber implantable pacemaker. In: TACAS. (2012) 188–203
3. Kroeker, K.L.: The rise of molecular machines. Commun. ACM 54 (2011) 11–13
4. Food, U., Drug Admin.: (List of Device Recalls)
5. Kwiatkowska, M.: Quantitative verification: Models, techniques and tools. In: Proc.
6th joint meeting of the European Software Engineering Conference and the ACM
SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE),
ACM Press (2007) 449–458
6. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: Verification of probabilistic
real-time systems. In Gopalakrishnan, G., Qadeer, S., eds.: Proc. 23rd International
Conference on Computer Aided Verification (CAV’11). Volume 6806 of LNCS.,
Springer (2011) 585–591
7. Bucur, D., Kwiatkowska, M.: On software verification for TinyOS. Journal of
Software and Systems 84 (2011) 1693–1707
8. Chen, T., Forejt, V., Kwiatkowska, M., Parker, D., Simaitis, A.: Automatic verifica-
tion of competitive stochastic systems. In: Proc. 18th International Conference on
Tools and Algorithms for the Construction and Analysis of Systems (TACAS’12).
Volume 7214 of LNCS., Springer (2012) 315–330
9. Lakin, M., Parker, D., Cardelli, L., Kwiatkowska, M., Phillips, A.: Design and
analysis of DNA strand displacement devices using probabilistic model checking.
Journal of the Royal Society Interface 9 (2012) 1470–1485
10. Lutz, R.R., Lutz, J.H., Lathrop, J.I., Klinge, T., Henderson, E., Mathur, D.,
Sheasha, D.A.: Engineering and verifying requirements for programmable self-
assembling nanomachines. In: ICSE, IEEE (2012) 1361–1364
Table of Contents
Session I: Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Which automata for which safety assessment step of satellite FDIR ? . . . . 235
L. Pintard, C. Seguin, and J. Blanquart
1 Introduction
Evidence-based safety arguments, i.e., safety cases, are increasingly being considered
in emerging standards [10] and guidelines [3], as an alternative means for showing
that critical systems are acceptably safe. The current practice for demonstrating safety,
largely, is rather to satisfy a set of objectives prescribed by standards and/or guide-
lines. Typically, these mandate the processes to be employed for safety assurance, and
the artifacts to be produced, e.g., requirements, traceability matrices, etc., as evidence
(that the mandated process was followed). However, the rationale connecting the rec-
ommended assurance processes, and the artifacts produced, to system safety is largely
implicit [7]. Making this rationale explicit has been recognized as a desirable enhance-
ment for “standards-based” assurance [14]; especially also in feedback received [4]
during our own, ongoing, safety case development effort.
In effect, there is a need in practice to bridge the gap between the existing means,
i.e., standards-based approaches, and the alternative means, i.e., argument-based ap-
proaches, for safety assurance. Due to the prevalence of standards-based approaches,
conventional systems engineering processes place significant emphasis on producing
a variety of artifacts to satisfy process objectives. These artifacts show an apprecia-
ble potential for reuse in evidence-based argumentation. Consequently we believe that
automatically assembling a safety argument (or parts of it) from the artifacts, to the
extent possible, is a potential way forward in bridging this gap.
In this paper, we describe a lightweight methodology to support the automatic as-
sembly of (preliminary) safety cases. Specifically, the main contribution of our paper
2 E. Denney and G. Pai
2 Context
The experimental Swift UAS being developed at NASA Ames comprises a single air-
borne system, the electric Swift Unmanned Aerial Vehicle (UAV), with duplicated
ground control stations and communication links. The development methodology used
adopts NASA mandated systems engineering procedures [15], and is further constrained
by other relevant standards and guidelines, e.g., for airworthiness and flight safety [13],
which define some of the key requirements on UAS operations. To satisfy these require-
ments, the engineers for the Swift UAS produce artifacts (e.g., requirements specifica-
tions, design documents, results for a variety of analyses, tests, etc.) that are reviewed at
predefined intervals during development. The overall systems engineering process also
includes traditional safety assurance activities as well as range safety analysis.
Our general approach for safety assurance includes argument development and uncer-
tainty analysis. Fig. 1 shows a data flow among the different processes/activities dur-
ing the development and safety assurance of the Swift UAS, integrating our approach
for safety argumentation.1 As shown, the main activities in argument development are
claims definition, evidence definition/identification, evidence selection, evidence link-
ing, and argument assembly. Of these, the first four activities are adapted from the
six-step method for safety case construction [8].
The main focus of this paper is argument development2 ; in particular, we consider
the activity of argument assembly, which is where our approach deviates from existing
methodologies [2], [8]. It reflects the notion of “stitching together” the data produced
from the remaining activities to create a safety case (in our example, fragments of ar-
gument structures for the Swift UAS) containing goals, sub-goals, and evidence linked
through an explicit chain of reasoning.
We distinguish this activity to account for (i) argument design criteria that are likely
to affect the structure of the overall safety case, e.g., maintainability, compliance with
safety principles, reducing the cost of re-certification, modularity, and composition of
arguments, and (ii) automation, e.g., in the assembly of heterogenous data in the overall
1
Note that the figure only shows some key steps and data relevant for this paper, and is not a
comprehensive representation. Additionally, the figure shows neither the iterative and phased
nature of the involved activities nor the feedback between the different processes.
2
Uncertainty analysis [5] is out of the scope of this paper.
A Lightweight Methodology for Safety Case Assembly 3
Ź
... Safety Claims Evidence definition /
ź (Sub-)Claims identification
Swift UAS Development artifacts Ź
Argument structure Ź
Development e.g., Requirements,
Design, Items of evidence
Methodology Evidence linking
... ź
Trustworthy evidence
Ź
Evidence Selection
Ź
Software SW verification artifacts Ź
e.g., Proofs, Argument modules Ź
Verification Safety Claims / sub-claims Ź Argument
Methodology Tests,
Argument structure Ź Assembly
... Swift UAS Safety Case versions
ź
Argument Design Criteria Ź Uncertainty
Sources of Uncertainty Ź Analysis
Uncertainty Measurements Ź
Confidence
assessment
System/Software Safety Argumentation
Ź
Fig. 1. Safety assurance methodology showing the data flow between the processes for safety
analysis, system development, software verification, and safety argumentation.
safety case, including argument fragments and argument modules created using manual,
automatic, and semi-automatic means [6].
Safety argumentation, which is phased with system development, is applied starting
at the level of the system and then repeated at the software level. Consequently, the
safety case produced itself evolves with system development. Thus, similar to [11], we
may define a preliminary, interim, and operational safety case reflecting the inclusion
of specific artifacts at different points in the system lifecycle. Alternatively, we can also
define finer grained versions, e.g., at the different milestones defined in the plan for
system certification3 .
The goal of a lightweight version of our methodology (Fig. 1), is to give systems engi-
neers a capability to (i) continue to maintain the existing set of artifacts, as per current
practice, (ii) automatically generate (fragments of) a safety case, to the extent possible,
rather than creating and maintaining an additional artifact from scratch, and (iii) provide
different views on the relations between the requirements and the safety case.
Towards this goal, we characterize the processes involved and their relationship
to safety cases. In this paper, we specifically consider a subset of the artifacts, i.e.,
tables of (safety) requirements and hazards, as an idealization4 of the safety analysis and
development processes. Then, we transform the tables into (fragments of) a preliminary
safety case for the Swift UAS, documented in the Goal Structuring Notation (GSN) [8].
Subsequently, we can modify the safety case and map the changes back to (extensions
of) the artifacts considered, thereby maintaining both in parallel.
3
Airworthiness certification in the case of the Swift UAS.
4
We consider idealizations of the processes, i.e., the data produced, rather than a formal process
description since we are mainly interested in the relations between the data so as to define and
automate the transformations between them.
4 E. Denney and G. Pai
Hazards Table
Safety
ID Hazard Cause / Mode Mitigation
Requirement
HR.1.3 Propulsion system hazards
HR.1.3.1 Motor overheating Insufficient airflow Monitoring RF.1.1.4.1.2
Failure during operation
Incorrect programming of KD motor Improper procedures to check programming before
HR.1.3.7 Checklist RF.1.1.4.1.9
controller flight
Verification Verification
ID Requirement Source Allocation
Method Allocation
RS.1.4.3 Critical systems must be redundant AFSRB RF.1.1.1.1.3
The system shall provide independent and
RS.1.4.3.1 AFSRB
redundant channels to the pilot
Verification Verification
ID Requirement Source Allocation
Method Allocation
Fig. 2. Tables of hazards, system and functional requirements for the Swift UAS (excerpts).
Thus, in an SRT (i) a source is one or more hazard or external artifact, (ii) an al-
location is a set of functional requirements or a set of artifacts, and (iii) a verification
allocation is a set of artifacts. Whereas in a FRT (i) a source is a hazard, external arti-
fact or system requirement, (ii) an allocation is a set of artifacts, and (iii) a verification
allocation links to a specific artifact that describes the result of applying a particular
verification method.
Given the base sets and the definitions 1 – 3, we can now define:
snapshots, however, so want to define a notion of partial safety case. Here, we ignore se-
mantic concerns and use a purely structural definition. Assuming finite, disjoint sets of
goals (G), strategies (S), evidence (E), assumptions (A), contexts (K) and justifications
(J), we give the following graph-theoretic definition:
Definition 5. A partial safety case, S, is a tuple hG, S, E, A, K, J, sg, gs, gc, sa, sc,
sji with the functions
– sg : S → P(G), the subgoals to a strategy
– gs : G → P(S) ∪ P(E), the strategies of a goal or the evidence to a goal
– gc : G → P(K), the contexts of a goal
– sa : S → P(A), the assumptions of a strategy
– sc : S → P(K), the contexts of a strategy
– sj : S → P(J), the justifications of a strategy
We say that g 0 is a subgoal of g whenever there exists an s ∈ gs(g) such that g 0 ∈ sg(s).
Then, define the descendant goal relation, g g 0 iff g 0 is a subgoal of g or there is a
goal g such that g
00
g and g is a subgoal of g 00 . We require that the relation is a
00 0
Definition 6. We say that a partial safety case, S = hG, S, E, A, K, J, sg, gs, gc, sa,
sc, sji, extends a requirements specification, R = hHT, RTs , RTf i, if there is an
embedding (i.e., injective function), ι, on the base sets of R in S, such that:
– ι(H ∪ C ∪ Rs ∪ Rf ) ⊆ G
– ι(V ∪ M ) ⊂ S
ι(so) ∈ gc(ι(r)),
– hr, so, al, vm, vai ∈ (RTs ∪ RTf ) ⇒ ι(vm) ∈ gs(ι(r)),
ι(va) ⊆ sg(ι(vm)) ∩ E
– x ≤ x ⇒ ι(x)
0 0
ι(x )
Whereas goal contexts may be derived from the corresponding requirements sources,
strategy contexts, assumptions and justifications are implicit and come from the map-
ping itself, e.g., as boilerplate GSN elements (See Fig. 3, for an example of a boilerplate
assumption element). Note that we do not specify the exact relations between the indi-
vidual elements, just that there is a relation.
The structure of the tables, and the mapping defined for each table, induces two patterns
of argument structures. In particular, the pattern arising from the transformation of the
HT can be considered as an extension of the hazard-directed breakdown pattern [12].
Thus, we argue over each hazard in the HT and, in turn, over the identified hazards in a
hierarchy of hazards. Consequently, each defined goal is further developed by argument
over the strategies implicit in the HT, i.e., over the causes and mitigations.
Similarly, the pattern induced by transforming the SRT and FRT connects the argu-
ment elements implicit in the tables, i.e., requirements (goals), and verification methods
and verification allocations (strategies), respectively. Additionally, it includes strategies
arising due to both the hierarchy of requirements in the tables, and the dependencies
between the tables. Specifically, for each requirement, we also argue over its allocation,
e.g., the allocation of a functional requirement to a system requirement, and its chil-
dren, i.e., lower-level requirements. The links between the tables in the requirements
specification define how the two patterns are themselves related and, in turn, how the
resulting safety case fragments are assembled.
One choice in the transformation is to create goals and strategies that are not marked
as undeveloped (or uninstantiated, or both, as appropriate), i.e., to assume that the
completeness and sufficiency of all hazards, their respective mitigations, and all re-
quirements and their respective verification methods, is determined prior to the trans-
formation, e.g., as part of the usual quality checks on requirements specifications. An
alternative is to highlight the uncertainty in the completeness and sufficiency of the
hazards/requirements tables, and mark all goals and strategies as undeveloped. We pick
the second option, i.e., in the transformation described next, all goals, strategies, and
evidence that are created are undeveloped except where otherwise indicated.
8 E. Denney and G. Pai
We give the transformation in a relational style, where the individual tables are
processed in a top-to-bottom order, and no such order is required among the tables.
(R2) (a) the Source column forms the context for the created goal/sub-goal. Ad-
ditionally, if the source is a hazard, i.e., (an ID of) an entry {Hazard} in
the HT, then the created goal is the same as the sub-goal that was created
from the Safety Requirement column of the HT, as in step (H2)(c).
(b) the Allocation column is either a strategy or a context element, depending
on the content. Thus, if it is
i. an allocated requirement (or its ID), then create and attach a strategy
“Argument over allocated requirement”; the sub-goal of this strategy
is the allocated requirement8 .
ii. an element of the physical architecture, then create an additional con-
text element for the goal.
(c) the Verification method column, if given, creates an additional strategy
“Argument by {Verification Method}”, an uninstantiated sub-goal con-
nected to this strategy9 , and an item of evidence whose content is the
entry in the column Verification allocation.
We now state (without proof), that the result of this transformation is a well-formed
partial safety case that extends the requirements specification.
5 Illustrative Example
Fig. 3 shows a fragment of the Swift UAS safety case, in the GSN, obtained by apply-
ing the transformation rules (Section 4.4) to the HT and FRT (Fig. 2), and assembling
the argument structures. Note that a similar safety case fragment (not shown here) is
obtained when the transformation is applied to the SRT and FRT.
We observe that (i) the argument chain starting from the top-level goal G0, to the
sub-goals G1.3 and G2.1 can be considered as an instantiation of the hazard-directed
breakdown pattern, which has then been extended by an argument over the causes and
the respective mitigations in the HT (ii) the argument chains starting from these sub-
goals to the evidence E1 and E2 reflects the transformation from the FRT, and that,
again, it is an instantiation of a specific pattern of argument structures, and (iii) when
each table is transformed, individual fragments are obtained which are then joined based
on the links between the tables (i.e., requirements common to either table). In general,
the transformation can produce several unconnected fragments. Here, we have shown
one of the two that are created.
The resulting partial safety case can be modified, e.g., by including additional con-
text, justifications and/or assumptions, to the goals, sub-goals, and strategies. In fact, a
set of allowable modifications can be defined, based on both a set of well-formedness
rules, and the activities of argument development (Fig. 1). Subsequently, the modifica-
tions can be mapped back to (extensions of) the requirements specification.
Fig. 4 shows an example of how the Claims definition and Evidence linking activ-
ities (Fig. 1) modify the argument fragment in Fig. 3. Specifically, goal G2 has been
8
This will also be an entry in the Requirements column of the FT.
9
A constraint, as per [8], is that each item of evidence is preceded by a goal, to be well-formed.
10 E. Denney and G. Pai
G0
C0.1
[Propulsion System Hazards]
HR.1.3
is mitigated
A0.1
Hazards have been
S0 completely and
Argument over correctly identified to
identified hazards the extent possible.
A
G1 G2
C1.1
[Motor overheating] [Incorrect programming C2.1
HR.1.3.1
is mitigated of KD motor controller] HR.1.3.7
is mitigated
A1.1
A2.1
S1 Causes have been
S2.1 Causes have been
Argument over completely and
Argument over completely and
identified causes correctly identified to
identified causes correctly identified to
the extent possible
A the extent possible
A
S2
Argument by S3
[Monitoring] Argument by
[Checklist]
C1.3.2
HR.1.3.1 C2.1.2
G1.3
HR.1.3.7
[CPU/Autopilot system must be G2.1
able to monitor engine and motor C1.3.1 [Engine software will be checked
controller temperature] holds RF.1.1.4.1.2 during pre-deployment checkout] C2.1.1.
holds RF.1.1.4.1.9
C1.3.3 C2.1.3
S6 Engine Systems S7 Pre-deployment
Argument by Argument by checklist
[Checklist] [Checklist]
G6.1 G7.1
{To be instantiated} {To be instantiated}
E2
E1 Pre-
Pre-flight deployment
checklist checklist
Fig. 3. Fragment of the Swift UAS safety case (in GSN) obtained by transformation of the hazards
table and the functional requirements table.
further developed using two additional strategies, StrStatCheck and StrRunVerf, result-
ing in the addition of the sub-goals GStatCheck and GRunVerf respectively. Fig. 5 shows
the corresponding updates (as highlighted rows and italicized text) in the HT and SRT
respectively, when the changes are mapped back to the requirements specification. Par-
ticularly, the strategies form entries in the Mitigation column of the HT, whereas the
sub-goals form entries in the Safety Requirement and Requirement columns of the HT
and the SRT respectively. Some updates will require a modification (extension) of the
tables, e.g., addition of a Rationale column reflecting the addition of justifications to
strategies. Due to space constraints, we do not elaborate further on the mapping from
safety cases to requirements specifications.
6 Conclusion
There are several points of variability for the transformations described in this paper,
e.g., variations in the forms of tabular specifications, and in the mapping between these
A Lightweight Methodology for Safety Case Assembly 11
G2
[Incorrect programming C2.1
of KD motor controller] HR.1.3.7
is mitigated
Fig. 4. Addition of strategies and goals to the safety case fragment for the Swift UAS.
Hazards Table
Safety
ID Hazard Cause / Mode Mitigation
Requirement
HR.1.3 Propulsion system hazards
HR.1.3.1 Motor overheating Insufficient airflow Monitoring RF.1.1.4.1.2
Failure during operation
Improper procedures to check
Checklist RF.1.1.4.1.9
Incorrect programming of programming before flight
HR.1.3.7
KD motor controller - Static checking GStatCheck
- Runtime Verification GRunVerf
Verification Verification
ID Requirement Source Allocation
Method Allocation
RS.1.4.3 Critical systems must be redundant AFSRB RF.1.1.1.1.3
RS.1.4.3.1 The system shall provide independent and redundant channels to the pilot AFSRB
GStatCheck Software checks that programmed parameter values are valid HR.1.3.7
GRunVerf Software performs runtime checks on programmed parameter values HR.1.3.7
Fig. 5. Updating the requirements specification tables to reflect the modifications shown in Fig. 4.
forms to safety case fragments. We emphasize that the transformation described in this
paper is one out of many possible choices to map artifacts such as hazard reports [9] and
requirements specifications to safety cases. Our main purpose is to place the approach
on a rigorous foundation and to show the feasibility of automation.
We are currently implementing the transformations described in a prototype tool10 ;
although the transformation is currently fixed and encapsulates specific decisions about
the form of the argument, we plan on making this customizable. We will also imple-
ment abstraction mechanisms to provide control over the level of detail displayed (e.g.,
perhaps allowing some fragments derived from the HT to be collapsed).
We will extend the transformations beyond the simplified tabular forms studied here,
and hypothesize that such an approach can be extended, in principle, to the rest of the
data flow in our general methodology so as to enable automated assembly/generation
of safety cases from heterogeneous data. In particular, we will build on our earlier work
on generating safety case fragments from formal derivations [1]. We also intend to
clarify how data from concept/requirements analysis, functional/architectural design,
10
AdvoCATE: Assurance Case Automation Toolset.
12 E. Denney and G. Pai
Acknowledgements. We thank Corey Ippolito for access to the Swift UAS data. This
work has been funded by the AFCS element of the SSAT project in the Aviation Safety
Program of the NASA Aeronautics Mission Directorate.
References
1. Basir, N., Denney, E., Fischer, B.: Deriving safety cases for hierarchical structure in model-
based development. In: 29th Intl. Conf. Comp. Safety, Reliability and Security. (2010)
2. Bishop, P., Bloomfield, R.: A methodology for safety case development. In: Proc. 6th Safety-
critical Sys. Symp. (Feb 1998)
3. Davis, K.D.: Unmanned Aircraft Systems Operations in the U.S. National Airspace System.
FAA Interim Operational Approval Guidance 08-01. (Mar 2008)
4. Denney, E., Habli, I., Pai, G.: Perspectives on Software Safety Case Development for Un-
manned Aircraft. In: Proc. 42nd Annual IEEE/IFIP Intl. Conf. on Dependable Sys. and Net-
works. (Jun 2012)
5. Denney, E., Pai, G., Habli, I.: Towards measurement of confidence in safety cases. In: Proc.
5th Intl. Symp. on Empirical Soft. Eng. and Measurement. pp. 380–383 (Sep 2011)
6. Denney, E., Pai, G., Pohl, J.: Heterogeneous aviation safety cases: Integrating the formal and
the non-formal. In: Proc. 17th IEEE Intl. Conf. Engineering of Complex Computer Systems.
(Jul 2012)
7. Dodd, I., Habli, I.: Safety certification of airborne software: An empirical study. Reliability
Eng. and Sys. Safety. 98(1), pp. 7–23 (2012)
8. Goal Structuring Notation Working Group: GSN Community Standard Version 1 (Nov
2011). http://www.goalstructuringnotation.info/
9. Goodenough, J.B., Barry, M.R.: Evaluating Hazard Mitigations with Dependability Cases.
White Paper (Apr 2009). http://www.sei.cmu.edu/library/abstracts/
whitepapers/dependabilitycase_hazardmitigation.cfm/
10. International Organization for Standardization (ISO): Road Vehicles-Functional Safety. ISO
Standard 26262 (2011)
11. Kelly, T.: A systematic approach to safety case management. In: Proc. Society of Automotive
Engineers (SAE) World Congress (Mar 2004)
12. Kelly, T., McDermid, J.: Safety case patterns – reusing successful arguments. In: Proc. IEE
Colloq. on Understanding Patterns and Their Application to Sys. Eng. (1998)
13. NASA Aircraft Management Division: NPR 7900.3C, Aircraft Operations Management
Manual. NASA (Jul 2011)
14. Rushby, J.: New challenges in certification for aircraft software. In: Proc. 11th Intl. Conf. on
Embedded Soft. pp. 211–218 (Oct 2011)
15. Scolese, C.J.: NASA Systems Engineering Processes and Requirements. NASA Procedural
Requirements NPR 7123.1A. (Mar 2007)
A Pattern-based Method for Safe Control
Systems Exemplified within Nuclear Power
Production
1 Introduction
This article presents a pattern-based method, referred to as SaCS (Safe Con-
trol Systems), facilitating development of conceptual designs for safety critical
systems. Intended users of SaCS are system developers, safety engineers and
HW/SW engineers.
The method interleaves three main activities each of which is divided into
sub-activities:
S Pattern Selection – The purpose of this activity is to support the concep-
tion of a design with respect to a given development case by: a) selecting
SaCS patterns for requirement elicitation; b) selecting SaCS patterns for
establishing design basis; c) selecting SaCS patterns for establishing safety
case.
C Pattern Composition – The purpose of this activity is to specify the in-
tended use of the selected patterns by: a) specifying composite patterns; b)
specifying composition of composite patterns.
I Pattern Instantiation – The purpose of this activity is to instantiate the
composite pattern specification by: a) selecting pattern instantiation order;
and b) conducting step wise instantiation.
14 A. Hauge and K. Stølen
2 Success Criteria
Basic Pattern
Process Assurance
2 HAZID 3 Overall
Establish 12
1 Concept 2 Safety
HAZOP
6
Hazard 7
5 Identification
6 FMEA
Safety Quality
Hazard FMEA Management Management
7 Analysis 8
FTA 13
10 8 9
9 Risk Analysis
ETA Process
Quality
Evidence
Establish System CCA
Process
11 Safety 12 15 Compliance
Requirements 10 DAL
Evidence
Classification
Avi Assessment
Evidence
SIL
Classification 11
Rail
I&C Functions
Categorisation
Nuc
Product Assurance
4 13 Technical Safety
Variable
Demand for
3 Service Trusted Online
Nuc Backup Evaluation
The selection of SaCS basic patterns is performed by the use of the pattern
selection map illustrated in Figure 11 . Selection starts at selection point (1).
Arrows provide the direction of flow through the selection map. Pattern selection
ends when all selection points have been explored. A choice specifies alternatives
where more than one alternative may be chosen. The patterns emphasized with
a thick line in Figure 1 are used in this article.
1
Not all patterns in Figure 1 are yet available in the SaCS language but has been
indicated for illustration purpose.
A Pattern-based Method for Safe Control Systems Exemplified within
Nuclear Power Production 17
Plnt Obj
pfd Variable Demand for Service
Sensors
Machine Plant Environment Req
Actuators
Req
When Variable Demand for Service is instantiated, the Req artefact indicated
in Figure 2 is produced with respect to the context given by Obj and Plnt. In
Section 4.1 we selected the pattern as support for eliciting requirements for a
PWR system upgrade with respect to goal G1. The parameter Obj is then bound
to G1, and the parameter Plnt is bound to the specification of the PWR system
that the ALF upgrade is meant for.
Assume that the instantiation of Variable Demand for Service according to
its instantiation rule provides a set of requirements where one of these is defined
as: “FR.1: ALF system shall activate calibration of the control rod patterns when
the need to calibrate is indicated”.
cmp Fragment 2
cmp Design
Functional «delegates» Trusted Backup
«satisfies»
Requirements
[[FR.1]] [[S={ALF Dgn}]] [[S={ALF Dgn}]] D
C
The satisfies combinator in Figure 4 indicates that ALF Dgn (that is the
instantiation of S ) satisfies the requirement FR.1 provided as output from in-
stantiation of the Functional Requirements composite. The Functional Require-
ments composite is detailed in Figure 3. A pattern reference to a composite is
indicated by the letter “C” in the lower compartment of a solid-drawn oval line.
Confer Laws,
Identify Regualations Define
ToA Req
Target Confer risk Requirements
analysis
Risks
Assume that when Risk analysis is instantiated on the basis of inputs pro-
vided by the instantiation of its successors, the following risk is identified: “R.1:
Erroneously adapted control function”. The different process requirement pat-
terns follow the same format; details on how they are instantiated are only given
with respect to the pattern Establish System Safety Requirements.
Figure 5 is an excerpt of Establish System Safety Requirements pattern. It
describes a UML activity diagram with some SaCS specific annotations. The
pattern provides the analyst a means for elaborating upon the problem of es-
tablishing safety requirements (represented by Req) based on inputs on the risks
(represented by Risks) associated with a given target (represented by ToA).
Assume that Establish System Safety Requirements is instantiated with the
parameter Risks bound to the risk R.1, and the parameter ToA bound to the
ALF Dgn design (see Section 5.2). The instantiation according to the instanti-
ation rule of the pattern might then give the safety requirements: “SR.1: ALF
shall disable the adaptive controller during the time period when controller pa-
rameters are configured” and “SR.2: ALF shall assure that configured parameters
are correctly modified before enabling adaptable control”.
cmp Fragment 3
cmp Safety Requirements
«address»
I&C Function
[[Risks]]
Categorisation
Nuc-M
«delegates» Establish System
Safety Requirements
[[SR.1,SR.2]] [[Req= R
{SR.1,SR.2}]]
tion point (15), as support for deriving a safety case demonstrating that the
safety requirements SR.1 and SR.2 (defined in Section 6.2) are satisfied.
Case
scd Assessment Evidence
Sub:Claim
ToD
Case:Claim S:Strategy Justification
Cond
j Ev
cmp Requirements C
[[ToA]] [[ToA]]
«delegates» «address» «demonstrates»
Safety
[[ALF Dgn]]
Requirements [[ALF Req=
{FR.1,SR.1, «satisfies»
C «delegates»
SR.2}]] «address»
[[SR.1,SR.2]]
[[Obj]]
[[ToD]] [[ALF Case]]
G1 Functional «delegates»
Requirements [[SR.1,SR.2]]
«refers» Safety Case
PWR C
[[FR.1]] [[Cond]] C
[[Plnt]]
8 Combine Fragments
The composite ALF Pattern Solution illustrated in Figure 9 specifies how the
different composite patterns defined in the previous sections are combined.
Figure 9 specifies the patterns used; the artefacts provided as a result of
pattern instantiation; the relationships between patterns and pattern artefacts
by the use of operators. The composite specification of Figure 9 may be refined by
successive steps of the SaCS method, e.g. by extending the different constituent
composite patterns with respect to the goals G2-G3 of Section 3.
24 A. Hauge and K. Stølen
9 Conclusions
In this paper we have exemplified the application of the SaCS-method on the
load following mode control application.
We claim that the conceptual design is easily instantiated from several SaCS
basic patterns within a case specific SaCS composite (Figure 9). Each basic
pattern has clearly defined inputs and outputs and provides guidance on instan-
tiation through defined instantiation rules. Combination of instantiation results
from several patterns is defined by composition operators. The conceptual design
is built systematically in manageable steps (exemplified in Section 4 to Section
8) by instantiating pieces (basic patterns) of the whole (composite pattern) and
merge results. The conceptual design (fully described in [5]) is consistent with
definition, the required triple is provided by the artefacts ALF Req, ALF Dgn,
and ALF Case (as indicated in Figure 9) and uniquely specifies the load following
case.
Future work includes expanding the set of basic patterns, detailing the syntax
of the pattern language and evaluation of the SaCS-method on further cases from
other domains.
Acknowledgments. This work has been conducted and funded within the
OECD Halden Reactor Project, Institute for energy technology (IFE), Halden,
Norway.
References
1. Alexander, C., Ishikawa, S., Silverstein, M., Jacobson, M., Fiksdahl-King, I., Angel,
S.: A Pattern Language: Towns, Buildings, Construction. Oxford University Press
(1977)
2. Buschmann, F., Henney, K., Schmidt D.C.: Pattern-Oriented Software Architecture:
On Patterns and Pattern Languages. Vol. 5, Wiley (2007)
3. Gamma, E., Helm, R., Johnson, R., and Vlissides, J.: Design Patterns: Elements of
Reusable Object-Oriented Software. Addison-Wesley (1995)
4. GSN Working Group: GSN Community Standard, version 1.0 (2011)
5. Hauge, A.A., Stølen, K.: A Pattern Based Method for Safe Control Conceptualisa-
tion Exemplified Within Nuclear Power Production, HWR-1029, Institute for energy
technology, OECD Halden Reactor Project, Halden, Norway (to appear)
6. IEC: Nuclear Power Plants – Instrumentation and Control Important to Safety –
Classification of Instrumentation and Control Functions. IEC-61226, International
Electrotechnical Commission (2009)
7. Jackson, M.: Problem Frames: Analyzing and Structuring Software Development
Problems. Addison-Wesley (2001)
8. Lokhov, A.: Technical and Economic Aspects of Load Following with Nuclear Power
Plants. Nuclear Development Division, OECD NEA (2011)
9. The Commission of the European Communities: Commission Regulation (EC) No
352/2009 on the Adoption of Common Safety Method on Risk Evaluation and As-
sessment, 352/2009/EC (2009)
10. Object Management Group: Unified Modeling Language Specification, version
2.4.1 (2011)
Risk Assessment for Airworthiness Security
1 Introduction
The increasing complexity of aircraft networked systems exposes them to three ad-
verse effects likely to erode flight safety margins: intrinsic component failures, design
or development errors and misuse. Safety1 processes have been capitalizing on expe-
rience to counter such effects and standards were issued to provide guidelines for
safety assessment process and development assurance such as ARP-4754 [1], ARP-
4761 [2], DO-178B [3] and DO-254 [4]. But safety-critical systems segregation from
the Open World tends to become thinner due to the high integration level of airborne
networks: use of Commercial Off-The-Shelf equipments (COTS), Internet access for
passengers as part of the new In-Flight Entertainment (IFE) services, transition from
Line Replaceable Units to field loadable software, evolution from voice-ground-based
to datalink satellite-based communications, more autonomous navigation with e-
Enabled aircrafts, etc. Most of the challenging innovations to offer new services, ease
1
Please note that safety deals with intrinsic failures of a system or a component (due to ageing or design
errors) whereas security deals with the external threats that could cause such failures. Security being a
brand new field in aeronautics, instead of building a process from scratch, the industry is trying to ap-
proximate to the well-known safety process, which has reached a certain level of maturity through its 50
years of experience.
26 S. Gil Casals, P. Owezarski, and G. Descargues
air traffic management, reduce development and maintenance time and costs, are not
security-compatible. They add a fourth adverse effect, increasingly worrying certifica-
tion authorities: vulnerability to deliberate or accidental attacks (e.g. worms or viruses
propagation, loading of corrupted software, unauthorized access to aircraft system
interfaces, on-board systems denial of service). De Cerchio and Riley quote in [5] a
short list of registered cyber security incidents in the aviation domain. As a matter of
fact, EUROCAE2 and RTCA3 are defining new airworthiness security standards: ED-
202 [6] provides guidance to achieve security compliance objectives based on future
ED-2034 [7] methods.
EU and US5 certification authorities are addressing requests to aircraft manufactur-
ers so they start dealing with security issues. However, ED-203 has not been officially
issued and existing risk assessment methods are not directly applicable to the aero-
nautical context: stakes and scales are not adapted, they are often qualitative and de-
pend on security managers expertise. Also, an important stake in aeronautics is costs
minimization. On the one hand, if security is handled after systems have been imple-
mented, modifications to insert security countermeasures, re-development and re-
certification costs are overwhelming: "fail-first patch-later" [8] IT security policies are
not compatible with aeronautic constraints. It is compulsory that risk assessment is
introduced at an early design step of development process. On the other hand, security
over-design must be avoided to reduce unnecessary development costs: risk needs to
be quantified in order to rank what has to be protected in priority.
This paper introduces a simple quantitative risk assessment framework which is:
compliant with ED-202 standard, suitable to the aeronautics, adaptable to different
points of view (e.g. at aircraft level for airframer, at system level for system provider)
and taking into account safety issues. This methodology is in strong interaction with
safety and development processes. Its main advantage is to allow the identification of
risks at an early design step of development V-cycle so that countermeasures are con-
sistently specified before systems implementation. It provides means to justify the
adequacy of countermeasures to be implemented in front of certification authorities.
Next chapter gives an overview of risk assessment methods; third one, depicts our
six-step risk assessment framework, illustrated by a simple study case in chapter 4;
last one concludes on pros and cons of our method and enlarges to future objectives.
Many risk assessment methodologies aim at providing tools to comply with ISO secu-
rity norms such as: ISO/IEC:27000, 31000, 17799, 13335, 15443, 7498, 73 and 15408
(Common Criteria [9]). For example, MAGERIT [10] and CRAMM [11] deal with
governmental risk management of IT against for example privacy violation.
2
European Organization for Civil Aviation Equipment
3
Radio Technical Commission for Aeronautics
4
ED-203 is under construction, we refer to the working draft [7] which content may be prone to change.
5
Respectively EASA (European Aviation Safety Agency) and FAA ( Federal Aviation Administration)
Risk assessment for airworthiness security 27
NIST800-30 [12] provides security management steps to fit into the system develop-
ment life-cycle of IT devices. Others, such as OCTAVE [13] aim at ensuring enter-
prise security by evaluating risk to avoid financial losses and brand reputation dam-
age. Previously stated methods are qualitative, i.e. no scale is given to compare identi-
fied risks between them. MEHARI [14] proposes a set of checklists and evaluation
grids to estimate natural exposure levels and impact on business. Finally, EBIOS [15]
shows an interesting evaluation of risks through the quantitative characterization of a
wide spectrum of threat sources (from espionage to natural disasters) but scales of
proposed attributes do not suit to the aeronautic domain.
Risk is commonly defined as the product of three factors: Risk = Threat × Vulner-
ability × Consequence. Quantitative risk estimations combine these factors with more
or less sophisticated models (e.g. a probabilistic method of risk prediction based on
fuzzy logic and Petri Nets [16] vs. a visual representation of threats under a pyramidal
form [17]). Ortalo, Deswarte and Kaaniche [18] defined a mathematical model based
on Markovian chains to define METF (Mean Effort to security Failure), a security
equivalent of MTBF (Mean Time Between Failure). Contrary to the failure rate used
in safety, determined by experience feedback and fatigue testing on components, se-
curity parameters are not physically measurable. To avoid subjective analysis,
Mahmoud, Larrieu and Pirovano [19] developed an interesting quantitative algorithm
based on computation of risk propagation through each node of a network. Some of
the parameters necessary for risk level determination are computed by using network
vulnerability scanning. This method is useful for an a posteriori evaluation, but it is
not adapted to an early design process as the system must have been implemented or
at least emulated.
Ideally, a security assessment should guarantee that all potential scenarios have been
exhaustively considered. They are useful to express needed protection means and to
set security tests for final products. This part describes our six-steps risk assessment
methodology summarized in Figure 1, with a dual threat scenario identification in-
spired on safety tools and an adaptable risk estimation method.
4 Risk Estimation
Likelihood Safety Impact
Acceptability?
Exposure
criteria to
NO
Security Level Security objectives be reduced
assigned
Security
5
SL
Countermeasures
Fig. 1. Risk assessment and treatment process: the figure differentiates input data for the securi-
ty process as coming either from the development process or from a security knowledge basis.
Primary Assets. According to ED-202, assets are "those portions of the equipment
which may be attacked with adverse effect on airworthiness". We distinguish two
types of assets: primary assets (aircraft critical functions and data) that are performed
or handled by supporting assets (software and hardware devices that carry and process
primary assets). In PRA, system architecture is still undefined, only primary assets
need to be identified.
Risk assessment for airworthiness security 29
Threats. Primary assets are confronted to a generic list of Threat Conditions (TCs)
themselves leading to Failure Conditions (FCs). Examples of TCs include: misuse,
confidentiality compromise, bypassing, tampering, denial, malware, redirection, sub-
version. FCs used in safety assessment are: erroneous, loss, delay, failure, mode
change, unintended function, inability to reconfigure or disengage.
6
http://cve.mitre.org/
30 S. Gil Casals, P. Owezarski, and G. Descargues
Let us call f () the evaluation function performed by the security analyst to assign
the corresponding severity degree a to each X for a given threat scenario: a
f
x . Attacker capability is expressed by the normalized sum of the values as-
signed to all attributes of set X (see equation 1). Exactly the same reasoning is made to
express the “asset exposure”.
a
A , x x , i 1 … n, j 1 … m 1
x
Risk assessment for airworthiness security 31
SAFETY IMPACT
No Effect Minor Major Hazardous Catastrophic
pV: Frequent Acceptable Unacceptable Unacceptable Unacceptable Unacceptable
LIKELIHOOD
* = assurance must be provided that no single vulnerability, if attacked successfully, would result in a catastrophic condition
Security Requirements. For each unacceptable threat scenario, a set of security ob-
jectives are established. They are translated into security requirements using the Secu-
rity Functional Requirements (SFR) classes of Common Criteria part 2 in order to
have an initial template to express security requirements in a formal way. Indeed,
Common Criteria provide a classification of requirements patterns where inter-
dependencies between them are already traced.
7
DAL stands for the accuracy dedicated to the design and development of a system according to its
criticality in terms of safety impact, it sets objectives to properly provide assurance to certification au-
thorities that developed system performs safely its intended functions. For example a DAL A system
will receive the maximum care as a failure would have a catastrophic impact, whereas a DAL E system
will have no design constraint as a failure would not have any consequence on safety of flight. Design
and development rules are given by standards DO-178B for software and DO-254 for hardware.
32 S. Gil Casals, P. Owezarski, and G. Descargues
procedures quaity. To do so, we have mapped each SL with Common Criteria EALs
(Evaluation Assurance Levels). Each EAL is linked to a set of assurance families
themselves composed of SARs (Security Assurance Requirements). Assurance re-
quirements aim at establishing accurate development rules so that security functions
perform correctly their intended purpose and means to maintain security during de-
velopment, maintenance and operational use have been taken into account.
4 Study Case
4.1 Scope
Let us consider the Weight and Balance (WBA) function that ensures 3D stability
control of aircraft gravity center. It determines flight parameters (e.g.: quantity of
kerosene to be loaded, takeoff run and speed, climbing angle, cruising speed, landing
roll) and requires interactions with ground facilities. Figure 2 depicts the interactions
required by the WBA function: check-in counters furnish number and distribution of
passengers in the aircraft. Ground agent enters weight of bulk freight loaded in aft
hold. Weight data is directly sent via data link to the ground WBA calculation tool to
compute flight parameters. On ground, flight crew imports flight parameters to be
directly loaded in the Flight Management System (FMS).
Risk assessment for airworthiness security 33
Fig. 3. Top-down approach threat scenario identification: from feared event to potential causes
ately wrong weight data on freight laptop” (TS2) and “intruder modifies flight param-
eters by accessing directly to FMS” (TS3).
To summarize, for each threat scenario, attacker capability and asset exposure are
evaluated using a set of attributes and scales (respectively tables 3 and 4 for this study
case). Values A and E are obtained thanks to equation 1 and used in table 1 intervals
to determine likelihood. Obtained likelihood level combined with the safety impact of
a successful attack attempt on table 2, allow deciding on risk acceptability. Results are
gathered on table 5.
Values
Attributes 3 2 1 0
X1: Elapsed time for the attack minutes hours <day >day
X2: Attacker expertise “misuser” layman proficient expert
X3: Attacker system knowledge public restricted sensitive critical
X4: Equipment used none domestic specialized dedicated
X5: Attacker location off-airport airport cabin cockpit
Values
Attributes 4 3 2 1 0
Y1: Asset location off-aircraft cabin maint. facility cockpit avionic bay
Y2: Class8 of asset class 1 class 2 class 3
Y3: DAL DAL E DAL D DAL C DAL B DAL A
Y4: Vulnerabilities large public limited public not public unknown none at all
Y5: Countermeasure none organizational technical on asset >2 on chain
8
class 1: Portable Electronic Device (PED); class 2: modified PED; class 3: installed equipment under
design control.
Risk assessment for airworthiness security 35
5 Conclusion
This paper justifies the need to develop an efficient risk assessment method to build
secured architectures for digital aircrafts. We aim at introducing security considera-
tions at an early design step of the development, allowing a certain degree of freedom
to use attributes that best fit to the scope of analysis. Criteria taxonomy rules are to be
improved by practice to make procedures as systematic and accurate as possible.
However the exhaustiveness of threat scenarios identification cannot be proved nor
guaranteed. Readjustments will have to be made to comply with future ED-203 modi-
fications. This methodology has been tested on various examples and then applied on
a real case of security certification. It has been agreed by the certification authority
provided that intrusion test results validate the coherence of identified threat scenarios
and eventually reveal new vulnerabilities.
References
1. SAE International (Society of Automotive Engineers, Inc.): Certification Considerations
for Highly-Integrated Or Complex Aircraft Systems (ARP-4754). USA (1996)
2. SAE International (Society of Automotive Engineers): Guidelines and methods for
constructing the safety assessment process on civil airborne systems and equipment (ARP-
4761). USA (1996)
3. Radio Technical Commission for Aeronautics (RTCA SC-167) and European Organization
for Civil Aviation Electronics (EUROCAE WG-12): Software considerations in airborne
systems and equipment certification (DO-178B/ED-12). Washington, USA (1992)
4. European Organization for Civil Aviation Electronics (EUROCAE WG-46) and Radio
Technical Commission for Aeronautics (RTCA SC-180): Design assurance guidance for
airborne electronic hardware (DO-254/ED-80). Paris, France (2000)
36 S. Gil Casals, P. Owezarski, and G. Descargues
1 Introduction
The radical change in the energy market towards renewable energy production
that has been initiated by politics causes a high demand for wind turbines to be
built. Because of concerns regarding noise emissions and scenic impacts, there
are ongoing plans to place more and more wind turbines into offshore wind parks.
In Germany, 24 wind parks in the North Sea have been approved so far ([1], [2]).
However, the construction of many of these wind parks is delayed.
As such a huge change in a short time can only be realized by a large amount
of companies constructing multiple facilities concurrently, a lot new players rush
into the offshore wind energy market. Not all of these companies have extended
experience in the maritime or offshore sector and are familiar with the required
safety assessment procedures. Implementing the necessary practices and pro-
cesses is a highly complex task. Not supporting their adoption could be a de-
laying factor for the energy change. Recent events ([3], [4]) have shown that
?
This work was partially supported by the European Regional Develop-
ment Fund (ERDF) within the project S afe Offshore Operations (SOOP),
http://soop.offis.de/.
??
This work was partially supported by European Commission for funding the Large-
scale integrating project (IP) proposal under the ICT Call 7 (FP7-ICT-2011-
7) Designing for Adaptability and evolutioN in S ystem of systems E ngineering
(DANSE) (No. 287716).
38 C. Läsche, E. Böde, and T. Peikenkamp
nevertheless these assessments have to take place to protect personnel and en-
vironment In many aspects, offshore scenarios fulfil the criteria of a System of
System (SoS). Maier[5], for instance, uses five characteristics that, depending on
their strength, impose different challenges for offshore operations:
• Operational independence of the elements: To a small extend, the people and sys-
tems may act independently during an offshore operation. However, they are mostly
directed by guidelines and supervisors.
• Managerial independence of the elements: The systems (e.g. construction ships) are
to some extend not depending on other systems during an operation. Nevertheless,
a complete independence is not given.
• Evolutionary development: While the first offshore operations had prototypical
character, new technological possibilities as well as political/legal constraints lead
to an evolution of the operations, its procedures, and the involved systems.
• Emergent behavior: As of today, there has (to our knowledge) not been a sys-
tematical investigation of the interaction of people and systems during offshore
operations. To perform such an analysis, a model that consistently integrates the
behavior of the involved systems is a preferable solution.
• Geographic distribution: There might be a large geographic distribution as the
guidance authorities for an operation might reside onshore whereas the operation
itself takes place offshore. Further, the geographic distributions of involved systems
and people offshore has a strong impact on the efficiency.
Therefore, we consider the collection of all systems and persons involved in typi-
cal offshore operations as an SoS.
The SOOP project1 aims at supporting planning and execution of safe off-
shore operations for construction and maintenance of offshore wind turbines. A
special focus is set on the behavior of persons involved. To analyze an operation,
a model based approach is used, including modeling the behavior of the involved
persons, as described in [6]. Thus, a conceptual model is build and maintained
that describes the interaction of systems and persons as well as the evolution of
the system. The architecture of the system and thus the conceptual model will
be changing over time as new needs might arise during the project implemen-
tation. Another aspect of the SOOP project is the identification and mitigation
of possible risks during the planning process while also taking the situation into
consideration. The results for this will also be used for an online assistant system
that monitors the mission (e.g. the location of crew and cargo, cf. [7]) and warns
if a hazardous event is emerging. This is intended as a further way to avoid risks
during an offshore operation.
In this paper, we will focus on the risk assessment aspects of the project.
We will discuss our current approach in performing those steps and present
our methods we developed for an improved risk identification process. After
introducing some terms and definitions, we will first give an overview about
current hazard identification and risk assessment approaches. Later, we show
how the conceptual model can be used to guide the hazard identification and
formalization process and how it can be used for modeling the relevant scenarios
and risk mitigation possibilities.
1
http://soop.offis.de/
A Method for Guided Hazard Identification and Risk Mitigation for
Offshore Operations 39
Oil and gas companies have collected a lot of experience in the offshore sector.
Safety assessments have been performed in this area for a long time and a large
knowledge base exists. Nevertheless, these experiences cannot directly be applied
to offshore wind turbine operations as, although some similarities exist, most of
the risks differ substantially. For example, there may be a lot of risks regarding
fire and explosion when considering oil and gas rigs, as both of them handle
ignitable compounds. Those are not primary risks when talking about offshore
wind turbines. Neither are there risks such as blowouts or leakage. Besides these
differences, some operations are common between both types of offshore opera-
tions. Therefore, Vinnem[10] has been taken into consideration as a reference to
planning of oil rig related operations. It describes the state of the art in risk anal-
ysis in the domain. In detail, it addresses the steps of Quantitative Risk Analysis
(QRA), which is a type of risk assessment that is frequently applied to oil and gas
related offshore operations. Its approach is based in the standards IEC 61508[9]
40 C. Läsche, E. Böde, and T. Peikenkamp
(which is also the base for ISO 26262) and IEC 61511[11]. The steps involved
in the QRA approach are depicted in fig. 1, which we have extended with the
shaded box (along with annotations of our developed methods). They include
identifying possible hazards, assessing their risks and developing risk mitigation
measures if the identified risk is not tolerable. We further describe these steps
when introducing our modified approach in sec. 4.
In order to identify all possible hazards, Vinnem further introduces HAZID
(HAZard IDentification) as an analysis technique, which basically suggests which
steps need to be performed and which sources should be taken into consideration
when identifying hazards. The sources include check lists, previous studies, and
accident and failure statistics. Performing the approach requires a lot of manual
work which demands experienced personnel. In newly planned operations, this is
a time consuming and expensive process. In addition to this, the HAZID process
is not well defined, not structured, and has no source that completely lists the
relevant potential hazards or risks. To improve this, we introduce a guided way
of identifying hazards, which is described in sec. 4.
A further approach is Formal Safety Analysis which is also used for offshore
safety assessment[12]. It is based on assigning risks (that are also identified using
HAZID) to three levels: intolerable, as low as is reasonably practicable (ALARP),
and negligible. Risk assigned to the ALARP level are only accepted if it is shown
that serious hazards have been identified, associated risks are below a tolerance
limit and are reduced “as low as is reasonably practicable”. Because this concept
does not rely on quantification and rather uses an argumentative method for
assessing risks, the analysis might not be complete and requires a lot of manual
expert effort. Because of this there are no concepts that are interesting for usage
with our model-based approach.
Of particular interest is the current automotive standard as in contrast to
the processes in the offshore sector, those in the automotive sector are more
time and cost efficient. This is due to the strong competition between different
manufacturers in this industry, the large amount of sold units, the short innova-
tion cycles and a high volume market with many different car configurations. To
achieve a cost efficient process of risk assessment, a specialized approach, defined
in ISO 26262[8], is used by the automotive companies. In contrast to the offshore
sector, the automotive industry also considers controllability as a factor for the
risk assessment, which we have described in sec. 2.
for model-based safety assessment, for instance those developed in the ESACS
and ISAAC projects (cf. [13, 14]) in the aerospace domain.
The most important difference between the risk assessment approach of the
automotive sector and the one used in the offshore sector is that the auto-
motive approach includes a third risk assessment factor, the controllability of
hazardous situations. We borrow this concept as a further assessment factor of
our approach, which will support the risk mitigation by introducing measures
raising the awareness of a risk, thus allowing its prevention or reduction of its
impact. Considering this parameter enables us to include human factors, that
is the ability of humans to react to a hazardous event in a way that lowers its
impact or even prevents it, in our analysis. Further, the mission assistant de-
veloped in the SOOP project (cf. sec. 1) that might alert the personnel about
potential hazards and thus allows avoiding them or mitigating their impact can
also be incorporated. The modified approach with added controllability (marked
by shades) can be seen in fig. 1. Also, we added information about how our
methods are integrated with the QRA approach in the boxes on the right side.
Fig. 1: Overview over the risk assessment steps. Our methods to support them are
annotated by boxes. Enhanced version the QRA approach from [10]. Enhancements
are marked by shades.
A further concept originating in the automotive sector and used in our ap-
proach are Hazardous Events. Their usage enables for us to further differentiate
the hazards by specifying the situations in which they occur. This allows us to
assess the impact of a hazard in a specific situation, as the impact might be
42 C. Läsche, E. Böde, and T. Peikenkamp
dependent on the operational situation. A hazard might have a more severe con-
sequence in some situations than in others. In a few situations, a hazard may
even have no relevant impact at all.
In the following sections, we introduce every step of our approach as well as
the supporting methods.
Entering/Moving/Navigating
Person injured
Approaching
be considered. As an ex-
Grounding
Loss of oil
Watching
Capsizing
Collision
Starting
Sinking
Using
Fire
X 1.1 ‐‐‐ 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12
the step Stepping Over X ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐
Wind
X 6.1 6.2 6.3 ‐‐‐ 6.5 6.6 ‐‐‐ 6.8 6.9 6.10 6.11 6.12
X ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ 15.9 15.10 15.11 15.12
X ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐ ‐‐‐
2
http://www.imca-int.com/documents/publications.html
44 C. Läsche, E. Böde, and T. Peikenkamp
17.9 17.10
Slipping off while using a ladder during wet weather Falling into water because of slipping off a ladder during wet weather
Involved Actors: Person, Safeguarding Involved Actors: Person
Locations: Ladder Locations: Ladder
Potential Source: ‐‐‐ Potential Source: 17.9
Description: A person uses a ladder during wet weather. The person slips off and falls Description: A person uses a ladder during wet weather. The person slips off and falls
down. down. The person land in the water
Possible Reasons: Sudden weather change, wrong evaluation of situation, faulty safeguarding Possible Reasons: Sudden weather change, wrong evaluation of situation, faulty safeguarding
Promotive Factors: Missing maintenance, missing training, missing safety equipment Promotive Factors: Missing maintenance, missing training, missing safety equipment
Preventive Factors: Not letting persons use ladder during wet weather, improve safeguarding Preventive Factors: Not letting persons use ladder during wet weather, improve safeguarding
Related Entries: 17.10 Related Entries: ‐‐‐
Further Information: ‐‐‐ Further Information: ‐‐‐
Identifying Hazardous Events: To assess the risks associated with the iden-
tified hazards, the hazards have to be documented. As we use Hazardous Events
which we introduced in sec. 2 and sec. 4, we extended the documentation with
these events as well as with all their dependencies and the dependencies for the
hazards. The documentation is realized by a list consisting of all events and con-
ditions. They are distinguished by a type and linked by a dependency structure
for each event through an column that lists the causes for each event. An ex-
cerpt of possible content of the table is depicted in fig. 4 showing the previously
identified hazards with some of their causes and two resulting hazardous events.
By parsing the Causes column, a fault tree can automatically be generated. Fur-
ther to this, the dependency structure can be used to formalize the hazardous
events listed in the table which is useful for analyzing the scenario, as we will
demonstrate in the next section.
ID Type Description Causes Comments Consequence Probability Controlability
7 Hazard Falling into water 18; 39; 75, 62; 18, 75; 63, 75; 64, 75; 65, 75; 0.01
39, 75; 92; 93
43 Environmental Condition Fog 0.2
44 Environmental Condition Rain 0.2
45 Environmental Condition Slickness 43; 44 0.4
76 Failure Missing safeguarding 3; 4; 11; 33; 35 0.04
86 Fault Slipping off the ladder 45; 54; 55; 56; 57; 71; 84; 84, 85; 85 0.06
91 Human Failure Not at least one hook hooked in 3; 4; 33; 35; 78 0.02
92 Failure Falling down the ladder 76, 86; 86, 91 0.05
103 Operational Situation Temperate water 0.3
104 Operational Situation Cold water 0.7
108 Hazardous Event Falling into cold water 104, 7 0,8 0.71 0.2
109 Hazardous Event Falling into temperate water 103, 7 0,3 0.31 0.4
Fig. 4: Excerpt from the list of events. The Hazardous Events are marked.
Fig. 5 only shows the correct process of stepping over and climbing up the
ladder. In order to detect potential hazards in the scenario, a formalization of
the hazardous events can be used.
In order to minimize the risk of an offshore scenario, a risk mitigation process for
hazardous events that have a high risk quantification value has to take place. This
can be realized by developing measures to prevent certain faults, thus lowering
the frequency of occurrence of a hazardous event. Another way to minimize the
risk is to raise the controllability of the hazardous event. To reach this, the
awareness of potential hazards has to be raised so that proper reaction to the
hazardous event can happen or other technical measures have to be considered.
A Method for Guided Hazard Identification and Risk Mitigation for
Offshore Operations 47
5 Conclusion
In this paper, we have presented a new method to improve risk assessment for
construction and maintenance operations for offshore wind energy parks. We
have based our approach on existing techniques used for risk assessment of off-
shore and maritime operations. To extend the approach, we adopted methods
from the automotive domain for improving and optimizing the risk assessment
process by adding the factor Controllability (as defined in ISO 26262) to com-
plement the existing factors Frequency and Consequence for forming the risk
picture. Taking controllability into account enables better distinction between
possible outcomes of a hazardous event (which is also a concept taken from the
ISO 26262), thereby improving the correct assessment of the actual risk. Further-
more, it improves the ability to evaluate the effectiveness of mitigation measures.
This is especially important with respect to the SOOP project in which we plan
to develop an assistant system that is intended to raise awareness of developing
critical situations and suggest mitigation measures in case a hazard has actually
happened (cf. sec. 1).
One of the central improvements is our Generic Hazard List that we are
specifically developing to systematically identify potential hazards and their
causes in offshore operations. The idea of a generic list of hazards was adopted
from a similar concept, originating from automotive projects. In our approach,
we have taken this tool and further improved it to not only address the topic of
hazard identification, but also to include the possible causes leading to a hazard.
Once this data has been captured, we were able to formalize the dependency
relation between faults, errors, failures, and hazards.
Our next step is to model the behavior of the system, the environment,
and participating persons. This behavior model is the necessary precondition
for further model-based safety analysis of the system. Techniques to be used
include those developed during the ESACS and ISAAC projects which enable
model-based FMEA (Failure Mode and Effect Analysis) and FTA (Fault Tree
Analysis) (cf. [13, 14]).
48 C. Läsche, E. Böde, and T. Peikenkamp
References
1. Bundesamt für Seeschifffahrt und Hydrographie: Genehmigung von Offshore
Windenergieparks. http://www.bsh.de/de/Meeresnutzung/Wirtschaft/Windparks/
index.jsp (2012)
2. Bundesministerium für Umwelt, Naturschutz und Reaktorsicherheit: Entwicklung
der Offshore-Windenergienutzung in Deutschland/Offshore wind power deploy-
ment in Germany. (2007)
3. Brandt, J.: Stundenlange Suche nach dem Vermissten auf See. Ostfriesen-Zeitung
(26.01.2012)
4. Vertikal.net: Fatal accident in Harwich. http://www.vertikal.net/en/news/story/
10145/ (21.05.2010)
5. Maier, M.W.: Architecting Principles for Systems-of-Systems. TiAC White Paper
Repository, http://www.infoed.com/Open/PAPERS/systems.htm (20.12.2010)
6. Lenk, J.C., Droste, R., Sobiech, C., Lüdtke, A., Hahn, A.: Towards Cooperative
Cognitive Models in Multi-Agent Systems. In: International Conference on Ad-
vanced Cognitive Technologies and Applications. (2012)
7. Wehs, T., Janssen, M., Koch, C., von Cölln, G.: System Architecture for Data Com-
munication and Localization under Harsh Environmental Conditions in Maritime
Automation. In: IEEE 10th International Conference on Industrial Informatics.
(2012)
8. International Organization for Standardization: ISO/DIS 26262 - Road vehicles
Functional safety. (2011)
9. International Electrotechnical Commission : IEC 61508. (2010)
10. Vinnem, J.E.: Offshore Risk Assessment. 2nd edition edn. Springer (2007)
11. International Electrotechnical Commission : IEC 61511. (2003)
12. Wang, J.: Offshore safety case approach and formal safety assessment of ships.
Journal of Safety Research 33(1) (2002) 81 – 115
13. Peikenkamp, T., Cavallo, A., Valacca, L., Böde, E., Pretzer, M., Hahn, E.M.: To-
wards a Unified Model-Based Safety Assessment. In: Proceedings of SAFECOMP.
(2006) 275–288
14. Åkerlund, O., et al.: ISAAC, a framework for integrated safety analyses of func-
tional, geometrical and human aspects. ERTS (2006)
15. Reuß, C.: Automotive Generic Hazard List. Technische Universität Braunschweig
(2009)
16. D. Beisel, C.R..E.S.: Approach of an Automotive Generic Hazard List. In: Pro-
ceedings of European Safety and Reliability, ESREL. (2010)
17. IMO: Lessons Learned from Casualities for Presentation to Seafahrers.
http://www.imo.org/blast/mainframe.asp?topic id=800 (2004/2006/2010)
18. Latvala, T., Biere, A., Heljanko, K., Junttila, T.: Simple is better: Efficient bounded
model checking for past LTL. In: VMCAI. Volume 3385 of LNCS, Springer (2005)
380–395
Risk Analysis and Software Integrity Protection for
4G Network Elements in ASMONIA
Manfred Schäfer
Abstract. The paper gives an insight into current ASMONIA research work on
risk analysis, security requirements and defence strategies for 4G network ele-
ments. It extends the 3GPP security architecture for 4G networks, in particular
when being part of critical infrastructures. Based on identified requirements it
focuses on enhanced protection concepts, aiming to improve implementation
security of threatened elements in 4G networks through attack resistant mecha-
nisms for integrity protection, covering attacks against a system during boot-
and execution-time. The concepts concentrate on generic mechanisms that can
be applied to 4G network elements and complete other methods researched in
ASMONIA. The paper describes infrastructure aspects of software integrity in
mobile networks and provides proposals for implementation of verification and
enforcement processes inside self-validating target systems. The proposals are
based on typical exemplary systems, relying on Linux and QEMU/KVM.
1 Introduction
The overall goal in ASMONIA [1] is development of security concepts for 4G mo-
bile network infrastructures, satisfying relevant security requirements. Research areas
comprise integrity protection, attack and malware detection for network elements
(NE) and devices, exploitation of attack resilient and flexible cloud computing tech-
niques, as well as collaborative information exchange mechanisms. ASMONIA is
partially funded by the German Federal Ministry of Education and Research within IT
Security Research work program. Primarily, this paper addresses specific security
aspects related to NEs in 4G networks, providing an insight into ongoing work. Re-
flecting essential, identified security requirements it concentrates on concepts for
software (SW) integrity protection. While correlative aspects of integrity protection
for devices are examined in accompanying ASMONIA work, e.g., [22], these are not
further considered in the context of this paper.
they successively become susceptible for various, partly well known attacks. While
relevant security requirements already have been established by 3GPP standardiza-
tion, room is left for fulfilment and in particular for attack resilient design and imple-
mentation. Taking these aspects into account, below we identify specific challenges
and resulting demands and elaborate proposals for typical system architectures and
execution environments.
2.1 Security Requirements for 3GPP Access Network
Reflecting peculiarities of the 4G network architecture, relevant security require-
ments have been stated in 3GPP standardization. In the context of this paper, particu-
larly this affects elements in access networks, such as eNB and HeNB ([20], [21])
terminating security relations between user equipment and backhaul. While [21] ex-
plicitly demands integrity for SW and setup of secure environments and stipulates
exclusive use of authorized software, [20] postulates principles of secure SW updates,
secure boot, and autonomous validation, implying reliable capabilities for self-
validation. As 3GPP restricts its focus to co-operability and compatibility aspects
(e.g., use of TR.069 in [20]), enough freedom is left for interpretation and implemen-
tation. This pertains to the implementation of measures against malicious physical
access – implicitly knowing that due to exposition in public areas (and even personal
possession as in HeNB case) particular protection must be provided, e.g., for stored
keys. Likewise, no explicit requirements exist for the runtime, to assure that a se-
curely booted system withstands attacks while operating over weeks or even months.
2.2 ASMONIA related Requirements
The 4G risk analysis made in ASMONIA [3] provides a very detailed insight and
assessment on threats and risk influence parameters, regarding individual NEs, also
beyond the access network. It substantiates why and which risks are particularly high
ranked - suggesting particular security accurateness in access networks. Accordingly,
system compromising and insider attacks are among the most relevant threats, which
in many cases manifest themselves in unauthorized SW manipulations that have to be
made as difficult as possible. Such risks significantly can be lowered by SW integrity
and sufficient access protection assuring only authorized firmware and SW to exe-
cute. In [2] we consider requirements arising from the need to protect SW integrity
mechanisms themselves, particularly when implemented into a target system, relying
on sensitive functionalities for validation and enforcement. In addition to compliance
with 3GPP (e.g., on proof of origin; autonomous and self-validation; secure boot
process; authorized SW updates), enhanced implementation security and in particular
runtime protection is motivated taking the typical conditions of 4G systems into ac-
count (e.g., long lasting operation after boot; very long product life cycles and re-
stricted possibilities for costly manual repair once systems like eNB are installed in
field; avoidance of class break risks; preventing systems not to stay unnoticedly in
vulnerable state; etc.). In addition, special requirements arise due to the obviating and
warning nature of the ASMONIA cooperation and the underlying protection, imply-
ing mechanisms for reporting and hardening of network elements. It is argued that the
complexity of today’s systems and in particular ‘execution environments’ require a
Risk Analysis and Software Integrity Protection for 4G Network Ele-
ments in ASMONIA 51
but requiring physical access to the board level design. It was shown that even ‘Dy-
namic Root of Trust (DRTM)’ communication can be hijacked and also simpler tricks
seem possible (e.g., blocking LPC communication when exiting DRTM measured
code would keep a TPM in its previous state, which would be attested. Even worse,
TPM-sealed secrets then would be accessible for malicious code). Even if DRTM has
interesting capabilities it is restricted to a few CPUs and thus, could not be applied for
general security concepts, as these would depend on decisions related to a system’s
board design and to selectable processing units. Summarizing we may state that (wrt.
physical attacks) the TPM itself may be secure, but due to its design particularities it
does not provide chip-level security for assurance of trustworthy boot processes. Nev-
ertheless, an indisputable advantage of TPMs is the hardware (HW) protection of
private keys, which could be integrated in authentication processes, reliably prevent-
ing a device identity from being cloned.
When looking for alternatives [19] we found that applying PKI based paradigms
for SW signing would remove most organisational and cryptographic limitations of
TCG’s hash-centric approaches and would reduce efforts in operator network. Signa-
tures are well suited for approaches being applicable in very different use cases. Un-
fortunately, at the time being supporting general purpose HW is missing and even if
CPU vendors are going to implement related mechanisms, these may not be compati-
ble among each other and thus may cause portability problems in concepts for secure
HW design. Due to these findings and facing the requirements mentioned above,
(CPU/OS independent) HW based concepts are advisable, particularly for use cases
requiring higher attack resilience against unauthorized physical access.
• The NE vendor (manufacturer), which creates the product and provides means and
control mechanisms for integrity protection and by setting policies for validation.
The manufacturer also ensures that products are equipped with the components re-
quired to reliably support validation and enforcement mechanisms.
Risk Analysis and Software Integrity Protection for 4G Network Ele-
ments in ASMONIA 53
• Service and repair entities, which belong to the NE vendor organization or act on
behalf of the vendor. Over a product’s lifecycle same protection mechanisms are
used and need to be authorized, for instance, to modify protected firmware.
• Component suppliers may create and deliver hardware or software components,
assembled as part of the product. As such components must be trusted an associ-
ated integrity protection method must be applied before the system is composed.
This method can be independent or aligned with the vendor’s SW-IP system, but
must be compliant regarding the objectives for integrity protection.
• Delivery services, which are used to cache and to distribute products to the recipi-
ents. In case code signing paradigms are used, delivery may be reduced to a logis-
tic service, which does not need additional security, except for confidentiality pur-
poses if this is required, e.g., to protect SW containing secrets. Otherwise, the ma-
jor obligation is to check the integrity at point of acceptance (depends on use case).
• Mobile network operators are using and maintaining the products. Verification and
enforcement typically is done inside operator network, possibly to a large extend or
even completely within products. An operator may protect its own configuration
data (or even vendor provided SW as, e.g., explicitly intended for HeNBs). The
scheme allows that major efforts are imposed on vendor side while efforts in op-
erator-side infrastructure (highly standardized environment) can be kept small.
• The ASMONIA overlay network providing analytic and collaborative services.
Essentially, it aggregates and processes anomaly messages (e.g., logs and alerts
from failed SW-IP validation), but may also store SW and validation data (e.g.,
signatures, certificates, trusted reference values) for update, revocation, and recov-
ery scenarios. The ASMONIA network also connects operators among each other.
The overlay network may need own protection mechanisms, but such security is
widely independent from the mechanisms provided by the NE-vendor and not con-
sidered further in this paper.
4.2 System Reference Architecture
When looking into a target system (i.e., a NE product), various (partly alternative)
components are required to securely anchor integrity protection mechanisms during
boot and runtime. As today’s systems are vulnerable at multiple layers, umpteen de-
54 M. Schäfer
fence strategies have to be applied in parallel, taking very different attack vectors into
account.
esses, which then will be executed by a system CPU starting with the trusted firm-
ware. SW updating is supported by validating any new code, before it is installed into
the system and is re-checked at each start-up.
In case of firmware validation failures the AFCP may select fall-back strategies,
such as mapping to a previous image or even an immutable pristine boot image corre-
sponding with initial factory settings. In contrast to the passive nature of TPM de-
signs, the AFCP enables autonomy, removes restrictions and hassles with LPC bus
communication, and particularly in long-term scenarios it profits from SW integrity
protection schemes using PKI control. Moreover, as the system is enabled for self-
validation difficulties as with remote attestation are no longer relevant, avoiding ef-
forts for complex provisioning and validation processes as well as additional entities
in operator infrastructure. Note that resilience against physical attacks widely depends
on implementation, but assuming realization via integrated circuits this might be
comparable with other security HW. Note that if requested, a TPM could be inte-
grated as ‘Security HW’, e.g., for protection and secure usage of private keys and
secrets, but actually it is not needed for the secure boot logic.
Runtime-SW Integrity Protection
Once a system is running, control of SW integrity is handled via the system CPU
relying on a securely booted kernel or hypervisor (HV) and depending on system
configuration and usage scenarios. The underlying strategy is to reliably block invalid
(unauthorized) resources at load-time via system call interception. This facilitates
event-triggered integrity checks on any data in file system. If no HV is present this
can be processed via Linux Security Modules (LSM) hooks built into a Linux kernel,
as in principle described with the DigSig approach [10], [11]. In case a hypervisor is
available, SW integrity checks can be executed via the virtualization layer, enabling
to control resources for arbitrary (Linux) guests, even without individually modifying
them. However, this comes with some restrictions as the HV has limited knowledge
about the semantics of file operations initiated from guests. On the other hand it is
well separated from attacks against SW-IP mechanisms (coming from a guest). In the
following we describe a flexible and harmonized solution, which makes use of DigSig
principles built into a –hardened- Linux kernel. It can be applied to a standalone OS
(protecting itself) as well as to a hypervisor (e.g., if built on Linux, such as with
QEMU/KVM based virtualization) to protect itself and likewise its individual guests.
Adaptations to the guests-kernels or QEMU (emulation of block devices) would also
be possible (and might in some cases further enhance flexibility and security), but due
to the complexity of such approach this is not considered here. Instead, in the follow-
ing we assume that neither QEMU nor a virtualized OS is adapted to the solution.
To broaden scope of protection, first we substitute the native ELF based approach
as proposed in DigSig (which is restricted to binaries) by a separate signed database
for maintaining signature objects or hashes. By trust transition the signed database
enables to apply the same cryptographic mechanisms and management principles (i.e.,
PKI control via certificates, signed objects, trust anchor, CRLs, etc.) as provided
trough the AFCP. In addition, it allows integration of arbitrary file types (not only
ELF files), e.g., documentation, scripts, or any (configuration) data shipped with a
distribution. Fig. 4 below describes a suitable solution, including caching mechanisms
Risk Analysis and Software Integrity Protection for 4G Network Ele-
ments in ASMONIA 57
similar to the initial approach, but also applicable to any file resource. The integrity of
the (remotely) signed database is validated using the mechanisms introduced above.
Via signed polices, individual files or directories (/tmp,..) could be excluded from
being checked or dependencies between individual files could be expressed.
launched against sensitive functionality and data within the kernel it is running in, we
consider three countermeasures that are illustrated in Fig. 6:
First we propose to apply a mitigation approach based on Mandatory Access Con-
trol (MAC) such as available with SeLinux [16] or RSBAC [15] for Linux systems.
Associated policies should be established to protect the kernel against attacks via
hijacked functions in kernel and user space, in particular attacks violating control flow
integrity (e.g. exploits running on stack or heap or return-to-libc attacks). Such miti-
gation approach might be implemented into a standalone OS or applied to the HV or
to the guest (as shown at the left side in Fig. 6.) to protect the SW-IP mechanisms and
data residing in kernel space.
Second, an effective, proactive countermeasure is known with the SecVisor ap-
proach [14], aiming at preventing injection of malicious code into the (guest’s) kernel
memory, by applying DMA and W⊕X protection via IOMMU virtualization and
dynamic access control on memory pages during transitions between user and kernel
mode. While SecVisor needs a (small) special HV layer, also conventional hardening
mechanisms could be applied, e.g., using kernel security patches (non executable
pages, address space layout randomization, etc.) as described with PAX [17] or Pro-
Police [18]. Effects on SW-IP accurately need to be considered (as hashed code may
be affected). Such hardening mechanisms could be combined with a MAC approach,
without interfering with each other.
Both types of approaches harden an OS or, respectively, a guest kernel, but indi-
rectly also a HV by mitigating or minimizing risks arising from SW-IP targeted at-
tacks via a guest’s kernel on top.
larity of signatures, which for reasons of portability and computational complexity are
assigned to ‘entire files’, but do not take into account the internal fragments and struc-
ture of individual block devises. Otherwise, to enable handling of block devices, SW
creation processes as well as QEMU had to be adapted, which causes additional ef-
forts and difficulties of implementation. Thus, it is much easier and more effective to
only allow NFS routed resources (where entire files are ‘seen’) and to validate and
control requested resources by the HV via NFS interception while usage of block
devices is strictly bared via the virtualization layers (i.e., QEMU).
Such solution beneficially can be applied for virtualized NE platforms, receiving
their resources via an HV, provided locally or even from remote NFS storage through
the network. An advantage is that arbitrary guests can be secured without modifica-
tions in their kernel code in order to implement SW-IP mechanisms individually. Still,
some dedicated hardening may be required, depending on the attack vectors that must
be prevented. For NEs running within secured domains and mainly executing trans-
port and routing functionality, user level applications are less relevant and thus, addi-
tional individual hardening measures may be expendable.
When using encrypted NFS to remote servers (e.g., via ssh or TLS) special care
has to be taken that the HV (not the guest) terminates the security relations, otherwise
file operations are not visible and SW-IP mechanisms cannot be applied.
For embedded systems, where only restricted or even no virtualization support can
be established, suitable combinations of the above building blocks can be selected.
5 Conclusion
By extending integrity protection into the runtime, the methods presented allow
implementing attack resistant mechanisms for event triggered SW integrity protection
into file-based systems. Due to the proposed protection paradigms and mechanisms
(relying on signatures and PKI at vendor side and -if requested- on flexible HW based
foundation of trust in target systems, as well as through Linux-based implementation
concepts) the SW-IP approaches in this paper are well adapted to the security re-
quirements for NEs in a mobile network. In particular, SW-IP paradigms can be ap-
plied to a variety of products and use cases for integrity protection, while efforts in
operator network can be kept minimal. It also allows re-using of an established SW
signing infrastructure for several scenarios.
The feasibility of the runtime SW-IP approach has been shown by a proof of con-
cept, where skeleton components have been implemented into a Fedora ‘Vanilla’
kernel and KVM/QEMU based virtualization. Still such a solution leaves some as-
pects open, for example on how to validate sensitive code and data during execution
in memory. Further, extended attack analysis examining tailored attacks against such
a solution has not been done yet. However, related research work is in the scope of
the ASMONIA protection concepts and is currently under study.
The authors acknowledge incitations and assistance by the ASMONIA consortium,
the support by ‘Hochschule der Medien’ (Prof. J. Charzinski, Julius Flohr; Stuttgart,
Germany) for a proof of concept implementation, and also like to thank the colleagues
at Nokia Siemens Networks for valuable ideas, discussions, and comments contribut-
ing to this work.
60 M. Schäfer
6 References
1. Official ASMONIA Project web-page, http://www.asmonia.de/index.php?page=1
2. A. Egners, M. Schäfer, S. Wessel; ASMONIA Deliverable D2.1 “Evaluating Methods to
assure System Integrity and Requirements for Future Protection Concepts.”, April 2011
3. A. Egners, E. Rey, P. Schneider, S. Wessel; ASMONIA Deliverable D5.1, “Threat and
Risk Analysis for Mobile Communication Networks and Mobile Terminals”, March 2011
4. TCG, TPM Main Specifications, Parts 1-3, Specification Version 1.2, Level 2, Revisions
103, July 2007; https://www.trustedcomputinggroup.org/specs/TPM/
5. K. Kursawe, D. Schellekens, and B. Preneel, "Analyzing trusted platform communication,"
In ECRYPT Workshop, CRASH - CRyptographic Advances in Secure Hardware, 2005,
https://www.cosic.esat.kuleuven.be/publications/article-591.pdf
6. E. Sparks, “A Security Assessment of Trusted Platform Modules”, Computer Science
Tech. Report TR2007-597, Department of Computer Science Dartmouth College, 2007
7. J. Winter, K. Dieterich, “A Hijacker’s Guide to the LPC bus”, Pre-Proceedings of the 8th
European Workshop on Public Key Infrastructures, Services, and Applications, EuroPKI
'11, pp. 126 ff.; www.cosic.esat.kuleuven.be/europki2011/pp/preproc.pdf
8. B. Kauer, “OSLO: Improving the security of Trusted Computing”, SS'07 Proceedings of
16th USENIX Security Symposium on USENIX Security Symposium, 2007
9. UEFI: Unified Extensible Firmware Interface, UEFI SPEC 2.3.1, 2011, www.uefi.org
10. A. Apvrille, D. Gordon, et.al., “Ericsson DigSig: Run-time Authentication of Binaries at
Kernel Level”, Proceedings of the 18th Large Installation System Administration Confer-
ence (LISA'04), pp. 59-66, Atlanta, November 14-19, 2004
11. A. Apvrille, D. Gordon, “DigSig novelties”, Libre Software Meeting (LSM 2005), Secu-
rity Topic, July 4-9 2005, Dijon, France, disec.sourceforge.net/docs/DigSig-novelties.pdf
12. KVM and QEMU see: http://www.linux-kvm.org/page/Documents
13. Network file system (NFS) vers. 4, RFC 3530, http://tools.ietf.org/html/rfc3530, 2003
14. S. Arvind, M. Luk, N. Qu, and A. Perrig, "SecVisor: A Tiny Hypervisor to Provide Life-
time Kernel Code Integrity for Commodity OSes", Proceed. of the ACM Symposium on
Operating Systems Principles (SOSP 2007), Stevenson, WA. October, 2007.
15. RSBAC, Rule Set Bases Access Control, see: http://www.rsbac.org/
16. SeLinux, Security Enhanced Linux, see: http://www.nsa.gov/research/selinux/docs.shtml
17. PAX, see: http://pax.grsecurity.net/docs/index.html
18. ProPolice / Stack Smashing Protector, see: http://www.trl.ibm.com/projects/security/ssp/,
http://www.x.org/wiki/ProPolice
19. M. Schäfer, W.D. Moeller, ‘Tailored Concepts for Software Integrity Protection in Mobile
Networks’, International Journal on Advances in Security, in vol. 4 no 1 & 2, August 2011
20. 3GPP TS 33.320, http://www.3gpp.org/ftp/Specs/archive/33_series/33.320/33320-b30.zip
21. 3GPP TS 33.401, http://www.3gpp.org/ftp/Specs/archive/33_series/33.401/33401-b10.zip
22. S. Wessel, F. Stumpf, “Page-based Runtime Integrity Protection of User and Kernel
Code”, Proceedings of {EuroSec}'12, 5th European Workshop on System Security, April
2012
Intentionally left blank to harmonize page numbering with printed version.
Applying Industrial-strength Testing Techniques to
Critical Care Medical Equipment
Christoph Woskowski
1 Introduction
New technologies have a hard time to gain a foothold in embedded systems develop-
ment for a couple of reasons. A lot of time and money has to be invested in reducing
size, average cost per unit and power consumption of a new hardware device until it is
applicable for special purpose embedded systems with limited power source. When
Applying Industrial-strength Testing Techniques to Critical Care Med-
ical Equipment 63
the development is finally finished, the next hardware generation is already available
for the non-embedded market.
Another reason is that, despite the huge progress creating small and extremely
powerful microcontrollers and increasing power density and lifetime of batteries,
there are still many limitations especially for the software part of embedded system
development, mainly both in available processing power and memory.
For a third reason, the intended use and application area of the device under devel-
opment can be another source of restrictions. Regulations and the obligation to prove
the effectiveness of risk mitigation techniques for safety-critical medical devices have
an impact on the hardware options as well as the choice of compiler, programming
language, operating system and test environment.
The bottom line is that modern object-oriented (OO) languages and incorporated
paradigms such as inheritance, encapsulation and information hiding are still uncom-
mon in embedded systems [1] - mainly for debatable performance reasons [2, 3, 4, 5].
Based thereon architectural and design patterns (i.e. event-driven architecture, pub-
lish-subscribe) as well as related, but non-OO concepts (i.e. component based devel-
opment) which in turn promote testability can only be found in some niche projects.
Other methods for introducing those higher-level concepts, like textual and graph-
ical domain-specific languages or UML profiles in conjunction with C-code genera-
tion can only emulate object orientation. Even today most programs for embedded
systems are written in more or less monolithic ANSI C code [1] or assembler lan-
guage. If unit testing is not applied from the very beginning of the project, reaching
adequate test coverage for a legacy code base like this is very time-consuming and
costly at a later stage.
How to adequately test such a system, given its limited external debugging and
tracing interfaces, its inherent complexity and the multitude of possible internal
states? In light of this, what can be done to improve the situation using industrial-
strength techniques?
This paper aims to provide possible solutions by presenting two case studies from a
software engineering point of view. The first case study describes the lessons learned
while introducing system tests in a late development state of a regulated medical sys-
tem. In the second case study the application of design for testability and unit testing
from the beginning on in combination with keyword-driven and hardware-supported
testing are documented in detail. This second part of the article is set against the
background of a medical device development, whereas the software part is classified
as critical (conforming to IEC 62304 [6]).
can be carried out in a manual manner. Since normally some regressions are discov-
ered during testing, some extra time needs to be planned for bug fixing and re-testing.
Unit tests exist for critical parts of the software but are carried out locally by the de-
velopers and are not part of a continuous integration process. The unit test coverage
does not allow performing the necessary software refactoring safely in favor of the
new features to be implemented. Even apparently small and local changes may or
may not break remote parts of the system.
The external bus interface enables searching, monitoring and controlling of exter-
nal devices. From a testing point of view, this is an input/output port which can be
used to asynchronously insert test input and to inject errors (e.g. corrupt bus commu-
nication frames). On the other hand the output of this port needs to be traced for ex-
pected and unexpected system reactions.
Since most system functions are available and triggered by user input via touch
screen, the main focus of the system tests has to be performing touch input and han-
dling graphical user interface (GUI) output. As a result most keywords of the testing
DSL are vocabularies of the graphical user interface domain. The challenge is to au-
tomatically perform touch input and to validate GUI output by recognizing displayed
elements on the screen.
There are a couple of possible solutions for triggering the touch recognition of the
screen. An electromechanical or robotic "finger" is very expensive and error-prone.
The injection of a touch event via serial input annuls safety measures - like the diverse
touch recognition of functional and monitoring processor - and thus prevents effective
system testing. At the end of the day the problem is solved by utilizing a microcon-
troller to directly inject the corresponding voltage for touching the screen at specific
coordinates.
Recognizing GUI output is even more difficult, since the underlying structures like
windows and buttons are not accessible from outside of the software. Image recogni-
tion using a camera gets very complex when there are lots of colors, shapes, texts and
numbers in different fonts and sizes on the screen. As a compromise of benefit versus
costs the device software is exploited to calculate a checksum of the whole screen or
part of it when triggered via debug serial port and returns it using the same interface.
Having only some volatile views and explicitly excluding very frequently changing
parts like animations, this solution works well for recognizing screens that have been
recorded before.
For being able to recognize and reproduce deviations indicated by a failed check-
sum check at the same time a picture of the screen is taken. The test protocol contains
this proof, so a successful or failed test run is well-documented and reproducible.
have no external interfaces generally are only testable by injecting errors using a
hardware or software exploit.
From a developer’s point of view, the necessary safety net provided by test cover-
age is continuously growing during system test implementation. Still, since new fea-
tures are being added at the same time, software development only benefits after sys-
tem test coverage reaches a certain level.
For project management, external auditors and other stakeholders the visibility of
automated system tests is very high as soon as the infrastructure is available and the
first tests are running. Generally, considerable return on investment is possible only:
• for large and medium sized projects that are continued for a longer period of time
after system test automation,
• for regulated projects with obligation to prove the effectiveness of risk mitigation
techniques
• and for projects that start a product line.
• one microcontroller board, emulating the setup of the final main board,
• one I/O board, emulating parts of the final periphery using slide switches and po-
tentiometers,
• and one test board, also called "test box".
The latter facilitates the emulation of sensors that are not available during earlier de-
velopment stages and the evaluation of signals that will trigger actors in the final set-
ting. The test box also contains a web server and is programmable and configurable
via Representational State Transfer (REST) interface [16]. This way the test board
executes even complex sequences and macros by consecutively triggering several
sensors, for example to respond with valid data during the startup self-test of the sys-
tem under development.
The general-purpose programmable test box is deployed for developer tests, inte-
gration tests and system tests running on the target hardware. Two test stands based
on COTS [17] testing hardware (National Instruments) provide an additional option
particularly for executing in-depth system tests. This equipment is especially useful
for monitoring and generating complex signals and waveforms. It is also possible to
Applying Industrial-strength Testing Techniques to Critical Care Med-
ical Equipment 69
inject faulty or disturbing signals directly into circuits and communication lines in
order to test error detection and -recovery mechanisms. Exact timing measurements
enable the verification of safety-critical reactions like error stop and putting the sys-
tem into a safe state.
Unit Testing. The horizontal decomposition of the software architecture into modu-
lar units promotes unit testing at component level - as opposed to the implementation
of file-based unit tests. This approach results in a well-defined interface for each
component. Clients of that interface perform operations solely through it, thus the
internal structure and data of a component are hidden (information hiding). By replac-
ing the real environment of a specific component with generated mock objects for all
the interfaces it uses, the component can be tested standalone without external influ-
ences.
When implementing unit tests for any component of the system at hand, preferably
the operations of its external interfaces are called. Apart from that, the unit test
framework used in this project also supports the exporting and invoking of compo-
nent-internal methods. This way the very high test coverage required by the device's
criticality (up to 95 per cent condition/decision coverage) is accomplishable.
By introducing a hardware abstraction layer the vertical decomposition in terms of
layering confers independence of the hardware platform. Instead of exclusively de-
veloping directly on the target by using an IDE with cross-compiler and a (limited)
hardware debugger, logical parts and modules representing concerns of the device
domain can be implemented in a comfortable PC development environment with ad-
vanced debugging facilities. This accounts also for (unit) test development and cover-
age pre-analysis. In a first step, those tests can run against a PC simulation. The con-
clusive condition/decision coverage still has to be measured by running the tests on
the target hardware.
Integration tests are implemented based on the NUnit1 framework [18] and grouped
as smoke/sanity tests and as short and long-running tests. Like the unit tests men-
tioned above they also run on the build server, triggered by any source code change.
The long-running tests are part of the nightly build. The smoke tests are used by every
developer before committing any changes to verify that no apparently unrelated parts
of the system are broken. The platform independence of the event bus permits the
identical tests to run against the simulation and the target hardware.
The source code that performs serialization and deserialization of the various
events as well as the routing table used to transmit events from one module to another
is generated by a code generator. The corresponding textual, Xtext2-based [19] do-
main-specific language is utilized to describe all events, their associated payloads, the
interfaces they belong to and the modules that implement and/or use those interfaces.
The same DSL framework applies to the specification of deployment scenarios and
the definition of finite state machines that describe the behavior of module-layer ele-
ments. Parts of the development model formed by textual DSLs are also used in the
testing context. The library containing the application programming interface that is
deployed for sending and receiving events as part of the integration test framework
also gets generated based on the DSL event description.
System Testing. The system testing framework implemented specifically for the
critical embedded device at hand, benefits from experience and knowledge gained in
previous testing projects. As already introduced in the first case study a keyword-
based testing DSL is used to formally specify test cases based on system requirements
and system use cases. The goal is to reach significant requirements coverage by em-
ploying test automation, all the while reducing the number of remaining manual tests.
All project-specific data - for instance all requirements, use cases, test cases as well
as traceability information for impact and coverage analysis - are stored using com-
mercial project management software (Microsoft Team Foundation Server, TFS) [20].
Since the dictionary of the keyword-driven testing language primarily contains ex-
pressions that can also be found in the formal test case specification, it seems obvious
to use the same repository for both the formal and the machine-readable aspect of a
test case. Another advantage of this approach is the already existing linkage between
requirement, use case and test case. Therefore traceability from the concrete processi-
ble test script to the corresponding use cases and requirements (and vice versa) is
ensured and so is proving complete coverage of the system requirements.
The in-house developed testing framework [Fig. 1] automatically fetches the ma-
chine-readable test cases from the project-specific repository. Then a parser processes
the single test steps and generates a runnable test script which can be stored on the
test server. The corresponding script interpreter finally executes the test script against
the native target, reads back the results and stores them for further analysis.
1
NUnit is an open-source (unit) testing framework for Microsoft .NET languages.
2
Xtext is an open-source framework for developing domain-specific languages.
Applying Industrial-strength Testing Techniques to Critical Care Med-
ical Equipment 71
A webcam permits to take a picture of the current screen contents of the device un-
der test at any time. Like in the first case study the device software also is exploited to
calculate a checksum of displayed screen areas.
Every test run is extensively documented and contains test steps, tracing and debug
output of the device under test as well as the acquired checksums of device screen
areas synchronized with the corresponding webcam screen shots. The test server
holds a history of all test runs and thus enables to detect volatile tests.
4 Related Work
Some aspects of the presented proceeding for testing safety-critical medical devices
are also covered by previous work. Pajunen et al. [11] propose to integrate a model-
based graphical user interface test generator with a keyword-driven test automation
framework. Gypta et al. [21] present a similar approach. In both papers the focus lies
on automated testing of web applications. But especially the abstraction of high-level
GUI operations and user actions using a keyword-based language is much more gen-
eral than this. The extension of the keyword-driven concept for testing GUI-oriented
systems in the embedded device domain is demonstrated by the article at hand.
Peischl et al. [22] focus on tool integration in software testing, all the while empha-
sizing the necessity of integrating heterogeneous tools for creating a test automation
flow. This concept (among others) is realized by the system testing framework pre-
sented above. R. Graves [17] analyzes the effects of using COTS hardware for con-
structing a test system, namely an avionics system testbed. His paper highlights the
72 C. Woskowski
effective use of COTS hardware for testing purposes. As illustrated above by the se-
cond case study, this also applies to combining a general-purpose programmable test
box based on COTS components with commercial test hardware.
Using a REST interface for controlling and monitoring embedded devices – e.g.
the test box mentioned above - is accepted practice and suggested by several articles.
Lelli et al. [16] focus on using REST for managing instruments and devices shared on
the grid while evaluating performance and compatibility with conventional web ser-
vices. Jari Kleimola [23] highlights the simplicity of a RESTful interface, resulting
into small footprint implementations predestinated for low power embedded devices.
On the other hand Guinard et al. [24] emphasize the ability of REST to integrate dif-
ferent devices for building up heterogeneous web mashups.
5 Conclusion
6 References
1. Nahas, M., Maaita, A.: Choosing Appropriate Programming Language to Implement Soft-
ware for Real-Time Resource-Constrained Embedded Systems. Embedded Systems - The-
ory and Design Methodology, Kiyofumi Tanaka (Ed.), ISBN: 978-953-51-0167-3, InTech
(2012)
2. Dutt S., Jamwal S., Devanand: Object Oriented Vs Procedural Programming in Embedded
Systems. In International Journal of Computer Science & Communication (IJCSC) (2010)
3. Chatzigeorgiou, A., Stephanides, G.: Evaluating Performance and Power of Object-
Oriented Vs. Procedural Programming in Embedded Processors. In Ada-Europe (2002)
Applying Industrial-strength Testing Techniques to Critical Care Med-
ical Equipment 73
1 Introduction
Industry and academia struggle to improve safety of road vehicles. The innovations
often employ embedded systems. However, malfunctions in safety-critical embedded
systems may lead to new hazards (potential sources of harm). To reduce the risk of
such malfunctions, safety-critical embedded systems must be developed according to
a safety standard. Recently a standard for functional safety, IEC61508, was adapted to
the context of road vehicles resulting in ISO26262 [1], which addresses development
of safety-critical electronic systems (Items). Development steps and processes are
specified according to five Automotive Safety Integrity Levels (ASILs), namely Qual-
ity Management (QM) and ASIL A-D. The ASIL for an Item is determined by con-
sidering the severity and probability for each hazard, as well as a driver’s ability to
compensate (controllability). For high ASIL items, the standard requires stringent
1
This work has resulted from FUSS, a subproject of DFEA2020, partially sponsored by the FFI
council of VINNOVA (Swedish Research Agency).
2
The authors thank Erik Hesslow, an ISO26262 safety expert from Mecel AB (partner in the
FUSS/DFEA2020 project), for reviewing the work and providing valuable comments.
Requirement decomposition and testability in development of safety-
critical automotive components 75
measures for risk minimization. In contrast to IEC61508, besides many other aspects,
ISO26262 imposes qualification requirements on software tools used in the develop-
ment process, which also includes verification and validation tools. While tools exist,
they may not have been qualified or developed considering safety requirements. Con-
sequently, to be ISO26262-compliant, existing tools must be qualified, and in each
future version, re-qualified. Similar to IEC 61508, ISO26262 allows decomposition of
high ASIL safety requirements, using two same-or-lower ASIL requirements and
redundancy, monitoring or other safety-enhancing concept. Since development to a
lower ASIL typically requires less effort, decomposition is an attractive possibility.
However, the decomposition must be implemented by independent components and
affects the system architecture. To demonstrate fulfillment of the original require-
ments, there shall be traceability to and from the decomposed requirements.
As seen from above, ISO26262-compliant development include specification of
safety requirement (including determination of ASIL), decomposition of safety re-
quirements, requirement traceability and testability, qualification of software tools,
verification and validation. This paper provides an example of how these steps can be
performed. The aim is to help minimize pitfalls in transition to ISO26262.
The next section reviews prior work. Section 3 presents requirements elicitation
and traceability. Section 4 discusses testability, leading up to Section 5 which is about
testing tool qualification. Section 6 presents a verification and validation strategy.
These concepts are illustrated in a case study in Section 7.
2 Prior work
Previous publications on ISO26262 include introductions to the standard [2] [3] [4]
[5], guides to successful application [4], experience reports [5], studies on the impact
on Item development [6], considerations regarding the development process and as-
sessment [3] [7] and adapting model-based development workflows to the standard
[8]. Dittel and Aryus [2] pointed out the need for support tools and methods. Hillen-
brand, et al. [6] discussed impact on the electric and electronic architecture, as well as
management of safety requirements. They found challenges, time-consuming activi-
ties, lack of support tools and proven workflows [8].
Support tools for ISO26262-complient development are considered in [9] [10] [11]
[12]. Makartetskiy, Pozza and Sisto [9] review two tools, Medini and Edona, for sys-
tem level modeling, handling the documents and checks against the standard regula-
tions. They stress that to bring a shared view of safety among companies, both a
standard and tools are required. Hillenbrand et al. [10] provide an FMEA tool with
features to support work with ISO26262. Schubotz [11] address the gap between the
standard and companies’ internal development processes by a concept approach to
plan, link, track and evaluate standard-required activities with documentation. Palin,
Ward, Habli and Rivett [12] argue that a safety case consisting of collected work
products, as ISO26262 allows, lacks an explicit argumentation for safety. They pre-
sent a template for a proper safety case using goal structuring notation.
Qualification methods for software tools used in development are addressed in [13]
[14]. Conrad, Munier and Rauch [13] present a reference development workflow us-
76 V. Izosimov, U. Ingelsson, and A. Wallin
ing model-based design, with checks in every development step, comparing require-
ments and the model and comparing test results and the model. This way, tool confi-
dence is achieved by high tool error detection probability. In [13] the tools are quali-
fied for such use that strictly follows the reference development workflow. The refer-
ence workflow approach to tool qualification is criticized by Hillebrand, et al. [14],
since it is tailored to specific tools and creates a dependency on the tool vendor.
While it is good practice to keep the same tool version throughout a development
project, various projects use different tool versions. This can be a source for confu-
sion. A “tool” may be a flow consisting of several tools and each tool in the tool flow
may have to undergo qualification. In [14] tool classification is addressed to avoid
unnecessary effort in tool qualification.
Robinson-Mallett and Heers [15] report that hardware-in-the-loop (HIL) test plat-
forms require special consideration, and the model-based approaches to tool qualifica-
tion do not apply. HIL test platforms provide a test environment that closely resem-
bles the intended operation environment of the Item and can be more complex than
the sum of electronic components in a car. Consequently, qualification of a HIL plat-
form is a challenge. In our previous work in [16] and [17], a testing tool qualification
method for HIL platforms is presented to reduce the qualification effort. The method
includes a monitor and fault injection. Our work in [16] and [17] focus on develop-
ment of a semi-automatic qualification process for the HIL tool, while we do not con-
sider traceability and testability of the Item requirements and within the HIL tool.
The papers listed above have identified the need for a best practice and the need to
develop and qualify tools. Previous papers on ISO 26262 have not discussed require-
ments traceability of safety-critical systems in the context of decomposition, nor for
verification and validation. For non-safety-critical complex computer-based systems,
however, Arkley and Riddle [18] discuss requirement traceability, motivating the
need for a traceable development contract. Further, in the context of aerospace indus-
try, Andersen and Romanski [19] discuss development of safety-critical avionics
software, including verification, validation and assessment, and emphasize im-
portance of requirement traceability. Neither of the previous papers, however, has
addressed propagation of safety requirements into the tool qualification. To address
tool qualification, requirements traceability, verification and validation in the context
of safety-critical systems and ISO 26262, this paper provides an example of how such
tasks can be performed, illustrated by a case study.
assessment step. Hazards are identified and categorized leading to an ASIL assign-
ment and a set of safety goals. A safety goal is an abstract, top-level safety require-
ment, which is proposed to overcome the hazardous situation that can arise from mal-
functioning of the Item, mitigating the risk that this situation brings. To fulfill the
safety goals, more detailed safety requirements are defined, each with a corresponding
ASIL. Thus, a functional safety concept is formed, consisting of all the safety re-
quirements and the steps taken to ensure safety. The safety requirements govern all
subsequent steps of the safety lifecycle. A typical problem in any large project is that
an individual requirement does not explain the reasoning behind its formulation and
so the importance of a safety requirement can be misunderstood. To clarify relations
between requirements and their reasons, requirement traceability is ensured by link-
ing each requirement to safety goals, corresponding tests, design decisions, etc.
Product Production
Item definition Production
development planning
ments with “fulfills” or “fulfilled by” relations and the names of all other related re-
quirements. An example is given in Table 1.
Item definition
Hazard analysis and
risk assessment Safety case
Safety goals
Table 1. Relations between TSR42 and other requirements (often represented by “links” be-
tween requirements in requirement management tools)
4 Testability
The claim that design and implementation fulfill the requirements shall be verified.
Verification is the task to determine completeness and correct specification of re-
quirements as well as correctness of the implementation that is to fulfill the require-
ments. Verification constitutes the right hand side of Fig. 2 and is performed for all
integration steps of the system design including implementation in software and
hardware. To be verified, the requirements should be testable. To ensure testability, a
semi-formal representation that is compatible with a definition of testability is uti-
lized. We present two representations to illustrate aspects to testability.
For the first representation, we define a requirement Ri as a logical expression Li:
<Object X> shall <Action Y> [applied to] <Subject Z>. The requirement is mapped
onto Object X which performs Action Y onto Subject Z. Testability of Ri is a property
of Ri that this logical expression Li can be verified. We suggest that, to fulfill testabil-
ity, the requirement has to consist of the object, the action and the subject and the
object, the action and the subject must be identifiable within and present in the sys-
tem. With these conditions valid, the requirement can be verified, and is testable.
Consider following “good” and “bad” examples of safety requirements, some that
fulfill, and some that do not fulfill the requirement pattern and, thus, shall be changed:
R1: We shall ensure presence of error correction codes (ECC) in the system for
correction of single-event upsets (SEUs).
R2: The MCU (microcontroller) shall include a logical watchdog.
In R1, neither Object nor Subject is clear, only Action is present, i.e., ensuring
presence of ECC codes, and the requirement is not testable. In R2, the elements are
clearly identifiable and physically present in the system. Thus, this requirement is
testable. However, this requirement will have to be detailed further to identify watch-
dog properties, relevant MCU software and monitoring strategy.
For the second representation of requirements, consider that although object, action
and subjects are obligatory attributes of requirements, it is often important to identify
conditions under which the requirements are applicable. R3 is an example require-
ment that is designed to prevent over-heating of a component.
R3: The MCU shall not enable a power supply to the central CPU if the ambient
temperature is above 95ºC.
In R3 there is an example of another important property of requirements, which is
the presence of measurable quantitative parameters. These parameters will ensure
operational intervals and applicability of requirements, i.e., as in R3, “above 95ºC”.
However, R3 is not easily refutable. The test that is necessary to check that the re-
quirement is fulfilled will be boundless. Therefore, it is good practice to either formu-
late requirements such that they are easily refutable or give a set of appropriate meas-
urement conditions for the test.
Requirement elicitation with respect to requirement testability and how it leads to
testing tool qualification can be shown in several steps (see Fig. 2 for work products):
80 V. Izosimov, U. Ingelsson, and A. Wallin
Define safety goals: Safety goals cannot be tested since they are usually very ab-
stract. Note, however, that safety goals and functional safety requirements shall be
validated by studying behavior of the whole system, to ensure that the correct system
has been developed and potentially dangerous behavior successfully avoided.
Define safety requirements: Many functional safety requirements cannot be veri-
fied due to lack of technical details. In this step, however, it is usually clear which
testing tools will be needed. Thus, selection and classification of testing tools can be
done, resulting with an input to SW tool criteria evaluation reports.
Refine safety requirements: By considering system properties, decomposition of
requirement is performed. Requirements are also evaluated on their feasibility by
performing requirements reviews, design studies and testability assessments. This will
result in a verification strategy, part of which will be adaptation of the test tool.
Detailed safety requirements: Verification is possible only for technical safety
requirements, which are the most detailed safety requirements. In this step, it is neces-
sary to derive test cases and clearly demonstrate requirement testability. Several itera-
tions of requirement elicitation may be needed. Testing tool qualification is per-
formed, resulting with input to SW tool qualification reports.
Implementation: Here, verification activities are fully executed on implementa-
tion releases with testing tools providing test reports for the respective requirements.
Safety case: Test cases, test reports and tool qualification reports will provide in-
puts to the safety case, for demonstration of fulfillment of the requirements.
tion of the testing tool should complement each other to ensure that the risk of test
escapes in the safety-critical component is minimized. Note also that if the tools are
used for testing of decomposed requirements, i.e., ASIL B(D), the ASIL level of in-
dependence, in this case: ASIL D, shall be often considered as the ASIL level in qual-
ification of these software tools.
The results of qualification of a testing tool and verification against safety require-
ments of the safety-critical automotive component will be reflected in a work product
called the safety case (see Fig. 2), which will include arguments that safety is
achieved using qualification and verification work products, including testing tool
analysis report, testing tool qualification report, integration and verification plan, test
cases (for the respective requirements) and respective test reports.
It should be noted that verification includes more than testing against requirements.
A complete verification process includes activities such as fault injection experiments,
tests in the operation environmental of the Item and EMC tests.
7 Case Study
In this section, we provide an example where we apply the concepts discussed in the
previous sections, in particular decomposition, traceability and testability of require-
ments, as well as testing tool qualification and fault injection based verification.
while driving at high speed on a curvy road, resulting in the highest probability, E4.
The highest severity, S3, applies, since the result may be that the car departs from the
road at high speed with risk of critical injury. The controllability is modest, C2, since
an obscured view is comparable to loss of headlights at night, which is categorized as
C2 in [1] (Part 3, Table B-4). Consequently, the hazard corresponds to ASIL C, the
second highest ASIL ( [1] Part 3, Table 4).
Washer Liquid
ECU1 Washer Liquid
Spray Activate
Spray Enable
Windshield
Wiper Angle
Windshield Override
Wiper Enable
Windshield
ECU2
Wiper Activate
We formulate a safety goal SG1: “A malfunction should not obscure the drivers
view with washer liquid”. For SG1, we formulate two functional safety requirements,
FSR1 and FSR2, to enforce a safe state “washer liquid spray disabled” upon control-
ler failure. The two requirements correspond to the two possible failure modes.
FSR1: The controller should not spray washer liquid if the windshield wiper fails.
FSR2: The controller should not spray washer liquid for an extended duration.
We found a decomposition to fulfill both FSR1 and FSR2. An overview is given in
Fig. 3. The ECUs perform mutual checking of each other’s operation as is described
by the technical safety requirements TSR1.1 and TSR2.1.
TSR1.1: ECU1 shall disable the washer liquid spray if the windshield wiper angle
does not change.
TSR2.1: ECU2 shall override the washer liquid spray if the washer liquid spray is
enabled for >1s.
ISO26262 allows decomposition from an ASIL C requirement to two requirements
with ASIL A(C) and ASIL B(C) respectively, if they are independent, e.g. correspond
to independent ECUs. ECU1 is controlling the washer liquid spray based on the driv-
er’s activation, while monitoring the windshield wiper angle. Thus ECU1 is to fulfill
TSR1.1. ECU2 fulfills TSR2.1 and is responsible for controlling the windshield wiper
based on the driver’s activation and sensor input of the windshield wiper angle. ECU2
also monitors the washer liquid spray enable signal from ECU1 such that it can over-
ride that signal if necessary. We choose to assign ASIL B(C) to ECU2 and ASIL
A(C) to ECU1 since ECU2 controls the windshield wipers. A malfunction of the
windshield wipers can potentially lead to ASIL B hazards. Take, for example, a sce-
nario in which the windshield is suddenly splashed with dirt which has been stirred up
by another vehicle on a wet and dirty road. Visibility is suddenly reduced. Malfunc-
tion of the wipers in this situation will not allow cleaning of the windshield. Although
the situation is fairly controllable (C2), the probability of this situation is second
Requirement decomposition and testability in development of safety-
critical automotive components 83
highest (E3) and the vehicle may drive into meeting traffic leading to high severity
(S3) if the driver loses control. Thus, ASIL B should be assigned ( [1] Part 3, Table
4).
The traceability of these requirements across the decomposition is implemented in
Table 2 as described in Section 3. Testability is achieved by representing the technical
safety requirements according to a semi-formal pattern (see Section 4) and by using
quantitatively measurable parameters. A testing tool is required, as is discussed next.
tool and thereby measure the monitor’s ability to detect unexpected behavior in the
testing tool. Through these experiments we identify three cases. The first case corre-
sponds to discovering a “bug” in the testing tool. In this case, the decision about
changing the monitor is deferred until the “bug” is corrected. In the second case, it is
discovered that the monitor is insufficient and requires a change and a change man-
agement to ASIL D(D) is performed, followed by further fault injection experiments.
In this case, the fault injection experiments must be adjusted. In [16] we describe a
semi-automatic procedure for adjusting the fault injection experiments. In the third
case, the monitor is able to detect all injected faults and no change to the monitor is
required. The relative frequency of the three cases depends on the type of testing tool
changes. We expect that the third case, which requires no changes to the monitor, will
be common enough to motivate the decomposition by its reduction in effort.
In the case study, we have seen two different applications of requirement decomposi-
tion, explicit requirement traceability and thorough management of requirement testa-
bility including testing tool qualification. Furthermore, we believe that the fault injec-
tion experiments applied to verify the testing tool monitor can be adapted also to other
software components and tools as an appropriate and time-saving verification method.
8 Conclusion
This paper addresses development of safety-critical embedded systems for use in road
vehicles according to ISO26262. Since the standard is new and introduces develop-
ment steps such as requirement decomposition and software tool qualification, we
have argued that this can lead to many manual steps and consequential pitfalls. For
example, software tool qualification can become a bottleneck in the development
process. To mitigate such pitfalls we have reviewed the important concepts require-
ment decomposition, traceability, testability, verification and validation. We have
showed application of the concepts in a case study involving two requirement decom-
positions, testing tool qualification using a monitor and fault injection experiments.
The chosen approach will increase efficiency of the development process of Items
with high ASIL levels, avoiding unnecessary bottlenecks and potential pitfalls that
might lead to hard-to-solve problems and compromise safety.
References
4
Otto von Guericke Universität Magdeburg
Lehrstuhl Integrierte Automation
Universitätsplatz 2
39104 Magdeburg
Germany
1
[email protected], 2 [email protected],
3
[email protected], 4 [email protected]
Abstract. This paper suggests methods, and a tool chain for model based
specification, verification, and test generation for a safety fieldbus profile. The
basis of this tool chain is the use of an UML-profile as a specification notation,
a simple high level Petri net model called “Safe Petri Net with Attributes”
(SPENAT) and analysis methods found in Petri net theory. The developed
UML-profile contains UML class diagrams and UML state machines for speci-
fication modeling. Verification and developed test generation methods are
shown to be applicable after mapping the specification model onto SPENAT.
The practical use of this tool chain is exemplarily demonstrated for a safety
fieldbus profile.
1 Introduction
More and more safety-relevant applications are being handled within industrial auto-
mation. The IEC 61508 standard describes requirements of functional safety. Micro-
processor based device solutions for safety-relevant applications are faced with this
standard. This forces the device manufacturer to contact third party partners such as
TÜV and IFA which verify the development process and the development result. This
results in a resource overhead for the device manufacturer. Therefore, these manufac-
turers are looking for methods and tools to automate some activities in order to de-
crease the overhead.
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
88 J. Krause, E. Hintze, S. Magnus, and C. Diedrich
The paradigm of the model based system development (see e.g. [1]) is generally ac-
cepted handling the increasing complexity of the system and device development.
One usage of model based techniques is within the development of safety relevant
fieldbus profiles in the industrial communication area. A fieldbus profile specifies the
common use of communication services and interacting variables of selected device
classes. These profiles serve as a basis for automation device development and are
subject to certification tests in the framework of the related communication market
organizations - the so-called user organizations. Devices which have successfully
passed the tests can work interoperably if the coverage of test cases meets the neces-
sary requirements. Additionally, the profile specification is part of a general quality
process both within the user organization as well as the device manufacturer.
Using model based specifications as a result of profile development processes
some quality assurance activities are addressed. One activity is the verification of
syntactic and semantic correctness with regard to the specified requirements. Another
activity is the generation of test cases with high specification coverage based on pro-
file specification model.
To support formal verification and test generation from model based specifications
a simple and intuitively understandable new Petri net model (“Safe Petri Net with
Attributes” - SPENAT) was developed based on safe place transition nets (PT nets).
Thanks to the simplicity of SPENAT a wide spectrum of existing and future modeling
notations should be supported and usable for verification and test generation. The
mapping of a UML State Machine to an ESPTN, the predecessor model of the
SPENAT, is described in detail in [6].
In this paper methods for model based specification, verification, and test genera-
tion are introduced. All methods are implemented on a tool chain. The practical usage
of this tool chain will be demonstrated on an existing safety fieldbus profile.
This paper is structured as follows. Section 2 addresses fieldbus profiles and their
model based specification and section 3 introduces SPENAT, and discusses its verifi-
cation and test generation. The methods introduced are implemented on a tool chain
in section 4 and a case study is carried out for a PROFIsafe PA profile in section 5.
Finally, section 6 concludes the paper and gives an outlook of future research.
Device profiles usually provide variables and/or application functions with related
input and output variables, parameters, and commands. In some cases, functions can
be aggregated to function blocks. The variables, parameters and commands (called
variables within this paper) are data to be communicated. The variables are dedicated
to modules and sub-modules which provide addressing and data type information for
the related communication services.
UML [14] nowadays is well established in the domain of embedded systems. Au-
tomation devices are seen as such systems. Class diagrams and state machine dia-
grams are the only UML languages which are used in the context of device and pro-
file models. Class diagrams are used to describe the device structure which consists of
Model based specification, verification and test generation for a safety
fieldbus profile 89
3.1 Motivation
The SPENAT notation is built upon safe place transition nets (p/t net) [10] and con-
cepts of high level Petri nets [3], [4], [5], [10]. Using SPENAT it is possible to use
external and parameterized signal/events as transition triggers (in contrast to
STG [11], SIPN [15], IOPT [16]). Thanks to this feature it is much easier to model the
required behavior of an open and reactive system with a Petri net. Also, the mapping
of existing models onto a Petri net should be possible in an easy and intuitive way.
An example of a declaration of a Petri net reacting on externally parameterized
signals is presented in Fig. 1. This Petri net has two transitions where transition t2 can
only fire after transition t1 and the guard of t2 depends implicitly on the value of the
parameter x of the trigger event of t1.
If transition t2 of the Petri net of Fig. 2 fires, it is clear that the parameter x of the
external event ev1(int x) must be 1. This value is a result of the guard of t1
(msg.x<2), the effect of t1 (y=msg.x), and the guard of t2 (y>0). The keyword msg is
90 J. Krause, E. Hintze, S. Magnus, and C. Diedrich
a reference to the respective trigger event of the transition. In this case the value 1 is
the only valid value for parameter x of the trigger event ev1(int x) so t2 can fire. For
any other value of x, transition t1 cannot fire (see guard msg.x<2) otherwise the
SPENAT of Fig. 1 would be in a deadlock after t1 has been fired.
event ev1(int x);
event ev2();
int y=0; p1
t2 ev2()[y>0]/;
p3
ev1(x)[msg.x<2]/y=msg.x+1; ev1(x)[msg.x<2]/y=msg.x+1;
p2 p2
Based on Fig. 2 the essential properties of SPENAT can easily be identified. The
connection of a data place to a transition is always implemented by a loop, so every
data place which is a predecessor of a transition is always a successor of the same
transition. Whether a data place is connected (by a loop) to a transition is determined
by the transition inscription (guard and effect). With this property and the fact that
data places are part of the initial marking, data places are always marked. This re-
striction allows a more simplified analysis of SPENAT. Also, the declaration of the
data places is not mandatory for the graphical declaration of SPENAT (see Fig. 2).
The syntax and semantic of the inscription of a SPENAT transition is essentially
adequate (see Fig. 2) to the syntax and semantic of a transition of a UML State Ma-
chine (USM [14]). However, a transition of SPENAT can have more than one prede-
cessor which is not possible for a USM transition. A SPENAT transition fires if all
predecessors are marked, if its (external) event (its trigger event) appears, and if its
guard is evaluated as ‘true’. If a transition fires, all specified actions associated with
the effect of the transition are executed.
event ev1(int x); event ev1(int x);
p2 p2
In Fig. 3 SPENAT with a data place y is presented before and after an firing of a tran-
sition (t1). The places p1 and p2 are control places. The initial marking M0 of this
SPENAT is characterized by the set M0={(p1,•),(y,0)}. The marking M1={(p2,•),(y,2)}
is induced by the firing of t1 based on M0. A marking set M contains all current col-
ored tokens. Here, a colored token represents a pair of place and value (color, see [3]).
The element is used as a type and a value of control places. More than one colored
token cannot be used in the current marking set for one place, so SPENAT is safe.
If a transition of SPENAT fires, all colored tokens representing a predecessor of
the transition will be removed from the current marking set and for each successor a
new colored token is produced and added to the current marking set.
the construction of a minimal prefix. In [11] this algorithm was parallelized and fur-
ther optimized. Also, in [11] the dependence on general place transition nets was re-
moved. Thus, the algorithm for prefix construction now is applicable to higher Petri
nets as well.
Values of attributes of SPENAT can depend on values of external event parame-
ters. In order to represent a marking of SPENAT during the prefix calculation algo-
rithm classic (value based) marking representation of (colored) Petri nets are not suit-
able. However, the marking of SPENAT attributes can be expressed by a set of con-
straints. A marking of SPENAT can then be represented by a marking set for the con-
trol places and by the identified constraints for the data places. With this marking
representation and the results of [11] the known algorithms for the prefix creation can
also be used for the prefix construction for SPENAT. However, the method for the
extension of the prefix with new events has to be adapted because of the use of con-
straints as marking representation for a data place. Now a new event can only be add-
ed to the prefix if the identified constraints are satisfiable. Also, the identification of
the cutoff events has to be adapted. Now it is necessary to check if two events pro-
duce the same marking of control places as well as the same constraints on a semantic
level.
In Fig. 4 SPENAT with its prefix is presented. The events of the prefix are inscript-
ed with the constraints for the data places. The parameters of the trigger events are
associated with the respective prefix events for a better overview. The constraint
e1.x<10 of the event e3 is valid for the parameter x of the trigger event ev1(int x) of
transition t1 represented by e1 in this process. The prefix contains the cutoff events e2,
e7, and e8. All cutoff events correspond to the marking of event e1. Furthermore, a
deadlock can be identified within the prefix seen in Fig. 4. This deadlock is a result of
the execution sequence t1t3t5 represented by the local configuration {e0,e1,e3,e5} as-
signed to the prefix event e5 (not added here).
In general, the complete prefix of the unfolding of Petri net is a compact represen-
tation of the state space and is well suited for the verification of interesting properties
like deadlock and reachability analysis, and satisfiability of LTL formulas by methods
of model checking. In [8] and [11] methods for formal verification based on prefix are
presented. These methods are also applicable to the verification of SPENAT.
Model based specification, verification and test generation for a safety
fieldbus profile 93
e0(ε) y1=0;y2=0;
event ev1(int x), ev2(int x), ev3(), ev4(int x), ev5();
int y1=0, y2=0;
p1 y1 y2
p1
p2
e1.x<10; e1.x>5; e (t ) e1.x<10; e4.x<10; e4.x>5;
y1=e1.x; y2=e1.x; 3 3 y2 e4(t4)
y1=e4.x; y2=e1.x;
ev4(int x)[x>5 and x<10]/
t3 ev 3()[y1>5]/ t4
y1=x;
p3 y1 p3
ev5()[y2<3] t5 p1 y2
e7.x<10; e8.x<10;
e7(t1) e8(t2)
y1=e7.x; y2=e7.x; y1=e8.x; y2=e8.x;
p2 y2 y1 p2 y2 y1
The identified test cases specify a (concurrent) message exchange between the test
object (System Under Test – SUT) and the tester or test system. This is an abstract
sequence-based description of the stimuli and the expected responses of the test ob-
ject. This abstract representation of the test cases must be transformed in an under-
standable and executable format for the test system. Furthermore, the realized level of
abstraction during the modeling of the required test object behavior must be respected
in order to get automatically executable test specifications as a result of the test gen-
eration process.
Data types, events, and/or signals, modelled within the profile model at an abstract
level, have to be mapped to usable structures of the target test notation of the used test
tool. Therefore, rules are necessary in order to automate this test formatting. For the
formatting in the standardised test notation TTCN-3 [12] and in a proprietary test
notation based on C#, suitable rules were developed and are implemented.
4 Tool chain
The previously described methods are implemented with established tools and proto-
typical implementations (see Fig. 5) on a tool chain. For the modeling, the established
UML modeling tool Rhapsody by IBM is used. The modeling of the profile specifica-
tion is done using Rhapsody and our definition of the UML profile.
With the available API of Rhapsody, the model of the profile specification can be
extracted. Therefore, two different test generators can operate using this profile mod-
el. The test generator for dynamic specification elements searches for UML state ma-
chines and generates, for these models of the expected behavior, suitable test cases
with Petri net techniques (see left branch of Fig. 5) as described in this paper. The test
generator for the parameter testing looks for special stereotypes within the profile
model indicating the parameter classes (see right branch of Fig. 5). This test generator
is not discussed in this paper.
The result of the two test generators are abstract test cases on the same level as the
specification model. These abstract test cases have to be transformed into suitable test
notations in order to automatically execute the tests with suitable test tools. This
transformation is implemented for TTCN 3 and a C# based test notation for the used
test tool isDEET. The test execution and test verdict identification can be realized
with isDEET.
The UML-profile previously described was used within a project to describe the
structure and the required behavior based on the specification of the PROFIsafe PA
profile (see extracts in Fig. 6). The required behavior was modeled with a UML State
Machine. Then, model verification was done in a first step to guarantee that the model
was free of errors. In particular, safety critical properties like deadlock freeness and
reachability of all states and all transitions were checked.
Based on the verified model of the specification of the PROFIsafe PA profile, a
test suite with high coverage (coverage criteria “round trip path”) was generated. The
generated test cases were transformed from the abstract sequence based format to the
input format of the used test tool (based on C#). Variables necessary to influence the
state machine and get the state machine status are additionally used for this transfor-
mation. For this transformation some rules had to be implemented by an adapter in
order to handle the actual communication between the test tool and the test object. A
SIEMENS device (“SITRANS P”) was successfully used as a test object. Except for
the implementation of the transformation rules by an adapter for the test cases of the
test tool, all activities were executed automatically.
Fig. 6. (a) Three selected parameters and (b) The state machine of the PROFIsafe profile [13]
96 J. Krause, E. Hintze, S. Magnus, and C. Diedrich
A tool developed at the ifak Magdeburg is used to verify the specification model and
generate test cases out of this verified specification model. It allows transferring a
UML state machine into SPENAT, and it creates the complete prefix of the unfolding
of this SPENAT. Based on this prefix the verification (deadlock and reachability
analysis) and the test generation are done. Therefore, different structural coverage
criteria can be chosen. For the highest possible coverage criteria “round trip path” 52
test cases are generated.
MTC SUT
ev_InspectionWriteRequest(value=S2)
//tc_1
ev_WriteResponse(response=OK) MSG.TEXT("tc_1");
{
CALL("ev_Init");
ev_InspectionWriteRequest(value=S3) CALL("ev_InspectionWriteRequest",value="S2");
CALL("ev_WriteResponse",response="OK");
ev_WriteResponse(response=OK)
CALL("ev_InspectionWriteRequest",value="S3");
CALL("ev_WriteResponse",response="OK");
ev_InspectionWriteRequest(value=S4)
CALL("ev_InspectionWriteRequest",value="S4");
CALL("ev_WriteResponse",response="OK");
ev_WriteResponse(response=OK) CALL("ev_InspectionWriteRequest",value="S3");
CALL("ev_WriteResponse",response="INVALID_RANGE");
}
ev_InspectionWriteRequest(value=S3)
ev_WriteResponse(response=INVALID_RANGE)
Fig. 7. Abstract and formatted test case for state machine testing
Fig. 7 shows an example of a test case as a sequence diagram. The test case will run
through all four states of the PROFIsafe state machine starting in state “S1”. State
“S2” is initiated by a write request on the inspection parameter and confirmed with a
positive response. The transition to “S3 and then to “S4” takes place in the same
manner. Finally, an attempt to execute a transition from state “S4” to state “S3” is
Model based specification, verification and test generation for a safety
fieldbus profile 97
made. According to the state machine, this is an illegal transition and a negative re-
sponse is returned by SUT.
All generated abstract test cases are transformed into an executable test notation and
afterwards run as a combined test suite on the test system. The test tool creates a re-
port of the success or failure of the executed test cases. The testing of parameter and
state machine test cases for the PROFIsafe profile was successful. The result of the
test suite confirms the correctness of the device regarding the profile on the one hand,
where functionality and behavior comply with the profile and its requirements. On the
other hand, a successful validation of the method for test case generation and the
transformation in the test notation are shown with the established test device used.
In this paper an approach to model based specification, verification, and test genera-
tion for safety fieldbus profiles were introduced. The essential methods and tools
ranging from model based fieldbus profile specification to the test execution are de-
scribed. Here, UML was used for the fieldbus profile specification, and Petri net
methods were employed for the model verification and test generation. The developed
Petri net model “Safe Petri Net with Attributes” (SPENAT) was used for the mapping
of the UML model and the application of the Petri net methods. The practical use of
these methods was demonstrated with an existing safety relevant UML profile
(PROFIsafe PA profile) for fieldbus devices in the PROFIBUS and PROFINET do-
main.
In the future, more existing methods of formal verification from the petri net area
should be used to verify the SPENAT model. Especially model checking algorithms
should be applied for the SPENAT analysis. Additionally, the method for test case
generation should be more configurable. One goal is to have more possibilities for
controlling the test generation process. The description of distributed (cooperative)
systems with communicating SPENAT components is an ongoing future research
aspect. The verification and generation of tests based on domain specific models will
98 J. Krause, E. Hintze, S. Magnus, and C. Diedrich
gain an increasing importance in the future for distributed and cooperative systems.
7 References
1. Schätz, B.; Pretschner, A.; Huber, F.; Philipps, J.: Model-Based Development of Embed-
ded Systems. Advances in Object-Oriented Information Systems - OOIS. Springer Verlag,
2002.
2. Frenzel, R.; Wollschlaeger, M.; Hadlich, T.; Diedrich, C.: Tool support for the develop-
ment of IEC 62390 compliant fieldbus profiles, Emerging Technologies and Factory Au-
tomation (ETFA), IEEE Conference, 2010
3. Jensen, K.: Coloured Petri Nets: Modeling and Validation of Concurrent Systems, Spring-
er-Verlag, Berlin, 2009
4. Best, E.; Fleischhack, H.; Fraczak, W.; Hopkins, R.; Klaudel, H.; Pelz, E.: A Class of
Composable High Level Petri Nets. ATPN'95. Springer-Verlag, 1995
5. ISO/IEC 15909-1: Software and system engineering – High-level Petri nets – Part 1: Con-
cepts, definitions and. 2004
6. Krause, J.; Herrmann, A.; Diedrich, Ch.: Test case generation from formal system specifi-
cations based on UML State Machines. atp - International 01/2008, Oldenbourg-Verlag,
2008.
7. Esparza, J.; Römer, S.; Vogler, W.: An Improvement of McMillan’s unfolding algorithm.
Formal Methods in Systems Design 20. Springer-Verlag, 2002
8. Heljanko, K.: Combining Symbolic and Partial Order Methods for Model Checking 1-safe
Petri Nets, PhD thesis. Helsinki, Helsinki University of Technology, 2002
9. McMillan, K. L.: Using Unfoldings to avoid the State Explosion Problem in the Verifica-
tion of Asynchronous Circuits. Proceedings of the 4th International Conference on Com-
puter–Aided Verification (CAV) ’92. Band volume 663 of LNCS. Montreal, Springer–
Verlag, 1992
10. Girault, C.; Valk, R.: Petri Nets for Systems Engineering: A Guide to Modelling, Verifica-
tion, and Applications, Berlin, Heidelberg, New York, 2003. Springer-Verlag
11. Khomenko, V: Model Checking Based on Prefixes of Petri Net Unfoldings. University of
Newcastle, 2003
12. ETSI: Testing and Test Control Notation, 2009, available at http://www.ttcn3.org/
13. PNO, 2009. PROFIBUS Specification: PROFIsafe for PA Devices. V1.01.
14. Object Management Group: Unified Modeling Language 2.2 Superstructure Specification.
2009, 08.01.2010 http://www.uml.org/.
15. Frey, G.: Design and formal Analysis of Petri Net based Logic Controllers, Dissertation,
Aachen, Shaker Verlag, 2002
16. Gomes, L.; Barros, J.; Costa, A.; Nunes, R.: The Input-Output Place-Transition Petri Net
Class and Associated Tools, Proceedings of the 5th IEEE International Conference on In-
dustrial Informatics (INDIN’07), Vienna, Austria, 2007
Quantification of Priority-OR Gates in Temporal Fault
Trees
{e.e.edifor@2007.,martin.walker@,n.a.gordon@}hull.ac.uk
Abstract. Fault Tree Analysis has been used in reliability engineering for many
decades and has seen various modifications to enable it to analyse fault trees
with dynamic and temporal gates so it can incorporate sequential failure in its
analysis. Pandora is a technique that analyses fault trees logically with three
temporal gates (PAND, SAND, POR) in addition to Boolean gates. However, it
needs extending so it can probabilistically analyse fault trees. In this paper, we
present three techniques to probabilistically analyse one of its temporal gates –
specifically the Priority-OR (POR) gate. We employ Monte Carlo simulation,
Markov analysis and Pandora’s own logical analysis in this solution. These
techniques are evaluated and applied to a case study. All three techniques are
shown to give essentially the same results.
Keywords: Safety, Fault Trees, Dynamic Fault Trees, Markov Chains, Monte
Carlo, Pandora
1 Introduction
Emerging complexity in modern technological systems brings with it new risks and
hazards. Most of today's systems will feature multiple modes of operation and many
offer some level of robustness built into the design. Nowhere is this truer than in the
field of safety-critical systems: those with the most serious consequences should they
fail. Frequently, such systems will make use of fault tolerance strategies with redun-
dant components, parallel architectures, and the ability to fall back to a degraded state
of operation without failing completely. However, such complexity also poses new
challenges for systems analysts, who need to understand how such systems behave
and estimate how reliable and safe they really are.
Fault Tree Analysis (FTA) is a classic technique for safety and reliability engineer-
ing that has stood the test of time – constantly being modified to meet changing re-
quirements. Fault trees (FTs) are graphical models based on Boolean logic which
depict how hazards of a system can arise from the combinations of basic failure
events in the system as well as any other contributing factors [1]. FTA begins with an
undesired event known as the 'top event', which is typically a system failure. The
analysis then decomposes this failure first into logical combinations of basic events,
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
100 E. Edifor, M. Walker, and N. Gordon
dora's definitions. A mathematical model for multiple POR gates is derived from first
principles. All techniques are applied to a case study and the results are discussed.
1.1 Notation
Pandora symbols in order of precedence (lowest first):
+ logical OR
. logical AND
| Priority-OR
< Priority-AND
& Simultaneous-AND
¬ logical NOT
Other notation:
⊲ Non-Inclusive Before
Failure rate of event i
time
(1)
. (2)
(3)
(4)
(5)
(6)
(7)
Quantification of Priority-OR Gates in Temporal Fault Trees 103
(8)
(9)
(10)
(11)
Thus by calculating the probabilities of these two cases, we can determine the
probability of the POR gate as a whole by using the principle of inclusion-exclusion
(and where (1-b) is the probability of event b not occurring, i.e., NOT(b)):
(13)
(14)
(15)
(16)
(17)
(18)
3. Keep count if
is TRUE.
4. The above steps are repeated for a specified number of times (called trials).
is finally evaluated by dividing the counts kept in step 3 by the total number
of trials. This gives the percentage of simulations in which the POR gate became true,
and thus an estimation of its probability.
⊲ (19)
⊲ (20)
It is evident that Merle’s algebraic solution (20) is exactly the same as the formulae
from Pandora’s definitions (11) and Markov analysis (16).
For any POR MCSQs with the expression , and constant failure
rates respectively, the probability of this MCSQs is derived as:
(21)
(22)
(23)
3 Case Study
Fig. 2 above depicts a redundant fuel distribution system for a maritime propulsion
system. There are two primary fuel flows: Tank1 provides fuel for Pump1 which
pumps it to Engine1, and Tank2 provides fuel for Pump2 which pumps it to Engine2.
Flow to each engine is monitored by two sensors, Flowmeter1 and Flowmeter2. In the
event that flow to an engine stops, the Controller activates the standby Pump3 and
adjusts the fuel flow accordingly using valves. If the Controller detects an omission of
flow through Flowmeter1, it activates Pump3 and opens Valve1, diverting fuel from
Pump1 to Pump3; if it detects an omission of flow through Flowmeter2, Valve2 is
opened instead. In either case, either Valve3 or Valve4 is opened accordingly by the
Controller to provide fuel flow to the appropriate fuel-starved engine.
Pump3 can therefore be used to replace either Pump1 or Pump2 in the event of
failure, but not both. Engine failure will ensue if it receives no fuel. Although the ship
can function in a degraded capacity on one engine, failure of either engine is still
considered a potentially serious failure.
106 E. Edifor, M. Walker, and N. Gordon
Important failure logic for the engines is described below. Failure modes are ab-
breviated as follows:
Valve1 initially provides flow from Tank1 to Pump1, but when activated, provides
flow from Tank1 to Pump3. Similarly, Valve2 initially provides flow to Pump2, but
when activated by the Controller, provides flow to Pump3. Either valve can become
stuck (failure modes V1 and V2 respectively), which will prevent the redirection of
flow if it happens before the valve is opened by the Controller. Thus omission of flow
to Pump3 from Valve1 is caused by 'V1 < ActivationSignalV1'. Failure to re-
ceive the control signal from the Controller will also cause a lack of flow to Pump 3.
Each primary pump takes fuel from its assigned fuel tank and pumps it to an en-
gine. Omission of flow from a pump can be caused by a lack of fuel flowing to the
pump or because the pump itself has failed (failure modes P1 and P2 for Pump1 and
Pump2 respectively). The flowmeters monitor the flow of fuel to each engine from
Pump1 and Pump2 and provide feedback to the Controller. If a sensor fails, it may not
provide any feedback, meaning that an omission of flow goes undetected and the
standby pump may not be activated. This is represented by S1 and S2.
The Controller is responsible for monitoring the feedback from the two sensors and
activating Pump3 if one of the two primary pumps fail. In this case, it sends signals to
the valves, diverting the flow of fuel to Pump3. It can also fail itself (failure mode
CF), in which case Pump3 may not be activated when needed. Once the Controller
has activated Pump3, a subsequent failure of the Controller has no effect on the sys-
tem, i.e., 'ActivationSignalV1 | CF' (or V2, V3, V4 for other valves).
Valves 3 and 4 direct the flow from Pump3 to either Engine1 or Engine2. Valve3 is
activated at the same time as Valve1 by Activate-Ctrl.UseTank1, whereas Valve4 is
activated at the same time as Valve2 by Activate-Ctrl.UseTank2. Like Valves 1 & 2,
both may get stuck closed (failure modes V3 and V4); however, unlike Valves 1 & 2,
they are only either open or closed. By default, they are closed.
Pump3 is the standby pump in the system. Once activated, it replaces one of the
two primary pumps, directing flow from one fuel tank to the appropriate engine
(Tank1 ==> Engine1 or Tank2 ==> Engine2). It has the same failure modes as the
other Pumps, but because it is a cold spare, it is assumed not to be able to fail until it
is activated. Input to Pump3 can come from either Valve1 or Valve2.
The engines provide propulsion for the ship. Each engine takes fuel from a differ-
ent fuel tank and can take its fuel from either its primary pump or Pump3. The order
in which the pumps fail determines which engine fails; for example, if Pump1 fails
first, then Engine1 can continue to function as long as Pump3 functions, but if Pump2
fails first, then Engine1 will be wholly reliant on Pump1. This is expressed by the
Quantification of Priority-OR Gates in Temporal Fault Trees 107
logical expressions below, where 'O-' denotes an omission of output. For simplicity,
internal failure of the engines themselves is left out of the scope of this analysis.
The expanded fault tree expressions for the failure of each engine are as follows:
E1 = (P1+P1|P2|CF|V1)|(P2+P2|P1|CF|V2).(V3 + P3 + V1<P1|P2|CF
+ V1&P1|P2|CF + S1<P1|P2 + CF<P1|P2 + S1&P1|P2
+ CF&P1|P2 + V2<P2|P1|CF + V2&P2|P1|CF + S2<P2|P1
+ CF<P2|P1 + S2&P2|P1 + CF&P2|P1)+ (P2+P2|P1|CF|V2)
<(P1+P1|P2|CF|V1) + (P2+P2|P1|CF|V2)&(P1+P1|P2|CF|V1)
E2 = (P2+P2|P1|CF|V2)|(P1+P1|P2|CF|V1).(V4 + P3 + V1<P1|P2|CF
+ V1&P1|P2|CF + S1<P1|P2 + CF<P1|P2 + S1&P1|P2 + CF&P1|P2
+ V2<P2|P1|CF + V2&P2|P1|CF + S2<P2|P1 + CF<P2|P1
+ S2&P2|P1 + CF&P2|P1) + (P1+P1|P2|CF|V1)
<(P2+P2|P1|CF|V2) + (P1+P1|P2|CF|V1)&(P2+P2|P1|CF|V2)
Minimisation of the fault trees for E1 and E2 gives the following MCSQs:
4 Evaluation
To verify the accuracy of POR gate quantifications discussed, terms of the MCSQs
from the case study were modelled in Isograph Reliability Workbench 11.0 (IRW), a
popular software package for reliability engineering. In Isograph, the RBDFTET (Re-
liability Block Diagram Fault Tree Effect Tree) lifetime and an accuracy indicator of
1 were used [20]. Tables 2 to 4 give the results of the terms in the MCSQ for Engine 1
in the case study. Since E1 and E2 are caused by the same events just in the opposite
sequences, their unavailability is the same, and thus results for E2 are the same as in
Tables 2-4. The column headed 'Algebraic' indicates results from Markov analysis
and Pandora. Results are obtained for varying values of system life time (in hours).
Pr (P1<P2) for all three methods was evaluated using Fussel’s formula [8] while
the top event probability was calculated using the Esary-Proschan [13] formula. The
precedence order for evaluating temporal terms is stated in ‘Notation’.
From tables 2-4, it can be observed that the solutions obtained by the algebraic ex-
pression are close to those obtained by Isograph when the lifetime is small. However,
with increasing lifetime, both results diverge. This may be attributed to the numerical
Quantification of Priority-OR Gates in Temporal Fault Trees 109
integration method [20] Isograph uses. Unfortunately, Isograph does not have any
representation for the POR gate, and having to model it ‘from scratch’ every time can
be cumbersome and (human) error-prone due as it consists of 4 states with three tran-
sitions between them (using Markov Chains) or 4 gates with two basic events (using
FTA). Modelling the entire system this way would be far worse.
It is also clear that results from the algebraic expression and Monte Carlo (800000
trials) are much more similar. Both results tend to converge with increasing lifetime,
although results for MSCSs with more than one temporal gate (S1<P1|P2 and
CF<P1|P2) instead diverge considerably as the system lifetime t increases. Further
research is being carried out to determine the cause of this behaviour.
It is widely known that as the size of fault tree increases, Markov models become
increasingly prone to human error and are crippled by the state explosion problem
[17]. Markov models are usually limited to exponential failure and repair distribution
[15,17], although some efforts have been made to extend it to other distributions [19].
Merle's, Markov’s and Pandora’s methods are efficient because they are algebraic
expressions generated from first principle. Even though Merle’s method is restricted
to a cut sequence with only one POR gate, Pandora’s has been extended to two or
more. Modelling multiple POR scenarios with Markov can be very cumbersome and
error-prone. Unlike in Monte Carlo simulation, however, all three techniques are re-
stricted to exponential failure distribution.
It can be observed that some of Monte Carlo’s results are zero. This is due to the
use of small realistic constant failure rates. Monte Carlo simulation is hugely depend-
ent on the sample size, which greatly impacts accuracy and computational time: the
smaller the sample size, the less accurate the result and vice versa. The ready avail-
ability of high computing power means that such compromises are rarely necessary.
However, unlike Markov analysis, which starts to break down when presented with
complex fault trees, Monte Carlo is very efficient in handling complex situations
[15,17].
5 Conclusion
Fault Tree Analysis has been used in reliability engineering for many decades now.
Modifications to it have evolved over time to provide new capabilities, such as the
introduction of dynamic or temporal semantics, allowing them to analyse sequential
failure logic. One such technique is Pandora which introduces new temporal gates to
enable the logical analysis of temporal fault trees. In this paper, we have presented
three new ways of probabilistically analysing one of Pandora’s temporal gates –
Priority-OR. These techniques, Monte Carlo, Pandora analysis, and Markov analysis
have been evaluated against an algebraic solution from Merle and applied to a case
study. The three novel techniques produce very similar results. Results have been
discussed. A mathematical model for more than one POR gate has been derived.
110 E. Edifor, M. Walker, and N. Gordon
6 Acknowledgements
This work has been developed in conjunction with the EU FP7 project MAENAD.
7 References
1. Vesely W E, Stamatelatos M, Dugan J B, et al (2002) Fault tree handbook with aerospace
applications. NASA office of safety and mission assurance, Washington DC
2. Merle G, Roussel J (2007) Algebraic modelling of fault trees with priority AND gates.
IFAC Workshop on Dependable Control of Discrete Systems p175-180
3. Dugan J. B, Bavuso S. J, Boyd M. A (1992) Dynamic fault-tree for fault-tolerant computer
systems. IEEE Transactions on Reliability 41(3):363–76
4. Merle G (2010) Algebraic modelling of dynamic fault trees, contribution to qualitative and
quantitative analysis. Dissertation, Décole Normale Supérieure De Cachan
5. Tang Z, Dugan J B (2004) Minimal cut set/sequence generation for dynamic fault trees. In:
Reliability And Maintainability Symposium (RAMS), Los Angeles, 26-29 Jan 2004
6. Walker M, Papadopoulos Y (2006) Synthesis and analysis of temporal fault trees with
PANDORA: The Time of Priority AND Gates. Nonlinear Analysis Hybrid Systems
2(2008):368-382
7. Walker M D (2009) Pandora: A Logic for the Qualitative Analysis of Temporal Fault
Trees. Dissertation, University of Hull
8. Fussel J B, Aber E F, Rahl R G (1976) On the quantitative analysis of Priority-AND fail-
ure logic. IEEE Transactions on Reliability R-25(5): 324-326
9. Walker M, Papadopoulos Y (2008) Qualitative Temporal Analysis: Towards a full imple-
mentation of the Fault Tree Handbook. Control Engineering Practice 17(2009):1115–1125
10. Vesely W E, Goldberg F F, Roberts N H and Haasl D F (1981) Fault Tree Handbook. US
Nuclear Regulatory Commission. Washington DC
11. Andrews J A (2000) To Not or Not to Not. In: Proceedings of the 18th International Sys-
tem Safety Conference, Fort Worth, Sept 2000. p 267-275.
12. Sharvia S, Papadopoulos Y (2008) Non-coherent modelling in compositional fault tree
analysis. In: The International Federation of Automatic Control, Seoul, 6-11 July 2008
13. Esary D, Proschan F (1963) Coherent Structures with Non-Identical Components. Tech-
nometrics 5(2):191-209.
14. Department of Defence (1998) Military Handbook: electronic reliability design handbook,
Webbooks. http:// webbooks.net/freestuff/mil-hdbk-338b.pdf. Accessed June 27, 2011
15. Pukite J, Pukite P (1998) Modelling for reliability analysis. Wiley-IEEE Press, New York
16. Weisstein, Eric W (2011) Monte Carlo Method, MathWorld. http://mathworld.wolfram.
com/MonteCarloMethod.html. Accessed 01 August 2011
17. Rao D K et al (2008) Dynamic fault tree analysis using Monte Carlo simulation in prob-
abilistic safety assessment, Reliability Engineering and System Safety. 94(4):872-883
18. Rocco C M, Muselli M (2004) A machine learning algorithm to estimate minimal cut and
path sets from a Monte Carlo simulation. In: Proceedings Probabilistic Safety Assessment
and Management PSAM7/ESREL'04Springer. Berlin, 2004, p 3142-3147
19. Manian R, Dugan B J, Coppit D, Sullivan K J (2002) Combining various solution tech-
niques for dynamic fault tree analysis of computer systems, In: Third IEEE international
high-assurance systems engineering symposium, IEEE Computer Society, 1998, p 21–28
20. Isograph Limited (2011) Reliability Workbench Version 11 User Guide, p 392
Cross-Level Compositional Reliability Analysis
for Embedded Systems?
Abstract. Ever shrinking device structures are one of the main reasons
for a growing inherent unreliability of embedded system components. As
a remedy, various means to increase the reliability of complex embedded
systems at several levels of abstraction are available. In fact, their efficient
application is a key factor for the successful design of reliable embedded
systems. While analysis approaches that evaluate these techniques and
their advantages and disadvantages at particular levels exist, an overall
system analysis that has to work cross-level is still lacking. This paper
introduces a framework for cross-level reliability analysis that enables
a seamless and flexible combination of various reliability analysis tech-
niques across different levels of abstraction. For this purpose, a proposed
framework provides mechanisms for (a) the composition and decomposi-
tion of the system during analysis and (b) the connection of different lev-
els of abstraction by adapters that convert and abstract analysis results.
As a case-study, the framework extends and combines three analysis ap-
proaches from the MPSoC domain: (I) a BDD-based reliability analysis
considers redundancies in the system structure, (II) an analytical behav-
ioral model to consider computational activity, and (III) a temperature
simulator for processor cores. This enables to capture thermal reliability
threats at transistor level in an overall system analysis. The approach
is seamlessly integrated in an automatic Electronic System Level (ESL)
tool flow.
1 Introduction
A major threat to the reliability of nowadays and future embedded system
components are their steadily shrinking device structures. These small device
structures are susceptible to, e. g., environmental changes and fluctuations like
cosmic rays and manufacturing tolerances. This poses a major challenge on the
(automatic) design of embedded systems at system level, because there needs to
be an awareness that the system is composed of unreliable components whose
unreliability itself is subject to uncertainties. Thinking of reliability-increasing
techniques at the system level like temporal/spatial redundancy or self-healing,
?
Supported in part by the German Research Foundation (DFG) as associated project
TE 163/16-1 of the priority program Dependable Embedded Systems (SPP 1500).
112 M. Glaß, H. Yu, F. Reimann, and J. Teich
there exists a significant gap between the level of abstraction where the faults
occur, i. e., switching devices like CMOS transistors at gate level and higher lev-
els like architecture level or system level where the interplay of hardware and
software is the main focus. This gap not only exists between the cause of faults
and the techniques to compensate them, but also between the techniques and
their efficiency, e. g., what is the effect of hardening techniques at circuit level
like Razor [3] on the applications at system level.
Today, there exists a smorgasbord of reliability analysis techniques for both,
the relatively low layers of abstraction that focus on technology as well as system-
level analysis techniques that focus on the applications.1 However, there cur-
rently exists no well-defined holistic and cross-level analysis technique that col-
lects knowledge at the lower levels of abstraction by combining different analysis
techniques and provide proper data for the analysis at higher levels of abstrac-
tion by performing abstraction and conversion. Moreover, means to enhance the
system’s reliability are not for free and typically deteriorate other design objec-
tives like monetary costs, latency, or energy-consumption. The outlined lack of
suitable cross-level analysis techniques, thus, prohibits an adequate system-wide
cost-benefit analysis during the design of embedded systems.
This work introduces a Cross-level Reliability Analysis (CRA) framework
that combines various reliability analysis techniques across different levels of
abstraction. It aims at closing the mentioned gaps and enables an efficient Design
Space Exploration (DSE) [16] during which a system-wide cost-benefit analysis
of reliability-increasing techniques is enabled. For this purpose, the framework
uses two basic concepts: (I) Decomposition and composition tackle the growing
complexity when considering lower levels of abstraction. (II) Adapters combine
different analysis techniques by converting between required and provided data
like temperature, radiation, and reliability-related measures. To give evidence
of the benefits of CRA, a case study investigates an embedded 8-core MPSoC
platform where a redundant task layout at system level is used as a means to cope
with thermal effects at device level. This is achieved by seamlessly combining
three analysis techniques from system level down to a temperature simulator of
the cores in CRA.
2 Related Work
Up to now, several system-level reliability analysis approaches have been pre-
sented for embedded systems and were integrated into design space exploration
techniques to automatically design reliable hardware/software systems. How-
ever, they typically rely on simplistic analysis approaches based on constant
failure rates for both, permanent and transient errors and series-parallel sys-
tem structures: An approach that unifies fault-tolerance via checkpointing and
power management through dynamic voltage scaling is introduced in [24]. In [7,
1
This work focuses on reliability issues for applications mapped to an embedded
system platform architecture and treats reliability of the (software) functionality
itself as an orthogonal aspect.
Cross-Level Compositional Reliability Analysis for Embedded Systems 113
This work targets the system-level design of embedded MPSoCs and distributed
embedded systems that typically consist of several processor cores connected
by a communication infrastructure such as buses, networks-on-a-chip, or field
buses. The main challenge is to analyze and increase the reliability of applica-
tions executed on the MPSoC by propagating the effects of faults and introduced
reliability-increasing techniques at lower design levels of abstraction up to the
system level. The work at hand targets this challenge by a cross-level Compo-
sitional Reliability Analysis (CRA) whose mathematical concept as introduced
in the following is directly reflected within a framework by means of a class and
interface structure. An important concept of the developed CRA is that for each
relevant error-model at a specific design level of abstraction, an appropriate reli-
ability analysis shall be applicable and seamlessly integrated into the CRA. As a
result, the developed concepts will become independent of an actual error-model
114 M. Glaß, H. Yu, F. Reimann, and J. Teich
reliability
OI (t)
level of
CRN I
abstraction
I
time t
adapter
OII (t)
CRN
composition / II
reliability decomposition
time t
level of
abstraction
OII (t)
OII (t)
II CRN CRN
II’ II”
time t time t
adapter’ adapter”
reliability
OIII (t)
OIII (t)
level of CRN CRN
abstraction III’ III”
III
time t time t
Fig. 1. An abstract view on the two basic mechanisms that Compositional Reliability
Analysis (CRA) provides: A Compositional Reliability Node (CRN) applies an individ-
ual analysis technique X at a specific reliability level of abstraction and delivers specific
measures O such as error-rates, temperature, or aging over time. Within each reliability
level of abstraction, composition and decomposition are applied to tame complexity.
Different reliability levels of abstraction are connected using adapters that perform
refinement, data conversion, and abstraction.
since it aims abstracting from the actual source of unreliability during upscal-
ing, i. e., the propagation of data from lower levels to higher levels by means of
abstraction and data conversion, in the CRA. After an introduction of the CRA
concept in this section, the following section will provide a concrete example for
the CRA concept. A schematic view of such a compositional reliability analysis
and the required mechanisms is shown in Fig. 1.
As depicted, CRA is agnostic of design levels like Electronic System Level
(ESL), Register Transfer Level (RTL), or circuit level but introduces reliabil-
ity levels of abstraction. A reliability level of abstraction in CRA involves one
or even a range of design levels where the same error models and, especially,
their causes are significant. Consider, e. g., the effects of electromigration due to
temperature T on a processing unit as cause for permanent defects. Analyzing
Cross-Level Compositional Reliability Analysis for Embedded Systems 115
these effects properly requires to (I) be aware of the processor’s workload and,
hence, the task binding and scheduling defined at system level down to (II) the
power consumption and the actual floorplan of the processor at circuit level. To
realize a holistic cross-layer analysis, the CRA features three important aspects:
(a) Individual analysis techniques are encapsulated in Compositional Reliability
Nodes (CRNs) at reliability levels of abstraction, (b) composition C and decom-
position D is applied to tame system complexity, and (c) formerly incompatible
reliability levels of abstraction and, hence, analysis techniques are connected by
adapters A. CRA combines CRNs and adapters in a tree-based fashion to a
flexible and holistic system-wide analysis.
2
The (sub)system model S is assumed to be a complete system specification that
contains required information for all considered design levels of abstraction from,
e. g., task binding down to the floorplan of allocated processors.
3
Based on a given reliability function, all well-known
R ∞ reliability-related measures like
the Mean-Time-To-Failure (MTTF) MTTF = 0 R(t)dt or the Mission Time (MT)
M T (p) = R−1 (p) can be determined.
116 M. Glaß, H. Yu, F. Reimann, and J. Teich
Sn
S1 , . . . , Sn with S = i=1 Si is feasible if the arising error for each output value
o ∈ O is smaller than a specified maximum o :
4 Case Study
A dedicated CRA that consists of the mentioned three analysis techniques is de-
picted in Fig. 2. At highest level, X I is a BDD-based reliability analysis as intro-
duced, e. g., in [4, 6]. It delivers the reliability function of the complete 8-core MP-
SoC, i. e., OI = {R(t)} and requires the reliability function of each component
(core) r ∈ R in the system, i. e., ∀r ∈ R : Rr (t) ∈ I I . At the intermediate level,
X II is a behavioral analysis approach termed Real-Time Calculus (RTC) [19].
Given a binding of tasks to cores and their schedule (activation), RTC delivers
upper and lower bounds for the workload on each processing unit. For this anal-
ysis, I II = ∅ indicates that no additional input from within RTC is required for
the second level. However, the important part of level II is OII = {βr0 } with βr0
being a service curve that describes the remaining service (computational capac-
ity) provided by the resource and, hence, enables to derive the current workload.
On level III, X III is a temperature simulation using HotSpot [14]. HotSpot is
capable of delivering a temperature profile, i. e., OIII = {Tr (t)} and requires a
power trace of the respective component I III = {Pr (t)}. Note that, for the sake
of simplicity, additional data required from the system model S at the different
levels of abstraction as well as the bypass of R(t), Rr (t) between levels I and II
is omitted, see Fig. 2.
t2
reliability R(t)
level I: t3 t3
BDD-based t4
reliability t5
analysis time t
0 1
β β
reliability R(t)
reliability R(t)
∆t ∆t
level II:
RTC-based α α
Core 1 Core 2
behavioral
analysis ∆t
time t ∆t
time t
β0 β0
∆t ∆t
temperature-
reliability adapters
temperature T (t)
temperature T (t)
level III:
HotSpot-based
Core 1 Core 2
temperature
time t time t
analysis
Fig. 2. An example of a concrete CRA that considers three reliability levels of abstrac-
tion: At the highest, i.e., system level, reliability function R(t) are to be derived by
means of BDD-based structural analysis. In the next level, the MPSoC is decomposed
into subsystems, i.e., two processing units in this example. To gather the desired reli-
ability functions for each processing unit, a lower reliability level of abstraction where
the operating temperature T (t) is a significant cause of failures is chosen. The adapter
A between level II and III requires the workload on a processing unit derived by a
Real-Time Calculus (RTC) on level II. It then refines them into a power trace and
passes it to the temperature simulation at level III. The delivered temperature profile
is converted into a reliability function of the processor and passed back to level II.
Finally, the individual reliability functions from level II are passed to the BDD-based
approach at level I.
idle and running. Whenever the processing unit executes a task, the processor
is in running mode and idle otherwise. Given the remaining service β 0 from OII ,
a power trace is generated with the concrete power consumption values for idle
and running being provided by the core’s specification in S.
The conversion transformer T↑ requires to decide for a particular fault model.
Here, electromigration is investigated. The effect of electromigration under con-
stant temperature T on the Mean-Time-To-Failure (MTTF) is modeled as
A Ea
MTTFEM = e KT , (2)
Jn
where A is a material dependent constant, J is the current density, Ea is the
activation energy for mobile ion diffusion, and K is the Boltzmann’s constant.
The respective values are obtained from [1]. To take into account temperature
profiles T (t), the method proposed by [22] is adopted: The total simulation time
Cross-Level Compositional Reliability Analysis for Embedded Systems 119
β P RTC β
level II:
RTC-based α α
Core 1 Core 2
timing
behaviour
β0 P RTC β0
P RoT
P RoT
level III:
Hotspot-based P HS
Core 1 Core 2
temperature
analysis
P HS
Fig. 3. Three possible levels for a corrective postprocessing: P RTC takes part at level
II but does not have access to temperature profiles and, thus, corresponds to no post-
procesing. P RoT takes part in the adapter and solely works on temperature profiles as
well as basic system model information. P HS takes part at level III and corresponds to
a complete simulation of both cores. The latter corresponds to an ineffective decom-
position and the error is assumed to be zero.
Core 1 Core 2
∆T
d1 d2
T1 T12 T2
F1→2
0.2
0.1
−0.1
P HS P RoT P RTC
2 0% 10% 0%
1.6
Consolidated MTTF gain using CRA
1.8
20% 10%
1.4
1.6
30%
20%
1.4
1.2
1.2
1 1
0.8
0.8
1 1.2 1.4 1.6 1.8 2 2.2 1 1.1 1.2 1.3 1.4 1.5 1.6
Expected MTTF gain by [4] Expected MTTF gain by [4]
Fig. 6. The expected gain in MTTF by means of spatial task redundancy for the
H.264 encoder only (left) and complete H.264 (right). While [4] neglects the negative
effects of spatial redundancy, CRA shows that this may lead to an overestimation of
the gain in MTTF of up to 30% (H.264 encoder only) and 20% (H.264). Note that for
implementations below the dotted line, [4] expects an enhanced reliability. Revealed
by CRA, their spatial redundancy is in fact downgrading the system-wide MTTF.
spatial redundancy predominates the negative effects due to the increased work-
load and, hence, wear-out. Moreover, there also exist several implementations
where the spatial redundancy is actually downgrading the MTTF of the overall
system, i. e., the negative effects dominate the positive effects. In summary, the
case study gives strong evidence that only a holistic analysis approach is capable
of providing sufficient information with respect to all trade-offs and significant
effects.
Cross-Level Compositional Reliability Analysis for Embedded Systems 123
5 Conclusion
This paper introduces a flexible framework for cross-level Compositional Relia-
bility Analysis (CRA) that enables a seamless integration of various reliability
analysis techniques across different levels of abstraction. The framework provides
mechanisms for (a) the composition and decomposition of the system during
analysis and (b) the connection of different levels of abstraction by adapters.
As a case-study, CRA combines three analysis approaches from the MPSoC
domain: (I) a BDD-based approach to consider redundancy in the system struc-
ture, (II) an analytical behavioral model to consider computational activity, and
(III) a temperature simulator. The experimental results highlight the flexibil-
ity of the approach with respect to both, the integration of different techniques
cross-level and the mechanisms to trade-off accuracy versus computational com-
plexity. Moreover, the need for holistic and cross-level analysis as enabled by
CRA is shown by investigating the error of existing work that, e. g., neglects the
negative effects of redundancy at system level on component wear-out at circuit
level.
References
1. Council, J.E.D.E.: Failure mechanisms and models for semiconductor devices.
JEDEC Publication JEP122-F (2010)
2. Eles, P., Izosimov, V., Pop, P., Peng, Z.: Synthesis of fault-tolerant embedded
systems. In: Proc. of DATE ’08. pp. 1117–1122 (2008)
3. Ernst, D. et al.: Razor: A low-power pipeline based on circuit-level timing specu-
lation. In: Microarchitecture ’03. pp. 7–18 (2003)
4. Glaß, M., Lukasiewycz, M., Reimann, F., Haubelt, C., Teich, J.: Symbolic system
level reliability analysis. In: Proc. of ICCAD ’10. pp. 185–189 (2010)
5. Gu, Z., Zhu, C., Shang, L., Dick, R.: Application-specific MPSoC reliability op-
timization. IEEE Trans. on Very Large Scale Integration Systems 16(5), 603–608
(2008)
6. Israr, A., Huss, S.: Specification and design considerations for reliable embedded
systems. In: Proc. of DATE ’08. pp. 1111–1116 (2008)
7. Izosimov, V., Pop, P., Eles, P., Peng, Z.: Synthesis of fault-tolerant schedules with
transparency/performance trade-offs for distributed embedded systems. In: Proc.
of DAC ’04. pp. 550–555 (2004)
8. Leon, A.S., Tam, K.W., Shin, J.L., Weisner, D., Schumacher, F.: A Power-Efficient
High-Throughput 32-Thread SPARC Processor. IEEE Journal of Solid-State Cir-
cuits 42(1), 7–16 (2007)
9. Lukasiewycz, M., Glaß, M., Reimann, F., Teich, J.: Opt4J - A Modular Framework
for Meta-heuristic Optimization. In: Proc. of GECCO ’11. pp. 1723–1730 (2011)
10. McGregor, J., Stafford, J., Cho, I.: Measuring component reliability. In: Proceed-
ings of 6th ICSE Workshop on Component-based Software Engineering (2003)
11. Reussner, R., Schmidt, H., Poernomo, I.: Reliability prediction for component-
based software architectures. Systems & Software 66(3), 241–252 (2003)
12. Sander, B., Schnerr, J., Bringmann, O.: ESL power analysis of embedded processors
for temperature and reliability estimations. In: Proc. of CODES+ISSS ’09. pp.
239–248 (2009)
13. Schnable, G., Comizzoli, R.: CMOS integrated circuit reliability. Microelectronics
Reliability 21, 33–50 (1981)
14. Skadron, K., Stan, M., Huang, W., Velusamy, S., Sankaranarayanan, K., Tarjan,
D.: Temperature-aware microarchitecture. In: ACM SIGARCH Computer Archi-
tecture News. vol. 31, pp. 2–13 (2003)
124 M. Glaß, H. Yu, F. Reimann, and J. Teich
15. Stathis, J.: Reliability limits for the gate insulator in CMOS technology. IBM
Journal of Research and Development 46(2-3), 265–286 (2002)
16. Streichert, T., Glaß, M., Haubelt, C., Teich, J.: Design space exploration of reliable
networked embedded systems. J. on Systems Architecture 53(10), 751–763 (2007)
17. Ting, L., May, J., Hunter, W., McPherson, J.: AC electromigration characteri-
zation and modeling of multilayeredinterconnects. In: 31st Annual International
Reliability Physics Symposium. pp. 311–316 (1993)
18. Tosun, S., Mansouri, N., Arvas, E., Kandemir, M., Xie, Y.: Reliability-centric high-
level synthesis. In: Proc. of DATE ’05. pp. 1258 – 1263 (2005)
19. Wandeler, E., Thiele, L.: Real-Time Calculus (RTC) Toolbox,
http://www.mpa.ethz.ch/Rtctoolbox
20. Wei, B., Vajtai, R., Ajayan, P.: Reliability and current carrying capacity of carbon
nanotubes. Applied Physics Letters 79, 1172–1174 (2001)
21. Wirthlin, M., Johnson, E., Rollins, N., Caffrey, M., Graham, P.: The reliability of
FPGA circuit designs in the presence of radiation induced configuration upsets.
In: Proc. of FCCM ’03. pp. 133–142 (2003)
22. Xiang, Y., Chantem, T., Dick, R.P., Hu, X.S., Shang, L.: System-Level Reliability
Modeling for MPSoCs. In: Proc. of CODES+ISSS ’10. pp. 297–306 (2010)
23. Xie, Y., Li, L., Kandemir, M., Vijaykrishnan, N., Irwin, M.: Reliability-aware co-
synthesis for embedded systems. VLSI Signal Processing 49(1), 87–99 (2007)
24. Zhang, Y., Dick, R., Chakrabarty, K.: Energy-aware deterministic fault tolerance
in distributed real-time embedded systems. In: Proc. of DATE ’05. pp. 372–377
(2005)
25. Zhu, C., Gu, Z., Dick, R., Shang, L.: Reliable multiprocessor system-on-chip syn-
thesis. In: Proc. of CODES+ISSS ’07. pp. 239–244 (2007)
IT-forensic automotive investigations on the example of
route reconstruction on automotive system and
communication data
2 Basics
This section provides relevant basics from the IT forensics domain as well as an over-
view on the spectrum of research on automotive IT security.
2.2 The spectrum of automotive IT Security and data sources for IT forensics
Motivated by the existing threats to automotive IT, which have been discussed at the
beginning of this article, research activities about the application of IT security concepts
to automotive IT systems and their individual characteristics are increasingly focused
onto by the academic and industrial community. Especially facing the restricted mainte-
nance and update capabilities of vehicular embedded systems, a suitable overall concept
will be characterised by the fact, that – in addition to preventive measures (update verifi-
cation, device authentication or tampering protection on device level) – it will also fea-
ture measures and processes of detection (recognition of indications for active attacks)
and reaction (recovery to safe system states, initiation and support of incident investiga-
tions).
As in the desktop IT domain, an IT forensic investigation can profit from measures al-
ready installed before an incident (strategic preparation). Two exemplary approaches,
which could serve as additional data sources for IT forensics, are:
• Permanent logging: Logs of selected information from the vehicle usage can be useful
for multifaceted applications (e.g. automatic driver’s logbooks or flexible insurance
models). If they are recorded in a forensically sound manner, e.g. by a forensic vehi-
cle data recorder [10], such log files can securely be provided to the respective users.
The application cases of such a system can include the logging of information, which
might be useful for the investigation of future incidents (e.g. accidents or manipula-
tions).
• Event-triggered logging: Another type of data source could be an automotive Intru-
sion-Detection-System (IDS) [11], which monitors the operating state of automotive
IT systems for potential IT security violations. Indications for respective incidents can
be detected either signature- or anomaly-based, followed by an appropriate reaction
[12]. While the spectrum of potential reactions can range up to active operations as a
controlled stopping of the vehicle, such major interventions should only be taken in
justified emergency cases based on a sufficient reliability of detection [13]. In case of
less critical or only vaguely detected incidents, the IDS can also decide for the initia-
tion or an intensification of data logging for a certain amount of time. Since also an
IDS should ensure the confidentiality, integrity and authenticity of logged informa-
tion, it would also be an option to connect it with a forensic vehicle data recorder (see
above and [10]).
IT-forensic automotive investigations on the example of route recon-
struction on automotive system and communication data 129
Following the process model introduced in section 2.1, the investigation of automotive
incidents (e.g. in the chosen hit-and-run scenario) should also reflect the introduced
phases. This section presents a compact, conceptual overview on exemplary steps.
3.1 Overview on the application of the process steps in the automotive context
The Strategic Preparation (SP), which is conducted ahead of a suspected specific inci-
dent, includes (next to the acquisition of technical specifications, wiring schemes etc.)
also the provision of components supporting a subsequent forensic investigation such as
forensic vehicle data or IDS components into the IT system, i.e. the car. Their installa-
tion (together with the necessary rights management and the initialisation of the cryptog-
raphy key management) could be executed by the vehicle owners themselves (e.g. car
fleet managers) in the medium term. However, in the long term, this installation could
also be executed by car manufacturers when the acceptance rate of such components
grows and the benefits are realised by the potential buyers. With the start of investigative
proceedings after a suspected incident the Operational Preparation (OP) is initiated,
involving the fundamental decision making about the course of action, such as the kind,
extent and the manner of the gathering of volatile and non-volatile data or the set of ap-
propriate tools. Potential incident relevant data containing traces is collected during the
Data Gathering (DG), e.g. using diagnostic protocols (diagnostic trouble codes, DTC)
or direct access to individual electronic control units (ECU), e.g. data that is contained in
non-volatile portions such as flash memory. This data gathering needs to be executed
with authenticity and integrity assuring mechanisms in place (both organisational or
technical means, e.g. cryptographic algorithms). The subsequent Data Investigation (DI)
involves the usage of mostly (semi-) automatic tools (e.g. to reconstruct deleted data
content, extraction of timestamps etc.). In the Data Analysis (DA) those single results are
put into context, involving their correlation according to time in the incident (e.g. time-
lining) and/or causal relationship (e.g. why-because-analysis). Each individual step of
forensic proceedings starting from the Strategic Preparation (SP) is comprehensively
recorded (e.g. input parameters, input data, used tool and settings, output data, environ-
mental data) in the process accompanying Documentation (DO). This data and the de-
rived information from all steps are distilled into the closing report during the final
Documentation (DO). Some of the phases can be revisited during the forensic investiga-
tion, e.g. when results point to promising data sources not yet acquired.
used in these tests are integrated devices for vehicles of an international manufacturer
from Germany.
For scenarios as the one selected for this article, it would be useful to include geographic
information into the set of proactively logged data comprising of CAN bus messages. To
gather such information, components to place as strategic preparation can implement this
in two different ways:
• Geographic information is already accessible in the car (e.g. if GPS coordinates are
placed on the internal bus system by an existing electronic control unit)
• The respective information can (or shall) not be acquired from external devices and
has to be determined by the logging device (e.g. installation of a GPS receiver in such
a component)
At the same time, this choice is a compromise between costs and the reliability of the
logged data.
Fig. 2: Route information (street names) on the instrument cluster (left) and the CAN bus (right)
On the example of a real, integrated navigation system and its electronic integration into
vehicles of a major international vehicle manufacturer from Germany, this could be im-
plemented as follows. During operation, the navigation system displays the current route
information (direction, street names etc.) also on the instrument cluster (see left part of
Fig. 2). This is both for comfort and safety reasons, because this way it is visible directly
in front of the driver, not distracting him from maintaining a frontal view towards the
traffic. To accomplish this, respective information is transmitted over the internal vehicle
CAN bus in clear text (see log excerpt in the right part of Fig. 2). A logging component
placed in the context of strategic preparation (e.g. a forensic vehicle data recorder or an
automotive IDS) could securely log this information (i.e. preserving confidentiality,
integrity and authenticity of the log files) and enable access to it in case of future inci-
dent investigations. In the chosen hit-and-run-scenario, this data could provide signifi-
cant indices for the presence or absence of the driver at the accident scene.
IT-forensic automotive investigations on the example of route recon-
struction on automotive system and communication data 131
Additionally to the data, which is collected before an incident (by measures installed as
strategic preparation), further information can be acquired from other data sources after
an incident – corresponding to the classical IT forensics approach.
Looking at the selected target of route reconstruction, potential evidence can be searched
for on the navigation system, for example. The common approach from desktop IT fo-
rensics to perform a complete low-level block-wise image (dump) of non-volatile mass
storage devices is more difficult to perform on embedded devices (due to their heteroge-
neous architecture, components and restricted interfaces). However, it can be tried to
access such systems using debug interfaces of a controller type identified beforehand
(left part of Fig. 3). Subsequently (or, at a pinch, as simplified alternative) information
can be acquired using graphical user interfaces (Fig. 3, right part), while information
deleted by the user can usually not be reconstructed this way. In such a scenario organ-
isational measures (e.g. four-eyes-principle) have to ensure the authenticity and integrity
of the acquired information.
The acquired data should also critically be reflected regarding their evidentiary value –
since in many cases displayed information can have been manipulated by the users (e.g.
the system time).
Fig. 3: Data Gathering via debug interfaces (left side) and/or GUIs (right side)
Regarding the Data Analysis (DA) phase, this section covers the analysis of (completely
or selective recorded) bus communication, which can be acquired by measures of strate-
gic preparation (e.g. a forensic vehicle data recorder or logging functions of an automo-
tive IDS component).
Looking at the route reconstruction scenario, street names or GPS coordinates could be
available in the log files (as illustrated above for the strategic preparation), which would
make it a trivial case. However, even assuming that no such explicit information is avail-
able in the log files (maybe because the navigation system has not been used at the time
in question) further possibilities remain for a subsequent route reconstruction. The fol-
lowing two subsections introduce a manual and a semi-automated approach.
132 T. Hoppe, S. Tuchscheerer, S. Kiltz, and J. Dittmann
During a test ride in the German city of Magdeburg the CAN bus communication from
the infotainment subnetwork was logged and, subsequently, evaluated.
One rudimentary approach for manual route reconstruction only requires the identifica-
tion and evaluation of the speed signal. Since the semantic structure of the bus commu-
nication is usually kept secret by the manufacturers, the IT forensic personnel either has
to perform their own analyses or can revert to results of respective analyses performed
and published by internet communities [14]. During the manual analysis of the log file
recorded in the performed test ride, an integer signal could be identified as a potential
candidate for the speed signal – a continuous value between 0 and 8206. Including the
known fact of an urban trip, a round scale factor of 150 can be assumed, which would
correspond to a maximum speed of 54.7 km/h (In Germany, the standard urban speed
limit is 50 km/h). The reconstructed velocity plot is illustrated in the upper part of Fig. 4.
399
m m
281 719m
Comparable strategies for tracking vehicle positions are already implemented in some
integrated navigation systems. The system data evaluated by respective algorithms usu-
ally include the vehicle’s speed and, partly, steering angles or gyrometer values. With
reference to the local map material, these strategies are used to correct the vehicle’s posi-
IT-forensic automotive investigations on the example of route recon-
struction on automotive system and communication data 133
tion under different conditions (e.g. in tunnels or at technical GPS reception problems)
as well as for the reduction of power consumption (by reducing the frequency of GPS
calculations). Using this “temporary solution”, some systems are even able to keep up a
flawless operation for several hundreds of kilometres. This functionality already present
in several devices can also be utilised by IT forensic personnel for route reconstruction
purposes. To accomplish this, three main steps are required:
1. The device has to be started offline, i.e. without GPS reception and bus connection to
a real car. In the lab this is usually easy to achieve by identifying and connecting the
pins for power supply and ignition signal. It could also be done without dismounting
it from a car by temporarily disconnecting only the vehicle bus and the GPS antenna.
2. The device has to be configured for the suspected starting position. Some devices
have a dedicated system menu dialogue for this purpose, as shown in Fig. 5 (by speci-
fying a nearby intersection, its distance and the current orientation).
3. To perform the actual route reconstruction, the device has to be provided with suitable
signals (as listed above) to simulate a trip done without working GPS reception.
Fig. 5: Semi-automated route reconstruction – step 2 (configuring the suspected starting position)
In our test, step 3 could not be completed for the device from Fig. 5 – because this older
device did not pick the speed information from the CAN bus but expects it as an ana-
logue signal. While a D/A conversion would also be feasible with suitable hardware,
digital feeding of the speed signal could successfully be implemented in a setup with a
newer device shown in Fig. 6. This device uses the speed information present on the
CAN bus and can be provided with respective signals (e.g. directly taken from the ac-
quired bus communication) via a suitable bus interface. In general, a navigation system
suitable for such an analysis does not necessarily have to be compatible to the bus proto-
col of the source vehicle. If this is not the case, it can be attempted to convert the re-
quired input values to the expected data format, temporal resolution etc.
Some snapshots from the route reconstruction (step 3) are shown in Fig. 6. As an issue of
the second device we encountered that it does not evaluate the steering angle present on
the infotainment CAN bus but determines the current angle with an integrated gyrometer
sensor. The provision of orientation information from the outside is a bit trickier in this
case. Without opening the device (e.g. to intercept the sensor connection) this could be
performed by an automated rotation of the device according to the available log informa-
tion (in this case: steering angle / velocity). In our setup we simply simulated this by
manually turning the device.
134 T. Hoppe, S. Tuchscheerer, S. Kiltz, and J. Dittmann
CAN
log
CAN log
USBCAN excerpt
Fig. 6: Semi-automated route reconstruction – step 3: test setup and test in progress
• Is the route a realistic? This is probable, if it belongs either to the fastest or to the
shortest connections between starting and destination address. It is less probable, if it
contains closed circles.
• Are the speed/location mappings realistic? During the reconstruction, the “virtual”
vehicle should slow down on sharp curves and stop on other points with a certain
probability (e.g. STOP signs, traffic lights). In single cases it may also slow down or
stop on straight road segments (e.g. due to wait for passengers or other cars) but a sig-
nificantly increased amount of such events would make the assumption of the route
(or, respectively, the chosen starting point) more improbable.
cent studies revealed that still the owners/drivers themselves are the most frequent pro-
tagonists of such incidents trying to “optimise” their car. But also the relevance of third
parties as initiating source of IT-based attacks on automotive systems might increase in
future. Since safety and security threats can arise as (direct or indirect) implications in
both cases, IT forensic investigations can help in such cases to identify and fix the ex-
ploited vulnerabilities (e.g. by providing software updates for current systems or includ-
ing design fixes for future ones). This makes IT forensics an essential part in the life
cycle chain to improve the dependability and trustworthiness of automotive IT systems.
Currently, the concept has still some practical boundaries, since the achievement of re-
spective findings from current vehicles is typically a difficult task due several multifac-
eted restrictions. The amount of vehicular information accessible via – still mostly pro-
prietary / manufacturer-dependent – diagnostic protocols is usually very limited. On the
other hand, extraction and analyses of complete memory dumps (e.g. from flash mem-
ory) out of heterogeneous embedded devices of different manufacturers currently de-
mand superior efforts. The alternate, comparably comfortable option of accessing poten-
tially incident-relevant information via existing graphical user interfaces (as the GUI of
the navigation system) is only possible for a small fraction of automotive systems and
only has a restricted reliability (e.g. due to editing/deletion features for the users).
Acknowledgements
The authors of this article would like to thank Thomas Rehse for his support and valu-
able input created in the context of his master thesis [15] at the Otto-von-Guericke-
University of Magdeburg, Germany.
References
1. SPIEGEL Online International: Autopsy Shows Haider Was Intoxicated, Web Article from
October 15th, 2008, http://www.spiegel.de/international/europe/0,1518,584382,00.html, last
access: March 2nd, 2012.
2. Nilsson, D.K.; Larson, U.E.: Conducting Forensic Investigations of Cyber Attacks on Auto-
mobile In-Vehicle Networks. In: Networking and Telecommunications: Concepts, Method-
ologies, Tools and Applications, pp. 647-660, IGI Global, ISBN 978-1-60566-986-1, 2010.
136 T. Hoppe, S. Tuchscheerer, S. Kiltz, and J. Dittmann
3. Biermann, M.; Hoppe, T.; Dittmann, J.; Vielhauer, C.: Vehicle Systems: Comfort & Security
Enhancement of Face/Speech Fusion with Compensational Biometrics; In: MM&Sec'08 -
Proceedings of the Multimedia and Security Workshop 2008, 22.-23. September, Oxford, UK,
ACM, pp.185-194; ISBN 978-1-60558-058-6, 2008.
4. Dittmann, J.; Hoppe, T.; Kiltz, S.; Tuchscheerer, T.: Elektronische Manipulation von Fahr-
zeug- und Infrastruktursystemen: Gefährdungspotentiale für die Straßenverkehrssicherheit;
Wirtschaftsverlag N. W. Verlag für neue Wissenschaft, ISBN 978-3869181158, 2011.
5. T. Grance, K. Kent, and B. Kim, “Computer incident handling guide, special publication 800-
61,” National Institute for Standards and Technology, NIST Special Publication 800-61, 2004.
6. Casey, E.: Digital Evidence and Computer Crime. ISBN 0-12-1631044. Academic Press,
2004.
7. Federal Office for Information Security: Leitfaden IT-Forensik, Version 1.0.1 / March 2011,
https://www.bsi.bund.de/ContentBSI/Themen/Internet_Sicherheit/Uebersicht/ITRevision/IT-
Forensik/it-forensik.html, 2011.
8. Kiltz, S.; Hoppe, T.; Dittmann, J.; Vielhauer C.; Video surveillance: A new forensic model for
the forensically sound retrieval of picture content off a memory dump; In Proceedings of In-
formatik2009 - Digitale Multimedia-Forensik, pp 1619–1633, 2009.
9. Kiltz, S.; Hildebrandt, M.; Dittmann, J.: Forensische Datenarten und -analysen in automotiven
Systemen; In: Patrick Horster, Peter Schartner (Hrsg.), D·A·CH Security 2009; Syssec; Bo-
chum; 19./20. Mai 2009, ISBN: 978-3-00027-488-6, 2009.
10. Hoppe, H.; Holthusen, S.; Tuchscheerer, S.; Kiltz, S.; Dittmann, J.: Sichere Datenhaltung im
Automobil am Beispiel eines Konzepts zur forensisch sicheren Datenspeicherung; In: Sicher-
heit 2010, LNI P-170, ISBN 978-3-88579-264-2. pp. 153-164, 2010.
11. Hoppe, T.; Kiltz, S.; Dittmann, J.: Applying Intrusion Detection to Automotive IT – Early In-
sights and Remaining Challenges; In: Journal of Information Assurance and Security (JIAS),
ISSN: 1554-1010, Vol. 4, Issue 6, pp. 226-235, 2009.
12. Hoppe, T.; Exler, F.; Dittmann, J.: IDS-Signaturen für automotive CAN-Netzwerke; In: Peter
Schartner, Jürgen Taeger (Hrsg.), D·A·CH Security 2011; Syssec; ISBN: 978-3-00-034960-7,
pp. 55-66; 2011.
13. Müter, M.; Hoppe, T.; Dittmann, J.: Decision Model for Automotive Intrusion Detection Sys-
tems; In: Automotive - Safety & Security 2010; Shaker Verlag, Aachen, ISBN 978-3-8322-
9172-3, pp. 103-116, 2010.
15. Rehse, T.: Semantische Analyse von Navigationsgeräten und Abgleich von Daten aus dem
Fahrzeugbussystem mit dem Ziel der Rekonstruktion von Fahrtrouten für den IT-forensischen
Nachweis. Master thesis, Otto-von-Guericke-University of Magdeburg, 2011.
Towards an IT Security Protection Profile for Safety-
related Communication in Railway Automation
Abstract. Some recent incidents have shown that possibly the vulnerability of
IT systems in railway automation has been underestimated so far. Fortunately
so far almost only denial of service attacks have been successful, but due to
several trends, such as the use of commercial IT and communication systems or
privatization, the threat potential could increase in the near future. However, up
to now, no harmonized IT security requirements for railway automation exist.
This paper defines a reference communication architecture which aims to sepa-
rate IT security and safety requirements as well as certification processes as far
as possible, and discusses the threats and IT security objectives including typi-
cal assumptions in the railway domain. Finally examples of IT security re-
quirements are stated and discussed based on the approach advocated in the
Common Criteria, in the form of a protection profile.
1 Introduction
What distinguishes railway systems from many other systems is their inherent dis-
tributed and networked nature with tens of thousands of kilometer track length for
large operators. Thus, it is not economical to completely protect against physical ac-
cess to this infrastructure and, as a consequence, railways are very vulnerable to
physical denial of service attacks leading to service interruptions.
Another distinguishing feature of railways from other systems is the long lifespan
of their systems and components. Current contracts usually demand support for over
25 years and history has shown that many systems, e.g. mechanical or relay interlock-
ings, last much longer. IT security analyses have to take into account such long life-
spans. Nevertheless, it should also be noted that at least some of the technical prob-
lems are not railway-specific, but are shared by other sectors such as Air Traffic Man-
agement [5].
Publications and presentations related to IT security in the railway domain are in-
creasing. Some are particularly targeted at the use of public networks such as Ethernet
or GSM for railway purposes [2], while others, at least rhetorically, pose the question
“Can trains be hacked?”[3]. As mentioned above, some publications give detailed
security-related recommendations [4]. While in railway automation harmonized safety
standards were elaborated more than a decade ago, up to now no harmonized IT secu-
rity requirements for railway automation exist.
This paper starts with a discussion of the normative background, then defines a ref-
erence communication architecture which aims to separate IT security and safety
requirements as well as certification processes as far as possible, and discusses the
threats and IT security objectives including typical assumptions in the railway do-
main. Finally, examples of IT security requirements are stated and discussed based on
the approach advocated in the Common Criteria, in the form of a protection profile.
2 Normative Background
The purely safety aspects of electronic hardware are covered by EN 50129 [7].
However, security issues are taken into account by EN 50129 only as far as they af-
fect safety issues, but, for example, denial of service attacks often do not fall into this
category. Questions such as intrusion protection are only covered by one requirement
in Table E.10 (exist protection against sabotage). However, EN 50129 provides a
structure for a safety case which explicitly includes a subsection on protection against
unauthorized access (both physical and informational). Other security objectives
could also be described in that structure.
On the other hand, industrial standards on information security exist. Here we can
specify the following standards:
• ISO/IEC 15408 [8] provides evaluation criteria for IT security, the so-called
Common Criteria [13 to15]. This standard is solely centered on information
systems and has, of course, no direct relation to safety systems.
• ISA 99 [9] is a set of 12 standards currently elaborated by the Industrial
Automation and Control System Security Committee of the International
Society for Automation (ISA). This standard is not railway-specific and
focuses on industrial control systems. It is dedicated to different hierarchical
levels, starting from concepts and going down to components of control
systems.
A more comprehensive overview on existing information security standards is pre-
sented in [10]. From these standards, it can be learnt, that for information security, not
only technical aspects of concrete technical systems need to be taken into account, but
also circumstances, organization, humans, etc. Certainly, not all elements mentioned
in the general information security standards can and need to be used for a railway
system.
How is the gap between information security standards for general systems and
railways to be bridged? The bridge is provided by the European Commission Regula-
tion on common safety methods No. 352/2009 [11]. This Commission Regulation
mentions three different methods to demonstrate that a railway system is sufficiently
safe:
a) by following existing rules and standards (application of codes of prac-
tice),
b) similarity analysis, i.e. showing that the given (railway) system is
equivalent to an existing and used one,
c) explicit risk analysis, where risk is assessed explicitly and shown to be
acceptable.
We assume that, from the process point of view, security can be treated just like
safety, meaning that threats would be treated as particular hazards. Using the ap-
proach under a), Common Criteria [8] or ISA 99 [9] may be used in railway systems,
but a particular tailoring would have to be performed due to different safety require-
ments and application conditions. By this approach, a code of practice that is ap-
proved in other areas of technology and provides a sufficient level of security, can be
adapted to railways. This ensures a sufficient level of safety.
However, application of the general standards [8] or [9] requires tailoring them to
the specific needs of a railway system. This is necessary to cover the specific threats
140 J. Braband, B. Milius, H. Bock, and H. Schäbe
associated with railway systems and possible accidents and to take into account spe-
cific other risk-reducing measures already present in railway systems, such as the use
of specifically trained personnel.
As a basis of our work, the Common Criteria [8] have been selected, as ISA99 was
not finalized in spring 2011, when this work started. The use of Common Criteria
may enable the reuse of systems for railway applications that have already been as-
sessed and certified for other areas of application. This is especially relevant as an
increasing number of commercial-off-the-shelf (COTS) products are being used and
certified against the Common Criteria. With this approach, a normative base has been
developed by the German standardization committee DKE [17], based on the Com-
mon Criteria and a specific protection profile tailored for railways, considering rail-
way-specific threats and scenarios and yielding a set of IT security requirements.
Assessment and certification of such a system can be carried out by independent ex-
pert organizations. Safety approval in Germany could then be achieved via the gov-
ernmental organizations Federal German Railways Office (Eisenbahn-Bundesamt,
EBA) for railway aspects and Federal German Office for Security in Information
Technology (Bundesamt für Sicherheit in der Informationstechnik, BSI) for IT secu-
rity aspects.
3 Reference Architecture
Based on this onion skin model a reference model for communication (see Figure
2) has been chosen, in which the RST applications are in a zone A or B. It is assumed
that, if communication between the zones were through a simple wire (as a model for
Towards an IT Security Protection Profile for Safety-related Commu-
nication in Railway Automation 141
Derive
Check
e
riv
De
De
riv
Check
RST in itself is safe but not necessarily secure. Often, there is a misconception in
the railway world that by having safe signaling technology, security issues do not
Towards an IT Security Protection Profile for Safety-related Commu-
nication in Railway Automation 143
have to be taken care of. In this section, we will discuss the security threats which aim
directly at signaling applications.
4.2 Threats
There is a common notion in the Common Criteria that threats are directed towards
the three major IT security aspects: confidentiality, integrity and availability. One
approach might be to analyze threats on this very high level. However, in our case
experience has shown that only availability can be used directly as a threat; the other
aspects need to be more detailed to derive security objectives.
In railway signaling, the starting point is EN 50129 where the safety case explicitly
demands addressing the aspect of unauthorized access (physical and/or non-physical).
In general, threats can be described on a higher system level.
The threats can be categorized into threats which are to be taken care of by the
TOE and threats which have to be dealt with by the safety system or the environment.
Some threats regarding communication issues can be taken from EN 50159. This
standard explores in detail security issues inherent to communication networks.
Threats taken from this standard are often defined on a lower level and are not dis-
cussed in this paper.
The threats have been listed using the following structure:
t.<attack>{.<initiator>.<further properties>}
t stands for threat and initiator for the initiator of the attack, typically a user, an at-
tacker or a technical problem, such as a software error. As the security profile is ge-
neric, in most cases there has been no further detailing.
Is it is not prudent to list all threats in this paper; we will only list threats on the
highest level. The lower levels give more properties e. g. regarding the particular
types and means of an attack. We will name the initiators taken into account. The
following threats have been used for the security profile. They deal with threats that
have to be controlled by the IT system:
• t.availability: Authorized users cannot obtain access to their data and re-
sources.
• t.entry: Persons who should not have access to the system may enter the sys-
tem. The initiator of such a threat could be an attacker who masks him-
self/herself as an authorized user.
• t.access: Authorized users gain access to resources which they are not enti-
tled to according to the IT security policy. The initiator is an authorized user.
The system is manipulated by negligence or operating errors.
• t.error: An error in part of the system leads to vulnerability in the IT security
policy. An error can also be the result of a failure. The initiator of such a
threat can be an attacker.
• t.crash: After a crash, the IT system is no longer able to correctly apply the
IT security policy.
• t.repudiation: Incidents which are IT security-related are not documented or
can not be attributed to an authorized user.
144 J. Braband, B. Milius, H. Bock, and H. Schäbe
4.3 Assumptions
The identification of threats depends on assumptions. As threats usually arise at the
system boundary, the assumptions are related to the boundary and the environment.
Some important assumptions are:
• a.entry: At least some parts of the system are in areas which are accessible
for authorized persons only.
• a.protection: All system parts of the IT security system are protected directly
against unauthorized modifications or there are (indirect) organizational
measures which allow effective protection. This includes protection against
elementary events.
• a.user: Users are correctly and sufficiently trained. They are considered
trustworthy. This does not mean that users are expected to work error-free
and their interactions with the system are logged.
4.4 Objectives
In order to protect against threats, security objectives are defined. For the sake of
brevity, we can demonstrate this process only for one example in Table 1:
Towards an IT Security Protection Profile for Safety-related Commu-
nication in Railway Automation 145
In general, it is possible to show that the security objectives cover the threats com-
pletely, but the argument for each threat relies on expert opinion and does not give a
formal proof.
Those portions of a TOE that must be relied on for correct enforcement of the func-
tional security requirements are collectively referred to as the TOE security function-
ality (TSF). The TSF consists of all hardware, software, and firmware of a TOE that
is either directly or indirectly relied upon for security enforcement.
The Common Criteria, Part 2 [13], define an extensive list of security functions
and requirements in a formalized language. Thus, the next step is to try to satisfy the
security objective by a subset of the security functions. As a countercheck, a walk-
through of all functions was performed. Table 2 shows an overview of the functional
classes and the selected IT security functions as specified in the Common Criteria part
2.
146 J. Braband, B. Milius, H. Bock, and H. Schäbe
Class FMT specifies a large number of generic configuration and management re-
quirements, but leaves freedom to implement particular role schemes.
Classes FPT, FRU and FTA deal with protection of the TOE and the TSF them-
selves. The requirements covered include self-testing and recovery as well as preser-
vation of a secure state which is very similar to requirements from EN 50129:
“FPT_FLS.1: The TSF shall preserve a secure state when the following types of fail-
ures occur: [assignment: list of types of failures in the TSF].” It was decided to apply
this generic requirement rigorously to any failure of the TSF.
Finally, as a plausibility check, coverage of the security objectives by the security
requirements is evaluated (see Table 3 for an example).
6 Summary
This paper has defined a reference communication architecture, which aims to sepa-
rate IT security and safety requirements as far as possible, and discussed the threats
and IT security objectives including typical assumptions in the railway domain. Ex-
amples of IT security requirements have been stated and discussed based on the ap-
148 J. Braband, B. Milius, H. Bock, and H. Schäbe
proach advocated in the Common Criteria, in the form of a protection profile [17].
The goal is to use COTS security components which can be certified according to the
Common Criteria, also in the railway signaling domain, instead of creating a new
certification framework. The work presented is still ongoing (the public consultation
ends September 2012), in particular with respect to approval of the protection profile
and practical experience.
7 References
1. http://www.nextgov.com/nextgov/ng_20120123_3491.php?oref=topstory, accessed on
February, 7, 2012
2. Stumpf, F.: Datenübertragung über öffentliche Netze im Bahnverkehr – Fluch oder Se-
gen?, Proc. Safetronic 2010, Hanser, München
3. Katzenbeisser, S.: Can trains be hacked?, 28th Chaos Communication Congress, Hamburg,
2011
4. Thomas, M.: Accidental Systems, Hidden Assumptions and Safety Assurance, in: Dale, C.
and Anderson, T, (eds.) Achieving System Safety, Proc. 20th Safety-Critical Systems Sym-
posium, Springer, 2012
5. Johnson, C.: CyberSafety: CyberSecurity and Safety-Critical Software Engineering, in:
Dale, C. and Anderson, T, (eds.) Achieving System Safety, Proc. 20th Safety-Critical Sys-
tems Symposium, Springer, 2012
6. EN 50159 Railway applications, Communication, signaling and processing systems –
Safety related communication in transmission systems, September 2010
7. EN 50129 Railway applications, Communication, signaling and processing systems –
Safety-related electronic systems for signaling, February 2003
8. ISO/IEC 15408 Information technology — Security techniques — Evaluation criteria for
IT security, 2009
9. ISA 99, Standards of the Industrial Automation and Control System Security Committee of
the International Society for Automation (ISA) on information security, see
http://en.wikipedia.org/wiki/Cyber_security_standards
10. BITKOM / DIN Kompass der IT-Sicherheitsstandards Leitfaden und Nachschlagewerk 4.
Auflage, 2009
11. Commission Regulation (EC) No. 352/2009 of 24 April 2009 on the adoption of a com-
mon safety method on risk evaluation and assessment as referred to in Article 6(3)(a) of
Directive 2004/49/EC of the European Parliament and of the Council
12. Common Criteria for Information Technology Security Evaluation, Version 3.1, revision
3, July 2009. Part 1: Introduction and general model
13. Common Criteria for Information Technology Security Evaluation, Version 3.1, revision
3, July 2009. Part 2: Functional security components
14. Common Criteria for Information Technology Security Evaluation, Version 3.1, revision
3, July 2009. Part 3: Assurance security components
15. Wickinger, T.: Modern Security Management Systems (in German), Signal&Draht, No. 4,
2001
16. DB AG: European Patent Application EP 2 088 052 A2, 2000
17. DIN V VDE V 0831-102: Electric signaling systems for railways – Part 102: Protection
profile for technical functions in railway signaling (in German), Draft, 2012
Towards Secure Fieldbus Communication
1 Introduction
Industrial automation systems use fieldbus communication for real-time dis-
tributed control of systems such as water supply, energy distribution, or man-
ufacturing. Security for fieldbus communication was not considered to be an
important issue, since these systems were deployed typically in closed environ-
ments. However, since fieldbus installations become more and more automated
and cross-linked, security becomes more and more important. For example, the
cables of the fieldbus-connections in a wind park used to connect the wind tur-
bines with a central control system can be accessed by an adversary since it is
not possible to protect the whole area. If wireless fieldbuses [3] are used, attacks
such as eavesdropping on the communication are even much easier for an adver-
sary. Thus, security mechanisms have to be applied. To enable the compliance
with real-time requirements as well as to provide transparent security to higher
layers, security mechanisms have to be integrated into the fieldbus layer.
These security mechanisms have to protect the confidentiality of the field-
bus communication to prevent an adversary from eavesdropping to get sensitive
information such as the temperature profile of a beer brewing process, which
150 F. Wieczorek, C. Krauß, F. Schiller, and C. Eckert
2 Related Work
Since most fieldbus systems have been used in closed systems, only a few ap-
proaches are designed to provide security, e.g., [14] which is based on IEEE
802.15.4 [13]. Here, Block Ciphers (BCs) in CCM-mode are used, which are
padded to full block length. This is a major disadvantage when many short tele-
grams are transmitted like in typical fieldbus communication. Adding security
mechanisms such as IPsec for Internet Protocol (IP)-based fieldbuses is discussed
in [20]. Introducing security mechanisms at higher levels is also discussed in [21].
Secure industrial communication using Transmission Control Protocol/Internet
Protocol (TCP/IP) is addressed in [4], where the necessary reaction times of
automation fieldbuses cannot be reached.
In the area of Building Automation Control (BAC), an approach for secure
fieldbus communication is presented in [18] using Data Encryption Standard
(DES) and Hashed MAC (HMAC) with SHA-1 on smartcards. In [7], the secu-
rity of wireless BAC networks is discussed. BAC networks have smaller band-
width and the presented solutions are not fast enough for general fieldbuses in
automation, where the data rate is much higher and the real-time constraints
tighter than in BAC applications.
Towards Secure Fieldbus Communication 151
3 Protocol Description
In this section, we describe our proposed protocol in detail. We first discuss the
requirements we address. Then we describe our scheme to combine a SC with a
MAC. Finally, we describe the protocol steps in detail.
where real-time constraints do not apply. Hybrid technique similar to this one
are widely in use.
Furthermore, our protocol ensures authenticity, integrity, freshness, and con-
fidentiality of the fieldbus communication assuming an active attacker attack-
ing the fieldbus communication. Availability is not considered, since protection
against an active attacker is usually not possible.
An important design principle of our protocol is the exchangeability of the
used Stream Cipher (SC) and MAC primitives and adaptability of security lev-
els. If an underlying primitive becomes insecure during the long life-time of
automation systems, easy substitution is required to fix those systems.
The generic SC and MAC scheme (cf. Figure 1) uses two distinct parts of the
output of only one SC. One part is used for encryption, the other as input of
a MAC scheme. We assume that the underlying MAC construction and the SC
are secure and have deterministic runtime.
The inputs of the scheme are
– payloads pl(0..n), all of the same fixed length (|pl(i)| = |pl(j)| ∀ 0 ≤ i, j ≤ n),
– a key k and
– an initialization vector iv.
In this section, we describe the protocol steps in the two phases of our protocol,
i.e., initialization and operational phase.
Towards Secure Fieldbus Communication 153
SC
s(0..n)
iv s(0) = finit (iv, k)
s(i + 1) = f (s(i))
partition
otpmac (0..n) otpenc (0..n)
pl(0..n)
Enc
c(0..n)
MAC
mac(0..n) c(0..n)
Initialization Phase This phase does not require real-time, therefore the use
of asymmetric cryptography is possible. During this phase, trust is established,
parameters, such as cipher choice, key- and MAC-length, are negotiated and
keys are exchanged. The key-exchange of the communicating parties has to be
triggered in advance by one party knowing the network topology, which is usually
the master. A Diffie-Hellman key-exchange using trust anchors for authentication
can be used as key-exchange protocol.
The SCs of every party are initialized using the exchanged key (and other
parameters) and iv := 0, resulting in same states s0 of the SCs.
States of a Participant Each party has to keep a state per communication rela-
tionship consisting of:
– secret key k,
– current iv,
– current state s of the SC,
– fixed payload length |pl|,
– fixed MAC length |mac|, and
154 F. Wieczorek, C. Krauß, F. Schiller, and C. Eckert
Device A Device B
state sA = s(i) sB = s(i)
partition next stream partition next stream
otpenc (i) and otpmac (i) otpenc (i) and otpmac (i)
sA = s(i + 1) sB = s(i + 1)
pl(i)
c(i)||mac(i)
pl(i)
Regular Operation The regular operation is sketched in Figure 2. All parties share
the same state of the SC. Since each payload has the same length, the execution
of the security algorithms consumes the same amount of cipher-stream for each
payload. Given the actual state s(i) of the SC, each successor state s(j) (j > i)
and the corresponding otpmac and otpenc are computable in advance without
knowing the payloads (cf. Figure 1).
Each time a payload pl is passed to the security layer, the cipher-stream is
first used as otpenc for encryption and afterwards as otpmac for integrity protec-
tion. A ciphertext is build with the encryption algorithm Enc (which computes
c := otpenc ⊕ pl). Then the MAC secures the authenticity and integrity of the ci-
phertext consuming otpmac . The telegram transmitted over the fieldbus consists
of the ciphertext concatenated with the mac.
Towards Secure Fieldbus Communication 155
The receiver uses the same cipher-stream, thus resulting in the same state
as the sender. It first checks the correctness of the mac with the verification
algorithm Vrf. If the mac is verified successfully, the ciphertext is decrypted (by
using bitwise XOR with the same otpenc the sender had used to encrypt the
ciphertext). The resulting payload is then passed to the application.
Regularly, the initialization vector is changed due to a specific schedule (e.g.,
every fixed number of telegrams). If the states of the communication parties
are no longer synchronized, mechanisms for resynchronization are required. At
a data rate of 100 Mbit/s of the fieldbus, the SC Grain-128a, which we used
in our implementation, can be used with one initialization vector over a typical
operational uptime, so renewing the initialization vector is only necessary if
synchronization is lost.
4 Security Discussion
In this section, we discuss the security of our proposed protocol, i.e., how it pro-
tects the integrity, confidentiality, authenticity, and freshness of the transmitted
156 F. Wieczorek, C. Krauß, F. Schiller, and C. Eckert
Device A Device B
sA = s(i) sB = s(i)
pl(i)
pl(i + 1)
c(i + 1)||mac(i + 1)
sB = s∗B
Device A Device B
several losses
no real-time possible
windowing fails
increment ivB
sB := finit (ivB ) = s∗ (0)
otp∗mac := next stream
mac ∗ := MACotp∗mac (syncLost||ivB )
In the following, we assume that the used stream cipher and the MAC are
each secure, i.e., an adversary can neither decrypt messages encrypted with
the stream cipher nor forge valid MACs without knowing the cryptographic
key. Since the basis of our protocol is the proposed generic construction, we
show that using this construction is also secure. First, we show that given any
set of pairs (c(0..n), mac(0..n)), where c(i) = pl(i) ⊕ otpenc (i) and mac(i) =
MACotpmac (i) (c(i)) for payloads pl(i) 0 ≤ i ≤ n, an adversary cannot get infor-
mation about any pl(i) by eavesdropping these pairs. Second, we show that an
adversary is not able to forge a valid pair (x, mac(i)) = (x, MACotpmac (i) (x)) for
any arbitrary binary string x.
In the first case, the generic construction loses its security properties if
two different messages are ever encrypted with the same cipher-stream. Thus,
otpenc (i) has to be different for each pl(i). To achieve this, the SC changes the
state for each transmitted telegram. The initial state s(0) is calculated using the
initialization vector iv and key k: s(0) = finit (iv, k) and all subsequent states
are calculated according to s(i + 1) = f (s(i)). Each state results in a different
output which is partitioned into otpmac (i) and otpenc (i). When the scheme is
158 F. Wieczorek, C. Krauß, F. Schiller, and C. Eckert
reinitialized, a new iv is used by incrementing the old one. Assuming the size
of the iv is carefully chosen to prevent overflows, an iv is only used once. Thus,
for each pl(i) always a different otpenc (i) is used. This reduces the security of
the scheme up to the security of the used stream cipher. Since we assumed that
the stream cipher is secure and protects the confidentiality of the transmitted
messages, this is also true for the generic scheme.
In the second case, the security of the MAC is solely based on the secrecy of
the used key since we assumed that the used MAC construction is secure. Thus,
an adversary can only forge a valid (x, mac) pair if he can derive the key k or
the correct otpmac (x). However, since we assumed the applied stream cipher is
secure, an adversary neither can get both of them by analyzing eavesdropped
pairs (c(0..n), mac(0..n)).
Thus, an adversary cannot successfully eavesdrop on telegrams or inject new
telegrams. Furthermore, an adversary cannot successfully replay telegrams, since
the freshness is ensured by changing the internal state after each correctly verified
telegram. Likewise the dropping of telegrams is detected and a resynchronization
initialized.
In our implementation, we have used Grain-128a [9] with 128 bit key and 96 bit
initialization vector as underlying stream-cipher of the generic scheme. The
Grain-128a cipher is based on the well-analyzed cipher Grain [10] which can
be easily implemented in hardware, provides high performance, and has deter-
ministic runtime. As MAC, we have chosen the Toeplitz matrix based approach,
which is easy to implement in hardware and also has deterministic runtime. We
chose a MAC length of 80 bit which provides a reasonable security level for most
applications. The telegrams are embedded as process data in regular EtherCAT
telegrams [12].
We have developed a corresponding prototype software implementation in C
[16]. The implementation is currently not optimized for speed. In future appli-
cations it might be possible to run the security stack in hardware in order to
reach higher performance. The master and slave were both running on the same
Microsoft Windows XP Professional SP3, Intel Core2Duo [email protected] GHz,
2 GB RAM machine during the runtime measurements. This configuration re-
sembles widely used IPC. The slave controller is a Beckhoff FC1100 card [1].
For the proof-of-concept implementation, EtherCAT was used in a synchronous
mode, triggering the slave application to run once on each incoming datagram.
The master can not be executed in hard real-time (this is only possible with pro-
gramming languages defined in [11]), as a replacement for the missing real-time
capabilities, the multimedia timer of Microsoft Windows was used to achieve a
de-facto cycle-time of 1 ms.
The first measurements show, that the security layer only generates negligible
overhead, at a cycle-time of 10 ms, the non-secure slave application runs in 7 µs,
compared to the execution time of 8 µs of the slave application with enabled
Towards Secure Fieldbus Communication 159
security layer. The execution times for resynchronization are not significantly
longer. Those measurements do not consider the transmission-time overhead of
the MAC. More extensive measurements, also at shorter cycle-times, will be part
of future work.
6 Conclusion
In this paper, we presented a protocol to secure the fieldbus communication of
automation systems while maintaining real-time requirements. The basis of the
protocol is a generic scheme which combines a stream cipher with a MAC to
ensure integrity, confidentiality, authenticity, and freshness of transmitted mes-
sages using only one key for cipher and MAC to facilitate key management.
The scheme relies solely on symmetric primitives, which are much more efficient
than asymmetric primitives, to support the use in resource-constrained systems
as well as to enable small cycle times for real-time communication. We chose a
stream cipher since they typically execute at a higher speed than block ciphers
and have lower hardware complexity. The security of our protocol relies on the
security of the used stream cipher and MAC construction. By adjusting the key
length, the protocol can be adapted according to the application requirements.
Our proof-of-concept implementation and the first results of our performed per-
formance analysis have shown the feasibility of our approach. As future work,
we plan to implement the protocol in hardware and perform more detailed per-
formance analyses. Another future topic is to provide exchangeability of SC and
MAC in the prototype.
References
1. Beckhoff Automation GmbH: FC1100 — PCI EtherCAT slave card (2011)
2. Bellare, M., Rogaway, P., Wagner, D.: The EAX Mode of Operation. In: Fast
Software Encryption, LNCS, vol. 3017, pp. 389–407. Springer (2004)
3. Brühne, M.: IEEE 802.1n und WLAN-Controller – Lohnt der Einsatz auch in
der Industrie. In: SPS/IPC/DRIVES : Elektrische Automatisierung, Systeme und
Komponenten (2011)
4. Damm, M., Leitner, S.H., Mahnke, W., Leitner, S.H.: Security. In: OPC Unified
Architecture, pp. 1–51. Springer (2009)
5. Dworkin, M.: Recommendation for Block Cipher Modes of Operation: The CCM
Mode for Authentication and Confidentiality. NIST Special Publication 800-38C,
NIST - Computer Security Resource Center (2007)
6. Ferguson, N., Whiting, D., Schneier, B., Kelsey, J., Lucks, S., Kohno, T.: Helix:
Fast Encryption and Authentication in a Single Cryptographic Primitive. Fast
Software Encryption pp. 330–346 (2003)
7. Granzer, W., Reinisch, C., Kastner, W.: Future Challenges for Building Automa-
tion: Wireless and Security. In: Proc. IEEE Int Industrial Electronics (ISIE) Symp.
pp. 4415–4467 (2010)
8. Ågren, M., Hell, M., Johansson, T.: On Hardware-Oriented Message Authentica-
tion with Applications towards RFID. In: Lightweight Security Privacy: Devices,
Protocols and Applications (LightSec). pp. 26–33. IEEE Computer Society (2011)
160 F. Wieczorek, C. Krauß, F. Schiller, and C. Eckert
9. Ågren, M., Hell, M., Johansson, T., Meier, W.: A New Version of Grain-128 with
Authentication. In: Symmetric Key Encryption Workshop. European Network of
Excellence in Cryptology II (2011)
10. Hell, M., Johansson, T., Meier, W.: Grain – A Stream Cipher for Constrained
Environments. International Journal of Wireless and Mobile Computing, Special
Issue on Security of Computer Network and Mobile Systems. 2(1), 86–93 (2006)
11. IEC: IEC 61131-3, Programmable controllers — Part 3: Programming languages
(2003), ed. 2
12. IEC: IEC 61158, Industrial communication networks — Fieldbus specifications
(2010), ed. 2
13. IEEE: IEEE 802.15.4, Wireless Medium Access Control (MAC) and Physical Layer
(PHY) Specifications for Low-Rate Wireless Personal Area Networks (WPANs)
(2006)
14. ISA: ISA100.11a Wireless systems for industrial automation: Process control and
related applications (2011)
15. Menezes, A.J., Vanstone, S.A., Oorschot, P.C.V.: Handbook of Applied Cryptog-
raphy (Discrete Mathematics and Its Applications). CRC Press, Inc., 5th printing
edn. (1996)
16. Microsoft Corporation: Microsoft Visual Studio 2010 Ultimate (2010), version
10.0.4.0129.1 SP1Rel
17. Rogaway, P., Bellare, M., Black, J.: OCB: A block-cipher mode of operation for
efficient authenticated encryption. ACM Trans. Inf. Syst. Secur. 6, 365–403 (2003)
18. Schwaiger, C., Treytl, A.: Smart Card Based Security for Fieldbus Systems. In:
Proc. IEEE Conf. Emerging Technologies and Factory Automation ETFA ’03.
vol. 1, pp. 398–406 (2003)
19. Szilagyi, C., Koopman, P.: Flexible Multicast Authentication for Time-Triggered
Embedded Control Network Applications. In: DSN. pp. 165–174. IEEE (2009)
20. Treytl, A., Sauter, T., Schwaiger, C.: Security Measures for Industrial Fieldbus
Systems – State of the Art and Solutions for IP-based Approaches. In: Proc. IEEE
Int Factory Communication Systems Workshop. pp. 201–209 (2004)
21. Treytl, A., Sauter, T., Schwaiger, C.: Security Measures in Automation Systems –
a Practice-Oriented Approach. In: Proc. 10th IEEE Conf. Emerging Technologies
and Factory Automation ETFA. vol. 2, pp. 847–855 (2005)
22. Whiting, D., Schneier, B., Lucks, S., Muller, F.: Phelix Fast Encryption and Au-
thentication in a Single Cryptographic Primitive. Tech. rep., ECRYPT Stream
Cipher Project Report 2005/027 (2005)
23. Wirt, K.T.: ASC – A Stream Cipher with Built–In MAC Functionality. World
Academy of Science, Engineering and Technology 29 (2007)
24. Wolf, M., Weimerskirch, A., Paar, C.: Security in Automotive Bus Systems. In:
Proceedings of the Workshop on Embedded Security in Cars (escar)’04 (2004)
25. Wolf, M., Weimerskirch, A., Wollinger, T.: State of the Art: Embedding Security
in Vehicles. EURASIP Journal on Embedded Systems (2007)
26. Wu, H., Preneel, B.: Differential-Linear Attacks against the Stream Cipher Phelix.
eSTREAM, ECRYPT Stream Cipher Project, Report 2006/056
27. Zoltak, B.: VMPC-MAC: A Stream Cipher Based Authenticated Encryption
Scheme. Fast Software Encryption (2004)
Extracting EFSMs of web applications for
formal requirements specification
1 Introduction
Continually increasing amounts of critical corporate and personal information
that are managed and exchanged by web applications has heightened the need for
approaches that verify safety and secutiry of web applications. Model checking
and model-based testing have been proven to be safe approaches to reduce the
time and effort associated with security testing of complex systems, but to apply
such methods a model of an application and a formal spefication are required.
While source code is the most accurate description of the behavior of a web
application, this description is expressed in the low-level program statements
and are hardly suitable for high-level understanding of an application’s intended
behavior. Moreover formal methods of quality assurance are hardly applicable to
the source code, as it is hard or almost impossible to formulate functional require-
ments. The goal of the presented research is to propose a method for automatic
162 A. Zakonov and A. Shalyto
extraction of an extended finite state machine (EFSM) that would describe given
web application and would be suitable for writing formal specification require-
ments and their automatic verification using existing model checking tools.
A web application could be described as number of states and transitions
between these states. A transition between states occurs when an action was
triggered by the user of the application or by some server event (timer event,
push notification and etc.). Depending on the number of the parameters target
state of the transition for the same action may vary. Such trigger conditions may
both reside on the client side of the application (Javascript variables and user
form inputs) and on the server side (database values, user permissions, etc.).
The research aims to propose solutions for number of various tasks:
1. Discover as many different application states as it is possible.
2. Discover all variables and factors that define transition’s target state.
3. An algorithm that would reveal on which conditions a transition in the ex-
tracted model could be made.
4. An algorithm to measure similarity of the web application states.
5. A tool that would automatically build a human-readable EFSM of the web
application provided discovered states, transitions, conditions for these tran-
sitions and similarity measure function for states.
Extracted models of the web applications then can be used as the basis for
automated model-based test generation, writing better quality requirements for
design and implementation, applying model checking to verify a model against
the application requirements.
This paper is organized as follows. Section II contains a brief overview of the
related works and tools. In Section III details on the state discovery algorithm
are given. The model extraction method, simplification algorithms and the simi-
larity measure are introduced in Section IV. Section V describes an approach to
discover factors that define transitions’ target states. Overview of the developed
proof-of-concept tool and case studies are presented in Section VI. Section VII
concludes.
2 Related work
Web applications are usually tested by manually constructed test cases using
unit testing tools with capture-replay facilities such as Selenium [1] and Sahi [2].
These tools provide functionality for recording the GUI actions performed by
a user in test scripts, for running a suite of tests, and for visualizing test re-
sults. However, even with such tool support, testing remains a challenging and
time-consuming activity because each test case has to be constructed manually.
Moreover any change in the web application’s structure may require manual
revision of all the test cases or creation of a new test suite.
Several techniques that propose automatic model extraction and verification
of the existing web applications have been presented in the literature. In [3]
authors survey 24 different modelling methods used in web site verification and
Extracting EFSMs of web applications for formal requirements specifi-
cation 163
testing. There is little research on the problem of extracting models from the
existing web applications in order to support their maintenance and evolution [4–
6]. The common drawback of such approaches is that they aim to create a model,
useful for proper understanding of dynamic application behavior, but not a for-
mal model that can be verified against given requirements.
In [7] the authors propose a runtime enforcement mechanism that restricts
the control flow of a web application to a state machine model specified by
the developer, and use model checking to verify temporal properties on these
state machines. This approach implies manual development of a state model by
developer, which is time consuming and error prone, especially for complex web
applications.
In [8] authors present an approach to model an existing web application using
communicating finite automata model based on the user defined properties to
be validated. Manual properties definition is required in order to build a specific
model, while in our approach we automatically build the model first, which
gives a user a convenient way for formal specification of the requirements. Also
models retrieved in this approach could contain up to thousands of states and
transitions, which would not be human-readable representation and therefore
not suitable for analysis.
An attempt to automate verification and model extraction is done in [9], but
they focus only on page transitions in web applications and are limited only to
web applications that have been developed using a web application framework,
such as Struts configuration files and Java Server Page templates. Our approach
supports a much wider range of web applications, due to the support of both
Java Server Page applications as well as Ajax Web applications that consist
of a single page whose elements are updated in response to callbacks activated
asynchronously by the user or by a server message.
The work most similar to our approach is described in [10]. Paper proposes
state-based testing approach, specifically designed to exercise Ajax Web appli-
cations. Test cases are derived from the state model based on the notion of
semantically interacting events. The approach is limited to single page applica-
tions and Ajax callbacks. Our approach handle all the possible state changes,
which include page transitions and Javascript page elements manipulation in the
event handlers triggered by user actions, as well as Ajax callbacks. Handling a
Web application in whole makes it possible to apply our approach to real world
applications and to achieve more accurate models.
Research on model checking of web applications [12–14] concentrates mostly
on the model checking process, but not on the model extraction. Model ex-
traction is critically important for complex real world web applications because
straight-forward model extraction would generate huge models with hundreds
of states and transitions for complex applications, which would be practically
useless. Creating the model manually is error-prone and time consuming. This
paper describes an approach that simplifies automatically extracted state and
transition information and generates human-readable models, which could sig-
164 A. Zakonov and A. Shalyto
4.1 Filter out DOM elements page state does not depend on
The DOM tree is traversed and all nodes of the following types are filtered
out: link, script, meta. Nodes of these types do not directly affect the page
state that the user could see or the set of actions that the user can make. Also
we propose to ignore text values of the elements, but compare only the DOM
structures. All the element attributes are ignored except the style attribute. The
style attribute may not be completely ignored as CSS could directly affect user’s
page perception: elements (including controls) could be made invisible or could
be disabled using CSS styles. For example an element could be present in the
source code but be unreachable for the user until he correctly fills some text
inputs. These two page states should be considered different as they support
different sets of the possible user actions. For example, due to this step following
nodes would be considered similar:
– <p class=’big-text’>text</p>;
– <p style=’color: red’>other text</p>.
As only DOM structures are compared then only different DOM nodes are con-
sidered to be different. The following two nodes would be considered different
even if in practice that could look similar:
– <span class=’big-text’>same text</span>;
– <div class=’big-text’>same text</div>.
166 A. Zakonov and A. Shalyto
A web application could often contain links that lead to other web sites. It could
be information links for the user (e.g. Google’s search results page), links to
partner web sites or advertisement banners. It depends on the specific web site
if these external elements on the web page are part of the web application’s
business logic or are they unimportant and could be filtered out. For web sites
with advertisement blocks the case study showed that taking these external
elements into the account while comparing states leads to the multiple states
duplication problem. Same page of the application could be represented by the
different DOM trees as its source code is being generated on the server side
and from time to time it could contain one advertisement banner, two banners
or none of them. There is no way to automatically detect if the element is
important to the application logic or not. Therefore for some web applications
it is reasonable to exclude all external elements, for other applications this step
should be omitted.
External dependencies are elements that depend on the external web sites
and are detected by the following set of rules:
nodes are replaced by one node, x1 . Due to the “collapse” step pages like a mail
inbox or a task list, which differ only by number of the similar items, would
become similar and the extracted model would make much more sense.
The “Collapse” step is implemented be the following algorithm:
1. Traverse the DOM tree, starting from its leafs.
2. For a given node fetch list of the children nodes listc .
3. Check all the possible pairs xi , xj ∈ listc and if
similar(xi , xj ) == T rue then remove xj node.
It should be noted that there could be more complex cases, where this algo-
rithm would not work. For example, if a repeating page item consists of more
then one DOM subtree, but a sequence of the same DOM subtrees. For example
the following listing gives an example of a list of the repeating items that should
be collapsed, but the proposed algorithm is not able to handle it:
<h1>Title1</h1>
<p>text1</p>
<h1>Title2</h1>
<p>text2</p>
...
<h1>TitleN</h1>
<p>textN</p>
The pattern discovery algorithms will be used to handle such situations.
6 Case study
The proof-of-concept tool was developed using Python 2.7 programming lan-
guage, Selenium [1] and Graphviz [16] frameworks. Current version of the tool is
capable of the automated web application analysis. The tool currently provides
a console interface and produces an output in the form of an XML file, describ-
ing the extracted model in the form of a FSM and a PNG image. The XML
170 A. Zakonov and A. Shalyto
in a reasonably short execution time (10 minutes). State models containing 80-
200 states and transitions between them are useless in practice, as they are
not human-readable and it is impossible to write down any adequate formal re-
quirements using them. For the TadaList.com and m.VK.com applications the
proposed simplification algorithms were able to produce models that contain
less then 20 states. Such models are human-readable and would be useful for
developers and QA specialists. For more complex web sites models contain more
states and manual review of the proposed models is advisable. While the cor-
rectness of the models and completeness with respect to the source code could
not be proven, the models could be verified against specification requirements
or used to generate test suites with high state coverage.
7 Conclusion
In this paper we have presented an approach to extract an extended finite state
model of an existing web application. Such model would be suitable for writing
formal specification requirements. The extracted finite state model XML repre-
sentation can be automatically converted into the Promela format and served as
an input to the Spin model checker. Properties to be verified could be expressed
as Linear Temporal Logic (LTL) formulas. Navigational requirements, which are
often being an important concern for web applications developers, could be con-
veniently formulated in LTL. There are examples of common requirements that
would be useful to check for most of the applications: “On all paths from page
172 A. Zakonov and A. Shalyto
References
1. Antawan Holmes , Marc Kellogg, Automating Functional Tests Using Selenium,
AGILE 2006: .270–275
2. Web test automation tool. http://sahi.co.in/w/sahi
3. Alalfi, M.H., Cordy, J.R., Dean, T.R.: Modelling methods for web application veri-
fication and testing: state of the art. Softw. Test., Verif. Reliab.(2009) 265–296
4. Hassan AE, Holt RC. Architecture recovery of web applications. Proceedings of the
24th ICSE, ACM Press: New York, NY, USA, 2002; 349–359.
5. Antoniol G, Di Penta M, Zazzara M. Understanding Web Applications through
Dynamic Analysis. Proceedings of the IWPC 2004; 120–131.
6. Di Lucca GA, Di Penta M. Integrating Static and Dynamic Analysis to improve
the Comprehension of Existing Web Applications. Proceedings 7th IEEE WSE:
Washington, DC, USA, 2005; 87–94.
7. Sylvain Hall, Taylor Ettema, Chris Bunch, Tevfik Bultan: Eliminating navigation
errors in web applications via model checking and runtime enforcement of navigation
state machines. ASE 2010: 235–244
8. Haydar, M.: Formal Framework for Automated Analysis and Verification of Web-
Based Applications. ASE 2004: 410–413
9. Atsuto Kubo, Hironori Washizaki, Yoshiaki Fukazawa, ”Automatic Extraction and
Verification of Page Transitions in a Web Application,” APSEC 2007: 350–357
10. Marchetto, A., Tonella, P., Ricca, F.: State-Based Testing of Ajax Web Applica-
tions. ICST 2008: 121–130
11. Zakonov A., Stepanov O., Shalyto A.A. GA-Based and Design by Contract Ap-
proach to Test Generation for EFSMs. IEEE EWDTS 2010: 152–155.
12. Y. Huang, F. Yu, C. Hang, C. Tsai, D.T. Lee, and S. Kuo, ”Verifying Web Appli-
cations Using Bounded Model Checking”, DSN 2004: 199–208.
13. Homma, Kei and Izumi, Satoru and Abe, Yuki and Takahashi, et al.”Using the
Model Checker Spin for Web Application Design”, SAINT 2010: 137–140
14. Homma, K. Izumi, S. Takahashi, K. Togashi, A., et al ”Modeling Web Applications
Design with Automata and Its Verification”, ISADS 2011: 103– 112
15. Document Object Model by the World Wide Web Consortium,
http://www.w3.org/DOM/
16. Kaufmann M., Wagner D. (editors) Drawing Graphs: Methods and Models,
Springer, 2001. - 326 pages
An ontological approach to systematization
of SW-FMEA
1 Introduction
Failure Modes and Effects Analysis (FMEA) is an analysis method used in the
development of industrial systems subject to dependability and safety require-
ments, for the identification of failure modes, their causes and effects, in order
to determine actions reducing the impact of failure events [18]. The analysis is
carried out since the initial phases of the development process, when mitigating
actions can be more easily taken.
FMEA was first standardized by the US Department of Defense in the MIL-
STD-1629A standard [33], then it was extended to many other industrial contexts
174 I. Bicchierai, G. Bucci, C. Nocentini, and E. Vicario
<<objectProperty>> <<objectProperty>>
is_in_OP has_FE
1..* 1..*
<<owlClass>> <<owlClass>>
Operation Mode Failure Effect
<<datatypeProperty>>CriticalityValue <<datatypeProperty>> SeverityLevel
Fig. 1. UML representation of the intensional part of the ontology modeling the con-
cepts involved in the Top Level Functional FMEA.
delivered service deviates from the correct implementation of the system function
[1]. Each Failure Mode is associated with the Failure Effects produced on the
system, classified through Severity Level s. Failure modes have their own Severity
Level, determined as the highest severity level of its Failure Effects. The num-
ber, the labeling and the characteristics of the severity levels are defined by the
standards adopted in the specific context [4,6,26] and involve the dependability
attributes (Reliability, Availability, Maintainability, Safety) for the considered
application.
Each functional requirement is characterized by its own Criticality Value
which, in our reference context, is defined as a function of both the severity levels
of associated failure modes and the criticality values of associated operation
modes.
The second phase of the process is carried out when the SW architecture is
already designed.
<<objectProperty>> <<objectProperty>>
has_method has_FE
1..*
<<owlClass>>
<<owlClass>>
Failure Effect
Method
<<datatypeProperty>> SeverityLevel
Fig. 2. UML representation of the intensional part of the ontology modeling the con-
cepts involved in the Detailed Functional FMEA.
An ontological approach to systematization of SW-FMEA 177
Fig. 2 shows the involved ontological concepts which mostly model structural
SW elements, represented by class Item. An item can be the entire CSCI, a SW
Component, a SW Module, or a Method. The model is hierarchically organized
from the entire CSCI to the methods, i.e. the smallest SW part with precise func-
tionalities. The CSCI is made of SW components, which are physically organized
in SW modules, containing methods written in some programming language,
which in our reference context are C and Assembly.
Structural elements are associated with the implemented functionalities. Each
Functionality is associated with one or more Failure Modes, which are in turn
associated with their Failure Effects, in the same way as in the Top Level Func-
tional FMEA.
Note that a SW Component has an attribute, SW-DAL, which represents
the required level of assurance to be attained in its development. The indirect
association between Functional Requirements and SW Components allows the
identification of the most critical functional requirement implemented by the
component. The component’s SW-DAL is taken as the level corresponding to
the criticality value of the identified functional requirement.
<<objectProperty>>
<<owlClass>> is_implemented_by <<owlClass>>
Functional Requirement Usage Degree
1..*
<<datatypeProperty>> CriticalityValue
<<objectProperty>>
has_used_component
1
<<owlClass>>
SW Component
<<datatypeProperty>> SW-DAL
1
<<objectProperty>> <<objectProperty>>
has_module is_module_of
<<owlClass>>
SW Module
<<objectProperty>> <<objectProperty>>
is_developed_with has
<<owlClass>> <<owlClass>>
Prevention Mitigation Accountability SW Parameter Accountability
<<objectProperty>> <<objectProperty>>
has_PM 1 has_PR 1
<<owlClass>> <<owlClass>>
Prevention Mitigation SW Parameter
Fig. 3. UML representation of the intensional part of the ontology modeling the con-
cepts involved in the SW-DAL Evaluation.
The most common activity of FMEA is the production of usually large work-
sheets, whose structure depends on the standard chosen. The format of a row in
the FMEA worksheet as defined in our reference context is reported in Fig. 4.
Fig. 4. The format of a row in the FMEA worksheet as defined in our reference context.
Listing 1.1. A SPARQL query producing a result set comprising the values for the
construction of the SW-FMEA woksheet.
structural and non structural parameters. For example, considering two relevant
parameters in our reference context (i.e. McCabe’s cyclomatic complexity and
testing coverage), the corresponding set of predicates relative to the SW-DAL
associated with the criticality value c of a functional requirement f takes the
following form:
Pc,f = {CC < 5, T C = “all edges”}
where CC stands for the McCabe’s cyclomatic complexity and T C stands for the
testing coverage. CC, T C, 5, and “all edges” are instances of the classes SW Pa-
rameter, Prevention Mitigation, SW Parameter Accountability, and Prevention
Mitigation Accountability, respectively (Fig. 3). Now, if all the SW components
contributing to the realization of a functional requirement f , and not implement-
ing a functional requirement with a criticality value greater than c, are realized
with a cyclomatic complexity less than 5 and are tested with all edges cover-
age, then the required SW-DAL is attained and f is considered to be rigorously
implemented.
The tool provides an easy to use interface guiding the user through the pro-
cess outlined in Sect. 2. During the Top Level Functional FMEA, instances of
concepts illustrated in Sect. 2.2, are derived from artifacts and documents as
described in Sect. 3 and inserted in the ontological model through the tool in-
terface. For example, Fig. 5 shows the screenshot of the interface for entering
Functional Requirements instances. Once data have been loaded, severity lev-
Fig. 5. Screenshot showing the tool interface provided to enter a new instance of Func-
tional Requirement in the ontological model.
5 Conclusions
of both the approach and the tool, showing improvements on the SW-FMEA
practices.
References
14. IEEE Computer Society. IEEE Recommended Practice for Software Design De-
scriptions (Std 1016 - 1998) . Technical report, IEEE, 1998.
15. International Electrotechnical Commission. IEC-60812 Analysis techniques for
system reliability - Procedure for failure mode and effects analysis (FMEA), 1985.
16. M. Kifer, G. Lausen, and J. Wu. Logical foundations of object-oriented and frame-
based languages. Journal of the Association for Computing Machinery, 42:741–843,
1995.
17. B. H. Lee. Using FMEA models and ontologies to build diagnostic models. Artif.
Intell. Eng. Des. Anal. Manuf., 15:281–293, September 2001.
18. N. Leveson. Safeware: system safety and computers. Addison-Wesley, 1995.
19. R. R. Lutz and R. M. Woodhouse. Requirements analysis using forward and
backward search. Annals of Software Engineering, 3:459–475, 1997.
20. D. L. McGuinness and F. van Harmelen. OWL Web Ontology Language.
http://www.w3.org/TR/owl-features/, February 2004.
21. National Aeronautics and Space Administration. NASA Software Safety Guidebook
NASA-GB-8719.13 - NASA TECHNICAL STANDARD, March 2004.
22. Object Management Group. Ontology Definition Metamodel v1.0, 2009.
23. H. Pentti and H. Atte. Failure Mode and Effects Analysis of software-based au-
tomation systems - STUK-YTO-TR 190. VTT Industrial Systems - STUK, August
2002.
24. E. Prud’hommeaux and A. Seaborne. SPARQL query language for RDF.
http://www.w3.org/TR/rdf-sparql-query/, January 2008.
25. PTC Product Development Company. Windchill FMEA (formerly Relex FMEA)
official website. http://www.ptc.com/product/windchill/fmea.
26. Radio Technical Commission for Aeronautics. DO-178B, Software Considerations
in Airborne Systems and Equipment Certification, 1992.
27. E. S. Raymond. The New Hacker’s Dictionary. The MIT Press, Cambridge, 1991.
28. D. J. Reifer. Software Failure Modes and Effects Analysis. Reliability, IEEE
Transactions on, R-28(3):247 –249, aug. 1979.
29. ReliaSoft. XFMEA official website. http://www.reliasoft.com/xfmea/.
30. R. A. Sahner, K. S. Trivedi, and A. Puliafito. Performance and reliability analy-
sis of computer systems: an example-based approach using the SHARPE software
package. Kluwer Academic Publishers, Norwell, MA, USA, 1996.
31. E. Sirin, B. Parsia, B. C. Grau, A. Kalyanpur, and Y. Katz. Pellet: A practical
OWL-DL reasoner. J. Web Sem., 5(2):51–53, 2007.
32. Society of Automotive Engineers. SAE J-1739 Potential Failure Mode and Effects
Analysis in Design (Design FMEA) and Potential Failure Mode and Effects Anal-
ysis in Manufacturing and assembly Processes (Process FMEA) Reference Manual,
1994.
33. United States Department of Defense. MIL-STD-1629A, Procedures for Performing
a Failure Mode, Effects and Criticality Analysis. Technical report, USDoD, 1980.
34. United States Department of Defense. MIL-STD-498, Military Standard For Soft-
ware Development And Documentation. Technical report, USDoD, 1994.
Online Black-box Failure Prediction for Mission
Critical Distributed Systems
1 Introduction
Context and Motivation. Distributed mission critical systems such as air traf-
fic control, battlefield or naval command and control systems consist of several
applications distributed over a number of nodes connected through a LAN or
WAN. The applications are constructed out of communicating software compo-
nents that are deployed on those nodes and may change over time. The dynamic
nature of applications is principally due to (i) the employed policies for resilience
to software or hardware failures, (ii) the adopted load balancing strategies or (iii)
the management of new comers. In such complex real time systems, failures may
happen with potentially catastrophic consequences for their entire functioning.
The industrial trend is to face failures by using, during operational system life,
supervision services that are not only capable of detecting and certificating a
failure, but also predicting and preventing it through an analysis of the overall
system behavior. Such services shall have a minimum impact on the supervised
system and possibly no interaction with the operational applications. The goal
is to plug-in a “ready-to-use observer” that acts at run time and is both non-
intrusive and black-box, i.e., it considers nodes and applications as black boxes.
In mission critical systems, a large amount of data deriving from communica-
tions among applications transits on the network; thus, the ”observer” can focus
186 R. Baldoni, G. Lodi, G. Mariotta, L. Montanari, and M. Rizzuto
on that type of data, only, in order to recognize many aspects of the actual in-
teractions among the components of the system. The motivation to adopt this
non-intrusive and black-box approach is twofold. Firstly, applications change
and evolve over time: grounding failure prediction on the semantic of the appli-
cations’ communications would require a deep knowledge of the specific system
design, a proven field experience, and a non-negligible effort to keep aligned the
supervision service to the controlled system. Secondly, interactions between the
service and system to be monitored might lead to unexpected behaviors, hardly
manageable as fully unknown and unpredictable.
Contribution. In this paper we introduce the design, implementation and ex-
perimental evaluation of a novel online, non-intrusive and black-box failure pre-
diction architecture we named CASPER that can be used for monitoring mission
critical distributed systems. CASPER is (i) online, as the failure prediction is
carried out during the normal functioning of the monitored system, (ii) non-
intrusive, as the failure prediction does not use any kind of information on the
status of the nodes (e.g., CPU, memory) of the monitored system; only infor-
mation concerning the network to which the nodes are connected is exploited
as well as that regarding the specific network protocol used by the system to
exchange information among the nodes (e.g., SOAP, GIOP); and (iii) black-box,
as no knowledge of the application’s internals and of the application logic of the
system is analyzed. Specifically, the aim of CASPER is to recognize any devi-
ation from normal behaviors of the monitored system by analyzing symptoms
of failures that might occur in the form of anomalous conditions of specific per-
formance metrics. In doing so, CASPER combines, in a novel fashion, Complex
Event Processing (CEP) [1] and Hidden Markov Models (HMM) [2]. The CEP
engine computes at run time the performance metrics. These are then passed
to the HMM in order to recognize symptoms of an upcoming failure. Finally,
the symptoms are evaluated by a failure prediction module that filters out as
many false positives as possible and provides at the same time a failure pre-
diction as early as possible. Note that we use HMM rather than other more
complex dynamic bayesian networks [3] since it provides us with high accu-
racy, with respect to the problem we wish to address, through simple and low
complexity algorithms. We deployed CASPER for monitoring a real Air Traffic
Control (ATC) system. Using the network data of such a system in the presence
of both steady state performance behaviors and unstable state behaviors, we
first trained CASPER in order to stabilize HMM and tune the failure prediction
module. Then we conducted an experimental evaluation of CASPER that aimed
to show its effectiveness in timely predicting failures in the presence of memory
and I/O stress conditions.
Related work. A large body of research is devoted to the investigation of
approaches to online failure prediction. [4] presents an error monitoring-based
failure prediction technique that uses Hidden Semi-Markov Model (HSMM) in
order to recognize error patterns that can lead to failures. This approach is
event-driven as no time intervals are defined: the errors are events that can
be triggered anytime. [5] describes two non-intrusive data driven modeling ap-
proaches to error monitoring: one based on a Discrete Time Markov Model, and
Online Black-box Failure Prediction for Mission Critical Distributed
Systems 187
time-to-prediction time-to-failure
Knowledge
Base
Communication Network
Time
Fault Failure Host 1 Host 2 Host 3 Host N Actions
Failure Avoidance
Fig. 1. Fault, Symptoms, Failure
Fault and Pre- Fig. 2. The modules of the CASPER
diction Alert failure prediction architecture
σ1 σ2 σ3 σM Symbols
0.2 0.7
0.8
Safe 0.3 Unsafe1 Unsafe2 UnsafeK
Hidden Process
Fig. 3. Hidden Markov Models graph used in the system state inference component
The knowledge base concerning the possible safe and unsafe system states of the
monitored system is composed by the parameters of the HMM. This knowledge is
built during an initial training phase. Specifically, the parameters are adjusted by
means of a training phase using the max likelihood state estimators of the HMM
[2]. During the training, CASPER is fed concurrently by both recorded network
traces and a sequence of pairs <system-state,time>. Each pair represents the
fact that at time <time> the system state changed in <system-state>4 .
3.2 Tuning of CASPER parameters
CASPER has been specialized to monitor a real Air Traffic Control system.
ATC systems are composed by middleware-based applications running over a
collection of nodes connected through a Local Area Network (LAN). The ATC
system that we monitored is based on CORBA [17] middleware. CASPER inter-
cepts GIOP messages produced by the CORBA middleware and extracts several
information from themin order to build the representation of the system at run
time. In this section we describe how the events are represented starting from
the GIOP messages and how the performance metrics representing the system
state are computed.
4
As the training is offline, the sequence of pairs <system-state,time> can be created
offline by the operator using network traces and system log files.
Online Black-box Failure Prediction for Mission Critical Distributed
Systems 191
evaluate the former in terms of Ntp (number of true positives) the system state is
unsafe and the inferred state is “system unsafe”; Ntn (number of true negatives):
the system state is safe and the inferred state is “system safe”; Nf p (number of
false positive): the system state is safe but the inferred state is “system unsafe”;
and Nf n (number of false negatives): the system state is unsafe but the inferred
state is “system safe”. Using these parameters, we compute the following metrics
that define the accuracy of CASPER:
Ntp Ntp
Precision: p = Ntp +Nf p
Recall (TP rate): r = Ntp +Nf n
Nf p
F-measure: F = 2 × p×r
p+r
FP Rate: f.p.r. = Nf p +Ntn
stress, i.e., memory and I/O stress) in which one of the services of the ATC
system fails. These traces are taken from ATC system’s testing environment.
During the training phase, the performance metrics computation component
produces a symbol at each CASPER clock cycle. Thanks to the set of pairs
<system-state,time> we are able to represent the emitted symbols in case of
safe and unsafe system states. Figure 4 illustrates these symbols. Each symbol
is calculated starting from a combination of three values. In this case, we have 6
possible values per each performance metric; the number of different symbols is
therefore 6 × 6 × 6 = 216. Observing Figure 4 we can notice that the majority of
the emissions belong to the interval [0, 2] for the Round Trip Time, and [0, 1] for
Number of Request Without Reply and Message Rate. Starting from the symbols
represented in Figure 4, the HMM-based component builds its knowledge base.
Tuning of CASPER parameters: clock period and number of symbols.
After the training of HMM, CASPER requires a tuning phase to set the clock
period and number of symbols in order to maximize the accuracy (F-measure,
precision, recall and false positive rate) of the symptoms detection module out-
put. This tuning phase is done by feeding the system with a recorded network
trace (different from the one used during the training). We can see that the best
choice of the clock period is 800 milliseconds. CASPER tries 4 different values
of clock (100ms, 300ms, 800ms, 1000ms) and compute the F-Measure for each
value and for each possible number of symbols. A clock period of 800 millisec-
onds yields a higher F-Measure value than the other clock values in most of
the number of symbols considered, thus, CASPER set the clock period to 800
milliseconds. Once fixed this clock period, the second parameter to define is the
number of symbols. Figure 5 shows the precision, recall, F-measure and false
positive rate of the symptoms detection module varying the number of sym-
bols. CASPER considers the maximum difference between the F-measure and
the false positive rate in order to choose the ideal number of symbols (ideally,
F-measure is equal to 1 and f.p.r. to 0). As shown in Figure 5, considering 216
symbols (6 values per performance metric) we obtain F = 0.82 and f.p.r. = 0.12
which is actually the best situation in case of memory stress.
Tuning of CASPER parameters: window size. The window size is the
only parameter that has to be tuned by the operator according to the tradeoff
discussed in Section 3.2. We experimentally noticed that during fault-free execu-
tions the system state inference still produced some false positives. However, the
probability that there exists a long sequence of false positives in steady-state is
very low. Thus, we designed the failure prediction module to recognize sequences
of consecutive clock cycles whose inferred state is not safe. Only if the sequence
is longer than a certain threshold CASPER rises a prediction. The length of
these sequences multiplied by the clock period (set to 800ms) is the window size.
The problem is then to set up a reasonable threshold in order to avoid false
positive predictions during steady-state. Figure 6 illustrates the number of the
false positive varying the window size. From this Figure it can be noted that
the window size has to be set to at least 16 seconds in order not to incur in
any false positives. Let us remark that the window size also corresponds to the
minimum time-to-prediction. All the results presented below are thus obtained
194 R. Baldoni, G. Lodi, G. Mariotta, L. Montanari, and M. Rizzuto
5
Message Rate 0.8
4
0.7
3
0.6
2
0.5
1
0.4
0 F−Measure
6 Precision
5 6
0.3
Recall
4 5 False Positive Rate
3 4 0.2
2 3
1 2
Round Trip Time 1 Number of Requsts 0.1
0 0 0 64 125 216 343 512 729 1000
Without Reply Number of Symbols
Fig. 4. Symbols emitted by the perfor- Fig. 5. Performance of the symptoms de-
mance metrics computation component in tection module varying the number of
case of a recorded trace subject to mem- possible symbols in case of a recorded
ory stress. trace subject to memory stress.
4.5
System State Inferred
4 Service
Real System State Failure
Percentage of False Positives
3.5 Prediction
3 time−to−prediction time−to−failure
1.5
Unsafe State
Safe State
0 50 100 150 200 250 300 350 400
Time (seconds)
tests for each type of fault5 . In the second type, we observed the accuracy of
CASPER when monitoring for 24h the ATC system in operation. These types
of experiments and their related results are discussed in order as follows. As
first test, we injected a memory stress in one of the node of the ATC system
till a service failure. Figure 7 shows the anatomy of this failure in one test. The
ATC system runs with some false positive till the time the memory stress starts
at second 105. The sequence of false positives starting at second 37 is not suf-
ficiently long to create a false prediction. After the memory stress starts, the
failure prediction module outputs a prediction at second 128; thus, the time-to-
prediction6 is 23s. The failure occurs at second 335, then the time-to-failure is
207s, which is satisfactory with respect to ATC system recovery requirements.
A failure caused by I/O stress happens after 408 seconds from the start of the
stress (at second 190) and has been predicted at time 222 after 32 seconds of
stress, with a time-to-prediction equal to 376 seconds before the failure. In gen-
eral, we obtained that in the 10 tests we carried out, the time-to-failure in case
of memory stress varied in the range of [183s, 216s] and the time-to-prediction in
the range of [20.8s, 27s]. In case of I/O stress, in the 10 tests, the time-to-failure
varied in the rage of [353s, 402s] whereas the time-to-prediction in the range
of [19.2s, 24.9s]. Finally, we performed a 24h test deploying CASPER on the
network of the ATC system in operation. In these 24 hours the system exhibited
steady-state performance behavior. CASPER did not produce any false positive
along the day. Figure 8 depicts a portion of 400 seconds of this run.
just connecting to the monitored system without any human intervention. This
will make CASPER a complete “plug-and-play” failure prediction system. The
advantage of the online training solution is that CASPER can analyze a huge
amount of network data. The disadvantage is that the training phase can last
for long time as CASPER does not have any external clue concerning the safe
or faulty system state.
References
1. Esper: Esper project web page (2011) http://esper.codehaus.org/.
2. Rabiner, L., Juang, B.: An introduction to hidden markov models. ASSP Magazine,
IEEE 3(1) (jan 1986) 4 – 16
3. Murphy, K.: Dynamic Bayesian Networks: Representation, Inference and Learning.
PhD thesis, UC Berkeley, Computer Science Division (2002)
4. Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model
Approach. PhD thesis, Department of Computer Science, Humboldt-Universität
zu Berlin, Germany (2008)
5. Hoffmann, G.A., Salfner, F., Malek, M.: Advanced Failure Prediction in Complex
Software Systems. Technical Report 172, Berlin, Germany (2004)
6. Li Yu, Ziming Zheng, Z.L., Coghlan, S.: Practical online failure prediction for
blue gene/p: Period-based vs event-driven. In: Proc. of IEEE/IFIP DSN-W 2011.
(2011) 259 – 264
7. Williams, A.W., Pertet, S.M., Narasimhan, P.: Tiresias: Black-box failure predic-
tion in distributed systems. In: Proc. of IEEE IPDPS 2007, Los Alamitos, CA,
USA (2007)
8. Tan, Y., Gu, X., Wang, H.: Adaptive system anomaly prediction for large-scale
hosting infrastructures. In: Proc. of ACM PODC 2010, New York, NY, USA, ACM
(2010) 173–182
9. Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Per-
formance debugging for distributed systems of black boxes. In SIGOPS Oper. Syst.
Rev. 37 (2003) 74–89
10. Fu, S., zhong Xu, C.: Exploring event correlation for failure prediction in coalitions
of clusters. (2007)
11. Daidone, A., Di Giandomenico, F., Bondavalli, A., Chiaradonna, S.: Hidden
markov models as a support for diagnosis: Formalization of the problem and syn-
thesis of the solution. In: SRDS 2006, Leeds, UK (2006) 245–256
12. Gu, X., Papadimitrioul, S., S.Yu, P., Chang, S.P.: Online failure forecast for fault-
tolerant data stream processing. In: ICDE 2008. 1388 – 1390
13. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxon-
omy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput.
1(1) (2004) 11–33
14. Hood, C., Ji, C.: Proactive network-fault detection. In IEEE Transactions on
Reliability 46(3) (1997) 333 –341
15. Thottan, M., Ji, C.: Properties of network faults. In: NOMS 2000. 941–942
16. Baldoni, R., Lodi, G., Mariotta, G., Montanari, L., Rizzuto, M.: On-
line Black-box Failure Prediction for Mission Critical Distributed Sys-
tems. Technical report (2012) http://www.dis.uniroma1.it/~midlab/articoli/
MidlabTechReport3-2012.pdf.
17. Group, O.M.: CORBA. Specification, Object Management Group (2011)
18. IBM: System S Web Site (2011) http://domino.research.ibm.com/comm/
research_projects.nsf/pages/esps.index.html.
Intentionally left blank to harmonize page numbering with printed version.
On the Impact of Hardware Faults - An Investigation of
the Relationship between Workload Inputs and Failure
Mode Distributions
1 Introduction
likely that future microprocessors will exhibit an increasing rate of incorrect program
executions caused by hardware related errors.
A cost-effective way of reducing the risk that such incorrect program executions
cause unacceptable or catastrophic system failures is to introduce a layer of software-
implemented error handling mechanisms. Numerous software techniques for detecting
and masking hardware errors have previously been proposed in the literature [2, 3].
The effectiveness of these techniques are often evaluated, or benchmarked, by means
of fault injection experiments that measure their ability to detect or mask single or
multiple bit errors (bit flips) in CPU registers and main memory [4]. Bit flipping is
used to emulate the effect of single event upset (SEU) errors caused by ionizing parti-
cles. The error coverage for software-implemented error handling techniques often
depends on the input processed by the target system. Thus, to assess the variability in
error coverage, it is essential to conduct fault injection experiments with different
inputs [5, 6].
This paper presents the results of extensive fault injection experiments with four
programs where single bit errors were injected in CPU registers and main memory of
the target systems. The aim of the study is to investigate how error coverage varies for
different inputs. We conducted experiments with programs protected by triple-time
redundant execution with forward recovery [7], and programs without software-
implemented hardware fault tolerance (SIHFT). In addition, we propose a technique
for identifying input sets that are likely to cause the measured error coverage to vary.
The remainder of the paper is organized as follows. We describe the target work-
loads in Section 2 and the TTR-FR mechanism in Section 3. The fault injection exper-
imental setup is described in Section 4. The analysis of the extensive fault injections
conducted on the workloads with/without TTR-FR mechanism is presented in Section
5. Based on the obtained results, we present the input selection approach in Section 6.
2 Target Workloads
In this section, we present the four workloads used in our set of experiments: secure
hash algorithm (SHA), cyclic redundancy check (CRC), quick sort (Qsort), and binary
string to integer convertor (BinInt). SHA is a cryptographic hash function which gen-
erates a 160-bit message digest. We use SHA-1 algorithm which is adopted in many
security protocols and applications such as SSL, SSH and IPsec. The CRC that we use
is a software implementation of CRC 32-bit polynomial which is mostly used to cal-
culate the end-to-end checksum. Qsort is a recursive implementation of the well-
known quick sort algorithm, which is also used as a target program for fault injection
experiments in [6, 8]. Finally, BinInt converts an ASCII binary string, 1s and 0s, into
its equivalent integer value.
Even though the implementation of our workloads can be found in the MiBench
suite [9], we only take CRC and BinInt from this suite. For the quick sort algorithm,
the MiBench implementation uses a built-in C function named qsort whose source
code is not available. This prevents us from performing detailed analysis. Further-
more, the MiBench implementation of SHA uses dynamic memory allocation which
200 D. Di Leo, F. Ayatolahi, B. Sangchoolie, J. Karlsson, and R. Johansson
Table 1. The input space for CRC (left table) and SHA (right table) execution flows
Table 2. The input space for Qsort (left table) and BinInt (right table) execution flows
1
http://www.dil.univ-mrs.fr/~morin/DIL/tp-crypto/sha1-c
On the Impact of Hardware Faults - An Investigation of the Relation-
ship between Workload Inputs and Failure Mode Distributions 201
The workloads are executed on a Freescale MPC565 microcontroller, which uses the
PowerPC architecture. Faults are injected into the microcontroller via a Nexus debug
interface using Goofi-2 [10], a tool developed in our research group. This environ-
ment allows us to inject faults, bit flips, into instruction set architecture (ISA) regis-
ters and main memory of the microcontroller. Ideally, the fault model to adopt for this
evaluation should exhibit real faults, i.e., it should account for multiple and single bit
flips. However, there is no commonly agreed model for multiple bit flips. Thus, we
adopt the single bit flip model as it has been done in other studies [11, 2, 3, 10].
The faults are injected in the main memory (stack, data, etc.) and all CPU registers
used by the execution flows. The registers include general purpose registers, program
counter register, link register, integer exception register, and condition register. As the
machine code of our workloads is stored in a Flash memory, it cannot be subjected to
fault injection. We define fault in terms of time-location pair, where the location is a
randomly selected bit in the memory word or CPU register, while the time corre-
sponds to the execution of a given machine instruction (i.e., a point in the execution
flow). Indeed, we make use of a pre-injection analysis [8] which is included in Goofi-
2. In this way, the fault injection takes place on a register or memory location, just
before it is read by the executing instruction. A fault injection experiment consists of
injecting one fault and observing its impact on a workload. A fault injection campaign
is a series of fault injection experiments with a given execution flow.
202 D. Di Leo, F. Ayatolahi, B. Sangchoolie, J. Karlsson, and R. Johansson
5 Experimental Results
In this section, we present the outcomes of fault injection campaigns conducted on the
4 workloads. We carried out 9 campaigns per workload which resulted in a total of 36
campaigns for the basic version and 36 campaigns for the TTR-FR version. The cam-
paigns consist of 25000 experiments except for CRC campaigns that are subjected to
12000 experiments. The error classification scheme of each experiment is:
No Impact (NI), errors that do not affect the output of the execution flow.
Detected by Hardware (DHW), errors that are detected by the hardware exceptions.
Time Out (TO), errors that cause violation of the timeout2.
Value Failure (VF), erroneous output with no indication of failure (silent data cor-
ruption).
Detected by Software (DSW), errors that are detected by the software detection
mechanisms.
Corrected by Software (CSW), errors that are corrected by the software correction
mechanisms.
When presenting the results, we also refer to the coverage (COV) as the probability
that a fault does not cause value failures, which is calculated in equation (1):
COV = 1 - #VF/N (1)
Here N is the total number of experiments, and #VF is the total number of experi-
ments that resulted in value failure. In addition to the experiments classified as detect-
ed by hardware, the coverage includes no impact and timeout experiments. No impact
experiments can be the result of internal robustness of the workload; therefore they
contribute to the overall coverage of the system. Experiments that are resulted in
timeout are detected by Goofi-2. In a real application, watchdog timers are used to
detect these types of errors.
2
Timeout value is approximately 10 times larger than the execution time of the workload.
On the Impact of Hardware Faults - An Investigation of the Relation-
ship between Workload Inputs and Failure Mode Distributions 203
input and the percentage of value failure”. The results of ANOVA in Table 4 allow
us to reject H0 with a confidence of 95%. The reason behind this correlation is that
when the length of the input increases, the number of reads from registers and
memory locations are increased as well. Therefore, there are more possibilities to
inject faults that result in value failure. Obviously, as the value failure increases line-
arly with the length, the coverage is linearly decreased (Table 3).
Table 3. Failure distribution of all the execution flows (values are in percentage)
Execu- NI VF DHW TO COV Execu- NI VF DHW TO COV
tion flow tion flow
CRC-1 42.7 6.1 48.2 3.0 93.9 SHA-1 18.9 38.8 41.0 1.4 61.2
CRC-2 32.9 17.9 46.7 2.4 82.1 SHA-2 17.8 40.1 41.0 1.1 59.9
CRC-3 28.3 24.3 45.8 1.6 75.7 SHA-3 17.6 40.8 40.6 1.0 59.2
CRC-4 20.8 34.3 44.0 0.8 65.7 SHA-4 16.8 42.1 39.7 1.4 57.9
CRC-5 20.3 35.5 43.6 0.6 64.5 SHA-5 15.9 43.1 39.4 1.6 56.9
CRC-6 17.1 39.6 43.0 0.3 60.4 SHA-6 11.5 47.1 39.5 1.9 52.9
CRC-7 16.6 39.8 43.4 0.2 60.2 SHA-7 11.4 47.7 39.3 1.6 52.3
CRC-8 15.7 41.2 42.7 0.4 58.8 SHA-8 10.7 48.8 38.8 1.7 51.2
CRC-9 16.0 41.9 41.8 0.3 58.1 SHA-9 10.7 49.1 38.4 1.8 50.9
Qsort-1 37.1 12.7 46.8 3.5 87.3 BinInt-1 44.1 3.5 49.9 2.5 96.5
Qsort-2 32.8 17.1 46.9 3.2 82.9 BinInt-2 34.9 20.6 41.5 3.0 79.4
Qsort-3 31.3 17.7 47.7 3.3 82.3 BinInt-3 34.7 20.6 41.6 3.1 79.4
Qsort-4 31.7 18.1 46.8 3.9 81.9 BinInt-4 34.5 20.5 42.0 2.9 79.5
Qsort-5 26.5 23.0 47.2 3.3 77.0 BinInt-5 35.3 21.2 40.5 3.0 78.8
Qsort-6 29.0 20.7 46.0 4.3 79.3 BinInt-6 35.1 21.0 40.8 3.1 79.0
Qsort-7 29.3 20.9 46.3 3.5 79.1 BitInt-7 34.8 21.5 40.5 3.2 78.5
Qsort-8 27.2 22.1 46.6 4.2 77.9 BitInt-8 36.7 20.4 40.0 3.0 79.6
Qsort-9 25.4 24.2 46.5 4.0 75.8 BinInt-9 35.5 20.9 40.5 3.1 79.1
Qsort and BinInt exhibit a non-linear variation of the value failure with, respective-
ly, the number of sorted elements and the input length (Table 4). For Qsort, this can
be explained by considering that in addition to the number of sorted elements, the
positionoftheseelementsimpactsQsort’sbehaviour.Thiscausesdifferentnumberof
element comparisons and recursive calls to the core function. This effect is particular-
ly evident for Qsort-4 and Qsort-5. Even though both have 50% of the input elements
sorted, there is a difference of 4.85 percentage points between their value failures.
Although there is no linear correlation for Qsort, it is notable that the average value
failures of the first five execution flows, which have more sorted elements, is 4.22
percentage points lower than the next four execution flows. BinInt, however, is a
small program with an input space between 0 to 32 characters; these inputs for such a
small application do not cause a significant variation in the failure distribution.
Results in Table 4 show that the proportion of failures detected by the hardware
exceptions is almost constant for a given workload (the coefficient is 0.019 for SHA,
0.04 for CRC, and 0.02 for BinInt). Analogously, the proportion of experiments clas-
sified as timeout is almost constant for all the workloads.
It is worth noting that the startup code may vary in different systems. We therefore
show the trend of value failures with/without the startup block in Fig. 1. We can see
that the trends in the two diagrams are similar which is due to the fact that the startup
code consists of significantly fewer lines of code compared to the other blocks.
Fig. 1. The percentage of value failures for different execution flows of each workload
relative size of the core function is smaller compared to the other programs. This re-
sulted in only around 57% of the injections in this function, while in the other work-
loads more than 96% of faults were injected in the core function. This can explain the
higher percentage of value failures in Qsort compared to the other workloads.
In order to evaluate the robustness of the voter, we conducted exhaustive fault in-
jections (i.e., we inject all possible faults) in the voter of each workload, see Table 5b.
It is notable that even though TTR-FR mechanism decreases the percentage of value
failure, the voter is one of the main contributors to the occurrence of value failure.
The average percentage of errors detected by the hardware exceptions does not
vary significantly between the versions extended with TTR-FR and those without this
mechanism for SHA, CRC, and BinInt, while it differs about 5% for Qsort.
Table 5. Average failure distributions for workloads with TTR-FR (values are in percentage).
6 Input Selection
As we demonstrate in this paper, the likelihood for a program to exhibit a value fail-
ure due to bit flips in CPU registers or memory words depends on the input to the
program. Thus, when we assess the error sensitivity of an executable program by fault
injection, it is desirable to perform experiments with several inputs.
In this section, we describe a method for selecting inputs such that they are likely
to result in widely different outcome distributions. The selection process consists of
three steps. First, the fault-free execution flows for a large set of inputs are profiled
using assembly code metrics. We then use cluster analysis to form clusters of similar
execution flows. Finally, we select one representative execution flow from each clus-
ter and subject the workload to fault injection. We validate the method by showing
that inputs in the same clusters indeed generate similar outcome distributions, while
inputs in different clusters are likely to generate different outcome distributions.
6.1 Profiling
We adopt a set of 47 assembly metrics corresponding to different access types (read,
write) to registers and memory sections along with various categories of assembly
instructions. Specifically, we group the PowerPC instruction set into 6 categories as
shown in Table 6. For each group, we define the percentage of execution as the num-
ber of times that the instructions of that category are executed out of the total number
of executed instructions. These 6 metrics are a proper representative of the metric set
206 D. Di Leo, F. Ayatolahi, B. Sangchoolie, J. Karlsson, and R. Johansson
for our workloads. Therefore, these metrics are used as a signature for the fault-free
run of each execution flow to be used in the clustering algorithm.
6.2 Clustering
Cluster analysis divides the input set (the execution flow, in our case) into homoge-
nous groups based on the signature of execution flows. We adopted the hierarchical
clustering [12] due to the fact that unlike other clustering techniques (e.g., K-means),
it does not require a preliminary knowledge of the number of clusters. Thus, we can
validate a posteriori if the execution flows are clustered as expected. The hierarchical
clustering adopted in this work evaluates the distance between two clusters according
to the centroid method [12]. A similar approach is used in [13].
the choice of the clustering method since we also obtain identical results with other
methods such as average and ward [12].
7 Related Work
Numerous works [14, 15, 16, 11] have assessed the effectiveness of hardware detec-
tion mechanisms in the presence of different fault models (such as pin level injection,
stuck at byte, and bit flipping) while executing different workloads. In addition, an
emerging research trend focuses on the implementation of software-implemented
hardware fault tolerance mechanisms for detecting/correcting errors. Different im-
plementation of software mechanisms at source level [2, 7] as well as at the assembly
levels [3, 4, 17] has been assessed. These studies targeted a large variety of workloads
and fault tolerance mechanisms without investigating their behavior to different in-
puts. In dependability benchmarking workloads are executed with realistic stimuli,
i.e., inputs that come from the domain. In this area, the study [18] investigates the
dependability of an automotive engine control system targeted with transient faults.
The system under study is totally different from ours and no input selection approach
is proposed. To the best of our knowledge, there is a little literature aiming to investi-
gate the effects of transient faults on workload variations. In [5], matrix multiplication
and selection sort are fed with three and two inputs, respectively. The fault model
includes zero-a-byte, set-a-byte and two-bit compensation that differs from ours. Au-
thors in [6] also estimated the error coverage for quicksort and shellsort, both execut-
ed with 24 different inputs. In addition, we study assembly level metrics with respect
to the failure distribution (Section 6). While in performance benchmarking some
study [19] explore the correlation between metrics and performance factors (e.g.,
power consumption), in the dependability field there is a no investigation on this area.
We investigated the relationship between inputs of a set of workloads and the failure
mode distribution. The experiments, carried out on an embedded system, demonstrate
that for CRC and SHA, the length of input is linearly correlated to the percentage of
value failure. Even though Qsort and BinInt do not show such a relationship, it is still
notable that the input affects the failure distribution. Results illustrate that the per-
centage of faults detected by the hardware exceptions is workload dependent, i.e., it is
not affected by the input. Additionally, a simple software-implemented hardware fault
tolerant mechanism, TTR-FR, can successfully increase the coverage, on the average,
to more than 97%, regardless of the input. As similar inputs (e.g., same length inputs)
result in a similar failure distribution, we devised an approach to reduce the number of
fault injections. Although the approach seems promising for workloads with a linear
relation between the input property (e.g., length) and the failure distribution, addition-
al metrics might be required for other workloads. Looking forward, we would like to
improve the confidence in our findings by extending the study with other workloads,
fault tolerance mechanisms, fault models and different compiler optimizations.
On the Impact of Hardware Faults - An Investigation of the Relation-
ship between Workload Inputs and Failure Mode Distributions 209
References
1. Borkar, S.; "Designing reliable systems from unreliable components: the challenges of
transistor variability and degradation," IEEE Micro, vol. 25, no. 6, pp. 10-16, 2005.
2. Rebaudengo, M.; Sonza Reorda, M.; Violante, M.; "A new approach to software-
implemented fault tolerance," Journal of Electronic Testing: Theory and Applications, vol.
20, no.4, pp. 433-437, 2004.
3. Reis, G.A.; et. al.; "SWIFT: Software implemented fault tolerance," Int. Symp. on Code
generation and optimization (CGO'05), pp. 243-254, 2005.
4. Skarin, D.; Karlsson, J.; "Software implemented detection and recovery of soft errors in a
brake-by-wire System," 7th European Dependable Computing Conf. (EDDC-07), pp. 145-
154, 2008.
5. Segall, Z.; et al.; "FIAT-fault injection based automated testing environment," 18th Int.
Symp. on Fault-Tolerant Computing (FTCS-18), pp. 102-107, 1988.
6. Folkesson, P.; Karlsson, J.; "Considering workload input variations in error coverage esti-
mation," 3rd European Dependable Computing Conf. (EDDC-03), pp. 171-190, 1999.
7. Alexandersson, R.; Karlsson, J.; "Fault injection-based assessment of aspect-oriented im-
plementation of fault tolerance," 41st Int. Dependable Systems & Networks Conf. (DSN),
pp. 303-314, 2011.
8. Barbosa, R.; Vinter, J.; Folkesson, P.; Karlsson, J.; "Assembly-level pre-injection analysis
for improving fault injection efficiency," 5th European Dependable Computing Conf.
(EDDC’05), pp. 246-262, 2005.
9. Mibench Version 1, [Online] http://www.eecs.umich.edu/mibench/
10. Skarin, D.; Barbosa, R.; Karlsson, J.; "GOOFI-2: A tool for experimental dependability as-
sessment," 40th Int. Dependable Systems & Networks Conf. (DSN), pp. 557-562, 2010.
11. Carreira, J.; Madeira, H.; Silva, J.G.; "Xception: A technique for the experimental evalua-
tion of dependability in modern computer system," IEEE Trans. Soft. Eng., vol. 24, no. 2,
pp. 125-136, 1998.
12. Jain, A.; Murty, M.; Flynn, P.; "Data clustering: a review," ACM Computing Surveys
(CSUR), vol. 31, no. 3, pp. 264-323, 1999.
13. Natella, R.; Cotroneo, D.; Duraes, J.; Madeira, H.;“Onfaultrepresentativenessofsoftware
faultinjection”,IEEE Trans Soft Eng, in press (PrePrint), 2011.
14. Kanawati, G.A.; Kanawati, N.A.; Abraham, J.A.; "FERRARI: a tool for the validation of
system dependability properties," 22nd Int. Symp. on Fault-Tolerant Computing (FTCS-
22), pp. 336-344 1992.
15. Madeira,H.; Rela, M.; Moreira, F.; Silva J.G; "RIFLE: A general purpose pin-level fault
injector" 1st European Dependable Computing Conf. (EDDC-01), pp. 199-216, 1994.
16. Arlat, J.; et al..; "Comparison of physical and software-implemented fault injection tech-
niques," IEEE Trans. on Computers, vol. 52, no. 9, pp. 1115-1133, 2003.
17. Martinez-Alvarez, A.; et. al; "Compiler-Directed soft error mitigation for embedded sys-
tems," IEEE Trans. on Dependable and Secure Computing, vol.9, no.2, pp. 159-172, 2012.
18. Ruiz, J.C.; Gil, P.; Yeste, P.; de Andrés, D.; Dependability benchmarking for computer
systems. John Wiley & Sons, Inc., 2008, ch. "Dependability Benchmarking of automotive
control system."
19. Eeckhout, L.; Sampson, J.; Calder, B.; "Exploiting program microarchitecture independent
characteristics and phase behavior for reduced benchmark suite simulation," IEEE Int.
Workload Characterization Symp., pp. 2-12, 2005.
Formal Development and Assessment of a
Reconfigurable On-board Satellite System
1 Introduction
Fault tolerance is an important characteristics of on-board satellite systems.
One of the essential means to achieve it is redundancy. However, the use of
(hardware) component redundancy in spacecraft is restricted by the weight and
volume constraints. Thus, the system developers need to perform a careful cost-
benefit analysis to minimise the use of spare modules yet achieve the required
level of reliability.
Despite such an analysis, Space System Finland has recently experienced a
double-failure problem with a system that samples and packages scientific data
in one of the operating satellites. The system consists of two identical modules.
When one of the first module subcomponents failed, the system switched to the
use of the second module. However, after a while a subcomponent of the spare
has also failed, so it became impossible to produce scientific data. To not lose
the entire mission, the company has invented a solution that relied on healthy
subcomponents of both modules and a complex communication mechanism to
restore system functioning. Obviously, a certain amount of data has been lost
before a repair was deployed. This motivated our work on exploring proactive
Formal development and assessment of a reconfigurable on-board satel-
lite system 211
solutions for fault tolerance, i.e., planning and evaluating of scenarios imple-
menting a seamless reconfiguration using a fine-grained redundancy.
In this paper we propose a formal approach to modelling and assessment of
on-board reconfigurable systems. We generalise the ad-hoc solution created by
Space Systems Finland and propose an approach to formal development and
assessment of fault tolerant satellite systems. The essence of our modelling ap-
proach is to start from abstract modelling functional goals that the system should
achieve to remain operational, and to derive reconfigurable architecture by re-
finement in the Event-B formalism [1]. The rigorous refinement process allows
us to establish the precise relationships between component failures and goal
reachability. The derived system architecture should not only satisfy functional
requirements but also achieve its reliability objective. Moreover, since the re-
configuration procedure requires additional inter-component communication, the
developers should also verify that system performance remains acceptable. Quan-
titative evaluation of reliability and performance of probabilistically augmented
Event-B models is performed using the PRISM model checker [8].
The main novelty of our work is in proposing an integrated approach to formal
derivation of reconfigurable system architectures and probabilistic assessment
of their reliability and performance. We believe that the proposed approach
facilitates early exploration of the design space and helps to build redundancy-
frugal systems that meet the desired reliability and performance requirements.
the overall system. However, even though none of the DPUs can accomplish G
on its own, it might be the case that the operational components of both DPUs
can together perform the entire set of tasks required to reach G. This observation
allows us to define the following dynamic reconfiguration strategy.
Initially DP UA is active and assigned to reach the goal G. If some of its
components fails, resulting in a failure to execute one of four scientific tasks
(let it be taskj ), the spare DP UB is activated and DP UA is deactivated. DP UB
performs the taskj and the consecutive tasks required to reach G. It becomes fully
responsible for achieving the goal G until some of its component fails. In this case,
to remain operational, the system performs dynamic reconfiguration. Specifically,
it reactivates DP UA and tries to assign the failed task to its corresponding
component. If such a component is operational then DP UA continues to execute
the subsequent tasks until it encounters a failed component. Then the control
is passed to DP UB again. Obviously, the overall system stays operational until
two identical components of both DPUs have failed.
We generalise the architecture of DPU by stating that essentially a system
consists of a number of modules and each module consists of n components:
C = Ca ∪ Cb , where Ca = {a compj | j ∈ 1..n ∧ n ∈ N1 } etc.
Each module relies on its components to achieve the tasks required to accomplish
G. An introduction of redundancy allows us to associate not a single but sev-
eral components with each task. We reformulate the goal reachability property
as follows: a goal remains reachable while there exists at least one operational
component associated with each task. Formally, it can be specified as:
M |= 2Os , where Os ≡ ∀t ∈ T · (∃c ∈ C · Φ(t, c) ∧ O(c))
and O is a predicate over the set of components C such that O(c) evaluates to
T RU E if and only if the component c is operational.
On the other hand, the system performance is a reward-based property that can
be measured by the number of successfully achieved goals within a certain time
period.
To quantitatively verify these quality attributes we formulate the following
CSL (Continuous Stochastic Logic) formulas [6]:
P=? {G ≤ t Os } and R(|goals|)=? {C ≤ t }.
The formulas above are specified using PRISM notation. The operator P is used
to refer to the probability of an event occurrence, G is an analogue of 2, R is used
to analyse the expected values of rewards specified in a model, while C specifies
that the reward should be cumulated only up to a given time bound. Thus, the
first formula is used to analyse how likely the system remains operational as
time passes, while the second one is used to compute the expected number of
achieved goals cumulated by the system over t time units.
In this paper we rely on modelling in Event-B to formally define the architec-
ture of a dynamically reconfigurable system, and on the probabilistic extension
of Event-B to create models for assessing system reliability and performance.
The next section briefly describes Event-B and its probabilistic extension.
decisions. In particular, we can add new events, split events as well as replace
abstract variables by their concrete counterparts, i.e., perform data refinement.
When data refinement is performed, we should define gluing invariants as a part
of the invariants of the refined machine. They define the relationship between the
abstract and concrete variables. The proof of data refinement is often supported
by supplying witnesses – the concrete values for the replaced abstract variables
and parameters. Witnesses are specified in the event clause with.
The consistency of Event-B models, i.e., verification of well-formedness and
invariant preservation as well as correctness of refinement steps, is demonstrated
by discharging the relevant proof obligations generated by the Rodin platform
[11]. The platform provides an automated tool support for proving.
Goal Decomposition. The aim of our first refinement step is to define the
goal execution flow. We assume that the goal is decomposed into n tasks, and
can be achieved by a sequential execution of one task after another. We also
assume that the id of each task is defined by its execution order. Initially, when
the goal is assigned, none of the tasks is executed, i.e., the state of each task
is “not defined” (designated by the constant value N D). After the execution,
the state of a task might be changed to success or failure, represented by the
constants OK and N OK correspondingly. Our refinement step is essentially data
refinement that replaces the abstract variable goal with the new variable task
that maps the id of a task to its state, i.e., task ∈ 1..n → {OK, N OK, N D}.
We omit showing the events of the refined model (the complete development
can be found in [13]). They represent the process of sequential selection of one
task after another until either all tasks are executed, i.e., the goal is reached, or
execution of some task fails, i.e., goal is not achieved. Correspondingly, the guards
ensure that either the goal reaching has not commenced yet or the execution of
all previous task has been successful. The body of the events nondeterministically
Formal development and assessment of a reconfigurable on-board satel-
lite system 217
changes the state of the chosen task to OK or N OK. The following invariants
define the properties of the task execution flow:
∀l · l ∈ 2 .. n ∧ task(l) 6= N D ⇒ (∀i · i ∈ 1 .. l − 1 ⇒ task(i) = OK),
∀l · l ∈ 1 .. n − 1 ∧ task(l) 6= OK ⇒ (∀i · i ∈ l + 1 .. n ⇒ task(i) = N D).
They state that the goal execution can progress, i.e., a next task can be chosen for
execution, only if none of the previously executed tasks failed and the subsequent
tasks have not been executed yet.
From the requirements perspective, the refined model should guarantee that
the system level goal remains achievable. This is ensured by the gluing invariants
that establish the relationship between the abstract goal and the tasks:
task[1 .. n] = {OK} ⇒ goal = reached,
(task[1 .. n] = {OK, N D} ∨ task[1 .. n] = {N D}) ⇒ goal = not reached,
(∃i · i ∈ 1 .. n ∧ task(i) = N OK) ⇒ goal = f ailed.
Finally, let us remark that the goal-oriented style of the reliability and per-
formance analysis has significantly simplified the assessment of the architectural
alternatives of DPU. Indeed, it allowed us to abstract away from the configura-
tion of input and output buffers, i.e., to avoid modelling of the circular buffer as
a part of the analysis.
In our future work we are planning to further study the properties of dynamic
reconfiguration. It particular, it would be interesting to investigate reconfigura-
tion in the presence of parallelism and complex component interdependencies.
References
1. Abrial, J.R.: Modeling in Event-B. Cambridge University Press (2010)
2. BepiColombo: ESA Media Center, Space Science, online at
http://www.esa.int/esaSC/SEMNEM3MDAF 0 spk.html
3. Caporuscio, M., Di Marco, A., Inverardi, P.: Model-Based System Reconfiguration
for Dynamic Performance Management. J. Syst. Softw. 80, 455–473 (2007)
4. de Castro Guerra, P.A., Rubira, C.M.F., de Lemos, R.: A Fault-Tolerant Software
Architecture for Component-Based Systems. In: Architecting Dependable Systems.
pp. 129–143. Springer (2003)
5. Goldsby, H., Sawyer, P., Bencomo, N., Cheng, B., Hughes., D.: Goal-Based Mod-
eling of Dynamically Adaptive System Requirements. In: ECBS 2008. pp. 36–45.
IEEE Computer Society (2008)
6. Grunske, L.: Specification Patterns for Probabilistic Quality Properties. In: ICSE
2008. pp. 31–40. ACM (2008)
7. Kelly, T.P., Weaver, R.A.: The Goal Structuring Notation – A Safety Argument
Notation. In: DSN 2004, Workshop on Assurance Cases (2004)
8. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: Verification of Probabilistic
Real-time Systems. In: CAV’11. pp. 585–591. Springer (2011)
9. van Lamsweerde, A.: Goal-Oriented Requirements Engineering: A Guided Tour.
In: RE’01. pp. 249–263. IEEE Computer Society (2001)
10. de Lemos, R., de Castro Guerra, P.A., Rubira, C.M.F.: A Fault-Tolerant Architec-
tural Approach for Dependable Systems. Software, IEEE 23, 80–87 (2006)
11. Rodin: Event-B Platform, online at http://www.event-b.org/
12. Space Engineering: Ground Systems and Operations – Telemetry and Telecom-
mand Packet Utilization: ECSS-E-70-41A. ECSS Secretariat, 30.01.2003, online at
http://www.ecss.nl/
13. Tarasyuk, A., Pereverzeva, I., Troubitsyna, E., Latvala, T., Nummila, L.: Formal
Development and Assessment of a Reconfigurable On-board Satellite System. Tech.
Rep. 1038, Turku Centre for Computer Science (2012)
14. Tarasyuk, A., Troubitsyna, E., Laibinis, L.: Quantitative Reasoning about Depend-
ability in Event-B: Probabilistic Model Checking Approach. In: Dependability and
Computer Engineering: Concepts for Software-Intensive Systems, pp. 459–472. IGI
Global (2011)
15. Tarasyuk, A., Troubitsyna, E., Laibinis, L.: Formal Modelling and Verification of
Service-Oriented Systems in Probabilistic Event-B. In: IFM 2012. pp. 237–252.
Springer (2012)
16. Warren, I., Sun, J., Krishnamohan, S., Weerasinghe, T.: An Automated Formal Ap-
proach to Managing Dynamic Reconfiguration. In: ASE 2006. pp. 18–22. Springer
(2006)
17. Wermelinger, M., Lopes, A., Fiadeiro, J.: A Graph Based Architectural Reconfig-
uration Language. SIGSOFT Softw. Eng. Notes 26, 21–32 (2001)
Intentionally left blank to harmonize page numbering with printed version.
Impact of Soft Errors in a Jet Engine Controller
1 Introduction
Digital control systems for turbo-jet engines have been in operational use for almost
30 years. These systems are known as Full Authority Digital Engine Control systems,
or FADEC systems. To ensure aircraft safety, FADEC systems must be highly relia-
ble and fault-tolerant. A basic requirement is that a failure of a single hardware unit
should never cause the engine to deliver inadequate thrust.
Most FADEC systems are provided with two redundant control channels config-
ured as a primary/backup pair. Recently designed FADEC systems are typically
equipped with two electronic channels, while older designs often use a single elec-
tronic channel with a hydro-mechanical backup. Regardless of whether the backup
channel is electronic or hydro-mechanical, it is essential that the primary electronic
224 O. Hannius and J. Karlsson
channel is provided with highly efficient error detection mechanisms so that a fail-
over to the backup channel is performed immediately if the primary channel should
fail.
One of the key challenges in designing a dual channel FADEC system is to provide
the electronic channels with error detection mechanisms that can effectively detect
hardware errors occurring in the microprocessor that executes the control software.
These mechanisms must ensure that the FADEC does not exhibit critical failures. A
critical failure occurs when the FADEC generates erroneous actuator commands that
cause a significant change in the engine thrust.
There are two main design options available for detecting microprocessor faults in
a FADEC control channel. One is to execute the control program on two lock-stepped
microprocessors (or cores). This solution achieves very high detection coverage since
the errors are detected by comparing the outputs of the two processors. The other
option is to use a single microprocessor monitored by a watch-dog timer and various
software implemented assertions and reasonable checks. The latter solution has been
successfully used in several FADEC systems, including the one that controls the
RM12 engine produced by Volvo Aero.
However, many existing FADEC systems were designed for microprocessors pro-
duced during the 1980’s and 1990’s. These microprocessors were manufactured in
circuit technologies that are less sensitive to cosmic-ray induced soft errors and aging
faults than current technologies are. It is expected that technology and voltage scaling
will make future circuit technologies increasingly sensitive to these kinds of faults as
well as process variations [1]. It is therefore an open question whether the classical
design with a single microprocessor provides sufficient detection coverage for future
FADEC systems.
This paper presents the results of a fault injection study aiming to provide insights
into the error sensitivity of a single processor control channel with respect to micro-
processor faults that manifest as transient bit errors in the instruction set architecture
registers of the processor. Such errors can be caused by both transient and intermittent
transistor level faults, including cosmic ray-induced soft errors [2] electromagnetic
interference [3], intermittent faults caused by process variations [4], and aging effects
such as NBTI [5], hot-carrier injection [6] and gate-oxide breakdown [7]. We con-
ducted the fault injection experiments with an engineering prototype of a single pro-
cessor control channel based on the Freescale MC68340 microprocessor. We injected
single-bit faults in the instruction set architecture (ISA) registers of processor while it
was executing a program controlling a software model of the Volvo Aero RM12 en-
gine. To perform the experiments, we developed a fault injection tool called JETFI
(JET Engine Fault Injection tool) [8].
The remainder of this report is organized as follows. Section 2 explains the basic
operation of the RM12 jet engine including the main engine parameters, which we use
to describe the failure modes of the engine. Section 2 also includes an overview of the
main functions of the FADEC system. Section 3 describes the experimental system
and the fault injection procedure. The results of our experiments are presented in Sec-
tion 4. A summary is provided in Section 5 and Conclusions and Future Work are
given in Section 6.
Impact of Soft Errors in a Jet Engine Controller 225
The RM12 jet engine is a two-spool mixed flow turbofan engine shown in Fig. 1.
Its principle operation is as follows. The intake delivers the air flow required by the
engine. The fan and the low-pressure (LP) turbine are connected by the LP shaft and
the compressor and the high-pressure (HP) turbine are connected by the HP shaft. (An
engine with two shafts is a two-spool engine.) The compressor delivers compressed
air to the burner. When the air-fuel mixture burns, it releases energy causing a high
temperature gas flow that powers the high- and low-pressure turbines.
The high-pressure turbine powers the compressor, while the low-pressure turbine
powers the fan. When the hot gas flow has passed through the low-pressure turbine, it
is further expanded and accelerated through the exhaust nozzle producing thrust. The
FADEC system controls the thrust of the engine using five actuators. These control
the guide vanes of the fan (FVG) and the compressor (CVG), the fuel mass flows to
the burner (WFM) and the afterburner (WFR) and the area of the exhaust nozzle (A8).
The pilot modulates thrust by changing the angle of a Power Lever (PLA). Besides
from the demanded thrust, the control system needs six more inputs. These are the
inlet temperature TT1, the LP shaft speed NL, the compressor inlet temperature TT25,
the HP shaft speed NH, the compressor discharge pressure PS3 and the LP turbine
exhaust gas temperature TT5. A comprehensive description of how to control the
RM12 engine is found in [9].
3 Experimental System
This section describes the main elements of our experimental set-up. We provide
an overview of the set-up in Section 3.1. The error detection mechanisms are de-
scribed in Section 3.2, while the JETFI fault injection tool is described in Section 3.3.
226 O. Hannius and J. Karlsson
Host computer
RS232 RS232
Computer board executing the Computer board simulating the
FADEC software, “FADEC board”. RS232 RM12 engine, “Engine board”.
The FADEC control software executes in a cyclic control loop with prescheduled
control tasks. It consists of 29 subsystems. A subsystem is a set of control tasks with
the same execution rate. The execution rate varies from 200 Hz for Subsystem 1 down
to 1 Hz for Subsystem 29. Subsystem 1 performs demanding control activities such as
positioning of the guide vanes, fuel flow metering and setting the exhaust nozzle area.
The other subsystems perform a variety of other control tasks and trim functions.
The Engine board executes simulation models of the engine, sensors, actuators and
the hydro-mechanical control system. We use a linearized model of the RM12 engine
to minimize execution time. We believe the accuracy of this model is sufficient for
the purpose of our experiments. The execution times would have become much longer
if we would have used a more accurate non-linear engine model.
The Engine board emulates a use case where the Power Lever Angle (PLA) in-
creases from 55º to 75º during one second of real-time execution. (Flight idle is at 28º
and max dry thrust, i.e., without afterburner, is at 100º.)
Impact of Soft Errors in a Jet Engine Controller 227
EDM Description
WDM A timer which must be reset periodically to prevent it from trip-
ping, i.e. signaling that an error has occurred
Hardware exceptions Hardware EDMs supported by the Motorola 68340 processor.
Software exceptions Software checks generated automatically by MATRIXx or by the
programmer using the exception-clause in the ADA-language. They
detect erroneous execution, erroneous calculations and other errors.
Software assertions Range checks on engine parameters.
Hardware Exceptions
The Motorola 68340 processor supports 256 hardware exception vectors numbered
in the range 0 to 255. The exceptions that were triggered in the fault injection experi-
ments are Bus error, Address error, Illegal instruction, Line 1111 Emulator and For-
mat error, see Table 2.
Software Exceptions
A software exception is a general check concerning calculations and program exe-
cution. The FADEC software implemented software exceptions are shown in Table 3.
228 O. Hannius and J. Karlsson
Software Assertions
The software assertions perform range checks on engine parameters and are based
on physical limitations of the jet engine and its environment. The software assertions
shown in Table 4 can detect engine failures, errors in data from sensors and wrap-
around signals from actuators (torque motor currents).
S/W assertion Failure condition Possible cause Effect when not detected
TT1 out of The reading from the TT1 Sensor or input data Low engine thrust and even
range sensor is not within range. failure. fan surge1.
NH The measured speed of the Overspeed of the There is a risk for engine
overspeed compressor and high- HP shaft, failure in disintegration.
(HP shaft) pressure turbine is too high. the input data.
NL sensor Missing pulses in the pulse NL sensor failure Fan overspeed. Possible
loss train from the NL sensor. detected by h/w. engine damage.
A8 or WFM The relationship between Sensor, actuator or A missed detection will
LVDT/TM demanded current and the mechanical failure result in a low or high
failure (actua- position change of the of the actuation engine thrust.
tors) actuator does not match. hardware.
PS3 fails high Out of range failure of the Sensor failure. Incorrect fuel flow and
comp. discharge pressure. erroneous thrust.
Flame out Engine speed and turbine Erroneous fuel The engine may flame out,
exhaust temp. decrease metering. if this occurs.
below allowed limits.
1
Fan surge causes an abrupt reversal of the airflow through the engine.
Impact of Soft Errors in a Jet Engine Controller 229
4 Results
This section presents the results of our fault injection experiments. Section 6.1 de-
scribes how we classify the outcomes of the experiments. Section 6.2 describes the
results from five fault injection campaigns denoted A to F.
Category Description
Detected error An error detected by the watchdog monitor (WDM), a hardware or
software exception or a software assertion.
No effect The outcome No effect occurs when nothing can be observed that is
different from a fault free experiment. The injected error is either
overwritten or remains in the system but does not have any impact on
230 O. Hannius and J. Karlsson
Undetected
No Watch- Hardware Software Software Non-crit. Critical
effect dog Exception Exception Assertion failure failure
No. of faults 715 32 200 12 13 15 4
Rel. freq. (%) 72.22.8 3.21.1 20.22.5 1.20.7 1.30.7 1.50.8 0.40.4
The number of non-effective faults relative to the total number of injected faults is
quite normal compared to other studies [11-13]. The distribution of experimental
outcomes for the effective faults is also typical with hardware exception as the prima-
ry error detection mechanism.
ly. The fault injection was directed to the input reading part of the scheduler. The
result of Campaign B is shown in Table 8. Most faults have no effect at all. Non-
effective faults were 86.1%. The corresponding number for Campaign A is 72.2%.
This part of the code showed to be less sensitive to faults than other parts.
Undetected
No effect Watch- Hardware Software Software Non-crit. Critical
dog Excep. Excep. Assertion failure failure
1983 (86.1%) 2 (0.9%) 267 (11.6%) 0 (0%) 20 (0.9%) 10 (0.4%) 2 (0.1%)
Undetected
No effect Watch- Hardware Software Software Non-crit. Critical
dog Excep. Excep. Assertion failure failure
2465 (68.8%) 97 (2.7%) 739 (20.6%) 29 (0.8%) 103 (2.9%) 86 (2.4%) 65 (1.8%)
Undetected
No effect Watch-dog Hardware Software Software Non-crit. Critical
Excep. Excep. Assertion failure failure
11278 (78.7%) 308 (2.2%) 1640 (11.4%) 0 (0.0%) 780 (5.4%) 328 (2.3%) 2 (0.01%)
232 O. Hannius and J. Karlsson
Table 10. Results from Fault injections in the PC register (Campaign E).
Undetected
No effect Watch-dog Hardware Software Software Non-crit. Critical
Excep. Excep. Assertion failure failure
7 (11.0%) 17 (26.6%) 32 (50.0%) 0 (0%) 2 (3.1%) 5 (7.8%) 1 (1.5%)
It showed to be hard finding an address where fault injection in the Status Register
was effective. Most instructions change the contents of the Status Register but only a
few use the contents, for example branch instructions. The probability is therefore
high that a fault in the SR is overwritten. One example of an outcome from one effec-
tive experiment was a software exception. No further experiments were performed.
5 Summary
The effects from soft errors in a prototype FADEC controller have been evaluated
by injecting single bit-flip faults in the controller’s microprocessor while simulating a
jet-engine during an acceleration sequence. Of all experiments, 67% were non-
effective. The distribution of the remaining 23% of effective errors is shown in Table
11.
Impact of Soft Errors in a Jet Engine Controller 233
Undetected
Campaign No. of Watch- Hardware Software Software Non-crit. Critical
eff. exp. dog Excep. Excep. Assertion failure failure
A (Random) 276 11.6% 72.5% 4.3% 4.7% 5.4% 1.4%
B (Scheduler) 321 6.9% 83.2% 0% 6.2% 3.1% 0.6%
C (Subsys 1) 1119 8.7% 66.0% 2.6% 9.2% 7.7% 5.8%
D (Subsys 1) 3058 10.1% 53.6% 0% 25.5% 10.7% 0.1%
E (PC reg.) 57 29.8% 56.1% 0% 3.5% 8.8% 1.8%
Average - 13.4% 66.3% 1.4% 9.8% 7.2% 1.9%
The efficiency of the error detection mechanisms are (in descending order):
1) Hardware Exception
2) Watchdog Monitor and Software Assertion
3) Software Exception
The results of our fault injection experiments provide valuable insights into the rela-
tive effectiveness of the error detection mechanisms included in our FADEC proto-
type. They show that the hardware exceptions included in the MC68340 processor
obtained the highest error coverage, in average 66.3%. This result is consistent with
results obtained in several other fault injection studies. The results also show that the
watchdog timer and the software assertions were quite effective obtaining average
coverage values of 13.4% and 9.8%, while the software exceptions detected merely
1.4% of the effective errors in average. Another important observation is that most of
the undetected failures were non-critical. However, the percentage of critical failures,
which varied between 0 and 6%, was higher than desirable. In particular, the high
percentage of critical failure observed for errors injected into Subsystem 1 in Cam-
paign C, suggest that the code for that subsystem needs to be provided with additional
error detection mechanisms.
Our future work will focus on development and evaluation of software-
implemented error detection techniques that can complement the ones we have evalu-
ated in this paper. Techniques that we plan to investigate include selective time-
redundant execution of sensitive code portions, software implemented control flow
checking and new types of software assertions. Our aim is to reduce the likelihood of
critical failure to below 0.01%. To this end, we plan to extend the JETFI tool to sup-
port test port-based fault injection and pre-injection analysis. We also plan to port our
experimental setup to a new hardware platform with faster CPUs, so that we can run
larger fault injection campaigns and thereby increase the confidence in our experi-
234 O. Hannius and J. Karlsson
mental results. In addition, we also intend to evaluate our FADEC prototype with
respect to multiple bit errors.
Acknowledgements
This work has partially been supported by the research project Reliable Jet Engines
funded by the NFFP programme.
References
1. Chandra, V., Aitken,R.: Impact of Technology and Voltage Scaling on the Soft Error Sus-
ceptibility in Nanoscale CMOS. IEEE International Symposium on Defect and Fault Tol-
erance of VLSI Systems (DFTVS '08), pp.114-122, Oct. 2008.
2. Ibe, E. Taniguchi, H., Yahagi, Y., Shimbo, K. S., Toba, T.: Impact of scaling on neutron-
induced soft error in SRAMs from a 250 nm to a 22 nm design rule. IEEE Transactions on
Electron Devices, vol. 57, no. 7, pp. 1527–1538, Jul. 2010.
3. Benso, A., Di Carlo, S., Di Natale, G., Prinetto, P.: A watchdog processor to detect data
and control flow errors. 9th IEEE International On-Line Testing Symposium, pp. 144-148,
Jul. 2003.
4. Jahinuzzaman, S.M., Sharifkhani, M., Sachdev, M.: Investigation of process impact on soft
error susceptibility of nanometric SRAMs using a compact critical charge model. 9th In-
ternational Symposium of Quality Electronic Design, 2008.
5. Islam, A.E., Kufluoglu, H., Varghese, D., Mahapatra, S., Alam, M.A.: Recent issues in
negative-bias temperature instability: Initial degradation, field dependence of interface trap
generation, hole trapping effects and relaxation. IEEE Trans. Electron Devices, vol. 54, no.
9, pp. 2143–2154, Sep. 2007.
6. Kufluoglu, H., Alam, M.A.: A Computational Model of NBTI and Hot Carrier Injection
Time-Exponents for MOSFET Reliability. Journal of Computational Electronics, vol. 3,
No. 3-4, pp.165–169, Oct. 2004.
7. Cannon, E.H., KleinOsowski, AJ., Kanj, R., Reinhardt, D.D.,Joshi, R.V.: The impact of
aging effects and manufacturing variation on SRAM soft-error rate. IEEE Transactions on
Device and Materials Reliability, Vol. 8, no. 1, pp. 145-152, Mar. 2008.
8. Hannius, O., Karlsson, J.: JETFI – A Fault Injection Tool for Assessment of Error Han-
dling Mechanisms in Jet-engine Control Systems. Technical Report 2012:06, ISSN 1652-
926X, Chalmers University of Technology, 2012.
9. Härefors, M.: A study in jet engine control - control structure selection and multivariable
design. Ph.D. Thesis, Chalmers University of Technology, Sweden, 1999.
10. Ward, D.K., Andrews, S.F., McComas, D.C., O’Donnell, J.R.: Use of the MATRIXx inte-
grated toolkit on the Microwave Anisotropy Probe Attitude Control System. NASA’s
Goddard Space Flight Center.
http://lambda.gsfc.nasa.gov/product/map/team_pubs/aas99.pdf
11. Autran, J. L., Roche, P., Sauze, S., Gasiot, G., Munteanu, D., Loaiza, P., Zampaolo, M.,
Borel, J.: Real-Time Neutron and Alpha Soft-Error Rate Testing of CMOS 130nm SRAM:
Altitude versus Underground Measurements. Proc. International Conference On IC Design
and Technology (ICICDT), pp. 233-236, Grenoble, 2008.
12. Autran, J. L., Roche, P., Sauze, S., Gasiot, G., Munteanu, D., Loaiza, P., Zampaolo, M.,
Borel, J.: Altitude and Underground Real-Time SER Characterization of CMOS 65 nm
SRAM. IEEE transactions on Nuclear Science, vol.56, no. 4, Aug. 2009.
13. Normand, E.: Single Event Upset at Ground Level. IEEE transactions on Nuclear Science,
vol. 43, no. 6, Dec. 1996.
14. Normand, E.: Single Event Effects in Avionics. IEEE transactions on Nuclear Science, vol.
43, no. 2, Apr. 1996.
Which Automata for which Safety Assessment
Step of Satellite FDIR ?
1 Introduction
Space systems become more and more autonomous and complex, which consider-
ably increases the difficulties related to their validation. This is particularly true
for the FDIR functions – Failure Detection, Isolation and Recovery – which is an
essential and critical part of space systems, so as to prevent mission interruption
or loss.
This introduces several interacting phenomena. E.g., a fault occurs while the
system is in a given state (or behaviour mode) and propagates according to this
state. A recovery action occurs while a fault is detected and may modify the
state of the system (for instance, a recovery action may switch off the power
supply of an electronic device). As a consequence, the initial fault propagation
may be modified, interrupted, or activate other faults potentially existing and
previously hidden (passive faults).
Today, system specifications are generally produced in a textual form and
result from an intellectual process supported by analyses largely made by hand.
236 L. Pintard, C. Seguin, and J. Blanquart
This raises several issues in term of the correctness of the specifications as well
as the implementation with respect to the specifications.
Moreover, the development process is long and complex, it deals with het-
erogeneous concepts: architecture, physical laws, software. . .
It appears that it is not possible to validate through a single approach all
the needed concepts: hierarchy; different operating modes (nominal, degraded,
safe. . . ); reactive software to compute monitoring, recovery actions. . . ; and even
physical laws (the environment, fault propagation. . . )
We therefore propose a FDIR validation approach in following three steps, so
that each one focuses on complementary validation objectives and consequently
requires different minimal validation means.
– Architectural and limited behavioural automata for the validation of the
design principles of the overall FDIR system.
– Detailed behavioural automata for the unitary validation of the detailed
software specifications.
– Continuous and non-continuous detailed behavioural automata for the uni-
tary validation of the detailed specification of the physical devices.
2 Context
2.1 System Class and Case Study
We consider safety critical systems containing both physical devices and software
controllers. To illustrate our purpose, we will present the thermal system of a
satellite such as Venus Express from Astrium Satellites. This system and its
FDIR are quite simple but it is representative of all the issues we want to study.
The thermal system aims at keeping the temperature of some satellite areas
between a predefined range of values. It is made of a primary and a backup
heating lines (see the architecture given in the figure 1). A complete heating line
is composed by 15 devices, but the heating system of a satellite can manage up
to 13 different lines.
Let us now give more details about what can be achieved with the selected
formalisms and tools.
equations and we need only to reason about some intervals of temperatures val-
ues (f rost, cold. . . ). So we discretized the temperature values and we introduced
nominal cooling and heating actions that increase or decrease the temperature.
ability laws to complete the transition specification. For instance the Dirac(0)
law characterize instantaneous events that shall be triggered as soon as their
guard is true. They characterized for instance the reconfigurations that are au-
tomatically triggered after failure detection in our case study. Exponential laws
are associated with stochastic events.
Finally, the assertions specify the values of the output flows according to the
constrained satisfied by the input flows in the current functioning mode.
The system model is made of several instances of generic nodes like the
heater. For the analysis purpose, the model may also contains special nodes, the
observers of the undesired events of the system. They monitor some components
flows to detect whether an undesired situation is reached.
At this stage, the controller logic that was previously sketched is detailed. More-
over, the resulting control laws are discretized to provide a detailed software
specification. So, the resulting specification deals not only with the management
of the operational modes but also with arithmetic and discrete time. The vali-
dation of this detailed specification is usually achieved by simulation and fault
injection.
We propose to explore the use of synchronous languages to formalize such
a detailed software specification and the use of model-checking techniques to
Which automata for which safety assessment step of satellite FDIR ? 241
verify the compliance of the detailed specification with the preliminary one. The
formalism is more expressive and the model of the software component is much
more detailed than previously. However, the proof seems to remain tractable for
controller of reasonable size with the technologies we used.
The theory of hybrid automata has been studied since 1990. In general, a hybrid
automaton is a state-machine (possibly infinite) whose variables have a behaviour
in IR .The key concepts were defined in [12].
To be able to compute the models, we cannot use directly this theory. Indeed,
reachability problem cannot be decided in all cases. The restriction of hybrid
automata – linear hybrid automata – is decidable [14].
A hybrid automaton H is linear if the two restriction are met [12].
1. The initial, invariant, flow, and jump conditions of H are boolean combina-
tions of linear inequalities.
2. If X is the set of variables of H, then the flow conditions of H contains free
variables from Ẋ only.
Hybrid automata is the most expressive theory that we use in this paper, but
it has less tools to edit and compute the model easily. Moreover, the theory is
reduced to Hybrid Linear Automata. Some recent tools provide a user-friendly
graphical editor, like SpaceEx, but most of them are only textual. On the anal-
ysis aspect, it is the same. These tools “only” perform reachability analysis,
which consist in calculating all the states that will be reached from an initial
state. Thus, it is possible to verify a property in each state and see if a forbidden
state is reached. The difference between tools is mainly based on the algorithm
which computes reachability. To compute hybrid automata, few tools were cre-
ated. There is a very interesting overview from hybrid tools existing before 2006
in [3], but we also want to mention SpaceEx [6], one from PHAVer successor. Its
purpose is to propose a real interface to design and compute hybrid automata
reachability. We decide to use HyTech [10, 11, 13](the first tool which has been
244 L. Pintard, C. Seguin, and J. Blanquart
Thus, if the two expressions are verified, it is proved that the system will not
have false alarm.
This result is interesting because we can prove formally and exhaustively a
property, when today, in automatic control, these verifications are generally per-
Which automata for which safety assessment step of satellite FDIR ? 245
formed with tools such as Matlab/Simulink that can better model the behaviour
of these continuous and non-continuous systems; unless all the verification are
mostly simulations. Even if we test a worst case to determine the parameters of
a system, it is possible to not see a non-trivial case that will leads the system
to an unsafe mode. An approach with HyTech seems better, but it is difficult to
handle complex systems and to model accurately the behaviour as it is possible
to do with Matlab/Simulink.
6 Conclusion
The paper reports how three kinds of automata were used to assess comple-
mentary features of satellites FDIR. The experiment shown that basic mode
automata are sufficient to handle efficiently assessments of the overall prelimi-
nary specifications of the FDIR.
More detailed specifications need to be addressed with dedicated tools. Here
we choose to address the verification of the control software with synchronous
language widely used. The experiment shown that the proof of the specification
is tractable on such a limited case and fruitful.
The control specification was tested in an open loop way. It would be inter-
esting to compare AltaRica models of the physical components with this spec-
ification to close the loop. Indeed AltaRica expressiveness is compatible with
Lustre data-flow (see [18], a translator from AltaRica to Lustre).
Finally, the use of hybrid automata seems promising for rigorous validation
(see [16]). However, this kind of tool is much less mature than the two others
and so applications remain limited.
7 Acknowledgements
The study reported in this paper has been partly funded by CNES (contract
115205). The authors wish to thank CNES personnel and in particular Raymond
Soumagne for their contribution and support. The authors wish also to thank
the anonymous referees for their valuable comments and suggestions to improve
this final version of the paper.
246 L. Pintard, C. Seguin, and J. Blanquart
References
1. A. Arnold, G. Point, A. Griffault, and A. Rauzy. The altarica formalism for de-
scribing concurrent systems. Fundamenta Informaticae, 40(2):109–124, 1999.
2. P. Bieber, C. Bougnol, C. Castel, J.P. Christophe Kehren, S. Metge, and C. Seguin.
Safety assessment with altarica. Building the Information Society, 156/2004:505–
510, 2004.
3. L.P. Carloni, R. Passerone, and A. Pinto. Languages and tools for hybrid systems
design, volume 1. Now Pub, 2006.
4. P. Caspi, D. Pilaud, N. Halbwachs, and J.A. Plaice. Lustre: A declarative language
for programming synchronous systems. In Conference Record of the 14th Annual
ACM Symp. on Principles of Programming Languages, 1987.
5. F.X. Dormoy. Scade 6: a model based solution for safety critical software devel-
opment. In Proceedings of the 4th European Congress on Embedded Real Time
Software (ERTS08), pages 1–9, 2008.
6. G. Frehse, C. Le Guernic, A. Donzé, S. Cotton, R. Ray, O. Lebeltel, R. Ripado,
A. Girard, T. Dang, and O. Maler. Spaceex: Scalable verification of hybrid systems.
In Computer Aided Verification, pages 379–395. Springer, 2011.
7. A. Griffault and A. Vincent. Vérification de modèles altarica. MAJECSTIC:
Manifestation des jeunes chercheurs STIC, Marseille, 2003.
8. N. Halbwachs. Synchronous programming of reactive systems. Number 215.
Springer, 1993.
9. N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data flow
programming language lustre. Proceedings of the IEEE, 79(9):1305–1320, 1991.
10. T. Henzinger, P. Ho, and H. Wong-Toi. A user guide to hytech. Tools and algo-
rithms for the construction and analysis of systems, 1019/1995:41–71, 1995.
11. T. Henzinger and P.H. Ho. Hytech: The cornell hybrid technology tool. Hybrid
Systems II, 999/1995:265–293, 1995.
12. T.A. Henzinger. The theory of hybrid automata. In Logic in Computer Science,
1996. LICS’96. Proceedings., Eleventh Annual IEEE Symposium on, pages 278–
292. IEEE, 1996.
13. T.A. Henzinger, P.H. Ho, and H. Wong-Toi. Hytech: A model checker for hybrid
systems. International Journal on Software Tools for Technology Transfer (STTT),
1(1):110–122, 1997.
14. Thomas A. Henzinger, Peter W. Kopke, Anuj Puri, and Pravin Varaiya. What’s
decidable about hybrid automata? Journal of Computer and System Sciences,
57(1):94 – 124, 1998.
15. S. Humbert, JM Bosc, C. Castel, P. Darfeuil, Y. Dutuit, and C. Seguin.
Méthodologie de modélisation altarica pour la sûreté de fonctionnement d’un
système de propulsion hélicoptère incluant une partie logicielle. In Proceedings
of Lambda Mu, volume 15, 2006.
16. S. Mazzini, S. Puri, F. Mari, I. Melatti, and E. Tronci. Formal verification at
system level. In DASIA 2009: ESA SP-669, 2009.
17. G. Point. AltaRica: Contribution à l’unification des méthodes formelles et de la
sûreté de fonctionnement. PhD thesis, Université de Bordeaux, 2000.
18. G. Point, A. Griffault, et al. On the partial translation of lustre programs into the
altarica language and vice versa. 2006.
A novel modelling pattern for establishing failure models
and assisting architectural exploration in an automotive
context
1 Introduction
With the introduction of the ISO 26262 standard [1] for functional safety of E/E
(Electric/Electronic) systems in road vehicles, a number of problems have arisen that
must be solved in the industry in order to be compliant with the prescriptions of the
standard. We address the entire reference life cycle of ISO 26262; looking at what
support that is needed at the different stages. In particular we investigate the support
needed when deciding both what communication topology and what communication
technology to use for a given distributed system. One particular challenge of such
systems is designing them to handle asymmetric failures. If unanticipated and
unhandled, these failures cause inconsistency in the system and potentially unsafe
situations. An asymmetric failure can be caused when data is distributed e.g. due to a
248 C. Bergenhem, R. Johansson, and H. Lönn
fault in the communication bus induced by external disturbance. Some receivers will
have correct data while others have e.g. no data or data that is incorrect or corrupt. An
example of a mitigation mechanism for asymmetric failures, is a membership
agreement protocol [2].
According to the ISO 26262 reference life cycle, there are at least four architectural
decision points when performing a design. The first one is part of the Functional
Safety Concept in part three of the standard); the second one is part of the Technical
Safety Concept (in part four); the remaining two relate to the hardware and software
designs respectively (in part five and six). In each of these steps, all safety
requirements shall be identified and allocated onto the elements of that architecture.
Our concern in this paper is the safety requirements that address inconsistency
between different architectural elements. Such inconsistencies can be due to
asymmetric failures.
We use service brake as an example item to illustrate the problem, see Fig. 1. Here
there are requirements to address asymmetric failures already in the safety goals. For
example, the safety goals could state that a certain integrity level is needed to avoid
asymmetric braking of the four wheels of the vehicle. Then in every architectural
design step, we may introduce distributed realization of the functionality, such as
functionality in multiple distributed cooperating processes. This introduces the need
for safety requirements on the integrity to avoid unsafe effects of these distributions,
e.g. inconsistency of data. For example, if the Functional Safety Concept has a
general functional architecture with two blocks for distributed calculation of the four
braking forces, we need to add functional safety requirements to avoid unsafe
inconsistencies between these two blocks. In the Technical Safety Concept the
general system design is decided including the topology of electronic control units
(ECU) and communication links between them. In the hardware design the
technology to realize each ECU and communication link is then decided. Depending
on what topology that is chosen and what bus technology that is used for the
communication links, the safety requirements dealing with asymmetric failures are
different. In this paper we present a guide to support the architectural choices and the
formulation of the corresponding safety requirements related to asymmetric failures.
Wheel- Wheel-
node node
In part eight of ISO 26262 there is a collection of requirements regarding the safety
requirements applicable at all the different phases of the reference life cycle. It is
stated in clause six that for integrity levels ASIL C and ASIL D, it is highly
recommended to use semi-formal methods for the safety requirements. This means
that the safety requirements are not adequately expressed by using only free text
A novel modelling pattern for establishing failure models and assisting
architectural exploration in an automotive context 249
natural language. That is why there is a need for at least some patterns to guide the
system designer when identifying safety requirements including their allocations onto
a system architecture. In this paper we address this need with respect to asymmetric
failures. We show how the safety requirements can be described by semi-formal or
formal representation in architecture description languages (ADL) such as
AUTOSAR [3], EAST-ADL [4] and AADL [5, 6]. Then we introduce a dedicated
modelling pattern (Compute and Distribute Result, CDR) suited for supporting
decisions on bus topology and bus technology, and identifying the corresponding
safety requirements. This CDR pattern is not dependant on any given ADL.
Furthermore we give an example of analysis of a FlexRay bus with the CDR pattern.
The outcome is a failure model for the FlexRay bus based on analysis according to the
pattern.
2 Related Work
EAST-ADL [4] is an architecture description language for automotive embedded
systems. It represents engineering information from abstract feature models to the
concrete system architecture. AUTOSAR elements [3] represent the realization of the
EAST-ADL model and form a consistent model together with the more abstract
EAST-ADL elements.
Safety extensions to EAST-ADL make it possible to capture ISO 26262
information in a model based manner [7, 8]. Hazards and safety goals are associated
to the abstract features while functional and Technical Safety Concepts relate to the
system architecture solution on analysis and design level. Hardware and software
safety requirements relate to AUTOSAR elements on implementation level. The
EAST-ADL system model and also the AUTOSAR elements can be extended with
error models that formalise error propagation and the failures on interfaces within and
on system borders.
250 C. Bergenhem, R. Johansson, and H. Lönn
Incoming Processor
Processor
link Processor
Outgoing Receiver Process
Producer Sender Network Service Consumer
Process
Consumer
Process Service link Receiver Consumer
Process
Consumer
Process
Incoming Service Process
Process
link
Fig. 2. One compute and distribute result (CDR) operation from producer to consumer.
A novel modelling pattern for establishing failure models and assisting
architectural exploration in an automotive context 251
3
1
1
.C
.C
.C
.A
.B
.A
.B
.A
.B
.A
.B
op
op
op
op
op
op
op
op
op
op
op
R
R
R
R
D
D
D
D
C
C
C
C
Fig. 3. Consecutive CDR operations. Each independent chain of operations is denoted with
a capital letter such as A
by the delivery to the consumer, i.e. the model encompasses the producer (process),
but not the consumer (process). In a concrete system a producer may be a process
executing on a node and the consumer may be a process on the same or another node.
In an allocated system there are several components involved between the producer
and consumer in the distribution of the result, see Fig. 2. Examples are middleware
software (sender and receiver service), the processor and network hardware. A failure
of any of these intermediate components leads to failure of the CDR operation.
An advantage of modelling system operation in terms of CDR operations is that
details in the path from producer to consumer are abstracted away. A fault in an
intermediate components leads to failure of the operation, e.g. the results were not
received at all consumers. This simplification removes the need to keep track of nodes
and propagation of errors such as an erroneous result. For example, a process in a
node can act as consumer in one CDR operation and then act as producer in the next
CDR operation. Individual component failures, such as failures in incoming or
outgoing links, do not have to be handled separately, but are rather handled as a
failure of an entire CDR operation. A CDR operation is a one-shot operation, i.e.
concerns the distribution of one single result. Continued service from a system can
consist of consecutive staggered CDR operations, Fig. 3. In this example there are
three independent chains of CDR operations, A, B and C, where each realise a service
With the CDR pattern (or guideline) a system can be considered and the failure
model for the system can be organised. In EAST-ADL terminology, CDR failure
models describe a combination of FDA (Functional Design Architecture) and HDA
(Hardware Design Architecture) failures for a specific function. The pattern can be
applied to an unallocated FDA to explore failures of different implementations and
thus to compare them. Although all involved elements affect the failure model, the
focus in this paper is the choice of communication network. The CDR pattern and its
related models are further investigated in [12].
Receive omission failure - The expected result or frame is not received by a node.
Signalled failure - The affected component in the system, e.g. a process or
middleware, is unable to perform its service and instead sends an error frame.
Blocking failure - A node jams other traffic on the network, e.g. by sending too
many frames, untimely frames or disturbance.
Addressing failure – A corruption in the frame affects the source or destination
address, e.g. frame masquerading.
Insertion failure - A spurious unexpected frame is received, e.g. frame commission
failure.
Repetition failure - An old result or frame is repeated.
Sequence failure - Frames are received in the wrong sequence.
The symmetry aspect of a failure decides whether all nodes in the system
experience the failure identically or not i.e. either symmetric or asymmetric. A
symmetric failure is perceived identically by all non-faulty nodes. An asymmetric
failure occurs when different nodes have different perceptions of the failure e.g. a
message is omitted at some but not all nodes.
Detectability decides whether the failure can or cannot be detected by the receiving
node; the failure is often either detectable or undetectable. An undetectable failure
implies that individual assessment based only on incoming frames does not suffice to
resolve the failure. For example, a corrupt message can usually be detected with a
CRC. Data from an erroneous producer is not necessarily detectable.
Persistence, the timing characteristic of a fault or failure, does not apply to the
CDR operation failure model since a CDR operation is a “single shot” operation. A
fault is therefore always regarded as a single occurrence during one operation and
leads to one CDR failure. However, when regarded at the perspective of sequential
CDR operations, persistence is applicable. A single fault can cause multiple CDR
failures. The persistence of a fault can be either transient – a single occurrence of the
fault, intermittent – recurring presence of the fault or permanent – continuous
presence of the fault. A transient fault will only affect a single CDR operation while a
permanent fault affects consecutive CDR operations. A fault leads to a failure which
is either temporary or permanent.
SafetyConstraint
AsymmetricForceDeviation
ASILValue=D
Deviation=10
ErrorModel
FDA_BBW
SafetyConstraint
ErrorModel ErrorModel OmissionForce
LocBrake_CDR FaultFailure ASILValue=C
Pedal_CDR ForceFailure Deviation=28
SafetyConstraint
ErrorModel FaultFailure OmissionForce
LocBrake_CDR ForceFailure ASILValue=C
Deviation=28
ErrorModel SafetyConstraint
ErrorModel FaultFailure
LocBrake_CDR OmissionForce
ForceFailure
GlobalBrk_CDR ASILValue=C
Deviation=28
ErrorModel FaultFailure SafetyConstraint
LocBrake_CDR ForceFailure OmissionForce
ASILValue=C
Deviation=28
DesignArchitecture DemonstratorDA
FunctionalDesignArchitecture FDA_BBW
LocalDeviceManager BSWFunction HWFunction
BrakeActuator BrakeIO BrakeIO
LocalDeviceManager
BrakePedal
LocalDeviceManager BSWFunction HWFunction
BrakeActuator BrakeIO BrakeIO
DesignFunction
BrakeController
LocalDeviceManager BSWFunction HWFunction
BrakeActuator BrakeIO BrakeIO
Allocate Allocate
HardwareDesignArchitecture
ECU HDA_BBW
Brake
ECU Actuator
ABS_1
ECU BrakeFrontLeft
ABS_1
Sensor Actuator
Pedal ECU BrakeFrontRight
ABS_2
Actuator
ECU BrakeRearLeft
ABS_3
ECU Actuator
ABS_4 BrakeRearRight
exceed 10. What these five constraints illustrate is that small asymmetric deviations
(10) are more critical than large symmetric deviations (28). The figure also shows the
hardware architecture, and suggests that the CDR pattern is used for the failure model
to assess different allocation choices.
6 Acknowledgements
This work was partially supported by the European Commission, through the
KARYON project, call FP7-ICT-2011-7, grant agreement number 288195; and
through the MAENAD project, call FP7-2010-ICT-GC, grant agreement number
260057.
References
[1] ISO, "International Standard 26262 Road vehicles -- Functional safety," ed,
2011.
[2] C. Bergenhem and J. Karlsson, "A Process Group Membership Service for
Active Safety Systems Using TT/ET Communication Scheduling," in
Dependable Computing (PRDC 2007). 13th Pacific Rim International
Symposium on, Melbourne, Australia, 2007, pp. 282-289.
[3] AUTOSAR, "AUTOSAR, An industry-wide initiative to manage the
complexity of emerging Automotive E/E-architectures," in Vehicle
Electronics to Digital Mobility: The Next Generation of Convergence. vol.
SAE/P-387, ed, 2004, pp. 325-332.
[4] P. Cuenot, et al., "Engineering Support for Automotive Embedded Systems -
Beyond AUTOSAR," ATZautotechnology 2009.
A novel modelling pattern for establishing failure models and assisting
architectural exploration in an automotive context 257
[5] "Architecture Analysis & Design Language (AADL) Version 2," in AS-5506,
ed: SAE, 2009.
[6] P. H. Feiler and A. E. Rugina, "Dependability Modeling with the
Architecture Analysis and Design Language (AADL)," Carnegie Mellon
Software Engineering Institute2007.
[7] R. Johansson, et al., "A road-map for enabling system analysis of
AUTOSAR-based systems," presented at the Proceedings of the 1st
Workshop on Critical Automotive applications: Robustness & Safety
(CARS), Valencia, Spain, 2010.
[8] D. Chen, et al., "Integrated Fault Modelling for Safety-Critical Automotive
Embedded Systems," Springer IE&I elektrotechnik und informationstechnik,
vol. 128, pp. 196-202, 2011.
[9] Y. Papadopoulos, "Safety-Directed System Monitoring Using Safety
Cases.," Ph.D. thesis, University of York,, UK, 2000
[10] Altarica. (2012, Altarica, a language designed to model both functional and
dysfunctional behaviours of critical systems.
[11] A. Avizienis, et al., "Basic concepts and taxonomy of dependable and secure
computing," Dependable and Secure Computing, IEEE Transactions on, vol.
1, pp. 11-33, 2004.
[12] C. Bergenhem and J. Karlsson, "A General System, Processing and Failure
Model for Fault Tolerance related Protocols in Safety-Critical Distributed
Real-Time Systems, TR 08-18," Chalmers University of Technology Dept.
of Computer Science and Engineering, Gothenburg, Sweden TR 08-18,
2008.
[13] Flexray, FlexRay Communications System Protocol Specification Version
2.1: FlexRay Consortium, 2005.
[14] A. Ademaj, "Slightly-off-specification failures in the time-triggered
architecture," presented at the Proceedings of the Seventh IEEE International
High-Level Design Validation and Test Workshop(HLDVT), 2002.
[15] T. M. Forest, "The FlexRay Communication Protocol and Some Implications
for Future Applications," in SAE Convergence, 2006.
Reviewing Software Models
in Compliance with ISO 26262
Keywords: Model review, ISO 26262, functional safety, quality assurance pro-
cess, model-based development, model architecture, modeling guidelines
1 Introduction
In the automotive domain, the approach for developing embedded software has
changed in recent years. Executable graphical models are now used at all stages of
development: from the initial design phase to the implementation. Model-based de-
sign is now recognized in process standards such as the ISO 26262 standard for the
automotive domain and even in the DO-178C standard for the avionics sector. Soft-
ware models are used for verifying functional requirements as an executable specifi-
cation, and also as so-called implementation models used for controller code genera-
tion. The models are designed with common graphical modeling languages, such as
Simulink and Stateflow from The MathWorks [6] in combination with automatic code
generation with TargetLink by dSPACE [5] or the Real-Time Workshop/Embedded
Coder by The MathWorks. Model-based development provides an efficient approach
for the development of software for embedded systems. Figure 1 illustrates the model-
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Reviewing Software Models in Compliance with ISO 26262 259
based development process in a simplified way. In the first stage, the system to be
build is modeled with a graphical modeling language. This model is created on the
basis of the textual requirements and is therefore often called functional model at this
stage. Since the functional model is focused on the design of the control function and
on checking the functional behavior with regard to the requirements it cannot directly
be used as a basis for production code creation. Implementation details which are the
prerequisite for automatic code generation are not considered here. Therefore the
functional model needs to be manually revised by implementation experts with re-
spect to the requirements of the embedded target system (e.g. function parts are dis-
tributed to different software units, arithmetic is adapted to fixed-point target). Fur-
thermore, it is often necessary to restructure the functional model with respect to a
planned software design. The result of this manual conversion is the so called imple-
mentation model. Finally, the implementation model is automatically translated to
source code by a code generator.
Executable
Requirements specification Generated
Implementation model C Code Object code
specification (model)
Floating-point Fixed-point
arithmetics arithmetics
Proof: Proof:
Model behavior is equivalent to the requirements Code behavior is equivalent to model and requirements
The ISO 26262 is an adaptation of the IEC 61508 functional safety standard to the
development of safety-related automotive electric/electronic systems. The standard
considers functional safety aspects of the entire development process. It provides an
approach for identifying and classifying hazardous situations (risks). During the risk
analysis, possible hazardous situations are identified and classified. As an outcome of
the risk analysis and risk assessment, so-called Automotive Safety Integrity Levels
(ASILs) are assigned to all HW and SW parts that could influence a hazardous situa-
tion. The standard defines for each ASIL specific measures that must be applied in
order to achieve an acceptable residual risk. The process for developing safety-related
software in compliance with ISO 26262 is based on the V-Model of software devel-
opment. For each development phase, requirements (1) on the development activities
and (2) on the work products are defined and (3) obligations on the corresponding
quality assurance methods are imposed. The standard recognizes model–based devel-
opment as a meaningful method in order to increase the quality of the software to be
developed. In more detail, the seamless utilization of software models, as the central
artifact in model-based development, “facilitate highly consistent and efficient devel-
opment”1.
Figure 3 shows the phase model of software development proposed by the stand-
ard. The standard specifies a two-part strategy in order to ensure functional safety.
The design phases on the left-hand side of the V-Model include reviews that aim at
ascertain compliance with the overall safety goals and at ensuring that the require-
ments will be correctly implemented. The testing phases on the right-hand side ensure
that the safety requirements on the work-products are fulfilled. It is obvious that this
approach is not a novelty. In particular, the testing process recommended by the
1
ISO 26262-1, Annex B.1
Reviewing Software Models in Compliance with ISO 26262 261
The ISO26262 standard requires reviews for the work products that are created in
the three design phases (ref. Fig. 2). The (1) safety requirements review investigates
the safety requirements of the software in order to ensure their compliance and con-
sistency with the overall safety requirements. Moreover, the hardware-software inter-
face specification, which is refined during the specification phase, is validated to en-
sure consistency with the overall system design. The (2) SW architecture review
concludes the software architecture design phase. It aims at ensuring compliance of
the software architecture with the software safety requirements defined in the preced-
ing phase. Furthermore, the compliance of the architecture with the architecture de-
sign principles imposed by the standard is approved. The (3) SW unit review investi-
gates the software unit design and implementation and has to show the fulfillment of
the software safety requirements and the compliance of the software unit with the
hardware-software interface specification. Moreover, it has to be verified that the unit
design is correctly implemented and that the implementation complies with the coding
respectively modeling guidelines. The standard explicitly requires to provide evidence
that the source code fulfills the safety requirements. An indispensable precondition
for all three reviews that is imposed by the ISO26262 is the traceability for the soft-
ware safety requirements.
Review procedures focused on verification of requirements specifications, e.g. Fa-
gan inspections [8], can be adapted to perform model reviews. The general objectives
of model reviews are: (1) to check whether or not the textual specified functional
requirements are realized in the model; (2) to ensure that relevant modeling guidelines
are fulfilled (e.g. naming conventions, structuring, modularization); (3) to check to be
sure that a number of selected quality criteria such as portability, maintainability,
testability are met; (4) to check to be sure that the implementation model meets the
requirements for the generation of safe code (e.g. robustness) and efficient code (e.g.
resource optimizations). To handle the complexity of this task, model reviews are
often guided by an in-house set of specific modeling and review guidelines. These are
commonly summarized as a review check list. During the model review, a series of
findings with suggestions and comments on individual model parts are gathered and
recorded with a reference to the affected model elements. The references to the model
elements enable the developer to track which parts of the model must be revised.
Compliance with modeling guidelines is important to increase the comprehensibil-
ity (readability) of the model, to facilitate maintenance, to ease testing, reuse, and
extensibility, and to simplify the exchange of models between OEMs and suppliers.
262 I. Stuermer, H. Pohlheim, and E. Salecker
Our best practice approach for model review presented in this paper is restricted to
the second and third design phase of the ISO reference model (ref. Fig. 2). We pro-
vide an approach for the SW architecture review that concludes the software architec-
tural design phase and the SW unit review that concludes the SW design and imple-
mentation phase. Our experience has shown that the development approach has no
influence on the safety requirements review. In contrast to the other two reviews for
safety requirements reviews, it is not necessary to take specific aspects for model
based development into consideration. For this reason, we do not address safety re-
quirements reviews to the full extent in this paper. In the following, we present an
overview of our review approach.
We aim to conduct the reviews required by ISO26262 as effective and efficient as
possible. We achieve this goal by different means. First, we identify appropriate
Reviewing Software Models in Compliance with ISO 26262 263
review points for the different phases of the development cycle and select appropri-
ate review objects. Second, we seek for efficient tool support to automate review
tasks. Third, we define entrance criteria (pre-conditions) for review objects that
ensure minimum standards for review objects in order to avoid time-consuming man-
ual reviews.
Requirements Executable
specification specification Generated
Implementation model C Code Object code
(model)
Floating-point Fixed-point
Architectural design arithmetics arithmetics
specification
Architecture Review
ISO Best Practice
Design Review
Best Practice ISO
conventions for signal names, subsystems, etc. are checked; (5) architectural design
patterns can be checked (e.g. every SW unit is realized as an atomic component and
has a INMAP and OUTMAP for signal conditioning). The pre-condition check phase
focuses already on some aspects of the architectural design review, which are check-
ing adherence of design guidelines (hierarchical structure, size of components and
interfaces). Once the model is reworked, the pre-condition check procedure is carried
out again, until specific quality values are reached, which are collected and assessed
with another quality assurance tool suite, the Model Quality Assessment Center
(MQAC) 13.
viewing this artifact we put the focus on the first two review goals. It must be checked
that all software safety requirements are considered. The review must be carried out
jointly by the persons responsible for the functional model and the software architec-
ture in order to avoid model designs that lead to costly, i.e. manual transformations of
the functional model into the implementation model. Especially the evaluation of the
compatibility with the target hardware requires expert knowledge, since this goal
requires estimations with respect to the target architecture although only limited data
is available. We postpone to check the adherence of design guidelines on architectural
level until the functional model is available. We analyze the model for inappropriate
distribution of functionality that might lead to very complex components which are
prone to errors. We must also review these architectural design guidelines on the im-
plementation model because in the functional model information on the mapping onto
source code components is not available. We exploit the fact that this analysis can be
carried out automatically by applying M-XRAY [11]. This tool measures the com-
plexity of models and analyzes the model hierarchy and function distribution. The
first automatic review allows to identify model parts, which can result into too com-
plex code structures, very early in the development process. Because the analysis
investigates the model architecture, the second automatic review of the implementa-
tion model is necessary in order to verify the software architecture and achieve com-
pliance with ISO26262.
quirements can often be facilitated by tool support. We conduct the review by investi-
gating the implementation model in parallel with the generated source code. This
approach also eases the understanding of the generated code.
5 Lessons learned
Our experience has shown that three conditions are extremely important to conduct
the reviews required by ISO 26262: (1) the traceability of requirements, (2) the doc-
umentation of the model and (3) the compliance with modeling guidelines. The effort
that is required for the documentation of the model and the traceability of require-
ments pays off. The same is true for the adherence to uniform guidelines. These con-
ditions reduce the time required to understand the functionality of a model that is
necessary to verify compliance with the safety requirements. Moreover, they expand
the group of possible reviewers because they ease the review process for project-
external personal. It is realistic to establish them because these conditions can be
checked automatically with tools that can be easily integrated into the development
process and employed by the model developer. The investment is also justified by the
fact that they help to increase the overall quality of the models.
In the SW unit design phase we analyzed the model manually for compliance with
the software safety requirements. Here, it is important that the stated requirements can
easily be traced down to the model (in this case by textual ID’s noted in the model).
The check, whether a model part is compliant with the specified safety-requirement
can often not easily be answered. The reason for this is that even if the realization of a
safety requirement is available on model level, e.g. a function has been implemented
in a redundant way, it has to be verified in the following testing phase, whether this
approach has been implemented correctly. As a result, we decided that it is enough to
identify the ‘logical’ realization of the safety requirement, which must be verified
with so-called safety test cases afterwards. The biggest challenge is then to show that
safety-relevant subsystems are decoupled and not functional dependent from the other
model parts, which are not safety-relevant. If this cannot be shown, the whole model
has to be associated to the highest ASIL level. This review procedure includes, that all
incoming interface signals are checked manually.
6 Conclusion
The new ISO 26262 is the first development standard in the automotive sector that
recognizes model-based development as a paradigm that improves the quality of the
software to be developed. Model reviews are regarded as an important quality assur-
ance method in order to check the compliancy of the software model with the safety
requirements. Several review tasks can be solved automatically because efficient tool
support is available. However, it must be noted that the ISO26262 requires to validate
that the generated source code fulfills the safety requirements. In this paper, we pro-
vide an approach for conducting the required reviews for model-based development.
Our approach is based on a combination of automatic and manual reviews. We identi-
Reviewing Software Models in Compliance with ISO 26262 267
fied the artifacts that should be reviewed in order to achieve the required review goals
as efficiently as possible. Although many review tasks can be solved automatically a
manual review is always required because the fulfillment of the safety requirements
cannot be checked automatically. The biggest challenge in model design and model
review according to ISO 26262, however, is to ensure that the safety-related software
functions (units) are decoupled from the non-safety-related SW units and that they are
not functional dependent.
7 References
{novak, stoegerer}[email protected]
1 Introduction
The increasing traffic density in urban and inter-urban areas as well as the desire to
increase road safety has resulted in a magnitude of measures. A possibility is the use
of traffic management systems. Such a system does not decrease traffic per se, but
supports the distribution of traffic in a more efficient way. Furthermore, it informs
and guides drivers about upcoming dangerous situations like traffic jams. Both,
optimizing traffic distribution and supporting road safety are accomplished by
displaying aspects or textual information to drivers. The aspects are mostly shown by
means of actuators – so-called Variable Message Signs (VMS). A typical VMS
includes a graphical part, where speed limits or warning signs are displayed
supplemented by a text part showing “traffic jam” or “accident”.
According to European and national standards such as EN50556 [3] or the German
standard VDE 0832 [2], actuators within traffic control systems have to fulfill a
number of requirements relating to the hardware, the software, the application, the
integration within an overall system and the engineering process. Requirements on the
software and its process are very similar to the generic international standard
IEC 61508 [5]. Objective of this paper is to present a software architecture of a VMS
Software Architecture of a safety-related Actuator in Traffic Manage-
ment Systems 269
that meets the requirements of EN 50556 in general and of VDE 0832 in case traffic
management systems in particular.
The remainder of the paper is therefore structured as follows: Section 2 gives an
overview of the domain of traffic control and discusses relevant parts of the standards
VDE 0832 and EN 50556, respectively. Section 3 presents a hazard analysis using
HAZOP analysis. Section 4, in turn, introduces the safety-related software
architecture derived from the safety and domain specific requirements. Moreover,
Section 5 proves the software architecture by going through a typical use-case of a
VMS. The main steps within the architecture are highlighted. Finally, Section 6
concludes by summarizing the key facts of the work done.
2 Related Work
This section is split into two subsections. The first subsection gives information on
the domain of traffic management systems in general and on actuators used in the
systems with focus on VMS. The second subsection highlights challenges and the key
facts of national standard VDE 0832 and European standard EN 50556 on road traffic
signals.
Outstations are responsible for data processing and autonomous control jobs. In the
following, the aforementioned control entities are subsumed by the term Higher Order
Control Unit (HOCU).
The VMS is composed of a graphical part and optionally of a text part below the
graphical one as shown in [1]. LED modules equipped with LEDs of different colors
(white, red, yellow or green) are in use to show different aspects. The LED modules
include a LED driver and a hardware watchdog. Those modules are connected serially
whereat the first module is connected via dedicated cables to a microcontroller. The
controller runs software that includes functionality to receive commands via various
protocols from a HOCU, and to process the commands and execute them by
activating and deactivating aspects respectively.
Standard. The German standard VDE 0832 consists of seven parts (100, 110, 200,
300, 310, 400, 500) and defines requirements on the development, construction,
validation and maintenance of road traffic signals. In contrast to generic standards like
IEC 61508, this standard is relating to a defined product.
Part 100 of the standard is identical with the German version of EN 50556
including a national foreword. Part 400 is a prestandard and deals with the integration
of VMS in traffic management systems. Additionally, it specifies requirements on
VMS. Finally, the prestandard part 500 gives requirements on safety-related software
of road traffic control systems. This part is referring to IEC 61508 and its
requirements.
According to VDE 0832, part 400, three failure modes must be considered in case
of a VMS:
1. Aspect unintentionally switched off: Due to a fault, the aspect is no longer shown
although it was not switched off intentionally by the user.
2. Corrupted aspect shown: Due to a fault, the aspect is not shown as defined. Either
too many or too few LEDs are switched on.
3. Aspect unintentionally switched on: Due to a fault, an aspect is shown although it
was not switched on by the user.
Software Architecture of a safety-related Actuator in Traffic Manage-
ment Systems 271
3 Hazard Analysis
Before starting with the hazard analysis, the scope of the VMS has to be specified
as addressed e.g., in the IEC 61508-1 life-cycle model. The equipment under control
to be looked at is the VMS. It shall display a “red cross”, “green arrow down”,
“yellow arrow left” and “yellow arrow right”. The application area of the VMS shall
be on roads with two-way traffic. Consequently, safety class ‘D’ is required according
to VDE 0832, part 400. The basic overall architecture of the VMS is identical to the
one presented in Section 2.1. The sign-controller runs safety-related software (see Fig.
2).
The scope of the hazard analysis is relating to hazards causing harm to the user.
Therefore, the display as interface to the user (e.g., a driver on the motorway) is of
interest. Beyond the scope of the analysis is data exchange by means of a protocol
between a HOCU and a VMS because various approaches and implementations are
already available [8-10].
Since the impact of failures to the environment shall be investigated, a hazard and
operability (HAZOP) study is a proper approach. A HAZOP study according to Def-
Standard 00-58 [6] is a well-defined method to analyze a system at its boundaries.
The HAZOP includes pre-defined keywords that are applied to specify the deviation
from the expected result, the cause of the deviation and the (negative) effect on the
system or environment.
In Table 1 the HAZOP of the function “Switch on ‘red cross’” is presented in a
general way. The content of the column “effect” is linked to one of the three failure
modes mentioned in Section 2.2. The column “possible cause” includes the typical
faults that have to be address by safety measures.
Acknowledgement
Sequence number
Data protection
(Watchdog)
Time stamp
Time out
(Echo)
Failures
Repetition 9 9
Loss 9 9
Incorrect
9 9
sequence
Data corruption 9 9
Delay 9 9
• Switch on/off aspect: This function switches on an aspect such as a “red cross” or
switches off the current aspect respectively. The trigger for this function is sent
from a HOCU via the communication protocol. Additionally, in case of a failure
detected by a proof-test, the function “switch off” is used as well.
274 T. Novak and C. Stoegerer
• Cable test: In order to switch on or off aspects, a formatted byte stream has to be
sent to the drivers on the LED modules. The connection from the microcontroller
to the LED modules as well as the connection between two LED modules is
provided by cables. To detect a broken cable, a predefined data stream is
periodically sent from and received by the microcontroller. If the sent and received
data stream match, the cable is not broken. Otherwise no data is received resulting
stop of communication of the microcontroller and fail-safe state of the display,
since the hardware watchdog on the LED modules (cf. Section 4.3) is not triggered
any longer.
• Check of display status: As outlined in [7], VMS are manifold in their design and
hence are adapted to the specific customer needs by means of configuration
parameters. A wrongly set parameter might lead to displaying another aspect than
expected. Consequently, that proof-test reads back data from each driver on the
LED boards periodically and compares the received data with the actual one. If
data sent and received match, the right aspect is shown. Otherwise, the fail-safe
state is entered.
• LED test: The status of all LEDs is checked periodically to detect a broken LED or
a broken driver. In most cases a predefined failure limit (e.g., above 5 or 10 broken
LEDs) is specified. Only if the limit is exceeded, the fail-safe state of the display is
entered.
least one task, the software is most likely in a deadlock. In that case a software
reset of the microcontroller is performed.
• Memory access integrates all features to access the volatile (i.e., internal and
external RAM) and the non-volatile memory. Two types of event-triggered proof-
tests are included in the module to ensure integrity of permanently and temporarily
stored data.
o Proof-test of the volatile memory at startup: After a reset of the
microcontroller, the memory is checked with the help of a memory
test to detect static faults leading to corruption of data. Typical tests
are GALPAT or the Checkerboard test differing in their diagnostic
coverage as mentioned in [11]. In case of a fault detected, the
microcontroller is switched to fail-safe state (i.e., endless loop).
o Proof-test during operation: Safety-critical data is protected by
checksums. Every time data is read, the checksum is recalculated
and compared with the stored one. If the calculated and stored result
is equal, integrity of data is granted. If not, data is discarded or the
microcontroller is reset. In case of three resets in a row due to the
same event, fail-safe state is entered.
5 Safety Assessment
The objective of this section is to prove that the safety and safety integrity
requirements are met. Consequently, it is demonstrated that adequate safety measures
are implemented and hazards mentioned in Table 1 are sufficiently addressed. For that
reason, the use-case “switch on ‘red cross’” is taken as an example.
It is assumed that the VMS is installed on a gantry and is connected to a HOCU via
a dedicated communication line. The protocol used to exchange messages is TLS
276 T. Novak and C. Stoegerer
[10]. The VMS includes predefined aspects “red cross”, “green arrow”, “orange arrow
right” and “orange arrow left”. Every aspect has a unique ID.
An operator is triggering the command “switch on ‘red cross’” at the central
station. The message with the ID and the command “switch on” is received by the
VMS (see Fig. 3, input “operator”) and stored temporarily in a buffer in the volatile
memory. The buffer is protected by a checksum provided by the memory access
module.
After processing the message in the protocol stack successfully, the ID of the
aspect “red cross” and the command “switch on” is sent to the display control via a
queuing mechanism. Again, data in the queue is protected with a checksum. The next
step is to prove availability of the aspect and integrity of corresponding data. Data to
show the chosen aspect is stored in the non-volatile memory and protected by a CRC.
If so, the last result of the time-triggered LED test is evaluated to check the status of
the LEDs. Only if the failure limit is not exceeded and the cable test returns a positive
result, the corresponding data byte stream to display the required aspect is sent to the
LED drivers on the modules to activate the desired LEDs.
A further step is to check the display status. Thus, data of each driver on the LED
modules is read back. That data is compared to the stored one in the non-volatile
memory. If the comparison returns a positive result, the right aspect “red cross” is
shown. Finally, a response message is transmitted to the HOCU including the result of
the execution of the command returned by the display control to the protocol stack.
In case of a fault is detected during memory access, data is discarded and a
response message is sent. Faults on the display result in a fail-safe state where the
display is switched off and a proper message is sent to the HOCU.
Aforementioned safety measures and the presented flow of actions are an efficient
solution where safety integrity of the VMS (see Table 3) is ensured according to
requirements given by VDE 0832, part 400.
Integrity of Integrity of
Data integrity Software integrity
display communication line
LED test Cable test with Checksum Software watchdog
Check of display hardware watchdog Startup memory test
status mechanism
6 Conclusion
The importance of traffic management systems as a measure to increase road safety
and distribute traffic load is constantly growing. Increasingly, these systems are
responsible for functions directly or indirectly affecting people’s safety and health.
Therefore, the products used are supposed to be developed in a way that harm to the
user is reduced to an acceptable minimum. Additionally, societal aspects especially in
Europe result in a demand on higher safety standards.
The paper presented a general software architecture to be used for actuators such as
VMS in traffic management systems being able to provide safety-critical
functionality. The architecture fulfills requirements stated by standard VDE 0832 and
European standard EN 50566, respectively. Therefore, the approach presented is a
proper solution for designing safety-related software to be used in VMS.
7 References
1. H. Kulovits, C. Stoegerer, W. Kastner: System Architecture for Variable Message Signs.
In Proceedings of 10th IEEE Conference on Emerging Technologies and Factory
Automation (ETFA), Vol. 2, pp. 903-909 (2005)
2. DIN/VDE: Road Traffic Signal Systems. VDE 0832, part 100-500 (2008-2011)
3. EN: Road traffic signal systems. EN 50556, CENELEC (2011)
4. W. Reed: Safety critical software in traffic control systems. IEE Colloquium on Safety
Critical Software in Vehicle and Traffic Control (2002)
5. IEC: Functional safety of electric/electronic/programmable electronic safety-related
systems. IEC 61508-1 to -7, Edition 2 (2010)
6. Ministry of Defense: HAZOP Studies on Systems Containing Programmable Electronics,
Part 1, Requirements. Standard 00-58, Issue 2 (2000)
7. T. Novak, Ch. Stoegerer. The right degree of configurability for safety-critical embedded
software in Variable Message Signs. Computer Safety, Reliability, and Security,
Proceedings of the 29th International Conference SAFECOMP 2010, LNCS 6351, pp.
418-430 (2010)
8. T. Novak, T. Tamandl. Architecture of a safe node for a fieldbus system. In Proceedings of
the 5th IEEE International Conference on Industrial Informatics (INDIN), Vol. 1, pp. 101-
106 (2007)
9. BG-PRÜFZERT, Fachausschuss Elektrotechnik. GS-ET-26 – Bussysteme für die
Übertragung sicherheitsrelevanter Nachrichten. Prüfgrundsätze, Köln, Deutschland
(2002)
10. Bundesanstalt fuer Strassenwesen (BASt). Technische Lieferbedingung fier
Streckenstationen (TLS) (2002)
11. H. Hölscher, J. Rader. Microcomputers in Safety Technique, An Aid to orientation for
Developer and Manufacturer. TÜV Rheinland, ch.7.22 (1986)
12. IEC. Telecontrol equipment and systems - Part 5: Transmission protocols. IEC 60870-5-1
(1990)
13. BG-PRÜFZERT, Fachausschuss Elektrotechnik. GS-ET-26 – Bussysteme für die
Übertragung sicherheitsrelevanter Nachrichten. Prüfgrundsätze, Köln, Deutschland,
(2002)
14. T. Tamandl, P. Preininger, T. Novak, P. Palensky. Testing Approach for Online Hardware
Self Tests in Embedded Safety Related Systems. In Proceedings of the 12th IEEE
278 T. Novak and C. Stoegerer
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
280 P. Helle, M. Masin, and L. Greenberg
considered during this automatic architecture generation, given the high impact of
safety on systems design, addressing safety requirements in later stages becomes ei-
ther impossible or very expensive.
Traditional reliability calculations are rather operational than denotational, i.e.,
they define a procedure how to calculate failure probabilities for a given architecture.
Denotational calculus defines a set of constraints depending on the architectural deci-
sion variables where the resulting failure probability is derived “automatically” from
the constraints. An additional problem with the classical reliability algebra is that its
calculations are combinatorial and highly non-linear. For example, [10] presents an
approach for architecture optimization for aircraft roll systems. The design problem of
a flight-control system on a large fly-by-wire airliner is to find combinations of actua-
tor(s), power circuit(s), computer(s) for each movable surface, so as to fulfill the con-
straints imposed by the safety regulations, while keeping the resulting system weight
as low as possible. Instead of a direct formulation of the safety problem suitable for
the optimization algorithm, they use a black-box function for the safety assessment of
potential solutions. This function provides an accurate evaluation of the safety re-
quirements but is rather costly, so the optimization algorithm had to be tailored so that
the safety assessment is not done for all possible solutions. The same limitation re-
mains for architectural design using evolutionary algorithms (see, e.g., [15]). Another
approach is to define Constraint Programming model [16] which is less scalable than
the best Mixed Integer Linear Programming (MILP) solvers available today, such as
Cplex [17] and Gurobi [18].
In this paper we present an approximate reliability algebra and its MILP formula-
tion supported by most optimization solvers. This reliability calculus has the follow-
ing features:
The paper is organized as follows. In the next section we briefly describe the place of
safety analysis in the whole design process, show which part we address and describe
the gap versus existing methods. In Section 3, our modeling approach for safety re-
quirements is shown. In Section 4, the approximate reliability algebra is presented
where safety requirements are transformed to a set of algebraic equations. In Section
5 we check bounds of the approximation and in Section 6 the proposed algebra is
implemented on a set of representative examples. Section 7 summarizes the paper and
provides directions for future research.
2 System Safety
System safety uses systems theory and systems engineering approaches to prevent
foreseeable accidents and to minimize the result of unforeseen ones. It is a planned,
Approximate Reliability Algebra for Architecture Optimization 281
the critical system functions, i.e. functions which may cause a hazard when lost or
malfunctioning.
the safety requirements for these functions, i.e. the maximum allowed failure prob-
abilities.
the demands, if any, for additional safety functions in order to achieve acceptable
levels of risk for the system.
In this work we focus on the system architecture development and the preliminary
system safety assessment.
nimal cut set allows the direct calculation of the overall failure probability for a given
failure case.
To automate the activities and also to extend and complement the classical safety
analysis techniques, a variety of formal safety analysis techniques has been proposed.
One of the most prominent examples is the AltaRica [3][4] language. AltaRica mod-
els formally specify the behavior of systems when faults occur. These models can be
assessed by means of complementary tools such as fault tree generator and model
checker. This allows analysts to model the failure behavior of a system as design
work progresses from the system architecture to the implementation level.
With the increased acceptance of Model based Systems Engineering (MBSE) 1 as
the new systems engineering paradigm, it seems natural to combine MBSE and Mod-
el-based Safety Analysis (MBSA). One possibility is to automatically extract minimal
cut sets directly from detailed design models bypassing Fault Tree generation alto-
gether. This approach [6] allows truly automated analysis of detailed design models
thus minimizing both the possibility of safety analysis errors and the cost of the anal-
ysis.
Another possibility is to derive models suitable for safety analysis from the system
development models. [7] and [14] provide examples for deducing analyzable AltaRica
code from UML/SysML models.
For our purpose, we extended the already existing meta model [13] for functional
and systems architecture modeling with concepts from the safety domain as depicted
by Fig. 1. Textual safety requirements are formalized as failure cases with attributes
that define the maximum allowed probability for a failure case to occur. The failure
case in turn is defined by its relation to one or more functions which have to fail in
order for the failure case to occur.
Note, that the function(s) that the failure case is related to serve as a starting point
for recursively propagating that failure in the functional architecture. The basic as-
sumption here is that a function fails when one or more of the functions that it needs
input from fail. Additional modifications on the relation between the failure case and
the function allow the definition of how this propagation is done, e.g. propagation
without restrictions, propagation up to a certain depth or no propagation at all.
The functional architecture, consisting of functions and data exchanges via virtual
links is mapped to the physical architecture, consisting of components that are in-
stances of component classes and connectors between these components. A virtual
link is mapped to a number of components and connectors, e.g. network switches and
cables.
1
The International Council on Systems Engineering defines MBSE as “the formalised applica-
tion of modelling to support system requirements, design, analysis, verification and valida-
tion activities beginning in the conceptual design phase and continuing throughout devel-
opment and later life cycle phases”[11].
Approximate Reliability Algebra for Architecture Optimization 283
The idea of the approximate reliability algebra is taking into account failure proba-
bilities of all components in correct power, since failure probabilities of the most
critical components, i.e. the ones with the highest failure probability, dominate the
overall failure probability. For example, for a critical component without redundancy,
the power should be 1. Now we define variables describing the redundancy of func-
tions and the corresponding components that implement the functions and/or transfer
data between them through input-output functional links. Let be the number of
redundant virtual paths for function/functional link l and pcl be the number of redun-
dant virtual paths for function/functional link l component c participates. We assume
, i.e., component c can either not participate at all in virtual paths of l,
or participate just in one path or in all paths. This is the usual case where there is no
redundancy ( ) for reliable components and there are independent channels for
unreliable components ( ). When components participate in 1 < pcl < pl redun-
dant virtual paths, there is no guarantee on the degree of redundancy.
Let define to be the degree of redundancy of component c for reliability re-
quirement r – the number of remaining virtual paths for reliability requirement r in
case of failure of component c. Formally,
where LRr is the set of functions and functional links effecting reliability requirement
r.
We define Fr to be the approximation of failure probability for requirement r:
284 P. Helle, M. Masin, and L. Greenberg
where is the failure probability of component , and C is the set of all components.
Then the reliability requirement becomes:
Let us consider the following small example, shown in Fig. 2, to demonstrate the
approximate calculations and compare them with the classical one.
We have two functions F1 and F2 where function F1 is an input for F2 by the func-
tional link between them. These functions are implemented, without redundancy, by
components C1 and C8. The components C2 to C7 implement the functional link FL1
with redundancy 2. Therefore, the number of remaining redundant paths the compo-
nent does not participate is equal to zero for components C1 and C2 and equal to one
for the rest of the components. If all components have the same failure probability f,
Approximate Reliability Algebra for Architecture Optimization 285
Both calculations are of the same order and for small f, both values will be the same
and equal to .
The reliability approximation ratio Ar for requirement r is defined as the ratio between
approximated, Fr, and reference, Br, failure probabilities for a given system D:
Ar(D)>1 means that the approximated failure probability is larger than the refer-
ence failure probability, i.e. the approximation is conservative. Ar(D)<1 means that
the approximated failure probability is smaller than the reference failure probability,
i.e., the approximation is optimistic. For typical systems, the most optimistic approx-
imation is obtained when all failure probabilities are equal, i.e., the minimal Ar is
obtained where for all components (see Lemma 1 below).
The low bound on the reliability approximation factor is also the low bound on the
safety factor sr. If one selects a safety factor sr less or equal to the bound, the resulting
architecture D is guaranteed to satisfy the safety requirement Rr. In practice, if the
safety factor is larger than the lower bound (e.g., sr=1), then the resulting failure prob-
abilities might exceed the requirements by no more than ratio between the used safety
factor and the bound.
In this paper we assume that the failure probabilities are very small by considering
only terms with the lowest power of failure probabilities. In architectures where each
participating component either appears in all redundant paths or in a single redundant
path, Lemma 1 defines the lower bound on the reliability approximation ratio. In the
first case the bound is approximately one, while in the second case it is inverse pro-
portional to the number of components in the longest redundant path in the power of
the number of redundant paths minus one. The proof is based on standard reliability
calculations for parallel composition of redundant paths.
Lemma 1
i) Let D be a system with multiple functional links. If there is at least one component
that participates in all redundant paths of some link, then the reliability approximation
ratio is approximately one:
ii) If each of the components participates in a single redundant path in a single func-
tional link then the reliability approximation ratio low bound is as follows:
286 P. Helle, M. Masin, and L. Greenberg
where pl and ml are the number of redundant paths and the number of components in
the longest redundant path for a functional link l, respectively. Moreover, the bound is
reached for the corresponding functional link when all components have the same
failure probability and all redundant paths have the same number of components.
Corollary 2
For a single functional link, in common cases were the number of redundant paths
is two or three, the reliability approximation is bounded by 2/ml or 3/ml2, respectively.
6 Application examples
The described method has been applied to several examples that are representative of
real problems in the aerospace industry:
Fig. 3 provides the functional architecture and the allocation of functions to compo-
nents whereas Fig. 4 and Fig. 6 depict concrete implementations of the NAS including
a mapping of the virtual links that connect the components to the network switches.
Approximate Reliability Algebra for Architecture Optimization 287
Three failure cases have been defined for the NAS as Fig. 5 shows.
Table 1 lists the comparison of the safety evaluation results using classical methods to
the ones calculated using the approximate reliability algebra and the assumption that
all components in the NAS have the same failure rate f.
288 P. Helle, M. Masin, and L. Greenberg
The Fire Warning System (FWS) is a simple system consisting of a power supply, a
fuse, a fire detector and a fire warning lamp. Fig. 7 shows the functional architecture
and the allocation of functions to components as well as two alternative implementa-
tions of the system. No virtual links are defined as the data is transferred via direct
connections. The main challenge is the consideration of loss and spurious failures as
there are safety requirements regarding the absence of fault alarms.
Fig. 7. FWS
Approximate Reliability Algebra for Architecture Optimization 289
Two failure cases are defined for the FWS as shown by Fig. 8.
Table 2 lists the comparison of the safety evaluation results using classical methods to
the ones calculated using the approximate reliability algebra.
In this work we proposed approximate reliability calculations that can be used with
most optimization solvers. The approximation has a theoretical bound on potential
over optimistic results and was found very accurate for a large set of examples shown
in the paper and in additional projects. This approximate algebra cannot and is not
290 P. Helle, M. Masin, and L. Greenberg
designed to replace the proper safety analysis using specialized tools required by the
certification authorities but can significantly improve the design space exploration
phase for driving the optimization to valid designs, from the safety perspective. For
future research we can suggest a relaxation of the assumption and fur-
ther exploration of the approach to different failure modes.
8 References
1. Leveson, N. G.: Safeware: System Safety and Computers. Addison-Wesley, Reading, MA
(1995).
2. Leveson, N. G.: Software Safety: Why, What, and How. In: ACM Computing Surveys,
Vol. 18, no. 2, pp. 125-163, ACM, New York (1986).
3. Arnold, A., Griffault, A., Point, G., Rauzy, A.: The AltaRica formalism for describing
concurrent systems. In: Fundamenta Informaticae, Vol. 40, no. 2, pp. 109-124, IOS Press,
Amsterdam (1999).
4. Bieber, P., et. al.: Safety assessment with Altarica. In: Building the Information Society.
505-510, Springer (2004).
5. Haskins, C. Ed.: Systems Engineering Handbook: A guide for system life cycle processes
and activities. INCOSE (2006).
6. Bozzano, M. et.al. ESACS: An integrated methodology for design and safety analysis of
complex systems. In: Proceedings of ESREL 2003, pp. 237-245, Balkema Publisher
(2003).
7. David, P., Idasiak, V., Kratz, F.: Towards a better interaction between design and depen-
dability analysis: FMEA derived from UML/SysML models. In: Proceedings of ESREL
2008 and 17th SRA-Europe Annual Conference, Valencia, Spain (2008).
8. David, P., Idasiak, V., Kratz, F.: Reliability study of complex physical systems using
SysML. In: Journal of Reliability Engineering and System Safety, Vol. 95, no. 4, pp. 431-
450 (2010).
9. Vesley, W., et. al.: Fault Tree Handbook with Aerospace Applications. NASA Office of
Safety and Mission Assurance, Washington DC (2002).
10. Bauer, C., et. al.: Flight-control system architecture optimization for fly-by-wire airliners.
In: Journal of Guidance Control and Dynamics, Vol. 30, no. 4, pp. 1023-1029, AIAA
(2007).
11. Tabbara, A., Sangiovanni-Vincentelli, A: Function/architecture optimization and co-design
of embedded systems. Springer (2000).
12. Todinov, M.T.: Risk-based reliability analysis and generic principles for risk reduction. El-
sevier Science, Amsterdam (2007).
13. Verma, A. K., Ajit, S., Karanki, D. R.: Reliability and Safety Engineering. Springer
(2010).
14. Helle, P., Strobel, C., Mitschke, A., Schamai, W., Rivière, A. and Vincent, L.: Improving
systems specifications a method proposal. CSEM, 2008.
15. Li R., Etemaadi R., Emmerich M.T.M., Chaudron M.R.V., Automated Design of Software
Architectures for Embedded Systems using Evolutionary Multiobjective Optimization. VII
ALIO/EURO, 2011.
16. Condat H., Strobel C., and Hein A., Model-based automatic generation and selection of
safe architectures. INCOSE, 2012.
17. http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
18. http://www.gurobi.com/
On the formal verification of systems of synchronous
software components?
1 Introduction
State-of-the-art safety critical systems are often composed of other distributed (compo-
nent) systems (system of systems (SoS)). While industrial standards for the develop-
ment of safety-critical software systems highly recommend formal model-based meth-
ods, application of those methods to SoS still remains a challenge when scalability to
real industrial applications is concerned.
In this paper we report on work in progress concerning the development of an ap-
proach to modeling and verification of SoS that is innovative for the industrial practice
and addresses the scalability problem. In our approach the nodes of a distributed sys-
tem consist of controllers performing specialized tasks in hard real time by operating
cyclically and in a synchronous way. For such a controller the model-based approach
of S CADE3 is an attractive solution providing code generation and good support for
model simulation and (formal) verification. But for a distributed system, a synchronous
implementation is neither realistic nor desirable. Hence, we focus on the model-based
?
This work was developed during the course of the project “Verifikation von Systemen syn-
chroner Softwarekomponenten” (VerSyKo) funded by the German ministry for education and
research (BMBF).
3
S CADE is developed and distributed by Esterel Technologies:
www.esterel-technologies.com
292 H. Günther, S. Milius, and O. Möller
and Garavel [12] explain the basic idea of combining a synchronous language and an
asynchronous formalism, and they also show how synchronous components can be in-
tegrated into an asynchronous verification tool and demonstrate this with one simple
example. In these works components are not abstracted by contracts as in our approach
but synchronous models are directly integrated in an asynchronous formalism. How-
ever, the results from our case study clearly indicate that component abstraction is nec-
essary.
A different approach follows Milner’s result [26] that asynchrony can be encoded
in a synchronous process calculus, see e. g. [20, 22, 28], and the tool Model build [5, 6]
as well as the Polychrony workbench [25]. A disadvantage of these approaches is that
asynchrony and non-determinism are not built-in concepts in the underlying formalisms
and so verification tools may not be optimized for asynchronous verification.
Other approaches extend synchronous formalisms in order to deal with some de-
gree of asynchrony, e. g., Multiclock Esterel [29] or Communicating Reactive State
Machines [30, 31]. Again, components are not abstracted in these approaches, and ac-
cording to [12]: “such extensions are not (yet) used in industry”.
Using contracts as specifications for parts of a program or system is also not a new
idea; see for example work on rely/guarantee logic [23]. Abstracting system compo-
nents by contracts appears recently, for example, in [15, 14] and in [7]. The former
work uses component contracts in the form of time-annotated UML statecharts. So this
approach does not deal directly with synchronous components or GALS systems. In ad-
dition, component contracts cannot be specified by LTL formulas as in our framework.
The latter work [7] describes a way to use contracts to specify the behaviour and inter-
actions of hardware components. The focus is on the verification of contracts while our
work also considers formal verification of system-level verification goals of composed
contract-systems.
Alur and Henzinger [3] treat the semantics of asynchronous component systems,
their reactive modules can be used to give a semantics to our GALS system specifi-
cations. Reactive modules are also the basis for the tool Mocha [1] which uses Alter-
nating Temporal Logic (ATL) as a specification language for system requirements. In
our approach the specification language for contracts and global verification goals is
separate from the synchronous language in which components are implemented. So our
framework is more flexible—it allows to easily exchange the synchronous language for
components and it also allows to change the analysis tools used for formal verification.
Clarke et al. describe an automatic approach to generating contract abstractions [9].
We did not apply this technique in our framework (yet) because we believe there are
several difficulties with this approach: it can only generate abstractions with the same
expressive power as regular languages, while our approach can also handle LTL abstrac-
tions. Also, the number of iterations needed for finding the abstraction might outweigh
the performance gains of the abstraction itself. But this still needs to be investigated
systematically in the future.
To sum up, the various ingredients (contracts for abstraction, synchronous verifica-
tion, GALS systems) of our work are well-established in the literature. However, to the
best of our knowledge these ingredients have not been brought together in this form for
the verification of GALS systems of synchronous S CADE models, and it is this gap we
intend to fill with our work.
294 H. Günther, S. Milius, and O. Möller
st = cs and proceed
result
2
f alse result
and a timer variable c is created which is initialized with time t. Each time a syn-
chronous component (or rather its P ROMELA process) makes a step, the timer c is
298 H. Günther, S. Milius, and O. Möller
decremented by the amount of time that has passed since the last step was performed by
a (possibly different) synchronous component. For this translation to be sound, until[t]-
formulas must not be nested on the right-hand side.
The translation of the network of contracts in a GTL specification to UPPAAL
works similarly. However, it is not possible in general to translate all verification goals
since the supported logical language of UPPAAL is based on (timed) CTL [2]. However,
if we restrict ourselves to so-called safety properties, translation to CTL is both sound
and simple. While clearly inferior in the expressive power, this logical class of formulas
is sufficient for many practical purposes. Currently, translation of safety properties has
to be done manually.
4.3 Detection of False Negatives. We implemented a third transformation from GTL
that can be used to validate verification results for an abstract GALS model CG . Sup-
pose we have CG 6|= Φ for a verification goal Φ and the formal verification produces
the failure trace π. If each component comes with a S CADE implementation, we can
check whether this is a real failure trace or a false negative as follows: using the GTL
specification one generates the concrete GALS model MG in P ROMELA. This is done
by composing the S CADE models of the components (together with the scheduler) by
integrating the C-code generated from them. By using SPIN to simulate MG on the
inputs from π we can verify whether π is a trace of MG .6
If so, we have found a real error, and one or several component implementations
need to be corrected. To support this process one can project the global failure trace π
on a local trace πM for each component M , which can be used in the ensuing analysis:
from each πM one can generate a S CADE simulator script which can be used to correct
the S CADE models of the components.
If the simulation finds that π is not a legal trace of the concrete GALS model MG ,
then our verification result is a false negative, and one needs to analyze the contracts for
weaknesses or inconsistencies.
For a concrete example of a false negative let us consider our mutex example. If we
omit line 23 of the client contract, which prevents clients from staying forever in their
critical section, the global verification of the second verification goal in Fig. 1 will yield
a failure trace: a client remains forever in the critical section. However, this does not
happen in the concrete model; the S CADE model of the client (not shown here) will
leave its critical section after at most 5 cycles.
To evaluate the GTL transformation to back-end formalisms, we use the mutex example
from Sec. 3 as a benchmark.
To this end a sequence of GTL files is generated by increasing the number N of
client processes that compete for the critical section. The server will (only) grant access
to the critical section to one requesting client (chosen non-deterministically), if no client
is currently in the critical section.
6
Again, this simulation is possible since CG operates on the complete concrete interfaces spec-
ified by each component M .
On the formal verification systems of synchronous software components 299
The state space of the system grows exponentially with increasing N . The property
that is model-checked for all instances is the previously mentioned classic safety condi-
tion: at no point in time more than one client is in the critical section. Since this is true,
the complete state space has to be analyzed.
The GTL representation is transformed to a model representation for SPIN and
UPPAAL, respectively. For SPIN, a verifier is (gcc-) compiled and executed with run-
time options -a and -m9999999k. This guarantees exhaustive and maximally deep
search. Other than that, none of the numerous optimization options of the tools are
activated. We use the newest available (64bit-)releases of the tools.
Fig. 3 displays the time and memory consumption with increasing number N of
clients. Unmapped N correspond to out-of-memory situations. After an initial offset,
the resource usage shows a steady slope on the logarithmic scale, which corresponds to
the exponential growth of the state space. Both SPIN and UPPAAL follow mainly the
same slope, but maintain roughly constant distance, which corresponds to a constant
factor. The time plot shows this better than the memory plot, since the latter operates
with a basic offset of allocated main memory (up to N = 5, due to option -m).
Surprisingly, this factor is rather large: ≈ 53 for time without compiler optimiza-
tions (≈ 23 with full optimization) and ≈ 87 for memory usage. Possibly, UPPAAL
profits substantially from the fact that only the reachable states have to be allocated at
all, while SPIN does provide (hash-compressed) memory for the full state space. More
details can be found in [27].
memory [MB] (logarithmic scale)
1000
time [s] (logarithmic scale)
100 10000
10 1000
1
100
0.1
Spin-6.1.0 10 Spin-6.1.0
0.01 Uppaal64-4.1.7 Uppaal64-4.1.7
2 3 4 5 6 7 8 9 10 11 2 3 4 5 6 7 8 9 10 11
Fig. 3. Time- and memory consumption for exhaustive search for N clients; measured on a
2.80GHz Intel R Xeon R CPU with 24GB of main memory
6 Case study
In the previous section we showed that our approach works on small academic exam-
ples. To see whether our method scales up to realistic systems we are currently working
on an industrial case study—a level crossing system from the railway domain.
The level crossing consists of several components (traffic lights, supervision sig-
nals, barriers etc.). An overview of the architecture is given in Fig. 4. The components
have been implemented as synchronous S CADE models, and are of medium complex-
ity: Failures, recovery and supervision aspects are implemented in each component. A
detailed informal description of the requirements of the level crossing system and its
300 H. Günther, S. Milius, and O. Möller
overall system architecture can be found in [32]. The implementation can be found at
the VerSyKo project web page.7 A main global requirement of the level crossing system
is to protect the road traffic from the train traffic and vice versa. Without abstraction,
the state space of the system is too large to be handled by model checkers like SPIN:
an experiment to integrate the C-code generated from the S CADE models and using the
model checker SPIN yields a too large state space. This outcome validates our expec-
tation that it is necessary to reduce state space by providing abstractions of the local
synchronous components using contracts.
As a next step, we have formulated contracts for each of the components of the
level crossing system and used S CADE Design Verifier to prove the contracts correct.
Unfortunately, for the level crossing controller, S CADE Design Verifier did not succeed
in verifying our contract. The reason for this is yet unclear, but omitting one of the
three automata from the contract yielded a verifiable contract. We suspect that the third
automaton encodes a property that cannot be handled by the induction heuristics im-
plemented in S CADE Design Verifier. However, the Debug Strategy of S CADE Design
Verifier yielded no counterexamples unrolling the model up to depth 80. The results
of the contract verification can be seen in Table 1, which also shows the complexity
of both the S CADE model (estimated from the C-code generated from it) and the as-
sociated contract. The detection points in Fig. 4 do not appear because they are mere
sensors without controller software.
Component Model complexity Contract complexity Verification time
(no. of states) (no. of states) (s)
traffic light 5.92 · 1010 6 3.292
supervision signal 1.54 · 104 4 5.054
barrier 2.46 · 105 5 3.385
axle counter 2.88 · 103 3 4.103
level crossing controller 2.36 · 10138 32 543.5991
1
Verified up to depth 80 using bounded model checking.
Table 1. Contract verification times
For a first experiment with global verification we have formulated the main require-
ment mentioned above as a verification goal in GTL. Since we have not completed an
automatic translation of GTL verification goals into UPPAAL’s query language, we did
7
See http://www.versyko.de.
On the formal verification systems of synchronous software components 301
not experiment with global verification using UPPAAL yet. Using SPIN resulted, as
expected from our benchmark in the previous section, in complexity problems.
Directions for future work include: (a) exploring possible alternatives to S CADE
Design Verifier for local verification—an approach using bounded model checking with
an SMT-solver similar to the KIND [24] model checker for L USTRE will be investi-
gated; (b) further investigations using bounded model checking for global verification
will be made on our case study, in particular, the formalization of other requirements as
global verification goals and the formulation of appropriate contracts for them; (c) from
the point of view of applicability of our approach a systematic methodology how to find
suitable abstractions of components and to formulate good contracts is highly desirable.
At the moment this is a creative process that needs expertise both with the system under
investigation and with the formal verification methods used in our framework. Con-
cerning this point, it should be investigated in how far tthe CEGAR approach of [9] is
applicable for automatic derivation of contracts.
Acknowledgments. We are grateful to ICS AG for providing the industrial case study,
and to Axel Zechner and Ramin Hedayati for fruitful discussions and their support to
formulate the contracts.
References
1. de Alfaro, L., Alur, R., Grosu, R., Henzinger, T., Kang, M., Majumdar, R., Mang, F., Meyer-
Kirsch, C., Wang, B.: Mocha: Exploiting modularity in model checking (2000), http://
www.cis.upenn.edu/˜mocha
2. Alur, R., Courcoubetis, C., Dill, D.L.: Model-checking in dense real-time. Information and
Computation 104(1), 2–34 (1993)
3. Alur, R., Henzinger, T.: Reactive modules. FMSD 15, 7–48 (1999)
4. André, C.: Semantics of S.S.M (safe state machine). Tech. Rep. UMR 6070, I3S Laboratory,
University of Nice-Sophia Antipolis (2003)
5. Baufreton, P.: SACRES: A step ahead in the development of critical avionics applications
(abstract). In: Proc. of HSCC. LNCS, vol. 1569. Springer-Verlag, London, UK (1999)
6. Baufreton, P.: Visual notations based on synchronous languages for dynamic validation
of GALS systems. In: CCCT’04 Computing, Communications and Control Technologies.
Austin (Texas) (August 2004)
7. Bouhadiba, T., Maraninchi, F.: Contract-based coordination of hardware components for the
development of embedded software. In: Proc. of COORDINATION. pp. 204–224. LNCS,
Springer-Verlag, Berlin, Heidelberg (2009)
8. Chapiro, D.M.: Globally-asynchronous locally-synchronous systems. Ph.D. thesis, Stanford
University (1984)
9. Clarke, E., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided abstraction re-
finement. In: Emerson, E., Sistla, A. (eds.) Computer Aided Verification, LNCS, vol. 1855,
pp. 154–169. Springer Berlin / Heidelberg (2000)
10. Dajani-Brown, S., Cofer, D., Bouali, A.: Formal verification of an avionics sensor using
SCADE. In: Lakhnech, Y., Yovine, S. (eds.) Proc. FORMATS/FTRTFT. LNCS, vol. 3253,
pp. 5–20. Springer-Verlag (2004)
11. Doucet, F., Menarini, M., Krüger, I.H., Gupta, R., Talpin, J.P.: A verification approach for
GALS integration of synchronous components. ENTCS 146, 105–131 (January 2006)
12. Garavel, H., Thivolle, D.: Verification of GALS systems by combining synchronous lan-
guages and process calculi. In: Proc. of SPIN. LNCS, vol. 5578, pp. 241–260 (2009)
13. Gastin, P., Oddoux, D.: Fast LTL to Büchi automata translation. In: Berry, G., Comon, H.,
Finkel, A. (eds.) Computer Aided Verification. LNCS, vol. 2102, pp. 53–65 (2001)
On the formal verification systems of synchronous software components 303
14. Giese, H., Tichy, M., Burmester, S., Schäfer, W., Flake, S.: Towards the compositional veri-
fication of real-time uml designs. SIGSOFT Softw. Eng. Notes 28, 38–47 (September 2003)
15. Giese, H., Vilbig, A.: Separation of non-orthogonal concerns in software architecture and
design. Software and System Modeling 5(2), 136–169 (2006)
16. Günther, H.: Bahnübergangsfallstudie: Verifikationsbericht. Tech. rep., Institut für Theo-
retische Informatik, Technische Universität Braunschweig (February 2012), Available at
http://www.versyko.de
17. Günther, H., Hedayati, R., Löding, H., Milius, S., Möller, O., Peleska, J., Sulzmann, M.,
Zechner, A.: A framework for formal verification of systems of synchronous components.
In: Proc. MBEES’12 (2012), available at http:\\www.versyko.de
18. Günther, H., Milius, S., Möller, O.: On the formal verification of systems of synchronous
software components (extended version) (May 2012), Avaliable at www.versyko.de
19. Halbwachs, N., Lagnier, F., Raymond, P.: Synchronous observers and the verification of
reactive systems. In: Proc. of AMAST’93. pp. 83–96. Workshops in Computing, Springer-
Verlag, London, UK (1994)
20. Halbwachs, N., Mandel, L.: Simulation and verification of asynchronous systems by means
of a synchronous model. In: Proc. of IFIP. pp. 3–14. IEEE Computer Society, Washington,
DC, USA (2006)
21. Halbwachs, N., Raymond, P.: Validation of synchronous reactive systems: from formal veri-
fication to automatic testing. In: Proc. of ASIAN (December 1999)
22. Jahier, E., Halbwachs, N., Raymond, P., Nicollin, X., Lesens, D.: Virtual execution of AADL
models via a translation into synchronous programs. In: Proc. of EMSOFT. pp. 134–143.
EMSOFT ’07, ACM, New York, NY, USA (2007)
23. Jones, C.B.: Specification and design of (parallel) programs. In: Proc. IFIP Congress. pp.
321–332 (1983)
24. Kahsai, T., Tinelli, C.: PKind: A parallel k-induction based model checker. In: Barnat, J.,
Heljanko, K. (eds.) PDMC. EPTCS, vol. 72, pp. 55–62 (2011)
25. Le Guernic, P., Talpin, J.P., Le Lann, J.L.: Polychrony for system design. Journal of Cir-
cuits, Systems and Computers (2002), special issue on Application-Specific Hardware De-
sign. World Scientific
26. Milner, R.: Calculi for synchrony and asynchrony. Theoret. Comput. Sci. 25(3) (July 1983)
27. Möller, M.O.: Benchmark Analysis of GTL-Backends using Client-Server Mutex (2012),
http://www.verified.de/en/publications/, Verified Systems International
GmbH, Doc.Id.: Verified-WHITEPAPER-001-2012, Issue 1.2.
28. Mousavi, M.R., Le Guernic, P., Talpin, J., Shukla, S.K., Basten, T.: Modeling and validating
globally asynchronous design in synchronous frameworks. In: Proc. of DATE. pp. 10384–.
DATE ’04, IEEE Computer Society, Washington, DC, USA (2004)
29. Rajan, B., Shyamasundar, R.: Multiclock Esterel: a reactive framework for asynchronous
design. In: Proc. of IPDPS. pp. 201–209 (2000)
30. Ramesh, S.: Communicating reactive state machines: Design, model and implementation. In:
Proc. IFAC Workshop on Distributed Computer Control Systems. Pergamon Press (Septem-
ber 1998)
31. Ramesh, S., Sonalkar, S., D’silva, V., Chandra R., N., Vijayalakshmi, B.: A toolset for mod-
elling and verification of GALS systems. In: Alur, R., Peled, D. (eds.) CAV, LNCS, vol.
3114, pp. 385–387. Springer Berlin / Heidelberg (2004)
32. Sulzmann, M., Zechner, A., Hedayati, R.: Anforderungsdokument für die Fallstudie
Bahnübergangssicherungsanlage. Tech. rep., ICS AG (2011)
33. Contract specification and domain specific modelling language for GALS systems, an ap-
proach to system validation. Tech. rep., ICS AG, Verified Systems International GmbH, TU
Braunschweig (2011), Available at www.versyko.de
Intentionally left blank to harmonize page numbering with printed version.
A Systematic Approach to Justifying Sufficient
Confidence in Software Safety Arguments?
1 Introduction
2 Safety Cases
The safety of safety-critical systems is of a great concern. Many such systems are
reviewed and approved or certified by regulatory agencies. For example, medical
devices sold in the United States are regulated by the U.S. Food and Drug Ad-
ministration (FDA). Some of these medical devices, such as infusion pumps, can-
not be commercially distributed before receiving an approval from the FDA [18].
Which means that manufacturers of such systems are expected not only to
achieve safety but also to convince regulators that it has been achieved [20].
Recently, safety cases have become popular and acceptable ways for communi-
cating ideas and information about the safety-critical systems among the system
stakeholders. The manufactures submit safety cases (to present a clear, compre-
hensive and defensible argument supported by evidence) to the regulators to
show that their products are acceptably safe to operate in the intended con-
text [13]. There are different approaches to structure and present safety cases.
The Goal Structuring Notation (GSN) [13] is one of the description techniques
that have been proven to be useful for constructing safety cases. In this work,
we use the GSN notation in presenting safety cases. There is often commonal-
ity among the structures of arguments used in safety cases. This commonality
A Systematic Approach to Justifying Sufficient Confidence in Software
Safety Arguments 307
motivates the definition for the concept of argument patterns [13], which is an
approach to support the reuse of arguments among safety cases.
A new approach for creating clear safety cases was introduced in [11]. This
new approach basically separates the major components of the safety cases into
safety argument and confidence argument. A safety argument is limited to give
arguments and evidence that directly target the system safety. For example,
claiming why a specific hazard is sufficiently unlikely to occur and arguing this
claim by testing results as evidence. A confidence argument is given separately
to justify the sufficiency of confidence in this safety argument. Such as ques-
tioning about the confidence in the given testing results (e.g., is that testing
exhaustive?). These two components are given explicitly and separately. They
are interlinked so that justification for having sufficient confidence in individual
aspects of the safety component is clear and readily available but not confused
with the safety component itself. This separation reduces the size of the core
safety argument. Consequently, this new structure is believed to facilitate the
development and reviewing processes for safety cases.
3 Related Work
There exists a widely used method for systematically constructing safety argu-
ments. This method is often referred to as the “Six-Step” method [12]. Although
this method has been used successfully in constructing many safety arguments,
it does not explicitly consider the confidence of the constructed safety argu-
ments [10]. In [16, 19], lists of major factors that should be considered in de-
termining the confidence in arguments are defined. Questions to be considered
when determining the sufficiency of each factor are also given. We were inspired
by this work and focused on one of these factors (i.e., the trustworthiness).
Argument patterns for confidence are given in [11]. Those patterns are defined
based on identifying and managing the assurance deficits to show sufficient con-
fidence in the safety argument. It is necessary to identify the assurance deficits
as completely as practicable. However, it is not quite clear how to do that. This
motivates us to take a step back to reasonably identify the assurance deficits.
Then the list of the recognized assurance deficits can be used in instantiating
the confidence pattern given in [11]. The constructed confidence arguments can
be used in the appraisal process for assurance arguments (e.g., [6, 14]).
There are attempts to quantitatively measure confidence in safety cases such
as [5, 7]. We believe that qualitative reasoning about the confidence existence is
more consistent with the inherited subjectivity in safety cases.
4 Proposed Approach
The best practice for supporting the top-claim of safety arguments (i.e., the
system is acceptably safe) is to show that the identified system hazards are
adequately mitigated. We refer to this type of argument as a contrapositive ar-
gument, since it refutes attempts to show that the system is unsafe. To build
308 A. Ayoub, B. Kim, O. Sokolsky, and I. Lee
this argument, one should first determine what could go wrong with this system
(i.e., identify the system hazards). Similarly, the top claim for a confidence ar-
gument is usually that sufficient confidence exists in an element E of the safety
argument. Such a claim can be supported by a contrapositive argument showing
that the identified assurance deficits associated with E are adequately miti-
gated [11]. Extending the analogy, one should first determine the uncertainties
associated with the element (i.e., identify the assurance deficits). Following sys-
tematic approaches helps in effectively identifying system hazards [1]. We believe
that following a systematic approach would also help in effectively identifying
assurance deficits.
The proposed systematic approach to identifying the assurance deficits re-
sults in the construction of positive confidence arguments. A positive argument is
a direct argument that relies on the properties of the actions taken in the devel-
opment (e.g., a well-established development process has been followed, a trusted
tool has been used, etc.). This distinguishes our confidence arguments from the
contrapositive ones discussed above. We stress that the intent of our work is not
to replace contrapositive arguments, but to aid in the identification of deficits
that can then be argued over using a contrapositive argument. However, note
that if no deficits are identified through the construction of a positive argument,
the resulting argument can be used as the requisite confidence argument.
We propose a common characteristics map to provide guidelines for system-
atic construction for positive confidence arguments. Using the map, claims in the
positive confidence arguments can be decomposed until every goal is supported
by positive sufficient evidence. If all branches in the positive confidence argu-
ments are supported by convinced evidence, that means all assurance deficits
are mitigated. For each goal in the resulting confidence arguments that cannot
be solved with sufficient evidence, an assurance deficit is identified and needs to
be addressed. After identifying the assurance deficits in this way, the confidence
pattern [11] can be instantiated to demonstrate that the recognized assurance
deficits are managed.
Mapping the confidence concerns. Elements belonging to the same category have
similar concerns about their trustworthiness, which need to be reasoned about in
a confidence argument. Table 1 illustrates this similarity with an example that
compares concerns regarding the outcomes of formal verification and testing, as
examples of evidences cited by software safety arguments. The first column gives
a generalization for the next two columns. For example, the used tool is a gener-
alization that covers the used formal verification tool and the used testing tool.
The formal verification results and the testing results can be categorized as pro-
cess results as shown in the first row. We call this set of concerns characteristics
of the category. Arguments over the characteristics of a category are to support
sufficient confidence in the trustworthiness of elements that belong to this cate-
gory. When we start addressing a particular concern C from the characteristics
set, it may, in turn, give rise to a set of derived trustworthiness concerns, which
correspond to the category exhibited by C. We illustrate the notions of concern,
category of concerns, and characteristics of a category in Figure 1. For example,
suppose we are addressing concern A1 that falls into the category A. Its derived
concerns are B1 and C3, that fall into categories B and C, respectively. More-
over, every concern in A will have derived concerns in B and C. We then say
that B and C are the characteristics of A. Several concrete examples of concerns
and their categories are given below.
310 A. Ayoub, B. Kim, O. Sokolsky, and I. Lee
characteristics
Concern C1 1-to-many relation
All concerns in a category give Concern C2 =
= = =
rise to derived concerns in the Concern C3
same categories possible alternatives = =
Safety Arguments
characteristics map
Construct a positive
confidence argument
Applicability of the
The human factor Applicability of the The main steps of
Continue the goal The way of using The used mechanism
involved in this The used language The used tool tool assumptions applying the
the tool mechanism assumptions and
process and limitations mechanism
decomposition limitations
nism
311
312 A. Ayoub, B. Kim, O. Sokolsky, and I. Lee
The creation
process for the
GPCA timed Validation results Layer 2
automata model
Safety Arguments
Expertise of the Expertise of the The UPPAAL The UPPAAL The manual The time notion The timeout
person reviewer assumptions limitations transformation step addition behavior addition
Layer 5
Expertise of the Expertise of the Expertise of the Expertise of the The mechanism The mechanism
person reviewer person reviewer assumptions limitations
A Systematic Approach to Justifying Sufficient Confidence in Software
Fig. 4. The common characteristics map instance to the GPCA timed automata model
313
314 A. Ayoub, B. Kim, O. Sokolsky, and I. Lee
but if it is driven from created artifact then it will be in layer 2. In the first
case, a goal will be created for it claiming the existence of sufficient confidence
in the trustworthiness of the process results. In the second case, a strategy will
be created with an argument by the process (e.g., argument by validation). Solid
shapes and arrows in Figure 5 show part of the developed positive confidence
argument for the GPCA timed automata model.
G:Trustworthiness
Sufficient confidence exists
in the trustworthiness of the
GPCA timed automata
model
S:Creation S:Validation
Argument over the Argument by
trustworthiness in the Validation
GPCA timed automata
creation process
S:TAHmmanFactor S:TimedAutomata
Argument over the Argument over the S:Transformation S:SemanticDiffs
Argument over the
humman factor involved in UPPAAL timed
Argument over the semantic differences
the using of the UPPAAL automata
transformation steps between the simulink/
timed automata
stateflow and the UPPAAL
timed automata
Sufficient confidence exists in the Sufficient confidence exists in the Sufficient confidence exists in the
trustworthiness of the expertise of trustworthiness of the expertise of trustworthiness of the handling of
the developer who used the the reviewer who reviewed the using the semantic differences between
UPPAAL timed automata description of the UPPAAL timed automata the simulink/stateflow and the
language description language UPPAAL timed automata
Fig. 5. Part of the positive confidence argument for the GPCA timed automata model
The decomposition for the confidence argument nodes continues until ev-
ery claim is supported with evidence. The dotted shapes and arrows in Fig-
ure 5 show the elements that require further decomposition. Decomposition for
G:Relation is required to support the claim about the trustworthiness in the
relation between the GPCA Simulink/Stateflow model and the GPCA timed
automata model. As the GPCA Simulink/Stateflow model was transformed into
the GPCA timed automata model, then this decomposition is given by two
strategies S:Transformation and S:SemanticDiffs. Any claim in the confidence ar-
gument that cannot be supported by evidence identifies an assurance deficit. For
example, although we transformed the GPCA simulink/stateflow model into an
equivalent GPCA timed automata model, we do not have evidence to show this
equivalence at the semantic level. So the claim at G:SemanticDiffs is not sup-
ported and so an assurance deficit is identified here.
For the identified assurance deficits, a contrapositive argument about their
mitigations needs to be constructed using the confidence pattern defined in [11].
In our case, exhaustive conformance testing between the GPCA Simulink/State-
flow model and the GPCA timed automata model may be a reasonable mitiga-
tion. We also have to instantiate the confidence argument for the trustworthiness
of conformance testing from the common characteristics map.
A Systematic Approach to Justifying Sufficient Confidence in Software
Safety Arguments 315
7 Discussion
Observations. The proposed common characteristics map is not complete and
so it should not be used blindly. The generated confidence arguments may re-
quire additional elements. In particular, generated goals and strategies may need
contexts, assumptions, and/or justifications. For example, a justification node,
stating that the GPCA timed automata model was developed from the GPCA
Simulink/Stateflow model using a careful transformation process [15], should be
connected to goal G:DevelopmentMechanism in Figure 5. Note that if any context
or assumption is added then argument about sufficient confidence in it should
be also considered.
Nodes in the map instance cannot be omitted at will during the confidence
argument construction. Otherwise, confidence in the trustworthiness of the el-
ement under concern is questionable and that identifies a potential assurance
deficit. For example, if tool assumptions are not known, the tool assumptions
node indicates a weakness that should be addressed. However, not every pos-
sible derived concern has to be present. If we decide to omit a branch in the
instantiation, we have to supply appropriate justification.
Limitations. The common characteristics map presented in this paper covers
only the trustworthiness factor. However, similar maps can be constructed for
other factors such as appropriateness. To do this, we need to identity categories
of appropriateness concerns and their characteristics. We leave this as our future
work. The common characteristics mechanism is not an automatic approach, i.e.,
it needs human interactions and decisions (e.g., what nodes can be ignored with
justification and what parts should be added as mentioned above).
While the structure of the argument is directly derived from the map in-
stance, the created goals and strategies still need to be formulated correctly. For
example, goal G:TADeveloper in Figure 5 is derived from the node expertise of
the person in Figure 4. The statement of the goal in G:TADeveloper is formed as
a proposition following the rules given in [12].
8 Conclusions
It is important to identify the assurance deficits and manage them to show suffi-
cient confidence in the safety argument. In this paper, we propose an approach to
systematically construct confidence arguments and identify the assurance deficits
in software safety arguments. Although the proposed mechanism does not guar-
antee to identify all assurance deficits, it helps to identify deficits that may have
been overlooked otherwise. Similarly, following systematic hazard identification
mechanisms does not guarantee that all hazards are identified.
The paper focuses on constructing positive confidence arguments with the
help of a proposed map. However, the map can also be used in the reviewing
process to help regulators identify gaps in submitted confidence arguments.
Our preliminary experience of applying the proposed approach has revealed
that the common characteristics mechanism yields the expected benefits in ex-
ploring important uncovered assurance deficits in software safety arguments.
316 A. Ayoub, B. Kim, O. Sokolsky, and I. Lee
References
1. Federal Aviation Administration. FAA System Safety Handbook, Chapter 8: Safety
Analysis/Hazard Analysis Tasks . System, 40(4), 2000.
2. R. Alexander, T. Kelly, Z. Kurd, and J. Mcdermid. Safety Cases for Advanced
Control Software: Safety Case Patterns. Technical report, University of York,
2007.
3. A. Ayoub, B. Kim, I. Lee, and O. Sokolsky. A Safety Case Pattern for Model-Based
Development Approach. In NFM2012, pages 223–243, Virginia, USA, 2012.
4. G. Behrmann, A. David, and K. Larsen. A tutorial on UPPAAL. In Formal
Methods for the Design of Real-Time Systems, LNCS, pages 200–237, 2004.
5. R. Bloomfield, B. Littlewood, and D. Wright. Confidence: Its Role in Dependability
Cases for Risk Assessment. In Dependable Systems and Networks, 2007. DSN ’07.
37th Annual IEEE/IFIP International Conference on, pages 338 –346, 2007.
6. L. Cyra and J. Górski. Expert Assessment of Arguments: A Method and Its
Experimental Evaluation. In Computer Safety, Reliability, and Security, 27th In-
ternational Conference, SAFECOMP, 2008.
7. E. Denney, G. Pai, and I. Habli. Towards Measurement of Confidence in Safety
Cases. In International Symposium on Empirical Software Engineering and Mea-
surement (ESEM’11), Washington, DC, USA, 2011. IEEE Computer Society.
8. The Generic Patient Controlled Analgesia Pump Model.
http://rtg.cis.upenn.edu/gip.php3.
9. I. Habli and T. Kelly. Achieving Integrated Process and Product Safety Arguments.
In the 15th Safety Critical Systems Symposium (SSS’07). Springer, 2007.
10. R. Hawkins and T. Kelly. Software Safety Assurance – What Is Sufficient? In 4th
IET International Conference of System Safety, 2009.
11. R. Hawkins, T. Kelly, J. Knight, and P. Graydon. A New Approach to creating
Clear Safety Arguments. In 19th Safety Critical Systems Symposium (SSS’11),
pages 3–23. Springer London, 2011.
12. T. Kelly. A six-step Method for Developing Arguments in the Goal Structuring
Notation (GSN). Technical report, York Software Engineering, UK, 1998.
13. T. Kelly. Arguing safety – a systematic approach to managing safety cases. PhD
thesis, Department of Computer Science, University of York, 1998.
14. T. Kelly. Reviewing Assurance Arguments – A Step-by-Step Approach. In Work-
shop on Assurance Cases for Security - The Metrics Challenge, Dependable Systems
and Networks (DSN), 2007.
15. B. Kim, A. Ayoub, O. Sokolsky, P. Jones, Y. Zhang, R. Jetley, and I. Lee. Safety-
Assured Development of the GPCA Infusion Pump Software. In EMSOFT, pages
155–164, Taipei, Taiwan, 2011.
16. C. Menon, R. Hawkins, and J. McDermid. Defence standard 00-56 issue 4: Towards
evidence-based safety standards. In Safety-Critical Systems: Problems, Process and
Practice, pages 223–243. Springer London, 2009.
17. Ministry of Defence (MoD) UK. Defence Stanandard 00-56 Issue 4: Safety Man-
agement Requirements for Defence Systems, 2007.
18. U.S. Food and Drug Administration, Center for Devices and Radiological Health.
Guidance for Industry and FDA Staff - Total Product Life Cycle: Infusion Pump
- Premarket Notification [510(k)] Submissions, April 2010.
19. R. Weaver. The Safety of Software - Constructing and Assuring Arguments. PhD
thesis, Department of Computer Science, University of York, 2003.
20. F. Ye. Contract-based justification for COTS component within safety-critical ap-
plications. PhD thesis, Department of Computer Science, University of York, 2005.
Determining Potential Errors in Tool Chains
Strategies to Reach Tool Confidence According to ISO 26262
{wildmoser,philipps,slotosch}@validas.de
Keywords: ISO 26262, Tool Chain Analysis, Tool Qualification, HAZOP, po-
tential tool failure, potential tool error
1 Introduction
The use of software to control technical systems – machinery, aircraft, cars - carries
risks in that software defects may endanger life and property. Safety standards, such
as the recent ISO 26262 [1] for the automotive domain aim to mitigate these risks
through a combination of demands on organization, structure and development meth-
ods. These standards and the practices they encode can also be seen as a sign of ma-
turity, a shift from an anything-goes attitude of programming to a more disciplined
engineering approach to software development.
As in any discipline, with growing maturity more emphasis is put not only on the
way of working, but also on the tools used. In safety standards, we can observe a
similar development. Earlier standards put only little emphasis on tool use, perhaps
roughly demanding an argument that each tool be "fit for use", mainly for tools used
in generating or transforming code or in testing software or systems. Recent stand-
ards, such as the ISO 26262 take a more holistic viewpoint. Not only tools, but also
their use in the development process must be analyzed, risks identified and counter-
measures employed. In this line of thinking, also requirement management tools,
318 M. Wildmoser, J. Philipps, and O. Slotosch
While there is a certain body of work on tool validation issues [2, 3] and some
work on specific tool use patterns to reduce thet need of tool validation [4],, there is so
far little work that considers the holistic approach of establishing tool confidence
according to ISO 26262. A notable exception is [5], where the process cess of tool evalua-
evalu
tion is explained in detail.
tail. The contribution of this paper is a strengthening of [5]: We
follow a similar approach,
proach, but in this paper we concentrate on determining potential
tool failures as a basis for obtaining correct tool confidence levels.
In practice the determination of the TCLs of a real development tool chain has to
deal with dozens of tools and hundreds of use cases, potential tool failures and coun- cou
ter-measures
measures for these. Rigorous bookkeeping
bookkeepin is needed to obtain comprehensible
comprehensi and
consistent results. In addition,
addi if not done in a systematic way, the determination
mination of
potential tool failures
ures will largely depend on the experience, biased view and chosen
approach of the person carrying out the analysis.
This paper proposes es strategies for systematic determination of potential tool errors
and judges their comprehensiveness, uniformity,
uniform adequateness of results and scalabil-
scalabi
ity. This judgment is not merely theoretical
the but backed up by the practical experience
gained by the authors from applying the proposed strategies in a large scale project
where the entire development tool chain for a product family of an automotive suppli- suppl
er has been evaluated. The paper also introduces a tool called Tool Chain Analyzer A
that has been built to support
suppo tool chain evaluation.
This paper is structured as follows.
fo lows. In the next section, based on a simple example,
we give an overview on o the tool evaluation process. Section 3 contains the main con- co
tribution: we propose and discuss two strategies for identifying potential tool failures.
Section 4 briefly introduces a tool to support
sup these strategies. Section 5 concludes.
cludes.
Determining potential errors in tool chains 319
Tool evaluation is about determining the TCL for all tools used in the development
process of a safety related product. The ISO 26262 defines what a tool evaluation
report must contain, but leaves the process for tool evaluation largely open. The pro-
cess we currently follow for tool evaluation consists of the following steps:
1. Define list of tools
2. Identify use cases
3. Determine tool impact
4. Identify potential tool failures
5. Identify and assign measures for tool failure detection and -prevention
6. Compute tool confidence level for each use case and tool
First, we create a list of all tools used in the development process. Then by studying
the development process and by interviewing tool users we identify and write down
the use cases for each tool (why? who? when? what? how?). For each use case we
then determine the tool impact (TI1, TI2) by answering two questions:
1. Can a tool failure inject a safety-related fault into the product?
2. Can a tool failure lead to the non-detection of a safety-related fault in the product?
Only if both questions can be answered with “No” the tool has no impact (TI1). For
every use case with impact (TI2) the potential tool failures need to be identified. For
each potential tool failure we look for existing measures for detection or –prevention
in the development process. If such measures are found we assign them to the corre-
sponding potential tool failure together with an estimated tool error detection level
(TD1-TD3). From the TI and TD the we finally determine the TCL according to
tables in ISO 26262 (see Fig. 1).
To give a short example (see Fig. 2) assume a tool chain consisting of the tools Zip,
Diff and Ls, which are used for release packaging.
In this tool chain we have four use cases Zip / contract, Zip / extract, Diff / compare,
and Ls / compare. Each use case has its own set of inputs and outputs, e.g. Zip / con-
tract takes a “File Tree” as input and delivers an “Archive” as output. Since the “Ar-
chive” contains product files the use cases Zip / contract and Zip / extract have tool
impact (TI2) as they might inject faults into the product.
These use cases need to be analyzed for potential tool failures, e.g. “File Content
Corrupted” in use-case Zip / contract and appropriate checks for these failures need
to be assigned if possible, e.g. “Diff File Trees” in use-case Diff / compare. Note that
in this tool chain the tools are not only sources for tool failures but can also act as
sinks for tool failures by providing measures for failure detection or prevention. The
effectiveness of these measures is expressed by the assigned TD level, which is omit-
ted in the figure above.
This section defines terminology, goals and strategies for determining potential tool
failures. In analogy to Laprie’s fault/error/failure concept [6] and the ISO 26262 [1]
vocabulary, we define the terms tool fault, tool error and tool failure as follows:
We also distinguish between concrete and potential tool errors and -failures:
• Concrete tool error/failure: Specific tool error/failure, e.g. Zip v7.3 corrupts file
contents with more than 4gb size in compression method “Ultra”.
• Potential tool error/failure: Abstract tool error/failure, e.g. File Content Corrup-
tion.
The aim of tool evaluation and -qualification is to counter Murphy’s law: that any-
thing that can go wrong will go wrong. Tool evaluation requires the determination of
the potential tool failures.
A determination strategy for potential tool failures is comprehensive if for every con-
crete tool failure it is able to determine a subsuming potential tool failure. In other
words it is able to reach all concrete tool failures. If a strategy is not comprehensive
the TCL might be inadequate.
A potential tool failure determination is uniform if the same methods and the same
levels of rigor are applied to all use cases of all tools. Using unbalanced methods and
levels of rigor is a typical sign for poor process quality.
The determined potential tool failures should also have an appropriate level of ab-
straction. If this level is too high no counter measures can be found and if it is too low
unnecessary effort is introduced.
Finally the strategy should be scalable in the sense that the effort spent on tool
evaluation should be acceptable and not grow drastically with the complexity of the
analyzed tool chain determined by the number of tools, use cases, artifacts and data
flow dependencies.
One the other hand tool failures are caused by errors in tool functions. For example an
error occurring in a function f3 might lead to a wrong mapping of input part D to
output part d. In order to describe a tool failure one can take two angles of view:
• describe what goes wrong inside the tool, e.g. error in function f3.
• describe what is wrong in the produced outputs, e.g. wrong part d in output file.
In the first description technique one refers to the internals of the tool, that is func-
tions needed to accomplish the use case, whereas in the second one refers to proper-
ties or structure of the output data. Both descriptions may in addition refer to the
properties of tool inputs that trigger the error. Note that both descriptions refer to the
same tool failure. They mainly characterize this tool failure from different views.
These two views give rise for two tool failure determination strategies: Analyze
what can go wrong in a tool or analyze what can be wrong with the artifacts.
The first strategy systematically refines the abstract error “Tool Failure” by analyz-
ing the functions in the tool. We call this strategy Function-based failure determina-
tion. The second strategy only looks at the output data of tools and we call it thus
Artifact-based failure determination. By going along the structure of output data one
can systematically refine the abstract error “Artifact broken” into more concrete po-
tential tool failures, e.g. output part d broken.
tool errors into a set of resulting tool failures one can apply an FMEA like inductive
thinking. What can happen if this function produces this tool error? If one traces this
question along the dependencies of functions one can derive corresponding potential
tool failures.
In the Zip example from the previous section we can decompose the use-case Zip /
contract into the following functions: Iterating Files, Loading Files, Transforming
Files, Writing Files. Each of these functions brings along its own set of typical poten-
tial tool errors, which can be consolidated (see Fig. 4) and then transformed to tool
failures.
Sometimes the potential tool failures that come out of such an analysis are too
coarse. A way out is often to further decompose the artifacts by their structure or
some other properties and then to confront each part/property with the guide words
again.
In our case we can decompose the artifact “Archive” into the parts “File Content”
and “File Properties”. By doing this we do not end up with the potential tool failure
“File Corrupted”, but with two finer potential tool failures “File Content Corrupted”
and “File Properties Corrupted”, which can now be detected by different measures,
e.g. “Diff File Trees” and “Compare ls -l” (see Fig. 2).
form tool errors to tool failures. The internal dependencies between between functions are
often not known to the public.
public If so, we have to think pessimistically and asssume that
every
ry tool error will corrupt
co all output artifacts. Even the log files which often provide
a chance to detect tool errors might be affected.
affec Finding effective counter
ter measures
often requires an analyysis of the artifacts as it is done in the artifact-based
based strategy
right away. On the other hand,
hand if sufficient details on a toolss internals are available,
availa a
rigorous function-based
based strategy pays off when it turns out that a tool has TCL2 or
TCL3 and requires qualification. Knowing the hardly detectable internal tool errors
and critical functions is very valuable for writing a tailored qualification plan or build-
ing a qualification
fication kit.
4 Tool Support
Suppor for Tool Evaluation: Tool Chain Analyzer
ISO 26262 requires all tools used in development to be analyzed. Tool ool chains
chain in in-
dustrial practice are rather large and may lead to many potential errors to manage.
m
Assigning detection or prevention measures to all these errors, subsuming errors and
maintaining
taining their relation to use cases, standard functions and artifacts is a complex
task, which needs tool support. Within the research project RECOMP Validas has ha
built such a tool called Tool Chain Analyzer (TCA).
Fig. 6. Tool Chain Analyzer with generated figures and tables (MS Word)
The TCA (see Fig. 6 and Fig. 7) allows to model a tool chain structure in terms of
tools having use cases which are reading or writing artifacts. The TCA also allows to
model the confidence in a tool chain in terms of tool failures,, checks and restrictions
as shown above (see Fig.
Fig 2) together with their TD. From the tool chain model and
confidence model the TCA can automatically computes the TCL of all tools in the
following way: First, the TCA obtains the TD for each potential
pote tial failure by taking the
t
TD with the lowest number (highest
( probability) the user has assigned for one of the
assigned counter measures.
measure
326 M. Wildmoser, J. Philipps, and O. Slotosch
Second, the TCA computes the TD for a use case by taking the worst TD for any
potential failure identified for this use case. Third, by combining the TD for a use case
with the TI of this use case according to the ISO 26262 table the TCA derives a TCL
for this use case. Finally, the TCL of a tool is the worst TCL for any use case.
The TCA also offers lots of plausibility checks for the tool chain and confidence
model. For example if an detection measure from Tool B is assigned to a potential
failure of tool A then there must be a data flow in terms of input/output artifacts from
tool A to tool B, otherwise the assignment of this detection measure is invalid.
The TCA can also generate a MS Word report, which contains detailed tables and
figures for each identified potential tool failure, such that the computed TCL becomes
plausible, comprehensible and checkable by review. The structure of this word report
is designed such that it can be directly used as a part for the tool criteria evaluation
report required by ISO 26262.
5 Conclusion
Tool evaluation is a critical step that comes before tool qualification. Besides deter-
mining the TCL the tool evaluation often identifies ways to rearrange or extend the
existing work flow such that tool qualification becomes obsolete.
We have applied both the function-based and the artifact-based identification
strategies for potential tool failures in a large scale industrial project using the TCA.
Our experience is that the function-based strategy tends to yield failure descrip-
tions that are too detailed or overlapping. While having very detailed tool failure de-
scriptions is useful for tool qualification, it is unnecessary for tool evaluation, which
Determining potential errors in tool chains 327
Acknowledgment
This work has been supported by ARTEMIS and the German Federal Ministry of
Research and Education (BMBF) within project RECOMP under research grant
1IS10001A.
References
1. International Organization for Standardization: ISO 26262 Road Vehicles – Functional safe-
ty–. 1st Edition, 2011-11-15
2. Stürmer I. and Conrad M.: Code Generator Certification: a Testsuite-Oriented Approach. In:
Proceedings of Automotive-Safety & Security, 2004.
3. Schneider S., Slotosch O.: A Validation Suite for Model-based Development
Tools. In: Proceedings of the 10th International Conference on Quality Engineering in Soft-
ware Technology, CONQUEST 2007.
4. Beine M.: A Model-Based Reference Workflow for the Development of Safety-Critical
Software. In: Embedded Real Time Software and Systems, 2010.
5. Hillebrand J., Reichenpfader P., Mandic I., Siegl H., and Peer C.: Establishing Confidence
in the Usage of Software Tools in Context of ISO 26262. In: Flammini F., Iologna V.,
Vittorini V. (eds.) SAFECOMP 2011. LNCS, vol. 6894. Springer, Heidelberg (2011)
6. Laprie J.C. (ed.): Dependability: Basic Concepts and Terminology. Book. Springer-Verlag,
1992. ISBN 0-387-82296-8
Safety-Focused Deployment Optimization in
Open Integrated Architectures?
1 Introduction
2 Related Work
The OMG defines deployment in [6] as a five step process encompassing installa-
tion, configuration, planning, preparation and launch. Our technique addresses
the optimization of the mapping of applications into the platform topology, which
relates to the planning step in the OMG process. In order to narrow down our
scope to Open Integrated Architectures and safety, we tailored the deployment
problem. A meta-model capturing the resulting deployment problem is shown
in Fig. 1. The depicted model is described in the following paragraphs: Charac-
teristics specific to the architecture are described in section 3.1, safety specific
characteristics are described in section 3.2. Section 3.3 introduces an example.
PlatformServiceDemand
1 topology network 1
PlatformTopology 0..1 solution ASWCNetwork
MappingSet
channel platform
* * aswc 1..*
connection 1..*
ComChannel Platform 0..* aswcMapping ASWC
* connect
- type : ChannelType 1 - type : PlatformType ASWCMapping aswc - intLevel : IntLevel
localChannel 1 - compLevel : CompLevel
1 channel partition 1..*
partition from 1 to 1
Partition outgoingSignal *
1 0..* signalMapping
Signal
SignalMapping signal
- intLevel : IntLevel
1
The concept of safety integrity levels (SIL), or comparable concepts like develop-
ment assurance level (DAL) [9], is used in safety standards across most domains.
Integrity levels categorize hazards according to the level of risk they pose and
tailor the safety standards in such a way that the risk reduction provided by
the system is appropriate. The higher the integrity level, the stricter and more
numerous are the requirements made by the standard. As a consequence, the
integrity level significantly regulates the development costs of a system.
During system development, it is common to allocate integrity levels to com-
ponents, if a component has failures that may lead to a hazard. Simply tagging
a component with an integrity level can be regarded as a simplification, as it
abstracts from the specific failure leading to the hazard that has to be avoided
or controlled. Still, standards specify deployment rules that are based upon in-
tegrity levels, which is why we assign integrity levels (IntLevel) to ASWCs.
The same is true for signals. We assign integrity levels to signals, meaning
that there is at least one failure mode related to the transmission of the signal
(like corruption, delay, insertion, masquerading, etc.) that may lead to a hazard
that poses the corresponding level of risk.
As a prerequisite for our approach, we assume that there is an initial speci-
fication of a functional network of applications, including an integrity level clas-
sification of each ASWC, identified by an early safety analysis. Furthermore, we
332 B. Zimmer, M. Trapp, P. Liggesmeyer, J. Höfflinger, and S. Bürklen
assume that the required integrity level of the platform solely results from the
integrity levels of the allocated ASWCs.
uncritical 1 uncritical 2
u1
QM QM
QM
high high
platform
ECU 1 1 platform 2
part 1.1 part 1.2 part 2.1 part 2.2
CH 1
Fig. 2. The running example: ASWCs are characterized by three strings, from top to
bottom: name, criticality, complexity. Signals are characterized by name and criticality.
4 Deployment Evaluation
This chapter introduces two metrics to evaluate from the safety perspective, a
solution for a specific deployment problem. The metrics implement a cost func-
tion that is minimized by the optimization algorithm presented in chapter 5. In
particular, the metrics evaluate negative effects caused by two core characteris-
tics of integrated architectures. The cohesion metric is presented in section 4.1
and focuses on the aspect of shared computational resources, as the metric evalu-
ates the costs of interferences between ASWCs. The coupling metric is presented
in section 4.2 and evaluates the costs caused by safety mechanisms to protect
against communication failures. In addition to the quantitative evaluation with
the metrics, we allow for the specification of constraints to limit the deployment
solution space. These constraints are introduced in section 4.3. Finally, section
4.4 introduces a mechanism to parameterize the metrics and the transformation
of the costs functions and the constraints to a fitness function for the GA.
Safety-Focused Deployment Optimization in Open Integrated Architec-
tures 333
topology and maxintLevel (part) be the maximum integrity level amongst the
applications in part. Then, cohesion is calculated as:
X X
coh(P ) = dc(aswc, maxintLevel (part)) (3)
part∈P aswc∈part
Fig. 3 shows two solutions for deploying the running example side by side.
The left one shows a deployment yielding no cohesion costs, since there are only
equally critical ASWCs in each partition. The deployment shown on the right
yields a much worse cohesion, since both uncritical, complex components are
deployed to the same partition as the critical comparator component.
channel 1.1 s 1.1 channel 1.2 channel 1.1 s 1.1 channel 1.2
SIL B SIL B s1
SIL B SIL B s1 SIL B SIL B
s 1.2 SIL .3 s 1.2 SIL .3
medium medium B comparator medium medium B comparator
SIL B SIL B
SIL C SIL C
channel 2.1 s 2.1
channel 2.2 2. 3 low channel 2.1 s 2.1 channel 2.2 2.3 low
SIL B s SIL B s B
SIL B SIL B B SIL B SIL B IL
S IL s 2.2 S
s 2.2 medium medium
medium medium SIL B
SIL B
CH 1 CH 1
Fig. 3. Two example deployments illustrating the cohesion metric. The deployment of
an ASWCs to a partition is indicated by equal fill color and pattern of the respective
shapes. The deployment of signals is not indicated.
Fig. 4 shows two solutions for deploying the running example side by side.
The deployment shown on the left side yields low coupling costs, since only
the signal s 2.3 is deployed to a inter-platform channel. The deployment shown
on the right side, however, requires the inter-platform communication of five
additional signals, which results in much higher coupling costs.
4.3 Constraints
This section introduces two constraints that allow the designer to restrict the
deployment solution space. Whereas the aspects evaluated in the previous two
sections have quantifiable effects on system development, solutions that violate
constraints are infeasible and will, therefore, be discarded.
The first constraint allows the designer to specify fixed mappings of an ASWC
to a platform or certain platform types. Restricting the mapping of an ASWC to
a specific platform is, for example, necessary if the ASWC is an I/O conditioning
component that must run on the platform that is hard-wired to the respective
sensor or actuator. Restricting the mapping of an ASWC to a platform type is
336 B. Zimmer, M. Trapp, P. Liggesmeyer, J. Höfflinger, and S. Bürklen
channel 1.1 s 1.1 channel 1.2 channel 1.1 s 1.1 channel 1.2
SIL B SIL B s1
SIL B SIL B s1 SIL B SIL B
s 1.2 SIL .3 s 1.2 SIL .3
medium medium B comparator medium medium B comparator
SIL B SIL B
SIL C SIL C
channel 2.1 s 2.1 channel 2.2 2.3 low channel 2.1 s 2.1 channel 2.2 2.3 low
SIL B s SIL B s B
SIL B SIL B B SIL B SIL B L
S IL s 2.2 SI
s 2.2 medium medium
medium medium SIL B
SIL B
platform
ECU 1 1 platform 2 platform 1 platform 2
part 1.1 part 1.2 part 2.1 part 2.2 part 1.1 part 1.2 part 2.1 part 2.2
CH 1 CH 1
Fig. 4. Two example deployments illustrating the coupling metric. The deployment of
ASWCs is indicated as in Fig. 3. The deployment of a signal is indicated as follows:
Locally exchanged signals are shown with a dotted line. Signals deployed to the re-
spective intra-platform channel are shown with a dashed line. Signals deployed to the
inter-platform channel CH 1 are shown with a solid line
necessary if the ASWC requires specific resources that only this platform type
provides.
The second constraint is used to represents dissimilarity relations between
typically two or three ASWCs, which means that the corresponding ASWCs
have to be developed heterogeneously to avoid systematic common cause fail-
ures. This also means that the platforms the ASWCs are deployed to must not
have systematic common cause failures either. Consequently, the dissimilarity
constraint is violated if the type of the host platforms of at least two dissimilar
ASWCs is the same.
Most of the existing deployment evaluation approaches allow specifying the
above-mentioned, or comparable constraints. There are, however, many other
possible ways for restricting the deployment solution space. Every objective func-
tion can, for example, be used to implement a constraint if the user defines a
pass/fail criterion using a minimum or a maximum threshold (e.g. workload
>66%).
5 Deployment Optimization
In this section we present a deployment optimization algorithm based on the
introduced fitness function and a GA. As stated before, the focus of our work
lies on the presented metrics and not on the selection of this specific optimization
algorithm. We used a GA to test and evaluate our metrics, because they were
integrated in a larger scale optimization running a GA as well. Other techniques
like linear programming, however, are also suitable for deployment optimization.
A GA is a stochastic search algorithm using techniques adopted from natural
evolution to find near optimal solutions for complex optimization problems[5].
The optimization process starts with a number of typically randomized solutions,
338 B. Zimmer, M. Trapp, P. Liggesmeyer, J. Höfflinger, and S. Bürklen
the so-called initial population. After initialization, each member of the popula-
tion is evaluated for its fitness, and then a new population is reproduced from
the old population using techniques like crossover and mutation. Members with
a higher fitness are more likely to participate in the reproduction than members
with a low fitness. After the new population is generated, it is evaluated for its
fitness which is followed by another reproduction of the next new population.
This optimization loop typically terminates after a fixed number of cycles or
when one individual reached a sufficient predefined fitness.
To be able to use standard algorithms like crossover and mutation, solu-
tions of a specific problem have to be represented by so called chromosomes.
A chromosome is divided into several genes, each gene representing a distinct
part of a potential solution. In our case, the intuitive chromosome for a specific
deployment problem with k ASWCs and n signals would be an array of k genes
representing ASWC mappings, concatenated with n genes representing signal
mappings.
Nevertheless, we decided to include only the ASWC mappings into the chro-
mosome and let the GA optimize only the ASWC mappings. This is because
the signal mapping highly depends on the ASWC mapping and, we are able to
calculate the optimal signal mapping directly when the ASWC mapping is de-
termined. For a specific deployment problem with k ASWCs and m partitions,
our chromosome therefore consists of the genes gj ∈ {1, . . . , m} , j ∈ {1, . . . , k}.
Each of the k genes is represented by an integer between 1 and m, where ga = b
denotes that ASWC a is assigned to partition b. This results in a slightly adapted
version of the aforementioned GA optimization loop, since we have to add the
signal mappings to the ASWC mappings generated by the GA for our fitness
function to work. The resulting loop consists of three steps: (1) calculate the
fitness for each individual, (2) reproduce a new set of ASWC mappings, (3) cal-
culate optimal signal mappings for each individual. The optimization stops if
the fitness improvement within the last 30 generations was below 5%.
The optimal signal mapping can be determined in a straight forward fashion,
since the costs for individual signal mappings do not influence each other. First,
we check the deployment of the receiver and sender ASWC of each signal. If
both are in the same partition, no channel is needed, and if both are on the
same platform, we deploy the signal to the local channel. If both are hosted on
different platforms, we search for all available channels connecting the respective
platforms. In case there is no such channel, we flag the ASWC mapping as invalid.
In case there is more than one channel, we search for the channel that yields the
lowest costs and deploy the signal accordingly.
Optimizing the running example presented in Fig. 2, the GA converged in
average within 2.3 seconds. Optimizing a real-world example based on a real-
world system consisting of 27 ASWCs, 51 signals, 13 platforms and 2 channels,
the GA terminated in average within 18.5 seconds. All measurements were taken
on a commercially available mobile CPU running with 2.40 GHz. The GA is
implemented using the Java Genetic Algorithms Package (JGAP) and a Java-
based implementation of our fitness function.
Safety-Focused Deployment Optimization in Open Integrated Architec-
tures 339
References
1. Website of the autosar standard, http://www.autosar.org/
2. ARINC: ARINC 653 P1-2, Avionic application software standard interface, Part
1 - Required services (2005)
3. Bastarrica, M.C., Caballero, R.E., Demurjian, S.A., Shvartsman, A.A.: Two op-
timization techniques for component-based systems deployment. In: Proc. of the
13th Int. Conf. on Software & Knowledge Engineering. pp. 153–162 (2001)
4. Boone, B., de Turck, F., Dhoedt, B.: Automated deployment of distributed software
components with fault tolerance guarantees. In: Proc. of the 6th Int. Conf. on
Software Engineering Research, Management and Applications. pp. 21–27 (2008)
5. Goldberg, D.E.: Genetic algorithms in search, optimization, and machine learning.
Addison-Wesley (1989)
6. OMG: Deployment and configuration of component-based distributed applications
specification (April 2006)
7. Pinello, C., Carloni, L., Sangiovanni-Vincentelli, A.: Fault-tolerant distributed de-
ployment of embedded control software. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems 27(5), 906–919 (2008)
8. Purao, S., Jain, H., Narareth, D.: Effective distribution of object-oriented applica-
tions. Communications of the ACM 41, 100–108 (1998)
9. RTCA: DO-178B – Software consideration in airborne systems and equipment
certification (1993)
10. RTCA: DO-297 – Integrated Modular Avionics (IMA) – Development guidance
and certification considerations (2005)
11. Sangiovanni-Vincentelli, A., Di Natale, M.: Embedded system design for automo-
tive applications. IEEE Computer 40(10), 42–51 (2007)
12. UK MoD: Def Stan 00-74: ASAAC standards part 1: Standards for software
13. Zimmer, B., Bürklen, S., Knoop, M., Höfflinger, J., Trapp, M.: Vertical safety
interfaces – Improving the efficiency of modular certification. In: Computer Safety,
Reliability, and Security, pp. 29–42. Springer Berlin / Heidelberg (2011)
Qualifying Software Tools, a Systems Approach
1 Introduction
been of less importance in the past, since related safety issues could then be han-
dled by restrictions on the processes used by the operators2 manually handling
these interactions. However, modern development environments are introducing
an increased support for automated tool integration, decreasing the possibility
for operators to monitor and act on safety issues due to tool integration.
In this paper we present nine safety goals and propose a method for qualify-
ing software tools as parts of tool chains, which together highlight the hitherto
obscured safety issues caused by tool integration and allow for being more ex-
act when identifying software in need of qualification during certification. The
safety goals and method build upon systems thinking [2] to approach a modern
development environment as a whole, i.e. they do not analyze parts of develop-
ment environments in isolation (which would risk failing to take the relationships
between the parts into account).
In Section 2 we present the relevant State of the Art to orient the reader
within the field of software tool qualification. In Section 3 we describe the domain
of tool integration and the possibilities to take a systems approach to allow for the
qualification of software tools as parts of tool chains. This is followed in Section
4 by a summary of a technical report in which we explored the relationship
between tool integration and safety through a detailed case study. The results
from this report, after being put into a system context in Section 4 through a
mapping into safety goals, allow us to propose a method for qualifying software
tools as parts of tool chains in Section 5. Conclusions are found in Section 6.
Several modern safety standards such as IEC 61508:2010 [3] (and domain adap-
tations of this standard such as IEC 61511:2003 [4], ISO 26262:2011 [5] and
EN 50128:2001 [6]) and DO-178C [7] increasingly deal with the issue of software
tool qualification (noticeable both when comparing between standards and when
comparing between different versions of the same standard). This has lead to a
large effort by the scientific community on analyzing tools, categorizing tools,
discussing how to certify tools, etc. ([8] gives examples related to DO-178B).
Much of the effort seems to be focused on finding a good combination of tools
and then enabling them to communicate in a reliable way, while the safety impli-
cations of the tool integration is not explicitly discussed (see [9] for an example).
This is not surprising, since either the safety standards themselves or the
discussion in relation to them try to limit qualification efforts to avoid software
associated primarily with tool integration (see Subsection 3.1 for a detailed dis-
cussion on this subject). This means that there is a limited number of approaches
on how to benefit from the fact that tools are deployed in tool chains. In fact,
we could only identify one such approach, which suggests the use of reference
2
This text uses the word operators in the generic sense, i.e. to indicate anyone who
at some point of time is involved in the day-to-day matters of a development effort
(such as engineers, tool chain developers, managers, etc.).
342 F. Asplund, J. El-Khoury, and M. Törngren
tool chains 3 [1]. There has been little effort to analyze the implications on safety
due to tool integration, leaving methods and metrics for evaluating different
approaches to tool integration for qualification purposes largely unknown.
Tool qualification guidelines can take different forms, such as requiring tools
to be suitable and of a certain quality [6], or requiring the development of relevant
tools to fulfill the same objectives as the development of the products handled
by the standard itself [7], etc. The many different forms are not surprising, since
the standards state different objectives for qualifying tools. Nevertheless, from
such standards as ISO 26262:2011 and DO-178C one can deduce generic safety
goals for software tool qualification:
In practice one can discern between two approaches to tool integration in relation
to these safety goals.
– The approach that allows a stricter limitation of the qualification effort, ex-
emplified by for instance DO-178C. In DO-178C the objective of the qualifi-
cation effort is only to ensure that tools provide confidence at least equivalent
to that of relevant process(es) they eliminate, reduce or automate [7].
– The approach that strives towards the same generic applicability, exemplified
by ISO 26262:2011. One could interpret this standard to indicate that almost
everything in the development environment has to be qualified [11] (including
all tool integration mechanisms 6 ).
focused on separate tools, since they are not necessarily directly associated with
any of the actions of a particular tool (if the output of a certain tool is later
transformed, that action does not necessarily have to be associated with the tool
itself during qualification). Furthermore, the control of these sources through
processes and methods is aggravated by the difficulty users usually have with
comprehending automation [12]. This leads to the problem that as development
environments scale up and become more integrated, the first approach becomes
less sufficient for ensuring safety.
In the second approach the qualification effort becomes tremendous in a
modern development environment [11]. If one keeps to this approach the likely
outcome is sweeping generalizations of the danger posed by most parts of the
development environments at hand. This leads to the problem that as develop-
ment environments scale up and become more integrated, the second approach
also becomes less sufficient for ensuring safety.
The reason for these mutually exclusive approaches sharing the same prac-
tical problem is that they stem from the miscomprehension that the increased
integration of software tools through automation is nothing more than the intro-
duction of more software. To overcome this miscomprehension we need a third
approach which analyzes the result of the increased integration of tools as a
whole (i.e. as highly integrated tool chains) and not simply as a collection of
separate parts. We need to take a systems approach to tool integration.
7
We recommend [2] for an overview of systems theory, [12] for how systems theory
relates to safety and [13] as an introduction to hierarchy theory.
8
In systems theory an emergent property at one level of organization cannot be pre-
dicted even by a thorough understanding of the parts of lower levels. Such a property
is meaningless at lower levels, i.e. it is irreducible. The opposite are composable prop-
erties, which can be shown to originate from one or several parts clearly discernible
at lower levels. Checkland uses the genetic coding of DNA as an example of an emer-
gent property [2], since it cannot be explained at the level of the bases (i.e. at the
chemistry level any arrangement of nucleobases is allowed, yet not all are meaningful
at the higher level of biology). A related composable property is the inability of hu-
mans to cope with too much ultraviolet light, since this can be directly attributed to
the way two adjacent thymine or cytosine nucleobases may form pyrimidine dimers
(which cause skin cancer).
Qualifying Software Tools, a Systems Approach 345
– The top level of our hierarchy consists of the management of the development
effort. This management level imposes constraints on the next level, the
operator level, by for instance establishing a safety culture.
– The operator level consists of the separate operators. It imposes constraints
on the next level, the tool chain level, through for instance the processes
actually used during development.
– The tool chain level consists of the tool chains used during development. It
imposes constraints on the next level, the tool level, by for instance specifying
the order of the engineering tools.
– The tool level consists of the tools and any supporting software used during
development (such as tool integration mechanisms). At the tool level safety
is ensured by software qualification, constrained by the requirements on this
activity by higher levels (as mentioned in Subsection 3.1).
account of a tool chain that exhibits the safety issues discussed below, an ac-
count which could not be included here due to space limitations). STPA defines
generic risks for inadequate control or enforcement of safety constraints that can
lead to hazards. We first translated these generic risks into risks associated with
tool integration by use of a reference model for tool integration that takes tool
integration related to platform, control, data, presentation and process into ac-
count (the details of this reference model are given in [16]). The risks associated
with these aspects of tool integration were then further translated into program-
matic risks (i.e. risks associated with the specific tool chain we were analyz-
ing). Based on programmatic risks, STPA provides guidance on identifying their
causes through general casual factors defined for the various parts of a generic
control loop. We substituted this generic control loop for context-specific ones,
for instance one through which operators and tool integration mechanisms (the
controllers) can introduce hazards in the end product (the controlled process)
through development artifacts (the actuators), hazards which can be identified
through verification and validation activities (the sensors). Analysis of the sub-
sequently identified causes allowed us to define nine safety-related characteristics
of tool chains [15].
To put the characteristics in the context of certification we below map each
characteristic to a safety goal for which assurance could be required. Addition-
ally, we also prepare for the discussion on assurance in Section 5 by grouping
the safety goals into subcategories. First we divide all safety goals according
to whether they are fully composable or emergent [17]; secondly we divide the
fully composable safety goals according to whether they support manual tool
integration or automate it.
Examples of risks identified in [15] are provided below together with the
safety goals which will mitigate the causes of said risks. These risks are often
critical research areas in themselves, but to discuss them in detail are outside
the scope of this paper.
The safety goals in this subsection are fully composable to subgoals at the tool
level and support manual tool integration (i.e. ensuring them will require an
effort to ensure relevant properties of distinct parts at the tool level and the
proper behavior of involved operators).
Traceability for Completeness and Consistency (TCC). Relevant parts of a
tool chain shall support the possibility to trace between artifacts to ensure that
those artifacts are consistent and complete in regard to each other. An associated
risk is the limitation of feedback on which parts of the development artifacts
correspond to parts of the end product where hazards have been introduced.
Well Defined Data Semantics (WDDS). Relevant parts of a tool chain shall
use unambiguous data semantics. An associated risk is inconsistencies in de-
velopment artifacts due to uncertainty on the part of operators regarding the
semantics of other engineering domains.
Qualifying Software Tools, a Systems Approach 347
The safety goals of this subsection are also fully composable to subgoals at the
tool level, but their implementation will not involve operators (i.e. they only
require an effort to ensure relevant properties of distinct parts at the tool level).
Automated Transformations of Data (ATD). Relevant parts of a tool chain
shall support the automated transfer of data between tools. An associated risk
is the incorrect reproduction of artifacts by tired or untrained operators.
Possibility to Automate Tool Usage (PATU). Relevant parts of a tool chain
shall support the automation of tool usage. An associated risk is the introduction
of errors in development artifacts due to untrained operators.
The safety goals in this subsection are emergent at the tool chain level. These
safety goals cannot be fully ensured through the properties of different parts at
the tool level, since they depend on the interaction of parts at that level.
Data Integrity (DI). A tool chain shall ensure that the data used reflects the
current state of the development. An associated risk is that obsolete or erroneous
versions of development artifacts lead to the incorrect reproduction of data. DI
can manifest itself locally (for instance when a file is corrupted by a file system
service or a database updates the wrong parameters), but also emerge from how
tools are used or interact with each other (for instance when a process engine
chooses the wrong version of a data artifact).
Data Mining (DM). A tool chain shall ensure that it is possible to (1) extract
all the data necessary to handle all safety goals correctly during development and
(2) present this data in a human-understandable form. An associated risk is that
operators are not aware that project deadlines are forcing the premature release
of artifacts and fail to take mitigating action. Which data needs to and can
be extracted emerge from the dynamic interaction between tools (for instance
348 F. Asplund, J. El-Khoury, and M. Törngren
through different sequences of service calls that determine what data can be
gathered and when).
(a) An engineering tool has not been qualified by the tool vendor.
(b) The relevant safety standard differs between tool vendors and tool users
and requires additional efforts in regard to engineering tools if the vendor
and user are different (an example of such a standard is DO-178C).
(c) The actual use cases, workflow or development environment are different
from those assumed during pre-qualification, which means that tools
and/or tool integration mechanisms will have to be (re)qualified by the
tool user.
These four steps allow qualification of a development environment to be dis-
tributed, but one can of course envision the tool user performing all these steps.
What is important is that these steps do not have the problems mentioned in
the introduction to this section. They both allow (1) a stronger focus on the
relevant parts of tool chains in regard to safety and (2) a clear separation of the
engineering tool qualification stipulated by safety standards (step 1, 4.a and 4.b
are consistent with safety standards such as ISO 26262:2011, IEC 61508:2010
and DO-178C) and extra efforts to ensure safety goals relevant to tool integra-
tion. They also have the additional benefits of allowing comparisons between
different setups for mitigating safety issues already after step 2 (giving an early
indication of the effort required) and favoring early planning in regard to the
development environment (helping to avoid fragmentation of the development
environment into several islands of automation [16]).
portion of the goal behavior that is partially composable [17]. We have focused on
the safety implications of modern, highly integrated development environments,
i.e. the way increased automation of tool integration hides what occurs to the
operators, rendering efforts (such as processes) at high levels of organization
less effective. Based on this new problem, we below define relevant, partially
composable safety goals of the emergent safety goals. Another focus could define
more parts of the development environment as important to qualify for DI and
DM. Partially composable safety goals therefore have to be defined during step
2 of a qualification effort, based on any considerations specific to the domain of
the development effort.
Data Integrity A tool chain shall ensure that development artifacts cannot
become corrupted or handled inappropriately without detection in automated,
unsupervised process activities. This safety goal can be partially decomposed
into the requirement for qualified tool integration mechanisms for verification
of version handling and avoidance of data corruption (for instance a qualified
version control system that can notify operators of such things as file corruption).
Data Mining A tool chain shall ensure that data regarding safety-related de-
velopment artifacts that is not handled by operators directly is automatically
gathered, analyzed and presented to operators. This safety goal can be partially
decomposed into the requirement for qualified tool integration mechanisms for
data mining and analysis (for instance a qualified project dashboard).
6 Conclusions
The implications on safety due to tool integration are largely ignored, even
though it leads to practical problems for the approaches stipulated by modern
safety standards in regard to software qualification. Based on previously defined
safety-related characteristics of tool chains we are able to identify nine safety
goals for ensuring safety in regard to tool integration. Based on these safety
goals, we suggest a systems approach to qualification for dealing with software
tools as reusable entities deployed in the context of different tool chains.
This approach allows for a stronger focus on the relevant parts of tool chains
in regard to safety, solving the problems that current safety standards have with
either stipulating a too wide or a too narrow qualification effort in regard to tool
integration. The approach also provides a clear separation of the engineering tool
qualification stipulated by current safety standards and extra efforts to ensure
safety goals relevant to tool integration, allowing for combining the approach
with said standards.
Important issues in need of validation and elaboration in future publications
include quantifying the effects of the method on cost efficiency by distributing
the software qualification effort, allowing a stronger focus on software which
actually has implications for end product safety and a stronger focus on early
planning.
Qualifying Software Tools, a Systems Approach 351
8 References
1. Conrad et al., “Qualifying software tools according to ISO 26262,” in Dagstuhl-
Workshop MBEES: Modellbasierte Entwicklung eingebetteter Systeme VI, 2010,
pp. 117–128.
2. P. Checkland, Systems Thinking, Systems Practice. John Wiley & Sons Ltd.,
1985.
3. BS/IEC 61508:2010, Functional Safety of Electrical/Electronic/Programmable
Electronic Safety-Related Systems, International Electrotechnical Commission Std.
4. BS/IEC 61511:2003, Functional safety - Safety instrumented systems for the pro-
cess industry sector, International Electrotechnical Commission Std.
5. ISO 26262:2011, Road vehicles - Functional safety, International Organization for
Standardization Std., 2011.
6. BS/EN 50128:2001, Railway applications - Communications, signalling and pro-
cessing systems - Software for railway control and protection systems, CENELEC,
European Committee for Electrotechnical Standardization Std., 2001.
7. DO-178C, Software Considerations in Airborne Systems and Equipment Certifica-
tion, Special Committee 205 of RTCA, Inc. Std., 2011.
8. Kornecki et al., “Certification of software for real-time safety-critical systems: state
of the art,” Innovations in Systems and Software Engineering, vol. 5, pp. 149–161,
2009.
9. Gönczy et al., “Tool support for engineering certifiable software,” Electronic Notes
in Theoretical Computer Science, vol. 238, pp. 79–85, 2009.
10. Certification Specifications for Very Light Rotorcraft, CS-VLR, European Aviation
Safety Agency Std., 2008.
11. Hamann et al. (2011, April) ISO 26262 release just ahead - remaining problems
and proposals for solutions, SAE 2011 world congress & exhibition.
12. N. Leveson, Engineering a Safer World, Systems Thinking Applied to Safety
(Draft). MIT Press, 2011.
13. Ahl et al., Hierarchy Theory, A Vision, Vocabulary, and Epistemology. Columbia
University Press, 1996.
14. Asplund et al., “Tool integration, from tool to tool chain with ISO 26262,” in SAE
2012 World Congress & Exhibition, 2012.
15. F. Asplund, “Safety and tool integration, a system-theoretic process analysis,”
KTH Royal Institute of Technlogy, Tech. Rep., 2012.
16. Asplund et al., “Tool integration beyond Wasserman,” in Advanced Information
Systems Engineering Workshops: CAiSE 2011 International Workshops, London,
UK, June 20-24, 2011, Proceedings, 2011, pp. 270–281.
17. J. A. Black, “System safety as an emergent property in composite systems,” Ph.D.
dissertation, Carnegie Mellon University, Carnegie Institute of Technology, 2009.
Adapting a Software Product Line Engineering
Process for Certifying Safety Critical Embedded
Systems
1 Introduction
The development of safety critical systems often requires certification under pre-
established standards, and this poses several challenges to software engineers:
Adapting a Software Product Line Engineering Process for Certifying
Safety Critical Embedded Systems 353
4
www.rtca.org
354 R. Braga, O. Trindade, K. R. Castelo Branco, L. D. O. Neris, and J. Lee
2 Related work
Several works regarding avionics systems certification have been published [11,16,1].
Schoitsch et al. [15] present an integrated architecture for embedded systems de-
velopment that includes a complete test bench to ease validation and verification
(V&V) according to existing standards. It provides guidance for the integration
of external tools, for example to format input data for V&V tools using model
transformation. Abdul-Baki et al. [1] also use the avionics domain to illustrate
their approach to V&V, which comprises specification-based testing, analysis
tools and associated processes in the context of a system to avoid collisions.
This system is specified using a formal language, and the system generated is
compliant with DO-178B. They are, however, mainly concerned with single sys-
tems, i.e., a software component that is part of an aircraft system.
SPLs require a different approach to certification, as the complete validation
of SPLs requires a lot of effort. Indeed, validating all combinations of features
that could produce a particular SPL instance is often an impossible (or at least a
very difficult) task. Several approaches are beginning to emerge to support SPL
certification, as for example the work by Hutchesson and McDermid [8,9], which
aims at the development of high-assurance SPLs based on model transforma-
tions. Model-based transformations are applied to instantiate variability models
of a large avionics SPL, ensuring that the transformation does not introduce
errors. The approach can be useful during certification, as it is able to generate
some evidences required for certification. However, they still need integration
and system testing.
We did not find related works where the focus is on associating certification
requirements to a SPL development process, which is the key idea of our ap-
proach. The closest we could find was the work proposed by Habli and Kelly [7],
where a notation was proposed to capture safety cases variations and tracing
these variations to an architectural model. This allows the explicit definition of
the possible behaviours of a derived product and how the safety case deals with
the variation. Nothing was mentioned about the process itself or the decisions
made to achieve particular certification levels or standards.
Adapting a Software Product Line Engineering Process for Certifying
Safety Critical Embedded Systems 355
3 SPL Certification
The motivation for this work is originated from a real world problem regarding
a UAV SPL. A UAV is an aerial vehicle where no human operator is physically
present in the aircraft and, thus, it flies autonomously or is controlled by a remote
pilot on the ground. UAVs are often thought of as part of an Unmanned Aircraft
System (UAS), where other elements such as the payload, ground control station
and communications links [6] are considered.
Tiriba [3] is a family of small, electric powered UAVs that were designed
for civilian applications, especially for agriculture, environment monitoring and
security. Examples of applications include the detection of crop diseases, to-
pographical surveys, traffic monitoring, urban planning and transmission line
inspection. Tiriba was initially designed as a single system, but the demand for
several slightly different versions motivated the development of its product line
assets based on a SPL process named ProLiCES [4].
Avionics software is mainly regulated by DO-178B [14] and its updated ver-
sion, DO-178C5 . They were developed by the Radio Technical Commission for
Aeronautics (RTCA) and were adopted by many government entities to support
the certification process for airborne systems. The ANAC (National Agency for
Civil Aviation) in Brazil is also considering its adoption for the development and
use of UAVs, although previous certification is not required yet 6 . Therefore, this
work is being done in anticipation for the future release of these rules.
After the deployment of several members of the Tiriba family, we started a
new project, named Sarvant, involving a heavy and more complex UAV system.
The resulting software is estimated to be ten times bigger than Tiriba, as new
features have to be included to provide a more reliable aircraft with better
performance and safety. Both Sarvant and Tiriba products will need certification
in the near future.
From the development process point of view, we face two major problems to
achieve certification: 1) standards such as DO-178B do not ask to follow a spe-
cific process, but define a set of activities with corresponding evidences that
they were performed properly. Thus, the organization should adapt its processes
in accordance with the standards; and 2) when a SPL is developed to leverage
software reuse, standards such as DO-178B do not provide any recommenda-
tion to certification agencies about how to certify reusable assets. Only concrete
products are considered for certification.
At the same time, organizations do not want to waste resources in activi-
ties/artefacts that are not required, or for parts of a system that do not need
5
http://www.rtca.org
6
http://www.anac.gov.br
356 R. Braga, O. Trindade, K. R. Castelo Branco, L. D. O. Neris, and J. Lee
metamodel represents the SPL feature model. A feature can also be categorized
according to its presence on final products, originating Mandatory, Optional,
and Alternative features. There are relationships among features, as for exam-
ple, a feature that requires the presence of another feature, or a feature that,
if included, excludes the presence of another feature. The model also allows the
mapping between features and certification levels, as stated in items I2 to I5 in
Section 3.2. Finally, the Product is modelled and linked to both: the Features
that it contains and to the Certification level that it received from the certifi-
cation agencies. As each product can be composed by several components, each
of which possibly requiring a different certification level, our metamodel can
represent components as sub-products.
proposed by our research group. Like most other SPL processes, ProLiCES also
has a two-path parallel life cycle, composed by domain and application engineer-
ing. The software engineer can choose what activity is done first, depending on
the context. A concrete product can be created first (application engineering),
and then the SPL is developed extractively or reactively (domain engineering)
based on one or more products. Alternatively, the SPL can be developed in a
proactive approach, i.e., a domain analysis is conducted to design the SPL based
on possible products to be created later.
The following modifications were made in ProLiCES domain engineering
phase to make it compliant with our proposal:
– DE1 - to attend issue I1, during feature modelling, group features into sub-
groups (or components), such that each subgroup is highly cohesive and the
instantiation of its composing features derives a well defined part of a prod-
uct, for which we can assign a desirable certification level, e.g., the feature
abnormal flight termination controller encompassing the feature parachute
should have level A;
– DE2 - in accordance with issue I6, establish an initial set of mandatory
activities that will be required in the application engineering, independently
of the certification level to be obtained, e.g. economic feasibility analysis,
planning, etc. This set can be further changed according to the feedback
obtained during application engineering;
– DE3 - in accordance with issue I6, look at the components defined in step
DE1 and establish an initial set of activities and artefacts that are required
to obtain the certification level for each component, according to the certifi-
cation standard being used. DO-178B shows several tables with indications
of processes and corresponding objectives to be met that can be mapped
to the SPL development process. This mapping can also be further changed
according to the feedback obtained during application engineering (activities
and/or artefacts not considered important to certain levels of certification
can be considered important to other). Figure 3 illustrates this mapping;
– DE4 - to attend issues I2 to I5, consider giving weights to activities/artefacts
that could help to guarantee quality attributes. As some artefacts can be
completely or partially implemented depending on the certification level,
the check marks in Figure 3 could be replaced by weights according to the
importance of a particular activity/artefact to a specific certification level.
For example, a requirements document should be considered totally imple-
mented when rigorous templates are followed, or partially implemented when
less formal text documents are presented;
– DE5 - considering that the feature model contains different types of feature
(e.g. the classification mentioned in Section 3.2), analyse the relationships
(dependencies) between features and the targeted certification level for each
component, based on issues I2 to I5. A mapping table is produced, but is
not shown due to space restrictions.
– AE1 - the mapping table created during steps DE2 to DE4 should be used
during the application engineering process. Mandatory activities that are
always executed, independently of the certification level, should be done in
the conventional way. For example, most requirements engineering activities
are always required, even though with less rigour in some cases.
– AE2 - determine the certification level for each product component (products
are instantiated from the feature model, so a pre-definition of components
can be done in domain engineering too).
– AE3 - during feature selection, observe mapping tables created in step DE5,
which can help decision making. Features required to achieve the correspond-
ing certification level should be included. Features that work for or against
certain certification levels should be analysed to identify advantages and
drawbacks in the particular context.
– AE4 - throughout the development, only execute mandatory activities that
are required for the component with the maximum certification level among
all the product components. Optional activities are executed only when the
respective component certification level requires so. For example, if the sys-
tem requirements verification results are only needed for certification levels
A to C, then probably the requirements analysis could be isolated for each
component, so that this verification is only done for components with levels
A to C, while other components skip this activity.
5 Conclusion
References
1 Introduction
Since the publication of the Institute of Medicine report “To err is human” in
1999 [1], the safety of patients has received unprecedented attention. Researchers
and healthcare organisations have turned to high-risk industries [27] such as com-
mercial aviation for inspiration about appropriate theories and methods through
which patient safety could be improved. For example, learning from past expe-
rience through incident reporting systems and Root Cause Analysis are now
standard practices through-out the National Health Service (NHS) in the UK,
triggered by the influential Department of Health Report, “An organisation with
Failure Modes, Functional Resonance and Socio-Technical Hazards:
Limitations of FMEA in Healthcare Settings 365
a memory” [2]. The report led to the foundation of the National Patient Safety
Agency (NPSA) and the development of the National Reporting and Learning
System (NRLS), a national system to collect patient safety incidents and to share
relevant learning throughout the NHS. In addition to such reactive approaches,
healthcare policy makers have recognised the need for proactive assessments
of threats to patient safety. In particular, the use of Failure Mode and Effects
Analysis (FMEA) is now recommended widely in healthcare as an appropriate
tool for proactive safety analysis. For example, the Joint Commission in the US
— the organisation that accredits hospitals — requires from participating or-
ganisations evidence that they carry out at least one proactive assessment of a
high-risk process every year [3], FMEA being the approach recommended. The
US Department of Veterans Affairs (VA) has developed an FMEA version tai-
lored to healthcare, Health Care Failure Mode and Effects Analysis (HFMEA)
[4]. During the past few years FMEA has been used in healthcare to assess the
risks associated with, for example, organ procurement and transplantation [5],
intravenous drug infusions [6], and communication in emergency care [7].
As healthcare organisations are gaining experience in using FMEA, there
starts to become available documented evidence of some of the problems that
practitioners experience with the application of the method. Habraken and col-
leagues carried out a large evaluation of HFMEA in the Netherlands [8]. While
they concluded that the method might be useful in Dutch healthcare, they re-
marked that practitioners commonly felt that the method was very time con-
suming, the identification of failure modes was poorly supported and the risk
assessment part was very difficult to carry out. FMEA was also used as part of the
Health Foundation’s Safer Patient Initiative in the UK, and a study evaluating
the perceptions of participating healthcare professionals found that participants
felt that while the structured nature of the process was beneficial, there were
negative aspects that may prevent the useful adoption of the method in the NHS,
including the time required to perform the analysis and the subjective nature of
the risk evaluation [9].
This paper addresses some of the difficulties related to the use of FMEA in
healthcare settings by investigating the application of an alternative, comple-
mentary methodology in order to conduct a proactive safety analysis. It argues
that some issues of adopting FMEA can be eased by combining diverse method-
ologies in order to assess vulnerabilities in complex socio-technical settings. This
paper is organised as follows. Section 2 summarises some of the research findings
around communication and handover failures in emergency care. Section 3 de-
scribes the application of FMEA for the identification of vulnerabilities related
to communication and handover within a specific emergency care pathway. It
then discusses the suitability of FMEA to assess risks in healthcare settings, and
investigates its possible combination with an alternative approach, the Func-
tional Resonance Analysis Method (FRAM). It also discusses and argues that
taking into account socio-technical hazards could be useful in order to overcome
limitations of analytical approaches that tend to narrow the scope of analysis.
Section 4 draws some concluding remarks.
366 M. Sujan and M. Felici
For the purpose of our case study, the emergency care pathway consists of the
Ambulance Service bringing a patient to hospital (typically two paramedics in an
ambulance), the Emergency Department (ED), and hospital departments that
receive patients from the ED – in the UK often a Clinical Decision Unit (CDU)
or Medical Assessment Unit (MAU). As part of the FMEA process, staff work-
ing within the pathway were invited to participate in a process mapping session
in order to describe the pathway for the subsequent risk analysis. Participants
Failure Modes, Functional Resonance and Socio-Technical Hazards:
Limitations of FMEA in Healthcare Settings 367
included doctors, paramedics, ED and MAU nurses. Figure 1 shows the result-
ing process description for highly critical patients (resuscitation patients). Such
simple, sequential process maps are commonly used in healthcare. The figure
shows steps in the process and information that is produced or communicated
(shown with background colour). The process in terms of communication and
handover consists essentially of a pre-alert by the Ambulance Service that a pa-
tient is about to be brought in (for highly critical patients), preparatory activities
within the ED, a handover between paramedic and the ED team, completion of
documentation and the negotiation of the onward transfer of the patient out
of the ED. A similar process description was produced for patients that have
severe but less critical injuries (majors cases), the main differences being that
there is no pre-alert and that the paramedics hand over to the triage nurse or
nurse coordinator rather than to the resuscitation team.
Following the process mapping activity described above, two further meetings
were organised to identify failure modes and to perform the risk analysis. As
368 M. Sujan and M. Felici
Table 2 explains the categories for assessing the likelihood of occurrence and
the severity of the consequences that were used.
A major risk identified relates to the failure of the pre-alert when the ambu-
lance crew is unable to establish a communication link with the ED, for example
because they are in an area where there is no mobile phone reception or – in
very rare cases – due to unreliability of the ED communication equipment. Par-
ticipants felt that this happened fairly regularly and that patients may die if
upon arrival critical team members such as airway management specialists were
unavailable. Another major risk relates to the failure of the handover between
the paramedic and the resuscitation team, when the team are starting to treat
the patient before the para-medic has had the chance to complete the handover.
This is a frequent occurrence, since ED staff are keen to start treatment of crit-
ically ill patients as quickly as possible. However, in some cases this may lead
to a situation where medication is given that has already been given by the
paramedic on scene or in the ambulance. Factors that contribute to this failure
include the perceived need to act quickly, a sense of hierarchy that may prevent
the paramedic from challenging the senior ED doctor, and high levels of stress.
The aim of approaches such as FMEA is the identification of single failures that
carry high risk. This is reasonable and the method has been applied successfully
in industrial settings for decades. FMEA requires assessment of the worst credible
consequences of any particular failure. This is difficult in most but very simple
systems, but it is even more complicated in healthcare, typically a complex socio-
technical system with a lot of uncertainty arising from contextual factors and
the patient condition. There is a risk of overlooking the limitations of FMEA
by over-relying on it, while excluding other possible complementary approaches.
When asked about assessing the severity of the consequences of a particular
failure mode as part of an FMEA exercise, participants will usually reply that
this depends on the condition of the patient and other contextual factors. If
370 M. Sujan and M. Felici
the condition of the patient is sufficiently critical, even minor failures may lead
to death. The problem with FMEA in such settings is that it assumes fairly
immediate cause and effect links and does not by itself encourage consideration
and differentiation of contextual factors. In the FMEA example above, clinicians
often contextualised the consequences of a particular failure mode by adding
statements such as “if we have a trauma patient”, or “when a patient comes in
and their airway is difficult to manage”. But even with this additional patient-
related information, it was difficult to establish the worst credible effect, since
single failures rarely kill patients, but usually have the potential to do so in
conjunction with other circumstances.
FMEA works well for technical systems and there is also scope for its ap-
plication in healthcare. However, the particular way of looking at a system and
of representing risk that is inherent in the method needs to be properly under-
stood by people applying it in healthcare. The method can be applied usefully
when these characteristics are taken into account, and when the method is com-
plemented by other approaches. This highlights some of the problems of using
FMEA in healthcare. The complexity and richness of the domain expose the lim-
itations of FMEA. Combining FMEA with complementary methodologies that
extend technical approaches could address such limitations. The next section
uses FRAM to identify vulnerabilities that may result from the propagation of
variation rather than from single failures.
An alternative approach has been described by Hollnagel [28] based on the con-
cept of functional resonance. Functional resonance is defined as the detectable
signal that emerges from the unintended interaction of the everyday variability
of multiple signals. The variability is mainly due to the approximate adjustments
of people, individually and collectively, and of organisations that are the basis
of everyday functioning. Each system function has a normal, weak variability.
The other functions constitute the environment for this particular function, and
their variability can be represented as random noise. However, on occasion the
pooled variability of the environment may lead to a situation of resonance, i.e. to
a detectable signal that emerges from the unintended interaction of the normal
variability of many signals. The Functional Resonance Analysis Method (FRAM)
proposes to model the functions of a system with six aspects, namely input, out-
put, time, resources, control and preconditions (I: Input, O: Output, T: Time,
R: Resources, C: Control, P: Precondition). The application of FRAM then tries
to establish the variability of the output of functions and the propagation of this
variability. System failures may emerge not necessarily as a result of failures,
but due to the propagation and resonance of variability. We have modelled for
simplicity only five steps of the above emergency care pathway as functions: (1)
Provide pre-alert to emergency department, (2) Prepare emergency department,
(3) Bring patient to the emergency department, (4) Hand over relevant infor-
mation to emergency department team, (5) Treat patient. FRAM prompts the
Failure Modes, Functional Resonance and Socio-Technical Hazards:
Limitations of FMEA in Healthcare Settings 371
In this case, if the output of function 1 (pre-alert) is late or does not take
place, this may lead to an increase in the variability of the output of function 2.
Likewise, if team members arrive late or are unavailable (resource), then vari-
ability may increase. If on the other hand, team members arrive on time and
the function is completed before the patient arrives, then variability may be
dampened. In this way, a more complex model allows the analyst to consider
the propagation and the possible dampening or reinforcing effect of variability
without the need to relate the observed effect causally to failures of any kind.
of reasoning could provide some further insights into the severity classification
derived by the application of FMEA. For example, the application of FMEA to
the pre-alert provided estimates that not receiving a pre-alert could lead to the
death of the patient. However, using FRAM, practitioners were able to structure
their reasoning about what happens when the pre-alert is not perfect and provide
insights of how the dynamic of the system may be affected. This is, of course,
different and complementary to the assessment of the worst credible outcome.
4 Conclusions
The application of FMEA in healthcare is useful in order to understand some of
the potential vulnerabilities of healthcare processes, but in practice it is difficult
to determine the consequences of failures as these depend on the context and the
patient’s condition. The combination of FMEA with other methods could be a
promising way of analysing risk in socio-technical systems. In this paper we have
described the additional application of FRAM to analyse a healthcare process.
FRAM focuses on variability and possible situations of resonance rather than
on failures and cause-effect links. FRAM provided insights into how the system
dynamic is affected by small variations in system functions. While practitioners
felt that FRAM added useful new insights, further work is required to deter-
mine how the findings generated by diverse methods should be integrated in a
systematic way for proactive risks analysis.
Acknowledgements. This project was funded by the National Institute for Health
Research Health Services and Delivery Research (NIHR HS&DR) programme
(project number 10/1007/26). Visit the HS&DR website for more information.
The views and opinions expressed therein are those of the authors and do not
necessarily reflect those of the HS&DR programme, NIHR, NHS or the Depart-
ment of Health.
References
1. Kohn, L.T., et al. (Eds.): To Err is Human: Building A Safer Health System. Insti-
tute of Medicine (1999)
2. Department of Health: An organisation with a memory (2000)
3. JCAHO: Comprehensive Accreditation Manual for Hospitals: The Official Hand-
book (CAMH) (2002)
4. DeRossier, J., et al.: Using Health Care Failure Mode And Effects Analysis. The
Joint Commission Journal on Quality Improvement 27(5):248–267 (2002)
5. Steinberger, D.M., et al.: Use of failure mode and effects analysis for proactive iden-
tification of communication and handoff failures in organ procurement and trans-
plantation. Progress in Transplantation 19(3):208–215 (2009)
6. Apkon, M., et al.: Design of a safer approach to intravenous drug infusions: failure
mode and effects analysis. Qual. Saf. Health Care 13(4):265–271 (2004)
7. Redfern, E., et al.: Identifying vulnerabilities in communication in the emergency
care department. Emerg. Med. J. 26:653–657 (2009)
8. Habraken, M.M., et al.: Prospective risk analysis of health care processes: a system-
atic evaluation of the use of HFMEA in Dutch health care. Ergonomics 52(7):809–
819 (2009)
9. Shebl, N., et al.: Failure Mode and Effects Analysis: views of hospital staff in the
UK. J. Health Serv. Res. Policy 17(1):37–43 (2012)
10. Institute of Medicine: Crossing the Quality Chasm: A New Health System for the
21st Century. National Academy Press, Washington DC, (2001)
11. British Medical Association: Safe Handover: Safe Patients. BMA, London, (2004)
12. Joint Commission: Strategies to improve hand-off communication: implementing a
process to resolve strategies. Jt Comm. Perspect Patient Safety 5(7):11 (2005)
Failure Modes, Functional Resonance and Socio-Technical Hazards:
Limitations of FMEA in Healthcare Settings 375
13. Apker, J., Mallak, M.A., Gibbson, S.C.: Communicating in the ”gray zone”: per-
ceptions about emergency physician hospitalist handoffs and patient safety. Acad.
Emerg. Med. 14(10):884–894 (2007)
14. Jeffcot, S.A., Ibrahim, J.E., Cameron, P.A.: Resilience in healthcare and clinical
handover. Qual. Saf. Health Care 18(4):256–260 (2009)
15. Raduma-Tomàs, M.A., et al.: Doctors’ handovers in hospitals: a literature review.
BMJ Qual. Saf. 20:128–133 (2011)
16. Wong, M.C., Yee, K.C., Turner, P.: Clinical Handover Literature Review. eHealth
Services Research Group, University of Tasmania, Australia, (2008)
17. Bost, N., et al.: Clinical handover of patients arriving by ambulance to the emer-
gency department — A literature review. Int. Emerg. Nursing 18(4):210–220 (2010)
18. Cohen, M.D., Hilligoss, P.B.: The published literature on handoffs in hospitals:
deficiencies identified in an extensive review. Qual. Saf. Health Care 19(6):493–497
(2010)
19. Patterson, W.S., Wears, R.L.: Patient handoffs: standardised and reliable measure-
ment tools remain elusive. Joint Commission Journal on Quality and Patient Safety
36(2):52–61 (2010)
20. Cook, R.I., Render, M., Woods, D.D.: Gaps in the continuity of care and progress
on patient safety. BMJ 320(7237):791-794 (2010)
21. Horwitz, L.I., et al.: Transfers of patient care between house staff on internal
medicine wards. Arch. Intern. Med. 166(11):1173–1177 (2006)
22. Solet, D.J., et al.: Lost in translation: challenges and opportunities in physician-
to-physician communication during patient handoffs. Acad. Med. 80(12):1094–1099
(2005)
23. Ye, K., et al.: Handover in the emergency department: deficiencies and adverse
effects. Emerg. Med. Australas. 19(5):433–441 (2007)
24. Anderson, S., Felici, M.: Classes of socio-technical hazards: Microscopic and macro-
scopic scales of risk analysis. Risk Management, Palgrave Macmillan, 11(3-4):208–
240 (2009)
25. Anderson, S., Felici, M.: Emerging Technological Risk: Underpinning the Risk of
Technology Innovation, Springer (2012)
26. Anderson, S., et al.: From Hazards to Resilience in Socio-Technical Healthcare
Systems. In E. Hollnagel, E., Rigaud, E., Besnard, D. (Eds.), Proceedings of the
fourth Resilience Engineering Symposium, pp. 15–21, (2011)
27. Perrow, C.: Normal Accidents: Living with High-Risk Technologies. Princeton, NJ:
Princeton University Press (1999)
28. Hollnagel, E.: The Functional Resonance Analysis Method. Ashgate (2012)
A STAMP Analysis on the China-Yongwen Railway
Accident
School of Reliability and System Engineering, Beihang University, Beijing 100191, PR China.
1 Introduction
On July 23, 2011, a very grave railway accident happened in the suburbs of Wenzhou,
Zhejiang Province, China. Train D301 from Beijing to Fuzhou rear-ended train
D3115 from Hangzhou to Fuzhou at 20:30 China Standard Time (CST) on a viaduct
in the suburbs of Wenzhou. The two trains derailed each other, and four cars fell off
the viaduct. The accident caused 40 fatalities and 172 injuries [1].
Accident models explain why accidents occur and lie at the foundation of accident
investigation and prevention. Traditionally, accidents have been viewed as resulting
from a linear chain of events. But as in this case, many more factors were involved in
the accident, including environment factors, component failures, design flaws, human
errors, and organizational factors, the interactions between system components were
complex and the relationships between events were non-linear. Traditional accident
models are limited in their ability to handle these important factors in the accident.
A systems-theoretic approach to understanding accident causation allows more
complex relationships between events (e.g., feedback and indirect relationships) to be
considered and also provides a way to look more deeply at why the events occurred
[2]. A new accident model called Systems-theoretic Accident Modeling and Processes
(STAMP) has been developed by Leveson. In STAMP, accidents are conceived as
resulting from inadequate enforcement or violation of safety-related constraints on the
design, development, and operation of the system. STAMP uses hierarchical struc-
tures to model socio-technical systems considering technical, human and organiza-
tional factors. In this paper, we will use the STAMP accident model to analyze the
China-Yongwen railway accident and propose some improvement measures to pre-
vent similar accidents in the future.
A STAMP Analysis on the China-Yongwen Railway Accident 377
In STAMP, the most basic concept is not an event but a constraint. Safety is viewed
as a control problem. Accidents are conceived as resulting from inadequate control
and enforcement of safety constraints. Safety constraints are enforced by the system
hierarchical control structure. Accidents occur when the hierarchical control structure
cannot adequately maintain the constraints [2].
Each hierarchical level of the control structure represents a control process and
control loop with actions and feedbacks. Fig.1 shows a basic control process in the
railway safety control structure. The control processes that enforce the safety con-
straints must limit system behavior to the safe states implied by the safety constraints.
According to system control theory, to effectively control over a system requires four
conditions: (1) the controller must has a goal or goals, e.g., to maintain the safety
constraints. (2) The controller must be able to affect the state of the system in order to
keep the process operating within predefined limits or safety constraints despite inter-
nal or external disturbances. (3) The controller must contain a model of the system.
The process model is used by the human or automation controller to determine what
control actions are needed, and it is updated through various forms of feedback. (4)
The controller must be able to ascertain the state of the system from information
about the process state provided by feedback [2].
Corresponding to the four conditions, ways for constraints to be violated in the
control process can be classified as the following control flaws: (1) Unidentified ha-
zards. Hazards and the safety constraints to prevent them are not identified and pro-
vided to the controllers. (2) Inadequate enforcement of constraints. The control ac-
tions do not adequately enforce the constraints because of inadequate control algo-
rithms, inadequate execution of control actions or inadequate coordination among
multiple controllers. (3) Inconsistent process models. The process models used by the
automation or human controllers (mental models refer to humans) become inconsis-
tent with the process and with each other. (4) Inadequate or missing feedback. The
controller is unable to ascertain the state of the system and update the process models
because feedback is missing or inadequate [3].
Train controllers
(Model of the
Controls process) Displays
Actuators Sensors
Commands Measurements
Train operation
(Physical process)
Process Process
Inputs Outputs
Disturbances
So the process that leads to accidents can be understood in terms of flaws in con-
trol structures to enforce constraints during design, implementation, manufacturing,
and operation. Therefore, to analyze an accident, the system hierarchical control
structure must be examined to determine why the control process for each component
at each hierarchical level was inadequate to maintain the safety constraints [2]. The
procedure of STAMP-based accident analysis can be described as follows: (1) Identi-
fy the system hazards and related safety constraints involved in the accident. (2) Con-
struct the system hierarchical safety control structure and identify safety constraints
for each controller. (3) Identify the inadequate actions that violated the constraints and
the control flaws in the control structure. In the following sections, we will use the
STAMP approach to analyze the China-Yongwen railway accident and propose some
improvement measures.
signals, switch to the visual driving mode and continue driving. The driver confirmed
this with the train dispatcher.
At 20:12, D301 stopped at Yongjia Station (the station before Wenzhou South Sta-
tion) waiting for the signals (it was 36 minutes behind schedule). At 20:14, D3115 left
Yongjia Station.
At 20:17, the train dispatcher informed the D3115 driver to switch to the visual
driving mode to drive at the speed less than 20 km/h when the passing signal was red.
At 20:21, because of the track circuit failure, the Automatic Train Protection (ATP)
system on D3115 activated the automatic braking function. D3115 stopped in the
faulted 5829AG track section. From 20:21 to 20:28, the D3115 driver had failed 3
times to drive in visual mode due to abnormal track circuit code.
From 20:22 to 20:27, the D3115 driver had called the train dispatcher 6 times and
the watchman in Wenzhou South Station had called the D3115 driver 3 times but all
failed due to communication failure.
At 20:24, D301 left Yongjia station heading for Wenzhou South Station.
At 20:26, the train dispatcher asked the watchman in Wenzhou South Station about
D3115’s information, the watchman replied: ―D3115 is close to the faulted track sec-
tion but the driver is out of reach, I will continue to contact him.‖
At 20:27, the watchman reached the driver of D3115, and the driver reported: the
train is 3 block sections to the Wenzhou South Station, but I failed to switch to visual
driving mode due to abnormal track signals. I cannot reach the train dispatcher be-
cause the communication system has no signal and I will try again.
From 20:28 to 20:29, the driver of D3115 called the dispatcher twice but both
failed. At 20:29:26, D3115 succeeded in starting the train by switching to the visual
driving mode.
At 20:29:32, the engineer in Wenzhou South Station called the D301 driver and
said: D301 you must be careful, train D3115 is in the same block section (the call was
interrupted unfinished).At the same time, D301 entered the faulted track section. The
driver of D301 saw the slowly moving D3115 and launched emergency brake.
At 20:30:05, D301 travelling at the speed of 99km/h collided with D3115 travel-
ling at the speed of 16km/h.
Workers of
Reports the Maintenance
signaling
branch
Inspection
Regulations Notice
Supervision
Standards
Training
Reports Controls
Legislation Watchman Train
Policies Shanghai control
Ministry of The Train in Wenzhou
railway center
Railways dispatcher South
bureau equipment
Station
Commands Displays
Reports Reports Vehicle integrated
Signals
control
Reports Controls
ATP On
Drivers
Board
Commands
Displays
Commands Measurements
Controls
Trains
To understand the role each component played in the accident, the contribution
to the accident of each component is described in terms of the safety constraints, the
inadequate control actions and the control flaws. For human controllers, the mental
model flaws and context in which decisions were made are considered.
Analysis on the physical process. The safety constraint at this level is that the Chi-
nese Train Control System (CTCS) must keep the trains free from collisions. The
CTCS installed on the Yongwen railway line is CTCS-2. (CTCS has four levels, and
CTCS-2 is installed on Chinese 200 km/h to 250 km/h high speed lines). It has two
subsystems: ground subsystem and onboard subsystem. The ground subsystem in-
cludes track circuit, Global System for Mobile communications-Railways (GSM-R)
and station train control center. The station control center enforces track circuit en-
coding, passing signals control in block sections and confirmation of movement au-
thorities. The GSM-R is a wireless communication network used by the drivers, train
dispatchers and station staff to communicate with each other. The track circuits en-
force railway occupation and track integrity monitoring, and continually transmit
track information to the vehicle as a movement authority. The onboard subsystem is
the ATP system. The ATP system controls the operation of train according to the
signals provided by the ground systems. When ATP receives no signal or abnormal
signals, it will adopt automatic braking to stop the train. If the train needs to move on,
it has to wait for 2 minutes to turn ATP into visual driving mode to travel at less than
20 km/h. In visual driving mode, if normal signals are received, ATP will automati-
cally convert into full monitoring mode.
In the accident, the train control center didn’t get the information that 5829 block
section was occupied by D3115 because the data acquisition loop lost its power after
the lightning. The passing signal in the 5829 section was green and no occupation
code was sent to the ATP on D301 while the section was occupied by D3115. The
A STAMP Analysis on the China-Yongwen Railway Accident 381
data sending to the computer of the train control center after the fusing was collected
before the failure, and the computer kept controlling the passing signal and track cir-
cuits coding based on the outdated data. The control flaws in the control process are
as follows: (1) the information of track occupation is vital to the whole train safety
control structure, but wrong feedback information was provided to the computer due
to the power loss of the data acquisition loop (inadequate or missing feedback). (2)
The design of the equipment must enforce the safety constraints in face of an envi-
ronmental disturbance or a component failure. In the accident, the data acquisition
loop in train control center just had a single power supply and lost its power after the
thunder strike. The driving unit of the data acquisition loop kept sending the computer
data collected before the failure. The computer kept accepting the outdated data and
controlling the passing signals and track circuits coding based on the outdated data
(inadequate control algorithm).
In the accident, the ATP on D3115 automatically stopped the train in the faulted
5829AG track section and the ATP on D301 didn’t take any actions to prevent the
train from entering the block section occupied by D3115. The control flaws existed in
the control process are: (1) the code sent to the ATP on D3115 by the 5829AG track
circuit was abnormal because of the communication failure between the track circuit
and the control center (inadequate feedback). (2) The process model of the ATP con-
troller in D301 was inconsistent with the actual process. The ATP on D301 received
green signals, which mean there was no train in the forward 3 sections. But in fact, the
train D3115 was moving slowly in the front section (inconsistent process models).
Fig.3 summarized the roles of the train control center and the ATP systems in the
accident.
Train control center
Safety constraints: ATP on board
(1) Acquire correct and fresh data about train position, Passing signals Safety constraints:
speed and track occupation; and movement (1) Correctly control the operation of train according
(2) Control the passing signals in block sections and authorities to the signals provided by the ground system;
send moving authorities to ATP correctly based on (2) Prevent the train from entering the block section
acquired data; occupied by another train.
Inadequate control actions: Inadequate control actions:
(1) The train control center didn’t get the information (1) The ATP on D3115 stopped the train near the
that 5829 block section was occupied by D3115 faulted 5829AG track section;
(2) The passing signal in the 5829 section was wrong ; (2) The ATP on D301 did not take any actions to
(3) Send moving authorities to the ATP on D301 while prevent the train from entering the block section
the section was occupied by D3115. occupied by D3115.
Control flaws: Control flaws:
(1) False track occupation information was provided to (1) The code sent to the ATP on D3115 by the 5829
the computer due to a component failure (inadequate or AG track circuit was abnormal (inadequate feedback).
missing feedback). (2) The process model of the ATP on D301 was
(2) The design of the equipment could not ensure the inconsistent with the actual process (inconsistent
correctness and freshness of data in face of a thunder process models).
strike and component failure(inadequate control Track integrity
algorithm). and occupation
Analysis on the operations. The D3115 driver was informed that there was a red
band failure near Wenzhou South Station. And he was told to switch to the visual
driving mode and continue driving if the ATP system automatically stopped the train
as a result of missing signals. But when this happened, the D3115 driver failed 3
times to drive in visual mode due to abnormal track circuit code. And he also failed to
382 T. Song, D. Zhong, and H. Zhong
report to the dispatcher due to communication failure. The driver of D301 didn’t
know that D3115 was trapped in the 5829 section and didn’t take any actions to pre-
vent the train from entering the section. The control flaws existed in the control
processes are: (1) the train has two controllers, the ATP and the driver. The ATP
stopped the D3115 automatically due to abnormal signals, but for the same reason, the
driver failed to drive the train in visual mode (inadequate coordination among control-
lers). (2) The D3115 driver tried to report to the dispatcher but failed due to commu-
nication flaw (inadequate execution of control actions). (3) The display on the D301
ATP system indicated there was no train in the forward 3 sections. And the driver
wasn’t informed by the dispatcher or Wenzhou South Station about the situation of
D3115. So his mental model thought there was no train ahead and controlled the train
according to this process model. Because of an inadequate feedback in the control
loop, the mental model of the driver became inconsistent with the actual process. The
usual performance of the driver was no longer safe (inconsistent process models).
The dispatcher turned the Wenzhou South Station into the emergency control mode
after receiving the report of the ―red band‖ problem according to regulations. But he
didn’t take further insight into the problem. He didn’t look into the situation of the
maintenance by the signaling branch and didn’t know the passing signal was wrong.
Moreover, he didn’t monitor the situation of D3115 carefully. Before the collision, the
watchman in Wenzhou South Station reported to the dispatcher: ―D3115 is close to
the faulted track section but the driver is out of reach. I will continue to contact him.‖
But the dispatcher didn’t take any measures. When D3115 was moving slowly and
D301 entered the same section, he didn’t give a warning to the D301 driver. There
may be several reasons for the mistakes. First, high work intensity increased the pos-
sibility of ignorance of paying attention to monitoring the situation of D3115. Accord-
ing to the investigation report, during 20:17 to 20:24, the dispatcher had confirmed
the equipment conditions in stations along the line, learned the information about
another train, and done the reception and departure work of 8 trains. Second, it may
be a result of inadequate training. The dispatcher did not take his responsibility care-
fully. His safety consciousness was weak and ignored the importance of the problem.
Third, the feedback to the dispatcher was missing due to the communication failure
between the driver and the dispatcher.
Before the collision, Wenzhou South Station was in the emergency control
mode. In this mode, the railway station was responsible for implementing the ―Ve-
hicle Integrated Control‖ with the passing trains and confirming the safety informa-
tion with the drivers in standard phrases using wireless communication equipment
according to the regulations. But the watchman in Wenzhou South Station didn’t im-
plement the ―Vehicle Integrated Control‖ with the D301driver. In addition, the
watchman had failed 3 times to connect the D3115 driver. When he reached the driver
finally, the driver reported: ―The train is three sections to Wenzhou South Station. I
failed to drive in visual driving mode due to abnormal track signals. I cannot reach the
train dispatcher because GSM-R has no signals and I will try again.‖ However, the
watchman didn’t report the situation of D3115 to the dispatcher in time. The watch-
man didn’t perform his duty correctly due to inadequate training or weak safety con-
sciousness.
A STAMP Analysis on the China-Yongwen Railway Accident 383
Informed of the failure of track circuits, the workers of the signaling branch re-
placed some track circuits transmitters without reporting and putting the equipment
out of service, and turned the code sent by the 5829AG to green, which violated the
railway signal maintenance regulations. The workers of the signaling branch didn’t
perform their duties correctly due to inadequate training or weak safety conscious-
ness.
Fig.4 summarized the role of the operators involved in the accident.
Drivers
Train dispatcher
Safety constraints:
Safety constraints:
(1) Operate the trains Safely according to operation procedures
(1) Sending control commands to railway
and control commands issued by the train dispatcher;
stations and drivers;
(2) Report the operation information and problems to the
(2) Monitoring the operation of trains and the
dispatcher and relevant railway stations.
occupation of tracks. Commands Inadequate control actions:
Inadequate control actions:
(1) The driver of D3115 failed 3 times to drive in the visual
(1) Didn’t look into the situation of the
driving mode.
maintenance of the signal system
(2) The driver of D3115 failed to report to the dispatcher.
(2) Didn’t carefully monitor the situation of
(3) The driver of D301 didn’t take any actions to prevent the train
D3115.
from entering the block section occupied by D3115.
Context in which decisions were made:
Context in which decisions were made:
High work intensity
Reports The trains were behind schedule
Mental model flaws:
Mental model flaws:
(1) Inadequate training led to weak safety
(1) Inadequate coordination between the D3115 driver and the
consciousness and poor understanding of job
ATP;
responsibility.
(2) The mental model of the D301 driver became inconsistent
(2) The feedback was missing
with the actual process because of missing feedback.
Reports
Reports Commands Integrated Vehicle
Control
to the staff to ensure they were competent to carry out their responsibilities. The weak
safety consciousness was a widespread problem in the staff.
The Ministry of Railways has primary responsibility for enforcing legislation,
regulations, and policies that apply to the construction and operation of railway sys-
tems. For the operation aspect, the Ministry of Railways didn’t provide adequate
inspection and supervision on the safety management of Shanghai railway bureau.
The existing problems in Shanghai railway bureau were ignored. For the construction
aspect, the Ministry of Railways is also responsible for implementing technical re-
view and the certification of the equipment to be used on Chinese railways. In the
accident, the ministry of railways had illegal operation in the technical review and
certification process, resulted in the faulted train control center equipment was used in
Wenzhou South Station. In addition, the Ministry of Railways didn’t establish explicit
rules for the technical review process.
With a rapid growth in economy, China has strived to develop high speed rail-
ways. The Chinese government has invested billions of dollars in the rapid expansion
of high-speed railway network in recent years. China’s high speed railways now
stretch across more than 10,000 km, expanding to 13,000 km by the end of 2012, and
is planned to reach about 16,000 km by 2020 according to the Chinese railway net-
work planning programs. But with a culture of seeking quick success and instant ben-
efits in the Chinese government, the Ministry of Railways pursues construction speed
rather than the safety in railway construction. China's initial high speed trains were
imported or built under technology transfer agreements with foreign companies. Then
Chinese engineers absorbed foreign technology in building indigenous train sets and
signal systems. The type of the defective station train control center equipment in
Wenzhou South Station is LKD-T1. It was designed by Beijing National Railway
Research &Design Institute of Signal & Communication and manufactured by
Shanghai Railway Communication Co. Ltd. The design was flawed and field testing
was not performed. The documentation of the design was incomplete and non-
standard. However, the Ministry of Railways let it pass the technical review and put it
into use in just a few months.
A STAMP Analysis on the China-Yongwen Railway Accident 385
Ministry of Railways
Safety constraints:
(1)Make regulations and standards on safe operation of trains.
(2)Provide oversights on the execution of regulations and standards.
(3)Implement technical review and the certification of the equipment to be used on Chinese
railways.
Inadequate control actions:
(1)Inadequate inspection and supervision of the safety management for Shanghai railway
bureau.
(2) Did not establish explicit rules for technical review
(3)Illegal operation in the technical review and certification process.
Context in which decisions were made:
(1)Rapid expansion of Chinese high speed railways.
(2)The technology of the signaling system was imported and the system was redesigned by
local companies.
(3)A culture of seeking quick success and instant benefits
Mental model flaws:
Insufficient attention was paid to safety
Regulations
Standards Reports
Supervision
Fig. 5. The role of Ministry of Railways and Shanghai railway bureau in the accident
Rationale: In the accident, the driver failed to drive the train in visual mode several
times. The ATP system relies too much on the signals. When the ATP system stops
the train due to abnormal signals, the driver should have priority to control the train
(inadequate coordination among controllers). But give priority to the driver to control
a train may give rise to new hazards. In this situation, new safety constraints may be
added and further measures should be taken to control the risk at an accepted level.
(3) More effective communication channels between drivers and dispatchers
should be added in the control structure.
Rationale: In this accident, the D3115 driver failed to report to the dispatcher be-
cause the GSM-R wireless network was out of signal (missing feedback). The com-
munication between drivers and dispatchers is critical to safety, and alternative ways
for communication should be established to ensure smooth communication.
(4) Shanghai railway bureau must improve safety management, enforce regulations
and standards more effectively and provide sufficient supervision on the job responsi-
bility fulfillment and regulation compliance. It must provide adequate training to the
staff to ensure they are competent to carry out their responsibilities and emphasize
railway safety culture.
Rationale: The related personnel in the accident didn’t fulfill their duties adequate-
ly and violated the regulations. The weak safety consciousness was a widespread
problem in the staff (inadequate execution of control actions).
(5) The safety and reliability of new technology and equipment should be empha-
sized to ensure the safe development of high speed railways in China.
Rationale: The Ministry of Railways pursues the advancement of technology and
the speed of railways in the development of high speed railways in China (insufficient
attention was paid to safety).
4 Conclusions
In this paper, by analyzing the safety control structure and the role each component
played in the China–Yongwen railway accident happened on July 23, 2011, we ac-
quired a better understanding of the accident and proposed some improvement meas-
ures to prevent similar accidents in the future.
The results of the STAMP analysis are consistent with the accident investigation
report. But instead of just identifying the causal factors and who to be punished, the
STAMP analysis provides a more comprehensive view to understand the accident.
The STAMP model used in this paper is effective in modeling complex socio-
technical systems. The use of STAMP provides the ability to examine the entire so-
cio-technical system to understand the role each component played in the accident.
Modeling the entire control structure helped in identifying different views of the acci-
dent process by designers, operators, managers, and regulators—and the contribution
of each to the loss. STAMP leads to a more comprehensive understanding of the acci-
dent by incorporating environment factors, component failures, design flaws, human
errors, social and organizational factors in the model. The modeling also helped us to
understand the relationships among these factors. The heart of STAMP analysis lies in
identifying the safety constraints necessary to maintain safety constraints and acquir-
ing the information of the way safety constraints are violated. This information can be
A STAMP Analysis on the China-Yongwen Railway Accident 387
References
1. The state investigation team of the China-Yongwen railway accident. The investigation
report on the ―7.23‖ Yongwen line major railway accident. (in Chi-
nese)http://www.chinasafety.gov.cn/newpage/Contents/Channel_5498/2011/1228/1605
77/content_160577.htm (2011)
2. Leveson, N.G. A New Accident Model for Engineering Safer Systems. Safety Science
42(4): 237-270 (2004)
3. Leveson, N.G., Daouk, M., Dulac, N., & Marais, K. Applying STAMP in Accident
Analysis. Workshop on Investigation and Reporting of Incidents and Accidents (2003)
4. Leveson, N.G. Model-Based Analysis of Socio-Technical Risk. Technical Report, En-
gineering Systems Division, Massachusetts Institute of Technology.
http://sunnyday.mit.edu/papers/stpa-tech-report.doc (2002)
5. Qureshi, Z.H. A Review of Accident Modeling Approaches for Complex Socio-
Technical Systems. Proc. 12th Australian Conference on Safety-Related Programmable
Systems, Adelaide, Australia (2007)
6. Nelson, P.S. A STAMP analysis of the LEX COMAIR 5191 accident. Thesis, Lund
University, Sweden (2008)
7. Ouyang, Min., Hong, Liu., Yu, Ming-Hui., Fei, Qi. STAMP-based analysis on the rail-
way accident and accident spreading: Taking the China–Jiaoji railway accident for ex-
ample. Safety Science, 48(5): 544-555 (2010)
Efficient Software Component Reuse in Safety-Critical
Systems – An Empirical Study
1 Introduction
Safety-critical systems are systems which may, should they fail, harm people and/or
the environment – such as vehicles, power plants, and machines. To develop such
systems, one must demonstrate that potential hazards have been analyzed, and that all
prescribed activities listed in an applicable safety standard have been performed.
There are generally applicable safety standards, such as the IEC-61508, and domain-
specific standards, such as IEC-61511, ISO-15998, ISO-26262, RTCA DO-178B/C,
EN50126/8/9, ISO-13849, and IEC-62061. In the daily development work, achieving
a sufficient level of safety boils down to adhering to the relevant standard(s).
These standards are based on an assumed top-down approach to system construc-
tion. Each system must be analyzed for its specific hazards and risks in its specific
environment, and the system requirements must be traced throughout the development
to design decisions, to implementation units, to test cases, and to final validation. The
standards’ general approach to the inclusion of pre-existing software components in a
system is to present them as being an integrated part of the development project, and
let them undergo the same close scrutiny as newly developed software for the specific
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Efficient Software Component Reuse in Safety-Critical Systems – An
Empirical Study 389
system (which is inefficient). The standards in general provide very little guidance for
potential developers of software components, intended for reuse in several safety-
critical systems – with the main exceptions of the recently issued ISO-26262 and the
advisory circular AC20-148 complementing RTCA DO178B.
For a reusable component to be included in a safety-critical system, the component
developer needs to not only comply with the relevant standard throughout the life
cycle, but also ensure that the integrator saves effort by reusing the component. In
safety-critical systems, the actual implementation is just a small part of the “compo-
nent” being reused and savings are lost if the integrator has to re-perform much or all
of the safety-related work (e.g. verification, traceability, adaption of documentation).
This paper takes an overall view and intends to identify the most important chal-
lenges, as perceived by practitioners, and provide some guidance on how to address
these challenges. Five specific challenges are (Åkerholm & Land, 2009):
2 Research Method
The purpose of the study is to collect valuable experience, but the extent to which the
suggested practices improve efficiency is not independently validated. First four
open-ended interviews were performed (see section 2.1). Secondly, as action research
we used an industrial project (see section 2.2), applying some of the findings from the
interviews. All observations were compiled (qualitatively), and the synthesized result
is presented here, with the source of each observation indicated in the text.
390 R. Land, M. Åkerholm, and J. Carlson
3 Related Work
activities during its development. Among the few attempts to describe reuse of soft-
ware in safety-critical software from a practical, industrial point of view, we most
notably find descriptions of components pre-certified according to the AC20-148
(Lougee, 2004) (Khanna & DeWalt:, 2005) (Wlad, 2006), which describe some of the
potential benefits of reusing pre-certified software, rather than provide guidance on
how to develop a reusable software component efficiently as we do in this paper.
Common challenges of software reuse (Karlsson, 1995) also hold true for reuse of
safety-critical software components; for example, there are various methods and prac-
tices addressing the need of designing a system with potential components in mind
(Land, Blankers, Chaudron, & Crnković, 2008) (Land, Sundmark, Lüders, Krasteva,
& Causevic, 2009). In general, there is more data and experiences on development
with reusable components than development of reusable components (Land,
Sundmark, Lüders, Krasteva, & Causevic, 2009), while the present study takes a
broad perspective and includes both.
Literature on modularized safety argumentation provide several promising research
directions, such as how to extend e.g. fault tree analysis (Lu & Lutz, 2002) and state-
based modeling (Liu, Dehlinger, & Lutz, 2005) to cover product lines, that should in
principle work also for composition of component models. A bottom-up, component-
based process is described in (Conmy & Bate, 2010), where internal faults in an
FPGA (e.g. bit flips) are traced to its output and potential system hazards. Such anal-
yses should be possible to apply to components being developed for reuse, leading to
a description at the component interface level, e.g. of the component’s behavior in the
presence. In the direction of modularized safety arguments, there are initiatives relat-
ed to GSN (Goal Structuring Notation) (Despotou & Kelly, 2008) and safety contracts
(Bate, Hawkins, & McDermid, 2003).
This section contains the observations made in the study, based both on the interviews
and the development project, formulated as concrete practices the component devel-
oper should perform. The section is organized according to the five challenges listed
in (Åkerholm & Land, 2009) and in the introduction of the present paper.
vant safety standard as closely as possible with regards to e.g. terminology and re-
quired documents. According to the experience of interviewee #2, companies unnec-
essarily create a problem when using an internal project terminology and then provid-
ing a mapping to the standard. Interviewee #4 on the other hand, describes such a
mapping from the platform’s terminology to that of the standards it is certified
against; however, since the same assessor (i.e. the same individual person at the certi-
fication authority) is appointed for all standards, this poses no major obstacles.
Still, the safety standards assume that the documentation structure is a result from a
top-down system construction, and a component will need to specify for which part of
this structure it provides (some of) the required documentation, and how it should be
integrated into the system’s documentation structure. When we followed the structure
outlined in (Åkerholm & Land, 2009) in our project, we observed that the documenta-
tion interface is highly dependent on the technical content, due to the fact that design
decisions on one level are treated as requirements on the level below. When defining
a component for reuse, there are some specific challenges involved: the perhaps not
obvious distinction between the architecture and requirements of the component, and
it was realized in the project that the documentation needs to distinguish these more
clearly than we did at the outset. Hazard and risk analysis for the component need to
be performed backwards and documented as a chain of assumptions rather than as a
chain of consequences; this needs to be documented very clearly to make the hazard
analysis and safety argumentation of the system as straightforward as possible. Fur-
ther research is needed to provide more detailed suggestions on how to structure the
component documentation in order to provide an efficient base for integration.
Practice I: Follow the requirements of the standard(s) on documentation structure
and terminology as closely as possible. Two important parts of a component’s docu-
mentation interface are the component requirements and the component hazard and
risk analysis, which should aim for easy integration into the system’s design and haz-
ard/risk analysis.
Identification of Configuration Interface. A reusable component should have a
modular design and configuration possibilities, so that “hot spots” where future antic-
ipated changes are identified and isolated (Interviewee #3; see also e.g. (Lougee,
2004)). Knowledge of the specific differences between customers and systems is re-
quired; interviewee #3 describes that their operating system has support for different
CPUs, its ability to let the integrator add specific initialization code, and its support
for statically modifying the memory map. With configurability come requirements on
verification and testing of a specific configuration of a component in a specific system
(interviewees #1 and #3). In our industrial project, we clearly separated the user con-
figurable data from other data in the system, by setting up a file structure where only
two files with a strict format are user modifiable. We used mechanisms provided by
the source code language to both provide an easy-to-use configuration interface and
the possibility of being able to statically include this data into the program with ap-
propriate static checks (see also section 4.3 for construction of adaptable test suites).
Interviewee #3 describes that with configuration variables which are read from
non-volatile memory during startup, the integrator needs to show that the parameters
cannot change between startups. See section 4.2 on deactivation of dead code.
Efficient Software Component Reuse in Safety-Critical Systems – An
Empirical Study 393
Practice II: Create a modular design where known points of variability can be eas-
ily expressed as configuration settings which are clearly separated and easy to under-
stand for the user.
1
http://www.autosar.org/, http://www.sysml.org/, http://www.omgmarte.org/
Efficient Software Component Reuse in Safety-Critical Systems – An
Empirical Study 395
There will remain a number of activities for the integrator, related to the context and
environment of the component in a specific system. The challenge for the component
developer is to aid the integrator in these activities by providing the component with
certain information and artefacts. In the studies, we identified what can be labeled
“analysis interface”, and adaptable test suites as two important means for this.
Identification of Analysis Interface. Data coupling analysis, control coupling
analysis, and timing analysis are examples of activities that can only be performed by
the integrator, when the complete software is available (AC20-148). However, some
analyses may in principle be partially performed at the component level, or some
useful data or evidence may be constructed at the component level. In spite of re-
search on composing system properties from component properties (see e.g. (Hissam,
Moreno, Stafford, & Wallnau, 2003) (Larsson, 2004) and the TIMMO project2), the
challenge remains to identify such analysis interfaces, including assertions that need
to be made by the component developer, properties that need to be specified, and how
to use these automatically in a system-level analysis. In the study, interviewees #3 and
#4 mentioned timing issues to be especially important. With a simple application
design, and certain component information, it may be sufficient to perform timing
measurements of the integrated system, given that the component developer makes
assertions on the behavior of the component. The current state of practice includes,
according to interviewees #3 and #4, component assertions that the function calls are
non-blocking, or information that the component disables interrupts, which is valua-
ble for the integrator’s more detailed timing analysis. Also, a specification of input
data which will cause the longest path through a function to be executed, and/or the
path that includes the most time-consuming I/O operations, is useful for finding upper
bounds on the timing within a specific system and on a specific hardware.
Practice VIII: Provide information on the component’s (non-)blocking behavior,
disabling and enabling of interrupts, and input data which is likely to cause the upper
bounds on timing, to facilitate the integrator’s system level analysis of e.g. timing.
Adaptable Test Suites. The component of our project is delivered with a module
test suite which automatically adapts itself to the configuration. The component con-
figuration is made through macro definitions and filling static data structures with
2
http://www.timmo.org/
396 R. Land, M. Åkerholm, and J. Carlson
values, and the test suite is conditionally compiled based on the same macros, and
uses the actual values of the data structures to identify e.g. boundary values to use in
testing, and of course determine the expected correct output. The test suite includes all
necessary fault injection in order to always achieve sufficient code coverage (for the
SIL 3 according to IEC-61508).
The creation of a module test suite on this higher level of abstraction forced us to
reason about many boundary values, possible overflow in computations, and similar
border conditions. Also, it helped us identify user errors, such as what would happen
if the component is configured with empty lists or inconsistent configuration parame-
ters. In addition, the resulting number of actual tests executed on a single configura-
tion is significantly higher than we would otherwise have created, which also increas-
es our confidence in the component, although strictly the number of test cases can
never in itself be an argument for testing being sufficient. Thus, as a side effect, this
greatly helped us to design for testability, and to design good test suites.
The main purpose of providing adaptable test suites is that the integrator easily can
perform module tests on the specific configuration used in the system. To verify the
configuration mechanisms and the test suite itself, we created a number of configura-
tions and re-executed the tests with very little effort (a matter of minutes). This in-
creased our confidence, not only in the component itself but in that we are saving a
significant amount of effort for integrators. (However the integrator must learn and
understand how to run the test suite correctly for a specific configuration, and how to
interpret the test output (including verification of the code coverage reached).) Anoth-
er extra benefit is that some changes (e.g. addition of messages on the bus) can be
made late in the development process and easily re-tested.
The test suite is written in ANSI-C and is therefore as portable as the component
itself, but the fault injection mechanism and the code coverage analysis rely on exter-
nal tools and therefore somewhat restrict the integrator’s freedom. To account for this,
we have designed the test suite and test environment so that adapting the suite to an-
other tool set should not be too effort-consuming.
Practice IX: Deliver an adaptable test suite with the component, so that the inte-
grator can (re-)perform configuration-specific testing with little effort.
metrics), and that it is used as intended and its safety manual has been followed (in-
terviewee #4; our project). Interviewee #2 in particular stresses that the objective
when arguing safety is to perform the argumentation in relation to the system hazards;
if a fault analysis (e.g. a fault tree analysis) shows that a component does not contrib-
ute to a specific hazard, the tracing may stop there.
Practice X: Specify component requirements and functional interface, so that a de-
tailed traceability analysis is not required when integrated into a system. This includes
providing a safety manual with assumptions and rules for component usage.
Standardization of Traceability Tools. Often traceability is managed manually as
tables in electronic documents, and even if a traceability tool is used, there are prob-
lems to share the same database, and it is also likely that the component developer
uses a different tool than its customers (interviewee #1). This is a challenge for stand-
ardization and tool developers, rather than for component developers or integrators.
Meeting the Requirements on System Hazard and Risk Analysis. Normally, the
system hazards are used, with their estimated frequency, consequence etc., to deter-
mine the SIL level (or similar; the standards have different classifications), which
influences all downstream activities. When developing a component for reuse, the risk
analysis is instead performed backwards: a target market is selected, and the compo-
nent is developed according to common requirements and a SIL level which it is be-
lieved that integrators will require. It is only in a system context safe external behav-
iors in case of detected failures can be determined (e.g. to shut down the unit immedi-
ately, apply a brake, or notify the operator; it may or may not be safe and desirable to
first wait for X more messages in case the communication recovers; etc.). It is always
the responsibility of the system developer to focus the argumentation around the haz-
ards and show that they cannot conceivably occur. “A general software component
does not have safety properties, but a quality stamp. Only in a specific context do
safety properties exist.” (interviewee #2)
In the case study we performed some analysis based on assumed, realistic, values
for usage and disturbances to demonstrate that an average system or application using
our component also meets the hardware requirements at the target SIL level. (It may
be noted that there is no major differences in the requirements on development of
software between SIL2 and SIL3 according to IEC61508; the requirements on hard-
ware however typically impose more expensive solutions including redundancy etc.)
Practice XI: Lacking a definition of system hazards, identify component error-
handling, fault tolerance mechanisms, and behavior that are common for many sys-
tems, as independently as possible of the specific system hazards. (See also Practice
III.)
section lists the goals that were mentioned by the interviewees in the study, and de-
scribes some of their considerations in meeting these goals.
Goal 1: Saving Effort, Time and Money for the Integrator. Since component
development is carried out according to the standard, and much of the required docu-
mentation and evidence is created, the integrator may potentially save the same effort
(interviewee #1; see section 4.1). However, for interviewees #3 and #4, the effort
savings for the integrator are not so significant. Interviewee #4 shared his experience
of a component not developed according to the required safety standards, which
brought a significant additional cost to construct the required evidence and document-
ing it. Interviewee #3 states that with AC20-148, the effort spent by the certification
authority is decreased since only changes of the component have to be investigated.
Goal 2: Reducing Risk for the Integrator. Interviewees #1, #3, and #4, all state
that with a pre-certified component (or a certifiable component, which has been used
in another, certified, system), the confidence is high that the component will not cause
any problems during system certification. Interviewee #3 specifically mentions that
the customers using the component from his organization do it because the compo-
nent is pre-certified according to AC20-148 and thus is a low-risk choice.
Practice XII: If the main goal is to present a component as risk-reducing, the
component developer should consider certifying the component. If the main goal is to
save efforts for the integrator, it may be sufficient to develop it according to a stand-
ard, and address effort savings in the ways outlined in this paper.
3
http://www.safecer.eu
Efficient Software Component Reuse in Safety-Critical Systems – An
Empirical Study 399
5.1 Acknowledgements
We want to thank the interviewees for sharing their experiences with us. This study
was partially funded by The Knowledge Foundation (KKS), CrossControl, ARTEMIS
Joint Undertaking, and Vinnova.
6 References
1. Mikael Åkerholm and Rikard Land, “Towards Systematic Software Reuse in Certifiable
Safety-Critical Systems”, in RESAFE - International Workshop on Software Reuse and
Safety, Falls Church, VA, 2009.
2. Scott A. Hissam, A. G. Moreno, Judith Stafford, and Kurt C. Wallnau, “Enabling Predict-
able Assembly”, Journal of Systems & Software, vol. 65, no. 3, 2003.
3. Magnus Larsson, “Predicting Quality Attributes in Component-based Software Systems”,
Ph.D. Thesis, Mälardalen University, 2004.
4. Jeffrey Voas, “Why Is It So Hard to Predict Software System Trustworthiness from Soft-
ware Component Trustworthiness?”, in 20th IEEE Symposium on Reliable Distributed
Systems (SRDS'01), 2001.
5. Hoyt Lougee, “Reuse and DO-178B Certified Software: Beginning With Reuse Basics”,
Crosstalk – the Journal of Defense Software Engineering, December, 2004.
6. Varun Khanna and Mike DeWalt:, “Reusable Sw components (RSC) in real life”, in
Software/CEH conference, Norfolk, VA, 2005.
7. Joe Wlad, “Software Reuse in Safety-Critical Airborne Systems”, in 25th Digital Avion-
ics Systems Conference, 2006.
8. Even-André Karlsson, Software Reuse : A Holistic Approach. ISBN 0 471 95819 0: John
Wiley & Sons Ltd., 1995.
9. Rikard Land, Laurens Blankers, Michel Chaudron, and Ivica Crnković, “COTS Selection
Best Practices in Literature and in Industry”, in Proceedings of 10th International Con-
ference on Software Reuse (ICSR), Beijing, China, 2008.
10. Rikard Land, Daniel Sundmark, Frank Lüders, Iva Krasteva, and Adnan Causevic, “Re-
use with Software Components – A Survey of Industrial State of Practice”, in 11th Inter-
national Conference on Software Reuse (ICSR), Falls Church, VA, USA, 2009.
11. Dingding Lu and Robyn R. Lutz, “Fault Contribution Trees for Product Families”, in 13th
International Symposium on Software Reliability Engineering (ISSRE’02), 2002.
12. Jing Liu, Josh Dehlinger, and Robyn Lutz, “Safety Analysis of Software Product Lines
Using State-Based Modeling”, in 16th IEEE International Symposium on Software Relia-
bility Engineering (ISSRE’05), 2005.
13. Philippa Conmy and Iain Bate, “Component-Based Safety Analysis of FPGAs”, IEEE
Transactions on Industrial Informatics, vol. 6, no. 2, 2010.
14. George Despotou and T. Kelly, “Investigating The Use Of Argument Modularity To
Optimise Through-Life System Safety Assurance”, in 3rd IET International Conference
on System Safety (ICSS), Birmingham, 2008.
15. I Bate, R Hawkins, J McDermid, “A Contract-based Approach to Designing Safe Sys-
tems”, in 8th Australian Workshop on Safety Critical Systems and Software (SCS'03),
2003.
16. As-2 Embedded Computing Systems Committee, “Architecture Analysis & Design Lan-
guage (AADL)”, Document Number AS5506, 2009.
Author Index