0% found this document useful (0 votes)
193 views207 pages

Fault Detection and Fault Tolerant Control

Fault Detection and Fault-Tolerant Control

Uploaded by

alialbadry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views207 pages

Fault Detection and Fault Tolerant Control

Fault Detection and Fault-Tolerant Control

Uploaded by

alialbadry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 207

Fault Detection and Fault Tolerant Control

for Distributed Systems.


A general framework

Ph.D. Thesis

Andrea Paoli

Tutor: Prof. Claudio Bonivento


Coordinator: Prof. Alberto Tonielli

University of Bologna

XVI Ciclo

A.A. 2000 – 2003


Fault Detection and Fault Tolerant Control
for Distributed Systems.
A general framework

Ph.D. Thesis

Andrea Paoli

Tutor: Prof. Claudio Bonivento


Coordinator: Prof. Alberto Tonielli

University of Bologna

XVI Ciclo

A.A. 2000 – 2003


Keywords:

fault tolerant control, fault diagnosis, reliability, distributed systems, output regulation theory,
discrete event systems.

Eng. Andrea Paoli


CASY - DEIS - University of Bologna
Viale Pepoli 3/2, 40136 Bologna.
Phone: +39 051 2093874, Fax: +39 051 2093870
Email: [email protected]
URL: http://www-lar.deis.unibo.it/people/apaoli

This thesis has been written in LATEX.

Copyright °2004
c by Andrea Paoli. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopy, recording or any information storage and re-
trieval system, without permission in writing from the author.

Acknowledgments:
This work has been partially funded by the EC-Project IFATIS (Intelligent Fault Tolerant Control
in Integrated Systems), sponsored by the European Commission in the IST programme 2001 of
the 5th EC framework programme (IST-2001-32122).

This work has been partially funded by MIUR (Ministero dell’istruzione, dell’università e della
ricerca).
H AEC A UTEM I TA F IERI D EBENT, UT H ABEATUR
R ATIO F IRMITATIS , U TILITATIS , V ENUSTATIS .

M ARCUS V ITRUVIUS P OLLIO ,


D E A RCHITECTURA , 3.2.

... all these [buildings] should be built considering


principles of strength, utility and beauty ...

(the Vitruvian triad)


Contents

Preface 9

1 Distributed systems and fault tolerance: an overview 15


1.1 Introduction to diagnosis and fault-tolerant control . . . . . . . . . . . . . . . . . 15
1.2 Faults and fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Fault-tolerant control architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 An overview on fault diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.1 Model Based-Diagnosis in the control community . . . . . . . . . . . . . . 20
1.4.2 Model Based-Diagnosis in the Artificial Intelligence framework . . . . . . 22
1.5 Some ideas on controller re-design . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Why distributed systems? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.6.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.6.2 Temporal requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.6.3 Dependability requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.7 Properties of distributed real-time systems . . . . . . . . . . . . . . . . . . . . . . 32
1.7.1 Composability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.7.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.7.3 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.8 Failures, errors and faults in distributed systems: nomenclature . . . . . . . . . . 35
1.8.1 Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.8.2 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.8.3 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.9 Fault-tolerance in distributed systems . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.9.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.9.2 A node as a unit of failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.10 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2 Fault tolerant architecture for distributed systems 41


2.1 The distributed nature of Fault Tolerant Control System . . . . . . . . . . . . . . . 41
2.1.1 Fault Tolerant Control/Measure modules . . . . . . . . . . . . . . . . . . . 43
2.1.2 Resource monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.3 Global (and group) resource and reconfiguration manager . . . . . . . . . 47
2.1.4 Configuration of the hierarchical structure . . . . . . . . . . . . . . . . . . 48
2.2 Failure analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.1 Fault Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5
6 Contents

2.2.2 Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


2.3 Supervision level hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Design of the Group/Global Resource Reconfiguration Manager . . . . . . . . . . 58
2.4.1 Resource sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.4.2 Interlaced reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.5 A pilot plant: the two tank system . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5.1 Fault scenario and diagnostic algorithms . . . . . . . . . . . . . . . . . . . 74
2.5.2 System decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.5.3 An overview of possible working modes . . . . . . . . . . . . . . . . . . . 79
2.5.4 Working Mode Decision Logic . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3 Reliability of complex diagnosis systems 85


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.2 Statistic Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3 A Framework for Reliability of Diagnosis Systems . . . . . . . . . . . . . . . . . . 88
3.3.1 Step 1. Description of Elementary Cells (definition of the item) . . . . . . . 88
3.3.2 Step 2. Computation of the Reliability Function for the Elementary Cells
(definition of the required function) . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.3 Step 3. Computation of the Propagation Reliability of Elementary Cells . 92
3.3.4 Step 4. Computation of diagnostic reliability indices. . . . . . . . . . . . . 93
3.4 The common rail benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4 A discrete event approach to system monitoring 101


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Preliminary notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 A motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 The notions of diagnosability and safe diagnosability . . . . . . . . . . . . . . . . 106
4.5 Necessary and sufficient conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.1 Necessary and sufficient conditions for diagnosability . . . . . . . . . . . 109
4.5.2 Necessary and sufficient conditions for safe diagnosability . . . . . . . . . 111
4.6 Safe time-diagnosability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.7 Active approach to safe diagnosability . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.7.1 Review of the Active Diagnosis Problem . . . . . . . . . . . . . . . . . . . 120
4.7.2 Formulation and solution procedure to the active safe diagnosis problem 122
4.8 An approach for fault tolerant supervision . . . . . . . . . . . . . . . . . . . . . . 127
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5 Implicit fault tolerant control systems 129


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2 The Induction Motor model and the Indirect Field Oriented Controller . . . . . . 130
5.2.1 The induction motor model . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.2 Control objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2.3 A Global Indirect Field Oriented Controller . . . . . . . . . . . . . . . . . . 132
5.3 The implicit fault tolerant controller . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3.1 Fault scenario and faulty model . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3.2 Reconfiguration strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Contents 7

5.3.3 Embedding an internal model of the fault . . . . . . . . . . . . . . . . . . . 141


5.3.4 Adaptive frequency estimation . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.3.5 Fault Detection and Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4 Experimental and simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Conclusions and future works 157

A Introduction to discrete event systems theory 161


A.1 Discrete event systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.2 Operations on Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.3 Representation of languages: automata . . . . . . . . . . . . . . . . . . . . . . . . 163
A.3.1 Operations on automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
A.3.2 Observer automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.4 Regular languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
A.5 Supervisory control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.6 Uncontrollability problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.6.1 Dealing with uncontrollable events . . . . . . . . . . . . . . . . . . . . . . 170
A.6.2 Realization of supervisors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.7 Unobservability problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

B An experimental setup for FTC algorithms test 173


B.1 The experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.2 The power stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

C Implicit fault tolerant control of a n-dof robot manipulator 181


C.1 introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
C.2 Problem statement and preliminary positions . . . . . . . . . . . . . . . . . . . . . 182
C.2.1 Tracking control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
C.2.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.3 Canonical internal model unit design . . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.4 Adaptive internal model unit design . . . . . . . . . . . . . . . . . . . . . . . . . . 187
C.5 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
C.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

Bibliography 193

Index 199

Curriculum vitae 205


8 Contents
Preface

Automated systems are vulnerable to faults such as defects in sensors or in actuators and fail-
ures in controllers or in control loop, which can cause un-desired reactions and consequences
as damage to technical parts of the plant, to personnel or to the environment. The main ob-
jective of the Fault Detection and Isolation (FDI) research area, widely addressed from several
points of view in the last years (see, besides others, [28],[80], [38]), is to study methodologies for
identifying and exactly characterizing possible incipient faults arising in predetermined parts
of the plant. This is usually achieved designing a dynamical system (filter) which, processing
input/output data, is able to detect the presence of an incipient fault and eventually to precisely
isolate it generating the so-called residual signals (see beside others [70], [69] for the linear case
and [85] for the non linear case).
This is an important problem to deal with, since faults in sensors, actuators and components
result in increased operating costs, off-specification production, line shut-down and possible
detrimental environment impact. As enlightened in [28] we use the term fault detection to describe
the problem of making a binary decision either that something has gone wrong or that everything is
fine. This task can appear trivial, but, to be useful, the presence of a faults should be detected
early, before they became serious. As the reader can imagine the uncertainty on the correct
behavior of the plant, the disturbances and noise on the measurements, makes the task of early
and robust fault detection difficult to achieve reliably. The next step is the so called fault-isolation
in which the source of the fault is determined(cf. [28], [38]). We will refer to the FDI problem
as the combined problem of fault detection and isolation, while with the term fault diagnosis
system we refer to the procedure used to detect, isolate faults and asses their severity.
The FDI mechanism can be achieved using a replication of hardware; this technique, known
in literature as hardware redundancy, relies on comparison for consistence of outputs from iden-
tical hardware. Using a different approach, known in literature as analytical redundancy, the FDI
task can be accomplished using analytical and functional information about the system or, in
other words, using a mathematical model of the plant (model based FDI).
Once a fault has been detected and isolated the next natural step is to reconfigure the con-
trol law in order to tolerate the fault, namely to guarantee pre-specified performances for the
faulty system. In this framework, the FDI phase is usually followed by the design of a Fault
Tolerant Control (FTC) namely by the design of a reconfiguring unit which, on the basis of
the information provided by the FDI filter, adjusts the controller in order to achieve prescribed
performances also for the faulty system. Many strategies can be followed at this stage. Mechan-
ical reconfiguration, such as switching between redundant hardware or mechanical parameters
(see [98] and [25]), adaptive robust control (see [27]), supervisory switching control (see [5]), are
just few examples of different strategies which can pursued to deal with faulty plants. The ba-

9
10 Preface

sic idea behind these approaches is that of the explicit FTC, in the sense that they are based on
an “explicit” control reconfiguration which follows the “explicit” fault estimation performed
by the FDI unit.
It is worth stressing that in general the design of a reliable FDI/FTC unit has to tackle
important problems which characterize real applications, such as the presence of unknown pa-
rameters or unknown neglected dynamics in the model of the system, unknown disturbances
acting on the system, the lack of knowledge of a reliable fault model, etc.. All these and other
factors can affect the performances of the FDI and FTC scheme in the sense that faults could be
detected even if not really present (false alarms) or, in a more critical and dangerous scenario,
faults can be not detected at all (or detected with unacceptable delay). Clearly the design of
FDI/FTC control unit with pre-specified performances can be better performed if the designer
has real data concerning disturbances and uncertainties characterizing the plant. This, in gen-
eral, allows to perform a design suitably tailored on the specific applications and carries to a
more simple and more reliable FDI/FTC units.
A further aspect which worth mentioning is the effect that a closed loop controller can
have on the reliability of the overall FDI/FTC control system. As a matter of fact most of the
present literature on the design of FDI units deals with “open loop” systems, in the sense that
the design of the FDI filter is usually addressed without taking into account the presence of
a feedback loop. The presence of a feedback regulator designed to posses robustness features
to exogenous disturbances may, in some circumstances, masks also the effect of an incipient
faults improving in this way the fault tolerance features of the system but also prejudicing the
effectiveness of the FDI phase. These considerations suggest that, in general, the design of a
reliable closed loop FDI/FTC control systems amounts in an integrated design between closed loop
regulators, aiming at enforcing desired specification, and explicit FDI and FTC units.
It is clear from this description that the classical approach to FDI and FTC relies upon a
“certainty equivalence” idea extensively used in the context of adaptive control, since it is based
on the explicit estimation of unknown time varying signals/parameters (in the specific case the
faults) by the FDI filter and subsequent explicit reconfiguration of the controller in presence of
faults.
These majors have been prevailingly developed considering centralized dynamical systems
with a limited number of faults. On the contrary most of the nowadays advanced applications
present a distributed nature (see e.g. the automotive field). Considering distributed large scale
control systems, faults affect local subsystems, but their effects propagate throughout the sys-
tem. This fact introduce new problems to deal with and the counteraction have to make use of
the structural properties of this class of systems.
From this brief discussion comes out the needs for a unified framework for fault tolerant control
of large scale systems.
This work is a collection of results towards a unified framework for Fault Tolerant Control
in Distributed Control Systems. It starts with a survey Chapter 1 which illustrates concepts,
definitions and classical results about Fault Detection and Isolation and Fault-Tolerant Control
and introduces basic concepts in distributed computer systems architectures. In Chapter 2 a
novel architecture for Fault Tolerant Control is presented and some design guidelines are given.
In this chapter is presented the work developed within the EC-Project IFATIS (Intelligent Fault
Tolerant Control in Integrated Systems), partly funded by the European Commission in the IST
programme 2001 of the 5th EC framework programme (IST-2001-32122). Chapter 3 deals with
a specific step of the design procedure presented in the previous chapter: the reliability pre-
diction. More in details a procedure to evaluate reliability of a complex distributed diagnosis
system in the framework of Fault tolerant control is illustrated starting from classical reliability
Preface 11

concepts. In Chapter 4 a “high-level” diagnosis and reconfiguration engine in the framework


of fault-tolerant control of discrete event systems is presented, while in Chapter 5 a “low-level”
fault-tolerant control scheme is presented. More in detail the idea of Implicit fault-tolerance is
given and applied to control induction motors in faulty conditions. Finally conclusive remarks
and some ideas on new directions to explore are given.
In Appendix A are reported briefly some concepts and results about Discrete Event Systems
which should help the reader in understanding some crucial points in chapter 4; while in Ap-
pendix B is reported a technical overview on the experimental set-up for fault tolerant control
of induction motors, entirely designed in the Laboratory of Automation and Robotics of Uni-
versity of Bologna. Appendix C deals with an application of the implicit fault tolerant control
scheme, presented in Chapter 5, to a n-dof fully actuated robot manipulator affected by actua-
tor faults.

? ? ??

First of all I would like to thank my supervisor, Prof. Claudio Bonivento, who lead me three
years ago inside this major and who places its trust on me since the first day of this period. A
sincere thanks goes to Prof. Lorenzo Marconi, for his valuable suggestions and for the always
profitable discussions.
I cannot exempt myself from thanking Prof. Alberto Isidori, for his priceless suggestions and
teaching. A special thank goes to Prof. Sthépane Lafortune of the University of Michigan, for
the continuous encouragement during my staying in Ann Arbor and its staying in Bologna, for
its teachings and for believing in me.
I always remember with pleasure the stimulating discussions with Prof. Demosthenis Teneket-
zis and Prof. Jessy W. Grizzle, thanks for transmitting me your love for science.
A special thanks goes to all the guys of the EEECS group of the University of Michigan: Chadi,
Doron, Kurt, Olivier, Pierre, Shaika. You made my staying in Ann Arbor unforgettable.
Thanks to the other members of my working group, Luca Gentili and Marta Capiluppi, you
are not only good friends, but also valuable colleague: it has been a pleasure to work with you.
Of course thanks to all the student who realize their master thesis under my supervision, a lot
of the ideas included in this work have been risen out discussing with you. A special thanks
goes to Davide Bagnara for the support given in the experimental activity presented in this
work.
How can I forget all the guys of the Laboratory of Automation and Robotics? Alessandro, An-
drea, Cristian, Fabio, Luigi, Marcello, Marco, Nicola, Raffaella, with your friendship you made
these years priceless.
Thanks to all my friends and in particular to Emiliano for his real friendship and his help dur-
ing all the difficult moments and to Luisa with whom I shared the American experience.
Last but not least a special, heartfelt thanks goes to my family, for always trusting in me and
for always encouraging me.

Bologna, 15th of March, 2004 Andrea


12 Preface
Fault Detection and Fault Tolerant
Control for Distributed Systems.
A general framework

13
Chapter 1
Distributed systems and fault tolerance:
an overview

In this chapter fundamentals concepts of Fault Detection and Fault Tol-


erant Control are given. More in details the concepts of errors, faults and
failures are introduced, some ideas on model-based fault detection and
isolation are sketched both in the automatic control area and in the arti-
ficial intelligence area. Moreover the classical fault-tolerant architecture
is illustrated stressing the idea of fault accommodation and fault recon-
figuration. To conclude, distributed architectures for computer control
systems are discussed and an overview of the state of the art of Fault-
Tolerance in distributed real-time systems is given.

1.1 Introduction to diagnosis and fault-tolerant control


In large systems, every component has been designed to provide a certain function and the
overall system works satisfactorily only if all components provide the service they are designed
for. Therefore, a fault in a single component usually changes the performance of the overall
system. In order to avoid production deteriorations or damage to machines and humans, faults
have to be found as quickly as possible and decisions that stop the propagation of their effects
have to be made. These measures should be carried out by the control equipment. If they are
successful, the system function is satisfied also after the appearance of a fault, possibly after a
short time of degraded performance. The control algorithm adapts to the faulty plant and the
overall system satisfies its function again.
Manufacturing systems consist of many different machine tools, robots and transportation
systems all of which have to correctly satisfy their purpose in order to ensure an efficient and
high-quality production. Economy and every-day life depend on the function of large power
distribution networks and transportation systems, where faults in a single component have ma-
jor effects on the availability and performance of the system as a whole. Mobile communication
provides another example where networked components interact so heavily that component

15
16 Distributed systems and fault tolerance: an overview

faults have far reaching consequences. These are just some examples on how a fault is some-
thing that changes the behavior of a technological system such that the system does no longer
satisfy its purpose.
Fault-Tolerant control concerns the interaction between a given system (plant) and a con-
troller that does not only include the usual feedback or forward control law, but also the deci-
sion making layer that determines the control configuration. This layer analyzes the behavior
of the plant to identify faults and changes the control law to hold the closed-loop system in
a region of acceptable performance. A fault-tolerant controller has the ability to react on the
existence of the fault by adjusting its activities to the faulty behavior of the plant.
Generally, the way to make a system fault-tolerant consists of two steps:

1. Fault diagnosis: The existence of a fault has to be detected and the fault identified.

2. Control re-design: The controller has to be adapted to the faulty situation so that the overall
system continues to satisfy its goal.

1.2 Faults and fault tolerance


A fault in a dynamical system is a deviation of the system structure or the system parameters
from the nominal situation. Examples are the blocking of an actuator, the loss of a sensor or the
disconnection of a system component. In all these situations, the set of interacting components
of the plant or the interface between the plant and the controller are changed by the fault.
These faults yield deviations of the dynamical input/output (I/O) properties of the plant from
the nominal ones and, hence, change the performance of the closed-loop system which further
results in a degradation or even a loss of the system function.
Consider a system in closed loop with a controller. The fault is denoted by f . We will
refer to F as the set of all faults for which the function of the system should be retained. The
faultless case is included in the fault set F and is denoted by f0 . For the performance of the

U ×Y

Figure 1.1: System behavior: a graphical interpretation.

overall system it is important the output y(t) that the plant generates if it gets the input u(t).
The pairs (u, y) are called input/output pairs (I/O pairs) and the set of all possible pairs of
1.2. Faults and fault tolerance 17

trajectories u and y that may occur for a given plant define the behavior B 1 .
As sketched in fig. 1.1, the behavior B is a subset of the space U × Y of all possible combi-
nations of input and output signals. The dot A in figure represents a single I/O pair that may
occur for the given system; on the contrary C represents a pair that is not consistent with the
system dynamics. Consider as example a static system, which is described by the equation

y(t) = ku(t) ,

where k is the static gain. The input and the output are elements of the set R. The behavior is
given by
B = {(u, y) : y = k · u} .
For a dynamical system the I/O pairs have to include the whole time functions u(t) and y(t)
that represent the input and output signals.
When a fault occurs it changes the behavior of the system; for example in fig. 1.2, instead of
the white set, the system behavior is moved to the grey set. If a common input u is applied to

U ×Y

B0
Bf

B
A

Figure 1.2: System subject to fault.

the faultless and the faulty system, then both systems answer with the different outputs YA or
YB , respectively. The points A = (U, YA ) and B = (U, YB ) differ and lie in the white or the grey
set, respectively. This change in the system behavior makes the detection and isolation of the
fault possible, unless the I/O pair lies in the intersection of B0 ∩ Bf .
Models represent constraints that the signals U and Y satisfy in order to be relevant for the
plant (see [86]). In dependence upon the kind of systems considered, constraints can have the
form of algebraic relations, differential or difference equations, automata tables or behavioral
relations of automata. For a given input U the model yields the corresponding output Y .
In continuous-variable system described by an analytical model (e.g. differential equation),
faults are usually described as additional external signal or as parameter deviation. The first
form of faults is called additive faults and in the model are represented by an unknown input
1
An interesting branch of modeling theory is the so called behavioral approach to physical modelling. This
theory states that when we accept a mathematical law as description of a phenomenon we view it as an exclusion
law: a mathematical model expresses the opinion that some things can happen, while others cannot. On the basis of
this assertion a mathematical law (model) selects a certain subset from an universum of possibilities. The interested
reader is referred to [86] and [104].
18 Distributed systems and fault tolerance: an overview

that enters the model equation as addend. The second form is called multiplicative fault: in this
case the system parameters are scaled depending on the fault size.
Also disturbances and model uncertainties change the plant behavior. Disturbances are
usually represented by unknown input signals that have to be added up to the system output,
while model uncertainties change the model parameters in a similar way as multiplicative
faults.
Faults are often classified as follows:

• Plant faults: faults that change the dynamical I/O properties of the system.

• Sensor faults: the plant properties are not affected, but the sensor readings have substantial
errors.

• Actuator faults: the plant properties are not affected, but the influence of the controller on
the plant is modified.

A remark is necessary concerning the distinction of the notions of fault and failure. A fault
causes a change in the characteristics of a component such that the mode of operation or per-
formance of the component is changed in an undesired way. Hence the required specifications
on the system performance are no longer met. However, a fault can be worked around by
fault-tolerant control so that the faulty system remains operational. The notion of a failure, as
defined in [12], describes the inability of a system or a component to accomplish its function.
Assume that the system performance can be described by two variables yl and y2 (see
fig. 1.3). In the region of required performance, the system satisfies its function. During its
time of operation the system should remain in this region. The controller makes the nomi-
nal system remain in this region despite of disturbances and uncertainties of the model. The
controller may even hold the system in this region if small faults occur.

y2

Region of danger

Region of unacceptable
Region of degraded
performance
performance

Region of required
performance
fault

recovery

y1

Figure 1.3: System subject to fault.


1.3. Fault-tolerant control architecture 19

The region of degraded performance shows where the faulty system is allowed to remain,
although in this region the performance may be considerably degraded. A fault brings the
system from the region of the required performance into the region of degraded performance.
The fault-tolerant controller should be able to act in order to prevent a further degradation of
the performance towards the region of unacceptable performance or to the region of danger
and it should move the system back into the region of required performance.
The region of unacceptable performance must be avoided by means of the fault-tolerant
controller. This region lies between the region of acceptable performance in which the system
should remain and the region of danger, which the system should never reach. A safety system
interrupts the operation of the plant to avoid danger for the system and its environment if the
outer threshold of the region of unacceptable performance is exceeded. The safety system and
the fault-tolerant controller work in separate regions of the signal space and satisfy comple-
mentary aims. In many applications, they represent two separate parts of the control system
and are usually implemented in separate units. This separation makes it possible to design
fault-tolerant controllers without the need to meet safety standards.

1.3 Fault-tolerant control architecture


A classical architecture of fault-tolerant control is depicted in fig. 1.4. The two blocks “diag-
nosis” and “controller re-design” carry out the two steps of fault-tolerant control previously
mentioned. The diagnostic block is a filter that uses the measured input and output and tests
their consistency with the plant model, giving as output a characterization of the fault with
sufficient accuracy for the controller re-design.
The re-design of the controller may result in new controller parameters, but it also may
result in a new control configuration. Fault-tolerant control extends the usual feedback con-

Diagnosis
Supervision Controller
Level re-design

Execution
Level f d
y
yref Controller Plant
u

Figure 1.4: A fault-tolerant architecture.

troller by a supervisor, which includes the diagnostic and the controller re-design blocks. If a
fault f occurs, the supervision level makes the control loop fault-tolerant. The diagnostic block
identifies the fault and the controller re-design block adjusts the controller to the new situation.
Fault tolerance can also be accomplished without the structure given in fig. 1.4 by means of
well established control methods.
20 Distributed systems and fault tolerance: an overview

• robust control: a fixed controller is designed in order to meet closed loop specification
tolerating changes of the plant dynamics. This implies that the controlled system satis-
fies its goals under faulty conditions. Fault tolerance is obtained without changing the
controller parameters. It is, therefore, called passive fault tolerance. However, the theory of
robust control has shown that robust controllers exist only for a restricted class of changes
that may be caused by faults. Further, a robust controller works suboptimal for the nom-
inal plant because its parameters are obtained as a trade-off between performance and
robustness.

• adaptive control: the controller parameters are adapted to changes of the plant parame-
ters caused by some fault (active fault tolerance). However, the theory of adaptive control
shows that this principle is particularly efficient only for plants that are described by lin-
ear models with slowly varying parameters. These restrictions are usually not met by
systems under the influence of faults, which typically have a nonlinear behavior with
sudden parameter changes.

1.4 An overview on fault diagnosis


1.4.1 Model Based-Diagnosis in the control community
The first task of fault-tolerant control concerns the detection and identification of existing faults.
Consider a dynamical system with input u and output y which is subjected to some fault f . The
system behavior depends on the fault f ∈ F . The diagnostic system takes as input the I/O pair
(U, Y ) and has to solve the following problem: for a given I/O pair (U, Y ), find the fault f
(Diagnostic Problem).
The diagnostic problem has to be solved under real-time constraints by exploitation of the
information included in a dynamical model and in the time evolution of the signals. For fault-
tolerant control, the location and the magnitude of the fault should be estimated in order to
address in an appropriate way the controller re-design phase.
Different names are used to distinguish the diagnostic steps:

• Fault detection: Decide whether or not a fault has occurred. This step determines the time
at which the system is subject to some fault.

• Fault isolation: Find in which component a fault has occurred. This step determines the
location of the fault.

• Fault identification and Fault estimation: Identify the fault and estimate its magnitude. This
step determines the kind of fault and its severity.

In order to be able to detect a fault, the measurement information (U, Y ) alone is not sufficient,
but a reference, which describes the nominal plant behavior, is necessary. This reference is
given by an explicit model of the plant, which describes the relation between the possible input
sequences and output sequences. This model is a representation of the plant behavior B. This
idea is called consistency-based diagnosis and can be explained in a really simple way. Consider
again fig. 1.1 and assume that the current I/O pair is represented by point A in the figure. If
the system is faultless (and the model is correct) then A lies in the set B. However, if the system
is faulty, it generates a different output Ŷ for the given input U . If the new I/O pair (U, Ŷ ) is
represented by point C, which is outside of B then the fault is detectable. In other words the
1.4. An overview on fault diagnosis 21

principle of consistency-based diagnosis is to test whether or not the measurement (U, Y ) is


consistent with the nominal system behavior.
Consider the situation in fig. 1.5 and assume that the system behavior is known for the
faults f0 , f1 and f2 . The corresponding behaviors B0 , B1 and B2 are different, but they overlap,
i.e. there are I/O pairs that may occur for more than one fault. Then the diagnostic algorithm
has to distinguish between the different faults.

U ×Y

B1
B0
B A

B2

Figure 1.5: System subject to more than one fault.

If the I/O pair is represented by the points A, C or D in fig. 1.5, the faults detected are
f0 , f1 or f2 , respectively. If, however, the measurement sequences are represented by point
B, the system may be subjected to one of the faults f0 or f1 . The ambiguity of the diagnostic
result is caused by the system and not by the diagnoser, because the system generates the same
information for both faults. The question of whether or not a certain fault can be detected
concerns the diagnosability or fault detectability of the system (for more details see [12]).
For continuous-variable systems, that are usually described by differential equations or
transfer functions, the principle of consistency-based diagnosis can be cast into the scheme
shown in fig. 1.6. The model is used to determine, for the measured input sequence U , the
model output sequence Ŷ . The consistency of the system with the model can be checked at
every time t by determining the difference

r(t) = y(t) − ŷ(t) ,

which is called a residual. In the faultless case, the residual vanishes or is close to zero. A non-
vanishing residual indicates the deviation between measurement and calculated values based
on system models and, hence, the existence of a fault. Diagnostic algorithms for continuous-
variable systems generally consist of two components:
1. Residual generation: The model and the I/O pair are used to determine residuals, which
describe the degree of consistency between the plant and the model behavior.

2. Residual evaluation: The residual is evaluated in order to detect, isolate and identify faults.
22 Distributed systems and fault tolerance: an overview

f d
u y
Plant


Model

Residual
evaluation

Figure 1.6: A fault-tolerant architecture.

In both steps, model uncertainties, disturbances and measurement noise have to be taken into
account.
Fault detection and isolation employs analytical redundancy: the residual is found by using
more than one way for determining the variable y. The sensor value y is compared with the
analytically computed value ŷ. This procedure is used to avoid physical redundancy.
It is important to remark that the behavior of a dynamical system does not only depend on
the input but also on the initial state. Inconsistencies may result from a deviation of the initial
state of the model. As the initial state of the system is usually immeasurable, every diagnostic
problem includes a state observation or state estimation problem. Moreover, the disturbance d
that influences the plant is usually immeasurable. As it influences the plant behavior, it has to
be taken into account in the consistency check.
The reader interested in going into more details about fault detection and isolation problem
can refer to [38], [12] and [80] where an exhaustive survey on this topic is given by authors.

1.4.2 Model Based-Diagnosis in the Artificial Intelligence framework


Model Based-Diagnosis (MBD) is also an emerging area of Artificial Intelligence (AI) that seeks
to develop algorithms which can perform diagnostic tasks on complex systems without hu-
man intervention. MBD techniques have been used to automatically diagnose faults on-board
spacecraft [106], [8], in automotive systems [89] and also to determine optimal placement of
sensors during design [68].
AI approach to MBD put the emphasis on reasoning engines that perform diagnostic tasks
via on-line reasoning, and inference of a system’s global behavior from the automatic combina-
tion of local models of its components. For example a MBD engine can be provided a schematic
of a circuit, values of some of its inputs and outputs and it can determine from only that in-
formation whether the circuit is malfunctioning, which components might be faulty and what
additional information need to be considered to identify the faulty components with relative
1.4. An overview on fault diagnosis 23

certainty.
The component library used by an MBD engine describes the laws which govern the behav-
ior of the components. A resistor obeys Ohm’s law, a multiplier component obeys the constraint
that its output is the product of its inputs. Once provided a component model library, the MBD
engine should be able to diagnose any system constructed out of known components.
MBD models are compositional: the model of a combination of two systems is directly
constructed from the models of the constituent systems. Consider the system illustrated in
fig. 1.7.

Figure 1.7: A, B, C, D and E are input terminals, F and G are output terminals, X, Y and Z
are internal probe points, M1 , M2 and M3 are multipliers and A1 and A2 are adders.

To model this system, the model library would include:

(MULTIPLIER M i j k) → k = i × j
(ADDER A i j k) → k =i+j

An MBD engine would be provided a structural description which might simply be character-
ized as:
(MULTIPLIER M1 A C X)
(MULTIPLIER M2 B D Y)
(MULTIPLIER M3 C E Z)
(ADDER A1 X Y F)
(ADDER A2 Y Z G).
The inputs to the diagnostic engine would be observations from the system: A = 3, B = 2,
C = 2, D = 3, E = 3, F = 1O, G = 12 From this information the MBD engine would determine
the following sets of components could be faulted: {A1 }, {M1 }, {A2 , M2 }, {M2 , M3 } and any
of their supersets. In addition, the most informative place to measure next is X because it
distinguishes between two single faults.
This example illustrates some of the basic properties of MBD:

• A system model is provided in terms of components and their interconnections.

• The component models describe how each component behaves.

• A domain-independent reasoning engine calculates the diagnoses from the model.


24 Distributed systems and fault tolerance: an overview

• The system may have multiple dependent or independent faults.

• MBD can propose additional measurements to differentiate among diagnoses.

• MBD does not do any precomputation and works entirely online.

Consider the example system. Given inputs A = 3, B = 2, C = 2, D = 3, and E = 3,


by simple calculation (i.e., the inference procedure), X = 6, Y = 6, and F = X + Y = (A ×
C) + (B × D) = 12. Intuitively, a symptom is any difference between a prediction made by the
inference procedure and an observation. Since, F is measured to be 10,

“F is observed to be 10, not 12”

is a symptom. The symptoms drive diagnostic reasoning. Each symptom indicates one or
more components may be faulted. Intuitively, a conflict is a set of components which underly
a symptom.
Consider the symptom “F is observed to be 10, not 12”. The prediction that F = 12 de-
pends on the correct operation of A1 , M1 , and M2 , i.e., if A1 , M1 , and M2 were correctly func-
tioning, then F = 12. Since F is not 12, at least one of A1 , M1 , and M2 is faulted. Thus the set
{A1 , M1 , M2 } is a conflict for the symptom. The set {A1 , A2 , M1 , M2 }, and any other superset of
{A1 , M1 , M2 } are conflicts as well; however, no subsets of {A1 , M1 , M2 } are necessarily conflicts
since all the components in the conflict were needed to predict the value at F .
A diagnosis is a particular hypothesis for how the system differs from its model. For exam-
ple “A2 and M2 are broken” is a diagnosis which explains the two symptoms observed for the
example system. The size of the initial diagnosis space is exponential in the number of system
components. Any component could be working or faulty, thus the diagnosis space for the sys-
tem initially consists of 25 = 32 diagnoses. Ultimately, the goal of diagnosis is to identify, and
refine, the set of diagnoses consistent with the observations thus far. For more details on these
concept the reader is referred to [33]. These diagnostic concepts can be defined more formally
using First-Order Logic within the framework of [34].

1.5 Some ideas on controller re-design


Controller re-design considers the problem of changing the control structure and the control
law after a fault has occurred in the plant. Aim of this step is to change the control law in order
to satisfy the requirements on the closed-loop system despite of the faulty behavior of the plant.
Consider the situation in fig. 1.8(a). The faultless plant has the behavior B0 and the controller
the behavior Bc 2 . Since the I/O pairs of the closed-loop system are consistent with both the
plant and the controller, the behavior of the closed-loop system is given by the intersection
B0 ∩ Bc .
Also control specifications, can be formulated in the behavioral setting by describing all
those I/O pairs that meet these requirements. The set of these I/O pairs is denoted by Bspec .
If B0 ∩ Bc ⊂ Bspec , the closed-loop systems satisfies the performance specifications. When
the plant becomes faulty, it changes its behavior, which is now given by the set Bf . Hence, the
closed-loop system behavior changes to become Bf ∩ Bc , which may no longer be a subset of
Bspec (see the case in fig. 1.8(b)). Hence, the controller has to be re-designed in order to restrict
the behavior of the faulty system to the set Bspec .
2
Extending the concept of behavior to the controller, the set Bc describes the I/O pairs (U, Y ) that satisfy the
control law.
1.5. Some ideas on controller re-design 25

U ×Y U ×Y
Bspec Bspec Bf

Bc

B0
Bc

(a) Behavior of the faultless closed loop system. (b) Behavior of the faulty closed loop system.

Figure 1.8: Control of a system in healthy/faulty condition.

There may be faults for which the behavior Bf does not overlap with Bspec . If this is the case
a new control configuration has to be chosen, which changes the signals under consideration
and, hence, the behavior of the plant. There may be faults for which no controller can make
the closed-loop system satisfy the specification and the system has to be shut off. Hence, the
question whether a fault-tolerant controller exists is not a property of the controller or the
control re-design method, but a property of the plant subject to faults3 . Two principal ways of

Diagnosis
Fault
Accom.

Controller
parameter f d
y
yref Controller Plant
u

Figure 1.9: A fault accommodation architecture.

controller re-design have to be distinguished: fault accommodation and reconfiguration.


• fault accommodation: Fault accommodation means to adapt the controller parameters to
the dynamical properties of the faulty plant. The input and output of the plant used in
the control loop remain the same as for the faultless case (see fig. 1.9) and the set U × Y
3
An illustrative example for an unsolvable fault-tolerant control problem is to consider a plant whose unstable
modes become uncontrollable or unobservable due to a fault. Then no controller exists which stabilizes the faulty
plant.
26 Distributed systems and fault tolerance: an overview

of input and output sequences is not changed. A simple way of fault accommodation
is based on predesigned controllers, selected off-line for a specific fault. The re-design
step then simply sets the switch among the different control laws. This step is quick and
can meet strong real-time constraints. However, the controller re-design has to be made
for all possible faults before the system is put into operation and all resulting controllers
have to be stored in the control software (for more details see [12]).

yref

Diagnosis
Fault
Accom.

New control
yref d y
configuration u
Nominal
yref Plant
Controller
y0

New
0
yref Controller u0

Figure 1.10: A control reconfiguration architecture.

• Control reconfiguration: if fault accommodation is impossible, the complete control loop


has to be reconfigured. Reconfiguration includes the selection of a new control configu-
ration where alternative input and output signals are used. The selection of these signals
depends upon the existing faults. A new control law has to be designed on-line (see
fig. 1.10). Control reconfiguration is necessary after severe faults have occurred that lead
to serious structural changes of the plant dynamics. The necessity of control reconfig-
uration is particularly obvious if sensor or actuator faults are considered. If these com-
ponents fail completely, the fault leads to a break-down of the control loop, alternative
actuators or sensors have to be found, which are not affected by the fault and which have
similar interactions with the plant so that a reasonably selected controller is able to satisfy
the performance specifications on the closed-loop system (for more details see [12]).

Fault-tolerant control makes intelligent use of the redundancies included in the system and in
the information about the system in order to increase the availability of the system. It utilizes an
analytic redundancy, which is cheaper than duplicating all vulnerable components. Of course
no method can guarantee a complete description of all possible faults of a system. Hence, no
100% fault tolerance is possible. However, for many applications, complete fault tolerance is
not necessary.
1.6. Why distributed systems? 27

1.6 Why distributed systems?


From a functional point of view, there is practically no difference whether a task is implemented
using a centralized or decentralized architecture. Of course a decentralized architecture has to
be preferred for the implementation of hard real-time systems4 .
The most important property of distributed systems is composability, which means that sys-
tem property follows from subsystem properties. Moreover distributed systems are scalable;
scalability means that there exist no limits to the extensibility of a system and that at the same
time complexity of reasoning about the proper operation of any system function do not depend
on the system size. However the property which makes distributed systems perfect candidate
for investigation in such a work is dependability. This means that designing a distributed system
it is possible to implement well-defined error-containment regions, achieving in this way fault
tolerance.
Distributed systems follow the architecture principle known as form follows function, as-
serted by Vitruvius in 27 B.C. in its ten books on architecture5 , which states that the function
of an object should determine its physical form. In a distributed architecture it is feasible to
encapsulate a logical function and the associate computer hardware into a single unit: a node.
A real-time distributed system consists of a set of nodes interconnected by a real-time com-
munication network. Viewed form an higher level, a node can be replaced by an abstraction of
its functional and temporal properties, hiding the irrelevant details of the implementation. If
there is a one-to-one mapping between functions and nodes, the cause for a malfunction can
be immediately diagnosed and the faulty node pinpointed. On the other side it is possible to
foresee what functions will be affected in case of an error in a node. Such a simple analysis
is impossible in centralized systems where there does not exist a mapping between resources
(hardware, operating system, firmware, software) and functions. In fig. 1.11 is shown the hard-

Man-Machine Interface Instrumentation Interface

Operator Computational Controlled


cluster cluster cluster

Operator Real-time Controlled


computer system object

Figure 1.11: Real-time distributed application.

ware architecture of a distributed application. As the reader can see, a distributed application
4
A real-time system is a system in which the correctness of the system behavior depends not only on the logical
results of the computations but also on the physical instant at which these results are produced (deadline). Real-
time systems in which there exists at least one firm deadline which could produce a “catastrophe” if missed are
called hard real-time.
5
In its treatise “De Architectura”, Marcus Vitruvius Pollio wrote down on ten scrolls everything he knew about
architecture. He presented this work, known today as “Ten Books on Architecture”, to Emperor Augustus in the
hope of changing what he perceived as a rampant lack of professionalism and educational rigor in the practice of
architecture.
28 Distributed systems and fault tolerance: an overview

can be decomposed into a set of clusters: the operator cluster, the computational cluster and the
controlled object cluster. Generally the computational cluster is implemented as a distributed
computer system and it has the structure shown in fig. 1.12: a set of nodes interconnected by
a real-time communication system. Considering a single node, it can be partitioned in at least

Node A Node B Node C

Real -Time Communication System

Node D Node E Node F

Figure 1.12: Distributed computer system.

two subsystems, the local communication controller and the host computer (fig. 1.13). The set
of all the communication controllers of the nodes within a cluster forms the real-time commu-
nication system of the cluster. The interface between the communication controller within a
node and the host computer of the node is called communication network interface (CNI).
The purpose of the the real-time communication system is to transport messages from the
CNI of the sender mode to the CNI of the receiver node within a predictable time interval, with
a small latency time and with high reliability. The communication system must ensure that
the contents of the messages are not corrupted. From the point of view of the host computer,
the details of the protocol logic and the physical structure of the communication network are
hidden behind the CNI. It is easy to understand that the communication system is a critical

Host Computer

Communication Network Interface


Communication Controller

Communication Network

Figure 1.13: Structure of a node.

resource of a distributed system, since the loss of communication means the loss of all global
system services. There are different alternatives available for the design and implementation
of the communication systems: a single channel system (bus or ring) or a multiple channel
system (mesh network). Communication reliability can be increased by message retransmis-
sion in case of a failure, or replicating messages so that a loss of a message can be masked. If
communication channel are replicated the permanent loss of a channel ca be tolerated. It is
1.6. Why distributed systems? 29

not purpose of this work to show more details about design, implementation, and problems of
real-time distributed systems, but the interested reader can learn more in [54] and references
therein.

1.6.1 Functional requirements


The functional requirements of distributed systems are concerned with the functions that it
must perform. They are grouped into data collection requirements, direct digital control re-
quirements, man-machine interaction requirements.

Data collection
A controlled object change its state as a function of time; if we freeze it we can describe the
current state of the controlled object by reading the values of its state variables at the moment.
Normally we are interested in a subset of state variables that is significant for our purpose (real-
time entity). Each real-time entity is in the sphere of control of a subsystem6 . Outside its sphere
of control the value of a real-time entity can be observed but not changed. The first functional
requirements of a real-time distributed system is the observation of the real-time entity in a
controlled object and the collection of these observations (real-time image). Since the state of the
controlled object is a function of time an image is only temporally accurate for a limited amount
of time. The first step in observation is signal conditioning, i.e. all the processing steps needed
to obtain a meaningful measured data of a real-time entity from the raw sensor data. After
signal conditioning the measured data must be checked for plausibility and related to other
measured data to detect possible faults in sensors. An other important requirement in data
collection is alarm monitoring, i.e. the monitoring of the real-time entities to detect abnormal
process behaviors. The computer system must detect and display these alarms and assist the
operator in identifying the primary event which was the initial cause of these alarms.

Direct digital control


Real-time computer systems must calculate set points for the actuators and control the con-
trolled object directly (Direct digital control). Control applications are highly regular, consisting
of an infinite sequence of control periods, each one starting withe the sampling of real-time en-
tities, followed by the execution of the control algorithm and subsequently by the output of the
control action to the actuator. Obviously the design of a proper control algorithm must include
the compensation of random disturbances that could perturb the controlled object.

Man-machine interaction
The real-time computer system must inform the operator of the current state of the controlled
object and must assist the operator in controlling the machine. This is achieved via the man-
machine interface. This critical subsystem contains an extensive data logging and data reporting
subsystem. The reader interested in this important topic is referred to [37].

1.6.2 Temporal requirements


The most stringent temporal demands for real-time systems have their origin in the require-
ments of the control loop. Consider a controlled object in equilibrium and suppose to change
6
It belongs to a subsystem that has the authority to change its value.
30 Distributed systems and fault tolerance: an overview

the desired set-point according with step function. There are two important parameters that
characterize the step response functions we obtain from our controlled object (see fig. 1.14):
the object delay dobj after which the measured variable begins to rise and the rise time drise (ap-
proximately the time after which the measured variable reaches the new equilibrium point).
The controlling computer must sample the measured variable to detect deviations from the de-

Measured Variable

Set point
90%

10%

Real time
dobj drise
Figure 1.14: Delay and rise time in a step response.

sired value. The constant amount of time between two sample points is called sampling period
(dsample ). We expect the digital system to behave like a continuous system and this implies
the sample period to be less then one tenth of the rise time. The computer compares the mea-
sured value with the set point selected by the operator, calculates the error term and from
this computes the new value of the control variable by the control algorithm. The amount of
time consecutive to the sample point, after which the controlling computer will output the new
control value is called computer delay (dpc ). The computer delay should be smaller than the sam-
pling period. The difference between the maximum and the minimum values of the delay is
called the jitter of the delay (∆dpc ). The dead time (td ) is time interval between the observation of
the real-time entity and the start of a reaction of the controlled object due to a computer action
based on this observation. As the reader can easily deduce, it is the sum of the controlled object
delay, which is in the sphere of control of the controlled object and it is thus determined by its
dynamics, and the computer delay, which is determined by the computer implementation. To
reduce the dead time in a control loop and to improve the stability of the control loop, these
delays should be as small as possible. In fig. 1.15 it is pictured the scenario explained above.
Hard real-time systems are by definition safety-critical. Hence it is important that any error
within the control system (loss or corruption of a message, failure of a node etc.) is detected
within a short time with a high probability. The required error-detection latency must be in the
same order of magnitude as the sampling period of the fastest critical control loop. In this case
it is possible to perform some corrective action or bring the system into a safe-state before the
consequences of an error can cause any severe system failure.

1.6.3 Dependability requirements


With dependability requirements we intend the “meta-functional” attributes of a computer
system related to the quality of service a system delivers to its users during an extended interval
of time (see [57]). The measures of dependability that are of importance are described in the
following.
1.6. Why distributed systems? 31

Observation Observation
dsample
∆dpc

dpc dobj Real time

Output to
the actuator

td drise

Figure 1.15: Time requirements in real-time systems.

Reliability
The reliability of a system R(t) is the probability that a system will provide the specified service
until time t, given that the system was operational at time t = t0 . If a system has a constant
failure rate of λ7 then the reliability at time t is given by

R(t) = e−λ·(t−t0 ) .

The inverse of the failure rate (1/λ) is called mean time to failure (MTTF). All these concepts will
be reviewed and applied to distributed systems in chapter 3. Moreover the reader is referred
to [9], [59] and [46] to go into more details.

Safety
With the term safety we regard critical failure modes. In such a failure mode the cost of the
failure can be order of magnitude higher than the utility of the system during normal operation.
Safety critical real-time systems must have a failure rate with regard to critical failure modes
that conforms to ultrahigh reliability requirments8 . Similar failure rates are required in flight-
control systems, train-signaling systems, nuclear plant monitoring systems etc.

Maintainability
With the term maintainability we intend the measure of the time required to repair a system
after the occurrence of a failure. Maintainability is measured by the probability M (d) that the
system is restored within a time interval d after the failure. As for reliability, a constant repair
rate µ (repairs per hour) and a mean time to repair (MTTR) is introduced to define a quantitative
maintainability measure.

Availability
Availability is a measure of the delivery of correct service with respect to the alternation of
correct and incorrect service and is measured by the fraction of time that the system is ready to
7
Failures over hours.
8
If the failure rate of a system is required to be in the order of 10−9 failures over hour or lower we speak of a
system with an ultrahigh reliability requirement.
32 Distributed systems and fault tolerance: an overview

provide the service. In a system with constant failure and repair rates, the reliability (MTTF),
maintainability (MTTR) and availability (A) are related by:

MTTF
A= .
MTTF + MTTR
A high availability can therefore be obtained either by a long MTTF or by a short MTTR.
The sum MTTF+MTTR is called mean time between failures (MTBF); the situation is sketched
in fig. 1.16.

failure repair failure


up

down

MTTR Real time


MTTF

MTBF

Figure 1.16: Relationship between MTTF, MTTR and MTBF.

Security
The security attribute is concerned with the ability of a system to prevent unauthorized access to
information or services. This attribute is usually associated with large database, but during the
last few years this issue has become important also in real-time systems. There are difficulties
in defining a quantitative security measure.

1.7 Properties of distributed real-time systems


1.7.1 Composability
Large systems are built by integrating a set of well-specified and tested subsystems. It is im-
portant that properties that have been established at the subsystem level are maintained during
system integration. Such a constructive approach to system design is possible if and only if the
architecture supports composability. An architecture is said to be composable with respect to a
specified property if the system integration will not invalidate this property once the property
has been established at the subsystem level. In other words in a composable architecture the
system properties follow from the subsystem properties.
In a distributed real-time system, the integration effect is achieved by interactions among
different nodes. Therefore the communication system has a central role in determining the
composability of a distributed architecture with respect to the temporal properties. There ex-
ist two possibility to implement temporal control; if a communication system transports event
message (event-triggered communication system) the temporal control is external to the com-
munication system. It is in within the sphere of control of the host computers to decide when
1.7. Properties of distributed real-time systems 33

a message must be sent. In a time-triggered communication system temporal control resides


within the communication system and it is not dependent on the application in the nodes.

1.7.2 Scalability
Evolving requirements are the rule in large distributed systems. Existing functions must be
changed or new functions added over the lifetime of the system. A scalable architecture is open
to such changes, and does not have any limit on its extensibility. Only distributed architectures
provide the necessary framework for unlimited growth:

1. Nodes can be added within the given capacity of the communication channel, introduc-
ing additional processing power to the system.

2. If the communication capacity within a cluster is fully utilized, or if the processing power
of a node has reached its limit, a node can be transformed into a gateway node to open
a way to a new cluster (see fig. 1.17). The interface between the original cluster and the
gateway node can remain unchanged.

N N
N
N N

Original G New

N N
N N

Figure 1.17: Transparent expansion of a node into a new cluster.

Of course a key point in designing large scalable systems is the complexity. Large systems can
be built if the effort required to understand the system operation9 remains under control as
the system grows. The complexity is related to the number of parts and the number and type
of interactions among the parts that must be considered to understand a particular function of
the system. In a scalable system the effort required to understand any function should remain
constant and independent from the system size. The only difference with a small system should
be in the number of different functions that a large system can provide. In other words the effort
needed to understand all functions of a large system grows with the system size.

1.7.3 Dependability
Implementing a dependable real-time service requires distribution of functions to achieve effective
fault containment and fault tolerance so that the service can continue despite the occurrence
of faults. In [65] authors define responsive system a system which has all these three attributes:
distribution, fault tolerance and real-time performance.
A fault tolerant system must be structured into partitions that act as error-containment regions
so that the consequences of faults that occur in one of this partitions can be detected, corrected
9
i.e. the complexity of the system.
34 Distributed systems and fault tolerance: an overview

or masked before these consequences corrupt the rest of the system. An error-containment
region must implement a well-specified service; this service should be provided to the outside
world through a small interface, so that an error in the service can be detected at this interface.
In a large distributed computer system it comes natural to to regard a complete node as an
error-containment region and to perform the error detection at the node’s message interface
to the communication system. With this in mind it is easy to understand that implementing
error-containment regions in centralized systems is an hard task because system resources are
multiplexed over many services.
Not all the faults in a large distributed system are equally critical. An interesting classifica-
tion arising from aircraft industries is the following:
1. Catastrophic: Fault that prevents continued safe operation of the system and can be the
cause of an accident.

2. Hazardous: Fault that reduces safety margin of the redundant system to an extent that
further operation of the system is considered critical.

3. Major: Fault that reduces the safety margin to an extent that immediate maintenance must
be performed.

4. Minor: Fault that has only a small effect on the safety margin. From the safety point of
view it is sufficient to repair the fault at the next scheduled maintenance.

5. No effect: Fault that has no effect on the safety margin.


The key concern in this categorization is the remainder of the safety margin after the occurrence
of a primary fault. The consequence of a fault is an error, which is a damage of the system state.
An error in a non-safety critical subsystem must be detected before it can propagate into a safety
critical subsystem. In this sense the issues of error containment plays a crucial role. Usually
distributed architectures (e.g. the one shown in fig. 1.18) supports both critical (shaded) and
non-critical (unshaded) system functions. The architecture must ensure that a fault over a non-
critical node cannot affect the correct operation of a critical system function. A key-word to

N
N Interface
Node

RT Communication
system

Interface N
Node N

Figure 1.18: Critical (shaded) and non-critical (unshaded) system functions.

implement fault-tolerance in distributed systems is replication. A node must represent a unit


of failure preferably with a simple failure mode (e.g. fail-silence); all inner failure modes of a
node are mapped into a single external failure mode. Under this hypothesis node failures can
1.8. Failures, errors and faults in distributed systems: nomenclature 35

be masked providing actively replicated nodes which are supposed to show a deterministic
behavior (replica determinism10 ).

1.8 Failures, errors and faults in distributed systems: nomenclature


Aim of this section is to give a short overview on basic concepts on fault tolerant computing.
Most of this concept are taken from [57]; for more details the interested reader is referred to this
text. The key-point are the following three terms: fault, error and failure (see fig. 1.19).

Cause of error Deviation of actual service


and failure: from intended service:
Fault SUBSYSTEM Failure
Unintended state:
Error

Figure 1.19: Faults, errors and failures: faults and error are states while failures are events.

1.8.1 Failures
Whenever the service of a system, as seen by the user, deviates from the agreed specification,
the system is said to have failed. A failure is an event that denotes a deviation between the actual
service and the specified or intended service, occurring at a particular point in real time. It can
be classified using the following criteria:
• Failure nature: we distinguish between value failures and timing failures. A value failure
means that an incorrect value is presented at the system-user interface, while a timing
failures means that a value is presented outside the specified interval of real-time.
• Failure perception: we distinguish between consistent failures and inconsistent failures. In a
consistent failure scenario all the users see the same wrong result. For example a consis-
tent failure scenario is when a subsystem either produces correct results or no results at
all; we will call this scenario a fail-silent failure scenario. In an inconsistent failure situation
different users may perceive different false results (sometimes this failures are also called
two-faced failures, malicious failures or Byzantine failures).
• Failure effect: we distinguish between benign failures and malign failures. A benign failure
can only cause failure costs that are of the same order of magnitude os the loss of the
normal utility of the system. Malign failures can cause a catastrophe such as the crash
of an airplane. We call safety-critical applications those applications where malign failures
can occur.
• Failure oftenness: we distinguish between permanent failures and transient failures. A per-
manent failure is a failure after which the system ceases to provide a service until an
explicit repair action has eliminated the cause of the failure. If a system continues to op-
erate after the failure we call it a transient failure. A frequently occurring transient failure
is called an intermittent failure.
10
Replicated nodes visit the same states at about the same time.
36 Distributed systems and fault tolerance: an overview

1.8.2 Errors
System failures can be traced to in incorrect internal state. We call such an incorrect internal
state an error. Therefore an error is an unintended state. If the error exists only for a short
interval of time and then it disappear without an explicit repair action it is called a transient
error while if it persists until an explicit repair action removes it, we call it a permanent error.
Transient errors form the predominant error class in many computer systems. In a fault
tolerant architecture every error must be confined to an error containment region to avoid the
propagation of the error throughout the system. It is aim of the error detection interfaces to
protect the boundaries of the error containment region.

1.8.3 Faults
The cause of an error and hence the indirect cause of a failure is called fault. Faults can be
classified using the following criteria:

• Fault nature: a fault that has its origin in a chance event (e.g. random break of a wire)
is called chance fault. If a fault can be traced to an intentional action by someone (e.g. a
Trojan horse introduced by a programmer in order to break the security of a system) the
fault is called an intentional fault

• Fault perception: a fault can be caused either by some physical phenomenon or by an


error in the design (e.g. some mistakes in the specifications). In the first case the fault is
called physical fault, while in the second case it is called design fault.

• Fault boundaries: it is useful to distinguish whether a fault is caused by a deficiency


within the system (internal fault) or by some external disturbances (external faults).

• Fault origin: faults that have their origin in the incorrect development of the system (de-
velopment faults) must be distinguished from faults that are related to system operation
(operation faults).

• Fault persistence: if the fault exists only for a short interval of time and then it disappear
by itself it is called a transient fault. On the contrary if it persists we call it a permanent
fault.

1.9 Fault-tolerance in distributed systems


1.9.1 State of the art
Fault tolerance is a key issue in safety-critical real-time systems because a single component
failure may lead to a catastrophic system failure. As stated in the previous section, a node is an
appropriate unit of failure, i.e. it implements a self-contained function so that the established
architecture principle “form follows function” can be maintained even in a failure scenario.
The node implementation must map all internal node faults into simple external failure modes.
The problem of node failure detection and isolation is then performed “just outside” the node
at the interface level and a set of replica-determinate nodes is grouped together to form a fault
tolerant unit that masks a failure of one of its nodes.
The designer of a safety-critical system has two options to implement the necessary fault
tolerance:
1.9. Fault-tolerance in distributed systems 37

(i) At the architecture level, transparent to the application code. This type of tolerance is
called systematic fault tolerance: the architecture must provide replica determinism so that
fault tolerance can be achieved by the temporal or spatial replication of computations to
detect and mask the faults specified in the fault hypothesis.

(ii) At the application level, within the application code. We will call this type of fault toler-
ance application-specific fault tolerance: it mixes the normal processing functions with the
error-detection and fault-tolerance functions at the application level.

The first problem in achieving fault tolerance for distributed system is error detection. An
error is a discrepancy between the intended correct state and the current state of a system.
The goal of the fault-tolerant system is to detect and mask or repair errors before they show
up as failures at the system user service interface. Error detection requires that along with the
information about the current state knowledge about the intended state of a system is available.
This knowledge can arise from two different sources: from some a priori knowledge or from the
comparison between redundant computational channels.
The more is known a priori about the properties of correct states and the temporal patterns
of correct behavior of a computation, the more effective are the error detection techniques.
This means that if a subsystem has to be flexible in the temporal domain and in the value
domain, then error detection based on a priori knowledge is hardly possible. Techniques for
error detection based on a priori knowledge are:

• Syntactic knowledge about the code space: parity bits, error-detecting codes in memory, cyclic
redundancy check11 (CRC) in data transmission, check digits at the man-machine inter-
face. Such codes are very effective in detecting the corruption of a value stored in memory
or in the transmission of a value over a computer network.

• Assertions and acceptance tests: application specific knowledge about the restricted ranges
and the known interrelationship of the values of the entities can be used to detect addi-
tional errors that are undetectable by syntactic methods.

• Activation patterns of computations: knowledge about the regularity in the activation pat-
tern of a computation can be used to detect errors in the temporal domain.

• Worst case execution time of tasks: to detect task errors in the temporal domain.

On the other hands there are many different possible combination of hardware, software
and time redundancy that can be used to detect different types of errors by performing the
computation twice.

1.9.2 A node as a unit of failure


As stressed previously, a node is a self-contained unit that provides a function across a small
well-defined external interface. A failure of a node thus corresponds to the failure of the func-
tion of the node. From an architectural point of view a node should display simple failure
modes; in the optimal case it should exhibit only fail-silent failures, i.e. the node is either op-
erational or not. If this is the case the fault-tolerance mechanism at the architectural level must
perform two major tasks:
11
An extra field in a message for the purpose of detection of value errors.
38 Distributed systems and fault tolerance: an overview

1. Membership service: to detect a node failure and to report this node failure consistently to
all operating nodes of the cluster within a short latency.

2. Redundancy management: to mask the node failure by active redundancy and to reintegrate
repaired nodes into cluster as soon as they become available again.

From the node point of view, a node must detect all internal failures within a short latency
and must map these failure to a single external failure mode (preferably a fail-silent failure
mode). After an exception has been detected, control is transferred to an exception handler.
After the exception handler has terminated, control is either resumed from the point of excep-
tion or the task is terminated. The purpose of a fault-tolerant unit (FTU) is to mask the failures

Request from client

FTU

service service
provider provider
error error
detector detector

Response from server

Figure 1.20: Fault-tolerant unit consisting of two fail-silent nodes.

of a node. If a node implements the fail-silent abstraction then the duplication of nodes is suf-
ficient to tolerate a single node failure. A fail-silent node either produces correct values or does
not produce any results at all. For example in a time-triggered architecture an FTU that consists
of two fail-silent nodes produces either zero, one or two correct result messages. If it produces
no messages it has failed. If it produces one or two messages it is operational. The receiver
must discard redundant result messages (see fig. 1.20).
If the node does not implement fail-silence, but can exhibit value errors at the host/network
interface then triple modular redundancy must be implemented. In this case we must assume
that the behavior of the nodes is replica determinate. More in details the FTU must consist of
three nodes and a voter. The voter decides and masks errors in one step comparing the three
independently computed results and then selecting the result that has been computed by the
majority (see fig. 1.21 in which a two out of three triple modular redundant configuration is
shown).
If no assumptions can be made about the failure behavior of a node (Byzantine failure)
then four nodes are required to form a fault-tolerant unit. These four nodes must execute a
Byzantine-resilient agreement protocol to agree on a malicious failure of a node (see [82]).
1.10. Bibliographic notes 39

Request from client

FTU

Voter Voter Voter


service service service
provider provider provider

Response from server

Figure 1.21: Fault-tolerant unit consisting of three nodes and voters.

1.10 Bibliographic notes


Fault detection issues have been studied since the late 70’s [49] and [81] are two of the earliest
descriptions of the field. A nice overview of the current state of the art for continuous-variable
systems, for which diagnostic methods are mainly based on state observation, on the parity
space approach and on parameter estimation techniques is contained in [80] and [45]. The
monograph [94] gives a thorough introduction into fault diagnosis by means of identification
techniques.
Fault accommodation methods have been developed in the 1990s based on robust and
adaptive control. Surveys of these methods are given in [79], [13], [11] and [88]. A major
impulse for the development of new methods has been given by the COSY-benchmark prob-
lem published in [111]. Solutions to these problems which have been obtained by alternative
methods; see beside others [15], [19] and [21]. A recent survey text on all these concept is [12].
An important reference text on real-time systems is [95] where authors discuss the key con-
cept and requirements on real-time computing. A good survey about distributed system is
contained in [54] and [71]. Moreover the International Federation of Automatic Control (IFAC)
organizes a periodic Workshop on Distributed Computer Control System (WDDCS) and the
IEEE Computer Society organizes each year the IEEE Distributed Systems Symposium. Pro-
ceedings of these workshops are a valuable source of information. Concerning the topic of
fault-tolerance in distributed systems, a good introduction to the problem can be found in [58],
[53], [51] and [32].
40 Distributed systems and fault tolerance: an overview
Chapter 2
Fault tolerant architecture for distributed
systems

In this chapter a possible distributed architecture for fault tolerant con-


trol of complex system will be presented. The key feature of this archi-
tecture is the modularity. Modularity is obtained following functional
reasoning. Different functions which can be degraded by faults are as-
sociated with fault tolerant (control/measure) modules that are able to
detect the fault and counteract the effect of fault in order to assure cer-
tain performances in the execution of the required function. Each module
is provided with a reconfiguration manager, namely a supervisor which
orchestrate the reconfiguration within the module. All the modules are
orchestrated together via high level supervisors (global/group) resource
and reconfiguration manager whose aim is also to allocate optimally the
function over the physical resources considering that they have a limited
capacity and that can be also affected by faults. Some ideas on the use of
classical failure analysis tools to design the proposed architecture will be
presented.

2.1 The distributed nature of Fault Tolerant Control System


Aim of this section is to present the modular architecture for fault tolerant control of distributed
systems developed in the framework of the European Project IFATIS (Intelligent Fault Tolerant
Control in Integrated Systems) (see also [4, 66, 16]). This architecture is sketched in fig. 2.1. It
consists of three modular levels: level 0 is the plant level, which is interconnected throughout
interfaces to the bus and to level 1, which is the control level, consisting of control functions,
monitoring and reconfiguration functions. Above this level there are two supervision level
(levels 2-3) whose aim is to monitor performances of the system and to manage physical re-
sources. From a functional view, large control systems consisting of plants and controllers can
be divided in partial processes (or process units) as functional units. Such partial processes

41
42 Fault tolerant architecture for distributed systems

Global Resource and Reconfiguration Management:


Level 3 - Performance and resource monitoring
- Intelligent resource and reconfiguration manager

Level 2

Group Resource and Reconfiguration Management: Group Resource and Reconfiguration Management:
- Performance and resource monitoring - Performance and resource monitoring
- Intelligent resource and reconfiguration manager - Intelligent resource and reconfiguration manager

Level 1

FTC cell: FTC cell:


- control system - control system
- monitoring system - monitoring system
- reconfiguration system - reconfiguration system

Level 0 Bus System

PLANT

Figure 2.1: IFATIS modular architecture for fault tolerant control of distributed systems
2.1. The distributed nature of Fault Tolerant Control System 43

global resource and


reconfiguration manager

M,D,N M M,D,N M

group resource and group resource and


reconfiguration manager reconfiguration manager

D D D
M,D,N M M,D,N M M,D,N M D

FTC/M FTC/M FTC/M resource resource


modules modules modules monitor monitor

R R
S A C,D S A C,D S A C,D

Plant 1 Plant i Plant n resource resource

partial partial partial resource resource


process 1 process i process n unit 1 unit m

Figure 2.2: IFATIS architecture. Interfaces signals have the following meaning: A actuating
information (mode dependent), C cross communication between FTC/FTM modules, D diag-
nosis results, m actual and acceptable modes, M mode decisions and resource allocations, N
resource needs and urgencies for all possible modes, R resource specific information, S sensor
information (mode dependent).

fulfil a specified purpose within the whole control system and run on a set of physical equip-
ment components. The set of plant components and controller modules might be regarded as
resources which can be allocated to the partial processes. Fig. 2.2 shows the overall structure of
a fault tolerant system, as proposed in [64].
Each partial process consists of plant functions, running on plant components, and (fault-
tolerant) control functions, running on controller modules. To run, they need allocation of
resources, i.e. plant components and controller modules. This allocation is dynamic, dependent
on mode and reconfiguration decisions of higher levels (e.g. group and global resource and
reconfiguration managers). These decisions can change sensor and actuator information, which
is exchanged between plant functions and control functions.

2.1.1 Fault Tolerant Control/Measure modules


A Fault Tolerant Control/Measure (FTC/M) module in fig. 2.2 stands for a possible hierarchical
structure of FTC/M modules. Each module is of the form described in the section below. It has
a specific function to achieve. Each function can be performed in different modes depending on
how well it is realized (nominal working mode or degraded working modes). This hierarchical
structure of FTC/M modules receives sensor information (S) and issues actuator information
(A). It communicates data (C) and diagnostic information (D) to the FTC/M modules associ-
ated to other partial processes. It decides on the working mode of its associated partial process
thanks to the monitoring system embedded into the hierarchical structure (see below). Yet, this
44 Fault tolerant architecture for distributed systems

decision may be modified by the group resource and reconfiguration manager (GRRM) (see
arrow pointing downward from GRRM in fig. 2.2) which has a global view on a set of partial
processes and resources, and decides on resource allocation and working mode of each partial
process.

Figure 2.3: Fault Tolerant Control Module

A FTC function (fault tolerant control function) is a functional unit. The details of a FTC
function are shown in fig. 2.3. It consists of the following parts:

• Control function (measurement, regulatory or logic or sequential control, HMI, etc.).

• Function monitor for evaluation of the partial process quality and for partial process spe-
cific fault diagnosis. The latter can be based on sensor and actuator signals, analytical or
knowledge based process models, and quality evaluation.

• Local reconfiguration and mode control. From diagnostic results of the function moni-
tor, local reconfiguration decisions can be derived, e.g. change of control parameters or
algorithms, switching to redundant sensors or to estimated states, change to a degraded
mode. Such decisions can be made locally only if the change of resource needs has no
significant influence on other partial processes.

• Resource needs. Each partial process has specific resource needs and urgencies, which
are different for each of its modes.

A FTM function (fault tolerant measurement function) is also a functional unit which has the
same structure as the FTC function except that the controller function is replaced with an esti-
mator (see fig. 2.4).
The inputs/outputs and the structure of the FTC and FTM modules (respectively pictured
in fig. 2.3 and fig. 2.4) are:
FTC interfaces:

• Cin : control signals, reference signals, estimates, sensor outputs (measurements), quality
of measurements (or estimates);
2.1. The distributed nature of Fault Tolerant Control System 45

Figure 2.4: Fault Tolerant Measure Module

• Din : diagnostic results coming from other FTC/FTM modules;

• Cout : control signals (references to inner loop, or actuator commands); Dout : fault model,
confidence and performance index;

• Min : working modes in (coming from upper levels);

• Mout : working modes out (present working mode and list of admissible working modes);

• N : resource needs and urgencies (for different modes).

FTM interfaces:

• Cin : data in, sensor outputs, estimates from other cells, quality of measurements (or esti-
mates);

• Din : diagnostic results from other FTC/FTM modules;

• Cout : data out, estimates and tolerance on estimates;

• Dout : fault model, confidence and performance index;

• Min : working modes in (coming from upper levels);

• Mout : working modes out (present working mode and list of admissible working modes);

• N : resource needs and urgencies (for different modes).

For the design of an FTC system, the global control task is separated into different functions
(i.e.: temperature control, temperature reference determination, temperature estimation, etc.).
Each function is characterized by:

• a general objective,

• a set of possible working modes; each working mode is specified by:


46 Fault tolerant architecture for distributed systems

– the resource needs;


– the urgency;
– the required performance;
– the conditions for transition from that working mode to other working modes.

ROOM
water to recycle

HEATER
Control
Valve

Temperature
Heating sensor
Resistance

Controller
Desired
temperature

Figure 2.5: Temperature control loop

Example 2.1 Let us consider the closed loop system sketched in fig. 2.5. The control objective is to
maintain a constant temperature (measured by a sensor) in a room. The temperature is controlled by
a hot water flow in a heater via a control valve. We assume we have a redundant sensor which is less
precise than the operating one. The function is characterized by:
• General objective: maintain the room temperature at a fixed value T ;

• Working modes:

– Mode 1: Nominal model (no fault, initial mode);


– Mode 2: Faulty sensor, the operating sensor is replaced with the less precise sensor;
– Mode 3: Faulty actuator;
– Mode 4: Faulty actuator (clog in the valve).

For each mode, we have to specify the resource needs, the urgency, the required performance, the
conditions for transition from the considered working mode to other working modes. Mode 2 could be
illustrated by the following:
• The resource needs: nominal needs.

• The urgency: a given urgency.

• The required performance: to maintain a temperature T with an error smaller than 10% (instead
of an acceptable error of 1% in Mode 1).
2.1. The distributed nature of Fault Tolerant Control System 47

• The conditions for transition from Mode 2 to Mode 1: if the less precise sensor has been replaced
by a sensor with the same characteristics as the initial one, then we have a transition to the Mode
1.

2.1.2 Resource monitor

A resource monitor is associated to a physical resource (plant or controller module) and moni-
tors its state. It cannot be related to a partial process, because partial processes can be allocated
to different resources. Sometimes, a resource monitor uses normal sensor and actuator infor-
mation together with models, and hence it can be allocated to different computing resources,
just like a FTC function. But sometimes resource monitors contain special hardware equipment
or hardware associated test software, giving resource specific information (e.g. in computers:
self-test software, parity check, watchdog). In that case, the resource monitor contains parts,
which have a fixed allocation to a physical resource and hence cannot freely be allocated.

2.1.3 Global (and group) resource and reconfiguration manager

There are two higher levels in the above mentioned hierarchical structure. The top level is the
global resource and reconfiguration manager and the lower level is the group resource and
reconfiguration manager. The group resource and reconfiguration manager is responsible for a
defined subset of partial processes, whereas the global resource and reconfiguration manager
is responsible for all the group resource and reconfiguration managers.

Figure 2.6: Group/Global Resource and Reconfiguration Manager

The inputs and outputs of these two higher levels are the same and are:

• Din : resource diagnostic information;

• Min : list of acceptable and present working modes (from lower levels);

• N : needs of FTM and FTC modules or lower level GRRM modules;

• Mout : working modes to be imposed on lower levels.

Given the urgency of different functions, role of the GRRM module is to determine, from the
available resources, how best to use them in order to achieve the most urgent functions (if all
functions cannot be realized).
48 Fault tolerant architecture for distributed systems

2.1.4 Configuration of the hierarchical structure


Global specifications:
The global specifications should include the functionalities of the system and the performance
criteria associated to each functionality. Fault tolerance of the device or the process should
be specified. This amounts to indicate what functionalities should be achieved in each faulty
mode. Statistical information on the faults, when available, could be used to evaluate the need
for assuring tolerance with respect to specific faults.
Specifications for FTC/M modules:
Specifications at FTC/M modules levels include:
• the model of the considered device or sub-process in healthy and faulty behavior;

• potential working modes, associated performance measures and urgency;

• acceptable needs (namely upper bound on the needs that might be tolerated for that mod-
ule);

• statistical information on faults;

• fault tolerance to be achieved.


The combination of performance measures and models should allow one to deduce perfor-
mance criteria on the FDI system in terms of detection delay and time for reconfiguration for
instance. Fault tolerance requirements will be used in order to deduce what fault should be
isolated to achieve reconfiguration. Acceptable needs are upper bounds on the needs that can
be dedicated to the module. The urgency of the different working modes should be specified
according to the importance of the functionality associated to the FTC/FTM module.
Specifications for the GRRM modules:
• Urgency of the functions associated to each FTC/FTM module linked to the considered
GRRM;

• Needs of the functions associated to each FTC/FTM module linked to the considered
GRRM.
The objective to be optimized for resource distribution among the different functions, according
to urgencies and technical feasibility.

2.2 Failure analysis tools


In this section failure analysis tools, such as fault tree analysis and structural analysis, are re-
viewed to stress their usefulness (see [12], [46]).

2.2.1 Fault Tree Analysis


A fault tree is a structure by which a particular system failure mode can be expressed in terms
of combinations of components failure modes and operator actions (see [2]). The system failure
mode to be considered is named the top event and the fault tree is developed in branches below
this event showing its causes. This development process ends when component failure events
(basic events) are encountered. Fault tree analysis can be carried out by providing information
2.2. Failure analysis tools 49

on the basic events probabilities and it is used to identify the causal relationships leading to a
specific system failure mode. Each fault tree considers only one of the many possible system
top failure modes. Therefore more than one fault tree can be associated to the same system.
A fault tree diagram contains two basic elements: gates and events. Gates allow or inhibit the
passage of combinations of faults up the tree and show the relationships between the events
needed for the occurrence of a higher level event. The three basic gates types used in the fault
tree are the OR gate, the AND gate and the NOT gate. These gates are used to combine events
as the Boolean operations of union, intersection and complements.
The analysis of the fault tree diagram produces two types of results: qualitative and quan-
titative. Qualitative analysis identifies the combinations of the basic events which cause the
top event, eventually using Boolean logic. Quantitative analysis will result in prediction of the
system performances in terms of probability of failure or frequency of failure. Once the top
event is specified, the fault tree is developed by determining the immediate, necessary and suf-
ficient causes for its occurrence: these are not the component level causes of the event, but the
immediate causes of the event. The immediate, necessary and sufficient causes of the top event
are then treated as sub-top events and the process then determines their immediate, necessary
and sufficient causes. In this way the tree is developed refining resolution until the limit of res-
olution is reached and the tree is complete. To identify the immediate, necessary and sufficient
causes of events some guidelines are:

• classify an event into more elementary events;

• identify distinct causes for an event;

• find all necessary causes for an event;

• identify a component failure event where possible.

For further information and examples, the reader is referred to [2] and references therein.

2.2.2 Structural Analysis


Structural analysis is the analysis of the structural properties of models, i.e. the analysis of
those properties which are independent of the actual values of the parameters. It represents
the links between variables and parameters of the operating model; these links are indepen-
dent of the form in which the operating model is expressed. The analysis of the structure is
performed through a graph representing the above-mentioned links. Structured approach con-
stitutes a general framework for the design of FDI systems and in particular it allows us to
generate, compute, evaluate and implement the residual based FDI procedures. On the other
side the structural analysis is an important tool to design and evaluate the impact of recon-
figurations on the structure of the system. From the point of view of structural analysis, the
system model is considered as a set of constraints which apply to a set of variables, among
which a subset has known values. This set of variables has been selected to describe the evo-
lution of the process; the system sensors, together with the control variables, give the subset of
those variables whose values are known. The set of constraints is given by the models of the
blocks which constitute the system. The term constraint refers to the fact that a technological
unit imposes some relations between the values of the variables: they cannot take any possible
value in the variables space, but only those values which are compatible with the physical laws
applied to that technological unit. The structure of the model is a digraph whose incidence
matrix represents the links between the variables (known and unknown) and the constraints.
50 Fault tolerant architecture for distributed systems

The knowledge we have about the system contains not only the model of the plant, but also
the model of the measurement. These two models can be represented by their structural graph.
When the model of the measurements is taken into account, the structure of the plant is only
relative to unknown variables, while the structure of the measurements shows the relations be-
tween known and unknown variables. Considering the parameters of the model as variables
(whose nominal value might be known or unknown), the structural model of the system may
be generalized as a set of constraints which apply to a set of variables and parameters, among
which a subset has known values. Analytical redundancy relations are obtained by applying
graph theory to the digraph representing the structural model of the plant.
Faults can be divided into two classes, non-structural faults (namely those faults that cause
changes just in the mathematical expression of constraints, e.g. parametric faults) and struc-
tural faults (whose effect is to change the set of constraints of the system). An example of
structural faults can be a resistor whose resistance value change from R 6= 0 to Rf = 0; in this
sense the constraint of the component changes from V − Ri = 0 into V = 0. For this kind
of failures, structural observability and controllability can be performed by the analysis of the
nominal/faulty structural graph (here the faulty structural graph is considered the structural
graph of the system in nominal condition with the constraint of the faulty component substi-
tuted by the new constraint).
In this framework it is possible to introduce the concept of the estimation redundancy de-
gree of an observable variable (which express in mathematical terms the idea of over con-
strained observed variable) and the control redundancy degree of a controllable variable, which
express in mathematical terms the idea of an over-constrained controllable variable. With this
in mind, the structural analysis, as asserted in [12], is a precious tool to study the monitorable
part of the system (namely the over-constrained part), but also to design reconfiguration actions
which can be expressed in terms of the faulty structural graph. In conclusion, we can assert that
the structural analysis of a system can constitute a valuable framework for the problem formu-
lation and solution approaches in many other steps of the FDI system design procedure, such
as:

• analysis of the local redundancies of the system, in order to detect FDI possibilities;

• determination of those extra sensors whose implementation would increase the FDI pos-
sibilities;

• determination of the computational sequences whose result is a residual;

• introduction of computational constraints in order to take into account robustness or com-


putational homogeneity constraints;

• analysis of the structure of the residuals in order to evaluate the detectability and isola-
bility of the faults;

• analysis of the structure of the residuals for the implementation of the FDI algorithms;

• design of the reconfiguration actions;

• evaluation of the effect of the reconfiguration on the system structure.

For further information and examples on structural analysis, we refer the reader to [80],
[12], [13], [36], [96] and references therein.
2.3. Supervision level hierarchy 51

2.3 Supervision level hierarchy


The structure of the distributed fault tolerant system proposed within IFATIS has been re-
viewed in previous sections and is sketched in fig. 2.2. In this figure it is easy to recognize
a two-levels supervisory unit, composed by a global and group Resource and Reconfiguration
Manager, orchestrating the switching between different working modes of the low level fault
tolerant cells and allocating in an optimal way functions over physical resources.
The partial processes and the FTC/M modules may have, in general, a complicated struc-
ture composed by several controlled dynamics achieving different functionalities, by several
diagnosers able to detect a number of different faults, and different local reconfiguration units
able to manage the optimal reconfiguration strategy. The main goal of the managers, on the ba-
sis of the information coming from the low level, is to manage the new working modes of the
FT cells in order to meet global specifications given by functionalities and performances crite-
ria associated to each functionality, fault tolerance criteria (namely which functionality should
be achieved in each faulty mode), fulfilling global constraints which are basically imposed by
limited resources.
In this framework it is possible to figure out three levels in which the reconfiguration of the
system can be performed:

• Module level reconfiguration: performed by LRMs to achieve module fault tolerance


only on the basis of modules diagnosis information, without considering resource diag-
nosis information and resource needs.

• Group RRM level reconfiguration (in-group re-allocation): in case the reconfiguration


of a certain process has impact in the resource allocation within the same Group, then the
Group RRM is demanded the responsibility of orchestrating the resource allocation.

• Global RRM level reconfiguration (cross-groups re-allocation): when the resource re-
allocation, due to module level reconfiguration or resource failure, involves more than
one group, the Global RRM is demanded the responsibility of orchestrating the resource
allocation.

The FTC/M module supervising each partial process is totally autonomous in setting the
new working modes as it is supposed completely isolated with respect to the other partial
process (or with minor coupling which can be dealt with the cross-process information) in
terms of fault propagation. The managing of the new WM is demanded to the Group RRM just
in the case it involves change of resources or it produces conflicts of functions.
It is worth to look to a partial process and to the linked FTC/M module as composed by sev-
eral partial sub-processes and FTC/M sub-modules hierarchically structured. The hierarchical
structure of the FTC/M module is clearly motivated by the fact of dealing with complicated
partial processes which can be characterized by several faults and strongly coupling effects.
Reconfiguration and mode decision are done by local reconfiguration and mode control (LRM)
blocks inside FTC/FTM modules. In this regards different FT Modules can be grouped in a
unique higher level module (supervised by a unique LRM), if they are isolated from the context
as far as the effect of a certain subset of faults is concerned. Starting from these considerations,
in this section we want to give some guidelines to design this hierarchical structure (see also
[66, 16]).
Generally speaking the overall task to be accomplished by the supervisor can be divided in
three steps:
52 Fault tolerant architecture for distributed systems

Heuristic Description Objectives and Resources


of the module actual working limitations
Diagnosis signals mode
Confidences Indeces
Actual Working modes
Resource Needs New Working Mode
Requests Working Mode
FDI Unit Generator Decision logic

Fault estimation Events generation on the


obtained by processing basis of the occurred fault
all the diagnostic signals and of the objectives to be
issued by the local achieved by the fault tolerant
function monitor. system.

Local Reconfiguration Manager

Figure 2.7: Internal structure of Local Reconfiguration Managers

1. Fault Detection and Identification: on the basis of the diagnostic signals issued by the FT
Modules, the supervisor is required to compute an estimation of the (possibly) occurred
fault in the set of partial processes supervised by it. To this end, there are two fundamen-
tal elements which must be taken into account:

• Fault Effect Propagation between partial processes: the supervisor, which has a
global perspective of all the partial processes belonging to the supervised group,
has to identify possible false alarms generated by the local function monitors which,
on the contrary, have just a local perspective of the process. For instance, the ac-
tivation of certain diagnostic signals from a local function monitor can be due not
to a real local fault but, indeed, to the effect of a fault in a different partial process
which propagates within the group. In this respect all the residual signals issued by
the local function monitors must be jointly elaborated by the supervisor in order to
identify the real fault. Eventually, in this elaboration, the response of local function
monitor may be changed.
• Confidence level of each diagnostic signal: all the diagnostic signals processed by
the supervisor are characterized by a confidence level (see [4]) expressing the qual-
ity of the diagnosis performed by local function monitors. The supervisor is required
to suitably take into account this information, by comparing it with analogous infor-
mation received by other processes, in order to generate a reliable fault estimation.

This first task can be thought as a phase in which an objective estimation of the fault oc-
curred in a specific group is generated. The adjective objective is to stress the fact that the
fault estimation is not filtered (altered) by global specifications/constraints which charac-
terize the complex system, but it precisely represents what it is going on objectively in the
plant at a certain time. As we shall see later, the objective fault estimation is then further
processed, taking into account other factors such as global specifications and constraints,
in order to generate the events which determine the new working modes. In order to suc-
cessfully carry out this phase a complete knowledge of the partial processes composing
the group, in terms of how the effects of faults and reconfigurations propagates through-
out the group, is needed. Roughly speaking, if one interprets the diagnosis signals issued
by local function monitors as residual signals then this phase of FDI amounts to invert
2.3. Supervision level hierarchy 53

the residual matrix linked to a particular group.

2. Events Generation: the second task embedded in the supervisor elaboration regards the
events generation, namely the generation of requests of WM changes. These, in gen-
eral, result by a joint elaboration of the information provided the FDI unit (namely by
the objective information about the occurred fault) and of the performance criteria to be
achieved. In other words the detection of certain faults may generate reconfiguration
requests or not, depending on the particular specification to be met. In some cases the
objective response of the FDI unit may be bypassed since a possible reconfiguration (due
to accommodate fault effects) may be in contrast with performance criteria. In the sim-
plest scenario this task is implemented by using a look up table which, on the basis of the
performance and fault tolerant criteria requested for a particular functionalities, on the
actual working, and on the faults detected by the previous task, yields the desired event.

3. Working Mode Setting: The request of a new reconfiguration issued by the event gener-
ation block, is then processed by the final decision logic whose aim is to set the new
working modes. The main goal of this unit is to set the new WM configuration on the
basis of the requests generated by the previous block and taking into account possible
resource limitations which may characterize the fault tolerant system. This part is im-
plemented using the theoretical machinery of the discrete event systems (DES) and the
theory of the supervision of DES (see Appendix A).

A possible structure for Local reconfiguration and mode control (LRM) at each level of the
architecture is represented in fig. 2.7. Achieving fault tolerance means to preserve functional-

Figure 2.8: Functionality tree.

ity of the system even in case of faults. A fault tolerant control system is a system which is
able to detect faults in the system and recover functionalities in order to assure pre-specified
performances.
For this reason the starting point for the analysis of the FT system is the functionality tree,
i.e. a tree in which the global objectives of the system (the root of the tree) is divided into sub-
54 Fault tolerant architecture for distributed systems

objectives (local functionalities) until we reach some elementary objectives. An example of a


functionality tree is given in fig. 2.8.
When a fault affect the system, then it cause a failure, i.e. the termination or degradation
of the ability of an item to perform its required function (its functionality is compromised). We
refer as failure mode to the effect by which a failure is observed on the failed system. From a
functional point of view a failure mode represent a loss of a functionality, due to a fault we are
no more able to accomplish certain objectives.
As stated above, a fault tree is a structure by which a particular system failure mode can be
expressed in terms of combinations of components failure modes and operator actions. Hence
from a functional point of view the fault tree can be built starting from the functionality tree
simply by “complementing” it: instead of being interested in functions, we are interested now
in loss of functions. As in the functionality tree the global objectives were satisfied if all the
local objectives were satisfied (note the AND operators between events in fig. 2.8), in the fault
tree the loss of a functionality is due to the loss of one or more local functionalities. Faults are
linked to the loss of elementary functionalities. An example of a fault tree built starting from
the functionality tree pictured in fig. 2.8 is sketched in fig. 2.9.

Figure 2.9: Fault tree.

Let us start from the diagnosis problem in order to identify a modular structure for the di-
agnosis algorithms. Detecting a fault means use analytical and/or hardware redundancies in
order to reveal its effect. Having in mind our previous discussion about failures and failure
modes, we can conclude that detect a fault means to reveal its failure mode, or, from a func-
tional point of view, to reveal a loss of functionality. This means that if we specify a detection
algorithm (i.e. a certain number of residual signals) we specify which failure modes we are
able to reveal. A way to identify which events in the fault tree are observable thanks to the
system structure can be the structural analysis, which, as explained previously, represents the
2.3. Supervision level hierarchy 55

links between variables and parameters of the operating model1 .


Once we have identified the loss of functionalities that are detectable, we can easily identify
a map between the detectable events and the FDI unit in supervisors. Each of this unit will be
present within the module which is designed to recovery the functionality whose loss we are
able to detect.

Figure 2.10: Example of functionality tree.

In fig. 2.10 is presented an example of a functionality tree. From this graph it is possible to
build the fault tree sketched in fig. 2.11(a). From the tree it is possible to see that three faults
(f1 , f2 and f3 ) can affect the system leading to the losses of local (and global functionalities).
In the figure are enlightened which losses of functionalities can be observed via three residual
signals (r1 , r2 and r3 ). It is now immediate to define the residual matrix:

 r1 r2 r
3
f1 0 1 1
R=
f2  1 1 1  .
f3 0 0 1

This diagnosis structure can be decomposed into three modular parts:

 r1 r2 r
3
· r1 r2¸
R1 = £ r1¤ R2 = f1 0 1 R3 =
f1 0 1 1
f2 1 , , f2  1 1 1  .
f2 1 1
f3 0 0 1

This means that at the lowest level in the structure we can detect fault f2 using residual r1 , at
the intermediate level it is possible to detect f1 using residual r2 and at the higher level in the
structure it is possible to detect fault f3 using residual r3 . In this way we have obtained a mod-
ular detection of the faults going from the lower levels to the higher levels of the supervision
hierarchy. The situation is also illustrated in fig. 2.11(b).
1
As the reader can figure out, the idea of observable (and hence diagnosable) failure modes is strictly linked
with the over-constrained observable variables present in the structure of the system. For this reason the structural
analysis is a perfect candidate tool to identify them.
56 Fault tolerant architecture for distributed systems

(a) Example of fault tree for diagnosis. (b) Supervisor structure for diagnosis.

Figure 2.11: Example of the use of fault tree analysis to design a distributed diagnosis system.

Reconfigure a system after the occurrence of a fault means recover the functionality affected
by the fault. For this reason we can start again from the fault tree to identify a modular archi-
tecture also for the reconfiguration system. As we defined observable events in the fault tree,
we can define controllable events in the sense that we can act on some degree of freedom in
order to recovery the functionality described in the event2 . For example we can change some
parameters in the controller or change the structure of the controller if the functionality we are
interested in, is the control of a variable. Another option is the switching between redundant
hardware, or in case of severe faults we can choose to change our local or even global objectives
in order to make them pursuable by the degraded system.
Now that we have identified in the fault tree what events are over-controllable, we can
identify a map between these and the request generator and the working mode decision logic in
the supervisor. These units will be present within the modules which are designed to recovery
those functionality. Following this procedure we have enriched the supervision hierarchy with
the units dedicated to the reconfiguration, obtaining a complex modular hierarchy dedicated
to the supervision of the distributed system.
As example consider again the fault tree in fig. 2.11(a) and consider that the functionality
controllable via reconfiguration action are those illustrated in fig. 2.12(a). From this figure it is
easy to sketch the supervision structure illustrated in fig. 2.12(b). It is easy to see that while f2
and f3 are reconfigurable locally simply recovering the local functionality that the faults have
corrupted, this is not possible for f1 , in fact to reconfigure this fault we need to act at global
objectives level.
Now considering the supervision hierarchy for diagnosis illustrated in fig. 2.11(b) and the
reconfiguration hierarchy for reconfiguration illustrated in fig. 2.12(b) and merging the two, we
obtain the supervision hierarchy shown in fig. 2.13.

2
As the reader can figure out, the idea of reconfigurable failure modes is strictly linked with the over-constrained
controllable variables present in the structure of the system. Again the structural analysis can help us to identify
them. For further details the reader is referred to [12].
2.3. Supervision level hierarchy 57

(a) Example of fault tree for reconfiguration. (b) Supervisor structure for reconfiguration.

Figure 2.12: Example of the use of fault tree analysis to design a distributed reconfiguration
system.

Figure 2.13: The modular structure obtained to supervise system in fig. 2.10.
58 Fault tolerant architecture for distributed systems

2.4 Design of the Group/Global Resource Reconfiguration Manager


The architecture of the supervisor in three levels, namely the local, group and the global,
accounts for possible situations where the switching between different modes can be imple-
mented just having information on a certain subset (or group) of FT cells and no global infor-
mation are needed. In particular it is possible to figure out three levels in which the reconfigu-
ration of the system can be performed:

• WM setting at Module level (low-impact reconfiguration): in some cases, when the resource
needs induced by a new reconfiguration (namely by setting a new WM) has no significant
influence on the other processes, the decision of setting a new WM can be taken by the lo-
cal reconfiguration and mode control (embedded in the FT module) without demanding
the decision to higher order levels.

• WM setting at Group RRM level (in-group re-allocation): in case the reconfiguration of a


certain process has impact in the resource allocation within the same group and/or the
change of WM in a certain module has to be joined to a change of WM to a different
module within the same group, then the Group RRM is demanded the responsibility of
orchestrating the new WM switching.

• WM setting at Global RRM level (cross-groups re-allocation): there may be cases in which
a certain WM change can be performed only by re-viewing the allocation of the FT algo-
rithms within the available resources. This, for instance, may happen in case:

– the resource monitor detects a fault in the specific resource (loss of a computer or
plant resources) and all the functionalities running on it must be moved on a differ-
ent resource;
– the resource needs of certain functionalities are not anymore compatible with the
specific resource (for instance after a reconfiguration) and the computational burden
associated to the algorithm must be spread in others resources.

In this case the only possibility is to demand the WM decision logic to the Global RRM.

From this perspective, the global RRM activates just in case new re-allocations are needed be-
tween resources and functions linked to different groups, while all the working mode manage-
ment which does not require re-allocation between different groups is demanded to the Group
RRM. For this reason the information processed by the Global RRM are not performance in-
dexes (which do not influence re-allocation polices) but just diagnostic results. On the other
hand, the Group RRM accesses both diagnostic results and performance indexes issued by the
FT Modules since the working mode switching policy relies on both the information.
The architecture of the supervisor is tailored in order to leave to the Group and Global
RRM just the task of managing reconfigurations in each group which are necessary for the best
exploitation of the available resources, without demanding to it the task of reconfiguring FT
modules for achieving global functionalities which, indeed, is left to each Local RRM.
As far the Group and Global RRM is concerned, it is composed just by a Decision Logic
Unit which processes all the events generated by the Requests Generation blocks linked to
the different local RRM and manages the WM changes involving reallocations in resources
linked to different groups. Moreover the state of the Group and Global RRM can change due
to external commands issued by external operators, in other words, the Group/Global RRM is
dedicated also to interface the external world with the distributed system.
2.4. Design of the Group/Global Resource Reconfiguration Manager 59

We address in this section the problem of design the WM decision logic unit of group and
global RRM, by showing how the theory of discrete event systems can be successfully used (see
Appendix A, [26] and [108]). To this end the first step is to precisely identify all the specifica-
tions which are behind the design of the group-global RRM. This is done in the following:

• Group selection: of course the first basic information regards how the partial processes
have been grouped each others, namely how the overall system has been divided in
groups of functional units.

• FTC/M as DES: this task amounts in describing each FT Modules (both Control or Mea-
surement) as Discrete Event System by specifying states, events and transitions between
states according to the occurred events. More precisely:

– States: the states capture information about the specific working modes of the partial
process; these, besides the reconfigured states, comprise also states associated to
faults (namely a fault has been occurred but no reconfiguration has been already
taken).
– Events: these are exogenous events, which can be controllable/observable for the
supervisor, inducing transitions between different states and which can be catego-
rized in the following two classes: change of WM, i.e. events (which will turn out
to be controllable for the supervisor) which induce transition between states due to
a change of the working mode and occurred Faults:, i.e. events (which will turn out
to be not controllable but observable for the supervisor) which induce transitions
between states due to faults which have been detected by local function monitors.

• Resource and Resource monitor as DES: This task amounts in describing local resources and
the associate monitors as Discrete Event Systems by specifying states and events inducing
transitions between states as follow:

– States: in the simplest case the states describing the status of resources reduce to
three: idle (namely the resource is capable to run additional functionalities), busy
(no other functionalities can be located on that resources) and fault (the resource
monitor has detected a local fault, for instance of a computer). As better highlighted
in the following, the busy state can in general split in several sub-states expressing
different cases in which the resource can be busy.
– Events: the exogenous events can be divided in change of WM inducing transitions
between idle and busy states (these are controllable events for the supervisor to be
designed which, on the basis of the model of the specific resource and the allocation
policy followed in the past, commute between the two states) and occurred faults,
i.e. non-controllable but observable events arising whenever the local resource mon-
itor detects a fault in the supervised resource.

• Reconfiguration specifications: list of different WMs for each partial process linked to a
specific fault situation. This represents the list of possible counteraction after failures to
achieve fault tolerance.

• WM Urgency: this information is needed whenever a reconfiguration must be actuated in


presence of limited resources. It expresses which functionality is the most urgent in order
to meet global specification. The urgency information depends, in general, on the actual
WM.
60 Fault tolerant architecture for distributed systems

• WMs/Resources map: this represents an offline planning on how the different functional-
ities, achieved in all possible WMs, can be allocated in the available resources. In this
description a fundamental information is to identify re-allocation in different resources
within the same group (whose management is demanded to the group RRM) or eventu-
ally between different groups (whose management is demanded to the global RRM).

Starting form these information, it is possible to automatically design the decision logic unit
which supervises each group. The composition of the discrete models of the functional units
and physical resources linked to the same group, yields a group automata which captures all
the information about the feasible WMs according to the actual WM and to the resources avail-
ability. From this group automaton, the Group RRM can be designed following the supervision
theory on the basis of performance specifications and reconfiguration requirements. Moreover
by composition of all the supervised group automata, it is possible to obtain a global discrete
model of the system on which the Global RRM can be designed, again taking advantage of the
supervision theory for Discrete Event systems, starting from global specifications.
In the following these considerations will be further developed by proposing possible case
studies. The idea is to identify elementary case studies which are distinctive as far as the de-
cision logic design is concerned, and that suitably composed allow one to deal with arbitrary
complex situations. To this end the following is proposed:

1. Resource sharing: this case aims to collect all the applications in which the different func-
tionalities are allocated not always in the same resource but real time re-allocations are
possible. This may happen due to particularly heavy reconfigurations (which may re-
quire re-allocation) or due to loss of resources. Specifically the resource sharing can be
further specialized in:

(a) cross-group resource sharing: in this case the re-allocation of certain functionalities
involves resources which are not necessarily belonging to the same group. Hence the
re-allocation management can not be supervised by the group RRM but is demanded
to the global RRM.
(b) in-group resource sharing: in this case the re-allocation is between resources which
are belonging to the same group. Hence the re-allocation supervision can be per-
formed by the Group RRM.

2. Interlaced reconfigurations: with interlaced reconfigurations it is meant reconfigurations in


a certain module induced not by a local occurred fault but indeed by a change of WM in a
different module. In general interlaced reconfigurations may happen both at group level
(namely a change of WM in a certain module is induced by a fault happened in a different
module within the same group) and at system level (if the two interlaced FTMs are linked
to different groups). However, it must be stressed that, following the idea expressed in
the previous section of avoiding to collect in different groups functionalities which are
strongly interlaced, it is more interesting to consider reconfiguration occurring at group
level.

Of course the two cases are not mutually exclusive since it is possible to figure out complex
fault tolerant systems in which interlaced reconfiguration are present within shared resources.
In the following the previous two case studies are presented, by showing how the design of the
decision logic supervision can be achieved.
2.4. Design of the Group/Global Resource Reconfiguration Manager 61

2.4.1 Resource sharing


In this section we will show how to manage more than one resource shared between different
functional units, e.g. hardware redundancy policies. Units sharing the same resource can be-
long to the same group (in this case we talk about in-group resource sharing) or to different
groups (intra-group resource sharing). In the first case the sharing policy should be managed
by the Group RRM; in the second one by the Global RRM.

Intra-group resource sharing

Consider an FTC system composed by two FTC modules (namely FTC1 and FTC2) and an
FTM module. These functional units share two physical resources (R1 and R2 namely). FTC1
is always allocated on R1, FTC2 is always allocated on R2, while FTM can be allocated both on
R1 and R2.
FTC1, FTC2 and FTM can be affected by three failures, f1 , f2 and f3 respectively. Each func-
tional unit can work in a nominal mode (working modes wm1 , wm3 and wm5 respectively)
or in a reconfigured mode (working modes wm2 , wm4 and wm6 respectively). Reconfigured
working modes should be issued by the supervisor after a failure is detected. When a FTC unit
works in a nominal mode it use just a part of the physical resource, leaving enough space on
it to allocate the working mode (nominal or reconfigured) of the FTM. When a FTC unit works
in the reconfigured mode this uses the whole resource. When this situation occurs the FTM
process have to be allocated on the other resource; this is possible only if FTM is working in
nominal condition.

Group selection: The system can be decomposed into two groups: the first group (G1) includes

Figure 2.14: Cross-group reconfiguration.

the resource R1, and the functional units FTC1 and FTM (when allocated on R1); the second
group is composed by the resource R2 and by FTC2 and FTM (when allocated on R2).
FTC/M as DES: In fig. 2 are reported the automata, modelling the functional units. Each state
WMi represent a different working mode for the functional unit, while states Fj represent the
j-th faulty situation. Since the FTM can be allocated on both the resources there will be two
models (FTM/i, i=1,2) of it, each of them representing the FTM seen from the i-th resource. In
these automata there will exists coordination states and transitions representing the stand-by
state of the unit on the resource. The events labelled wmi are the controllable and observable
events representing changing of working modes (and hence transitions between states in the
automata). Events labelled fj are the diagnosable and uncontrollabe events representing possi-
ble failures. The transitions to the stand-by state (namely events sbi , i=1,2) are the controllable
events used to allocate the unit on the resources.
62 Fault tolerant architecture for distributed systems

Figure 2.15: Cross-group reconfiguration: automata modelling functional units.

Resource and Resource monitor as DES: In fig. 3 are reported the automata modelling the physi-

Figure 2.16: Cross-group reconfiguration: automata modelling physical resources.

cal resources. Here the Ij, (j=1,2) states stand for Idle states, meaning that new working modes
can be allocated on the j-th resource, while states Bij (i=1,2 ; j=1,2) represent the busy situation
for the j-th resource (no other functionalities can be allocated on the resources); more in detail
state B1j represents the situation “FTCj in reconfigured working mode” while state B2j repre-
sents the situation “FTCj in nominal working mode and FTM allocated onto the j-th resource”.
All the events have the meaning explained above.
Reconfiguration specifications: Working modes wm1 , wm3 and wm5 represent nominal working
modes for FTC1, FTC2 and FTM respectively. When a failure f1 happens in FTC1 then the
working mode issued should be wm2 ; analogously wm4 and wm6 are the working modes that
2.4. Design of the Group/Global Resource Reconfiguration Manager 63

should be issued in FTC2 and FTM when failures f2 and f3 are detected respectively.
WM Urgency: In the example we will consider two different working mode urgencies orders,
leading to different supervisory strategy:
1. (max. urgency) wm2 → wm4 → wm6 (min. urgency);

2. (max. urgency) wm6 → wm2 → wm4 (min. urgency).


The first situation expresses the fact that the reconfiguration after a failure in FTC1 is the most
important, while in the second case the most important reconfiguration is after a failure in
FTM.
WMs/Resources map: As explained above FTC1 is always allocated on R1, FTC2 is always al-
located on R2, while FTM can be allocated both on R1 and R2 depending on the state of the
resource. When a FTC unit works in a nominal mode it use just a part of the physical resource,
leaving enough space on it to allocate the working mode (nominal or reconfigured) of the FTM.
When a FTC unit works in the reconfigured mode this uses the whole resource.
The model of group Gi (i=1,2) can be achieved by the automata parallel composition:

Gi = FTCi k FTM/i k Ri .

In fig. 2.17 is reported the FSM modelling group 1; specifications regarding changes of working
mode after a failure are designed on the basis of this machine, and the supervisor Si is built to
satisfy these specs. In fig. 2.18 are represented models of controlled groups G1 and G2.

Figure 2.17: Cross-group reconfiguration: automata modelling group G1.

Now the global machine on which design the Global RRM can be achieved by:

G = (G1/S1) k (G2/S2) .
64 Fault tolerant architecture for distributed systems

Figure 2.18: Cross-group reconfiguration: group supervisors.


2.4. Design of the Group/Global Resource Reconfiguration Manager 65

For sake of simplicity we do not report nor the global machine neither the specifications de-
signed on it, but we will explain how the global supervisor will work related with the recon-
figuration priority order. Suppose that the WM urgency order is the one explained in case 1,
namely: (max. urgency) wm2 → wm4 → wm6 (min. urgency).
At start-up the global supervisor allocates FTM onto R2 (sb1 wm52 ) so that in case of failure
f1 (the most severe), it is possible an immediate reconfiguration of FTC1 (wm2 ). In the following
are reported the supervision actions due to some combinations of failures.

f1 → wm2
f3 → wm62
f2 → sb2 wm51 wm4
f2 → sb2 wm51 wm4 → f3 → wm61 → f1 → sb1 wm3 wm52 wm2 .

Suppose now that the WM urgency order is the one explained in case (2), namely: (max. ur-
gency) wm6 → wm2 → wm4 (min. urgency). At start-up the global supervisor allocate FTM
onto R2 so that in case of failure f1 (more severe than f2 ), it is possible an immediate recon-
figuration of FTC1 (wm2 ). In the following are reported the supervision actions due to some
combinations of failures.

f2 → sb2 wm51 wm4 → f3 → wm61 → f1 → ²

where ² is the empty string meaning that the supervisor takes no action. In fact to reconfigure
FTC1 (wm2 ) it should move FTM from R1 to R2 stopping its reconfiguration (wm6 ).

In-group resource sharing


Consider three FTC modules and three resources. In nominal conditions FTC1 and FTC2 share
the same resource (R1). If FTC2 is reconfigured it works on R2 leaving R1 to FTC1 (in nominal
or reconfigured condition). If FTC1 is reconfigured it works on R1 and FTC2 needs to move to
R2 (in nominal or reconfigured condition). FTC3 works always on R3 (see fig. 2.19). Here the
groups are assumed to be

G1 = FTC1 k FTC2 k R1 k R2
G2 = FTC3 k R3 .

This case can be approached as the previous one; the only difference is that the resource alloca-
tion policy is managed by the group supervisor.

Figure 2.19: In-group resource sharing. Here n stands for nominal working mode, where r
stands for reconfigured working mode.
66 Fault tolerant architecture for distributed systems

2.4.2 Interlaced reconfiguration


As a meaningful example of Fault Tolerant Systems with interlaced reconfigurations, is given
by the heating system proposed in fig. 2.20. The proposed set up is composed by three cylindri-
cal tanks. Two tanks (1 and 2) are used for pre-heating liquids supplied by two pumps driven
by DC motors. The liquids temperature is adjusted in these two tanks by means of two electri-
cal resistors. A third tank allows to mix the two liquids coming from the two pre-heating tanks.
The system instrumentation includes 4 actuators and 6 sensors:

• Actuators: P1 , P2 and Q1 , Q2 are respectively powers delivered by the two resistors and
the input flow-rates provided by the two pumps.

• Sensors: T1 , T2 and T3 are the temperature measurements in the three tanks, H1 and H2
are the level measurements in tank 1 and 2, while Q3 is the flow-rate at the output of the
mixing tank.

Figure 2.20: The 3-tanks heating system.

The objectives of the system are to adjust flow-rate Q3 and fluid temperature T3 according to the
reference variables and Q?3 and T3? . For that purpose a very simple strategy consists of control-
ling the temperature and flow-rate variables in the two pre-heating tanks. According to this
point of view, the global system is decomposed into three subsystems, respectively two pre-
heating systems and a mixing one. The FTC/FTM decomposition is such that, to each function
is associated a subsystem: SS1 and SS2 represent the pre-heating subsystems while SS3 is the
2.4. Design of the Group/Global Resource Reconfiguration Manager 67

Faults Symbol Type


Sensor ∆H1 Bias - abrupt change
Actuator ∆Q1 Saturation
Actuator δQ1 Drift - incipient change
Actuator P1 = 0 change Loss of actuator

Table 2.1: Fault scenario for the 3-tanks heating system.

mixing subsystem. Two Fault Tolerant Control modules (FTC1 and FTC2) are associated to SS1
and SS2 . Since no control action is performed for mixing, only a Fault Tolerant Measurement
module (FTM) is associated to SS3 . The FTC/FTM implementation is represented in fig. 2.21
while the fault scenario is reported in table 2.1. According to the severity of the faults, the task

Figure 2.21: FTC/FTM implementation of the 3-tanks heating system.

of the Global Resource and Reconfiguration Manager is either to define new local objectives or
to synthesize a new global objective by submitting new functions to all the subsystems of the
plant. This is dealt with by dividing the possible situations in six different scenarios depending
on the occurred fault and its severity and proposing possible reconfiguration strategies (see
fig. 2.22). In the following the points, highlighted above to express the supervisor specifica-
tions, are covered again by showing how they specialize for the present case.

Resource and Group selection: In order to enrich a little bit the proposed example we suppose
that two resources, R1 and R2, are available for running the control and estimation algorithms
implemented to achieve the desired functionalities. Furthermore, the second resource R2 can
be affected by a fault (loss of computation capability) and hence all the functionality running
of it must be eventually moved in the other resource. In this respect the double resource could
be motivated by the need of having hardware redundancy. As far as the group selection is
concerned, the analysis of the system shows that the modules FTC1 and FTC2 are strongly
interlaced (a change of WM in one of the modules is always joined to a similar change in the
other) and hence a wise choice can be to group it. Following this idea a possible choice is to
identify Group1 as FTC1, FTC2, R1 and Group2 as FTM, R2.
68 Fault tolerant architecture for distributed systems

Figure 2.22: Reconfiguration strategy for the 3-tanks heating system.


2.4. Design of the Group/Global Resource Reconfiguration Manager 69

FTC/M as DES: FTC1 can work in a nominal mode (wm1 ) or in two reconfigured modes (wm2
and wm3 ), representing respectively scenario 3 and scenario 4 remedial actions (see fig. 2.22).
Since the processes described in FTC2 are not affected by faults and possible reconfiguration of
FTC2 always follows reconfiguration of FTC1, the same WMs (wm1 , wm2 , wm3 ) are also asso-
ciated to FTC2. The Discrete Event system which governs FTC1 and FTC2 can be described as
follows:
• if the system is in scenario 1 or scenario 2, all the modules work in the nominal mode;

• if the system is in scenario 3, both the FTC modules should move to wm2 (new local
trajectory computation) and then move back to wm1 ;

• if the system is in scenario 4, both the FTC modules should move to wm3 (new global and
local trajectory computation) and then move back to wm1 .
Note that wm2 and wm3 are associated to tasks which involve just new reference computation,
and this is why the FTC are moved back to control working mode wm1 after the reconfiguration
is carried out. For sake of simplicity, we will assume that faults can not occur when the working
mode is wm2 and wm3 . A pictorial sketch of the DES modelling FTC1 and FTC2 is presented
in fig. 2.23. Note that the events which governs the transitions between the different states are
faults (observable but non controllable) and the WM changes (controllable).
As far as the description of FTM is concerned, it is supposed that it runs on R2 in case
the latter is working properly, and must be moved into R1 in case R2 is affected by faults.
This is precisely the case of resource sharing (and specifically cross-group resource sharing)
which have been treated in the previous subsection. In particular it has been shown that the
resource sharing involving FTM can be described by two DES, denoted by FTM/1 and FTM/2,
describing the functionality FTM running on R1 and on R2 respectively. Consequently, also
the nominal WM (wmn ) is split in two nominal WMs (wmn1 , wmn2 ) and two additional states,
denoted by sb1 and sb2, are introduced to denote if the functionality FTM is in standby on R1
or on R2. Resource R1 is assumed to be always available (no fault on it) and with enough room
to allocate all the processes. A sketch of the DESs describing FTM is shown in fig. 2.23.
Resources and Resource Monitors as DES: While the DES describing R1 and the associated re-
source monitor is trivial as it is not affected by fault and it is never busy, the discrete event
description of R2 has two states, idle and fault, with a single non controllable events which is
the local fault. A sketch of this DES are shown again in fig. 2.23.
Reconfiguration specifications: first, we start by specifying the controllable events which must be
managed by the Global and that which must be managed by the Group RRM.
Global RRM events: these are the events which are associated to re-allocation between re-
sources linked to different groups. In the present case just the events (sb1 , sb2 ) and (wmn1 ,
wmn2 ) must be managed by the Global RRM in case the resource R2 fails and the FTM func-
tionality must be moved on R1.
Group RRM events: these are all the other controllable events, namely (wm1 , wm2 , wm3 ) which
regards reconfiguration of FTC1 and FTC2.
The reconfiguration specifications can be given in terms of group and global specification. As
far as the group specification, the decision logic unit supervising the first group can be designed
on the basis of the specifications expressed in fig. 2.24. Note that, in this reconfiguration policy,
the priority of the wm3 with respect to wm2 (due to the fact that the fault P1 has been labelled
as more severe than DQ1) has been respected in the sense that in case of a concurrent fault P1
and DQ1 the priority is given to wm3 . Note that the second group does not need a Group RRM
since all the local reconfigurations are interlaced with that of the first group (change of WM
70 Fault tolerant architecture for distributed systems

due to a reference computation) or are managed by the global RRM (loss of resource R2). As
far as the global specifications, these are described in figure fig. 2.25, where it is clear the just
the reconfiguration due to a loss of resource R2 is managed.
WM urgencies: The fault severity has been specified in tab. 2.1. In particular the most urgent
fault to be dealt with is given by P1 (loss of actuator).
WMs/Resources map: both the resources are able to run all the algorithms involved in the Fault
Tolerant Systems. In case the resource R1 is working properly, it is assumed that FTC1 and
FTC2 are located on R1 and FTM on R2. In case a fault is detected on R2 by the local resource
monitor then also the functionality FTM is moved on R1.

Figure 2.23: DES modeling of the 3-tank system.

The previous information represents all what is needed in order to design the Group and
Global decision logic unit. As far as the design of the Group 1 RRM is concerned, the first step
regards the computation of the overall DES modelling the group. This can be easily achieved
by considering the parallel interconnection of all the DESs describing FTC1, FTC2, FTM/1 and
R1, namely
G1 = FTC1 k FTC2 k FTM/1 k R1 .
Hence the supervisor of G1 is built on the basis of the group specifications H presented above,
following the known supervision theory of the Discrete Event Systems (see Appendix A).
Once the Group Supervisor Machine has been designed, the first step for the design of the
Global decision logic unit is to compute the DES modelling the supervised group. This can be
done by computing the parallel interconnection between G1 and the designed supervisor and
then to compute the supervisor machine on the basis of the global specification expressed in
the point (d) above. It is worth stressing that all these steps which, starting from the specifi-
cation given above, carry to the global and group decision logic machines, are automatically
computable by means of standard procedures.
2.5. A pilot plant: the two tank system 71

Figure 2.24: Group specifications for 3-tanks system.

Figure 2.25: Global specifications for 3-tanks system.

Fig. 2.26 shows the behavior of the Decision Logic Unit which supervises the whole sys-
tem. In particular the table shows al the possible strings (change of WMs) generated by the
supervisor according to the actual WM and to occurred event: for each state (in bold) is shown
the number of transitions and the possible events followed by the new state. By exploration of
this table it is possible to check that for each state there exists just a control action imposed by
the supervisors, while no possible failures (which are uncontrollable events) are disabled. This
means that the supervision strategy is unique.

2.5 A pilot plant: the two tank system


In the section we will apply the design procedure above explained to a benchmark plant (see
fig. 2.27) composed by two pumps with flow rate Q1 and Q2 and two tanks in which the liquid
is contained. The two tanks are connected through two redundant pipes with valves. The
output flows of the two tanks are mixed through two other valves. The system has two level
sensors (L1 and L2 ) and flow-rates sensors for Q1 , Q2 , Q12 , QF 1 and QF 2 .
The linearized equations of the system around an equilibrium position are

S L̇1 = Q1 + Q12 − QF 1

S L̇2 = Q2 − Q12 − QF 2
(2.1)
L1 − L2
Q12 =
R12
with
L1
QF 1 = (2.2)
R1
72 Fault tolerant architecture for distributed systems

Figure 2.26: Behavior of the Decision Logic Unit 3-tanks system.


2.5. A pilot plant: the two tank system 73

Figure 2.27: 2-tanks representation.


74 Fault tolerant architecture for distributed systems

L2
QF 2 = (2.3)
R2
where R12 is the throttling of valve V12 , R1 is the throttling of valve V1 , R2 is the throttling of
valve V2 and S is the section of the tank. This means that if the throttling is ∞ the valve is
closed.
The mathematical model of the system is then
L1 L1 − L2
S L̇1 = − + + Q1
R1 R12
(2.4)
L2 L1 − L2
S L̇2 = − − + Q2 .
R2 R12
We consider, as control inputs of the system, the variables Q1 , Q2 and, as additional input,
R12 . The controlled outputs of the system are:

y1 = L1 + L2
(2.5)
L1
y2 = .
L2
We want these two outputs following two desired set points denoted respectively as y1∗ and y2∗ .
According to (2.2) and (2.3), these outputs are proportional to the total flow rate at the output
of the two tanks and to the flow rate ratio between the two tanks (as R1 and R2 are assumed
constant). In particular note that we can rewrite the desired set points y1∗ and y2∗ as desired set
points L∗1 and L∗2 for the measured levels L1 and L2 . Specifically, from (2.5) we have

L∗1 = y2∗ L∗2

and thus
y1∗ y2∗ y1∗
L∗1 = L∗2 = (2.6)
1 + y2∗ 1 + y2∗

2.5.1 Fault scenario and diagnostic algorithms


We consider the following fault scenario:
• Fault on pump 2: pump 2 is stuck to a constant value Q2 = Q2 , namely it injects a constant
flow in tank 2. This fault leads to a loss of controllability as it changes one of the structural
constraint in eq. (2.4).

• Fault on the connection valve V12 : the valve is stuck to a constant value R12 = R12 which
can be ∞ (namely the valve fails in stuck closed mode) or a constant finite positive value.

• Leakage fault on tank 2: due to an hole in the tank the dynamic of L2 is corrupted by a term
δQF 2 (L2 ) = L2 /δ where δ is the section of the hole. In other words there is an undesired
outgoing flow from tank 2.

• Fault on level sensor of tank 2: the measure L2m of level L2 in tank 2 is corrupted by a
constant bias δL2 , i.e. L2m = L2 + δL2 .

• Fault on flow-rate sensor of tank 2: the measure Q2m of flow-rate Q2 in tank 2 is corrupted
by a constant bias δQ2 , i.e. Q2m = Q2 + δQ2 .
2.5. A pilot plant: the two tank system 75

Diagnosis algorithms
Aim of this section is to give some guidelines about the generation of residual signals to de-
tect and isolate the different faults scenario illustrated above. The first residual that can be
generated is based on a test on the control loop of pump 2; in fact the flow Q2 from pump 2
is a controlled variable, hence it can be considered known as an internal state Q?2 of our con-
troller (consider the case of steady state, Q?2 is equal to the reference for the flow from pump 2).
Moreover the flow Q2 is measured with a flow sensor. Considering the signal

r1 (t) = Q?2 − Q2m , (2.7)

r1 (t) is qual to zero in nominal condition, while is different from zero in case the pump 2 is
stuck or the measure Q2m is not correct. In other words the signal r1 is a residual for fault Q2
and for fault δQ2 .
Let us assume that the valve R12 is monitored with an hardware electrical test (for example
an electrical signal consistency test). this means that this test will result in signal r2 (t) which is
able to detect a fault on valve R12 (more specifically is able to detect the situation in which the
connection valve is stuck); i.e. r2 (t) is a residual for the fault R12 .
Consider now equation 2.3. It states that in case of nominal condition, the outgoing flow
from tank two depends on the level of liquid in tank two L2 and from the throttling of outgoing
valve R2 . In case of leakage in tank two, it is clear that the relation 2.3 does not hold anymore,
since there is also an outgoing flow due to the leakage. This means that the relation becomes:
L2
= QF 2 + δQF 2 .
R2
With this in mind it is immediate to see that signal
L2m
r3 (t) = − QF 2 (2.8)
R2
is equal to zero in nominal condition while is different from zero in case of leakage in tank two
or in case of a misreading of the level sensor; in other words r3 (t) is a residual for faults δQF 2
and δL2 .
As far as the detection of the sensor bias δL2 is concerned, we have two possible situations
depending on the R12 status. As a matter of fact in the case R12 < ∞ we have
L1 − L2
Q12 = .
R12
Thus, since L1 and Q12 are measurable, it is possible to estimate L2 as

L̂2 = L1 − Q12 R12 . (2.9)

From this it is possible to generate a residual signal r40 (t) sensitive to δL2 as

r40 (t) = |L2m − L̂2 | (2.10)

and use the value of L2m − L̂2 to reconstruct the sensor bias and thus to estimate the level value.
On the other hand in the case R12 = ∞ (valve closed), it turns out that it is possible to estimate
L2 by means of the observer
µ ¶
˙ 1 1 1 1 1
L̂2 = − L1m − − L̂2 + Q2m + K(L2m − L̂2 )
SR12 S R2 R12 S
76 Fault tolerant architecture for distributed systems

Q2 R12 δQF 2 δL2 δQ2


r1 1 0 0 0 1
r2 0 1 0 0 0
r3 0 0 1 1 0
r40 0 0 0 1 0
r400 0 0 0 1 1
r5 0 0 1 0 1

Table 2.2: Residual matrix for the two-tanks system

where K is a suitably defined output constant, which, by defining the error variable

L̃2 = L̂2 − L2m (2.11)

yields the following error dynamics


· µ ¶ ¸ µ ¶
˙ 1 1 1 1 1
L̃2 = − − + K L̃2 + − δL2 + δQ2 .
S R2 R12 R2 R12

From this it is possible to generate a residual signal r400 (t) sensitive to δL2 and to δQ2 as:

r400 (t) = |L̃2 | . (2.12)

Note that it make sense to consider the two algorithms used to estimate the sensor faults and
to reconstruct the level measure as characterized by different reliability factors. As a matter of
fact it is expected that the algorithm to be run in the case R12 ≤ ∞ has an higher confidence
level with respect to the one to be considered if R12 = ∞ as it is based on algebraic physical
relations. This, however, should be validated by simulation and experiments results.
Consider now the flow balance in tank 2 in steady state condition:

QF 2 = Q2 + Q12 + δQF 2 ;

by hypothesis the measures of QF 2 and Q12 are given and reliable, hence it is possible to use
this analytical redundancy to generate a last residual r5 (t) which is sensitive just to a fault on
the flow sensor Q2 and δQF 2 as:

r5 (t) = QF 2m − Q12m − Q2m . (2.13)

All these consideration lead to the residual matrix in table 2.2. It is easy to verify just by
inspection of table 2.2 that with this set of residual signals the system is fully detectable and
isolable with respect to the set of faults considered.

2.5.2 System decomposition


From the discussion presented in the previous section, measures of level and incoming flow in
tank 2 are of crucial importance both for control and diagnosis task. For this reason it cames
natural to design two fault tolerant measures modules (FTM) to menage at tank 2 level the
measure of L2 and Q2 respectively. Aim of these two modules is the detection and isolation
of faults δL2 and δQ2 and, in case of fault, the estimation of the bias superimposed to the
corrupted measure and the estimation of the real measures with a certain level of confidence.
2.5. A pilot plant: the two tank system 77

Hence signal outgoing from the two FTM are the estimation of the measure with the confi-
dence level (in nominal case the measure from sensor with a confidence nearby 100%) and the
diagnosis signals concerning the two sensors.
Concerning the leakage fault on tank 2 (δQF 2 ), this is a fault which influence the local func-
tionality of tank 2. From a mathematical point of view, this fault introduces a changing on the
parameter of the system. In fact from (2.4) it turns out that in this failure mode the system
dynamics modifies as

L1 L1 − L2
S L̇1 = − + + Q1
R1 R12
L2 L1 − L2
S L̇2 = − − + Q2 + δQF 2 (L2 )
R2 R12
namely
L1 L1 − L2
S L̇1 = − + + Q1
R1 R12
(2.14)
L2 L1 − L2 L2
S L̇2 = − − + Q2 + .
R2 R12 δ
As far as this fault is considered, it is hence possible to isolate and accommodate it simply
reconfiguring the control over tank two. For this reason we propose also the introduction of
a fault tolerant control at tank two level. This FTC module should be dedicated to isolate and
accommodate the leakage fault without changing global objectives and performances.
Let us consider now the fault on the pump in tank 2 (namely Q2 = Q2 ). It is easy to
understand that this fault is crucial, in fact it corrupt the global functionality of the whole
system leading to a loss of controllability. In other words reconfigure this fault will imply
a changes of global objectives. For this reason it will be considered at a higher level, more
in detail an FTC module will be designed at the whole system level in order to isolate and
accommodate the effect of the fault on pump 2.
The last fault to consider is the fault on valve V12 . As explained previously the detection of
this fault is mainly at hardware level. Moreover from figure 2.27 it is easy to see that also the
accommodation can be managed at hardware level thanks to the hardware redundancy V12 -
0 . For this reason the isolation of the fault will be demanded to a resource monitor, while the
V12
accommodation will be managed by a global resource manager at the top level.
In view of all these consideration we propose the decomposition of the system sketched in
figure 2.28.

Internal structure of supervisors


Each FTC - FTM will contain a local reconfiguration manager (LRM), namely a supervisor
dedicated to the logical isolation and accommodation on the basis of the information provided
by local diagnosers (with their confidence) and on the basis of global performances index and
external commands. As explained, local supervisors can be composed by a logical FDI unit,
which performs logical isolation of faults elaborating residual signals and their confidences,
an event generator unit that generates requests of changing working modes on the basis of
the information from the FDI unit and on the basis of performances request from higher level.
The last component of local reconfiguration manager is the logical unit that orchestrate the
changing of local working modes accordingly to the requests of the event generator. Clearly in
a local supervisor can be present all these three units or just some of them. For our system the
78 Fault tolerant architecture for distributed systems

Figure 2.28: 2-tanks decomposition in FTC and FTM modules.

Figure 2.29: Internal structure of supervisors.


2.5. A pilot plant: the two tank system 79

structure the internal structure of the supervisor is depicted in the figure 2.29. The lowest level
supervisors are the LRM of FTC2 and the LRMs of FTM(L2 ) and FTM(Q2 ). As explained in
the previous section, FTC2 manages the leakage fault on tank 2, isolating and accommodating
this fault. For this reason all the three units are presented within its supervisor to orchestrate
local working modes. On the contrary in LRMs of FTM(L2 ) and FTM(Q2 ) just the FDI units
are present to manage the isolation of faults on sensors on the basis of the estimation of the
measures. In this case since no explicit reconfiguration is necessary, nor an event generator
neither a decision logic unit are necessary.
All the diagnostic signals given by the low level FDI units are sent to the higher level LRM.
This one manages the isolation and accommodation of the fault on pump 2. Since, as explained
previously, the accommodation of this fault requires a change in global objectives, all signals
from lower levels are required. An event generator unit within this supervisor decides, when-
ever the pump 2 is stuck, on the basis of the status of the low level FTC and FTM, which
reconfiguration action should be issued. For this reason in this LRM all the three units must be
present.
A different discussion should be addressed on the global resource manager (GRRM). This
global supervisor has two different tasks: the first one is to manage hardware reconfigurations
on the basis of information from resources monitors; the second task is to enable/disable possi-
ble local working modes on the basis of their resource needs, resource status and performances
requirements from the external operator. For this reason it presents only a decision logic unit
that manages reconfigurations in case of loss of performances and hardware reallocation, and
decides performance priorities according to external commands.

2.5.3 An overview of possible working modes


Nominal working modes
Let us consider first of all nominal operation conditions. Assume that the throttling R12 of valve
V12 is constant and it is not controlled. Using the flow-rates Q1 and Q2 as control variables,
from eq. 2.4 it is immediate to see that the system is controllable and that the control objectives
specified in first section of this document can be achieve. A particular case is that characterized
by R12 = ∞, i.e. the two tanks are decoupled. In this case the model of the system becomes:

L1
S L̇1 = − + Q1
R1
(2.15)
L2
S L̇2 = − + Q2 .
R2
It is possible to achieve control objectives using two PI controllers, controlling levels L1 and L2
using respectively Q1 and Q2 in order to track references as specified in relations 2.6. We will
refer to this working mode as WM0.
In case R12 < ∞ the model of the system is represented by equations 2.4. The system is
coupled and must be controlled in order to achieve

L1 = L∗1

L2 = L∗2
where L∗1 and L∗2 are represented in equations 2.6. To satisfy these objectives we can implement
two control architecture: use an optimal control strategy, or, considering that Q12 = L1R−L 12
2
80 Fault tolerant architecture for distributed systems

is measured, decouple the system through a feed-forward action and control the system in a
decoupled way. We will refer to this working mode as WM0I .
These two nominal working modes represent two different functionalities of the system,
but they can also be used in order to augment the detectability of the system with respect to
the faults considered.
Reconfiguration of Q2 in case of decoupled tanks
First of all consider the fault on pump 2. In other words consider the case in which pump
2 is stuck to a constant value such that Q2 = Q2 . If R12 = ∞ the system is represented
by equations 2.15, then the two tanks are decoupled. In this case we lose a control variable
(Q2 = Q2 =const.), i.e. one degree of freedom, then we must decide which of the two control
objectives represented by eq. 2.5 we want to satisfy and reconfigure the trajectory on L1 to
satisfy this objective. Since the incoming flow in tank 2 is constant than also the level in this
tank will stabilize to a constant level L2 which can be measured. Hence
• if we prefer to satisfy:
L1 + L2 = y1∗
than we must compute the new trajectory L∗1 as

L∗1 = y1∗ − L2 .

We will call this working mode WM1.

• if we prefer to satisfy:
L1
= y2∗
L2
than we must compute the new trajectory L∗1 as

L∗1 = y2∗ L2 .

We will call this case WM2.


In both cases we do not need an estimation of Q2 because we use the measure L2 of L2 . It is
important to notice that both WM1 and WM2 are feasible if and only if L2 < L2max , where
L2max is the maximum level of tank 2 to avoid overflow. In these two cases the control law
does not need to be changed.
Reconfiguration of Q2 in case of coupled tanks
If R12 < ∞ the system is represented by eq. 2.4. Let us consider R12 not controlled, i.e. R12 =
R12 =const. We must decide which of the two control objectives we want to satisfy. Let us
suppose to satisfy:
L1 + L2 = y1∗
using the measure of L1 let us define the new state variable:

L1 = L1 − R12 Q2 . (2.16)

The equations
L1 L1 − L2
S L̇1 = − + + Q1
R1 R12
L2 L1 − L2
S L̇2 = − − + Q2 ,
R2 R12
2.5. A pilot plant: the two tank system 81

become
L1 R12 Q2 L1 L2
S L˙ 1 = − − + + Q2 − + Q1
R1 R1 R12 R12
L2 L1 L2
S L̇2 = − − + .
R2 R12 R12
Defining µ ¶
1 1 1
d= − R12 Q2
S R1 R12
as an unknown term, it is possible to write the system as
#  1³1 ´ 
− S R1 − R1 − SR1
"
˙L ·
L
¸ · ¸
1
1 1
= 12 ³ 12 ´  + [Q1 − d] . (2.17)
L̇2 − SR1 − S1 R12 − R1 L2 0
12 12

Defining now
y ∗1 = y1∗ − R12 Q2
and the control error as
e = y − y ∗1 = L2 + L1 − y1∗
where y = L2 −L1 , it is easy to see that it is possible to design an error-feedback integral control
law over system 2.17 (state of the obtained system is not available since the state feedback
would require an estimation of Q2 ), by which the objective y1∗ is forced without estimate Q2 .
We will call this working mode WM3.
Similarly by forcing objective y2∗ instead of y1∗ and following a similar procedure as above
we can define a working mode WM4 in which using an error-feedback integral control law
over the obtained system, the objective y2∗ is forced without estimate Q2 ..
Consider the same change of variables expressed by 2.16. In the same situation, if we es-
timate Q2 the term d is no more unknown, then it is possible to use a feed-forward action.
Moreover in this case we know all the state variables, because we know L1 , L2 and Q2 , then
a state feedback is possible with better performances than the error-feedback. If we choose
to achieve objective y1∗ the working mode will be denoted with WM5, while if objective y2∗ is
satisfied the working mode is WM6.
Now let us suppose to control R12 . In this way we increase the degrees of freedom of
control, i.e. it is possible to satisfy again the two objectives, because we lose the control variable
Q2 but we add the control variable R12 . The value R12 ∗ of R at each moment can be compute
12
starting from system 2.17, computing the steady-state of L2 and using the constraints y1∗ and
y2∗ . Define
ξ = L1 + L2 − y ∗1 ;
from
1 1 R12 R2
L̇2 = − L1 − L2
SR12 S R12 R2
we obtain µ ¶
˙ξ = − 1 (ξ − L2 + y ∗ ) − 1 R12
− 1 L2 .
1
SR12 SR12 R2
Computing the zero dynamic as ξ˙ = ξ = 0, we obtain the steady-state for L∗2 compatible with
the satisfaction of control objective y1∗ :
y1∗ − R12 Q2
L∗2 = ³ ´. (2.18)
1 − RR122
− 1
82 Fault tolerant architecture for distributed systems

Introducing now the constraint over the objective y2∗ , we have:


y1∗
L∗2 = . (2.19)
1 + y2∗
Forcing the steady state of L2 obtained from relation 2.18 to be equal to 2.19, we have that

y1∗ y1∗ − R12 Q2


= R2 . (2.20)
1 + y2∗ 2R2 − R12

From 2.20 it is possible to compute the value that R12 should assume in case of Q2 = Q2 in
order to satisfy both objectives y1∗ and y2∗ :

∗ y1∗ (1 − y2∗ )R2


R12 = .
y1∗ − Q2 R2 (1 + y2∗ )
∗ is feasible (i.e. in the range of values of R ), we can assume to control the
If the value R12 12
opening of the valve to this value. We will refer to this working mode as WM7.
Reconfiguration of R12
Consider now the fault on valve V12 . If the valve is stuck close, i.e. R12 = ∞, we can switch
I and use RI , due to the parallel hardware redundance. This working mode is
to valve V12 12
denoted with WM8.
On the contrary if the valve is stuck open, i.e. R12 < ∞, we have again the system 2.4 and
we can use the optimal control law of WM0I . The estimation of R12 comes in this case from
L1 − L2
Q12 = ;
R12
this working mode will be denoted by WM9.
Reconfiguration of leakage δQF 2
Leakage fault on tank 2 imply a parametric change in the model of the system as shown in eq.
2.14. This parametric change can be tolerated locally making use of a robust control law. In this
sense it is possible to define a reconfigured working mode WM10 in which the local control
law on Q2 is switched from the nominal control law to the robust one.
Reconfiguration of sensor faults δQ2 and δL2
As explained in previous sections; sensor faults are managed by FTMs which estimate the
interested variables, hence no explicit reconfiguration are needed regarding these two faults.

2.5.4 Working Mode Decision Logic


In this section we will try to explain briefly how the decision logic works to set allowed work-
ing mode sequences. As explained previously, the low level LRM in FTC2 (see fig. 2.29) is
required to accommodate locally the leakage fault, hence it will be able just to force the work-
ing mode WM10. A more complicated scenario has to be taken into account when we consider
reconfiguration managed by the higher level LRM. This is dedicated to set the proper working
mode in case the pump 2 gets stuck.
When the nominal working mode is WM0 the valve V12 is closed and cannot be opened,
then the allowed reconfigured WMs are only WM1 and WM2, in which the system switch if a
fault on pump 2 occurs. The choice between WM1 and WM2 is due to the objective we decide
to satisfy.
2.6. Conclusions 83

On the contrary when the nominal working mode is WM0I (i.e. the valve V12 is open), if a
fault on pump 2 occurs the allowed WMs are:

• WM7 if we can control R12 and the required steady state for R12 is within a feasible range.
This reconfiguration strategy has to be preferred because it allow to satisfy both objectives
y1∗ and y2∗ . It is important to stress that this WM requires a reliable estimation of Q2 , hence
if the measure of this variable has a low confidence this strategy has not to be preferred.

• WM5/6 if we can estimate Q2 with an high confidence; this strategies allow to satisfy just
one of the control objectives, moreover if the measure of Q2 (from the FTM) has a low
confidence this strategy has not to be preferred.

• WM3/4 if we can not estimate Q2 with a satisfactory performance, this strategies allow
to satisfy one of the control objectives with lower dynamic performances.

The Global Resource and Reconfiguration Manager decides which control objective we
must satisfy due to performances objectives, it decides if the confidence level of the estimates
is sufficiently high and it gives priority among WM3/4, WM5/6 and WM7 as explained above.
Moreover the GRRM manages the fault on valve V12 using an hardware redundancy policy.

2.6 Conclusions
In this chapter a possible distributed architecture for fault tolerant control of complex system
has been introduced. This architecture is modular and different modules are divided following
functional reasoning. Different functions which can be degraded by faults are associated with
fault tolerant (control/measures) modules that are able to detect the fault and counteract its
effect in order to assure certain performances in the execution of the required function. Each
module is provided with a reconfiguration manager, namely a supervisor which orchestrate
the reconfiguration within the module. All the modules are orchestrated together via high
level supervisors (global/group) resource and reconfiguration manager whose aim is also to
allocate optimally the function over the physical resources considering that they have a limited
capacity and that can be also affected by faults.
In this chapter some ideas were presented on the use of classical failure analysis tools to
design the structure of FTC/M modules within the IFATIS architecture. More in detail we
have shown how tools like Fault Tree Analysis (FTA) and Structural Analysis (SA) are useful to
divide the system in functionalities and sub-functionalities. To each of them a FTC/M module
is associated, achieving modularity and hierarchy. This failure analysis tools have proved to
be useful in order to design the structure of the local supervisor of each module. Moreover has
been shown what information are required to design the local and global supervisors and how
the theoretical machinery of discrete event system supervision can be used in order to design
the dynamical part of the supervisors.
84 Fault tolerant architecture for distributed systems
Chapter 3
Reliability of complex diagnosis systems

This chapter deals with the description of a procedure to evaluate relia-


bility of a complex diagnostic system. By a four steps procedure it will be
shown how to compute a reliability function associated to the complex
system starting from the statistical description of the possible faults and
from the reliability of the diagnostic subsystems devoted to faults detec-
tion. The example of the common rail system taken from the automotive
field is used to illustrate the procedure.

3.1 Introduction
The aim of this chapter is to introduce a framework and a procedure for the reliability com-
putation of complex diagnosis systems using statistical tools. It enriches the tools available
to engineers for analysis and design of diagnosis systems. Generally speaking a possible ap-
proach to the design of complex diagnosis and reconfiguration systems is illustrated in [10]
(see also [50]) where the whole design procedure is divided in two parts regarding respectively
the analysis of the diagnostic system and the design of diagnosis and reconfiguration algo-
rithms. The first step of the analysis procedure is the fault modeling. A Failure Modes and
Effects Analysis (FMEA) is normally used to describe the system in terms of failures modes
and functional discrepancies. Then a fault propagation analysis investigates how direct fault
effects propagate through the functional system. This analysis leads to a severity assessment of
the possible faults and to a reverse propagation analysis which makes possible to find where
and how to detect and stop faults. The last step of the analysis procedure amounts in selecting
remedial actions for all the faults considered in the design and represents a key phase for the
effectiveness of the whole diagnostic system. As far as the design part is concerned, it is usually
divided in three steps which involve respectively the detector design, namely the design of the
Fault Detection and Isolation (FDI) algorithms, the effector design (namely the design of the
reconfiguration procedures) and the supervisor design.
Our goal is to enrich the general framework described above with a procedure for the reli-
ability computation of the complex diagnostic system. In this framework can be cast [109, 110]
where Markov Chains are used to perform reliability analysis of Fault Tolerant Control (FTC)

85
86 Reliability of complex diagnosis systems

systems. To the purpose of this work a complex diagnostic system is thought as a system com-
posed by a number of elementary subsystems affected by a variety of possible faults and a
number of diagnostic algorithms which are simultaneously running in order to generate resid-
ual signals which are sensitive to one or more faults. In this framework we are interested in
developing a procedure that quantify the reliability of the overall diagnosis system in terms
of capability of not generating false alarms and missed diagnosis starting from the elaboration of the
residuals signals generated by the diagnostic algorithms.
It must be noted that different factors influence the reliability of a complex system where
several faults can arise at a certain time and where several diagnosis algorithms are simultane-
ously running to detect faults. A first factor is that not all the faults may have the same impact
on the reliability of the whole system as they are usually characterized by different occurrence
rates and different severity levels. A second factor is represented by the features of each di-
agnosis algorithm running in the diagnostic system which, since it is designed using different
strategies such as hardware or analytical redundancy and different techniques, is character-
ized by its own reliability in terms of capability of detecting occurred faults and avoiding false
alarms. A further aspect which is distinctive in complex systems is given by the presence of
fault propagation phenomena intended as the possibility that the occurrence of a specific fault
can generate different failures or spoil the features of diagnosis algorithms. The framework
proposed is able to capture all these aspects as it is based on a statistical description of the ex-
ogenous faults and of the diagnostic algorithms and on the use of propagation rules. This goal
is achieved by suitably adapting the mathematical tools presented in [9] to the case of diagnos-
tic systems. In particular the reliability computation is based on a four steps procedure which,
starting from the statistical description of the exogenous faults, the reliability of diagnosis sub-
systems and the description of the complex system, yields a measure of the reliability of the
overall system intended as capability of detecting faults by processing the available residual
signals without generating false alarms.
The procedure proposed in this chapter can be a useful tool both for the off-line design of
the diagnostic system and for the on-line design of the FDI unit. As far as the usefulness of the
proposed approach for the off-line design of the diagnostic system, it is worth noting that the
procedure for predicting the reliability of the system given the reliability of the physical com-
ponents and of the diagnostic algorithms lends itself to be used as a tool for identifying the op-
timal dimensioning of the physical components subject to faults and of the diagnostic algorithms
in order to achieve a prescribed reliability for the complex system. To this respect the proce-
dure here presented can be seen as an interesting tool for considering reliability as a criterion
which underlies the design of a diagnostic system, by solving the typical tradeoff regarding the
”quality” of the physical components and of the diagnostic algorithms composing the complex
system and their ”costs” in terms of economical impact, computational burden, etc. This fea-
ture will be formalized by providing, as outcome of the statistical analysis, an Hazard matrix
(see [46]) well-suited for checking reliability specifications.
On the other hand the proposed approach represents an interesting tool also for the on-
line design of the FDI unit. As a matter of fact, as clarified throughout the chapter, one of
the sub-products of the statistical analysis presented is to generate a statistical residual matrix
namely a matrix which has as many rows as the number of possible faults, as many columns
as the number of residual signals and whose element in the i-th row and j-th column is a real
number representing the probability that if the j-th residual signal arises then the i-th fault
happened. To this regard the analysis here presented is also useful in order to implement an
on-line FDI unit based on statistical considerations as it makes possible not only to detect and
possibly isolate faults from the joined analysis of the residuals signals but also to come out with
3.2. Statistic Tools 87

a measure of the probability of the right detection and isolation. This feature, in the cases in
which the deterministic isolation of faults is not possible due to a small number of independent
residual signals with respect to the possible faults, allows for a statistical isolation of the faults
determining which fault more likely happened.

3.2 Statistic Tools


Goal of this section is to briefly present the statistic tools used for reliability analysis. A com-
plete treatment of this subject can be found in [9].
Let us assume that n statistical independent items are put into operation at time t = 0 under
the same conditions, and at time t a subset ν̄(t) of those items have not yet failed. It is easy to
see that ν̄(t) is a decreasing step function. Time instant t1 , . . ., tn , when ν̄(t) changes its value
are the observed failure-free operating time of the n items.
For this reason the failure free operating time for one of the n items is a random variable τ
whose realizations are the sequence t1 , . . ., tn described above.
The expression
t1 + . . . + tn
Ê(t) =
n
is the empirical expected value of τ .
The function
ν̄(t)
R̂(t) =
n
is the empirical reliability function and converge to the reliability function R(t) for n → ∞.
The empirical failure rate is defined as:

ν̂(t) − ν̂(t + δt)


λ̂(t) =
ν̂(t)δt
which is the ratio of the items failed in the interval (t, t + δt] to the items that have not yet failed
at time t. It is very simple to see that:

R̂(t) − R̂(t + δt)


λ̂(t) = .
R̂(t)δt

For n → ∞ and δt → 0, λ̂(t) tends to the failure rate:


−dR(t) 1
λ(t) =
dt R(t)
Equation above shows that the failure rate λ(t) fully determines the reliability function, infact
R(0) = 1 yields to: Rt
R(t) = e− 0 λ(x)dx
In many applications the failure rate can be assumed to be nearly constant for t ≥ 0:

λ(t) = λ

so that:
R(t) = e−λt
which implies that failure free operating time τ is exponentially distributed.
88 Reliability of complex diagnosis systems

3.3 A Framework for Reliability of Diagnosis Systems


As enlightened in [9], reliability is a characteristic of an item, expressed by the probability that
the item will perform its required function under given conditions for a stated time interval.
With this in mind, it is obvious that to make sense, a numerical statement of reliability must be
accompanied by the definition of the item, the required function, the operating conditions and
the time interval considered. The goal of this section is to describe a framework for the analysis
of the reliability of complex diagnosis systems. This means that we will introduce the item and
its required function here considered.
We regard as complex diagnosis system a set S = {Si , i = 1, . . . , n} of subsystems function-
ally interconnected in an arbitrary way, supervised by another set D = {Dj , j = 1, . . . , r} of
diagnosis subsystems, with S and D affected by a set of faults F which will be better specified
later. The sets S and D are supposed to be interconnected by logical and functional intercon-
nections depending on the specific application. Each subsystem Si ∈ S can be affected by a
certain number ni of faults which will be denoted by fjSi , with j = 1, . . . , ni . Moreover each
subsystem Dj ∈ D is supposed to generate a residual signal yj sensitive to one or more faults
D D
and it is supposed to be affected itself by two different kinds of failures denoted by f1 j , f2 j .
D
The first (f1 j ) models a bad diagnosis of Dj (namely it models false alarms by Dj ) while the sec-
D
ond (f2 j ) models a loss diagnosis of Dj (namely it models the possibility that Dj can not detect
a fault). Then the set F collects N = ni=1 ni faults fjSi , i = 1, . . . , n, j = 1, . . . , ni , on the set S
P
D D
and 2r failures (f1 j , f2 j ) on the set D.
The information on how the N faults acting on S influence the residual signals yk , k = 1, . . . , r,
generated by D, are contained in the so-called residual matrix. The latter is defined as the matrix
with N rows and r columns whose element in the i-th row and j-th column is one if the i-th
fault affects the j-th residual and zero otherwise.
In summary the complex system is considered as n “functional” subsystems influenced by N
different faults, and r diagnosis subsystems influenced by 2r failures representing a wrong
detecting behavior. The situation is sketched in Figure 1 (a).
The procedure proposed for the computation of reliability of the complex systems relies on
four different steps which are described in the following.

3.3.1 Step 1. Description of Elementary Cells (definition of the item)


The first step in setting the framework for the computation of reliability of the complex system
is to divide the latter in a number of Elementary cells suitably interconnected each others for
which the computation of the specific reliability turns out quite simple. To define what an ele-
mentary cell is we make use of the concept of failure mode. A failure mode is the effect of a fault
on the system, i.e. what we can observe. The distinction between fault and failure mode is a
classical way to manage multiple faults in systems analysis because it makes possible to con-
sider separately modes instead of a single fault affecting different subsystems. The definition
of the elementary cell is the following (see also Figure 1 (b)).

Definition 3.1 (Elementary Cell). An Elementary cell is defined as

• a subsystem Si ∈ S, subject to a single failure mode fjSi ;

• a diagnosis subsystem Dk , affected by (f1Dk , f2Dk ), able to detect (and not necessarily to isolate)
fjSi ,
3.3. A Framework for Reliability of Diagnosis Systems 89

fjS1 fjSn
fjSi
...
? ? ?
- S ¾ Si

6
?
(f1Dr , f2Dr ) yk
Dk -
(f1D1 , f2D1 )
? ... ?
6 6
(y1 . . . yr ) f1Dk f2Dk
- D ¾
-

(a) (b)

Figure 3.1: (a) Structure of the complex diagnosis system. (b) Structure of the elementary cell
EC(Si , Dk , fjSi , (f1Dk , f2Dk )).

functionally interconnected each other.

We stress the fact that in the definition of the elementary cell the subsystem Si is allowed to
be influenced just by a single failure mode even if, in the description given above, more faults
can affect a single subsystem. In our context this means that if Si is affected by ni different
faults, it generates ni different elementary cells. Similarly the diagnosis subsystem Dj in a ele-
mentary cell is required to detect the presence of a single failure mode fjSi . Hence, in case Dj
is sensitive to kj > 1 different failures, it generates kj different elementary cells.
It is easy to realize that, according to the definition of elementary cells given above, the num-
ber of elementary cells which are generated from the complex diagnosis system is equal to the
number of ones in the residual matrix. In the following we shall denote the elementary cell as
EC(Si , Dk , fjSi , (f1Dk , f2Dk )).

3.3.2 Step 2. Computation of the Reliability Function for the Elementary Cells (def-
inition of the required function)
The next step is the computation of the reliability function (see [9]) for each single elementary
cell which, in this second step, is still assumed isolated from the context. To to this we need
to define what is the required function of the item (i.e. the required function of the elementary
cell). The functioning of an elementary cell is the following (see figure 3.2): if no fault fjSi on
the component is happened, then the cell is in a nominal healthy state (H in figure 3.2) which
for assumption is safe and reliable (S-R). If the fault fjSi occur, then the cell transit in a faulty
state (F in figure 3.2), which is assumed to be unsafe (NS). At this point if the detection is
missed (f2Dk ) the cell remains in the unsafe state, while if the detection is performed (¬f2Dk ),
an opportune reconfiguration action is taken based on the detected fault and the system goes
in a reconfigured state (R in figure 3.2) which is assumed to be safe, but not reliable (S-NR). It
is now obvious that if a false alarm occur (f2Dk ) the cell transit from the state H to the state R,
90 Reliability of complex diagnosis systems

which means that the cell remains safe but becomes not reliable. The meaning of the states in
terms of safety and reliability is figured in table 3.1. With this in mind, the function required
¬f1Dk , ¬f2Dk , f2Dk , ¬fjSi
f2Dk
fjSi
H F

f1Dk
¬f2Dk

Figure 3.2: Functional model of the elementary cell EC(Si , Dk , fjSi , (f1Dk , f2Dk )).

Reliable Not reliable


Safe H R
Not safe × F

Table 3.1: Meaning of the states in figure 3.2 in terms of safety and reliability.

to our diagnostic cell will be to remain safe (we will refer to the reliability function of this
required function as the “safe state reliability” RS (t)) or remain reliable (we will refer to the
reliability function of this required function as the “reliable state reliability” RR (t)). Obviously
the condition of remaining reliable is more restrictive than the condition of remaining safe,
since in the first one we do not allow false alarms. It is important to stress that in the sequel the
hypothesis of statistically independent failure modes will be adopted.
In this sense, the two reliability functions of the elementary cell EC(Si , Dk , fjSi , (f1Dk , f2Dk )) will
be in principle affected by the level of occurrence of the exogenous fault fjSi and of the faults
(f1Dk , f2Dk ) (namely by the capability of the local diagnoser Dk of detecting the occurred fault
fjSi and of not generating false alarms). In order to describe this we introduce the following
definition which attempts to qualify, from a statistical point of the view, the faults affecting the
elementary cell.

Definition 3.2 (Statistical Description of the Faults). The faults occurrence for an elementary cell

EC(Si , Dk , fjSi , (f1Dk , f2Dk ))

is described by the triple (λSj i , λD Dk Si Dk Dk


1 , λ2 ) which collects the failure rates of the faults fj , f1 and f2
k

respectively.

We now proceed in presenting in a more formal way what is the “safe state” and the “reliable
state” for a generic cell. This is highlighted in the next definition.

Definition 3.3 (Reliable State and Safe State for an Elementary Cell). The elementary cell

EC(Si , Dk , fjSi , (f1Dk , f2Dk ))


3.3. A Framework for Reliability of Diagnosis Systems 91

is in the safe state if and only if:


iS ) Si is not affected by fault fjSi .
iiS ) Si is affected by fault fjSi and Dk is not affected by the fault f2Dk (missed diagnosis) (¬f2Dk ).
The elementary cell is in the reliable state if:
iR ) Si is not affected by fault and Dk is not affected by the fault f1Dk (false alarms) (¬f2Dk ).

These two definitions mean the following: the system is safe if no fault happens, if a fault
occurs but a remedial action is performed or even if the diagnoser generates a false alarm (and
hence a useless reconfiguration is performed); in case a fault happen or in case a false alarm
occur the system becomes no reliable.
Now that the required function for an elementary cell has been defined we can present the
rule for the computation of the reliability function, the latter depending on the failure rates
of the faults acting on the cell (see [9]). We will assume that failure rates for failure modes
are constant. This means that we can work in calendar time, without the need of maintaining
information about the age of each system element rather than the system age.
Proposition 3.1 (Computation of the Reliability Functions of the Elementary Cell)1 Under the
assumption of constant failure rates, the safe state reliability function of the elementary cell

EC(Si , Dk , fjSi , (f1Dk , f2Dk ))

is computed as: ³ ´
S (t) = RSi (t) + 1 − RSi (t) · RDk (t) =
Rikj j j 2
Si
µ
Si

Dk (3.1)
= e−λj t + 1 − e−λj t · e−λ2 t

where RjSi (t) is the reliability function of subsystem Si with respect to the fault fjSi , R1Dk (t) and R2Dk (t)
are reliability functions of diagnosis subsystem Dk with respect to false alarm and missed diagnosis
respectively.
The reliable state reliability function of the elementary cell

EC(Si , Dk , fjSi , (f1Dk , f2Dk ))

is computed as
R (t) = RSi (t) · RDk (t) =
Rikj j 1
Si Dk
(3.2)
= e−λj t
· e−λ1 t

The computation of the reliability function highlighted in the previous definition, does not
take into account the effect of the faults propagation between different elementary cells, in
other words the reliability functions are computed as if the elementary cell was isolated. The
propagation phenomenon clearly affects the reliability of the overall complex system and it is
considered in the next step.

1
The same results given in Proposition 3.1 can be obtained enriching the automaton pictured in fig. 3.2 with
probability of occurrence of events, i.e. with failure rates presented in Definition 3.2. The stochastic automaton
obtained is a Markov Chain. Applying standard analysis tools for Markov chains (see e.g. [26]) it is possible to find
again equations 3.1 and 3.2.
92 Reliability of complex diagnosis systems

3.3.3 Step 3. Computation of the Propagation Reliability of Elementary Cells


This step is devoted to evaluate the effect of possible propagation of faults on the reliability of
the whole system. A key source of information for carrying out this phase is given by the Fault
Propagation Analysis and in particular by the analysis of the Fault Propagation Tree, see [59] and
[46].
A fault fjSi can propagate from the elementary cell EC(Si , Dk , fjSi , (f1Dk , f2Dk )) to the elemen-
S
tary cell EC(Sq , D` , fmq , (f1D` , f2D` )), in the sense that its occurrence can modify the failure rates
S S
both of the fault fmq (which roughly means that the fault fmq can be generated by the occur-
rence of fjSi ) and of the faults f1D` and f2D` (which roughly means that the occurrence of fjSi
can alter the diagnostic properties of D` ). For sake of simplicity we assume that a fault fjSi
propagates to a different cell just in case the subsystem Si is affected by fjSi and the diagnosis
subsystem Dk is not able to detect such a fault (namely fjSi is affecting Si and f2Dk is affect-
ing Dk , hence the cell is in the unsafe state). Applying suitably-defined composition rules it is
S
possible to modify the reliability functions Rq,`,m R
and Rq,`,m associated to the elementary cell
S
EC(Sq , D` , fmq , (f1D` , f2D` )) according to the safe state reliability function Rikj
S associated to the

EC(Si , Dk , fjSi , (f1Dk , f2Dk )). This is specialized in the following definition.

Proposition 3.2 (Propagation Reliability). If a fault fjSi of the elementary cell

EC(Si , Dk , fjSi , (f1Dk , f2Dk ))

propagates to the elementary cell


S
EC(Sq , D` , fmq , (f1D` , f2D` ))

then the safe state reliability function of the latter can be computed as
³ ´
S S
S
Rq,`,m (t) = δ S · Rmq (t) + 1 − δ S · Rmq (t) · δ2D · R2D` (t) (3.3)

and
S
R
Rq,`,m (t) = δ S · Rmq (t) · δ2D · R1D` (t) (3.4)
where
S
S (t) in case the fault f Si propagates to the fault f q and δ S = 1 otherwise;
a) δ S = Rikj j m

S (t) in case the fault f Si propagates to the fault f D` and δ D = 1 otherwise;


b) δ1D = Rikj j 1 1

S (t) in case the fault f Si propagates to the fault f D` and δ D = 1 otherwise;


c) δ2D = Rikj j 2 2

Now it is possible to derive two statistical residual matrix which both have as many rows
as the number of possible faults arising in the complex system and as many columns as the
number of residuals generated by the local diagnosis units. In the first one, called reliable state
statistical residual matrix (MR ), the element in the i-th row and j-th column are real numbers
representing the probability that if the j-th residual signal arises then the i-th fault has occurred.
In other words the element in the i-th row and j-th column represents the reliability of the j-th
diagnostic test with respect to both missed diagnosis and false alarms regarding the i-th fault.
Concerning the second matrix, called the safe state statistical residual matrix (MS ), the element
3.3. A Framework for Reliability of Diagnosis Systems 93

in the i-th row and j-th column are real numbers representing the probability that if the i-th
fault has occurred then the j-th residual signal arises. In other words the element in the i-th
row and j-th column represents the reliability of the j-th diagnostic test with respect to missed
diagnosis.
These matrices seem to be the natural generalization of the classical deterministic residual
matrix (in which the elements are 0 or 1 depending on the fact the a residual signal is affected
or not by a fault) in order to put into the description also the statistical modeling of components
and diagnostic algorithms. From such a description it is clear that such matrixes can be simply
obtained from the residual matrix just substituting to “ones” in the (i, j) cell the propagated
safe state reliability of the cell involving the i-th fault detected by j-th residual for MR and the
propagated reliable state reliability of the cell involving the i-th fault detected by j-th residual
for MS .
It is interesting to note that, starting from the information contained in the statistical resid-
ual matrix, it is possible to introduce the concept of statistical isolation of two (or more) faults.
As a matter of fact even in case two (or more) faults are indistinguishable by processing the
residual signals, the comparison of the reliability indices associated to the indistinguishable
faults makes possible to conclude which faults more likely happened.

3.3.4 Step 4. Computation of diagnostic reliability indices.


As last step all the elementary cells must be integrated to compute the reliability of the whole
system with respect to a fault. This phase is carried out by first considering the so-called re-
liability block diagram of elementary cells, in which all the elementary cells associated to each
subsystem Si are put in series and the reliability function of Si is computed as the product of
all the reliability functions of the elementary cells (computed according to the previous steps
and in particular taking into account possible faults propagation). In other words it is possible
to compute the reliability with respect to each single fault of the diagnosis algorithm simply by
elaborating the statistical residual matrixes MS and MR . To this end denote by MijS and MijR
the non zero values in the i-th row and j-th column of MS and MR respectively. Then the safe
state reliability index RFS i of the diagnostic algorithm with respect to the fault Fi is defined as
n
Y
RFS i = MijS ,
j=1

and the reliable state reliability index RFRi of the diagnostic algorithm with respect to the fault
Fi is defined as
n
Y
RFRi = MijR .
j=1

The two statistical residual matrixes can be further elaborated in order to obtain a measure
of reliability for the overall complex system for design specification. In fact the two matrices
can be used to fill the so-called Hazard Matrix which is a matrix whose rows collect the classes
of fault effects (each class collecting effects which have the same severity) and whose columns
report the rate of occurrence of that event. A “cross” in the i-th row and j-th column element
means that the probability of occurrence of an event of the i-th class of severity is that specified
by the j-th column. Such a matrix is then suited to impose a specification involving reliability
as it amounts to identify forbidden and allowed regions (see figure 3.3) Consider the case in
which the rows of the hazard matrix are collected in three classes: safe and reliable class (S-R),
94 Reliability of complex diagnosis systems

Figure 3.3: An example of hazard analysis using the hazard matrix.

safe and not reliable class (S-NR), and not safe class (NS-NR). It is easy to see that if a fault
happen its effect is in the not safe class, hence its hazard is high, but if it is detected and a
remedial action is taken, then the effect of the fault is moved into the safe but not reliable class.
On the other hand if a false alarm happens (hence a remedial action is taken even if no fault is
happen), also its effect can be classified in the safe but not reliable class.
With all this considerations in mind it is easy to think to a fault as two effects: the effect of
the reconfigured fault (which is in the safe but not reliable class) and the effect of the not recon-
figured fault (which is classified as unsafe). In this sense, considering a fault characterized by a
safe state reliability function RiS and a reliable state reliability function RiR , then the probability
of the not reconfigured fault effect is

(N S)
= k 1 − RiS
¡ ¢
Oi

while the probability of the reconfiguration effect is

(S−N R)
= k 1 − RiR ,
¡ ¢
Oi

where k is a positive number representing a scaling factor. We can use these information to im-
prove our complex diagnostic system, in fact the reliability indexes can be augmented changing
the physical component (augmenting the reliability of the component Rfj ) or making the diag-
noser more reliable. The first option is more expensive, but let us to move the two crosses in the
hazard matrix reducing the occurrence rate. The second option is of course less expensive, but
we need to trade off between false alarms and missed diagnosis: making the diagnoser more
sensitive means to augment its reliability with respect to missed diagnosis and hence moves
the cross in the unsafe class towards lower occurrence rate. As drawback, it decreases the reli-
ability of the diagnoser with respect to false alarms and hence it can move the cross in the safe
but not reliable class towards higher occurrence rates.
3.4. The common rail benchmark 95

3.4 The common rail benchmark


The so called common rail is a fuel high-pressure injection system for diesel engines with direct
injection. The main component of the system is a shared high-pressure storage (rail). Differ-
ently from other systems with directly driven block or individual pumps, pressure generation
and fuel injection control are decoupled in the Common-Rail fuel injection system. A scheme of
this system is sketched in fig. 3.4. Aim of this system is to inject fuel in the engine with higher
pressure than classical system. In this way fuel is pulverized in smaller particles and a better
mix between air and diesel is possible leading to a more powered combustion. Moreover com-
mon rail is able to keep constant the fuel pressure along the rail connected with the injectors.
This makes possible to achieve optimal performance independently by the operating point. In
the sequel we will try to explain the system functionalities with respect to fig. 3.4.
The system is composed by two diesel circuits. The low pressure circuit, by means of the
low pressure pump contained in the fuel tank, brings diesel from this to a filter (3) and then
to an high pressure pump (4). This filter contains a filtering element, an heater, a temperature
sensor and a “water in diesel sensor”. The high pressure diesel circuit is used to guarantee
fuel pressure from high pressure pump to injectors. In other words the high pressure pump
brings diesel from the filtering box and increase its pressure through the rail. Along this cir-
cuit, following the high pressure pump, there are a shut off valve (5) used for safety and fault
detection purposes. The rail (6) is equipped with a pressure sensor (7) and a pressure regulator
(8) to keep fuel in constant pressure. Connected to the rail there are four injectors (9) which
pulverize high pressure diesel in engine cylinders. All the devices are under the control of a
central unit (1).

4
6
8 7

9
1
2
High Pressure
Low Pressure

Figure 3.4: The common rail system.

The common rail system is composed by 13 components. Due to the electro-mechanical


nature of these components, they are subject both to electric and mechanical faults; in the sys-
tem there is no hardware redundancy. Using an FMEA-like analysis it is possible to isolate the
96 Reliability of complex diagnosis systems

following fault scenario. The low pressure pump (LPP) is an electro-mechanical component
and can be subject to a functional failure fLPP ; the temperature sensor (TS), the water in diesel
sensor (H2 OS) and the high pressure sensor (HPS) are subject to functional failures denoted
with fTS , fH2 OS and fHPS respectively. The electrical heater (HE) can be affected by a failure fHE ,
while the shut off valve (SOV) can fail due to the functional failure fSOV . The high pressure
pump (HPP) is a critical component: its mechanical nature makes this pump easily affected by
functional failures denoted by fHPP . The pressure control (PR) is actuated via an electrome-
chanical actuator which can be affected by a failure fPR . The rail (R) can have leakage problems
denoted by fR . Finally the four injectors (EI) are subject to functional failures fiEI , i = 1, 2, 3, 4.
The following hypothesis are made in order to simplify the analysis problem:

H1 just single failures are possible;

H2 the central control unit (CCU) is considered a safe component (no failures are possible);

H3 the common rail system is considered isolated, i.e. external components are not affected
by failures.

Fault detection phase is performed via 6 electrical/analytical tests that generate 9 residual
signals by which is possible to detect and isolate all the faults.

1. internal losses test: it consists of testing the closure capability of the four injectors with
respect to the combustion cylinder during the cut-off phase (when the accelerator is re-
leased). This means to test the torque generated by the engine when the injectors are
closed (which should be zero). Using the information about engine rotation it is also pos-
sible to detect which injector does not work properly. For this reason this test generates 4
residual signals (ri , i = 1, 2, 3, 4) each of them sensible to a fault on a specific injector;

2. external losses test: aim of this test is to identify fuel losses checking the pressure inside a
close volume when the injectors are closed. This test generates a residual signal r5 which
is sensible to a failure in one of the injectors, a leakage in the rail, a failure on the shut-off
valve and a failure on the pressure regulator;

3. high pressure sensor test: the control unit gives to the pressure regulator specific trajectories
to track and tests the information from the high pressure sensor. In this way a residual
signal r6 can be generated to detect failures on the high pressure sensor and on the pres-
sure regulator;

4. temperature sensor test: similarly to the case of the high pressure sensor test, the control
unit gives to the heater specific temperature trajectories and tests the information form
the temperature sensor. The residual signal r7 generated from this test is able to detect
failures on the temperature sensor and on the heater regulator;

5. shut-off valve test: a command issued by the control unit to the low pressure pump gen-
erates a specific mechanical response of the shut-off valve and hence a specific response
on the high pressure circuit which can be monitored using the high pressure sensor. This
test generates a residual r8 which is sensible to failures on the low pressure pump, on the
shut-off valve and on the high pressure sensor.

6. electrical test on sensors: an electrical test on the output of the sensors generates 3 residual
signals (r9 , r10 and r11 ) which are used to detect a possible failures on these components.
3.4. The common rail benchmark 97

1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 1 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 1 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 1 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 1 0
r5 0 0 0 0 0 1 0 1 1 1 1 1 1
r6 0 0 0 0 0 0 1 1 0 0 0 0 0
r7 0 1 0 0 1 0 0 0 0 0 0 0 0
r8 1 0 0 1 0 1 0 0 0 0 0 0 0
r9 0 0 1 0 0 0 0 0 0 0 0 0 0
r10 0 1 0 0 0 0 0 0 0 0 0 0 0
r11 1 0 0 0 0 0 0 0 0 0 0 0 0

Table 3.2: Residual matrix for the common rail system.

The residual matrix obtained is reported in tab. 3.2. It is immediate to see that both detection
and isolation of all the considered failures is possible.
The residual matrix of the system is composed by 11 rows and 13 columns with 21 elements
set to one. As previously explained, this fact implies that in reliability analysis we will con-
sider 21 elementary cells, each of them characterized with intrinsic reliability functions com-
puted starting from the occurrence rates of faults, false alarm and missed diagnosis according
to equations (3.1) and (3.2). Following an FPA like procedure we studied the fault propaga-
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
0.999 0.999 0.994 0.998 0.999 0.998 0.989 0.999 0.989 0.989 0.989 0.989 0.977

Table 3.3: Reliability data for components of the common rail system (DATA SET 1).

r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
False alarms 0.994 0.994 0.995 0.995 0.994 0.992 0.991 0.994 0.993 0.993 0.994
Missed diagnosis 0.994 0.994 0.995 0.995 0.994 0.992 0.993 0.994 0.993 0.993 0.994

Table 3.4: Reliability data for diagnostic tests (DATA SET 1).

tions between the elementary cells. For the sake of clarity just the most severe propagations are
considered in the analysis, i.e.:
• a fault on the temperature sensor propagates to a fault on the thermic heater;

• a fault on the low pressure pump propagates to a mechanical fault on the valve (stuck
closed) and to a loss of diagnosis failure of the test based on this valve;

• a fault on the water in diesel sensor propagate to a mechanical failure on the high pressure
pump, to a failure on the injectors due to a low lubrication and to false alarms of the high
pressure sensor test;

• a fault on the pressure controller may propagate to failures on the rail, on the injectors due
to the excessive pressure and on false alarms on the residuals generated via the internal
losses test.
The reliability of physical components with respect to the failures above described are re-
ported in table 3.3, while reliability of diagnostic tests with respect to missed diagnosis and
98 Reliability of complex diagnosis systems

false alarms are shown in table 3.72 . Using this information, we are able to compute both the
safe state statistic residual matrix in tab. 3.5, and the reliable state statistic residual matrix in
tab. 3.6, just substituting to each element set to 1 in the residual matrix the propagated safe
state and reliable state reliability of the corresponding elementary cell computed using equa-
tions (3.3) and (3.4).
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.9993 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.9993 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.9994 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.99994 0
r5 0 0 0 0 0 0.9999 0 0.9999 0.9993 0.9993 0.9993 0.9993 0.9986
r6 0 0 0 0 0 0 0.9991 0.9999 0 0 0 0 0
r7 0 0.9999 0 0 0.9999 0 0 0 0 0 0 0 0
r8 0.9999 0 0 0.9999 0 0.9999 0 0 0 0 0 0 0
r9 0 0 0.9996 0 0 0 0 0 0 0 0 0 0
r10 0 0.9999 0 0 0 0 0 0 0 0 0 0 0
r11 0.9999 0 0 0 0 0 0 0 0 0 0 0 0

Table 3.5: Statistical safe residual matrix for the common rail system. (DATA SET 1)

1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.983 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.983 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.984 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.984 0
r5 0 0 0 0 0 0.992 0 0.993 0.983 0.983 0.983 0.983 0.971
r6 0 0 0 0 0 0 0.981 0.991 0 0 0 0 0
r7 0 0.99 0 0 0.99 0 0 0 0 0 0 0 0
r8 0.993 0 0 0.992 0 0.992 0 0 0 0 0 0 0
r9 0 0 0.987 0 0 0 0 0 0 0 0 0 0
r10 0 0.992 0 0 0 0 0 0 0 0 0 0 0
r11 0.993 0 0 0 0 0 0 0 0 0 0 0 0

Table 3.6: Statistical reliable residual matrix for the common rail system. (DATA SET 1)

From these matrices it is possible to perform the hazard analysis. The result obtained is that
in the mean life of the systems, the probability of being in an unsafe state is equal to 1.38 · 10−4 ,
which means that the probability of not detecting a fault is around 100 ppm. Moreover the
probability of being in an unreliable state is 3.36 · 10−2 , which means that the probability of
performing a false alarm or a bad diagnosis is in the order of magnitude of 10000 ppm.
Suppose now that there exists a gap between specifications in terms of hazard and results
obtained. This suggests that it is necessary to use higher quality components and/or more
reliable diagnosers. A first remark is that the water in diesel sensor is a crucial component, since
all the analytical redundancies are based on this sensor reading, an higher reliability is required
for this component. Another crucial components is the high pressure pump. This pump is a
mechanical system moved directly by the engine, which raises the fuel pressure thanks to three
pistons. The need for a more reliable pump suggested to chose an electrical components instead
of a mechanical one. The new reliability of physical components with respect to the failures are
reported in table 3.7, while reliability of diagnostic tests are not changed from values in table
3.4. Using this new set of physical components, we obtain a new safe state statistic residual
matrix shown in tab. 3.8, and a new reliable state statistic residual matrix shown in tab. 3.9.
The new hazard analysis brought the following new results. In the mean life of the system,
the probability of being in an unsafe state is equal to 1.38 · 10−5 , which means that the prob-
ability of not detecting a fault is around 10 ppm. On the contrary the probability of being in
2
Data regarding reliability or failure rate over the mean time life of physical components are reported by com-
ponents producers, while data regarding the reliability over the mean time life of diagnostic tests concerning false
alarms and missed diagnosis can be obtained via simulation methods (e.g. throughout Montecarlo method) or via
experimental tests on circulating vehicles.
3.5. Conclusions 99

1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
0.999 0.999 0.999 0.998 0.999 0.998 0.999 0.999 0.989 0.989 0.989 0.989 0.977

Table 3.7: Reliability data for components of the common rail system (DATA SET 2).
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.9999 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.9999 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.9999 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.9999 0
r5 0 0 0 0 0 0.9999 0 0.9999 0.9999 0.9999 0.9999 0.9999 0.9998
r6 0 0 0 0 0 0 0.9999 0.9999 0 0 0 0 0
r7 0 0.9999 0 0 0.9999 0 0 0 0 0 0 0 0
r8 0.9999 0 0 0.9999 0 0.9999 0 0 0 0 0 0 0
r9 0 0 0.9999 0 0 0 0 0 0 0 0 0 0
r10 0 0.9999 0 0 0 0 0 0 0 0 0 0 0
r11 0.9999 0 0 0 0 0 0 0 0 0 0 0 0

Table 3.8: Statistical safe residual matrix for the common rail system. (DATA SET 2)

an unreliable state is 1.8 · 10−2 , which means that the probability of performing a false alarm
or a bad diagnosis is in the order of magnitude of 10000 ppm. It is important to note that the
probability of false alarms does not depend on the reliability of physical components; hence,
enhancing this last, we have decreased the probability of being in an unsafe state with a factor
of 10, while we did not obtain any significative improvement in the probability of being in an
unreliable state. For this reason, as third analysis, we decided to increase also the reliability of
diagnosers. The third analysis has been performed considering more reliable diagnostic tests
in terms of missed diagnosis and false alarms as shown in tab. 3.10. The statistical residual
matrices obtained by this third analysis are shown in table 3.11 and 3.12.
With the new set of diagnosers we obtained a probability of being in an unsafe state equal
to 5 · 10−6 , which means that the probability of not detecting a fault is around 1 ppm. On
the contrary the probability of being in an unreliable state is 6 · 10−3 , which means that the
probability of performing a false alarm or a bad diagnosis is in the order of magnitude of 600
ppm. Enhancing the reliability of diagnosers we obtained a decreasing of both the probability
of being in an unsafe state and in an unreliable with a factor of 10.

3.5 Conclusions
The main contribution of this chapter is the introduction of a procedure to evaluate reliability
of a complex diagnosis systems. By a four steps procedure it has been shown how to compute
a reliability function associated to the capability of the whole system to remain in a safe state
or in a reliable state with respect to each faults starting from a statistical description of the
exogenous faults and from the reliability of diagnosis subsystem devoted to faults detection.

1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.992 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.992 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.993 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.993 0
r5 0 0 0 0 0 0.992 0 0.993 0.988 0.988 0.988 0.983 0.992
r6 0 0 0 0 0 0 0.999 0.991 0 0 0 0 0
r7 0 0.99 0 0 0.99 0 0 0 0 0 0 0 0
r8 0.993 0 0 0.992 0 0.993 0 0 0 0 0 0 0
r9 0 0 0.991 0 0 0 0 0 0 0 0 0 0
r10 0 0.992 0 0 0 0 0 0 0 0 0 0 0
r11 0.993 0 0 0 0 0 0 0 0 0 0 0 0

Table 3.9: Statistical reliable residual matrix for the common rail system. (DATA SET 2)
100 Reliability of complex diagnosis systems

r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
False alarms 0.999 0.999 0.999 0.999 0.999 0.998 0.996 0.998 0.996 0.996 0.996
Missed diagnosis 0.999 0.999 0.999 0.999 0.999 0.998 0.998 0.999 0.996 0.996 0.996

Table 3.10: Reliability data for diagnostic tests (DATA SET 3).
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.99999 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.99999 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.99999 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.99999 0
r5 0 0 0 0 0 0.99999 0 0.99999 0.99999 0.99999 0.99999 0.99999 0.99998
r6 0 0 0 0 0 0 0.99999 0.99999 0 0 0 0 0
r7 0 0.99999 0 0 0.99999 0 0 0 0 0 0 0 0
r8 0.99999 0 0 0.99999 0 0.99999 0 0 0 0 0 0 0
r9 0 0 0.999999 0 0 0 0 0 0 0 0 0 0
r10 0 0.99999 0 0 0 0 0 0 0 0 0 0 0
r11 0.99999 0 0 0 0 0 0 0 0 0 0 0 0

Table 3.11: Statistical safe residual matrix for the common rail system. (DATA SET 3)
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.998 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.999 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.999 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.998 0
r5 0 0 0 0 0 0.999 0 0.999 0.997 0.998 0.998 0.983 0.999
r6 0 0 0 0 0 0 0.999 0.999 0 0 0 0 0
r7 0 0.997 0 0 0.997 0 0 0 0 0 0 0 0
r8 0.998 0 0 0.997 0 0.997 0 0 0 0 0 0 0
r9 0 0 0.995 0 0 0 0 0 0 0 0 0 0
r10 0 0.997 0 0 0 0 0 0 0 0 0 0 0
r11 0.995 0 0 0 0 0 0 0 0 0 0 0 0

Table 3.12: Statistical reliable residual matrix for the common rail system. (DATA SET 3)

An example taken from the automotive field has been used to illustrate the effectiveness of the
procedure as an analysis tool but also as a design tool to improve performances of the system
in terms of reliability.
Chapter 4
A discrete event approach to system
monitoring

The problem of achieving fault tolerant supervision of discrete event sys-


tems is considered from the viewpoint of safe and timely diagnosis of
unobservable faults. To this end, the new property of safe diagnosability
is introduced and studied. Standard definitions of diagnosability of dis-
crete event systems deal with the problem of detecting the occurrence of
unobservable fault events using model-based inferencing from observed
sequences of events. In safe diagnosability, it is required in addition that
fault detection occur prior to the execution of a given set of forbidden
strings in the failed mode of operation of the system. For instance, this
constraint could be required to prevent local faults from developing into
failures that could cause safety hazards. If the system is safe diagnosable,
reconfiguration actions could be forced upon the detection of faults prior
to the execution of unsafe behavior, thus achieving the objective of fault
tolerant supervision. Necessary and sufficient conditions for safe diag-
nosability are derived. In addition, the problem of explicitly considering
safe diagnosability in controller design, termed “active safe diagnosis
problem”, is formulated and solved. A brief discussion of safe diagnos-
ability for timed models of discrete-event systems is also provided.

4.1 Introduction
This chapter addresses the problem of Fault Detection and Isolation (FDI) in the framework of
discrete event dynamic systems (DES). The main objective in FDI is to develop methodologies
for identifying and exactly characterizing possible incipient faults arising in the operation of a
dynamic system. This research problem has received considerable attention in the last several
years due among other factors to the increasing requirements on safety imposed on today’s
complex technological systems. In particular, many methodologies have been developed for
faults that are naturally modeled, and thus diagnosed, using a “higher-level” discrete-event

101
102 A discrete event approach to system monitoring

model of the system under consideration; see, e.g., [91, 92, 47, 14, 87, 97, 75, 55, 52, 7, 41, 31, 62]
for a representative sample of this work including references to successful industrial appli-
cations. In this work, we adopt the so-called “Diagnoser Approach” to fault diagnosis of DES
introduced in [91, 92] and surveyed in [56]. In this approach, the DES model includes “normal”
as well as “failed” behavior for a given set of faults modeled as unobservable events (i.e., not
directly measured by the system sensors). Diagnosis is the process of detecting on-line the oc-
currence of these faults using model-based inferencing driven by the observed event sequence.
This is achieved by the use of a special type of automaton, called the diagnoser, which is built
from the system model. In addition, the diagnoser can also be used to analyze (off-line) the
diagnosability properties of the system according to the formal definition introduced in [91]
that will be recalled later in this chapter. The problem of taking diagnosability into account in
system design and using control actions to alter the diagnosability properties of a given system
is considered in [90].
In this work, we are concerned with the diagnosis of faults that may lead to violations of
critical safety requirements if they are not detected and identified in a timely manner. To this
end, the new property of safe diagnosability is introduced and studied. Assume that a given sys-
tem is diagnosable (according to [91]). In safe diagnosability it is required in addition that fault
detection occur prior to the execution of a given set of forbidden strings in the failed mode of
operation of the system. For instance, this constraint could be required to prevent local faults
from developing into failures that could cause safety hazards. We view safe diagnosability as
a first necessary step in order to achieve fault tolerant supervision of DES. If the system is safe
diagnosable, reconfiguration actions could be forced upon the detection of faults prior to the
execution of unsafe behavior, thus achieving the objective of fault tolerant supervision. Our
main contributions are: (i) formal definition of the notion of safe diagnosability; (ii) deriva-
tion of implementable necessary and sufficient conditions to test for safe diagnosability; (iii)
formulation and solution of the problem of active safe diagnosis, where the requirement of safe
diagnosability is explicitly taken into account in controller design; (iv) discussion of the exten-
sion of safe diagnosability to timed models of DES.
A preliminary and partial version of these results are contained in [76, 77, 78].

4.2 Preliminary notions


Consider a system G modeled as a Finite State Machine (FSM) (or Finite-state Automaton)

G = (X, Σ, δ, x0 ) (4.1)

where X is the state space, Σ is the set of the events, δ is the partial transition function and x0
is the initial state of the system. The behavior of the system is described by the prefix-closed
language L = L(G) generated by G defined as

L(G) = {s ∈ Σ? s.t. δ(x0 , s) is defined} .

L(G) is a subset of Σ? where Σ? denotes the Kleene closure of the set Σ, i.e., the set of all finite
strings of elements of Σ.
The event set Σ is partitioned as:

Σ = Σo ∪ Σuo

where Σo represents the set of observable events (their occurrence can be observed) and Σuo
represents the set of unobservable events. Moreover, some of the events are controllable (it is
4.2. Preliminary notions 103

possible to prevent their occurrence) while the rest are uncontrollable. Thus the event set can
also be partitioned as:
Σ = Σc ∪ Σuc .
We associate with Σo the (natural) projection P , P : Σ → Σo , defined in the usual manner. We
refer the reader to table 4.1 and to Appendix A, for any notation used in the sequel but not
defined in this section.
Let Σf ⊆ Σ denote the set of failure events which can occur in the system. We assume
without loss of generality that:
Σf ⊆ Σuo
since an observable failure can be trivially diagnosed and that

Σf ⊆ Σuc

since a “controllable” fault can be a priori avoided.


We partition the set of failure events into disjoint non empty sets corresponding to different
failures types
Σf = Σf 1 ∪ Σf 2 ∪ . . . ∪ Σf m . (4.2)
Let Πf denote this partition; when we write that a failure of type Fi is happened we will mean
that some event in the set Σf i has occurred.
Whenever we say that there exists a trace s of arbitrarily long length we mean the following

∀ n ∈ N (∃ s ∈ Σ? s.t. ksk > n) .


We define Xo as the set of states in X that are entered by at least one observable transition,
together with x0 . Let L(G, x) denote the set of all traces that originate from state x of G. We
define
Lo (G, x) = {s ∈ L(G, x) s.t. s = uσ , u ∈ Σ?uo , σ ∈ Σo } (4.3)
and
Lσ (G, x) = {s ∈ Lo (G, x) s.t. sf = σ} . (4.4)
In other words Lo (G, x) denotes the set of all traces that originate from state x and end at
the first observable event, while Lσ (G, x) denotes those traces in Lo (G, x) that end with the
particular observable event σ.
Given the FSM G in (4.1), we define the nondeterministic FSM G0 as

G0 = (Xo , Σo , δG0 , x0 ) (4.5)

where Xo , Σo and x0 are defined as previously and the transition relation of G0 is given by

δG0 ⊆ (X0 × Σ × X0 )

and defined as follows:

(x, σ, x0 ) ∈ δG0 if δ(x, s) = x0 for some s ∈ Lσ (G, x) . (4.6)

It is easy to verify that L(G0 ) = P (L) where:

P (L) = {t s.t. t = P (s) for some s ∈ L} .


104 A discrete event approach to system monitoring

Nomenclature Meaning
² the empty trace
s̄ the prefix-closure of trace s ∈ Σ?
ksk length of trace s
L/s the post language of L after s, i.e., L/s = {t ∈ Σ? s.t. st ∈ L}
P : Σ? → Σ?o projection
 P , i.e.,

 P (²) = ²
P (σ) = σ if σ ∈ Σo


 P (σ) = ² if σ ∈ Σuo
P (sσ) = P (s)P (σ) s ∈ Σ? , σ ∈ Σ .

PL−1 : Σ?o → Σ? inverse projection, i.e., PL−1 (y) = {s ∈ L s.t. P (s) = y}
sf final event of trace s
Ψ(Σf i ) set of all traces that end in a failure event belonging to the
class Σf i , i.e.,
Ψ(Σf i ) = {sσf ∈ s.t. σf ∈ Σf i }
σ ∈ s, σ ∈ Σ and s ∈ Σ? σ is an event in the trace s
Σf i ∈ s σf ∈ s for some σf ∈ Σf i

Table 4.1: Some nomenclature for the chapter.

4.3 A motivating example


Let revise the procedure of designing a supervisor for a system modelled as a DES. The situ-
ation is that of a given DES whose behavior must be modified by feedback control in order to
achieve a given set of specifications. Let assume that the given DES is modelled by an automa-
ton G; G is obtained by interconnection (product or parallel composition) of a set of interacting
components whose models are (Gnom 1 . . . Gnom
n ):

G = (kni=1 Gnom
i ) .

Automaton G models the uncontrolled behavior of the DES; this behavior is not satisfactory and
must be modified by control; this means (see [26]) “restricting the behavior of G to a subset of its
generated language L(G)". For this reason we design a supervisor S nom such that defining

Gnom := (kni=1 Gnom


i ) k S nom

L(Gnom ) satisfies a desired set of specifications.


Usually at this point possible faults on components are considered, and models Gnom i are
enhanced to include most likely faults and subsequent faulty behavior (see [92]). At this point
instead of nominal models Gnom i we consider models Gn+f
i that embed also the faulty behavior
of the considered component. Of course considering the generated language we have that

L(Gn+f
i ) ⊃ L(Gnom
i )

and considering the event sets

Σn+f
i = Σnom
i ∪ Σf i where Σf i ⊂ Σn+f n+f
i,uo ∩ Σi,uc .

This means that the real controlled behavior of the system is described by
³° ´
Gn+f = °ni=1 Gn+f k S nom .
°
i
4.3. A motivating example 105

By simple observation of Gn+f , we can distinguish between a nominal part and a faulty part.
Of course in the nominal part there are no undesired actions, because we prevented it with the
control, but this undesired sequences of actions can arise in the faulty part due to a nominal
supervision of a faulty behavior. Hence to achieve fault tolerance we do not need just to be
able to detect the occurred faults, but also to prevent undesired sequences of actions. For this
reason in the next section we introduce a new property of discrete event systems that assures
the possibility of detecting the occurred fault before an illegal string is performed: the notion
of safe diagnosability.

Example 4.1 Consider the simple system sketched in Fig. 4.1 consisting of a tank, a pump and a valve.
Let us assume that the only failure mode for the valve is stuck closed. The tank is equipped with a level
sensor, while the pipe can be equipped by either a flow sensor or a pressure sensor. The flow sensor reads
two possible values: F meaning that there is a flow in the pipe and N F meaning that there is no flow in
the pipe. Similarly, the pressure sensor can read two possible values: P meaning that there is pressure
in the pipe and N P meaning that there is no pressure in the pipe.
Suppose that we control the system to achieve the following behavior:

• if the level sensor says that the tank is full (event Level) the controller must respond by opening
the valve and starting the pump;

• if the level sensor says that the tank is not full (event N ot Level) the controller must respond by
stopping the pump and closing the valve.

figure 4.2 shows the FSMs Gn+f


i modelling the pump, the valve and the controller. The readings of the
flow and pressure sensors are linked with the states of the valve and the pump as depicted in Table 4.2.

Flow sensor
Tank
Level
sensor

Pressure
sensor Valve Pump

Controller

Figure 4.1: The system.

PON POFF
VO F - NP NF - NP
VC NF - P NF - P

Table 4.2: Flow and pressure sensors readings table.

Let us assume that the situation “valve closed and pump on" is risky and should be avoided. The FSM
modelling the controlled behavior of the system, denoted by Gn+f , is shown in Fig. 4.3; Gn+f is obtained
performing the parallel composition of the model in Fig. 4.2, and then incorporating the sensor map
(Table 4.2) according to procedure in [92]. It is simple to identify the part of the automaton modelling
106 A discrete event approach to system monitoring

open , close

SC
F
close open stop start start
open

VC VO POFF PON
close stop

(Pump)
(Valve)

No Level

close stop No Level Level open start


C7 C6 C5 C1 C2 C3 C4

Level
(Controller)

Figure 4.2: DES models Gn+f


i of system components.

the nominal behavior and the part that models the behavior of the supervised system after the occurrence
of the fault; in this part there is a legal state (state 12) in where the dangerous situation “valve closed
and pump on", occurs. This means that due to the possible failure of the valve, the system can enter this
risky state. This situation should be avoided by detecting the failure before the system goes into state 12.

4.4 The notions of diagnosability and safe diagnosability


Let L be a prefix closed language which satisfies the following hypotheses:
A1) The language L is live. This means that there is a transition defined at each state in x ∈ X
(the system can not reach a point at which no event is possible).
A2) There does not exist in L any cycle of unobservable events, i.e.:
∃no ∈ N s.t. ∀ ust ∈ L , s ∈ Σ?uo ⇒ ksk ≤ no . (4.7)

The liveness assumption (A1) is made for the sake of simplicity while assumption (A2) ensures
that observations occur with some regularity; in other words we require that the system does
not generate arbitrarily long sequences of unobservable events.
A language L is diagnosable if it is possible to detect with a finite delay occurrences of
failures of any type using the record of observed events. Formally,
Definition 4.1 (Diagnosability [91]) A prefix-closed language K satisfying hypothesis (A1) and (A2)
is said to be diagnosable with respect to the projection P and with respect to the partition Πf if the
following holds:
(∀i ∈ Πf ) (∃ni ∈ N) (∀s ∈ Ψ(Σf i )) (∀t ∈ L/s) (ktk ≥ ni ⇒ D) (4.8)
where the diagnosability condition D is:
ω ∈ PL−1 [P (st)] ⇒ Σf i ∈ ω . (4.9)
4.4. The notions of diagnosability and safe diagnosability 107

SC VC VO

C1 POFF
17 1

f <Level,NF,P>
<Level,NF,P>

C2 10 f 2 POFF

<open,NF,NP>
<open,NF,P>

C3 11 3 POFF

<start,NF,P> <start,F,NP>

4 PON
C4 12 <Level,NF,P>

<No Level,NF,P> <No Level,F,NP>


<Level,NF,P>
C5 5 PON
13

<No Level,NF,P> <stop,NF,P> <stop,NF,NP>

<No Level,NF,P>
C6 14 6 POFF

<close,NF,P>
<close,NF,P>
f
C7 15 7 POFF
f
<stop,NF,P>

C5 16 8
f POFF
<close,NF,P>
<stop,NF,P>

C6 9 POFF

Faulty behavior Nominal behavior

Figure 4.3: DES model Gn+f of controlled system behavior.


108 A discrete event approach to system monitoring

The above definition means the following. Let s be any trace generated by the system
that ends in a failure event from the set Σf i and let t be any sufficiently long continuation of
s. Condition D requires that every trace belonging to the language that produces the same
record of observable events as the trace st should contain in it a failure event from the set
Σf i . Or, in other words, diagnosability requires that every failure event leads to observations
distinct enough to enable unique identification of the failure type with a finite delay. For more
information the reader is referred to [91].

Remark 4.1 Given a non live language L it is possible to extend it to a new language Llive by adding a
new event “Stop” to Σ, where Stop ∈ Σo ∩ Σuc and by defining

Llive = L sStopi s.t. L/s = ∅ .
ª
(4.10)
i≥0

It is obvious that Llive is live and that L is diagnosable iff Llive is diagnosable. In other words, checking
for the diagnosability of L is equivalent to checking for the diagnosability of Llive . See [90] for further
details.

Consider now the case where we want to avoid that after a failure of type Fi (i = 1 . . . m)
occurs, the system executes a forbidden string from a given finite set Φi , where Φi ⊆ Σ? . For
example, this could be required to prevent local faults from developing into failures that can
cause safety hazards. This condition is strictly linked to the definition of diagnosability, because
if an illegal string in the set Φi is possible in the language of our system, then it is necessary
to detect the failure before an illegal string is executed. The elements of the set Φi capture
sequences of events that become illegal after the occurrence of a fault of type Fi . This situation
can be formalized by defining the “illegal language” Kfi (i = 1 . . . m) as follows:

Kfi = {u ∈ L/s s.t. [s ∈ Ψ(Σf i )] (i = 1 . . . m)[∃v ∈ Φi s.t. v is a substring of u]} . (4.11)

In other words, Kfi contains all the possible continuations after a fault of type Fi which have a
forbidden string from Φi as substring.

Definition 4.2 (Safe Diagnosability) A prefix-closed language L satisfying hypothesis (A1) and
(A2) is said to be safe diagnosable with respect to the projection P , the partition Πf and the forbid-
den languages Kfi (i = 1 . . . m) if the following conditions hold:

SC1) diagnosability condition:

(∀i ∈ Πf ) (∃ni ∈ N) (∀s ∈ Ψ(Σf i )) (∀t ∈ L/s) (ktk ≥ ni ⇒ D)

where the diagnosability condition D is:

ω ∈ PL−1 [P (st)] ⇒ Σf i ∈ ω .

SC2) safety condition:

(∀i ∈ Πf )(∀t ∈ L/s) such that ktk = ni , let tc , ktc k = ntc , be the shortest prefix of t such that D
holds, then
tc ∩ Kfi = ∅ (4.12)
4.5. Necessary and sufficient conditions 109

Roughly speaking, this definition says that a language is safe diagnosable if it is diagnosable
and after a failure, the shortest continuation that assures the detection does not contain any
illegal string. A graphical interpretation of this property is sketched in Fig. 4.4. In that figure, a
non safe diagnosable language L is represented: the string t0c is the shortest continuation after
string s0 fi for which condition D holds and t0c contains an element of the forbidden language
Kfi .

Remark 4.2 We note that we have chosen to approach the problem of safe diagnosability in the language-
theoretic framework of [90]-[91] for fault diagnosis rather than the temporal logic framework of [52] in
order to better capture the safety requirement that has to be enforced after a fault has been diagnosed with
certainty. Thus, as will become apparent in the following sections, we are able to employ the "diagnoser
automata" of the Diagnoser Approach of [91] to verify safe diagnosability (cf. Section 4.5) and build safe
diagnosable systems (cf. Section 4.6).

L
D holds Kfi

t0c
fi t00c
s0 fi
s00

Figure 4.4: A non safe diagnosable language L: the string s00 fi t00c satisfy the conditions for safe
diagnosability, while the string s0 fi t0c does not.

4.5 Necessary and sufficient conditions


4.5.1 Necessary and sufficient conditions for diagnosability
The diagnoser denoted by Gd = (Qd , Σo , δd , q0 ) is an FSM built from the system G = (X, Σ, δ, x0 ).
This machine is used to perform diagnostics when it observes on-line the behavior of G. The
construction procedure of the diagnoser can be found in [91].
First, define the set of failure labels

∆f = {F1 , F2 , . . . , Fm }

where |Πf | = m and the complete set of possible labels

∆ = {N } ∪ 2{∆f } . (4.13)

Here N is to be interpreted as meaning normal and Fi , i = {1, . . . , m}, as meaning that a


failure of type Fi has occurred. Recalling the definition of Xo , define

Qo = 2Xo ×∆ . (4.14)
110 A discrete event approach to system monitoring

The diagnoser for G is the FSM

Gd = (Qd , Σo , δd , q0 ) (4.15)

where Qd , Σo , δd and q0 have the usual interpretation.


The initial state of the diagnoser is defined to be {x0 , {N }}. The transition function δd of
the diagnoser is constructed in a similar manner to the transition function of an observer with
an additional aspect that includes attaching failure labels to the states and propagating these
labels from state to state. More precisely:

q2 = δd (q1 , σ) ⇔ q2 = R(q1 , σ) (4.16)

where R : Qo × Σo → Qo is the so called range function defined as


[ [
R(q, σ) = {(δ(x, s), LP (x, `, s))} (4.17)
(x,`)∈q s∈Lσ (G,x)

and LP : Xo × ∆ × Σ? → ∆ is the label propagation function defined as:


½
{N } if ` = {N } ∧ ∀i(Σf i ∈
/ s)
LP (x, `, s) = (4.18)
{fi : fi ∈ ` ∨ Σf i ∈ s} otherwise.

The state space Qd is the resulting subset of Qo composed of the states of the diagnoser that
are reachable from q0 under δd . Since the state space Qd of the diagnoser is a subset of Qo , a
state qd of Gd is in the form:
qd = {(x1 , `1 ), . . . , (xn , `n )}
where xi ∈ X0 and `i ∈ ∆.
We recall some definitions and a big result from [91].

Definition 4.3 A state q ∈ Qd is said to be Fi -certain if

∀(x, `) ∈ q , fi ∈ ` .

Definition 4.4 A state q ∈ Qd is said to be Fi -uncertain if

∃(x, `), (y, `0 ) ∈ q such that fi ∈ ` and fi ∈


/ `0 .

Definition 4.5 A set of states x1 , x2 , . . . , xn ∈ X is said to form a cycle in G if

∃s ∈ L(G, x1 ) such that s = σ1 σ2 . . . σn and δ(xl , σl ) = x(l+1) mod n , l = 1, 2, . . . n .

Definition 4.6 A set of Fi -uncertain states q1 , . . . , qn ∈ Qd is said to form an Fi -indeterminate cycle


if
1. q1 . . . qn form a cycle in Gd with

δd (ql , σl ) = ql+1 l = 1, . . . , n − 1

δd (qn , σn ) = q1

where σl ∈ Σo , l = 1, . . . , n.

2. ∃(xkl , `kl ), (ylr , `˜rl ) ∈ ql , l = 1, . . . , n, k = 1, . . . , m, r = 1, . . . , m0 such that


4.5. Necessary and sufficient conditions 111

(a) fi ∈ `kl , fi ∈/ `˜rl for all l, k and r.


(b) The sequences of states xkl , l = 1, . . . , n, k = 1, . . . , m and {ylr }, l = 1, . . . , n, r =
© ª

1, . . . , m0 form cycles in G0 with:

(xkl , σl , xkl+1 ) ∈ δG0 l = 1, . . . , n − 1 k = 1, . . . , m

(xkn , σn , xk+1
1 ) ∈ δG0 k = 1, . . . , m − 1

(xm 1
n , σn , x1 ) ∈ δG0
r ) ∈ δ 0 l = 1, . . . , n − 1 r = 1, . . . , m0
(ylr , σl , yl+1 G

(ynr , σn , y1r+1 ) ∈ δG0 r = 1, . . . , m0 − 1


0
(ynm , σn , y11 ) ∈ δG0

An Fi -indeterminate cycle in Gd indicates the presence in L of two traces s1 and s2 of arbi-


trarily long length, such that they both have the same observable projection and s1 contains a
failure event from the set Σf i , while s2 does not.

Theorem 4.1 [91] A language L is diagnosable with respect to the projection P and the failure partition
Πf on Σf if and only if its diagnoser Gd has no Fi -indeterminate cycle for each failure type Fi .

Remark 4.3 Testing diagnosability using diagnosers can be done in polynomial time in the cardinality
of the state space of the diagnoser, since it involves detection of special kinds of cycles. The state space of
the diagnoser is however exponential in the state space of the system G in the worst case.

4.5.2 Necessary and sufficient conditions for safe diagnosability


Consider a diagnosable language L and the FSM

G = (X, Σ, δ, x0 )

generating L. For each set Φi (i = 1 . . . m) of forbidden strings, define, following the procedure
in [60], the recognizer of Φi as an automaton

Si = (X(Φi ) ∪ {x0 }, Σ, ξi , (x0 , N B))

where
X(Φi ) = {(s, π(s)) s.t. (s ∈ t̄) ∧ (t ∈ Φi ) } (4.19)
and where π : Σ? → {N B, Bi }, (i = 1 . . . m) is defined as
½
Bi if s ∈ Φi ;
π(s) = (4.20)
N B otherwise.

The transition function ξi : X(Φi ) × Σ → X(Φi ) is defined as:



 (s, π(s)) if s ∈ Φi

ξi ((s, π(s)), σ) = (4.21)
 (t, π(t)) if s ∈
 / Φi
112 A discrete event approach to system monitoring

where t is the longest suffix of sσ belonging to X(Φi ) and π(·) is defined as in (4.20); moreover

 (ε, N B) if σ = fi

ξi ((x0 , N B), σ) = (4.22)
 (x0 , N B)
 otherwise

Next, consider the automaton


Gm = G × S1 × . . . × Sm = (Xm , Σ, δm , xm0 ) . (4.23)
A state z of Gm is of the form:
z = [x, (s1 , π(s1 )), . . . , (sm , π(sm ))]
where x ∈ X and (si , π(si )) ∈ X(Φi ) ∪ {x0 }, i = 1 . . . m. We will say that state z ∈ Xm is
a Bi -state if in the list of z there exists π(si ) = Bi for i ∈ [1 . . . m]; otherwise, z is called an
N B-state.
By construction of Si , i = 1 . . . m, we have that L(Si ) = Σ? and therefore L(Gm ) = L(G)
(= L). The need for a bijective map between the set of forbidden strings and a set of forbidden
states in the automaton generating the language under observation motivates the operation in
Eqn. 4.23. In fact the, automaton Gm satisfies the following property.

Proposition 4.1 Consider the subset of states of Gm


Θi = {z ∈ Xm s.t. z is a Bi -state}
then
(u ∈ τ (z)) ⇒ (u = vw)(v ∈ Ψ(ΣF i )(w ∈ Kfi )
where the map τ : X → Σ? is defined as
τ (x) = {s ∈ Σ? s.t. x = δ(x0 , s)}
or, in other words, τ (x) is the set of all traces in the language generated by G leading from the initial
state x0 to state x.

Proof. By definition, if z ∈ Xm is a Bi -state, this means that z = [x, . . . , (si , π(si )), . . .] with
π(si ) = Bi and si ∈ Φi (see (4.20)). Due to the definition of Gm (4.23), z can be reached only
by strings which contain si as substring. Since si ∈ Φi , from definition (4.11), we get that
(∀u ∈ τ (z)) (u = vw)(v ∈ Ψ(ΣF i )(w ∈ Kfi ). /

Remark 4.4 In the worst case the state space of Gm is the Cartesian product of G, S1 , . . . , §m . In
practice it is expected that the illegal strings in the respective Φi sets will have short lengths and will
result in splitting of localized parts of the state space of G when performing the product in 4.23; see
Examples 2 and 3 at the end of this section.
Given a diagnosable language L and an automaton
G = (X, Σ, δ, x0 )
generating L, we call safe-diagnoser with respect to the forbidden sets Φi , the diagnoser
Gsd = (Qsd , Σo , δds , qos ) (4.24)
of automaton Gm defined in (4.23), with respect to the projection P and to the failure partition
Πf on Σf . We can now present our result for testing the property of safe diagnosability.
4.5. Necessary and sufficient conditions 113

Theorem 4.2 Consider a diagnosable language L and an automaton

G = (X, Σ, δ, x0 )

generating L. L is safe diagnosable with respect to the projection P , the failure partition Πf and the
forbidden languages Kfi if and only if in the safe diagnoser (4.24):
(TC1) There does not exist a state q ∈ Qsd that is Fi -uncertain with a component of the form (x, `) such
that fi ∈ ` and x ∈ Θi ;

(TC2) There does not exist a pair of states q, q 0 ∈ Qsd such that: (i) q is Fi -certain with a component of
the form (x, `) such that fi ∈ ` and x ∈ Θi ; (ii) q 0 is Fi -uncertain; and (iii) q = δds (q 0 , e) with
e ∈ Eo .

Proof.
Sufficiency:
(TC1) By contradiction, suppose that there exists q ∈ Qsd Fi -uncertain:

(x, `), (y, `0 ) ∈ q s.t. (fi ∈ `)(fi ∈


/ `0 )

with x ∈ Θi . Let s ∈ Σ?o be such that q = δds (q0s , s); this means that

∃u, v ∈ PL−1 (s) s.t. (x = δm (xm0 , u))((Σf i ∈ u))(y = δm (xm0 , v))(Σf i ∈


/ v) .

Due to the fact that Σf i ∈ u, we get that u = u1 u2 with u1 ∈ Ψ(Σf i ). Since the system is
diagnosable, there exists n ∈ N such that if t ∈ L/u and ktk = n then condition D holds.
Consider the string
t0 = u1 u2 t = u1 tc (tc = u2 t) .
Since
Θi 3 x = δm (xm0 , u1 u2 ) ,
from Proposition 4.1, it turns out that u2 ∈ Kfi or, in other words, since tc = u2 t, then tc ∩Kfi 6= ∅,
which violates the hypothesis that L is safe diagnosable.

(TC2) Again by contradiction, suppose that there exists q ∈ Qsd Fi -certain:

∀(x, `) ∈ q, fi ∈ `

such that ∃(x, `0 ) ∈ q with x ∈ Θi . Moreover, suppose that there exists q 0 ∈ Qsd Fi -uncertain
such that q = δds (q 0 , e) with e ∈ Eo . Let s ∈ Σ?o be such that q 0 = δds (q0s , s); this means that

∃u, v ∈ PL−1 (s) s.t. (x1 = δm (xm0 , u))((Σf i ∈ u))(x2 = δm (xm0 , v))(Σf i ∈
/ v) .

From the fact that Σf i ∈ u we get that u = u1 u2 with u1 ∈ Ψ(Σf i ). Since the system is di-
agnosable there exists n ∈ N such that if t ∈ L/u and ktk = n then condition D holds; it is
straightforward to see that
t = PL−1 (e) = e .
Consider the string
t0 = u1 u2 e = u1 tc (tc = u2 e) .
Form the fact that
q = δds (q0s , se)
114 A discrete event approach to system monitoring

we get that
Θi 3 x = δm (xm0 , u1 u2 e) = δ(xm0 , u1 tc ) .
From Proposition 4.1, we conclude that tc ∈ Kfi , which violates the hypothesis that L is a safe
diagnosable language.

Necessity:
Consider a language L that is diagnosable with respect to the projection P and to the failure
partition Πf on Σf . Suppose that L is not safe diagnosable with respect to the set of forbidden
languages Kfi (i = 1 . . . m). This means that for i ∈ Πf , ∃s ∈ Ψ(Σf i ) and tc ∈ L/s such that tc is
the shortest continuation of s for which the diagnosability condition D holds and

tc ∩ Kfi = ∅ .

Consider the case where there exists a prefix t1 of tc such that t1 ∈ Kfi . From Proposition 4.1

x = δm (xm0 , t1 ) ∈ Θi .

Moreover, since for t1 D does not hold, we get from the definition of the diagnoser Gsd (see [91])
that
q = δds (q0s , P (st1 ))
is Fi -uncertain and (x, `) ∈ q with fi ∈ ` (TC1).

For the case where


tc ∈ Kfi ,
from Proposition 4.1
x = δm (xm0 , tc ) ∈ Θi .
Moreover, since tc is the first continuation of s for which D holds, we get from the definition of
the diagnoser Gsd (see [91]) that
q = δds (q0s , P (stc ))
is Fi -certain and (x, `) ∈ q with fi ∈ `. But tc = t0 e where e ∈ Σo and t0 is a suffix of s for which
D does not hold. This means that
q 0 = δds (q0s , P (st0 ))
with q 0 Fi -uncertain and
q = δds (q 0 , e)
which means that q is the first Fi -certain state in Gsd after an uncertain state (TC2). /
In fig. 4.5 is sketched the idea behind conditions (TC1) and (TC2) and behind the algorithmic
test proposed in Theorem 4.2.
Remark 4.5 Once it has been established that a language L is diagnosable, testing conditions (TC1)
and (TC2) for safe diagnosability is linear in the cardinality of the state space of the safe diagnoser Gsd .

Example 4.2 Consider the language generated by the automaton G in Fig. 4.6. We assume that the
forbidden string after the failure f1 is Φ = {β} and therefore the illegal language is Kf = {αβ}. In this
example there is a state, namely state 6, that can be reached both by an illegal string s0 = (αβγ)n αf1 αβ
and a legal string s00 = (αβγ)n αβf1 τ .
The recognizer S for the set Φ is shown in Fig. 4.7, and the modified automaton Gm = G × S is
shown in Fig. 4.8(left). For the sake of simplicity, we have kept the same state names and indicated the
4.5. Necessary and sufficient conditions 115

Kfi UC
P (sfi t1 )
t2
t1 P (t2 )
s fi UC
C

UC
fi t1 t2 P (sfi t1 t2 )
s

UC C
σo
t1
s fi t2 UC
P (sfi t1 )

C P (t2 )
← tc →
D holds ⇒ C

Figure 4.5: Conditions (TC1) and (TC2) and their influence on the safe diagnoser. On the left
are shown, from the top to bottom, a string that violate (TC1), a string that violate (TC2) and a
string that does not violate (TC1) nor (TC2). On the right is shown what happens in these three
case in the safe diagnoser. The label UC stands for uncertain state, while the label C stands for
certain state.

additional information next to the state. The effect of the product between S and the original model G is
the splitting of state 6 into two states: a B-state and an N B-state.
The safe diagnoser Gsd is shown in Fig. 4.8(right) where for the sake of clarity, only the information
about B-states and N B−states has been reported and not the full name of the state. It is immediate by
seen that both conditions (TC1) and (TC2) are satisfied, and therefore the language generated by system
G is safe diagnosable with respect to the failure f1 and the forbidden string β.

Example 4.3 Consider again the simple system sketched in Fig. 4.1. The only failure mode considered
is that the valve is stuck closed. The tank is equipped with a level sensor, while the pipe can be equipped
by either a flow sensor or a pressure sensor. The flow sensor reads two possible values: F meaning that
there is a flow in the pipe and N F meaning that there is no flow in the pipe. Similarly, the pressure
sensor can read two possible values: P meaning that there is pressure in the pipe and N P meaning that
there is no pressure in the pipe. We control the system to achieve the following behavior:

• if the level sensor says that the tank is full (event Level) the controller must respond by opening
the valve and starting the pump;

• if the level sensor says that the tank is not full (event N ot Level) the controller must respond by
stopping the pump and closing the valve.

The readings of the flow and pressure sensors are linked with the states of the valve and the pump as
depicted in Table 4.2. As explained above the situation valve closed and pump on is risky and should
be avoided; namely, Φ = {start}. The FSM modelling the controlled behavior of the system is depicted
again in Fig. 4.9. There is a state in the admissible behavior (namely state 12) where the dangerous
situation “valve closed and pump on", occurs. This means that due to the possible failure of the valve,
116 A discrete event approach to system monitoring

α β
0 1 2

f1 f1

3 4

α τ

β
5 6

Figure 4.6: An automaton G.

αβ γ τ αf1 γ τ α β γ τ f1
f1 β
x0 -NB ²-NB β-B

Figure 4.7: The recognizer S.

γ
x0 -NB
α β 2
0 1 x0 -NB 0N-NB
γ
x0 -NB
f1 f1 α
β
²-NB 3 4 ²-NB 1N 3F-NB 2N 4F-NB

α τ α τ
β
β-B
β ²-NB
6F-B 5F-NB 6F-NB
γ 6 5 6

²-NB
γ γ γ

Figure 4.8: The modified automaton Gm (left) and the safe diagnoser Gsd (right).
4.5. Necessary and sufficient conditions 117

the system can enter this risky state. This situation should be avoided by detecting the failure before the
system goes into state 12. Here, for sake of simplicity, all the analysis will be done on automaton G and
not on the modified automaton Gm since these two have the same structure and the B-state to care about
is state 12.
SC VC VO

C1 POFF
17 1

f <Level,NF,P>
<Level,NF,P>

C2 10 f 2 POFF

<open,NF,NP>
<open,NF,P>

C3 11 3 POFF

<start,NF,P> <start,F,NP>

4 PON
C4 12 <Level,NF,P>

<No Level,NF,P> <No Level,F,NP>


<Level,NF,P>
C5 5 PON
13

<No Level,NF,P> <stop,NF,P> <stop,NF,NP>

<No Level,NF,P>
C6 14 6 POFF

<close,NF,P>
<close,NF,P>
f
C7 15 7 POFF
f
<stop,NF,P>

C5 16 8
f POFF
<close,NF,P>
<stop,NF,P>

C6 9 POFF

Figure 4.9: DES model of controlled system behavior.

Let us assume that to perform the diagnosis only the flow sensor is used, i.e., disregard (or delete)
all pressure information (P,NP) in the event names in Fig. 4.9; then build the corresponding (safe)
diagnoser Gsd (flow) shown in Fig. 4.10. Upon inspection of Gsd (flow) we can see that diagnosability
holds but safe diagnosability is violated at state 12F, entered from state (3N,11F): this is a violation of
(TC2). Intuitively if only the flow sensor is used to detect the valve failure, then we must wait until
the pump is switched on to see if there is flow or not; this diagnoses the failure, but violates the safety
condition.
Repeat the above process but this time using only the pressure sensor; the corresponding (safe) diag-
noser Gsd (pressure) is shown in Fig. 4.11. Upon inspection of Gsd (pressure) we conclude that the system
118 A discrete event approach to system monitoring

1N,17F

<Level,NF>

2N,10F
11F <open,NF> <No Level,NF>
<start,NF>
<start,NF>
3N,11F
12F <start,F>

<No Level,NF> 8N,16F


4N
13F <stop,NF>
<No Level,F>
<open,NF> <stop,NF>
5N 9N,14F
14F
<stop,NF> <Level,NF>
<close,NF>
6N
<close,NF>
15F
<close,NF>
<Level,NF>
7N,15F
10F

Figure 4.10: Non safe diagnosable case.

equipped with the pressure sensor is diagnosable and safe diagnosable.

4.6 Safe time-diagnosability


The temporal behavior of a system is often very useful in detecting and isolating faults. For
instance, in manufacturing assembly lines, the time elapsed between the triggering of adjacent
motion sensors can provide useful information about the status of the system component be-
tween them. To this end, the notion of diagnosability recalled in Section 4.4 was extended to a
special class of timed automata in [30, 29], resulting in the definition of time diagnosability and
the identification of necessary and sufficient conditions for testing this property. The timed
automaton model used in [30, 29] is essentially the same as the modeling formalism for timed
discrete event systems proposed and studied in [22]. (This model has many similarities with
the widely-used timed automaton model of Alur and Dill [1] in computer science.)
In time diagnosability, the diagnosability condition D in Definition 4.1 is required to hold
after a bounded time interval, as opposed to a bounded number of events in the case of untimed
discrete-event models (cf. the ni integers in Definitions 4.1 and 4.2). The approach adopted in
[30, 29] for testing time diagnosability is to transform the timed automaton (termed “activity
machine” in these works) into an untimed automaton where tick events are used to model the
discrete evolution of time. This “time unfolding” operation is in fact at the heart of the methods
developed to analyze and control timed discrete-event models, cf. [1, 22]. It is shown in [29]
that time diagnosability can be verified by testing the diagnoser of the untimed automaton
obtained by time-unfolding the timed automaton system model.
The notion of safe diagnosability introduced in this chapter can be extended to timed au-
tomata models in a straightforward manner, resulting in the notion of safe time-diagnosability,
by following the approach in [30, 29]. We briefly discuss this extension, assuming the reader is
4.6. Safe time-diagnosability 119

1N,17F

<open,P> <Level,P>

2N,10F
11F <open,NP> <No Level,P>
<start,P>
3N
12F <start,NP>

<No Level,P> 8N,16F


4N
13F <stop,P>
<No Level,NP>
<open,P> <stop,P>
5N 9N,14F
14F
<stop,NP>
<close,P> <Level,P>
6N
<close,P>
15F
<close,P>
<Level,P>
7N,15F
10F

Figure 4.11: Safe diagnosable case.

familiar with [30, 29]. In general case, one may wish to model the unsafe behavior pertaining
to safe time-diagnosability in terms of forbidden timed strings of events. Similarly to [30, 29],
safe time-dagnosability would be defined as in Definition 4.2 above, except that in condition
SC1 the requirement for condition D to hold would be in terms of a bounded time interval
(i.e., bounded number of tick events in the timed traces generated by the timed automaton)
and in condition SC2 the safety condition would be in terms of the illegal timed languages cor-
responding to the forbidden timed strings. Safe time-diagnosability would be tested by first
composing the timed automaton system model Gt with the set of recognizers Sit corresponding
to the forbidden timed strings, as is done in equation 4.23 for untimed models. Note that the
timed automaton model of [22, 30, 29] is closed under the operations of parallel composition
and product, when these operations are defined to combine time intervals by intersection; see
[29, 22] for further details. At that point, the resulting Gtm would be time-unfolded to obtain
Gm , the safe diagnoser of Gm would be constructed, and safe time-diagnosability would be
tested by examining the structure of this safe diagnoser, namely indeterminate cycles and con-
ditions (TC1) and (TC2) of Theorem 4.2. The respective operations of product of timed models
and time-unfolding having taken care of proper marking of states (Bi or N B) and unfolding of
tick events, the safety requirement in safe time-diagnosability can be addressed using the same
untimed conditions TC1 and TC2 as before.

A special case of interest to the above general case is when timing information is not re-
quired to specify the sets of forbidden strings. In this case, one could time-unfold Gt first,
and then compose the resulting G with the recognizers Si to obtain Gm , exactly as is done in
equation 4.23.
120 A discrete event approach to system monitoring

4.7 Active approach to safe diagnosability


Since the diagnosability of a system depends both on the projection P that generates output
event sequences and the partition Πf on the failure events, it is possible to alter the diagnos-
ability properties of a system by changing the observable event set Σo , which is equivalent to
altering the set of sensors on the system (see [35]). An alternate approach (developed in [90]) to
the design of a diagnosable system is to design the system controller in such a way that it not
only satisfies other specified control objectives but it also results in a diagnosable system; in
other words the behavior of a non diagnosable system is restricted, by appropriate control, to
obtain a diagnosable system. The problem solved in [90] is called the Active Diagnosis Problem
(ADP).
In this section we address the problem of building safe diagnosable systems and answer
the question: given a diagnosable system with multiple failure modes, and given a set of safe diagnostic
requirements, how do we ensure that the system satisfies these requirements? The solution to this
problem could be achieved by determining an optimal, feasible set of sensors that will meet the
requirements (as seen in the example 4.3). However, in the spirit of the ADP, we address this
problem by restricting the behavior of a non safe diagnosable system, by appropriate control,
to obtain a safe diagnosable system.

4.7.1 Review of the Active Diagnosis Problem


Supervisory control theory deals with the design of controllers for a given DES that ensure the
controlled system meets certain qualitative specifications. These specifications define the legal
language of the system. A supervisor for a DES is an external agent that, based on its partial
view of the system, dynamically enables or disables the controllable events of the system in
order to ensure that the resulting closed loop language lies within the legal language. For an
introduction of the basic ideas of supervisory control theory we refer the reader to Appendix
A and to [26, 108] and the references therein.
Given a system G, a partial observation supervisor for G is a map

SP : P [L(G)] → 2Σc × Σuc . (4.25)

The resulting closed loop system is denoted by SP /G. A realization of the supervisor SP for
G which achieves L(SP /G) = K is given by the pair (R, φ) where R = (XR , Σo , δR , xR,0 ) is a
recognizer for P (K) and φ : XR → 2Σc × Σuc .
In addition to Assumption (A1) and (A2) on the liveness and the absence of cycles of unob-
servable events that we made previously on the system, we now make the following additional
assumption:
(A3) Σc ⊆ Σo , i.e., no unobservable event can be prevented from occurring by control.
Under Assumption (A3) the above map φ(x) can simply be given by the active event set of R
at state x. In this case the supervisor SP may simply be realized by the FSM R, the feedback
map φ(·) being implicit in the transition structure of R.
Given a language M over the alphabet Σ and a language K ⊆ M we denote by K ↑C the
supremal controllable sublanguage of K with respect to M and Σuc ⊆ Σ. Likewise we denote
by K ↑CO the supremal controllable and observable sublanguage of K with respect to M , P and
Σuc ⊆ Σ (Assumption (A3) is sufficient to ensure the existence of K ↑CO ; see [26]). With abuse
of notation, if H is the FSM that generates K we denote by H ↑C the FSM that generates H ↑K .
The active diagnosis problem is formulated as follows:
4.7. Active approach to safe diagnosability 121

Active Diagnosis Problem (ADP): Given the regular, live, language L generated by the sys-
tem G and given a regular diagnosable language K ⊆ L such that every live sublan-
guage of K is diagnosable, find a partial observation supervisor SP for G such that
L(SP /G) = Lact where
(C1) Lact ⊆ K;
(C2) Lact is diagnosable;
(C3) Lact is as large as possible.
A brief description of the solution procedure proposed in [90] for the ADP is now given.

Initialization
Step 0.1 Obtain an FSM Glegal that generates K.

Step 0.2 Build the diagnoser Glegal


d corresponding to Glegal .

Step 0.3 Let i = 0, Hd (0) = Glegal


d and Md (0) = L(Hd (0)).
i-th Iteration:
Step 1 Compute the supremal controllable sublanguage Md↑C (i) of Md (i) with respect to P (L)
and Σuc ∩ Σo . Let Hd↑C (i) denote the FSM generating Md↑C (i).

Step 2 Compute M (i) = PL−1 [Md↑C (i)]. Let the FSM H(i) be such that L(H(i)) = M (i).
Step 3 Extend M (i) to the live language M live (i). Let H live (i) denote the FSM that generates
M live (i).
Step 4 Build the diagnoser Hdlive (i) corresponding to H live (i).
Step 5 If M live (i) is diagnosable then Lact = M (i) and the corresponding Sp is realized by the
FSM Hd↑C (i).
Step 6 If M live (i) is not diagnosable then

1. Obtain H̃d by eliminating from Hd↑C (i) all states q such that there is a transition de-
fined at q in Hdlive (i) due to the Stop event and this transition is part of an indeter-
minate cycle in Hdlive (i) or it leads to a state that is part of an indeterminate cycle in
Hdlive (i).
2. Let Hd (i + 1) = Accessible(H̃d ) and Md (i + 1) = L(Hd (i + 1)).
3. Increment i to i + 1.
Theorem 4.3 [90] The iterative procedure presented for solving the active diagnosis problem converges
in a finite number of iterations. M (i) at convergence is the supremal controllable, observable, and
diagnosable sublanguage of K and is a regular language. The supervisor Sp that achieves the closed loop
language M (i) can be realized by Hd (i), the diagnoser corresponding to the generator H(i) of M (i).
An exhaustive presentation of the class of languages Kl ⊆ L (l ≥ 0) which can be used
as initial condition for the ADP can be found in [90]. In this work the reader can find also
an algorithmic method to generate these languages for the case where the diagnoser Gd is such
that no two indeterminate cycles in Gd are interleaved or nested (i.e., no two indeterminate cycles
share a common state in a manner such that it is possible for the diagnoser to keep alternating
between these two cycles).
122 A discrete event approach to system monitoring

4.7.2 Formulation and solution procedure to the active safe diagnosis problem
The active safe diagnosis problem is formulated as follows

Active Safe Diagnosis Problem (ASDP): Given the regular, live, language L generated by the
system G, given a set of forbidden strings after failures Φ = m
S
Φ
i=1 i , and given a regular
diagnosable language K ⊆ L such that every live sublanguage of K is diagnosable, find
a partial observation supervisor SP for G such that L(SP /G) = Lsaf e where:

(C1) Lsaf e ⊆ K;
(C2) Lsaf e is safe diagnosable with respect to Φ;
(C3) Lsaf e is as large as possible.

Solution procedure:

(a) Solve the ADP with respect to K and L; let M be the supremal diagnosable, controllable,
observable sublanguage of K and H the FSM that generates M .

(b) Build the FSM S that recognize the set of strings Φ0 and the machine

Hs = H × S .

Build the diagnoser Hsd of Hs and check the safety property. If Hs is safe than M is also
the solution to the ASDP.
If Hs is not safe then there exist some states in Hsd that violate conditions (TC1) or (TC2).

(c) Define

Ξ = s ∈ Σ?o s.t. in Hsd ∃ a transition due to s which ends in a state that violates (SC1) or (SC2). .
© ª

Build the new language


Mnew = M \ PL−1 [Ξ] ∩ M ,
¡ ¢

live , the FSM H


its live extension Mnew live
new and Hnew generating respectively languages Mnew
live live
and Mnew , and diagnosers Hd,new of Hnew and its live extension Hd,new

(d) Solve the ADP again using as initial condition

L=M K = Mnew

and obtain M 0 , H 0 , Hd0 .

(e) The solution to the ASDP is M 0 .

Theorem 4.4 Given the regular, live, language L generated by the system G, given a set of forbidden
strings after failures, Φ = m
S
i=1 i , and given a regular diagnosable language K ⊆ L such that every
Φ
sublanguage of K is diagnosable, the solution procedure given for the ASDP returns M 0 , which is the
supremal observable, controllable, safe diagnosable sublanguage of K. Moreover the supervisor Sp that
achieves the closed loop language M 0 can be realized by Hd0 , the diagnoser corresponding to the generator
H 0 of M 0 .
4.7. Active approach to safe diagnosability 123

Proof. Step (a) is proved by theorem 4.3 to converge in a finite number of iterations and give at
convergence the supremal controllable, observable and diagnosable sublanguage of K, namely
M . If this language is also safe (step (b)) then we are done and M is the supremal controllable,
observable and safe diagnosable sublanguage of K. Otherwise from theorem 2, we know that
in the safe diagnoser of the generator H of M there must be some Bi -states with label Fi in
the list of an uncertain state (TC1) or in the list of the first certain state after an uncertain state
(TC2). The set Ξ collects the strings in Σ?o for which the safe diagnoser enters these states. Note
that from theorem 2 we know that the inverse projection in language M of those strings violates
the definition of safe diagnosability, hence we need to remove them from the language (step
(d)), thus obtaining the new language Mnew .
Mnew is a sublanguage of M and so of K, but we cannot say anything about its liveness
(and so about its diagnosability) nor about its controllability and observability properties, but
we can start again to iterate the solution procedure for the ADP (step (d)). Since each live
sublanguage of Mnew is also a live sublanguage of M we are still in the hypothesis of Theorem
4.3, so the new iteration of the ADP solution is assured to converge in a finite number of steps
to the supremal diagnosable, observable and controllable sublanguage of Mnew , namely M 0 .
If it is possible to prove that after these iterations the safety condition is still preserved, then
we proved that M 0 is the supremal safe diagnosable, observable and controllable sublanguage
of K. Moreover, from Theorem 4.3, the supervisor Sp that achieves the closed loop language
M 0 can be realized by Hd0 , the diagnoser corresponding to the generator H 0 of M 0 and the
liveness extension of the diagnoser, namely Hd0,live can be used to perform on line diagnosis on
the closed loop system.
The fact that M 0 ⊆ Mnew ⊆ M with M 0 and M diagnosable languages, and the definition of
the operator delay(·; ·) (given in [90]) imply that
∀s ∈ M 0 delay(s, M 0 ) = n0tc ≤ delay(s, M ) = ntc ;
and ntc = n0tc if the maximum delay in detecting a failure event in M occurs along a trace t
which is also contained in M 0 . This fact joint with the definition of safe diagnosability leads
to the conclusion that, under the hypothesis of the ASDP, the property of safe diagnosability is
preserved under string deletions (see Fig. 4.12); hence M 0 is safe diagnosable. /

fi

ntc σ ∈ φi
st0 ∈M
s
t0 ∈ (M/s)

σ ∈ φi
st00 ∈ M 0
s n0tc
fi t00 ∈ (M 0 /s)

Figure 4.12: Safety condition (SC2) is preserved under string deletion.

Example 4.4 Consider the system G and the corresponding diagnoser Gd in Fig. 4.13. Let
Σf = Σf 1 = {σf 1 } Σuo = Σf ∪ {σuo } Σuc = Σf ∪ {δ} Φ = {τ } .
124 A discrete event approach to system monitoring

The diagnoser Gd has a cycle of F1 -uncertain states with the corresponding event sequence βγδ. Cor-
responding to this cycle in the diagnoser, there are two cycles in G: the first involves states 3-5 which
appear with an F1 label in the diagnoser and the second involves states 11-14 which appear with an N
label in the cycle in the diagnoser, moreover state 14 is reached via an unobservable event (σuo ). Thus
the cycle in Gd is an F1 -indeterminate cycle and the system G is non diagnosable.
The ADP has been solved using K = K1 and the results is shown in fig. 4.14. For more details on
the steps of the solution procedure the reader is referred to [90].

τ
6
α δ

3 4 5
α β γ
2
σf 1 γ τ
β
1 7 8 9 10 τ
σf 1
α γ σuo
β
11 12 13 14

δ
δ

α
1N 3F1 11N 4F1 8F1 12N 5F1 9F1 13N
β γ
α τ
6F1 10F1

τ τ

Figure 4.13: System G and diagnoser Gd .

Building the safe diagnoser for the solution Lact of the ADP (here represented in Fig. 4.15 where
for sake of clarity only information about B states is shown), it is immediate to see that Lact is non
safe diagnosable, in fact a B-state (namely {10F1}) is in the first certain state after an uncertain state,
violating condition (SC2). We can see that

Ξ = {αβγτ }

and
PL−1 [Ξ] ∩ Lact = {ασf 1 βγτ } .
Fig. 4.16 shows the solution Lsaf e of the ASDP, i.e., the supremal controllable, observable, safe di-
agnosable sublanguage of K ⊆ L. Fig. 4.17 shows the partial observation supervisor Sp such that
L(G/Sp ) = Lsaf e ; we would like to stress that the live extension of Sp can be used to perform online
diagnosis on the closed loop system.
4.7. Active approach to safe diagnosability 125

τ
6
α α

3 4 5 15
α β γ δ
2
σf 1 γ
β τ
1 7 8 9 10 τ
σf 1
α γ σuo
β δ
11 12 13 14 17

σf 1
19

Stop

Figure 4.14: Solution Lact for the ADP for system G with K = K1 and its live extension.

Stop
α
1N 3F1 11N 4F1 8F1 12N 5F1 9F1 13N 15F1 17N 19F1
β γ δ
τ Stop
α 6F1 α 10F1 - B
τ
6F1 - B
τ

Figure 4.15: Safe diagnoser for Lact and its live extension.
126 A discrete event approach to system monitoring

τ
6
α α

3 4 5 15
α β γ δ
2
σf 1 γ
β
1 7 8 9 Stop

σf 1
α γ σuo
β δ
11 12 13 14 17

σf 1
19

Stop

Figure 4.16: Solution Lsaf e of the ASDP for system G with K = K1 and Φ = {τ }, and its live
extension.

Stop

9F1

Stop Stop
α
1N 3F1 11N 4F1 8F1 12N 5F1 9F1 13N 15F1 17N 19F1
β γ δ

Stop
α 6F1 α

Figure 4.17: Diagnoser for Lsaf e and its live extension.


4.8. An approach for fault tolerant supervision 127

4.8 An approach for fault tolerant supervision


At this point, we wish to revisit the motivating discussion of Section 4.3 regarding the problem
of fault tolerant supervision. DES G is supervised with a nominal supervisor S nom in order
to achieve some desired specifications for the resulting automaton Gnom = G k S nom . When
possible faults are accounted for and the system model is enhanced, the controlled behavior is
described by the automaton Gn+f instead of Gnom . Gn+f contains a nominal part and faulty
parts. We ask two questions:

1. Is L(Gn+f ) diagnosable?

2. Is L(Gn+f ) safe diagnosable?

Let us assume the worst case situation of two negative answers. In this case, we perform the
solution procedure of the ASPD described in Section 4.7. The result is a safe diagnosable system
denoted by Gn+sf and such that
L(Gn+f ) ⊇ L(Gn+sf ) .
We know that Gn+sf satisfies the property that after a failure fi and prior to any undesired
string v ∈ Φi (i = 1 . . . n), there exists an observable event σ ∈ Σo whose observation leads to
the detection and isolation of the fault.
We propose to approach fault tolerant supervision by enhancing Gn+sf and using σ as a
trigger to “force” reconfiguration events that will lead the system to compensate (to the extent
possible) for the effect of the detected fault. We do not specify such reconfiguration, as it is
problem-dependent. The system model Gn+sf is refined as follows. We introduce the new
events ri , nri ∈ Σc (i = 1 . . . n), which are assumed to be controllable and are used to force
reconfigurations. In Gn+sf , we insert the pair of events ri and nri after the above event σ
and just before the above string v, as shown in Fig. 4.18. After the event ri we model the
desired system reconfiguration depending on the specific application1 . Prior to fault detection,
all nri events are enabled while all ri events are disabled. When we observe event σ (that
leads to the detection of the fault), we disable nri and enable ri , thereby “forcing” the desired
reconfiguration2 .
The above strategy is sound since the safe diagnosability property ensures that along any
string s ∈ L(Gn+sf ) there is an observable event σ ∈ Σo between fi and the undesired string
v ∈ Φi after which the diagnoser is Fi -certain and where we can “force” the reconfiguration. In
a general sense, this strategy is conceptually similar to so-called “explicit” approaches to fault
tolerant control in the literature on continuous-variable systems (cf. [12]).

4.9 Conclusion
In this chapter we have shown that the starting point towards a fault tolerant supervision of
DES is the problem of safe failure diagnosis. Starting from the standard definition of diagnos-
ability of discrete event systems, which deal with the problem of detecting the occurrence of an
unobservable event using the available observations on the system, the problem of performing
the detection before the system executes a forbidden string was introduced. This idea resulted
1
Events ri can be used also to freeze the state of the diagnoser or eventually to reset the diagnoser itself.
2
In Example 3, the reconfiguration could for instance be realized trying to open-close the valve a certain number
of times and if the valve is still blocked shutting down the system.
128 A discrete event approach to system monitoring

s ∈ L(Gn+sf )

fi σ ∈ Σo nri v ∈ Φi

ri

Reconfigured Model

Figure 4.18: Reconfiguration of a safe diagnosable system.

in the new notion of safe diagnosability for discrete event systems and in necessary and suffi-
cient conditions to test this language property. Moreover, the problem of explicitly taking into
account safe diagnosability as a requirement in system design was addressed and solved.
This work wants to be a starting point for the problem of fault tolerant supervision for dis-
crete event systems, namely, the design of a reconfiguration unit which, on the basis of the in-
formation provided by the diagnoser, “adjusts” the supervisor in order to achieve a prescribed
performance in the case of faulty behavior. In such a framework, reconfiguration actions on the
system should be enabled by the supervisor just when a failure occurrence has been detected
and the forbidden actions (namely the set Φ) can be considered as those strings of events after
which reconfiguration is no longer effective.
To this aim in the last section of the paper we have presented a simple modelling tool which
makes use of the property of safe diagnosability that permit reconfiguration actions on the
faulty plant
Chapter 5
Implicit fault tolerant control systems

In this chapter we propose an innovative way of dealing with the design


of fault tolerant control systems. We show how the nonlinear output reg-
ulation theory can be successfully adopted in order to design a regulator
able to offset the effect of all possible faults which can occur and, in doing
so, also to detect and isolate the occurred fault. The regulator is designed
by embedding the (possible nonlinear) internal model of the fault. This idea
is applied to the design of a fault tolerant controller for induction motors
in presence of both rotor and stator mechanical faults. In Appendix C this
control architecture will be applied to achieve fault tolerance in tracking
control of a n-dof fully actuated robot manipulator.

5.1 Introduction
The most common approach in dealing with fault tolerant control (FTC) problem is to split the
overall design in two distinct phases. The first phase addresses the so-called “Fault Detection
and Isolation” (FDI) problem, which consists in designing a dynamical system (filter) which,
by processing input/output data, is able to detect the presence of an incipient fault and to iso-
late it from other faults and/or disturbances. Once the FDI filter has been designed, the second
phase usually consists in the design of a supervisory unit which, on the basis of the information
provided by the FDI filter, reconfigures the control so as to compensate for the effect of the fault
and to fulfill performances constraint. In general, the latter phase is carried out by means of
parameterized controller which is suitably updated by the supervisory unit, on the basis of the
information provided by FDI filter.
It is clear from this description that the classical approach to FDI and FTC relies upon a “cer-
tainty equivalence” idea extensively used in the context of adaptive control, since it is based on
the explicit estimation of unknown time varying signals/parameters (in the specific case the
faults) by the FDI filter and subsequent explicit reconfiguration of the controller in presence of
faults.
Our aim is to follow a different approach to fault tolerance control. Specifically, we address the
case in which the faults affecting the controlled system can be modeled as functions (of time)

129
130 Implicit fault tolerant control systems

within a finitely-parametrized family of such functions. Then, we design a controller which


embeds an internal model of this family, whose purpose is to generate a supplementary control
action which compensate for the presence of any of such faults, regardless their entity. In other
words the control reconfiguration does not pass through an explicit FDI design but, indeed,
is achieved by a proper design of a dynamic controller which is implicitly fault tolerant to all
the possible faults whose model is embedded in the regulator. This idea will be pursued us-
ing the theoretical machinery of the (nonlinear) output regulation theory (see [23]) under the
assumption that the side-effects generated by the occurrence of the fault can be modeled as
an exogenous signal generated by an autonomous “neutrally stable" system (the so-called “ex-
osystem"). As opposite to certainty equivalence-based adaptive control, the distinctive feature
of output regulation theory relies upon the design of a controller which, by embedding an in-
ternal model of exosystem, is able to offset the effect of any “exosystem-generated" signal with
no need of explicit estimation of such signal. Thus, in a certain sense, the main idea pursued
in this paper is to propose a fault tolerant control design which is to the classical FDI/FTC
approach as the output regulation theory is to certainty equivalence-based adaptive control. It
is interesting to see that, in this framework, the Fault Detection and Isolation phase, which is
usually the starting point in the design of a Fault Tolerant Control systems, is postponed to that
of control reconfiguration since it can be carried out by testing the state of the internal model
unit which automatically activates to offset the presence of the fault. The approach outlined
above is specialized to the design of a fault tolerant control system for Induction Motors (IM).
As all the magnetic rotating machines, the IM is subject to rotor and stator failures caused by
a combination of thermal, electrical, mechanical, magnetic and environmental stresses. Due
to these stresses the IM can operate into a failure condition whose effects show with spurious
harmonic currents arising in the stator circuit (see [6], [103]) with frequencies which are directly
related to the kind of the fault (in general stator or rotor fault) and amplitude and phase which
depend on the gravity of the fault.
This allows to see the fault tolerant problem in the perspective outlined above and to cast the
problem as an output regulation problem. More in detail we show how an indirect field oriented
controller (IFOC) (see [83], [74] and the reference therein) can be “augmented” with a dynamic
unit designed so as to compensate the unknown spurious harmonic currents arising in the sta-
tor circuit in presence of rotor or stator faults. In this way a controller which is implicitly fault
tolerant to all the faults belonging to the model embedded in the regulator is obtained. With
respect to the work presented in [20], where just a semiglobal result was discussed, this work
presents the more interesting design of a global stabilizer.

5.2 The Induction Motor model and the Indirect Field Oriented Con-
troller

5.2.1 The induction motor model

In this subsection we briefly review the model of the induction motor. For a more exhaustive
treatment on how this model can be derived the interested reader can refer to [67].
Under assumptions of linear magnetic circuits and balanced operating conditions, the equiv-
alent two-phase model of the symmetrical IM, represented in an arbitrary rotating two-phase
5.2. The Induction Motor model and the Indirect Field Oriented Controller 131

reference frame (d − q), is (see [67])

ẋ = f (x, ω0 ) + B(u + V ) + dTL


y = Cx (5.1)
²̇0 = ω0 ²0 (0) = 0

in which x = (ω, Ψd , Ψq , id , iq )T is the state vector, u = (ud , uq )T is a control vector and y =


(ω, id , iq )T is a vector of measurable variables. The state variables are defined as follows: ω is
the rotor speed, (Ψd , Ψq ) are the components of the rotor flux, (id , iq ) are the components of the
T
stator current vector. The input TL is the unknown load torque and V = (Vd , Vq ) represents
an exogenous input which is zero in case the IM works in un-faulty mode while is a bounded
(unknown) signal in the presence of faults (see the treatment in subsection 5.3.1). The variable
²0 represents the angular position of the rotating (d − q) reference frame with respect to the
a-axis of the fixed stator reference frame (a − b), and its rate of change ω0 is viewed as an
additional control. The relation between the original (a − b) and transformed (d − q) variables
is given by
xdq = e−j²0 xab
xab = ej²0 xdq
where · ¸
cos ²0 sin ²0
e−j²0 =
− sin ²0 cos ²0
where xyz stands for two-dimensional vectors in the (y − z) reference frame. The vector-valued
function f (x, ω0 ) and the constant matrices B and d are, respectively,
 
µ (Ψd iq − Ψq id )
 −αΨd + (ω0 − ω) Ψq + αLm id 
 
 − (ω0 − ω) Ψd − αΨq + αLm iq 
f (x, ω0 ) =  
 αβΨd + βωΨq − γid + ω0 iq 
−βωΨd + αβΨq − ω0 id − γiq

− J1
   
0 0
 0 0 
 0 
 
 
 0 0 
B=  d= 0 
 
 1
 σ 0 
  
 0 
0 σ1 0
where the positive constants in the model are related to electrical and mechanical parameters
of IM as follow
L2m
µ ¶ µ ¶
Lm 3 Lm Rr Rs
σ = Ls 1 − , β= , µ= , α= , γ= + αLm β
Ls Lr σLr 2 JLr Lr σ

with J the rotor inertia, Rs , Rr , Ls , Lr the stator/rotor resistances and inductances respectively,
Lm the magnetizing inductance.

5.2.2 Control objectives


General specifications for speedqcontrolled electric drives consider two outputs: rotor speed ω
and rotor flux amplitude |Ψ| = Ψ2d + Ψ2q . These variables are supposed to track two reference
132 Implicit fault tolerant control systems

signals denoted respectively by ω ? (t) and Ψ? (t), which are assumed to be smooth functions of
time. A further control objective consists of having Ψq (t) asymptotically decaying to zero; this
property is known as steady-state flux decoupling. Hence, given ω ? (t) and Ψ? (t), the problem
consists in the design of a dynamic output feedback controller of the form
ν̇ = α(ν, y, ω ? , Ψ? ) u = β(ν, y, ω ? , Ψ? ) ω0 = π(ν, y, ω ? , Ψ? )
such that for all initial states x(0) ∈ R5 and for all possible constant torque load TL , the trajec-
tories of the closed loop system are bounded and
lim |ω(t) − ω ? (t)| = 0 lim |Ψd (t) − Ψ? (t)| = 0 lim Ψq (t) = 0 .
t→∞ t→∞ t→∞

5.2.3 A Global Indirect Field Oriented Controller


In this subsection we review the design of an indirect field oriented controller (IFO) for the
induction motor able to meet the control objectives outlined above. Although the control struc-
ture strongly follows that presented in [83], we review the entire design procedure following
a different approach in order to “pave the way” for the design of the fault compensation unit.
Indeed this allows us to shed some light onto some important features of the IFO controller,
not shown in [83], which turn out to be crucial for the rest of the chapter.

The idea in [83] is to design the control inputs (ud , uq ) and ω0 in order to force the overall
system to behave as the cascade connection of two subsystems, called the flux subsystem and the
speed subsystem. More specifically we consider first a subsystem associated with the state vari-
ables (Ψd , Ψq , id ) and we show that a suitable choice of the control input ud and of the (d, q)
reference frame rate ω0 allows us to achieve some stability properties (better specified later)
regardless of the behavior of the other state variables (ω, iq ), provided that ω(t) is bounded. Then
we turn our attention on the subsystem associated with the remaining state variables (ω, iq )
and we design the control input uq in order to achieve the prescribed stability properties.

To begin the analysis of the flux subsystem, define error variables as


Ψ̃d := Ψd − Ψ? Ψ̃q := Ψq (5.2)
and set
1
ĩd := id − i?d where i?d :=
(αΨ? + Ψ̇? ) . (5.3)
αLm
Trivial computations show that, choosing the control input ud and the rate of the (d − q) refer-
ence frame ω0 as
ud = σ(−Kd ĩd + γid − ω0 iq − αβΨ? + i̇?d ) + σufc fc
d := ūd + σud
1 (5.4)
ω0 = ω + (αLm iq + βω ĩd )
Ψ?
in which Kd is a design parameter and ufc d an additional control input (which will be spent for
fault compensation), the dynamics of the state variables (Ψd , Ψq , id ), in the new coordinates
(5.2)-(5.3), are described by the following equations (which define the flux subsystem)
˙
Ψ̃ d = −αΨ̃d + sω Ψ̃q + αLM ĩd
˙
Ψ̃q = −sω Ψ̃d − αΨ̃q − βω ĩd (5.5)
ĩ˙ = βαΨ̃ + βω Ψ̃ − K ĩ + v
d d q d d d
5.2. The Induction Motor model and the Indirect Field Oriented Controller 133

in which sω := ω0 − ω is the so-called slip angular frequency and vd := ufc d + Vd is viewed as an


exogenous input.
It can be easily proven that, regardless of what ω(t) and sω (t) are, an high value of Kd ren-
ders such a system input-to-state stable with respect to the input vd , with a linear gain function
whose coefficient can be made arbitrarily small by increasing Kd . This is formalized in the next
proposition where the notion of a-L∞ bound, proposed in [100], is used1 . For convenience,
here and in the following, we denote the state vector of (5.5) as
T
xf := (Ψ̃d , Ψ̃q , ĩd ) .

Proposition 5.1 Set Kd = k K̄d , where K̄d is a positive fixed constant. Then there exists a number
k1? > 0, independent of ω(t) and sω (t), such that for all k ≥ k1? the state of system (5.5) satisfies an
a-L∞ bound with respect to the input vd without restriction on the input and on the initial state and
linear gain functions. In particular, there exists numbers γf > 0 and γf0 > 0 such that, for each x0f ∈ R3
and each measurable vd , the solution of (5.5) with x0f (0) = x0f exists for all t ≥ 0 and satisfies
γ
kxf k∞ ≤ max{γf0 kx0f k, f kvd k∞ }
γf k . (5.6)
kxf ka ≤ kvd ka
k

Proof. The result follows as a trivial application of the Young’s inequality, considering the
T
candidate ISS-Lyapunov function V := xf xf , as the dependence on w(t) and on sω (t) is skew-
symmetric. /
We now take into account the remaining state variables of (5.1) to define the speed subsystem.
To this end, define the additional error variable

ω̃ := ω − ω ?

and set
1
ĩq := iq − i?q where i?q = (−Kω ω̃ + T̂L + ω̇ ? ).
µΨ?
In the expression of i?q above, Kω is a positive constant and T̂ is an auxiliary state variable of the
controller, introduced to offset the unknown load torque TL , whose dynamics will be specified
later. Moreover, let i̇?q1 denote the known part of the derivative of i?q , which is given by

1 h i Ψ̇?
˙
i̇?q1 := Kω (Kω ω̃ − µΨ?
ĩq ) + Ψ̂ + ω̈ ?
− ? i?q .
µΨ? Ψ
With this notation in mind we design the control law for the input uq in the following way
h i
uq = σ −(Kq − γ)ĩq + ω0 id + βωΨ? + i̇?q1 − 1? (Ψ̇? ĩq + Kξ ξ) + σufc fc
q := ūq + σuq
Ψ
ξ˙ = Kη Ψĩq (5.7)
˙
T̂L = −KT ω̃

where Kq and Kξ are design parameters, Kη and KT are fixed (though arbitrary) positive con-
stants and ufc
q is a new control input, used for fault compensation, to be fixed later.
1
It is important to say that, to our purposes, the notion of a-L∞ bound is equivalent to that of input-to-state
stability. We use this, instead of the more commonly used notion of ISS, since in the proof of the next propositions
it is convenient to use the small gain theorem of [100], which is expressed in terms of a-L∞ bounds.
134 Implicit fault tolerant control systems

The subsystem thus obtained (which defines the speed subsystem), having set

η := Ψ? ĩq , T̃L := T̂L − TL ,

is described by equations of the form

ẋs = As xs + Asf (xf , t)xs + Bs (xf , t)xf + Bv vq (5.8)


¡ ¢T
in which xs := T̃L ω̃ ξ η and
   
0 KT 0 0 0 0 0 0
 −1 −Kω 0 µ  Ψ̃d 
 −1 −K ω 0 µ 
As =  Asf (xf , t) = ?  0 (5.9)
  
 0 0 0 Kη 
 Ψ  0 0 0 

K
− µω 0 −Kξ −Kq − Kµω a42 0 Kω
   
0 0 0 0
 b21 b22 0   0 
Bs (xf , t) = 
 0
 Bv =   (5.10)
0 0   0 
b41 b42 0 1
with
Kω2 1
a42 := −βΨ?2 − , b21 := (TL + ω̇ ? ) , b22 := −µ(ĩd + i?d ) ,
µ Ψ?

b41 := −βω ? Ψ? + (TL + ω̇ ? ) , b42 := αβΨ? − Kω i?d − Kω ĩd .
µΨ?
As for vd in the case of the flux subsystem, here vq := ufc q + Vq is viewed as an exogenous
input. The next result states that, if Kq and Kξ are sufficiently large, system (5.8), viewed as
a system with state xs and inputs (xf , vq ), is input-to-state stable with a linear gain function
whose coefficient can be made arbitrarily small, provided that the input xf is sufficiently small.

Proposition 5.2 Set Kq = k K̄q and Kξ = k K̄ξ , where K̄q and K̄ξ are positive fixed constants. Then
there exist numbers k2? > 0 and ∆ > 0, such that for all k ≥ k2? the state of system (5.8) satisfies
an a-L∞ bound with respect to the inputs (xf , vq ) without restriction on the initial state, restrictions
(∆, ∞) on the inputs (xf , vq ), and linear gain functions. In particular there exist numbers γs > 0 and
γs0 > 0 such that for each x0s ∈ R4 and each measurable (xf , vq ), the solution of (5.8) with xs (0) = x0s
exists for all t ≥ 0 and satisfies

kxs k∞ ≤ max{γs0 kx0s k, γs max{kxf k∞ , 1 kvq k∞ }}


k
kxs ka ≤ γs max{kxf ka , 1 kvq ka }.
k

Proof. The proof is a consequence of the small gain theorem. In particular it is worth partition-
T T
ing the state variable xs as xs = (x1s , x2s ), with x1s = (T̃L , ω̃) and x2s = (ξ, η) , and considering
the 4-dimensional system (5.8) as the feedback interconnection of the 2-dimensional subsystem

ẋ1s = A11 11
£ ¤ 1 £ 12 12
¤ 2 1
s + Asf (xf , t) xs + As + Asf (xf , t) xs + Bs (xf , t)xf (5.11)

with the 2-dimensional subsystem

ẋ2s = A22 22
£ ¤ 2 £ 21 21
¤ 1 2 2
s + Asf (xf , t) xs + As + Asf (xf , t) xs + Bs (xf , t)xf + Bv vq (5.12)
5.2. The Induction Motor model and the Indirect Field Oriented Controller 135

where the matrices Aij i 1


k and Bk easily obtained from (5.9) and (5.10). The state xs of the first sub-
system can be shown to satisfy an a-L∞ bound without restriction on the initial state, nonzero
restriction on the inputs (x2s , xf ) and linear gain function. As a matter of fact, since A11
s is Hur-
T
witz, consider the candidate ISS-Lyapunov function V = x1s P1 x1s with P1 > 0 solution of
T
P1 A11 11
s + As P1 = −I , (5.13)

and note that there exist positive constants a and b0 , b1 such that

kA11
sf (xf , t)k ≤ akxf k, kA12
sf (xf , t)k ≤ akxf k kBs1 (xf , t)k ≤ b0 + b1 kxf k . (5.14)

Taking derivatives of V along the solution of (5.11) we get


T
= −kx1s k2 + 2x1s P1 A11 1
£ ¡ 12 12
¢ 2 1
¤
V̇ sf (xf , t)xs + As + Asf (xf , t) xs + Bs (xf , t)xf
≤ −kx1s k2 + 2kP1 kkx1s k akx1s kkxf k + kA12 2 2 2 .
¡ ¢
s kkxs k + akxs kkxf k + b0 kxf k + b1 kxf k

Now fix ∆0 > 0 so that ∆0 akP1 k < 1/4 and note that
1
kxf k < ∆0 ⇒ V̇ ≤ − kx1s k2 + (`1 )kx1s kkx2s k + (`2 + `3 ∆0 )kx1s kkxf k
2
for some positive numbers `i , i = 1, . . . , 3. From this and lemma 3.3 in [100] it turns out that
the state x1s of (5.11) satisfies an a-L∞ bound without restriction on the initial state, restrictions
(∞, ∆0 ) on the inputs (x2s , xf ) and linear gain functions. In particular there exists γs0 > 0 such
that
kx1s ka ≤ γs0 max{kx2s ka , kxf ka } .
In a similar way, it can be shown that the state x2s of (5.12) satisfies an a-L∞ bound without
restriction on the initial state, nonzero restriction on the inputs (x1s , xf ) and linear gain function,
whose coefficient can be arbitrarily lowered by increasing k. To this end, observe that the real
part of the eigenvalues of the matrix A22 s depends linearly on 1/k. In view of this there exist
symmetric positive definite matrices P2 and Q such that (see [105])
T
P2 A22 22
s + As P2 = −2`kP2 − Q (5.15)
T
for some positive `. Consider now the candidate ISS Lyapunov function V = x2s P2 x2s whose
derivative along (5.12) satisfies (observe that bounds like (5.14) also hold for A22 21 2
sf , Asf and Bs )
T
≤ −2`kkP2 kkx2s k2 − x2s Qx2s + kx2s k q0 kx2s kkxf k + q1 kx1s kkxf k + q2 kx1s k+
¡

+q3 kxf k + q4 kxf k2 + q5 kvq k
¢

for some positive numbers qi , i = 0, . . . , 5, and fix ∆00 > 0 such that q0 ∆00 < kQk. Hence

kxf k < ∆00 V̇ ≤ −2`kkP2 kkx2s k2 + kx2s k q1 + q2 ∆00 kx1s k + (q3 + q4 ∆00 )kxf k + q4 kvq k
£ ¤

from which it is easy to conclude, again by lemma 3.3 in [100], that the state x2s of (5.11) satisfies
an a-L∞ bound without restriction on the initial state, restrictions (∞, ∆00 , ∞) on the inputs
(x1s , xf , vq ) and linear gain functions. In particular there exists γs00 > 0 such that

γs00
kx1 ka ≤ max{kx1s ka , kxf ka , kvq ka } .
k
136 Implicit fault tolerant control systems

In view of this, a simple application of the small gain theorem 1 in [100] proves the claim of the
proposition with k2? ≥ γs00 max{1, 1/γs0 , γs0 }, ∆ ≤ min{∆0 , ∆00 } and γs ≥ 2 max{γs0 , γs00 , γs0 γs00 }. /
The previous two propositions are building blocks for proving the next concluding propo-
sition which states that a sufficiently large value of Kd , Kq and Kξ renders the overall system
(5.5)-(5.8) input-to-state stable with respect to the inputs (vd , vq ) with arbitrarily large restric-
tions on the inputs and arbitrarily small linear gains.

Proposition 5.3 Let M be an arbitrary positive number and set, as in the previous proposition, Kd =
k K̄d , Kq = k K̄q , Kξ = k K̄ξ where K̄d , K̄q and K̄ξ are fixed positive numbers. Then there exists
k3? > max{k1? , k2? } such that for all k ≥ k3? the state x := (xf , xs ) ∈ R7 of the overall system (5.5)-
(5.8) satisfies an a-L∞ bound without restriction on the initial state, restrictions (M, ∞) on the inputs
(vd , vq ) and linear gain functions. In particular there exist numbers γ > 0 and γ 0 > 0 such that for
each x0 ∈ R7 and each measurable (vd , vq ) such that kvd k∞ ≤ M , the solution of (5.5)-(5.8) with
x(0) = (xf (0), xs (0)) = x0 exists for all t ≥ 0 and satisfies
γ
kxk∞ ≤ max{γ 0 kx0 k, max{kvd k∞ , kvq k∞ }}
γ k
kxka ≤ max{kvd ka , kvq ka } .
k

Proof. The idea is to act on k in order to force xf to fulfill the restriction ∆ on an interval
[T ? , ∞) and to see the overall system as cascade connection of two ISS systems. For this we
need first to make sure that the solution exists on any interval of the form [0, T ? ], i.e. that the
system does not have finite escape time. To this end note that, if solutions (hence, in partic-
ular, ω(t) and sω (t)) are defined on a interval [0, T ? ], kxf (t)k is bounded by the fixed quantity
γ
max{γf0 kx0f k, f kvd k∞ }. On the other hand, the growth of the right-hand side of (5.8) is affine
k
in kxs k, with coefficients only depending on bounds on kxf (t)k and kvq (t)k. Thus, on the in-
terval [0, T ? ], kxs (t)k is guaranteed to be bounded by an exponentially growing function which
only depends on kx0f k, kx0s k, kvd k∞ , kvd k∞ . As a consequence, solutions exist on any interval of
the form [0, T ? ], i.e. no finite escape time can occur.
Now note that, by proposition 1, kvd k∞ < M and k ≥ k1? imply that
γf
kxf ka ≤ M.
k

Hence, fixing k3? > γf M/∆, it turns out that there exists T ? > 0 such that kxf (t)k < ∆ for all
t ≥ T ? , namely the input xf of the speed subsystem fulfills the restriction ∆ in finite time. From
this the claim of the proposition follows by standard cascade arguments as a consequence of
proposition 1 and 2. In particular the fact that the asymptotic gain is proportional to 1/k follows
from gain composition. /
The previous result implies that, in case vd and vq are identically zero, the overall closed
loop system given by (5.5)-(5.8) is globally asymptotically stable namely the control objective
specified in section 5.2.2 are achieved for every initial state x(0) of the induction motor (5.1).
The control structure, given by (5.4), (5.7), turns out to be a classical indirect field oriented con-
troller, as proposed in [83]. In case vd and vq are not zero, the previous analysis has shown
that the IFO controller is “robust” with respect to exogenous inputs matched with the current
dynamics, as it achieves input-to-state stability with a linear gain proportional to 1/k.
Before describing how this IFO controller can be enriched with an fault tolerant unit we con-
clude this subsection with few remarks on some important points of the above analysis.
5.3. The implicit fault tolerant controller 137

Remark 5.1 Note that the controller (5.7) includes two integral actions, provided by the state variables
T̂L and ξ. The purpose of the first one is to offset the constant unknown load torque TL . The presence of
the other integral action is justified to achieve “steady state robustness” with respect to the parameter α
(see for more details [83]).

Remark 5.2 It is interesting to note that the choice of the control law as in (5.4) and (5.7) puts the
system in a cascade form, with the flux subsystem feeding the speed subsystem. This is achieved by
forcing the variables ω(t) and sω (t) in the flux dynamics (5.5) to appear as entries of a skew symmetric
matrix. This has permitted, in the above analysis, to consider the flux subsystem as an autonomous
system feeding the speed dynamics (see the proof of the last proposition).

Remark 5.3 In principle the above IFO controller does not allow for uncertainties on the rotor param-
eter α (which is typically an highly uncertain parameter), as this is explicitly used in (5.3)-(5.4) for
achieving the cascade structure. However it must be stressed that, in practice, this drawback does not
prejudice the effectiveness of the controller, as demonstrated by the experimental results described in [83]
where it is shown that high uncertainties on α can be tolerated (see also the first remark).

5.3 The implicit fault tolerant controller


5.3.1 Fault scenario and faulty model
In this section we briefly review how the model of the IM modifies in presence of faults. The
fault scenario considered in this paper addresses faults on the rotor as well as on the stator of
the induction motor, which could be caused by failures of both electrical and mechanical nature.
Several research efforts have focused in the last years on the study of the effect of faults in the
induction motor, see besides others [103], [6], [107], [99], [42], [43]. All these works, mainly
developed within the domain of electrical engineering, are based on approximate mathemati-
cal description of the faulty motor and on appropriate experimental validation. As a matter of
fact the presence of faults in general introduces asymmetries in the motor circuit which make
a precise and rigorous modeling an impossible task.
With reference to [103], the most common faults can be summarized in the following two
classes:

• Rotor asymmetries, mainly due to broken bars or dynamic eccentricity;

• Stator asymmetries, mainly due to short circuit or static eccentricity.

Following the theory in [103], it turns out that the presence of mechanical and electrical faults
generates asymmetries in the IM, yielding some slot harmonics in the stator winding. In the
a − b reference frame, it is possible to model this effect by adding a sinusoidal corruption term
to the stator currents values. Specifically, letting iuf uf
a (t) and ib (t) denote the values of the stator
f f
current in the absence of faults and ia (t) and ib (t) the corresponding values in the presence of
faults, the latter can expressed in the form

ifa (t) = iuf


a (t) + A sin(²e (t) + φ)
(5.16)
ifb (t) = iuf
b (t) + A cos(²e (t) + φ) ,

in which Z t
²e (t) = 2π fe (τ )dτ , (5.17)
0
138 Implicit fault tolerant control systems

where fe (t) is a function which depends on the specific fault. For example, faults caused by
rotor asymmetries (typically due to broken bars or dynamic eccentricity) yield harmonic com-
ponents at frequencies
fe = frbb = (1 ± 2ksω )f (5.18)
in which (see section 5.2) sω = ω0 − ω is the slip angular frequency, f is the supply frequency
and k is a positive integer. On the other hand, faults generated by stator asymmetries (typically
induced by short circuit or static eccentricity) generate harmonic components at the frequency

fe = fssc = f . (5.19)

As far as the amplitude A and the phase φ in (5.16) are concerned, they depend on the entity
of the rotor or stator asymmetry and then can not be considered known since depend on the
specific fault severity.
Similarly, once the variables are expressed in the (d − q) reference frame, it turns out that the
stator currents in presence of (stator or rotor) asymmetries change into

ifd = iuf
d + A sin(²e (t) + ²0 (t) + φ)
(5.20)
ifq = iuf
q + A cos(²e (t) + ²0 (t) + φ)

where ²0 , introduced in the previous section, denotes the angular position of the (d − q) ref-
erence frame. Few assumptions are made in the following in order to simplify relation (5.20).
First of all, for sake of simplicity, we concentrate on the case in which the possible frequencies
which characterize the sinusoidal additive terms in (5.20) are constants. This corresponds to
assume that the following two hypotheses hold:

(a) the reference angular velocity ω ? is constant and a possible fault is allowed to arise only
when the steady-state has been reached;

(b) the frequencies which characterize the faulty current are “frozen" to the steady state val-
ues, namely to the values obtained when the state of the system assumes the steady state
value. This is equivalent to assume that the frequencies are not dependent on the ac-
tual state of the plant but only on the reference to be tracked and on the parameters of
the system. These assumptions, which play a crucial role in the analysis which follows,
will be removed in the experimental and simulation results where it will be shown how
the proposed Fault Tolerant controller works properly even in case of state-dependent
frequencies.

As a matter of fact, under these assumptions, it is easy to realize that (bearing in mind (5.17),
(5.18), (5.19) and the definition of the slip sω )

²e (t) + ²0 (t) = 2πf t + (ω ? + s?ω )t + ²?0 := Ω1 t + ²?0 (5.21)

as far as faults concerning stator asymmetries are concerned, and

²e (t) + ²0 (t) = 2π(1 ± 2ks?ω )f t + (ω ? + s?ω )t + ²?0 := Ω2,±k t + ²?0 (5.22)

as far as faults concerning rotor asymmetries are concerned. In the previous expressions s?ω
denotes the (constant) steady state reached by sω which turns out to be

αLm T̂
s?ω := (5.23)
µΨ?2
5.3. The implicit fault tolerant controller 139

while ²?0 denotes the (unknown) position of the reference frame once the fault occurs.
As final hypothesis about the effect of rotor asymmetries, we assume that there exists a (possi-
bly large) finite integer N > 0 such that all the components with frequencies Ω2,±k , k > N , are
negligible with respect to the first components.
These assumptions allow us to express the deviation of the stator currents values in presence
of faults with respect to the un-faulty values as
N
X
ifd = iuf
d + A sin(Ω1 t + φ) + Ak sin(Ω2,k t + φk ) + A−k sin(Ω2,−k t + φ−k )
k=1 (5.24)
XN
ifq = iuf
q + A cos(Ω1 t + φ) + Ak cos(Ω2,k t + φk ) + A−k cos(Ω2,−k t + φ−k )
k=1

namely as the superposition of (2N + 1) harmonic components whose frequencies depend on


the specific fault (the first term being due to stator asymmetries, while the last 2N terms are
due asymmetries) and with amplitudes and phases depending on the fault severity.
Note that in a realistic scenario frequencies, amplitudes and phases of harmonic components
are all uncertain quantities. As a matter of fact, by definition of Ω1 in (5.21) and of Ω2,±k in (5.22)
it is clear that the frequencies depend upon the induction motor parameters and in particular
on the rotor parameter α whose value, as also specified in the last remark of subsection 5.2.3,
is in general highly uncertain. Moreover, phases and amplitudes of the fault harmonics are
depend on the fault severity and on the reference frame position when the fault occurs. Clearly
all this quantities can not be assumed to be a priori known.
We proceed now with some manipulations, with the purpose of expressing in a more con-
venient way the effect of the fault on the currents dynamics. To this end define
¡ ¢
$ := Ω1 Ω2,1 Ω2,−1 . . . Ω2,N Ω2,−N

and the exosystem


ẇ = S($)w w ∈ R4N +2 (5.25)
with
 
Sr,1 0 · · · 0
0 Sr,2 · · · 0
µ ¶ µ ¶
Ss 0 0 Ω1
 
S($) = Ss = Sr = 
 
0 Sr −Ω1 0 .. .. .. .. 
 . . . . 
0 0 ··· Sr,N

and  
0 Ω2,k 0 0
 −Ω2,k 0 0 0 
Sr,k =
 0
 .
0 0 Ω2,−k 
0 0 −Ω2,−k 0
With this in mind it is clear that (5.24) can be expressed as

ifd = iuf
d + Qd w
ifq = iuf
q + Qq w .

with ¡ ¢
Qd := ¡ 1 0 1 0 · · · 1 0 ¢
Qq := 0 1 0 1 ··· 0 1 .
140 Implicit fault tolerant control systems

In this setting the uncertainty on the amplitude and phase of the additive sinusoidal terms
reflects in that on the initial state w(0) of the exosystem (5.25) whose structure is uncertain as
the vector $ of frequencies is uncertain.
Bearing in mind the dynamics of the rotor currents in the normal (i.e. in the absence of faults)
operative conditions, it is also simple to get the IM dynamics after the occurrence of a fault.
As a matter of fact, taking derivatives of (5.24) it is readily seen that the model of the IM in
presence of faults is given by (5.1) with an exogenous input V equal to
µ ¶ µ ¶
Vd −γQd w + Qd Sw + ω0 Qq w
V = = . (5.26)
Vq −ω0 Qd w − γQq w + Qq Sw

or, in a more compact form, µ ¶


Γd
V = w = Γw
Γq
where µ ¶
−γ `1 Γ1 · · · ΓN
Γ :=
−`1 −γ Γ̄1 · · · Γ̄N
with
¡ ¢ ¡ ¢
Γk := −γ `2,k −γ `3,k Γ̄k := −`2,k −γ −`3,k −γ k = 1, . . . , N

and
`1 := ω ? + s?ω + Ω1 `2,k := ω ? + s?ω + Ω2,k `3,k := ω ? + s?ω + Ω2,−k .
Note that, with the above formalism, both electrical or mechanical or simultaneous faults are
allowed, with the first two components of the exosystem state which take into account for sta-
tor faults, while the last 4N for rotor faults.
To conclude this section it is worth to anticipate that in the next part of the chapter we will
assume that that initial state of the exosystem (5.25), which as stressed above depends on the
specific fault and on its severity, is unknown but ranges within a known, but otherwise arbitrar-
ily large, compact set, denoted W. As the vector of frequencies $ is concerned, we will study
first the case in which this is perfectly known (which corresponds to require perfect knowl-
edge of the IM parameters) and then the case in which this is uncertain. In the latter case, we
will assume the knowledge of a compact set, denoted F, to which the vector $ is supposed to
belong.

5.3.2 Reconfiguration strategy


As stressed in the introduction of this chapter, the idea behind implicit fault tolerant control is
that of designing a control unit able to automatically offset the effect of the faults, without need
of an explicit fault detection and isolation process and consequent explicit reconfiguration. This
objective will be pursued for the induction motor by means of the control scheme sketched in
fig. 5.1.
In this scheme, the IFO controller is enriched with a fault compensation unit which, reading
the current tracking error (ĩd , ĩq ), generates additional control inputs (ufc fc
d , uq ) able to offset the
effect of any fault belonging to the classes described above. The operation of the fault tolerant
unit is not supervised by any higher level unit as it automatically corrects the control action
for achieving fault tolerance. On the contrary, the role of the supervisory unit shown in the
scheme of fig. 5.1 is that of performing fault detection and isolation by reading the state of the
5.3. The implicit fault tolerant controller 141

Vd Vq
? ?
ud flux subsys uq
- - speed subsys ¾
(xf ) (xs )

id (ω, iq )
? ?
¾ ūd
m IFO contr. ūq - m
(ξ, T̂L ) ¾
6 6
(Ψ? , ω ? )
ĩd ĩq
?
ufc
d ufc
q
fault comp
sat ¾ - sat
(ζ)
ζ
?
supervisor faults
-
FDI

Figure 5.1: The implicit fault tolerant controller structure

fault compensation unit. In this perspective, the FDI phase is postponed to that of FTC as it is
carried out by looking at the unit which is possibly already compensating the fault.
It is important to stress that the desired goal is to realize the fault compensation unit as a sort
of plug-and-play device, whose design is as much independent as possible by the previously
designed controller, whose purpose is to achieve the prescribed control objectives in the un-
faulty operation mode. As a matter of fact, the only feature required to the controller designed
in section 5.2.3 is that the closed-loop system is input-to-state stable with suitable restrictions
with respect to a matched control input and with a sufficiently small linear gain function.
For sake of clarity we address separately the design of the fault tolerance control unit in the
two cases in which the frequency of the disturbance is known (namely when the matrix S in
(5.25) is known) and that in which it is not.

5.3.3 Embedding an internal model of the fault


We start from a result of [72], which permits a suitable parametrization of the pair (S, Γ) in-
troduced in the section 5.3.1. As a matter of fact, in [72] it is shown that if S is a n × n matrix
having all eigenvalues on the imaginary axis, if F is a n×n Hurwitz matrix, if Γ is a 1×n vector
such that S, Γ is observable, and G is a n × 1 vector such that F, G is controllable, the unique
solution T of the Sylvester equation

T S − F T = GΓ (5.27)

is nonsingular. From this, it is immediate to realize that

T ST −1 = (F + GΦ) with Φ := ΓT −1 , (5.28)

namely the pair (S, Γ) is similar, via the change of coordinates induced by T , to the pair (F +
GΦ, Φ). Using this result twice, it is seen that the two components Vd , Vq of the exogenous input
142 Implicit fault tolerant control systems

V can be thought of as generated by a pair of systems of the form


w̄˙ d = (Fd + Gd Φd )w̄d Vd = Φd w̄d in which w̄d := Td w
w̄˙ q = (Fq + Gq Φq )w̄q Vq = Φq w̄q in which w̄q := Tq w .
Indeed, we can rewrite these two as one single system
w̄˙ = (F + GΦ)w̄ V = Φw̄ with w̄ := T w . (5.29)
in which
µ ¶ µ ¶ µ ¶ µ ¶ µ ¶
w̄d Fd 0 Gd 0 Φd 0 Td
w̄ = , F = , G= , Φ= , T = .
w̄q 0 Fq 0 Gq 0 Φq Tq
Let satλ (·) denote the (piece-wise differentiable) saturation function
satλ (s) := sgn(s) min{|s|, λ}
and let dznλ (·) denote the (piece-wise differentiable) dead-zone function
dznλ (s) := satλ (s) − s .
For a 2-vector-valued argument v we set, with a mild abuse of notation,
µ ¶ µ ¶
satλ (v1 ) dznλ (v1 )
satλ (v) = , dznλ (v) = .
satλ (v2 ) dznλ (v2 )
With this notation in mind, set
ufc
µ ¶ µ ¶
ĩq fc q
ĩ := u :=
ĩd ufc
d

and note that the dynamics of ĩ can be written in the compact form as

ĩ˙ = a(xf , xs , t) + ufc + ΦT w


where a(xf , xs , t) is a 2-vector-valued function which can readily be obtained from (5.5) and
(5.8).
Choose now, as fault compensation unit, the following controller (with saturated output)

ζ̇ = (F + GΦ)(ζ − Gĩ) − GK ĩ + Gdznλ (Φζ − ΦGĩ)


(5.30)
ufc = satλ (Φζ − ΦGĩ)
where λ is a positive design parameter and K := diag(Kq , Kd ).
It turns out that the fault compensation unit (5.30) is able to automatically offset the effect
of the fault V = ΦT w for all the allowed initial conditions w(0), if the gain coefficient k which
characterizes the indirect field oriented controller (see proposition 5.3) and the amplitude λ of
the saturation function are sufficiently large. In particular the amplitude of the saturation func-
tion, which corresponds to the maximum amplitude of the output of the fault compensation
unit, must be compatible with maximum amplitude of the fault effect, which is given by
VM := max kΓwk∞ (5.31)
w(0) ∈ W
where W represents the compact set where the initial condition w(0) is supposed to lie. Bearing
in mind the previous discussion we are able to proof the following proposition.
5.3. The implicit fault tolerant controller 143

Proposition 5.4 Consider system (5.5), (5.8) with dynamic feedback control law given by (5.4), (5.7)
(5.30), with Kd = k K̄d , Kq = k K̄q , Kξ = k K̄ξ , where K̄d , K̄q and K̄ξ are fixed positive numbers. Let
λ, the output saturation level of (5.30), be any number such that λ ≥ VM . Then there exists a number
k ? > 0 such that, for all k ≥ k ? , k(xf (t), xs (t))k and kζ(t) + T w(t)k asymptotically (and locally
exponentially) converge to zero for all xf (0) ∈ R3 , xs (0) ∈ R4 , ζ(0) ∈ Rn and all w(0) ∈ W.

Proof. Consider the change of variables

ζ → χ := ζ + T w − Gĩ . (5.32)

In the new coordinates (it suffices to consider here only the dynamics of ĩ instead of the whole
dynamics of (xf , xs ) )

ĩ˙ = a(xf , xs , t) + satλ (Φχ − ΦT w) + ΦT w (5.33)


χ̇ = F χ − Gρ(xf , xs , t)

where ρ(xf , xs , t) = a(xf , xs , t) + K ĩ. Now fix M ≥ 2λ, let k3? and γ be the lower bound for
the gain k and, respectively, the gain coefficient determined in proposition 3 and note that if
k ? ≥ k3? , since kvd k ≤ 2λ ≤ M , the restriction (equal to M ) of the flux/speed subsystem is
fulfilled for all t ≥ 0. As a consequence, since also kvd k ≤ M , we deduce that xf (t), xs (t) exist
for all t and
γ
k(xf , xs )ka ≤ M . (5.34)
k
Observe now (compare with (5.5) and (5.8)) that there exist constants `1 , `2 , independent of k,
such that
kρ(xf , xs , t)k ≤ `1 k(xf , xs )k + `2 k(xf , xs )k2 (5.35)
for all t ≥ 0. In fact, the term K ĩ in ρ(xf , xs , t) cancels out the terms in a(xf , xs , t) which depend
of k. This, the fact that the matrix F is Hurwitz and (5.34) imply (assuming without loss of
generality that γM/k ≤ 1) that χ(t) exists for all t and

kχka ≤ qkρ(xf , xs , ·)ka ≤ q(`1 + `2 k(xf , xs )ka )k(xf , xs )ka ≤ q(`1 + `2 )k(xf , xs )ka (5.36)

for some positive q. Consider now the function satλ (Φχ − ΦT w) + ΦT w. Since λ ≥ kΦT wk∞ it
turns out that there exists a continuous positive and bounded ϕ(t) such that

ksatλ (Φχ − ΦT w) + ΦT wk ≤ ϕ(t)satλ (Φχ) ≤ Lkχk (5.37)

where L is a positive constant. Hence by the small gain theorem 1 in [100] we conclude that if

k > γq(`1 + `2 )L

the overall system is globally asymptotically stable. In particular local exponential stability
follows from the fact the linear approximation is Hurwitz.
Asymptotic convergence of kζ(t) + T w(t)k to zero trivially follows from the fact that χ(t) and ĩ
converge to zero. /
The statement of the previous result highlights two key features of the fault tolerant unit
(5.30): the first is that for all possible faults belonging to the classes specified in subsection 5.3.1
(regardless the fault severity) the state (xf , xs ) of the flux/speed subsystem converges to zero,
namely the control objectives are achieved. This means that the control is fault tolerant. The
144 Implicit fault tolerant control systems

second result is that the state of the fault compensation unit ζ converges to −T w(t), namely
the state of the exogenous signal is asymptotically reconstructed. This means, as specified in
subsection 5.3.1, that the specific fault and its severity can be precisely isolated and evaluated.

We conclude this subsection with some remarks which shed further light to the result of the
previous proposition.

Remark 5.4 It is worth noting that the result of proposition 5.4 remains true also in case the control
action ū is not provided by the IFO controller specified in section 5.2.3 but is generated by a different
controller. As a matter of fact it is easy to realize, with an eye to the proof of the previous proposition,
that the key feature required for the controller which generates the input ū, is the property of rendering
the corresponding closed-loop system input-to-state stable with respect to the matched exogenous input
ufc + V , with sufficiently large restrictions (compatible with the fault effect to compensate) and with
sufficiently small linear gain (to enforce the small gain condition on which the stability proof relies).
Any controller yielding such property is suitable for implementing the structure sketched in fig. 5.1.

Remark 5.5 It is interesting to stress the key role of the saturation function introduced in the output
of the fault compensation unit. On one hand, its presence allows us to fulfill the restriction on the
input ufc + V which characterizes the system under the IFO controller and hence to render the use of
the small gain theorem of [100] possible, so as to obtain asymptotic stability. On the other hand, the
saturation plays a role in decoupling the system consisting of the IM and the IFO controller from the
dynamics of the fault compensation unit. In this regard, note that a key point in the proof of the previous
proposition is the fact that the state (xf , xs ) can be rendered asymptotically small (see (5.34)), which
in turn implies that the (quadratic) function a(xs , xf , t) in (5.35) can be asymptotically dominated by a
linear function (see (5.36)). This fact, which is crucial to enforce the small gain condition, holds precisely
because of the presence of a saturation function, which renders the bound (5.34) fulfilled independently
of the χ-dynamics.

5.3.4 Adaptive frequency estimation


In this subsection we address the case in which the vector $ of frequencies which characterize
the fault in the dynamics of the current are not perfectly known, but are supposed to a range
on fixed (though arbitrarily large) compact set F. The theory presented in this section takes
a major source of inspiration from the general theory about regulation with adaptive internal
model of [93], but proposes however some non trivial modifications which can be seen an
interesting development of the work [93].
Before entering into the details of the analysis, it is worth noting that the lack of knowledge
of the frequencies to be compensated is a real problem in the implementation of the above
controller. As a matter of fact, looking at the definitions of Ω1 in (5.21) and Ω2,±k in (5.22), it is
immediately acknowledged that the frequencies of the fault depend on the steady-state value
of the slip variable sω which, in turn, depends on the rotor parameter α and the latter, as also
stressed before, is in general a parameter characterized by an high level of uncertainties.
Since the control law (5.30) cannot be implemented, because Φ depends on the uncertain vector
$, we use instead a control law of the form

ζ̇ = (F + GΦ̂)(ζ − Gĩ) − GK ĩ + Gdznλ (Φ̂ζ − Φ̂Gĩ)


(5.38)
ufc = satλ (Φ̂ζ − Φ̂Gĩ)
5.3. The implicit fault tolerant controller 145

in which µ ¶ µ ¶
ζd Φ̂d 0
ζ= , Φ̂ =
ζq 0 Φ̂q
is an estimate updated according to the following dynamics

˙
Φ̂d = dzn` (Φ̂d ) − ρĩd ζdT
(5.39)
˙
Φ̂q = dzn` (Φ̂q ) − ρΨ? ĩq ζqT

in which ` and ρ are positive design parameter. As in the previous section, dealing with the
case of known frequencies, it is possible to prove that the indirect field oriented controller (5.4),
(5.7) with the adaptive fault compensation unit (5.38), (5.39) is able, for sufficiently large value
of λ, ` and k, to achieve the control objectives while offsetting the effect V of any fault.
The next proposition provides the desired result.

Proposition 5.5 Consider system (5.5), (5.8) with dynamic feedback control law given by (5.4), (5.7)
(5.38) and (5.39), with Kd = k K̄d , Kq = k K̄q , Kξ = k K̄ξ where K̄d , K̄q and K̄ξ are fixed positive
numbers. Suppose that the vector $ and the initial state w(0) range within fixed compact sets F and,
respectively, W . Let ` and ρ be arbitrary constants with ` (the amplitude of the dead-zone functions
which characterize the adaptation law (5.39)) such that

` ≥ ΦM := max{kΦd k, kΦq k}.


$∈F

Then there exist λ > 0 and k ? > 0 such that, for all k ≥ k ? , k(xf (t), xs (t))k and kζ(t) + T w(t)k
asymptotically (and locally exponentially) converge to zero for all xf (0) ∈ R3 , xs (0) ∈ R4 , ζ(0) ∈ Rn ,
Φ̂(0) ∈ Rn and all w(0) ∈ W, $ ∈ F.

Proof. For convenience the proof is divided in two parts. In the first part it is shown that suf-
ficiently large values of λ and k guarantee that the trajectories are bounded and the saturation
function enters the linear region in finite time. Then, in the second part, a Lyapunov argument
is used to show that the fault tolerance is achieved, namely the state (xf , xs ) asymptotically
decays to zero.
Consider again the change of variable (5.32). The input v = (vd , vq ) to (5.5) – (5.8), in the new
coordinates, read as
v = satλ (Φ̂χ − Φ̂T w) + ΦT w . (5.40)
Since kΦT wk ≤ VM (with VM defined in (5.31)) and |satλ (s)| ≤ λ for all s, assuming without loss
of generality that λ ≥ VM ≥ 1, it turns out that kvk∞ ≤ 4λ. This implies that (see proposition
5.3), if k is large enough, xf (t), xs (t) exist for all t and

4γλ 4γ
k(xf , xs )ka ≤ = 0 taking k = k0 λ . (5.41)
k k
Even though the control law has changed, from (5.30) to (5.38), the dynamics of χ still has the
form given by the second equation of (5.33). Bearing in mind the fact that F is Hurwitz, the
bound (5.35), and assuming without loss of generality k > 1, the estimate (5.41) also implies
that χ(t) exists for all t and there exists a δ > 0, not dependent on λ, such that

δ
kχka ≤ .
k0
146 Implicit fault tolerant control systems

Hence, by definition of χ,

δ + 4kGkγ
kζka ≤ kχ + T w + Gĩka ≤ kT wk∞ + kχ + Gĩka ≤ kT wk∞ + ≤m.
k0
In the previous expression m is a positive constant not dependent on k 0 (assuming k 0 ≥ 1) and
not dependent on λ.
We now focus on the Φ̂ dynamics (5.39). Since |sat` (s)| ≤ ` for all s ∈ R and

T 4γm
kĩζ ka ≤ k(xf , xs )ka kζka ≤
k0

it is easy to realize that Φ̂ is bounded and the following asymptotic estimate holds

4ργm
kΦ̂ka < ` + ≤ 2` (5.42)
k0
where the last inequality holds provided k 0 ≥ `/4ργm. This means that the argument of the
saturation function in (5.40) satisfies an asymptotic estimate of the form
µ ¶
δ
kΦ̂χ − Φ̂T wka < 2` + kT wk∞ ≤ n
k0

in which n is a positive number not dependent on λ and k 0 (as k 0 ≥ 1). In view of this, we
choose λ ≥ n. This implies that there exists a T ? > 0 such that for all t ≥ T ?

satλ (Φ̂χ(t) − Φ̂(t)T w(t)) + ΦT w(t) = Φ̂χ(t) − Φ̂(t)T w(t) + ΦT w(t)


= Φ̃(t)ζ(t) + Φχ(t) − Φ̃(t)Gĩ(t)

where
Φ̃ := Φ̂ − Φ . (5.43)
This completes the first part of the proof. Note that the previous discussion, in addition to
proving that the saturation function enters in finite time the linear region, has also shown that
the states of the flux and speed subsystems can be rendered arbitrarily small by increasing k
(see (5.41)) and that the estimate Φ̂ is bounded by a positive number (see (5.42)).
We proceed now to prove that (xf (t) and xs (t) asymptotically decay to zero. The overall system
consists of the flux subsystem (5.5), of the speed subsystem (5.8), with

v = Φ̃ζ + Φχ − Φ̃Gĩ ,

of the fault compensation unit, whose dynamics in the χ-coordinates is described by the second
equation in (5.33), and of the adaptation law (5.39) which in the new error coordinates (5.43)
reads as
˙ T
Φ̃ d = dzn` (Φ̃d + Φd ) − ρĩd ζd
(5.44)
˙
Φ̃ = dzn (Φ̃ + Φ ) − ρηζ T .
q ` q q q

We construct the Lyapunov function for the whole system at different stages. First, consider the
T
flux subsystem and the Lyapunov function Vf = 21 xf xf . Taking derivatives along (5.5), simple
computations show that for a large value of k

V̇f ≤ −n1 kxf k2 − (k K̄d − n2 )ĩ2d + ĩd Φ̃d ζd + n3 kĩd kkχk


5.3. The implicit fault tolerant controller 147

for some positive numbers n1 , n2 , n3 . Consider now the speed subsystem (5.8) and let, in
analogy to the analysis carried out in the proof of proposition 5.2, set xs = (x1s , x2s ), where
T
x1s := (T̃L , ω̃) and x2s := (ξ, η). For this system consider the Lyapunov function Vs = x1s P1 x1s +
T
1 2
2 xs x2s with P1 defined in (5.13). Bearing in mind the estimate (5.41) and noting that there
exists positive numbers a, b0 , b1 such that

kAsf (xf , t)k ≤ akxf k kBs (xf , t)k ≤ b0 + b1 kxf k ,

it is easy to obtain that, for a sufficiently large value of k, there exists a T1? such that for all
t ≥ T1?
V̇s ≤ −q1 kxs k2 − (k K̄q − q2 )η 2 + q3 kxs kkxf k + η Φ̃q ζq + q4 kηkkχk
for some positive numbers qi , i = 1, . . . , 4. Consider now the χ-dynamics and define Vχ =
T
χ Pf χ
T
Pf F + F Pf = −I .
Recalling (5.35) and (5.41), it is easy to conclude that for a sufficiently large k, there exists T2?
such that for all t ≥ T2?

V̇χ (t) ≤ −kχ(t)k2 + `1 kχ(t)kk(xf (t), xs (t))k

for some positive `1 .


T T
Finally define VΦ := Φ̃d Φ̃d /2ρ + Φ̃q Φ̃q /2ρ so that, differentiating along (5.44),
T T
V̇Φ = Φ̃q dznT T
` (Φ̃q + Φq )/ρ + Φ̃d dzn` (Φ̃d + Φd )/ρ − Φ̃d ĩd ζd − Φ̃q ηζq .

For all ` ≥ |a|, the graph of the function dzn` (s + a) is (second-quadrant)/(fourth quadrant),
and therefore
s dzn` (s + a) ≤ 0, for all s ∈ R.
Hence, since by hypothesis ` ≥ ΦM ,
T T
V̇Φ ≤ −Φ̃d ĩd ζd − Φ̃q ηζq .

Define now the Lyapunov function for the whole system as

W (xf , xs , χ, Φ̃) := Vf (xf ) + ²1 Vs (xs ) + ²2 Vχ (χ) + VΦ (Φ̃)

with
q1 n 1 n 1 + q1
²1 ≤ ²2 ≤ .
q22 `1
A simple application of the Young’s inequality shows that for sufficiently large k

Ẇ (t) ≤ −r1 k(xf (t), xs (t))k − r2 kχ(t)k for all t ≥ max{T1? , T2? },

for some positive r1 and r2 . This, by the LaSalle theorem, implies that the trajectories of the
closed loop system converge toward the largest invariant set characterized by xf = 0, xs = 0
and χ = 0. This concludes the second part of the proof. /
Few remarks to comment the previous proposition are now presented.
148 Implicit fault tolerant control systems

Remark 5.6 The adaptation law chosen in (5.39) differs from the that proposed in [93] for the presence of
the dead-zone functions dzn` (·) which is motivated by the fact that the output of the fault compensation
unit is saturated. As a matter of fact it is interesting to note, with an eye to the proof of the previous
proposition, that the role of dzn` (·) consists in keeping the estimate Φ̂ bounded with a bound which is not
dependent on k (see (5.42) and the analysis just before). This indeed is crucial to show that in finite time
the saturation function which characterizes the output of the fault compensation unit definitely enters
the linear region.

Remark 5.7 Note that the proposition is not conclusive about the asymptotic convergence of matrix Φ̂
to Φ. In this respect it can be easily proved (following an analysis similar to that in [93]) that Φ̂ converges
to a fixed matrix which is, in general, different from Φ unlike the case in which all the frequencies $ are
excited by the fault (namely if the initial condition w(0) is such that the signals T w(t) contains all the
frequencies in $). This in general is not true as it happens only in case of simultaneously stator and
rotor asymmetries.

5.3.5 Fault Detection and Isolation


According to the controller structure sketched in fig. 5.1, the fault detection and isolation can be
performed by checking the state of the fault compensation unit which automatically offsets the
fault effect. The detection phase, whose purpose is to identify the occurrence of some fault, can
be easily carried out by comparing kζ(t)k with a suitably tuned threshold. As a matter of fact
by proposition 5.4 (or equivalently proposition 5.5 in the adaptive scenario) ζ(t) asymptotically
converge on −T w which is zero in the un-faulty case and different from zero when a fault
occurs.
Also the isolation procedure, which consists in the identification of the occurrence of a specific
fault, is possible in this setup, as indicated in the following example. In particular consider the
problem of distinguishing two kinds of fault regarding stator or rotor asymmetries. As shown
in subsection 5.3.1, stator asymmetries are symptoms of short circuits or static eccentricity while
rotor asymmetries are due to broken bars or dynamic eccentricity. Moreover, again according
to the modeling presented in subsection 5.3.1, it is clear that faults due to stator asymmetries
are represented by the first two components of the exogenous state w(t) while faults due to
rotor asymmetries are represented by the last N − 2 components of w(t). In view of this, setting

ŵ(t) := T −1 ζ(t)

it is possible to define residual signals as:


( ° ° ( ° °
1 if ° ŵ(t)|1,2 ° ≥ Tfs 1 if ° ŵ(t)|3,N ° ≥ Tfr
° ° ° °
rfs (t) = rfr (t) =
0 otherwise 0 otherwise

where Tfs and Tfr are two positive thresholds.2 The two residual signals rfs and rfr correspond
to faults due to stator asymmetries and, respectively, faults due to dynamics asymmetries.
It is important to note that the previous isolation strategy can not be implemented as such
in case the vector $ of exogenous frequencies is not perfectly known. As a matter of fact
in such a case the matrix T solution of (5.27), which depends on $, is not known and the
exogenous variable estimate ŵ can not be computed. In this scenario a more sophisticated
isolation algorithm should be identified by using signal processing algorithms.
2
With the notation s|1,j we denote the components of the vector s from the i-th to the j-th.
5.4. Experimental and simulation results 149

5.4 Experimental and simulation results


The effectiveness of the adaptive implicit fault tolerant control algorithm described in sections
5.3.3 and 5.3.4 in presence of mechanical faults has been tested in the experimental activity
carried out at Laboratory of Automation and Robotics (LAR) of University of Bologna.
Specifically the experimental setup consists of an induction motor, on which the mechanical
fault is implemented as described later, connected to a brush-less motor, acting as a load torque
generator, by means of an adaptive joint able to compensate for angular, axial and radial off-
sets. The induction motor is a commercial asynchronous three phases 1.1KW motor with 50Hz
and 380 − 410V power supply whose electrical and mechanical parameters are shown in tab.
5.1. The experimental setup is then completed with a power inverter with 540 V DC-link volt-
age and a control board based on a DSP TMS320C32 designed and developed within LAR. The
sampling time for controller implementation is set to 400 µsec. The DSP performs data acqui-
sition, generates speed and flux references, implements the control algorithm and generates
the PWM inverter commands. The control board is connected to a standard PC used for DSP
programming and for data acquisition and displaying.
The stator current and the motor angular speed, processed by the adaptive algorithm are ac-
quired by means of commercial Hall-type sensors that output a 0-10V signal which is propor-
tional to the instantaneous value of a AC current signal (0-50A) and by a two poles commercial
resolver (6V, 10KHZ, with a transformation rate 0.28 ± 10%) with an encoder simulation out-
put (1024 ppr). In appendix B the interested reader can find a more technical description of the
set-up used to perform experimental activity.

Description Parameter Value Units


Stator inductance Ls 0.663 H
Rotor inductance Lr 0.663 H
Mutual inductance Lm 0.627 H
Stator resistance Rs 6.25 Ω
Rotor resistance Rr 6.26 Ω
Rotor inertia J 0.0024 Kg m2
Number of pole pairs np 2 −

Table 5.1: Parameters of the induction motor adopted for the experimental activity.

The induction motor has been damaged by introducing a mechanical fault. Specifically five
of the 28 rotor bars have been holed in order to simulate a broken bar rotor failure, see fig. 5.2.
The diameter of each hole is 4 mm.
In all the experimental results which will be presented in the following, we have fixed a
constant speed reference ω ? = 100 rad/sec and a constant load torque TL = 6 Nm applied by
means of the brush-less motor. Furthermore the IFO controller described in section 5.2.3 has
been tuned following the constructive procedure illustrated in [84] and the control parameters
thus obtained are shown in tab. 5.2.

Kω Kd Kq Kξ Kη KT
120 300 300 500 90 7200

Table 5.2: IFO controller gains adopted for experimental activities.

As far as the Fault Compensation Unit described in sections 5.3.3 and 5.3.4 is concerned, we
150 Implicit fault tolerant control systems

Figure 5.2: The rotor of the induction motor with five bars broken by means of two holes each
bar.

have fixed λ, the amplitude of the saturation function, and `, the amplitude of the deadzone
function, respectively to λ = 2000 and ` = 2500. These large values have been chosen in order
to test the fault tolerance even in case of very severe failures. Furthermore the tuning of the
fault compensation unit (5.38)-(5.39) has been completed taking ρ = 5 and the controllable pair
(F, G) as
−10
   
0 0 0 0 0 1 0
 0
 −20 0 0 0 0  
 0 2 
 
 0 0 −30 0 0 0    3 0 
F = G =  0 4  .
 
 0
 0 0 −40 0 0    
 0 0 0 0 −50 0   5 0 
0 0 0 0 0 −60 0 6

The figures fig. 5.3, fig. 5.4 and fig. 5.5 report respectively the steady state current tracking
error ĩd (t), the steady state current tracking error ĩq (t) and the steady state speed tracking
error ω̃(t) in case just the IFO controller is present in the loop (upper plots) and in case the
IFO controller is enriched with the fault compensation unit (lower plots). It is worth to note
that the presence of the failure due to the broken bars, if not compensated by means of the
fault tolerance unit, generates a quite large steady state tracking error for the current id (in
particular ĩd (t) shows a mean value equal to 0.1 and oscillations of amplitude 0.4) which in
turn yields large deviation of the speed tracking error. In this respect the effectiveness of the
fault compensation unit in compensating the effect of the faults and recovering the control
objectives is quite evident in these figures.
In order to stress how the information extracted by the state behavior of the fault compen-
sation unit can be useful in order to perform detection of the occurred (and compensated) fault
(see section 5.3.5), we have reported in fig. 5.6 the behavior of the outputs ufc fc
d and uq when the
implicit fault compensation algorithm is run in presence of a un-faulty induction motor (up-
per plot) and of the induction motor with broken bars (lower plots). In this respect it is quite
evident that the behavior of these additional control inputs yields a robust information about
the presence of the fault which can be used for fault detection and isolation. In particular, as
stressed in section 5.3.5, the comparison of ufc fc
d and uq with a fixed (or suitably adapted) thresh-
old can be used in order to detect if the fault compensation unit is working to compensate for
oscillations and hence if a fault is occurred.

To conclude this section we present few simulation results aiming to test the performances
of the algorithm also in presence of electrical faults which, at the state of the art, are not yet
5.4. Experimental and simulation results 151

0.6

0.4

0.2

ĩd (t) 0

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.6

0.4

0.2

ĩd (t) 0

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

sec

Figure 5.3: Experimental results. Current tracking error ĩd (t) in case of rotor failure with the
IM controlled via the standard IFO controller (upper plot) and using the implicit FT controller
(lower plot).

0.4

0.2

ĩq (t) 0

-0.2

-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.4

0.2

ĩq (t) 0

-0.2

-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

sec

Figure 5.4: Experimental results. Current tracking error ĩq (t) in case of rotor failure with the
IM controlled via the standard IFO controller (upper plot) and using the implicit FT controller
(lower plot).
152 Implicit fault tolerant control systems

ω̃(t) 0

-1

-2

-3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0
ω̃(t)
-1

-2

-3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

sec

Figure 5.5: Experimental results. Speed tracking error ω̃(t) in case of rotor failure with the
IM controlled via the standard IFO controller (upper plot) and using the implicit FT controller
(lower plot).

implemented in the experimental set up above described. The same electrical and mechanical
values presented in table 5.1 have been adopted as nominal parameters and the simulation
results have been obtained assuming parametric uncertainties up to 20%. Moreover the same
tuning of the IFO controller and of the Fault Compensation Unit used for the experimental
activity has been chosen also for the simulation results.
According to the theory presented in section 5.3.1, to simulate a stator and rotor failure the
stator currents have been corrupted as in (5.24) taking N = 1 and assuming φ = φ1 = φ−1 = 0
and Ω, Ω1 and Ω−1 computed using (5.21) and (5.22). The amplitudes A, A1 and A2 have been
set to zero or to a positive number depending on the kind of simulated fault (stator or rotor).
Finally the stator currents and angular speed, which are processed by the control algorithm,
have been corrupted with a Gaussian white noise with zero mean and standard deviation ±0.2
A and ±0.3 rad/sec respectively.
In all the experiments presented in the following the induction motor is simulated in the
un-faulty condition up to t = 2 sec and in different faulty scenarios from t = 2 sec. Moreover
the fault compensation unit is always initially not activated and it is inserted in the control
loop, with the only exception of the first experiment, at the time t = 1.2 sec.
In fig. 5.7 it is reported the behavior of flux and speed tracking errors when a stator fault
characterized by A = 0.1 is simulated and the fault compensation unit is not present in the
control loop. It is quite evident how the presence of the stator fault, for instance due to an
electrical short circuit, does not allow the achievement of the control objectives. In fig. 5.8,
fig. 5.9 are shown the same tracking errors in case respectively of rotor fault (with A1 = A−1 =
0.1) and of stator fault (with A = 0.1) with the fault compensation unit present in the loop.
In these figures one can recognize the transient at t = 2 sec due to the occurred fault. Note
how in this case the fault tolerance is achieved by the fault compensation unit. Finally fig. 5.10
5.4. Experimental and simulation results 153

-10 12

-15 10

8
-20
6
-25
σufd c (t) σufq c (t) 4
-30
2
-35 0

-40 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

0.4 0.4

0.2 0.2

σufd c (t) 0 σufq c (t) 0

-0.2 -0.2

-0.4 -0.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

sec sec

Figure 5.6: Experimental results. Left plots: the output σufcd (t) of the fault compensation unit
with an un-faulty IM (upper plot) and with the faulty IM (lower plot). Right plots: the output
σufc
q (t) of the fault compensation unit with an un-faulty IM (upper plot) and with the faulty IM
(lower plot).
154 Implicit fault tolerant control systems

ω̃(t) 0

-5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0
Ψ̃d (t)
-2

-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Ψq (t) 0

-2

-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

sec

Figure 5.7: Simulation results. From upper to lower: speed tracking error ω̃(t), the d flux
tracking error Ψ̃d (t) and the q flux Ψq (t) using a standard IFO controller when a rotor fault
occurs at time t = 2 s.
0.4

0.2

0
ω̃(t)
-0.2

-0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0.5

0
Ψ̃d (t)
-0.5

-1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Ψq (t)
-1

-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

sec

Figure 5.8: Simulation results. From upper to lower: speed tracking error ω̃(t), the d flux
tracking error Ψ̃d (t) and the q flux Ψq (t) using the augmented fault tolerant IFO controller
when a rotor fault occurs at time t = 2 s.
5.5. Conclusions 155

0.4

0.2

ω̃(t) 0

-0.2

-0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0.5

0
Ψ̃d (t)
-0.5

-1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0.5

Ψq (t) 0

-0.5

-1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

sec

Figure 5.9: Simulation results. From upper to lower: speed tracking error ω̃(t), the d flux
tracking error Ψ̃d (t) and the q flux Ψq (t) using the augmented fault tolerant IFO controller
when a stator fault occurs at time t = 2 s.

and fig. 5.11 report the behaviors of the first and third component of the vector ω̂ defined in
section 5.3.5 which represents the estimation of the internal state of the exosystem (5.25) in
presence respectively of rotor and stator faults. In the same figures there are also overlapped
the behaviors of the signals rfs (t) and rfr (t) defined in section 5.3.5 representing the residual
signals sensitive respectively to fault and rotor faults (the thresholds Tfs and Tfr have been
fixed at the value Tfs = Tfr = 5). It is interesting to note that, after an initial transient, just the
component of ω̂(t) which belongs to the part of the internal model devoted to compensate the
specific class of fault (rotor or stator) presents a steady state value different by zero and hence
the residual signals rfs (t) and rfr (t) allow for perfect isolation of the occurred fault.

5.5 Conclusions
In this chapter we have presented a new idea for dealing with fault tolerant control systems
design presenting the design of a fault tolerant control unit for Induction Motors. We have
shown how an Indirect Field Oriented controller processing the currents and the angular velocity
of the IM in order to enforce desired flux and speed profiles, can be “enriched” with an internal
model of the fault in order to achieve fault tolerance and also fault detection and isolation. The
design of the internal model unit can be considered independent from that of the stabilizing
IFO unit as only the current gains are eventually required to be re-tuned. We have shown
how the internal model unit can be designed in order to have “global” tracking of the desired
references and “semi-global” tolerance to possible faults. Experimental and simulation results
have been presented in order to show the effectiveness of the approach.
156 Implicit fault tolerant control systems

40

20

ŵ1 (t), rfs (t) 0

-20

-40

-60
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

50

ŵ3 (t), rfr (t) 0

-50

-100

-150
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

sec

Figure 5.10: Simulation results. Rotor fault affecting the IM at t = 2 sec. Upper plot: Stator
failure signal ŵ1 (t) (dash-dot line) and signal rfs (t) (continuous line). Lower plot: rotor failure
signal ŵr (t) (dash-dot line) and signal rfr (t) (continuous line).
40

20

ŵ1 (t), rfs (t) 0

-20

-40

-60
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

50

ŵ3 (t), rfr (t) 0

-50
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

sec

Figure 5.11: Simulation results. Stator fault affecting the IM at t = 2 sec. Upper plot: Stator
failure signal ŵ1 (t) (dash-dot line) and signal rfs (t) (continuous line). Lower plot: rotor failure
signal ŵr (t) (dash-dot line) and signal rfr (t) (continuous line).
Conclusions and future works

The main aim of this thesis, conclusive work of a three years research period focused on the
major of Fault Diagnosis and Fault Tolerant Control, is to present a collection of results which
should lead towards a unified framework for Fault Tolerant Control of Distributed Control
Systems. The first key-point was: why distributed systems? As stated in this work, from a
functional point of view, there is practically no difference whether a task is implemented using
a centralized or a decentralized architecture.
The most important properties of distributed systems are composability (which means that
system properties follows from subsystem properties), scalability (which means that there exist
no limits to the extensibility of a system and that at the same time complexity of reasoning
about the proper operation of any system function does not depend on the system size) and
dependability. This last property means that when designing a distributed system is possible
to implement well-defined error-containment regions, achieving in this way fault tolerance.
This idea, and the importance that large distributed systems have nowadays in application
(see e.g. the automotive field) make distributed systems perfect candidates for investigation in
researches regarding fault tolerance.
This work starts with a survey chapter which illustrates concepts, definitions and classi-
cal results about Fault Detection and Isolation and Fault-Tolerant Control and introduces basic
concepts in distributed computer systems architectures. Starting from these concepts a novel
architecture for Fault Tolerant Distributed Control System has been introduced. This architec-
ture has been developed in order to accomplish the task of fault tolerance in a modular and
hierarchical way. This means that we want to achieve fault tolerance starting from each single
function of the system, composing each fault tolerant module as the single function compose
themselves to generate the global function of the complex system. For this reason functionality
analysis tools and failure analysis tools have been shown to be the right tools to design the
exoskeleton of this architecture.
Within this architecture three levels of operation can be identified: a low level (basic func-
tionalities level) at which physical faults are detected and reconfigured; an intermediate level
that supervises this substratum using cross information between different functionalities and
an high level which is dedicated to the allocation of different functionalities on common hard-
ware in an optimal way and that acts also as a high level interface (communication cluster)
between human operator and plant.
After this discussion we have presented results that can be mapped with the three levels
of operation we have presented. More specifically we have presented an algorithm to predict
reliability of the diagnostic level. The procedure presented is used to evaluate reliability of a
complex distributed diagnosis system and gives as output an index which represents the prob-

157
158 Conclusions and future works

ability that if a fault happens then the diagnosis system is able to detect it without generating
false alarms.
We have then presented an “high-level” diagnosis and reconfiguration engine in the frame-
work of discrete event systems, i.e. an automaton which is able to diagnose the occurrence of
a fault event before the system executes some dangerous operation. This property results to be
a kind of “robustness on demand” in the sense that if the system is safe diagnosable we can
always detect the fault before the system becomes unreconfigurable and hence “force” recon-
figuration events that will lead the system to compensate (to the extent possible) for the effect
of the detected fault..
Concerning the “low-level”, an innovative fault-tolerant control scheme has been devel-
oped. The idea is that of achieving implicit fault-tolerance of basic functionalities, without an
explicit estimation of faults and reconfiguration actions. This idea has been applied to control
in a reliable way an induction motor in case of mechanical faults and a robot manipulator in
case of actuators faults (see Appendix C).
Future development of this work include the study of complex modeling tools that are
able to enlighten both the distributed nature of complex systems and the stochastic nature of
faults. We have to deal with horizontal modularity and vertical modularity. Discrete event
system theory can help in describing such systems. For example horizontal modularity can be
achieved modeling subsystems as automata and interconnecting them with some communica-
tion protocols. An interesting development will be to extend results regarding diagnosability
and safe diagnosability to such systems. Considering vertical modularity, preliminary investi-
gations have shown that the state-charts modeling formalism of Harel (see [48]) can be applied
to study the diagnosability property in a more computationally-efficient manner.
Another future development will be to enrich the deterministic information provided by
automata with statistical information. Doing so it is possible to introduce the notions of sta-
tistical diagnosability (some recent efforts in this sense are presented in [101]) and statistical
safe diagnosability. In this last case forbidden strings would be tolerated after a fault if their
probability is below a certain threshold.
Another very interesting research problem is to investigate methods to fuse information
from the high level with information from the low level in order to solve the FTC problem.
In this sense we are interested in exploiting the problem of diagnosis of hybrid automata (see
[63]).
Finally, also the topic of reconfiguration of systems at the high level of abstraction is of great
importance. In this direction it is our intention to exploit problems such as system redundancy,
probing for possible faults and controller reconfiguration using the automata formalism.
All this effort will go in the direction of finding a unified tool to study phenomena linked
with the distributed nature of the system and the stochastic nature of faults and use analysis
tools in this new framework to refine the proposed framework.
Appendixes

159
Appendix A
Introduction to discrete event systems
theory

In this appendix some notions about discrete event dynamical systems


(DEDS) are reported, including basic definitions about discrete event sys-
tems (DES), theory of automata and languages and some important re-
sults about feedback supervision. The interested reader is referred for
a more complete and formal treatment of these topics to [26], [108] and
references therein.

A.1 Discrete event systems


A Discrete event system (DES) is a dynamical system in which the state space is naturally de-
scribed by a discrete set, and the state transitions are only observed at discrete points in time.
We associate the state transitions with “events”; an event can be identified with a specific action
taken. This action can be spontaneous (dictated by nature) or it may be the result of several con-
ditions which are suddenly met. The symbol e denotes an event; considering a system affected
by different types of events, we will assume we can define an event set E, whose elements are
all these events.
Discrete event systems satisfy the following two properties:

1. The state space is a discrete set.

2. The state transition mechanism is event drive.

Definition A.1 A discrete event system is a discrete-state, event-driven system, i.e. its state depends
entirely on the occurrence of asynchronous discrete events over time.

With this in mind, the behavior of a DES can be described in terms of event sequences of the
form e1 , e2 , · · · , en . A more formal way to study the logical behavior of DES is based on the
theories of languages and automata.

161
162 Introduction to discrete event systems theory

The starting point is the fact that any DES has an underlying event set E associated with
it. The set E is thought of as the alphabet of a language and event sequences are thought of as
strings (words) in that language. A string consisting of no events is called empty string and is
denoted by ². The length of a string s, denoted with |s|, is the number of events contained in it,
counting multiple occurrences of the same event.

Definition A.2 A language defined over a set E is a set of finite-length strings from events in E.

The key operation involved in building strings and thus languages from a set of events E is
concatenation: the string abb is the concatenation of the string ab and the event b. The empty
string ² is the identity element of concatenation. Let E ? denote the set of all finite strings
of elements of E, including the empty string ²; the (·)? operation is called Kleene-closure. A
language over an event set E is a subset of E ? .
If tuv = s, the following nomenclature can be defined:

• t is called a prefix of s,

• u is called a substring of s,

• v is called a suffix of s.

Observe that both ² and s are prefixes, substrings and suffixes of s.

A.2 Operations on Languages


The usual set operations, such as union, intersection, difference and complement with respect
to E ? are applicable to languages, since languages are sets. In addition it is possible to introduce
the following operations:

• Concatenation: Let La , Lb ⊆ E ? , then

La Lb := {s ∈ E ? : (s = sa sb ) and (sa ∈ La ) and (sb ∈ Lb )} .

A string is in La Lb if it can be written as the concatenation of a string in La with a string


in Lb .

• Prefix-closure: Let L ⊆ E ? , then

L := {s ∈ E ? : ∃t ∈ E ? (st ∈ L)} .

The prefix closure of L is the language L consisting of all the prefixes of all the strings in
L. In general L ⊆ L.

• Kleene-closure: Let L ⊆ E ? , then:

L? := {²} ∪ L ∪ LL ∪ LLL ∪ · · · .

An element of L? is formed by the concatenation of a finite number of elements of L.


A.3. Representation of languages: automata 163

A.3 Representation of languages: automata


An automaton is a device that is capable of representing a language according to well-defined
rules.

Definition A.3 A deterministic automaton, denoted by G, is a six-tuple

G = (X, E, f, Γ, x0 , Xm )

where

• X is the set of states

• E is the finite set of events associated with transitions in G

• f : X × E → X is the transition function: f (x, e) = y means that there is a transition labelled


by event e from state x to state y; in general f is a partial function on its domain

• Γ : X → 2E is the active event function: Γ(x) is the set of all the events e for which f (x, e)is
defined

• x0 is the initial state

• XM ⊆ X is the set of marked states.

The automaton is said to be deterministic because f is a function over X × E. In contrast


the transition function of a nondeterministic automaton is defined by means of a relation over
X × E × X 1.
An automaton generates a language defined in the following way.

Definition A.4 The language generated by G = (X, E, f, Γ, x0 , Xm ) is

L(G) := {s ∈ E ? : f (x0 , s) is defined} . (A.1)

The language marked by G = (X, E, f, Γ, x0 , Xm ) is

Lm (G) := {s ∈ E ? : f (x0 , s) ∈ Xm } . (A.2)

In other words a string s is in L(G) if and only if it corresponds to an admissible path in the
state transition diagram. Note that in the above definitions we work with an extension of the
transition function defined over X × E ? as:

f (x, ²) := x
f (x, se) := f (f (x, s), e) for s ∈ E ? and e ∈ E .

Two automata are said to be equivalent if they generate and mark the same languages, i.e.

Definition A.5 Automata G1 and G2 are said to be equivalent if

L(G1 ) = L(G2 ) and Lm (G1 ) = Lm (G2 ) .


1
Or equivalently a function from X × E to 2X .
164 Introduction to discrete event systems theory

In general the following property holds:

Lm (G) ⊆ L(G) ⊆ L(G) .

It can happen that an automaton G could reach a state x where Γ(x) = ∅, but x ∈ / Xm . This is
said a deadlock, because no further event can be executed. If deadlock happens, then necessarily
Lm (G) will be a proper subset of L(G), since any string in L(G) that ends at state x cannot be a
prefix of a string in Lm (G).
Consider now the case when there is set of unmarked states in G that forms a strongly
connected component, but with no transitions going out of the set. If the system enters this set,
then we get a so-called livelock. If livelock is possible then again Lm (G) will be a proper subset
of L(G).

Definition A.6 An automaton G is said to be blocking if

Lm (G) ⊂ L(G)

and nonblocking when


Lm (G) = L(G) .

In other words if an automaton is blocking either deadlock and/or livelock can happen.
Suppose now that an event e at state x may cause transitions to more than one state. In this
case f (x, e) is represented by a set of states. In addition we may want that to allow the label ² in
the state transition diagram, i.e. we allow transitions between distinct states to have the empty
string as label2 . These two changes lead to the definition of a nondeterministic automaton.

Definition A.7 A nondeterministic automaton, denoted by Gnd , is a six-tuple

Gnd = (X, E ∪ {²} , fnd , Γ, x0 , Xm )

where

• fnd is a function fnd : X × E ∪ {²} → 2X , that is fnd (x, e) ⊆ X whenever it is defined

• The initial state may be itself a set of states: x0 ⊆ X.

A.3.1 Operations on automata


• Accessible part: From the definition of L(G) and Lm (G), we can delete from G all states
that are not accessible or reachable from x0 by some string in L(G) without affecting the
languages generated and marked by G. When we delete a state we also delete all the
transitions that are attached to that state. We will denote this operation by Ac(G) (taking
the accessible part).

Ac(G) := (Xac , E, fac , x0 , Xac,m )


Xac := {x ∈ X : ∃s ∈ E ? (f (x0 , s) = x)}
Xac,m := Xm ∩ Xac
fac := f |Xac ×E→Xac .
2
These transitions may represent events that cause a change in the internal state but are not observable by an
outside observer.
A.3. Representation of languages: automata 165

• Coaccessible part: A state x of G is said to be coaccessible to Xm if there is a string in Lm (G)


that goes through x; i.e. there is a path in the state transition diagram of G from state x
to a marked state. We denote the operation of deleting all the states of G that are not
coaccessible by CoAc(G):

CoAc(G) := (Xcoac , E, fcoac , x0,coac , Xm )


:= {x ?
Xcoac ½ ∈ X : ∃s ∈ E (f (x, s) ∈ Xm )}
x0 if x0 ∈ Xcoac
x0,coac :=
undefined otherwise
fcoac := f |Xcoac ×E→Xcoac .

The CoAc operation may shrink L(G), but does not affect Lm (G).

• Trim operation: An automaton that is both accessible and coaccessible is said to be trim.
We define the T rim operation as:

T rim(G) = CoAc [Ac(G)] = Ac [CoAc(G)] .

• Complement: Consider a trim automaton G = (X, E, f, Γ, x0 , Xm ) that marks the language


L ⊆ E ? , we can define a complement automaton Gcomp that will mark the language E ? \L.

• Product: consider the two automata

G1 = (X1 , E1 , f1 , Γ1 , x01 , Xm1 ) and G2 = (X2 , E2 , f2 , Γ2 , x02 , Xm2 )

the product of G1 and G2 is the automaton

G1 × G2 = Ac (X1 × X2 , E1 ∩ E2 , f, Γ1×2 , (x01 , x02 ), Xm1 × Xm2 )

where ½
(f1 (x1 , e), f2 (x2 , e)) if e ∈ Γ1 (x1 ) ∩ Γ2 (x2 )
f ((x1 , x2 ), e) =
undefined otherwise
and thus
Γ1×2 (x1 , x2 ) = Γ1 (x1 ) × Γ2 (x2 ) .
This means that in the product the transitions of the two automata must always be syn-
chronized on common events (in E1 ∩ E2 ). It is easy to verify that:

L(G1 × G2 ) = L(G1 ) ∩ L(G2 ) Lm (G1 × G2 ) = Lm (G1 ) ∩ Lm (G2 )

• Parallel composition: consider the two automata

G1 = (X1 , E1 , f1 , Γ1 , x01 , Xm1 ) and G2 = (X2 , E2 , f2 , Γ2 , x02 , Xm2 )

the parallel composition of G1 and G2 is the automaton

G1 k G2 = Ac(X1 × X2 , E1 ∪ E2 , f, Γ1k2 , (x01 , x02 ), Xm1 × Xm2 )

where 

 (f1 (x1 , e), f2 (x2 , e)) if e ∈ Γ1 (x1 ) ∩ Γ2 (x2 )
(f1 (x1 , e), x2 ) if e ∈ Γ1 (x1 )\E2

f ((x1 , x2 ), e) =
 (x1 , f2 (x2 , e))
 if e ∈ Γ2 (x2 )\E1
undefined otherwise .

166 Introduction to discrete event systems theory

In the parallel composition a common event can only be executed if the two automata
both execute it simultaneously. The two automata are synchronized on common events.
To characterized the language generated, we define the projection

Pi : (E1 ∪ E2 )? → Ei? for i = 1, 2

as follows:
Pi (ε) = ²½
e if e ∈ Ei
Pi (e) =
² if e ∈ / Ei
Pi (se) = Pi (s)Pi (e) for s ∈ (E1 ∪ E2 )? , e ∈ (E1 ∪ E2 ).
In other words given two event sets where one is a subset of the other, this kind of pro-
jection (called natural projection) erases events in a string formed from the larger set, that
do not belong to the smaller one. We can also introduce the corresponding inverse maps
(inverse projection)
?
Pi−1 : Ei? → 2(E1 ∪E2 )
defined as:
Pi−1 = {s ∈ (E1 ∪ E2 )? : Pi (s) = t} .
In other words given a string in the smaller event set, the inverse projection returns the
set of all strings in the larger event set that project to the given string. The projections
and their inverses are extended to languages, simply by applying them to all the strings
in the language. Note that
Pi Pi−1 (L) = L
£ ¤

but in general
L ⊆ Pi−1 [Pi (L)] .
Returning to the parallel composition between automata, it easy now to prove that

L(G1 k G2 ) = P1−1 [L(G1 )]∩P2−1 [L(G2 )] Lm (G1 k G2 ) = P1−1 [Lm (G1 )]∩P2−1 [Lm (G2 )] .

A.3.2 Observer automata


It is always possible transform a nondeterministic automaton Gnd into an equivalent deter-
ministic one. We will call the resulting equivalent deterministic automaton the observer Gobs
corresponding to the nondeterministic automaton. The procedure to build the automaton can
be found in [26]. It is important just to recall the properties of the observer:

1. Gobs is a deterministic automaton.

2. L(Gobs ) = L(Gnd )

3. Lm (Gobs ) = Lm (Gnd )

We motivated previously the use of ²-transitions in a nondeterministic automaton as events


that occur in the system modelled by the automaton, but cannot be observed from outside.
Those events are considered as unobservable events, in other words the event set is partitioned
into two disjoints parts:
E = Eo ∪ Euo
A.4. Regular languages 167

where Eo is the set of observable events and Euo is the set of unobservable events. Treating
unobservable events as ²-transitions and building the observer corresponding to the nondeter-
ministic automaton obtained, it is easy to prove that the observer satisfy the following proper-
ties:

• L(Gobs ) = P [L(G)]

• Lm (Gobs ) = P [Lm (G)]

• The state of Gobs that is reached after a string t ∈ P [L(G)] will contain all the states of G
that can be reached after any strings in

P −1 (t) ∩ L(G) .

Where P denotes the natural projection from E to Eo defined as follows:

P (²) = ²½
e if e ∈ Eo
P (e) =
² if e ∈ / Eo
P (se) = P (s)P (e) for s ∈ E ? , e ∈ E.

In other words, the state of Gobs is the union of all the states of G that are consistent with the
observable events that have occurred so far. In this sense the state of Gobs is an estimate of the
current state of G.

A.4 Regular languages


Any language can be marked by an automaton: simply build the automaton as a possibly
infinite tree whose root is the initial state and where the nodes at layer n of the tree are entered
bye the strings of length n or the prefixes of length n of the longer strings. The state space is
the set of nodes of the tree and a state is marked if and only if the string that reaches it from
the root is an element of the language. Such tree automaton will have an infinite state space
if the cardinality of the language is infinite. Of course there exist infinite languages that can
be represented by finite-state automaton3 , but there exist also infinite languages that cannot be
represented by finite-state automata. A classical example is the language L = {an bn : n ≥ 0};
see [26] for more details.

Definition A.8 A language is said to be regular if it can be marked by a finite-state automaton. We


will denote the class of regular languages by R.

It is easy to prove the following theorems:

Theorem A.1 The class of languages representable by nondeterministic finite-state automata is exactly
the same as the class of languages representable by deterministic finite-state automata: R.

Theorem A.2 If L1 and L2 are in R, then the following languages are also in R:

1. L1

2. L?1
3
The simplest case is the language L = E ? that can be represented by a single state automaton.
168 Introduction to discrete event systems theory

3. Lc := E ? \ L1
4. L1 ∪ L2
5. L1 ∩ L2
6. L1 L2 .

Theorem A.3 A language is regular if and only if it can be represented by a regular expression i.e. by
means of the operations of kleene-closure, union and concatenation.

A.5 Supervisory control


The situation considered in this section is that of a given DES whose behavior must be modified
by feedback control in order to achieve a given set of specifications. Consider an automaton
G that models the uncontrolled behavior of the DES; this behavior is not satisfactory and must
be modified by control in the sense that its behavior must be restricted to a subset of L(G).
In this framework, we consider sublanguages of L(G) that represent the legal behavior for the
controlled system. In this paradigm, the supervisor S observes some of all the events that G
executes and tells G which events in its current active set are allowed next. In other words, S
has the capability of disabling some feasible events of G, exerting in this way a feedback control
action on G.
Consider a DES modelled by the language generated L and the language marked Lm . L
and Lm are defined over the event set E. Consider the case of a prefix-closed L. These two
languages are generated and marked by an automaton

G = (X, E, f, Γ, x0 , Xm ) .

We want to design a supervisor S that interacts with G in a feedback manner as explained


previously. Let E be partitioned into two disjoint subsets:

E = Ec ∪ Euc

where
• Ec is the set of controllable events, i.e. those events that can be prevented from happening
(disabled) by supervisor S;
• Euc is the set of uncontrollable events, i.e. those events that cannot be prevented from
happening by supervisor S.
Assume for the moment that all the events in E executed by G are observed by S. A supervisor
S is a function
S : L(G) → 2E
such that for each s ∈ L(G),
S(s) ∩ Γ(f (x0 , s))
is the set of enabled events that G can execute at its current state f (x0 , s). In view of this we will
say that supervisor S is admissible if for all s ∈ L(G)

Euc ∩ Γ(f (x0 , s)) ⊆ S(s)

i.e. S is not allowed to ever disable a feasible uncontrollable event.


A.5. Supervisory control 169

Definition A.9 The language generated by S/G is defined recursively as follows


1. ² ∈ L(S/G)
2. [(s ∈ L(S/G)) and (sσ ∈ L(G)) and (σ ∈ S(s))] ⇐⇒ [(sσ ∈ L(S/G))] .
The language marked by S/G is defined as follows:
Lm (S/G) := L(S/G) ∩ Lm (G) .
Definition A.10 The DES S/G is blocking if
L(S/G) 6= Lm (S/G)
and nonblocking if
L(S/G) = Lm (S/G) .
Consider now the situation where the supervisor does not observe all the events that G exe-
cutes, i.e. the event set E is partitioned into two disjoint subsets:
E = Eo ∪ Euo
where
• Eo is the set of observable events, i.e. those events that can be seen by supervisor S;
• Euo is the set of unobservable events, i.e. those events that cannot be seen by supervisor S.
In this case the feedback loop includes a natural projection P between G and the supervisor S
in the sense that the supervisor cannot distinguish between two strings s1 and s2 that have the
same projection and will issue the same control action: SP [P (s1 )] = SP [P (s2 )]. We define a
partial-observation supervisor a function
SP : P [L(G)] → 2E .
This means that the control action can change only after the occurrence of an observable events,
i.e. when P (s) changes.
Let us take the string t = t0 σ (with σ ∈ Eo ). SP (t) is the control action that applies to
all strings in L(G) that belong to P −1 (t0 ){σ} and to the unobservable continuations of these
strings. However SP (t) may disable unobservable events and thus prevent some of these un-
observable continuations. We define
Lt = P −1 (t0 ){σ} (SP (t) ∩ Euo )? ∩ L(G) .
In words, Lt contains all the strings in L(G) that are subject to the control action SP (t). Now
remember that a supervisor is admissible if it does not disable uncontrollable events. Hence
SP is admissible if for all t = t0 σ ∈ P [L(G)],
" #
[
Euc ∩ Γ(f (x0 , s)) ⊆ SP (t) .
s∈Lt

Definition A.11 The language generated by SP /G is defined recursively as follows


1. ² ∈ L(SP /G)
2. [(s ∈ L(SP /G)) and (sσ ∈ L(G)) and (σ ∈ SP [P (s)])] ⇐⇒ [(sσ ∈ L(SP /G))] .
The language marked by S/G is defined as follows:
Lm (SP /G) := L(SP /G) ∩ Lm (G) .
170 Introduction to discrete event systems theory

A.6 Uncontrollability problem


A.6.1 Dealing with uncontrollable events
In the following is presented the existence result for supervisors in presence of uncontrollable
events.

Theorem A.4 (Controllability theorem) Consider a DES G = (X, E, f, Γ, x0 ) where Euc ⊆ E is


the set of uncontrollable events. Let K ⊆ L(G), where K 6= ∅ is the admissible language. Then there
exists a supervisor S such that L(S/G) = K if and only if the controllability condition does hold, i.e.:

KEuc ∩ L(G) ⊆ K .

Proof. See [26]. /

Remark A.1 The controllability condition in controllability theorem is intuitive and can be paraphrased
as: “if you cannot prevent it, then it should be legal”.

Definition A.12 (controllability) Let K and M = M be languages over set E. Let Euc be a subset
of E. K is said to be controllable with respect to M and Euc if and only if

KEuc ∩ M ⊆ K .

Theorem A.5 (Nonblocking Controllability theorem) Consider the DES G = (X, E, f, Γ, x0 )


where Euc ⊆ E is the set of uncontrollable events. Consider the language K ⊆ Lm (G), where K 6= ∅ is
the admissible language. Then there exists a nonblocking supervisor S for G such that Lm (S/G) = K
and L(S/G) = K if and only if:

1. K is controllable with respect to L(G) and Euc , i.e.:

KEuc ∩ L(G) ⊆ K ,

2. K is Lm (G)-closed, i.e.
K = K ∩ Lm (G) .

Proof. See [26]. /

A.6.2 Realization of supervisors


Let us assume that language K ⊆ L(G) is controllable, then from controllability theorem we
know that supervisor S defined by
© ª
S(s) = [Euc ∩ Γ(f (x0 , s))] ∪ σ ∈ Ec : sσ ∈ K

results in
L(S/G) = K .
We need now to build a convenient representation of the function S. Consider now an automa-
ton R that marks the language K:

R = (Y, E, g, ΓR , y0 , Y )
A.7. Unobservability problem 171

where R is trim and


Lm (R) = L(R) = K.
If we connect R to G by the product operation, the result R×G is exactly the behavior we desire
for S/G:
L(R × G) = L(R) ∩ L(G)
= K ∩ L(G)
= K = L(S/G)

Lm (R × G) = Lm (R) ∩ Lm (G)
= K ∩ Lm (G)
= L(S/G) ∩ Lm (G) = L(S/G) .
Note that R is defined to have the same event set as G, then R k G = R × G. We will call R the
standard realization of S.

A.7 Unobservability problem


Consider now the feedback loop in the case of partial event observation. In other words we
have to deal with the presence of unobservable events in addition to the presence of uncon-
trollable events. Clearly unobservable events impose further limitations on the controlled be-
haviors that can be achieved with P -supervisors. As we did for controllability, we need to
introduce the concept of observability. Intuitively observability means “if you cannot differentiate
between two strings, then these strings should require the same control action”, or equivalently “if you
must disable an event after observing a string, then by doing so you should not disable any string that
appears in the desired behavior”. This idea can be formalized as follows:

Definition A.13 (observability) Let K and M = M be languages over set E. Let Ec be a subset of
E. Let Eo be another subset of E with P as the corresponding natural projection from E ? to Eo? . K is
said to be observable with respect to M , P and Ec if for all s ∈ K and for all σ ∈ Ec ,

/ K) and (sσ ∈ M ) ⇒ P −1 [P (s)] σ ∩ K = ∅ .


(sσ ∈

Theorem A.6 (Controllability and observability theorem) Consider DES G = (X, E, f, Γ, x0 )


where Euc ⊆ E is the set of uncontrollable events and Eo ⊆ E is the set of observable events. Let P
be the natural projection from E ? to Eo? . Consider the language K ⊆ Lm (G), where K 6= ∅ is the
admissible language. Then there exists a nonblocking P-supervisor SP for G such that Lm (SP /G) = K
and L(SP /G) = K if and only if:

1. K is controllable with respect to L(G) and Euc .

2. K is observable with respect to L(G), P and Euc .

3. K is Lm (G)-closed.

Proof. See [26]. /


The reader interested in further results regarding supervision of uncontrollable and unob-
servable languages is referred to [26] and references therein.
172 Introduction to discrete event systems theory
Appendix B
An experimental setup for FTC
algorithms test

In this appendix is described the experimental setup designed at the Lab-


oratory of Automation and Robotics (LAR) of University of Bologna, to
create a testing environment for fault diagnosis algorithms and fault tol-
erant control algorithms for electrical motors. For further information
the reader is referred to [3], where it is possible to find the technical doc-
umentation for the setup and an exhaustive user manual.

B.1 The experimental setup


In this chapter will be described the experimental setup designed to create a testing environ-
ment for fault diagnosis algorithms and fault tolerant control algorithms for electrical motors.
The idea developed is sketched in fig. B.1. The system is constituted by the interconnection of
two subsystem: an asynchronous motor with a power inverter with 540 V DC-link voltage and
a control board based on a DSP TMS320C32, and a brush-less motor with its driver, acting as a
load torque generator. The interconnection of the two mechanical systems is made by means of
an adaptive joint. The DSP performs data acquisition, generates speed and flux references for
the induction motor and the torque reference for the brush-less motor, implements the control
algorithm and generates the PWM inverter commands. The control board is connected to a
standard PC used for DSP programming and for data acquisition and displaying. The control
board and the standard PC realize a so-called rapid prototyping station, i.e. a system by means
of which it is possible to design and test control algorithms, in fact the station perform the task
of a virtual oscilloscope to monitor the controlled system and is interfaced with a power con-
verter.
The induction motor is a commercial asynchronous three phases 1.1KW motor with 50Hz and
380 − 410V power supply whose electrical and mechanical parameters are shown in tab. B.1.
The induction motor has been damaged by introducing a mechanical fault. Specifically five of
the 28 rotor bars have been holed in order to simulate a broken bar rotor failure. The diameter

173
174 An experimental setup for FTC algorithms test

Description Parameter Value Units


Stator inductance Ls 0.663 H
Rotor inductance Lr 0.663 H
Mutual inductance Lm 0.627 H
Stator resistance Rs 6.25 Ω
Rotor resistance Rr 6.26 Ω
Rotor inertia J 0.0024 Kg m2
Number of pole pairs np 2 −

Table B.1: Parameters of the induction motor adopted for the experimental activity.

of each hole is 4 mm.

The stator current and the motor angular speed are acquired by means of commercial Hall-
type sensors that output a 0-10V signal which is proportional to the instantaneous value of a
AC current signal (0-50A) and by a two poles commercial resolver (6V, 10KHZ, with a transfor-
mation rate 0.28 ± 10%) with an encoder simulation output (1024 ppr).
The brush-less motor is controlled by its commercial driver in order to track a torque refer-
ence signal. It is used to simulate an unknown torque load and is interconnected with the DSP
board in order to generate via software torque load references and hence simulate particular
operating conditions. The mechanical coupling of the two mechanical systems is made by two
mechanical system:
• an adaptive joint able to compensate for angular, axial and radial offsets;

• a saddle to align the two motor axes.


In fig. B.2 is shown the mechanical interconnection of the systems.
The electronic interface between the two systems is realized via the signal generating the
torque reference for the brush-less motor (an analog output from the DSP board) and the out-
puts of the induction motor driver (encoder signal and current sensor signal). The remaining
analog output of the DSP board are used to manage the power driver of the induction motor.
In this way the system is completely observable and controllable via software.

B.2 The power stage


The power stage of the induction motor is constituted by a three-phase commutation power
converter and a control board for the interface with the prototyping station. This board embeds
duty circuits (servoassisted supply circuit and the opto-isolated interface for the encoder) and
auxiliary circuits (voltage clampers and high tension conditioning circuits). The power stage is
sketched in fig. B.3.
Downstream the supply terminals there are the power and pre-charge relays controlled by
the DSP board via a digital output. High voltage comes from the network through a rectifier
to an opto-isolated amplifier that measures the supply voltage and outputs the measure on the
analog port of the board, while a voltage clamper limits the maximum voltage throughout an
hysteresis comparer. On the analog port are therefore available the bus voltage measure, the
current measure of two phases and the temperature of the dissipative system.
In fig. B.4 is also represented the electrical and electronic interconnection between the brush-
less driver and the prototyping station. The interface is constituted by the encoder signal from
B.2. The power stage 175

Induction Motor
Resolver output

Brush-less motor

IM Driver
Brushless Driver

Encoder output
Torque reference

Current value
Control law

Rapid Prototyping Station

Figure B.1: The experimental setup.


176 An experimental setup for FTC algorithms test

Induction Motor

Saddle + Joint
Brush-less motor

Figure B.2: Mechanical interconnection.

the brush-less motor to the power stage, the load torque reference signal from the DSP board
to the brush-less driver, command signals from the DSP board to the brush-less driver. These
signals are conditioned by means of a dedicated board.
The design have been completed by some shrewdness to solve electro-magnetic compati-
bility problems and by operator safety systems, in order to protect the human operator with
respect to the high voltage dangerousness.
In fig. B.5 and fig. B.6 are shown some pictures of the realized system.
B.2. The power stage 177

Resolver

Driver Brushless

Encoder Output
POWERIF
Torque reference

DSP-IF Board
Digital Bus
Analog Bus

FastProt
Control Law

Figure B.3: Power stage of the setup.


178
Driver Brushless

Figure B.4: Interconnection between the brush-less motor system and the setup.

Plug A
Power stage

Plug C
Control board

Plug B Encoder Output

An experimental setup for FTC algorithms test


Drive OK - Drive Enable

Resolver output
Torque reference

Plug MIL
Plug D

Analog bus

Digital Bus
FastProt Board

Derivation board
Interface Board
B.2. The power stage 179

Figure B.5: An overview of the experimental setup.

(a) The mechanical interconnection (b) The adaptive joint

(c) The induction motor (d) The brush-less motor

Figure B.6: Some picture of details of the systems.


180 An experimental setup for FTC algorithms test
Appendix C
Implicit fault tolerant control of a n-dof
robot manipulator

I n this appendix an implicit fault tolerant control scheme is spe-


cialized for an n-dof fully actuated mechanical manipulator sub-
ject to various sinusoidal torque disturbances acting on joints.
More in detail we show how a standard tracking controller can be
“augmented" with an internal model unit designed so as to com-
pensate the unknown spurious torque harmonics. In this way the
controller is proved to be global implicitly fault tolerant to all the
faults belonging to the model embedded in the regulator. More-
over simply testing the state of the internal model we will show
how to perform fault detection and isolation. Results illustrated in
this chapter are also reported in [17].

C.1 introduction
In chapter 5 it has been addressed the case in which the faults affecting the controlled system
can be modelled as functions (of time) within a finitely-parameterized family of such functions.
Then a controller which embeds an internal model of this family is designed in order to gener-
ate a supplementary control action which compensate for the presence of any of such faults,
regardless their entity. The idea is pursued using the theoretical machinery of the (nonlinear)
output regulation theory (see [23]) under the assumption that the side-effects generated by the
occurrence of the fault can be modelled as an exogenous signal generated by an autonomous
“neutrally stable" system (the so-called “exosystem"). In this framework, the Fault Detection
and Isolation phase is postponed to that of control reconfiguration since it can be carried out by
testing the state of the internal model unit which automatically activates to offset the presence
of the fault. This approach has been applied with optimal results to control induction motors
in faulty conditions.
In this chapter the approach outlined above is specialized to the design of a fault tolerant
control system for n-dof fully actuated mechanical robot subject to various sinusoidal torque

181
182 Implicit fault tolerant control of a n-dof robot manipulator

disturbances acting on joints (see [61]). We will show how this framework can be casted as an
output regulation problem. More in detail we show how a standard tracking robot control (see
[40], [102], [39]), can be “augmented" with an internal model unit designed so as to compen-
sate the unknown spurious torque harmonic. In this way the controller is proved to be global
implicitly fault tolerant to all the faults belonging to the model embedded in the regulator.

C.2 Problem statement and preliminary positions


In this section we are going to introduce the model of a n-degree of freedom fully actuated
robot manipulator and state the FTC-FDI problem. Usually the joint actuators are modelled
as pure torque sources; however they can be subject to some asymmetries (e.g. due to some
electrical or mechanical faults) that comport the arise of spurious harmonics in the electrical
variables and then in the generated torques. Hence, in the following, we will model these
effects as sinusoidal signals superimposed to the controlled torque signals.
We will then show how it is possible to cast this problem in the framework illustrated in
[18]: in order to point out that a pre-existing control can be augmented without modification
with the FTC-FDI module designed (internal model unit) able to overcome the disturbance and,
moreover, to isolate it, in this section a simple tracking controller is also considered.
The regulation scheme developed is depicted in figure C.1.

Trajectory
Exosystem Generator

q ? (t)
v(t) ?
? - (t)
p
- Nominal ν(t) - ?
i -i - n-dof - i
+
controller Robot q(t), p(t)
6
τ (t) q̄(t), p̄(t)

ξ(t) Internal
FDI
¾ Model ¾
Logic Unit

Fault Estimation
-

Figure C.1: FTC controller scheme

Consider an n degree of freedom fully-actuated robot manipulator with generalized coordi-


nates q = (q1 , · · · , qn )T . If p = M (q)q̇ = (p1 , · · · , pn )T are the generalized momenta, with
M (q) the inertia matrix, symmetric and positive definite for all q, an explicit port-Hamiltonian
¡ ¢T
representation of this system can be obtained defining the whole state x := q p and the
Hamiltonian function as the total energy of the system (sum of kinetic energy and potential
energy)
1
H(q, p) := pT M −1 (q)p + P (q)
2
C.2. Problem statement and preliminary positions 183

and, finally,
µ ¶ µ ¶ µ ¶
0 In 0 0 0
J= R= G=
−In 0 0 D(q) In

with D(q) = DT (q) ≥ 0 taking into account the dissipation effects. The input is an effort
representing the input torques and the output is a flow representing the joint velocities. These
considerations lead to the following model

∂H
 
· ¸ ·µ ¶ µ ¶¸ · ¸
q̇ 0 In 0 0  ∂q  0
= −   + ν
ṗ −In 0 0 D(q)  ∂H  In
∂p (C.1)
∂H
 
£ ¤  ∂q 
y = 0 In  
 ∂H 
∂p

This system will be affected by an external torque ripple v(t) acting through the control input
channel (i.e. actually the torque applied to the system will be the sum of the control torque and
the external disturbance ν + v(t)) and the problem here addressed is to compensate this distur-
bance, detecting and isolating in the meanwhile the entity of this (unknown) disturbance.
It is worth to point out again that the design of the internal model unit doesn’t affect a previous
regulator, designed in order to carry out a particular task. To remark this feature, in the follow-
ing we will introduce a simple control scheme whose aim is to make the manipulator track a
known trajectory. This tracking control is developed following [39], but the same results can be
obtained using also a simpler controller.

C.2.1 Tracking control

Firstly a preliminary torque input able to compensate potential energies (as gravity) is de-
signed:

∂P (q)
ν= + ν0 (C.2)
∂q

Let define the desired trajectory for the generalized coordinates and the generalized momenta
as (q ? (t) , p? (t)); this trajectory, to be realizable, has to satisfy p? (t) = M (q ? )q̇ ? (t). To define
new error variables, let consider the following change of coordinates

q̄ = q − q ? (t)
(C.3)
p̄ = p − M (q)q̇ ? (t)
184 Implicit fault tolerant control of a n-dof robot manipulator

Deriving the new error coordinates we obtain:

q̄˙ = M −1 (q)p̄

∂H ∂H d
p̄˙ = − − D(q) + ν 0 − (M (q)q̇ ? (t)) =
∂q ∂p dt
1 ∂M −1 d
= − pT p − DM −1 (q)p + ν 0 − (M q̇ ? (t)) =
2 ∂q dt
1 ∂M −1 d
= − (p̄ + M q̇ ? (t))T (p̄ + M q̇ ? (t)) − DM −1 (p̄ + M q̇ ? (t)) + ν 0 − (M q̇ ? (t))+
2 ∂q dt
∂M −1 (q)
− 12 p̄T p̄ − D(q)M −1 (q)p̄ + ν 0 − Π(q, q̇ ? (t), q̈ ? (t))
∂q
(C.4)
Defining a new Hamiltonian function as

1
H 0 = p̄T M −1 (q)p̄
2
it is possible to write again (C.4) as a port-Hamiltonian system:

∂H 0
q̄˙ =
∂ p̄
(C.5)
∂H 0 ∂H 0
p̄˙ = − − D(q) + ν 0 − Π(q, q̇ ? (t), q̈ ? (t))
∂ q̄ ∂ p̄

It is now possible to obtain a perfect asymptotic tracking designing the control torque in or-
der to delete the “bad" term Π(·), to shape the energy of the error system in order to have a
minimum in the origin1 and to add some damping in order to have this minimum globally
attractive:
ν 0 = Π(q, q̇ ? (t), q̈ ? (t)) + DM −1 (q)p̄ − q̄ − M −1 (q)p̄ + τ (C.6)
where τ is an additional control torque that will be used in the following section in order to
compensate the presence of additional torque disturbances.
The whole error system (C.5) with the controller (C.6) writes as

∂ H̄
q̄˙ =
∂ p̄
(C.7)
∂ H̄ ∂ H̄
p̄˙ = − − +τ
∂ q̄ ∂ p̄

where the new Hamiltonian is defined by

1 1
H̄ = p̄T M −1 (q)p̄ + q̄ T q̄ (C.8)
2 2

Remark C.1 It is worth to remark that this kind of control strategy is very similar the classical tracking
control made by inversion of the model and introducing simple proportional and derivative terms (see
e.g. [40], [102], [73])
1
Note that q̄ = 0 means that the tracking is achieved as q → q ? (t).
C.3. Canonical internal model unit design 185

C.2.2 Problem statement


It is now possible to state the input disturbance suppression problem, introducing in the model
of a controlled n-degree of freedom robot manipulator like (C.7), (C.8) the exogenous torque
disturbance v(t):
∂ H̄
q̄˙ =
∂ p̄ (C.9)
∂ H̄ ∂ H̄
˙p̄ = − − + τ + v(t) ,
∂ q̄ ∂ p̄
In (C.9), v(t) is a torque disturbance belonging to the class of signals generated by the linear,
neutrally stable autonomous system (exosystem)
(
ż = Sz
(C.10)
v(t) = −Γz

with z ∈ R2k , Γ ∈ R2k×m a known matrix and S is defined by

S = diag{S1 , . . . , Sk } (C.11)

with · ¸
0 ωi
Si = ωi > 0 i = 1, . . . , k (C.12)
−ωi 0
and z(0) ∈ Z, with Z ⊆ R2k bounded compact set.
In this discussion the matrix S is firstly considered perfectly known, and then, in section C.4,
this hypothesis is removed (as in [93]): the dimension 2k of matrix S will be still known but
all characteristic frequencies ωi will be unknown but ranging within known compact sets, i.e.
ωimin ≤ ωi ≤ ωimax .
In this set up the lack of knowledge of the exogenous disturbance reflects into the lack of
knowledge of the initial state w(0) of the exosystem and, in section C.4, also of the charac-
teristic frequencies. For instance, in the next section, any v(t) obtained by linear combination
of sinusoidal signals with known frequencies but unknown amplitudes and phases will be con-
sidered, while, in section C.4, the frequencies will be unknown too.
All those assumptions allows us to cast the problem of disturbance suppression as a problem
of output regulation (see [24], [44]) that will be complicated by the lack of knowledge of the
matrix S (see [93]), and suggests to look for a controller which embeds an internal model of the
exogenous disturbances augmented by an adaptive part in order to estimate the characteristic
frequencies of the disturbances.

Remark C.2 Note again that the whole design method introduced in the following can be easily applied
to general mechanical system described as pHs (C.9). Hence it is really straightforward to consider this
method suitable for a generic mechanical system already regulated to accomplish a certain task with a
classical control strategy (see [73] for a survey about passivity based control strategies applied to port-
Hamiltonian systems).

C.3 Canonical internal model unit design


In this section we are going to briefly design a canonical internal model unit able to overcome
external torque disturbances (i.e. exogenous sinusoidal torque ripples). The main hypothe-
sis here (that will be removed in the next section) is that the exogenous matrix S is perfectly
186 Implicit fault tolerant control of a n-dof robot manipulator

known.
As previously announced, the regulator to be designed will embed the internal model of the
exogenous disturbance: this internal model unit is designed according to the procedure pro-
posed in [72] (canonical internal model). Given any Hurwitz matrix F and any matrix G such
that (F, G) is controllable, denote by Y the unique matrix solution of the Sylvester equation

Y S − F Y = GΓ

and define Ψ := ΓY −1 .
Let introduce the internal model unit as

ξ˙ = (F + GΨ)ξ + N (p̄, q̄) (C.13)

And set the control law as


τ = Ψξ + τst (C.14)
where N (q̄, p̄) and τst are additional terms that will be designed later.
Defining the changes of coordinate

χ = ξ − Y z − Gp̄ (C.15)

system (C.9), (C.15), becomes




 q̄˙ = M (q)−1 p̄



∂ H̄ ∂ H̄ (C.16)
p̄˙ = − − + Ψξ + τst − ΨY z


 ∂ q̄ ∂ p̄
 χ̇ = (F + GΨ)ξ + N (p̄, q̄) − Y Sz − Gp̄˙

Choosing τst = −ΨGp̄, simple computation shows that the p̄-dynamic become

∂ H̄ ∂ H̄
p̄˙ = − − + Ψχ (C.17)
∂ q̄ ∂ p̄
Concentrating on the χ-dynamic it is possible to design
µ ¶
T ∂ H̄ ∂ H̄
N (q̄, p̄) = −Ψ p̄ − F Gp̄ − G + + ΨGp̄
∂ q̄ ∂ p̄
and write the last equation of (C.16) as

χ̇ = F χ − ΨT p̄ (C.18)

Consider now the first equation of (C.16) with (C.17) and (C.18). This new system identifies a
port-Hamiltonian system described by:
∂Hx (x)
ẋ = [J(x) − R(x)] (C.19)
∂x
with ¡ ¢T
x= q̄ p̄ χ ,
the Hamiltonian Hx (x) defined by
1 1 1
Hx (x) = p̄T M (q)−1 p̄ + q̄ T q + χT χ
2 2 2
C.4. Adaptive internal model unit design 187

the skew-symmetric interconnection matrix J(x) and the positive-definite damping matrix R
defined by:    
0 I 0 0 0 0
J(x) =  −I 0 Ψ  , R = 0 I 0 
0 −Ψ T 0 0 0 −F

Proposition C.1 Consider the controlled n-degree of freedom robot manipulator (C.9) with Hamilto-
nian (C.8), affected by the torque disturbances generated by (C.10), (C.11), (C.12).
The additional control law generated by the internal model unit:
 −1
 ξ˙ = (F + GΨ)ξ − ΨT p̄ − F Gp̄ + Gq̄ + Gp̄T ∂M (q) p̄ + GM −1 (q)p̄ − GΨGp̄

∂q (C.20)

τ = Ψξ − ΨGp̄ .

assures asymptotically the input disturbance suppression (fault tolerance with respect to torque ripple,
i.e. (q̄, p̄) → (0, 0) as time t → ∞) and the convergence of the state of the adaptive internal model to the
fault signal (fault detection, i.e. ξ → Y z).

Proof. Considering Hx (x) as a Lyapunov function the proof is immediate as (remembering that
F is an arbitrary Hurwitz matrix)

Ḣx ≤ −M −1 (q)kp̄k2 + F kχk2

and for the LaSalle invariants principle the system will asymptotically converge to limt→∞ (p̄, χ) =
(0, 0). Moreover from the first and second equation of (C.20) it is possible to state that also
limt→∞ q̄(t) = 0 and the proposition is proved. /

C.4 Adaptive internal model unit design


In this section we are going to introduce the main result, designing an adaptive internal model
unit able to overcome external torque disturbances (i.e. exogenous sinusoidal torque ripples).
As previously announced, in this section the perfect knowledge of the exogenous matrix S is
not assumed, as only the dimension 2k of the matrix is assumed to be known; this means that,
for instance, any v(t) obtained by linear combination of sinusoidal signals with unknown fre-
quencies, amplitudes and phases can be modelled.
For this reason it is now impossible to implement the “classical" internal model control intro-
duced in (C.13) depending on Ψ; hence let design a canonical adaptive internal model

 ξ˙

= (F + GΨ̂)ξ + N (p̄, q̄)
(C.21)
 ˙T
Ψ̂i = ϕi (ξ, p̄, q̄)

calling Ψ̂T T
i with i = 1, · · · , n every column of the matrix Ψ̂ ∈ R
2k×n .

Moreover let set the control law as


τ = Ψ̂ξ + τst
where N (q̄, p̄) and τst are additional terms that will be designed later. The adaptation law
ϕ(ξ, p̄, q̄) will be designed in order to assure that asymptotically the internal model unit will
188 Implicit fault tolerant control of a n-dof robot manipulator

provide a torque able to overcome all disturbances.


Defining the changes of coordinate

χ = ξ − Y z − Gp̄
(C.22)
Ψ̃T
i = Ψ̂T T
i − Ψi i = (1, · · · , n)

where ΨT T
i represent the i-th column of Ψ , system (C.9), (C.21) becomes


 q̄˙ = M (q)−1 p̄





 p̄˙ ∂ H̄ ∂ H̄
= − − + Ψ̂ξ + τst − ΨY z

∂ q̄ ∂ p̄ (C.23)
χ̇ = (F + GΨ̂)ξ + N (p̄, q̄) − Y Sz − Gp̄˙






˙T


 Ψ̂
i = ϕi (ξ, p̄, q̄) i = (1, · · · , n)

Note that
∂ H̄ ∂ H̄
p̄˙ = − − + Ψ̂(ξ − Y z) + τst − Ψ̃Y z =
∂ q̄ ∂ p̄
∂ H̄ ∂ H̄
= − − + Ψ̂(ξ − Y z − Gp̄) + Ψ̂Gp̄1 + + τst − Ψ̃(ξ − χ − Gp̄)
∂ q̄ ∂ p̄

choosing τst = −Ψ̂Gp̄ + τst0 it is possible to write

∂ H̄ ∂ H̄
p̄˙ = − − + Ψ̂χ + Ψ̃ξ − Ψ̃χ − Ψ̃Gp̄ + τst0
∂ q̄ ∂ p̄
Choosing now τst0 = ĀM −1 (q)p̄ with Ā such that A = Ā − I is hurwitz we obtain

∂ H̄ ∂ H̄
p̄˙ = − +A + Ψχ + Ψ̃(ξ − Gp̄) . (C.24)
∂ q̄ ∂ p̄
Considering every single element of vector p̄ it is possible to write (from now on apex i means
the i-th element of the vector considered)
µ ¶i
∂ H̄ ∂ H̄
˙p̄i = − +A + Ψχ + (ξ − Gp̄)T Ψ̃T i (C.25)
∂ q̄ ∂ p̄
with i = 1, · · · , n.
Concentrate now on the χ-dynamic in order to design suitably the update term N (q̄, p̄):
µ ¶
∂ H̄ ∂ H̄
χ̇ = (F + GΨ̂)ξ + N (p̄, q̄) − Y M z − GΓz − G − − + Ψ̂ξ + τst − Γz =
∂ q̄ ∂ p̄
∂ H̄ ∂ H̄
= F χ + F Gp̄ + N (p̄, q̄) + G +G − Gτst
∂ q̄ ∂ p̄
Choosing
∂ H̄ ∂ H̄
N (p̄, q̄) = −F Gp̄ − G −G + Gτst (C.26)
∂ q̄ ∂ p̄
we obtain
χ̇ = F χ = F χ − ΨT p̄ + ΨT p̄ . (C.27)
C.4. Adaptive internal model unit design 189

As all dynamics of (C.23) have been investigated, it is now possible to design an adaptation
law for Ψ̂T : assume then

ϕi (ξ, p̄, q̄) = −(ξ − Gp̄)p̄i i = 1, · · · , n .


T
With this in mind it is immediate to write the Ψ̃i -dynamic as
˙T
˙ T = Ψ̂ T i
Ψ̃ i i − Ψ̇i = −(ξ − Gp̄)p̄ i = 1, · · · , n . (C.28)

Consider now the first equation of (C.23) with all (C.25), (C.27) and (C.28). This new system
(with a small abuse of notation in order to obtain a more compact and readable formulation)
identifies an interconnection described by:
∂Hx (x)
ẋ = [J(x) − R(x)] + Λ(x) (C.29)
∂x
with ¢T
q̄ p̄ χ Ψ̃T
¡
x= ,
the Hamiltonian Hx (x) defined by
n
1 1 1 X1
Hx (x) = p̄T M (q)−1 p̄ + q̄ T q̄ + χT χ + Ψ̃i Ψ̃T
i
2 2 2 2
i=1

the skew-symmetric interconnection matrix J(x) defined by:


 
0 I 0 0
 −I 0 Ψ (ξ − Gp̄)T 
J(x) =  T

 0 −Ψ 0 0 
0 −(ξ − Gp̄) 0 0

the positive-definite damping matrix R defined as:


 
0 0 0 0
0 −A 0 0
R= 0 0 −F

0
0 0 0 0

and Λ(x) defined by:


¢T
0 0 ΨT p̄ 0
¡
Λ(x) = .

Proposition C.2 Consider the controlled n-degree of freedom robot manipulator (C.9) with Hamilto-
nian (C.8), affected by the torque disturbances generated by (C.10), (C.11), (C.12).
The additional control law generated by the adaptive internal model unit:
 −1

 ξ˙ = (F + GΨ̂)ξ − F Gp̄ + G 1 p̄T ∂M (q) p̄ + Gq̄ − GΨ̂Gp̄ + GAM −1 (q)p̄
2 ∂q





˙ (C.30)

 Ψ̂ = −(ξ − Gp̄)T p̄




τ = Ψ̂ξ − Ψ̂Gp̄ + AM −1 (q)p̄ .

190 Implicit fault tolerant control of a n-dof robot manipulator

assures asymptotically the input disturbance suppression (fault tolerance with respect to torque ripple,
i.e. (q̄, p̄) → (0, 0) as time t → ∞) and the convergence of the state of the adaptive internal model to the
fault signal (fault detection, i.e. ξ → Y z).

Proof. Consider for system (C.29) (obtained connecting (C.9) with (C.30)) the following Lya-
punov function:
V = Hx (x) (C.31)
Easy computations (remembering the skew-symmetry of interconnection matrix J(x)) show
that there exists two real numbers ηA ∈ R− , ηF ∈ R− (depending on design matrices A and F )
and ηΨ ∈ R, such that
V̇ ≤ ηA kp̄k2 + ηF kχk2 + ΨT p̄χ
(C.32)
≤ ηA kp̄k2 + ηF kχk2 + ηΨ kp̄kkχk .
Using a Young’s inequality argumentation we can write:
ηΨ ηΨ
V̇ ≤ ηA kp̄k2 + ηF kχk2 + εkp̄k2 + kχk2 , (C.33)
2 2ε
for a certain value of ε. Now choosing ε = −ηA /ηΨ , we obtain
µ
ηΨ2 ¶
ηA 2
V̇ ≤ kp̄k + ηF − kχk2
2 2ηA

hence choosing matrix F such that


ηΨ2
ηF <
2ηA
we have that V̇ ≤ 0.
System asymptotic behavior for t → ∞ and for any bounded torque disturbance generated
by (C.10) with arbitrary bounded initial conditions z(0) ∈ Z, is then defined by LaSalle’s in-
variants principle: V̇ = 0 implies that asymptotically, as time t → ∞, p̄ → 0 and χ → 0 (and
consequently limt→∞ ξ = Y z). From the first equation of (C.23) it is possible to point out that
q̄˙ → 0 i.e. q̄ tends to a constant vector q̄∞ .
Again from (C.28) it is possible to state that Ψ̃ ˙ → 0, hence Ψ̃ tends to a constant matrix Ψ̃ .

The asymptotic behavior of the system (according to V̇ = 0) is then characterized by (C.24): as
time t → ∞ the following hold:
0 = −q̄∞ + Ψ̃∞ Y z (C.34)
As Y is a full rank matrix (always invertible) and z is a vector whose elements are all sinusoidal
signals, the only solution to (C.34) is q̄∞ = 0 and Ψ̃∞ = 0. Hence we proved that

lim x(t) = 0 .
t→∞

Remark C.3 The fault detection and isolation phase can be performed by checking the state of the fault
compensation unit which automatically offsets the fault effect. In this framework the detection phase
can be easily carried out by comparing kξ(t)k with a suitably tuned threshold; in fact, as proved in
Proposition C.2, ξ(t) asymptotically converge on Y z(t) which is zero in the un-faulty case and different
C.5. Simulation results 191

from zero when a fault occurs.


Also the isolation procedure is possible in this framework. According to the modelling presented, it is
clear that fault acting on different actuators, are represented by different components of the exogenous
signal Γz(t). In view of this, setting
ẑ(t) := Y −1 ξ(t) ,
it is possible to define residual signals as
(
1 if kΓi ẑ(t)k ≥ Ti
ri (t) = (i = 1 . . . n)
0 otherwise

where Ti , (i = 1 . . . n) are n positive thresholds and Γi is the i-th row of matrix Γ.

C.5 Simulation results


In order to check the performances of the regulator above designed, a certain number of tests
have been made, simulating the response of a 2-dof fully actuated mechanical robot subject
to various sinusoidal torque disturbances acting on both joints. The results of these tests are
shown in fig. C.2(a), fig. C.2(b) and fig. C.3. In particular the disturbance ripple to be rejected
was a sinusoidal signal δ(t) = V sin(Ω t) with amplitude V = 10 Nm and (unknown) frequency
Ω = 0.4 rad/sec occurring at time t = 100 sec and affecting only the first joint torque.
The internal model unit was connected at time t = 200 sec, but the frequency adaptation unit
was connected only at time t = 300 sec; in fig. C.2(a) it is possible to see clearly the action of the
adaptation unit, avoiding completely the effect of the disturbance ripple on the tracking error
for the angle position of the first joint; moreover it is worth to point out that the adaptation
unit makes it possible to obtain a perfect detection of the fault occurring on the first joint: in
fig. C.3, while the upper plot represents the disturbance ripple, the last two plots show the
behavior of the controlled torques and how the adaptation unit makes it possible to clearly see
the fault affecting only the control torque acting on the first joint. The adaptation unit is then
disconnected again at time t = 500 sec for 200 sec in order to point out that the effect of the
exogenous ripple is always present and that a classical internal model designed for a wrong
frequency is not able to overcome completely this disturbance.

C.6 Conclusions
The main result presented in this appendix is an adaptive internal model unit designed in order
to compensate unknown spurious torque harmonics that degrade performances of an n-dof
fully actuated mechanical robot. We have shown how a standard tracking robot control, can be
“augmented" with an internal model unit to achive global implicit fault tolerant to all the faults
belonging to the model embedded in the regulator. We also have shown how it is possible to
perform fault detection and isolation simply testing the state of the internal model.
192 Implicit fault tolerant control of a n-dof robot manipulator

10 












 2.5
3

5
1.5












0 0.5







0










5 −0.5

−1

10 −1.5

−2
100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 800

(a) Tracking error for θ1 , angle position of the first joint (b) Tracking error for θ2 , angle position of the second joint

Figure C.2: Tracking error of the joint positions.

10

−5

−10
0 100 200 300 400 500 600 700 800

20

10

−10

−20
0 100 200 300 400 500 600 700 800

3
2
1
0
−1
−2
0 100 200 300 400 500 600 700 800

Figure C.3: From upper to lower plot: disturbance torque ripple acting on the first joint, con-
trolled torque τ1 on the first joint and controlled torque τ2 on the second joint
Bibliography

[1] R. Alur and D.L. Dill. A theory of timed automata. Theoretical Computer Science, (126):183–
235, 1994.

[2] J.D. Andrews and T.R. Moss. Reliability and Risk Assessment. Professional Engineering
Publishing, 2002.

[3] D. Bagnara. Configurazione sperimentale per la diagnosi automatica di motori ad in-


duzione. Master thesis, Faculty of Engineering, University of Bologna, 2001-2002.

[4] L. El Bahir, R. Gros, M. Kinnaert, C. Parloir, and J.Yamé. Final report on WP2. IFATIS
deliverable D2-5, January 2003.

[5] J. Balakrishnan and K.S. Narendra. Adaptive control using multiple model. IEEE trans-
actions on automatic control, 42(2), 1997.

[6] A. Bellini, F. Filippetti, G. Franceschini, and C. Tassoni. Closed-loop control impact on the
diagnosis of induction motor faults. IEEE Transactions on Industry Applications, 36(5):1318
– 1328, 2000.

[7] A. Benveniste, E. Fabre, C. Jard, and S. Haar. Diagnosis of asynchronous discrete event
systems, a net unfolding approach. Proceedings of the Workshop on discrete event systems,
2002.

[8] D.E. Bernard, G.A. Dorais, C. Fry, E.B. Gamble Jr., B. Kanefsky, J. Kurien, W. Millar,
N. Muscettola, P.P. Nayak, B. Pell, K. Rajah, N. Rouquette, B. Smith, and B.C. Williams.
Design of the remote agent experiment for spacecraft autonomy. Proceedings of IEEE
Aerospace, 1998.

[9] A. Birolini. Reliability Engineering: Theory and Practice. Springer-Verlag, 1999.

[10] M. Blanke. Aims and means in the evolution of fault tolerant control. Proceedings of the
European Science Foundation COSY workshop, Rome, 1995.

[11] M. Blanke. Aims and means in the evolution of fault tolerant control. In Proceedings of the
European science foundation COSY workshop, Roma, September 1995.

[12] M. Blanke, M. Kinnaert, J. Lunze, and M. Staroswiecki. Diagnosis and fault-tolerant control.
Springer-Verlag, 2003.

193
194 Bibliography

[13] M. Blanke, R. I. Zamanabadi, and S. A. Bøgh. Fault tolerant control systems: a holistic
view. Control engeneering practice, 5(5), 1997.

[14] R. Boel and J. Van Schuppen. Decentralized failure diagnosis for discrete-event systems
with constrained communication between diagnosers. Proceedings of the Workshop on dis-
crete event systems, 2002.

[15] S.A. Bogh. Fault Tolerant Control Systems - a Development Method and Real-Life Case Study.
PhD thesis, Aalborg University, Department of Control Engineering, December 1997.
0908-1208.

[16] C. Bonivento, M. Capiluppi, L. Marconi, and A. Paoli. System analysis and decomposi-
tion methods. IFATIS deliverable D6-3, November 2003.

[17] C. Bonivento, L. Gentili, and A. Paoli. Implicit fault tolerant control of a robot manipula-
tor. submitted to CDC, 2004.

[18] C. Bonivento, A. Isidori, L. Marconi, and A. Paoli. Implicit fault tolerant control: Appli-
cation to induction motors. Automatica, 40(3):355–371, 2004.

[19] C. Bonivento, A. Paoli, and L. Marconi. Fault-tolerant control for a ship propulsion sys-
tem. In Proceedings of the ECC, Porto, Portugal, 2001.

[20] C. Bonivento, A. Paoli, and L. Marconi. A fault-tolerant strategy for induction motors.
40th IEEE Conference on Decision and Control, Orlando, 2001.

[21] C. Bonivento, A. Paoli, and L. Marconi. Fault-tolerant control for a ship propulsion sys-
tem. Control engeneering practice, 11(10), 2002.

[22] B. A. Brandin and W. M. Wonham. Supervisory control of timed discrete-event systems.


IEEE Transactions on Automatic Control, 39(2):329–342, February 1994.

[23] C. I. Byrnes, F. Delli Priscoli, A. Isidori, and W. Kang. Structurally stable output regula-
tion of nonlinear systems. Automatica, 33(2):369 – 385, 1997.

[24] C.I. Byrnes, F. Delli Priscoli, and A. Isidori. Output regulation of uncertain nonlinear systems.
Birkhäuser, Boston, 1997.

[25] F. Caliskan and R. Vepa. A real-time reconfiguration algorithm for aircraft flight control.
Proceedings of conference on Aerospace Vehicle Dynamics and Control, Cranfield Institute of
Technology, 1994.

[26] C.G. Cassandras and S. Lafortune. Introduction to discrete event systems. Kluwer Academic
Publisher, 1999.

[27] P.R. Chandler, M. Pachter, and M. Mears. System identification for adaptive and recon-
figurable control. Journal of guidance, control and dynamics, 18(3), 1995.

[28] J. Chen and R.J. Patton. Robust model based fault diagnosis for dynamic systems. Kluwer
academic publishers, Boston, 1999.

[29] Y.-L. Chen and G. Provan. Modeling and diagnosis of timed discrete event systems -
A factory automation example. Technical Report SC-PP-96-72, Rockwell Science Center,
Thousand Oaks, CA, September 1996.
Bibliography 195

[30] Y.-L. Chen and G. Provan. Modeling and diagnosis of timed discrete event systems - a
factory automation example. In Proceedings of the 1997 American Control Conference, pages
31–36, Albuquerque, NM, June 1997.
[31] M.O. Cordier and L. Rozé. Diagnosing discrete-event systems : extending the “diagnoser
approach” to deal with telecommunication networks. Journal on Discrete Event Dynamic
Systems, 12(2):43 – 81, 2002.
[32] F. Cristian. Understanding fault-tolerant distributed systems. Comm. ACM, 34(2):57–78,
1991.
[33] J. de Kleer and J. Kurien. Fundamentals of model-based diagnosis. Proc. of the SAFEPRO-
CESS’03, 2003.
[34] J. de Kleer, A. Mackworth, and R. Reiter. Characterizing diagnoses and systems. ARTI-
FICIAL INTELLIGENCE 56(2-3), 1992.
[35] R. Debouk, S. Lafortune, and D. Teneketzis. On an optimization problem in sensor selec-
tion. International Journal of Control, 12(4):417 – 445, 2002.
[36] P. Declerk and M. Staroswiecki. Characterisation of the canonical components of a struc-
tural graph for fault detection in large scale industrial plants. Proc. European Control
Conference, Grenoble, 1991.
[37] R.E. Ebert. User interface design. Prenctice Hall, Englewood Cliffs, N.J., 1994.
[38] P. M. Frank. Fault diagnosis in dynamic systems using analitical and knowledge based
redundancy: a survey and some new results. Automatica, 26(3), 1990.
[39] K. Fujimoto, K. Sakurama, and T. Sugie. Trajectory tracking control of port-controlled
hamiltonian systems via generalized canonical transformations. Automatica, 39(12):2059–
2069, 2003.
[40] Takegaki G. and Arimoto S. A new feedback method for dynamic control of manipula-
tors. ASME Journal of Dynamic Systems Measurement and Control, 102, 1981.
[41] E. Garcia, F. Morant, and R. Blasco-Giménez. Centralized modular diagnosis and the
phenomenon of coupling. Proceedings of the Workshop on discrete event systems, 2002.
[42] G. Gentile, N. Rotondale, F. Filippetti, G. Franceschini, M. Martelli, and C. Tassoni. Anal-
ysis approach of induction motor stator faults to on-line diagnostics. In Proceedings of
ICEM90, 1990.
[43] G. Gentile, N. Rotondale, M. Martelli, and C. Tassoni. Harmonic analysis of induction
motors with stator faults. Electrical Machines Power Systems, (22):215 – 231, 1994.
[44] L. Gentili and A. J. van der Schaft. Regulation and input disturbance suppression for
port-controlled hamiltonian systems. 2nd IFAC Workshop LHMNLC, Seville, Spain,
2003.
[45] J. Getler. Failure detection and diagnosis in engineering systems. Marcel Dekker, 1998.
[46] B.E. Goldberg, K. Everhart, R. Stevens, N. Babbit III, P. Clemens, and L. Stout. System
engineering toolbox for design-oriented egineers. Reference publication 1358, NASA,
1994.
196 Bibliography

[47] C. Hadjicostis. Probabilistic fault detection in finite-state machines based on state occu-
pancy measurements. Proceedings of the 41st IEEE conference on decision and control, 2002.

[48] D. Harel. Statecharts: a visual formalism for complex system. Science of computer pro-
gramming, 8:231–374, 1987.

[49] D.M. Himmelblaum. Fault detection and diagnosis in chemical and petrolchemical processes.
Elsevier, 1978.

[50] R. Izadi-Zamanabadi. Fault-tolerant supervisory control - system analysis and logic de-
sign. Ph.D. thesis, Department of Control Engineering, Aalborg University, 1999.

[51] P. Jalote. Fault tolerance in distributed systems. Prenctice Hall, Englewood Cliffs, N.J., 1994.

[52] S. Jiang and R. Kumar. Failure diagnosis of discrete event systems with linear-time tem-
poral logic fault specifications. Proceedings of the American Control Conference, 2002.

[53] B. Johnson. Design and analysis of fault-tolerant digital systems. Addison Wesley, Reaing,
Mass, USA, 1989.

[54] H. Koepetz. Real-time systems: design principles for distributed embedded applications. Real-
time systems. Kluwer academic publishers, London, 1997.

[55] R. Kumar and S. Jiang. Diagnosis of repeated failures in discrete event systems. Proceed-
ings of the 41st IEEE conference on decision and control, 2002.

[56] S. Lafortune, D. Teneketzis, M. Sampath, R. Sengupta, and K. Sinnamohiden. Failure


diagnosis of dynamic systems: an approach based on discrete event systems. Proceedings
of the American Control Conference, 2001.

[57] J.C. Laprie. Dependability: basic concepts and terminology. Springer Verlag, Vienna, Asutria,
1992.

[58] P.A. Lee and T. Anderson. Fault tolerance: principles and practice. Springer Verlag, Vienna,
Asutria, 1990.

[59] E. Lewis. Introduction of reliability engineering. John Wiley and Sons, 1997.

[60] F. Lin, A.F. Vaz, and W.M. Wonham. Supervisor specification and synthesis for discrete
event system. International Journal of Control, 48(1):321 – 332, 1998.

[61] A. De Luca and R. Mattone. Actuator failure detection and isolation using generalized
momenta. ICRA, Taipei, Taiwan, 2003.

[62] J. Lunze. State observation and diagnosis of discrete-event systems described by stochas-
tic automata. Journal on Discrete Event Dynamic Systems, 11(4):319 – 369, 2001.

[63] N. Lynch, R. Segala, and F. Vaandrager. Hybrid I/O automata. Information and computa-
tion, (185):105–157, 2003.

[64] U. Maier and M. Colnaric. Some basic ideas for intelligent fault tolerant control systems
design. IFAC World Congress 2002, Barcellona, 2002.

[65] M. Malek. Responsing computing. Real-time systems. Kluwer academic publishers, Lon-
don, 1994.
Bibliography 197

[66] L. Marconi, C. Bonivento, A Paoli, and R. Costi. System description. IFATIS deliverable
D6-2, August 2002.
[67] R. Marino, S. Peresada, and P. Valigi. Adaptive input-output linearizing control of induc-
tion motors. IEEE Transactions on Autotmatic Control, 38(2):208 – 221, 1993.
[68] J. Mauss, V. May, and M. Tatar. Towards model-based engineering: Failure analysis with
mds. Workshop on Knowledge-Based Systems for Model-Based Engineering, European Confer-
ence on AI, 2000.
[69] M.A. Mossoumnia. A geometric approach to the synthesis of failure detection filters.
IEEE Transactions on Automatic Control, 31(3), 1986.
[70] M.A. Mossoumnia, G.C. Verghese, and A.S. Willsky. Failure detection and identification.
IEEE Transactions on Automatic Control, 34(3), 1989.
[71] S. Mullender. Distributed systems. Addison Wesley, Reaing, Mass, USA, 1995.
[72] V.O. Nikiforov. Adaptive non-linear tracking with complete compensation of unknown
disturbance. European Journal of Control, (4):132 – 139, 1998.
[73] R. Ortega. Some applications and recent results on passivity based control. 2nd IFAC
Workshop on Lagrangian and Hamiltonian Methods for Nonlinear Control, Seville,
Spain, 2003.
[74] R. Ortega, P.J. Nicklasson, and G. Espinosa. On speed control of induction motors. Auto-
matica, 32(3):455 – 460, 1996.
[75] D.N. Pandalai and L.E. Holloway. Template languages for fault monitoring of timed
discrete event processes. IEEE Transactions on Automatic Control, 45(5):868 – 882, 2000.
[76] A. Paoli and S. Lafortune. Safe diagnosability of discrete event systems. Technical Re-
port CGR-03-02, System Science and Engineering Division, Department of Electrical En-
gineering and Computer Science, The University of Michigan, 2003.
[77] A. Paoli and S. Lafortune. Safe diagnosability of discrete event systems. Proceedings of
cdc 2003, Maui, Hawaii, 2003.
[78] A. Paoli and S. Lafortune. Safe diagnosability for fault tolerant supervision of discrete
event systems. Submitted to Automatica, March 2004.
[79] R.J. Patton. Fault-tolerant control: the 1997 situation. In Proceedings of the IFAC symposium
on fault detection and safety for technical processes, Hull, 1997.
[80] R.J. Patton, P.M. Frank, and R.N. Clark. Issues of fault diagnosis for dynamical systems.
Springer-Verlag, 2000.
[81] L. Pau. Failure diagnosis and performance monitoring. Marcel Dekker, 1981.
[82] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults.
Journal of the ACM, 27(2):228–234, 1980.
[83] S. Peresada and A. Tonielli. High-performance robust speed-flux tracking controller for
induction motor. International Journal on Adaptive Control and Signal Processing, 14(2-3):177
– 200, 2000.
198 Bibliography

[84] S. Peresada, A. Tonielli, and R. Morici. High-performance indirect field-orientated


output-feedback control of induction motors. Automatica, 35(6):1033 – 1047, 1999.

[85] C. De Persis and A. Isidori. A geometric approach to nonlinear fault detection and isola-
tion. IEEE Transactions on Automatic Control, 45(6), 2001.

[86] J. W. Polderman and J. C. Willems. Introduction to mathematical systems theory: a behavioral


approach. Texts in appplied mathematics. Springer, 1998.

[87] G. Provan. Distributed diagnosability properties of discrete event systems. Proceedings of


the American Control Conference, 2002.

[88] H.E. Rauch. Autonomous control reconfiguration. IEEE control systems magazine, 1995.

[89] M.P. Sachenbacher and R. Weber. Advances in design and implementation of obd func-
tions for diesel injection systems based on a qualitative approach to diagnosis. Proceedings
of the SAE 2000 World Congress, 2000.

[90] M. Sampath, S. Lafortune, and D. Teneketzis. Active diagnosis of discrete event systems.
IEEE Transactions on Automatic Control, 43(7):908 – 929, 1998.

[91] M. Sampath, R. Sangupta, and S. Lafortune. Diagnosability of discrete event systems.


IEEE Transactions on Automatic Control, 40(9):1555 – 1575, 1995.

[92] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen, and D.C. Teneketzis. Failure


diagnosis using discrete-event models. IEEE Transactions on Control Systems Technology,
4(2):105 – 124, 1996.

[93] A. Serrani, A. Isidori, and L. Marconi. Semiglobal nonlinear output regulation with adap-
tive internal model. IEEE Transactions on Autotmatic Control, 46(8):1178 – 1194, 2001.

[94] S. Simani, C. Fantuzzi, and R.J. Patton. Model-based fault diagnosis in dynamical systems
using identification techniques. Springer Verlag, 2002.

[95] J.A. Stankovic and K. Ramamritham. Hard real-time systems. IEEE Press, 1988.

[96] M. Staroswiecki, S. Attouche, and M.L. Assas. A graphic approach for reconfigurability
analysis. 10th Int. Workshop on principle of diagnosis, Loch Awe, 1999.

[97] R. Su and W.M. Wonham. Probabilistic reasoning in distributed diagnosis for qualitative
systems. Proceedings of the 41st IEEE conference on decision and control, 2002.

[98] K. Suyama. Reliable observer based control using vector valued decision by majority.
Proceedings of cdc 1999, USA, 1999.

[99] S.Williamson and K. Mirzoian. Analysis of cage induction motors with stator winding
faults. IEEE Transactions on Power Application Systems, 104:1838 – 1842, 1985.

[100] A. Teel. A nonlinear small gain theorem for the analysis of control systems with satura-
tions. IEEE Transactions on Autotmatic Control, 41(9):1256 – 1270, 1996.

[101] D. Thorsley and Demosthenis Teneketzis. Diagnosability of stochastic discrete-event sys-


tems. Proceedings of the CDC, 2003.
Bibliography 199

[102] A.J. van der Schaft. L2 -gain and Passivity Techniques in Nonlinear Control. Springer-Verlag,
London, UK, 1999.

[103] P. Vas. Parameter estimation, condition monitoring and diagnosis of electrical machines. Oxford
science publications, 1994.

[104] J. C. Willems. Paradigms and puzzles in the theory of dynamical systems. IEEE Transac-
tions on automatic control, 36(3):259–294, 1991.

[105] J.C. Willems. Stability Theory of Dynamical Systems. Nelson, 1970.

[106] B.C. Williams and P.P. Nayak. A model-based approach to reactive self-configuring sys-
tems. Proc. of the First National Conf. on Artificial Intelligence, 1996.

[107] S. Williamson and A. C. Smith. Steady-state analysis of three-phase cage motors with
rotor bar and end ring faults. In Proceedings Insttitue Electrical Engineering, 129(3):93 –
100, 1982.

[108] W. M. Wonham. Notes on control of discrete event systems. ECE 1636F/1637S


2002-2003. Systems Control Group, Dept. of ECE, University of Toronto, URL:
www.control.utoronto.ca/people/profs/wonham/wonham.html.

[109] N. E. Wu. Reliability of fault tolerant control systems: Part i. 40th IEEE Conference on
Decision and Control, Orlando, 2001.

[110] N. E. Wu. Reliability of fault tolerant control systems: Part ii. 40th IEEE Conference on
Decision and Control, Orlando, 2001.

[111] R. I. Zamanabadi and M. Blanke. Ship propulsion system as a benchmark for fault toler-
ant control. Control engeneering practice, 7(2), 1999.
Index

Symbols Byzantine failure, 38


Fi -certain, 110 Byzantine failures, 35
Fi -indeterminate cycle, 110
Fi -uncertain, 110 C
n-dof fully actuated mechanical robot, 181 calendar time, 91
cascade connection, 132, 136
A central unit, 95
a-L∞ bound, 133 certainty equivalence, 10, 129
Accessible part, 164 chance fault, 36
Active Diagnosis Problem, 120 change of coordinates, 141
active event function, 163 clusters, 28
active fault tolerance, 20 Coaccessible part, 165
active safe diagnosis problem, 122 common rail, 95
activity machine, 118
communication network interface, 28
Actuator faults, 18
Complement, 165
adaptation law, 145
complex diagnosis system, 88
adaptive control, 20, 129
complex diagnosis systems, 85
adaptive fault compensation unit, 145
complexity, 33
adaptive internal model, 144
composability, 27, 32
adaptive robust control, 9
computer delay, 30
additive faults, 17
Concatenation, 162
admissible, 168
concatenation, 162
affine, 136
consistency-based diagnosis, 20
alarm monitoring, 29
consistent failures, 35
alphabet, 162
constraints, 17
analytical redundancy, 9
Control function, 44
application-specific fault tolerance, 37
Control re-design, 16
Artificial Intelligence, 22
Control reconfiguration, 26
Availability, 31
controllability, 170
B Controllability and observability theorem, 171
bad diagnosis, 88 controllability condition, 170
behavior, 17 Controllability theorem, 170
behavioral approach, 17 controllable, 170
benign failures, 35 controllable events, 168
blocking, 164, 169 cross-group resource sharing, 60
broken bars, 137 cross-groups re-allocation, 51, 58
bus, 28 cycle, 110

200
Index 201

D failure mode, 88
damping, 184 failure modes, 54
dead time, 30 Failure Modes and Effects Analysis, 85
dead-zone function, 142 failure rate, 87
deadlock, 164 false alarms, 86, 88
dependability, 27 fault, 16, 35, 36
dependability requirements, 30 fault accommodation, 25
dependable real-time service, 33 fault compensation unit, 140
design fault, 36 Fault detection, 20
detectability, 21 Fault Detection and Isolation, 9
detection phase, 148 Fault diagnosis, 16
deterministic automaton, 163 fault diagnosis system, 9
development faults, 36 Fault estimation, 20
Diagnosability, 106 Fault identification, 20
diagnosability, 21 Fault isolation, 20
diagnoser, 109 fault propagation, 86
Diagnostic Problem, 20 Fault Propagation Analysis, 92
diesel engines, 95 fault propagation analysis, 85
Direct digital control, 29 Fault Propagation Tree, 92
Discrete event system, 161 Fault Tolerant Control/Measure (FTC/M) mod-
distributed system, 27 ule, 43
disturbances, 18 fault tolerant unit, 36
dynamic eccentricity, 137 fault tree, 48, 54
fault-isolation, 9
E
Fault-Tolerant control, 16
Elementary cells, 88
fault-tolerant unit, 38
empirical failure rate, 87
faulty motor, 137
empirical reliability function, 87
faulty state, 89
empty string, 162
empty trace, 104 filter, 9
equivalent, 163 finite escape time, 136
error, 34–37 Finite State Machine, 102
error-containment regions, 33 flux subsystem, 132
error-detection latency, 30 forbidden timed strings, 119
estimator, 44 form follows function, 27
event, 161 FTC function, 44
event set, 161 FTC interfaces, 44
event-triggered, 32 FTM function, 44
exclusion law, 17 FTM interfaces, 45
exosystem, 130, 140 Function monitor, 44
explicit FTC, 10 functional requirements, 29
external faults, 36 functionality tree, 53
external losses, 96
G
F gateway, 33
fail-silent failure, 35 global resource and reconfiguration manager,
failure, 18, 35 47
failure analysis tools, 48 Global RRM level reconfiguration, 51
failure free operating time, 87 globally asymptotically stable, 136
202 Index

group resource and reconfiguration manager, Kleene-closure, 162


47
Group RRM level reconfiguration, 51 L
Group selection, 59 language, 162
language generated, 163
H language marked, 163
Hamiltonian, 189 LaSalle theorem, 147
hard real-time systems, 27 legal behavior, 168
hardware redundancy, 9 legal language, 120
hazard analysis, 94 livelock, 164
Hazard Matrix, 93 liveness assumption, 106
Hazard matrix, 86 load torque, 133
healthy state, 89 local exponential stability, 143
heating system, 66 Local reconfiguration and mode control, 44
high pressure pump, 95 local redundancies, 50
high pressure sensor test, 96 loss diagnosis, 88
low pressure pump, 95
I low-impact reconfiguration, 58
IFATIS, 41
illegal language, 108 M
implicitly fault tolerant, 130 maintainability, 31
in-group re-allocation, 51, 58 malicious failures, 35
In-group resource sharing, 65 malign failures, 35
in-group resource sharing, 60 man-machine interface, 29
inconsistent failures, 35 marked states, 163
indirect field oriented controller, 130, 132 Markov Chains, 85
Induction Motors, 130 mean time between failures, 32
injectors, 95 mean time to failure, 31
input-to-state stable, 133 mean time to repair, 31
integral action, 137 Mechanical reconfiguration, 9
intentional fault, 36 Membership service, 38
Interlaced reconfiguration, 66 mesh network, 28
Interlaced reconfigurations, 60 missed diagnosis, 86
intermittent failure, 35 model based FDI, 9
internal fault, 36 Model Based-Diagnosis, 22
internal losses, 96 model of Alur and Dill, 118
internal model, 130 model uncertainties, 18
Intra-group resource sharing, 61 modified automaton, 114
inverse projection, 166 Module level reconfiguration, 51
isolation procedure, 148 Montecarlo method, 98
ISS Lyapunov function, 135 multiplicative fault, 18
ISS-Lyapunov function, 133, 135
N
item, 88
natural projection, 166
J node, 27
jitter, 30 non live language, 108
nonblocking, 164, 169
K Nonblocking Controllability theorem, 170
Kleene closure, 102 nondeterministic automaton, 164
Index 203

O reliability block diagram, 93


object delay, 30 reliability computation, 85
observability, 171 reliability function, 87
observable, 171 reliable state reliability, 90
observable events, 102, 169 reliable state reliability function, 91
observer, 166 reliable state statistical residual matrix, 92
operation faults, 36 replication, 34
output regulation, 130, 181 required function, 88
residual, 9, 21
P
Residual evaluation, 21
Parallel composition, 165
Residual generation, 21
partial observation supervisor, 120
residual matrix, 88
partial-observation supervisor, 169
resource monitor, 47
passive fault tolerance, 20
Resource needs, 44
passivity based control, 185
Resource sharing, 60
permanent error, 36
responsive system, 33
permanent failures, 35
ring, 28
permanent fault, 36
rise time, 30
physical fault, 36
robust control, 20
Plant faults, 18
rotor, 130
plug-and-play device, 141
Rotor asymmetries, 137
port-Hamiltonian, 182
rotor flux, 131
port-Hamiltonian systems, 185
rotor speed, 131
post language, 104
prefix, 162
S
prefix-closed language, 102
Safe Diagnosability, 108
Prefix-closure, 162
safe diagnoser, 124
prefix-closure, 104
safe state reliability, 90
pressure regulator, 95
safe state reliability function, 91
pressure sensor, 95
safe state statistical residual matrix, 92
primary event, 29
safe time-diagnosability, 118
Product, 165
safe-state, 30
projection, 103, 104, 166
safety, 31
R safety-critical applications, 35
rail, 95 sampling period, 30
range function, 110 saturation function, 142, 144
real-time entity, 29 scalability, 27
real-time image, 29 scalable architecture, 33
reconfigured state, 89 security, 32
redundancy, 37 Sensor faults, 18
Redundancy management, 38 severity assessment, 85
region of danger, 19 sharing policy, 61
region of degraded performance, 19 short circuit, 137
region of required performance, 18 shut-off valve test, 96
region of unacceptable performance, 19 signal conditioning, 29
regular, 167 sinusoidal corruption term, 137
regular expression, 168 slip angular frequency, 133
reliability, 10, 31, 88 slot harmonics, 137
204 Index

small gain theorem, 134, 136, 143 U


speed subsystem, 132, 134 ultrahigh reliability, 31
sphere of control, 29 uncontrollable events, 168
spurious harmonic currents, 130 unobservable events, 166, 169
standard realization, 171
state estimation, 22 V
state observation, 22 value failures, 35
static eccentricity, 137 voter, 38
statistical description, 86 W
statistical residual matrix, 86, 92 water in diesel sensor, 95
stator, 130
Stator asymmetries, 137 Y
steady state robustness, 137 Young’s inequality, 133, 147, 190
steady-state flux decoupling, 132
strings, 162
Structural analysis, 49
structural faults, 50
substring, 162
suffix, 162
supervisor, 19, 104
Supervisory control, 120
supervisory switching control, 9
supremal controllable and observable sub-
language, 120
supremal controllable sublanguage, 120
Sylvester equation, 141, 186
systematic fault tolerance, 37

T
temperature sensor test, 96
temporal behavior, 118
temporal logic framework, 109
tick events, 118
time diagnosability, 118
time unfolding, 118
time-triggered, 33
timed automata, 118
timing failures, 35
torque disturbances, 187
transient error, 36
transient failures, 35
transient fault, 36
transition function, 163
Trim operation, 165
triple modular redundancy, 38
two tank system, 71
two-faced failures, 35
two-phase model, 130
Curriculum vitae

Andrea Paoli was born on 5th September 1975. He


took his degree on 12th July 2000 with an experimen-
tal thesis developed at the Laboratory of Automation
and Robotics (LAR) of the University of Bologna. Post
lauream he began his research period at LAR develop-
ing the argument of his thesis, supported by a schol-
arship. On February 2001 he began his PhD (XVI cy-
cle) and his research activity within the Department of
Electronic, Computer Science and Systems (DEIS) un-
der the supervision of Prof. Claudio Bonivento.
Since 2001 he is active within the European Project IFATIS (Intelligent Fault Tolerant Control
in Integrated Systems) (IST-2001-32122). Since September 2002 through January 2003 he is an
invited scholar at the Department of Electrical Engineering and Computer Science of the Uni-
versity of Michigan in Ann Arbor (U.S.A.), and begin his collaboration with Prof. Stéphane
Lafortune.
He is member of the Center for Research on Complex Automated Systems (CASY) “Giuseppe
Evangelisti”. His research interests include Fault detection, isolation and identification of fail-
ures in complex dynamical systems, fault tolerant control and supervision and failure diagnosis
of complex systems using discrete event systems.
His theoretical background includes also output regulation and robust global stabilization of
non linear systems using a geometrical approach.

205

You might also like