Fault Detection and Fault Tolerant Control
Fault Detection and Fault Tolerant Control
Ph.D. Thesis
Andrea Paoli
University of Bologna
XVI Ciclo
Ph.D. Thesis
Andrea Paoli
University of Bologna
XVI Ciclo
fault tolerant control, fault diagnosis, reliability, distributed systems, output regulation theory,
discrete event systems.
Copyright °2004
c by Andrea Paoli. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopy, recording or any information storage and re-
trieval system, without permission in writing from the author.
Acknowledgments:
This work has been partially funded by the EC-Project IFATIS (Intelligent Fault Tolerant Control
in Integrated Systems), sponsored by the European Commission in the IST programme 2001 of
the 5th EC framework programme (IST-2001-32122).
This work has been partially funded by MIUR (Ministero dell’istruzione, dell’università e della
ricerca).
H AEC A UTEM I TA F IERI D EBENT, UT H ABEATUR
R ATIO F IRMITATIS , U TILITATIS , V ENUSTATIS .
Preface 9
5
6 Contents
Bibliography 193
Index 199
Automated systems are vulnerable to faults such as defects in sensors or in actuators and fail-
ures in controllers or in control loop, which can cause un-desired reactions and consequences
as damage to technical parts of the plant, to personnel or to the environment. The main ob-
jective of the Fault Detection and Isolation (FDI) research area, widely addressed from several
points of view in the last years (see, besides others, [28],[80], [38]), is to study methodologies for
identifying and exactly characterizing possible incipient faults arising in predetermined parts
of the plant. This is usually achieved designing a dynamical system (filter) which, processing
input/output data, is able to detect the presence of an incipient fault and eventually to precisely
isolate it generating the so-called residual signals (see beside others [70], [69] for the linear case
and [85] for the non linear case).
This is an important problem to deal with, since faults in sensors, actuators and components
result in increased operating costs, off-specification production, line shut-down and possible
detrimental environment impact. As enlightened in [28] we use the term fault detection to describe
the problem of making a binary decision either that something has gone wrong or that everything is
fine. This task can appear trivial, but, to be useful, the presence of a faults should be detected
early, before they became serious. As the reader can imagine the uncertainty on the correct
behavior of the plant, the disturbances and noise on the measurements, makes the task of early
and robust fault detection difficult to achieve reliably. The next step is the so called fault-isolation
in which the source of the fault is determined(cf. [28], [38]). We will refer to the FDI problem
as the combined problem of fault detection and isolation, while with the term fault diagnosis
system we refer to the procedure used to detect, isolate faults and asses their severity.
The FDI mechanism can be achieved using a replication of hardware; this technique, known
in literature as hardware redundancy, relies on comparison for consistence of outputs from iden-
tical hardware. Using a different approach, known in literature as analytical redundancy, the FDI
task can be accomplished using analytical and functional information about the system or, in
other words, using a mathematical model of the plant (model based FDI).
Once a fault has been detected and isolated the next natural step is to reconfigure the con-
trol law in order to tolerate the fault, namely to guarantee pre-specified performances for the
faulty system. In this framework, the FDI phase is usually followed by the design of a Fault
Tolerant Control (FTC) namely by the design of a reconfiguring unit which, on the basis of
the information provided by the FDI filter, adjusts the controller in order to achieve prescribed
performances also for the faulty system. Many strategies can be followed at this stage. Mechan-
ical reconfiguration, such as switching between redundant hardware or mechanical parameters
(see [98] and [25]), adaptive robust control (see [27]), supervisory switching control (see [5]), are
just few examples of different strategies which can pursued to deal with faulty plants. The ba-
9
10 Preface
sic idea behind these approaches is that of the explicit FTC, in the sense that they are based on
an “explicit” control reconfiguration which follows the “explicit” fault estimation performed
by the FDI unit.
It is worth stressing that in general the design of a reliable FDI/FTC unit has to tackle
important problems which characterize real applications, such as the presence of unknown pa-
rameters or unknown neglected dynamics in the model of the system, unknown disturbances
acting on the system, the lack of knowledge of a reliable fault model, etc.. All these and other
factors can affect the performances of the FDI and FTC scheme in the sense that faults could be
detected even if not really present (false alarms) or, in a more critical and dangerous scenario,
faults can be not detected at all (or detected with unacceptable delay). Clearly the design of
FDI/FTC control unit with pre-specified performances can be better performed if the designer
has real data concerning disturbances and uncertainties characterizing the plant. This, in gen-
eral, allows to perform a design suitably tailored on the specific applications and carries to a
more simple and more reliable FDI/FTC units.
A further aspect which worth mentioning is the effect that a closed loop controller can
have on the reliability of the overall FDI/FTC control system. As a matter of fact most of the
present literature on the design of FDI units deals with “open loop” systems, in the sense that
the design of the FDI filter is usually addressed without taking into account the presence of
a feedback loop. The presence of a feedback regulator designed to posses robustness features
to exogenous disturbances may, in some circumstances, masks also the effect of an incipient
faults improving in this way the fault tolerance features of the system but also prejudicing the
effectiveness of the FDI phase. These considerations suggest that, in general, the design of a
reliable closed loop FDI/FTC control systems amounts in an integrated design between closed loop
regulators, aiming at enforcing desired specification, and explicit FDI and FTC units.
It is clear from this description that the classical approach to FDI and FTC relies upon a
“certainty equivalence” idea extensively used in the context of adaptive control, since it is based
on the explicit estimation of unknown time varying signals/parameters (in the specific case the
faults) by the FDI filter and subsequent explicit reconfiguration of the controller in presence of
faults.
These majors have been prevailingly developed considering centralized dynamical systems
with a limited number of faults. On the contrary most of the nowadays advanced applications
present a distributed nature (see e.g. the automotive field). Considering distributed large scale
control systems, faults affect local subsystems, but their effects propagate throughout the sys-
tem. This fact introduce new problems to deal with and the counteraction have to make use of
the structural properties of this class of systems.
From this brief discussion comes out the needs for a unified framework for fault tolerant control
of large scale systems.
This work is a collection of results towards a unified framework for Fault Tolerant Control
in Distributed Control Systems. It starts with a survey Chapter 1 which illustrates concepts,
definitions and classical results about Fault Detection and Isolation and Fault-Tolerant Control
and introduces basic concepts in distributed computer systems architectures. In Chapter 2 a
novel architecture for Fault Tolerant Control is presented and some design guidelines are given.
In this chapter is presented the work developed within the EC-Project IFATIS (Intelligent Fault
Tolerant Control in Integrated Systems), partly funded by the European Commission in the IST
programme 2001 of the 5th EC framework programme (IST-2001-32122). Chapter 3 deals with
a specific step of the design procedure presented in the previous chapter: the reliability pre-
diction. More in details a procedure to evaluate reliability of a complex distributed diagnosis
system in the framework of Fault tolerant control is illustrated starting from classical reliability
Preface 11
? ? ??
First of all I would like to thank my supervisor, Prof. Claudio Bonivento, who lead me three
years ago inside this major and who places its trust on me since the first day of this period. A
sincere thanks goes to Prof. Lorenzo Marconi, for his valuable suggestions and for the always
profitable discussions.
I cannot exempt myself from thanking Prof. Alberto Isidori, for his priceless suggestions and
teaching. A special thank goes to Prof. Sthépane Lafortune of the University of Michigan, for
the continuous encouragement during my staying in Ann Arbor and its staying in Bologna, for
its teachings and for believing in me.
I always remember with pleasure the stimulating discussions with Prof. Demosthenis Teneket-
zis and Prof. Jessy W. Grizzle, thanks for transmitting me your love for science.
A special thanks goes to all the guys of the EEECS group of the University of Michigan: Chadi,
Doron, Kurt, Olivier, Pierre, Shaika. You made my staying in Ann Arbor unforgettable.
Thanks to the other members of my working group, Luca Gentili and Marta Capiluppi, you
are not only good friends, but also valuable colleague: it has been a pleasure to work with you.
Of course thanks to all the student who realize their master thesis under my supervision, a lot
of the ideas included in this work have been risen out discussing with you. A special thanks
goes to Davide Bagnara for the support given in the experimental activity presented in this
work.
How can I forget all the guys of the Laboratory of Automation and Robotics? Alessandro, An-
drea, Cristian, Fabio, Luigi, Marcello, Marco, Nicola, Raffaella, with your friendship you made
these years priceless.
Thanks to all my friends and in particular to Emiliano for his real friendship and his help dur-
ing all the difficult moments and to Luisa with whom I shared the American experience.
Last but not least a special, heartfelt thanks goes to my family, for always trusting in me and
for always encouraging me.
13
Chapter 1
Distributed systems and fault tolerance:
an overview
15
16 Distributed systems and fault tolerance: an overview
faults have far reaching consequences. These are just some examples on how a fault is some-
thing that changes the behavior of a technological system such that the system does no longer
satisfy its purpose.
Fault-Tolerant control concerns the interaction between a given system (plant) and a con-
troller that does not only include the usual feedback or forward control law, but also the deci-
sion making layer that determines the control configuration. This layer analyzes the behavior
of the plant to identify faults and changes the control law to hold the closed-loop system in
a region of acceptable performance. A fault-tolerant controller has the ability to react on the
existence of the fault by adjusting its activities to the faulty behavior of the plant.
Generally, the way to make a system fault-tolerant consists of two steps:
1. Fault diagnosis: The existence of a fault has to be detected and the fault identified.
2. Control re-design: The controller has to be adapted to the faulty situation so that the overall
system continues to satisfy its goal.
U ×Y
overall system it is important the output y(t) that the plant generates if it gets the input u(t).
The pairs (u, y) are called input/output pairs (I/O pairs) and the set of all possible pairs of
1.2. Faults and fault tolerance 17
trajectories u and y that may occur for a given plant define the behavior B 1 .
As sketched in fig. 1.1, the behavior B is a subset of the space U × Y of all possible combi-
nations of input and output signals. The dot A in figure represents a single I/O pair that may
occur for the given system; on the contrary C represents a pair that is not consistent with the
system dynamics. Consider as example a static system, which is described by the equation
y(t) = ku(t) ,
where k is the static gain. The input and the output are elements of the set R. The behavior is
given by
B = {(u, y) : y = k · u} .
For a dynamical system the I/O pairs have to include the whole time functions u(t) and y(t)
that represent the input and output signals.
When a fault occurs it changes the behavior of the system; for example in fig. 1.2, instead of
the white set, the system behavior is moved to the grey set. If a common input u is applied to
U ×Y
B0
Bf
B
A
the faultless and the faulty system, then both systems answer with the different outputs YA or
YB , respectively. The points A = (U, YA ) and B = (U, YB ) differ and lie in the white or the grey
set, respectively. This change in the system behavior makes the detection and isolation of the
fault possible, unless the I/O pair lies in the intersection of B0 ∩ Bf .
Models represent constraints that the signals U and Y satisfy in order to be relevant for the
plant (see [86]). In dependence upon the kind of systems considered, constraints can have the
form of algebraic relations, differential or difference equations, automata tables or behavioral
relations of automata. For a given input U the model yields the corresponding output Y .
In continuous-variable system described by an analytical model (e.g. differential equation),
faults are usually described as additional external signal or as parameter deviation. The first
form of faults is called additive faults and in the model are represented by an unknown input
1
An interesting branch of modeling theory is the so called behavioral approach to physical modelling. This
theory states that when we accept a mathematical law as description of a phenomenon we view it as an exclusion
law: a mathematical model expresses the opinion that some things can happen, while others cannot. On the basis of
this assertion a mathematical law (model) selects a certain subset from an universum of possibilities. The interested
reader is referred to [86] and [104].
18 Distributed systems and fault tolerance: an overview
that enters the model equation as addend. The second form is called multiplicative fault: in this
case the system parameters are scaled depending on the fault size.
Also disturbances and model uncertainties change the plant behavior. Disturbances are
usually represented by unknown input signals that have to be added up to the system output,
while model uncertainties change the model parameters in a similar way as multiplicative
faults.
Faults are often classified as follows:
• Plant faults: faults that change the dynamical I/O properties of the system.
• Sensor faults: the plant properties are not affected, but the sensor readings have substantial
errors.
• Actuator faults: the plant properties are not affected, but the influence of the controller on
the plant is modified.
A remark is necessary concerning the distinction of the notions of fault and failure. A fault
causes a change in the characteristics of a component such that the mode of operation or per-
formance of the component is changed in an undesired way. Hence the required specifications
on the system performance are no longer met. However, a fault can be worked around by
fault-tolerant control so that the faulty system remains operational. The notion of a failure, as
defined in [12], describes the inability of a system or a component to accomplish its function.
Assume that the system performance can be described by two variables yl and y2 (see
fig. 1.3). In the region of required performance, the system satisfies its function. During its
time of operation the system should remain in this region. The controller makes the nomi-
nal system remain in this region despite of disturbances and uncertainties of the model. The
controller may even hold the system in this region if small faults occur.
y2
Region of danger
Region of unacceptable
Region of degraded
performance
performance
Region of required
performance
fault
recovery
y1
The region of degraded performance shows where the faulty system is allowed to remain,
although in this region the performance may be considerably degraded. A fault brings the
system from the region of the required performance into the region of degraded performance.
The fault-tolerant controller should be able to act in order to prevent a further degradation of
the performance towards the region of unacceptable performance or to the region of danger
and it should move the system back into the region of required performance.
The region of unacceptable performance must be avoided by means of the fault-tolerant
controller. This region lies between the region of acceptable performance in which the system
should remain and the region of danger, which the system should never reach. A safety system
interrupts the operation of the plant to avoid danger for the system and its environment if the
outer threshold of the region of unacceptable performance is exceeded. The safety system and
the fault-tolerant controller work in separate regions of the signal space and satisfy comple-
mentary aims. In many applications, they represent two separate parts of the control system
and are usually implemented in separate units. This separation makes it possible to design
fault-tolerant controllers without the need to meet safety standards.
Diagnosis
Supervision Controller
Level re-design
Execution
Level f d
y
yref Controller Plant
u
troller by a supervisor, which includes the diagnostic and the controller re-design blocks. If a
fault f occurs, the supervision level makes the control loop fault-tolerant. The diagnostic block
identifies the fault and the controller re-design block adjusts the controller to the new situation.
Fault tolerance can also be accomplished without the structure given in fig. 1.4 by means of
well established control methods.
20 Distributed systems and fault tolerance: an overview
• robust control: a fixed controller is designed in order to meet closed loop specification
tolerating changes of the plant dynamics. This implies that the controlled system satis-
fies its goals under faulty conditions. Fault tolerance is obtained without changing the
controller parameters. It is, therefore, called passive fault tolerance. However, the theory of
robust control has shown that robust controllers exist only for a restricted class of changes
that may be caused by faults. Further, a robust controller works suboptimal for the nom-
inal plant because its parameters are obtained as a trade-off between performance and
robustness.
• adaptive control: the controller parameters are adapted to changes of the plant parame-
ters caused by some fault (active fault tolerance). However, the theory of adaptive control
shows that this principle is particularly efficient only for plants that are described by lin-
ear models with slowly varying parameters. These restrictions are usually not met by
systems under the influence of faults, which typically have a nonlinear behavior with
sudden parameter changes.
• Fault detection: Decide whether or not a fault has occurred. This step determines the time
at which the system is subject to some fault.
• Fault isolation: Find in which component a fault has occurred. This step determines the
location of the fault.
• Fault identification and Fault estimation: Identify the fault and estimate its magnitude. This
step determines the kind of fault and its severity.
In order to be able to detect a fault, the measurement information (U, Y ) alone is not sufficient,
but a reference, which describes the nominal plant behavior, is necessary. This reference is
given by an explicit model of the plant, which describes the relation between the possible input
sequences and output sequences. This model is a representation of the plant behavior B. This
idea is called consistency-based diagnosis and can be explained in a really simple way. Consider
again fig. 1.1 and assume that the current I/O pair is represented by point A in the figure. If
the system is faultless (and the model is correct) then A lies in the set B. However, if the system
is faulty, it generates a different output Ŷ for the given input U . If the new I/O pair (U, Ŷ ) is
represented by point C, which is outside of B then the fault is detectable. In other words the
1.4. An overview on fault diagnosis 21
U ×Y
B1
B0
B A
B2
If the I/O pair is represented by the points A, C or D in fig. 1.5, the faults detected are
f0 , f1 or f2 , respectively. If, however, the measurement sequences are represented by point
B, the system may be subjected to one of the faults f0 or f1 . The ambiguity of the diagnostic
result is caused by the system and not by the diagnoser, because the system generates the same
information for both faults. The question of whether or not a certain fault can be detected
concerns the diagnosability or fault detectability of the system (for more details see [12]).
For continuous-variable systems, that are usually described by differential equations or
transfer functions, the principle of consistency-based diagnosis can be cast into the scheme
shown in fig. 1.6. The model is used to determine, for the measured input sequence U , the
model output sequence Ŷ . The consistency of the system with the model can be checked at
every time t by determining the difference
which is called a residual. In the faultless case, the residual vanishes or is close to zero. A non-
vanishing residual indicates the deviation between measurement and calculated values based
on system models and, hence, the existence of a fault. Diagnostic algorithms for continuous-
variable systems generally consist of two components:
1. Residual generation: The model and the I/O pair are used to determine residuals, which
describe the degree of consistency between the plant and the model behavior.
2. Residual evaluation: The residual is evaluated in order to detect, isolate and identify faults.
22 Distributed systems and fault tolerance: an overview
f d
u y
Plant
ŷ
Model
Residual
evaluation
In both steps, model uncertainties, disturbances and measurement noise have to be taken into
account.
Fault detection and isolation employs analytical redundancy: the residual is found by using
more than one way for determining the variable y. The sensor value y is compared with the
analytically computed value ŷ. This procedure is used to avoid physical redundancy.
It is important to remark that the behavior of a dynamical system does not only depend on
the input but also on the initial state. Inconsistencies may result from a deviation of the initial
state of the model. As the initial state of the system is usually immeasurable, every diagnostic
problem includes a state observation or state estimation problem. Moreover, the disturbance d
that influences the plant is usually immeasurable. As it influences the plant behavior, it has to
be taken into account in the consistency check.
The reader interested in going into more details about fault detection and isolation problem
can refer to [38], [12] and [80] where an exhaustive survey on this topic is given by authors.
certainty.
The component library used by an MBD engine describes the laws which govern the behav-
ior of the components. A resistor obeys Ohm’s law, a multiplier component obeys the constraint
that its output is the product of its inputs. Once provided a component model library, the MBD
engine should be able to diagnose any system constructed out of known components.
MBD models are compositional: the model of a combination of two systems is directly
constructed from the models of the constituent systems. Consider the system illustrated in
fig. 1.7.
Figure 1.7: A, B, C, D and E are input terminals, F and G are output terminals, X, Y and Z
are internal probe points, M1 , M2 and M3 are multipliers and A1 and A2 are adders.
(MULTIPLIER M i j k) → k = i × j
(ADDER A i j k) → k =i+j
An MBD engine would be provided a structural description which might simply be character-
ized as:
(MULTIPLIER M1 A C X)
(MULTIPLIER M2 B D Y)
(MULTIPLIER M3 C E Z)
(ADDER A1 X Y F)
(ADDER A2 Y Z G).
The inputs to the diagnostic engine would be observations from the system: A = 3, B = 2,
C = 2, D = 3, E = 3, F = 1O, G = 12 From this information the MBD engine would determine
the following sets of components could be faulted: {A1 }, {M1 }, {A2 , M2 }, {M2 , M3 } and any
of their supersets. In addition, the most informative place to measure next is X because it
distinguishes between two single faults.
This example illustrates some of the basic properties of MBD:
is a symptom. The symptoms drive diagnostic reasoning. Each symptom indicates one or
more components may be faulted. Intuitively, a conflict is a set of components which underly
a symptom.
Consider the symptom “F is observed to be 10, not 12”. The prediction that F = 12 de-
pends on the correct operation of A1 , M1 , and M2 , i.e., if A1 , M1 , and M2 were correctly func-
tioning, then F = 12. Since F is not 12, at least one of A1 , M1 , and M2 is faulted. Thus the set
{A1 , M1 , M2 } is a conflict for the symptom. The set {A1 , A2 , M1 , M2 }, and any other superset of
{A1 , M1 , M2 } are conflicts as well; however, no subsets of {A1 , M1 , M2 } are necessarily conflicts
since all the components in the conflict were needed to predict the value at F .
A diagnosis is a particular hypothesis for how the system differs from its model. For exam-
ple “A2 and M2 are broken” is a diagnosis which explains the two symptoms observed for the
example system. The size of the initial diagnosis space is exponential in the number of system
components. Any component could be working or faulty, thus the diagnosis space for the sys-
tem initially consists of 25 = 32 diagnoses. Ultimately, the goal of diagnosis is to identify, and
refine, the set of diagnoses consistent with the observations thus far. For more details on these
concept the reader is referred to [33]. These diagnostic concepts can be defined more formally
using First-Order Logic within the framework of [34].
U ×Y U ×Y
Bspec Bspec Bf
Bc
B0
Bc
(a) Behavior of the faultless closed loop system. (b) Behavior of the faulty closed loop system.
There may be faults for which the behavior Bf does not overlap with Bspec . If this is the case
a new control configuration has to be chosen, which changes the signals under consideration
and, hence, the behavior of the plant. There may be faults for which no controller can make
the closed-loop system satisfy the specification and the system has to be shut off. Hence, the
question whether a fault-tolerant controller exists is not a property of the controller or the
control re-design method, but a property of the plant subject to faults3 . Two principal ways of
Diagnosis
Fault
Accom.
Controller
parameter f d
y
yref Controller Plant
u
of input and output sequences is not changed. A simple way of fault accommodation
is based on predesigned controllers, selected off-line for a specific fault. The re-design
step then simply sets the switch among the different control laws. This step is quick and
can meet strong real-time constraints. However, the controller re-design has to be made
for all possible faults before the system is put into operation and all resulting controllers
have to be stored in the control software (for more details see [12]).
yref
Diagnosis
Fault
Accom.
New control
yref d y
configuration u
Nominal
yref Plant
Controller
y0
New
0
yref Controller u0
Fault-tolerant control makes intelligent use of the redundancies included in the system and in
the information about the system in order to increase the availability of the system. It utilizes an
analytic redundancy, which is cheaper than duplicating all vulnerable components. Of course
no method can guarantee a complete description of all possible faults of a system. Hence, no
100% fault tolerance is possible. However, for many applications, complete fault tolerance is
not necessary.
1.6. Why distributed systems? 27
ware architecture of a distributed application. As the reader can see, a distributed application
4
A real-time system is a system in which the correctness of the system behavior depends not only on the logical
results of the computations but also on the physical instant at which these results are produced (deadline). Real-
time systems in which there exists at least one firm deadline which could produce a “catastrophe” if missed are
called hard real-time.
5
In its treatise “De Architectura”, Marcus Vitruvius Pollio wrote down on ten scrolls everything he knew about
architecture. He presented this work, known today as “Ten Books on Architecture”, to Emperor Augustus in the
hope of changing what he perceived as a rampant lack of professionalism and educational rigor in the practice of
architecture.
28 Distributed systems and fault tolerance: an overview
can be decomposed into a set of clusters: the operator cluster, the computational cluster and the
controlled object cluster. Generally the computational cluster is implemented as a distributed
computer system and it has the structure shown in fig. 1.12: a set of nodes interconnected by
a real-time communication system. Considering a single node, it can be partitioned in at least
two subsystems, the local communication controller and the host computer (fig. 1.13). The set
of all the communication controllers of the nodes within a cluster forms the real-time commu-
nication system of the cluster. The interface between the communication controller within a
node and the host computer of the node is called communication network interface (CNI).
The purpose of the the real-time communication system is to transport messages from the
CNI of the sender mode to the CNI of the receiver node within a predictable time interval, with
a small latency time and with high reliability. The communication system must ensure that
the contents of the messages are not corrupted. From the point of view of the host computer,
the details of the protocol logic and the physical structure of the communication network are
hidden behind the CNI. It is easy to understand that the communication system is a critical
Host Computer
Communication Network
resource of a distributed system, since the loss of communication means the loss of all global
system services. There are different alternatives available for the design and implementation
of the communication systems: a single channel system (bus or ring) or a multiple channel
system (mesh network). Communication reliability can be increased by message retransmis-
sion in case of a failure, or replicating messages so that a loss of a message can be masked. If
communication channel are replicated the permanent loss of a channel ca be tolerated. It is
1.6. Why distributed systems? 29
not purpose of this work to show more details about design, implementation, and problems of
real-time distributed systems, but the interested reader can learn more in [54] and references
therein.
Data collection
A controlled object change its state as a function of time; if we freeze it we can describe the
current state of the controlled object by reading the values of its state variables at the moment.
Normally we are interested in a subset of state variables that is significant for our purpose (real-
time entity). Each real-time entity is in the sphere of control of a subsystem6 . Outside its sphere
of control the value of a real-time entity can be observed but not changed. The first functional
requirements of a real-time distributed system is the observation of the real-time entity in a
controlled object and the collection of these observations (real-time image). Since the state of the
controlled object is a function of time an image is only temporally accurate for a limited amount
of time. The first step in observation is signal conditioning, i.e. all the processing steps needed
to obtain a meaningful measured data of a real-time entity from the raw sensor data. After
signal conditioning the measured data must be checked for plausibility and related to other
measured data to detect possible faults in sensors. An other important requirement in data
collection is alarm monitoring, i.e. the monitoring of the real-time entities to detect abnormal
process behaviors. The computer system must detect and display these alarms and assist the
operator in identifying the primary event which was the initial cause of these alarms.
Man-machine interaction
The real-time computer system must inform the operator of the current state of the controlled
object and must assist the operator in controlling the machine. This is achieved via the man-
machine interface. This critical subsystem contains an extensive data logging and data reporting
subsystem. The reader interested in this important topic is referred to [37].
the desired set-point according with step function. There are two important parameters that
characterize the step response functions we obtain from our controlled object (see fig. 1.14):
the object delay dobj after which the measured variable begins to rise and the rise time drise (ap-
proximately the time after which the measured variable reaches the new equilibrium point).
The controlling computer must sample the measured variable to detect deviations from the de-
Measured Variable
Set point
90%
10%
Real time
dobj drise
Figure 1.14: Delay and rise time in a step response.
sired value. The constant amount of time between two sample points is called sampling period
(dsample ). We expect the digital system to behave like a continuous system and this implies
the sample period to be less then one tenth of the rise time. The computer compares the mea-
sured value with the set point selected by the operator, calculates the error term and from
this computes the new value of the control variable by the control algorithm. The amount of
time consecutive to the sample point, after which the controlling computer will output the new
control value is called computer delay (dpc ). The computer delay should be smaller than the sam-
pling period. The difference between the maximum and the minimum values of the delay is
called the jitter of the delay (∆dpc ). The dead time (td ) is time interval between the observation of
the real-time entity and the start of a reaction of the controlled object due to a computer action
based on this observation. As the reader can easily deduce, it is the sum of the controlled object
delay, which is in the sphere of control of the controlled object and it is thus determined by its
dynamics, and the computer delay, which is determined by the computer implementation. To
reduce the dead time in a control loop and to improve the stability of the control loop, these
delays should be as small as possible. In fig. 1.15 it is pictured the scenario explained above.
Hard real-time systems are by definition safety-critical. Hence it is important that any error
within the control system (loss or corruption of a message, failure of a node etc.) is detected
within a short time with a high probability. The required error-detection latency must be in the
same order of magnitude as the sampling period of the fastest critical control loop. In this case
it is possible to perform some corrective action or bring the system into a safe-state before the
consequences of an error can cause any severe system failure.
Observation Observation
dsample
∆dpc
Output to
the actuator
td drise
Reliability
The reliability of a system R(t) is the probability that a system will provide the specified service
until time t, given that the system was operational at time t = t0 . If a system has a constant
failure rate of λ7 then the reliability at time t is given by
R(t) = e−λ·(t−t0 ) .
The inverse of the failure rate (1/λ) is called mean time to failure (MTTF). All these concepts will
be reviewed and applied to distributed systems in chapter 3. Moreover the reader is referred
to [9], [59] and [46] to go into more details.
Safety
With the term safety we regard critical failure modes. In such a failure mode the cost of the
failure can be order of magnitude higher than the utility of the system during normal operation.
Safety critical real-time systems must have a failure rate with regard to critical failure modes
that conforms to ultrahigh reliability requirments8 . Similar failure rates are required in flight-
control systems, train-signaling systems, nuclear plant monitoring systems etc.
Maintainability
With the term maintainability we intend the measure of the time required to repair a system
after the occurrence of a failure. Maintainability is measured by the probability M (d) that the
system is restored within a time interval d after the failure. As for reliability, a constant repair
rate µ (repairs per hour) and a mean time to repair (MTTR) is introduced to define a quantitative
maintainability measure.
Availability
Availability is a measure of the delivery of correct service with respect to the alternation of
correct and incorrect service and is measured by the fraction of time that the system is ready to
7
Failures over hours.
8
If the failure rate of a system is required to be in the order of 10−9 failures over hour or lower we speak of a
system with an ultrahigh reliability requirement.
32 Distributed systems and fault tolerance: an overview
provide the service. In a system with constant failure and repair rates, the reliability (MTTF),
maintainability (MTTR) and availability (A) are related by:
MTTF
A= .
MTTF + MTTR
A high availability can therefore be obtained either by a long MTTF or by a short MTTR.
The sum MTTF+MTTR is called mean time between failures (MTBF); the situation is sketched
in fig. 1.16.
down
MTBF
Security
The security attribute is concerned with the ability of a system to prevent unauthorized access to
information or services. This attribute is usually associated with large database, but during the
last few years this issue has become important also in real-time systems. There are difficulties
in defining a quantitative security measure.
1.7.2 Scalability
Evolving requirements are the rule in large distributed systems. Existing functions must be
changed or new functions added over the lifetime of the system. A scalable architecture is open
to such changes, and does not have any limit on its extensibility. Only distributed architectures
provide the necessary framework for unlimited growth:
1. Nodes can be added within the given capacity of the communication channel, introduc-
ing additional processing power to the system.
2. If the communication capacity within a cluster is fully utilized, or if the processing power
of a node has reached its limit, a node can be transformed into a gateway node to open
a way to a new cluster (see fig. 1.17). The interface between the original cluster and the
gateway node can remain unchanged.
N N
N
N N
Original G New
N N
N N
Of course a key point in designing large scalable systems is the complexity. Large systems can
be built if the effort required to understand the system operation9 remains under control as
the system grows. The complexity is related to the number of parts and the number and type
of interactions among the parts that must be considered to understand a particular function of
the system. In a scalable system the effort required to understand any function should remain
constant and independent from the system size. The only difference with a small system should
be in the number of different functions that a large system can provide. In other words the effort
needed to understand all functions of a large system grows with the system size.
1.7.3 Dependability
Implementing a dependable real-time service requires distribution of functions to achieve effective
fault containment and fault tolerance so that the service can continue despite the occurrence
of faults. In [65] authors define responsive system a system which has all these three attributes:
distribution, fault tolerance and real-time performance.
A fault tolerant system must be structured into partitions that act as error-containment regions
so that the consequences of faults that occur in one of this partitions can be detected, corrected
9
i.e. the complexity of the system.
34 Distributed systems and fault tolerance: an overview
or masked before these consequences corrupt the rest of the system. An error-containment
region must implement a well-specified service; this service should be provided to the outside
world through a small interface, so that an error in the service can be detected at this interface.
In a large distributed computer system it comes natural to to regard a complete node as an
error-containment region and to perform the error detection at the node’s message interface
to the communication system. With this in mind it is easy to understand that implementing
error-containment regions in centralized systems is an hard task because system resources are
multiplexed over many services.
Not all the faults in a large distributed system are equally critical. An interesting classifica-
tion arising from aircraft industries is the following:
1. Catastrophic: Fault that prevents continued safe operation of the system and can be the
cause of an accident.
2. Hazardous: Fault that reduces safety margin of the redundant system to an extent that
further operation of the system is considered critical.
3. Major: Fault that reduces the safety margin to an extent that immediate maintenance must
be performed.
4. Minor: Fault that has only a small effect on the safety margin. From the safety point of
view it is sufficient to repair the fault at the next scheduled maintenance.
N
N Interface
Node
RT Communication
system
Interface N
Node N
be masked providing actively replicated nodes which are supposed to show a deterministic
behavior (replica determinism10 ).
Figure 1.19: Faults, errors and failures: faults and error are states while failures are events.
1.8.1 Failures
Whenever the service of a system, as seen by the user, deviates from the agreed specification,
the system is said to have failed. A failure is an event that denotes a deviation between the actual
service and the specified or intended service, occurring at a particular point in real time. It can
be classified using the following criteria:
• Failure nature: we distinguish between value failures and timing failures. A value failure
means that an incorrect value is presented at the system-user interface, while a timing
failures means that a value is presented outside the specified interval of real-time.
• Failure perception: we distinguish between consistent failures and inconsistent failures. In a
consistent failure scenario all the users see the same wrong result. For example a consis-
tent failure scenario is when a subsystem either produces correct results or no results at
all; we will call this scenario a fail-silent failure scenario. In an inconsistent failure situation
different users may perceive different false results (sometimes this failures are also called
two-faced failures, malicious failures or Byzantine failures).
• Failure effect: we distinguish between benign failures and malign failures. A benign failure
can only cause failure costs that are of the same order of magnitude os the loss of the
normal utility of the system. Malign failures can cause a catastrophe such as the crash
of an airplane. We call safety-critical applications those applications where malign failures
can occur.
• Failure oftenness: we distinguish between permanent failures and transient failures. A per-
manent failure is a failure after which the system ceases to provide a service until an
explicit repair action has eliminated the cause of the failure. If a system continues to op-
erate after the failure we call it a transient failure. A frequently occurring transient failure
is called an intermittent failure.
10
Replicated nodes visit the same states at about the same time.
36 Distributed systems and fault tolerance: an overview
1.8.2 Errors
System failures can be traced to in incorrect internal state. We call such an incorrect internal
state an error. Therefore an error is an unintended state. If the error exists only for a short
interval of time and then it disappear without an explicit repair action it is called a transient
error while if it persists until an explicit repair action removes it, we call it a permanent error.
Transient errors form the predominant error class in many computer systems. In a fault
tolerant architecture every error must be confined to an error containment region to avoid the
propagation of the error throughout the system. It is aim of the error detection interfaces to
protect the boundaries of the error containment region.
1.8.3 Faults
The cause of an error and hence the indirect cause of a failure is called fault. Faults can be
classified using the following criteria:
• Fault nature: a fault that has its origin in a chance event (e.g. random break of a wire)
is called chance fault. If a fault can be traced to an intentional action by someone (e.g. a
Trojan horse introduced by a programmer in order to break the security of a system) the
fault is called an intentional fault
• Fault origin: faults that have their origin in the incorrect development of the system (de-
velopment faults) must be distinguished from faults that are related to system operation
(operation faults).
• Fault persistence: if the fault exists only for a short interval of time and then it disappear
by itself it is called a transient fault. On the contrary if it persists we call it a permanent
fault.
(i) At the architecture level, transparent to the application code. This type of tolerance is
called systematic fault tolerance: the architecture must provide replica determinism so that
fault tolerance can be achieved by the temporal or spatial replication of computations to
detect and mask the faults specified in the fault hypothesis.
(ii) At the application level, within the application code. We will call this type of fault toler-
ance application-specific fault tolerance: it mixes the normal processing functions with the
error-detection and fault-tolerance functions at the application level.
The first problem in achieving fault tolerance for distributed system is error detection. An
error is a discrepancy between the intended correct state and the current state of a system.
The goal of the fault-tolerant system is to detect and mask or repair errors before they show
up as failures at the system user service interface. Error detection requires that along with the
information about the current state knowledge about the intended state of a system is available.
This knowledge can arise from two different sources: from some a priori knowledge or from the
comparison between redundant computational channels.
The more is known a priori about the properties of correct states and the temporal patterns
of correct behavior of a computation, the more effective are the error detection techniques.
This means that if a subsystem has to be flexible in the temporal domain and in the value
domain, then error detection based on a priori knowledge is hardly possible. Techniques for
error detection based on a priori knowledge are:
• Syntactic knowledge about the code space: parity bits, error-detecting codes in memory, cyclic
redundancy check11 (CRC) in data transmission, check digits at the man-machine inter-
face. Such codes are very effective in detecting the corruption of a value stored in memory
or in the transmission of a value over a computer network.
• Assertions and acceptance tests: application specific knowledge about the restricted ranges
and the known interrelationship of the values of the entities can be used to detect addi-
tional errors that are undetectable by syntactic methods.
• Activation patterns of computations: knowledge about the regularity in the activation pat-
tern of a computation can be used to detect errors in the temporal domain.
• Worst case execution time of tasks: to detect task errors in the temporal domain.
On the other hands there are many different possible combination of hardware, software
and time redundancy that can be used to detect different types of errors by performing the
computation twice.
1. Membership service: to detect a node failure and to report this node failure consistently to
all operating nodes of the cluster within a short latency.
2. Redundancy management: to mask the node failure by active redundancy and to reintegrate
repaired nodes into cluster as soon as they become available again.
From the node point of view, a node must detect all internal failures within a short latency
and must map these failure to a single external failure mode (preferably a fail-silent failure
mode). After an exception has been detected, control is transferred to an exception handler.
After the exception handler has terminated, control is either resumed from the point of excep-
tion or the task is terminated. The purpose of a fault-tolerant unit (FTU) is to mask the failures
FTU
service service
provider provider
error error
detector detector
of a node. If a node implements the fail-silent abstraction then the duplication of nodes is suf-
ficient to tolerate a single node failure. A fail-silent node either produces correct values or does
not produce any results at all. For example in a time-triggered architecture an FTU that consists
of two fail-silent nodes produces either zero, one or two correct result messages. If it produces
no messages it has failed. If it produces one or two messages it is operational. The receiver
must discard redundant result messages (see fig. 1.20).
If the node does not implement fail-silence, but can exhibit value errors at the host/network
interface then triple modular redundancy must be implemented. In this case we must assume
that the behavior of the nodes is replica determinate. More in details the FTU must consist of
three nodes and a voter. The voter decides and masks errors in one step comparing the three
independently computed results and then selecting the result that has been computed by the
majority (see fig. 1.21 in which a two out of three triple modular redundant configuration is
shown).
If no assumptions can be made about the failure behavior of a node (Byzantine failure)
then four nodes are required to form a fault-tolerant unit. These four nodes must execute a
Byzantine-resilient agreement protocol to agree on a malicious failure of a node (see [82]).
1.10. Bibliographic notes 39
FTU
41
42 Fault tolerant architecture for distributed systems
Level 2
Group Resource and Reconfiguration Management: Group Resource and Reconfiguration Management:
- Performance and resource monitoring - Performance and resource monitoring
- Intelligent resource and reconfiguration manager - Intelligent resource and reconfiguration manager
Level 1
PLANT
Figure 2.1: IFATIS modular architecture for fault tolerant control of distributed systems
2.1. The distributed nature of Fault Tolerant Control System 43
M,D,N M M,D,N M
D D D
M,D,N M M,D,N M M,D,N M D
R R
S A C,D S A C,D S A C,D
Figure 2.2: IFATIS architecture. Interfaces signals have the following meaning: A actuating
information (mode dependent), C cross communication between FTC/FTM modules, D diag-
nosis results, m actual and acceptable modes, M mode decisions and resource allocations, N
resource needs and urgencies for all possible modes, R resource specific information, S sensor
information (mode dependent).
fulfil a specified purpose within the whole control system and run on a set of physical equip-
ment components. The set of plant components and controller modules might be regarded as
resources which can be allocated to the partial processes. Fig. 2.2 shows the overall structure of
a fault tolerant system, as proposed in [64].
Each partial process consists of plant functions, running on plant components, and (fault-
tolerant) control functions, running on controller modules. To run, they need allocation of
resources, i.e. plant components and controller modules. This allocation is dynamic, dependent
on mode and reconfiguration decisions of higher levels (e.g. group and global resource and
reconfiguration managers). These decisions can change sensor and actuator information, which
is exchanged between plant functions and control functions.
decision may be modified by the group resource and reconfiguration manager (GRRM) (see
arrow pointing downward from GRRM in fig. 2.2) which has a global view on a set of partial
processes and resources, and decides on resource allocation and working mode of each partial
process.
A FTC function (fault tolerant control function) is a functional unit. The details of a FTC
function are shown in fig. 2.3. It consists of the following parts:
• Function monitor for evaluation of the partial process quality and for partial process spe-
cific fault diagnosis. The latter can be based on sensor and actuator signals, analytical or
knowledge based process models, and quality evaluation.
• Local reconfiguration and mode control. From diagnostic results of the function moni-
tor, local reconfiguration decisions can be derived, e.g. change of control parameters or
algorithms, switching to redundant sensors or to estimated states, change to a degraded
mode. Such decisions can be made locally only if the change of resource needs has no
significant influence on other partial processes.
• Resource needs. Each partial process has specific resource needs and urgencies, which
are different for each of its modes.
A FTM function (fault tolerant measurement function) is also a functional unit which has the
same structure as the FTC function except that the controller function is replaced with an esti-
mator (see fig. 2.4).
The inputs/outputs and the structure of the FTC and FTM modules (respectively pictured
in fig. 2.3 and fig. 2.4) are:
FTC interfaces:
• Cin : control signals, reference signals, estimates, sensor outputs (measurements), quality
of measurements (or estimates);
2.1. The distributed nature of Fault Tolerant Control System 45
• Cout : control signals (references to inner loop, or actuator commands); Dout : fault model,
confidence and performance index;
• Mout : working modes out (present working mode and list of admissible working modes);
FTM interfaces:
• Cin : data in, sensor outputs, estimates from other cells, quality of measurements (or esti-
mates);
• Mout : working modes out (present working mode and list of admissible working modes);
For the design of an FTC system, the global control task is separated into different functions
(i.e.: temperature control, temperature reference determination, temperature estimation, etc.).
Each function is characterized by:
• a general objective,
ROOM
water to recycle
HEATER
Control
Valve
Temperature
Heating sensor
Resistance
Controller
Desired
temperature
Example 2.1 Let us consider the closed loop system sketched in fig. 2.5. The control objective is to
maintain a constant temperature (measured by a sensor) in a room. The temperature is controlled by
a hot water flow in a heater via a control valve. We assume we have a redundant sensor which is less
precise than the operating one. The function is characterized by:
• General objective: maintain the room temperature at a fixed value T ;
• Working modes:
For each mode, we have to specify the resource needs, the urgency, the required performance, the
conditions for transition from the considered working mode to other working modes. Mode 2 could be
illustrated by the following:
• The resource needs: nominal needs.
• The required performance: to maintain a temperature T with an error smaller than 10% (instead
of an acceptable error of 1% in Mode 1).
2.1. The distributed nature of Fault Tolerant Control System 47
• The conditions for transition from Mode 2 to Mode 1: if the less precise sensor has been replaced
by a sensor with the same characteristics as the initial one, then we have a transition to the Mode
1.
A resource monitor is associated to a physical resource (plant or controller module) and moni-
tors its state. It cannot be related to a partial process, because partial processes can be allocated
to different resources. Sometimes, a resource monitor uses normal sensor and actuator infor-
mation together with models, and hence it can be allocated to different computing resources,
just like a FTC function. But sometimes resource monitors contain special hardware equipment
or hardware associated test software, giving resource specific information (e.g. in computers:
self-test software, parity check, watchdog). In that case, the resource monitor contains parts,
which have a fixed allocation to a physical resource and hence cannot freely be allocated.
There are two higher levels in the above mentioned hierarchical structure. The top level is the
global resource and reconfiguration manager and the lower level is the group resource and
reconfiguration manager. The group resource and reconfiguration manager is responsible for a
defined subset of partial processes, whereas the global resource and reconfiguration manager
is responsible for all the group resource and reconfiguration managers.
The inputs and outputs of these two higher levels are the same and are:
• Min : list of acceptable and present working modes (from lower levels);
Given the urgency of different functions, role of the GRRM module is to determine, from the
available resources, how best to use them in order to achieve the most urgent functions (if all
functions cannot be realized).
48 Fault tolerant architecture for distributed systems
• acceptable needs (namely upper bound on the needs that might be tolerated for that mod-
ule);
• Needs of the functions associated to each FTC/FTM module linked to the considered
GRRM.
The objective to be optimized for resource distribution among the different functions, according
to urgencies and technical feasibility.
on the basic events probabilities and it is used to identify the causal relationships leading to a
specific system failure mode. Each fault tree considers only one of the many possible system
top failure modes. Therefore more than one fault tree can be associated to the same system.
A fault tree diagram contains two basic elements: gates and events. Gates allow or inhibit the
passage of combinations of faults up the tree and show the relationships between the events
needed for the occurrence of a higher level event. The three basic gates types used in the fault
tree are the OR gate, the AND gate and the NOT gate. These gates are used to combine events
as the Boolean operations of union, intersection and complements.
The analysis of the fault tree diagram produces two types of results: qualitative and quan-
titative. Qualitative analysis identifies the combinations of the basic events which cause the
top event, eventually using Boolean logic. Quantitative analysis will result in prediction of the
system performances in terms of probability of failure or frequency of failure. Once the top
event is specified, the fault tree is developed by determining the immediate, necessary and suf-
ficient causes for its occurrence: these are not the component level causes of the event, but the
immediate causes of the event. The immediate, necessary and sufficient causes of the top event
are then treated as sub-top events and the process then determines their immediate, necessary
and sufficient causes. In this way the tree is developed refining resolution until the limit of res-
olution is reached and the tree is complete. To identify the immediate, necessary and sufficient
causes of events some guidelines are:
For further information and examples, the reader is referred to [2] and references therein.
The knowledge we have about the system contains not only the model of the plant, but also
the model of the measurement. These two models can be represented by their structural graph.
When the model of the measurements is taken into account, the structure of the plant is only
relative to unknown variables, while the structure of the measurements shows the relations be-
tween known and unknown variables. Considering the parameters of the model as variables
(whose nominal value might be known or unknown), the structural model of the system may
be generalized as a set of constraints which apply to a set of variables and parameters, among
which a subset has known values. Analytical redundancy relations are obtained by applying
graph theory to the digraph representing the structural model of the plant.
Faults can be divided into two classes, non-structural faults (namely those faults that cause
changes just in the mathematical expression of constraints, e.g. parametric faults) and struc-
tural faults (whose effect is to change the set of constraints of the system). An example of
structural faults can be a resistor whose resistance value change from R 6= 0 to Rf = 0; in this
sense the constraint of the component changes from V − Ri = 0 into V = 0. For this kind
of failures, structural observability and controllability can be performed by the analysis of the
nominal/faulty structural graph (here the faulty structural graph is considered the structural
graph of the system in nominal condition with the constraint of the faulty component substi-
tuted by the new constraint).
In this framework it is possible to introduce the concept of the estimation redundancy de-
gree of an observable variable (which express in mathematical terms the idea of over con-
strained observed variable) and the control redundancy degree of a controllable variable, which
express in mathematical terms the idea of an over-constrained controllable variable. With this
in mind, the structural analysis, as asserted in [12], is a precious tool to study the monitorable
part of the system (namely the over-constrained part), but also to design reconfiguration actions
which can be expressed in terms of the faulty structural graph. In conclusion, we can assert that
the structural analysis of a system can constitute a valuable framework for the problem formu-
lation and solution approaches in many other steps of the FDI system design procedure, such
as:
• analysis of the local redundancies of the system, in order to detect FDI possibilities;
• determination of those extra sensors whose implementation would increase the FDI pos-
sibilities;
• analysis of the structure of the residuals in order to evaluate the detectability and isola-
bility of the faults;
• analysis of the structure of the residuals for the implementation of the FDI algorithms;
For further information and examples on structural analysis, we refer the reader to [80],
[12], [13], [36], [96] and references therein.
2.3. Supervision level hierarchy 51
• Global RRM level reconfiguration (cross-groups re-allocation): when the resource re-
allocation, due to module level reconfiguration or resource failure, involves more than
one group, the Global RRM is demanded the responsibility of orchestrating the resource
allocation.
The FTC/M module supervising each partial process is totally autonomous in setting the
new working modes as it is supposed completely isolated with respect to the other partial
process (or with minor coupling which can be dealt with the cross-process information) in
terms of fault propagation. The managing of the new WM is demanded to the Group RRM just
in the case it involves change of resources or it produces conflicts of functions.
It is worth to look to a partial process and to the linked FTC/M module as composed by sev-
eral partial sub-processes and FTC/M sub-modules hierarchically structured. The hierarchical
structure of the FTC/M module is clearly motivated by the fact of dealing with complicated
partial processes which can be characterized by several faults and strongly coupling effects.
Reconfiguration and mode decision are done by local reconfiguration and mode control (LRM)
blocks inside FTC/FTM modules. In this regards different FT Modules can be grouped in a
unique higher level module (supervised by a unique LRM), if they are isolated from the context
as far as the effect of a certain subset of faults is concerned. Starting from these considerations,
in this section we want to give some guidelines to design this hierarchical structure (see also
[66, 16]).
Generally speaking the overall task to be accomplished by the supervisor can be divided in
three steps:
52 Fault tolerant architecture for distributed systems
1. Fault Detection and Identification: on the basis of the diagnostic signals issued by the FT
Modules, the supervisor is required to compute an estimation of the (possibly) occurred
fault in the set of partial processes supervised by it. To this end, there are two fundamen-
tal elements which must be taken into account:
• Fault Effect Propagation between partial processes: the supervisor, which has a
global perspective of all the partial processes belonging to the supervised group,
has to identify possible false alarms generated by the local function monitors which,
on the contrary, have just a local perspective of the process. For instance, the ac-
tivation of certain diagnostic signals from a local function monitor can be due not
to a real local fault but, indeed, to the effect of a fault in a different partial process
which propagates within the group. In this respect all the residual signals issued by
the local function monitors must be jointly elaborated by the supervisor in order to
identify the real fault. Eventually, in this elaboration, the response of local function
monitor may be changed.
• Confidence level of each diagnostic signal: all the diagnostic signals processed by
the supervisor are characterized by a confidence level (see [4]) expressing the qual-
ity of the diagnosis performed by local function monitors. The supervisor is required
to suitably take into account this information, by comparing it with analogous infor-
mation received by other processes, in order to generate a reliable fault estimation.
This first task can be thought as a phase in which an objective estimation of the fault oc-
curred in a specific group is generated. The adjective objective is to stress the fact that the
fault estimation is not filtered (altered) by global specifications/constraints which charac-
terize the complex system, but it precisely represents what it is going on objectively in the
plant at a certain time. As we shall see later, the objective fault estimation is then further
processed, taking into account other factors such as global specifications and constraints,
in order to generate the events which determine the new working modes. In order to suc-
cessfully carry out this phase a complete knowledge of the partial processes composing
the group, in terms of how the effects of faults and reconfigurations propagates through-
out the group, is needed. Roughly speaking, if one interprets the diagnosis signals issued
by local function monitors as residual signals then this phase of FDI amounts to invert
2.3. Supervision level hierarchy 53
2. Events Generation: the second task embedded in the supervisor elaboration regards the
events generation, namely the generation of requests of WM changes. These, in gen-
eral, result by a joint elaboration of the information provided the FDI unit (namely by
the objective information about the occurred fault) and of the performance criteria to be
achieved. In other words the detection of certain faults may generate reconfiguration
requests or not, depending on the particular specification to be met. In some cases the
objective response of the FDI unit may be bypassed since a possible reconfiguration (due
to accommodate fault effects) may be in contrast with performance criteria. In the sim-
plest scenario this task is implemented by using a look up table which, on the basis of the
performance and fault tolerant criteria requested for a particular functionalities, on the
actual working, and on the faults detected by the previous task, yields the desired event.
3. Working Mode Setting: The request of a new reconfiguration issued by the event gener-
ation block, is then processed by the final decision logic whose aim is to set the new
working modes. The main goal of this unit is to set the new WM configuration on the
basis of the requests generated by the previous block and taking into account possible
resource limitations which may characterize the fault tolerant system. This part is im-
plemented using the theoretical machinery of the discrete event systems (DES) and the
theory of the supervision of DES (see Appendix A).
A possible structure for Local reconfiguration and mode control (LRM) at each level of the
architecture is represented in fig. 2.7. Achieving fault tolerance means to preserve functional-
ity of the system even in case of faults. A fault tolerant control system is a system which is
able to detect faults in the system and recover functionalities in order to assure pre-specified
performances.
For this reason the starting point for the analysis of the FT system is the functionality tree,
i.e. a tree in which the global objectives of the system (the root of the tree) is divided into sub-
54 Fault tolerant architecture for distributed systems
Let us start from the diagnosis problem in order to identify a modular structure for the di-
agnosis algorithms. Detecting a fault means use analytical and/or hardware redundancies in
order to reveal its effect. Having in mind our previous discussion about failures and failure
modes, we can conclude that detect a fault means to reveal its failure mode, or, from a func-
tional point of view, to reveal a loss of functionality. This means that if we specify a detection
algorithm (i.e. a certain number of residual signals) we specify which failure modes we are
able to reveal. A way to identify which events in the fault tree are observable thanks to the
system structure can be the structural analysis, which, as explained previously, represents the
2.3. Supervision level hierarchy 55
In fig. 2.10 is presented an example of a functionality tree. From this graph it is possible to
build the fault tree sketched in fig. 2.11(a). From the tree it is possible to see that three faults
(f1 , f2 and f3 ) can affect the system leading to the losses of local (and global functionalities).
In the figure are enlightened which losses of functionalities can be observed via three residual
signals (r1 , r2 and r3 ). It is now immediate to define the residual matrix:
r1 r2 r
3
f1 0 1 1
R=
f2 1 1 1 .
f3 0 0 1
r1 r2 r
3
· r1 r2¸
R1 = £ r1¤ R2 = f1 0 1 R3 =
f1 0 1 1
f2 1 , , f2 1 1 1 .
f2 1 1
f3 0 0 1
This means that at the lowest level in the structure we can detect fault f2 using residual r1 , at
the intermediate level it is possible to detect f1 using residual r2 and at the higher level in the
structure it is possible to detect fault f3 using residual r3 . In this way we have obtained a mod-
ular detection of the faults going from the lower levels to the higher levels of the supervision
hierarchy. The situation is also illustrated in fig. 2.11(b).
1
As the reader can figure out, the idea of observable (and hence diagnosable) failure modes is strictly linked
with the over-constrained observable variables present in the structure of the system. For this reason the structural
analysis is a perfect candidate tool to identify them.
56 Fault tolerant architecture for distributed systems
(a) Example of fault tree for diagnosis. (b) Supervisor structure for diagnosis.
Figure 2.11: Example of the use of fault tree analysis to design a distributed diagnosis system.
Reconfigure a system after the occurrence of a fault means recover the functionality affected
by the fault. For this reason we can start again from the fault tree to identify a modular archi-
tecture also for the reconfiguration system. As we defined observable events in the fault tree,
we can define controllable events in the sense that we can act on some degree of freedom in
order to recovery the functionality described in the event2 . For example we can change some
parameters in the controller or change the structure of the controller if the functionality we are
interested in, is the control of a variable. Another option is the switching between redundant
hardware, or in case of severe faults we can choose to change our local or even global objectives
in order to make them pursuable by the degraded system.
Now that we have identified in the fault tree what events are over-controllable, we can
identify a map between these and the request generator and the working mode decision logic in
the supervisor. These units will be present within the modules which are designed to recovery
those functionality. Following this procedure we have enriched the supervision hierarchy with
the units dedicated to the reconfiguration, obtaining a complex modular hierarchy dedicated
to the supervision of the distributed system.
As example consider again the fault tree in fig. 2.11(a) and consider that the functionality
controllable via reconfiguration action are those illustrated in fig. 2.12(a). From this figure it is
easy to sketch the supervision structure illustrated in fig. 2.12(b). It is easy to see that while f2
and f3 are reconfigurable locally simply recovering the local functionality that the faults have
corrupted, this is not possible for f1 , in fact to reconfigure this fault we need to act at global
objectives level.
Now considering the supervision hierarchy for diagnosis illustrated in fig. 2.11(b) and the
reconfiguration hierarchy for reconfiguration illustrated in fig. 2.12(b) and merging the two, we
obtain the supervision hierarchy shown in fig. 2.13.
2
As the reader can figure out, the idea of reconfigurable failure modes is strictly linked with the over-constrained
controllable variables present in the structure of the system. Again the structural analysis can help us to identify
them. For further details the reader is referred to [12].
2.3. Supervision level hierarchy 57
(a) Example of fault tree for reconfiguration. (b) Supervisor structure for reconfiguration.
Figure 2.12: Example of the use of fault tree analysis to design a distributed reconfiguration
system.
Figure 2.13: The modular structure obtained to supervise system in fig. 2.10.
58 Fault tolerant architecture for distributed systems
• WM setting at Module level (low-impact reconfiguration): in some cases, when the resource
needs induced by a new reconfiguration (namely by setting a new WM) has no significant
influence on the other processes, the decision of setting a new WM can be taken by the lo-
cal reconfiguration and mode control (embedded in the FT module) without demanding
the decision to higher order levels.
• WM setting at Global RRM level (cross-groups re-allocation): there may be cases in which
a certain WM change can be performed only by re-viewing the allocation of the FT algo-
rithms within the available resources. This, for instance, may happen in case:
– the resource monitor detects a fault in the specific resource (loss of a computer or
plant resources) and all the functionalities running on it must be moved on a differ-
ent resource;
– the resource needs of certain functionalities are not anymore compatible with the
specific resource (for instance after a reconfiguration) and the computational burden
associated to the algorithm must be spread in others resources.
In this case the only possibility is to demand the WM decision logic to the Global RRM.
From this perspective, the global RRM activates just in case new re-allocations are needed be-
tween resources and functions linked to different groups, while all the working mode manage-
ment which does not require re-allocation between different groups is demanded to the Group
RRM. For this reason the information processed by the Global RRM are not performance in-
dexes (which do not influence re-allocation polices) but just diagnostic results. On the other
hand, the Group RRM accesses both diagnostic results and performance indexes issued by the
FT Modules since the working mode switching policy relies on both the information.
The architecture of the supervisor is tailored in order to leave to the Group and Global
RRM just the task of managing reconfigurations in each group which are necessary for the best
exploitation of the available resources, without demanding to it the task of reconfiguring FT
modules for achieving global functionalities which, indeed, is left to each Local RRM.
As far the Group and Global RRM is concerned, it is composed just by a Decision Logic
Unit which processes all the events generated by the Requests Generation blocks linked to
the different local RRM and manages the WM changes involving reallocations in resources
linked to different groups. Moreover the state of the Group and Global RRM can change due
to external commands issued by external operators, in other words, the Group/Global RRM is
dedicated also to interface the external world with the distributed system.
2.4. Design of the Group/Global Resource Reconfiguration Manager 59
We address in this section the problem of design the WM decision logic unit of group and
global RRM, by showing how the theory of discrete event systems can be successfully used (see
Appendix A, [26] and [108]). To this end the first step is to precisely identify all the specifica-
tions which are behind the design of the group-global RRM. This is done in the following:
• Group selection: of course the first basic information regards how the partial processes
have been grouped each others, namely how the overall system has been divided in
groups of functional units.
• FTC/M as DES: this task amounts in describing each FT Modules (both Control or Mea-
surement) as Discrete Event System by specifying states, events and transitions between
states according to the occurred events. More precisely:
– States: the states capture information about the specific working modes of the partial
process; these, besides the reconfigured states, comprise also states associated to
faults (namely a fault has been occurred but no reconfiguration has been already
taken).
– Events: these are exogenous events, which can be controllable/observable for the
supervisor, inducing transitions between different states and which can be catego-
rized in the following two classes: change of WM, i.e. events (which will turn out
to be controllable for the supervisor) which induce transition between states due to
a change of the working mode and occurred Faults:, i.e. events (which will turn out
to be not controllable but observable for the supervisor) which induce transitions
between states due to faults which have been detected by local function monitors.
• Resource and Resource monitor as DES: This task amounts in describing local resources and
the associate monitors as Discrete Event Systems by specifying states and events inducing
transitions between states as follow:
– States: in the simplest case the states describing the status of resources reduce to
three: idle (namely the resource is capable to run additional functionalities), busy
(no other functionalities can be located on that resources) and fault (the resource
monitor has detected a local fault, for instance of a computer). As better highlighted
in the following, the busy state can in general split in several sub-states expressing
different cases in which the resource can be busy.
– Events: the exogenous events can be divided in change of WM inducing transitions
between idle and busy states (these are controllable events for the supervisor to be
designed which, on the basis of the model of the specific resource and the allocation
policy followed in the past, commute between the two states) and occurred faults,
i.e. non-controllable but observable events arising whenever the local resource mon-
itor detects a fault in the supervised resource.
• Reconfiguration specifications: list of different WMs for each partial process linked to a
specific fault situation. This represents the list of possible counteraction after failures to
achieve fault tolerance.
• WMs/Resources map: this represents an offline planning on how the different functional-
ities, achieved in all possible WMs, can be allocated in the available resources. In this
description a fundamental information is to identify re-allocation in different resources
within the same group (whose management is demanded to the group RRM) or eventu-
ally between different groups (whose management is demanded to the global RRM).
Starting form these information, it is possible to automatically design the decision logic unit
which supervises each group. The composition of the discrete models of the functional units
and physical resources linked to the same group, yields a group automata which captures all
the information about the feasible WMs according to the actual WM and to the resources avail-
ability. From this group automaton, the Group RRM can be designed following the supervision
theory on the basis of performance specifications and reconfiguration requirements. Moreover
by composition of all the supervised group automata, it is possible to obtain a global discrete
model of the system on which the Global RRM can be designed, again taking advantage of the
supervision theory for Discrete Event systems, starting from global specifications.
In the following these considerations will be further developed by proposing possible case
studies. The idea is to identify elementary case studies which are distinctive as far as the de-
cision logic design is concerned, and that suitably composed allow one to deal with arbitrary
complex situations. To this end the following is proposed:
1. Resource sharing: this case aims to collect all the applications in which the different func-
tionalities are allocated not always in the same resource but real time re-allocations are
possible. This may happen due to particularly heavy reconfigurations (which may re-
quire re-allocation) or due to loss of resources. Specifically the resource sharing can be
further specialized in:
(a) cross-group resource sharing: in this case the re-allocation of certain functionalities
involves resources which are not necessarily belonging to the same group. Hence the
re-allocation management can not be supervised by the group RRM but is demanded
to the global RRM.
(b) in-group resource sharing: in this case the re-allocation is between resources which
are belonging to the same group. Hence the re-allocation supervision can be per-
formed by the Group RRM.
Of course the two cases are not mutually exclusive since it is possible to figure out complex
fault tolerant systems in which interlaced reconfiguration are present within shared resources.
In the following the previous two case studies are presented, by showing how the design of the
decision logic supervision can be achieved.
2.4. Design of the Group/Global Resource Reconfiguration Manager 61
Consider an FTC system composed by two FTC modules (namely FTC1 and FTC2) and an
FTM module. These functional units share two physical resources (R1 and R2 namely). FTC1
is always allocated on R1, FTC2 is always allocated on R2, while FTM can be allocated both on
R1 and R2.
FTC1, FTC2 and FTM can be affected by three failures, f1 , f2 and f3 respectively. Each func-
tional unit can work in a nominal mode (working modes wm1 , wm3 and wm5 respectively)
or in a reconfigured mode (working modes wm2 , wm4 and wm6 respectively). Reconfigured
working modes should be issued by the supervisor after a failure is detected. When a FTC unit
works in a nominal mode it use just a part of the physical resource, leaving enough space on
it to allocate the working mode (nominal or reconfigured) of the FTM. When a FTC unit works
in the reconfigured mode this uses the whole resource. When this situation occurs the FTM
process have to be allocated on the other resource; this is possible only if FTM is working in
nominal condition.
Group selection: The system can be decomposed into two groups: the first group (G1) includes
the resource R1, and the functional units FTC1 and FTM (when allocated on R1); the second
group is composed by the resource R2 and by FTC2 and FTM (when allocated on R2).
FTC/M as DES: In fig. 2 are reported the automata, modelling the functional units. Each state
WMi represent a different working mode for the functional unit, while states Fj represent the
j-th faulty situation. Since the FTM can be allocated on both the resources there will be two
models (FTM/i, i=1,2) of it, each of them representing the FTM seen from the i-th resource. In
these automata there will exists coordination states and transitions representing the stand-by
state of the unit on the resource. The events labelled wmi are the controllable and observable
events representing changing of working modes (and hence transitions between states in the
automata). Events labelled fj are the diagnosable and uncontrollabe events representing possi-
ble failures. The transitions to the stand-by state (namely events sbi , i=1,2) are the controllable
events used to allocate the unit on the resources.
62 Fault tolerant architecture for distributed systems
Resource and Resource monitor as DES: In fig. 3 are reported the automata modelling the physi-
cal resources. Here the Ij, (j=1,2) states stand for Idle states, meaning that new working modes
can be allocated on the j-th resource, while states Bij (i=1,2 ; j=1,2) represent the busy situation
for the j-th resource (no other functionalities can be allocated on the resources); more in detail
state B1j represents the situation “FTCj in reconfigured working mode” while state B2j repre-
sents the situation “FTCj in nominal working mode and FTM allocated onto the j-th resource”.
All the events have the meaning explained above.
Reconfiguration specifications: Working modes wm1 , wm3 and wm5 represent nominal working
modes for FTC1, FTC2 and FTM respectively. When a failure f1 happens in FTC1 then the
working mode issued should be wm2 ; analogously wm4 and wm6 are the working modes that
2.4. Design of the Group/Global Resource Reconfiguration Manager 63
should be issued in FTC2 and FTM when failures f2 and f3 are detected respectively.
WM Urgency: In the example we will consider two different working mode urgencies orders,
leading to different supervisory strategy:
1. (max. urgency) wm2 → wm4 → wm6 (min. urgency);
Gi = FTCi k FTM/i k Ri .
In fig. 2.17 is reported the FSM modelling group 1; specifications regarding changes of working
mode after a failure are designed on the basis of this machine, and the supervisor Si is built to
satisfy these specs. In fig. 2.18 are represented models of controlled groups G1 and G2.
Now the global machine on which design the Global RRM can be achieved by:
G = (G1/S1) k (G2/S2) .
64 Fault tolerant architecture for distributed systems
For sake of simplicity we do not report nor the global machine neither the specifications de-
signed on it, but we will explain how the global supervisor will work related with the recon-
figuration priority order. Suppose that the WM urgency order is the one explained in case 1,
namely: (max. urgency) wm2 → wm4 → wm6 (min. urgency).
At start-up the global supervisor allocates FTM onto R2 (sb1 wm52 ) so that in case of failure
f1 (the most severe), it is possible an immediate reconfiguration of FTC1 (wm2 ). In the following
are reported the supervision actions due to some combinations of failures.
f1 → wm2
f3 → wm62
f2 → sb2 wm51 wm4
f2 → sb2 wm51 wm4 → f3 → wm61 → f1 → sb1 wm3 wm52 wm2 .
Suppose now that the WM urgency order is the one explained in case (2), namely: (max. ur-
gency) wm6 → wm2 → wm4 (min. urgency). At start-up the global supervisor allocate FTM
onto R2 so that in case of failure f1 (more severe than f2 ), it is possible an immediate recon-
figuration of FTC1 (wm2 ). In the following are reported the supervision actions due to some
combinations of failures.
where ² is the empty string meaning that the supervisor takes no action. In fact to reconfigure
FTC1 (wm2 ) it should move FTM from R1 to R2 stopping its reconfiguration (wm6 ).
G1 = FTC1 k FTC2 k R1 k R2
G2 = FTC3 k R3 .
This case can be approached as the previous one; the only difference is that the resource alloca-
tion policy is managed by the group supervisor.
Figure 2.19: In-group resource sharing. Here n stands for nominal working mode, where r
stands for reconfigured working mode.
66 Fault tolerant architecture for distributed systems
• Actuators: P1 , P2 and Q1 , Q2 are respectively powers delivered by the two resistors and
the input flow-rates provided by the two pumps.
• Sensors: T1 , T2 and T3 are the temperature measurements in the three tanks, H1 and H2
are the level measurements in tank 1 and 2, while Q3 is the flow-rate at the output of the
mixing tank.
The objectives of the system are to adjust flow-rate Q3 and fluid temperature T3 according to the
reference variables and Q?3 and T3? . For that purpose a very simple strategy consists of control-
ling the temperature and flow-rate variables in the two pre-heating tanks. According to this
point of view, the global system is decomposed into three subsystems, respectively two pre-
heating systems and a mixing one. The FTC/FTM decomposition is such that, to each function
is associated a subsystem: SS1 and SS2 represent the pre-heating subsystems while SS3 is the
2.4. Design of the Group/Global Resource Reconfiguration Manager 67
mixing subsystem. Two Fault Tolerant Control modules (FTC1 and FTC2) are associated to SS1
and SS2 . Since no control action is performed for mixing, only a Fault Tolerant Measurement
module (FTM) is associated to SS3 . The FTC/FTM implementation is represented in fig. 2.21
while the fault scenario is reported in table 2.1. According to the severity of the faults, the task
of the Global Resource and Reconfiguration Manager is either to define new local objectives or
to synthesize a new global objective by submitting new functions to all the subsystems of the
plant. This is dealt with by dividing the possible situations in six different scenarios depending
on the occurred fault and its severity and proposing possible reconfiguration strategies (see
fig. 2.22). In the following the points, highlighted above to express the supervisor specifica-
tions, are covered again by showing how they specialize for the present case.
Resource and Group selection: In order to enrich a little bit the proposed example we suppose
that two resources, R1 and R2, are available for running the control and estimation algorithms
implemented to achieve the desired functionalities. Furthermore, the second resource R2 can
be affected by a fault (loss of computation capability) and hence all the functionality running
of it must be eventually moved in the other resource. In this respect the double resource could
be motivated by the need of having hardware redundancy. As far as the group selection is
concerned, the analysis of the system shows that the modules FTC1 and FTC2 are strongly
interlaced (a change of WM in one of the modules is always joined to a similar change in the
other) and hence a wise choice can be to group it. Following this idea a possible choice is to
identify Group1 as FTC1, FTC2, R1 and Group2 as FTM, R2.
68 Fault tolerant architecture for distributed systems
FTC/M as DES: FTC1 can work in a nominal mode (wm1 ) or in two reconfigured modes (wm2
and wm3 ), representing respectively scenario 3 and scenario 4 remedial actions (see fig. 2.22).
Since the processes described in FTC2 are not affected by faults and possible reconfiguration of
FTC2 always follows reconfiguration of FTC1, the same WMs (wm1 , wm2 , wm3 ) are also asso-
ciated to FTC2. The Discrete Event system which governs FTC1 and FTC2 can be described as
follows:
• if the system is in scenario 1 or scenario 2, all the modules work in the nominal mode;
• if the system is in scenario 3, both the FTC modules should move to wm2 (new local
trajectory computation) and then move back to wm1 ;
• if the system is in scenario 4, both the FTC modules should move to wm3 (new global and
local trajectory computation) and then move back to wm1 .
Note that wm2 and wm3 are associated to tasks which involve just new reference computation,
and this is why the FTC are moved back to control working mode wm1 after the reconfiguration
is carried out. For sake of simplicity, we will assume that faults can not occur when the working
mode is wm2 and wm3 . A pictorial sketch of the DES modelling FTC1 and FTC2 is presented
in fig. 2.23. Note that the events which governs the transitions between the different states are
faults (observable but non controllable) and the WM changes (controllable).
As far as the description of FTM is concerned, it is supposed that it runs on R2 in case
the latter is working properly, and must be moved into R1 in case R2 is affected by faults.
This is precisely the case of resource sharing (and specifically cross-group resource sharing)
which have been treated in the previous subsection. In particular it has been shown that the
resource sharing involving FTM can be described by two DES, denoted by FTM/1 and FTM/2,
describing the functionality FTM running on R1 and on R2 respectively. Consequently, also
the nominal WM (wmn ) is split in two nominal WMs (wmn1 , wmn2 ) and two additional states,
denoted by sb1 and sb2, are introduced to denote if the functionality FTM is in standby on R1
or on R2. Resource R1 is assumed to be always available (no fault on it) and with enough room
to allocate all the processes. A sketch of the DESs describing FTM is shown in fig. 2.23.
Resources and Resource Monitors as DES: While the DES describing R1 and the associated re-
source monitor is trivial as it is not affected by fault and it is never busy, the discrete event
description of R2 has two states, idle and fault, with a single non controllable events which is
the local fault. A sketch of this DES are shown again in fig. 2.23.
Reconfiguration specifications: first, we start by specifying the controllable events which must be
managed by the Global and that which must be managed by the Group RRM.
Global RRM events: these are the events which are associated to re-allocation between re-
sources linked to different groups. In the present case just the events (sb1 , sb2 ) and (wmn1 ,
wmn2 ) must be managed by the Global RRM in case the resource R2 fails and the FTM func-
tionality must be moved on R1.
Group RRM events: these are all the other controllable events, namely (wm1 , wm2 , wm3 ) which
regards reconfiguration of FTC1 and FTC2.
The reconfiguration specifications can be given in terms of group and global specification. As
far as the group specification, the decision logic unit supervising the first group can be designed
on the basis of the specifications expressed in fig. 2.24. Note that, in this reconfiguration policy,
the priority of the wm3 with respect to wm2 (due to the fact that the fault P1 has been labelled
as more severe than DQ1) has been respected in the sense that in case of a concurrent fault P1
and DQ1 the priority is given to wm3 . Note that the second group does not need a Group RRM
since all the local reconfigurations are interlaced with that of the first group (change of WM
70 Fault tolerant architecture for distributed systems
due to a reference computation) or are managed by the global RRM (loss of resource R2). As
far as the global specifications, these are described in figure fig. 2.25, where it is clear the just
the reconfiguration due to a loss of resource R2 is managed.
WM urgencies: The fault severity has been specified in tab. 2.1. In particular the most urgent
fault to be dealt with is given by P1 (loss of actuator).
WMs/Resources map: both the resources are able to run all the algorithms involved in the Fault
Tolerant Systems. In case the resource R1 is working properly, it is assumed that FTC1 and
FTC2 are located on R1 and FTM on R2. In case a fault is detected on R2 by the local resource
monitor then also the functionality FTM is moved on R1.
The previous information represents all what is needed in order to design the Group and
Global decision logic unit. As far as the design of the Group 1 RRM is concerned, the first step
regards the computation of the overall DES modelling the group. This can be easily achieved
by considering the parallel interconnection of all the DESs describing FTC1, FTC2, FTM/1 and
R1, namely
G1 = FTC1 k FTC2 k FTM/1 k R1 .
Hence the supervisor of G1 is built on the basis of the group specifications H presented above,
following the known supervision theory of the Discrete Event Systems (see Appendix A).
Once the Group Supervisor Machine has been designed, the first step for the design of the
Global decision logic unit is to compute the DES modelling the supervised group. This can be
done by computing the parallel interconnection between G1 and the designed supervisor and
then to compute the supervisor machine on the basis of the global specification expressed in
the point (d) above. It is worth stressing that all these steps which, starting from the specifi-
cation given above, carry to the global and group decision logic machines, are automatically
computable by means of standard procedures.
2.5. A pilot plant: the two tank system 71
Fig. 2.26 shows the behavior of the Decision Logic Unit which supervises the whole sys-
tem. In particular the table shows al the possible strings (change of WMs) generated by the
supervisor according to the actual WM and to occurred event: for each state (in bold) is shown
the number of transitions and the possible events followed by the new state. By exploration of
this table it is possible to check that for each state there exists just a control action imposed by
the supervisors, while no possible failures (which are uncontrollable events) are disabled. This
means that the supervision strategy is unique.
S L̇1 = Q1 + Q12 − QF 1
S L̇2 = Q2 − Q12 − QF 2
(2.1)
L1 − L2
Q12 =
R12
with
L1
QF 1 = (2.2)
R1
72 Fault tolerant architecture for distributed systems
L2
QF 2 = (2.3)
R2
where R12 is the throttling of valve V12 , R1 is the throttling of valve V1 , R2 is the throttling of
valve V2 and S is the section of the tank. This means that if the throttling is ∞ the valve is
closed.
The mathematical model of the system is then
L1 L1 − L2
S L̇1 = − + + Q1
R1 R12
(2.4)
L2 L1 − L2
S L̇2 = − − + Q2 .
R2 R12
We consider, as control inputs of the system, the variables Q1 , Q2 and, as additional input,
R12 . The controlled outputs of the system are:
y1 = L1 + L2
(2.5)
L1
y2 = .
L2
We want these two outputs following two desired set points denoted respectively as y1∗ and y2∗ .
According to (2.2) and (2.3), these outputs are proportional to the total flow rate at the output
of the two tanks and to the flow rate ratio between the two tanks (as R1 and R2 are assumed
constant). In particular note that we can rewrite the desired set points y1∗ and y2∗ as desired set
points L∗1 and L∗2 for the measured levels L1 and L2 . Specifically, from (2.5) we have
and thus
y1∗ y2∗ y1∗
L∗1 = L∗2 = (2.6)
1 + y2∗ 1 + y2∗
• Fault on the connection valve V12 : the valve is stuck to a constant value R12 = R12 which
can be ∞ (namely the valve fails in stuck closed mode) or a constant finite positive value.
• Leakage fault on tank 2: due to an hole in the tank the dynamic of L2 is corrupted by a term
δQF 2 (L2 ) = L2 /δ where δ is the section of the hole. In other words there is an undesired
outgoing flow from tank 2.
• Fault on level sensor of tank 2: the measure L2m of level L2 in tank 2 is corrupted by a
constant bias δL2 , i.e. L2m = L2 + δL2 .
• Fault on flow-rate sensor of tank 2: the measure Q2m of flow-rate Q2 in tank 2 is corrupted
by a constant bias δQ2 , i.e. Q2m = Q2 + δQ2 .
2.5. A pilot plant: the two tank system 75
Diagnosis algorithms
Aim of this section is to give some guidelines about the generation of residual signals to de-
tect and isolate the different faults scenario illustrated above. The first residual that can be
generated is based on a test on the control loop of pump 2; in fact the flow Q2 from pump 2
is a controlled variable, hence it can be considered known as an internal state Q?2 of our con-
troller (consider the case of steady state, Q?2 is equal to the reference for the flow from pump 2).
Moreover the flow Q2 is measured with a flow sensor. Considering the signal
r1 (t) is qual to zero in nominal condition, while is different from zero in case the pump 2 is
stuck or the measure Q2m is not correct. In other words the signal r1 is a residual for fault Q2
and for fault δQ2 .
Let us assume that the valve R12 is monitored with an hardware electrical test (for example
an electrical signal consistency test). this means that this test will result in signal r2 (t) which is
able to detect a fault on valve R12 (more specifically is able to detect the situation in which the
connection valve is stuck); i.e. r2 (t) is a residual for the fault R12 .
Consider now equation 2.3. It states that in case of nominal condition, the outgoing flow
from tank two depends on the level of liquid in tank two L2 and from the throttling of outgoing
valve R2 . In case of leakage in tank two, it is clear that the relation 2.3 does not hold anymore,
since there is also an outgoing flow due to the leakage. This means that the relation becomes:
L2
= QF 2 + δQF 2 .
R2
With this in mind it is immediate to see that signal
L2m
r3 (t) = − QF 2 (2.8)
R2
is equal to zero in nominal condition while is different from zero in case of leakage in tank two
or in case of a misreading of the level sensor; in other words r3 (t) is a residual for faults δQF 2
and δL2 .
As far as the detection of the sensor bias δL2 is concerned, we have two possible situations
depending on the R12 status. As a matter of fact in the case R12 < ∞ we have
L1 − L2
Q12 = .
R12
Thus, since L1 and Q12 are measurable, it is possible to estimate L2 as
From this it is possible to generate a residual signal r40 (t) sensitive to δL2 as
and use the value of L2m − L̂2 to reconstruct the sensor bias and thus to estimate the level value.
On the other hand in the case R12 = ∞ (valve closed), it turns out that it is possible to estimate
L2 by means of the observer
µ ¶
˙ 1 1 1 1 1
L̂2 = − L1m − − L̂2 + Q2m + K(L2m − L̂2 )
SR12 S R2 R12 S
76 Fault tolerant architecture for distributed systems
where K is a suitably defined output constant, which, by defining the error variable
From this it is possible to generate a residual signal r400 (t) sensitive to δL2 and to δQ2 as:
Note that it make sense to consider the two algorithms used to estimate the sensor faults and
to reconstruct the level measure as characterized by different reliability factors. As a matter of
fact it is expected that the algorithm to be run in the case R12 ≤ ∞ has an higher confidence
level with respect to the one to be considered if R12 = ∞ as it is based on algebraic physical
relations. This, however, should be validated by simulation and experiments results.
Consider now the flow balance in tank 2 in steady state condition:
QF 2 = Q2 + Q12 + δQF 2 ;
by hypothesis the measures of QF 2 and Q12 are given and reliable, hence it is possible to use
this analytical redundancy to generate a last residual r5 (t) which is sensitive just to a fault on
the flow sensor Q2 and δQF 2 as:
All these consideration lead to the residual matrix in table 2.2. It is easy to verify just by
inspection of table 2.2 that with this set of residual signals the system is fully detectable and
isolable with respect to the set of faults considered.
Hence signal outgoing from the two FTM are the estimation of the measure with the confi-
dence level (in nominal case the measure from sensor with a confidence nearby 100%) and the
diagnosis signals concerning the two sensors.
Concerning the leakage fault on tank 2 (δQF 2 ), this is a fault which influence the local func-
tionality of tank 2. From a mathematical point of view, this fault introduces a changing on the
parameter of the system. In fact from (2.4) it turns out that in this failure mode the system
dynamics modifies as
L1 L1 − L2
S L̇1 = − + + Q1
R1 R12
L2 L1 − L2
S L̇2 = − − + Q2 + δQF 2 (L2 )
R2 R12
namely
L1 L1 − L2
S L̇1 = − + + Q1
R1 R12
(2.14)
L2 L1 − L2 L2
S L̇2 = − − + Q2 + .
R2 R12 δ
As far as this fault is considered, it is hence possible to isolate and accommodate it simply
reconfiguring the control over tank two. For this reason we propose also the introduction of
a fault tolerant control at tank two level. This FTC module should be dedicated to isolate and
accommodate the leakage fault without changing global objectives and performances.
Let us consider now the fault on the pump in tank 2 (namely Q2 = Q2 ). It is easy to
understand that this fault is crucial, in fact it corrupt the global functionality of the whole
system leading to a loss of controllability. In other words reconfigure this fault will imply
a changes of global objectives. For this reason it will be considered at a higher level, more
in detail an FTC module will be designed at the whole system level in order to isolate and
accommodate the effect of the fault on pump 2.
The last fault to consider is the fault on valve V12 . As explained previously the detection of
this fault is mainly at hardware level. Moreover from figure 2.27 it is easy to see that also the
accommodation can be managed at hardware level thanks to the hardware redundancy V12 -
0 . For this reason the isolation of the fault will be demanded to a resource monitor, while the
V12
accommodation will be managed by a global resource manager at the top level.
In view of all these consideration we propose the decomposition of the system sketched in
figure 2.28.
structure the internal structure of the supervisor is depicted in the figure 2.29. The lowest level
supervisors are the LRM of FTC2 and the LRMs of FTM(L2 ) and FTM(Q2 ). As explained in
the previous section, FTC2 manages the leakage fault on tank 2, isolating and accommodating
this fault. For this reason all the three units are presented within its supervisor to orchestrate
local working modes. On the contrary in LRMs of FTM(L2 ) and FTM(Q2 ) just the FDI units
are present to manage the isolation of faults on sensors on the basis of the estimation of the
measures. In this case since no explicit reconfiguration is necessary, nor an event generator
neither a decision logic unit are necessary.
All the diagnostic signals given by the low level FDI units are sent to the higher level LRM.
This one manages the isolation and accommodation of the fault on pump 2. Since, as explained
previously, the accommodation of this fault requires a change in global objectives, all signals
from lower levels are required. An event generator unit within this supervisor decides, when-
ever the pump 2 is stuck, on the basis of the status of the low level FTC and FTM, which
reconfiguration action should be issued. For this reason in this LRM all the three units must be
present.
A different discussion should be addressed on the global resource manager (GRRM). This
global supervisor has two different tasks: the first one is to manage hardware reconfigurations
on the basis of information from resources monitors; the second task is to enable/disable possi-
ble local working modes on the basis of their resource needs, resource status and performances
requirements from the external operator. For this reason it presents only a decision logic unit
that manages reconfigurations in case of loss of performances and hardware reallocation, and
decides performance priorities according to external commands.
L1
S L̇1 = − + Q1
R1
(2.15)
L2
S L̇2 = − + Q2 .
R2
It is possible to achieve control objectives using two PI controllers, controlling levels L1 and L2
using respectively Q1 and Q2 in order to track references as specified in relations 2.6. We will
refer to this working mode as WM0.
In case R12 < ∞ the model of the system is represented by equations 2.4. The system is
coupled and must be controlled in order to achieve
L1 = L∗1
L2 = L∗2
where L∗1 and L∗2 are represented in equations 2.6. To satisfy these objectives we can implement
two control architecture: use an optimal control strategy, or, considering that Q12 = L1R−L 12
2
80 Fault tolerant architecture for distributed systems
is measured, decouple the system through a feed-forward action and control the system in a
decoupled way. We will refer to this working mode as WM0I .
These two nominal working modes represent two different functionalities of the system,
but they can also be used in order to augment the detectability of the system with respect to
the faults considered.
Reconfiguration of Q2 in case of decoupled tanks
First of all consider the fault on pump 2. In other words consider the case in which pump
2 is stuck to a constant value such that Q2 = Q2 . If R12 = ∞ the system is represented
by equations 2.15, then the two tanks are decoupled. In this case we lose a control variable
(Q2 = Q2 =const.), i.e. one degree of freedom, then we must decide which of the two control
objectives represented by eq. 2.5 we want to satisfy and reconfigure the trajectory on L1 to
satisfy this objective. Since the incoming flow in tank 2 is constant than also the level in this
tank will stabilize to a constant level L2 which can be measured. Hence
• if we prefer to satisfy:
L1 + L2 = y1∗
than we must compute the new trajectory L∗1 as
L∗1 = y1∗ − L2 .
• if we prefer to satisfy:
L1
= y2∗
L2
than we must compute the new trajectory L∗1 as
L∗1 = y2∗ L2 .
L1 = L1 − R12 Q2 . (2.16)
The equations
L1 L1 − L2
S L̇1 = − + + Q1
R1 R12
L2 L1 − L2
S L̇2 = − − + Q2 ,
R2 R12
2.5. A pilot plant: the two tank system 81
become
L1 R12 Q2 L1 L2
S L˙ 1 = − − + + Q2 − + Q1
R1 R1 R12 R12
L2 L1 L2
S L̇2 = − − + .
R2 R12 R12
Defining µ ¶
1 1 1
d= − R12 Q2
S R1 R12
as an unknown term, it is possible to write the system as
# 1³1 ´
− S R1 − R1 − SR1
"
˙L ·
L
¸ · ¸
1
1 1
= 12 ³ 12 ´ + [Q1 − d] . (2.17)
L̇2 − SR1 − S1 R12 − R1 L2 0
12 12
Defining now
y ∗1 = y1∗ − R12 Q2
and the control error as
e = y − y ∗1 = L2 + L1 − y1∗
where y = L2 −L1 , it is easy to see that it is possible to design an error-feedback integral control
law over system 2.17 (state of the obtained system is not available since the state feedback
would require an estimation of Q2 ), by which the objective y1∗ is forced without estimate Q2 .
We will call this working mode WM3.
Similarly by forcing objective y2∗ instead of y1∗ and following a similar procedure as above
we can define a working mode WM4 in which using an error-feedback integral control law
over the obtained system, the objective y2∗ is forced without estimate Q2 ..
Consider the same change of variables expressed by 2.16. In the same situation, if we es-
timate Q2 the term d is no more unknown, then it is possible to use a feed-forward action.
Moreover in this case we know all the state variables, because we know L1 , L2 and Q2 , then
a state feedback is possible with better performances than the error-feedback. If we choose
to achieve objective y1∗ the working mode will be denoted with WM5, while if objective y2∗ is
satisfied the working mode is WM6.
Now let us suppose to control R12 . In this way we increase the degrees of freedom of
control, i.e. it is possible to satisfy again the two objectives, because we lose the control variable
Q2 but we add the control variable R12 . The value R12 ∗ of R at each moment can be compute
12
starting from system 2.17, computing the steady-state of L2 and using the constraints y1∗ and
y2∗ . Define
ξ = L1 + L2 − y ∗1 ;
from
1 1 R12 R2
L̇2 = − L1 − L2
SR12 S R12 R2
we obtain µ ¶
˙ξ = − 1 (ξ − L2 + y ∗ ) − 1 R12
− 1 L2 .
1
SR12 SR12 R2
Computing the zero dynamic as ξ˙ = ξ = 0, we obtain the steady-state for L∗2 compatible with
the satisfaction of control objective y1∗ :
y1∗ − R12 Q2
L∗2 = ³ ´. (2.18)
1 − RR122
− 1
82 Fault tolerant architecture for distributed systems
From 2.20 it is possible to compute the value that R12 should assume in case of Q2 = Q2 in
order to satisfy both objectives y1∗ and y2∗ :
On the contrary when the nominal working mode is WM0I (i.e. the valve V12 is open), if a
fault on pump 2 occurs the allowed WMs are:
• WM7 if we can control R12 and the required steady state for R12 is within a feasible range.
This reconfiguration strategy has to be preferred because it allow to satisfy both objectives
y1∗ and y2∗ . It is important to stress that this WM requires a reliable estimation of Q2 , hence
if the measure of this variable has a low confidence this strategy has not to be preferred.
• WM5/6 if we can estimate Q2 with an high confidence; this strategies allow to satisfy just
one of the control objectives, moreover if the measure of Q2 (from the FTM) has a low
confidence this strategy has not to be preferred.
• WM3/4 if we can not estimate Q2 with a satisfactory performance, this strategies allow
to satisfy one of the control objectives with lower dynamic performances.
The Global Resource and Reconfiguration Manager decides which control objective we
must satisfy due to performances objectives, it decides if the confidence level of the estimates
is sufficiently high and it gives priority among WM3/4, WM5/6 and WM7 as explained above.
Moreover the GRRM manages the fault on valve V12 using an hardware redundancy policy.
2.6 Conclusions
In this chapter a possible distributed architecture for fault tolerant control of complex system
has been introduced. This architecture is modular and different modules are divided following
functional reasoning. Different functions which can be degraded by faults are associated with
fault tolerant (control/measures) modules that are able to detect the fault and counteract its
effect in order to assure certain performances in the execution of the required function. Each
module is provided with a reconfiguration manager, namely a supervisor which orchestrate
the reconfiguration within the module. All the modules are orchestrated together via high
level supervisors (global/group) resource and reconfiguration manager whose aim is also to
allocate optimally the function over the physical resources considering that they have a limited
capacity and that can be also affected by faults.
In this chapter some ideas were presented on the use of classical failure analysis tools to
design the structure of FTC/M modules within the IFATIS architecture. More in detail we
have shown how tools like Fault Tree Analysis (FTA) and Structural Analysis (SA) are useful to
divide the system in functionalities and sub-functionalities. To each of them a FTC/M module
is associated, achieving modularity and hierarchy. This failure analysis tools have proved to
be useful in order to design the structure of the local supervisor of each module. Moreover has
been shown what information are required to design the local and global supervisors and how
the theoretical machinery of discrete event system supervision can be used in order to design
the dynamical part of the supervisors.
84 Fault tolerant architecture for distributed systems
Chapter 3
Reliability of complex diagnosis systems
3.1 Introduction
The aim of this chapter is to introduce a framework and a procedure for the reliability com-
putation of complex diagnosis systems using statistical tools. It enriches the tools available
to engineers for analysis and design of diagnosis systems. Generally speaking a possible ap-
proach to the design of complex diagnosis and reconfiguration systems is illustrated in [10]
(see also [50]) where the whole design procedure is divided in two parts regarding respectively
the analysis of the diagnostic system and the design of diagnosis and reconfiguration algo-
rithms. The first step of the analysis procedure is the fault modeling. A Failure Modes and
Effects Analysis (FMEA) is normally used to describe the system in terms of failures modes
and functional discrepancies. Then a fault propagation analysis investigates how direct fault
effects propagate through the functional system. This analysis leads to a severity assessment of
the possible faults and to a reverse propagation analysis which makes possible to find where
and how to detect and stop faults. The last step of the analysis procedure amounts in selecting
remedial actions for all the faults considered in the design and represents a key phase for the
effectiveness of the whole diagnostic system. As far as the design part is concerned, it is usually
divided in three steps which involve respectively the detector design, namely the design of the
Fault Detection and Isolation (FDI) algorithms, the effector design (namely the design of the
reconfiguration procedures) and the supervisor design.
Our goal is to enrich the general framework described above with a procedure for the reli-
ability computation of the complex diagnostic system. In this framework can be cast [109, 110]
where Markov Chains are used to perform reliability analysis of Fault Tolerant Control (FTC)
85
86 Reliability of complex diagnosis systems
systems. To the purpose of this work a complex diagnostic system is thought as a system com-
posed by a number of elementary subsystems affected by a variety of possible faults and a
number of diagnostic algorithms which are simultaneously running in order to generate resid-
ual signals which are sensitive to one or more faults. In this framework we are interested in
developing a procedure that quantify the reliability of the overall diagnosis system in terms
of capability of not generating false alarms and missed diagnosis starting from the elaboration of the
residuals signals generated by the diagnostic algorithms.
It must be noted that different factors influence the reliability of a complex system where
several faults can arise at a certain time and where several diagnosis algorithms are simultane-
ously running to detect faults. A first factor is that not all the faults may have the same impact
on the reliability of the whole system as they are usually characterized by different occurrence
rates and different severity levels. A second factor is represented by the features of each di-
agnosis algorithm running in the diagnostic system which, since it is designed using different
strategies such as hardware or analytical redundancy and different techniques, is character-
ized by its own reliability in terms of capability of detecting occurred faults and avoiding false
alarms. A further aspect which is distinctive in complex systems is given by the presence of
fault propagation phenomena intended as the possibility that the occurrence of a specific fault
can generate different failures or spoil the features of diagnosis algorithms. The framework
proposed is able to capture all these aspects as it is based on a statistical description of the ex-
ogenous faults and of the diagnostic algorithms and on the use of propagation rules. This goal
is achieved by suitably adapting the mathematical tools presented in [9] to the case of diagnos-
tic systems. In particular the reliability computation is based on a four steps procedure which,
starting from the statistical description of the exogenous faults, the reliability of diagnosis sub-
systems and the description of the complex system, yields a measure of the reliability of the
overall system intended as capability of detecting faults by processing the available residual
signals without generating false alarms.
The procedure proposed in this chapter can be a useful tool both for the off-line design of
the diagnostic system and for the on-line design of the FDI unit. As far as the usefulness of the
proposed approach for the off-line design of the diagnostic system, it is worth noting that the
procedure for predicting the reliability of the system given the reliability of the physical com-
ponents and of the diagnostic algorithms lends itself to be used as a tool for identifying the op-
timal dimensioning of the physical components subject to faults and of the diagnostic algorithms
in order to achieve a prescribed reliability for the complex system. To this respect the proce-
dure here presented can be seen as an interesting tool for considering reliability as a criterion
which underlies the design of a diagnostic system, by solving the typical tradeoff regarding the
”quality” of the physical components and of the diagnostic algorithms composing the complex
system and their ”costs” in terms of economical impact, computational burden, etc. This fea-
ture will be formalized by providing, as outcome of the statistical analysis, an Hazard matrix
(see [46]) well-suited for checking reliability specifications.
On the other hand the proposed approach represents an interesting tool also for the on-
line design of the FDI unit. As a matter of fact, as clarified throughout the chapter, one of
the sub-products of the statistical analysis presented is to generate a statistical residual matrix
namely a matrix which has as many rows as the number of possible faults, as many columns
as the number of residual signals and whose element in the i-th row and j-th column is a real
number representing the probability that if the j-th residual signal arises then the i-th fault
happened. To this regard the analysis here presented is also useful in order to implement an
on-line FDI unit based on statistical considerations as it makes possible not only to detect and
possibly isolate faults from the joined analysis of the residuals signals but also to come out with
3.2. Statistic Tools 87
a measure of the probability of the right detection and isolation. This feature, in the cases in
which the deterministic isolation of faults is not possible due to a small number of independent
residual signals with respect to the possible faults, allows for a statistical isolation of the faults
determining which fault more likely happened.
λ(t) = λ
so that:
R(t) = e−λt
which implies that failure free operating time τ is exponentially distributed.
88 Reliability of complex diagnosis systems
• a diagnosis subsystem Dk , affected by (f1Dk , f2Dk ), able to detect (and not necessarily to isolate)
fjSi ,
3.3. A Framework for Reliability of Diagnosis Systems 89
fjS1 fjSn
fjSi
...
? ? ?
- S ¾ Si
6
?
(f1Dr , f2Dr ) yk
Dk -
(f1D1 , f2D1 )
? ... ?
6 6
(y1 . . . yr ) f1Dk f2Dk
- D ¾
-
(a) (b)
Figure 3.1: (a) Structure of the complex diagnosis system. (b) Structure of the elementary cell
EC(Si , Dk , fjSi , (f1Dk , f2Dk )).
We stress the fact that in the definition of the elementary cell the subsystem Si is allowed to
be influenced just by a single failure mode even if, in the description given above, more faults
can affect a single subsystem. In our context this means that if Si is affected by ni different
faults, it generates ni different elementary cells. Similarly the diagnosis subsystem Dj in a ele-
mentary cell is required to detect the presence of a single failure mode fjSi . Hence, in case Dj
is sensitive to kj > 1 different failures, it generates kj different elementary cells.
It is easy to realize that, according to the definition of elementary cells given above, the num-
ber of elementary cells which are generated from the complex diagnosis system is equal to the
number of ones in the residual matrix. In the following we shall denote the elementary cell as
EC(Si , Dk , fjSi , (f1Dk , f2Dk )).
3.3.2 Step 2. Computation of the Reliability Function for the Elementary Cells (def-
inition of the required function)
The next step is the computation of the reliability function (see [9]) for each single elementary
cell which, in this second step, is still assumed isolated from the context. To to this we need
to define what is the required function of the item (i.e. the required function of the elementary
cell). The functioning of an elementary cell is the following (see figure 3.2): if no fault fjSi on
the component is happened, then the cell is in a nominal healthy state (H in figure 3.2) which
for assumption is safe and reliable (S-R). If the fault fjSi occur, then the cell transit in a faulty
state (F in figure 3.2), which is assumed to be unsafe (NS). At this point if the detection is
missed (f2Dk ) the cell remains in the unsafe state, while if the detection is performed (¬f2Dk ),
an opportune reconfiguration action is taken based on the detected fault and the system goes
in a reconfigured state (R in figure 3.2) which is assumed to be safe, but not reliable (S-NR). It
is now obvious that if a false alarm occur (f2Dk ) the cell transit from the state H to the state R,
90 Reliability of complex diagnosis systems
which means that the cell remains safe but becomes not reliable. The meaning of the states in
terms of safety and reliability is figured in table 3.1. With this in mind, the function required
¬f1Dk , ¬f2Dk , f2Dk , ¬fjSi
f2Dk
fjSi
H F
f1Dk
¬f2Dk
Figure 3.2: Functional model of the elementary cell EC(Si , Dk , fjSi , (f1Dk , f2Dk )).
Table 3.1: Meaning of the states in figure 3.2 in terms of safety and reliability.
to our diagnostic cell will be to remain safe (we will refer to the reliability function of this
required function as the “safe state reliability” RS (t)) or remain reliable (we will refer to the
reliability function of this required function as the “reliable state reliability” RR (t)). Obviously
the condition of remaining reliable is more restrictive than the condition of remaining safe,
since in the first one we do not allow false alarms. It is important to stress that in the sequel the
hypothesis of statistically independent failure modes will be adopted.
In this sense, the two reliability functions of the elementary cell EC(Si , Dk , fjSi , (f1Dk , f2Dk )) will
be in principle affected by the level of occurrence of the exogenous fault fjSi and of the faults
(f1Dk , f2Dk ) (namely by the capability of the local diagnoser Dk of detecting the occurred fault
fjSi and of not generating false alarms). In order to describe this we introduce the following
definition which attempts to qualify, from a statistical point of the view, the faults affecting the
elementary cell.
Definition 3.2 (Statistical Description of the Faults). The faults occurrence for an elementary cell
respectively.
We now proceed in presenting in a more formal way what is the “safe state” and the “reliable
state” for a generic cell. This is highlighted in the next definition.
Definition 3.3 (Reliable State and Safe State for an Elementary Cell). The elementary cell
These two definitions mean the following: the system is safe if no fault happens, if a fault
occurs but a remedial action is performed or even if the diagnoser generates a false alarm (and
hence a useless reconfiguration is performed); in case a fault happen or in case a false alarm
occur the system becomes no reliable.
Now that the required function for an elementary cell has been defined we can present the
rule for the computation of the reliability function, the latter depending on the failure rates
of the faults acting on the cell (see [9]). We will assume that failure rates for failure modes
are constant. This means that we can work in calendar time, without the need of maintaining
information about the age of each system element rather than the system age.
Proposition 3.1 (Computation of the Reliability Functions of the Elementary Cell)1 Under the
assumption of constant failure rates, the safe state reliability function of the elementary cell
is computed as: ³ ´
S (t) = RSi (t) + 1 − RSi (t) · RDk (t) =
Rikj j j 2
Si
µ
Si
¶
Dk (3.1)
= e−λj t + 1 − e−λj t · e−λ2 t
where RjSi (t) is the reliability function of subsystem Si with respect to the fault fjSi , R1Dk (t) and R2Dk (t)
are reliability functions of diagnosis subsystem Dk with respect to false alarm and missed diagnosis
respectively.
The reliable state reliability function of the elementary cell
is computed as
R (t) = RSi (t) · RDk (t) =
Rikj j 1
Si Dk
(3.2)
= e−λj t
· e−λ1 t
The computation of the reliability function highlighted in the previous definition, does not
take into account the effect of the faults propagation between different elementary cells, in
other words the reliability functions are computed as if the elementary cell was isolated. The
propagation phenomenon clearly affects the reliability of the overall complex system and it is
considered in the next step.
1
The same results given in Proposition 3.1 can be obtained enriching the automaton pictured in fig. 3.2 with
probability of occurrence of events, i.e. with failure rates presented in Definition 3.2. The stochastic automaton
obtained is a Markov Chain. Applying standard analysis tools for Markov chains (see e.g. [26]) it is possible to find
again equations 3.1 and 3.2.
92 Reliability of complex diagnosis systems
EC(Si , Dk , fjSi , (f1Dk , f2Dk )). This is specialized in the following definition.
then the safe state reliability function of the latter can be computed as
³ ´
S S
S
Rq,`,m (t) = δ S · Rmq (t) + 1 − δ S · Rmq (t) · δ2D · R2D` (t) (3.3)
and
S
R
Rq,`,m (t) = δ S · Rmq (t) · δ2D · R1D` (t) (3.4)
where
S
S (t) in case the fault f Si propagates to the fault f q and δ S = 1 otherwise;
a) δ S = Rikj j m
Now it is possible to derive two statistical residual matrix which both have as many rows
as the number of possible faults arising in the complex system and as many columns as the
number of residuals generated by the local diagnosis units. In the first one, called reliable state
statistical residual matrix (MR ), the element in the i-th row and j-th column are real numbers
representing the probability that if the j-th residual signal arises then the i-th fault has occurred.
In other words the element in the i-th row and j-th column represents the reliability of the j-th
diagnostic test with respect to both missed diagnosis and false alarms regarding the i-th fault.
Concerning the second matrix, called the safe state statistical residual matrix (MS ), the element
3.3. A Framework for Reliability of Diagnosis Systems 93
in the i-th row and j-th column are real numbers representing the probability that if the i-th
fault has occurred then the j-th residual signal arises. In other words the element in the i-th
row and j-th column represents the reliability of the j-th diagnostic test with respect to missed
diagnosis.
These matrices seem to be the natural generalization of the classical deterministic residual
matrix (in which the elements are 0 or 1 depending on the fact the a residual signal is affected
or not by a fault) in order to put into the description also the statistical modeling of components
and diagnostic algorithms. From such a description it is clear that such matrixes can be simply
obtained from the residual matrix just substituting to “ones” in the (i, j) cell the propagated
safe state reliability of the cell involving the i-th fault detected by j-th residual for MR and the
propagated reliable state reliability of the cell involving the i-th fault detected by j-th residual
for MS .
It is interesting to note that, starting from the information contained in the statistical resid-
ual matrix, it is possible to introduce the concept of statistical isolation of two (or more) faults.
As a matter of fact even in case two (or more) faults are indistinguishable by processing the
residual signals, the comparison of the reliability indices associated to the indistinguishable
faults makes possible to conclude which faults more likely happened.
and the reliable state reliability index RFRi of the diagnostic algorithm with respect to the fault
Fi is defined as
n
Y
RFRi = MijR .
j=1
The two statistical residual matrixes can be further elaborated in order to obtain a measure
of reliability for the overall complex system for design specification. In fact the two matrices
can be used to fill the so-called Hazard Matrix which is a matrix whose rows collect the classes
of fault effects (each class collecting effects which have the same severity) and whose columns
report the rate of occurrence of that event. A “cross” in the i-th row and j-th column element
means that the probability of occurrence of an event of the i-th class of severity is that specified
by the j-th column. Such a matrix is then suited to impose a specification involving reliability
as it amounts to identify forbidden and allowed regions (see figure 3.3) Consider the case in
which the rows of the hazard matrix are collected in three classes: safe and reliable class (S-R),
94 Reliability of complex diagnosis systems
safe and not reliable class (S-NR), and not safe class (NS-NR). It is easy to see that if a fault
happen its effect is in the not safe class, hence its hazard is high, but if it is detected and a
remedial action is taken, then the effect of the fault is moved into the safe but not reliable class.
On the other hand if a false alarm happens (hence a remedial action is taken even if no fault is
happen), also its effect can be classified in the safe but not reliable class.
With all this considerations in mind it is easy to think to a fault as two effects: the effect of
the reconfigured fault (which is in the safe but not reliable class) and the effect of the not recon-
figured fault (which is classified as unsafe). In this sense, considering a fault characterized by a
safe state reliability function RiS and a reliable state reliability function RiR , then the probability
of the not reconfigured fault effect is
(N S)
= k 1 − RiS
¡ ¢
Oi
(S−N R)
= k 1 − RiR ,
¡ ¢
Oi
where k is a positive number representing a scaling factor. We can use these information to im-
prove our complex diagnostic system, in fact the reliability indexes can be augmented changing
the physical component (augmenting the reliability of the component Rfj ) or making the diag-
noser more reliable. The first option is more expensive, but let us to move the two crosses in the
hazard matrix reducing the occurrence rate. The second option is of course less expensive, but
we need to trade off between false alarms and missed diagnosis: making the diagnoser more
sensitive means to augment its reliability with respect to missed diagnosis and hence moves
the cross in the unsafe class towards lower occurrence rate. As drawback, it decreases the reli-
ability of the diagnoser with respect to false alarms and hence it can move the cross in the safe
but not reliable class towards higher occurrence rates.
3.4. The common rail benchmark 95
4
6
8 7
9
1
2
High Pressure
Low Pressure
following fault scenario. The low pressure pump (LPP) is an electro-mechanical component
and can be subject to a functional failure fLPP ; the temperature sensor (TS), the water in diesel
sensor (H2 OS) and the high pressure sensor (HPS) are subject to functional failures denoted
with fTS , fH2 OS and fHPS respectively. The electrical heater (HE) can be affected by a failure fHE ,
while the shut off valve (SOV) can fail due to the functional failure fSOV . The high pressure
pump (HPP) is a critical component: its mechanical nature makes this pump easily affected by
functional failures denoted by fHPP . The pressure control (PR) is actuated via an electrome-
chanical actuator which can be affected by a failure fPR . The rail (R) can have leakage problems
denoted by fR . Finally the four injectors (EI) are subject to functional failures fiEI , i = 1, 2, 3, 4.
The following hypothesis are made in order to simplify the analysis problem:
H2 the central control unit (CCU) is considered a safe component (no failures are possible);
H3 the common rail system is considered isolated, i.e. external components are not affected
by failures.
Fault detection phase is performed via 6 electrical/analytical tests that generate 9 residual
signals by which is possible to detect and isolate all the faults.
1. internal losses test: it consists of testing the closure capability of the four injectors with
respect to the combustion cylinder during the cut-off phase (when the accelerator is re-
leased). This means to test the torque generated by the engine when the injectors are
closed (which should be zero). Using the information about engine rotation it is also pos-
sible to detect which injector does not work properly. For this reason this test generates 4
residual signals (ri , i = 1, 2, 3, 4) each of them sensible to a fault on a specific injector;
2. external losses test: aim of this test is to identify fuel losses checking the pressure inside a
close volume when the injectors are closed. This test generates a residual signal r5 which
is sensible to a failure in one of the injectors, a leakage in the rail, a failure on the shut-off
valve and a failure on the pressure regulator;
3. high pressure sensor test: the control unit gives to the pressure regulator specific trajectories
to track and tests the information from the high pressure sensor. In this way a residual
signal r6 can be generated to detect failures on the high pressure sensor and on the pres-
sure regulator;
4. temperature sensor test: similarly to the case of the high pressure sensor test, the control
unit gives to the heater specific temperature trajectories and tests the information form
the temperature sensor. The residual signal r7 generated from this test is able to detect
failures on the temperature sensor and on the heater regulator;
5. shut-off valve test: a command issued by the control unit to the low pressure pump gen-
erates a specific mechanical response of the shut-off valve and hence a specific response
on the high pressure circuit which can be monitored using the high pressure sensor. This
test generates a residual r8 which is sensible to failures on the low pressure pump, on the
shut-off valve and on the high pressure sensor.
6. electrical test on sensors: an electrical test on the output of the sensors generates 3 residual
signals (r9 , r10 and r11 ) which are used to detect a possible failures on these components.
3.4. The common rail benchmark 97
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 1 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 1 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 1 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 1 0
r5 0 0 0 0 0 1 0 1 1 1 1 1 1
r6 0 0 0 0 0 0 1 1 0 0 0 0 0
r7 0 1 0 0 1 0 0 0 0 0 0 0 0
r8 1 0 0 1 0 1 0 0 0 0 0 0 0
r9 0 0 1 0 0 0 0 0 0 0 0 0 0
r10 0 1 0 0 0 0 0 0 0 0 0 0 0
r11 1 0 0 0 0 0 0 0 0 0 0 0 0
The residual matrix obtained is reported in tab. 3.2. It is immediate to see that both detection
and isolation of all the considered failures is possible.
The residual matrix of the system is composed by 11 rows and 13 columns with 21 elements
set to one. As previously explained, this fact implies that in reliability analysis we will con-
sider 21 elementary cells, each of them characterized with intrinsic reliability functions com-
puted starting from the occurrence rates of faults, false alarm and missed diagnosis according
to equations (3.1) and (3.2). Following an FPA like procedure we studied the fault propaga-
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
0.999 0.999 0.994 0.998 0.999 0.998 0.989 0.999 0.989 0.989 0.989 0.989 0.977
Table 3.3: Reliability data for components of the common rail system (DATA SET 1).
r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
False alarms 0.994 0.994 0.995 0.995 0.994 0.992 0.991 0.994 0.993 0.993 0.994
Missed diagnosis 0.994 0.994 0.995 0.995 0.994 0.992 0.993 0.994 0.993 0.993 0.994
Table 3.4: Reliability data for diagnostic tests (DATA SET 1).
tions between the elementary cells. For the sake of clarity just the most severe propagations are
considered in the analysis, i.e.:
• a fault on the temperature sensor propagates to a fault on the thermic heater;
• a fault on the low pressure pump propagates to a mechanical fault on the valve (stuck
closed) and to a loss of diagnosis failure of the test based on this valve;
• a fault on the water in diesel sensor propagate to a mechanical failure on the high pressure
pump, to a failure on the injectors due to a low lubrication and to false alarms of the high
pressure sensor test;
• a fault on the pressure controller may propagate to failures on the rail, on the injectors due
to the excessive pressure and on false alarms on the residuals generated via the internal
losses test.
The reliability of physical components with respect to the failures above described are re-
ported in table 3.3, while reliability of diagnostic tests with respect to missed diagnosis and
98 Reliability of complex diagnosis systems
false alarms are shown in table 3.72 . Using this information, we are able to compute both the
safe state statistic residual matrix in tab. 3.5, and the reliable state statistic residual matrix in
tab. 3.6, just substituting to each element set to 1 in the residual matrix the propagated safe
state and reliable state reliability of the corresponding elementary cell computed using equa-
tions (3.3) and (3.4).
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.9993 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.9993 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.9994 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.99994 0
r5 0 0 0 0 0 0.9999 0 0.9999 0.9993 0.9993 0.9993 0.9993 0.9986
r6 0 0 0 0 0 0 0.9991 0.9999 0 0 0 0 0
r7 0 0.9999 0 0 0.9999 0 0 0 0 0 0 0 0
r8 0.9999 0 0 0.9999 0 0.9999 0 0 0 0 0 0 0
r9 0 0 0.9996 0 0 0 0 0 0 0 0 0 0
r10 0 0.9999 0 0 0 0 0 0 0 0 0 0 0
r11 0.9999 0 0 0 0 0 0 0 0 0 0 0 0
Table 3.5: Statistical safe residual matrix for the common rail system. (DATA SET 1)
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.983 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.983 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.984 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.984 0
r5 0 0 0 0 0 0.992 0 0.993 0.983 0.983 0.983 0.983 0.971
r6 0 0 0 0 0 0 0.981 0.991 0 0 0 0 0
r7 0 0.99 0 0 0.99 0 0 0 0 0 0 0 0
r8 0.993 0 0 0.992 0 0.992 0 0 0 0 0 0 0
r9 0 0 0.987 0 0 0 0 0 0 0 0 0 0
r10 0 0.992 0 0 0 0 0 0 0 0 0 0 0
r11 0.993 0 0 0 0 0 0 0 0 0 0 0 0
Table 3.6: Statistical reliable residual matrix for the common rail system. (DATA SET 1)
From these matrices it is possible to perform the hazard analysis. The result obtained is that
in the mean life of the systems, the probability of being in an unsafe state is equal to 1.38 · 10−4 ,
which means that the probability of not detecting a fault is around 100 ppm. Moreover the
probability of being in an unreliable state is 3.36 · 10−2 , which means that the probability of
performing a false alarm or a bad diagnosis is in the order of magnitude of 10000 ppm.
Suppose now that there exists a gap between specifications in terms of hazard and results
obtained. This suggests that it is necessary to use higher quality components and/or more
reliable diagnosers. A first remark is that the water in diesel sensor is a crucial component, since
all the analytical redundancies are based on this sensor reading, an higher reliability is required
for this component. Another crucial components is the high pressure pump. This pump is a
mechanical system moved directly by the engine, which raises the fuel pressure thanks to three
pistons. The need for a more reliable pump suggested to chose an electrical components instead
of a mechanical one. The new reliability of physical components with respect to the failures are
reported in table 3.7, while reliability of diagnostic tests are not changed from values in table
3.4. Using this new set of physical components, we obtain a new safe state statistic residual
matrix shown in tab. 3.8, and a new reliable state statistic residual matrix shown in tab. 3.9.
The new hazard analysis brought the following new results. In the mean life of the system,
the probability of being in an unsafe state is equal to 1.38 · 10−5 , which means that the prob-
ability of not detecting a fault is around 10 ppm. On the contrary the probability of being in
2
Data regarding reliability or failure rate over the mean time life of physical components are reported by com-
ponents producers, while data regarding the reliability over the mean time life of diagnostic tests concerning false
alarms and missed diagnosis can be obtained via simulation methods (e.g. throughout Montecarlo method) or via
experimental tests on circulating vehicles.
3.5. Conclusions 99
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
0.999 0.999 0.999 0.998 0.999 0.998 0.999 0.999 0.989 0.989 0.989 0.989 0.977
Table 3.7: Reliability data for components of the common rail system (DATA SET 2).
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.9999 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.9999 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.9999 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.9999 0
r5 0 0 0 0 0 0.9999 0 0.9999 0.9999 0.9999 0.9999 0.9999 0.9998
r6 0 0 0 0 0 0 0.9999 0.9999 0 0 0 0 0
r7 0 0.9999 0 0 0.9999 0 0 0 0 0 0 0 0
r8 0.9999 0 0 0.9999 0 0.9999 0 0 0 0 0 0 0
r9 0 0 0.9999 0 0 0 0 0 0 0 0 0 0
r10 0 0.9999 0 0 0 0 0 0 0 0 0 0 0
r11 0.9999 0 0 0 0 0 0 0 0 0 0 0 0
Table 3.8: Statistical safe residual matrix for the common rail system. (DATA SET 2)
an unreliable state is 1.8 · 10−2 , which means that the probability of performing a false alarm
or a bad diagnosis is in the order of magnitude of 10000 ppm. It is important to note that the
probability of false alarms does not depend on the reliability of physical components; hence,
enhancing this last, we have decreased the probability of being in an unsafe state with a factor
of 10, while we did not obtain any significative improvement in the probability of being in an
unreliable state. For this reason, as third analysis, we decided to increase also the reliability of
diagnosers. The third analysis has been performed considering more reliable diagnostic tests
in terms of missed diagnosis and false alarms as shown in tab. 3.10. The statistical residual
matrices obtained by this third analysis are shown in table 3.11 and 3.12.
With the new set of diagnosers we obtained a probability of being in an unsafe state equal
to 5 · 10−6 , which means that the probability of not detecting a fault is around 1 ppm. On
the contrary the probability of being in an unreliable state is 6 · 10−3 , which means that the
probability of performing a false alarm or a bad diagnosis is in the order of magnitude of 600
ppm. Enhancing the reliability of diagnosers we obtained a decreasing of both the probability
of being in an unsafe state and in an unreliable with a factor of 10.
3.5 Conclusions
The main contribution of this chapter is the introduction of a procedure to evaluate reliability
of a complex diagnosis systems. By a four steps procedure it has been shown how to compute
a reliability function associated to the capability of the whole system to remain in a safe state
or in a reliable state with respect to each faults starting from a statistical description of the
exogenous faults and from the reliability of diagnosis subsystem devoted to faults detection.
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.992 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.992 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.993 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.993 0
r5 0 0 0 0 0 0.992 0 0.993 0.988 0.988 0.988 0.983 0.992
r6 0 0 0 0 0 0 0.999 0.991 0 0 0 0 0
r7 0 0.99 0 0 0.99 0 0 0 0 0 0 0 0
r8 0.993 0 0 0.992 0 0.993 0 0 0 0 0 0 0
r9 0 0 0.991 0 0 0 0 0 0 0 0 0 0
r10 0 0.992 0 0 0 0 0 0 0 0 0 0 0
r11 0.993 0 0 0 0 0 0 0 0 0 0 0 0
Table 3.9: Statistical reliable residual matrix for the common rail system. (DATA SET 2)
100 Reliability of complex diagnosis systems
r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
False alarms 0.999 0.999 0.999 0.999 0.999 0.998 0.996 0.998 0.996 0.996 0.996
Missed diagnosis 0.999 0.999 0.999 0.999 0.999 0.998 0.998 0.999 0.996 0.996 0.996
Table 3.10: Reliability data for diagnostic tests (DATA SET 3).
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.99999 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.99999 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.99999 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.99999 0
r5 0 0 0 0 0 0.99999 0 0.99999 0.99999 0.99999 0.99999 0.99999 0.99998
r6 0 0 0 0 0 0 0.99999 0.99999 0 0 0 0 0
r7 0 0.99999 0 0 0.99999 0 0 0 0 0 0 0 0
r8 0.99999 0 0 0.99999 0 0.99999 0 0 0 0 0 0 0
r9 0 0 0.999999 0 0 0 0 0 0 0 0 0 0
r10 0 0.99999 0 0 0 0 0 0 0 0 0 0 0
r11 0.99999 0 0 0 0 0 0 0 0 0 0 0 0
Table 3.11: Statistical safe residual matrix for the common rail system. (DATA SET 3)
1 2 3 4
fLPP fTS fH2 OS fHPS fHE fSOV fHPP fPR fEI fEI fEI fEI fR
r1 0 0 0 0 0 0 0 0 0.998 0 0 0 0
r2 0 0 0 0 0 0 0 0 0 0.999 0 0 0
r3 0 0 0 0 0 0 0 0 0 0 0.999 0 0
r4 0 0 0 0 0 0 0 0 0 0 0 0.998 0
r5 0 0 0 0 0 0.999 0 0.999 0.997 0.998 0.998 0.983 0.999
r6 0 0 0 0 0 0 0.999 0.999 0 0 0 0 0
r7 0 0.997 0 0 0.997 0 0 0 0 0 0 0 0
r8 0.998 0 0 0.997 0 0.997 0 0 0 0 0 0 0
r9 0 0 0.995 0 0 0 0 0 0 0 0 0 0
r10 0 0.997 0 0 0 0 0 0 0 0 0 0 0
r11 0.995 0 0 0 0 0 0 0 0 0 0 0 0
Table 3.12: Statistical reliable residual matrix for the common rail system. (DATA SET 3)
An example taken from the automotive field has been used to illustrate the effectiveness of the
procedure as an analysis tool but also as a design tool to improve performances of the system
in terms of reliability.
Chapter 4
A discrete event approach to system
monitoring
4.1 Introduction
This chapter addresses the problem of Fault Detection and Isolation (FDI) in the framework of
discrete event dynamic systems (DES). The main objective in FDI is to develop methodologies
for identifying and exactly characterizing possible incipient faults arising in the operation of a
dynamic system. This research problem has received considerable attention in the last several
years due among other factors to the increasing requirements on safety imposed on today’s
complex technological systems. In particular, many methodologies have been developed for
faults that are naturally modeled, and thus diagnosed, using a “higher-level” discrete-event
101
102 A discrete event approach to system monitoring
model of the system under consideration; see, e.g., [91, 92, 47, 14, 87, 97, 75, 55, 52, 7, 41, 31, 62]
for a representative sample of this work including references to successful industrial appli-
cations. In this work, we adopt the so-called “Diagnoser Approach” to fault diagnosis of DES
introduced in [91, 92] and surveyed in [56]. In this approach, the DES model includes “normal”
as well as “failed” behavior for a given set of faults modeled as unobservable events (i.e., not
directly measured by the system sensors). Diagnosis is the process of detecting on-line the oc-
currence of these faults using model-based inferencing driven by the observed event sequence.
This is achieved by the use of a special type of automaton, called the diagnoser, which is built
from the system model. In addition, the diagnoser can also be used to analyze (off-line) the
diagnosability properties of the system according to the formal definition introduced in [91]
that will be recalled later in this chapter. The problem of taking diagnosability into account in
system design and using control actions to alter the diagnosability properties of a given system
is considered in [90].
In this work, we are concerned with the diagnosis of faults that may lead to violations of
critical safety requirements if they are not detected and identified in a timely manner. To this
end, the new property of safe diagnosability is introduced and studied. Assume that a given sys-
tem is diagnosable (according to [91]). In safe diagnosability it is required in addition that fault
detection occur prior to the execution of a given set of forbidden strings in the failed mode of
operation of the system. For instance, this constraint could be required to prevent local faults
from developing into failures that could cause safety hazards. We view safe diagnosability as
a first necessary step in order to achieve fault tolerant supervision of DES. If the system is safe
diagnosable, reconfiguration actions could be forced upon the detection of faults prior to the
execution of unsafe behavior, thus achieving the objective of fault tolerant supervision. Our
main contributions are: (i) formal definition of the notion of safe diagnosability; (ii) deriva-
tion of implementable necessary and sufficient conditions to test for safe diagnosability; (iii)
formulation and solution of the problem of active safe diagnosis, where the requirement of safe
diagnosability is explicitly taken into account in controller design; (iv) discussion of the exten-
sion of safe diagnosability to timed models of DES.
A preliminary and partial version of these results are contained in [76, 77, 78].
G = (X, Σ, δ, x0 ) (4.1)
where X is the state space, Σ is the set of the events, δ is the partial transition function and x0
is the initial state of the system. The behavior of the system is described by the prefix-closed
language L = L(G) generated by G defined as
L(G) is a subset of Σ? where Σ? denotes the Kleene closure of the set Σ, i.e., the set of all finite
strings of elements of Σ.
The event set Σ is partitioned as:
Σ = Σo ∪ Σuo
where Σo represents the set of observable events (their occurrence can be observed) and Σuo
represents the set of unobservable events. Moreover, some of the events are controllable (it is
4.2. Preliminary notions 103
possible to prevent their occurrence) while the rest are uncontrollable. Thus the event set can
also be partitioned as:
Σ = Σc ∪ Σuc .
We associate with Σo the (natural) projection P , P : Σ → Σo , defined in the usual manner. We
refer the reader to table 4.1 and to Appendix A, for any notation used in the sequel but not
defined in this section.
Let Σf ⊆ Σ denote the set of failure events which can occur in the system. We assume
without loss of generality that:
Σf ⊆ Σuo
since an observable failure can be trivially diagnosed and that
Σf ⊆ Σuc
where Xo , Σo and x0 are defined as previously and the transition relation of G0 is given by
δG0 ⊆ (X0 × Σ × X0 )
Nomenclature Meaning
² the empty trace
s̄ the prefix-closure of trace s ∈ Σ?
ksk length of trace s
L/s the post language of L after s, i.e., L/s = {t ∈ Σ? s.t. st ∈ L}
P : Σ? → Σ?o projection
P , i.e.,
P (²) = ²
P (σ) = σ if σ ∈ Σo
P (σ) = ² if σ ∈ Σuo
P (sσ) = P (s)P (σ) s ∈ Σ? , σ ∈ Σ .
PL−1 : Σ?o → Σ? inverse projection, i.e., PL−1 (y) = {s ∈ L s.t. P (s) = y}
sf final event of trace s
Ψ(Σf i ) set of all traces that end in a failure event belonging to the
class Σf i , i.e.,
Ψ(Σf i ) = {sσf ∈ s.t. σf ∈ Σf i }
σ ∈ s, σ ∈ Σ and s ∈ Σ? σ is an event in the trace s
Σf i ∈ s σf ∈ s for some σf ∈ Σf i
G = (kni=1 Gnom
i ) .
Automaton G models the uncontrolled behavior of the DES; this behavior is not satisfactory and
must be modified by control; this means (see [26]) “restricting the behavior of G to a subset of its
generated language L(G)". For this reason we design a supervisor S nom such that defining
L(Gn+f
i ) ⊃ L(Gnom
i )
Σn+f
i = Σnom
i ∪ Σf i where Σf i ⊂ Σn+f n+f
i,uo ∩ Σi,uc .
This means that the real controlled behavior of the system is described by
³° ´
Gn+f = °ni=1 Gn+f k S nom .
°
i
4.3. A motivating example 105
By simple observation of Gn+f , we can distinguish between a nominal part and a faulty part.
Of course in the nominal part there are no undesired actions, because we prevented it with the
control, but this undesired sequences of actions can arise in the faulty part due to a nominal
supervision of a faulty behavior. Hence to achieve fault tolerance we do not need just to be
able to detect the occurred faults, but also to prevent undesired sequences of actions. For this
reason in the next section we introduce a new property of discrete event systems that assures
the possibility of detecting the occurred fault before an illegal string is performed: the notion
of safe diagnosability.
Example 4.1 Consider the simple system sketched in Fig. 4.1 consisting of a tank, a pump and a valve.
Let us assume that the only failure mode for the valve is stuck closed. The tank is equipped with a level
sensor, while the pipe can be equipped by either a flow sensor or a pressure sensor. The flow sensor reads
two possible values: F meaning that there is a flow in the pipe and N F meaning that there is no flow in
the pipe. Similarly, the pressure sensor can read two possible values: P meaning that there is pressure
in the pipe and N P meaning that there is no pressure in the pipe.
Suppose that we control the system to achieve the following behavior:
• if the level sensor says that the tank is full (event Level) the controller must respond by opening
the valve and starting the pump;
• if the level sensor says that the tank is not full (event N ot Level) the controller must respond by
stopping the pump and closing the valve.
Flow sensor
Tank
Level
sensor
Pressure
sensor Valve Pump
Controller
PON POFF
VO F - NP NF - NP
VC NF - P NF - P
Let us assume that the situation “valve closed and pump on" is risky and should be avoided. The FSM
modelling the controlled behavior of the system, denoted by Gn+f , is shown in Fig. 4.3; Gn+f is obtained
performing the parallel composition of the model in Fig. 4.2, and then incorporating the sensor map
(Table 4.2) according to procedure in [92]. It is simple to identify the part of the automaton modelling
106 A discrete event approach to system monitoring
open , close
SC
F
close open stop start start
open
VC VO POFF PON
close stop
(Pump)
(Valve)
No Level
Level
(Controller)
the nominal behavior and the part that models the behavior of the supervised system after the occurrence
of the fault; in this part there is a legal state (state 12) in where the dangerous situation “valve closed
and pump on", occurs. This means that due to the possible failure of the valve, the system can enter this
risky state. This situation should be avoided by detecting the failure before the system goes into state 12.
The liveness assumption (A1) is made for the sake of simplicity while assumption (A2) ensures
that observations occur with some regularity; in other words we require that the system does
not generate arbitrarily long sequences of unobservable events.
A language L is diagnosable if it is possible to detect with a finite delay occurrences of
failures of any type using the record of observed events. Formally,
Definition 4.1 (Diagnosability [91]) A prefix-closed language K satisfying hypothesis (A1) and (A2)
is said to be diagnosable with respect to the projection P and with respect to the partition Πf if the
following holds:
(∀i ∈ Πf ) (∃ni ∈ N) (∀s ∈ Ψ(Σf i )) (∀t ∈ L/s) (ktk ≥ ni ⇒ D) (4.8)
where the diagnosability condition D is:
ω ∈ PL−1 [P (st)] ⇒ Σf i ∈ ω . (4.9)
4.4. The notions of diagnosability and safe diagnosability 107
SC VC VO
C1 POFF
17 1
f <Level,NF,P>
<Level,NF,P>
C2 10 f 2 POFF
<open,NF,NP>
<open,NF,P>
C3 11 3 POFF
<start,NF,P> <start,F,NP>
4 PON
C4 12 <Level,NF,P>
<No Level,NF,P>
C6 14 6 POFF
<close,NF,P>
<close,NF,P>
f
C7 15 7 POFF
f
<stop,NF,P>
C5 16 8
f POFF
<close,NF,P>
<stop,NF,P>
C6 9 POFF
The above definition means the following. Let s be any trace generated by the system
that ends in a failure event from the set Σf i and let t be any sufficiently long continuation of
s. Condition D requires that every trace belonging to the language that produces the same
record of observable events as the trace st should contain in it a failure event from the set
Σf i . Or, in other words, diagnosability requires that every failure event leads to observations
distinct enough to enable unique identification of the failure type with a finite delay. For more
information the reader is referred to [91].
Remark 4.1 Given a non live language L it is possible to extend it to a new language Llive by adding a
new event “Stop” to Σ, where Stop ∈ Σo ∩ Σuc and by defining
[©
Llive = L sStopi s.t. L/s = ∅ .
ª
(4.10)
i≥0
It is obvious that Llive is live and that L is diagnosable iff Llive is diagnosable. In other words, checking
for the diagnosability of L is equivalent to checking for the diagnosability of Llive . See [90] for further
details.
Consider now the case where we want to avoid that after a failure of type Fi (i = 1 . . . m)
occurs, the system executes a forbidden string from a given finite set Φi , where Φi ⊆ Σ? . For
example, this could be required to prevent local faults from developing into failures that can
cause safety hazards. This condition is strictly linked to the definition of diagnosability, because
if an illegal string in the set Φi is possible in the language of our system, then it is necessary
to detect the failure before an illegal string is executed. The elements of the set Φi capture
sequences of events that become illegal after the occurrence of a fault of type Fi . This situation
can be formalized by defining the “illegal language” Kfi (i = 1 . . . m) as follows:
In other words, Kfi contains all the possible continuations after a fault of type Fi which have a
forbidden string from Φi as substring.
Definition 4.2 (Safe Diagnosability) A prefix-closed language L satisfying hypothesis (A1) and
(A2) is said to be safe diagnosable with respect to the projection P , the partition Πf and the forbid-
den languages Kfi (i = 1 . . . m) if the following conditions hold:
ω ∈ PL−1 [P (st)] ⇒ Σf i ∈ ω .
(∀i ∈ Πf )(∀t ∈ L/s) such that ktk = ni , let tc , ktc k = ntc , be the shortest prefix of t such that D
holds, then
tc ∩ Kfi = ∅ (4.12)
4.5. Necessary and sufficient conditions 109
Roughly speaking, this definition says that a language is safe diagnosable if it is diagnosable
and after a failure, the shortest continuation that assures the detection does not contain any
illegal string. A graphical interpretation of this property is sketched in Fig. 4.4. In that figure, a
non safe diagnosable language L is represented: the string t0c is the shortest continuation after
string s0 fi for which condition D holds and t0c contains an element of the forbidden language
Kfi .
Remark 4.2 We note that we have chosen to approach the problem of safe diagnosability in the language-
theoretic framework of [90]-[91] for fault diagnosis rather than the temporal logic framework of [52] in
order to better capture the safety requirement that has to be enforced after a fault has been diagnosed with
certainty. Thus, as will become apparent in the following sections, we are able to employ the "diagnoser
automata" of the Diagnoser Approach of [91] to verify safe diagnosability (cf. Section 4.5) and build safe
diagnosable systems (cf. Section 4.6).
L
D holds Kfi
t0c
fi t00c
s0 fi
s00
Figure 4.4: A non safe diagnosable language L: the string s00 fi t00c satisfy the conditions for safe
diagnosability, while the string s0 fi t0c does not.
∆f = {F1 , F2 , . . . , Fm }
∆ = {N } ∪ 2{∆f } . (4.13)
Qo = 2Xo ×∆ . (4.14)
110 A discrete event approach to system monitoring
Gd = (Qd , Σo , δd , q0 ) (4.15)
The state space Qd is the resulting subset of Qo composed of the states of the diagnoser that
are reachable from q0 under δd . Since the state space Qd of the diagnoser is a subset of Qo , a
state qd of Gd is in the form:
qd = {(x1 , `1 ), . . . , (xn , `n )}
where xi ∈ X0 and `i ∈ ∆.
We recall some definitions and a big result from [91].
∀(x, `) ∈ q , fi ∈ ` .
δd (ql , σl ) = ql+1 l = 1, . . . , n − 1
δd (qn , σn ) = q1
where σl ∈ Σo , l = 1, . . . , n.
(xkn , σn , xk+1
1 ) ∈ δG0 k = 1, . . . , m − 1
(xm 1
n , σn , x1 ) ∈ δG0
r ) ∈ δ 0 l = 1, . . . , n − 1 r = 1, . . . , m0
(ylr , σl , yl+1 G
Theorem 4.1 [91] A language L is diagnosable with respect to the projection P and the failure partition
Πf on Σf if and only if its diagnoser Gd has no Fi -indeterminate cycle for each failure type Fi .
Remark 4.3 Testing diagnosability using diagnosers can be done in polynomial time in the cardinality
of the state space of the diagnoser, since it involves detection of special kinds of cycles. The state space of
the diagnoser is however exponential in the state space of the system G in the worst case.
G = (X, Σ, δ, x0 )
generating L. For each set Φi (i = 1 . . . m) of forbidden strings, define, following the procedure
in [60], the recognizer of Φi as an automaton
where
X(Φi ) = {(s, π(s)) s.t. (s ∈ t̄) ∧ (t ∈ Φi ) } (4.19)
and where π : Σ? → {N B, Bi }, (i = 1 . . . m) is defined as
½
Bi if s ∈ Φi ;
π(s) = (4.20)
N B otherwise.
where t is the longest suffix of sσ belonging to X(Φi ) and π(·) is defined as in (4.20); moreover
(ε, N B) if σ = fi
ξi ((x0 , N B), σ) = (4.22)
(x0 , N B)
otherwise
Proof. By definition, if z ∈ Xm is a Bi -state, this means that z = [x, . . . , (si , π(si )), . . .] with
π(si ) = Bi and si ∈ Φi (see (4.20)). Due to the definition of Gm (4.23), z can be reached only
by strings which contain si as substring. Since si ∈ Φi , from definition (4.11), we get that
(∀u ∈ τ (z)) (u = vw)(v ∈ Ψ(ΣF i )(w ∈ Kfi ). /
Remark 4.4 In the worst case the state space of Gm is the Cartesian product of G, S1 , . . . , §m . In
practice it is expected that the illegal strings in the respective Φi sets will have short lengths and will
result in splitting of localized parts of the state space of G when performing the product in 4.23; see
Examples 2 and 3 at the end of this section.
Given a diagnosable language L and an automaton
G = (X, Σ, δ, x0 )
generating L, we call safe-diagnoser with respect to the forbidden sets Φi , the diagnoser
Gsd = (Qsd , Σo , δds , qos ) (4.24)
of automaton Gm defined in (4.23), with respect to the projection P and to the failure partition
Πf on Σf . We can now present our result for testing the property of safe diagnosability.
4.5. Necessary and sufficient conditions 113
G = (X, Σ, δ, x0 )
generating L. L is safe diagnosable with respect to the projection P , the failure partition Πf and the
forbidden languages Kfi if and only if in the safe diagnoser (4.24):
(TC1) There does not exist a state q ∈ Qsd that is Fi -uncertain with a component of the form (x, `) such
that fi ∈ ` and x ∈ Θi ;
(TC2) There does not exist a pair of states q, q 0 ∈ Qsd such that: (i) q is Fi -certain with a component of
the form (x, `) such that fi ∈ ` and x ∈ Θi ; (ii) q 0 is Fi -uncertain; and (iii) q = δds (q 0 , e) with
e ∈ Eo .
Proof.
Sufficiency:
(TC1) By contradiction, suppose that there exists q ∈ Qsd Fi -uncertain:
with x ∈ Θi . Let s ∈ Σ?o be such that q = δds (q0s , s); this means that
Due to the fact that Σf i ∈ u, we get that u = u1 u2 with u1 ∈ Ψ(Σf i ). Since the system is
diagnosable, there exists n ∈ N such that if t ∈ L/u and ktk = n then condition D holds.
Consider the string
t0 = u1 u2 t = u1 tc (tc = u2 t) .
Since
Θi 3 x = δm (xm0 , u1 u2 ) ,
from Proposition 4.1, it turns out that u2 ∈ Kfi or, in other words, since tc = u2 t, then tc ∩Kfi 6= ∅,
which violates the hypothesis that L is safe diagnosable.
∀(x, `) ∈ q, fi ∈ `
such that ∃(x, `0 ) ∈ q with x ∈ Θi . Moreover, suppose that there exists q 0 ∈ Qsd Fi -uncertain
such that q = δds (q 0 , e) with e ∈ Eo . Let s ∈ Σ?o be such that q 0 = δds (q0s , s); this means that
∃u, v ∈ PL−1 (s) s.t. (x1 = δm (xm0 , u))((Σf i ∈ u))(x2 = δm (xm0 , v))(Σf i ∈
/ v) .
From the fact that Σf i ∈ u we get that u = u1 u2 with u1 ∈ Ψ(Σf i ). Since the system is di-
agnosable there exists n ∈ N such that if t ∈ L/u and ktk = n then condition D holds; it is
straightforward to see that
t = PL−1 (e) = e .
Consider the string
t0 = u1 u2 e = u1 tc (tc = u2 e) .
Form the fact that
q = δds (q0s , se)
114 A discrete event approach to system monitoring
we get that
Θi 3 x = δm (xm0 , u1 u2 e) = δ(xm0 , u1 tc ) .
From Proposition 4.1, we conclude that tc ∈ Kfi , which violates the hypothesis that L is a safe
diagnosable language.
Necessity:
Consider a language L that is diagnosable with respect to the projection P and to the failure
partition Πf on Σf . Suppose that L is not safe diagnosable with respect to the set of forbidden
languages Kfi (i = 1 . . . m). This means that for i ∈ Πf , ∃s ∈ Ψ(Σf i ) and tc ∈ L/s such that tc is
the shortest continuation of s for which the diagnosability condition D holds and
tc ∩ Kfi = ∅ .
Consider the case where there exists a prefix t1 of tc such that t1 ∈ Kfi . From Proposition 4.1
x = δm (xm0 , t1 ) ∈ Θi .
Moreover, since for t1 D does not hold, we get from the definition of the diagnoser Gsd (see [91])
that
q = δds (q0s , P (st1 ))
is Fi -uncertain and (x, `) ∈ q with fi ∈ ` (TC1).
Example 4.2 Consider the language generated by the automaton G in Fig. 4.6. We assume that the
forbidden string after the failure f1 is Φ = {β} and therefore the illegal language is Kf = {αβ}. In this
example there is a state, namely state 6, that can be reached both by an illegal string s0 = (αβγ)n αf1 αβ
and a legal string s00 = (αβγ)n αβf1 τ .
The recognizer S for the set Φ is shown in Fig. 4.7, and the modified automaton Gm = G × S is
shown in Fig. 4.8(left). For the sake of simplicity, we have kept the same state names and indicated the
4.5. Necessary and sufficient conditions 115
Kfi UC
P (sfi t1 )
t2
t1 P (t2 )
s fi UC
C
UC
fi t1 t2 P (sfi t1 t2 )
s
UC C
σo
t1
s fi t2 UC
P (sfi t1 )
C P (t2 )
← tc →
D holds ⇒ C
Figure 4.5: Conditions (TC1) and (TC2) and their influence on the safe diagnoser. On the left
are shown, from the top to bottom, a string that violate (TC1), a string that violate (TC2) and a
string that does not violate (TC1) nor (TC2). On the right is shown what happens in these three
case in the safe diagnoser. The label UC stands for uncertain state, while the label C stands for
certain state.
additional information next to the state. The effect of the product between S and the original model G is
the splitting of state 6 into two states: a B-state and an N B-state.
The safe diagnoser Gsd is shown in Fig. 4.8(right) where for the sake of clarity, only the information
about B-states and N B−states has been reported and not the full name of the state. It is immediate by
seen that both conditions (TC1) and (TC2) are satisfied, and therefore the language generated by system
G is safe diagnosable with respect to the failure f1 and the forbidden string β.
Example 4.3 Consider again the simple system sketched in Fig. 4.1. The only failure mode considered
is that the valve is stuck closed. The tank is equipped with a level sensor, while the pipe can be equipped
by either a flow sensor or a pressure sensor. The flow sensor reads two possible values: F meaning that
there is a flow in the pipe and N F meaning that there is no flow in the pipe. Similarly, the pressure
sensor can read two possible values: P meaning that there is pressure in the pipe and N P meaning that
there is no pressure in the pipe. We control the system to achieve the following behavior:
• if the level sensor says that the tank is full (event Level) the controller must respond by opening
the valve and starting the pump;
• if the level sensor says that the tank is not full (event N ot Level) the controller must respond by
stopping the pump and closing the valve.
The readings of the flow and pressure sensors are linked with the states of the valve and the pump as
depicted in Table 4.2. As explained above the situation valve closed and pump on is risky and should
be avoided; namely, Φ = {start}. The FSM modelling the controlled behavior of the system is depicted
again in Fig. 4.9. There is a state in the admissible behavior (namely state 12) where the dangerous
situation “valve closed and pump on", occurs. This means that due to the possible failure of the valve,
116 A discrete event approach to system monitoring
α β
0 1 2
f1 f1
3 4
α τ
β
5 6
αβ γ τ αf1 γ τ α β γ τ f1
f1 β
x0 -NB ²-NB β-B
γ
x0 -NB
α β 2
0 1 x0 -NB 0N-NB
γ
x0 -NB
f1 f1 α
β
²-NB 3 4 ²-NB 1N 3F-NB 2N 4F-NB
α τ α τ
β
β-B
β ²-NB
6F-B 5F-NB 6F-NB
γ 6 5 6
²-NB
γ γ γ
Figure 4.8: The modified automaton Gm (left) and the safe diagnoser Gsd (right).
4.5. Necessary and sufficient conditions 117
the system can enter this risky state. This situation should be avoided by detecting the failure before the
system goes into state 12. Here, for sake of simplicity, all the analysis will be done on automaton G and
not on the modified automaton Gm since these two have the same structure and the B-state to care about
is state 12.
SC VC VO
C1 POFF
17 1
f <Level,NF,P>
<Level,NF,P>
C2 10 f 2 POFF
<open,NF,NP>
<open,NF,P>
C3 11 3 POFF
<start,NF,P> <start,F,NP>
4 PON
C4 12 <Level,NF,P>
<No Level,NF,P>
C6 14 6 POFF
<close,NF,P>
<close,NF,P>
f
C7 15 7 POFF
f
<stop,NF,P>
C5 16 8
f POFF
<close,NF,P>
<stop,NF,P>
C6 9 POFF
Let us assume that to perform the diagnosis only the flow sensor is used, i.e., disregard (or delete)
all pressure information (P,NP) in the event names in Fig. 4.9; then build the corresponding (safe)
diagnoser Gsd (flow) shown in Fig. 4.10. Upon inspection of Gsd (flow) we can see that diagnosability
holds but safe diagnosability is violated at state 12F, entered from state (3N,11F): this is a violation of
(TC2). Intuitively if only the flow sensor is used to detect the valve failure, then we must wait until
the pump is switched on to see if there is flow or not; this diagnoses the failure, but violates the safety
condition.
Repeat the above process but this time using only the pressure sensor; the corresponding (safe) diag-
noser Gsd (pressure) is shown in Fig. 4.11. Upon inspection of Gsd (pressure) we conclude that the system
118 A discrete event approach to system monitoring
1N,17F
<Level,NF>
2N,10F
11F <open,NF> <No Level,NF>
<start,NF>
<start,NF>
3N,11F
12F <start,F>
1N,17F
<open,P> <Level,P>
2N,10F
11F <open,NP> <No Level,P>
<start,P>
3N
12F <start,NP>
familiar with [30, 29]. In general case, one may wish to model the unsafe behavior pertaining
to safe time-diagnosability in terms of forbidden timed strings of events. Similarly to [30, 29],
safe time-dagnosability would be defined as in Definition 4.2 above, except that in condition
SC1 the requirement for condition D to hold would be in terms of a bounded time interval
(i.e., bounded number of tick events in the timed traces generated by the timed automaton)
and in condition SC2 the safety condition would be in terms of the illegal timed languages cor-
responding to the forbidden timed strings. Safe time-diagnosability would be tested by first
composing the timed automaton system model Gt with the set of recognizers Sit corresponding
to the forbidden timed strings, as is done in equation 4.23 for untimed models. Note that the
timed automaton model of [22, 30, 29] is closed under the operations of parallel composition
and product, when these operations are defined to combine time intervals by intersection; see
[29, 22] for further details. At that point, the resulting Gtm would be time-unfolded to obtain
Gm , the safe diagnoser of Gm would be constructed, and safe time-diagnosability would be
tested by examining the structure of this safe diagnoser, namely indeterminate cycles and con-
ditions (TC1) and (TC2) of Theorem 4.2. The respective operations of product of timed models
and time-unfolding having taken care of proper marking of states (Bi or N B) and unfolding of
tick events, the safety requirement in safe time-diagnosability can be addressed using the same
untimed conditions TC1 and TC2 as before.
A special case of interest to the above general case is when timing information is not re-
quired to specify the sets of forbidden strings. In this case, one could time-unfold Gt first,
and then compose the resulting G with the recognizers Si to obtain Gm , exactly as is done in
equation 4.23.
120 A discrete event approach to system monitoring
The resulting closed loop system is denoted by SP /G. A realization of the supervisor SP for
G which achieves L(SP /G) = K is given by the pair (R, φ) where R = (XR , Σo , δR , xR,0 ) is a
recognizer for P (K) and φ : XR → 2Σc × Σuc .
In addition to Assumption (A1) and (A2) on the liveness and the absence of cycles of unob-
servable events that we made previously on the system, we now make the following additional
assumption:
(A3) Σc ⊆ Σo , i.e., no unobservable event can be prevented from occurring by control.
Under Assumption (A3) the above map φ(x) can simply be given by the active event set of R
at state x. In this case the supervisor SP may simply be realized by the FSM R, the feedback
map φ(·) being implicit in the transition structure of R.
Given a language M over the alphabet Σ and a language K ⊆ M we denote by K ↑C the
supremal controllable sublanguage of K with respect to M and Σuc ⊆ Σ. Likewise we denote
by K ↑CO the supremal controllable and observable sublanguage of K with respect to M , P and
Σuc ⊆ Σ (Assumption (A3) is sufficient to ensure the existence of K ↑CO ; see [26]). With abuse
of notation, if H is the FSM that generates K we denote by H ↑C the FSM that generates H ↑K .
The active diagnosis problem is formulated as follows:
4.7. Active approach to safe diagnosability 121
Active Diagnosis Problem (ADP): Given the regular, live, language L generated by the sys-
tem G and given a regular diagnosable language K ⊆ L such that every live sublan-
guage of K is diagnosable, find a partial observation supervisor SP for G such that
L(SP /G) = Lact where
(C1) Lact ⊆ K;
(C2) Lact is diagnosable;
(C3) Lact is as large as possible.
A brief description of the solution procedure proposed in [90] for the ADP is now given.
Initialization
Step 0.1 Obtain an FSM Glegal that generates K.
Step 2 Compute M (i) = PL−1 [Md↑C (i)]. Let the FSM H(i) be such that L(H(i)) = M (i).
Step 3 Extend M (i) to the live language M live (i). Let H live (i) denote the FSM that generates
M live (i).
Step 4 Build the diagnoser Hdlive (i) corresponding to H live (i).
Step 5 If M live (i) is diagnosable then Lact = M (i) and the corresponding Sp is realized by the
FSM Hd↑C (i).
Step 6 If M live (i) is not diagnosable then
1. Obtain H̃d by eliminating from Hd↑C (i) all states q such that there is a transition de-
fined at q in Hdlive (i) due to the Stop event and this transition is part of an indeter-
minate cycle in Hdlive (i) or it leads to a state that is part of an indeterminate cycle in
Hdlive (i).
2. Let Hd (i + 1) = Accessible(H̃d ) and Md (i + 1) = L(Hd (i + 1)).
3. Increment i to i + 1.
Theorem 4.3 [90] The iterative procedure presented for solving the active diagnosis problem converges
in a finite number of iterations. M (i) at convergence is the supremal controllable, observable, and
diagnosable sublanguage of K and is a regular language. The supervisor Sp that achieves the closed loop
language M (i) can be realized by Hd (i), the diagnoser corresponding to the generator H(i) of M (i).
An exhaustive presentation of the class of languages Kl ⊆ L (l ≥ 0) which can be used
as initial condition for the ADP can be found in [90]. In this work the reader can find also
an algorithmic method to generate these languages for the case where the diagnoser Gd is such
that no two indeterminate cycles in Gd are interleaved or nested (i.e., no two indeterminate cycles
share a common state in a manner such that it is possible for the diagnoser to keep alternating
between these two cycles).
122 A discrete event approach to system monitoring
4.7.2 Formulation and solution procedure to the active safe diagnosis problem
The active safe diagnosis problem is formulated as follows
Active Safe Diagnosis Problem (ASDP): Given the regular, live, language L generated by the
system G, given a set of forbidden strings after failures Φ = m
S
Φ
i=1 i , and given a regular
diagnosable language K ⊆ L such that every live sublanguage of K is diagnosable, find
a partial observation supervisor SP for G such that L(SP /G) = Lsaf e where:
(C1) Lsaf e ⊆ K;
(C2) Lsaf e is safe diagnosable with respect to Φ;
(C3) Lsaf e is as large as possible.
Solution procedure:
(a) Solve the ADP with respect to K and L; let M be the supremal diagnosable, controllable,
observable sublanguage of K and H the FSM that generates M .
(b) Build the FSM S that recognize the set of strings Φ0 and the machine
Hs = H × S .
Build the diagnoser Hsd of Hs and check the safety property. If Hs is safe than M is also
the solution to the ASDP.
If Hs is not safe then there exist some states in Hsd that violate conditions (TC1) or (TC2).
(c) Define
Ξ = s ∈ Σ?o s.t. in Hsd ∃ a transition due to s which ends in a state that violates (SC1) or (SC2). .
© ª
L=M K = Mnew
Theorem 4.4 Given the regular, live, language L generated by the system G, given a set of forbidden
strings after failures, Φ = m
S
i=1 i , and given a regular diagnosable language K ⊆ L such that every
Φ
sublanguage of K is diagnosable, the solution procedure given for the ASDP returns M 0 , which is the
supremal observable, controllable, safe diagnosable sublanguage of K. Moreover the supervisor Sp that
achieves the closed loop language M 0 can be realized by Hd0 , the diagnoser corresponding to the generator
H 0 of M 0 .
4.7. Active approach to safe diagnosability 123
Proof. Step (a) is proved by theorem 4.3 to converge in a finite number of iterations and give at
convergence the supremal controllable, observable and diagnosable sublanguage of K, namely
M . If this language is also safe (step (b)) then we are done and M is the supremal controllable,
observable and safe diagnosable sublanguage of K. Otherwise from theorem 2, we know that
in the safe diagnoser of the generator H of M there must be some Bi -states with label Fi in
the list of an uncertain state (TC1) or in the list of the first certain state after an uncertain state
(TC2). The set Ξ collects the strings in Σ?o for which the safe diagnoser enters these states. Note
that from theorem 2 we know that the inverse projection in language M of those strings violates
the definition of safe diagnosability, hence we need to remove them from the language (step
(d)), thus obtaining the new language Mnew .
Mnew is a sublanguage of M and so of K, but we cannot say anything about its liveness
(and so about its diagnosability) nor about its controllability and observability properties, but
we can start again to iterate the solution procedure for the ADP (step (d)). Since each live
sublanguage of Mnew is also a live sublanguage of M we are still in the hypothesis of Theorem
4.3, so the new iteration of the ADP solution is assured to converge in a finite number of steps
to the supremal diagnosable, observable and controllable sublanguage of Mnew , namely M 0 .
If it is possible to prove that after these iterations the safety condition is still preserved, then
we proved that M 0 is the supremal safe diagnosable, observable and controllable sublanguage
of K. Moreover, from Theorem 4.3, the supervisor Sp that achieves the closed loop language
M 0 can be realized by Hd0 , the diagnoser corresponding to the generator H 0 of M 0 and the
liveness extension of the diagnoser, namely Hd0,live can be used to perform on line diagnosis on
the closed loop system.
The fact that M 0 ⊆ Mnew ⊆ M with M 0 and M diagnosable languages, and the definition of
the operator delay(·; ·) (given in [90]) imply that
∀s ∈ M 0 delay(s, M 0 ) = n0tc ≤ delay(s, M ) = ntc ;
and ntc = n0tc if the maximum delay in detecting a failure event in M occurs along a trace t
which is also contained in M 0 . This fact joint with the definition of safe diagnosability leads
to the conclusion that, under the hypothesis of the ASDP, the property of safe diagnosability is
preserved under string deletions (see Fig. 4.12); hence M 0 is safe diagnosable. /
fi
ntc σ ∈ φi
st0 ∈M
s
t0 ∈ (M/s)
σ ∈ φi
st00 ∈ M 0
s n0tc
fi t00 ∈ (M 0 /s)
Example 4.4 Consider the system G and the corresponding diagnoser Gd in Fig. 4.13. Let
Σf = Σf 1 = {σf 1 } Σuo = Σf ∪ {σuo } Σuc = Σf ∪ {δ} Φ = {τ } .
124 A discrete event approach to system monitoring
The diagnoser Gd has a cycle of F1 -uncertain states with the corresponding event sequence βγδ. Cor-
responding to this cycle in the diagnoser, there are two cycles in G: the first involves states 3-5 which
appear with an F1 label in the diagnoser and the second involves states 11-14 which appear with an N
label in the cycle in the diagnoser, moreover state 14 is reached via an unobservable event (σuo ). Thus
the cycle in Gd is an F1 -indeterminate cycle and the system G is non diagnosable.
The ADP has been solved using K = K1 and the results is shown in fig. 4.14. For more details on
the steps of the solution procedure the reader is referred to [90].
τ
6
α δ
3 4 5
α β γ
2
σf 1 γ τ
β
1 7 8 9 10 τ
σf 1
α γ σuo
β
11 12 13 14
δ
δ
α
1N 3F1 11N 4F1 8F1 12N 5F1 9F1 13N
β γ
α τ
6F1 10F1
τ τ
Building the safe diagnoser for the solution Lact of the ADP (here represented in Fig. 4.15 where
for sake of clarity only information about B states is shown), it is immediate to see that Lact is non
safe diagnosable, in fact a B-state (namely {10F1}) is in the first certain state after an uncertain state,
violating condition (SC2). We can see that
Ξ = {αβγτ }
and
PL−1 [Ξ] ∩ Lact = {ασf 1 βγτ } .
Fig. 4.16 shows the solution Lsaf e of the ASDP, i.e., the supremal controllable, observable, safe di-
agnosable sublanguage of K ⊆ L. Fig. 4.17 shows the partial observation supervisor Sp such that
L(G/Sp ) = Lsaf e ; we would like to stress that the live extension of Sp can be used to perform online
diagnosis on the closed loop system.
4.7. Active approach to safe diagnosability 125
τ
6
α α
3 4 5 15
α β γ δ
2
σf 1 γ
β τ
1 7 8 9 10 τ
σf 1
α γ σuo
β δ
11 12 13 14 17
σf 1
19
Stop
Figure 4.14: Solution Lact for the ADP for system G with K = K1 and its live extension.
Stop
α
1N 3F1 11N 4F1 8F1 12N 5F1 9F1 13N 15F1 17N 19F1
β γ δ
τ Stop
α 6F1 α 10F1 - B
τ
6F1 - B
τ
Figure 4.15: Safe diagnoser for Lact and its live extension.
126 A discrete event approach to system monitoring
τ
6
α α
3 4 5 15
α β γ δ
2
σf 1 γ
β
1 7 8 9 Stop
σf 1
α γ σuo
β δ
11 12 13 14 17
σf 1
19
Stop
Figure 4.16: Solution Lsaf e of the ASDP for system G with K = K1 and Φ = {τ }, and its live
extension.
Stop
9F1
Stop Stop
α
1N 3F1 11N 4F1 8F1 12N 5F1 9F1 13N 15F1 17N 19F1
β γ δ
Stop
α 6F1 α
1. Is L(Gn+f ) diagnosable?
Let us assume the worst case situation of two negative answers. In this case, we perform the
solution procedure of the ASPD described in Section 4.7. The result is a safe diagnosable system
denoted by Gn+sf and such that
L(Gn+f ) ⊇ L(Gn+sf ) .
We know that Gn+sf satisfies the property that after a failure fi and prior to any undesired
string v ∈ Φi (i = 1 . . . n), there exists an observable event σ ∈ Σo whose observation leads to
the detection and isolation of the fault.
We propose to approach fault tolerant supervision by enhancing Gn+sf and using σ as a
trigger to “force” reconfiguration events that will lead the system to compensate (to the extent
possible) for the effect of the detected fault. We do not specify such reconfiguration, as it is
problem-dependent. The system model Gn+sf is refined as follows. We introduce the new
events ri , nri ∈ Σc (i = 1 . . . n), which are assumed to be controllable and are used to force
reconfigurations. In Gn+sf , we insert the pair of events ri and nri after the above event σ
and just before the above string v, as shown in Fig. 4.18. After the event ri we model the
desired system reconfiguration depending on the specific application1 . Prior to fault detection,
all nri events are enabled while all ri events are disabled. When we observe event σ (that
leads to the detection of the fault), we disable nri and enable ri , thereby “forcing” the desired
reconfiguration2 .
The above strategy is sound since the safe diagnosability property ensures that along any
string s ∈ L(Gn+sf ) there is an observable event σ ∈ Σo between fi and the undesired string
v ∈ Φi after which the diagnoser is Fi -certain and where we can “force” the reconfiguration. In
a general sense, this strategy is conceptually similar to so-called “explicit” approaches to fault
tolerant control in the literature on continuous-variable systems (cf. [12]).
4.9 Conclusion
In this chapter we have shown that the starting point towards a fault tolerant supervision of
DES is the problem of safe failure diagnosis. Starting from the standard definition of diagnos-
ability of discrete event systems, which deal with the problem of detecting the occurrence of an
unobservable event using the available observations on the system, the problem of performing
the detection before the system executes a forbidden string was introduced. This idea resulted
1
Events ri can be used also to freeze the state of the diagnoser or eventually to reset the diagnoser itself.
2
In Example 3, the reconfiguration could for instance be realized trying to open-close the valve a certain number
of times and if the valve is still blocked shutting down the system.
128 A discrete event approach to system monitoring
s ∈ L(Gn+sf )
fi σ ∈ Σo nri v ∈ Φi
ri
Reconfigured Model
in the new notion of safe diagnosability for discrete event systems and in necessary and suffi-
cient conditions to test this language property. Moreover, the problem of explicitly taking into
account safe diagnosability as a requirement in system design was addressed and solved.
This work wants to be a starting point for the problem of fault tolerant supervision for dis-
crete event systems, namely, the design of a reconfiguration unit which, on the basis of the in-
formation provided by the diagnoser, “adjusts” the supervisor in order to achieve a prescribed
performance in the case of faulty behavior. In such a framework, reconfiguration actions on the
system should be enabled by the supervisor just when a failure occurrence has been detected
and the forbidden actions (namely the set Φ) can be considered as those strings of events after
which reconfiguration is no longer effective.
To this aim in the last section of the paper we have presented a simple modelling tool which
makes use of the property of safe diagnosability that permit reconfiguration actions on the
faulty plant
Chapter 5
Implicit fault tolerant control systems
5.1 Introduction
The most common approach in dealing with fault tolerant control (FTC) problem is to split the
overall design in two distinct phases. The first phase addresses the so-called “Fault Detection
and Isolation” (FDI) problem, which consists in designing a dynamical system (filter) which,
by processing input/output data, is able to detect the presence of an incipient fault and to iso-
late it from other faults and/or disturbances. Once the FDI filter has been designed, the second
phase usually consists in the design of a supervisory unit which, on the basis of the information
provided by the FDI filter, reconfigures the control so as to compensate for the effect of the fault
and to fulfill performances constraint. In general, the latter phase is carried out by means of
parameterized controller which is suitably updated by the supervisory unit, on the basis of the
information provided by FDI filter.
It is clear from this description that the classical approach to FDI and FTC relies upon a “cer-
tainty equivalence” idea extensively used in the context of adaptive control, since it is based on
the explicit estimation of unknown time varying signals/parameters (in the specific case the
faults) by the FDI filter and subsequent explicit reconfiguration of the controller in presence of
faults.
Our aim is to follow a different approach to fault tolerance control. Specifically, we address the
case in which the faults affecting the controlled system can be modeled as functions (of time)
129
130 Implicit fault tolerant control systems
5.2 The Induction Motor model and the Indirect Field Oriented Con-
troller
In this subsection we briefly review the model of the induction motor. For a more exhaustive
treatment on how this model can be derived the interested reader can refer to [67].
Under assumptions of linear magnetic circuits and balanced operating conditions, the equiv-
alent two-phase model of the symmetrical IM, represented in an arbitrary rotating two-phase
5.2. The Induction Motor model and the Indirect Field Oriented Controller 131
− J1
0 0
0 0
0
0 0
B= d= 0
1
σ 0
0
0 σ1 0
where the positive constants in the model are related to electrical and mechanical parameters
of IM as follow
L2m
µ ¶ µ ¶
Lm 3 Lm Rr Rs
σ = Ls 1 − , β= , µ= , α= , γ= + αLm β
Ls Lr σLr 2 JLr Lr σ
with J the rotor inertia, Rs , Rr , Ls , Lr the stator/rotor resistances and inductances respectively,
Lm the magnetizing inductance.
signals denoted respectively by ω ? (t) and Ψ? (t), which are assumed to be smooth functions of
time. A further control objective consists of having Ψq (t) asymptotically decaying to zero; this
property is known as steady-state flux decoupling. Hence, given ω ? (t) and Ψ? (t), the problem
consists in the design of a dynamic output feedback controller of the form
ν̇ = α(ν, y, ω ? , Ψ? ) u = β(ν, y, ω ? , Ψ? ) ω0 = π(ν, y, ω ? , Ψ? )
such that for all initial states x(0) ∈ R5 and for all possible constant torque load TL , the trajec-
tories of the closed loop system are bounded and
lim |ω(t) − ω ? (t)| = 0 lim |Ψd (t) − Ψ? (t)| = 0 lim Ψq (t) = 0 .
t→∞ t→∞ t→∞
The idea in [83] is to design the control inputs (ud , uq ) and ω0 in order to force the overall
system to behave as the cascade connection of two subsystems, called the flux subsystem and the
speed subsystem. More specifically we consider first a subsystem associated with the state vari-
ables (Ψd , Ψq , id ) and we show that a suitable choice of the control input ud and of the (d, q)
reference frame rate ω0 allows us to achieve some stability properties (better specified later)
regardless of the behavior of the other state variables (ω, iq ), provided that ω(t) is bounded. Then
we turn our attention on the subsystem associated with the remaining state variables (ω, iq )
and we design the control input uq in order to achieve the prescribed stability properties.
Proposition 5.1 Set Kd = k K̄d , where K̄d is a positive fixed constant. Then there exists a number
k1? > 0, independent of ω(t) and sω (t), such that for all k ≥ k1? the state of system (5.5) satisfies an
a-L∞ bound with respect to the input vd without restriction on the input and on the initial state and
linear gain functions. In particular, there exists numbers γf > 0 and γf0 > 0 such that, for each x0f ∈ R3
and each measurable vd , the solution of (5.5) with x0f (0) = x0f exists for all t ≥ 0 and satisfies
γ
kxf k∞ ≤ max{γf0 kx0f k, f kvd k∞ }
γf k . (5.6)
kxf ka ≤ kvd ka
k
Proof. The result follows as a trivial application of the Young’s inequality, considering the
T
candidate ISS-Lyapunov function V := xf xf , as the dependence on w(t) and on sω (t) is skew-
symmetric. /
We now take into account the remaining state variables of (5.1) to define the speed subsystem.
To this end, define the additional error variable
ω̃ := ω − ω ?
and set
1
ĩq := iq − i?q where i?q = (−Kω ω̃ + T̂L + ω̇ ? ).
µΨ?
In the expression of i?q above, Kω is a positive constant and T̂ is an auxiliary state variable of the
controller, introduced to offset the unknown load torque TL , whose dynamics will be specified
later. Moreover, let i̇?q1 denote the known part of the derivative of i?q , which is given by
1 h i Ψ̇?
˙
i̇?q1 := Kω (Kω ω̃ − µΨ?
ĩq ) + Ψ̂ + ω̈ ?
− ? i?q .
µΨ? Ψ
With this notation in mind we design the control law for the input uq in the following way
h i
uq = σ −(Kq − γ)ĩq + ω0 id + βωΨ? + i̇?q1 − 1? (Ψ̇? ĩq + Kξ ξ) + σufc fc
q := ūq + σuq
Ψ
ξ˙ = Kη Ψĩq (5.7)
˙
T̂L = −KT ω̃
where Kq and Kξ are design parameters, Kη and KT are fixed (though arbitrary) positive con-
stants and ufc
q is a new control input, used for fault compensation, to be fixed later.
1
It is important to say that, to our purposes, the notion of a-L∞ bound is equivalent to that of input-to-state
stability. We use this, instead of the more commonly used notion of ISS, since in the proof of the next propositions
it is convenient to use the small gain theorem of [100], which is expressed in terms of a-L∞ bounds.
134 Implicit fault tolerant control systems
The subsystem thus obtained (which defines the speed subsystem), having set
Proposition 5.2 Set Kq = k K̄q and Kξ = k K̄ξ , where K̄q and K̄ξ are positive fixed constants. Then
there exist numbers k2? > 0 and ∆ > 0, such that for all k ≥ k2? the state of system (5.8) satisfies
an a-L∞ bound with respect to the inputs (xf , vq ) without restriction on the initial state, restrictions
(∆, ∞) on the inputs (xf , vq ), and linear gain functions. In particular there exist numbers γs > 0 and
γs0 > 0 such that for each x0s ∈ R4 and each measurable (xf , vq ), the solution of (5.8) with xs (0) = x0s
exists for all t ≥ 0 and satisfies
Proof. The proof is a consequence of the small gain theorem. In particular it is worth partition-
T T
ing the state variable xs as xs = (x1s , x2s ), with x1s = (T̃L , ω̃) and x2s = (ξ, η) , and considering
the 4-dimensional system (5.8) as the feedback interconnection of the 2-dimensional subsystem
ẋ1s = A11 11
£ ¤ 1 £ 12 12
¤ 2 1
s + Asf (xf , t) xs + As + Asf (xf , t) xs + Bs (xf , t)xf (5.11)
ẋ2s = A22 22
£ ¤ 2 £ 21 21
¤ 1 2 2
s + Asf (xf , t) xs + As + Asf (xf , t) xs + Bs (xf , t)xf + Bv vq (5.12)
5.2. The Induction Motor model and the Indirect Field Oriented Controller 135
and note that there exist positive constants a and b0 , b1 such that
kA11
sf (xf , t)k ≤ akxf k, kA12
sf (xf , t)k ≤ akxf k kBs1 (xf , t)k ≤ b0 + b1 kxf k . (5.14)
Now fix ∆0 > 0 so that ∆0 akP1 k < 1/4 and note that
1
kxf k < ∆0 ⇒ V̇ ≤ − kx1s k2 + (`1 )kx1s kkx2s k + (`2 + `3 ∆0 )kx1s kkxf k
2
for some positive numbers `i , i = 1, . . . , 3. From this and lemma 3.3 in [100] it turns out that
the state x1s of (5.11) satisfies an a-L∞ bound without restriction on the initial state, restrictions
(∞, ∆0 ) on the inputs (x2s , xf ) and linear gain functions. In particular there exists γs0 > 0 such
that
kx1s ka ≤ γs0 max{kx2s ka , kxf ka } .
In a similar way, it can be shown that the state x2s of (5.12) satisfies an a-L∞ bound without
restriction on the initial state, nonzero restriction on the inputs (x1s , xf ) and linear gain function,
whose coefficient can be arbitrarily lowered by increasing k. To this end, observe that the real
part of the eigenvalues of the matrix A22 s depends linearly on 1/k. In view of this there exist
symmetric positive definite matrices P2 and Q such that (see [105])
T
P2 A22 22
s + As P2 = −2`kP2 − Q (5.15)
T
for some positive `. Consider now the candidate ISS Lyapunov function V = x2s P2 x2s whose
derivative along (5.12) satisfies (observe that bounds like (5.14) also hold for A22 21 2
sf , Asf and Bs )
T
≤ −2`kkP2 kkx2s k2 − x2s Qx2s + kx2s k q0 kx2s kkxf k + q1 kx1s kkxf k + q2 kx1s k+
¡
V̇
+q3 kxf k + q4 kxf k2 + q5 kvq k
¢
for some positive numbers qi , i = 0, . . . , 5, and fix ∆00 > 0 such that q0 ∆00 < kQk. Hence
kxf k < ∆00 V̇ ≤ −2`kkP2 kkx2s k2 + kx2s k q1 + q2 ∆00 kx1s k + (q3 + q4 ∆00 )kxf k + q4 kvq k
£ ¤
⇒
from which it is easy to conclude, again by lemma 3.3 in [100], that the state x2s of (5.11) satisfies
an a-L∞ bound without restriction on the initial state, restrictions (∞, ∆00 , ∞) on the inputs
(x1s , xf , vq ) and linear gain functions. In particular there exists γs00 > 0 such that
γs00
kx1 ka ≤ max{kx1s ka , kxf ka , kvq ka } .
k
136 Implicit fault tolerant control systems
In view of this, a simple application of the small gain theorem 1 in [100] proves the claim of the
proposition with k2? ≥ γs00 max{1, 1/γs0 , γs0 }, ∆ ≤ min{∆0 , ∆00 } and γs ≥ 2 max{γs0 , γs00 , γs0 γs00 }. /
The previous two propositions are building blocks for proving the next concluding propo-
sition which states that a sufficiently large value of Kd , Kq and Kξ renders the overall system
(5.5)-(5.8) input-to-state stable with respect to the inputs (vd , vq ) with arbitrarily large restric-
tions on the inputs and arbitrarily small linear gains.
Proposition 5.3 Let M be an arbitrary positive number and set, as in the previous proposition, Kd =
k K̄d , Kq = k K̄q , Kξ = k K̄ξ where K̄d , K̄q and K̄ξ are fixed positive numbers. Then there exists
k3? > max{k1? , k2? } such that for all k ≥ k3? the state x := (xf , xs ) ∈ R7 of the overall system (5.5)-
(5.8) satisfies an a-L∞ bound without restriction on the initial state, restrictions (M, ∞) on the inputs
(vd , vq ) and linear gain functions. In particular there exist numbers γ > 0 and γ 0 > 0 such that for
each x0 ∈ R7 and each measurable (vd , vq ) such that kvd k∞ ≤ M , the solution of (5.5)-(5.8) with
x(0) = (xf (0), xs (0)) = x0 exists for all t ≥ 0 and satisfies
γ
kxk∞ ≤ max{γ 0 kx0 k, max{kvd k∞ , kvq k∞ }}
γ k
kxka ≤ max{kvd ka , kvq ka } .
k
Proof. The idea is to act on k in order to force xf to fulfill the restriction ∆ on an interval
[T ? , ∞) and to see the overall system as cascade connection of two ISS systems. For this we
need first to make sure that the solution exists on any interval of the form [0, T ? ], i.e. that the
system does not have finite escape time. To this end note that, if solutions (hence, in partic-
ular, ω(t) and sω (t)) are defined on a interval [0, T ? ], kxf (t)k is bounded by the fixed quantity
γ
max{γf0 kx0f k, f kvd k∞ }. On the other hand, the growth of the right-hand side of (5.8) is affine
k
in kxs k, with coefficients only depending on bounds on kxf (t)k and kvq (t)k. Thus, on the in-
terval [0, T ? ], kxs (t)k is guaranteed to be bounded by an exponentially growing function which
only depends on kx0f k, kx0s k, kvd k∞ , kvd k∞ . As a consequence, solutions exist on any interval of
the form [0, T ? ], i.e. no finite escape time can occur.
Now note that, by proposition 1, kvd k∞ < M and k ≥ k1? imply that
γf
kxf ka ≤ M.
k
Hence, fixing k3? > γf M/∆, it turns out that there exists T ? > 0 such that kxf (t)k < ∆ for all
t ≥ T ? , namely the input xf of the speed subsystem fulfills the restriction ∆ in finite time. From
this the claim of the proposition follows by standard cascade arguments as a consequence of
proposition 1 and 2. In particular the fact that the asymptotic gain is proportional to 1/k follows
from gain composition. /
The previous result implies that, in case vd and vq are identically zero, the overall closed
loop system given by (5.5)-(5.8) is globally asymptotically stable namely the control objective
specified in section 5.2.2 are achieved for every initial state x(0) of the induction motor (5.1).
The control structure, given by (5.4), (5.7), turns out to be a classical indirect field oriented con-
troller, as proposed in [83]. In case vd and vq are not zero, the previous analysis has shown
that the IFO controller is “robust” with respect to exogenous inputs matched with the current
dynamics, as it achieves input-to-state stability with a linear gain proportional to 1/k.
Before describing how this IFO controller can be enriched with an fault tolerant unit we con-
clude this subsection with few remarks on some important points of the above analysis.
5.3. The implicit fault tolerant controller 137
Remark 5.1 Note that the controller (5.7) includes two integral actions, provided by the state variables
T̂L and ξ. The purpose of the first one is to offset the constant unknown load torque TL . The presence of
the other integral action is justified to achieve “steady state robustness” with respect to the parameter α
(see for more details [83]).
Remark 5.2 It is interesting to note that the choice of the control law as in (5.4) and (5.7) puts the
system in a cascade form, with the flux subsystem feeding the speed subsystem. This is achieved by
forcing the variables ω(t) and sω (t) in the flux dynamics (5.5) to appear as entries of a skew symmetric
matrix. This has permitted, in the above analysis, to consider the flux subsystem as an autonomous
system feeding the speed dynamics (see the proof of the last proposition).
Remark 5.3 In principle the above IFO controller does not allow for uncertainties on the rotor param-
eter α (which is typically an highly uncertain parameter), as this is explicitly used in (5.3)-(5.4) for
achieving the cascade structure. However it must be stressed that, in practice, this drawback does not
prejudice the effectiveness of the controller, as demonstrated by the experimental results described in [83]
where it is shown that high uncertainties on α can be tolerated (see also the first remark).
Following the theory in [103], it turns out that the presence of mechanical and electrical faults
generates asymmetries in the IM, yielding some slot harmonics in the stator winding. In the
a − b reference frame, it is possible to model this effect by adding a sinusoidal corruption term
to the stator currents values. Specifically, letting iuf uf
a (t) and ib (t) denote the values of the stator
f f
current in the absence of faults and ia (t) and ib (t) the corresponding values in the presence of
faults, the latter can expressed in the form
in which Z t
²e (t) = 2π fe (τ )dτ , (5.17)
0
138 Implicit fault tolerant control systems
where fe (t) is a function which depends on the specific fault. For example, faults caused by
rotor asymmetries (typically due to broken bars or dynamic eccentricity) yield harmonic com-
ponents at frequencies
fe = frbb = (1 ± 2ksω )f (5.18)
in which (see section 5.2) sω = ω0 − ω is the slip angular frequency, f is the supply frequency
and k is a positive integer. On the other hand, faults generated by stator asymmetries (typically
induced by short circuit or static eccentricity) generate harmonic components at the frequency
fe = fssc = f . (5.19)
As far as the amplitude A and the phase φ in (5.16) are concerned, they depend on the entity
of the rotor or stator asymmetry and then can not be considered known since depend on the
specific fault severity.
Similarly, once the variables are expressed in the (d − q) reference frame, it turns out that the
stator currents in presence of (stator or rotor) asymmetries change into
ifd = iuf
d + A sin(²e (t) + ²0 (t) + φ)
(5.20)
ifq = iuf
q + A cos(²e (t) + ²0 (t) + φ)
where ²0 , introduced in the previous section, denotes the angular position of the (d − q) ref-
erence frame. Few assumptions are made in the following in order to simplify relation (5.20).
First of all, for sake of simplicity, we concentrate on the case in which the possible frequencies
which characterize the sinusoidal additive terms in (5.20) are constants. This corresponds to
assume that the following two hypotheses hold:
(a) the reference angular velocity ω ? is constant and a possible fault is allowed to arise only
when the steady-state has been reached;
(b) the frequencies which characterize the faulty current are “frozen" to the steady state val-
ues, namely to the values obtained when the state of the system assumes the steady state
value. This is equivalent to assume that the frequencies are not dependent on the ac-
tual state of the plant but only on the reference to be tracked and on the parameters of
the system. These assumptions, which play a crucial role in the analysis which follows,
will be removed in the experimental and simulation results where it will be shown how
the proposed Fault Tolerant controller works properly even in case of state-dependent
frequencies.
As a matter of fact, under these assumptions, it is easy to realize that (bearing in mind (5.17),
(5.18), (5.19) and the definition of the slip sω )
as far as faults concerning rotor asymmetries are concerned. In the previous expressions s?ω
denotes the (constant) steady state reached by sω which turns out to be
αLm T̂
s?ω := (5.23)
µΨ?2
5.3. The implicit fault tolerant controller 139
while ²?0 denotes the (unknown) position of the reference frame once the fault occurs.
As final hypothesis about the effect of rotor asymmetries, we assume that there exists a (possi-
bly large) finite integer N > 0 such that all the components with frequencies Ω2,±k , k > N , are
negligible with respect to the first components.
These assumptions allow us to express the deviation of the stator currents values in presence
of faults with respect to the un-faulty values as
N
X
ifd = iuf
d + A sin(Ω1 t + φ) + Ak sin(Ω2,k t + φk ) + A−k sin(Ω2,−k t + φ−k )
k=1 (5.24)
XN
ifq = iuf
q + A cos(Ω1 t + φ) + Ak cos(Ω2,k t + φk ) + A−k cos(Ω2,−k t + φ−k )
k=1
and
0 Ω2,k 0 0
−Ω2,k 0 0 0
Sr,k =
0
.
0 0 Ω2,−k
0 0 −Ω2,−k 0
With this in mind it is clear that (5.24) can be expressed as
ifd = iuf
d + Qd w
ifq = iuf
q + Qq w .
with ¡ ¢
Qd := ¡ 1 0 1 0 · · · 1 0 ¢
Qq := 0 1 0 1 ··· 0 1 .
140 Implicit fault tolerant control systems
In this setting the uncertainty on the amplitude and phase of the additive sinusoidal terms
reflects in that on the initial state w(0) of the exosystem (5.25) whose structure is uncertain as
the vector $ of frequencies is uncertain.
Bearing in mind the dynamics of the rotor currents in the normal (i.e. in the absence of faults)
operative conditions, it is also simple to get the IM dynamics after the occurrence of a fault.
As a matter of fact, taking derivatives of (5.24) it is readily seen that the model of the IM in
presence of faults is given by (5.1) with an exogenous input V equal to
µ ¶ µ ¶
Vd −γQd w + Qd Sw + ω0 Qq w
V = = . (5.26)
Vq −ω0 Qd w − γQq w + Qq Sw
and
`1 := ω ? + s?ω + Ω1 `2,k := ω ? + s?ω + Ω2,k `3,k := ω ? + s?ω + Ω2,−k .
Note that, with the above formalism, both electrical or mechanical or simultaneous faults are
allowed, with the first two components of the exosystem state which take into account for sta-
tor faults, while the last 4N for rotor faults.
To conclude this section it is worth to anticipate that in the next part of the chapter we will
assume that that initial state of the exosystem (5.25), which as stressed above depends on the
specific fault and on its severity, is unknown but ranges within a known, but otherwise arbitrar-
ily large, compact set, denoted W. As the vector of frequencies $ is concerned, we will study
first the case in which this is perfectly known (which corresponds to require perfect knowl-
edge of the IM parameters) and then the case in which this is uncertain. In the latter case, we
will assume the knowledge of a compact set, denoted F, to which the vector $ is supposed to
belong.
Vd Vq
? ?
ud flux subsys uq
- - speed subsys ¾
(xf ) (xs )
id (ω, iq )
? ?
¾ ūd
m IFO contr. ūq - m
(ξ, T̂L ) ¾
6 6
(Ψ? , ω ? )
ĩd ĩq
?
ufc
d ufc
q
fault comp
sat ¾ - sat
(ζ)
ζ
?
supervisor faults
-
FDI
fault compensation unit. In this perspective, the FDI phase is postponed to that of FTC as it is
carried out by looking at the unit which is possibly already compensating the fault.
It is important to stress that the desired goal is to realize the fault compensation unit as a sort
of plug-and-play device, whose design is as much independent as possible by the previously
designed controller, whose purpose is to achieve the prescribed control objectives in the un-
faulty operation mode. As a matter of fact, the only feature required to the controller designed
in section 5.2.3 is that the closed-loop system is input-to-state stable with suitable restrictions
with respect to a matched control input and with a sufficiently small linear gain function.
For sake of clarity we address separately the design of the fault tolerance control unit in the
two cases in which the frequency of the disturbance is known (namely when the matrix S in
(5.25) is known) and that in which it is not.
T S − F T = GΓ (5.27)
namely the pair (S, Γ) is similar, via the change of coordinates induced by T , to the pair (F +
GΦ, Φ). Using this result twice, it is seen that the two components Vd , Vq of the exogenous input
142 Implicit fault tolerant control systems
and note that the dynamics of ĩ can be written in the compact form as
Proposition 5.4 Consider system (5.5), (5.8) with dynamic feedback control law given by (5.4), (5.7)
(5.30), with Kd = k K̄d , Kq = k K̄q , Kξ = k K̄ξ , where K̄d , K̄q and K̄ξ are fixed positive numbers. Let
λ, the output saturation level of (5.30), be any number such that λ ≥ VM . Then there exists a number
k ? > 0 such that, for all k ≥ k ? , k(xf (t), xs (t))k and kζ(t) + T w(t)k asymptotically (and locally
exponentially) converge to zero for all xf (0) ∈ R3 , xs (0) ∈ R4 , ζ(0) ∈ Rn and all w(0) ∈ W.
ζ → χ := ζ + T w − Gĩ . (5.32)
In the new coordinates (it suffices to consider here only the dynamics of ĩ instead of the whole
dynamics of (xf , xs ) )
where ρ(xf , xs , t) = a(xf , xs , t) + K ĩ. Now fix M ≥ 2λ, let k3? and γ be the lower bound for
the gain k and, respectively, the gain coefficient determined in proposition 3 and note that if
k ? ≥ k3? , since kvd k ≤ 2λ ≤ M , the restriction (equal to M ) of the flux/speed subsystem is
fulfilled for all t ≥ 0. As a consequence, since also kvd k ≤ M , we deduce that xf (t), xs (t) exist
for all t and
γ
k(xf , xs )ka ≤ M . (5.34)
k
Observe now (compare with (5.5) and (5.8)) that there exist constants `1 , `2 , independent of k,
such that
kρ(xf , xs , t)k ≤ `1 k(xf , xs )k + `2 k(xf , xs )k2 (5.35)
for all t ≥ 0. In fact, the term K ĩ in ρ(xf , xs , t) cancels out the terms in a(xf , xs , t) which depend
of k. This, the fact that the matrix F is Hurwitz and (5.34) imply (assuming without loss of
generality that γM/k ≤ 1) that χ(t) exists for all t and
kχka ≤ qkρ(xf , xs , ·)ka ≤ q(`1 + `2 k(xf , xs )ka )k(xf , xs )ka ≤ q(`1 + `2 )k(xf , xs )ka (5.36)
for some positive q. Consider now the function satλ (Φχ − ΦT w) + ΦT w. Since λ ≥ kΦT wk∞ it
turns out that there exists a continuous positive and bounded ϕ(t) such that
where L is a positive constant. Hence by the small gain theorem 1 in [100] we conclude that if
k > γq(`1 + `2 )L
the overall system is globally asymptotically stable. In particular local exponential stability
follows from the fact the linear approximation is Hurwitz.
Asymptotic convergence of kζ(t) + T w(t)k to zero trivially follows from the fact that χ(t) and ĩ
converge to zero. /
The statement of the previous result highlights two key features of the fault tolerant unit
(5.30): the first is that for all possible faults belonging to the classes specified in subsection 5.3.1
(regardless the fault severity) the state (xf , xs ) of the flux/speed subsystem converges to zero,
namely the control objectives are achieved. This means that the control is fault tolerant. The
144 Implicit fault tolerant control systems
second result is that the state of the fault compensation unit ζ converges to −T w(t), namely
the state of the exogenous signal is asymptotically reconstructed. This means, as specified in
subsection 5.3.1, that the specific fault and its severity can be precisely isolated and evaluated.
We conclude this subsection with some remarks which shed further light to the result of the
previous proposition.
Remark 5.4 It is worth noting that the result of proposition 5.4 remains true also in case the control
action ū is not provided by the IFO controller specified in section 5.2.3 but is generated by a different
controller. As a matter of fact it is easy to realize, with an eye to the proof of the previous proposition,
that the key feature required for the controller which generates the input ū, is the property of rendering
the corresponding closed-loop system input-to-state stable with respect to the matched exogenous input
ufc + V , with sufficiently large restrictions (compatible with the fault effect to compensate) and with
sufficiently small linear gain (to enforce the small gain condition on which the stability proof relies).
Any controller yielding such property is suitable for implementing the structure sketched in fig. 5.1.
Remark 5.5 It is interesting to stress the key role of the saturation function introduced in the output
of the fault compensation unit. On one hand, its presence allows us to fulfill the restriction on the
input ufc + V which characterizes the system under the IFO controller and hence to render the use of
the small gain theorem of [100] possible, so as to obtain asymptotic stability. On the other hand, the
saturation plays a role in decoupling the system consisting of the IM and the IFO controller from the
dynamics of the fault compensation unit. In this regard, note that a key point in the proof of the previous
proposition is the fact that the state (xf , xs ) can be rendered asymptotically small (see (5.34)), which
in turn implies that the (quadratic) function a(xs , xf , t) in (5.35) can be asymptotically dominated by a
linear function (see (5.36)). This fact, which is crucial to enforce the small gain condition, holds precisely
because of the presence of a saturation function, which renders the bound (5.34) fulfilled independently
of the χ-dynamics.
in which µ ¶ µ ¶
ζd Φ̂d 0
ζ= , Φ̂ =
ζq 0 Φ̂q
is an estimate updated according to the following dynamics
˙
Φ̂d = dzn` (Φ̂d ) − ρĩd ζdT
(5.39)
˙
Φ̂q = dzn` (Φ̂q ) − ρΨ? ĩq ζqT
in which ` and ρ are positive design parameter. As in the previous section, dealing with the
case of known frequencies, it is possible to prove that the indirect field oriented controller (5.4),
(5.7) with the adaptive fault compensation unit (5.38), (5.39) is able, for sufficiently large value
of λ, ` and k, to achieve the control objectives while offsetting the effect V of any fault.
The next proposition provides the desired result.
Proposition 5.5 Consider system (5.5), (5.8) with dynamic feedback control law given by (5.4), (5.7)
(5.38) and (5.39), with Kd = k K̄d , Kq = k K̄q , Kξ = k K̄ξ where K̄d , K̄q and K̄ξ are fixed positive
numbers. Suppose that the vector $ and the initial state w(0) range within fixed compact sets F and,
respectively, W . Let ` and ρ be arbitrary constants with ` (the amplitude of the dead-zone functions
which characterize the adaptation law (5.39)) such that
Then there exist λ > 0 and k ? > 0 such that, for all k ≥ k ? , k(xf (t), xs (t))k and kζ(t) + T w(t)k
asymptotically (and locally exponentially) converge to zero for all xf (0) ∈ R3 , xs (0) ∈ R4 , ζ(0) ∈ Rn ,
Φ̂(0) ∈ Rn and all w(0) ∈ W, $ ∈ F.
Proof. For convenience the proof is divided in two parts. In the first part it is shown that suf-
ficiently large values of λ and k guarantee that the trajectories are bounded and the saturation
function enters the linear region in finite time. Then, in the second part, a Lyapunov argument
is used to show that the fault tolerance is achieved, namely the state (xf , xs ) asymptotically
decays to zero.
Consider again the change of variable (5.32). The input v = (vd , vq ) to (5.5) – (5.8), in the new
coordinates, read as
v = satλ (Φ̂χ − Φ̂T w) + ΦT w . (5.40)
Since kΦT wk ≤ VM (with VM defined in (5.31)) and |satλ (s)| ≤ λ for all s, assuming without loss
of generality that λ ≥ VM ≥ 1, it turns out that kvk∞ ≤ 4λ. This implies that (see proposition
5.3), if k is large enough, xf (t), xs (t) exist for all t and
4γλ 4γ
k(xf , xs )ka ≤ = 0 taking k = k0 λ . (5.41)
k k
Even though the control law has changed, from (5.30) to (5.38), the dynamics of χ still has the
form given by the second equation of (5.33). Bearing in mind the fact that F is Hurwitz, the
bound (5.35), and assuming without loss of generality k > 1, the estimate (5.41) also implies
that χ(t) exists for all t and there exists a δ > 0, not dependent on λ, such that
δ
kχka ≤ .
k0
146 Implicit fault tolerant control systems
Hence, by definition of χ,
δ + 4kGkγ
kζka ≤ kχ + T w + Gĩka ≤ kT wk∞ + kχ + Gĩka ≤ kT wk∞ + ≤m.
k0
In the previous expression m is a positive constant not dependent on k 0 (assuming k 0 ≥ 1) and
not dependent on λ.
We now focus on the Φ̂ dynamics (5.39). Since |sat` (s)| ≤ ` for all s ∈ R and
T 4γm
kĩζ ka ≤ k(xf , xs )ka kζka ≤
k0
it is easy to realize that Φ̂ is bounded and the following asymptotic estimate holds
4ργm
kΦ̂ka < ` + ≤ 2` (5.42)
k0
where the last inequality holds provided k 0 ≥ `/4ργm. This means that the argument of the
saturation function in (5.40) satisfies an asymptotic estimate of the form
µ ¶
δ
kΦ̂χ − Φ̂T wka < 2` + kT wk∞ ≤ n
k0
in which n is a positive number not dependent on λ and k 0 (as k 0 ≥ 1). In view of this, we
choose λ ≥ n. This implies that there exists a T ? > 0 such that for all t ≥ T ?
where
Φ̃ := Φ̂ − Φ . (5.43)
This completes the first part of the proof. Note that the previous discussion, in addition to
proving that the saturation function enters in finite time the linear region, has also shown that
the states of the flux and speed subsystems can be rendered arbitrarily small by increasing k
(see (5.41)) and that the estimate Φ̂ is bounded by a positive number (see (5.42)).
We proceed now to prove that (xf (t) and xs (t) asymptotically decay to zero. The overall system
consists of the flux subsystem (5.5), of the speed subsystem (5.8), with
v = Φ̃ζ + Φχ − Φ̃Gĩ ,
of the fault compensation unit, whose dynamics in the χ-coordinates is described by the second
equation in (5.33), and of the adaptation law (5.39) which in the new error coordinates (5.43)
reads as
˙ T
Φ̃ d = dzn` (Φ̃d + Φd ) − ρĩd ζd
(5.44)
˙
Φ̃ = dzn (Φ̃ + Φ ) − ρηζ T .
q ` q q q
We construct the Lyapunov function for the whole system at different stages. First, consider the
T
flux subsystem and the Lyapunov function Vf = 21 xf xf . Taking derivatives along (5.5), simple
computations show that for a large value of k
for some positive numbers n1 , n2 , n3 . Consider now the speed subsystem (5.8) and let, in
analogy to the analysis carried out in the proof of proposition 5.2, set xs = (x1s , x2s ), where
T
x1s := (T̃L , ω̃) and x2s := (ξ, η). For this system consider the Lyapunov function Vs = x1s P1 x1s +
T
1 2
2 xs x2s with P1 defined in (5.13). Bearing in mind the estimate (5.41) and noting that there
exists positive numbers a, b0 , b1 such that
it is easy to obtain that, for a sufficiently large value of k, there exists a T1? such that for all
t ≥ T1?
V̇s ≤ −q1 kxs k2 − (k K̄q − q2 )η 2 + q3 kxs kkxf k + η Φ̃q ζq + q4 kηkkχk
for some positive numbers qi , i = 1, . . . , 4. Consider now the χ-dynamics and define Vχ =
T
χ Pf χ
T
Pf F + F Pf = −I .
Recalling (5.35) and (5.41), it is easy to conclude that for a sufficiently large k, there exists T2?
such that for all t ≥ T2?
For all ` ≥ |a|, the graph of the function dzn` (s + a) is (second-quadrant)/(fourth quadrant),
and therefore
s dzn` (s + a) ≤ 0, for all s ∈ R.
Hence, since by hypothesis ` ≥ ΦM ,
T T
V̇Φ ≤ −Φ̃d ĩd ζd − Φ̃q ηζq .
with
q1 n 1 n 1 + q1
²1 ≤ ²2 ≤ .
q22 `1
A simple application of the Young’s inequality shows that for sufficiently large k
Ẇ (t) ≤ −r1 k(xf (t), xs (t))k − r2 kχ(t)k for all t ≥ max{T1? , T2? },
for some positive r1 and r2 . This, by the LaSalle theorem, implies that the trajectories of the
closed loop system converge toward the largest invariant set characterized by xf = 0, xs = 0
and χ = 0. This concludes the second part of the proof. /
Few remarks to comment the previous proposition are now presented.
148 Implicit fault tolerant control systems
Remark 5.6 The adaptation law chosen in (5.39) differs from the that proposed in [93] for the presence of
the dead-zone functions dzn` (·) which is motivated by the fact that the output of the fault compensation
unit is saturated. As a matter of fact it is interesting to note, with an eye to the proof of the previous
proposition, that the role of dzn` (·) consists in keeping the estimate Φ̂ bounded with a bound which is not
dependent on k (see (5.42) and the analysis just before). This indeed is crucial to show that in finite time
the saturation function which characterizes the output of the fault compensation unit definitely enters
the linear region.
Remark 5.7 Note that the proposition is not conclusive about the asymptotic convergence of matrix Φ̂
to Φ. In this respect it can be easily proved (following an analysis similar to that in [93]) that Φ̂ converges
to a fixed matrix which is, in general, different from Φ unlike the case in which all the frequencies $ are
excited by the fault (namely if the initial condition w(0) is such that the signals T w(t) contains all the
frequencies in $). This in general is not true as it happens only in case of simultaneously stator and
rotor asymmetries.
ŵ(t) := T −1 ζ(t)
where Tfs and Tfr are two positive thresholds.2 The two residual signals rfs and rfr correspond
to faults due to stator asymmetries and, respectively, faults due to dynamics asymmetries.
It is important to note that the previous isolation strategy can not be implemented as such
in case the vector $ of exogenous frequencies is not perfectly known. As a matter of fact
in such a case the matrix T solution of (5.27), which depends on $, is not known and the
exogenous variable estimate ŵ can not be computed. In this scenario a more sophisticated
isolation algorithm should be identified by using signal processing algorithms.
2
With the notation s|1,j we denote the components of the vector s from the i-th to the j-th.
5.4. Experimental and simulation results 149
Table 5.1: Parameters of the induction motor adopted for the experimental activity.
The induction motor has been damaged by introducing a mechanical fault. Specifically five
of the 28 rotor bars have been holed in order to simulate a broken bar rotor failure, see fig. 5.2.
The diameter of each hole is 4 mm.
In all the experimental results which will be presented in the following, we have fixed a
constant speed reference ω ? = 100 rad/sec and a constant load torque TL = 6 Nm applied by
means of the brush-less motor. Furthermore the IFO controller described in section 5.2.3 has
been tuned following the constructive procedure illustrated in [84] and the control parameters
thus obtained are shown in tab. 5.2.
Kω Kd Kq Kξ Kη KT
120 300 300 500 90 7200
As far as the Fault Compensation Unit described in sections 5.3.3 and 5.3.4 is concerned, we
150 Implicit fault tolerant control systems
Figure 5.2: The rotor of the induction motor with five bars broken by means of two holes each
bar.
have fixed λ, the amplitude of the saturation function, and `, the amplitude of the deadzone
function, respectively to λ = 2000 and ` = 2500. These large values have been chosen in order
to test the fault tolerance even in case of very severe failures. Furthermore the tuning of the
fault compensation unit (5.38)-(5.39) has been completed taking ρ = 5 and the controllable pair
(F, G) as
−10
0 0 0 0 0 1 0
0
−20 0 0 0 0
0 2
0 0 −30 0 0 0 3 0
F = G = 0 4 .
0
0 0 −40 0 0
0 0 0 0 −50 0 5 0
0 0 0 0 0 −60 0 6
The figures fig. 5.3, fig. 5.4 and fig. 5.5 report respectively the steady state current tracking
error ĩd (t), the steady state current tracking error ĩq (t) and the steady state speed tracking
error ω̃(t) in case just the IFO controller is present in the loop (upper plots) and in case the
IFO controller is enriched with the fault compensation unit (lower plots). It is worth to note
that the presence of the failure due to the broken bars, if not compensated by means of the
fault tolerance unit, generates a quite large steady state tracking error for the current id (in
particular ĩd (t) shows a mean value equal to 0.1 and oscillations of amplitude 0.4) which in
turn yields large deviation of the speed tracking error. In this respect the effectiveness of the
fault compensation unit in compensating the effect of the faults and recovering the control
objectives is quite evident in these figures.
In order to stress how the information extracted by the state behavior of the fault compen-
sation unit can be useful in order to perform detection of the occurred (and compensated) fault
(see section 5.3.5), we have reported in fig. 5.6 the behavior of the outputs ufc fc
d and uq when the
implicit fault compensation algorithm is run in presence of a un-faulty induction motor (up-
per plot) and of the induction motor with broken bars (lower plots). In this respect it is quite
evident that the behavior of these additional control inputs yields a robust information about
the presence of the fault which can be used for fault detection and isolation. In particular, as
stressed in section 5.3.5, the comparison of ufc fc
d and uq with a fixed (or suitably adapted) thresh-
old can be used in order to detect if the fault compensation unit is working to compensate for
oscillations and hence if a fault is occurred.
To conclude this section we present few simulation results aiming to test the performances
of the algorithm also in presence of electrical faults which, at the state of the art, are not yet
5.4. Experimental and simulation results 151
0.6
0.4
0.2
ĩd (t) 0
-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.6
0.4
0.2
ĩd (t) 0
-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
sec
Figure 5.3: Experimental results. Current tracking error ĩd (t) in case of rotor failure with the
IM controlled via the standard IFO controller (upper plot) and using the implicit FT controller
(lower plot).
0.4
0.2
ĩq (t) 0
-0.2
-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.2
ĩq (t) 0
-0.2
-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
sec
Figure 5.4: Experimental results. Current tracking error ĩq (t) in case of rotor failure with the
IM controlled via the standard IFO controller (upper plot) and using the implicit FT controller
(lower plot).
152 Implicit fault tolerant control systems
ω̃(t) 0
-1
-2
-3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
ω̃(t)
-1
-2
-3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
sec
Figure 5.5: Experimental results. Speed tracking error ω̃(t) in case of rotor failure with the
IM controlled via the standard IFO controller (upper plot) and using the implicit FT controller
(lower plot).
implemented in the experimental set up above described. The same electrical and mechanical
values presented in table 5.1 have been adopted as nominal parameters and the simulation
results have been obtained assuming parametric uncertainties up to 20%. Moreover the same
tuning of the IFO controller and of the Fault Compensation Unit used for the experimental
activity has been chosen also for the simulation results.
According to the theory presented in section 5.3.1, to simulate a stator and rotor failure the
stator currents have been corrupted as in (5.24) taking N = 1 and assuming φ = φ1 = φ−1 = 0
and Ω, Ω1 and Ω−1 computed using (5.21) and (5.22). The amplitudes A, A1 and A2 have been
set to zero or to a positive number depending on the kind of simulated fault (stator or rotor).
Finally the stator currents and angular speed, which are processed by the control algorithm,
have been corrupted with a Gaussian white noise with zero mean and standard deviation ±0.2
A and ±0.3 rad/sec respectively.
In all the experiments presented in the following the induction motor is simulated in the
un-faulty condition up to t = 2 sec and in different faulty scenarios from t = 2 sec. Moreover
the fault compensation unit is always initially not activated and it is inserted in the control
loop, with the only exception of the first experiment, at the time t = 1.2 sec.
In fig. 5.7 it is reported the behavior of flux and speed tracking errors when a stator fault
characterized by A = 0.1 is simulated and the fault compensation unit is not present in the
control loop. It is quite evident how the presence of the stator fault, for instance due to an
electrical short circuit, does not allow the achievement of the control objectives. In fig. 5.8,
fig. 5.9 are shown the same tracking errors in case respectively of rotor fault (with A1 = A−1 =
0.1) and of stator fault (with A = 0.1) with the fault compensation unit present in the loop.
In these figures one can recognize the transient at t = 2 sec due to the occurred fault. Note
how in this case the fault tolerance is achieved by the fault compensation unit. Finally fig. 5.10
5.4. Experimental and simulation results 153
-10 12
-15 10
8
-20
6
-25
σufd c (t) σufq c (t) 4
-30
2
-35 0
-40 -2
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
0.4 0.4
0.2 0.2
-0.2 -0.2
-0.4 -0.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
sec sec
Figure 5.6: Experimental results. Left plots: the output σufcd (t) of the fault compensation unit
with an un-faulty IM (upper plot) and with the faulty IM (lower plot). Right plots: the output
σufc
q (t) of the fault compensation unit with an un-faulty IM (upper plot) and with the faulty IM
(lower plot).
154 Implicit fault tolerant control systems
ω̃(t) 0
-5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
Ψ̃d (t)
-2
-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Ψq (t) 0
-2
-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
sec
Figure 5.7: Simulation results. From upper to lower: speed tracking error ω̃(t), the d flux
tracking error Ψ̃d (t) and the q flux Ψq (t) using a standard IFO controller when a rotor fault
occurs at time t = 2 s.
0.4
0.2
0
ω̃(t)
-0.2
-0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0.5
0
Ψ̃d (t)
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Ψq (t)
-1
-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
sec
Figure 5.8: Simulation results. From upper to lower: speed tracking error ω̃(t), the d flux
tracking error Ψ̃d (t) and the q flux Ψq (t) using the augmented fault tolerant IFO controller
when a rotor fault occurs at time t = 2 s.
5.5. Conclusions 155
0.4
0.2
ω̃(t) 0
-0.2
-0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0.5
0
Ψ̃d (t)
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0.5
Ψq (t) 0
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
sec
Figure 5.9: Simulation results. From upper to lower: speed tracking error ω̃(t), the d flux
tracking error Ψ̃d (t) and the q flux Ψq (t) using the augmented fault tolerant IFO controller
when a stator fault occurs at time t = 2 s.
and fig. 5.11 report the behaviors of the first and third component of the vector ω̂ defined in
section 5.3.5 which represents the estimation of the internal state of the exosystem (5.25) in
presence respectively of rotor and stator faults. In the same figures there are also overlapped
the behaviors of the signals rfs (t) and rfr (t) defined in section 5.3.5 representing the residual
signals sensitive respectively to fault and rotor faults (the thresholds Tfs and Tfr have been
fixed at the value Tfs = Tfr = 5). It is interesting to note that, after an initial transient, just the
component of ω̂(t) which belongs to the part of the internal model devoted to compensate the
specific class of fault (rotor or stator) presents a steady state value different by zero and hence
the residual signals rfs (t) and rfr (t) allow for perfect isolation of the occurred fault.
5.5 Conclusions
In this chapter we have presented a new idea for dealing with fault tolerant control systems
design presenting the design of a fault tolerant control unit for Induction Motors. We have
shown how an Indirect Field Oriented controller processing the currents and the angular velocity
of the IM in order to enforce desired flux and speed profiles, can be “enriched” with an internal
model of the fault in order to achieve fault tolerance and also fault detection and isolation. The
design of the internal model unit can be considered independent from that of the stabilizing
IFO unit as only the current gains are eventually required to be re-tuned. We have shown
how the internal model unit can be designed in order to have “global” tracking of the desired
references and “semi-global” tolerance to possible faults. Experimental and simulation results
have been presented in order to show the effectiveness of the approach.
156 Implicit fault tolerant control systems
40
20
-20
-40
-60
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
50
-50
-100
-150
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
sec
Figure 5.10: Simulation results. Rotor fault affecting the IM at t = 2 sec. Upper plot: Stator
failure signal ŵ1 (t) (dash-dot line) and signal rfs (t) (continuous line). Lower plot: rotor failure
signal ŵr (t) (dash-dot line) and signal rfr (t) (continuous line).
40
20
-20
-40
-60
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
50
-50
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
sec
Figure 5.11: Simulation results. Stator fault affecting the IM at t = 2 sec. Upper plot: Stator
failure signal ŵ1 (t) (dash-dot line) and signal rfs (t) (continuous line). Lower plot: rotor failure
signal ŵr (t) (dash-dot line) and signal rfr (t) (continuous line).
Conclusions and future works
The main aim of this thesis, conclusive work of a three years research period focused on the
major of Fault Diagnosis and Fault Tolerant Control, is to present a collection of results which
should lead towards a unified framework for Fault Tolerant Control of Distributed Control
Systems. The first key-point was: why distributed systems? As stated in this work, from a
functional point of view, there is practically no difference whether a task is implemented using
a centralized or a decentralized architecture.
The most important properties of distributed systems are composability (which means that
system properties follows from subsystem properties), scalability (which means that there exist
no limits to the extensibility of a system and that at the same time complexity of reasoning
about the proper operation of any system function does not depend on the system size) and
dependability. This last property means that when designing a distributed system is possible
to implement well-defined error-containment regions, achieving in this way fault tolerance.
This idea, and the importance that large distributed systems have nowadays in application
(see e.g. the automotive field) make distributed systems perfect candidates for investigation in
researches regarding fault tolerance.
This work starts with a survey chapter which illustrates concepts, definitions and classi-
cal results about Fault Detection and Isolation and Fault-Tolerant Control and introduces basic
concepts in distributed computer systems architectures. Starting from these concepts a novel
architecture for Fault Tolerant Distributed Control System has been introduced. This architec-
ture has been developed in order to accomplish the task of fault tolerance in a modular and
hierarchical way. This means that we want to achieve fault tolerance starting from each single
function of the system, composing each fault tolerant module as the single function compose
themselves to generate the global function of the complex system. For this reason functionality
analysis tools and failure analysis tools have been shown to be the right tools to design the
exoskeleton of this architecture.
Within this architecture three levels of operation can be identified: a low level (basic func-
tionalities level) at which physical faults are detected and reconfigured; an intermediate level
that supervises this substratum using cross information between different functionalities and
an high level which is dedicated to the allocation of different functionalities on common hard-
ware in an optimal way and that acts also as a high level interface (communication cluster)
between human operator and plant.
After this discussion we have presented results that can be mapped with the three levels
of operation we have presented. More specifically we have presented an algorithm to predict
reliability of the diagnostic level. The procedure presented is used to evaluate reliability of a
complex distributed diagnosis system and gives as output an index which represents the prob-
157
158 Conclusions and future works
ability that if a fault happens then the diagnosis system is able to detect it without generating
false alarms.
We have then presented an “high-level” diagnosis and reconfiguration engine in the frame-
work of discrete event systems, i.e. an automaton which is able to diagnose the occurrence of
a fault event before the system executes some dangerous operation. This property results to be
a kind of “robustness on demand” in the sense that if the system is safe diagnosable we can
always detect the fault before the system becomes unreconfigurable and hence “force” recon-
figuration events that will lead the system to compensate (to the extent possible) for the effect
of the detected fault..
Concerning the “low-level”, an innovative fault-tolerant control scheme has been devel-
oped. The idea is that of achieving implicit fault-tolerance of basic functionalities, without an
explicit estimation of faults and reconfiguration actions. This idea has been applied to control
in a reliable way an induction motor in case of mechanical faults and a robot manipulator in
case of actuators faults (see Appendix C).
Future development of this work include the study of complex modeling tools that are
able to enlighten both the distributed nature of complex systems and the stochastic nature of
faults. We have to deal with horizontal modularity and vertical modularity. Discrete event
system theory can help in describing such systems. For example horizontal modularity can be
achieved modeling subsystems as automata and interconnecting them with some communica-
tion protocols. An interesting development will be to extend results regarding diagnosability
and safe diagnosability to such systems. Considering vertical modularity, preliminary investi-
gations have shown that the state-charts modeling formalism of Harel (see [48]) can be applied
to study the diagnosability property in a more computationally-efficient manner.
Another future development will be to enrich the deterministic information provided by
automata with statistical information. Doing so it is possible to introduce the notions of sta-
tistical diagnosability (some recent efforts in this sense are presented in [101]) and statistical
safe diagnosability. In this last case forbidden strings would be tolerated after a fault if their
probability is below a certain threshold.
Another very interesting research problem is to investigate methods to fuse information
from the high level with information from the low level in order to solve the FTC problem.
In this sense we are interested in exploiting the problem of diagnosis of hybrid automata (see
[63]).
Finally, also the topic of reconfiguration of systems at the high level of abstraction is of great
importance. In this direction it is our intention to exploit problems such as system redundancy,
probing for possible faults and controller reconfiguration using the automata formalism.
All this effort will go in the direction of finding a unified tool to study phenomena linked
with the distributed nature of the system and the stochastic nature of faults and use analysis
tools in this new framework to refine the proposed framework.
Appendixes
159
Appendix A
Introduction to discrete event systems
theory
Definition A.1 A discrete event system is a discrete-state, event-driven system, i.e. its state depends
entirely on the occurrence of asynchronous discrete events over time.
With this in mind, the behavior of a DES can be described in terms of event sequences of the
form e1 , e2 , · · · , en . A more formal way to study the logical behavior of DES is based on the
theories of languages and automata.
161
162 Introduction to discrete event systems theory
The starting point is the fact that any DES has an underlying event set E associated with
it. The set E is thought of as the alphabet of a language and event sequences are thought of as
strings (words) in that language. A string consisting of no events is called empty string and is
denoted by ². The length of a string s, denoted with |s|, is the number of events contained in it,
counting multiple occurrences of the same event.
Definition A.2 A language defined over a set E is a set of finite-length strings from events in E.
The key operation involved in building strings and thus languages from a set of events E is
concatenation: the string abb is the concatenation of the string ab and the event b. The empty
string ² is the identity element of concatenation. Let E ? denote the set of all finite strings
of elements of E, including the empty string ²; the (·)? operation is called Kleene-closure. A
language over an event set E is a subset of E ? .
If tuv = s, the following nomenclature can be defined:
• t is called a prefix of s,
• u is called a substring of s,
• v is called a suffix of s.
L := {s ∈ E ? : ∃t ∈ E ? (st ∈ L)} .
The prefix closure of L is the language L consisting of all the prefixes of all the strings in
L. In general L ⊆ L.
L? := {²} ∪ L ∪ LL ∪ LLL ∪ · · · .
G = (X, E, f, Γ, x0 , Xm )
where
• Γ : X → 2E is the active event function: Γ(x) is the set of all the events e for which f (x, e)is
defined
In other words a string s is in L(G) if and only if it corresponds to an admissible path in the
state transition diagram. Note that in the above definitions we work with an extension of the
transition function defined over X × E ? as:
f (x, ²) := x
f (x, se) := f (f (x, s), e) for s ∈ E ? and e ∈ E .
Two automata are said to be equivalent if they generate and mark the same languages, i.e.
It can happen that an automaton G could reach a state x where Γ(x) = ∅, but x ∈ / Xm . This is
said a deadlock, because no further event can be executed. If deadlock happens, then necessarily
Lm (G) will be a proper subset of L(G), since any string in L(G) that ends at state x cannot be a
prefix of a string in Lm (G).
Consider now the case when there is set of unmarked states in G that forms a strongly
connected component, but with no transitions going out of the set. If the system enters this set,
then we get a so-called livelock. If livelock is possible then again Lm (G) will be a proper subset
of L(G).
Lm (G) ⊂ L(G)
In other words if an automaton is blocking either deadlock and/or livelock can happen.
Suppose now that an event e at state x may cause transitions to more than one state. In this
case f (x, e) is represented by a set of states. In addition we may want that to allow the label ² in
the state transition diagram, i.e. we allow transitions between distinct states to have the empty
string as label2 . These two changes lead to the definition of a nondeterministic automaton.
where
The CoAc operation may shrink L(G), but does not affect Lm (G).
• Trim operation: An automaton that is both accessible and coaccessible is said to be trim.
We define the T rim operation as:
where ½
(f1 (x1 , e), f2 (x2 , e)) if e ∈ Γ1 (x1 ) ∩ Γ2 (x2 )
f ((x1 , x2 ), e) =
undefined otherwise
and thus
Γ1×2 (x1 , x2 ) = Γ1 (x1 ) × Γ2 (x2 ) .
This means that in the product the transitions of the two automata must always be syn-
chronized on common events (in E1 ∩ E2 ). It is easy to verify that:
where
(f1 (x1 , e), f2 (x2 , e)) if e ∈ Γ1 (x1 ) ∩ Γ2 (x2 )
(f1 (x1 , e), x2 ) if e ∈ Γ1 (x1 )\E2
f ((x1 , x2 ), e) =
(x1 , f2 (x2 , e))
if e ∈ Γ2 (x2 )\E1
undefined otherwise .
166 Introduction to discrete event systems theory
In the parallel composition a common event can only be executed if the two automata
both execute it simultaneously. The two automata are synchronized on common events.
To characterized the language generated, we define the projection
as follows:
Pi (ε) = ²½
e if e ∈ Ei
Pi (e) =
² if e ∈ / Ei
Pi (se) = Pi (s)Pi (e) for s ∈ (E1 ∪ E2 )? , e ∈ (E1 ∪ E2 ).
In other words given two event sets where one is a subset of the other, this kind of pro-
jection (called natural projection) erases events in a string formed from the larger set, that
do not belong to the smaller one. We can also introduce the corresponding inverse maps
(inverse projection)
?
Pi−1 : Ei? → 2(E1 ∪E2 )
defined as:
Pi−1 = {s ∈ (E1 ∪ E2 )? : Pi (s) = t} .
In other words given a string in the smaller event set, the inverse projection returns the
set of all strings in the larger event set that project to the given string. The projections
and their inverses are extended to languages, simply by applying them to all the strings
in the language. Note that
Pi Pi−1 (L) = L
£ ¤
but in general
L ⊆ Pi−1 [Pi (L)] .
Returning to the parallel composition between automata, it easy now to prove that
L(G1 k G2 ) = P1−1 [L(G1 )]∩P2−1 [L(G2 )] Lm (G1 k G2 ) = P1−1 [Lm (G1 )]∩P2−1 [Lm (G2 )] .
2. L(Gobs ) = L(Gnd )
3. Lm (Gobs ) = Lm (Gnd )
where Eo is the set of observable events and Euo is the set of unobservable events. Treating
unobservable events as ²-transitions and building the observer corresponding to the nondeter-
ministic automaton obtained, it is easy to prove that the observer satisfy the following proper-
ties:
• L(Gobs ) = P [L(G)]
• The state of Gobs that is reached after a string t ∈ P [L(G)] will contain all the states of G
that can be reached after any strings in
P −1 (t) ∩ L(G) .
P (²) = ²½
e if e ∈ Eo
P (e) =
² if e ∈ / Eo
P (se) = P (s)P (e) for s ∈ E ? , e ∈ E.
In other words, the state of Gobs is the union of all the states of G that are consistent with the
observable events that have occurred so far. In this sense the state of Gobs is an estimate of the
current state of G.
Theorem A.1 The class of languages representable by nondeterministic finite-state automata is exactly
the same as the class of languages representable by deterministic finite-state automata: R.
Theorem A.2 If L1 and L2 are in R, then the following languages are also in R:
1. L1
2. L?1
3
The simplest case is the language L = E ? that can be represented by a single state automaton.
168 Introduction to discrete event systems theory
3. Lc := E ? \ L1
4. L1 ∪ L2
5. L1 ∩ L2
6. L1 L2 .
Theorem A.3 A language is regular if and only if it can be represented by a regular expression i.e. by
means of the operations of kleene-closure, union and concatenation.
G = (X, E, f, Γ, x0 , Xm ) .
E = Ec ∪ Euc
where
• Ec is the set of controllable events, i.e. those events that can be prevented from happening
(disabled) by supervisor S;
• Euc is the set of uncontrollable events, i.e. those events that cannot be prevented from
happening by supervisor S.
Assume for the moment that all the events in E executed by G are observed by S. A supervisor
S is a function
S : L(G) → 2E
such that for each s ∈ L(G),
S(s) ∩ Γ(f (x0 , s))
is the set of enabled events that G can execute at its current state f (x0 , s). In view of this we will
say that supervisor S is admissible if for all s ∈ L(G)
KEuc ∩ L(G) ⊆ K .
Remark A.1 The controllability condition in controllability theorem is intuitive and can be paraphrased
as: “if you cannot prevent it, then it should be legal”.
Definition A.12 (controllability) Let K and M = M be languages over set E. Let Euc be a subset
of E. K is said to be controllable with respect to M and Euc if and only if
KEuc ∩ M ⊆ K .
KEuc ∩ L(G) ⊆ K ,
2. K is Lm (G)-closed, i.e.
K = K ∩ Lm (G) .
results in
L(S/G) = K .
We need now to build a convenient representation of the function S. Consider now an automa-
ton R that marks the language K:
R = (Y, E, g, ΓR , y0 , Y )
A.7. Unobservability problem 171
Lm (R × G) = Lm (R) ∩ Lm (G)
= K ∩ Lm (G)
= L(S/G) ∩ Lm (G) = L(S/G) .
Note that R is defined to have the same event set as G, then R k G = R × G. We will call R the
standard realization of S.
Definition A.13 (observability) Let K and M = M be languages over set E. Let Ec be a subset of
E. Let Eo be another subset of E with P as the corresponding natural projection from E ? to Eo? . K is
said to be observable with respect to M , P and Ec if for all s ∈ K and for all σ ∈ Ec ,
3. K is Lm (G)-closed.
173
174 An experimental setup for FTC algorithms test
Table B.1: Parameters of the induction motor adopted for the experimental activity.
The stator current and the motor angular speed are acquired by means of commercial Hall-
type sensors that output a 0-10V signal which is proportional to the instantaneous value of a
AC current signal (0-50A) and by a two poles commercial resolver (6V, 10KHZ, with a transfor-
mation rate 0.28 ± 10%) with an encoder simulation output (1024 ppr).
The brush-less motor is controlled by its commercial driver in order to track a torque refer-
ence signal. It is used to simulate an unknown torque load and is interconnected with the DSP
board in order to generate via software torque load references and hence simulate particular
operating conditions. The mechanical coupling of the two mechanical systems is made by two
mechanical system:
• an adaptive joint able to compensate for angular, axial and radial offsets;
Induction Motor
Resolver output
Brush-less motor
IM Driver
Brushless Driver
Encoder output
Torque reference
Current value
Control law
Induction Motor
Saddle + Joint
Brush-less motor
the brush-less motor to the power stage, the load torque reference signal from the DSP board
to the brush-less driver, command signals from the DSP board to the brush-less driver. These
signals are conditioned by means of a dedicated board.
The design have been completed by some shrewdness to solve electro-magnetic compati-
bility problems and by operator safety systems, in order to protect the human operator with
respect to the high voltage dangerousness.
In fig. B.5 and fig. B.6 are shown some pictures of the realized system.
B.2. The power stage 177
Resolver
Driver Brushless
Encoder Output
POWERIF
Torque reference
DSP-IF Board
Digital Bus
Analog Bus
FastProt
Control Law
Figure B.4: Interconnection between the brush-less motor system and the setup.
Plug A
Power stage
Plug C
Control board
Resolver output
Torque reference
Plug MIL
Plug D
Analog bus
Digital Bus
FastProt Board
Derivation board
Interface Board
B.2. The power stage 179
C.1 introduction
In chapter 5 it has been addressed the case in which the faults affecting the controlled system
can be modelled as functions (of time) within a finitely-parameterized family of such functions.
Then a controller which embeds an internal model of this family is designed in order to gener-
ate a supplementary control action which compensate for the presence of any of such faults,
regardless their entity. The idea is pursued using the theoretical machinery of the (nonlinear)
output regulation theory (see [23]) under the assumption that the side-effects generated by the
occurrence of the fault can be modelled as an exogenous signal generated by an autonomous
“neutrally stable" system (the so-called “exosystem"). In this framework, the Fault Detection
and Isolation phase is postponed to that of control reconfiguration since it can be carried out by
testing the state of the internal model unit which automatically activates to offset the presence
of the fault. This approach has been applied with optimal results to control induction motors
in faulty conditions.
In this chapter the approach outlined above is specialized to the design of a fault tolerant
control system for n-dof fully actuated mechanical robot subject to various sinusoidal torque
181
182 Implicit fault tolerant control of a n-dof robot manipulator
disturbances acting on joints (see [61]). We will show how this framework can be casted as an
output regulation problem. More in detail we show how a standard tracking robot control (see
[40], [102], [39]), can be “augmented" with an internal model unit designed so as to compen-
sate the unknown spurious torque harmonic. In this way the controller is proved to be global
implicitly fault tolerant to all the faults belonging to the model embedded in the regulator.
Trajectory
Exosystem Generator
q ? (t)
v(t) ?
? - (t)
p
- Nominal ν(t) - ?
i -i - n-dof - i
+
controller Robot q(t), p(t)
6
τ (t) q̄(t), p̄(t)
ξ(t) Internal
FDI
¾ Model ¾
Logic Unit
Fault Estimation
-
and, finally,
µ ¶ µ ¶ µ ¶
0 In 0 0 0
J= R= G=
−In 0 0 D(q) In
with D(q) = DT (q) ≥ 0 taking into account the dissipation effects. The input is an effort
representing the input torques and the output is a flow representing the joint velocities. These
considerations lead to the following model
∂H
· ¸ ·µ ¶ µ ¶¸ · ¸
q̇ 0 In 0 0 ∂q 0
= − + ν
ṗ −In 0 0 D(q) ∂H In
∂p (C.1)
∂H
£ ¤ ∂q
y = 0 In
∂H
∂p
This system will be affected by an external torque ripple v(t) acting through the control input
channel (i.e. actually the torque applied to the system will be the sum of the control torque and
the external disturbance ν + v(t)) and the problem here addressed is to compensate this distur-
bance, detecting and isolating in the meanwhile the entity of this (unknown) disturbance.
It is worth to point out again that the design of the internal model unit doesn’t affect a previous
regulator, designed in order to carry out a particular task. To remark this feature, in the follow-
ing we will introduce a simple control scheme whose aim is to make the manipulator track a
known trajectory. This tracking control is developed following [39], but the same results can be
obtained using also a simpler controller.
Firstly a preliminary torque input able to compensate potential energies (as gravity) is de-
signed:
∂P (q)
ν= + ν0 (C.2)
∂q
Let define the desired trajectory for the generalized coordinates and the generalized momenta
as (q ? (t) , p? (t)); this trajectory, to be realizable, has to satisfy p? (t) = M (q ? )q̇ ? (t). To define
new error variables, let consider the following change of coordinates
q̄ = q − q ? (t)
(C.3)
p̄ = p − M (q)q̇ ? (t)
184 Implicit fault tolerant control of a n-dof robot manipulator
q̄˙ = M −1 (q)p̄
∂H ∂H d
p̄˙ = − − D(q) + ν 0 − (M (q)q̇ ? (t)) =
∂q ∂p dt
1 ∂M −1 d
= − pT p − DM −1 (q)p + ν 0 − (M q̇ ? (t)) =
2 ∂q dt
1 ∂M −1 d
= − (p̄ + M q̇ ? (t))T (p̄ + M q̇ ? (t)) − DM −1 (p̄ + M q̇ ? (t)) + ν 0 − (M q̇ ? (t))+
2 ∂q dt
∂M −1 (q)
− 12 p̄T p̄ − D(q)M −1 (q)p̄ + ν 0 − Π(q, q̇ ? (t), q̈ ? (t))
∂q
(C.4)
Defining a new Hamiltonian function as
1
H 0 = p̄T M −1 (q)p̄
2
it is possible to write again (C.4) as a port-Hamiltonian system:
∂H 0
q̄˙ =
∂ p̄
(C.5)
∂H 0 ∂H 0
p̄˙ = − − D(q) + ν 0 − Π(q, q̇ ? (t), q̈ ? (t))
∂ q̄ ∂ p̄
It is now possible to obtain a perfect asymptotic tracking designing the control torque in or-
der to delete the “bad" term Π(·), to shape the energy of the error system in order to have a
minimum in the origin1 and to add some damping in order to have this minimum globally
attractive:
ν 0 = Π(q, q̇ ? (t), q̈ ? (t)) + DM −1 (q)p̄ − q̄ − M −1 (q)p̄ + τ (C.6)
where τ is an additional control torque that will be used in the following section in order to
compensate the presence of additional torque disturbances.
The whole error system (C.5) with the controller (C.6) writes as
∂ H̄
q̄˙ =
∂ p̄
(C.7)
∂ H̄ ∂ H̄
p̄˙ = − − +τ
∂ q̄ ∂ p̄
1 1
H̄ = p̄T M −1 (q)p̄ + q̄ T q̄ (C.8)
2 2
Remark C.1 It is worth to remark that this kind of control strategy is very similar the classical tracking
control made by inversion of the model and introducing simple proportional and derivative terms (see
e.g. [40], [102], [73])
1
Note that q̄ = 0 means that the tracking is achieved as q → q ? (t).
C.3. Canonical internal model unit design 185
S = diag{S1 , . . . , Sk } (C.11)
with · ¸
0 ωi
Si = ωi > 0 i = 1, . . . , k (C.12)
−ωi 0
and z(0) ∈ Z, with Z ⊆ R2k bounded compact set.
In this discussion the matrix S is firstly considered perfectly known, and then, in section C.4,
this hypothesis is removed (as in [93]): the dimension 2k of matrix S will be still known but
all characteristic frequencies ωi will be unknown but ranging within known compact sets, i.e.
ωimin ≤ ωi ≤ ωimax .
In this set up the lack of knowledge of the exogenous disturbance reflects into the lack of
knowledge of the initial state w(0) of the exosystem and, in section C.4, also of the charac-
teristic frequencies. For instance, in the next section, any v(t) obtained by linear combination
of sinusoidal signals with known frequencies but unknown amplitudes and phases will be con-
sidered, while, in section C.4, the frequencies will be unknown too.
All those assumptions allows us to cast the problem of disturbance suppression as a problem
of output regulation (see [24], [44]) that will be complicated by the lack of knowledge of the
matrix S (see [93]), and suggests to look for a controller which embeds an internal model of the
exogenous disturbances augmented by an adaptive part in order to estimate the characteristic
frequencies of the disturbances.
Remark C.2 Note again that the whole design method introduced in the following can be easily applied
to general mechanical system described as pHs (C.9). Hence it is really straightforward to consider this
method suitable for a generic mechanical system already regulated to accomplish a certain task with a
classical control strategy (see [73] for a survey about passivity based control strategies applied to port-
Hamiltonian systems).
known.
As previously announced, the regulator to be designed will embed the internal model of the
exogenous disturbance: this internal model unit is designed according to the procedure pro-
posed in [72] (canonical internal model). Given any Hurwitz matrix F and any matrix G such
that (F, G) is controllable, denote by Y the unique matrix solution of the Sylvester equation
Y S − F Y = GΓ
and define Ψ := ΓY −1 .
Let introduce the internal model unit as
χ = ξ − Y z − Gp̄ (C.15)
Choosing τst = −ΨGp̄, simple computation shows that the p̄-dynamic become
∂ H̄ ∂ H̄
p̄˙ = − − + Ψχ (C.17)
∂ q̄ ∂ p̄
Concentrating on the χ-dynamic it is possible to design
µ ¶
T ∂ H̄ ∂ H̄
N (q̄, p̄) = −Ψ p̄ − F Gp̄ − G + + ΨGp̄
∂ q̄ ∂ p̄
and write the last equation of (C.16) as
χ̇ = F χ − ΨT p̄ (C.18)
Consider now the first equation of (C.16) with (C.17) and (C.18). This new system identifies a
port-Hamiltonian system described by:
∂Hx (x)
ẋ = [J(x) − R(x)] (C.19)
∂x
with ¡ ¢T
x= q̄ p̄ χ ,
the Hamiltonian Hx (x) defined by
1 1 1
Hx (x) = p̄T M (q)−1 p̄ + q̄ T q + χT χ
2 2 2
C.4. Adaptive internal model unit design 187
the skew-symmetric interconnection matrix J(x) and the positive-definite damping matrix R
defined by:
0 I 0 0 0 0
J(x) = −I 0 Ψ , R = 0 I 0
0 −Ψ T 0 0 0 −F
Proposition C.1 Consider the controlled n-degree of freedom robot manipulator (C.9) with Hamilto-
nian (C.8), affected by the torque disturbances generated by (C.10), (C.11), (C.12).
The additional control law generated by the internal model unit:
−1
ξ˙ = (F + GΨ)ξ − ΨT p̄ − F Gp̄ + Gq̄ + Gp̄T ∂M (q) p̄ + GM −1 (q)p̄ − GΨGp̄
∂q (C.20)
τ = Ψξ − ΨGp̄ .
assures asymptotically the input disturbance suppression (fault tolerance with respect to torque ripple,
i.e. (q̄, p̄) → (0, 0) as time t → ∞) and the convergence of the state of the adaptive internal model to the
fault signal (fault detection, i.e. ξ → Y z).
Proof. Considering Hx (x) as a Lyapunov function the proof is immediate as (remembering that
F is an arbitrary Hurwitz matrix)
and for the LaSalle invariants principle the system will asymptotically converge to limt→∞ (p̄, χ) =
(0, 0). Moreover from the first and second equation of (C.20) it is possible to state that also
limt→∞ q̄(t) = 0 and the proposition is proved. /
ξ˙
= (F + GΨ̂)ξ + N (p̄, q̄)
(C.21)
˙T
Ψ̂i = ϕi (ξ, p̄, q̄)
calling Ψ̂T T
i with i = 1, · · · , n every column of the matrix Ψ̂ ∈ R
2k×n .
χ = ξ − Y z − Gp̄
(C.22)
Ψ̃T
i = Ψ̂T T
i − Ψi i = (1, · · · , n)
where ΨT T
i represent the i-th column of Ψ , system (C.9), (C.21) becomes
q̄˙ = M (q)−1 p̄
p̄˙ ∂ H̄ ∂ H̄
= − − + Ψ̂ξ + τst − ΨY z
∂ q̄ ∂ p̄ (C.23)
χ̇ = (F + GΨ̂)ξ + N (p̄, q̄) − Y Sz − Gp̄˙
˙T
Ψ̂
i = ϕi (ξ, p̄, q̄) i = (1, · · · , n)
Note that
∂ H̄ ∂ H̄
p̄˙ = − − + Ψ̂(ξ − Y z) + τst − Ψ̃Y z =
∂ q̄ ∂ p̄
∂ H̄ ∂ H̄
= − − + Ψ̂(ξ − Y z − Gp̄) + Ψ̂Gp̄1 + + τst − Ψ̃(ξ − χ − Gp̄)
∂ q̄ ∂ p̄
∂ H̄ ∂ H̄
p̄˙ = − − + Ψ̂χ + Ψ̃ξ − Ψ̃χ − Ψ̃Gp̄ + τst0
∂ q̄ ∂ p̄
Choosing now τst0 = ĀM −1 (q)p̄ with Ā such that A = Ā − I is hurwitz we obtain
∂ H̄ ∂ H̄
p̄˙ = − +A + Ψχ + Ψ̃(ξ − Gp̄) . (C.24)
∂ q̄ ∂ p̄
Considering every single element of vector p̄ it is possible to write (from now on apex i means
the i-th element of the vector considered)
µ ¶i
∂ H̄ ∂ H̄
˙p̄i = − +A + Ψχ + (ξ − Gp̄)T Ψ̃T i (C.25)
∂ q̄ ∂ p̄
with i = 1, · · · , n.
Concentrate now on the χ-dynamic in order to design suitably the update term N (q̄, p̄):
µ ¶
∂ H̄ ∂ H̄
χ̇ = (F + GΨ̂)ξ + N (p̄, q̄) − Y M z − GΓz − G − − + Ψ̂ξ + τst − Γz =
∂ q̄ ∂ p̄
∂ H̄ ∂ H̄
= F χ + F Gp̄ + N (p̄, q̄) + G +G − Gτst
∂ q̄ ∂ p̄
Choosing
∂ H̄ ∂ H̄
N (p̄, q̄) = −F Gp̄ − G −G + Gτst (C.26)
∂ q̄ ∂ p̄
we obtain
χ̇ = F χ = F χ − ΨT p̄ + ΨT p̄ . (C.27)
C.4. Adaptive internal model unit design 189
As all dynamics of (C.23) have been investigated, it is now possible to design an adaptation
law for Ψ̂T : assume then
Consider now the first equation of (C.23) with all (C.25), (C.27) and (C.28). This new system
(with a small abuse of notation in order to obtain a more compact and readable formulation)
identifies an interconnection described by:
∂Hx (x)
ẋ = [J(x) − R(x)] + Λ(x) (C.29)
∂x
with ¢T
q̄ p̄ χ Ψ̃T
¡
x= ,
the Hamiltonian Hx (x) defined by
n
1 1 1 X1
Hx (x) = p̄T M (q)−1 p̄ + q̄ T q̄ + χT χ + Ψ̃i Ψ̃T
i
2 2 2 2
i=1
Proposition C.2 Consider the controlled n-degree of freedom robot manipulator (C.9) with Hamilto-
nian (C.8), affected by the torque disturbances generated by (C.10), (C.11), (C.12).
The additional control law generated by the adaptive internal model unit:
−1
ξ˙ = (F + GΨ̂)ξ − F Gp̄ + G 1 p̄T ∂M (q) p̄ + Gq̄ − GΨ̂Gp̄ + GAM −1 (q)p̄
2 ∂q
˙ (C.30)
Ψ̂ = −(ξ − Gp̄)T p̄
τ = Ψ̂ξ − Ψ̂Gp̄ + AM −1 (q)p̄ .
190 Implicit fault tolerant control of a n-dof robot manipulator
assures asymptotically the input disturbance suppression (fault tolerance with respect to torque ripple,
i.e. (q̄, p̄) → (0, 0) as time t → ∞) and the convergence of the state of the adaptive internal model to the
fault signal (fault detection, i.e. ξ → Y z).
Proof. Consider for system (C.29) (obtained connecting (C.9) with (C.30)) the following Lya-
punov function:
V = Hx (x) (C.31)
Easy computations (remembering the skew-symmetry of interconnection matrix J(x)) show
that there exists two real numbers ηA ∈ R− , ηF ∈ R− (depending on design matrices A and F )
and ηΨ ∈ R, such that
V̇ ≤ ηA kp̄k2 + ηF kχk2 + ΨT p̄χ
(C.32)
≤ ηA kp̄k2 + ηF kχk2 + ηΨ kp̄kkχk .
Using a Young’s inequality argumentation we can write:
ηΨ ηΨ
V̇ ≤ ηA kp̄k2 + ηF kχk2 + εkp̄k2 + kχk2 , (C.33)
2 2ε
for a certain value of ε. Now choosing ε = −ηA /ηΨ , we obtain
µ
ηΨ2 ¶
ηA 2
V̇ ≤ kp̄k + ηF − kχk2
2 2ηA
lim x(t) = 0 .
t→∞
Remark C.3 The fault detection and isolation phase can be performed by checking the state of the fault
compensation unit which automatically offsets the fault effect. In this framework the detection phase
can be easily carried out by comparing kξ(t)k with a suitably tuned threshold; in fact, as proved in
Proposition C.2, ξ(t) asymptotically converge on Y z(t) which is zero in the un-faulty case and different
C.5. Simulation results 191
C.6 Conclusions
The main result presented in this appendix is an adaptive internal model unit designed in order
to compensate unknown spurious torque harmonics that degrade performances of an n-dof
fully actuated mechanical robot. We have shown how a standard tracking robot control, can be
“augmented" with an internal model unit to achive global implicit fault tolerant to all the faults
belonging to the model embedded in the regulator. We also have shown how it is possible to
perform fault detection and isolation simply testing the state of the internal model.
192 Implicit fault tolerant control of a n-dof robot manipulator
10
2.5
3
5
1.5
0 0.5
0
5 −0.5
−1
10 −1.5
−2
100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 800
(a) Tracking error for θ1 , angle position of the first joint (b) Tracking error for θ2 , angle position of the second joint
10
−5
−10
0 100 200 300 400 500 600 700 800
20
10
−10
−20
0 100 200 300 400 500 600 700 800
3
2
1
0
−1
−2
0 100 200 300 400 500 600 700 800
Figure C.3: From upper to lower plot: disturbance torque ripple acting on the first joint, con-
trolled torque τ1 on the first joint and controlled torque τ2 on the second joint
Bibliography
[1] R. Alur and D.L. Dill. A theory of timed automata. Theoretical Computer Science, (126):183–
235, 1994.
[2] J.D. Andrews and T.R. Moss. Reliability and Risk Assessment. Professional Engineering
Publishing, 2002.
[4] L. El Bahir, R. Gros, M. Kinnaert, C. Parloir, and J.Yamé. Final report on WP2. IFATIS
deliverable D2-5, January 2003.
[5] J. Balakrishnan and K.S. Narendra. Adaptive control using multiple model. IEEE trans-
actions on automatic control, 42(2), 1997.
[6] A. Bellini, F. Filippetti, G. Franceschini, and C. Tassoni. Closed-loop control impact on the
diagnosis of induction motor faults. IEEE Transactions on Industry Applications, 36(5):1318
– 1328, 2000.
[7] A. Benveniste, E. Fabre, C. Jard, and S. Haar. Diagnosis of asynchronous discrete event
systems, a net unfolding approach. Proceedings of the Workshop on discrete event systems,
2002.
[8] D.E. Bernard, G.A. Dorais, C. Fry, E.B. Gamble Jr., B. Kanefsky, J. Kurien, W. Millar,
N. Muscettola, P.P. Nayak, B. Pell, K. Rajah, N. Rouquette, B. Smith, and B.C. Williams.
Design of the remote agent experiment for spacecraft autonomy. Proceedings of IEEE
Aerospace, 1998.
[10] M. Blanke. Aims and means in the evolution of fault tolerant control. Proceedings of the
European Science Foundation COSY workshop, Rome, 1995.
[11] M. Blanke. Aims and means in the evolution of fault tolerant control. In Proceedings of the
European science foundation COSY workshop, Roma, September 1995.
[12] M. Blanke, M. Kinnaert, J. Lunze, and M. Staroswiecki. Diagnosis and fault-tolerant control.
Springer-Verlag, 2003.
193
194 Bibliography
[13] M. Blanke, R. I. Zamanabadi, and S. A. Bøgh. Fault tolerant control systems: a holistic
view. Control engeneering practice, 5(5), 1997.
[14] R. Boel and J. Van Schuppen. Decentralized failure diagnosis for discrete-event systems
with constrained communication between diagnosers. Proceedings of the Workshop on dis-
crete event systems, 2002.
[15] S.A. Bogh. Fault Tolerant Control Systems - a Development Method and Real-Life Case Study.
PhD thesis, Aalborg University, Department of Control Engineering, December 1997.
0908-1208.
[16] C. Bonivento, M. Capiluppi, L. Marconi, and A. Paoli. System analysis and decomposi-
tion methods. IFATIS deliverable D6-3, November 2003.
[17] C. Bonivento, L. Gentili, and A. Paoli. Implicit fault tolerant control of a robot manipula-
tor. submitted to CDC, 2004.
[18] C. Bonivento, A. Isidori, L. Marconi, and A. Paoli. Implicit fault tolerant control: Appli-
cation to induction motors. Automatica, 40(3):355–371, 2004.
[19] C. Bonivento, A. Paoli, and L. Marconi. Fault-tolerant control for a ship propulsion sys-
tem. In Proceedings of the ECC, Porto, Portugal, 2001.
[20] C. Bonivento, A. Paoli, and L. Marconi. A fault-tolerant strategy for induction motors.
40th IEEE Conference on Decision and Control, Orlando, 2001.
[21] C. Bonivento, A. Paoli, and L. Marconi. Fault-tolerant control for a ship propulsion sys-
tem. Control engeneering practice, 11(10), 2002.
[23] C. I. Byrnes, F. Delli Priscoli, A. Isidori, and W. Kang. Structurally stable output regula-
tion of nonlinear systems. Automatica, 33(2):369 – 385, 1997.
[24] C.I. Byrnes, F. Delli Priscoli, and A. Isidori. Output regulation of uncertain nonlinear systems.
Birkhäuser, Boston, 1997.
[25] F. Caliskan and R. Vepa. A real-time reconfiguration algorithm for aircraft flight control.
Proceedings of conference on Aerospace Vehicle Dynamics and Control, Cranfield Institute of
Technology, 1994.
[26] C.G. Cassandras and S. Lafortune. Introduction to discrete event systems. Kluwer Academic
Publisher, 1999.
[27] P.R. Chandler, M. Pachter, and M. Mears. System identification for adaptive and recon-
figurable control. Journal of guidance, control and dynamics, 18(3), 1995.
[28] J. Chen and R.J. Patton. Robust model based fault diagnosis for dynamic systems. Kluwer
academic publishers, Boston, 1999.
[29] Y.-L. Chen and G. Provan. Modeling and diagnosis of timed discrete event systems -
A factory automation example. Technical Report SC-PP-96-72, Rockwell Science Center,
Thousand Oaks, CA, September 1996.
Bibliography 195
[30] Y.-L. Chen and G. Provan. Modeling and diagnosis of timed discrete event systems - a
factory automation example. In Proceedings of the 1997 American Control Conference, pages
31–36, Albuquerque, NM, June 1997.
[31] M.O. Cordier and L. Rozé. Diagnosing discrete-event systems : extending the “diagnoser
approach” to deal with telecommunication networks. Journal on Discrete Event Dynamic
Systems, 12(2):43 – 81, 2002.
[32] F. Cristian. Understanding fault-tolerant distributed systems. Comm. ACM, 34(2):57–78,
1991.
[33] J. de Kleer and J. Kurien. Fundamentals of model-based diagnosis. Proc. of the SAFEPRO-
CESS’03, 2003.
[34] J. de Kleer, A. Mackworth, and R. Reiter. Characterizing diagnoses and systems. ARTI-
FICIAL INTELLIGENCE 56(2-3), 1992.
[35] R. Debouk, S. Lafortune, and D. Teneketzis. On an optimization problem in sensor selec-
tion. International Journal of Control, 12(4):417 – 445, 2002.
[36] P. Declerk and M. Staroswiecki. Characterisation of the canonical components of a struc-
tural graph for fault detection in large scale industrial plants. Proc. European Control
Conference, Grenoble, 1991.
[37] R.E. Ebert. User interface design. Prenctice Hall, Englewood Cliffs, N.J., 1994.
[38] P. M. Frank. Fault diagnosis in dynamic systems using analitical and knowledge based
redundancy: a survey and some new results. Automatica, 26(3), 1990.
[39] K. Fujimoto, K. Sakurama, and T. Sugie. Trajectory tracking control of port-controlled
hamiltonian systems via generalized canonical transformations. Automatica, 39(12):2059–
2069, 2003.
[40] Takegaki G. and Arimoto S. A new feedback method for dynamic control of manipula-
tors. ASME Journal of Dynamic Systems Measurement and Control, 102, 1981.
[41] E. Garcia, F. Morant, and R. Blasco-Giménez. Centralized modular diagnosis and the
phenomenon of coupling. Proceedings of the Workshop on discrete event systems, 2002.
[42] G. Gentile, N. Rotondale, F. Filippetti, G. Franceschini, M. Martelli, and C. Tassoni. Anal-
ysis approach of induction motor stator faults to on-line diagnostics. In Proceedings of
ICEM90, 1990.
[43] G. Gentile, N. Rotondale, M. Martelli, and C. Tassoni. Harmonic analysis of induction
motors with stator faults. Electrical Machines Power Systems, (22):215 – 231, 1994.
[44] L. Gentili and A. J. van der Schaft. Regulation and input disturbance suppression for
port-controlled hamiltonian systems. 2nd IFAC Workshop LHMNLC, Seville, Spain,
2003.
[45] J. Getler. Failure detection and diagnosis in engineering systems. Marcel Dekker, 1998.
[46] B.E. Goldberg, K. Everhart, R. Stevens, N. Babbit III, P. Clemens, and L. Stout. System
engineering toolbox for design-oriented egineers. Reference publication 1358, NASA,
1994.
196 Bibliography
[47] C. Hadjicostis. Probabilistic fault detection in finite-state machines based on state occu-
pancy measurements. Proceedings of the 41st IEEE conference on decision and control, 2002.
[48] D. Harel. Statecharts: a visual formalism for complex system. Science of computer pro-
gramming, 8:231–374, 1987.
[49] D.M. Himmelblaum. Fault detection and diagnosis in chemical and petrolchemical processes.
Elsevier, 1978.
[50] R. Izadi-Zamanabadi. Fault-tolerant supervisory control - system analysis and logic de-
sign. Ph.D. thesis, Department of Control Engineering, Aalborg University, 1999.
[51] P. Jalote. Fault tolerance in distributed systems. Prenctice Hall, Englewood Cliffs, N.J., 1994.
[52] S. Jiang and R. Kumar. Failure diagnosis of discrete event systems with linear-time tem-
poral logic fault specifications. Proceedings of the American Control Conference, 2002.
[53] B. Johnson. Design and analysis of fault-tolerant digital systems. Addison Wesley, Reaing,
Mass, USA, 1989.
[54] H. Koepetz. Real-time systems: design principles for distributed embedded applications. Real-
time systems. Kluwer academic publishers, London, 1997.
[55] R. Kumar and S. Jiang. Diagnosis of repeated failures in discrete event systems. Proceed-
ings of the 41st IEEE conference on decision and control, 2002.
[57] J.C. Laprie. Dependability: basic concepts and terminology. Springer Verlag, Vienna, Asutria,
1992.
[58] P.A. Lee and T. Anderson. Fault tolerance: principles and practice. Springer Verlag, Vienna,
Asutria, 1990.
[59] E. Lewis. Introduction of reliability engineering. John Wiley and Sons, 1997.
[60] F. Lin, A.F. Vaz, and W.M. Wonham. Supervisor specification and synthesis for discrete
event system. International Journal of Control, 48(1):321 – 332, 1998.
[61] A. De Luca and R. Mattone. Actuator failure detection and isolation using generalized
momenta. ICRA, Taipei, Taiwan, 2003.
[62] J. Lunze. State observation and diagnosis of discrete-event systems described by stochas-
tic automata. Journal on Discrete Event Dynamic Systems, 11(4):319 – 369, 2001.
[63] N. Lynch, R. Segala, and F. Vaandrager. Hybrid I/O automata. Information and computa-
tion, (185):105–157, 2003.
[64] U. Maier and M. Colnaric. Some basic ideas for intelligent fault tolerant control systems
design. IFAC World Congress 2002, Barcellona, 2002.
[65] M. Malek. Responsing computing. Real-time systems. Kluwer academic publishers, Lon-
don, 1994.
Bibliography 197
[66] L. Marconi, C. Bonivento, A Paoli, and R. Costi. System description. IFATIS deliverable
D6-2, August 2002.
[67] R. Marino, S. Peresada, and P. Valigi. Adaptive input-output linearizing control of induc-
tion motors. IEEE Transactions on Autotmatic Control, 38(2):208 – 221, 1993.
[68] J. Mauss, V. May, and M. Tatar. Towards model-based engineering: Failure analysis with
mds. Workshop on Knowledge-Based Systems for Model-Based Engineering, European Confer-
ence on AI, 2000.
[69] M.A. Mossoumnia. A geometric approach to the synthesis of failure detection filters.
IEEE Transactions on Automatic Control, 31(3), 1986.
[70] M.A. Mossoumnia, G.C. Verghese, and A.S. Willsky. Failure detection and identification.
IEEE Transactions on Automatic Control, 34(3), 1989.
[71] S. Mullender. Distributed systems. Addison Wesley, Reaing, Mass, USA, 1995.
[72] V.O. Nikiforov. Adaptive non-linear tracking with complete compensation of unknown
disturbance. European Journal of Control, (4):132 – 139, 1998.
[73] R. Ortega. Some applications and recent results on passivity based control. 2nd IFAC
Workshop on Lagrangian and Hamiltonian Methods for Nonlinear Control, Seville,
Spain, 2003.
[74] R. Ortega, P.J. Nicklasson, and G. Espinosa. On speed control of induction motors. Auto-
matica, 32(3):455 – 460, 1996.
[75] D.N. Pandalai and L.E. Holloway. Template languages for fault monitoring of timed
discrete event processes. IEEE Transactions on Automatic Control, 45(5):868 – 882, 2000.
[76] A. Paoli and S. Lafortune. Safe diagnosability of discrete event systems. Technical Re-
port CGR-03-02, System Science and Engineering Division, Department of Electrical En-
gineering and Computer Science, The University of Michigan, 2003.
[77] A. Paoli and S. Lafortune. Safe diagnosability of discrete event systems. Proceedings of
cdc 2003, Maui, Hawaii, 2003.
[78] A. Paoli and S. Lafortune. Safe diagnosability for fault tolerant supervision of discrete
event systems. Submitted to Automatica, March 2004.
[79] R.J. Patton. Fault-tolerant control: the 1997 situation. In Proceedings of the IFAC symposium
on fault detection and safety for technical processes, Hull, 1997.
[80] R.J. Patton, P.M. Frank, and R.N. Clark. Issues of fault diagnosis for dynamical systems.
Springer-Verlag, 2000.
[81] L. Pau. Failure diagnosis and performance monitoring. Marcel Dekker, 1981.
[82] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults.
Journal of the ACM, 27(2):228–234, 1980.
[83] S. Peresada and A. Tonielli. High-performance robust speed-flux tracking controller for
induction motor. International Journal on Adaptive Control and Signal Processing, 14(2-3):177
– 200, 2000.
198 Bibliography
[85] C. De Persis and A. Isidori. A geometric approach to nonlinear fault detection and isola-
tion. IEEE Transactions on Automatic Control, 45(6), 2001.
[88] H.E. Rauch. Autonomous control reconfiguration. IEEE control systems magazine, 1995.
[89] M.P. Sachenbacher and R. Weber. Advances in design and implementation of obd func-
tions for diesel injection systems based on a qualitative approach to diagnosis. Proceedings
of the SAE 2000 World Congress, 2000.
[90] M. Sampath, S. Lafortune, and D. Teneketzis. Active diagnosis of discrete event systems.
IEEE Transactions on Automatic Control, 43(7):908 – 929, 1998.
[93] A. Serrani, A. Isidori, and L. Marconi. Semiglobal nonlinear output regulation with adap-
tive internal model. IEEE Transactions on Autotmatic Control, 46(8):1178 – 1194, 2001.
[94] S. Simani, C. Fantuzzi, and R.J. Patton. Model-based fault diagnosis in dynamical systems
using identification techniques. Springer Verlag, 2002.
[95] J.A. Stankovic and K. Ramamritham. Hard real-time systems. IEEE Press, 1988.
[96] M. Staroswiecki, S. Attouche, and M.L. Assas. A graphic approach for reconfigurability
analysis. 10th Int. Workshop on principle of diagnosis, Loch Awe, 1999.
[97] R. Su and W.M. Wonham. Probabilistic reasoning in distributed diagnosis for qualitative
systems. Proceedings of the 41st IEEE conference on decision and control, 2002.
[98] K. Suyama. Reliable observer based control using vector valued decision by majority.
Proceedings of cdc 1999, USA, 1999.
[99] S.Williamson and K. Mirzoian. Analysis of cage induction motors with stator winding
faults. IEEE Transactions on Power Application Systems, 104:1838 – 1842, 1985.
[100] A. Teel. A nonlinear small gain theorem for the analysis of control systems with satura-
tions. IEEE Transactions on Autotmatic Control, 41(9):1256 – 1270, 1996.
[102] A.J. van der Schaft. L2 -gain and Passivity Techniques in Nonlinear Control. Springer-Verlag,
London, UK, 1999.
[103] P. Vas. Parameter estimation, condition monitoring and diagnosis of electrical machines. Oxford
science publications, 1994.
[104] J. C. Willems. Paradigms and puzzles in the theory of dynamical systems. IEEE Transac-
tions on automatic control, 36(3):259–294, 1991.
[106] B.C. Williams and P.P. Nayak. A model-based approach to reactive self-configuring sys-
tems. Proc. of the First National Conf. on Artificial Intelligence, 1996.
[107] S. Williamson and A. C. Smith. Steady-state analysis of three-phase cage motors with
rotor bar and end ring faults. In Proceedings Insttitue Electrical Engineering, 129(3):93 –
100, 1982.
[109] N. E. Wu. Reliability of fault tolerant control systems: Part i. 40th IEEE Conference on
Decision and Control, Orlando, 2001.
[110] N. E. Wu. Reliability of fault tolerant control systems: Part ii. 40th IEEE Conference on
Decision and Control, Orlando, 2001.
[111] R. I. Zamanabadi and M. Blanke. Ship propulsion system as a benchmark for fault toler-
ant control. Control engeneering practice, 7(2), 1999.
Index
200
Index 201
D failure mode, 88
damping, 184 failure modes, 54
dead time, 30 Failure Modes and Effects Analysis, 85
dead-zone function, 142 failure rate, 87
deadlock, 164 false alarms, 86, 88
dependability, 27 fault, 16, 35, 36
dependability requirements, 30 fault accommodation, 25
dependable real-time service, 33 fault compensation unit, 140
design fault, 36 Fault detection, 20
detectability, 21 Fault Detection and Isolation, 9
detection phase, 148 Fault diagnosis, 16
deterministic automaton, 163 fault diagnosis system, 9
development faults, 36 Fault estimation, 20
Diagnosability, 106 Fault identification, 20
diagnosability, 21 Fault isolation, 20
diagnoser, 109 fault propagation, 86
Diagnostic Problem, 20 Fault Propagation Analysis, 92
diesel engines, 95 fault propagation analysis, 85
Direct digital control, 29 Fault Propagation Tree, 92
Discrete event system, 161 Fault Tolerant Control/Measure (FTC/M) mod-
distributed system, 27 ule, 43
disturbances, 18 fault tolerant unit, 36
dynamic eccentricity, 137 fault tree, 48, 54
fault-isolation, 9
E
Fault-Tolerant control, 16
Elementary cells, 88
fault-tolerant unit, 38
empirical failure rate, 87
faulty motor, 137
empirical reliability function, 87
faulty state, 89
empty string, 162
empty trace, 104 filter, 9
equivalent, 163 finite escape time, 136
error, 34–37 Finite State Machine, 102
error-containment regions, 33 flux subsystem, 132
error-detection latency, 30 forbidden timed strings, 119
estimator, 44 form follows function, 27
event, 161 FTC function, 44
event set, 161 FTC interfaces, 44
event-triggered, 32 FTM function, 44
exclusion law, 17 FTM interfaces, 45
exosystem, 130, 140 Function monitor, 44
explicit FTC, 10 functional requirements, 29
external faults, 36 functionality tree, 53
external losses, 96
G
F gateway, 33
fail-silent failure, 35 global resource and reconfiguration manager,
failure, 18, 35 47
failure analysis tools, 48 Global RRM level reconfiguration, 51
failure free operating time, 87 globally asymptotically stable, 136
202 Index
T
temperature sensor test, 96
temporal behavior, 118
temporal logic framework, 109
tick events, 118
time diagnosability, 118
time unfolding, 118
time-triggered, 33
timed automata, 118
timing failures, 35
torque disturbances, 187
transient error, 36
transient failures, 35
transient fault, 36
transition function, 163
Trim operation, 165
triple modular redundancy, 38
two tank system, 71
two-faced failures, 35
two-phase model, 130
Curriculum vitae
205