Report
Report
Report
Johannesburg, 2006
Declaration
I declare that this dissertation is my own, unaided work, except where other-
wise acknowledged. It is being submitted for the degree of Master of Science in
Engineering in the University of the Witwatersrand, Johannesburg. It has not
been submitted before for any degree or examination in any other university.
i
Abstract
ii
To my PARENTS and LUFUNO
iii
Acknowledgements
I wish to thank my supervisor Prof. Tshilidzi Marwala for his constant encour-
agement and advice throughout the course of this research. I wish to thank
him for encouraging me to pursue a Masters degree and to make me value the
importance of education. Thanks you for being such an inspiration.
I would like to thank my family for all the support they have given me through-
out my studies. I would also like to thank Lufuno for all the encouragement
and support throughout my studies and thank you for always believing in me.
I would also like to thank the guys in the C-lab for making this year an
enjoyable experience. I would also like to thank the following people for helping
me with proof reading of this thesis; Shoayb Nabbie, Thando Tetty, Shakir
Mohamed, Thavi Govender and Brain Leke.
iv
Contents
Declaration i
Abstract ii
Acknowledgements iv
Contents v
List of Figures x
Nomenclature xv
1 Introduction 1
v
CONTENTS
2.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Machine Learning 25
vi
CONTENTS
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vii
CONTENTS
5.5 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 Conclusion 87
viii
CONTENTS
A Fuzzy ARTMAP 90
A.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B Learn++ Algorithm 96
C Publications 99
References 100
ix
List of Figures
4.1 Block diagram of the proposed system for the fuzzy ARTMAP . 46
x
LIST OF FIGURES
xi
List of Tables
xii
LIST OF TABLES
5.2 Normalized confusion matrix for SVM for the bushing data . . . 80
5.3 Normalized confusion matrix for ENN for the bushing data . . . 81
5.5 Diagnostic results for the MLP and SVM for bushing data . . . 82
5.6 Diagnostic results for the MLP, SVM and ENN for bushing data 82
5.11 Diagnostic results for the MLP and SVM for vibration data . . 85
xiii
LIST OF TABLES
5.12 Diagnostic results for the MLP, SVM and ENN for vibration data 85
xiv
Nomenclature
AI Artificial Intelligence
CI Computational Intelligence
xv
NOMENCLATURE
xvi
Chapter 1
Introduction
Industrial machinery has a high capital cost and its efficient use depends on low
operating and maintenance costs. To comply with this requirements, condition
monitoring and diagnosis of machinery have become established industry tools
[3]. Condition monitoring approaches have produced considerable savings by
reducing unplanned outage of machinery, reducing downtime for repair and
improving reliability and safety. Condition monitoring is a technique of sens-
ing equipment health; operating information and analyzing this information to
quantify the condition of equipment. This is done so that potential problems
can be detected and diagnosed early in their development, and corrected by
suitable recovery measures before they become severe enough to cause plant
breakdown and other serious consequences. As a result, an increasing volume
of condition monitoring data is captured and presented to engineers. This
leads to two key problems: the data volume is too large for engineers to deal
with; and the relationship between the plant item, its health and the condi-
tion monitoring data generated is not always well understood [4]. Therefore,
the extraction of meaningful information from the condition monitoring data
is challenging. Although modern monitoring systems provide operators with
1
1.2. HISTORICAL DEVELOPMENT OF CONDITION MONITORING
TECHNIQUES
immediate access to a range of raw plant data, only application domain spe-
cialists with clear diagnostic knowledge are capable of providing qualitative
interpretation of acquired data; an ability that will be lost when the special-
ists leave [5]. In addition, the number of plant specialists skilled in monitoring
processes is limited. Also, in many cases, the increasing volume of different
types of measurement data and the pressure on human experts to identify
faults quickly might lead to false conclusions. Hence, there is a need for de-
velopment of sophisticated intelligent condition monitoring systems to reduce
human dependency. A reliable, fast and automated diagnostic technique allow-
ing relatively unskilled operators to make important decisions without the need
for a condition monitoring specialist to examine data and diagnose problems
is required.
In the past decades, various effective monitoring techniques have been devel-
oped for machine monitoring and diagnosis; such as; vibration monitoring,
visual inspection, thermal monitoring and electrical monitoring [3]. These
techniques mainly focused on how to extract the pertinent signals or features
from the equipment health information. However, the related yet more impor-
tant problem are methods to analyze this information.
Various traditional methods have been used to process and analyze this infor-
mation. These techniques include conventional computation methods, such as
simple threshold methods, system identification and statistical methods. The
main shortcoming of these techniques is that they require a skilled specialist to
make the diagnosis. This shortcoming has lead to the usage of computational
intelligence technique to the problem of condition monitoring.
2
1.2. HISTORICAL DEVELOPMENT OF CONDITION MONITORING
TECHNIQUES
3
1.3. OBJECTIVE OF THIS THESIS
monitoring, it has been accepted that the software module such as intelligent
agents can be used to promote extensibility and modularity of the system [7].
Many machine learning tools have been applied to the problem of condition
monitoring using static machine learning structures such as artificial neural
network, support vector machine that are unable to accommodate new infor-
mation as it becomes available [8]. However, in many real world applications
the environment changes over time and requires the learning system to track
these changes and incorporate them in its knowledge base. The first objective
of this work is to develop an incremental learning system that will ensure that
the condition monitoring system knowledge base is updated in an incremental
fashion without compromising the performance of the classifier on previously
learned information. The vast amount of data and complex processes associ-
ated with on-line monitoring resulted in the development of complex software
systems, which are often viewed as isolated, non-flexible, static software com-
ponents [9, 10]. Hence, the second objective is to use intelligent agents and
multi-agent system to build a fully automated condition monitoring system.
4
1.5. OUTLINE OF THE THESIS
DAI is a subfield of artificial intelligence which has for more than a decade now,
been investigating knowledge models, as well as communication and reasoning
techniques that computational agents might need to participate in societies
composed of computers. More, generally, DAI is concerned with situations in
which several systems interact in order to solve a common problem. There
are two main areas of research in DAI, distributed problem solving (DPS)
and multi-agent system (MAS). DPS considers how solving a task of a par-
ticular problem can be divided among a number of modules that cooperate
in dividing and sharing knowledge about the problem and about its evolving
solution. A multi-agent system is concerned with the behavior of a collection
of autonomous agents aiming at solving a given problem.
5
1.5. OUTLINE OF THE THESIS
Chapter 6 summarizes the findings of the work and gives suggestions for
future research.
Appendix C lists the papers that have been published based on the work
performed in this thesis.
Figure 1.1 shows the layout of the dissertation. The reader is advised to
read the dissertation in a sequential way, however, due to the independence of
Chapter 4 and Chapter 5, these chapters can be read independently.
6
1.5. OUTLINE OF THE THESIS
7
Chapter 2
Approaches to Condition
Monitoring
8
2.2. VARIOUS CONDITION MONITORING TECHNIQUES
the monitored signal can be used to predict the need for maintenance before a
breakdown or serious deterioration occurs, or to estimate the current condition
of the machine. Condition monitoring has become increasingly important, in
different industries due to an increased need for normal undisturbed opera-
tion of equipment. An unexpected fault or shutdown can result in a serious
accident and financial loss for the company. Hence, utilities must find ways
to avoid failures, minimize downtime, reduce maintenance costs, and lengthen
the lifetime of their equipment.
9
2.2. VARIOUS CONDITION MONITORING TECHNIQUES
10
2.2. VARIOUS CONDITION MONITORING TECHNIQUES
Dissolved Gas analysis is one of the most popular chemical techniques that
are used in oil-filled equipment. DGA is the most commonly used diagnos-
tic technique for oil-filled machines such as transformers and bushings [24].
DGA is used to detect oil breakdown, moisture presence and partial discharge
activity. The gaseous byproduct are produced by degradation of transformer
and bushing oil and solid insulation, such as paper and pressboard, which are
all made of cellulose. The gases produced from the transformer and bushing
operation can be listed as follows [25]:
The symptoms of faults are classified into four main groups; corona, low energy
discharge, high energy discharge and thermal. The quantity and types of gases
reflect the nature and extent of the stressed mechanism in the bushing. Oil
breakdown is shown by the presence of hydrogen, methane, ethane, ethylene
and acetylene while high levels of hydrogen show that the degeneration is due
to corona. High levels of acetylene occur in the presence of arcing at high tem-
peratures. Methane and ethane are produced from low temperature thermal
heating of oil and high temperature thermal heating produces ethylene, hydro-
gen as well as a methane and ethane. Low temperature thermal degradation
of cellulose produces carbon dioxide and high temperature produces carbon
monoxide [24].
Existing diagnostic approaches for power transformers, which are based on the
dissolved gas information, can be divided into two categories; the conventional
approaches and artificial intelligence techniques.
11
2.2. VARIOUS CONDITION MONITORING TECHNIQUES
Conventional Approaches
Single AI Approaches
12
2.2. VARIOUS CONDITION MONITORING TECHNIQUES
in [30]. Training samples were taken from post-mortem data and were carefully
selected so that various operating conditions were represented. It was reported
in [31] that more accurate diagnoses can be obtained, if compared to Rogers
Ratios and Drnenburg Ratios. Furthermore, two separate ANN were also used
in [32] for fault diagnosis and detection of cellulose degradation. It was found
that carbon dioxide and carbon monoxide are not needed as inputs for fault
diagnosis, and a single output that indicates whether cellulose was involved in
a fault is sufficient for the detection of cellulose degradation [31]. The same
authors also reported in the later publication [32] that higher diagnosis accu-
racy could be achieved if gas generation rates were included as inputs to the
ANN. While the aforementioned ANN-based approaches utilize actual DGA
data as training inputs, there are also other ANN-based approaches which
rely on conventional DGA interpretation schemes for generating the training
outputs. An example of such approaches can be found in [36], where key gas
concentrations, IEC Ratios, and Rogers Ratios were used to generate training
outputs for three independently trained ANN; fault diagnoses as given by these
ANN were combined to arrive at a final decision. The ANN-based approach, as
suggested in [33], relies totally on conventional interpretation scheme, where 13
characteristic patterns of gaseous composition were used as inputs to the ANN,
which was trained to detect different types of fault as specified in the Japanese
ECRA method. Unsupervised ANN were also implemented for the analysis
of the dissolved gas data. Specifically, self-organizing map was applied for ex-
ploratory analysis on historical DGA data [35]. It was reported that interesting
and comprehensive patterns have been unearthed, which could be associated
with certain incipient faults in power transformers. Kernel based approaches
such as Support Vector Machine (SVM) were also utilized for bushing fault
diagnosis [25]. The SVM approaches suggested in [25] rely on the conventional
DGA interpretation schemes for generating the training outputs.
13
2.2. VARIOUS CONDITION MONITORING TECHNIQUES
Hybrid AI Approaches
14
2.2. VARIOUS CONDITION MONITORING TECHNIQUES
15
2.3. COMPONENTS OF CONDITION MONITORING
16
2.3. COMPONENTS OF CONDITION MONITORING
17
2.3. COMPONENTS OF CONDITION MONITORING
18
2.3. COMPONENTS OF CONDITION MONITORING
value, crest factor analysis, kurtosis analysis, shock pulse counting, time series
averaging method, signal enveloping method and many more. In this study,
Mel-frequency Cepstral Coefficients (MFCC) and statistical features are used.
MFCCs have been widely used in the field of speech recognition and have
managed to represent the dynamic features of a signal as they extract both
linear and non-linear properties [41]. MFCC can be a useful tool of feature
extraction in vibration signals as vibrations contain both linear and non-linear
features. MFCC is a type of wavelet in which frequency scales are placed on a
linear scale for frequencies less than 1 kHz and on a log scale for frequencies
above 1 kHz [41]. The complex cepstral coefficients obtained from this scale
are called the MFCC [41]. The MFCC contain both time and frequency infor-
mation of the signal and this makes them more useful for feature extraction.
The following steps are involved in MFCC computations;
fHz
mel = 2595 × log10 1+ (2.3)
700
19
2.3. COMPONENTS OF CONDITION MONITORING
3. The final step converts the logarithmic Mel spectrum back to the time
domain. The result of this step is what is called the Mel-frequency Cep-
stral Coefficients. This conversion is achieved by taking the Discrete
Cosine Transform of the spectrum as:
F
X −1 π
Cm = cos m (n + 0.5) log10 (Hn ) (2.4)
n=0
F
Statistical Features
Basic statical features such as mean, root mean square, variance (σ), skew-
ness (normalized 3rd central moment) and kurtosis (normalized 4th central
moment) are implemented to obtain the signature of faults. The root mean
square value contains all the energy in the signal and therefore also all the noise
and all the elements that depend on the cutting process. Therefore, it is not
the most effective parameter but has retained its place because it is so easy to
produce and understand [14]. There is a need to deal with the occasional spik-
ing of vibration data, which is caused by some types of faults and to achieve
this task Kurtosis is used. The success of Kurtosis in signals is based on the
fact that signals of a system under stress or having defects differ from those
of a normal system. The sharpness or spiking of the vibration signal changes
when there are defects in the system. Kurtosis is a measure of the sharpness
20
2.3. COMPONENTS OF CONDITION MONITORING
2.3.3 Classification
21
2.4. CONDITION MONITORING ISSUES
Measurements from the system can be taken every few seconds. This leads
to an overwhelming volume of data per equipment to be interpreted by engi-
neers. When this is multiplied by the number of equipment to be monitored,
the problem of data overload becomes insurmountable in terms of manual data
interpretation. The second problem is the limited base of experts able to in-
terpret the complex condition monitoring data. The third issue is that most of
22
2.4. CONDITION MONITORING ISSUES
the time to effectively monitor the condition of a machine, more than one tech-
nology is used [7]. Thus, there is a longer term requirement for the integration
of further monitoring technologies.
Based on the problem, condition monitoring system identified above; the re-
quirements of an online condition monitoring system are [7]:
• The system must capture and condition the relevant data automatically.
• It should have the capacity to learn the typical plant behavior over a
period of time and then use this to indicate when anomalies and de-
fects arise and provide clear and concise defect information and remedial
advice to the operation engineer.
23
2.4. CONDITION MONITORING ISSUES
and research results can be integrated into the relevant agents without any
redesign or modifications of the condition monitoring system. This suggests
that each of the required functions should be stand-alone, with the ability to
cooperate and exchange information as required. In Chapter 5, it is demon-
strated how the above requirements can be met using agents and multi-agent
system.
24
Chapter 3
Machine Learning
Machine learning has been used extensively in the area of condition monitoring
systems as described in the previous chapter. In this chapter, the theoretical
foundation of various machine learning tools such as Artificial Neural Networks
(ANN), Support Vector Machine (SVM), Extension Neural Network (ENN)
and Fuzzy ARTMAP (FAM) are briefly presented. As there are numerous
different types of ANNs, emphasis is given to the variant that is used in this
study called multi-Layer Perceptrons (MLPs). The design process for ensemble
of classifiers is outlined and the advantages of the ensemble approach are also
discussed.
25
3.2. MACHINE LEARNING TOOLS
In all these three cases though, this learning can be represented mathemati-
cally, where for some input signal x(t) ∈ Rn and an output y(t) ∈ Rm for some
time step t ∈ N , the system learns a mapping F : Rn ⊃ X → Y ⊂ Rm or
simply x 7−→ y is learned [39]. The mapping F is usually a function defined
by some matrix of weight values.
ANNs are powerful data processing systems that are able to learn complex
input-output relationships from data. A typical ANN consists of a large num-
ber of simple processing elements called neurons that are highly interconnected
in an architecture that is loosely based on the structure of biological neurons
in the human or animal brain. A few attributes of interest are learning ability,
parallelism, distributed representation and computation, fault tolerance and
generalization ability. When used for pattern classification, the ANN performs
a nonlinear mapping function producing an output that indicates membership
of an input vector to a specific pattern class. The ANN learns this map-
ping function from training data and, if trained correctly, is able to generalize
on new data. This ability to learn from examples is useful for classification
26
3.2. MACHINE LEARNING TOOLS
Multi-layer Perceptron
yk = f (X; W ) (3.1)
In most cases, ANNs consist of simpler processing units called artificial neurons
arranged according to some topology [1]. Each input to the neuron is multiplied
by an adjustable weight parameter. The sum of all the weighted inputs is called
the activation of the neuron and is represented by
N
X
aj = wij × xij (3.2)
i=0
27
3.2. MACHINE LEARNING TOOLS
The input x0 is a special input, called a bias, that is always equal to 1. The
activation of the neuron becomes the input to a nonlinear activation function
which returns the output of the neuron
zj = f (aj ) (3.3)
For a MLP architecture with one neuron in the output layer, M neurons in the
hidden layer, and N inputs in the input layer, the response of the jth neuron
in the hidden layer can be derived from equations 3.2 and 3.3 as
XN
1
y = fhidden ( wij xi ) (3.4)
i=0
1
where wij is the weight between input xi and hidden neuron j, w0j is the weight
of the bias input to neuron j and fhidden ( · ) is the activation function. The
representational capability of the MLP refers to the range of mappings that
the MLP can implement as its weight parameters are varied. This indicates
whether the MLP is capable of implementing a desired mapping function. Pre-
vious work on this subject has shown that a sufficiently large network with a
sigmoidal activation function and one hidden layer of neurons is able to approx-
imate any function with arbitrary accuracy [1]. This universal approximation
capability proves that an MLP with one hidden layer is theoretically able to
implement any required mapping function. When used for classification, the
ANN is required to function as a statistical model of the process from which
the input data is generated. In order to realize this model, two problems have
28
3.2. MACHINE LEARNING TOOLS
29
3.2. MACHINE LEARNING TOOLS
SVMs were introduced by Vapnik in the late 1960s on the foundation of sta-
tistical learning theory. However, since the middle of 1990s, the algorithms
used for SVMs started emerging with greater availability of computing power,
paving the way for numerous practical applications [2, 43, 44]. The basic SVM
deals with two-class problems in which the data are separated by a hyperplane
defined by a number of support vectors. The SVM can be considered to cre-
ate a line or hyperplane between two sets of data for classification. In case
of two-dimensional situation, the action of the SVM can be explained easily
without any loss of generality. In Figure 3.2, a series of points for two different
classes of data are shown, circles represents class A and class B is represented
by squares.
The SVM attempts to place a linear boundary (solid line) between the two
different classes, and orient it in such a way that the margin (represented
30
3.2. MACHINE LEARNING TOOLS
by dotted lines) is maximized. In other words, the SVM tries to orient the
boundary such that the distance between the boundary and the nearest data
point in each class is maximal. The boundary is then placed in the middle of
this margin between the two points. The nearest data points are used to define
the margins and are known as support vectors, represented by gray circle and
square). Once the support vectors are selected, the rest of the feature set
can be discarded, since the Support Vectors (SVs) contain all the necessary
information for the classifier. The boundary can be expressed in terms of
(w.b) + b = 0 w ∈ Rn , b ∈ R, (3.7)
where the vector w defines the boundary, x is the input vector of dimension
N and b is a scalar threshold. At the margins, where the SVs are located, the
equations for class A and B, respectively, are
(w.b) + b = 1 (3.8)
and
(w.b) + b = −1 (3.9)
As SVs correspond to the extremities of the data for a given class, the following
31
3.2. MACHINE LEARNING TOOLS
decision function holds good for all data points belonging to either A or B:
where l is the number of training sets. The solution of the constrained quadratic
programming optimization problem can be obtained as
X
w= vi xi (3.13)
where xi are SVs obtained from training. Putting equation 3.13 in equation
3.10 the decision function is obtained as follows:
X
f (x) = sign (vi (x · xi )) + b (3.14)
In cases where the linear boundary in input spaces will not be enough to
separate two classes properly, it is possible to create a hyperplane that allows
linear separation in the higher dimension (corresponding to curved surface in
lower dimensional input space). In SVMs, this is achieved through the use of
a transformation that converts the data from an N-dimensional input space to
Q-dimensional feature space: s = φ (x) where x ∈ RN and s ∈ RQ : Figure 3.3
shows the transformation from input space to feature space where nonlinear
boundary has been transformed into a linear boundary in feature space.
X
f (x) = sign (vi (φ (x) · φ (xi ))) + b (3.15)
32
3.2. MACHINE LEARNING TOOLS
X
f (x) = sign (vi (K (x.y))) + b (3.17)
The width of the RBF kernel parameter (σ) can be determined in general by an
iterative process selecting an optimum value based on the full feature set. In
case there is an overlap between the classes with nonseparable data, the range
33
3.2. MACHINE LEARNING TOOLS
ENN comprises of input layer and the output layer. The input layer nodes
receive an input feature pattern and uses a set of weighted parameters to
generate an image of the input pattern. There are two connection weights
between input nodes and output nodes; one connection represents the lower
bound for this classical domain of the features, and the other represents the
upper bound.The entire network is thus represented by a matrix of weights for
the upper and lower limits of the features for each class WU and WL . A third
matrix representing the cluster centers is also defined as [46]:
WU + WL
z= (3.19)
2
ENN uses supervised learning, which tune the weights of the ENN to achieve a
good clustering performance or to minimize the clustering error. The network
34
3.2. MACHINE LEARNING TOOLS
where xi is the i th training pattern, β is the learning rate and w can be either
the upper or lower weight matrices of the network centers. It can be shown
that for t training patterns for a particular class C, the weight is given by:
t
X
c c
w (t) = (1 − ρ)w (0) − β (1 − β)t−1 xci (3.21)
i=1
This equation demonstrates how each training pattern reinforces the learning
in the network by having the most recent signal xtc determine only a fraction
of the current value of zct . The equation indicates that there is no convergence
of the weight values since the learning process is adaptive and reinforcing.
35
3.2. MACHINE LEARNING TOOLS
Equation 3.21 also highlights the importance of the learning rate β. Small
values of β require many training epochs, whereas large values may results
in oscillatory behavior of the network weights, resulting in poor classification
performance.
36
3.2. MACHINE LEARNING TOOLS
• Training rate, (β [0, 1]) which controls the velocity or the learning rate
of the network adaptation, β = 1 permits the system to adapt faster
while 0 < β < 1 allows the system to adapt slower.
The ART network affects the processing of two ART networks, ARTa and
ARTb , and after the resonance is confirmed in each network, J is the active
category for the ARTa network and K is the active category for the ARTb
network. The next step is to verify using the match tracking, if the active
category on ARTa corresponds to the desired output vector presented on ARTb .
The vigilance criterion is given by [47]:
|y b ∧ wJK ab
|
= ρab , (3.22)
|yb |
It works by causing a little increment on the vigilance parameter only enough
to exclude that category and select another category which will be active and
37
3.3. ENSEMBLE OF CLASSIFIERS
reintroduced in the process until the active category correspond to the desired
output. After the input has completed the resonance state by the vigilance
criterion, the weight adaptation is effected. The adaptation of the ARTa and
ARTb module weights is given by [47]:
Modeling using neural networks often involves trying multiple networks with
different architectures and training parameters in order to achieve acceptable
model accuracy. Selection of the best network is based on the performance
of the network on an independent validation or test set, and to keep only the
best performing network and to discard the rest. There are two disadvantages
with such an approach; first, all the effort involved in training the remaining
neural networks is wasted, second, the networks generalization performance is
greatly reduced. Ensemble approaches typically aim at improving the classifier
accuracy on a complex classification problem through a divide-and-conquer
approach. Perrone and Copper [48] proved mathematically that an ensemble
method can be used to improve the classification ability of a single classifier.
They argue that the ensemble method have the following advantages:
1. It efficiently uses all the networks in the population - none of the network
need to be discarded.
38
3.3. ENSEMBLE OF CLASSIFIERS
39
3.3. ENSEMBLE OF CLASSIFIERS
given by [52];
nN F
ρn = (3.24)
N − N F − N R + nN F
where N F is the number of samples which are misclassified by all classifiers N R
represent the samples which were classified correctly by all classifiers. N is the
total number of experiments example. It should be noted that the correlation
calculated is for the outcomes of the classifiers. Generally, smaller correlation
degree can lead to better performance of classifier fusion because independent
classifiers can give more effective information.
40
3.3. ENSEMBLE OF CLASSIFIERS
41
Chapter 4
The problem of fault diagnosis of machine has been an ongoing research in vari-
ous industries. Many machine learning tools have been applied to this problem
using static machine learning structures such as neural network and support
vector machine that are unable to accommodate new information as it becomes
available into their existing models. The chapter introduces the incremental
learning approach to the problem of condition monitoring. The chapter starts
by giving a brief definition of incremental learning. Two incremental learn-
ing techniques are applied to the problem of condition monitoring. The first
method uses the incremental learning ability of Fuzzy ARTMAP (FAM) and
explores whether ensemble approach can improve the performance of the FAM.
The second technique uses Learn++ that uses an ensemble of MLP classifiers.
Lastly, these two methods are compared with each others.
42
4.1. INCREMENTAL LEARNING
43
4.1. INCREMENTAL LEARNING
44
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
trained network. They update the weights for the additional instance in such a
way that the influence of the weight update on previously trained instances is
minimum, which is satisfied by minimizing a weight-sensitivity cost function.
Fu et al. [64, 65] introduced an incremental backpropagation learning network
that is capable of learning new data in the absence of old data. However,
this system, also based on learning new instances through minor modification
of current weights by putting a bound on weight modification, is not able to
learn new classes. Various methods have been proposed for incremental SVM
learning in the literature [66, 67].
The proposed system is implemented using the fuzzy ARTMAP. The overview
of the proposed system is shown in Figure 4.1. The population of classifiers is
used to introduce classification diversity. A bad team of classifiers may even
lead to worse performance than the best single classifier due to the fusion of
error decision. A classifier selection process is executed using the correlation
measure. After the classifier selection, majority voting fusion algorithm is
employed for the final decision.
Data Preprocessing
45
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
Figure 4.1: Block diagram of the proposed system for the fuzzy ARTMAP
Creation of an Ensemble
46
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
data are created. This permutation is needed in order to create diversity of the
classifiers being added since FAM learns in an instance-based fashion, which
makes the order in which the training patterns are received an important fac-
tor. From the created population, the best performing classifier of the team
is selected based on the classification accuracy. Then the correlation degree
between the best classifier and all the members of the population is calculated
using Equation 3.24. From the unselected classifiers, the classifier with low
correlation is selected for fusion. This is repeated until all the unselected clas-
sifiers are selected. Lastly, the selected classifiers are fused using the majority
voting to give the final decision. The algorithm for this system is shown in
Figure 4.2.
The available data was divided into three data set; training, validation and
testing data sets. The validation data set is used to evaluate the performance
of each classifiers in the initial population and this performance is used to
select the classifiers for the ensemble. The testing data set is used to evaluate
the performance of the system on unseen data. The first experiment compares
the performance of fuzzy ARTMAP with MLP, SVM and ENN. The second
experiment shows that an ensemble of classifiers improve the classification
accuracy compared to a single classifier. The third experiment evaluates and
compares the incremental capability of a single FAM and the ensemble of FAM
for new classes. The last experiment uses an ensemble of FAM so as to compare
the performance of FAM with that of Learn++.
Bushing Data
A previously mentioned, the gases produced from the transformer and bushing
operation can be listed as follows [25]:
47
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
TRAINING PHASE
3. Find the best performing classifiers to be the first classifier of the ensem-
ble.
4. Calculate the correlation degree between the first classifier and other
classifiers, respectively using Equation 3.24
5. Select the classifier with low correlation for fusion. 6 Repeat 4 and 5
between selected classifiers yet to be selected until all the classifiers are
determined.
OPERATION PHASE
48
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
Table 4.1 shows that FAM gave slightly better results than MLP and SVM.
However, ENN gave slightly better results than the fuzzy ARTMAP.
The results for different numbers of classifiers are shown in Table 4.3. Figure
4.3 shows the effect of classifier selection. It further shows that the optimal
49
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
results is achieved with six classifiers hence six classifiers are chosen to form
the ensemble.
Table 4.3: Results of the FAM classifiers created to be used in the ensemble
for the bushing data
Table 4.4 shows the classification accuracy for both the best classifier and the
ensemble. The table shows that the ensemble of fuzzy ARTMAP performs
better that the best fuzzy ARTMAP classifier.
Table 4.4: Comparison of classification accuracy for the best classifier and
ensemble for bushing data
50
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
51
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
and 98% for the ensemble. This simple experiment shows that the system
is able to learn new classes, while still preserving existing knowledge of the
system. The small change in the accuracy of the three classes is due to the
tradeoff between stability and plasticity of the classifier during training. The
learning ability of the system on a further fifth class is shown in Figure 4.4.
Figure 4.4: Incremental learning of the best classifier on the fifth class
Figure 4.4 shows that as more training examples are added the ability of the
system to correctly classify the data generally increases as shown by the in-
crease in classification accuracy. The classification accuracy of the best clas-
sifier and the ensemble is 95% and 96.33%, respectively. The best classifier
gave a classification accuracy of 90.20% while the ensemble gave a classifica-
tion accuracy of 89.67% on the original test data with three classes. Again,
the change in the accuracy of the three classes is due to the tradeoff between
stability and plasticity of the classifier during training.
Incremental Learning with the Ensemble of FAM Table 4.5 shows the
distribution of the classes in various databases, Dj j = 1, . . . , 3 indicated by
the first column. Table 4.6 shows the results of the ensemble of FAM, the first
row shows the training cycles, Ti i = 1, . . . , 4 and the first column indicate
the various databases Dj j = 1, . . . , 3.
52
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
T1 T2 T3 T4
D1 100 100 100 100
D2 - 100 100 100
D3 - - 100 100
D4 - - - 100
Test 48 68 79 93
achieved only when all training data are correctly classified. Furthermore, once
a pattern is learned, a particular cluster is assigned to it, and future training
does not alter this clustering. Therefore, ARTMAP never forgets what it has
seen as a training data instance. Table 4.6 shows that FAM gave a classifica-
tion accuracy of 48% which improved to 93%, this show that FAM is able to
accommodate new classes without forgetting previously learned information.
Vibration Data
The investigation in this study is based on the data obtained from Case West-
ern Reserve University [68]. The experimental setup comprised of a Reliance
Electric 2HP IQPreAlert connected to a dynamometer. Faults of size 0.007,
53
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
0.014, 0.021 and 0.028 inches were introduced into the drive-end bearing of
a motor using the Electric Discharge Machining method. These faults were
introduced separately at the inner raceway, rolling element and outer race-
way. An impulsive force was applied to the motor shaft and the resulting
vibration was measured using two accelerometers, one mounted on the motor
housing and the other on the outer race of the drive-end bearing. All signals
were recorded at a sampling frequency of 12 kHz. The statistical features
mentioned in Chapter 2 were used for feature extraction.
54
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
The results for different classifiers trained with different permutation of data
is shown in Table 4.8. Figure 4.5 shows the effect of classifier selection. The
figure shows that the classifier selection gives an accuracy of 99.92% with seven
classifiers, and classifier combined without selection gave a 100% accuracy
with four classifiers. In this case, classifier selection is not effective due to
the high correlation of the classifier, this might be due to the fact that the
features extracted are highly correlated, hence the permutation of data did
not have a very big impact. It was decided to use the classifiers that are not
selected using correlation measure. The good performance of the classifiers
may be attributed to the fact the features extraction techniques used where
able to discriminate different condition. Consequently, making it easy for the
classifiers to discriminate the different conditions. Table 4.10 shows the
classification accuracy for both the best classifier and the ensemble. The table
shows that the ensemble of fuzzy ARTMAP gave similar results as the best
fuzzy ARTMAP classifier. This is expected as the single fuzzy ARTMAP gave
a 100% accuracy, the ensemble of FAM could not do better.
55
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
Table 4.9: Results of the FAM classifiers created to be used in the ensemble
for the vibration data
Classifier Validation Accuracy(%) Correlation (ρ)
1 100 Best Classifier
2 100 1.00
3 100 1.00
4 100 1.00
5 99.5 0.96
6 99.5 0.96
7 99.5 0.96
8 99.5 0.96
9 99 0.94
10 99 0.94
11 99 0.94
12 99 0.94
56
4.2. FUZZY ARTMAP AND INCREMENTAL LEARNING
Table 4.10: Comparison of classification accuracy for the best classifier and
ewnsemble for the vibration data
57
4.3. LEARN++ AND INCREMENTAL LEARNING
Table 4.11: Distribution of the four bearing classes on different database for
FAM
T1 T2 T3
D1 100 100 100
D2 - 100 100
D3 - - 100
Test 49 73 91
58
4.3. LEARN++ AND INCREMENTAL LEARNING
weighted majority voting scheme, where the voting weight for each classifier in
the system is determined using the performance of that particular classifier on
the entire set of training data used for the current increment. The Learn++
algorithm is shown in Figure 4.6 and more details on the algorithm can be
found in Appendix B.
59
4.3. LEARN++ AND INCREMENTAL LEARNING
on both the training and testing. If t > 1/2, set t=t-1, discard ht and
go to step 2. Otherwise, compute a normalized error as βt = t /(1 − t )
X
Ht = argmax log(1/βt )
t:ht (x)=y
X m
X
Et = Dt (i) = Dt (i)[|Ht (xt (i) 6= y(i))|]
t:Ht (xi )6=yi i=1
XK X
Hf inal (x) = argmax
60 log1/βt
k=1 t:ht (x)=y
4.3. LEARN++ AND INCREMENTAL LEARNING
and constants δ < , and δ < 1 such that for every concept c ∈ ` and for every
distribution D on the instance space χ, the algorithm L0, given access to an
example set drawn from (c, D), returns a hypothesis h ∈ H with probability at
least δ < 0 and errors D(h)δ < 0 [69].
It should be noted that unlike strong learning, weak learning imposes the least
possible stringent conditions, since it is required to perform only slightly better
than chance.
61
4.3. LEARN++ AND INCREMENTAL LEARNING
Bushing Data
First, the ten variables from the DGA data undergo a min-max normalization.
The normalization is a requirement for training an MLP as it ensure that the
data lies within similar range. Normalization is done using equation 4.1.
For the first experiment, the training data was divided into four databases each
with 300 training instances. In each training session, Learn++ is provided with
each database and generates 12 classifiers. The base Learner uses an MLP with
30 hidden layer neurons and 100 training cycles. To ensure that the method
retains previously seen data, for each training session, the previous database
is tested. Table 4.13 presents the results in which first four rows indicates
the classification performance of Learn++ on each of the databases after each
training session while the last row show the generalization capability on the
testing data. This demonstrates the performance improvement of Learn++
as inherited from AdaBoost on a single database. Table 4.13 further shows
that classifiers performances on the testing dataset, gradually improved from
62
4.3. LEARN++ AND INCREMENTAL LEARNING
T1 T2 T3 T4
D1 89.5 85.8 83 86.9
D2 - 91.5 94.2 92.9
D3 - - 96.5 90.46
D4 - - - 98
Test 65.7 79.0 85 95.8
For the second experiment, the training data was divided into three databases.
Table 4.5 shows the distribution of the data in four databases. In each training
session, Learn++ is provided with each database and generates a specified
number of hypotheses, which is indicated by the number inside the bracket in
Table 4.14. Table 4.14 and Figure 4.8 show that the classifiers performance
increases from 49% to 95.67% as new classes are introduced in the subsequent
training dataset.
An MLP with the same set of training examples as Learn++ was trained.
The trained MLP was tested with the same validation data as Learn++. The
training data consisted of all the data in the three databases and an MLP that
consists of 12 hidden layer units was trained. The MLP gave a classification
rate of 100% tested on the same testing data as Learn++. This shows that the
classification accuracy of Learn++ is comparable with that of an MLP trained
using batch learning.
63
4.3. LEARN++ AND INCREMENTAL LEARNING
Figure 4.8: Incremental capability of Learn++ on new classes for bushing data
Vibration Data
The first stage of bearing fault detection and diagnosis is signal pre-processing
and feature extraction. The signal is first pre-processed by dividing the vibra-
tion signals into, T, where T is the number of windows. The pre-processing is
followed by extraction of features of each window using MFCC. The optimal
number of MFCC is chosen empirically and it was found to be 13. Detailed
explanation of MFCC can be found in [41].
Due to the large variations of the vibration signal, this makes direct compar-
ison of the signals difficult. Hence, non-linear pattern classification methods
64
4.3. LEARN++ AND INCREMENTAL LEARNING
For the first experiment, the training data was divided into three databases
each with 30 training instances. In each training session, Learn++ is pro-
vided with each database and generates the 15 hypothesis. The base Learner
uses an MLP with 60 hidden layer neurons and 50 training cycles. To ensure
that the method retains previously seen data, for each training session, the
previous database is tested as previously. Table 4.15 shows that the system
performances on the testing dataset, gradually improved from 79% to 100% as
new databases become available thereby demonstrating incremental learning
capability of Learn++.
T1 T2 T3
D1 100 98 94
D2 - 99.5 96
D3 - - 97.5
Test 79 87.6 100
For the second experiment, the training data is divided into three databases.
Table 4.11 shows the distribution of the data in the four databases, where a new
class was introduced with each dataset. In each training session, Learn++ is
provided with each database and generates the specified number of hypothesis
which, is given by the number in the bracket. Table 4.16 and Figure 4.9 show
that the classifiers performance increases from 56.92% to 92.23% as new classes
are introduced in the subsequent training dataset. Learn++ had to generate
42 classifiers to achieve this performance.
65
4.4. COMPARISON OF LEARN++ AND FUZZY ARTMAP
66
4.5. SUMMARY
sifiers. Learn++ uses weighted majority voting, where each classifier receives
a voting weight based on its training performance. This works well in practice
for most applications. However, for incremental learning problems that involve
introduction of new classes, the voting scheme proves to be unfair towards the
newly introduced class. Since none of the previously generated classifiers can
pick the new class, a relatively large number of new classifiers that recognize
the new class are needed, so that their total weight can out-vote the first batch
of classifiers on instances of the new class. This in return populates the en-
semble with an unnecessarily large number of classifiers. When incrementing
the system with new classes, it is best to ensure that the number of classifiers
that is added to the system is greater than the number of classifiers added
during a previous system increment. It is also better to include some data
from classes that have previously been seen. This ensures that if any pattern
is classified into one of the new classes, the votes from the previous classifiers
do not ‘outvote’ votes received from new classifiers. The major disadvantage
of Learn++ is that it is computationally expensive. Generally, to allow incre-
mental learning of the classes, classifiers had to be generated for the system,
while the same performance was obtained from a single classifier trained in
batch mode.
4.5 Summary
67
Chapter 5
Engineers have introduced better decision support systems for condition mon-
itoring procedures through applications of centralized intelligent systems by
using a variety of artificial intelligence (AI) techniques [7]. It is now widely rec-
ognized that problems due to the complexity of condition monitoring systems
can be overcome with architectures that contain many dynamically interacting
intelligent distributed modules, called intelligent agents [9]. Each agent is an
autonomous system, which processes as a selection of input, and the complete
interpretation and diagnosis is accomplished through interaction with other
agents. This chapter also looks at the advantages of using multi-agent tech-
niques, and how they can be used for condition monitoring. The chapter also
presents the conceptual framework of the multi-agent system and describes the
analysis and design including the information flow between agents to replicate
the diagnostic tasks performed by the engineers.
68
5.1. AGENT AND MULTI-AGENT SYSTEM
There are several motivations for using a multi-agent system. Firstly, they
are able to solve problems that are too large for a centralized single agent
to do due to resource limitations or the sheer risk of having one centralized
system. Secondly, to enhance modularity, which reduces complexity, speed
due to parallelism, reliability due to redundancy, flexibility (i.e. new tasks are
composed more easily from the more modular organization) and reusability
at the knowledge level and hence shareability of resources. Lastly, MAS offer
the extensibility and flexibility framework for integrating the necessary data
capture system, monitoring system and interpretation function.
In the context of condition monitoring system, the agent perceives the en-
vironment through one or more sensors. Additional information about the
69
5.2. POTENTIAL OF MAS
70
5.2. POTENTIAL OF MAS
systems that are flexible, robust, and can adapt to the environment [70]. This
is especially helpful when components of the system are not known in advance,
change over time, and are highly heterogeneous. Agent-based computing of-
fers the ability to decentralize computing solutions by incorporating auton-
omy and intelligence into cooperative, distributed applications. Each agent
perceives the state of its environment, infers by updating its internal knowl-
edge according to the newly received perceptions, decides on an action, and
acts to change the state of the environment. Agent-oriented programming is a
software paradigm used to facilitate agent-based computing and extends from
object-oriented programming by replacing the notions of class and inheritance
with the notions of roles and messages, respectively [70].
Within a multi-agent system, agents can communicate with each other using
agent communication languages, which resemble human-like speech actions
more than typical symbol-level program-to-program communication protocols
[70]. This capability enables agents to distill useful knowledge from volumi-
nous heterogeneous information sources and communicate with each other on
the basis of which they coordinate their actions. By enabling performance of
computation where computing resources and data are located, and allowing for
flexible communication of relevant results to relevant entities as needed, MAS
offer significant new capabilities to power systems, which have for so long de-
pended on various forms of expensive telemetry to satisfy most communication
needs.
71
5.3. FUNCTIONAL DESIGN
from data agents and functional agents to decision agents. Because each agent
is designed to perform a specific role, with associated knowledge and skills,
distributed and heterogeneous information may be efficiently assimilated lo-
cally and utilized in a coordinated fashion in distributed knowledge networks,
resulting in reduced information processing time and network bandwidth in
comparison to that of more traditional centralized schemes.
• Interpretation Layer
• Diagnostic Layer
• Information Layer
72
5.4. SYSTEM DESIGN
73
5.4. SYSTEM DESIGN
Based on the analyzed goals, agents need to be identified and their relationships
need to be modelled. In any system, there is a choice to be made about how
many functions are combined within a single agent versus function becoming
autonomous. Agents for each layer are discussed below.
The data monitoring layer allows only relevant information to come into the
system. Raw data from the sensors and associated condition monitoring sys-
tems is received and all necessary pre-conditioning takes place. This layer
consists of feature extraction agent and data transformation agent.
74
5.4. SYSTEM DESIGN
Interpretation Layer
The interpretation layer turns the data into information that can be easily
interpreted by plant operator. This module uses advanced intelligent sys-
tem techniques, coupled with codified knowledge and expertise in the area of
condition monitoring and to diagnose a problem. It supports more that one
interpretation technique. Data interpretation is achieved through three agents,
kernel-based classifier, backpropagation neural network, and extension neural
network.
75
5.4. SYSTEM DESIGN
Diagnostics Layer
The diagnostic layer consists of two diagnosis agents, that are allowed to shared
information and compare their decision.
Diagnostic Agent This agent takes the outputs of the interpretation layer
agents and builds an overall diagnostic conclusion. It is known that the per-
formance of ensemble is often much better than that of individual classifiers,
because of independently-trained classifiers and their uncorrelated errors [54].
Since several independent agent classifiers are used, we need to aggregate them
in an appropriate manner. A number of decision fusion techniques exist such
as weighted majority voting, majority voting and trained combiner fusion. The
majority voting is the simplest and widely-used aggregation method [54]. This
voting scheme treats all agents with equal weights. Prediction errors of agent
are often different, thus, it is more reasonable to give them different weights, in
proportion to their prediction performance. In the weighted majority voting,
the predicted class label of the diagnosis agent is given by;
K
X
Cm = argmax W (k, m)Ikm (5.1)
k=1
Where W(k,m) is the weight when the predicted class label of the kth agent is
Cm . The weights can be determined by calculating appropriate performance
measures for each classifier as mentioned in previous sections. The weights of
the classifiers must add up to one. A table is built with classifier tabulated
against conclusion, the cell data is then populated with various results that
have been submitted to the diagnosis agent. Each time new results are added,
the conclusion is recalculated. After the table is fully populated, the weighted
majority voting is used to determine an outcome of the event.
76
5.4. SYSTEM DESIGN
Information Layer
Registry agent
The registry agent maintains information about the published variables and
monitored conditions for each agent. It is required that all agents must register
with the registry agent. The registry agent also maintains the current status of
all the registered agents. Agent status is a combination of two parameters alive
77
5.4. SYSTEM DESIGN
and reachable. The status of a communication link between any two agents is
determined by attempting to achieve a reliable communication between them.
The registry agent is used to find information about agents who may supply
required data.
78
5.5. EXPERIMENTATION
5.5 Experimentation
Having designed the condition monitoring system using the most appropriate
features of existing MAS design methodologies, the next stage was to imple-
ment the prototype. To achieve this, the multi-agent building toolkit, JADE
(Java Agent DEvelopment Framework) was used. JADE is a middle-ware that
could be used to develop agent-based applications in compliance with the FIPA
specifications for inter-operable intelligent multi-agent systems. JADE is java-
based and provides the infrastructure for agent communication in distributed
environments, based on FIPA standards.
79
5.5. EXPERIMENTATION
An MLP that consists of 10 inputs layer nodes, 12 hidden layer nodes and 5
output layer nodes was used. Table 5.1 shows the normalized confusion matrix
resulting from the MLP agent. The backpropagation neural network gave an
overall classification accuracy of 95.87%, which is used as the voting weight
for an MLP in the diagnosis stage. A radial basis function kernel function was
Table 5.1: Normalized confusion matrix for MLP for bushing data
used to train the support vector machines. The normalized confusion matrix
for the SVM is shown in Table 5.2. The overall classification accuracy of the
SVM is 95.63%, which is the voting weight for SVM. The learning rate of ENN
Table 5.2: Normalized confusion matrix for SVM for the bushing data
was chosen to be 0.103, the value was selected empirically. Table 5.3 shows the
normalized matrix for the ENN. ENN gave a classification accuracy of 97.5%.
All the interpretation agents gave very good classification accuracy, this might
80
5.5. EXPERIMENTATION
Table 5.3: Normalized confusion matrix for ENN for the bushing data
be due to the fact that the features extracted gave a good signature of the
different bearing conditions.
For experimental purposes, one data instance is passed to the data preprocess-
ing agent and processed as described. The preprocessed data is now sent, via
agent subscription and messaging, to each of the interpretation layer agents.
The interpretation layer agents work on data simultaneously and upon com-
pletion the agents will forward their results to the diagnosis agent to perform
diagnosis. From these, the first conclusions are generated by the multi-layer
perceptron agent and this passes the result to the diagnosis agent using the
agent interactions described. The agent returns results detailing the confidence
or probability due to one of the possible condition. The diagnosis table is as
shown in Table 5.4. The table shows that the fault is of a thermal nature
Table 5.4: Diagnostic results for the MLP for bushing data
with a confidence level of 95.87%. Next, the kernel-based agent provides its
result. This yields the results shown in Table 5.5 which indicate that there is
a thermal fault with a confidence level of 95.75%.
81
5.5. EXPERIMENTATION
Table 5.5: Diagnostic results for the MLP and SVM for bushing data
The last result passed to the diagnosis agent is the conclusion from the fuzzy
neural network. When added to the table, the results shown in Table 5.6 are
yielded.
Table 5.6: Diagnostic results for the MLP, SVM and ENN for bushing data
At this stage, the bushing diagnosis agent concludes its calculation of the
combined result. It determines that there is a thermal fault with a confidence
level of 96.33%. This is now passed to the Operator Assistant Agent for display.
Due to the fact that the different classifiers learn the data differently, this
means they will make errors differently. In this case, the MLP and SVM agree
on the nature of fault, although the ENN gives a different result, hence the
reduced probability.
82
5.5. EXPERIMENTATION
An MLP that consists of 7 inputs layer nodes, 10 hidden layer nodes and 4
output layer nodes was used. Table 5.7 shows the normalized confusion matrix
resulting from the MLP agent. The backpropagation neural network gave an
overall classification accuracy of 100%, which is used as the voting weight for
an MLP in the diagnosis stage.
Table 5.7: Normalized confusion matrix for MLP for vibration data
A radial basis function kernel function was used to train the support vector
machines. The normalized confusion matrix for the SVM is shown in Table
5.8. The overall classification accuracy of the SVM is 100%, which is the voting
weight for SVM.
Table 5.8: Normalized confusion matrix for SVM for vibration data
The learning rate of ENN was chosen to be 0.534, the value was selected
empirically. Table 5.9 shows the normalized matrix for the ENN. ENN gave a
classification accuracy of 99.98%.
83
5.5. EXPERIMENTATION
Table 5.9: Normalized confusion matrix for ENN for vibration data
All the interpretation agents gave very good classification accuracies, this
might be due to the fact that the feature extracted gave a good signature
of the different bearing conditions.
For experimental purposes, one data instance is passed to the data preprocess-
ing agent and processed as described previously. The preprocessed data is now
sent, via agent subscription and messaging, to each of the interpretation layer
agents. The interpretation layer agents work on data simultaneously and the
agent forward their results to the diagnosis agent in order of completion. From
these, the first conclusions are generated by the multi-layer perceptron agent
and this passes the result to the diagnosis agent using the agent interactions
described. The agent returns results detailing the confidence or probability
due to one of the possible condition. The diagnosis table is as shown in Table
5.10. The table shows that the bearing is operating under normal condition
with a confidence level of 100%. Next, the kernel-based agent provides its
result. This yields the table shown in Table 5.11 shows that the bearing is
operating under normal conditions with a confidence level of 100%. The last
84
5.6. SUMMARY
Table 5.11: Diagnostic results for the MLP and SVM for vibration data
result passed to the diagnosis agent is the conclusion from the extension neu-
ral network. When added to the table, the result is shown in Table 5.12. At
this stage, the bearing diagnosis agent concludes its calculation of the com-
bined result. It determines that bearing is operating under normal conditions
with a confidence level of 100%. This is now passed to the Operator Assistant
Agent for display. The system actually predicted the condition correctly as
the normal operation condition was passed to the system.
Table 5.12: Diagnostic results for the MLP, SVM and ENN for vibration data
5.6 Summary
85
5.6. SUMMARY
the data monitoring layer, interpretation layer, diagnostic layer and informa-
tion layer, which were presented. The data monitoring layer consists of two
agents which are responsible for extracting features and conditioning of data
received from the measurement system. The interpretation layer consists of
three agents, the kernel-based agent, backpropagation agent and extension
neural network agent. The interpretation agents are encapsulated with classi-
fiers hence, were trained prior to usage. These agents interpret data from the
data monitoring layer and give a diagnosis on the condition of the equipment.
The diagnosis layer uses weighted majority voting to combine the decision
from the interpretation agents. This information along with the maintenance
recommendations are passed to the operator assistant agent.
The adoption of a distributed MAS architecture also provides a basis for con-
tinual improvement of engineering decision support due to the flexibility and
scalability achieved by the use of common communication mechanism. This
system will enable integration of further data sources and interpretation agents.
86
Chapter 6
Conclusion
The fuzzy ARTMAP gave slightly better results than the MLP, SVM and
ENN classifiers. This might be due to the fact that the structure of the fuzzy
ARTMAP adapts, creating a larger number of categories in which to map the
output labels as the training data becomes available.
The problem with batch learning is that if new information has to be added
to the classifier, for this involves discarding the existing classifier and training
87
6.3. MULTI-AGENT SYSTEM
The work illustrates how a multi-agent system can be used to effectively in-
tegrate existing stand-alone intelligent systems and new interpretation ap-
proaches within an environment which offers long term flexibility and scalabil-
ity. This work has economic impact as the agent-based condition monitoring
system with the integrated tools for data classification and clustering, along
88
6.4. SUGGESTIONS FOR FUTURE RESEARCH
Future work should integrate various measurement data into the existing model.
Since, the agents function in a dynamic environment and they need to update
their knowledge as the environment changes, future work should look into in-
corporating the incremental learning system into the multi-agent system to
ensure that the agents are able to adapt. Learning in multi-agent system af-
fects the co-operation of agents, hence the effect of learning in this system
must also be studied. Also the scalability of the multi-agent system must also
be investigated.
89
Appendix A
Fuzzy ARTMAP
90
6.4. SUGGESTIONS FOR FUTURE RESEARCH
ART evolved from the analysis of how biological brains work to cope with
changes in the environment in real-time and in a stable fashion. This neural
network architecture was first proposed by Carpenter and Grossberg [47]. The
FAM network is a robust architecture encompassing both fuzzy logic and the
properties of ART, and is capable of handling analog or binary input vectors,
which may represent fuzzy or crisp sets of features.
Figure A a schematic diagram of the FAM network. Each fuzzy ART module
has two layers of nodes: F1a (F1b ) is the input layer whereas F2a (F2b ) is a dy-
namic layer in which the number of nodes can be increased when necessary,
and every node encodes a prototype pattern representing a cluster of input
samples. F0a (F0b ) is a pre-processing layer in which the size of an M-dimensional
input vector, a ∈ [0, 1], is kept constant in order to avoid the category pro-
liferation problem [47]. One of the recommended normalization techniques is
complement-coding [47] where an M-dimensional input vector is normalized
to a 2M-dimensional vector A = (a, 1 − a) = (a1 , . . . am , 1 − a1 , 1, . . . am ) Fol-
lowing the notation used in the FAM paper [47], let 2M , be the number of
91
A.1. INITIALIZATION
nodes in F1a and N , be the number of nodes in F2a . The Short Term Memory
traces or activity vectors of F y and F ; are denoted by xa = (xa1 , . . . , xa2Ma ) and
ya = (y1a , . . . , yN
a
a
); and wja = (wj1
a a
, . . . , wj.2M a
), j = 1, . . . , N , is the jth ART,
weight vector or the Long Term Memory trace. The same notation applies
to ARTb when the superscripts or subscripts a and b are interchanged. In
the map field, wjab = (wj1
ab ab
, . . . , wjN a
), j = 1, . . . , N , is the weight vector from
the jth F2b node to Fab , xab = (x1ab , . . . , n) is the map field activity vector. In
general, the FAM algorithm can be divided into four phases:
A.1 Initialization
In a Fuzzy ART module, the weight vectors subsume both the bottom-up and
top-down weight vectors of ARTa [47]. In ART,, each category node weight
vector fans-out to all the nodes in the Fy layer. These weight vectors are
initialized to unity;
a a
wj1 = . . . = wj.2M a
(0) = 1 j = 1, . . . , N, (A.1)
There are three parameters associated with ARTa and ARTb , namely the choice
parameter,ηa , the learning rate, β , and the baseline vigilance parameter, ρ.
To operate in the conservative mode where recoding during learning will be
minimized, ρ, should be initialized close to 0. The values of β, and η are
set between 0 and 1. The same initialization procedure is also applicable
to ARTb . In the map field, the vigilance parameter, ηab , is also initialized
between 0 and 1, whereas the weight vectors from F2b to Fab are set to unity.
Note that the number of nodes in Fab is the same as the number of nodes in
F2b , and there is a one-to-one permanent link between each corresponding pair
of nodes. Activities in Fuzzy ART. During supervised learning, ART, receives
an input vector, and ARTb receives the associated target vector. After pre-
processing, the complement-coded input vector A is propagated from F1a to
F2a ; through wa . A fuzzy choice function is used to measure the response of
92
A.1. INITIALIZATION
A ∧ wja
Tj (A) = j = 1, . . . , Na (A.2)
αa + wj al
The fuzzy MIN operator (∧) and the size l.l are defined as: (x∧y) ≡ min(xi , yi )
P
and lxl = kxi | [75]. The maximally responsive node is selected as the win-
ner, denoted as node J, while all other nodes are shutdown in accordance with
the winner-take-all competition. The winning node then sends its weight vec-
tor to F1a , and a vigilance test is performed to compare the similarity between
the activity vector, xa , and the input vector against the vigilance parameter:
xa A ∧ wja
= ≥ ρa (A.3)
A A
where wj is the weight vector of the Jth winning node in F2a . If this novelty
test is satisfied, resonance is said to occur, and learning takes place. However,
if the test fails, the winning node is inhibited, and A is retransmitted to to
search for a new winner which is able to fulfill the vigilance test. If such a
node does not exist, a new node is recruited to code the input vector. The
same search cycle for the target vector goes on simultaneously in ARTb where
a prototype node in F2a : that best matches the target vector will be found.
In general, an independent Fuzzy ART module is employed as ARTb to self-
organize the target vectors. However, in one-from-N classification (i.e. each
input pattern belongs to only one of the N possible output classes), ARTb can
be replaced by a single layer containing N nodes. Then, the N-bit teaching
stimulus can be coded to have unit value corresponding to the target category
and zero for all others. Activities in the map field. The comparison between
F2a and F2b ; activities takes place in the map field. If K is the winning node in
ARTb , then
1 if j=J and k=K
b
yk = (A.4)
0 otherwise
93
A.2. MATCH TRACKING
Assuming that both ARTa , and ARTb are active, the Fab activity vector, xab ,
obeys
which forms a prediction from the Jth ART, category to the Kth ARTb target
class via wjab . A map field vigilance test is performed to determine the similarity
between xab and yb against a user-defined map field vigilance parameter, ρab ,
i.e.:
|xab |
≥ ρab (A.6)
|yb |
If equation A.6 is satisfied, learning ensues in ART , ARTb and the map field
as defined in previous section. Conversely, if equation A.6 fails, an activity
called match tracking is triggered which initiates a new search cycle in ARTa ,.
For each new input vector, the ART , vigilance parameter, ρa , equals a user-
defined baseline vigilance, ρa . In response to a failure in the map field vigilance
test, ρa is raised to
A ∧ wja
ρa = +δ (A.7)
|A|
Where δ is a small positive value slightly greater than zero. Thus, the ARTa ,
vigilance test Equation A.7 fails, and a new winner in F2a has to be selected.
In other words, match tracking provides a means to select a category node
which satisfies both the ARTa , and the map field vigilance tests. If no such
node exists, the current input vector will be ignored.
94
A.3. LEARNING
A.3 Learning
Once the search ends, the winning F; weight vector is updated according to:
ab
wJK = 1,
ab
wJK = 0 for k 6= K
This learning rule indicates that the Jth category prototype in F2a ; is linked
b
to the Kth target output in F2a via wij , and the association is permanent.
95
Appendix B
Learn++ Algorithm
96
A.3. LEARNING
If the error is greater than 0.5, the hypothesis is discarded and new train-
ing and testing subsets are selected according to DT and another hypothesis
is computed. All classifiers generated so far, are combined using weighted
majority voting to obtain composite hypothesis, ht
X
ht = argmaxy∈Y log(1/βt ) (B.4)
t:hi x6=y
If the error is greater than 0.5, the current hypothesis is discarded and the
new training and testing data are selected according to the distribution DT .
Otherwise, if the error is less than 0.5, the normalized error of the composite
hypothesis is computed as,
Bt = Et /1 − Et (B.6)
The error is used in the distribution update rule, where the weights of the cor-
rectly classified instances are reduced, consequently increasing the weights of
the misclassified instances. This ensures that instances that were misclassified
by the current hypothesis have a higher probability of being selected for the
subsequent training set. The distribution update rule is given by,
Once the T hypothesis is created for each database, the final hypothesis is
computed by combining the hypothesis using weighted majority voting given
by,
K
X X
Ht = argmaxy∈Y log(1/βt ) (B.8)
k=1 t:ht (x)=y
97
A.3. LEARNING
98
Appendix C
Publications
Below is a list of all publications that have been derived from this work.
99
References
[1] C. Bishop, Neural Networks for Pattern Recognition. New York: Oxford
University Press, 2003.
100
REFERENCES
[12] X. Lou and K. Loparo, “Bearing fault diagnosis based on wavelet trans-
form and fuzzy inference,” Mechanical Systems and Signal Processing,
vol. 18, pp. 1077–1095, 2004.
[14] B. Samata, “Gear fault detection using artificial neural network and vec-
tor machines with genetic algorithms,” Mechanical Systems and Signal Pro-
cessing, vol. 18, pp. 625 – 644, 2005.
[15] B. Yang, T. Han, and J. An, “ART-KOHONEN neural network for fault
diagnosis of rotating machinery,” Mechanical Systems and Signal Process-
ing, vol. 18, pp. 645 – 657, 2004.
[16] A. Rojas and A. Nandi, “Practical scheme for fast detection and classi-
fication of rolling-element bearing faults using support vector machines,”
Mechanical Systems and Signal Processing, vol. 20, no. 7, pp. 1523–1536,
2006.
101
REFERENCES
[20] H. Ocak and K. Loparo, “Estimation of the running speed and bearing
defect frequencies of an induction motor from vibration data,” Mechanical
Systems and Signal Processing, vol. 18, pp. 515 – 533, 2004.
[25] S. M. Dhlamini and T. Marwala, “Using SVM, RBF and MLP for bush-
ings,” in Proceedings of IEEE Africon, pp. 613 – 617, 2004.
102
REFERENCES
[27] R. R. Rogers, “IEEE and IEC codes to interpret faults in transformers us-
ing gas in oil analysis,” IEEE Transactions on Electrical Insulation, vol. 13,
no. 5, pp. 349 – 354, 1978.
[28] M. Duval, “Fault gases formed in oil-filled breathing E.H.V. power trans-
formersthe interpretation of gas analysis data,” in Proceedings of IEEE
Power Engineering Society Conference, pp. C74–476–8, 1974.
[31] Y. Zhang, X. Ding, Y. Liu, and P. J. Griffin, “An artificial neural net-
work based approach to transformer fault diagnosis,” IEEE Transactions
on Power Delivery, vol. 11, pp. 1836 – 1841, 1996.
[32] Z. Wang, Y. Zhang, C. Li, and Y. Liu, “ANN based transformer fault
diagnosis,” in Proceedings of American Power Conference, pp. 428 – 432,
1997.
103
REFERENCES
[37] C. E. Lin, J. M. Ling, and C. L. Huang, “An expert system for transformer
fault diagnosis using dissolved gas analysis,” IEEE Transactions on Power
Delivery, vol. 8, no. 1, pp. 231 – 238, 1993.
[41] J. Wang, J. Wang, and Y. Weng, “Chip design of MFCC extraction for
speech recognition,” The VLSI Journal, vol. 32, pp. 111–131, 2002.
[43] J. Weston and C. Watkins, “Support vector machines for multiclass recog-
nition,” in Proceedings of ESANN99, pp. 219–224, 1999.
104
REFERENCES
[48] M. P. Perrone and L. N. Cooper, Neural Networks for Speech and Image
Processing, ch. When Networks Disagree: Ensemble Methods for Hybrid
Neural Networks. Chapman Hall, 1993.
105
REFERENCES
[59] F. S. Osorio and B. Amy, “INSS: A hybrid system for constructive ma-
chine learning,” Neurocomputing, vol. 28, pp. 191–205, 1999.
[60] M. Vo, “Incremental learning using the time delay neural network,” in
Proceedings of the IEEE International Conference on Acoustics, Speech
and Signal Processing, pp. 629–632, 1994.
106
REFERENCES
107
REFERENCES
[75] C. Lim and R. F. Harrison, “An incremental adaptive network for on-line
supervised learning and propability estimation,” Neural Networks, vol. 10,
no. 5, pp. 925 – 939, 1997.
108