Generators 1
Generators 1
Generators 1
A thesis
presented to the University of Waterloo
in fulfillment of the
thesis requirement for the degree of
Master of Applied Science
in
Electrical and Computer Engineering
ii
Acknowledgements
iii
Table of Contents
Acronyms x
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Model-based Prognostics . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Data-Driven Prognostics . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Hybrid Prognostics . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Background Review 15
2.1 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Components of Time Series . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Concept of Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iv
2.1.5 Problem Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Missing Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Data Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Imbalanced Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.4 Long Short Term Memory (LSTM) Network . . . . . . . . . . . . . 37
2.3.5 Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.1 Extra Trees Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.2 Partial Autocorrelation Function . . . . . . . . . . . . . . . . . . . 44
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
v
3.5 Lag Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7.1 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7.2 ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7.4 Majority Voting and Weighted Majority Voting . . . . . . . . . . . 69
3.7.5 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 Conclusion 73
4.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References 76
vi
List of Figures
2.1 Examples of (a) underfitting, (b) overfitting, and (c) statistical fit cases [72]. 23
2.2 Walk-forward validation technique. . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Effect of C parameter on the hyperplane: (a) large C values are assigned;
(b) small C values are assigned. . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 An example of a basic feedforward NN. . . . . . . . . . . . . . . . . . . . . 30
2.6 Sigmoid activation function. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 ReLU activation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 Unfolding of RNNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.9 Schematic graph of LSTM cell at time t. . . . . . . . . . . . . . . . . . . . 39
2.10 Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.11 A sample PACF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vii
List of Tables
viii
3.21 Testing dataset results for initial ANN model. . . . . . . . . . . . . . . . . 66
3.22 Tuned parameters for final ANN model. . . . . . . . . . . . . . . . . . . . 66
3.23 Testing dataset results for final ANN model. . . . . . . . . . . . . . . . . . 66
3.24 Confusion matrix for final ANN model. . . . . . . . . . . . . . . . . . . . . 66
3.25 Preset hyperparameters for initial LSTM. . . . . . . . . . . . . . . . . . . . 67
3.26 Initial LSTM model hyperparameters obtained with Random Search. . . . 68
3.27 Testing dataset results for initial LSTM model. . . . . . . . . . . . . . . . 68
3.28 Parameters for final LSTM model. . . . . . . . . . . . . . . . . . . . . . . . 68
3.29 Results for final LSTM model. . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.30 Confusion Matrix for final LSTM model. . . . . . . . . . . . . . . . . . . . 69
3.31 Weights assigned to various pretrained algorithms. . . . . . . . . . . . . . . 70
3.32 Results with weights [1, 1, 1]. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.33 Results with weights [1, 1, 2]. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.34 Results with weights [1, 0.5, 2]. . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.35 Results for AdaBoost model, applied to imbalanced dataset. . . . . . . . . 71
3.36 Confusion matrix for AdaBoost model, applied to imbalanced dataset. . . . 71
3.37 Results for AdaBoost model, applied to balanced dataset. . . . . . . . . . . 71
3.38 Confusion matrix for AdaBoost model, applied to balanced dataset. . . . . 71
ix
Acronyms
AF Activation Functions
AI Artificial Intelligence
DirRec Direct-Recursive
FN False Negative
x
FP False Positive
KPSS Kwiatkowski-Phillips-Schmidt-Shin
NF Neural-Fuzzy
xi
SCADA Supervisory Control and Data Acquisition
TN True Negative
TP True Positive
xii
Chapter 1
Introduction
1.1 Motivation
The need for fossil fuels has expanded as the global energy demand has been increasing,
which has been met by supplies of coal, oil, and natural gas. As it can be seen from the
Figure 1.1, this demand is heavily met by the aforementioned resources.
1
Figure 1.1: Global Primary Energy Consumption from 1800 to 2017 [1].
While being a widespread energy resource, fossil fuels present a significant problem
to the environment in the form of combustion by-product: carbon dioxide, hydrocarbons
and other greenhouse gases Greenhouse Gases (GHGs), resulting in a rapidly warming
global climate. This issue has united countries and pushed them to propose such policies
as the Kyoto Protocol and the Paris agreement to mitigate the negative impact of GHGs.
One of technological approaches that can help to reduce the effect of greenhouse gases is
utilization of natural resources via Renewable Energy Sources (RESs). As [2] states, RESs
were initially used as back-up units and were deployed only when primary power supply
was interrupted. As one of the RES, wind turbines are reported to be rapidly expanding
[3], with the world-wide wind power installed capacity exceeding 539.58 GW [4].
In order to compare the cost of power production from wind turbines with other gen-
eration technologies, Levelised Cost of Energy (LCOE) metrics are utilised, which is used
as a way of showing the average price of electricity that the asset must receive over its
operating life in order for it to break even. Figure 1.2 shows the range of LCOE for solar,
onshore, and offshore wind compared to fossil fuels, demonstrating wind being a cheaper
2
alternative compared to solar generation technologies. LCOE is calculated as [5]:
ICC · F C + LRC
LCOE = + O&M, (1.1)
AEPN et
AEPN et = AEPGross · Availability · (1 − Loss) (1.2)
where ICC is the initial capital cost, FC is the fixed charge in (%/year), O&M is the
operation and maintenance cost, AEP is the annual energy production (kWh/year), and
LRC is the levelized ($/year) replacement cost. From (1.1), it follows that even though
wind turbines offer cheaper LCOE, the high Operation and Maintenance (O&M) will drive
the LCOE up. It has been reported in [6] that O&M accounts for 10-15 % and 20-25 %
of the overall LCOE for onshore and offshore wind turbines, respectively, during their 20
years of operating life, and can reach up to 35 % towards the end of their life [7]. Since
there is a strong correlation between O&M expenses and LCOE, minimizing those expenses
will reduce the LCOE.
Figure 1.2: Comparison of LCOE from renewable power generation technologies compared
to fossil fuels for 2010 and 2017 [8]. Permission to re-use this image was obtained from
IRENA.
To ensure the proper working conditions and prolong the useful life of an equipment,
wind turbines are serviced according to one of following types of maintenance [9]:
3
Corrective Maintenance: Corrective or “run-to-failure” maintenance is performed only
when the failure occurs and a failed component is replaced with a new one. Even though
this type of maintenance is considered a poor choice, it has been surprisingly popular,
mainly due to being the cheapest alternative to implement, and still remains a widely used
type of maintenance nowadays [10], with ongoing research efforts [11]. The major problem
with corrective maintenance is that a failure might happen at the most inconvenient time,
such as produced power is at its maximum level. Furthermore, due to the unplanned nature
of a fault, spare equipment might not be available and thus, downtime would be prolonged.
As the result, even though this is the cheapest type to implement, the final financial losses
outweigh the benefits [12].
4
sensors deter operators from installing these systems [17]. An alternative is to use Super-
visory Control and Data Acquisition (SCADA) systems, which are commissioned with all
large scale wind turbines. These types of systems record data with minutes granularity
and as the result, can be used as a low-cost solution for early failure detection.
As a SCADA system is an integrated part of any production facility, the collected oper-
ational and status data may be used to detect incipient faults. However, since the constant
manual inspection of the collected data is not feasible due to the size of the database, ma-
chine learning algorithm can be employed, as they they allow to detect complex patterns
in data. Integrated with an existing SCADA systems as a part of a predictive mainte-
nance program, these algorithms could predict upcoming failures and thus, minimize the
downtime of a facility and maximize profits. Therefore, this thesis is focused on utilizing
collected SCADA data for maintenance scheduling of wind generation in the plant.
5
In [24], the authors attempt to predict gearbox failure based on the trending of recorded
oil temperature, vibration amplitudes, and oil debris particle counts. By monitoring trans-
mission efficiency and rotational speed, and relating those parameters to the temperature
rise, gearbox failures can be predicted up to 6 months in advance.
Reference [25] proposes an approach that is based on monitoring fault progression in
the cases when a damage process (“hidden”) occurs on a slower time scale than the observ-
able dynamics (“fast”) in a system. The proposed methodology consists of three distinct
steps; in the first step, a tracking function is calculated based on a reference model short-
time prediction error, and then used to approximate a tracking metric. The estimated
tracking metric is used an an input to a nonlinear recursive filter that allows to evaluate
the current damage level. This estimation becomes an input for the last stage of the pro-
posed algorithm, time-to-failure is obtained from a Kalman filter by using discrete-time
state transition and measurement equations, under the assumption of a particular damage
evolution model. The developed algorithm is validated against an electromechanical sys-
tem with a discharging battery that is treated as a hidden damage process. Simulations
demonstrate that the presented algorithm can predict a failure 5 hours in advance of an
actual failure.
In [26], the authors present an approach to approximate time-to-failure for rolling bear-
ings, by combining sensor data for diagnostics with a mathematical model and historical
data for prognostics. The developed mathematical model is based on Kotzalas/Harris the-
ory and focuses on spall progression. This model provides damage build-up information
in advance of a failure, while sensors supply essential information such as oil condition or
debris monitor outputs and vibration signatures, which are used to update the assumptions
in the model and reduce the uncertainty in RUL prediction.
The authors in [27] present an integrated model-based diagnostic and prognostic frame-
work that has been tested in a simulated propellant loading system that includes tanks,
valves, and pumps. The approach makes use of a common modeling paradigm that ac-
counts for both the nominal behaviour and fault progression. The detection problem is
formulated by representing the system under nominal operations, and marking changes
in the nominal state of the system as faults. Measurement residuals are used along with
prognostic predictions of how each measurement is expected to deviate from the nominal
for each possible fault in the system to generate a set of fault candidates, which explain
the observed deviations in measurements. The developed model is able to predict RUL of
a pump with the mean accuracy of 70%, 2.7 hours prior to a failure happening.
Model-based approaches are challenging, as they have to be developed for a specific
component and each one would require a different mathematical model. Furthermore,
6
these models require extensive experimentation and verification, and changes in system
dynamics and operating conditions can potentially affect the model, plus it is impossible
to model for all real-life conditions. In addition to that, model-based methods have a
limited capability in generating degradation behaviour for complex systems, where several
faults can take place [28]. Therefore, in this work, model-based approaches will not be
considered.
Statistical Prognostics
While statistical and machine learning approaches have a similar end goal, statistical meth-
ods aim to estimate a function f (X) as follows [30]:
y = f (X) + , (1.3)
7
The authors in [33] present two probabilistic methods based on Hidden Markov Model
(HMM) for the prediction of incipient faults in the case of roller bearings. The data
used in this work consist of vibration signals, composed of 20480 samples, recorded every
10 minutes with a sampling frequency of 20 kHz. The collected data are separated into
classes, and the degradation process is estimated by two probabilistic methods based on
HMMs. The performances of both methods are evaluated in the prediction of an impending
fault for a roller bearing degradation, using 4 degradation stages, predicting failures with
various time horizons depending on the stage.
In [34], the authors employ the residual delay-time concept and stochastic filtering
theory to estimate the RUL for 6 rolling bearings. The bearings are tested in a laboratory
setting, and the vibration signals are recorded. The proposed model is based on the
idea of conditional residual delay time, which differs from the conventional concept of
residuals that depend on the current age of the monitored equipment, depending also
on the condition information available to date. The residual time distribution used the
collected data to fit the model, with the trained model suggesting a preventive maintenance
58 hours before the actual failure.
Due to the fact that statistical approaches only capture linear relationships along with
prior assumptions that must be held, these techniques will not be considered in this work.
Machine learning methods are based on the application of artificial intelligence algorithms,
where failure patterns are learned from the collected historical data. These approaches are
divided into supervised and unsupervised methods. Supervised learning is applied in cases
where the data are labeled according to operating and failure modes. Unsupervised ap-
proaches are applied to unlabelled data, grouping data points in order to detect anomalous
regions.
In [35], the authors use SCADA system data from wind turbines for wind turbine per-
formance monitoring and prognostics, using provided power curves from a manufacturer,
where deviations would indicate abnormal behaviour. The power residual (difference be-
tween measured actual power and the power expected) is used in constructing the baseline
average. A three-sigma deviation from the baseline mean is used to build the upper and
lower bounds on the residuals, which would detect abnormal activities in the system and
minimize false warnings. Once the first signs of anomaly are captured, higher order statis-
tical methods, such as skewness and kurtosis that the authors call condition indicators, are
further applied to the anomaly detection and prognosis in different time windows. These
8
condition indicators demonstrate the first signs of an impeding failure 20 days before the
actual fault.
The authors in [36] analyze bearing faults in wind turbines using Artificial Neural
Network (ANN). High frequency SCADA data, recorded with 10 second intervals, are
collected from 24 wind turbines to be used for analysis of bearing faults. In order to detect
faults, a training dataset is constructed from turbines that have not been affected by any
faults. 5 different NNs with varying parameters (number of nodes and activation functions)
are modelled and trained with respect to several performance metrics: absolute error, mean
absolute error, relative error, mean relative error, and coefficient of determination. A data-
mining approach is applied to detect generator bearing anomalies. The NN models are first
tested on two different turbines for model validation, and the error residuals are analyzed
using an average moving window of 360 data points (an hourly averaged error residual).
The best performing model is selected to test the normal behaviour of wind turbines and
detect anomalies in bearing temperatures. The same model is used for prediction of a fault,
where the fault is detected 51 minutes prior to the actual fault occurring.
In addition to the previously cited work focused on predicting abnormal behaviour in
bearings, [37] applies a data-mining model to predict faults of a blade pitch. Faults such as
blade angle asymmetry and blade angle implausibility are analyzed to determine how they
affect the wind turbine performance. The paper is focused on predicting a fault associated
with the blade pitch angle using: genetic algorithms, ANN, k-nearest neighbors, partial
decision trees, and bagging. SCADA data used to train and validate models are collected at
1-second intervals from 27 wind turbines for three months, and consist of both operational
parameters such as power, wind speed, rotor speed, and status codes, along with status
descriptions. The prediction problem is framed as a classification problem and models are
evaluated on such metrics as accuracy, sensitivity and specificity. Models are trained on one
third of the data and validated on the remaining data, with the best result being achieved
from genetic algorithms with a prediction horizon of 10 minutes with the following results:
68.7% for accuracy, 71.2% for sensitivity (also known as recall), and 66.9% for specificity.
In [38], the authors apply Support Vector Machine (SVM) algorithms to classify and
predict various faults. The SCADA data used in this research is collected from a turbine
in the South-East region of Ireland, divided into 3 separate datasets: operational data,
status data, and warning data. Warning data is ignored in the simulations, and both
operational and status data are used for model training. Status data contain information
on faults and warnings that can be used for supervised classification, and operational data
contains historical information about turbine performance. The authors focus on both
fault detection and prediction, where the initial goal of an algorithm is to separate faults
from data points corresponding to a normal operating mode. The second stage consists
9
of classifying a specific fault: feeding, air-cooling, excitation, generator heating, or mains
failure faults. The last stage is to predict a specific fault in advance. The results vary
based on the fault, where 100% predictions are obtained in the case of generator heating
faults, however the problem of data leakage is present, possibly resulting in overoptimistic
predictions.
The previous work is extended in [39], where the authors vary the prediction window
and compared such approaches as Synthetic Minority Oversampling Technique (SMOTE)
technique [40], random under-sampling and class weights to solve the problem with im-
balanced classes. This work focuses on predicting generator heating and excitation faults,
with a maximum prediction time horizon of 12 hours. The results presented for the afore-
mentioned prediction window are not as accurate as in [38]. Even though the recall metrics
demonstrate a fairly good result, the data are shuffled prior to splitting into training and
testing sets and as the result, the problem of data leakage is once more present.
Similarly to [38] and [39], [41] frames the problem as a classification problem and divides
the work into 3 distinct stages: (1) prediction of a fault; (2) prediction of a fault severity;
(3) prediction of a specific fault. SCADA data recorded with 5 minutes frequency are
collected for 3 months from 4 wind turbines, in two separate data sets: operational and
status/fault data. The prediction window is varied from 5 minutes up to 1 hour, and models
are evaluated according to accuracy, sensitivity and specificity. At each stage, different set
of models are trained on the two-thirds of data, and tested on two separate test sets, i.e.
for Test Set 1 with the remaining third of data, while for Test Set 2 randomly selecting
10% of the data from the labeled SCADA data. For stage 1, the models trained are: ANN,
an ensemble of ANNs, boosting tree, and SVM, with an ensemble of ANN demonstrating
the best performance with a prediction horizon of 1 hour. For stage 2, which consists of
predicting a fault category, the models trained are: ANN, classification and regression tree,
a boosting tree algorithm and SVM, with CART showing the best results. For stage 3, i.e.,
prediction of a fault, the models used are: boosting tree algorithm, ANN, ANN Ensemble,
and SVM, with ANN Ensemble demonstrating the best result. It is worth noting that
when the prediction horizon is increased from 5 minutes to 1 hour, all metrics demonstrate
stochastic behaviour, which can be explained by data leakage coming from the random
splitting of the data set into training and testing ones.
As not all operators provide SCADA data along with the information about faults and
warnings, it is common to establish a normal operating model and any deviation from the
model results is classified as a fault. Thus, in [42], the authors attempt to predict gearbox
main bearing temperature and lubrication cooling oil temperature through analyzing 2
years worth of SCADA data with a 10-min recording frequency, collected from 26 Bonus
600 kW stall-regulated turbines located in Scotland. The first part of the paper focuses
10
on detecting faults, which is achieved by estimating normal model behaviour via ANNs.
The choice of the ANN model parameters is ensured via cross-correlation, where different
parameters are omitted one by one and the model accuracy is checked after each step.
After training various ANN models using 2 years worth of data points that correspond
to normal behaviour, the authors are able to detect both gearbox and cooling oil faults.
In the specific case of the cooling oil model, overheating incipient problems were detected
almost 6 months in advance of the actual failure.
In [43], the authors aim to predict a fault in wind turbine generators in advance, di-
agnosing the state of the wind turbine generator when the fault occurs. The proposed
solution is deployed and evaluated in two real-world wind power plants in China, demon-
strating that the time-to-failure of the generators can be predicted 18 days ahead with
about 80%, accuracy and when faults occur, the specific type of a generator fault can be
diagnosed with an accuracy of 94%. SCADA data collected over a period of 18 months
from two real-world wind power plants is used to identify correlated features and princi-
pal component analysis is applied to combine data with these features. A density-based
spatial clustering of applications with noise method is employed to form clusters; however,
since a state change from healthy to unhealthy is a continuous process, a new metric called
Anomaly Operation Index (AOI) is proposed to measure the wind turbine’s performance
over time. The characteristics of the AOI are combined with an Auto-Regressive Inte-
grated Moving Average (ARIMA) model, predicting time-to-failure 18 days ahead with an
average prediction accuracy of 80%.
In [44], the authors approach the problem of wind turbine breakage prognosis and
diagnosis by using deep autoencoder models for dimensionality reduction [45]. The au-
toencoders are trained on the data that corresponds to normal operating modes, which
using a test data set that consists of both faulty and normal data points, calculate the
square of the Euclidean distance between the reconstructed inputs and the original inputs;
wind turbines with impending blade breakages show higher reconstruction error. Based
on this metric, an Exponentially Weighted Moving Average (EWMA) method is applied
to estimate the upper and lower limits, with the EWMA control charts being tested on 3
wind farms, detecting impending blade breakage 7 hours and 20 minutes, 6 hours and 10
minutes, and 8 hours, prior to a fault happening for each wind farm. Furthermore, it is
shown that when models are trained on the full set of data without fault data points being
filtered out, the EWMA control charts are only able to detect a fault 2 hours in advance;
however, it is not clear if the data used to train the model is separated from the data used
to validate it.
In [46], the authors built various ANN-based condition monitoring approach using a
Nonlinear Autoregressive neural network with eXogenous input (NARX) to estimate the
11
condition of the gearbox bearings. SCADA data collected from onshore wind turbines,
located in the south of Sweden, is used to calculate a Mahalanobis distance measure based
on a fault-free training data set, and the threshold is used to determine presence of an
anomaly in the bearings. The proposed ANN model is trained on the fault free data and
during the validation phase, and the trained model is able to detect a gearbox failure 1
week in advance.
Based on the aforementioned review, and considering that machine learning methods
do not require prior assumptions about the underlying relationships between variables and
focus on detecting the complex data patterns, these methods form the basis for the research
work presented in this thesis.
12
a recursive manner. The developed fusion prognostics framework consists of two separate
steps: the state estimation and future state forecasting. During state estimation, the fusion
frameworks estimates the states in parallel with model parameter identification based on
the battery parameters. During the forecasting period stage, the identified model param-
eters are further tuned based on the the predicted system evolution from the data-driven
predictor. The proposed fusion prognostic framework predicts an error 1.59 weeks in ad-
vance, which is shown superior to both the particle filtering (33.25 weeks late) and RNF
(4.18 weeks late) approaches.
In [49], the authors present a fusion prognostic method in the application of Multi-Layer
Ceramic Capacitors (MLCC). In order to detect a fault, a health baseline is estimated
by using training data from 5 MLCCs that have not experienced failure. The data is
then monitored by a Multivariate State Estimation Technique (MSET) and Sequential
Probability Ratio Test (SPRT) algorithms, where MSET is used to monitor the status of
the capacitors and estimate the residuals for each parameter, and SPRT is used to detect
anomalous behaviour from the calculated residuals. The approach results in detection of
failures in the range of 875 to 920 hours, with the actual failure taking place at the 962
hour.
The authors in [50] develop a hybrid model for degradation monitoring and fault pre-
diction in offshore wind turbines. In this paper, physics-based models of degradation of a
component and RUL are estimated from historical data of failures, or in cases of lack of
this information, based on average failure rates. A prognostic model is then obtained using
a Bayesian updating approach, to update the parameters of the proposed mathematical
models.
While the previous literature review shows that hybrid methods are powerful, these
require mathematical models and are focused on components rather than the whole system.
Hence, since this thesis focuses on predicting incipient faults in a wind farm, these hybrid
methods will not be considered here.
13
• Develop machine learning models by tuning hyper-parameters using time-series cross-
validation techniques, ensuring that no data leakage is present, and addressing such
data limitations, as imbalanced data sets and missing values.
• Identify the most useful features for making a prediction through feature selection
algorithms, and interpret the results and compare the outputs of various algorithms
to identify the most suitable algorithm.
Collected data from Summerside wind farm, located in Prince Edward Island, will be
used to develop, test, and validate the proposed fault prediction technique.
• Chapter 2 presents a background review of the main concepts in this research work.
The chapter is focused on discussing multivariate time-series problems and appropri-
ate tools for time-series analysis. A general review of data preprocessing, including
missing data imputation, features selection and data reduction, is also presented.
Then, the theoretical background of the various machine learning models used here
is discussed, along with the parameters that affect the learning phase. Finally, met-
rics used to evaluate models are presented.
• Chapter 3 presents and discusses the prediction models developed, along with the
results of applying these models. A comparison of the various prediction models and
techniques is also presented, identifying the best models for the dataset used.
• Chapter 4 presents a summary and main contributions of the research work, identi-
fying possible future extensions and improvements.
14
Chapter 2
Background Review
This chapter reviews the theoretical background of the main concepts on which the work
presented throughout this thesis is based. First, here time series analysis is discussed along
with the concepts of forecasting that are used for fault prediction. Then, data preprocessing
steps are discussed, and finally, the machine learning methods used here, i.e., SVM, ANN,
Long Short Term Memory (LSTM), and ensemble models are reviewed.
2.1.1 Definition
These are two distinct approaches to time series analysis: time domain and frequency
domain approaches [51]. In this thesis, the focus is on time domain approaches, where the
primary focus is analyzing lagged relationships, as oppose to frequency domain techniques,
where the objective is to investigate cycles [51].
In various applications, including the one discussed here, metrics of interest are collected
at regular intervals to form an ordered sequenced of n real-valued variables. In the cases of
having a single monitored feature, time series are univariate and mathematically expressed
as:
X = {x1 , x2 , . . . , xn } ∀ xn ∈ R (2.1)
where {xi , . . . , xn } are discrete data points in time {t1 , . . . , tn }. However, it is more common
to encounter multivariate time series, where a vector of features in subsequent moments of
15
time is observed and defined as;
where {x1 , . . . , xn }, {α1 , . . . , αn }, and {β1 , . . . , βn } are data points associated with different
and distinct features.
where Y (t) is the observation, T (t) is the trend, S(t) is the seasonal component, and C(t)
and I(t) are cyclical and irregular components at time t, respectively. The main difference
between these 2 models is that, in the multiplicative model there is the assumption that 4
components can affect each other, while in the additive model it is assumed that the four
components are independent of each other [52].
The major difference between modeling time series data via machine learning models
and other domains such as computer vision, is that time series cannot be assumed to be
independent and identically distributed (i.i.d.) variables, as they follow some pattern in
the long term. This leads to limitation in data splitting and cross-validation techniques;
thus, the dataset must be split in a way to preserve the temporal components inherent in
the problem. Violation of this principle leads to the data leakage problems and inflated
results [54].
16
2.1.3 Concept of Stationarity
A stationary time series is one whose properties such as mean and variance do not depend
on time [55]. Time series with trends or seasonality are non-stationary, as both trend and
seasonality affect time series at different times. There are two types of stationary processes:
strongly stationary and weakly stationary.
A process is strongly stationary if joint distribution of every collections of values
{xt−s , xt−s+1 , . . . , xt , . . . , Xt+s−1 , Xt+s } is independent of t ∀s ∈ N [56]. Strong station-
ary implies the equal distribution of random variables of the stochastic process as the
series gets shifted along the time index axis. However, this type of stationarity is too
strong for most applications and as the result, a weakly stationary concept is needed.
Time series are known to be weakly stationary if the mean value, µ, of Xt does not
depend on t and remains constant [57]. In addition to that, the autocovariance γ(s, t) of
two time points s and t of the series depends only on their distance d = |s − t|, leading to
any time points being apart the same distance having the same autocovariance.
As mentioned in [58], the concept of stationarity is a mathematical idea constructed
to simplify the theoretical and practical development of stochastic processes. In order
to design a proper model, adequate for future forecasting, the underlying time series is
expected to be stationary. However, in practice, time series are often non-stationary since
they exhibit some persistent increase or decrease over time. For a given time series, a
visual inspection can help to decide whether the time series is stationary. Except for some
obvious cases, such as a clear trend, deciding if the time series is stationary is not simple;
furthermore, a time series can be large enough to make the visual inspection unfeasible. To
reduce the subjectivity in determining stationarity, units root tests are employed [59]. The
2 common tests, and the ones used in this work, are Kwiatkowski-Phillips-Schmidt-Shin
(KPSS) and Augmented Dickey-Fuller (ADF).
In statistics, ADF tests are employed for testing whether a unit root is present in the
time series sample, which implies that time series are difference stationary. The null hy-
pothesis in ADF tests, H0 , applies to processes with a unit root. The alternate hypothesis,
Ha , suggests that the time series does not have a unit root, meaning it is stationary. A
non-stationary process with a deterministic trend is transformed into stationary one by
removing the trend, or detrending [60].
KPSS tests are used for testing a null hypothesis, i.e., that an observable time series
is stationary around a deterministic trend versus the alternative of a unit root. The null
hypothesis, H0 , corresponds to the data being stationary, and the alternate hypothesis,
Ha , suggests that the data is difference stationary. To correct this non-stationarity, the
17
following difference of a time series with respect to its lag, known as differencing is used,
resulting in a new time series [61]:
2.1.4 Forecasting
Forecasting focuses on predicting future behaviour of a time series, based on the collected
historical observations. In order to make accurate forecasts, several assumptions need to
hold. Thus and first of all, future observations need to display a similar behaviour to
collected data. Then, a collected time series must contain patterns that can be captured,
so that machine learning techniques can learn those patterns in order to produce accurate
results. Historically, forecasts are divided into short-, medium-, and long-term horizons
[55]. The strategies used for forecasting are recursive, direct, and direct-recursive.
Recursive Strategy
The recursive strategy trains first a one-step model f that denotes the functional depen-
dency between past and future observations [62]:
where wt+1 is a stochastic independent and identically distributed error process with mean
zero and variance σ 2 , d is number of past values used to predict future values, also known
as an embedding dimension of time series. The trained f is used to recursively predict a
multi-step forecast.
A major drawback of the recursive approach is its sensitivity to the estimation error,
since the error propagates into the forecasted horizon. Despite this, recursive strategies
have been successfully utilized in real-world applications in conjunction with machine learn-
ing models such as recurrent neural networks [63].
Direct Strategy
xt+h = fh (xt , . . . , xt−d+1 ) + wt+h ∀t ∈ {n, ..., N − H}, h ∈ {1, ..., H} (2.7)
18
and returns a multi-step forecast for each fh model in an H horizon. Unlike the recursive
strategy, the direct strategy does not compute forecasts based on the approximated ones
and hence, errors do not accumulate. However, there are still drawbacks to this approach.
Thus, since the H models are learned independently, no statistical dependencies between
the predictions are considered [62]. Also, direct methods demand more computational
resources, since the number of models to learn depends on the horizon size. Different
machine learning models such as decision trees have been used to implement the direct
strategy for multi-step forecasting tasks [64].
Direct-Recursive Strategy
The Direct-Recursive (DirRec) strategy, proposed in [65], uses a different model for every
horizon and at every time step, introducing the approximations from previous steps into
the input step. The DirRec strategy learn H models fh from the time series {x1 , ..., xN },
where:
xt+h = fh (xt+h−1 , . . . , xt−d+1 ) + wt+h ∀t ∈ {d, ..., N − H}, h ∈ {1, ..., H} (2.8)
Unlike the previous strategies, the embedding size is not the same for all horizons. The
effectiveness of the strategy has been demonstrated in [65].
In this thesis, where the primary objective is to predict a fault happening in wind
generators, a direct strategy is employed, as it is widely researched and adequately accurate.
19
the forecasting function f can be written as:
where d is the selected lags used to predict future values, and yt+h is the binary target
variables, with 0 corresponding to the“no-fault” or ”normal” operation mode, and 1 repre-
senting a “fault”. As soon as the trained classifier recognizes the pattern in the data that
is characteristic of a potential fault, the target variable at the prediction horizon h will be
classified as Class 1. The size of the embedding dimension d is selected according to one
of the feature selection methods that are discussed in the Section 2.4.
The goal of any machine learning model is to minimize an objective function. Since
the proposed classifier f is a binary one, the objective function that will be minimized is
the following log-loss objective function or binary cross-entropy:
N
X N
X
L(y, ŷ) = − yi log(ŷi ) − (1 − yi )log(1 − ŷi ) (2.10)
i=1 i=1
where yi is a true target of the Class i, ŷi is the predicted probability of the Class i, and
N is the number of independent classes.
20
features.
21
where β is the ratio of negative samples over total ones [71].
22
Figure 2.1: Examples of (a) underfitting, (b) overfitting, and (c) statistical fit cases [72].
Underfitting, also known as a high bias, occurs when a model cannot reduce either a
training or a testing set error. The most common cause of underfitting is a model that is
too simple to learn the underlying complexities of the data. In Figure 2.1, the first degree
polynomial is not complex enough to fit the training samples.
Overfitting, also known as a high variance, is a more common occurence in training
a model. Overfitting can be characterized by a good performance on a training set, but
a poor one during the validation phase. Overfitting happens when the model is complex
enough to learn the smallest details in the training set and as a result, not being able to
generalize well on the test set. In Figure 2.1, degree 15 polynomial overfits the training
data, as it learns the noise.
While underfitting can be solved in a simple manner by increasing model complexity,
overfitting is a more complex problem. In this work, several methods to address overfitting
are employed, with the most common one being a time-series cross-validation technique.
23
2.3.1 Hyperparameter Optimization
The hyperparameters for all the models presented in this chapter will be tuned according to
Random Search [73], instead of other approaches such as grid search or Bayesian learning.
This technique can be described as trying random combinations to find the best solution
in the hypothesis space. In [73], the authors demonstrate that the probability of finding a
combination of parameters within 5% of local optima is 95%.
Random Search will be combined with cross-Validation to test the effectiveness of a
machine learning model. Traditional cross-validation techniques such as K-fold approach
will not work here, as the underlying assumption of those approaches includes shuffling
the dataset prior to splitting it, and as the result, data leakage problems occur, as the
information from the future becomes part of the training set. To ensure that this problem
is avoided, the cross-validation technique known as walk-forward validation [55], with the
process being shown in Figure 2.2. This technique restricts the usage of the full set; instead,
the dataset is split into blocks of contiguous samples, and at each split, the test data from
the previous split becomes the part of the training dataset.
24
Figure 2.3: Support vector machine
SVMs are positioned as binary classifiers. A set of N samples is considered, where each
input Xi is D-dimensional, and the targets consist of two classes: yi = +1 and yi = −1,
i.e.:
{Xi , yi } for yi ∈ {−1, 1} X ∈ RD i = 1, . . . , N (2.13)
The goal is then to find the hyperplane that defines the widest margin to separate both
classes and such as:
hw,b (Xi ) = wT Xi + b (2.14)
where b is a bias term and w is the weight vector. The optimal hyperplane can be found
in the infinite number of ways by scaling b and w [78]. Regardless of values for b and w,
25
the canonical hyperplane will satisfy the following:
wT X+sv + b = 1 (2.15)
wT X−sv + b = −1 (2.16)
where both X+sv and X−sv are known as support vectors, and are the closest to the hyperplane
in Figure 2.3.
If d1 is defined as the distance from X−sv and d2 is the corresponding distance from X+sv ,
then the hyperplane should be equidistant from both classes, i.e., d1 = d2 . In this context,
the following is defined as the margin [78]:
1
d(X±sv ) = (2.17)
||w||
Hence, maximizing the margin is equivalent to minimizing 12 ||w||2 , which allows to trans-
form the problem into a primal optimization problem with a convex quadratic objective
function, that can be solved using a quadratic programming optimizer as follows:
1
min ||w||2
w,b 2 (2.18)
s.t. yi (wT Xi + b) − 1 ≥ 1 ∀i = 1, . . . , N
where αi are the Lagrange multipliers. Hence, the optimal is obtained as follows:
N N
∂L(w, b) X X
=w− αi yi Xi = 0 =⇒ w = α i y i Xi (2.20)
∂w i=1 i=1
N
∂L(w, b) X
= αi yi = 0 (2.21)
∂b i=1
26
Finally, substituting (2.20) and (2.21) into (2.19) leads to:
N N N
X 1 XX
L(w, b, α) = αi − αi αj yi yj XiT Xj (2.22)
i=1
2 i=1 j=1
This is known as the Dual of (2.19). Hence, the problem constraints, the following dual
optimization problem can be formulated as [79]:
N N N
X 1 XX
max L(α) = αi − yi yj αi αj XiT Xj
α
i=1
2 i=1 j=1
N
X (2.23)
s.t. αi y i = 0
i=1
αi ≥ 0 ∀i = 1, . . . , N
This is a convex quadratic optimization problem, which can be solved with a quadratic
programming (QP) solver to get α∗ , and from (2.20), the optimal w∗ can be found. From
w∗ , the optimal value of the intercept term b would be:
27
Figure 2.4: Effect of C parameter on the hyperplane: (a) large C values are assigned; (b)
small C values are assigned.
Similarly to the case of the linearly separable data, Lagrangian is formed as:
N N N
1 X X X
L(w, b, α, ζ, r) = wT w + C ζi − αi [yi (X T w + b) − 1 + ζi ] − ri ζi (2.26)
2 i=1 i=1 i=1
where αi and ri are the Lagrange multipliers. By setting the derivatives with respect to w
and b to 0, and simplifying the resulting (2.26), the following dual optimization problem
is formed:
N N
X 1X
max αi − αi αj yi yj XiT Xj
αi
i=1
2 i,j=1
N
X (2.27)
s.t. αi y i = 0
i=1
0 ≤ αi ≤ C ∀i = 1, . . . , N
28
In SVMs, the optimization problem (2.23) is solved in the convex space, obtaining the
global extrema [78]. When the space is not linearly separable, a transformation to another
space is required. That transformation is performed via the kernel function:
This is also known as the kernel trick, and computes the inner product between the mapped
vectors, which leads to computational savings, when data are transformed into a higher-
dimensional space. There are three kernels that are considered in this thesis: linear,
polynomial, and Gaussian, respectively, as follows:
y = f (wT X + b) (2.32)
29
Figure 2.5: An example of a basic feedforward NN.
30
Backpropagation
This algorithm looks for the minimum of the error function in weight space using the
method of gradient descents; the combination of weights that minimizes the error func-
tion is considered a solution and is guaranteed to converge to a local optima [82]. As
backpropagation takes derivative at each iteration step, the activation function used must
be differentiable, and thus, step functions are not applicable, as the composite function
produced by the nodes would be discontinuous, and as a result, the error function would
be too [82]. A general error function that can be minimized is defined in (2.12). The
backpropagation algorithm is applied by iterating over the set of observed pairs, and the
general algorithm can be summarized in several discrete steps as follows [83]:
• Feed-Forward: The input vector X is fed into the input layer, and forward propa-
gated through the network. At each step, the output of each neuron oi is calculated
according to (2.31).
• Error at the output layer: At the output layer, the error δj , where j is the neuron in
the output layer, is calculated as follows:
∂E
δj oi = (2.33)
∂wij
where wij is the weight from the ith neuron into the jth neuron
• Backpropagation to the hidden layers: The error is backpropagated from the output
layer to the first layer. If whq is the connection from the hth to qth neuron in the
hidden layer, then the error is estimated as follows:
X
δh = whq δq (2.34)
q
• Weight updates: At the end of each backpropagation iteration, the weights are up-
dated as follows:
wij = wij − αδj oi (2.35)
where α is the learning rate that controls the step of the gradient descent.
The algorithm iterations are terminated when the value of the error function has become
sufficiently small, or if early stopping is invoked.
31
Optimizer
Once the backpropagation computes gradients, the optimizer uses these gradients to train
the model. The choice of an optimizer is a hyperparameter that needs to be tuned for a
specific task; the Adam optimizer introduced in 2014 in [84] is employed here. Adam is an
adaptive learning rate method with an ability to compute individual learning rates based
on the individual features in the dataset. The learning rate is adapted for each weight,
based on the estimations of the first and second moments of the calculated gradient. In
addition to storing an exponentially decaying average of past squared gradients vt , Adam
also keeps an exponentially decaying average of past gradients mt , as follows:
mt = β1 mt−1 + (1 − β1 )gt
(2.36)
vt = β2 vt−1 + (1 − β2 )gt2
where mt and vt are estimates of the first moment (mean) and the second moment (the
uncentered variance) of the gradients, respectively; gt is the gradient of the current mini-
batch; β1 is the exponential decay rate for the first moment estimates; and β2 is the
exponential decay rate for the second-moment estimates. If mt and vt are initialized as
null vectors, they are based towards zero; this limitation can be addressed by computing
bias-corrected first and second moment estimates:
mt
m̂t =
1 − β1t
vt (2.37)
vˆt =
1 − β2t
which are further utilized to produce the following Adam update rule:
α
θt+1 = θt − √ m̂t (2.38)
v̂t +
where is a number that is small enough to prevent division by zero in the implemen-
tation.
The amount α that the weights are updated during training is referred to as the step
size or the learning rate, which controls the rate or speed at which the model learns. Large
learning rates help to regularize the training, but if the learning rate is too large, it results
in weight updates that are too large, with the performance of the model oscillating over
training epochs or causing the weights to explode [45]. On the other hand, a learning rate
that is too small may never converge or may get stuck on a suboptimal solution [45]. In
32
this work, instead of treating learning rates as a hyperparameter, it is a value that changes
according to the decaying learning rate method [45], which reduces the learning rate by a
magnitude of 0.10 when the performance of the model plateaus.
Sigmoid: This is referred to as a logistic function and is shown in Figure 2.6. Mathe-
matically, it can be expressed as:
1
σ(x) = (2.39)
1 − e−x
This function takes a real-valued number and compresses it into a range between 0 and
1, i.e., σ(x) ∈ (0, 1). In this work, the sigmoid is used as an activation function in the
output layer. As the activation function in the hidden layers, sigmoid possess two major
drawbacks: is saturates and kill gradients, and the outputs are not zero-centered, which
leads to large oscillations in the gradients updates [85].
33
Figure 2.6: Sigmoid activation function.
Rectified Linear Unit (ReLU): In this work, the activation functions used in the
hidden layers are ReLUs are proposed in 2010 [86], and are illustrated in Figure 2.7. This
is the most widely used activation function for deep learning applications, and is a faster
learning activation function, offering better performance and generalization compared to
other activation functions [87]. The ReLU represents a nearly linear function and therefore,
preserves the properties of linear models that made them easy to optimize with gradient-
descent methods [45].
34
Figure 2.7: ReLU activation function.
The ReLU activation function performs a threshold operation for each input element,
where values less than zero are set to zero, thus the ReLU is given by:
(
xi , if xi ≥ 0
f (x) = max(0, x) =
0, if xi < 0
This function “rectifies” the values of inputs less than zero, thereby forcing them to zero and
eliminating the vanishing gradient problem observed in earlier types of activation functions.
The main advantage of using ReLU is that it guarantees faster computation, since it
does not require exponential and divisions, speeding up computations [85]. However, the
function can easily overfit compared to the sigmoid function, which can be addressed using
the dropout technique to reduce the effect of overfitting [88]. Another drawback from the
simplicity of the gradient updates 0 and 1 is that it can lead to neurons “dying” during
training; to avoid this problem, Leaky ReLUs have been proposed [89].
35
Epoch, Batch Size, and Dropout
Another set of hyperparameters that affects performance of an ANN is batch size and
number of epochs. The number of epochs, which is comprised of one or more batches,
is a hyperparameter that defines the number of times that the learning algorithm will
propagate through the entire training dataset [90]. Running a model for too many epochs
will inevitably lead to overfitting; in order to address this limitation, a regularization
technique known as early stopping is used. Early stopping halts the learning process if the
error on the validation set is not minimizing for a set number of epochs [91].
Batch size is a hyperparameter that defines the number of samples to work through
before updating the internal model parameters [90]. There are 3 major types of batch sizes,
with the mini-batch gradient descent being employed here. A mini-batch is characterized
by the size of the batch larger than 1 sample, but less than the whole training sample. The
size of the batch is tuned during the learning process.
In addition to early stopping, another technique used in this thesis to prevent overfitting
is dropout. The algorithm is named due to various random subsets of neurons being
deactivated during the training [92]. Each neuron is kept with a fixed probability p, where
p = 0.5 yields the maximum regularization and is close to optimal for a wide range of
applications [92].
36
neurons results in overfitting, as the model will memorize samples. Furthermore, choosing
a large number of neurons will increase the training time [94].
Unfolding an RNN results in the parameters being shared across the deep network
structure, which performs well for various data sequence lengths [45]. Since RNNs are
unfolded into a series of feed-forward neural networks, backpropagation could be a proper
training choice; however, due to RNNs being very sensitive to any changes in the parame-
ters, regular backpropagation leads to a vanishing gradient problem [96]. The solution to
this is using LSTM networks, which is one of the main models used in this work [97].
Basic LSTM
The core idea behind LSTMs is that at each time step, a few gates are used to control
the passing of information along the sequences that can capture long-range dependencies
more accurately [98]. The internal state of the LSTM is controlled by the Constant Error
Carousel (CEC) [99]. The internal gates control the flow of the information, and thus
eliminate the problem of vanishing gradients. The name gate is derived from the fact that
37
activation functions decide how much infomation can be let through, where 1 indicates that
all information goes through and 0 cuts off the flow from another node. In the architecture
presented in Figure 2.9, forget gate f t solves the problem with long-term dependency that
arises if sequential information was not broken into distinct sequences a priori. At each
time step t, the hidden state ht is updated by current data at the same time step xt ,
the hidden state of the previous time step ht−1 , the input gate it , the forget gate f t , the
output gate ot , and a memory cell ct [100]. The following updating equations represent
this process:
38
Figure 2.9: Schematic graph of LSTM cell at time t.
2.3.5 Ensemble
Ensemble systems are characterized by a set of classifiers whose individual decisions com-
bined in some way to classify new samples. There are several approaches to ensemble
learning such as dividing a training set, (bagging), manipulating data distribution (boost-
ing), input features (Random Forest), and learning algorithms [101]. In [102], the author
states that ensemble models generate more accurate classifications due to 3 major reasons:
statistical, computational, and representational.
• Statistical: As the goal of the learning algorithm is to search the chosen space of
hypothesis in order to find the hypothesis that best approximates the target function
[103]. However, if the training set is much smaller compared to the hypothesis space,
then statistical problems emerge. The learning algorithm trained on small training
data finds the solution in the hypothesis space, producing equal results on the training
set; however, the performance on the validation set varies. By constructing ensemble
of learning algorithms run on the same training set, the risk of picking the wrong
model is reduced [104].
39
• Computational: As previously discussed, the backpropagation algorithm converges
to the local minima only. Having a large dataset and complex pattern in the data
increases the computational burden of finding the best hypothesis. Ensemble learning
with different starting points may result in a better estimation of the true hypothesis,
compared to an individual classifier [105].
• Representational: This is based on the idea that there might not be a true hypothesis
h∗ in the hypothesis space. By combining several models from the hypothesis space,
h∗ can be approximated; for example, an ensemble of linear models can be employed
to approximate a nonlinear classification boundary [105].
In this work, the following ensemble methods are used: majority and weighted majority
voting, and Adaboost. The methods are explained next.
In voting ensemble approaches, each classifier classifies an unseen sample based on its
evaluation; the final class prediction is the one with the majority of votes [102]. The
majority vote function can be defined as follows:
where the outputs of the ensemble’s classifier, di,M are expressed through a binary vector
of the size M , with M being the number of possible classes and B being the number of
classifiers in the ensemble. If the ith classifier votes the class Cj for the new sample, then
di,j = 1; otherwise, it is 0. Thus, the majority voting ensemble predicts the class Ck as
follows:
XB XB
di,k = max di,j (2.46)
j=1,...,M
i=1 i=1
In (2.46), all classifiers contribute equally to the final predictions; this can be modified,
so that each classifier is assigned a predetermined weight, which is known as weighted
majority voting and is calculated as follows:
B
X B
X
wi di,k = max wi di,j (2.47)
j=1,...,M
i=1 i=1
where the vector wi represents the weight for the ith classifier.
40
AdaBoost
The ADAptive BOOSTing (AdaBoost) algorithm is based on the idea of boosting [106],
which consists of assigning a weight to each training sample, adaptively changing the weight
at each boosting round [107]. Boosting algorithms place more weights on samples most
often misclassified by the previous classifier, so that the next round of learning focuses on
them [107]. AdaBoost is based on the idea of boosting with the sole aim of converting
a set of weak classifiers, defined as classifiers with the performance only slightly better
than a random guessing, into strong ones, and gets its name from adaptively configuring
classifiers in favor of samples missclassified by previous ones.
In [108], the author formalizes the algorithm as follows: AdaBoost takes as input a
training set {(xi , yi )}N
i=1 , where xi ∈ R
D
and yi ∈ {−1, +1}. Based on a weak classifier,
Gm (x) ∈ {−1, +1}, along with the following 0-1 loss function I [109]:
(
0 if fm (xi ) = yi
I(Gm (x), y) = (2.48)
1 if fm (xi ) 6= yi
the pseudo-code for the AdaBoost algorithm can be defined as follows [108]:
2. For m = 1 to M :
(a) Fit a weak classifier Gm (x) to the training data using weights wi .
PN
wi I(yi 6=Gm (xi ))
(b) Compute errm = i=1
PN .
i=1 wi
41
metrics are calculated based on values in the confusion matrix depicted in Figure 2.10,
which is not a performance measure as such but rather, a summary of prediction results
on a classification problem. Thus, the diagonal cells show the number of points for which
the true label is the same as the approximated labels, while the off-diagonal ones show the
ones that are mislabelled by the trained classifier. In Fig 2.10, the abbreviations represent
the following:
TP
Precision = (2.50)
TP + FP
2T P
F1 = (2.51)
2T P + F P + F N
42
From these metrics, it is possible to conclude that Recall is interpreted as the percentage
of actual failures that are correctly predicted, and Precision is the percentage of predicted
failures that are actually true.
In fault prediction, the cost of misclassifying a faulty sample as normal is very high;
hence, it is important to maximize the Recall metric. Furthermore, even though the
Accuracy metric is commonly used in classification problems [110], this will not be used
here, since a majority class would have a significant effect on the average performance.
43
2.4.2 Partial Autocorrelation Function
As it has been mentioned, in this work a lagged window is constructed that corresponds
to the amount of data history used to allow the model to make a prediction. However, not
all of the lagged variables allow predicting the target, and as the result, a feature selection
on the lagged variables is needed. A comprehensive research on feature selection for lagged
variables are presented in [115]; a popular technique used in this works is the PACF [116].
PACF can be considered a filter method that is independent of any machine learning
algorithms, and is based on the scores of various statistical tests. PACF at lag k is defined
as the correlation that results after removing the effect of any correlations due to the short
term lags [116]. In this work, a “significance” level of 5% is employed,
√ and those lags that
are outside of the upper and lower limits, calculated as ±1.96/ X, where X is the vector
length, are kept. A sample PACF is shown in Figure 2.11, using data from [117] where in
the feature selection process, the lags that are inside of the bound are filtered out, resulting
in a reduced size of the lag window.
44
2.5 Summary
In this chapter, the background of the main concepts required to develop a predictive
model is presented. First, time series analysis along with common tests are reviewed. This
is followed by general data preprocessing steps, including the data imputation technique
used in this thesis. Finally, machine learning is introduced, where concepts as overfitting
and underfitting are discussed, along with the models and techniques used in this work to
build a predictive framework, i.e., ANN, LSTM, SVM, ensemble, AdaBoost, and PACF.
45
Chapter 3
This chapter presents the case study and the results of the implementation of the failure
prediction model, using algorithms presented in Chapter 2. The case study is developed
to test the predictive power of several models using the collected performance data from
the City of Summerside wind farm. The preprocessing steps are explained and results are
presented, discussing the selection of the most adequate final model.
• The input is historical data collected from the SCADA system of the wind farm
installed in Prince Edward Island (PEI), Canada.
• Define a target variable, based on the collected data.
• Perform initial tests on the collected time series dataset to determine its stationarity.
• Create a lagged window that will be used as input variables for model training.
• Perform a feature selection.
• Tune parameters by randomly searching for combinations of parameters in a param-
eter grid.
46
The steps are discussed in details next.
The models developed in this thesis are implemented in Python using the follow-
ing package versions: Python 3.6.6 [118], Tensorflow 1.10 [119], Keras 2.2.4
[120], scikit-learn 0.21.2 [72], MatPlotLib 2.2.3 [121], NumPy 1.16.4 [122],
Pandas 0.22.0 [123].
47
Table 3.1: Dataset features.
Alarms Dataset
Along with the operational dataset, a different dataset that captures any changes in a wind
farm operation state was provided. It is assumed that the turbine is operating in normal
conditions until a new status message is generated. The status dataset comes with several
indicators that are summarized in subcategories, presented in Table 3.2, where:
48
• Category indicates the possible reason behind an abnormal behaviour and includes
the following categories: Environmental, Force Majeure, Manufacturer, Owner, Un-
scheduled maintenance, Utility.
• Event corresponds to the code that helps to distinct between a fault or scheduled
service, and is supported by the Event Text that clarifies the code in the Event
category.
• Responsible Source Title has a single entry that is ‘Remote pause by Owner’.
• Time from and Time to indicate the starting and ending periods of a fault or a
service
• Total kWh that can potentially indicate the total generated power, however most of
the variables are 0.
Out of these categories, Event, Turbine, Time from, and Time to are the most informative
and will be used in order to construct a dataset, representing abnormal behaviour.
49
Table 3.2: Categories in the alarms dataset.
Category
Comment
Duration
Event
Event text
Lost kWh
Parameter1
Parameter2
ResponsibleSourceTitle
Time from
Time to
Total
Total kWh
Turbine
Missing Data
The original operation dataset has 20165 samples that are missing at random for the whole
wind farm; thus, missing data for individual turbines is less. Even though dropping data
rows with missing values is an option, due to the high number of missing samples, it has
been decided to impute them using a MICE technique, as explained in Section 2.2.1, via
the FancyImpute package v0.3.2 [125].
50
turbine states is used to match up the associated 10-minute operational data. The 10-
minute time-band is selected to capture any 10-minute period where a fault occurred; for
example, for a fault 188 occuring between 10:19 and 12:19, the aforementioned approach
would label the samples between 10:10 and 12:20 are labelled as faults.
It is important to mention that the goal of this work is to predict faults associated with
the wind turbine system; thus, the faults that correspond to categories Environmental and
Force Majeure have been dropped. Based on that information, a new feature is created:
No-Fault, where samples are labelled as 0, if they correspond to the normal operation, or
1, if they correspond a fault, based on the Event code. By doing so, the problem is framed
as a binary classification task.
51
Table 3.3: Electrical fault codes.
Code Electrical
188 External Power Supply eq.
176 Error on all wind Sensors
315 ExEx Low Voltage (Grid)
485 Circuit Breaker Open
286 Extreme High Current
707 Chopper Hardware error
959 Heating Slipring in stop
958 Heating Slipring in pause
817 Q8 close not possible
187 Q9 open
489 Timeout generator reconnecting
632 Signal error (Failing signal is listed as CAN-node number)
102 Emergency circuit open
648 Arc Detected, F60 tripped (Environmental)
687 Busbar temperature high
935 Grid Voltage above stop limit (125%)
332 Low DC voltage (Turbine)
846 Frequency error (44.43 Hz)
52
Table 3.4: Mechanical fault codes.
Code Mechanical
604 Low Pressure in Pitch Block C
603 Low Pressure in Pitch Block B
958 Heating Slipring in Pause
154 Max Rotor RPM
162 Pitch doesnt follow reference
165 Low Oil Level, pitch hydraulic
512 Blade 1,3 could not be released
948 Hub Shut off, valves not open
151 High temperature. Generator bearing
163 Low pitch hydr. Press
296 Tower acceleration X. Alarm
734 Pitch Deviation AB
735 Pitch Deviation AC
844 Gear oil low pressure
191 Related to blade A not moving at the expected speed
560 Low water level
297 Tower Acceleration Y. Alarm
504 Defect lubrication system blade C
Code Control
356 Extreme wind direction
144 High windspeed
100 Too many autorestrats
53
Table 3.6: Status Codes.
Code Status
220 New Service State
276 Start auto-outyawing
224 Pause
222 Emergency
309 Pause over RCS
223 Stop
Parameters Values
number of estimators 100
criterion Gini
max depth None
min samples split 2
min samples leaf 1
min weight fraction leaf 0
max features auto
max leaf nodes None
min impurity decrease 0
min impurity split 1.00E-07
bootstrap False
oob score False
random state 42
54
Figure 3.2: Contribution of individual features to the final target.
Based on the output of the feature selection algorithm, the threshold is set up to 0.03.
As the result, the original dataset is reduced from 26 features to 23.
55
Table 3.8: Results for ADF tests.
56
Table 3.9: Results for KPSS tests.
From Table 3.8, it follows that H0 is rejected, resulting in the time series not having
a unit root and hence, is stationary. On the other hand, from Table 3.9, it follows that
in several cases p < 0.05 and thus, H0 fails to be rejected, which implies that a unit root
test and differencing is needed, as per (2.5). Following the differencing, the updated tables
are given in Tables 3.10 and Tables 3.11, with results showing that the time series are
stationary, according to both KPSS and ADF tests.
57
Table 3.10: Results for ADF tests after differencing.
58
Table 3.11: Results for KPSS tests after differencing.
59
dimensionality from 23 to 1656 features. Following the procedure set in (2.7), a prediction
horizon of 1 hour is set as well.
60
Figure 3.3: PACF of “Grid Production VoltagePhase” feature.
61
SVM (Section 2.3.2): The search grid consists of the parameters in Table 3.12, fed into
RandomizedSearchCV class, along with the 5 splits for cross-validation. In order to get
reproducible results, the random state parameter is set to 42, resulting in guaranteeing
that same sequence of numbers are generated and thus, results are reproducible. The
resulted model is saved to a file.
Hyperparameters Values
C 0.01, 0.1, 1, 10, 100, 1000, 10000
γ 1e-1, 1e-2, 1e-3, 1e-4, 1e-5
kernel linear, rbf, poly
ANN (Section 2.3.3): The following parameters have been into the RandomizedSearchCV
class, along with the 5 splits for cross-validation. Unlike SVM, in ANN another indepen-
dent model is built using the selected parameters from Random Search; in this independent
model, parameters are fine-tuned using a validation set. It is worth noting that the batch
sizes are chosen with respect to a power of 2, as authors in [130] show that batch sizes with
power of 2 offer better runtime.
Hyperparameters Values
Neurons in the hidden layer 15, 20, 25, 30, 35, 40, 45, 50, 55, 60
Batch size 16, 32, 64, 128
Epochs 150
Learning rate 1E-04
LSTM (Section 2.3.4): LSTM is tuned in the similar fashion as ANN, with the same
parameters. The primary difference is that LSTM requires data to be in the following
format: samples, timesteps, f eatures. However, there is no theoretical foundations on how
2D data needs to be turned into 3D, needed for LSTM. Several experiments were carried,
and the best results obtained yielded the format: [samples, 1, f eatures], as opposed to
[samples, f eatures, 1]. In addition, the problem of exploding gradient was observed, and
62
thus, a setting AMSGrad for the Adam optimizer in the Keras package is set to True,
which addresses the weight decay problem in Adam [131].
Hyperparameters Values
Neurons in the hidden layer 15, 20, 25, 30, 35, 40, 45, 50, 55, 60
Batch size 16, 32, 64, 128
Epochs 150
Learning rate 1E-04
AMSGrad True
AdaBoost (Section 2.3.5): As discussed in [132], AdaBoost does not require classes
to be balanced, as AdaBoost recalculates weights for each subsequent weak learner. This
results in higher weights being assigned to the minority class at each iteration. The sim-
ulation results obtained further demonstrate that the performance of the algorithm is
identical when classes are balanced and unbalanced. At the same time, AdaBoost demon-
strated good performance with the predefault values, and only minor manual tuning of the
learning parameter was needed; hence, Random Search is not used in this case.
Hyperparameters Values
Learning rate 1E-05
Number of estimators 100
Random state 42
63
3.7 Results
The available dataset to run the models is divided into training, validation, and testing
sets with the following ratio: 70% for training, 15% for validation, and 15% for testing.
The training set is balanced out as explained in Section 2.2.3., and thus, the objective
function to be minimized is given in (2.12), except for the case of AdaBoost. The objective
function for AdaBoost is (2.10). The selection of optimal parameters from the parameter
grid via RandomizedSearch is performed on Ubuntu 16.04, GTX 1070 and 16 GB of
TM
RAM. The models are fine tuned on Windows 10, Intel R Core i7 and 16 GB of RAM.
3.7.1 SVM
The SVM model is invoked through the scikit-learn package, and used as the tunable
model in RandomizedSearch with 5 cross validation time series splits. The obtained
parameters are shown in Table 3.16, and the model results are given in Table 3.17, and
confusion matrix is shown in Table 3.18. It is important to note that running the same
model as a stand-alone will not change the final results. The SVM training took about 17
hours as the scikit-learn library does not support GPU hardware acceleration. The
GPU-supported library, ThunderSVM [133], was also used for model tuning at significantly
reduced computational times; however, this library is unable to save the final model, and
hence it could not be used for further model processing and testing.
Parameters Value
Kernel Linear
gamma 0.1
C 1
64
Table 3.18: Confusion matrix for SVM model.
3.7.2 ANN
The ANN model is trained in the similar fashion as the SVM model. The preset model
parameters are shown in Table 3.19, and selected parameters returned by Random Search
are shown in Table 3.20. The model results for the testing dataset are illustrated in Table
3.21.
Table 3.20: Initial ANN model hyperparameters obtained with Random Search.
Parameters Value
Neurons in the Hidden Layer 60
Batch size 128
65
Table 3.21: Testing dataset results for initial ANN model.
After Random Search, an improved model is built with the number of epochs set to 1000,
and callbacks such as early stopping and decaying learning rate method. To fine tune the
model, the validation set is used, resulting in the parameters in Table 3.22, the final results
in Table 3.23, and the confusion matrix in Table 3.24. The learning curves that showing
changes in learning performance over time are depicted in Figure 3.4, demonstrating how
statistical fit is reached. Similar performance can be achieved with Grid Search, or with
the larger parameter grid but the computational time increases.
Parameters Value
Neurons in the Hidden Layer 50
Batch size 64
Reduced learning rate 1E-05
66
Figure 3.4: Change in the final ANN model loss versus epochs.
3.7.3 LSTM
The LSTM model is trained in similar fashion as the ANN model. The preset hyperparam-
eters are given in Table 3.25, the obtained parameters from Random Search are in Table
3.26, and the results in Table 3.27.
67
Table 3.26: Initial LSTM model hyperparameters obtained with Random Search.
Parameters Value
Neurons in the Hidden Layer 45
Batch size 64
Similarly to the ANN, an improved model is built using the callbacks and validation
set. The obtained parameters are shown in Table 3.28, the final results in Table 3.29, and
the confusion matrix in Table 3.30. The learning curves illustrating the learning process
are shown in Figure 3.5, demonstrating how statistical fit is reached.
Parameters Value
Neurons in the Hidden Layer 35
Batch size 64
Reduced learning rate 1E-05
68
Table 3.30: Confusion Matrix for final LSTM model.
Figure 3.5: Change in the final LSTM model loss versus epochs.
69
Table 3.31: Weights assigned to various pretrained algorithms.
Model Weights
ANN 1 1 1
LSTM 1 0.5 1
SVM 1 2 2
3.7.5 AdaBoost
In the case of AdaBoost, Random Search is not needed, as good performance is achieved
with the default parameters. The only parameter that needs to be tuned is the learning
70
parameter, which is reduced to 0.00001 from the default 1; classes do not need to be
balanced in this case. The results of the model are given in the Table 3.35, and the
confusion matrix is given in Table 3.36. The results for the balanced dataset are shown in
Table 3.37, and the confusion matrix is shown in Table 3.38. As it can be seen from the
results, there is no requirement for the dataset to be balanced prior to applying AdaBoost.
Table 3.36: Confusion matrix for AdaBoost model, applied to imbalanced dataset.
Table 3.38: Confusion matrix for AdaBoost model, applied to balanced dataset.
71
3.7.6 Discussion
As it can be seen from the aforementioned tables, the models show similar performance
in terms of precision, recall, and F1 score. The trained models can be deployed in the
real-world settings as a self-contained application, and while it is an option, there is no
need to re-train the models with new incoming data. The proposed models can be used
to predict from the incoming data and produce the probability of a fault occurring in an
hour.
The results include prediction of both a normal and faulty conditions; the focus of this
work is on predicting the faulty state 1 hour in advance. In the case of prediction of a
normal operating state, the high values for precision and recall can be explained by the
high number of samples. The final recall results for predicting a fault are 83% accurate,
which are reasonable and in line with typical forecast scores expected from the machine
learning tools studied here.
Even though all models show good performance, the final prediction methodology
should include the best model. However, simplicity and interpretability of a model needs
to be considered in the selection. In [134], the authors apply the Occam’s razor principle,
i.e., not using more than necessary, to choose the model that is simplest to interpret, ex-
plain to a client, and maintain in the long run. It is also important to add that ease model
tuning is crucial during the development stage. Therefore, considering the aforementioned
criteria, the AdaBoost model should be used in the proposed fault prediction method, as
it is the fastest to run, and proved to be easiest to tune.
3.8 Summary
In this chapter, the algorithms discussed in Chapter 2 were applied to operation data
collected from the wind farm in Summerside, PEI, illustrating a proposed fault prediction
methodology. First, the data was labeled according to the error log, available online.
Next, the missing data was imputed, and statistical tests to determine the stationarity
of the time series are run. To capture the historical pattern, a lag window approach
was employed, and to capture the most relevant lags, the PACF feature selection was
applied, thus reducing the number of features in the training dataset. The simulations
were conducted, and the performance of several classification models were compared, and
the results were discussed to select the final model. The obtained results demonstrated the
effectiveness of the proposed methodology to predict a fault in the wind farm.
72
Chapter 4
Conclusion
73
• The results obtained with the proposed framework demonstrate the potential for its
actual deployment to minimize the downtime of the wind farm studied.
• The presented results demonstrate that having enough historical data, machine learn-
ing models can properly capture the underlying failure patterns for fault prediction.
• The developed models can be trained off-line, and used on-line to quickly and ade-
quately predict in real-time the probability of a fault happening in the next 1 hour
horizon.
4.2 Contributions
The following are the major contributions of the presented research work for practical
failure prediction of wind farm questions:
• Develop, compare, and apply several machine learning algorithms with the goal of
determining the best model.
• Perform a feature selection analysis to identify the most relevant features and lags
that would influence the prediction.
• Further test the proposed methodology with new data from different sites. In this
case, the idea of transfer learning can be explored [135].
• Adapt the proposed framework for real-time deployment, and evaluate it performance
in the real-life conditions.
74
• In this thesis, ANN, SVM, LSTM, and ensemble machine learning models have been
utilized. Hence, exploration of other models such as k-nearest neighbors algorithms
or other methods, traditionally used in different domains, such as Convolutional
Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Variational
AutoEncoders (VAE) could be studied.
• Explore the use of reinforcement learning, which does not need large amounts of
labeled data, such as in the case of [136], where it is applied to predictive maintenance.
75
References
[1] V. Smil, Energy Transitions: Global and National Perspectives. Praeger, 2017.
[2] A. Fakhouri and A. Kuperman, “Backup of renewable energy for an electrical island:
Case study of israeli electricity systemcurrent status,” The Scientific World Journal,
vol. 2014, pp. 609–687, January 2014.
[3] U.S. Department of Energy, “20% Wind energy by 2030: increasing wind energy’s
contribution to US electricity supply,” Tech. Rep., 2008.
[4] S. Gsanger, “Wind power capacity worldwide reaches 597 gw, 50,1 gw added
in 2018,” Jun 2019. [Online]. Available: https://wwindea.org/blog/2019/02/25/
wind-power-capacity-worldwide-reaches-600-gw-539-gw-added-in-2018/
[5] C. A. Walford, “Wind turbine reliability: understanding and minimizing wind tur-
bine operation and maintenance costs,” Sandia National Laboratories, Albuquerque,
New Mexico, Tech. Rep., March 2006.
[6] B. Lu, Y. Li, X. Wu, and Z. Yang, “A review of recent advances in wind turbine
condition monitoring and fault diagnosis,” in 2009 IEEE Power Electronics and Ma-
chines in Wind Applications, June 2009, pp. 1–7.
76
[9] R. Velmurugan and T. Dhingra, “Maintenance strategy selection and its impact in
maintenance function: A conceptual framework,” International Journal of Opera-
tions and Production Management, vol. 35, pp. 1622–1661, December 2015.
[10] Q. Hao, Y. Xue, W. Shen, B. Jones, and J. Zhu, “A decision support system for
integrating corrective maintenance, preventive maintenance and condition-based
maintenance,” in Construction Research Congress 2010: Innovation for Reshaping
Construction Practice - Proceedings of the 2010 Construction Research Congress,
Banff Alberta, Canada, July 2010, pp. 470–479. [Online]. Available: https:
//www2.scopus.com/inward/record.uri?eid=2-s2.0-77956328783&doi=10.1061%
2f41109%28373%2947&partnerID=40&md5=b4e6744f08e923890756882c0884fe13
[11] S. Nachimuthu, M. Zuo, and Y. Ding, “A decision-making model for corrective main-
tenance of offshore wind turbines considering uncertainties,” Energies, vol. 12, pp.
1–13, April 2019.
[12] S. Butler, “Prognostic algorithms for condition monitoring and remaining useful
life estimation,” Ph.D. dissertation, National University of Ireland Maynooth, 2012.
[Online]. Available: http://mural.maynoothuniversity.ie/3994/
[15] J. Moubray, Reliability-centered maintenance, 2nd ed. New York : Industrial Press,
1997.
77
[18] F. Cheng, L. Qu, and W. Qiao, “A case-based data-driven prediction framework for
machine fault prognostics,” in 2015 IEEE Energy Conversion Congress and Exposi-
tion (ECCE), Sep. 2015, pp. 3957–3963.
[20] K. Javed, “A robust & reliable Data-driven prognostics approach based on extreme
learning machine and fuzzy clustering.” Ph.D. dissertation, Université de Franche-
Comté, April 2014. [Online]. Available: https://tel.archives-ouvertes.fr/tel-01025295
[22] M. G. Pecht and R. Jaai, “A prognostics and health management roadmap for infor-
mation and electronics-rich systems,” Microelectronics Reliability, vol. 50, pp. 317–
323, 2010.
[23] V. T. Tran and B. S. Yang, “Machine fault diagnosis and prognosis: The state of the
art,” The International Journal of Fluid Machinery and Systems (IJFMS), vol. 2, pp.
61–71, January 2009. [Online]. Available: http://eprints.hud.ac.uk/id/eprint/16578/
[27] I. Roychoudhury and M. Daigle, “An Integrated Model-Based Diagnostic and Prog-
nostic Framework,” NASA, NASA Ames Research Center, Moffett Field, CA, Tech.
Rep., 2011.
78
[28] K. Abid, M. Sayed Mouchaweh, and L. Cornez, “Fault prognostics for the predictive
maintenance of wind turbines: State of the art,” in Communications in Computer
and Information Science ECML PKDD 2018 Workshops, Dublin, Ireland, March
2019, pp. 113–125.
[32] J. Yan, M. Ko, and J. Lee, “A prognostic algorithm for machine performance as-
sessment and its application,” Production Planning & Control, vol. 15, no. 8, pp.
796–801, February 2004.
[34] W. Wang, “A model to predict the residual life of rolling element bearings given
monitored condition information to date,” IMA Journal of Management Mathemat-
ics, vol. 13, no. 1, pp. 3–16, January 2002.
[36] A. Kusiak and A. Verma, “Analyzing bearing faults in wind turbines: A data-mining
approach,” Renewable Energy, vol. 48, pp. 110–116, May 2012. [Online]. Available:
http://dx.doi.org/10.1016/j.renene.2012.04.020
79
[37] A. Kusiak and A. Verma, “A data-driven approach for monitoring blade pitch faults
in wind turbines,” IEEE Transactions on Sustainable Energy, vol. 2, no. 1, pp. 87–96,
Jan 2011.
[41] A. Kusiak and W. Li, “The prediction and diagnosis of wind turbine faults,” Renew-
able Energy, vol. 36, no. 1, pp. 16–23, October 2011.
[42] A. Zaher, S. McArthur, D. Infield, and Y. Patel, “Online wind turbine fault
detection through automated scada data analysis,” Wind Energy, vol. 12, no. 6,
pp. 574–593, January 2009. [Online]. Available: https://onlinelibrary.wiley.com/
doi/abs/10.1002/we.319
[43] Y. Zhao, D. Li, A. Dong, D. Kang, Q. Lv, and L. Shang, “Fault prediction and
diagnosis of wind turbine generators using SCADA data,” Energies, vol. 10, no. 8,
pp. 1–17, August 2017.
[44] L. Wang, Z. Zhang, J. Xu, and R. Liu, “Wind turbine blade breakage monitoring with
deep autoencoders,” IEEE Transactions on Smart Grid, vol. 9, no. 4, pp. 2824–2833,
July 2018.
[45] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016,
http://www.deeplearningbook.org.
[46] P. Bangalore and L. B. Tjernberg, “An artificial neural network approach for early
fault detection of gearbox bearings,” IEEE Transactions on Smart Grid, vol. 6, no. 2,
pp. 980–987, March 2015.
80
[47] P. Baraldi, F. Cadini, F. Mangili, and E. Zio, “Model-based and data-
driven prognostics under different available information,” Probabilistic Engineering
Mechanics, vol. 32, pp. 66 – 79, November 2013. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0266892013000143
[48] J. Liu, W. Wang, F. Ma, Y. Yang, and C. Yang, “A data-model-fusion prognostic
framework for dynamic system state forecasting,” Engineering Applications of
Artificial Intelligence, vol. 25, no. 4, pp. 814 – 823, February 2012. [Online].
Available: http://www.sciencedirect.com/science/article/pii/S0952197612000528
[49] S. Cheng and M. Pecht, “A fusion prognostics method for remaining useful life predic-
tion of electronic products,” in 2009 IEEE International Conference on Automation
Science and Engineering, Bangalore, India, August 2009, pp. 102–107.
[50] M. Asgarpour and J. D. Sørensen, “Bayesian based Diagnostic Model for Condition
based Maintenance of Offshore Wind Farms,” Energies, vol. 11, no. 2, pp. 1–17,
January 2018.
[51] O. Brandes, J. Farley, M. Hinich, and U. Zackrisson, “The time domain and the fre-
quency domain in time series analysis,” Scandinavian Journal of Economics, Wiley,
vol. 70, pp. 25–42, March 1968.
[52] R. Adhikari and R. K. Agrawal, “An introductory study on time series modeling
and forecasting,” Computing Research Repository (CoRR), pp. 1–67, February 2013.
[Online]. Available: http://arxiv.org/abs/1302.6613
[53] R. H. Shumway and D. S. Stoffer, Time Series Analysis and Its Applications
(Springer Texts in Statistics). Berlin, Heidelberg: Springer-Verlag, 2005.
[54] YellowRoad, “Beware of data leakage,” Jul 2017. [Online]. Available: https:
//blog.myyellowroad.com/beware-of-data-leakage-f6c307009ad9
[55] R. Hyndman and G. Athanasopoulos, Forecasting: Principles and Practice, 2nd ed.
Australia: OTexts, 2018.
[56] M. Hauskrecht, “Cs 3750 advanced topics in machine learning,” October
2018. [Online]. Available: http://people.cs.pitt.edu/∼milos/courses/cs3750/lectures/
class16.pdf
[57] M. Delfs, “Forecasting in the supply chain with machine learning techniques,”
Master’s thesis, University of Bamberg, Germany, Jun 2018. [Online]. Available:
https://www.uni-bamberg.de/en/cogsys/research/theses/advised-theses/
81
[58] M. Geurts, G. E. P. Box, and G. M. Jenkins, Time Series Analysis: Forecasting and
Control. Wiley, 2015.
[62] S. Ben Taieb, G. Bontempi, A. Atiya, and A. Sorjamaa, “A review and comparison
of strategies for multi-step ahead time series forecasting based on the nn5 forecasting
competition,” Expert Systems with Applications, vol. 39, pp. 7067–7083, August 2011.
[65] A. Sorjamaa and A. Lendasse, “Time series prediction using dirrec strategy,” in
European Symposium on Artificial Neural Networks Bruges, Belgium, April 2006,
pp. 1–6.
[68] Z. Wu, W. Lin, and Y. Ji, “An integrated ensemble learning model for imbalanced
fault diagnostics and prognostics,” IEEE Access, vol. 6, pp. 8394–8402, 2018.
82
[69] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, “Adasyn: Adaptive synthetic
sampling approach for imbalanced learning,” in 2008 IEEE International Joint Con-
ference on Neural Networks (IEEE World Congress on Computational Intelligence),
Hong Kong, China, June 2008, pp. 1322–1328.
[70] T. Fawcett, “Learning from imbalanced classes,” Sep 2017. [Online]. Available:
https://www.svds.com/learning-imbalanced-classes/
[71] L. Nieradzik, “Losses for image segmentation,” Sep 2018. [Online]. Available:
https://lars76.github.io/neural-networks/object-detection/losses-for-segmentation/
[74] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20,
no. 3, pp. 273–297, Sep. 1995. [Online]. Available: https://doi.org/10.1023/A:
1022627411411
[75] Z. Zhang and A. Kusiak, “Monitoring wind turbine vibration based on scada data,”
Journal of Solar Energy Engineering, vol. 134, no. 2, pp. 1–12, February 2012.
[Online]. Available: https://doi.org/10.1115/1.4005753
[76] Changxue Ma, M. A. Randolph, and J. Drish, “A support vector machines-based re-
jection technique for speech recognition,” in 2001 IEEE International Conference on
Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 1,
Salt Lake City, UT, USA, USA, May 2001, pp. 381–384.
83
[78] Y. Vidal, F. Pozo, and C. Tutivn, “Wind Turbine Multi-Fault Detection and Clas-
sification Based on SCADA Data,” Energies, vol. 11, no. 11, pp. 1–18, November
2018.
[79] A. Ng, “CS 229 - support vector machines,” Aug 2019. [Online]. Available:
http://cs229.stanford.edu/notes/cs229-notes3.pdf
[80] T. Fletcher, “Support vector machines explained,” Feb 2008. [Online]. Available:
https://cling.csd.uwo.ca/cs860/papers/SVM Explained.pdf
[81] A. Ben-Hur and J. Weston, “A user’s guide to support vector machines.” Methods
in molecular biology, vol. 609, pp. 223–239, October 2010.
[83] O. K. Ernst, “Stochastic gradient descent learning and the backpropagation algo-
rithm,” University of California, San Diego, La Jolla, CA, Tech. Rep., 2014.
[84] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Interna-
tional Conference on Learning Representations (ICLR), San Diego, CA, December
2015, pp. 1–15.
[86] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann
machines,” in Proceedings of the 27th International Conference on International
Conference on Machine Learning, Haifa, Israel, 2010, pp. 807–814. [Online].
Available: http://dl.acm.org/citation.cfm?id=3104322.3104425
[87] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural networks for
LVCSR using rectified linear units and dropout,” in IEEE International Conference
on Acoustics, Speech and Signal Processing - Proceedings (ICASSP), Vancouver, BC,
Canada, May 2013, pp. 1–5.
[88] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in
Proceedings of the Fourteenth International Conference on Artificial Intelligence and
Statistics, vol. 15, Fort Lauderdale, FL, USA, April 2011, pp. 315–323. [Online].
Available: http://proceedings.mlr.press/v15/glorot11a.html
84
[89] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural
network acoustic models,” in 30th International Conference on Machine Learning,
Atlanta, USA, June 2013, pp. 1–6.
[90] J. Brownlee, “What is the difference between a batch and an epoch in a neural
network?” Aug 2019. [Online]. Available: https://machinelearningmastery.com/
difference-between-a-batch-and-an-epoch/
[94] J. Heaton, “The number of hidden layers,” Dec 2018. [Online]. Available:
https://www.heatonresearch.com/2017/06/01/hidden-layers.html
[95] D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, Eds., Parallel Dis-
tributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foun-
dations. Cambridge, MA, USA: MIT Press, 1986.
[96] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gra-
dient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp.
157–166, March 1994.
[98] Y. Zhang, R. Xiong, H. He, and M. G. Pecht, “Long short-term memory recurrent
neural network for remaining useful life prediction of lithium-ion batteries,” IEEE
Transactions on Vehicular Technology, vol. 67, no. 7, pp. 5695–5705, July 2018.
85
[99] A. Graves, N. Beringer, and J. Schmidhuber, “A comparison between spiking and
differentiable recurrent neural networks on spoken digit recognition,” in Interna-
tional Conference on Neural Networks and Computational Intelligence, Grindelwald,
Switzerland, January 2004, pp. 164–168.
[100] R. Zhao, J. Wang, R. Yan, and K. Mao, “Machine health monitoring with LSTM
networks,” in 10th International Conference on Sensing Technology (ICST), Nanjing,
China, Nov 2016, pp. 1–6.
[101] M. Farrash, “Machine learning ensemble method for discovering knowledge from big
data,” Ph.D. dissertation, University of East Anglia, 2016.
[102] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms. New York,
NY, USA: Wiley-Interscience, 2004.
[104] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits and Systems
Magazine, vol. 6, pp. 21–45, September 2006.
[109] M. Brubaker, “CSC 411 - adaboost handout,” September 2015. [Online]. Available:
http://www.cs.toronto.edu/∼mbrubake/teaching/C11/Handouts/AdaBoost.pdf
86
[110] M. B. Hossin, “A review on evaluation metrics for data classification evaluations,”
International Journal of Data Mining Knowledge Management Process, vol. 5, no. 2,
pp. 1–11, Mar 2015.
[111] K. Deng, “Omega: On-line memory-based general purpose system classifier,” Ph.D.
dissertation, Carnegie Mellon University, Pittsburgh, PA, November 1998.
[112] A. Jovi, K. Brki, and N. Bogunovi, “A review of feature selection methods with
applications,” in 38th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, May 2015,
pp. 1200–1205.
[115] R. May, G. Dandy, and H. Maier, “Review of input variable selection methods for
artificial neural networks,” Artificial Neural Networks-Methodological Advances and
Biomedical Applications, pp. 1–28, April 2011.
[116] P. S. P. Cowpertwait and A. V. Metcalfe, Introductory Time Series with R, 1st ed.
Springer Publishing Company, Incorporated, 2009.
[118] G. Van Rossum and F. L. Drake Jr, Python tutorial. Centrum voor Wiskunde en
Informatica Amsterdam, The Netherlands, 1995.
87
[120] F. Chollet et al., “Keras,” https://keras.io, 2015.
[126] P. Chen, B. Liao, G. Chen, and S. Zhang, “Understanding and utilizing deep neural
networks trained with noisy labels,” Computing Research Repository (CoRR), pp.
1–13, May 2019. [Online]. Available: http://arxiv.org/abs/1905.05040
[127] J. Han, P. Luo, and X. Wang, “Deep self-learning from noisy labels,”
Computing Research Repository (CoRR), pp. 1–10, August 2019. [Online]. Available:
https://arxiv.org/pdf/1908.02160.pdf
[128] “Fehlerliste v90,” Tech. Rep., Apr 2015. [Online]. Available: https://dokumen.tips/
documents/fehlerlistev902944665r11.html
[130] S. Waslander, “Optimization for training deep models,” Jun 2017. [Online]. Available:
http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture-4-1.pdf
[131] P. T. Tran and L. T. Phong, “On the convergence proof of amsgrad and a new
version,” IEEE Access, vol. 7, pp. 61 706–61 716, May 2019.
88
[132] Y. Sun, M. S. Kamel, and Y. Wang, “Boosting for learning multiple classes with
imbalanced class distribution,” in Sixth International Conference on Data Mining
(ICDM’06), Hong Kong, China, Dec 2006, pp. 592–602.
[135] L. Torrey, T. Walker, J. Shavlik, and R. Maclin, “Using advice to transfer knowledge
acquired in one reinforcement learning task to another,” in Proceedings of the 16th
European Conference on Machine Learning, Porto, Portugal, October 2005, pp.
412–424. [Online]. Available: http://dx.doi.org/10.1007/11564096 40
[136] C. Zhang, C. Gupta, A. Farahat, K. Ristovski, and D. Ghosh, “Equipment health in-
dicator learning using deep reinforcement learning,” in Machine Learning and Knowl-
edge Discovery in Databases, Dublin, Ireland, September 2019, pp. 488–504.
89