1834-Article Text-7588-1-10-20200929
1834-Article Text-7588-1-10-20200929
1834-Article Text-7588-1-10-20200929
net/publication/345965691
CITATIONS READS
7 722
1 author:
SEE PROFILE
All content following this page was uploaded by Jakub Bolesław Gęca on 25 May 2022.
http://doi.org/10.35784/iapgos.1834
Abstract. The consequences of failures and unscheduled maintenance are the reasons why engineers have been trying to increase the reliability
of industrial equipment for years. In modern solutions, predictive maintenance is a frequently used method. It allows to forecast failures and alert about
their possibility. This paper presents a summary of the machine learning algorithms that can be used in predictive maintenance and comparison of their
performance. The analysis was made on the basis of data set from Microsoft Azure AI Gallery. The paper presents a comprehensive approach to the issue
including feature engineering, preprocessing, dimensionality reduction techniques, as well as tuning of model parameters in order to obtain the highest
possible performance. The conducted research allowed to conclude that in the analysed case , the best algorithm achieved 99.92% accuracy out of over
122 thousand test data records. In conclusion, predictive maintenance based on machine learning represents the future of machine reliability in industry.
Keywords: machine learning, random forest, predictive maintenance, neural networks
Introduction failures from post flight reports using random forest and support
vector machine (SVM). Despite the fact, that machine learning
Today’s industry is facing new problems associated with con- methods are often utilized and gives good results, scientists are
stant growth of production as well as higher accuracy and safety still working on some other interesting techniques [2,13]. Some of
requirements. In addition, international market is very competitive the forecasting tasks are complicated and struggle because of the
in terms of prices. These prices are highly dependent on produc- missing maintenance history or other type of data so authors in [6]
tion speed and reliability. Machines and automatons are very proposed a hybrid semi-supervised approach. Kanawaday and
important parts of a manufacturing process. It means that if certain Sane [12] came up with idea to firstly predict production cycle
component fails, it will cause financial losses related to downtime parameters with ARIMA (AutoRegressive Integrated Moving
of the production process. Moreover, some failures may lead to Average) model and then feed supervised classifier with these
the safety violations, which of course are far more undesirable. values.
To avoid unwanted danger and financial losses, many mainte- This paper presents a comprehensive approach to the predic-
nance strategies are used in the industry. According to the Susto et tive maintenance, where performance of eight machine learning
al. [23] maintenance approaches can be classified as follows: algorithms with tuned parameters was compared. To the best of
Corrective maintenance (also Run-to-Failure – R2F) – this author’s knowledge and according to [4] there is no such work in
method consists of replacing or fixing a certain component af- the literature.
ter it fails. It is the most straightforward approach, which is al-
so the most ineffective one. It leads to the additional costs as- 1. Data structure and preprocessing
sociated with downtime and unscheduled maintenance, often
including spare parts delivery interval. The data comes from Microsoft Azure AI Gallery and is dedi-
Preventive maintenance (PvM) – where maintenance interven- cated for predictive maintenance modelling [25]. It consists of five
tions are performed regularly to avoid unscheduled stoppages. datasets that contains useful information about a group of identical
Time duration between conservations is based on knowledge industrial machines. Every machine has its own identification
about certain system component, but do not grant full usage of number which indicates a model and age of the machine. First
their life. Thus scheduled maintenance may cause additional dataset includes real-time telemetry data, that is timestamp, volt-
costs related to unnecessary repairs. age, rotation, pressure and vibration values. Error messages are in
Predictive maintenance (PdM) – the goal of PdM is to forecast the second dataset. The rest of the datasets contains information
failures before they occur. It is possible thanks to the monitor- about machines, maintenance history (timestamp and replaced
ing and data acquisition systems, which provides useful in- component ID) and failures (timestamp, broken component ID).
formation about history of the machine and its current state. Preprocessing starts with feature engineering, which is im-
Predictions are based on historical data, defined health factors, portant to extract maximum of the useful information from the
engineering approaches and statistical inference methods. data. First of all, it should be determined how far back the algo-
Machine learning algorithms are proved to be very effective in rithm should “look” in order to predict failures. It is so-called
terms of failure prediction and remaining useful life (RUL) esti- lookback parameter, because it is used to create lag features that
mation[10,17,24]. They can also be used in wide range of industry constitute short term history of the machine. The width of this
applications such as engine soot emission prediction[18], gearbox time window have to be discussed with an expert in a particular
failure prediction [11], robotic manipulation failures forecasting field. It is also very important to remember that if this time is too
[20]. Moreover, predictive models are very popular in other fields long, the data will be too noisy for algorithm to predict with satis-
of technology. In [22] authors used decision tree algorithm for factory performance. On the other hand, if the time window is too
hard disc drive failure prediction. Korvesis et al. [15] predicted small, it will contain too little information to determine the risk of
artykuł recenzowany/revised paper IAPGOS, 3/2020, 32–35
p-ISSN 2083-0157, e-ISSN 2391-6761 IAPGOŚ 3/2020 33
failure. Further research about lookback parameter is out of scope ignoring the records for 24 hours ahead. This eliminates the risk of
of this work but 24h time window was chosen. information (created during labelling) leakage between these
Creating lag features for telemetry data consists of calculating subsets. As a splitting point, 2015-07-31 1:00:00 was chosen and
mean and standard deviation for every third record in the dataset. the 60:40 ratio between training and testing data was obtained.
Next, to capture a long term effect, mean and standard deviation It is proven that data preparation techniques such as normali-
of last 24 hours is also calculated. zation and standardization have a positive impact on the perfor-
The error dataset contains timestamp and error message ID mance of prediction models [8]. Thus, data was standardized
number for every machine. The amount of errors of every type in before model training.
24h lag window have to be calculated in order to find out what
impact on failure probability it has. 2. Prediction models and validation
Maintenance history is one of the most important datasets, so
it is crucial for company to build system that collects such data. It One of the breakthroughs in machine learning was the devel-
is used to calculate the amount of days since last replacement of a opment of the perceptron learning rule :
certain machine asset, which provides very useful information
about its degradation level.
wi ( yi yi) xi (2)
Finally all the datasets (including machine and failure infor- by F. Rosenblatt [21]. In the formula above, is the learning rate
mation) are merged together and prepared for labelling i.e. mark-
ing as a class that says whether or not the fault has occurred. But
(0÷1), yi – true class label of the i-th sample, y i denotes the
that is not the only way to do it. The authors in [19] have consid- predicted class label and xi is the corresponding input value.
ered each of the devices and their components separately, labelling
them as faulty or not. Thibaux et al. [5] decided to distinguish Table 2. The best parameters
between three classes: „impending failure detected”, „not impend-
ing failure detected” and “uncertain about future failure”. In the Classifier Parameter Best value
dimensionality reduction LDA
case of this work, it was decided to consider the issue as a multi- C 10
class classification problem where it will be anticipated which of Logistic
penalty l2
regression
the four components will fail or none. In addition, it was assumed solver newton_cg
that the prediction would take place 24 hours in advance, although imbalance handling ENN
dimensionality reduction GUS
this time should generally be chosen in terms of maintenance time
criterion entropy
and spare parts availability. It means that each data record located max_depth 10
24 hours before the fault is marked as “incoming failure of com- Decision Tree
max_features None
ponent number x” or ‘none” otherwise. Table 1 shows the struc- min_impurity_decrease 0.0001
ture of labelled data and sample values. imbalance handling Tomek’s links
dimensionality reduction LDA
Table 1. Data structure C 10
SVM
kernel linear
Feature Example imbalance handling -
machine_ID 22 dimensionality reduction GUS
datetime 2015-09-29 18:00:00 criterion entropy
volt_mean_3h 171.27 max_depth 10
Random forest
rotate_mean_3h 493.40 min_impurity_decrease 0.0001
pressure_mean_3h 112.35 n_estimators 500
vibration_mean_3h 39.654 imbalance handling Tomek’s links
volt_std_3h 9.4917 dimensionality reduction RFE
rotate_std_3h 10.984 learning_rate 0.01
pressure_std_3h 6.4264 Gradient loss deviance
vibration_std_3h 5.5001 boosting n_estimators 500
subsample 0.5
…
imbalance handling -
error1 0 dimensionality reduction -
error2 0 topology one hidden layer
number of errors error3 0 units 50
error4 0 activation relu
error5 0 kernel_initializer lecun_uniform
ANN
component1 26 optimizer adam
days since last loss sparse_categorical_crossentropy
component2 11
component
component3 41 epochs 10
replacement
component4 56 batch_size 32
Model 1 imbalance handling -
Age 14 dimensionality reduction -
Failure none topology two convolution layers
units in layer 1 75
units in layer 2 50
Convolutional Neural Network (CNN) and Long Short-Term layer 1 kernel_size 2
Memory (LSTM) network requires data input shape in the form of layer 2 kernel_size 2
(batch_size, timesteps, features). In order to obtain such three- CNN activation elu
dimensionality it is necessary to run an algorithm that will gener- kernel_initializer glorot_uniform
ate a 24-hour machine history (as an additional dimension) for optimizer adam
loss sparse_categorical_crossentropy
each labelled data record. epochs 10
The next step is to process the categorical data, which consists batch_size 32
of mapping the ordinal features and encoding nominal features as imbalance handling -
well as class labels. Then the dataset is split into training and test dimensionality reduction -
subsets. Each data record has a corresponding point in time, so no topology one recurrent layer
units 75
random splitting or random sampling method can be utilized. The activation relu
reason for this is that past events cannot be predicted based on LSTM
recurrent_activation sigmoid
future events (which might happen when using some methods). It kernel_initializer glorot_uniform
is also unacceptable to use data that has arisen later than the point recurrent_initializer glorot_normal
imbalance handling -
under consideration. Hence, a time-dependent splitting method has
been applied by selecting one point in time as a division point and
34 IAPGOŚ 3/2020 p-ISSN 2083-0157, e-ISSN 2391-6761
The idea of updating weights has led researchers to develop Table 3. Performance metrics
more sophisticated models. In this article, the following Initial metrics values Final metrics values
classification algorithms have been used for failure prediction: Classifier f1- f1-
accuracy precision recall accuracy precision recall
logistic regression, support vector machines (SVM), decision tree, score score
random forest, gradient boosting classifier, artificial neural Logistic
99.80 94.33 93.96 94.71 99.82 95.04 93.35 96.85
regression
network (ANN), convolutional neural network (CNN), long Decision
short-term memory (LSTM). 99.87 96.32 95.55 97.21 99.88 96.49 96.09 96.97
Tree
Initially, the performance of algorithms with default SVM 99.83 95.72 95.20 96.33 99.85 96.07 94.24 98.12
parameters was tested for a different splitting and dimensionality Random
99.92 97.74 98.56 96.95 99.93 97.92 98.74 97.14
forest
reduction method. It appeared that the training and test data Gradient
proportions has only a small impact on the accuracy, so for further 99.93 97.78 98.01 97.57 99.92 97.76 98.02 97.52
boosting
research the 60:40 split was used. The next step was to investigate ANN 99.83 95.56 94.13 97.09 99.88 96.82 96.04 97.66
the influence of dimensionality reduction techniques such as CNN 97.72 62.01 57.44 75.59 99.50 88.02 92.63 84.06
LSTM 98.82 68.86 78.66 63.65 99.61 90.38 94.59 86.69
principal component analysis (PCA) [9], linear discriminant
analysis (LDA), generic univariate select (GUS), and recursive
Among the introduced methods, three that achieved the best
feature elimination (RFE) on prediction accuracy. For each results in prediction of defects were chosen and their confusion
algorithm, the method giving the best results was selected for matrices were presented (Fig. 1-3). Comparison of these matrices
further research. As a consequence, each prediction model has and explicit selection of the best algorithm involves determining
been prepared for the parameter tuning process.
several requirements related to functioning of the system to which
Selection of the best parameters values was made by applying the application is dedicated. First of all, it is important to specify
a grid search algorithm, which consists of finding the best result how expensive are the so-called false alarms, i.e. situations in
for every parameter combination from the grid. This method which the model predicts a failure, when it does not actually oc-
usually involves using the k-fold cross-validation with random
cur. In addition, it is necessary to determine how harmful it will be
sampling, which is unacceptable in this case. Therefore, the to forecast failure of one component instead of another. Of course,
time-series splitting method for grid search was used to avoid not detecting the upcoming malfunction is the worst case scenario,
overestimating the performance. Furthermore, for neural because unscheduled maintenance is the most expensive and
networks, the best topology was chosen by manual testing avoiding it is desirable.
and comparison. The best parameters for each model are listed
in Table 2.
In predictive maintenance, there is another important factor
affecting performance. Failures are very rare occurrences among
telemetry data, which leads to imbalance in the label distribution.
Hence, the classifier tends to perform better predicting majority
class labels than the minority. There are many solutions to this
problem. Among others, undersampling can be used as authors in
[3]. Alternatively, class weighting can be applied to increase or
decrease algorithm’s sensitivity towards specific classes. In this
work, the methods proposed in [16] were used, especially
Tomek’s links and edited nearest neighbours (ENN). However,
oversampling was not utilized because the number of newly added
samples was unreasonably large compared to the efficiency
improvement and the model training time became too long. Fig. 1. Gradient boosting confusion matrix
3. Results
Table 3 summarises performance metrics of each model for
default parameters and once the data preprocessing and parameter
tuning have been applied. The aforementioned label distribution
imbalance has also an impact on how the model is evaluated.
Failure-free records represent the vast majority in the dataset [14]
so the algorithm can predict only a few faults while still
maintaining high accuracy. Therefore, other performance metrics
such as precision, recall and f1-score should be taken into account.
These metrics are based on a number of positive and negative
hypotheses, so for multi-class predictions, their macro averages
are calculated. The results for CNN and LSTM algorithms confirm
the problem described above. The accuracy values exceed 99%
in both cases but the other metrics are much lower. Fig. 2. Random forest confusion matrix
On the basis of the presented results, it would seem that
there is no need for sophisticated methods of data preparation and
parameter selection, since the results are very good and their
improvement is only a fraction of a percent. However, considering
a company that would like to use such a system, any incorrect
forecast can cost a lot of money, so it is reasonable to refine
the algorithms to the perfection. The performance metrics
of the gradient boosting (marked in the table) seem to be particu-
larly interesting, as they have slightly deteriorated after the model
improvement. Nonetheless, the overfitting has decreased,
so the risk of worse behaviour towards new, previously unseen
data will be lower.