fault_detection
fault_detection
A Deep Learning Model with Adaptive Learning Rate for Fault Diagnosis
Xiaodong Zhai1, Fei Qiao1
1. School of Electronics and Information Engineering, Tongji University, Shanghai 201804
E-mail: [email protected], [email protected]
Abstract: With the increasing amount of data in the field of equipment fault diagnosis, deep learning is playing an increasingly
important role in the process of fault diagnosis, during which the timeliness requirement is high and the fault diagnosis results
need to be obtained accurately and timely. However, with the increase of network layers, the training time of deep learning model
becomes longer. Learning rate in the deep learning model plays an important role in the process of model training, and a
well-designed learning rate adjustment strategy can effectively reduce the training time and satisfy the requirements of fault
diagnosis. At present, some deep learning models usually adopt a globally uniform learning rate strategy, which is unreasonable
for different parameters. This paper has designed an adaptive learning rate strategy for the parameters of weight and bias
respectively in deep learning model. Specifically, the strategy contains a learning rate strategy based on stochastic gradient
descent method for weight, and a power exponential learning rate strategy for bias. Experiments are carried out to validate the
effectiveness of proposed learning rate strategy. Results suggest that the strategy can reduce the training time and reconstruction
error rate of deep learning model, and improve the classification accuracy of fault diagnosis.
Key Words: Deep learning, Learning rate, Adaptive, Fault diagnosis
u j i 1 ij xi b j
I
3.1 Deep learning Model (2)
In the model of deep learning, data are expressed by
The essence of deep learning model is a kind of
connecting weights, and they are distinguished by sharing
multi-layer neural network, and the general neural network
weights and biases. Therefore, weights are important for
only has several layers of network, but deep learning model
feature extraction and layer-by-layer abstraction of deep
contains a large number of hidden layers, so it has strong
learning model. According to formula (2), it can be seen that
ability of feature learning. Through multi-layer non-linear
the bias bj can be regarded as a neuron with the state of bj
conversion, we can learn the deep abstract features from
and weight of 1. It can be seen as adding a dimension to the
complex training data and describe the intrinsic information
original data which is beneficial to data differentiation,
of data. In order to avoid falling into the problem of local
especially when the dimension of input data is low.
optimum, it usually adopts layer-by-layer training algorithm
However, when the dimension of input data is high, which is
to realize parameter training of multi-layer neural network
enough to distinguish the data, the role of bias will be
in deep learning.
relatively weakened. Therefore, for the fault diagnosis
At present, there are many mature deep learning models,
model based on deep learning, when the dimension of input
and this paper focuses on Deep Belief Network (DBN). The
data is high, the calculation amount of bias could be reduced
basic unit of DBN is the Restricted Boltzmann Machine
appropriately.
DDCLS'20
669
Authorized licensed use limited to: Carleton University. Downloaded on June 20,2021 at 01:24:34 UTC from IEEE Xplore. Restrictions apply.
learning rate of the model. ij , b1i , b2 j obtained by
the above formulas are utilized to update the corresponding
weights and biases in formula (2). With the constant
updating of parameters, the training process of the model is
completed until the iteration termination condition is
reached.
DDCLS'20
670
Authorized licensed use limited to: Carleton University. Downloaded on June 20,2021 at 01:24:34 UTC from IEEE Xplore. Restrictions apply.
compared with other adaptive methods [11]. For the model
Lm ( )= n1 ln ( )
N
(8)
dealing with fault diagnosis, sometimes the original data is
high-dimensional, which can weaken the role of bias
Where Lm ( ) is the gradient value of loss function relatively. Therefore, a simple power exponential function is
calculated from the m-th batch data, N is the number of chosen as the learning rate strategy for bias, which simply
samples in the m-th batch data set. As we can see in these ensures that the learning rate is in a downward trend, so as to
two formulas, the computational complexity of SGD is further reduce the amount of calculation and improve the
greatly reduced compared with gradient descent method. final classification accuracy.
3.6 Evaluating Indicator
3.5 Learning Rate Scheduling
In this paper, Reconstruction Error Rate (RER) of test
For (t) in Formula(7), traditional DBN models usually data in the reverse fine-tuning stage is used as the
quantitative evaluation index, which can well describe the
set a global uniform constant parameter based on experience.
convergence state of the model. For a test data set with N
However, with the number of iterations increasing, more
samples, the mathematical expressions of the reconstruction
precise step size of iteration is needed. Constant learning
error rate are as follows:
rate will slow down convergence speed of the model
1
RER n 1 MSE (data (n))
N
because it keeps the step size of each iteration unchanged (12)
during the iteration process. A good learning rate strategy N
can significantly improve the convergence speed and 1
MSE (data ) d (
D
In (data (d ) Out (data (d ))) 2 (13)
operation efficiency of deep learning model. In terms of D 1
mechanism, the full-parameter learning rate will further Formula(13) is the calculation formula of Mean Squared
reduce the training time of the model and the final Error(MSE), where In( data ) is the input data of the model,
classification error of the model. Based on AdaGrad and
Out (data ) is the data generated by the model, and D is the
AdaDec, and combined with SGD method, the mathematical
expression of learning rate strategy is designed according to number of samples. Under the condition of same number of
the different characteristics and functions of weight and bias, iterations, when the reconstruction error rate is high, the
which are formulated as follows: convergence of the model is bad. On the contrary, when the
reconstruction error rate is low, the convergence of the
ij (t 1) model is good.
ij (t ) (9)
K g (t ) 2 4 Case Study
t In order to verify the performance of the learning rate
i (t ) i (0)(1 ) - q (10) strategy proposed in this paper, a constant learning
R rate(Cons) is introduced. Experiment in Section 4.1 was
t designed to compare the convergence and computational
j (t ) j (0)(1 ) - q (11)
R complexity of Cons, AdaGrad, AdaDec and the learning rate
strategy proposed in this paper. On this basis, the
Where ij (t ) is the learning rate of weight in next round, classification accuracy of each method is compared and
ij (t 1) is the learning rate in current round, g (t )2 is the analyzed in Section 4.2. Finally, in order to verify the
reliability of setting the learning rates of weight and bias
sum of squares of the gradients of the loss function in the
respectively, experiment in Section 4.3 was designed to
current round, K is a constant term, and it equals one in
elaborate the relationship between weight and bias.
general which mainly ensures that the learning rate is
In this paper, the data set of rolling bearings are used [18],
bounded and in a downward trend. i (t ) and j (t ) are the including bearings in good condition, bearings with peeling
learning rates of bias terms for visible and hidden units off in outer ring, bearings with peeling off in inner ring,
respectively, which use power exponential functions with a bearings with peeling off in ball and bearings with broken
downward trend where R is the number of iterations, and the cage. The neural network model used in the experiment has
value of q is 0.75 generally. a five-layer structure. The number of neurons in the input
According to the above formulas, the main idea of the layer is 1000, and the numbers of neurons in the three hidden
learning rate adjustment strategy is that a larger learning rate layers are 1000, 500, 250 respectively. The number of
can make the value of the target loss function decrease neurons in the output layer is 5. The initial connection
rapidly in the initial stage of iteration process. In the process weights between layers obey the Gauss distribution with the
of iteration, the learning rate decreases gradually, which can mean value of 0 and the variance of 0.001. The initial bias of
accelerate the stable expression of data samples and help the the first layer is determined by the training data, and the
model to find the convergence point of data samples more initial biases of the other layers are set to 0. All the methods
quickly and steadily. In this paper, the learning rate of mentioned in the experiment adopt the same initial value of
weight is adaptively adjusted by using the current gradient learning rate, the initial value of learning rate in the
value based on the learning rate of previous round, which is pretreatment stage is 0.1, and the initial value of learning
adaptive. As a result, the learning rate can describe the rate in the reverse fine-tuning stage is 0.001. Reconstruction
current running state of model more accurately, and reduce error rate of the model is calculated by formula (12) and
the amount of computation of historical gradient data formula (13).
DDCLS'20
671
Authorized licensed use limited to: Carleton University. Downloaded on June 20,2021 at 01:24:34 UTC from IEEE Xplore. Restrictions apply.
4.1 Reconstruction Error Rate Comparison 98.7%, 98.1% and 98.3%, respectively. In addition, the
curve of the learning rate strategy proposed in this paper is
In this section, the influence of four methods on the
higher than those of other three strategies in the whole
convergence of the deep learning model is compared.
iteration process, that is, its comprehensive performance is
Meanwhile, the running time of the model with 100
better.
iterations is counted. The experimental results are shown in
Fig. 4.
DDCLS'20
672
Authorized licensed use limited to: Carleton University. Downloaded on June 20,2021 at 01:24:34 UTC from IEEE Xplore. Restrictions apply.
100, the reconstruction errors of the two models are 7.81 and [3] L. Li, and S. S. Yu, Image quality assessment based on deep learning
model, Journal of Huazhong University of Science and technology
7.88 respectively. By increasing the number of iterations,
(Nature Science edition), 44(12): 70-75, 2016.
there is little difference between the two results. However, [4] S. K. Kim, Y. J. Park, and S. Lee, Voice activity detection based on
the learning rate strategy in which weight is zero and bias is deep belief networks using likelihood ratio, Journal of Central South
constant does not reduce the reconstruction error rate in the University, 23: 145-149, 2016.
[5] K. Liu, and W. Y. Yuan, Short Texts Feature Extraction and
process of iterations, which keeps a high reconstruction Clustering Based on Auto-Encoder, Acta Scientiarum Naturalium
error rate. Therefore, we can get a conclusion that the weight Universitatis Pekinensis, 51 (2):282-288, 2015.
plays a decisive role in the process of model convergence, [6] Y. Lecun, Y. Bengio, and G. E. Hinton, Deep learning, Nature, 521:
and the bias term does not play an important role in the same 436-444, 2015.
[7] K. Liu, L. M. Zhang, X. L. Fan, New image deep feature extraction
process.
based on improved CRBM, Journal of Harbin Institute of
Technology, 48(5):155-159, 2016.
5 Conclusions [8] Y. Peng, and M. Dong, A prognosis method using age-dependent
In this paper, a deep learning model with adaptive hidden semi-Markov model for equipment health prediction,
Mechanical Systems & Signal Processing, 25(1): 237-252, 2011.
learning rate for fault diagnosis is proposed. According to [9] T. P. Hong, and B. S. Yang, Estimation and forecasting of machine
the different roles of weight and bias in the deep learning health condition using ARMA/GARCH model, Mechanical Systems
model, SGD method is used to design a suitable learning & Signal Processing, 24(2):546-558, 2010.
rate strategy for the parameter of weight, and a power [10] G. E. Hinton, and R. R. Salakhutdinov, Reducing the dimensional-ity
of data with neural networks, Science, 313(5786): 504-507, 2006.
exponential function is chosen as the learning rate strategy [11] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for
for the parameter of bias. Experiments show that the strategy online learning and stochastic optimization, The Journal of Machine
proposed in this paper can extract the characteristics of data Learning Research, 12: 2121-2159, 2011.
samples better, reduce the reconstruction error rate of data [12] A. Senior, G. Heigold, M. A. Ranzato, and K. Yang, An empirical
study of learning rates in deep neural networks for speech recognition,
samples, improve the training efficiency and classification in Proceedings of the 2013 IEEE International Conference on
accuracy of the model and has better performance than some Acoustics, Speech, and Signal Processing, 2013: 6724-6728.
existing learning rate strategies. [13] J. W. Liu, Y. Liu, and X. L. Luo, Research and Development on
However, there are still some areas for further Boltzmann Machine, Journal of Computer Research and
Development, 51(1): 1-16, 2014.
improvement for the work proposed in this paper. For [14] R. Salakhutdinov, and G. Hinton, An efficient learning procedure for
example, when the dimensional of data set is low, the deep Boltzmann machines, Neural Computation, 24(8): 1967-2006,
learning rate strategy proposed in this paper needs to be 2012.
adjusted appropriately, and further study is needed when the [15] H. Robbins, and S. Monro, A stochastic approximation method, The
Annals of Mathematical Statistics, 22(3): 400-407,1951.
deep learning model with adaptive learning rate applied to [16] Z. You, X. R. Wang, and B. Xu, Exploring one pass learning for deep
more practical problems. neural network training with averaged stochastic gradient descent, in
Proceedings of the 2014 IEEE International Conference on
References Acoustics, Speech, and Signal Processing, 2014: 6854-6858.
[17] S. Klein, J. P. W. Pluim, M. Staring, and M. A. Viergever, Adaptive
[1] E. Pan, W. Z. Liao, and M. l. Zhuo, Periodic Preventive Maintenance stochastic gradient descent optimisation for image registration,
Policy with Infinite Time and Limit of Reliability Based on Health International Journal of Computer Vision, 81(3): 227-239, 2009.
Index, Journal of Shanghai Jiaotong University, 15(2):231-235, [18] https://download.csdn.net/download/qq_34133884/11017362
2010.
[2] C. MacGillivary, V. Turner, and D. Lund, Worldwide internet of
things (IOT) 2013-2020 forecast: Billions of things, trillions of
dollars, IDC Q1 Doc, 243661(3):1-22, Oct. 2013.
DDCLS'20
673
Authorized licensed use limited to: Carleton University. Downloaded on June 20,2021 at 01:24:34 UTC from IEEE Xplore. Restrictions apply.