ISA Transactions: Te Han, Chao Liu, Wenguang Yang, Dongxiang Jiang
ISA Transactions: Te Han, Chao Liu, Wenguang Yang, Dongxiang Jiang
ISA Transactions
journal homepage: www.elsevier.com/locate/isatrans
Practice article
article info a b s t r a c t
Article history: In recent years, an increasing popularity of deep learning model for intelligent condition monitoring
Received 30 January 2019 and diagnosis as well as prognostics used for mechanical systems and structures has been observed.
Received in revised form 2 August 2019 In the previous studies, however, a major assumption accepted by default, is that the training and
Accepted 4 August 2019
testing data are taking from same feature distribution. Unfortunately, this assumption is mostly invalid
Available online 12 August 2019
in real application, resulting in a certain lack of applicability for the traditional diagnosis approaches.
Keywords: Inspired by the idea of transfer learning that leverages the knowledge learnt from rich labeled data in
Transfer learning source domain to facilitate diagnosing a new but similar target task, a new intelligent fault diagnosis
Domain adaptation framework, i.e., deep transfer network (DTN), which generalizes deep learning model to domain
Joint distribution adaptation adaptation scenario, is proposed in this paper. By extending the marginal distribution adaptation
Intelligent fault diagnosis (MDA) to joint distribution adaptation (JDA), the proposed framework can exploit the discrimination
Convolutional neural networks structures associated with the labeled data in source domain to adapt the conditional distribution of
unlabeled target data, and thus guarantee a more accurate distribution matching. Extensive empirical
evaluations on three fault datasets validate the applicability and practicability of DTN, while achieving
many state-of-the-art transfer results in terms of diverse operating conditions, fault severities and fault
types.
© 2019 ISA. Published by Elsevier Ltd. All rights reserved.
1. Introduction data and testing data follow a similar distribution. Take bearing
fault diagnosis as an example. Lei et al. [1] utilized ensemble
In modern industry, machines and equipment are developing empirical mode decomposition (EEMD) and statistical param-
towards the direction of high-precision, high-efficiency, more eters to extract features, and wavelet neural network (WNN)
automatic and more complicated, making the breakdown or even to intelligently classify and diagnose bearing health conditions.
accidents more frequent. Intelligent monitoring and fault diagno- Verstraete et al. [10] designed a deep feature learning method
sis systems, in a broad sense, have always been key to attaining using time–frequency images and convolutional neural networks
the enhancement of security and reliability of industry equip- (CNN) for bearing fault diagnosis. Feng et al. [11] presented a
ment [1]. Over the past decade, various attempts have been made local connection network constructed by stacked auto-encoder
to design efficient algorithms or new ways for achieving superior (SAE) to extract shift-invariant features from bearing fault signals.
diagnostic performance. These studies usually merge advanced Numerous other works can be found in related reviews [15,16].
signal processing algorithms and machine learning techniques to In these works, the monitored signal is generally divide into
process machine data and make diagnostic decisions intelligently, many segments, i.e., samples. These samples are randomly par-
leading to impressive results in many diagnosis cases [2–6]. titioned into the training data and testing data. In this manner,
Marvelous success on diverse intelligent fault diagnosis frame- the designed methods or algorithms are actually validated in
works have been reported over the past decade [7–14]. However, the same data distribution. These reported works contribute to
two latent problems in these works may restrict extensive and the development of more effective diagnosis methods utilizing
flexible industry applications. (1) Most of designed methods or expert knowledge or adaptively feature learning, while ignore
algorithms are validated based on a assumption: the training the fact of distribution discrepancy. Due to the multiple loading
conditions, working environments and fault severities for bear-
∗ Corresponding author at: Department of Energy and Power Engineering, ing, the distributions between training data and testing data are
Tsinghua University, Beijing 100084, China. different in real situations. The diagnostic model is generally
E-mail address: [email protected] (C. Liu). learned with the training data of limited conditions, and the
https://doi.org/10.1016/j.isatra.2019.08.012
0019-0578/© 2019 ISA. Published by Elsevier Ltd. All rights reserved.
270 T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281
Fig. 1. Intelligent fault diagnosis framework. (a) Stage I, (b) Stage II and (c) New one.
adaptation method, which employed a MMD term to evaluate the where ℓ means the loss function to calculate the cost between
discrepancy of normal category between source and target do- true label h(X ) and predicted label by CNN model f (X , {W i }Ki=1 ,
mains, and retained the sophisticated fault features with a weight {bi }Ki=1 , U ).
regularization term. These studies have preliminarily explored
the effectiveness of transfer learning in the field of intelligent 3.2. Transfer learning
fault diagnosis, but further works are needed to improve this
framework in the following two aspects. (1) The transfer scenario For completeness, the definitions of transfer learning are first
should be extended to more challenging diagnosis tasks, such as presented.
the diverse fault severity levels and diverse fault types. (2) The
previous studies only adapted the marginal distribution without
Definition 1 (Domain). A domain D is composed of two compo-
considering the conditional distribution, leading to the neglect of
nents: a feature space X and a marginal probability distribution
the discrimination structures in rich labeled source data. Jointly
reducing the discrepancy in both marginal distribution and con- P(X ), where X = {x1 , . . . , xn } ∈ X is a particular training dataset,
ditional distribution may hold the potential to achieve superior i.e., D = {X , P(X )}.
transfer performance.
Definition 2 (Task). A task T consists of two parts, a label space
3. Preliminaries Y and a predictive function f (X ), which can be learned from the
instance set X , i.e., T = {Y , f (X )}. Also, f (X ) = Q (Y |X ) is the
3.1. Convolutional neural network conditional probability distribution.
CNN, as a type of most effective deep learning models, has Definition 3 (Transfer Learning). Given a source domain Ds with
been widely used in image processing, computer vision and a learning task Ts and a target domain Dt with a learning task Tt ,
speech recognition. Typically, a CNN is composed of three types transfer learning aims to facilitate the learning process of target
of layers, which are convolutional layers, pooling layers and fully- predictive function ft (X ) in Dt by using the related information
connected layers. The first step of CNN is to convolve the input or knowledge in Ds and Ts , where Ds ̸ = Dt , or Ts ̸ = Tt . When
signal with a set of filter kernels (1D for time-series signal and 2D the Ds = Dt and Ts = Tt , it will be categorized into traditional
for image). All the feature activations by convolution operation at machine learning task.
different locations constitute the feature map. A nonlinear activa-
tion function, generally rectified linear unit (ReLU), is applied on Two remarks should⋁ be emphasized here. The condition Ds ̸=
the sum of feature maps. The operation of convolutional layer can Dt means Xs ̸ = ⋁ Xt Ps (Xs ) ̸ = Pt (Xt ). And the condition of Ts ̸ = Tt
be expressed as: implies Ys ̸ = Yt Qs (Ys |Xs ) ̸ = Qt (Yt |Xt ).
∑
cnr = ReLU( vmr −1 ∗ wnr + brn ) (1)
3.3. Maximum mean discrepancy
m
where cnr
is the nth output of convolutional layer r, n represents MMD is an index to measure the discrepancy of two distribu-
the number of filter in layer r, wnr and brn are the nth filter and bias tions. Given two dataset Xs , Xt , Ps (Xs ) ̸ = Pt (Xt ) and a nonlinear
of layer r respectively, vmr −1
is the mth output from previous layer
mapping function φ in a reproducing Kernel Hilbert space H
r − 1, ∗ denotes the convolution operation. The obtained feature
(RKHS), the formulation of MMD can be defined as:
map is then processed with a pooling layer by taking the mean
ns nt
or maximum feature activation over disjoint regions. By cascad- 1 ∑ 1 ∑
ing the combination of convolutional layer and pooling layer, a MMDH (Xs , Xt ) = ∥ φ (xsi ) − φ (xti )∥2H (3)
ns nt
multi-layer structure is built for feature description. Finally, the i=1 i=1
fully-connected layers, just like the layers in multi-layer neural In (3), we can find that the empirical estimation of the discrep-
network, are employed for classification. Given the training set ancy for two distributions is considered as the distance between
{Xj }j , the learning process of a CNN with K convolutional layers, the two data distributions in RKHS. A value near zero for MMD
including the parameters of filters {W i }Ki=1 , the biases {bi }Ki=1 and
means the two distributions are matched. In transfer learning,
classification layers U, can be defined as an optimization task:
∑ MMD is generally used to construct the regularization term for
min ℓ(h(X j ), f ({W i }Ki=1 , {bi }Ki=1 , U )) (2) the constraint in feature learning, making the learned feature
{Wi }Ki=1 ,{bi }Ki=1 distributions more similar between different domains.
j
272 T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281
Fig. 2. An illustration of MDA and CDA, f : discriminative hyperplane, Ds : feature distribution in source domain, Dt : feature distribution in target domain.
4. Deep transfer network with joint distribution adaptation same distribution. If the marginal distribution for (4) holds, the
optimization problem in (7) becomes
4.1. Joint distribution adaptation
min D(Qs (φ (Xs )|Ys ), Qt (φ (Xt )|Yt )) (8)
Generally, the probability distributions of diverse domains The above objective function is noted as CDA. This step is
may exhibit significant difference not only in marginal distribu- essential for an accurate and robust distribution adaptation. How-
tion, which represents the cluster center of feature distributions, ever, it is still intractable as Yt is unknown. Some previous studies
but also in conditional distribution for large amount of practical proposed a circuitous way by exploiting the pseudo labels for
applications. From Fig. 2(a) to (b), it is clear the distributions target data to handle the CDA in unsupervised domain adapta-
for source and target domains are different. The direct use of tion [40,46]. With the aid of the pre-trained models on labeled
trained discriminative hyperplane in source domain will lead to source data, pseudo labels for target data can be preliminarily
the extensive misclassification in target domain. The marginal supplied. Supposing a total of C categories and the category c ∈
distribution adaptation (MDA) contributes to improving transfer {1, . . . , C }, the distance index, MMD, can be defined to measure
performance by aligning the two distribution centers. However, the mismatch of conditional distributions Qs (xs |ys = c) and
only adapting the marginal distributions is insufficient, since the Qt (xt |yt = c) of c category,
discriminative hyperplanes may be different for diverse domain (c) 1 ∑ 1 ∑
tasks. The conditional distribution adaptation (CDA), which aims MMD2H (Qs(c) , Qt ) = ∥ (c)
φ (xsi ) − (c)
φ (xtj )∥2H (9)
ns nt
to match the discriminative structures between labeled source (c)
xsi ∈Ds
(c)
xtj ∈Dt
data and unlabeled target data, is also indispensable and highly
(c)
effective. An intuitive description of this consideration is illus- where Ds = {xi : xi ∈ Ds ∧ y(xi ) = c }, y(xi ) is the true label, and
(c) (c) (c)
trated from Fig. 2(b) to (c). Hence, in this part, we are dedicated to ns = |Ds |, Dt = {xj : xj ∈ Dt ∧ ŷ(xj ) = c }, ŷ(xj ) is the pseudo
presenting a simple mathematical formulation of JDA, and further (c) (c)
label and nt = |Dt |.
providing a specific deep transfer framework. It should be noted that, although there are probably many
Problem formulation (joint distribution adaptation) In a mistakes in the initial pseudo labels, one can iteratively update
n
fault diagnosis task, given a labeled source dataset Xs = {xsi , ysi }i=s 1 the pseudo labels in the stage of model optimization to ob-
t nt
and a unlabeled target dataset Xt = {xi }i=1 , Xs = Xt , Ys = Yt , tain the optimal prediction accuracy under the current learning
Ps (Xs ) ̸ = Pt (Xt ), Qs (Ys |Xs ) ̸ = Qt (Yt |Xt ). The weak form of transfer conditions.
learning with domain adaptation is to learn a feature transform (3) JDA: By integrating marginal MMD and conditional MMD,
that simultaneously minimizes the discrepancy between marginal a regularization term of JDA can be written as:
distribution and conditional distribution [39], i.e., C
∑ (c)
min D(Ps (φ (Xs )), Pt (φ (Xt ))) (4) DH (Js , Jt ) = MMD2H (Ps , Pt ) + MMD2H (Qs(c) , Qt ) (10)
c =1
l
where θ = {W i , bi i=1 } is the parameter collection of a CNN with Algorithm 1 Training Procedure of DTN with JDA
l layers and λ is non-negative regularization parameter. It should n
Input: Given the dataset Ds = {xsi , ysi }i=s 1 in source domain, unlabeled
be emphasized that the mapping function φ in RKHS H is the n
dataset Dt = {xti }i=t 1 in target domain, the architecture of deep neural
nonlinear feature transform learned by deep models herein. For network, the trade-off parameters λ.
CNNs, the features always change from general to specific with Output: Transferred network and predicted labels for target samples
the increase of layer depth. The upper layers tend to represent 1: begin
more abstract features, which may result in a larger domain dis- 2: Train a base deep network on the source dataset Ds
n
3: Predict the pseudo labels Ŷ0 = {yti }i=t 1 for target samples with base
crepancy [47]. Consequently, we deploy the regularization term
network
on the last hidden fully-connected layer, namely the layer in front 4: repeat
of discrimination layer, that is, φ (x) = hl−1 (x), where hl−1 (·) is the 5: j = j + 1
feature map by the nonlinear feature transform of the first (l − 1) 6: Compute the regularization term of JDA according to (10)
layers. The JDA regularization term employed in conjunction with 7: Network optimization with respect to (11)
deep models can generate the mapping function φ by adaptively 8: Update the pseudo labels Ŷj with optimized network
learning from data, and avoid to manually set the parameterized 9: until convergence or Ŷj = Ŷj−1
10: Check the diagnosis performance of transferred network on other
kernel function. target samples.
The architecture of proposed DTN with JDA is illustrated in
Fig. 3. A domain-shared CNN is utilized to extract signal charac-
teristics for both source data and target data. That is, the struc-
ture and weights of convolutional blocks and fully-connected
layers keep consistent in source and target domains. By execut- ⎧ ns nt
ing the forward pass, the two terms in (11) can be calculated, 2 1 ∑ 1 ∑
φ s
φ (xtj )),
⎪
namely, the traditional cross-entropy loss ℓce and the regular-
⎪
⎪ ( (x i ) − x ∈ Ds
⎨ ns ns
⎪ nt
i=1 j=1
ization term of JDA. Then, the backpropagation algorithm and ∇ MMD2H (Ps , Pt ) = nt ns
mini-batch stochastic gradient descent (SGD) are utilized for net- 2 1 ∑ 1 ∑
φ (xtj ) − φ (xsi )),
⎪
⎩ nt ( nt x ∈ Dt
⎪
work optimization. On the one hand, by optimizing the loss
⎪
ns
⎪
ℓce , the model is animated to capture the discriminant structure j=1 i=1
from the labeled source data. On the other hand, by optimizing (14)
the regularization term of JDA, the model can further reduce
the discrepancy of feature distributions between domains and
learn domain-invariant feature representation so that the learnt (c)
and ∇ MMD2H (Qs(c) , Qt ) =
discriminant structure in source domain can also be applied to ⎧
target data. 2 1 ∑ 1 ∑
⎪
⎪ (c)
( (c) φ (xsi ) − (c)
φ (xtj )), x ∈ Ds
The gradient of objective function with respect to network
⎪
⎪ n
⎨ s
⎪ n s (c)
nt (c)
parameters is xsi ∈Ds xtj ∈Dt (15)
∂ℓce ∂φ (x) 2 1 ∑ 1 ∑
φ (xtj ) − φ (xsi )),
⎪
∇θ l = + λ(∇ DH (Js , Jt ))T ( ) (12)
⎪
⎪ (c)
( (c) (c)
x ∈ Dt
∂Θl ∂Θl ⎩ nt nt t (c) ns
⎪
⎪
(c)
xj ∈Ds xsi ∈Ds
∂φ (x)
where ∂ Θ l are the partial derivatives of the output of (l −
1)th layer with network parameters. The detailed formulations 4.3. Training strategy
of ∇ D2H (Js , Jt ) are described as:
C
∑
∇ DH (Js , Jt ) = ∇ MMD2H (Ps , Pt ) + ∇ MMD2H (Qs(c) , Qt(c) ) (13) The training procedure of this framework mainly consists of
c =1
two parts: (1) the pre-training on labeled source data and (2)
274 T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281
the network adaptation in target domain with the input of both loading conditions (i.e., varying wind speeds). The experiments
labeled source data and unlabeled target data. It should be noted are performed under six different wind speeds ranging from 5.8
that the dataset is generally divided into small batches, which m/s to 11.5 m/s (loads 0–5). And the corresponding speeds of
are fed into the network for training. A desirable batch size wind wheel range from 255 rpm to 300 rpm. The raw vibration
should be as large as possible to cover the variance of the whole data is collected by accelerometers. The sampling frequency is
dataset, whereas a too large batch size will also increase the 20 kHz. The time-domain waveforms of diverse machine condi-
calculation burden. It is a trade-off between transfer performance tions under load 5 are presented in Fig. 5. When the machine is
and computational effectiveness. Besides, the same amount of health (condition 0), it is clear the vibration amplitude maintains
samples from source and target domain are used for network in low level and the signal components related to rotating fre-
adaptation. When the data sizes are different across domains, the quency is dominated. When the faults are introduced to machines
re-sampling can be applied in the smaller dataset to keep the (conditions 1 to 9), obvious impulse characteristics appear, espe-
same sample size in the source and target domains. The whole cially for bearing-related faults (conditions 3 to 5). And the signal
adaptation steps of DTN with JDA are listed in Algorithm 1. components are more complex.
For clarity, the denotation of A→B is utilized to represent the
5. Experiments transfer task from source dataset A to target dataset B. In the
wind turbine fault dataset, we aim to explore the transfer ability
In this section, experiments on three mechanical fault datasets of proposed framework across diverse operating conditions. Con-
are conducted to demonstrate the efficiency, superiority as well sequently, six transfer tasks are designed for empirical evaluation
as practical value of proposed transfer framework. Mechanical (listed in Table 1). For instance, A→B: the source dataset A
equipment may appear diverse failure mode during the long-time contains the samples of ten machine conditions under load 0–
operation. Different faults may present different characteristics. 2, while the target dataset B is composed of the samples under
The studies of intelligent fault diagnosis focus on classifying the load 3–5. In Table 1, the unlabeled target samples are utilized for
signal samples from different health conditions and make di- domain adaptation. No information of label can be used in this
agnostic decisions automatically. In the three datasets, the fre- process. After domain adaptation, another set of testing target
quently occurred faults in mechanical systems are artificially samples with labels are used to evaluate the performance of
introduced to machines so as to simulate diverse health condi- transferred diagnosis model.
tions. The vibration signal under diverse machine conditions are (2) Bearing Fault Dataset: The bearing fault dataset is an open-
collected. The performance of proposed method and comparative access dataset from Case Western Reserve University [48]. Four
methods can be further tested in these fault datasets. different bearing conditions: health, outer ring fault (OF), rolling
element fault (RF) and inner ring fault (IF) (corresponding labels
5.1. Data description 0–3), are considered in this dataset. The experiments are per-
formed under four motor speeds (1797, 1772, 1750 and 1730
(1) Wind Turbine Fault Dataset: The first dataset is from our rpm) at a sampling frequency of 12 kHz. For each kind of fault,
wind turbine experimental platform, whose schematic diagram is single point faults with different severity levels are introduced
illustrated in Fig. 4. This dataset contains ten machine conditions, to test bearing respectively. In most existing studies, the samples
which are health, front bearing pedestal loosening (FB), back with the same fault type but different severity levels are treated
bearing pedestal loosening (BB), rolling element fault of front as distinct categories. Indeed, the signal characteristic of certain
bearing (RF), inner-ring fault of front bearing (IF), outer-ring fault fault type always varies with the severity level. Therefore, we aim
of front bearing (OF), misalignment in horizontal direction (MH), to investigate the performance of proposed transfer framework
misalignment in vertical direction (MV), variation in airfoil of across diverse fault severities in this dataset.
blades (VB) and yaw fault (YF) respectively (corresponding labels For simplicity, we select two fault severity levels with the fault
0–9). All these faults can basically simulate the typical failure diameters (FD) of 0.18 mm and 0.53 mm to construct transfer
modes from wind wheel to drive chain of a real wind turbine. tasks: G→H, H→G. The dataset G is composed of the samples of
To create working conditions close to reality, we change the four bearing conditions under four motor speeds and the fault
power of axial flow fan in the wind tunnel to generate varying diameter of OF, RF and IF cases is 0.18 mm. The dataset H is
T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281 275
Table 1
Designed transfer tasks across diverse operating conditions.
Transfer tasks Source domain Target domain Unlabeled target samples Testing target sample Machine conditions
A→B Load 0–2 Load 3–5 24000 4000
B→A Load 3–5 Load 0–2 24000 4000 10
C→ D Load 2 Load 3–5 24000 4000 conditions
D→C Load 3–5 Load 2 12000 4000 (labels
E→ F Load 2 Load 5 12000 4000 0–9)
F→E Load 5 Load 2 12000 4000
formed by the health samples and fault samples with 0.53 mm 5.2. Comparison studies
fault diameter.
(3) Gearbox Fault Dataset: The gearbox fault dataset collected (1) Comparison methods: The proposed framework will be
from our single-stage cylindrical straight gearbox test rig (as compared with several state-of-the-art methods in the field of
shown in Fig. 6) is analyzed in the scenario where the domain dis- intelligent fault diagnosis: (1) SVM [26]; (2) Random forest (RF)
crepancy between specific fault types are expected to be bridged [26]; (3) Empirical mode decomposition analysis (EMD) [1]; (4)
by transfer learning. Sometimes, it may be more practical to CNN [13,33]; (5) TJM [38]; (6) TCA [39]; (7) JDA [40]; (8) DTN with
confirm the location of failure instead of specific types. Consid- MDA and (9) DTN with JDA (this work). These baseline methods
ering the example of gearbox, identifying the fault location, such can be categorized into two subsettings: the standard diagno-
as gear fault or bearing fault, is beneficial for monitoring and sis methods (1)–(4) and the transfer learning based techniques
maintenance. That said, certain types of fault occurred in one (5)–(9).
In (1)–(2), the popular statistical features, such as root mean
component, such as bearing inner race fault or outer race fault,
square and kurtosis, are extracted from raw data in time and
can be defined as one category. Besides, it may be impossible to
frequency domains to form the input of the classifiers [7,26,49]
obtain the fault data of various fault types and train a diagnosis
. In (3), EMD is applied to decompose the raw signal into a se-
model with high accuracy for a complex mechanical system.
quence of intrinsic mode functions (IMF). The energy distribution
Consequently, the transfer performance across similar but diverse
of first five IMFs is calculated as the input features for classifier.
fault types is of great practical significance. In the experiments,
In (4), using the deep learning flow, CNN. In the transfer learning
we introduced two types of faults, i.e., gear root crack (RC) and based techniques (5)–(9), TJM, TCA and JDA are the shallow
tooth surface spalling (TS), to high-speed cylindrical gearing, and transfer learning methods, and thus we also extract the statistical
another two types of faults, i.e., outer race fault (OR) and roller features from raw data, then conduct the unsupervised domain
fault (RO), to high-speed conical bearing. The vibration data is adaptation, finally make diagnosis results with classifier. In deep
collected with a sampling frequency of 20 kHz. learning flow, a comparison between the proposed method and
We state three conditions of gearbox, including health, gear DTN with MDA method by removing the CDA term in objective
fault and bearing fault in this dataset (corresponding labels 0–2), function is made. The pre-trained base network resorts to the
and design two transfer tasks: I→J, J→I. The dataset I contains optimal CNN model in source domain, that is, the trained model
the samples of health, bearing OR and gear RC. The dataset J is in (4).
formed by the samples of health, bearing RO and gear TS (see (2) Implementation details: For (1)–(4), we use the labeled
Table 2). source data to train the model, which will be applied to diagnose
276 T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281
Fig. 6. The single-stage cylindrical straight gearbox test rig: (a) Schematic diagram of gearbox test rig; (b) The damaged components.
Table 2
Designed transfer tasks across diverse fault severities and types.
Transfer tasks Source domain Target domain Unlabeled target samples Testing target sample Machine conditions
G→H FD 0.18 FD 0.53 12000 4000 4 conditions
H→ G FD 0.53 FD 0.18 12000 4000 (labels 0–3)
I→J H, OR, RC H, RO, TS 12000 4000 3 conditions
J→ I H, RO, TS H, OR, RC 12000 4000 (labels 0–2)
Fig. 7. Comparison of the diagnosis accuracy of diverse methods on ten transfer tasks.
Table 4
Diagnosis accuracy (%) on ten transfer tasks with different methods.
Methods A→B B →A C →D D→C E →F F→E G→H H →G I→J J→I Avg
SVM 72.8 ± 1.9 74.5 ± 1.5 89.8 ± 0.8 90.4 ± 1.2 62.6 ± 1.0 63.4 ± 1.5 73.4 ± 1.3 75.7 ± 1.3 78.2 ± 0.9 46.6 ± 0.6 72.7 ± 1.2
RF 84.4 ± 0.9 78.3 ± 2.4 89.1 ± 0.3 92.7 ± 1.1 60.9 ± 0.7 61.0 ± 0.7 80.8 ± 1.1 49.9 ± 1.3 69.4 ± 0.9 69.2 ± 7.8 73.6 ± 1.7
EMD 79.8 ± 0.9 72.2 ± 0.9 77.7 ± 4.2 71.7 ± 6.7 64.5 ± 6.2 61.8 ± 1.2 72.5 ± 4.4 64.8 ± 10.7 57.0 ± 6.9 43.4 ± 2.8 66.5 ± 4.5
CNN 91.8 ± 0.3 93.8 ± 0.4 89.4 ± 3.1 94.9 ± 0.2 80.4 ± 4.0 82.1 ± 4.6 81.7 ± 5.3 64.5 ± 10.0 79.5 ± 1.5 72.3 ± 1.3 83.0 ± 3.1
TJM 87.4 ± 1.9 81.4 ± 2.4 92.5 ± 1.3 93.9 ± 0.4 78.3 ± 5.4 67.6 ± 1.3 92.2 ± 5.8 96.0 ± 6.9 77.1 ± 2.2 58.7 ± 5.2 82.5 ± 3.3
TCA 87.8 ± 1.7 79.0 ± 2.8 88.7 ± 0.5 92.9 ± 0.5 76.1 ± 4.0 68.6 ± 1.2 92.8 ± 7.2 94.4 ± 8.5 75.5 ± 4.7 56.4 ± 6.9 81.2 ± 3.8
JDA 86.0 ± 2.2 81.0 ± 2.6 91.3 ± 1.6 94.1 ± 1.3 83.6 ± 2.8 61.4 ± 3.4 93.9 ± 10.9 94.8 ± 11.0 79.2 ± 1.2 55.4 ± 5.4 82.1 ± 4.2
DTN.w.MDA 95.9 ± 2.9 96.9 ± 0.4 94.0 ± 1.1 97.4 ± 0.4 87.3 ± 1.2 87.4 ± 1.7 81.3 ± 5.0 68.2 ± 7.9 80.1 ± 1.0 83.6 ± 1.8 87.2 ± 2.3
DTN.w.JDA 98.3 ± 0.2 98.9 ± 0.5 96.6 ± 0.2 98.5 ± 0.2 96.8 ± 0.5 97.3 ± 0.2 99.3 ± 0.4 97.1 ± 8.7 99.9 ± 0.1 96.3 ± 0.5 97.9 ± 1.2
Table 5
Missing alarm rate (%) on ten transfer tasks with different methods.
Methods A→B B →A C →D D→C E →F F→E G→H H→G I→J J→I Avg
SVM 18.4 ± 2.2 17.7 ± 1.1 8.0 ± 0.6 7.5 ± 0.8 29.9 ± 0.9 38.3 ± 2.2 27.0 ± 1.1 14.2 ± 2.2 15.4 ± 0.8 52.5 ± 0.4 22.9 ± 1.2
RF 12.4 ± 0.8 11.6 ± 1.7 8.5 ± 0.2 5.0 ± 0.6 40.0 ± 1.0 39.5 ± 1.5 12.7 ± 0.9 62.6 ± 0.4 24.3 ± 6.2 28.6 ± 11.7 24.5 ± 2.5
EMD 19.5 ± 0.9 27.2 ± 0.9 21.3 ± 3.7 29.0 ± 7.0 35.6 ± 7.2 37.4 ± 1.3 27.8 ± 5.0 35.4 ± 13.2 44.6 ± 6.6 58.4 ± 2.9 33.6 ± 4.9
CNN 8.2 ± 0.3 6.1 ± 0.4 10.6 ± 3.0 5.0 ± 0.2 19.7 ± 4.2 17.9 ± 4.7 18.3 ± 5.3 35.3 ± 9.9 21.0 ± 2.0 27.1 ± 1.6 16.9 ± 3.2
TJM 9.8 ± 1.0 13.4 ± 2.0 6.4 ± 1.1 5.4 ± 0.4 16.3 ± 3.4 29.6 ± 5.4 6.0 ± 4.2 2.5 ± 3.8 17.0 ± 2.7 25.7 ± 4.7 13.2 ± 2.9
TCA 9.9 ± 1.2 15.6 ± 2.5 10.4 ± 0.9 6.3 ± 0.5 19.6 ± 1.4 25.3 ± 5.0 5.3 ± 4.4 3.4 ± 4.7 16.1 ± 2.7 31.0 ± 11.2 14.3 ± 3.5
JDA 11.9 ± 1.5 13.9 ± 2.1 7.1 ± 1.5 5.4 ± 1.1 14.1 ± 2.2 46.6 ± 2.5 8.3 ± 15.7 7.7 ± 16.2 14.9 ± 1.0 35.2 ± 11.1 16.5 ± 5.5
DTN.w.MDA 4.2 ± 0.3 3.2 ± 0.5 6.0 ± 1.0 2.6 ± 0.4 12.8 ± 1.2 12.7 ± 1.8 18.9 ± 5.0 32.2 ± 7.5 19.8 ± 1.0 16.4 ± 1.8 12.9 ± 2.1
DTN.w.JDA 1.7 ± 0.2 1.1 ± 0.5 3.4 ± 0.2 1.6 ± 0.2 3.2 ± 0.5 2.6 ± 0.5 0.8 ± 0.4 2.8 ± 8.5 0.1 ± 0.1 3.7 ± 0.5 2.1 ± 1.2
Table 6
False alarm rate(%) on ten transfer tasks with different methods.
Methods A →B B→A C→D D →C E →F F→E G →H H →G I→J J →I Avg
SVM 22.1 ± 1.9 25.8 ± 1.5 11.8 ± 0.7 11.1 ± 1.3 36.1 ± 0.9 36.6 ± 1.3 26.8 ± 1.0 24.3 ± 0.3 22.9 ± 1.1 53.4 ± 0.5 27.1 ± 1.1
RF 15.8 ± 0.9 21.4 ± 2.4 11.2 ± 0.4 8.7 ± 1.1 39.2 ± 0.7 38.9 ± 0.7 19.3 ± 1.1 49.3 ± 1.3 32.8 ± 0.6 31.6 ± 7.6 26.8 ± 1.7
EMD 20.1 ± 0.9 27.6 ± 0.9 21.9 ± 3.8 27.9 ± 6.6 34.8 ± 6.6 37.0 ± 2.1 27.0 ± 4.0 35.3 ± 11.2 44.0 ± 6.7 57.7 ± 3.6 33.3 ± 4.6
CNN 7.3 ± 0.3 5.4 ± 0.3 9.5 ± 2.6 4.0 ± 0.2 17.2 ± 3.8 16.6 ± 3.5 17.5 ± 5.3 48.1 ± 9.7 12.7 ± 0.7 18.1 ± 1.6 15.6 ± 2.8
TJM 12.6 ± 1.8 18.4 ± 2.4 7.6 ± 1.4 6.7 ± 0.3 20.6 ± 5.0 32.5 ± 1.5 7.7 ± 5.9 3.9 ± 6.6 22.8 ± 1.9 41.8 ± 5.0 17.5 ± 3.2
TCA 12.3 ± 1.7 21.2 ± 2.8 11.6 ± 0.6 7.4 ± 0.6 23.3 ± 3.4 30.5 ± 1.2 7.3 ± 7.2 5.6 ± 8.5 25.1 ± 4.4 43.6 ± 7.4 18.8 ± 3.8
JDA 14.1 ± 2.2 19.2 ± 2.5 8.8 ± 1.5 6.3 ± 1.4 15.7 ± 2.7 43.8 ± 5.0 5.9 ± 10.7 5.0 ± 10.5 21.1 ± 1.1 44.6 ± 4.5 18.5 ± 4.2
DTN.w.MDA 3.8 ± 0.2 2.7 ± 0.3 5.5 ± 0.9 2.4 ± 0.2 10.8 ± 0.9 11.5 ± 1.0 18.4 ± 5.2 39.6 ± 2.8 19.7 ± 1.0 12.5 ± 0.9 12.7 ± 1.3
DTN.w.JDA 1.6 ± 0.2 1.1 ± 0.4 2.3 ± 1.0 1.4 ± 0.2 2.9 ± 0.3 2.5 ± 0.3 0.7 ± 0.3 4.4 ± 13.0 0.1 ± 0.1 3.3 ± 0.4 2.0 ± 1.6
random tests, where the training set and testing set are randomly standard methods (the first four) is much improved with domain
splitted. To comprehensively show the capabilities of proposed adaptation for most cases. Especially, the average accuracy of
method, three performance indices, i.e., average diagnosis accu- DTN with JDA is 97.9%, and makes a 14.9% transfer improvement,
racy, missing alarm rate (MAR), false alarm rate (FAR) [50], are comparing with the baseline CNN, 83.0%. (3) The deep learning
reported. Several encouraging observations are firstly noted. (1)
methods always present a superior performance to the shallow
The DTN with JDA method in this work significantly outperforms
methods no matter in standard diagnosis framework or transfer
the other methods. The stable average accuracies and low root-
learning framework, conforming its extraordinary feature learn-
mean-square error under different transfer scenarios (over 96%
for all tasks) validate the effective and robust domain adaptation ing and representation capacity as well as a stronger feature
ability of proposed method. The better performances of DTN transferability. (4) By jointly adapting the marginal distribution
with JDA can also be found for MAR and FAR (much lower and conditional distribution, the DTN with JDA in this work
than comparative methods). (2) The diagnosis performance in significantly promotes the adaptation ability of previous DTN
278 T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281
Fig. 10. Network visualization in task E→F: t-SNE is applied on the feature representation of last hidden fully-connected layer for both the source data and target
data. There are total 10 categories in wind turbine dataset (corresponding labels 0–9). S represents the samples in source domain and T means the target domain.
For instance, the c5-T corresponds to the samples of category 5 (inner-ring fault of bearing, as introduced above) in target domain.
Table 7 visualization results of standard CNN (that is, the pre-trained base
Computation complexity for diverse deep methods with wind turbine dataset in
network for further domain adaptation), DTN with MDA and DTN
task A→B.
with JDA in three transfer tasks are presented in Figs. 10–12,
Methods Time (s/epoch) Memory (MB)
respectively.
CNN 4.52 1016.6
Task E→F is to realize the domain adaptation across diverse
DTN.w.MDA 7.12 1548.0
DTN.w.JDA 33.78 1548.5 operating conditions. First, as shown in Fig. 10(a), most of the 10
categories of source samples are well separated with the standard
CNN, while the feature distributions of same category between
source and target domains are not aligned well. And even worse,
with MDA, especially under the transfer scenarios of diverse fault
a large overlapping areas can be inspected among the target
severity levels and diverse fault types.
samples of certain categories, such as 2, 3 and 8. These observa-
To show the real-time practicality of the proposed framework,
tions suggest the domain discrepancy exists not only in marginal
the computation complexity of diverse methods in task A→B
distribution, but also in conditional distribution, which may result
is compared in Table 7. Generally, the deep learning methods
in the degraded diagnosis results in conventional framework. In
require higher computation complexity but achieve better perfor-
Fig. 10(b) and (c), under the transfer learning framework, we
mance than shallow methods, and thus we only listed the results
can find the obvious improvement of distribution adaptation. In
of three deep methods here. Since the DTN with JDA calculates
more intermediate variables in CDA, it needs more computing particular, the same category between domains is aligned very
time and memory than standard CNN and DTN with MDA. This well by DTN with JDA, and a consonant and legible discriminant
work focuses on the investigation of effectiveness in DTN. The structure can be observed for both source and target categories.
training process is implemented in batch manner. Another online Task G→H is to adapt the distribution across diverse fault
learning manner can train the deep network from sequential severity levels. In Fig. 11(a), the standard CNN assembles the
data flow. The transform of transfer diagnosis framework from distributions of OF and IF in target domain, and the source OF
batch learning to online learning will largely reduce the real-time and target IF are mixed, explaining the unsatisfactory accuracy
computing time and memory [51]. in Table 4. As a contrast, in Fig. 11(c), the distribution of same
category between source and target domains are well matched
6.2. Network visualization with JDA. Interestingly, in Fig. 11(b), we can observe that MDA
relocates the target OF and IF away from the corresponding
In order to give a clear and intuitive understanding of pro- distribution in source domain. Naturally, marginal distribution
posed framework, t-distributed stochastic neighbor embedding only reflects the cluster structure for the feature distribution of
(t-SNE), is utilized for network visualization. For comparison, the all categories, and MDA aims to explicitly reduce the distance
T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281 279
Fig. 11. Network visualization in task G→H: there are total 4 categories in bearing dataset (corresponding labels 0–3).
Fig. 12. Network visualization in task I→J: there are total 3 categories in gearbox dataset (corresponding labels 0–2).
Fig. 13. The transfer loss curves and test accuracy via DTN with JDA. Fig. 14. The transfer loss curves and test accuracy via DTN with MDA.
between the cluster centers of different domains. When the con- in task G→H are plot in Figs. 13 and 14, respectively. Here, we
ditional distributions are same across domains, MDA helps to separately display the ℓce term and regularization term for ease
correct the overall shift of feature space. However, in the field of observation. At the beginning, the losses of regularization term
of fault diagnosis, the difference in conditional distributions may for the two methods are both around 0.1, and the ones of ℓce term
be prevalent. Consequently, unlike single MDA, the JDA which are almost negligible. From Fig. 13, the loss of JDA regularization
simultaneously adapts the marginal distribution and conditional term converges to a certain degree after a series of iterations,
distribution is promising in these cases. As shown in Fig. 12, accompanied by the continued increase of test accuracy of tar-
similar results can be found in transfer task I→J across diverse get data. However, from Fig. 14, the loss of MDA regularization
fault types. Both the transfer accuracy and network visualization term finally fluctuates in a high level, and the test accuracy is
show that JDA supersedes the performance of MDA. confined around 87%. Besides, it is clear to observe the loss of
ℓce term presents an abrupt increase after around 300 iterations.
6.3. Convergence analysis Essentially, the ℓce term and regularization term in objective
function try to reduce domain discrepancy while preserving the
Since the additional regularization term is appended to objec- original discriminant structure in source domain. One possible
tive function for the transfer training, the convergence analysis is reason for the jump is that the gradient direction of the parameter
necessary to illustrate the transfer ability. The transfer loss curves optimization for regularization term conflicts with that of ℓce
and test accuracy curve for DTN with JDA and DTN with MDA term, causing a significant spike in transfer loss and test accuracy.
280 T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281
The analysis reveals that the use of JDA regularization term is target domain (where the model is applied). However, re-training
capable of facilitating the network training and guaranteeing a the model is challenging and probably unrealistic as of the lack
stronger feature transferability. of sufficient labeled data in practical applications. To address this
issue, this work presents a DTN to take advantage of a pre-trained
6.4. Discussion network from the source domain and get the model transferred
with unlabeled data from the target domain, where a novel do-
In the traditional intelligent fault diagnosis framework, main adaptation approach, JDA, is presented. Through extensive
whether for shallow methods or the deep learning methods, the experiments on three datasets, the results show that the DTN
diagnosis performance varies a lot on different tasks. For instance, with JDA outperforms the state-of-the-art approaches. Compared
in the first six tasks on wind turbine dataset, RF get the best with the shallow methods, i.e., SVM, RF, EMD, TJM, TCA and JDA,
performance in task C→D and D→C (89.1% and 92.7%), while DTN with JDA achieves 25.2%, 24.3%, 31.4% 15.4%, 16.7%, 15.8%
degraded results in tasks E→F and F→E (60.9% and 61.0%). It improvements on the average accuracy in ten diagnosis tasks.
is reasonable because the operating conditions between C and In deep learning framework, DTN also effectively increase the
D are closer than those between E and F, and thus the data diagnosis accuracy from 83.0%, 87.2% to 97.9% in comparison with
for C and D shares a more similar feature space, leading to basic CNN and DTN with MDA. The network visualization further
a higher diagnosis accuracy. This phenomenon actually reveals provides the interpretation of diagnosis results, and DTN with JDA
the inherent drawback in conventional diagnosis framework, is shown to obtain a more accurate feature distribution alignment
that the feature distribution discrepancy between source domain across domains. Moreover, the DTN with JDA presents smooth
and target domain is neglected. The success much relies on the convergence and avoids negative adaptation in comparison with
similarity between source and target distributions, whereas a MDA.
large discrepancy across domains is common and inevitable in Using DTN with JDA, it is promising that the learnt diagnosis
practical diagnosis applications. The proposed transfer diagnosis models from experimental or real datasets can be transferred to
framework provides an effective measure for resolving the prob- new but similar applications in a more efficient and accurate way,
lem mentioned above, and DTN with JDA achieves the desirable which could benefit kinds of industrial applications. Further work
performance both in diagnosis indices and feature visualization. will pursue (i) quantitative assessment approaches of similarity
In domain adaptation methods, the DTN achieves superior and transferability between diverse domains, (ii) application on
performance than shallow domain adaptation methods, such as imbalanced distribution of machine conditions and (iii) hyper-
TJM, TCA and JDA. The shallow methods require manual feature parameter selection with intelligent optimization algorithms [52,
extraction, which may suffer from the interference of redundant 53].
and irrelevant features. And more importantly, this process is
not flexible and not able to meet the need of adaptivity. DTN Declaration of competing interest
establishes the domain adaptation in deep learning flow, and is
capable of adaptively learning intrinsic fault characteristics. It The authors declare that they have no known competing finan-
is also worth noting that the complexity of the domain adap- cial interests or personal relationships that could have appeared
tation process always changes with the scenarios. In the easy to influence the work reported in this paper.
transfer tasks, e.g. C→D and D→C, all these transfer learning
based techniques get the relatively satisfactory results. However, Acknowledgments
in several hard tasks, e.g. E→F and J→I, where the source and tar-
get data could be substantially dissimilar, the performance drop This work was supported by National Natural Science Founda-
in the comparative transfer methods, such as DTN with MDA, tion of China (Grant No. 11572167 and 11802152). The authors
convincingly illustrates that the difficulties of domain adaptation would like to express their sincere gratitude to Mr. Shaohua Li
will accordingly increase. The comprehensive assessments under for his contributions on the acquisition of experimental data.
diverse transfer scenarios further demonstrate the pivotal role of
JDA in DTN. References
This work proposes a novel diagnosis framework for consid-
ering the deep feature learning and cross-domain feature dis- [1] Lei Y, He Z, Zi Y. Eemd method and wnn for fault diagnosis of locomotive
tribution alignment simultaneously. It may overcome the short- roller bearings. Expert Syst Appl 2011;38(6):7334–41.
comings in existing studies and have a certain significance in [2] Jiang D, Liu C. Machine condition classification using deterioration feature
the practical diagnosis application. Although the effectiveness of extraction and anomaly determination. IEEE Trans Reliab 2011;60(1):41–8.
[3] Cui L, Huang J, Hao Z, Zhang F. Research on the meshing stiffness and vi-
proposed DTN with JDA has been demonstrated from the aspects bration response of fault gears under an angle-changing crack based on the
of diagnosis indices, feature visualization and loss convergence in universal equation of gear profile. Mech Mach Theory 2016;105:554–67.
ten experimental tasks, it still has limits in assumed conditions, [4] Gong X, Qiao W. Current-based mechanical fault detection for direct-drive
where the faults occur both in source and target domains, and wind turbines via synchronous sampling and impulse detection. IEEE Trans
Ind Electron 2015;62(3):1693–702.
the fault labels are also the same. However, the monitoring data
[5] Yunusa-Kaltungo A, Sinha JK, Nembhard AD. A novel fault diagnosis
of industrial process is mostly under health conditions, and the technique for enhancing maintenance and reliability of rotating machines.
occurred fault types may differ from the known ones in source Struct Health Monit 2015;14(6):231–62.
domain. As a result, these factors introduce additional difficul- [6] Cui L, Huang J, Zhang F, Chu F. Hvsrms localization formula and localization
ties into the application of transfer diagnosis framework. The law: Localization diagnosis of a ball bearing outer ring fault. Mech Syst
Signal Process 2019;120:608–29.
integration of data cleaning and selection techniques into this
[7] Shen Z, Chen X, Zhang X, He Z. A novel intelligent gear fault di-
framework has an important significance. agnosis model based on emd and multi-class tsvm. Measurement
2012;45(1):30–40.
7. Conclusion [8] Li Y, Wang X, Liu Z, Liang X, Si S. The entropy algorithm and its
variants in the fault diagnosis of rotating machinery: A review. IEEE Access
2018;6:66723–41.
Intelligent fault diagnosis in real industrial applications is suf- [9] Li Y, Wang X, Si S, Huang S. Entropy based fault classification using the
fering the difficulty of model re-training as of the discrepancy Case western reserve university data: A benchmark study. IEEE Trans
between the source domain (where the model is learnt) and the Reliab 2019. DOI: 10.1109/TR.2019.2896240.
T. Han, C. Liu, W. Yang et al. / ISA Transactions 97 (2020) 269–281 281
[10] Verstraete D, Ferrada A, Droguett EL, Meruane V, Modarres M. Deep [31] Shao H, Jiang H, Wang F, Wang Y. Rolling bearing fault diagnosis using
learning enabled fault diagnosis using time-frequency image analysis of adaptive deep belief network with dual-tree complex wavelet packet. ISA
rolling element bearings. Shock Vib 2017;2017:1–17. Trans 2017;187–201.
[11] Feng J, Lei Y, Guo L, Lin J, Xing S. A neural network constructed by [32] Jing L, Zhao M, Li P, Xu X. A convolutional neural network based feature
deep learning technique and its application to intelligent fault diagnosis learning and fault diagnosis method for the condition monitoring of
of machines. Neurocomputing 2018;272:619–28. gearbox. Measurement 2017;111:1–10.
[12] Jia F, Lei Y, Lin J, Zhou X, Lu N. Deep neural networks: A promising tool for [33] Zhang W, Peng G, Li C, Chen Y, Zhang Z. A new deep learning model for
fault characteristic mining and intelligent diagnosis of rotating machinery fault diagnosis with good anti-noise and domain adaptation ability on raw
with massive data. Mech Syst Signal Process 2016;72–73:303–15. vibration signals. Sensors 2017;17(2):425.
[13] Han T, Liu C, Yang W, Jiang D. A novel adversarial learning framework in [34] Liu R, Meng G, Yang B, Sun C, Chen X. Dislocated time series convolutional
deep convolutional neural network for intelligent diagnosis of mechanical neural architecture: An intelligent fault diagnosis approach for electric
faults. Knowl-Based Syst 2019;165:474–87. machine. IEEE Trans Ind Inf 2017;13(3):1310–20.
[14] Wen L, Li X, Gao L, Zhang Y. A new convolutional neural network [35] Sun W, Zhao R, Yan R, Shao S, Chen X. Convolutional discriminative
based data-driven fault diagnosis method. IEEE Trans Ind Electron feature learning for induction motor fault diagnosis. IEEE Trans Ind Inf
2017;65(7):5990–8. 2017;13(3):1350–9.
[15] Liu R, Yang B, Zio E, Chen X. Artificial intelligence for fault diagnosis of [36] Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng
rotating machinery: A review. Mech Syst Signal Process 2018;108:33–47. 2010;22(10):1345–59.
[16] Cerrada M, Sánchez RV, Li C, Pacheco F, Cabrera D, Oliveira JVD, [37] Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big
Vásquez RE. A review on data-driven fault severity assessment in rolling Data 2016;3(1):9.
bearings. Mech Syst Signal Process 2018;99:169–96. [38] Long M, Wang J, Ding G, Sun J, Yu PS. Transfer joint matching for
[17] Oquab M, Bottou L, Laptev I, Sivic J. Learning and transferring mid- unsupervised domain adaptation. In: IEEE conference on computer vision
level image representations using convolutional neural networks. In: IEEE and pattern recognition. 2014, p. 1410–7.
conference on computer vision and pattern recognition. 2014, p. 1717–24. [39] Pan SJ, Tsang IW, Kwok JT, Yang Q. Domain adaptation via transfer
[18] Mun S, Shin M, Shon S, Kim W, Han DK, Ko H. DNN Transfer learning component analysis. IEEE Trans Neural Netw 2011;22(2):199.
based non-linear feature extraction for acoustic event classification. IEICE [40] Long M, Wang J, Ding G, Sun J, Yu PS. Transfer feature learning with joint
Trans Inf Syst 2017;100(9). distribution adaptation. In: IEEE international conference on computer
[19] Qureshi AS, Khan A, Zameer A, Usman A. Wind power prediction using vision. 2014, p. 2200–7.
deep neural network based meta regression and transfer learning. Appl [41] Long M, Cao Y, Wang J, Jordan MI. Learning transferable features with deep
Soft Comput 2017;58:742–55. adaptation networks, Eprint Arxiv (2015) 97–105.
[20] Khatami A, Babaie M, Tizhoosh HR, Khosravi A, Nguyen T, Nahavandi S. [42] Long M, Zhu H, Wang J, Jordan MI. Deep transfer learning with joint
A sequential search-space shrinking using CNN transfer learning and adaptation networks, Eprint Arxiv (2016).
a radon projection pool for medical image retrieval. Expert Syst Appl [43] Ghifary M, Kleijn WB, Zhang M, Balduzzi D, Li W. Deep reconstruction-
2018;100:224–33. classification networks for unsupervised domain adaptation. In: European
[21] Han T, Liu C, Yang W, Jiang D. Learning transferable features in deep conference on computer vision. 2016, p. 597–613.
convolutional neural networks for diagnosing unseen machine conditions. [44] Wen L, Gao L, Li X, Wen L, Gao L, Li X. A new deep transfer learning based
ISA Trans 2019. http://dx.doi.org/10.1016/j.isatra.2019.03.017. on sparse auto-encoder for fault diagnosis. IEEE Trans Syst Man Cybern
[22] Wei Y, Zhang Y, Yang Q. Learning to transfer, Eprint Arxiv (2017). 2017;1–9.
[23] Liu C, Jiang D, Yang W. Global geometric similarity scheme for feature [45] Lu W, Liang B, Cheng Y, Meng D, Yang J, Zhang T. Deep model
selection in fault diagnosis. Expert Syst Appl 2014;41(8):3585–95. based domain adaptation for fault diagnosis. IEEE Trans Ind Electron
[24] Zhao C, Feng Z, Wei X, Qin Y. Sparse classification based on dictio- 2017;64(3):2296–305.
nary learning for planet bearing fault identification. Expert Syst Appl [46] Xu Zhang S-FCSW, Yu FX. Deep transfer network: Unsupervised domain
2018;108:233–45. adaptation, Eprint Arxiv, arXiv:1503.00591 (2015).
[25] Lei Y, Jia F, Lin J, Xing S, Ding SX. An intelligent fault diagnosis method [47] Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in
using unsupervised feature learning towards mechanical big data. IEEE deep neural networks?, Eprint Arxiv 27 (2014) 3320–3328.
Trans Ind Electron 2016;63(5):3137–47. [48] Center BD. Case western reserve university bearing data, http://csegroups.
[26] Han T, Jiang D, Zhao Q, Wang L, Yin K. Comparison of random forest, arti- case.edu/bearingdatacenter/pages/download-data-file, 2013.
ficial neural networks and support vector machine for intelligent diagnosis [49] Rauber TW, Boldt FDA, Varejao FM. Heterogeneous feature models and
of rotating machinery. Trans Inst Meas Control 2018;40(8):2681–93. feature selection applied to bearing fault diagnosis. IEEE Trans Ind Electron
[27] Costilla-Reyes O, Scully P, Ozanyan KB. Deep neural networks for learning 2015;62(1):637–46.
spatio-temporal features from tomography sensors. IEEE Trans Ind Electron [50] Xu J, Wang J, Izadi I, Chen T. Performance assessment and design for
2018;65(1):645–53. univariate alarm systems based on far, mar, and AAD. IEEE Trans Autom
[28] Han T, Liu C, Yang W, Jiang D. An adaptive spatiotemporal feature learning Sci Eng 2012;9(2):296–307.
approach for fault diagnosis in complex systems. Mech Syst Signal Process [51] Wang X, Hou Z, Yu W, Jin Z. Online fast deep learning tracker based on
2019;117:170–87. deep sparse neural networks. In: International conference on image and
[29] Jiao J, Zhao M, Lin J, Zhao J. A multivariate encoder information based graphics. Springer; 2017, p. 186–98.
convolutional neural network for intelligent fault diagnosis of planetary [52] Patwal RS, Narang N, Garg H. A novel TVAC-PSO based mutation strategies
gearboxes. Knowl-Based Syst 2018. algorithm for generation scheduling of pumped storage hydrothermal
[30] Lu C, Wang ZY, Qin WL, Ma J. Fault diagnosis of rotary machinery system incorporating solar units. Energy 2018;142:822–37.
components using a stacked denoising autoencoder-based health state [53] Garg H. A hybrid GSA-GA algorithm for constrained optimization problems.
identification. Signal Process 2017;130(C):377–88. Inform Sci 2019;478:499–523.