0% found this document useful (0 votes)

7 views14 pages

Marginal Deep Architecture Stacking Feature Learning Modules To Build Deep Learning Models

The paper introduces a novel deep learning architecture called Marginal Deep Architecture (MDA), which utilizes stacked Marginal Fisher Analysis (MFA) layers for initialization instead of random methods, allowing for effective training with limited data. MDA incorporates various deep learning techniques such as backpropagation and dropout to enhance model performance across multiple applications, including handwritten digit recognition and speech recognition. Experimental results demonstrate that MDA outperforms both shallow feature learning models and traditional deep learning models in various tasks.

Uploaded by

Munnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views14 pages

Marginal Deep Architecture Stacking Feature Learning Modules To Build Deep Learning Models

Uploaded by

Munnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Received February 4, 2019, accepted February 14, 2019, date of publication March 4, 2019, date of current version March

20, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2902631

Marginal Deep Architecture: Stacking

Feature Learning Modules to Build
Deep Learning Models
GUOQIANG ZHONG 1 , (Member, IEEE), KANG ZHANG1 , HONGXU WEI1 ,
YUCHEN ZHENG 2 , AND JUNYU DONG 1 , (Member, IEEE)
1 Department of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
2 Department of Advanced Information Technology, Kyushu University, Fukuoka 819-0395, Japan
Corresponding author: Guoqiang Zhong ([email protected])
This work was supported in part by the National Key R&D Program of China under Grant 2016YFC1401004, in part by the National
Natural Science Foundation of China under Grant 41706010, in part by the Science and Technology Program of Qingdao under
Grant 17-3-3-20-nsh, in part by the CERNET Innovation Project under Grant NGII20170416, and in part by the Fundamental
Research Funds for the Central Universities of China.

ABSTRACT Recently, many deep models have been proposed in different fields, such as image classifica-
tion, object detection, and speech recognition. However, most of these architectures require a large amount
of training data and employ random initialization. In this paper, we propose to stack feature learning modules
for the design of deep architectures. Specifically, marginal Fisher analysis (MFA) is stacked layer-by-layer
for the initialization and we call the constructed deep architecture marginal deep architecture (MDA). When
implementing the MDA, the weight matrices of MFA are updated layer-by-layer, which is a supervised pre-
training method and does not need a large scale of data. In addition, several deep learning techniques are
applied to this architecture, such as backpropagation, dropout, and denoising, to fine-tune the model. We have
compared MDA with some feature learning and deep learning models on several practical applications,
such as handwritten digits recognition, speech recognition, historical document understanding, and action
recognition. The extensive experiments show that the performance of MDA is better than not only shallow
feature learning models but also related deep learning models in these tasks.

INDEX TERMS Deep architectures, feature learning, marginal Fisher analysis, marginal deep architecture.

I. INTRODUCTION In recent years, many deep learning models have been

Deep learning models have achieved significant results in proposed [7]–[10]. Nevertheless, there are several complex
many tasks, such as image classification, document analysis problems to be solved, for example, some parameters need to
and recognition, natural language processing and video anal- be properly initialized, such as the weight matrix of two suc-
ysis [1]–[5]. With multiple hidden layers, deep learning meth- cessive layers in deep belief networks (DBNs) and the con-
ods can explore the internal structure of high dimensional data volution kernels in convolutional neural networks (CNNs).
and learn data representation with multiple levels of abstrac- Furthermore, to get high performance, traditional deep learn-
tion [6]. For example, in the face recognition applications, ing methods need a large scale of data to train them. To the
the learned features of the first layer may be the edges, direc- end, many problems emerge during the training process. If we
tions and some local information. The second layer typically don’t initialize the parameters properly, the optimization pro-
detects some object parts which are combination of the edges cedure might need a long training time and fall into inferior
and directions. Higher layers may further abstract the face local minima.
image by combining the features of previous layers (outlines Alternatively, many feature learning methods have been
of the eyes, noses, lips). proposed to learn the low-dimensional representation of high-
dimensional data and avoid the curse of dimensionality.
The associate editor coordinating the review of this manuscript and In particular, most of them can be trained with limited amount
approving it for publication was Choon Ki Ahn. of data and their learning algorithms are generally based on

2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.
30220 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

closed-form solution or convex optimization. For example, to the supervised pre-training method rather than ran-
marginal Fisher analysis (MFA) is one of the feature learning dom initialization.
methods that is supervised and based on the graph embed- 3. Experiments demonstrate that the proposed MDA can
ding framework [11], [12]. It utilizes an intrinsic graph to obtain good results in several fields of data sets, such as
characterize the intra-class compactness, and another penalty natural images, spoken letters and handwritten digits.
graph to characterize the inter-class separability. The optimal These results show that MDA is a general model to
solution of MFA can be learned by generalized eigenvalue handle data sets with different scales of data. In addi-
decomposition. Whereas, shallow feature learning models tion, for large size images, combining convolutional
cannot achieve good performance if the structure of the data is operations and MDA, we can obtain competitive results
highly nonlinear; on the other hand, the combinations of these with existing deep learning methods.
shallow feature learning models have rarely been exploited to The rest of this paper is organized as follows: In Section II,
design deep models. we give a brief overview of related work. In Section III,
In order to simultaneously solve the existing problems we present the proposed marginal deep architecture (MDA)
in deep learning models and combine the advantages of in detail. The experimental settings and results are reported
feature learning models, we proposed a novel deep learn- in Section IV. At last, Section V concludes this paper with
ing method based on stacked feature learning modules. remarks and future work.
Specifically, instead of using random initialization, stacked
MFA layers are applied to initialize this deep architecture, II. RELATED WORK
so that the constructed deep learning models are called Since 2006, many deep learning models have been proposed.
marginal deep architecture (MDA). At first, to increase the Primitively, Hinton and Salakhutdinov proposed the deep
capacity of the architecture, we use a random weight matrix autoencoder (AE) that is an effective way to learn the low-
to project the input data to a higher dimensional space. dimensional representations of high-dimensional data [10].
Next, the stacked MFA layers are applied to learn the lower Based on AE, Vincent et al. [13] proposed the denoising
dimensional representations of the data layer by layer. At last, autoencoder (DAE), which made the learned representations
the softmax layer is connected to the final feature layer. robust to partial corruption of the input data. Subsequently,
During the implementation of MDA, we add some tricks in Vincent et al. extended DAE to stacked DAE (SDAE), which
the training process to fine tune it, such as back propaga- works very well on natural images and handwritten dig-
tion, dropout and denoising. We have compared MDA with its. To prevent the weights in deep neural networks from
some feature learning and deep learning models on different co-adaptation, Hinton et al. [14] introduced the dropout tech-
domains of datasets (including handwritten digits recogni- nique, which delivers new records for many speech and object
tion, speech recognition, historical document understanding, recognition applications. However, due to numerous param-
image classification, action recognition and so on). Experi- eters, previous deep learning models generally need a large
ments show that the performance of MDA is better than not scale of training data to obtain good learning results.
only shallow feature learning models, but also related deep In recent years, to address many vision problems,
learning models. the research on deep convolutional neural networks (CNN)
Please note that, although convolutional neural networks develops very fast [4], [15]–[18]. Specifically, in the research
(CNNs) and recurrent neural networks (RNNs) have played of image classification, Krizhevsky, Sutskever and Hin-
an important role in many image, video and natural language ton proposed a large, deep convolutional neural network
applications, feedforward neural networks are still important. (AlexNet) to classify the 1.2 million high-resolution images
For instance, they can be used to deal with vectorized data in the ImageNet data set. The authors use efficient GPU
and as the fully connected layers of many deep learning to speed AlexNet. The results show that a large, deep con-
architectures. Hence, how to design deep feedforward neural volutional neural network is capable of achieving record-
networks is still an important issue for the deep learning breaking results on a highly challenging data set using
community. purely supervised learning [4]. In order to transfer a
The contributions of this work can be summarized as trained deep convolutional neural network to new tasks,
follows: Donahue et al. [15] proposed the deep convolutional activa-
1. We propose a novel deep architecture called MDA. tion feature (DeCAF), which is extracted from a well trained
The neurons in the first hidden layer of MDA are deep convolutional neural network on a large object recog-
twice or quadruple of that in the input layer. Next, nition data set. DeCAF provides a uniform framework for
several layers of feature learning models are stacked to researchers, who can improve and change this framework on
learn the low dimensional representations of the input some specific tasks. However, its performance at scene recog-
data. In the end, a softmax classifier is applied. nition has not attained the same level of success. In order to
2. In general, traditional deep learning models require a alleviate this problem, Zhou et al. [17] introduce a new scene-
large amount of training data to obtain good results. centric database called Places with over 7 million labeled
Whereas, MDA can achieve better performance than pictures of scenes. Then, they learn the deep features for
these models with limited amount of training data due scene recognition tasks using deep architectures, and achieve

VOLUME 7, 2019 30221

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

excellent results on several scene-centric datasets. However, representations from different, unknown attributes of a given
these methods based on convolutional operation need very dataset. Whereas, this model is proposed for learning low-
large scale of training samples and can not work well with dimensional representations that are better suited for clus-
limited amount of data. tering. Ngiam [30] proposed a deep architecture, which is
In many domains other than computer vision, deep learn- an unsupervised model to learn feature representations over
ing methods also achieve good performances. In [19], multiple modalities. They argued that multi-modality feature
Hinton et al. represent the shared views of four research learning is better than one modality and achieved good per-
groups in using deep neural networks (DNNs) for automatic formance on video and audio data sets.
speech recognition (ASR). The DNNs that contain many lay- In this work, we combine the advantages of feature learning
ers of nonlinear hidden units and a very large output layer can models and deep architectures [31], [32], which stack MFA
outperform Gaussian mixture models (GMMs) at acoustic to initialize the deep architecture as a supervised pre-training
modeling for speech recognition on a variety of data sets. method. Then, we employ some deep learning techniques,
In the area of genetics, Xiong et al. [20] use deep learn- like back propagation, denoising and dropout to fine-tune
ing algorithms to derive a computational model that takes the network. The advantage of this deep architecture is that
DNA sequences as input to predict splicing in human tissues. we can learn the desirable weight matrix even if the training
It reveals the genetic origins of disease and how strongly data is not large enough. And compared with traditional
genetic variants affect RNA splicing. In the area of natural deep learning models and shallow feature learning models,
language understanding, deep learning models have deliv- the proposed method perform better than them in most cases.
ered strong results on topic classification, sentiment analysis
and so on. Amongst others, Sutskever et al. [21] proposed III. MARGINAL DEEP ARCHITECTURE (MDA)
a general approach, multilayered long short-term memory In this section, we present an innovative architecture of deep
(LSTM), which can solve the general sequence to sequence learning models first; After that, we introduce the proposed
problems better than before. marginal deep architecture (MDA) in detail. In addition,
On the other hand, in the field of feature learning models, some deep learning techniques used for the training of MDA,
dimensionality reduction plays a crucial role to handle the including back propagation, denoising and dropout, are also
problems for visualizing high-dimensional data and avoiding presented.
the ‘‘curse of dimensionality’’ [22], [23]. Traditional dimen-
sionality reduction can mainly be classified by three crite- A. A NOVEL FRAMEWORK OF DEEP ARCHITECTURE
ria: linear or nonlinear, e.g., principal components analysis The target of feature learning can be described as follows.
(PCA) [24] and linearity preserving projection (LPP) [25] If there are n input data, X = {xT1 , . . . , xTn } ∈ RD , where D is
are linear methods, while stochastic neighbor embedding the dimensionality of the data space. The learning objective
(SNE) [26] is a nonlinear method; supervised or unsuper- is to search for the compact representations of these data, i.e.
vised, e.g., marginal Fisher analysis (MFA) [11], [12] and Y = {yT1 , . . . , yTn } ∈ Rd , where d is the dimensionality of the
linear discriminant analysis (LDA) [27] are supervised meth- low dimensional embeddings.
ods, and PCA is an unsupervised method; local or global, In this paper, we consider the feature learning problems
e.g., MFA and SNE are local methods, and PCA is a global from the perspective of deep learning, and stack shallow
method. Many feature learning models provide excellent feature learning modules to build deep networks [31], [32].
solutions for the applications of dimensionality reduction. In this case, the data maps from the original D-dimensional
However, for large scale complex problems, feature learning space to the d-dimensional space layer by layer. This deep
models may not perform well. Considering this situation, architecture can be seen as a general framework for data
we try to select some well-behaved feature learning mod- representation learning. The data flow in the deep architecture
els and combine them to deep architectures. Among others, can be abstracted as
MFA is one special formulation of the graph embedding
framework [11]. It utilizes an intrinsic graph to characterize D H⇒ D1 H⇒ · · · H⇒ Di H⇒ · · · H⇒ Dp−1 H⇒ d, (1)
the intra-class compactness, and another penalty graph to
characterize the inter-class separability. Our motivation of where D1 is the dimensionality of a high dimensional space
this work is to combine the advantage of MFA and deep archi- mapped from the original space. In order to increase the
tectures and propose a new supervised initialization method capacity of the network, it is twice or quadruple as many as
for deep learning algorithms. those neurons in the input layer. Di stands for the dimen-
There are also some work about feature learning models sionality of the i-th intermediate representation space, and
based on the deep architectures [28]–[30]. Yuan et al. [28] p is the total stages of mappings. In this framework, feature
proposed an improved multilayer learning model to solve learning modules with various output dimensions are applied
the scene recognition task. This model learn all features to learn the representations of data in each layer. The mapping
used for scene recognition in an unsupervised manner. functions between consecutive layers are obtained by the
George et al. [29] proposed the deep semi-Non-negative layer by layer optimization of the feature learning models.
matrix factorization (NMF), which is able to learn hidden The framework of the proposed deep architectures is briefly

30222 VOLUME 7, 2019

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

FIGURE 1. The uniform framework of the proposed deep architectures. Wr1 represents the first layer random weight matrix, while WF and WF
2 3
represent the weight matrices learned by feature learning models. For simplicity, the bias terms are omitted.

presented in Fig. 1. In this figure, the first hidden layer is ran- the classes. Therefore, we apply MFA as the building blocks
domly initialized by matrix Wr1 , and the new representation of MDA.
of an input x can be written as MFA follows the graph embedding framework to con-
struct an intrinsic graph that characterizes the intra-class
a1 = g(WTr1 x + b), (2) compactness and another penalty graph that characterizes the
inter-class separability [11]. Suppose the input data is X =
where g(.) is a non-linear activation function. After that, {x1 , . . . , xn } and the projection matrix is W = {ω1 , . . . , ωd }.
MFA or other feature learning models are used to initialize The intrinsic graph aims to connect each sample to its
the subsequent layers. The outputs of the subsequent hidden k-nearest neighbors in the same class. Suppose that we use
layers are N (i) to denote the k-nearest neighbors of xi in its class. The
intra-class compactness can be described as
ak = g(WTFk−1 ak−1 + b). (3)
X X 2
S˜w = ω T xi − ω T xj (4)
For example, in Fig. 1, WF2 and WF3 are the weight
i j∈N (i)∨i∈N (j)
matrices of the second layer and third layer learned from XX 2
feature learning models. In the end, softmax regression is = ωT xi − ωT xj Aij (5)
adopted as the last layer for the classification tasks. In the i j
first hidden layer, the higher dimensional representations of = 2ωT X(D − A)XT ω, (6)
input data can be learned. Afterwards, the following feature P
where D is a diagonal matrix with elements Dii = j Aij .
learning models can learn the lower dimensional embeddings
The adjacency matrix A is given by
step by step. The key point of this network is that the weight
matrices of the hidden layers are initialized by feature learn- 1, if j ∈ N (i) or i ∈ N (j),
Aij = (7)
ing modules except the matrix in the first hidden layer, which 0, otherwise.
may deliver a better performance than other deep learning
The penalty graph connects the marginal point pairs of differ-
models initialized by random matrices.
ent classes. We use M(C) to denote a set of input pairs that
are k-nearest pairs among the set {(i, j)|i ∈ C ∧ j ∈/ C}, where
B. MARGINAL FISHER ANALYSIS (MFA) C is class of an input xi . The inter-class separability can be
Based on our novel framework of deep architecture, defined as
we employ marginal Fisher analysis (MFA) to construct X X 2
MDA. There are several advantages to use MFA. Compared S˜b = ω T xi − ω T xj (8)
with many traditional feature learning models, such as linear i (i,j)∈M(Ci )∨(j,i)∈M(Cj )
discriminant analysis (LDA), there is no assumption about XX 2 p
the data distribution of each class, so that MFA is more = ωT xi − ωT xj Aij (9)
general for discriminant analysis. In addition, the margins i j
between classes can properly characterize the separability of = 2ωT X(Dp − Ap )XT ω, (10)

VOLUME 7, 2019 30223

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

FIGURE 2. A brief representation of MDA. Wr1 stands for the first layer random weight matrix, while WMFA and WMFA represent the weight
2 3
matrices learned by MFA. The dotted red lines represent the dropout operation, the dotted red circle is the dropout node, and the cross nodes are
corrupted. The denoising and dropout operation are completely random. For simplicity, the bias terms are omitted.

p P p
where Dp is a diagonal matrix with elements Dii = j Aij . novel framework. Given an input vector x ∈ [0, 1]d , it is
The adjacency matrix Ap is given by firstly mapped to a higher dimensional space by a random
( weight matrix Wr1 . The activation output of the first hidden
p 1, if (i, j) ∈ M(Ci ) or (j, i) ∈ M(Cj ), layer can be written as
Aij = (11)
0, otherwise.
a1 = s(WTr1 x + b), (13)
The target of MFA is to minimize the intra-class compactness
and maximize the inter-class separability simultaneously. where s(.) is the sigmoid function s(x) = 1+e1 −x , b is the bias
Therefore, the marginal Fisher criterion is defined as terms, and a1 is the output of the first hidden layer. From
the second layer to the (n − 1)-th layer, the weight matrices
tr(WT X(D − A)XT W)
WMFA = argmin . (12) are learned by MFA to initialize MDA layer by layer.
W tr(WT X(Dp − Ap )XT W)
ak = s(WTMFAk−1 ak−1 + b). (14)
In [11], to apply MFA for face recognition applications,
the faces are firstly projected into a PCA subspace by the We use the softmax regression as the last layer of MDA for
transformation matrix WPCA to reduce noise. Since the fea- classification tasks, so that the number of neurons is the same
tures are learned by multiple layers in MDA and the whole as the number of classes. The cost function can be defined as
deep architecture is fine-tuned by back propagation, it is
not necessary to reduce the dimension of data by PCA at
N K
1 XX exp(wTj an−1
i )
J (w) = − ( I(yi = j) log PK ),
first. Hence, we compute the projection matrix WMFA directly N T n−1
i=1 j=1 l=1 exp(wl ai )
using Eq. (12) at each layer.
Here, MFA is used as the initialization method of the (15)
weight matrices in MDA. For different layers of MDA, where N and K are the total number and class number of the
the input X of WMFA in Eq. (12) is the output of its previous input data, respectively. I(x) is the indicator function. If x is
layer. For example, in Fig. 2, WMFA2 and WMFA3 are com- true, I(x) = 1, else I(x) = 0. yi is the label of xi . wj and
puted using the output of their previous layers. In addition, wl are weight vectors corresponding to class j and l. Hence,
the weight matrices calculated by MFA are only applied to the probability that xi is correctly categorized to class j is
initialize the weight matrices of MDA at the first iteration.
Then, we apply back propagation to fine-tune these matrices. exp(wTj an−1
i )
p(yi = j|xi , w) = PK n−1
. (16)
T
C. MARGINAL DEEP ARCHITECTURE (MDA) l=1 exp(wl ai )
Based on the novel deep architecture framework and the ben- From the (n − 1)-th layer to the last layer, we continue to
efits of MFA, we present MDA in the following. As depicted use MFA to map it. To then end, we can consider that the
in Fig. 2, MDA is constructed by integrating MFA into the MDA is initialized with a supervised pre-training method.

30224 VOLUME 7, 2019

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

D. DEEP LEARNING TECHNIQUES APPLIED TO MDA where Wr1 and b1 are the random weight matrix and the bias
In order to improve the performance of MDA, we adopt some term in the first hidden layer. The ‘‘denoising’’ method is pro-
deep learning techniques to fine-tune MDA, including back posed based on a hypothetical criterion for network design:
propagation, denoising and dropout. robust to partial destruction of the input data. This criterion
implies that good internal representations can be learned from
1) BACK PROPAGATION an unidentified distribution of the input data. Therefore, this
Back propagation [33] is an efficient optimization algorithm method is a benefit to learn more robust structure and avoids
to optimize MDA, which employs stochastic gradient descent the overfitting problems in most cases.
to learn the weight matrices and the bias terms layer by layer.
For every neuron i in the output layer (n-th layer), the error 3) DROPOUT
term can be described as Similar with denoising operation, dropout is an efficient
N method to prevent overfitting [14]. Dropout has a dramatic
∂J (w) 1 X
δin = = − [(I(yi = j)−p(yi = j|xi , w)] (17) effect on the test set when a deep learning model is trained
∂ani N on a small training set. It is a regularization technique to
i=1
where J (w) is the cost function computed from (15), and prevent the complex co-adaptations on the training data. The
ani is the output of neuron i in the output layer. For every key point of dropout is that each neuron in the hidden layers
neuron i from the (n − 1)-th layer to the second layer, the cal- is randomly excluded from the model with a probability of β.
culation of the error term is Besides, dropout can be seen as an efficient way of perform-
k+1 ing model averaging with deep learning models. Fig. 2 depicts
the dropout operation in MDA.
X
δik =( wkji δjk+1 )s0 (aki ). (18)
j=1
IV. EXPERIMENTS AND DISCUSSIONS
∂J (w) ∂J (w)
Then, the and are calculated as, To evaluate MDA, we performed several experiments on
∂wkij ∂bki
different sizes of data sets. We designed MDA in different
∂J (w) structures in order to explore the optimal architecture of it.
= akj δik+1 , (19)
∂wkij At first, we tested the MDA on five benchmark data sets
∂J (w) to explore the best architecture of MDA and compared it
= δik+1 . (20) with other feature learning and deep learning models. Then,
∂bik
in order to show the performance of MDA initialized by the
By calculating the gradient of the cost function of MDA supervised initialization method on different sizes of data
with respect to the parameters, the back propagation algo- sets, we applied MDA to a specific dataset with extremely
rithm can update the weight matrices and bias terms in the limited data, CMU mocap, and a relatively large data set,
layers of MDA. It starts from the output layer at the top and CIFAR-10. In addition, we combined MDA with convolu-
ends with the input layer at the bottom. tional neural network (CNN) for addressing image classifi-
cation tasks, and used the supervised initialization method in
2) DENOISING OPERATION the pre-training phrase of deep CNNs on the CIFAR-10 data
Denoising is proposed in the denoising autoencoder to set.
improve its robustness [13]. It can be viewed as a regular-
ization method and avoids the ‘‘overfitting’’ problem. The A. DATE SET DESCRIPTIONS
main idea of it is that we can set the required proportion of ν We first evaluated the MDA on five benchmark datasets.
‘‘destruction’’ and corrupt partial input data. We can select a Table 1 illustrates some characteristics of these data sets. The
fixed percentage ν randomly for every input x. The value of USPS1 data set is composed of handwritten digits images,
these inputs is fixed at 0, while the others remain unchanged. which contains 7291 training samples and 2007 test sam-
A partially destroyed version x̃ of an initial input x can be ples from 10 classes and each sample is represented with a
obtained through a stochastic mapping, 256 dimensional vector. The task is to identify the digits from
x̃ ∼ qD (x̃|x), (21) 0 to 9. The Isolet2 data set contains 6238 training samples
and 1559 test samples from 26 classes with 614 dimensional
where qD (x̃|x) is the unknown data distribution. Hence, a hid- features. It collects audio feature vectors of spoken letters
den representation h can be computed as from the English alphabet. Based on the recorded (and pre-
h = s(WT x̃ + b). (22) processed) audio signals, the task aims to identify the exact
letter which is spoken. Sensor3 is a sensorless drive diagnosis
In MDA, we use denoising to improve its performance.
data set, including 46816 training samples and 11693 test
A concrete illustration can be seen in Fig. 2. With the denois-
ing operation, the output of the first hidden layer is computed 1 http://www.gaussianprocess.org/gpml/data/
2 http://archive.ics.uci.edu/ml/datasets/ISOLET
as
3 http://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+
a2 = s(WTr1 x̃ + b1 ), (23) Diagnosis#

VOLUME 7, 2019 30225

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

TABLE 1. Characteristics of the used data sets. autoencoders with dropout (DAE(dropout)) and a variant of
MDA, PDA. Note that the architecture of PDA is the same as
MDA but the feature learning module of PDA is PCA [24]
instead of MFA [11], [12].

1) EXPERIMENTAL CONFIGURATIONS
All of these deep learning models have the same structure
and configurations. The size of minibatch was set to 100,
the learning rate and momentum were set to the default value
1 and 0.5, the number of epoch was set to 400, while the
dropout rate β and the denoising rate ν were set to 0.1. In AE
samples from 11 classes. Each one of the samples contains and SAE, the weight penalty of the L2 norm was set to 10−4 .
48 dimensional features, which are extracted from electric For MFA, the number of nearest neighbors for constructing
current drive signals. The target is to classify the specific the intrinsic graph was set to 5, while that for constructing
category through different conditions of the drive and its the penalty graph was set to 20. The target dimensions of
intact and faulty components. Covertype4 is a geological data representations in MDA and PDA on these data sets are
and map-based data set chosen from four wilderness areas shown in the last column of Table 1.
located in the Roosevelt National Forest of northern Col-
orado. It contains 15120 training samples and 565892 test 2) CLASSIFICATION RESULTS
samples from 7 classes with 54 dimensional features. The From the experimental results shown in Table 2, we can see
target is to recognize the categories of forest cover from that MDA performs best on four data sets except the Sensor
cartographic variables. IbnSina5 is an ancient Arabic doc- data set, compared with other models. Meanwhile, it obtained
ument data set, we select 50 pages of the manuscript for the second best result on the Sensor dataset. Furthermore,
training (17543 training samples) and 10 pages for testing PDA achieved the best result on the Sensor data set and
(3125 test samples). There are 174 classes of subwords with the second best results on the other data sets. These results
200 dimensions in this dataset. demonstrate that in most cases, the proposed deep learning
In addition, we also tested MDA on a specific task, which models can achieve good performances on data sets with
uses the CMU motion capture (CMU mocap) data set.6 limited amount of training data. We can also conclude that
The CMU mocap data set includes three categories, namely, the performance of MDA is better than not only the related
jumping, running and walking. We chose 49 video sequences deep learning models, but also some shallow feature learning
from four subjects. For each sequence, the features are gener- methods, such as PCA and MFA as shown in Table 2. These
ated using Lawrence’s method,7 with dimensionality 93 [34]. results demonstrate that MDA and PDA based on stacked
By reason of the few samples of this data set, we adopt 10- some feature learning models can learn better representations
fold cross-validation in our experiments and use the average of the input data than shallow feature learning methods.
error rate and standard deviation to evaluate the performance. However, it is not always true that deep learning models
At last, we test MDA on a classic data set CIFAR-108 to perform better than the feature learning models. For instance,
test the performance of MDA on image classification appli- the performance of MFA is better than AE, DAE, DAE with
cations. Furthermore, we combined MDA with CNN, and dropout and SDAE on the Sensor data set. This indicates that
evaluated this model on CIFAR-10. The CIFAR-10 data set training with a limited amount of data, some feature learning
includes 60000 32 × 32 color images in 10 classes, with methods may learn the representations of the input data better
6000 images per class. There are 50000 training images than deep learning models. MDA possesses the advantages of
and 10000 test images. Fig. 5 shows some examples of the both deep learning and feature learning models. Experimental
CIFAR-10 data set from the 10 categories. results show the advantages of MDA on data sets with limited
amount of training data.
B. CLASSIFICATION ON FIVE BENCHMARK DATA SETS
3) TIME CONSUMPTION
In this experiments, we compare MDA with several related
The time consumption on training and test process for 7 dif-
deep learning models on the 5 benchmark data sets. These
ferent deep architectures are shown in Table 3. Each exper-
deep learning models include autoencoder (AE) [10], stacked
iment was carried out 5 times and the averaged results are
autoencoders (SAE), denoising autoencoders (DAE) [13],
reported. Note that all the experiments were performed on a
stacked denoising autoencoders (SDAE) [35] and denoising
4-core intel(R) Core(TM)2 Quad Q9550 CPU with 2.83GHz
4 http://archive.ics.uci.edu/ml/datasets/Covertype
clock frequency. We can see that the training times of PDA
5 http://www.causality.inf.ethz.ch/al_data/IBN_SINA.html and MDA are similar with AE, DAE and DAE with dropout.
6 http://http://mocap.cs.cmu.edu/ However, they are much faster than SAE and SDAE. On the
7 http://is6.cs.man.ac.uk/∼neill/mocap/ Isolet data set, the time consumptions of PDA and MDA
8 http://www.cs.toronto.edu/ kriz/cifar.html are less than other deep architectures. This demonstrates that

30226 VOLUME 7, 2019

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

TABLE 2. The classification accuracy on five benchmark data sets. ‘‘ORIG’’ represents the results obtained by applying softmax directly to the original
data space. The best result is highlighted with boldface.

TABLE 3. Time consumption of 7 compared deep architectures. The test time is on all the test samples.

TABLE 4. The structures of MDA on the 5 benchmark data sets. ‘‘None’’ represents without second layer in MDA. ‘‘Twice’’ means the second layer’s nodes
are as twice as the input layer. ‘‘Quadruple’’ represents the second layer’s nodes are as quadruple as the input layer. ‘‘Octuple’’ represents the second
layer’s nodes are as octuple as the input layer.

PDA and MDA can sometimes achieve good results with For the Sensor data set, the error decreased slightly when
short training time because of their efficient weights initial- denoising ratio was changed from 0.2 to 0.3. The experimen-
ization. The training periods of SAE and SDAE are very slow tal results demonstrate that denoising and dropout operations
because every layer in SAE and SDAE is an autoencoder can improve the performance of MDA when selecting appro-
layer, and it requires a long optimization time for initial- priate values for them. Without the denoising and dropout
izing the weight matrix. During the test procedure, all the operations, the experimental results is not as good as adopting
methods have similar efficiency as their architectures are the these operations. In the following experiments, we set both of
same. These results show the efficiency of MDA. It’s mainly them to 0.1.
because of that it can achieve good initial weight matrices by
a short time and perform well on the learning tasks. 5) DIFFERENT STRUCTURES FOR MDA
In order to explore better structures of MDA, we constructed
4) THE DENOISING AND DROPOUT RATIOS different structures of it by changing the number of nodes in
In order to evaluate the influence of the denoising and dropout each layer. For the USPS data set, we first got rid of the second
operations on MDA, we designed an experiment on the layer, the structure of the model was 256 − 128 − 64 − 32.
5 benchmark data sets with different denoising ratios and Then, we set the number of the nodes in the second layer
dropout ratios. Firstly, we fixed the dropout ratio at 0.1, then to twice of that in the input layer, so that the architecture
adjusted the denoising ratio from 0.1 to 0.5. Next, we fixed changed to 256 − 512 − 128 − 64 − 32. Next, the nodes
the denoising ratio at 0.1, and modified the dropout ratio from in the second layer were quadruple as many as those in the
0.1 to 0.5. These experimental results are shown in Fig. 3. input layer, and the architecture became to 256 − 1024 −
We can see that MDA achieved minimum error on all data 512 − 256 − 128 − 64 − 32. Finally, the nodes were octuple
sets when denoising ratio and dropout ratio are 0.1. The error as many as those in the input layer, so that the architecture
was increasing with the increasing of denoising ratio and was 256 − 2048 − 1024 − 512 − 256 − 128 − 64 − 32. The
dropout ratio in most cases. For the USPS data set, the error structures of MDA on other data sets were changed similarly,
decreased when dropout ratio was changed from 0.2 to 0.3. as shown in Table 4.

VOLUME 7, 2019 30227

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

FIGURE 3. Error rates on 5 data sets with respect to different denoising ratios and dropout ratios. (a) USPS. (b) Isolet. (c) Sensor. (d) Covertype. (e) Ibnsina.

TABLE 5. The classification error with different structures of MDA on the 6) DIFFERENT NUMBERS OF HIDDEN LAYERS FOR MDA
5 benchmark data sets. The best result (minimum error) is highlighted
with boldface. As a deep learning model, the depth of MDA is very
important. With the architecture getting deeper and deeper,
the training of the deep learning models may become more
and more difficult. It means that we shall spend more com-
puting resources if the architecture is very deep. In order to
evaluate how many hidden layers are appropriate to different
tasks, we designed different structures on the 5 benchmark
data sets. We applied MDA with from 1 to 7 hidden layers
on the USPS and Isolet data sets and from 1 to 5 hidden
layers on the Covertype, Sensor and Ibnsina datasets. Other
experimental settings are the same as previous experiments.
The experimental results with these different structures
Table 6 shows the classification error on the 5 benchmark
of MDA are shown in Table 5. We can see that MDA
data sets with different numbers of hidden layers for MDA.
achieved the minimum classification error on all the data sets
All the data sets achieved the best results when the number of
except the Covertype data set when the nodes in the sec-
hidden layers is 3 except the USPS data set. On the USPS data
ond layer are twice of that in the input layer. Besides,
set, MDA achieved the best result when the number of hidden
MDA obtained the best performance on the Covertype data
layers was 5. When the number of hidden layers is from 1 to 3,
set when the nodes of the second layer are quadruple of that
with the increasing number of hidden layers, the classification
in the input layer. We can conclude that MDA can work well
error decreased on all the data sets. With limited amount of
when the nodes in the second layer are twice or quadru-
data, we don’t need very deep architectures to handle them.
ple as many as the nodes in the input layer. Then we
employed these structures that achieved best results to com-
pare the performance of MDA with its related models on 7) EFFECTS OF INITIALIZATION METHODS
these data sets. In addition, except on the Covertype data To evaluate the effect of MFA as an effective network
set, when the nodes in the second layer increase from initialization method, we applied different network ini-
doubled nodes gradually, the errors increase at the same tialization methods to the same deep architecture on the
time. 5 benchmark data sets. These initialization methods included

30228 VOLUME 7, 2019

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

TABLE 6. The classification error on the 5 benchmark data sets with different structures of MDA.

FIGURE 4. The test accuracies of MDA with different initialization methods on the 5 benchmark data sets. On each data set, the deep models have the
same structure and hyperparameters. The classification results are evaluated only with initialization and after training the same epochs, respectively.
(a) USPS dataset. (b) Isolet dataset. (c) Sensor dataset. (d) Covertype dataset. (e) Ibnsina dataset.

random initialization, SAE as the unsupervised pre-training of the deep models without back propagation training and
method and MFA as the supervised pre-training method. after the same training period. These experimental results
Firstly, the deep architectures initialized by these methods are demonstrate that compared with other initialization methods,
evaluated on the 5 benchmark data sets before the training the stacked MFA in MDA can initialize the deep architecture
phase. Then, we tested these models after the same training more effectively and obtain better performance after the same
epochs on the same data sets. On each data set, the structures training epochs.
and experiment settings of these deep models were kept the
same. The structures were 256 − 128 − 64 − 32 for the USPS C. CLASSIFICATION ON THE CMU MOCAP DATASET
data set, 617−308 for the Isolet data set, 48−24 for the Sensor To test the performance of MDA on real world applications,
data set, 54 − 27 for the Covertype data set and 200 − 100 for we evaluated it on a specific data set, CMU mocap. CMU
the Ibnsina data set, respectively. mocap is a very small data set including only 49 samples.
Fig. 4 illustrates the test accuracies of these models before Traditional deep learning methods cannot work well in this
and after the training period on the 5 benchmark data sets. application. We compared MDA and PDA with PCA, MFA
NN is the neural network initialized by random initializa- and other 5 deep learning models. The architectures of all the
tion. SAE and MDA are the same neural network initial- deep models (except the PDA) are 93 − 186 − 93 − 47 − 24.
ized by SAE and MFA, respectively. It is easy to see that, Specially, since the CMU mocap data set only has 49 sam-
MDA achieved the best results in the pre-training period ples, the PCA method can only reduce the dimensionality

VOLUME 7, 2019 30229

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

TABLE 7. The classification accuracy with standard deviation on the CMU

mocap data set. The best results are highlighted with boldface.

FIGURE 5. Example images in the CIFAR-10 data set. Each column

corresponds to one category.

to 49 at most, so that the architecture of PDA was set to

93 − 186 − 49. The denoising ratio and dropout ratio were set
to 0.1 on all the deep learning models, including AE, DAE,
DAE with dropout, SDAE, SAE, PDA and MDA. The weight
penalty of AE was set to 10−4 . The learning rate was set
to 0.01, the momentum was set to 0.5 and the number of
epoch was set to 600. The experiment was tested based on
10-fold cross validation. The experimental results are shown FIGURE 6. Gray level images corresponding to those shown in Fig. 5.
in Table 7.
In Table 7, we can see that PDA and MDA achieved
the best and second best results on this dataset and have TABLE 8. Classification accuracy obtained on the CIFAR-10 data set.
lower standard deviation than other deep learning models.
This demonstrates that PDA and MDA are more stable than
other deep learning models. Moreover, deep learning models
perform better than shallow feature learning models, such
as PCA and MFA. Through these results, we can see that
in applications with limited amount of training data, our
proposed methods can achieve desired and stable results
compared with other deep learning and feature learning
models.

D. CLASSIFICATION ON THE CIFAR-10 DATA SET

In order to assess MDA on image classification applications,
we chose CIFAR-10 as a relatively large scale data set to
evaluate the performance of MDA. gray level images. Then, we flattened each sample as a
In this section, we performed two experiments. In the 1024 dimensional vector, which can be input to MDA. Based
first experiment, we tested MDA on the CIFAR-10 data on the previous experiments, the architecture of MDA was
set to test its performance on image classification appli- designed as 1024 − 2048 − 1024 − 512 − 256 − 128 − 64.
cations. In the Second experiment, because MDA is a The size of minibatch was set to 100. The dropout ratio and
fully-connected network, we selected a CNN and replaced denoising ratio were set to 0.1, respectively. The number of
its fully-connected layers with MDA. This new model epoch was set to 400. The learning rate was set to 1 and
was named CNN-MDA. In this case, we tested the the momentum was set to 0.5. We compared MDA with
effect of MDA as a supervised initialization method 8 previous methods.
in CNNs. Table 8 shows the classification accuracy on gray
CIFAR-10. Due to the high dimensionality and high com-
1) PERFORMANCE OF MDA ON THE CIFAR-10 DATA SET plexity of the problem, PCA and MFA did not perform
In this experiment, we first transformed the color images well. Furthermore, MDA achieved the best results in all the
in the CIFAR-10 data set to gray level images and named compared methods. However, due to working on gray level
them gray CIFAR-10, so that the dimensionality of the images images and loss of the color and spatial information of the
reduced to 32 × 32 × 1. Fig. 5 shows some example images images, we didn’t get the state-of-the-art result on this data
in the CIFAR-10 data set. Fig. 6 shows the corresponding set.

30230 VOLUME 7, 2019

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

TABLE 9. Comparison between CNNs and CNN-MDA on the

CIFAR-10 data set.

TABLE 10. Accuracy and computational cost of DJINN and MDA.

DJINN has shown its high predictive performance for a vari-

ety of regression and classification tasks. Here, we construct
FIGURE 7. The architecture of CNN-MDA. The fully-connected part is
replaced with MDA.
these two models by the same number of layers and compare
the test accuracies and computational cost of DJINN and the
proposed MDA on the same data sets.
The initialization period of DJINN is to determine the
2) PERFORMANCE OF CNN-MDA ON THE CIFAR-10 architecture and weight utilizing the dependency structure of
DATA SET a decision tree trained on the data. At the same time, the ini-
In this experiment, we combined MDA and CNN by replacing tialization period of MDA is to compute the weight matri-
the fully-connected part of CNN with MDA, which can be ces using MFA. Since the main difference between DJINN
initialized by the supervised pre-training method. This model and MDA is their initialization methods, the computational
is named CNN-MDA and its structure is shown in Fig. 7. cost of the two models is evaluated with the time consump-
Compared with the original CNN, its fully connected part tion of their initialization period under the same experiment
is initialized by MDA. Typically, we chose a classic CNN conditions.
model, LeNet-5, and replaced its fully-connected part with The experiment results are shown in Table 10.
MDA and named it LeNet-5-MDA. Note that the structures MDA achieved better results than DJINN on all the 5 bench-
of CNN-MDA and LeNet-5-MDA remained the same as mark data sets. In addition, the computational cost of MDA
their original models. Then, we tested these CNNs on the is much less than DJINN on these data sets. This is because
CIFAR-10 data set without fine-tuning. MDA only needs to compute the weight matrices between
The performance of these models can be seen in Table 9. different layers in the pre-training phase. It can be concluded
On the one hand, LeNet-5 and LeNet-5-MDA did not that the performance of MDA is better than DJINN with less
achieve good results on this data set, mainly because of time consumption.
their limited number of parameters. On the other hand, com-
pared with their original architectures, both CNN-MDA and V. CONCLUSION
LeNet-5-MDA can optimize the original networks better and In this paper, we propose a novel deep learning architec-
achieve higher accuracy by replacing the fully-connected ture by stacking feature learning methods. We apply the
parts with MDA. It can be concluded that the fully-connected feature learning method MFA to this framework and name
parts of CNNs can be initialized effectively by MDA with the it MDA. In this case, MDA can be initialized by a super-
supervised pre-training method. vised pre-training method. Furthermore, some deep learning
techniques, such as back propagation, denosing and dropout
E. COMPARISON BETWEEN MDA AND DEEP JOINTLY operation, are employed on MDA to improve its performance.
INFORMED NEURAL NETWORKS (DJINN) Extensive experiments demonstrate that on data sets with
Deep jointly informed neural networks (DJINN) is an effec- limited amount of data, the performance of MDA is better
tive model, which can automatically construct feedforward than both shallow feature learning models and relevant deep
neural networks based on decision trees [36]. Furthermore, learning models. Experiments on the CIFAR-10 data set show
it can initialize the constructed model by its tree-informed that MDA can be used in CNNs for the supervised initial-
initialization method as a warm-start to the training process. ization of their fully-connected layers. In the future work,

VOLUME 7, 2019 30231

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

we intend to exploit some other feature learning methods for [22] L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik, ‘‘Dimen-
the deep architecture construction and explore the different sionality reduction: A comparative review,’’ J. Mach. Learn. Res., vol. 10,
nos. 1–41, pp. 66–71, Oct. 2009.
structures of this novel deep learning framework. [23] L. J. van der Maaten, ‘‘An introduction to dimensionality reduction
using MATLAB,’’ Univ. Maastricht, Maastricht, The Netherlands, Tech.
Rep. MICC 07-07, 2007.
ACKNOWLEDGMENT
[24] I. T. Jolliffe, Principal Component Analysis. Hoboken, NJ, USA: Wiley,
The Titan X GPU used for this research was donated by the 2002.
NVIDIA Corporation. [25] X. He and P. Niyogi, ‘‘Locality preserving projections,’’ in Proc. NIPS,
vol. 16, 2003, pp. 153–160.
[26] G. E. Hinton and S. T. Roweis, ‘‘Stochastic neighbor embedding,’’ in Proc.
REFERENCES NIPS, 2002, pp. 833–840.
[1] R. Collobert and J. Weston, ‘‘A unified architecture for natural language [27] R. A. Fisher, ‘‘The use of multiple measurements in taxonomic problems,’’
processing: Deep neural networks with multitask learning,’’ in Proc. ICML, Ann. Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
2008, pp. 160–167. [28] Y. Yuan, L. Mou, and X. Lu, ‘‘Scene recognition by manifold regular-
[2] D. C. Cirecsan, U. Meier, L. M. Gambardella, and J. Schmidhuber, ‘‘Deep, ized deep learning architecture,’’ IEEE Trans. Neural Netw. Learn. Syst.,
big, simple neural nets for handwritten digit recognition,’’ Neural Comput., vol. 26, no. 10, pp. 2222–2233, Oct. 2015.
vol. 22, no. 12, pp. 3207–3220, 2010. [29] T. George, B. Konstantinos, Z. Stefanos, and B. W. Schuller, ‘‘A deep semi-
[3] Q. Le, W. Zou, S. Yeung, and A. Ng, ‘‘Learning hierarchical invariant NMF model for learning hidden representations,’’ in Proc. ICML, 2014,
spatio-temporal features for action recognition with independent subspace pp. 1692–1700.
analysis,’’ in Proc. CVPR, Jun. 2011, pp. 3361–3368. [30] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, ‘‘Multimodal
deep learning,’’ in Proc. ICML, 2011, pp. 689–696.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classifica-
[31] Y. Zheng, G. Zhong, J. Liu, X. Cai, and J. Dong, ‘‘Visual texture perception
tion with deep convolutional neural networks,’’ in Proc. NIPS, 2012,
with feature learning models and deep architectures,’’ in Proc. CCPR,
pp. 1106–1114.
2014, pp. 401–410.
[5] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, ‘‘PCANet:
[32] Y. Zheng, Y. Cai, G. Zhong, Y. Chherawala, Y. Shi, and J. Dong, ‘‘Stretch-
A simple deep learning baseline for image classification?’’ IEEE Trans.
ing deep architectures for text recognition,’’ in Proc. ICDAR, 2015,
Image Process., vol. 24, no. 12, pp. 5017–5032, Dec. 2015.
pp. 236–240.
[6] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
[33] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘‘Learning repre-
pp. 436–444, May 2015.
sentations by back-propagating errors,’’ Nature, vol. 323, pp. 533–536,
[7] M. Ranzato, Y. Boureau, and Y. L. Cun, ‘‘Sparse feature learning for deep Oct. 1986.
belief networks,’’ in Proc. NIPS, 2007, pp. 1185–1192. [34] G. Zhong, W.-J. Li, D.-Y. Yeung, X. Hou, and C.-L. Liu, ‘‘Gaussian process
[8] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, ‘‘Unsupervised feature latent random field,’’ in Proc. AAAI, 2010, pp. 1–6.
learning for audio classification using convolutional deep belief networks,’’ [35] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
in Proc. NIPS, 2009, pp. 1096–1104. ‘‘Stacked denoising autoencoders: Learning useful representations in a
[9] H. Lee, R. B. Grosse, R. Ranganath, and A. Y. Ng, ‘‘Convolutional deep deep network with a local denoising criterion,’’ J. Mach. Learn. Res.,
belief networks for scalable unsupervised learning of hierarchical repre- vol. 11, no. 12, pp. 3371–3408, Dec. 2010.
sentations,’’ in Proc. ICML, 2009, pp. 609–616. [36] K. D. Humbird, J. L. Peterson, and R. G. McClarren, ‘‘Deep neural network
[10] G. E. Hinton and R. R. Salakhutdinov, ‘‘Reducing the dimensionality of initialization with decision trees,’’ IEEE Trans. Neural Netw. Learn. Syst.,
data with neural networks,’’ Science, vol. 313, no. 5786, pp. 504–507, to be published. doi: 10.1109/TNNLS.2018.2869694.
2006.
[11] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, ‘‘Graph
embedding and extensions: A general framework for dimensionality reduc-
tion,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 40–51,
Jan. 2007.
[12] G. Zhong, Y. Chherawala, and M. Cheriet, ‘‘An empirical evaluation of
supervised dimensionality reduction for recognition,’’ in Proc. ICDAR,
Aug. 2013, pp. 1315–1319. GUOQIANG ZHONG (M’16) received the B.S.
[13] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, ‘‘Extracting and degree in mathematics from Hebei Normal Univer-
composing robust features with denoising autoencoders,’’ in Proc. ICML, sity, Shijiazhuang, China, in 2004, the M.S. degree
2008, pp. 1096–1103. in operations research and cybernetics from the
[14] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and Beijing University of Technology, Beijing, China,
R. Salakhutdinov. (2012). ‘‘Improving neural networks by preventing in 2007, and the Ph.D. degree in pattern recog-
co-adaptation of feature detectors.’’ [Online]. Available: https:// nition and intelligent systems from the Institute
arxiv.org/abs/1207.0580 of Automation, Chinese Academy of Sciences,
[15] J. Donahue et al., ‘‘DeCAF: A deep convolutional activation feature for Beijing, in 2011.
generic visual recognition,’’ in Proc. ICML, 2014, pp. 647–655. From 2011 to 2013, he was a Postdoctoral Fel-
[16] M. Long, Y. Cao, J. Wang, and M. I. Jordan, ‘‘Learning transferable low with the Synchromedia Laboratory for Multimedia Communication
features with deep adaptation networks,’’ in Proc. ICML, 2015, pp. 97–105. in Telepresence, École de Technologie Supérieure, University of Quebec,
[17] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, ‘‘Learning deep Montreal, Canada. Since 2014, he has been an Associate Professor with
features for scene recognition using places database,’’ in Proc. NIPS, 2014, the Department of Computer Science and Technology, Ocean University of
pp. 487–495.
China, Qingdao, China. He has published three books, four book chapters,
[18] G. Zhong, X. Ling, and L.-N. Wang, ‘‘From shallow feature learning
and more than 50 technical papers in the areas of artificial intelligence,
to deep learning: Benefits from the width and depth of deep architec-
tures,’’ Wiley Interdiscipl. Rev., Data Mining Knowl. Discovery, vol. 9,
pattern recognition, machine learning, and data mining. His research interests
Jan./Feb. 2019, Art. no. e1255. include pattern recognition, machine learning, and image processing. He
[19] G. Hinton et al., ‘‘Deep neural networks for acoustic modeling in speech has served as a PC Member/Reviewer for many international conferences
recognition: The shared views of four research groups,’’ IEEE Signal and top journals, such as the IEEE TNNLS, the IEEE TKDE, the IEEE
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012. TCSVT, Pattern Recognition, Knowledge-Based Systems, Neurocomputing,
[20] H. Y. Xiong et al., ‘‘The human splicing code reveals new insights into ACM TKDD, ICPR, and ICDAR. He is a member of the ACM and IAPR
the genetic determinants of disease,’’ Science, vol. 347, no. 6218, 2015, and a Professional Committee Member of CAAI-PR, CAA-PRMI, and
Art. no. 1254806. CSIG-DIAR. He has been awarded as an Outstanding Reviewer by several
[21] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence learning journals, such as Pattern Recognition, Neurocomputing, and Cognitive Sys-
with neural networks,’’ in Proc. NIPS, 2014, pp. 3104–3112. tems Research.

30232 VOLUME 7, 2019

G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models

KANG ZHANG received the B.S. degree in com- YUCHEN ZHENG received the B.S. and M.S.
munication engineering from the Ocean University degrees from the Department of Computer Sci-
of China, Qingdao, China, in 2014, where he is ence and Technology, Ocean University of China,
currently pursuing the M.S. degree in electron- in 2014 and 2017, respectively. He is currently
ics and communication engineering. His research pursuing the Ph.D. degree with the Department
interests include data mining, machine learning, of Advanced Information Technology, Kyushu
and image processing. University, Japan. His research interests include
machine learning, document analysis, neural net-
works, and computer vision.

JUNYU DONG (M’09) received the B.S. and

M.S. degrees from the Department of Applied
Mathematics, Ocean University of China, Qing-
HONGXU WEI received the B.S. degree in elec- dao, China, in 1993 and 1999, respectively, and the
tronic commerce from the University of Jinan, Ph.D. degree in image processing from the Depart-
Jinan, China, in 2015. He is currently pursuing ment of Computer Science, Heriot-Watt Univer-
the M.S. degree in computer technology with the sity, U.K., in 2003.
Ocean University of China, Qingdao, China. His He joined the Ocean University of China,
research interests include machine learning, neural in 2004, where he is currently a Professor and
networks, and image processing. the Head of the Department of Computer Science
and Technology. His research interests include machine learning, big data,
computer vision, and underwater vision.

VOLUME 7, 2019 30233

CH 8
No ratings yet
CH 8
42 pages
Department of Information Engineering and Computer Science University of Trento
No ratings yet
Department of Information Engineering and Computer Science University of Trento
98 pages
A Survey of Multimodal Hybrid Deep Learning For Computer Vision
No ratings yet
A Survey of Multimodal Hybrid Deep Learning For Computer Vision
28 pages
Artificial Intelligence 2024 Past Paper Punjab University Solve Past Paper
No ratings yet
Artificial Intelligence 2024 Past Paper Punjab University Solve Past Paper
72 pages
Analysis of Convolutional Neural Network Based Image Classification Techniques
No ratings yet
Analysis of Convolutional Neural Network Based Image Classification Techniques
19 pages
Big Data Analytics Using Artificial Intelligence
No ratings yet
Big Data Analytics Using Artificial Intelligence
5 pages
Deep Learning Architectures Enabling Sophisticated Feature Extraction and Representation For Complex Data Analysis
No ratings yet
Deep Learning Architectures Enabling Sophisticated Feature Extraction and Representation For Complex Data Analysis
11 pages
Wang 18 Domain Adaptation
No ratings yet
Wang 18 Domain Adaptation
20 pages
Facial Expression Recognition Based On A MLP Neural
No ratings yet
Facial Expression Recognition Based On A MLP Neural
23 pages
Deep Learning For Face Recognition
No ratings yet
Deep Learning For Face Recognition
47 pages
Dai 2015
No ratings yet
Dai 2015
10 pages
Olajire and Marvellous Seminar Report
No ratings yet
Olajire and Marvellous Seminar Report
12 pages
Deep Learning
No ratings yet
Deep Learning
37 pages
Analysis of Conventional Feature Learning
No ratings yet
Analysis of Conventional Feature Learning
12 pages
Artificial Intelligence in Manufacturing Research 1st Edition J. Paulo Davim Download
No ratings yet
Artificial Intelligence in Manufacturing Research 1st Edition J. Paulo Davim Download
81 pages
A Survey On Neural Architecture Search
No ratings yet
A Survey On Neural Architecture Search
53 pages
Make 04 00004 v3
No ratings yet
Make 04 00004 v3
37 pages
2.1.3. SC-Lecture-Unit-II-Ch1
No ratings yet
2.1.3. SC-Lecture-Unit-II-Ch1
10 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
Neural Networks and Deep Learning: A Textbook, 2nd Edition Charu C. Aggarwal Download
No ratings yet
Neural Networks and Deep Learning: A Textbook, 2nd Edition Charu C. Aggarwal Download
44 pages
Redefining Traffic How Ai Leads The Change Guanghui Zhao Download
No ratings yet
Redefining Traffic How Ai Leads The Change Guanghui Zhao Download
87 pages
Adaptive Deep Supervised Autoencoder Based Image R PDF
No ratings yet
Adaptive Deep Supervised Autoencoder Based Image R PDF
15 pages
A Comparative Study On Handwriting Digit Recognition Using Neural Networks
No ratings yet
A Comparative Study On Handwriting Digit Recognition Using Neural Networks
5 pages
Recent Advances in Deep Learning For Object Detection
No ratings yet
Recent Advances in Deep Learning For Object Detection
26 pages
Face Mask Detection in Image and
No ratings yet
Face Mask Detection in Image and
20 pages
Zanuttigh 2017
No ratings yet
Zanuttigh 2017
5 pages
A Review On Deep Learning Approaches To Image Classification and Object Segmentation 1
No ratings yet
A Review On Deep Learning Approaches To Image Classification and Object Segmentation 1
23 pages
Ad3501-Dl-Unit 1 Notes
No ratings yet
Ad3501-Dl-Unit 1 Notes
43 pages
Pcanet: A Simple Deep Learning Baseline For Image Classification?
No ratings yet
Pcanet: A Simple Deep Learning Baseline For Image Classification?
15 pages
DS Honors Sem 5 Syllabus
No ratings yet
DS Honors Sem 5 Syllabus
4 pages
Unit - 4 ANN
No ratings yet
Unit - 4 ANN
46 pages
Face Recognition Using Machine Learning
No ratings yet
Face Recognition Using Machine Learning
5 pages
Neural Architecture Search For Skin Lesion Classification
No ratings yet
Neural Architecture Search For Skin Lesion Classification
11 pages
BCS602 Model Question Paper Solved (Search Creators) - 2-37
0% (2)
BCS602 Model Question Paper Solved (Search Creators) - 2-37
36 pages
A Deep Learning Architecture Comprising Homogeneous Cortical Circuits For Scalable Spatiotemporal Pattern Inference-2009
No ratings yet
A Deep Learning Architecture Comprising Homogeneous Cortical Circuits For Scalable Spatiotemporal Pattern Inference-2009
9 pages
Handwritten Hindi Character Recognition Using Deep Learning Techniques
No ratings yet
Handwritten Hindi Character Recognition Using Deep Learning Techniques
8 pages
(English) Introduction To Generative AI (DownSub - Com)
No ratings yet
(English) Introduction To Generative AI (DownSub - Com)
10 pages
AI Engineer Interview Prep Guide
No ratings yet
AI Engineer Interview Prep Guide
16 pages
IJIGSP-Template 2nd Paper Modified 03-10-2021
No ratings yet
IJIGSP-Template 2nd Paper Modified 03-10-2021
12 pages
Hindi Handwritten Character Recognition Using Deep Learning
No ratings yet
Hindi Handwritten Character Recognition Using Deep Learning
7 pages
10 1109icaccs 2019 8728466
No ratings yet
10 1109icaccs 2019 8728466
7 pages
IJIGSP-Template 2nd Paper Modified
No ratings yet
IJIGSP-Template 2nd Paper Modified
11 pages
Ijigsp V13 N6 2
No ratings yet
Ijigsp V13 N6 2
11 pages
Restaurant Review Production Analysis Using Python
No ratings yet
Restaurant Review Production Analysis Using Python
33 pages
Transfer Learning For Image Classification
No ratings yet
Transfer Learning For Image Classification
5 pages
2019 Hybrid2
No ratings yet
2019 Hybrid2
5 pages
1 Mukherjee2017 PDF
No ratings yet
1 Mukherjee2017 PDF
5 pages
Project Thesis Final YOLO SSD
No ratings yet
Project Thesis Final YOLO SSD
49 pages
Self-Supervised Pretext Tasks - AD
No ratings yet
Self-Supervised Pretext Tasks - AD
9 pages
10 1109@vlsi-Dat49148 2020 9196288
No ratings yet
10 1109@vlsi-Dat49148 2020 9196288
1 page
The Application of Artifcial Intelligence in The Feld of Mental Health - A Systematic Review
No ratings yet
The Application of Artifcial Intelligence in The Feld of Mental Health - A Systematic Review
20 pages
Face Detection Using Gabor Feature Extraction and Artificial Neural Network
No ratings yet
Face Detection Using Gabor Feature Extraction and Artificial Neural Network
6 pages
2023 IEEE TNNLS A Survey On Evolutionary Neural Architecture Search
No ratings yet
2023 IEEE TNNLS A Survey On Evolutionary Neural Architecture Search
21 pages
Face Recognition Using LDA-based Algorithms
No ratings yet
Face Recognition Using LDA-based Algorithms
7 pages
Sjabin
No ratings yet
Sjabin
13 pages
Introduction To Robotics
No ratings yet
Introduction To Robotics
24 pages
Comparison Analysis of Traditional Machine Learnin
No ratings yet
Comparison Analysis of Traditional Machine Learnin
9 pages
Application of Deep Learning in Image Recognition
No ratings yet
Application of Deep Learning in Image Recognition
8 pages
IJIGSP-Template 2nd Paper
No ratings yet
IJIGSP-Template 2nd Paper
11 pages
Irjet V10i1067
No ratings yet
Irjet V10i1067
5 pages
A Novel Online Machine Learning Approach For..
No ratings yet
A Novel Online Machine Learning Approach For..
7 pages
Deep Convolutional Neural Network-Based Approaches
No ratings yet
Deep Convolutional Neural Network-Based Approaches
21 pages
Research On Application of Deep Learning Algorithm in Image Classification (2021)
No ratings yet
Research On Application of Deep Learning Algorithm in Image Classification (2021)
4 pages
A Study On Effects of Data Augmentation in Detection
No ratings yet
A Study On Effects of Data Augmentation in Detection
13 pages
Review of Deep Convolution Neural Network in Image Classification
No ratings yet
Review of Deep Convolution Neural Network in Image Classification
6 pages
A Gentle Introduction To Deep Learning in Medical Image Processing
No ratings yet
A Gentle Introduction To Deep Learning in Medical Image Processing
31 pages
TransferLearningwithAdaptiveFine Tuning
No ratings yet
TransferLearningwithAdaptiveFine Tuning
16 pages
Sle 2
No ratings yet
Sle 2
11 pages
A Comprehensive Analysis of Deep Learning Based Representation For Face Recognition
No ratings yet
A Comprehensive Analysis of Deep Learning Based Representation For Face Recognition
8 pages
Deep Learning Training Best Practices
No ratings yet
Deep Learning Training Best Practices
40 pages
Understanding Sentiment Analysis With VADER: A Comprehensive Overview and Application
No ratings yet
Understanding Sentiment Analysis With VADER: A Comprehensive Overview and Application
28 pages
EasyChair Preprint 15723
No ratings yet
EasyChair Preprint 15723
10 pages
Syl Lab Us July 2025
No ratings yet
Syl Lab Us July 2025
3 pages
Adversarial Attacks Against Binary Similarity Systems
No ratings yet
Adversarial Attacks Against Binary Similarity Systems
23 pages
005 Spe 198762 Ms
No ratings yet
005 Spe 198762 Ms
15 pages
NNFL Assignment 1128
No ratings yet
NNFL Assignment 1128
15 pages
Inventateq DataScience With Python Course Content Syllabus
No ratings yet
Inventateq DataScience With Python Course Content Syllabus
9 pages
Face Recognition Based On Convolutional Neural Network.: November 2017
No ratings yet
Face Recognition Based On Convolutional Neural Network.: November 2017
5 pages
Tunnelqnn: A Hybrid Quantum-Classical Neural Network For Efficient Learning
No ratings yet
Tunnelqnn: A Hybrid Quantum-Classical Neural Network For Efficient Learning
11 pages
Deep Learning Models Based On Image Classification: A Review
No ratings yet
Deep Learning Models Based On Image Classification: A Review
8 pages
Groundwater Quality Forecasting Using Machine Learning Algorithms For
No ratings yet
Groundwater Quality Forecasting Using Machine Learning Algorithms For
13 pages
TFET Presentation Fixed
No ratings yet
TFET Presentation Fixed
10 pages
Introduction To AI
No ratings yet
Introduction To AI
10 pages
Comparing Recurrent Convolutional Neural Networks For Large Scale Bird Species Classification
No ratings yet
Comparing Recurrent Convolutional Neural Networks For Large Scale Bird Species Classification
12 pages
Assessing - The - Effectiveness - of - Large - Language - Models - in - Predicting - Student - Dropout - Rates Paper 4
No ratings yet
Assessing - The - Effectiveness - of - Large - Language - Models - in - Predicting - Student - Dropout - Rates Paper 4
6 pages
Example Library First Principles Model Training For CDU Efficiency
No ratings yet
Example Library First Principles Model Training For CDU Efficiency
10 pages
Schwalbe 2019
No ratings yet
Schwalbe 2019
5 pages
Leveraging Flask API and Machine Learning To Forecast Multiple Diseases
No ratings yet
Leveraging Flask API and Machine Learning To Forecast Multiple Diseases
13 pages
Deep Learning
No ratings yet
Deep Learning
7 pages
Hassan 2021
No ratings yet
Hassan 2021
7 pages
Yu Et Al - 2016 - Recent Developments On Deep Big Vision
No ratings yet
Yu Et Al - 2016 - Recent Developments On Deep Big Vision
2 pages
Deep Learning As A Frontier of Machine Learning A
No ratings yet
Deep Learning As A Frontier of Machine Learning A
10 pages
3D Face Recognition Based On Deep Learning - 8816269 PDF
No ratings yet
3D Face Recognition Based On Deep Learning - 8816269 PDF
6 pages
Deep Learning As A Frontier of Machine Learning A
No ratings yet
Deep Learning As A Frontier of Machine Learning A
10 pages
Prakash2019 - Face Recognition
No ratings yet
Prakash2019 - Face Recognition
4 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Marginal Deep Architecture Stacking Feature Learning Modules To Build Deep Learning Models

Uploaded by

Marginal Deep Architecture Stacking Feature Learning Modules To Build Deep Learning Models

Uploaded by

Received February 4, 2019, accepted February 14, 2019, date of publication March 4, 2019, date of current version March

Marginal Deep Architecture: Stacking

I. INTRODUCTION In recent years, many deep learning models have been

VOLUME 7, 2019 30221

30222 VOLUME 7, 2019

VOLUME 7, 2019 30223

30224 VOLUME 7, 2019

VOLUME 7, 2019 30225

30226 VOLUME 7, 2019

VOLUME 7, 2019 30227

30228 VOLUME 7, 2019

VOLUME 7, 2019 30229

TABLE 7. The classification accuracy with standard deviation on the CMU

FIGURE 5. Example images in the CIFAR-10 data set. Each column

to 49 at most, so that the architecture of PDA was set to

D. CLASSIFICATION ON THE CIFAR-10 DATA SET

30230 VOLUME 7, 2019

TABLE 9. Comparison between CNNs and CNN-MDA on the

TABLE 10. Accuracy and computational cost of DJINN and MDA.

DJINN has shown its high predictive performance for a vari-

VOLUME 7, 2019 30231

30232 VOLUME 7, 2019

JUNYU DONG (M’09) received the B.S. and

VOLUME 7, 2019 30233

You might also like