Marginal Deep Architecture Stacking Feature Learning Modules To Build Deep Learning Models
Marginal Deep Architecture Stacking Feature Learning Modules To Build Deep Learning Models
20, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2902631
ABSTRACT Recently, many deep models have been proposed in different fields, such as image classifica-
tion, object detection, and speech recognition. However, most of these architectures require a large amount
of training data and employ random initialization. In this paper, we propose to stack feature learning modules
for the design of deep architectures. Specifically, marginal Fisher analysis (MFA) is stacked layer-by-layer
for the initialization and we call the constructed deep architecture marginal deep architecture (MDA). When
implementing the MDA, the weight matrices of MFA are updated layer-by-layer, which is a supervised pre-
training method and does not need a large scale of data. In addition, several deep learning techniques are
applied to this architecture, such as backpropagation, dropout, and denoising, to fine-tune the model. We have
compared MDA with some feature learning and deep learning models on several practical applications,
such as handwritten digits recognition, speech recognition, historical document understanding, and action
recognition. The extensive experiments show that the performance of MDA is better than not only shallow
feature learning models but also related deep learning models in these tasks.
INDEX TERMS Deep architectures, feature learning, marginal Fisher analysis, marginal deep architecture.
2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.
30220 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
G. Zhong et al.: MDA: Stacking Feature Learning Modules to Build Deep Learning Models
closed-form solution or convex optimization. For example, to the supervised pre-training method rather than ran-
marginal Fisher analysis (MFA) is one of the feature learning dom initialization.
methods that is supervised and based on the graph embed- 3. Experiments demonstrate that the proposed MDA can
ding framework [11], [12]. It utilizes an intrinsic graph to obtain good results in several fields of data sets, such as
characterize the intra-class compactness, and another penalty natural images, spoken letters and handwritten digits.
graph to characterize the inter-class separability. The optimal These results show that MDA is a general model to
solution of MFA can be learned by generalized eigenvalue handle data sets with different scales of data. In addi-
decomposition. Whereas, shallow feature learning models tion, for large size images, combining convolutional
cannot achieve good performance if the structure of the data is operations and MDA, we can obtain competitive results
highly nonlinear; on the other hand, the combinations of these with existing deep learning methods.
shallow feature learning models have rarely been exploited to The rest of this paper is organized as follows: In Section II,
design deep models. we give a brief overview of related work. In Section III,
In order to simultaneously solve the existing problems we present the proposed marginal deep architecture (MDA)
in deep learning models and combine the advantages of in detail. The experimental settings and results are reported
feature learning models, we proposed a novel deep learn- in Section IV. At last, Section V concludes this paper with
ing method based on stacked feature learning modules. remarks and future work.
Specifically, instead of using random initialization, stacked
MFA layers are applied to initialize this deep architecture, II. RELATED WORK
so that the constructed deep learning models are called Since 2006, many deep learning models have been proposed.
marginal deep architecture (MDA). At first, to increase the Primitively, Hinton and Salakhutdinov proposed the deep
capacity of the architecture, we use a random weight matrix autoencoder (AE) that is an effective way to learn the low-
to project the input data to a higher dimensional space. dimensional representations of high-dimensional data [10].
Next, the stacked MFA layers are applied to learn the lower Based on AE, Vincent et al. [13] proposed the denoising
dimensional representations of the data layer by layer. At last, autoencoder (DAE), which made the learned representations
the softmax layer is connected to the final feature layer. robust to partial corruption of the input data. Subsequently,
During the implementation of MDA, we add some tricks in Vincent et al. extended DAE to stacked DAE (SDAE), which
the training process to fine tune it, such as back propaga- works very well on natural images and handwritten dig-
tion, dropout and denoising. We have compared MDA with its. To prevent the weights in deep neural networks from
some feature learning and deep learning models on different co-adaptation, Hinton et al. [14] introduced the dropout tech-
domains of datasets (including handwritten digits recogni- nique, which delivers new records for many speech and object
tion, speech recognition, historical document understanding, recognition applications. However, due to numerous param-
image classification, action recognition and so on). Experi- eters, previous deep learning models generally need a large
ments show that the performance of MDA is better than not scale of training data to obtain good learning results.
only shallow feature learning models, but also related deep In recent years, to address many vision problems,
learning models. the research on deep convolutional neural networks (CNN)
Please note that, although convolutional neural networks develops very fast [4], [15]–[18]. Specifically, in the research
(CNNs) and recurrent neural networks (RNNs) have played of image classification, Krizhevsky, Sutskever and Hin-
an important role in many image, video and natural language ton proposed a large, deep convolutional neural network
applications, feedforward neural networks are still important. (AlexNet) to classify the 1.2 million high-resolution images
For instance, they can be used to deal with vectorized data in the ImageNet data set. The authors use efficient GPU
and as the fully connected layers of many deep learning to speed AlexNet. The results show that a large, deep con-
architectures. Hence, how to design deep feedforward neural volutional neural network is capable of achieving record-
networks is still an important issue for the deep learning breaking results on a highly challenging data set using
community. purely supervised learning [4]. In order to transfer a
The contributions of this work can be summarized as trained deep convolutional neural network to new tasks,
follows: Donahue et al. [15] proposed the deep convolutional activa-
1. We propose a novel deep architecture called MDA. tion feature (DeCAF), which is extracted from a well trained
The neurons in the first hidden layer of MDA are deep convolutional neural network on a large object recog-
twice or quadruple of that in the input layer. Next, nition data set. DeCAF provides a uniform framework for
several layers of feature learning models are stacked to researchers, who can improve and change this framework on
learn the low dimensional representations of the input some specific tasks. However, its performance at scene recog-
data. In the end, a softmax classifier is applied. nition has not attained the same level of success. In order to
2. In general, traditional deep learning models require a alleviate this problem, Zhou et al. [17] introduce a new scene-
large amount of training data to obtain good results. centric database called Places with over 7 million labeled
Whereas, MDA can achieve better performance than pictures of scenes. Then, they learn the deep features for
these models with limited amount of training data due scene recognition tasks using deep architectures, and achieve
excellent results on several scene-centric datasets. However, representations from different, unknown attributes of a given
these methods based on convolutional operation need very dataset. Whereas, this model is proposed for learning low-
large scale of training samples and can not work well with dimensional representations that are better suited for clus-
limited amount of data. tering. Ngiam [30] proposed a deep architecture, which is
In many domains other than computer vision, deep learn- an unsupervised model to learn feature representations over
ing methods also achieve good performances. In [19], multiple modalities. They argued that multi-modality feature
Hinton et al. represent the shared views of four research learning is better than one modality and achieved good per-
groups in using deep neural networks (DNNs) for automatic formance on video and audio data sets.
speech recognition (ASR). The DNNs that contain many lay- In this work, we combine the advantages of feature learning
ers of nonlinear hidden units and a very large output layer can models and deep architectures [31], [32], which stack MFA
outperform Gaussian mixture models (GMMs) at acoustic to initialize the deep architecture as a supervised pre-training
modeling for speech recognition on a variety of data sets. method. Then, we employ some deep learning techniques,
In the area of genetics, Xiong et al. [20] use deep learn- like back propagation, denoising and dropout to fine-tune
ing algorithms to derive a computational model that takes the network. The advantage of this deep architecture is that
DNA sequences as input to predict splicing in human tissues. we can learn the desirable weight matrix even if the training
It reveals the genetic origins of disease and how strongly data is not large enough. And compared with traditional
genetic variants affect RNA splicing. In the area of natural deep learning models and shallow feature learning models,
language understanding, deep learning models have deliv- the proposed method perform better than them in most cases.
ered strong results on topic classification, sentiment analysis
and so on. Amongst others, Sutskever et al. [21] proposed III. MARGINAL DEEP ARCHITECTURE (MDA)
a general approach, multilayered long short-term memory In this section, we present an innovative architecture of deep
(LSTM), which can solve the general sequence to sequence learning models first; After that, we introduce the proposed
problems better than before. marginal deep architecture (MDA) in detail. In addition,
On the other hand, in the field of feature learning models, some deep learning techniques used for the training of MDA,
dimensionality reduction plays a crucial role to handle the including back propagation, denoising and dropout, are also
problems for visualizing high-dimensional data and avoiding presented.
the ‘‘curse of dimensionality’’ [22], [23]. Traditional dimen-
sionality reduction can mainly be classified by three crite- A. A NOVEL FRAMEWORK OF DEEP ARCHITECTURE
ria: linear or nonlinear, e.g., principal components analysis The target of feature learning can be described as follows.
(PCA) [24] and linearity preserving projection (LPP) [25] If there are n input data, X = {xT1 , . . . , xTn } ∈ RD , where D is
are linear methods, while stochastic neighbor embedding the dimensionality of the data space. The learning objective
(SNE) [26] is a nonlinear method; supervised or unsuper- is to search for the compact representations of these data, i.e.
vised, e.g., marginal Fisher analysis (MFA) [11], [12] and Y = {yT1 , . . . , yTn } ∈ Rd , where d is the dimensionality of the
linear discriminant analysis (LDA) [27] are supervised meth- low dimensional embeddings.
ods, and PCA is an unsupervised method; local or global, In this paper, we consider the feature learning problems
e.g., MFA and SNE are local methods, and PCA is a global from the perspective of deep learning, and stack shallow
method. Many feature learning models provide excellent feature learning modules to build deep networks [31], [32].
solutions for the applications of dimensionality reduction. In this case, the data maps from the original D-dimensional
However, for large scale complex problems, feature learning space to the d-dimensional space layer by layer. This deep
models may not perform well. Considering this situation, architecture can be seen as a general framework for data
we try to select some well-behaved feature learning mod- representation learning. The data flow in the deep architecture
els and combine them to deep architectures. Among others, can be abstracted as
MFA is one special formulation of the graph embedding
framework [11]. It utilizes an intrinsic graph to characterize D H⇒ D1 H⇒ · · · H⇒ Di H⇒ · · · H⇒ Dp−1 H⇒ d, (1)
the intra-class compactness, and another penalty graph to
characterize the inter-class separability. Our motivation of where D1 is the dimensionality of a high dimensional space
this work is to combine the advantage of MFA and deep archi- mapped from the original space. In order to increase the
tectures and propose a new supervised initialization method capacity of the network, it is twice or quadruple as many as
for deep learning algorithms. those neurons in the input layer. Di stands for the dimen-
There are also some work about feature learning models sionality of the i-th intermediate representation space, and
based on the deep architectures [28]–[30]. Yuan et al. [28] p is the total stages of mappings. In this framework, feature
proposed an improved multilayer learning model to solve learning modules with various output dimensions are applied
the scene recognition task. This model learn all features to learn the representations of data in each layer. The mapping
used for scene recognition in an unsupervised manner. functions between consecutive layers are obtained by the
George et al. [29] proposed the deep semi-Non-negative layer by layer optimization of the feature learning models.
matrix factorization (NMF), which is able to learn hidden The framework of the proposed deep architectures is briefly
FIGURE 1. The uniform framework of the proposed deep architectures. Wr1 represents the first layer random weight matrix, while WF and WF
2 3
represent the weight matrices learned by feature learning models. For simplicity, the bias terms are omitted.
presented in Fig. 1. In this figure, the first hidden layer is ran- the classes. Therefore, we apply MFA as the building blocks
domly initialized by matrix Wr1 , and the new representation of MDA.
of an input x can be written as MFA follows the graph embedding framework to con-
struct an intrinsic graph that characterizes the intra-class
a1 = g(WTr1 x + b), (2) compactness and another penalty graph that characterizes the
inter-class separability [11]. Suppose the input data is X =
where g(.) is a non-linear activation function. After that, {x1 , . . . , xn } and the projection matrix is W = {ω1 , . . . , ωd }.
MFA or other feature learning models are used to initialize The intrinsic graph aims to connect each sample to its
the subsequent layers. The outputs of the subsequent hidden k-nearest neighbors in the same class. Suppose that we use
layers are N (i) to denote the k-nearest neighbors of xi in its class. The
intra-class compactness can be described as
ak = g(WTFk−1 ak−1 + b). (3)
X X 2
S˜w = ω T xi − ω T xj (4)
For example, in Fig. 1, WF2 and WF3 are the weight
i j∈N (i)∨i∈N (j)
matrices of the second layer and third layer learned from XX 2
feature learning models. In the end, softmax regression is = ωT xi − ωT xj Aij (5)
adopted as the last layer for the classification tasks. In the i j
first hidden layer, the higher dimensional representations of = 2ωT X(D − A)XT ω, (6)
input data can be learned. Afterwards, the following feature P
where D is a diagonal matrix with elements Dii = j Aij .
learning models can learn the lower dimensional embeddings
The adjacency matrix A is given by
step by step. The key point of this network is that the weight
matrices of the hidden layers are initialized by feature learn- 1, if j ∈ N (i) or i ∈ N (j),
Aij = (7)
ing modules except the matrix in the first hidden layer, which 0, otherwise.
may deliver a better performance than other deep learning
The penalty graph connects the marginal point pairs of differ-
models initialized by random matrices.
ent classes. We use M(C) to denote a set of input pairs that
are k-nearest pairs among the set {(i, j)|i ∈ C ∧ j ∈/ C}, where
B. MARGINAL FISHER ANALYSIS (MFA) C is class of an input xi . The inter-class separability can be
Based on our novel framework of deep architecture, defined as
we employ marginal Fisher analysis (MFA) to construct X X 2
MDA. There are several advantages to use MFA. Compared S˜b = ω T xi − ω T xj (8)
with many traditional feature learning models, such as linear i (i,j)∈M(Ci )∨(j,i)∈M(Cj )
discriminant analysis (LDA), there is no assumption about XX 2 p
the data distribution of each class, so that MFA is more = ωT xi − ωT xj Aij (9)
general for discriminant analysis. In addition, the margins i j
between classes can properly characterize the separability of = 2ωT X(Dp − Ap )XT ω, (10)
FIGURE 2. A brief representation of MDA. Wr1 stands for the first layer random weight matrix, while WMFA and WMFA represent the weight
2 3
matrices learned by MFA. The dotted red lines represent the dropout operation, the dotted red circle is the dropout node, and the cross nodes are
corrupted. The denoising and dropout operation are completely random. For simplicity, the bias terms are omitted.
p P p
where Dp is a diagonal matrix with elements Dii = j Aij . novel framework. Given an input vector x ∈ [0, 1]d , it is
The adjacency matrix Ap is given by firstly mapped to a higher dimensional space by a random
( weight matrix Wr1 . The activation output of the first hidden
p 1, if (i, j) ∈ M(Ci ) or (j, i) ∈ M(Cj ), layer can be written as
Aij = (11)
0, otherwise.
a1 = s(WTr1 x + b), (13)
The target of MFA is to minimize the intra-class compactness
and maximize the inter-class separability simultaneously. where s(.) is the sigmoid function s(x) = 1+e1 −x , b is the bias
Therefore, the marginal Fisher criterion is defined as terms, and a1 is the output of the first hidden layer. From
the second layer to the (n − 1)-th layer, the weight matrices
tr(WT X(D − A)XT W)
WMFA = argmin . (12) are learned by MFA to initialize MDA layer by layer.
W tr(WT X(Dp − Ap )XT W)
ak = s(WTMFAk−1 ak−1 + b). (14)
In [11], to apply MFA for face recognition applications,
the faces are firstly projected into a PCA subspace by the We use the softmax regression as the last layer of MDA for
transformation matrix WPCA to reduce noise. Since the fea- classification tasks, so that the number of neurons is the same
tures are learned by multiple layers in MDA and the whole as the number of classes. The cost function can be defined as
deep architecture is fine-tuned by back propagation, it is
not necessary to reduce the dimension of data by PCA at
N K
1 XX exp(wTj an−1
i )
J (w) = − ( I(yi = j) log PK ),
first. Hence, we compute the projection matrix WMFA directly N T n−1
i=1 j=1 l=1 exp(wl ai )
using Eq. (12) at each layer.
Here, MFA is used as the initialization method of the (15)
weight matrices in MDA. For different layers of MDA, where N and K are the total number and class number of the
the input X of WMFA in Eq. (12) is the output of its previous input data, respectively. I(x) is the indicator function. If x is
layer. For example, in Fig. 2, WMFA2 and WMFA3 are com- true, I(x) = 1, else I(x) = 0. yi is the label of xi . wj and
puted using the output of their previous layers. In addition, wl are weight vectors corresponding to class j and l. Hence,
the weight matrices calculated by MFA are only applied to the probability that xi is correctly categorized to class j is
initialize the weight matrices of MDA at the first iteration.
Then, we apply back propagation to fine-tune these matrices. exp(wTj an−1
i )
p(yi = j|xi , w) = PK n−1
. (16)
T
C. MARGINAL DEEP ARCHITECTURE (MDA) l=1 exp(wl ai )
Based on the novel deep architecture framework and the ben- From the (n − 1)-th layer to the last layer, we continue to
efits of MFA, we present MDA in the following. As depicted use MFA to map it. To then end, we can consider that the
in Fig. 2, MDA is constructed by integrating MFA into the MDA is initialized with a supervised pre-training method.
D. DEEP LEARNING TECHNIQUES APPLIED TO MDA where Wr1 and b1 are the random weight matrix and the bias
In order to improve the performance of MDA, we adopt some term in the first hidden layer. The ‘‘denoising’’ method is pro-
deep learning techniques to fine-tune MDA, including back posed based on a hypothetical criterion for network design:
propagation, denoising and dropout. robust to partial destruction of the input data. This criterion
implies that good internal representations can be learned from
1) BACK PROPAGATION an unidentified distribution of the input data. Therefore, this
Back propagation [33] is an efficient optimization algorithm method is a benefit to learn more robust structure and avoids
to optimize MDA, which employs stochastic gradient descent the overfitting problems in most cases.
to learn the weight matrices and the bias terms layer by layer.
For every neuron i in the output layer (n-th layer), the error 3) DROPOUT
term can be described as Similar with denoising operation, dropout is an efficient
N method to prevent overfitting [14]. Dropout has a dramatic
∂J (w) 1 X
δin = = − [(I(yi = j)−p(yi = j|xi , w)] (17) effect on the test set when a deep learning model is trained
∂ani N on a small training set. It is a regularization technique to
i=1
where J (w) is the cost function computed from (15), and prevent the complex co-adaptations on the training data. The
ani is the output of neuron i in the output layer. For every key point of dropout is that each neuron in the hidden layers
neuron i from the (n − 1)-th layer to the second layer, the cal- is randomly excluded from the model with a probability of β.
culation of the error term is Besides, dropout can be seen as an efficient way of perform-
k+1 ing model averaging with deep learning models. Fig. 2 depicts
the dropout operation in MDA.
X
δik =( wkji δjk+1 )s0 (aki ). (18)
j=1
IV. EXPERIMENTS AND DISCUSSIONS
∂J (w) ∂J (w)
Then, the and are calculated as, To evaluate MDA, we performed several experiments on
∂wkij ∂bki
different sizes of data sets. We designed MDA in different
∂J (w) structures in order to explore the optimal architecture of it.
= akj δik+1 , (19)
∂wkij At first, we tested the MDA on five benchmark data sets
∂J (w) to explore the best architecture of MDA and compared it
= δik+1 . (20) with other feature learning and deep learning models. Then,
∂bik
in order to show the performance of MDA initialized by the
By calculating the gradient of the cost function of MDA supervised initialization method on different sizes of data
with respect to the parameters, the back propagation algo- sets, we applied MDA to a specific dataset with extremely
rithm can update the weight matrices and bias terms in the limited data, CMU mocap, and a relatively large data set,
layers of MDA. It starts from the output layer at the top and CIFAR-10. In addition, we combined MDA with convolu-
ends with the input layer at the bottom. tional neural network (CNN) for addressing image classifi-
cation tasks, and used the supervised initialization method in
2) DENOISING OPERATION the pre-training phrase of deep CNNs on the CIFAR-10 data
Denoising is proposed in the denoising autoencoder to set.
improve its robustness [13]. It can be viewed as a regular-
ization method and avoids the ‘‘overfitting’’ problem. The A. DATE SET DESCRIPTIONS
main idea of it is that we can set the required proportion of ν We first evaluated the MDA on five benchmark datasets.
‘‘destruction’’ and corrupt partial input data. We can select a Table 1 illustrates some characteristics of these data sets. The
fixed percentage ν randomly for every input x. The value of USPS1 data set is composed of handwritten digits images,
these inputs is fixed at 0, while the others remain unchanged. which contains 7291 training samples and 2007 test sam-
A partially destroyed version x̃ of an initial input x can be ples from 10 classes and each sample is represented with a
obtained through a stochastic mapping, 256 dimensional vector. The task is to identify the digits from
x̃ ∼ qD (x̃|x), (21) 0 to 9. The Isolet2 data set contains 6238 training samples
and 1559 test samples from 26 classes with 614 dimensional
where qD (x̃|x) is the unknown data distribution. Hence, a hid- features. It collects audio feature vectors of spoken letters
den representation h can be computed as from the English alphabet. Based on the recorded (and pre-
h = s(WT x̃ + b). (22) processed) audio signals, the task aims to identify the exact
letter which is spoken. Sensor3 is a sensorless drive diagnosis
In MDA, we use denoising to improve its performance.
data set, including 46816 training samples and 11693 test
A concrete illustration can be seen in Fig. 2. With the denois-
ing operation, the output of the first hidden layer is computed 1 http://www.gaussianprocess.org/gpml/data/
2 http://archive.ics.uci.edu/ml/datasets/ISOLET
as
3 http://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+
a2 = s(WTr1 x̃ + b1 ), (23) Diagnosis#
TABLE 1. Characteristics of the used data sets. autoencoders with dropout (DAE(dropout)) and a variant of
MDA, PDA. Note that the architecture of PDA is the same as
MDA but the feature learning module of PDA is PCA [24]
instead of MFA [11], [12].
1) EXPERIMENTAL CONFIGURATIONS
All of these deep learning models have the same structure
and configurations. The size of minibatch was set to 100,
the learning rate and momentum were set to the default value
1 and 0.5, the number of epoch was set to 400, while the
dropout rate β and the denoising rate ν were set to 0.1. In AE
samples from 11 classes. Each one of the samples contains and SAE, the weight penalty of the L2 norm was set to 10−4 .
48 dimensional features, which are extracted from electric For MFA, the number of nearest neighbors for constructing
current drive signals. The target is to classify the specific the intrinsic graph was set to 5, while that for constructing
category through different conditions of the drive and its the penalty graph was set to 20. The target dimensions of
intact and faulty components. Covertype4 is a geological data representations in MDA and PDA on these data sets are
and map-based data set chosen from four wilderness areas shown in the last column of Table 1.
located in the Roosevelt National Forest of northern Col-
orado. It contains 15120 training samples and 565892 test 2) CLASSIFICATION RESULTS
samples from 7 classes with 54 dimensional features. The From the experimental results shown in Table 2, we can see
target is to recognize the categories of forest cover from that MDA performs best on four data sets except the Sensor
cartographic variables. IbnSina5 is an ancient Arabic doc- data set, compared with other models. Meanwhile, it obtained
ument data set, we select 50 pages of the manuscript for the second best result on the Sensor dataset. Furthermore,
training (17543 training samples) and 10 pages for testing PDA achieved the best result on the Sensor data set and
(3125 test samples). There are 174 classes of subwords with the second best results on the other data sets. These results
200 dimensions in this dataset. demonstrate that in most cases, the proposed deep learning
In addition, we also tested MDA on a specific task, which models can achieve good performances on data sets with
uses the CMU motion capture (CMU mocap) data set.6 limited amount of training data. We can also conclude that
The CMU mocap data set includes three categories, namely, the performance of MDA is better than not only the related
jumping, running and walking. We chose 49 video sequences deep learning models, but also some shallow feature learning
from four subjects. For each sequence, the features are gener- methods, such as PCA and MFA as shown in Table 2. These
ated using Lawrence’s method,7 with dimensionality 93 [34]. results demonstrate that MDA and PDA based on stacked
By reason of the few samples of this data set, we adopt 10- some feature learning models can learn better representations
fold cross-validation in our experiments and use the average of the input data than shallow feature learning methods.
error rate and standard deviation to evaluate the performance. However, it is not always true that deep learning models
At last, we test MDA on a classic data set CIFAR-108 to perform better than the feature learning models. For instance,
test the performance of MDA on image classification appli- the performance of MFA is better than AE, DAE, DAE with
cations. Furthermore, we combined MDA with CNN, and dropout and SDAE on the Sensor data set. This indicates that
evaluated this model on CIFAR-10. The CIFAR-10 data set training with a limited amount of data, some feature learning
includes 60000 32 × 32 color images in 10 classes, with methods may learn the representations of the input data better
6000 images per class. There are 50000 training images than deep learning models. MDA possesses the advantages of
and 10000 test images. Fig. 5 shows some examples of the both deep learning and feature learning models. Experimental
CIFAR-10 data set from the 10 categories. results show the advantages of MDA on data sets with limited
amount of training data.
B. CLASSIFICATION ON FIVE BENCHMARK DATA SETS
3) TIME CONSUMPTION
In this experiments, we compare MDA with several related
The time consumption on training and test process for 7 dif-
deep learning models on the 5 benchmark data sets. These
ferent deep architectures are shown in Table 3. Each exper-
deep learning models include autoencoder (AE) [10], stacked
iment was carried out 5 times and the averaged results are
autoencoders (SAE), denoising autoencoders (DAE) [13],
reported. Note that all the experiments were performed on a
stacked denoising autoencoders (SDAE) [35] and denoising
4-core intel(R) Core(TM)2 Quad Q9550 CPU with 2.83GHz
4 http://archive.ics.uci.edu/ml/datasets/Covertype
clock frequency. We can see that the training times of PDA
5 http://www.causality.inf.ethz.ch/al_data/IBN_SINA.html and MDA are similar with AE, DAE and DAE with dropout.
6 http://http://mocap.cs.cmu.edu/ However, they are much faster than SAE and SDAE. On the
7 http://is6.cs.man.ac.uk/∼neill/mocap/ Isolet data set, the time consumptions of PDA and MDA
8 http://www.cs.toronto.edu/ kriz/cifar.html are less than other deep architectures. This demonstrates that
TABLE 2. The classification accuracy on five benchmark data sets. ‘‘ORIG’’ represents the results obtained by applying softmax directly to the original
data space. The best result is highlighted with boldface.
TABLE 3. Time consumption of 7 compared deep architectures. The test time is on all the test samples.
TABLE 4. The structures of MDA on the 5 benchmark data sets. ‘‘None’’ represents without second layer in MDA. ‘‘Twice’’ means the second layer’s nodes
are as twice as the input layer. ‘‘Quadruple’’ represents the second layer’s nodes are as quadruple as the input layer. ‘‘Octuple’’ represents the second
layer’s nodes are as octuple as the input layer.
PDA and MDA can sometimes achieve good results with For the Sensor data set, the error decreased slightly when
short training time because of their efficient weights initial- denoising ratio was changed from 0.2 to 0.3. The experimen-
ization. The training periods of SAE and SDAE are very slow tal results demonstrate that denoising and dropout operations
because every layer in SAE and SDAE is an autoencoder can improve the performance of MDA when selecting appro-
layer, and it requires a long optimization time for initial- priate values for them. Without the denoising and dropout
izing the weight matrix. During the test procedure, all the operations, the experimental results is not as good as adopting
methods have similar efficiency as their architectures are the these operations. In the following experiments, we set both of
same. These results show the efficiency of MDA. It’s mainly them to 0.1.
because of that it can achieve good initial weight matrices by
a short time and perform well on the learning tasks. 5) DIFFERENT STRUCTURES FOR MDA
In order to explore better structures of MDA, we constructed
4) THE DENOISING AND DROPOUT RATIOS different structures of it by changing the number of nodes in
In order to evaluate the influence of the denoising and dropout each layer. For the USPS data set, we first got rid of the second
operations on MDA, we designed an experiment on the layer, the structure of the model was 256 − 128 − 64 − 32.
5 benchmark data sets with different denoising ratios and Then, we set the number of the nodes in the second layer
dropout ratios. Firstly, we fixed the dropout ratio at 0.1, then to twice of that in the input layer, so that the architecture
adjusted the denoising ratio from 0.1 to 0.5. Next, we fixed changed to 256 − 512 − 128 − 64 − 32. Next, the nodes
the denoising ratio at 0.1, and modified the dropout ratio from in the second layer were quadruple as many as those in the
0.1 to 0.5. These experimental results are shown in Fig. 3. input layer, and the architecture became to 256 − 1024 −
We can see that MDA achieved minimum error on all data 512 − 256 − 128 − 64 − 32. Finally, the nodes were octuple
sets when denoising ratio and dropout ratio are 0.1. The error as many as those in the input layer, so that the architecture
was increasing with the increasing of denoising ratio and was 256 − 2048 − 1024 − 512 − 256 − 128 − 64 − 32. The
dropout ratio in most cases. For the USPS data set, the error structures of MDA on other data sets were changed similarly,
decreased when dropout ratio was changed from 0.2 to 0.3. as shown in Table 4.
FIGURE 3. Error rates on 5 data sets with respect to different denoising ratios and dropout ratios. (a) USPS. (b) Isolet. (c) Sensor. (d) Covertype. (e) Ibnsina.
TABLE 5. The classification error with different structures of MDA on the 6) DIFFERENT NUMBERS OF HIDDEN LAYERS FOR MDA
5 benchmark data sets. The best result (minimum error) is highlighted
with boldface. As a deep learning model, the depth of MDA is very
important. With the architecture getting deeper and deeper,
the training of the deep learning models may become more
and more difficult. It means that we shall spend more com-
puting resources if the architecture is very deep. In order to
evaluate how many hidden layers are appropriate to different
tasks, we designed different structures on the 5 benchmark
data sets. We applied MDA with from 1 to 7 hidden layers
on the USPS and Isolet data sets and from 1 to 5 hidden
layers on the Covertype, Sensor and Ibnsina datasets. Other
experimental settings are the same as previous experiments.
The experimental results with these different structures
Table 6 shows the classification error on the 5 benchmark
of MDA are shown in Table 5. We can see that MDA
data sets with different numbers of hidden layers for MDA.
achieved the minimum classification error on all the data sets
All the data sets achieved the best results when the number of
except the Covertype data set when the nodes in the sec-
hidden layers is 3 except the USPS data set. On the USPS data
ond layer are twice of that in the input layer. Besides,
set, MDA achieved the best result when the number of hidden
MDA obtained the best performance on the Covertype data
layers was 5. When the number of hidden layers is from 1 to 3,
set when the nodes of the second layer are quadruple of that
with the increasing number of hidden layers, the classification
in the input layer. We can conclude that MDA can work well
error decreased on all the data sets. With limited amount of
when the nodes in the second layer are twice or quadru-
data, we don’t need very deep architectures to handle them.
ple as many as the nodes in the input layer. Then we
employed these structures that achieved best results to com-
pare the performance of MDA with its related models on 7) EFFECTS OF INITIALIZATION METHODS
these data sets. In addition, except on the Covertype data To evaluate the effect of MFA as an effective network
set, when the nodes in the second layer increase from initialization method, we applied different network ini-
doubled nodes gradually, the errors increase at the same tialization methods to the same deep architecture on the
time. 5 benchmark data sets. These initialization methods included
TABLE 6. The classification error on the 5 benchmark data sets with different structures of MDA.
FIGURE 4. The test accuracies of MDA with different initialization methods on the 5 benchmark data sets. On each data set, the deep models have the
same structure and hyperparameters. The classification results are evaluated only with initialization and after training the same epochs, respectively.
(a) USPS dataset. (b) Isolet dataset. (c) Sensor dataset. (d) Covertype dataset. (e) Ibnsina dataset.
random initialization, SAE as the unsupervised pre-training of the deep models without back propagation training and
method and MFA as the supervised pre-training method. after the same training period. These experimental results
Firstly, the deep architectures initialized by these methods are demonstrate that compared with other initialization methods,
evaluated on the 5 benchmark data sets before the training the stacked MFA in MDA can initialize the deep architecture
phase. Then, we tested these models after the same training more effectively and obtain better performance after the same
epochs on the same data sets. On each data set, the structures training epochs.
and experiment settings of these deep models were kept the
same. The structures were 256 − 128 − 64 − 32 for the USPS C. CLASSIFICATION ON THE CMU MOCAP DATASET
data set, 617−308 for the Isolet data set, 48−24 for the Sensor To test the performance of MDA on real world applications,
data set, 54 − 27 for the Covertype data set and 200 − 100 for we evaluated it on a specific data set, CMU mocap. CMU
the Ibnsina data set, respectively. mocap is a very small data set including only 49 samples.
Fig. 4 illustrates the test accuracies of these models before Traditional deep learning methods cannot work well in this
and after the training period on the 5 benchmark data sets. application. We compared MDA and PDA with PCA, MFA
NN is the neural network initialized by random initializa- and other 5 deep learning models. The architectures of all the
tion. SAE and MDA are the same neural network initial- deep models (except the PDA) are 93 − 186 − 93 − 47 − 24.
ized by SAE and MFA, respectively. It is easy to see that, Specially, since the CMU mocap data set only has 49 sam-
MDA achieved the best results in the pre-training period ples, the PCA method can only reduce the dimensionality
we intend to exploit some other feature learning methods for [22] L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik, ‘‘Dimen-
the deep architecture construction and explore the different sionality reduction: A comparative review,’’ J. Mach. Learn. Res., vol. 10,
nos. 1–41, pp. 66–71, Oct. 2009.
structures of this novel deep learning framework. [23] L. J. van der Maaten, ‘‘An introduction to dimensionality reduction
using MATLAB,’’ Univ. Maastricht, Maastricht, The Netherlands, Tech.
Rep. MICC 07-07, 2007.
ACKNOWLEDGMENT
[24] I. T. Jolliffe, Principal Component Analysis. Hoboken, NJ, USA: Wiley,
The Titan X GPU used for this research was donated by the 2002.
NVIDIA Corporation. [25] X. He and P. Niyogi, ‘‘Locality preserving projections,’’ in Proc. NIPS,
vol. 16, 2003, pp. 153–160.
[26] G. E. Hinton and S. T. Roweis, ‘‘Stochastic neighbor embedding,’’ in Proc.
REFERENCES NIPS, 2002, pp. 833–840.
[1] R. Collobert and J. Weston, ‘‘A unified architecture for natural language [27] R. A. Fisher, ‘‘The use of multiple measurements in taxonomic problems,’’
processing: Deep neural networks with multitask learning,’’ in Proc. ICML, Ann. Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
2008, pp. 160–167. [28] Y. Yuan, L. Mou, and X. Lu, ‘‘Scene recognition by manifold regular-
[2] D. C. Cirecsan, U. Meier, L. M. Gambardella, and J. Schmidhuber, ‘‘Deep, ized deep learning architecture,’’ IEEE Trans. Neural Netw. Learn. Syst.,
big, simple neural nets for handwritten digit recognition,’’ Neural Comput., vol. 26, no. 10, pp. 2222–2233, Oct. 2015.
vol. 22, no. 12, pp. 3207–3220, 2010. [29] T. George, B. Konstantinos, Z. Stefanos, and B. W. Schuller, ‘‘A deep semi-
[3] Q. Le, W. Zou, S. Yeung, and A. Ng, ‘‘Learning hierarchical invariant NMF model for learning hidden representations,’’ in Proc. ICML, 2014,
spatio-temporal features for action recognition with independent subspace pp. 1692–1700.
analysis,’’ in Proc. CVPR, Jun. 2011, pp. 3361–3368. [30] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, ‘‘Multimodal
deep learning,’’ in Proc. ICML, 2011, pp. 689–696.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classifica-
[31] Y. Zheng, G. Zhong, J. Liu, X. Cai, and J. Dong, ‘‘Visual texture perception
tion with deep convolutional neural networks,’’ in Proc. NIPS, 2012,
with feature learning models and deep architectures,’’ in Proc. CCPR,
pp. 1106–1114.
2014, pp. 401–410.
[5] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, ‘‘PCANet:
[32] Y. Zheng, Y. Cai, G. Zhong, Y. Chherawala, Y. Shi, and J. Dong, ‘‘Stretch-
A simple deep learning baseline for image classification?’’ IEEE Trans.
ing deep architectures for text recognition,’’ in Proc. ICDAR, 2015,
Image Process., vol. 24, no. 12, pp. 5017–5032, Dec. 2015.
pp. 236–240.
[6] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
[33] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘‘Learning repre-
pp. 436–444, May 2015.
sentations by back-propagating errors,’’ Nature, vol. 323, pp. 533–536,
[7] M. Ranzato, Y. Boureau, and Y. L. Cun, ‘‘Sparse feature learning for deep Oct. 1986.
belief networks,’’ in Proc. NIPS, 2007, pp. 1185–1192. [34] G. Zhong, W.-J. Li, D.-Y. Yeung, X. Hou, and C.-L. Liu, ‘‘Gaussian process
[8] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, ‘‘Unsupervised feature latent random field,’’ in Proc. AAAI, 2010, pp. 1–6.
learning for audio classification using convolutional deep belief networks,’’ [35] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
in Proc. NIPS, 2009, pp. 1096–1104. ‘‘Stacked denoising autoencoders: Learning useful representations in a
[9] H. Lee, R. B. Grosse, R. Ranganath, and A. Y. Ng, ‘‘Convolutional deep deep network with a local denoising criterion,’’ J. Mach. Learn. Res.,
belief networks for scalable unsupervised learning of hierarchical repre- vol. 11, no. 12, pp. 3371–3408, Dec. 2010.
sentations,’’ in Proc. ICML, 2009, pp. 609–616. [36] K. D. Humbird, J. L. Peterson, and R. G. McClarren, ‘‘Deep neural network
[10] G. E. Hinton and R. R. Salakhutdinov, ‘‘Reducing the dimensionality of initialization with decision trees,’’ IEEE Trans. Neural Netw. Learn. Syst.,
data with neural networks,’’ Science, vol. 313, no. 5786, pp. 504–507, to be published. doi: 10.1109/TNNLS.2018.2869694.
2006.
[11] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, ‘‘Graph
embedding and extensions: A general framework for dimensionality reduc-
tion,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 40–51,
Jan. 2007.
[12] G. Zhong, Y. Chherawala, and M. Cheriet, ‘‘An empirical evaluation of
supervised dimensionality reduction for recognition,’’ in Proc. ICDAR,
Aug. 2013, pp. 1315–1319. GUOQIANG ZHONG (M’16) received the B.S.
[13] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, ‘‘Extracting and degree in mathematics from Hebei Normal Univer-
composing robust features with denoising autoencoders,’’ in Proc. ICML, sity, Shijiazhuang, China, in 2004, the M.S. degree
2008, pp. 1096–1103. in operations research and cybernetics from the
[14] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and Beijing University of Technology, Beijing, China,
R. Salakhutdinov. (2012). ‘‘Improving neural networks by preventing in 2007, and the Ph.D. degree in pattern recog-
co-adaptation of feature detectors.’’ [Online]. Available: https:// nition and intelligent systems from the Institute
arxiv.org/abs/1207.0580 of Automation, Chinese Academy of Sciences,
[15] J. Donahue et al., ‘‘DeCAF: A deep convolutional activation feature for Beijing, in 2011.
generic visual recognition,’’ in Proc. ICML, 2014, pp. 647–655. From 2011 to 2013, he was a Postdoctoral Fel-
[16] M. Long, Y. Cao, J. Wang, and M. I. Jordan, ‘‘Learning transferable low with the Synchromedia Laboratory for Multimedia Communication
features with deep adaptation networks,’’ in Proc. ICML, 2015, pp. 97–105. in Telepresence, École de Technologie Supérieure, University of Quebec,
[17] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, ‘‘Learning deep Montreal, Canada. Since 2014, he has been an Associate Professor with
features for scene recognition using places database,’’ in Proc. NIPS, 2014, the Department of Computer Science and Technology, Ocean University of
pp. 487–495.
China, Qingdao, China. He has published three books, four book chapters,
[18] G. Zhong, X. Ling, and L.-N. Wang, ‘‘From shallow feature learning
and more than 50 technical papers in the areas of artificial intelligence,
to deep learning: Benefits from the width and depth of deep architec-
tures,’’ Wiley Interdiscipl. Rev., Data Mining Knowl. Discovery, vol. 9,
pattern recognition, machine learning, and data mining. His research interests
Jan./Feb. 2019, Art. no. e1255. include pattern recognition, machine learning, and image processing. He
[19] G. Hinton et al., ‘‘Deep neural networks for acoustic modeling in speech has served as a PC Member/Reviewer for many international conferences
recognition: The shared views of four research groups,’’ IEEE Signal and top journals, such as the IEEE TNNLS, the IEEE TKDE, the IEEE
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012. TCSVT, Pattern Recognition, Knowledge-Based Systems, Neurocomputing,
[20] H. Y. Xiong et al., ‘‘The human splicing code reveals new insights into ACM TKDD, ICPR, and ICDAR. He is a member of the ACM and IAPR
the genetic determinants of disease,’’ Science, vol. 347, no. 6218, 2015, and a Professional Committee Member of CAAI-PR, CAA-PRMI, and
Art. no. 1254806. CSIG-DIAR. He has been awarded as an Outstanding Reviewer by several
[21] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence learning journals, such as Pattern Recognition, Neurocomputing, and Cognitive Sys-
with neural networks,’’ in Proc. NIPS, 2014, pp. 3104–3112. tems Research.
KANG ZHANG received the B.S. degree in com- YUCHEN ZHENG received the B.S. and M.S.
munication engineering from the Ocean University degrees from the Department of Computer Sci-
of China, Qingdao, China, in 2014, where he is ence and Technology, Ocean University of China,
currently pursuing the M.S. degree in electron- in 2014 and 2017, respectively. He is currently
ics and communication engineering. His research pursuing the Ph.D. degree with the Department
interests include data mining, machine learning, of Advanced Information Technology, Kyushu
and image processing. University, Japan. His research interests include
machine learning, document analysis, neural net-
works, and computer vision.