Zhuang 2017
Zhuang 2017
Fuzhen Zhuang1,2 , Lang Huang1(B) , Jia He1,2 , Jixin Ma3 , and Qing He1,2
1
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences
(CAS), Institute of Computing Technology, CAS, Beijing 100190, China
{zhuangfz,hej,heq}@ics.ict.ac.cn, [email protected]
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
University of Greenwich, London, UK
1 Introduction
Recently, deep learning shows great success for learning robust representation
and outperforms conventional state-of-the-art methods in computer vision appli-
cations. Convolutional neural networks (CNNs), win the ImageNet challenge
which is a contest based on a large scale data sets with over 1 million images
since 2012 [15,23]. And the key of this success is that the substantially increased
depth enlarge the capacity of CNNs and then enable CNNs to fit the data sets
Lang Huang—This work is finished when Lang Huang is an intern (under the super-
vision of Fuzhen Zhuang) in Institute of Computing Technology, Chinese Academy
of Sciences.
c Springer International Publishing AG 2017
G. Li et al. (Eds.): KSEM 2017, LNAI 10412, pp. 483–494, 2017.
DOI: 10.1007/978-3-319-63558-3 41
484 F. Zhuang et al.
well. The works [17,24] reinforced this result by showing significant improvement
over shallow models when applying a very deep neural network architecture.
On the other hand, as the power of CNNs keeps growing, the complexity of
models increases, which further requires more data to avoid overfitting during
training process. The problem is that most of data sets are not large enough
for training, thus the performance degrades. To take advantage of both huger
capacity of deep models and less needed data of shallow ones, a training technique
called fine-tuning is proposed. Fine-tuning adopts pretrained models, which are
trained on large scale data sets, e.g. ImageNet, on new tasks with only sightly
modifying the parameters of the pretrained models. Several studies have reported
that fine-tuning obtains outstanding performance and reduces training time from
2 or 3 weeks to few days [18,19].
Although fine-tuning can learn effective representation in various fields, the
performance drops significantly when directly applied to transfer learning with
insufficient target domain data. Transfer learning aims to improve learning per-
formance in target domain with little or without any label information by lever-
aging knowledge from auxiliary source domain. Yosinski et al. [20] pointed out
that deep feature must transition from general to specific by higher layers of the
networks for transfer learning. Hence, only integrating with source domain data
will lead to the learned presentation go too specific to source domain.
To address such problem, we propose a manifold regularized convolutional
neural network (MRCNN) framework for transfer learning, which aims to use
manifold learning approach to regularize fine-tuning progress. Manifold learn-
ing approaches are widely adopted in semi-supervised or unsupervised learning,
which assumes that data points within a same local structure are likely to have
the same label [2]. Therefore, the unlabeled data in target domain can be uti-
lized to preserve such structure in higher layer or output layer by imposing man-
ifold based constraints. By coupling manifold regularization and fine-tuning, we
expect that the learned representation in higher layers go more general or more
specific to target domain, and thus the knowledge from auxiliary source domain
is successfully transferred.
The contributions of this paper are summarized as follow:
1. We propose a unsupervised learning framework that collaborates fine-tuning
technique and manifold regularization within deep convolutional neural net-
work for transfer learning.
2. We conduct extensive experiments on several data sets and statistical evidence
shows the effectiveness of our framework.
3. Furthermore, we investigate the impact of fine-tuning and manifold regular-
ization on knowledge transfer.
2 Preliminary Knowledge
In this section, we first review convolutional neural networks architecture, fine-
tuning technique and manifold regularization, which serve as preliminary knowl-
edge of this paper.
Manifold Regularized Convolutional Neural Network 485
Deep learning approaches have been widely adopted in the last decade [3]. Par-
ticularly, convolutional neural network (CNN) is proposed to learn robust rep-
resentation and achieve satisfying results in computer vision [15,17].
A typical CNN is a feed-forward neural network stacked with multiple convo-
lutional (conv) layers, pooling layers (max-pool or average-pool), fully connected
(f c) layers and a classifier on top of them. Both conv and f c layers learn non-
linear mapping hl in the lth layer with slight difference. The mapping of conv
and f c layers can be formalized by
respectively, where wl is the weight matrix (kernel in conv layers) and bl is the
bias of lth hidden layer, ‘*’ denotes convolution operation and σ(·) is an non-
linear activation function, e.g., Rectified Linear Unit (ReLU) σ(x) = max(0, x)
x
[8] for hidden layers or softmax function σ(x) = ne exi for output layer. Pool-
i=1
ing layers perform a downsampling operation along the spacial dimension. It
is useful for reducing computational cost and providing robustness for learning
representation [26].
Given data set X with label y, the objective to minimize in CNN is
n
1
L= C(hli , yi ) (3)
n i=1
where n is the size of data set X and l is the depth of model, hl is the learned
representation of lth hidden layer formulized by Eq. (1) or (2) and C(·, ·) is the
cross entropy loss function.
2.2 Fine-Tuning
According to [20], the representation in earlier layers of deep CNNs which are
trained on large scale data sets are general to different tasks. Hence, it would
be beneficial for both performance boost and time-saving to use those weights
as either initialization or feature extractor. Fine-tuning is a training procedure
that aims to adopt pretrained models on new tasks.
A standard fine-tuning procedure usually re-initializes the top fc layers to
match the dimension, since the sizes of input data sets differ between tasks. And
for the conv layers, there are two major strategies: (1) Fix some earlier layers
and only fine-tune the higher layers when data is very limited [18]; (2) Back
propagate through all the layers when we have enough data [19].
Note that it’s common to use smaller learning rate to avoid overfitting. In
this paper, we mainly follow the first fine-tuning strategy and the details will be
presented in later sections.
486 F. Zhuang et al.
xi xj
cos(xi , xj ) = (5)
xi xi · xj xj
wherexi and xj are column vectors and denotes matrix transpose. Let D =
diag( j M[i,j] ) and L = D − M be the Laplacian matrix, then the manifold
regularized term Γ (X) can be written as
Γ (X) = i,j M[i,j] ||f (xi ) − f (xj )||2
(6)
= trace(F LF )
where f (·) is a map function, F[i,.] denotes ith row of F and F[i,.] = f (xi ).
3 MRCNN Model
In this paper, we focus on unsupervised transfer learning, i.e., only labeled data
are available in source domain and unlabeled data in target domain. Thus, we
denote source domain as Ds = {xi , yi }ni=1
(s) (s) s
with ns labeled instances, target
(t) nt
domain as Dt = {xi }i=1 with nt unlabeled instances.
Now we are ready to present details of our framework. MRCNN is based on
VGG19 Network architecture [17], which is composed of 16 conv, 5 max-pooling,
3 fc and 1 softmax layers. Following the notation in [17], the 16 conv layers are
divided into 5 blocks by max-pooling layers, i.e., conv 1 block consists of conv
layers before first max-pooling layer, conv 2 block consists of conv layers between
first and second max-pooling layer, and so on.
We adopt the weights of VGG19 model pretrained on ImageNet in all conv
layers and randomly initialize the fc layers, which are shared for both source
and target domains. Then we further extend VGG Net by integrating manifold
regularization, which is imposed on target domain to enforce similar instances
to have the same labels, and cross entropy loss on source domain data to incor-
porate label information. By preserving manifold structure in this manner, we
hope the learned representation in higher layers can be well generalized. Figure 1
intuitively illustrates MRCNN framework. Note that conv 1–conv 5 each denotes
Manifold Regularized Convolutional Neural Network 487
a convolutional block, not a single layer. Due to the limitation of data and the
learned representations transition from general to specific along the network,
we adopt following three different strategies: (1) randomly initialize 3 fc layers
because fc layers require strict dimensional matching while the sizes of input
data sets are different. (2) conv 5 block, containing 4 conv layers, is carefully
fine-tuned since the representation in this block is more transferable and needs
only sightly tuning. In other word, we apply a smaller learning rate on this block.
(3) the conv 1-conv 4 blocks consisting of 12 conv layers are fixed and used as
feature extractor since the representation of these blocks are general.
Let w and b denote collections of weights and biases of the MRCNN, L
denotes Laplacian matrix of target domain, the overall objective is written as
below:
The first term of Eq. (7) is the cross entropy loss between the output log-
its and labels of source domain as presented in Eq. (3). The second term
Γ (k, L, w, b, x(t) ) is the manifold regularization as described in Eq. (6) imposed
on target domain and k is the number of nearest neighbors. The last term of
Eq. (7) is L2 norm of weight and bias matrices, which controls the complexity of
the network structure and is defined as
l
Ω(w, b) = (||wi ||2 + ||bi ||2 ), (8)
i=1
where l is the depth of the neural network, i.e., 19 in this paper. α, β are hyper-
parameters that balance the importance of manifold regularization and model
complexity in the entire framework.
The objective Eq. (7) can be minimized by performing Gradient Descent.
Since we implement MRCNN by Deep Learning Library TensorFlow [25], which
can automatically compute the gradients and derive the solution, the update
rules for parameters will be omitted here.
488 F. Zhuang et al.
4 Experiments
To evaluate the effectiveness of our proposed framework MRCNN, we conduct
experiments on two image data sets and compare our model with several state-
of-the-art baseline methods.
CIFAR-100. CIFAR-100 data set1 has 100 classes, which are grouped into 20
superclasses [5], and each contains 600 images. Among these 20 superclasses,
we randomly choose two of them ‘fruit and vegetables’ and ‘household electri-
cal devices’ and take ‘fruit and vegetables’ as positive examples and ‘household
electrical devices’ as negative one. Each superclass of ‘fruit and vegetables’ and
‘household electrical devices’ has 5 classes. To construct transfer learning clas-
sification problems, we randomly choose one class from ‘fruit and vegetables’
and one from ‘household electrical devices’ as source domains, and then choose
another one class of ‘fruit and vegetables’ and another one of ‘household electrical
devices’ from the remaining classes to construct target domain. In this way, we
can obtain 400 (P52 · P52 ) classification problems.
Corel. Corel data set2 consists of two different top categories, ‘flower ’ and ‘traf-
fic’ [9]. Each top category further includes four subcategories. We take ‘flower ’ as
positive class and ‘traffic’ as negative one. Then by following the same process-
ing procedure, we can construct 144 (P42 · P42 ) transfer learning classification
problems.
Note that we do not perform any data argumentation on these data sets
except subtracting mean for CNN based methods and normalizing features to
[0, 1] for the other compared competitors.
– Logistic regression (LR), one of the most widely applied supervised learning
algorithm without transfer learning technique.
– Transductive Support Vector Machine (TSVM) [1], a transductive learning
algorithm to incorporate unlabeled target domain data. However, TSVM
assumes the labeled source domain data and unlabeled target domain data
follow the same distribution.
– Transfer Component Analysis (TCA) [12], which aims at learning a low-
dimensional representation for transfer learning. We use Support Vector
Machine (SVM) as the basic classifier for it in this paper.
1
https://www.cs.toronto.edu/∼kriz/cifar.html.
2
http://archive.ics.uci.edu/ml/datasets/Corel+Image+Features.
Manifold Regularized Convolutional Neural Network 489
– Transfer Learning with Deep Autoconders (TLDA) [22], which uses deep
autoencoders to find a proper embedding space for both source and target
domains, while their distribution are explicitly enforced to be similar.
– Standard VGG Net [17], which finetunes on source domain but without man-
ifold regularization. We denote it as VGG.
4.4 Results
In total, we construct 400 classification tasks for CIFAR-100 and 144 classifica-
tion tasks for Corel. To make comprehensive comparison, we further divide the
classification tasks into two groups for each data set according to the accuracy
of LR. Specifically, we first conduct LR model on all classification tasks, and
then group them into two groups, i.e., the first group of classification tasks with
accuracies from LR lower than 70%, while the other one with accuracies from
LR higher than 70%. Finally, we report the average results of two groups, and all
the results are shown in Table 1. Left and Right respectively denote the average
performance of classification tasks whose accuracies lower and higher than 70%,
and Total means the average results over all tasks. Note that lower accuracy
from LR indicates more difficult to make transfer, and vice versa.
From these results, we have the following observations:
– TSVM is better than LR, which indicates the importance to consider unla-
beled data. However, TSVM can not achieve satisfying results since it assumes
the labeled and unlabeled data should follow the same distribution. TCA per-
forms even worse than LR on Corel data set, which may show the difficulty
to make transfer on the constructed classification tasks. TLDA delivers best
results in most cases compared to above methods, which reveals the power of
deep models.
3
http://vikas.sindhwani.org/svmlin.html.
4
The code is available at https://github.com/LayneH/MRCNN.
490 F. Zhuang et al.
– CNN based methods, i.e., VGG and MRCNN, significantly outperform all
other conventional methods by a large margin. We attribute it to the fine-
tuning procedure, which fixes the earlier 4 conv clocks and only back-
propagates through the last layers to preserve the generality. This also indi-
cates the essential to adopt deep learning models for classification.
– Among the deep learning methods, MRCNN achieves considerable improve-
ment over standard VGG Net with fine-tuning. This validates that manifold
regularization successfully guides the training process to obtain better repre-
sentation.
– Overall, the incorporation of manifold regularization leads to the success of
MRCNN. In other word, MRCNN performs the best on all groups.
where 1{·} is the indicator function, #nni is the number of nearest neighbors
(t) (t)
of xi , nni,j is the jth nearest neighbor of xi and label(x) is the label of
x. Intuitively, the higher is, the more confusing knowledge are introduced by
imposing manifold regularization. We sort the classification problems according
to the values of on target domains and show how influence the classification
accuracy in Fig. 2(b). Moreover, we group the tasks into 3 groups according to
the values of : the first group consists of tasks with < 0.1, the second one
consists of those with 0.1 ≤ ≤ 0.15 and the rest form the third one. The
average accuracy of each group is presented in Table 2.
From Fig. 2(b) and Table 2, we can find that the classification accuracy sub-
stantially drops as grows for all methods. The reason may be that the positive
and negative instances are similar for these classification tasks, and they are not
easy to be separated. The above results also reveal that the confusion of nearest
neighbors is the key factor that influences the performance of MRCNN. Hence,
to further generalize MRCNN on transfer learning tasks, one crucial problem to
be enhanced is how to obtain correct nearest neighbors.
492 F. Zhuang et al.
5 Related Work
Transfer learning is the improvement of learning in a new domain by transferring
the knowledge from auxiliary source domain [7,11]. It has drawn much attention
in past decades for its potential to ease the pain of manual labeling. Feature
based approaches are one of the most widely proposed, which aim to learn a good
feature representation for both source domain and target domain by reduce the
difference between domains or integrating regularization [10,12,14,16]. Among
feature based transfer learning methods, several methods have been proposed
to reduce the domain discrepancy explicitly. For example, transfer component
analysis (TCA) [12] aims to minimize the difference of distributions between
domains in a kernel Hilbert space, [10] is trying to find a subspace where training
and testing samples are approximately i.i.d. by integrating Bregman divergence-
based regularization between distributions of domains. One crucial problem of
these methods is that most of them only adopt shallow representation models
to reduce the domain discrepancy, which limits their ability to generalize for
various tasks.
Deep learning methods show its potential to learn effective and robust rep-
resentation in recent years. To enjoy such benefit, several frameworks have been
introduced. Stacked Denoising Autoencoders (SDAEs) [13] aims to improve the
effectiveness of learned representation in Denoising Autoencoders (DAEs) [4]
by extending the depth of DAEs, i.e. stack multiple DAEs within the frame-
work. [22] further couples SDAEs and feature based transfer learning approach
Manifold Regularized Convolutional Neural Network 493
6 Conclusion
References
1. Joachims, T.: Transductive inference for text classification using support vector
machines. ICML 99, 200–209 (1999)
2. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric frame-
work for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7,
2399–2434 (2006)
3. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with
neural networks. Science 313(5786), 504–507 (2006)
4. Vincent, P., Larochelle, H., Bengio, Y., et al.: Extracting and composing robust
features with denoising autoencoders. In: Proceedings of the 25th International
Conference on Machine Learning, pp. 1096–1103. ACM (2008)
5. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images
(2009)
6. Wu, J., Xiong, H., Chen, J.: Adapting the right measures for k-means clustering.
In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 877–886. ACM (2009)
494 F. Zhuang et al.
7. Torrey, L., Shavlik, J.: Transfer learning. Handb. Res. Mach. Learn. Appl. Trends:
Algorithms Methods Tech. 1, 242 (2009)
8. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann
machines. In: Proceedings of the 27th International Conference on Machine Learn-
ing (ICML 2010), pp. 807–814 (2010)
9. Zhuang, F., Luo, P., Xiong, H., et al.: Cross-domain learning from multiple sources:
a consensus regularization perspective. IEEE Trans. Knowl. Data Eng. 22(12),
1664–1678 (2010)
10. Si, S., Tao, D., Geng, B.: Bregman divergence-based regularization for transfer
subspace learning. IEEE Trans. Knowl. Data Eng. 22(7), 929–942 (2010)
11. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
22(10), 1345–1359 (2010)
12. Pan, S.J., Tsang, I.W., Kwok, J.T., et al.: Domain adaptation via transfer compo-
nent analysis. IEEE Trans. Neural Netw. 22(2), 199–210 (2011)
13. Vincent, P., Larochelle, H., Lajoie, I., et al.: Stacked denoising autoencoders: learn-
ing useful representations in a deep network with a local denoising criterion. J.
Mach. Learn. Res. 11, 3371–3408 (2010)
14. Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment
classification: a deep learning approach. In: Proceedings of the 28th International
Conference on Machine Learning (ICML 2011), pp. 513–520 (2011)
15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
16. Chen, M., Xu, Z., Weinberger, K., et al.: Marginalized denoising autoencoders for
domain adaptation. arXiv preprint arXiv:1206.4683 (2012)
17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
18. Hoffman, J., Guadarrama, S., Tzeng, E.S., et al.: LSDA: large scale detection
through adaptation. In: Advances in Neural Information Processing Systems, pp.
3536–3544 (2014)
19. Sharif Razavian, A., Azizpour, H., Sullivan, J., et al.: CNN features off-the-shelf:
an astounding baseline for recognition. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)
20. Yosinski, J., Clune, J., Bengio, Y., et al.: How transferable are features in deep
neural networks?. In: Advances in Neural Information Processing Systems, pp.
3320–3328 (2014)
21. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
22. Zhuang, F., Cheng, X., Luo, P., et al.: Supervised representation learning: transfer
learning with deep autoencoders. In: IJCAI, pp. 4119–4125 (2015)
23. Russakovsky, O., Deng, J., Su, H., et al.: Imagenet large scale visual recognition
challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
24. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
25. Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: large-scale machine learning
on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
26. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
(2016)