Fu Relaxing From Vocabulary ICCV 2015 Paper
Fu Relaxing From Vocabulary ICCV 2015 Paper
Fu Relaxing From Vocabulary ICCV 2015 Paper
Jianlong Fu1,2 , Yue Wu3 , Tao Mei2 , Jinqiao Wang1 , Hanqing Lu1 and Yong Rui2
1
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
2
Microsoft Research, Beijing, China
3
University of Science and Technology of China, Hefei, China
2
{jianf, tmei, yongrui}@microsoft.com, 1 {jqwang, luhq}@nlpr.ia.ac.cn, 3 [email protected]
bird
The development of deep learning has empowered ma- car
chines with comparable capability of recognizing limited ... ... ... ... ...
gradients
image categories to human beings. However, most exist-
ing approaches heavily rely on human-curated training da-
Deep Features
ta, which hinders the scalability to large and unlabeled vo-
cabularies in image tagging. In this paper, we propose Deep
BP +
Networks
a weakly-supervised deep learning model which can be
trained from the readily available Web images to relax the
dependence on human labors and scale up to arbitrary tags
(categories). Specifically, based on the assumption that fea-
Figure 1: The illustration of the proposed model. The deep
tures of true samples in a category tend to be similar and
network is trained not only by the label supervision with
noises tend to be variant, we embed the feature map of
loss L, but also the minimization of the discrepancy be-
the last deep layer into a new affinity representation, and
tween the affinity representation Ψ(X; W) and its low-rank
further minimize the discrepancy between the affinity rep-
approximation Ψ(X; W∗ ). Note that a traditional CNN
resentation and its low-rank approximation. The discrep-
model only follows the flowchart of the top green part, with-
ancy is finally transformed into the objective function to
out the feature relevance feedback indicated in the bottom
give relevance feedback to back propagation. Experiments
red part. Details are in Sec. 3. [Best viewed in color]
show that we can achieve a performance gain of 14.0% in
terms of a semantic-based relevance metric in image tag-
ging with 63,043 tags from the WordNet, against the typical vocabulary, which is often too expensive to obtain. For
deep model trained on the ImageNet 1,000 vocabulary set. example, it took more than 25,000 AMT1 workers about
one year to construct the entire ImageNet dataset [6] (about
22,000 categories and 14.2 million images). Despite of it-
s wide adoption in research communities, ImageNet is still
1. Introduction a small subset of the nouns in WordNet2 . There are huge
More recently, deep learning has achieved comparable numbers of categories left unlabeled, making the existing
accuracy to human beings in image categorization tasks on deep learning models hard to scale up. Therefore, how to
the limited vocabulary [10]. However, this result is far from scale deep learning approaches to large and arbitrary cate-
many real-world applications, such as image tagging, where gories without enormous human-cost appears to be a chal-
we often need tens of thousands of tags to describe the var- lenging yet urgent problem.
ious image content [5, 8]. One of the major challenges is to With the success of commercial image search engines,
acquire sufficient and high-quality training data for a large learning from the Web has demonstrated one of the most ef-
fective solutions to collect massive training data [4, 9, 22].
∗ This work was performed when Jianlong Fu and Yue Wu
were visiting Microsoft Research as research interns. The first two 1 https://www.mturk.com/mturk/welcome
1985
Despite of the convenience using Web images to train mod- network itself robust to noises.
els, the performance degradation is inevitable due to the The preprocessing methods can be implemented either
noises in Web image search results. A conventional deep by the conventional outlier detection, or by the pre-training
learning network is sensitive to noisy training images, as strategy in deep learning. First, the specific methods in out-
it tries to fit all the training data without distinguishing the lier detection include PCA, Robust PCA [3], Robust Kernel
authenticity of their labels. According to our experiments, PCA [25], probabilistic modeling [11] and one-class SVM
when 30% of the training images are mislabeled, the ac- [14]. These methods regard the outliers as those “few and
curacy of a conventional deep network drops at least 20% different” samples. However, the challenge of these meth-
in CIFAR-10 dataset. Therefore, designing a noise-robust ods is to distinguish “hard samples” from the truly noisy
deep network is imperative to attenuate the influence of the samples. Second, recovering the clean training samples by
noises in Web images. a layer-wise autoencoder or denoising autoencoder [23] in
Although previous works have studied how to perform the pre-training and then initializing a deep network by the
the weakly-supervised object recognition or localization if pre-trained model parameters is an effective method to re-
the accurate image-level labels can be provided [19, 24], move global noises, which has been used in face parsing
how to suppress the image-level noise effect has not been [15]. However, these methods are mainly designed for cas-
fully explored yet. In this paper, we propose a robust es where noises are contained in correct images (e.g., back-
weakly-supervised deep learning network with the noisy ground noises), while noises in web images are often those
Web training data for image tagging. As the Web data is mislabeled.
readily available, the proposed approach can scale to arbi- To train a robust deep learning model on noisy training
trary and unlabeled categories without heavy human effort. data, J.Larsen et al. proposed one of the pioneer works
To achieve this goal, we first start from embedding the fea- which added noise modeling into the neural networks [13].
ture map of the last deep layer into a new affinity representa- However, they make a symmetric label noise assumption,
tion that essentially explores the similarities among the deep which is often not true in real applications. V.Mnih et al.
features of training samples. Second, by adopting the “few proposed to label aerial images from noisy data where only
and different” assumption about the noises, we minimize a binary classification was considered [17]. The most relat-
the discrepancy between the affinity representation and its ed work to ours was proposed by S.Sukhbaatar et al. who
low-rank approximation. Third, this discrepancy is further introduced an extra noise layer as a part of training process
transformed into the objective function to give those “few in multi-class image classification [21]. They first trained
and different” noisy samples low-level authorities in train- a base model on noisy training data with several iterations,
ing. then activated the extra noise layer to absorb the noise from
The advantages of the proposed method are three folds. the learned base model.
First, except for the label supervision, we utilize the mutu- Compared with previous works, we propose a holistic
al relationship of features as feedback in our formulation. noise-robust model that handles noisy samples softly by
In this way, the learning process is mainly driven by the limiting their contributions in the learning process accord-
dominant correct samples. To the best of our knowledge, ing to their affinity to other samples. Besides, the algorithm
this idea has not been exploited by previous deep learning can affect the whole back propagation, rather than simply
works. Second, we conduct image tagging with the largest relying on a certain layer.
vocabulary set of about 63,000 tags from the WordNet, and
achieve a significant improvement against the typical deep 3. Weakly-Supervised Deep Learning Model
learning model trained on the ImageNet 1,000 vocabulary
Our goal is to design a noise-robust deep learning algo-
set. Third, our improvement is network-independent, so
rithm. We use the convolutional deep neural network (CN-
that with the help of our model, any existing deep learning
N) [12] for its state-of-the-art performance in image catego-
network can be readily extended to unlabeled categories.
rization. We will first analyze its limitation on noisy train-
An illustration of the proposed model is shown in Fig. 1.
ing samples and then propose the weakly-supervised deep
The rest of the paper is organized as follows. Sec.2 re-
learning model.
views related works. In Sec.3, we introduce the proposed
approach and implementation details. The performance is 3.1. Traditional CNN Model
evaluated in Sec.4. Sec.5 concludes this paper.
The first several layers of the traditional CNN model are
2. Related Work convolutional and the remaining layers are fully-connected.
The exact number of layers generally depends on specif-
There are two schemes to handle the data noises in deep ic tasks. The output of the last fully-connected layer is
learning. One aims to remove the noisy data before training considered as an input to a softmax classifier which can
by preprocessing. The other is designed to make the deep generate a distribution over the final category labels. Let
1986
X = [x1 , ..., xN ] be the matrix of training data, where similarity metric S ∈ RN ×N as follows:
xi is the feature vector of the ith image. N is the num-
(M )
ber of images. Denote Y = [y1 , ..., yN ]T ∈ {0, 1}N ×K , exp{− ||z
(xi ) − z(M ) (xj )||2
} yi = yj
where yi ∈ {0, 1}K×1 is the cluster indicator vector for Sij = γ2
0
xi . K is the number of categories. There are M lay- otherwise,
ers in total and W = {W(1) , ..., W(M ) } are the mod- (3)
el parameters. In each layer, we absorb the bias term in- where γ is a scale factor. To better reflect the local struc-
to the weights and denote them as a whole. W(m) = ture, the similarity metric
PN is normalized with a diagonal ma-
(m) (m) (m)
[w1 , ..., wdm ]T ∈ Rdm ×dm−1 , where wi ∈ Rdm−1 , trix D, where Dii = j=1 Sij . We define Ψ(X, W) =
dm−1 is the dimension of the (m − 1)th feature map. [ψ(x1 ), . . . , ψ(xN )] = D−1 S as the new feature represen-
Z(m) (X) = [z(m) (xi ), ..., z(m) (xN )]T ∈ RN ×dm is the tation. Each column of the matrix Ψ(X, W) embeds the
feature map produced by the mth layer. relationship of an image xi to other images.
The goal is to minimize the following objective function Given the input images X, we assume that the ideal
in the form of a softmax regression with weight decay: model parameters are W∗ . A noise-robust learning algo-
rithm should optimize W to approach to W∗ as much as
N K
1 XX possible. The objective can be achieved by minimizing
L(W; X, Y) = − [ 1Y (j) log p(Yij = 1|xi ; W)] the differences En between the learned features Ψ(X, W)
N i=1 j=1 ij
and Ψ(X, W∗ ). En is the error in feature representation
β caused by the noisy images. In other words, we can regard
+ ||W||F , (1)
2 Ψ(X, W) as the ideal feature map plus an additive error
where Yij is the (i, j)th entry of Y. 1Yij (j) is the indicator En as:
function such that 1Yij (j) = 1 if Yij = 1, otherwise zero. Ψ(X; W) = Ψ(X; W∗ ) + En . (4)
β is the coefficient of weight decay. We can see that the According to Eqn. (4) and the low-rank representation theo-
(M )
derivatives to wj in the output layer is: ry [3], we consider Ψ(X; W∗ ) to be a low-rank matrix and
we have:
N
∂L(W; X, Y) 1 X (M −1) rank(Ψ(X; W)) > rank(Ψ(X; W∗ )) (5)
(M )
=− z (xi )[1Yij (j)
∂wj N i=1
When the vocabulary size is large enough, the categories
(M ) (M )
−p(Yij = 1|z(M −1) (xi ); wj )] + βwj . (2) are fine-grained and images in each category are very simi-
lar to each other. Besides, the noises in one category are ac-
Parameters in other layers can be calculated by the back tually those from other categories with wrong labels. Con-
propagation algorithm (BP) [20]. sequently, we can assume that all the features can present at
According to the gradients, we can see that if the training most K types of patterns and the rank of Ψ(X; W∗ ) equal-
data has noises, the indicator function 1Yij (j) will produce s the category number K. As a result, Ψ(X; W∗ ) can be
a wrong value, resulting in a wrong optimization direction calculated by the following optimization problem:
or even making this optimization diverge. The reason is
that traditional models completely believe the label of each min kΨ(X; W) − Ψ(X; W∗ )kF ,
Ψ(X;W∗ )
image, and all the images are treated equally. As a result, (6)
the model will suffer from low accuracy if it is trained on s.t. rank(Ψ(X; W∗ )) = K.
the noisy web images. Since the labels are noisy, we should use the obtained ideal
feature map to reduce the noise effect in the learning pro-
3.2. CNN Model with Feature Relevance Feedback
cess. We use the ideal feature map as an input to generate
The proposed model is based on the basic assumption the ideal prediction over different category labels by soft-
that features of correct samples in a category tend to be sim- max function. In this way, we make the prediction as ac-
ilar with each other, while there is a big variance in the rep- curate as possible and thus reduce the risk that errors of
resentation of the noise samples. As a result, the relation- the network are reinforced in each iteration. However, we
ship among features can be utilized as a feedback to make find that this scheme greatly increases the time-cost in the
different samples contribute differently to achieve better ac- optimization, because it involves additional computational
curacy. burden in Eqn. (6). Instead of this step-by-step method, in
Specifically, we transform the sample features in the out- the following, we propose an alternative solution that es-
put layer into a new affinity representation that embeds the sentially calculates the ideal feature map and generates the
mutual relationship of sample features. We model this rela- ideal prediction over category labels at the same time. The
tionship as a nearest neighbor system as in [1]. We define a proposed algorithm is based on the following proposition:
1987
Proposition 1. Let L = D − S, H∗ ∈ RN ×K is com-
0.002
prised of the eigenvectors of the largest K eigenvalues of
dM-1=1000
Ψ(X, W), we have: 1) the solution of Eqn. (6), i.e., the
best rank-K approximation of Ψ(X, W), is uniquely deter-
mined by the eigenvector H∗ ; 2) H∗ is also the solution of
dM-1=2000
the following optimization problem:
0.001
dM-1=4000
T T
min tr[H LH] s.t. H H = I. (7)
H
L̃(W; X, Y) = min L(W; X, Y) + α tr[HT LH]. (8) ment in W(M ) ∈ RK×dM −1 , we have:
W
∂tr[HT LH]
Since the label matrix Y is given, H can be calculated by (M )
∂Wkd
minimizing the gap between the subspace spanned by H N −1 N
and Y [26, 27], i.e., minH ||HHT − YYT ||2F . To satis- T ∂L T
X X ∂Sij
1
= tr[HH (M )
] = tr[HH (M )
uij (uij )T ]
fy the orthogonality, Y is further scaled to Y(YT Y)− 2 . ∂Wkd i=1 j=i+1 ∂Wkd
Although the solution to the above problem is not unique, N −1 N 2
(M ) [∆d (xi , xj )]
1
H = Y(YT Y)− 2 is a feasible one. It avoids the heavy
X X
= tr[HHT −Wkd Cijk uij (uij )T ]
computational costs for solving the eigen-decomposition i=1 j=i+1
γ2
problem in Eqn. (7). Besides, we find that this approxi- N X
N
mation can make the network training efficient and robust.
X
= ξij g(∆d (xj , xi )) (10)
i=1 j=1
3.3. Analysis of Relevance Feedback PdM −1 (M ) 2
d (W
i j ) [∆ (x ,x )]2
We analyze the relevance feedback from the gradient where Cijk = exp{− d=1 kd
2γ 2 },
th T
perspective to show the noise-resistance ability of the pro- ξij is the (i, j) entry of HH . g(∆d (xi , xj ))
∂L
posed objective function. Based on the definition of similar- represents (M ) , and g(∆d (xi , xj )) ∝
∂Wkd
ity metric S in Eqn. (3), the mutual relationship of features d
[∆d (xi , xj )]2 exp{− d=1 [∆d (xi , xj )]2 }.
M
P −1
1988
Algorithm 1 Weakly-Supervised CNN model 100 0 0.01 0.05 0.1
classification accuracy
Rectified Linear Activation Function: f (·)
Procedure: 60
Repeat:
Forward Propagation: 40
Implementing as the traditional CNN
Backward Propagation:
20
1. For m = M , calculate
∂(L̃) ∂(L) P P
(M ) = (M ) + α i=1 j=1 ξij g(∆d (xj , xi ))
∂Wkd ∂Wkd 0
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
(M )
δk = − ∂((M
L̃)
) # of iterations
∂zk
2. For m = M − 1 to m = 2, set
∂(L̃)
= δ (m+1) (f (Z(m) ))T Figure 3: The performances with different α. We can see
∂W(m)
δ (m)
= [(W(m) )T δ (m+1) ] · f ′ (Z(m) ) that the performances can keep in a stable range, except for
Until The max iteration number one too large value (α = 10).
Output:W = [W(1) , ..., W(M ) ]
with four baselines in CIFAR-10 and six baselines in
VOC2007. The common four baselines are:
Note that this suppression will be back-propagated to • CNN: the state-of-the-art CNN network with convolu-
the first several layers by the “error term” δ (M ) defined in tional layers and fully-connected layers. We will spec-
BP [20], thereby the contribution of the noise samples will ify the network structure in each task.
be limited in each layer. The complete algorithm of our • RPCA+CNN: before the CNN training, we recon-
weakly-supervised CNN model is in Algorithm 1. struct each training sample by RPCA [3] and remove
those samples with large reconstruction error. The re-
4. Experiments moval ratio is set as the same as the noise percentage.
• CAE+CNN: we pre-train the convolutional layers of
Our experiments consist of two parts. First, we show the
CNN by the convolutional autoencoder (CAE) in a
noise-robust performance of the proposed approach in im-
layer-wise way and fine-tune the entire network, which
age categorization tasks. Second, we show the vocabulary-
is reported in [15] to reduce the noise effect.
free tagging performance with the vocabulary of personal
• NL+CNN: we reproduce the additional bottom-up
photos and WordNet.
noise-adaption layer in [21], and combine this layer
4.1. Image Categorization with CNN network.
We also compared with another two methods in VOC2007.
Datasets: We conducted experiments on two widely- • Best VOC: pre-training using ImageNet, and fine-
used datasets in image categorization. One is CIFAR-10, tuning in VOC2007, which has achieved the state-of-
which consists of 60, 000 32×32 color images of 10 classes. the-art performance [18].
50, 000 images are for training and 10, 000 for testing. To • Web HOG: training concept representations by the
generate the noisy training data with different percentages part-based model and human-crafted features with We-
in CIFAR-10, the training images of a certain percentage in b training images [22], which is the most recent work
a certain category were randomly replaced by the training in this topic.
images in other categories. The total number of images in Results: First of all, we adjusted the weight decay val-
a category remained unchanged. We set the percentages of ue of the basic CNN model, i.e., β in Eqn. (2), on the t-
the noise data from 10% to 90%. wo datasets. For different noise percentages (from 10%
The other dataset is PASCAL VOC2007, which consist- to 90%), this value is 0.004 for 10%, 0.008 for 20%, and
s of 9963 images of 20 classes, with the split of 50% for 0.04 for the rest. We found that the above parameters can
training/validation and 50% for testing. We trained a classi- make the basic CNN model achieve the best result on both
fication model of 20 categories using Web training images, datasets. In addition, we empirically set γ to 0.1 in Eqn. (3)
and compared with the state-of-the-art methods. so that the similarity value can be in an appropriate scale.
Baselines: We denote the proposed method as noise- Besides, there is only one adjustable parameter α in our
robust CNN (NRCNN). We compared the proposed method model. Fig. 3 shows the effect of α to the classification
1989
accuracy on the CIFAR-10 training data with 20% noises. CNN RPCA+CNN
declines in accuracy
We found that only when α is too large (e.g., 10), the mod-
CAE+CNN NL+CNN
NRCNN(our method)
el lost the classification ability and the accuracy remained at
random values. For other values, the performance maintain-
1990
Table 1: Accuracy of image classification on the clean training data and the training data with different noise percentages.
CIFAR-10
Method
clean 10% 20% 30% 40% 50% 60% 70% 80% 90%
CNN 81.24 77.79 71.97 65.09 55.65 45.60 36.65 25.02 19.46 17.55
RPCA[3]+CNN 81.24 77.94 72.44 65.94 57.82 45.77 36.55 23.68 17.85 15.49
CAE[16]+CNN 81.55 78.54 73.19 67.69 60.83 52.71 44.71 34.39 27.54 18.61
NL+CNN[21] 81.16 78.28 73.36 68.26 61.63 55.83 47.33 37.12 30.81 19.49
NRCNN(our method) 81.60 79.39 76.21 72.81 68.79 63.01 54.78 45.48 35.43 20.56
Table 2: Average precision per class on the VOC2007 test set. The words in brackets indicate: “Web,” this method uses
the positive/negative Web training images of the same number as the standard setting in VOC2007; “Web×4,” compared to
“Web,” the number of positive images used in this setting is increased to 4 times.
PP
PP class plane bike bird boat btl bus car cat chr cow tab dog horse moto pers plnt shp sfa train tv mAP
method PPP P
Best VOC [18] 88.5 81.5 87.9 82.0 47.5 75.5 90.1 87.2 61.6 75.7 67.3 85.5 83.5 80.0 95.6 60.8 76.8 58.0 90.4 77.9 77.7
Web HOG [22] 68.5 48.2 47.3 55.7 40.0 56.3 60.1 64.1 43.6 59.2 32.9 46.5 56.2 62.4 41.3 29.6 41.4 35.6 68.9 35.5 49.6
CNN(Web) 84.1 68.8 77.1 73.0 63.0 74.2 74.3 79.2 61.8 73.8 48.9 79.5 81.0 82.1 48.4 57.9 72.0 31.6 83.4 64.7 68.9
CNN(Web×4) 85.4 69.4 77.1 74.5 63.7 74.7 75.0 81.6 62.3 75.7 53.3 80.2 83.8 84.6 50.7 58.9 75.9 41.0 84.5 69.1 71.1
NRCNN(Web) 85.8 69.7 77.4 75.1 63.8 75.8 75.6 82.7 62.7 76.9 53.5 80.6 84.7 84.9 49.2 59.1 76.0 50.8 84.8 69.2 71.9
NRCNN(Web×4) 91.3 75.2 83.3 81.5 70.2 81.3 80.6 88.3 67.0 82.5 60.0 86.3 90.0 90.3 75.8 64.8 81.0 57.8 89.9 74.9 78.6
with registration time more than two years) in Flickr, we 1) family 1) young girl
found that 50 categories, e.g., “sunset,” “sightseeing,” and 2) dining room 2) birthday
3) party 3) table
“birthday” cannot be found even in the category list in Im- 4) people 4) candle
ageNet. For the all 200 categories, we can only use the 5) night 5) donut
ImageNet dataset to train a CNN model on the 150 exist- (a) (b)
ing categories, with 1, 000 clean ImageNet training images 1) ocean 1) sightseeing
for each category. We denote this method as CNN (Ima- 2) sunset 2) bridge
3) cliff 3) sky
geNet). To train the complete 200 categories, we crawled 4) mountain 4) stadium
1, 000 images from a commercial image search engine for 5) beach 5) lake
each category, removed duplicate images and trained deep (c) (d)
learning models. Note that all methods were conducted with
their best parameters, respectively. Besides, an alternative Figure 5: Tagging results produced by the proposed
way to predict new categories is by zero-shot learning. We method. Note that the underlined tags are missing in the
therefore implemented DeViSE [7] as an additional base- ImageNet categories, but are important for personal photos.
line, which is trained on the 150 existing categories as CNN
(ImageNet) and tested on the complete 200 categories by
position of p in the ranked list is defined by:
semantic extension.
p i
1991
Table 3: Tagging performance in terms of NDCG for the 1,000 testing photos in MIT-Adobe FiveK dataset.
CNN (Web) RPCA+CNN (Web) CAE+CNN (Web) NL+CNN (Web) CNN (ImageNet) DeViSE (ImageNet)[7] NRCNN (Web)
NDCG@1 0.08 0.23 0.11 0.24 0.20 0.28 0.32
NDCG@3 0.18 0.32 0.25 0.33 0.29 0.36 0.41
NDCG@5 0.26 0.39 0.34 0.41 0.39 0.43 0.46
Table 4: The tagging performance in terms of Similarity@K trained by different models and different vocabulary sets.
ImageNet-1K is the vocabulary set of 1,000 categories in ImageNet competition. WordNet-63K is the largest vocabulary set
used in this paper.
1992
References [21] R. F. Sainbayar Sukhbaatar. Learning from noisy labels
with deep neural networks. arXiv preprint arXiv:1406.2080,
[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral 2014.
techniques for embedding and clustering. In NIPS, pages [22] C. S.K. Divvala, A.Farhadi. Learning everything about any-
585–591, 2001. thing: Webly-supervised visual concept learning. In CVPR,
[2] V. Bychkovsky, S. Paris, E. Chan, and F. Durand. Learn- 2014.
ing photographic global tonal adjustment with a database of [23] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.
input / output image pairs. In CVPR, 2011. Manzagol. Stacked denoising autoencoders: Learning use-
[3] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal ful representations in a deep network with a local denoising
component analysis? J. ACM, 58(3):11:1–11:37, 2011. criterion. JMLR, 11:3371–3408, 2010.
[4] X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting [24] J. Wu, Y. Yu, C. Huang, and K. Yu. Deep multiple in-
visual knowledge from web data. In ICCV, 2013. stance learning for image classification and auto-annotation.
[5] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What does classi- In CVPR, pages 3460–3469, 2015.
fying more than 10,000 image categories tell us? In ECCV, [25] H. Xu, C. Caramanis, and S. Mannor. Outlier-robust pca:
pages 71–84, 2010. The high-dimensional case. IEEE Transactions on Informa-
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- tion Theory, 59(1):546–572, 2013.
Fei. Imagenet: A large-scale hierarchical image database. In [26] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou. l2, 1 -
CVPR, 2009. norm regularized discriminative feature selection for unsu-
[7] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ran- pervised learning. In IJCAI, pages 1589–1594, 2011.
zato, and T. Mikolov. Devise: A deep visual-semantic em- [27] J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clus-
bedding model. In NIPS, 2013. tering. In NIPS, 2007.
[8] J. Fu, T. Mei, K. Yang, H. Lu, and Y. Rui. Tagging personal
photos with transfer deep learning. In WWW, pages 344–354,
2015.
[9] J. Fu, J. Wang, Y. Rui, X.-J. Wang, T. Mei, and H. Lu. Image
tag refinment with view-dependent concept representation-
s. In IEEE Transactions on CSVT, volume 25, pages 1409–
1422, 2015.
[10] B. Graham. Spatially-sparse convolutional neural networks.
arXiv preprint arXiv:1409.6070, 2014.
[11] J. Kim and C. D. Scott. Robust kernel density estimation. In
JMLR, pages 2529–2565, 2012.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS, pages 1106–1114, 2012.
[13] J. Larsen, L. N. Andersen, M. Hintz-madsen, and L. K.
Hansen. Design of robust neural network classifiers. In I-
CASSP, pages 1205–1208, 1998.
[14] W. Liu, G. Hua, and J. R. Smith. Unsupervised one-class
learning for automatic outlier removal. In CVPR, 2014.
[15] P. Luo, X. Wang, and X. Tang. Hierarchical face parsing via
deep learning. In CVPR, pages 2480–2487, 2012.
[16] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber. Stacked
convolutional auto-encoders for hierarchical feature extrac-
tion. In ICANN, pages 52–59, 2011.
[17] V. Mnih and G. E. Hinton. Learning to label aerial images
from noisy data. In ICML, 2012.
[18] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and
transferring mid-level image representations using convolu-
tional neural networks. In CVPR, 2014.
[19] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object local-
ization for free? - weakly-supervised learning with convolu-
tional neural networks. In CVPR, pages 685–694, 2015.
[20] D. Rumelhart, G. Hintont, and R. Williams. Learning repre-
sentations by back-propagating errors. Nature, 323:533–536,
1986.
1993