Fu Relaxing From Vocabulary ICCV 2015 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Relaxing from Vocabulary: Robust Weakly-Supervised Deep Learning for

Vocabulary-Free Image Tagging∗

Jianlong Fu1,2 , Yue Wu3 , Tao Mei2 , Jinqiao Wang1 , Hanqing Lu1 and Yong Rui2
1
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
2
Microsoft Research, Beijing, China
3
University of Science and Technology of China, Hefei, China
2
{jianf, tmei, yongrui}@microsoft.com, 1 {jqwang, luhq}@nlpr.ia.ac.cn, 3 [email protected]

Abstract Input: Web Images X Input: Label Y

bird
The development of deep learning has empowered ma- car
chines with comparable capability of recognizing limited ... ... ... ... ...
gradients
image categories to human beings. However, most exist-
ing approaches heavily rely on human-curated training da-

Deep Features
ta, which hinders the scalability to large and unlabeled vo-
cabularies in image tagging. In this paper, we propose Deep
BP +
Networks
a weakly-supervised deep learning model which can be
trained from the readily available Web images to relax the
dependence on human labors and scale up to arbitrary tags
(categories). Specifically, based on the assumption that fea-
Figure 1: The illustration of the proposed model. The deep
tures of true samples in a category tend to be similar and
network is trained not only by the label supervision with
noises tend to be variant, we embed the feature map of
loss L, but also the minimization of the discrepancy be-
the last deep layer into a new affinity representation, and
tween the affinity representation Ψ(X; W) and its low-rank
further minimize the discrepancy between the affinity rep-
approximation Ψ(X; W∗ ). Note that a traditional CNN
resentation and its low-rank approximation. The discrep-
model only follows the flowchart of the top green part, with-
ancy is finally transformed into the objective function to
out the feature relevance feedback indicated in the bottom
give relevance feedback to back propagation. Experiments
red part. Details are in Sec. 3. [Best viewed in color]
show that we can achieve a performance gain of 14.0% in
terms of a semantic-based relevance metric in image tag-
ging with 63,043 tags from the WordNet, against the typical vocabulary, which is often too expensive to obtain. For
deep model trained on the ImageNet 1,000 vocabulary set. example, it took more than 25,000 AMT1 workers about
one year to construct the entire ImageNet dataset [6] (about
22,000 categories and 14.2 million images). Despite of it-
s wide adoption in research communities, ImageNet is still
1. Introduction a small subset of the nouns in WordNet2 . There are huge
More recently, deep learning has achieved comparable numbers of categories left unlabeled, making the existing
accuracy to human beings in image categorization tasks on deep learning models hard to scale up. Therefore, how to
the limited vocabulary [10]. However, this result is far from scale deep learning approaches to large and arbitrary cate-
many real-world applications, such as image tagging, where gories without enormous human-cost appears to be a chal-
we often need tens of thousands of tags to describe the var- lenging yet urgent problem.
ious image content [5, 8]. One of the major challenges is to With the success of commercial image search engines,
acquire sufficient and high-quality training data for a large learning from the Web has demonstrated one of the most ef-
fective solutions to collect massive training data [4, 9, 22].
∗ This work was performed when Jianlong Fu and Yue Wu
were visiting Microsoft Research as research interns. The first two 1 https://www.mturk.com/mturk/welcome

authors contributed equally to this work. 2 http://wordnet.princeton.edu/

1985
Despite of the convenience using Web images to train mod- network itself robust to noises.
els, the performance degradation is inevitable due to the The preprocessing methods can be implemented either
noises in Web image search results. A conventional deep by the conventional outlier detection, or by the pre-training
learning network is sensitive to noisy training images, as strategy in deep learning. First, the specific methods in out-
it tries to fit all the training data without distinguishing the lier detection include PCA, Robust PCA [3], Robust Kernel
authenticity of their labels. According to our experiments, PCA [25], probabilistic modeling [11] and one-class SVM
when 30% of the training images are mislabeled, the ac- [14]. These methods regard the outliers as those “few and
curacy of a conventional deep network drops at least 20% different” samples. However, the challenge of these meth-
in CIFAR-10 dataset. Therefore, designing a noise-robust ods is to distinguish “hard samples” from the truly noisy
deep network is imperative to attenuate the influence of the samples. Second, recovering the clean training samples by
noises in Web images. a layer-wise autoencoder or denoising autoencoder [23] in
Although previous works have studied how to perform the pre-training and then initializing a deep network by the
the weakly-supervised object recognition or localization if pre-trained model parameters is an effective method to re-
the accurate image-level labels can be provided [19, 24], move global noises, which has been used in face parsing
how to suppress the image-level noise effect has not been [15]. However, these methods are mainly designed for cas-
fully explored yet. In this paper, we propose a robust es where noises are contained in correct images (e.g., back-
weakly-supervised deep learning network with the noisy ground noises), while noises in web images are often those
Web training data for image tagging. As the Web data is mislabeled.
readily available, the proposed approach can scale to arbi- To train a robust deep learning model on noisy training
trary and unlabeled categories without heavy human effort. data, J.Larsen et al. proposed one of the pioneer works
To achieve this goal, we first start from embedding the fea- which added noise modeling into the neural networks [13].
ture map of the last deep layer into a new affinity representa- However, they make a symmetric label noise assumption,
tion that essentially explores the similarities among the deep which is often not true in real applications. V.Mnih et al.
features of training samples. Second, by adopting the “few proposed to label aerial images from noisy data where only
and different” assumption about the noises, we minimize a binary classification was considered [17]. The most relat-
the discrepancy between the affinity representation and its ed work to ours was proposed by S.Sukhbaatar et al. who
low-rank approximation. Third, this discrepancy is further introduced an extra noise layer as a part of training process
transformed into the objective function to give those “few in multi-class image classification [21]. They first trained
and different” noisy samples low-level authorities in train- a base model on noisy training data with several iterations,
ing. then activated the extra noise layer to absorb the noise from
The advantages of the proposed method are three folds. the learned base model.
First, except for the label supervision, we utilize the mutu- Compared with previous works, we propose a holistic
al relationship of features as feedback in our formulation. noise-robust model that handles noisy samples softly by
In this way, the learning process is mainly driven by the limiting their contributions in the learning process accord-
dominant correct samples. To the best of our knowledge, ing to their affinity to other samples. Besides, the algorithm
this idea has not been exploited by previous deep learning can affect the whole back propagation, rather than simply
works. Second, we conduct image tagging with the largest relying on a certain layer.
vocabulary set of about 63,000 tags from the WordNet, and
achieve a significant improvement against the typical deep 3. Weakly-Supervised Deep Learning Model
learning model trained on the ImageNet 1,000 vocabulary
Our goal is to design a noise-robust deep learning algo-
set. Third, our improvement is network-independent, so
rithm. We use the convolutional deep neural network (CN-
that with the help of our model, any existing deep learning
N) [12] for its state-of-the-art performance in image catego-
network can be readily extended to unlabeled categories.
rization. We will first analyze its limitation on noisy train-
An illustration of the proposed model is shown in Fig. 1.
ing samples and then propose the weakly-supervised deep
The rest of the paper is organized as follows. Sec.2 re-
learning model.
views related works. In Sec.3, we introduce the proposed
approach and implementation details. The performance is 3.1. Traditional CNN Model
evaluated in Sec.4. Sec.5 concludes this paper.
The first several layers of the traditional CNN model are
2. Related Work convolutional and the remaining layers are fully-connected.
The exact number of layers generally depends on specif-
There are two schemes to handle the data noises in deep ic tasks. The output of the last fully-connected layer is
learning. One aims to remove the noisy data before training considered as an input to a softmax classifier which can
by preprocessing. The other is designed to make the deep generate a distribution over the final category labels. Let

1986
X = [x1 , ..., xN ] be the matrix of training data, where similarity metric S ∈ RN ×N as follows:
xi is the feature vector of the ith image. N is the num- 
(M )
ber of images. Denote Y = [y1 , ..., yN ]T ∈ {0, 1}N ×K ,  exp{− ||z
 (xi ) − z(M ) (xj )||2
} yi = yj
where yi ∈ {0, 1}K×1 is the cluster indicator vector for Sij = γ2

0
xi . K is the number of categories. There are M lay- otherwise,
ers in total and W = {W(1) , ..., W(M ) } are the mod- (3)
el parameters. In each layer, we absorb the bias term in- where γ is a scale factor. To better reflect the local struc-
to the weights and denote them as a whole. W(m) = ture, the similarity metric
PN is normalized with a diagonal ma-
(m) (m) (m)
[w1 , ..., wdm ]T ∈ Rdm ×dm−1 , where wi ∈ Rdm−1 , trix D, where Dii = j=1 Sij . We define Ψ(X, W) =
dm−1 is the dimension of the (m − 1)th feature map. [ψ(x1 ), . . . , ψ(xN )] = D−1 S as the new feature represen-
Z(m) (X) = [z(m) (xi ), ..., z(m) (xN )]T ∈ RN ×dm is the tation. Each column of the matrix Ψ(X, W) embeds the
feature map produced by the mth layer. relationship of an image xi to other images.
The goal is to minimize the following objective function Given the input images X, we assume that the ideal
in the form of a softmax regression with weight decay: model parameters are W∗ . A noise-robust learning algo-
rithm should optimize W to approach to W∗ as much as
N K
1 XX possible. The objective can be achieved by minimizing
L(W; X, Y) = − [ 1Y (j) log p(Yij = 1|xi ; W)] the differences En between the learned features Ψ(X, W)
N i=1 j=1 ij
and Ψ(X, W∗ ). En is the error in feature representation
β caused by the noisy images. In other words, we can regard
+ ||W||F , (1)
2 Ψ(X, W) as the ideal feature map plus an additive error
where Yij is the (i, j)th entry of Y. 1Yij (j) is the indicator En as:
function such that 1Yij (j) = 1 if Yij = 1, otherwise zero. Ψ(X; W) = Ψ(X; W∗ ) + En . (4)
β is the coefficient of weight decay. We can see that the According to Eqn. (4) and the low-rank representation theo-
(M )
derivatives to wj in the output layer is: ry [3], we consider Ψ(X; W∗ ) to be a low-rank matrix and
we have:
N
∂L(W; X, Y) 1 X (M −1) rank(Ψ(X; W)) > rank(Ψ(X; W∗ )) (5)
(M )
=− z (xi )[1Yij (j)
∂wj N i=1
When the vocabulary size is large enough, the categories
(M ) (M )
−p(Yij = 1|z(M −1) (xi ); wj )] + βwj . (2) are fine-grained and images in each category are very simi-
lar to each other. Besides, the noises in one category are ac-
Parameters in other layers can be calculated by the back tually those from other categories with wrong labels. Con-
propagation algorithm (BP) [20]. sequently, we can assume that all the features can present at
According to the gradients, we can see that if the training most K types of patterns and the rank of Ψ(X; W∗ ) equal-
data has noises, the indicator function 1Yij (j) will produce s the category number K. As a result, Ψ(X; W∗ ) can be
a wrong value, resulting in a wrong optimization direction calculated by the following optimization problem:
or even making this optimization diverge. The reason is
that traditional models completely believe the label of each min kΨ(X; W) − Ψ(X; W∗ )kF ,
Ψ(X;W∗ )
image, and all the images are treated equally. As a result, (6)
the model will suffer from low accuracy if it is trained on s.t. rank(Ψ(X; W∗ )) = K.
the noisy web images. Since the labels are noisy, we should use the obtained ideal
feature map to reduce the noise effect in the learning pro-
3.2. CNN Model with Feature Relevance Feedback
cess. We use the ideal feature map as an input to generate
The proposed model is based on the basic assumption the ideal prediction over different category labels by soft-
that features of correct samples in a category tend to be sim- max function. In this way, we make the prediction as ac-
ilar with each other, while there is a big variance in the rep- curate as possible and thus reduce the risk that errors of
resentation of the noise samples. As a result, the relation- the network are reinforced in each iteration. However, we
ship among features can be utilized as a feedback to make find that this scheme greatly increases the time-cost in the
different samples contribute differently to achieve better ac- optimization, because it involves additional computational
curacy. burden in Eqn. (6). Instead of this step-by-step method, in
Specifically, we transform the sample features in the out- the following, we propose an alternative solution that es-
put layer into a new affinity representation that embeds the sentially calculates the ideal feature map and generates the
mutual relationship of sample features. We model this rela- ideal prediction over category labels at the same time. The
tionship as a nearest neighbor system as in [1]. We define a proposed algorithm is based on the following proposition:

1987
Proposition 1. Let L = D − S, H∗ ∈ RN ×K is com-
0.002
prised of the eigenvectors of the largest K eigenvalues of
dM-1=1000
Ψ(X, W), we have: 1) the solution of Eqn. (6), i.e., the
best rank-K approximation of Ψ(X, W), is uniquely deter-
mined by the eigenvector H∗ ; 2) H∗ is also the solution of
dM-1=2000
the following optimization problem:
0.001
dM-1=4000
T T
min tr[H LH] s.t. H H = I. (7)
H

As both Eqn. (6) and Eqn. (7) achieve the optimum at H∗ ,


Eqn. (6) is equivalent to Eqn. (7). 0
0 0.1 0.2 0.3
gapà0
The proof of Proposition 1 is presented in the supple-
mentary material A1. The above proposition uncovers that Figure 2: The curve shows the contribution of an image
the optimal solution in Eqn. (6) can be obtained by solving sample to gradients, with its distance to other images. With
the trace minimization problem in Eqn. (7). Therefore, we the increasing of dM −1 , only the monotone decreasing part
combine the softmax regression of a traditional CNN with can be reflected. Hence, we can observe that the larger the
the trace optimization. The final objective function for the distance, the less the contribution.
noise-robust deep learning is designed as:

L̃(W; X, Y) = min L(W; X, Y) + α tr[HT LH]. (8) ment in W(M ) ∈ RK×dM −1 , we have:
W
∂tr[HT LH]
Since the label matrix Y is given, H can be calculated by (M )
∂Wkd
minimizing the gap between the subspace spanned by H N −1 N
and Y [26, 27], i.e., minH ||HHT − YYT ||2F . To satis- T ∂L T
X X ∂Sij
1
= tr[HH (M )
] = tr[HH (M )
uij (uij )T ]
fy the orthogonality, Y is further scaled to Y(YT Y)− 2 . ∂Wkd i=1 j=i+1 ∂Wkd
Although the solution to the above problem is not unique, N −1 N 2
(M ) [∆d (xi , xj )]
1
H = Y(YT Y)− 2 is a feasible one. It avoids the heavy
X X
= tr[HHT −Wkd Cijk uij (uij )T ]
computational costs for solving the eigen-decomposition i=1 j=i+1
γ2
problem in Eqn. (7). Besides, we find that this approxi- N X
N
mation can make the network training efficient and robust.
X
= ξij g(∆d (xj , xi )) (10)
i=1 j=1
3.3. Analysis of Relevance Feedback PdM −1 (M ) 2
d (W
i j ) [∆ (x ,x )]2
We analyze the relevance feedback from the gradient where Cijk = exp{− d=1 kd
2γ 2 },
th T
perspective to show the noise-resistance ability of the pro- ξij is the (i, j) entry of HH . g(∆d (xi , xj ))
∂L
posed objective function. Based on the definition of similar- represents (M ) , and g(∆d (xi , xj )) ∝
∂Wkd
ity metric S in Eqn. (3), the mutual relationship of features d
[∆d (xi , xj )]2 exp{− d=1 [∆d (xi , xj )]2 }.
M
P −1

is described by the discrepancy in the output layer features.


Discussions: For an image xi , its contribution
Furthermore, we define:
to
PNthe gradient in Eqn. (10) can be measured by
(M −1) (M −1) j=1 ξij g(∆d (xi , xj )). Obviously, this term is non-zero
∆d (xi , xj ) = kzd (xi ) − zd (xj )k2 , (9) if and only if i 6= j and ξij 6= 0 (i.e., ŷi = ŷj ). As ξij
plays a role of an indicator, the quantized value of the con-
is the discrepancy between two images in the dth dimension tribution mainly depends on the the value of g(∆d (xi , xj )).
of the (M − 1)th layer, where d = 1, 2, ..., dM −1 . There- The curve of g(∆d (xi , xj )) with the changes of ∆d (xi , xj )
fore, if we use a linear activation function, the discrepancy is show in Fig. 2. We can observe that with the increasing
in the M th layer (output layer) can be represented by the of dM −1 , the extreme point is very close to the coordinate
accumulated products of the weight in the M th layer and origin and only the monotone decreasing part in the curve
the discrepancy in each dimension of the (M − 1)th layer. can be reflected. Therefore, if xi is a noise sample, that
For clarity, we use the notation uij . uij is a column vec- is, it is very far from other images in the same category.
tor with two nonzero elements, where ith and j th element Then the ∆d (xi , xj ) is large, and therefore its contribution
equals to 1 and −1, respectively. Therefore, for each ele- g(∆d (xi , xj )) in the gradient will be small.

1988
Algorithm 1 Weakly-Supervised CNN model 100 0 0.01 0.05 0.1

Input: Noisy Web Training Images: X = [x1 , ..., xN ] 0.5 1 5 10

Initial Parameters W = [W(1) , ..., W(M ) ] 80

classification accuracy
Rectified Linear Activation Function: f (·)
Procedure: 60
Repeat:
Forward Propagation: 40
Implementing as the traditional CNN
Backward Propagation:
20
1. For m = M , calculate
∂(L̃) ∂(L) P P
(M ) = (M ) + α i=1 j=1 ξij g(∆d (xj , xi ))
∂Wkd ∂Wkd 0

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
(M )
δk = − ∂((M
L̃)
) # of iterations
∂zk
2. For m = M − 1 to m = 2, set
∂(L̃)
= δ (m+1) (f (Z(m) ))T Figure 3: The performances with different α. We can see
∂W(m)
δ (m)
= [(W(m) )T δ (m+1) ] · f ′ (Z(m) ) that the performances can keep in a stable range, except for
Until The max iteration number one too large value (α = 10).
Output:W = [W(1) , ..., W(M ) ]
with four baselines in CIFAR-10 and six baselines in
VOC2007. The common four baselines are:
Note that this suppression will be back-propagated to • CNN: the state-of-the-art CNN network with convolu-
the first several layers by the “error term” δ (M ) defined in tional layers and fully-connected layers. We will spec-
BP [20], thereby the contribution of the noise samples will ify the network structure in each task.
be limited in each layer. The complete algorithm of our • RPCA+CNN: before the CNN training, we recon-
weakly-supervised CNN model is in Algorithm 1. struct each training sample by RPCA [3] and remove
those samples with large reconstruction error. The re-
4. Experiments moval ratio is set as the same as the noise percentage.
• CAE+CNN: we pre-train the convolutional layers of
Our experiments consist of two parts. First, we show the
CNN by the convolutional autoencoder (CAE) in a
noise-robust performance of the proposed approach in im-
layer-wise way and fine-tune the entire network, which
age categorization tasks. Second, we show the vocabulary-
is reported in [15] to reduce the noise effect.
free tagging performance with the vocabulary of personal
• NL+CNN: we reproduce the additional bottom-up
photos and WordNet.
noise-adaption layer in [21], and combine this layer
4.1. Image Categorization with CNN network.
We also compared with another two methods in VOC2007.
Datasets: We conducted experiments on two widely- • Best VOC: pre-training using ImageNet, and fine-
used datasets in image categorization. One is CIFAR-10, tuning in VOC2007, which has achieved the state-of-
which consists of 60, 000 32×32 color images of 10 classes. the-art performance [18].
50, 000 images are for training and 10, 000 for testing. To • Web HOG: training concept representations by the
generate the noisy training data with different percentages part-based model and human-crafted features with We-
in CIFAR-10, the training images of a certain percentage in b training images [22], which is the most recent work
a certain category were randomly replaced by the training in this topic.
images in other categories. The total number of images in Results: First of all, we adjusted the weight decay val-
a category remained unchanged. We set the percentages of ue of the basic CNN model, i.e., β in Eqn. (2), on the t-
the noise data from 10% to 90%. wo datasets. For different noise percentages (from 10%
The other dataset is PASCAL VOC2007, which consist- to 90%), this value is 0.004 for 10%, 0.008 for 20%, and
s of 9963 images of 20 classes, with the split of 50% for 0.04 for the rest. We found that the above parameters can
training/validation and 50% for testing. We trained a classi- make the basic CNN model achieve the best result on both
fication model of 20 categories using Web training images, datasets. In addition, we empirically set γ to 0.1 in Eqn. (3)
and compared with the state-of-the-art methods. so that the similarity value can be in an appropriate scale.
Baselines: We denote the proposed method as noise- Besides, there is only one adjustable parameter α in our
robust CNN (NRCNN). We compared the proposed method model. Fig. 3 shows the effect of α to the classification

1989

accuracy on the CIFAR-10 training data with 20% noises. CNN RPCA+CNN

declines in accuracy
We found that only when α is too large (e.g., 10), the mod- 
CAE+CNN NL+CNN
NRCNN(our method)
el lost the classification ability and the accuracy remained at
random values. For other values, the performance maintain- 

s at a stable range and achieves the best at 0.5. Besides, we 


found that the value of 0.5 can also ensure the best results
for other noise percentages. Therefore, α is set to 0.5 in the 
        
following experiments.
Tab. 1 shows the classification accuracy on differen-
Figure 4: Compared to the performance on clean training
t noise percentages in CIFAR-103 . We can see that our mod-
data, the declined performance in terms of image classifica-
el achieves the best accuracy for all cases. Our approach
tion accuracy on different noise percentages. The less the
even achieved a slight improvement on the clean training
declined number, the better the method.
data, compared with the traditional CNN. We found that the
traditional CNN dropped by nearly 20% in CIFAR-10 with
30% noises. In contrast, our method only dropped about samples with the same number of that in VOC2007 and de-
10%, showing a strong robustness to noisy training data. In noted methods on this training set as CNN (Web) and NR-
addition, we found an interesting fact about the data prepro- CNN (Web). Second, we increased the positive samples to
cessing method RPCA+CNN. When the noise percentage 4 times of that in VOC2007 for each category, and denoted
is less than 50%, this method shows performance improve- methods on this training set as CNN (Web×4) and NRCN-
ment over traditional CNN. As the noises increase, the per- N (Web×4). From our statistics, the noise percentages for
formance of RPCA+CNN gets lower than that of traditional the setting of (Web) and (Web×4) are about 20% and 40%,
CNN. The reason is that the risk of removing the correct respectively. The average precision in VOC2007 test set is
samples by mistake will significantly increase with the in- shown in Tab. 2. We can draw the following conclusions:
crease of the noise percentage, which leads to the increase • CNN (Web) can surpass over Web HOG with a signif-
of noises in the final training data. The performance of icant gain, which demonstrates a stronger noise-robust
CAE+CNN and NL+CNN are substantially similar. In the ability of deep learning methods than the method using
case of 30% noises, they drop by 17.0% and 15.9%, respec- human-crafted features on noisy training data.
tively. It indicates that although CAE+CNN can solve the • NRCNN (Web×4) is better than the state-of-the-art
problem where the noises are region-level [15] (e.g., back- performance in [18]. Since the Web data is readily
ground noises), its performance will greatly drop when the available, the cost of our model is small. We demon-
noises are sample-level, i.e., some images are totally nois- strate that effectiveness of training a neural network by
es to their categories. For NL+CNN, our experiments also noise-robust model with noisy Web training images.
demonstrated that it is insufficient to enhance the noise im- However, the traditional CNN model cannot achiever
munity simply by adding the last noise-adaption layer. In the comparable result with ours.
contrast, our method can limit the noise effect in all layers Besides, we found that the proposed model took 1.34 times
by the role of back propagation. Therefore we achieved the time-cost of the standard CNN model with the 128 batch
best classification results. We drew Fig. 4 to clearly reflect size. We also found that the performances dropped about
the declines of the classification accuracy on different noise 5.0% and 3.1% when using the features in the first and sec-
percentages. ond fully-connected layers for similarity computation re-
Furthermore, we evaluated the image categorization per- spectively, compared to the last layer. The reason is the
formance on PASCAL VOC2007 dataset4 . We pre-trained lack of the high-level semantics in other layers.
our network by ImageNet 1,000 categories as in [18], and
fine-tuned the network by Web training data. Training im- 4.2. Tagging with the Vocabulary of Personal Photos
ages were crawled from commercial image search engines One of the most attractive features in the proposed
by using each category in VOC2007 as a query, where du- method is that we can quickly obtain a deep learning model
plicate images were removed. We followed the splits of pos- to describe any tags (categories) by leveraging the unlimited
itive/negative samples provided by VOC2007 to construct tags and training data on the Web. For example, categories
the Web training dataset for each category. Note that we in personal photos are typically biased toward the tags re-
have two training sets. First, we kept the positive/negative lated to “landscape,” “family,” for which we do not have a
3 We
human-labeled training set.
used the “cifar10 quick train test” network in Caffe
(caffe.berkeleyvision.org) as the baseline CNN model in this task.
After collecting a set of 200 frequent categories from
4 We used the “alexnet train val” network in Caffe as the baseline CNN user-contributed tags from 10, 000 active users (who had
model in this task uploaded more than 500 photos in the recent six months

1990
Table 1: Accuracy of image classification on the clean training data and the training data with different noise percentages.

CIFAR-10
Method
clean 10% 20% 30% 40% 50% 60% 70% 80% 90%
CNN 81.24 77.79 71.97 65.09 55.65 45.60 36.65 25.02 19.46 17.55
RPCA[3]+CNN 81.24 77.94 72.44 65.94 57.82 45.77 36.55 23.68 17.85 15.49
CAE[16]+CNN 81.55 78.54 73.19 67.69 60.83 52.71 44.71 34.39 27.54 18.61
NL+CNN[21] 81.16 78.28 73.36 68.26 61.63 55.83 47.33 37.12 30.81 19.49
NRCNN(our method) 81.60 79.39 76.21 72.81 68.79 63.01 54.78 45.48 35.43 20.56

Table 2: Average precision per class on the VOC2007 test set. The words in brackets indicate: “Web,” this method uses
the positive/negative Web training images of the same number as the standard setting in VOC2007; “Web×4,” compared to
“Web,” the number of positive images used in this setting is increased to 4 times.
PP
PP class plane bike bird boat btl bus car cat chr cow tab dog horse moto pers plnt shp sfa train tv mAP
method PPP P
Best VOC [18] 88.5 81.5 87.9 82.0 47.5 75.5 90.1 87.2 61.6 75.7 67.3 85.5 83.5 80.0 95.6 60.8 76.8 58.0 90.4 77.9 77.7
Web HOG [22] 68.5 48.2 47.3 55.7 40.0 56.3 60.1 64.1 43.6 59.2 32.9 46.5 56.2 62.4 41.3 29.6 41.4 35.6 68.9 35.5 49.6
CNN(Web) 84.1 68.8 77.1 73.0 63.0 74.2 74.3 79.2 61.8 73.8 48.9 79.5 81.0 82.1 48.4 57.9 72.0 31.6 83.4 64.7 68.9
CNN(Web×4) 85.4 69.4 77.1 74.5 63.7 74.7 75.0 81.6 62.3 75.7 53.3 80.2 83.8 84.6 50.7 58.9 75.9 41.0 84.5 69.1 71.1
NRCNN(Web) 85.8 69.7 77.4 75.1 63.8 75.8 75.6 82.7 62.7 76.9 53.5 80.6 84.7 84.9 49.2 59.1 76.0 50.8 84.8 69.2 71.9
NRCNN(Web×4) 91.3 75.2 83.3 81.5 70.2 81.3 80.6 88.3 67.0 82.5 60.0 86.3 90.0 90.3 75.8 64.8 81.0 57.8 89.9 74.9 78.6

with registration time more than two years) in Flickr, we 1) family 1) young girl
found that 50 categories, e.g., “sunset,” “sightseeing,” and 2) dining room 2) birthday
3) party 3) table
“birthday” cannot be found even in the category list in Im- 4) people 4) candle
ageNet. For the all 200 categories, we can only use the 5) night 5) donut
ImageNet dataset to train a CNN model on the 150 exist- (a) (b)
ing categories, with 1, 000 clean ImageNet training images 1) ocean 1) sightseeing
for each category. We denote this method as CNN (Ima- 2) sunset 2) bridge
3) cliff 3) sky
geNet). To train the complete 200 categories, we crawled 4) mountain 4) stadium
1, 000 images from a commercial image search engine for 5) beach 5) lake

each category, removed duplicate images and trained deep (c) (d)
learning models. Note that all methods were conducted with
their best parameters, respectively. Besides, an alternative Figure 5: Tagging results produced by the proposed
way to predict new categories is by zero-shot learning. We method. Note that the underlined tags are missing in the
therefore implemented DeViSE [7] as an additional base- ImageNet categories, but are important for personal photos.
line, which is trained on the 150 existing categories as CNN
(ImageNet) and tested on the complete 200 categories by
position of p in the ranked list is defined by:
semantic extension.
p i

We used the same network as in PASCAL VOC2007,


X 2r − 1
N DCG@p = Zp , (11)
and trained the network without pre-training scheme. A i=1
log(1 + i)
randomly-selected 1, 000 photos from MIT-Adobe FiveK
i
Dataset [2] were used as the test set. Each method produces where 2r is the relevance level of the ith tag and Zp is a
top five categories with the highest prediction scores as a normalization constant such that N DCG@p = 1 for the
tagging list. 25 human-labelers were employed to evaluate perfect ranking.
each tag with three levels: 2–Highly Relevant; 1–Relevant; The result is shown in Tab. 3. We can see that our pro-
0–Non Relevant. We adopted the Normalized Discounted posed method achieves a consistently better result than oth-
Cumulative Gain (NDCG) as the metric to evaluate the tag- er noise-resistant methods. Besides, CNN (ImageNet) is
ging performance. The NDCG measures multi-level rele- inferior to our method, because of its limited vocabulary.
vance and assumes that the relevant tags are more useful The result also demonstrates that by leveraging Web train-
when appearing higher in a ranked list. This metric at the ing images on new categories, we can obtain a superior re-

1991
Table 3: Tagging performance in terms of NDCG for the 1,000 testing photos in MIT-Adobe FiveK dataset.

CNN (Web) RPCA+CNN (Web) CAE+CNN (Web) NL+CNN (Web) CNN (ImageNet) DeViSE (ImageNet)[7] NRCNN (Web)
NDCG@1 0.08 0.23 0.11 0.24 0.20 0.28 0.32
NDCG@3 0.18 0.32 0.25 0.33 0.29 0.36 0.41
NDCG@5 0.26 0.39 0.34 0.41 0.39 0.43 0.46

Table 4: The tagging performance in terms of Similarity@K trained by different models and different vocabulary sets.
ImageNet-1K is the vocabulary set of 1,000 categories in ImageNet competition. WordNet-63K is the largest vocabulary set
used in this paper.

Vocabulary Set Model Similarity@1 Similarity@2 Similarity@5 Similarity@10 Similarity@20


ImageNet-1K CNN 0.88 0.85 0.51 0.43 0.22
WordNet-63K CNN 0.57 0.47 0.38 0.31 0.26
WordNet-63K NRCNN (our method) 0.58 0.56 0.51 0.45 0.36

sult than the semantic-embedded method DeViSE. Fig. 5 top-ranked K tags.


further illustrates some exemplary tagging results. We can We show the results in Tab. 4. The results of the second
observe that our approach can provide users with accurate row are achieved by the released Alex’s network in Caffe.
tags where some are even excluded in the category list in The results of the third and forth row are achieved by fine-
ImageNet. tuning the network by Web training data on about 63,000
tags with the released Alex’s network as the pre-trained pa-
4.3. Tagging with the Vocabulary of WordNet rameters. We observe that our model can achieve better re-
We further train a tagging model with a larger vocabu- sults from Similarity@5 to Similarity@20, than the tradi-
lary set from WordNet. WordNet covers about 82,000 pairs tional CNN model, which is implemented by the released
of the item ID and tag list5 . Since the tags in a tag list refer Alex’s network on the 1,000 vocabulary set. Our model
to a synset, we keep the first tag as the representative of the can predict a wide range of tags, and achieve a significant
tag list. For each tag, we crawled about 50 images from a improvement with the gain of 14.0% in terms of Similari-
commercial image search engine as the training data. We ty@20, against the CNN model on the 1,000 vocabulary set.
removed invalid images or images whose width or height We show the exemplar tagging results in the supplementary
is smaller than 200 pixels. Then after this processing, we material A2. The lower results on Similarity@1 and Simi-
further removed tags which contained less than 30 images larity@2 are derived from the variety of tags and the limited
from the vocabulary set. Finally, we collected 63,043 tags number of training images. We will solve this problem by
and about 2.4 million training images in total. To the best of using more powerful GPUs that can involve more training
our knowledge, this is the largest vocabulary set in the im- samples in each category within a reasonable time-cost (e.g,
age tagging area. We kept the same network as above, fine- one week as we need currently).
tuned the network by Web training data with the released
Alex’s network parameters [12] in Caffe as the pre-trained
5. Conclusions
parameters. Although the number of training data for each In this paper, we propose a noise-robust deep learning
category is limited, we will show good image tagging re- model on noisy training data. The merit is that we can
sults with the help of the proposed noise-robust model and quickly train a deep learning model for any categories with-
the largest vocabulary set. out human-labeled training data and apply the model to real-
We randomly selected 20,000 images from the ImageNet applications. By leveraging the mutual relationships of fea-
validation set as the testing images. To compare the tag- tures in the output layer, the contribution of noise images
ging performance of different approaches, we calculated the are weakened in the back propagation. Experiments demon-
cosine similarity between the word vector of the category strate the superior performance. In the future, we will apply
name of each testing image and the word vector of each tag the weakly-supervised model to more image domains.
produced by different models. The word vectors can be cal-
culated by this tool6 . We defined an average similarity as 6. Acknowledgements
Similarity@K by averaging the similarity scores among the
This work was supported by the 863 Program
5 image-net.org/archive/words.txt 2014AA015104, and the National Natural Science Founda-
6 nlp.stanford.edu/projects/glove/ tion of China (61273034 and 61332016).

1992
References [21] R. F. Sainbayar Sukhbaatar. Learning from noisy labels
with deep neural networks. arXiv preprint arXiv:1406.2080,
[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral 2014.
techniques for embedding and clustering. In NIPS, pages [22] C. S.K. Divvala, A.Farhadi. Learning everything about any-
585–591, 2001. thing: Webly-supervised visual concept learning. In CVPR,
[2] V. Bychkovsky, S. Paris, E. Chan, and F. Durand. Learn- 2014.
ing photographic global tonal adjustment with a database of [23] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.
input / output image pairs. In CVPR, 2011. Manzagol. Stacked denoising autoencoders: Learning use-
[3] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal ful representations in a deep network with a local denoising
component analysis? J. ACM, 58(3):11:1–11:37, 2011. criterion. JMLR, 11:3371–3408, 2010.
[4] X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting [24] J. Wu, Y. Yu, C. Huang, and K. Yu. Deep multiple in-
visual knowledge from web data. In ICCV, 2013. stance learning for image classification and auto-annotation.
[5] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What does classi- In CVPR, pages 3460–3469, 2015.
fying more than 10,000 image categories tell us? In ECCV, [25] H. Xu, C. Caramanis, and S. Mannor. Outlier-robust pca:
pages 71–84, 2010. The high-dimensional case. IEEE Transactions on Informa-
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- tion Theory, 59(1):546–572, 2013.
Fei. Imagenet: A large-scale hierarchical image database. In [26] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou. l2, 1 -
CVPR, 2009. norm regularized discriminative feature selection for unsu-
[7] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ran- pervised learning. In IJCAI, pages 1589–1594, 2011.
zato, and T. Mikolov. Devise: A deep visual-semantic em- [27] J. Ye, Z. Zhao, and M. Wu. Discriminative k-means for clus-
bedding model. In NIPS, 2013. tering. In NIPS, 2007.
[8] J. Fu, T. Mei, K. Yang, H. Lu, and Y. Rui. Tagging personal
photos with transfer deep learning. In WWW, pages 344–354,
2015.
[9] J. Fu, J. Wang, Y. Rui, X.-J. Wang, T. Mei, and H. Lu. Image
tag refinment with view-dependent concept representation-
s. In IEEE Transactions on CSVT, volume 25, pages 1409–
1422, 2015.
[10] B. Graham. Spatially-sparse convolutional neural networks.
arXiv preprint arXiv:1409.6070, 2014.
[11] J. Kim and C. D. Scott. Robust kernel density estimation. In
JMLR, pages 2529–2565, 2012.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS, pages 1106–1114, 2012.
[13] J. Larsen, L. N. Andersen, M. Hintz-madsen, and L. K.
Hansen. Design of robust neural network classifiers. In I-
CASSP, pages 1205–1208, 1998.
[14] W. Liu, G. Hua, and J. R. Smith. Unsupervised one-class
learning for automatic outlier removal. In CVPR, 2014.
[15] P. Luo, X. Wang, and X. Tang. Hierarchical face parsing via
deep learning. In CVPR, pages 2480–2487, 2012.
[16] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber. Stacked
convolutional auto-encoders for hierarchical feature extrac-
tion. In ICANN, pages 52–59, 2011.
[17] V. Mnih and G. E. Hinton. Learning to label aerial images
from noisy data. In ICML, 2012.
[18] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and
transferring mid-level image representations using convolu-
tional neural networks. In CVPR, 2014.
[19] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object local-
ization for free? - weakly-supervised learning with convolu-
tional neural networks. In CVPR, pages 685–694, 2015.
[20] D. Rumelhart, G. Hintont, and R. Williams. Learning repre-
sentations by back-propagating errors. Nature, 323:533–536,
1986.

1993

You might also like