Guangjian Zheng, Min Tan, Jun Yu, Qing Wu, Jianping Fan
Guangjian Zheng, Min Tan, Jun Yu, Qing Wu, Jianping Fan
Guangjian Zheng1 , Min Tan1 , Jun Yu1 , Qing Wu1 , Jianping Fan2
1
Key Laboratory of Complex Systems Modeling and Simulation,
School of Computer Science and Technology, Hangzhou Dianzi University
2
Department of Computer Science, University of North Carolina at Charlotte
978-1-5090-6067-2/17/$31.00 2017
c IEEE
978-1-5090-6067-2/17/$31.00 ©2017 IEEE ICME 2017
661
• We present an efficient optimization to iteratively learn images/queries obtained from user click data are extremely
both the C-BCNN model and image reliability. Image noisy, we have interest in the answer to the question: how
reliability (weight) is optimized via solving a softmax- to determine whether an image/query is reliable to learn the
loss-based quadratic programming. image recognition model, given that there are many noisy im-
ages/queries? Intuitively, images of higher quality should be
2. OUR METHOD more reliable and contribute more in the training than that of
lower quality. To address these issues, we introduce a vari-
We present a novel Weakly supervised user Click data able to characterize each image’s reliability, and propose a
guided Bilinear CNN (W-C-BCNN), where an image is rep- method to iteratively learn both the C-BCNN model and im-
resented as the combined deep visual and semantical feature. age reliability. Fig. 1 illustrates our pipeline for fine-grained
We firstly review the classical BCNN model proposed in [5], image recognition. In the following sections, we will show
and then illustrate our model, including the model structure our model and its optimization in detail.
and weakly supervised learning procedure.
2.2.1. C-BCNN Structure
2.1. Review of Classical BCNN Model Our C-BCNN is constructed based on [5]. Fig. 2 illus-
A BCNN model [5] consists of two CNN feature extrac- trates the structure of the classical BCNN and our C-BCNN
tors, whose outputs are multiplied using outer product at each respectively. The main differences lie in the feature concate-
location of an image for image representation. Fig. 2(a) shows nating layer behind the 2 normalization layer. It is designed
the model structure of a classical BCNN model. It is particu- for integrating the CNN feature with semantical feature. More
larly useful for fine-grained categorization since it can model specifically, the normalized BCNN vector z is passed through
local pairwise feature interactions in a translation invariant a feature concatenating layer, generating a combined feature
manner. When it is used for a classification task, the BCNN vector of oi ← [zi , τ ui ]. Here, zi and ui are the deep visual
model B is defined as a quadruple B = (fA , fB , P, C), where and semantical feature for image i, and τ denotes the weight
fA , fB are two CNN feature functions, P is a pooling func- of click feature in the combined feature.
tion and C is a classification function. This BCNN extractors We employ the user click data to construct the semantical
the deep visual φ for an image I as below: feature ui , and each image is represented as a click feature
vector by concatenating the click count for each query. As the
φ (I) = bilinear(l, I, fA , fB ), (1) query set obtained from the large-scale click data is extremely
l∈L
huge and redundant, we merge queries with similar semantics,
where bilinear(l, I, fA , fB ) = fA (l, I)T fB (l, I) is the bi- and represent each image as a click feature vector based on
linear feature combination of fA and fB at each location each query cluster instead of original queries:
l ∈ L. The mapping function f : I × L → Rc×D out-
puts a feature vector of size c × D for image I at locations L. ui = ( ci,j , ci,j , ..., ci,j ), (2)
j∈G1 j∈G2 j∈G£
For classification tasks, the function C is trained using image
where Gj is the index set for the j-th query cluster.
features φ . Note that φ is a high-dimensional feature vector.
For example, when fA and fB extract features of size C × M
and C × N respectively, φ is a feature vector with M × N 2.2.2. The Weakly Supervised Learning of C-BCNN
dimensions. The following classification function C is trained Given n training data (Ii , yi ), where yi ∈ [1, 2, ..., N ] de-
on the reshaped feature of size M N × 1. notes the category label, the parameters θ for the C-BCNN
To obtain an improvedperformance, the signed square- model B is learned by solving the following weakly super-
root step y ← sign(x) |x| and 2 normalization z ← vised C-BCNN learning (W-C-BCNN) problem:
y/y2 are conducted on x = φ (I), which is used as the
input for softmax classification layer. Afterwards, an end-to-
end training is applied to learn the BCNN model [5]. (θθ ∗ , w∗ )
C
n
1 2
= argmin θθ 2 + wi (yi , oi )
2.2. Our Method θ ,w 2 n i=1
⎧+
αP (w) + βS(G, w) (3)
The BCNN model distinguishes object only by visual fea- n
⎨ i=1 wi = n N
ture, therefore the subtle visual differences among categories s.t. (yi , oi ) = − log(eoyi / j=1 eoj )
remain a big challenge. We design a novel C-BCNN model ⎩
wi > 0, ∀i,
to simultaneously extract the deep visual zi and semantical
feature ui for image xi . where wi represents the reliability for sample i, (yi , oi ) is the
We employ the user click data to construct a semantical softmax-loss for Ii , and oi is the combined BCNN feature zi
feature ui for each image based on its clicked queries. As with semantical feature ui .
662
Learning C-BCNN model where T (·) is a transformation scaling the range of u to han-
Sample Weight dle the large unbalance of click counts among images.
weight learning Smoothness Assumption. The smooth term is construct-
Sample loss
Training Sample Deep C-BCNN
Similarity graph
ed based on visual consistency. It is assumed that images sim-
samples re-weighting learning model
Weight prior
ilar in visually should be assigned with similar weight. Based
on this assumption, we possess a graph regularization based
Testing samples Final result [12] smooth term using similarity based adjacency graph G.
The BCNN visual feature z is used to measure the similarity
Testing
and construct G as follows:
Fig. 1. Pipeline of image recognition via click data guid-
2
ed BCNN (C-BCNN) model with weakly supervised train- S(G, w) = ∀i,j∈χk gi,j (wi −wj ) /2
(6)
ing procedure. In the training phase, it iteratively learns a C- gi,j = sim(xi , xj ) = exp (− zi − zj ).
BCNN model and the image weight. During testing, we rec-
ognize each testing image using the learned C-BCNN model. 2.2.3. Optimization
6RIWPD[
As (3) is a complicated non-convex problem, it is hard to
find the convergent global optimal solution. We utilize the al-
'URSRXW.HHS
ternation to achieve the local optimal, and find the solution for
6RIWPD[ )LOWHUFRQFDW one variable in each turn with others being fixed. We obtain
/QRUPDOL]DWLRQ /QRUPDOL]DWLRQ &OLFNIHDWXUH
the optimal solution via iteratively optimizing two steps: 1)
fix each wi , and solve a weighted C-BCNN problem to obtain
6TXDUHURRW 6TXDUHURRW
θ ; 2) fix each θ , and solve wi using quadratic programming.
%LOLQHDUYHFWRU %LOLQHDUYHFWRU This procedure is shown in Fig. 1.
,QSXWLPDJH ,QSXWLPDJH
Learning θ . Similar to [5], θ can be trained by back-
propagating the gradients of the classification loss (e.g. con-
(a) Original [5] (b) Proposed
ditional log-likelihood). Let d/dx be the gradient of the loss
function to x, then by chain rule of gradients we solve the
Fig. 2. Visualization of the two BCNNs models. The biggest two deep network A and B by:
difference (plotted in red) between our C-BCNN and the o-
riginal BCNN lies in the combination of BCNN feature with d d d d
= B( )T , =A , (7)
user click feature. dA dx dB dx
d do dz dy
d
where dx = do dz dy dx .
Since the reliability variable w in the training is unknown Learning w. With fixed BCNN model θ , we re-construct
in advance and should be estimated during training, we em- G by (6) and solve w by the following problem:
ploy a weakly supervised training method [10] to solve (3),
wherein both the reliability variables and C-BCNN model are
C
n
2
iteratively learned in each iteration. The sample weight mod- w∗ = argmin wi li + α w − wc 2
el is constructed by click data based weight prior and visual w n i=1
consistency based smooth constrains. 1 2
+ β gi,j (wi − wj ) (8)
Weight Prior. The weight prior term P (w) possesses a 2
∀s,t ∀i,j∈As,t
regularization constrain. Recently, Tan et al. proposed to dis- T
criminatively train a binary SVM classifier to estimate weight, I w=n
s.t.
but the reliability classifier is heuristically trained [11]. In this 0 ≤ wi ≤ n, ∀i,
paper, we utilize click data to model weight prior. Intuitively,
where I is an unit vector. Based on Laplacian, we re-write (8)
an image with larger user click count should be more reliable
as follows:
and contribute more in the training. For each image, we use
the total click count to estimate the weight prior as follows:
w∗ = argmin 21 wT (2βLlap + 2αE)w + ( C n l − 2αw ) w
c T
2 w T
P (w) = w − w c 2 , (4) I w=n
s.t.
0 ≤ wi ≤ n, ∀i,
where wc is the normalized click vector defined as: (9)
where E is an identity matrix, and Llap is the Laplacian ma-
wc =wc / wc , wc = T (u), (5) trix of graph G that is defined as:
663
the weakly supervised training procedure, are evaluated. The
performance is evaluated by recognition accuracy.
Llap = D −
G Extensive experiments are conducted. Firstly, we show
(10)
D = diag( j g1,j , j g2,j , ..., j gn,j ). the experimental settings including the used dataset; second-
ly, we evaluate our combined BCNN and click feature; fi-
By re-writing (8), we have:
nally, we show the effect of the weakly supervised training
1 C procedure, wherein both the click data based prior and visual
w∗ = argmin wT (2βLlap + 2αE)w + ( l − 2αwc )T w consistency based graph-regularization are evaluated.
w 2 n
I w=n
T
s.t. 3.1. Experimental Settings
0 ≤ wi ≤ n, ∀i.
(11) With no publicly available training/testing split for both
We use interior point algorithm1 to solve the quadratic pro- the datasets, similar to [13], we randomly split the two
gramming problem (11). datasets into three parts: 50% for training, 30% for valida-
tion, and 20% for testing respectively.
2.3. Extensions for W-C-BCNN
We discuss two extensions for our method. One is an im- 3.1.1. The Clickture-Dog dataset
proved per-category weight learning method, another is con-
It consists of dog images of 344 categories. To ensure
structing weight prior by other kinds of data.
a valid training/testing split, we filter out the categories that
contain less than 3 images. Also, we randomly select 300
2.3.1. Per-Category Weight Learning samples for the categories with more than 300 samples to
We propose a per-category weight optimization for a avoid unbalance among categories. Altogether, we obtain a
dataset that is very unbalanced. Denote μ j as the weight dog-breed dataset with 30, 568 dog images of 283 categories.
vector for samples in category j, we conduct a per-category For each image, the clicked query set and their correspond-
weight optimization to obtain w∗ = {μμ∗1 , ..., μ ∗N }. Each μ ∗j is ing count are collected from Clickture-Full [9] (refer to [13]).
obtained by solving the following problem: Different from [14], we did not conduct any further prepro-
cessing, e.g. data cleaning.
664
100
CíBCNN train: 8.7 Table 1. Comparison of recognition accuracy (%) for C-
CíBCNN val: 26.7
Training error (%)
80 BCNN train: 25.6 BCNN with BCNN on Clickture-Dog and WCE dataset.
BCNN val: 55 BCNN C-BCNN
60 Clickture-Dog 33.20 51.20
WCE 97.10 98.40
40
665
C-BCNN OUR OUR(T) [8] A. Vedaldi, S. Mahendran, S. Tsogkas, S. Maji, R. Gir-
Clickture-Dog 50.60 52.90 52.90 shick, J. Kannala, E. Rahtu, I. Kokkinos, M. B.
WCE 98.40 99.40 99.50 Blaschko, D. Weiss, B. Taskar, K. Simonyan, N. Saphra,
and S. Mohamed, “Understanding objects in detail with
Table 2. Comparison of recognition accuracy (%) between C- fine-grained attributes,” in IEEE Conference on Com-
BCNN learning with W-C-BCNN by different weight trans- puter Vision and Pattern Recognition, 2014.
formations. “Our” denotes the proposed W-C-BCNN method.
[9] Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing
Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li,
5. ACKNOWLEDGEMENTS “Clickage: Towards bridging semantic and intent gaps
via mining click logs of search engines,” in ACM In-
This work was supported by National Natural Science ternational Conference on Multimedia. ACM, 2013, pp.
Foundation of China (No. 61602136, No.61622205, and No. 243–252.
61601158), and Zhejiang Provincial Natural Science Founda-
tion of China under Grant LR15F020002. [10] M. Tan, B. Wang, Z. Wu, J. Wang, and G. Pan, “Weakly
supervised metric learning for traffic sign recognition in
6. REFERENCES a lidar-equipped vehicle,” IEEE Transactions on Intel-
ligent Transportation Systems, vol. 17, no. 5, pp. 1415–
[1] T. Berg, Jiongxin Liu, Seung Woo Lee, M. L. Alexan- 1427, May 2016.
der, D. W. Jacobs, and P. N. Belhumeur, “Birdsnap: [11] Min Tan, Zhenfang Hu, Baoyuan Wang, Jieyi Zhao, and
Large-scale fine-grained visual categorization of bird- Yueming Wang, “Robust object recognition via weakly
s,” in IEEE Conference on Computer Vision and Pattern supervised metric and template learning,” Neurocom-
Recognition, 2014, pp. 2019–2026. puting, vol. 101, pp. 96–107, 2016.
[2] A. Iscen, G. Tolias, P. H. Gosselin, and H. Jegou, “A [12] Xuelong Li, Guosheng Cui, and Yongsheng Dong,
comparison of dense region detectors for image search “Graph regularized non-negative low-rank matrix fac-
and fine-grained classification,” IEEE Transactions on torization for image clustering,” IEEE Transactions on
Image Processing, vol. 24, no. 8, pp. 2369–81, 2015. Cybernetics, 2016.
[3] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng [13] Min Tan, Jun Yu, Guangjian Zheng, Weichen Wu, and
Yao, and Li Fei-Fei, “Novel dataset for fine-grained im- Kejia Sun, “Deep neural network boosted large scale
age categorization,” in IEEE Conference on Computer image recognition using user click data,” in Internation-
Vision and Pattern Recognition, Colorado Springs, CO, al Conference on Internet Multimedia Computing and
June 2011. Service, 2016, pp. 118–121.
[4] Shenghua Gao, Ivor Wai-Hung Tsang, and Yi Ma, [14] Chenghua Li, Qiang Song, Yuhang Wang, Hang Song,
“Learning category-specific dictionary and shared dic- Qi Kang, Jian Cheng, and Hanqing Lu, “Learning to
tionary for fine-grained image categorization,” IEEE recognition from bing clickture data,” in IEEE Interna-
Transactions on Image Processing, vol. 23, no. 2, pp. tional Conference on Multimedia and Expo, 2016, pp.
623–634, Feb. 2014. 1–4.
[5] Aruni RoyChowdhury Tsung-Yu Lin and Subhransu [15] Min Tan, Gang Pan, Yueming Wang, Yuting Zhang, and
Maji, “Bilinear CNN Models for Fine-grained Visu- Zhaohui Wu, “L1-norm latent SVM for compact fea-
al Recognition,” in IEEE International Conference on tures in object detection,” Neurocomputing, vol. 139,
Computer Vision, 2015. no. 0, pp. 56 – 64, 2014.
[6] Ning Zhang, Manohar Paluri, Marc’Aurelio Ranzato, [16] Q. Zheng, A Kumar, and G. Pan, “A 3d feature descrip-
Trevor Darrell, and Lubomir Bourdev, “Panda: Pose tor recovered from a single 2d palmprint image.,” IEEE
aligned networks for deep attribute modeling,” in IEEE Transactions on Pattern Analysis and Machine Intelli-
Computer Vision and Pattern Recognition, 2014, pp. gence, vol. 38, no. 6, pp. 1272–1279, 2016.
1637–1644.
[17] L. Chen, D. Zhang, X. Ma, and L. Wang, “Container
[7] Hong Shao, Shuang Chen, Jie-yi Zhao, Wen-cheng Cui, port performance measurement and comparison lever-
and Tian-shu Yu, “Face recognition based on subset s- aging ship gps traces and maritime open data,” IEEE
election via metric learning on manifold,” Frontiers of Transactions on Intelligent Transportation Systems, vol.
Information Technology & Electronic Engineering, vol. 5, no. 2, pp. 1–16, 2016.
16, no. 12, pp. 1046–1058, 2015.
666