0% found this document useful (0 votes)
60 views13 pages

A Mixed Generative-Discriminative

This document summarizes a research paper that proposes a new method called semantic cross-media hashing (SCMH) to address challenges in cross-media retrieval. SCMH uses continuous word representations to capture semantic-level textual similarities and a deep belief network to model correlations between different modalities. The method was evaluated on three datasets and achieved better performance than state-of-the-art approaches, with comparable efficiency. Existing methods often use bag-of-words to represent text, missing semantic similarities between words, whereas SCMH incorporates word embeddings to address this.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views13 pages

A Mixed Generative-Discriminative

This document summarizes a research paper that proposes a new method called semantic cross-media hashing (SCMH) to address challenges in cross-media retrieval. SCMH uses continuous word representations to capture semantic-level textual similarities and a deep belief network to model correlations between different modalities. The method was evaluated on three datasets and achieved better performance than state-of-the-art approaches, with comparable efficiency. Existing methods often use bag-of-words to represent text, missing semantic similarities between words, whereas SCMH incorporates word embeddings to address this.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO.

4, APRIL 2016 845

A Mixed Generative-Discriminative
Based Hashing Method
Qi Zhang, Yang Wang, Jin Qian, and Xuanjing Huang

Abstract—Hashing methods have proven to be useful for a variety of tasks and have attracted extensive attention in recent years.
Various hashing approaches have been proposed to capture similarities between textual, visual, and cross-media information.
However, most of the existing works use a bag-of-words methods to represent textual information. Since words with different forms may
have similar meaning, semantic level text similarities can not be well processed in these methods. To address these challenges, in this
paper, we propose a novel method called semantic cross-media hashing (SCMH), which uses continuous word representations to
capture the textual similarity at the semantic level and use a deep belief network (DBN) to construct the correlation between different
modalities. To demonstrate the effectiveness of the proposed method, we evaluate the proposed method on three commonly used
cross-media data sets are used in this work. Experimental results show that the proposed method achieves significantly better
performance than state-of-the-art approaches. Moreover, the efficiency of the proposed method is comparable to or better than that of
some other hashing methods.

Index Terms—Hashing method, word embedding, fisher vector

Ç
1 INTRODUCTION

W ITH the rapid expansion of the World Wide Web, digi-


tal information has become much easier to access,
modify, and duplicate. Hence, hashing based similarity
structures, a variety of methods studied the problem from
the aspect of learning correlations between different modali-
ties. Existing methods proposed to use Canonical Correla-
calculation or approximate nearest neighbour searching tion Analysis (CCA) [8], manifolds learning [9], dual-wing
methods have been proposed and received considerable harmoniums [10], deep autoencoder [11], and deep Boltz-
attention in recent years. Various applications such as infor- mann machine [12] to approach the task. Due to the effi-
mation retrieval, near duplicate detection, and data mining ciency of hashing-based methods, there also exists a rich
are performed by hashing based methods. Due to the rapid line of work focusing the problem of mapping multi-modal
expansion of mobile networks and social media sites, infor- high-dimensional data to low-dimensional hash codes, such
mation input through multiple channels has also attracted as Latent semantic sparse hashing (LSSH) [13], discrimina-
increasing attention. Images and videos are associated tive coupled dictionary hashing (DCDH) [14], Cross-view
with tags and captions. According to research published on Hashing (CVH) [15], and so on.
eMarketer, about 75 percent of the content posted by Most of the existing works use a bag-of-words to model
Facebook users contains photos.1 The relevant data from textual information. The semantic level similarities between
different modalities usually have semantic correlations. words or documents are rarely considered. Let us consider
Therefore, it is desirable to support the retrieval of informa- the following examples:
tion through different modalities. For example, images
can be used to find semantically relevant textual informa- S1. The company announces new operating system.
tion. On the other side, images without (or with little) S2. The company releases new operating system.
textual descriptions are highly needed to be retrieved with S3. The company delays new operating system.
textual query. From these examples, we can observe that although only
Along with the increasing requirements, in recent one word differs between the three sentences, sentence S3
years, cross-media search tasks have received considerable should not be considered as the near duplicate sentence of
attention [1], [2], [3], [4], [5], [6], [7]. Since each modality S1 and sentence S2. The meaning expressed by S3 is much
having different representation methods and correlational different with S1 and S2’s. Since existing methods are usu-
ally based on lexical level similarities, this kind of issue can-
1. http://www.socialmediaexaminer.com/photos-generate- not be well addressed by these methods.
engagement-research/ In short text segments (e.g., microblogs, captions, and
tags), the similarities between words are especially impor-
 The authors are with the Shanghai Key Laboratory of Intelligent Informa-
tion Processing, School of Computer Science, Fudan University, Shanghai
tant for retrieval. For example: journey versus travel, coast
201203, P.R. China. versus shore. According to human-assigned similarity judg-
E-mail: {qz, 14210240023, 12110240030, xjhuang}@fudan.edu.cn. ments [16], more than 90 percent of subjects thought that
Manuscript received 2 June 2015; revised 17 Oct. 2015; accepted 1 Dec. 2015. these pairs of words had similar meanings. Fig. 1 illustrates a
Date of publication 9 Dec. 2015; date of current version 3 Mar. 2016. set of images retrieved from Flickr using different queries.
Recommended for acceptance by Y. Chang. From these examples, we can see that images may express
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. similar concepts, even though there is little overlap in terms
Digital Object Identifier no. 10.1109/TKDE.2015.2507127 of annotated tags. Since users rarely annotate a single image
1041-4347 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
846 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016

Fig. 1. An example of top retrieved images from Flickr with different tags.

using multiple words with similar meaning, semantic level 2 RELATED WORK
textual similarities should be incorporated into the cross-
Along with the increasing requirement, extensive Hashing-
media retrieval.
based methods have been proposed for cross-media
Motivated by the success of continuous space word rep-
retrieval. In this section, we briefly describe the related
resentations (also called word embeddings) in a variety of
works, which can be categorized into the following four
tasks, in this work we propose to incorporate word embed-
research areas: cross-media retrieval, text Reuse detection,
dings to meet these challenges. Words in a text are embed-
and hashing methods.
ded in a continuous space, which can be viewed as a
Bag-of-Embedded-Words (BoEW). Since the number of
words in a text is dynamic, in [17], we proposed a method 2.1 Cross-Media Retrieval
to aggregate it into a fixed length Fisher Vector (FV), using a Cross-media retrieval, in which the modality of input query
Fisher kernel framework [18]. However, the proposed and the returned results can be of different, has received
method only focus on textual information. Another chal- considerable attentions [1], [3], [6], [7], [8], [9], [10], [12],
lenge in this task is how to determine the correlation [20], [21]. Wu et al. [8] introduced a Canonical Correlation
between multi-modal representations. Since we propose the Analysis based method to construct isomorphic subspace
use of a Fisher kernel framework to represent the textual and multi-modal correlations between media objects and
information, we also use it to aggregate the SIFT descrip- polar coordinates to judge the general distance of media
tors [19] of images. Through the Fisher kernel framework, objects. Due to lack of sufficient training samples, relevance
both textual and visual information is mapped to points in feedback of user was used to accurately refine cross-media
the gradient space of a Riemannian manifold. However, the similarities. Yang et al. [9] proposed manifold-based
relationships that exist between FVs of different modalities method, in which they used Laplacian media object space to
are usually highly non-linear. Hence, to construct the corre- represent media object for each modality and an multime-
lation between textual and visual modalities, we introduce dia document semantic graph to learn the multimedia docu-
a DBN based method to model the mapping function, which ment semantic correlations. In [22], a rich-media object
is used to convert abstract representations of different retrieval method is proposed to represent data consisting of
modalities from one to another. multiple modalities, such as 2-D images, 3-D objects and
The main contributions of this work are summarized as audio files. To tackle the large scale problem, a multimedia
follows. indexing scheme was also adopted.
Since the relationships across different modalities are
 We propose to incorporate continuous word repre- typically highly non-linear and observations are usually
sentations to handle semantic textual similarities noisy, Srivastava and Salakhutdinov [12] proposed a Deep
and adopted for cross-media retrieval. Boltzmann Machine to learn joint representations of image
 Inspired by the advantages of DBN in handling and text inputs. The proposed model fuses multiple data
highly non-linear relationships and noisy data, we modalities into a unified representation, which can be used
introduce a novel DBN based method to construct for classification and retrieval. Xing et al. [10] introduced to
the correlation between different modalities. use dual-wing harmoniums to build a joint model for
 A variety of experiments on three cross-media com- images and text. The model incorporated Gaussian hidden
monly used benchmarks demonstrate the effective- units together with Gaussian and Poisson visible units into
ness of the proposed method. The experimental a linear RBM model. In [12], a multimodal deep Boltzmann
results show that the proposed method can signifi- machine was proposed for learning multimodal data
cantly outperform the state-of-the-art methods. representations. To reduce the training time complexity,
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 847

Zhang and Li [6] proposed to seamlessly integrate semantic and defined six categories of text reuses. They proposed a
labels into the hashing learning procedure for large-scale general framework for text reuse detection. Several finger-
data modeling. printing techniques under the framework were evaluated
In past few years, deep neural networks (DNNs) have under the framework. Zhang et al. [30] also studied the
achieved tremendous success in various tasks. Cross-media partial-duplicate detection problem. They converted the
retrieval is one of the tasks which DNNs and other neural task into two subtasks: sentence level near-duplicate detec-
network architectures obtained improvements. In [11], a tion and sequence matching. Except for the similarities
deep autoencoder was proposed to learn features over mul- between documents, the method can simultaneously output
tiple modalities. The method uses the hidden units to con- the positions where the duplicated parts occur. In order to
struct shallow representation for the data and builds deep handle the efficiency problem, they implement their method
bimodal representations by modeling the correlations using three Map-Reduce jobs. Kim et al. [31] proposed to
across the learned shallow representations. Karpathy and map sentences into points in a high dimensional space and
Fei-Fei proposed a multimodal recurrent neural network for leveraged range searches in this space. They used MD5
generating descriptions for images[23]. The generated hash function to generate hash code for each word. File sig-
descriptions can be used for cross-media retrieval. nature is then created by taking the bitwise-or of all signa-
Most of the existing works described above focused on tures of words that appear in the file.
constructing the correlations between multiple modalities Different with these existing methods, in this paper, we
from different aspects. They usually use bag-of-words propose to use aggregated word embeddings to capture the
model to represent text. However, we in this work propose semantic level similarities to reduce the false matches.
to use Fisher kernel framework to represent both textual
and visual information and use a deep network to construct
the correlations between the two manifolds. 2.3 Hashing-Based Methods
In recent years, hashing-based methods, which create com-
2.2 Near-Duplicate Detection pact hash codes that preserve similarity, for single-modal
The task of detecting near duplicate textual information has or cross-modal retrieval on large-scale databases have
received considerable attentions in recent years. Previous attracted considerable attention [4], [5], [12], [13], [14], [15],
works studied the problem from different aspects such as [32], [33], [34], [35], [36], [37], [38]. For single-modal, Hinton
fingerprint extraction methods with or without linguistic and Salakhutdinov [33] proposed a two-layer network,
knowledge, hash codes learning methods, different granu- which is called a Restricted Boltzmann machine (RBM),
larities, and so on. with a small central layer to convert high-dimensional input
Broder [24] proposed Shingling method, which uses con- vectors into low-dimensional codes. In [36], spectral hash-
tiguous subsequences to represent documents. It does not ing was defined to seek compact binary codes in order to
rely on any linguistic knowledge. If sets of shingles preserve the semantic similarity between codewords. The
extracted from different documents are appreciably over- criterion used in spectral hashing is related to graph parti-
lap, these documents are considered exceedingly similar, tioning. Norouzi and Fleet [39] introduced a method based
which are usually measured by Jaccard similarity. In order on latent structural SVM framework for learning similarity-
to reduce the complexity of shingling, meta-sketches was preserving hash functions. A specific loss function is
proposed to handle the efficiency problem [25]. In order to designed to incorporating both Hamming distance and
improve the robustness of shingle-like signatures, Theobald binary quantization into consideration. In [40], Self-Taught
et al. [26] introduced a method, SpotSigs. It provides more Hashing (STH) converted the hash codes learning problem
semantic pre-selection of shingles for extracting characteris- into two stages. Unsupervised method, binarised Laplacian
tic signatures from Web documents. SpotSigs combines Eigenmap, is used to optimize l-bit binary codes. The classi-
stopword antecedents with short chains of adjacent content fiers were trained to predict the l-bit code for unseen
terms. The aim of it is to filter natural-language text pas- documents.
sages out of noisy Web page components. They also pro- A variety works studied the problem of mapping multi-
posed several pruning conditions based on the upper modal high-dimensional data to low-dimensional hash
bounds of Jaccard similarity. codes. Latent semantic sparse hashing [13] proposed the
I-Match [27] is one of the methods using hash codes to use of Matrix Factorization to represent text and sparse cod-
represent input document. It filters the input document ing to capture the salient structures of images. Then, these
based on collection statistics and compute a single hash representations are mapped to a joint abstraction space.
value for the remainder text. If the documents have same However, LSSH requires the use of both visual and textual
hash value, they are considered as duplicates. It hinges on information to construct the data set. Although out-of-sam-
the premise that removal of very infrequent terms and very ples can be estimated, the performances may be heavily
common terms results good document representations for influenced. Yu et al. [14] introduced a discriminative cou-
the near-duplicate detection task. Since I-Match signatures pled dictionary hashing approach, which generated a cou-
are respect to small modifications, Ko»cz et al. [28] proposed pled dictionary for each modality based on category labels.
the solution of several I-Match signatures, all derived from Kumar and Udupa [15] formulated formulated the problem
randomized versions in the original lexicon. of learning hash functions as a constrained minimization
Local text reuse detection focused on identifing the problem. Since the optimization problem is NP hard, they
reused and modified sentences, facts or passages, rather transformed it into a tractable eigenvalue problem by means
than whole documents. Seo and Croft [29] analyzed the task of a relaxation. Inter-media hashing (IMH) [4] uses a linear
848 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016

Fig. 2. An overview processing flow of the proposed SCMH for cross-media retrieval.

regression model to jointly learn a set of hashing functions 3 THE PROPOSED METHOD
for each individual media type.
The processing flow of the proposed semantic cross-media
Since we in this work learn the mapping functions
hashing (SCMH) method is illustrated in Fig. 2. Given a col-
between FVs of different modalities, all the hashing based
lection of text-image bi-modality data, we firstly represent
methods for single modality can be incorporated into it.
image and text respectively. Through table lookup, all the
words in a text are transformed to distributed vectors gener-
2.4 Neural Networks for Representing
ated by the word embeddings learning methods. For repre-
Image and Text
senting images, we use SIFT detector to extract image
The task of learning continuous space word representations
keypoints. SIFT descriptor is used to calculate descriptors of
have a long history[41], [42], [43], [44], [45], [46], [47]. It has
the extracted keypoints [19]. After these steps, a variable
demonstrated outstanding results across a variety of tasks.
size set of points in the embeddings space represents the
Hinton and Salakhutdinov [44] introduced a deep genera-
text, and a variable size set of points in SIFT descriptor
tive model to learn word-count vector and binary code for
space represents each image. Then, the Fisher kernel frame-
documents. In [45], the word representations are learned by
work is utilized to aggregate these points in different spaces
a recurrent neural network language model. The proposed
into fixed length vectors, which can also be considered as
architecture consists of an input layer and a hidden layer
points in the gradient space of the Riemannian manifold.
with recurrent connections. Probabilistic neural network
Henceforth, texts and images are represented by vectors
language model (NNLM) [48] simultaneously learns a dis-
with fixed length. Finally, the mapping functions between
tributed representation for each word and the probability
textual and visual Fisher vectors (FVs) are learned by a
function for word sequences. Bordes et al.[41] proposed to
deep neural network. We use the learned mapping function
use multi-task training process to jointly learn representa-
to convert FVs of one modality to another. Hash code gener-
tions of words, entities and meaning representations. The
ation methods are used to transfer FVs of different modali-
work described in [49] introduced a mix of unsupervised
ties to short length binary vectors. In the following section,
and supervised techniques to learn word vectors to capture
we provide detailed examples of practical applications of
both semantic and sentiment similarities among words.
the proposed method.
On the image side, there are also a variety of studies tack-
ling the problem of higher-level representations of visual
information. Krizhevsky et al. [50] proposed to use a deep 3.1 Word Embeddings Learning
convolutional neural network to perform object detection. Representation of words as continuous vectors recently has
In [51], region proposals are combined with CNNs to gener- been shown to benefit performance for a variety of NLP
ate features for object detection. Except for these supervised and IR tasks [44], [46], [47]. Similar words tend to be close
methods, unsupervised learning methods for training visual to each other with the vector representation. Moreover,
features have also been carefully studied. Lee et al. [52] Mikolov et al. [54] also demonstrated the learned word
introduced convolutional deep belief network , a hierarchi- representations could capture meaningful syntactic and
cal generative model, represent images. Taylor et al. [53] semantic regularities. Hence, in this work, we propose to
proposed a convolutional gated restricted Boltzmann use word embeddings to capture the semantic level simi-
machineto model the spatio-temporal features for videos. larities between short text segments.
Although, in this work, we use word embeddings and Fig. 3 shows three architectures used for learning word
SIFT to represent texts and images respectively, the pro- embeddings. wi represents the ith words in the given words
posed method can also incorporate these representations. sequence fw1 ; w2 ; . . . ; wT g. Fig. 3a shows the architecture of
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 849

Fig. 3. Methods used to learn word embeddings. The NNLM architecture predicates the probability of words based on the existing words [48]. CBOW
predicts the current word based on the context [54]. Skip-gram predicts surrounding words given the current word [54].

the probabilistic neural network language model (NNLM) kernel framework, X can be modeled by a probability den-
proposed by Bengio et al. in [48]. It can have either one hid- sity function. In this work, P ðXjuÞ is given by Gaussian mix-
den layer beyond the word features mapping or direct con- ture model (GMM), which a sum of N Gaussians Nðmi ; Si Þ
nections from the word features to the output layer. They with weights vi . Let u ¼ fvi ; mi ; Si ; 8i ¼ 1 . . . Ng be the set
also proposed to use softmax function for the output layer of GMM parameters. The parameters u are estimated
to guarantee positive probabilities summing to 1. The word through the optimization of Maximum Likelihood (ML) cri-
vectors and the parameters of that probability function can terion using Expectation Maximization (EM) method [57].
be learned simultaneously. In this work, we only use the Based on the learned parameters set u, a text or an image
learned word vectors. X can be characterized by the gradient vector using the fol-
Figs. 3b and 3c show the architectures of the methods lowing function:
proposed by Mikolov in [54]. The architecture of CBOW,
which is similar to NNLM, is shown in Fig. 3b. The main GX
u ¼ ru log P ðXjuÞ
 
differences are that (i) the non-linear hidden layer is @ @ (1)
removed; (ii) the words from the future are included; (iii) ¼ log ðP ðXjuÞ; . . . ; log ðP ðXjuÞÞ ;
@u1 @ul
the training criterion is to correctly classify the current (wt )
word. The Skip-gram architecture, which is shown in where GX u is a vector whose dimensionality is only depen-
Fig. 3c, is similar to CBOW. However, instead of predicting dent on the number of parameters in , not on the number
the current word based on the history and future words, it of words or keypoints. The gradient describes the contribu-
tries to maximize classification accuracy of words within a tion of each individual parameters to the generative process.
certain range before and after the current word based on It can also be interpreted as how these parameter contribute
only the current word as input. to the process of generating an example. We follow the
Besides the methods mentioned above, there are also work described in [18] for normalizing these gradients by
a large number of works addressing the task of learning incorporating Fisher information matrix (FIM) Fu . Accor-
distributed word representations [47], [49], [55]. Most of ding to the theory of information geometry [58], U ¼
them can also be used in this work. The proposed frame- fP ðXjuÞ; u 2 Qg, which is a parametric family of distribu-
work has no limits in using which of the continuous tions, can be regarded as a Riemanninan manifold MQ with
word representation methods. a local metric given by the FIM Fu 2 RMM :
 
3.2 Fisher Kernel Framework Fu ¼ E ru logP ðXjuÞru logP ðXjuÞT : (2)
Fisher kernel framework [18] was proposed to directly
obtain the kernel function from a generative probability The similarity between two samples X and Y can be mea-
model. A parametric class of probability models P ðXjuÞ sured by the Fisher kernel defined as:
where u 2 Q  Rl for some positive integer l. If the depen-
T
dence on u is sufficiently smooth, the collection of models KFK ðX; Y Þ ¼ GX 1 Y
u Fu Gu : (3)
with parameters from Q can be viewed as a manifold MQ .
Though applying a scalar product at each point P ðXjuÞ 2 Since Fu is symmetric and positive definite, Fu1 can be
MQ , it can be turned into a Riemannian manifold [56]. transformed to LTu Lu based on the Cholesky decomposition.
We denote a text or an image X ¼ fxi ; 1  i  jXjg, Therefore, KFK ðX; Y Þ can be rewritten as follows:
where xi is the embedding of ith word of a text or the SIFT
T
descriptors of the ith keypoint of an image, jXj is the num- KFK ðX; Y Þ ¼ G X Y
u Gu ; (4)
ber of words in a text or the number of the extracted key- where
points in an image. xi is D-dimension word embeddings or
SIFT descriptors. We should note that there may be different GX X
u ¼ Lu Gu ¼ Lu ru log P ðXjuÞ: (5)
parameters for different data sets. According to the Fisher
850 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016

Fig. 5. A graphical model representation of restricted Boltzmann


machine.
Fig. 4. A single hidden layer model for mapping FVs of different
modalities. FVi and FVt denote the Fisher vector of image and text,
respectively. h represents the hidden layer. representation is illustrated in Fig. 5. The parameters of
RBM consist of the weight matrix W 2 RLK , the biases
In this work, we assume that xi ð1  i  jXjÞ follows the c 2 RL for observed units, and the biases b 2 RK for hidden
naive independence model, G X
u can be rewritten as follows: units. If the observed units are real-valued, the model is
called the Gaussian RBM. Its joint probability distribution
X
jXj
can be defined as follows:
Lu ru log P ðxi juÞ: (6)
i¼1 1
P ðv; hÞ ¼ expðEðv; hÞÞ; Eðv; hÞ
GX
u is also referred to as the Fisher Vector of X. Z
Based on the specific probability density function GMM, 1 X 1X X
¼ 2 ðvi  ci Þ2  vi Wij hj  bj hj ; (9)
which we used in this work, FV of X is respect to the mean 2s i s i;j j
m and standard deviation s of all the mixed Gaussian distri-
butions. Let g xi ðkÞ be the soft assignment of the xi in X to where Z is a normalization constant. The conditional distri-
Gaussian k: bution of this model can be written as follows:
!
vi Pk ðxi juÞ 1X
g xi ðkÞ ¼ Pðkjxi ; uÞ ¼ PN : (7) P ðhj ¼ 1jvÞ ¼ sigm Wij vi þ bj ; (10)
j¼1 vj Pj ðxi juÞ s i

Mathematical derivations lead to: !


X
X
jXj   P ðvi ¼ 1jhÞ ¼ N vi ; s Wij hj þ ci ; s 2
; (11)
X 1 xi  m k
G ¼ pffiffiffiffiffi g xi ðkÞ ;GX j
m;k
jXj vi i¼1 sk s;k

" # where sigmðsÞ ¼ 1þexpðsÞ 1


is the sigmoid function, and
1 X
jXj
ðxi  mk Þ2
¼ pffiffiffiffiffiffiffi g ðkÞ 1 : (8) N ð:; :; :Þ is a Gaussian distribution.
jXj 2vi i¼1 xi s 2k Although exact maximum likelihood learning in this
model is intractable, sampling-based approximate maxi-
The division between vectors is as a term-by-term opera-
mum-likelihood methods can be used to estimated the
tion. The final gradient vector G X
 is is the concatenation of parameters. Because the variables in a layer are conditionally
the G X X
m;k and G s;k vectors for k ¼ 1 . . . N. Let T be the independent, block Gibbs sampling can be performed in par-
dimensionality of the vector offsets. The final gradient vec- allel. After training the RBM, Fisher vectors of different
tor G X
 is therefore 2NT-dimensional. modalities can be transferred with the estimated parameters.

3.3 Mapping Function Learning 3.4 Hash Code Generation


To transfer the FVs of one modality to another, we propose Through the previous steps, a variable length of text seg-
to use a deep belief network with one hidden layer to ments or keypoints can be transferred to a fixed length vec-
achieve the task. Fig. 4 shows the structure of the proposed tor. However, Fisher vectors are usually high dimensional
method. The building block of the network used in this and dense. It limits the usages of FVs for large-scale applica-
work is the Gaussian restricted Boltzmann machine. tions, where computational requirement should be studied.
Because we have converted both textual and visual informa- In this work, we propose to use hashing methods to address
tion into the gradient space of a Riemannian manifold, we in the efficiency problem.
this work use a single hidden layer model to do it. The task of generating hash codes for samples can be
The restricted Boltzmann machine is a kind of an undi- formalized as learning a mapping bðxÞ, referred to as a hash
rected graphical model with observed units and hidden function, which can project p-dimensional real-valued
units. The undirected graph of an RBM has an bipartite inputs x 2 Rp onto q-dimensional binary codes h 2 H
structure. It can be understood as a Markov random field  f1; 1gq , while preserving similarities between samples
with latent factors which explain the input observed data in original spaces and transformed spaces. The mapping
using binary hidden variables. Let v be the L dimensional bðxÞ can be parameterized by a real-valued vector w as:
observed data, which can take real values or binary values.
The dimension of stochastic binary units h is K. Each visible bðx; wÞ ¼ signðfðx; wÞÞ; (12)
unit is connected to each hidden unit. The graphical model
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 851

where signðÞ represents the element-wise sign function, and text. We use the SIFT keypoint detector to extract a variable
fðx; wÞ denotes a real-valued transformation from Rp to Rq . number of keypoints for each image and calculate the
In this work, Fisher vectors of text segments or keypoints descriptors of the keypoints using 128-dimensional SIFT
are the x in mapping function bðx; wÞ. A variety of existing descriptors. The toolkit we used in this work is VLFeat
methods have been proposed to achieve this task under this 0.9.19.5 The word embeddings we used in this work are pre-
framework using different forms of f and different optimi- trained vectors trained on part of a Google News dataset
zation objectives. Most of the learning to hash methods for (about 100 billion words). A Skip-gram model [62] was used
dense vectors can be used in this framework. In this work, to generate these 300-dimensional vectors for three million
we evaluated several state-of-the-arts hashing methods, words and phrases. For generating Fisher vectors, we use
whose performances are shown in the experiment section. the implementation of INRIA [63].
To demonstrate the effectiveness of the propose method,
4 EXPERIMENTS we evaluated the following state-of-the-art methods on the
To demonstrate the effectiveness of the proposed method, three data sets:
we compare and contrast the experimental results of SCMH  Cross-view Hashing [15] maps similar objects to simi-
and state-of-the-art hashing methods on three commonly lar codes across the views to enable cross-view simi-
used data sets for cross-media retrieval. larity search.
 Discriminative coupled dictionary hashing [14] gener-
4.1 Data Sets
ates a coupled dictionary for each modality based on
The three data sets used in this example contain both texts category labels.
and images. They have been chosen for the purpose of eval-  Multi view discriminative coupled dictionary hashing
uating various cross-media retrieval methods. (MV-DCDH) [14] is extended from DCDH with
Flickr. The MIR Flickr data set2 [59], which consists of multi-view representation to enhance the represent-
one million images along with their user assigned tags, ing capability of the relatively “weak” modalities.
was collected from Flickr. Out of all the images, 25,000  Latent semantic sparse hashing [13] uses Matrix Factori-
images are annotated for 24 concepts, including object zation to represent text and sparse coding to capture
categories (e.g., bird, people) and scene categories (e.g., the salient structures of images.
sky, night). A stricter annotation was made on 14 con-  Collective matrix factorization hashing (CMFH) [1] gen-
cepts where a subset of the positive images was selected erates unified hash codes for different modalities of
only if the concept is salient in the image. Therefore, one instance through collective matrix factorization
this leads to a total of 38 concepts for this data set. Fol- with latent factor model.
lowing previous works, each image may belong to one  Semantic correlation maximization (SCM) [6] integra-
or more concepts. Image-text pairs are considered to be tes semantic labels into the hashing learning proce-
similar if they share the same concept. dure for preserving the semantic similarity cross
LabelMe. The LabelMe data set3 [60] contains 2,688 images, modalities.
which belong to eight outdoor scene categories: coast, moun- The toolkits of LSSH, DCDH, and MV-DCDH are kindly
tain, forest, open country, street, inside city, tall buildings provided by the authors. As we mentioned in the previous
and highways. All the objects in these images have been fully section, the proposed method SCMH can incorporate any
labeled and used as tags of the images. Following the work hashing methods for single modality. In this work, we use
described in [13], tags occurring in fewer than three images Semantic Hashing to generate hash codes for both textual
are discarded. Therefore, there are a total of 245 unique tags and visual information. Semantic Hashing [33] is a multi-
remaining. To construct the golden standards, we also follow layer neural network with a small central layer to convert
previous works and assume that image-text pairs are high-dimensional input vectors into low-dimensional
regarded as similar if they share the same scene label. codes.6 For the length of hash codes, all the methods gener-
NUS-WIDE. The NUS-WIDE data set4 [61] contains ate 32, 64, and 128 bits hash codes.
images and their associated tags from Flickr. The total Following previous literatures on this task, we also adopt
number of images and unique tags are 269,648 and 5,018 the widely used Mean Average Precision (MAP) as the evalu-
respectively. The dataset includes six kinds of low-level ation metric. For a single query and top-K retrieved instan-
features extracted from these images and 81 manually ces, Average Precision (AP) is defined as follows:
constructed ground-truth concepts. For comparison with
previous methods, we also used the 10 most common
concepts, and randomly selected 20,000 images from 1X K
AP ¼ P ðkÞdðkÞ;
them for evaluation. We treat as similar image-text pairs R k¼1
labeled with the same concepts
where R denotes the number of ground-truth instances in
4.2 Experiment Settings the retrieved set, P ðkÞ denotes the precision of top-k
For multimodal documents, we use the SIFT framework to retrieved instances, and dðkÞ is an indicator function which
represent images and use word embeddings to represent equals to 1 if the kth instance is relevant to query or 0 other-
wise. In the experiments, we set K ¼ 50. Besides MAP, we
2. http://press.liacs.nl/mirflickr/
3. http://people.csail.mit.edu/torralba/code/spatialenvelope/ 5. http://www.vlfeat.org/
4. http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm 6. http://www.cs.toronto.edu/ hinton/
852 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016

TABLE 1 From the results, we observe that the proposed method


MAP Comparison on Flickr SCMH achieves the best performance among all the meth-
ods on both Text ! Image and Image ! Text tasks. LSSH
Tasks Methods Code Length
achieves the second best results in most of the cases. It
32 64 128 approaches the best result when the hash code length is 64.
Text ! Image CVH 0.615 0.613 0.610 However, if we increase the hash code length to 128, per-
DCDH 0.577 0.598 0.611 formances of LSSH and SCM are decrease. On the contrary,
MV-DCDH 0.600 0.603 0.614 the performances of SCMH with different length of hash
LSSH 0.623 0.634 0.626 codes are more robust. The main possible reason is that the
CMFH 0.625 0.630 0.632 performances of SCMH are highly impacted by the map-
SCM 0.624 0.606 0.600
SCMH 0.640 0.644 0.645 ping functions between FVs of different modalities. If we
use the cosine similarity between Fisher vectors to rank can-
Image ! Text CVH 0.609 0.601 0.602
DCDH 0.610 0.621 0.622 didates, the MAP results can reach 0.682 and 0.678 in Text
MV-DCDH 0.604 0.614 0.619 ! Image and Image ! Text task respectively.
LSSH 0.618 0.630 0.617 The precision and recall curves (PR-curves) are plotted in
CMFH 0.619 0.626 0.621 Fig. 6, where the x-axis denotes the recall and the y-axis
SCM 0.614 0.620 0.623 indicates the corresponding precision. From these figures,
SCMH 0.643 0.650 0.649
we observe that SCMH outperforms the other methods on
all tasks, especially with long hash codes. The performance
also report precision-recall curve to represent the precision of CVH, DCDH, MV-DCDH, LSSH, CMFH, and SCM
at different recall level. decrease much more quickly than SCMH. This also con-
We report the results of Text ! Image and Image ! Text firms that the proposed SCMH better suits the tasks for
tasks on all three databases. For Text ! Image task, a text cross-media retrieval.
query, which contains the annotated tags of an image, is
input to search images. The text query is firstly represented
by a Fisher vector based on word embeddings. Then, the FV 4.3.2 Results on LabelMe
of text is mapped into a FV in image space. Finally, ham- Table 2 compares the relative performances of the different
ming distance is used to measure the similarities between methods on the LabelMe dataset. Fig. 7 gives the PR-curves
the hash code of the converted FV and other hash codes of of different methods on the dataset. From the results, we
images. The top-K images are selected as the results. The observe that SCMH achieves better performance than state-
procedure of Image ! Text task is similar as the Text ! of-the-art methods on all tasks. From analyzing the data, we
Image task. Since the Fisher vector mapping function needs find that different tags belonging to the same category may
training data, for each data set, we select 40 percent of the express similar or related meaning. Since semantic relations
data to train the mapping function between text and image. can be readily captured by the proposed method, SCMH
35 percent of the data are chosen as the retrieval database outperforms the other methods. As the length of hash code
and the others are formed the query set. All the methods increases, the MAP performance of SCMH improves. How-
use the same data splits. ever, when the hash code length approaches 128, the per-
formances of most of the methods except SCMH decrease.
4.3 Results and Discussions Comparing with Flickr dataset, the total number of images
4.3.1 Results on Flickr and unique tags are much smaller than it. Hence, the main
Table 1 shows the comparisons of the proposed method possible reason is that longer hash codes encode more
with the state-of-the-art methods on the Flickr MIR data set. explicit information and thus the inability to capture the

Fig. 6. The precision-recall curves of different hash code generation methods on the Flickr data set.
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 853

TABLE 2 TABLE 3
MAP Comparison on LabelMe MAP Comparison on NUS-WIDE

Tasks Methods Code Length Tasks Methods Code Length


32 64 128 32 64 128
Text ! Image CVH 0.400 0.370 0.349 Text ! Image CVH 0.435 0.426 0.418
DCDH 0.410 0.449 0.424 DCDH 0.468 0.486 0.484
MV-DCDH 0.437 0.476 0.448 MV-DCDH 0.479 0.487 0.484
LSSH 0.665 0.695 0.671 LSSH 0.504 0.509 0.504
CMFH 0.589 0.601 0.610 CMFH 0.504 0.510 0.508
SCM 0.613 0.621 0.615 SCM 0.526 0.528 0.530
SCMH 0.694 0.701 0.714 SCMH 0.552 0.560 0.556
Image ! Text CVH 0.343 0.342 0.338 Image ! Text CVH 0.437 0.426 0.421
DCDH 0.416 0.466 0.428 DCDH 0.460 0.476 0.481
MV-DCDH 0.448 0.480 0.455 MV-DCDH 0.462 0.474 0.478
LSSH 0.670 0.673 0.687 LSSH 0.504 0.501 0.498
CMFH 0.636 0.644 0.652 CMFH 0.512 0.514 0.511
SCM 0.622 0.630 0.636 SCM 0.531 0.539 0.541
SCMH 0.662 0.676 0.688 SCMH 0.590 0.597 0.593

semantic level similarities between tags decreases the per- 4.3.3 Results on NUS-WIDE
formance. We also observe that SCMH achieves better per- The results of different methods on the NUS-WIDE dataset
formance on the Text ! Image task than the Image ! Text are shown in Table 3. The corresponding PR-curves of
task. The DCDH, MV-DCDH, LSSH, CMFH, and SCM them are given in the Fig. 8. From the results, we observe
methods all behave differently from SCMH, achieving bet- that SCMH achieves significantly better performance than
ter performance on the Image ! Text task. The main reason state-of-the-art methods on all tasks. The relative improve-
is possibly that word level semantic similarities can be bet- ments of SCMH over the second best results are 10.0 and
ter captured than keypoints represented by SIFT descriptors 18.5 percent on the Text ! Image and Image ! Text task
though SCMH. In Image ! Text task, the performance of respectively. Comparing with the results of SCMH on
SCMH is slightly worse than LSSH when the hash code LabelMe and Flickr dataset, the improvement of SCMH on
length is 32 bits. We think that the main reason is that size NUS-WIDE is more significant. The main possible reason
of LabelMe dataset and number of tags occurred in this is that the number of tags based on their frequency we
dataset are both too small. used in this dataset is bigger than LabelMe and Flickr.
From the PR-curves shown in Fig. 7, we also observe that There are only a total of 245 unique tags which occur more
although SCMH has similar performance as LSSH in Image than three times in the whole LabelMe dataset. For com-
! Text task, the precision of SCMH decreases much more paring with other methods, we selected top 500 most
slowly. This means that SCMH can achieve better results frequent tags in Flickr data set. Since NUS-WIDE is a more
when the user needs more candidates. We also observe practical dataset, which contains more unique tags, we
from the figure that the improvements of SCMH on the propose to use the top 1,000 most frequent tags. Hence, the
Image ! Text task are relatively marginal compared to weakness of the other methods in capturing the semantic
those on the Text ! Image tasks at all recall levels. This also level similarities between tags decreases the performance.
confirms the phenomenon described above.

Fig. 7. The precision-recall curves of different hash code generation methods on the LabelMe data set.
854 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016

Fig. 8. The precision-recall curves of different hash code generation methods on the NUS-WIDE data set.

It can, in some degree, demonstrate that the proposed using commonly accepted performance metrics on data sets
method SCMH is more appropriate for practical environ- that are commonly used for evaluating cross-media retrieval.
ment. From the PR-curves illustrated in Fig. 8, we observe
the similar phenomenon as LabelMe and Flickr that the 4.3.4 Parameter Sensitivity
precision of SCMH decreases much more slowly. When
To analysis the sensitivity of the hyper-parameters of
recall achieves 20 percent, the relative improvement of
SCMH, we conduct several empirical experiments on all the
SCMH over LSSH is more than 28.2 percent on Image !
datasets. For easy comparison with previous methods, we
Text task.
set the hash code length to be 64 bits. Fig. 10 shows the per-
To further analyze the results given by different meth-
formances of SCHM with different percentages of training
ods, we calculate the cosine similarities between textual
data. In the two figures, the x-axis denotes the percentages
descriptions of queries and correct results in the top 50 lists
of data used for training and the y-axis denotes the MAP
on the Image ! Text. Fig. 9 shows the distribution of cosine
performance. The data used for constructing retrieval set
similarities. In the figure, x-axis denotes the ranges of cosine
and query set are same as we used in previous section.
similarity the y-axis the number of correct results in the
From the figures, we observe that as the number of training
range. From the results, we can see that SCMH can find
data increases, the MAP performances of SCMH conse-
more correct results whose cosine similarity with corre-
quently improve on all data sets. When the percentages of
sponding query are less 10 percent. It can in some degree
training data are over 30 percent of the whole dataset, the
demonstrates the effectiveness of SCMH in capturing the
MAP performances increase slowly. The main reason may
semantic textual similarities.
possibly be that the number of categories or concepts
In summary, the evaluation results on three data sets
included in these data sets are small. However, on the other
demonstrate conclusively that the proposed SCMH method
side, we can say that the proposed method SCMH can
is superior to the state-of-the-art methods when measured
achieve acceptable results with a few of ground truths.
Hence, it can be easily adopted for achieving other data sets.
Since the training process of mapping function is solved
by an iterative procedure, we also evaluate its convergency
property. Fig. 11 shows the MAP performances of SCMH
on Image ! Text and Text ! Image tasks. In the two fig-
ures, the x-axis denotes the number of iterations for opti-
mizing the mapping function and the y-axis denotes the

Fig. 9. Distribution of cosine similarities between queries and results on Fig. 10. Effects of training size on MAP performance on the Image !
the Image ! Text on NUS-WIDE dataset. Text and Text ! Image tasks.
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 855

Fig. 11. Effects of the number of iterations on MAP performance on the


Image ! Text and Text ! Image tasks.

MAP performance. From these figures, we observe that


SCMH can coverage with less than 10 iterations on all three
data sets. It means that SCMH can achieve stable and supe-
rior performance under a wide range of parameter values.
A strange point occurs on the LableMe dataset on Image ! Fig. 12. The efficiency comparison of different hashing methods.
Text task. The result achieve the best with three iterations.
The main possible reason is that the size of LabelMe is rela- proposed to perform the task. We evaluate the proposed
tively small comparing to Flickr and NUS-WIDE. Hence, method SCMH on three commonly used data sets. SCMH
the results may more sensitive on the LabelMe dataset. achieves better results than state-of-the-art methods with dif-
ferent the lengths of hash codes. In NUS-WIDE data set, the
4.3.5 Efficiency Evaluation relative improvements of SCMH over LSSH, which achieves
the best results in these datasets, are 10.0 and 18.5 percent on
Due to the requirement of processing huge amounts of data,
the Text ! Image and Image ! Text tasks respectively.
efficiency is also an important issue. In this work, we com-
Experimental results demonstrate the effectiveness of the
pare the running time of the proposed approach with other
proposed method on the cross-media retrieval task.
hashing learning methods. Although the offline stage of the
proposed framework requires massive computation cost,
the computational complexity of online stage is small or ACKNOWLEDGMENTS
comparable to other hashing methods. This work was partially funded by National Natural Science
Fig. 12 shows the efficiency comparison of different hash- Foundation of China (No. 61532011, 61473092, and
ing methods. We implement the all methods to run on sin- 61472088), the National High Technology Research and
gle thread in the same machine, which contains Xeon quad Development Program of China (No. 2015AA015408), and
core CPUs (2.53 GHz) and 32 GB RAM. All the methods Shanghai Science and Technology Development Funds
take the text query as inputs. The processing time is calcu- (13dz226020013511504300). J. Qian is the corresponding
lated from receiving the inputs to generating hash codes. author.
Since in practical usages queries are usually out-of-sample
ones, we compare the proposed method with Spectral Hash
and Semantic hash. For processing out-of-sample extension REFERENCES
of spectral hashing, we propose to use the Nystrom method [1] G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization
hashing for multimodal data,” in Proc. IEEE Conf. Comput. Vis.
[64] to do it. From the results, we can observe that the
Pattern Recog., 2014, pp. 2083–2090.
computational complexity of the proposed method is com- [2] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quan-
parable with and state-of-the hashing methods. Comparing tization: A procrustean approach to learning binary codes for
to the methods based on the matrix factorization, the pro- large-scale image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 35, no. 12, pp. 2916–2929, Dec. 2013.
posed method is much more efficient. In this work, we use [3] Y. Pan, T. Yao, T. Mei, H. Li, C.-W. Ngo, and Y. Rui, “Click-
semantic hash to generate hash codes of FVs. Hence, addi- through-based cross-view learning for image search,” in Proc. 37th
tional processing time is required to perform the calucua- Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2014, pp. 717–726.
tion. However, if we use less complex hashing method, the [4] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Inter-media
hashing for large-scale retrieval from heterogeneous data
efficiency can be further improved. It demonstrates that the sources,” in Proc. Int. Conf. Manage. Data, 2013, pp. 785–796.
proposed method is applicable for large scale applications. [5] D. Zhai, H. Chang, Y. Zhen, X. Liu, X. Chen, and W. Gao,
“Parametric local multimodal hashing for cross-view similarity
search,” in Proc. 23rd Int. Joint Conf. Artif. Intell., 2013, pp. 2754–2760.
5 CONCLUSIONS [6] D. Zhang and W.-J. Li, “Large-scale supervised multimodal hash-
In this work, we propose a novel hashing method, SCMH, to ing with semantic correlation maximization,” in Proc. 28th AAAI
Conf. Artif. Intell., 2014, pp. 2177–2183.
perform the near-duplicate detection and cross media [7] Y. Zhuang, Y. Yang, F. Wu, and Y. Pan, “Manifold learning based
retrieval task. We propose to use a set of word embeddings cross-media retrieval: A solution to media object complementary
to represent textual information. Fisher kernel framework is nature,” J. VLSI Signal Process. Syst. Signal, Image Video Technol.,
vol. 46, pp. 153–164, 2007.
incorporated to represent both textual and visual informa- [8] F. Wu, H. Zhang, and Y. Zhuang, “Learning semantic correlations
tion with fixed length vectors. For mapping the Fisher for cross-media retrieval,” in Proc. IEEE Int. Conf. Image Process.,
vectors of different modalities, a deep belief network is 2006, pp. 1465–1468.
856 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016

[9] Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan, “Harmonizing hier- [33] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of
archical manifolds for multimedia document semantics under- data with neural networks,” Science, vol. 313, pp. 504–507, 2006.
standing and cross-media retrieval,” IEEE Trans. Multimedia, [34] B. Kulis and T. Darrell, “Learning to hash with binary reconstruc-
vol. 10, no. 3, pp. 437–446, Apr. 2008. tive embeddings,” in Proc. Adv. Neural Inf. Process. Syst., 2009,
[10] E. P. Xing, R. Yan, and A. G. Hauptmann, “Mining associated text pp. 1042–1050.
and images with dual-wing harmoniums,” in Proc. 21st Conf. [35] K. Grauman and R. Fergus, “Learning binary hash codes for large-
Uncertainty Artif. Intell., 2005, pp. 633–641. scale image search,” in Proc. Mach. Learn. Comput. Vis., 2013,
[11] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, pp. 49–87.
“Multimodal deep learning,” in Proc. 28th Int. Conf. Mach. Learn., [36] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc.
2011, pp. 689–696. Adv. Neural Inf. Process. Syst., 2008.
[12] N. Srivastava and R. Salakhutdinov, “Multimodal learning with [37] Y. Zhen and D.-Y. Yeung, “A probabilistic model for multimodal
deep boltzmann machines,” in Proc. Adv. Neural Inf. Process. Syst., hash function learning,” in Proc. 18th ACM SIGKDD Int. Conf.
2012, pp. 2222–2230. Knowl. Discovery Data Mining, 2012, pp. 940–948.
[13] J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing for [38] P. Wang, B. Xu, Y. Wu, and X. Zhou, “Link prediction in social
cross-modal similarity search,” in Proc. 37th Int. ACM SIGIR Conf. networks: The state-of-the-art,” Sci. China Inf. Sci., vol. 58, no. 1,
Res. Develop. Inf. Retrieval, 2014, pp. 415–424. pp. 1–38, 2015.
[14] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, [39] M. Norouzi and D. Fleet, “Minimal loss hashing for compact binary
“Discriminative coupled dictionary hashing for fast cross-media codes,” in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 353–360.
retrieval,” in Proc. 37th Int. ACM SIGIR Conf. Res. Develop. Inf. [40] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast
Retrieval, 2014, pp. 395–404. similarity search,” in Proc. 33rd Int. ACM SIGIR Conf. Res. Develop.
[15] S. Kumar and R. Udupa, “Learning hash functions for cross-view Inf. Retrieval, 2010, pp. 18–25.
similarity search,” in Proc. Int. Joint Conf. Artif. Intell., 2011, [41] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “Joint learning
pp. 1360–1365. of words and meaning representations for open-text semantic
[16] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. parsing,” in Proc. Int. Conf. Artif. Intell. Statist., 2012, pp. 127–
Wolfman, and E. Ruppin, “Placing search in context: The concept 135.
revisited,” in Proc. 10th Int. Conf. World Wide Web, 2001, pp. 406– [42] J. L. Elman, “Distributed representations, simple recurrent net-
414. works, and grammatical structure,” Mach. Learn., vol. 7, pp. 195–
[17] Q. Zhang, J. Kang, J. Qian, and X. Huang, “Continuous word 225, 1991.
embeddings for detecting local text reuses at the semantic level,” [43] G. E. Hinton, “Learning distributed representations of concepts,”
in Proc. 37th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2014, in Proc. 8th Annu. Conf. Cognitive Sci. Soc., 1986, pp. 1–12.
pp. 797–806. [44] G. Hinton and R. Salakhutdinov, “Discovering binary codes for
[18] T. Jaakkola, D. Haussleret al., “Exploiting generative models in documents by learning deep generative models,” Topics Cognitive
discriminative classifiers,” in Proc. Adv. Neural Inf. Process. Syst., Sci., vol. 3, pp. 74–91, 2010.
1999, pp. 487–493. [45] T. Mikolov, M. Karafiat, L. Burget, J. Cernockỳ, and S. Khudanpur,
[19] D. G. Lowe, “Object recognition from local scale-invariant “Recurrent neural network based language model,” in Proc.
features,” in Proc. Int. Conf. Comput. Vis., 1999, p. 1150. INTERSPEECH, 2010, pp. 1045–1048.
[20] X. Wang, Y. Liu, D. Wang, and F. Wu, “Cross-media topic mining [46] R. Socher, E. H. Huang, J. Pennin, C. D. Manning, and A. Ng,
on wikipedia,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, “Dynamic pooling and unfolding recursive autoencoders for
pp. 689–692. paraphrase detection,” in Proc. Adv. Neural Inf. Process. Syst., 2011,
[21] H. Zhang, J. Yuan, X. Gao, and Z. Chen, “Boosting cross-media pp. 801–809.
retrieval via visual-auditory feature analysis and relevance [47] J. Turian, L. Ratinov, and Y. Bengio, “Word representations:
feedback,” in Proc. ACM Int. Conf. Multimedia, 2014, pp. 953–956. A simple and general method for semi-supervised learning,” in
[22] P. Daras, S. Manolopoulou, and A. Axenopoulos, “Search and Proc. 48th Annu. Meeting Assoc. Comput. Ling., 2010, pp. 384–394.
retrieval of rich media objects supporting multiple multimodal [48] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural
queries,” IEEE Trans. Multimedia, vol. 14, no. 3, pp. 734–746, Jun. probabilistic language model,” J. Mach. Learn. Res., vol. 3,
2012. pp. 1137–1155, 2003.
[23] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for [49] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C.
generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Potts, “Learning word vectors for sentiment analysis,” in Proc.
Pattern Recog., Boston, MA, USA, Jun. 2015, pp. 3128–3137. 49th Annu. Meeting Assoc. Comput. Ling.: Human Lang. Technol.-
[24] A. Z. Broder, “On the resemblance and containment of doc- Vol. 1, 2011, pp. 142–150.
uments,” in Proc. SEQUENCES, 1997, p. 21. [50] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
[25] A. Z. Broder, “Identifying and filtering near-duplicate doc- cation with deep convolutional neural networks,” in Proc. Adv.
uments,” in Proc. Combinatorial Pattern Matching, 2000, pp. 1–10. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[26] M. Theobald, J. Siddharth, and A. Paepcke, “Spotsigs: Robust and [51] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
efficient near duplicate detection in large web collections,” in hierarchies for accurate object detection and semantic
Proc. 31st Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog.,
2008, pp. 563–570. 2014, pp. 580–587.
[27] A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, [52] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional
“Collection statistics for fast duplicate document detection,” ACM deep belief networks for scalable unsupervised learning of hierar-
Trans. Inf. Syst., vol. 20, no. 2, pp. 171–191, 2002. chical representations,” in Proc. 26th Annu. Int. Conf. Mach. Learn.,
[28] A. Ko»cz, A. Chowdhury, and J. Alspector, “Improved robustness 2009, pp. 609–616.
of signature-based near-replica detection via lexicon random- [53] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional
ization,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data learning of spatio-temporal features,” in Proc. Comput. Vis., 2010,
Mining, 2004, pp. 605–610. pp. 140–153.
[29] J. Seo and W. B. Croft, “Local text reuse detection,” in Proc. 31st [54] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estima-
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008, tion of word representations in vector space,” in Proc. Workshop
pp. 571–578. ICLR, 2013, pp. 1–2.
[30] Q. Zhang, Y. Zhang, H. Yu, and X. Huang, “Efficient partial- [55] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng, “Improving
duplicate detection based on sequence matching,” in Proc. 31st word representations via global context and multiple word proto-
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2010, types,” in Proc. 50th Annu. Meeting Assoc. Comput. Ling., 2012,
pp. 571–578. pp. 873–882.
[31] J. W. Kim, K. S. Candan, and J. Tatemura, “Efficient overlap and [56] J. Jost and J. Jost, Riemannian Geometry and Geometric Analysis. New
content reuse detection in blogs and online news articles,” in Proc. York, NY, USA: Springer, 2008.
Int. Conf. World Wide Web, 2009, pp. 571–578. [57] R. A. Redner and H. F. Walker, “Mixture densities, maximum
[32] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angular quanti- likelihood and the em algorithm,” SIAM Rev., vol. 26, no. 2,
zation-based binary codes for fast similarity search,” in Proc. Adv. pp. 195–239, 1984.
Neural Inf. Process. Syst., 2012, pp. 1196–1204. [58] S. Amari and H. Nagaoka, Methods of Information Geometry, Ameri-
can Mathematical Soc., 2000.
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 857

[59] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi- Jin Qian received the master’s degree in com-
supervised learning for image classification,” in Proc. IEEE Conf. puter science from Shandong University. He is
Comput. Vis. Pattern Recog., 2010, pp. 902–909. currently working toward the PhD degree at
[60] A. Oliva and A. Torralba, “Modeling the shape of the scene: A Fudan University. His research interests include
holistic representation of the spatial envelope,” Int. J. Comput. data mining.
Vis., vol. 42, pp. 145–175, 2001.
[61] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng,
“NUS-wide: A real-world web image database from national uni-
versity of singapore,” in Proc. ACM Conf. Image Video Retrieval,
pp. 48:1–48:9.
[62] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their
compositionality,” in Proc. Adv. Neural Inf. Process. Syst., 2013, Xuanjing Huang received the PhD degree in
pp. 3111–3119. computer science from Fudan University. She is a
[63] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and professor of computer science at Fudan Univer-
C. Schmid, “Aggregating local image descriptors into compact sity, Shanghai, China. Her research interests
codes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, include natural language processing and informa-
pp. 1704–1716, Sep. 2011. tion retrieval.
[64] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent,
and M. Ouimet, “Learning eigenfunctions links spectral embed-
ding and kernel PCA,” Neural Comput., vol. 16, pp. 2197–2219,
2004.

Qi Zhang received the PhD degree in computer


science from Fudan University. He is an associate " For more information on this or any other computing topic,
professor of computer science at Fudan Univer- please visit our Digital Library at www.computer.org/publications/dlib.
sity, Shanghai, China. His research interests
include natural language processing and informa-
tion retrieval.

Yang Wang received the bachelor’s degree in


computer science from Xidian University. He is
currently working toward the master’s degree at
Fudan University. His research interests include
information retrieval.

You might also like