A Mixed Generative-Discriminative
A Mixed Generative-Discriminative
A Mixed Generative-Discriminative
Based Hashing Method
Qi Zhang, Yang Wang, Jin Qian, and Xuanjing Huang
Abstract—Hashing methods have proven to be useful for a variety of tasks and have attracted extensive attention in recent years.
Various hashing approaches have been proposed to capture similarities between textual, visual, and cross-media information.
However, most of the existing works use a bag-of-words methods to represent textual information. Since words with different forms may
have similar meaning, semantic level text similarities can not be well processed in these methods. To address these challenges, in this
paper, we propose a novel method called semantic cross-media hashing (SCMH), which uses continuous word representations to
capture the textual similarity at the semantic level and use a deep belief network (DBN) to construct the correlation between different
modalities. To demonstrate the effectiveness of the proposed method, we evaluate the proposed method on three commonly used
cross-media data sets are used in this work. Experimental results show that the proposed method achieves significantly better
performance than state-of-the-art approaches. Moreover, the efficiency of the proposed method is comparable to or better than that of
some other hashing methods.
Ç
1 INTRODUCTION
Fig. 1. An example of top retrieved images from Flickr with different tags.
using multiple words with similar meaning, semantic level 2 RELATED WORK
textual similarities should be incorporated into the cross-
Along with the increasing requirement, extensive Hashing-
media retrieval.
based methods have been proposed for cross-media
Motivated by the success of continuous space word rep-
retrieval. In this section, we briefly describe the related
resentations (also called word embeddings) in a variety of
works, which can be categorized into the following four
tasks, in this work we propose to incorporate word embed-
research areas: cross-media retrieval, text Reuse detection,
dings to meet these challenges. Words in a text are embed-
and hashing methods.
ded in a continuous space, which can be viewed as a
Bag-of-Embedded-Words (BoEW). Since the number of
words in a text is dynamic, in [17], we proposed a method 2.1 Cross-Media Retrieval
to aggregate it into a fixed length Fisher Vector (FV), using a Cross-media retrieval, in which the modality of input query
Fisher kernel framework [18]. However, the proposed and the returned results can be of different, has received
method only focus on textual information. Another chal- considerable attentions [1], [3], [6], [7], [8], [9], [10], [12],
lenge in this task is how to determine the correlation [20], [21]. Wu et al. [8] introduced a Canonical Correlation
between multi-modal representations. Since we propose the Analysis based method to construct isomorphic subspace
use of a Fisher kernel framework to represent the textual and multi-modal correlations between media objects and
information, we also use it to aggregate the SIFT descrip- polar coordinates to judge the general distance of media
tors [19] of images. Through the Fisher kernel framework, objects. Due to lack of sufficient training samples, relevance
both textual and visual information is mapped to points in feedback of user was used to accurately refine cross-media
the gradient space of a Riemannian manifold. However, the similarities. Yang et al. [9] proposed manifold-based
relationships that exist between FVs of different modalities method, in which they used Laplacian media object space to
are usually highly non-linear. Hence, to construct the corre- represent media object for each modality and an multime-
lation between textual and visual modalities, we introduce dia document semantic graph to learn the multimedia docu-
a DBN based method to model the mapping function, which ment semantic correlations. In [22], a rich-media object
is used to convert abstract representations of different retrieval method is proposed to represent data consisting of
modalities from one to another. multiple modalities, such as 2-D images, 3-D objects and
The main contributions of this work are summarized as audio files. To tackle the large scale problem, a multimedia
follows. indexing scheme was also adopted.
Since the relationships across different modalities are
We propose to incorporate continuous word repre- typically highly non-linear and observations are usually
sentations to handle semantic textual similarities noisy, Srivastava and Salakhutdinov [12] proposed a Deep
and adopted for cross-media retrieval. Boltzmann Machine to learn joint representations of image
Inspired by the advantages of DBN in handling and text inputs. The proposed model fuses multiple data
highly non-linear relationships and noisy data, we modalities into a unified representation, which can be used
introduce a novel DBN based method to construct for classification and retrieval. Xing et al. [10] introduced to
the correlation between different modalities. use dual-wing harmoniums to build a joint model for
A variety of experiments on three cross-media com- images and text. The model incorporated Gaussian hidden
monly used benchmarks demonstrate the effective- units together with Gaussian and Poisson visible units into
ness of the proposed method. The experimental a linear RBM model. In [12], a multimodal deep Boltzmann
results show that the proposed method can signifi- machine was proposed for learning multimodal data
cantly outperform the state-of-the-art methods. representations. To reduce the training time complexity,
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 847
Zhang and Li [6] proposed to seamlessly integrate semantic and defined six categories of text reuses. They proposed a
labels into the hashing learning procedure for large-scale general framework for text reuse detection. Several finger-
data modeling. printing techniques under the framework were evaluated
In past few years, deep neural networks (DNNs) have under the framework. Zhang et al. [30] also studied the
achieved tremendous success in various tasks. Cross-media partial-duplicate detection problem. They converted the
retrieval is one of the tasks which DNNs and other neural task into two subtasks: sentence level near-duplicate detec-
network architectures obtained improvements. In [11], a tion and sequence matching. Except for the similarities
deep autoencoder was proposed to learn features over mul- between documents, the method can simultaneously output
tiple modalities. The method uses the hidden units to con- the positions where the duplicated parts occur. In order to
struct shallow representation for the data and builds deep handle the efficiency problem, they implement their method
bimodal representations by modeling the correlations using three Map-Reduce jobs. Kim et al. [31] proposed to
across the learned shallow representations. Karpathy and map sentences into points in a high dimensional space and
Fei-Fei proposed a multimodal recurrent neural network for leveraged range searches in this space. They used MD5
generating descriptions for images[23]. The generated hash function to generate hash code for each word. File sig-
descriptions can be used for cross-media retrieval. nature is then created by taking the bitwise-or of all signa-
Most of the existing works described above focused on tures of words that appear in the file.
constructing the correlations between multiple modalities Different with these existing methods, in this paper, we
from different aspects. They usually use bag-of-words propose to use aggregated word embeddings to capture the
model to represent text. However, we in this work propose semantic level similarities to reduce the false matches.
to use Fisher kernel framework to represent both textual
and visual information and use a deep network to construct
the correlations between the two manifolds. 2.3 Hashing-Based Methods
In recent years, hashing-based methods, which create com-
2.2 Near-Duplicate Detection pact hash codes that preserve similarity, for single-modal
The task of detecting near duplicate textual information has or cross-modal retrieval on large-scale databases have
received considerable attentions in recent years. Previous attracted considerable attention [4], [5], [12], [13], [14], [15],
works studied the problem from different aspects such as [32], [33], [34], [35], [36], [37], [38]. For single-modal, Hinton
fingerprint extraction methods with or without linguistic and Salakhutdinov [33] proposed a two-layer network,
knowledge, hash codes learning methods, different granu- which is called a Restricted Boltzmann machine (RBM),
larities, and so on. with a small central layer to convert high-dimensional input
Broder [24] proposed Shingling method, which uses con- vectors into low-dimensional codes. In [36], spectral hash-
tiguous subsequences to represent documents. It does not ing was defined to seek compact binary codes in order to
rely on any linguistic knowledge. If sets of shingles preserve the semantic similarity between codewords. The
extracted from different documents are appreciably over- criterion used in spectral hashing is related to graph parti-
lap, these documents are considered exceedingly similar, tioning. Norouzi and Fleet [39] introduced a method based
which are usually measured by Jaccard similarity. In order on latent structural SVM framework for learning similarity-
to reduce the complexity of shingling, meta-sketches was preserving hash functions. A specific loss function is
proposed to handle the efficiency problem [25]. In order to designed to incorporating both Hamming distance and
improve the robustness of shingle-like signatures, Theobald binary quantization into consideration. In [40], Self-Taught
et al. [26] introduced a method, SpotSigs. It provides more Hashing (STH) converted the hash codes learning problem
semantic pre-selection of shingles for extracting characteris- into two stages. Unsupervised method, binarised Laplacian
tic signatures from Web documents. SpotSigs combines Eigenmap, is used to optimize l-bit binary codes. The classi-
stopword antecedents with short chains of adjacent content fiers were trained to predict the l-bit code for unseen
terms. The aim of it is to filter natural-language text pas- documents.
sages out of noisy Web page components. They also pro- A variety works studied the problem of mapping multi-
posed several pruning conditions based on the upper modal high-dimensional data to low-dimensional hash
bounds of Jaccard similarity. codes. Latent semantic sparse hashing [13] proposed the
I-Match [27] is one of the methods using hash codes to use of Matrix Factorization to represent text and sparse cod-
represent input document. It filters the input document ing to capture the salient structures of images. Then, these
based on collection statistics and compute a single hash representations are mapped to a joint abstraction space.
value for the remainder text. If the documents have same However, LSSH requires the use of both visual and textual
hash value, they are considered as duplicates. It hinges on information to construct the data set. Although out-of-sam-
the premise that removal of very infrequent terms and very ples can be estimated, the performances may be heavily
common terms results good document representations for influenced. Yu et al. [14] introduced a discriminative cou-
the near-duplicate detection task. Since I-Match signatures pled dictionary hashing approach, which generated a cou-
are respect to small modifications, Ko»cz et al. [28] proposed pled dictionary for each modality based on category labels.
the solution of several I-Match signatures, all derived from Kumar and Udupa [15] formulated formulated the problem
randomized versions in the original lexicon. of learning hash functions as a constrained minimization
Local text reuse detection focused on identifing the problem. Since the optimization problem is NP hard, they
reused and modified sentences, facts or passages, rather transformed it into a tractable eigenvalue problem by means
than whole documents. Seo and Croft [29] analyzed the task of a relaxation. Inter-media hashing (IMH) [4] uses a linear
848 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016
Fig. 2. An overview processing flow of the proposed SCMH for cross-media retrieval.
regression model to jointly learn a set of hashing functions 3 THE PROPOSED METHOD
for each individual media type.
The processing flow of the proposed semantic cross-media
Since we in this work learn the mapping functions
hashing (SCMH) method is illustrated in Fig. 2. Given a col-
between FVs of different modalities, all the hashing based
lection of text-image bi-modality data, we firstly represent
methods for single modality can be incorporated into it.
image and text respectively. Through table lookup, all the
words in a text are transformed to distributed vectors gener-
2.4 Neural Networks for Representing
ated by the word embeddings learning methods. For repre-
Image and Text
senting images, we use SIFT detector to extract image
The task of learning continuous space word representations
keypoints. SIFT descriptor is used to calculate descriptors of
have a long history[41], [42], [43], [44], [45], [46], [47]. It has
the extracted keypoints [19]. After these steps, a variable
demonstrated outstanding results across a variety of tasks.
size set of points in the embeddings space represents the
Hinton and Salakhutdinov [44] introduced a deep genera-
text, and a variable size set of points in SIFT descriptor
tive model to learn word-count vector and binary code for
space represents each image. Then, the Fisher kernel frame-
documents. In [45], the word representations are learned by
work is utilized to aggregate these points in different spaces
a recurrent neural network language model. The proposed
into fixed length vectors, which can also be considered as
architecture consists of an input layer and a hidden layer
points in the gradient space of the Riemannian manifold.
with recurrent connections. Probabilistic neural network
Henceforth, texts and images are represented by vectors
language model (NNLM) [48] simultaneously learns a dis-
with fixed length. Finally, the mapping functions between
tributed representation for each word and the probability
textual and visual Fisher vectors (FVs) are learned by a
function for word sequences. Bordes et al.[41] proposed to
deep neural network. We use the learned mapping function
use multi-task training process to jointly learn representa-
to convert FVs of one modality to another. Hash code gener-
tions of words, entities and meaning representations. The
ation methods are used to transfer FVs of different modali-
work described in [49] introduced a mix of unsupervised
ties to short length binary vectors. In the following section,
and supervised techniques to learn word vectors to capture
we provide detailed examples of practical applications of
both semantic and sentiment similarities among words.
the proposed method.
On the image side, there are also a variety of studies tack-
ling the problem of higher-level representations of visual
information. Krizhevsky et al. [50] proposed to use a deep 3.1 Word Embeddings Learning
convolutional neural network to perform object detection. Representation of words as continuous vectors recently has
In [51], region proposals are combined with CNNs to gener- been shown to benefit performance for a variety of NLP
ate features for object detection. Except for these supervised and IR tasks [44], [46], [47]. Similar words tend to be close
methods, unsupervised learning methods for training visual to each other with the vector representation. Moreover,
features have also been carefully studied. Lee et al. [52] Mikolov et al. [54] also demonstrated the learned word
introduced convolutional deep belief network , a hierarchi- representations could capture meaningful syntactic and
cal generative model, represent images. Taylor et al. [53] semantic regularities. Hence, in this work, we propose to
proposed a convolutional gated restricted Boltzmann use word embeddings to capture the semantic level simi-
machineto model the spatio-temporal features for videos. larities between short text segments.
Although, in this work, we use word embeddings and Fig. 3 shows three architectures used for learning word
SIFT to represent texts and images respectively, the pro- embeddings. wi represents the ith words in the given words
posed method can also incorporate these representations. sequence fw1 ; w2 ; . . . ; wT g. Fig. 3a shows the architecture of
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 849
Fig. 3. Methods used to learn word embeddings. The NNLM architecture predicates the probability of words based on the existing words [48]. CBOW
predicts the current word based on the context [54]. Skip-gram predicts surrounding words given the current word [54].
the probabilistic neural network language model (NNLM) kernel framework, X can be modeled by a probability den-
proposed by Bengio et al. in [48]. It can have either one hid- sity function. In this work, P ðXjuÞ is given by Gaussian mix-
den layer beyond the word features mapping or direct con- ture model (GMM), which a sum of N Gaussians Nðmi ; Si Þ
nections from the word features to the output layer. They with weights vi . Let u ¼ fvi ; mi ; Si ; 8i ¼ 1 . . . Ng be the set
also proposed to use softmax function for the output layer of GMM parameters. The parameters u are estimated
to guarantee positive probabilities summing to 1. The word through the optimization of Maximum Likelihood (ML) cri-
vectors and the parameters of that probability function can terion using Expectation Maximization (EM) method [57].
be learned simultaneously. In this work, we only use the Based on the learned parameters set u, a text or an image
learned word vectors. X can be characterized by the gradient vector using the fol-
Figs. 3b and 3c show the architectures of the methods lowing function:
proposed by Mikolov in [54]. The architecture of CBOW,
which is similar to NNLM, is shown in Fig. 3b. The main GX
u ¼ ru log P ðXjuÞ
differences are that (i) the non-linear hidden layer is @ @ (1)
removed; (ii) the words from the future are included; (iii) ¼ log ðP ðXjuÞ; . . . ; log ðP ðXjuÞÞ ;
@u1 @ul
the training criterion is to correctly classify the current (wt )
word. The Skip-gram architecture, which is shown in where GX u is a vector whose dimensionality is only depen-
Fig. 3c, is similar to CBOW. However, instead of predicting dent on the number of parameters in , not on the number
the current word based on the history and future words, it of words or keypoints. The gradient describes the contribu-
tries to maximize classification accuracy of words within a tion of each individual parameters to the generative process.
certain range before and after the current word based on It can also be interpreted as how these parameter contribute
only the current word as input. to the process of generating an example. We follow the
Besides the methods mentioned above, there are also work described in [18] for normalizing these gradients by
a large number of works addressing the task of learning incorporating Fisher information matrix (FIM) Fu . Accor-
distributed word representations [47], [49], [55]. Most of ding to the theory of information geometry [58], U ¼
them can also be used in this work. The proposed frame- fP ðXjuÞ; u 2 Qg, which is a parametric family of distribu-
work has no limits in using which of the continuous tions, can be regarded as a Riemanninan manifold MQ with
word representation methods. a local metric given by the FIM Fu 2 RMM :
3.2 Fisher Kernel Framework Fu ¼ E ru logP ðXjuÞru logP ðXjuÞT : (2)
Fisher kernel framework [18] was proposed to directly
obtain the kernel function from a generative probability The similarity between two samples X and Y can be mea-
model. A parametric class of probability models P ðXjuÞ sured by the Fisher kernel defined as:
where u 2 Q Rl for some positive integer l. If the depen-
T
dence on u is sufficiently smooth, the collection of models KFK ðX; Y Þ ¼ GX 1 Y
u Fu Gu : (3)
with parameters from Q can be viewed as a manifold MQ .
Though applying a scalar product at each point P ðXjuÞ 2 Since Fu is symmetric and positive definite, Fu1 can be
MQ , it can be turned into a Riemannian manifold [56]. transformed to LTu Lu based on the Cholesky decomposition.
We denote a text or an image X ¼ fxi ; 1 i jXjg, Therefore, KFK ðX; Y Þ can be rewritten as follows:
where xi is the embedding of ith word of a text or the SIFT
T
descriptors of the ith keypoint of an image, jXj is the num- KFK ðX; Y Þ ¼ G X Y
u Gu ; (4)
ber of words in a text or the number of the extracted key- where
points in an image. xi is D-dimension word embeddings or
SIFT descriptors. We should note that there may be different GX X
u ¼ Lu Gu ¼ Lu ru log P ðXjuÞ: (5)
parameters for different data sets. According to the Fisher
850 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016
where signðÞ represents the element-wise sign function, and text. We use the SIFT keypoint detector to extract a variable
fðx; wÞ denotes a real-valued transformation from Rp to Rq . number of keypoints for each image and calculate the
In this work, Fisher vectors of text segments or keypoints descriptors of the keypoints using 128-dimensional SIFT
are the x in mapping function bðx; wÞ. A variety of existing descriptors. The toolkit we used in this work is VLFeat
methods have been proposed to achieve this task under this 0.9.19.5 The word embeddings we used in this work are pre-
framework using different forms of f and different optimi- trained vectors trained on part of a Google News dataset
zation objectives. Most of the learning to hash methods for (about 100 billion words). A Skip-gram model [62] was used
dense vectors can be used in this framework. In this work, to generate these 300-dimensional vectors for three million
we evaluated several state-of-the-arts hashing methods, words and phrases. For generating Fisher vectors, we use
whose performances are shown in the experiment section. the implementation of INRIA [63].
To demonstrate the effectiveness of the propose method,
4 EXPERIMENTS we evaluated the following state-of-the-art methods on the
To demonstrate the effectiveness of the proposed method, three data sets:
we compare and contrast the experimental results of SCMH Cross-view Hashing [15] maps similar objects to simi-
and state-of-the-art hashing methods on three commonly lar codes across the views to enable cross-view simi-
used data sets for cross-media retrieval. larity search.
Discriminative coupled dictionary hashing [14] gener-
4.1 Data Sets
ates a coupled dictionary for each modality based on
The three data sets used in this example contain both texts category labels.
and images. They have been chosen for the purpose of eval- Multi view discriminative coupled dictionary hashing
uating various cross-media retrieval methods. (MV-DCDH) [14] is extended from DCDH with
Flickr. The MIR Flickr data set2 [59], which consists of multi-view representation to enhance the represent-
one million images along with their user assigned tags, ing capability of the relatively “weak” modalities.
was collected from Flickr. Out of all the images, 25,000 Latent semantic sparse hashing [13] uses Matrix Factori-
images are annotated for 24 concepts, including object zation to represent text and sparse coding to capture
categories (e.g., bird, people) and scene categories (e.g., the salient structures of images.
sky, night). A stricter annotation was made on 14 con- Collective matrix factorization hashing (CMFH) [1] gen-
cepts where a subset of the positive images was selected erates unified hash codes for different modalities of
only if the concept is salient in the image. Therefore, one instance through collective matrix factorization
this leads to a total of 38 concepts for this data set. Fol- with latent factor model.
lowing previous works, each image may belong to one Semantic correlation maximization (SCM) [6] integra-
or more concepts. Image-text pairs are considered to be tes semantic labels into the hashing learning proce-
similar if they share the same concept. dure for preserving the semantic similarity cross
LabelMe. The LabelMe data set3 [60] contains 2,688 images, modalities.
which belong to eight outdoor scene categories: coast, moun- The toolkits of LSSH, DCDH, and MV-DCDH are kindly
tain, forest, open country, street, inside city, tall buildings provided by the authors. As we mentioned in the previous
and highways. All the objects in these images have been fully section, the proposed method SCMH can incorporate any
labeled and used as tags of the images. Following the work hashing methods for single modality. In this work, we use
described in [13], tags occurring in fewer than three images Semantic Hashing to generate hash codes for both textual
are discarded. Therefore, there are a total of 245 unique tags and visual information. Semantic Hashing [33] is a multi-
remaining. To construct the golden standards, we also follow layer neural network with a small central layer to convert
previous works and assume that image-text pairs are high-dimensional input vectors into low-dimensional
regarded as similar if they share the same scene label. codes.6 For the length of hash codes, all the methods gener-
NUS-WIDE. The NUS-WIDE data set4 [61] contains ate 32, 64, and 128 bits hash codes.
images and their associated tags from Flickr. The total Following previous literatures on this task, we also adopt
number of images and unique tags are 269,648 and 5,018 the widely used Mean Average Precision (MAP) as the evalu-
respectively. The dataset includes six kinds of low-level ation metric. For a single query and top-K retrieved instan-
features extracted from these images and 81 manually ces, Average Precision (AP) is defined as follows:
constructed ground-truth concepts. For comparison with
previous methods, we also used the 10 most common
concepts, and randomly selected 20,000 images from 1X K
AP ¼ P ðkÞdðkÞ;
them for evaluation. We treat as similar image-text pairs R k¼1
labeled with the same concepts
where R denotes the number of ground-truth instances in
4.2 Experiment Settings the retrieved set, P ðkÞ denotes the precision of top-k
For multimodal documents, we use the SIFT framework to retrieved instances, and dðkÞ is an indicator function which
represent images and use word embeddings to represent equals to 1 if the kth instance is relevant to query or 0 other-
wise. In the experiments, we set K ¼ 50. Besides MAP, we
2. http://press.liacs.nl/mirflickr/
3. http://people.csail.mit.edu/torralba/code/spatialenvelope/ 5. http://www.vlfeat.org/
4. http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm 6. http://www.cs.toronto.edu/ hinton/
852 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016
Fig. 6. The precision-recall curves of different hash code generation methods on the Flickr data set.
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 853
TABLE 2 TABLE 3
MAP Comparison on LabelMe MAP Comparison on NUS-WIDE
semantic level similarities between tags decreases the per- 4.3.3 Results on NUS-WIDE
formance. We also observe that SCMH achieves better per- The results of different methods on the NUS-WIDE dataset
formance on the Text ! Image task than the Image ! Text are shown in Table 3. The corresponding PR-curves of
task. The DCDH, MV-DCDH, LSSH, CMFH, and SCM them are given in the Fig. 8. From the results, we observe
methods all behave differently from SCMH, achieving bet- that SCMH achieves significantly better performance than
ter performance on the Image ! Text task. The main reason state-of-the-art methods on all tasks. The relative improve-
is possibly that word level semantic similarities can be bet- ments of SCMH over the second best results are 10.0 and
ter captured than keypoints represented by SIFT descriptors 18.5 percent on the Text ! Image and Image ! Text task
though SCMH. In Image ! Text task, the performance of respectively. Comparing with the results of SCMH on
SCMH is slightly worse than LSSH when the hash code LabelMe and Flickr dataset, the improvement of SCMH on
length is 32 bits. We think that the main reason is that size NUS-WIDE is more significant. The main possible reason
of LabelMe dataset and number of tags occurred in this is that the number of tags based on their frequency we
dataset are both too small. used in this dataset is bigger than LabelMe and Flickr.
From the PR-curves shown in Fig. 7, we also observe that There are only a total of 245 unique tags which occur more
although SCMH has similar performance as LSSH in Image than three times in the whole LabelMe dataset. For com-
! Text task, the precision of SCMH decreases much more paring with other methods, we selected top 500 most
slowly. This means that SCMH can achieve better results frequent tags in Flickr data set. Since NUS-WIDE is a more
when the user needs more candidates. We also observe practical dataset, which contains more unique tags, we
from the figure that the improvements of SCMH on the propose to use the top 1,000 most frequent tags. Hence, the
Image ! Text task are relatively marginal compared to weakness of the other methods in capturing the semantic
those on the Text ! Image tasks at all recall levels. This also level similarities between tags decreases the performance.
confirms the phenomenon described above.
Fig. 7. The precision-recall curves of different hash code generation methods on the LabelMe data set.
854 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 4, APRIL 2016
Fig. 8. The precision-recall curves of different hash code generation methods on the NUS-WIDE data set.
It can, in some degree, demonstrate that the proposed using commonly accepted performance metrics on data sets
method SCMH is more appropriate for practical environ- that are commonly used for evaluating cross-media retrieval.
ment. From the PR-curves illustrated in Fig. 8, we observe
the similar phenomenon as LabelMe and Flickr that the 4.3.4 Parameter Sensitivity
precision of SCMH decreases much more slowly. When
To analysis the sensitivity of the hyper-parameters of
recall achieves 20 percent, the relative improvement of
SCMH, we conduct several empirical experiments on all the
SCMH over LSSH is more than 28.2 percent on Image !
datasets. For easy comparison with previous methods, we
Text task.
set the hash code length to be 64 bits. Fig. 10 shows the per-
To further analyze the results given by different meth-
formances of SCHM with different percentages of training
ods, we calculate the cosine similarities between textual
data. In the two figures, the x-axis denotes the percentages
descriptions of queries and correct results in the top 50 lists
of data used for training and the y-axis denotes the MAP
on the Image ! Text. Fig. 9 shows the distribution of cosine
performance. The data used for constructing retrieval set
similarities. In the figure, x-axis denotes the ranges of cosine
and query set are same as we used in previous section.
similarity the y-axis the number of correct results in the
From the figures, we observe that as the number of training
range. From the results, we can see that SCMH can find
data increases, the MAP performances of SCMH conse-
more correct results whose cosine similarity with corre-
quently improve on all data sets. When the percentages of
sponding query are less 10 percent. It can in some degree
training data are over 30 percent of the whole dataset, the
demonstrates the effectiveness of SCMH in capturing the
MAP performances increase slowly. The main reason may
semantic textual similarities.
possibly be that the number of categories or concepts
In summary, the evaluation results on three data sets
included in these data sets are small. However, on the other
demonstrate conclusively that the proposed SCMH method
side, we can say that the proposed method SCMH can
is superior to the state-of-the-art methods when measured
achieve acceptable results with a few of ground truths.
Hence, it can be easily adopted for achieving other data sets.
Since the training process of mapping function is solved
by an iterative procedure, we also evaluate its convergency
property. Fig. 11 shows the MAP performances of SCMH
on Image ! Text and Text ! Image tasks. In the two fig-
ures, the x-axis denotes the number of iterations for opti-
mizing the mapping function and the y-axis denotes the
Fig. 9. Distribution of cosine similarities between queries and results on Fig. 10. Effects of training size on MAP performance on the Image !
the Image ! Text on NUS-WIDE dataset. Text and Text ! Image tasks.
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 855
[9] Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan, “Harmonizing hier- [33] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of
archical manifolds for multimedia document semantics under- data with neural networks,” Science, vol. 313, pp. 504–507, 2006.
standing and cross-media retrieval,” IEEE Trans. Multimedia, [34] B. Kulis and T. Darrell, “Learning to hash with binary reconstruc-
vol. 10, no. 3, pp. 437–446, Apr. 2008. tive embeddings,” in Proc. Adv. Neural Inf. Process. Syst., 2009,
[10] E. P. Xing, R. Yan, and A. G. Hauptmann, “Mining associated text pp. 1042–1050.
and images with dual-wing harmoniums,” in Proc. 21st Conf. [35] K. Grauman and R. Fergus, “Learning binary hash codes for large-
Uncertainty Artif. Intell., 2005, pp. 633–641. scale image search,” in Proc. Mach. Learn. Comput. Vis., 2013,
[11] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, pp. 49–87.
“Multimodal deep learning,” in Proc. 28th Int. Conf. Mach. Learn., [36] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc.
2011, pp. 689–696. Adv. Neural Inf. Process. Syst., 2008.
[12] N. Srivastava and R. Salakhutdinov, “Multimodal learning with [37] Y. Zhen and D.-Y. Yeung, “A probabilistic model for multimodal
deep boltzmann machines,” in Proc. Adv. Neural Inf. Process. Syst., hash function learning,” in Proc. 18th ACM SIGKDD Int. Conf.
2012, pp. 2222–2230. Knowl. Discovery Data Mining, 2012, pp. 940–948.
[13] J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing for [38] P. Wang, B. Xu, Y. Wu, and X. Zhou, “Link prediction in social
cross-modal similarity search,” in Proc. 37th Int. ACM SIGIR Conf. networks: The state-of-the-art,” Sci. China Inf. Sci., vol. 58, no. 1,
Res. Develop. Inf. Retrieval, 2014, pp. 415–424. pp. 1–38, 2015.
[14] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, [39] M. Norouzi and D. Fleet, “Minimal loss hashing for compact binary
“Discriminative coupled dictionary hashing for fast cross-media codes,” in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 353–360.
retrieval,” in Proc. 37th Int. ACM SIGIR Conf. Res. Develop. Inf. [40] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast
Retrieval, 2014, pp. 395–404. similarity search,” in Proc. 33rd Int. ACM SIGIR Conf. Res. Develop.
[15] S. Kumar and R. Udupa, “Learning hash functions for cross-view Inf. Retrieval, 2010, pp. 18–25.
similarity search,” in Proc. Int. Joint Conf. Artif. Intell., 2011, [41] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “Joint learning
pp. 1360–1365. of words and meaning representations for open-text semantic
[16] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. parsing,” in Proc. Int. Conf. Artif. Intell. Statist., 2012, pp. 127–
Wolfman, and E. Ruppin, “Placing search in context: The concept 135.
revisited,” in Proc. 10th Int. Conf. World Wide Web, 2001, pp. 406– [42] J. L. Elman, “Distributed representations, simple recurrent net-
414. works, and grammatical structure,” Mach. Learn., vol. 7, pp. 195–
[17] Q. Zhang, J. Kang, J. Qian, and X. Huang, “Continuous word 225, 1991.
embeddings for detecting local text reuses at the semantic level,” [43] G. E. Hinton, “Learning distributed representations of concepts,”
in Proc. 37th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2014, in Proc. 8th Annu. Conf. Cognitive Sci. Soc., 1986, pp. 1–12.
pp. 797–806. [44] G. Hinton and R. Salakhutdinov, “Discovering binary codes for
[18] T. Jaakkola, D. Haussleret al., “Exploiting generative models in documents by learning deep generative models,” Topics Cognitive
discriminative classifiers,” in Proc. Adv. Neural Inf. Process. Syst., Sci., vol. 3, pp. 74–91, 2010.
1999, pp. 487–493. [45] T. Mikolov, M. Karafiat, L. Burget, J. Cernockỳ, and S. Khudanpur,
[19] D. G. Lowe, “Object recognition from local scale-invariant “Recurrent neural network based language model,” in Proc.
features,” in Proc. Int. Conf. Comput. Vis., 1999, p. 1150. INTERSPEECH, 2010, pp. 1045–1048.
[20] X. Wang, Y. Liu, D. Wang, and F. Wu, “Cross-media topic mining [46] R. Socher, E. H. Huang, J. Pennin, C. D. Manning, and A. Ng,
on wikipedia,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, “Dynamic pooling and unfolding recursive autoencoders for
pp. 689–692. paraphrase detection,” in Proc. Adv. Neural Inf. Process. Syst., 2011,
[21] H. Zhang, J. Yuan, X. Gao, and Z. Chen, “Boosting cross-media pp. 801–809.
retrieval via visual-auditory feature analysis and relevance [47] J. Turian, L. Ratinov, and Y. Bengio, “Word representations:
feedback,” in Proc. ACM Int. Conf. Multimedia, 2014, pp. 953–956. A simple and general method for semi-supervised learning,” in
[22] P. Daras, S. Manolopoulou, and A. Axenopoulos, “Search and Proc. 48th Annu. Meeting Assoc. Comput. Ling., 2010, pp. 384–394.
retrieval of rich media objects supporting multiple multimodal [48] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural
queries,” IEEE Trans. Multimedia, vol. 14, no. 3, pp. 734–746, Jun. probabilistic language model,” J. Mach. Learn. Res., vol. 3,
2012. pp. 1137–1155, 2003.
[23] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for [49] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C.
generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Potts, “Learning word vectors for sentiment analysis,” in Proc.
Pattern Recog., Boston, MA, USA, Jun. 2015, pp. 3128–3137. 49th Annu. Meeting Assoc. Comput. Ling.: Human Lang. Technol.-
[24] A. Z. Broder, “On the resemblance and containment of doc- Vol. 1, 2011, pp. 142–150.
uments,” in Proc. SEQUENCES, 1997, p. 21. [50] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
[25] A. Z. Broder, “Identifying and filtering near-duplicate doc- cation with deep convolutional neural networks,” in Proc. Adv.
uments,” in Proc. Combinatorial Pattern Matching, 2000, pp. 1–10. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[26] M. Theobald, J. Siddharth, and A. Paepcke, “Spotsigs: Robust and [51] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
efficient near duplicate detection in large web collections,” in hierarchies for accurate object detection and semantic
Proc. 31st Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog.,
2008, pp. 563–570. 2014, pp. 580–587.
[27] A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe, [52] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional
“Collection statistics for fast duplicate document detection,” ACM deep belief networks for scalable unsupervised learning of hierar-
Trans. Inf. Syst., vol. 20, no. 2, pp. 171–191, 2002. chical representations,” in Proc. 26th Annu. Int. Conf. Mach. Learn.,
[28] A. Ko»cz, A. Chowdhury, and J. Alspector, “Improved robustness 2009, pp. 609–616.
of signature-based near-replica detection via lexicon random- [53] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional
ization,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data learning of spatio-temporal features,” in Proc. Comput. Vis., 2010,
Mining, 2004, pp. 605–610. pp. 140–153.
[29] J. Seo and W. B. Croft, “Local text reuse detection,” in Proc. 31st [54] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estima-
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2008, tion of word representations in vector space,” in Proc. Workshop
pp. 571–578. ICLR, 2013, pp. 1–2.
[30] Q. Zhang, Y. Zhang, H. Yu, and X. Huang, “Efficient partial- [55] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng, “Improving
duplicate detection based on sequence matching,” in Proc. 31st word representations via global context and multiple word proto-
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2010, types,” in Proc. 50th Annu. Meeting Assoc. Comput. Ling., 2012,
pp. 571–578. pp. 873–882.
[31] J. W. Kim, K. S. Candan, and J. Tatemura, “Efficient overlap and [56] J. Jost and J. Jost, Riemannian Geometry and Geometric Analysis. New
content reuse detection in blogs and online news articles,” in Proc. York, NY, USA: Springer, 2008.
Int. Conf. World Wide Web, 2009, pp. 571–578. [57] R. A. Redner and H. F. Walker, “Mixture densities, maximum
[32] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angular quanti- likelihood and the em algorithm,” SIAM Rev., vol. 26, no. 2,
zation-based binary codes for fast similarity search,” in Proc. Adv. pp. 195–239, 1984.
Neural Inf. Process. Syst., 2012, pp. 1196–1204. [58] S. Amari and H. Nagaoka, Methods of Information Geometry, Ameri-
can Mathematical Soc., 2000.
ZHANG ET AL.: A MIXED GENERATIVE-DISCRIMINATIVE BASED HASHING METHOD 857
[59] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi- Jin Qian received the master’s degree in com-
supervised learning for image classification,” in Proc. IEEE Conf. puter science from Shandong University. He is
Comput. Vis. Pattern Recog., 2010, pp. 902–909. currently working toward the PhD degree at
[60] A. Oliva and A. Torralba, “Modeling the shape of the scene: A Fudan University. His research interests include
holistic representation of the spatial envelope,” Int. J. Comput. data mining.
Vis., vol. 42, pp. 145–175, 2001.
[61] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng,
“NUS-wide: A real-world web image database from national uni-
versity of singapore,” in Proc. ACM Conf. Image Video Retrieval,
pp. 48:1–48:9.
[62] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their
compositionality,” in Proc. Adv. Neural Inf. Process. Syst., 2013, Xuanjing Huang received the PhD degree in
pp. 3111–3119. computer science from Fudan University. She is a
[63] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and professor of computer science at Fudan Univer-
C. Schmid, “Aggregating local image descriptors into compact sity, Shanghai, China. Her research interests
codes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, include natural language processing and informa-
pp. 1704–1716, Sep. 2011. tion retrieval.
[64] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent,
and M. Ouimet, “Learning eigenfunctions links spectral embed-
ding and kernel PCA,” Neural Comput., vol. 16, pp. 2197–2219,
2004.