Learning Image-Text Associations
Learning Image-Text Associations
Abstract—Web information fusion can be defined as the problem of collating and tracking information related to specific topics on the
World Wide Web. Whereas most existing work on web information fusion has focused on text-based multidocument summarization,
this paper concerns the topic of image and text association, a cornerstone of cross-media web information fusion. Specifically, we
present two learning methods for discovering the underlying associations between images and texts based on small training data sets.
The first method based on vague transformation measures the information similarity between the visual features and the textual
features through a set of predefined domain-specific information categories. Another method uses a neural network to learn direct
mapping between the visual and textual features by automatically and incrementally summarizing the associated features into a set of
information templates. Despite their distinct approaches, our experimental results on a terrorist domain document set show that both
methods are capable of learning associations between images and texts from a small training data set.
1 INTRODUCTION
classification can be used to predict whether a person is [20], each pixel is treated as a transaction, while the set
infected by dengue disease. In the multimedia domain, ranges of the pixel’s spectral bands and auxiliary concept
classification has been used for annotation purposes, i.e., labels (e.g., crop yields) are considered as items. Then, pixel-
predicting whether certain semantic concepts appear in level association rules of the form “Band 1 in the range [a, b]
media objects. and Band 2 in the range [c, d] are likely to imply crop yield
An early work in this area is to classify the indoor- E.” However, Tesic et al. [21] pointed out that using
outdoor scenarios of the video frames [16]. In this work, a individual pixel as transaction may lose the context
video frame or image is modeled as sequences of image information of surrounding locations which are usually
segments, each of which is represented by a set of color very useful for determine the image semantics. This
histograms. A group of 1D hidden Markov models (HMMs) motivated them to use image and rectangular image regions
are first trained to capture the patterns of image segment as transactions and items. Image regions are first repre-
sequences and then used to predict the indoor-outdoor sented using Gabor texture features and then clustered using
categories of new images. self-organizing map (SOM) [22] and LVQ to form a visual
Recently, many efforts aim to classify and annotate thesaurus. The thesaurus is used to provide the perceptual
images with more concrete concepts. In [17], a decision tree labeling of the image regions. Then, the first- and second-
is used to learn the classification rules that associate color order spatial predicate associations among regions are
features, including global color histograms and local tabulated in spatial event cubes (SECs), based on which
dominant colors, with semantic concepts such as sunset, higher-order association rules are determined using Apriori
marine, arid images, and nocturne. In [18], a learning vector algorithm [23]. For example, a third-order item set is in the
quantization (LVQ)-based neural network is used to classify form of “If a region with label uj is a right neighbor of a ui
images into outdoor-domain concepts, such as sky, road, region, it is likely that there is a uk region on the right side of
and forest. Image features are extracted via Haar wavelet uj .” More recently, Teredesai et al. [24] proposed a frame-
transformation. Another approach using vector quantiza- work to learn multirelational visual-textual associations for
tion for image classification was presented in [19]. In this image annotation. Within this framework, keywords and
method, images are divided into a number of image blocks. image visual features (including color saliency maps,
Visual information of the image blocks is represented using orientation, and intensity contrast maps) are extracted and
HSV colors. For each image category, a concept-specific stored separately in relational tables in a database. Then, a
codebook is extracted based on training images. Each FP-Growth algorithm [25] is used for extracting multi-
codebook contains a set of codewords, i.e., representative relational associations between the visual features and
color feature vectors for the concept. New image classifica- keywords from the database tables. The extracted rules,
tion is performed based on finding most similar codewords such as “4 Yellow ! EARTH, GROUND,” can be subse-
for its image blocks. The new image will be assigned to the quently used for annotating new images.
category whose codebook provides the most number of the In [26], the author proposed a method to use associations
similar codewords. of visual features to discriminate high-level semantic
At current stage, image classification mainly works for concepts. To avoid combinatory explosion during the
discriminating images into a relevant small set of categories association extraction, a clustering method is used to
that are visually separable. It is not suitable for linking organize the large number of color and texture features into
images with free texts, in which tens of thousands of the a visual dictionary where similar visual features are grouped
different terms exists. On one hand, the concepts represented together. Then, each image can be represented using a
by those terms, such as “sea” and “sky,” may not be easily relevant small set of representative visual feature groups.
separable be the visual features. On the other hand, training Then, for each specific image category (i.e., semantic
classifier for each of these terms would need a large amount concept), a set of associations is extracted as a visual
of training data, which is usually unavailable, and the knowledge base featuring the image category. When a new
training process would also be extremely time consuming. image comes in, it considered related to an image category if
it globally verifies the associations associated with that
2.3 Learning Association Rules between Image image category. In this method, associations were only
Content and Semantic Concepts learned among visual feature groups, not between visual
Association rule mining (ARM) is originally used for features and semantic concepts or keywords.
discovering association patterns in transaction databases. Due to the pattern combinatory explosion problem, the
An association rule is an implication of the form X ) Y , performance of learning association rule is highly dependent
where X, Y I (called item sets or patterns) and X \ Y ¼ ;. on the number of items (e.g., image features and the number
In the domain of market-basket analysis, such an association of lexical terms). Although existing methods that learning
rule indicates that the customers who buy the set of items X association rules between image features and high-level
are also likely to buy the set of items Y . Mining association semantic concepts are applicable for small set of concepts/
rule from multimedia data is usually a straightforward keywords, they may encounter problems when mining
extension of ARM in transaction databases. association rules on images and free texts where a large
In this area, many efforts are conducted to extract the amount of different terms exist. This may not only cause
association between low-level visual features and high-level significant increasing in the learning times but also result in
semantic concepts for image annotation. Ding et al. [20] a great number of association rules which may also lower the
presented a pixel-based approach to deduce associations performance during the process of annotating images as
between pixels’ spectral features and semantic concepts. In more rules need to be considered and consolidated.
164 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009
Based on the above observation, we build two the- domain information category Attack Details, and two text
sauruses in the form of transformation matrices, each of
segments T S1 and T S2 , represented by the textual feature
which corresponds to a subtransformation. Suppose the
visterm space V is of m dimensions, the textual feature vectors tTS1 and tTS2 , belonging to the categories of Attack
space T is of n dimensions, and the cardinality of the set of Details and Victims, respectively. If the two categories Attack
high-level domain information categories C is l. Based on V,
Details and Victims share many common words (such as kill,
T , and C, we define the two following transformation
matrices: die, and injure), the vague transformation result of vI might be
0 VC 1 similar to both tTS1 and tTS2 . To reduce the influence of
m11 mVC 12 : : mVC 1l
B mVC mVC : : mVC C common terms on different categories and utilize the
B 21 22 2l C
M VC ¼ B
B : : : CC ð4Þ strength of the distinct words, we consider another transfor-
@ : : : A mation from the word space to the visterm space. Similarly,
mVC mVC : : mVC
m1 m2 ml
we define a pair of transformation matrices M T C ¼ fmTkjC gnl
and and M CV ¼ fmCV lm
, where mTkjC ¼ P ðcj jtmk Þ
Nðcj ;tmk Þ
ji g Nðtmk Þ and
0 1 Nðvi ;cj Þ
mCT
11 mCT
12 : : : mCT
1n
mCV
ji ¼ P ðvi jcj Þ Nðcj Þ (i ¼ 1; 2; . . . ; m, j ¼ 1; 2; . . . ; l, and
B mCT mCT : : : C mCT
M CT B 21
¼@ 22 C; 2n ð5Þ k ¼ 1; 2; . . . ; n). Here, Nðtmk Þ is the number of text segments
: : : A
mCT mCT : : : mCT containing the term tmk ; Nðcj ; tmk Þ, Nðvi ; cj Þ, and Nðcj Þ are
l1 l2 ln
same as those in (6) and (7). Then, the similarity between a
where mVC ij represents the association factor between the
visual feature vi and the information category cj , and mCT text segment represented by the textual feature vector tTS
jk
represents the association factor between the information and the visual content of an image vI can be defined as
category cj and the textual feature tk . In our current system,
mVC CT ðtTS ÞT M T C M CV vI
ij and mjk are calculated by simT V ðtTS ; vI Þ ¼ T : ð9Þ
TS T
ðt Þ M T C M CV kvI k
Nðvi ; cj Þ
mVC
ij ¼ P ðcj jvi Þ ð6Þ
Nðvi Þ
Finally, we can define a cross-media similarity measure
and based on the dual-direction transformation which is the
arithmetic mean of simVT ðvI ; tTS Þ and simT V ðtTS ; vI Þ
Nðcj ; tmk Þ
mCT
jk ¼ P ðtmk jcj Þ ; ð7Þ given by
Nðcj Þ
where Nðvi Þ is the number of images containing the visual simVT ðvI ; tTS Þ þ simT V ðtTS ; vI Þ
simvt I TS
d ðv ; t Þ¼ : ð10Þ
feature vi , Nðvi ; cj Þ is the number of images containing vi 2
and belonging to the information category cj , Nðcj Þ is the 3.3.4 Vague Transformation with Visual Space
number of text segments belonging to the category cj , and
Projection
Nðcj ; tmk Þ is the number of text segments belonging to cj
A problem in the reversed cross-media (text-to-visual)
and containing the textual feature (term) tmk .
transformation in dual-direction transformation is that the
For calculating mVC CT
ij and mjk in (6) and (7), we build a
intermediate layer, i.e., information categories, may be
training data set of texts and images that have been
embedded differently in the textual feature space and the
manually classified into domain information categories
visterm space. For example, in Fig. 5, two information
(see Section 4 for details).
categories, “Terrorist Suspects” and “Victims,” may contain
Based on (4) and (5), we can define the similarity
quite different text descriptions but somewhat similar
between the visual part of an image vI and a text segment
images, e.g., human faces. Suppose we translate a term
represented by tTS as ðvI ÞT M VC M CT tTS . For embedding
vector of a text segment into the visual feature space using a
into (1), we use its normalized form
cross-media transformation. Transforming a term vector in
ðvI ÞT M VC M CT tTS “Victims” category or a term vector in “Terrorist Suspects”
simVT ðvI ; tTS Þ ¼
T
: ð8Þ category may result in a similar visual feature vector as
ðvI ÞT M VC M CT ktTS k these two information categories have similar representa-
tion in the visual feature space. In such a case when there
3.3.3 Dual-Direction Vague Transformation are text segments belonging to the two categories in the
same web page, we may not be able to select a proper text
Equation (8) calculates the cross-media similarity using a segment for an image about “Terrorist Suspects” or
single-direction transformation from visual feature space to “Victims” based on the text-to-visual vague transformation.
textual feature space. However, it may still have the vague For solving this problem, we need to consolidate the
differences in the similarities between the information
problem. For example, suppose there is a picture I, categories in the textual feature space and the visual feature
represented by the visual feature vector vI , belonging to a space. We assume that text can more precisely represent the
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 167
. Using bipartite graph of the classified text Using this refined equation in the dual-direction trans-
segments. For constructing the similarity matrix of formation, we expect that the performance of discovering
the information categories in the textual feature the image-text associations can be improved. However,
space, we utilize the bipartite graph of the classified solving (11) is a nonlinear optimization problem of a
text segments and the information categories as very large scale because X is a m m matrix, i.e., there
shown in Fig. 6. are m2 variables to tune. Fortunately, from (15) we can
The underlying idea is that the more text segments see that we do not need to get the exact matrix X.
that two information categories share, the more Instead, we only need to solve a simple linear equation
similar they are. We borrow the similarity measure T
D ¼ M CV X T XM CV to obtain a matrix
used in [31] for calculating the similarity between
information categories which is originally used for 1 1 T
A ¼ XT X ¼ M CV DM CV ; ð16Þ
calculating term similarity based on bipartite graphs
CV 1
of terms and text documents. Therefore, any st ðci ; cj Þ where M is the pseudo-inverse of the transformation
in D can be calculated as matrix M CV .
168 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009
where kSj k is the number of images in the jth cluster. 4.3 Evaluation of Cross-Media Similarity Measure
Fig. 9 shows the information gains obtained by clustering Based on Visual Features Only
our image collection based on visterm sets with a varying We first evaluate the performance of the cross-media
number of visterms. We can see no matter how many similarity measures, defined in Sections 3.3 and 3.4, by
clusters of images we generate, the largest information gain setting ¼ 0 in our linear mixture similarity model (see
is always achieved when k is around 400. Based on this (1)), i.e., using only visual contents of images (without
observation, we generate 400 visterms for the image visterm image captions) for measuring image-text associations. As
there has been no prior work on image-text association
vectors.
learning, we implement two baseline methods for evalua-
Note that we employ information gain depending on
tion and comparison. The first method is based on the
information category to determine the number of visterms
CMRM proposed by Jeon et al [12]. The CMRM is designed
to use for representing image contents. This is an optimiza-
for image annotation by estimating the conditional prob-
tion for the data preprocessing stage, the benefit of which ability of observing a term w given the observed visual
will be employed by all the learning models in our content of an image. Another baseline method is based on
experiments. Therefore, we should see that it is not conflict the DWH model proposed by Xing et al [30]. As described
with our statement that the fusion ART does not depend on in Section 2, a trained DWH can also be used to estimate
the predefined information categories for learning the the conditional probability of seeing a term w given the
image-text associations. observed image visual features. As our objective is to
172 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009
TABLE 2
The Seven Cross-Media Models in Comparison
associate an entire text segment to an image, we extend similarity matrix calculated based on bipartite graphs
CMRM and DWH model to calculate the average condi- cannot really reflect the semantic similarity between the
tional probability of observing terms in a text segment domain information categories. We shall revisit this issue in
given the visual content of an image. The reason of using the next section. Note that although DDT outperforms SDT
the average conditional probability, instead of the joint in most of the folds, there is a significant performance
conditional probability, is that we need to minimize the reduction in the fold 4. The reason could be that the reverse
influence of the length of the text segments. Note that the vague transformation results of certain text segments in the
longer a text segment is, the smaller the joint conditional fold 4 are difficult to be discriminated due to the reason
probability tends to be. Table 2 summarizes the seven described in Section 3.3.4. Therefore, the reverse vague
methods that we experimented for discovering image-text transformation based on text data may even lower the
associations based on pure visual contents of images. The overall performance of the DDT. On the other hand,
first four methods are vague-transformation-based cross- DDT_VP_CT performs much stable than DDT by incorpor-
media similarity measures that we define in Section 3.3. ating the visual space projection.
The fifth method is the fusion-ART (object resonance)- For evaluating the impact of the size of training data on
based similarity measure. The last two methods are the learning performance, we also experiment with differ-
baseline methods based on the CMRM and the DWH ent data sizes for training and testing. As DWH model has
model, respectively. Fig. 10 shows the performance of been shown cannot be trained properly for this data set, we
using the various models for extracting image-text associa- leave it out in the rest of the experiments. Fig. 11 shows the
tions based on a fivefold cross validation. performance of the six cross-media similarity models with
We see that among the seven methods, DDT_VP_CT and respect to training data of various sizes. We can see that
fusion ART provide the best performance. They outperform when the size of the training data decreases, the precision of
SDT and DDT which have a similar performance. All of the CMRM drops dramatically. In contrast, the performance
these four models perform much better than DWH model, of vague transformation and fusion ART drop less than
CMRM and DDT_VP_BG. We can see that DWH model 10 percent in terms of average precision. It shows that our
always obtains a precision of 0 percent and therefore cannot methods also provide better performance stability on small
predict the correct image-text association for this particular data sets compared with the statistical-based CMRM.
experiment. The reason could be that the training data set is
too small, and on the contrary, the data dimensions are 4.4 Evaluation of Linear Mixture Similarity Model
quite large (501 for visual features and 8,747 for textual In this section, we study the effect of using both textual and
features) to train a effective DWH model using Gibbs visual features in the linear mixture similarity model for
sampling [30]. It is surprising that DDT_VP_BG is the worst discovering image-text associations. Referring to the experi-
method other than the DWH model, hinting that the mental results in Table 3, we see that textual information is
Fig. 10. Comparison of cross-media models for discovering image-text Fig. 11. Performance comparison of cross-media models with respect to
associations. different training data sizes.
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 173
TABLE 3
The Average Precision Scores (in Percentages) for Image-Text Association Extraction
fairly reliable in identifying image-text associations. In fact, A sample set of the extracted image-text associations is
the pure text similarity measure ð ¼ 1:0Þ outperforms the shown in Fig. 12. We find that cross-media models are
pure cross-media similarity measure ð ¼ 0:0Þ by 20.7 per- usually good in associating general domain keywords with
cent to 24.4 percent in terms of average precision. images. Referring to the second image in Fig. 12, cross-
However, the best result is achieved by the linear media models can associate an image depicting the attack
mixture model using both the text-based and the cross- scene of 911 attack with a text segment containing the word
media similarity measures. DDT_VP_CT with ¼ 0:7 can “attack” which is commonly used for describing terrorist
achieve an average precision of 62.6 percent, while the attack scenes. However, for the word “wreckage” that is a
fusion ART with ¼ 0:6 can achieve an average precision of more specific word, cross-media models usually cannot
62.0 percent. On the average, the mixture similarity models identify it correctly. For such cases, using image captions
can outperform the pure text similarity measure by about may be helpful. On the other hand, as discussed before,
5 percent. This shows that visual features are also useful in image captions may not always reflect the image content
the identification of image-text associations. In addition, we accurately. For example, the caption of the third image
observe that combining cross-media and text-based simi- contains the word “man,” which is a very general term, not
larity measures improves the performance of pure text quite relevant to the terrorist event. For such cases, cross-
similarity measure on each fold of the experiment. There- media models can be useful to find the proper domain-
fore, such improvement is stable. In fact, the keywords specific textual information based on the visual features of
extracted from the captions of the images sometimes may the images.
be inconsistent with the contents of the images. For Fig. 13 shows a sample set of the results by using fusion
example, an image on the 911 attack scene may have a ART for image annotation. We can see that such annota-
caption on the ceremony of 911 attack, such as “Victims’ tions can reflect the direct associations between the visual
families will tread Ground Zero for the first time.” In such a and textual features in the images and texts. For example,
case, the visual features can compensate the imprecision in the visual cue of “debris” in the images may be associated
the textual features. with words, such as “bomb” and “terror” in the text
Among the vague transformation methods, dual- segments. Discovering such direct associations is an
direction transformation achieves almost the same perfor- advantage of the fusion-ART-based method.
mance as single-direction transformation. However, visual
space projection with dual-direction transformation can 4.5 Discussions and Comparisons
slightly improve the average precision. We can also see that In Table 4, we provide a summary of the key characteristics
the bipartite-graph-based similarity matrix D for visual of the two proposed methods. First of all, we note that the
space projection does not improve the image-text associa- underlying ideas of the two approaches are quite different.
tion results. By examining the classified text segments, we Given a pair of image and text segment, the vague-
notice that only a small number of text segments belong to transformation-based method translates features from one
more than one category and contribute to category simila- information space into another information space so that
rities. This may have resulted in an inaccurate similarity features of different spaces can be compared. The fusion-
matrix and a biased visual space projection. ART-based method, on the other hand, learns a set of
On the other hand, the performance of fusion ART is prototypical image-text associations and then predicts the
comparable with that of vague transformation with visual degree of association between an incoming pair of image
space projection. Nevertheless, when using the pure cross- and text segment by comparing it with the learned
media model ð ¼ 0:0Þ, fusion ART can actually outperform associations. During the prediction process, the visual and
vague-transformation-based methods by about 1 percent to textual information is first compared in their respective
3 percent. Looking into each fold of the experiment, we see spaces and the results consolidated based on a multimedia
that the fusion-ART-based method is much more stable object resonance function (ART choice function).
than the vague-transformation-based methods in the Vague transformation is a statistic-based method which
sense that the best results are almost always achieved with calculates the conditional probabilities in one information
¼ 0:6 or 0.7. For the vague-transformation-based meth- space given some observations in the other information
ods, the best result of each experiment fold is obtained with space. To calculate such conditional probabilities, we need
rather different values. This suggests that the vague- to perform batch learning on a fixed set of training data.
transformation-based methods are more sensitive to the Once the transformation matrices are trained, they cannot
training data. be updated without building from scratch. In contrast, the
174 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009
Fig. 12. A sample set of image-text associations extracted with similarity scores (SC). The correctly identified associated texts are bolded.
fusion-ART-based method adopts an incremental competi- A fixed number of domain-specific information categories
tive learning paradigm. The trained fusion ART can always are used to reduce the information complexity. Instead of
be updated when new training data are available. using predefined information categories, the fusion-ART-
The vague-transformation-based method encodes the based method can automatically organize multimedia
learned conditional probabilities in transformation matrices. information objects into typical categories. The character-
istics of an object category are encoded by a multimedia
information object template. There are usually more cate-
gory nodes learned by the fusion ART. Therefore, the
information in the fusion ART is less compact than that in
the transformation matrices. In our experiments, around 70
to 80 categories are learned by the fusion ART on a data set
containing 300 images (i.e., 240 images are used for training
in our fivefold cross-validation).
In terms of efficiency, the vague-transformation-based
method runs much faster than the fusion-ART-based
method during both training and testing. However, the
fusion-ART-based method produces a more stable perfor-
mance than that of the vague-transformation-based method
(see discussions in Section 4.4).
5 CONCLUSION
We have presented two distinct methods for learning and
Fig. 13. Samples of image annotations using fusion ART. extracting associations between images and texts from
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 175
TABLE 4
Comparison of the Vague-Transformation- and the Fusion-ART-Based Methods
[15] J. Han, Data Mining: Concepts and Techniques. Morgan Kaufmann, [37] G.A. Carpenter and S. Grossberg, “ART 2: Self-Organization of
2005. Stable Category Recognition Codes for Analog Input Patterns,”
[16] H.H. Yu and W.H. Wolf, “Scenic Classification Methods for Applied Optics, vol. 26, pp. 4919-4930, 1987.
Image and Video Databases,” Proc. SPIE, vol. 2606, no. 1, [38] G.A. Carpenter, S. Grossberg, and D.B. Rosen, “ART 2-A: An
pp. 363-371, http://link.aip.org/link/?PSI/2606/363/1, 1995. Adaptive Resonance Algorithm for Rapid Category Learning
[17] I.K. Sethi, I.L. Coman, and D. Stan, “Mining Association Rules and Recognition,” Neural Networks, vol. 4, no. 4, pp. 493-504,
between Low-Level Image Features and High-Level Concepts,” 1991.
Proc. SPIE, vol. 4384, no. 1, pp. 279-290, http://link.aip.org/link/ [39] G.A. Carpenter, S. Grossberg, and D.B. Rosen, “Fuzzy ART: Fast
?PSI/4384/279/1, 2001. Stable Learning and Categorization of Analog Patterns by an
[18] M. Blume and D.R. Ballard, “Image Annotation Based on Adaptive Resonance System,” Neural Networks, vol. 4, no. 6,
Learning Vector Quantization and Localized Haar Wavelet pp. 759-771, 1991.
Transform Features,” Proc. SPIE, vol. 3077, no. 1, pp. 181-190, [40] W. Li, K.-L. Ong, and W.K. Ng, “Visual Terrain Analysis of
http://link.aip.org/link/?PSI/3077/181/1, 1997. High-Dimensional Datasets,” Proc Ninth European Conf. Princi-
[19] A. Mustafa and I.K. Sethi, “Creating Agents for Locating ples and Practice of Knowledge Discovery in Databases (PKDD ’05),
Images of Specific Categories,” Proc. SPIE, vol. 5304, no. 1, A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, eds.,
pp. 170-178, http://link.aip.org/link/?PSI/5304/170/1, 2003. vol. 3721, pp. 593-600, 2005.
[20] Q. Ding, Q. Ding, and W. Perrizo, “Association Rule Mining [41] F. Chu, Y. Wang, and C. Zaniolo, “An Adaptive Learning
on Remotely Sensed Images Using P-Trees,” Proc. Sixth Pacific- Approach for Noisy Data Streams,” Proc. Fourth IEEE Int’l Conf.
Asia Conf. Advances in Knowledge Discovery and Data Mining Data Mining (ICDM ’04), pp. 351-354, 2004.
(PAKDD ’02), pp. 66-79, 2002. [42] D. Shen, Q. Yang, and Z. Chen, “Noise Reduction through
[21] J. Tesic, S. Newsam, and B.S. Manjunath, “Mining Image Summarization for Web-Page Classification,” Information Pro-
Datasets Using Perceptual Association Rules,” Proc. SIAM Sixth cessing and Management, vol. 43, no. 6, pp. 1735-1747, 2007.
Workshop Mining Scientific and Eng. Datasets in conjunction with the [43] A. Tan, H. Ong, H. Pan, J. Ng, and Q. Li, “FOCI: A Personalized
Third SIAM Int’l Conf. (SDM ’03), http://vision.ece.ucsb.edu/ Web Intelligence System,” Proc. IJCAI Workshop Intelligent
publications/03SDMJelena.pdf, May 2003. Techniques for Web Personalization (ITWP ’01), pp. 14-19,
[22] T. Kohonen, Self-Organizing Maps, T. Kohonen, M.R. Schroeder, Aug. 2001.
and T.S. Huang, eds., Springer-Verlag New York, Inc., 2001. [44] A.-H. Tan, H.-L. Ong, H. Pan, J. Ng, and Q.-X. Li, “Towards
[23] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Personalised Web Intelligence,” Knowledge and Information Systems,
Association Rules in Large Databases,” Proc. 20th Int’l Conf. vol. 6, no. 5, pp. 595-616, 2004.
Very Large Data Bases (VLDB ’94), J.B. Bocca, M. Jarke, and [45] E.W.M. Lee, Y.Y. Lee, C.P. Lim, and C.Y. Tang, “Application of a
C. Zaniolo, eds., pp. 487-499, 1994. Noisy Data Classification Technique to Determine the Occurrence
[24] A.M. Teredesai, M.A. Ahmad, J. Kanodia, and R.S. Gaborski, of Flashover in Compartment Fires,” Advanced Eng. Informatics,
“Comma: A Framework for Integrated Multimedia Mining Using vol. 20, no. 2, pp. 213-222, 2006.
Multi-Relational Associations,” Knowledge and Information Systems, [46] A.M. Fard, H. Akbari, R. Mohammad, and T. Akbarzadeh, “Fuzzy
vol. 10, no. 2, pp. 135-162, 2006. Adaptive Resonance Theory for Content-Based Data Retrieval,”
[25] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns Proc. Third IEEE Int’l Conf. Innovations in Information Technology
without Candidate Generation: A Frequent-Pattern Tree (IIT ’06), pp. 1-5, Nov. 2006.
Approach,” Data Mining and Knowledge Discovery, vol. 8, [47] S. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli
no. 1, pp. 53-87, 2004. Relevance Models for Image and Video Annotation,” Proc. IEEE
Computer Soc. Conf. Computer Vision and Pattern Recognition
[26] C. Djeraba, “Association and Content-Based Retrieval,” IEEE
(CVPR ’04), pp. 1002-1009, 2004.
Trans. Knowledge and Data Eng., vol. 15, no. 1, pp. 118-135, Jan./
[48] M. Sharma, “Performance Evaluation of Image Segmentation and
Feb. 2003.
Texture Extraction Methods in Scene Analysis,” master’s thesis,
[27] K. Barnard and D. Forsyth, “Learning the Semantics of Words and
1998.
Pictures,” Proc. Eighth Int’l Conf. Computer Vision (ICCV ’01), vol. 2,
[49] P. Duygulu, O.C. Ozcanli, and N. Papernick, “Comparison of
pp. 408-415, 2001.
Feature Sets Using Multimedia Translation,” LNCS, 2869th ed.,
[28] P. Duygulu, K. Barnard, J.F.G. de Freitas, and D.A. Forsyth, 2003.
“Object Recognition as Machine Translation: Learning a Lexicon [50] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Like-
for a Fixed Image Vocabulary,” Proc. Seventh European Conf. lihood from Incomplete Data via the EM Algorithm,” J. Royal
Computer Vision (ECCV ’02), pp. 97-112, 2002. Statistical Soc. Series B (Methodological ’77), vol. 39, no. 1, pp. 1-38,
[29] J. Li and J.Z. Wang, “Automatic Linguistic Indexing of Pictures 1977.
by a Statistical Modeling Approach,” IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1075-1088,
Sept. 2003.
[30] E.P. Xing, R. Yan, and A.G. Hauptmann, “Mining Associated
Text and Images with Dual-Wing Harmoniums,” Proc. 21st
Ann. Conf. Uncertainty in Artificial Intelligence (UAI ’05), p. 633,
2005.
[31] P. Sheridan and J.P. Ballerini, “Experiments in Multilingual
Information Retrieval Using the Spider System,” Proc. 19th Ann.
Int’l ACM SIGIR Conf. Research and Development in Information
Retrieval (SIGIR ’96), pp. 58-65, 1996.
[32] P. Biebricher, N. Fuhr, G. Lustig, M. Schwantner, and G. Knorz,
“The Automatic Indexing System Air/Phys—From Research to
Applications,” Proc. 11th Ann. Int’l ACM SIGIR Conf. Research
and Development in Information Retrieval (SIGIR ’88), pp. 333-342,
1988.
[33] N. Tishby, F. Pereira, and W. Bialek, “The Information Bottle-
neck Method,” Proc. 37th Ann. Allerton Conf. Comm., Control
and Computing, pp. 368-377, http://citeseer.ist.psu.edu/
tishby99information.html, 1999.
[34] H. Hsu, L.S. Kennedy, and S.-F. Chang, “Video Search Reranking
via Information Bottleneck Principle,” Proc. 14th Ann. ACM Int’l
Conf. Multimedia (MULTIMEDIA ’06), pp. 35-44, 2006.
[35] G. Carpenter and S. Grossberg, Pattern Recognition by Self-
Organizing Neural Networks. MIT Press, 1991.
[36] A.-H. Tan, “Adaptive Resonance Associative Map,” Neural
Networks, vol. 8, no. 3, pp. 437-446, 1995.
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 177
Tao Jiang received the BS degree in computer Ah-Hwee Tan received the BS (first class
science and technology from Peking University honors) and MS degrees in computer and
in 2000 and the PhD degree from the Nanyang information science from the National Univer-
Technological University. Since October 2007, sity of Singapore in 1989 and 1991, respec-
he has been with ecPresence Technology Pte tively, and the PhD degree in cognitive and
Ltd., Singapore, where he is currently a project neural systems from Boston University, Boston,
manager. From July 2000 to May 2003, he in 1994. He is currently an associate professor
was with Found Group, one of the biggest IT and the head of the Division of Information
companies in China, earlier as a software Systems, School of Computer Engineering,
engineer and later as a technical manager. He Nanyang Technological University, Singapore.
also currently serves as a coordinator of “vWorld Online Community” He was the founding director of the Emerging Research Laboratory, a
project supported and sponsored by Multimedia Development Authority research center for incubating interdisciplinary research initiatives.
(MDA), Singapore. His research interests include data mining, machine Prior to joining NTU, he was a research manager at the A*STAR
learning, and multimedia information fusion. Institute for Infocomm Research (I2R), responsible for the Text Mining
and Intelligent Agents research groups. His current research interests
include cognitive and neural systems, information mining, machine
learning, knowledge discovery, document analysis, and intelligent
agents. He is the holder of five patents and has published more than
80 technical papers in books, international journals, and conferences.
He is an editorial board member of Applied Intelligence and a member
of the ACM. He is a senior member of the IEEE.