Relevance Feedback and Learning in Content-Based Image Search
Relevance Feedback and Learning in Content-Based Image Search
Abstract
A major bottleneck in content-based image retrieval (CBIR) systems or search engines is the large gap between
low-level image features used to index images and high-level semantic contents of images. One solution to this
bottleneck is to apply relevance feedback to refine the query or similarity measures in image search process.
In this paper, we first address the key issues involved in relevance feedback of CBIR systems and present a
brief overview of a set of commonly used relevance feedback algorithms. Almost all of the previously proposed
methods fall well into such framework. We present a framework of relevance feedback and semantic learning
in CBIR. In this framework, low-level features and keyword annotations are integrated in image retrieval and in
feedback processes to improve the retrieval performance. We have also extended framework to a content-based
web image search engine in which hosting web pages are used to collect relevant annotations for images and
users’ feedback logs are used to refine annotations. A prototype system has developed to evaluate our proposed
schemes, and our experimental results indicated that our approach outperforms traditional CBIR system and
relevance feedback approaches.
1. Introduction
The popularity of digital images is rapidly increasing due to improving digital imaging
technologies and convenient availability facilitated by the Internet. However, how to find
user-intended images from the Internet is still non-trivial. The main reason is that web
images are usually not well annotated using semantic descriptors. The development his-
tory of image retrieval systems features two stages. The first stage is keyword-based image
retrieval, which is summarized by Chang et al. [2]. Since manual image annotation is
a tedious process, it is practically impossible to annotate all the images on the Internet.
Furthermore, due to the multiplicity of contents in a single image and the subjectivity of
human perception, it is also difficult to make exactly the same annotations to the same
image by different users. These difficulties have limited the applications of the keyword-
based image retrieval technology. Having been actively researched on in the last decade
[6,30], content-based image retrieval (CBIR) attempts to automate the process of indexing
or annotating image in image databases. CBIR approaches work with descriptions based
on inherent properties of images, such as color, texture and shape. However, despite all
∗ This paper is based on the invited keynote that first author gave in VDB2002, Brisbane, Australia, May 2002.
132 ZHANG ET AL.
the research efforts, the retrieval accuracy of today’s CBIR algorithms is still very limited.
In addition to many other difficulties, the bottleneck is the gap between low-level image
features and semantic image contents. This problem stems from the fact that visual sim-
ilarity measures, such as color histograms, in general do not necessarily match semantics
of images human subjectivity. Also, each type of visual feature tends to capture only one
aspect of image property and it is usually hard for a user to specify clearly how different
aspects are combined to form an optimal query. To make the problem even worse, people
often have different semantic interpretations of the same image. Even the same person
may have different perception about the same image at different times. To address this
bottleneck, interactive relevance feedback techniques have been proposed. The key idea
is that we should incorporate human perception subjectivity into the retrieval process and
provide users opportunities to evaluate retrieval results and automatically refine queries on
the basis of those evaluations. In the last few years, this research topic has become the
focus in CBIR research community.
Relevance feedback, originally developed for textual document retrieval [16], is a super-
vised active learning technique used to improve the effectiveness of information systems.
The main idea is to use positive and negative examples from the user to improve system
performance. For a given query, the system first retrieves a list of ranked images according
to a predefined similarity metrics. Then, the user marks the retrieved images as relevant
(positive examples) to the query or not (negative examples). The system will refine the
query based on the feedback and retrieves a new list of images and presents to user. Hence,
the key issue in relevance feedback is how to incorporate positive and negative examples
to refine the query and/or to adjust the similarity measure.
In this paper, we present a content-based image retrieval framework that integrate low-
level and semantic-based image similarities and support automated annotation through
learning from relevance feedback, and the extension of the framework in a web image
search engine. Instead of detailed description of the novel component algorithms, we fo-
cus our description on the key ideas in the framework. Details of the algorithms and the
framework implementation can be found in the reference [4,9,12,23,24]. Also, we want
the paper to serve as a reference on the current state of the art of CBIR relevance feed-
back research, a comprehensive survey is presented in this paper on relevance feedback
algorithms in terms of their natures and limitations.
There are many issues in relevance feedback approaches CBIR, such as learning
schemes, feature selection, index structure and scalability. Instead of giving an exhaus-
tive survey of each published relevance feedback algorithms for CBIR in term of their
advantages and limitations, we focus our discussions with the consideration that relevance
feedback in CBIR is a small sample machine learning problem and extend our description
in detail in respect to learning and searching natures of each algorithm. This is presented
in Section 2.
In Section 3, we present the integrated relevance feedback framework for framework
for CBIR. In this framework, while the user is interacting with the system by providing
feedbacks in a query session, a progressive learning process is activated to propagate the
keyword annotations from the labeled images to un-labeled images as the system refines
the retrieval. The knowledge learned in the relevance feedback sessions are accumulated
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 133
In this section, we review a set of relevance feedback approaches used in CBIR. The review
is focused on learning and searching natures of each relevance feedback algorithm as we
consider that relevance feedback in CBIR is a machine learning problem. We begin the
discussion by first providing an overview of classical relevance feedback approaches in
CBIR.
The early relevant feedback schemes for CBIR were mainly adopted from those developed
for classical textual document retrieval. These approaches can be classified into two ap-
proaches: query point movement (query refinement) and reweighting (similarity measure
refinement) [1]. Both of them were developed based on vector space model, the most
popular model used in information retrieval [20].
The query point movement method essentially tries to improve the estimate of the “ideal
query point” by moving it towards positive example points and away from bad example
points in the query space. There are various ways to update the query. The frequently used
technique to iteratively improve this estimation is the Rocchio’s formula given below for
sets of relevant documents DR and non-relevant documents DN given by the user [16]:
1 1
Q = αQ + β Di − γ Di , (1)
NR
NN
i∈DR i∈DN
134 ZHANG ET AL.
where α, β, and γ are suitable constants; NR and NN are the number of documents in
DR and DN , respectively. This technique is also referred as learning query vector. It was
implemented in the MARS system [18] by replacing the document vector with visual fea-
ture vectors. Experiments show that retrieval performance can be improved considerably
by using such relevance feedback approaches.
The basic idea behind the re-weighting method is to enhance the importance of the di-
mensions of a feature that help in retrieving the relevant images and reduce the importance
of those dimensions that hinder this process. This is achieved by updating the weights of
feature vectors in the distance metric. Considered a weighted metric defined as
D= ωj Xj(1) − Xj(2) . (2)
j ∈[N]
When an image of the query result is labeled as a positive example, the feature components
that contribute more similarity to the match is considered more important, while the com-
ponents with less contribution is considered to be less important. Therefore, the weight for
a feature component, ωi , is updated in the following way:
ωi = ωi · 1 + δ − δi , δ = f (Q) − f A+ j ,
(3)
where δ is the mean of δ. On the other hand, if an image is labeled as a negative exam-
ple, the feature components that contribute more to the match should be considered to be
depressed. That is, the weight is updated as:
ωi = ωi · 1 − δ + δi . (4)
This technique is also referred as learning the metric. This approach was implemented
proposed by Huang et al. [7]. The MARS system implemented a slight refinement to the
re-weighting method called the standard deviation method [18].
Instead of updating the individual components of a distance metric, we can also begin
with a set of predefined distance metrics and use relevance feedback to automatically select
the best one in the retrieval process. For instance, in ImageRover system [21], appropriate
Lp Minkowski distance metrics are automatically selected to minimize the mean distance
between the relevant images specified by the user.
Another relevance feedback approach, proposed by Minka and Picard, is to update the
query space by selecting feature models. It is assumed that each feature model has its
own strength in representing a certain aspect of image content, and thus, the best way for
effective content-based retrieval is to utilize “a society of models.” This approach uses a
learning scheme to dynamically determine which feature model or combination of models
is best for subsequent retrieval.
Recently, more computationally robust methods that perform global feature optimization
have been proposed. The MindReader retrieval system designed by Ishikawa et al. [8]
formulates a minimization problem on the parameter estimating process. Unlike traditional
retrieval systems whose distance function can be represented by ellipses aligned with the
coordinate axis, the MindReader system proposed a distance function that is not necessarily
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 135
aligned with the coordinate axis. Therefore, it allows for correlations between attributes in
addition to different weights on each component.
A further improvement over the MindReader approach is given in [17]. In this ap-
proach, optimal query estimation and weighting functions are derived in a unified frame-
work. Based on the minimization of total distances of positive examples from the revised
query, the weighted average and a whitening transform in the feature space were found to
be the optimal solutions. In more detail, assume that a query vector component qi corre-
sponds to the ith feature, an N element vector r = [r1 , . . . , rN ] represents the degree of
relevance for each of the N input training samples, and there is a set of N training vec-
tors xni for each feature i. It is derived that the ideal query vector qi∗ for feature i is the
weighted average of the training samples for feature i given by
R T Xi
qiT∗ = N , (5)
n=1 rn
where Xi is the N × Ki training sample matrix for feature i, obtained by stacking the N
training vectors xni into a matrix. It is interesting to note that the original query vector qi
does not appear in (5). This shows that the ideal query vector with respect to the feedbacks
is not influenced by the initial query.
The optimal weight matrix Wi∗ is given by
1/Ki −1
Wi∗ = det(Ci ) Ci , (6)
where Ci is the weighted covariance matrix of Xi . That is,
N
πn (xnir − qir )(xnis − qis )
Cirs = n=1 N , r, s = 1, . . . , Ki . (7)
n=1 πn
We can see from the above equations that the critical inputs into the system are training
vectors xni and the relevance matrix R. In this algorithm, initially, the user needs to input
these data to the system. Another issue with this algorithm is that negative examples are
not utilized in updating of query and similarity.
learning [13], artificial neural networks [10], Bayesian learning [5,27], and kernel based
learning [26] can be and have been applied to relevance feedbacks in CBIR. However, as
users are usually reluctant to provide a large number of feedback examples, the number of
training samples is very small, typically less than ten in each round of feedback session. On
the contrary, feature dimensions in CBIR systems are usually high. Hence, the crucial is-
sue in performing relevance feedback in CBIR systems is how to learn from small training
samples in a very high dimension feature space. This fact makes many learning methods,
such as decision tree learning and artificial neural networks, not suitable for CBIR.
The key issues in addressing relevance feedback in CBIR as a small sample learning
problem include: How to learn fast from small sets of feedback samples to improve re-
trieval accuracy effectively; How to accumulate knowledge learned from feedback; and
How to integrate low-level visual and high-level semantic features in query. However, most
of the published works have been focused on the first issue. Compared with other learn-
ing methods, Bayesian learning shows its advantages in addressing the first issue above
and almost all aspects of Bayesian learning have been touched in researching for effective
learning algorithms.
Vasconcelos and Lippman [27] treated feature distribution as a Gaussian mixture and
used Bayesian inference for learning during feedback iterations in a query session. Richer
information captured by the mixture model also makes image regional matching possible.
The potential problems of their methods are computing efficiency and complex data model
that leads to too many parameters need to be estimated with very limited samples.
To speed up the learning process so the retrieval result can be converged faster to user’s
satisfaction, active learning methods have been used to actively select samples in order to
achieve the maximal information gain, or the minimized entropy/uncertainty in decision-
making. The approached proposed in [5] used Monte Carlo sampling in search of the
set of sample that will minimize the expected number of future iterations. In estimating
the expected number of future iterations, entropy is used as an estimate of the number of
future iterations under the ambiguity specified by the current probability distribution of the
target image over all test images. Tong and Chang [26] proposed a SVM active learning
algorithm to select the sample to maximally reduce the size of the vector space in which
the class boundary lies. Without knowing a priori the class of a candidate, the best strategy
is to halve the search space each time. They attempted to justify that selecting the points
near the SVM boundary can approximately achieve this goal, and it is more efficient than
other more sophisticated schemes, which require exhaustive trials on all the test items.
Therefore, in their work, the points near the SVM boundary are used to approximate the
most-informative points; and the most-positive images are chosen as the ones farthest from
the boundary on the positive side in the feature space.
Some researchers consider relevance feedback process in CBIR as a pattern recognition
or classification problem. Under such a consideration, the positive and negative exam-
ples provided by user can be treated as training examples and a classifier could be trained.
Then, such classifier can separate all data set into relevant and irrelevant groups. It seemed
that many existing pattern recognition tools could be adopted for this task and many kinds
of classifiers have been experimented, such as linear classifier [29], nearest-neighbor clas-
sifier [28], Bayesian classifier [24], support vector machines (SVM) [26], and so on. In
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 137
this category, the most popular algorithm is represented by [26] where SVM classifier is
trained to divide the positive and negative examples. Then such SVM classifier will clas-
sify all images in database into two groups: relevant and irrelevant groups to a given query.
However, in most cases of CBIR, there is no predefined class structure. From application
point of view, such classification-based methods may improve the retrieval performance in
some constrained contexts; but they will be limited when applied to general purpose image
databases.
All the approaches described in above perform relevance feedback at the low-level feature
vector level by basically replacing keywords with features when adopting the vector space
model developed for document retrieval. While these approaches do improve the perfor-
mance of ICBR, there are severe limitations. The inherent problem is that the low-level
features are often not as powerful in representing complete semantic content of images as
keywords in representing text documents. Furthermore, users often pay more attention to
the semantic content (or a certain object/region) of an image than to the background and
other, the feedback images may be similar only partially in semantic content, but may vary
largely in low-level features. Hence, using low-level features alone may not be effective in
representing users’ feedbacks and in describing their intentions.
In addition, there are typically two different modes of user interactions involved in image
retrieval systems. In one case, the user types in a list of keywords representing the semantic
contents of the desired images. In the other case, the user provides a set of examples
images as the input and the retrieval system will retrieve other similar images. In most
image retrieval systems, these two modes of interaction are mutually exclusive. However,
combining these two approaches and allowing them to benefit from each other will yield a
great deal of advantage in terms of both retrieval accuracy and ease of use of the system.
There have been efforts on incorporating semantics in relevance feedback for image
retrieval. The framework proposed in [11] (to be discussed later in more detail in this
section) attempted to embed semantic information into a low-level feature based image re-
trieval process using a correlation matrix. The FourEye system by Minka and Picard [14]
and the PicHunter system by Cox et al. [5], made use of hidden annotation through learn-
ing process. However, they excluded the possibility of benefiting from good annotations,
which may lead to a very slow convergence.
In terms of feature selection, unlike most CBIR systems that use image features such
as color histogram or moments, texture, shape, and structure features, Tieu and Viola [25]
used a boosting technique to learn a classification function in a feature space of more than
45,000 features. The features were demonstrated to be sparse with high kurtosis, and were
argued to be expressive for high-level semantic concepts. Weak 2-class classifiers were
formulated based on Gaussian assumption for both the positive and negative (randomly
chosen) examples along each feature component, independently. The strong classifier is
then a weighted sum of the weak classifiers as in AdaBoost.
138 ZHANG ET AL.
m irrelevant. The relevant as well as irrelevant images may or may not be from difference
clusters. This approach memorizes such feedbacks by updating the correlation matrix as
below:
m
n
Mt = Mt −1 + F (q)F (pi )T − F (q)F (ni )T , (10)
i=1 i=1
where q is the feature vector of the query, pi and ni are feature vectors of positive and
negative feedback samples, and F (x) is a transform function used to determine the update
magnitude based on the feedback samples. In this way, the correlation between the clus-
ter where the query original falls in and these the positive samples fall in are increased,
progressively embedding the information on semantic correlations between images. This
correlation is then used in subsequent retrievals in which not only the visual features, but
also the semantic correlations are used in determine the similarity of an image to the query.
Experiments have shown that such a progressive learning approach effectively utilizes the
knowledge learnt from previous queries to reduce the number of iterations to achieve high
retrieval accuracy [11].
140 ZHANG ET AL.
Also, if there are two distinct groups in one initial cluster which semantically dissimilar,
meaning that they are negative examples to each other, a splitting is performed to spit the
initial cluster into two clusters. On the other hand, based on feedbacks, when two clusters
that are close in features space and have high correlation between them according to M, the
two initial clusters could be merged into one. That is, the correlation network dynamically
updates its structure in addition to updating the correlation matrix as learning from user
feedback.
More recently, people are aware of the fact that the Web is a rich resource of image data and
some of their semantics is usually available on the same web documents. Shen et al. [22]
exploit such reality and use some natural language processing technique to obtain semantic
features from the web text to characterize the web images. Hence, they are able to find
relevant images from the web using text-based queries. In our work of web image search
engine, we also use the web pages as the potential sources of semantics. There are two
kinds of difference for two systems. First difference is in the natural language processing
approach to obtaining semantic features. They use a so-called weighted chain-net, which
is actually a lexical chain, to represent the document space model for images, while our
document space model of all media objects is simply a vector space model, which is an
effective approach and has widely been used in traditional information retrieval. Other
natural language processing methods, such as, proper noun identification, are also used to
extract semantic features. Another difference is that our system exploits relevant feedback
and data mining on the users’ feedback logs to update the document space model. So our
approach outperforms traditional CBIR system and relevance feedback approaches.
Figure 2. The proposed framework of integrated relevance feedback and query expansion.
image features, and a machine learning algorithm to iteratively update the semantic net-
work and to improve the system’s performance over time. The system supports both query
by keyword and query by image example through semantic network and low-level fea-
ture indexing. More importantly, the learning process propagates the keyword annotations
from the labeled images to unlabeled ones during the feedback. In this way, more and more
images are implicitly labeled by keywords by the semantic propagation process. This an-
notation propagation process also helps the system in accumulating knowledge learned to
improve performance of future retrieval requests.
The semantic network is a two-layered structure. The top layer is represented by a set of
keywords having links to the images in the database. It can be considered an extension
of the initial information embedding idea in the system shown in Figure 1. The degree
of relevance of the keywords to the associated images’ semantic content is represented as
the weight on each link, as shown pictorially in Figure 3. This layer is what we need in
142 ZHANG ET AL.
keyword relevance feedback and will be updated during the semantic propagation. Bottom
layer is a keyword thesaurus to construct the connection between different keywords.
The initial weights can be obtained by manual labeling. In our web image search engine,
they are initially extracted from the following sources on the web page that contains the
image based according to some empirical rules.
1. Image filename and URL. We assume that web page authors/editors usually assign
meaningful filenames to images in a web page. Some heuristic rules are used to extract
the keywords from the filenames. First, the filename is segmented into meaningful key-
words based on pre-define dictionary. For example, filename “redflower.jpg” includes
two semantic words: “red” and “flower.” Then, the clutter letters in filenames, such
as digits, hyphens, filename extension, etc., are discarded. We also extract semantic
keywords from the URL of the image files. The URL usually represents the hierar-
chy information of an image on the web page. For instance, “animal” and “bird” are
useful information in the URL http://www.ditto.com/images/animals/
anim_birds.jpg. We apply the similar technology of the filename segmentation to
segment the URL into meaning pieces.
2. ALT (alternate) text. The ALT text in a web page is used for displaying to replace the
associated image in a text-based browser. Hence, it usually represents the semantics
of the image concisely; hence, it is a very relevant feature to represent the semantic
meaning of the images.
3. Surrounding text. In web pages, images are used to enhance the content that the editors
want to present. Hence, some texts in the surrounding areas are semantically relevant
to the content of the image. However, it is difficult to judge which area among all
of the four possible areas (above, below, left, right) is the most relevant to the image.
Therefore, in our prototype, all of the four areas are chosen as the sources of the text
features for the image. This feature will be refined by log mining on the users’ relevant
feedback logs as discussed in Section 4.
4. Page title. The page title is a good candidate of the text feature of images in a web page.
5. Other information. Image hyperlinks, anchor text, etc., are also candidates of text fea-
tures of the images.
The initial value of weight wij associated with each keyword of an image is calculated
by the TF*IDF method [19]. That is, a feature vector is used to represent the all keywords
of an image and the vector is defined as
Dih = TF i · IDFi
N N N
= ti1 · log , . . . , tij · log , . . . , tim · log , (11)
n1 nj nm
where Dih is the feature vector, with each component value corresponding to the initial
weight assigned to the association of a keyword to an image i. tij stands for the frequency
of keyword j appearing in the text description of the image i. nj is the number of images
that are characterized by keyword j . N is the total number of images. Of course, if no
keyword information to the image, the corresponding feature vector is set to null.
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 143
With the semantic network, semantic based relevance feedback can be performed rela-
tively easily compared to its low-level feature counterpart. This is performed by updating
the weights wij associated with each link shown in Figure 3. The weight updating process
is described below.
1. A user submits a query and the system retrieves similar images using cross-modality
query extension, to be explained later in next subsection.
2. System collects the positive and negative feedback examples corresponding to the
query.
3. For each keyword in the input query, check to see if any of them is not in the keyword
database. If so, add them into the database without creating any links.
4. For each positive example, check to see if any query keyword is not linked to it. If so,
create a link with an initial weight from each missing keyword to this image. For all
other keywords that are already linked to this image, increase the weight by a predefined
value or using the method defined by (10) and (11).
5. Similarly, for each negative example, check to see if any query keyword is linked with
it. If so, decrease its weight, until it is zero.
Through this updating process, the keywords that represent the actual semantic content
of each image will receive a larger weight. Also, it can be easily seen that as more queries
are inputted into the system, the system is able to expand its vocabulary. Furthermore,
a semantic propagation method is used to populate keywords to unlabeled image during
user’s feedback iteration, which will be described later in this section.
The proposed framework has an integrated relevance feedback scheme in which both low-
level feature based and high-level semantic feedbacks are performed. We define a unified
metric function G to measure the relevance between query Q and any image j within an
image database in terms of both semantic and low-level feature content, where Q includes
the original query and users’ feedback information:
G(j, Q ) = α · simk (j, Qk ) + (1 − α) · simf (j, Qf ), (12)
where α ∈ [0, 1] is the weight of the semantic relevance in the overall similarity measure,
which can be specified by users. The larger α is, the more important the semantic rele-
vance will play in the overall similarity measurement. simf (j, Qf ) and simk (j, Qk ) are the
semantic similarity and low-level feature similarity between image j and revised query Q ,
respectively.
The revised query Q consists of two parts: the feature-based one Qf and the semantic
(keyword)-based one Qk . Qf is defined by (3)–(5) based on feature vectors of feedback im-
ages. With the semantic network, simk (j, Qk ) can be directly computed with the updated
weights.
To further improve the retrieval performance of the proposed framework, a cross-
modality query expansion method is supported. That is, once a query is submitted in
144 ZHANG ET AL.
the form of keywords, the retrieved images based on keyword search are considered as
the positive examples, based on which the query is expanded by features of these images.
This is done by first searching the semantic network, shown in Figure 3, for the keywords.
Then, the visual features of these images that contains these keywords (referred as training
images) are incorporated into the expanded query, qi∗ , defined by (5).
Since the expanded query qi∗ is defined by the feature vectors of all image associated
with query keywords, we need to determine relevance vector R in (5) so that proper rele-
vance factor r is assigned to each image feature vector. Therefore, we introduce a relevance
factor rij of the ith keyword association to the j th image, defined
wij
rij = N , (13)
j
l=1 wlj
which is the relative weighting of matched keyword i over all keyword weights of image j .
This relevance factor is defined in accordance to information retrieval theory. That is, the
importance of a keyword to an image that has links spreading over a large number of
images in the database should be penalized. We use this relevance factor in (5) to compute
the expanded query in feature space.
One can consider this expanded query as a result of relevance feedback, except that
the feedback images are obtained by semantic network search. Using this approach may
seem dangerous at first, since some images may have keyword association which the user
does not intent to search for. However, the goal here is to generate a set of query that is
guaranteed to contain the user’s intended search results. The query can then be narrowed
down by including more feedback images through the relevance feedback cycle.
For query by image example, similar procedure takes effect to extend the retrieval from
feature space to semantic space through the semantic network. In this way, user input
information is utilized as much as possible to improve the retrieval performance.
Using the methods described above, we can perform the combined semantic and feature-
based relevance feedback as follows.
1. Collect the user query and expand the query.
2. Compute the combined similarity according to (12) to retrieval initial set of images
using the expanded query.
3. Collect positive and negative feedbacks from the user.
4. Compute the feature vectors xni and the degree of relevance vector R of the retrieved
images to form the revised feature-based query Qf defined by (5).
5. Update weights in the semantic network. The new weights implicitly defined the revised
keyword-based query Qk by defining simk (j, Qk ) in (12).
6. Using the revised query to compute the ranking score for each image based on (12) and
sort the results.
7. Show new retrieval results and go to step 3.
From this query and combined feedback process, we can see that system learns from the
user’s feedback both semantically and in a feature based manner. In addition, it can be eas-
ily seen that our method degenerates into the method of [18] when no semantic information
is available.
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 145
As illustrated in Figure 3, the more images are annotated (correctly), the better the system
retrieval performance will be. However, the reality is human labeling of images is tedious
and expensive, hence not a feasible solution, which was what motivated CBIR research
fifteen years ago. To address this issue, a probabilistic progressive keyword propagation
scheme is proposed in our framework to automatically annotate images in the databases in
the relevance feedback process utilizing based a small percentage of annotated images.
We assume that initially only a few images in a database have been manual labeled with
keywords and the retrieval is performed mainly based on low-level features. As stated be-
fore, the initial keywords annotation can be from web through the crawler when the images
are from the Web, or labeled by humans. While the user is interacting with the system by
providing feedbacks in a query session, a progressive learning process is activated to prop-
agate the keyword annotation from the labeled images to un-labeled images so that more
and more images are implicitly labeled by keywords. In this way, the semantic network is
updated in which the keywords with a majority of user consensus will emerge as the dom-
inant representation of the semantic content of their associated images. As more queries
are inputted into the system, the system is able to expand its vocabulary. Also, through
the propagation process, the keywords that represent the actual semantic content of each
image will receive a large weight.
There are two major issues in keyword propagation: which images and which key-
word(s) should be propagated during a query session. To answer the first question, a
probability model, based on Bayesian learning, is proposed. We assume that, (1) all pos-
itive examples in one retrieval session belong to the same semantic class with common
semantic object(s) or meaning(s); and (2) the features from the same semantic class fol-
lows the Gaussian or Mixture Gaussian distributions. Therefore, all positive examples in a
query session are used to calculate and update the parameters of the corresponding seman-
tic Gaussian classes. Then, the probability of each image in the database belonging to such
semantic class is calculated. The common keywords in positive examples are propagated
to the images with very high probability belonging to this class.
As we can see, the propagation framework uses the same procedure as the feedback algo-
rithm in low-level features [23]. The only difference is that for low-level feature feedbacks,
the calculated probability is used for the ranking of an image in retrieval candidate list,
while here it is used to determine if an image should be in the propagation candidate list.
The propagation candidate set S is obtained as follows:
S = {c1 , . . . , ck }, where p(cj ) > ψ, (14)
where p(cj ) is the probability that image j in the database belonging to such semantic
class and ψ is a constant threshold which can be estimated by the training process. The
weight associates with the propagated keyword i and the image j is wij = p(cj ). More
complex distribution model, for example, Mixture Gaussian, may be used in this propaga-
tion framework. However, because the user’s feedback examples in practice are often very
few, complex models will leads into much more parameter estimation errors as there are
more parameters to be estimated.
146 ZHANG ET AL.
The image set used in evaluating the proposed framework described in this section is the
Corel Image Gallery of 10,000 images, manually labeled into 79 semantic categories.
200 random selected images compose the test query set. Whether a retrieved image is
correct or incorrect is judged according to the ground truth. Three types of color features
and three types of texture features are used in our system. Feedback process is running as
follows. Given a query from the test set, a different test image of the same category as the
query is used in each round of feedback iteration as the positive example for updating the
Gaussian parameters and revise the query. To incorporate negative feedback, the first two
irrelevant images are assigned as negative examples. The accuracy is defined as
relevant images retrieved in top N returns
Accuracy = . (15)
N
Several experiments have been performed as follows. First, three feature-based feedback
algorithms are compared. They are: a Bayesian feedback scheme by Su et al. in [23,24],
the scheme by [27] and scheme by [17] as defined by (5)–(7). This comparison is done
in the same feature space. Figure 4 shows that the accuracy of Bayesian feedback scheme
(referred as “our feedback approach”) becomes higher than the other two methods after
two feedback iterations. This demonstrates that the incorporated Bayesian estimation with
the Gaussian parameter-updating scheme is able to improve retrieval effectively.
To demonstrate the performance of the semantic propagation, the following experiment
was designed. 200 images in the query set were annotated by their category names. So
only one keyword is associated to one query image and other images in database have no
keyword annotations. During the test, each query image was used twice. The retrieval
performance is shown in Figure 5 with comparison to that with the propagation. It is seen
that for feedback with propagation, the retrieval accuracy is much higher than the original
one without it. This is because, when a system has propagation ability, latter queries can
utilize the accumulated knowledge from previous feedback iterations. In other words,
system has the learning ability and will be smarter with more users’ interactions.
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 147
Figure 4. Retrieval accuracy for top 100 results in original feature space.
Figure 5. Retrieval accuracy for top 100 results performance between feedback without propagation and feed-
back with propagation scheme.
The architecture of our proposed web image search engine is shown in Figure 6. In addition
to all components in a CBIR system, the web search engine contains an image crawler and
three other modules, namely, the log miner, the model updater, and the query updater [3,4].
The data organization of the system mainly consists of four parts: the image database that
also contains metadata of images (i.e., low-level and high-level features), the user’s relevant
feedback log database, the document space model, and the user space model.
A typical scenario of the system is as follows. The off-line crawler is first employed at
regular intervals (e.g., once every day at non-peak network traffic hours) to collect potential
web pages containing images and store them into a local database. The feature extractor is
then applied to these pages to extract both the low-level visual features and the high-level
148 ZHANG ET AL.
semantic features for the images appear in these pages. In our system, the crawler and the
feature extractor actually work simultaneously. An image indexer is applied to the images
and their features to build the document space model, which is the representation of the
images in the database using their features. Once the document space model is available,
the matcher compares the user’s query with the document space model of images to yield
the image retrieval results. Since many irrelevant images may be returned by the retrieval
system, the user feedback interface is also provided for users to specify whether a returned
image is relevant or not to the user’s intents. The image retrieval system can utilize user
feedbacks to gain an understanding as to the relevancy of certain images and update the
query or adjust the matcher to return more accurate retrieval results. The user’s feedback
log data are also stored in the user log database in the system, from which the log miner
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 149
can find and build the user space model through log analysis. The user space model is
then combined with the document space model to update the document space model to
eliminate the mismatch between the page author’s expression and the user’s understanding
and expectation and can further improve the retrieval accuracy.
The document space model in the image search engine combines the low-level visual fea-
tures and high-level semantic features to index the images on the web. The detail process
is described as follows.
To collect images in the web, a crawler (or a spider, which is a program that can automat-
ically analyze the web pages and download the related pages hyper-linked to the analyzed
web pages) is used to collect images from many web sites. First, we re-arrange the semantic
network shown in Figure 3 a concept hierarchy of image categories, such as “animals,” “ar-
chitecture,” “arts,” etc. Then, we select some representative sites to be collected for each
concept category. For instance, http://www.nba.com for sports, http://www.
cnn.com for news, http://www.disney.com for entertainment, etc. For each site
candidate, the crawler collects the images and saves it to a local web page database. We
then use a simple classifier to classify the images into meaningful and junk (e.g., ban-
ners, backgrounds, buttons, icons, etc.) categories based on certain information like color
histograms, image sizes, image file types, etc.
For each image collected, the initial keywords are assigned in the way as described in
Section 3.1. In addition, the low-level features of each image are calculated. The keywords
and low-level features of all collected images form the document space.
In the image search process, the overall similarity is simply the linear combination of the
visual and the textual similarities, as defined in (12). It is not a good idea to set the same
default weight α = 0.5 in (12) to balance the importance of low-level features and high-
level features. However, it is very efficient for us to build up the baseline configuration
of our image retrieval system. The weight is automatically adjusted to a suitable value
by the system through the user’s feedback as to the relevancy of certain returned images.
Moreover, after we collect enough user log information of user feedback, data mining
technology (which will be presented in the next section) can be applied to find out the
importance of low-level feature and high-level feature for different concepts/categories.
For example, we find that for concept “Clinton,” the high-level features are more important
than the low-level features, while for concept “sunshine,” the low-level features are more
useful than the high-level features.
In order to reduce the ambiguity in the text descriptors extracted from web pages and
the low-level image features, and to improve the search performance, we have proposed
a user space model to supplement the original document space model. This is achieved
by applying a user log analysis process. The user space model is also a vector space
150 ZHANG ET AL.
model. The difference between the user space model and the document space model is that
vectors in the user space model are constructed from the information mined from the user
feedback log data, not from the original information extracted from the web pages. When
a user submits a query, our system will return to the user some images found based on the
original document space model. The user can then use the feedback user interface to tell
the system about the return images as whether relevant or irrelevant to the query based on
his/her subjective judgment. Of course, most users do not have the patience and time to
mark all relevant and irrelevant images in the returned image collection. However, this is
not a very serious problem because even a small set of feedback images can provide very
useful information.
After we get some user’s feedback log data, the user space model can be built from the
user log. Let Q be the set of total queries used until now. Let Tj (j = 1, . . . , NT ) be the
set of all individual words that appear in Q. (Note that a singe query may contain multiple
words.) For a query in Q, Iri is one of the relevant images and Iii is one of the irrelevant
images specified by the user and stored in the user log.
From the user log, we can easily calculate the probabilities listed below:
Nri
P (Iri ) = , (16)
NQ
where Nri is the number of query times that image Iri has been retrieved and marked as
relevant, and NQ is the total number of queries.
Nri (Tj )
P (Iri |Tj ) = , (17)
NQ (Tj )
where Nri (Tj ) is the number of query times that image Iri has been retrieved and marked
as relevant for those queries that contain word Tj , and NQ (Tj ) is the number of queries
that contain Tj .
NQ (Tj )
P (Tj ) = . (18)
NQ
Based on the Bayesian theory, we have
P (Iri |Tj )P (Tj )
P (Tj |Iri ) = . (19)
P (Iri )
In addition, for irrelevant images in the user log, we have
Nii (Tj )
P (Iii |Tj ) = , (20)
NQ (Tj )
where Nii (Tj ) is the number of times that image Iii has been retrieved and marked as
irrelevant for those queries that contain word Tj .
For a given image I , P (Tj |I ) (j = 1, . . . , NT ) calculated using (19) also form a vector
for I . We call this vector the user space model of image I , compared to the document
space model of image I , which is built from the related features extracted from the web
pages.
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 151
If we have a large collection of user log data, it is reasonable to say that the information
in the user space model is more accurate than the information in the original document
space model. However, as we have previously stated, few users like to tag all relevant and
irrelevant images in the retrieval result. Hence, the user feedback log is usually not enough
and this causes the user space model to be not as comprehensive as the original document
space model. Therefore, we cannot replace the document space model with the user space
model completely. We choose to integrate the user space model into the original document
space model to improve the accuracy of the final document space model.
For each image I , vector U is the feature in the user space model, and vector D is the
textual feature in the document space model. We simply use the linear combination method
to integrate these two vectors. We use D new to denote the updated document space model,
which is calculated using
where η is used to adjust the weight between the user space model and the document space
model. Actually, η is the confidence of the vector U in the user space model. In our
approach, if the vector in the user space model is accurate and comprehensive enough, we
can assign η with a value very close to 1.0. If the vector in the user space model is not
accurate and not comprehensive enough, the value of η should be relatively small. The
times that an image is marked in the feedback by the user can be used to determine the
value of η for this image. Obviously, if an image is marked in user feedback more times
than another image, the feedback information of this image should be more accurate and
comprehensive than the other image. The confidence of its vector U in the user space
model should thus be higher for this image than that for the other image and we can assign
a bigger η for this image than for the other image.
Since irrelevant images are also recorded in the user feedback log, we can also utilize
this information. For each irrelevant image Iii , we use P (Iii |Tj ) as the confidence that Iii
is irrelevant to query Tj and form a vector I. We denote D final to the text feature vector
of the image in the final document space model and calculate it using (22), similar to the
TF ∗ IDF method:
D new · 1 − I .
final = D (22)
4.3. Experiments
Based on the proposed architecture, a demo system of image search engine, called iFind©,
has been developed in Microsoft Research Asia. The graphic interface is shown in Figure 7.
The search options that iFind supports include:
• Keyword-based search. One can type in one or more keywords, such as girl, in the
textbox and start the retrieval. One will see some images displayed in several pages in
the browse mode.
152 ZHANG ET AL.
• Query by example. If the “Similar” hyperlink under an image is selected, the system
will retrieve some similar images that are semantically/visually similar to the example
image.
• Relevance feedback. The system will improve the performance of retrieval after the user
provides some positive and/or negative examples. One is promised to get much better
result after several iterations of feedback.
• Log mining. The retrieval performance of the system will be greatly improved after
off-line log mining process. The user could benefit from other users’ usages.
To illustrate improvement brought by log mining in image search, we show here some
evaluation results based on three system configurations: (1) the baseline system, which
provides only query and retrieval; (2) the feedback system, which can provides user feed-
back as well as the baseline functionality; (3) the full configuration including user log
mining.
In our experiments, we have selected more than 2000 representative image websites. The
intelligent crawler is used to collect the images from these hyperlinks. All related semantic
features, including image filenames, ALT texts, surrounding texts, and page titles, as well
the low-level visual features are also extracted using the feature extractor at the same time.
The images are stored in the database and indexed with their textual and visual features. In
total, we have collected more than 30,000 images from these websites. It is difficult for us
to calculate the recall of the system because it is a tedious job to browse the entire image
database and specify the ground truth manually. Therefore, we only choose 17 queries
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 153
Figure 8. The average precision–recall curve of the system’s retrieval performance for all queries.
5. Conclusions
the framework in a web image search engine by incorporating user log mining in refining
search accuracy. This new framework makes the image retrieval system to be superior over
either the classical CBIR or text-based systems.
Publisher’s note
This article is based on the original conference paper published by Kluwer Academic Pub-
lishers in Visual and Multimedia Information Management, edited by Xiaofang Zhou and
Pearl Pu. ISBN: 1-4020-7060-8. © 2002 by International Federation for Information
Processing.
References
[1] C. Buckley and G. Salton, “Optimization of relevance feedback weights,” in Proceedings of SIGIR’95,
1995.
[2] S. K. Chang, C. W. Yan, D. C. Dimitroff, and T. Arndt, “An intelligent image database system,” IEEE
Transactions on Software Engineering 14(5), 1988.
[3] Z. Chen, W. Liu, C. Hu, M. Li, and H. J. Zhang, “iFind: A web image search engine,” in Proceedings of
SIGIR2001, 2001.
[4] Z. Chen, W. Liu, F. Zhang, M. Li, and H. J. Zhang, “Web mining for web image retrieval,” Journal of the
American Society for Information Science and Technology 52(10), August 2001, 831–839.
[5] I. J. Cox, T. P. Minka, T. V. Papathomas, and P. N. Yianilos, “The Bayesian image retrieval system,
PicHunter: Theory, implementation, and psychophysical experiments,” IEEE Transactions on Image
Processing, Special Issue on Digital Libraries, 2000.
[6] M. Flickner, H. Sawhney, W. Niblack et al., “Query by image and video content: The QBIC system,” IEEE
Computer Magazine 28, 1995, 23–32.
[7] J. Huang, S. R. Kumar, and M. Metra, “Combining supervised learning with color correlograms for content-
based image retrieval,” in Proceedings of ACM Multimedia’95, November 1997, pp. 325–334.
[8] Y. Ishikawa, R. Subramanya, and C. Faloutsos, “Mindreader: Query databases through multiple examples,”
in Proceedings of the 24th VLDB Conference, New York, 1998.
[9] F. Jing, M. Li, H. J. Zhang, and B. Zhang, “An effective region-based image retrieval framework,” in
Proceedings of ACM Multimedia 2002, Juan-les-Pins, France, December 1–6, 2002.
[10] J. Laaksonen, M. Koskela, and E. Oja, “PicSOM: Self-organizing maps for content-based image retrieval,”
in Proceedings of International Joint Conference on NN, July 1999.
[11] C. Lee, W. Y. Ma, and H. J. Zhang, “Information embedding based on user’s relevance feedback for image
retrieval,” in Proceedings of SPIE International Conference on Multimedia Storage and Archiving Sys-
tems IV, Boston, 19–22 September 1999.
[12] Y. Lu et al., “A unified framework for semantics and feature based relevance feedback in image retrieval
systems,” in Proceedings of ACM MM2000, 2000.
[13] S. D. MacArthur, C. E. Brodley, and C.-R. Shyu, “Relevance feedback decision trees in content-based image
retrieval,” in IEEE Workshop on Content-Based Access of Image and Video Libraries, 2000, pp. 68–72.
[14] T. Minka and R. Picard, “Interactive learning using a ‘Society of Models’,” Pattern Recognition 30(4), 1997.
[15] T. Mitchell, Machine Learning, McGraw-Hill, 1997.
[16] J. J. Rocchio Jr., “Relevance feedback in information retrieval,” in The SMART Retrieval System: Experi-
ments in Automatic Document Processing, ed. G. Salton, Prentice-Hall, 1971, pp. 313–323.
[17] Y. Rui and T. S. Huang, “A novel relevance feedback technique in image retrieval,” in Proceedings of 7th
ACM Conference on Multimedia, 1999.
[18] Y. Rui, T. S. Huang, and S. Mehrotra, “Content-based image retrieval with relevance feedback in MARS,”
in Proceedings of IEEE International Conference on Image Processing, 1997.
RELEVANCE FEEDBACK AND LEARNING IN CONTENT-BASED IMAGE SEARCH 155