0% found this document useful (0 votes)
23 views15 pages

Document 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

Document 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Ten Years of Relevance Score for Content

Based Image Retrieval

Lorenzo Putzu2(B) , Luca Piras1,2 , and Giorgio Giacinto1,2


1
Pluribus One, Cagliari, Italy
2
Department of Electrical and Electronic Engineering, University of Cagliari,
Piazza d’Armi, 09123 Cagliari, Italy
{lorenzo.putzu,luca.piras,giacinto}@diee.unica.it
http://pralab.diee.unica.it

Abstract. After more than 20 years of research on Content-Based


Image Retrieval (CBIR), the community is still facing many challenges
to improve the retrieval results by filling the semantic gap between the
user needs and the automatic image description provided by different
image representations. Including the human in the loop through Rel-
evance Feedback (RF) mechanisms turned out to help improving the
retrieval results in CBIR. In this paper, we claim that Nearest Neigh-
bour approaches still provide an effective method to assign a Relevance
Score to images, after the user labels a small set of images as being
relevant or not to a given query. Although many other approaches to rel-
evance feedback have been proposed in the past ten years, we show that
the Relevance Score, while simple in its implementation, allows attaining
superior results with respect to more complex approaches, can be easily
adopted with any feature representations. Reported results on different
real-world datasets with a large number of classes, characterised by differ-
ent degrees of semantic and visual intra- e inter-class variability, clearly
show the current challenges faced by CBIR system in reaching accept-
able retrieval performances, and the effectiveness of Nearest neighbour
approaches to exploit Relevance Feedback.

Keywords: Image retrieval · Image description · Relevance feedback


Nearest neighbour

1 Introduction

Content based image retrieval (CBIR) systems include all the approaches to
retrieve images from large repositories, by analysing the visual and semantic
content of the images. A CBIR system represents each image in the repository
as a set of low- and mid-level features such as colour, texture, shape, connected
components, etc. and uses a set of distance functions defined over these feature
spaces to estimate the similarity between images. The goal of a CBIR system is
to retrieve a set of images that is best suited to the user’s intention formulated
c Springer International Publishing AG, part of Springer Nature 2018
P. Perner (Ed.): MLDM 2018, LNAI 10935, pp. 117–131, 2018.
https://doi.org/10.1007/978-3-319-96133-0_9
118 L. Putzu et al.

using an image as the query. The performances of CBIR systems are strongly
related not only to the employed feature representations, but also to the distance
functions used to measure the similarity between images [12,41]. The implicit
assumption is that the similarity between images is related to a distance defined
over a particular feature space. But since an image may be characterized by a
large number of concepts, this leads to the so-called semantic gap between the
real user image interpretation and the semantics induced from the low-level fea-
tures. Furthermore, the availability of personal devices and the possibility for
people to capture an unlimited number of photos and videos, rise the amount of
multimedia document. Thus, the need for efficient and effective image retrieval
systems has become crucial. Indeed, in order to accurately extract information
from this vast amount of data, retrieval methods need to quickly discard irrele-
vant information and focus on the items of interests.
To this end, and to reduce the semantic gap introduced by the automatic
image representation, most of the recent CBIR systems has adopted a mechanism
of Relevance Feedback (RF) [2,35,43]. RF is a mechanism initially developed
to improve text-based information retrieval systems [36], and then extended to
CBIR system in [38]. RF techniques involve the user in the process of refining the
search, that becomes an iterative process in which the original query is refined
interactively, to progressively obtain a more accurate result. In a CBIR system,
after the user submits a query image, the system retrieves a series of images
and requires user interaction to label the returned images as relevant or non-
relevant to the query image. The user’s intention is then modified according to
this data, producing a new set of images, that is expected to contain a larger
number of relevant images. The relationship between the query image and any
other image in the database is expressed using a relevance value, that is aimed
to directly reflect the user’s intention. Many different approaches have been
proposed to estimate this relevance value. In this context, nearest neighbour
approaches proved to be particularly effective [14]. The aim of these methods is
to produce for each image a Relevance Score by using the distances to relevant
and non-relevant neighbours. Obviously, an image will present a highest relevance
score as much as its distance from the nearest relevant image is small compared to
the distance of its nearest non relevant image. Although many other approaches
for relevance feedback have been proposed in the past ten years, we believe that
the Relevance Score can still be considered the state-of-the-art approach for
relevance estimation.
Other Relevance Feedback techniques proposed in the literature involve the
optimization of one or more CBIR components such as the formulation of a new
query or the transformation of the feature space. This kind of approaches extract
global properties pertaining to relevant images, and such properties are derived
from the set of relevant and non-relevant images retrieved so far. Such global
optimization can suffer from the small sample problem, as the number of images
displayed to the user for labelling is usually small (e.g., 20 images at a time).
In particular, as the size of the database increases, it is very likely that small
numbers of relevant images are retrieved in response to the user’s query. As a
Ten Years of Relevance Score for Content Based Image Retrieval 119

consequence, the modifications of CBIR parameters based on such information


may result unreliable or provides poor performance improvement [16]. Nearest
Neighbour approaches, instead of trying to estimate global properties pertain-
ing to relevant images, estimated image relevance locally. As a result, such a
mechanism is apt to identify classes of images with complex boundaries.
This paper will introduce the reader to the major approaches proposed in
the literature to create an effective image retrieval system. Section 2 describes
the basic concepts behind the techniques proposed in the content based image
retrieval field. Section 3 summarizes the main categories of features used for this
task. Section 4 describes some approaches proposed so fat to exploit the rele-
vance feedback, including the Relevance Score. Section 5 proposes a comparison
of different RF strategies on real-world challenging datasets, showing the effec-
tiveness of the Relevance Score in different retrieval scenarios, and employing
different feature representations. Conclusions and future research perspectives
are drawn in Sect. 6.

2 Architecture of CBIR Systems


Many CBIR systems that are based on the automatic analysis of the image
content from a computer perspective have been proposed over the years. They
present many advantages over traditional image retrieval based on textual
queries only. In particular, they avoid either relying on costly professional tex-
tual annotations, that are needed to deem the labels and tags as being reliable,
or relying on social labels and tags, for which an estimation of their relevance
needs to be computed. Moreover, CBIR systems present advantages even in cases
where textual annotations are already present, since they could be focused on
just some aspects of the image and neglect other important contents [11]. Most
of the existing successful CBIR systems are tailored to specific applications, usu-
ally referred to as narrow domain systems, and consequently to specific retrieval
problems, such as the Computer Aided Diagnosis systems [27], sport events [12],
and cultural heritage preservation [7]. In all the previous examples, the objects of
interest related to a given query can be either precisely defined or easily modelled
due to the knowledge of the semantic content of the images.
Thus, it is easier to manage a CBIR system designed for narrow domain
archives, as the semantic content to be searched is clearly defined [41]. Instead
the design of a general purpose multimedia retrieval systems is still a challenging
task, as such system should be capable of adapting to different semantic contents,
and different intents of the users. A huge number of works have been devoted to
the special cases of instance retrieval or near duplicate retrieval [19,45], that con-
sists on retrieving images showing exactly the same object as the query image.
But in most of the real cases, the user is not interested in retrieving images with
the same subject, but images with similar content. As a consequence, this could
create some ambiguity, in particular if the image contains multiple objects, or
if some objects are just parts of a larger object. This happens because different
users, for a given image, may refer to different regions and objects in the image.
120 L. Putzu et al.

Furthermore, an object could even belong to multiple orthogonal classes. Differ-


ent mechanisms have been proposed to manage these kinds of ambiguity, most
of them require the user feedback to improve the retrieval results, while others
are mainly based on defining a strong set of features with the most appropriate
similarity measure.

3 Features in CBIR
In a CBIR system the description and the representation of the content of a spe-
cific image can be provided in multiple ways according to three levels of abstrac-
tion [12]. The first level includes the representation of primitive characteristics
or low-level features, such as colour, texture and basic geometric shapes. This
kind of information although very simple and easy to extract, does not guarantee
reliability and accuracy. The second level is devoted to provide a more detailed
description of the elements mentioned above, by providing the representation of
more complex objects by means of their aggregation and their spatial arrange-
ments which are precisely the logical characteristics or mid-level features. The
common perception of an image or a scene is not always related to its descrip-
tion through such low and mid-level features. Indeed, an image can be seen as
the representation of different concepts, either related to primitive characteris-
tics but also related to emotions and memories. For these reasons, a third level,
called semantic level, is used. It includes the description of the abstract features,
such as the meaning of the scenario, or the feelings induced in an observer.
While most of the current CBIR systems employ low-level and mid-level fea-
tures [12,30], aimed to take into account information like colour, edge and texture
[6], CBIR systems that address specific retrieval problems leverage on different
kind of features that are specifically designed [40]. Some works exploited low-level
image descriptors, such as SIFT [44] originally proposed for object recognition
[25], different colour histograms [23] or a fusion of textual and visual information
[11] for scene categorization. Also more specific low-level features designed for
other applications have been used in CBIR systems, such as the HOG [10] and the
LBP [28] descriptors, originally proposed for pedestrian detection and texture
analysis respectively. Obviously, these features does not fit for CBIR designed for
less specific applications and general purpose retrieval systems [30]. Furthermore,
while low-level visual descriptors help improving the retrieval performances for
particular purposes, they often fail to provide high-level understanding of the
scene. Several CBIR systems that use a combination or fusion of different image
descriptors have been proposed in the literature [32]. Fusion approaches are usu-
ally categorized into two classes, namely early, and late fusion approaches. The
approaches of early fusion are very common in the image retrieval field, and the
simplest and well known solution is based on the concatenation of the feature
vectors, such as in [45], where the authors propose two ways of integrating the
SIFT and LBP, and the HOG and LBP descriptors, respectively. Instead, the
aim of late fusion approaches is to produce a new output by combining either
different similarities or distances from the query [13] or different ranks obtained
by the classifiers [39].
Ten Years of Relevance Score for Content Based Image Retrieval 121

In this light, deep learning approaches implicitly perform feature fusion, mod-
elling high-level features by employing a high number of connected layers com-
posed of multiple non-linear transformation units [32]. Indeed, the Convolutional
Neural Network (CNN) model [5] consists of several convolutional layers and
pooling layers, where the convolutional layer performs a weighted combination
of the input values, while the pooling layer perform a down-sampling operation
that reduces the output of the convolutional layer. In [20] it has been shown
also that features extracted from the upper layers of the CNN can also serve as
good descriptors for image retrieval. It implies that a CNN trained for a given
general task has acquired generic representation of objects that will be useful
for all sorts of visual recognition tasks [3].

4 Four Approaches to Exploit Relevance Feedback


In the following, we will refer to image retrieval systems where a query image
is used as an example to find all the images in the repository that are relevant
to that query. Defining which image is relevant or not-relevant to a query is not
a trivial problem, in particular if the problem must be addressed using just one
image as query. Indeed, there is still a gap between the human perception of the
semantic information present in an image, and its computer description, that is
typically able to capture just a small subsets of semantic concepts. Moreover, the
user that performs the query could not have a specific target in mind, or he could
perform the query with an image that is only partially related to the content that
he has in mind. In both cases, the retrieval process could not be accomplished in
just one step. The mechanism of Relevance Feedback (RF) has been developed
to involve the user also in further steps of the process, in particular, to verify
if the search results are relevant or not. Indeed, after the user feedback, the
system can consider all relevant images as additional examples to better specify
the query, and the non-relevant ones as examples of images that the user is not
interested in.
A number of RF approaches have been proposed to refine the retrieval results
and they can be divided into four main categories. One of the first mechanism
of RF used in CBIR tasks and still used in many applications, is based on
the so-called Query Shifting or Query-Point Movement (QPM) paradigm. This
technique has been firstly proposed for text retrieval refinement [36] and then
adopted in CBIR systems [37]. The assumption behind this approach is that
relevant images are clustered in the feature space, but the original query could
lie in the region of the feature space that is in some way far from the cluster of
relevant images. Accordingly, a new optimal query is computed in such a way
that it lies near to the euclidean center of the relevant images, and far from the
non-relevant images, according to Eq. (1)
 
Qopt = N1R Di − NT −N
1
R
Di (1)
i∈DR i∈DN

where DR and DN are the sets of relevant and non-relevant images, respectively,
NR is the number of images in DR , NT the number of the total documents, and
122 L. Putzu et al.

Di is the representation of an image in the feature space. It is easy to see that


this approach can be suited only in cases in which relevant images tend to form
a cluster with a small intersection with other images that are not-relevant to the
user’s interests.
A quite close group of RF approaches are based on distance or similarity
learning, that instead of optimizing the query with respect to the relevant and
not-relevant images retrieved so far, they optimize the distance metric used to
compute image similarities. The goal is to have high pair-wise similarity value for
images marked as relevant, and a low similarity value between relevant and not-
relevant images. In the simplest case, the metric learning may consist in just re-
weighting the individual features [31]. A major advantage of these two groups of
approaches is that they are relatively fast, but they usually ignore dependencies
between features, do not consider local properties in different portions of the
feature space, and are only effective if the concept represented by the relevant
images consists of a convex region in the feature space.
A third group of approaches are based on the formulation of RF in terms
of a pattern classification task, by using the relevant and non-relevant image
sets to train popular learning algorithms such as SVMs [24], neural networks
and self-organizing maps [8,21]. However, in many practical CBIR settings it is
usually difficult to produce a high-level generalization as the number of available
relevant and non-relevant samples cases may be too small. This kind of problems
has been partially mitigated using the active learning paradigm [9,34], where the
system is trained not only with the most relevant images according to the user
judgement, but also with the most informative images that allows driving the
search into more promising regions of the feature space [18,33,42].
Finally, RF can be formulated according to a probabilistic approach, where
the posterior probability distribution of a random variable according to the user
feedback [2] is estimated. In particular, the probability densities of the relevant
and non-relevant images is used as a similarity measure, as in the case of using
a soft classifier [2,15]. This category also includes the nearest neighbour (NN)
methods used in this context to estimate the posterior probabilities of an image
as being or not relevant to the user’s query. NN approaches have been adapted
in several forms over the years, but the most used and effective form to compute
a Relevance Score for each image is through the computation of its distances to
its nearest relevant and non-relevant neighbours as follows:
||I−N N nr (I)||
relN N (I) = ||I−N N r (I)||+||I−N N nr (I)|| (2)

where N N r (·) and N N nr (·) denote the nearest relevant and non relevant image
for the image I respectively, and || · || is the metric, typically the Euclidean
distance, defined for the feature space. For convenience the equation can be
substituted by other ones such as ||I − N N nr (I)||/||I − N N r (I)|| [1]. In a recent
work [2] this ratio has been modified by introducing a smoothed term in order to
increase the importance of the images more relevant to the user query with the
distance to the closest relevant image. The modified ratio ||I − N N nr (I)||/||(I −
N N r (I))2 || has been shown to improve the basic one in some cases [2], but in
Ten Years of Relevance Score for Content Based Image Retrieval 123

general, it turns out that the original formulation gives surprisingly good results
also in comparison to other state-of-the-art techniques [14].

5 Experimental Results
5.1 Datasets
We performed experiments with three real-world, state of the art datasets differ-
ing in the number of classes, and in the semantic content of the images. Caltech
is a well knows image dataset1 that presents a collection of pictures of objects.
In most images, objects are in the centre with fairly similar poses, and with very
limited or no occlusion. There are different versions of this dataset, and in this
work we used the Caltech-101 and the Caltech-256 versions. Caltech-101 is a col-
lection of pictures of objects belonging to 101 categories. It contains a total of
9,144 images and most of the categories have almost 50 images, but the number
of images per categories range between 40 and 800. Caltech-256 contains pic-
tures of objects belonging to 256 categories. It contains a total of 30,607 images
and the number of images per category ranges greatly between 80 and 827, with
an average value of 100 images per category. The Flower dataset2 presents a
collection of flower images. This dataset is released in two different versions, and
in this work we used the 102 category version Flowers-102. Despite this dataset
has a similar number of classes as Caltech-101, the two datasets are related to
two very different problems. Indeed, Flowers-102 turns out to be a problem of
fine retrieval, since it contains the single category object ‘Flower’ that is sub-
divided into 102 sub-categories. It consists of 8,189 images, with a number of
images per class that ranges between 20 and 238. SUN-397 is an image dataset
for scene categorization. It contains 108,754 images belonging to 397 categories.
The number of images varies across categories, but there are at least 100 images
per category. In the experimental evaluation these three datasets have been ran-
domly divided into two subsets: the query set, containing a query image for each
class, and the search set containing all the remaining images for retrieval.

5.2 Features for Image Representation


In order to assess the performances of the different RF mechanisms on different
image representations, we used 8 sets of features: CNN, Colour features, SIFT
[25], HOG [10], LBP [28], LLBP [4], Gabor Wavelets [22] and HAAR wavelets
[17]. In particular the CNN features have been extracted from the most used
CNN architectures that is AlexNet [20]. Each level of a CNN could be used
for this purpose. Indeed the first network layers are able to describe just some
images characteristics, like point and edges, while the innermost layers can cap-
ture high-level features and thus can create a richer image representation. We
extracted the features from the second fully connected layer (fc7), that produces
1
http://www.vision.caltech.edu/Image Datasets/.
2
http://www.robots.ox.ac.uk/∼vgg/data/flowers/.
124 L. Putzu et al.

a feature vector of size 4096. Colour features instead present a set of features
proposed in [26] that includes Colour Histogram, Colour Moments and Colour
Auto-correlogram, which concatenated produce a feature vector of size 102. SIFT
features instead have been extracted with a grid sampling of 8 pixel size and a
window size ranging from 32 to 128. Then the extracted SIFT features have been
used to create a BoVW of size 4096. HOG have been computed on HOG blocks
composed by 4 cells of 16 pixel size. The blocks are overlapped by one cell per
side creating a feature vector of size 6084. Also LBP have been extracted from
blocks of 32 pixel size, in order to favour the analysis of small regions of the
image, since they have been created for texture analysis. The final feature vec-
tor has a size of 2891. In addition to the LLBP, the Gabor and HAAR wavelet
features have been also used, and the related feature vectors are sized 768, 5760
and 3456 respectively.

5.3 Experimental Setup


The different RF approaches have been applied after the first retrieval round,
where the top k images are returned to the user for labelling as being rele-
vant or not. Values of k = 20, 50, 100 have been considered. Then, 4 RF rounds
are performed for each query image. Reported results have been computed by
automating the RF task, thanks to the availability of labelled datasets. In par-
ticular, for a given query image, images are labelled as being relevant or not if it
belongs to the same class of the query image. This procedure is useful to obtain
objective relevance labels, and to perform repeatable comparisons between var-
ious retrieval systems. The underlying assumption is that a user who performs
a query using an image belonging to a certain class, should be interested only
in images belonging to the same class. The performances attained with the Rel-
evance Score have been compared to other RF approaches following a different
paradigm, namely, the QPM paradigm (see Eq. (1)), and a binary Linear SVM
classifier [24]. Since the training set that is created to train the SVM is different
for each query image, and it is also different at each RF round for each query,
the selection of the SVM hyperparameters is not trivial. Thus, the employed
SVM uses an error-correcting output code (ECOC) mechanism [29] with a 5-
fold cross validation to fit and automatically tune the hyperparameters. The
trained SVM is used to classify the images belonging to the repository, and a
vector is produced where each component represents the probability that the
given image belongs to a certain class. Therefore, given that the SVM has been
used to classify relevant and non-relevant images, the provided score could be
used as a similarity measure, directly indicating the relevance of an image.
To evaluate the performances of the proposed RF mechanisms, we used three
different accuracy metrics: Precision, Recall and Average precision. The Preci-
sion (P) (see Eq. (3)) measures the ratio between Relevant Retrieved Images
(RRI) within the top k retrieved images. The Recall (R)(see Eq. (4)) measures
the ratio between the RRI and the total number of Relevant Images (RI) in
the dataset. The Average Precision (AP) (see Eq. (5)) takes into account the
position in the results set of the relevant images.
Ten Years of Relevance Score for Content Based Image Retrieval 125

RRI RRI N
1  Rin i
P = (3) R= (4) AP = t (5)
k images RI Qi n=1 n n

where Qi is the number of relevant images for the i-th query, N is the total
number of images of the search set, Rni is the number of relevant retrieved
images within the n top results, and tni indicates if the n-th retrieved image is
relevant (=1) for the i-th query or not (=0).

5.4 Results

In order to give the reader a first idea on the performances attained on one of
the considered feature sets, Table 1 shows the AP results after the first retrieval
round on each dataset. As it can be observed, the performances obtained are
very different for each feature set, although in general the CNN features out-
perform the other methods. This great imbalance in the initial retrieval per-
formances makes even more interesting the analysis of the approaches on the
individual sets of descriptors, in order to verify which one benefits more from
that mechanisms. The retrieval performances of the different approaches have
been reported in Figs. 1, 2 and 3 in which we compared the various techniques
with different features sets on retrieval sets of size 20, 50 and 100 respectively.
As a general comment on the attained results, in all the approaches the perfor-
mances increase after each RF round, but with very different trends. The feature
set that take more advantages of an RF approach are the CNN features, as a
gain of more than 25% can be obtained for all the datasets when the RF mecha-
nism based on Relevance Score is employed. We can also see that the Relevance
Score approach outperforms other approaches in most of the cases. In particular,
looking at the trends for CNN features, it can be observed that the Relevance
Score allows attaining performance improvements in every RF round, although

Table 1. Comparison of features for image retrieval without relevance feedback.

Features Caltech101 Caltech256 Flowers SUN397


CNN 38.53 18.17 29.81 6.14
Colors 2.96 1.05 5.21 0.48
SIFT 9.99 2.43 4.17 0.83
HOG 8.06 2.33 2.72 0.74
LBP 7.50 2.59 4.11 0.85
LLBP 4.05 1.09 4.24 0.62
HAAR 8.79 2.57 5.56 0.78
Gabor 10.26 2.48 4.27 0.59
126 L. Putzu et al.

Features
CNN
Colors
SIFT
HOG
LBP Strategies
LLBP Relevance Score
HAAR Query Shift
Gabor SVM

Precision Recall AP
0.9 0.9 0.9
0.8 0.8
0.8
0.7 0.7
0.7 0.6
0.6
0.6 0.5
0.5
0.4
0.5
0.4
0.3
0.4
0.3
Caltech101

0.2
0.3
0.2

0.2 0.1

0.1

0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7
0.7 0.6
0.6
0.6 0.5
0.5
0.5 0.4
0.4
0.4 0.3
0.3

0.3 0.2
Flowers

0.2

0.2
0.1

0.1

0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7
0.7 0.6
0.6
0.6 0.5
0.5
0.5 0.4
0.4
0.3
0.4
0.3
0.2
0.3
0.2
Caltech256

0.2 0.1

0.1

0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds

0.4 0.4
0.4
0.3
0.3
0.3 0.2
0.2

0.2 0.1

0.1
SUN397

0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds

Fig. 1. Performances after 4 RF rounds, when the size of the retrieval set is 20.

it start reducing the amount of the improvement after three rounds, While the
other RF approaches shows many fluctuations, as they do not provide increases
in every RF round. This trend could be observed in detail in Table 2, where we
reported the average results of all datasets and all features, starting from the
initial AP value, and then showing the amount of gain in AP after each round,
and, finally, the total amount of gain for each RF approach. It is also worth
to note that in many cases the SVM outperforms the QS approach for each k
Ten Years of Relevance Score for Content Based Image Retrieval 127

Features
CNN
Colors
SIFT
HOG
LBP Strategies
LLBP Relevance Score
HAAR Query Shift
Gabor SVM

Precision Recall AP
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6
0.6
0.5 0.5
0.5
0.4 0.4
0.4
0.3 0.3
Caltech101

0.3

0.2 0.2

0.2

0.1 0.1

0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7

0.6 0.6 0.6


0.5 0.5
0.5
0.4 0.4
0.4

0.3 0.3
0.3
Flowers

0.2 0.2
0.2

0.1 0.1

0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6
0.6
0.5 0.5
0.5
0.4 0.4
0.4
0.3 0.3
0.3
0.2
0.2
Caltech256

0.2

0.1
0.1

0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

rounds rounds rounds

0.4 0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2

0.1

0.1

0.1
SUN397

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

rounds rounds rounds

Fig. 2. Performances after 4 RF rounds, when the size of the retrieval set is 50.

value, with the exception of k = 20 where, although its precision is higher that
that of QS, it presents a lower value for the recall and AP. This is an already
known trend of the learning algorithm such as the SVM, that are not able to fit
properly a model whit a too small training set. This trend is confirmed also by
the fact that the SVM performances arise when k increases, until it reaches the
Relevance Score performances on Caltech101 with k = 100.
128 L. Putzu et al.

Features
CNN
Colors
SIFT
HOG
LBP Strategies
LLBP Relevance Score
HAAR Query Shift
Gabor SVM

Precision Recall AP
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6
0.6
0.5 0.5
0.5
0.4 0.4
0.4
0.3
0.3
0.3
Caltech101

0.2
0.2
0.2

0.1
0.1
0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6
0.6
0.5 0.5
0.5
0.4 0.4
0.4
0.3
0.3
0.3
Flowers

0.2
0.2
0.2

0.1
0.1

0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5
0.5
0.4
0.4 0.4
0.3
0.3 0.3

0.2
0.2
Caltech256

0.2

0.1
0.1
0.1

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds

0.4 0.4 0.4

0.3 0.3
0.3

0.2 0.2
0.2

0.1 0.1

0.1
SUN397

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds

Fig. 3. Performances after 4 RF rounds, when the size of the retrieval set is 100.

6 Conclusion

A Content-Based Image Retrieval (CBIR) system is focused in retrieving images


from digital archives comparing their visual content with the semantic content
required by the user using a query image as an example. Although during the
past years many features sets have been proposed to represent images for CBIR
tasks, there is still a semantic gap between the automatic image description and
the user needs. The results reported in this paper clearly show the challenges
Ten Years of Relevance Score for Content Based Image Retrieval 129

Table 2. AP gain for the different RF approaches averaged for all the datasets and
features set.

RF approach k Round Total gain


1 2 3 4
Relevance score 20 2.56 0.97 0.53 0.42 4.47
50 5.17 2.03 1.24 0.88 9.32
100 7.57 3.48 2.03 1.52 14.60
Query shift 20 0.15 1.68 −0.45 0.82 2.20
50 2.36 2.32 0.09 0.89 5.66
100 4.78 3.32 0.95 0.83 9.87
SVM 20 −0.17 0.95 −0.06 0.27 1.00
50 2.93 1.17 0.67 0.40 5.16
100 5.78 1.82 1.36 0.73 9.69

to improve the retrieval performances with different image representations, and


thus the needs of approaches to fill the semantic gap. To this end, we tested
different Relevance Feedback approaches, and showed that Nearest Neighbour
(NN) approaches proved to be able to provide an effective method to exploit
user’s feedback both on different retrieval problems, and using different feature
representations. In particular, although different NN approaches that exploit
the relevance feedback paradigm have been proposed in the past ten years, the
formulation proposed in [14], still proves to be effective.

Acknowledgements. This work has been supported by the Regional Administration


of Sardinia (RAS), Italy, within the project BS2R - Beyond Social Semantic Recom-
mendation (POR FESR 2007/2013 - PIA 2013).

References
1. Arevalillo-Herráez, M., Domingo, J., Ferri, F.J.: Combining similarity measures in
content-based image retrieval. Pattern Recogn. Lett. 29(16), 2174–2181 (2008)
2. Arevalillo-Herráez, M., Ferri, F.J., Domingo, J.: A naive relevance feedback model
for content-based image retrieval using multiple similarity measures. Pattern
Recogn. 43(3), 619–629 (2010)
3. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image
retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014.
LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/
978-3-319-10590-1 38
4. Rosdi, B.A., Shing, C.W., Suandi, S.A.: Finger vein recognition using local line
binary pattern. Sensors 11, 11357–11371 (2011)
5. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
6. Chatzichristofis, S.A., Boutalis, Y.S.: CEDD: color and edge directivity descriptor:
a compact descriptor for image indexing and retrieval. In: Gasteratos, A., Vincze,
M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 312–322. Springer, Hei-
delberg (2008). https://doi.org/10.1007/978-3-540-79547-6 30
130 L. Putzu et al.

7. Chen, H.: A socio-technical perspective of museum practitioners’ image-using


behaviors. The Electron. Libr. 25(1), 18–35 (2007)
8. Chen, Y., Zhou, X.S., Huang, T.: One-class SVM for learning in image retrieval.
ICIP 1, 34–37 (2001)
9. Cohn, D.A., Atlas, L.E., Ladner, R.E.: Improving generalization with active learn-
ing. Mach. Learn. 15(2), 201–221 (1994)
10. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
Proceedings of CVPR, pp. 886–893 (2005)
11. Dang-Nguyen, D.T., Piras, L., Giacinto, G., Boato, G., De Natale, F.G.B.: Mul-
timodal retrieval with diversification and relevance feedback for tourist attrac-
tion images. ACM Trans. Multimedia Comput. Commun. Appl. 13(4), 49:1–49:24
(2017)
12. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: ideas, influences, and
trends of the new age. ACM Comput. Surv. 40(2), 1–60 (2008)
13. Escalante, H.J., Hérnadez, C.A., Sucar, L.E., Montes, M.: Late fusion of hetero-
geneous methods for multimedia image retrieval. In: Proceedings of the 1st ACM
International Conference on Multimedia Information Retrieval, pp. 172–179 (2008)
14. Giacinto, G.: A nearest-neighbor approach to relevance feedback in content based
image retrieval. In: Proceedings of the 6th ACM International Conference on Image
and Video Retrieval, CIVR 2007, pp. 456–463. ACM, New York (2007)
15. Giacinto, G., Roli, F.: Bayesian relevance feedback for content-based image
retrieval. Pattern Recogn. 37(7), 1499–1508 (2004)
16. Giacinto, G., Roli, F.: Nearest-prototype relevance feedback for content based
image retrieval. In: ICPR, vol. 2, pp. 989–992 (2004)
17. Graps, A.: An introduction to wavelets. IEEE Comput. Sci. Eng. 2(2), 50–61 (1995)
18. Hoi, S.C.H., Jin, R., Zhu, J., Lyu, M.R.: Semisupervised SVM batch mode active
learning with applications to image retrieval. ACM Trans. Inf. Syst. 27(3), 16:1–
16:29 (2009)
19. Jégou, H., Zisserman, A.: Triangulation embedding and democratic aggregation for
image search. In: Proceedings of CVPR, pp. 3310–3317. IEEE Computer Society
(2014)
20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: 26th Annual Conference on Advances in Neural
Information Processing Systems, pp. 1106–1114 (2012)
21. Laaksonen, J., Koskela, M., Oja, E.: PicSOM-self-organizing image retrieval with
MPEG-7 content descriptors. IEEE Trans. Neural Netw. 13(4), 841–853 (2002)
22. Lee, T.S.: Image representation using 2D Gabor wavelets. IEEE Trans. Pattern
Anal. Mach. Intell. 18(10), 959–971 (1996)
23. van Leuken, R.H., Garcia, L., Olivares, X., van Zwol, R.: Visual diversification of
image search results. In: ACM International Conference on World Wide Web, pp.
341–350 (2009)
24. Liang, S., Sun, Z.: Sketch retrieval and relevance feedback with biased SVM clas-
sification. Pattern Recogn. Lett. 29(12), 1733–1741 (2008)
25. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Com-
put. Vis. 60(2), 91–110 (2004)
26. Mitro, J.: Content-based image retrieval tutorial. ArXiv e-prints (2016)
27. Müller, H., Clough, P.D., Deselaers, T., Caputo, B. (eds.): ImageCLEF: Experi-
mental Evaluation in Visual Information Retrieval. Springer, Heidelberg (2010).
https://doi.org/10.1007/978-3-642-15181-1
Ten Years of Relevance Score for Content Based Image Retrieval 131

28. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. IEEE Trans. Pattern
Anal. Mach. Intell. 24(7), 971–987 (2002)
29. Passerini, A., Pontil, M., Frasconi, P.: New results on error correcting output codes
of kernel machines. IEEE Trans. Neural Netw. 15(1), 45–54 (2004)
30. Pavlidis, T.: Limitations of content-based image retrieval. Technical report, Stony
Brook University (2008)
31. Piras, L., Giacinto, G.: Neighborhood-based feature weighting for relevance feed-
back in content-based retrieval. In: WIAMIS, pp. 238–241. IEEE Computer Society
(2009)
32. Piras, L., Giacinto, G.: Information fusion in content based image retrieval: a
comprehensive overview. Inf. Fusion 37, 50–60 (2017)
33. Piras, L., Giacinto, G., Paredes, R.: Enhancing image retrieval by an exploration-
exploitation approach. In: Perner, P. (ed.) MLDM 2012. LNCS (LNAI), vol.
7376, pp. 355–365. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-
642-31537-4 28
34. Piras, L., Giacinto, G., Paredes, R.: Passive-aggressive online learning for relevance
feedback in content based image retrieval. In: Proceedings of the 2nd International
Conference on Pattern Recognition Applications and Methods, pp. 182–187 (2013)
35. Piras, L., Tronci, R., Giacinto, G.: Diversity in ensembles of codebooks for visual
concept detection. In: Petrosino, A. (ed.) ICIAP 2013. LNCS, vol. 8157, pp. 399–
408. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41184-7 41
36. Rocchio, J.J.: Relevance feedback in information retrieval, pp. 313–323. Prentice
Hall, Englewood Cliffs (1971)
37. Rui, Y., Huang, T.S., Mehrotra, S.: Content-based image retrieval with relevance
feedback in MARS. In: International Conference on Image Processing Proceedings,
pp. 815–818, October 1997
38. Rui, Y., Huang, T.S., Mehrotra, S.: Relevance feedback: a power tool in interactive
content-based image retrieval. IEEE Trans. Circuits Syst. Video Technol. 8(5),
644–655 (1998)
39. da Torres, R.S., Falcão, A.X., Gonçalves, M.A., Papa, J.P., Zhang, B., Fan, W.,
Fox, E.A.: A genetic programming framework for content-based image retrieval.
Pattern Recogn. 42(2), 283–292 (2009)
40. Sivic, J., Zisserman, A.: Efficient visual search for objects in videos. Proc. IEEE
96(4), 548–566 (2008)
41. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based
image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach.
Intell. 22(12), 1349–1380 (2000)
42. Tong, S., Chang, E.Y.: Support vector machine active learning for image retrieval.
In: ACM Multimedia, pp. 107–118 (2001)
43. Tronci, R., Murgia, G., Pili, M., Piras, L., Giacinto, G.: ImageHunter: a novel
tool for relevance feedback in content based image retrieval. In: Lai, C., Semeraro,
G., Vargiu, E. (eds.) New Challenges in Distributed Information Filtering and
Retrieval. SCI, vol. 439, pp. 53–70. Springer, Heidelberg (2013). https://doi.org/
10.1007/978-3-642-31546-6 4
44. Tsai, C.M., Qamra, A., Chang, E., Wang, Y.F.: Extent: interring image metadata
from context and content. In: IEEE International Conference on Multimedia and
Expo, pp. 1270–1273 (2006)
45. Yu, J., Qin, Z., Wan, T., Zhang, X.: Feature integration analysis of bag-of-features
model for image retrieval. Neurocomputing 120, 355–364 (2013)

You might also like