Document 1
Document 1
1 Introduction
Content based image retrieval (CBIR) systems include all the approaches to
retrieve images from large repositories, by analysing the visual and semantic
content of the images. A CBIR system represents each image in the repository
as a set of low- and mid-level features such as colour, texture, shape, connected
components, etc. and uses a set of distance functions defined over these feature
spaces to estimate the similarity between images. The goal of a CBIR system is
to retrieve a set of images that is best suited to the user’s intention formulated
c Springer International Publishing AG, part of Springer Nature 2018
P. Perner (Ed.): MLDM 2018, LNAI 10935, pp. 117–131, 2018.
https://doi.org/10.1007/978-3-319-96133-0_9
118 L. Putzu et al.
using an image as the query. The performances of CBIR systems are strongly
related not only to the employed feature representations, but also to the distance
functions used to measure the similarity between images [12,41]. The implicit
assumption is that the similarity between images is related to a distance defined
over a particular feature space. But since an image may be characterized by a
large number of concepts, this leads to the so-called semantic gap between the
real user image interpretation and the semantics induced from the low-level fea-
tures. Furthermore, the availability of personal devices and the possibility for
people to capture an unlimited number of photos and videos, rise the amount of
multimedia document. Thus, the need for efficient and effective image retrieval
systems has become crucial. Indeed, in order to accurately extract information
from this vast amount of data, retrieval methods need to quickly discard irrele-
vant information and focus on the items of interests.
To this end, and to reduce the semantic gap introduced by the automatic
image representation, most of the recent CBIR systems has adopted a mechanism
of Relevance Feedback (RF) [2,35,43]. RF is a mechanism initially developed
to improve text-based information retrieval systems [36], and then extended to
CBIR system in [38]. RF techniques involve the user in the process of refining the
search, that becomes an iterative process in which the original query is refined
interactively, to progressively obtain a more accurate result. In a CBIR system,
after the user submits a query image, the system retrieves a series of images
and requires user interaction to label the returned images as relevant or non-
relevant to the query image. The user’s intention is then modified according to
this data, producing a new set of images, that is expected to contain a larger
number of relevant images. The relationship between the query image and any
other image in the database is expressed using a relevance value, that is aimed
to directly reflect the user’s intention. Many different approaches have been
proposed to estimate this relevance value. In this context, nearest neighbour
approaches proved to be particularly effective [14]. The aim of these methods is
to produce for each image a Relevance Score by using the distances to relevant
and non-relevant neighbours. Obviously, an image will present a highest relevance
score as much as its distance from the nearest relevant image is small compared to
the distance of its nearest non relevant image. Although many other approaches
for relevance feedback have been proposed in the past ten years, we believe that
the Relevance Score can still be considered the state-of-the-art approach for
relevance estimation.
Other Relevance Feedback techniques proposed in the literature involve the
optimization of one or more CBIR components such as the formulation of a new
query or the transformation of the feature space. This kind of approaches extract
global properties pertaining to relevant images, and such properties are derived
from the set of relevant and non-relevant images retrieved so far. Such global
optimization can suffer from the small sample problem, as the number of images
displayed to the user for labelling is usually small (e.g., 20 images at a time).
In particular, as the size of the database increases, it is very likely that small
numbers of relevant images are retrieved in response to the user’s query. As a
Ten Years of Relevance Score for Content Based Image Retrieval 119
3 Features in CBIR
In a CBIR system the description and the representation of the content of a spe-
cific image can be provided in multiple ways according to three levels of abstrac-
tion [12]. The first level includes the representation of primitive characteristics
or low-level features, such as colour, texture and basic geometric shapes. This
kind of information although very simple and easy to extract, does not guarantee
reliability and accuracy. The second level is devoted to provide a more detailed
description of the elements mentioned above, by providing the representation of
more complex objects by means of their aggregation and their spatial arrange-
ments which are precisely the logical characteristics or mid-level features. The
common perception of an image or a scene is not always related to its descrip-
tion through such low and mid-level features. Indeed, an image can be seen as
the representation of different concepts, either related to primitive characteris-
tics but also related to emotions and memories. For these reasons, a third level,
called semantic level, is used. It includes the description of the abstract features,
such as the meaning of the scenario, or the feelings induced in an observer.
While most of the current CBIR systems employ low-level and mid-level fea-
tures [12,30], aimed to take into account information like colour, edge and texture
[6], CBIR systems that address specific retrieval problems leverage on different
kind of features that are specifically designed [40]. Some works exploited low-level
image descriptors, such as SIFT [44] originally proposed for object recognition
[25], different colour histograms [23] or a fusion of textual and visual information
[11] for scene categorization. Also more specific low-level features designed for
other applications have been used in CBIR systems, such as the HOG [10] and the
LBP [28] descriptors, originally proposed for pedestrian detection and texture
analysis respectively. Obviously, these features does not fit for CBIR designed for
less specific applications and general purpose retrieval systems [30]. Furthermore,
while low-level visual descriptors help improving the retrieval performances for
particular purposes, they often fail to provide high-level understanding of the
scene. Several CBIR systems that use a combination or fusion of different image
descriptors have been proposed in the literature [32]. Fusion approaches are usu-
ally categorized into two classes, namely early, and late fusion approaches. The
approaches of early fusion are very common in the image retrieval field, and the
simplest and well known solution is based on the concatenation of the feature
vectors, such as in [45], where the authors propose two ways of integrating the
SIFT and LBP, and the HOG and LBP descriptors, respectively. Instead, the
aim of late fusion approaches is to produce a new output by combining either
different similarities or distances from the query [13] or different ranks obtained
by the classifiers [39].
Ten Years of Relevance Score for Content Based Image Retrieval 121
In this light, deep learning approaches implicitly perform feature fusion, mod-
elling high-level features by employing a high number of connected layers com-
posed of multiple non-linear transformation units [32]. Indeed, the Convolutional
Neural Network (CNN) model [5] consists of several convolutional layers and
pooling layers, where the convolutional layer performs a weighted combination
of the input values, while the pooling layer perform a down-sampling operation
that reduces the output of the convolutional layer. In [20] it has been shown
also that features extracted from the upper layers of the CNN can also serve as
good descriptors for image retrieval. It implies that a CNN trained for a given
general task has acquired generic representation of objects that will be useful
for all sorts of visual recognition tasks [3].
where DR and DN are the sets of relevant and non-relevant images, respectively,
NR is the number of images in DR , NT the number of the total documents, and
122 L. Putzu et al.
where N N r (·) and N N nr (·) denote the nearest relevant and non relevant image
for the image I respectively, and || · || is the metric, typically the Euclidean
distance, defined for the feature space. For convenience the equation can be
substituted by other ones such as ||I − N N nr (I)||/||I − N N r (I)|| [1]. In a recent
work [2] this ratio has been modified by introducing a smoothed term in order to
increase the importance of the images more relevant to the user query with the
distance to the closest relevant image. The modified ratio ||I − N N nr (I)||/||(I −
N N r (I))2 || has been shown to improve the basic one in some cases [2], but in
Ten Years of Relevance Score for Content Based Image Retrieval 123
general, it turns out that the original formulation gives surprisingly good results
also in comparison to other state-of-the-art techniques [14].
5 Experimental Results
5.1 Datasets
We performed experiments with three real-world, state of the art datasets differ-
ing in the number of classes, and in the semantic content of the images. Caltech
is a well knows image dataset1 that presents a collection of pictures of objects.
In most images, objects are in the centre with fairly similar poses, and with very
limited or no occlusion. There are different versions of this dataset, and in this
work we used the Caltech-101 and the Caltech-256 versions. Caltech-101 is a col-
lection of pictures of objects belonging to 101 categories. It contains a total of
9,144 images and most of the categories have almost 50 images, but the number
of images per categories range between 40 and 800. Caltech-256 contains pic-
tures of objects belonging to 256 categories. It contains a total of 30,607 images
and the number of images per category ranges greatly between 80 and 827, with
an average value of 100 images per category. The Flower dataset2 presents a
collection of flower images. This dataset is released in two different versions, and
in this work we used the 102 category version Flowers-102. Despite this dataset
has a similar number of classes as Caltech-101, the two datasets are related to
two very different problems. Indeed, Flowers-102 turns out to be a problem of
fine retrieval, since it contains the single category object ‘Flower’ that is sub-
divided into 102 sub-categories. It consists of 8,189 images, with a number of
images per class that ranges between 20 and 238. SUN-397 is an image dataset
for scene categorization. It contains 108,754 images belonging to 397 categories.
The number of images varies across categories, but there are at least 100 images
per category. In the experimental evaluation these three datasets have been ran-
domly divided into two subsets: the query set, containing a query image for each
class, and the search set containing all the remaining images for retrieval.
a feature vector of size 4096. Colour features instead present a set of features
proposed in [26] that includes Colour Histogram, Colour Moments and Colour
Auto-correlogram, which concatenated produce a feature vector of size 102. SIFT
features instead have been extracted with a grid sampling of 8 pixel size and a
window size ranging from 32 to 128. Then the extracted SIFT features have been
used to create a BoVW of size 4096. HOG have been computed on HOG blocks
composed by 4 cells of 16 pixel size. The blocks are overlapped by one cell per
side creating a feature vector of size 6084. Also LBP have been extracted from
blocks of 32 pixel size, in order to favour the analysis of small regions of the
image, since they have been created for texture analysis. The final feature vec-
tor has a size of 2891. In addition to the LLBP, the Gabor and HAAR wavelet
features have been also used, and the related feature vectors are sized 768, 5760
and 3456 respectively.
RRI RRI N
1 Rin i
P = (3) R= (4) AP = t (5)
k images RI Qi n=1 n n
where Qi is the number of relevant images for the i-th query, N is the total
number of images of the search set, Rni is the number of relevant retrieved
images within the n top results, and tni indicates if the n-th retrieved image is
relevant (=1) for the i-th query or not (=0).
5.4 Results
In order to give the reader a first idea on the performances attained on one of
the considered feature sets, Table 1 shows the AP results after the first retrieval
round on each dataset. As it can be observed, the performances obtained are
very different for each feature set, although in general the CNN features out-
perform the other methods. This great imbalance in the initial retrieval per-
formances makes even more interesting the analysis of the approaches on the
individual sets of descriptors, in order to verify which one benefits more from
that mechanisms. The retrieval performances of the different approaches have
been reported in Figs. 1, 2 and 3 in which we compared the various techniques
with different features sets on retrieval sets of size 20, 50 and 100 respectively.
As a general comment on the attained results, in all the approaches the perfor-
mances increase after each RF round, but with very different trends. The feature
set that take more advantages of an RF approach are the CNN features, as a
gain of more than 25% can be obtained for all the datasets when the RF mecha-
nism based on Relevance Score is employed. We can also see that the Relevance
Score approach outperforms other approaches in most of the cases. In particular,
looking at the trends for CNN features, it can be observed that the Relevance
Score allows attaining performance improvements in every RF round, although
Features
CNN
Colors
SIFT
HOG
LBP Strategies
LLBP Relevance Score
HAAR Query Shift
Gabor SVM
Precision Recall AP
0.9 0.9 0.9
0.8 0.8
0.8
0.7 0.7
0.7 0.6
0.6
0.6 0.5
0.5
0.4
0.5
0.4
0.3
0.4
0.3
Caltech101
0.2
0.3
0.2
0.2 0.1
0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7
0.7 0.6
0.6
0.6 0.5
0.5
0.5 0.4
0.4
0.4 0.3
0.3
0.3 0.2
Flowers
0.2
0.2
0.1
0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7
0.7 0.6
0.6
0.6 0.5
0.5
0.5 0.4
0.4
0.3
0.4
0.3
0.2
0.3
0.2
Caltech256
0.2 0.1
0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.4 0.4
0.4
0.3
0.3
0.3 0.2
0.2
0.2 0.1
0.1
SUN397
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
Fig. 1. Performances after 4 RF rounds, when the size of the retrieval set is 20.
it start reducing the amount of the improvement after three rounds, While the
other RF approaches shows many fluctuations, as they do not provide increases
in every RF round. This trend could be observed in detail in Table 2, where we
reported the average results of all datasets and all features, starting from the
initial AP value, and then showing the amount of gain in AP after each round,
and, finally, the total amount of gain for each RF approach. It is also worth
to note that in many cases the SVM outperforms the QS approach for each k
Ten Years of Relevance Score for Content Based Image Retrieval 127
Features
CNN
Colors
SIFT
HOG
LBP Strategies
LLBP Relevance Score
HAAR Query Shift
Gabor SVM
Precision Recall AP
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6
0.6
0.5 0.5
0.5
0.4 0.4
0.4
0.3 0.3
Caltech101
0.3
0.2 0.2
0.2
0.1 0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.3 0.3
0.3
Flowers
0.2 0.2
0.2
0.1 0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6
0.6
0.5 0.5
0.5
0.4 0.4
0.4
0.3 0.3
0.3
0.2
0.2
Caltech256
0.2
0.1
0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0.4 0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
SUN397
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Fig. 2. Performances after 4 RF rounds, when the size of the retrieval set is 50.
value, with the exception of k = 20 where, although its precision is higher that
that of QS, it presents a lower value for the recall and AP. This is an already
known trend of the learning algorithm such as the SVM, that are not able to fit
properly a model whit a too small training set. This trend is confirmed also by
the fact that the SVM performances arise when k increases, until it reaches the
Relevance Score performances on Caltech101 with k = 100.
128 L. Putzu et al.
Features
CNN
Colors
SIFT
HOG
LBP Strategies
LLBP Relevance Score
HAAR Query Shift
Gabor SVM
Precision Recall AP
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6
0.6
0.5 0.5
0.5
0.4 0.4
0.4
0.3
0.3
0.3
Caltech101
0.2
0.2
0.2
0.1
0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6
0.6
0.5 0.5
0.5
0.4 0.4
0.4
0.3
0.3
0.3
Flowers
0.2
0.2
0.2
0.1
0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5
0.5
0.4
0.4 0.4
0.3
0.3 0.3
0.2
0.2
Caltech256
0.2
0.1
0.1
0.1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
0.3 0.3
0.3
0.2 0.2
0.2
0.1 0.1
0.1
SUN397
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
rounds rounds rounds
Fig. 3. Performances after 4 RF rounds, when the size of the retrieval set is 100.
6 Conclusion
Table 2. AP gain for the different RF approaches averaged for all the datasets and
features set.
References
1. Arevalillo-Herráez, M., Domingo, J., Ferri, F.J.: Combining similarity measures in
content-based image retrieval. Pattern Recogn. Lett. 29(16), 2174–2181 (2008)
2. Arevalillo-Herráez, M., Ferri, F.J., Domingo, J.: A naive relevance feedback model
for content-based image retrieval using multiple similarity measures. Pattern
Recogn. 43(3), 619–629 (2010)
3. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image
retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014.
LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/
978-3-319-10590-1 38
4. Rosdi, B.A., Shing, C.W., Suandi, S.A.: Finger vein recognition using local line
binary pattern. Sensors 11, 11357–11371 (2011)
5. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
6. Chatzichristofis, S.A., Boutalis, Y.S.: CEDD: color and edge directivity descriptor:
a compact descriptor for image indexing and retrieval. In: Gasteratos, A., Vincze,
M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 312–322. Springer, Hei-
delberg (2008). https://doi.org/10.1007/978-3-540-79547-6 30
130 L. Putzu et al.
28. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. IEEE Trans. Pattern
Anal. Mach. Intell. 24(7), 971–987 (2002)
29. Passerini, A., Pontil, M., Frasconi, P.: New results on error correcting output codes
of kernel machines. IEEE Trans. Neural Netw. 15(1), 45–54 (2004)
30. Pavlidis, T.: Limitations of content-based image retrieval. Technical report, Stony
Brook University (2008)
31. Piras, L., Giacinto, G.: Neighborhood-based feature weighting for relevance feed-
back in content-based retrieval. In: WIAMIS, pp. 238–241. IEEE Computer Society
(2009)
32. Piras, L., Giacinto, G.: Information fusion in content based image retrieval: a
comprehensive overview. Inf. Fusion 37, 50–60 (2017)
33. Piras, L., Giacinto, G., Paredes, R.: Enhancing image retrieval by an exploration-
exploitation approach. In: Perner, P. (ed.) MLDM 2012. LNCS (LNAI), vol.
7376, pp. 355–365. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-
642-31537-4 28
34. Piras, L., Giacinto, G., Paredes, R.: Passive-aggressive online learning for relevance
feedback in content based image retrieval. In: Proceedings of the 2nd International
Conference on Pattern Recognition Applications and Methods, pp. 182–187 (2013)
35. Piras, L., Tronci, R., Giacinto, G.: Diversity in ensembles of codebooks for visual
concept detection. In: Petrosino, A. (ed.) ICIAP 2013. LNCS, vol. 8157, pp. 399–
408. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41184-7 41
36. Rocchio, J.J.: Relevance feedback in information retrieval, pp. 313–323. Prentice
Hall, Englewood Cliffs (1971)
37. Rui, Y., Huang, T.S., Mehrotra, S.: Content-based image retrieval with relevance
feedback in MARS. In: International Conference on Image Processing Proceedings,
pp. 815–818, October 1997
38. Rui, Y., Huang, T.S., Mehrotra, S.: Relevance feedback: a power tool in interactive
content-based image retrieval. IEEE Trans. Circuits Syst. Video Technol. 8(5),
644–655 (1998)
39. da Torres, R.S., Falcão, A.X., Gonçalves, M.A., Papa, J.P., Zhang, B., Fan, W.,
Fox, E.A.: A genetic programming framework for content-based image retrieval.
Pattern Recogn. 42(2), 283–292 (2009)
40. Sivic, J., Zisserman, A.: Efficient visual search for objects in videos. Proc. IEEE
96(4), 548–566 (2008)
41. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based
image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach.
Intell. 22(12), 1349–1380 (2000)
42. Tong, S., Chang, E.Y.: Support vector machine active learning for image retrieval.
In: ACM Multimedia, pp. 107–118 (2001)
43. Tronci, R., Murgia, G., Pili, M., Piras, L., Giacinto, G.: ImageHunter: a novel
tool for relevance feedback in content based image retrieval. In: Lai, C., Semeraro,
G., Vargiu, E. (eds.) New Challenges in Distributed Information Filtering and
Retrieval. SCI, vol. 439, pp. 53–70. Springer, Heidelberg (2013). https://doi.org/
10.1007/978-3-642-31546-6 4
44. Tsai, C.M., Qamra, A., Chang, E., Wang, Y.F.: Extent: interring image metadata
from context and content. In: IEEE International Conference on Multimedia and
Expo, pp. 1270–1273 (2006)
45. Yu, J., Qin, Z., Wan, T., Zhang, X.: Feature integration analysis of bag-of-features
model for image retrieval. Neurocomputing 120, 355–364 (2013)