An Approximate Proximity Graph Incremental Construction For Large Image Collections Indexing
An Approximate Proximity Graph Incremental Construction For Large Image Collections Indexing
Abstract. This paper addresses the problem of the incremental construction of an indexing structure, namely a proximity graph, for large
image collections. To this purpose, a local update strategy is examined.
Considering an existing graph G and a new node q, how only a relevant
sub-graph of G can be updated following the insertion of q? For a given
proximity graph, we study the most recent algorithm of the literature and
highlight its limitations. Then, a method that leverages an edge-based
neighbourhood local update strategy to yield an approximate graph is
proposed. Using real-world and synthetic data, the proposed algorithm is
tested to assess the accuracy of the approximate graphs. The scalability
is verified with large image collections, up to one million images.
Keywords: Image indexing
Proximity graphs
Incremental
Introduction
Dealing with large amount of data has become a great challenge. Advances in
technology allow to collect data from almost everywhere, everything and everyone in a continuous way. This permanent ow of data can be nearly innite and
occurs in various elds. A perfect illustration of this phenomenon is the exponential growth of images. Thousands of photos are added each minute on online
platforms such as Flickr, Instagram or Facebook.
One challenge, along with the storage of this huge amount of images, is the
exploration of these image collections. In order to extract relevant information
from these images, one needs to have a relevant representation to observe their
global topology and search local information. Proximity graphs [6] have the
property of extracting the structure of the data they represent. Each piece of
data is represented by a vertex, and two vertices are linked by an edge if they
are close enough to be considered as neighbours. Such graphs t perfectly for
purposes such as clustering and outlier detection, but also for indexing and
retrieval tasks.
c Springer International Publishing Switzerland 2015
F. Esposito et al. (Eds.): ISMIS 2015, LNAI 9384, pp. 5968, 2015.
DOI: 10.1007/978-3-319-25252-0 7
60
F. Rayar et al.
Proximity graphs [6] are weighted graphs with no loops. They aim at extracting
the structure of a data point set, where each point is represented by a node. They
associate an edge between two points if they are close enough to be considered
as neighbours. The notable proximity graphs include k-nearest neighbour graph,
relative neighbourhood graph, Gabriel graph and Delaunay graph.
In the present paper, we will focus our attention on the RNG. Indeed, it
is the smallest connected proximity graph that embeds local information about
vertices neighbourhood. The connectivity property guarantees that each image
can be reachable during a content-based exploration.
2.1
Definition
The relative neighbourhood graph has been introduced in the work of Toussaint
[5]. The construction of this graph is based on the notion of relatively close
neighbours, that denes two points as relative neighbours if they are at least as
close to each other as they are to any other points. From this denition, we can
61
dene RN G = (V, E) as the graph built from the points of D where distinct
points p and q of D are connected by an edge pq if and only if they are relative
neighbours. Thus,
E(RN G) = {pq | p, q D, p = q, (p, q) max((p, r), (q, r)), r D\{p, q}}.
where : D D R is a distance function. An illustration of the relative
neighbourhood of two point p, q R2 is given in Fig. 1.
r
p
The main drawback of the RNG is its construction. The classical and bruteforce construction has a complexity of O(n3 ), where n = |D| is the number of
considered data point. A few works in the literature address this complexity for
2D and 3D points. Their key idea is to build a supergraph of the RNG (e.g.
the Delaunay graph), and adopt a strategy to eliminate some edges to yield the
RNG. Thus, one can nd in the literature [2] algorithms for 2D and 3D points,
23
whose complexity are O(nlog(n)) and O(n 12 log(n)), respectively.
2.2
Incremental Construction
To the best of our knowledge, only few works have been done in the literature
regarding the incremental construction of the RNG. Scuturici et al. [4] explain
that they insert the new vertex in the existing RNG graph by verifying the
relative neighbourhood criteria specied in Sect. 2. The authors state that the
graph is locally updated, but no details are given. They experimented up to
10,000 images and evaluated their work using classication performance metrics,
namely precision and recall.
In [1], Hacid et al. propose an algorithm to perform local update of a RNG
following the insertion of a new vertex. This algorithm is leveraged to incrementally build the RNG. Given a set of vertices V , the incremental construction of
the RNG proposed by Hacid et al. consists in (i) randomly selecting 2 vertices
of V and creating an edge between them and (ii) iteratively inserting the other
vertices by locally updating the RNG. The insertion algorithm (Algorithm 1) is
detailed below.
Let RN G be the relative neighbourhood graph built from the vertices of V ,
q be a new vertex to be inserted, and R+ . First the nearest vertex nn of q is
sought in V (line 1). The farthest relative neighbour f n of nn is retrieved in the
graph RN G (line 2). A hypersphere SR centred around q is then computed as
62
F. Rayar et al.
its neighbourhood. All vertices that lay in that hypersphere are retrieved (lines
611). The radius of this hypersphere corresponds to the sum of the distances
between q and nn, and the one between nn and f n. Note that this hypersphere
radius can be magnied thanks to the parameter (line 3). The neighbourhood
relationships of the hypersphere SR are updated (line 12) with the classical
brute-force algorithm.
Algorithm 1. Hacid et al.s insertion algorithm
Input: RN G = (V, E), q,
Output: RN G = (V , E )
1: nn = nearest vertex(q, V )
2: f n = f arthest relative neighbour(nn, RN G)
3: sr = ((q, nn) + (nn, f n)) (1 + )
4: V = V {q}
5: E = E
6: SR =
7: for each p V do
8:
if (p, q) sr then
9:
SR = SR {p}
10:
end if
11: end for
12: E = U pdate(SR)
13: return RN G = (V , E )
The complexity of this insertion algorithm is O(2n + n3 ), where n = |E| and
n = |SR|. The 2n term corresponds to the search of the nearest neighbour and
the search of vertices that lay in the hypersphere. The second term is the time for
updating the neighbourhood relations between the points within the hypersphere
with the classical RNG algorithm. The authors state that the incrementally built
RNG corresponds exactly to the RNG built with a brute-force algorithm, using
a recall measure and graph correspondence.
We have noticed several drawbacks regarding this insertion algorithm. First,
the choice of the parameter , used by the authors to expand the neighbourhood
of q. It is empirically set at = 0.1 in [1]. However, no proof that relative
neighbours of the newly inserted point must lay in this magnied hypersphere
is given. This could be the cause of losing relative neighbours as illustrated in
Fig. 2 (left). Second, due to the spherical denition of the neighbourhood SR,
the update step may create false edges. Indeed, the classical RNG algorithm is
performed only considering the vertices laying in SR. Figure 2 (right) illustrates
such a erroneous edge creation. Thus, the insertion algorithm described above
might not incrementally yield the exact RNG, as stated by the authors, due
to the loss of edges or the inclusion of bad ones. This has been observed and
reported in Sect. 4. Third, an assumption is done stating that n << n, i.e. the
number of vertices in the hypersphere is way less than the number of previously
added points. It may not be the case, for instance, if a set of dense points laying in
the same part of the space is considered. This has been experimentally observed
for a few datasets. Thus the term n3 in the complexity might be an issue.
63
fn
nn
SR
g
g
SR
3
3.1
L
Nei (q)
i=1
3.2
Algorithm
64
F. Rayar et al.
q
N1 (q)
N1e (q)
N2 (q)
N2e (q)
Fig. 3. First and second order vertex neighbours (in lightgrey and grey respectively)
and edge neighbours (in dotted lightgrey and grey respectively) of the vertex q.
The main steps of Algorithm 2 are as follows. The rst steps (lines 19) are
the same as in Algorithm 1. As explained in the previous section, the hypersphere SR centred around q and its content are computed. Then, we retrieve
the relative neighbours of q in SR (lines 1216). For each vertex p in SR, the
pair (p, q) is considered. For each vertex r in SR, we check if r lays in the relative
neighbourhood of the pair (p, q). If no vertices lay in this relative neighbourhood,
then p and q are relative neighbours, and the edge pq is created. This step is carried out in O(n2 ), where n = |SR|. Step 1 gathers all the edges that belong to
the edge-based neighbourhood NeL (q) of q, given an order L. This is performed
with a recursive algorithm. First, we initialise an empty set of edges A. For each
relative neighbour q of q in RN G, we recursively compute the (L 1)th -order
edge neighbours of q and store them in A. At the end of this step, the set
A contains the list of edges that belong to NeL (q)\Ne1 (q). Finally, the eective
update is made in Step 2: for each edge e in A, we check if e has to be removed
due to the apparition of q, i.e., if q lays in the relative neighbourhood of the
two endpoints of the considered edge. The overall complexity of the proposed
insertion algorithm is O(2n + n2 + deg L ), where deg is the average degree of the
graph RN G.
Thus, we propose here an algorithm that reduces the time complexity of the
local update strategy. Moreover, the edge-based neighbourhood allows to verify
more edges that may be concerned by the apparition of a new vertex. The tradeo between computation time and accuracy that can be achieved will be studied
in the experiments.
4
4.1
Experiments
Experimental Setup
The Algorithms 1 and 2 presented in this paper were implemented in C++. The
classical O(n3 ) RNG algorithm was also implemented for reference, to assess
the graph accuracy. For a fair comparison, the algorithm described by Hacid
et al. was implemented under the same constraints as our algorithm (only the
65
local update strategy diers). value was set to 0.1 as in [1]. In order to speed
up some operations (e.g. the nearest neighbours search), they were parallelised
using OpenMP1 . In the present experiments, the whole dataset was loaded in
memory, and then each piece of data was inserted one by one. The graph was
stored as an adjacency list. For runtime experiments, we used an Intel Xeon
CPU W3520 (quadcore) at 2.66 Ghz, with 8 Go of RAM.
4.2
Datasets
Five datasets were selected (available on the online UCI machine learning repository2 ). They are either articial or real world multidimensional datasets. Table 1
summarizes the specications of the datasets. The three rst can be considered
as small datasets, i.e. their distance matrices can be stored in the memory.
They were used mainly to assess the validity of our algorithm and the accuracy of the resulting graphs. The two last, which are large image collections
up to one million images, were used to verify the scalability of the algorithm.
For these datasets, the exact computation of the RNG is not tractable (n.t.) in
reasonable time with the O(n3 ) algorithm. Therefore, a CPU/GPU RNG construction method [3], which can handle up to 300.000 entries, was used to generate the exact graph for the Corel68k dataset. Regarding the MIRFLICKR-1M3
1
2
3
http://www.openmp.org/.
http://archive.ics.uci.edu/ml.
http://press.liacs.nl/mirflickr.
66
F. Rayar et al.
(MF-1M) image collection, its RNG is not tractable at all, thus the number of
edges does not appear in Table 1.
Table 1. Datasets used for experiments. The number of vertices, their dimension and
the number of edges in the exact RNG are given.
D
Type
|V|
| E(RNG) |
Iris
real world
150
4 195
WDBC
real world
569
30 712
Breiman
artificial
5000
40 17,837
68,040
57 190,410
All the ve datasets share one common property: their attributes are numerical, hence the euclidean distance was used for data comparison. Note that this
work can be applied to the data described by categorical features with an appropriate distance function.
4.3
Accuracy Evaluation
First, we evaluate the accuracy of the proposed algorithm and the sensitivity
with regards to the parameter L. The exact RNG was computed for the four
rst datasets. The number of edges of these graphs are used as ground truth
and the graph correspondence is computed to evaluate the approximate graphs.
Table 2 gives the number of erroneously added edges and removed edges. Since
exact RNG could not be produced in reasonable time, this experiment is not
reported for the largest dataset, namely MF-1M.
One interesting observation in this experiment is that the main dierence
between the graph produced incrementally and the exact graph is often the
addition of wrong edges. Actually, it is not the addition of wrong edges, but
rather the fact that some edges are not invalidated after an insertion due to the
proposed edge-based neighbourhood. Thus, our algorithm leads to create a few
number of false similarities between data, which may not be critical in some
applications (e.g. similar images retrieval or user recommendation systems).
We notice that Hacid et al.s algorithm does not always incrementally yield
the exact RNG as stated in their paper. Furthermore, our proposed algorithm
performs at least as well as, if not better, than Hacid et al.s algorithm in terms
of accuracy, considering low edge-based neighbourhood order (L = 4).
As expected, the number of the wrongly added or removed edges in the
approximate graphs decreases as the edge-based neighbourhood order increases.
Indeed, as more edges are checked, less erroneous edges are left, thus improving
the accuracy of the approximate RNG. It is possible to build such a graph with
less than 1 % of wrongly added or removed edges considering low order of edgebased neighbourhood (such as L = 4).
67
Table 2. Number of wrongly added edges and removed edges in the RNGs computed
by Algorithms 1 and 2. The symbol == means that the approximate graph corresponds
exactly to the exact graph.
| E(RNG) | Algorithm 1
Algorithm 2
L=2
L=3
L=4
Iris
195
+10/ 2
+8/ 1
==
==
WDBC
712
+2/ 1
+10/ 0
+3/ 0
==
17837
+0/ 0
+1161/ 0
+299/ 0
+26/ 0
Breiman
Corel68 k 190410
4.4
7692
16
25
178
Corel68 k 122 h
889
1371
1604
MF-1M
145 h
151 h
181 h
>> 250 h
68
F. Rayar et al.
Conclusion
References
1. Hacid, H., Yoshida, T.: Incremental neighborhood graphs construction for multidimensional databases indexing. In: Canadian Conference on AI (2007)
2. Jaromczyk, J.W., Toussaint, G.T.: Relative neighborhood graphs and their relatives.
Proc. IEEE 80, 15021517 (1992)
3. Liu, T., Bouali, F., Venturini, G.: EXOD: a tool for building and exploring a large
graph of open datasets. Comput. Graph. 39, 117130 (2014)
4. Scuturici, M., Scuturici, V.-M., Clech, J., Zighed, D.A.: Navigation dans une base
dimages `
a laide de graphes topologiques. In: Inforsid (2004)
5. Toussaint, G.T.: The relative neighbourhood graph of a finite planar set. Pattern
Recogn. 12, 261268 (1980)
6. Toussaint, G.T.: Some unsolved problems on proximity graphs (1991)