Similarity Search Using Metric Trees: Bhavin Bhuta Gautam Chauhan
Similarity Search Using Metric Trees: Bhavin Bhuta Gautam Chauhan
IJIACS
ISSN 2347 – 8616
Volume 4, Issue 11
November 2015
Performance of BK tree:
database size increases the number of irrelevant Points X,Y and Z form the left sub tree
node visited increases Points P,Q and R form the right sub tree
VP Trees (Vantage Point) Hence we totally „prune‟ one-half of tree and
The VP-tree divides the searching process into two continue pruning till we get our goal node thus
regions left and right sub tree depending on the dealing one subtree at a time. This pruning helps
vantage point thereby reducing the computation. while retrieving information from the database
This vantage point is nothing but a randomly thereby increasing the rate of retrieving
selected point or there are other way to select the information. But this act of pruning comes at the
vantage point (no of nodes, distance to other cost of calculating median and comparing the
nodes,etc) but they require computation and thereby median with all points.
incur additional cost. The implementation of this Performance of VP tree:
technique has a very general method and a simple
logic. The distance from each point to VP is
computed, and then after having all the values we
calculate median value for it. The points are sorted
depending on whether they are less than or greater
than the median value and the two sub trees are
formed. We select all points less than median to left
sub tree and greater to right sub tree. We recursively
follow the same procedure for each node. This
median distance acts as a separating constraint.
Hence we can think of this median distance as a
radius and all the values less than median distance
(left sub tree points) fall inside the circle and all the
values greater than radius (right sub tree points) fall
outside the circle. The data structure The graph has x-axis as size of the database and on
implementation of VP tree contains VP Id, median y-axis the average number of nodes visited as we
distance and address of next left and right sub tree know that in VP tree we have decision functionality
node [11]. The figure below explains the concept which is why we make very few searches. As
database size goes on increasing, search time shows
logarithmic growth. VP tree performs best for a
large database. The creating of the tree takes
O(nlogn) where n is the number of nodes and
searching time complexity varies but an efficient
technique should have O(logn) as expected time [2].
Kd-tree (K-Dimensional)
Kd tree is spatial data partition tree. It is also called
as K-Dimensional Tree. Kd tree construction is a
Binary Search tree where data value stored in each
node is k dimensional point in space. Kd trees are
used in nearest neighbor search and other several
applications. The initial step in inserting new node
Here, V – Vantage Point is traversing kd tree from root node as starting node
d(V,P) > median distance and move to either left or right based on the data
d(V,R) > median distance value which compares the value in „k‟th dimension.
If we find the node under which the data value
d(V,Q) > median distance should be placed then we can add new data on
d(V,X) < median distance either right or left depending upon the data value
d(V,Y) < median distance and comparison with node under which data value
d(V,Z) < median distance should be placed [5].
To find whether a point is present in left or right VP tree although after increasing the database size
subtree? the curve looks similar to that of VP tree.[11]
Here we consider a two dimensional tree. First align
the root node in X plane, then right and left tree will
contain all the points whose coordinates in X plane
are greater than equal to that of root node and less
than that of root node respectively. Here (4,7)
represents root node which is aligned in X-plane.
Now the coordinate (3,8) has smaller x-coordinate
than the root node hence placed in left tree where as
coordinate (18,16 ) is placed in right tree since it
has x-coordinate greater than that of coordinates
(4,7). [5]
Kd tree Example
The Quadtree
A quadtree is a spatial data partition tree. It is used
to find nearest neighbor distance and in
optimizations of various algorithms. The root has
exactly four child nodes hence it is called as
quadtree. Root node covers the entire portion of
space whereas other node covers some specific
portion of space. We can insert new data into
Performance of kd-tree: quadtree easily. In quadtree each node is also
divided into four nodes as they now act as root node
Performance of kd-trees can be measured in terms for newly generated nodes. This methodology of
of number of queries Q and time for processing dividing the single nodes into four nodes is used in
each queries T. The Figure 1.1 shows relationship two dimensional searches. Quadtree is classified
between Q AND T. From the Figure we can into different types on the basis of the data it
conclude that query time T grows approximately contains i.e.points, curves, lines, shapes and
linearly with respect to number of queries Q. areas.[8]
Approximate time complexity is O(nlogn). [6] The
second graph shows the relationship between 4.1 Quadtree example
database size and average number of nodes visited. In the below example first we start with single root
As database size goes on increasing, search time node A as shown in figure. Now divide this root
also shows logarithmic growth. It is quite evident node A into four small squares as A1, A2, A3 and
that the performance of kd tree is less efficient then A4. These small squares are termed as child node.
Now repeat the entire process on this child node i.e. The graph(a) shows query optimization i.e
again divide this square into four small squares. For relationship between query selectivity percentage
node A1, child nodes are A11, A12, A13 and A14. and response time in second. The graph(b) shows
For node A2, child nodes are A21, A22, A23 and average response time taken for queries on two
A24.Similarly, Repeat this process for node A3 and dimensional database i.e. relationship between
A4 as shown in figure.[7] database size and response time in second. The
database size is approximately 57,000 million
polygons where as the database size in graph(b) is
approximately 1.5 million polygons. Samet and
webber represented polygon maps using point-
region quadtree. Polygon map is used to store each
vertex or points of polygon map in point-region
quadtree. We can increase the size of this polygon
map by cloning the polygon data.[11]
suggest to create a combination of two or more [4] Baeza-Yates, Ricardo, et al. "Proximity matching
techniques making use of hybrid indexing method. using fixed-queries trees."Combinatorial Pattern
In our opinion VP tree works best in all cases as Matching. Springer Berlin Heidelberg, 1994.
compared to other trees. But it has been found that [5] Bozkaya, Tolga, and Meral Ozsoyoglu. "Distance-
based indexing for high-dimensional metric
as the database size increases eventually the
spaces." ACM SIGMOD Record. Vol. 26. No. 2.
performance of Kd tree and VP tree is found to be ACM, 1997.
same. [6] Chandran, Sharat. Introduction to kd-trees.
University of Maryland Department of Computer
REFERENCES: Science.
[1] N. Katayama, S. Satoh. The SR-tree: An Index [7] Wald I, Havran,On building fast kd-trees for ray
Structure for High-Dimensional Nearest Neighbor tracing, and on doing that in O(N log N), 2006
Queries. In Proceedings of the 1997 ACM SIGMOD [8] Tomas G. Rokicki, An Algorithm for Compressing
International Conference on Management of Data, Space and Time, 2006
Tucson, Arizona [9] Jacob E. Goodman, Joseph O'Rourke and Piotr
[2] Ch´avez, E., Navarro, G., Baeza-Yates, R., Indyk, Nearest neighbours in high-dimensional
Marroqu´ın, J.L.: Searching in metric spaces. ACM spaces,2004
Computing Surveys 33(3) (2001) 273–321 [10]Data Structures and Algorithms for Nearest
[3] Burkhard, W.A. and Keller, R.M. \Some Approaches Neighbour Search in General Metric Spaces,1993
to Best-Match File Searching", Communications of [11] Yianilos, Peter N. "Data structures and algorithms
the ACM 16 (4), April 1973, 230-236. for nearest neighbor search in general metric
spaces." SODA. Vol. 93. No. 194. 1993.