0% found this document useful (0 votes)
57 views

Similarity Search Using Metric Trees: Bhavin Bhuta Gautam Chauhan

The document discusses metric trees and their use for nearest neighbor searches in large databases. Metric trees exploit the properties of metric spaces, including positivity, symmetry, and the triangle inequality, to index data and speed up similarity searches. Several types of metric trees are described, including BK-trees, VP-trees, KD-trees, and quad trees. BK-trees in particular use the Levenshtein distance metric to find strings with a limited number of differences from a query string. The performance of BK-trees is evaluated on databases of varying sizes, showing that as the database grows larger, metric trees can index the data to reduce the number of nodes that must be scanned during a similarity search.

Uploaded by

Gautam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Similarity Search Using Metric Trees: Bhavin Bhuta Gautam Chauhan

The document discusses metric trees and their use for nearest neighbor searches in large databases. Metric trees exploit the properties of metric spaces, including positivity, symmetry, and the triangle inequality, to index data and speed up similarity searches. Several types of metric trees are described, including BK-trees, VP-trees, KD-trees, and quad trees. BK-trees in particular use the Levenshtein distance metric to find strings with a limited number of differences from a query string. The performance of BK-trees is evaluated on databases of varying sizes, showing that as the database grows larger, metric trees can index the data to reduce the number of nodes that must be scanned during a similarity search.

Uploaded by

Gautam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Innovations & Advancement in Computer Science

IJIACS
ISSN 2347 – 8616
Volume 4, Issue 11
November 2015

Similarity Search using Metric Trees


Bhavin Bhuta* Gautam Chauhan*
Department of Computer Engineering Department of Computer Engineering
Thadomal Shahani Engineering College Thadomal Shahani Engineering College
Mumbai, India Mumbai, India

Abstract: d( a, b) = 0 iff | a - b | = 0, i.e., iff a = b (positive-


The traditional method of comparing values against definiteness)
database might work well for small database but when 2. Symmetry
the database size increases invariably then the efficiency d(a, b) ≡ | a - b | = | b – a | ≡ d(b, a)
takes a hit. For instance, image plagiarism detection 3. Triangle Inequality
software has large amount of images stored in the
d(a, b) ≡ | a – b | = | a – c + c – b | ≤
database, when we compare the hash values of input
query image against that of database then the | a – c | + | c – b | ≡ d(a, c) + d(c, b)
comparison may consume a good amount of time. For All indexing techniques primarily exploit Triangle
this reason we introduce the concept of using metric Inequality property. We‟ll discuss four of these
trees which makes use metric space properties for indexing techniques BK tree, VP tree, KD tree and
indexing in databases. In this paper, we have provided Quad tree in detail [5].
the working of various nearest neighbor search trees for
speeding up this task. Various spatial partition and BK-tree
indexing techniques are explained in detail and their Named after Burkhard-Keller(BK tree). Many
performance in real-time world is provided. powerful search engines make use of this tree
reason being it is fuzzy search algorithm. The
Keywords: Metric Trees, Metric Space, Nearest problem definition for this type of tree can be:”Find
Neighbor Search Trees, Spatial Partition, Indexing in the dictionary, of some finite size, accounting for
Technique all the string that match given word, taking into
account k possible differences.”In simpler terms we
Introduction to Metric Tree/ Nearest Neighbor mean that if there is a string of hash values having a
Search Trees finite length that has to be matched from given data
set then at the most “k” error terms (differences) are
With the ever-growing multimedia (audio, video, allowed. For example if your input query has
images, etc) there is an urge to manage it. In order requested for “sunny” with 1 possible error term
to retrieve information from the database the brute then “suny” or ”bunny” are acceptable and so on.
force method proves to be costly in every aspect Instead of using the brute force of searching
while retrieving. The reason is that they need to throughout the database which would take O(n*m2)
cater with how a particular multimedia file is where m is the average length and n is the word
similar to other and the similarity factors to be count, we try to find the nodes that are within the
considered while comparing are image patterns, admissible error range and consider only those
texture, color, sound, shape [1]. Therefore we make nodes. Next we need to measure distance of the
use of distance function to account for similarity. input query with entries in database in order to
Thus to increase the efficiency of executing similar understand how many error terms are there and
queries we need to have right indexing technique. It retrieve only those entries which satisfy the error
is for this reason we make use of metric trees. criteria defined [3].
Metric trees use the properties of metric spaces
which are positivity, symmetry and triangle Levenshtein Distance
inequality. When we talk about these properties we Levenshtein distance is used in BK tree
consider the absolute distance between the entities implementation, it is a metric used to find
in question (say two points) [5]. The explanation of difference between strings in BK tree. It satisfies as
the properties is:- a metric space. Levenshtein Distance basically tells
1. Positivity us how much modifications are required for let‟s
d(a, b) ≡ | a – b | ≥ 0 and say string1 to represent in string2. Playing can be

96 Bhavin Bhuta, Gautam Chauhan


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue 11
November 2015

represented as Praying by removing R and d(Japn, Seoul) = 4


appending L, hence the Levenshtein Distance is 1. d(Japn, Mumbai) = 6
d(Japn, London) = 5
Considering Seoul as parent and exploring it‟s child
d(Japn, Japan ) = 1
Even though if we find distance to be equal to n we
have to search remaining branches as priority
should be given to Levenshtein distance equal to
zero. Assigning the admissible error should be
optimal generally admissible error of 2 proves to be
Similarly, if we have two binary strings then instead
working fine but as we increase the performance
of alphabets we have to replace 0‟s and 1‟s. For
starts degrading.
example, d (1000001000, 0000000000) is 2.
Example of BK tree
We create a tree based on the Levenshtein Distance,
below is a tree having Delhi as its root node and
leaf nodes as Mumbai, Seoul and London and so on
and weights represent the Levenshtein Distance.

Performance of BK tree:

Let‟s say if the query requires information with


”Japn” as string. We need to find the string ”Japn”
from the given tree with the admissible error being
(n) =”1”. We start with the root node and calculate
Levenshtein distance with our input string.
d (Japn, Delhi) = 5
Then we assign a range that specifies the nodes to
be explored for which we have a lower limit and an The database size signifies the amount of data
upper limit. Limits are defined with formula as stored and strings in the range of about 100 were
lower limit=d-n and upper limit= d+n ( for our searched with the length of the string to be searched
example 4 to 6). Hence nodes having 4, 5 and 6 being 5-10. We have two graphs one showing the
will be explored and we‟ll calculate Levenshtein nodes scanned for a particular string and the other
distance again till the condition that the distance nodes found with matching string. One important
between the data present in database and input note, string being searched for this particular graph
query (Japn) should be less than or equal to had admissible error = 1. We find the difference
admissible error (n) between the scanned percentage nodes and found
Hence, 4 is lower limit and 6 is higher limit We percentage nodes, it should be as low as possible
repeat this procedure till the last node. ideally 0. As seen from both the charts as the

97 Bhavin Bhuta, Gautam Chauhan


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue 11
November 2015

database size increases the number of irrelevant Points X,Y and Z form the left sub tree
node visited increases Points P,Q and R form the right sub tree
VP Trees (Vantage Point) Hence we totally „prune‟ one-half of tree and
The VP-tree divides the searching process into two continue pruning till we get our goal node thus
regions left and right sub tree depending on the dealing one subtree at a time. This pruning helps
vantage point thereby reducing the computation. while retrieving information from the database
This vantage point is nothing but a randomly thereby increasing the rate of retrieving
selected point or there are other way to select the information. But this act of pruning comes at the
vantage point (no of nodes, distance to other cost of calculating median and comparing the
nodes,etc) but they require computation and thereby median with all points.
incur additional cost. The implementation of this Performance of VP tree:
technique has a very general method and a simple
logic. The distance from each point to VP is
computed, and then after having all the values we
calculate median value for it. The points are sorted
depending on whether they are less than or greater
than the median value and the two sub trees are
formed. We select all points less than median to left
sub tree and greater to right sub tree. We recursively
follow the same procedure for each node. This
median distance acts as a separating constraint.
Hence we can think of this median distance as a
radius and all the values less than median distance
(left sub tree points) fall inside the circle and all the
values greater than radius (right sub tree points) fall
outside the circle. The data structure The graph has x-axis as size of the database and on
implementation of VP tree contains VP Id, median y-axis the average number of nodes visited as we
distance and address of next left and right sub tree know that in VP tree we have decision functionality
node [11]. The figure below explains the concept which is why we make very few searches. As
database size goes on increasing, search time shows
logarithmic growth. VP tree performs best for a
large database. The creating of the tree takes
O(nlogn) where n is the number of nodes and
searching time complexity varies but an efficient
technique should have O(logn) as expected time [2].
Kd-tree (K-Dimensional)
Kd tree is spatial data partition tree. It is also called
as K-Dimensional Tree. Kd tree construction is a
Binary Search tree where data value stored in each
node is k dimensional point in space. Kd trees are
used in nearest neighbor search and other several
applications. The initial step in inserting new node
Here, V – Vantage Point is traversing kd tree from root node as starting node
d(V,P) > median distance and move to either left or right based on the data
d(V,R) > median distance value which compares the value in „k‟th dimension.
If we find the node under which the data value
d(V,Q) > median distance should be placed then we can add new data on
d(V,X) < median distance either right or left depending upon the data value
d(V,Y) < median distance and comparison with node under which data value
d(V,Z) < median distance should be placed [5].

98 Bhavin Bhuta, Gautam Chauhan


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue 11
November 2015

To find whether a point is present in left or right VP tree although after increasing the database size
subtree? the curve looks similar to that of VP tree.[11]
Here we consider a two dimensional tree. First align
the root node in X plane, then right and left tree will
contain all the points whose coordinates in X plane
are greater than equal to that of root node and less
than that of root node respectively. Here (4,7)
represents root node which is aligned in X-plane.
Now the coordinate (3,8) has smaller x-coordinate
than the root node hence placed in left tree where as
coordinate (18,16 ) is placed in right tree since it
has x-coordinate greater than that of coordinates
(4,7). [5]
Kd tree Example

The Quadtree
A quadtree is a spatial data partition tree. It is used
to find nearest neighbor distance and in
optimizations of various algorithms. The root has
exactly four child nodes hence it is called as
quadtree. Root node covers the entire portion of
space whereas other node covers some specific
portion of space. We can insert new data into
Performance of kd-tree: quadtree easily. In quadtree each node is also
divided into four nodes as they now act as root node
Performance of kd-trees can be measured in terms for newly generated nodes. This methodology of
of number of queries Q and time for processing dividing the single nodes into four nodes is used in
each queries T. The Figure 1.1 shows relationship two dimensional searches. Quadtree is classified
between Q AND T. From the Figure we can into different types on the basis of the data it
conclude that query time T grows approximately contains i.e.points, curves, lines, shapes and
linearly with respect to number of queries Q. areas.[8]
Approximate time complexity is O(nlogn). [6] The
second graph shows the relationship between 4.1 Quadtree example
database size and average number of nodes visited. In the below example first we start with single root
As database size goes on increasing, search time node A as shown in figure. Now divide this root
also shows logarithmic growth. It is quite evident node A into four small squares as A1, A2, A3 and
that the performance of kd tree is less efficient then A4. These small squares are termed as child node.

99 Bhavin Bhuta, Gautam Chauhan


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue 11
November 2015

Now repeat the entire process on this child node i.e. The graph(a) shows query optimization i.e
again divide this square into four small squares. For relationship between query selectivity percentage
node A1, child nodes are A11, A12, A13 and A14. and response time in second. The graph(b) shows
For node A2, child nodes are A21, A22, A23 and average response time taken for queries on two
A24.Similarly, Repeat this process for node A3 and dimensional database i.e. relationship between
A4 as shown in figure.[7] database size and response time in second. The
database size is approximately 57,000 million
polygons where as the database size in graph(b) is
approximately 1.5 million polygons. Samet and
webber represented polygon maps using point-
region quadtree. Polygon map is used to store each
vertex or points of polygon map in point-region
quadtree. We can increase the size of this polygon
map by cloning the polygon data.[11]

Quad tree example

Quick Search using quad tree:


The quadtree divides the root node into four child
nodes in each level. Each levels represents the
particular information in space and used in various
ways. Let‟s take an example; we work on a point in
two dimensional space, having X and Y co-
ordinates, which represents particular data values
A1, A2,A3 and so on as shown in above figure.
Each level of tree will represents this data value.
We use this concept in quick search on data. To
perform quick search on quadtree, first we have to
sort the data present at each levels at a time
considering one level and then move down to find Conclusion
whether the data value is found or not by traversing In this work, we have provided the working of BK
correct node at each level until we reach the last tree, VP Tree, KD tree and Quad tree. These
level. [8] techniques have been tested and it has been found
that VP tree proves to be forerunner out of the 4
Performance of Quadtree: techniques followed by Kd tree.For future work we

100 Bhavin Bhuta, Gautam Chauhan


International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Issue 11
November 2015

suggest to create a combination of two or more [4] Baeza-Yates, Ricardo, et al. "Proximity matching
techniques making use of hybrid indexing method. using fixed-queries trees."Combinatorial Pattern
In our opinion VP tree works best in all cases as Matching. Springer Berlin Heidelberg, 1994.
compared to other trees. But it has been found that [5] Bozkaya, Tolga, and Meral Ozsoyoglu. "Distance-
based indexing for high-dimensional metric
as the database size increases eventually the
spaces." ACM SIGMOD Record. Vol. 26. No. 2.
performance of Kd tree and VP tree is found to be ACM, 1997.
same. [6] Chandran, Sharat. Introduction to kd-trees.
University of Maryland Department of Computer
REFERENCES: Science.
[1] N. Katayama, S. Satoh. The SR-tree: An Index [7] Wald I, Havran,On building fast kd-trees for ray
Structure for High-Dimensional Nearest Neighbor tracing, and on doing that in O(N log N), 2006
Queries. In Proceedings of the 1997 ACM SIGMOD [8] Tomas G. Rokicki, An Algorithm for Compressing
International Conference on Management of Data, Space and Time, 2006
Tucson, Arizona [9] Jacob E. Goodman, Joseph O'Rourke and Piotr
[2] Ch´avez, E., Navarro, G., Baeza-Yates, R., Indyk, Nearest neighbours in high-dimensional
Marroqu´ın, J.L.: Searching in metric spaces. ACM spaces,2004
Computing Surveys 33(3) (2001) 273–321 [10]Data Structures and Algorithms for Nearest
[3] Burkhard, W.A. and Keller, R.M. \Some Approaches Neighbour Search in General Metric Spaces,1993
to Best-Match File Searching", Communications of [11] Yianilos, Peter N. "Data structures and algorithms
the ACM 16 (4), April 1973, 230-236. for nearest neighbor search in general metric
spaces." SODA. Vol. 93. No. 194. 1993.

101 Bhavin Bhuta, Gautam Chauhan

You might also like