1 Applications of Nearest Neighbor

The document discusses nearest neighbor search, which is the problem of finding the nearest or k-nearest neighbors of a query point from a set of points in high-dimensional space. It provides two examples of where this problem arises: 1) in image processing when comparing images represented as vectors, and 2) in text retrieval when comparing documents represented as vectors of word frequencies. It then describes how to solve the 1D nearest neighbor problem by sorting points and using binary search, and how Voronoi diagrams can be used to solve the 2D problem in similar time and storage costs.

Uploaded by

iimranmalik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views5 pages

1 Applications of Nearest Neighbor

Uploaded by

iimranmalik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS 683: Advanced Algorithms Nearest Neighbor Search

DATE 04/17/2001 Scribe: Elliot Anshelevich, Anirban Dasgupta

1 Applications of Nearest Neighbor

We now discuss an important geometric problem — nearest-neighbor search — and an ap-
proach to solving it through an application of VC-dimension. Given a set of points of high
dimension, we want to find out the nearest (or the k nearest) neighbors of any given query
point. We give two areas where this application comes in handy.
Example 1.1 In image-processing, often digital images are stored as just a string of bits. One
possibility could be chopping the entire image into regions and computing the average values of the
colour intensities over each region and storing these averages for all the regions. So, the image
could now be represented as a vector over Rd where d is the number of regions, and is typically very
large. In this case, Euclidean distance between two image-vectors gives us some notion of similarity.

Example 1.2 This example relates to the vector space model of image retrieval developed by
Salton. Suppose we have a reasonable vocabulary of the English language. We represent documents
in English by a vector with one coordinate for each word in the vocabulary. Document i is mapped to
a vector vi ∈ Rd . Typically d is of the order of 50K to 100K for a reasonable information retrieval
system. The j th coordinate of vi stores the number of times the j th word from the vocabulary has
appeared in the document. Or we may take the vectors to be just boolean vectors, the j th coordinate
representing whether the j th word has at all appeared in the document. Again, in this case, the
notion of similarity of two vectors under the L1 or L2 norm gives us some idea of similarity
between documents.

1.1 Nearest Neighbor Search

Suppose we have a set of points P in Rd . Our aim is to preprocess it so that we can provide
answers to nearest neighbor queries in reasonable time. That is, given a query point q ∈ Rd ,
we shall return the nearest neighbor pi ∈ P achieving minp∈P d(p, q), or we might need the
k-nearest points to q. In our discussion we shall restrict ourselves to the nearest-neighbor
problem only.
The complexity of this problem consists of two parts: the preprocessing and storage time,
and the actual query time. For the case of d = 1, i.e. when all the points are on a line, all we
need to do is sort the points and store the midpoints between successive points. Then, given
a query point, all we need to do is do a binary search on the stored midpoints to find in which
interval it lies. This directly gives us the nearest neighbor. Thus the preprocessing time is
O(n log n) and the storage requirement is O(n). The query time complexity is O(log n).
For two dimensions, the analogue of the above algorithm is finding out Voronoi diagrams.
A Voronoi diagram of a set of points {p1 , . . . , pn } is defined as a partition of the plane into

1
1 00
11 q
0 11 0
1 00 1
0 11
00
00
11
00
11 00 0 11
1 0
1 11
00 00
11 11
00
p1 p2 p3 p4 p5

Figure 1: One dimensional nearest neighbor

cells (closed or open polygons) such that the point pi lies in cell i and all points in R2 lying
in cell-i are nearer to pi than to any other point from the given set. It is known that Voronoi
diagrams in R2 can be found in time O(n log n) and require linear storage. Also, given a
query point, we just need to find out which cell it lies in. This can possibly be done by
constructing projections on both the coordinates and doing binary search (not sure).
11111
00000
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111
000000
111111 00000
11111
p3
00000
000000
111111 1
011111
00000
11111
000000
111111
000000 11111
111111 00000
000000
111111 00000
11111
00000
000000
111111 11111
00000
11111
000000
111111
000000 00000
11111 1p2
0
111111
000000
111111 00000
11111
1111
0000
p4 111
000 00000000
11111111
1 111
0 000 00000000
11111111
00000000
11111111
000
111
000
111 00000000
11111111
000
111 00000000
11111111
000
111 00000000
11111111
00000000
11111111
0000000
1111111 000
111
000 00000000
11111111
0000000
1111111000
111 111
000000
111111
000 0
111 00000000
11111111
1 p1
0000000
11111110
1
000
111
0
1 000000
111111 00000000
11111111
0
1 000000
111111 00000000
11111111
0
1 000000
111111
000000
111111 00000000
11111111
0
1 000000
111111 00000000
11111111
0
1 000000
111111 00000000
11111111
0
1
0 000000
111111
1 000000
111111
1
0
1 1
0 0
0
1 000000
111111
p5 0
1
0
p6
1
0
1

Figure 2: Voronoi diagrams

This technique does not scale well to higher dimensions. For larger d, Clarkson (1987)
improved on a long sequence of previous work with an algorithm for nearest-neighbor search
that has pre-processing and storage requirement O(n⌈d/2(1+ε)⌉ ) The query time is O(cd log n)
for a constant c. As d gets near or as high as log n, this query time is not really doing better
than the brute force search algorithm which takes no preprocessing time, linear storage and
linear query time.

1.2 High dimensional nearest neighbor search

We introduce the notion of approximate nearest neighbor.
Definition 1.1 Given a initial set of points P = {p1 , . . . , pn }, and a query point q, call pi
to be the ε-approximate nearest neighbor to q if d(pi , q) ≤ (1 + ε) minp∈P d(p, q).
Due to a result by Arya et al. we get that ε-approximate nearest neighbor can be solved in
preprocessing time O(n log n), storage space O(n log n), and query time O(cd ε−d log n). Our
goal is the following

2
1111111111111111111
0000000000000000000
0
1
0000000000000000000
1111111111111111111
x 0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
θ 0
1
0000000000000000000
1111111111111111111
0
1
1111111111111111111111
0000000000000000000000
000000000000
111111111111
0000000000000000000
1111111111111111111
11
00 0
1
000000000000
111111111111 0
1
0
1
000000000000
111111111111
0
1 v
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
ϕ
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111 y

Figure 3: Random projection

Goal. Suppose we are willing to pay high preprocessing and storage costs. Then can we
achieve a query time cost of O(poly( 1ε )poly(d) log n) ?
It turns out that this is possible by using an idea based on random projections (Kleinberg
1997). The intuition behind the algorithm is that relative distances between the query point
and the initial set P should preserve their relation under projection on random vectors.

1.3 Random Projections

Suppose we have points pi and pj from the initial set of points. Let q be the query point.
Let us take a random line L and project the vectors pi q and pj q on the line L. We need to
analyze the probability that the shorter vector actually has a shorter projection. We present
a general lemma for this.

Lemma 1.2 Suppose 0 < γ ≤ 12 , x, y ∈ Rd such that kxk(1 + γ) ≤ kyk (i.e. x is sufficiently
smaller than y), Now, we choose a vector v uniformly at random from the unit sphere S d−1
in Rd . Then P r[|v · x| < |v · y|] ≥ 12 + γ5 .

Proof. It is enough to think about the 2-dimensional space spanned by x and y. Let
kxk 1
r = kyk ≤ 1+γ . Suppose the angle between v and x is ϕ and the angle between x and y is θ.
So we need to know if cos2 (θ − ϕ) ≤ r2 cos2 ϕ. This being true is a bad outcome for us, and
hence we want to upper bound the probability for this event. It is easily seen that the worst
case occurs when the vectors x and y are orthogonal to each other, i.e. θ = π2 . Actually the
probability of the “bad” event occurring increases from 0 to π2 and decreases again from π2 to
π, and so on. So we need to see when cos2( π2 − ϕ) = sin2 ϕ ≤ r2 cos2 ϕ i.e. need upper bound
−1
P r[| tan ϕ| ≤ r]. Hence, P r[bad event] = P r[| tan ϕ| ≤ r] = 2 tanπ r < 12 as from Figure 4.
If we do a Taylor expansion of tan−1 r, we can show that that the above probability is less
that ( 21 − γ5 ). Hence, P r[bad event] < 12 − γ5 . Hence the lemma follows.

3
tan -1 r

1
0 0

Figure 4: Wedge for “bad” event

From the lemma, we get the following corollary.

Corollary 1.3 Given x and y, the set Wx,y of vectors from S d−1 , that give rise to the bad
event (i.e. the projection of x exceeds projection of y, under the conditions of the above
lemma), is a wedge of hyperplanes of probability measure < 12 − γ5 .

Definition 1.4 A distinguishing set is a finite set of points V on the unit sphere in Rd so
that no wedge Wx,y of measure < 12 − γ5 (for any x and y producing such a wedge) has at
least half of V .

The point of a distinguishing set is that V gives a correct length comparison for any
x, y ∈ Rd differing by ≥ 1 + γ factor, by majority vote.
Question: How big must V be? This is actually a VC-dimension question. The ground set
here is the unit sphere (denoted S d−1 ), and C are all the wedges.
Fact: An ε-sample for the infinite set system (S d−1 ,wedges) with ε = γ5 is a distinguishing
set. This is clear from the definitions of distinguishing set and ε-sample. So, if d′ =VC-
′ ′
dim(S d−1,wedges), then we can take a random sample from S d−1 of size |V | = O( γd2 log dγ +
1
γ2
log δ1 ), since this would form an γ5 -sample with probability ≥ (1 − δ). This is in terms of
d′ , however, so we need to bound d′ .

Claim 1.5 d′ = O(d log d)

To prove this, notice that all wedges are a result of a Boolean function of four halfspaces,
specifically the function f(A1, A2 , A3, A4) which takes four halfspaces and produces (A1 ∩
A2) ∪ (A3 ∩ A4). (The Boolean operations here are ∩ and ∪). Therefore, Claim 1.5 results
from the following Lemma:

4
Lemma 1.6 Let f be a Boolean function on h inputs, each input a set. Let (U, R) be
a set system of VC-dimension d. Let (U, Rf ) be the new set system, where Rf = f(all
combinations of R). Then the VC-dimension of (U, Rf )= O(dh log dh) if h = O(d).

Proof. If A ⊆ U has size n, at most di=0 ni subsets of A can be realized by intersections

with members of R, by definition of VC-dimension. P How many different intersections of A

can there be with members of Rf ? We know that di=0 ni ≤ nd , so there are at most nd sets
d
in R which result in different intersections with A. There are ≤ nh ≤ ndh combinations of
these sets of R that can be given to f as inputs, so there are at most ndh sets of Rf which yield
different intersections with A. So if A is shattered by (U, Rf ), then ndh ≥ 2n , by definition
of “shattered”. This means that dh log n ≥ n, so n/ log n ≤ dh. Therefore, n = O(dh log dh)
would be enough for A not to be shattered, and so VC-dim(U, Rf )= O(dh log dh).

(Mathematical Surveys and Monographs 173) Sariel Har-Peled - Geometric Approximation Algorithms-American Mathematical Society (2011)
100% (3)
(Mathematical Surveys and Monographs 173) Sariel Har-Peled - Geometric Approximation Algorithms-American Mathematical Society (2011)
378 pages
Computer and Robot Vision Volume 1 PDF
100% (4)
Computer and Robot Vision Volume 1 PDF
682 pages
Nguyen Princeton 0181D 11063
No ratings yet
Nguyen Princeton 0181D 11063
168 pages
Computer and Robot Vision, Vol 1
100% (1)
Computer and Robot Vision, Vol 1
682 pages
Engineering Emergence - Joris Dormans
No ratings yet
Engineering Emergence - Joris Dormans
302 pages
1.1 About Spatial Mining
No ratings yet
1.1 About Spatial Mining
53 pages
22 Scheme Physics For Cse Module 5 Notes
100% (1)
22 Scheme Physics For Cse Module 5 Notes
32 pages
ResearchCards 18sept2020 PDF
No ratings yet
ResearchCards 18sept2020 PDF
172 pages
En-12004 Impac On Adhesive Formulations
100% (1)
En-12004 Impac On Adhesive Formulations
28 pages
05 DC
No ratings yet
05 DC
63 pages
03 - Magnetic Effects of Current
No ratings yet
03 - Magnetic Effects of Current
28 pages
05 DC
No ratings yet
05 DC
72 pages
24 SimilaritySearch
No ratings yet
24 SimilaritySearch
52 pages
An Optimal Algorithm For Approximate Nearest
No ratings yet
An Optimal Algorithm For Approximate Nearest
33 pages
Lecture 8
No ratings yet
Lecture 8
56 pages
Asit Kumar Das - M4 BDA Clustering
No ratings yet
Asit Kumar Das - M4 BDA Clustering
99 pages
Hashing For Similarity Search: A Survey: Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji
No ratings yet
Hashing For Similarity Search: A Survey: Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji
29 pages
III Clustering
No ratings yet
III Clustering
87 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
12 Clustering
No ratings yet
12 Clustering
46 pages
Lecture Slides-Week15,16
No ratings yet
Lecture Slides-Week15,16
50 pages
Computational Geomatory
No ratings yet
Computational Geomatory
212 pages
Lecture 2 CS602 Divide and Conquer I Closest Pair Distance FINAL VERSION
No ratings yet
Lecture 2 CS602 Divide and Conquer I Closest Pair Distance FINAL VERSION
34 pages
K Nearest Neighbor Classification
0% (1)
K Nearest Neighbor Classification
32 pages
Week1 Chap1 Introduction Library
No ratings yet
Week1 Chap1 Introduction Library
26 pages
Geometry of High-Dimensional Space
No ratings yet
Geometry of High-Dimensional Space
36 pages
Achilioptas
No ratings yet
Achilioptas
17 pages
Quality Plan
No ratings yet
Quality Plan
11 pages
Gutor PXW: Starting - Switching Over - Stopping of Single Systems
No ratings yet
Gutor PXW: Starting - Switching Over - Stopping of Single Systems
5 pages
Randomized Algorithms Notes
No ratings yet
Randomized Algorithms Notes
13 pages
Closest Pair
No ratings yet
Closest Pair
14 pages
Closest Points Problem
No ratings yet
Closest Points Problem
11 pages
Efficient Nearest Neighbor Search in High Dimensional Hamming Space
No ratings yet
Efficient Nearest Neighbor Search in High Dimensional Hamming Space
11 pages
Scalable Nearest Neighbor Algorithms For High Dimensional Data
No ratings yet
Scalable Nearest Neighbor Algorithms For High Dimensional Data
14 pages
003 05 KNN - Enhancements W3L2
No ratings yet
003 05 KNN - Enhancements W3L2
10 pages
Random Projection and Its Applications
No ratings yet
Random Projection and Its Applications
10 pages
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
No ratings yet
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
10 pages
Chapter 2
No ratings yet
Chapter 2
26 pages
Nearest Neighbor Search
No ratings yet
Nearest Neighbor Search
9 pages
Fast Subspace Search Via Grassmannian Based Hashing
No ratings yet
Fast Subspace Search Via Grassmannian Based Hashing
8 pages
JDA CorrectedAfterPublication
No ratings yet
JDA CorrectedAfterPublication
16 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Fast and Exact Fixed-Radius Neighbor Search Based On Sorting
No ratings yet
Fast and Exact Fixed-Radius Neighbor Search Based On Sorting
17 pages
Very Sparse Random Projections: Ping Li Trevor J. Hastie Kenneth W. Church
No ratings yet
Very Sparse Random Projections: Ping Li Trevor J. Hastie Kenneth W. Church
10 pages
Lecture 16: Voronoi Diagrams and Fortune's Algorithm
No ratings yet
Lecture 16: Voronoi Diagrams and Fortune's Algorithm
16 pages
07.01.approximate Nearest Neighbor Queries in Fixed Dimensions
No ratings yet
07.01.approximate Nearest Neighbor Queries in Fixed Dimensions
11 pages
JL Transformation - Minlash
No ratings yet
JL Transformation - Minlash
11 pages
Self Reading - KNN - Notes
No ratings yet
Self Reading - KNN - Notes
7 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
KNN Notes
No ratings yet
KNN Notes
6 pages
Applications of Sorting
No ratings yet
Applications of Sorting
3 pages
Report On Nearest Dominating Point Queries: Naman Mishra, Sreeramji K S November 2024
No ratings yet
Report On Nearest Dominating Point Queries: Naman Mishra, Sreeramji K S November 2024
5 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
Assignment 2: N Earest Neighbor Interpolation
No ratings yet
Assignment 2: N Earest Neighbor Interpolation
5 pages
Clrs Closest Points
No ratings yet
Clrs Closest Points
5 pages
17 Random Projections and Orthogonal Matching Pursuit
No ratings yet
17 Random Projections and Orthogonal Matching Pursuit
7 pages
Complexity, The Changing Minimum and Closest Pair: 1 Las Vegas and Monte Carlo Algorithms
No ratings yet
Complexity, The Changing Minimum and Closest Pair: 1 Las Vegas and Monte Carlo Algorithms
5 pages
Closest Pair of Points Problem
No ratings yet
Closest Pair of Points Problem
3 pages
Closest Pair of Points Problem
No ratings yet
Closest Pair of Points Problem
3 pages
p117 Andoni
No ratings yet
p117 Andoni
6 pages
Physics Project File On Electromagnetic Induction PDF
No ratings yet
Physics Project File On Electromagnetic Induction PDF
16 pages
Fixed-Radius Near Neighbors
No ratings yet
Fixed-Radius Near Neighbors
2 pages
The Design Process of An Aerodynamic Package For An FSae Car
No ratings yet
The Design Process of An Aerodynamic Package For An FSae Car
9 pages
Closest Pair
No ratings yet
Closest Pair
9 pages
EMI Seminar Overview Final
100% (1)
EMI Seminar Overview Final
36 pages
Ma 101 Mathematics-I L T P C First Semester (All Branch) 3 1 0 8
No ratings yet
Ma 101 Mathematics-I L T P C First Semester (All Branch) 3 1 0 8
35 pages
Additional Mathematics - Area of Sector
100% (2)
Additional Mathematics - Area of Sector
2 pages
Standard Method For Gradually Varied Flow
No ratings yet
Standard Method For Gradually Varied Flow
5 pages
Physical Sciences p1 Grade 11 Exemplar 2013 Eng 071645
No ratings yet
Physical Sciences p1 Grade 11 Exemplar 2013 Eng 071645
19 pages
Flexible Riser Inspection
100% (1)
Flexible Riser Inspection
4 pages
Paper Airplane Physics
100% (1)
Paper Airplane Physics
16 pages
Maps and Globes Worksheets
No ratings yet
Maps and Globes Worksheets
7 pages
CEOr Module 4 For Printing
No ratings yet
CEOr Module 4 For Printing
16 pages
ch04 Equilibrium of Rigid Bodies
No ratings yet
ch04 Equilibrium of Rigid Bodies
61 pages
Experiment 4 Friction) 1
No ratings yet
Experiment 4 Friction) 1
5 pages
Lecture8 - Infinite Coin Toss
No ratings yet
Lecture8 - Infinite Coin Toss
4 pages
Seawater Refractometer v2
No ratings yet
Seawater Refractometer v2
2 pages
Course Title - Mechanic Course Code - Mts 104 - Lecturer'S Name - Mr. A. A. Yusuf
No ratings yet
Course Title - Mechanic Course Code - Mts 104 - Lecturer'S Name - Mr. A. A. Yusuf
42 pages
Raex 400: General Product Description
No ratings yet
Raex 400: General Product Description
2 pages
Poems About Time - and Love
100% (1)
Poems About Time - and Love
8 pages
Brauer - (Polyurethane Tyred Wheels Section) .
No ratings yet
Brauer - (Polyurethane Tyred Wheels Section) .
6 pages
Datasheet PTC 200 Ds
No ratings yet
Datasheet PTC 200 Ds
9 pages
Chapter-6 - Analysis of Plane Stress
No ratings yet
Chapter-6 - Analysis of Plane Stress
9 pages
Chapterwise Test01 Maths Classix Cbse 102576 Test PDF 01yapnnjza
No ratings yet
Chapterwise Test01 Maths Classix Cbse 102576 Test PDF 01yapnnjza
1 page
39 Mag Fichetech Immersion Tubes V2 12.11 7
No ratings yet
39 Mag Fichetech Immersion Tubes V2 12.11 7
2 pages
Assumptions 1 This Is A Steady-Flow Process and Thus The Mass Flow Rate of Dry Air Remains Constant During The Entire
No ratings yet
Assumptions 1 This Is A Steady-Flow Process and Thus The Mass Flow Rate of Dry Air Remains Constant During The Entire
1 page