1 Applications of Nearest Neighbor
1 Applications of Nearest Neighbor
Example 1.2 This example relates to the vector space model of image retrieval developed by
Salton. Suppose we have a reasonable vocabulary of the English language. We represent documents
in English by a vector with one coordinate for each word in the vocabulary. Document i is mapped to
a vector vi ∈ Rd . Typically d is of the order of 50K to 100K for a reasonable information retrieval
system. The j th coordinate of vi stores the number of times the j th word from the vocabulary has
appeared in the document. Or we may take the vectors to be just boolean vectors, the j th coordinate
representing whether the j th word has at all appeared in the document. Again, in this case, the
notion of similarity of two vectors under the L1 or L2 norm gives us some idea of similarity
between documents.
1
1 00
11 q
0 11 0
1 00 1
0 11
00
00
11
00
11 00 0 11
1 0
1 11
00 00
11 11
00
p1 p2 p3 p4 p5
cells (closed or open polygons) such that the point pi lies in cell i and all points in R2 lying
in cell-i are nearer to pi than to any other point from the given set. It is known that Voronoi
diagrams in R2 can be found in time O(n log n) and require linear storage. Also, given a
query point, we just need to find out which cell it lies in. This can possibly be done by
constructing projections on both the coordinates and doing binary search (not sure).
11111
00000
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111
000000
111111 00000
11111
p3
00000
000000
111111 1
011111
00000
11111
000000
111111
000000 11111
111111 00000
000000
111111 00000
11111
00000
000000
111111 11111
00000
11111
000000
111111
000000 00000
11111 1p2
0
111111
000000
111111 00000
11111
1111
0000
p4 111
000 00000000
11111111
1 111
0 000 00000000
11111111
00000000
11111111
000
111
000
111 00000000
11111111
000
111 00000000
11111111
000
111 00000000
11111111
00000000
11111111
0000000
1111111 000
111
000 00000000
11111111
0000000
1111111000
111 111
000000
111111
000 0
111 00000000
11111111
1 p1
0000000
11111110
1
000
111
0
1 000000
111111 00000000
11111111
0
1 000000
111111 00000000
11111111
0
1 000000
111111
000000
111111 00000000
11111111
0
1 000000
111111 00000000
11111111
0
1 000000
111111 00000000
11111111
0
1
0 000000
111111
1 000000
111111
1
0
1 1
0 0
0
1 000000
111111
p5 0
1
0
p6
1
0
1
This technique does not scale well to higher dimensions. For larger d, Clarkson (1987)
improved on a long sequence of previous work with an algorithm for nearest-neighbor search
that has pre-processing and storage requirement O(n⌈d/2(1+ε)⌉ ) The query time is O(cd log n)
for a constant c. As d gets near or as high as log n, this query time is not really doing better
than the brute force search algorithm which takes no preprocessing time, linear storage and
linear query time.
2
1111111111111111111
0000000000000000000
0
1
0000000000000000000
1111111111111111111
x 0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
0
1111111111111111111
1
0000000000000000000
1111111111111111111
0
1
0000000000000000000
1111111111111111111
θ 0
1
0000000000000000000
1111111111111111111
0
1
1111111111111111111111
0000000000000000000000
000000000000
111111111111
0000000000000000000
1111111111111111111
11
00 0
1
000000000000
111111111111 0
1
0
1
000000000000
111111111111
0
1 v
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
ϕ
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111
0
1
000000000000
111111111111 y
Goal. Suppose we are willing to pay high preprocessing and storage costs. Then can we
achieve a query time cost of O(poly( 1ε )poly(d) log n) ?
It turns out that this is possible by using an idea based on random projections (Kleinberg
1997). The intuition behind the algorithm is that relative distances between the query point
and the initial set P should preserve their relation under projection on random vectors.
Lemma 1.2 Suppose 0 < γ ≤ 12 , x, y ∈ Rd such that kxk(1 + γ) ≤ kyk (i.e. x is sufficiently
smaller than y), Now, we choose a vector v uniformly at random from the unit sphere S d−1
in Rd . Then P r[|v · x| < |v · y|] ≥ 12 + γ5 .
Proof. It is enough to think about the 2-dimensional space spanned by x and y. Let
kxk 1
r = kyk ≤ 1+γ . Suppose the angle between v and x is ϕ and the angle between x and y is θ.
So we need to know if cos2 (θ − ϕ) ≤ r2 cos2 ϕ. This being true is a bad outcome for us, and
hence we want to upper bound the probability for this event. It is easily seen that the worst
case occurs when the vectors x and y are orthogonal to each other, i.e. θ = π2 . Actually the
probability of the “bad” event occurring increases from 0 to π2 and decreases again from π2 to
π, and so on. So we need to see when cos2( π2 − ϕ) = sin2 ϕ ≤ r2 cos2 ϕ i.e. need upper bound
−1
P r[| tan ϕ| ≤ r]. Hence, P r[bad event] = P r[| tan ϕ| ≤ r] = 2 tanπ r < 12 as from Figure 4.
If we do a Taylor expansion of tan−1 r, we can show that that the above probability is less
that ( 21 − γ5 ). Hence, P r[bad event] < 12 − γ5 . Hence the lemma follows.
3
tan -1 r
1
0 0
Corollary 1.3 Given x and y, the set Wx,y of vectors from S d−1 , that give rise to the bad
event (i.e. the projection of x exceeds projection of y, under the conditions of the above
lemma), is a wedge of hyperplanes of probability measure < 12 − γ5 .
Definition 1.4 A distinguishing set is a finite set of points V on the unit sphere in Rd so
that no wedge Wx,y of measure < 12 − γ5 (for any x and y producing such a wedge) has at
least half of V .
The point of a distinguishing set is that V gives a correct length comparison for any
x, y ∈ Rd differing by ≥ 1 + γ factor, by majority vote.
Question: How big must V be? This is actually a VC-dimension question. The ground set
here is the unit sphere (denoted S d−1 ), and C are all the wedges.
Fact: An ε-sample for the infinite set system (S d−1 ,wedges) with ε = γ5 is a distinguishing
set. This is clear from the definitions of distinguishing set and ε-sample. So, if d′ =VC-
′ ′
dim(S d−1,wedges), then we can take a random sample from S d−1 of size |V | = O( γd2 log dγ +
1
γ2
log δ1 ), since this would form an γ5 -sample with probability ≥ (1 − δ). This is in terms of
d′ , however, so we need to bound d′ .
To prove this, notice that all wedges are a result of a Boolean function of four halfspaces,
specifically the function f(A1, A2 , A3, A4) which takes four halfspaces and produces (A1 ∩
A2) ∪ (A3 ∩ A4). (The Boolean operations here are ∩ and ∪). Therefore, Claim 1.5 results
from the following Lemma:
4
Lemma 1.6 Let f be a Boolean function on h inputs, each input a set. Let (U, R) be
a set system of VC-dimension d. Let (U, Rf ) be the new set system, where Rf = f(all
combinations of R). Then the VC-dimension of (U, Rf )= O(dh log dh) if h = O(d).