0% found this document useful (0 votes)

22 views17 pages

Randomproj

Uploaded by

juggikapoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views17 pages

Randomproj

Uploaded by

juggikapoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Random Projection, Margins, Kernels,

and Feature-Selection

Avrim Blum

Department of Computer Science,

Carnegie Mellon University, Pittsburgh, PA 15213-3891

Abstract. Random projection is a simple technique that has had a

number of applications in algorithm design. In the context of machine
learning, it can provide insight into questions such as “why is a learning
problem easier if data is separable by a large margin?” and “in what sense
is choosing a kernel much like choosing a set of features?” This talk is
intended to provide an introduction to random projection and to survey
some simple learning algorithms and other applications to learning based
on it. I will also discuss how, given a kernel as a black-box function, we
can use various forms of random projection to extract an explicit small
feature space that captures much of what the kernel is doing. This talk
is based in large part on work in [BB05, BBV04] joint with Nina Balcan
and Santosh Vempala.

1 Introduction

Random projection is a technique that has found substantial use in the area
of algorithm design (especially approximation algorithms), by allowing one to
substantially reduce dimensionality of a problem while still retaining a signiﬁcant
degree of problem structure. In particular, given n points in Euclidean space
(of any dimension but which we can think of as Rn ), we can project these
points down to a random d-dimensional subspace for d n, with the following
outcomes:

1. If d = ω( γ12 log n) then Johnson-Lindenstrauss type results (described below)

imply that with high probability, relative distances and angles between all
pairs of points are approximately preserved up to 1 ± γ.
2. If d = 1 (i.e., we project points onto a random line) we can often still get
something useful.

Projections of the ﬁrst type have had a number of uses including fast approxi-
mate nearest-neighbor algorithms [IM98, EK00] and approximate clustering al-
gorithms [Sch00] among others. Projections of the second type are often used
for “rounding” a semideﬁnite-programming relaxation, such as for the Max-CUT
problem [GW95], and have been used for various graph-layout problems [Vem98].
The purpose of this survey is to describe some ways that this technique can
be used (either practically, or for providing insight) in the context of machine

C. Saunders et al. (Eds.): SLSFS 2005, LNCS 3940, pp. 52–68, 2006.

c Springer-Verlag Berlin Heidelberg 2006
Random Projection, Margins, Kernels, and Feature-Selection 53

learning. In particular, random projection can provide a simple way to see why
data that is separable by a large margin is easy for learning even if data lies
in a high-dimensional space (e.g., because such data can be randomly projected
down to a low dimensional space without aﬀecting separability, and therefore
it is “really” a low-dimensional problem after all). It can also suggest some
especially simple algorithms. In addition, random projection (of various types)
can be used to provide an interesting perspective on kernel functions, and also
provide a mechanism for converting a kernel function into an explicit feature
space.
The use of Johnson-Lindenstrauss type results in the context of learning was
ﬁrst proposed by Arriaga and Vempala [AV99], and a number of uses of random
projection in learning are discussed in [Vem04]. Experimental work on using
random projection has been performed in [FM03, GBN05, Das00]. This survey,
in addition to background material, focuses primarily on work in [BB05, BBV04].
Except in a few places (e.g., Theorem 1, Lemma 1) we give only sketches and
basic intuition for proofs, leaving the full proofs to the papers cited.

1.1 The Setting

We are considering the standard PAC-style setting of supervised learning from
i.i.d. data. Specifically, we assume that examples are given to us according to
some probability distribution D over an instance space X and labeled by some
unknown target function c : X → {−1, +1}. We use P = (D, c) to denote the
combined distribution over labeled examples. Given some sample S of labeled
training examples (each drawn independently from D and labeled by c), our
objective is to come up with a hypothesis h with low true error: that is, we want
Prx∼D (h(x) = c(x)) to be low. In the discussion below, by a “learning problem”
we mean a distribution P = (D, c) over labeled examples.
In the first part of this survey (Sections 2 and 3), we will think of the input
space X as Euclidean space, like Rn . In the second part (Section 4), we will
discuss kernel functions, in which case one should think of X as just some ab-
stract space, and a kernel function K : X × X → [−1, 1] is then some function
that provides a measure of similarity between two input points. Formally, one
requires for a legal kernel K that there exist some implicit function φ mapping X
into a (possibly very high-dimensional) space, such that K(x, y) = φ(x) · φ(y). In
fact, one interesting property of some of the results we discuss is that they make
sense to apply even if K is just an arbitrary similarity function, and not a “legal”
kernel, though the theorems make sense only if such a φ exists. Extensions of
this framework to more general similarity functions are given in [BB06].
Definition 1. We say that a set S of labeled examples is linearly separable
by margin γ if there exists a unit-length vector w such that:

min [(w · x)/||x||] ≥ γ.

(x,)∈S

That is, the separator w · x ≥ 0 has margin γ if every labeled example in S is

correctly classiﬁed and furthermore the cosine of the angle between w and x has
54 A. Blum

magnitude at least γ.1 For simplicity, we are only considering separators that
pass through the origin, though results we discuss can be adapted to the general
case as well.
We can similarly talk in terms of the distribution P rather than a sample S.
Deﬁnition 2. We say that P is linearly separable by margin γ if there
exists a unit-length vector w such that:

Pr [(w · x)/||x|| < γ] = 0,

(x,)∼P

and we say that P is separable with error α at margin γ if there exists a

unit-length vector w such that:

Pr [(w · x)/||x|| < γ] ≤ α.

(x,)∼P

A powerful theoretical result in machine learning is that if a learning problem is

linearly separable by a large margin γ, then that makes the problem “easy” in the
sense that to achieve good generalization one needs only a number of examples
that depends (polynomially) on 1/γ, with no dependence on the dimension of
the ambient space X that examples lie in. In fact, two results of this form are:
1. The classic Perceptron Convergence Theorem that the Perceptron Algorithm
makes at most 1/γ 2 mistakes on any sequence of examples separable by
margin γ [Blo62, Nov62, MP69]. Thus, if the Perceptron algorithm is run
on a sample of size 1/(γ 2 ), the expected error rate of its hypothesis at a
random point in time is at most . (For further results of this form, see
[Lit89, FS99]).
2. The more recent margin bounds of [STBWA98, BST99] that state that
|S| = O( 1ε [ γ12 log2 ( γε
1
) + log 1δ ]) is suﬃcient so that with high probability,
any linear separator of S with margin γ has true error at most . Thus, this
provides a sample complexity bound that applies to any algorithm that ﬁnds
large-margin separators.
In the next two sections, we give two more ways of seeing why having a large
margin makes a learning problem easy, both based on the idea of random
projection.

2 An Extremely Simple Learning Algorithm

In this section, we show how the idea of random projection can be used to get an
extremely simple algorithm (almost embarrassingly so) for weak-learning, with
1
Often margin is deﬁned without normalizing by the length of the examples, though
in that case the “γ 2 ” term in sample complexity bounds becomes “γ 2 /R2 ”, where R
is the maximum ||x|| over x ∈ S. Technically, normalizing produces a stronger bound
because we are taking the minimum of a ratio, rather than the ratio of a minimum
to a maximum.
Random Projection, Margins, Kernels, and Feature-Selection 55

error rate 1/2 − γ/4, whenever a learning problem is linearly separable by some
margin γ. This can then be plugged into Boosting [Sch90, FS97], to achieve
strong-learning. This material is taken from [BB05].
In particular, the algorithm is as follows.

Algorithm 1 (Weak-learning a linear separator)

1. Pick a random linear separator. Speciﬁcally, choose a random unit-length

vector h and consider the separator h · x ≥ 0. (Equivalently, project data to
the 1-dimensional line spanned by h and consider labeling positive numbers
as positive and negative numbers as negative.)
2. Evaluate the error of the separator selected in Step 1. If the error is at most
1/2 − γ/4 then halt with success, else go back to 1.

Theorem 1. If P is separable by margin γ then a random linear separator will

have error at most 12 − γ/4 with probability Ω(γ).

In particular, Theorem 1 implies that the above algorithm will in expectation

repeat only O(1/γ) times before halting.2

Proof. Consider a (positive) example x such that x · w∗ /||x|| ≥ γ. The angle

between x and w∗ is then some value α ≤ π/2 − γ. Now, a random vector h,
when projected onto the 2-dimensional plane deﬁned by x and w∗ , looks like a
random vector in this plane. Therefore, we have (see Figure 1):

Pr(h · x ≤ 0|h · w∗ ≥ 0) = α/π ≤ 1/2 − γ/π.

Similarly, for a negative example x, for which x · w∗ /||x|| ≤ −γ, we have:

Pr(h · x ≥ 0|h · w∗ ≥ 0) ≤ 1/2 − γ/π.

Therefore, if we define h(x) to be the classifier defined by h · x ≥ 0, we have:

E[err(h)|h · w∗ ≥ 0] ≤ 1/2 − γ/π.

Finally, since the error rate of any hypothesis is bounded between 0 and 1, and
a random vector h has a 1/2 chance of satisfying h · w∗ ≥ 0, it must be the case
that:
Prh [err(h) ≤ 1/2−γ/4] = Ω(γ).

2
For simplicity, we have presented Algorithm 1 as if it can exactly evaluate the true
error of its chosen hypothesis in Step 2. Technically, we should change Step 2 to
talk about empirical error, using an intermediate value such as 1/2 − γ/6. In that
case, sample size O( γ12 log γ1 ) is suﬃcient to be able to run Algorithm 1 for O(1/γ)
repetitions, and evaluate the error rate of each hypothesis produced to suﬃcient
precision.
56 A. Blum

w*
h·x≥0 h·x≤0

x
α

Fig. 1. For random h, conditioned on h · w∗ ≥ 0, we have Pr(h · x ≤ 0) = α/π and

Pr(h · x ≥ 0) = 1 − α/π

3 The Johnson-Lindenstrauss Lemma

The Johnson-Lindenstrauss Lemma [JL84, DG02, IM98] states that given a set
S of points in Rn , if we perform an orthogonal projection of those points onto a
random d-dimensional subspace, then d = O( γ12 log |S|) is sufficient so that with
high probability all pairwise distances are preserved up to 1 ± γ (up to scaling).
Conceptually, one can think of a random projection as first applying a random
rotation to Rn and then reading off the first d coordinates. In fact, a number of
different forms of “random projection” are known to work (including some that
are especially efficient to perform computationally, considered in [Ach03, AV99]).
In particular, if we think of performing the projection via multiplying all points,
viewed as row-vectors of length n, by an n × d matrix A, then several methods
for selecting A that provide the desired result are:
1. Choosing its columns to be d random orthogonal unit-length vectors (a true
random orthogonal projection).
2. Choosing each entry in A independently from a standard Gaussian (so the
projection can be viewed as selecting d vectors u1 , u2 , . . . , ud from a spherical
gaussian and mapping a point p to (p · u1 , . . . , p · ud ).
3. Choosing each entry in A to be 1 or −1 independently at random.
Some especially nice proofs for the Johnson-Lindenstrauss Lemma are given by
Indyk and Motwani [IM98] and Dasgupta and Gupta [DG02]. Here, we just
give the basic structure and intuition for the argument. In particular, consider
two points pi and pj in the input and their difference vij = pi − pj . So, we
are interested in the length of vij A. Fixing vij , let us think of each of the d
coordinates y1 , . . . , yd in the vector y = vij A as random variables (over the
choice of A). Then, in each form of projection above, these d random variables are
Random Projection, Margins, Kernels, and Feature-Selection 57

nearly independent (in fact, in forms (2) and (3) they are completely independent
random variables). This allows us to use a Chernoff-style bound to argue that
d = O( γ12 log 1δ ) is sufficient so that with probability 1 − δ, y12 + . . . + yd2 will be
within 1 ± γ of its expectation. This in turn implies that the length of y is within
1 ± γ of its expectation. Finally, using δ = o(1/n2 ) we have by the union bound
that with high probability this is satisfied simultaneously for all pairs of points
pi , pj in the input.
Formally, here is a convenient form of the Johnson-Lindenstrauss Lemma
given in [AV99]. Let N (0, 1) denote the standard Normal distribution with mean
0 and variance 1, and U (−1, 1) denote the distribution that has probability 1/2
on −1 and probability 1/2 on 1.

Theorem 2 (Neuronal RP [AV99]). Let u, v ∈ Rn . Let u = √1d uA and v =

√1 vA where A is a n × d random matrix whose entries are chosen independently
d
from either N (0, 1) or U (−1, 1). Then,

Pr (1 − γ)||u − v||2 ≤ ||u − v ||2 ≤ (1 + γ)||u − v||2 ≥ 1 − 2e−(γ −γ ) 4 .
2 3 d

Theorem 2 suggests a natural learning algorithm: ﬁrst randomly project data

into a lower dimensional space, and then run some other algorithm in that space,
taking advantage of the speedup produced by working over fewer dimensions.
Theoretical results for some algorithms of this form are given in [AV99], and
experimental results are given in [FM03, GBN05, Das00].

3.1 The Johnson-Lindenstrauss Lemma and Margins

The Johnson-Lindenstrauss lemma provides a particularly intuitive way to see

why one should be able to generalize well from only a small amount of training
data when a learning problem is separable by a large margin. In particular,
imagine a set S of data in some high-dimensional space, and suppose that we
randomly project the data down to Rd . By the Johnson-Lindenstrauss Lemma,
d = O(γ −2 log |S|) is suﬃcient so that with high probability, all angles between
points (viewed as vectors) change by at most ±γ/2.3 In particular, consider
projecting all points in S and the target vector w∗ ; if initially data was separable
by margin γ, then after projection, since angles with w∗ have changed by at most
γ/2, the data is still separable (and in fact separable by margin γ/2). Thus, this
means our problem was in some sense really only a d-dimensional problem after
all. Moreover, if we replace the “log |S|” term in the bound for d with “log 1 ”,
then we can use Theorem 2 to get that with high probability at least a 1 −
fraction of S will be separable. Formally, talking in terms of the true distribution
P , one can state the following theorem. (Proofs appear in, e.g., [AV99, BBV04].)

3
The Johnson-Lindenstrauss Lemma talks about relative distances being approxi-
mately preserved, but it is a straightforward calculation to show that this implies
angles must be approximately preserved as well.
58 A. Blum

1 1
Theorem 3. If P is linearly separable by margin γ, then d = O γ2 log( εδ ) is
suﬃcient so that with probability at least 1 − δ, a random projection down to Rd
will be linearly separable with error at most ε at margin γ/2.

So, Theorem 3 can be viewed as stating that a learning problem separable by

margin γ is really only an “O(1/γ 2 )-dimensional problem” after all.

4 Random Projection, Kernel Functions, and Feature

Selection
4.1 Introduction
Kernel functions [BGV92, CV95, FS99, MMR+ 01, STBWA98, Vap98] have be-
come a powerful tool in Machine Learning. A kernel function can be viewed
as allowing one to implicitly map data into a high-dimensional space and to
perform certain operations there without paying a high price computationally.
Furthermore, margin bounds imply that if the learning problem has a large
margin linear separator in that space, then one can avoid paying a high price in
terms of sample size as well.
Combining kernel functions with the Johnson-Lindenstrauss Lemma (in par-
ticular, Theorem 3 above), we have that if a learning problem indeed has the
large margin property under kernel K(x, y) = φ(x) · φ(y), then a random lin-
ear projection of the “φ-space” down to a low dimensional space approximately
preserves linear separability. This means that for any kernel K under which
the learning problem is linearly separable by margin γ in the φ-space, we can, in
principle, think of K as mapping the input space X into an Õ(1/γ 2 )-dimensional
space, in essence serving as a method for representing the data in a new (and
not too large) feature space.
The question we now consider is whether, given kernel K as a black-box
function, we can in fact produce such a mapping efficiently. The problem with the
above observation is that it requires explicitly computing the function φ(x). Since
for a given kernel K, the dimensionality of the φ-space might be quite large, this
is not efficient.4 Instead, what we would like is an efficient procedure that given
K(., .) as a black-box program, produces a mapping with the desired properties
using running time that depends (polynomially) only on 1/γ and the time to
compute the kernel function K, with no dependence on the dimensionality of
the φ-space. This would mean we can effectively convert a kernel K that is good
for some learning problem into an explicit set of features, without a need for
“kernelizing” our learning algorithm. In this section, we describe several methods
for doing so; this work is taken from [BBV04].
Specifically, we will show the following. Given black-box access to a kernel
function K(x, y), access to unlabeled examples from distribution D, and para-
meters γ, ε, and δ, we can in polynomial time construct a mapping F : X → Rd
4
In addition, it is not totally clear how to apply Theorem 2 if the dimension of the
φ-space is infinite.
Random Projection, Margins, Kernels, and Feature-Selection 59

(i.e., to a set of d real-valued features) where d = O γ12 log εδ 1
, such that if the
target concept indeed has margin γ in the φ-space, then with probability 1 − δ
(over randomization in our choice of mapping function), the induced distribution
in Rd is separable with error ≤ ε. In fact, not only will the data in Rd be sepa-
rable, but it will be separable with a margin γ = Ω(γ). (If the original learning
problem was separable with error α at margin γ then the induced distribution
is separable with error α + at margin γ .)
To give a feel of what such a mapping might look like, suppose we are will-
ing to use dimension d = O( 1ε [ γ12 + ln 1δ ]) (so this is linear in 1/ε rather than
logarithmic) and we are not concerned with preserving margins and only want
approximate separability. Then we show the following especially simple proce-
dure suffices. Just draw a random sample of d unlabeled points x1 , . . . , xd from
D and define F (x) = (K(x, x1 ), . . . , K(x, xd )).5 That is, if we think of K not so
much as an implicit mapping into a high-dimensional space but just as a simi-
larity function over examples, what we are doing is drawing d “reference” points
and then defining the ith feature of x to be its similarity with reference point
i. Corollary 1 (in Section 4.3 below) shows that under the assumption that the
target function has margin γ in the φ space, with high probability the data will
be approximately separable under this mapping. Thus, this gives a particularly
simple way of using the kernel and unlabeled data for feature generation.
Given these results, a natural question is whether it might be possible to
perform mappings of this type without access to the underlying distribution. In
Section 4.6 we show that this is in general not possible, given only black-box
access (and polynomially-many queries) to an arbitrary kernel K. However, it
may well be possible for specific standard kernels such as the polynomial kernel
or the gaussian kernel.

4.2 Additional Deﬁnitions

In analogy to Deﬁnition 2, we will say that P is separable by margin γ in

the φ-space if there exists a unit-length vector w in the φ-space such that
Pr(x,)∼P [(w · φ(x))/||φ(x)|| < γ] = 0, and similarly that P is separable with
error α at margin γ in the φ-space if the above holds with “= 0” replaced by
“≤ α”.
For a set of vectors v1 , v2 , . . . , vk in Euclidean space, let span(v1 , . . . , vk )
denote the span of these vectors: that is, the set of vectors u that can be written
as a linear combination a1 v1 + . . . + ak vk . Also, for a vector u and a subspace
Y , let proj(u, Y ) be the orthogonal projection of u down to Y . So, for instance,
proj(u, span(v1 , . . . , vk )) is the orthogonal projection of u down to the space
spanned by v1 , . . . , vk . We note that given a set of vectors v1 , . . . , vk and the
ability to compute dot-products, this projection can be computed eﬃciently by
solving a set of linear equalities.
5
In contrast, the Johnson-Lindenstrauss Lemma as presented in Theorem 2 would
draw d Gaussian (or uniform {−1, +1}) random points r1 , . . . , rd in the φ-space and
deﬁne F (x) = (φ(x) · r1 , . . . , φ(x) · rd ).
60 A. Blum

4.3 Two Simple Mappings

Our goal is a procedure that given black-box access to a kernel function K(., .),
unlabeled examples from distribution D, and a margin value γ, produces a map-
ping F : X → Rd with the following property: if the target function indeed has
margin γ in the φ-space, then with high probability F approximately preserves
linear separability. In this section, we analyze two methods that both produce a
space of dimension O( 1ε [ γ12 + ln 1δ ]), such that with probability 1 − δ the result
is separable with error at most ε. The second of these mappings in fact satisfies
a stronger condition that its output will be approximately separable at margin
γ/2 (rather than just approximately separable). This property will allow us to
use this mapping as a first step in a better mapping in Section 4.4.
The following lemma is key to our analysis.
Lemma 1. Consider any distribution over labeled examples in Euclidean space
such that there exists a linear separator w · x = 0 with margin γ. If we draw

8 1 1
d≥ + ln
ε γ2 δ
examples z1 , . . . , zd iid from this distribution, with probability ≥ 1−δ, there exists
a vector w in span(z1 , . . . , zd ) that has error at most ε at margin γ/2.
Remark 1. Before proving Lemma 1, we remark that a somewhat weaker bound
on d can be derived from the machinery of margin bounds. Margin bounds
[STBWA98, BST99] tell us that using d = O( 1ε [ γ12 log2 ( γε
1
) + log 1δ ]) points, with
probability 1− δ, any separator with margin ≥ γ over the observed data has true
error ≤ ε. Thus, the projection of the target function w into the space spanned
by the observed data will have true error ≤ ε as well. (Projecting w into this
space maintains the value of w · zi , while possibly shrinking the vector w, which
can only increase the margin over the observed data.) The only technical issue
is that we want as a conclusion for the separator not only to have a low error
rate over the distribution, but also to have a large margin. However, this can
be obtained from the double-sample argument used in [STBWA98, BST99] by
using a γ/4-cover instead of a γ/2-cover. Margin bounds, however, are a bit of
an overkill for our needs, since we are only asking for an existential statement
(the existence of w ) and not a universal statement about all separators with
large empirical margins. For this reason we are able to get a better bound by a
direct argument from first principles.
Proof (Lemma 1). For any set of points S, let win (S) be the projection of w to
span(S), and let wout (S) be the orthogonal portion of w, so that w = win (S) +
wout (S) and win (S) ⊥ wout (S). Also, for convenience, assume w and all examples
z are unit-length vectors (since we have defined margins in terms of angles, we
can do this without loss of generality). Now, let us make the following definitions.
Say that wout (S) is large if Prz (|wout (S) · z| > γ/2) ≥ ε, and otherwise say
that wout (S) is small. Notice that if wout (S) is small, we are done, because
w · z = (win (S) · z) + (wout (S) · z), which means that win (S) has the properties
Random Projection, Margins, Kernels, and Feature-Selection 61

we want. That is, there is at most an ε probability mass of points z whose dot-
product with w and win (S) differ by more than γ/2. So, we need only to consider
what happens when wout (S) is large.
The crux of the proof now is that if wout (S) is large, this means that a
new random point z has at least an ε chance of significantly improving the
set S. Specifically, consider z such that |wout (S) · z| > γ/2. Let zin (S) be the
projection of z to span(S), let zout (S) = z −zin (S) be the portion of z orthogonal
to span(S), and let z = zout (S)/||zout (S)||. Now, for S = S ∪ {z}, we have
wout (S ) = wout (S) − proj(wout (S), span(S )) = wout (S) − (wout (S) · z )z , where
the last equality holds because wout (S) is orthogonal to span(S) and so its
projection onto span(S ) is the same as its projection onto z . Finally, since
wout (S ) is orthogonal to z we have ||wout (S )||2 = ||wout (S)||2 − |wout (S) · z |2 ,
and since |wout (S) · z | ≥ |wout (S) · zout (S)| = |wout (S) · z|, this implies by
definition of z that ||wout (S )||2 < ||wout (S)||2 − (γ/2)2 .
So, we have a situation where so long as wout is large, each example has
at least an ε chance of reducing ||wout ||2 by at least γ 2 /4, and since ||w||2 =
||wout (∅)||2 = 1, this can happen at most 4/γ 2 times. Chernoff bounds state
that a coin of bias ε flipped n = 8
ε
1
γ2 + ln 1δ times will with probability 1 − δ
have at least nε/2 ≥ 4/γ heads. Together, these
2
imply that with probability at
least 1 − δ, wout (S) will be small for |S| ≥ ε γ12 + ln 1δ as desired.
8

Lemma 1 implies that if P is linearly separable with margin γ under K, and we

draw d = 8ε [ γ12 + ln 1δ ] random unlabeled examples x1 , . . . , xn from D, then with
probability at least 1 − δ there is a separator w in the φ-space with error rate
at most ε that can be written as

w = α1 φ(x1 ) + . . . + αd φ(xd ).

Notice that since w · φ(x) = α1 K(x, x1 ) + . . . + αd K(x, xd ), an immediate impli-

cation is that if we simply think of K(x, xi ) as the ith “feature” of x — that is,
if we deﬁne F1 (x) = (K(x, x1 ), . . . , K(x, xd )) — then with high probability the
vector (α1 , . . . , αd ) is an approximate linear separator of F1 (P ). So, the kernel
and distribution together give us a particularly simple way of performing fea-
ture generation that preserves (approximate) separability. Formally, we have the
following.
Corollary 1. If P has margin γ in the φ-space, then with probability ≥ 1 − δ,
if x1 , . . . , xd are drawn from D for d = ε γ 2 + ln 1δ , the mapping
8 1

F1 (x) = (K(x, x1 ), . . . , K(x, xd ))

produces a distribution F1 (P ) on labeled examples in Rd that is linearly separable

with error at most ε.
Unfortunately, the above mapping F1 may not preserve margins because we
do not have a good bound on the length of the vector (α1 , . . . , αd ) deﬁning
62 A. Blum

the separator in the new space, or the length of the examples F1 (x). The key
problem is that if many of the φ(xi ) are very similar, then their associated
features K(x, xi ) will be highly correlated. Instead, to preserve margin we want
to choose an orthonormal basis of the space spanned by the φ(xi ): i.e., to do an
orthogonal projection of φ(x) into this space. Speciﬁcally, let S = {x1 , ..., xd } be
a set of 8ε [ γ12 + ln 1δ ] unlabeled examples from D as in Corollary 1. We can then
implement the desired orthogonal projection of φ(x) as follows. Run K(x, y)
for all pairs x, y ∈ S, and let M (S) = (K(xi , xj ))xi ,xj ∈S be the resulting kernel
matrix. Now decompose M (S) into U T U , where U is an upper-triangular matrix.
Finally, deﬁne the mapping F2 : X → Rd to be F2 (x) = F1 (x)U −1 , where F1
is the mapping of Corollary 1. This is equivalent to an orthogonal projection of
φ(x) into span(φ(x1 ), . . . , φ(xd )). Technically, if U is not full rank then we want
to use the (Moore-Penrose) pseudoinverse [BIG74] of U in place of U −1 .
By Lemma 1, this mapping F2 maintains approximate separability at margin
γ/2 (See [BBV04] for a full proof):

φ-space, then with probability ≥ 1 − δ,

Theorem 4. If P has margin γ in the
the mapping F2 : X → R for d ≥ ε γ12 + ln 1δ has the property that F2 (P ) is
d 8

linearly separable with error at most ε at margin γ/2.

Notice that the running time to compute F2 (x) is polynomial in 1/γ, 1/ε, 1/δ
and the time to compute the kernel function K.

4.4 An Improved Mapping

We now describe an improved mapping, in which the dimension d has only
a logarithmic, rather than linear, dependence on 1/ε. The idea is to perform
a two-stage process, composing the mapping from the previous section with
an additional Johnson-Lindenstrauss style mapping to reduce dimensionality
even further. Thus, this mapping can be thought of as combining two types of
random projection: a projection based on points chosen at random from D, and
a projection based on choosing points uniformly at random in the intermediate
space.
In particular, let F2 : X → Rd2 be the mapping from Section 4.3 using ε/2
and δ/2 as its error and conﬁdence parameters respectively. Let F̂ : Rd2 → Rd3
be a random projection as in Theorem 2. Then consider the overall mapping
F3 : X → Rd3 to be F3 (x) = F̂ (F2 (x)).
We now claim that for d2 = O( 1ε [ γ12 + ln 1δ ]) and d3 = O( γ12 log( εδ
1
)), with
high probability, this mapping has the desired properties. The basic argument is
that the initial mapping F2 maintains approximate separability at margin γ/2 by
Lemma 1, and then the second mapping approximately preserves this property
by Theorem 2. In particular, we have (see [BBV04] for a full proof):
Theorem 5. If P has margin γ in the φ-space, then with probability at least

1 − δ, the mapping F3 = F̂ ◦ F2 : X → R , for values d2 = O ε γ 2 + ln 1δ
d3 1 1

and d3 = O γ12 log( εδ
1
) , has the property that F3 (P ) is linearly separable with
error at most ε at margin γ/4.
Random Projection, Margins, Kernels, and Feature-Selection 63

As before, the running time to compute our mappings is polynomial in 1/γ,

1/ε, 1/δ and the time to compute the kernel function K.
Since the dimension d3 of the mapping in Theorem 5 is only logarithmic in
1/ε, this means that if P is perfectly separable with margin γ in the φ-space,
we can set ε to be small enough so that with high probability, a sample of size
O(d3 log d3 ) would be perfectly separable. That is, we could use an arbitrary
noise-free linear-separator learning algorithm in Rd3 to learn the target concept.
However, this requires using d2 = Õ(1/γ 4 ) (i.e., Õ(1/γ 4 ) unlabeled examples to
construct the mapping).
Corollary 2. Given ε , δ, γ < 1, if P has margin γ in the φ-space, then Õ( ε1γ 4 )
unlabeled examples are suﬃcient so that with probability 1−δ, mapping F3 : X →
Rd3 has the property that F3 (P ) is linearly separable with error o(ε /(d3 log d3 )),
where d3 = O( γ12 log ε1γδ ).

4.5 A Few Extensions

So far, we have assumed that the distribution P is perfectly separable with mar-
gin γ in the φ-space. Suppose, however, that P is only separable with error
α at margin γ. That is, there exists a vector w in the φ-space that correctly
classifies a 1 − α probability mass of examples by margin at least γ, but the
remaining α probability mass may be either within the margin or incorrectly
classified. In that case, we can apply all the previous results to the 1 − α portion
of the distribution that is correctly separated by margin γ, and the remain-
ing α probability mass of examples may or may not behave as desired. Thus
all preceding results (Lemma 1, Corollary 1, Theorem 4, and Theorem 5) still
hold, but with ε replaced by (1 − α)ε + α in the error rate of the resulting
mapping.
Another extension is to the case that the target separator does not pass
through the origin: that is, it is of the form w · φ(x) ≥ β for some value β.
If our kernel function is normalized, so that ||φ(x)|| = 1 for all x ∈ X, then
all results carry over directly (note that one can normalize any kernel K by
defining K̂(x, x ) = K(x, x )/ K(x, x)K(x , x )). In particular, all our results
follow from arguments showing that the cosine of the angle between w and
φ(x) changes by at most ε due to the reduction in dimension. If the kernel is
not normalized, then results still carry over if one is willing to divide by the
maximum value of ||φ(x)||, but we do not know if results carry over if one wishes
to be truly translation-independent, say bounding only the radius of the smallest
ball enclosing all φ(x) but not necessarily centered at the origin.

4.6 On the Necessity of Access to D

Our algorithms construct mappings F : X → Rd using black-box access to
the kernel function K(x, y) together with unlabeled examples from the input
distribution D. It is natural to ask whether it might be possible to remove the
need for access to D. In particular, notice that the mapping resulting from the
Johnson-Lindenstrauss lemma has nothing to do with the input distribution:
64 A. Blum

if we have access to the φ-space, then no matter what the distribution is, a
random projection down to Rd will approximately preserve the existence of a
large-margin separator with high probability.6 So perhaps such a mapping F
can be produced by just computing K on some polynomial number of cleverly-
chosen (or uniform random) points in X. (Let us assume X is a “nice” space
such as the unit ball or {0, 1}n that can be randomly sampled.) In this section,
we show this is not possible in general for an arbitrary black-box kernel. This
leaves open, however, the case of specific natural kernels.
One way to view the result of this section is as follows. If we define a fea-
ture space based on dot-products with uniform or gaussian-random points in
the φ-space, then we know this will work by the Johnson-Lindenstrauss lemma.
However, this requires explicit access to the φ-space. Alternatively, using Corol-
lary 1 we can define features based on dot-products with points φ(x) for x ∈ X,
which only requires implicit access to the φ-space through the kernel. However,
this procedure needs to use D to select the points x. What we show here is that
such use of D is necessary: if we define features based on points φ(x) for uniform
random x ∈ X, or any other distribution that does not depend on D, then there
will exist kernels for which this does not work.
We demonstrate the necessity of access to D as follows. Consider X = {0, 1}n,
let X be a random subset of 2n/2 elements of X, and let D be the uniform dis-
tribution on X . For a given target function c, we will define a special φ-function
φc such that c is a large margin separator in the φ-space under distribution D,
but that only the points in X behave nicely, and points not in X provide no
useful information. Specifically, consider φc : X → R2 defined as:
⎧
⎨ (1, 0) √ if x ∈ X
φc (x) = (−1/2, √ 3/2) if x ∈ X and c(x) = 1
⎩
(−1/2, − 3/2) if x ∈ X and c(x) = −1

See ﬁgure 2. This then induces the kernel:

1 if x, y ∈ X or [x, y ∈ X and c(x) = c(y)]
Kc (x, y) =
−1/2 otherwise

Notice that the distribution P = (D, c) over labeled examples has margin γ =
√
3/2 in the φ-space.

Theorem 6. Suppose an algorithm makes polynomially many calls to a black-

box kernel function over input space {0, 1}n and produces a mapping F : X → Rd
where d is polynomial in n. Then for random X and random c in the above
construction, with high probability
√ F (P ) will not even be weakly-separable (even
though P has margin γ = 3/2 in the φ-space).
6
To be clear about the order of quantiﬁcation, the statement is that for any distrib-
ution, a random projection will work with high probability. However, for any given
projection, there may exist bad distributions. So, even if we could deﬁne a mapping
of the sort desired, we would still expect the algorithm to be randomized.
Random Projection, Margins, Kernels, and Feature-Selection 65

x in X’
c(x)=1

120 x not in X’

x in X’
c(x)=−1

Fig. 2. Function φc used in lower bound

Proof Sketch: Consider any algorithm with black-box access to K attempting

to create a mapping F : X → Rd . Since X is a random exponentially-small
fraction of X, with high probability all calls made to K when constructing the
function F are on inputs not in X . Let us assume this indeed is the case. This
implies that (a) all calls made to K when constructing the function F return the
value 1, and (b) at “runtime” when x chosen from D (i.e., when F is used to map
training data), even though the function F (x) may itself call K(x, y) for different
previously-seen points y, these will all give K(x, y) = −1/2. In particular, this
means that F (x) is independent of the target function c. Finally, since X has
size 2n/2 and d is only polynomial in n, we have by simply counting the number
of possible partitions of F (X ) by halfspaces that with high probability F (P )
will not even be weakly separable for a random function c over X . Specifically,
for any given halfspace, the probability over choice of c that it has error less than
1/2 − is exponentially small in |X | (by Hoeffding bounds), which is doubly-
exponentially small in n, whereas there are “only” 2O(dn) possible partitions by
halfspaces.

The above construction corresponds to a scenario in which “real data” (the

points in X ) are so sparse and special that an algorithm without access to D
is not able to construct anything that looks even remotely like a real data point
by itself (e.g., examples are pixel images of outdoor scenes and yet our poor
learning algorithm has no knowledge of vision and can only construct white
noise). Furthermore, it relies on a kernel that only does something interesting on
the real data (giving nothing useful for x ∈ X ). It is conceivable that positive
results independent of the distribution D can be achieved for standard, natural
kernels.
66 A. Blum

5 Conclusions and Open Problems

This survey has examined ways in which random projection (of various forms)
can provide algorithms for, and insight into, problems in machine learning. For
example, if a learning problem is separable by a large margin γ, then a ran-
dom projection to a space of dimension O( γ12 log δ 1
) will with high probability
approximately preserve separability, so we can think of the problem as really
an O(1/γ 2 )-dimensional problem after all. In addition, we saw that just pick-
ing a random separator (which can be thought of as projecting to a random
1-dimensional space) has a reasonable chance of producing a weak hypothesis.
We also saw how given black-box access to a kernel function K and a distri-
bution D (i.e., unlabeled examples) we can use K and D together to construct a
new low-dimensional feature space in which to place the data that approximately
preserves the desired properties of the kernel. Thus, through this mapping, we
can think of a kernel as in some sense providing a distribution-dependent feature
space. One interesting aspect of the simplest method considered, namely choos-
ing x1 , . . . , xd from D and then using the mapping x → (K(x, x1 ), . . . , K(x, xd )),
is that it can be applied to any generic “similarity” function K(x, y), even those
that are not necessarily legal kernels and do not necessarily have the same inter-
pretation as computing a dot-product in some implicit φ-space. Recent results
of [BB06] extend some of these guarantees to this more general setting.
One concrete open question is whether, for natural standard kernel functions,
one can produce mappings F : X → Rd in an oblivious manner, without using
examples from the data distribution. The Johnson-Lindenstrauss lemma tells us
that such mappings exist, but the goal is to produce them without explicitly
computing the φ-function. Barring that, perhaps one can at least reduce the
unlabeled sample-complexity of our approach. On the practical side, it would
be interesting to further explore the alternatives that these (or other) mappings
provide to widely used algorithms such as SVM and Kernel Perceptron.

Acknowledgements

Much of this was based on joint work with Maria-Florina (Nina) Balcan and San-
tosh Vempala. Thanks also to the referees for their helpful comments. This work
was supported in part by NSF grants CCR-0105488, NSF-ITR CCR-0122581,
and NSF-ITR IIS-0312814.

References

[Ach03] D. Achlioptas. Database-friendly random projections. Journal of Com-

puter and System Sciences, 66(4):671–687, 2003.
[AV99] R. I. Arriaga and S. Vempala. An algorithmic theory of learning, robust
concepts and random projection. In Proceedings of the 40th Annual IEEE
Symposium on Foundation of Computer Science, pages 616–623, 1999.
Random Projection, Margins, Kernels, and Feature-Selection 67

[BB05] M-F. Balcan and A. Blum. A PAC-style model for learning from labeled
and unlabeled data. In Proceedings of the 18th Annual Conference on
Computational Learning Theory (COLT), pages 111–126, 2005.
[BB06] M-F. Balcan and A. Blum. On a theory of kernels as similarity functions.
Mansucript, 2006.
[BBV04] M.F. Balcan, A. Blum, and S. Vempala. Kernels as features: On kernels,
margins, and low-dimensional mappings. In 15th International Conference
on Algorithmic Learning Theory (ALT ’04), pages 194–205, 2004. An ex-
tended version is available at http://www.cs.cmu.edu/~avrim/Papers/.
[BGV92] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for
optimal margin classifiers. In Proceedings of the Fifth Annual Workshop
on Computational Learning Theory, 1992.
[BIG74] A. Ben-Israel and T.N.E. Greville. Generalized Inverses: Theory and
Applications. Wiley, New York, 1974.
[Blo62] H.D. Block. The perceptron: A model for brain functioning. Reviews
of Modern Physics, 34:123–135, 1962. Reprinted in Neurocomputing,
Anderson and Rosenfeld.
[BST99] P. Bartlett and J. Shawe-Taylor. Generalization performance of support
vector machines and other pattern classifiers. In Advances in Kernel
Methods: Support Vector Learning. MIT Press, 1999.
[CV95] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,
20(3):273 – 297, 1995.
[Das00] S. Dasgupta. Experiments with random projection. In Proceedings of the
16th Conference on Uncertainty in Artificial Intelligence (UAI), pages
143–151, 2000.
[DG02] S. Dasgupta and A. Gupta. An elementary proof of the Johnson-Lind-
enstrauss Lemma. Random Structures & Algorithms, 22(1):60–65, 2002.
[EK00] Y. Rabani E. Kushilevitz, R. Ostrovsky. Efficient search for approxi-
mate nearest neighbor in high dimensional spaces. SIAM J. Computing,
30(2):457–474, 2000.
[FM03] D. Fradkin and D. Madigan. Experiments with random projections for
machine learning. In KDD ’03: Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages
517–522, 2003.
[FS97] Y. Freund and R. Schapire. A decision-theoretic generalization of on-
line learning and an application to boosting. Journal of Computer and
System Sciences, 55(1):119–139, 1997.
[FS99] Y. Freund and R.E. Schapire. Large margin classification using the Per-
ceptron algorithm. Machine Learning, 37(3):277–296, 1999.
[GBN05] N. Goal, G. Bebis, and A. Nefian. Face recognition experiments with
random projection. In Proceedings SPIE Vol. 5779, pages 426–437, 2005.
[GW95] M.X. Goemans and D.P. Williamson. Improved approximation algo-
rithms for maximum cut and satisfiability problems using semidefinite
programming. Journal of the ACM, pages 1115–1145, 1995.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: towards re-
moving the curse of dimensionality. In Proceedings of the 30th Annual
ACM Symposium on Theory of Computing, pages 604–613, 1998.
[JL84] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings
into a Hilbert space. In Conference in Modern Analysis and Probability,
pages 189–206, 1984.
68 A. Blum

[Lit89] Nick Littlestone. From on-line to batch learning. In COLT ’89: Proceed-
ings of the 2nd Annual Workshop on Computational Learning Theory,
pages 269–284, 1989.
[MMR+ 01] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An
introduction to kernel-based learning algorithms. IEEE Transactions on
Neural Networks, 12:181–201, 2001.
[MP69] M. Minsky and S. Papert. Perceptrons: An Introduction to Computa-
tional Geometry. The MIT Press, 1969.
[Nov62] A.B.J. Novikoﬀ. On convergence proofs on perceptrons. In Proceedings
of the Symposium on the Mathematical Theory of Automata, Vol. XII,
pages 615–622, 1962.
[Sch90] R. E. Schapire. The strength of weak learnability. Machine Learning,
5(2):197–227, 1990.
[Sch00] L. Schulman. Clustering for edge-cost minimization. In Proceedings
of the 32nd Annual ACM Symposium on Theory of Computing, pages
547–555, 2000.
[STBWA98] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Struc-
tural risk minimization over data-dependent hierarchies. IEEE Trans.
on Information Theory, 44(5):1926–1940, 1998.
[Vap98] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons Inc.,
New York, 1998.
[Vem98] S. Vempala. Random projection: A new approach to VLSI layout.
In Proceedings of the 39th Annual IEEE Symposium on Foundation of
Computer Science, pages 389–395, 1998.
[Vem04] S. Vempala. The Random Projection Method. American Mathemati-
cal Society, DIMACS: Series in Discrete Mathematics and Theoretical
Computer Science, 2004.

Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
GR 10 Practice 1 P2
No ratings yet
GR 10 Practice 1 P2
5 pages
Graziani-Bassanini Unsteady Viscous Flows About Bodies Vorticity Release and Forces 2002
No ratings yet
Graziani-Bassanini Unsteady Viscous Flows About Bodies Vorticity Release and Forces 2002
22 pages
Ikariam
No ratings yet
Ikariam
61 pages
2019 Summer Model Answer Paper (Msbte Study Resources)
No ratings yet
2019 Summer Model Answer Paper (Msbte Study Resources)
12 pages
CS229
No ratings yet
CS229
216 pages
Sliding Mode Control Strategy Based Lead Screw Control Design in Electromechanical Tracking Drive System
No ratings yet
Sliding Mode Control Strategy Based Lead Screw Control Design in Electromechanical Tracking Drive System
9 pages
EE - Mathematics AA (Marked 1)
No ratings yet
EE - Mathematics AA (Marked 1)
22 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Fiber Optics Reviewer
No ratings yet
Fiber Optics Reviewer
6 pages
1 s2.0 S0168927423000429 Main
No ratings yet
1 s2.0 S0168927423000429 Main
20 pages
s.3 Phy Cai Term 2 KR
No ratings yet
s.3 Phy Cai Term 2 KR
4 pages
2025 Ned Mock Physics I
No ratings yet
2025 Ned Mock Physics I
12 pages
Two-Way Slab Deflection
No ratings yet
Two-Way Slab Deflection
12 pages
Predicting Structured Data
No ratings yet
Predicting Structured Data
361 pages
NIPS 1999 Support Vector Method For Novelty Detection Paper
No ratings yet
NIPS 1999 Support Vector Method For Novelty Detection Paper
7 pages
Load Shedding Programme For Monday 2ND September To Saturday 7TH Sepetember 2024
No ratings yet
Load Shedding Programme For Monday 2ND September To Saturday 7TH Sepetember 2024
1 page
svm2 (1) Fin
No ratings yet
svm2 (1) Fin
24 pages
ML-chap10 2024 110300
No ratings yet
ML-chap10 2024 110300
29 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
Machine Learning
No ratings yet
Machine Learning
46 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
White Noise Jammer Mathematical Modelling
No ratings yet
White Noise Jammer Mathematical Modelling
8 pages
Theoretical and Experimental Study For An Improved Cycloid Drive Model
No ratings yet
Theoretical and Experimental Study For An Improved Cycloid Drive Model
14 pages
Lesson 26 Differential Equation of First Order: Dy DX F (X, Y) MDX + Ndy 0 M N X y
No ratings yet
Lesson 26 Differential Equation of First Order: Dy DX F (X, Y) MDX + Ndy 0 M N X y
6 pages
MIT18 409S15 Bookex
No ratings yet
MIT18 409S15 Bookex
123 pages
Lecture 5 - Kernel SAVAS1
No ratings yet
Lecture 5 - Kernel SAVAS1
19 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
45 pages
Perceptron Learning Algorithm Lecture Supplement
No ratings yet
Perceptron Learning Algorithm Lecture Supplement
6 pages
APA Advanced Framing Construction Guide
No ratings yet
APA Advanced Framing Construction Guide
24 pages
Absolute Value of Inequalities
No ratings yet
Absolute Value of Inequalities
13 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
3D Geometric Transformation: Course Coordinator Dr. Badal Soni
No ratings yet
3D Geometric Transformation: Course Coordinator Dr. Badal Soni
32 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
2022 - Cold-Formed Stainless Steel RHS Members Undergoing Combined Bending
No ratings yet
2022 - Cold-Formed Stainless Steel RHS Members Undergoing Combined Bending
14 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
Lecture 9 - SVMs
No ratings yet
Lecture 9 - SVMs
8 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Support Vector Network
No ratings yet
Support Vector Network
25 pages
Machine Learning - The Science of Selection Under Uncertainty
No ratings yet
Machine Learning - The Science of Selection Under Uncertainty
85 pages
Matter, Energy, and Measurement: Frederick A. Bettelheim William H. Brown Mary K. Campbell Shawn O. Farrell
100% (1)
Matter, Energy, and Measurement: Frederick A. Bettelheim William H. Brown Mary K. Campbell Shawn O. Farrell
22 pages
Physics Xi Home Task-I
No ratings yet
Physics Xi Home Task-I
2 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
8 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
R2017-EEE-Curriculum and Syllabus
No ratings yet
R2017-EEE-Curriculum and Syllabus
120 pages
State of Matter Lesson 3
No ratings yet
State of Matter Lesson 3
5 pages
Lesson2 Math
No ratings yet
Lesson2 Math
7 pages
PTS Year 6 Revision Answers
No ratings yet
PTS Year 6 Revision Answers
64 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
Gann Theory and Support and Resistance
No ratings yet
Gann Theory and Support and Resistance
4 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
HW 1 Eeowh 3
No ratings yet
HW 1 Eeowh 3
6 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
Quiz 1 Genmath
No ratings yet
Quiz 1 Genmath
5 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
NNLS1 2019 HW4 Solutions
No ratings yet
NNLS1 2019 HW4 Solutions
11 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Distribution System
No ratings yet
Distribution System
103 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
Network Models - Part 9
No ratings yet
Network Models - Part 9
20 pages
Learning Multidimensional Fourier Series With Tensor Trains
No ratings yet
Learning Multidimensional Fourier Series With Tensor Trains
6 pages
Ralf Herbrich Learning Kernel Classifiers. Theory and Algorithms 2001ã. 382ñ. ISBN ISBN10 026208306X PDF
No ratings yet
Ralf Herbrich Learning Kernel Classifiers. Theory and Algorithms 2001ã. 382ñ. ISBN ISBN10 026208306X PDF
382 pages
Learning Kernel Classifiers. Theory and Algorithms
100% (2)
Learning Kernel Classifiers. Theory and Algorithms
371 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
SVM Class
No ratings yet
SVM Class
33 pages
BS en 1096 4 2018 2022 08 18 04 11 19 Am
No ratings yet
BS en 1096 4 2018 2022 08 18 04 11 19 Am
38 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Preventing Maloperation of Distance Protection Due
No ratings yet
Preventing Maloperation of Distance Protection Due
8 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
119 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
Vahid
No ratings yet
Vahid
18 pages
Gravity Method PDF
No ratings yet
Gravity Method PDF
14 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet

Randomproj

Uploaded by

Randomproj

Uploaded by

Random Projection, Margins, Kernels,

Department of Computer Science,

Abstract. Random projection is a simple technique that has had a

1. If d = ω( γ12 log n) then Johnson-Lindenstrauss type results (described below)

1.1 The Setting

min [(w · x)/||x||] ≥ γ.

That is, the separator w · x ≥ 0 has margin γ if every labeled example in S is

Pr [(w · x)/||x|| < γ] = 0,

and we say that P is separable with error α at margin γ if there exists a

Pr [(w · x)/||x|| < γ] ≤ α.

A powerful theoretical result in machine learning is that if a learning problem is

2 An Extremely Simple Learning Algorithm

Algorithm 1 (Weak-learning a linear separator)

1. Pick a random linear separator. Speciﬁcally, choose a random unit-length

Theorem 1. If P is separable by margin γ then a random linear separator will

In particular, Theorem 1 implies that the above algorithm will in expectation

Proof. Consider a (positive) example x such that x · w∗ /||x|| ≥ γ. The angle

Pr(h · x ≤ 0|h · w∗ ≥ 0) = α/π ≤ 1/2 − γ/π.

Similarly, for a negative example x, for which x · w∗ /||x|| ≤ −γ, we have:

Pr(h · x ≥ 0|h · w∗ ≥ 0) ≤ 1/2 − γ/π.

Therefore, if we define h(x) to be the classifier defined by h · x ≥ 0, we have:

E[err(h)|h · w∗ ≥ 0] ≤ 1/2 − γ/π.

Fig. 1. For random h, conditioned on h · w∗ ≥ 0, we have Pr(h · x ≤ 0) = α/π and

3 The Johnson-Lindenstrauss Lemma

Theorem 2 (Neuronal RP [AV99]). Let u, v ∈ Rn . Let u = √1d uA and v  =

Theorem 2 suggests a natural learning algorithm: ﬁrst randomly project data

3.1 The Johnson-Lindenstrauss Lemma and Margins

The Johnson-Lindenstrauss lemma provides a particularly intuitive way to see

So, Theorem 3 can be viewed as stating that a learning problem separable by

4 Random Projection, Kernel Functions, and Feature

4.2 Additional Deﬁnitions

In analogy to Deﬁnition 2, we will say that P is separable by margin γ in

4.3 Two Simple Mappings

Lemma 1 implies that if P is linearly separable with margin γ under K, and we

Notice that since w · φ(x) = α1 K(x, x1 ) + . . . + αd K(x, xd ), an immediate impli-

F1 (x) = (K(x, x1 ), . . . , K(x, xd ))

produces a distribution F1 (P ) on labeled examples in Rd that is linearly separable

 φ-space, then with probability ≥ 1 − δ,

linearly separable with error at most ε at margin γ/2.

4.4 An Improved Mapping

As before, the running time to compute our mappings is polynomial in 1/γ,

4.5 A Few Extensions

4.6 On the Necessity of Access to D

See ﬁgure 2. This then induces the kernel:

Theorem 6. Suppose an algorithm makes polynomially many calls to a black-

Fig. 2. Function φc used in lower bound

Proof Sketch: Consider any algorithm with black-box access to K attempting

The above construction corresponds to a scenario in which “real data” (the

5 Conclusions and Open Problems

[Ach03] D. Achlioptas. Database-friendly random projections. Journal of Com-

You might also like

min [(w · x)/||x||] ≥ γ.

Pr [(w · x)/||x|| < γ] = 0,

Pr [(w · x)/||x|| < γ] ≤ α.

Theorem 2 (Neuronal RP [AV99]). Let u, v ∈ Rn . Let u = √1d uA and v =

Notice that since w · φ(x) = α1 K(x, x1 ) + . . . + αd K(x, xd ), an immediate impli-

φ-space, then with probability ≥ 1 − δ,