Randomproj
Randomproj
and Feature-Selection
Avrim Blum
1 Introduction
Random projection is a technique that has found substantial use in the area
of algorithm design (especially approximation algorithms), by allowing one to
substantially reduce dimensionality of a problem while still retaining a significant
degree of problem structure. In particular, given n points in Euclidean space
(of any dimension but which we can think of as Rn ), we can project these
points down to a random d-dimensional subspace for d n, with the following
outcomes:
Projections of the first type have had a number of uses including fast approxi-
mate nearest-neighbor algorithms [IM98, EK00] and approximate clustering al-
gorithms [Sch00] among others. Projections of the second type are often used
for “rounding” a semidefinite-programming relaxation, such as for the Max-CUT
problem [GW95], and have been used for various graph-layout problems [Vem98].
The purpose of this survey is to describe some ways that this technique can
be used (either practically, or for providing insight) in the context of machine
C. Saunders et al. (Eds.): SLSFS 2005, LNCS 3940, pp. 52–68, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Random Projection, Margins, Kernels, and Feature-Selection 53
learning. In particular, random projection can provide a simple way to see why
data that is separable by a large margin is easy for learning even if data lies
in a high-dimensional space (e.g., because such data can be randomly projected
down to a low dimensional space without affecting separability, and therefore
it is “really” a low-dimensional problem after all). It can also suggest some
especially simple algorithms. In addition, random projection (of various types)
can be used to provide an interesting perspective on kernel functions, and also
provide a mechanism for converting a kernel function into an explicit feature
space.
The use of Johnson-Lindenstrauss type results in the context of learning was
first proposed by Arriaga and Vempala [AV99], and a number of uses of random
projection in learning are discussed in [Vem04]. Experimental work on using
random projection has been performed in [FM03, GBN05, Das00]. This survey,
in addition to background material, focuses primarily on work in [BB05, BBV04].
Except in a few places (e.g., Theorem 1, Lemma 1) we give only sketches and
basic intuition for proofs, leaving the full proofs to the papers cited.
magnitude at least γ.1 For simplicity, we are only considering separators that
pass through the origin, though results we discuss can be adapted to the general
case as well.
We can similarly talk in terms of the distribution P rather than a sample S.
Definition 2. We say that P is linearly separable by margin γ if there
exists a unit-length vector w such that:
error rate 1/2 − γ/4, whenever a learning problem is linearly separable by some
margin γ. This can then be plugged into Boosting [Sch90, FS97], to achieve
strong-learning. This material is taken from [BB05].
In particular, the algorithm is as follows.
Finally, since the error rate of any hypothesis is bounded between 0 and 1, and
a random vector h has a 1/2 chance of satisfying h · w∗ ≥ 0, it must be the case
that:
Prh [err(h) ≤ 1/2−γ/4] = Ω(γ).
2
For simplicity, we have presented Algorithm 1 as if it can exactly evaluate the true
error of its chosen hypothesis in Step 2. Technically, we should change Step 2 to
talk about empirical error, using an intermediate value such as 1/2 − γ/6. In that
case, sample size O( γ12 log γ1 ) is sufficient to be able to run Algorithm 1 for O(1/γ)
repetitions, and evaluate the error rate of each hypothesis produced to sufficient
precision.
56 A. Blum
w*
h·x≥0 h·x≤0
x
α
nearly independent (in fact, in forms (2) and (3) they are completely independent
random variables). This allows us to use a Chernoff-style bound to argue that
d = O( γ12 log 1δ ) is sufficient so that with probability 1 − δ, y12 + . . . + yd2 will be
within 1 ± γ of its expectation. This in turn implies that the length of y is within
1 ± γ of its expectation. Finally, using δ = o(1/n2 ) we have by the union bound
that with high probability this is satisfied simultaneously for all pairs of points
pi , pj in the input.
Formally, here is a convenient form of the Johnson-Lindenstrauss Lemma
given in [AV99]. Let N (0, 1) denote the standard Normal distribution with mean
0 and variance 1, and U (−1, 1) denote the distribution that has probability 1/2
on −1 and probability 1/2 on 1.
3
The Johnson-Lindenstrauss Lemma talks about relative distances being approxi-
mately preserved, but it is a straightforward calculation to show that this implies
angles must be approximately preserved as well.
58 A. Blum
1 1
Theorem 3. If P is linearly separable by margin γ, then d = O γ2 log( εδ ) is
sufficient so that with probability at least 1 − δ, a random projection down to Rd
will be linearly separable with error at most ε at margin γ/2.
we want. That is, there is at most an ε probability mass of points z whose dot-
product with w and win (S) differ by more than γ/2. So, we need only to consider
what happens when wout (S) is large.
The crux of the proof now is that if wout (S) is large, this means that a
new random point z has at least an ε chance of significantly improving the
set S. Specifically, consider z such that |wout (S) · z| > γ/2. Let zin (S) be the
projection of z to span(S), let zout (S) = z −zin (S) be the portion of z orthogonal
to span(S), and let z = zout (S)/||zout (S)||. Now, for S = S ∪ {z}, we have
wout (S ) = wout (S) − proj(wout (S), span(S )) = wout (S) − (wout (S) · z )z , where
the last equality holds because wout (S) is orthogonal to span(S) and so its
projection onto span(S ) is the same as its projection onto z . Finally, since
wout (S ) is orthogonal to z we have ||wout (S )||2 = ||wout (S)||2 − |wout (S) · z |2 ,
and since |wout (S) · z | ≥ |wout (S) · zout (S)| = |wout (S) · z|, this implies by
definition of z that ||wout (S )||2 < ||wout (S)||2 − (γ/2)2 .
So, we have a situation where so long as wout is large, each example has
at least an ε chance of reducing ||wout ||2 by at least γ 2 /4, and since ||w||2 =
||wout (∅)||2 = 1, this can happen at most 4/γ 2 times. Chernoff bounds state
that a coin of bias ε flipped n = 8
ε
1
γ2 + ln 1δ times will with probability 1 − δ
have at least nε/2 ≥ 4/γ heads. Together, these
2
imply that with probability at
least 1 − δ, wout (S) will be small for |S| ≥ ε γ12 + ln 1δ as desired.
8
w = α1 φ(x1 ) + . . . + αd φ(xd ).
the separator in the new space, or the length of the examples F1 (x). The key
problem is that if many of the φ(xi ) are very similar, then their associated
features K(x, xi ) will be highly correlated. Instead, to preserve margin we want
to choose an orthonormal basis of the space spanned by the φ(xi ): i.e., to do an
orthogonal projection of φ(x) into this space. Specifically, let S = {x1 , ..., xd } be
a set of 8ε [ γ12 + ln 1δ ] unlabeled examples from D as in Corollary 1. We can then
implement the desired orthogonal projection of φ(x) as follows. Run K(x, y)
for all pairs x, y ∈ S, and let M (S) = (K(xi , xj ))xi ,xj ∈S be the resulting kernel
matrix. Now decompose M (S) into U T U , where U is an upper-triangular matrix.
Finally, define the mapping F2 : X → Rd to be F2 (x) = F1 (x)U −1 , where F1
is the mapping of Corollary 1. This is equivalent to an orthogonal projection of
φ(x) into span(φ(x1 ), . . . , φ(xd )). Technically, if U is not full rank then we want
to use the (Moore-Penrose) pseudoinverse [BIG74] of U in place of U −1 .
By Lemma 1, this mapping F2 maintains approximate separability at margin
γ/2 (See [BBV04] for a full proof):
if we have access to the φ-space, then no matter what the distribution is, a
random projection down to Rd will approximately preserve the existence of a
large-margin separator with high probability.6 So perhaps such a mapping F
can be produced by just computing K on some polynomial number of cleverly-
chosen (or uniform random) points in X. (Let us assume X is a “nice” space
such as the unit ball or {0, 1}n that can be randomly sampled.) In this section,
we show this is not possible in general for an arbitrary black-box kernel. This
leaves open, however, the case of specific natural kernels.
One way to view the result of this section is as follows. If we define a fea-
ture space based on dot-products with uniform or gaussian-random points in
the φ-space, then we know this will work by the Johnson-Lindenstrauss lemma.
However, this requires explicit access to the φ-space. Alternatively, using Corol-
lary 1 we can define features based on dot-products with points φ(x) for x ∈ X,
which only requires implicit access to the φ-space through the kernel. However,
this procedure needs to use D to select the points x. What we show here is that
such use of D is necessary: if we define features based on points φ(x) for uniform
random x ∈ X, or any other distribution that does not depend on D, then there
will exist kernels for which this does not work.
We demonstrate the necessity of access to D as follows. Consider X = {0, 1}n,
let X be a random subset of 2n/2 elements of X, and let D be the uniform dis-
tribution on X . For a given target function c, we will define a special φ-function
φc such that c is a large margin separator in the φ-space under distribution D,
but that only the points in X behave nicely, and points not in X provide no
useful information. Specifically, consider φc : X → R2 defined as:
⎧
⎨ (1, 0) √ if x ∈ X
φc (x) = (−1/2, √ 3/2) if x ∈ X and c(x) = 1
⎩
(−1/2, − 3/2) if x ∈ X and c(x) = −1
Notice that the distribution P = (D, c) over labeled examples has margin γ =
√
3/2 in the φ-space.
x in X’
c(x)=1
120 x not in X’
x in X’
c(x)=−1
This survey has examined ways in which random projection (of various forms)
can provide algorithms for, and insight into, problems in machine learning. For
example, if a learning problem is separable by a large margin γ, then a ran-
dom projection to a space of dimension O( γ12 log δ 1
) will with high probability
approximately preserve separability, so we can think of the problem as really
an O(1/γ 2 )-dimensional problem after all. In addition, we saw that just pick-
ing a random separator (which can be thought of as projecting to a random
1-dimensional space) has a reasonable chance of producing a weak hypothesis.
We also saw how given black-box access to a kernel function K and a distri-
bution D (i.e., unlabeled examples) we can use K and D together to construct a
new low-dimensional feature space in which to place the data that approximately
preserves the desired properties of the kernel. Thus, through this mapping, we
can think of a kernel as in some sense providing a distribution-dependent feature
space. One interesting aspect of the simplest method considered, namely choos-
ing x1 , . . . , xd from D and then using the mapping x → (K(x, x1 ), . . . , K(x, xd )),
is that it can be applied to any generic “similarity” function K(x, y), even those
that are not necessarily legal kernels and do not necessarily have the same inter-
pretation as computing a dot-product in some implicit φ-space. Recent results
of [BB06] extend some of these guarantees to this more general setting.
One concrete open question is whether, for natural standard kernel functions,
one can produce mappings F : X → Rd in an oblivious manner, without using
examples from the data distribution. The Johnson-Lindenstrauss lemma tells us
that such mappings exist, but the goal is to produce them without explicitly
computing the φ-function. Barring that, perhaps one can at least reduce the
unlabeled sample-complexity of our approach. On the practical side, it would
be interesting to further explore the alternatives that these (or other) mappings
provide to widely used algorithms such as SVM and Kernel Perceptron.
Acknowledgements
Much of this was based on joint work with Maria-Florina (Nina) Balcan and San-
tosh Vempala. Thanks also to the referees for their helpful comments. This work
was supported in part by NSF grants CCR-0105488, NSF-ITR CCR-0122581,
and NSF-ITR IIS-0312814.
References
[BB05] M-F. Balcan and A. Blum. A PAC-style model for learning from labeled
and unlabeled data. In Proceedings of the 18th Annual Conference on
Computational Learning Theory (COLT), pages 111–126, 2005.
[BB06] M-F. Balcan and A. Blum. On a theory of kernels as similarity functions.
Mansucript, 2006.
[BBV04] M.F. Balcan, A. Blum, and S. Vempala. Kernels as features: On kernels,
margins, and low-dimensional mappings. In 15th International Conference
on Algorithmic Learning Theory (ALT ’04), pages 194–205, 2004. An ex-
tended version is available at http://www.cs.cmu.edu/~avrim/Papers/.
[BGV92] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for
optimal margin classifiers. In Proceedings of the Fifth Annual Workshop
on Computational Learning Theory, 1992.
[BIG74] A. Ben-Israel and T.N.E. Greville. Generalized Inverses: Theory and
Applications. Wiley, New York, 1974.
[Blo62] H.D. Block. The perceptron: A model for brain functioning. Reviews
of Modern Physics, 34:123–135, 1962. Reprinted in Neurocomputing,
Anderson and Rosenfeld.
[BST99] P. Bartlett and J. Shawe-Taylor. Generalization performance of support
vector machines and other pattern classifiers. In Advances in Kernel
Methods: Support Vector Learning. MIT Press, 1999.
[CV95] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,
20(3):273 – 297, 1995.
[Das00] S. Dasgupta. Experiments with random projection. In Proceedings of the
16th Conference on Uncertainty in Artificial Intelligence (UAI), pages
143–151, 2000.
[DG02] S. Dasgupta and A. Gupta. An elementary proof of the Johnson-Lind-
enstrauss Lemma. Random Structures & Algorithms, 22(1):60–65, 2002.
[EK00] Y. Rabani E. Kushilevitz, R. Ostrovsky. Efficient search for approxi-
mate nearest neighbor in high dimensional spaces. SIAM J. Computing,
30(2):457–474, 2000.
[FM03] D. Fradkin and D. Madigan. Experiments with random projections for
machine learning. In KDD ’03: Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages
517–522, 2003.
[FS97] Y. Freund and R. Schapire. A decision-theoretic generalization of on-
line learning and an application to boosting. Journal of Computer and
System Sciences, 55(1):119–139, 1997.
[FS99] Y. Freund and R.E. Schapire. Large margin classification using the Per-
ceptron algorithm. Machine Learning, 37(3):277–296, 1999.
[GBN05] N. Goal, G. Bebis, and A. Nefian. Face recognition experiments with
random projection. In Proceedings SPIE Vol. 5779, pages 426–437, 2005.
[GW95] M.X. Goemans and D.P. Williamson. Improved approximation algo-
rithms for maximum cut and satisfiability problems using semidefinite
programming. Journal of the ACM, pages 1115–1145, 1995.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: towards re-
moving the curse of dimensionality. In Proceedings of the 30th Annual
ACM Symposium on Theory of Computing, pages 604–613, 1998.
[JL84] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings
into a Hilbert space. In Conference in Modern Analysis and Probability,
pages 189–206, 1984.
68 A. Blum
[Lit89] Nick Littlestone. From on-line to batch learning. In COLT ’89: Proceed-
ings of the 2nd Annual Workshop on Computational Learning Theory,
pages 269–284, 1989.
[MMR+ 01] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An
introduction to kernel-based learning algorithms. IEEE Transactions on
Neural Networks, 12:181–201, 2001.
[MP69] M. Minsky and S. Papert. Perceptrons: An Introduction to Computa-
tional Geometry. The MIT Press, 1969.
[Nov62] A.B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings
of the Symposium on the Mathematical Theory of Automata, Vol. XII,
pages 615–622, 1962.
[Sch90] R. E. Schapire. The strength of weak learnability. Machine Learning,
5(2):197–227, 1990.
[Sch00] L. Schulman. Clustering for edge-cost minimization. In Proceedings
of the 32nd Annual ACM Symposium on Theory of Computing, pages
547–555, 2000.
[STBWA98] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Struc-
tural risk minimization over data-dependent hierarchies. IEEE Trans.
on Information Theory, 44(5):1926–1940, 1998.
[Vap98] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons Inc.,
New York, 1998.
[Vem98] S. Vempala. Random projection: A new approach to VLSI layout.
In Proceedings of the 39th Annual IEEE Symposium on Foundation of
Computer Science, pages 389–395, 1998.
[Vem04] S. Vempala. The Random Projection Method. American Mathemati-
cal Society, DIMACS: Series in Discrete Mathematics and Theoretical
Computer Science, 2004.