Rise of The Machines: Larry Wasserman
Rise of The Machines: Larry Wasserman
1
1.1 Introduction
Statistics is the science of learning from data. Machine Learning (ML) is the
science of learning from data. These fields are identical in intent although they
differ in their history, conventions, emphasis and culture.
There is no denying the success and importance of the field of Statistics for
science and, more generally, for society. I’m proud to be a part of the field. The
focus of this essay is on one challenge (and opportunity) to our field: the rise of
Machine Learning.
During my twenty-five year career I have seen Machine Learning evolve from
being a collection of rather primitive (yet clever) set of methods to do classifi-
cation, to a sophisticated science that is rich in theory and applications.
A quick glance at the The Journal of Machine Learning Research (mlr.
csail.mit.edu) and NIPS (books.nips.cc) reveals papers on a variety of top-
ics that will be familiar to Statisticians such as:
conditional likelihood, sequential design, reproducing kernel Hilbert
spaces, clustering, bioinformatics, minimax theory, sparse regres-
sion, estimating large covariance matrices, model selection, density
estimation, graphical models, wavelets, nonparametric regression.
2
it, and eventually you write it up and submit it. Then the refereeing process
starts. One paper can take years.
In ML, the intellectual currency is conference publications. There are a
number of deadlines for the main conference (NIPS, AISTAT, ICML, COLT).
The threat of a deadline forces one to quit ruminating and start writing. Most
importantly, all faculty members and students are facing the same deadline so
there is a synergy in the field that has mutual benefits. No one minds if you
cancel a class right before the NIPS deadline. And then, after the deadline,
everyone is facing another deadline: refereeing each others papers and doing so
in a timely manner. If you have an idea and don’t submit a paper on it, then
you may be out of luck because someone may scoop you.
This pressure is good; it keeps the field moving at a fast pace. If you think
this leads to poorly written papers or poorly thought out ideas, I suggest you
look at nips.cc and read some of the papers. There are some substantial, deep
papers. There are also a few bad papers. Just like in our journals. The papers
are refereed and the acceptance rate is comparable to our main journals. And
if an idea requires more detailed followup, then one can always write a longer
journal version of the paper.
Absent this stream of constant deadline, a field moves slowly. This is a
problem for Statistics not only for its own sake but also because it now competes
with ML.
Of course, there are disadvantages to the conference culture. Work is done
in a rush, and ideas are often not fleshed out in detail. But I think that the
advantages outweigh the disadvantages.
3
●●
●●● ●
●
●
●●
● ●
●
●
4
●
● ●
●● ●●
●
●
●
●● ●
● ● ●
● ● ● ● ●● ● ●●
●● ● ●
● ●
● ● ●
●● ● ●●
● ●
●
●
●●●● ●
●
● ● ●
● ●● ●
●
●
●●
● ● ●●
●
● ● ●
● ●
● ●
● ● ●●
●
● ● ● ●●● ● ●
● ●
● ●
● ● ●
● ● ● ● ●● ●
●● ●
●● ● ●
●
●●
●●
●●●
●
●●
●
●
●
● ●● ●
●●
●●
●●● ● ●
● ● ●● ●
● ● ●
● ● ●●
● ● ●●
● ●●
● ●● ●● ●
● ●● ●●● ●
●● ● ●
● ●● ● ●
● ●
● ● ●
● ●● ●
● ● ●● ●
●● ● ● ●● ● ● ●
● ● ● ●
●
●
5
and hence, semisupervised estimation dominates supervised estimation.
The class Pn consists of distributions such that the marginal for X is highly
concentrated near some lower dimensional set and such that the regression func-
tion is smooth on this set. We have not proved that the class must be of this
form for semisupervised inference to improve on supervised inference but we
suspect that is indeed the case. Our framework includes a parameter α that
characterizes the strength of the semisupervised assumption. We showed that,
in fact, one can use the data to adapt to the correct value of α.
Yi = Xi + i , i = 1, . . . , n (1.4)
Of course, the risk depends on what conditions we assume on M and one the
noise Φ.
Our main findings are as follows. When there is no noise — so the data fall
on the manifold — we get Rn n−2/d . When the noise is perpendicular to M ,
the risk is Rn n−2/(2+d) . When the noise is Gaussian the rate is Rn 1/ log n.
The latter is not surprising when one considers the similar problem of estimating
a function when there are errors in variables.
The implications for Machine Learning are that, the best their algorithms
can do is highly dependent on the particulars of the type of noise.
How do we actually estimate these manifolds in practice? In ([8]) we take the
following point of view: if the noise is not too large, then the manifold should
be close to a d-dimensional hyper-ridge in the density p(y) for Y . Ridge finding
is an extension of mode finding, which is a common task in computer vision.
6
●
●
● ●
●
●
●
● ● ●
●
● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
●● ● ● ●
● ● ●
●
● ●
●
●
●
Figure 1.3: The Mean Shift Algorithm. The data points move along trajecto-
ries during iterations until they reach the two modes marked by the two large
asterisks.
denote the eigenvalues of H(x) and let Λ(x) be the diagonal matrix whose
diagonal elements are the eigenvalues. Write the spectral decomposition of
H(x) as H(x) = U (x)Λ(x)U (x)T . Fix 0 ≤ d < D and let V (x) be the last D − d
columns of U (x) (that is, the columns corresponding to the D − d smallest
eigenvalues). If we write U (x) = [V (x) : V (x)] then we can write H(x) =
[V (x) : V (x)]Λ(x)[V (x) : V (x)]T . Let L(x) = V (x)V (x)T be the projector on
the linear space defined by the columns of V (x). Define the projected gradient
7
If the vector field G(x) is Lipschitz then by Theorem 3.39 of [9], G defines
a global flow as follows. The flow is a family of functions φ(x, t) such that
φ(x, 0) = x and φ0 (x, 0) = G(x) and φ(s, φ(t, x)) = φ(s + t, x). The flow lines,
or integral curves, partition the space and at each x where G(x) is non-null,
there is a unique integral curve passing through x. The intuition is that the
flow passing through x is a gradient ascent path moving towards higher values
of p. Unlike the paths defined by the gradient g which move towards modes,
the paths defined by G move towards ridges.
The paths can be parameterized in many ways. One commonly used param-
eterization is to use t ∈ [−∞, ∞] where large values of t correspond to higher
values of p. In this case t = ∞ will correspond to a point on the ridge. In this
parameterization we can express each integral curve in the flow as follows. A
map π : R → RD is an integral curve with respect to the flow of G if
As mentioned above, the integral curves partition the space and for each
x∈/ R, there is a unique path πx passing through x. The ridge points are zeros
of the projected gradient: y ∈ R implies that G(y) = (0, . . . , 0)T . [10] derived
an extension of the mean-shift algorithm, called the subspace constrained mean
shift algorithm that finds ridges which can be applied to the kernel density
estimator. Our results can be summarized as follows:
2. We constructed an estimator R
b such that
2 !
D+8
log n
H(R, R)
b = OP (1.11)
n
8
Figure 1.4: Simulated cosmic web data.
An example can be found in Figures 1.4 and 1.5. I believe that Statistics
has much to offer to this area especially in terms of making the assumptions
precise and clarifying how accurate the inferences can be.
9
Figure 1.5: Ridge finder applied to simulated cosmic web data.
10
Having been on hiring committees for both Statistics and ML I can say that
the difference is striking. It is easy to choose candidates to interview in ML.
You have a lot of data on each candidate and you know what you are getting.
In Statistics, it is a struggle. You have little more than a few papers that bear
their advisor’s footprint.
The ML conference culture encourages publishing many papers on many
topics which is better for both the students and their potential employers. And
now, Statistics students are competing with ML students, putting Statistics
students at a significant disadvantage.
There are a number of topics that are routinely covered in ML that we rarely
teach in Statistics. Examples are: Vapnik-Chervonenkis theory, concentration of
measure, random matrices, convex optimization, graphical models, reproducing
kernel Hilbert spaces, support vector machines, and sequential game theory. It
is time to get rid of antiques like UMVUE, complete statistics and so on and
teach modern ideas.
11
Bibliography
[2] Chacón. Clusters and water flows: a novel approach to modal clustering
through morse theory. arXiv preprint arXiv:1212.1384, 2012.
[3] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature
space analysis. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 24(5):603 –619, may 2002.
[4] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic
regression: a statistical view of boosting (with discussion and a rejoinder
by the authors). The Annals of Statistics, 28(2):337–407, 2000.
[5] Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradi-
ent of a density function, with applications in pattern recognition. IEEE
Transactions on Information Theory, 21:32–40, 1975.
[6] Christopher R. Genovese, Marco Perone-Pacifico, Isabella Verdinelli, and
Larry Wasserman. Manifold estimation and singular deconvolution under
hausdorff loss. The Annals of Statistics, 40:941–963, 2012.
[9] M.C. Irwin. Smooth dynamical systems, volume 94. Academic Press, 1980.
[10] Ozertem and Erdogmus. Locally defined principal curves and surfaces.
Journal of Machine Learning Research, 12:1249–1286, 2011.
12