MachineLearningMathematics
MachineLearningMathematics
introduction
Sanjeev Arora
Princeton University Computer Science
Institute for Advanced Study
Abstract
Machine learning is the subfield of computer science concerned with
creating machines that can improve from experience and interaction. It
relies upon mathematical optimization, statistics, and algorithm design.
Rapid empirical success in this field currently outstrips mathematical un-
derstanding. This elementary article sketches the basic framework of ma-
chine learning and hints at the open mathematical problems in it.
An updated version of this article and related articles can be found on
the author’s webpage.
MSC: 68-02, 68Q99, 68T05.
The dictionary defines the act of learning as gaining or acquiring knowledge
or skill (in something) by study, experience, or being taught. Machine learning,
a field in computer science, seeks to design machines that learn. This may seem
to fly in contradiction to the usual view of computers as fixed and logic-based
devices whose behavior is completely fixed by their programmer. But this view
is simplistic because it is in fact straightforward to write programs that learn
new capabilities from new experiences and new data (images, pieces of text,
etc.). This learnt capability can become part of its program, and of course,
any newly learnt capabilities can also be trivially copied from one machine to
another.
Machine learning is related to artificial intelligence, but somewhat distinct
because it does not seek to recreate only human-like skills in a machine. Some
skills —e.g., detecting patterns in millions of images from a particle accelerator,
or in billions of facebook posts— may be easy for a machine, but beyond the
cognitive abilities of humans. (In fact, lately machines can go beyond human
capabilities in some image recognition tasks.) Conversely, many human skills
such as composing good music and proving math theorems seem beyond the
reach of current machine learning paradigms.
The quest to imbue machines with learning abilities rests upon an emerging
body of knowledge that spans computer science, mathematical optimization,
statistics, applied math, applied physics etc. It ultimately requires us to math-
ematically formulate nebulous concepts such as the “meaning”of a picture, or a
newspaper article. This article provides a brief introduction to machine learning.
1
The mathematical notion closest to machine learning is curve-fitting, which
has long been a mainstay of science and social science. For example, the sup-
posed inverse relationship between an economy’s inflation and unemployment
rates, called the Philips curve, was discovered by fitting a curve to economic
data over a few decades. Machine learning algorithms do something similar,
except the settings are more complicated and with many more —sometimes,
tens of millions—variables. This raises many issues, computational as well as
statistical. Let’s introduce them with a simple example.
After training we expect to find that the weights assigned to words are
meaningful. Positive words like terrific, enjoyable, loved etc. get high weights
and negative words like terrible, hated, avoid get low or negative weights.
To finish our discussion we need to address two important issues.
2
the algorithmic task can be solved very efficiently to optimality. The reason is
that the optimization problem in (1) happens to be convex, a notion we define
below. Under fairly general conditions, convex optimization problems can be
solved efficiently.
where the expectation in the second term is over the entire distribution of re-
views. At first glance this appears to be a trivial matter of bounding the differ-
ence between the population average and the sample average, in other words,
to use measure concentration bounds. But actually there is a complication: the
solution θ∗ was computed using the sample, and thus depends intimately upon
it. We handle this complication by taking a union bound over all possible θ∗ .
First, we can discretize θ∗ by rounding off entries in θ∗ to the nearest integer
multiple of , since this can affect the predicted score by at most /2. Now all
entries in θ∗ are at least , which means there are at most m = kθ∗k2 /2 of them.
V
The number of possible choices for such vectors is at most T = m (1/)m where
3
recall that V denotes the number of words in the dictionary. Now (3) follows
from standard concentration bounds provided the number of training samples
exceed c0 log T /2 for some suitable (and explicit) constant c0 . This number
grows roughly as kθ∗ k2 log V /2 , which is usually much smaller than V .
By now it should be clearer what role the tunable λ multiplier plays in (2).
For best generalization we wish to find a solution θ∗ that minimizes the `2 norm.
Increasing λ penalizes solutions θ with higher `2 norm, so it serves to balance
the `2 norm against the total `2 error on training data. So the algorithm can
start with a high value of λ (which rules out all θ except those with very low
norm) and then perform binary search to home in on a value that balances
the error in (3) and the `2 norm just exactly so that the we end up with the
minimum norm solution.
The above simple argument can be strengthened in various ways and ulti-
mately connects with broader questions in statistics [HTF09] as well as beautiful
parts of discrete mathematics such as VC dimension and Rademacher complex-
ity [BDSS14].
2 Supervised learning
The above simple example illustrates a more general paradigm: supervised learn-
ing, which concerns learning to classify data-points after seeing many labeled
examples. This is the most well-known and successful paradigm of machine
learning. To illustrate it we use a famous and empirically successful example,
image recognition. Imagine we have divided everyday objects into k classes:
chair, building, dog, drink etc. and want to train the machine to assign the cor-
rect label when given an image. Here each image is in pixel format, so assume it
is a point in <d The training data contains N images of each class, where N is
some modest number (such as 1000). Let the labels be {1, 2, . . . , k}. In formal-
izing the learning problem, it helps to think of the label y i of xi as a vector in
<k : it has an entry 1 in coordinate y i and zero in other coordinates. Ideally, the
learning algorithm would learn to produce labels with only one nonzero coordi-
nate as well, which we encourage by appropriately setting up the optimization
problem.
The machine has to learn a function fθ : <d → <k that classifies the images
correctly, where θ are the parameters in the description of fθ . The training
objective –variously called loss function and empirical risk—is
N
X
minθ (fθ (xi ) − y i )2 . (`2 loss). (4)
i=1
Variations of this formulation are used as well, for example the following where
yj denotes jth coordinate of y:
N X
X k
minθ yji log(fθ (xi )j ) (cross entropy loss). (5)
i=1 j=1
4
This framework for supervised learning goes by the name Empirical Risk
Minimization (ERM) [Vap98]. The learning generalizes if the expected loss
of the optimum solution θ∗ on the entire distribution is close to that on the
samples. The flip side of this issue is statistical efficiency —determining the
minimum number of samples that lead to good generalization—as was discussed
earlier.
min g(θ)
θ∈K
The minimum exists, but can we find it efficiently? One could imagine using a
variety of algorithms to solve such an optimization problem —optimization the-
ory is quite well-developed! Usually design of such algorithms needs to assume
that the objects in question are efficiently computable. Specifically, given a θ
we need to be able to (a) efficiently compute f (θ) and (b) check if θ ∈ K. Both
assumptions are easily true in machine learning setting.
In practice, machine learning algorithms often use some variant of gradient
descent, which seems to give the best balance between performance and scala-
bility. Basically the same algorithm that is covered in freshman calculus, this
algorithm iteratively improves the solution, starting at initial point θ0 and then
finding θ1 , θ2 , . . . , such that at step t
where η > 0 is called learning rate and Proj(st+1 , K) is the point in K closest
to st+1 , also called projection of st+1 on K. Pythagoras theorem implies mono-
tonicity: g(θt+1 ) ≤ g(θt ). In general, gradient descent started with arbitrary θ0
is not guaranteed to reach the minimum, as is clear from the figure. It converges
to a stationary point where ∇(f ) = 0, and at best we can hope this is a local
optimum.
A well-behaved special case is when g is a convex function and K is a convex
body, as is the case in (2). Then gradient descent does reach the global optimum
5
Figure 1: Gradient descent on a nonconvex function is not guaranteed to reach
the global minimum.
if run long enough. Under modest conditions —e.g., a bound on the Lipschitz
constant—it approaches the global optimum quite quickly. A comprehensive
survey of such convex optimization procedures appears in [BV08].
But in general, problems (4), (5) are not convex and gradient descent can
converge, at best, to a local optimum. A nonconvex problem may have multi-
ple local optima, with some having lower objective values than others. So it is
unclear which ones gradient descent ends up at. Nevertheless, in practice gradi-
ent descent works quite well: the solutions found are generally of good quality.
Explaining why this happens is an important open problem. It is known that
regularization can help, and a cottage industry of tricks has sprung up for regu-
larizing the problem. Another important trick that helps is stochastic gradient
descent, whereby one estimates the gradient of ERM objective 4 via a small
sample of training samples: this improves the running time, and also seems to
act as a regularizer.
6
function σ : < → < called the nonlinearity. The most popular nonlinearity σ
these days is the rectilinear linear function RELUb (x) = max{0, x − b}. Here b
is called the bias, and it is also a parameter of the network together with the
Ai ’s. Defining y 0 = x0 this net computes y 1 , y 2 , . . . , y d where y i+1 = σ(Ai y i ).
Here σ(z) denotes the vector obtained by applying σ to each coordinate of z.
Also we are assuming that the dimensions of y i ’s and Ai ’s match so that the
matrix-vector products are well-defined. Each coordinate of a computed vector
y i is refered to as a node of the net, and each entry of one of the Ai ’s is refered
to as an edge. The output of the net is y d . The size of the net is the number of
nodes in it. The number of parameters is the number of edges plus the number
of nodes.
A deep net thus defines an input-output behavior, mapping the input vector
x0 to the output vector y d = fA1 ,A2 ,...,Ad ,~b (x0 ) where Ai ’s are the layer matrices
and ~b is the vector of all bias values at the nodes. Thus this model can be
used to do supervised learning, where the trainable parameters are the matrices
and the biases. (An important subcase of a deep net is a convolutional deep
net where the matrices Ai ’s have a specific compact representation whereby the
same weight is reused in a fixed pattern across the input. These are easier to
train in practice especially on data such as images which have patterns that are
well-represented by such nets. We will ignore convolution in this survey.)
How does depth help in deep nets? While a net with a single hidden layer
(i.e., depth 2) can in principle express any function computed by a net with more
layers, doing so may come at a cost of requiring vastly more nodes [ES16, Tel16].
Training such a vast net would be computationally infeasible. Thus increasing
depth allows a more succinct net to do interesting classification tasks.
To train d-layer deep nets for supervised learning using the above-mentioned
Empirical Risk Minimization paradigm, we need to solve an optimization prob-
lem that solves for the matrices A1 , A2 , . . . , Ad and the bias vector ~b. Writing
out the expression for Empirical Risk we find it to be nonconvex in the vari-
ables. Nevertheless, we can plough ahead and try to solve it using some variant
of gradient descent.
7
of operations that is linear in the number of parameters. This is a crucial saving
that enables deep learning to get off the ground, so to speak. An elementary
exposition of backpropagation and its variants appears at [AM16].
What fueled deep learning’s rise? While the basic ingredients of deep
learning were known for several decades, a confluence of factors around 2011 led
to its rapid progress and adoption. The first was availability of large labeled
datasets. Datasets for training image recognition software used to be created in
academia, and it was just not feasible for a small academic team to hand-label a
very large number of images. Starting a decade ago, researchers could use crowd-
sourcing to create datasets containing millions of humanly-labeled images, such
as ImageNet [DDS+ 09]. The second factor was availability of extremely fast
Graphical Processing Units (GPUs) that brought the power of supercomputers
to grad student desktops and fed a wave of experimentation that led to deep
learning’s resurgence. The third factor is developments in the theory of opti-
mization for machine learning. The new generation of researchers understand
notions such as regularization and acceleration and were able to employ them
effectively —as well as design new ideas such as batch normalization, dropout,
AdaGrad, Adam, etc.—to improve optimization —specifically, what things to
try when training a large net fails initially.
Finally. enormous corporate interest in uses of deep learning leads to enor-
mous research effort in industry as well.
3 Unsupervised learning
The techniques discussed thus far can train machines to do classification tasks
where the output is a scalar (or small number of scalars) and there is plentiful
training data that has been labeled by humans. But this captures only a small
part of what we humans consider as learning. One suspects that a big part of
our learning is unsupervised, whereby we passively observe the world around us
and notice patterns in it. When we see a new animal or bird while visiting a
new continent, we do not need to be told its name to already be able to describe
8
it, and relate it to animals we’ve seen in the past. Efforts to endow machines
with such capabilities have not been as successful.
Viewed from a distance, all methods for unsupervised learning try to for-
malize a notion of “high level”descriptor of data. If the training datapoints are
x1 , x2 , . . . ,, one assumes that each has an implicit (i.e., unknown) high level
descriptor h1 , h2 , . . .. To give an (advanced) example, xi could be a pixel-level
description of a photo of an unknown bird, and hi could say in some form “white
bird with long legs and long beak.” Clearly, each hi corresponds to multiple (even
infinitely many) images and conversely even an image can have multiple high
level descriptions. Methods for unsupervised learning allow for this possibility.
They define some (possibly loose) way to go from xi to hi and vice versa. The
following is a non-exhaustive list of ideas that have been tried for many years.
9
dent samples from the distribution, this amounts to
Y
argmaxθ pθ (xi ). (8)
i
10
This is very analogous to the Empirical Risk Minimization paradigm mentioned
above.
11
distribution that we are trying to learn. This model is trained using a set of
samples from the target distribution D (for example, real-life images).
Thus the deep net implicitly defines a probability distribution U, which we
are trying to make close to D. This technically is a subcase of the setting in
Section 3.2, and the main idea in training is to do some form of gradient descent
on the objective (9). Some notable notions in this line of work include Restricted
Boltzman Machines [HS06], Variational Autoencoders [KW14], and Generative
Adversarial Nets [GPAM+ 15].
4 Reinforcement learning
Reinforcement learning concerns design of autonomous agents that take a se-
quence (potentially of unbounded length) of actions. For example, a self-driving
car that has to take a dozens of actions every second, and maintain a safe course
on the road. Such an agent may be trained a long time in various ways, but
once trained has to be autonomous. Another setting where similar issues arise
is in playing a complicated game like Chess or Go, where machines now outplay
humans.
To formulate the goals of such learning, let’s identify key aspects of such a
system. (a) It needs to maintain some state at every time step, to allow it to
store relevant information from previous steps (e.g., current speed, direction,
separation from nearby vehicles) that will be needed in future steps. We denote
the set of all possible states by S. (b) There is uncertainty in every measurement
and action, which will be modeled via probabilities. (c) In each state the agent
has the choice of some actions. Let A denote the set of possible actions. When
the agent takes action a ∈ A in a state, it makes a probabilistic transition
to another state. (d) The agent moves from state to state as follows. Upon
reaching a state, it takes an action, which causes it to transition probabilistically
to another state, and in the process get some internal reward. This reward is
its “internal motivation,” so to speak. For example, reward function for a self-
driving car may be a simple function of distances from the nearest vehicles in
all four directions. The agent is trying to maximise this reward, as formalized
later.
Similar frameworks have been well-studied in the past century in fields such
as control theory, finance, economic theory, operations research, etc. In machine
learning the above framework is called a Markov Decision Process (MDP). As
sketched above, it consists of the following components: a finite set of states S; a
set of actions A (each action can be taken in each state); a probabilistic transition
function that gives for each pair of states (s, s0 ) and action a a probability
p(s, a, s0 )Pof transitioning to s0 when action a is taken in state s (for all s, a it
satisfies s0 p(s, a, s0 ) = 1); and a reward function that gives for each pair of
states (s, s0 ) and action a a reward r(s, a, s0 ) which is obtained when an action
a is taken in state s followed by a transition to state s0 .
The goal of the learner is to identify a policy π, which maps states to actions.
Once an agent decides a policy π : S → A, the MDP turns effectively into a
12
markov chain, where p(s, π(s), s0 ) is the probability of transitioning to s0 at the
next step if the agent is currently at state s. Thus if it is started in a state s0 ,
the agent’s trajectory is a random sample from the distribution of random walks
starting from s0 . It is customary to assume for convenience that this markov
chain is ergodic. Thus if s0 , s1 , s2 , . . . , are random variables listing am infinite
sequence of states that are visited during a random walk starting from s0 then
the expected reward is
X∞
E[ R(si , π(si ), si+1 )].
i=0
stays finite. (The discounting idea is borrowed from economics, where this is
a formalization of the familiar human instinct to treat a bird in hand as better
than two in the bush.) The policy is optimum if this discount reward is optimum
for every choice of s0 . The optimum policy can be computed using dynamic
programming or linear programming in time that is a fixed polynomial of the
number of states.
However, in practice today the set of states is often very large, or even
infinite. For example, perhaps a state is a vector in <d and an action is a vector
in <k , which makes a policy a function from <d to <k . Now there is no known
efficient algorithm for finding an optimum policy, and in fact the task is known
to be NP-hard. In practice, various heuristics are known such as policy iteration
and value interation, where the policy being computed is represented implicitly
via a suitable representation, often a deep net. Usually the machine does not
know the underlying MDP and has to learn it while coming up with the policy.
For a detailed introduction see [SB98]. Providing theoretical support for this
heuristic work is an important open problem, since obvious ways to formalize
it run into NP-hard problems. A start would be to formalize what it means for
training to generalize here, since the above algorithms such as policy iteration
do an exploration to progressively improve the policy, which takes us far afield
from the independent sample framework utilized in our treatment of supervised
learning in Section 1.
We note that the above framework can be changed in various ways to pro-
vide other well-studied frameworks that we will not describe here, such as on-
line computation, bandit optimization, etc.. These capture less general types
of sequential decision-making, which retain aspects of classical optimization by
restricting attention to convex functions. For an introduction see [Haz].
13
Acknowledgements
Sanjeev Arora’s work is supported by the NSF, ONR, Simons Foundation,
Schmidt Foundation, SRC, Yahoo Research and Mozilla Foundation. Thanks
to Mark Goresky, Avi Wigderson and Yi Zhang for useful feedback on the
manuscript.
References
[AGH+ 14] Animashree Anandkumar, Rong Ge, Daniel Hsu,
Sham M. Kakade, and Matus Telgarsky. Tensor de-
compositions for learning latent variable models. Jour-
nal of Machine Learning Research, 15:2773–2832, 2014.
http://jmlr.org/papers/v15/anandkumar14b.html.
[ALL+ 16] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej
Risteski. A latent variable model approach to pmi-based word em-
beddings. Trans. Assoc. Comp. Linguistics, pages 385–399, 2016.
14
[HS06] Geoff Hinton and Ruslan Salakhutdinov. Reducing the dimension-
ality of data with neural networks. Science, pages 504–507, 2006.
[HTF09] Trevor Hastie, Robert Tibshirani, and Robert Friedman. The El-
ements of Statistical Learning. Springer Verlag, 2009.
15