0% found this document useful (0 votes)
18 views24 pages

Week 5 Optimisation

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views24 pages

Week 5 Optimisation

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

IAML: Optimization

Nigel Goddard
School of Informatics

Semester 1

1 / 24
Outline

I Why we use optimization in machine learning


I The general optimization problem
I Gradient descent
I Problems with gradient descent
I Batch versus online
I Second-order methods
I Constrained optimization

Many illustrations, text, and general ideas from these slides are taken from Sam Roweis (1972-2010).

2 / 24
Why Optimization

I A main idea in machine learning is to convert the learning


problem into a continuous optimization problem.
I Examples: Linear regression, logistic regression (we have
seen), neural networks, SVMs (we will see these later)
I One way to do this is maximum likelihood

`(w) = log p(y1 , x1 , y2 , x2 , . . . , yn , xn |w)


n
Y
= log p(yi , xi |w)
i=1
n
X
= log p(yi , xi |w)
i=1

I Example: Linear regression

3 / 24
– absolute weight decay (lasso) ⇔ Laplace prior (decay = 1/Λ) •
– smoothing on multinomial parameters ⇔ Dirichlet prior
– smoothing on covariance matrices ⇔ Wishart prior
I End result: an “error function” E(w) which we want to
minimize.
I e.g., E(w)Error
can be the negative
Surfaces of theSpace
and Weight log likelihood.
5
I Consider
• End result:aanfixed
“errortraining
function” set;
E(w) think in want
which we weight (not input)
to minimize. •
space. At each
• E(w) can settingofof
be the negative thethe weightsorthere
log likelihood is some error
log posterior.
(given thea fixed
• Consider training
fixed training set):in this
set; think weightdefines anspace.
(not input) error surface •
At each setting
in weight space. of the weights there is some error (given the fixed •
training set): this defines an error surface in weight space.
I Learning == descending the error surface.
• Learning == descending the error surface. •
I If •the data
Notice: are
If the IID,
data are the error
IID, the errorfunction
function EEis aissum
a sum
of errorof error •
function
functionsEEi nfor
, oneeach
per datadata point
point.
E wj •
E(w)
E(w)

w wi

4 / 24
Role of Smoothness
If E completely unconstrained, minimization is impossible.

E(w)

w
All we could do is search through all possible values w.

Key idea: If E is continuous, then measuring E(w) gives


information about E at many nearby values.
5 / 24
Role of Derivatives

I If we wiggle wk and keep everything else the same, does


the error get better or worse?
∂E
I Calculus has an answer to exactly this question: ∂wk
I So: use a differentiable cost function E and compute
partial derivatives of each parameter
I The vector of partial derivatives is called the gradient of the
∂E ∂E ∂E
error. It is written ∇E = ( ∂w ,
1 ∂w2
, . . . , ∂w n
). Alternate
∂E
notation ∂w .
I It points in the direction of steepest error descent in weight
space.
I Three crucial questions:
I How do we compute the gradient ∇E efficiently?
I Once we have the gradient, how do we minimize the error?
I Where will we end up in weight space?

6 / 24
Numerical Optimization Algorithms

I Numerical optimization algorithms try to solve the


general problem
min E(w)
w
I Most commonly, a numerical optimization procedure takes
two inputs:
I A procedure that computes E(w)
∂E
I A procedure that computes the partial derivative ∂wj
I (Aside: Some use less information, i.e., they don’t use
gradients. Some use more information, i.e., higher order
derivative. We won’t go into these algorithms in the
course.)

7 / 24
Optimization Algorithm Cartoon
I Basically, numerical optimization algorithms are iterative.
They generate a sequence of points
w0 , w1 , w2 , . . .
E(w0 ), E(w1 ), E(w2 ), . . .
∇E(w0 ), ∇E(w1 ), ∇E(w2 ), . . .
I Basic optimization algorithm is

initialize w
while E(w) is unacceptably high
calculate g = ∇E
Compute direction d from w, E(w), g
(can use previous gradients as well...)
w←w−η d
end while
return w
8 / 24
A Choice of Direction

I The simplest choice d is the current gradient ∇E.


I It is locally the steepest descent direction.
I (Technically, the reason for this choice is Taylor’s theorem
from calculus.)

9 / 24
Gradient Descent

I Simple gradient descent algorithm:


initialize w
while E(w) is unacceptably high
∂E
calculate g ← ∂w
w←w−η g
end while
return w
I η is known as the step size (sometimes called learning
rate)
I We must choose η > 0.
I η too small → too slow
I η too large → instability

10 / 24
Effect of Step Size

Goal: Minimize
E(w) = w 2
I Take η = 0.1. Works well.
8

w0 = 1.0
6
w1 = w0 − 0.1 · 2w0 = 0.8
E(w)

4
w2 = w1 − 0.1 · 2w1 = 0.64
2 w3 = w2 − 0.1 · 2w2 = 0.512
0 ···
−3 −2 −1 0 1 2 3 w25 = 0.0047
w

11 / 24
Effect of Step Size

I Take η = 1.1. Not so good. If you


Goal: Minimize
step too far, you can leap over the
E(w) = w 2
region that contains the minimum
8
w0 = 1.0
6
w1 = w0 − 1.1 · 2w0 = −1.2
E(w)

4 w2 = w1 − 1.1 · 2w1 = 1.44


2 w3 = w2 − 1.1 · 2w2 = −1.72
···
0
w25 = 79.50
−3 −2 −1 0 1 2 3

w I Finally, take η = 0.000001. What


happens here?

12 / 24
“Bold Driver” Gradient Descent

I Simple heuristic for choosing η which you can use if you’re


desperate.
initialize w, η
initialize e ← E(w); g ← ∇E(w) while η > 0
w1 ← w − ηg
e1 = E(w1 ); g1 = ∇E
if e1 ≥ e
η = η/2
else
η = 1.01η; w ← w1 ; g ← g1 ; e = e1
end while
return w
I Finds a local minimum of E.

13 / 24
Batch vs online
I So far all the objective function we have seen look like:
n
X
E(w; D) = Ei (w; yi , xi ).
i=1
D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )} is the training set.
I Each term sum depends on only one training instance
I Example: Logistic regression: Ei (w; yi , xi ) = log p(yi |xi , w).
I The gradient in this case is always
n
X ∂Ei
∂E
=
∂w ∂w
i=1
I The algorithm on slide 10 scans all the training instances
before changing the parameters.
I Seems dumb if we have millions of training instances.
Surely we can get a gradient that is “good enough” from
fewer instances, e.g., a couple of thousand? Or maybe
even from just one? 14 / 24
Batch vs online

I Batch learning: use all patterns in training set, and update


weights after calculating

∂E X ∂Ei
=
∂w ∂w
i

I On-line learning: adapt weights after each pattern


presentation, using ∂E
∂w
i

I Batch more powerful optimization methods


I Batch easier to analyze
I On-line more feasible for huge or continually growing
datasets
I On-line may have ability to jump over local optima

15 / 24
Algorithms for Batch Gradient Descent

I Here is batch gradient descent.


initialize w
while E(w) is unacceptably high
P
calculate g ← N ∂Ei
i=1 ∂w
w←w−η g
end while
return w
I This is just the algorithm we have seenP before. We have
just “substituted in” the fact that E = N
i=1 Ei .

16 / 24
Algorithms for Online Gradient Descent

I Here is (a particular type of) online gradient descent


algorithm
initialize w
while E(w) is unacceptably high
Pick j as uniform random integer in 1 . . . N
∂E
calculate g ← ∂wj
w←w−η g
end while
return w
I This version is also called “stochastic gradient ascent”
because we have picked the training instance randomly.
I There are other variants of online gradient descent.

17 / 24
Problems With Gradient Descent

I Setting the step size η


I Shallow valleys
I Highly curved error surfaces
I Local minima

18 / 24
Shallow Valleys

I Typical gradient descent can be fooled in several ways,


t 8 which is why more sophisticated Momentum methods are10 used when
w do we
possible.• Ifquickly
the error surface is a long and narrow valley, grad. descent goes
One problem:
down the valley walls but very slowly along the valley bottom.

0;
dE
dw

• We can alleviate this by updating our parameters using a


ad; I Gradient combination
descentof the goes very
previous updateslowly once
and the gradient it hits the shallow
update:
∆wjt = β∆wjt−1 + (1 − β) $ ∂E/∂wj (wt)
valley.
}
• Usually, β is quite high, about 0.95.
t.
I One hack to we
• When deal with
have to retractthis
a step,is
we momentum
set ∆wj to zero.
may be better. • Physically, this is like giving momentum to our weights.

dt = βdt−1 + (1 − β)η∇E(wt )
9 Mini-Batch and Online Optimization 11
I Now you have to set both η and β. Can be difficult and
different • When our data is big, computing the exact gradient is expensive.
cessarily point irritating.• This seems wasteful, since the only thing we are going to use the
gradient for is to make a small change to the weights and then
throw it away and measure it again at the new weights.
19 / 24
step=step*1.01; errold=err; w0=w; gradold=grad; combina
end
Curved Error Surfaces end
}
• Usually,
• This algorithm only finds a local minimum of the cost. • When we
• This is batch grad. descent, but mini-batch or online may be better. • Physicall

I A second problem Curved with gradient descent is that 9the


Error Surfaces M
gradient•might
Notice: thenot
errorpoint towards
surface may the optimum.
be curved differently in different This is • When ou
because directions.
of This means that the gradient does not necessarily point
curvature
directly at the nearest local minimum.
• This seem
gradient
throw it
dE • An appro
dW
in line w
• The local geometry of curvature is measured by the Hessian matrix • One very
Hij = ∂ 2E/∂w examples
I Note: gradient
of second is the locally
derivatives: steepest
i wj . direction. Need not an updat
• Eigenvectors/values of the Hessian describe the directions of
directly point toward local optimum.
principal curvature and the amount of curvature in each direction.
mini-bat
Near a local minimum, the Hessian is positive definite. • In the lim
I Local curvature is measured by the Hessian matrix: online gr
2 • Maximum sensible stepsize is λ 2
Hij = ∂ E/∂w i wj .
Rate of convergence
max
depends on (1 − 2 λmin ).
• These m
λmax and are v

20 / 24
Local Minima
16 Convexity, Local Optima 18
I If you follow the gradient,
• Unfortunately, where
many error functions will
while you end
differentiable are up?
not Once you
unimodal. When using gradient descent we can get stuck in local
hit a local minimum, gradient is 0, so you
minima. Where we end up depends on where we start.
stop.
dient,
s and

error
move in
ays
evious parameter space

• Some very nice error functions (e.g. linear least squares, logistic
I Certain regression,
nice functions, such
lasso) are convex, and as
thussquared
have a uniqueerror,
(global) logistic
minimum.
regression Convexity means
likelihood that the second
are convex, derivative isthat
meaning alwaysthe second
+1) positive. No linear combination of weights can have greater error
derivative
thanisthealways positive.
linear combination of theThis
originalimplies
errors. that any local
w(t) minimum
• Butismostglobal.
settings do not lead to convex optimization problems.

I There is no great solution to this problem. It is a


fundamental one. Usually, the best you can do is rerun the
17 19
optimizer multipleConstrained
times from Optimization
different random starting
nt that points.• Sometimes we want to optimize with some constraints on the
parameters.
e.g. variances are always positive
e.g. priors are non-negative and sum to unity (live on the simplex) 21 / 24
Advanced Topics That We Will Not Cover (Part I)

I Some of these issues (shallow valley, curved error


surfaces) can be fixed
I Some of these are second-order methods like Newton’s
method that use the second derivatives
I Also there are fancy first-order methods like quasi-Newton
methods (e.g., limited memory BFGS) and conjugate
gradient
I They are the state of the art methods for logistic regression
(as long as there are not too many data points)
I We will not discuss these methods in the course.
I Other issues (like local minima) cannot be easily fixed

22 / 24
Advanced Topics That We Will Not Cover (Part II)

I Sometimes the optimization problem has constraints


I Example: Observe the points {0.5, 1.0} from a Gaussian
with known mean µ = 0.8 and unknown standard deviation
σ. Want to estimate σ by maximum likelihood.
I Constraint: σ must be positive.
I In this case to find the maximum likelihood solution, the
optimization problem is
2
X 1
max (xi − µ)2
µ,σ 2σ 2
i=1
subject to σ > 0

I There are ways to solve this (in this case: can be done
analytically). We will not discuss them in this course.

23 / 24
Summary

I Complex mathematical area. Do not implement your own


optimization algorithms if you can help it!
I Stuff you should understand:
I How and why we convert learning problems into
optimization problems
I Modularity between modelling and optimization
I Gradient descent
I Why gradient descent can run into problems
I Especially local minima
I Methods of choice: Fancy first-order methods (e.g.,
quasi-Newton, CG) for moderate amounts of data.
Stochastic gradient for large amounts of data.

24 / 24

You might also like