Week 5 Optimisation
Week 5 Optimisation
Nigel Goddard
School of Informatics
Semester 1
1 / 24
Outline
Many illustrations, text, and general ideas from these slides are taken from Sam Roweis (1972-2010).
2 / 24
Why Optimization
3 / 24
– absolute weight decay (lasso) ⇔ Laplace prior (decay = 1/Λ) •
– smoothing on multinomial parameters ⇔ Dirichlet prior
– smoothing on covariance matrices ⇔ Wishart prior
I End result: an “error function” E(w) which we want to
minimize.
I e.g., E(w)Error
can be the negative
Surfaces of theSpace
and Weight log likelihood.
5
I Consider
• End result:aanfixed
“errortraining
function” set;
E(w) think in want
which we weight (not input)
to minimize. •
space. At each
• E(w) can settingofof
be the negative thethe weightsorthere
log likelihood is some error
log posterior.
(given thea fixed
• Consider training
fixed training set):in this
set; think weightdefines anspace.
(not input) error surface •
At each setting
in weight space. of the weights there is some error (given the fixed •
training set): this defines an error surface in weight space.
I Learning == descending the error surface.
• Learning == descending the error surface. •
I If •the data
Notice: are
If the IID,
data are the error
IID, the errorfunction
function EEis aissum
a sum
of errorof error •
function
functionsEEi nfor
, oneeach
per datadata point
point.
E wj •
E(w)
E(w)
w wi
4 / 24
Role of Smoothness
If E completely unconstrained, minimization is impossible.
E(w)
w
All we could do is search through all possible values w.
6 / 24
Numerical Optimization Algorithms
7 / 24
Optimization Algorithm Cartoon
I Basically, numerical optimization algorithms are iterative.
They generate a sequence of points
w0 , w1 , w2 , . . .
E(w0 ), E(w1 ), E(w2 ), . . .
∇E(w0 ), ∇E(w1 ), ∇E(w2 ), . . .
I Basic optimization algorithm is
initialize w
while E(w) is unacceptably high
calculate g = ∇E
Compute direction d from w, E(w), g
(can use previous gradients as well...)
w←w−η d
end while
return w
8 / 24
A Choice of Direction
9 / 24
Gradient Descent
10 / 24
Effect of Step Size
Goal: Minimize
E(w) = w 2
I Take η = 0.1. Works well.
8
w0 = 1.0
6
w1 = w0 − 0.1 · 2w0 = 0.8
E(w)
4
w2 = w1 − 0.1 · 2w1 = 0.64
2 w3 = w2 − 0.1 · 2w2 = 0.512
0 ···
−3 −2 −1 0 1 2 3 w25 = 0.0047
w
11 / 24
Effect of Step Size
12 / 24
“Bold Driver” Gradient Descent
13 / 24
Batch vs online
I So far all the objective function we have seen look like:
n
X
E(w; D) = Ei (w; yi , xi ).
i=1
D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )} is the training set.
I Each term sum depends on only one training instance
I Example: Logistic regression: Ei (w; yi , xi ) = log p(yi |xi , w).
I The gradient in this case is always
n
X ∂Ei
∂E
=
∂w ∂w
i=1
I The algorithm on slide 10 scans all the training instances
before changing the parameters.
I Seems dumb if we have millions of training instances.
Surely we can get a gradient that is “good enough” from
fewer instances, e.g., a couple of thousand? Or maybe
even from just one? 14 / 24
Batch vs online
∂E X ∂Ei
=
∂w ∂w
i
15 / 24
Algorithms for Batch Gradient Descent
16 / 24
Algorithms for Online Gradient Descent
17 / 24
Problems With Gradient Descent
18 / 24
Shallow Valleys
0;
dE
dw
dt = βdt−1 + (1 − β)η∇E(wt )
9 Mini-Batch and Online Optimization 11
I Now you have to set both η and β. Can be difficult and
different • When our data is big, computing the exact gradient is expensive.
cessarily point irritating.• This seems wasteful, since the only thing we are going to use the
gradient for is to make a small change to the weights and then
throw it away and measure it again at the new weights.
19 / 24
step=step*1.01; errold=err; w0=w; gradold=grad; combina
end
Curved Error Surfaces end
}
• Usually,
• This algorithm only finds a local minimum of the cost. • When we
• This is batch grad. descent, but mini-batch or online may be better. • Physicall
20 / 24
Local Minima
16 Convexity, Local Optima 18
I If you follow the gradient,
• Unfortunately, where
many error functions will
while you end
differentiable are up?
not Once you
unimodal. When using gradient descent we can get stuck in local
hit a local minimum, gradient is 0, so you
minima. Where we end up depends on where we start.
stop.
dient,
s and
error
move in
ays
evious parameter space
• Some very nice error functions (e.g. linear least squares, logistic
I Certain regression,
nice functions, such
lasso) are convex, and as
thussquared
have a uniqueerror,
(global) logistic
minimum.
regression Convexity means
likelihood that the second
are convex, derivative isthat
meaning alwaysthe second
+1) positive. No linear combination of weights can have greater error
derivative
thanisthealways positive.
linear combination of theThis
originalimplies
errors. that any local
w(t) minimum
• Butismostglobal.
settings do not lead to convex optimization problems.
22 / 24
Advanced Topics That We Will Not Cover (Part II)
I There are ways to solve this (in this case: can be done
analytically). We will not discuss them in this course.
23 / 24
Summary
24 / 24