04 Numerical
04 Numerical
(Goodfellow 2017)
Roadmap
• Iterative Optimization
(Goodfellow 2017)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2017)
Gradient Descent
CHAPTER 4. NUMERICAL COMPUTATION
2.0
0.5
0.0
For x < 0, we have f 0 (x) < 0, For x > 0, we have f 0 (x) > 0,
so we can decrease f by so we can decrease f by
0.5 moving rightward. moving leftward.
1.0
f (x) = 12 x2
1.5
f 0 (x) = x
2.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x
Figure 4.1: An illustration of how the gradient descent algorithm uses the derivatives of a
Figure 4.1
function can be used to follow the function downhill to a minimum.
(Goodfellow 2017)
PTER 4. NUMERICAL COMPUTATION
Approximate Optimization
This local minimum
performs nearly as well as
the global one,
so it is an acceptable
halting point.
f (x)
16 1.0
14 0.9
10 0.7
8 0.6
6 0.5
4 0.4
2 0.3
0 0.2
2 0.1
50 0 50 100 150 200 250 0 50 100 150 200 250
Training time (epochs) Training time (epochs)
ure 8.1: Gradient descent often does not arrive at a critical point of any kind. In th
(Goodfellow 2017)
mple, the gradient norm increases throughout training of a convolutional network us
Deep learning optimization
way of life
• Pure math way of life:
(Goodfellow 2017)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2017)
TER 4. NUMERICAL COMPUTATION
Critical Points
Minimum Maximum Saddle point
Figure 4.2
4.2: Examples of each of the three types of critical points in 1-D. A critical
with zero slope. Such a point can either be a local minimum, which is low
(Goodfellow 2017)
Saddle Points
500
f(x1 ,x1 )
0
−500
15
−15 0 x1
x1 0 −15
15
Figure 4.5
(Gradient descent escapes,
4.5: A saddle
Saddle pointattract
points containing both positive and
seenegative
Appendixcurvature. The function
C of “Qualitatively
example is f (x) =
Newton’s method x 2
1 x 2 . Along the axis corresponding
2
to x1 , the
Characterizing Neural
function
Network
upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue
Optimization Problems”)
the axis corresponding to x2 , the function curves downward. This direction is2017)an
(Goodfellow
APTER 4. NUMERICAL COMPUTATION
Curvature
Negative curvature No curvature Positive curvature
f (x)
f (x)
f (x)
x x x
Figure 4.4
ure 4.4: The second derivative determines the curvature of a function. Here w
adratic functions with various curvature. The dashed line indicates the value of t (Goodfellow 2017)
Fortunately, in this book, we usually need to decompose only a specific class
¸1 v(1)
Directional Curvature
2 2
1 1
v(1) v(1)
0 0
x1
x10
¸2 v(2)
v(2) v(2)
−1 −1
−2 −2
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x0 x00
Figure 2.3: Effect of eigenvectors and eigenvalues. An example of the effect of(Goodfellow
eigenvect 2017)
e: the original value of
on f (x) around the current point x : the
(0) function, the expected
pe of the function, and the correction we must apply
e(x) of⇡the
Predicting
f (xfunction.
(0)
) + (x xWhen
optimal
1
) g +this
(0) >
(x last
x term
step
) H(xis too
(0) > (0) size
x ),large,(4.8)
the
2
ctually move uphill. using WhenTaylor >
g Hg series
is zero or negative,
he gradient and H is the Hessian at x . If we use a learning rate
(0)
ation
e new predicts
point x willthat increasing
be given by x(0) ✏✏g. forever will decrease
Substituting f
this into our
ylor
on, weseries
obtainis unlikely to remain accurate for large ✏, so
ristic choices
(0)
of ✏ in (0)
this case.>
When
1 2 > g Hg is positive,
>
f (x ✏g) ⇡ f (x ) ✏g g + ✏ g Hg. (4.9)
size that decreases the Taylor series 2 approximation of
shree terms here: the original value of the function, the expected
t due to the slopegof > gthe function, and the correction we must apply
⇤
✏ = of>the function.
or the curvature . When this last term is too large, (4.10)
the
g Hgmove uphill. When g> Hg is zero or negative,
cent step can actually
Big gradients speed you up
aligns with the eigenvector
eries approximation of H corresponding
predicts that increasing to the
✏ forever will decrease f
ractice, the Taylor series is unlikely to remain accurate1for large ✏, so
then this optimal step
Big size is
eigenvalues given
slow youby
ort to more heuristic choices of ✏ in this case. When g max > . To the
Hg is positive,
eheminimize
optimal stepcan be down
size that approximated
if you
decreases the align
Taylorwell
with byapproximation
series a quadratic of
themost
the Hessian
yields thus determine the scale of the learning
their eigenvectors (Goodfellow 2017)
computation because rounding errors in the inputs
e output.
= A Condition
1 x.When A 2 Number
has an eigenvalue
R n⇥n
umber is
i
max . (4.2)
i,j j
10
0
x2
10
20
30
30 20 10 0 10 20
x1
(From “Qualitatively
Characterizing Neural
Network Optimization
Problems”) (Goodfellow 2017)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2017)
. NUMERICAL COMPUTATION
KKT Multipliers
erties guarantee that no infeasible point can be optimal, and that the
ithin the feasible points is unchanged.
orm constrained maximization, we can construct the generalized La-
ction of f (x), which leads to this optimization problem:
X X
(i) (j)
min max max f (x) + i g (x) + ↵ j h (x). (4.19)
x ↵,↵ 0
i j
• Iterative Optimization
(Goodfellow 2017)
Numerical Precision: A deep
learning super skill
• Often deep learning algorithms “sort of work”
• No “bugs” per se
(Goodfellow 2017)
Example
• Adding a very small number to a larger one may
have no effect. This can cause large changes
downstream:
(Goodfellow 2017)
Secondary effects
(Goodfellow 2017)
exp
• exp(x) overflows for large x
• float32: 89 overflows
(Goodfellow 2017)
ntity of the distribution is clear from the context, we may simply
e of the random variable that the expectation is over, as in Ex [f (x)].
Subtraction
which random variable the expectation is over, we may omit the
rely, as in E[f (x)]. By default, we can assume that E[·] averages over
all the•random variables inside the brackets.
Suppose x and y have similar magnitude Likewise, when there is
we may omit the square brackets.
ons are• linear, for example,
Suppose x is always greater than y
Ex [↵f (x) + g(x)] = ↵Ex [f (x)] + Ex [g(x)], (3.11)
• In a computer, x - y may be negative due to
are notrounding
dependenterror
on x.
nce gives a measure of how much the values of a function
Safe of a random
y as we sample different values
• Example: variance of x from its Dangerous
probability distribution:
h i
Var(f (x)) = E (f (x) E[f (x)])2 . (3.12)
⇥ ⇤ 2
ance is low, the values of
= E (x)
f (x)2cluster near their
E [f (x)] expected value. The
the variance is known as the standard deviation.
(Goodfellow 2017)
log and sqrt
• log(0) = - inf
(Goodfellow 2017)
log exp
• log exp(x) is a common pattern
• Should be simplified to x
• Avoids:
• Overflow in exp
(Goodfellow 2017)
Which is the better hack?
• normalized_x = x / st_dev
• eps = 1e-7
• Should we use
(Goodfellow 2017)
log(sum(exp))
• Naive implementation:
tf.log(tf.reduce_sum(tf.exp(array))
• Failure modes:
(Goodfellow 2017)
Stable version
mx = tf.reduce_max(array)
safe_array = array - mx
log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array))
Built in version:
tf.reduce_logsumexp
(Goodfellow 2017)
Why does the logsumexp trick
work?
• Algebraically equivalent to the original version:
X
m + log exp(ai m)
i
X exp(ai )
= m + log
i
exp(m)
1 X
= m + log exp(ai )
exp(m) i
X
= m log exp(m) + log exp(ai )
i
(Goodfellow 2017)
Why does the logsumexp trick
work?
• No overflow:
• Similar to logsumexp
(Goodfellow 2017)
Sigmoid
(Goodfellow 2017)
Cross-entropy
• Cross-entropy loss for softmax (and sigmoid) has both
softmax and logsumexp in it
• If you roll your own, use the stabilization tricks for softmax
and logsumexp
(Goodfellow 2017)
Bug hunting strategies
(Goodfellow 2017)
Bug hunting strategies
• If you see explosion (NaNs, very large values) immediately
suspect:
• log
• exp
• sqrt
• division
(Goodfellow 2017)