0% found this document useful (0 votes)
19 views39 pages

04 Numerical

Uploaded by

James Miller
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views39 pages

04 Numerical

Uploaded by

James Miller
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Numerical Computation

for Deep Learning


Lecture slides for Chapter 4 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last modified 2017-10-14

Thanks to Justin Gilmer and Jacob


Buckman for helpful discussions
Numerical concerns for implementations
of deep learning algorithms
• Algorithms are often specified in terms of real numbers; real
numbers cannot be implemented in a finite computer

• Does the algorithm still work when implemented with a finite


number of bits?

• Do small changes in the input to a function cause large changes to


an output?

• Rounding errors, noise, measurement errors can cause large


changes

• Iterative search for best input is difficult

(Goodfellow 2017)
Roadmap

• Iterative Optimization

• Rounding error, underflow, overflow

(Goodfellow 2017)
Iterative Optimization

• Gradient descent

• Curvature

• Constrained optimization

(Goodfellow 2017)
Gradient Descent
CHAPTER 4. NUMERICAL COMPUTATION

2.0

1.5 Global minimum at x = 0.


Since f 0 (x) = 0, gradient
descent halts here.
1.0

0.5

0.0
For x < 0, we have f 0 (x) < 0, For x > 0, we have f 0 (x) > 0,
so we can decrease f by so we can decrease f by
0.5 moving rightward. moving leftward.

1.0
f (x) = 12 x2
1.5
f 0 (x) = x
2.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x

Figure 4.1: An illustration of how the gradient descent algorithm uses the derivatives of a
Figure 4.1
function can be used to follow the function downhill to a minimum.
(Goodfellow 2017)
PTER 4. NUMERICAL COMPUTATION

Approximate Optimization
This local minimum
performs nearly as well as
the global one,
so it is an acceptable
halting point.
f (x)

Ideally, we would like


to arrive at the global
minimum, but this
might not be possible.
This local minimum performs
poorly and should be avoided.

re 4.3: Optimization algorithms Figure


may fail 4.3
to find a global minimum when the
ple local minima or plateaus present. In the context of deep learning, we gen
t such solutions even though they are not truly minimal, so long as they corre (Goodfellow 2017)
We usually don’t even reach a
local minimum
APTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

16 1.0
14 0.9

Classification error rate


12 0.8
Gradient norm

10 0.7
8 0.6
6 0.5
4 0.4
2 0.3
0 0.2
2 0.1
50 0 50 100 150 200 250 0 50 100 150 200 250
Training time (epochs) Training time (epochs)

ure 8.1: Gradient descent often does not arrive at a critical point of any kind. In th
(Goodfellow 2017)
mple, the gradient norm increases throughout training of a convolutional network us
Deep learning optimization
way of life
• Pure math way of life:

• Find literally the smallest value of f(x)

• Or maybe: find some critical point of f(x) where


the value is locally smallest

• Deep learning way of life:

• Decrease the value of f(x) a lot

(Goodfellow 2017)
Iterative Optimization

• Gradient descent

• Curvature

• Constrained optimization

(Goodfellow 2017)
TER 4. NUMERICAL COMPUTATION

Critical Points
Minimum Maximum Saddle point

Figure 4.2

4.2: Examples of each of the three types of critical points in 1-D. A critical
with zero slope. Such a point can either be a local minimum, which is low
(Goodfellow 2017)
Saddle Points
500

f(x1 ,x1 )
0
−500

15
−15 0 x1
x1 0 −15
15
Figure 4.5
(Gradient descent escapes,
4.5: A saddle
Saddle pointattract
points containing both positive and
seenegative
Appendixcurvature. The function
C of “Qualitatively
example is f (x) =
Newton’s method x 2
1 x 2 . Along the axis corresponding
2
to x1 , the
Characterizing Neural
function
Network
upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue
Optimization Problems”)
the axis corresponding to x2 , the function curves downward. This direction is2017)an
(Goodfellow
APTER 4. NUMERICAL COMPUTATION
Curvature
Negative curvature No curvature Positive curvature
f (x)

f (x)

f (x)
x x x

Figure 4.4
ure 4.4: The second derivative determines the curvature of a function. Here w
adratic functions with various curvature. The dashed line indicates the value of t (Goodfellow 2017)
Fortunately, in this book, we usually need to decompose only a specific class

Effect of eigenvectors and eigenvalues

Directional Second Derivatives


Before multiplication After multiplication
3 3

¸1 v(1)

Directional Curvature
2 2

1 1
v(1) v(1)

0 0
x1

x10
¸2 v(2)
v(2) v(2)
−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x0 x00

Figure 2.3: Effect of eigenvectors and eigenvalues. An example of the effect of(Goodfellow
eigenvect 2017)
e: the original value of
on f (x) around the current point x : the
(0) function, the expected
pe of the function, and the correction we must apply
e(x) of⇡the
Predicting
f (xfunction.
(0)
) + (x xWhen
optimal
1
) g +this
(0) >
(x last
x term
step
) H(xis too
(0) > (0) size
x ),large,(4.8)
the
2
ctually move uphill. using WhenTaylor >
g Hg series
is zero or negative,
he gradient and H is the Hessian at x . If we use a learning rate
(0)
ation
e new predicts
point x willthat increasing
be given by x(0) ✏✏g. forever will decrease
Substituting f
this into our
ylor
on, weseries
obtainis unlikely to remain accurate for large ✏, so
ristic choices
(0)
of ✏ in (0)
this case.>
When
1 2 > g Hg is positive,
>
f (x ✏g) ⇡ f (x ) ✏g g + ✏ g Hg. (4.9)
size that decreases the Taylor series 2 approximation of
shree terms here: the original value of the function, the expected
t due to the slopegof > gthe function, and the correction we must apply

✏ = of>the function.
or the curvature . When this last term is too large, (4.10)
the
g Hgmove uphill. When g> Hg is zero or negative,
cent step can actually
Big gradients speed you up
aligns with the eigenvector
eries approximation of H corresponding
predicts that increasing to the
✏ forever will decrease f
ractice, the Taylor series is unlikely to remain accurate1for large ✏, so
then this optimal step
Big size is
eigenvalues given
slow youby
ort to more heuristic choices of ✏ in this case. When g max > . To the
Hg is positive,
eheminimize
optimal stepcan be down
size that approximated
if you
decreases the align
Taylorwell
with byapproximation
series a quadratic of
themost
the Hessian
yields thus determine the scale of the learning
their eigenvectors (Goodfellow 2017)
computation because rounding errors in the inputs
e output.
= A Condition
1 x.When A 2 Number
has an eigenvalue
R n⇥n

umber is

i
max . (4.2)
i,j j

de of the largest and


When the smallest
condition eigenvalue.
number is large, When
sometimes you
rsion is particularly hit largeto
sensitive eigenvalues
error in and
the input.
sometimes you hit small ones.
sic property of the matrix itself, not the
The large ones force you to keep the learning
result
x inversion. Poorly
rate small, andconditioned matrices
miss out on moving fast inamplify
the
tiply by the true matrix
small inverse.
eigenvalue In practice, the
directions.
r by numerical errors in the inversion process itself. (Goodfellow 2017)
Gradient Descent and Poor
Conditioning
20

10

0
x2

10

20

30
30 20 10 0 10 20
x1

igure 4.6: Gradient descent fails to Figure 4.6curvature information containe


exploit the
essian matrix. Here we use gradient descent to minimize a quadratic function f (x
(Goodfellow 2017)
nference paper at ICLR 2015

Neural net visualization


At end of learning:
- gradient is still large
- curvature is huge

(From “Qualitatively
Characterizing Neural
Network Optimization
Problems”) (Goodfellow 2017)
Iterative Optimization

• Gradient descent

• Curvature

• Constrained optimization

(Goodfellow 2017)
. NUMERICAL COMPUTATION

KKT Multipliers
erties guarantee that no infeasible point can be optimal, and that the
ithin the feasible points is unchanged.
orm constrained maximization, we can construct the generalized La-
ction of f (x), which leads to this optimization problem:
X X
(i) (j)
min max max f (x) + i g (x) + ↵ j h (x). (4.19)
x ↵,↵ 0
i j

so convert this to a problem with maximization in the outer loop:


In practice, we usually
In this book, mostly usedX for X
max min mintheory
f (x) + (i)
i g (x) just↵j h (x). back to (4.20)
project
(j) the
x ↵,↵ 0
(e.g.: show Gaussian is highest
i constraint
j region after each
entropy
the term equality constraints does not matter;step
for thedistribution) we may define it
on or subtraction as we wish, because the optimization is free to choose
r each i . (Goodfellow 2017)
Roadmap

• Iterative Optimization

• Rounding error, underflow, overflow

(Goodfellow 2017)
Numerical Precision: A deep
learning super skill
• Often deep learning algorithms “sort of work”

• Loss goes down, accuracy gets within a few


percentage points of state-of-the-art

• No “bugs” per se

• Often deep learning algorithms “explode” (NaNs, large


values)

• Culprit is often loss of numerical precision


(Goodfellow 2017)
Rounding and truncation
errors
• In a digital computer, we use float32 or similar
schemes to represent real numbers

• A real number x is rounded to x + delta for some


small delta

• Overflow: large x replaced by inf

• Underflow: small x replaced by 0

(Goodfellow 2017)
Example
• Adding a very small number to a larger one may
have no effect. This can cause large changes
downstream:

>>> a = np.array([0., 1e-8]).astype('float32')


>>> a.argmax()
1
>>> (a + 1).argmax()
0

(Goodfellow 2017)
Secondary effects

• Suppose we have code that computes x-y

• Suppose x overflows to inf

• Suppose y overflows to inf

• Then x - y = inf - inf = NaN

(Goodfellow 2017)
exp
• exp(x) overflows for large x

• Doesn’t need to be very large

• float32: 89 overflows

• Never use large x

• exp(x) underflows for very negative x

• Possibly not a problem

• Possibly catastrophic if exp(x) is a denominator, an argument to a


logarithm, etc.

(Goodfellow 2017)
ntity of the distribution is clear from the context, we may simply
e of the random variable that the expectation is over, as in Ex [f (x)].

Subtraction
which random variable the expectation is over, we may omit the
rely, as in E[f (x)]. By default, we can assume that E[·] averages over
all the•random variables inside the brackets.
Suppose x and y have similar magnitude Likewise, when there is
we may omit the square brackets.
ons are• linear, for example,
Suppose x is always greater than y
Ex [↵f (x) + g(x)] = ↵Ex [f (x)] + Ex [g(x)], (3.11)
• In a computer, x - y may be negative due to
are notrounding
dependenterror
on x.
nce gives a measure of how much the values of a function
Safe of a random
y as we sample different values
• Example: variance of x from its Dangerous
probability distribution:
h i
Var(f (x)) = E (f (x) E[f (x)])2 . (3.12)
⇥ ⇤ 2
ance is low, the values of
= E (x)
f (x)2cluster near their
E [f (x)] expected value. The
the variance is known as the standard deviation.
(Goodfellow 2017)
log and sqrt
• log(0) = - inf

• log(<negative>) is imaginary, usually nan in software

• sqrt(0) is 0, but its derivative has a divide by zero

• Definitely avoid underflow or round-to-negative in the


argument!

• Common case: standard_dev = sqrt(variance)

(Goodfellow 2017)
log exp
• log exp(x) is a common pattern

• Should be simplified to x

• Avoids:

• Overflow in exp

• Underflow in exp causing -inf in log

(Goodfellow 2017)
Which is the better hack?
• normalized_x = x / st_dev

• eps = 1e-7

• Should we use

• st_dev = sqrt(eps + variance)

• st_dev = eps + sqrt(variance) ?

• What if variance is implemented safely and will never


round to negative?

(Goodfellow 2017)
log(sum(exp))
• Naive implementation:
tf.log(tf.reduce_sum(tf.exp(array))

• Failure modes:

• If any entry is very large, exp overflows

• If all entries are very negative, all exps


underflow… and then log is -inf

(Goodfellow 2017)
Stable version
mx = tf.reduce_max(array)
safe_array = array - mx
log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array))

Built in version:
tf.reduce_logsumexp

(Goodfellow 2017)
Why does the logsumexp trick
work?
• Algebraically equivalent to the original version:
X
m + log exp(ai m)
i
X exp(ai )
= m + log
i
exp(m)
1 X
= m + log exp(ai )
exp(m) i
X
= m log exp(m) + log exp(ai )
i
(Goodfellow 2017)
Why does the logsumexp trick
work?
• No overflow:

• Entries of safe_array are at most 0

• Some of the exp terms underflow, but not all

• At least one entry of safe_array is 0

• The sum of exp terms is at least 1

• The sum is now safe to pass to the log


(Goodfellow 2017)
Softmax

• Softmax: use your library’s built-in softmax function

• If you build your own, use:


safe_logits = logits - tf.reduce_max(logits)
softmax = tf.nn.softmax(safe_logits)

• Similar to logsumexp

(Goodfellow 2017)
Sigmoid

• Use your library’s built-in sigmoid function

• If you build your own:

• Recall that sigmoid is just softmax with one of


the logits hard-coded to 0

(Goodfellow 2017)
Cross-entropy
• Cross-entropy loss for softmax (and sigmoid) has both
softmax and logsumexp in it

• Compute it using the logits not the probabilities

• The probabilities lose gradient due to rounding error where


the softmax saturates

• Use tf.nn.softmax_cross_entropy_with_logits or similar

• If you roll your own, use the stabilization tricks for softmax
and logsumexp

(Goodfellow 2017)
Bug hunting strategies

• If you increase your learning rate and the loss gets


stuck, you are probably rounding your gradient to
zero somewhere: maybe computing cross-entropy
using probabilities instead of logits

• For correctly implemented loss, too high of


learning rate should usually cause explosion

(Goodfellow 2017)
Bug hunting strategies
• If you see explosion (NaNs, very large values) immediately
suspect:

• log

• exp

• sqrt

• division

• Always suspect the code that changed most recently


(Goodfellow 2017)
Questions

(Goodfellow 2017)

You might also like