0% found this document useful (0 votes)
52 views46 pages

04 Numerical

This document summarizes key points from a lecture on numerical computation for deep learning: 1) Deep learning algorithms are often specified using real numbers, but real numbers cannot be exactly represented on a computer. This can lead to issues with rounding errors, noise, and small changes in inputs causing large changes in outputs. 2) Gradient descent is commonly used for optimization, but it can be slowed by poor conditioning of the problem where some directions have very large or small curvature. Newton's method aims to account for curvature but may be attracted to saddle points. 3) Numerical precision is important, as loss of precision during iterative optimization can lead to issues like NaN values or results diverging from the expected behavior. Managing rounding

Uploaded by

Niranjan Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views46 pages

04 Numerical

This document summarizes key points from a lecture on numerical computation for deep learning: 1) Deep learning algorithms are often specified using real numbers, but real numbers cannot be exactly represented on a computer. This can lead to issues with rounding errors, noise, and small changes in inputs causing large changes in outputs. 2) Gradient descent is commonly used for optimization, but it can be slowed by poor conditioning of the problem where some directions have very large or small curvature. Newton's method aims to account for curvature but may be attracted to saddle points. 3) Numerical precision is important, as loss of precision during iterative optimization can lead to issues like NaN values or results diverging from the expected behavior. Managing rounding

Uploaded by

Niranjan Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Numerical Computation

for Deep Learning


Lecture slides for Chapter 4 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last modified 2017-10-14
Adapted by m.n. for CMPS 392

Thanks to Justin Gilmer and Jacob


Buckman for helpful discussions
Numerical concerns for implementations
of deep learning algorithms
• Algorithms are often specified in terms of real numbers;
real numbers cannot be implemented in a finite computer

q Does the algorithm still work when implemented with a


finite number of bits?

• Do small changes in the input to a function cause large


changes to an output?

q Rounding errors, noise, measurement errors can cause


large changes

q Iterative search for best input is difficult

(Goodfellow 2016)
Roadmap
• Iterative Optimization
• Rounding error, underflow, overflow

(Goodfellow 2016)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization

(Goodfellow 2016)
Gradient-based optimization
• Optimization is the task of minimizing or maximizing
some function 𝑓(𝒙) by altering 𝒙
!
• 𝑓 𝑥 + 𝜖 ≈ 𝑓 𝑥 + 𝜖𝑓 𝑥
• 𝑓 𝑥 − 𝜖 𝑠𝑖𝑔𝑛 𝑓 ! 𝑥 < 𝑓 𝑥 for small 𝜖
• This technique is called gradient descent (Cauchy,
1847)

(Goodfellow 2016)
Gradient Descent

Figure 4.1
(Goodfellow 2016)
Approximate Optimization

Figure 4.3
(Goodfellow 2016)
We usually don’t even reach
a local minimum

(Goodfellow 2016)
Deep learning optimization
way of life
• Pure math way of life:
q Find literally the smallest value of f(x)
q Or maybe: find some critical point of f(x) where
the value is locally smallest
• Deep learning way of life:
q Decrease the value of f(x) a lot

(Goodfellow 2016)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization

(Goodfellow 2016)
Critical Points (𝑓’(𝑥) = 0)

Figure 4.2

(Goodfellow 2016)
Saddle Points

Figure 4.5
(Gradient descent escapes,
Saddle points attract see Appendix C of “Qualitatively
Newton’s method Characterizing Neural Network
Optimization Problems”)
(Goodfellow 2016)
Multiple dimensions
• The gradient of 𝑓 is the vector conatining the partial
derivatives: 𝛻𝒙 𝑓 𝒙
• Critical point is when all elements of the gradient are 0
• The directional derivative is the slope of the function
𝑓(𝒙 + 𝛼 𝒖 ) (with respect to 𝛼)
" "
• 𝑓 𝒙+ 𝛼𝒖 = 𝑓 𝒙 + 𝛻$% 𝑓 𝒙 𝛼𝒖 + 𝑂(𝛼 & ) =
"# "#
%
𝛻$ 𝑓 𝒙 𝒖+𝑂 𝛼
"
• 𝑓 𝒙 + 𝛼𝒖 #'( = 𝛻$% 𝑓 𝒙 𝒖
"#
• The directional derivative is the projection of the gradient
onto the vector 𝒖

(Goodfellow 2016)
The best u is in the opposite
direction of the gradient!
• To minimize 𝑓, find the direction 𝑢 in which 𝑓
decreases the fastest:
q min '
𝒖 𝛻𝒙 𝑓 𝒙 =
!
𝐮,𝒖 𝒖%&
min
!
𝐮 ) 𝛻𝒙 𝑓 𝒙 )
cos 𝜃 =
𝐮,𝒖 𝒖%&
min cos 𝜃 = −1
q Take 𝒙! = 𝒙 − 𝜖𝛻𝒙 𝑓 𝒙
q 𝜖 is called the learning rate

(Goodfellow 2016)
What is the best 𝜖 ?
• Constant small 𝜖
• Solve for 𝜖 that makes the directional derivative
vanish
• Line search

(Goodfellow 2016)
Beyond the gradient: Curvature

The second derivative tells us how the first


derviative will change as we vary the input
(Goodfellow 2016)
Second derivative
• If the gradient = 1 and we make a step 𝜖 along the
negative gradient:
q 𝑓’’ = 0 : no cruvature, flat line (𝑓 decreases by 𝜖)
q 𝑓” < 0 : curves downward (𝑓 decreases by more
than 𝜖 )
q 𝑓” > 0 : curves upward (𝑓 decreases by less than
𝜖)

(Goodfellow 2016)
Second derivative – multiple
dimensions
• The second order partial derivatives are collected in
the Hessian matrix 𝐻
• If the second oder partial derivatives are
*" *"
continuous, we have *+ *+ 𝑓 𝒙 = *+ 𝑓 𝒙 . 𝐻 is
# $ $ *+#
symmetric.
• The directional second derivative:
!! ! "!
q
!" !
𝑓 𝒙 + 𝛼𝒖 = !" 𝑓 𝒙 + 𝛻#$ 𝑓 𝒙 𝛼𝒖 + %
𝑢$ 𝐻𝑢 + 𝑂 𝛼 &

*"
q " 𝑓 𝒙 + 𝛼𝒖 ,%- = 𝑢' 𝐻𝑢
*,

(Goodfellow 2016)
Directional Second
Derivatives
• 𝐻 can be decomposed into 𝑄Λ𝑄 " where 𝑄 is an orthogonal
basis of eigen vectors
• directional second derivative along direction d is 𝒅" 𝐻𝒅
q 𝒅 = ∑𝑑# 𝒗𝒊
q 𝒅" 𝐻𝒅 = ∑𝜆# 𝑑#%
• The maximum eigen
value determines the
maximum second
derivative
• The minimum eigen
value determines the
minumum second
derivative
(Goodfellow 2016)
Predicting optimal step size
using Taylor series

Expected Correction
'
When 𝒈 𝐻𝒈 > 𝟎 improvement term
Solve for optimal step size: If 𝒈' 𝐻𝒈 ≤ 𝟎:
* - ' & ) '
𝑓(𝑥 ) − 𝜖𝒈 𝒈 + 𝜖 𝒈 𝐻𝒈 = Negative curvature
*. )
−𝒈' 𝒈+ 𝜖𝒈' 𝐻𝒈=0
𝒈! 𝒈
⇒ 𝜖 ∗ = 𝒈!1𝒈
(Goodfellow 2016)
Optimal step
• When 𝒈' 𝐻𝒈 > 𝟎:
q If 𝒈 aligns with the eigen vector corresponding to
𝜆_𝑚𝑎𝑥 of 𝐻:
1
𝜖∗ =
𝜆_𝑚𝑎𝑥

Big gradients speed you up

Big eigenvalues slow you


down if you align with their
eigenvectors (Goodfellow 2016)
Condition Number

When the condition number is large,


sometimes you hit large eigenvalues and
sometimes you hit small ones.
The large ones force you to keep the learning
rate small, and miss out on moving fast in the
small eigenvalue directions.

(Goodfellow 2016)
Gradient Descent and Poor
Conditioning

Figure 4.6
(Goodfellow 2016)
Newton’s Method
$ & $
𝑓 𝒙 ≈𝑓 𝒙(𝟎) + 𝒙−𝒙 (𝟎) 𝛻𝒙 𝑓 𝒙 𝟎 +' 𝒙−𝒙 (𝟎) 𝐻 𝑓 𝒙(𝟎) 𝒙 − 𝒙(𝟎)

Solve for the critical point 𝛻𝒙 𝑓 𝒙 𝟎 + 𝐻 𝑓 𝒙(𝟎) 𝒙 − 𝒙(𝟎) = 0

𝒙∗ = 𝒙(𝟎) − 𝐻*+ 𝑓 𝒙(𝟎) 𝛻𝒙 𝑓 𝒙 𝟎

• If 𝑓 is positive definite quadratic function:


q When 𝑓 is truly quadratic, apply once to jump to the minimum
q When 𝑓 is not truly quadratic, apply multiple times
q Useful when close to a local minimum (where all Hessian eigen values are
positive)
q Bad near a saddle point
• Gradient descent has the advantage to not get attracted to saddle points

(Goodfellow 2016)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization

(Goodfellow 2016)
KKT Multipliers

In this book, mostly used for In practice, we usually


theory just project back to the
(e.g.: show Gaussian is highest constraint region after each
entropy distribution) step

(Goodfellow 2016)
Roadmap
• Iterative Optimization
• Rounding error, underflow, overflow

(Goodfellow 2016)
Numerical Precision: A deep
learning super skill
• Often deep learning algorithms “sort of work”
q Loss goes down, accuracy gets within a few
percentage points of state-of-the-art
q No “bugs” per se
• Often deep learning algorithms “explode” (NaNs,
large values)
• Culprit is often loss of numerical precision

(Goodfellow 2016)
Rounding and truncation
errors
• In a digital computer, we use float32 or similar
schemes to represent real numbers
• A real number x is rounded to x + delta for some
small delta
• Overflow: large x replaced by inf
• Underflow: small x replaced by 0

(Goodfellow 2016)
Example
• Adding a very small number to a larger one may
have no effect. This can cause large changes
downstream:

>>> a = np.array([0., 1e-8]).astype('float32')


>>> a.argmax()
1
>>> (a + 1).argmax()
0

(Goodfellow 2016)
Secondary effects
• Suppose we have code that computes x-y
• Suppose x overflows to inf
• Suppose y overflows to inf
• Then x - y = inf - inf = NaN

(Goodfellow 2016)
exp
• exp(x) overflows for large x

q Doesn’t need to be very large

q float32: 89 overflows

q Never use large x

• exp(x) underflows for very negative x

q Possibly not a problem

q Possibly catastrophic if exp(x) is a denominator, an argument


to a logarithm, etc.

(Goodfellow 2016)
Subtraction
• Suppose x and y have similar magnitude
• Suppose x is always greater than y
• In a computer, x - y may be negative due to
rounding error
• Example: variance

Safe
Dangerous

(Goodfellow 2016)
log and sqrt
• log(0) = - inf
• log(<negative>) is imaginary, usually nan in
software
• sqrt(0) is 0, but its derivative has a divide by zero
• Definitely avoid underflow or round-to-negative in
the argument!
• Common case: standard_dev = sqrt(variance)

(Goodfellow 2016)
log exp
• log exp(x) is a common pattern
• Should be simplified to x
• Avoids:
q Overflow in exp
q Underflow in exp causing -inf in log

(Goodfellow 2016)
Which is the better hack?
• normalized_x = x / st_dev

• eps = 1e-7

• Should we use

q st_dev = sqrt(eps + variance)

q or st_dev = eps + sqrt(variance) ?

• What if variance is implemented safely and will never


round to negative?

(Goodfellow 2016)
log(sum(exp))
• Naive implementation:
tf.log(tf.reduce_sum(tf.exp(array))
• Failure modes:
q If any entry is very large, exp overflows
q If all entries are very negative, all exps
underflow… and then log is -inf

(Goodfellow 2016)
Stable version

mx = tf.reduce_max(array)
safe_array = array - mx
log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array))

Built in version:
tf.reduce_logsumexp

(Goodfellow 2016)
Why does the logsumexp
trick work?
• Algebraically equivalent to the original version:

(Goodfellow 2016)
Why does the logsumexp
trick work?
• No overflow:
q Entries of safe_array are at most 0
• Some of the exp terms underflow, but not all
q At least one entry of safe_array is 0
q The sum of exp terms is at least 1
q The sum is now safe to pass to the log

(Goodfellow 2016)
Softmax
• Softmax: use your library’s built-in softmax function

• If you build your own, use:


• Similar to logsumexp
safe_logits = logits - tf.reduce_max(logits)
softmax = tf.nn.softmax(safe_logits) (Goodfellow 2016)
Cross-entropy
• Cross-entropy loss for softmax
(and sigmoid) has both
softmax and logsumexp in it

Compute it using the logits not


the probabilities
• The probabilities lose gradient
due to rounding error where
the softmax saturates
• Use
tf.nn.softmax_cross_entrop
y_with_logits or similar
• If you roll your own, use the
stabilization tricks for softmax
and logsumexp

(Goodfellow 2016)
Sigmoid
• Use your library’s built-in sigmoid function
• If you build your own:
q Recall that sigmoid is just softmax with one of the
logits hard-coded to 0

(Goodfellow 2016)
Bug hunting strategies
q If you increase your learning rate and the loss
gets stuck, you are probably rounding your
gradient to zero somewhere: maybe computing
cross-entropy using probabilities instead of logits
q For correctly implemented loss, too high of
learning rate should usually cause explosion

(Goodfellow 2016)
Bug hunting strategies
• If you see explosion (NaNs, very large values)
immediately suspect:

q log

q exp

q sqrt

q division

• Always suspect the code that changed most recently

(Goodfellow 2016)
Watch
• https://www.youtube.com/watch?v=XlYD8jn1ayE&li
st=PLoWh1paHYVRfygApBdss1HCt-TFZRXs0k

(Goodfellow 2016)

You might also like