0% found this document useful (0 votes)

19 views39 pages

04 Numerical

Uploaded by

James Miller

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views39 pages

04 Numerical

Uploaded by

James Miller

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Numerical Computation

for Deep Learning

Lecture slides for Chapter 4 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
Last modified 2017-10-14

Thanks to Justin Gilmer and Jacob

Buckman for helpful discussions
Numerical concerns for implementations
of deep learning algorithms
• Algorithms are often specified in terms of real numbers; real
numbers cannot be implemented in a finite computer

• Does the algorithm still work when implemented with a finite

number of bits?

• Do small changes in the input to a function cause large changes to

an output?

• Rounding errors, noise, measurement errors can cause large

changes

• Iterative search for best input is diﬃcult

(Goodfellow 2017)
Roadmap

• Iterative Optimization

• Rounding error, underflow, overflow

(Goodfellow 2017)
Iterative Optimization

• Gradient descent

• Curvature

• Constrained optimization

(Goodfellow 2017)
Gradient Descent
CHAPTER 4. NUMERICAL COMPUTATION

2.0

1.5 Global minimum at x = 0.

Since f 0 (x) = 0, gradient
descent halts here.
1.0

0.5

0.0
For x < 0, we have f 0 (x) < 0, For x > 0, we have f 0 (x) > 0,
so we can decrease f by so we can decrease f by
0.5 moving rightward. moving leftward.

1.0
f (x) = 12 x2
1.5
f 0 (x) = x
2.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x

Figure 4.1: An illustration of how the gradient descent algorithm uses the derivatives of a
Figure 4.1
function can be used to follow the function downhill to a minimum.
(Goodfellow 2017)
PTER 4. NUMERICAL COMPUTATION

Approximate Optimization
This local minimum
performs nearly as well as
the global one,
so it is an acceptable
halting point.
f (x)

Ideally, we would like

to arrive at the global
minimum, but this
might not be possible.
This local minimum performs
poorly and should be avoided.

re 4.3: Optimization algorithms Figure

may fail 4.3
to find a global minimum when the
ple local minima or plateaus present. In the context of deep learning, we gen
t such solutions even though they are not truly minimal, so long as they corre (Goodfellow 2017)
We usually don’t even reach a
local minimum
APTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

16 1.0
14 0.9

Classification error rate

12 0.8
Gradient norm

10 0.7
8 0.6
6 0.5
4 0.4
2 0.3
0 0.2
2 0.1
50 0 50 100 150 200 250 0 50 100 150 200 250
Training time (epochs) Training time (epochs)

ure 8.1: Gradient descent often does not arrive at a critical point of any kind. In th
(Goodfellow 2017)
mple, the gradient norm increases throughout training of a convolutional network us
Deep learning optimization
way of life
• Pure math way of life:

• Find literally the smallest value of f(x)

• Or maybe: find some critical point of f(x) where

the value is locally smallest

• Deep learning way of life:

• Decrease the value of f(x) a lot

(Goodfellow 2017)
Iterative Optimization

• Gradient descent

• Curvature

• Constrained optimization

(Goodfellow 2017)
TER 4. NUMERICAL COMPUTATION

Critical Points
Minimum Maximum Saddle point

Figure 4.2

4.2: Examples of each of the three types of critical points in 1-D. A critical
with zero slope. Such a point can either be a local minimum, which is low
(Goodfellow 2017)
Saddle Points
500

f(x1 ,x1 )
0
−500

15
−15 0 x1
x1 0 −15
15
Figure 4.5
(Gradient descent escapes,
4.5: A saddle
Saddle pointattract
points containing both positive and
seenegative
Appendixcurvature. The function
C of “Qualitatively
example is f (x) =
Newton’s method x 2
1 x 2 . Along the axis corresponding
2
to x1 , the
Characterizing Neural
function
Network
upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue
Optimization Problems”)
the axis corresponding to x2 , the function curves downward. This direction is2017)an
(Goodfellow
APTER 4. NUMERICAL COMPUTATION
Curvature
Negative curvature No curvature Positive curvature
f (x)

f (x)

f (x)
x x x

Figure 4.4
ure 4.4: The second derivative determines the curvature of a function. Here w
adratic functions with various curvature. The dashed line indicates the value of t (Goodfellow 2017)
Fortunately, in this book, we usually need to decompose only a specific class

Effect of eigenvectors and eigenvalues

Directional Second Derivatives

Before multiplication After multiplication
3 3

¸1 v(1)

Directional Curvature
2 2

1 1
v(1) v(1)

0 0
x1

x10
¸2 v(2)
v(2) v(2)
−1 −1

−2 −2

−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x0 x00

Figure 2.3: Eﬀect of eigenvectors and eigenvalues. An example of the eﬀect of(Goodfellow
eigenvect 2017)
e: the original value of
on f (x) around the current point x : the
(0) function, the expected
pe of the function, and the correction we must apply
e(x) of⇡the
Predicting
f (xfunction.
(0)
) + (x xWhen
optimal
1
) g +this
(0) >
(x last
x term
step
) H(xis too
(0) > (0) size
x ),large,(4.8)
the
2
ctually move uphill. using WhenTaylor >
g Hg series
is zero or negative,
he gradient and H is the Hessian at x . If we use a learning rate
(0)
ation
e new predicts
point x willthat increasing
be given by x(0) ✏✏g. forever will decrease
Substituting f
this into our
ylor
on, weseries
obtainis unlikely to remain accurate for large ✏, so
ristic choices
(0)
of ✏ in (0)
this case.>
When
1 2 > g Hg is positive,
>
f (x ✏g) ⇡ f (x ) ✏g g + ✏ g Hg. (4.9)
size that decreases the Taylor series 2 approximation of
shree terms here: the original value of the function, the expected
t due to the slopegof > gthe function, and the correction we must apply
⇤
✏ = of>the function.
or the curvature . When this last term is too large, (4.10)
the
g Hgmove uphill. When g> Hg is zero or negative,
cent step can actually
Big gradients speed you up
aligns with the eigenvector
eries approximation of H corresponding
predicts that increasing to the
✏ forever will decrease f
ractice, the Taylor series is unlikely to remain accurate1for large ✏, so
then this optimal step
Big size is
eigenvalues given
slow youby
ort to more heuristic choices of ✏ in this case. When g max > . To the
Hg is positive,
eheminimize
optimal stepcan be down
size that approximated
if you
decreases the align
Taylorwell
with byapproximation
series a quadratic of
themost
the Hessian
yields thus determine the scale of the learning
their eigenvectors (Goodfellow 2017)
computation because rounding errors in the inputs
e output.
= A Condition
1 x.When A 2 Number
has an eigenvalue
R n⇥n

umber is

i
max . (4.2)
i,j j

de of the largest and

When the smallest
condition eigenvalue.
number is large, When
sometimes you
rsion is particularly hit largeto
sensitive eigenvalues
error in and
the input.
sometimes you hit small ones.
sic property of the matrix itself, not the
The large ones force you to keep the learning
result
x inversion. Poorly
rate small, andconditioned matrices
miss out on moving fast inamplify
the
tiply by the true matrix
small inverse.
eigenvalue In practice, the
directions.
r by numerical errors in the inversion process itself. (Goodfellow 2017)
Gradient Descent and Poor
Conditioning
20

0
x2

30
30 20 10 0 10 20
x1

igure 4.6: Gradient descent fails to Figure 4.6curvature information containe

exploit the
essian matrix. Here we use gradient descent to minimize a quadratic function f (x
(Goodfellow 2017)
nference paper at ICLR 2015

Neural net visualization

At end of learning:
- gradient is still large
- curvature is huge

(From “Qualitatively
Characterizing Neural
Network Optimization
Problems”) (Goodfellow 2017)
Iterative Optimization

• Gradient descent

• Curvature

• Constrained optimization

(Goodfellow 2017)
. NUMERICAL COMPUTATION

KKT Multipliers
erties guarantee that no infeasible point can be optimal, and that the
ithin the feasible points is unchanged.
orm constrained maximization, we can construct the generalized La-
ction of f (x), which leads to this optimization problem:
X X
(i) (j)
min max max f (x) + i g (x) + ↵ j h (x). (4.19)
x ↵,↵ 0
i j

so convert this to a problem with maximization in the outer loop:

In practice, we usually
In this book, mostly usedX for X
max min mintheory
f (x) + (i)
i g (x) just↵j h (x). back to (4.20)
project
(j) the
x ↵,↵ 0
(e.g.: show Gaussian is highest
i constraint
j region after each
entropy
the term equality constraints does not matter;step
for thedistribution) we may define it
on or subtraction as we wish, because the optimization is free to choose
r each i . (Goodfellow 2017)
Roadmap

• Iterative Optimization

• Rounding error, underflow, overflow

(Goodfellow 2017)
Numerical Precision: A deep
learning super skill
• Often deep learning algorithms “sort of work”

• Loss goes down, accuracy gets within a few

percentage points of state-of-the-art

• No “bugs” per se

• Often deep learning algorithms “explode” (NaNs, large

values)

• Culprit is often loss of numerical precision

(Goodfellow 2017)
Rounding and truncation
errors
• In a digital computer, we use float32 or similar
schemes to represent real numbers

• A real number x is rounded to x + delta for some

small delta

• Overflow: large x replaced by inf

• Underflow: small x replaced by 0

(Goodfellow 2017)
Example
• Adding a very small number to a larger one may
have no eﬀect. This can cause large changes
downstream:

>>> a = np.array([0., 1e-8]).astype('float32')

>>> a.argmax()
1
>>> (a + 1).argmax()
0

(Goodfellow 2017)
Secondary eﬀects

• Suppose we have code that computes x-y

• Suppose x overflows to inf

• Suppose y overflows to inf

• Then x - y = inf - inf = NaN

(Goodfellow 2017)
exp
• exp(x) overflows for large x

• Doesn’t need to be very large

• float32: 89 overflows

• Never use large x

• exp(x) underflows for very negative x

• Possibly not a problem

• Possibly catastrophic if exp(x) is a denominator, an argument to a

logarithm, etc.

(Goodfellow 2017)
ntity of the distribution is clear from the context, we may simply
e of the random variable that the expectation is over, as in Ex [f (x)].

Subtraction
which random variable the expectation is over, we may omit the
rely, as in E[f (x)]. By default, we can assume that E[·] averages over
all the•random variables inside the brackets.
Suppose x and y have similar magnitude Likewise, when there is
we may omit the square brackets.
ons are• linear, for example,
Suppose x is always greater than y
Ex [↵f (x) + g(x)] = ↵Ex [f (x)] + Ex [g(x)], (3.11)
• In a computer, x - y may be negative due to
are notrounding
dependenterror
on x.
nce gives a measure of how much the values of a function
Safe of a random
y as we sample diﬀerent values
• Example: variance of x from its Dangerous
probability distribution:
h i
Var(f (x)) = E (f (x) E[f (x)])2 . (3.12)
⇥ ⇤ 2
ance is low, the values of
= E (x)
f (x)2cluster near their
E [f (x)] expected value. The
the variance is known as the standard deviation.
(Goodfellow 2017)
log and sqrt
• log(0) = - inf

• log(<negative>) is imaginary, usually nan in software

• sqrt(0) is 0, but its derivative has a divide by zero

• Definitely avoid underflow or round-to-negative in the

argument!

• Common case: standard_dev = sqrt(variance)

(Goodfellow 2017)
log exp
• log exp(x) is a common pattern

• Should be simplified to x

• Avoids:

• Overflow in exp

• Underflow in exp causing -inf in log

(Goodfellow 2017)
Which is the better hack?
• normalized_x = x / st_dev

• eps = 1e-7

• Should we use

• st_dev = sqrt(eps + variance)

• st_dev = eps + sqrt(variance) ?

• What if variance is implemented safely and will never

round to negative?

(Goodfellow 2017)
log(sum(exp))
• Naive implementation:
tf.log(tf.reduce_sum(tf.exp(array))

• Failure modes:

• If any entry is very large, exp overflows

• If all entries are very negative, all exps

underflow… and then log is -inf

(Goodfellow 2017)
Stable version
mx = tf.reduce_max(array)
safe_array = array - mx
log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array))

Built in version:
tf.reduce_logsumexp

(Goodfellow 2017)
Why does the logsumexp trick
work?
• Algebraically equivalent to the original version:
X
m + log exp(ai m)
i
X exp(ai )
= m + log
i
exp(m)
1 X
= m + log exp(ai )
exp(m) i
X
= m log exp(m) + log exp(ai )
i
(Goodfellow 2017)
Why does the logsumexp trick
work?
• No overflow:

• Entries of safe_array are at most 0

• Some of the exp terms underflow, but not all

• At least one entry of safe_array is 0

• The sum of exp terms is at least 1

• The sum is now safe to pass to the log

(Goodfellow 2017)
Softmax

• Softmax: use your library’s built-in softmax function

• If you build your own, use:

safe_logits = logits - tf.reduce_max(logits)
softmax = tf.nn.softmax(safe_logits)

• Similar to logsumexp

(Goodfellow 2017)
Sigmoid

• Use your library’s built-in sigmoid function

• If you build your own:

• Recall that sigmoid is just softmax with one of

the logits hard-coded to 0

(Goodfellow 2017)
Cross-entropy
• Cross-entropy loss for softmax (and sigmoid) has both
softmax and logsumexp in it

• Compute it using the logits not the probabilities

• The probabilities lose gradient due to rounding error where

the softmax saturates

• Use tf.nn.softmax_cross_entropy_with_logits or similar

• If you roll your own, use the stabilization tricks for softmax
and logsumexp

(Goodfellow 2017)
Bug hunting strategies

• If you increase your learning rate and the loss gets

stuck, you are probably rounding your gradient to
zero somewhere: maybe computing cross-entropy
using probabilities instead of logits

• For correctly implemented loss, too high of

learning rate should usually cause explosion

(Goodfellow 2017)
Bug hunting strategies
• If you see explosion (NaNs, very large values) immediately
suspect:

• log

• exp

• sqrt

• division

• Always suspect the code that changed most recently

(Goodfellow 2017)
Questions

(Goodfellow 2017)

Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Chapter
No ratings yet
Chapter
46 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Unit IV
No ratings yet
Unit IV
89 pages
Optimization Algorithm 0401
No ratings yet
Optimization Algorithm 0401
26 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Lecture 03
No ratings yet
Lecture 03
32 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
CS231n Deep Learning For Computer Vision p-1
No ratings yet
CS231n Deep Learning For Computer Vision p-1
10 pages
Lecture 02
No ratings yet
Lecture 02
37 pages
Optimization Lecture 1
No ratings yet
Optimization Lecture 1
11 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
No ratings yet
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
8 pages
Enigma Submission
No ratings yet
Enigma Submission
3 pages
6th - Lecture - Calculus Review - WIDROW - HOFF - LEARNING - ALGORITHM - s1 - 21 - 22
No ratings yet
6th - Lecture - Calculus Review - WIDROW - HOFF - LEARNING - ALGORITHM - s1 - 21 - 22
22 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Week 5 Optimisation
No ratings yet
Week 5 Optimisation
24 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Linear and Nonlinear Programming
No ratings yet
Linear and Nonlinear Programming
7 pages
Setting Parameters of A Deep Neural Network - Hierarchical Representations
No ratings yet
Setting Parameters of A Deep Neural Network - Hierarchical Representations
10 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Mit18 S096iap23 Lec06
No ratings yet
Mit18 S096iap23 Lec06
9 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
DL 1
No ratings yet
DL 1
10 pages
ML Notes
No ratings yet
ML Notes
14 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Numerical Error Analysis of Large Language Models
No ratings yet
Numerical Error Analysis of Large Language Models
31 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Solving Differential Equations Via Artificial Neural Networks Findings and Failures in A Model Problem
No ratings yet
Solving Differential Equations Via Artificial Neural Networks Findings and Failures in A Model Problem
6 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Lec 2
No ratings yet
Lec 2
5 pages
Neural Networks - Learning
No ratings yet
Neural Networks - Learning
26 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Chapter 07
No ratings yet
Chapter 07
20 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Mastering OpenGL: From Basics to Advanced Rendering Techniques: OpenGL
From Everand
Mastering OpenGL: From Basics to Advanced Rendering Techniques: OpenGL
Kameron Hussain
No ratings yet
MSC in Data Science Admission 2021 2022 1
No ratings yet
MSC in Data Science Admission 2021 2022 1
3 pages
M.sc. Data Science - Siesce
No ratings yet
M.sc. Data Science - Siesce
7 pages
SSC CGL Preparation - 2020
No ratings yet
SSC CGL Preparation - 2020
7 pages
Data Center
50% (2)
Data Center
61 pages
DSP R22 Syllabus
No ratings yet
DSP R22 Syllabus
3 pages
Newton Interpolation
No ratings yet
Newton Interpolation
4 pages
COMPROG Flowchart For DCIT22.document
No ratings yet
COMPROG Flowchart For DCIT22.document
4 pages
3) CoCubes - Capgemini Faceprep - Pseudocode
No ratings yet
3) CoCubes - Capgemini Faceprep - Pseudocode
199 pages
Nria20-Dl - Unit-4 Notes-Final
No ratings yet
Nria20-Dl - Unit-4 Notes-Final
21 pages
Raste Gar 2009
No ratings yet
Raste Gar 2009
4 pages
Memoization
No ratings yet
Memoization
4 pages
Vein-Based Biometric Verification Using Densely-Connected Convolutional Autoencoder
No ratings yet
Vein-Based Biometric Verification Using Densely-Connected Convolutional Autoencoder
5 pages
(EIE529) Assignment 2 Solution
No ratings yet
(EIE529) Assignment 2 Solution
4 pages
Lesson 7.4: Solving Polynomial Equations in Factored Form
No ratings yet
Lesson 7.4: Solving Polynomial Equations in Factored Form
15 pages
Adaline, Widrow Hoff
No ratings yet
Adaline, Widrow Hoff
3 pages
Numerical Lab Full
No ratings yet
Numerical Lab Full
15 pages
Excel Apex CT - 05 - (06-02-2024) - SPS Sir FC
No ratings yet
Excel Apex CT - 05 - (06-02-2024) - SPS Sir FC
4 pages
Term-II Practical Programs
No ratings yet
Term-II Practical Programs
10 pages
MULTIPLE CHOICE. Choose The One Alternative That Best Completes The Statement or Answers The Question. Multiply
No ratings yet
MULTIPLE CHOICE. Choose The One Alternative That Best Completes The Statement or Answers The Question. Multiply
2 pages
Fig. P. 5.15.20: DM Signal Part II: Maximum Frequency: Given
No ratings yet
Fig. P. 5.15.20: DM Signal Part II: Maximum Frequency: Given
3 pages
Math 2 Portfolio Unit 1 Filled in Honors
No ratings yet
Math 2 Portfolio Unit 1 Filled in Honors
2 pages
CP ENum Projectkuno
No ratings yet
CP ENum Projectkuno
5 pages
Kalman Smoothing
No ratings yet
Kalman Smoothing
15 pages
Simplex Method
No ratings yet
Simplex Method
45 pages
Uber - LeetCode
0% (1)
Uber - LeetCode
10 pages
Errors in Numerical Computation
No ratings yet
Errors in Numerical Computation
2 pages
D1, L8 Critical Path Analysis Early Event Time Examples
No ratings yet
D1, L8 Critical Path Analysis Early Event Time Examples
45 pages
DSA Week1
No ratings yet
DSA Week1
60 pages
Lab5 ML Eac22050
No ratings yet
Lab5 ML Eac22050
11 pages
Efficient Online Learning Algorithms Based On LSTM Neural Networks
No ratings yet
Efficient Online Learning Algorithms Based On LSTM Neural Networks
12 pages
CACEP Unit 2 Assignment Problem
No ratings yet
CACEP Unit 2 Assignment Problem
36 pages
Operation Research
No ratings yet
Operation Research
99 pages
Chapter 8 - Cryptographic Tools Algorithms and Protocol
No ratings yet
Chapter 8 - Cryptographic Tools Algorithms and Protocol
17 pages
Algorithm and Flowchart Are Two Types of Tools To Explain The Process of A Program
No ratings yet
Algorithm and Flowchart Are Two Types of Tools To Explain The Process of A Program
8 pages

04 Numerical

Uploaded by

04 Numerical

Uploaded by

Numerical Computation

for Deep Learning

Thanks to Justin Gilmer and Jacob

• Does the algorithm still work when implemented with a finite

• Do small changes in the input to a function cause large changes to

• Rounding errors, noise, measurement errors can cause large

• Iterative search for best input is diﬃcult

• Rounding error, underflow, overflow

1.5 Global minimum at x = 0.

Ideally, we would like

re 4.3: Optimization algorithms Figure

Classification error rate

• Find literally the smallest value of f(x)

• Or maybe: find some critical point of f(x) where

• Deep learning way of life:

• Decrease the value of f(x) a lot

Effect of eigenvectors and eigenvalues

Directional Second Derivatives

de of the largest and

igure 4.6: Gradient descent fails to Figure 4.6curvature information containe

Neural net visualization

so convert this to a problem with maximization in the outer loop:

• Rounding error, underflow, overflow

• Loss goes down, accuracy gets within a few

• Often deep learning algorithms “explode” (NaNs, large

• Culprit is often loss of numerical precision

• A real number x is rounded to x + delta for some

• Overflow: large x replaced by inf

• Underflow: small x replaced by 0

>>> a = np.array([0., 1e-8]).astype('float32')

• Suppose we have code that computes x-y

• Suppose x overflows to inf

• Suppose y overflows to inf

• Then x - y = inf - inf = NaN

• Doesn’t need to be very large

• Never use large x

• exp(x) underflows for very negative x

• Possibly not a problem

• Possibly catastrophic if exp(x) is a denominator, an argument to a

• log(<negative>) is imaginary, usually nan in software

• sqrt(0) is 0, but its derivative has a divide by zero

• Definitely avoid underflow or round-to-negative in the

• Common case: standard_dev = sqrt(variance)

• Underflow in exp causing -inf in log

• st_dev = sqrt(eps + variance)

• st_dev = eps + sqrt(variance) ?

• What if variance is implemented safely and will never

• If any entry is very large, exp overflows

• If all entries are very negative, all exps

• Entries of safe_array are at most 0

• Some of the exp terms underflow, but not all

• At least one entry of safe_array is 0

• The sum of exp terms is at least 1

• The sum is now safe to pass to the log

• Softmax: use your library’s built-in softmax function

• If you build your own, use:

• Use your library’s built-in sigmoid function

• If you build your own:

• Recall that sigmoid is just softmax with one of

• Compute it using the logits not the probabilities

• The probabilities lose gradient due to rounding error where

• Use tf.nn.softmax_cross_entropy_with_logits or similar

• If you increase your learning rate and the loss gets

• For correctly implemented loss, too high of

• Always suspect the code that changed most recently

You might also like