04 Numerical
04 Numerical
(Goodfellow 2016)
Roadmap
• Iterative Optimization
• Rounding error, underflow, overflow
(Goodfellow 2016)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2016)
Gradient-based optimization
• Optimization is the task of minimizing or maximizing
some function 𝑓(𝒙) by altering 𝒙
!
• 𝑓 𝑥 + 𝜖 ≈ 𝑓 𝑥 + 𝜖𝑓 𝑥
• 𝑓 𝑥 − 𝜖 𝑠𝑖𝑔𝑛 𝑓 ! 𝑥 < 𝑓 𝑥 for small 𝜖
• This technique is called gradient descent (Cauchy,
1847)
(Goodfellow 2016)
Gradient Descent
Figure 4.1
(Goodfellow 2016)
Approximate Optimization
Figure 4.3
(Goodfellow 2016)
We usually don’t even reach
a local minimum
(Goodfellow 2016)
Deep learning optimization
way of life
• Pure math way of life:
q Find literally the smallest value of f(x)
q Or maybe: find some critical point of f(x) where
the value is locally smallest
• Deep learning way of life:
q Decrease the value of f(x) a lot
(Goodfellow 2016)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2016)
Critical Points (𝑓’(𝑥) = 0)
Figure 4.2
(Goodfellow 2016)
Saddle Points
Figure 4.5
(Gradient descent escapes,
Saddle points attract see Appendix C of “Qualitatively
Newton’s method Characterizing Neural Network
Optimization Problems”)
(Goodfellow 2016)
Multiple dimensions
• The gradient of 𝑓 is the vector conatining the partial
derivatives: 𝛻𝒙 𝑓 𝒙
• Critical point is when all elements of the gradient are 0
• The directional derivative is the slope of the function
𝑓(𝒙 + 𝛼 𝒖 ) (with respect to 𝛼)
" "
• 𝑓 𝒙+ 𝛼𝒖 = 𝑓 𝒙 + 𝛻$% 𝑓 𝒙 𝛼𝒖 + 𝑂(𝛼 & ) =
"# "#
%
𝛻$ 𝑓 𝒙 𝒖+𝑂 𝛼
"
• 𝑓 𝒙 + 𝛼𝒖 #'( = 𝛻$% 𝑓 𝒙 𝒖
"#
• The directional derivative is the projection of the gradient
onto the vector 𝒖
(Goodfellow 2016)
The best u is in the opposite
direction of the gradient!
• To minimize 𝑓, find the direction 𝑢 in which 𝑓
decreases the fastest:
q min '
𝒖 𝛻𝒙 𝑓 𝒙 =
!
𝐮,𝒖 𝒖%&
min
!
𝐮 ) 𝛻𝒙 𝑓 𝒙 )
cos 𝜃 =
𝐮,𝒖 𝒖%&
min cos 𝜃 = −1
q Take 𝒙! = 𝒙 − 𝜖𝛻𝒙 𝑓 𝒙
q 𝜖 is called the learning rate
(Goodfellow 2016)
What is the best 𝜖 ?
• Constant small 𝜖
• Solve for 𝜖 that makes the directional derivative
vanish
• Line search
(Goodfellow 2016)
Beyond the gradient: Curvature
(Goodfellow 2016)
Second derivative – multiple
dimensions
• The second order partial derivatives are collected in
the Hessian matrix 𝐻
• If the second oder partial derivatives are
*" *"
continuous, we have *+ *+ 𝑓 𝒙 = *+ 𝑓 𝒙 . 𝐻 is
# $ $ *+#
symmetric.
• The directional second derivative:
!! ! "!
q
!" !
𝑓 𝒙 + 𝛼𝒖 = !" 𝑓 𝒙 + 𝛻#$ 𝑓 𝒙 𝛼𝒖 + %
𝑢$ 𝐻𝑢 + 𝑂 𝛼 &
*"
q " 𝑓 𝒙 + 𝛼𝒖 ,%- = 𝑢' 𝐻𝑢
*,
(Goodfellow 2016)
Directional Second
Derivatives
• 𝐻 can be decomposed into 𝑄Λ𝑄 " where 𝑄 is an orthogonal
basis of eigen vectors
• directional second derivative along direction d is 𝒅" 𝐻𝒅
q 𝒅 = ∑𝑑# 𝒗𝒊
q 𝒅" 𝐻𝒅 = ∑𝜆# 𝑑#%
• The maximum eigen
value determines the
maximum second
derivative
• The minimum eigen
value determines the
minumum second
derivative
(Goodfellow 2016)
Predicting optimal step size
using Taylor series
Expected Correction
'
When 𝒈 𝐻𝒈 > 𝟎 improvement term
Solve for optimal step size: If 𝒈' 𝐻𝒈 ≤ 𝟎:
* - ' & ) '
𝑓(𝑥 ) − 𝜖𝒈 𝒈 + 𝜖 𝒈 𝐻𝒈 = Negative curvature
*. )
−𝒈' 𝒈+ 𝜖𝒈' 𝐻𝒈=0
𝒈! 𝒈
⇒ 𝜖 ∗ = 𝒈!1𝒈
(Goodfellow 2016)
Optimal step
• When 𝒈' 𝐻𝒈 > 𝟎:
q If 𝒈 aligns with the eigen vector corresponding to
𝜆_𝑚𝑎𝑥 of 𝐻:
1
𝜖∗ =
𝜆_𝑚𝑎𝑥
(Goodfellow 2016)
Gradient Descent and Poor
Conditioning
Figure 4.6
(Goodfellow 2016)
Newton’s Method
$ & $
𝑓 𝒙 ≈𝑓 𝒙(𝟎) + 𝒙−𝒙 (𝟎) 𝛻𝒙 𝑓 𝒙 𝟎 +' 𝒙−𝒙 (𝟎) 𝐻 𝑓 𝒙(𝟎) 𝒙 − 𝒙(𝟎)
(Goodfellow 2016)
Iterative Optimization
• Gradient descent
• Curvature
• Constrained optimization
(Goodfellow 2016)
KKT Multipliers
(Goodfellow 2016)
Roadmap
• Iterative Optimization
• Rounding error, underflow, overflow
(Goodfellow 2016)
Numerical Precision: A deep
learning super skill
• Often deep learning algorithms “sort of work”
q Loss goes down, accuracy gets within a few
percentage points of state-of-the-art
q No “bugs” per se
• Often deep learning algorithms “explode” (NaNs,
large values)
• Culprit is often loss of numerical precision
(Goodfellow 2016)
Rounding and truncation
errors
• In a digital computer, we use float32 or similar
schemes to represent real numbers
• A real number x is rounded to x + delta for some
small delta
• Overflow: large x replaced by inf
• Underflow: small x replaced by 0
(Goodfellow 2016)
Example
• Adding a very small number to a larger one may
have no effect. This can cause large changes
downstream:
(Goodfellow 2016)
Secondary effects
• Suppose we have code that computes x-y
• Suppose x overflows to inf
• Suppose y overflows to inf
• Then x - y = inf - inf = NaN
(Goodfellow 2016)
exp
• exp(x) overflows for large x
q float32: 89 overflows
(Goodfellow 2016)
Subtraction
• Suppose x and y have similar magnitude
• Suppose x is always greater than y
• In a computer, x - y may be negative due to
rounding error
• Example: variance
Safe
Dangerous
(Goodfellow 2016)
log and sqrt
• log(0) = - inf
• log(<negative>) is imaginary, usually nan in
software
• sqrt(0) is 0, but its derivative has a divide by zero
• Definitely avoid underflow or round-to-negative in
the argument!
• Common case: standard_dev = sqrt(variance)
(Goodfellow 2016)
log exp
• log exp(x) is a common pattern
• Should be simplified to x
• Avoids:
q Overflow in exp
q Underflow in exp causing -inf in log
(Goodfellow 2016)
Which is the better hack?
• normalized_x = x / st_dev
• eps = 1e-7
• Should we use
(Goodfellow 2016)
log(sum(exp))
• Naive implementation:
tf.log(tf.reduce_sum(tf.exp(array))
• Failure modes:
q If any entry is very large, exp overflows
q If all entries are very negative, all exps
underflow… and then log is -inf
(Goodfellow 2016)
Stable version
mx = tf.reduce_max(array)
safe_array = array - mx
log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array))
Built in version:
tf.reduce_logsumexp
(Goodfellow 2016)
Why does the logsumexp
trick work?
• Algebraically equivalent to the original version:
(Goodfellow 2016)
Why does the logsumexp
trick work?
• No overflow:
q Entries of safe_array are at most 0
• Some of the exp terms underflow, but not all
q At least one entry of safe_array is 0
q The sum of exp terms is at least 1
q The sum is now safe to pass to the log
(Goodfellow 2016)
Softmax
• Softmax: use your library’s built-in softmax function
(Goodfellow 2016)
Sigmoid
• Use your library’s built-in sigmoid function
• If you build your own:
q Recall that sigmoid is just softmax with one of the
logits hard-coded to 0
(Goodfellow 2016)
Bug hunting strategies
q If you increase your learning rate and the loss
gets stuck, you are probably rounding your
gradient to zero somewhere: maybe computing
cross-entropy using probabilities instead of logits
q For correctly implemented loss, too high of
learning rate should usually cause explosion
(Goodfellow 2016)
Bug hunting strategies
• If you see explosion (NaNs, very large values)
immediately suspect:
q log
q exp
q sqrt
q division
(Goodfellow 2016)
Watch
• https://www.youtube.com/watch?v=XlYD8jn1ayE&li
st=PLoWh1paHYVRfygApBdss1HCt-TFZRXs0k
(Goodfellow 2016)