0% found this document useful (0 votes)
8 views31 pages

3 Gradient

Uploaded by

Tobias Litwin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views31 pages

3 Gradient

Uploaded by

Tobias Litwin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Deep Learning

3. Gradient and Auto Differentiation


Matrix Calculus
Review Scalar Derivative

Derivative is the slope


of the tangent line

The slope of the


tangent line is 2
Subderivative

▪ Extend derivative to non-differentiable cases

Another example:

slope= - 0.3 slope=0.5


Gradients

▪ Generalize derivatives into vectors

Vector
Scalar

Scalar

Vector
Gradients

x, vector Vector

y, scalar Scalar

Scalar

Vector

∂y/∂x is a row vector

Direction (2, 4), perpendicular


(x1, x2) = (1, 1) to contour line
Examples
Gradients

y, vector Vector

x, scalar Scalar

Scalar

Vector

∂y/∂x is a row vector, while ∂y/∂x is a column vector

It is called numerator-layout notation. The reversed version is called denominator-layout


notation.
Gradients

y, vector Vector

x, vector Scalar

Scalar

Vector

The result is the Jacobi matrix


Examples
Examples
Generalize to Matrices

Scalar Vector Matrix

Scalar

Vector

Matrix
Chain Rule
Generalize to Vectors

▪ Chain rule for scalars:

▪ Generalize to vectors:
Example 1

Assume

Compute

Decompose
Example 1

Assume

Compute

Decompose
Auto Differentiation

Auto differentiation
Auto Differentiation (AD)

▪ AD evaluates gradients of a function specified by a program at given values


▪ AD differs to
▪ Symbolic differentiation

▪ Numerical differentiation
Computation Graph

▪ Decompose into primitive operations


Assume
▪ Build a directed acyclic graph to present the computation
Computation Graph

▪ Decompose into primitive operations


▪ Build a directed acyclic graph to present the
computation from mxnet import sym
▪ Build explicitly
▪ Tensorflow/Theano/MXNet a = sym.var()

b = sym.var()

c=2*a+b

# bind data into a and b

later
Computation Graph

▪ Decompose into primitive operations


▪ Build a directed acyclic graph to present the
computation
from mxnet import autograd, nd
▪ Build explicitly
▪ Tensorflow/Theano/MXNet with autograd.record():
▪ Build implicitly though tracing
▪ PyTorch/MXNet
a = nd.ones((2,1))

b = nd.ones((2,1)

c=2*a+b
Two Modes

▪ By chain rule

▪ Forward accumulation

▪ Reverse accumulation (a.k.a Backpropagation)


Reverse Accumulation

Assume
Forward Backward
Reverse Accumulation

Assume
Forward Backward

Read pre-computed
results
Reverse Accumulation

Assume
Forward Backward
Reverse Accumulation

Assume
Forward Backward
Reverse Accumulation Summary

▪ Build a computation graph


▪ Forward: Evaluate the graph, store intermediate results
▪ Backward: Evaluate the graph in a reversed order
▪ Eliminate paths not needed

Forward Backward
Complexities

▪ Computational complexity: O(n), n is #operations, to compute all derivatives

▪ Often similar to the forward cost

▪ Memory complexity: O(n), needs to record all intermediate results in the forward pass

▪ Compare to forward accumulation:

▪ O(n) time complexity to compute one gradient, O(n*k) to compute gradients for k variables

▪ O(1) memory complexity


[Advanced] Rematerialization

▪Memory is bottleneck for backward accumulation


▪Linear to #layers and batch size
▪Limited GPU memory (32GB max)
▪Trade computation for memory
▪Save a part of intermediate results
▪Recompute the rest when needed
Rematerialization

Only store the head Recompute the Recompute the


Forward Backward rest in part 2 rest in part 1
result in each part

Part 2

Part 1
Complexities

▪ An additional forward pass


▪ Assume m parts, then O(m) for head results, O(n/m) to store one part’s results
▪Choose then the memory complexity is
▪ Applying to deep neural networks
▪Only throw away simple layers, e.g. activation, often < 30% additional overhead
▪Train 10x larger networks, or 10x large batch size

You might also like