0% found this document useful (0 votes)

8 views31 pages

3 Gradient

Uploaded by

Tobias Litwin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views31 pages

3 Gradient

Uploaded by

Tobias Litwin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Deep Learning

3. Gradient and Auto Diﬀerentiation

Matrix Calculus
Review Scalar Derivative

Derivative is the slope

of the tangent line

The slope of the

tangent line is 2
Subderivative

▪ Extend derivative to non-diﬀerentiable cases

Another example:

slope= - 0.3 slope=0.5

Gradients

▪ Generalize derivatives into vectors

Vector
Scalar

Scalar

Vector
Gradients

x, vector Vector

y, scalar Scalar

Scalar

Vector

∂y/∂x is a row vector

Direction (2, 4), perpendicular

(x1, x2) = (1, 1) to contour line
Examples
Gradients

y, vector Vector

x, scalar Scalar

Scalar

Vector

∂y/∂x is a row vector, while ∂y/∂x is a column vector

It is called numerator-layout notation. The reversed version is called denominator-layout

notation.
Gradients

y, vector Vector

x, vector Scalar

Scalar

Vector

The result is the Jacobi matrix

Examples
Examples
Generalize to Matrices

Scalar Vector Matrix

Scalar

Vector

Matrix
Chain Rule
Generalize to Vectors

▪ Chain rule for scalars:

▪ Generalize to vectors:
Example 1

Assume

Compute

Decompose
Example 1

Assume

Compute

Decompose
Auto Diﬀerentiation

Auto differentiation
Auto Diﬀerentiation (AD)

▪ AD evaluates gradients of a function speciﬁed by a program at given values

▪ AD diﬀers to
▪ Symbolic diﬀerentiation

▪ Numerical diﬀerentiation
Computation Graph

▪ Decompose into primitive operations

Assume
▪ Build a directed acyclic graph to present the computation
Computation Graph

▪ Decompose into primitive operations

▪ Build a directed acyclic graph to present the
computation from mxnet import sym
▪ Build explicitly
▪ Tensorﬂow/Theano/MXNet a = sym.var()

b = sym.var()

c=2*a+b

# bind data into a and b

later
Computation Graph

▪ Decompose into primitive operations

▪ Build a directed acyclic graph to present the
computation
from mxnet import autograd, nd
▪ Build explicitly
▪ Tensorﬂow/Theano/MXNet with autograd.record():
▪ Build implicitly though tracing
▪ PyTorch/MXNet
a = nd.ones((2,1))

b = nd.ones((2,1)

c=2*a+b
Two Modes

▪ By chain rule

▪ Forward accumulation

▪ Reverse accumulation (a.k.a Backpropagation)

Reverse Accumulation

Assume
Forward Backward
Reverse Accumulation

Assume
Forward Backward

Read pre-computed
results
Reverse Accumulation

Assume
Forward Backward
Reverse Accumulation

Assume
Forward Backward
Reverse Accumulation Summary

▪ Build a computation graph

▪ Forward: Evaluate the graph, store intermediate results
▪ Backward: Evaluate the graph in a reversed order
▪ Eliminate paths not needed

Forward Backward
Complexities

▪ Computational complexity: O(n), n is #operations, to compute all derivatives

▪ Often similar to the forward cost

▪ Memory complexity: O(n), needs to record all intermediate results in the forward pass

▪ Compare to forward accumulation:

▪ O(n) time complexity to compute one gradient, O(n*k) to compute gradients for k variables

▪ O(1) memory complexity

[Advanced] Rematerialization

▪Memory is bottleneck for backward accumulation

▪Linear to #layers and batch size
▪Limited GPU memory (32GB max)
▪Trade computation for memory
▪Save a part of intermediate results
▪Recompute the rest when needed
Rematerialization

Only store the head Recompute the Recompute the

Forward Backward rest in part 2 rest in part 1
result in each part

Part 2

Part 1
Complexities

▪ An additional forward pass

▪ Assume m parts, then O(m) for head results, O(n/m) to store one part’s results
▪Choose then the memory complexity is
▪ Applying to deep neural networks
▪Only throw away simple layers, e.g. activation, often < 30% additional overhead
▪Train 10x larger networks, or 10x large batch size

Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
Skills Matrix
100% (1)
Skills Matrix
3 pages
Sbte Syllabus - 57 - 1st Semester - Electronics - Session 2023-24-1
No ratings yet
Sbte Syllabus - 57 - 1st Semester - Electronics - Session 2023-24-1
71 pages
Kimberly Kempf Leonard - Encyclopedia of Social Measurement, Volume 2 - Elsevier (2005)
100% (1)
Kimberly Kempf Leonard - Encyclopedia of Social Measurement, Volume 2 - Elsevier (2005)
971 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
16 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Complete UNIT III DEEP LEARNING
No ratings yet
Complete UNIT III DEEP LEARNING
126 pages
Final2 Math EE
No ratings yet
Final2 Math EE
77 pages
Statement of Purpose
100% (1)
Statement of Purpose
2 pages
DL03 Classroom SNN
No ratings yet
DL03 Classroom SNN
41 pages
Deep Learning
No ratings yet
Deep Learning
142 pages
d2l en
No ratings yet
d2l en
883 pages
Mathematics For AI
No ratings yet
Mathematics For AI
8 pages
1 Linear Algebra Basics 25-07-2024
No ratings yet
1 Linear Algebra Basics 25-07-2024
30 pages
Derivative Networks Reducedversion - 2022
No ratings yet
Derivative Networks Reducedversion - 2022
14 pages
NN and Optimization Regularization
No ratings yet
NN and Optimization Regularization
198 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Understanding Deep Convolutional Networks
No ratings yet
Understanding Deep Convolutional Networks
17 pages
5 Backward Propagation
No ratings yet
5 Backward Propagation
81 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
3 Gradient
No ratings yet
3 Gradient
30 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
2024 04 CS115 Vector Caculus
No ratings yet
2024 04 CS115 Vector Caculus
131 pages
Lec 03 Deep Networks 1
No ratings yet
Lec 03 Deep Networks 1
53 pages
The Matrix Calculus You Need For Deep Learning
No ratings yet
The Matrix Calculus You Need For Deep Learning
34 pages
LOD Differentiable
No ratings yet
LOD Differentiable
55 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
Neural Networks With Cheap Differential Operators
No ratings yet
Neural Networks With Cheap Differential Operators
11 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
Tut 01
No ratings yet
Tut 01
39 pages
Mit18 S096iap23 Lec1
No ratings yet
Mit18 S096iap23 Lec1
16 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
21 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Gradients of Vector-Valued Functions & Matrices
No ratings yet
Gradients of Vector-Valued Functions & Matrices
16 pages
XCS224N Module2 Slides
No ratings yet
XCS224N Module2 Slides
80 pages
Backpropagation Exercises
No ratings yet
Backpropagation Exercises
7 pages
Learning 3
No ratings yet
Learning 3
98 pages
Unit 3
No ratings yet
Unit 3
6 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
MODULE 2 Deep Learning
No ratings yet
MODULE 2 Deep Learning
26 pages
009 Neural - Networks Complete
No ratings yet
009 Neural - Networks Complete
61 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Chap 3 Slides
No ratings yet
Chap 3 Slides
95 pages
Lecture 3-4
No ratings yet
Lecture 3-4
50 pages
Unit 3
No ratings yet
Unit 3
110 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Neural Networks Optional
No ratings yet
Neural Networks Optional
96 pages
Derivatives, Backpropagation, and Vectorization
No ratings yet
Derivatives, Backpropagation, and Vectorization
7 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Deep Learning
No ratings yet
Deep Learning
26 pages
Unit 1
No ratings yet
Unit 1
30 pages
Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
Unit 1 DL
No ratings yet
Unit 1 DL
52 pages
Deeplearning2017 Johnson Automatic Differentiation 01
No ratings yet
Deeplearning2017 Johnson Automatic Differentiation 01
142 pages
Lecture 2: Introduction To Pytorch
No ratings yet
Lecture 2: Introduction To Pytorch
7 pages
Chapter 2 - 2 Shallow Neural Network 2 - 2
No ratings yet
Chapter 2 - 2 Shallow Neural Network 2 - 2
34 pages
Deep Neural Networks - 2
No ratings yet
Deep Neural Networks - 2
55 pages
First
No ratings yet
First
92 pages
Tensorflow Deep Learning and Artificial Intelligence Machine Learning
No ratings yet
Tensorflow Deep Learning and Artificial Intelligence Machine Learning
97 pages
Unit 1 1
No ratings yet
Unit 1 1
6 pages
CI DeepLearningFundamentals
No ratings yet
CI DeepLearningFundamentals
45 pages
Types of Matrix 2021
No ratings yet
Types of Matrix 2021
86 pages
Isonic 2009 Pa Matrix Srut GW
No ratings yet
Isonic 2009 Pa Matrix Srut GW
3 pages
19SE22 - Syllabus - MATRIX AND FINITE ELEMENT METHOD OF ANALYSIS
No ratings yet
19SE22 - Syllabus - MATRIX AND FINITE ELEMENT METHOD OF ANALYSIS
2 pages
Bairstow Method: Polynomial
No ratings yet
Bairstow Method: Polynomial
3 pages
Mit18 701f21 Full Lec
No ratings yet
Mit18 701f21 Full Lec
14 pages
Introduction To FDM
No ratings yet
Introduction To FDM
82 pages
Module 1 Determinants:: F (A) K, Where A
No ratings yet
Module 1 Determinants:: F (A) K, Where A
10 pages
Idl Simple Manual
No ratings yet
Idl Simple Manual
16 pages
House File
No ratings yet
House File
30 pages
141 - Basic Applied Mathematics
No ratings yet
141 - Basic Applied Mathematics
4 pages
Jee Advanced 2025
No ratings yet
Jee Advanced 2025
16 pages
Design and Analysis of Algorithms - R2013
No ratings yet
Design and Analysis of Algorithms - R2013
2 pages
Dynamic Programming Algorithms 2021 Part 2
No ratings yet
Dynamic Programming Algorithms 2021 Part 2
87 pages
Asymmetric Relations in Longitudinal Social Networks: Popular
No ratings yet
Asymmetric Relations in Longitudinal Social Networks: Popular
8 pages
9th Math, CH 1, Ex 1.1
No ratings yet
9th Math, CH 1, Ex 1.1
3 pages
Maths 2
No ratings yet
Maths 2
9 pages
Homework 2.2
No ratings yet
Homework 2.2
11 pages
Efficient Noise-Decoupling For Multi-Behavior Sequential Recommendation
No ratings yet
Efficient Noise-Decoupling For Multi-Behavior Sequential Recommendation
10 pages
The Spherical Four-Bar Mechanism: Optimum Synthesis With DE Algorithm and Animation Using Mathematica
No ratings yet
The Spherical Four-Bar Mechanism: Optimum Synthesis With DE Algorithm and Animation Using Mathematica
8 pages
AAD Lab Manual
No ratings yet
AAD Lab Manual
85 pages
Study Guide Exam 2
No ratings yet
Study Guide Exam 2
7 pages
EFEAv6p03 User Manual
No ratings yet
EFEAv6p03 User Manual
190 pages
Tiruvannamalai: 11 TH Mathematics
No ratings yet
Tiruvannamalai: 11 TH Mathematics
65 pages
3 Library of Fish Functions
No ratings yet
3 Library of Fish Functions
58 pages
Assignment 1
No ratings yet
Assignment 1
2 pages

3 Gradient

Uploaded by

3 Gradient

Uploaded by

Deep Learning

3. Gradient and Auto Diﬀerentiation

Derivative is the slope

The slope of the

▪ Extend derivative to non-diﬀerentiable cases

slope= - 0.3 slope=0.5

▪ Generalize derivatives into vectors

∂y/∂x is a row vector

Direction (2, 4), perpendicular

∂y/∂x is a row vector, while ∂y/∂x is a column vector

It is called numerator-layout notation. The reversed version is called denominator-layout

The result is the Jacobi matrix

Scalar Vector Matrix

▪ Chain rule for scalars:

▪ AD evaluates gradients of a function speciﬁed by a program at given values

▪ Decompose into primitive operations

▪ Decompose into primitive operations

# bind data into a and b

▪ Decompose into primitive operations

▪ Reverse accumulation (a.k.a Backpropagation)

▪ Build a computation graph

▪ Computational complexity: O(n), n is #operations, to compute all derivatives

▪ Often similar to the forward cost

▪ Compare to forward accumulation:

▪ O(1) memory complexity

▪Memory is bottleneck for backward accumulation

Only store the head Recompute the Recompute the

▪ An additional forward pass

You might also like