0% found this document useful (0 votes)

12 views44 pages

Chapter 4 - Optimization

Uploaded by

Muhammad Ashraf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views44 pages

Chapter 4 - Optimization

Uploaded by

Muhammad Ashraf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Optimization

With permission from:

Justin Johnson, EECS 498-007 / 598-005
Deep Learning for Computer Vision
At University of Michigan

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 1 The Energy University

Last Time: Linear Classifiers

Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint

f(x, W) = Wx + b One template Hyperplanes cutting up

per class high-dimensional space

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 2 The Energy University

Summary: Loss Functions quantify preferences

• We have some dataset of (x, y)

• We have a score function: Linear Classifier
• We have a loss function:

Cross-entropy

SVM

Full/average loss

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 3 The Energy University

Summary: Loss Functions quantify preferences

• We have some dataset of (x, y)

• We have a score function: Linear Classifier
• We have a loss function:

Cross-entropy Q: How do we find the best W?

SVM
A: Optimization
Full/average loss

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 4 The Energy University

Optimization

• Gradient
• Gradient Descent
• Stochastic Gradient Descent (SGD)
• SGD + Momentum
• AdaGrad
• RMSProp
• Adam

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 5 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 6 The Energy University
Follow the slope

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 7 The Energy University

Follow the slope
In 1-dimension, the derivative of a function gives the slope:

In multiple dimensions, the gradient is the vector of partial derivatives along each
dimension

The slope in any direction is the dot product of the direction with the gradient
The direction of steepest descent is the negative gradient

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 8 The Energy University

current W: W + h (first dim): gradient dL/dW:

[0.34, [0.34 + 0.0001, [?,

-1.11, -1.11, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25322

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 9 The Energy University

current W: W + h (first dim): gradient dL/dW:

[0.34, [0.34 + 0.0001, [-2.5,

-1.11, -1.11, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
(1.25322 - 1.25347)/0.0001
0.55, 0.55, =?,-2.5
2.81, 2.81, ?,𝑑𝑓(𝑊) 𝑓 𝑊+ℎ −𝑓 𝑊
-3.1, -3.1, ?, 𝑑𝑊 = lim
→ ℎ
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25322

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 10 The Energy University

current W: W + h (second dim): gradient dL/dW:

[0.34, [0.34, [-2.5,

-1.11, -1.11 + 0.0001, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25353

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 11 The Energy University

current W: W + h (second dim): gradient dL/dW:

[0.34, [0.34, [-2.5,

-1.11, -1.11 + 0.0001, 0.6,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
(1.25353 - 1.25347)/0.0001
2.81, 2.81, =?,0.6
-3.1, -3.1, ?,𝑑𝑓(𝑊) 𝑓 𝑊+ℎ −𝑓 𝑊
= lim
-1.5, -1.5, ?, 𝑑𝑊 → ℎ
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25353

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 12 The Energy University

current W: W + h (third dim): gradient dL/dW:

[0.34, [0.34, [-2.5,

-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25347

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 13 The Energy University

current W: W + h (third dim): gradient dL/dW:

[0.34, [0.34, [-2.5,

-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, 0.0,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
(1.25347 - 1.25347)/0.0001
-3.1, -3.1, =?,0.6
-1.5, -1.5, ?,𝑑𝑓(𝑊) 𝑓 𝑊+ℎ −𝑓 𝑊
0.33,…] 0.33,…] ?,…]𝑑𝑊
= lim
→ ℎ
loss 1.25347 loss 1.25347

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 14 The Energy University

current W: W + h (third dim): gradient dL/dW:

[0.34, [0.34, [-2.5,

-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, 0.0,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
Numeric Gradient:
-1.5, -1.5, - ?,
Slow: f(#dimensions)
0.33,…] 0.33,…] - ?,…]
Approximate
loss 1.25347 loss 1.25347

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 15 The Energy University

Analytic Gradient: Loss is a function of W

want

Use calculus to compute an analytic gradient

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 16 The Energy University

current W: gradient dL/dW:

[0.34, [-2.5,
-1.11, dL/dW = ... 0.6,
0.78, (some function 0,
0.12, data and W) 0.2,
0.55, 0.7,
2.81, -0.5,
-3.1, (In practice we will compute 1.1,
dL/dW using backpropagation)
-1.5, 1.3,
0.33,…] -2.1,…]
loss 1.25347

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 17 The Energy University

Computing Gradients

• Numeric gradient: approximate, slow, easy to write

• Analytic gradient: exact, fast, error-prone

In practice: Always use analytic gradient, but check implementation

with numerical gradient. This is called a gradient check.

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 18 The Energy University

Computing Gradients

• Numeric gradient: approximate, slow, easy to write

• Analytic gradient: exact, fast, error-prone

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 19 The Energy University

Negative gradient
Gradient Descent W_2 direction Original W

Iteratively step in the direction of the

negative gradient
(direction of local steepest descent)

Hyperparameters:
- Weight initialization method
- Number of steps
- Learning rate W_1

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 20 The Energy University

Gradient Descent
Iteratively step in the direction of the
negative gradient
(direction of local steepest descent)

Hyperparameters:
- Weight initialization method
- Number of steps
- Learning rate

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 21 The Energy University

Batch Gradient Descent / Full Batch Gradient Descent
Full sum expensive
when N is large!

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 22 The Energy University

Stochastic Gradient Descent (SGD)
Full sum expensive
when N is large!

Approximate sum using

a minibatch of examples
32 / 64 / 128 common

Hyperparameters:
- Weight initialization
- Number of steps
- Learning rate
- Batch size
- Data sampling

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 23 The Energy University

Problems with SGD
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?

Very slow progress along shallow dimension, jitter along steep direction

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 24 The Energy University

Problems with SGD Local
minimum

What if the loss function

has a local minimum or
saddle point?
Saddle
point

Zero gradient,
gradient descent gets stuck

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 25 The Energy University

Problems with SGD

Our gradients come from minibatches

so they can be noisy!

The Energy University

SGD
SGD

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 27 The Energy University

SGD + Momentum
SGD SGD+Momentum

- Build up “velocity” as a running mean of gradients

- Rho gives “friction”; typically rho=0.9 or 0.99

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 28 The Energy University

SGD + Momentum Gradient Noise

Local minimum / saddle points

Poor conditioning

SGD
Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 29 The Energy University

AdaGrad

Added element-wise scaling of the gradient based

on the historical sum of squares in each dimension

“Per-parameter learning rates”

or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 30 The Energy University

AdaGrad

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 31 The Energy University

AdaGrad

Q: What happens with AdaGrad?

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 32 The Energy University
AdaGrad

Progress along “steep” directions is damped;

Q: What happens with AdaGrad? progress along “flat” directions is accelerated
9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 33 The Energy University
RMSProp: AdaGrad + decay rate

AdaGrad

RMSProp

Tieleman and Hinton, 2012

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 34 The Energy University

RMSProp
SGD

SGD+Momentum

RMSProp

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 35 The Energy University

RMSProp

Compare them:
• Different algorithms will converge to
different minima.
• Often, SGD and SGD with momentum
will converge to the poorer minimum
• RMSProp will converge to the global
minimum.
https://emiliendupont.github.io/2018/01/24
/optimization-visualization/

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 36 The Energy University

Adam: RMSProp + Momentum

Adam

Momentum
RMSProp

SGD+Momentum

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 37 The Energy University

Adam: RMSProp + Momentum

Adam

RMSProp

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 38 The Energy University

Adam: RMSProp + Momentum

Adam

Momentum
RMSProp

Q: What happens at t=0?

(Assume beta2 = 0.999)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 39 The Energy University

Adam: RMSProp + Momentum

Momentum
RMSProp
Bias correction

Bias correction for the fact Adam with beta1 = 0.9,

that first and second moment beta2 = 0.999, and learning_rate = 1e-3, 5e-4, 1e-4
estimates start at zero is a great starting point for many models!

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 40 The Energy University

Adam

SGD

SGD+Momentum

RMSProp

Adam

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 41 The Energy University

Optimization Algorithm Comparison

Tracks second
Tracks first Bias correction
moments Leaky second
Algorithm moments for moment
(Adaptive moments
(Momentum) estimates
learning rates)
SGD x x x x
SGD+Momentum  x x x
AdaGrad x  x x
RMSProp x   x
Adam    

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 42 The Energy University

Summary
In practice:
• Adam is a good default choice in many cases
• SGD+Momentum can outperform Adam but may require more tuning

• Use Linear Models for image classification problems

• Use Loss Functions to express preferences over different choices of
weights
• Use Stochastic Gradient Descent to minimize our loss functions and
train the model

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 43 The Energy University

Next time:
Neural Networks

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 44 The Energy University

Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
Preparation of Calibration Curves A Guide To Best Practice
No ratings yet
Preparation of Calibration Curves A Guide To Best Practice
31 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Pub Solution Manual Thomas Calculus 11th Edition Fourier Series
100% (1)
Pub Solution Manual Thomas Calculus 11th Edition Fourier Series
5 pages
Functional 3
No ratings yet
Functional 3
7 pages
Level Set Methods in Medical Image Analysis: Segmentation: Nikos Paragios
No ratings yet
Level Set Methods in Medical Image Analysis: Segmentation: Nikos Paragios
92 pages
IterMethBook 2nded PDF
100% (1)
IterMethBook 2nded PDF
567 pages
Non-Parametric Test
100% (1)
Non-Parametric Test
2 pages
DSILYTC
No ratings yet
DSILYTC
2 pages
Random Walk in Random and Non-Random Environments - by Pal Revesz
No ratings yet
Random Walk in Random and Non-Random Environments - by Pal Revesz
325 pages
Tugas RO Integer Programming Formulation
100% (2)
Tugas RO Integer Programming Formulation
3 pages
Power and Sample Size
No ratings yet
Power and Sample Size
88 pages
Legendre Equation Problems
No ratings yet
Legendre Equation Problems
2 pages
Final Capstone CRM - Report
No ratings yet
Final Capstone CRM - Report
39 pages
Exercises Fourier Series PDF
100% (1)
Exercises Fourier Series PDF
3 pages
FFT PDF
No ratings yet
FFT PDF
17 pages
Green Social Work and Sustainability
No ratings yet
Green Social Work and Sustainability
12 pages
Empcode First Name Last Name Dept Region - Code Branch Hiredate Salary
No ratings yet
Empcode First Name Last Name Dept Region - Code Branch Hiredate Salary
23 pages
Trajectory Tracking Control: Robotics 2
No ratings yet
Trajectory Tracking Control: Robotics 2
26 pages
NA Lec 15
No ratings yet
NA Lec 15
15 pages
Strong Form and Weak Form
No ratings yet
Strong Form and Weak Form
2 pages
Nichita, 2013
No ratings yet
Nichita, 2013
12 pages
Stiffnes - Beam With Sinking
No ratings yet
Stiffnes - Beam With Sinking
10 pages
Quantitative-Methods-for-Business - Summer 2020
No ratings yet
Quantitative-Methods-for-Business - Summer 2020
6 pages
Ilovepdf Merged Unit 1 Compressed
No ratings yet
Ilovepdf Merged Unit 1 Compressed
223 pages
Unit-Ii (Ml-I)
No ratings yet
Unit-Ii (Ml-I)
81 pages
Numerical Methods SMJM 3053: Nurhazimah Nazmi, PHD 5.33.01 Nurhazimah@Utm - My
No ratings yet
Numerical Methods SMJM 3053: Nurhazimah Nazmi, PHD 5.33.01 Nurhazimah@Utm - My
30 pages
Design by Root Locus PDF
No ratings yet
Design by Root Locus PDF
26 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Optimization
No ratings yet
Optimization
51 pages
Lesson 4 Training ANNs
No ratings yet
Lesson 4 Training ANNs
34 pages
Deep Learning As Optimal Control Problems - Models and Numerical Methods
No ratings yet
Deep Learning As Optimal Control Problems - Models and Numerical Methods
34 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
A Comparison of Various Normalization in Techniques For Order Performance by Similarity To Ideal Solution (TOPSIS)
No ratings yet
A Comparison of Various Normalization in Techniques For Order Performance by Similarity To Ideal Solution (TOPSIS)
1 page
DNN Full Merged Compressed Compressed
No ratings yet
DNN Full Merged Compressed Compressed
863 pages
UNIT2
No ratings yet
UNIT2
25 pages
Note On Precedent Transactions
No ratings yet
Note On Precedent Transactions
2 pages
Day 2 - Loss & Activation Functions
No ratings yet
Day 2 - Loss & Activation Functions
8 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Comparative Scale
No ratings yet
Comparative Scale
1 page
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
EE5121: Convex Optimization: Assignment 5
No ratings yet
EE5121: Convex Optimization: Assignment 5
2 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Week 4
No ratings yet
Week 4
61 pages
Deep Learning: Course Code: Unit 1
No ratings yet
Deep Learning: Course Code: Unit 1
41 pages
Optim
No ratings yet
Optim
33 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Lec 8
No ratings yet
Lec 8
43 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
PCA and Convex Optimization and Bias, Variance-2
No ratings yet
PCA and Convex Optimization and Bias, Variance-2
29 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Chapter 02.background-Theory
No ratings yet
Chapter 02.background-Theory
20 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Cours 5
No ratings yet
Cours 5
23 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
cs188 Fa23 Note23
No ratings yet
cs188 Fa23 Note23
2 pages
9.b Handout-3-GD Variants
No ratings yet
9.b Handout-3-GD Variants
3 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
PINN Notes
No ratings yet
PINN Notes
10 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Optimizers
No ratings yet
Optimizers
4 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Engineering Physics
From Everand
Engineering Physics
Dr. S.G Ibrahim
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
From Everand
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
Stuart A. Klugman
4/5 (1)
Master Thesis Template Polito
No ratings yet
Master Thesis Template Polito
16 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
13.practice Questions and Solutions
No ratings yet
13.practice Questions and Solutions
17 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet

Chapter 4 - Optimization

Uploaded by

Chapter 4 - Optimization

Uploaded by

Optimization

With permission from:

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 1 The Energy University

Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint

f(x, W) = Wx + b One template Hyperplanes cutting up

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 2 The Energy University

• We have some dataset of (x, y)

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 3 The Energy University

• We have some dataset of (x, y)

Cross-entropy Q: How do we find the best W?

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 4 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 5 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 7 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 8 The Energy University

[0.34, [0.34 + 0.0001, [?,

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 9 The Energy University

[0.34, [0.34 + 0.0001, [-2.5,

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 10 The Energy University

[0.34, [0.34, [-2.5,

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 11 The Energy University

[0.34, [0.34, [-2.5,

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 12 The Energy University

[0.34, [0.34, [-2.5,

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 13 The Energy University

[0.34, [0.34, [-2.5,

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 14 The Energy University

[0.34, [0.34, [-2.5,

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 15 The Energy University

Use calculus to compute an analytic gradient

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 16 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 17 The Energy University

• Numeric gradient: approximate, slow, easy to write

In practice: Always use analytic gradient, but check implementation

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 18 The Energy University

• Numeric gradient: approximate, slow, easy to write

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 19 The Energy University

Iteratively step in the direction of the

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 20 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 21 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 22 The Energy University

Approximate sum using

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 23 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 24 The Energy University

What if the loss function

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 25 The Energy University

Our gradients come from minibatches

The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 27 The Energy University

- Build up “velocity” as a running mean of gradients

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 28 The Energy University

Local minimum / saddle points

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 29 The Energy University

Added element-wise scaling of the gradient based

“Per-parameter learning rates”

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 30 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 31 The Energy University

Q: What happens with AdaGrad?

Progress along “steep” directions is damped;

Tieleman and Hinton, 2012

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 34 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 35 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 36 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 37 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 38 The Energy University

Q: What happens at t=0?

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 39 The Energy University

Bias correction for the fact Adam with beta1 = 0.9,

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 40 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 41 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 42 The Energy University

• Use Linear Models for image classification problems

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 43 The Energy University

9/10/2023 Zafri Baharuddin EEEB4023/ECEB463 AI 44 The Energy University

You might also like