0% found this document useful (0 votes)

67 views62 pages

Gradient Descent As Quadratic Approximation

1) Gradient descent is a general method for minimizing an objective function (loss) through iterative updates when an analytic solution is not possible. 2) It works by taking steps proportional to the negative gradient of the loss function, moving in the direction of steepest descent. 3) At each step, the parameters are updated according to the rule: wt+1 = wt - η∇L(wt), where η is the learning rate that controls step size.

Uploaded by

22520750

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views62 pages

Gradient Descent As Quadratic Approximation

Uploaded by

22520750

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Deep learning

3.5. Gradient descent

François Fleuret
https://fleuret.org/dlc/
We saw that training consists of finding the model parameters minimizing an
empirical risk or loss, for instance the mean-squared error (MSE)

1 X
ℒ (w , b) = (f (xn ; w , b) − yn )2 .
N n

Other losses are more fitting for classification, certain regression problems, or
density estimation. We will come back to this.

François Fleuret Deep learning / 3.5. Gradient descent 1 / 13

We saw that training consists of finding the model parameters minimizing an
empirical risk or loss, for instance the mean-squared error (MSE)

1 X
ℒ (w , b) = (f (xn ; w , b) − yn )2 .
N n

Other losses are more fitting for classification, certain regression problems, or
density estimation. We will come back to this.

So far we minimized the loss either with an analytic solution for the MSE, or
with ad hoc recipes for the empirical error rate (k-NN and perceptron).

François Fleuret Deep learning / 3.5. Gradient descent 1 / 13

There is generally no ad hoc method. The logistic regression for instance
1
Pw (Y = 1 | X = x) = σ(w · x + b), with σ(x) =
1 + e −x
leads to the loss
X
ℒ (w , b) = − log σ(yn (w · xn + b))
n

which cannot be minimized analytically.

François Fleuret Deep learning / 3.5. Gradient descent 2 / 13

which cannot be minimized analytically.

The general minimization method used in such a case is the gradient descent.

François Fleuret Deep learning / 3.5. Gradient descent 2 / 13

Given a functional

f : RD → R
x 7→ f (x1 , . . . , xD ),

its gradient is the mapping

∇f : RD → RD

∂f ∂f
x 7→ (x), . . . , (x) .
∂x1 ∂xD

François Fleuret Deep learning / 3.5. Gradient descent 3 / 13

To minimize a functional
ℒ : RD → R
the gradient descent uses local linear information to iteratively move toward a
(local) minimum.

François Fleuret Deep learning / 3.5. Gradient descent 4 / 13

To minimize a functional
ℒ : RD → R
the gradient descent uses local linear information to iteratively move toward a
(local) minimum.

For w0 ∈ RD , consider an approximation of ℒ around w0

1
ℒ˜w0 (w ) = ℒ (w0 ) + ∇ℒ (w0 )> (w − w0 ) + kw − w0 k2 .
2η

Note that the chosen quadratic term does not depend on ℒ .

François Fleuret Deep learning / 3.5. Gradient descent 4 / 13

To minimize a functional
ℒ : RD → R
the gradient descent uses local linear information to iteratively move toward a
(local) minimum.

For w0 ∈ RD , consider an approximation of ℒ around w0

1
ℒ˜w0 (w ) = ℒ (w0 ) + ∇ℒ (w0 )> (w − w0 ) + kw − w0 k2 .
2η

Note that the chosen quadratic term does not depend on ℒ .

We have
1
∇ℒ˜w0 (w ) = ∇ℒ (w0 ) + (w − w0 ),
η
which leads to
argmin ℒ˜w0 (w ) = w0 − η∇ℒ (w0 ).
w

François Fleuret Deep learning / 3.5. Gradient descent 4 / 13

The resulting iterative rule, which goes to the minimum of the approximation at
the current location, takes the form:

wt+1 = wt − η∇ℒ (wt ),

which corresponds intuitively to “following the steepest descent”.

This [most of the time] eventually ends up in a local minimum, and the choices
of w0 and η are important.

François Fleuret Deep learning / 3.5. Gradient descent 5 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ
ℒ˜

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

w10

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

w11

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ
ℒ˜

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ
ℒ˜

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ
ℒ˜

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ˜ ℒ

w10

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.125

ℒ˜ ℒ

w11

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13

η = 0.5

ℒ˜
ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

w10

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

η = 0.5

ℒ˜ ℒ

w11

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13

1.0
0.0 0.8
0.2 0.6
0.4 0.4
0.6 0.2
0.8
1.0 0.0

François Fleuret Deep learning / 3.5. Gradient descent 9 / 13

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

François Fleuret Deep learning / 3.5. Gradient descent 9 / 13

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

François Fleuret Deep learning / 3.5. Gradient descent 9 / 13

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

François Fleuret Deep learning / 3.5. Gradient descent 9 / 13

We saw that the minimum of the logistic regression loss
X
ℒ (w , b) = − log σ(yn (w · xn + b))
n

does not have an analytic form.

François Fleuret Deep learning / 3.5. Gradient descent 10 / 13

We can derive
∂ℒ X
=− yn σ(−yn (w · xn + b)),
∂b n
| {z }
un
∂ℒ X
∀d, =− xn,d yn σ(−yn (w · xn + b)),
∂wd n
| {z }
vn,d

François Fleuret Deep learning / 3.5. Gradient descent 11 / 13

We can derive
∂ℒ X
=− yn σ(−yn (w · xn + b)),
∂b n
| {z }
un
∂ℒ X
∀d, =− xn,d yn σ(−yn (w · xn + b)),
∂wd n
| {z }
vn,d

which can be implemented as

def gradient(x, y, w, b):

u = y * ( - y * (x @ w + b)).sigmoid()
v = x * u.view(-1, 1) # Broadcasting
return - v.sum(0), - u.sum(0)

François Fleuret Deep learning / 3.5. Gradient descent 11 / 13

We can derive
∂ℒ X
=− yn σ(−yn (w · xn + b)),
∂b n
| {z }
un
∂ℒ X
∀d, =− xn,d yn σ(−yn (w · xn + b)),
∂wd n
| {z }
vn,d

which can be implemented as

def gradient(x, y, w, b):

u = y * ( - y * (x @ w + b)).sigmoid()
v = x * u.view(-1, 1) # Broadcasting
return - v.sum(0), - u.sum(0)

and the gradient descent as

w, b = torch.randn(x.size(1)), 0
eta = 1e-1

for k in range(nb_iterations):
print(k, loss(x, y, w, b))
dw, db = gradient(x, y, w, b)
w -= eta * dw
b -= eta * db

François Fleuret Deep learning / 3.5. Gradient descent 11 / 13

102

101
Loss

100

10−1
0 2000 4000 6000 8000 10000
Nb. of steps

François Fleuret Deep learning / 3.5. Gradient descent 12 / 13

With 100 training points and η = 10−1 .

n=0
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .

n = 10
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .

n = 102
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .

n = 103
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .

n = 104
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
LDA
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
The end

Topic 4 (Part 2) - NN Learning
No ratings yet
Topic 4 (Part 2) - NN Learning
92 pages
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
No ratings yet
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
714 pages
# Gradient Descent #Start #Learning Rate: November 6, 2024
No ratings yet
# Gradient Descent #Start #Learning Rate: November 6, 2024
418 pages
Topic 5 - Part2 NN Learning
No ratings yet
Topic 5 - Part2 NN Learning
90 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Gradient Descent: Disclaimer: This PPT Is Modified Based On Hung-Yi Lee
No ratings yet
Gradient Descent: Disclaimer: This PPT Is Modified Based On Hung-Yi Lee
38 pages
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
No ratings yet
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
601 pages
(00000) - 2018-Weinan - (CommMathStat) - The Deep Ritz Method A Deep Learning-Based
No ratings yet
(00000) - 2018-Weinan - (CommMathStat) - The Deep Ritz Method A Deep Learning-Based
12 pages
7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
AML Lab
No ratings yet
AML Lab
34 pages
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
No ratings yet
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
49 pages
Slides 2
No ratings yet
Slides 2
28 pages
2205.11393generic Bounds On The Approximation Error
No ratings yet
2205.11393generic Bounds On The Approximation Error
40 pages
MFD S Assignment 2
No ratings yet
MFD S Assignment 2
18 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Practical Aspects On Solving Differential Equations Using Deep Learning
No ratings yet
Practical Aspects On Solving Differential Equations Using Deep Learning
32 pages
36 Neural Operator Graph Kernel N
No ratings yet
36 Neural Operator Graph Kernel N
21 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Handout Delta Rule
No ratings yet
Handout Delta Rule
10 pages
Gradient Descent (v2)
No ratings yet
Gradient Descent (v2)
38 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
Deep Learning: Course Code: Unit 1
No ratings yet
Deep Learning: Course Code: Unit 1
41 pages
ML Lec-6
No ratings yet
ML Lec-6
16 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
C1 W1 Lab04 Gradient Descent Soln
No ratings yet
C1 W1 Lab04 Gradient Descent Soln
11 pages
Larning Rate
No ratings yet
Larning Rate
9 pages
Lec 105
No ratings yet
Lec 105
19 pages
UDL - Errata Data
No ratings yet
UDL - Errata Data
19 pages
Minor 2 - DNN
No ratings yet
Minor 2 - DNN
3 pages
Cours 5
No ratings yet
Cours 5
23 pages
C1 W2 Lab03 Feature Scaling and Learning Rate Soln
No ratings yet
C1 W2 Lab03 Feature Scaling and Learning Rate Soln
10 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
2403B05107 DL Activity 01
No ratings yet
2403B05107 DL Activity 01
6 pages
2403B05107 DL Activity 01
No ratings yet
2403B05107 DL Activity 01
6 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Lesson 2
No ratings yet
Lesson 2
8 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
9.b Handout-3-GD Variants
No ratings yet
9.b Handout-3-GD Variants
3 pages
Cours 4
No ratings yet
Cours 4
30 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Ai Lab7th
No ratings yet
Ai Lab7th
7 pages
Solving Parabolic Periodic P-Laplacian by Deep Learning
No ratings yet
Solving Parabolic Periodic P-Laplacian by Deep Learning
15 pages
DL Quiz1
No ratings yet
DL Quiz1
5 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
Solution: Introduction To Deep Learning
No ratings yet
Solution: Introduction To Deep Learning
20 pages
Approximation of Solution Operators for High-dimensional PDEs部分2
No ratings yet
Approximation of Solution Operators for High-dimensional PDEs部分2
2 pages
Ada Grand
No ratings yet
Ada Grand
2 pages
Solution PDF
No ratings yet
Solution PDF
20 pages
Homework 2
No ratings yet
Homework 2
3 pages
CPS - 25kW 208V UL Modbus Map Spec FW V4.0
No ratings yet
CPS - 25kW 208V UL Modbus Map Spec FW V4.0
78 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
4.deep Learning Assignment4 Solution PDF
100% (1)
4.deep Learning Assignment4 Solution PDF
12 pages
Bones of Upper Limb (Anatomy Practical) Mansoura
100% (1)
Bones of Upper Limb (Anatomy Practical) Mansoura
27 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Aml Crashlog
No ratings yet
Aml Crashlog
18 pages
Shabana
No ratings yet
Shabana
104 pages
Optimizer
No ratings yet
Optimizer
2 pages
Mpfi - Audi A6 C4
No ratings yet
Mpfi - Audi A6 C4
13 pages
FM 4910 (2020) - Examination Standard For Cleanroom Materials
No ratings yet
FM 4910 (2020) - Examination Standard For Cleanroom Materials
13 pages
Week Number: (Week 1) Topic: Orientation Course Description
No ratings yet
Week Number: (Week 1) Topic: Orientation Course Description
12 pages
Pragmatism
100% (2)
Pragmatism
22 pages
mc166 Kompend Kap083 e
No ratings yet
mc166 Kompend Kap083 e
27 pages
STM Unit 1 Taxonomy of Bugs
No ratings yet
STM Unit 1 Taxonomy of Bugs
56 pages
About Version Control
No ratings yet
About Version Control
6 pages
99bc 0 2016-04-28 Guidelines For Using Synchro 9 Including SimTraffic 9 Final A
No ratings yet
99bc 0 2016-04-28 Guidelines For Using Synchro 9 Including SimTraffic 9 Final A
79 pages
First Mock Exam As Chem MCQ Nov 23
No ratings yet
First Mock Exam As Chem MCQ Nov 23
16 pages
Why Processor Performance Is More Than Frequency and Core Counts v10 13 23
No ratings yet
Why Processor Performance Is More Than Frequency and Core Counts v10 13 23
7 pages
2007 - Rate Effects in Vane Shear Testing
100% (1)
2007 - Rate Effects in Vane Shear Testing
9 pages
Semiconductor Optoelectronics: 4: Nonequilibrium Excess Carriers in Semiconductors
No ratings yet
Semiconductor Optoelectronics: 4: Nonequilibrium Excess Carriers in Semiconductors
24 pages
Reduction To Diagnol Form
No ratings yet
Reduction To Diagnol Form
11 pages
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Modelling of Preconditioning by Blasting in Block and Panel Caving
No ratings yet
Modelling of Preconditioning by Blasting in Block and Panel Caving
19 pages
Vishay 601-1045
No ratings yet
Vishay 601-1045
2 pages
Channel Coding 2
No ratings yet
Channel Coding 2
12 pages
Co4 Django-Pagination Databases
No ratings yet
Co4 Django-Pagination Databases
29 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Worksheet-Simplification of Numerical
No ratings yet
Worksheet-Simplification of Numerical
4 pages
Lecture 10-Controllers (PLC) B.
No ratings yet
Lecture 10-Controllers (PLC) B.
28 pages
Create An Efficient Dashboard With Qlik Sense
No ratings yet
Create An Efficient Dashboard With Qlik Sense
35 pages
Principle of Concrete Mix Design
No ratings yet
Principle of Concrete Mix Design
3 pages
More More Symcli
100% (1)
More More Symcli
10 pages
MH 1000 MT Magnetic Hyperthermia Test System
No ratings yet
MH 1000 MT Magnetic Hyperthermia Test System
3 pages
PMP Formulas: 1. Number of Communication Channels
No ratings yet
PMP Formulas: 1. Number of Communication Channels
5 pages
Alarm System - DSC Pc1555 - Faq
No ratings yet
Alarm System - DSC Pc1555 - Faq
3 pages
2019-2020 Evms Model QP
No ratings yet
2019-2020 Evms Model QP
2 pages