0% found this document useful (0 votes)
67 views62 pages

Gradient Descent As Quadratic Approximation

1) Gradient descent is a general method for minimizing an objective function (loss) through iterative updates when an analytic solution is not possible. 2) It works by taking steps proportional to the negative gradient of the loss function, moving in the direction of steepest descent. 3) At each step, the parameters are updated according to the rule: wt+1 = wt - η∇L(wt), where η is the learning rate that controls step size.

Uploaded by

22520750
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views62 pages

Gradient Descent As Quadratic Approximation

1) Gradient descent is a general method for minimizing an objective function (loss) through iterative updates when an analytic solution is not possible. 2) It works by taking steps proportional to the negative gradient of the loss function, moving in the direction of steepest descent. 3) At each step, the parameters are updated according to the rule: wt+1 = wt - η∇L(wt), where η is the learning rate that controls step size.

Uploaded by

22520750
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Deep learning

3.5. Gradient descent

François Fleuret
https://fleuret.org/dlc/
We saw that training consists of finding the model parameters minimizing an
empirical risk or loss, for instance the mean-squared error (MSE)

1 X
ℒ (w , b) = (f (xn ; w , b) − yn )2 .
N n

Other losses are more fitting for classification, certain regression problems, or
density estimation. We will come back to this.

François Fleuret Deep learning / 3.5. Gradient descent 1 / 13


We saw that training consists of finding the model parameters minimizing an
empirical risk or loss, for instance the mean-squared error (MSE)

1 X
ℒ (w , b) = (f (xn ; w , b) − yn )2 .
N n

Other losses are more fitting for classification, certain regression problems, or
density estimation. We will come back to this.

So far we minimized the loss either with an analytic solution for the MSE, or
with ad hoc recipes for the empirical error rate (k-NN and perceptron).

François Fleuret Deep learning / 3.5. Gradient descent 1 / 13


There is generally no ad hoc method. The logistic regression for instance
1
Pw (Y = 1 | X = x) = σ(w · x + b), with σ(x) =
1 + e −x
leads to the loss
X
ℒ (w , b) = − log σ(yn (w · xn + b))
n

which cannot be minimized analytically.

François Fleuret Deep learning / 3.5. Gradient descent 2 / 13


There is generally no ad hoc method. The logistic regression for instance
1
Pw (Y = 1 | X = x) = σ(w · x + b), with σ(x) =
1 + e −x
leads to the loss
X
ℒ (w , b) = − log σ(yn (w · xn + b))
n

which cannot be minimized analytically.

The general minimization method used in such a case is the gradient descent.

François Fleuret Deep learning / 3.5. Gradient descent 2 / 13


Given a functional

f : RD → R
x 7→ f (x1 , . . . , xD ),

its gradient is the mapping

∇f : RD → RD
 
∂f ∂f
x 7→ (x), . . . , (x) .
∂x1 ∂xD

François Fleuret Deep learning / 3.5. Gradient descent 3 / 13


To minimize a functional
ℒ : RD → R
the gradient descent uses local linear information to iteratively move toward a
(local) minimum.

François Fleuret Deep learning / 3.5. Gradient descent 4 / 13


To minimize a functional
ℒ : RD → R
the gradient descent uses local linear information to iteratively move toward a
(local) minimum.

For w0 ∈ RD , consider an approximation of ℒ around w0


1
ℒ˜w0 (w ) = ℒ (w0 ) + ∇ℒ (w0 )> (w − w0 ) + kw − w0 k2 .

Note that the chosen quadratic term does not depend on ℒ .

François Fleuret Deep learning / 3.5. Gradient descent 4 / 13


To minimize a functional
ℒ : RD → R
the gradient descent uses local linear information to iteratively move toward a
(local) minimum.

For w0 ∈ RD , consider an approximation of ℒ around w0


1
ℒ˜w0 (w ) = ℒ (w0 ) + ∇ℒ (w0 )> (w − w0 ) + kw − w0 k2 .

Note that the chosen quadratic term does not depend on ℒ .

We have
1
∇ℒ˜w0 (w ) = ∇ℒ (w0 ) + (w − w0 ),
η
which leads to
argmin ℒ˜w0 (w ) = w0 − η∇ℒ (w0 ).
w

François Fleuret Deep learning / 3.5. Gradient descent 4 / 13


The resulting iterative rule, which goes to the minimum of the approximation at
the current location, takes the form:

wt+1 = wt − η∇ℒ (wt ),

which corresponds intuitively to “following the steepest descent”.

This [most of the time] eventually ends up in a local minimum, and the choices
of w0 and η are important.

François Fleuret Deep learning / 3.5. Gradient descent 5 / 13


η = 0.125

ℒ˜ ℒ

w0

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125


ℒ˜

w1

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w2

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w3

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w4

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w5

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w6

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w7

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w8

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w9

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w10

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w11

François Fleuret Deep learning / 3.5. Gradient descent 6 / 13


η = 0.125

ℒ˜ ℒ

w0

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125

ℒ˜ ℒ

w1

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125


ℒ˜

w2

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125

ℒ˜ ℒ

w3

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125


ℒ˜

w4

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125

ℒ˜ ℒ

w5

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125

ℒ˜ ℒ

w6

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125

ℒ˜ ℒ

w7

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125


ℒ˜

w8

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125

ℒ˜ ℒ

w9

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125

ℒ˜ ℒ

w10

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.125

ℒ˜ ℒ

w11

François Fleuret Deep learning / 3.5. Gradient descent 7 / 13


η = 0.5

ℒ˜

w0

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w1

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w2

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w3

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w4

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w5

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w6

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w7

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w8

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w9

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w10

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


η = 0.5

ℒ˜ ℒ

w11

François Fleuret Deep learning / 3.5. Gradient descent 8 / 13


1.0
0.0 0.8
0.2 0.6
0.4 0.4
0.6 0.2
0.8
1.0 0.0

François Fleuret Deep learning / 3.5. Gradient descent 9 / 13


1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

François Fleuret Deep learning / 3.5. Gradient descent 9 / 13


1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

François Fleuret Deep learning / 3.5. Gradient descent 9 / 13


1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

François Fleuret Deep learning / 3.5. Gradient descent 9 / 13


We saw that the minimum of the logistic regression loss
X
ℒ (w , b) = − log σ(yn (w · xn + b))
n

does not have an analytic form.

François Fleuret Deep learning / 3.5. Gradient descent 10 / 13


We can derive
∂ℒ X
=− yn σ(−yn (w · xn + b)),
∂b n
| {z }
un
∂ℒ X
∀d, =− xn,d yn σ(−yn (w · xn + b)),
∂wd n
| {z }
vn,d

François Fleuret Deep learning / 3.5. Gradient descent 11 / 13


We can derive
∂ℒ X
=− yn σ(−yn (w · xn + b)),
∂b n
| {z }
un
∂ℒ X
∀d, =− xn,d yn σ(−yn (w · xn + b)),
∂wd n
| {z }
vn,d

which can be implemented as

def gradient(x, y, w, b):


u = y * ( - y * (x @ w + b)).sigmoid()
v = x * u.view(-1, 1) # Broadcasting
return - v.sum(0), - u.sum(0)

François Fleuret Deep learning / 3.5. Gradient descent 11 / 13


We can derive
∂ℒ X
=− yn σ(−yn (w · xn + b)),
∂b n
| {z }
un
∂ℒ X
∀d, =− xn,d yn σ(−yn (w · xn + b)),
∂wd n
| {z }
vn,d

which can be implemented as

def gradient(x, y, w, b):


u = y * ( - y * (x @ w + b)).sigmoid()
v = x * u.view(-1, 1) # Broadcasting
return - v.sum(0), - u.sum(0)

and the gradient descent as

w, b = torch.randn(x.size(1)), 0
eta = 1e-1

for k in range(nb_iterations):
print(k, loss(x, y, w, b))
dw, db = gradient(x, y, w, b)
w -= eta * dw
b -= eta * db

François Fleuret Deep learning / 3.5. Gradient descent 11 / 13


102

101
Loss

100

10−1
0 2000 4000 6000 8000 10000
Nb. of steps

François Fleuret Deep learning / 3.5. Gradient descent 12 / 13


With 100 training points and η = 10−1 .

n=0
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .

n = 10
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .

n = 102
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .

n = 103
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .

n = 104
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
LDA
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
The end

You might also like