Gradient Descent As Quadratic Approximation
Gradient Descent As Quadratic Approximation
François Fleuret
https://fleuret.org/dlc/
We saw that training consists of finding the model parameters minimizing an
empirical risk or loss, for instance the mean-squared error (MSE)
1 X
ℒ (w , b) = (f (xn ; w , b) − yn )2 .
N n
Other losses are more fitting for classification, certain regression problems, or
density estimation. We will come back to this.
1 X
ℒ (w , b) = (f (xn ; w , b) − yn )2 .
N n
Other losses are more fitting for classification, certain regression problems, or
density estimation. We will come back to this.
So far we minimized the loss either with an analytic solution for the MSE, or
with ad hoc recipes for the empirical error rate (k-NN and perceptron).
The general minimization method used in such a case is the gradient descent.
f : RD → R
x 7→ f (x1 , . . . , xD ),
∇f : RD → RD
∂f ∂f
x 7→ (x), . . . , (x) .
∂x1 ∂xD
We have
1
∇ℒ˜w0 (w ) = ∇ℒ (w0 ) + (w − w0 ),
η
which leads to
argmin ℒ˜w0 (w ) = w0 − η∇ℒ (w0 ).
w
This [most of the time] eventually ends up in a local minimum, and the choices
of w0 and η are important.
ℒ˜ ℒ
w0
ℒ
ℒ˜
w1
ℒ˜ ℒ
w2
ℒ˜ ℒ
w3
ℒ˜ ℒ
w4
ℒ˜ ℒ
w5
ℒ˜ ℒ
w6
ℒ˜ ℒ
w7
ℒ˜ ℒ
w8
ℒ˜ ℒ
w9
ℒ˜ ℒ
w10
ℒ˜ ℒ
w11
ℒ˜ ℒ
w0
ℒ˜ ℒ
w1
ℒ
ℒ˜
w2
ℒ˜ ℒ
w3
ℒ
ℒ˜
w4
ℒ˜ ℒ
w5
ℒ˜ ℒ
w6
ℒ˜ ℒ
w7
ℒ
ℒ˜
w8
ℒ˜ ℒ
w9
ℒ˜ ℒ
w10
ℒ˜ ℒ
w11
ℒ˜
ℒ
w0
ℒ˜ ℒ
w1
ℒ˜ ℒ
w2
ℒ˜ ℒ
w3
ℒ˜ ℒ
w4
ℒ˜ ℒ
w5
ℒ˜ ℒ
w6
ℒ˜ ℒ
w7
ℒ˜ ℒ
w8
ℒ˜ ℒ
w9
ℒ˜ ℒ
w10
ℒ˜ ℒ
w11
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
w, b = torch.randn(x.size(1)), 0
eta = 1e-1
for k in range(nb_iterations):
print(k, loss(x, y, w, b))
dw, db = gradient(x, y, w, b)
w -= eta * dw
b -= eta * db
101
Loss
100
10−1
0 2000 4000 6000 8000 10000
Nb. of steps
n=0
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .
n = 10
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .
n = 102
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .
n = 103
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
With 100 training points and η = 10−1 .
n = 104
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
LDA
François Fleuret Deep learning / 3.5. Gradient descent 13 / 13
The end