Class 2
Class 2
Jacob Whitehill
1
Gradient descent for
2-layer linear NNs
2
Gradient descent algorithm
• Set w to random values; call this initial choice w(0).
• Compute the gradient: rw f (w(0) )
• Update w by moving opposite the gradient, multiplied by a
learning rate ε. w(1) w (0)
✏rw f (w )(0)
• Repeat…
w(2) w(1) ✏rw f (w(1) )
(3) (2) (2)
w w ✏rw f (w )
…
(t) (t 1) (t 1)
w w ✏rw f (w )
• …until convergence:
3
Jacob Whitehill, WPI
Gradient descent
• For a 2-layer linear NN, the gradient of fMSE w.r.t. w is:
" n ⇣
#
1 X ⌘2
(i) >
rw fMSE (y, ŷ; w) = rw x w y (i)
2n i=1
Xn ⇣ ⌘2
1 (i) >
= rw x w y (i)
2n i=1
1 Xn ⇣ ⌘
(i) (i) >
= x x w y (i)
n i=1
4
Jacob Whitehill, WPI
Gradient descent
• By using matrices, we can nd a more compact notation
for the gradient.
2 3
(1)
y
6 .. 7
y=4 . 5
y (n)
5
Jacob Whitehill, WPI
fi
fi
Gradient descent
• By using matrices, we can nd a more compact notation
for the gradient.
2 3
(1)
y
6 .. 7
y=4 . 5
y (n)
• Now we can rewrite the gradient:
Xn ⇣
<latexit sha1_base64="uxf6nLqyUlGsPYiCoKtE+y4UPYg=">AAACw3icbVFdixMxFM2MX2v92K4++hIsLlPQ0hFdBSkUZcEXYUW7W2jaIZNm2tBMZkju6JaQP+mb/hrTmVncDy8ETs45997k3rSUwsBw+DsIb92+c/fe3v3Og4ePHu93D56cmqLSjE9YIQs9TanhUig+AQGST0vNaZ5KfpZuPu30sx9cG1Go77At+TynKyUywSh4Kun+IYqmkiaWpBn+6XCWEODnoHP75duxi2p6617iGpA1BX/7gFt3Hx+OOiTTlNnYWeWIqfLEilHsFqrxnLuFjUTfYSJ5BhG2V2m3IFCU+KIefoW3jUC0WK2hjwnpHI7w5R6Nd+pwdIGaGv9KtG/uJ93ecDCsA98EcQt6qI2TpPuLLAtW5VwBk9SYWTwsYW6pBsEkdx1SGV5StqErPvNQ0Zybua134PALzyxxVmh/FOCavZxhaW7MNk+9M6ewNte1Hfk/bVZB9n5uhSor4Io1jbJKYijwbqF4KTRnILceUKaFfytma+rnBX7tHT+E+PqXb4LT14P4aPD265ve+GM7jj30DD1HEYrROzRGn9EJmiAWjIMsKIIyPA43oQ6hsYZBm/MUXYnQ/QVRZ9eW</latexit>
1 ⌘
(i) (i) > (i)
rw fMSE (y, ŷ; w) = x x w y
n i=1
1 >
= X(X w y)
n
6
Jacob Whitehill, WPI
fi
fi
Exercise
7
Gradient descent
• For the 2-layer NN below, let m=2 and w(0)=[1 0]T.
• >
Recall: rw fMSE (w) = X(X w
n
y)
x1
w1
x2 w2
ŷ
… wm
xm
9
Exercise
• Draw on paper a function (with one local minimum) such
that the magnitude of the gradient is NOT an indicator of
how far to move w so as to reach the local minimum.
10
Jacob Whitehill, WPI
Exercise
• Draw on paper a function such that this property is false.
11
Jacob Whitehill, WPI
Hyperparameter
tuning
Hyperparameter tuning
Accuracy
Hyperparameter h
Hyperparameter tuning
Accuracy
Hyperparameter h
Hyperparameter tuning
• If you choose hyperparameters on the test set, you are
likely deceiving yourself about how good your model is.
Accuracy
Hyperparameter h
Hyperparameter tuning
• Instead, you should use a separate dataset that is not
part of the test set to choose hyperparameters.
19
Linear auto-regressive (AR)
models
20
Linear auto-regressive (AR)
models
• In one classic prediction model, we use a xed length of
history (p) to predict the next value xt:
• ^
xt = w1 xt-1 + w2 xt-2 + … + wp xt-p
xt-1 w1
… ^
xt
wp
xt-p
21
fi
Auto-regression
• The essence of auto-regression is that we are using the past
to predict the next future event.
22
fi
Auto-regression
• The essence of auto-regression is that we are using the past
to predict the next future event.
23
fi
Auto-regression
• The essence of auto-regression is that we are using the past
to predict the next future event.
24
fi
Example
• ^
Model: xt = w1 xt-1 + w2 xt-2
25
Multivariate auto-regression
• ^
x t = W(1) xt-1 + … + W(p) xt-p
26
Multivariate auto-regression
• Suppose each observation xt has 2 components (xta, xtb),
and that p=2.
xt-1a
xt-1b xta
xt-2a xtb
xt-2b
27
Exercise
• Recall: x^t = W(1) xt-1 + … + W(p) xt-p
xt-1a
xt-1b xta
xt-2a xtb
xt-2b
28
fi
Exercise
• Recall: x^t = W(1) xt-1 + … + W(p) xt-p
xt-1a
xt-1b xta
xt-2a xtb
xt-2b
29
fi
Multivariate auto-regression
• We can alternatively represent this network with just a
single matrix of weights W if we “stack” the inputs:
• ^
xt = W [xt-1T ; … ; xt-pT]T
xt-1a
W
xt-1b xta
xt-2a xtb
xt-2b
30
Auto-regression in deep
learning
31
Stochastic gradient
descent
32
Gradient descent
• With gradient descent, we only update the weights after
scanning the entire training set.
• This is slow.
• Repeat.
• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (one epoch):
A. Select a mini-batch containing the next examples.
B. Compute the gradient on this mini-batch:
C. Update the weights based on the current mini-batch
gradient.
• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (one epoch):
A. Select a mini-batch containing the next examples.
B. Compute the gradient on this mini-batch:
C. Update the weights based on the current mini-batch
gradient.
• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (one epoch):
A. Select a mini-batch containing the next examples.
B. Compute the gradient on this mini-batch:
C. Update the weights based on the current mini-batch
gradient.
• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (dn/ñe 1) (one epoch):
• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (dn/ñe 1) (one epoch):
• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (dn/ñe 1) (one epoch):
• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (dn/ñe 1) (one epoch):
• Necessary conditions:
T
X
lim |✏t |2 < 1
T !1
t=1
Not too big: sum of squared
learning rates converges.
Jacob Whitehill, WPI
SGD: learning rates
• Necessary conditions:
T
X T
X
lim |✏t |2 < 1 lim |✏t | = 1
T !1 T !1
t=1 t=1
Not too small: sum of absolute
learning rates grows to in nity.
Jacob Whitehill, WPI
fi
SGD: learning rates
• One common learning rate “schedule” is to multiply by
c 2 (0, 1) every k rounds.
• SGD may not fully converge, but the machine might still
perform well.
convex non-convex
https://plus.maths.org/content/convexity
Jacob Whitehill, WPI
Convexity in 1-d
• How can we tell if a 1-d function f is convex?
• For f : R m
! R , f is convex if the Hessian matrix is
positive semi-de nite for every input x.
• @f @f
<latexit sha1_base64="M3O+cQSypVYYMYLIsln1zHdeQOo=">AAACPHicfVBLSwMxGMz6rPVV9ejlwyIIYtktvi5C0YvHivYB3VKyabYNzT5MstJl6Q/z4o/w5smLB0W8ejbbFtRWHAgMM/Ml+cYJOZPKNJ+Mmdm5+YXFzFJ2eWV1bT23sVmVQSQIrZCAB6LuYEk582lFMcVpPRQUew6nNad3kfq1OyokC/wbFYe06eGOz1xGsNJSK3dtuwKTxA6xUAxzcAffvD+AM4hhH4p9sG8j3IZ/wnEa7sMBFONWLm8WzCFgmlhjkkdjlFu5R7sdkMijviIcS9mwzFA1k/Rmwukga0eShpj0cIc2NPWxR2UzGS4/gF2ttMENhD6+gqH6cyLBnpSx5+ikh1VXTnqp+JfXiJR72kyYH0aK+mT0kBtxUAGkTUKbCUoUjzXBRDD9VyBdrAtSuu+sLsGaXHmaVIsF67hwdHWYL52P68igbbSD9pCFTlAJXaIyqiCC7tEzekVvxoPxYrwbH6PojDGe2UK/YHx+AVhfrYM=</latexit>
2 1
H=
1 2
• Notice that H for this f does not depend on (x,y).
• @f @f
<latexit sha1_base64="M3O+cQSypVYYMYLIsln1zHdeQOo=">AAACPHicfVBLSwMxGMz6rPVV9ejlwyIIYtktvi5C0YvHivYB3VKyabYNzT5MstJl6Q/z4o/w5smLB0W8ejbbFtRWHAgMM/Ml+cYJOZPKNJ+Mmdm5+YXFzFJ2eWV1bT23sVmVQSQIrZCAB6LuYEk582lFMcVpPRQUew6nNad3kfq1OyokC/wbFYe06eGOz1xGsNJSK3dtuwKTxA6xUAxzcAffvD+AM4hhH4p9sG8j3IZ/wnEa7sMBFONWLm8WzCFgmlhjkkdjlFu5R7sdkMijviIcS9mwzFA1k/Rmwukga0eShpj0cIc2NPWxR2UzGS4/gF2ttMENhD6+gqH6cyLBnpSx5+ikh1VXTnqp+JfXiJR72kyYH0aK+mT0kBtxUAGkTUKbCUoUjzXBRDD9VyBdrAtSuu+sLsGaXHmaVIsF67hwdHWYL52P68igbbSD9pCFTlAJXaIyqiCC7tEzekVvxoPxYrwbH6PojDGe2UK/YHx+AVhfrYM=</latexit>
2 1
H=
1 2
• Notice that H for this f does not depend on (x,y).
1 2 1 1
• Yes. For example, v = [1 2]T: 2 1 2 2
⇥ ⇤ 1
= 4 3 = 2
2
Jacob Whitehill, WPI
fi
Example
• Graph of f(x,y) = xy + x2 - y2:
76
Convex ML models
local maximum
saddle point
local minimum
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
2. Bad initialization of the weights w.
not so good
good
local minimum
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
3. Learning rate is too small.
Learning
rate too
small
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
3. Learning rate is too small.
Learning
rate too
small
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
3. Learning rate is too small.
Learning
rate too
small
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
3. Learning rate is too small.
Learning
rate too
small
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
4. Learning rate is too large.
Learning
rate too
large
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
4. Learning rate is too large.
Learning
rate too
large
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
4. Learning rate is too large.
Learning
rate too
large
global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
4. Learning rate is too large. (o the chart)
Learning
rate too
large
global minimum
ff
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.
w2
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.
-rw f (w)
w2
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.