0% found this document useful (0 votes)
7 views107 pages

Class 2

The document discusses the gradient descent algorithm for training 2-layer linear neural networks, detailing the steps for weight initialization, gradient computation, and iterative updates until convergence. It emphasizes the importance of hyperparameter tuning, explaining the distinction between model parameters and hyperparameters, and the need for separate datasets for training, validation, and testing. Additionally, it introduces linear auto-regressive models for predicting future values based on historical data, illustrating the concept of using past values for future predictions.

Uploaded by

Madhav Kalyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views107 pages

Class 2

The document discusses the gradient descent algorithm for training 2-layer linear neural networks, detailing the steps for weight initialization, gradient computation, and iterative updates until convergence. It emphasizes the importance of hyperparameter tuning, explaining the distinction between model parameters and hyperparameters, and the need for separate datasets for training, validation, and testing. Additionally, it introduces linear auto-regressive models for predicting future values based on historical data, illustrating the concept of using past values for future predictions.

Uploaded by

Madhav Kalyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

CS/DS 541: Class 2

Jacob Whitehill

1
Gradient descent for
2-layer linear NNs

2
Gradient descent algorithm
• Set w to random values; call this initial choice w(0).
• Compute the gradient: rw f (w(0) )
• Update w by moving opposite the gradient, multiplied by a
learning rate ε. w(1) w (0)
✏rw f (w )(0)

• Repeat…
w(2) w(1) ✏rw f (w(1) )
(3) (2) (2)
w w ✏rw f (w )

(t) (t 1) (t 1)
w w ✏rw f (w )
• …until convergence:

3
Jacob Whitehill, WPI
Gradient descent
• For a 2-layer linear NN, the gradient of fMSE w.r.t. w is:
" n ⇣
#
1 X ⌘2
(i) >
rw fMSE (y, ŷ; w) = rw x w y (i)
2n i=1
Xn ⇣ ⌘2
1 (i) >
= rw x w y (i)
2n i=1
1 Xn ⇣ ⌘
(i) (i) >
= x x w y (i)
n i=1

4
Jacob Whitehill, WPI
Gradient descent
• By using matrices, we can nd a more compact notation
for the gradient.

• De ne the design/feature matrix X and label vector y:


<latexit sha1_base64="C+sk46FTvc+9oHn/nGwJmUg4Q8E=">AAACP3icbVBNS+RAEO24umrWXWd3j14ahwW9DMni12VB9OJxBGcUJnHodCozjZ1O6K4IIeSf7WX/gjevXjy4iFdv9sxE8Kug4fFevaquF+VSGPS8a2fu0/zC58WlZffLytdvq63vP/omKzSHHs9kps8iZkAKBT0UKOEs18DSSMJpdHE40U8vQRuRqRMscwhTNlIiEZyhpYatfhVECS1r+ocGEhIcBBGMhKqY1qysK1675Xm14W/WNAioG1zGGRoLp6x6ZkHFjcENtBiNMRy22l7HmxZ9D/wGtElT3WHrKogzXqSgkEtmzMD3cgztVBRcgp1bGMgZv2AjGFioWAomrKb31/SXZWKaZNo+hXTKvnRULDWmTCPbmTIcm7fahPxIGxSY7IWVUHmBoPhsUVJIihmdhEljoYGjLC1gXAv7V8rHTDOONnLXhuC/Pfk96P/u+Dud7eOt9v5BE8cSWSPrZIP4ZJfskyPSJT3CyV9yQ+7If+efc+vcOw+z1jmn8fwkr8p5fAINka5e</latexit>

2 3
(1)
y
6 .. 7
y=4 . 5
y (n)

5
Jacob Whitehill, WPI
fi
fi
Gradient descent
• By using matrices, we can nd a more compact notation
for the gradient.

• De ne the design/feature matrix X and label vector y: <latexit sha1_base64="C+sk46FTvc+9oHn/nGwJmUg4Q8E=">AAACP3icbVBNS+RAEO24umrWXWd3j14ahwW9DMni12VB9OJxBGcUJnHodCozjZ1O6K4IIeSf7WX/gjevXjy4iFdv9sxE8Kug4fFevaquF+VSGPS8a2fu0/zC58WlZffLytdvq63vP/omKzSHHs9kps8iZkAKBT0UKOEs18DSSMJpdHE40U8vQRuRqRMscwhTNlIiEZyhpYatfhVECS1r+ocGEhIcBBGMhKqY1qysK1675Xm14W/WNAioG1zGGRoLp6x6ZkHFjcENtBiNMRy22l7HmxZ9D/wGtElT3WHrKogzXqSgkEtmzMD3cgztVBRcgp1bGMgZv2AjGFioWAomrKb31/SXZWKaZNo+hXTKvnRULDWmTCPbmTIcm7fahPxIGxSY7IWVUHmBoPhsUVJIihmdhEljoYGjLC1gXAv7V8rHTDOONnLXhuC/Pfk96P/u+Dud7eOt9v5BE8cSWSPrZIP4ZJfskyPSJT3CyV9yQ+7If+efc+vcOw+z1jmn8fwkr8p5fAINka5e</latexit>

2 3
(1)
y
6 .. 7
y=4 . 5
y (n)
• Now we can rewrite the gradient:
Xn ⇣
<latexit sha1_base64="uxf6nLqyUlGsPYiCoKtE+y4UPYg=">AAACw3icbVFdixMxFM2MX2v92K4++hIsLlPQ0hFdBSkUZcEXYUW7W2jaIZNm2tBMZkju6JaQP+mb/hrTmVncDy8ETs45997k3rSUwsBw+DsIb92+c/fe3v3Og4ePHu93D56cmqLSjE9YIQs9TanhUig+AQGST0vNaZ5KfpZuPu30sx9cG1Go77At+TynKyUywSh4Kun+IYqmkiaWpBn+6XCWEODnoHP75duxi2p6617iGpA1BX/7gFt3Hx+OOiTTlNnYWeWIqfLEilHsFqrxnLuFjUTfYSJ5BhG2V2m3IFCU+KIefoW3jUC0WK2hjwnpHI7w5R6Nd+pwdIGaGv9KtG/uJ93ecDCsA98EcQt6qI2TpPuLLAtW5VwBk9SYWTwsYW6pBsEkdx1SGV5StqErPvNQ0Zybua134PALzyxxVmh/FOCavZxhaW7MNk+9M6ewNte1Hfk/bVZB9n5uhSor4Io1jbJKYijwbqF4KTRnILceUKaFfytma+rnBX7tHT+E+PqXb4LT14P4aPD265ve+GM7jj30DD1HEYrROzRGn9EJmiAWjIMsKIIyPA43oQ6hsYZBm/MUXYnQ/QVRZ9eW</latexit>

1 ⌘
(i) (i) > (i)
rw fMSE (y, ŷ; w) = x x w y
n i=1
1 >
= X(X w y)
n
6
Jacob Whitehill, WPI
fi
fi
Exercise

7
Gradient descent
• For the 2-layer NN below, let m=2 and w(0)=[1 0]T.

• Compute the updated weight vector w(1) after one


iteration of gradient descent using (1/2) MSE loss, a single
training example (x,y)=([2, 3]T, 4), and learning rate =0.1.
1
<latexit sha1_base64="S6748+Ik/XwWHnXkQ1f+b3SYPM0=">AAACRHicbZDLSgMxFIYz9V5vVZdugkXQhWVGvG2EoghuhIpWC506ZNKMBjOZITmjlmEezo0P4M4ncONCEbdi2o6g1QOBn+8/5yT5/VhwDbb9ZBWGhkdGx8YnipNT0zOzpbn5Mx0lirI6jUSkGj7RTHDJ6sBBsEasGAl9wc796/2uf37DlOaRPIVOzFohuZQ84JSAQV6p6UriC+Klrh/g2wwHngvsDlSYHp0cZCs5XsW72A0UoamTpTLr0UbuNrILF6IYf29Y66tOtuqVynbF7hX+K5xclFFeNa/06LYjmoRMAhVE66Zjx9BKiQJOBcuKbqJZTOg1uWRNIyUJmW6lvRAyvGxIGweRMkcC7tGfEykJte6EvukMCVzpQa8L//OaCQQ7rZTLOAEmaf+iIBEYItxNFLe5YhRExwhCFTdvxfSKmLDA5F40ITiDX/4rztYrzlZl83ijXN3L4xhHi2gJrSAHbaMqOkQ1VEcU3aNn9IrerAfrxXq3PvqtBSufWUC/yvr8Aj7ksPQ=</latexit>

• >
Recall: rw fMSE (w) = X(X w
n
y)

x1
w1
x2 w2
ŷ
… wm

xm

Input layer Output layer


8
𝞊
Solution
1 >
rw fMSE (w) = X(X w y)
n
(1) (0)
w w ✏rw fMSE (w)
   !
>
1 2 2 1 ⇥ ⇤
= 0.1 4
0 3 3 0

1 + 0.1 ⇤ 2 ⇤ 2
=
0 + 0.1 ⇤ 3 ⇤ 2

1.4
=
0.6

9
Exercise
• Draw on paper a function (with one local minimum) such
that the magnitude of the gradient is NOT an indicator of
how far to move w so as to reach the local minimum.

10
Jacob Whitehill, WPI
Exercise
• Draw on paper a function such that this property is false.

11
Jacob Whitehill, WPI
Hyperparameter
tuning
Hyperparameter tuning

• The values we optimize when training a machine learning


model — e.g., w and b for linear regression — are the
parameters of the model.

• There are also values related to the training process itself


~
— e.g., learning rate ε, batch size n, regularization
strength ɑ — which are the hyperparameters of training.
Hyperparameter tuning
• Both the parameters and hyperparameters can have a
huge impact on model performance on test data.

• Ideally, we would hope that the accuracy of the system


varies smoothly with each hyper parameter value, e.g.:

Accuracy

Hyperparameter h
Hyperparameter tuning

• However, in the real world, the hyperparameter landscape


can be quite erratic, e.g.:

Accuracy

Hyperparameter h
Hyperparameter tuning
• If you choose hyperparameters on the test set, you are
likely deceiving yourself about how good your model is.

• This is a subtle but very dangerous form of ML


cheating.

Accuracy

Hyperparameter h
Hyperparameter tuning
• Instead, you should use a separate dataset that is not
part of the test set to choose hyperparameters.

• The most common approach is to use training, validation,


& testing sets:
• Training (typically 70-80%): optimization of parameters
• Validation (typically 5-10%): tuning of hyperparameters
• Testing (typically 5-10%): evaluation of the nal model

• For comparison with other researchers’ methods, this


partition should be xed.
fi
fi
Training/validation/testing
sets
• Hyperparameter tuning works as follows:
1.Choose a set of hyperparameter con gurations.
2.For each con guration h:
• Train the parameters on the training set using h.
• Evaluate the model on the validation set.
• If performance is better than what we got with the
best h so far (h*), then save h as h*.
3.Train a model with h*, and evaluate its accuracy A on
the testing set. (You can train either on training data, or
on training+validation data).
fi
fi
Linear auto-regressive
(AR) models

19
Linear auto-regressive (AR)
models

• In some application areas, we have a time series of values


x1, x2, …, xt, but no “labels” y.

• Task: Given the known values of x1, x2, …, xt-1, we want to


predict the value of xt.

• A classic example is stock market price prediction.

20
Linear auto-regressive (AR)
models
• In one classic prediction model, we use a xed length of
history (p) to predict the next value xt:

• ^
xt = w1 xt-1 + w2 xt-2 + … + wp xt-p

• We can model this prediction using the same 2-layer


neural network as before:

xt-1 w1

… ^
xt

wp
xt-p

21
fi
Auto-regression
• The essence of auto-regression is that we are using the past
to predict the next future event.

• We can apply this recursively to predict in nitely into the


future.

• Example for p=2, assuming we already know x1, x2:


• x^3 = w1 x2 + w2 x1
• x4 = w1 x3 + w2 x2
• x5 = w1 x4 + w2 x3
• …

22
fi
Auto-regression
• The essence of auto-regression is that we are using the past
to predict the next future event.

• We can apply this recursively to predict in nitely into the


future.

• Example for p=2, assuming we already know x1, x2:


• x^3 = w1 x2 + w2 x1
• x^4 = w1 x3 + w2 x2
• x5 = w1 x4 + w2 x3
• …

23
fi
Auto-regression
• The essence of auto-regression is that we are using the past
to predict the next future event.

• We can apply this recursively to predict in nitely into the


future.

• Example for p=2, assuming we already know x1, x2:


• x^3 = w1 x2 + w2 x1
• x^4 = w1 x3 + w2 x2
• ^
x5 = w1 x4 + w2 x3
• …

24
fi
Example

• ^
Model: xt = w1 xt-1 + w2 xt-2

• For w1=0.4, w2=-0.5, x1=0, and x2=2, what are the


predictions for x3, x4, and x5?
• x^3 = (0.4)2 + (-0.5)0 = 0.8
• x^4 = (0.4)0.8 + (-0.5)2 = -0.68
• x^5 = (0.4)(-.68) + (-0.5)(0.8) = -0.672

25
Multivariate auto-regression

• The value xt of each time-step can also be a vector, in


which case we multiply the values of previous timesteps
with matrices:

• ^
x t = W(1) xt-1 + … + W(p) xt-p

26
Multivariate auto-regression
• Suppose each observation xt has 2 components (xta, xtb),
and that p=2.

• Here is the corresponding neural network:

xt-1a

xt-1b xta

xt-2a xtb

xt-2b
27
Exercise
• Recall: x^t = W(1) xt-1 + … + W(p) xt-p

• To which matrix (W(1), W(2), or neither) do the rst 4 edges


correspond?

xt-1a

xt-1b xta

xt-2a xtb

xt-2b
28
fi
Exercise
• Recall: x^t = W(1) xt-1 + … + W(p) xt-p

• To which matrix (W(1), W(2), or neither) do the rst 4 edges


correspond? W(1).

xt-1a

xt-1b xta

xt-2a xtb

xt-2b
29
fi
Multivariate auto-regression
• We can alternatively represent this network with just a
single matrix of weights W if we “stack” the inputs:

• ^
xt = W [xt-1T ; … ; xt-pT]T

xt-1a
W

xt-1b xta

xt-2a xtb

xt-2b
30
Auto-regression in deep
learning

• Auto-regression is used frequently in deep learning,


especially for machine translation and text generation
(e.g., ChatGPT).

31
Stochastic gradient
descent

32
Gradient descent
• With gradient descent, we only update the weights after
scanning the entire training set.

• This is slow.

• If the training set contains 20K examples, then the


gradient is an average over 20K images.

• How much would the gradient really change if we just


used, say, 10K images? 5K images? 128 images?
1 >
rw fMSE (y, ŷ; w) = X(X w y)
n
Average over entire training set.
Jacob Whitehill, WPI
Stochastic gradient descent

• This is the idea behind stochastic gradient descent (SGD):

• Randomly sample a small ( n) mini-batch (or


sometimes just batch) of training examples.

• Estimate the gradient on just the mini-batch.

• Update weights based on mini-batch gradient estimate.

• Repeat.

Jacob Whitehill, WPI


Stochastic gradient descent
• In practice, SGD is usually conducted over multiple epochs.
• An epoch is a single pass through the entire training set.

• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (one epoch):
A. Select a mini-batch containing the next examples.
B. Compute the gradient on this mini-batch:
C. Update the weights based on the current mini-batch
gradient.

Jacob Whitehill, WPI


Stochastic gradient descent
• In practice, SGD is usually conducted over multiple epochs.
• An epoch is a single pass through the entire training set.

• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (one epoch):
A. Select a mini-batch containing the next examples.
B. Compute the gradient on this mini-batch:
C. Update the weights based on the current mini-batch
gradient.

Jacob Whitehill, WPI


Stochastic gradient descent
• In practice, SGD is usually conducted over multiple epochs.
• An epoch is a single pass through the entire training set.

• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (one epoch):
A. Select a mini-batch containing the next examples.
B. Compute the gradient on this mini-batch:
C. Update the weights based on the current mini-batch
gradient.

Jacob Whitehill, WPI


Stochastic gradient descent
• In practice, SGD is usually conducted over multiple epochs.
• An epoch is a single pass through the entire training set.

• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (dn/ñe 1) (one epoch):

A. Select a mini-batch containing the next examples.


B. Compute the gradient on this mini-batch:
C. Update the weights based on the current mini-batch
gradient.

Jacob Whitehill, WPI


Stochastic gradient descent
• In practice, SGD is usually conducted over multiple epochs.
• An epoch is a single pass through the entire training set.

• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (dn/ñe 1) (one epoch):

A. Select a mini-batch J containing the next ñ examples.


B. Compute the gradient on this mini-batch:
C. Update the weights based on the current mini-batch
gradient.

Jacob Whitehill, WPI


Stochastic gradient descent
• In practice, SGD is usually conducted over multiple epochs.
• An epoch is a single pass through the entire training set.

• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (dn/ñe 1) (one epoch):

A. Select a mini-batch J containing the next ñ examples.


1X
B. Compute the gradient on this mini-batch: ñ rW f (y(i) , ŷ(i) ; W)
i2J
C. Update the weights based on the current mini-batch
gradient.

Jacob Whitehill, WPI


Stochastic gradient descent
• In practice, SGD is usually conducted over multiple epochs.
• An epoch is a single pass through the entire training set.

• Procedure:
1. Let ñ ⌧ n equal the size of the mini-batch.
2. Randomize the order of the examples in the training set.
3. For e = 0 to numEpochs:
I. For i = 0 to (dn/ñe 1) (one epoch):

A. Select a mini-batch J containing the next ñ examples.


1X
B. Compute the gradient on this mini-batch: ñ rW f (y(i) , ŷ(i) ; W)
i2J
C. Update the weights based on the current mini-batch
gradient.

Jacob Whitehill, WPI


SGD versus GD: example
• Suppose our training set contains n=8 examples.

• Here is how regular gradient descent would proceed:


• Initialize weights w(0) to random values.
Training
examples
• For each round: 1

• Compute gradient on all n examples.


2
3
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 4
5
6
7
8

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples.

• Here is how regular gradient descent would proceed:


Training
• Initialize weights w(0) to random values. examples
• For each round: 1

• Compute gradient on all n examples.


2
3
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 4
5
6
7
8

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples.

• Here is how regular gradient descent would proceed:


Training
• Initialize weights w(0) to random values. examples
• For each round: 1

• Compute gradient on all n examples.


2
3
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 4
5
6
7
8

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples.

• Here is how regular gradient descent would proceed:


Training
• Initialize weights w(0) to random values. examples
• For each round: 1

• Compute gradient on all n examples.


2
3
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 4
5
6
7
8

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples.

• Here is how regular gradient descent would proceed:


Training
• Initialize weights w(0) to random values. examples
• For each round: 1

• Compute gradient on all n examples.


2
3
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 4
5
6
7
8

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 1
2
• For each epoch:
3

• For each round: 4


5
• Compute gradient on next examples. 6

• Update weights: w(t+1) ⟵ w(t) - ϵ wf 7


8

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch:
3

• For each round: 5


7
• Compute gradient on next examples. 6

• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8


2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=1
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6

• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8


2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=1
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=1
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=1
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=1
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=1
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=1
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=1
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=2
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI


𝛁
SGD versus GD: example
• Suppose our training set contains n=8 examples with ñ = 2.

• Here is how stochastic gradient descent would proceed:

• Initialize weights w(0) to random values. Training


examples
• Randomize the order of the training data. 4
1
• For each epoch (e=1, …, E): e=2
3

• For each round (r=1, …, dn/ñe ): 5


7
• Compute gradient on next ñ examples.
6
~
• Update weights: w(t+1) ⟵ w(t) - ϵ wf 8
2

Jacob Whitehill, WPI
𝛁
Stochastic gradient descent

• Despite “noise” (statistical inaccuracy) in the mini-batch


gradient estimates, we will still converge to local minimum.

• The noise can even sometimes help us to get out of worse


local minima and into better ones.

• Training can be much faster than regular gradient descent


because we adjust the weights many times per epoch.

Jacob Whitehill, WPI


SGD: learning rates

• With SGD, our learning rate needs to be annealed


(reduced slowly over time) to guarantee convergence.

• Otherwise we might just oscillate forever in weight space.

• Necessary conditions:
T
X
lim |✏t |2 < 1
T !1
t=1
Not too big: sum of squared
learning rates converges.
Jacob Whitehill, WPI
SGD: learning rates

• With SGD, our learning rate needs to be annealed


(reduced slowly over time) to guarantee convergence.

• Otherwise we might just oscillate forever in weight space.

• Necessary conditions:
T
X T
X
lim |✏t |2 < 1 lim |✏t | = 1
T !1 T !1
t=1 t=1
Not too small: sum of absolute
learning rates grows to in nity.
Jacob Whitehill, WPI
fi
SGD: learning rates
• One common learning rate “schedule” is to multiply by
c 2 (0, 1) every k rounds.

• This is called exponential decay.

• Another possibility (which avoids the issue) is to set the


number of epochs T to a nite number.

• SGD may not fully converge, but the machine might still
perform well.

• There are many other strategies.

Jacob Whitehill, WPI


fi
Optimization of ML
models
Optimization of ML models
• With linear regression, the cost function fMSE has a single
local minimum w.r.t. the weights w:

• As long as our learning rate is small enough, we will


eventually nd the optimal w.
fi
Convex ML models
• Linear regression has a loss function that is convex.

• With a convex function f, every local minimum is also a


global minimum.

convex non-convex

• Convex functions are ideal for conducting gradient


descent.

https://plus.maths.org/content/convexity
Jacob Whitehill, WPI
Convexity in 1-d
• How can we tell if a 1-d function f is convex?

• What property of the slope of f ensures there is only one


local minimum?

• From left to right, the slope of f never decreases.


==> the derivative of the slope is always non-negative.
==> the second derivative of f is always non-negative.

Jacob Whitehill, WPI


Convexity in 1-d
• How can we tell if a 1-d function f is convex?

• What property of the slope of f ensures there is only one


local minimum?

• From left to right, the slope of f never decreases.


==> the derivative of the slope is always non-negative.
==> the second derivative of f is always non-negative.

Jacob Whitehill, WPI


Convexity in higher
dimensions
• For higher-dimensional f, convexity is determined by the
the Hessian of f.
2 2 2 3
@ f @ f
@x1 @x1 ... @x1 @xm
6 .. .. 7
H[f ] = 6
4 ... . .
7
5
@2f @2f
@xm @x1 ... @xm @xm

• For f : R m
! R , f is convex if the Hessian matrix is
positive semi-de nite for every input x.

Jacob Whitehill, WPI


fi
Positive semi-de nite
• Positive semi-de nite is the matrix analog of being “non-
negative”.

• A real symmetric matrix A is positive semi-de nite (PSD) if


(equivalent conditions):
• All its eigenvalues are ≥0.
• If A happens to be diagonal, then its eigenvalues are
the diagonal elements.
• For every vector x: xTAx ≥0
• Therefore: If there exists any vector x such that
xTAx < 0, then A is not PSD.

Jacob Whitehill, WPI


fi
fi
fi
Positive semi-de nite
• Positive semi-de nite is the matrix analog of being “non-
negative”.

• A real symmetric matrix A is positive semi-de nite (PSD) if


(equivalent conditions):
• All its eigenvalues are ≥0.
• In particular, if A is diagonal, then A is PSD if its
eigenvalues are the diagonal elements.
• For every vector x: xTAx ≥0
• Therefore: If there exists any vector x such that
xTAx < 0, then A is not PSD.

Jacob Whitehill, WPI


fi
fi
fi
Positive semi-de nite
• Positive semi-de nite is the matrix analog of being “non-
negative”.

• A real symmetric matrix A is positive semi-de nite (PSD) if


(equivalent conditions):
• All its eigenvalues are ≥0.
• In particular, if A is diagonal, then A is PSD if its
eigenvalues are the diagonal elements.
• For every vector v: vTAv ≥0
• Therefore: If there exists any vector v such that
vTAv < 0, then A is not PSD.

Jacob Whitehill, WPI


fi
fi
fi
Example
• Suppose f(x, y) = 3x2 + 2y2 - 2.

• Then the rst derivatives are: @f


@x = 6x @f
@y = 4y
• The Hessian matrix is therefore:
" 2 # 
2
@ f @ f
@x@x @x@y 6 0
H= @2f @2f
=
0 4
@y@x @y@y

• Notice that H for this f does not depend on (x,y).

• Also, H is a diagonal matrix (with 6 and 4 on the diagonal).


Hence, the eigenvalues are just 6 and 4. Since they are both
non-negative, then f is convex.

Jacob Whitehill, WPI


fi
Example
• Graph of f(x, y) = 3x2 + 2y2 - 2:

Jacob Whitehill, WPI


Example
• Suppose f(x,y) = xy + x2 - y2.

• @f @f
<latexit sha1_base64="M3O+cQSypVYYMYLIsln1zHdeQOo=">AAACPHicfVBLSwMxGMz6rPVV9ejlwyIIYtktvi5C0YvHivYB3VKyabYNzT5MstJl6Q/z4o/w5smLB0W8ejbbFtRWHAgMM/Ml+cYJOZPKNJ+Mmdm5+YXFzFJ2eWV1bT23sVmVQSQIrZCAB6LuYEk582lFMcVpPRQUew6nNad3kfq1OyokC/wbFYe06eGOz1xGsNJSK3dtuwKTxA6xUAxzcAffvD+AM4hhH4p9sG8j3IZ/wnEa7sMBFONWLm8WzCFgmlhjkkdjlFu5R7sdkMijviIcS9mwzFA1k/Rmwukga0eShpj0cIc2NPWxR2UzGS4/gF2ttMENhD6+gqH6cyLBnpSx5+ikh1VXTnqp+JfXiJR72kyYH0aK+mT0kBtxUAGkTUKbCUoUjzXBRDD9VyBdrAtSuu+sLsGaXHmaVIsF67hwdHWYL52P68igbbSD9pCFTlAJXaIyqiCC7tEzekVvxoPxYrwbH6PojDGe2UK/YHx+AVhfrYM=</latexit>

Then the rst derivatives are: = y + 2x =x 2y


@x @y

• The Hessian matrix is therefore:


<latexit sha1_base64="eCpcSbG5aEWQ1ERb7hYLvkMPJ9w=">AAACL3icbVDLSgMxFM34dnxVXboJFsWNZab42giiIF1WsCp0hpJJ77TBTGZI7ghl6B+58VfciCji1r8wrbPwdSDkcM69N7knyqQw6HnPzsTk1PTM7Ny8u7C4tLxSWV27MmmuObR4KlN9EzEDUihooUAJN5kGlkQSrqPbs5F/fQfaiFRd4iCDMGE9JWLBGVqpUzkvgiimjSE9poGEGNvUDSLoCVUwrdlgWHA+dOt0m/pB4Pr23q27AahuabuBFr0+hp1K1at5Y9C/xC9JlZRodiqPQTfleQIKuWTGtH0vw9BORcEl2Lm5gYzxW9aDtqWKJWDCYrzvkG5ZpUvjVNujkI7V7x0FS4wZJJGtTBj2zW9vJP7ntXOMj8JCqCxHUPzroTiXFFM6Co92hQaOcmAJ41rYv1LeZ5pxtBG7NgT/98p/yVW95h/U9i/2qienZRxzZINskh3ik0NyQhqkSVqEk3vySF7Iq/PgPDlvzvtX6YRT9qyTH3A+PgF5saZU</latexit>


2 1
H=
1 2
• Notice that H for this f does not depend on (x,y).

• Does there exist any vector v s.t. vTHv < 0?

• Yes. For example, v = [1 2]T:

Jacob Whitehill, WPI


fi
Example
• Suppose f(x,y) = xy + x2 - y2.

• @f @f
<latexit sha1_base64="M3O+cQSypVYYMYLIsln1zHdeQOo=">AAACPHicfVBLSwMxGMz6rPVV9ejlwyIIYtktvi5C0YvHivYB3VKyabYNzT5MstJl6Q/z4o/w5smLB0W8ejbbFtRWHAgMM/Ml+cYJOZPKNJ+Mmdm5+YXFzFJ2eWV1bT23sVmVQSQIrZCAB6LuYEk582lFMcVpPRQUew6nNad3kfq1OyokC/wbFYe06eGOz1xGsNJSK3dtuwKTxA6xUAxzcAffvD+AM4hhH4p9sG8j3IZ/wnEa7sMBFONWLm8WzCFgmlhjkkdjlFu5R7sdkMijviIcS9mwzFA1k/Rmwukga0eShpj0cIc2NPWxR2UzGS4/gF2ttMENhD6+gqH6cyLBnpSx5+ikh1VXTnqp+JfXiJR72kyYH0aK+mT0kBtxUAGkTUKbCUoUjzXBRDD9VyBdrAtSuu+sLsGaXHmaVIsF67hwdHWYL52P68igbbSD9pCFTlAJXaIyqiCC7tEzekVvxoPxYrwbH6PojDGe2UK/YHx+AVhfrYM=</latexit>

Then the rst derivatives are: = y + 2x =x 2y


@x @y

• The Hessian matrix is therefore:


<latexit sha1_base64="eCpcSbG5aEWQ1ERb7hYLvkMPJ9w=">AAACL3icbVDLSgMxFM34dnxVXboJFsWNZab42giiIF1WsCp0hpJJ77TBTGZI7ghl6B+58VfciCji1r8wrbPwdSDkcM69N7knyqQw6HnPzsTk1PTM7Ny8u7C4tLxSWV27MmmuObR4KlN9EzEDUihooUAJN5kGlkQSrqPbs5F/fQfaiFRd4iCDMGE9JWLBGVqpUzkvgiimjSE9poGEGNvUDSLoCVUwrdlgWHA+dOt0m/pB4Pr23q27AahuabuBFr0+hp1K1at5Y9C/xC9JlZRodiqPQTfleQIKuWTGtH0vw9BORcEl2Lm5gYzxW9aDtqWKJWDCYrzvkG5ZpUvjVNujkI7V7x0FS4wZJJGtTBj2zW9vJP7ntXOMj8JCqCxHUPzroTiXFFM6Co92hQaOcmAJ41rYv1LeZ5pxtBG7NgT/98p/yVW95h/U9i/2qienZRxzZINskh3ik0NyQhqkSVqEk3vySF7Iq/PgPDlvzvtX6YRT9qyTH3A+PgF5saZU</latexit>


2 1
H=
1 2
• Notice that H for this f does not depend on (x,y).

• Does there exist any vector v s.t. vTHv < 0?


  
>
<latexit sha1_base64="/vJfKJ4gbUuLbZJenEZ//ofgML8=">AAADB3icrVJNaxsxENVu2iZVP+Kkx1IQNQ29xOw6n5dAaC85plAnAWtrtPKsLaLVLtJswCy+5ZK/kksPLaXX/oXe8m8qO3to83XqgODxZt48aTRpqZXDKLoKwoVHj58sLj2lz56/eLncWlk9ckVlJfRkoQt7kgoHWhnooUINJ6UFkacajtPTj7P88RlYpwrzGSclJLkYGZUpKdBTg5XgDdeQYZ9RnsJImVpYKybTWk5pzDhnXcrBDBuWcqtGY0y+cCxKeo/QK7tsjcWc+w5rbP3ODveJH3L1CbrHHrDdnNlt/C87ukfXu4NWO+pE82C3QdyANmnicND6zYeFrHIwKLVwrh9HJSa+NSqpwTevHJRCnooR9D00IgeX1PN/nLJ3nhmyrLD+GGRz9m9FLXLnJnnqK3OBY3czNyPvyvUrzHaTWpmyQjDy2iirNMOCzZaCDZUFiXrigZBW+bsyORZWSPSrQ/0Q4ptPvg2Oup14u7P1abO9/6EZxxJ5Td6S9yQmO2SfHJBD0iMyOA8ug2/B9/Ai/Br+CH9el4ZBo3lF/onw1x90OO9X</latexit>

1 2 1 1
• Yes. For example, v = [1 2]T: 2 1 2 2

⇥ ⇤ 1
= 4 3 = 2
2
Jacob Whitehill, WPI
fi
Example
• Graph of f(x,y) = xy + x2 - y2:

76
Convex ML models

• Prominent convex models in ML include linear regression,


logistic regression, softmax regression, and support
vector machines (SVM).

• However, models in deep learning are generally not


convex.

• Much DL research is devoted to how to optimize the


weights to deliver good generalization performance.

Jacob Whitehill, WPI


Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
1. Presence of multiple local minima
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
1. Presence of multiple local minima & saddle points
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
1. Presence of multiple local minima & saddle points
global maximum

local maximum

saddle point

local minimum

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
2. Bad initialization of the weights w.

not so good

good
local minimum

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
3. Learning rate is too small.

Learning
rate too
small

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
3. Learning rate is too small.

Learning
rate too
small

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
3. Learning rate is too small.

Learning
rate too
small

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
3. Learning rate is too small.

Learning
rate too
small

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
4. Learning rate is too large.

Learning
rate too
large

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
4. Learning rate is too large.

Learning
rate too
large

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
4. Learning rate is too large.

Learning
rate too
large

global minimum
Optimization:
what can go wrong?
• In general ML and DL models, optimization is usually not
so simple, due to:
4. Learning rate is too large. (o the chart)

Learning
rate too
large

global minimum
ff
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• Consider the cost f whose level sets are shown below:


w1

Which direction does the


rw f (w) gradient point?

w2
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• Consider the cost f whose level sets are shown below:


w1

-rw f (w)

w2
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• Gradient descent guides the search along the direction of


steepest decrease in f.
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• Gradient descent guides the search along the direction of


steepest decrease in f.
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• Gradient descent guides the search along the direction of


steepest decrease in f.
Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are lucky, we still converge quickly.


Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are lucky, we still converge quickly.


Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are lucky, we still converge quickly.


Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are lucky, we still converge quickly.


Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are unlucky, convergence is very slow.


Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are unlucky, convergence is very slow.


Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are unlucky, convergence is very slow.


Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are unlucky, convergence is very slow.


Optimization:
what can go wrong?
• With multidimensional weight vectors, badly chosen
learning rates can cause more subtle problems.

• But what if the level sets are ellipsoids instead of spheres?

• If we are unlucky, convergence is very slow.


Curvature
• The problem is that gradient descent only considers slope
(1st-order e ect), i.e., how f changes with w.

• The gradient does not consider how the slope itself


changes with w (2nd-order e ect).

• The higher-order derivatives, rw f (w)


including the Hessian
H, determine the
curvature of f.
ff
ff
Curvature
• The problem is that gradient descent only considers slope
(1st-order e ect), i.e., how f changes with w.

• The gradient does not consider how the slope itself


changes with w (2nd-order e ect).

• The higher-order derivatives,


including the Hessian
H, determine the
curvature of f.
ff
ff
Curvature
• The problem is that gradient descent only considers slope
(1st-order e ect), i.e., how f changes with w.

• The gradient does not consider how the slope itself


changes with w (2nd-order e ect).

• The higher-order derivatives,


including the Hessian
H, determine the
curvature of f.
ff
ff
Optimization:
what can we do?
• To accelerate optimization of the weights, we can either:

• Alter the curvature of the loss by transforming the input


data.

• Change our optimization method to account for the


curvature.

• Both of these strategies play an important role in deep


learning.

You might also like