0% found this document useful (0 votes)
19 views95 pages

Optimization

The document discusses basic machine learning concepts including capacity, underfitting, overfitting, and generalization. Capacity refers to the expressive power of a model. The VC dimension is introduced as a measure of capacity. Underfitting and overfitting occur when the gap between training and test loss is too large or small. Regularization helps address overfitting.

Uploaded by

ARITRA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views95 pages

Optimization

The document discusses basic machine learning concepts including capacity, underfitting, overfitting, and generalization. Capacity refers to the expressive power of a model. The VC dimension is introduced as a measure of capacity. Underfitting and overfitting occur when the gap between training and test loss is too large or small. Regularization helps address overfitting.

Uploaded by

ARITRA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

COMP 6321 Machine Learning

Basic Machine Learning Concepts

Computer Science & Software Engineering


Concordia University, Winter 2023

1
Summary of the last episode….

● What is machine learning?

● Type of problems (classification, regression)

● Types of learning (supervised, unsupervised)

● Dataset and features

● Objective functions

● Parameters

● Training

2
Outline

● Capacity, Underfitting, Overfitting, Regularization

● Optimization

● Gradient Descent

● Hyperparameters

● Gradient Descent Variants

3
Basic Machine Learning Concepts:
Capacity

4
Capacity

● Different machine learning algorithms have different hypothesis spaces.

● The capacity (also called representational capacity) attempts to quantify how “big” (or “rich”)
is the hypothesis space.

● It is usually not intended as the number of functions in the


hypothesis space (that can easily be infinite also for simple
machine learning models).

● Instead, it more often refers to the variability in terms of “family” of


functions (expressive power, richness).

● For instance, a complex model that can implement linear, exponential, sinusoidal, and logarithmic
functions has a larger capacity than one that implements linear functions only.
5
VC Dimension

● One popular measure of the capacity of a model is the Vapnik–Chervonenkis (VC) dimension.

● It is defined as the cardinality of the largest set of points that a binary classifier can shatter.

● A set of points is shattered by the classifier if for all ways of splitting the examples into positive
and negative subsets there exists a perfect classifier.

● If we have a set of N points X = [x1, x2, ..., xN]T, we say that the VC dimension is N if at least one
configuration of those data points can be shattered, but no set of N+1 points can be shattered.

6
VC Dimension

Example: Linear classifier in a 2D space (binary)

This classifier implements linear boundaries

- - -

● Let’s now compute the VC dimension for this simple model.

7
VC Dimension

● Let’s start by drawing 2 random points.

We need to assign random labels (positive and negative to these points)


+
How many label configurations do we have? 4 Can we draw a perfect
-
classifier for all the
label configurations?

+ - + -

+ + - -

yes! yes! yes! yes!

8
VC Dimension

● Let’s draw 3 random points!

+
How many label configurations do we have? 8
-

yes! yes! yes! yes!


-
- + - Can we draw a perfect
classifier for all the
- - - + label configurations?
- - + -
yes! yes! yes! yes!
+ - + +

+ + - +
+ + + - 9
VC Dimension

● Note that with some special point configurations, you cannot find a perfect classifier that
works for all the label configurations.

+ + - - + -

● We cannot classify correctly this set for all the label configurations.

● However, by definition, it is enough to find at least one set of points whose labels can be
classified correctly with all configurations.

10
VC Dimension

● Let’s draw 4 random points!

+
How many label configurations do we have? 16
-

Can we draw a perfect


- + + - classifier for all the
+ + - +
label configurations?
- - - +
- - - -
Yes Yes Yes No

● Try different points and you will see that you cannot find any point configuration that can
lead to a perfect classifier for all the label configurations. 11
VC Dimension

4 points

- - ● The 4 points cannot be shattered by this linear


+ + +
classifier:
- - +
- -

● It can be shown that for a binary linear classifier operating on D-dimensional inputs:

● The VC dimension thus depends on “how” complex


the classifier can be.

● The larger D, the larger the set of parameters of the


classifier.
12
Basic Machine Learning Concepts:
Generalization, Overfitting, Underfitting

13
Generalization

● Training a machine learning model often requires solving an optimization problem.

We have to find the parameters of the function f that


minimizes the objective function using the training data.

● However, we are more interested in the performance achieved on data never seen before.

● Generalization is the ability of a machine learning algorithm to perform well on new, previously
unseen data.

● A machine learning model generalizes well if the test loss is low enough (according to our
application)

Training Loss: Objective function Test Loss: Objective function computed with the
computed with the training set. test set.
14
Underfitting

● By analyzing training and test losses, we can identify some “pathologies” that often affect machine
learning models.

● One of these conditions is called underfitting.

● It happens when the model cannot achieve a training loss sufficiently low.

● In both cases, the function found by the machine


learning algorithm is too simple to explain well the
training data.

Regression Classification 15
Overfitting

● Another pathology is overfitting.

● It happens when the gap between the training and test losses is too large.

● In both cases, the learned function is too complex


to explain well the training data.

● In an extreme case, the model stores the training


samples and just behaves as a memory without
generalization capabilities.

Regression Classification

16
Underfitting, Overfitting, Capacity

● Underfitting and Overfitting are connected to the capacity of the model.

● Intuitively:

Too Low Capacity Underfitting


Proper Capacity Proper Fitting

Too High Capacity Overfitting

For each task, we have to


choose a model with proper
complexity.

Underfitting
Proper Fitting Overfitting
(low capacity) 17
(proper capacity) (low capacity)
Underfitting, Overfitting, Dataset size

● Underfitting and Overfitting are influenced by the number of training samples as well.

● Intuitively:

More Training Examples Better Generalization


Few Training Examples Overfitting

Small Number of Training Samples High number of Training Samples

pdata In this case, the training loss is pdata Also in this case the training loss
low. is low.

We are anyway in an overfitting However, this time we have a


regime because the test loss is good generalization because
high. test loss is low.
prediction
Train Train
prediction
Test Test
18
Regularization

● Regularization techniques aim to counteract overfitting (and thus improve generalization).

● Different techniques have been proposed in the machine learning literature (e.g., L1 regularization, L2
regularization).

● One way to regularize is to express some preference for some solutions over others (using prior
knowledge).

● For instance, we can penalize complexity.


f1
f4
f2 Occam’s razor (1287-1346)
f3 f6 Among competing hypotheses that explain the training data
f5 equally well, we should choose the “simplest” one

This selection is based on prior knowledge on how could be a proper


Hypothesis space function for my problem. 19
COMP 432 Machine Learning

Basic Machine Learning Concepts:


Optimization
Computer Science & Software Engineering
Concordia University, Winter 2022

20
Optimization

● Training a machine learning model often requires solving an optimization problem.

For parametric machine learning:

We want to find the parameters of the function f


that minimizes the objective function.

This is usually a difficult problem:


● The objective function might have local minima, maxima, and saddle
points.

● In a real machine learning problem, we might have tons of


parameters to optimize.

● For instance, in modern deep learning, we might even have billions


of parameters. 21
Critical Points
Single Parameter case
Derivative
● Critical points: points where

● Local Minimum: it is a critical point where J(θ) is lower


than all the neighboring points. If this limit exists for any value of θ,
the function is said differentiable.
● Local Maximum: it is a critical point where J(θ) is higher
than all the neighboring points. The derivative tells us how the
objective function J changes with
● Saddle point: A critical point that is not a a little change in the parameters θ.
maximum nor a minimum value.

● Plateau: wide areas where the value of J(θ)


is constant. Flat regions

22
Gradient
Multiple Parameters

● When we have more than one parameter we have to compute the partial derivatives over all the
parameters:

Partial Derivatives ● The gradient generalizes the notion of


derivative.

● It is a vector that points in the direction


of the greatest increase of the objective.

Gradient
Gradient

How J changes when I apply a How J changes when I apply a


little perturbation to θ1 ? little perturbation to θM ?
23
Critical Points in Multi-dimensional Spaces
Multiple Parameters

● Critical points: points where

● In high-dimensional spaces, the is a proliferation


Plateau
maximum of local minima, maxima, and saddle points.
maximum
● In particular, there is a proliferation of saddle
points.

● Saddle points are a local minimum along one


dimension (or cross-section of the objective
function) and a local maximum in another one.
Plateau Plateau
● The ratio between the expected number of
saddle points and the local minima grows
exponentially with dimensionality.
minimum
minimum ● This makes the optimization problem very
challenging. 24
How can we solve the optimization problem?

25
Analytical Solution

● How can we solve the optimization problem?

We can try to find an analytical solution to the


problem:

1. We have to compute an expression for the gradient

2. We find a closed-form expression that tells us where are all the critical points:

3. We test all the critical points and we choose the one that minimizes the objective J.

Most of the time, we cannot find a closed-form analytical expression that solves this
problem (root finding).

Closed-form expressions only exist for simple functions (e.g, linear). 26


Numerical Optimization

● How can we solve the optimization problem?

Fortunately, we can try to find a numerical


solution to this problem.

● With this approach, we accept the idea of finding an approximate solution to the problem.

● When using numerical methods, we start from a set of candidate solutions, and we progressively
improve them until we think the problem is solved well enough.

27
Numerical Optimization
Naive Approach

Idea 1: We can try all the parameters θ and choose the set
that optimizes the objective J

Very often the parameters are continuous and we have an infinite number of parameter
configurations.

Idea 2: We can (randomly) sample N parameter configurations and take the set that optimizes the
objective J
The number of parameter combinations grows exponentially with the number of parameters.

In a high-dimensional space, the probability of finding a good set of parameters by random


sampling is almost zero.

What’s the probability of a monkey typing


Shakespeare?

28
Numerical Optimization

● Numerical Optimization is a big research field and many approaches have been proposed so far.

● Some attempts to find a global optimum (e.g., simulated annealing, ant colony, genetic algorithms,
particle swarm optimization). This is usually computationally demanding.

● Some of them can find local optima only (e.g., hill-climbing, coordinate descent, gradient descent,
conjugate gradient method, Newton’s Methods).

● In this case, we accept a “good enough” sub-optimal solution.

● Local optimization is much faster than global optimization.

● It is thus the dominant approach in current machine learning.

29
Gradient Descent

● We have seen that the gradient gives us useful


information:

maximum
The gradient points in the direction of the greatest
maximum increase of the objective.

If we want to minimize J, we might want to do a little


step in this direction:

minimum
minimum New Current
Learning Rate Gradient
parameters parameters
30
Gradient Descent
Algorithm

Choose the learning rate n


J(θ)
1
Randomly Initialize θ 2

3
4
Compute the Gradient 76
5

We start from a random


Update the Parameters θ solution and we progressively
improve it until convergence.

No Good Yes 1
Solution 2 The algorithm is sensitive to
? 3 the initialization point.
4
End
5

31
Learning Rate

● Setting a proper learning rate Ɛ is crucial for gradient descent. A common choice is to set it as a
constant value (e.g., n = 0.01).

J(θ) J(θ) J(θ)

θ θ
θ

32
Stochastic Gradient Descent

● To improve generalization, modern machine learning models are trained on large datasets.

● In standard gradient descent, we update the parameters based on the gradient.

● To compute the gradient, we need to process all the training examples:

The objective function is often decomposed as a


sum over training samples (empirical risk
minimization).

The total gradient is the average of the gradients computed


for each training sample.

The complexity of gradient computation is O(N). 33


Stochastic Gradient Descent

● What about approximating the gradient using little training data?

Features Labels

x10, x11, x12, ………………………….x1D y1


x20, x21, x22, ………………………….x2D y2
x30, x31, x32, ………………………….x3D y3
x40, x41, x42, ………………………….x4D y4
x50, x51, x52, ………………………….x5D y5
x60, x61, x62, ………………………….x6D y6
x70, x71, x72, ………………………….x7D y7
x80, x81, x82, ………………………….x8D y8
x90, x91, x92, ………………………….x9D y9

Number of features

34
Stochastic Gradient Descent

● What about approximating the gradient using little training data?

Features Labels

x10, x11, x12, ………………………….x1D y1


x20, x21, x22, ………………………….x2D y2
x30, x31, x32, ………………………….x3D y3
x40, x41, x42, ………………………….x4D y4
x50, x51, x52, ………………………….x5D y5
x60, x61, x62, ………………………….x6D y6
x70, x71, x72, ………………………….x7D y7
x80, x81, x82, ………………………….x8D y8
x90, x91, x92, ………………………….x9D y9

Number of features

35
Stochastic Gradient Descent

● What about approximating the gradient using little training data?

Features Labels

x10, x11, x12, ………………………….x1D y1


x20, x21, x22, ………………………….x2D y2
x30, x31, x32, ………………………….x3D y3
x40, x41, x42, ………………………….x4D y4
x50, x51, x52, ………………………….x5D y5
x60, x61, x62, ………………………….x6D y6
x70, x71, x72, ………………………….x7D y7
x80, x81, x82, ………………………….x8D y8
x90, x91, x92, ………………………….x9D y9

Number of features

36
Stochastic Gradient Descent

● What about approximating the gradient using little training data?

Features Labels

x10, x11, x12, ………………………….x1D y1


x20, x21, x22, ………………………….x2D y2
x30, x31, x32, ………………………….x3D y3
x40, x41, x42, ………………………….x4D y4
x50, x51, x52, ………………………….x5D y5 ● Minibatch: it is the set of training samples used for computing
x60, x61, x62, ………………………….x6D y6 the gradient.
x70, x71, x72, ………………………….x7D y7
x80, x81, x82, ………………………….x8D y8 ● Batch size: number of samples in the minibatch.
x90, x91, x92, ………………………….x9D y9
● Epoch: it is a complete pass through all the training data

At the end of the first epoch, we used all of the data exactly once.
Number of features

We often train the machine learning model over multiple epochs.


37
Stochastic Gradient Descent

● With SGD we use noisy versions of the gradient.

● This introduces some randomness in the learning process that helps the algorithm to escape
from saddle points and local minima.

Features Labels

x10, x11, x12, ………………………….x1D y1


x20, x21, x22, ………………………….x2D y2
x30, x31, x32, ………………………….x3D y3
x40, x41, x42, ………………………….x4D y4
x50, x51, x52, ………………………….x5D y5
x60, x61, x62, ………………………….x6D ● SGD naturally introduces some regularization that yields better
y6
x70, x71, x72, ………………………….x7D generalization properties.
y7
x80, x81, x82, ………………………….x8D y8
x90, x91, x92, ………………………….x9D y9

Number of features 38
Batch Size

● The batch size manages the trade-off between the “accuracy” and the computational cost of the
updates.

Batch size=N If the batch size is equal to the number of training samples N, we have
“Accurate” update,
the standard gradient descent.
but slow
Batch size

The best value is normally in between the two extreme cases. We


1< Batch size < N must carefully choose the batch size because it has a remarkable
impact on the convergence speed and performance.

Batch size=1 We update the weights after processing every single sample. This is
called Online SGD
“Noisy” update, but
fast
39
Advantages and Limitations
Limitations
O(θ)
● Gradient descent can get stuck in local optima.

● It depends on proper initialization.

● It requires tuning the learning rate (and batch size for SGD). θ

From the “Deep Learning Book”


● It is difficult to apply to non-differentiable loss functions.
“In the past the application of gradient descent
to nonconvex optimization problems was
Advantages regarded as foolhardy or unprincipled. Today, we
know that machine learning models work very
well when trained with gradient descend.”
● Computationally efficient (computing the gradient is fast usually).
“The optimization algorithm may not be
guaranteed to arrive at even a local minimum in a
It works well in practice (even with nonconvex objectives). reasonable amount of time, but often finds a very
low value of the cost function quickly enough to
be useful”
40
e.g., It proved capable of escaping from saddle points.
Example: Linear Least Square

41
Example - Linear Least Squares
● Let’s try to apply gradient descent in a simple linear regression problem.

Input output
Machine
Learning
x

Training set Objective (MSE)


Inputs

Labels

42
Example - Linear Least Squares
● To apply gradient descent, we need to compute the gradient of the objective function J

43
Example - Linear Least Squares
● To apply gradient descent, we need to compute the gradient of the objective function J

44
Example - Linear Least Squares
● Now that we have the gradient, we can start the gradient descend training:

# Initial Values
w0 = 1.0
w1 = 0.5

N_epochs = 10
lr = 0.05

train_loss = []
test_loss = []

Epoch 1 Epoch 2
for epoch in range(N_epochs):
# compute the predictions
y_hat = linear_model(x_train, w0, w1)

# compute the gradient


grad_w0 = (y_hat - y_train).mean()
grad_w1 = ((y_hat - y_train) * x_train).mean()

# parameter updates
w0 = w0 -lr * grad_w0
w1 = w1 -lr * grad_w1

Epoch 3 45
Epoch 10
Example - Linear Least Squares
● Now that we have the gradient, we can start the gradient descend training:

# Initial Values
w0 = 1.0
w1 = 0.5

N_epochs = 10
lr = 0.05

train_loss = []
test_loss = []

for epoch in range(N_epochs):


# compute the predictions
y_hat = linear_model(x_train, w0, w1)

# compute the gradient


grad_w0 = (y_hat - y_train).mean()
grad_w1 = ((y_hat - y_train) * x_train).mean()

# parameter updates
w0 = w0 -lr * grad_w0
w1 = w1 -lr * grad_w1

46
Example - Linear Least Squares
● Note: Sometimes it is convenient to write the linear model in a vector form:

1D Input
x w0

1 w1

Appending this one is needed to manage the intercept term w0

47
Example - Linear Least Squares
● Note: Sometimes it is convenient to write the linear model in a vector form:

Multiple D-dimensional Input w0


1
w1
x1
w..
x..
wD
xD
Needed to vectorize the intercept terms

The number of parameters P = D + 1 for this model (D is the feature dim) 48


Example - Linear Least Squares
● In the multidimensional case, we can generalize the gradient as:

● If we set:

We obtain the equations seen for


the 1d case 49
Example - Linear Least Squares
● We can also write the gradient equations in matrix form:

50
Example - Linear Least Squares
● Let’s do a sanity check on the dimensionalities to convince us better than the two expressions are
equivalent.

● Let’s assume we have 2 parameters w0, and w1.

● What is the expected dimensionality for the gradient? (2 x1)

● What is the dimensionality of w? (2 x 1)

● Let’s assume we have 3 examples. What will be the (3 x 2)


dimensionality of the feature matrix X?
51
Example - Linear Least Squares
● Let’s do a sanity check on the dimensionalities to convince us better than the two expressions are
equivalent.

(2,3) (3,2) (2,3) (3,1)

(2,2) (2,1) (2,1)


Why do we want expressions that
contain matrices?
(2,1)
● For the sake of compactness.
This is the expected
● We can take advantage of (2,1)
dimensionality for the
fast matrix multiplication
libraries. gradient.

● If you expand the computations for both expressions, you will see the exact same operations
52
(do it as an additional exercise).
Example - Linear Least Squares

This problem is simple enough and can be solved in a closed form with an analytical expression:

Moore-Penrose Pseudoinverse Generalization of the notion of “inverse”


matrix to non-square matrices

We can solve this simple problem in one shot (without using gradient descent) just by solving a system of linear
equations.

This is possible only for such a simple machine learning model. For more complex ones, the
closed-form solution does not exist. 53
Example - Linear Least Squares

For this kind of simple model (linear model trained with MSE), it can be shown that the objective function is
convex:

J(θ) ● The function has only one minimum which is


the global one.

This happens only in this simple case.

More often, we have to solve non-convex


optimization problems.

54
Example - Linear Least Squares

55
Hyperparameters and Validation Set

56
Hyperparameters

● Several machine learning models have to learn the parameters θ that implement the desired input-
output mapping (e.g., the slope w1 and the intercept w0 of our linear model)

● There are special “parameters” to set to properly control the learning algorithm itself.

● These “special parameters” are called hyperparameters.

● In the context of SGD, the learning rate, the batch size, and the number of epochs are examples of
hyperparameters.

● We cannot compute the gradient for these variables and users have to set them manually.

57
Hyperparameters

How do we choose the hyperparameters?

● One way is to perform several training experiments with different sets of hyperparameters and
choose the best one.

To select the best one, we cannot use the performance achieved on the training set.

Because we increase the risk of overfitting.

To select the best one, we cannot use the performance achieved on the test set.

Because we will overestimate the actual performance of the system.


58
Validation Set

● We can employ a third set, called validation set, to choose the best set of hyperparameters.

● The validation set is normally extracted from the training set (e.g, 10%-20% of the training data are
devoted to validation).

● The training set is used to find the best parameters, the validation set is used only to select the
hyperparameters.

59
Hyperparameter Search

● Searching for the best hyperparameter is usually expensive because we have to train the model
multiple times before finding the best configuration.

● One possible way is to initialize the hyperparameters with “reasonable” values (based on our
experience or default settings suggested in the literature).

● Then, we can fine-tune the initial configuration through a hyperparameter search:

1. Manual Search

2. Grid Search

3. Random Search

4. Bayesian Optimization 60
Hyperparameter Search

● To summarize, this is what we do when we develop a machine learning model:

Choose the hparams

Training
Train the model (to find θ) set

Valid
Check the performance set

No Good Yes
performa
nce?

Test
Check the performance set

End
61
Hyperparameters

Parameters Hyperparameters

● They are part of the model f(x,θ). ● They are external to the model f(x,θ).

● They are estimated during training ● They are not estimated during training, but
using a training set. during the hyperparameter search
(performed on the validation set)

● In gradient descent, we train the ● We cannot compute the gradient over the
model by computing a gradient of hyperparameters.
the objective over the parameters.

● Examples are the learning rate, batch size,


● Examples of parameters are the and number of epochs.
weights w0 and w1.

62
Basic Machine Learning Concepts:
Variants and Extensions of Gradient Descent

63
SGD Extensions and Variants

Several improvements have been proposed to the vanilla SGD algorithm:

● Early Stopping

● Learning Rate Annealing

● SGD with Momentum

● Adaptive Learning Rate methods (e.g, AdaGrad, RMSPROP, Adam)

● Second Order Methods (e.g., Newton’s methods)

64
Early Stopping
● Often we iterate gradient descent for a predefined number of epochs (which is one of the
hyperparameters of the system)

● However, if the number of epochs is too high we might end up in an overfitting regime.

● If the number of epochs is too low we might end up in an underfitting regime.

We can monitor the performance after each epoch on


the validation set and stop training if the validation
performance starts to get worse.

● This strategy is known as early stopping.

● It is one of the most commonly used forms of


regularization.
65
Learning Rate Annealing
● Changing the learning rate over the epochs can improve performance and reduce training time.

● We normally reduce the learning rate while we train.

● This operation is known as learning rate annealing.

● Several methods have been proposed, such as


exponential decay, linear decay, etc.

● Sometimes we reduce the learning rate when some


conditions are met (e.g little improvement on the
validation set). This is called new-bob annealing.

66
SGD with Momentum

67
Momentum

● SGD has trouble navigating in areas where the surface curves much more steeply in one dimension
than in another.

θ2 ● The partial derivative in the point P wrt θ1 have a small


magnitude because the area is rather flat in this
dimension
P
● The partial derivative in the point P wrt θ2 have a large
θ1 magnitude because the area is quite steep in this
dimension
Small
● As a result, gradient descent will do a little update for θ1 and a larger one for θ2

Large ● This causes a slow convergence as we lose time jumping back and forth for
the sides of the narrow valley.

68
Momentum

● The method of momentum tackles this problem by accumulating an exponentially-decaying moving


average of the past gradients:

Momentum term Learning rate Gradient ● The velocity vt is the vector containing the
velocity
Previous updates to perform.
velocity

● This time the updates do not only depend on the


learning rate and the gradient, but also on the
previous updates vt-1(weighted by a factor γ)

vt

69
Momentum

● The method of momentum tackles this problem by accumulating an exponentially-decaying moving


average of the past gradients:

vt
Momentum term Learning rate Gradient
velocity vt
Previous
velocity

If the previous update vt If the previous update vt


points in a direction very points in a direction similar
different from the current to the current gradient.
gradient.

I do a little update I do a big update

70
Momentum

SGD
SGD with
● The effect of the momentum is to dampen the oscillations
Momentum observed with SGD and reach the minimum faster.

● Why? Because two consecutive updates point in a very


different direction.

● As we have seen, this leads to a smaller update that


minimizes the jumps over the side of the valley.

The basic idea of the momentum is the following:

- When there is an “agreement” between the current and


previous gradients, we are “safe” to do a big step.

- If there is “disagreement”, it is better to be “prudent” and do


little steps.
71
Nesterov Momentum

● A popular variation of the standard momentum is the so-called Nesterov Momentum:

Standard Momentum Nesterov Momentum

● We here try to “anticipate” the next move and apply


a “correction factor” to the standard momentum.

● This anticipatory update prevents us from going too


fast.
72
Adaptive Learning Rate

73
Adaptive Learning Rate
● In SGD, the learning rate η is the same for all the parameters θ = [θ1, .., θi, …, θP]T

● However, each parameter is different from the others.

Idea: why not using a different learning rate for each parameter?

Standard SGD SGD with adaptive learning rate

Vector Element-wise multiplication


Constant Value

74
AdaGrad
● In a real case, we have millions or even billions of parameters We cannot set their learning rate
manually

● We need a method that assigns the learning rate to each parameter automatically.

● Adagrad (Duchi et al., 2011) proposed to do it in this way:

Small constant (for numerical stability)

We individually scale each parameter update using the historical values of the squared gradient
magnitude.
75
AdaGrad
Flat region Small gradients Large updates Adagrad
J(θ)

When the squared magnitude of


the gradient is, on average, small it
is “safe” to do large updates.
θ Problem:

The updates tend to get


smaller and smaller over time
Steep region Large gradients Little updates as we keep accumulating the
J(θ) squared magnitude of
gradients (positive quantity)
When the squared magnitude of
the gradient is, on average, large it This decrease is excessive
is “safer” to do small updates. for many practical
θ applications
76
RMSProp

● RMSProp mitigates this issue by changing the squared gradient accumulation into an exponential
moving average.

Decay Rate
● With the exponential moving average, we
give more “weight” to the most recent
updates and less weight to the older ones.

● Good default values are η = 0.001 and ρ =


0.9

● RMSProp has been shown to work well in practice for real machine learning methods.

77
Adam

● Adam is an extension of the RMSProp optimizer that also considers momentum.

Similar to momentum (but with exponential


weighting)
s

ρ1s (1-ρ1)g

Previous Current
gradients gradients

The term s is big if the gradient points in the same directions, small if they point if different directions.

78
Adam

● Adam is an extension of the RMSProp optimizer that also considers momentum.

This term is the same as the one used in


RMSProp

The term r is big if the squared


magnitude of the gradient is big, and
small otherwise.

79
Adam

● Adam is an extension of the RMSProp optimizer that also consider momentum.

Suggested values:
ρ1=0.9
ρ2=0.999
η = 0.001

The update considers exponential moving


averages of the gradients (first-order
moment) and the magnitude of the
gradients (second-order moment).

● With RMSProp, the direction of the update depends only on the current gradient (ρ1=0), while the
step size also depends on the history of the squared gradient.

● With Adam, both the direction of the update and the step size depends on the past gradients. 80
Adam

● Normally, s and r are initialized to 0.

This adds a bias, especially during the initial


time steps, and when ρ1 and ρ2 are close to 1.

The bias correction has an effect only in the


first part of training.
We can compensate for this bias in this way:
When t grows, s and get closer and
closer.

Correcting this bias is not that crucial in


practice.

Where t is the update number (e.g., first update t=1,


81
second t=2, etc)
Adam

Adam often works better than other optimizers in


real machine learning problems.

It requires setting ρ1, ρ2, η. However, the default


suggested values often work well (ρ1=0.9,
ρ2=0.999, η = 0.001).

It requires storing s and r which are vectors of


size corresponding to the number of parameters
to optimize θ.

If we have millions or billions of parameters, this


can be quite memory-demanding.

82
Second Order Methods

83
Second Order Methods

● We have seen that the gradient (based on first order partial derivatives) provides useful
information that we can use to minimize our objective.

● What about using the second order derivative?


Hessian

Gradient

The gradient is a vector containing the first The Hessian is a symmetric square matrix of
order partial derivatives. second-order partial derivatives. 84
Second Order Methods

● The second order derivative tells us how the derivative changes when applying a little change in
input x.

Hessian ● It measures the curvature of the objective function


J(θ) around the point θ.

J(θ)

J(θ)

J(θ)
The Hessian is a square matrix of
second-order partial derivatives.

θ θ θ
85
Single parameter
Second Order Methods
The information of the second derivative can be used to classify critical points (second derivative test)
J(θ)

Local Maximum
θ
θ1
J(θ)

Local Minimum
θ1 θ
J(θ)

Inconclusive
Saddle Point or
θ1 θ flat region? 86
Multiple parameters Second Order Methods
In a multi-dimensional case, the test involves the eigenvalues

If all eigenvalues are Local Minimum


positive

If all eigenvalues Local Maximum


are negative

If at least one
eigenvalue is Saddle Point
positive and at
least one is
negative
87
In this special case, we have a diagonal matrix and the eigenvalues are just the elements on the diagonal.
Multiple parameters Second Order Methods

● The condition to have a saddle point is less restrictive than that needed for local minima and
maxima.

● Intuitively, it will be significantly easier to find points with at least one eigenvalues is positive and at
least one negative (saddle points) rather than finding points with all eigenvalues are positive
(minima) or negative (maxima).

● Now, we can understand better why saddle points are much more common than local minima and
maxima in high-dimensional spaces.

88
Single parameter
Newton's Method

● The optimization methods that use both the gradient and the Hessian are called second order
methods.

● A popular one is called Newton’s method.

● We can approximate the objective with a Taylor expansion (up to the second order) and
jump directly into the expected minimum value.

Jquad(θ) - quadratic approx

Quadratic function
J(θ) - actual objective

If the function is convex around θ0 (positive second derivative), we can


θ0 θ* θ find the minimum by solving:

89
Single parameter
Newton's Method

Jquad(θ) - quadratic approx

Quadratic function J(θ) - actual objective

Minimum
θ0 θ* θ

● The update equation is similar to


gradient descend.

● The main difference is that the “learning


rate” is not specified but automatically
guessed using by the second derivative.

● Similar to gradient descend, we can


iterate multiple times until convergence 90
Newton's Method
High-dimensional case

Quadratic approximation

If H is positive definite, this will find the


minimum.

Algorithm:
θ0
1. Compute the Gradient.
2. Compute the Hessian
3. Compute Hessian
Inverse.
4. Update parameters.

θ*
91
Newton's Method
Issues

● Newton’s method sounds appealing, but in practice, it suffers from several issues:

● If the second derivative is negative, we reach a local maximum


J(θ)
and not a local minimum.

● In a multi-dimensional case, this happens if the eigenvalues of


H are not all positives.
θ0 θ* θ
● This happens for instance, near saddle points.

● Newton's method is thus sensitive to saddle points (that are a big issue in high-dimensional spaces).

● Standard SGD is less sensitive to this issue.

● Several solutions have been proposed to mitigate this issue (e.g, Hessian regularization). 92
Newton's Method
Issues

● The other issue is the computational complexity that increases significantly with the number of
parameters:

Gradient Computation: O(N) ● Quasi-Newton methods attempt to the computational burden


by approximating the Hessian.
Hessian Computation: O(N^2)
● Examples of techniques are Conjugate Gradients and the BFGS
Hessian Inversion: O(N^3) Algorithm.

● The objective should be twice differentiable.

● Numerical instability with second derivative close to zero.

93
Additional Material

Chapter 2: Linear Algebra


Introduction (page 1-11)
Chapter 3: Probability and
information theory
Chapter 5: Machine Learning Basics

Introduction (page 1-27)

Linear Models (page 47-57)

94
Lab Session

● During the weekly lab session, we will do:

Tutorial on Matplotlib

Tutorial on Scikit-learn

Plotting exercises

Deadline: January, 22 11.59 pm

95

You might also like