Optimization
Optimization
1
Summary of the last episode….
● Objective functions
● Parameters
● Training
2
Outline
● Optimization
● Gradient Descent
● Hyperparameters
3
Basic Machine Learning Concepts:
Capacity
4
Capacity
● The capacity (also called representational capacity) attempts to quantify how “big” (or “rich”)
is the hypothesis space.
● For instance, a complex model that can implement linear, exponential, sinusoidal, and logarithmic
functions has a larger capacity than one that implements linear functions only.
5
VC Dimension
● One popular measure of the capacity of a model is the Vapnik–Chervonenkis (VC) dimension.
● It is defined as the cardinality of the largest set of points that a binary classifier can shatter.
● A set of points is shattered by the classifier if for all ways of splitting the examples into positive
and negative subsets there exists a perfect classifier.
● If we have a set of N points X = [x1, x2, ..., xN]T, we say that the VC dimension is N if at least one
configuration of those data points can be shattered, but no set of N+1 points can be shattered.
6
VC Dimension
- - -
7
VC Dimension
+ - + -
+ + - -
8
VC Dimension
+
How many label configurations do we have? 8
-
+ + - +
+ + + - 9
VC Dimension
● Note that with some special point configurations, you cannot find a perfect classifier that
works for all the label configurations.
+ + - - + -
● We cannot classify correctly this set for all the label configurations.
● However, by definition, it is enough to find at least one set of points whose labels can be
classified correctly with all configurations.
10
VC Dimension
+
How many label configurations do we have? 16
-
● Try different points and you will see that you cannot find any point configuration that can
lead to a perfect classifier for all the label configurations. 11
VC Dimension
4 points
● It can be shown that for a binary linear classifier operating on D-dimensional inputs:
13
Generalization
● However, we are more interested in the performance achieved on data never seen before.
● Generalization is the ability of a machine learning algorithm to perform well on new, previously
unseen data.
● A machine learning model generalizes well if the test loss is low enough (according to our
application)
Training Loss: Objective function Test Loss: Objective function computed with the
computed with the training set. test set.
14
Underfitting
● By analyzing training and test losses, we can identify some “pathologies” that often affect machine
learning models.
● It happens when the model cannot achieve a training loss sufficiently low.
Regression Classification 15
Overfitting
● It happens when the gap between the training and test losses is too large.
Regression Classification
16
Underfitting, Overfitting, Capacity
● Intuitively:
Underfitting
Proper Fitting Overfitting
(low capacity) 17
(proper capacity) (low capacity)
Underfitting, Overfitting, Dataset size
● Underfitting and Overfitting are influenced by the number of training samples as well.
● Intuitively:
pdata In this case, the training loss is pdata Also in this case the training loss
low. is low.
● Different techniques have been proposed in the machine learning literature (e.g., L1 regularization, L2
regularization).
● One way to regularize is to express some preference for some solutions over others (using prior
knowledge).
20
Optimization
22
Gradient
Multiple Parameters
● When we have more than one parameter we have to compute the partial derivatives over all the
parameters:
Gradient
Gradient
25
Analytical Solution
2. We find a closed-form expression that tells us where are all the critical points:
3. We test all the critical points and we choose the one that minimizes the objective J.
Most of the time, we cannot find a closed-form analytical expression that solves this
problem (root finding).
● With this approach, we accept the idea of finding an approximate solution to the problem.
● When using numerical methods, we start from a set of candidate solutions, and we progressively
improve them until we think the problem is solved well enough.
27
Numerical Optimization
Naive Approach
Idea 1: We can try all the parameters θ and choose the set
that optimizes the objective J
Very often the parameters are continuous and we have an infinite number of parameter
configurations.
Idea 2: We can (randomly) sample N parameter configurations and take the set that optimizes the
objective J
The number of parameter combinations grows exponentially with the number of parameters.
28
Numerical Optimization
● Numerical Optimization is a big research field and many approaches have been proposed so far.
● Some attempts to find a global optimum (e.g., simulated annealing, ant colony, genetic algorithms,
particle swarm optimization). This is usually computationally demanding.
● Some of them can find local optima only (e.g., hill-climbing, coordinate descent, gradient descent,
conjugate gradient method, Newton’s Methods).
29
Gradient Descent
maximum
The gradient points in the direction of the greatest
maximum increase of the objective.
minimum
minimum New Current
Learning Rate Gradient
parameters parameters
30
Gradient Descent
Algorithm
3
4
Compute the Gradient 76
5
No Good Yes 1
Solution 2 The algorithm is sensitive to
? 3 the initialization point.
4
End
5
31
Learning Rate
● Setting a proper learning rate Ɛ is crucial for gradient descent. A common choice is to set it as a
constant value (e.g., n = 0.01).
θ θ
θ
32
Stochastic Gradient Descent
● To improve generalization, modern machine learning models are trained on large datasets.
Features Labels
Number of features
34
Stochastic Gradient Descent
Features Labels
Number of features
35
Stochastic Gradient Descent
Features Labels
Number of features
36
Stochastic Gradient Descent
Features Labels
At the end of the first epoch, we used all of the data exactly once.
Number of features
● This introduces some randomness in the learning process that helps the algorithm to escape
from saddle points and local minima.
Features Labels
Number of features 38
Batch Size
● The batch size manages the trade-off between the “accuracy” and the computational cost of the
updates.
Batch size=N If the batch size is equal to the number of training samples N, we have
“Accurate” update,
the standard gradient descent.
but slow
Batch size
Batch size=1 We update the weights after processing every single sample. This is
called Online SGD
“Noisy” update, but
fast
39
Advantages and Limitations
Limitations
O(θ)
● Gradient descent can get stuck in local optima.
● It requires tuning the learning rate (and batch size for SGD). θ
41
Example - Linear Least Squares
● Let’s try to apply gradient descent in a simple linear regression problem.
Input output
Machine
Learning
x
Labels
42
Example - Linear Least Squares
● To apply gradient descent, we need to compute the gradient of the objective function J
43
Example - Linear Least Squares
● To apply gradient descent, we need to compute the gradient of the objective function J
44
Example - Linear Least Squares
● Now that we have the gradient, we can start the gradient descend training:
# Initial Values
w0 = 1.0
w1 = 0.5
N_epochs = 10
lr = 0.05
train_loss = []
test_loss = []
Epoch 1 Epoch 2
for epoch in range(N_epochs):
# compute the predictions
y_hat = linear_model(x_train, w0, w1)
# parameter updates
w0 = w0 -lr * grad_w0
w1 = w1 -lr * grad_w1
Epoch 3 45
Epoch 10
Example - Linear Least Squares
● Now that we have the gradient, we can start the gradient descend training:
# Initial Values
w0 = 1.0
w1 = 0.5
N_epochs = 10
lr = 0.05
train_loss = []
test_loss = []
# parameter updates
w0 = w0 -lr * grad_w0
w1 = w1 -lr * grad_w1
46
Example - Linear Least Squares
● Note: Sometimes it is convenient to write the linear model in a vector form:
1D Input
x w0
1 w1
47
Example - Linear Least Squares
● Note: Sometimes it is convenient to write the linear model in a vector form:
● If we set:
50
Example - Linear Least Squares
● Let’s do a sanity check on the dimensionalities to convince us better than the two expressions are
equivalent.
● If you expand the computations for both expressions, you will see the exact same operations
52
(do it as an additional exercise).
Example - Linear Least Squares
This problem is simple enough and can be solved in a closed form with an analytical expression:
We can solve this simple problem in one shot (without using gradient descent) just by solving a system of linear
equations.
This is possible only for such a simple machine learning model. For more complex ones, the
closed-form solution does not exist. 53
Example - Linear Least Squares
For this kind of simple model (linear model trained with MSE), it can be shown that the objective function is
convex:
54
Example - Linear Least Squares
55
Hyperparameters and Validation Set
56
Hyperparameters
● Several machine learning models have to learn the parameters θ that implement the desired input-
output mapping (e.g., the slope w1 and the intercept w0 of our linear model)
● There are special “parameters” to set to properly control the learning algorithm itself.
● In the context of SGD, the learning rate, the batch size, and the number of epochs are examples of
hyperparameters.
● We cannot compute the gradient for these variables and users have to set them manually.
57
Hyperparameters
● One way is to perform several training experiments with different sets of hyperparameters and
choose the best one.
To select the best one, we cannot use the performance achieved on the training set.
To select the best one, we cannot use the performance achieved on the test set.
● We can employ a third set, called validation set, to choose the best set of hyperparameters.
● The validation set is normally extracted from the training set (e.g, 10%-20% of the training data are
devoted to validation).
● The training set is used to find the best parameters, the validation set is used only to select the
hyperparameters.
59
Hyperparameter Search
● Searching for the best hyperparameter is usually expensive because we have to train the model
multiple times before finding the best configuration.
● One possible way is to initialize the hyperparameters with “reasonable” values (based on our
experience or default settings suggested in the literature).
1. Manual Search
2. Grid Search
3. Random Search
4. Bayesian Optimization 60
Hyperparameter Search
Training
Train the model (to find θ) set
Valid
Check the performance set
No Good Yes
performa
nce?
Test
Check the performance set
End
61
Hyperparameters
Parameters Hyperparameters
● They are part of the model f(x,θ). ● They are external to the model f(x,θ).
● They are estimated during training ● They are not estimated during training, but
using a training set. during the hyperparameter search
(performed on the validation set)
● In gradient descent, we train the ● We cannot compute the gradient over the
model by computing a gradient of hyperparameters.
the objective over the parameters.
62
Basic Machine Learning Concepts:
Variants and Extensions of Gradient Descent
63
SGD Extensions and Variants
● Early Stopping
64
Early Stopping
● Often we iterate gradient descent for a predefined number of epochs (which is one of the
hyperparameters of the system)
● However, if the number of epochs is too high we might end up in an overfitting regime.
66
SGD with Momentum
67
Momentum
● SGD has trouble navigating in areas where the surface curves much more steeply in one dimension
than in another.
Large ● This causes a slow convergence as we lose time jumping back and forth for
the sides of the narrow valley.
68
Momentum
Momentum term Learning rate Gradient ● The velocity vt is the vector containing the
velocity
Previous updates to perform.
velocity
vt
69
Momentum
vt
Momentum term Learning rate Gradient
velocity vt
Previous
velocity
70
Momentum
SGD
SGD with
● The effect of the momentum is to dampen the oscillations
Momentum observed with SGD and reach the minimum faster.
73
Adaptive Learning Rate
● In SGD, the learning rate η is the same for all the parameters θ = [θ1, .., θi, …, θP]T
Idea: why not using a different learning rate for each parameter?
74
AdaGrad
● In a real case, we have millions or even billions of parameters We cannot set their learning rate
manually
● We need a method that assigns the learning rate to each parameter automatically.
We individually scale each parameter update using the historical values of the squared gradient
magnitude.
75
AdaGrad
Flat region Small gradients Large updates Adagrad
J(θ)
● RMSProp mitigates this issue by changing the squared gradient accumulation into an exponential
moving average.
Decay Rate
● With the exponential moving average, we
give more “weight” to the most recent
updates and less weight to the older ones.
● RMSProp has been shown to work well in practice for real machine learning methods.
77
Adam
ρ1s (1-ρ1)g
Previous Current
gradients gradients
The term s is big if the gradient points in the same directions, small if they point if different directions.
78
Adam
79
Adam
Suggested values:
ρ1=0.9
ρ2=0.999
η = 0.001
● With RMSProp, the direction of the update depends only on the current gradient (ρ1=0), while the
step size also depends on the history of the squared gradient.
● With Adam, both the direction of the update and the step size depends on the past gradients. 80
Adam
82
Second Order Methods
83
Second Order Methods
● We have seen that the gradient (based on first order partial derivatives) provides useful
information that we can use to minimize our objective.
Gradient
The gradient is a vector containing the first The Hessian is a symmetric square matrix of
order partial derivatives. second-order partial derivatives. 84
Second Order Methods
● The second order derivative tells us how the derivative changes when applying a little change in
input x.
J(θ)
J(θ)
J(θ)
The Hessian is a square matrix of
second-order partial derivatives.
θ θ θ
85
Single parameter
Second Order Methods
The information of the second derivative can be used to classify critical points (second derivative test)
J(θ)
Local Maximum
θ
θ1
J(θ)
Local Minimum
θ1 θ
J(θ)
Inconclusive
Saddle Point or
θ1 θ flat region? 86
Multiple parameters Second Order Methods
In a multi-dimensional case, the test involves the eigenvalues
If at least one
eigenvalue is Saddle Point
positive and at
least one is
negative
87
In this special case, we have a diagonal matrix and the eigenvalues are just the elements on the diagonal.
Multiple parameters Second Order Methods
● The condition to have a saddle point is less restrictive than that needed for local minima and
maxima.
● Intuitively, it will be significantly easier to find points with at least one eigenvalues is positive and at
least one negative (saddle points) rather than finding points with all eigenvalues are positive
(minima) or negative (maxima).
● Now, we can understand better why saddle points are much more common than local minima and
maxima in high-dimensional spaces.
88
Single parameter
Newton's Method
● The optimization methods that use both the gradient and the Hessian are called second order
methods.
● We can approximate the objective with a Taylor expansion (up to the second order) and
jump directly into the expected minimum value.
Quadratic function
J(θ) - actual objective
89
Single parameter
Newton's Method
Minimum
θ0 θ* θ
Quadratic approximation
Algorithm:
θ0
1. Compute the Gradient.
2. Compute the Hessian
3. Compute Hessian
Inverse.
4. Update parameters.
θ*
91
Newton's Method
Issues
● Newton’s method sounds appealing, but in practice, it suffers from several issues:
● Newton's method is thus sensitive to saddle points (that are a big issue in high-dimensional spaces).
● Several solutions have been proposed to mitigate this issue (e.g, Hessian regularization). 92
Newton's Method
Issues
● The other issue is the computational complexity that increases significantly with the number of
parameters:
93
Additional Material
94
Lab Session
Tutorial on Matplotlib
Tutorial on Scikit-learn
Plotting exercises
95