0% found this document useful (0 votes)
10 views37 pages

Lec 04 05

The document discusses optimization methods in deep learning, focusing on hyperparameters, parameters, and various tuning techniques such as manual search, grid search, and random search. It emphasizes the importance of gradient descent for minimizing loss functions during model training and outlines different variations of gradient descent, including batch, stochastic, and mini-batch gradient descent. The document also highlights the significance of scaling features and the learning rate in the optimization process.

Uploaded by

202411073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views37 pages

Lec 04 05

The document discusses optimization methods in deep learning, focusing on hyperparameters, parameters, and various tuning techniques such as manual search, grid search, and random search. It emphasizes the importance of gradient descent for minimizing loss functions during model training and outlines different variations of gradient descent, including batch, stochastic, and mini-batch gradient descent. The document also highlights the significance of scaling features and the learning rate in the optimization process.

Uploaded by

202411073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

IT549: Deep Learning

Lecture 04/05

Optimization
[Search Methods and Gradient Descent]

Arpit Rana
9th / 10th January 2025
Components of Supervised Learning

Representation ✔
choosing the set of functions (hypotheses space or the
model class) that can be learned.

Evaluation ✔
An evaluation function (also called objective function
or scoring function) is needed to distinguish good
hypotheses from bad ones.

Optimization
We need a method to search the hypothesis space for ��
the highest-scoring one.
Hyperparameters

Hyperparameters are parameters of the model class, not of the individual model.
● We define them before training to control the learning process.
● The learner algorithm does not learn these (can’t be estimated from data).

Example: RandomForestClassifier(
bootstrap=True,
from sklearn.ensemble import RandomForestClassifier ccp_alpha=0.0, class_weight=None,
criterion=’gini’,
# Instantiate the model max_depth=None,
rf_model = RandomForestClassifier() max_features=’auto’,
max_leaf_nodes=None,
# Print hyperparameters max_samples=None,
rf_model.get_params min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs=None, oob_score=False,
random_state=None,
verbose=0, warm_start=False)
Hyperparameters

Hyperparameters are parameters of the model class, not of the individual model.
● We define them before training to control the learning process.
● The learner algorithm does not learn these (can’t be estimated from data).

Example: RandomForestClassifier(
bootstrap=True,
from sklearn.ensemble import RandomForestClassifier ccp_alpha=0.0, class_weight=None,
criterion=’gini’,
# Instantiate the model max_depth=None,
rf_model = RandomForestClassifier() max_features=’auto’,
max_leaf_nodes=None,
# Print hyperparameters Some hyperparameters are max_samples=None,
Rf_model.get_params more important than min_impurity_decrease=0.0,
others. min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
Changed in version 1.1: The default of max_features changed from n_estimators=100,
"auto" to "sqrt". n_jobs=None, oob_score=False,
random_state=None,
verbose=0, warm_start=False)
Parameters

(Model) Parameters are components of the model whose value can be estimated from data.
● We do not set these manually (we can't in fact!)
● The algorithm will discover these for us (learned during the training).

Example:
array([[-2.88651273e-06,
from sklearn import LogisticRegression -8.23168511e-03,
7.50857018e-04,
# Instantiate the model 3.94375060e-04,
log_reg_clf = LogisticRegression() 3.79423562e-04,
4.34612046e-04,
# Train the model 4.37561467e-04,
log_reg_clf.fit(X_train, y_train) 4.12107102e-04,
-6.41089138e-06,
print(log_reg_clf.coef_) -4.39364494e-06, cont... ]])

● For decision tree or random forest, split column (the attribute chosen for split) and split column value
(the value of that attribute chosen for split) are examples of parameters that are learned while training.
Hyperparameter Tuning

Hyperparameter Tuning is choosing the best combination of hyperparameters. But, they can’t
be estimated from the data.

The following methods are commonly used for tuning the hyperparameters.
● Manual Search
● Grid Search
● Random Search
● Coarse to Fine Search
● Bayesian Search
● Genetic Algorithm (will be covered in Numerical Optimization)
Manual Search/Hand Tuning

In this search, the user himself manually tweak the hyperparameter combinations until the
model gets the optimal performance.

● Guess some parameter values based on past experience,


● Train a model, measure its performance on the validation data,
● Analyze the results, and use your intuition to suggest new parameter values.
● Repeat until you have satisfactory performance (or you run out of time, computing
budget, or patience).

Pros
● For a skilled practitioner, this can help to reduce computational time

Cons
● Hard to guess even though you really understand the algorithm
● Time-consuming
Grid Search

If there are only a few hyperparameters, each with a small number of possible values, then a
more systematic approach called grid search is appropriate.

● Try all combinations of values and see which


performs best on the validation data.
● Different combinations can be run in parallel
on different machines, so if you have sufficient
computing resources, this need not be slow.
● Although in some cases model selection has
been known to suck up resources on
thousand-computer clusters for days at a
time.
● If two hyperparameters are independent of
each other, they can be optimized separately.

Image Source: A Medium Blog by Louis Owen


Grid Search

We simply run a random forest classifier with default values and get the predictions for the
test set.
Example:
# Instantiate and fit random forest classifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Predict on the test set and call accuracy


y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(accuracy)

0.81
Grid Search

Grid Search starts with defining a search space grid.


The grid consists of selected hyperparameter names and values, and grid search exhaustively
searches the best combination of these given values.
Example:
# Define the parameter grid
Grid search will have to run and compare 240 models
param_grid = {
(=4*3*5*2*2).
'n_estimators': [50, 100, 200, 300],
'min_samples_leaf': [1, 5, 10], For 5-fold cross-validation, grid search will have to
'max_depth': [2, 4, 6, 8, 10], evaluate 1200 (=240*5) model performances.
'max_features': ['auto', 'sqrt'],
'bootstrap': [True, False]}
# Instantiate GridSearchCV
model_gridsearch = GridSearchCV(
estimator=rf_model,
param_grid=param_grid,
scoring='accuracy',
n_jobs=4,
cv=5,
refit=True,
return_train_score=True)
Grid Search

# Record the current time


start = time()

# Fit the selected model


model_gridsearch.fit(X_train, y_train)

# Print the time spend and number of models ran


print("GridSearchCV took %.2f seconds for %d candidate parameter settings." % ((time() -
start), len(model_gridsearch.cv_results_['params'])))

# Predict on the test set and call accuracy


y_pred_grid = model_gridsearch.predict(X_test)
accuracy_grid = accuracy_score(y_test, y_pred_grid)

GridSearchCV took 247.79 seconds for 240 candidate parameter settings.

0.88
Random Search

In random search, we define distributions for each hyperparameter which can be defined
uniformly or with a sampling method.

For example, if there are 500 values in the distribution and if we input n_iter=50 then
random search will randomly sample 50 values to test.

Since random search does not try every hyperparameter combination, it does not necessarily
return the best performing values, but it returns a relatively good performing model in a
significantly shorter time.
Random Search

In random search, we define distributions for each hyperparameter which can be defined
uniformly or with a sampling method.

# specify distributions to sample from


param_dist = {
'n_estimators': list(range(50, 300, 10)),
'min_samples_leaf': list(range(1, 50)),
'max_depth': list(range(2, 20)),
'max_features': ['auto', 'sqrt'],
'bootstrap': [True, False]}

# specify number of search iterations


n_iter = 50

# Instantiate RandomSearchCV
model_random_search = RandomizedSearchCV(
estimator=rf_model,
param_distributions=param_dist,
n_iter=n_iter)
Random Search

# Record the current time


start = time()

# Fit the selected model


model_random_search.fit(X_train, y_train)

# Print the time spend and number of models ran


print("RandomizedSearchCV took %.2f seconds for %d candidate parameter settings." %
((time() - start), len(model_random_search.cv_results_['params'])))

# Predict on the test set and call accuracy


y_pred_random = model_random_search.predict(X_test)
accuracy_random = accuracy_score(y_test, y_pred_random)

RandomizedSearchCV took 64.17 seconds for 50 candidate parameter settings.

0.86
Coarse to Fine Search

For Grid Search, an increased number of hyperparameters easily becomes a bottleneck. To


prevent this inefficiency, we can combine grid search with random search.

● In coarse-to-fine tuning, we start with a random search to find the promising value
ranges for each hyperparameter.

● After getting focus area for each hyperparameter using random search, we can define
the grid accordingly for grid search to find the best values amongst them.

● For example, if the random search returns high performance for n_estimators between
150 and 200, this is the range we want grid search to focus on.
Coarse to Fine Search

For Grid Search, an increased number of hyperparameters easily becomes a bottleneck. To


prevent this inefficiency, we can combine grid search with random search.

Image Source: A Medium Blog by Idil Ismiguzel


When to Use What

When we are
training a deep
neural network
with huge
hyperparameter
space, it is
preferable to use
a Manual Search
or Random
Search method
rather than
using the Grid
Search method.

Image Source: A Medium Blog by Louis Owen


Overall Objective

Let Ɛ be the set of all possible input–output examples that follow a prior probability
distribution P(X, Y).
Then the expected generalization loss for a hypothesis h (with respect to loss function L) is–

The best hypothesis h*, is the one with the minimum expected generalization loss:

h ∈𝓗
Overall Objective

Because P(X, Y) is not known in most cases, the learner can only estimate generalization loss
with empirical loss on a set of examples D of size N.

The estimated best hypothesis h*, is the one with the minimum expected empirical loss:

h ∈𝓗
Training

Training finds the best hypothesis within the hypothesis space.

Parameters:
a=2, b=3

One common way to find the final hypothesis is by minimizing a loss function using Gradient
Descent algorithm.
Gradient Descent

Gradient Descent is a generic method to tweak parameters iteratively in order to minimize a


cost (a.k.a. loss) function.
● It is a search in the model's parameter space for values of the parameters that minimize
the loss function.

Image Source: https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9


Gradient Descent

Conceptually:
● It starts with an initial guess for the values
of the parameters (called random
initialization).

● Then repeatedly:
○ It updates the parameter values —
hopefully to reduce the loss.

Ideally, it keeps doing this until convergence —


changes to the parameter values do not result in
lower loss.

The key to this algorithm is how to update the


parameter values.
Gradient Descent: The Update Rule

To update the parameter values to reduce the loss:

● Compute the gradient vector.


○ But this points 'uphill' and we want to go 'downhill'.
○ And we want to make 'baby steps' (see later), so we use a learning rate, α which is
between 0 and 1.

● So subtract α times the gradient vector from w


Gradient Descent: The Learning Rate

The size of the steps is determined by the learning rate.

● If the learning rate is too small, it will take many updates until convergence:

● If the learning rate is too big, the algorithm might jump across the valley (overshoot) — it
may even end up with higher loss than before, making the next step bigger.
Gradient Descent: Sensitive to Scaling of Features

For Gradient Descent, we do need to scale the features.

● If features have different ranges, it affects the shape of the 'bowl'.


○ E.g. features 1 and 2 have similar ranges of values — a 'bowl':

● The algorithm goes straight towards the minimum.


Gradient Descent: Sensitive to Scaling of Features

E.g. feature 1 has smaller values than feature 2 — an elongated 'bowl':

● Since feature 1 has smaller values, it takes a larger change in θ1 to affect the loss function,
which is why it is elongated.
● It takes more steps to get to the minimum — steeply down but not really towards the
goal, followed by a long march down a nearly flat valley.
● It makes it more difficult to choose a value for the learning rate that avoids divergence: a
value that suits one feature may not suit another.
Gradient Descent Algorithm Pseudocode
Gradient Descent in Action

For univariate regression, the squared-error loss is quadratic, so


the partial derivative will be linear.
Gradient Descent in Action

Applying this to both w0 and w1 we get:

Here, loss covers only one training example.


Batch Gradient Descent

For N training examples, we want to minimize the sum of the individual losses for each
example.
● The derivative of a sum is the sum of the derivatives, so we have:

● We have to sum over all N training examples for every step, and there may be many
steps.
● A step that covers all the training examples is called an epoch.

These updates constitute the batch gradient descent learning rule for univariate linear
regression.
Gradient Descent: Multivariate Linear Regression

We can easily extend to multivariable linear regression problems, in which each example xj is
an n-element vector

Vectorized form -

The best vector of weights, w∗ , minimizes squared-error loss over the examples:
Stochastic Gradient Descent

As we saw, in each iteration, Batch Gradient Descent does a calculation on the entire training
set, which, for large training sets, may be slow.

● Stochastic Gradient Descent (SGD), on each iteration,


picks just one training example xj at random and
computes the gradients on just that one example.
● This gives huge speed-up.
○ It enables us to train on huge training sets since
only one example needs to be in memory in each
iteration.
○ But, because it is stochastic (the randomness), the
loss will not necessarily decrease on each iteration:
● On average, the loss decreases, but in any one iteration,
loss may go up or down.
● Eventually, it will get close to the minimum, but not
necessarily optimal.
Stochastic Gradient Descent

Simulated Annealing
● As we discussed, SGD does not settle at the minimum.

● One solution is to gradually reduce the learning rate:


○ Updates start out 'large' so you make progress.
○ But, over time, updates get smaller, allowing SGD to settle at or near the global
minimum.

● The function that determines how to reduce the learning rate is called the learning
schedule.
○ Reduce it too quickly and you may not converge on or near to the global minimum.
○ Reduce it too slowly and you may still bounce around a lot and, if stopped after too
few iterations, may end up with a suboptimal solution.
Mini-Batch Gradient Descent

Batch Gradient Descent computes gradients from the full training set and Stochastic Gradient
Descent computes gradients from just one example.

● Mini-Batch Gradient Descent lies between the two:


○ It computes gradients from a small randomly-selected subset of the training set,
called a mini-batch.

● Since it lies between the two:


○ It may bounce less and get closer to the global minimum than SGD…
■ …although both of them can reach the global minimum with a good learning
schedule.
○ Its time and memory costs lie between the two.
Non-Convex Loss Functions

Gradient Descent is a generic method: you can


use it to find the minima of other loss
functions.

● Not all loss functions are convex, which


can cause problems for Gradient
Descent:
● The algorithm might converge to a local
minimum, instead of the global
minimum.
● It may take a long time to cross a
plateau.

What do we do about this?


Non-Convex Loss Functions

What do we do about this?

● One thing is to prefer Stochastic Gradient Descent (or Mini-Batch Gradient Descent):
because of the way they 'bounce around', they might even escape a local minimum, and
might even get to the global minimum.

● In this context, simulated annealing is also useful: updates start out 'large' allowing these
algorithms to make progress and even escape local minima; but, over time, updates get
smaller, allowing these algorithms to settle at or near the global minimum.

● But, if using simulated annealing, if you reduce the learning rate too quickly, you may still
get stuck in a local minimum.
Next lecture
Introduction to Neural Networks
13th January 2025

You might also like