Lec 04 05
Lec 04 05
Lecture 04/05
Optimization
[Search Methods and Gradient Descent]
Arpit Rana
9th / 10th January 2025
Components of Supervised Learning
Representation ✔
choosing the set of functions (hypotheses space or the
model class) that can be learned.
Evaluation ✔
An evaluation function (also called objective function
or scoring function) is needed to distinguish good
hypotheses from bad ones.
Optimization
We need a method to search the hypothesis space for ��
the highest-scoring one.
Hyperparameters
Hyperparameters are parameters of the model class, not of the individual model.
● We define them before training to control the learning process.
● The learner algorithm does not learn these (can’t be estimated from data).
Example: RandomForestClassifier(
bootstrap=True,
from sklearn.ensemble import RandomForestClassifier ccp_alpha=0.0, class_weight=None,
criterion=’gini’,
# Instantiate the model max_depth=None,
rf_model = RandomForestClassifier() max_features=’auto’,
max_leaf_nodes=None,
# Print hyperparameters max_samples=None,
rf_model.get_params min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs=None, oob_score=False,
random_state=None,
verbose=0, warm_start=False)
Hyperparameters
Hyperparameters are parameters of the model class, not of the individual model.
● We define them before training to control the learning process.
● The learner algorithm does not learn these (can’t be estimated from data).
Example: RandomForestClassifier(
bootstrap=True,
from sklearn.ensemble import RandomForestClassifier ccp_alpha=0.0, class_weight=None,
criterion=’gini’,
# Instantiate the model max_depth=None,
rf_model = RandomForestClassifier() max_features=’auto’,
max_leaf_nodes=None,
# Print hyperparameters Some hyperparameters are max_samples=None,
Rf_model.get_params more important than min_impurity_decrease=0.0,
others. min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
Changed in version 1.1: The default of max_features changed from n_estimators=100,
"auto" to "sqrt". n_jobs=None, oob_score=False,
random_state=None,
verbose=0, warm_start=False)
Parameters
(Model) Parameters are components of the model whose value can be estimated from data.
● We do not set these manually (we can't in fact!)
● The algorithm will discover these for us (learned during the training).
Example:
array([[-2.88651273e-06,
from sklearn import LogisticRegression -8.23168511e-03,
7.50857018e-04,
# Instantiate the model 3.94375060e-04,
log_reg_clf = LogisticRegression() 3.79423562e-04,
4.34612046e-04,
# Train the model 4.37561467e-04,
log_reg_clf.fit(X_train, y_train) 4.12107102e-04,
-6.41089138e-06,
print(log_reg_clf.coef_) -4.39364494e-06, cont... ]])
● For decision tree or random forest, split column (the attribute chosen for split) and split column value
(the value of that attribute chosen for split) are examples of parameters that are learned while training.
Hyperparameter Tuning
Hyperparameter Tuning is choosing the best combination of hyperparameters. But, they can’t
be estimated from the data.
The following methods are commonly used for tuning the hyperparameters.
● Manual Search
● Grid Search
● Random Search
● Coarse to Fine Search
● Bayesian Search
● Genetic Algorithm (will be covered in Numerical Optimization)
Manual Search/Hand Tuning
In this search, the user himself manually tweak the hyperparameter combinations until the
model gets the optimal performance.
Pros
● For a skilled practitioner, this can help to reduce computational time
Cons
● Hard to guess even though you really understand the algorithm
● Time-consuming
Grid Search
If there are only a few hyperparameters, each with a small number of possible values, then a
more systematic approach called grid search is appropriate.
We simply run a random forest classifier with default values and get the predictions for the
test set.
Example:
# Instantiate and fit random forest classifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
print(accuracy)
0.81
Grid Search
0.88
Random Search
In random search, we define distributions for each hyperparameter which can be defined
uniformly or with a sampling method.
For example, if there are 500 values in the distribution and if we input n_iter=50 then
random search will randomly sample 50 values to test.
Since random search does not try every hyperparameter combination, it does not necessarily
return the best performing values, but it returns a relatively good performing model in a
significantly shorter time.
Random Search
In random search, we define distributions for each hyperparameter which can be defined
uniformly or with a sampling method.
# Instantiate RandomSearchCV
model_random_search = RandomizedSearchCV(
estimator=rf_model,
param_distributions=param_dist,
n_iter=n_iter)
Random Search
0.86
Coarse to Fine Search
● In coarse-to-fine tuning, we start with a random search to find the promising value
ranges for each hyperparameter.
● After getting focus area for each hyperparameter using random search, we can define
the grid accordingly for grid search to find the best values amongst them.
● For example, if the random search returns high performance for n_estimators between
150 and 200, this is the range we want grid search to focus on.
Coarse to Fine Search
When we are
training a deep
neural network
with huge
hyperparameter
space, it is
preferable to use
a Manual Search
or Random
Search method
rather than
using the Grid
Search method.
Let Ɛ be the set of all possible input–output examples that follow a prior probability
distribution P(X, Y).
Then the expected generalization loss for a hypothesis h (with respect to loss function L) is–
The best hypothesis h*, is the one with the minimum expected generalization loss:
h ∈𝓗
Overall Objective
Because P(X, Y) is not known in most cases, the learner can only estimate generalization loss
with empirical loss on a set of examples D of size N.
The estimated best hypothesis h*, is the one with the minimum expected empirical loss:
h ∈𝓗
Training
Parameters:
a=2, b=3
One common way to find the final hypothesis is by minimizing a loss function using Gradient
Descent algorithm.
Gradient Descent
Conceptually:
● It starts with an initial guess for the values
of the parameters (called random
initialization).
● Then repeatedly:
○ It updates the parameter values —
hopefully to reduce the loss.
● If the learning rate is too small, it will take many updates until convergence:
● If the learning rate is too big, the algorithm might jump across the valley (overshoot) — it
may even end up with higher loss than before, making the next step bigger.
Gradient Descent: Sensitive to Scaling of Features
● Since feature 1 has smaller values, it takes a larger change in θ1 to affect the loss function,
which is why it is elongated.
● It takes more steps to get to the minimum — steeply down but not really towards the
goal, followed by a long march down a nearly flat valley.
● It makes it more difficult to choose a value for the learning rate that avoids divergence: a
value that suits one feature may not suit another.
Gradient Descent Algorithm Pseudocode
Gradient Descent in Action
For N training examples, we want to minimize the sum of the individual losses for each
example.
● The derivative of a sum is the sum of the derivatives, so we have:
● We have to sum over all N training examples for every step, and there may be many
steps.
● A step that covers all the training examples is called an epoch.
These updates constitute the batch gradient descent learning rule for univariate linear
regression.
Gradient Descent: Multivariate Linear Regression
We can easily extend to multivariable linear regression problems, in which each example xj is
an n-element vector
Vectorized form -
The best vector of weights, w∗ , minimizes squared-error loss over the examples:
Stochastic Gradient Descent
As we saw, in each iteration, Batch Gradient Descent does a calculation on the entire training
set, which, for large training sets, may be slow.
Simulated Annealing
● As we discussed, SGD does not settle at the minimum.
● The function that determines how to reduce the learning rate is called the learning
schedule.
○ Reduce it too quickly and you may not converge on or near to the global minimum.
○ Reduce it too slowly and you may still bounce around a lot and, if stopped after too
few iterations, may end up with a suboptimal solution.
Mini-Batch Gradient Descent
Batch Gradient Descent computes gradients from the full training set and Stochastic Gradient
Descent computes gradients from just one example.
● One thing is to prefer Stochastic Gradient Descent (or Mini-Batch Gradient Descent):
because of the way they 'bounce around', they might even escape a local minimum, and
might even get to the global minimum.
● In this context, simulated annealing is also useful: updates start out 'large' allowing these
algorithms to make progress and even escape local minima; but, over time, updates get
smaller, allowing these algorithms to settle at or near the global minimum.
● But, if using simulated annealing, if you reduce the learning rate too quickly, you may still
get stuck in a local minimum.
Next lecture
Introduction to Neural Networks
13th January 2025