0% found this document useful (0 votes)

10 views37 pages

Lec 04 05

The document discusses optimization methods in deep learning, focusing on hyperparameters, parameters, and various tuning techniques such as manual search, grid search, and random search. It emphasizes the importance of gradient descent for minimizing loss functions during model training and outlines different variations of gradient descent, including batch, stochastic, and mini-batch gradient descent. The document also highlights the significance of scaling features and the learning rate in the optimization process.

Uploaded by

202411073

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views37 pages

Lec 04 05

Uploaded by

202411073

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

IT549: Deep Learning

Lecture 04/05

Optimization
[Search Methods and Gradient Descent]

Arpit Rana
9th / 10th January 2025
Components of Supervised Learning

Representation ✔
choosing the set of functions (hypotheses space or the
model class) that can be learned.

Evaluation ✔
An evaluation function (also called objective function
or scoring function) is needed to distinguish good
hypotheses from bad ones.

Optimization
We need a method to search the hypothesis space for ��
the highest-scoring one.
Hyperparameters

Hyperparameters are parameters of the model class, not of the individual model.
● We deﬁne them before training to control the learning process.
● The learner algorithm does not learn these (can’t be estimated from data).

Example: RandomForestClassifier(
bootstrap=True,
from sklearn.ensemble import RandomForestClassifier ccp_alpha=0.0, class_weight=None,
criterion=’gini’,
# Instantiate the model max_depth=None,
rf_model = RandomForestClassifier() max_features=’auto’,
max_leaf_nodes=None,
# Print hyperparameters max_samples=None,
rf_model.get_params min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs=None, oob_score=False,
random_state=None,
verbose=0, warm_start=False)
Hyperparameters

Example: RandomForestClassifier(
bootstrap=True,
from sklearn.ensemble import RandomForestClassifier ccp_alpha=0.0, class_weight=None,
criterion=’gini’,
# Instantiate the model max_depth=None,
rf_model = RandomForestClassifier() max_features=’auto’,
max_leaf_nodes=None,
# Print hyperparameters Some hyperparameters are max_samples=None,
Rf_model.get_params more important than min_impurity_decrease=0.0,
others. min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
Changed in version 1.1: The default of max_features changed from n_estimators=100,
"auto" to "sqrt". n_jobs=None, oob_score=False,
random_state=None,
verbose=0, warm_start=False)
Parameters

(Model) Parameters are components of the model whose value can be estimated from data.
● We do not set these manually (we can't in fact!)
● The algorithm will discover these for us (learned during the training).

Example:
array([[-2.88651273e-06,
from sklearn import LogisticRegression -8.23168511e-03,
7.50857018e-04,
# Instantiate the model 3.94375060e-04,
log_reg_clf = LogisticRegression() 3.79423562e-04,
4.34612046e-04,
# Train the model 4.37561467e-04,
log_reg_clf.fit(X_train, y_train) 4.12107102e-04,
-6.41089138e-06,
print(log_reg_clf.coef_) -4.39364494e-06, cont... ]])

● For decision tree or random forest, split column (the attribute chosen for split) and split column value
(the value of that attribute chosen for split) are examples of parameters that are learned while training.
Hyperparameter Tuning

Hyperparameter Tuning is choosing the best combination of hyperparameters. But, they can’t
be estimated from the data.

The following methods are commonly used for tuning the hyperparameters.
● Manual Search
● Grid Search
● Random Search
● Coarse to Fine Search
● Bayesian Search
● Genetic Algorithm (will be covered in Numerical Optimization)
Manual Search/Hand Tuning

In this search, the user himself manually tweak the hyperparameter combinations until the
model gets the optimal performance.

● Guess some parameter values based on past experience,

● Train a model, measure its performance on the validation data,
● Analyze the results, and use your intuition to suggest new parameter values.
● Repeat until you have satisfactory performance (or you run out of time, computing
budget, or patience).

Pros
● For a skilled practitioner, this can help to reduce computational time

Cons
● Hard to guess even though you really understand the algorithm
● Time-consuming
Grid Search

If there are only a few hyperparameters, each with a small number of possible values, then a
more systematic approach called grid search is appropriate.

● Try all combinations of values and see which

performs best on the validation data.
● Different combinations can be run in parallel
on different machines, so if you have sufﬁcient
computing resources, this need not be slow.
● Although in some cases model selection has
been known to suck up resources on
thousand-computer clusters for days at a
time.
● If two hyperparameters are independent of
each other, they can be optimized separately.

Image Source: A Medium Blog by Louis Owen

Grid Search

We simply run a random forest classiﬁer with default values and get the predictions for the
test set.
Example:
# Instantiate and fit random forest classifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Predict on the test set and call accuracy

y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(accuracy)

0.81
Grid Search

Grid Search starts with deﬁning a search space grid.

The grid consists of selected hyperparameter names and values, and grid search exhaustively
searches the best combination of these given values.
Example:
# Define the parameter grid
Grid search will have to run and compare 240 models
param_grid = {
(=4*3*5*2*2).
'n_estimators': [50, 100, 200, 300],
'min_samples_leaf': [1, 5, 10], For 5-fold cross-validation, grid search will have to
'max_depth': [2, 4, 6, 8, 10], evaluate 1200 (=240*5) model performances.
'max_features': ['auto', 'sqrt'],
'bootstrap': [True, False]}
# Instantiate GridSearchCV
model_gridsearch = GridSearchCV(
estimator=rf_model,
param_grid=param_grid,
scoring='accuracy',
n_jobs=4,
cv=5,
refit=True,
return_train_score=True)
Grid Search

# Record the current time

start = time()

# Fit the selected model

model_gridsearch.fit(X_train, y_train)

# Print the time spend and number of models ran

print("GridSearchCV took %.2f seconds for %d candidate parameter settings." % ((time() -
start), len(model_gridsearch.cv_results_['params'])))

# Predict on the test set and call accuracy

y_pred_grid = model_gridsearch.predict(X_test)
accuracy_grid = accuracy_score(y_test, y_pred_grid)

GridSearchCV took 247.79 seconds for 240 candidate parameter settings.

0.88
Random Search

In random search, we deﬁne distributions for each hyperparameter which can be deﬁned
uniformly or with a sampling method.

For example, if there are 500 values in the distribution and if we input n_iter=50 then
random search will randomly sample 50 values to test.

Since random search does not try every hyperparameter combination, it does not necessarily
return the best performing values, but it returns a relatively good performing model in a
signiﬁcantly shorter time.
Random Search

In random search, we deﬁne distributions for each hyperparameter which can be deﬁned
uniformly or with a sampling method.

# specify distributions to sample from

param_dist = {
'n_estimators': list(range(50, 300, 10)),
'min_samples_leaf': list(range(1, 50)),
'max_depth': list(range(2, 20)),
'max_features': ['auto', 'sqrt'],
'bootstrap': [True, False]}

# specify number of search iterations

n_iter = 50

# Instantiate RandomSearchCV
model_random_search = RandomizedSearchCV(
estimator=rf_model,
param_distributions=param_dist,
n_iter=n_iter)
Random Search

# Record the current time

start = time()

# Fit the selected model

model_random_search.fit(X_train, y_train)

# Print the time spend and number of models ran

print("RandomizedSearchCV took %.2f seconds for %d candidate parameter settings." %
((time() - start), len(model_random_search.cv_results_['params'])))

# Predict on the test set and call accuracy

y_pred_random = model_random_search.predict(X_test)
accuracy_random = accuracy_score(y_test, y_pred_random)

RandomizedSearchCV took 64.17 seconds for 50 candidate parameter settings.

0.86
Coarse to Fine Search

For Grid Search, an increased number of hyperparameters easily becomes a bottleneck. To

prevent this inefﬁciency, we can combine grid search with random search.

● In coarse-to-ﬁne tuning, we start with a random search to ﬁnd the promising value
ranges for each hyperparameter.

● After getting focus area for each hyperparameter using random search, we can deﬁne
the grid accordingly for grid search to ﬁnd the best values amongst them.

● For example, if the random search returns high performance for n_estimators between
150 and 200, this is the range we want grid search to focus on.
Coarse to Fine Search

For Grid Search, an increased number of hyperparameters easily becomes a bottleneck. To

prevent this inefﬁciency, we can combine grid search with random search.

Image Source: A Medium Blog by Idil Ismiguzel

When to Use What

When we are
training a deep
neural network
with huge
hyperparameter
space, it is
preferable to use
a Manual Search
or Random
Search method
rather than
using the Grid
Search method.

Image Source: A Medium Blog by Louis Owen

Overall Objective

Let Ɛ be the set of all possible input–output examples that follow a prior probability
distribution P(X, Y).
Then the expected generalization loss for a hypothesis h (with respect to loss function L) is–

The best hypothesis h*, is the one with the minimum expected generalization loss:

h ∈𝓗
Overall Objective

Because P(X, Y) is not known in most cases, the learner can only estimate generalization loss
with empirical loss on a set of examples D of size N.

The estimated best hypothesis h*, is the one with the minimum expected empirical loss:

h ∈𝓗
Training

Training ﬁnds the best hypothesis within the hypothesis space.

Parameters:
a=2, b=3

One common way to ﬁnd the ﬁnal hypothesis is by minimizing a loss function using Gradient
Descent algorithm.
Gradient Descent

Gradient Descent is a generic method to tweak parameters iteratively in order to minimize a

cost (a.k.a. loss) function.
● It is a search in the model's parameter space for values of the parameters that minimize
the loss function.

Image Source: https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9

Gradient Descent

Conceptually:
● It starts with an initial guess for the values
of the parameters (called random
initialization).

● Then repeatedly:
○ It updates the parameter values —
hopefully to reduce the loss.

Ideally, it keeps doing this until convergence —

changes to the parameter values do not result in
lower loss.

The key to this algorithm is how to update the

parameter values.
Gradient Descent: The Update Rule

To update the parameter values to reduce the loss:

● Compute the gradient vector.

○ But this points 'uphill' and we want to go 'downhill'.
○ And we want to make 'baby steps' (see later), so we use a learning rate, α which is
between 0 and 1.

● So subtract α times the gradient vector from w

Gradient Descent: The Learning Rate

The size of the steps is determined by the learning rate.

● If the learning rate is too small, it will take many updates until convergence:

● If the learning rate is too big, the algorithm might jump across the valley (overshoot) — it
may even end up with higher loss than before, making the next step bigger.
Gradient Descent: Sensitive to Scaling of Features

For Gradient Descent, we do need to scale the features.

● If features have different ranges, it affects the shape of the 'bowl'.

○ E.g. features 1 and 2 have similar ranges of values — a 'bowl':

● The algorithm goes straight towards the minimum.

Gradient Descent: Sensitive to Scaling of Features

E.g. feature 1 has smaller values than feature 2 — an elongated 'bowl':

● Since feature 1 has smaller values, it takes a larger change in θ1 to affect the loss function,
which is why it is elongated.
● It takes more steps to get to the minimum — steeply down but not really towards the
goal, followed by a long march down a nearly ﬂat valley.
● It makes it more difﬁcult to choose a value for the learning rate that avoids divergence: a
value that suits one feature may not suit another.
Gradient Descent Algorithm Pseudocode
Gradient Descent in Action

For univariate regression, the squared-error loss is quadratic, so

the partial derivative will be linear.
Gradient Descent in Action

Applying this to both w0 and w1 we get:

Here, loss covers only one training example.

Batch Gradient Descent

For N training examples, we want to minimize the sum of the individual losses for each
example.
● The derivative of a sum is the sum of the derivatives, so we have:

● We have to sum over all N training examples for every step, and there may be many
steps.
● A step that covers all the training examples is called an epoch.

These updates constitute the batch gradient descent learning rule for univariate linear
regression.
Gradient Descent: Multivariate Linear Regression

We can easily extend to multivariable linear regression problems, in which each example xj is
an n-element vector

Vectorized form -

The best vector of weights, w∗ , minimizes squared-error loss over the examples:
Stochastic Gradient Descent

As we saw, in each iteration, Batch Gradient Descent does a calculation on the entire training
set, which, for large training sets, may be slow.

● Stochastic Gradient Descent (SGD), on each iteration,

picks just one training example xj at random and
computes the gradients on just that one example.
● This gives huge speed-up.
○ It enables us to train on huge training sets since
only one example needs to be in memory in each
iteration.
○ But, because it is stochastic (the randomness), the
loss will not necessarily decrease on each iteration:
● On average, the loss decreases, but in any one iteration,
loss may go up or down.
● Eventually, it will get close to the minimum, but not
necessarily optimal.
Stochastic Gradient Descent

Simulated Annealing
● As we discussed, SGD does not settle at the minimum.

● One solution is to gradually reduce the learning rate:

○ Updates start out 'large' so you make progress.
○ But, over time, updates get smaller, allowing SGD to settle at or near the global
minimum.

● The function that determines how to reduce the learning rate is called the learning
schedule.
○ Reduce it too quickly and you may not converge on or near to the global minimum.
○ Reduce it too slowly and you may still bounce around a lot and, if stopped after too
few iterations, may end up with a suboptimal solution.
Mini-Batch Gradient Descent

Batch Gradient Descent computes gradients from the full training set and Stochastic Gradient
Descent computes gradients from just one example.

● Mini-Batch Gradient Descent lies between the two:

○ It computes gradients from a small randomly-selected subset of the training set,
called a mini-batch.

● Since it lies between the two:

○ It may bounce less and get closer to the global minimum than SGD…
■ …although both of them can reach the global minimum with a good learning
schedule.
○ Its time and memory costs lie between the two.
Non-Convex Loss Functions

Gradient Descent is a generic method: you can

use it to ﬁnd the minima of other loss
functions.

● Not all loss functions are convex, which

can cause problems for Gradient
Descent:
● The algorithm might converge to a local
minimum, instead of the global
minimum.
● It may take a long time to cross a
plateau.

What do we do about this?

Non-Convex Loss Functions

What do we do about this?

● One thing is to prefer Stochastic Gradient Descent (or Mini-Batch Gradient Descent):
because of the way they 'bounce around', they might even escape a local minimum, and
might even get to the global minimum.

● In this context, simulated annealing is also useful: updates start out 'large' allowing these
algorithms to make progress and even escape local minima; but, over time, updates get
smaller, allowing these algorithms to settle at or near the global minimum.

● But, if using simulated annealing, if you reduce the learning rate too quickly, you may still
get stuck in a local minimum.
Next lecture
Introduction to Neural Networks
13th January 2025

Wind Load Calculation For An Open Tower Type Structure (As Per BS EN 1991-1-4-2005) PDF
100% (3)
Wind Load Calculation For An Open Tower Type Structure (As Per BS EN 1991-1-4-2005) PDF
4 pages
Algebra - A Complete Introduction
No ratings yet
Algebra - A Complete Introduction
491 pages
Diabetes Case Study - Jupyter Notebook
100% (1)
Diabetes Case Study - Jupyter Notebook
10 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Model Training: (Anything Done While We Train The Model)
No ratings yet
Model Training: (Anything Done While We Train The Model)
194 pages
Model Selection On ML
No ratings yet
Model Selection On ML
49 pages
Hyperparameters
No ratings yet
Hyperparameters
8 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Applied Machine Learning Supervised Machine Learning (Part 2)
No ratings yet
Applied Machine Learning Supervised Machine Learning (Part 2)
47 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
Bergstra12a PDF
No ratings yet
Bergstra12a PDF
25 pages
8 To 12 Jaimeen
No ratings yet
8 To 12 Jaimeen
34 pages
Lecture 3 1611410001002
No ratings yet
Lecture 3 1611410001002
51 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Hyperparameter Tuning
No ratings yet
Hyperparameter Tuning
4 pages
Tuning A CART's Hyperparameters: Elie Kawerk
No ratings yet
Tuning A CART's Hyperparameters: Elie Kawerk
26 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
Grid Search
No ratings yet
Grid Search
48 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
Advanced Scikit Learn
No ratings yet
Advanced Scikit Learn
98 pages
Optimized Hyperparameters Tuning of Multi-Class Classification Algorithms
No ratings yet
Optimized Hyperparameters Tuning of Multi-Class Classification Algorithms
17 pages
Lecture6c HyperparameterOptimization
No ratings yet
Lecture6c HyperparameterOptimization
19 pages
Updated Lecture 12 Zainab
No ratings yet
Updated Lecture 12 Zainab
17 pages
Scikit Learn What Were Covering
No ratings yet
Scikit Learn What Were Covering
15 pages
Hyperparameter Tuning The Random Forest in Python BOM 3 - by Will Koehrsen - Towards Data Science
No ratings yet
Hyperparameter Tuning The Random Forest in Python BOM 3 - by Will Koehrsen - Towards Data Science
15 pages
Skit Learn Cheatsheet
No ratings yet
Skit Learn Cheatsheet
11 pages
Machine Learning Model ENG
No ratings yet
Machine Learning Model ENG
16 pages
ML Lab Programs 2
No ratings yet
ML Lab Programs 2
16 pages
ML Chap 5
No ratings yet
ML Chap 5
14 pages
Hyper Parameters
No ratings yet
Hyper Parameters
24 pages
Model Fine-Tuning - Hyperparameter Optimization
No ratings yet
Model Fine-Tuning - Hyperparameter Optimization
9 pages
The Overlooked Limitations of Grid Search and Random Search
No ratings yet
The Overlooked Limitations of Grid Search and Random Search
6 pages
Hyperparameter Tuning
No ratings yet
Hyperparameter Tuning
7 pages
Module 6
No ratings yet
Module 6
4 pages
Lecture 4 - Intro To Machine Learning and Decision Trees
No ratings yet
Lecture 4 - Intro To Machine Learning and Decision Trees
61 pages
QB 1
No ratings yet
QB 1
11 pages
AML Code For m2
No ratings yet
AML Code For m2
7 pages
Hyperparameter Tuning
No ratings yet
Hyperparameter Tuning
6 pages
Hyperparameter Tuning
No ratings yet
Hyperparameter Tuning
9 pages
Supple Maximizing Performance in Cs CuBiCl
No ratings yet
Supple Maximizing Performance in Cs CuBiCl
5 pages
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
100% (1)
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
1 page
Hyperparameter Tuning
No ratings yet
Hyperparameter Tuning
3 pages
Lecture 9 Model Selection
No ratings yet
Lecture 9 Model Selection
15 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
Reference Guide - Validation & Cross-Validation
No ratings yet
Reference Guide - Validation & Cross-Validation
7 pages
Grid Random Search
No ratings yet
Grid Random Search
6 pages
#Machinelearning: Mastering Tuning Hyperparameter
No ratings yet
#Machinelearning: Mastering Tuning Hyperparameter
7 pages
Hyper Parameter Tuning
No ratings yet
Hyper Parameter Tuning
4 pages
Hyperparameter - Tuning
No ratings yet
Hyperparameter - Tuning
3 pages
Hyperparameter Tuning For Machine Learning Models
No ratings yet
Hyperparameter Tuning For Machine Learning Models
5 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
Offshor Mooring System
No ratings yet
Offshor Mooring System
6 pages
Hyperparameter Search in Machine Learning: February 2015
No ratings yet
Hyperparameter Search in Machine Learning: February 2015
6 pages
OBJECT 188 QOP-82-04 (01) Final Acceptance Inspection A 1
No ratings yet
OBJECT 188 QOP-82-04 (01) Final Acceptance Inspection A 1
4 pages
Loading Data From APO SCM 5.0 To BW 7.0
No ratings yet
Loading Data From APO SCM 5.0 To BW 7.0
10 pages
Module 5 Notes
No ratings yet
Module 5 Notes
22 pages
Users Manual HP ENVY LAPTOP
No ratings yet
Users Manual HP ENVY LAPTOP
67 pages
PM Tech Knowledge Scwev
No ratings yet
PM Tech Knowledge Scwev
216 pages
GD Ready Set Go
No ratings yet
GD Ready Set Go
34 pages
Dream India Technologies Is The Best Way To Learn Spoken English
No ratings yet
Dream India Technologies Is The Best Way To Learn Spoken English
1 page
PC4020 v3.3 - Manual de Instrucción: Advertencia
No ratings yet
PC4020 v3.3 - Manual de Instrucción: Advertencia
44 pages
4-Brand Communication Process
100% (1)
4-Brand Communication Process
19 pages
Fault Modeling: Why Model Faults? Some Real Defects in VLSI and PCB Common Fault Models Stuck-At Faults
No ratings yet
Fault Modeling: Why Model Faults? Some Real Defects in VLSI and PCB Common Fault Models Stuck-At Faults
18 pages
CIMA-F2 Area B - Self Study Guide: Over View of The Syllabus Area C
No ratings yet
CIMA-F2 Area B - Self Study Guide: Over View of The Syllabus Area C
3 pages
Lorenz Datalogger Software Installation Maual
0% (1)
Lorenz Datalogger Software Installation Maual
19 pages
Blubber Experiment Worksheet
No ratings yet
Blubber Experiment Worksheet
1 page
Lesson Plan 4
No ratings yet
Lesson Plan 4
4 pages
Budget of Minority
No ratings yet
Budget of Minority
18 pages
Meetings 1 - Getting Down To Business - Lesson Plan PDF
No ratings yet
Meetings 1 - Getting Down To Business - Lesson Plan PDF
4 pages
Omputer Eekly Buyer S Guide To Threat Management: Making Unified Threat Management A Key Security Tool
No ratings yet
Omputer Eekly Buyer S Guide To Threat Management: Making Unified Threat Management A Key Security Tool
19 pages
Agricultural Innovation Agricultural Development
No ratings yet
Agricultural Innovation Agricultural Development
17 pages
At First Blush: The Politics of Guilt and Shame: Marguerite La Caze
No ratings yet
At First Blush: The Politics of Guilt and Shame: Marguerite La Caze
15 pages
Application For Job Vacancy Within Your Organisation
No ratings yet
Application For Job Vacancy Within Your Organisation
4 pages
Eqpx CLC001.2022
No ratings yet
Eqpx CLC001.2022
2 pages
History Lesson Plan: THEME 2: Brunei Sultanate, 14th To 16th Century
No ratings yet
History Lesson Plan: THEME 2: Brunei Sultanate, 14th To 16th Century
6 pages
MPLAB Xpress Evaluation Board 7 - Segment Display
No ratings yet
MPLAB Xpress Evaluation Board 7 - Segment Display
10 pages
Tugas Individu LKMM
No ratings yet
Tugas Individu LKMM
4 pages
Ing Gris
No ratings yet
Ing Gris
6 pages
Test The Visionary
No ratings yet
Test The Visionary
2 pages
Chapter 6 - Probability
No ratings yet
Chapter 6 - Probability
3 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
JavaScript: Advanced Guide to Programming Code with Javascript: JavaScript Computer Programming, #4
From Everand
JavaScript: Advanced Guide to Programming Code with Javascript: JavaScript Computer Programming, #4
Charlie Masterson
No ratings yet
JavaScript: Advanced Guide to Programming Code with JavaScript
From Everand
JavaScript: Advanced Guide to Programming Code with JavaScript
Charlie Masterson
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Lec 04 05

Uploaded by

Lec 04 05

Uploaded by

IT549: Deep Learning

● Guess some parameter values based on past experience,

● Try all combinations of values and see which

Image Source: A Medium Blog by Louis Owen

# Predict on the test set and call accuracy

Grid Search starts with deﬁning a search space grid.

# Record the current time

# Fit the selected model

# Print the time spend and number of models ran

# Predict on the test set and call accuracy

GridSearchCV took 247.79 seconds for 240 candidate parameter settings.

# specify distributions to sample from

# specify number of search iterations

# Record the current time

# Fit the selected model

# Print the time spend and number of models ran

# Predict on the test set and call accuracy

RandomizedSearchCV took 64.17 seconds for 50 candidate parameter settings.

For Grid Search, an increased number of hyperparameters easily becomes a bottleneck. To

For Grid Search, an increased number of hyperparameters easily becomes a bottleneck. To

Image Source: A Medium Blog by Idil Ismiguzel

Image Source: A Medium Blog by Louis Owen

Training ﬁnds the best hypothesis within the hypothesis space.

Gradient Descent is a generic method to tweak parameters iteratively in order to minimize a

Image Source: https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9

Ideally, it keeps doing this until convergence —

The key to this algorithm is how to update the

To update the parameter values to reduce the loss:

● Compute the gradient vector.

● So subtract α times the gradient vector from w

The size of the steps is determined by the learning rate.

For Gradient Descent, we do need to scale the features.

● If features have different ranges, it affects the shape of the 'bowl'.

● The algorithm goes straight towards the minimum.

E.g. feature 1 has smaller values than feature 2 — an elongated 'bowl':

For univariate regression, the squared-error loss is quadratic, so

Applying this to both w0 and w1 we get:

Here, loss covers only one training example.

● Stochastic Gradient Descent (SGD), on each iteration,

● One solution is to gradually reduce the learning rate:

● Mini-Batch Gradient Descent lies between the two:

● Since it lies between the two:

Gradient Descent is a generic method: you can

● Not all loss functions are convex, which

What do we do about this?

What do we do about this?

You might also like