0% found this document useful (0 votes)

122 views39 pages

Lecture 8: Gradient Descent and Logistic Regression

This document provides an overview of gradient descent and logistic regression. It discusses how gradient descent can be used to optimize parameters in linear and logistic regression models by minimizing a cost function. Gradient descent works by taking small steps in the direction of steepest descent. Stochastic gradient descent processes examples one at a time for faster convergence on large datasets. Logistic regression applies gradient descent to classification problems by using a logistic cost function and sigmoid activation. Advanced optimization methods like Newton's method can converge faster than gradient descent but require calculating the Hessian matrix.

Uploaded by

Ashish Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views39 pages

Lecture 8: Gradient Descent and Logistic Regression

Uploaded by

Ashish Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Lecture 8: Gradient descent and

logistic regression
17.11.2016
Course contents

• Lecture 1: Introduction and basic principles

• Lecture 2: Covariance and Gaussianity
• Lecture 3: Multivariate linear regression
• Lecture 4: Principal component analysis
• Lecture 5: Bridging input and output
• Lecture 6: Gaussian Mixture Models (Pedram Daee)
• Lecture 7: Gaussian Process Regression (Pedram
Daee)
• Lecture 8: Gradient descent and logistic regression

2
Mathematical terms

• : input variables , also called input features

• : output or target variable that we are trying to
predict
• A pair ( , ) is called a training example
• The dataset that we’ll be using to learn a list of m
training examples ( , ) ; i ={1, . . .,m} is called a
training set.
• We use X denote the space of input values, and Y the
space of output values.

3
Model learning

• Our goal is, given a training set, to learn a function

h : X → Y so that h(x) is a “good” predictor for the
corresponding value of y.
• Function h is called a hypothesis.

4
Hypothesis function
• For example, if the input has two variables (two-
dimensional) R2 , we should decide how to represent
hypotheses h
• let’s say we decide to approximate y as a linear function
of X

• ’s are parameters or weights

• In this course, always include intercept parameter of

and make it equal to 1 : =1 . This deals with offset.

5
How to learn parameter
• make h(x) close to y or minimize the following cost
function:

• To do so, let’s use a search algorithm that starts with

some “initial guess” for θ, and that repeatedly changes θ
to make J(θ) smaller
• We need an algorithm starting with an initial θ,
repeatedly performs updating until convergence of J

6
Gradient descent
• A natural algorithm that repeatedly takes step in the
direction of steepest decrease of J

• α is called the learning rate

• For a single training example:

7
Global minima for linear regression

• Gradient descent can be susceptible to local minima

•
• But, for linear regression it has only one global, and no
other local, optima

• Because J is a convex quadratic function, thus gradient

descent always converges (assuming the learning rate α
is not too large) to the global minimum

8
Convex cost function

• For linear regression problem, the defined cost function

is convex. Always converges to global optimum.

contours of the quadratic function

9
Non-convex cost function

• Depends on where to start, we might end up to diferent

local optimas

10
Batch and stocastic gradient descents
• Batch:

• Stocastic:

11
Batch and stocastic gradient descents

• Batch gradient descent has to scan through the entire

training set before taking a single step—a costly
operation if m is large

• Stochastic gradient descent can start making progress

right away, and continues to make progress with each
example it looks at.

• When the training set is large, stochastic gradient

descent is often preferred over batch gradient descent

12
Some points before implementation

• Make sure features are on a similar scale.

• It makes the counour plots look more circular and faster

to reach to the global optima

• Scaling can be done using variance or standard

deviation of the variable

13
Correct implementation

14
Selecting learning rate

• If is small, gradient descent can be slow

• If is too lrage, gradient descent might overshoot the

minimum

15
Polynomial fitting
• Adding features to the model can make better fit

• Overfitting is problem
• Cross validation is one way to avoid overfitting

16
Gradient descent versus normal equation

Gradient Descent Normal Equation

FMLR   X X  X T Y .
T 1

Multiple iterations and many steps to One step to get to the optimal value
reach to global optimum
Need to choose No need to choose

Works fine for large number of Slow if features are large

features
Time complexity is O(n) Time complexity is O(n3) for (XTX)-1
Not closed solution for all error Not motivated when XTX is not
measures: |y-f(x)| or any non- invertible. Pseuodo inverse can help
differentiable term to some extend

17
Classification

• Binary classification

• Multi-class classification
– y ∈ {0,1,2,3,…}

.
1- x x x x
 A threshold is defined to classify
0.5 -
0 x x x x

18
Linear regression for classification
• Applying linear regression for classification is often not
useful

x
.
1- x x x x
0.5 -
0 x x x x

• can be a large positive or negative value while y is 0

or 1
• Logistic regression :
– A classification problem not regression despite the name

19
Change of hypothesis

• Logistic or sigmoid function

• Or

• Derivative of sigmoid:

20
Logistic regression

• Assume:

0.5
• It can be written as: G(Z)

• We can see: g(z)= 0.5 if Z 0

And g(z) 0.5 if Z< 0

21
Cost Function and optimization
• Linear regression cost function was convex

• The same cost function for logestic regression is non-

convex because of nonlinear sigmoid function

• We define logistic regression cost function as :

22
Convex cost function for logistic regression

Cost Y=0
Y=1 • If h goes to zero and
Cost also goes to zero,
Class 0 is selected
• If h goes to 1 and Cost
goes to zero, class 1 is
selected
0 1

23
Cost function for logistic regression

• It can be written as :
• For m training example,the likelihood of the parameters:

24
• It will be easier to work on the log likelihood and instead
of minimizing the cost function we will maximize the log
likelihood function:

• Given:

25
• Since we are maximizing rather than minimizing, gradient
ascent applied for parameter optimization

• The parameter updating algorithm:

• Looks similar to Least-Mean squared errors

• However, they are different due to the fact that is a
nonlinear function of
• But the updating rules of the parameters are the same

26
Another algorithm for optimization

• Gradient descent takes many steps iteratively to reach

to the optima
• The parameter should be manually set
• There are other lagorithms converging faster than
gradient descent with no need to pick
• However, they are more complex than gradient descent
• We know newton methods for finding zeros of a
function:

27
Optimization using newton’s method
• In fact, newton method is approximating finction f using
a linear function f ’ , which is the tangent of f at the
current guess of parameter
• It solves until function f equals to zero.
• Newton method is a way of finding zeros. What about
finding the maxima of a function ?
• Maxima of a function occurs where its derivative is zero:
– =0
• Therefore, in newton method, by replacing
• The same algorithm can be sued:

28
Newton- Raphson

• Generalization of Newton method to multi-dimensional

set is called Newton Raphson:
• where H is the hessian:
• Newton’s method usually converges faster than gradient
descent when maximizing logistic regression log
likelihood.
• Each iteration is more expensive than gradient descent
because of calculating inverse of Hessian
• As long as data points are not very large, Newton’s
methods are preferred.

29
Advanced optimizaiton algorithms

• It is recommended to use the built-in functions or

softwares when using advanced optimizaiton algorithms
rather than coding the algorithm yourself
• For this course, it is recommended to use advanced
optimization function of Matlab : fminunc
• Built-in functions apply other methods of optimization
which are faster . E.g. quasi-Newton methods instead of
Newton’s methods
• Newton’s methods calculate H directly. Calculating H
numerically involves a large amount of computation.
Quasi-Newton methods avoid this by using an
approximation to H

30
Multi-class classification
×××
*** ×××
+++
OO
OO
O
O

***
• One-vs-all strategy: working with ×××
multiple binary classifications ××
×
• We train one logistic regression
classifier for each class i to predict
the probability that y = i
×××
+++
××
• For each x, pick the class having ×
highest value of probability

31
One versus one strategy ***
OO
*** O
+++
OO
O

*** +++

we train binary classifiers corresponding
to every combination of two class classifiers. For
the test data, we use all the classifiers to classify
the data and then count the number of times
that the test data was assigned to each class. +++
The final class is the one with the maximum OO
number of wins. O

32
Decision boundary

• Example:
• If = [-3 1 1] , then y = 1 if -3+x1+x2 0 → x1+x2 3

Note:
• decision boundary is the property of the
Y=1 parameters not the data.
• Parameters are the property of data and
learned from data
• Decision boundary can be more complex
shape if higher order of polynomials applied

33
Overfitting

Uderfit, high bias Quadratic terms: Higher orders: Over-fit,

Good fit not able to generalize
the unseens

34
How to deal with overfitting
• Seems having higher order of polynomials is good fit, but how
to deal with overfitting?
– Reduce the number of features manually
– Keep all the features, but apply regularization

– The most common variants in machine learning are L₁ and L₂

regularization
– Minimizing E(X, Y) + α‖w‖, where w is the model's weight
vector, ‖·‖ is either the L₁ norm or the squared L₂ norm, and α
is a free parameter that needs to be tuned empirically
– Regularization using L₂ norm is called Tikhonov regularization
(Ridge regression), using L₁ norm is called Lasso
regularization

35
Regularized linear regression

• Note: Do not include the intercept parameter in the

regularization

• If you consider a very large value for the regularization

parameter λ , all the parameters are penalized close to
zero, and intercept is left. The decision boundary will
be a straight line

36
Regularized regression

• Regularized gradient descent

[ + ]

37
Regularized logistic regression
• Similar to linear regression, the cost function is :

• Gradient descent:

[ + ]

• difference is :

38
Summary

• Gradient descent is a useful optimization technique for

both classification and linear regression
• For linear regression the cost function is convex
meaning that always converges to golbal optimum
• For non-linear cost function, gradient descent might get
stuck in the local optima
• Logistic regression is a widely applied supervised
classification technique
• For logistic regression, gradient descent and newton
Raphson optimization techniques were explained.

The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Chapter02 Introduction To DeepLearning
No ratings yet
Chapter02 Introduction To DeepLearning
84 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Lec 3
No ratings yet
Lec 3
22 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Ch2Regression and Regularization1
No ratings yet
Ch2Regression and Regularization1
45 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Machine Learning Summary
No ratings yet
Machine Learning Summary
38 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
CM20315 06 Fitting
No ratings yet
CM20315 06 Fitting
67 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
L4 More On Linear Regression and Polynomial Regression
No ratings yet
L4 More On Linear Regression and Polynomial Regression
37 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
No ratings yet
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
18 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
110 pages
Lecture 05 06
No ratings yet
Lecture 05 06
40 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
Supervised Learning
No ratings yet
Supervised Learning
41 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
ML Notes
No ratings yet
ML Notes
14 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
APSC 258 Midterm Study Guide
No ratings yet
APSC 258 Midterm Study Guide
4 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
(PR 2024) Lec2 Regression II
No ratings yet
(PR 2024) Lec2 Regression II
41 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
Regression PDF
No ratings yet
Regression PDF
37 pages
Module 3
No ratings yet
Module 3
27 pages
Lecture3 - Linear Regression and Logistic Regression
No ratings yet
Lecture3 - Linear Regression and Logistic Regression
60 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Optim ML
No ratings yet
Optim ML
41 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Logistic Regression
No ratings yet
Logistic Regression
51 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Gradient Descent Regression Logistic Regression
No ratings yet
Gradient Descent Regression Logistic Regression
14 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
5G Jumpstart: Unofficial Guide To Nokia Certified 5G Associate
No ratings yet
5G Jumpstart: Unofficial Guide To Nokia Certified 5G Associate
34 pages
IxLoad VE en
No ratings yet
IxLoad VE en
17 pages
Lenire User Experience Group Study - Report 1 - Cohort Analysis 20200131 PDF
No ratings yet
Lenire User Experience Group Study - Report 1 - Cohort Analysis 20200131 PDF
55 pages
TnCfortheGoldETF OfferONLY
No ratings yet
TnCfortheGoldETF OfferONLY
1 page
Golden Large Cap Portfolio: Update
No ratings yet
Golden Large Cap Portfolio: Update
7 pages
SIP ATA Standard For Local Fibre Company Wholesale Service (Loose Coupling)
No ratings yet
SIP ATA Standard For Local Fibre Company Wholesale Service (Loose Coupling)
63 pages
Devops & Itsm:: Defining Value From Two Sides of The Same Coin
No ratings yet
Devops & Itsm:: Defining Value From Two Sides of The Same Coin
12 pages
Finale: Innovation Challenge: Date: Time: Location
No ratings yet
Finale: Innovation Challenge: Date: Time: Location
4 pages
Adapter Hub Atm
No ratings yet
Adapter Hub Atm
326 pages
Whitepaper 1
No ratings yet
Whitepaper 1
17 pages
Binomial Theorem
No ratings yet
Binomial Theorem
3 pages
Advanced Optimization Using MATLAB by Endalew
No ratings yet
Advanced Optimization Using MATLAB by Endalew
22 pages
2dfa Unreach
No ratings yet
2dfa Unreach
7 pages
Solved Determine The Real Root of F X 5x3 5x2 6x 2 3 PDF Free
No ratings yet
Solved Determine The Real Root of F X 5x3 5x2 6x 2 3 PDF Free
6 pages
Presentation On Model Order Reduction
No ratings yet
Presentation On Model Order Reduction
26 pages
Lecture 9
No ratings yet
Lecture 9
24 pages
A Computational Study With Finite Element Method A
No ratings yet
A Computational Study With Finite Element Method A
21 pages
Module 5 - Linear Algebra - QB
No ratings yet
Module 5 - Linear Algebra - QB
4 pages
Introduction To Artificial Life (Alife) : Computational Modeling Lab
No ratings yet
Introduction To Artificial Life (Alife) : Computational Modeling Lab
8 pages
Introduction To Clusterwise Regression
No ratings yet
Introduction To Clusterwise Regression
10 pages
Nnet - Ug 1 150 PDF
No ratings yet
Nnet - Ug 1 150 PDF
150 pages
Reading 4
No ratings yet
Reading 4
15 pages
Constrained Cubic Spline Interpolation: For Chemical Engineering Applications by CJC Kruger
No ratings yet
Constrained Cubic Spline Interpolation: For Chemical Engineering Applications by CJC Kruger
5 pages
C Prog
No ratings yet
C Prog
9 pages
Week4 Chap3 Recursion Branch and Bound Cbus
No ratings yet
Week4 Chap3 Recursion Branch and Bound Cbus
13 pages
Strassen's Matrix Multiplication
No ratings yet
Strassen's Matrix Multiplication
13 pages
Ls-Dyna: No.8 1 Floor, First Main Road, New Colony, Chrompet, Chennai-600044
No ratings yet
Ls-Dyna: No.8 1 Floor, First Main Road, New Colony, Chrompet, Chennai-600044
4 pages
Class X MCQ and Ar 2023-24 - Polynomials
No ratings yet
Class X MCQ and Ar 2023-24 - Polynomials
18 pages
Fa22 Bee 105 (KAMRAN AZIZ) Lab 6 Ai
No ratings yet
Fa22 Bee 105 (KAMRAN AZIZ) Lab 6 Ai
7 pages
Example 1 - Bisection Method: F (X) F (X) 0
No ratings yet
Example 1 - Bisection Method: F (X) F (X) 0
8 pages
Solution For Tutorial 3
No ratings yet
Solution For Tutorial 3
11 pages
Numerical Analysis: MATLAB Practical (Autumn 2020) B.E. III Semester Thapar Institute of Engineering & Technology Patiala
0% (1)
Numerical Analysis: MATLAB Practical (Autumn 2020) B.E. III Semester Thapar Institute of Engineering & Technology Patiala
8 pages
04 Recurrences
No ratings yet
04 Recurrences
23 pages
Marcelo Gontiñas Exercise No.1
No ratings yet
Marcelo Gontiñas Exercise No.1
9 pages
Mathematics-III Module-I - 1677825128-1
No ratings yet
Mathematics-III Module-I - 1677825128-1
22 pages
Kambaramayanam
No ratings yet
Kambaramayanam
6 pages
Stiffness Method Pin Jointed Frame
No ratings yet
Stiffness Method Pin Jointed Frame
12 pages
Advanced Algorithms and Complexity: The Complexity Class N P
No ratings yet
Advanced Algorithms and Complexity: The Complexity Class N P
4 pages
Computer Oriented Numerical Methods (CONM) 2620004
No ratings yet
Computer Oriented Numerical Methods (CONM) 2620004
3 pages
Expt No 5 Numerical
No ratings yet
Expt No 5 Numerical
5 pages

Lecture 8: Gradient Descent and Logistic Regression

Uploaded by

Lecture 8: Gradient Descent and Logistic Regression

Uploaded by

Lecture 8: Gradient descent and

• Lecture 1: Introduction and basic principles

• : input variables , also called input features

• Our goal is, given a training set, to learn a function

• ’s are parameters or weights

• In this course, always include intercept parameter of

• To do so, let’s use a search algorithm that starts with

• α is called the learning rate

• For a single training example:

• Gradient descent can be susceptible to local minima

• Because J is a convex quadratic function, thus gradient

• For linear regression problem, the defined cost function

contours of the quadratic function

• Depends on where to start, we might end up to diferent

• Batch gradient descent has to scan through the entire

• Stochastic gradient descent can start making progress

• When the training set is large, stochastic gradient

• Make sure features are on a similar scale.

• It makes the counour plots look more circular and faster

• Scaling can be done using variance or standard

• If is small, gradient descent can be slow

• If is too lrage, gradient descent might overshoot the

Gradient Descent Normal Equation

Works fine for large number of Slow if features are large

• can be a large positive or negative value while y is 0

• Logistic or sigmoid function

• We can see: g(z)= 0.5 if Z 0

• The same cost function for logestic regression is non-

• We define logistic regression cost function as :

• The parameter updating algorithm:

• Looks similar to Least-Mean squared errors

• Gradient descent takes many steps iteratively to reach

• Generalization of Newton method to multi-dimensional

• It is recommended to use the built-in functions or

Uderfit, high bias Quadratic terms: Higher orders: Over-fit,

– The most common variants in machine learning are L₁ and L₂

• Note: Do not include the intercept parameter in the

• If you consider a very large value for the regularization

• Regularized gradient descent

• Gradient descent is a useful optimization technique for

You might also like