0% found this document useful (0 votes)
122 views39 pages

Lecture 8: Gradient Descent and Logistic Regression

This document provides an overview of gradient descent and logistic regression. It discusses how gradient descent can be used to optimize parameters in linear and logistic regression models by minimizing a cost function. Gradient descent works by taking small steps in the direction of steepest descent. Stochastic gradient descent processes examples one at a time for faster convergence on large datasets. Logistic regression applies gradient descent to classification problems by using a logistic cost function and sigmoid activation. Advanced optimization methods like Newton's method can converge faster than gradient descent but require calculating the Hessian matrix.

Uploaded by

Ashish Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views39 pages

Lecture 8: Gradient Descent and Logistic Regression

This document provides an overview of gradient descent and logistic regression. It discusses how gradient descent can be used to optimize parameters in linear and logistic regression models by minimizing a cost function. Gradient descent works by taking small steps in the direction of steepest descent. Stochastic gradient descent processes examples one at a time for faster convergence on large datasets. Logistic regression applies gradient descent to classification problems by using a logistic cost function and sigmoid activation. Advanced optimization methods like Newton's method can converge faster than gradient descent but require calculating the Hessian matrix.

Uploaded by

Ashish Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Lecture 8: Gradient descent and

logistic regression
17.11.2016
Course contents

• Lecture 1: Introduction and basic principles


• Lecture 2: Covariance and Gaussianity
• Lecture 3: Multivariate linear regression
• Lecture 4: Principal component analysis
• Lecture 5: Bridging input and output
• Lecture 6: Gaussian Mixture Models (Pedram Daee)
• Lecture 7: Gaussian Process Regression (Pedram
Daee)
• Lecture 8: Gradient descent and logistic regression

2
Mathematical terms

• : input variables , also called input features


• : output or target variable that we are trying to
predict
• A pair ( , ) is called a training example
• The dataset that we’ll be using to learn a list of m
training examples ( , ) ; i ={1, . . .,m} is called a
training set.
• We use X denote the space of input values, and Y the
space of output values.

3
Model learning

• Our goal is, given a training set, to learn a function


h : X → Y so that h(x) is a “good” predictor for the
corresponding value of y.
• Function h is called a hypothesis.

4
Hypothesis function
• For example, if the input has two variables (two-
dimensional) R2 , we should decide how to represent
hypotheses h
• let’s say we decide to approximate y as a linear function
of X

• ’s are parameters or weights

• In this course, always include intercept parameter of


and make it equal to 1 : =1 . This deals with offset.

5
How to learn parameter
• make h(x) close to y or minimize the following cost
function:

• To do so, let’s use a search algorithm that starts with


some “initial guess” for θ, and that repeatedly changes θ
to make J(θ) smaller
• We need an algorithm starting with an initial θ,
repeatedly performs updating until convergence of J

6
Gradient descent
• A natural algorithm that repeatedly takes step in the
direction of steepest decrease of J

• α is called the learning rate

• For a single training example:

7
Global minima for linear regression

• Gradient descent can be susceptible to local minima



• But, for linear regression it has only one global, and no
other local, optima

• Because J is a convex quadratic function, thus gradient


descent always converges (assuming the learning rate α
is not too large) to the global minimum

8
Convex cost function

• For linear regression problem, the defined cost function


is convex. Always converges to global optimum.

contours of the quadratic function

9
Non-convex cost function

• Depends on where to start, we might end up to diferent


local optimas

10
Batch and stocastic gradient descents
• Batch:

• Stocastic:

11
Batch and stocastic gradient descents

• Batch gradient descent has to scan through the entire


training set before taking a single step—a costly
operation if m is large

• Stochastic gradient descent can start making progress


right away, and continues to make progress with each
example it looks at.

• When the training set is large, stochastic gradient


descent is often preferred over batch gradient descent

12
Some points before implementation

• Make sure features are on a similar scale.

• It makes the counour plots look more circular and faster


to reach to the global optima

• Scaling can be done using variance or standard


deviation of the variable

13
Correct implementation

14
Selecting learning rate

• If is small, gradient descent can be slow

• If is too lrage, gradient descent might overshoot the


minimum

15
Polynomial fitting
• Adding features to the model can make better fit

• Overfitting is problem
• Cross validation is one way to avoid overfitting

16
Gradient descent versus normal equation

Gradient Descent Normal Equation

FMLR   X X  X T Y .
T 1

Multiple iterations and many steps to One step to get to the optimal value
reach to global optimum
Need to choose No need to choose

Works fine for large number of Slow if features are large


features
Time complexity is O(n) Time complexity is O(n3) for (XTX)-1
Not closed solution for all error Not motivated when XTX is not
measures: |y-f(x)| or any non- invertible. Pseuodo inverse can help
differentiable term to some extend

17
Classification

• Binary classification

• Multi-class classification
– y ∈ {0,1,2,3,…}

.
1- x x x x
 A threshold is defined to classify
0.5 -
0 x x x x

18
Linear regression for classification
• Applying linear regression for classification is often not
useful

x
.
1- x x x x
0.5 -
0 x x x x

• can be a large positive or negative value while y is 0


or 1
• Logistic regression :
– A classification problem not regression despite the name

19
Change of hypothesis

• Logistic or sigmoid function

• Or

• Derivative of sigmoid:

20
Logistic regression

• Assume:

0.5
• It can be written as: G(Z)

• We can see: g(z)= 0.5 if Z 0


And g(z) 0.5 if Z< 0

21
Cost Function and optimization
• Linear regression cost function was convex

• The same cost function for logestic regression is non-


convex because of nonlinear sigmoid function

• We define logistic regression cost function as :

22
Convex cost function for logistic regression

Cost Y=0
Y=1 • If h goes to zero and
Cost also goes to zero,
Class 0 is selected
• If h goes to 1 and Cost
goes to zero, class 1 is
selected
0 1

23
Cost function for logistic regression

• It can be written as :
• For m training example,the likelihood of the parameters:

24
• It will be easier to work on the log likelihood and instead
of minimizing the cost function we will maximize the log
likelihood function:

• Given:

25
• Since we are maximizing rather than minimizing, gradient
ascent applied for parameter optimization

• The parameter updating algorithm:

• Looks similar to Least-Mean squared errors


• However, they are different due to the fact that is a
nonlinear function of
• But the updating rules of the parameters are the same

26
Another algorithm for optimization

• Gradient descent takes many steps iteratively to reach


to the optima
• The parameter should be manually set
• There are other lagorithms converging faster than
gradient descent with no need to pick
• However, they are more complex than gradient descent
• We know newton methods for finding zeros of a
function:

27
Optimization using newton’s method
• In fact, newton method is approximating finction f using
a linear function f ’ , which is the tangent of f at the
current guess of parameter
• It solves until function f equals to zero.
• Newton method is a way of finding zeros. What about
finding the maxima of a function ?
• Maxima of a function occurs where its derivative is zero:
– =0
• Therefore, in newton method, by replacing
• The same algorithm can be sued:

28
Newton- Raphson

• Generalization of Newton method to multi-dimensional


set is called Newton Raphson:
• where H is the hessian:
• Newton’s method usually converges faster than gradient
descent when maximizing logistic regression log
likelihood.
• Each iteration is more expensive than gradient descent
because of calculating inverse of Hessian
• As long as data points are not very large, Newton’s
methods are preferred.

29
Advanced optimizaiton algorithms

• It is recommended to use the built-in functions or


softwares when using advanced optimizaiton algorithms
rather than coding the algorithm yourself
• For this course, it is recommended to use advanced
optimization function of Matlab : fminunc
• Built-in functions apply other methods of optimization
which are faster . E.g. quasi-Newton methods instead of
Newton’s methods
• Newton’s methods calculate H directly. Calculating H
numerically involves a large amount of computation.
Quasi-Newton methods avoid this by using an
approximation to H

30
Multi-class classification
×××
*** ×××
+++
OO
OO
O
O

***
• One-vs-all strategy: working with ×××
multiple binary classifications ××
×
• We train one logistic regression
classifier for each class i to predict
the probability that y = i
×××
+++
××
• For each x, pick the class having ×
highest value of probability

31
One versus one strategy ***
OO
*** O
+++
OO
O

*** +++


we train binary classifiers corresponding
to every combination of two class classifiers. For
the test data, we use all the classifiers to classify
the data and then count the number of times
that the test data was assigned to each class. +++
The final class is the one with the maximum OO
number of wins. O

32
Decision boundary

• Example:
• If = [-3 1 1] , then y = 1 if -3+x1+x2 0 → x1+x2 3

Note:
• decision boundary is the property of the
Y=1 parameters not the data.
• Parameters are the property of data and
learned from data
• Decision boundary can be more complex
shape if higher order of polynomials applied

33
Overfitting

Uderfit, high bias Quadratic terms: Higher orders: Over-fit,


Good fit not able to generalize
the unseens

34
How to deal with overfitting
• Seems having higher order of polynomials is good fit, but how
to deal with overfitting?
– Reduce the number of features manually
– Keep all the features, but apply regularization

– The most common variants in machine learning are L₁ and L₂


regularization
– Minimizing E(X, Y) + α‖w‖, where w is the model's weight
vector, ‖·‖ is either the L₁ norm or the squared L₂ norm, and α
is a free parameter that needs to be tuned empirically
– Regularization using L₂ norm is called Tikhonov regularization
(Ridge regression), using L₁ norm is called Lasso
regularization

35
Regularized linear regression

• Note: Do not include the intercept parameter in the


regularization

• If you consider a very large value for the regularization


parameter λ , all the parameters are penalized close to
zero, and intercept is left. The decision boundary will
be a straight line

36
Regularized regression

• Regularized gradient descent

[ + ]

37
Regularized logistic regression
• Similar to linear regression, the cost function is :

• Gradient descent:

[ + ]

• difference is :

38
Summary

• Gradient descent is a useful optimization technique for


both classification and linear regression
• For linear regression the cost function is convex
meaning that always converges to golbal optimum
• For non-linear cost function, gradient descent might get
stuck in the local optima
• Logistic regression is a widely applied supervised
classification technique
• For logistic regression, gradient descent and newton
Raphson optimization techniques were explained.

39

You might also like