Lecture 8: Gradient Descent and Logistic Regression
Lecture 8: Gradient Descent and Logistic Regression
logistic regression
17.11.2016
Course contents
2
Mathematical terms
3
Model learning
4
Hypothesis function
• For example, if the input has two variables (two-
dimensional) R2 , we should decide how to represent
hypotheses h
• let’s say we decide to approximate y as a linear function
of X
5
How to learn parameter
• make h(x) close to y or minimize the following cost
function:
6
Gradient descent
• A natural algorithm that repeatedly takes step in the
direction of steepest decrease of J
7
Global minima for linear regression
8
Convex cost function
9
Non-convex cost function
10
Batch and stocastic gradient descents
• Batch:
• Stocastic:
11
Batch and stocastic gradient descents
12
Some points before implementation
13
Correct implementation
14
Selecting learning rate
15
Polynomial fitting
• Adding features to the model can make better fit
• Overfitting is problem
• Cross validation is one way to avoid overfitting
16
Gradient descent versus normal equation
FMLR X X X T Y .
T 1
Multiple iterations and many steps to One step to get to the optimal value
reach to global optimum
Need to choose No need to choose
17
Classification
• Binary classification
• Multi-class classification
– y ∈ {0,1,2,3,…}
.
1- x x x x
A threshold is defined to classify
0.5 -
0 x x x x
18
Linear regression for classification
• Applying linear regression for classification is often not
useful
x
.
1- x x x x
0.5 -
0 x x x x
19
Change of hypothesis
• Or
• Derivative of sigmoid:
20
Logistic regression
• Assume:
0.5
• It can be written as: G(Z)
21
Cost Function and optimization
• Linear regression cost function was convex
22
Convex cost function for logistic regression
Cost Y=0
Y=1 • If h goes to zero and
Cost also goes to zero,
Class 0 is selected
• If h goes to 1 and Cost
goes to zero, class 1 is
selected
0 1
23
Cost function for logistic regression
• It can be written as :
• For m training example,the likelihood of the parameters:
24
• It will be easier to work on the log likelihood and instead
of minimizing the cost function we will maximize the log
likelihood function:
• Given:
25
• Since we are maximizing rather than minimizing, gradient
ascent applied for parameter optimization
26
Another algorithm for optimization
27
Optimization using newton’s method
• In fact, newton method is approximating finction f using
a linear function f ’ , which is the tangent of f at the
current guess of parameter
• It solves until function f equals to zero.
• Newton method is a way of finding zeros. What about
finding the maxima of a function ?
• Maxima of a function occurs where its derivative is zero:
– =0
• Therefore, in newton method, by replacing
• The same algorithm can be sued:
28
Newton- Raphson
29
Advanced optimizaiton algorithms
30
Multi-class classification
×××
*** ×××
+++
OO
OO
O
O
***
• One-vs-all strategy: working with ×××
multiple binary classifications ××
×
• We train one logistic regression
classifier for each class i to predict
the probability that y = i
×××
+++
××
• For each x, pick the class having ×
highest value of probability
31
One versus one strategy ***
OO
*** O
+++
OO
O
*** +++
we train binary classifiers corresponding
to every combination of two class classifiers. For
the test data, we use all the classifiers to classify
the data and then count the number of times
that the test data was assigned to each class. +++
The final class is the one with the maximum OO
number of wins. O
32
Decision boundary
• Example:
• If = [-3 1 1] , then y = 1 if -3+x1+x2 0 → x1+x2 3
Note:
• decision boundary is the property of the
Y=1 parameters not the data.
• Parameters are the property of data and
learned from data
• Decision boundary can be more complex
shape if higher order of polynomials applied
33
Overfitting
34
How to deal with overfitting
• Seems having higher order of polynomials is good fit, but how
to deal with overfitting?
– Reduce the number of features manually
– Keep all the features, but apply regularization
35
Regularized linear regression
36
Regularized regression
[ + ]
37
Regularized logistic regression
• Similar to linear regression, the cost function is :
• Gradient descent:
[ + ]
• difference is :
38
Summary
39