0% found this document useful (0 votes)
2 views28 pages

CH3 Logistic Regression 2020

The document discusses supervised learning with a focus on logistic regression for classification problems, particularly binary classification. It covers key concepts such as hypothesis representation, decision boundaries, cost functions, and gradient descent, while highlighting the advantages of logistic regression over linear regression. Additionally, it explains the sigmoid function, interpretation of hypothesis outputs, and the formulation of a convex cost function necessary for effective model training.

Uploaded by

Kouraichi Zeineb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views28 pages

CH3 Logistic Regression 2020

The document discusses supervised learning with a focus on logistic regression for classification problems, particularly binary classification. It covers key concepts such as hypothesis representation, decision boundaries, cost functions, and gradient descent, while highlighting the advantages of logistic regression over linear regression. Additionally, it explains the sigmoid function, interpretation of hypothesis outputs, and the formulation of a convex cost function necessary for effective model training.

Uploaded by

Kouraichi Zeineb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Supervised learning

Logistic regression

1
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Binary classification
 Classification problems
o Email  spam/not spam?
o Online transactions  fraudulent?
o Tumor  Malignant/benign
 Variable in these problems is Y
o Y is either 0 or 1
 0 = negative class (absence of something)
 1 = positive class (presence of something)
 Start with binary class problems
 Later look at multiclass classification problem, although this is just an extension of binary
classification
Machine learning Mourad ZAIED RTIM 2
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

How do we develop a classification algorithm?

o Tumour size vs malignancy (0 or 1)


H w ( x)  W T x
o We could use linear regression
 Then threshold the classifier output (i.e.
anything over some value is yes, else no)
In our example below linear regression
with thresholding seems to work

Machine learning Mourad ZAIED RTIM 3


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Issues with linear regression in classification problems


 We can see above this does a reasonable job of stratifying the data points into one of two
classes
o But what if we had a single Yes with a very huge tumour ?
o This would lead to classifying all the existing yeses as nos
 Another issues with linear regression
o We know Y is 0 or 1
o Hypothesis can give values large than 1 or less than 0
 So, logistic regression generates a value where is always either 0 or 1
o Logistic regression is a classification algorithm - don't be confused
Machine learning Mourad ZAIED RTIM 4
Classification problem Logistic function
Hypothesis representation Hypothesis output
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

What function is used to represent our hypothesis in classification?


 We want our classifier to output values between 0 and 1
o When using linear regression we did H ( x)  W T x
w
o For classification hypothesis representation we do H ( x)  g (W T x)
w
g ( z)  1 z where z  
1 e
 This is the sigmoid function, or the logistic function

 If we combine these equations we can write out the hypothesis as

H w ( x)  1 where z  
1  eW
T
x

Machine learning Mourad ZAIED RTIM 5


Classification problem Logistic function
Hypothesis representation Hypothesis output
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

What does the sigmoid function look like? (2/2)


 Crosses 0.5 at the origin, then flattens out
 Asymptotes at 0 and 1
g(z)

Machine learning Mourad ZAIED RTIM 6


Classification problem Logistic function
Hypothesis representation Hypothesis output
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Useful property of the derivative of the sigmoid function

d 1
g '( z)  z
dz 1 e
 1 (e z )
(1 e z )2
 1 (1 
1 )
(1 e z ) (1 e z )
 g ( z )(1 g ( z ))

Machine learning Mourad ZAIED RTIM 7


Classification problem Logistic function
Hypothesis representation Hypothesis output
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Interpreting hypothesis output (1/2)


 When hw(x) outputs a number, we treat that value as the estimated probability that y=1 on input x
o Example
 If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize
 Hw(x) = 0.7  Tells a patient they have a 70% chance of a tumor being malignant
o We can write this using the following notation  Hw(x) = P(y=1|x ; w)
o P(y=1|x ; w) : Probability that y=1, given x, parameterized by w
 Since this is a binary classification task we know y = 0 or 1
o So the following must be true
P(y=1|x ; w) + P(y=0|x ; w) = 1  P(y=0|x ; w) = 1 - P(y=1|x ; w)
Machine learning Mourad ZAIED RTIM 8
Classification problem Logistic function
Hypothesis representation Hypothesis output
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Interpreting hypothesis output (2/2)


o We can predict y = 1 when Hw(x) >=0.5 (Else we predict y = 0)
o When is it exactly that Hw(x) is greater than 0.5?
 Look at sigmoid function
o g(z) is greater than or equal to 0.5 when z is greater than or equal to 0 g(z)

 So if z is positive, g(z) is greater than 0.5


 z = wT x
 So when
 wT x >= 0
 Then Hw >= 0.5
z
Machine learning Mourad ZAIED RTIM 9
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Decision boundary
 So what we've shown is that the hypothesis predicts y = 1 when wT x >= 0
o The corollary of that when wT x < 0 then the hypothesis predicts y = 0
o Let's use this to better understand how the hypothesis makes its predictions

 Hw(x) = g(w0 + w1x1 + w2x2)

Machine learning Mourad ZAIED RTIM 10


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Decision boundary
 Example w0 = -3, w1 = 1 and w2 = 1
 Our parameter vector is a column vector with the above values  wT is a row vector = [-3,1,1]
 The z here becomes wT x
o We predict "y = 1" if
 -3x0 + 1x1 + 1x2 >= 0  -3 + x1 + x2 >= 0
 We can also re-write this as
o If (x1 + x2 >= 3) then we predict y = 1
o If we plot
 x1 + x2 = 3 we graphically plot our decision boundary
Machine learning Mourad ZAIED RTIM 11
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Non-linear decision boundaries


 Get logistic regression to fit a complex non-linear data set
o Like polynomial regress add higher order terms
o So say we have
 Hw(x) = g(w0 + w1x1+ w2x2+ w3x12 + w4x22)
 We take the transpose of the w vector times the input vector
 Say wT was [-1,0,0,1,1] then we say;
 Predict that "y = 1" if
 -1 + x12 + x22 >= 0  x12 + x22 >= 1
 If we plot x12 + x22 = 1
 This gives us a circle with a radius of 1 around 0
Machine learning Mourad ZAIED RTIM 12
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Non-linear decision boundaries


 Mean we can build more complex decision boundaries by fitting complex
parameters to this (relatively) simple hypothesis
 More complex decision boundaries?
o By using higher order polynomial terms, we can get even more
complex decision boundaries

Machine learning Mourad ZAIED RTIM 13


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Cost function of the linear regression in ‘Logistic Regression’ is non-convex


1 m (i )
C ( w)   (y  H w ( x ( i ) )) 2
2m i 1
 If we try to use the cost function of
the linear regression in ‘Logistic
Regression’ then it would be of no
use as it would end up being
a non-convex function with many
local minimums, in which it would
be very difficult to minimize the
cost value and find the global
minimum.
Non-convex cost function
Machine learning Mourad ZAIED RTIM 14
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

A convex logistic regression cost function


• We need a different, convex cost function which means we can apply gradient descent

• We want a classifier that produces very high Hw(x) when y=1, and conversely very low Hw(x)
when y=0. We hope that a is very close to y for each sample

• In other words,
• If y=1 we want to maximize Hw(x)
• If y=0 we want to maximize 1-Hw(x)
• If we combine (1) and (2), we want to maximize:
y  1 y 
 H  x   1  H  x  
w w

• Maximizing above is equivalent to maximizing:

 y
log  H w  x   1  H w  x  
1 y 
   y log  H ( x)   1  y  log  1  H ( x) 
w w

Machine learning Mourad ZAIED RTIM 15


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

A convex logistic regression cost function


• or we want to minimize:

  y log  H w ( x)   1  y  log  1  H w ( x) 

• Generally, in Machine Learning we like to minimize the loss. That is why we changed the
direction. This is by convention.

• The above formula defines a cost function (or loss) for only one sample. We also need a
loss function for multiple samples (which we will call C(W)).

1 m (i)
     
C ( w)    y log H w ( x ( i ) )  1  y ( i ) log 1  H w ( x ( i ) )
m i 1

Machine learning Mourad ZAIED RTIM 16
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Gradient of ∁( ) 1/3

1 m
C ( w)  -   ylog  H w ( x)  + 1  y  log  1  H w ( x)  
m i 1
C ( w) 1 m  log  H w ( x)  log  1  H w ( x) 
 -  y + 1  y  
w j m i 1  w j w j 
 H w ( x)   1  H w ( x) 
 
1 m  w j w j 
-  y + 1  y 
m i 1  H w ( x) 1  H w ( x) 
 
 

Machine learning Mourad ZAIED RTIM 17


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Gradient of ∁( ) 2/3
 
 1 
 n

  wk x k
H w ( x)   fog  w  1
 1 e 
k 0 j
With f  and g  w0 x0  ...  w j x j  ...  wn xn
w j w j w j 1  e x

fog '  ( f 'og ) g ' f '  f (1  f ) g '  xj


H w ( x)
 f ( g )(1  f ( g )) x j  H w ( x)(1  H w ( x)) x j
w j
 (1  H w ( x)) H ( x)
 w  H w ( x)(1  H w ( x)) x j
w j w j
Machine learning Mourad ZAIED RTIM 18
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Gradient of ∁( ) ) 3/3

C ( w) 1 m  H w ( x)(1  H w ( x)) x j H w ( x)(1  H w ( x)) x j 


 -  y - 1  y  
w j m i 1  H w ( x) 1  H w ( x) 
1 m
 -   y (1  H w ( x)) x j  1  y  H w ( x) x j 
m i 1
1 m
 -   x j ( y  yH w ( x)  H w ( x)  yH w ( x)) 
m i 1
1 m
 -  ( y  H w ( x)) x j
m i 1

Machine learning Mourad ZAIED RTIM 19


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Gradient descent algorithm for logistic regression


Start from any initial value w j

{
m
 (i ) (i ) (i )
wj  wj 
m
 (
i 1
y  H w ( x )) x j

•This equation is the same as the linear regression rule


•The only difference is that our definition for the hypothesis has changed

Machine learning Mourad ZAIED RTIM 20


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression m
 (i )
Multiclass classification problems wj  wj  (y  H w ( x (i ) )) x (ji )
m i 1

1ière iteration:
Base d’apprentissage • w0=0.1+0.01/4((0-g(w0x0(1)+w1x1(1)+w2x2 (1)) x0(1)+
(0-g(w0 x0(2) +w1x1(2)+w2x2(2)) x0(2) +
x0 x1 x2 Y (1-g(w0 x0(3 +w1x1(3)+w2x2(3)) x0(3) +
1 -0.1 1.4 0 (1-g(w0 x0(4) +w1x1(4)+w2x2(4)) x0(4) )
1 -0.5 -0.1 0 =0.1+0.01/4((0-g(0.1-0.2*0.1+0.3*1.4)+
1 1.3 0.9 1 (0-g(0.1-0.2*0.5-0.3*0.1) +
1 -0.6 0.4 1 (1-g(0.1+0.2*1.3+0.3*0.9) +
(1-g(0.1-0.2*0.6+0.3*0.4)
Valeurs initiales des wi: w0= 0.1 w1= 0.2 w2= 0.3
=0.1+0.01/4((0-g(0.5)+ (0-g(-0.03) + (1-g(0.63)) +(1-g(0.1))
= 0.1+0.01/4(2.78442238235- 0.92460027612+2.40347419836-
Taux d’apprentissage alpha = 0.01
0.37330225703)
1 = 0.1+0.01/4(5.7391945998)=0.1+0.0025* 3.88999404756
( )= =0.10972498511
1 + exp(− )
Machine learning Mourad ZAIED RTIM 21
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

• w1=0.2+0.01/4((0-g(w0x0(1)+w1x1(1)+w2x2 (1)) x1(1)+


(0-g(w0 x0(2) +w1x1(2)+w2x2(2)) x1(2) +  m
(i )
(1-g(w0 x0(3 +w1x1(3)+w2x2(3)) x1(3) + wj  wj  (y  H w ( x (i ) )) x (ji )
(1-g(w0 x0(4) +w1x1(4)+w2x2(4)) x1(4) )
m i 1

=0.2+0.01/4((0-g(0.1-0.2*0.1+0.3*1.4)*(-0.1)+
(0-g(0.1-0.2*0.5-0.3*0.1) )*(-0.5)+ Base d’apprentissage
(1-g(0.2+0.2*1.3+0.3*0.9) *(1.3) +
(1-g(0.1-0.2*0.6+0.3*0.4) *(-0.6) x0 x1 x2 Y
1 -0.1 1.4 0
=0.2+0.01/4((0-g(0.5) *(-0.1)+ (0-g(-0.03) )*(-0.5) + (1-g(0.63)) *(1.3) +(1- 1 -0.5 -0.1 0
g(0.1)) *(-0.6) 1 1.3 0.9 1
= 0.2+0.01/4(2.78442238235*(-0.1)- 0.92460027612*(0.5)+ 2.40347419836 1 -0.6 0.4 1
*(1.3) -0.37330225703) *(-0.6))
= 0.2+0.01/4(-0.27844223823 -0.46230013806 -
3.12451645787+0.22398135421)
=0.2+0.0025*(-3.64127747995) =0.1908968063
Machine learning Mourad ZAIED RTIM 22
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

• w2=0.3+0.01/4((0-g(w0x0(1)+w1x1(1)+w2x2 (1)) x2(1)+


(0-g(w0 x0(2) +w1x1(2)+w2x2(2)) x2(2) +  m
(i )
(1-g(w0 x0(3 +w1x1(3)+w2x2(3)) x2(3) + wj  wj  (y  H w ( x (i ) )) x (ji )
(1-g(w0 x0(4) +w1x1(4)+w2x2(4)) x2(4) )
m i 1

=0.2+0.01/4((0-g(0.1-0.2*0.1+0.3*1.4)*(1.4)+
(0-g(0.1-0.2*0.5-0.3*0.1) )*(-0.1)+ Base d’apprentissage
(1-g(0.2+0.2*1.3+0.3*0.9) *(0.9) +
(1-g(0.1-0.2*0.6+0.3*0.4) *(0.4) x0 x1 x2 Y
1 -0.1 1.4 0
=0.2+0.01/4((0-g(0.5) *(1.4)+ (0-g(-0.03) )*(- 0.1) + (1-g(0.63)) *(0.9) +(1- 1 -0.5 -0.1 0
g(0.1)) *(0.4) 1 1.3 0.9 1
= 0.2+0.01/4(2.78442238235*(1.4)- 0.92460027612*(-0.1)+ 2.40347419836 1 -0.6 0.4 1
*(0.9) -0.37330225703) *(0.4))
= 0.2+0.01/4(3.89819133529+ 0.09246002761+ 2.16312677852 -
0.14932090281)
=0.2+0.0025*(6.00445723861) = 0.21501114309
Machine learning Mourad ZAIED RTIM 23
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

2ière iteration:
• w0=0.10972498511+0.01/4((0-g(w0x0(1)+w1x1(1)+w2x2 (1)) x0(1)+
(0-g(w0 x0(2) +w1x1(2)+w2x2(2)) x0(2) +
(1-g(w0 x0(3 +w1x1(3)+w2x2(3)) x0(3) +
(1-g(w0 x0(4) +w1x1(4)+w2x2(4)) x0(4) )

• w1=0.1908968063+0.01/4((0-g(w0x0(1)+w1x1(1)+w2x2 (1)) x1(1)+


(0-g(w0 x0(2) +w1x1(2)+w2x2(2)) x1(2) +
(1-g(w0 x0(3 +w1x1(3)+w2x2(3)) x1(3) +
(1-g(w0 x0(4) +w1x1(4)+w2x2(4)) x1(4) )

• w2=0.21501114309+0.01/4((0-g(w0x0(1)+w1x1(1)+w2x2 (1)) x2(1)+


(0-g(w0 x0(2) +w1x1(2)+w2x2(2)) x2(2) +
(1-g(w0 x0(3 +w1x1(3)+w2x2(3)) x2(3) +
(1-g(w0 x0(4) +w1x1(4)+w2x2(4)) x2(4) )

Machine learning Mourad ZAIED RTIM 24


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

Gradient descent algorithm for logistic regression simulation

Machine learning Mourad ZAIED RTIM 25


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

One vs. all technique 1/3


 Getting logistic regression for multiclass classification using one vs. all
 Multiclass - more than yes or no (1 or 0)
o Classification with multiple classes for assignment

Machine learning Mourad ZAIED RTIM 26


Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

One vs. all technique 2/3


 Given a dataset with three classes, how do we get a learning algorithm to work?

o Use one vs. all classification make binary classification work for multiclass classification

 One vs. all classification

o Split the training set into three separate binary classification problems
 i.e. create a new fake training set

 Triangle (1) vs crosses and squares (0) hw1(x)


 P(y=1 | x1; w)

 Crosses (1) vs triangle and square (0) hw2(x)

 P(y=1 | x2; w)

 Square (1) vs crosses and square (0) hw3(x)

 P(y=1 | x3; w)
Machine learning Mourad ZAIED RTIM 27
Classification problem
Hypothesis representation
Decision boundary
Cost function for logistic regression
Gradient descent for logistic regression
Multiclass classification problems

One vs. all technique 3/3


 Overall
o Train a logistic regression classifier hw(i)(x) for each class i to predict the probability that
y=i
o On a new input, x to make a prediction, pick the class i that maximizes the probability
that hw(i)(x) = 1

Machine learning Mourad ZAIED RTIM 28

You might also like