0% found this document useful (0 votes)
9 views

Lecture+3+-+1

Uploaded by

salah.abdo.tech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture+3+-+1

Uploaded by

salah.abdo.tech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

CPE/EE 695: Applied Machine Learning

Lecture 3-1: Logistic Regression

Dr. Shucheng Yu, Associate Professor


Department of Electrical and Computer Engineering
Stevens Institute of Technology
Logistic Regression
Usually be used for binary classification:

Pr(Y|X), where Y is a binary variable. (why probability?)

2
Logistic Regression
Usually be used for binary classification:

Pr(Y|X), where Y is a binary variable. (why probability?)

E.g., Pr(Tomorrow snow | today windy)

3
Logistic Regression
Usually be used for binary classification:

Pr(Y|X), where Y is a binary variable. (why probability?)

E.g., Pr(Tomorrow snow | today windy)

Let Pr( Y=1 | X=x) = !(#; %)

4
Logistic Regression
Usually be used for binary classification:

Pr(Y|X), where Y is a binary variable. (why probability?)

E.g., Pr(Tomorrow snow | today windy)

Let Pr( Y=1 | X=x) = !(#; %)

Maximize likelihood:
$
! " %! (1 − ")#&%!
!"#

5
Logistic Regression
Usually be used for binary classification:

Pr(Y|X), where Y is a binary variable. (why probability?)

E.g., Pr(Tomorrow snow | today windy)

Let Pr( Y=1 | X=x) = !(#; %)


Assumption: p is modeled with parameter !;
otherwise, optimization problem doesn’t
Maximize likelihood: work
$
! " %! (1 − ")#&%!
!"#

6
Logistic Regression
Let Pr( Y=1 | X=x) = !(#; %)

Maximize likelihood:
$
! " %! (1 − ")#&%!
!"#

7
Logistic Regression
Let Pr( Y=1 | X=x) = !(#; %)

Maximize likelihood:
$
! " %! (1 − ")#&%!
!"#

Task: to estimate ' by maximizing the likelihood.

8
Logistic Regression
Let Pr( Y=1 | X=x) = !(#; %)

Maximize likelihood:
$
! " %! (1 − ")#&%!
!"#

Task: to estimate ' by maximizing the likelihood.

How can we use linear regression to solve this?

9
Logistic Regression
Let Pr( Y=1 | X=x) = !(#; %)
Task: to estimate ' by maximizing the likelihood.

How can we use linear regression to solve this?

Attempt 1: assume "((; ') be a linear function of (


Attempt 2: assume log "((; ') be a linear function of (
'
Attempt 3: assume log be a linear function of ( (good)
#&'

Remember: 0<= p <= 1

10
Logistic Regression
Logistic regression model
"
log = '( + ( / '
1−"

#
Which gives " ( =
#) * "#$ %&'#

11
Logistic Regression

Logistic Regression model estimated probability:


!̂ = ℎ" # = *(% # + #)
where * + is a logistic function (or sigmoid function).

Prediction:
0, 12 !̂ < 0.5
-, = .
1, 12 !̂ ≥ 0.5

Training the logistic regression model %̂ & = ((! ! * &) is to learn the best value of
parameter ! that makes the model fit the training data.

12
Logistic Regression
To train a logistic regression model, we first need to define a performance
measure. A commonly used measure is so-called the log loss function:
+ ,
0 1 =− ∑-"+[4 - 9 - )+ :−4
567(8 - 8 - ))]
567(: − 9
,

It is easier to explain the log loss function with one train example case, in
which we want to maximize the posterior probability
",̂ BℎDE = = 1
8. (: − 9
P(=|() = 9 8) +&. =?
1 − ",̂ BℎDE = = 0

Take log of both sides, we have 567G 4 H = 4IJK8 8).


9 + : − 4 IJK(: − 9

Average the sum of L training examples, we obtain M ' .

13
Logistic Regression
Learning the logistic regression model is to find:
θ8 = :;<=1>, J θ .

#
Where J ' = − / ∑/ ! 9 ! )+ 1−=
!"#[= log(8
! 8 ! ))],
log(1 − 9

"̂ = O(' 0 / ()

No Normal Equation (i.e., closed form solution) for θ.

But the cost function @ A is convex and derivable. Gradient


Descent is guaranteed to find global maximum.

14
Training Logistic Regression Model
The gradient of the log loss function J θ is:
. /
B- @ A = ∑ (D A2 ⋅ F 0 − H(0) )F3 (0)
/ 01.

At each round of GD, % is updated as following (similar to


linear regression, different values of = for different modes of
GD):
% = % − I∇4 K %

15
Training Logistic Regression Model
Overfitting may also happen in logistic regression.

Similarly, to combat overfitting we can introduce a regularization term to


the cost function J θ :
#
M ' = − / ∑/ ! !
!"#[= QRS("̂ ) + 1 − =
! QRS(1 − "̂ ! ))] + TU(1)

where V is a hyperparameter and W(') can be ℓ1 -norm of ', i.e.,


(
1
W ' = ∥ ' ∥1 = (∑ '! ) )

Note: W(') is ℓ2 -norm in Ridge regression, ℓ# -norm in Lasso regression.

16
Multi-Class Classification
One-Vs-Rest Method:

We can use binary classifier for multi-class classification with so-called


the One-Vs-Rest (OvR) method. Specifically, it uses multiple rounds of
binary classification for multi-class classification.
For example, to determine if an object X is a dog, cat or fish, we call a
binary classifier f() as follows:

if f(X) outputs dog


return dog;
else if f(X) outputs cat
return cat;
else return fish

17
Multi-Class Classification
Multinomial Logistic Regression:

Another approach for multi-class classification is to use the


multinomial logistic regression. For each class 1 ≤ M ≤ N,
!
1) first compute 5" & = !" * &
2) then compute Softmax function:
#$%('! ( )
%̂" = 6(5" & )" = %
∑"#$ #$%('" ( )

where !" is the vector of parameters


of input features for 5" .

18
Multi-Class Classification
Training Multinomial Logistic Regression Model:

The performance measure is the cross-entropy cost function:


. (0)
@ A = − / ∑/ ∑ 8
01. 71. H 7
(0) OPQ(R
S7 )

(:) 1, UℎV 1 ;< V#:=!WV X2 YW:ZZ M


where -9 =T
0, 1 ;< V#:=!WV >XU X2 YW:ZZ M
GD can be used to train the multinomial logistic regression
model. The gradient is:
. 0
B-! @ A = / ∑/
01.(R
S7 − H7 (0) )F(0)

19
Multi-Class Classification
Other Approaches: One-Vs-One Method:

The One-Vs-One (OvO) method constructs a binary classifier for each


pair of classes. Therefore, with K classes, we need to construct K(K-1)/2
binary classifiers.

The decision at prediction time can be made by counting the votes from
individual binary classifiers. In case of a tie, it compares the aggregated
classification confidence (i.e., the output probability) of individual binary
classifiers of each class and the higher one is selected.

The OvO method is slower than OvR. But for some algorithms (e.g., Kernel
algorithms) which cannot scale with many training examples, this algorithm
can be helpful.

20
Multi-Class Classification
Other Approaches: Error-Correcting Output Codes
The Error-Correcting Output Codes (ECOC) method encodes K classes into N bit
vectors. Each class is represented as a bit in each bit vector. ECOC trains N
binary classifiers, each splitting one group of classes from another (using the
column bit vectors below). At prediction time, the N binary classifiers are called, the
outputs of them yielding an N-bit vector. A class with the closest Euclidean
distance to the N-bit vector is selected. To reduce the classification error, error
correcting codes are used when generating the “code book”

A code book with K=9, N=15

21

You might also like