Lecture+3+-+1
Lecture+3+-+1
2
Logistic Regression
Usually be used for binary classification:
3
Logistic Regression
Usually be used for binary classification:
4
Logistic Regression
Usually be used for binary classification:
Maximize likelihood:
$
! " %! (1 − ")#&%!
!"#
5
Logistic Regression
Usually be used for binary classification:
6
Logistic Regression
Let Pr( Y=1 | X=x) = !(#; %)
Maximize likelihood:
$
! " %! (1 − ")#&%!
!"#
7
Logistic Regression
Let Pr( Y=1 | X=x) = !(#; %)
Maximize likelihood:
$
! " %! (1 − ")#&%!
!"#
8
Logistic Regression
Let Pr( Y=1 | X=x) = !(#; %)
Maximize likelihood:
$
! " %! (1 − ")#&%!
!"#
9
Logistic Regression
Let Pr( Y=1 | X=x) = !(#; %)
Task: to estimate ' by maximizing the likelihood.
10
Logistic Regression
Logistic regression model
"
log = '( + ( / '
1−"
#
Which gives " ( =
#) * "#$ %&'#
11
Logistic Regression
Prediction:
0, 12 !̂ < 0.5
-, = .
1, 12 !̂ ≥ 0.5
Training the logistic regression model %̂ & = ((! ! * &) is to learn the best value of
parameter ! that makes the model fit the training data.
12
Logistic Regression
To train a logistic regression model, we first need to define a performance
measure. A commonly used measure is so-called the log loss function:
+ ,
0 1 =− ∑-"+[4 - 9 - )+ :−4
567(8 - 8 - ))]
567(: − 9
,
It is easier to explain the log loss function with one train example case, in
which we want to maximize the posterior probability
",̂ BℎDE = = 1
8. (: − 9
P(=|() = 9 8) +&. =?
1 − ",̂ BℎDE = = 0
13
Logistic Regression
Learning the logistic regression model is to find:
θ8 = :;<=1>, J θ .
#
Where J ' = − / ∑/ ! 9 ! )+ 1−=
!"#[= log(8
! 8 ! ))],
log(1 − 9
"̂ = O(' 0 / ()
14
Training Logistic Regression Model
The gradient of the log loss function J θ is:
. /
B- @ A = ∑ (D A2 ⋅ F 0 − H(0) )F3 (0)
/ 01.
15
Training Logistic Regression Model
Overfitting may also happen in logistic regression.
16
Multi-Class Classification
One-Vs-Rest Method:
17
Multi-Class Classification
Multinomial Logistic Regression:
18
Multi-Class Classification
Training Multinomial Logistic Regression Model:
19
Multi-Class Classification
Other Approaches: One-Vs-One Method:
The decision at prediction time can be made by counting the votes from
individual binary classifiers. In case of a tie, it compares the aggregated
classification confidence (i.e., the output probability) of individual binary
classifiers of each class and the higher one is selected.
The OvO method is slower than OvR. But for some algorithms (e.g., Kernel
algorithms) which cannot scale with many training examples, this algorithm
can be helpful.
20
Multi-Class Classification
Other Approaches: Error-Correcting Output Codes
The Error-Correcting Output Codes (ECOC) method encodes K classes into N bit
vectors. Each class is represented as a bit in each bit vector. ECOC trains N
binary classifiers, each splitting one group of classes from another (using the
column bit vectors below). At prediction time, the N binary classifiers are called, the
outputs of them yielding an N-bit vector. A class with the closest Euclidean
distance to the N-bit vector is selected. To reduce the classification error, error
correcting codes are used when generating the “code book”
21