0% found this document useful (0 votes)
9 views

3 LogisticRegression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

3 LogisticRegression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

ITCS 6156/8156 Fall 2023

Machine Learning

Logistic Regression

Instructor: Hongfei Xue


Email: [email protected]
Class Meeting: Mon & Wed, 4:00 PM – 5:15 PM, CHHS 376

Some content in the slides is based on Dr. Razvan’s and Dr. Andrew’s lectures
Logistic Regression

Linear
Regression
Regression

Supervised
Types of Output
Learning

Logistic
Classification
Regression
Classification

Question Answer "y"


Is this email spam? no yes
Is the transaction fraudulent? no yes
Is the tumor malignant? no yes

• binary classification:
• “y” can only be one of two values:
- false: 0: "negative class" = “absence”
- true: 1: "positive class” = “presence”
Linear Regression Approach

𝑓!,# 𝑥 = 𝑤𝑥 + 𝑏
(Yes) 1

𝑏 = 𝑤$
Threshold 0.5 𝑦! = 1
malignant? 𝑦! = 0

(No) 0
tumor size 𝑥
(diameter in cm)

if 𝑓!,# 𝑥 < 0.5 → 𝑦( = 0


if 𝑓!,# 𝑥 ≥ 0.5 → 𝑦( = 1
Linear Regression Approach

𝑓!,# 𝑥 = 𝑤𝑥 + 𝑏
(Yes) 1

Threshold 0.5 𝑦! = 1
malignant? 𝑦! = 0

(No) 0
tumor size 𝑥
(diameter in cm)

if 𝑓!,# 𝑥 < 0.5 → 𝑦( = 0


if 𝑓!,# 𝑥 ≥ 0.5 → 𝑦( = 1
Logistic Function

(Yes) 1

Threshold 0.7
malignant?

(No) 0
tumor size 𝑥
(diameter in cm)

Probabilistic Discriminative Models: directly model the


posterior class probabilities 𝑝(𝐶|𝐱; 𝒘, 𝑏)
Logistic Function
• Want outputs between 0 and 1 𝑤+𝑥+𝑏

1 𝑧
1 ↓
𝑔 𝑧 = 1
1 + 𝑒 $% 𝑔 𝑧 =
1 + 𝑒 $%
0.5
𝑓&,( 𝑥 = 𝑔 𝑤 + 𝑥 + 𝑏
1
0 =
-3 3 1 + 𝑒 $(&*+"( )
logistic regression
• sigmoid function
• logistic function
!
• outputs between 0 and 1 𝑔 𝑧 = ,0 <𝑔 𝑧 <1
!"# !"
Decision Boundary

𝑥-
3
𝑓&,( 𝑥 = 𝑔 𝑧 = 𝑔(𝑤!𝑥! + 𝑤-𝑥- + 𝑏)

2
Decision Boundary: 𝑧 = 𝑤 + 𝑥 + 𝑏 = 0
1 (set 𝑤! = 1, 𝑤- = 1 )
𝑧 = 𝑥! + 𝑥- − 3 = 0
0 𝑥! + 𝑥- = 3
1 2 3 𝑥!

Decision boundary is hyperplane 𝑓 𝑥 = 0.5 → 𝑧 = 0


Non-linear Decision Boundary

𝑥-

1 𝑧 = 𝑤$ 𝑥$% + 𝑤% 𝑥%% + 𝑏

-1 1 𝑥!

-1 Decision Boundary:
(set 𝑤! = 1, 𝑤- = 1 )
𝑧 = 𝑥!- + 𝑥-- − 1 = 0
𝑥!- + 𝑥-- = 1
Loss Function

Training Set
tumor size(cm) … patient’s age malignant?
𝑥! 𝑥" 𝑦 𝑖 = 1,2, ⋯ 𝑚: number of training
samples
10 52 1
𝑗 = 1,2, ⋯ 𝑛: number of features
2 73 0 target 𝑦 is 0 or 1
5 55 0
12 49 1
… … …

1
𝑓!,# 𝑥 =
1 + 𝑒 $(!&'(# )
How to choose 𝑤 = [𝑤* , 𝑤+ , 𝑤, , ⋯ 𝑤- ] and 𝑏?
Loss Function

• Squared Error Cost:


! 9 !
𝐽 𝑤, 𝑏 = ∑ ( 𝑓&,( 𝑥⃗ (:) − 𝑦 : )-
9 :;! -

- Differentiable => can use gradient descent


- Non-convex => not guaranteed to find the global optimum
Loss Function

• Logistic Loss Function:

−log(𝑓!,# 𝑥⃗ . ), 𝑖𝑓 𝑦 . =1
𝐿 𝑓!,# 𝑥⃗ . ,𝑦 . =6
− log 1 − 𝑓!,# 𝑥⃗ . 𝑖𝑓 𝑦 . =0

𝑖𝑓 𝑦 # = 1, As 𝑓$,& 𝑥⃗ # → 1, then loss → 0


As 𝑓$,& 𝑥⃗ # → 0, then loss → ∞

#
𝑖𝑓 𝑦 = 0, As 𝑓$,& 𝑥⃗ # → 1, then loss → ∞
As 𝑓$,& 𝑥⃗ # → 0, then loss → 0
Simplified Loss Function

• Logistic Loss Function:


−log(𝑓!,# 𝑥⃗ . ), 𝑖𝑓 𝑦 . =1
𝐿 𝑓!,# 𝑥⃗ . ,𝑦 . =6
− log 1 − 𝑓!,# 𝑥⃗ . 𝑖𝑓 𝑦 . =0

• Simplified Logistic Loss Function (Convex):


𝐿 𝑓!,# 𝑥⃗ . ,𝑦 . = −𝑦 . log 𝑓!,# 𝑥⃗ . − (1 − 𝑦 . ) log 1 − 𝑓!,# 𝑥⃗ .

• Overall:
* /
𝐽 𝑤, 𝑏 = ∑.0*[𝐿 𝑓!,# 𝑥⃗ . , 𝑦 . ]
/
*
= − ∑/ .0*[𝑦 . log 𝑓!,# 𝑥⃗ . + (1 − 𝑦 . ) log 1 − 𝑓!,# 𝑥⃗ . ]
/

Can be derived from Maximum Likelihood


Gradient Descent
• Overall Loss (Cost):
*
𝐽 𝑤, 𝑏 = − / ∑/ .
.0* 𝑦 log 𝑓!,# 𝑥⃗
. + 1−𝑦 . log 1 − 𝑓!,# 𝑥⃗ .

• Gradient Decent: Compared with Linear Regression:


Repeat {
2
𝑤1 = 𝑤1 − 𝛼 𝐽 𝑤, 𝑏 ,
2!!
2 * /
where 𝐽 𝑤, 𝑏 = ∑ (𝑓 𝑥⃗ (.) − 𝑦 . )𝑥1 .
2!! / .0* !,#

2
𝑏 =𝑏−𝛼 𝐽 𝑤, 𝑏 ,
2#
2 * /
where 𝐽 𝑤, 𝑏 = ∑ (𝑓 𝑥⃗ (.) − 𝑦 . )
2# / .0* !,#
} simultaneous updates
Bias & Variance
• Bias and Variance are two fundamental concepts in machine learning that
pertain to the errors associated with predictive models.
• Bias: The differences between actual or expected values and the predicted
values are known as bias error or error due to bias. Bias is a systematic error
that occurs due to wrong assumptions in the machine learning process.
• Low Bias: In this case, the model will closely match the training dataset.
• High Bias: If a model has high bias, this means it can't capture the
patterns in the data, no matter how much you train it. The model is too
simplistic. This scenario is often referred to as underfitting.
• Variance: Variance is the amount by which the performance of a predictive
model changes when it is trained on different subsets of the training data.
More specifically, variance is the variability of the model that how much it is
sensitive to another subset of the training dataset (i.e. how much it can adjust
on the new subset of the training dataset).
• Low Variance: Low variance means that the model is less sensitive to
changes in the training data and can produce consistent estimates of the
target function with different subsets of data from the same distribution.
• High Variance: High variance means that the model is very sensitive to
changes in the training data and can result in significant changes in the
estimate of the target function when trained on different subsets of data 15
from the same distribution.
Polynomial Regression Examples

M=1 M=3 M=9

• Underfitting • Just right • Overfitting


• Does not fit the • Fits training set • Fit the training set
training set well pretty well extremely well
• Cannot fit the • Fits test set well • Cannot fit the test
test set as well • Generalization set as well
• High bias • High variance
Classification Examples

𝑧 = 𝑤"𝑥" + 𝑤#𝑥# + 𝑏 z = 𝑤" 𝑥" + 𝑤# 𝑥# + z = 𝑤" 𝑥"$ + 𝑤# 𝑥#$ +


𝑤$𝑥"# + 𝑤%𝑥## + 𝑤&𝑥"𝑥# + 𝑏 𝑤$𝑥"# + 𝑤%𝑥## + 𝑤&𝑥"𝑥# +
…+ 𝑏

• Underfitting • Just right • Overfitting


Dealing with Overfitting

• Collect more training examples


Dealing with Overfitting

• Select features to include/exclude:


• 100 features à 10 feature
• 100 features + insufficient data à Overfitting
• Just right 10 features + same data à Just right (possible)

• Disadvantage:
• Useful features could be lost
Regularization

• Reduce the size of parameters w

𝑓 𝑥 𝑓 𝑥
= 28𝑥 − 385𝑥 # + 39𝑥 $ = 13𝑥 − 0.23𝑥 # + 0.000014𝑥 $
− 174𝑥 % + 100 − 0.0001𝑥 % + 10
Regularized Linear Regression
• Overall Loss with Regularizer:

!
𝐽 𝑤, 𝑏 = − 9 ∑9
:;! = 𝑦 : log 𝑓
&,( 𝑥⃗
: + 1−𝑦 :

D
log 1 − 𝑓&,( 𝑥⃗ : >+ ∑FE;! 𝑤E -
-9
• Gradient Decent:
Repeat {
(
𝑤' = 𝑤' − 𝛼 ($ 𝐽 𝑤, 𝑏 ,
!
( ! -
where ($ 𝐽 𝑤, 𝑏 = ) ∑) ⃗ (#) − 𝑦 # )𝑥'
#*!( 𝑓$,& 𝑥
#
+ ) 𝑤'
!

(
𝑏 = 𝑏 − 𝛼 (& 𝐽 𝑤, 𝑏 ,
( !
where (& 𝐽 𝑤, 𝑏 = ) ∑) ⃗ (#) − 𝑦 # )
#*!( 𝑓$,& 𝑥
} simultaneous updates
Machine Learning Objective

• Find a model M:
• that fits the training data + that is simple

• Inductive hypothesis: Models that perform well on


training examples are expected to do well on test (unseen)
examples.
• Occam's Razor: Simpler models are expected to do better
than complex models on test examples (assuming similar
training performance).
Algebraic Interpretation

• The output of the neuron is a linear combination of inputs


from other neurons, rescaled by the weights.
• summation corresponds to combination of signals
• It is often transformed through an activation/output
function.
Binary Classification

• Predictions on test dataset:


• A perfect classifier
• Test dataset for evaluation:
• In binary classification dataset,
each instance will have its true
label (true class): Positive Class
(P) vs Negative Class (N).

• A real-world classifier

Images from https://classeval.wordpress.com/introduction/basic-evaluation-measures/


Confusion Matrix

• Confusion matrix (a 2x2 table) is composed of four outcomes of classification:


• True positive (TP): correct positive prediction
• False positive (FP): incorrect positive prediction
• True negative (TN): correct negative prediction
• False negative (FN): incorrect negative prediction

True Prediction Positive Negative


Positive # of TPs # of FNs
Negative # of FPs # of TNs
Basic Measurements
• Accuracy is calculated as the number of • Recall (sensitivity, true positive rate) is
all correct predictions divided by the total calculated as the number of correct
number of the dataset. positive predictions divided by the total
number of positives.

• F1 Score is a harmonic mean of precision


• Precision is calculated as the number of and recall.
correct positive predictions divided by the
total number of positive predictions.
Multi-class Classification

• Multi-class Classification: • Binary classification:


• To classify instances into one of
more than two classes. (i.e., there
are more than two possible
categories or labels)

• Strategies:
• Multi-class classification:
• One-vs-All (One-vs-Rest)
• One-vs-One
• Softmax Regression (Later)
• Decision Trees (Later)

Images from: https://utkuufuk.com/2018/06/03/one-vs-all-classification/


One-vs-All

• One-vs-all classification breaks down N classes present in the


dataset into N binary classifier models that aims to classify a data
point as either part of the current class or not.
• Suppose you have classes 1, 2, and 3.
• Model A: 1 or 2,3 (1 or not 1)
• Model B: 2 or 1,3 (2 or not 2)
• Model C: 3 or 1,2 (3 or not 3)

• At prediction time, the class that corresponds


to the classifier with the highest confidence
score is the predicted class.
• Model A: 𝑃(𝑥 = 1) and 𝑃(𝑥 ≠ 1)
• Model B: 𝑃(𝑥 = 2) and 𝑃(𝑥 ≠ 2)
• Model C: 𝑃(𝑥 = 3) and 𝑃(𝑥 ≠ 3)
• Among 𝑃(𝑥 = 1) , 𝑃(𝑥 = 2) , and
𝑃(𝑥 = 3) , which one is the highest?

Images from: https://www.cc.gatech.edu/classes/AY2016/cs4476_fall/results/proj4/html/jnanda3/index.html


One-vs-one

• One-vs-one classification breaks down N classes present in the


dataset into N*(N-1)/2 binary classifier models – one for each pair
of classes.

• Suppose you have classes 1, 2, and 3.


• Model A: 1 or 2
• Model B: 1 or 3
• Model C: 2 or 3

• At prediction time, each classifier votes for a


class, and the class with the most votes is the
predicted class.
• Model A: Vote for 1 or 2
• Model B: Vote for 1 or 3
• Model C: Vote for 2 or 3
• Classes 1, 2, and 3, which one has the
most votes?
Questions?

You might also like