0% found this document useful (0 votes)
21 views

3 - Logistic Regression (v3)

1. Logistic regression uses a sigmoid function to output a probability between 0 and 1 for classification problems. It finds the optimal weights and bias that maximize the likelihood of the training data. 2. The likelihood function measures how well the sigmoid function predicts the training labels. Taking the negative log transforms it into a sum of cross-entropy terms that can be optimized using gradient descent. 3. In each step of gradient descent, the weights are updated by subtracting a learning rate times the derivative of the negative log-likelihood with respect to the weights. This derivative involves the predicted probability and training labels.

Uploaded by

Jeffery Chia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

3 - Logistic Regression (v3)

1. Logistic regression uses a sigmoid function to output a probability between 0 and 1 for classification problems. It finds the optimal weights and bias that maximize the likelihood of the training data. 2. The likelihood function measures how well the sigmoid function predicts the training labels. Taking the negative log transforms it into a sum of cross-entropy terms that can be optimized using gradient descent. 3. In each step of gradient descent, the weights are updated by subtracting a learning rate times the derivative of the negative log-likelihood with respect to the weights. This derivative involves the predicted probability and training labels.

Uploaded by

Jeffery Chia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Logistic Regression

Step 1: Function Set


We want to find 𝑃𝑤,𝑏 𝐶1 |𝑥 1
𝜎 𝑧 =
1 + 𝑒𝑥𝑝 −𝑧
If 𝑃𝑤,𝑏 𝐶1 |𝑥 ≥ 0.5, output C1
 z 
Otherwise, output C2
𝑃𝑤,𝑏 𝐶1 |𝑥 = 𝜎 𝑧

𝑧 = 𝑤 ∙ 𝑥 + 𝑏 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏
z
𝑖

Function set:
Including all
𝑓𝑤,𝑏 𝑥 = 𝑃𝑤,𝑏 𝐶1 |𝑥 different w and b
Step 1: Function Set

x1 w1 z   wi xi 𝑓𝑤,𝑏 𝑥
i

wi z  z 
xi  𝑃𝑤,𝑏 𝐶1 |𝑥

wI Sigmoid Function  z 
xI b
 z  
1
1  ez z
Logistic Regression Linear Regression

Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓𝑤,𝑏 𝑥 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏


𝑖 𝑖
Output: between 0 and 1 Output: any value

Step 2:

Step 3:
Step 2: Goodness of a Function

Training 𝑥1 𝑥2 𝑥3 𝑥𝑁
……
Data 𝐶1 𝐶1 𝐶2 𝐶1

Assume the data is generated based on 𝑓𝑤,𝑏 𝑥 = 𝑃𝑤,𝑏 𝐶1 |𝑥


Given a set of w and b, what is its probability of generating
the data?
𝐿 𝑤, 𝑏 = 𝑓𝑤,𝑏 𝑥 1 𝑓𝑤,𝑏 𝑥 2 1 − 𝑓𝑤,𝑏 𝑥 3 ⋯ 𝑓𝑤,𝑏 𝑥 𝑁
The most likely w* and b* is the one with the largest 𝐿 𝑤, 𝑏 .
𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 max 𝐿 𝑤, 𝑏
𝑤,𝑏
𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3
…… ……
3
𝐶1 𝐶1 𝐶2 𝑦ො 1 =1 𝑦ො 2 = 0 𝑦ො = 1

𝑦ො 𝑛 : 1 for class 1, 0 for class 2

𝐿 𝑤, 𝑏 = 𝑓𝑤,𝑏 𝑥 1 𝑓𝑤,𝑏 𝑥 2 1 − 𝑓𝑤,𝑏 𝑥 3 ⋯

𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 max 𝐿 𝑤, 𝑏
𝑤,𝑏
= 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min
𝑤,𝑏
−𝑙𝑛𝐿 𝑤, 𝑏

−𝑙𝑛𝐿 𝑤, 𝑏
= −𝑙𝑛𝑓𝑤,𝑏 𝑥 1 − 𝑦ො11 𝑙𝑛𝑓 𝑥 1 + 1 −0 𝑦ො 1 𝑙𝑛 1 − 𝑓 𝑥 1
−𝑙𝑛𝑓𝑤,𝑏 𝑥 2 − 𝑦ො12 𝑙𝑛𝑓 𝑥 2 + 1 −0 𝑦ො 2 𝑙𝑛 1 − 𝑓 𝑥 2
−𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 3 − 𝑦ො03 𝑙𝑛𝑓 𝑥 3 + 1 −1 𝑦ො 3 𝑙𝑛 1 − 𝑓 𝑥 3
……
Step 2: Goodness of a Function
𝐿 𝑤, 𝑏 = 𝑓𝑤,𝑏 𝑥 1 𝑓𝑤,𝑏 𝑥 2 1 − 𝑓𝑤,𝑏 𝑥 3 ⋯ 𝑓𝑤,𝑏 𝑥 𝑁

−𝑙𝑛𝐿 𝑤, 𝑏 = 𝑙𝑛𝑓𝑤,𝑏 𝑥 1 + 𝑙𝑛𝑓𝑤,𝑏 𝑥 2 + 𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 3 ⋯


𝑦ො 𝑛 : 1 for class 1, 0 for class 2
= ෍ − 𝑦ො 𝑛 𝑙𝑛𝑓𝑤,𝑏 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝑛
𝑛 Cross entropy between two Bernoulli distribution
Distribution p: Distribution q:
p 𝑥 = 1 = 𝑦ො 𝑛 cross q 𝑥 = 1 = 𝑓 𝑥𝑛
p 𝑥 = 0 = 1 − 𝑦ො 𝑛 entropy q 𝑥 = 0 = 1 − 𝑓 𝑥𝑛
𝐻 𝑝, 𝑞 = − ෍ 𝑝 𝑥 𝑙𝑛 𝑞 𝑥
𝑥
Logistic Regression Linear Regression

Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓𝑤,𝑏 𝑥 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏


𝑖 𝑖
Output: between 0 and 1 Output: any value

Training data: 𝑥 𝑛 , 𝑦ො 𝑛 Training data: 𝑥 𝑛 , 𝑦ො 𝑛


Step 2: 𝑦ො 𝑛 : 1 for class 1, 0 for class 2 𝑦ො 𝑛 : a real number
1
𝐿 𝑓 = ෍ 𝐶 𝑓 𝑥 𝑛 , 𝑦ො 𝑛 𝐿 𝑓 = ෍ 𝑓 𝑥 𝑛 − 𝑦ො 𝑛 2
2
𝑛 𝑛

Cross entropy:
𝐶 𝑓 𝑥 𝑛 , 𝑦ො 𝑛 = − 𝑦ො 𝑛 𝑙𝑛𝑓 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓 𝑥 𝑛
Question: Why don’t we simply use square error as linear
regression?
Step 3: Find the best function
1 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
−𝑙𝑛𝐿 𝑤, 𝑏 = ෍ − 𝑦ො 𝑛 𝑙𝑛𝑓𝑤,𝑏 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝑛
𝜕𝑤𝑖 𝑛 𝜕𝑤𝑖 𝜕𝑤𝑖

𝜕𝑙𝑛𝑓𝑤,𝑏 𝑥 𝜕𝑙𝑛𝑓𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧
= = 𝑥𝑖 𝜎 𝑧
𝜕𝑤𝑖 𝜕𝑧 𝜕𝑤𝑖 𝜕𝑤𝑖
𝜕𝜎 𝑧
𝜕𝑙𝑛𝜎 𝑧 1 𝜕𝜎 𝑧 1 𝜕𝑧
= = 𝜎 𝑧 1−𝜎 𝑧
𝜕𝑧 𝜎 𝑧 𝜕𝑧 𝜎 𝑧

𝑓𝑤,𝑏 𝑥 = 𝜎 𝑧
𝑧 = 𝑤 ∙ 𝑥 + 𝑏 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏
= 1Τ1 + 𝑒𝑥𝑝 −𝑧 𝑖
Step 3: Find the best function
1 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
−𝑙𝑛𝐿 𝑤, 𝑏 = ෍ − 𝑦ො 𝑛 𝑙𝑛𝑓𝑤,𝑏 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝑛
𝜕𝑤𝑖 𝑛 𝜕𝑤𝑖 𝜕𝑤𝑖

𝜕𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝜕𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧


= = 𝑥𝑖
𝜕𝑤𝑖 𝜕𝑧 𝜕𝑤𝑖 𝜕𝑤𝑖

𝜕𝑙𝑛 1 − 𝜎 𝑧 1 𝜕𝜎 𝑧 1
=− =− 𝜎 𝑧 1−𝜎 𝑧
𝜕𝑧 1 − 𝜎 𝑧 𝜕𝑧 1−𝜎 𝑧

𝑓𝑤,𝑏 𝑥 = 𝜎 𝑧
𝑧 = 𝑤 ∙ 𝑥 + 𝑏 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏
= 1Τ1 + 𝑒𝑥𝑝 −𝑧 𝑖
Step 3: Find the best function
1 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
−𝑙𝑛𝐿 𝑤, 𝑏 = ෍ − 𝑦ො 𝑛 𝑙𝑛𝑓𝑤,𝑏 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝑛
𝜕𝑤𝑖 𝑛 𝜕𝑤𝑖 𝜕𝑤𝑖

= ෍ − 𝑦ො 𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛 − 1 − 𝑦ො 𝑛 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛


𝑛

= ෍ − 𝑦ො 𝑛 − 𝑦ො 𝑛 𝑓𝑤,𝑏 𝑥 𝑛 − 𝑓𝑤,𝑏 𝑥 𝑛 + 𝑦ො 𝑛 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛


𝑛

= ෍ − 𝑦ො 𝑛 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
𝑛
Larger difference,
𝑤𝑖 ← 𝑤𝑖 − 𝜂 ෍ − 𝑦ො 𝑛 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
larger update
𝑛
Logistic Regression Linear Regression

Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓𝑤,𝑏 𝑥 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏


𝑖 𝑖
Output: between 0 and 1 Output: any value

Training data: 𝑥 𝑛 , 𝑦ො 𝑛 Training data: 𝑥 𝑛 , 𝑦ො 𝑛


Step 2: 𝑦ො 𝑛 : 1 for class 1, 0 for class 2 𝑦ො 𝑛 : a real number
1
𝐿 𝑓 = ෍ 𝐶 𝑓 𝑥 𝑛 , 𝑦ො 𝑛 𝐿 𝑓 = ෍ 𝑓 𝑥 𝑛 − 𝑦ො 𝑛 2
2
𝑛 𝑛

Logistic regression: 𝑤𝑖 ← 𝑤𝑖 − 𝜂 ෍ − 𝑦ො 𝑛 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛


𝑛
Step 3:
Linear regression: 𝑤𝑖 ← 𝑤𝑖 − 𝜂 ෍ − 𝑦ො 𝑛 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
𝑛
Logistic Regression + Square Error

Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏
𝑖

Step 2: Training data: 𝑥 𝑛 , 𝑦ො 𝑛 , 𝑦ො 𝑛 : 1 for class 1, 0 for class 2


1 2
𝐿 𝑓 = ෍ 𝑓𝑤,𝑏 𝑥 𝑛 − 𝑦ො 𝑛
2
𝑛
Step 3: 𝜕𝑓𝑤,𝑏 𝑥 𝜕𝑧
= 2 𝑓𝑤,𝑏 𝑥 − 𝑦ො
𝜕 (𝑓𝑤,𝑏 ො 2
(𝑥)−𝑦) 𝜕𝑧 𝜕𝑤𝑖
𝜕𝑤𝑖 = 2 𝑓𝑤,𝑏 𝑥 − 𝑦ො 𝑓𝑤,𝑏 𝑥 1 − 𝑓𝑤,𝑏 𝑥 𝑥𝑖

𝑦ො 𝑛 = 1 If 𝑓𝑤,𝑏 𝑥 𝑛 = 1 (close to target) 𝜕𝐿Τ𝜕𝑤𝑖 = 0

If 𝑓𝑤,𝑏 𝑥 𝑛 = 0 (far from target) 𝜕𝐿Τ𝜕𝑤𝑖 = 0


Logistic Regression + Square Error

Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏
𝑖

Step 2: Training data: 𝑥 𝑛 , 𝑦ො 𝑛 , 𝑦ො 𝑛 : 1 for class 1, 0 for class 2


1 2
𝐿 𝑓 = ෍ 𝑓𝑤,𝑏 𝑥 𝑛 − 𝑦ො 𝑛
2
𝑛
Step 3: 𝜕𝑓𝑤,𝑏 𝑥 𝜕𝑧
= 2 𝑓𝑤,𝑏 𝑥 − 𝑦ො
𝜕 (𝑓𝑤,𝑏 ො 2
(𝑥)−𝑦) 𝜕𝑧 𝜕𝑤𝑖
𝜕𝑤𝑖 = 2 𝑓𝑤,𝑏 𝑥 − 𝑦ො 𝑓𝑤,𝑏 𝑥 1 − 𝑓𝑤,𝑏 𝑥 𝑥𝑖

𝑦ො 𝑛 = 0 If 𝑓𝑤,𝑏 𝑥 𝑛 = 1 (far from target) 𝜕𝐿Τ𝜕𝑤𝑖 = 0

If 𝑓𝑤,𝑏 𝑥 𝑛 = 0 (close to target) 𝜕𝐿Τ𝜕𝑤𝑖 = 0


Cross Entropy v.s. Square Error

Cross
Entropy

Total
Loss
Square
Error

w1 w2
http://jmlr.org/procee
dings/papers/v9/gloro
t10a/glorot10a.pdf
Discriminative v.s. Generative
𝑃 𝐶1 |𝑥 = 𝜎 𝑤 ∙ 𝑥 + 𝑏

directly find w and b Find 𝜇1 , 𝜇2 , Σ −1


𝑤 𝑇 = 𝜇1 − 𝜇2 𝑇 Σ −1
1 1 𝑇
𝑏=− 𝜇 Σ1 −1 𝜇1
2
Will we obtain the same 1 2 𝑁1
𝑇 2 −1 2
set of w and b? + 𝜇 Σ 𝜇 + 𝑙𝑛
2 𝑁2
The same model (function set), but different function is
selected by the same training data.
Generative v.s. Discriminative
Generative Discriminative

All: total, hp, att, sp att, de, sp de, speed


73% accuracy 79% accuracy
Generative v.s. Discriminative
• Example

Training 1 1 0 0

Data 1
X4 X4 X4
0 1 0

Class 1 Class 2 Class 2 Class 2

Testing 1 Class 1? How about Naïve Bayes?


Data 1 Class 2?
𝑃 𝑥|𝐶𝑖 = 𝑃 𝑥1 |𝐶𝑖 𝑃 𝑥2 |𝐶𝑖
Generative v.s. Discriminative
• Example

Training 1 1 0 0

Data 1
X4 X4 X4
0 1 0

Class 1 Class 2 Class 2 Class 2

1
𝑃 𝐶1 = 𝑃 𝑥1 = 1|𝐶1 = 1 𝑃 𝑥2 = 1|𝐶1 = 1
13
12 1 1
𝑃 𝐶2 = 𝑃 𝑥1 = 1|𝐶2 = 𝑃 𝑥2 = 1|𝐶2 =
13 3 3
Training 1 1 0 0

Data 1
X4 X4 X4
0 1 0

Class 1 Class 2 Class 2 Class 2


1
1×1
13
<0.5
𝑃 𝑥|𝐶1 𝑃 𝐶1
Testing 1 𝑃 𝐶1 |𝑥 =
𝑃 𝑥|𝐶1 𝑃 𝐶1 + 𝑃 𝑥|𝐶2 𝑃 𝐶2
Data 1
1 1 1 12
1×1 ×
13 3 3 13
1
𝑃 𝐶1 = 𝑃 𝑥1 = 1|𝐶1 = 1 𝑃 𝑥2 = 1|𝐶1 = 1
13
12 1 1
𝑃 𝐶2 = 𝑃 𝑥1 = 1|𝐶2 = 𝑃 𝑥2 = 1|𝐶2 =
13 3 3
Generative v.s. Discriminative
• Benefit of generative model
• With the assumption of probability distribution,
less training data is needed
• With the assumption of probability distribution,
more robust to the noise
• Priors and class-dependent probabilities can be
estimated from different sources.
[Bishop, P209-210]
Multi-class Classification (3 classes as example)

C1: 𝑤 1 , 𝑏1 𝑧1 = 𝑤 1 ∙ 𝑥 + 𝑏1 Probability:
 1 > 𝑦𝑖 > 0
C2: 𝑤 2 , 𝑏2 𝑧2 = 𝑤 2 ∙ 𝑥 + 𝑏2  σ𝑖 𝑦𝑖 = 1
C3: 𝑤 3 , 𝑏3 𝑧3 = 𝑤 3 ∙ 𝑥 + 𝑏3 yi  PCi | x 
Softmax
3 0.88 3

e
20
z1 e e z1
 y1  e z1 zj

j 1

1 0.12 3
z2 e e z 2 2.7
 y2  e z2
e
zj

j 1
0.05 ≈0
z3 -3
3
e e z3
 y3  e z3
e
zj

3 j 1

 e zj

j 1
Multi-class Classification (3 classes as example)

y ŷ
𝑧1 = 𝑤 1 ∙ 𝑥 + 𝑏1 y1 ŷ1
Cross Entropy

Softmax
𝑥 𝑧2 = 𝑤 2 ∙𝑥 + 𝑏2 y2 3 ŷ2
− ෍ 𝑦ො𝑖 𝑙𝑛𝑦𝑖
𝑧3 = 𝑤 3 ∙𝑥 + 𝑏3 y3 ŷ3
𝑖=1
target
If x ∈ class 1 If x ∈ class 2 If x ∈ class 3
1 0 0
𝑦ො = 0 𝑦ො = 1 𝑦ො = 0
0 0 1
Limitation of Logistic Regression
z  w1 x1  w2 x2  b
x1 w1 y  0.5
z  Class1
 y 
w2 Class 2 y  0.5
x2
b
Can we?
Input Feature x2
Label
x1 x2 y ≥ 0.5 y < 0.5
0 0 Class 2
0 1 Class 1
1 0 Class 1 y < 0.5 y ≥ 0.5
1 1 Class 2
x1
Limitation of Logistic Regression
x1 w1
• No, we can’t …… z
w2  y
x2 b

x2 x2

x1 x1
Limitation of Logistic Regression
0
𝑥1′ :distance to
• Feature Transformation 0
′ 1
𝑥2 : distance to
1
𝑥1 𝑥1′
𝑥2 𝑥2′
Not always easy
x2 to find a good
𝑥2′ 0 transformation
0 1
1 2
1

0 0 1 2
0 1 1 0
x1 𝑥1′
Limitation of Logistic Regression
• Cascading logistic regression models

z1
x1  𝑥1′
z
 y
z2
x2  𝑥2′

Feature Transformation Classification

(ignore bias in this figure)


𝑥1′ =0.73 𝑥1′ =0.27

x2
z1
x1  𝑥1′ 𝑥1′ =0.27 𝑥1′ =0.05

x1

z2 𝑥2′ =0.05 𝑥2′ =0.27


x2  𝑥2′
x2

𝑥2′ =0.27 𝑥2′ =0.73

x1
𝑥1′ =0.73 𝑥1′ =0.27 𝑥1′
w1
x2 z
 y
𝑥1′ =0.27 𝑥1′ =0.05 w2
𝑥2′
x1

𝑥2′ =0.05 𝑥2′ =0.27 (0.73, 0.05)

x2 𝑥2′

(0.27, 0.27) (0.05,0.73)


𝑥2′ =0.27 𝑥2′ =0.73

x1 𝑥1′
Deep Learning!

  z 

  z    z 

  z 
“Neuron”

Neural Network
Reference
• Bishop: Chapter 4.3

You might also like