Naïve Bayes vs Perceptron
• Naïve Bayes predict classes based on the probability of the instance
being that class
• It learns 𝑃 𝑌 = 𝑦𝑘 𝑋 = 𝒙𝑖
• Perceptron doesn’t produce probability estimate
• It estimates 𝜽 from the training data
• For new data find the sign of 𝜽𝑇 𝒙 𝑖
• Based on the sign assign a class
Logistic Regression
Logistic Regression
• It takes a probabilistic approach to learn a classifier (a function)
• ℎ𝜽 𝒙 should give 𝑝 𝑦 = 1|𝒙; 𝜽
• We want 0 ≤ ℎ𝜽 𝒙 ≤ 1
• Logistic regression model :
ℎ𝜽 𝒙 = 𝑔(𝜽𝑇 𝒙)
1
𝑔 𝑧 =
1 + 𝑒 −𝑧
𝜽𝑇𝒙
1 𝑒
ℎ𝜽 𝒙 = 𝑇 = 𝑇
1+ 𝑒 −𝜽 𝒙
1 + 𝑒𝜽 𝒙
• The sigmoid first computes real-valued score and then squashes it between
0 and 1 to make it as a probability score.
Interpreting Hypothesis Output
𝜽𝑇𝒙
1 𝑒
• ℎ𝜽 𝒙 = estimated 𝑝 𝑦 = 1|𝒙; 𝜽 = 𝑇 = 𝑇
1+𝑒 −𝜽 𝒙 1+𝑒 𝜽 𝒙
• Note: 𝑝 𝑦 = 0|𝒙; 𝜽 + 𝑝 𝑦 = 1|𝒙; 𝜽 = 1
1 1
• So, 𝑝 𝑦 = 0|𝒙; 𝜽 = 1 − 𝑝 𝑦 = 1|𝒙; 𝜽 = 1 − 𝑇 = 𝑇
1+𝑒 −𝜽 𝒙 1+𝑒 𝜽 𝒙
• The log-odds (logits) of the model
𝑝 𝑦 = 1|𝒙; 𝜽 𝜽𝑇 𝒙
log = log 𝑒 = 𝜽𝑇 𝒙
𝑝 𝑦 = 0|𝒙; 𝜽
• Thus if 𝜽𝑇 𝒙 > 𝟎 then the positive class more probable
Logistic Regression
• ℎ𝜽 𝒙 = 𝑔 𝜽𝑇 𝒙
1
•𝑔 𝑧 =
1+𝑒 −𝑧
• 𝜽𝑇 𝒙 should be large negative values for negative instances
• 𝜽𝑇 𝒙 should be large positive values for positive instances
• Assume a threshold and predict
• 𝑦 = 1 if ℎ𝜽 𝒙 ≥ 0.5 (𝜽𝑇 𝒙 ≥ 0)
• 𝑦 = 0 if ℎ𝜽 𝒙 < 0.5 (𝜽𝑇 𝒙 < 0)
Non-linear Decision Boundary
• Can apply basis function expansion to features
1
𝑥1
𝑥2
1 𝑥1 𝑥2
2
•𝒙= 1 →
𝑥 𝑥 1
𝑥2 𝑥22
𝑥12 𝑥2
𝑥1 𝑥22
⋮
Logistic Regression Cost Function
• Should not use the squared loss as in case of linear regression
𝑛
1 (𝑖) (𝑖) 2
𝐽 𝜽 = ℎ𝜽 𝒙 −𝑦
2𝑛
𝑖=1
• The logistic regression model will lead to a non-convex cost function
1
ℎ𝜽 𝒙 = −𝜽 𝑇𝒙
1+𝑒
Finding the Cost Function via MLE
• Likelihood of the data is given by 𝐿 𝜽 = ∏𝑛𝑖=1 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝜽)
• 𝜽 that maximizes the likelihood
𝑛
𝜽𝑀𝐿𝐸 = arg max 𝐿 𝜽 = arg max ෑ 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝜽)
𝜽 𝜽
𝑖=1 𝑛
𝜽𝑀𝐿𝐸 = arg max log 𝐿 𝜽 = arg max log ෑ 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝜽)
𝜽 𝜽
𝑖=1
= arg max σ𝑛𝑖=1 log 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝜽)
𝜽
Finding the Cost Function via MLE
• Each label 𝑦 𝑖 is binary with probability ℎ𝜽 𝒙(𝑖)
• Assume Bernoulli likelihood
𝑛
𝑝 𝒚|𝑿, 𝜽 = ෑ 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝜽)
𝑖=1
𝑦𝑖 1−𝑦 𝑖
= ∏𝑛𝑖=1 ℎ𝜽 𝒙 𝑖
1 − ℎ𝜽 𝒙(𝑖)
• The log-likelihood
𝑛
𝑙 𝜽 = 𝑦 𝑖 log ℎ𝜽 𝒙(𝑖) + 1 − 𝑦 𝑖
log 1 − ℎ𝜽 𝒙(𝑖)
𝑖=1
The Cost Function
• Maximizing 𝑙 𝜽 is equivalent to minimizing the NLL
𝑛
𝐽 𝜽 = 𝑁𝐿𝐿 𝜽 = − 𝑦 𝑖 log ℎ𝜽 𝒙(𝑖) + 1 − 𝑦 𝑖
log 1 − ℎ𝜽 𝒙(𝑖)
𝑖=1
• Cost of a single instance
−log ℎ𝜽 𝒙 𝑖𝑓 𝑦 = 1
𝑐𝑜𝑠𝑡 ℎ𝜽 𝒙 , 𝑦 = ൝
−log 1 − ℎ𝜽 𝒙 𝑖𝑓 𝑦 = 0
• The objective function
𝐽 𝜽 = σ𝑛𝑖=1 𝑐𝑜𝑠𝑡 ℎ𝜽 𝒙(𝑖) , 𝑦 𝑖
Intuition
−log ℎ𝜽 𝒙 𝑖𝑓 𝑦 = 1
• 𝑐𝑜𝑠𝑡 ℎ𝜽 𝒙 , 𝑦 = ൝
−log 1 − ℎ𝜽 𝒙 𝑖𝑓 𝑦 = 0
• If 𝑦 = 1
• 𝑐𝑜𝑠𝑡 = 0 for correct prediction
• As ℎ𝜽 𝒙 → 0, 𝑐𝑜𝑠𝑡 → ∞
• Mistakes should get large penalties
• e.g., predict ℎ𝜽 𝒙 = 0, but 𝑦 = 1
Intuition
−log ℎ𝜽 𝒙 𝑖𝑓 𝑦 = 1
• 𝑐𝑜𝑠𝑡 ℎ𝜽 𝒙 , 𝑦 = ൝
−log 1 − ℎ𝜽 𝒙 𝑖𝑓 𝑦 = 0
• If 𝑦 = 0
• 𝑐𝑜𝑠𝑡 = 0 for correct prediction
• As (1 − ℎ𝜽 𝒙 ) → 0, 𝑐𝑜𝑠𝑡 → ∞
• Mistakes should get large penalties
• e.g., predict ℎ𝜽 𝒙 = 0, but 𝑦 = 1
MAP formulation
Regularized Logistic Regression
• 𝐽 𝜽 = − σ𝑛𝑖=1 𝑦 𝑖 log ℎ𝜽 𝒙(𝑖) + 1 − 𝑦 𝑖
log 1 − ℎ𝜽 𝒙(𝑖)
• We can regularize the logistic regression as
𝑑
𝐽𝑟𝑒𝑔 𝜽 = 𝐽 𝜽 + 𝜆 𝜃𝑗2
𝑗=1
2
=𝐽 𝜽 +𝜆 𝜽 2
𝐽𝑟𝑒𝑔 𝜽
𝑛 𝑑
= − 𝑦 𝑖 log ℎ𝜽 𝒙(𝑖) + 1 − 𝑦 𝑖
log 1 − ℎ𝜽 𝒙(𝑖) + 𝜆 𝜃𝑗2
𝑖=1 𝑗=1
Estimating the Parameter