0% found this document useful (0 votes)
12 views26 pages

Logistic Regression

Uploaded by

lamvut67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

Logistic Regression

Uploaded by

lamvut67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Logistic Regression

Aarti Singh

Machine Learning 10-315


Feb 2, 2022
Announcements

• Recitation Friday Feb 4


– MLE, MAP examples, Convexity, Optimization

2
Discriminative Classifiers
Bayes Classifier:

Why not learn P(Y|X) directly? Or better yet, why not learn the
decision boundary directly?
• Assume some functional form for P(Y|X) or for the
decision boundary
• Estimate parameters of functional form directly from
training data
Today we will see one such classifier – Logistic Regression
3
Logistic Regression Not really regression

Assumes the following functional form for P(Y|X):

1
<latexit sha1_base64="m0EUV6lfJZ7gYJ/jxnNzqIh9V+E=">AAACF3icbVDLSsNAFJ34rPUVdelmsAgt0pCIoptC0Y3LCraNNCFMppN26OTBzMRaYv/Cjb/ixoUibnXn3zh9LLT1wIXDOfdy7z1+wqiQpvmtLSwuLa+s5tby6xubW9v6zm5DxCnHpI5jFnPbR4IwGpG6pJIRO+EEhT4jTb93OfKbd4QLGkc3cpAQN0SdiAYUI6kkTzdqxduK9WCXYAU6AUfYyqwjh9wnxXLfM8uOSEOPwr4q26OloacXTMMcA84Ta0oKYIqap3857RinIYkkZkiIlmUm0s0QlxQzMsw7qSAJwj3UIS1FIxQS4Wbjv4bwUCltGMRcVSThWP09kaFQiEHoq84Qya6Y9Ubif14rlcG5m9EoSSWJ8GRRkDIoYzgKCbYpJ1iygSIIc6puhbiLVDpSRZlXIVizL8+TxrFhnRrm9UmhejGNIwf2wQEoAgucgSq4AjVQBxg8gmfwCt60J+1Fe9c+Jq0L2nRmD/yB9vkDoxidIw==</latexit>

P (Y = 1|X) = P
1 + exp( w0 i w i Xi )

Logistic function applied to a linear


function of the data

logistic (z)
Logistic
function
(or Sigmoid), s(z) = :

Features can be discrete or continuous! z


4
Logistic Regression is a Linear Classifier!
Assumes the following functional form for P(Y|X):
1
<latexit sha1_base64="m0EUV6lfJZ7gYJ/jxnNzqIh9V+E=">AAACF3icbVDLSsNAFJ34rPUVdelmsAgt0pCIoptC0Y3LCraNNCFMppN26OTBzMRaYv/Cjb/ixoUibnXn3zh9LLT1wIXDOfdy7z1+wqiQpvmtLSwuLa+s5tby6xubW9v6zm5DxCnHpI5jFnPbR4IwGpG6pJIRO+EEhT4jTb93OfKbd4QLGkc3cpAQN0SdiAYUI6kkTzdqxduK9WCXYAU6AUfYyqwjh9wnxXLfM8uOSEOPwr4q26OloacXTMMcA84Ta0oKYIqap3857RinIYkkZkiIlmUm0s0QlxQzMsw7qSAJwj3UIS1FIxQS4Wbjv4bwUCltGMRcVSThWP09kaFQiEHoq84Qya6Y9Ubif14rlcG5m9EoSSWJ8GRRkDIoYzgKCbYpJ1iygSIIc6puhbiLVDpSRZlXIVizL8+TxrFhnRrm9UmhejGNIwf2wQEoAgucgSq4AjVQBxg8gmfwCt60J+1Fe9c+Jq0L2nRmD/yB9vkDoxidIw==</latexit>

P (Y = 1|X) = P
1 + exp( w0 i w i Xi )

Decision boundary:
1
1
0

(Linear Decision Boundary)


5
Training Logistic Regression
How to learn the parameters w0, w1, … wd? (d features)
Training Data
Maximum Likelihood Estimates

But there is a problem …


Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

6
Training Logistic Regression
How to learn the parameters w0, w1, … wd? (d features)
Training Data
Maximum (Conditional) Likelihood Estimates

Discriminative philosophy – Don’t waste effort learning P(X),


focus on P(Y|X) – that’s all that matters for classification!

7
Expressing Conditional log Likelihood
1
<latexit sha1_base64="fRKecmAH7QifqDp6iBtMHm4ssnA=">AAACGXicbVDLSsNAFJ3UV62vqEs3g0Vo0ZZEFN0Uim5cVrBtpAlhMp20QycPZibWEvsbbvwVNy4Ucakr/8bpY6GtBy4czrmXe+/xYkaFNIxvLbOwuLS8kl3Nra1vbG7p2zsNESUckzqOWMQtDwnCaEjqkkpGrJgTFHiMNL3e5chv3hEuaBTeyEFMnAB1QupTjKSSXN2oFW4r5oN11C/CCrR9jrCZmoc2uY8Lpb5rlGyRBC6FfVWWS4tDV88bZWMMOE/MKcmDKWqu/mm3I5wEJJSYISFaphFLJ0VcUszIMGcngsQI91CHtBQNUUCEk44/G8IDpbShH3FVoYRj9fdEigIhBoGnOgMku2LWG4n/ea1E+udOSsM4kSTEk0V+wqCM4Cgm2KacYMkGiiDMqboV4i5S6UgVZk6FYM6+PE8ax2XztGxcn+SrF9M4smAP7IMCMMEZqIIrUAN1gMEjeAav4E170l60d+1j0prRpjO74A+0rx8CJZ3a</latexit>

P (Y = 1|X, w) = P
1 + exp( w0 i w i Xi )

1
<latexit sha1_base64="+IFZRrn+I1EsqOPovILz4rQSSBM=">AAACGHicbVDLSsNAFJ3UV62vqEs3g0VoqdREFN0Uim5cVrAPaUKYTCft0MmDmYm1xH6GG3/FjQtF3Hbn3zhts9DWAxcO59zLvfe4EaNCGsa3lllaXlldy67nNja3tnf03b2GCGOOSR2HLOQtFwnCaEDqkkpGWhEnyHcZabr964nffCBc0DC4k8OI2D7qBtSjGEklOfpJrXBfMZ5ax4MirEDL4wibiVmyyGNUGDhGyRKx71A4UNVyaHHk6HmjbEwBF4mZkjxIUXP0sdUJceyTQGKGhGibRiTtBHFJMSOjnBULEiHcR13SVjRAPhF2Mn1sBI+U0oFeyFUFEk7V3xMJ8oUY+q7q9JHsiXlvIv7ntWPpXdoJDaJYkgDPFnkxgzKEk5Rgh3KCJRsqgjCn6laIe0iFI1WWORWCOf/yImmcls3zsnF7lq9epXFkwQE4BAVgggtQBTegBuoAg2fwCt7Bh/aivWmf2tesNaOlM/vgD7TxD4f8naA=</latexit>

P (Y = 0|X, w) = P
1 + exp(w0 + i wi Xi )

8
Expressing Conditional log Likelihood
1
<latexit sha1_base64="fRKecmAH7QifqDp6iBtMHm4ssnA=">AAACGXicbVDLSsNAFJ3UV62vqEs3g0Vo0ZZEFN0Uim5cVrBtpAlhMp20QycPZibWEvsbbvwVNy4Ucakr/8bpY6GtBy4czrmXe+/xYkaFNIxvLbOwuLS8kl3Nra1vbG7p2zsNESUckzqOWMQtDwnCaEjqkkpGrJgTFHiMNL3e5chv3hEuaBTeyEFMnAB1QupTjKSSXN2oFW4r5oN11C/CCrR9jrCZmoc2uY8Lpb5rlGyRBC6FfVWWS4tDV88bZWMMOE/MKcmDKWqu/mm3I5wEJJSYISFaphFLJ0VcUszIMGcngsQI91CHtBQNUUCEk44/G8IDpbShH3FVoYRj9fdEigIhBoGnOgMku2LWG4n/ea1E+udOSsM4kSTEk0V+wqCM4Cgm2KacYMkGiiDMqboV4i5S6UgVZk6FYM6+PE8ax2XztGxcn+SrF9M4smAP7IMCMMEZqIIrUAN1gMEjeAav4E170l60d+1j0prRpjO74A+0rx8CJZ3a</latexit>

P (Y = 1|X, w) = P
1 + exp( w0 i w i Xi )

1
<latexit sha1_base64="+IFZRrn+I1EsqOPovILz4rQSSBM=">AAACGHicbVDLSsNAFJ3UV62vqEs3g0VoqdREFN0Uim5cVrAPaUKYTCft0MmDmYm1xH6GG3/FjQtF3Hbn3zhts9DWAxcO59zLvfe4EaNCGsa3lllaXlldy67nNja3tnf03b2GCGOOSR2HLOQtFwnCaEDqkkpGWhEnyHcZabr964nffCBc0DC4k8OI2D7qBtSjGEklOfpJrXBfMZ5ax4MirEDL4wibiVmyyGNUGDhGyRKx71A4UNVyaHHk6HmjbEwBF4mZkjxIUXP0sdUJceyTQGKGhGibRiTtBHFJMSOjnBULEiHcR13SVjRAPhF2Mn1sBI+U0oFeyFUFEk7V3xMJ8oUY+q7q9JHsiXlvIv7ntWPpXdoJDaJYkgDPFnkxgzKEk5Rgh3KCJRsqgjCn6laIe0iFI1WWORWCOf/yImmcls3zsnF7lq9epXFkwQE4BAVgggtQBTegBuoAg2fwCt7Bh/aivWmf2tesNaOlM/vgD7TxD4f8naA=</latexit>

P (Y = 0|X, w) = P
1 + exp(w0 + i wi Xi )

Good news: l(w) is concave function of w ! QnA2

9
Expressing Conditional log Likelihood
1
<latexit sha1_base64="fRKecmAH7QifqDp6iBtMHm4ssnA=">AAACGXicbVDLSsNAFJ3UV62vqEs3g0Vo0ZZEFN0Uim5cVrBtpAlhMp20QycPZibWEvsbbvwVNy4Ucakr/8bpY6GtBy4czrmXe+/xYkaFNIxvLbOwuLS8kl3Nra1vbG7p2zsNESUckzqOWMQtDwnCaEjqkkpGrJgTFHiMNL3e5chv3hEuaBTeyEFMnAB1QupTjKSSXN2oFW4r5oN11C/CCrR9jrCZmoc2uY8Lpb5rlGyRBC6FfVWWS4tDV88bZWMMOE/MKcmDKWqu/mm3I5wEJJSYISFaphFLJ0VcUszIMGcngsQI91CHtBQNUUCEk44/G8IDpbShH3FVoYRj9fdEigIhBoGnOgMku2LWG4n/ea1E+udOSsM4kSTEk0V+wqCM4Cgm2KacYMkGiiDMqboV4i5S6UgVZk6FYM6+PE8ax2XztGxcn+SrF9M4smAP7IMCMMEZqIIrUAN1gMEjeAav4E170l60d+1j0prRpjO74A+0rx8CJZ3a</latexit>

P (Y = 1|X, w) = P
1 + exp( w0 i w i Xi )

1
<latexit sha1_base64="+IFZRrn+I1EsqOPovILz4rQSSBM=">AAACGHicbVDLSsNAFJ3UV62vqEs3g0VoqdREFN0Uim5cVrAPaUKYTCft0MmDmYm1xH6GG3/FjQtF3Hbn3zhts9DWAxcO59zLvfe4EaNCGsa3lllaXlldy67nNja3tnf03b2GCGOOSR2HLOQtFwnCaEDqkkpGWhEnyHcZabr964nffCBc0DC4k8OI2D7qBtSjGEklOfpJrXBfMZ5ax4MirEDL4wibiVmyyGNUGDhGyRKx71A4UNVyaHHk6HmjbEwBF4mZkjxIUXP0sdUJceyTQGKGhGibRiTtBHFJMSOjnBULEiHcR13SVjRAPhF2Mn1sBI+U0oFeyFUFEk7V3xMJ8oUY+q7q9JHsiXlvIv7ntWPpXdoJDaJYkgDPFnkxgzKEk5Rgh3KCJRsqgjCn6laIe0iFI1WWORWCOf/yImmcls3zsnF7lq9epXFkwQE4BAVgggtQBTegBuoAg2fwCt7Bh/aivWmf2tesNaOlM/vgD7TxD4f8naA=</latexit>

P (Y = 0|X, w) = P
1 + exp(w0 + i wi Xi )

Good news: l(w) is concave function of w ! QnA2

Bad news: no closed-form solution to maximize l(w)


Good news: can use iterative optimization methods (gradient ascent)
10
Iteratively optimizing concave function
• Conditional likelihood for Logistic Regression is concave
• Maximum of a concave function can be reached by
Gradient Ascent Algorithm
Initialize: Pick w at random

Gradient:
l(w)
d
Update rule: Learning rate, h>0

w
Ø Poll: Effect of step-size h? 11
Effect of step-size h

Large h => Fast convergence but larger residual error


Also possible oscillations

Small h => Slow convergence but small residual error

12
Gradient Ascent for M(C)LE

Gradient ascent rule for w0:

13
Gradient Ascent for M(C)LE

Gradient ascent rule for w0:

" d
#
X 1 X
= y j
Pd · exp(w0 + wi xji )
j 1 + exp(w0 + i wi xji ) i

14
Gradient Ascent for M(C)LE
Logistic Regression
Gradient ascent algorithm: iterate until change < e

For i=1,…,d,

repeat Predict what current weight


thinks label Y should be

• Gradient ascent is simplest of optimization approaches


– e.g. Stochastic gradient ascent, Momentum methods, Newton method,
15
Conjugate gradient ascent, IRLS (see Bishop 4.3.3)
That’s M(C)LE. How about M(C)AP?

• Define priors on w
– Common assumption: Normal
distribution, zero mean, identity
covariance
Zero-mean Gaussian prior
– “Pushes” parameters towards zero

Logistic

logistic (z)
function
(or Sigmoid), s(z) = :

Ø What happens if we scale z by a large constant? z 16


Logistic

logistic (z)
function
(or Sigmoid), s(z) = :

Ø Poll: What happens if we scale z (equivalently weights w) by a large


constant?

A) The logistic decision boundary shifts towards class 1


B) The logistic decision boundary remains same
C) The logistic classifier tries to separate the data perfectly
D) The logistic classifier allows more mixing of labels on each side of
decision boundary

17
That’s M(C)LE. How about M(C)AP?

• M(C)AP estimate
Zero-mean Gaussian prior

Still concave objective!


Penalizes large weights 18
M(C)AP – Gradient

• Gradient

Zero-mean Gaussian prior

Same as before

Extra term Penalizes large weights

Penalization = Regularization 19
M(C)LE vs. M(C)AP
• Maximum conditional likelihood estimate

• Maximum conditional a posteriori estimate

20
Logistic Regression for more than 2
classes
• Logistic regression in more general case, where Y 2{y1,…,yK}

for k<K

for k=K (normalization, so no weights for this class)

Predict

Is the decision boundary still linear? 21


Comparison with Gaussian Naïve Bayes

22
Gaussian Naïve Bayes vs. Logistic
Regression
Set of Gaussian
Naïve Bayes parameters Set of Logistic
(feature variance Regression parameters
independent of class label)

• Representation equivalence (both yield linear decision


boundaries)
– But only in a special case!!! (GNB with class-independent
variances)
– LR makes no assumptions about P(X|Y) in learning!!!
– Optimize different functions (MLE/MCLE) or
(MAP/MCAP)! Obtain different solutions
23
Experimental Comparison (Ng-Jordan’01)
UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features

More in
Paper…

Naïve Bayes Logistic Regression


24
Gaussian Naïve Bayes vs. Logistic
Regression
Both GNB and LR have similar number O(d) of parameters.

• GNB error converges faster with increasing number of samples as its


parameter estimates are not coupled,

however,

• GNB has higher large sample error if conditional independence


assumption DOES NOT hold.

GNB outperforms LR if conditional independence assumption holds.


25
What you should know
• LR is a linear classifier
• LR optimized by maximizing conditional likelihood or
conditional posterior
– no closed-form solution
– concave ! global optimum with gradient ascent
• Gaussian Naïve Bayes with class-independent variances
representationally equivalent to LR
– Solution differs because of objective (loss) function
• In general, NB and LR make different assumptions
– NB: Features independent given class ! assumption on P(X|Y)
– LR: Functional form of P(Y|X), no assumption on P(X|Y)
• Convergence rates
– GNB (usually) needs less data
– LR (usually) gets to better solutions in the limit

26

You might also like