Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
I Logistic regression
Logistic Regression
2
Probabilistic View of Linear
Regression
Maximum Likelihood Estimation
3
Diversion: Some Probability
P (X, Y ) = P (X)P (Y )
4
Maximum Likelihood Estimation (contd. . . )
5
Maximum Likelihood Estimation (contd. . . )
6
Maximum Likelihood Estimation (contd. . . )
N
X
∗
θMLE = arg max L(θ) = arg max log P (xn |θ)
θ θ n=1
7
Maximum Likelihood Estimation (contd. . . )
Example:
Suppose Xn is a binary random variable. Suppose it follows
Bernoulli distribution
i.e. P (x|θ) = θx (1 − θ)1−x
N
X N
X
L(θ) = log P (xn |θ) = xn log θ + (1 − xn ) log(1 − θ)
n=1 n=1
N N
∂L(θ) 1 X 1 X
= xn + (1 − xn )
∂θ θ 1−θ
n=1 n=1
N N
1 X 1 X
= xn + (N − xn )
θ 1−θ
n=1 n=1
8
Maximum Likelihood Estimation (contd. . . )
PN
∗ n=1 xn
=⇒ θM LE =
N
[ In a coin tossing experiment, it is just a fraction of heads ]
9
Maximum a Posteriori Estimate
P (θ|X) : P osterior
P (X|θ) : Likelihood
P (θ) : P rior
P (X) : Evidence
10
Maximum a Posteriori Estimate (contd. . . )
Bayes Rule:
P (X|θ)P (θ)
P (θ|X) =
P (X)
11
Maximum a Posteriori Estimate (contd. . . )
MAP Estimate
∗
θM AP = arg max P (θ|x)
θ
12
Linear Regression : Probabilistic Setting
=⇒ P (Y |X, W ) = N (y|wT x, σ 2 )
1 (y − wT x)2
= √ exp −
σ 2π 2σ 2
13
Linear Regression : ML Estimation
Log Likelihood
14
Linear Regression : ML Estimation (contd. . . )
N
∗ 1 X
WM LE = arg max − (yn − wT xn )2
w 2σ 2
n=1
N
1 X
= arg min (yn − wT xn )2
w 2σ 2
n=1
15
Linear Regression : MAP Estimate
I 0 , |λ−1
P(w) = N (w||{z} {z I} )
MeanVariance
I We have multivariate Gaussian
1 1
exp − (x − µ)T Σ−1 (X − µ)
N (x : µ, Σ) = p
D
(2π) |Σ| 2
1 λ
exp − wT w
=q
D D 2
(2π) 2 ( λ1 ) 2
16
Linear Regression : MAP Estimate (contd. . . )
P (D|w)P (w)
log(w|D) = log
P (D)
= log P (w) + log P (w|D) − log P (D)
∗
WM AP = arg max log P (w|D)
w
= arg max log P (w) + log P (D|w) + log P (D)
w
= arg max log P (w) + log P (D|w)
w
17
Linear Regression : MAP Estimate (contd. . . )
∗
WM AP = arg max log P (w|D)
w
D λ
= arg max − log 2π − wT w
w 2 2
N
X 1 (xn − wT xn )2
+ − log(2πσ 2 ) −
2 2σ 2
n=1
N
1 X λ
= arg max 2 (yn − wT xn )2 + wT w
w 2σ 2
n=1
18
MLE vs MAP
19
Logistic Regression
Problem Set Up
I i.e predict
P (yn = 1|xn , w) = µn
P (yn = 0|xn , w) = 1 − µn
20
The Logistic Regression Model
1 exp(w| xn )
µn = f (xn ) = σ(w| xn ) = =
1 + exp(−w| xn ) 1 + exp(w| xn )
µ
-1
21
0 w| x
The Decision Boundary
w| x = 0
I Squared Loss
− log(µ ) when yn = 1
n
`(yn , f (xn )) =
− log(1 − µn ) when yn = 0
−P (y = 1|x , w ) when y = 1
n n n n
=
−P (yn = 0|xn , wn ) when yn = 0
23
Cross Entropy loss
I By adding L2 regularizer.
N
X
L(w) = − [yn w| xn − log(1 + exp(w| xn )) + λ||w||2
n=1
25
Logistic Regression: MLE formulation
N
X
P (Y |X, w) = [yn log µn + (1 − yn ) log(1 − µn )]
n=1
exp(w| xn )
We have µn =
1 + exp(w| xn )
N
X
=⇒ L(w) = [yn w| xn − log(1 + exp(w| xn )]
n=1
27
Logistic Regression: MAP estimate
28
Logistic Regression: MAP estimate (cont...)
I We have
∗
WM AP = arg max log P (W |θ)
w
30
Optimization is the Key
31
Gradient Descent methods are Used in ....
I Linear Regression
I Logistic Regression
I Neural Networks
32
Example of an objective
33
Multivariate Functions
(c) f (x, y) = cos2 (x) + y 2 (d) f (x, y) = cos2 (x) + cos2 (y)
34
Partial Derivatives
35
Partial Derivatives (contd. . . )
36
Idea of Gradient Descent Algorithm
37
Idea of Gradient Descent Algorithm1
1
Credits for all the images in this sections goes to Michailidis and Maiden
38
Gradient Descent
39
Gradient Descent
40
Gradient Descent (contd...)
(t) |
I µn = σ(w(t) xn )
I η is the learning rate.
41
Gradient Descent (contd...)
42
Gradient Descent: Some Remarks
44
Multiclass Logistic or Softmax Regression
exp(wk| xn )
I P (yn = k|xn , W ) = PK |
l=1 exp(wk xn )
I W = [w1 w2 . . . wk ]D×K
45
Hyperplane based classifiers and
Perceptron
Linear as Optimization
46
Linear Regression
Problem Set-Up
48
Logistic Regression(Contd...)
Problem
Find a function f such that,
µ = f (xn )
Model
1
µn = f (xn ) = σ(w| xn ) =
1 + exp(−w| xn )
exp(w| xn )
=
1 + exp(w| xn )
49
Logistic Regression(Contd...)
Sigmoid Function
50
Logistic Regression(contd...)
− log(µ ) when yn = 1
n
L(yn , f (xn )) =
− log(1 − µn ) when yn = 0
= −yn log(µn ) − (1 − yn ) log(1 − µn )
= −yn log(σ(w| xn )) − (1 − yn ) log(1 − σ(w| xn ))
N
∂L X
= (µn − yn )xn
∂w
n=1
N
X
(t+1)
w w(t) − η
= |{z} (µ(t) − yn )xn
| {z } |{z}
n=1
| n {z }
New learned previous Learning Gradient at
parameter or value rate previous value
weights
|
I Note: Here µ(t) = σ(w(t) xn )
52
Logistic Regression (contd. . . )
53
Stochastic Gradient Descent
.
(t)
I Also: Replace predicted label probability µn by predicted
(t)
binary label ŷn , where
1 if µ(t) > 0.5 or w(t)| x > 0
(t) n n
ŷn =
0 if µ(t)
n < 0.5 or w
(t)|
xn < 0
54
Stochastic Gradient Descent (contd. . . )
55
Stochastic Gradient Descent (contd. . . )
56
Hyperplanes
w| x = 0
I By adding bias b ∈ R
w| x + b = 0 b > 0 moving the
hyperplane parallely along w
b<0 opposite direction
57
Hyperplane based classification
I Classification rule
y = sign(w| x + b)
w| x + b > 0 =⇒ y = +1
w| x + b < 0 =⇒ y = −1
58
Hyperplane based classification
59
The Perceptron Algorithm (Rosenblatt, 1958)
60
Perceptron Algorithm
I If yn (w| xn + b) ≤ 0
[Or sign(w| x + b) 6= yn i.e mistake mode]
I wnew = wold + yn xn
I bnew = bold + yn
61
Perceptron Convergence Theorem (Block and Novikoff )
62
What if the data is not linearly separable?
63
What we learned?
64