0% found this document useful (0 votes)
21 views67 pages

Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron

The document discusses machine learning topics including linear regression, logistic regression, and hyperplane-based classifiers. It begins by introducing probabilistic linear regression, describing it using a maximum likelihood estimation framework. It then discusses regularized linear regression using a maximum a posteriori estimation approach, introducing a prior over the model parameters to encourage simpler models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views67 pages

Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron

The document discusses machine learning topics including linear regression, logistic regression, and hyperplane-based classifiers. It begins by introducing probabilistic linear regression, describing it using a maximum likelihood estimation framework. It then discusses regularized linear regression using a maximum a posteriori estimation approach, introducing a prior over the model parameters to encourage simpler models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Machine Learning by ambedkar@IISc

I Probabilistic view of linear regression

I Logistic regression

I Hyperplane based classifiers and


perceptron
Topics

Probabilistic View of Linear Regression

Logistic Regression

Hyperplane based classifiers and Perceptron

2
Probabilistic View of Linear
Regression
Maximum Likelihood Estimation

I Let X = x1 , x2 , . . . , xN , where xn ∈ Rd be some data that


is generated from xn ∼ P (x|θ)
I Recall: In the statistical approach to machine learning, we
assume that there is an underlying probability distribution
from which the data is sampled.
I Hence θ denotes the parameters of the distribution.
I For example xn ∼ N (x|µ, σ). That is θ = (µ, σ).

I Assumption: The data in X is generated i.i.d.


(independent and identically distributed). This is very
important assumption and we see this very often.
I Aim: Learn θ given the data X = x1 , x2 , . . . , xN .

3
Diversion: Some Probability

I We say two random variables X, Y are identical that


means that their probability distributions are the same

I If two Gaussian random variables are same only if their


means and variance (covariance matrices) are same.

I We say two random variables X, Y are independent if

P (X, Y ) = P (X)P (Y )

4
Maximum Likelihood Estimation (contd. . . )

I Given X = x1 , x2 , . . . , xN , and xn ∼ P (x|θ)


I Learn P so that likelihood of x1 , x2 , . . . , xN are sampled
from P is maximum.
I Equivalently learn or estimate θ so that likelihood of
x1 , x2 , . . . , xN are sampled from P is maximum.
I By the iid assumption

P (X|θ) = P (x1 , x2 , . . . , xN |θ)


N
Y
= P (xn |θ)
n=1

I P (X|θ) is the likelihood.

5
Maximum Likelihood Estimation (contd. . . )

How do we estimate θ given the data X.

Find value of θ that makes observed data most probable.

Find θ that maximizes likelihood function


N
X
L = log P (X|θ) = log P (xn |θ)
n=1

6
Maximum Likelihood Estimation (contd. . . )

N
X

θMLE = arg max L(θ) = arg max log P (xn |θ)
θ θ n=1

7
Maximum Likelihood Estimation (contd. . . )

Example:
Suppose Xn is a binary random variable. Suppose it follows
Bernoulli distribution
i.e. P (x|θ) = θx (1 − θ)1−x
N
X N
X
L(θ) = log P (xn |θ) = xn log θ + (1 − xn ) log(1 − θ)
n=1 n=1
N N
∂L(θ) 1 X 1 X
= xn + (1 − xn )
∂θ θ 1−θ
n=1 n=1
N N
1 X 1 X
= xn + (N − xn )
θ 1−θ
n=1 n=1

8
Maximum Likelihood Estimation (contd. . . )

PN
∗ n=1 xn
=⇒ θM LE =
N
[ In a coin tossing experiment, it is just a fraction of heads ]

9
Maximum a Posteriori Estimate

I We will have a prior on parameter θ i.e. P (θ)


Yes θ is no more a mere number, it is a Random Variable.
I One can have knowledge on θ
I It acts as a regularizer (We will see later)
I Bayes Rule:
P (X|θ)P (θ)
P (θ|X) =
P (X)

P (θ|X) : P osterior
P (X|θ) : Likelihood
P (θ) : P rior
P (X) : Evidence

10
Maximum a Posteriori Estimate (contd. . . )

Bayes Rule:
P (X|θ)P (θ)
P (θ|X) =
P (X)

11
Maximum a Posteriori Estimate (contd. . . )

MAP Estimate


θM AP = arg max P (θ|x)
θ

= arg max log P (x|θ) + log P (θ)


θ
N
X
= arg max log P (xn |θ) + log P (θ)
θ n=1

Note: When P (θ) is a uniform distribution, it reduces to MLE.

12
Linear Regression : Probabilistic Setting

I Each response is generated by a linear model + Gaussian


noise
Y = WTX + E

I That is, given N training samples {(xn , yn )N


n=1 } i.i.d.
d
xn ∈ R and yn ∈ R
I n ∼ N (0, σ 2 )
I yn ∼ N (wT xn , σ 2 )

=⇒ P (Y |X, W ) = N (y|wT x, σ 2 )
1 (y − wT x)2 
= √ exp −
σ 2π 2σ 2

13
Linear Regression : ML Estimation

Log Likelihood

log L(w) = log P (D|w) = log P (y|X, W )


N
Y
= log P (yn |xn , w)
n=1
N
X
= log P (yn |xn , w)
n=1
N 
(yn − wT xn )2

X 1
= − log(2πσ 2 ) −
2 2σ 2
n=1

14
Linear Regression : ML Estimation (contd. . . )

N
∗ 1 X
WM LE = arg max − (yn − wT xn )2
w 2σ 2
n=1
N
1 X
= arg min (yn − wT xn )2
w 2σ 2
n=1

i.e. ML Estimation in the case of Gaussian environment ≡ Least


square objective for regression

15
Linear Regression : MAP Estimate

I Here we introduce prior on the parameter w.


⇒ This will lead to regularization of model.
I Remember we treat parameters as Random Variables in
MAP.

I 0 , |λ−1
P(w) = N (w||{z} {z I} )
MeanVariance
I We have multivariate Gaussian
1 1
exp − (x − µ)T Σ−1 (X − µ)

N (x : µ, Σ) = p
D
(2π) |Σ| 2
1 λ
exp − wT w

=q
D D 2
(2π) 2 ( λ1 ) 2

16
Linear Regression : MAP Estimate (contd. . . )

I log posterior probability

P (D|w)P (w)
log(w|D) = log
P (D)
= log P (w) + log P (w|D) − log P (D)


WM AP = arg max log P (w|D)
w

= arg max log P (w) + log P (D|w) + log P (D)
w

= arg max log P (w) + log P (D|w)
w

17
Linear Regression : MAP Estimate (contd. . . )


WM AP = arg max log P (w|D)
w
 D λ
= arg max − log 2π − wT w
w 2 2
N
X 1 (xn − wT xn )2 
+ − log(2πσ 2 ) −
2 2σ 2
n=1
N
1 X λ
= arg max 2 (yn − wT xn )2 + wT w
w 2σ 2
n=1

MAP estimate in the case of Gaussian environment ≡ Least


square objective with L2 regularization.

18
MLE vs MAP

MAP estimate shrinks the estimate of w towards the prior.

19
Logistic Regression
Problem Set Up

I Two class classification

I Instead of the exact labels estimate the probabilities of the


labels.

I i.e predict

P (yn = 1|xn , w) = µn
P (yn = 0|xn , w) = 1 − µn

20
The Logistic Regression Model

1 exp(w| xn )
µn = f (xn ) = σ(w| xn ) = =
1 + exp(−w| xn ) 1 + exp(w| xn )

I Here σ is the sigmoid or logistic function.


I The model first computes a real-values score.
d
X
w| x = wi x i
i=1
and non-linearly squashes it between (0, 1) to turn this
into a probability.

µ
-1

21
0 w| x
The Decision Boundary

If w| x > 0 =⇒ P (yn = 1|xn , w) > P (yn = 0|xn , w)


w| x < 0 =⇒ P (yn = 1|xn , w) < P (yn = 0|xn , w)

w| x = 0

Logistic Regression: Decision Boundary


22
Loss Function Optimization

I Squared Loss

`(yn , f (xn ) = (yn − f (xn ))2


= (yn − σ(w| xn ))2

I This is non-convex and not easy to optimize.


I Cross Entropy loss


− log(µ ) when yn = 1
n
`(yn , f (xn )) =
− log(1 − µn ) when yn = 0

−P (y = 1|x , w ) when y = 1
n n n n
=
−P (yn = 0|xn , wn ) when yn = 0

23
Cross Entropy loss

l(yn , f (xn )) = −yn log(µn ) − (1 − yn ) log(1 − µn )


= −yn log(σ(w| xn )) − (1 − yn ) log(1 − σ(w| xn ))

I Cross Entropy Loss over entire data.


N
X
L(w) = l(yn , f (xn ))
n=1
XN
= [−yn log(µn ) − (1 − yn ) log(1 − µn )]
n=1
XN
=− [yn w| xn − log(1 + exp(w| xn )]
n=1
24
Cross Entropy loss

I By adding L2 regularizer.
N
X
L(w) = − [yn w| xn − log(1 + exp(w| xn )) + λ||w||2
n=1

25
Logistic Regression: MLE formulation

I AIM Learn w from the data that can predict the


probability of xn belong to 0 or 1.
I Log Likelihood: Given D = {(x1 , y1 ), . . . , (xN , yN )}
log L(w) = log P (D|w)
= log P (Y |X, w)
N
Y
= log P (yn |xn , w)
n=1
YN
= log µynn (1 − µn )1−yn
n=1
I ∵ Y is a Bernoulli random variable
P (yn = 1|xn , w) = µn
P (yn = 0|xn , w) = 1 − µn
26
Logistic Regression: MLE formulation(contd...)

N
X
P (Y |X, w) = [yn log µn + (1 − yn ) log(1 − µn )]
n=1
exp(w| xn )
We have µn =
1 + exp(w| xn )
N
X
=⇒ L(w) = [yn w| xn − log(1 + exp(w| xn )]
n=1

Which is same as the cross entropy loss minimization.

27
Logistic Regression: MAP estimate

I MAP → We assume a prior distribution on the weight


vector w. That is

P (w) = N (w|0), λ−1 I)


 
1 λ |
= D exp − w w
(2π) 2 2

I Note: Multivariate Gaussian is defined as


 
1 1 | −1
P (w) = p exp − (X − µ) Σ (X − µ)
(2π)D |Σ| 2
I Then MAP estimate is

WM AP = arg max log P (W |D)
w

28
Logistic Regression: MAP estimate (cont...)

I We have

WM AP = arg max log P (W |θ)
w

= arg max log P (D|w) + log P (w)


w

D λ
= arg max − log 2π − w| w
w 2 2
N
X 
|
− log(1 + exp(−yn w xn ))
n=1
N  
X λ
= arg max log 1 + exp(−yn w xn ) + w| w
|
w 2
n=1
Which is same as the minimizing regularized cross entropy
loss.
29
Logistic Regression: Some Comments

I Objective function of Logistic Regression is same as SVMs


except for the loss function.

Logistic Regression → log loss


SVM → hinge loss

I Logistic regression can be extended to multiclass case: just


use softmax function.
exp(wk| x)
P (Y = k|w, x) = P | k = 1, 2, . . . , K classes
k exp(wk x)

30
Optimization is the Key

I Almost all problems in machine learning leads to


optimization problems

I The following two factors decides the fate of any method:

I What kind of optimization problem that we are led to

I What are all optimization methods that are available to us

I There are several methods that are available for


optimization, among these gradient descent methods are
most popular

31
Gradient Descent methods are Used in ....

I Linear Regression

I Logistic Regression

I It is just classification, but instead of labels it gives us class


probability

I Support Vector Machines

I Neural Networks

I The backbone of neural networks is Back-propagation


algorithm

32
Example of an objective

I Most often, we do not even


have functional form of the
objective.
I Given x, we can only
compute f (x)
I Sometime this may involve
a simulating a system
I Computing each f (x) can
be time consuming

I This becomes even more difficult as x is a D-dimensional


vector and D is very large

33
Multivariate Functions

(a) f (x, y) = x2 + y 2 (b) f (x, y) = −x2 + y 2

(c) f (x, y) = cos2 (x) + y 2 (d) f (x, y) = cos2 (x) + cos2 (y)

34
Partial Derivatives

(a) Surface given by


(b) Plane y = 1
f (x, y) = 9 − x2 − y2

(c) f (x, 1) = 8 − x2 denotes a


curve, and f 0 (x) = −2x denotes
derivative (or slope) of that
curve

35
Partial Derivatives (contd. . . )

36
Idea of Gradient Descent Algorithm

I Start at some random point (of course final results will


depend on this)
I Take steps based on the gradient vector of the current
position till convergence
I Gradient vector give us direction and rate of fastest increase
any any point
I Any point x if the gradient is nonzero, then the direction of
gradient is the direction in which the function most quickly
from x
I The magnitude of gradient is the rate of increase in that
direction

37
Idea of Gradient Descent Algorithm1

1
Credits for all the images in this sections goes to Michailidis and Maiden
38
Gradient Descent

I AIM: To minimize the function


N h
X i
L(w) = yn w| xn − log(1 + exp(w| xn )
n=1

I We do this by calculating the derivative of L w.r.t w.

I Note: Since log function is concave in w, this has a unique


minimum.

39
Gradient Descent

I AIM: To minimize the function


XN h i
L(w) = yn w| xn − log(1 + exp(w| xn )
n=1
I Gradient:
N 
exp(w| xn )

∂L X
=− yn xn − xn
∂w 1 + exp(w| xn )
n=1
XN
=− (yn − µn )xn = X −1 (µ − y)
n=1
     
µ1 y1 x1
 .   .   . 
where µ =  .. 

 Y = . 
 .  and X =  . 
 . 
µN yN xN
N ×D

40
Gradient Descent (contd...)

I Since there is no closed form solution, we take a recourse to


iterative methods like gradient descent.
I Gradient Descent:
1 Initialize w(1) ∈ RD randomly.
2 Iterate until the convergence.
N 
X 
w(t+1) = w(t) − η µ(t)
n − yn xn
n=1
| {z }
Gradient at
previous value

(t) |
I µn = σ(w(t) xn )
I η is the learning rate.

41
Gradient Descent (contd...)

I We have the following update


N
X
w(t+1) = w(t) − η (µ(t)
n − yn )xn
n=1
| {z }
Gradient at previous value

I Note: Calculating gradient in each iteration requires all the


data. When N is large this may not be feasible.

I Stochastic Gradient Descent: Use mini-batches to compute


the gradient.

42
Gradient Descent: Some Remarks

Note on Learning Rate:


I Sometimes choosing the learning rate is difficult
I Larger learning rate → Too much fluctuation.
I Smaller learning rate → Slow convergence

To deal with this problem:


I Choose optimal step size at each iteration ηt using line
search.
I Add momentum to the update.
 
w(t+1) = w(t) − η(t) g (t) + αt w(t) − w(t−1)
I Use second order methods like Newton method to exploit
the curvature of the loss function. (But we need to compute
Hessian matrix.)
43
Multiclass Logistic or Softmax Regression

I Logistic regression can be extend for the multiclass case.


I Let yn ∈ {0, 1, . . . , k − 1}
I Define
exp(wk| xn )
P (yn = k|xn , W ) = PK |
l=1 exp(wk xnl )
= µnk

∗ µnK : Probability that nth sample belongs to


k th class and kl=1 µnl = 1
P

I Softmax: Class k with largest wk| xn dominates the


probability.

44
Multiclass Logistic or Softmax Regression

exp(wk| xn )
I P (yn = k|xn , W ) = PK |
l=1 exp(wk xn )

I W = [w1 w2 . . . wk ]D×K

I We can think of yn are drawn from multimodal distribution


N Y
K
Y yn
P (y|X, W ) = µnl l : Likelihood function
n=1 l=1

I where ynl = 1 if true class of example n is l and ynl = 0 for


all other l.

45
Hyperplane based classifiers and
Perceptron
Linear as Optimization

Supervised Learning Problem

I Given data {(xn , yn )}N


n=1 find f : X → Y that best
approximates the relation between X and Y .

I Determine f such a way that loss l(y, f (x)) is minimum.

I f and l are specific to the problem and the method that we


choose.

46
Linear Regression

I Data: {(xn , yn )}N


n=1
I xn ∈ RD is a D dimensional input
I yn ∈ R is the output
Aim is to find a hyperplane that fits best these points.
I Here hyperplane is a model of choice i.e.,
D
X
f (x) = x j wj + b = w | x + b
j=1

I Here w1 , . . . , wd and b are model parameters


I Best is determined by some loss function
N
X
Loss(w) = [yn − f (xn )]2
n=1
I Aim : Determine the model parameters that minimize the
loss. 47
Logistic Regression

Problem Set-Up

I Two class classification

I Instead of the exact labels estimate the probabilities of the


labels i.e.

Predict P (yn = 1|xn , w) = µn


P (yn = 0|xn , w) = 1 − µn

I Here (xn , yn ) is the input output pair.

48
Logistic Regression(Contd...)

Problem
Find a function f such that,

µ = f (xn )

Model
1
µn = f (xn ) = σ(w| xn ) =
1 + exp(−w| xn )
exp(w| xn )
=
1 + exp(w| xn )

49
Logistic Regression(Contd...)

Sigmoid Function

I Here σ(.) is the sigmoid function.

I The model first computes a real valued score


w| x = D
P
i=1 wi xi and then nonlinearly “squashes” it
between (0,1) to turn into a probability.

50
Logistic Regression(contd...)

Loss Function: Here we use cross entropy loss instead of


squared loss.
Cross entropy loss is defined as:


− log(µ ) when yn = 1
n
L(yn , f (xn )) =
− log(1 − µn ) when yn = 0
= −yn log(µn ) − (1 − yn ) log(1 − µn )
= −yn log(σ(w| xn )) − (1 − yn ) log(1 − σ(w| xn ))

And now empirical risk is


N
X
L(w) = − [yn w| xn − log(1 + exp(w| xn ))]
n=1
51
Logistic Regression(contd...)

By taking the derivative w.r.t w

N
∂L X
= (µn − yn )xn
∂w
n=1

I Here the Gradient Descent Algorithm is


1 Initialize w(1) ∈ RD randomly
2 Iterate until the convergence

N
X
(t+1)
w w(t) − η
= |{z} (µ(t) − yn )xn
| {z } |{z}
n=1
| n {z }
New learned previous Learning Gradient at
parameter or value rate previous value
weights

|
I Note: Here µ(t) = σ(w(t) xn )

52
Logistic Regression (contd. . . )

Let us take a look at the update equation again


N
X
(t+1)
w w(t) − η
= |{z} (µ(t) − yn )xn
| {z } |{z} | n {z }
New learned previous Learning n=1 Gradient at
parameter or value rate previous value
weights

What do we notice here?


Problem: Calculating gradient in each iteration requires all the
data. When N is large this may not be feasible.

53
Stochastic Gradient Descent

I Strategy: Approximate gradient using randomly chosen


data point (xn , yn )

w(t+1) = w(t) − ηt (µ(t)


n − yn )xn

.
(t)
I Also: Replace predicted label probability µn by predicted
(t)
binary label ŷn , where

1 if µ(t) > 0.5 or w(t)| x > 0
(t) n n
ŷn =
0 if µ(t)
n < 0.5 or w
(t)|
xn < 0

54
Stochastic Gradient Descent (contd. . . )

I Hence: Update rule becomes

w(t+1) = w(t) − ηt (ŷn(t) − yn )xn

I This is mistake driven update rule

I w(t) gets updated only when there is a misclassification i.e.


(t)
ŷn 6= yn

55
Stochastic Gradient Descent (contd. . . )

We will do one more simple change:

I Change: the class labels to {-1,+1}



−2y (t) (t)
n if ŷn = 6 yn
=⇒ ŷn(t) − yn = (t) (t)
0 if ŷn = yn

I Hence: Whenever there is a misclassification.

w(t+1) = w(t) − 2η(t) yn xn

I =⇒ This is a perceptron learning algorithm which is a


hyperplane based learning algorithm.

56
Hyperplanes

I Separates a d-dimensional space into two half


spaces(positive and negative)
I Equation of the hyperplane is

w| x = 0

I By adding bias b ∈ R
w| x + b = 0 b > 0 moving the
hyperplane parallely along w
b<0 opposite direction
57
Hyperplane based classification

I Classification rule

y = sign(w| x + b)

w| x + b > 0 =⇒ y = +1
w| x + b < 0 =⇒ y = −1

58
Hyperplane based classification

59
The Perceptron Algorithm (Rosenblatt, 1958)

I Aim is to learn a linear hyperplane to separate two classes.

I Mistake drives online learning algorithm

I Guaranteed to find a separating hyperplane if data is


linearly separable.

60
Perceptron Algorithm

I Given training data D = {(x1 , y1 ), ..., (xn , yn )}

I Initialize wold = [0, ..., 0], bold = 0

I Repeat until convergence.

I For a random (xn , yn ) ∈ D

I If yn (w| xn + b) ≤ 0
[Or sign(w| x + b) 6= yn i.e mistake mode]

I wnew = wold + yn xn

I bnew = bold + yn

61
Perceptron Convergence Theorem (Block and Novikoff )

"Roughly" : If the data is linearly separable perceptron


algorithm converges.

62
What if the data is not linearly separable?

Yes! In practice, most often the data is not linearly separable.


Then

I Make linearly separable using kernel methods.

I (Or) Use multilayer perceptron.

What are all these?

I The first leads to Support Vector Machines, that rules


machine learning for decades

I The second one leads to Deep Learning!

63
What we learned?

I Maximum Likelihood Estimates

I Bayes again! MAP

I Probabilistic view of Linear and Logistic Regression

I Hyperplanes and Perceptrons

I The two very big paradigms in ML

64

You might also like