0% found this document useful (0 votes)
25 views38 pages

Probabilistic Learning and Generalized Linear Models (GLMS)

Uploaded by

mabdian4821
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views38 pages

Probabilistic Learning and Generalized Linear Models (GLMS)

Uploaded by

mabdian4821
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Probabilistic Learning

Probabilistic interpretation of Linear Regression


() = () + () i=1…m

( )|
1 −
( ): Gaussian , iid , = ex p −
2 2

() ~ (0, )

= = ∏ ( ( )| ( ), )
Likelihood function

= min ∑ ( −θ ( ))
= log
Probabilistic interpretation of logistic regression

ℎ = ( = 1| , ) 1−ℎ = ( = 0| , )

, = (ℎ ) (1 − ℎ )

= , = ((ℎ ) (1 − ℎ ) )

= log = log ℎ + (1 − ) log(1 − ℎ ) =− ( )


Generalized Linear Models
GLM
In the regression example and the classification we had:

∣ ; ∼ , ∣ ; ∼ Bernoulli( )

In this section, we will show that both of these methods are special cases of a broader family of
models, called Generalized Linear Models (GLMs).

We will also show how other models in the GLM family can be derived and applied to other
classification and regression problems.
The exponential family
The exponential family plays a crucial role in statistics and machine learning, for various reasons,
including the following:

 The exponential family is the unique family of distributions that has maximum entropy.

 The exponential family is at the core of GLMs.

 The exponential family is at the core of variational inference.

 Under certain regularity conditions, the exponential family is the only family of distributions with
finite-sized sufficient statistics.

 All members of the exponential family have a conjugate prior, which simplifies Bayesian inference
of the parameters.
The exponential family
We say that a class of distributions is in the exponential family if it can be written in the
form:
= ex p −
(1)

: natural parameter ( ):sufficient statistic : log partition function

( )
The quantity essentially plays the role of a normalization constant, that makes sure the
distribution ( ; ) sums/integrates over to 1 .

A fixed choice of , and defines a family (or set) of distributions that is parameterized by
; as we vary , we then get different distributions within this family.
Bernoulli distribution
The Bernoulli distribution with mean , written Bernoulli ( ), specifies a distribution over
∈ {0,1}, so that ( = 1; ) = ; ( = 0; ) = 1 − .

there is a choice of , and so that Equation (1) becomes exactly the class of Bernoulli
distributions.
Bernoulli distribution
We write the Bernoulli distribution as:

( ; )= (1 − )
= exp( log + (1 − )log(1 − ))

= exp log + log(1 − )


1−

Thus, the natural parameter is given by = log( /(1 − )).

if we invert this definition for by solving for in terms of , we obtain = 1/ 1 + .


Bernoulli distribution
To complete the formulation of the Bernoulli distribution as an exponential family distribution, we
also have:
( ) =

( ) = −log(1 − )
= log 1 +

( ) =1

This shows that the Bernoulli distribution can be written in the form of Equation (1), using an
appropriate choice of , and .
Gaussian distribution
Recall that, when deriving linear regression, the value of had no effect on our final choice
of and ℎ ( ). Thus, we can choose an arbitrary value for without changing anything. To
simplify the derivation below, let's set = 1. We then have:

11
( ; ) =exp − ( − )
2 2
1 1 1
= exp − ⋅ exp −
2 2 2
Gaussian distribution
we see that the Gaussian is in the exponential family,
with:

=
( ) =
( ) = /2
= /2
( ) = (1/ 2 )exp − /2
Constructing GLMs
• Suppose you would like to build a model to estimate the number of customers arriving in

your store (or number of page-views on your website) in any given hour, based on certain

features such as store promotions, recent advertising, weather, day-of-week, etc.

• We know that the Poisson distribution usually gives a good model for numbers of visitors.

• Poisson is an exponential family distribution, so we can apply a Generalized Linear Model

(GLM).
Constructing GLMs
• More generally, consider a classification or regression problem where we would like to predict
the value of some random variable as a function of .

• To derive a GLM for this problem, we will make the following three assumptions about the
conditional distribution of given and about our model:

1. ∣ ; ∼ Exponential Family ( ). I.e., given and , the distribution of follows some


exponential family distribution, with parameter .

2. Given , our goal is to predict the expected value of ( ) given . In most of our examples, we
will have ( ) = , so this means we would like the prediction ℎ( ) output by our learned
hypothesis ℎ to satisfy ℎ( ) = E[ ∣ ].

3. The natural parameter and the inputs are related linearly: = . (Or, if is vector-
valued, then = .)
Ordinary Least Squares
• To show that ordinary least squares is a special case of the GLM family of models, consider the
setting where the target variable (also called the response variable in GLM terminology) is
continuous, and we model the conditional distribution of given as a Gaussian , .
(Here, may depend . )
• let the Exponential Family distribution above be the Gaussian distribution.
• As we saw previously, in the formulation of the Gaussian as an exponential family distribution,
we had = .
• So, we have:
ℎ ( ) = [ ∣ ; ]
=
=
= .
Logistic Regression
In our formulation of the Bernoulli distribution as an exponential family distribution, we had:

= 1/ 1 + .

if ∣ ; ∼ Bernoulli( ), then E[ ∣ ; ] = . So, following a similar derivation as the one for


ordinary least squares, we get:
ℎ ( ) = [ ∣ ; ]
=
= 1/ 1 +
= 1/ 1 +

So, this gives us hypothesis functions of the form ℎ ( ) = 1/ 1 +


Softmax Regression
• Consider a classification problem in which the response variable can take on any one of

values, so ∈ {1,2, … , }.

• For example, rather than classifying email into the two classes spam or not-spam-which

would have been a binary classification problem might want to classify it into three classes,

such as spam, personal mail, and work-related mail.

• The response variable is still discrete, but can now take on more than two values. We will thus

model it as distributed according to a categorical distribution.


Softmax Regression
• To parameterize a multinomial over possible outcomes, one could use parameters
,…, specifying the probability of each of the outcomes. However, these parameters
would be redundant, or more formally, they would not be independent (∑ = 1).

• So, we will instead parameterize the categorical with only − 1 parameters,


,…, , where = ( = ; ), and ( = ; ) = 1 − ∑ .
• For notational convenience, we will also let =1−∑ , but we should keep in
mind that this is not a parameter, and that it is fully specified by ,…, .
Softmax Regression
• To express the categorical as an exponential family distribution, we will define ( )
∈ℝ as follows:

1 0 0 0 0
0 1 0 0 0
(1) = 0 , (2) = 0 , (3) = 1 , … , ( − 1) = 0 , ( )= 0 ,
⋮ ⋮ ⋮ ⋮ ⋮
0 0 0 1 0
Softmax Regression
We are now ready to show that the categorical is a member of the exponential
family. We have:
{ } { } { }
( ; )= ⋯
{ } { } ∑ { }
= ⋯
( ( )) ( ( )) ∑ ( ( ))
= ⋯
= exp(( ( )) log + ( ( )) log +

⋯+ 1− ( ( )) log

= exp(( ( )) log / + ( ( )) log / +


⋯ + ( ( )) log / + log
= ( )exp ( )− ( )
Softmax Regression
where
log /
log /
= ,

log /
( ) = −log
( ) = 1.

This completes our formulation of the categorical as an exponential family distribution.


The link function is given (for = 1, … , ) by

= lo g
Softmax Regression
For convenience, we have also defined = log / = 0. To invert the link
function and derive the response function, we therefore have that

=
= (2)

= =1

This implies that = 1/∑ , which can be substituted back into Equation(2) to give the
response function:
=

Softmax Regression
This function mapping from the 's to the 's is called the softmax function.
To complete our model, we use Assumption 3, given earlier, that the 's are linearly related
to the 's. So, have = (for = 1, … , − 1 ), where ,…, ∈ℝ are the
parameters of our model.

For notational convenience, we can also define = 0, so that = = 0, as given


previously. Hence, our model assumes that the conditional distribution of given is given
by
( = ∣ ; ) =
(3)
=

=

Softmax Regression
This model, which applies to classification problems where ∈ {1, … , }, is called softmax
regression.

It is a generalization of logistic regression. ℎ ( )= E[ ( ) ∣ ; ]


1{ = 1}
Our hypothesis will output: 1{ = 2}
= E ;

1{ = − 1}

=

exp
∑ exp
exp
= ∑ exp .

exp
∑ exp
Softmax Regression
() ()
if we have a training set of examples , ; = 1, … , and would like to learn
the parameters of this model, we would begin by writing down the log-likelihood

() ()
ℓ( ) = log ∣ ;

()
()

= log ()

To obtain the second line above, we used the definition for ( ∣ ; ) given in Equation (3). We
can now obtain the maximum likelihood estimate of the parameters by maximizing ℓ( ) in terms
of , using a method such as gradient ascent or Newton's method.
Generative Models for Classification
Introduction
• Previous algorithms modeled ( ∣ ; ), the conditional distribution of
given .

• For instance, logistic regression modeled ( ∣ ; ) as ℎ ( ) =


Why do we need another method, when we have logistic regression?

• When there is substantial separation between the two classes, the parameter estimates for
the logistic regression model are surprisingly unstable.

• If the distribution of the predictors is approximately normal in each of the classes and
the sample size is small, then the approaches in this section may be more accurate than
logistic regression.

• The methods can be naturally extended to the case of more than two response classes. in
other words can fit classes separately.
Discriminative learning VS generative learning

• Generative models generate the features for each class , by sampling


from ( ∣ = ; ).

• Discriminative model directly learns the class posterior ( ∣ ; ).


Bayes theorem for classification

( ∣ = ; ) ( = ; )
( = ∣ ; )=
∑ ∣ = ; ( = ; )

The term ( = ; ) is the prior over class labels, and the term (
∣ = ; ) is called the class conditional density for class .
Gaussian discriminant analysis
• Assume that ( ∣ ) is distributed according to a multivariate normal
distribution.

1 1
( ; , Σ) = /
exp − ( − ) Σ ( − ) .
(2 ) |Σ| / 2
Density of a Gaussian distribution

1 0 1 0.5 1 0.8
Σ= Σ= Σ=
0 1 0.5 1 0.8 1
The Gaussian Discriminant Analysis model

When we have a classification problem in which the input features are continuous-
valued random variables, we can then use the Gaussian Discriminant Analysis (GDA)
model, which models ( ∣ ) using a multivariate normal distribution. The model
is:

∼ Bernoulli( )
∣ =0 ∼ ,Σ
∣ =1 ∼ ,Σ
( )= (1 − )

1 1
( ∣ = 0) = / /
− − −
(2 ) | | 2
1 1
( ∣ = 1) = / /
− − −
(2 ) | | 2

Model parameters: ∑, , ,

You might also like