Probabilistic Learning and Generalized Linear Models (GLMS)
Probabilistic Learning and Generalized Linear Models (GLMS)
( )|
1 −
( ): Gaussian , iid , = ex p −
2 2
() ~ (0, )
= = ∏ ( ( )| ( ), )
Likelihood function
= min ∑ ( −θ ( ))
= log
Probabilistic interpretation of logistic regression
ℎ = ( = 1| , ) 1−ℎ = ( = 0| , )
, = (ℎ ) (1 − ℎ )
= , = ((ℎ ) (1 − ℎ ) )
∣ ; ∼ , ∣ ; ∼ Bernoulli( )
In this section, we will show that both of these methods are special cases of a broader family of
models, called Generalized Linear Models (GLMs).
We will also show how other models in the GLM family can be derived and applied to other
classification and regression problems.
The exponential family
The exponential family plays a crucial role in statistics and machine learning, for various reasons,
including the following:
The exponential family is the unique family of distributions that has maximum entropy.
Under certain regularity conditions, the exponential family is the only family of distributions with
finite-sized sufficient statistics.
All members of the exponential family have a conjugate prior, which simplifies Bayesian inference
of the parameters.
The exponential family
We say that a class of distributions is in the exponential family if it can be written in the
form:
= ex p −
(1)
( )
The quantity essentially plays the role of a normalization constant, that makes sure the
distribution ( ; ) sums/integrates over to 1 .
A fixed choice of , and defines a family (or set) of distributions that is parameterized by
; as we vary , we then get different distributions within this family.
Bernoulli distribution
The Bernoulli distribution with mean , written Bernoulli ( ), specifies a distribution over
∈ {0,1}, so that ( = 1; ) = ; ( = 0; ) = 1 − .
there is a choice of , and so that Equation (1) becomes exactly the class of Bernoulli
distributions.
Bernoulli distribution
We write the Bernoulli distribution as:
( ; )= (1 − )
= exp( log + (1 − )log(1 − ))
( ) = −log(1 − )
= log 1 +
( ) =1
This shows that the Bernoulli distribution can be written in the form of Equation (1), using an
appropriate choice of , and .
Gaussian distribution
Recall that, when deriving linear regression, the value of had no effect on our final choice
of and ℎ ( ). Thus, we can choose an arbitrary value for without changing anything. To
simplify the derivation below, let's set = 1. We then have:
11
( ; ) =exp − ( − )
2 2
1 1 1
= exp − ⋅ exp −
2 2 2
Gaussian distribution
we see that the Gaussian is in the exponential family,
with:
=
( ) =
( ) = /2
= /2
( ) = (1/ 2 )exp − /2
Constructing GLMs
• Suppose you would like to build a model to estimate the number of customers arriving in
your store (or number of page-views on your website) in any given hour, based on certain
• We know that the Poisson distribution usually gives a good model for numbers of visitors.
(GLM).
Constructing GLMs
• More generally, consider a classification or regression problem where we would like to predict
the value of some random variable as a function of .
• To derive a GLM for this problem, we will make the following three assumptions about the
conditional distribution of given and about our model:
2. Given , our goal is to predict the expected value of ( ) given . In most of our examples, we
will have ( ) = , so this means we would like the prediction ℎ( ) output by our learned
hypothesis ℎ to satisfy ℎ( ) = E[ ∣ ].
3. The natural parameter and the inputs are related linearly: = . (Or, if is vector-
valued, then = .)
Ordinary Least Squares
• To show that ordinary least squares is a special case of the GLM family of models, consider the
setting where the target variable (also called the response variable in GLM terminology) is
continuous, and we model the conditional distribution of given as a Gaussian , .
(Here, may depend . )
• let the Exponential Family distribution above be the Gaussian distribution.
• As we saw previously, in the formulation of the Gaussian as an exponential family distribution,
we had = .
• So, we have:
ℎ ( ) = [ ∣ ; ]
=
=
= .
Logistic Regression
In our formulation of the Bernoulli distribution as an exponential family distribution, we had:
= 1/ 1 + .
values, so ∈ {1,2, … , }.
• For example, rather than classifying email into the two classes spam or not-spam-which
would have been a binary classification problem might want to classify it into three classes,
• The response variable is still discrete, but can now take on more than two values. We will thus
1 0 0 0 0
0 1 0 0 0
(1) = 0 , (2) = 0 , (3) = 1 , … , ( − 1) = 0 , ( )= 0 ,
⋮ ⋮ ⋮ ⋮ ⋮
0 0 0 1 0
Softmax Regression
We are now ready to show that the categorical is a member of the exponential
family. We have:
{ } { } { }
( ; )= ⋯
{ } { } ∑ { }
= ⋯
( ( )) ( ( )) ∑ ( ( ))
= ⋯
= exp(( ( )) log + ( ( )) log +
⋯+ 1− ( ( )) log
= lo g
Softmax Regression
For convenience, we have also defined = log / = 0. To invert the link
function and derive the response function, we therefore have that
=
= (2)
= =1
This implies that = 1/∑ , which can be substituted back into Equation(2) to give the
response function:
=
∑
Softmax Regression
This function mapping from the 's to the 's is called the softmax function.
To complete our model, we use Assumption 3, given earlier, that the 's are linearly related
to the 's. So, have = (for = 1, … , − 1 ), where ,…, ∈ℝ are the
parameters of our model.
=
∑
Softmax Regression
This model, which applies to classification problems where ∈ {1, … , }, is called softmax
regression.
=
⋮
exp
∑ exp
exp
= ∑ exp .
⋮
exp
∑ exp
Softmax Regression
() ()
if we have a training set of examples , ; = 1, … , and would like to learn
the parameters of this model, we would begin by writing down the log-likelihood
() ()
ℓ( ) = log ∣ ;
()
()
= log ()
∑
To obtain the second line above, we used the definition for ( ∣ ; ) given in Equation (3). We
can now obtain the maximum likelihood estimate of the parameters by maximizing ℓ( ) in terms
of , using a method such as gradient ascent or Newton's method.
Generative Models for Classification
Introduction
• Previous algorithms modeled ( ∣ ; ), the conditional distribution of
given .
• When there is substantial separation between the two classes, the parameter estimates for
the logistic regression model are surprisingly unstable.
• If the distribution of the predictors is approximately normal in each of the classes and
the sample size is small, then the approaches in this section may be more accurate than
logistic regression.
• The methods can be naturally extended to the case of more than two response classes. in
other words can fit classes separately.
Discriminative learning VS generative learning
( ∣ = ; ) ( = ; )
( = ∣ ; )=
∑ ∣ = ; ( = ; )
The term ( = ; ) is the prior over class labels, and the term (
∣ = ; ) is called the class conditional density for class .
Gaussian discriminant analysis
• Assume that ( ∣ ) is distributed according to a multivariate normal
distribution.
1 1
( ; , Σ) = /
exp − ( − ) Σ ( − ) .
(2 ) |Σ| / 2
Density of a Gaussian distribution
1 0 1 0.5 1 0.8
Σ= Σ= Σ=
0 1 0.5 1 0.8 1
The Gaussian Discriminant Analysis model
When we have a classification problem in which the input features are continuous-
valued random variables, we can then use the Gaussian Discriminant Analysis (GDA)
model, which models ( ∣ ) using a multivariate normal distribution. The model
is:
∼ Bernoulli( )
∣ =0 ∼ ,Σ
∣ =1 ∼ ,Σ
( )= (1 − )
1 1
( ∣ = 0) = / /
− − −
(2 ) | | 2
1 1
( ∣ = 1) = / /
− − −
(2 ) | | 2
Model parameters: ∑, , ,