0% found this document useful (0 votes)

25 views38 pages

Probabilistic Learning and Generalized Linear Models (GLMS)

Uploaded by

mabdian4821

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views38 pages

Probabilistic Learning and Generalized Linear Models (GLMS)

Uploaded by

mabdian4821

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Probabilistic Learning

Probabilistic interpretation of Linear Regression

() = () + () i=1…m

( )|
1 −
( ): Gaussian , iid , = ex p −
2 2

() ~ (0, )

= = ∏ ( ( )| ( ), )
Likelihood function

= min ∑ ( −θ ( ))
= log
Probabilistic interpretation of logistic regression

ℎ = ( = 1| , ) 1−ℎ = ( = 0| , )

, = (ℎ ) (1 − ℎ )

= , = ((ℎ ) (1 − ℎ ) )

= log = log ℎ + (1 − ) log(1 − ℎ ) =− ( )

Generalized Linear Models
GLM
In the regression example and the classification we had:

∣ ; ∼ , ∣ ; ∼ Bernoulli( )

In this section, we will show that both of these methods are special cases of a broader family of
models, called Generalized Linear Models (GLMs).

We will also show how other models in the GLM family can be derived and applied to other
classification and regression problems.
The exponential family
The exponential family plays a crucial role in statistics and machine learning, for various reasons,
including the following:

 The exponential family is the unique family of distributions that has maximum entropy.

 The exponential family is at the core of GLMs.

 The exponential family is at the core of variational inference.

 Under certain regularity conditions, the exponential family is the only family of distributions with
finite-sized sufficient statistics.

 All members of the exponential family have a conjugate prior, which simplifies Bayesian inference
of the parameters.
The exponential family
We say that a class of distributions is in the exponential family if it can be written in the
form:
= ex p −
(1)

: natural parameter ( ):sufficient statistic : log partition function

( )
The quantity essentially plays the role of a normalization constant, that makes sure the
distribution ( ; ) sums/integrates over to 1 .

A fixed choice of , and defines a family (or set) of distributions that is parameterized by
; as we vary , we then get different distributions within this family.
Bernoulli distribution
The Bernoulli distribution with mean , written Bernoulli ( ), specifies a distribution over
∈ {0,1}, so that ( = 1; ) = ; ( = 0; ) = 1 − .

there is a choice of , and so that Equation (1) becomes exactly the class of Bernoulli
distributions.
Bernoulli distribution
We write the Bernoulli distribution as:

( ; )= (1 − )
= exp( log + (1 − )log(1 − ))

= exp log + log(1 − )

1−

Thus, the natural parameter is given by = log( /(1 − )).

if we invert this definition for by solving for in terms of , we obtain = 1/ 1 + .

Bernoulli distribution
To complete the formulation of the Bernoulli distribution as an exponential family distribution, we
also have:
( ) =

( ) = −log(1 − )
= log 1 +

( ) =1

This shows that the Bernoulli distribution can be written in the form of Equation (1), using an
appropriate choice of , and .
Gaussian distribution
Recall that, when deriving linear regression, the value of had no effect on our final choice
of and ℎ ( ). Thus, we can choose an arbitrary value for without changing anything. To
simplify the derivation below, let's set = 1. We then have:

11
( ; ) =exp − ( − )
2 2
1 1 1
= exp − ⋅ exp −
2 2 2
Gaussian distribution
we see that the Gaussian is in the exponential family,
with:

=
( ) =
( ) = /2
= /2
( ) = (1/ 2 )exp − /2
Constructing GLMs
• Suppose you would like to build a model to estimate the number of customers arriving in

your store (or number of page-views on your website) in any given hour, based on certain

features such as store promotions, recent advertising, weather, day-of-week, etc.

• We know that the Poisson distribution usually gives a good model for numbers of visitors.

• Poisson is an exponential family distribution, so we can apply a Generalized Linear Model

(GLM).
Constructing GLMs
• More generally, consider a classification or regression problem where we would like to predict
the value of some random variable as a function of .

• To derive a GLM for this problem, we will make the following three assumptions about the
conditional distribution of given and about our model:

1. ∣ ; ∼ Exponential Family ( ). I.e., given and , the distribution of follows some

exponential family distribution, with parameter .

2. Given , our goal is to predict the expected value of ( ) given . In most of our examples, we
will have ( ) = , so this means we would like the prediction ℎ( ) output by our learned
hypothesis ℎ to satisfy ℎ( ) = E[ ∣ ].

3. The natural parameter and the inputs are related linearly: = . (Or, if is vector-
valued, then = .)
Ordinary Least Squares
• To show that ordinary least squares is a special case of the GLM family of models, consider the
setting where the target variable (also called the response variable in GLM terminology) is
continuous, and we model the conditional distribution of given as a Gaussian , .
(Here, may depend . )
• let the Exponential Family distribution above be the Gaussian distribution.
• As we saw previously, in the formulation of the Gaussian as an exponential family distribution,
we had = .
• So, we have:
ℎ ( ) = [ ∣ ; ]
=
=
= .
Logistic Regression
In our formulation of the Bernoulli distribution as an exponential family distribution, we had:

= 1/ 1 + .

if ∣ ; ∼ Bernoulli( ), then E[ ∣ ; ] = . So, following a similar derivation as the one for

ordinary least squares, we get:
ℎ ( ) = [ ∣ ; ]
=
= 1/ 1 +
= 1/ 1 +

So, this gives us hypothesis functions of the form ℎ ( ) = 1/ 1 +

Softmax Regression
• Consider a classification problem in which the response variable can take on any one of

values, so ∈ {1,2, … , }.

• For example, rather than classifying email into the two classes spam or not-spam-which

would have been a binary classification problem might want to classify it into three classes,

such as spam, personal mail, and work-related mail.

• The response variable is still discrete, but can now take on more than two values. We will thus

model it as distributed according to a categorical distribution.

Softmax Regression
• To parameterize a multinomial over possible outcomes, one could use parameters
,…, specifying the probability of each of the outcomes. However, these parameters
would be redundant, or more formally, they would not be independent (∑ = 1).

• So, we will instead parameterize the categorical with only − 1 parameters,

,…, , where = ( = ; ), and ( = ; ) = 1 − ∑ .
• For notational convenience, we will also let =1−∑ , but we should keep in
mind that this is not a parameter, and that it is fully specified by ,…, .
Softmax Regression
• To express the categorical as an exponential family distribution, we will define ( )
∈ℝ as follows:

1 0 0 0 0
0 1 0 0 0
(1) = 0 , (2) = 0 , (3) = 1 , … , ( − 1) = 0 , ( )= 0 ,
⋮ ⋮ ⋮ ⋮ ⋮
0 0 0 1 0
Softmax Regression
We are now ready to show that the categorical is a member of the exponential
family. We have:
{ } { } { }
( ; )= ⋯
{ } { } ∑ { }
= ⋯
( ( )) ( ( )) ∑ ( ( ))
= ⋯
= exp(( ( )) log + ( ( )) log +

⋯+ 1− ( ( )) log

= exp(( ( )) log / + ( ( )) log / +

⋯ + ( ( )) log / + log
= ( )exp ( )− ( )
Softmax Regression
where
log /
log /
= ,
⋮
log /
( ) = −log
( ) = 1.

This completes our formulation of the categorical as an exponential family distribution.

The link function is given (for = 1, … , ) by

= lo g
Softmax Regression
For convenience, we have also defined = log / = 0. To invert the link
function and derive the response function, we therefore have that

=
= (2)

= =1

This implies that = 1/∑ , which can be substituted back into Equation(2) to give the
response function:
=
∑
Softmax Regression
This function mapping from the 's to the 's is called the softmax function.
To complete our model, we use Assumption 3, given earlier, that the 's are linearly related
to the 's. So, have = (for = 1, … , − 1 ), where ,…, ∈ℝ are the
parameters of our model.

For notational convenience, we can also define = 0, so that = = 0, as given

previously. Hence, our model assumes that the conditional distribution of given is given
by
( = ∣ ; ) =
(3)
=
∑

=
∑
Softmax Regression
This model, which applies to classification problems where ∈ {1, … , }, is called softmax
regression.

It is a generalization of logistic regression. ℎ ( )= E[ ( ) ∣ ; ]

1{ = 1}
Our hypothesis will output: 1{ = 2}
= E ;
⋮
1{ = − 1}

=
⋮

exp
∑ exp
exp
= ∑ exp .
⋮
exp
∑ exp
Softmax Regression
() ()
if we have a training set of examples , ; = 1, … , and would like to learn
the parameters of this model, we would begin by writing down the log-likelihood

() ()
ℓ( ) = log ∣ ;

()
()

= log ()
∑

To obtain the second line above, we used the definition for ( ∣ ; ) given in Equation (3). We
can now obtain the maximum likelihood estimate of the parameters by maximizing ℓ( ) in terms
of , using a method such as gradient ascent or Newton's method.
Generative Models for Classification
Introduction
• Previous algorithms modeled ( ∣ ; ), the conditional distribution of
given .

• For instance, logistic regression modeled ( ∣ ; ) as ℎ ( ) =

Why do we need another method, when we have logistic regression?

• When there is substantial separation between the two classes, the parameter estimates for
the logistic regression model are surprisingly unstable.

• If the distribution of the predictors is approximately normal in each of the classes and
the sample size is small, then the approaches in this section may be more accurate than
logistic regression.

• The methods can be naturally extended to the case of more than two response classes. in
other words can fit classes separately.
Discriminative learning VS generative learning

• Generative models generate the features for each class , by sampling

from ( ∣ = ; ).

• Discriminative model directly learns the class posterior ( ∣ ; ).

Bayes theorem for classification

( ∣ = ; ) ( = ; )
( = ∣ ; )=
∑ ∣ = ; ( = ; )

The term ( = ; ) is the prior over class labels, and the term (
∣ = ; ) is called the class conditional density for class .
Gaussian discriminant analysis
• Assume that ( ∣ ) is distributed according to a multivariate normal
distribution.

1 1
( ; , Σ) = /
exp − ( − ) Σ ( − ) .
(2 ) |Σ| / 2
Density of a Gaussian distribution

1 0 1 0.5 1 0.8
Σ= Σ= Σ=
0 1 0.5 1 0.8 1
The Gaussian Discriminant Analysis model

When we have a classification problem in which the input features are continuous-
valued random variables, we can then use the Gaussian Discriminant Analysis (GDA)
model, which models ( ∣ ) using a multivariate normal distribution. The model
is:

∼ Bernoulli( )
∣ =0 ∼ ,Σ
∣ =1 ∼ ,Σ
( )= (1 − )

1 1
( ∣ = 0) = / /
− − −
(2 ) | | 2
1 1
( ∣ = 1) = / /
− − −
(2 ) | | 2

Model parameters: ∑, , ,

The Role of Input and Output in Second L
100% (1)
The Role of Input and Output in Second L
14 pages
Detailed Lesson Log in Mathematics
100% (1)
Detailed Lesson Log in Mathematics
7 pages
Misa Santa Fe PDF
80% (5)
Misa Santa Fe PDF
26 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
19 pages
Softmax For The Layman
100% (1)
Softmax For The Layman
10 pages
Design and Implementation of Text To Speech Application For Vision Impaired Students
100% (2)
Design and Implementation of Text To Speech Application For Vision Impaired Students
15 pages
Kid Presidents: Educator's Guide
100% (2)
Kid Presidents: Educator's Guide
3 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
9 pages
Ch13slides Generalized Linear Models
No ratings yet
Ch13slides Generalized Linear Models
24 pages
Jewish Art and Civilization
100% (3)
Jewish Art and Civilization
358 pages
Grade 5 Catch Up Friday
100% (3)
Grade 5 Catch Up Friday
5 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
Language, Dialects, and Varieties
0% (1)
Language, Dialects, and Varieties
5 pages
Gujarati Parts of Speech
No ratings yet
Gujarati Parts of Speech
16 pages
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
100% (1)
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
14 pages
Nelder 1972
No ratings yet
Nelder 1972
16 pages
θ, then the probability density function for Y, θ), can be written as  y∣=exp  ybcd  y θ) is called the natural −m  n y ,
No ratings yet
θ, then the probability density function for Y, θ), can be written as  y∣=exp  ybcd  y θ) is called the natural −m  n y ,
6 pages
Econometrics - Exercise Set 2 (Solution)
No ratings yet
Econometrics - Exercise Set 2 (Solution)
12 pages
Wk04 Machine Learning
No ratings yet
Wk04 Machine Learning
6 pages
CSE291D Lecture 4: Exponential Families Generalized Linear Models
No ratings yet
CSE291D Lecture 4: Exponential Families Generalized Linear Models
67 pages
Note on Generalized Linear Models: y y Xβ w X β w I y Xβ I y Xβ X w X
No ratings yet
Note on Generalized Linear Models: y y Xβ w X β w I y Xβ I y Xβ X w X
4 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Exponential Family
No ratings yet
Exponential Family
13 pages
Week 6 Mle
No ratings yet
Week 6 Mle
41 pages
CPSC 440: Advanced Machine Learning: Exponential Families
No ratings yet
CPSC 440: Advanced Machine Learning: Exponential Families
15 pages
Statistical Machine Learning 1665832214
No ratings yet
Statistical Machine Learning 1665832214
55 pages
Theory Generalized Linear Model
No ratings yet
Theory Generalized Linear Model
16 pages
L08 MaximumLikelihoodEstimation
No ratings yet
L08 MaximumLikelihoodEstimation
5 pages
Lecture 11
No ratings yet
Lecture 11
6 pages
Lecture BDS 2 23 24 Print
No ratings yet
Lecture BDS 2 23 24 Print
10 pages
All Ex Sol
No ratings yet
All Ex Sol
43 pages
Lecture2 2015
No ratings yet
Lecture2 2015
58 pages
CQF ML Lab Estimating Default Probability With Logistic Regression
No ratings yet
CQF ML Lab Estimating Default Probability With Logistic Regression
7 pages
CQF January 2016 M5S8 Workings Annotated
No ratings yet
CQF January 2016 M5S8 Workings Annotated
7 pages
Notes
No ratings yet
Notes
10 pages
15 Exponential Families
No ratings yet
15 Exponential Families
33 pages
Understanding Maximum Likelihood
No ratings yet
Understanding Maximum Likelihood
5 pages
Categorical Notes Ch4
No ratings yet
Categorical Notes Ch4
40 pages
Int SQL PLSQL
No ratings yet
Int SQL PLSQL
32 pages
GLM Slides 6 Binary Response Print
No ratings yet
GLM Slides 6 Binary Response Print
55 pages
3logistic Regression
No ratings yet
3logistic Regression
61 pages
w6 - Statistical Modelling
No ratings yet
w6 - Statistical Modelling
24 pages
GLM Slides 2 Exp Family
No ratings yet
GLM Slides 2 Exp Family
35 pages
Statistics - Lecture 7
No ratings yet
Statistics - Lecture 7
47 pages
Bio24 Rathouz
No ratings yet
Bio24 Rathouz
45 pages
S M S T C Lecture 2425
No ratings yet
S M S T C Lecture 2425
45 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
Generalized Linear Models and Exponential Family
No ratings yet
Generalized Linear Models and Exponential Family
15 pages
Exponential Families: Peter D. Hoff September 26, 2013
No ratings yet
Exponential Families: Peter D. Hoff September 26, 2013
9 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
7 Generalized Linear Models Padua
No ratings yet
7 Generalized Linear Models Padua
29 pages
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
No ratings yet
Stat 5102 Lecture Slides: Deck 6 Gauss-Markov Theorem, Sufficiency, Generalized Linear Models, Likelihood Ratio Tests, Categorical Data Analysis
86 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 2 - Machine Learning - WWW - Rgpvnotes.in PDF
10 pages
Lec12 GLM ExponentialFamilies
No ratings yet
Lec12 GLM ExponentialFamilies
24 pages
Exponential Family
No ratings yet
Exponential Family
45 pages
Lecture3 Statistics
No ratings yet
Lecture3 Statistics
39 pages
Foundations of Statistical Inference
No ratings yet
Foundations of Statistical Inference
89 pages
2.1972 Generalized Linear Models Nelder Wedderburn
No ratings yet
2.1972 Generalized Linear Models Nelder Wedderburn
16 pages
Rolf Sundberg - Statistical Modelling by Exponential Families
No ratings yet
Rolf Sundberg - Statistical Modelling by Exponential Families
297 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Lecture 7
No ratings yet
Lecture 7
10 pages
ABD Formulas
No ratings yet
ABD Formulas
55 pages
Generalized Linear Models-1
No ratings yet
Generalized Linear Models-1
29 pages
15 GLM
No ratings yet
15 GLM
32 pages
Exponential Families
No ratings yet
Exponential Families
42 pages
Week6 1 GLM
No ratings yet
Week6 1 GLM
28 pages
Literary Theory Workshop Handout - Structuralism
100% (1)
Literary Theory Workshop Handout - Structuralism
1 page
057-349 G0123 Soft
No ratings yet
057-349 G0123 Soft
24 pages
NATO Phonetic Alphabet 2015 NGL
No ratings yet
NATO Phonetic Alphabet 2015 NGL
1 page
Assign 1
No ratings yet
Assign 1
5 pages
Jason Weston Reasoning Alignment Berkeley Talk
No ratings yet
Jason Weston Reasoning Alignment Berkeley Talk
106 pages
Module 5 - DIGITAL TECHNIQUES ELECTRONIC INSTRUMENT SYSTEMS - 1
100% (1)
Module 5 - DIGITAL TECHNIQUES ELECTRONIC INSTRUMENT SYSTEMS - 1
58 pages
Acord Sb+predicat - Engleza - Exercitii
No ratings yet
Acord Sb+predicat - Engleza - Exercitii
5 pages
Tenses (Notes) : Chart-Active Verb Tenses
No ratings yet
Tenses (Notes) : Chart-Active Verb Tenses
2 pages
Class-3 English Hy 2025)
No ratings yet
Class-3 English Hy 2025)
4 pages
SWT3000 Product Brochure Solution 0514
No ratings yet
SWT3000 Product Brochure Solution 0514
12 pages
Intel's Haswell CPU Microarchitecture
No ratings yet
Intel's Haswell CPU Microarchitecture
17 pages
Cca 2ND Term J1
No ratings yet
Cca 2ND Term J1
2 pages
Session 1 DA Introduction
No ratings yet
Session 1 DA Introduction
69 pages
CMake Lists
No ratings yet
CMake Lists
29 pages
Spagobi Server Configure v3
No ratings yet
Spagobi Server Configure v3
11 pages
2016-2017 Q4 Conexiones Culturales
No ratings yet
2016-2017 Q4 Conexiones Culturales
2 pages
Faisal Iqbal ID 478364 Lab Practice MSE Based
No ratings yet
Faisal Iqbal ID 478364 Lab Practice MSE Based
6 pages
34 Crossword Puzzle
No ratings yet
34 Crossword Puzzle
4 pages
Jagamohana Ramayana: Early Medieval Recension From Bengal
No ratings yet
Jagamohana Ramayana: Early Medieval Recension From Bengal
3 pages
2 Week 9c English A
No ratings yet
2 Week 9c English A
3 pages
PRESENT PERFECT A2
No ratings yet
PRESENT PERFECT A2
2 pages

Probabilistic Learning and Generalized Linear Models (GLMS)

Uploaded by

Probabilistic Learning and Generalized Linear Models (GLMS)

Uploaded by

Probabilistic Learning

Probabilistic interpretation of Linear Regression

= log = log ℎ + (1 − ) log(1 − ℎ ) =− ( )

 The exponential family is at the core of GLMs.

 The exponential family is at the core of variational inference.

: natural parameter ( ):sufficient statistic : log partition function

= exp log + log(1 − )

Thus, the natural parameter is given by = log( /(1 − )).

if we invert this definition for by solving for in terms of , we obtain = 1/ 1 + .

features such as store promotions, recent advertising, weather, day-of-week, etc.

• Poisson is an exponential family distribution, so we can apply a Generalized Linear Model

1. ∣ ; ∼ Exponential Family ( ). I.e., given and , the distribution of follows some

if ∣ ; ∼ Bernoulli( ), then E[ ∣ ; ] = . So, following a similar derivation as the one for

So, this gives us hypothesis functions of the form ℎ ( ) = 1/ 1 +

such as spam, personal mail, and work-related mail.

model it as distributed according to a categorical distribution.

• So, we will instead parameterize the categorical with only − 1 parameters,

= exp(( ( )) log / + ( ( )) log / +

This completes our formulation of the categorical as an exponential family distribution.

For notational convenience, we can also define = 0, so that = = 0, as given

It is a generalization of logistic regression. ℎ ( )= E[ ( ) ∣ ; ]

• For instance, logistic regression modeled ( ∣ ; ) as ℎ ( ) =

• Generative models generate the features for each class , by sampling

• Discriminative model directly learns the class posterior ( ∣ ; ).

You might also like