0% found this document useful (0 votes)
24 views12 pages

Class - Lectur 5&6

CDA

Uploaded by

Misbahur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

Class - Lectur 5&6

CDA

Uploaded by

Misbahur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lectured by

STT553: Categorical Data Analysis (CDA)


Md. Kaderi Kibria, STT-HSTU

Lecture # 5 & 6 Generalized Linear Model

Objectives of this lecture:


After reading this unit, you should be able to
• basic concepts of Two-way contingency table
• understand about the normal approximation
• understand about the delta method

What is a Generalized Linear Model?


Generalized Linear Model (GLM) is an advanced statistical modelling technique formulated by
John Nelder and Robert Wedderburn in 1972. It is an umbrella term that encompasses many
other models, which allows the response variable y to have an error distribution other than a normal
distribution. The models include Linear Regression, Logistic Regression, and Poisson
Regression.
In a Linear Regression Model, the response variable ‘y’ is expressed as a linear function/linear
combination of all the predictors ‘X’ (aka independent/regression/explanatory/observed variables).
The underlying relationship between the response and the predictors is linear. Also, the error
distribution of the response variable should be normally distributed. Therefore we are building a
linear model.
GLM models allow us to build a linear relationship between the response and predictors, even
though their underlying relationship is not linear. This is made possible by using a link function,
which links the response variable to a linear model. Unlike Linear Regression models, the error
distribution of the response variable need not be normally distributed. The errors in the response
variable are assumed to follow an exponential family of distribution (i.e. normal, binomial,
Poisson, or gamma distributions).
The term "general" linear model (GLM) usually refers to conventional linear regression models
for a continuous response variable given continuous and/or categorical predictors. It includes
multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only).

Why Generalized Linear Model?


Linear Regression model is not suitable if,
i. The relationship between X and Y is not linear. There exists some non-linear relationship
between them. For example, Y increases exponentially as X increases.
ii. Variance of errors in Y (commonly called as Homoscedasticity in Linear Regression), is not
constant, and varies with X.

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

iii. Response variable is not continuous, but discrete/categorical. Linear Regression assumes
normal distribution of the response variable, which can only be applied on a continuous
data. If we try to build a linear regression model on a discrete/binary Y variable, then the
linear regression model predicts negative values for the corresponding response variable,
which is inappropriate.

When linear models don’t fit your data, now what?


When your dependent variable is not continuous, unbounded, and measured on an interval or ratio
scale, linear models don’t fit. The data just will not meet the assumptions of linear models. But
there’s good news, other models exist for many types of dependent variables.

Categorical Dependent Variables


Both binary (2 values) and multicategory (3 or more values) variables
clearly fail all three criteria. But there are other types of regression models
that work just fine for these variables.
For binary variables, probit and logistic regression models are the most
common. For multicategorical variables, use multinomial logistic
regression.

Ordinal Variables
These variables are made up of ordered categories. They include rank and likert-item variables,
although are not limited to these.
Although ordinal variables look like numbers, the distances between their values aren’t equal in a
true numerical sense. So it doesn’t make sense to apply numerical operations like addition and
division to them. Hence means, the basis of linear models, don’t really compute.
Like unordered categorical variables, ordinal variables require specialized logistic or probit models,
such as the proportional odds model. There are a few other types of ordinal models, but the
proportional odds model is most commonly available.

Count Variables
Discrete counts fail the assumptions of linear models for many reasons. The
most obvious is that the normal distribution of linear models allows any value
on the number scale, but counts are bounded at 0. It just doesn’t make sense to
predict negative numbers of cigarettes smoked each day, children in a family,
or aggressive incidents.

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

But Poisson regression, or related models like negative binomial, are designed to accurately model
count data.

Zero Inflated Variables


Zero Inflated data have a spike in the distribution at 0.
They are common in Poisson data, but can occur with any distribution. A recent example I saw
were scores on a depression scale. The scale ranges from 0 to 20, and 0 was by far the most
common value (which is a good thing for the state of humanity, but it really messes up the linear
model assumptions).
Even if the rest of the distribution is normal, you can’t transform zero inflated data to look normal.
A Zero-Inflated model, however, incorporates the high number of zeros by simultaneously
modeling 0/Not 0 as a logistic regression and all the Not 0 values as another distribution. Zero-
inflated Poisson regression is used to model count data that has an excess of zero counts. Further,
theory suggests that the excess zeros are generated by a separate process from the count values and
that the excess zeros can be modeled independently. Thus, the zip model has two parts, a Poisson
count model and the logit model for predicting excess zeros.

Proportions
Proportions, bounded at 0 and 1, or percentages, bounded at 0 and 100, really become problematic
if much of the data are close to the bounds.
If all the data fall in the middle portion, say in the .2 to .8 range, a linear
model can give reasonably good results. But beyond that, you need to either
use a beta regression if the proportion is continuous or logistic regression if
the proportion measures discrete events with a certain outcome (proportion
of questions answered correctly).

Most of the models we’ve described here fit into the family of regression models called
Generalized Linear Models.

Assumptions of GLM
i. The data Y 1 ,Y 2 , . . . , Y n are independently distributed, i.e., cases are independent.
ii. The dependent variable Y i does NOT need to be normally distributed, but it typically
assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial,
normal, etc.).
iii. A GLM does NOT assume a linear relationship between the response variable and the
explanatory variables, but it does assume a linear relationship between the transformed

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

expected response in terms of the link function and the explanatory variables; e.g., for binary
logistic regression logit ( π )=β0+ β1 x .
iv. Explanatory variables can be nonlinear transformations of some original variables.
v. The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in
many cases given the model structure.
vi. Errors need to be independent but NOT normally distributed.

vii. Parameter estimation uses maximum likelihood estimation (MLE) rather than ordinary
least squares (OLS).

The Generalized Linear Model


The term "general" linear model (GLM) usually refers to conventional linear regression models for
a continuous response variable given continuous and/or categorical predictors. It includes multiple
linear regression, as well as ANOVA and ANCOVA (with fixed effects only). The form is
T 2
y i~ N ( xi β , σ ) where x i contains known covariates and β contains the coefficients to be
estimated.
In these models, the response variable y i is assumed to follow an exponential family distribution
with mean μ i , which is assumed to be some (often nonlinear) function of x Ti β . Some would
call these “nonlinear” because μ i is often a nonlinear function of the covariates.
Assume that you have n data points ( x i 1 , . . . , xip , y i ) ϵℝ p +1 for i=1,2 , . . . n . We want to build a
model of the response y using the p other features X 1 ,. . . , X p . Assume that the x values are all
fixed throughout the discussion.
A GLM consists of three components:
1. A random component,
2. A systematic component, and
3. A link function.
Random components -- identifies the response variable Y and its probability distribution.
The random component of a GLM consists of a response variable Y with independent observations
( y 1 ,. . . , y N ) from a distribution in the natural exponential family. We assume that y 1 , . . . , y n are
samples of independent random variables Y 1 ,. . . ,Y n respectively. Thus family has probability
density function or mass function of form
f ( y i ; θi )=a( θ i )b ( y i )exp [ y i Q( θ i )]
In the above, the form of f (and hence, that of a, b and Q) is assumed to be known. What is
unknown are the θi ‘ s , which have to be estimated. The value of θi can vary across i.

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

The family of distributions above is known as an exponential family, and Q( θ ) is called the
natural parameter. If Q( θ )=θ , the exponential family is said to be in canonical form.

Systematic components -- specifies the explanatory variables (x 1 , x 2 , . .. , x n ) in the model,


more specifically, their linear combination; e.g., β0+ β1 x 1+ β2 x 2 , as we have seen in a linear
regression, and as we will see in the logistic regression in this lesson.
The systematic (non-random) component relates a parameter η to the predictors X. In a generalized
linear model, this is always done via
p
T
η= β X = β 1 X 1 + .. .+ β p X p=∑ β j xij
j=1

As in the linear model, and in the logit and probit models, the regressors X ij are pre-specified
functions of the explanatory variables and therefore may include quantitative explanatory variables,
transformations of quantitative explanatory variables, polynomial regressors, dummy regressors,
interactions, and soon. Indeed, one of the advantages of GLMs is that the structure of the linear
predictor is the familiar structure of a linear model.

Link function -- specifies the link between the random and the systematic components. It
indicates how the expected value of the response relates to the linear combination of explanatory
variables; e.g., η=g ( E( Y i ))=E (Y i ) for classical regression, or η=log π =logit ( π) for
1−π ( )
logistic regression.
The link function is a function g such that ηi =g ( μ i ) for i=1,2 , . . . , n . The function g is
assumed to be known, and is something which the data model are picks.
If ηi =g ( μ i )=μ i , the link function is called the identity link. If ηi =g ( μ i )=Q( θ i ), the link
function is called the canonical link.
In a GLM, we wish to estimate the β j ’ s . This in turn gives us an estimate for the g ( μ i) ‘ s ,
which will give us an estimate for the μ i ’ s .

In summary, a GLM is a linear model for a transformed mean of a response variable that has
distribution in the natural exponential family.

Simple Linear Regression


SLR models how the mean of a continuous response variable Y depends on a set of
explanatory variables, where i indexes each observation:
μ i = β 0+ β x i

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

• Random component- The distribution of Y has a normal distribution with mean μ and
constant variance σ 2 .

• Systematic component - x is the explanatory variable (can be continuous or discrete) and is


linear in the parameters β0+ β x 1. This can be extended to multiple linear regression where
we may have more than one explanatory variable, e.g., ( x 1 , x 2 , . .. , x k ).
• Link function - the identity link, η=g ( E( Y ))=E (Y ), is used; this is the simplest link
function.

Binary Logistic Regression


Binary logistic regression models how the odds of success for a binary response variable Y depend
on a set of explanatory variables:

πi
logit ( πi )=log
( 1−πi )
= β 0+ β 1 x i

• Random component – The distribution of the response variable is assumed to be binomial


with a single trail and success probability E(Y)=%pi.

• Systematic component – x is the explanatory variable and is linear in the parameters. As


with the above example, this can be extended to multiple variables of non-linear
transformations.
πi
• Link function – the log-odds or logit link, η=g ( π )=log
( 1− πi), is used.

Poisson Regression
models how the means of a discrete response variable Y depends on a set of explanatory variables

log λ i=β0+ β x i

• Random component – The distribution of Y is Poisson with mean λ .

• Systematic component – x is the explanatory variable and is linear in the parameters.

• Link function – the log link is used.

Example 1: Binomial Logit Models for Binary Data


Many response variables are binary. We represent the “success” and “failure” outcomes by 1 and 0.
A Bernoulli trail has probabilities P (Y =1)=π and P (Y =0)=1− π for which E (Y )= π .

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

This is the special case of the binomial distribution with n=1. We can express the probability mass
function as
f ( y ; π )= π y (1− π )1− y =(1− π )[ π /(1− π )] y
= (1− π ) exp [ y (log π )]
1− π
for y=0 and 1. This is in the natural exponential family, identifying θ with π , a( π )=1− π ,
b(y) = 1 and Q( π )=log [ π /(1− π )]. The natural parameter log [ π /(1− π )] is the log odds (logit)
of response outcome 1, the logit of π . This is the canonical link function.
In logistic regression, we take the link function to be the canonical link. That is, our systematic
component is
p
log π = β x , i=1,2 , .. . , n
( ∑ )
1−π j =1 j ij

Example 2: Poisson Loglinear Models for Count Data


Some response variables have counts as their possible outcomes. In a health survey, each
observation might be the number of illness in the past year for which the subject visited a doctor.
Counts also occur as entries in contingency tables.
The simplest distribution for counts data is the Poisson i.e Y i ~ Poisson( μ i ). The Poisson
probability mass function for a count Y is
e−μ μ y 1
f ( y ; μ )= =exp (−μ )( ) exp[ y (log μ )]; y =0,1 , 2 , . . ..
y! y!
This has natural exponential form with θ =μ , a (μ )=exp(−μ ) , b ( y)=1 / y ! , and
Q( μ )=log μ . The natural parameter is log μ , so the canonical link function is the log link,
η=log μ . The model using this link function is
log μ i=∑ β j x ij ; i=1 , 2 , . . . , n
j

This model is called a Poisson loglinear model.

Deviance of a GLM
The deviance is a key concept in generalized linear models. Intuitively, it measures the deviance of
the fitted generalized linear model with respect to a perfect model for the sample. This perfect
model, known as the saturated model, is the model that perfectly fits the data, in the sense that the
fitted responses ( Y^ i) equal the observed responses (Yi). In other words, a saturated model refers
to a model where there are as many parameters as there are observations, resulting in a perfect fit

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

to the data. Deviance is a measure of error; lower deviance means better fit to data. The greater the
deviance, the worse the model fits compared to the best case (saturated).
For a particular GLM with observations y=( y 1 , . . .. , y N ) , let L(μ , y) denote the log-likelihood
function expressed in terms of the mean μ =( μ 1 , . . ., μ N ). Let L(μ^ ; y) denote the maximum of
the log likelihood for the model. Consider for all possible models, the maximum achievable log
likelihood is L( y ; y) . This occurs for the most general model, having a separate parameter for
each observation and the perfect fit μ^ = y . Such a model is called the saturated model. This model
is not useful, because it does not provide data reduction. However, it serves as a baseline for
comparison with other model fits.
The deviance of a Poisson or Binomial GLM is defined to be
−2[ L(μ^ ; y)−L( y ; y )]
This is the likelihood-ratio statistic for testing the null hypothesis that the model holds against the
general alternative (i.e the saturated model).
For some application with Poisson and Binomial GLM, the number of observations N is fixed and
the individual counts are relatively large. Then the deviance has an approximate chi-square null
distribution. The df=N-p, where p is the number of model parameters that is df equals the difference
between the numbers of parameters in the saturated model and in the unsaturated model. The
deviance then provides a test of model fit.

Generalized Linear Models for Binary Data


Let Y denote a binary response variable, such as the result of a medical treatment (success, failure).
Each observation has one of two outcomes, denoted by 1 and 0, which we treat as binomial variate
for a single Bernoulli trail. The mean E(Y )=P(Y =1). We denote P(Y =1) by π (x)
reflecting its dependence on values x=(x1, . . . , xp) of explanatory variables. The variance of Y is
var (Y )=π (x )[1−π ( x)],
which is the binomial variance for n=1.

Linear Probability Model


One approach to modeling the effect of X uses the form of ordinary regression, by which the
expected value of Y is a linear function of X. For a binary response variable Yi, is called the linear
probability model. In the linear probability model we have
E (Y∣X 1 , X 2 , . . . , X k )= π (Y =1∣X 1 , X 2 , .. . , X k )
where

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

π (x)= π (Y =1∣X 1 , X 2 , .. . , X k )=α + β1 X 1 +β2 X 2 +. ..+ βk X k ; (1)


is called a linear probability model. With independent observations it is a GLM with binomial
random component and identity link function. Thus, βj can be interpreted as the change in the
probability that Yi=1, holding constant the other k−1 regressors
Suppose the regression model is as follows:
Y i = β 1+ β 2 X i +u i . . . . . . . . . ( 2)
where X = family income and Y = 1 if the family owns a house and 0 if it does not own a
house.
The above model looks like a typical linear regression model. Here, the regressand is binary, or
dichotomous, it is called a linear probability model (LPM).
In this model, the conditional expectation of Yi given Xi, E(Yi | Xi ), can be interpreted as the
conditional probability that the event will occur given Xi ,that is, Pr(Yi=1|Xi).
In our example, E(Yi|Xi) gives the probability of a family owning a house and whose income is the
given amount Xi.

Logistic Regression Model


Usually, binary data result from a nonlinear relationship between π ( x) and x. A fixed change in x
often has less impact when π (x) is near 0 or 1 than when is near 0.50.
In practice, nonlinear relationship between π (x) and x are often monotonic, with π ( x)
increasing continuously or π (x) decreasing continuously as x increases. The most important
model for S-shaped curve
exp( α + β x)
π (x)= ; (2)
1+exp ( α + β x)
is a logistic regression model. As x increases, π ( x) increases when β> 0 and decreases when
β< 0 .
Lets find the link function for which logistic regression is a GLM. For (2) extended to multiple
predictors, the odds are
π ( x)
=exp( α + β1 x 1+ .. .+ β p x p )
1−π ( x)
The log odds has the linear relationship
π (x)
log =α + β1 x 1 +. . .+ β p x p
1−π ( x)
Thus, the appropriate link is the log odds transformation, the logit. Logistic regression models are
GLMs with binomial random component and logit link function.

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

The logit is the natural parameter for the binomial distribution, so the logit link is its canonical link
function. Whereas π (x) must fall in the (0,1) range, the logit can be any real number. The real
numbers are also the range for linear predictors that form the systematic component of a GLM. So,
the model does not have the structural problem that the linear probability model has.

Binomial GLM for 2x2 Contingency Tables


Among the simplest GLM for a binary response is the one having a single explanatory variable X
that is also binary. Label its values by 0 and 1. For a given function, the GLM
link [ π (x )]=α + β x
has the effect of x described by
β=link [ π ( x)]−link [ π (0)]
For the identity link, β=π (1)−π ( 0) is the difference between proportions. For the log link,
β=log [ π (1)]−log [ π (0)]=log [ π (1)/ π (0)] is the log relative risk.
For the logit link,
π (1) π (0)
β=logit [ π (1)]−logit [ π ( 0)]=log – log
1−π (1) 1−π ( 0)
π (1)/(1− π (1))
=[ ]
π (0)/(1− π (0))
is the odds ratio. Measures of association for 2x2 tables are effect parameters in GLMs for binary
data.

Generalized Linear Model for Counts and Rates


The best known GLMs for count data assume a Poisson distribution for Y. We will use Poisson
GLMs for counts in contingency tables with categorical response variables. We first introduce
Poisson GLMs to model count or rate data for a single non-negative integer valued response
variable.

Poisson Loglinear Models


The Poisson distribution has a positive mean μ . Although a GLM can model a positive mean
using the identity link, it is more common to model the log of the mean. Like the liner predictor, the
log mean can take any real value. The log mean is the natural parameter for the Poisson distribution,
and the log link is the canonical link for a Poisson GLM. A Poisson loglinear GLM assumes a
Poisson distribution for Y and uses the log link.
The Poisson loglinear model with explanatory variable x is
log μ (x )=α + β1 x 1 +. . .+ β p x p

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

For this model, the mean satisfies the exponential relationship


μ ( x)=exp( α + β1 x 1 +. . .+ β p x p )=eα (eβ ) x . .. (eβ )x
1 1 p p

β
A 1 unit increase in x j has a multiplicative impact of e . j

Poisson Regression for Rates Using Offsets


Often a response count Y i has an index such that its expected value is proportional to t i . For
instance, this index could be an amount of time or a spatial area over which the count is made.
Then, the sample rate is y i /t i with expected value μ i /t i . With explanatory variables x, a
loglinear model for the expected rate has form
log( μ i / t i )= α + β1 x i 1 +. . .+ β p x ip , (3)
This model has equivalent representation
log μ i−log t i =α + β1 x i 1 +. . .+ β p x ip
The adjustment term −log t i , to the log link of the mean is called an offset. The fit corresponds to
using log t i as a predictor on the right hand side and forcing its coefficient to equal 1.
For model (3), the expected response count satisfies
μ i=t i exp( α +β1 x i 1 +. . .+ β p xip )
The mean has proportionality constant depending on the value of x i . The identity link is also
sometimes useful. The model is then
μ i /t i =α + β1 x i 1+ .. .+ β p x ip
or μ i=α t i + β1 x i 1 t i + . ..+ β p x ip ti

N.B. In Generalized Linear Models (GLMs), an offset is a predictor variable with a known
coefficient of 1. It is used to model the effect of exposure or scale on the response variable.
Offsets are typically used when the response variable is a rate or a count, and we want to model the
rate or count while adjusting for the exposure or scale of each observation. For example, if we are
modeling disease rates per population size, the population size would be included as an offset
variable.
Mathematically, in a GLM, an offset is included in the linear predictor as an additional term with a
fixed coefficient of 1. This ensures that the predictor's effect on the response variable is
proportional to the offset variable.
For example, in the context of Poisson regression, if Y represents the count response variable, X
represents the predictor variable, and O represents the offset variable, the model would be
formulated as:

Categorical Data Analysis | Lecture 5 | Generalized linear model


Lectured by
STT553: Categorical Data Analysis (CDA)
Md. Kaderi Kibria, STT-HSTU

log ( E (Y ))= β 0 + β 1 X + log(O)

where E(Y) represents the expected count, and β0 and β1 are coefficients estimated from the data.

Poisson GLM of Independence in Two-Way Contingency Tables


Poisson loglinear models are also used to model counts in ordinary contingency tables. We illustrate
for two-way tables wit independent counts {Y ij } having Poisson distributions with means {μ ij }.
Suppose that {μ ij } satisfy
μ ij =μ αi β j ,
where {α } and {β j } are positive constants satisfying ∑ α i=∑ β j =1. This is a multiplicative
i j

model, but a linear predictor for a GLM results using the log link,
* *
log μ ij = λ + α i + β j ,
where λ =log μ , α*i =log α i , β*j=log β j . This Poisson loglinear model has additive main effects of
the two classifications but no interaction.
Since the {Y ij } are independent, the total sample size ∑ ∑ Y ij has a Poisson distribution with
i j

mean ∑ ∑ μ ij=μ . Conditional on ∑ ∑ y ij=n , the cell counts have a multinomial


i j i j

distribution with probabilities {πij =μ ij / μ =α i β j= πi + π + j }. This is independence between the two


categorical variables.

Reference Book:
i. Agresti A. (2019), An Introduction to Categorical Data Analysis, 3rd edition, A John Wiley & Sons
Inc., Publication.
ii.
<><><><><><><><><> End <><><><><><><><><>

Categorical Data Analysis | Lecture 5 | Generalized linear model

You might also like