Class - Lectur 5&6
Class - Lectur 5&6
iii. Response variable is not continuous, but discrete/categorical. Linear Regression assumes
normal distribution of the response variable, which can only be applied on a continuous
data. If we try to build a linear regression model on a discrete/binary Y variable, then the
linear regression model predicts negative values for the corresponding response variable,
which is inappropriate.
Ordinal Variables
These variables are made up of ordered categories. They include rank and likert-item variables,
although are not limited to these.
Although ordinal variables look like numbers, the distances between their values aren’t equal in a
true numerical sense. So it doesn’t make sense to apply numerical operations like addition and
division to them. Hence means, the basis of linear models, don’t really compute.
Like unordered categorical variables, ordinal variables require specialized logistic or probit models,
such as the proportional odds model. There are a few other types of ordinal models, but the
proportional odds model is most commonly available.
Count Variables
Discrete counts fail the assumptions of linear models for many reasons. The
most obvious is that the normal distribution of linear models allows any value
on the number scale, but counts are bounded at 0. It just doesn’t make sense to
predict negative numbers of cigarettes smoked each day, children in a family,
or aggressive incidents.
But Poisson regression, or related models like negative binomial, are designed to accurately model
count data.
Proportions
Proportions, bounded at 0 and 1, or percentages, bounded at 0 and 100, really become problematic
if much of the data are close to the bounds.
If all the data fall in the middle portion, say in the .2 to .8 range, a linear
model can give reasonably good results. But beyond that, you need to either
use a beta regression if the proportion is continuous or logistic regression if
the proportion measures discrete events with a certain outcome (proportion
of questions answered correctly).
Most of the models we’ve described here fit into the family of regression models called
Generalized Linear Models.
Assumptions of GLM
i. The data Y 1 ,Y 2 , . . . , Y n are independently distributed, i.e., cases are independent.
ii. The dependent variable Y i does NOT need to be normally distributed, but it typically
assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial,
normal, etc.).
iii. A GLM does NOT assume a linear relationship between the response variable and the
explanatory variables, but it does assume a linear relationship between the transformed
expected response in terms of the link function and the explanatory variables; e.g., for binary
logistic regression logit ( π )=β0+ β1 x .
iv. Explanatory variables can be nonlinear transformations of some original variables.
v. The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in
many cases given the model structure.
vi. Errors need to be independent but NOT normally distributed.
vii. Parameter estimation uses maximum likelihood estimation (MLE) rather than ordinary
least squares (OLS).
The family of distributions above is known as an exponential family, and Q( θ ) is called the
natural parameter. If Q( θ )=θ , the exponential family is said to be in canonical form.
As in the linear model, and in the logit and probit models, the regressors X ij are pre-specified
functions of the explanatory variables and therefore may include quantitative explanatory variables,
transformations of quantitative explanatory variables, polynomial regressors, dummy regressors,
interactions, and soon. Indeed, one of the advantages of GLMs is that the structure of the linear
predictor is the familiar structure of a linear model.
Link function -- specifies the link between the random and the systematic components. It
indicates how the expected value of the response relates to the linear combination of explanatory
variables; e.g., η=g ( E( Y i ))=E (Y i ) for classical regression, or η=log π =logit ( π) for
1−π ( )
logistic regression.
The link function is a function g such that ηi =g ( μ i ) for i=1,2 , . . . , n . The function g is
assumed to be known, and is something which the data model are picks.
If ηi =g ( μ i )=μ i , the link function is called the identity link. If ηi =g ( μ i )=Q( θ i ), the link
function is called the canonical link.
In a GLM, we wish to estimate the β j ’ s . This in turn gives us an estimate for the g ( μ i) ‘ s ,
which will give us an estimate for the μ i ’ s .
In summary, a GLM is a linear model for a transformed mean of a response variable that has
distribution in the natural exponential family.
• Random component- The distribution of Y has a normal distribution with mean μ and
constant variance σ 2 .
πi
logit ( πi )=log
( 1−πi )
= β 0+ β 1 x i
Poisson Regression
models how the means of a discrete response variable Y depends on a set of explanatory variables
log λ i=β0+ β x i
This is the special case of the binomial distribution with n=1. We can express the probability mass
function as
f ( y ; π )= π y (1− π )1− y =(1− π )[ π /(1− π )] y
= (1− π ) exp [ y (log π )]
1− π
for y=0 and 1. This is in the natural exponential family, identifying θ with π , a( π )=1− π ,
b(y) = 1 and Q( π )=log [ π /(1− π )]. The natural parameter log [ π /(1− π )] is the log odds (logit)
of response outcome 1, the logit of π . This is the canonical link function.
In logistic regression, we take the link function to be the canonical link. That is, our systematic
component is
p
log π = β x , i=1,2 , .. . , n
( ∑ )
1−π j =1 j ij
Deviance of a GLM
The deviance is a key concept in generalized linear models. Intuitively, it measures the deviance of
the fitted generalized linear model with respect to a perfect model for the sample. This perfect
model, known as the saturated model, is the model that perfectly fits the data, in the sense that the
fitted responses ( Y^ i) equal the observed responses (Yi). In other words, a saturated model refers
to a model where there are as many parameters as there are observations, resulting in a perfect fit
to the data. Deviance is a measure of error; lower deviance means better fit to data. The greater the
deviance, the worse the model fits compared to the best case (saturated).
For a particular GLM with observations y=( y 1 , . . .. , y N ) , let L(μ , y) denote the log-likelihood
function expressed in terms of the mean μ =( μ 1 , . . ., μ N ). Let L(μ^ ; y) denote the maximum of
the log likelihood for the model. Consider for all possible models, the maximum achievable log
likelihood is L( y ; y) . This occurs for the most general model, having a separate parameter for
each observation and the perfect fit μ^ = y . Such a model is called the saturated model. This model
is not useful, because it does not provide data reduction. However, it serves as a baseline for
comparison with other model fits.
The deviance of a Poisson or Binomial GLM is defined to be
−2[ L(μ^ ; y)−L( y ; y )]
This is the likelihood-ratio statistic for testing the null hypothesis that the model holds against the
general alternative (i.e the saturated model).
For some application with Poisson and Binomial GLM, the number of observations N is fixed and
the individual counts are relatively large. Then the deviance has an approximate chi-square null
distribution. The df=N-p, where p is the number of model parameters that is df equals the difference
between the numbers of parameters in the saturated model and in the unsaturated model. The
deviance then provides a test of model fit.
The logit is the natural parameter for the binomial distribution, so the logit link is its canonical link
function. Whereas π (x) must fall in the (0,1) range, the logit can be any real number. The real
numbers are also the range for linear predictors that form the systematic component of a GLM. So,
the model does not have the structural problem that the linear probability model has.
β
A 1 unit increase in x j has a multiplicative impact of e . j
N.B. In Generalized Linear Models (GLMs), an offset is a predictor variable with a known
coefficient of 1. It is used to model the effect of exposure or scale on the response variable.
Offsets are typically used when the response variable is a rate or a count, and we want to model the
rate or count while adjusting for the exposure or scale of each observation. For example, if we are
modeling disease rates per population size, the population size would be included as an offset
variable.
Mathematically, in a GLM, an offset is included in the linear predictor as an additional term with a
fixed coefficient of 1. This ensures that the predictor's effect on the response variable is
proportional to the offset variable.
For example, in the context of Poisson regression, if Y represents the count response variable, X
represents the predictor variable, and O represents the offset variable, the model would be
formulated as:
where E(Y) represents the expected count, and β0 and β1 are coefficients estimated from the data.
model, but a linear predictor for a GLM results using the log link,
* *
log μ ij = λ + α i + β j ,
where λ =log μ , α*i =log α i , β*j=log β j . This Poisson loglinear model has additive main effects of
the two classifications but no interaction.
Since the {Y ij } are independent, the total sample size ∑ ∑ Y ij has a Poisson distribution with
i j
Reference Book:
i. Agresti A. (2019), An Introduction to Categorical Data Analysis, 3rd edition, A John Wiley & Sons
Inc., Publication.
ii.
<><><><><><><><><> End <><><><><><><><><>