Generalized Linear Models: FX Axb C DX Axb C DX
Generalized Linear Models: FX Axb C DX Axb C DX
The usual linear regression model assumes normal distribution of study variables whereas nonlinear
logistic and Poison regressions are based on Bernoulli and Poisson distributions respectively of study
variables. Similar to as in logistic and Poisson regressions, the study variable can follow different
probability distributions like exponential, gamma, inverse normal etc. One such family of distribution is
described by exponential family of distributions. The generalized linear model is based on this
distribution and unifies linear and nonlinear regression models. It assumes that the distribution of study
variable is a member of exponential family of distribution.
If a ( X ) = X , the distribution in said to be in canonical form. The function b(θ ) is called the natural
parameter of the distribution. The parameter θ is of interest and all other parameters which are not of
interest are called nuisance parameters.
Example:
Normal distribution
1 1
f x ( x, µ , σ )
= exp − 2 ( x − µ ) 2 ; −∞ < x < ∞; −∞ < µ < ∞; σ 2 > 0
σ 2π 2σ
µ µ 2
1 x2
= exp x 2 + − 2 − ln 2πσ 2 − 2 .
σ 2σ 2 2σ
µ
Here= , b(θ )
a ( x) x= .
σ2
dL
Let U =
dθ
then for any distribution
E (U ) = 0
Var (U=
) E (U 2=
) E (−U ')
dU
where U ' = . The function U is called score and Var (U ) is called information.
dθ
The log-likelihood function is
=L ln [ f ( X ,=
θ ) ] a( X )b(θ ) + c(θ ) + d ( y )
and then
dL
=
U = a ( X )b '(θ ) + c '(θ )
dθ
d 2L
=U' = a( X )b "(θ ) + c "(θ )
dθ 2
db(θ ) d 2b(θ ) dc(θ ) d 2 c(θ )
where b '(θ ) =
= , b "(θ ) = , c '(θ ) = and c "(θ ) .
dθ dθ 2 dθ dθ 2
Since E(U) = 0, so
−b "(θ ) E [ a ( X )] − c "(θ )
E (−U ') =
Var (U=
) E (−U ')
−b "(θ ) E [ a ( X )] − c "(θ )
⇒ Var [ a ( X )] = 2
[b '(θ )]
b "(θ )c '(θ ) − c "(θ )b '(θ )
= .
[b '(θ )]
3
Now we consider two examples which illustrate how other distribution and their properties can be
obtained as particular cases:
π n
Here a= , θ π , b=
( x) x= (θ ) ln (θ ) n ln(1 − π ), d ( x) = ln ,
, c=
1− π x
n
) x ln π − x ln(1 − π ) + n ln(1 − π ) + ln .
L = ln f ( x, π =
x
It is the canonical form of f ( x, π ) with natural parameter ln π .
E ( x) − nπ
E (U ) =
π (1 − π )
nπ − nπ
=
π (1 − π )
=0
Var ( x)
Var (U ) =
π 2 (1 − π ) 2
nπ (1 − π )
= 2
π (1 − π ) 2
n
=
π (1 − π ) 2
( − n)
E (−U ') = E − 2
π (1 − π )
n
=
π (1 − π )
1
b '(θ ) = b '(π )
=
π (1 − π )
2π − 1
b "(θ ) =
= b "(π )
[π (1 − π )]
2
n
c '(θ ) =
− = c '(π ).
1− π
n
c "(θ ) =
− = c "(π ).
(1 − π ) 2
Thus
c '(π )
E [ a( X )] =
E( X ) =
− π
=
b '(π )
Var [ a ( X ) ] = Var ( X )
b "(π )c '(π ) − c "(π )b '(π )
= = nπ (1 − π ).
[b '(π )]
3
Var ( x)
Var (U ) =
λ2
λ
=
λ2
1
=
λ
d x
E (−U ') = E − − 1
d λ λ
x
= E 2
λ
λ
=
λ2
1
=
λ
1
b '(θ=
) = b '(λ )
λ
1
b "(θ ) = b "(λ )
− 2 =
λ
c '(θ ) =−1 =c '(λ )
c "(θ )= 0= c "(λ )
1
−0
=λ
2
1
λ3
= λ.
Denoting ηi to be the linear predictor which relates to expected value of study variable, it is expressed
as
ηi = g [ E ( yi ) ]
= g ( µi )
= xi' β
n
where xi' β= β 0 + ∑ β j xij .
j =1
Thus
E ( yi ) = g −1 (ηi )
= g −1 ( xi' β )
where g is the function called as link function.
• Binomial distribution, then logistic regression is used and logistic link is used as canonical link
π
which is defined as ηi = ln i .
1− πi
• Poisson distribution, then log link is used as canonical link which is given as ηi = ln λ .
• Exponential and gamma distribution, then the canonical link function used is reciprocal link
1
given by ηi = .
λ1
Other types of link functions are
- probit link given as ηi = Φ −1 [ E ( yi ) ] where Φ is the cumulative distribution function of N (0,1)
distribution.
- Complementary log-log link given by
=ηi ln ln {1 − E ( yi )}
A link is preferable if it maps the range of µi onto the whole real line and provides good empirical
approximation. It should also carry a meaningful interpretation is case of real applications.
The choice of link function is like choosing an appropriate transformation on study variable. The link
function takes the advantage of natural distribution of study variable. The incorrect choice of link
function can give arise to incorrect statistical inferences.
Regression Analysis | Chapter 16 | Generalized Linear Models | Shalabh, IIT Kanpur
7
Maximum likelihood estimation of GLM:
The least squares method can not be directly applied when the study variable is not continuous. So we
use the maximum likelihood estimation method in GLM which has a close connection with iteratively
weighted least squares method.
Given the data ( xi , yi ), i = 1, 2,..., n and y following exponential family of distribution, the joint p.d.f. is
n n n
θ , φ ) exp ∑ yi b(θi ) + ∑ c(θi ) + ∑ d ( yi )
f ( yi ;=
= i 1 =i 1 =i 1
where θ is the parameter of interest and φ is nuisance parameters. The θ and/or φ can be a vector also
Consider a smaller set of parameter β = ( β1 , β 2 ,..., β k ) ' which relates some function g ( µi ) to µi . In case
ri
For example, if data on yi and ni such that yi ~ Bin(ni , π i ), then yi = is the number of successes is
ni
ni trials where π i is the probability of success. Then joint p.d.f. of all n data set is
n π n n
ni
exp ∑ yi ln i + ∑ ni ln(1 − π i ) + ∑ ln .
= i 1= 1 − π i i 1 =i 1 yi
Assuming that the variation in π i is explained by xi values, choose suitable link function g (π i ) = xi' β .
A sensible link function is log-odds as
πi
g (π i ) = ln .
1− πi
Now the objective is to fit a model
πi
ln = xi' β = β 0 + β1 xi1 + ... + β k xik
1− πi
or equivalently
exp( xi' β )
πi = .
1 + exp( xi' β )
The general log-likelihood function is
n n n n
L( β ) =ln f ( y;θ , φ ) =∑ Li =
∑ yib(θi ) +∑ c(θi ) +∑ d ( yi ).
=i 1 =i 1 =i 1 =i 1
Suppose β̂ is the final value obtained after optimization and is the maximum likelihood estimator of β ,
then asymptotically
E ( βˆ ) = β
V ( βˆ ) = a (φ )( X 'V −1 X ) −1
where V is a diagonal matrix formed by the variances of estimated parameters in the linear predictor,
apart from a (φ ) . The covariance matrix can be estimated by replacing V by its estimate Vˆ .
In GLM, the variance of yi is not constant and so generalized least squares estimation is used to get
more efficient estimators.
To conduct the test of hypothesis in GLM, the model deviance is used for testing the goodness of model
fit. The difference in deviance of full and reduced models is used to decide for subset model.
The Wald inference can be applied for testing hypothesis and confidence interval estimation about
individual model parameters. The Wald statistic for testing the null hypothesis
H 0 : R β r where R is q × (k + 1) with rank ( R) = q is
=
−1
( R βˆ − r ) ' R ( X 'VX
W= ˆ ) −1 R ' ( R βˆ − r ).
The distribution of W under H 0 is χ 2 distribution with q degrees of freedom.
βˆ j − β 0
=
Z =
W
se( βˆ j )
which has N (0,1) distribution under H 0 and se( βˆ j ) is the standard error of βˆ j . The confidence
intervals can be constructed using Wald test. For example, 100 (1 − α )% confidence interval for β j is
βˆ j ± Z α se( βˆ j ) .
2
where Lˆ full and Lˆreduced are the maximized likelihood functions under full and reduced models. The
likelihood ratio test statistic has a χ 2 -distribution with degrees of freedom equal to the difference in the
degrees of freedom of full and reduced model.
yˆ=
0 µˆ=
0 g −1 ( xo' β )
It is understood that x0 is expandable to model form if more terms, e.g., interaction forms, are to be
accommodated in the linear predictor.
To find the confidence interval, the asymptotic covariance matrix of β̂ given by Ω =a (φ )( X 'V ' X ) −1 ,
ˆ x ≤ µ ( x ) ≤ g −1 x ' βˆ + Z x ' Ω
ˆ
g −1 x0' βˆ − Z α x0' Ω 0 0 0 α 0 x0 .
2 2
This approach usually works in practice because β̂ is the maximum likelihood estimate of β . So any
function of β̂ is also a maximum likelihood estimate. This method constructs the confidence interval in
the space of linear predictor and transform back the interval into the original metric. The Wald method
can also be used to derive the approximate confidence interval for mean response.
niπ i
ˆ 1− πi
ˆ
1
where πˆi = .
1 + exp( xi' β )
yi
As fitting of data to the model becomes better, then πˆi ≡ and deviance residuals become smaller and
ni
close to zero.
In case of Poisson regression,
yi
=di yi ln − yi − exp(
= xi' β ) , i 1, 2,..., n.
exp( xi
'
β )
Here yi and yˆi = exp( xi' βˆ ) become close to each other as deviance residuals approach zero.
The behaviour of deviance residuals is like the behaviour of ordinary residuals as in standard normal
linear regression model. The normal probability plot is obtained by plotting the deviance residuals on a
normal probability scale versus fitted values. Usually, the fitted values are transformed to constant
information scale before plotting, so
• yˆi is used in case of usual regression with normal distribution,