0% found this document useful (0 votes)
15 views12 pages

Applied Multilevel Analysis-Section B 1

The document discusses the application of multilevel analysis and generalized linear models in statistical modeling, particularly for data that is grouped or nested. It highlights the advantages of multilevel models over traditional regression, the components and estimation methods of generalized linear models, and the importance of sample size and variance partitioning in multilevel modeling. Additionally, it covers the concepts of intra-class correlations and the interpretation of random effects in multilevel models.

Uploaded by

steve stalin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

Applied Multilevel Analysis-Section B 1

The document discusses the application of multilevel analysis and generalized linear models in statistical modeling, particularly for data that is grouped or nested. It highlights the advantages of multilevel models over traditional regression, the components and estimation methods of generalized linear models, and the importance of sample size and variance partitioning in multilevel modeling. Additionally, it covers the concepts of intra-class correlations and the interpretation of random effects in multilevel models.

Uploaded by

steve stalin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Applied Multilevel Analysis

Course: Stat-5123

Section-B (Binary)

 Use multilevel model whenever your data is grouped (or nested) in more than one
category (for example, states, countries, etc)
 Multilevel models (mixed-effects, random effects, or hierarchical linear models) are
now a standard generalization of conventional regression models for analyzing
clustered and longitudinal data in the social, psychological, behavioral and medical
sciences. Examples include students within schools, respondents within
neighborhoods, patients with hospitals, repeated measures within subjects, and panel
survey waves on households. Multilevel models have been further generalized to
handle a wide range of response types, including, continuous, categorical (binary or
dichotomous, ordinal, and nominal or discrete choice), count, and survival responses.

Some advantages:
• Regular regression ignores the average variation between entities.
• Individual regression may face sample problems and lack of generalization

Question: Describe the concept of Generalized Linear Models:

The classical approach to the problem of non-normally distributed variables and


heteroscedastic errors is to apply a transformation to achieve normality and reduce the
heteroskedasticity, followed by a straightforward analysis with ANOVA or multiple
regression.

To distinguish this approach from the generalized linear modeling, where the transformation
is part of the statistical model, it is often referred to as an empirical transformation. Some
general guidelines for choosing a suitable transformation have been suggested for situations
in which a specific transformation is often successful. For instance, for the proportion p some
recommended transformations are: the arcsine transformation f(p) = 2 arcsin (√p), the logit
transformation f(p) = logit(p) = ln(p/(1 − p)), where ‗ln‘ is the natural logarithm, and the
probit or inverse normal transformation f(p) = Φ−1 (p), where Φ−1 is the inverse of the
standard normal distribution. Thus, for proportions, we can use the logit transformation, and
use standard regression procedures on the transformed variable:
The transformation is a part of the statistical model; it is often referred to as an empirical
transformation. Empirical transformations have the disadvantage that they are ad hoc, and
may encounter problems in specific situations. For instance, if we model dichotomous data,
which are simply the observed proportions in a sample of size 1, both the logistic

transformation ( ), (where ‗ln‘ is the natural logarithm), and the

probit or inverse normal transformation , (where is the inverse of the


standard normal distribution) break down, because these functions are not defined for values
0 and 1. In fact, no empirical transformation can ever transform a dichotomous variable,
which takes on only two values, into any resemblance of a normal distribution.

The modern approach to the problem of non-normally distributed variables is to include the
necessary transformation and the choice of the appropriate error distribution (not necessarily
a normal distribution) explicitly in the statistical model. This class of statistical models is
called generalized linear models.

Generalized linear models defined by three components:

An outcome variable with a specific error distribution that has mean and variance ,

a linear additive regression equation that produces an unobserved (latent) predictor of the
outcome variable ,

a link function that links the expected values of the outcome variable to the predicted
values

Question: Commonly used generalized linear model for dichotomous data is the logistic
regression model:
Commonly used generalized linear model for dichotomous data is the logistic regression
model specified by

1. the probability distribution is binomial with mean ,


2. the linear predictor is the multiple regression equation for ,
3. e.g., ,
4. the link function is the logit function given by
Question: Link function and Canonical link function:
The link function is often the inverse of the error distribution. For example, we have the
logit link for the logistic distribution, the probit link for the normal distribution, and the
complementary log-log link for the extreme value (Weibull) distribution.
Many distributions have a specific link function for which sufficient statistics exist, which
is called the canonical link function. Some commonly used canonical link functions and
the corresponding error distribution are given below:

Table: Some commonly used canonical link functions and the corresponding error
distribution

Response Link function Name Distribution


Continuous Identity Normal
Proportion Logit Binomial
( )

Count Log Poisson


Positive Inverse Gamma

Question: MULTILEVEL GENERALIZED LINEAR MODELS:

In multilevel generalized linear models, the multilevel structure appears in the linear
regression equation of the generalized linear model. Thus, a two-level model for proportions
is written as follows:

1. the probability distribution for is binomial ( , ) with overall mean ,


2. the linear predictor is the multilevel regression equation for , e.g.,

= + + + + + ,

3. The link function is the logit function given by = logit(µ).

These equations state that our outcome variable is a proportion , that we use a logit link
function, and that conditional on the predictor variables, we assume that has a binomial
error distribution, with expected value , and number of trials . If there is only one trial
(all are equal to one), the only possible outcomes are 0 and 1, and we are modeling
dichotomous data. This specific case of the binomial distribution is called the Bernoulli
distribution. Note that the usual lowest-level residual variance is not in the model
equation, because it is part of the specification of the error distribution. If the error
distribution is binomial, the variance is a function of the number of trials and the
population proportion : σ² = n × × (1 – ) and it does not have to be estimated
separately.

Question: Estimation in Generalized Multilevel Models:


The parameters of generalized linear models are estimated using maximum likelihood
methods. Multilevel models are also generally estimated using maximum likelihood methods,
and combining multilevel and generalized linear models leads to complex models and
estimation procedures.
There are two different approaches to estimation: quasi-likelihood and numerical integration.
1) Quasi-likelihood approach:
The quasi-likelihood approach is to approximate the nonlinear link by a nearly linear
function, and to embed the multilevel estimation for that function in the generalized linear
model. This approach is a quasi-likelihood approach, and it confronts us with two choices
that must be made. The nonlinear function is linearized using an approximation known as
Taylor series expansion. Taylor series expansion approximates a nonlinear function by an
infinite series of terms. Often only the first term of the series is used, which is referred to as a
first-order Taylor approximation.
When the second term is also used, we have a second-order Taylor approximation, which is
generally more accurate. So the first choice is whether to use a first-order or a second order
approximation. The second choice also involves the Taylor series expansion.

The quasi-likelihood estimation procedure in multilevel modeling proceeds iteratively,


starting with approximate parameter values, which are improved in each successive iteration.
Thus, the estimated parameter values change during the iterations. In consequence, the Taylor
series expansion must be repeated after each run of the multilevel estimation procedure, using
the current estimated values of the multilevel model parameters. This results in two sets of
iterations. One set of iterations is the standard iterations carried out on the linearized
outcomes, estimating the parameters (coefficients and variances) of the multilevel model. In
HLM, these are called the micro-iterations. The second set of iterations uses the currently
converged estimates from the micro iterations to improve the Taylor series approximation.
After each update of the linearized outcomes, the micro iterations are performed again. The
successive improvements of the Taylor series approximation are called the macro iterations.
Thus, in the quasi-likelihood approach based on Taylor series approximation, there are two
sets of iterations to check for convergence problems.
Marginal quasi-likelihood (MQL) and penalized quasi-likelihood (PQL):
Taylor series linearization of a nonlinear function depends on the values of its parameters.
And this presents us with the second choice: the Taylor series expansion can use the current
estimated values of the fixed part only, which is referred to as marginal quasi-likelihood
(MQL), or it can be improved by using the current values of the fixed part plus the residuals,
which is referred to as penalized quasi-likelihood (PQL).

2) Numerical integration approach:

The numerical integration approach does not use an approximate likelihood, but uses
numerical integration of the exact likelihood function. Numerical integration maximizes the
correct likelihood. The estimation methods involve the numerical integration of a complex
likelihood function, which becomes more complicated as the number of random effects
increases. The actual calculations involve quadrature points, and the numerical approximation
becomes better when the number of quadrature points in the numerical integration is
increased. Unfortunately, increasing the number of quadrature points also increases the
computing time, sometimes dramatically. When full maximum likelihood estimation with
numerical integration is used, the test procedures and goodness of fit indices based on the
deviance are appropriate. Simulation research suggests that when both approaches are
feasible, the numerical integration method achieves more precise estimates.

Question: If second-order estimation and penalized quasi-likelihood are always better,


then why not always use these?

Answer: The reason is that complex models or small data sets may pose convergence
problems, and we may be forced to use first-order MQL. Goldstein and Rasbash (1996)
suggest using bootstrap methods to improve the quasi-likelihood estimates, and Browne
(1998) explores bootstrap and Bayesian methods.

Question: Explain the Ever-Changing Latent Scale:


In many generalized linear models, for example in logistic and probit regression, the scale of
the unobserved latent variable is arbitrary, and to identify the model it needs to be
standardized. Probit regression uses the standard normal distribution, with mean 0 and
variance 1. Logistic regression uses the standard logistic distribution (scale parameter equal

to one) which has a mean of zero and a variance of 3.29. The assumption of an

underlying latent variable is convenient for interpretation, but not crucial. An important issue
in these models is that the underlying scale is standardized to the same standard distribution
in each of the analyzed models.
If we start with an intercept-only model, and then estimate a second model where we add a
number of explanatory variables that explain part of the variance, we normally expect that the
estimated variance components become smaller. However, in logistic and probit regression
(and in many other generalized linear models), the underlying latent variable is rescaled, so

the lowest-level residual variance is again (or unity in probit regression). Consequently,

the values of the regression coefficients and higher-level variances are also rescaled. The
phenomenon of the change in scale is not specific to multilevel generalized linear modeling,
it also occurs in single-level logistic and probit models. For the single-level logistic model,
several pseudo- formulas have been proposed to provide an indication of the explained
variance. These are all based on the log-likelihood. They can be applied in multilevel logistic
and probit regression, provided that a good estimate of the log-likelihood is available.

Question: Sample size and power analysis:


Answer: Rules of thumb and data constraints :( Book-3, page 272, 10.3)
The most commonly offered rule of thumb with regard to sample size for multilevel models
is at least 20 groups and at least 30 observations per group. An alternative recommendation,
cited almost as frequently, is 30 groups and 30 observations per group. Almost invariably,
however, such recommendations are heavily qualified. It soon becomes clear that sample size
and sample structure are complex and under-researched issues in multilevel analysis.
Even without immersing ourselves in the growing and difficult technical literature on sample
size and statistical power in multilevel analysis, the ―20/30‖ and ―30/30‖ rules of thumb merit
a closer look. With 20—or even 30—observations at the second level of a multilevel analysis,
tests of significance and construction of confidence intervals for level-two regression
coefficients will be based on a dangerously small number of cases.
As a rather extreme but quite real case, recall the data set we have used that contains
measures on 331 beginning kindergarten students in 18 classrooms in 12 randomly selected
West Virginia elementary schools. Clearly, there are three levels that might be used in an
analysis of this information: the individual student, the classroom, and the school. With 331
students, at first glance, the sample size seems not to be unduly small. Again, however, if we
use multilevel regression, the nominal sample size at the classroom level is only 18, and the
effective sample size at the school level is only 12. These are constraints that came with the
data set and over which we have no control. At levels two and three, therefore, actual sample
size constraints are even more severe than those embedded in the dubious 20/30 and 30/30
rules of thumb.
Variance Partition Coefficients (VPCs)

The variation proportion in observed response due to the individual level in the hierarchy of

model is defined by variance partition coefficients (VPCs). Thus, the community's relative

importance where the source of variation is women for the MHCS non-utilization is

understood with VPCs.

The community-level VPC for a two-level model is calculated as:

Here,

VPCs are based on residuals compared to the originally observed responses. In the
conditional model, the proportion of variation in outcome unexplained by the independent
variables exists at each level of the model hierarchy.

Intra-Class Correlations (ICCs)

Within a given class or community, the actual correlation of the model (i.e., homogeneity or
similarity) of the discerned responses is termed as intra-class correlations (ICCs).

The community-level ICC for a two-level model is calculated as:


Here, , y is the response variable.

In conditional models, somewhat depending on observed responses, ICCs are based on the
residuals. Therefore it measures similarity in responses having adjusted for the independent
variables; that is, the homogeneity in unexplained reactions. When ICCs are computed based
on conditional models, it is sometimes referred to as adjusted ICCs. VPCs and ICCs are the
same for a two-level model.
The purpose of MLM is to partition variance in the outcome between the different groupings
in the data.

For example, if we make multiple observations on individual participants we partition


outcome variance between individuals and residual variance.

Definition

In multilevel modeling we might want to know what proportion of total variance is


attributable to variation within groups, or how much is found between groups. This statistic is
called VPC or ICC.

Mathematical form of two-level binary logistic regression can be written as follows:

( )

Where, ( ) ( ) is the intercept of fixed effect, and is the residual


of second level. Furthermore refers to level-1 variable with fixed effect and refers
to level-2 variable with fixed effect respectively. The value of intra-class correlation
coefficient (ICC) should be calculated prior to the application of any two-level binary model
from the following formula:

Intraclass correlation =
( ) ⁄

Where, ) represents the variance of random intercept (variance of 2nd level).

Question: Comment on the random part of following analysis

Table 1: Multilevel model's output (random) identifying individual-level and


community-level factors associated with ANC utilization in Bangladesh

Measures of variation for random part


Model 0 Model 1 Model 2 Model 3
Variance (SE) 1.02 (0.12) 0.42(0.08) 0.53(0.08) 0.38(0.07)
Explained Variation i Ref 58.82 48.04 62.75
ICC=VPC 0.237 0.113 0.139 0.104
Chi-Square ( ) 313.0 65.09 117.09 56.81
P-value 0.000 0.000 0.000 0.000
i
SE=Standard Error, Compared to the null model, corresponds to Likelihood Ratio test vs.

Logistic regression

In this table, The estimated variance (SE) for the random part is 1.02 (0.12) and the ICC
value for the null model (Model 0) is 0.237, suggesting that there presents 24% heterogeneity
between two different communities which is greater than 0 and rest 76% heterogeneity occurs
due to the differences within community in case of ANC utilization in Bangladesh. I.e. there
is a variation due to 2nd level (community-level). For this reason we can apply multilevel
regression model instead of simple regression model.

From the null model (Model 0), the likelihood ratio (LR) test vs. traditional single-level
logistic regression statistic = 313.0 with a p-value that is effectively zero, P=0.000
(p<0.000). This data is significantly better fitted by a two-level model compared to the
traditional single-level model.

The next model (Model 1) contains only the lower level or individual-level variables. The
estimated variance (SE) for Model 1 is 0.42(0.08).

So Model 1 can explain [(1.02-0.42)/1.02]*100= 58.82% of the variance compared to the null

model (Model 0). And the ICC/VPC= = 0.113, Which is (0.237-0.113)= 12.4%
lower than the null model (Model 1).

………..

The full model contains the significant variables of both level (Level-1 and Level-2).
Question: Basics of Multilevel Generalized Linear Model for count Data.
Frequently the outcome variable of interest is a count of events. In most cases count data do
not have a nice normal distribution. A count cannot be lower than zero, so count data always
have a lower bound at zero. When the outcome is a count of events that occur frequently,
these problems can be addressed by taking the square root or in more extreme cases the
logarithm. However, such nonlinear transformations change the interpretation of the
underlying scale, so analyzing counts directly may be preferable. Count data can be analyzed
directly using a generalized linear model.
When the counted events are relatively rare they are often analyzed using a Poisson model.
Examples of such events are frequency of depressive symptoms in a normal population,
traffic accidents on specific road stretches, or conflicts in stable relationships. More frequent
counts are often analyzed using a negative binomial model.

i. The Poisson Model for Count Data:


In the Poisson distribution, the probability of observing y events (y = 0, 1, 2, 3, . . .) is:
Pr(y) = exp(−λ)λy/ y! , where exp is the inverse of the natural logarithm. Just like the
binomial distribution, the Poisson distribution has only one parameter, the event rate λ
(lambda). The mean and variance of the Poisson distribution are both equal to λ. As a result,
with an increasing event rate, the frequency of the higher counts increases, and the variance
of the counts also increases, which introduces heteroscedasticity. An important assumption in
the Poisson distribution is that the events are independent and have a constant mean rate (λ).

Example: counting how many days a pupil has missed school is probably not a Poisson
variate, because one may miss school because of an illness, and if this lasts several days these
counts are not independent. The number of typing errors on randomly chosen book pages is
probably a Poisson variate.

The Poisson model for count data is a generalized linear regression model that consists of
three components:
I. An outcome variable y with a specific error distribution that has mean µ and variance
,
II. Linear additive regression equation that produces a predictor of the outcome
variable y,
III. A link function that links the expected values of the outcome variable y to the
predicted values for
For counts, the outcome variable is often assumed to follow a Poisson distribution with event
rate . The model can be further extended by including a varying exposure rate m. For
instance, if book pages have different numbers of words, the distribution of typing errors
would be Poisson with exposure rate the number of words on a page.
The multilevel Poisson regression model for a count for person in group can be
written as:
⁄ = Poisson ( )
The standard link function for the Poisson distribution is the logarithm, and

= ln ( )
The first-level and second-level model is constructed as usual, so

and

giving,

Since the Poisson distribution has only one parameter, its variance is equal to the mean.
Estimating an expected count implies a specific variance. Therefore, just as in logistic
regression, the first-level equations do not have a lowest-level error term.

ii. The Negative Binomial Model for Count Data:

In the Poisson model, the variance of the outcome is equal to the mean. When the observed
variance is much larger than expected under the Poisson model, we have over-dispersion.
One way to model over-dispersion is to add an explicit error term to the model. Thus, for the
Poisson model we have the link function , and the inverse is = exp( ),
where is the outcome predicted by the linear regression model. The negative binomial
adds an explicit error term ε to the model, as follows:
= exp ( ) = exp ( ) exp ( )
The error term in the model increases the variance compared to the variance implied by the
Poisson model. This is similar to adding a dispersion parameter in a Poisson model.
Given that the negative binomial model is a Poisson model with an added variance term, the
test on the deviance can be used to assess whether the negative binomial model fits better.
The negative binomial model cannot directly be compared to the Poisson model with over-
dispersion parameter, because these models are not nested. However, the AIC and BIC can be
used to compare these models. Both the AIC and the BIC are smaller for the Poisson model
with over-dispersion than for the negative binomial model.
iii. The Zero-Inflated Model:
When the data show an excess of zeros compared to the expected number under the Poisson
distribution, it is sometimes assumed that there are two processes that produce the data. Some
of the zeros are part of the event count, and are assumed to follow a Poisson model (or a
negative binomial). Other zeros are part of the event taking place or not, a binary process
modeled by a binomial model. These zeros are not part of the count; they are structural zeros,
indicating that the event never takes place. Thus, the assumption is that our data actually
include two populations, one that always produces zeros and a second that produces counts
following a Poisson model. For example, assume that we study risky behavior, such as using
drugs or having unsafe sex. One population never shows this behavior, it is simply not part of
their behavior repertoire. These individuals will always report a zero. The other population
consists of individuals who do have this behavior in their repertoire. These individuals can
report on their behavior, and these reports can also contain zeros. An individual may
sometimes use drugs, but just did not do this in the time period surveyed. Models for such
mixtures are referred to as zero-inflated Poisson or ZIP models.

You might also like