0% found this document useful (0 votes)
4 views48 pages

Week14 Longitudinal Data

This document outlines statistical methods for analyzing missing data in longitudinal studies, focusing on techniques such as Generalized Estimating Equations (GEE) and Last Observation Carried Forward (LOCF). It discusses the structure of longitudinal data, the implications of missingness patterns, and various methods for handling missing data, including complete case analysis and multiple imputation. The document also highlights the challenges and biases associated with different approaches to missing data in clinical trials and longitudinal studies.

Uploaded by

h24086074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views48 pages

Week14 Longitudinal Data

This document outlines statistical methods for analyzing missing data in longitudinal studies, focusing on techniques such as Generalized Estimating Equations (GEE) and Last Observation Carried Forward (LOCF). It discusses the structure of longitudinal data, the implications of missingness patterns, and various methods for handling missing data, including complete case analysis and multiple imputation. The document also highlights the challenges and biases associated with different approaches to missing data in clinical trials and longitudinal studies.

Uploaded by

h24086074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Statistical Methods for Analysis

with Missing Data

Instructor: Martin Lukusa, PhD

Department of Statistics
Graduate Course
NCKU

Spring Semester, 2023


Methods for Missing Data:

Missing Data in Longitudinal Studies


Longitudinal/Clustered Data

This lecture covers the following:

I Study of missing data in longitudinal data sets

I Study of the Generalized Estimating Equation (GEE) for


missing data.

I Review of "the last carried observation forward" (LOCF).


Longitudinal/Clustered Data

 Longitudinal or clustered data can be encountered in medical


studies, population health, social studies, economics, etc.

 In a longitudinal study, we collect data from every sampled


subject (or unit) at multiple time points.

 Under cluster sampling, we collect data from units within


each sampled cluster.

 Longitudinal data can be coded into “long" and “wide"


formats.
Understand Longitudinal/Clustered Data

 A wide dataset will have one record for each individual.


The observations made (measured) at different time points are
coded as different columns.

id age Y1 Y2
1 14 28 22
2 12 34 16
3 ··· ··· ···

Table: Wide format


Understand Longitudinal/Clustered Data

 In the long format there will be multiple records for each unit.

id age Y
1 14 28
1 14 22
2 12 34
2 12 16
3 ··· ··· ···

Table: Long format

 Clearly, in wide format, Y1 and Y2 are consecutive measures


taken on the same unit, whereas in long format they refer to the
same unit.
Understand Longitudinal/Clustered Data

 Both the wide format and the long format present advantages
and disadvantages in some situations.

The long format has an explicit time variable available that can
be used for analysis

making graphs and conducting statistical analyses are easier in the


long format.

 The wide and the long formats can be converted into each
other.
Understand Longitudinal/Clustered Data

On the other hand,

with longitudinal data, multiple imputation method is


conveniently done when data are in the wide format.

A more general approach is to impute data in the long format,


which requires some form of multilevel imputation.

Apart from the fact that the columns are ordered in time, there
is not major difference with cross-sectional imputation methods.
Basic Set ups for Longitudinal Data

Assume that for each independent unit i = 1, · · · , N in the study,


it is planned to collect a set of measurements

Yij : j = 1, · · · , ni ; i = 1, · · · , N.

In a longitudinal study,
i indicates the subject and j the measurement occasion.

For multivariate studies,


j refers to the particular outcome variable.
Basic Settings for Longitudinal Data

The outcomes are collected into a vector as follows:

Y i = (Yi1 , Yi2 , · · · , Yini )T ; i = 1, · · · , N.

To account for the missing data, we define

(
1, if Yij is observed
Rij =
0, elsewhere.

Here Rij is the missing indicator of subject i at the occasion j.


Basic Settings for Longitudinal Data

These missing data indicators Rij are organized into a vector R i of


parallel structure to Y i .

 In cross-sectional setting δij = Rij refers to unit i and


predictors (or factor) j.

 In longitudinal setting δij = Rij refers to unit i but meaning of


j will depend of the format used: wide or long.

In other words, indicator Rij is adaptive to the type of data format


Missingness Patterns in Longitudinal
In clinical trials, the missingness patterns are summarized as follows:

Table: Patient’s clinical trials


Subject Time 1 Time 2 Time 3 Time 4 Time 5 Time 6
1 O O O O O O
2 O O O O NA NA
3 NA O O O O 0
4 O O NA O NA O
··· O O O O O O

O means observed data and NA is missing records

From the above,


I Subject 1: completer
I Subject 2: dropout (=monotone pattern)
I Subject 3: late entry
I Subject 4: intermittent (=intermediate or non-monotone patterns)
Basic Settings for Longitudinal Data

 When missingness is restricted to dropout or attrition, we can


replace the vector R i by a scalar variable Di , which denotes the
dropout indicator.

In this case, each vector R i is of the form

(1, 1, · · · , 1, 0, 0, · · · , 0)

and we can define the scalar dropout indicator


ni
X
Di = 1 + Rij .
j=1
Basic Settings for Longitudinal Data

 For an incomplete sequence, Di denotes the occasion at which


dropout occurs.

 For a complete sequence, Di = ni + 1.

 In both cases, Di is equal to one plus the length of the


measurement sequence, whether complete or incomplete.

Sometimes it is convenient to define an alternative dropout


indicator:
Basic Settings for Longitudinal Data

Alternatively, a dropout indicator

Ti = Di − 1

indicates the number of measurements actually taken, rather than


the first occasion at which the planned measurement has not been
taken.

Dropout or attrition is an example of a monotone pattern of


missingness.

 The missing mechanisms (processes) are similar to


non-longitudinal cases.
Basic Settings for Longitudinal Data

 The general settings of missing data, in longitudinal data have


the following characteristics:
I Occasion (time): Baseline — Follow up

I Missingness pattern: complete— monotone —


non-monotone

I Dropout pattern: complete— dropout — intermittent

I Missingness mechanism: MCAR— MAR —MNAR

I Ignorability: ignorable — non-ignorable

I Inference paradigm: Frequentist — Likelihood — Bayes


Illustration of CC and LOCF Methods

Table: Artificial incomplete data.

Week
Patient Treatment Baseline 1 2 3 4 5 6
1 1 22 20 18 16 14 12 10
2 1 22 21 18 15 12 9 6
3 1 22 22 21 20 19 -99 -99
4 2 20 20 20 20 21 21 22
5 2 21 22 22 23 24 25 26
6 2 18 19 20 -99 -99 -99 -99

In a clinical trial study, data related to two treatment groups were


collected on 6 subjects during 7 consecutive weeks.
Missing data were recorded as “-99".
CC and LOCF Methods
 Naive estimation / complete case (CC) analysis:
(CC) method consists in removing incomplete units, say unit 3
and unit 6, from the sample for analysis. Then proceed with the
analysis as usual

Table: Longitudinal incomplete data.

Week
Patient Treatment Baseline 1 2 3 4 5 6
1 1 22 20 18 16 14 12 10
2 1 22 21 18 15 12 9 6
3 1 22 22 21 20 19 -99 -99
4 2 20 20 20 20 21 21 22
5 2 21 22 22 23 24 25 26
6 2 18 19 20 -99 -99 -99 -99
CC and LOCF Methods

 Last observation carried forward (LOCF) is the most popular


used in clinical trial for longitudinal data.
Principle. replaces missing data by the last observed value from
the same subject.

Table: Longitudinal incomplete data.


Week
Patient Treatment Baseline 1 2 3 4 5 6
1 1 22 20 18 16 14 12 10
2 1 22 21 18 15 12 9 6
3 1 22 22 21 20 19 19 19
4 2 20 20 20 20 21 21 22
5 2 21 22 22 23 24 25 26
6 2 18 19 20 20 20 20 20
Basic Settings for Longitudinal Data

 In principle, a choice has to be made regarding the treatment


approach for the missingness process.

Note that, under certain assumptions, however, this process can be


ignored.

 Most recommended techniques include


I Likelihood-based: EM algorithm, Multiple imputation,
weighting, etc.
I Bayesian ignorable analysis.

 Simple methods including: Complete case analysis (CC), last


observation carried forward (LOCF), and others.
Problem with CC Analysis and LOCF

 By an example, let how CC and LOCF are not


consistent estimators for longitudinal data with dropout
missingness

Assume that each subject i is to be measured on two occasions


ti = 0, 1.
 Subjects are randomized to one of two treatment arms:
Ti = 0 for standard arm
Ti = 1 for the experimental arm.
 The probability of a measurement being observed on the second
occasion (Di = 2) is
p0 is for treatment group 0
p1 is for treatment group 1
Problem with CC Analysis and LOCF
We can write the means of the observations in the two dropout
groups as follows:

For Dropouts with Di = 1,

β0 + β1 Ti + β2 ti + β3 Ti ti , (1)

For Completers with Di = 2,

γ0 + γ1 Ti + γ2 ti + γ3 Ti ti . (2)

The true underlying population treatment difference at time ti = 1,


as determined from (1)-(2), is equal to

∆true =p1 (γ0 + γ1 + γ2 + γ3 ) + (1 − p1 )(β0 + β1 + β2 + β3 )


−[p0 (γ0 + γ2 ) + (1 − p0 )(β0 + β2 )]. (3)
Problem with CC Analysis and LOCF

 If we use LOCF as estimation procedure, the expectation of the


corresponding estimator is

∆LOCF =p1 (γ0 + γ1 + γ2 + γ3 ) + (1 − p1 )(β0 + β1 )


−[p0 (γ0 + γ2 ) + (1 − p0 )β0 ]. (4)

 Alternatively, if we use CC, the above expression changes to

∆CC = γ1 + γ3 . (5)

Clearly these are, in general, both biased estimators.


Problem with CC Analysis and LOCF

• We will now consider the special but important cases where the
true missing data mechanisms are MCAR and MAR, respectively.

Each of these will impose particular constraints on the β and γ


parameters in model (1)-(2).

• Under MCAR, the β parameters are equal to their γ


counterparts and ∆true simplifies to

∆mcar ,true = β1 + β3 . (6)

Suppose we choose to apply the LOCF procedure in this setting.


Problem with CC Analysis and LOCF

The expectation of the resulting estimator then simplifies to

∆mcar ,LOCF = β1 + (p1 − p0 )β2 + p1 β3 . (7)

The bias is given by the difference between (6) and (7):

Biasmcar ,LOCF = (p1 − p0 )β2 + (1 − p1 )β3 . (8)

Typically, the LOCF has a bias which cannot vanish.

But expressions (5) and (6) can reveal that CC estimator is


unbiased under MCAR.
Example: Orthodontic Growth Data

These Potthoff and Roy (1964)’s data contain growth


measurements for 11 girls and 16 boys. For each subject, the
distance from the center of the pituitary to the maxillary fissure
was recorded at ages 8, 10, 12, and 14.

Little and Rubin (1987) deleted nine of the measurements at age


10, thereby producing nine incomplete subjects.
 They describe the mechanism to be such that subjects with a
low value at age 8 are more likely to have a missing value at age
10.

The measurements that were deleted are marked with an asterisk.


The data are presented in Table 2.1.
The Orthodontic Growth Data

Figure: longitudinal data


The Orthodontic Growth Data

Figure: Missing Patterns


The Orthodontic Growth Data

Figure: Missing Patterns


General Estimating Equations

 Marginal models describe the measurements within a repeated


or multivariate sequence, Y i , conditional on covariates, but not on
other measurements, nor on unobserved (latent) structures.

 While full likelihood approaches exist for such models


(Molenberghs and Verbeke 2005), outside the multivariate normal
linear model setting they can be very demanding in
computational terms.

 This explains the popularity of so-called generalized


estimating equations (GEE: Liang and Zeger 1986).
General Estimating Equations

 Generalized estimating equations (GEE) (Liang and Zeger


1986) are useful when scientific interest focuses on the first
moments of the outcome vector.

For Examples: time evolutions in the response probability,


treatment effect, their interaction, and the effect of (baseline)
covariates on these probabilities.

 GEE allow the researcher to


I use a ‘fix’ for the correlations between measurements, present
in the second moments, and
I ignore the higher-order moments, while still obtaining valid
inferences, with reasonable efficiency.
Generalized estimating equations

 Relating a linear regression to GEE model: Consider a


simple example of linear regression. Suppose that we have n
independent pairs {Yi , Xi } such that

E (Yi ) = β0 + β1 xi , i = 1, 2, · · · , n.

and
Y ∼ N(0, σ 2 ), independent.
The ordinary least squares regression line (or maximum likelihood)
is obtained by solving the normal equation for β :
X  
XT
i yi − X T
i β̂ = 0.
Generalized estimating equations

Suppose now that some outcome Yi observations are missing, and


let Ri be the binary missing value indicator defined as

Ri = 0 implies that Yi is missing.

It is easily checked that the following normal equations remain


unbiased (hence consistent for):

   
yi∗ − XiT β̂ = 0,
X X
XT
i yi − X T
i β̂ + XT
i (9)
observed missing

where
yi∗ = EYi |ri (Yi ).
Generalized estimating equations

 If Ri is independent of Yi (everything assumed conditional on


xi ), then

yi∗ = EYi |ri (Yi ) = EYi = XiT β̂


and X  
XT
i yi − X T
i β̂
observed
used use the completers only (CC data).

The assumption above is equivalent to MCAR, and in this setting,


without imposing a distribution on the Xi ,
there is no relevant additional information in the incomplete pairs.
Generalized estimating equations

 Suppose now that Ri and Yi are not independent.

If the conditional distribution of Yi |ri were known, we could still


use (9), and this is, for example, the basis of tobit regression.

But, suppose that we do not know this distribution, but instead


know, or can estimate, the missingness probabilities

πi = P(Ri = 1|y, Xi ).

Note that πi is also named as selection probability or weight for


the i th observation.
Generalized estimating equations

 Suppose now that Ri and Yi are not independent.

If the conditional distribution of Yi |ri were known we could still use


(9), and this is, for example, the basis of tobit regression.

But, suppose that we do not know this distribution, but instead


know, or can estimate, the missingness probabilities

πi = P(Ri = 1|y, Xi ).

Note that πi is also named as selection probability or weight for


the i th observation.
Generalized estimating equations

Then it is easily seen that the IPW normal equations are unbiased
(for known i),

 
X XT
i yi − X T
i β̂
(10)
π
observed
and hence consistent for β.

When consistent estimates of π are available, then, given suitable


regularity conditions, consistency for β is maintained.

Warning: These probabilities should not becoming too small, and,


in particular, nor equal zero.
Generalized estimating equations

 We can now generalized this to a setting where we have an


unbiased estimating equation for some parameter vector β and
vector of observations (outcome and/or explanatory) zi

n
X
S(β) = Si (zi ; β̂) = 0. (11)
i=1

 If Ri = 1 is now the event of observing a complete set zi then


the following IPW estimating equation is still unbiased:

X Si (zi , β̂)
. (12)
P(Ri = 1)
observed
Generalized estimating equations

 Remarks. Again, to use this in practice we would need suitable


estimates of the probabilities, using, for example logistic
regression.

For suitable estimates of the probabilities, it can be shown that

π̂ → π0 , as n → ∞,

where π0 is the true π.

The method developed in (12) is most applicable under MAR, for


then there is a direct route to obtaining the estimate of these.
General Estimating Equations

 Relating a Generalized linear models to GEE model:


 Univariate GLMs, score function of the form (scalar Yi ):
N
X ∂µi −1
S(β) = υi (yi − µi ) = 0 (13)
i=1
∂β

υi = Var (Yi ).

 In longitudinal setting: Y = (Y1 , Y2 , · · · , YN ) :

X X ∂µij
S(β) = υij−1 (yij − µij )
i j
∂β
N
0
D i [V i (α)]−1 (y i − µi )
X
= (14)
i=1
General Estimating Equations

Note that
µ
I D is an ni × p matrix with (i,j) th elements of ij
β

I y i and µi are ni -vectors with elements yij and µij

I V i is ni × ni diagonal or complex

I V i = Var (Y i ) is more complex since it involves a set of


nuisance parameters α, determining the covariance structure
of Y i .
General Estimating Equations

1/2 1/2
V i (β, α) = φAi (β)R i (α)Ai (β)

in which
q 
υi,1 (µi1 (β)) 0 ··· 0
 q 
0 υi,2 (µi2 (β)) · · · 0
 
1/2  
Ai (β) =
 .. .. .. ..



 . . . . 

q
0 0 ··· υi,ni (µini (β))

R i (α) is the correlation matrix of Yi , parameterized by α as in


Liang and Zeger (1986).
The Orthodontic Growth Data: Missing Pattern

 In general, a working correlation matrix is needed as the


knowledge of V i (β, α) is just a theoretical variance.

• Interestingly, β̂ is consistent even if the working correlation


matrix is incorrect.

 An estimator of V can be found by replacing the unknown


variance Var (Y i ) by

(Y i − µ̂i )(Y i − µ̂i )0 .

• Note that in GLMs with identity link, µ = X T


i β̂.
Generalized estimating equations

Working correlation Matrix


1/2 1/2
V i (β, α, φ) = φAi (β)R i (α)Ai (β).

Liang and Zeger (1986) proposed moment-based estimates for the


working correlation.
1. Independence

2. Exchangeable

3. AR(1)

4. Unstructured
working correlation matrix

1. Independence
 
1 0 0 ... 0
0 1 0 . . . 0
 
Ri =  . . . .
. . ... 
 
 .. .. .. 
0 0 0 ... 1

2. Exchangeable
 
1 α α ... α
α 1 α . . . α
 
Ri =  . . . .
. . ... 
 
 .. .. .. 
α α α ... 1
working correlation matrix

3. Auto-regressive (AR (1))


 
1 α α2 . . . αm−1

 α 1 α . . . αm−2 

Ri =  .. .. .. .. .. 

 . . . . . 

αm−1 αm−2 αm−3 ... 1

4. Unstructured (Symmetric)
 
1 α12 α13 . . . α1m
 α21 1 α23 . . . α2m 


Ri = 
 .. .. .. .. .. 
 . . . . . 

αm1 αm2 αm3 ... 1

There are others.


Incomplete longitudinal Data

Figure: Missing Patterns


Incomplete longitudinal Data

• For MAR or MNAR mechanisms, GEE can be incorporated in


I IPW

I Augmented IPW

I Multiple imputation

I EM algorithm (Maximum likelihood based methods)


• Under MCAR, GEE based on CC data is consistent.

You might also like