Sas/Stat 14.3 User's Guide: The GEE Procedure
Sas/Stat 14.3 User's Guide: The GEE Procedure
SAS/STAT 14.3
User’s Guide
The GEE Procedure
This document is an individual chapter from SAS/STAT® 14.3 User’s Guide.
The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2017. SAS/STAT® 14.3 User’s Guide. Cary, NC:
SAS Institute Inc.
SAS/STAT® 14.3 User’s Guide
Copyright © 2017, SAS Institute Inc., Cary, NC, USA
All Rights Reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by
any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute
Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time
you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is
illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic
piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software
developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or
disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as
applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S.
federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision
serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The
Government’s rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414
September 2017
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
SAS software may be provided with certain third-party software, including but not limited to open-source software, which is
licensed under its applicable third-party software license agreement. For license information about third-party software distributed
with SAS software, refer to http://support.sas.com/thirdpartylicenses.
Chapter 45
The GEE Procedure
Contents
Overview: GEE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3104
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3104
Syntax: GEE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3107
PROC GEE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3108
BY Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3109
CLASS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3110
EFFECTPLOT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3111
ESTIMATE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3112
FREQ Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3113
LSMEANS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3113
LSMESTIMATE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3114
MISSMODEL Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3115
MODEL Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3116
OUTPUT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3119
REPEATED Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3120
SLICE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124
STORE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124
WEIGHT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3125
Details: GEE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3125
Generalized Estimating Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3125
Alternating Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3128
Weighted Generalized Estimating Equations under the MAR Assumption . . . . . . . 3132
Type 3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3136
ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3137
ODS Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3138
Examples: GEE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3139
Example 45.1: Comparison of the Marginal and Random Effect Models for Binary Data 3139
Example 45.2: Log-Linear Model for Count Data . . . . . . . . . . . . . . . . . . . . 3142
Example 45.3: Weighted GEE for Longitudinal Data That Have Missing Values . . . 3146
Example 45.4: GEE for Binary Data with Logit Link Function . . . . . . . . . . . . . 3150
Example 45.5: Alternating Logistic Regression for Ordinal Multinomial Data . . . . . 3153
Example 45.6: GEE for Nominal Multinomial Data . . . . . . . . . . . . . . . . . . . 3156
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3158
3104 F Chapter 45: The GEE Procedure
Getting Started
This section illustrates some of the basic features of the GEE procedure by analyzing longitudinal data from
Stokes, Davis, and Koch (2012).
In this study, researchers followed 25 children at ages 8, 9, 10, and 11 years. The goal of this study is to
investigate the health effects of air pollution on children. The binary response is the wheezing status of the
children at four different ages. The explanatory variables are age, city, and passive smoking index (with
values 0, 1, 2) that represented the degree of smoking in the home. The responses for individual children are
assumed to be equally correlated, implying an exchangeable correlation structure.
The following statements create the data set Children:
Getting Started F 3105
data Children;
input ID City$ @@;
do i=1 to 4;
input Age Smoke Symptom @@;
output;
end;
datalines;
1 steelcity 8 0 1 9 0 1 10 0 1 11 0 0
2 steelcity 8 2 1 9 2 1 10 2 1 11 1 0
3 steelcity 8 2 1 9 2 0 10 1 0 11 0 0
4 greenhills 8 0 0 9 1 1 10 1 1 11 0 0
5 steelcity 8 0 0 9 1 0 10 1 0 11 1 0
6 greenhills 8 0 1 9 0 0 10 0 0 11 0 1
7 steelcity 8 1 1 9 1 1 10 0 1 11 0 0
8 greenhills 8 1 0 9 1 0 10 1 0 11 2 0
9 greenhills 8 2 1 9 2 0 10 1 1 11 1 0
10 steelcity 8 0 0 9 0 0 10 0 0 11 1 0
11 steelcity 8 1 1 9 0 0 10 0 0 11 0 1
12 greenhills 8 0 0 9 0 0 10 0 0 11 0 0
13 steelcity 8 2 1 9 2 1 10 1 0 11 0 1
14 greenhills 8 0 1 9 0 1 10 0 0 11 0 0
15 steelcity 8 2 0 9 0 0 10 0 0 11 2 1
16 greenhills 8 1 0 9 1 0 10 0 0 11 1 0
17 greenhills 8 0 0 9 0 1 10 0 1 11 1 1
18 steelcity 8 1 1 9 2 1 10 0 0 11 1 0
19 steelcity 8 2 1 9 1 0 10 0 1 11 0 0
20 greenhills 8 0 0 9 0 1 10 0 1 11 0 0
21 steelcity 8 1 0 9 1 0 10 1 0 11 2 1
22 greenhills 8 0 1 9 0 1 10 0 0 11 0 0
23 steelcity 8 1 1 9 1 0 10 0 1 11 0 0
24 greenhills 8 1 0 9 1 1 10 1 1 11 2 1
25 greenhills 8 0 1 9 0 0 10 0 0 11 0 0
;
The following statements fit the model by the GEE method:
Both the MODEL statement and the REPEATED statement are required.
The DIST=BIN and LINK=LOGIT options in the MODEL statement request a logistic regression with the
variable Symptom as the response and City, Age, and Smoke as explanatory variables.
The REPEATED statement specifies the correlation structure and requests various tables in the output. The
SUBJECT=ID option requests that individual subjects be identified in the input data set by the variable ID,
which must be listed in the CLASS statement. Measurements of individual subjects at ages 8, 9, 10, and 11
are in the proper order in the data set, so the WITHIN= option is not required. The TYPE=EXCH option
specifies an exchangeable working correlation structure, the COVB option requests the parameter estimate
3106 F Chapter 45: The GEE Procedure
covariance matrix, and the CORRW option requests the working correlation matrix.
Figure 45.1 shows the “Model Information” table, which provides information about the specified logistic
regression model and the input data set.
Model Information
Data Set WORK.CHILDREN
Distribution Binomial
Link Function Logit
Dependent Variable Symptom
Figure 45.2 displays general information about the GEE analysis. Each subject has four measurements.
Figure 45.3 displays the model-based and empirical covariance matrices of the parameter estimates.
The parameter estimates table, shown in Figure 45.5, contains parameter estimates, standard errors, confidence
intervals, Z scores, and p-values for the parameter estimates. Empirical standard error estimates are used in
this table. You can create a table that uses model-based standard errors by specifying the MODELSE option
in the REPEATED statement. The results indicate that smoking exposure is significant with a p-value of
0.0211, Age is marginally influential with a p-value of 0.0893, and City does not influence wheezing. The
parameter estimate for Age is –0.3201, which indicates that the odds ratio of wheezing for the children at the
higher age group compared to those in the lower age group is e 0:3201 D 0:726.
Goodness-of-fit criteria for the model are displayed in Figure 45.6. For more information about the quasi-
likelihood information criterion (QIC), see the section “Quasi-likelihood Information Criterion” on page 3127.
GEE Fit
Criteria
QIC 137.1373
QICu 136.2173
The PROC GEE statement invokes the GEE procedure. Table 45.1 summarizes the options available in the
PROC GEE statement.
Option Description
DATA= Specifies the input data set
DESCENDING Sorts the response variable in the reverse of the default order
NAMELEN= Specifies the length of effect names
ORDER= Specifies the sort order of CLASS variable
PLOTS Controls the plots that are produced through ODS Graphics
DATA=SAS-data-set
specifies the SAS data set that contains the data to be analyzed. If you omit the DATA= option, PROC
GEE uses the most recently created SAS data set.
BY Statement F 3109
DESCENDING
DESCEND
DESC
requests that the levels of the response variable for the binomial model that uses a single-variable
response syntax be sorted in the reverse of the default order.
NAMELEN=number
specifies the length to which long effect names are shortened. The default and minimum value is 20.
For more information about enabling and disabling ODS Graphics, see the section “Enabling and
Disabling ODS Graphics” on page 615 in Chapter 21, “Statistical Graphics Using ODS.”
You can specify the following plot-requests:
ALL
requests that all default plots be produced.
HISTOGRAM
creates a histogram for the predicted weights from the missingness model.
NONE
suppresses all plots.
BY Statement
BY variables ;
You can specify a BY statement with PROC GEE to obtain separate analyses of observations in groups that
are defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be
sorted in order of the BY variables. If you specify more than one BY statement, only the last one specified is
used.
If your input data set is not sorted in ascending order, use one of the following alternatives:
Sort the data by using the SORT procedure with a similar BY statement.
Specify the NOTSORTED or DESCENDING option in the BY statement for the GEE procedure. The
NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged
in groups (according to values of the BY variables) and that these groups are not necessarily in
alphabetical or increasing numeric order.
Create an index on the BY variables by using the DATASETS procedure (in Base SAS software).
3110 F Chapter 45: The GEE Procedure
For more information about BY-group processing, see the discussion in SAS Language Reference: Concepts.
For more information about the DATASETS procedure, see the discussion in the SAS Visual Data Management
and Utility Procedures Guide.
CLASS Statement
CLASS variables < / options > ;
The CLASS statement names the classification variables to be used in the analysis. If the CLASS statement
is used, it must appear before the MODEL statement.
Classification variables can be either character or numeric. CLASS levels are determined from the formatted
values of the variables. Thus, you can use formats to group values into levels. For more information, see the
discussion of the FORMAT procedure in the SAS Visual Data Management and Utility Procedures Guide
and the discussions of the FORMAT statement and SAS formats in SAS Formats and Informats: Reference.
You can specify the following options for classification variables:
DESCENDING
DESC
reverses the sort order of the classification variable. If you specify both the DESCENDING and
ORDER= options, PROC GEE orders the categories according to the ORDER= option and then
reverses that order.
ORDER=order-type
specifies the sort order for the categories of categorical variables. This ordering determines which
parameters in the model correspond to each level in the data. When the default ORDER=FORMATTED
is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered
by their internal values. Table 45.2 shows how PROC GEE interprets values of the ORDER= option.
For the FORMATTED and INTERNAL values, the sort order is machine-dependent. If you specify
the ORDER= option in the MODEL statement and the ORDER= option in the CLASS statement, the
former takes precedence.
EFFECTPLOT Statement F 3111
For more information about sort order, see the chapter on the SORT procedure in the SAS Visual
Data Management and Utility Procedures Guide and the discussion of BY-group processing in SAS
Language Reference: Concepts.
EFFECTPLOT Statement
EFFECTPLOT < plot-type < (plot-definition-options) > > < / options > ;
The EFFECTPLOT statement produces a display of the fitted model and provides options for changing and
enhancing the displays. Table 45.3 describes the available plot-types and their plot-definition-options.
For full details about the syntax and options of the EFFECTPLOT statement, see the section “EFFECTPLOT
Statement” on page 420 in Chapter 19, “Shared Concepts and Topics.”
3112 F Chapter 45: The GEE Procedure
ESTIMATE Statement
ESTIMATE < 'label' > estimate-specification < (divisor =n) >
< , . . . < 'label' > estimate-specification < (divisor =n) > >
< / options > ;
The ESTIMATE statement provides a mechanism for obtaining custom hypothesis tests. Estimates are
formed as linear estimable functions of the form Lˇ. You can perform hypothesis tests for the estimable
functions, construct confidence limits, and obtain specific nonlinear transformations.
Table 45.4 summarizes the options available in the ESTIMATE statement.
Option Description
Construction and Computation of Estimable Functions
DIVISOR= Specifies a list of values to divide the coefficients
NOFILL Suppresses the automatic fill-in of coefficients for higher-order
effects
SINGULAR= Tunes the estimability checking difference
Statistical Output
CL Constructs confidence limits
CORR Displays the correlation matrix of estimates
COV Displays the covariance matrix of estimates
E Prints the L matrix
JOINT Produces a joint F or chi-square test for the estimable functions
PLOTS= Requests ODS statistical graphics if the analysis is sampling-based
SEED= Specifies the seed for computations that depend on random
numbers
For details about the syntax of the ESTIMATE statement, see the section “ESTIMATE Statement” on
page 448 in Chapter 19, “Shared Concepts and Topics.”
FREQ Statement
FREQ variable ;
FREQUENCY variable ;
The variable in the FREQ statement identifies a variable in the input data set that contains the frequency of
occurrence of each observation. PROC GEE treats each observation as if it appeared n times, where n is the
value of the FREQ variable for the observation. If the frequency value is not an integer, it is truncated to an
integer. If it is less than 1 or missing, the observation is not used. The frequencies must be the same for all
observations within each subject.
LSMEANS Statement
LSMEANS < model-effects > < / options > ;
The LSMEANS statement computes and compares least squares means (LS-means) of fixed effects. LS-means
are predicted population margins—that is, they estimate the marginal means over a balanced population. In a
sense, LS-means are to unbalanced designs as class and subclass arithmetic means are to balanced designs.
Table 45.5 summarizes the options available in the LSMEANS statement.
Option Description
Construction and Computation of LS-Means
AT Modifies the covariate value in computing LS-means
BYLEVEL Computes separate margins
DIFF Requests differences of LS-means
OM= Specifies the weighting scheme for LS-means computation as
determined by the input data set
SINGULAR= Tunes estimability checking
Statistical Output
CL Constructs confidence limits for means and mean differences
CORR Displays the correlation matrix of LS-means
COV Displays the covariance matrix of LS-means
3114 F Chapter 45: The GEE Procedure
Option Description
E Prints the L matrix
LINES Uses connecting lines to indicate nonsignificantly different subsets
of LS-means
LINESTABLE Displays the results of the LINES option as a table
MEANS Prints the LS-means
PLOTS= Requests graphs of means and mean comparisons
SEED= Specifies the seed for computations that depend on random
numbers
For details about the syntax of the LSMEANS statement, see the section “LSMEANS Statement” on page 464
in Chapter 19, “Shared Concepts and Topics.”
LSMESTIMATE Statement
LSMESTIMATE model-effect < 'label' > values < divisor =n >
< , . . . < 'label' > values < divisor =n > >
< / options > ;
The LSMESTIMATE statement provides a mechanism for obtaining custom hypothesis tests among least
squares means.
Table 45.6 summarizes the options available in the LSMESTIMATE statement.
Option Description
Construction and Computation of LS-Means
AT Modifies covariate values in computing LS-means
BYLEVEL Computes separate margins
DIVISOR= Specifies a list of values to divide the coefficients
OM= Specifies the weighting scheme for LS-means computation as
determined by a data set
SINGULAR= Tunes estimability checking
MISSMODEL Statement F 3115
Option Description
Statistical Output
CL Constructs confidence limits for means and mean differences
CORR Displays the correlation matrix of LS-means
COV Displays the covariance matrix of LS-means
E Prints the L matrix
ELSM Prints the K matrix
JOINT Produces a joint F or chi-square test for the LS-means and
LS-means differences
PLOTS= Requests graphs of means and mean comparisons
SEED= Specifies the seed for computations that depend on random
numbers
For details about the syntax of the LSMESTIMATE statement, see the section “LSMESTIMATE Statement”
on page 484 in Chapter 19, “Shared Concepts and Topics.”
MISSMODEL Statement
MISSMODEL effects < / options > ;
The MISSMODEL statement requests a weighted GEE analysis. It specifies a logistic regression that is
used to estimate the weights under the MAR assumption. If the pattern of missing data is intermittent (not
dropout), the GEE procedure terminates and does not perform an analysis.
You can use the same effects or different effects in the MODEL and MISSMODEL statements. Explanatory
variables can be continuous or classification variables. Classification variables can be character or numeric.
Explanatory variables that represent nominal (classification) data must be declared in a CLASS statement.
3116 F Chapter 45: The GEE Procedure
Interactions between variables can also be included as effects. Columns of the design matrix are automatically
generated for classification variables and interactions. The syntax for effects is the same as for the GLM
procedure. For more information, see the section “Specification of Effects” on page 3773 in Chapter 48,
“The GLM Procedure.”
You can specify the following options after a slash (/).
MAXWEIGHT=number
truncates the predicted weights from the missingness model if they are larger than number , where
number 1.
TYPE=OBSLEVEL | SUBLEVEL
specifies the type of weighted GEE method. You can specify the following values:
By default, TYPE=OBSLEVEL.
MODEL Statement
MODEL response = < effects > < / options > ;
The MODEL statement specifies the response (dependent variable) and the effects (explanatory variables). If
you omit the explanatory variables, PROC GEE fits an intercept-only model. An intercept term is included in
the model by default. You can remove the intercept by specifying the NOINT option.
You can specify the response in the form of a single variable (response) or in the form of a ratio of two
variables ( events/trials). The first form is applicable to all responses. The second form is applicable only to
summarized binomial response data. When each observation in the input data set contains the number of
events (for example, successes) and the number of trials from a set of binomial trials, use the events/trials
syntax.
In the events/trials model syntax, you specify two variables: one for the event counts and one for trial counts.
These two variables are separated by a slash (/). The value of the events variable must be nonnegative,
and the value of the trials variable must be equal to or greater than the value of the events variable for an
observation to be valid. The events and trials variables can take non-integer values.
When each observation in the input data set contains a single trial from a binomial experiment, use the
response form of the MODEL statement. The response variable can be numeric or character. The ordering
of response levels is critical in these models.
Responses for the Poisson distribution must be all nonnegative, but they can be non-integer values.
The effects in the MODEL statement consist of an explanatory variable or combination of variables. Ex-
planatory variables can be continuous or classification variables. Classification variables can be character or
numeric. Explanatory variables that represent nominal (classification) data must be declared in a CLASS
statement. Interactions between variables can also be included as effects. Columns of the design matrix
are automatically generated for classification variables and interactions. The syntax for specifying effects
MODEL Statement F 3117
is the same as for the GLM procedure. For more information, see the section “Specification of Effects” on
page 3773 in Chapter 48, “The GLM Procedure.”
Table 45.7 summarizes the options available in the MODEL statement.
Option Description
ALPHA= Sets the confidence coefficient
DIST= Specifies the probability distribution
LINK= Specifies the link function
NOINT Requests no intercept term
NOSCALE Holds the scale parameter fixed
OFFSET= Specifies a variable in the input data set to be used as an offset
SCALE= Specifies the value used for the scale
TYPE3 Computes statistics for Type 3 contrasts
WALD Requests Wald statistics for Type 3 contrasts
ALPHA=number
sets the confidence coefficient for parameter confidence intervals to 1–number . The value of number
must be between 0 and 1. The default value of number is 0.05.
DIST=keyword
D=keyword
ERROR=keyword
ERR=keyword
specifies the built-in probability distribution to use in the model. If you specify the DIST= option
and you omit the LINK= option, a default link function is chosen as displayed in Table 45.8. If you
specify neither the DIST= option nor the LINK= option, then the GEE procedure defaults to the normal
distribution with the identity link function.
LINK=keyword
specifies the link function in the model. You can specify the keywords shown in Table 45.9.
Link
LINK= Function g./ D D
CLOGLOG | CLL Complementary log-log log. log.1 //
CUMCLL | CCLL Cumulative complementary log-log log. log.1 //
CUMLOGIT| CLOGIT Cumulative logit log.=.1 //
CUMPROBIT | CPROBIT Cumulative probit ˆ 1 ./
GLOGIT Generalized logit
IDENTITY | ID Identity
LOG Log log./
LOGIT Logit log.=.1 //
PROBIT Probit ˆ 1 ./
INVERSE | RECIPROCAL Reciprocal 1=
POWERMINUS2 Power with exponent –2 1=2
For the probit and cumulative probit links, ˆ 1 ./ denotes the quantile function of the standard normal
distribution. If you do not specify the LINK= option, then by default the canonical link function is used
if you specify the DIST= option. Otherwise, if you omit the DIST= option, the identity link function is
used.
The cumulative link functions are appropriate only for the multinomial distribution with ordinal
responses, with cumulative probabilities indicated by . The GLOGIT link function is appropriate
only for the multinomial distribution with nominal responses.
NOINT
requests that no intercept term be included in the model. An intercept is included unless this option is
specified.
NOSCALE
holds the scale parameter fixed. Otherwise, for the normal, inverse Gaussian, and gamma distributions,
the scale parameter is estimated by maximum likelihood. If you omit the SCALE= option, the scale
parameter is fixed at the value 1.
OFFSET=variable
specifies a variable in the input data set to be used as an offset variable. This variable cannot be a
CLASS variable, the response variable, or any of the explanatory variables.
SCALE=number
SCALE=PEARSON | P
PSCALE
SCALE=DEVIANCE | D
DSCALE
specifies the value used for the scale parameter when the NOSCALE option is used. For the binomial
and Poisson distributions, which have no free scale parameter, this can be used to specify an overdis-
persed model. If the NOSCALE option is not specified, then number is used as an initial estimate of
the scale parameter.
OUTPUT Statement F 3119
Specifying SCALE=PEARSON or SCALE=P is the same as specifying the PSCALE option. This
fixes the scale parameter at the value 1 in the estimation procedure. After the parameter estimates
are determined, the exponential family dispersion parameter is assumed to be given by Pearson’s
chi-square statistic divided by the degrees of freedom, and all statistics such as standard errors are
adjusted appropriately.
Specifying SCALE=DEVIANCE or SCALE=D is the same as specifying the DSCALE option. This
fixes the scale parameter at a value of 1 in the estimation procedure.
TYPE3
requests that statistics for Type 3 contrasts be computed for each effect specified in the MODEL
statement. The default analysis is to compute score statistics for the contrasts. Type 3 analyses using
the score statistics are not supported for nominal response data or weighted GEE methods. Wald
statistics are computed if the WALD option is also specified.
WALD
requests Wald statistics for Type 3 contrasts. You must also specify the TYPE3 option in order to
compute Type 3 Wald statistics.
OUTPUT Statement
OUTPUT < OUT=SAS-data-set > < keyword=name . . . keyword=name > ;
The OUTPUT statement creates a new SAS data set that contains all the variables in the input data set and,
optionally, the estimated linear predictors (XBETA) and their standard error estimates, predicted values of
the mean, and confidence limits for predicted values.
If you use the multinomial distribution with one of the cumulative link functions for ordinal data, the data
set also contains variables named _ORDER_ and _LEVEL_ that indicate the levels of the ordinal response
variable and the values of the variable in the input data set corresponding to the sorted levels. These variables
indicate that the predicted value for a given observation is the probability that the response variable is as
large as the value of the _LEVEL_ variable. Residuals and other diagnostic statistics are not available for the
multinomial distribution.
The estimated linear predictor, its standard error estimate, and the predicted values and their confidence
intervals are computed for all observations in which the explanatory variables are all nonmissing, even if
the response is missing. By adding observations with missing response values to the input data set, you can
compute these statistics for new observations or for settings of the explanatory variables not present in the
data without affecting the model fit.
The following list explains specifications in the OUTPUT statement.
OUT=SAS-data-set
specifies the output data set. If you omit the OUT=option, the output data set is created and given a
default name that uses the DATAn convention.
keyword=name
specifies the statistics to be included in the output data set and names the new variables that contain the
statistics. Specify a keyword for each desired statistic (see the following list of keywords), an equal
sign, and the name of the new variable or variables to contain the statistic.
3120 F Chapter 45: The GEE Procedure
Although you can use the OUTPUT statement without any keyword=name specifications, the output
data set then contains only the original variables and, possibly, the variables Level and Value (if you
use the multinomial model with ordinal data).
The keywords allowed and the statistics they represent are as follows:
LOWER | L represents the lower confidence limit for the predicted value of the mean, or the
lower confidence limit for the probability that the response is less than or equal
to the value of Level or Value. The confidence coefficient is determined by the
ALPHA=number option in the MODEL statement as .1 number / 100%. The
default confidence coefficient is 95%.
PREDICTED | PRED | PROB | P represents the predicted value of the mean of the response or the
predicted probability that the response variable is less than or equal to the value
of _LEVEL_ if the multinomial model for ordinal data is used (in other words,
Pr.Y _LEVEL_/, where Y is the response variable).
RESCHI represents the Pearson (chi) residual for identifying observations that are poorly
accounted for by the model. This option is not available for the multinomial
distribution.
RESRAW represents the raw residual for identifying poorly fitted observations. This option is
not available for the multinomial distribution.
STDXBETA represents the standard error estimate of XBETA (see the XBETA keyword).
UPPER | U represents the upper confidence limit for the predicted value of the mean, or the
upper confidence limit for the probability that the response is less than or equal
to the value of Level or Value. The confidence coefficient is determined by the
ALPHA=number option in the MODEL statement as .1 number / 100%. The
default confidence coefficient is 95%.
XBETA represents the estimate of the linear predictor x0i ˇ for observation i, or ˛j C
x0i ˇ, where j is the corresponding ordered value of the response variable for the
multinomial model with ordinal data. If there is an offset, it is included in x0i ˇ.
REPEATED Statement
REPEATED SUBJECT=subject-effect < / options > ;
The REPEATED statement specifies the correlation structure of the responses for GEE model fitting. In
addition, the REPEATED statement controls the iterative fitting algorithm and specifies optional output.
Table 45.10 summarizes the options available in the REPEATED statement.
Option Description
ALPHAINIT= Specifies initial values for log odds ratio regression parameters
CONVERGE= Specifies the convergence criterion for GEE parameter estimation
CORRB Displays the estimated correlation matrix
REPEATED Statement F 3121
Option Description
CORRW Displays the estimated working correlation matrix
COVB Displays the estimated covariance matrix
ECORRB Displays the estimated empirical correlation matrix
ECOVB Displays the estimated empirical covariance matrix
INITIAL= Specifies initial values of the regression parameters estimation
INTERCEPT= Specifies an initial value of the intercept
LOGOR= Specifies the use of alternating logistic regression and a model for the log
odds ratio
MAXITER= Specifies the maximum number of iterations
MCORRB Displays the estimated model-based correlation matrix
MCOVB Displays the estimated model-based covariance matrix
MODELSE Displays a parameter estimates table with the model-based standard errors
SUBCLUSTER= Specifies a variable that defines subclusters
SUBJECT= Identifies a different subject (cluster)
TYPE= Specifies the working correlation matrix structure
WITHIN= Specifies the order of measurements within subjects
ZDATA= Specifies the full z matrix
ZROW= Specifies the rows of the z matrix
SUBJECT=subject-effect
identifies subjects in the input data set. The subject-effect can be a single variable, an interaction effect,
a nested effect, or a combination. Each distinct value (level) of the effect identifies a different subject
(cluster). Responses from different subjects are assumed to be statistically independent, and responses
within subjects are assumed to be correlated. You must specify a subject-effect , and you must list
variables that are used in defining the subject-effect in the CLASS statement.
You can also specify the following options after a slash (/) to control how the model is fit and what output is
produced:
ALPHAINIT=numbers
specifies initial values for log odds ratio regression parameters if you specify the option LOGOR= for
data that have either binary or ordinal multinomial responses. The default value of numbers is 0.01.
CONVERGE=number
specifies the convergence criterion for GEE parameter estimation. If the maximum absolute difference
between regression parameter estimates is less than number on two successive iterations, convergence
is declared. If the absolute value of a regression parameter estimate is greater than 0.08, then the
absolute difference normalized by the regression parameter value is used instead of the absolute
difference. The default value of number is 0.0001.
3122 F Chapter 45: The GEE Procedure
CORRB
displays the estimated regression parameter correlation matrix. Both model-based and empirical
correlations are displayed.
CORRW
displays the estimated working correlation matrix. If you specify TYPE=EXCH for the exchangeable
working correlation structure, then the CORRW option is not needed to view the estimated correlation,
because a table that contains the single estimated correlation is printed by default.
COVB
displays the estimated regression parameter covariance matrix. Both model-based and empirical
covariances are displayed.
ECORRB
displays the estimated regression parameter empirical correlation matrix.
ECOVB
displays the estimated regression parameter empirical covariance matrix.
INITIAL=numbers
specifies initial values of the regression parameters estimation, other than the intercept parameter, for
GEE estimation. If you do not specify this option, then the estimated regression parameters (assuming
independence for all responses) are used for the initial values.
INTERCEPT=number
specifies an initial value of the intercept regression parameter in the GEE model.
LOGOR=log-odds-ratio-structure-keyword
specifies the use of the alternating logistic regression (ALR) method and the regression model structure
for the log odds ratio. For data that have either a binary or ordinal multinomial response distribution,
the ALR method uses the log odds ratio to model the association of the responses from subjects. For
more information about the ALR method and examples of specifying log odds ratio models, see the
section “Alternating Logistic Regression” on page 3128. You can specify the values that are shown in
Table 45.11.
For ordinal multinomial data, only the exchangeable regression structure that is specified by LO-
GOR=EXCH is supported. You should specify the option LOGOR= or TYPE=, but not both.
REPEATED Statement F 3123
MAXITER=number
MAXIT=number
specifies the maximum number of iterations allowed in the iterative GEE estimation process. By
default, MAXITER=50.
MCORRB
displays the estimated regression parameter model-based correlation matrix.
MCOVB
displays the estimated regression parameter model-based covariance matrix.
MODELSE
displays a parameter estimates table that uses model-based standard errors for inference. By default, a
“Parameter Estimates” table that is based on empirical standard errors is displayed.
SUBCLUSTER=variable
SUBCLUST=variable
specifies a variable that defines subclusters for the 1-nested or k-nested log odds ratio association
modeling structures for data that have a binary response distribution. A 1-nested or k-nested modeling
structure is specified in the option LOGOR=, and variable must be listed in the CLASS statement. For
definitions of the 1-nested and k-nested modeling structures, see the section “Specifying Log Odds
Ratio Models” on page 3130.
TYPE=correlation-structure-keyword
CORR=correlation-structure-keyword
specifies the structure of the working correlation matrix that is used to model the correlation of the
responses from subjects for ordinary GEEs. You can specify the values that are shown in Table 45.12
(for definitions of the correlation matrix types, see Table 45.13 in the section “Details: GEE Procedure”
on page 3125).
By default, TYPE=IND. When you specify the alternating logistic regression method using the option
LOGOR= you should not specify TYPE=.
3124 F Chapter 45: The GEE Procedure
WITHINSUBJECT=within-subject-effect
WITHIN=within-subject-effect
defines an effect that specifies the order of measurements within subjects. Each distinct level of the
within-subject-effect defines a different response from the same subject. If the data are in proper order
within each subject, you do not need to specify this option.
If some measurements do not appear in the data for some subjects, this option properly orders the
existing measurements and treats the omitted measurements as missing values.
If you do not specify the WITHIN= option for the standard GEE method, missing values are assumed
to be the last values and are not used; the remaining observations are then ordered in the sequence
in which they are provided in the input data set. If you do not specify the WITHIN= option for the
weighted GEE method, the observations are assumed to be ordered in the sequence in which they are
provided in the input data set.
Variables that are used in defining the within-subject-effect must be listed in the CLASS statement.
ZDATA=SAS-data-set
specifies a SAS data set that contains either the full z matrix for log odds ratio association modeling for
data with binary responses or the z matrix for a single complete cluster to be replicated for all clusters.
ZROW=variable-list
specifies the variables in the ZDATA= data set that correspond to rows of the z matrix for log odds
ratio association modeling for data with binary responses.
SLICE Statement
SLICE model-effect < / options > ;
The SLICE statement provides a general mechanism for performing a partitioned analysis of the LS-means
for an interaction. This analysis is also known as an analysis of simple effects.
The SLICE statement uses the same options as the LSMEANS statement, which are summarized in Ta-
ble 19.21. For details about the syntax of the SLICE statement, see the section “SLICE Statement” on
page 512 in Chapter 19, “Shared Concepts and Topics.”
STORE Statement
STORE < OUT= >item-store-name < / LABEL='label' > ;
The STORE statement requests that the procedure save the context and results of the statistical analysis. The
resulting item store has a binary file format that cannot be modified. The contents of the item store can be
processed with the PLM procedure. For details about the syntax of the STORE statement, see the section
“STORE Statement” on page 515 in Chapter 19, “Shared Concepts and Topics.”
WEIGHT Statement F 3125
WEIGHT Statement
WEIGHT variable ;
The WEIGHT statement identifies a variable in the input data set to be used as the exponential family
dispersion parameter weight for each observation. The exponential family dispersion parameter is divided by
the WEIGHT variable value for each observation.
The WEIGHT variable value does not have to be an integer; if the value is less than or equal to 0 or if it is
missing, the corresponding observation is not used.
where Ai is an ni ni diagonal matrix whose jth diagonal element is v.ij / and Wi is an ni ni diagonal
matrix whose jth diagonal is wij , where wij is a weight variable that is specified in the WEIGHT statement.
If there is no WEIGHT statement, wij D 1 for all i and j. If Ri .˛/ is the true correlation matrix of Yi , then
Vi is the true covariance matrix of Yi .
In practice, the working correlation matrix is usually unknown and must be estimated. It is estimated in the
iterative fitting process by using the current value of the parameter vector ˇ to compute appropriate functions
of the Pearson residual:
yij ij
eij D p
v.ij /=wij
If you specify the working correlation matrix as R0 D I, which is the identity matrix, the GEE reduces to the
independence estimating equation.
Table 45.13 shows the working correlation structures that are supported by the GEE procedure and the
estimators that are used to estimate the working correlations.
m-dependent 8
< 1 t D0 PK P
1
Corr.Yij ; Yi;j Ct / D ˛t t D 1; 2; : : : ; m ˛O t D
.Kt p/ i D1 j ni t eij ei;j Ct
0 t >m Kt D K
: P
.n
i D1 i t /
Exchangeable
1 j Dk 1 PK P
Corr.Yij ; Yi k / D ˛O D
.N p/ i D1 j <k eij ei k
˛ j ¤k PK
N D 0:5 i D1 ni .ni 1/
Unstructured
1 j Dk 1 PK
Corr.Yij ; Yi k / D ˛O j k D .K p/ i D1 eij ei k
˛j k j ¤ k
Generalized Estimating Equations F 3127
Dispersion Parameter
The dispersion parameter is estimated by
ni
K X
1 X
2
O D eij
N p
i D1 j D1
PK
where N D i D1 ni is the total number of measurements and p is the number of regression parameters.
The square root of O is reported by PROC GEE as the scale parameter in the “Parameter Estimates for
Response Model with Model-Based Standard Error” output table. If a fixed scale parameter is specified
by using the NOSCALE option in the MODEL statement, then the fixed value is used in estimating the
model-based covariance matrix and standard errors.
where the quasi-likelihood contribution of the jth observation in the ith cluster is defined in the section
O
“Quasi-likelihood Functions” on page 3128 and ˇ.R/ are the parameter estimates that are obtained by using
the GEE approach with the working correlation of interest R.
QIC is defined as
QIC.R/ D O
2Q.ˇ.R/; O I VOR /
/ C 2trace.
where VOR is the robust covariance estimate and
O I is the inverse of the model-based covariance estimate
O
under the independent working correlation assumption, evaluated at ˇ.R/, which are the parameter estimates
that are obtained by using the GEE approach with the working correlation of interest R.
PROC GEE also computes an approximation to QIC.R/, which is defined by Pan (2001) as
QICu .R/ D O
2Q.ˇ.R/; / C 2p
where p is the number of regression parameters.
Pan (2001) notes that QIC is appropriate for selecting regression models and working correlations, whereas
QICu is appropriate only for selecting regression models.
3128 F Chapter 45: The GEE Procedure
Quasi-likelihood Functions
See McCullagh and Nelder (1989) and Hardin and Hilbe (2003) for discussions of quasi-likelihood functions.
The contribution of observation j in cluster i to the quasi-likelihood function that is evaluated at the regression
Q
parameters ˇ is expressed by Q.ˇ; I .Yij ; Xij // D ij , where Qij is defined in the following list. These
definitions are used in the computation of the quasi-likelihood information criteria (QIC) for goodness of
fit of models that are fit by the GEE approach. The wij are prior weights, if any, that are specified in the
WEIGHT or FREQ statement. Note that the definition of the quasi-likelihood for the negative binomial differs
from that given in McCullagh and Nelder (1989). The definition used here allows the negative binomial
quasi-likelihood to approach the Poisson as k ! 0.
Normal:
1
Qij D wij .yij ij /2
2
Inverse Gaussian:
wij .ij :5yij /
Qij D
2ij
Gamma:
yij
Qij D wij C log.ij /
ij
Negative binomial:
1 1 kij 1 1
Qij D wij log yij C log C yij log C log
k k 1 C kij k 1 C kij
Poisson:
Binomial:
Multinomial (s categories):
s
X
Qij D wij yij k log.ij k /
kD1
The joint probability in the numerator satisfies the following bounds, by elementary properties of probability,
because ij D Pr.Yij D 1/:
Therefore, the correlation is constrained to be within limits that depend in a complicated way on the means
of the data.
The odds ratio, defined as
Pr.Yij D 1; Yi k D 1/ Pr.Yij D 0; Yi k D 0/
OR.Yij ; Yi k / D
Pr.Yij D 1; Yi k D 0/ Pr.Yij D 0; Yi k D 1/
is not constrained by the means and is preferred, in some cases, to correlations for binary data.
The ALR algorithm seeks to model the logarithm of the odds ratio,
ij k D log.OR.Yij ; Yi k //, as
ij k D z0ij k ˛
where ˛ is a q 1 vector of regression parameters and zij k is a fixed, specified vector of coefficients.
The parameter
ij k can take any value in . 1; 1/, with
ij k D 0 corresponding to no association.
The log odds ratio, when modeled in this way with a regression model, can take different values in subgroups
defined by zij k . For example, zij k can define subgroups within clusters, or it can define “block effects”
between clusters.
You specify a GEE model for binary data that uses log odds ratios by specifying a model for the mean, as
in ordinary GEEs, and by specifying a model for the log odds ratios. You can use any of the link functions
appropriate for binary data in the model for the mean, such as logistic, probit, or complementary log-log.
where ˇ1 ; ˇ2 ; : : : ; ˇC 1 are increasing intercept terms that depend only on the level c. Let the binary
0
vector that represents the responses of the ith subject be Yi D Yi1 ; : : : ; Yi ni with corresponding means
0
i D i1 ; : : : ; i ni .
The log odds ratio between two indicator variables Yijc1 and Yi kc2 is modeled as
for q 1 regression parameters ˛ and fixed coefficients zi.j k/.c1 c2 / . As in Carey, Zeger, and Diggle
(1993), ˛ then provides a vector of regression parameters in a logistic model for the conditional expectation
i.j k/.c1 c2 / D E Yijc1 jYi kc2 . To estimate ˛, the conditional expectation is considered for all pairs Yijc1
and Yi kc2 with j < k. Let
0
i.j k/ D i.j k/.11/ ; i.j k/.12/ ; : : : ; i.j k/.21/ ; : : : ; i.j k/.C 1;C 1/
0
i D i.12/ ; i.13/ ; : : : ; i.23/ ; : : : ; i.ni 1ni /
ni 1 1
h‚ …„ ƒ ‚ …„ ƒ i0
Yi D Yi1 ˝ eC 1 ; : : : ; Yi1 ˝ eC 1 ; Yi 2 ˝ eC 1 ; : : : ; Yi 2 ˝ eC ; : : : ; Yi ni 1 ˝ e
„ ƒ‚ …1 C 1
ni 2
where ˝ denotes the Kronecker product and el denotes a vector of dimension l composed of ones. The
difference Yi i represents the residuals of the model for the conditional expectation.
For both binary and multinomial data, the ALR estimates for ˇ and ˛ are the simultaneous solutions to the
estimating equations
K
P @i 0 1 .Y
S1 .ˇ; ˛/ D @ˇ
Vi11 i i .ˇ// D 0
i D1
K
@i 0 1
Yi
P
S2 .ˇ; ˛/ D @˛
Vi 33 i D 0
i D1
where Vi11 D cov .Yi / and Vi 33 D diag Œi .1 i /. The fitting algorithm alternates between a GEE
step to update the model for the mean and a logistic regression step to update the log odds ratio model.
Upon convergence, the ALR algorithm provides estimates of the regression parameters for the mean, ˇ; the
regression parameters for the log odds ratios, ˛; their standard errors; and their covariances.
EXCH specifies exchangeable log odds ratios. In this model, the log odds ratio is a
constant for all clusters i and pairs .j; k/. The parameter ˛ is the common log
odds ratio.
FULLCLUST specifies fully parameterized clusters. Each cluster is parameterized in the same
way, and there is a parameter for each unique pair within clusters. If a complete
Alternating Logistic Regression F 3131
cluster is of size n, then there are n.n2 1/ parameters in the vector ˛. For example,
if a full cluster is of size 4, then there are 43
2 D 6 parameters, and the z matrix
is of the form
2 3
1 0 0 0 0 0
6 0 1 0 0 0 0 7
6 7
6 0 0 1 0 0 0 7
ZD6 6 0 0 0 1 0 0 7
7
6 7
4 0 0 0 0 1 0 5
0 0 0 0 0 1
The elements of ˛ correspond to log odds ratios for cluster pairs in the following
order:
Pair Parameter
(1,2) Alpha1
(1,3) Alpha2
(1,4) Alpha3
(2.3) Alpha4
(2,4) Alpha5
(3,4) Alpha6
LOGORVAR(variable) specifies log odds ratios by cluster. The argument variable is a variable name
that defines the “block effects” between clusters. The log odds ratios are con-
stant within clusters, but they take a different value for each different value of
the variable. For example, if Center is a variable in the input data set that
takes a different value for k treatment centers, then when you specify LO-
GOR=LOGORVAR(Center), you get a model that has different log odds ratios
for each of the k centers, constant within center.
NESTK specifies k-nested log odds ratios. You must also specify the SUB-
CLUST=variable option to define subclusters within clusters. Within each
cluster, PROC GEE computes a log odds ratio parameter for pairs that have
the same value of variable for both members of the pair and one log odds ratio
parameter for each unique combination of different values of variable.
NEST1 specifies 1-nested log odds ratios. You must also specify the SUB-
CLUST=variable option to define subclusters within clusters. There are
two log odds ratio parameters for this model. Pairs that have the same value of
variable correspond to one parameter; pairs that have different values of variable
correspond to the other parameter. For example, if patients are clustered by
hospital and subclusters are the wards within those hospitals, then the outcomes
of patients within the same ward have one log odds ratio parameter, and the
outcomes of patients from different wards have the other parameter.
ZFULL specifies the full z matrix. You must also specify a SAS data set that contains
the z matrix by using the ZDATA=data-set-name option. Each observation
in the data set corresponds to one row of the z matrix. You must specify the
ZDATA data set as if all clusters are complete—that is, as if all clusters are
3132 F Chapter 45: The GEE Procedure
the same size and there are no missing observations. The ZDATA data set
has KŒnmax .nmax 1/=2 observations, where K is the number of clusters and
nmax is the maximum cluster size. If the members of cluster i are ordered
as 1; 2; : : : ; n, then the rows of the z matrix must be specified for pairs in the
order .1; 2/; .1; 3/; : : : ; .1; n/; .2; 3/; : : : ; .2; n/; : : : ; .n 1; n/. The variables
that you specify in the REPEATED statement for the SUBJECT effect must
also be present in the ZDATA= data set to identify clusters. You must specify
variables in the data set that define the columns of the z matrix by using the
ZROW=variable-list option. If there are q columns (q variables in variable-list ),
then there are q log odds ratio parameters. You can optionally specify variables
that indicate the cluster pairs corresponding to each row of the z matrix by using
the YPAIR=(variable1, variable2 ) option. If you specify this option, the data from
the ZDATA data set are sorted within each cluster by variable1 and variable2 .
See Example 45.4 for an example of specifying a full z matrix.
ZREP specifies a replicated z matrix. You specify z matrix data exactly as you do
for the ZFULL option case, except that you specify only one complete cluster.
The z matrix for the one cluster is replicated for each cluster. The number of
observations in the ZDATA data set is nmax .n2max 1/ , where nmax is the size of a
complete cluster (a cluster with no missing observations).
ZREP(matrix ) specifies direct input of the replicated z matrix. You specify the z matrix for one
cluster by using the syntax LOGOR=ZREP ( .yj yk / zj k1 zj k2 zj kq ; ),
where yj and yk are numbers that represent a pair of observations from the ith
cluster and the values zj k1 ; zj k2 ; : : : ; zj kq make up the corresponding row zij k
of the z matrix. The number of specified rows is nmax .n2max 1/ , where nmax is the
size of a complete cluster (a cluster with no missing observations). For example,
logor = zrep((1 2) 1 0,
(1 3) 1 0,
(1 4) 1 0,
(2 3) 1 1,
(2 4) 1 1,
(3 4) 1 1)
The mechanism for missingness can be described by a statistical model for the probability of observing
a missing value, and making the right assumption about the mechanism is crucial to methods that handle
missing data. Missingness mechanisms are classified into three types: missing completely at random
(MCAR), missing at random (MAR), and missing not at random (MNAR) (Rubin 1976).
Assumptions about longitudinal data that include missing responses caused by dropouts are classified as
follows:
The data are said to be MCAR if the probability of a missing response is independent of its past, current,
and future responses conditional on the covariates. That is, P .rij D 0jYi ; Xi / D P .rij D 0jXi /.
The data are said to be MAR if the probability of a missing response is independent of its current
and future responses conditional on the observed past responses and the covariates. That is, P .rij D
0jrij 1 D 1; Xi ; Yi / D P .rij D 0jrij 1 D 1; Xi ; yi1 ; : : : ; yij 1 /. MAR is a weaker assumption than
MCAR.
The data are said to be MNAR if the probability of a missing response depends on the unobserved
responses. MNAR is the most general and the most problematic missing-data scenario.
The GEE procedure implements two different weighted methods (observation-specific and subject-specific)
of estimating the regression parameter ˇ when dropouts occur. Both methods provide consistent estimates
if the data are MAR. The weighted GEE methods are not supported for the multinomial distribution for
polytomous responses.
Unlike the standard generalized estimating equations, the weighted generalized estimating equations are
unbiased when the observations are appropriately weighted and lead to consistent estimates of ˇ.
The weights wij are often unknown in practice and are estimated by a logistic regression model under the
MAR assumption. Specifically, suppose that ij D P .rij D 1jrij 1 D 1; Xi ; Yi / denotes the probability of
observing the response yij given its observed previous responses.
Under the MAR assumption,
Using the observed data, ij can be predicted from a logistic regression model,
logitfij g D zij ˛
3134 F Chapter 45: The GEE Procedure
where the zij are predictors that usually include the covariates xij , the past responses, and the indicators for
visit times. The dropout process implies that the estimated probability of observing yij can be expressed as a
cumulative product of conditional probabilities:
With the estimated weights wO ij D PO .rij D 1jXi ; Yi / 1, the regression parameter ˇ is estimated by solving
the equation for Sow .ˇ/.
The regression parameter ˇ can be estimated by solving for Sow .ˇ/ after plugging in the estimated weights.
The fitting algorithm is described in the section “Fitting Algorithm for Weighted GEE” on page 3135.
where the responses for the ith subject are Yi D .yi1 ; yi 2 ; : : : ; yi ni /0 and the weight wi for subject i is the
inverse probability of a subject i dropping out at the observed time (Fitzmaurice, Molenberghs, and Lipsitz
1995; Preisser, Lohman, and Rathouz 2002). Note that the weight wi is a scalar, in contrast to the weight
matrix Wi that the observation-specific weighted GEE method uses.
The subject-specific weighted estimating equations are also unbiased when the subjects are appropriately
weighted and lead to consistent estimates of the regression parameters ˇ.
P wi is usually unknown in practice and needs to be estimated. Suppose subject i drops out at time
The weight
mi D TjD1 rij C 1. Assume that the first visit yi1 is always observed with ri1 D 1. Thus, the dropout
times mi range from 2 to T+1. Note that a dropout time of T+1 indicates that subject i completes all the T
visits and dropout does not occur.
The weight wi is defined as follows: if subject i drops out before completing the last visit (that is, mi T ),
then wi D P .ri mi D 0; ri mi 1 D 1jXi ; Yi / 1 ; otherwise, the subject completes all the T visits (that is,
mi D T C 1), and wi D P .riT D 1jXi ; Yi / 1 .
Similar to the process for the observation-specific weighted method, the dropout process for the subject-
specific weighted method implies that subject-specific weights can be estimated as a cumulative product of
conditional probabilities:
1 1
wO i D P .ri mi D 0; ri mi 1 D 1jXi ; Yi / O i mi
D Œi1 .˛/ O
1 .˛/ .1 O
i mi .˛// ; if mi T
1 1
wO i D P .ri mi 1 D 1jXi ; Yi / O i 2 .˛/
D Œi1 .˛/ O i mi O
1 .˛/ ; if mi D T C 1
Thus, the subject-specific weights wO i can be obtained after ij is estimated by fitting a logistic regression to
the data .rij ; zij /.
The regression parameter ˇ from the subject-specific weighted GEE method can be estimated by solving for
Ssw .ˇ/ after plugging in the estimated weights. The fitting algorithm is described in the section “Fitting
Weighted Generalized Estimating Equations under the MAR Assumption F 3135
Algorithm for Weighted GEE” on page 3135. The subject-specific weighting scheme was originally developed
for computational convenience. Preisser, Lohman, and Rathouz (2002) showed that the observation-level
weighted GEE method produces more efficient estimates than the cluster-level weighted GEE method for
incomplete longitudinal binary data.
1. Fit a logistic regression to the data .rij ; zij / to obtain an estimate of ˛ and estimate the weights.
2. Compute an initial estimate of ˇ by using an ordinary generalized linear model, assuming independence
of the responses.
3. Compute the working correlation matrix R based on the standardized residuals, the current estimate of
ˇ, and the specified structure of R.
4. Compute the estimated covariance matrix:
1 1
O
Vi D Ai2 R.˛/A 2
i
O
5. Update ˇ:
"K # 1" K #
X @i 0 @i X @i 0 1
ˇOrC1 D ˇOr C Vi 1 V Wi .Yi i /
@ˇ @ˇ @ˇ i
i D1 i D1
For the observation-specific weighted method, Yi D .yi1 ; yi 2 ; : : : ; yiT /0 ; i and Vi are its
corresponding mean vector and working covariance matrix, respectively; and Wi is a T T
diagonal matrix whose jth diagonal is rij wO ij .
For the subject-specific weighted method, Yi D .yi1 ; yi 2 ; : : : ; yi ni /0 ; i and Vi are its corre-
sponding mean vector and working covariance matrix, respectively; and Wi is a ni ni diagonal
matrix whose jth diagonal is wO i .
Note that you can use the WEIGHT statement in the GENMOD procedure to perform a two-stage strategy
that is often used in practice to obtain the weighted GEE estimates. You fit a logistic regression to the data
.rij ; zij / to obtain the weights as described in the preceding steps. Then you estimate ˇ by specifying the
estimated weights in the WEIGHT statement in PROC GENMOD for the GEE analysis. For the subject-
specific weighted GEE method, this approach is appropriate for any working correlation structure. However,
for the observation-specific weighted method, this approach is appropriate only for the independent working
correlation structure.
The two-stage approach results in standard errors that are larger than those that are produced by using
the MISSMODEL statement in the GEE procedure (because PROC GENMOD treats the weights as fixed
and known). Thus, the two-stage approach that uses PROC GENMOD results in conservative inference
(Fitzmaurice, Laird, and Ware 2011). The GEE procedure computes the parameter estimate covariances as
described in (Fitzmaurice, Laird, and Ware 2011) and Preisser, Lohman, and Rathouz (2002).
3136 F Chapter 45: The GEE Procedure
Missing Data
Suppose that each subject in a longitudinal study is measured at T times. In other words, for the ith subject
you measure T responses .yi1 ; yi 2 ; : : : ; yiT / and T corresponding covariates .xi1 ; xi 2 ; ; : : : ; xiT /.
By default, the GEE procedure handles missing data in the same manner as the standard GEE method in the
GENMOD procedure. The working correlation matrix is estimated from data that contain both intermittent
and dropout types of missing values by using the all-available-pairs method, in which all nonmissing pairs of
data are used in the moment estimators. The resulting covariances and standard errors are valid under the
missing completely at random (MCAR) assumption. For more information, see the section “Missing Data”
on page 3272 in Chapter 46, “The GENMOD Procedure.”
When you specify the MISSMODEL statement in the GEE procedure to use the weighted GEE method to
analyze the data, the procedure uses observations that have missing values in the response, provided that the
missing values for all subjects are caused by dropouts. If the missing values are intermittent for any of the
subjects, then the weighted GEE method does not apply and the procedure terminates.
For the observation-specific weighted GEE method, the covariates for all the observations for a subject must
be observed, regardless of whether the response is missing. For each subject, the input data set must provide
T observations.
For the subject-specific weighted GEE method, the covariates for a subject who drops out at time k must
be observed for the observations up to and including time k. The input data set must provide at least k
observations for this subject. The covariates must be observed for all observations on a subject who completes
the study, and the input data set must provide T observations for this subject.
For more information about how weighted GEE methods handle missing values, see Fitzmaurice, Laird, and
Ware (2011) and Preisser, Lohman, and Rathouz (2002).
Type 3 Analysis
A Type 3 analysis is similar to the Type 3 sums of squares used in PROC GLM, except that generalized
score tests for Type 3 contrasts instead of Type 3 sums of squares are computed. Briefly, a Type 3 estimable
function (contrast) for an effect is a linear function of the model parameters that involves the parameters of
the effect and any interactions with that effect. A test of the hypothesis that the Type 3 contrast for a main
effect is equal to 0 is intended to test the significance of the main effect in the presence of interactions. For
more information about Type 3 estimable functions, see Chapter 48, “The GLM Procedure,” and Chapter 15,
“The Four Types of Estimable Functions.” Also see Littell, Freund, and Spector (1991).
Boos (1992) and Rotnitzky and Jewell (1990) describe score tests applicable to testing L0 ˇ D 0 in GEEs,
where L0 is a user-specified r p contrast matrix or a contrast for a Type 3 test of hypothesis.
Let ˇQ be the regression parameters that result from solving the GEE under the restricted model L0 ˇ D 0, and
Q be the generalized estimating equation values at ˇ.
let S.ˇ/ Q
Q 0 †m L.L0 †e L/
T D S.ˇ/ 1 0 Q
L †m S.ˇ/
where †m is the model-based covariance estimate and †e is the empirical covariance estimate. The p-values
for T are computed based on the chi-square distribution with r degrees of freedom, where r is the rank of L.
ODS Table Names F 3137
A Type 3 analysis can consume considerable computation time because a constrained model is fitted for each
effect. Wald statistics for Type 3 contrasts are computed if you specify the WALD option. Wald statistics for
contrasts use less computation time than likelihood ratio statistics but might be less accurate indicators of the
significance of the effect of interest. The Wald statistic for testing L0 ˇ D 0 is defined by
S D .L0 ˇ/
O 0 .L0 †e L/ 1
.L0 ˇ/
O
where L is the contrast matrix, ˇ are the GEE parameter estimates, and †e is the empirical covariance
estimate. The asymptotic distribution of S is chi-square with r degrees of freedom, where r is the rank of L.
The results of this type of analysis do not depend on the order in which the terms are specified in the MODEL
statement. Type 3 analyses that use score statistics are not supported for nominal response data or weighted
GEE methods. Type 3 analyses can be conducted using the Wald statistics for all the models that the GEE
procedure supports.
ODS Graphics
Statistical procedures use ODS Graphics to create graphs as part of their output. ODS Graphics is described
in detail in Chapter 21, “Statistical Graphics Using ODS.”
Before you create graphs, ODS Graphics must be enabled (for example, by specifying the ODS GRAPH-
ICS ON statement). For more information about enabling and disabling ODS Graphics, see the section
“Enabling and Disabling ODS Graphics” on page 615 in Chapter 21, “Statistical Graphics Using ODS.”
The overall appearance of graphs is controlled by ODS styles. Styles and other aspects of using ODS
Graphics are discussed in the section “A Primer on ODS Statistical Graphics” on page 614 in Chapter 21,
“Statistical Graphics Using ODS.”
Example 45.1: Comparison of the Marginal and Random Effect Models for
Binary Data
A clinical trial (Stokes, Davis, and Koch 2012) was conducted to compare two treatments for a respiratory
illness. Patients in each of two centers were randomly assigned to two groups: one group received the active
treatment and one group received a placebo.
During treatment, respiratory status was determined for each of four visits and is represented by the variable
Outcome (coded here as 0 = poor, 1 = good). The variables Center, Treatment, Sex, and Baseline (baseline
respiratory status) are classification variables that have two levels. The variable Age (age at time of entry into
the study) is a continuous variable.
All 111 patients completed the study. That is, there are no missing data for responses or covariates. The
following statements create the data set Resp:
data Resp;
input Center ID Treatment $ Sex $ Age Baseline Visit1-Visit4;
datalines;
1 1 P M 46 0 0 0 0 0
1 2 P M 28 0 0 0 0 0
1 3 A M 23 1 1 1 1 1
1 4 P M 44 1 1 1 1 0
1 5 P F 13 1 1 1 1 1
1 6 A M 34 0 0 0 0 0
2 51 A M 43 1 1 1 1 0
2 52 A F 39 0 1 1 1 1
2 53 A M 68 0 1 1 1 1
2 54 A F 63 1 1 1 1 1
2 55 A M 31 1 1 1 1 1
;
data Resp;
set Resp;
Visit=1; Outcome=Visit1; output;
Visit=2; Outcome=Visit2; output;
Visit=3; Outcome=Visit3; output;
Visit=4; Outcome=Visit4; output;
run;
Suppose yij represents the respiratory status of patient i at the jth visit, j D 1; : : : ; 4, and ij D E.yij /
represents the mean of the respiratory status. Logistic regression is commonly used to analyze binary response
data. You can use the variance function for the binomial distribution, v.ij / D ij .1 ij /, and the logit
3140 F Chapter 45: The GEE Procedure
link function, g.ij / D log.ij =.1 ij //. The model for the mean is g.ij / D xij 0 ˇ, where ˇ is a vector
of regression parameters to be estimated.
The following SAS statements perform the GEE model fit:
Model Information
Data Set WORK.RESP
Distribution Binomial
Link Function Logit
Dependent Variable Outcome
Example 45.1: Comparison of the Marginal and Random Effect Models for Binary Data F 3141
General information about the GEE analysis is displayed in Output 45.1.2, and model fit criteria for the
model are displayed in Output 45.1.3.
GEE Fit
Criteria
QIC 512.5723
QICu 499.4873
The results of GEE model fitting are displayed in Output 45.1.4. If you specify no other options, the standard
errors, confidence intervals, Z scores, and p-values are based on empirical standard error estimates. You can
specify the MODELSE option in the REPEATED statement to create a table that is based on model-based
standard error estimates.
Output 45.1.4 Results of Model Fitting
Treatment and Baseline appear to be strongly influential, and Center might be marginally significant.
For comparison, a generalized linear mixed model is fitted to the data set to obtain subject-specific effects.
Specifically, consider the logistic regression model,
where the random effect bi is normally distributed with zero mean and variance, Var.bi / D b2 .
3142 F Chapter 45: The GEE Procedure
The following statements use the GLIMMIX procedure to fit a generalized linear mixed model:
From Output 45.1.4 and Output 45.1.5, you can see that the parameter estimates from the marginal model
and the mixed-effects model differ. For example, the estimated treatment effects are 1.2654 and 1.4758 from
the marginal model and the mixed-effects model, respectively.
The interpretation of the model effects in the marginal and random models differs. For example, the estimated
treatment effect from the marginal model indicates that, on average, the odds of a good response for the
patients is e 1:2654 D 3:5 times higher when they receive the active treatment versus the placebo. The
estimated treatment effect from the generalized linear mixed model indicates that an individual patient’s odds
of a good response is e 1:4758 D 4:4 times higher when the patient receives the active treatment versus the
placebo.
The choice of the marginal model or a subject-specific model often depends on the goal of your analysis:
whether you are interested in population-averaged effects or subject-specific effects. For more information,
see Diggle et al. (2002); Fitzmaurice, Laird, and Ware (2011).
during which patients received either a placebo or the drug progabide. The question of scientific interest is
whether progabide is effective in reducing the rate of epileptic seizures.
The following DATA step creates the data set Seizure:
data Seizure;
input ID Count Visit Trt Age Weeks;
datalines;
104 11 0 0 31 8
104 5 1 0 31 2
104 3 2 0 31 2
104 3 3 0 31 2
104 3 4 0 31 2
106 11 0 0 30 8
236 12 0 1 37 8
236 1 1 1 37 2
236 4 2 1 37 2
236 3 3 1 37 2
236 2 4 1 37 2
;
The following DATA step creates a log time interval variable for use as an offset and an indicator variable for
whether the observation is for a baseline measurement or a visit measurement. Patient 207 is deleted as an
outlier, which was done in the Diggle et al. (2002) analysis:
data Seizure;
set Seizure;
if ID ne 207;
if Visit = 0 then do;
X1=0;
Ltime = log(8);
end;
else do;
X1=1;
Ltime=log(2);
end;
run;
Poisson regression is commonly used to model count data. In this example, the log-linear Poisson model is
specified by V ./ D (the Poisson variance function) and a log link function,
log.E.Yij // D ˇ0 C xi1 ˇ1 C xi 2 ˇ2 C xi1 xi 2 ˇ3 C log.tij /
where
Because the visits represent repeated measurements, the responses from the same individual are correlated
and inferences need to take this into account. The correlations between the counts are modeled as rij D ˛,
i ¤ j (exchangeable correlations).
In this model, the regression parameters are interpreted in terms of the log seizure rate that is displayed in
Table 45.16.
The difference between the log seizure rates in the pretreatment (baseline) period and the treatment periods is
ˇ1 for the placebo group and ˇ1 C ˇ3 for the progabide group. A value of ˇ3 < 0 indicates a reduction in
the seizure rate.
The following statements perform the analysis:
Model Information
Data Set WORK.SEIZURE
Distribution Poisson
Link Function Log
Dependent Variable Count
Offset Variable Ltime
Output 45.2.2 displays general information about the GEE model analysis.
Example 45.2: Log-Linear Model for Count Data F 3145
Output 45.2.3 displays the parameter estimate covariance matrices, which are requested by the COVB option.
Both model-based and empirical covariances are produced.
The exchangeable working correlation matrix is displayed in Output 45.2.4. It shows that there are noticeable
correlations among the respective visits.
The parameter estimates table, shown in Output 45.2.5, contains parameter estimates, standard errors,
confidence intervals, Z scores, and p-values for the parameter estimates. Empirical standard error estimates
are used in this table.
3146 F Chapter 45: The GEE Procedure
The estimate of ˇ3 is –0.3162, which indicates that progabide is effective in reducing the rate of epileptic
seizures.
Model fit criteria for the model are displayed in Output 45.2.6. These criteria are used in selecting regression
models and working correlations.
Example 45.3: Weighted GEE for Longitudinal Data That Have Missing Values
This example shows how you can use the GEE procedure to analyze longitudinal data that contain missing
values. The data set is taken from a longitudinal study of women who used contraception during one year
(Fitzmaurice, Laird, and Ware 2011). In this study, 1,151 women were randomly assigned to one of two
treatments: 100 mg or 150 mg of depot medroxyprogesterone acetate (DMPA) at baseline and at three-month
intervals. The response variable indicates their amenorrhea status during the four three-month intervals. The
question of interest is whether the treatment has an effect on the rate of the amenorrhea over time. The
example follows the analysis done by Fitzmaurice, Laird, and Ware (2011).
The following statements create the data set Amenorrhea:
data Amenorrhea;
input ID Dose Time Y@@;
datalines;
1 0 1 0
1 0 2 .
1 0 3 .
1 0 4 .
1150 1 4 1
1151 1 1 1
1151 1 2 1
1151 1 3 1
1151 1 4 1
;
The variables in the data are as follows:
Example 45.3: Weighted GEE for Longitudinal Data That Have Missing Values F 3147
ID: patient’s ID
Dose: 0 for treatment with 100 mg injection; 1 for treatment with 150 mg injection
Prevy: the patient’s amenorrhea status in the previous three-month interval. For the baseline visit, this
is set to an arbitrary nonmissing value (0 here). In this release of PROC GEE, this arbitrary value
must be nonmissing and valid for the response variable—for example, it should be 0 or 1 for a binary
response—but it does not otherwise affect the results.
Ctime: a copy of Time, which you can include in the marginal model as a continuous effect and also in
the missingness model as a classification effect
The following statements add these two variables to the data set:
data Amenorrhea;
set Amenorrhea;
by ID;
Prevy=lag(Y);
if first.id then Prevy=0;
Time=Time-1;
Ctime=Time;
run;
Suppose yij denotes the amenorrhea status of woman i at the jth visit, j D 1; : : : ; 4, and suppose ij D
P.yij D 1/ denotes the average rate of high dosage. To explore whether the treatment has an effect on the
rate of amenorrhea over time, consider the following marginal model:
Of the 1,151 women in this study, 576 are from the low-dose group, and 575 are from the high-dose group.
For the low-dose group, 62.67% of the women completed the trial; for the high-dose group, 61.39% of the
women completed this trial. Thus, both groups have substantial dropouts.
To obtain the weights for the weighted GEE analysis, consider the following logistic regression model for
missingness:
The following statements use the observation-specific weighted GEE method and the specified response and
missingness models to analyze the data:
The classification variable Ctime has two levels whose estimates are equal to zero. One is the reference level
Ctime = 3. The first level, Ctime = 0, also has an estimate of zero, because the first visit is always observed
and the first level is never used in estimating the weights in the missing model.
Output 45.3.2 displays the results of the weighted GEE analysis.
Example 45.3: Weighted GEE for Longitudinal Data That Have Missing Values F 3149
Output 45.3.2 Parameter Estimates for Amenorrhea Data Analysis Using Weighted GEE
The GEE Procedure
The estimate of ˇ4 (the parameter estimate for the Dose*Time interaction) is 0.4092, which indicates that the
change of amenorrhea rate over time depends on the dose of DMPA. Specifically, for women in the low-dose
group, the amenorrhea rates ij at the four consecutive time intervals are 0.1830, 0.2764, 0.3928, and 0.5210
and for women in the high-dose group, the amenorrhea rate are 0.1997, 0.3609, 0.4963, and 0.5701. In other
words, the amenorrhea rate increases over time for both treatments, and the rates of increase are slightly
different.
You can request subject-level weights by specifying the TYPE=SUBLEVEL option. The results (not
shown here) from the subject-level weighted method are similar to the results from the observation-level
weighted method. Both of the weighted GEE methods provide unbiased regression parameter estimates if the
missingness model is specified correctly. Preisser, Lohman, and Rathouz (2002) note that the observation-
level weighted GEE produces more efficient estimates than the cluster-level weighted GEE produces for
incomplete longitudinal binary data.
Large weights can have impacts on the parameter estimates. Consequently, it is recommended that you check
the distribution of the estimated weights. If there are large weights, you might consider trimming them
by specifying the MAXWEIGHT= option in the MISSMODEL statement. Output 45.3.3 shows that the
estimated weights in this example range between 1 and 2.1, so no trimming is needed.
3150 F Chapter 45: The GEE Procedure
Example 45.4: GEE for Binary Data with Logit Link Function
Because the respiratory data in Example 45.1 are binary, you can use the alternating logistic regression (ALR)
method and model associations by using the log odds ratios instead of working correlations. This example
fits a “fully parameterized cluster” model for the log odds ratio. That is, there is a log odds ratio parameter
for each unique pair of responses within clusters, and all clusters are parameterized identically. The following
statements fit the same regression model for the mean as in Example 45.1 but use a regression model for the
log odds ratios instead of a working correlation. LOGOR=FULLCLUST specifies a fully parameterized log
odds ratio model.
The parameters Alpha1 through Alpha6 estimate the log odds ratio for each unique within-cluster pair. The
correspondence between the log odds ratio parameters and within-cluster pairs is displayed in Output 45.4.2.
GEE Fit
Criteria
QIC 511.8589
QICu 499.6516
3152 F Chapter 45: The GEE Procedure
The QIC for the ALR model shown in Output 45.4.3 is 511.86, whereas the QIC for the unstructured working
correlation model shown in Output 45.1.3 is 512.34, indicating that the ALR model has a slightly better fit.
You can fit the same model by fully specifying the z matrix; for the definition of the z matrix, see the section
“Specifying Log Odds Ratio Models” on page 3130. The following statements create a data set that contains
the full z matrix:
data zin;
keep id center z1-z6 y1 y2;
array zin(6) z1-z6;
set resp;
by center id;
if first.id
then do;
t = 0;
do m = 1 to 4;
do n = m+1 to 4;
do j = 1 to 6;
zin(j) = 0;
end;
y1 = m;
y2 = n;
t + 1;
zin(t) = 1;
output;
end;
end;
end;
run;
Obs z1 z2 z3 z4 z5 z6 Center ID y1 y2
1 1 0 0 0 0 0 1 1 1 2
2 0 1 0 0 0 0 1 1 1 3
3 0 0 1 0 0 0 1 1 1 4
4 0 0 0 1 0 0 1 1 2 3
5 0 0 0 0 1 0 1 1 2 4
6 0 0 0 0 0 1 1 1 3 4
7 1 0 0 0 0 0 1 2 1 2
8 0 1 0 0 0 0 1 2 1 3
9 0 0 1 0 0 0 1 2 1 4
10 0 0 0 1 0 0 1 2 2 3
11 0 0 0 0 1 0 1 2 2 4
12 0 0 0 0 0 1 1 2 3 4
Example 45.5: Alternating Logistic Regression for Ordinal Multinomial Data F 3153
The following statements fit the model for fully parameterized clusters by fully specifying the z matrix. The
results are identical to those shown previously.
data Arthritis;
input ID Rating Sex Age Treatment Baseline Visit;
datalines;
1 4 2 54 2 2 1
1 5 2 54 2 2 3
1 5 2 54 2 2 5
2 4 1 41 1 3 1
2 4 1 41 1 3 3
2 4 1 41 1 3 5
301 2 2 64 1 2 5
302 2 2 55 1 2 1
302 3 2 55 1 2 3
302 3 2 55 1 2 5
;
3154 F Chapter 45: The GEE Procedure
The following SAS statements use PROC GEE to fit a model that has a fully exchangeable working correlation
structure:
The parameter Alpha1, which is used to estimate the log odds ratio, is included in Output 45.5.1.
To fit the ALR model, each response is coded as a vector of binary variables and the log odds ratio models the
association between pairs of responses. For more information about the log odds ratio and the ALR method
for ordinal multinomial data, see the section “ALR for Ordinal Multinomial Data” on page 3129. The ALR
model fit criteria are shown in Output 45.5.2.
For comparison, the following SAS statements use PROC GEE to fit the same marginal model by using an
independent working correlation structure:
Output 45.5.3 Parameter Estimates for Arthritis Data Using Independent Working Correlation
The GEE Procedure
The QIC for the ALR model shown in Output 45.5.2 is 2241.95, whereas the QIC for the independent
working correlation model shown in Output 45.5.4 is 2269.82, indicating a slightly better fit for the ALR
model.
Output 45.5.4 Model Fit Criteria
data Housing;
input ID Housing Time Sec;
datalines;
1 1 0 1
1 2 6 1
1 2 12 1
1 2 24 1
2 1 0 1
2 2 6 1
362 1 0 0
362 1 6 0
362 1 12 0
362 1 24 0
;
The following SAS statements use PROC GEE to fit a model to nominal multinomial data:
and
exp.ij /
ij D PJ
kD1 exp.i k /
iJ D 0
The results of fitting the model are displayed in Output 45.6.1.
The positive estimates for the classification variable Sec = 0 at each response category, Housing = 0 and 1,
indicate an increased probability that a client will live independently when given access to Section 8 housing.
The model fit criteria are shown in Output 45.6.2
For comparison, the following SAS statements treat the responses as ordinal and use PROC GEE to fit a
marginal model by using an independent working correlation structure:
Treating the responses as ordinal results in a single parameter estimate that is related to the classification
variable Sec. The QIC for the model that is fit by treating the responses as nominal (shown in Output 45.6.2) is
2675.21, whereas the QIC for the model that is fit by treating the responses as ordinal (shown in Output 45.6.4)
is 2710.50, indicating a slightly better fit when the responses are treated as nominal.
References
Boos, D. (1992). “On Generalized Score Tests.” American Statistician 46:327–333.
Carey, V., Zeger, S. L., and Diggle, P. J. (1993). “Modelling Multivariate Binary Data with Alternating
Logistic Regressions.” Biometrika 80:517–526.
Diggle, P. J., Heagerty, P., Liang, K.-Y., and Zeger, S. L. (2002). Analysis of Longitudinal Data. 2nd ed. New
York: Oxford University Press.
Diggle, P. J., Liang, K.-Y., and Zeger, S. L. (1994). Analysis of Longitudinal Data. Oxford: Clarendon Press.
Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011). Applied Longitudinal Analysis. 2nd ed. Hoboken,
NJ: John Wiley & Sons.
Fitzmaurice, G. M., Molenberghs, G., and Lipsitz, S. R. (1995). “Regression Models for Longitudinal Binary
Responses with Informative Drop-Outs.” Journal of the Royal Statistical Society, Series B 57:691–704.
Hardin, J. W., and Hilbe, J. M. (2003). Generalized Estimating Equations. Boca Raton, FL: Chapman &
Hall/CRC.
Heagerty, P., and Zeger, S. L. (1996). “Marginal Regression Models for Clustered Ordinal Measurements.”
Journal of the American Statistical Association 91:1024–1036.
References F 3159
Hurlbut, M. S., Wood, P. A., and Hough, R. L. (1996). “Providing Independent Housing for the Homeless
Mentally Ill: A Novel Approach to Evaluating Long-Term Longitudinal Housing Patterns.” Journal of
Community Psychology 24:291–310.
Liang, K.-Y., and Zeger, S. L. (1986). “Longitudinal Data Analysis Using Generalized Linear Models.”
Biometrika 73:13–22.
Lipsitz, S. R., Fitzmaurice, G. M., Orav, E. J., and Laird, N. M. (1994). “Performance of Generalized
Estimating Equations in Practical Situations.” Biometrics 50:270–278.
Lipsitz, S. R., Kim, K., and Zhao, L. (1994). “Analysis of Repeated Categorical Data Using Generalized
Estimating Equations.” Statistics in Medicine 13:1149–1163.
Littell, R. C., Freund, R. J., and Spector, P. C. (1991). SAS System for Linear Models. 3rd ed. Cary, NC: SAS
Institute Inc.
Mallinckrodt, C. (2013). Preventing and Treating Missing Data in Longitudinal Clinical Trials: A Practical
Guide. Cambridge: Cambridge University Press.
McCullagh, P., and Nelder, J. A. (1989). Generalized Linear Models. 2nd ed. London: Chapman & Hall.
Molenberghs, G., and Kenward, M. G. (2007). Missing Data in Clinical Studies. New York: John Wiley &
Sons.
O’Kelly, M., and Ratitch, B. (2014). Clinical Trials with Missing Data: A Guide for Practitioners. Chichester,
UK: John Wiley & Sons.
Pan, W. (2001). “Akaike’s Information Criterion in Generalized Estimating Equations.” Biometrics 57:120–
125.
Preisser, J. S., Lohman, K. K., and Rathouz, P. J. (2002). “Performance of Weighted Estimating Equations
for Longitudinal Binary Data with Drop-Outs Missing at Random.” Statistics in Medicine 21:3035–3054.
Robins, J. M., and Rotnitzky, A. (1995). “Semiparametric Efficiency in Multivariate Regression Models with
Missing Data.” Journal of the American Statistical Association 90:122–129.
Rotnitzky, A., and Jewell, N. P. (1990). “Hypothesis Testing of Regression Parameters in Semiparametric
Generalized Linear Models for Cluster Correlated Data.” Biometrika 77:485–497.
Stokes, M. E., Davis, C. S., and Koch, G. G. (2012). Categorical Data Analysis Using SAS. 3rd ed. Cary,
NC: SAS Institute Inc.
Subject Index
confidence intervals GEE procedure, 3138
confidence coefficient, 3117 output table names
convergence criterion GEE procedure, 3137
GEE procedure, 3121
correlated data probability distribution, built-in
GEE procedure, 3125, 3132 GEE procedure, 3117
initial values
GEE procedure, 3122
intercept
GEE procedure, 3118
logistic regression
GEE procedure, 3104
offset
GEE procedure, 3118
options summary
ESTIMATE statement, 3112
output ODS Graphics table names
Syntax Index
ALPHA= option DESCENDING option, 3110
GEE procedure, MODEL statement, 3117 ORDER= option, 3110
ALPHAINIT= option GEE procedure, EFFECTPLOT statement, 3111
REPEATED statement (GEE), 3121 GEE procedure, ESTIMATE statement, 3112
GEE procedure, FREQ statement, 3113
BY statement GEE procedure, LSMEANS statement, 3113
GEE procedure, 3109 GEE procedure, LSMESTIMATE statement, 3114
GEE procedure, MISSMODEL statement, 3115
CLASS statement MAXWEIGHT option, 3116
GEE procedure, 3110 TYPE= option, 3116
CONVERGE= option GEE procedure, MODEL statement, 3116
REPEATED statement, 3121 ALPHA= option, 3117
CORR= option DIST= option, 3117
REPEATED statement , 3123 ERR= option, 3117
CORRB option LINK= option, 3118
REPEATED statement, 3122 NOINT option, 3118
CORRW option NOSCALE option, 3118
REPEATED statement, 3122 OFFSET= option, 3118
COVB option SCALE= option, 3118
REPEATED statement , 3122 TYPE3 option, 3119
WALD option, 3119
DATA= option
GEE procedure, OUTPUT statement, 3119
PROC GEE statement, 3108
keyword= option, 3119
DESCENDING option
OUT= option, 3119
CLASS statement, 3110
GEE procedure, PROC GEE statement, 3108
PROC GEE statement, 3109
DATA= option, 3108
DIST= option
DESCENDING option, 3109
MODEL statement, 3117
NAMELEN= option, 3109
DSCALE
PLOTS option, 3109
MODEL statement, 3118
GEE procedure, REPEATED statement, 3120
ECORRB option ALPHAINIT= option, 3121
REPEATED statement , 3122 CONVERGE= option, 3121
ECOVB option CORR= option, 3123
REPEATED statement , 3122 CORRB option, 3122
EFFECTPLOT statement CORRW option, 3122
GEE procedure, 3111 COVB option, 3122
ERR= option ECORRB option, 3122
MODEL statement, 3117 ECOVB option, 3122
ESTIMATE statement INITIAL= option, 3122
GEE procedure, 3112 INTERCEPT= option, 3122
MAXITER= option, 3123
FREQ statement MCORRB option, 3123
GEE procedure, 3113 MCOVB option, 3123
MODELSE option, 3123
GEE procedure SUBCLUSTER= option, 3123
syntax, 3107 SUBJECT= option, 3121
GEE procedure, BY statement, 3109 TYPE= option, 3123
GEE procedure, CLASS statement, 3110 WITHIN= option, 3124
WITHINSUBJECT= option, 3124 PROC GEE statement, see GEE procedure
ZDATA= option, 3124 PSCALE
ZROW= option, 3124 MODEL statement, 3118
GEE procedure, SLICE statement, 3124
GEE procedure, STORE statement, 3124 REPEATED statement
GEE procedure, WEIGHT statement, 3125 GEE procedure, 3120
OFFSET= option
MODEL statement, 3118
ORDER= option
CLASS statement, 3110
OUT= option
OUTPUT statement (GEE), 3119
OUTPUT statement
GEE procedure, 3119
PLOTS option
PROC GEE statement, 3109