0% found this document useful (0 votes)

251 views63 pages

Sas/Stat 14.3 User's Guide: The GEE Procedure

Uploaded by

Phuong Dang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

251 views63 pages

Sas/Stat 14.3 User's Guide: The GEE Procedure

Uploaded by

Phuong Dang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

®

SAS/STAT 14.3
User’s Guide
The GEE Procedure
This document is an individual chapter from SAS/STAT® 14.3 User’s Guide.
The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2017. SAS/STAT® 14.3 User’s Guide. Cary, NC:
SAS Institute Inc.
SAS/STAT® 14.3 User’s Guide
Copyright © 2017, SAS Institute Inc., Cary, NC, USA
All Rights Reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by
any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute
Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time
you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is
illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic
piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software
developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or
disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as
applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S.
federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision
serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The
Government’s rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414
September 2017
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
SAS software may be provided with certain third-party software, including but not limited to open-source software, which is
licensed under its applicable third-party software license agreement. For license information about third-party software distributed
with SAS software, refer to http://support.sas.com/thirdpartylicenses.
Chapter 45
The GEE Procedure

Contents
Overview: GEE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3104
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3104
Syntax: GEE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3107
PROC GEE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3108
BY Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3109
CLASS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3110
EFFECTPLOT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3111
ESTIMATE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3112
FREQ Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3113
LSMEANS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3113
LSMESTIMATE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3114
MISSMODEL Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3115
MODEL Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3116
OUTPUT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3119
REPEATED Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3120
SLICE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124
STORE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124
WEIGHT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3125
Details: GEE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3125
Generalized Estimating Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3125
Alternating Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3128
Weighted Generalized Estimating Equations under the MAR Assumption . . . . . . . 3132
Type 3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3136
ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3137
ODS Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3138
Examples: GEE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3139
Example 45.1: Comparison of the Marginal and Random Effect Models for Binary Data 3139
Example 45.2: Log-Linear Model for Count Data . . . . . . . . . . . . . . . . . . . . 3142
Example 45.3: Weighted GEE for Longitudinal Data That Have Missing Values . . . 3146
Example 45.4: GEE for Binary Data with Logit Link Function . . . . . . . . . . . . . 3150
Example 45.5: Alternating Logistic Regression for Ordinal Multinomial Data . . . . . 3153
Example 45.6: GEE for Nominal Multinomial Data . . . . . . . . . . . . . . . . . . . 3156
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3158
3104 F Chapter 45: The GEE Procedure

Overview: GEE Procedure

The GEE procedure implements the generalized estimating equations (GEE) approach (Liang and Zeger
1986), which extends the generalized linear model to handle longitudinal data (Stokes, Davis, and Koch 2012;
Fitzmaurice, Laird, and Ware 2011; Diggle et al. 2002). For longitudinal studies, missing data are common,
and they can be caused by dropouts or skipped visits. If missing responses depend on previous responses,
the usual GEE approach can lead to biased estimates. So the GEE procedure also implements the weighted
GEE method to handle missing responses that are caused by dropouts in longitudinal studies (Robins and
Rotnitzky 1995; Preisser, Lohman, and Rathouz 2002). The GEE procedure in SAS/STAT 14.1 does not
support the weighted GEE method for the multinomial distribution for polytomous responses.
The GEE method fits a marginal model to longitudinal data. The regression parameters in the marginal model
are interpreted as population-averaged. For more information about the GEE method, see Fitzmaurice, Laird,
and Ware (2011); Hardin and Hilbe (2003); Diggle et al. (2002); Lipsitz et al. (1994).
The GEE procedure compares most closely to the GENMOD procedure in SAS/STAT software. Both
procedures implement the standard generalized estimating equation approach for longitudinal data; this
approach is appropriate for complete data or when data are missing completely at random (MCAR). When
the data are missing at random (MAR), the weighted GEE method produces valid inference. Molenberghs
and Kenward (2007); Fitzmaurice, Laird, and Ware (2011); Mallinckrodt (2013); O’Kelly and Ratitch (2014)
describe the weighted GEE method.
The GEE procedure includes alternating logistic regression (ALR) analysis for binary and ordinal multinomial
responses. In ordinary GEEs, the association between pairs of responses are modeled with correlations. The
ALR approach provides an alternative by using the log odds ratio to model the association between pairs.
For more information about the log odds ratio and the ALR method, see the section “Alternating Logistic
Regression” on page 3128. For binary responses the ALR algorithm of Carey, Zeger, and Diggle (1993) is
implemented in both the GEE and GENMOD procedures. The GEE procedure also implements the ALR
algorithm of Heagerty and Zeger (1996), which extends the ALR approach to ordinal multinomial responses.
An ordinary GEE with the independent working correlation structure is also available for both nominal and
ordinal multinomial data.

Getting Started
This section illustrates some of the basic features of the GEE procedure by analyzing longitudinal data from
Stokes, Davis, and Koch (2012).
In this study, researchers followed 25 children at ages 8, 9, 10, and 11 years. The goal of this study is to
investigate the health effects of air pollution on children. The binary response is the wheezing status of the
children at four different ages. The explanatory variables are age, city, and passive smoking index (with
values 0, 1, 2) that represented the degree of smoking in the home. The responses for individual children are
assumed to be equally correlated, implying an exchangeable correlation structure.
The following statements create the data set Children:
Getting Started F 3105

data Children;
input ID City$ @@;
do i=1 to 4;
input Age Smoke Symptom @@;
output;
end;
datalines;
1 steelcity 8 0 1 9 0 1 10 0 1 11 0 0
2 steelcity 8 2 1 9 2 1 10 2 1 11 1 0
3 steelcity 8 2 1 9 2 0 10 1 0 11 0 0
4 greenhills 8 0 0 9 1 1 10 1 1 11 0 0
5 steelcity 8 0 0 9 1 0 10 1 0 11 1 0
6 greenhills 8 0 1 9 0 0 10 0 0 11 0 1
7 steelcity 8 1 1 9 1 1 10 0 1 11 0 0
8 greenhills 8 1 0 9 1 0 10 1 0 11 2 0
9 greenhills 8 2 1 9 2 0 10 1 1 11 1 0
10 steelcity 8 0 0 9 0 0 10 0 0 11 1 0
11 steelcity 8 1 1 9 0 0 10 0 0 11 0 1
12 greenhills 8 0 0 9 0 0 10 0 0 11 0 0
13 steelcity 8 2 1 9 2 1 10 1 0 11 0 1
14 greenhills 8 0 1 9 0 1 10 0 0 11 0 0
15 steelcity 8 2 0 9 0 0 10 0 0 11 2 1
16 greenhills 8 1 0 9 1 0 10 0 0 11 1 0
17 greenhills 8 0 0 9 0 1 10 0 1 11 1 1
18 steelcity 8 1 1 9 2 1 10 0 0 11 1 0
19 steelcity 8 2 1 9 1 0 10 0 1 11 0 0
20 greenhills 8 0 0 9 0 1 10 0 1 11 0 0
21 steelcity 8 1 0 9 1 0 10 1 0 11 2 1
22 greenhills 8 0 1 9 0 1 10 0 0 11 0 0
23 steelcity 8 1 1 9 1 0 10 0 1 11 0 0
24 greenhills 8 1 0 9 1 1 10 1 1 11 2 1
25 greenhills 8 0 1 9 0 0 10 0 0 11 0 0
;
The following statements fit the model by the GEE method:

proc gee data=Children descending;

class ID City;
model Symptom = City Age Smoke / dist=bin link=logit;
repeated subject=ID / type=exch covb corrw;
run;

Both the MODEL statement and the REPEATED statement are required.
The DIST=BIN and LINK=LOGIT options in the MODEL statement request a logistic regression with the
variable Symptom as the response and City, Age, and Smoke as explanatory variables.
The REPEATED statement specifies the correlation structure and requests various tables in the output. The
SUBJECT=ID option requests that individual subjects be identified in the input data set by the variable ID,
which must be listed in the CLASS statement. Measurements of individual subjects at ages 8, 9, 10, and 11
are in the proper order in the data set, so the WITHIN= option is not required. The TYPE=EXCH option
specifies an exchangeable working correlation structure, the COVB option requests the parameter estimate
3106 F Chapter 45: The GEE Procedure

covariance matrix, and the CORRW option requests the working correlation matrix.
Figure 45.1 shows the “Model Information” table, which provides information about the specified logistic
regression model and the input data set.

Figure 45.1 Model Information

The GEE Procedure

Model Information
Data Set WORK.CHILDREN
Distribution Binomial
Link Function Logit
Dependent Variable Symptom

Figure 45.2 displays general information about the GEE analysis. Each subject has four measurements.

Figure 45.2 GEE Model Information

GEE Model Information

Correlation Structure Exchangeable
Subject Effect ID (25 levels)
Number of Clusters 25
Correlation Matrix Dimension 4
Maximum Cluster Size 4
Minimum Cluster Size 4

Figure 45.3 displays the model-based and empirical covariance matrices of the parameter estimates.

Figure 45.3 Covariance Matrices of Parameter Estimates

Covariance Matrix (Model-Based)

Prm1 Prm2 Prm4 Prm5
Prm1 3.26069 -0.16313 -0.32274 -0.12257
Prm2 -0.16313 0.24015 0.002520 0.03422
Prm4 -0.32274 0.002520 0.03379 0.004471
Prm5 -0.12257 0.03422 0.004471 0.09533

Covariance Matrix (Empirical)

Prm1 Prm2 Prm4 Prm5
Prm1 4.09770 -0.55261 -0.37280 -0.29397
Prm2 -0.55261 0.29538 0.03719 0.09143
Prm4 -0.37280 0.03719 0.03550 0.02064
Prm5 -0.29397 0.09143 0.02064 0.07957

The exchangeable working correlation matrix is displayed in Figure 45.4.

Syntax: GEE Procedure F 3107

Figure 45.4 Working Correlation Matrix

Working Correlation Matrix

Obs 1 Obs 2 Obs 3 Obs 4
Obs 1 1.0000 0.0883 0.0883 0.0883
Obs 2 0.0883 1.0000 0.0883 0.0883
Obs 3 0.0883 0.0883 1.0000 0.0883
Obs 4 0.0883 0.0883 0.0883 1.0000

The parameter estimates table, shown in Figure 45.5, contains parameter estimates, standard errors, confidence
intervals, Z scores, and p-values for the parameter estimates. Empirical standard error estimates are used in
this table. You can create a table that uses model-based standard errors by specifying the MODELSE option
in the REPEATED statement. The results indicate that smoking exposure is significant with a p-value of
0.0211, Age is marginally influential with a p-value of 0.0893, and City does not influence wheezing. The
parameter estimate for Age is –0.3201, which indicates that the odds ratio of wheezing for the children at the
higher age group compared to those in the lower age group is e 0:3201 D 0:726.

Figure 45.5 GEE Parameter Estimates Table

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 2.2615 2.0243 -1.7060 6.2290 1.12 0.2639
City greenhil 0.0418 0.5435 -1.0234 1.1070 0.08 0.9387
City steelcit 0.0000 0.0000 0.0000 0.0000 . .
Age -0.3201 0.1884 -0.6894 0.0492 -1.70 0.0893
Smoke 0.6506 0.2821 0.0978 1.2035 2.31 0.0211

Goodness-of-fit criteria for the model are displayed in Figure 45.6. For more information about the quasi-
likelihood information criterion (QIC), see the section “Quasi-likelihood Information Criterion” on page 3127.

Figure 45.6 Model Fit Criteria

GEE Fit
Criteria
QIC 137.1373
QICu 136.2173

Syntax: GEE Procedure

The following statements are available in the GEE procedure. Items within < > are optional.
3108 F Chapter 45: The GEE Procedure

PROC GEE < options > ;

BY variables ;
CLASS variable < (options) > . . . < variable < (options) > > < / options > ;
EFFECTPLOT < plot-type < (plot-definition-options) > > < / options > ;
ESTIMATE < 'label' > estimate-specification < / options > ;
FREQ | FREQUENCY variable ;
LSMEANS < model-effects > < / options > ;
LSMESTIMATE model-effect < 'label' > values < divisor =n > < , . . . < 'label' > values < divisor =n > >
< / options > ;
MISSMODEL < effects > < / options > ;
MODEL response = < effects > < / options > ;
OUTPUT < OUT=SAS-data-set > < keyword=name . . . keyword=name > ;
REPEATED SUBJECT=subject-effect < / options > ;
SLICE model-effect < / options > ;
STORE < OUT= >item-store-name < / LABEL='label' > ;
WEIGHT variable ;
The syntax of the GEE procedure compares most closely to that of the GENMOD procedures. The PROC
GEE, MODEL, and REPEATED statements are required. All other statements can appear only once. The
following sections describe the PROC GEE statement and then describe the other statements in alphabetical
order.

PROC GEE Statement

PROC GEE < options > ;

The PROC GEE statement invokes the GEE procedure. Table 45.1 summarizes the options available in the
PROC GEE statement.

Table 45.1 PROC GEE Statement Options

Option Description
DATA= Specifies the input data set
DESCENDING Sorts the response variable in the reverse of the default order
NAMELEN= Specifies the length of effect names
ORDER= Specifies the sort order of CLASS variable
PLOTS Controls the plots that are produced through ODS Graphics

You can specify the following options.

DATA=SAS-data-set
specifies the SAS data set that contains the data to be analyzed. If you omit the DATA= option, PROC
GEE uses the most recently created SAS data set.
BY Statement F 3109

DESCENDING
DESCEND
DESC
requests that the levels of the response variable for the binomial model that uses a single-variable
response syntax be sorted in the reverse of the default order.

NAMELEN=number
specifies the length to which long effect names are shortened. The default and minimum value is 20.

PLOTS < = plot-request >

controls the plots produced through ODS Graphics. For example:

proc gee plots=histogram;

model y=x1;
run;

For more information about enabling and disabling ODS Graphics, see the section “Enabling and
Disabling ODS Graphics” on page 615 in Chapter 21, “Statistical Graphics Using ODS.”
You can specify the following plot-requests:

ALL
requests that all default plots be produced.

HISTOGRAM
creates a histogram for the predicted weights from the missingness model.

NONE
suppresses all plots.

BY Statement
BY variables ;

You can specify a BY statement with PROC GEE to obtain separate analyses of observations in groups that
are defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be
sorted in order of the BY variables. If you specify more than one BY statement, only the last one specified is
used.
If your input data set is not sorted in ascending order, use one of the following alternatives:

Sort the data by using the SORT procedure with a similar BY statement.

Specify the NOTSORTED or DESCENDING option in the BY statement for the GEE procedure. The
NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged
in groups (according to values of the BY variables) and that these groups are not necessarily in
alphabetical or increasing numeric order.

Create an index on the BY variables by using the DATASETS procedure (in Base SAS software).
3110 F Chapter 45: The GEE Procedure

For more information about BY-group processing, see the discussion in SAS Language Reference: Concepts.
For more information about the DATASETS procedure, see the discussion in the SAS Visual Data Management
and Utility Procedures Guide.

CLASS Statement
CLASS variables < / options > ;
The CLASS statement names the classification variables to be used in the analysis. If the CLASS statement
is used, it must appear before the MODEL statement.
Classification variables can be either character or numeric. CLASS levels are determined from the formatted
values of the variables. Thus, you can use formats to group values into levels. For more information, see the
discussion of the FORMAT procedure in the SAS Visual Data Management and Utility Procedures Guide
and the discussions of the FORMAT statement and SAS formats in SAS Formats and Informats: Reference.
You can specify the following options for classification variables:
DESCENDING
DESC
reverses the sort order of the classification variable. If you specify both the DESCENDING and
ORDER= options, PROC GEE orders the categories according to the ORDER= option and then
reverses that order.
ORDER=order-type
specifies the sort order for the categories of categorical variables. This ordering determines which
parameters in the model correspond to each level in the data. When the default ORDER=FORMATTED
is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered
by their internal values. Table 45.2 shows how PROC GEE interprets values of the ORDER= option.

Table 45.2 Sort Order for Categorical Variables

order-type Levels Sorted By

DATA Order of appearance in the input data set
FORMATTED External formatted value, except for numeric variables that have no
explicit format, which are sorted by their unformatted (internal) value
FREQ Descending frequency count; levels that have the most observations
come first in the order
FREQDATA Order of descending frequency count, and within counts by order of
appearance in the input data set when counts are tied
FREQFORMATTED Order of descending frequency count, and within counts by formatted
value (as above) when counts are tied
FREQINTERNAL Order of descending frequency count, and within counts by unformat-
ted value when counts are tied
INTERNAL Unformatted value

For the FORMATTED and INTERNAL values, the sort order is machine-dependent. If you specify
the ORDER= option in the MODEL statement and the ORDER= option in the CLASS statement, the
former takes precedence.
EFFECTPLOT Statement F 3111

For more information about sort order, see the chapter on the SORT procedure in the SAS Visual
Data Management and Utility Procedures Guide and the discussion of BY-group processing in SAS
Language Reference: Concepts.

EFFECTPLOT Statement
EFFECTPLOT < plot-type < (plot-definition-options) > > < / options > ;

The EFFECTPLOT statement produces a display of the fitted model and provides options for changing and
enhancing the displays. Table 45.3 describes the available plot-types and their plot-definition-options.

Table 45.3 Plot-Types and Plot-Definition-Options

Plot-Type and Description Plot-Definition-Options

BOX
Displays a box plot of continuous response data at each PLOTBY= variable or CLASS effect
level of a CLASS effect, with predicted values X= CLASS variable or effect
superimposed and connected by a line. This is an
alternative to the INTERACTION plot-type.
CONTOUR
Displays a contour plot of predicted values against two PLOTBY= variable or CLASS effect
continuous covariates. X= continuous variable
Y= continuous variable
FIT
Displays a curve of predicted values versus a PLOTBY= variable or CLASS effect
continuous variable. X= continuous variable
INTERACTION
Displays a plot of predicted values (possibly with error PLOTBY= variable or CLASS effect
bars) versus the levels of a CLASS effect. The SLICEBY= variable or CLASS effect
predicted values are connected with lines and can be X= CLASS variable or effect
grouped by the levels of another CLASS effect.
MOSAIC
Displays a mosaic plot of predicted values using up to PLOTBY= variable or CLASS effect
three CLASS effects. X= CLASS effects
SLICEFIT
Displays a curve of predicted values versus a PLOTBY= variable or CLASS effect
continuous variable grouped by the levels of a SLICEBY= variable or CLASS effect
CLASS effect. X= continuous variable

For full details about the syntax and options of the EFFECTPLOT statement, see the section “EFFECTPLOT
Statement” on page 420 in Chapter 19, “Shared Concepts and Topics.”
3112 F Chapter 45: The GEE Procedure

ESTIMATE Statement
ESTIMATE < 'label' > estimate-specification < (divisor =n) >
< , . . . < 'label' > estimate-specification < (divisor =n) > >
< / options > ;

The ESTIMATE statement provides a mechanism for obtaining custom hypothesis tests. Estimates are
formed as linear estimable functions of the form Lˇ. You can perform hypothesis tests for the estimable
functions, construct confidence limits, and obtain specific nonlinear transformations.
Table 45.4 summarizes the options available in the ESTIMATE statement.

Table 45.4 ESTIMATE Statement Options

Option Description
Construction and Computation of Estimable Functions
DIVISOR= Specifies a list of values to divide the coefficients
NOFILL Suppresses the automatic fill-in of coefficients for higher-order
effects
SINGULAR= Tunes the estimability checking difference

Degrees of Freedom and p-values

ADJUST= Determines the method for multiple comparison adjustment of
estimates
ALPHA=˛ Determines the confidence level (1 ˛)
LOWER Performs one-sided, lower-tailed inference
STEPDOWN Adjusts multiplicity-corrected p-values further in a step-down
fashion
TESTVALUE= Specifies values under the null hypothesis for tests
UPPER Performs one-sided, upper-tailed inference

Statistical Output
CL Constructs confidence limits
CORR Displays the correlation matrix of estimates
COV Displays the covariance matrix of estimates
E Prints the L matrix
JOINT Produces a joint F or chi-square test for the estimable functions
PLOTS= Requests ODS statistical graphics if the analysis is sampling-based
SEED= Specifies the seed for computations that depend on random
numbers

Generalized Linear Modeling

CATEGORY= Specifies how to construct estimable functions with multinomial
data
EXP Exponentiates and displays estimates
ILINK Computes and displays estimates and standard errors on the inverse
linked scale
FREQ Statement F 3113

For details about the syntax of the ESTIMATE statement, see the section “ESTIMATE Statement” on
page 448 in Chapter 19, “Shared Concepts and Topics.”

FREQ Statement
FREQ variable ;

FREQUENCY variable ;

The variable in the FREQ statement identifies a variable in the input data set that contains the frequency of
occurrence of each observation. PROC GEE treats each observation as if it appeared n times, where n is the
value of the FREQ variable for the observation. If the frequency value is not an integer, it is truncated to an
integer. If it is less than 1 or missing, the observation is not used. The frequencies must be the same for all
observations within each subject.

LSMEANS Statement
LSMEANS < model-effects > < / options > ;

The LSMEANS statement computes and compares least squares means (LS-means) of fixed effects. LS-means
are predicted population margins—that is, they estimate the marginal means over a balanced population. In a
sense, LS-means are to unbalanced designs as class and subclass arithmetic means are to balanced designs.
Table 45.5 summarizes the options available in the LSMEANS statement.

Table 45.5 LSMEANS Statement Options

Option Description
Construction and Computation of LS-Means
AT Modifies the covariate value in computing LS-means
BYLEVEL Computes separate margins
DIFF Requests differences of LS-means
OM= Specifies the weighting scheme for LS-means computation as
determined by the input data set
SINGULAR= Tunes estimability checking

Degrees of Freedom and p-values

ADJUST= Determines the method for multiple-comparison adjustment of
LS-means differences
ALPHA=˛ Determines the confidence level (1 ˛)
STEPDOWN Adjusts multiple-comparison p-values further in a step-down
fashion

Table 45.5 continued

Option Description
E Prints the L matrix
LINES Uses connecting lines to indicate nonsignificantly different subsets
of LS-means
LINESTABLE Displays the results of the LINES option as a table
MEANS Prints the LS-means
PLOTS= Requests graphs of means and mean comparisons
SEED= Specifies the seed for computations that depend on random
numbers

Generalized Linear Modeling

EXP Exponentiates and displays estimates of LS-means or LS-means
differences
ILINK Computes and displays estimates and standard errors of LS-means
(but not differences) on the inverse linked scale
ODDSRATIO Reports (simple) differences of least squares means in terms of
odds ratios if permitted by the link function

For details about the syntax of the LSMEANS statement, see the section “LSMEANS Statement” on page 464
in Chapter 19, “Shared Concepts and Topics.”

LSMESTIMATE Statement
LSMESTIMATE model-effect < 'label' > values < divisor =n >
< , . . . < 'label' > values < divisor =n > >
< / options > ;

The LSMESTIMATE statement provides a mechanism for obtaining custom hypothesis tests among least
squares means.
Table 45.6 summarizes the options available in the LSMESTIMATE statement.

Table 45.6 LSMESTIMATE Statement Options

Option Description
Construction and Computation of LS-Means
AT Modifies covariate values in computing LS-means
BYLEVEL Computes separate margins
DIVISOR= Specifies a list of values to divide the coefficients
OM= Specifies the weighting scheme for LS-means computation as
determined by a data set
SINGULAR= Tunes estimability checking
MISSMODEL Statement F 3115

Table 45.6 continued

Option Description

Degrees of Freedom and p-values

ADJUST= Determines the method for multiple-comparison adjustment of
LS-means differences
ALPHA=˛ Determines the confidence level (1 ˛)
LOWER Performs one-sided, lower-tailed inference
STEPDOWN Adjusts multiple-comparison p-values further in a step-down
fashion
TESTVALUE= Specifies values under the null hypothesis for tests
UPPER Performs one-sided, upper-tailed inference

Statistical Output
CL Constructs confidence limits for means and mean differences
CORR Displays the correlation matrix of LS-means
COV Displays the covariance matrix of LS-means
E Prints the L matrix
ELSM Prints the K matrix
JOINT Produces a joint F or chi-square test for the LS-means and
LS-means differences
PLOTS= Requests graphs of means and mean comparisons
SEED= Specifies the seed for computations that depend on random
numbers

Generalized Linear Modeling

CATEGORY= Specifies how to construct estimable functions with multinomial
data
EXP Exponentiates and displays LS-means estimates
ILINK Computes and displays estimates and standard errors of LS-means
(but not differences) on the inverse linked scale

For details about the syntax of the LSMESTIMATE statement, see the section “LSMESTIMATE Statement”
on page 484 in Chapter 19, “Shared Concepts and Topics.”

MISSMODEL Statement
MISSMODEL effects < / options > ;
The MISSMODEL statement requests a weighted GEE analysis. It specifies a logistic regression that is
used to estimate the weights under the MAR assumption. If the pattern of missing data is intermittent (not
dropout), the GEE procedure terminates and does not perform an analysis.
You can use the same effects or different effects in the MODEL and MISSMODEL statements. Explanatory
variables can be continuous or classification variables. Classification variables can be character or numeric.
Explanatory variables that represent nominal (classification) data must be declared in a CLASS statement.
3116 F Chapter 45: The GEE Procedure

Interactions between variables can also be included as effects. Columns of the design matrix are automatically
generated for classification variables and interactions. The syntax for effects is the same as for the GLM
procedure. For more information, see the section “Specification of Effects” on page 3773 in Chapter 48,
“The GLM Procedure.”
You can specify the following options after a slash (/).

MAXWEIGHT=number
truncates the predicted weights from the missingness model if they are larger than number , where
number 1.

TYPE=OBSLEVEL | SUBLEVEL
specifies the type of weighted GEE method. You can specify the following values:

OBSLEVEL specifies the observation-level weighted GEE method.

SUBLEVEL specifies the subject-level weighted GEE method.

By default, TYPE=OBSLEVEL.

MODEL Statement
MODEL response = < effects > < / options > ;

MODEL events/trials = < effects > < / options > ;

The MODEL statement specifies the response (dependent variable) and the effects (explanatory variables). If
you omit the explanatory variables, PROC GEE fits an intercept-only model. An intercept term is included in
the model by default. You can remove the intercept by specifying the NOINT option.
You can specify the response in the form of a single variable (response) or in the form of a ratio of two
variables ( events/trials). The first form is applicable to all responses. The second form is applicable only to
summarized binomial response data. When each observation in the input data set contains the number of
events (for example, successes) and the number of trials from a set of binomial trials, use the events/trials
syntax.
In the events/trials model syntax, you specify two variables: one for the event counts and one for trial counts.
These two variables are separated by a slash (/). The value of the events variable must be nonnegative,
and the value of the trials variable must be equal to or greater than the value of the events variable for an
observation to be valid. The events and trials variables can take non-integer values.
When each observation in the input data set contains a single trial from a binomial experiment, use the
response form of the MODEL statement. The response variable can be numeric or character. The ordering
of response levels is critical in these models.
Responses for the Poisson distribution must be all nonnegative, but they can be non-integer values.
The effects in the MODEL statement consist of an explanatory variable or combination of variables. Ex-
planatory variables can be continuous or classification variables. Classification variables can be character or
numeric. Explanatory variables that represent nominal (classification) data must be declared in a CLASS
statement. Interactions between variables can also be included as effects. Columns of the design matrix
are automatically generated for classification variables and interactions. The syntax for specifying effects
MODEL Statement F 3117

is the same as for the GLM procedure. For more information, see the section “Specification of Effects” on
page 3773 in Chapter 48, “The GLM Procedure.”
Table 45.7 summarizes the options available in the MODEL statement.

Table 45.7 MODEL Statement Options

Option Description
ALPHA= Sets the confidence coefficient
DIST= Specifies the probability distribution
LINK= Specifies the link function
NOINT Requests no intercept term
NOSCALE Holds the scale parameter fixed
OFFSET= Specifies a variable in the input data set to be used as an offset
SCALE= Specifies the value used for the scale
TYPE3 Computes statistics for Type 3 contrasts
WALD Requests Wald statistics for Type 3 contrasts

You can specify the following options after a slash (/).

ALPHA=number
sets the confidence coefficient for parameter confidence intervals to 1–number . The value of number
must be between 0 and 1. The default value of number is 0.05.

DIST=keyword
D=keyword
ERROR=keyword
ERR=keyword
specifies the built-in probability distribution to use in the model. If you specify the DIST= option
and you omit the LINK= option, a default link function is chosen as displayed in Table 45.8. If you
specify neither the DIST= option nor the LINK= option, then the GEE procedure defaults to the normal
distribution with the identity link function.

Table 45.8 Distributions and Default Link Functions

DIST= Distribution Default Link Function

LINK=keyword
specifies the link function in the model. You can specify the keywords shown in Table 45.9.

Table 45.9 Built-In Link Functions of the GEE Procedure

Link
LINK= Function g./ D D
CLOGLOG | CLL Complementary log-log log. log.1 //
CUMCLL | CCLL Cumulative complementary log-log log. log.1 //
CUMLOGIT| CLOGIT Cumulative logit log.=.1 //
CUMPROBIT | CPROBIT Cumulative probit ˆ 1 ./
GLOGIT Generalized logit
IDENTITY | ID Identity
LOG Log log./
LOGIT Logit log.=.1 //
PROBIT Probit ˆ 1 ./
INVERSE | RECIPROCAL Reciprocal 1=
POWERMINUS2 Power with exponent –2 1=2

For the probit and cumulative probit links, ˆ 1 ./ denotes the quantile function of the standard normal
distribution. If you do not specify the LINK= option, then by default the canonical link function is used
if you specify the DIST= option. Otherwise, if you omit the DIST= option, the identity link function is
used.
The cumulative link functions are appropriate only for the multinomial distribution with ordinal
responses, with cumulative probabilities indicated by . The GLOGIT link function is appropriate
only for the multinomial distribution with nominal responses.
NOINT
requests that no intercept term be included in the model. An intercept is included unless this option is
specified.
NOSCALE
holds the scale parameter fixed. Otherwise, for the normal, inverse Gaussian, and gamma distributions,
the scale parameter is estimated by maximum likelihood. If you omit the SCALE= option, the scale
parameter is fixed at the value 1.
OFFSET=variable
specifies a variable in the input data set to be used as an offset variable. This variable cannot be a
CLASS variable, the response variable, or any of the explanatory variables.
SCALE=number
SCALE=PEARSON | P
PSCALE
SCALE=DEVIANCE | D
DSCALE
specifies the value used for the scale parameter when the NOSCALE option is used. For the binomial
and Poisson distributions, which have no free scale parameter, this can be used to specify an overdis-
persed model. If the NOSCALE option is not specified, then number is used as an initial estimate of
the scale parameter.
OUTPUT Statement F 3119

Specifying SCALE=PEARSON or SCALE=P is the same as specifying the PSCALE option. This
fixes the scale parameter at the value 1 in the estimation procedure. After the parameter estimates
are determined, the exponential family dispersion parameter is assumed to be given by Pearson’s
chi-square statistic divided by the degrees of freedom, and all statistics such as standard errors are
adjusted appropriately.
Specifying SCALE=DEVIANCE or SCALE=D is the same as specifying the DSCALE option. This
fixes the scale parameter at a value of 1 in the estimation procedure.

TYPE3
requests that statistics for Type 3 contrasts be computed for each effect specified in the MODEL
statement. The default analysis is to compute score statistics for the contrasts. Type 3 analyses using
the score statistics are not supported for nominal response data or weighted GEE methods. Wald
statistics are computed if the WALD option is also specified.

WALD
requests Wald statistics for Type 3 contrasts. You must also specify the TYPE3 option in order to
compute Type 3 Wald statistics.

OUTPUT Statement
OUTPUT < OUT=SAS-data-set > < keyword=name . . . keyword=name > ;

The OUTPUT statement creates a new SAS data set that contains all the variables in the input data set and,
optionally, the estimated linear predictors (XBETA) and their standard error estimates, predicted values of
the mean, and confidence limits for predicted values.
If you use the multinomial distribution with one of the cumulative link functions for ordinal data, the data
set also contains variables named _ORDER_ and _LEVEL_ that indicate the levels of the ordinal response
variable and the values of the variable in the input data set corresponding to the sorted levels. These variables
indicate that the predicted value for a given observation is the probability that the response variable is as
large as the value of the _LEVEL_ variable. Residuals and other diagnostic statistics are not available for the
multinomial distribution.
The estimated linear predictor, its standard error estimate, and the predicted values and their confidence
intervals are computed for all observations in which the explanatory variables are all nonmissing, even if
the response is missing. By adding observations with missing response values to the input data set, you can
compute these statistics for new observations or for settings of the explanatory variables not present in the
data without affecting the model fit.
The following list explains specifications in the OUTPUT statement.

OUT=SAS-data-set
specifies the output data set. If you omit the OUT=option, the output data set is created and given a
default name that uses the DATAn convention.

keyword=name
specifies the statistics to be included in the output data set and names the new variables that contain the
statistics. Specify a keyword for each desired statistic (see the following list of keywords), an equal
sign, and the name of the new variable or variables to contain the statistic.
3120 F Chapter 45: The GEE Procedure

Although you can use the OUTPUT statement without any keyword=name specifications, the output
data set then contains only the original variables and, possibly, the variables Level and Value (if you
use the multinomial model with ordinal data).
The keywords allowed and the statistics they represent are as follows:

LOWER | L represents the lower confidence limit for the predicted value of the mean, or the
lower confidence limit for the probability that the response is less than or equal
to the value of Level or Value. The confidence coefficient is determined by the
ALPHA=number option in the MODEL statement as .1 number / 100%. The
default confidence coefficient is 95%.
PREDICTED | PRED | PROB | P represents the predicted value of the mean of the response or the
predicted probability that the response variable is less than or equal to the value
of _LEVEL_ if the multinomial model for ordinal data is used (in other words,
Pr.Y _LEVEL_/, where Y is the response variable).
RESCHI represents the Pearson (chi) residual for identifying observations that are poorly
accounted for by the model. This option is not available for the multinomial
distribution.
RESRAW represents the raw residual for identifying poorly fitted observations. This option is
not available for the multinomial distribution.
STDXBETA represents the standard error estimate of XBETA (see the XBETA keyword).
UPPER | U represents the upper confidence limit for the predicted value of the mean, or the
upper confidence limit for the probability that the response is less than or equal
to the value of Level or Value. The confidence coefficient is determined by the
ALPHA=number option in the MODEL statement as .1 number / 100%. The
default confidence coefficient is 95%.
XBETA represents the estimate of the linear predictor x0i ˇ for observation i, or ˛j C
x0i ˇ, where j is the corresponding ordered value of the response variable for the
multinomial model with ordinal data. If there is an offset, it is included in x0i ˇ.

REPEATED Statement
REPEATED SUBJECT=subject-effect < / options > ;

The REPEATED statement specifies the correlation structure of the responses for GEE model fitting. In
addition, the REPEATED statement controls the iterative fitting algorithm and specifies optional output.
Table 45.10 summarizes the options available in the REPEATED statement.

Table 45.10 REPEATED Statement Options

Option Description
ALPHAINIT= Specifies initial values for log odds ratio regression parameters
CONVERGE= Specifies the convergence criterion for GEE parameter estimation
CORRB Displays the estimated correlation matrix
REPEATED Statement F 3121

Table 45.10 continued

Option Description
CORRW Displays the estimated working correlation matrix
COVB Displays the estimated covariance matrix
ECORRB Displays the estimated empirical correlation matrix
ECOVB Displays the estimated empirical covariance matrix
INITIAL= Specifies initial values of the regression parameters estimation
INTERCEPT= Specifies an initial value of the intercept
LOGOR= Specifies the use of alternating logistic regression and a model for the log
odds ratio
MAXITER= Specifies the maximum number of iterations
MCORRB Displays the estimated model-based correlation matrix
MCOVB Displays the estimated model-based covariance matrix
MODELSE Displays a parameter estimates table with the model-based standard errors
SUBCLUSTER= Specifies a variable that defines subclusters
SUBJECT= Identifies a different subject (cluster)
TYPE= Specifies the working correlation matrix structure
WITHIN= Specifies the order of measurements within subjects
ZDATA= Specifies the full z matrix
ZROW= Specifies the rows of the z matrix

You must specify the SUBJECT= option:

SUBJECT=subject-effect
identifies subjects in the input data set. The subject-effect can be a single variable, an interaction effect,
a nested effect, or a combination. Each distinct value (level) of the effect identifies a different subject
(cluster). Responses from different subjects are assumed to be statistically independent, and responses
within subjects are assumed to be correlated. You must specify a subject-effect , and you must list
variables that are used in defining the subject-effect in the CLASS statement.

You can also specify the following options after a slash (/) to control how the model is fit and what output is
produced:

ALPHAINIT=numbers
specifies initial values for log odds ratio regression parameters if you specify the option LOGOR= for
data that have either binary or ordinal multinomial responses. The default value of numbers is 0.01.

CONVERGE=number
specifies the convergence criterion for GEE parameter estimation. If the maximum absolute difference
between regression parameter estimates is less than number on two successive iterations, convergence
is declared. If the absolute value of a regression parameter estimate is greater than 0.08, then the
absolute difference normalized by the regression parameter value is used instead of the absolute
difference. The default value of number is 0.0001.
3122 F Chapter 45: The GEE Procedure

CORRB
displays the estimated regression parameter correlation matrix. Both model-based and empirical
correlations are displayed.

CORRW
displays the estimated working correlation matrix. If you specify TYPE=EXCH for the exchangeable
working correlation structure, then the CORRW option is not needed to view the estimated correlation,
because a table that contains the single estimated correlation is printed by default.

COVB
displays the estimated regression parameter covariance matrix. Both model-based and empirical
covariances are displayed.

ECORRB
displays the estimated regression parameter empirical correlation matrix.

ECOVB
displays the estimated regression parameter empirical covariance matrix.

INITIAL=numbers
specifies initial values of the regression parameters estimation, other than the intercept parameter, for
GEE estimation. If you do not specify this option, then the estimated regression parameters (assuming
independence for all responses) are used for the initial values.

INTERCEPT=number
specifies an initial value of the intercept regression parameter in the GEE model.

LOGOR=log-odds-ratio-structure-keyword
specifies the use of the alternating logistic regression (ALR) method and the regression model structure
for the log odds ratio. For data that have either a binary or ordinal multinomial response distribution,
the ALR method uses the log odds ratio to model the association of the responses from subjects. For
more information about the ALR method and examples of specifying log odds ratio models, see the
section “Alternating Logistic Regression” on page 3128. You can specify the values that are shown in
Table 45.11.

Table 45.11 Log Odds Ratio Regression Structures

Keyword Log Odds Ratio Regression Structure

EXCH Exchangeable
FULLCLUST Fully parameterized clusters
LOGORVAR(variable) Indicator variable for specifying block effects
NESTK k-nested
NEST1 1-nested
ZFULL Fully specified z matrix specified in ZDATA= data set
ZREP Single cluster specification for replicated z matrix specified
in ZDATA= data set
ZREP(matrix ) Single cluster specification for replicated z matrix

For ordinal multinomial data, only the exchangeable regression structure that is specified by LO-
GOR=EXCH is supported. You should specify the option LOGOR= or TYPE=, but not both.
REPEATED Statement F 3123

MAXITER=number
MAXIT=number
specifies the maximum number of iterations allowed in the iterative GEE estimation process. By
default, MAXITER=50.

MCORRB
displays the estimated regression parameter model-based correlation matrix.

MCOVB
displays the estimated regression parameter model-based covariance matrix.

MODELSE
displays a parameter estimates table that uses model-based standard errors for inference. By default, a
“Parameter Estimates” table that is based on empirical standard errors is displayed.

SUBCLUSTER=variable
SUBCLUST=variable
specifies a variable that defines subclusters for the 1-nested or k-nested log odds ratio association
modeling structures for data that have a binary response distribution. A 1-nested or k-nested modeling
structure is specified in the option LOGOR=, and variable must be listed in the CLASS statement. For
definitions of the 1-nested and k-nested modeling structures, see the section “Specifying Log Odds
Ratio Models” on page 3130.

TYPE=correlation-structure-keyword
CORR=correlation-structure-keyword
specifies the structure of the working correlation matrix that is used to model the correlation of the
responses from subjects for ordinary GEEs. You can specify the values that are shown in Table 45.12
(for definitions of the correlation matrix types, see Table 45.13 in the section “Details: GEE Procedure”
on page 3125).

Table 45.12 Correlation Structure Types

Keyword Correlation Structure Type

AR | AR(1) Autoregressive(1)
EXCH | CS Exchangeable
IND Independent
MDEP(number ) m-dependent, where m = number
UNSTR | UN Unstructured
USER(matrix ) | FIXED(matrix ) Fixed, user-specified correlation matrix

For example, the following option specifies a fixed 4 4 correlation matrix:

type=user( 1.0 0.9 0.8 0.6

0.9 1.0 0.9 0.8
0.8 0.9 1.0 0.9
0.6 0.8 0.9 1.0 )

By default, TYPE=IND. When you specify the alternating logistic regression method using the option
LOGOR= you should not specify TYPE=.
3124 F Chapter 45: The GEE Procedure

WITHINSUBJECT=within-subject-effect
WITHIN=within-subject-effect
defines an effect that specifies the order of measurements within subjects. Each distinct level of the
within-subject-effect defines a different response from the same subject. If the data are in proper order
within each subject, you do not need to specify this option.
If some measurements do not appear in the data for some subjects, this option properly orders the
existing measurements and treats the omitted measurements as missing values.
If you do not specify the WITHIN= option for the standard GEE method, missing values are assumed
to be the last values and are not used; the remaining observations are then ordered in the sequence
in which they are provided in the input data set. If you do not specify the WITHIN= option for the
weighted GEE method, the observations are assumed to be ordered in the sequence in which they are
provided in the input data set.
Variables that are used in defining the within-subject-effect must be listed in the CLASS statement.

ZDATA=SAS-data-set
specifies a SAS data set that contains either the full z matrix for log odds ratio association modeling for
data with binary responses or the z matrix for a single complete cluster to be replicated for all clusters.

ZROW=variable-list
specifies the variables in the ZDATA= data set that correspond to rows of the z matrix for log odds
ratio association modeling for data with binary responses.

SLICE Statement
SLICE model-effect < / options > ;

The SLICE statement provides a general mechanism for performing a partitioned analysis of the LS-means
for an interaction. This analysis is also known as an analysis of simple effects.
The SLICE statement uses the same options as the LSMEANS statement, which are summarized in Ta-
ble 19.21. For details about the syntax of the SLICE statement, see the section “SLICE Statement” on
page 512 in Chapter 19, “Shared Concepts and Topics.”

STORE Statement
STORE < OUT= >item-store-name < / LABEL='label' > ;

The STORE statement requests that the procedure save the context and results of the statistical analysis. The
resulting item store has a binary file format that cannot be modified. The contents of the item store can be
processed with the PLM procedure. For details about the syntax of the STORE statement, see the section
“STORE Statement” on page 515 in Chapter 19, “Shared Concepts and Topics.”
WEIGHT Statement F 3125

WEIGHT Statement
WEIGHT variable ;

The WEIGHT statement identifies a variable in the input data set to be used as the exponential family
dispersion parameter weight for each observation. The exponential family dispersion parameter is divided by
the WEIGHT variable value for each observation.
The WEIGHT variable value does not have to be an integer; if the value is less than or equal to 0 or if it is
missing, the corresponding observation is not used.

Details: GEE Procedure

Generalized Estimating Equations

The marginal model is commonly used in analyzing longitudinal data when the population-averaged effect is
of interest. To estimate the regression parameters in the marginal model, Liang and Zeger (1986) proposed
the generalized estimating equations method, which is widely used.
Suppose yij ; j D 1; : : : ; ni ; i D 1; : : : ; K, represent the jth response of the ith subject, which has a vector
of covariates xij . There are ni measurements on subject i, and the maximum number of measurements per
subject is T.
Suppose the responses of the ith subject be Yi D Œyi1 ; : : : ; yi ni 0 with corresponding means i D
Œi1 ; : : : ; i ni 0 . For generalized linear models, the marginal mean ij of the response yij is related
to a linear predictor through a link function g.ij / D x0ij ˇ, and the variance of yij depends on the mean
through a variance function v.ij /.
An estimate of the parameter ˇ in the marginal model can be obtained by solving the generalized estimating
equations,
K
X @0 i
S.ˇ/ D Vi 1 .Yi i .ˇ// D 0
@ˇ
i D1

where Vi is the working covariance matrix of Yi .

Only the mean and the covariance of Yi are required in the GEE method; a full specification of the joint
distribution of the correlated responses is not needed. This is particularly convenient because the joint
distribution for noncontinuous responses involves high-order associations and is complicated to specify.
Moreover, the regression parameter estimates are consistent even when the working covariance is incorrectly
specified. Because of these properties, the GEE method is popular in situations where the marginal effect is
of interest and the responses are not continuous. However, the GEE approach can lead to biased estimates
when missing responses depend on previous responses. The weighted GEE method, which is described in
the section “Weighted Generalized Estimating Equations under the MAR Assumption” on page 3132, can
provide unbiased estimates.
3126 F Chapter 45: The GEE Procedure

Working Correlation Matrix

Suppose Ri .˛/ is an ni ni “working” correlation matrix that is fully specified by the vector of parameters
˛. The covariance matrix of Yi is modeled as
1 1 1 1
Vi D Ai2 Wi 2 R.˛/Wi 2 Ai2

where Ai is an ni ni diagonal matrix whose jth diagonal element is v.ij / and Wi is an ni ni diagonal
matrix whose jth diagonal is wij , where wij is a weight variable that is specified in the WEIGHT statement.
If there is no WEIGHT statement, wij D 1 for all i and j. If Ri .˛/ is the true correlation matrix of Yi , then
Vi is the true covariance matrix of Yi .
In practice, the working correlation matrix is usually unknown and must be estimated. It is estimated in the
iterative fitting process by using the current value of the parameter vector ˇ to compute appropriate functions
of the Pearson residual:
yij ij
eij D p
v.ij /=wij

If you specify the working correlation matrix as R0 D I, which is the identity matrix, the GEE reduces to the
independence estimating equation.
Table 45.13 shows the working correlation structures that are supported by the GEE procedure and the
estimators that are used to estimate the working correlations.

Table 45.13 Working Correlation Structures and Estimators

Working Correlation Structure Estimator

Fixed
Corr.Yij ; Yi k / D rj k The working correlation is not estimated in this case.
where rj k is the jkth element of a constant,
user-specified correlation matrix R0
Independent
1 j Dk
Corr.Yij ; Yi k / D The working correlation is not estimated in this case.
0 j ¤k

m-dependent 8
< 1 t D0 PK P
1
Corr.Yij ; Yi;j Ct / D ˛t t D 1; 2; : : : ; m ˛O t D
.Kt p/ i D1 j ni t eij ei;j Ct
0 t >m Kt D K
: P
.n
i D1 i t /

Exchangeable
1 j Dk 1 PK P
Corr.Yij ; Yi k / D ˛O D
.N p/ i D1 j <k eij ei k
˛ j ¤k PK

N D 0:5 i D1 ni .ni 1/

Unstructured
1 j Dk 1 PK
Corr.Yij ; Yi k / D ˛O j k D .K p/ i D1 eij ei k
˛j k j ¤ k
Generalized Estimating Equations F 3127

Table 45.13 continued

Working Correlation Structure Estimator
Autoregressive AR(1)
1 PK P
Corr.Yij ; Yi;j Ct / D ˛ t ˛O D.K1 p/ i D1 j ni 1 eij ei;j C1
for t D 0; 1; 2; : : : ; ni j PK
K1 D i D1 .ni 1/

Dispersion Parameter
The dispersion parameter is estimated by
ni
K X
1 X
2
O D eij
N p
i D1 j D1
PK
where N D i D1 ni is the total number of measurements and p is the number of regression parameters.
The square root of O is reported by PROC GEE as the scale parameter in the “Parameter Estimates for
Response Model with Model-Based Standard Error” output table. If a fixed scale parameter is specified
by using the NOSCALE option in the MODEL statement, then the fixed value is used in estimating the
model-based covariance matrix and standard errors.

Quasi-likelihood Information Criterion

The quasi-likelihood information criterion (QIC) was developed by Pan (2001) as a modification of Akaike’s
information criterion (AIC) to apply to models fit by the GEE approach.
Define the quasi-likelihood under the independent working correlation assumption, evaluated with the
parameter estimates under the working correlation of interest as
ni
K X
X
O
Q.ˇ.R/; / D O
Q.ˇ.R/; I .Yij ; Xij //
i D1 j D1

where the quasi-likelihood contribution of the jth observation in the ith cluster is defined in the section
O
“Quasi-likelihood Functions” on page 3128 and ˇ.R/ are the parameter estimates that are obtained by using
the GEE approach with the working correlation of interest R.
QIC is defined as
QIC.R/ D O
2Q.ˇ.R/; O I VOR /
/ C 2trace.
where VOR is the robust covariance estimate and
O I is the inverse of the model-based covariance estimate
O
under the independent working correlation assumption, evaluated at ˇ.R/, which are the parameter estimates
that are obtained by using the GEE approach with the working correlation of interest R.
PROC GEE also computes an approximation to QIC.R/, which is defined by Pan (2001) as
QICu .R/ D O
2Q.ˇ.R/; / C 2p
where p is the number of regression parameters.
Pan (2001) notes that QIC is appropriate for selecting regression models and working correlations, whereas
QICu is appropriate only for selecting regression models.
3128 F Chapter 45: The GEE Procedure

Quasi-likelihood Functions
See McCullagh and Nelder (1989) and Hardin and Hilbe (2003) for discussions of quasi-likelihood functions.
The contribution of observation j in cluster i to the quasi-likelihood function that is evaluated at the regression
Q
parameters ˇ is expressed by Q.ˇ; I .Yij ; Xij // D ij , where Qij is defined in the following list. These
definitions are used in the computation of the quasi-likelihood information criteria (QIC) for goodness of
fit of models that are fit by the GEE approach. The wij are prior weights, if any, that are specified in the
WEIGHT or FREQ statement. Note that the definition of the quasi-likelihood for the negative binomial differs
from that given in McCullagh and Nelder (1989). The definition used here allows the negative binomial
quasi-likelihood to approach the Poisson as k ! 0.

Normal:
1
Qij D wij .yij ij /2
2
Inverse Gaussian:
wij .ij :5yij /
Qij D
2ij

Gamma:

yij
Qij D wij C log.ij /
ij

Negative binomial:

1 1 kij 1 1
Qij D wij log yij C log C yij log C log
k k 1 C kij k 1 C kij

Poisson:

Qij D wij .yij log.ij / ij /

Binomial:

Qij D wij Œrij log.pij / C .nij rij / log.1 pij /

Multinomial (s categories):
s
X
Qij D wij yij k log.ij k /
kD1

Alternating Logistic Regression

If the responses are binary (that is, they take only two values), then there is an alternative method to account
for the association among the measurements. The alternating logistic regressions (ALR) algorithm of Carey,
Zeger, and Diggle (1993) models the association between pairs of responses by using log odds ratios instead
of using correlations, as ordinary GEEs do. The ALR algorithm of Heagerty and Zeger (1996) extends the
method to GEEs that have ordinal multinomial responses (that is, they fall into one of C ordered categories).
Alternating Logistic Regression F 3129

ALR for Binary Data

For binary data, the correlation between the jth and kth response is, by definition,
Pr.Yij D 1; Yi k D 1/ ij i k
Corr.Yij ; Yi k / D p
ij .1 ij /i k .1 i k /

The joint probability in the numerator satisfies the following bounds, by elementary properties of probability,
because ij D Pr.Yij D 1/:

max.0; ij C i k 1/ Pr.Yij D 1; Yi k D 1/ min.ij ; i k /

Therefore, the correlation is constrained to be within limits that depend in a complicated way on the means
of the data.
The odds ratio, defined as
Pr.Yij D 1; Yi k D 1/ Pr.Yij D 0; Yi k D 0/
OR.Yij ; Yi k / D
Pr.Yij D 1; Yi k D 0/ Pr.Yij D 0; Yi k D 1/

is not constrained by the means and is preferred, in some cases, to correlations for binary data.
The ALR algorithm seeks to model the logarithm of the odds ratio, ij k D log.OR.Yij ; Yi k //, as

ij k D z0ij k ˛

where ˛ is a q 1 vector of regression parameters and zij k is a fixed, specified vector of coefficients.
The parameter ij k can take any value in . 1; 1/, with ij k D 0 corresponding to no association.
The log odds ratio, when modeled in this way with a regression model, can take different values in subgroups
defined by zij k . For example, zij k can define subgroups within clusters, or it can define “block effects”
between clusters.
You specify a GEE model for binary data that uses log odds ratios by specifying a model for the mean, as
in ordinary GEEs, and by specifying a model for the log odds ratios. You can use any of the link functions
appropriate for binary data in the model for the mean, such as logistic, probit, or complementary log-log.

ALR for Ordinal Multinomial Data

For ordinal multinomial data, let Oij , i D 1; : : : ; K, j D 1; : : : ; ni , denote the jth measurement on the ith sub-
0
ject. To apply the ALR algorithm, the responses Oij are represented by a vector Yij D Yij1 ; : : : ; YijC 1 of
cumulative indicator variables Yijc D I.Oi;j c/. You model the cumulative probabilities ijc D E Yijc
by using a cumulative link function,

g ijc D ˇc C x0ij ˇ; for c D 1; : : : ; C 1

where ˇ1 ; ˇ2 ; : : : ; ˇC 1 are increasing intercept terms that depend only on the level c. Let the binary
0
vector that represents the responses of the ith subject be Yi D Yi1 ; : : : ; Yi ni with corresponding means
0
i D i1 ; : : : ; i ni .
The log odds ratio between two indicator variables Yijc1 and Yi kc2 is modeled as

i.j k/.c1 c2 / D log.OR.Yijc1 ; Yi kc2 // D z0i.j k/.c1 c2 / ˛

3130 F Chapter 45: The GEE Procedure

for q 1 regression parameters ˛ and fixed coefficients zi.j k/.c1 c2 / . As in Carey, Zeger, and Diggle
(1993), ˛ then provides a vector of regression parameters in a logistic model for the conditional expectation
i.j k/.c1 c2 / D E Yijc1 jYi kc2 . To estimate ˛, the conditional expectation is considered for all pairs Yijc1
and Yi kc2 with j < k. Let
0
i.j k/ D i.j k/.11/ ; i.j k/.12/ ; : : : ; i.j k/.21/ ; : : : ; i.j k/.C 1;C 1/
0
i D i.12/ ; i.13/ ; : : : ; i.23/ ; : : : ; i.ni 1ni /
ni 1 1
h‚ …„ ƒ ‚ …„ ƒ i0
Yi D Yi1 ˝ eC 1 ; : : : ; Yi1 ˝ eC 1 ; Yi 2 ˝ eC 1 ; : : : ; Yi 2 ˝ eC ; : : : ; Yi ni 1 ˝ e
„ ƒ‚ …1 C 1
ni 2

where ˝ denotes the Kronecker product and el denotes a vector of dimension l composed of ones. The
difference Yi i represents the residuals of the model for the conditional expectation.
For both binary and multinomial data, the ALR estimates for ˇ and ˛ are the simultaneous solutions to the
estimating equations
K
P @i 0 1 .Y
S1 .ˇ; ˛/ D @ˇ
Vi11 i i .ˇ// D 0
i D1
K
@i 0 1
Yi
P
S2 .ˇ; ˛/ D @˛
Vi 33 i D 0
i D1

where Vi11 D cov .Yi / and Vi 33 D diag Œi .1 i /. The fitting algorithm alternates between a GEE
step to update the model for the mean and a logistic regression step to update the log odds ratio model.
Upon convergence, the ALR algorithm provides estimates of the regression parameters for the mean, ˇ; the
regression parameters for the log odds ratios, ˛; their standard errors; and their covariances.

Specifying Log Odds Ratio Models

Specifying a regression model for the log odds ratio requires you to specify the rows of the matrix z. For
binary data, there is a row zij k for each cluster i and within-cluster pair .j; k/. For ordinal multinomial data,
there is a row zi.j k/.c1 c2 / for each cluster i, within-cluster pair .j; k/, and choice of levels .c1 ; c2 /.
For ordinal multinomial data, the GEE procedure supports only the ALR method that uses a fully exchangeable
regression structure for the log odds ratio. In a fully exchangeable model, the log odds ratio is constant for all
clusters i, within-cluster pair .j; k/, and levels .c1 ; c2 /. You select a fully exchangeable model for the log
odds ratio by specifying LOGOR=EXCH.
For binary data, the GEE procedure provides several methods of specifying zij k . You apply these methods by
specifying LOGOR=keyword and associated options in the REPEATED statement. The supported keywords
and the resulting log odds ratio models are described as follows:

EXCH specifies exchangeable log odds ratios. In this model, the log odds ratio is a
constant for all clusters i and pairs .j; k/. The parameter ˛ is the common log
odds ratio.

zij k D 1 for all i; j; k

FULLCLUST specifies fully parameterized clusters. Each cluster is parameterized in the same
way, and there is a parameter for each unique pair within clusters. If a complete
Alternating Logistic Regression F 3131

cluster is of size n, then there are n.n2 1/ parameters in the vector ˛. For example,
if a full cluster is of size 4, then there are 43
2 D 6 parameters, and the z matrix
is of the form
2 3
1 0 0 0 0 0
6 0 1 0 0 0 0 7
6 7
6 0 0 1 0 0 0 7
ZD6 6 0 0 0 1 0 0 7
7
6 7
4 0 0 0 0 1 0 5
0 0 0 0 0 1

The elements of ˛ correspond to log odds ratios for cluster pairs in the following
order:

Pair Parameter
(1,2) Alpha1
(1,3) Alpha2
(1,4) Alpha3
(2.3) Alpha4
(2,4) Alpha5
(3,4) Alpha6

LOGORVAR(variable) specifies log odds ratios by cluster. The argument variable is a variable name
that defines the “block effects” between clusters. The log odds ratios are con-
stant within clusters, but they take a different value for each different value of
the variable. For example, if Center is a variable in the input data set that
takes a different value for k treatment centers, then when you specify LO-
GOR=LOGORVAR(Center), you get a model that has different log odds ratios
for each of the k centers, constant within center.
NESTK specifies k-nested log odds ratios. You must also specify the SUB-
CLUST=variable option to define subclusters within clusters. Within each
cluster, PROC GEE computes a log odds ratio parameter for pairs that have
the same value of variable for both members of the pair and one log odds ratio
parameter for each unique combination of different values of variable.
NEST1 specifies 1-nested log odds ratios. You must also specify the SUB-
CLUST=variable option to define subclusters within clusters. There are
two log odds ratio parameters for this model. Pairs that have the same value of
variable correspond to one parameter; pairs that have different values of variable
correspond to the other parameter. For example, if patients are clustered by
hospital and subclusters are the wards within those hospitals, then the outcomes
of patients within the same ward have one log odds ratio parameter, and the
outcomes of patients from different wards have the other parameter.
ZFULL specifies the full z matrix. You must also specify a SAS data set that contains
the z matrix by using the ZDATA=data-set-name option. Each observation
in the data set corresponds to one row of the z matrix. You must specify the
ZDATA data set as if all clusters are complete—that is, as if all clusters are
3132 F Chapter 45: The GEE Procedure

the same size and there are no missing observations. The ZDATA data set
has KŒnmax .nmax 1/=2 observations, where K is the number of clusters and
nmax is the maximum cluster size. If the members of cluster i are ordered
as 1; 2; : : : ; n, then the rows of the z matrix must be specified for pairs in the
order .1; 2/; .1; 3/; : : : ; .1; n/; .2; 3/; : : : ; .2; n/; : : : ; .n 1; n/. The variables
that you specify in the REPEATED statement for the SUBJECT effect must
also be present in the ZDATA= data set to identify clusters. You must specify
variables in the data set that define the columns of the z matrix by using the
ZROW=variable-list option. If there are q columns (q variables in variable-list ),
then there are q log odds ratio parameters. You can optionally specify variables
that indicate the cluster pairs corresponding to each row of the z matrix by using
the YPAIR=(variable1, variable2 ) option. If you specify this option, the data from
the ZDATA data set are sorted within each cluster by variable1 and variable2 .
See Example 45.4 for an example of specifying a full z matrix.
ZREP specifies a replicated z matrix. You specify z matrix data exactly as you do
for the ZFULL option case, except that you specify only one complete cluster.
The z matrix for the one cluster is replicated for each cluster. The number of
observations in the ZDATA data set is nmax .n2max 1/ , where nmax is the size of a
complete cluster (a cluster with no missing observations).
ZREP(matrix ) specifies direct input of the replicated z matrix. You specify the z matrix for one
cluster by using the syntax LOGOR=ZREP ( .yj yk / zj k1 zj k2 zj kq ; ),
where yj and yk are numbers that represent a pair of observations from the ith
cluster and the values zj k1 ; zj k2 ; : : : ; zj kq make up the corresponding row zij k
of the z matrix. The number of specified rows is nmax .n2max 1/ , where nmax is the
size of a complete cluster (a cluster with no missing observations). For example,

logor = zrep((1 2) 1 0,
(1 3) 1 0,
(1 4) 1 0,
(2 3) 1 1,
(2 4) 1 1,
(3 4) 1 1)

specifies the 43

2 D 6 rows of the z matrix for a cluster of size 4 with q = 2 log
odds ratio parameters. The log odds ratio for the pairs (1 2), (1 3), (1 4) is ˛1 , and
the log odds ratio for the pairs (2 3), (2 4), (3 4) is ˛1 C ˛2 .

Weighted Generalized Estimating Equations under the MAR Assumption

In longitudinal studies, response measurements are often missing because of skipped visits or dropouts.
Suppose rij is the indicator that the response yij is observed, where rij D 1 if yij is observed and 0 otherwise.
Missing data patterns can be classified into two types: dropout and intermittent. A dropout occurs if an
individual skips a particular visit and then never comes back for subsequent visits. That is, if rij D 0, then
ri k D 0 for all k > j . Otherwise, the missing data pattern is intermittent. Intermittent patterns can be quite
complicated; only dropout patterns are considered here.
Weighted Generalized Estimating Equations under the MAR Assumption F 3133

The mechanism for missingness can be described by a statistical model for the probability of observing
a missing value, and making the right assumption about the mechanism is crucial to methods that handle
missing data. Missingness mechanisms are classified into three types: missing completely at random
(MCAR), missing at random (MAR), and missing not at random (MNAR) (Rubin 1976).
Assumptions about longitudinal data that include missing responses caused by dropouts are classified as
follows:

The data are said to be MCAR if the probability of a missing response is independent of its past, current,
and future responses conditional on the covariates. That is, P .rij D 0jYi ; Xi / D P .rij D 0jXi /.

The data are said to be MAR if the probability of a missing response is independent of its current
and future responses conditional on the observed past responses and the covariates. That is, P .rij D
0jrij 1 D 1; Xi ; Yi / D P .rij D 0jrij 1 D 1; Xi ; yi1 ; : : : ; yij 1 /. MAR is a weaker assumption than
MCAR.

The data are said to be MNAR if the probability of a missing response depends on the unobserved
responses. MNAR is the most general and the most problematic missing-data scenario.

The GEE procedure implements two different weighted methods (observation-specific and subject-specific)
of estimating the regression parameter ˇ when dropouts occur. Both methods provide consistent estimates
if the data are MAR. The weighted GEE methods are not supported for the multinomial distribution for
polytomous responses.

Observation-Specific Weighted GEE Method

Suppose wij is the weight for yij , which is defined as the inverse probability of observing yij . In other
words, wij D P .rij D 1jXi ; Yi / 1 . Suppose Wi is a T T diagonal matrix whose jth diagonal is rij wij .
The responses for the ith subject are Yi D .yi1 ; yi 2 ; : : : ; yiT /0 . Consider the following weighted generalized
estimating equations (Robins and Rotnitzky 1995; Preisser, Lohman, and Rathouz 2002):
K
X @0 i
Sow .ˇ/ D Vi 1 Wi .Yi i .ˇ// D 0
@ˇ
i D1

Unlike the standard generalized estimating equations, the weighted generalized estimating equations are
unbiased when the observations are appropriately weighted and lead to consistent estimates of ˇ.
The weights wij are often unknown in practice and are estimated by a logistic regression model under the
MAR assumption. Specifically, suppose that ij D P .rij D 1jrij 1 D 1; Xi ; Yi / denotes the probability of
observing the response yij given its observed previous responses.
Under the MAR assumption,

ij D P .rij D 1jrij 1 D 1; Xi ; Yi / D P .rij D 1jrij 1 D 1; Xi ; Y1 ; : : : ; Yj 1/

Using the observed data, ij can be predicted from a logistic regression model,

logitfij g D zij ˛
3134 F Chapter 45: The GEE Procedure

where the zij are predictors that usually include the covariates xij , the past responses, and the indicators for
visit times. The dropout process implies that the estimated probability of observing yij can be expressed as a
cumulative product of conditional probabilities:

PO .rij D 1jXi ; Yi / D i1 .˛/

O i 2 .˛/
O ij .˛/
O

With the estimated weights wO ij D PO .rij D 1jXi ; Yi / 1, the regression parameter ˇ is estimated by solving
the equation for Sow .ˇ/.
The regression parameter ˇ can be estimated by solving for Sow .ˇ/ after plugging in the estimated weights.
The fitting algorithm is described in the section “Fitting Algorithm for Weighted GEE” on page 3135.

Subject-Specific Weighted GEE Method

Unlike the observation-specific weighted method, which assigns an observation-specific weight to each
observation, the subject-specific weighted method assigns a single weight to each subject. In other words, all
the observations from a subject receive the same weight. Specifically, the subject-specific weighted method
obtains the regression parameter estimates by solving the equations
K
X
Ssw .ˇ/ D D0i Vi 1 wi .Yi i .ˇ// D 0
i D1

where the responses for the ith subject are Yi D .yi1 ; yi 2 ; : : : ; yi ni /0 and the weight wi for subject i is the
inverse probability of a subject i dropping out at the observed time (Fitzmaurice, Molenberghs, and Lipsitz
1995; Preisser, Lohman, and Rathouz 2002). Note that the weight wi is a scalar, in contrast to the weight
matrix Wi that the observation-specific weighted GEE method uses.
The subject-specific weighted estimating equations are also unbiased when the subjects are appropriately
weighted and lead to consistent estimates of the regression parameters ˇ.

P wi is usually unknown in practice and needs to be estimated. Suppose subject i drops out at time
The weight
mi D TjD1 rij C 1. Assume that the first visit yi1 is always observed with ri1 D 1. Thus, the dropout
times mi range from 2 to T+1. Note that a dropout time of T+1 indicates that subject i completes all the T
visits and dropout does not occur.
The weight wi is defined as follows: if subject i drops out before completing the last visit (that is, mi T ),
then wi D P .ri mi D 0; ri mi 1 D 1jXi ; Yi / 1 ; otherwise, the subject completes all the T visits (that is,
mi D T C 1), and wi D P .riT D 1jXi ; Yi / 1 .
Similar to the process for the observation-specific weighted method, the dropout process for the subject-
specific weighted method implies that subject-specific weights can be estimated as a cumulative product of
conditional probabilities:
1 1
wO i D P .ri mi D 0; ri mi 1 D 1jXi ; Yi / O i mi
D Œi1 .˛/ O
1 .˛/ .1 O
i mi .˛// ; if mi T
1 1
wO i D P .ri mi 1 D 1jXi ; Yi / O i 2 .˛/
D Œi1 .˛/ O i mi O
1 .˛/ ; if mi D T C 1

Thus, the subject-specific weights wO i can be obtained after ij is estimated by fitting a logistic regression to
the data .rij ; zij /.
The regression parameter ˇ from the subject-specific weighted GEE method can be estimated by solving for
Ssw .ˇ/ after plugging in the estimated weights. The fitting algorithm is described in the section “Fitting
Weighted Generalized Estimating Equations under the MAR Assumption F 3135

Algorithm for Weighted GEE” on page 3135. The subject-specific weighting scheme was originally developed
for computational convenience. Preisser, Lohman, and Rathouz (2002) showed that the observation-level
weighted GEE method produces more efficient estimates than the cluster-level weighted GEE method for
incomplete longitudinal binary data.

Fitting Algorithm for Weighted GEE

The following fitting algorithm fits marginal models by using the observation-specific or the subject-specific
weighted GEE method when the dropout process is missing at random:

1. Fit a logistic regression to the data .rij ; zij / to obtain an estimate of ˛ and estimate the weights.
2. Compute an initial estimate of ˇ by using an ordinary generalized linear model, assuming independence
of the responses.
3. Compute the working correlation matrix R based on the standardized residuals, the current estimate of
ˇ, and the specified structure of R.
4. Compute the estimated covariance matrix:
1 1
O
Vi D Ai2 R.˛/A 2
i

O
5. Update ˇ:
"K # 1" K #
X @i 0 @i X @i 0 1
ˇOrC1 D ˇOr C Vi 1 V Wi .Yi i /
@ˇ @ˇ @ˇ i
i D1 i D1

where Yi ; i ; Vi , and Wi are as follows:

For the observation-specific weighted method, Yi D .yi1 ; yi 2 ; : : : ; yiT /0 ; i and Vi are its
corresponding mean vector and working covariance matrix, respectively; and Wi is a T T
diagonal matrix whose jth diagonal is rij wO ij .
For the subject-specific weighted method, Yi D .yi1 ; yi 2 ; : : : ; yi ni /0 ; i and Vi are its corre-
sponding mean vector and working covariance matrix, respectively; and Wi is a ni ni diagonal
matrix whose jth diagonal is wO i .

6. Repeat steps 3–5 until convergence.

Note that you can use the WEIGHT statement in the GENMOD procedure to perform a two-stage strategy
that is often used in practice to obtain the weighted GEE estimates. You fit a logistic regression to the data
.rij ; zij / to obtain the weights as described in the preceding steps. Then you estimate ˇ by specifying the
estimated weights in the WEIGHT statement in PROC GENMOD for the GEE analysis. For the subject-
specific weighted GEE method, this approach is appropriate for any working correlation structure. However,
for the observation-specific weighted method, this approach is appropriate only for the independent working
correlation structure.
The two-stage approach results in standard errors that are larger than those that are produced by using
the MISSMODEL statement in the GEE procedure (because PROC GENMOD treats the weights as fixed
and known). Thus, the two-stage approach that uses PROC GENMOD results in conservative inference
(Fitzmaurice, Laird, and Ware 2011). The GEE procedure computes the parameter estimate covariances as
described in (Fitzmaurice, Laird, and Ware 2011) and Preisser, Lohman, and Rathouz (2002).
3136 F Chapter 45: The GEE Procedure

Missing Data
Suppose that each subject in a longitudinal study is measured at T times. In other words, for the ith subject
you measure T responses .yi1 ; yi 2 ; : : : ; yiT / and T corresponding covariates .xi1 ; xi 2 ; ; : : : ; xiT /.
By default, the GEE procedure handles missing data in the same manner as the standard GEE method in the
GENMOD procedure. The working correlation matrix is estimated from data that contain both intermittent
and dropout types of missing values by using the all-available-pairs method, in which all nonmissing pairs of
data are used in the moment estimators. The resulting covariances and standard errors are valid under the
missing completely at random (MCAR) assumption. For more information, see the section “Missing Data”
on page 3272 in Chapter 46, “The GENMOD Procedure.”
When you specify the MISSMODEL statement in the GEE procedure to use the weighted GEE method to
analyze the data, the procedure uses observations that have missing values in the response, provided that the
missing values for all subjects are caused by dropouts. If the missing values are intermittent for any of the
subjects, then the weighted GEE method does not apply and the procedure terminates.
For the observation-specific weighted GEE method, the covariates for all the observations for a subject must
be observed, regardless of whether the response is missing. For each subject, the input data set must provide
T observations.
For the subject-specific weighted GEE method, the covariates for a subject who drops out at time k must
be observed for the observations up to and including time k. The input data set must provide at least k
observations for this subject. The covariates must be observed for all observations on a subject who completes
the study, and the input data set must provide T observations for this subject.
For more information about how weighted GEE methods handle missing values, see Fitzmaurice, Laird, and
Ware (2011) and Preisser, Lohman, and Rathouz (2002).

Type 3 Analysis
A Type 3 analysis is similar to the Type 3 sums of squares used in PROC GLM, except that generalized
score tests for Type 3 contrasts instead of Type 3 sums of squares are computed. Briefly, a Type 3 estimable
function (contrast) for an effect is a linear function of the model parameters that involves the parameters of
the effect and any interactions with that effect. A test of the hypothesis that the Type 3 contrast for a main
effect is equal to 0 is intended to test the significance of the main effect in the presence of interactions. For
more information about Type 3 estimable functions, see Chapter 48, “The GLM Procedure,” and Chapter 15,
“The Four Types of Estimable Functions.” Also see Littell, Freund, and Spector (1991).
Boos (1992) and Rotnitzky and Jewell (1990) describe score tests applicable to testing L0 ˇ D 0 in GEEs,
where L0 is a user-specified r p contrast matrix or a contrast for a Type 3 test of hypothesis.
Let ˇQ be the regression parameters that result from solving the GEE under the restricted model L0 ˇ D 0, and
Q be the generalized estimating equation values at ˇ.
let S.ˇ/ Q

The generalized score statistic is

Q 0 †m L.L0 †e L/
T D S.ˇ/ 1 0 Q
L †m S.ˇ/

where †m is the model-based covariance estimate and †e is the empirical covariance estimate. The p-values
for T are computed based on the chi-square distribution with r degrees of freedom, where r is the rank of L.
ODS Table Names F 3137

A Type 3 analysis can consume considerable computation time because a constrained model is fitted for each
effect. Wald statistics for Type 3 contrasts are computed if you specify the WALD option. Wald statistics for
contrasts use less computation time than likelihood ratio statistics but might be less accurate indicators of the
significance of the effect of interest. The Wald statistic for testing L0 ˇ D 0 is defined by

S D .L0 ˇ/
O 0 .L0 †e L/ 1
.L0 ˇ/
O

where L is the contrast matrix, ˇ are the GEE parameter estimates, and †e is the empirical covariance
estimate. The asymptotic distribution of S is chi-square with r degrees of freedom, where r is the rank of L.
The results of this type of analysis do not depend on the order in which the terms are specified in the MODEL
statement. Type 3 analyses that use score statistics are not supported for nominal response data or weighted
GEE methods. Type 3 analyses can be conducted using the Wald statistics for all the models that the GEE
procedure supports.

ODS Table Names

PROC GEE assigns a name to each table that it creates. You can use these names to refer to the table when
you use the Output Delivery System (ODS) to select tables and create output data sets. Table 45.14 lists these
names. For more information about ODS, see Chapter 20, “Using the Output Delivery System.”

Table 45.14 ODS Tables Produced BY PROC GEE

ODS Table Name Description Statement Option

ClassLevels Classification variable levels CLASS Default
Coef Coefficients for LS-means LSMEANS E
Diffs Differences of LS-means LSMEANS DIFF
Estimates Estimates of contrasts ESTIMATE Default
GEEEmpPEst Parameter estimates with REPEATED Default
empirical standard errors
GEEExchCorr Exchangeable working REPEATED TYPE=EXCH
correlation value
GEEFitCriteria QIC fit criteria REPEATED Default
GEELogORInfor GEE log odds ratio model REPEATED LOGOR=
information
GEEModInfo GEE model information REPEATED Default
GEEModPEst Parameter estimates with REPEATED MODELSE
model-based standard errors
GEENCorr Model-based correlation REPEATED MCORRB
matrix
GEENCov Model-based covariance REPEATED MCOVB
matrix
GEERCorr Empirical correlation matrix REPEATED ECORRB
GEERCov Empirical covariance matrix REPEATED ECOVB
GEEWCorr GEE working correlation REPEATED CORRW
matrix
LSMeans LS-means LSMEANS Default
3138 F Chapter 45: The GEE Procedure

Table 45.14 continued

ODS Table Name Description Statement Option

LSMLines Lines display for LS-means LSMEANS LINES
MissModelPEst Parameter estimates for the MISSMODEL Default
missingness model
MissPattern Frequency counts for MISSMODEL Default
dropout times
ModelInfo Model information MODEL Default
NObs Number of observations Default
summary
ParmInfo Parameter indices REPEATED MCORRB, MCOVB,
ECORRB, ECOVB
ResponseProfile Frequency counts for binary MODEL DIST=BINOMIAL
models
Type3 Type 3 tests MODEL TYPE3

ODS Graphics
Statistical procedures use ODS Graphics to create graphs as part of their output. ODS Graphics is described
in detail in Chapter 21, “Statistical Graphics Using ODS.”
Before you create graphs, ODS Graphics must be enabled (for example, by specifying the ODS GRAPH-
ICS ON statement). For more information about enabling and disabling ODS Graphics, see the section
“Enabling and Disabling ODS Graphics” on page 615 in Chapter 21, “Statistical Graphics Using ODS.”
The overall appearance of graphs is controlled by ODS styles. Styles and other aspects of using ODS
Graphics are discussed in the section “A Primer on ODS Statistical Graphics” on page 614 in Chapter 21,
“Statistical Graphics Using ODS.”

ODS Graph Names

PROC GEE assigns a name to each graph it creates using ODS. You can use these names to refer to the
graphs when you use ODS. Table 45.15 lists the names.
To request these graphs, ODS Graphics must be enabled and you must specify the statement and option that
are indicated in Table 45.15.

Table 45.15 Graphs Produced by PROC GEE

ODS Graph Name Description Statement Option

Histogram Histogram of predicted weights PROC PLOTS=
from the missingness model
Examples: GEE Procedure F 3139

Examples: GEE Procedure

The following examples illustrate some of the capabilities of the GEE procedure. These examples are not
intended to represent definitive analyses of the data sets that are presented here.

Example 45.1: Comparison of the Marginal and Random Effect Models for
Binary Data
A clinical trial (Stokes, Davis, and Koch 2012) was conducted to compare two treatments for a respiratory
illness. Patients in each of two centers were randomly assigned to two groups: one group received the active
treatment and one group received a placebo.
During treatment, respiratory status was determined for each of four visits and is represented by the variable
Outcome (coded here as 0 = poor, 1 = good). The variables Center, Treatment, Sex, and Baseline (baseline
respiratory status) are classification variables that have two levels. The variable Age (age at time of entry into
the study) is a continuous variable.
All 111 patients completed the study. That is, there are no missing data for responses or covariates. The
following statements create the data set Resp:

data Resp;
input Center ID Treatment $ Sex $ Age Baseline Visit1-Visit4;
datalines;
1 1 P M 46 0 0 0 0 0
1 2 P M 28 0 0 0 0 0
1 3 A M 23 1 1 1 1 1
1 4 P M 44 1 1 1 1 0
1 5 P F 13 1 1 1 1 1
1 6 A M 34 0 0 0 0 0

... more lines ...

2 51 A M 43 1 1 1 1 0
2 52 A F 39 0 1 1 1 1
2 53 A M 68 0 1 1 1 1
2 54 A F 63 1 1 1 1 1
2 55 A M 31 1 1 1 1 1
;

data Resp;
set Resp;
Visit=1; Outcome=Visit1; output;
Visit=2; Outcome=Visit2; output;
Visit=3; Outcome=Visit3; output;
Visit=4; Outcome=Visit4; output;
run;
Suppose yij represents the respiratory status of patient i at the jth visit, j D 1; : : : ; 4, and ij D E.yij /
represents the mean of the respiratory status. Logistic regression is commonly used to analyze binary response
data. You can use the variance function for the binomial distribution, v.ij / D ij .1 ij /, and the logit
3140 F Chapter 45: The GEE Procedure

link function, g.ij / D log.ij =.1 ij //. The model for the mean is g.ij / D xij 0 ˇ, where ˇ is a vector
of regression parameters to be estimated.
The following SAS statements perform the GEE model fit:

proc gee data=Resp descend;

class ID Treatment Center Sex Baseline;
model Outcome=Treatment Center Sex Age Baseline /
dist=bin link=logit;
repeated subject=ID(Center) / corr=exch corrw;
run;
Both the MODEL statement and the REPEATED statement are required.
In the MODEL statement, you use the DIST=BIN and LINK=LOGIT options to specify a logistic regression,
and you specify Outcome as the response variable and Treatment, Center, Sex, Age, and Baseline as the
explanatory variables. The DESCEND option in the PROC GEE statement requests that the probability that
Outcome = 1 be modeled. If the DESCEND option had not been specified, the probability that Outcome = 0
would be modeled by default.
You use the REPEATED statement to specify the subject and the correlation structure of the responses. The
SUBJECT=ID(CENTER) option specifies that the observations in any single cluster are uniquely identified
by Center and ID. An equivalent specification is SUBJECT=ID*CENTER. Because the same ID values
are used in each center, one of these specifications is needed. If ID values were unique across all centers,
SUBJECT=ID could be specified. The option TYPE=EXCH specifies the exchangeable working correlation
structure.
The “Model Information” table displayed in Output 45.1.1 provides information about the specified logistic
regression model and the input data set.

Output 45.1.1 Model Information

The GEE Procedure

Model Information
Data Set WORK.RESP
Distribution Binomial
Link Function Logit
Dependent Variable Outcome
Example 45.1: Comparison of the Marginal and Random Effect Models for Binary Data F 3141

General information about the GEE analysis is displayed in Output 45.1.2, and model fit criteria for the
model are displayed in Output 45.1.3.

Output 45.1.2 Model Fitting Information

GEE Model Information

Correlation Structure Exchangeable
Subject Effect ID(Center) (111 levels)
Number of Clusters 111
Correlation Matrix Dimension 4
Maximum Cluster Size 4
Minimum Cluster Size 4

Output 45.1.3 Model Fitting Information

GEE Fit
Criteria
QIC 512.5723
QICu 499.4873

The results of GEE model fitting are displayed in Output 45.1.4. If you specify no other options, the standard
errors, confidence intervals, Z scores, and p-values are based on empirical standard error estimates. You can
specify the MODELSE option in the REPEATED statement to create a table that is based on model-based
standard error estimates.
Output 45.1.4 Results of Model Fitting

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 1.6391 0.5247 0.6107 2.6675 3.12 0.0018
Treatment A 1.2654 0.3467 0.5859 1.9448 3.65 0.0003
Treatment P 0.0000 0.0000 0.0000 0.0000 . .
Center 1 -0.6495 0.3532 -1.3418 0.0428 -1.84 0.0660
Center 2 0.0000 0.0000 0.0000 0.0000 . .
Sex F 0.1368 0.4402 -0.7261 0.9996 0.31 0.7560
Sex M 0.0000 0.0000 0.0000 0.0000 . .
Age -0.0188 0.0130 -0.0442 0.0067 -1.45 0.1480
Baseline 0 -1.8457 0.3460 -2.5238 -1.1676 -5.33 <.0001
Baseline 1 0.0000 0.0000 0.0000 0.0000 . .

Treatment and Baseline appear to be strongly influential, and Center might be marginally significant.
For comparison, a generalized linear mixed model is fitted to the data set to obtain subject-specific effects.
Specifically, consider the logistic regression model,

logit.E.yij jbi // D xij 0 ˇ C bi

where the random effect bi is normally distributed with zero mean and variance, Var.bi / D b2 .
3142 F Chapter 45: The GEE Procedure

The following statements use the GLIMMIX procedure to fit a generalized linear mixed model:

proc glimmix data=Resp;

class ID Treatment Center Sex Baseline;
model Outcome (desc)=Treatment Center Sex Age Baseline /
dist=binary solution;
random ID(Center);
run;
Output 45.1.5 displays the parameter estimates for the fixed effects in the generalized linear mixed model.

Output 45.1.5 Parameter Estimates

The GLIMMIX Procedure

Solutions for Fixed Effects

Standard
Effect Treatment Sex Center Baseline Estimate Error DF t Value Pr > |t|
Intercept 1.7936 0.6292 105 2.85 0.0053
Treatment A 1.4758 0.3898 333 3.79 0.0002
Treatment P 0 . . . .
Center 1 -0.7201 0.4051 105 -1.78 0.0784
Center 2 0 . . . .
Sex F 0.1732 0.5034 333 0.34 0.7310
Sex M 0 . . . .
Age -0.02011 0.01507 333 -1.33 0.1831
Baseline 0 -2.1343 0.3971 333 -5.38 <.0001
Baseline 1 0 . . . .

From Output 45.1.4 and Output 45.1.5, you can see that the parameter estimates from the marginal model
and the mixed-effects model differ. For example, the estimated treatment effects are 1.2654 and 1.4758 from
the marginal model and the mixed-effects model, respectively.
The interpretation of the model effects in the marginal and random models differs. For example, the estimated
treatment effect from the marginal model indicates that, on average, the odds of a good response for the
patients is e 1:2654 D 3:5 times higher when they receive the active treatment versus the placebo. The
estimated treatment effect from the generalized linear mixed model indicates that an individual patient’s odds
of a good response is e 1:4758 D 4:4 times higher when the patient receives the active treatment versus the
placebo.
The choice of the marginal model or a subject-specific model often depends on the goal of your analysis:
whether you are interested in population-averaged effects or subject-specific effects. For more information,
see Diggle et al. (2002); Fitzmaurice, Laird, and Ware (2011).

Example 45.2: Log-Linear Model for Count Data

The following example demonstrates how you can fit a GEE model to count data. The data are analyzed by
Diggle, Liang, and Zeger (1994). The response is the number of epileptic seizures, which was measured
at the end of each of eight two-week treatment periods over sixteen weeks. The first eight weeks were the
baseline period (during which no treatment was given), and the second eight weeks were the treatment period,
Example 45.2: Log-Linear Model for Count Data F 3143

during which patients received either a placebo or the drug progabide. The question of scientific interest is
whether progabide is effective in reducing the rate of epileptic seizures.
The following DATA step creates the data set Seizure:

data Seizure;
input ID Count Visit Trt Age Weeks;
datalines;
104 11 0 0 31 8
104 5 1 0 31 2
104 3 2 0 31 2
104 3 3 0 31 2
104 3 4 0 31 2
106 11 0 0 30 8

... more lines ...

236 12 0 1 37 8
236 1 1 1 37 2
236 4 2 1 37 2
236 3 3 1 37 2
236 2 4 1 37 2
;
The following DATA step creates a log time interval variable for use as an offset and an indicator variable for
whether the observation is for a baseline measurement or a visit measurement. Patient 207 is deleted as an
outlier, which was done in the Diggle et al. (2002) analysis:

data Seizure;
set Seizure;
if ID ne 207;
if Visit = 0 then do;
X1=0;
Ltime = log(8);
end;
else do;
X1=1;
Ltime=log(2);
end;
run;
Poisson regression is commonly used to model count data. In this example, the log-linear Poisson model is
specified by V ./ D (the Poisson variance function) and a log link function,
log.E.Yij // D ˇ0 C xi1 ˇ1 C xi 2 ˇ2 C xi1 xi 2 ˇ3 C log.tij /
where

Yij D number of epileptic seizures in interval j

tij D length of interval j

1 W weeks 8–16 (treatment)
xi1 D
0 W weeks 0–8 (baseline)

1 W progabide group
xi 2 D
0 W placebo group
3144 F Chapter 45: The GEE Procedure

Because the visits represent repeated measurements, the responses from the same individual are correlated
and inferences need to take this into account. The correlations between the counts are modeled as rij D ˛,
i ¤ j (exchangeable correlations).
In this model, the regression parameters are interpreted in terms of the log seizure rate that is displayed in
Table 45.16.

Table 45.16 Interpretation of Regression Parameters

Treatment Visit log.E.Yij /=tij /

Placebo Baseline ˇ0
1–4 ˇ0 C ˇ1
Progabide Baseline ˇ0 C ˇ2
1–4 ˇ0 C ˇ1 C ˇ2 C ˇ3

The difference between the log seizure rates in the pretreatment (baseline) period and the treatment periods is
ˇ1 for the placebo group and ˇ1 C ˇ3 for the progabide group. A value of ˇ3 < 0 indicates a reduction in
the seizure rate.
The following statements perform the analysis:

proc gee data = Seizure;

class ID Visit;
model Count = X1 Trt X1*Trt / dist=poisson link=log offset= Ltime;
repeated subject = ID / within = Visit type=unstr covb corrw;
run;
In the MODEL statement, Count is the response variable, and X1, Trt, and the interaction X1*Trt are the
explanatory variables. You request Poisson regression with the DIST=POISSON and the LINK=LOG options.
The offset variable is often used in Poisson regression to account for different exposures. In this case, the
OFFSET= option specifies Ltime as the offset variable representing different time intervals.
In the REPEATED statement, the SUBJECT= option indicates that the variable ID identifies the observations
from a single cluster, and the TYPE=UNSTR option specifies the unstructured working correlation structure.
The CORRW option requests that the working correlation matrix be displayed.
The “Model Information” table that is displayed in Output 45.2.1 provides information about the specified
model and the input data set.

Output 45.2.1 Model Information

The GEE Procedure

Model Information
Data Set WORK.SEIZURE
Distribution Poisson
Link Function Log
Dependent Variable Count
Offset Variable Ltime

Output 45.2.2 displays general information about the GEE model analysis.
Example 45.2: Log-Linear Model for Count Data F 3145

Output 45.2.2 GEE Model Information

GEE Model Information

Correlation Structure Unstructured
Within-Subject Effect Visit (5 levels)
Subject Effect ID (58 levels)
Number of Clusters 58
Correlation Matrix Dimension 5
Maximum Cluster Size 5
Minimum Cluster Size 5

Output 45.2.3 displays the parameter estimate covariance matrices, which are requested by the COVB option.
Both model-based and empirical covariances are produced.

Output 45.2.3 Covariance Matrices of Parameter Estimate

Covariance Matrix (Model-Based)

Prm1 Prm2 Prm3 Prm4
Prm1 0.01210 0.004902 -0.01210 -0.004902
Prm2 0.004902 0.006660 -0.004902 -0.006660
Prm3 -0.01210 -0.004902 0.02461 0.01299
Prm4 -0.004902 -0.006660 0.01299 0.01852

Covariance Matrix (Empirical)

Prm1 Prm2 Prm3 Prm4
Prm1 0.02597 -0.003069 -0.02597 0.003069
Prm2 -0.003069 0.008597 0.003069 -0.008597
Prm3 -0.02597 0.003069 0.03841 -0.006196
Prm4 0.003069 -0.008597 -0.006196 0.02237

The exchangeable working correlation matrix is displayed in Output 45.2.4. It shows that there are noticeable
correlations among the respective visits.

Output 45.2.4 Working Correlation Matrix

Working Correlation Matrix

Obs 1 Obs 2 Obs 3 Obs 4 Obs 5
Obs 1 1.0000 0.7920 0.7190 0.8111 0.6582
Obs 2 0.7920 1.0000 0.4859 0.6552 0.4566
Obs 3 0.7190 0.4859 1.0000 0.6988 0.4171
Obs 4 0.8111 0.6552 0.6988 1.0000 0.6464
Obs 5 0.6582 0.4566 0.4171 0.6464 1.0000

The parameter estimates table, shown in Output 45.2.5, contains parameter estimates, standard errors,
confidence intervals, Z scores, and p-values for the parameter estimates. Empirical standard error estimates
are used in this table.
3146 F Chapter 45: The GEE Procedure

Output 45.2.5 Parameter Estimates Table

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 1.3309 0.1612 1.0151 1.6468 8.26 <.0001
X1 0.1128 0.0927 -0.0689 0.2945 1.22 0.2237
Trt -0.1034 0.1960 -0.4875 0.2807 -0.53 0.5978
X1*Trt -0.3162 0.1496 -0.6093 -0.0231 -2.11 0.0345

The estimate of ˇ3 is –0.3162, which indicates that progabide is effective in reducing the rate of epileptic
seizures.
Model fit criteria for the model are displayed in Output 45.2.6. These criteria are used in selecting regression
models and working correlations.

Output 45.2.6 Model Fit Criteria

GEE Fit Criteria

QIC -1036.2837
QICu -1041.8041

Example 45.3: Weighted GEE for Longitudinal Data That Have Missing Values
This example shows how you can use the GEE procedure to analyze longitudinal data that contain missing
values. The data set is taken from a longitudinal study of women who used contraception during one year
(Fitzmaurice, Laird, and Ware 2011). In this study, 1,151 women were randomly assigned to one of two
treatments: 100 mg or 150 mg of depot medroxyprogesterone acetate (DMPA) at baseline and at three-month
intervals. The response variable indicates their amenorrhea status during the four three-month intervals. The
question of interest is whether the treatment has an effect on the rate of the amenorrhea over time. The
example follows the analysis done by Fitzmaurice, Laird, and Ware (2011).
The following statements create the data set Amenorrhea:

data Amenorrhea;
input ID Dose Time Y@@;
datalines;
1 0 1 0
1 0 2 .
1 0 3 .
1 0 4 .

... more lines ...

1150 1 4 1
1151 1 1 1
1151 1 2 1
1151 1 3 1
1151 1 4 1
;
The variables in the data are as follows:
Example 45.3: Weighted GEE for Longitudinal Data That Have Missing Values F 3147

ID: patient’s ID

Y: indicator of amenorrhea status (1 for amenorrhea; 0 otherwise)

Time: four consecutive three-month intervals with values 1, 2, 3, and 4

Dose: 0 for treatment with 100 mg injection; 1 for treatment with 150 mg injection

To prepare for the analysis, two additional variables are created:

Prevy: the patient’s amenorrhea status in the previous three-month interval. For the baseline visit, this
is set to an arbitrary nonmissing value (0 here). In this release of PROC GEE, this arbitrary value
must be nonmissing and valid for the response variable—for example, it should be 0 or 1 for a binary
response—but it does not otherwise affect the results.

Ctime: a copy of Time, which you can include in the marginal model as a continuous effect and also in
the missingness model as a classification effect

The following statements add these two variables to the data set:

data Amenorrhea;
set Amenorrhea;
by ID;
Prevy=lag(Y);
if first.id then Prevy=0;
Time=Time-1;
Ctime=Time;
run;
Suppose yij denotes the amenorrhea status of woman i at the jth visit, j D 1; : : : ; 4, and suppose ij D
P.yij D 1/ denotes the average rate of high dosage. To explore whether the treatment has an effect on the
rate of amenorrhea over time, consider the following marginal model:

logit.ij / D ˇ0 C ˇ1 timeij C ˇ2 time2ij C ˇ3 dosei C ˇ4 dosei time C ˇ5 dosei time2

Of the 1,151 women in this study, 576 are from the low-dose group, and 575 are from the high-dose group.
For the low-dose group, 62.67% of the women completed the trial; for the high-dose group, 61.39% of the
women completed this trial. Thus, both groups have substantial dropouts.
To obtain the weights for the weighted GEE analysis, consider the following logistic regression model for
missingness:

logitp.rij D 1jrij 1 D 1; dosei ; ctimeij ; yij 1/ D˛0 C ˛1 I.ctimeij D 1/ C ˛2 I.ctimeij D 2/

C ˛3 dosei C ˛4 yij 1 C ˛5 dosei yij 1

The following statements use the observation-specific weighted GEE method and the specified response and
missingness models to analyze the data:

ods graphics on;

proc gee data=Amenorrhea desc plots=histogram;
class ID Ctime;
missmodel Ctime Prevy Dose Dose*Prevy / type=obslevel;
3148 F Chapter 45: The GEE Procedure

model Y = Time Dose TimeTime DoseTime DoseTimeTime / dist=bin;

repeated subject=ID / within=Ctime corr=cs;
run;
The MODEL statement specifies logistic regression and the model effects. The DESCEND option in the
PROC GEE statement models the probability that Y = 1.
The REPEATED statement requests GEE analysis. The SUBJECT=ID option specifies that observations
from the same subject are identified by ID. The TYPE=CS option specifies the compound symmetric working
correlation structure.
The MISSMODEL statement requests the weighted GEE analysis. It specifies the logistic regression model
for missingness. Note that no response variable is needed in weighted GEE analysis to specify a missingness
model because the response is completely determined by the response variable in the MODEL statement.
Without the MISSMODEL statement, PROC GEE would use the standard GEE approach, the same as
provided by PROC GENMOD. The TYPE=OBSLEVEL option requests observation-specific weights.
Output 45.3.1 shows the parameter estimates for the missingness model. The estimate of ˛4 is –0.4514 with a
p-value of 0.0053, which suggests that the possibility that a participant will drop out is related to her previous
amenorrhea status. This suggests that the assumption of MAR is more appropriate than that of MCAR.

Output 45.3.1 Parameter Estimates for the Missingness Model

Parameter Estimates for Missingness Model

95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 2.3967 0.1438 2.1149 2.6785 16.67 <.0001
Ctime 0 0.0000 0.0000 0.0000 0.0000 . .
Ctime 1 -0.7286 0.1439 -1.0106 -0.4466 -5.06 <.0001
Ctime 2 -0.5919 0.1469 -0.8798 -0.3040 -4.03 <.0001
Ctime 3 0.0000 0.0000 0.0000 0.0000 . .
Prevy -0.4514 0.1619 -0.7687 -0.1341 -2.79 0.0053
Dose 0.0680 0.1313 -0.1893 0.3253 0.52 0.6046
Prevy*Dose -0.2381 0.2196 -0.6685 0.1923 -1.08 0.2782

The classification variable Ctime has two levels whose estimates are equal to zero. One is the reference level
Ctime = 3. The first level, Ctime = 0, also has an estimate of zero, because the first visit is always observed
and the first level is never used in estimating the weights in the missing model.
Output 45.3.2 displays the results of the weighted GEE analysis.
Example 45.3: Weighted GEE for Longitudinal Data That Have Missing Values F 3149

Output 45.3.2 Parameter Estimates for Amenorrhea Data Analysis Using Weighted GEE
The GEE Procedure

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept -1.4965 0.1072 -1.7067 -1.2863 -13.95 <.0001
Time 0.5379 0.1334 0.2764 0.7994 4.03 <.0001
Dose 0.1061 0.1491 -0.1861 0.3983 0.71 0.4767
Time*Time -0.0037 0.0405 -0.0831 0.0757 -0.09 0.9275
Dose*Time 0.4092 0.1903 0.0362 0.7823 2.15 0.0315
Dose*Time*Time -0.1264 0.0577 -0.2395 -0.0134 -2.19 0.0284

The estimate of ˇ4 (the parameter estimate for the Dose*Time interaction) is 0.4092, which indicates that the
change of amenorrhea rate over time depends on the dose of DMPA. Specifically, for women in the low-dose
group, the amenorrhea rates ij at the four consecutive time intervals are 0.1830, 0.2764, 0.3928, and 0.5210
and for women in the high-dose group, the amenorrhea rate are 0.1997, 0.3609, 0.4963, and 0.5701. In other
words, the amenorrhea rate increases over time for both treatments, and the rates of increase are slightly
different.
You can request subject-level weights by specifying the TYPE=SUBLEVEL option. The results (not
shown here) from the subject-level weighted method are similar to the results from the observation-level
weighted method. Both of the weighted GEE methods provide unbiased regression parameter estimates if the
missingness model is specified correctly. Preisser, Lohman, and Rathouz (2002) note that the observation-
level weighted GEE produces more efficient estimates than the cluster-level weighted GEE produces for
incomplete longitudinal binary data.
Large weights can have impacts on the parameter estimates. Consequently, it is recommended that you check
the distribution of the estimated weights. If there are large weights, you might consider trimming them
by specifying the MAXWEIGHT= option in the MISSMODEL statement. Output 45.3.3 shows that the
estimated weights in this example range between 1 and 2.1, so no trimming is needed.
3150 F Chapter 45: The GEE Procedure

Output 45.3.3 Histogram of Estimated Weights

Example 45.4: GEE for Binary Data with Logit Link Function
Because the respiratory data in Example 45.1 are binary, you can use the alternating logistic regression (ALR)
method and model associations by using the log odds ratios instead of working correlations. This example
fits a “fully parameterized cluster” model for the log odds ratio. That is, there is a log odds ratio parameter
for each unique pair of responses within clusters, and all clusters are parameterized identically. The following
statements fit the same regression model for the mean as in Example 45.1 but use a regression model for the
log odds ratios instead of a working correlation. LOGOR=FULLCLUST specifies a fully parameterized log
odds ratio model.

proc gee data=Resp descend;

class ID Treatment Center Sex Baseline;
model Outcome=Treatment Center Sex Age Baseline / dist=bin;
repeated subject=ID(Center) / logor=fullclust;
run;
The results of fitting the model are displayed in Output 45.4.1.
Example 45.4: GEE for Binary Data with Logit Link Function F 3151

Output 45.4.1 Results of ALR Model Fitting

The GEE Procedure

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept 1.6001 0.5128 0.5950 2.6052 3.12 0.0018
Treatment A 1.2611 0.3406 0.5934 1.9287 3.70 0.0002
Treatment P 0.0000 0.0000 0.0000 0.0000 . .
Center 1 -0.6287 0.3486 -1.3119 0.0545 -1.80 0.0713
Center 2 0.0000 0.0000 0.0000 0.0000 . .
Sex F 0.1024 0.4362 -0.7526 0.9575 0.23 0.8144
Sex M 0.0000 0.0000 0.0000 0.0000 . .
Age -0.0162 0.0125 -0.0407 0.0084 -1.29 0.1977
Baseline 0 -1.8980 0.3404 -2.5652 -1.2308 -5.58 <.0001
Baseline 1 0.0000 0.0000 0.0000 0.0000 . .
Alpha1 1.6109 0.4892 0.6522 2.5696 3.29 0.0010
Alpha2 1.0771 0.4834 0.1297 2.0246 2.23 0.0259
Alpha3 1.5875 0.4735 0.6594 2.5155 3.35 0.0008
Alpha4 2.1224 0.5022 1.1381 3.1068 4.23 <.0001
Alpha5 1.8818 0.4686 0.9634 2.8001 4.02 <.0001
Alpha6 2.1046 0.4949 1.1347 3.0745 4.25 <.0001

The parameters Alpha1 through Alpha6 estimate the log odds ratio for each unique within-cluster pair. The
correspondence between the log odds ratio parameters and within-cluster pairs is displayed in Output 45.4.2.

Output 45.4.2 Log Odds Ratio Parameters

Log Odds Ratio

Parameter
Information
Parameter Group
Alpha1 (1, 2)
Alpha2 (1, 3)
Alpha3 (1, 4)
Alpha4 (2, 3)
Alpha5 (2, 4)
Alpha6 (3, 4)

Model goodness-of-fit criteria are shown in Output 45.4.3.

Output 45.4.3 ALR Model Fit Criteria

GEE Fit
Criteria
QIC 511.8589
QICu 499.6516
3152 F Chapter 45: The GEE Procedure

The QIC for the ALR model shown in Output 45.4.3 is 511.86, whereas the QIC for the unstructured working
correlation model shown in Output 45.1.3 is 512.34, indicating that the ALR model has a slightly better fit.
You can fit the same model by fully specifying the z matrix; for the definition of the z matrix, see the section
“Specifying Log Odds Ratio Models” on page 3130. The following statements create a data set that contains
the full z matrix:

data zin;
keep id center z1-z6 y1 y2;
array zin(6) z1-z6;
set resp;
by center id;
if first.id
then do;
t = 0;
do m = 1 to 4;
do n = m+1 to 4;
do j = 1 to 6;
zin(j) = 0;
end;
y1 = m;
y2 = n;
t + 1;
zin(t) = 1;
output;
end;
end;
end;
run;

proc print data=zin (obs=12);

run;
Output 45.4.4 displays the full z matrix for the first two clusters. The z matrix is identical for all clusters in
this example.

Output 45.4.4 Full z Matrix Data Set

Obs z1 z2 z3 z4 z5 z6 Center ID y1 y2
1 1 0 0 0 0 0 1 1 1 2
2 0 1 0 0 0 0 1 1 1 3
3 0 0 1 0 0 0 1 1 1 4
4 0 0 0 1 0 0 1 1 2 3
5 0 0 0 0 1 0 1 1 2 4
6 0 0 0 0 0 1 1 1 3 4
7 1 0 0 0 0 0 1 2 1 2
8 0 1 0 0 0 0 1 2 1 3
9 0 0 1 0 0 0 1 2 1 4
10 0 0 0 1 0 0 1 2 2 3
11 0 0 0 0 1 0 1 2 2 4
12 0 0 0 0 0 1 1 2 3 4
Example 45.5: Alternating Logistic Regression for Ordinal Multinomial Data F 3153

The following statements fit the model for fully parameterized clusters by fully specifying the z matrix. The
results are identical to those shown previously.

proc gee data=Resp descend;

class ID Treatment Center Sex Baseline;
model Outcome=Treatment Center Sex Age Baseline / dist=bin;
repeated subject=ID(Center) / logor=zfull
zdata=zin
zrow =(z1-z6)
ypair=(y1 y2);
run;

Example 45.5: Alternating Logistic Regression for Ordinal Multinomial Data

This example illustrates how you use the GEE procedure and alternating logistic regression (ALR) to analyze
ordinal multinomial data. A clinical trial was conducted to evaluate the effectiveness of the drug auranofin
for treating arthritis (Lipsitz, Kim, and Zhao 1994). Patients were assigned to one of two groups: one group
was treated with auranofin, and the other group received a placebo. The treatment that a patient received is
recorded in the variable Treatment (coded here as 1 = auranofin and 0 = placebo).
The response was self-assessment of arthritis recorded at one-, three-, and five-month follow-up visits. The
responses are recorded in the Rating variable and are coded as 1 = very poor, 2 = poor, 3 = fair, 4 = good, and
5 = very good. This coding of Rating is finer than the coding in Lipsitz, Kim, and Zhao (1994), where only
three levels were used. The visit numbers are recorded in the classification variable Visit, whose value is 1,
3, or 5. An initial self-assessment that uses the same coding as Rating is recorded in the variable Baseline.
The variable Age records the participants’ ages (in years) at the baseline visit and is treated as a continuous
variable. One participant missed all visits and is not considered. There are an additional 15 missed visits from
eight participants who dropped out, and there are four participants who missed a single visit. A weighted
GEE is not used because the GEE procedure in SAS/STAT 14.1 does not support the weighted GEE method
for the multinomial distribution.
The following DATA step creates the data set Arthritis:

data Arthritis;
input ID Rating Sex Age Treatment Baseline Visit;
datalines;
1 4 2 54 2 2 1
1 5 2 54 2 2 3
1 5 2 54 2 2 5
2 4 1 41 1 3 1
2 4 1 41 1 3 3
2 4 1 41 1 3 5

... more lines ...

301 2 2 64 1 2 5
302 2 2 55 1 2 1
302 3 2 55 1 2 3
302 3 2 55 1 2 5
;
3154 F Chapter 45: The GEE Procedure

The following SAS statements use PROC GEE to fit a model that has a fully exchangeable working correlation
structure:

proc gee data=Arthritis;

class Sex ID Treatment Baseline Visit;
model Rating= Visit Treatment Baseline / dist=multinomial;
repeated subject=ID / within=Visit logor=exch;
run;
You specify LOGOR=EXCH in the REPEATED statement to select the ALR method that has a fully
exchangeable model for the log odds ratio. The results of the ALR model fitting are displayed in Output 45.5.1.

Output 45.5.1 Parameter Estimates for Arthritis Data Using ALR

The GEE Procedure

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept1 -6.7502 0.4267 -7.5865 -5.9138 -15.82 <.0001
Intercept2 -4.6310 0.3968 -5.4087 -3.8533 -11.67 <.0001
Intercept3 -2.6735 0.3749 -3.4083 -1.9387 -7.13 <.0001
Intercept4 -0.3838 0.3710 -1.1109 0.3433 -1.03 0.3008
Visit 1 0.3740 0.1148 0.1489 0.5991 3.26 0.0011
Visit 3 0.3641 0.1116 0.1455 0.5828 3.26 0.0011
Visit 5 0.0000 0.0000 0.0000 0.0000 . .
Treatment 1 0.5552 0.1673 0.2273 0.8830 3.32 0.0009
Treatment 2 0.0000 0.0000 0.0000 0.0000 . .
Baseline 1 3.9457 0.5352 2.8969 4.9946 7.37 <.0001
Baseline 2 3.3052 0.4268 2.4686 4.1418 7.74 <.0001
Baseline 3 2.7483 0.3790 2.0054 3.4911 7.25 <.0001
Baseline 4 1.4013 0.4132 0.5914 2.2113 3.39 0.0007
Baseline 5 0.0000 0.0000 0.0000 0.0000 . .
Alpha1 1.6447 0.1693 1.3130 1.9764 9.72 <.0001

The parameter Alpha1, which is used to estimate the log odds ratio, is included in Output 45.5.1.
To fit the ALR model, each response is coded as a vector of binary variables and the log odds ratio models the
association between pairs of responses. For more information about the log odds ratio and the ALR method
for ordinal multinomial data, see the section “ALR for Ordinal Multinomial Data” on page 3129. The ALR
model fit criteria are shown in Output 45.5.2.

Output 45.5.2 ALR Model Fit Criteria

GEE Fit Criteria

QIC 2241.9540
QICu 2259.8575
Example 45.5: Alternating Logistic Regression for Ordinal Multinomial Data F 3155

For comparison, the following SAS statements use PROC GEE to fit the same marginal model by using an
independent working correlation structure:

proc gee data=Arthritis;

class Sex ID Treatment Baseline Visit;
model Rating= Visit Treatment Baseline / dist=multinomial;
repeated subject=ID / within=Visit;
run;
When the data have multinomial responses, the independent working correlation structure is the only structure
supported for ordinary GEEs. In Output 45.5.1 and Output 45.5.3, you can see slight differences in the
parameter estimates between the model that you fit by using ALR and the model that you fit by using an
independent working correlation structure.

Output 45.5.3 Parameter Estimates for Arthritis Data Using Independent Working Correlation
The GEE Procedure

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept1 -6.7528 0.4227 -7.5812 -5.9244 -15.98 <.0001
Intercept2 -4.6719 0.3953 -5.4466 -3.8972 -11.82 <.0001
Intercept3 -2.7138 0.3730 -3.4449 -1.9828 -7.28 <.0001
Intercept4 -0.4129 0.3689 -1.1360 0.3102 -1.12 0.2631
Visit 1 0.3852 0.1160 0.1578 0.6125 3.32 0.0009
Visit 3 0.3725 0.1118 0.1534 0.5916 3.33 0.0009
Visit 5 0.0000 0.0000 0.0000 0.0000 . .
Treatment 1 0.5643 0.1679 0.2352 0.8933 3.36 0.0008
Treatment 2 0.0000 0.0000 0.0000 0.0000 . .
Baseline 1 3.9533 0.5351 2.9046 5.0020 7.39 <.0001
Baseline 2 3.3264 0.4250 2.4934 4.1593 7.83 <.0001
Baseline 3 2.7672 0.3769 2.0285 3.5059 7.34 <.0001
Baseline 4 1.4252 0.4112 0.6192 2.2312 3.47 0.0005
Baseline 5 0.0000 0.0000 0.0000 0.0000 . .

The QIC for the ALR model shown in Output 45.5.2 is 2241.95, whereas the QIC for the independent
working correlation model shown in Output 45.5.4 is 2269.82, indicating a slightly better fit for the ALR
model.
Output 45.5.4 Model Fit Criteria

GEE Fit Criteria

QIC 2269.8166
QICu 2259.7693
3156 F Chapter 45: The GEE Procedure

Example 45.6: GEE for Nominal Multinomial Data

This example illustrates how you use the GEE procedure to analyze nominal multinomial data. A two-year
study was conducted to assess the impact of access to Section 8 housing as a means of providing independent
housing to the severely mentally ill homeless (Hurlbut, Wood, and Hough 1996). In this study, half of the 362
clients received Section 8 housing certificates. The assignment of Section 8 housing certificates is recorded
in the variable Sec; 0 indicates clients who did not receive a certificate, and 1 indicates clients who received
a certificate.
Every six months during the study, research staff interviewed all 362 clients, who provided data about their
living arrangements in the previous 60 days. Clients’ living arrangements were also recorded during a
baseline interview. The time of interviews is recorded in the variable Time, whose value is 0, 6, 12, or 24 (for
the number of months since the study began). There were a total of 159 missed interviews. The variable
Housing records the living arrangement of a client and is coded as 0 (street living), 1 (community living), or
2 (independent living). The following statements create the data set Housing:

data Housing;
input ID Housing Time Sec;
datalines;
1 1 0 1
1 2 6 1
1 2 12 1
1 2 24 1
2 1 0 1
2 2 6 1

... more lines ...

362 1 0 0
362 1 6 0
362 1 12 0
362 1 24 0
;

The following SAS statements use PROC GEE to fit a model to nominal multinomial data:

proc gee data=Housing;

class ID Housing Time SEC;
model Housing=Sec / dist=multinomial link=glogit;
repeated subject=ID / within=Time;
run;
An ordinary GEE that has an independent working correlation structure is fit. This model is the only
option supported for data that have nominal multinomial responses. In the MODEL statement, you specify
LINK=GLOGIT to indicate that the responses are nominal. In the generalized logit model, you model
baseline category logits. By default, the GEE procedure chooses the last response category as the baseline
category. If your nominal response has J categories, then the baseline logit for category j and subject i is

log.ij =iJ / D ij D x0i ˇj

Example 45.6: GEE for Nominal Multinomial Data F 3157

and
exp.ij /
ij D PJ
kD1 exp.i k /

iJ D 0
The results of fitting the model are displayed in Output 45.6.1.

Output 45.6.1 Results of Model Fitting

The GEE Procedure

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Housing Estimate Error Limits Z Pr > |Z|
Intercept 0 -0.9532 0.1266 -1.2013 -0.7051 -7.53 <.0001
Intercept 1 -0.6562 0.1064 -0.8647 -0.4477 -6.17 <.0001
Sec 0 0 0.9226 0.1850 0.5599 1.2853 4.99 <.0001
Sec 0 1 1.2645 0.1642 0.9426 1.5863 7.70 <.0001
Sec 1 0 0.0000 0.0000 0.0000 0.0000 . .
Sec 1 1 0.0000 0.0000 0.0000 0.0000 . .

The positive estimates for the classification variable Sec = 0 at each response category, Housing = 0 and 1,
indicate an increased probability that a client will live independently when given access to Section 8 housing.
The model fit criteria are shown in Output 45.6.2

Output 45.6.2 Model Fit Criteria

GEE Fit Criteria

QIC 2675.2174
QICu 2671.4680

For comparison, the following SAS statements treat the responses as ordinal and use PROC GEE to fit a
marginal model by using an independent working correlation structure:

proc gee data=Housing;

class ID Housing Time SEC;
model Housing=Sec / dist=multinomial;
repeated subject=ID / within=Time;
run;
The cumulative logit link function is the default option that is used to fit the model. Because the generalized
logit link function is not specified, the responses are treated as ordinal multinomial data. The results for the
model that is fit by treating the responses as ordinal are displayed in Output 45.6.3.
3158 F Chapter 45: The GEE Procedure

Output 45.6.3 Results of Model Fitting

The GEE Procedure

Parameter Estimates for Response Model

with Empirical Standard Error Estimates
95%
Standard Confidence
Parameter Estimate Error Limits Z Pr > |Z|
Intercept1 -1.6917 0.1242 -1.9352 -1.4481 -13.62 <.0001
Intercept2 0.0112 0.0960 -0.1770 0.1994 0.12 0.9072
Sec 0 0.8224 0.1327 0.5624 1.0824 6.20 <.0001
Sec 1 0.0000 0.0000 0.0000 0.0000 . .

Treating the responses as ordinal results in a single parameter estimate that is related to the classification
variable Sec. The QIC for the model that is fit by treating the responses as nominal (shown in Output 45.6.2) is
2675.21, whereas the QIC for the model that is fit by treating the responses as ordinal (shown in Output 45.6.4)
is 2710.50, indicating a slightly better fit when the responses are treated as nominal.

Output 45.6.4 Model Fit Criteria

GEE Fit Criteria

QIC 2710.4971
QICu 2707.2983

References
Boos, D. (1992). “On Generalized Score Tests.” American Statistician 46:327–333.

Carey, V., Zeger, S. L., and Diggle, P. J. (1993). “Modelling Multivariate Binary Data with Alternating
Logistic Regressions.” Biometrika 80:517–526.

Diggle, P. J., Heagerty, P., Liang, K.-Y., and Zeger, S. L. (2002). Analysis of Longitudinal Data. 2nd ed. New
York: Oxford University Press.

Diggle, P. J., Liang, K.-Y., and Zeger, S. L. (1994). Analysis of Longitudinal Data. Oxford: Clarendon Press.

Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011). Applied Longitudinal Analysis. 2nd ed. Hoboken,
NJ: John Wiley & Sons.

Fitzmaurice, G. M., Molenberghs, G., and Lipsitz, S. R. (1995). “Regression Models for Longitudinal Binary
Responses with Informative Drop-Outs.” Journal of the Royal Statistical Society, Series B 57:691–704.

Hardin, J. W., and Hilbe, J. M. (2003). Generalized Estimating Equations. Boca Raton, FL: Chapman &
Hall/CRC.

Heagerty, P., and Zeger, S. L. (1996). “Marginal Regression Models for Clustered Ordinal Measurements.”
Journal of the American Statistical Association 91:1024–1036.
References F 3159

Hurlbut, M. S., Wood, P. A., and Hough, R. L. (1996). “Providing Independent Housing for the Homeless
Mentally Ill: A Novel Approach to Evaluating Long-Term Longitudinal Housing Patterns.” Journal of
Community Psychology 24:291–310.

Liang, K.-Y., and Zeger, S. L. (1986). “Longitudinal Data Analysis Using Generalized Linear Models.”
Biometrika 73:13–22.

Lipsitz, S. R., Fitzmaurice, G. M., Orav, E. J., and Laird, N. M. (1994). “Performance of Generalized
Estimating Equations in Practical Situations.” Biometrics 50:270–278.

Lipsitz, S. R., Kim, K., and Zhao, L. (1994). “Analysis of Repeated Categorical Data Using Generalized
Estimating Equations.” Statistics in Medicine 13:1149–1163.

Littell, R. C., Freund, R. J., and Spector, P. C. (1991). SAS System for Linear Models. 3rd ed. Cary, NC: SAS
Institute Inc.

Mallinckrodt, C. (2013). Preventing and Treating Missing Data in Longitudinal Clinical Trials: A Practical
Guide. Cambridge: Cambridge University Press.

McCullagh, P., and Nelder, J. A. (1989). Generalized Linear Models. 2nd ed. London: Chapman & Hall.

Molenberghs, G., and Kenward, M. G. (2007). Missing Data in Clinical Studies. New York: John Wiley &
Sons.

O’Kelly, M., and Ratitch, B. (2014). Clinical Trials with Missing Data: A Guide for Practitioners. Chichester,
UK: John Wiley & Sons.

Pan, W. (2001). “Akaike’s Information Criterion in Generalized Estimating Equations.” Biometrics 57:120–
125.

Preisser, J. S., Lohman, K. K., and Rathouz, P. J. (2002). “Performance of Weighted Estimating Equations
for Longitudinal Binary Data with Drop-Outs Missing at Random.” Statistics in Medicine 21:3035–3054.

Robins, J. M., and Rotnitzky, A. (1995). “Semiparametric Efficiency in Multivariate Regression Models with
Missing Data.” Journal of the American Statistical Association 90:122–129.

Rotnitzky, A., and Jewell, N. P. (1990). “Hypothesis Testing of Regression Parameters in Semiparametric
Generalized Linear Models for Cluster Correlated Data.” Biometrika 77:485–497.

Rubin, D. B. (1976). “Inference and Missing Data.” Biometrika 63:581–592.

Stokes, M. E., Davis, C. S., and Koch, G. G. (2012). Categorical Data Analysis Using SAS. 3rd ed. Cary,
NC: SAS Institute Inc.
Subject Index
confidence intervals GEE procedure, 3138
confidence coefficient, 3117 output table names
convergence criterion GEE procedure, 3137
GEE procedure, 3121
correlated data probability distribution, built-in
GEE procedure, 3125, 3132 GEE procedure, 3117

dispersion parameter weights QIC

GEE procedure, 3125 GEE procedure, 3127
quasi-likelihood functions
events/trials format for response GEE procedure, 3128
GEE procedure, 3116 quasi-likelihood information criterion (QIC)
GEE procedure, 3127
GEE procedure
convergence criterion, 3121 repeated measures
correlated data, 3125, 3132 GEE procedure, 3125, 3132
dispersion parameter weights, 3125
events/trials format for response, 3116 Type 3 analysis
GEE, 3120, 3150 GEE procedure, 3136
generalized estimating equations (GEE), 3125
initial values, 3122 weighted generalized estimating equations (WGEE),
intercept, 3118 3132
logistic regression, 3104 working correlation matrix
offset, 3118 GEE procedure, 3122, 3123, 3126
output ODS Graphics table names, 3138
output table names, 3137
QIC, 3127
quasi-likelihood functions, 3128
quasi-likelihood information criterion (QIC),
3127
repeated measures, 3125, 3132
Type 3 analysis, 3136
working correlation matrix, 3122, 3123, 3126
generalized estimating equations (GEE), 3120, 3150
GEE procedure, 3125

initial values
GEE procedure, 3122
intercept
GEE procedure, 3118

logistic regression
GEE procedure, 3104

offset
GEE procedure, 3118
options summary
ESTIMATE statement, 3112
output ODS Graphics table names
Syntax Index
ALPHA= option DESCENDING option, 3110
GEE procedure, MODEL statement, 3117 ORDER= option, 3110
ALPHAINIT= option GEE procedure, EFFECTPLOT statement, 3111
REPEATED statement (GEE), 3121 GEE procedure, ESTIMATE statement, 3112
GEE procedure, FREQ statement, 3113
BY statement GEE procedure, LSMEANS statement, 3113
GEE procedure, 3109 GEE procedure, LSMESTIMATE statement, 3114
GEE procedure, MISSMODEL statement, 3115
CLASS statement MAXWEIGHT option, 3116
GEE procedure, 3110 TYPE= option, 3116
CONVERGE= option GEE procedure, MODEL statement, 3116
REPEATED statement, 3121 ALPHA= option, 3117
CORR= option DIST= option, 3117
REPEATED statement , 3123 ERR= option, 3117
CORRB option LINK= option, 3118
REPEATED statement, 3122 NOINT option, 3118
CORRW option NOSCALE option, 3118
REPEATED statement, 3122 OFFSET= option, 3118
COVB option SCALE= option, 3118
REPEATED statement , 3122 TYPE3 option, 3119
WALD option, 3119
DATA= option
GEE procedure, OUTPUT statement, 3119
PROC GEE statement, 3108
keyword= option, 3119
DESCENDING option
OUT= option, 3119
CLASS statement, 3110
GEE procedure, PROC GEE statement, 3108
PROC GEE statement, 3109
DATA= option, 3108
DIST= option
DESCENDING option, 3109
MODEL statement, 3117
NAMELEN= option, 3109
DSCALE
PLOTS option, 3109
MODEL statement, 3118
GEE procedure, REPEATED statement, 3120
ECORRB option ALPHAINIT= option, 3121
REPEATED statement , 3122 CONVERGE= option, 3121
ECOVB option CORR= option, 3123
REPEATED statement , 3122 CORRB option, 3122
EFFECTPLOT statement CORRW option, 3122
GEE procedure, 3111 COVB option, 3122
ERR= option ECORRB option, 3122
MODEL statement, 3117 ECOVB option, 3122
ESTIMATE statement INITIAL= option, 3122
GEE procedure, 3112 INTERCEPT= option, 3122
MAXITER= option, 3123
FREQ statement MCORRB option, 3123
GEE procedure, 3113 MCOVB option, 3123
MODELSE option, 3123
GEE procedure SUBCLUSTER= option, 3123
syntax, 3107 SUBJECT= option, 3121
GEE procedure, BY statement, 3109 TYPE= option, 3123
GEE procedure, CLASS statement, 3110 WITHIN= option, 3124
WITHINSUBJECT= option, 3124 PROC GEE statement, see GEE procedure
ZDATA= option, 3124 PSCALE
ZROW= option, 3124 MODEL statement, 3118
GEE procedure, SLICE statement, 3124
GEE procedure, STORE statement, 3124 REPEATED statement
GEE procedure, WEIGHT statement, 3125 GEE procedure, 3120

INITIAL= option SCALE= option

REPEATED statement , 3122 MODEL statement, 3118
INTERCEPT= option SLICE statement
REPEATED statement , 3122 GEE procedure, 3124
STORE statement
keyword= option GEE procedure, 3124
OUTPUT statement (GEE), 3119 SUBCLUSTER= option
REPEATED statement (GEE), 3123
LINK= option SUBJECT= option
MODEL statement, 3118 REPEATED statement, 3121
LSMEANS statement
GEE procedure, 3113 TYPE3 option
LSMESTIMATE statement MODEL statement (GEE), 3119
GEE procedure, 3114 TYPE= option
MISSMODEL statement (WGEE), 3116
MAXITER= option REPEATED statement , 3123
REPEATED statement , 3123
MAXWEIGHT= option WALD option
MISSMODEL statement, 3116 MODEL statement (GEE), 3119
MCORRB option WEIGHT statement
REPEATED statement , 3123 GEE procedure, 3125
MCOVB option WITHIN= option
REPEATED statement , 3123 REPEATED statement, 3124
MISSMODEL statement WITHINSUBJECT= option
GEE procedure, 3115 REPEATED statement, 3124
MODEL statement
GEE procedure, 3116 ZDATA= option
MODELSE option REPEATED statement (GEE), 3124
REPEATED statement , 3123 ZROW= option
REPEATED statement (GEE), 3124
NAMELEN= option
PROC GEE statement, 3109
NOINT option
MODEL statement, 3118
NOSCALE option
MODEL statement, 3118

OFFSET= option
MODEL statement, 3118
ORDER= option
CLASS statement, 3110
OUT= option
OUTPUT statement (GEE), 3119
OUTPUT statement
GEE procedure, 3119

PLOTS option
PROC GEE statement, 3109

Advanced BASIC Scientific Subroutines
No ratings yet
Advanced BASIC Scientific Subroutines
189 pages
Minitab Guide
No ratings yet
Minitab Guide
256 pages
Introduction to Linear Regression Analysis
From Everand
Introduction to Linear Regression Analysis
Douglas C. Montgomery
2.5/5 (4)
Stats Book Sfu
100% (1)
Stats Book Sfu
354 pages
Foundations of Excel
100% (1)
Foundations of Excel
528 pages
Statistical Intervals For A Single Sample
No ratings yet
Statistical Intervals For A Single Sample
31 pages
Guide
No ratings yet
Guide
1,201 pages
Proc Logistic
No ratings yet
Proc Logistic
261 pages
Baum - An Introduction To Modern Econometrics Using Stata
100% (1)
Baum - An Introduction To Modern Econometrics Using Stata
376 pages
Data Analysis With SAS
100% (1)
Data Analysis With SAS
353 pages
Simulation Modeling With Simul8 Web
No ratings yet
Simulation Modeling With Simul8 Web
415 pages
Probit Model
No ratings yet
Probit Model
29 pages
EdCK3 3 Assessment Module 3
100% (1)
EdCK3 3 Assessment Module 3
30 pages
SAS Graphics for Clinical Trials by Example
From Everand
SAS Graphics for Clinical Trials by Example
Kriss Harris
5/5 (1)
Jess in Action: Rule-Based Systems in Java
From Everand
Jess in Action: Rule-Based Systems in Java
Ernest Friedman-Hill
3.5/5 (2)
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Mark Book
No ratings yet
Mark Book
849 pages
Regbook Inside
100% (1)
Regbook Inside
21 pages
JMP for Mixed Models
From Everand
JMP for Mixed Models
Ruth Hummel
No ratings yet
Glimmix
No ratings yet
Glimmix
244 pages
Small Radar - Sea.clutter - Scattering.the.k.distribution - And.radar - Performance.radar - Sonar..navigation
No ratings yet
Small Radar - Sea.clutter - Scattering.the.k.distribution - And.radar - Performance.radar - Sonar..navigation
586 pages
Introduction To Data Science and Statistical Thinking
No ratings yet
Introduction To Data Science and Statistical Thinking
384 pages
Notes in Operations Research
From Everand
Notes in Operations Research
Rahul Basu
5/5 (1)
Statistical Methods in Experimental Chemistry
100% (1)
Statistical Methods in Experimental Chemistry
103 pages
Practical Introduction To Stata PDF
100% (1)
Practical Introduction To Stata PDF
58 pages
P Refresher
No ratings yet
P Refresher
264 pages
Some Methods of Climatological Analysis - WMO 1966 PDF
No ratings yet
Some Methods of Climatological Analysis - WMO 1966 PDF
69 pages
An Introduction to Creating Standardized Clinical Trial Data with SAS
From Everand
An Introduction to Creating Standardized Clinical Trial Data with SAS
Todd Case
No ratings yet
Mixed PDF
No ratings yet
Mixed PDF
213 pages
Generalized Estimating Equations (Gees)
No ratings yet
Generalized Estimating Equations (Gees)
40 pages
Operations Research for Social Good: A Practitioner’s Introduction Using SAS and Python
From Everand
Operations Research for Social Good: A Practitioner’s Introduction Using SAS and Python
Natalia Summerville
No ratings yet
(Ebook PDF) A First Course in Probability, Global Edition 10th Edition Download
100% (1)
(Ebook PDF) A First Course in Probability, Global Edition 10th Edition Download
54 pages
TIBookVol 3
No ratings yet
TIBookVol 3
183 pages
Statistics 152
No ratings yet
Statistics 152
236 pages
R Intro Long
No ratings yet
R Intro Long
156 pages
Sas/Stat 15.1 User's Guide: The GLM Procedure
No ratings yet
Sas/Stat 15.1 User's Guide: The GLM Procedure
198 pages
Manual Minitab
No ratings yet
Manual Minitab
124 pages
Segmentation Analytics with SAS Viya: An Approach to Clustering and Visualization
From Everand
Segmentation Analytics with SAS Viya: An Approach to Clustering and Visualization
Randall S. Collica
No ratings yet
MINITAB Manual For Introduction To The Practice of Statistics
No ratings yet
MINITAB Manual For Introduction To The Practice of Statistics
124 pages
Proc GLM - Sas User Guide
No ratings yet
Proc GLM - Sas User Guide
190 pages
Data Science Essentials For Dummies
From Everand
Data Science Essentials For Dummies
Lillian Pierson
No ratings yet
Manual Minitab
No ratings yet
Manual Minitab
124 pages
The SAS Programmer's PROC REPORT Handbook: Basic to Advanced Reporting Techniques
From Everand
The SAS Programmer's PROC REPORT Handbook: Basic to Advanced Reporting Techniques
Jane Eslinger
No ratings yet
Sas Manual For Introduction To The Practice of Statistics
No ratings yet
Sas Manual For Introduction To The Practice of Statistics
263 pages
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
Freq PDF
No ratings yet
Freq PDF
207 pages
Advance Stats
No ratings yet
Advance Stats
233 pages
Sta Tug Logistic
No ratings yet
Sta Tug Logistic
240 pages
Anova
No ratings yet
Anova
64 pages
Get Quantitative Finance and Risk Management A Physicist S Approach 2nd Edition Jan W Dash PDF Ebook With Full Chapters Now
100% (1)
Get Quantitative Finance and Risk Management A Physicist S Approach 2nd Edition Jan W Dash PDF Ebook With Full Chapters Now
55 pages
Basic Probability What Every Math Student Should Know 2nd Edition Henk Tijms Instant Download
No ratings yet
Basic Probability What Every Math Student Should Know 2nd Edition Henk Tijms Instant Download
41 pages
Generalized Estimating Equations For Longitudinal Data
No ratings yet
Generalized Estimating Equations For Longitudinal Data
41 pages
Chap 55
No ratings yet
Chap 55
157 pages
SAS Surveyfreq Material
No ratings yet
SAS Surveyfreq Material
117 pages
Glmext4 Preview
No ratings yet
Glmext4 Preview
27 pages
GLM Notes
No ratings yet
GLM Notes
173 pages
Practical Introduction To Stata PDF
No ratings yet
Practical Introduction To Stata PDF
58 pages
Hyram V2.0 User Guide: Sandia Report
No ratings yet
Hyram V2.0 User Guide: Sandia Report
48 pages
Glmselect
No ratings yet
Glmselect
104 pages
Stata Lecture Unit Root
No ratings yet
Stata Lecture Unit Root
59 pages
Econometrics
No ratings yet
Econometrics
28 pages
Advanced Statistics For The Behavioral Sciences A Computational Approach With R Ebook Full Text
100% (15)
Advanced Statistics For The Behavioral Sciences A Computational Approach With R Ebook Full Text
16 pages
RP ch07
No ratings yet
RP ch07
29 pages
CCF Statistics
No ratings yet
CCF Statistics
19 pages
SM Ch1
No ratings yet
SM Ch1
30 pages
Statistics and Probability Curriculum Map
100% (1)
Statistics and Probability Curriculum Map
5 pages
Biostatistics-I MCQS: Topic: Sample Descriptive Statics
100% (9)
Biostatistics-I MCQS: Topic: Sample Descriptive Statics
40 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Teaching The Costs of Uncoordinated Supply Chains
No ratings yet
Teaching The Costs of Uncoordinated Supply Chains
18 pages
Introduction to Data Analytics
From Everand
Introduction to Data Analytics
Dan Martin
No ratings yet
Unit+16 T Test
No ratings yet
Unit+16 T Test
35 pages
MSC App Stat Syllabus First Sem
No ratings yet
MSC App Stat Syllabus First Sem
9 pages
One Function of Two Random Variables
No ratings yet
One Function of Two Random Variables
33 pages
Stat 130n Answers To The LAs in Lessons 3.1-3.3
No ratings yet
Stat 130n Answers To The LAs in Lessons 3.1-3.3
18 pages
Health Prognosis With Optimized Feature Selection For Lithium-Ion Battery in Electric Vehicle Applications
No ratings yet
Health Prognosis With Optimized Feature Selection For Lithium-Ion Battery in Electric Vehicle Applications
10 pages
Syllabus Book
No ratings yet
Syllabus Book
10 pages
Exp 2
No ratings yet
Exp 2
20 pages
Bank Direct Marketing Analysis of Data Mining Techniques: Hany A. Elsalamony
No ratings yet
Bank Direct Marketing Analysis of Data Mining Techniques: Hany A. Elsalamony
11 pages
Deformable Model Fitting by Regularized Landmark Mean-Shift: Simon Lucey Jeffrey F. Cohn
No ratings yet
Deformable Model Fitting by Regularized Landmark Mean-Shift: Simon Lucey Jeffrey F. Cohn
16 pages
B.SC Stats
No ratings yet
B.SC Stats
6 pages
Quadratic Mean Differentiability Example
No ratings yet
Quadratic Mean Differentiability Example
5 pages
S1000D BRDP Reference Book for Issues 1.9 through 6
From Everand
S1000D BRDP Reference Book for Issues 1.9 through 6
Victoria Ichizli-Bartels
No ratings yet
Multiple Imputation of Missing Data Using SAS
From Everand
Multiple Imputation of Missing Data Using SAS
Patricia Berglund
No ratings yet
JNTU Old Question Papers 2007
94% (18)
JNTU Old Question Papers 2007
8 pages
Normal Distribution and Test of Normality
No ratings yet
Normal Distribution and Test of Normality
38 pages
SolidWorks Simulation 2020 Black Book
From Everand
SolidWorks Simulation 2020 Black Book
Gaurav Verma
5/5 (3)
SolidWorks Flow Simulation 2020 Black Book
From Everand
SolidWorks Flow Simulation 2020 Black Book
Gaurav Verma
5/5 (2)
SolidWorks Flow Simulation 2021 Black Book
From Everand
SolidWorks Flow Simulation 2021 Black Book
Gaurav Verma
No ratings yet
SolidWorks Simulation 2021 Black Book
From Everand
SolidWorks Simulation 2021 Black Book
Gaurav Verma
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet

Sas/Stat 14.3 User's Guide: The GEE Procedure

Uploaded by

Sas/Stat 14.3 User's Guide: The GEE Procedure

Uploaded by

®

Overview: GEE Procedure

proc gee data=Children descending;

Figure 45.1 Model Information

Figure 45.2 GEE Model Information

GEE Model Information

Figure 45.3 Covariance Matrices of Parameter Estimates

Covariance Matrix (Model-Based)

Covariance Matrix (Empirical)

The exchangeable working correlation matrix is displayed in Figure 45.4.

Figure 45.4 Working Correlation Matrix

Working Correlation Matrix

Figure 45.5 GEE Parameter Estimates Table

Parameter Estimates for Response Model

Figure 45.6 Model Fit Criteria

Syntax: GEE Procedure

PROC GEE < options > ;

PROC GEE Statement

Table 45.1 PROC GEE Statement Options

You can specify the following options.

PLOTS < = plot-request >

proc gee plots=histogram;

Table 45.2 Sort Order for Categorical Variables

order-type Levels Sorted By

Table 45.3 Plot-Types and Plot-Definition-Options

Plot-Type and Description Plot-Definition-Options

Table 45.4 ESTIMATE Statement Options

Degrees of Freedom and p-values

Generalized Linear Modeling

Table 45.5 LSMEANS Statement Options

Degrees of Freedom and p-values

Table 45.5 continued

Generalized Linear Modeling

Table 45.6 LSMESTIMATE Statement Options

Table 45.6 continued

Degrees of Freedom and p-values

Generalized Linear Modeling

OBSLEVEL specifies the observation-level weighted GEE method.

MODEL events/trials = < effects > < / options > ;

Table 45.7 MODEL Statement Options

You can specify the following options after a slash (/).

Table 45.8 Distributions and Default Link Functions

DIST= Distribution Default Link Function

Table 45.9 Built-In Link Functions of the GEE Procedure

Table 45.10 REPEATED Statement Options

Table 45.10 continued

You must specify the SUBJECT= option:

Table 45.11 Log Odds Ratio Regression Structures

Keyword Log Odds Ratio Regression Structure

Table 45.12 Correlation Structure Types

Keyword Correlation Structure Type

For example, the following option specifies a fixed 4 4 correlation matrix:

type=user( 1.0 0.9 0.8 0.6

Details: GEE Procedure

Generalized Estimating Equations

where Vi is the working covariance matrix of Yi .

Working Correlation Matrix

Table 45.13 Working Correlation Structures and Estimators

Working Correlation Structure Estimator

Table 45.13 continued

Quasi-likelihood Information Criterion

Qij D wij .yij log.ij / ij /

Qij D wij Œrij log.pij / C .nij rij / log.1 pij /

Alternating Logistic Regression

ALR for Binary Data

max.0; ij C i k 1/  Pr.Yij D 1; Yi k D 1/  min.ij ; i k /

ALR for Ordinal Multinomial Data

g ijc D ˇc C x0ij ˇ; for c D 1; : : : ; C 1

i.j k/.c1 c2 / D log.OR.Yijc1 ; Yi kc2 // D z0i.j k/.c1 c2 / ˛

Specifying Log Odds Ratio Models

zij k D 1 for all i; j; k

specifies the 43

Weighted Generalized Estimating Equations under the MAR Assumption

Observation-Specific Weighted GEE Method

ij D P .rij D 1jrij 1 D 1; Xi ; Yi / D P .rij D 1jrij 1 D 1; Xi ; Y1 ; : : : ; Yj 1/

PO .rij D 1jXi ; Yi / D i1 .˛/

Qij D wij .yij log.ij / ij /

Qij D wij Œrij log.pij / C .nij rij / log.1 pij /

max.0; ij C i k 1/ Pr.Yij D 1; Yi k D 1/ min.ij ; i k /

g ijc D ˇc C x0ij ˇ; for c D 1; : : : ; C 1

ij D P .rij D 1jrij 1 D 1; Xi ; Yi / D P .rij D 1jrij 1 D 1; Xi ; Y1 ; : : : ; Yj 1/

PO .rij D 1jXi ; Yi / D i1 .˛/

where Yi ; i ; Vi , and Wi are as follows:

Y: indicator of amenorrhea status (1 for amenorrhea; 0 otherwise)

Time: four consecutive three-month intervals with values 1, 2, 3, and 4

logit.ij / D ˇ0 C ˇ1 timeij C ˇ2 time2ij C ˇ3 dosei C ˇ4 dosei time C ˇ5 dosei time2

model Y = Time Dose TimeTime DoseTime DoseTimeTime / dist=bin;