An Application of The Biprobit Heckman Selection Model To Correct Estimates of HIV Prevalence From Sample Surveys
An Application of The Biprobit Heckman Selection Model To Correct Estimates of HIV Prevalence From Sample Surveys
1
Contents
1 Background 3
2 Data 3
3 Method 3
3.1 Multi-stage Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1.1 Correction for Selection Bias and Calculation of HIV Prevalence . . . . . . . . . . . . 4
3.1.2 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Barnighausen et al. Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Correction for Selection Bias and Calculation of HIV Prevalence . . . . . . . . . . . . 10
3.2.2 Model Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2
1 Background
This is a very succinct summary of our application of the Heckman selection model approach to correcting
estimates of HIV prevalence from sample surveys to HIV biomarker data recently collected at the Agincourt
HDSS in South Africa. A sex-age-stratified sample was drawn from the 30,000 individuals ages 15+
alive and resident in the DSS in 2010. With respect to the sampled individuals, the survey proceeded as
follows:
1. attempt to make contact result: found or not-found
2. for those who were found, attempt to interview result: interviewed or not
3. for those who were interviewed, attempt to collect biomarkers result: biomarkers or not
4. for those who provided biomarkers, test biomarkers result: positive or negative
Consequently, there are three decision points at which the sample is subdivided. At each of these unmeasured
factors could have produced a selection effect that results in the selected fraction of the sample being
systematically different from the not- selected fraction. At each of these stages we can use a Heckman
Selection model to attempt to identify and correct for the selection bias. We attempt to do this methodically
so that we can predict the HIV status of everyone in the original sample. Working down the list above, there
are three subgroups of the sample that are not observed:
1. not-found
2. found but not- interviewed
3. interviewed but not- tested
2 Data
The Agincourt health and demographic surveillance system is located in rural northeast South Africa. Since
1992 the study has conducted annual censuses of all households in 21 study villages. Vital events, migrations
and many other things are described at each census. During 2010-11 the study conducted a sample survey
that collected data describing HIV and NCD risk factors and biomarkers for both on a sex-age-stratified
sample of everyone fifteen years old and older. Those data inform this work.
Of relevance, the main livelihood for the study population is cyclic labor migration to a variety of locations
outside of the study site with periods on the daily, weekly, monthly and annual time scales. Both men
and women engage in this labor migration, but it is predominantly men. This situation impacted the
sample survey such that many more men than women were not found, even after repeated attempts to
locate them. This sex-specific non-response undoubtedly affects the raw results of the survey. It is hoped
that the Heckman selection model correction procedure can help reduce the effect of this sex-specific non-
response.
3 Method
There are two ways of approaching the correction of estimated population-level HIV prevalence using the
Heckman selection model. Both predict the probability of begin HIV positive for subgroups of the population
that were not tested, but the structure of those predictions differs. We first describe our multi-stage
approach and then the Barnighausen et al. approach.
3
3.1 Multi-stage Approach
There are three selection decisions that progressively divide the Agincourt sample into subgroups. The first
is whether or not a sampled individual was found. This creates found (F) and not-found (nF) subgroups.
Then within the F group, individuals can either agree (I) or not agree (nI) to being interviewed, thus
creating found, interviewed (F:I) and found, not- interviewed (F:nI) subgroups. Further within the F:I
group, individuals can either agree to be tested (T) or not (nT), and this results in found, interviewed,
tested (F:I:T) and found, interviewed, not- tested (F:I:nT) subgroups. The final set of four subgroups
defined by interview outcome is:
1. not-found nF,
2. found & not- interviewed F:nI,
3. found & interviewed & not- tested F:I:nT, and
4. found & interviewed & tested F:I:T.
Clearly we only have an HIV status for the F:I:T group. Our method is designed to predict the HIV status
for everyone else1 taking account of the selection effect at each stage of selection. Figures 1 and 2 display the
hierarchical categorization of the Agincourt sample. The four observed subgroups listed above are along the
righthand side of Figure 1 in green, together with the higher level groups from which they disaggregate. The
box in the extreme lower right corner contains the HIV positive individuals who we have observed, labeled
1.
The remainder of Figure 1 displays how respondents would be subdivided in the counter factual situation
in which we could observe them. Starting from the right with the F:I:nT group, we can imagine that they
are either HIV positive or negative. In reality they actually are one or the other, but we do not know their
HIV status. The biprobit Heckman selection model allows us to model this situation and obtain an estimate
for the probability that an unobserved individual in the F:I:nT group is HIV positive. The model used to
accomplish this M3 operates on the individuals in the red box labeled M3. The predicted probability of
being HIV positive in this subgroup is written as Pr(+|T ) along the leg of the diagram leading to the HIV
positive outcome in that subgroup. These predicted probabilities allow us to identify the fraction of the
F:I:nT group that is HIV positive, labeled 2.
Now we move to the F:nI group. Again, we imagine the counter factual in which they are interviewed,
either tested or not tested, and finally either HIV positive or negative. The red box labeled M2 contains the
subgroups associated with this counter factual. Another biprobit Heckman selection model M2 predicts
the probability of being tested for those who are in the F:nI group Pr(T |I), and those probabilities are
used to divide the F:nI group into either F:nI:T Pr(T |I) or F:nI:nT 1 Pr(T |I) . To estimate the HIV
status of these tested and not-tested subgroups, we need the conditional probabilities of being HIV positive
for individuals who are either tested or not tested. To acquire these, we make the critical assumption that
individuals in the tested and not-tested subgroups of the F:nI group have HIV statuses that are the same
as the tested and not-tested individuals in the F:I group. We know the probabilities of being HIV positive
for the F:I:T group, and we can use these to impute the probabilities of being HIV positive for the F:nI:T
group Pr(+|T ). This gives us the fraction of the F:nI:T group who are HIV positive, labeled 3. Finally,
we can borrow the probabilities of being HIV positive in the F:I:nT group Pr(+|T ) that we have already
predicted using model M3 to be the probabilities of being HIV positive in the F:nI:nT group. This yields
the HIV cases labeled 4.
1 All of the other imaginary groups identified using the same concatenated abbreviation notation, e.g. nF:nI:nT are those
4
Figure 1. Agincourt Probability Model Found Side.
5
Figure 2. Agincourt Probability Model Not-Found Side.
6
Figure 2 displays the counter factual categorization of the nF group in a manner similar to Figure 1 for the F
group. The probabilities necessary to calculate the number of HIV positive individuals in each of the tested
and not-tested subgroups, labeled 5-8, are all displayed in the diagram. The probabilities are estimated
in a manner analogous to those displayed in Figure 1. One more biprobit Heckman selection model M1 is
used to estimate the probability of being interviewed in the nF group Pr(I|F ), and the probability of being
tested in the nF:I group Pr(T |I) is imputed from observations in the F:I and F:nI groups.
Finally, the population HIV prevalence can be calculated by taking an average over the whole population
consisting of: 1) HIV status (coded 1=positive, 0=negative) in the F:I:T group (HIV cases labeled 1), and
2) the probabilities of being HIV positive in the F:I:nT group (HIV cases labeled 2), F:nI group (HIV cases
labeled 3-4), and finally the nF group (HIV cases labeled 5-8). Each individual in the sample appears
in one and only one of those subgroups, so this global average provides a correctly weighted population
prevalence. The estimated HIV prevalence for the population is:
" a
1 X
PHIV = [H]i
N i=1
b
X
+ Pr(+|T )j
j=1
Xc
+ Pr(T |I)k Pr(+|T )k + (1 Pr(T |I)k ) Pr(+|T )k (1)
k=1
d
X
+ Pr(I|F ) [ Pr(T |I)l Pr(+|T )l + (1 Pr(T |I)l ) Pr(+|T )l ]
l=1
d
#
X
+ (1 Pr(I|F )) [ Pr(T |I)l Pr(+|T )l + (1 Pr(T |I)l ) Pr(+|T )l ]
l=1
where N is the total number of individuals in the population, a is the number of individuals in the F:I:T
group indexed by i; b the number in the F:I:nT group indexed by j; c the number in the F:nI group indexed
by k and d the number in the nF group with index l.
The biprobit Heckman selection models that estimate the selection effects and predict the unobserved out-
comes are specified in detail below. For all three specifications zi is a row vector of values for individual i for
the variables in the selection equation, and xi is a row vector of data values for individual i for the variables
in the outcome equation. Likewise and are the column vectors of model coefficients for the selection
and outcome equations, respectively.
Throughout the model specification we use Iversons bracket notation to represent indicator variables:
(
1 if is true
[] =
0 if is false
7
3.1.2.1 Model 1
Model 1 is estimated using everyone in the sample who is eligible and alive. The selection equation for Model
1 with [F ] on the lefthand side is:
where there are no selection variables that appears only in the selection equation. The Heckman is
estimated solely on the basis of the shape of the joint error distribution. The reference categories are:
sex = female, age = 15 19, village = 1, migrant = 0 (no recent history of migration), and SES = 1 (the
first and poorest quintile of the SES distribution). sex and age are fully interacted.
The outcome equation for Model 1 with [F : I] on the lefthand side is:
Pr([F : Ii ]|xi ) = (i )
i = xi + Moi
i = + 1 [sexi = male]
+ 2 [agei = 20 24] + ... + 14 [agei = 80 84]
(3)
+ 15 [agei = 20 24 sexi = male] + ... + 27 [agei = 80 84 sexi = male]
+ 28 [villagei = 2] + ... + 47 [villagei = 21]
+ 48 [migranti ] + 49 [SESi = 1] + ... + 52 [SESi = 5]
+ Moi
where the reference categories and interaction structure are the same as the selection equation in Model 1,
Equation 2.
R = corr(Ms , Mo ) (4)
8
3.1.2.2 Model 2
Model 2 is estimated using everyone in the sample who was found, F. The selection equation for Model 2
with [F : I] on the lefthand side is:
Pr([F:Ii ]|zi ) = (i? )
i? = zi + Msi
i? = + 1 [sexi = male]
+ 2 [agei = 20 24] + ... + 14 [agei = 80 84]
(5)
+ 15 [agei = 20 24 sexi = male] + ... + 27 [agei = 80 84 sexi = male]
+ 28 [villagei = 2] + ... + 47 [villagei = 21]
+ 48 [migranti ] + 49 [SESi = 1] + ... + 52 [SESi = 5]
+ Msi
where there are no selection variables that appear only in the selection equation. The Heckman is
estimated solely on the basis of the shape of the joint error distribution. The reference categories and
interaction structure are the same as Equation 2.
The outcome equation for Model 2 with [F : I: T ] on the lefthand side is:
Pr([F:I:Ti ]|xi ) = (i )
i = xi + Moi
i = + 1 [sexi = male]
+ 2 [agei = 20 24] + ... + 14 [agei = 80 84]
(6)
+ 15 [agei = 20 24 sexi = male] + ... + 27 [agei = 80 84 sexi = male]
+ 28 [villagei = 2] + ... + 47 [villagei = 21]
+ 48 [migranti ] + 49 [SESi = 2] + ... + 52 [SESi = 5]
+ Moi
where, again, the reference categories and interaction structure are the same as Equation 2.
R = corr(Ms , Mo ) (7)
3.1.2.3 Model 3
Model 3 is estimated on everyone in the sample who was found and interviewed, F:I. The selection equation
for Model 3 with [F : I: T ] on the lefthand side is:
Pr([F : I: Ti ]|zi ) = (i? )
i? = zi + Msi
i? = + 1 [sexi = male]
+ 2 [agei = 20 24] + ... + 14 [agei = 80 84]
+ 15 [agei = 20 24 sexi = male] + ... + 27 [agei = 80 84 sexi = male] (8)
+ 28 [villagei = 2] + ... + 47 [villagei = 21]
+ 48 [migranti ] + 49 [SESi = 2] + ... + 52 [SESi = 5]
+ 53 [f ieldworkeri = 2] + ... + 63 [f ieldworkeri = 11]
+ Msi
9
where the selection variable unrelated to the outcome is f ieldworker. The reference categories and inter-
action structure are the same as Equation 2.
The outcome equation for Model 3 with [H] on the lefthand side:
Pr([Hi ]|xi ) = (i )
i = xi + Moi
i = + 1 [sexi = male]
+ 2 [agei = 20 24] + ... + 14 [agei = 80 84]
(9)
+ 15 [agei = 20 24 sexi = male] + ... + 27 [agei = 80 84 sexi = male]
+ 28 [villagei = 2] + ... + 47 [villagei = 21]
+ 48 [migranti ] + 49 [SESi = 1] + ... + 52 [SESi = 5]
+ Moi
R = corr(Ms , Mo ) (10)
3.2 B
arnighausen et al. Approach
The B
arnighausen et al. approach (REF) divides the population into three groups:
1. those who are not contacted nCT,
2. those who are contacted but do not consent to testing CT:nCS, and
3. those who are contacted and consent to testing CT:CS
Again clearly we only have an HIV status for the CN:CS group. The Barnighausen et al. approach is
designed to predict the HIV status of the other two groups. Figures 3a and 3b display the categorization of
the Agincourt sample according to this approach. There are four ways to be HIV positive in this scheme
labeled A - D in the two panels of the figure.
The left panel of Figure 3 displays the consent model used to predict the probability of being HIV positive for
individuals who are contacted but do not consent to testing Pr(+|Cs ). Here the contact process is ignored
and the Heckman selection model uses consent as the selection criteria and HIV status as the outcome. The
right panel of Figure 3 shows the contact model used to predict the probability of being HIV positive for
those who were not contacted Pr(+|Ct ). In this model the consent process is ignored, and selection is on
whether or not someone was contacted, with the outcome being HIV status.
The overall population prevalence is calculated by taking an average of the three groups consisting of: 1)
HIV status (coded 1=positive, 0=negative) in the CT:CS group (HIV cases labeled A and C), 2) the
probabilities of being HIV positive in the CT:nCS group (HIV cases labeled B), and 3) the probabilities
of being HIV positive in the nCT group (HIV cases labeled D). The estimated HIV prevalence for the
population is: " a #
b d
1 X X X
PHIV = [H]i + Pr(+|CS)j + Pr(+|CT )k (11)
N i=1 j=1 k=1
10
Figure 3. B
arnighausen et al. Probability Models
Everyone
Everyone
Contact
Contacted
CTR
Contact Contact
CSR
Consent Consent
Pr(+|CT ) Consent
Pr(+|CS)
HIV HIV HIV HIV
where N is the total number of individuals in the population, a is the number of individuals in the CT:CS
group indexed by i; b the number in the CT:nCS group indexed by j and d the number in the nCT group
indexed by k.
Two biprobit Heckman selection models are used to estimate the selection effects and predict the unobserved
outcomes. Notation conventions follow those of Equations 2 9.
11
selection equation for the consent model with [CT : CS] on the lefthand side is:
where the selection variable unrelated to the outcome is f ieldworker. The reference categories and inter-
action structure are the same as Equation 2.
The outcome equation for the consent model with [H] on the lefthand side:
Pr([Hi ]|xi ) = (i )
i = xi + Moi
i = + 1 [sexi = male]
+ 2 [agei = 20 24] + ... + 14 [agei = 80 84]
(13)
+ 15 [agei = 20 24 sexi = male] + ... + 27 [agei = 80 84 sexi = male]
+ 28 [villagei = 2] + ... + 47 [villagei = 21]
+ 48 [migranti ] + 49 [SESi = 1] + ... + 52 [SESi = 5]
+ Moi
R = corr(Ms , Mo ) (14)
where the selection variable unrelated to the outcome is f ieldworker. The reference categories and inter-
action structure are the same as Equation 2.
12
The outcome equation for the contact model with [H] on the lefthand side:
Pr([Hi ]|xi ) = (i )
i = xi + Moi
i = + 1 [sexi = male]
+ 2 [agei = 20 24] + ... + 14 [agei = 80 84] (16)
+ 15 [agei = 20 24 sexi = male] + ... + 27 [agei = 80 84 sexi = male]
+ 28 [villagei = 2] + ... + 47 [villagei = 21]
+ Moi
R = corr(Ms , Mo ) (17)
4.1 Results
Coefficient estimates from the regressions for all models are presented below in the appendix, Section 5.
Table 1 displays the estimated HIV Prevalence for females (F), males (M) and both sexes combined (B)
derived from the two different approaches, multi-stage and Barnighausen et al. The measured prevalence
in the group that was tested is 19.4% (F: 23.9%, M: 10.6%); the multi-stage approach estimates 23.1% (F:
26.9%, M: 17.1%) and the B arnighausen et al. approach 22.1% (F: 25.4%, M: 16.9%). The lower panel
of Table 1 contains the corrections consisting of the difference between the estimated prevalences in the
whole population and the prevalence measured among those who tested. The magnitude of the corrections
is important. The multi-stage approach increases female prevalence by 3% and male prevalence by 6.4%, for
a two-sex increase of 3.6%. The B arnighausen et al. approach correction is half as much for females 1.5%
and about the same for males 6.3%, for a two-sex increase of 2.7%.
F M B
Crude Prevalence Rates
Tested 23.9 10.6 19.4
Multi-stage 26.9 17.1 23.1
B
arnighausen et al. 25.4 16.9 22.1
Correction
Multi-stage 3.0 6.4 3.6
B
arnighausen et al. 1.5 6.3 2.7
Like all crude rates, the overall population prevalence of HIV is a weighted average across dimensions along
which HIV prevalence varies, sex and age being two of the important ones. The differences between crude
rates the corrections we are estimating with this method - are the result of changes in the prevalence profiles
across these subgroups and changes in the composition of the population across the subgroups. In our case,
the sex-age profile of prevalence may change to bring about the corrections, or the sex-age composition of the
13
population may change to provide different weights for the same sex-age profile of prevalence. To unravel
how much of each type of change is contributing to the overall difference, we can decompose the change
in the overall crude rate into components resulting from changes in the prevalence profile and the sex-age
structure of the population.
Table 2 displays the estimated HIV prevalences for subgroups of the population defined in the multi-stage
approach. Starting with the F:I:T group, subgroups are added until the whole population is included. The
crude HIV prevalence rates are given for each group along with the differences between each group and
the previous, smaller, group. The differences in the crude rates are decomposed into rate differences and
age composition differences which are displayed in the lower half of the table. The rate and age difference
components add to 100% within each sex group.
Contributions from changes in the sex-age prevalence profile and sex-age composition vary considerably.
When moving from the F:I:T to F:I:T + F:I:nT groups (adding the not- tested to the tested), the dominant
component of the difference is changes in the sex-age profile of prevalence. When the F:nI group is added
(those who were not interviewed), the two components contribute equally. Finally when the nF group is
added (those who were not found), changes to the sex-age profile of prevalence contribute little (in the
case of males actually work weakly in the opposite direction to decrease the difference in the crude rates),
while changes in the sex-age composition of the population are responsible for almost all of the overall
change. What this means is that the sex-age profiles of prevalence are essentially the same for the F and nF
subgroups, but the sex-age structures are importantly different, with the nF group giving more weight to
high prevalence sex-age groups which leads to a higher crude prevalence rate when this group is added in,
especially for males.
Table 3 displays the age composition and age-specific prevalence rates by sex for the F:I:T group, and then
the changes to each as additional subgroups are added. The column labels in this table relate to the row
numbers in Table 2 and indicate the movement from each group to the next larger group. By examining
Table 3 you can easily verify that changes to the age profile of prevalence are important in the first two
transitions, but not to the third, where changes in the age composition are the driving force.
14
Table 3. Sex-Age-Specific Composition & HIV Prevalence:
Multi-stage Approach
Table 4 displays the estimated HIV prevalences for subgroups defined in the Barnighausen et al. approach,
similar to Table 2. Adding the non consenting group nCS to the tested group adds 2.2% to female and 3.2%
15
to male prevalence. Further adding the not- contacted nCT group decreases female prevalence by 0.6% and
increases male by another 3.1%. The are important positive contributions from both the prevalence profiles
and the age compositions changes when adding the nCS group. However when the nCT is added, the
situation is different. For females, prevalence profile changes contribute twice the magnitude of the overall
change in population-level prevalence, while changes it age composition work in the opposite direction to
decrease (cut in half) the change in population-level prevalence. For males the entire change in population-
level prevalence is driven by changes to the age structure with no contribution (1.6%) from changes in the
prevalence profile. Table 6 breaks down these changes by age and sex, as in Table 3, and makes clear how
the age composition and prevalence are changing in each age group for each sex.
Table 5 below contains a summary of the estimated values of the Heckman in each model. The significance
levels displayed in the table are approximate. Using survey design estimation procedures in Stata (Statas
svy commands) invalidate the likelihood ratio test that one would normally use to test the null hypothesis
that is zero. Conequently, we use non-survey-design estimation procedures that directly specify weighting
and estimation sample selection, and use the likelihood ratio test for = 0 from that, with the knowledge
that the standard errors are not precisely correct.
The values of are interesting. Starting with the multistage approach, the for model M1 is negative,
indicating that individuals who were interviewed but refused testing are more likely to be HIV positive an
intuitive and reasonable result. The for model M2 is positive suggesting that individuals who were found
but refused to be interviewed were less likely to agree to testing also intuitive and reasonable. Finally the
for model M3 is negative, implying that individuals who were not found would be more likely to agree to
be interviewed (conditional on being found), compared to those who were actually found.
Turning to the B arnighausen et al. approach. The for the consent model is negative, suggesting that
individuals who did not consent to testing were more likely to be HIV positive reasonable. For the contact
model is positive, indicating that individuals who were not found are less likely to be HIV positive. This
is a strange finding that likely results from improper modeling of the selection processes. The contact model
lumps together the selection process governing whether an individual is found and the selection process
determining whether or not they agree to testing. As we can see with the multi-stage approach that does
model these processes separately, the different components of the full selection mechanism work in different
directions.
16
Table 5. Estimates of Heckman Values
Significance CI
Multi-stage Approach
M1 -0.215 0.470 (-0.670 0.358)
M2 0.414 0.114 (-0.105 0.755)
M3 -0.499 0.252 (-0.902 0.371)
B
arnighausen et al. Approach
Consent -0.342 0.471 (-0.868 0.546)
Contact 0.219 0.180 (-0.102 0.500)
4.2 Discussion
The Heckman selection model correction procedure works by providing estimates of the probability of being
HIV positive for sampled individuals who did not participate in HIV testing. Not getting a test results from
being: 1) not found, 2) being found but refusing to be interviewed, and 3) being found, interviewed and then
refusing to provide blood for testing. At each of these decision points various factors both measured and not
can contribute to the selected and non-selected subgroups being systematically different. The systematic
difference that concerns us is HIV status.
Both the multi-stage and B arnighausen et al. approaches to applying the Heckman selection model correction
predict the probability of being HIV positive for those who did not receive a test. The two approaches model
the selection processes differently and make different assumptions. The multi-stage approach identifies
each discrete selection step starting with the whole sample all the way through to the individuals who
eventually receive a positive or negative HIV test result. At each of these steps the biprobit Heckman
selection model is used to model the selection process across two levels of the selection hierarchy and to
predict the unobserved outcome at the lower level. These models are organized into the hierarchy of the
categorization scheme for the sample such that the outcome of each higher level model is the selection level
of the model below, see Figures 1 and 2. In this way the whole categorization hierarchy can be modeled, and
all of the conditional probabilities associated with unobserved outcomes can be predicted from the models.
The remaining conditional probabilities that are similar to observed selection processes and outcomes can
be imputed from the situations in which they are observed under the assumption that those conditional
probabilities actually are similar in the observed and unobserved situations a valid point of discussion.
Finally, using the hierarchical structure of the classification scheme and its associated model, the probability
of being HIV positive can be calculated for each of the unobserved groups, see Equation 1. This modeling
approach is:
systematic,
uses as much of the data as possible,
yields results that can be interpreted with respect to the selection steps that actually governed the
categorization of the sample into various observed and unobserved groups, and
clearly identifies where assumptions are being made and exactly what they are.
In contrast the Barnighausen et al. approach is less systematic and conflates various selection processes.
This approach is built on two biprobit Heckman selection models, both of which have HIV status as the
outcome. The difference between the two is that the consent model defines consent to test as the selection
process and restricts the population over which the model is estimated to be those who were found, while
the contact model defines the ability to contact a respondent as the selection process and estimates the
model over the whole population. Predictions from the consent model provide an estimate of the probability
17
of being HIV positive for those who were found but did not consent to testing, and predictions from the
contact model provide the probability of being HIV positive for those who were not found.
The consent model conflates two selection processes that we know exist and can likely be described with
data from a typical survey: 1) the original decision on the part of a respondent to either be interviewed
or not, and 2) the subsequent choice that the respondent makes to either test or not. The two selection
processes at work here may be different and even work in opposite directions with respect to systematic
differences in HIV status among those who opt in and out at each decision point. The contact model ignores
the two selection processes that are described by the consent model but still uses HIV status as the outcome,
effectively conflating all three selection process (being contacted, agreeing to be interviewed and agreeing to
be tested) into one selection process.
Altogether, the B arnighausen et al. approach is harder to understand and interpret because it does not
cleanly separate the selection processes and clearly describe how they relate to one another. It is also
vaguely troubling that the two models effectively use the same data twice HIV status and both model
some of the same selection processes, effectively using that information twice as well. Finally, although the
overall results are similar, our ability to diagnose exactly what is happening with the Barnighausen et al.
approach is limited and confusing (for example, see Table 4).
Although they constitute two completely different analytical strategies, both approaches suggest upward
corrections of population prevalence on the order of 3%, and in both approaches most of this results from
important increases in male prevalence of almost 6.5%. At the population level, the only real difference
between the two approaches is the correction to female prevalence. The multi-stage approach suggests an
upward correction of 3% while the B arnighausen et al. approach halves this to 1.5%.
In both approaches, the bulk of the correction for females is associated with the model that illuminates the
selection process governing self-selection into testing (adding the F:I:nT group for the multi-stage approach,
and the nCS group for the B arnighausen et al. approach). In both cases, most of this correction is the
result of changes to the age-specific prevalence rates, rather than differences between the age structures of
the testing and combined group consisting of both testing and non-testing individuals.
For males the situation is different. In both approaches, the large male correction is contributed in about
equal proportions by the non-testing and not-found groups. For the male non-testers, the situation is similar
to females with large changes to the age-specific prevalence rates that account for most of the differences
in the crude rates when this subgroup is added. For the not-found males in both approaches, the changes
to overall crude prevalence rates are driven entirely by differences in the age structures of the found and
not-found populations.
The Heckman selection model procedure, no matter how applied, suggests significant differences in age-
specific prevalence rates between the found subgroups who either test or do not test, with the non-testers
having higher age-specific prevalence rates. Neither approach to applying the procedure suggests large
differences in the age-specific prevalence comparing the found and not-found subgroups.
18
Table 6. Sex-Age-Specific Composition & HIV
Prevalence: Barnighausen et al. Approach
For differences between the found and not-found groups, both approaches suggest large age-structure-driven
corrections to overall prevalence for men when the not-founds are added into the total. For women, the multi-
stage approach also suggests an age-structure-driven correction, but the magnitude is small 0.8%. The
19
Barnighausen et al. approach comparing found women to the combined group consisting of found and not-
found women is less well-defined. There are conflicting corrections with a large component from differences
in the age-specific prevalence profiles of the two groups counterbalanced by another large component of
opposite sign associated with differences in the age structures of the two groups. The overall difference in
crude prevalence rates between the two groups is negative, so the differences in age-specific prevalence are
producing a negative correction that reduces overall prevalence, while differences in the age structure are
bringing it back in the other direction. The net result is the negative correction of 0.6% that is observed
when taking the differences between the crude prevalences of the two groups.
Taken as a whole, the signs, magnitudes and origins of these corrections are consistent with our detailed
understanding of what is happening in the Agincourt study population, and the multi-stage approach yields
results that are much easier to interpret, interrogate and corroborate with existing knowledge of the pop-
ulation. The main corrections to age-specific prevalence rates occurs between the testing and non-testing
groups, as we would expect given that people who feel they have a reason to fear the results of the test may
be less likely to agree to testing. Many men of working age are employed outside the field site, and when
they are added back in they change the age structure of the population in such as way as to more heavily
weight age groups with high HIV prevalence, and the resulting correction is of important magnitude. For
women a similar thing happens, but the magnitude of the correction is much less because the age structure
differences are less pronounced.
The results obtained here support the notion that when properly applied the Heckman selection model
method for correcting estimates of HIV prevalence from sample surveys can work well.
Although both approaches produce similar overall corrections, we feel the multi-state approach is better
justified and produces more stable and interpretable results.
4.3 Recommendations
1. The Heckman selection model is a useful tool for assessing the possibility and extent of selection bias
in surveys that include HIV tests.
2. The overall corrections to crude HIV prevalence rates at the population level suggested by Heckman
selection model procedures are reasonably robust to exactly how the selection processes are modeled.
In the work presented here, both the multi-stage and Barnighausen et al. approaches produced similar
results.
3. We prefer the multi-stage procedure presented here because it fully describes the selection processes
at each step and produces stable, interpretable results that clearly corroborate our understanding of
what is happening in the population.
4. Future surveys should think through the selection processes before the survey is fielded and include
strong, valid selection/exclusion criteria variables in the survey instruments and/or logistical tools used
to conduct and monitor the survey implementation.
20
5 Appendix: Regression Estimation Results Tables
21
... table 7 continued
Variable Coefficient (Std. Err.)
village = 19 0.476 (0.310)
village = 20 -0.043 (0.254)
village = 21 0.081 (0.305)
migration = 1 -0.147 (0.078)
SES quintile = 2 0.052 (0.129)
SES quintile = 3 -0.176 (0.124)
SES quintile = 4 -0.268 (0.119)
SES quintile = 5 -0.290 (0.119)
Intercept 2.327 (0.241)
Selection Equation: [F ]
age = 20 -0.322 (0.108)
age = 25 -0.351 (0.106)
age = 30 -0.203 (0.108)
age = 35 0.000 (0.110)
age = 40 -0.146 (0.116)
age = 45 0.005 (0.120)
age = 50 0.145 (0.149)
age = 55 0.107 (0.147)
age = 60 0.511 (0.171)
age = 65 0.443 (0.167)
age = 70 0.376 (0.189)
age = 75 0.578 (0.221)
age = 80 0.358 (0.201)
sex = 1 0.164 (0.126)
age = 20 and sex = 1 -0.681 (0.154)
age = 25 and sex = 1 -0.835 (0.153)
age = 30 and sex = 1 -0.999 (0.155)
age = 35 and sex = 1 -0.971 (0.156)
age = 40 and sex = 1 -0.973 (0.166)
age = 45 and sex = 1 -0.994 (0.170)
age = 50 and sex = 1 -0.940 (0.203)
age = 55 and sex = 1 -0.984 (0.203)
age = 60 and sex = 1 -0.895 (0.223)
age = 65 and sex = 1 -0.805 (0.227)
age = 70 and sex = 1 -0.678 (0.248)
age = 75 and sex = 1 -0.713 (0.320)
age = 80 and sex = 1 0.081 (0.329)
village = 2 -0.499 (0.112)
village = 3 -0.047 (0.098)
village = 4 -0.028 (0.116)
village = 5 -0.166 (0.114)
village = 6 0.111 (0.115)
village = 7 -0.168 (0.133)
village = 8 -0.138 (0.099)
village = 9 -0.301 (0.101)
village = 10 -0.129 (0.101)
village = 11 -0.189 (0.088)
village = 12 -0.020 (0.137)
village = 13 -0.188 (0.121)
Continued on next page...
22
... table 7 continued
Variable Coefficient (Std. Err.)
village = 14 -0.381 (0.152)
village = 15 0.049 (0.110)
village = 16 0.117 (0.115)
village = 17 -0.189 (0.141)
village = 18 -0.286 (0.169)
village = 19 0.223 (0.180)
village = 20 -0.049 (0.177)
village = 21 -0.040 (0.154)
migration = 1 -0.173 (0.045)
SES quintile = 2 0.043 (0.071)
SES quintile = 3 0.039 (0.070)
SES quintile = 4 0.044 (0.070)
SES quintile = 5 -0.058 (0.070)
Intercept 1.139 (0.122)
-0.215 (0.288)
Significance levels : : 10% : 5% : 1%
23
... table 8 continued
Variable Coefficient (Std. Err.)
age = 80 and sex = 1 0.281 (0.484)
village = 2 -0.202 (0.217)
village = 3 -0.048 (0.168)
village = 4 -0.377 (0.196)
village = 5 -0.056 (0.183)
village = 6 -0.253 (0.184)
village = 7 0.362 (0.225)
village = 8 -0.018 (0.155)
village = 9 -0.375 (0.178)
village = 10 0.224 (0.175)
village = 11 0.100 (0.153)
village = 12 0.593 (0.226)
village = 13 -0.089 (0.179)
village = 14 0.476 (0.293)
village = 15 0.165 (0.221)
village = 16 -0.249 (0.175)
village = 17 0.033 (0.245)
village = 18 0.048 (0.267)
village = 19 -0.008 (0.285)
village = 20 0.151 (0.327)
village = 21 0.106 (0.213)
migration = 1 -0.017 (0.083)
SES quintile = 2 -0.036 (0.119)
SES quintile = 3 -0.156 (0.119)
SES quintile = 4 -0.414 (0.118)
SES quintile = 5 -0.494 (0.117)
Intercept 2.430 (0.274)
Selection Equation: [F : I]
age = 20 -0.201 (0.232)
age = 25 -0.380 (0.225)
age = 30 -0.321 (0.218)
age = 35 -0.527 (0.213)
age = 40 -0.569 (0.224)
age = 45 -0.589 (0.222)
age = 50 0.026 (0.302)
age = 55 -0.367 (0.253)
age = 60 -0.257 (0.261)
age = 65 -0.326 (0.262)
age = 70 -0.526 (0.275)
age = 75 -0.532 (0.284)
age = 80 -0.268 (0.320)
sex = 1 -0.029 (0.249)
age = 20 and sex = 1 -0.118 (0.318)
age = 25 and sex = 1 -0.524 (0.300)
age = 30 and sex = 1 -0.707 (0.293)
age = 35 and sex = 1 -0.407 (0.287)
age = 40 and sex = 1 -0.478 (0.307)
age = 45 and sex = 1 -0.253 (0.304)
age = 50 and sex = 1 -1.061 (0.375)
Continued on next page...
24
... table 8 continued
Variable Coefficient (Std. Err.)
age = 55 and sex = 1 -0.381 (0.360)
age = 60 and sex = 1 -0.264 (0.353)
age = 65 and sex = 1 -0.403 (0.352)
age = 70 and sex = 1 0.205 (0.398)
age = 75 and sex = 1 11.429 (0.000)
age = 80 and sex = 1 -0.368 (0.433)
village = 2 -0.079 (0.177)
village = 3 -0.003 (0.157)
village = 4 -0.038 (0.188)
village = 5 0.357 (0.193)
village = 6 -0.136 (0.182)
village = 7 -0.188 (0.215)
village = 8 0.103 (0.171)
village = 9 -0.280 (0.142)
village = 10 0.442 (0.162)
village = 11 0.053 (0.144)
village = 12 0.053 (0.190)
village = 13 -0.088 (0.170)
village = 14 -0.493 (0.233)
village = 15 0.224 (0.235)
village = 16 -0.020 (0.144)
village = 17 -0.169 (0.262)
village = 18 0.211 (0.290)
village = 19 0.489 (0.308)
village = 20 -0.045 (0.258)
village = 21 0.088 (0.311)
migration = 1 -0.162 (0.075)
SES quintile = 2 0.058 (0.131)
SES quintile = 3 -0.174 (0.125)
SES quintile = 4 -0.272 (0.120)
SES quintile = 5 -0.296 (0.119)
Intercept 2.301 (0.241)
0.414 (0.230)
Significance levels : : 10% : 5% : 1%
25
... table 9 continued
Variable Coefficient (Std. Err.)
age = 60 0.567 (0.188)
age = 65 0.447 (0.207)
age = 70 0.388 (0.219)
age = 75 0.064 (0.254)
age = 80 -0.614 (0.405)
sex = 1 -0.991 (0.334)
age = 20 and sex = 1 0.078 (0.362)
age = 25 and sex = 1 0.595 (0.361)
age = 30 and sex = 1 1.082 (0.351)
age = 35 and sex = 1 1.073 (0.349)
age = 40 and sex = 1 1.206 (0.361)
age = 45 and sex = 1 0.932 (0.366)
age = 50 and sex = 1 1.164 (0.383)
age = 55 and sex = 1 1.237 (0.377)
age = 60 and sex = 1 1.267 (0.383)
age = 65 and sex = 1 1.187 (0.414)
age = 70 and sex = 1 0.686 (0.433)
age = 75 and sex = 1 0.973 (0.515)
age = 80 and sex = 1 1.140 (0.643)
village = 2 0.178 (0.183)
village = 3 0.114 (0.121)
village = 4 -0.012 (0.152)
village = 5 -0.114 (0.135)
village = 6 0.056 (0.144)
village = 7 -0.095 (0.155)
village = 8 -0.082 (0.125)
village = 9 -0.057 (0.131)
village = 10 -0.217 (0.121)
village = 11 0.047 (0.113)
village = 12 0.073 (0.157)
village = 13 0.001 (0.141)
village = 14 -0.025 (0.179)
village = 15 0.034 (0.141)
village = 16 -0.329 (0.146)
village = 17 0.129 (0.154)
village = 18 0.226 (0.197)
village = 19 0.195 (0.212)
village = 20 -0.268 (0.217)
village = 21 0.664 (0.193)
migration = 1 -0.024 (0.058)
SES quintile = 2 -0.160 (0.081)
SES quintile = 3 -0.070 (0.085)
SES quintile = 4 -0.070 (0.098)
SES quintile = 5 -0.351 (0.110)
Intercept -1.433 (0.180)
Selection Equation: [F : I: T ]
age = 20 -0.555 (0.241)
age = 25 -0.832 (0.225)
age = 30 -0.663 (0.231)
Continued on next page...
26
... table 9 continued
Variable Coefficient (Std. Err.)
age = 35 -0.721 (0.232)
age = 40 -0.753 (0.246)
age = 45 -0.422 (0.257)
age = 50 -0.569 (0.270)
age = 55 -0.456 (0.276)
age = 60 -0.658 (0.261)
age = 65 -0.817 (0.254)
age = 70 -0.567 (0.313)
age = 75 -0.536 (0.341)
age = 80 -0.466 (0.335)
sex = 1 -0.377 (0.251)
age = 20 and sex = 1 0.127 (0.313)
age = 25 and sex = 1 -0.031 (0.299)
age = 30 and sex = 1 -0.236 (0.309)
age = 35 and sex = 1 -0.027 (0.301)
age = 40 and sex = 1 0.002 (0.327)
age = 45 and sex = 1 -0.185 (0.332)
age = 50 and sex = 1 -0.122 (0.357)
age = 55 and sex = 1 0.223 (0.414)
age = 60 and sex = 1 0.264 (0.353)
age = 65 and sex = 1 0.976 (0.442)
age = 70 and sex = 1 0.689 (0.434)
age = 75 and sex = 1 0.175 (0.482)
age = 80 and sex = 1 0.332 (0.502)
village = 2 -0.081 (0.228)
village = 3 -0.037 (0.173)
village = 4 -0.403 (0.199)
village = 5 -0.076 (0.185)
village = 6 -0.239 (0.191)
village = 7 0.413 (0.232)
village = 8 -0.028 (0.159)
village = 9 -0.358 (0.182)
village = 10 0.219 (0.179)
village = 11 0.081 (0.161)
village = 12 0.612 (0.236)
village = 13 -0.097 (0.183)
village = 14 0.502 (0.294)
village = 15 0.153 (0.225)
village = 16 -0.234 (0.178)
village = 17 0.012 (0.250)
village = 18 0.091 (0.286)
village = 19 -0.039 (0.285)
village = 20 0.179 (0.338)
village = 21 0.123 (0.222)
migration = 1 0.015 (0.079)
SES quintile = 2 -0.008 (0.122)
SES quintile = 3 -0.069 (0.121)
SES quintile = 4 -0.348 (0.124)
SES quintile = 5 -0.425 (0.118)
Continued on next page...
27
... table 9 continued
Variable Coefficient (Std. Err.)
fieldworker = 3713 -0.123 (0.168)
fieldworker = 3858 -0.239 (0.167)
fieldworker = 4680 0.289 (0.227)
fieldworker = 5681 0.118 (0.159)
fieldworker = 6547 0.463 (0.180)
fieldworker = 6761 0.019 (0.164)
fieldworker = 6963 -0.286 (0.156)
fieldworker = 7683 -0.287 (0.191)
fieldworker = 8875 -0.295 (0.166)
fieldworker = 9821 0.160 (0.165)
Intercept 2.547 (0.299)
-0.499 (0.359)
Significance levels : : 10% : 5% : 1%
28
... table 10 continued
Variable Coefficient (Std. Err.)
village = 4 -0.035 (0.154)
village = 5 -0.136 (0.136)
village = 6 0.049 (0.146)
village = 7 -0.065 (0.152)
village = 8 -0.091 (0.126)
village = 9 -0.058 (0.142)
village = 10 -0.231 (0.129)
village = 11 0.050 (0.115)
village = 12 0.088 (0.158)
village = 13 -0.002 (0.142)
village = 14 0.040 (0.176)
village = 15 0.029 (0.146)
village = 16 -0.350 (0.147)
village = 17 0.141 (0.155)
village = 18 0.218 (0.200)
village = 19 0.180 (0.216)
village = 20 -0.262 (0.217)
village = 21 0.668 (0.199)
migration = 1 -0.014 (0.059)
SES quintile = 2 -0.164 (0.082)
SES quintile = 3 -0.066 (0.088)
SES quintile = 4 -0.074 (0.112)
SES quintile = 5 -0.359 (0.131)
Intercept -1.430 (0.183)
Selection Equation: [CT : CS]
age = 20 -0.416 (0.185)
age = 25 -0.676 (0.178)
age = 30 -0.522 (0.177)
age = 35 -0.672 (0.178)
age = 40 -0.702 (0.185)
age = 45 -0.566 (0.188)
age = 50 -0.334 (0.220)
age = 55 -0.428 (0.209)
age = 60 -0.505 (0.207)
age = 65 -0.669 (0.204)
age = 70 -0.585 (0.231)
age = 75 -0.574 (0.244)
age = 80 -0.403 (0.261)
sex = 1 -0.241 (0.197)
age = 20 and sex = 1 0.027 (0.250)
age = 25 and sex = 1 -0.297 (0.240)
age = 30 and sex = 1 -0.540 (0.239)
age = 35 and sex = 1 -0.235 (0.234)
age = 40 and sex = 1 -0.316 (0.252)
age = 45 and sex = 1 -0.210 (0.252)
age = 50 and sex = 1 -0.609 (0.287)
age = 55 and sex = 1 -0.089 (0.304)
age = 60 and sex = 1 0.008 (0.278)
age = 65 and sex = 1 0.301 (0.292)
Continued on next page...
29
... table 10 continued
Variable Coefficient (Std. Err.)
age = 70 and sex = 1 0.504 (0.327)
age = 75 and sex = 1 0.420 (0.405)
age = 80 and sex = 1 -0.010 (0.363)
village = 2 -0.097 (0.168)
village = 3 -0.014 (0.134)
village = 4 -0.274 (0.160)
village = 5 0.116 (0.150)
village = 6 -0.226 (0.152)
village = 7 0.065 (0.187)
village = 8 0.019 (0.134)
village = 9 -0.361 (0.133)
village = 10 0.357 (0.138)
village = 11 0.074 (0.122)
village = 12 0.276 (0.174)
village = 13 -0.078 (0.143)
village = 14 -0.188 (0.215)
village = 15 0.210 (0.182)
village = 16 -0.161 (0.135)
village = 17 -0.080 (0.215)
village = 18 0.161 (0.237)
village = 19 0.193 (0.238)
village = 20 0.070 (0.235)
village = 21 0.139 (0.231)
migration = 1 -0.076 (0.063)
SES quintile = 2 0.027 (0.102)
SES quintile = 3 -0.147 (0.102)
SES quintile = 4 -0.359 (0.100)
SES quintile = 5 -0.435 (0.097)
fieldworker = 3713 -0.201 (0.147)
fieldworker = 3858 -0.266 (0.146)
fieldworker = 4680 0.008 (0.184)
fieldworker = 5681 0.044 (0.136)
fieldworker = 6547 -0.085 (0.158)
fieldworker = 6761 -0.385 (0.142)
fieldworker = 6963 -0.207 (0.136)
fieldworker = 7683 -0.306 (0.161)
fieldworker = 8875 -0.273 (0.141)
fieldworker = 9821 -0.108 (0.142)
Intercept 2.295 (0.231)
-0.342 (0.436)
Significance levels : : 10% : 5% : 1%
30
... table 11 continued
Variable Coefficient (Std. Err.)
age = 20 0.886 (0.137)
age = 25 1.198 (0.137)
age = 30 1.246 (0.135)
age = 35 1.386 (0.131)
age = 40 1.157 (0.140)
age = 45 1.114 (0.138)
age = 50 0.901 (0.156)
age = 55 0.914 (0.153)
age = 60 0.583 (0.158)
age = 65 0.570 (0.160)
age = 70 0.511 (0.181)
age = 75 0.324 (0.198)
age = 80 -0.098 (0.235)
sex = 1 -0.179 (0.166)
age = 20 and sex = 1 -0.430 (0.210)
age = 25 and sex = 1 0.031 (0.211)
age = 30 and sex = 1 0.368 (0.215)
age = 35 and sex = 1 0.265 (0.210)
age = 40 and sex = 1 0.432 (0.228)
age = 45 and sex = 1 0.196 (0.220)
age = 50 and sex = 1 0.532 (0.246)
age = 55 and sex = 1 0.368 (0.246)
age = 60 and sex = 1 0.406 (0.235)
age = 65 and sex = 1 0.285 (0.243)
age = 70 and sex = 1 -0.129 (0.275)
age = 75 and sex = 1 0.021 (0.342)
age = 80 and sex = 1 0.483 (0.334)
village = 2 0.167 (0.148)
village = 3 0.124 (0.107)
village = 4 0.125 (0.133)
village = 5 -0.109 (0.116)
village = 6 0.148 (0.128)
village = 7 -0.047 (0.144)
village = 8 -0.070 (0.108)
village = 9 0.145 (0.112)
village = 10 -0.281 (0.105)
village = 11 0.007 (0.098)
village = 12 -0.013 (0.131)
village = 13 0.046 (0.119)
village = 14 0.118 (0.174)
village = 15 0.003 (0.131)
village = 16 -0.115 (0.119)
village = 17 0.146 (0.143)
village = 18 0.097 (0.171)
village = 19 0.104 (0.181)
village = 20 -0.184 (0.183)
village = 21 0.521 (0.176)
Intercept -1.428 (0.142)
Continued on next page...
31
... table 11 continued
Variable Coefficient (Std. Err.)
Selection Equation: [CT ]
age = 20 -0.301 (0.108)
age = 25 -0.353 (0.107)
age = 30 -0.257 (0.108)
age = 35 -0.031 (0.110)
age = 40 -0.161 (0.116)
age = 45 0.031 (0.121)
age = 50 0.229 (0.152)
age = 55 0.174 (0.149)
age = 60 0.595 (0.170)
age = 65 0.498 (0.173)
age = 70 0.425 (0.198)
age = 75 0.642 (0.223)
age = 80 0.394 (0.201)
sex = 1 0.183 (0.127)
age = 20 and sex = 1 -0.672 (0.155)
age = 25 and sex = 1 -0.810 (0.154)
age = 30 and sex = 1 -0.934 (0.155)
age = 35 and sex = 1 -0.932 (0.156)
age = 40 and sex = 1 -0.967 (0.166)
age = 45 and sex = 1 -0.973 (0.169)
age = 50 and sex = 1 -1.008 (0.206)
age = 55 and sex = 1 -0.963 (0.204)
age = 60 and sex = 1 -0.937 (0.222)
age = 65 and sex = 1 -0.796 (0.231)
age = 70 and sex = 1 -0.702 (0.257)
age = 75 and sex = 1 -0.817 (0.320)
age = 80 and sex = 1 0.078 (0.334)
fieldworker = 3713 -1.049 (0.162)
fieldworker = 3858 -0.746 (0.172)
fieldworker = 4680 -1.541 (0.169)
fieldworker = 5681 -1.192 (0.164)
fieldworker = 6547 -1.301 (0.163)
fieldworker = 6761 -1.156 (0.162)
fieldworker = 6963 -1.141 (0.163)
fieldworker = 7683 -1.295 (0.161)
fieldworker = 8875 -1.118 (0.161)
fieldworker = 9821 -0.948 (0.162)
Intercept 2.019 (0.169)
0.219 (0.158)
Significance levels : : 10% : 5% : 1%
32