Unbiased Estimation of The Average Treatment Effect in Cluster-Randomized Experiments
Unbiased Estimation of The Average Treatment Effect in Cluster-Randomized Experiments
1 Introduction
In recent years, researchers have paid increased attention to the properties
of treatment effect estimators for randomized experiments under the design-
based model (see, e.g. Freedman 2008a,b). Under the design-based model
(Neyman 1923, 1934; Sarndal 1978), potential outcomes are fixed and the only
source of stochasticity lies in the random administration of a treatment to a
finite population. Importantly, Freedman (2008a) demonstrated that, under
a such a model, regression adjustment is generally biased (though consist-
ent) and may reduce efficiency. Researchers have since derived methods that
do not suffer from these problems (Lin 2013; Miratrix et al. 2013) and assessed
the operating characteristics of common model-based estimators (Humphreys
2009; Samii and Aronow 2012) under the design-based paradigm. However, this
1 Hansen and Bowers (2008) also derives design-based balance tests for cluster-randomized
e xperiments.
2 As in Hansen and Bowers (2008), we consider estimation of the effect of assignment to treat-
ment, which we refer to this simply as the ATE throughout. This quantity is also termed the
intention to treat effect. Our approach circumvents the issue of compliance, but our estimators
might be divided by suitable compliance rate estimates to estimate average treatment on treated
effects, though this may introduce bias from ratio estimation (Hartley and Ross 1954).
2 Potential Outcomes
The foundation of our design-based approach is the model of potential out-
comes introduced by Neyman (1923) and popularized by Rubin (1974). Define
treatment indicator Di∈{0, 1} for units i∈1, 2, …, N such that Di = 1 when unit i
receives the treatment and Di = 0 otherwise. Assuming that the stable unit treat-
ment value assumption (Rubin 1978, 2005) holds, let Y1i be the potential outcome
if unit i is exposed to the treatment, and let Y0i be the potential outcome if unit i
is not exposed to the treatment. The observed experimental outcome Yi may be
expressed as a function of the potential outcomes and the assigned treatment:
Yi = DiY1i+(1–Di)Y0i. The causal effect of the treatment on unit i, τi, is defined as
the difference between the two potential outcomes for unit i: τi≡Y1i–Y0i. And, by
definition the ATE, denoted Δ, is the average value of τi for all units i. Under this
model, the only random component of the experiment is the allocation of units to
treatment and control groups.
Since τi≡Y1i–Y0i, the ATE is equivalently
N
∑ ( Y1i −Y0 i ) 1N N
1 T T
∑Y1i − ∑Y0 i = [ Y1 −Y0 ],
i=1
∆= =
N N i=1 i=1 N
where Y1T is the sum of potential outcomes if in the treatment condition and Y0T
is the sum of potential outcomes if in the control condition. An estimator of Δ can
be constructed using estimators of Y0T and Y1T :
ˆ = 1 YT −YT ,
∆ (1)
N 1 0
where Y1T is the estimated sum of potential outcomes under treatment and Y0T is
the estimated sum of potential outcomes under control.
ˆ ]= 1 T T 1 T T
E[ ∆ E Y − E Y = [ Y −Y0 ] = ∆.
N 1 0 N 1
3.1 U
nbiased Estimation of Treatment Effects Under Random
Allocation of Units
Define N and nt as integers such that 0 < nt < N. Random allocation of treatment
implies that nt, a fixed number, units are randomly assigned to treatment (Di = 1)
and the remaining nc = N–nt are in control (Di = 0). Define I0 as the set of all i such
that Di = 0 and I1 as the set of all i such that Di = 1.
To derive an unbiased estimator of the ATE under random allocation, we can
first posit estimators of Y0T and Y1T . Define an estimator of Y0T ,
N N
Y0T,S = ∑Y0 i = ∑Yi (2)
nc i∈I0 nc i∈I0
3 Throughout, we use the term random allocation to refer to the assignment of a fixed number
of units (or clusters) to treatment and a fixed number to control, following the terminology of
Lachin (1988).
N N
Y1,TS = ∑Y1i = ∑Yi . (3)
nt i∈I 1 nt i∈I 1
It is easy to show that the estimators in equations 2 and 3 are unbiased under the
random allocation rule:
N
E Y0T,S = E ∑Yi = N ⋅Y0 =Y0T , (4)
nc i∈I0
where Y0 is the mean value of Y0i over all i units (and is not an observable quan-
tity). A proof for the unbiasedness of Y1,TS directly follows the form of equation 4.
From equation 1, it follows that we may construct an unbiased estimator of Δ:
where ∑ i∈I 1
Yi / nt is the mean value of Yi for all units assigned to treatment and
is known
∑ i∈I0
Yi / nc is the mean value of Y for all units assigned to control. ∆
i S
3.2 P
roperties of The Difference-In-Means Estimator Under
Random Allocation of Clusters
the bias associated with the difference-in-means estimator. As our derivation will
show, the bias arises whenever outcomes are related to cluster size.
Formally, suppose each cluster j = 1, 2, …, M is assigned to either treatment or
control. Define mt and M as (fixed) integers such that 0 < mt < M. Now mt clusters
are randomly assigned to treatment (Dj = 1) and the remaining mc = M–mt clusters
are assigned to control (Dj = 0). Define J0 as the set of all j such that Dj = 0 and J1 as
the set of all j such that Dj = 1. Let Y0ij be the response of the ith individual in the jth
cluster if the cluster is assigned to control and let Y1ij be the response of the ith indi-
vidual in the jth cluster if the cluster is assigned to treatment. Let nj be the number
of individuals in the jth cluster. Note that all individuals have the same probability
mt/M of entering treatment.
The estimators in equations 2 and 3 can be rewritten as
n n
Y1,S = N ∑ j∈J ∑ i=j 1Yij / ∑ j∈J n j and Y0T,S = N ∑ j∈J ∑ i=j 1Yij / ∑ j∈J n j . The difference-
T
1 1 0 0
nj nj
The double summations in the numerators make explicit that summation takes
place across individuals in different clusters. In the denominators, the summa-
tions operate over clusters. While the estimator remains unchanged from equa-
tion 5, expressing it this way reveals a fundamental problem with its application.
The trouble with using the estimator in equation 6 is that the quantities
nt = ∑ j∈J n j and nc = ∑ j∈J n j are no longer fixed numbers as they were in equa-
1 0
tion 5, but are now random variables. The total number of individuals in treat-
ment and control now depends on the size of the particular clusters assigned to
the experimental groups. To understand why this dependence is problematic, we
need only examine equation 4: the term N/nc may be moved to the outside of the
expectation operator because it is a fixed constant. When nc is a random variable,
calculating the expectation is more involved. In general, for a ratio of two random
variables u, v, (u/v),
u 1 u
E = E[ u ] − Cov , v (7)
v E[ v ] v
] = 1 T T M 1 nj
E[ ∆ [ Y1 −Y0 ] − Cov ∑∑Y1ij / ∑n j , ∑n
S
N N mt j∈J 1 i=1 j∈J 1 j∈J 1
j
1 nj
− Cov ∑∑Y0 ij / ∑n j , ∑n j .
mc j∈J0 i=1 j∈J0 j∈J0
] −∆=
It follows that the bias, E[ ∆ S
M 1 nj 1
nj
Inspection of this term reveals that, if the size of the cluster is correlated with the
potential outcomes in the cluster, the difference-in-means estimator is biased.
Moreover, the presence of the terms 1/mt and 1/mc shows that the magnitude (and
even the direction) of the bias can depend on the relative number of clusters allo-
cated to treatment and control.
In some special cases, there will be no bias, such as when the cluster size
does not vary or when there is no covariance between cluster size and outcomes.
Nonetheless, in applied research we might expect cluster size to be related to
outcomes. For example, precinct size may be related to the characteristics of the
precinct, such as partisan composition and voting rates. In Section 6 we show
an example where cluster size is significantly related to treatment effect. Such
an association between cluster size and treatment effect has been referred to as
nonignorable cluster size (e.g. Hoffman et al. 2001).
3.3 A
symptotic Properties of the Difference-In-Means
Estimator With Random Allocation of Clusters
probability) as h→∞.
where in this case J1 is defined as the set of hmt treatment clusters and J0 is defined
as the set of hmc control clusters. As h→∞, by the weak law of large numbers,
1 n hm 1 n hm 1 hm
∑ ∑ j Y
h j∈J 1 i=1 ij
p
→Y1T ⋅ t , ∑ j∈J ∑ i=j 1Yij
hM h 0
p
→Y0T ⋅ c , ∑ n
hM h j∈J 1 j
p
→N⋅ t
hM
1 hm
and
h
∑ n
j∈J0 j
p
→ N ⋅ c . By Slutsky’s theorem,
hM
hmt hm
Y1T ⋅ YT ⋅ c
hM − 0 hM = Y1 −Y0 .
T T
p
→
∆ (9)
S
hm hm N
N⋅ t N⋅ c
hM hM
Figure 1: Two versions of Brewer’s simple notion of asymptotic growth. The population is
simply copied h–1 times. In Panel A, copies of the clusters are made and the number of clusters
grows. In Panel B, the number of clusters is fixed and the individuals within are copied. An
estimator is consistent under asymptotic growth if it converges to the parameter as h→∞.
∑ j∈J1 ∑ i=1j Yij ∑ j∈J0 ∑ i=1j Yij ∑ j∈J1 ∑ i=j 1Yij ∑ j∈J0 ∑ i=j 1Yij
∆S = − = − . (10)
∑ j∈J hnj ∑ j∈J hnj ∑ j∈J nj ∑ j∈J nj
1 0 1 0
As h→∞, the estimate remains unchanged with large N if the number of clus-
ters is fixed. This proves that the bias articulated in equation 8 is unmitigated for
increasingly large clusters.
3.4 Discussion
The results of this section highlight the fact that, for some designs, bias may not
be mitigated with increased units. For example, imagine a study of the effect of
state-level policy on public opinion. Increasing the number of surveys conducted
does nothing to decrease bias in that case since the number of states is fixed.
More troubling, the above results also suggest that the bias of an estimator
that averages together a number of biased sub-estimates will not diminish with
increasing number of sub-estimates. Consider a block randomized design where
clusters (e.g. houses, clinics, precincts) are randomized; if a fixed effects regres-
sion is used to “control” for groups, then adding more units by increasing the
number of blocks (strata) does not diminish the bias. This is because the fixed
effects estimator is simply a weighted average of group-level difference-in-means
estimates estimates (cf. Angrist and Pischke 2009, Chapter 5).4
4 However, as the formulas suggest, a way to mitigate such bias would be to block units based
on cluster size as suggested by Imai et al. (2009).
4 U
nbiased Estimation of Treatment Effects Under
Random Allocation of Clusters
By understanding bias as a problem fundamental to ratio estimation, we can
circumvent the bias with an alternative design-based estimator. Notationally, it
helps to clarify the task if we consider cluster totals – i.e. the sum of the responses
n
of the individuals in each cluster. Define Y0Tj = ∑ i=j 1Y0 ij as the sum of responses of
n
the individuals in the jth cluster if assigned to control and Y1Tj = ∑ i=j 1Y1ij as the sum
of responses of the individuals in the jth cluster if assigned to treatment. For each
individual, only one of the two possible responses, Y0ij or Y1ij, may be observed and,
since individuals are assigned to treatment conditions in clusters, for any given
cluster, only one of the possible totals Y0Tj or Y1Tj , may be observed. The observed
cluster total for cluster j , YjT , may be expressed as: YjT = DjY1Tj + (1 − Dj )Y0Tj .
Using this new notation, the ATE may be expressed as
M nj M M
∆=
∑ ∑ (Y
j=1 i=1 1ij
−Y0 ij )
=
∑ j=1
Y1Tj − ∑ j =1Y0Tj
=
1 T T
[ Y −Y0 ].
M nj M
N 1
∑ ∑ j=1 i=1
1 ∑ j=1
nj
M M
Y0T,HT =
mc
∑Y T
0j
=
mc
∑Y j
T
. (11)
j∈J0 j∈J0
One can think of this estimator as estimating the average of the cluster totals
(among control clusters) and then multiplying by the number of clusters M to get
the estimated total for all units in the study. Likewise,
M M
Y1,THT = ∑Y1Tj = ∑YjT . (12)
mt j∈J 1 mt j∈J 1
Following the same steps as equation 4, it can be shown that Y0T,HT and Y1,THT
are unbiased estimators of Y0T and Y1T , respectively. The terms M/mt and M/mc
are fixed; when taking the expectations of equations 11 and 12, they can be moved
outside the expectation operator. Note that the random variables at the root of
the ratio estimation problem above, nt and nc, do not appear in either estimator.
From these two unbiased estimators, we may therefore construct an estimator of
the ATE:
= 1 Y M 1 1
∆
T
−Y0T,HT = ∑YjT − ∑Y T
. (13)
HT
N 1,HT
N mt j∈J 1 mc j∈J0
j
i.e. the ATE estimated from linearly transformed outcomes will be equal to the
ATE estimated from non-transformed outcomes multiplied by the scaling factor
b1. In Appendix A, we demonstrate that the HT estimator is not location-invariant
because the estimate based on the transformed data will be
M 1 1
∆∗HT = b0 ⋅ ∑n j −
N mt j∈J 1 mc
∑n + b ⋅∆.
j 1 HT (15)
j∈J0
Unless b0 = 0, the term on the left does not generally reduce to zero but instead
varies across treatment assignments, so equation 15 is not generally equivalent
to equation 14 for a given randomization. Note that, while a multiplicative scale
change (e.g. transforming feet to inches) need not be a concern, a linear trans-
formation that includes a location shift (e.g. reversing a binary indicator variable
or transforming Fahrenheit to Celsius) will lead to a violation of invariance. For
any given randomization, linearly transforming the data such that the intercept
changes can yield different estimates.
4.2 D
eriving Estimators of the Variance of the Horvitz-
Thompson Estimator Under Random Allocation of Clusters
In our derivation of variances, we follow the general formulations of Freedman
et al. (1998), which follow from a long tradition dating from Neyman (1923). The
variance of the estimator in equation 13 is
ˆ )=
V( ∆
1 T
N2
( ) ( )
(
V Y0 +V Y1T − 2Cov Y0T , Y1T . ) (16)
and V (Y ) , there does not generally exist an unbiased estimator for Cov (Y , Y )
T T T
1 0 1
( )
M M − mc 2 T
2
V Y0T,HT = σ ( Y0 j ),
mc M − 1
( )
M M − mt 2 T
2
V Y1,THT = σ ( Y1 j ),
mt M − 1
and
(
)
Cov Y0T,HT , Y1,THT =−
M2
M −1
σ ( Y0Tj , Y1Tj ),
) = M 2 M − mc 2 T M 2 M − mt 2 T
1 2M2
V( ∆ σ ( Y0 j ) + σ ( Y1 j ) + σ ( Y0Tj , Y1Tj )
mc M − 1 mt M − 1 M −1
HT
N2
M 2 M σ ( Y0 j ) σ ( Y1 j )
2 T 2 T
1
= 2 + + [ 2 σ ( Y0Tj , Y1Tj ) − σ 2 ( Y0Tj ) − σ 2 ( Y1Tj ) ] . (17)
N M − 1 mc mt M − 1
( ) + ∑ (Y
)
2 2
M ∑ j∈J0 Yj −Ycj Y
T T T T
2 −
V̂( ∆ HT ) = 2
j∈J 1 j tj
,
N mc ( mc − 1) mt ( mt − 1)
where YmjT = ∑ j∈J YjT / mc , the mean value of YjT over all j∈J0 and YtjT = ∑ j∈J YjT / mt ,
0 1
formed on finite populations (Aronow et al. 2014) and thus it may be the case that
no single variance estimator is generally adequate. This issue is compounded
when N is small and asymptotic approximations may be poor.
We propose an alternative estimator of the variance by assuming sharp
null hypothesis and either analytically or computationally calculating the vari-
ance of the estimator. One common sharp null hypothesis is that of the sharp
null hypothesis of no treatment effect: H0:τi = 0, ∀i. H0 implies that the treatment
has no effect whatsoever on the outcome, i.e. that both potential outcomes are
5 When M is large, researchers may encounter numerical problems computing M2 and, later,
−2
1 M
M4. This problem may be obviated by replacing M2/N2 with ∑ j = 1 n j , the reciprocal of the
M
square of the average number of units per cluster.
identical: Y0i = Y1i = Yi. When the sharp null hypothesis of no effect holds, we know
two important facts: σ2(Y0j) = σ2(Y1j) = σ2(Yj) and σ(Y0j, Y1j) = σ2(Yj). By substituting σ2
into the last line of equation 17, we may calculate the true variance under this null
hypothesis,
2
) = M M σ ( Yj ) + σ ( Yj ) + 1 [ 2 σ 2 ( Y T ) − σ 2 ( Y T ) − σ 2 ( Y T ) ]
2 T 2 T
VN(∆ HT
N M − 1 mc
2
mt M − 1
j j j
M σ ( Yj )
4 2 T
= 2 .
N ( M − 1) mcmt
Note that if the sharp null hypothesis of no effect holds, V N ( ∆ ) is the true
HT
variance, which can be calculated from the data exactly or by way of resampling.
When the sharp null hypothesis of no effect does not necessarily hold, V N ( ∆ )
HT
may be construed as an estimator of V( ∆ HT ). We therefore refer to a variance
estimator constructed by assuming the sharp null hypothesis of no effect as
).
V̂ N ( ∆ HT
The primary benefit of using V̂ N ( ∆ ) is that it tends to be more stable
HT
than V̂( ∆ HT ), particularly when either nc or nt is small, because it combines
the variance of the treatment and control groups. In cases where V̂( ∆ ) is
HT
imprecise, V̂ N ( ∆ ) may be preferable. Highly imprecise standard errors may
HT
be downwardly biased even when the associated variance estimator is con-
servative. The square root is a concave function so, by Jensen’s inequality,
ˆ ∆ ) 0.5 ] ≤ ( E[V(
ˆ ∆ ) will tend to
) ] ) 0.5 . Since the estimates from V̂ N ( ∆
E[V( HT HT HT
remain stable across randomizations, its use may therefore avoid the bias result-
ing from Jensen’s inequality. However, when effect sizes are large, V̂ N ( ∆ ) will
HT
tend to overestimate the true sampling variability.
Recent theoretical results suggest that V̂ N ( ∆) may be adequate as a con-
HT
servative approximation. In general, V̂ N ( ∆ ) will be conservative relative to the
HT
true variance if effects are constant (at the cluster scale) or if the number of clus-
ters is balanced, in a result that directly follows from theorem 3 of Ding (2014)
and Samii and Aronow (2012) (by way of the relationship between pooled and
combined variance). These results indicate V̂ N ( ∆ )
HT
will have a higher value
than that of the true variance if treatment effects are in fact constant at the cluster
scale. For these reasons, choosing the sharp null of no effect as an approximation
will generally be conservative among the class of hypotheses such that effects are
constant at the cluster scale.6
6 Researchers may seek to calculate separate variance estimators for each of a grid of
hypothesized, constant treatment effects, and use these to form a confidence interval by way of
inverting hypothesis tests. We thank an anonymous reviewer for this suggestion.
B
N
B
∆ HT = ∑ b ∆bHT , (18)
b= 1 N
where Nb is the number of units in the bth block. From first principles, the variance
of the estimator is
( ) N
( )
B 2
V ∆ BHT = ∑ b2 V ∆bHT , (19)
b= 1 N
5 Difference Estimators
In this section, we propose a simple extension of the HT estimator to improve the
efficiency of the estimator as well as confer the important property of location
invariance.
A major source of variability with the HT estimator is the variation in the number
of individuals in each cluster. Clusters with large nj will tend to have larger
values of YjT – that is, in many applications, as clusters get larger, the sum of
the outcomes for that cluster will also tend to get larger. We use the Des Raj (1965)
difference estimator to reduce this variability. To derive the Des Raj difference
estimator in this context, we first derive our estimates of the study population
totals, Y0Tj and Y0Tj by “differencing” off some of the variability:
M
Y0T,R 1 =
mc
∑( Y j
T
− k( n j − N / M ) ),
(20)
j∈J0
M
Y1,TR 1 = ∑( YjT − k ( n j − N / M ) ). (21)
mt j∈J 1
To develop an intuition about this method, note that it is equivalent to defining
a new “differenced” variable U Tj , where U Tj =YjT − k ( n j − N / M ) and conducting
the analysis based on U Tj instead of YjT . So long as k is fixed before analysis, this
strategy does not lead to bias because
Ek [ n j − N / M ] = kE[ n j − N / M ] = k ⋅0 = 0. (22)
It follows that the HT and Des Raj estimators have the same expected value. Since
Y0T,R 1 and Y1,TR 1 are unbiased, it follows that the Des Raj estimator,
= 1 Y
∆ T
−Y T ,
R1
N 1,R 1 0 ,R 1
is also unbiased.8
7 A similar estimator is proposed by Hansen and Bowers (2009), differing primarily in that it
contains a random denominator.
8 Note that estimating k from the same data set can lead to bias, as we demonstrate in
Appendix B, raising the question of where to obtain a suitable value. In Section 6, we sug-
gest using data from other blocks in experiments with blocking. Another option would be to
find an auxilliary data source from which a trustworthy value of k can be estimated. In survey
( ) ( ) ,
2 2
where U cjT = ∑ j∈J U Tj / mc , the mean value of U Tj in the control condition and
0
from Section 4.2, we may easily construct a variance estimator for ∆ by assum-
R1
ing the sharp null hypothesis of no treatment effect:
) = M 4 σ 2 ( U Tj )
V̂ N ( ∆ R1
.
N 2 ( M − 1) mcmt
One benefit of the Des Raj estimator is that it has invariance to location transfor-
mation, regardless of the accuracy of the researcher’s choice of k. In this section,
we prove the invariance of the Des Raj estimator. When Y0ij and Y1ij are linearly
transformed, k will also change: the same transformation must be applied to k as
to Y0Tij and Y1Tij . Since k is on the same scale as the outcome variable, when the
outcome variable is transformed, k will also be transformed:
k ∗ = ( b0 + b1 ⋅k ). (24)
Using this new k*, we may again define new differenced treatment outcomes,
U 1Tj∗ =Y1Tj∗ − k ∗ ⋅( n j − N / M )
nj
= ∑( b0 + b1 ⋅Y1ij ) − ( b0 + b1 ⋅k ) ⋅( n j − N / M )
i=1
= n j ⋅b0 + b1 ⋅Y1Tj − ( b0 + b1 ⋅k ) ⋅( n j − N / M )
= b0 ⋅ N / M + b1 ⋅U 1Tj .
s ampling, researchers sometimes accept the bias of estimating k with regression (Sarndal 1978),
but the focus of the current paper is on unbiased estimation so regression estimation is outside
our scope. We recommend that either the value of or procedure for choosing k be specified in
a preanalysis planning document, so as to reduce the uncertainty associated with researcher
discretion.
M 1 1
∆∗R 1 = ∑U Tj ∗ − ∑U T∗
N mt j∈J 1 mc j∈J0
j
M 1 1
= ∑( b0 ⋅ N / M + b1 ⋅U 1Tj ) − ∑( b ⋅ N / M + b ⋅U T
)
N mt j∈J 1 mc j∈J0
0 1 1j
M 1 1
= b1 ∑U 1 j − b1 ∑U 0 j
T T
N mt j∈J 1 mc j∈J0
.
= b1 ⋅∆ (25)
R1
The Des Raj estimator is therefore invariant to linear transformation because any
linear transformation to the outcome will necessarily be reflected in k.
Note that the HT estimator may be considered a special case of the Des Raj
estimator when k = 0. However, unlike the HT estimator, the explicit assumption
that k = 0 ensures that when the scale of the outcome changes, the scale of k also
changes. The non-invariance of the HT estimator may therefore be thought of as a
failure to recognize the implicit assumption that k = 0 and to transform to k* when
the scale of the outcome changes.
∑ (U )
2
T
0j
−U 0T j
j
σ 2 ( U 0T j ) =
M
∑ (Y )
2
T
0j
− k( n j − N / M ) −Y0Tj
j
=
M
= σ 2 ( Y0Tj ) + k 2 σ 2 ( n j ) − 2 k σ ( n j , Y0Tj ),
where U 0T j is the mean value of U 0T j over all j clusters. koptim , the value of k that
c
minimizes σ 2 ( U 0T j ), can be found using simple optimization. Since the second
derivative with respect to k, 2σ2(nj), must be positive, we may set the first deriva-
tive equal to zero and solve for k, so that
σ ( n j , Y0Tj )
koptim = . (26)
c
σ 2 ( nj )
Equation 26 should look familiar to the reader: the best fitting k is the ordinary
least squares coefficient.
Likewise, the optimal value of k for the potential outcomes under treatment
is koptim = σ ( n j , Y1Tj ) / σ 2 ( n j ). Given that koptim does not generally equal koptim , a
t t c
researcher could justifiably identify different values of k for treatment and control
groups. In practice, however, this would require a good deal of prior knowledge
(including knowledge about treatment effects); for this reason, a single value of k
will typically be preferable. In Appendix C, we derive a single optimal value of
k , koptim∗ = mt koptim / M + mc koptim / M .
c t
Unlike a structural parameter, the value of koptim* will depend on the number
of clusters assigned to treatment and to control. Perhaps counterintuitively, when
there are fewer clusters in the control condition, koptim* is more heavily weighted
toward koptim , the value of k that minimizes σ 2 ( U 0T j ) (and vice versa). A simple
c
intuition for this weighting is that the condition with fewer clusters will con-
tribute more to the overall variance of the estimator. Consequently, the greatest
increase in precision comes from adjustments made to units in that condition.
The chosen value of k will reduce the variability of the Des Raj estimator, ∆ ,
R1
relative to the HT estimator when, for koptim* > 0, 0 < k < 2koptim* and, for koptim* < 0,
0 > k > 2koptim*. In other words, the Des Raj estimator will have better precision than
the HT estimator unless the researcher picks a k with the wrong sign or chooses
a k that is more than twice the magnitude of koptim*. Because koptim* will tend to be
close to the average outcome for all individuals, the researcher will usually have
prior knowledge about the mean individual-level outcome.9
Under the sharp null hypothesis of no treatment effect, koptim*=
koptimc=koptimt= σ ( n j , YjT ) / σ 2 ( n j ), and thus the optimal k would be the ordinary
least squares coefficient from regressing YjT on nj. Prima facie, the intuitive next
step would be to try to estimate k from the data, utilizing ordinary least squares
on the observed data (perhaps controlling for Dj). However, regression estimates
of k can lead to bias in the estimation of treatment effects. In Appendix B, we
demonstrate that the bias from estimating k from within-sample data is
YT
−Y T M
E 1,R 1 0 ,R 1 − ∆ = ( Cov(kˆ, ncj ) − Cov(kˆ, ntj ) ),
N N
9 Note that our fundamental uncertainty about the optimal value of k does not itself contribute
to the uncertainty of our estimate since k is treated as a fixed constant, e.g. in equation 23.
where k̂ is an estimator of k, ntj is the mean value of nj for clusters in the treat-
ment condition in a given randomization and ncj is the mean value of nj for units
in the control condition in a given randomization.
Knowing the optimal value of k under the sharp null hypothesis of no treat-
ment effect is nevertheless informative as we seek to construct principled prior esti-
mates for k. By using the ordinary least squares estimator on auxiliary data with
similar potential outcomes, we can approximate koptim* with out-of-sample data.
As we will demonstrate in our empirical example, such auxiliary data can
come from the other blocks in an block randomized experiment. If one was con-
cerned that estimating the values of k from other blocks of an experiment would
lead to additional stochasticity in the values of U 0T j and U 1Tj , Monte Carlo simula-
tions (whereby the values of k are recomputed for each simulation) may be used
to compute the sharp null variance estimate.
5.4 D
es Raj Difference Estimator for Cluster Size and
Covariates
The Des Raj estimator may also be extended to include other covariates which
may further reduce the sampling variability of the estimator. Assume the
researcher has access to A covariates for each individual i in cluster j, denoted
n
by Xaij
T
, a ∈1, 2, …, A. Define the cluster total of the covariate, XajT = ∑ i=j 1 Xaij , and
M n
define the sum of the Xaij across all individuals in all clusters, XaT = ∑ j =1 ∑ i=j 1 Xaij . It
is simple to adapt the Des Raj estimator to incorporate these additional covariates.
Define constants k′ and ka (∀a) as prior estimates of the coefficients associated
with a regression of Yj on cluster size and cluster-level covariates, respectively.
Again, k′ and ka do not have causal interpretations. It follows that we may define
M
A
Y0T,R 2 =
mc
∑ Y j
T
− k ′( n j − N / M ) − ∑ka ( XajT − XaT / M )
j∈J0
adjusting for size
a=1
adjusting for other covariates
and
M
A
Y1,TR 2 = ∑ YjT − k ′( n j − N / M ) − ∑ka ( XajT − XaT / M ) .
mt j∈J 1 a= 1
By the logic of equation 22, Y0T,R 2 and Y1,TR 2 are unbiased estimators of Y0T and
Y1T , respectively. It follows that we may again construct an unbiased estimator
of Δ,
= 1 Y
∆ T
−Y T .
R2
N 1,R 2 0,R 2
Following the same steps as in equation 25, it can be shown that as long as k′
undergoes the same linear transformation as the original data and ka (∀a∈A)
undergoes the same multiplicative scale shift, the Des Raj estimator with covari-
ates will also be invariant. It will also be more efficient than the preceding esti-
mators if the researcher’s estimates for k′ and ka are reasonable; constructing
variance estimators for ∆ is simple and follows directly from Section 5.1.
R2
Note that the efficiency characteristics of this Des Raj estimator may be
derived as in Section 5.3, where the same intuitions about efficiency hold. In prin-
ciple, a researcher should choose covariates that together do the best job of pre-
dicting values of the potential outcomes to achieve the values of Y0,TR 2 and Y1,TR 2
with the lowest variability across randomizations. In practice, a researcher might
apply a variable selection method such as penalized regression techniques using
an auxiliary data set to identify suitable covariates and values of k.
6 Application
In this application we reanalyze the data from Green and Vavreck (2008) who
used a cluster randomized design to examine the effectiveness of television ads
on voter turnout among 18- and 19-year-old voters in the 2004 presidential elec-
tion. The study randomized television cable districts to either a treatment group,
in which advertisements encouraging young people to vote were shown, or to
the control group. The original experiment included a total of 23,869 voters in 85
television cable districts in blocks (strata) of size 2 or 3. Because we wanted to use
prior turnout in the cable district as a covariate in our analysis, we limited the
analysis to the 80 cable districts for which this information was available from the
authors. This yielded 40 blocks of two cable districts each (one in treatment, the
other in control) and a total of 22,733 individual voters.
The outcome measure of interest, Yij, is whether or not the individual i in
cluster (cable district) j voted in the 2004 American presidential election (coded
1 if the individual voted, 0 if the individual did not vote). Because 18 and 19 year
olds are new registrants they have no prior voter history, so individual voter
history could not be used for covariates. However, we use turnout rate in the cable
district in the 2000 election as a covariate as well as age. While the covariates are
somewhat less than ideal because they are unlikely to be particularly predictive,
they provide us with an opportunity to examine how the Raj difference estimator
performs when covariates are not particularly informative. In such a situation we
might expect koptim to be near zero and values of k chosen may actually reduce the
efficiency of the Raj difference estimator since it is less likely to be the case that
2koptim > k > 0 when koptim = 0.
Randomization inference (RI) will allow us to assess the bias and variance of
any given estimator. In addition, RI allows the researcher to perform completely
nonparametric significance testing (see, e.g. Rosenbaum 2002). We refer to the
estimate produced by a given estimator as the test statistic. RI assumes that a
given sharp null hypothesis holds and evaluates the test statistic for every pos-
sible random assignment of units to treatment and control. By recalculating the
test statistic for each possible treatment assignment, the reference distribution of
the test statistic is constructed. Fisher’s exact test is a well-known form of RI for
significance testing, but the method is much more general.
Because the total possible permutations increase rapidly with population
size, RI may be computationally infeasible. We may use Monte Carlo simula-
tions to approximate RI by repeatedly assigning units to treatment and control
groups randomly and estimating the test statistic that would be observed for each
repetition. The distribution of the test statistic across randomizations forms the
reference distribution of the statistic. As the number of repetitions gets large, the
distribution of the test statistic based on repeated randomizations converges to
that of full RI. This method can achieve results arbitrarily close to RI by increasing
the number of repetitions.
We use randomization inference to examine the behavior of our estimators
and compare them with the behavior of three commonly used estimators. We
conduct randomization inference for two scenarios (5000 iterations). The first
scenario examines the behavior of the estimators under the sharp null hypothesis
of absolutely no treatment effect. The second scenario examines the behavior of
estimators under heterogeneous treatment effects.
Computing the test statistics under repeated randomizations requires that we can
observe both potential outcomes for each unit. Since in reality we only observe
the response of unit i under one of the treatments, we must impute the value of
the missing potential outcome before conducting RI. We conduct RI using two
different methods of imputation.
The first method assumes the sharp null hypothesis of no treatment effect.
This effectively imputes the missing potential outcome with the observed poten-
tial outcome.
In the second method we simulate heterogeneous treatment effects, first
modeling the data using logistic regression in order to impute missing potential
outcomes. This method looks to the data as a guide to creating realistic potential
outcomes that have a similar structure to the original data. We used the logistic
regression model,
−1
F −1
P( Yij = 1) = 1 − 1 + exp α + τt j + βn j + φn jt j + ∑γ f Γ f
f =1
where tj is a treatment indicator for cluster j, nj is the cluster size, F is the number
of blocks (in this case, 40), Γf is an indicator variable indicating whether cluster
j is in block f. The terms α, τ, β, φ and γf are coefficients estimated from the data
using maximum likelihood methods. Note the coefficient φ is responsible for the
heterogeneous treatment effects. We estimate τ as 0.4, β as 1.4 and φ as –0.9.
We used this model to impute missing potential outcomes for each individ-
ual. To do so, the latent probability of response (voting) was first computed for
each unit when treated, pti, and when not treated, pci, using the estimated model.
Each missing Yci and Yti was imputed using a random draw from a Bernoulli
random variable with probability estimated from the logistic regression model.
The imputation process was conducted for each iteration of the RI. Marginalizing
In this section, we define the estimators that will be compared. We will consider
four regression-based estimators as well as the three design-based estimators
proposed in this paper. We begin by detailing each of these estimators.
The first estimator is the regression without covariates, also known as the
difference-in-means. The model can be written:
where β0, β1, Dj and eij are as above and Γf represents the dummy variable for the fth
block, and the model is again fitted with ordinary least squares. We then consider
the fixed-effects regression estimator that also adjusts for the covariates: average
turnout in 2000 and age. The model can be written:
F −1
Yij = β0 + β 1 Dj + β 2 X 2 ij + β 3 X 3 ij + ∑γ f Γ f + eij ,
f =1
In this application we use the alternate blocks of the experiment to derive the
values of k, k′ and ka from the data. For a given block, the values are estimated by
dropping that block from the data and regressing the outcome on the covariates
using data from the remaining 39 blocks.
To estimate k for the Des Raj estimator with only nj, we use the following
model:
YjT = α + kn j + e j ,
YjT = α′ + k ′n j + k1 X 1Tj + k2 X 2T j + e j ,
where α′ is a constant, X 1Tj is the total turnout in cluster j in the 2000 election,
X 2T j is the sum of ages in cluster j.
Note that in the sharp null scenario the estimated values of k, k′, k1 and k2 are
the same for all randomizations for a given block. For the heterogeneous treat-
ment effect scenario, however, these values can vary across randomizations as
the observed values of YjT change depending on whether cluster j is in treatment
or not. As mentioned above, this sort of variability in these values contributes
to the variability of the Raj difference estimator. In our application, the variance
estimators remain conservative nonetheless. In practice, if the contribution of k
to the uncertainty is a concern, Monte Carlo simulations could be used to esti-
mate the variance.
6.4 R
andomization Inference With the Sharp Null Hypothesis
of No Treatment Effect
Figure 2 displays the results for the point estimators assuming the sharp null
hypothesis of no treatment effect. Solid vertical lines indicate the mean of the
sampling distributions.
Results show that all estimators are unbiased under the sharp null. The HT
estimator is the least precise estimator by far. The rest perform very competitively
with the random effects regression and Raj’s difference estimator being the most
precise.
Figure 3 displays the results for the standard error estimators under
the sharp null hypothesis. In the case of the regression with no covariates
(difference-in-means) the standard errors are biased upwards due to the failure
0.30
0.30
0.30
0.30
Bias = 0.0 Bias = 0.0 Bias = 0.0 Bias = 0.0
SD = 2.4 SD = 1.9 SD = 2.0 SD = 1.6
RMSE = 2.4 RMSE = 1.9 RMSE = 2.0 RMSE = 1.6
0.20
0.20
0.20
0.20
Density
Density
Density
Density
0.10
0.10
0.10
0.10
0
0
0
0
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
ATE (Percentage points) ATE (Percentage points) ATE (Percentage points) ATE (Percentage points)
0.30
0.30
0.30
Joel A. Middleton and Peter M. Aronow
0.20
0.20
0.20
Density
Density
Density
0.10
0.10
0.10
0
0
Authenticated
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
ATE (Percentage points) ATE (Percentage points) ATE (Percentage points)
Figure 2: Sampling distributions associated with the ATE estimators under the sharp null hypothesis of no treatment effect detailed in Section 6.
8
8
8
8
SD = 0.09 SD = 0.06 SD = 0.07 SD = 0.04
Bias = 0.53 Bias = −0.54 Bias = −0.55 Bias = −0.06
6
6
6
6
4
4
4
4
Density
Density
Density
Density
2
2
2
2
0
0
0
0
0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0
SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points)
8
8
8
SD = 0.00 SD = 0.00 SD = 0.00
Bias = 0.00 Bias = 0.00 Bias = 0.00
6
6
6
4
4
4
Density
Density
Density
2
2
2
0
0
0
8.0 8.5 9.0 9.5 10.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0
Authenticated
SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points)
Unbiased Estimation of the Average Treatment Effect
Figure 3: Sampling distributions associated with the SE estimators under the sharp null hypothesis of no treatment effect detailed in Section 6.
Five thousand randomizations were used to estimate the sampling distributions. Density plots were generated using the density() function in R
and SD estimates are computed from each empirical distribution. Distributions for the SEs under the sharp null hypothesis of no treatment effect were
of this model to account for blocking.10 However, for the regression models
that include fixed effects the standard errors are badly biased downward as
we might expect given that sandwich type estimators tend to be unreliable in
finite samples. The standard errors associated with the random effects model
perform reasonably well, being only slightly biased downwards. Meanwhile,
under the sharp null, the standard errors associated with the HT estimator and
the Raj Difference estimators are exact, being both unbiased and having no
sampling variability.
6.5 R
andomization Inference with Treatment Effect
Heterogeneity
Figure 4 displays results under treatment effect heterogeneity. Solid vertical lines
indicate the mean of the sampling distributions. Dotted vertical lines indicate the
true treatment effect (0.7 percentage points).
The results demonstrate that the regressions tend to be biased to varying
degrees. Interestingly, the regression without covariates (difference-in-means)
is only slightly biased downward. That the difference-in-means is not terribly
biased can be understood as a result of the sample size (80 clusters) being suf-
ficiently large (recall the consistency proof in Section 3.3).
When the regressions include fixed-effects, however, the bias actually
increases. This can be understood in light of the fact that fixed-effects regression
estimates yield variance-weighted averages of the block-level estimates (Angrist
and Pischke 2009). In other words, the fixed-effects estimator is equivalent to
taking the difference-in-means for each block and then taking a weighted average
of them. Since the block-level estimates are each biased, the overall average is
similarly biased. As discussed in Section 3.4 above, this is a particularly troubling
property of the fixed-effects estimator because it will also be inconsistent for
increasing numbers of blocks. In other words, adding more blocks to the experi-
ment will not necessarily diminish the overall bias.
Again, the HT estimator is unbiased but has very poor precision. And while
the random effects estimator has the lowest standard deviation, Des Raj’s differ-
ence estimators are the most precise in terms of RMSE.
10 Although we consider the bias of the standard error estimator, in practice, bias is not an ideal
loss function for evaluating standard error estimators. However, given the size of the sample and
the typical rate of convergence for variance estimators, we expect that bias serves as an approxi-
mation for asymptotic bias, which is of greater interest for constructing confidence intervals.
0.30
0.30
0.30
0.30
Bias = −0.2 Bias = 0.9 Bias = 0.9 Bias = 2.0
SD = 2.3 SD = 1.5 SD = 1.5 SD = 1.3
RMSE = 2.3 RMSE = 1.8 RMSE = 1.8 RMSE = 2.4
0.20
0.20
0.20
0.20
Density
Density
Density
Density
0.10
0.10
0.10
0.10
0
0
0
0
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
ATE (Percentage points) ATE (Percentage points) ATE (Percentage points) ATE (Percentage points)
0.30
0.30
0.30
Bias = 0.0 Bias = 0.0 Bias = 0.0
SD = 9.2 SD = 1.5 SD = 1.6
RMSE = 9.2 RMSE = 1.5 RMSE = 1.6
0.20
0.20
0.20
Density
Density
Density
0.10
0.10
0.10
0
0
0
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
Authenticated
ATE (Percentage points) ATE (Percentage points) ATE (Percentage points)
Unbiased Estimation of the Average Treatment Effect
Figure 4: ATE estimator sampling distributions associated with heterogeneous treatment effects detailed in Section 6. Five thousand randomizations
the expected value, and therefore bias, of the estimator. Bias and SE estimates in the upper-right of each plot are computed from each empirical
Note also that the addition of the covariates (age and turnout rate in 2000)
actually increases the variability in Raj’s difference estimator. This is because the
covariates are not particularly predictive of the outcome and so the estimated
values of k’s tend to miss their mark by a large extent.
Finally, Figure 5 displays the performance of the standard error estimators
in the case of heterogeneous treatment effects. Results again show that the
“robust cluster” standard errors can perform very badly, being substantially
downwardly biased in the case of the regressions with fixed effects. The stand-
ard error estimator associated with the random effects regression performs
well, being only slightly upwardly biased. The standard error estimators for the
HT and Raj difference estimators are conservative, being biased only slightly
upwards.
7 Conclusion
The unbiased estimation of the ATE in cluster-randomized experiments has
been elusive. In unpacking the source of the bias in the difference-in-means
estimator, this paper has also identified some common design-estimator com-
binations where the bias of estimators will not diminish with sample size such
as pair-randomized designs combined with regression estimators with fixed
effects for block. This paper has returned to the first principles of randomiza-
tion and sampling theory, showing that the fundamental statistical properties
of randomization can be applied to modern causal inferential problems. Not
only does the Des Raj estimator provide the basis for an unbiased and location-
invariant estimator for the analysis of cluster-randomized experiments, com-
pared to the HT estimator it also achieves improved precision through covariate
adjustment.
There are a number of theoretical implications of this return to the first prin-
ciples of randomization. First, machinery based solely on sampling-theoretic
ideas can be sufficient for precise and unbiased estimation of causal parameters.
Second, researchers need not feel that achieving precise and unbiased causal esti-
mates requires an up-to-date knowledge of complex statistical models: we may
easily derive estimators with good statistical properties using only fundamental
concepts. Third, utilizing such estimators serves to remind us of the importance
of this distinction between observational studies and randomized experiments.
The importance of the logic of the experiment, with its reliance on randomiza-
tion, may be lost when researchers rely on model-based estimators that may or
may not reflect the experimental design.
8
8
8
8
6
6
6
6
Bias = 0.84 Bias = −0.20 Bias = −0.21 Bias = 0.22
SD = 0.25 SD = 0.16 SD = 0.17 SD = 0.12
4
4
4
4
Density
Density
Density
Density
2
2
2
2
0
0
0
0
0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0
SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points)
8
8
8
6
6
6
Bias = 0.03 Bias = 0.22 Bias = 0.22
SD = 0.24 SD = 0.24 SD = 0.26
4
4
4
Density
Density
Density
2
2
2
0
0
0
8.0 8.5 9.0 9.5 10.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0
SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points)
Authenticated
Figure 5: SE estimator sampling distributions associated with the heterogeneous treatment effect detailed in Section 6. Five thousand randomiza-
Unbiased Estimation of the Average Treatment Effect
tions were used to estimate the sampling distributions. Density plots were generated using the density() function in R (R Development Core
computed from each empirical distribution. Distributions for the SEs under the sharp null hypothesis of no treatment effect were too narrow to display.
Appendix
M 1 1
∆∗HT = ∑YjT ∗ − ∑Y T∗
N mt j∈J 1 mc j∈J0
j
nj n
M 1 1 j
= ∑∑Y ∗ −
N mt j∈J 1 i=1 ij mc
∑∑ ij Y ∗
j∈J0 i = 1
M 1
n nj
j
1
= ∑∑( b0 + b1 ⋅Yij ) − ∑∑( b + b ⋅Y )
N mt j∈J 1 i=1 mc 0 1 ij
j∈J0 i = 1
M 1 1
nj nj
and 21 from the data to approximate the optimal value of k with an estimator kˆ.
In this scenario, the expected value of equation 20 yields
M
E Y1,TR 1 = E ∑( YjT − kˆ( n j − N / M ) )
m
t j∈J 1
M
ˆ + E kN
ˆ / M
= E ∑YjT − E ∑kn ∑
mt j∈J 1 j∈J 1 j
j∈J 1
=(M ˆ n ] + E[ km
E m Y T − E[ km
mt t 1 j t tj
ˆ N /M]
t )
ˆ ] − E[ kˆ ]E[ n ] )
=Y1T − M ( E[ kntj tj
where ntj is the mean value of nj for clusters in the treatment condition in a given
randomization. In the third line of equation 29, k̂ moves outside the summation
operator because it is a constant for a given randomization. Likewise,
E Y0T,R 1 =Y0T − MCov (kˆ, ncj ), (30)
where ncj is the mean value of nj for units in the control condition in a given ran-
domization. So the expected value of the estimator will be
Y
T
−Y T M
E 1,R 1 0 ,R 1 = ∆ + ( Cov (kˆ, ncj ) − Cov (kˆ, ncj ) ). (31)
N N
The term on the right of equation 31 represents the bias. A special case with no
bias is when the sharp null hypothesis of no treatment effect holds and treat-
ment and control groups have equal numbers of clusters. We refer the reader to
Williams (1961), Freedman (2008a) and Freedman (2008b) for additional reading
on the particular bias associated with the regression adjustment of random
samples and experimental data.
( M − 1) N 2 M − mc M − mt
where v = , c= , and t = . Now note that the terms
M 2
mc mt
σ 2 ( U Tj 0 ), σ 2 ( U Tj 0 ), and σ ( U Tj 0 , U Tj 1 ) in equation 32 can be written as follows:
σ ( U Tj 0 , U Tj 1 ) = E [ U Tj 0U Tj 1 ] −U 0T U 1T
− k [ σ ( YjT1 , n j ) + E[ YjT1 ] ⋅0 ] + k 2 σ 2 ( n j )
) = c [ σ 2 ( Y T ) + k 2 σ 2 ( n ) − 2 k σ ( Y T , n )] + t [ σ 2 ( Y T ) + k 2 σ 2 ( n ) − 2 k σ ( Y T , n )]
vV( ∆ R1 j0 j j0 j j1 j j1 j
M − mc M − mt mc mt M − mc mc
m + m + m + m koptim∗ σ ( n j ) = m + m ⋅ σ ( Yj 0 , n j )
2 T
c t c t c c
M − mt mt
+ + ⋅ σ ( YjT1 , n j )
m t
mt
M M M M
m + m koptim∗ σ ( n j ) = m σ ( Yj 0 , n j ) + m σ ( Yj 1 , n j )
2 T T
c t c t
−1
1 1 1 σ ( YjT0 , n j ) 1 σ ( YjT1 , n j )
koptim∗ = + +
m
c mt mc σ ( n j ) mt σ ( n j )
2 2
−1
1 1 1 1
koptim∗ = + koptimc + koptimt
mc mt mc mt
mt mc
koptim∗ = koptim + koptim .
M c
M t
The Des Raj estimator will be more efficient than the HT estimator when
m m
k 2 < 2 k t koptim + c koptim
M c
M t
k 2 < 2 k ⋅koptim∗ .
References
Angrist, J. D. and J. Pischke (2009) Mostly Harmless Econometrics. Princeton: Princeton
University Press.
Aronow, P. M., D. P. Green and D. K. K. Lee (2014) “Sharp Bounds on the Variance in
Randomized Experiments,” Annals of Statistics, 42(3):850–871.
Bates, D. and M. Maechler (2010) lme4: Linear mixed-effects models using S4 classes.
R package, version 0.999375-37.
Brewer, K. R. W. (1979) “A Class of Robust Sampling Designs for Large-Scale Surveys,” Journal
of the American Statistical Association, 74:911–915.
Chaudhuri, A. and H. Stenger (2005) Survey Sampling. Boca Raton: Chapman and Hall.
Cochran, W. G. (1977) Sampling Techniques, 3rd ed. New York: John Wiley.
Des Raj. (1965) “On A Method of Using Multi-Auxiliary Information in Sample Surveys,” Journal
of The American Statistical Association, 60:270–277.
Ding, P. (2014) “A Paradox from Randomization-Based Causal Inference,” arXiv preprint
arXiv:1402.0142.
Donner, A. and N. Klar (2000) Design and Analysis of Cluster Randomization Trials in Health
Research. New York: Oxford Univ. Press.
Freedman, D. A. (2006) “On the So-Called ‘Huber Sandwich Estimator’ and ‘Robust’ Standard
Errors,” American Statistician, 60:299–302.
Freedman, D. A. (2008a) “On Regression Adjustments to Experimental Data,” Advances in
Applied Mathematics, 40:180–193.
Freedman, D. A. (2008b) “On Regression Adjustments in Experiments with Several Treat-
ments.” Annals of Applied Statistics, 2:176–196.
Freedman, D. A., R. Pisani and R. A. Purves (1998) Statistics, 3rd ed. New York: W. W. Norton,
Inc.
Green, D. P. and L. Vavreck (2008) “Analysis of Cluster-Randomized Experiments: A Comparison
of Alternative Estimation Approaches,” Political Analysis, 16:138–152.
Hansen, B. and J. Bowers (2008) “Covariate Balance in Simple, Stratified and Clustered
Comparative Studies,” Statistical Science, 23:219–236.
Hansen, B. and J. Bowers (2009) “Attributing Effects to a Cluster-Randomized Get-Out-the-Vote
Campaign,” Journal of the American Statistical Association, 104:873–885.
Hartley, H. O. and A. Ross (1954) “Unbiased Ratio Estimators,” Nature, 174:270.
Hoffman, E. B., P. K. Sen and C. R. Weinberg (2001) “Within-Cluster Resampling,” Biometrika,
88: 1121–1134.
Horvitz, D. G. and D. J. Thompson (1952) “A Generalization of Sampling Without Replacement
From a Finite Universe,” Journal of the American Statistical Association, 47:663–684.
Humphreys, M. (2009) Bounds on Least Squares Estimates of Causal Effects in the Presence of
Heterogeneous Assignment Probabilities. Working paper. Available at: http://www.colum-
bia.edu/~mh2245/papers1/monotonicity4.pdf.
Imai, K., G. King and C. Nall (2009) “The Essential Role of Pair Matching in Cluster-Randomized
Experiments, with Application to the Mexican Universal Health Insurance Evaluation,”
Statistical Science, 24:29–53.
King, G. and M. Roberts (2014) “How Robust Standard Errors Expose Methodological Problems
They Do Not Fix, and What to Do About It,” Political Analysis, 1–12.
Lachin, J. M. (1988) “Properties of Simple Randomization in Clinical Trials,” Controlled Clinical
Trials, 9(4):312–326.
Lin, W. (2013) “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining
Freedman’s Critique,” Annals of Applied Statistics, 7(1):295–318.
Middleton, J. A. (2008) “Bias of the Regression Estimator for Experiments Using Clustered
Random Assignment,” Statistics and Probability Letters, 78:2654–2659.
Miratrix, L., J. Sekhon and B. Yu (2013) “Adjusting Treatment Effect Estimates by Post-
Stratification in Randomized Experiments,” Journal of the Royal Statistical Society. Series B
(Methodological), 75(2):369–396.
Neyman, J. (1923) “On the Application of Probability Theory to Agricultural Experiments: Essay
on Principles, Section 9,” Statistical Science, 5:465–480. (Translated in 1990).
Neyman, J. (1934) “On the Two Different Aspects of the Representative Method: The Method of
Stratified Sampling and the Method of Purposive Selection,” Journal of the Royal Statistical
Society, 97(4):558–625.
R Development Core Team. (2010) R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. Version
2.12.0.
Rosenbaum, P. R. (2002) Observational Studies, 2nd ed. New York: Springer.
Rubin, D. (1974) “Estimating Causal Effects of Treatments in Randomized and Nonrandomized
Studies,” Journal of Educational Psychology, 66:688–701.
Rubin, D. B. (1978) “Bayesian Inference for Causal Effects: The Role of Randomization,” The
Annals of Statistics, 6:34–58.
Rubin, D. B. (2005) “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions,”
Journal of the American Statistical Association, 100:322–331.
Samii, C. and P. M. Aronow (2012) “On Equivalencies Between Design-Based and Regression-
Based Variance Estimators for Randomized Experiments,” Statistics and Probability
Letters, 82:365–370.
Sarndal, C.-E. (1978) “Design-Based and Model-Based Inference in Survey Sampling,”
Scandinavian Journal of Statistics, 5(1):27–52.
Williams, W. H. (1961) “Generating Unbiased Ratio and Regression Estimators,” Biometrics,
17:267–274.