0% found this document useful (0 votes)
17 views37 pages

Unbiased Estimation of The Average Treatment Effect in Cluster-Randomized Experiments

Uploaded by

liushaoyusz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views37 pages

Unbiased Estimation of The Average Treatment Effect in Cluster-Randomized Experiments

Uploaded by

liushaoyusz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Statistics, Politics, and Policy 2015; aop

Joel A. Middleton and Peter M. Aronow*


Unbiased Estimation of the Average
Treatment Effect in Cluster-Randomized
Experiments
DOI 10.1515/spp-2013-0002

Abstract: Many estimators of the average treatment effect, including the


­difference-in-means, may be biased when clusters of units are allocated to treat-
ment. This bias remains even when the number of units within each cluster
grows asymptotically large. In this paper, we propose simple, unbiased, location-
invariant, and covariate-adjusted estimators of the average treatment effect in
experiments with random allocation of clusters, along with associated variance
estimators. We then analyze a cluster-randomized field experiment on voter
mobilization in the US, demonstrating that the proposed estimators have preci-
sion that is comparable, if not superior, to that of existing, biased estimators of
the average treatment effect.

1 Introduction
In recent years, researchers have paid increased attention to the properties
of treatment effect estimators for randomized experiments under the design-
based model (see, e.g. Freedman 2008a,b). Under the design-based model
(Neyman 1923, 1934; Sarndal 1978), potential outcomes are fixed and the only
source of stochasticity lies in the random administration of a treatment to a
finite population. Importantly, Freedman (2008a) demonstrated that, under
a such a model, regression adjustment is generally biased (though consist-
ent) and may reduce efficiency. Researchers have since derived methods that
do not suffer from these problems (Lin 2013; Miratrix et al. 2013) and assessed
the operating characteristics of common model-based estimators (Humphreys
2009; Samii and Aronow 2012) under the design-based paradigm. However, this

*Corresponding author: Peter M. Aronow, Department of Political Science, Yale University,


New Haven, CT, USA, e-mail: [email protected]
Joel A. Middleton: Department of Political Science, University of California Berkeley,
Berkeley, CA, USA

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
2 Joel A. Middleton and Peter M. Aronow

research has largely focused on experiments wherein treatment is randomized


at the unit level.
Although extensively studied under the model-based paradigm (see, e.g.
Donner and Klar 2000), comparatively little attention has been devoted to
designs with random allocation of clusters under the design-based paradigm.
The aforementioned estimators are not directly applicable to cluster-randomized
designs. Even seemingly design-based estimators – such as the difference-in-
means estimator – may suffer from bias even when all units have an equal prob-
ability of treatment assignment. Importantly, Middleton (2008) proves the bias of
the difference-in-means estimator (and inconsistency under asymptotic scalings
that entail a fixed number of clusters) for randomized experiments with unequal
cluster sizes. Similarly, Imai et al. (2009) recognize the bias of the difference-in-
means estimator and propose solutions that require altering the design of the
experiment. The authors recommend pair matching on observables in order to
reduce the amount of bias and variance that may result from a standard anal-
ysis of cluster-randomized experiments. The closest analogue to our proposed
approach, however, may be found in Hansen and Bowers (2009), which proposes
similar – though not necessarily unbiased – design-based estimators for cluster-
randomized experiments.1
Bias is not the only statistical property that researchers are interested in.
In choosing an estimator, researchers often consider efficiency (typically mean
square error) to be of paramount importance. However, as we show below, the
bias of estimators such as the difference-in-means estimator may not diminish
with increasing study size under common designs. Estimators that are asymp-
totically biased are guaranteed to be relatively inefficient for a sufficiently large
sample size, and we provide an empirical example where bias is critical in under-
mining the relative efficiency of common estimators. In sum, bias cannot always
be ignored even when efficiency is a primary concern.
In this paper, we propose a simple and unbiased design-based estima-
tor for the average treatment effect (ATE) for cluster-randomized experiments.2
Drawing from classical sampling theory, we then propose a natural extension to
improve efficiency and confer the property of location invariance: the Des Raj

1 Hansen and Bowers (2008) also derives design-based balance tests for cluster-randomized
e­ xperiments.
2 As in Hansen and Bowers (2008), we consider estimation of the effect of assignment to treat-
ment, which we refer to this simply as the ATE throughout. This quantity is also termed the
intention to treat effect. Our approach circumvents the issue of compliance, but our estimators
might be divided by suitable compliance rate estimates to estimate average treatment on treated
effects, though this may introduce bias from ratio estimation (Hartley and Ross 1954).

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 3

(1965) difference estimator, which remains unbiased even in small samples. We


also derive two different variance estimators. We then examine a field experiment
designed to assess the effect of voter mobilization in a US presidential election,
and use randomization inference to assess the bias and precision of a number of
estimators under two different null hypotheses. Whereas many common treat-
ment effect estimators, including the difference-in-means, ordinary least squares
regression and random effects regression fail to unbiasedly recover the ATE, the
proposed estimators are unbiased and are comparable (if not superior) in terms
of efficiency.

2 Potential Outcomes
The foundation of our design-based approach is the model of potential out-
comes introduced by Neyman (1923) and popularized by Rubin (1974). Define
treatment indicator Di∈{0, 1} for units i∈1, 2, …, N such that Di = 1 when unit i
receives the treatment and Di = 0 otherwise. Assuming that the stable unit treat-
ment value assumption (Rubin 1978, 2005) holds, let Y1i be the potential outcome
if unit i is exposed to the treatment, and let Y0i be the potential outcome if unit i
is not exposed to the treatment. The observed experimental outcome Yi may be
expressed as a function of the potential outcomes and the assigned treatment:
Yi = DiY1i+(1–Di)Y0i. The causal effect of the treatment on unit i, τi, is defined as
the difference between the two potential outcomes for unit i: τi≡Y1i–Y0i. And, by
definition the ATE, denoted Δ, is the average value of τi for all units i. Under this
model, the only random component of the experiment is the allocation of units to
treatment and control groups.
Since τi≡Y1i–Y0i, the ATE is equivalently
N
∑ ( Y1i −Y0 i ) 1N N
 1 T T
∑Y1i − ∑Y0 i  = [ Y1 −Y0 ],
i=1
∆= =
N N  i=1 i=1  N

where Y1T is the sum of potential outcomes if in the treatment condition and Y0T
is the sum of potential outcomes if in the control condition. An estimator of Δ can
be constructed using estimators of Y0T and Y1T :

ˆ = 1 YT −YT  , 
∆ (1)
N 1 0 

 
where Y1T is the estimated sum of potential outcomes under treatment and Y0T is
the estimated sum of potential outcomes under control.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
4 Joel A. Middleton and Peter M. Aronow

Formally, the bias of an estimator is the difference between the expected


value of the estimator (over all randomizations) and the parameter of interest. If
 
the estimators Y0T and Y1T are unbiased, the corresponding estimator of Δ is also
unbiased since

ˆ ]= 1   T   T   1 T T
E[ ∆ E Y − E Y = [ Y −Y0 ] = ∆.
N   1   0  N 1

3 Properties of the Difference-In-Means Estimator


In this section, we examine the properties of the difference-in-means estimator.
We start with an examination of the difference-in-means for three reasons. First,
the difference-in-means is one of the most commonly used esitmators of ATE in
randomized experiments. Second, insights about the difference-in-means will
help us understand the bias of other estimators. Third, the derivations will help
us to identify conditions under which common estimators are not consistent and,
hence, asymptotically inefficient.
Before discussing random allocation of clusters, we begin with a short deri-
vation of the unbiasedness of the difference-in-means estimator under random
allocation of individual units.3 We then articulate the source of the bias for the
difference-in-means estimator when applied to a cluster randomized experiment.
Finally, we examine the asymptotic properties of the estimator.

3.1 U
 nbiased Estimation of Treatment Effects Under Random
Allocation of Units

Define N and nt as integers such that 0 < nt < N. Random allocation of treatment
implies that nt, a fixed number, units are randomly assigned to treatment (Di = 1)
and the remaining nc = N–nt are in control (Di = 0). Define I0 as the set of all i such
that Di = 0 and I1 as the set of all i such that Di = 1.
To derive an unbiased estimator of the ATE under random allocation, we can
first posit estimators of Y0T and Y1T . Define an estimator of Y0T ,

 N N
Y0T,S = ∑Y0 i = ∑Yi (2)
nc i∈I0 nc i∈I0


3 Throughout, we use the term random allocation to refer to the assignment of a fixed number
of units (or clusters) to treatment and a fixed number to control, following the terminology of
Lachin (1988).

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 5

and, similarly, define an estimator of Y1T ,

 N N
Y1,TS = ∑Y1i = ∑Yi . (3)
nt i∈I 1 nt i∈I 1


It is easy to show that the estimators in equations 2 and 3 are unbiased under the
random allocation rule:

 N 
E Y0T,S  = E  ∑Yi  = N ⋅Y0 =Y0T , (4)
 nc i∈I0 

where Y0 is the mean value of Y0i over all i units (and is not an observable quan-

tity). A proof for the unbiasedness of Y1,TS directly follows the form of equation 4.
From equation 1, it follows that we may construct an unbiased estimator of Δ:

 = 1 Y  ∑ i∈I 1 Yi ∑ i∈I0 Yi


∆ 
T
−Y0T,S  = − , (5)
S
N 1,S
nt nc


where ∑ i∈I 1
Yi / nt is the mean value of Yi for all units assigned to treatment and
 is known
∑ i∈I0
Yi / nc is the mean value of Y for all units assigned to control. ∆
i S

as the difference-in-means estimator.

3.2 P
 roperties of The Difference-In-Means Estimator Under
Random Allocation of Clusters

Under random allocation of clusters, the difference-in-means estimator is no


longer generally unbiased, despite all individuals having the same probabil-
ity of entering into each treatment condition. The unit of randomization is no
longer the individual: instead, clusters (or groups of individuals) are assigned
to treatment. While random allocation of units may yield more efficient designs
in principle, a number of settings may dictate clustered designs in practice.
Some examples include when unit randomization is infeasible, when outcome
measures are only available at the level of the cluster, or when unit interfer-
ence (e.g. treatment synergies or spillover effects) is an important aspect of
treatment.
In settings where unit randomization is infeasible or undesirable, the
researcher rarely has control over the cluster size (e.g. household, village). As
a consequence, bias can arise in estimation. We begin this section by deriving

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
6 Joel A. Middleton and Peter M. Aronow

the bias associated with the difference-in-means estimator. As our derivation will
show, the bias arises whenever outcomes are related to cluster size.
Formally, suppose each cluster j = 1, 2, …, M is assigned to either treatment or
control. Define mt and M as (fixed) integers such that 0 < mt < M. Now mt clusters
are randomly assigned to treatment (Dj = 1) and the remaining mc = M–mt clusters
are assigned to control (Dj = 0). Define J0 as the set of all j such that Dj = 0 and J1 as
the set of all j such that Dj = 1. Let Y0ij be the response of the ith individual in the jth
cluster if the cluster is assigned to control and let Y1ij be the response of the ith indi-
vidual in the jth cluster if the cluster is assigned to treatment. Let nj be the number
of individuals in the jth cluster. Note that all individuals have the same probability
mt/M of entering treatment.
The estimators in equations 2 and 3 can be rewritten as
 n  n
Y1,S = N ∑ j∈J ∑ i=j 1Yij / ∑ j∈J n j and Y0T,S = N ∑ j∈J ∑ i=j 1Yij / ∑ j∈J n j . The difference-
T
1 1 0 0

in-means estimator in equation 5 can therefore be rewritten

nj nj

 = 1 Y  ∑ j∈J 1 ∑ i=1Yij ∑ j∈J0 ∑ i=1Yij


∆ 
T
−Y0T,S  = − . (6)
S
N 1,S
∑ j∈J nj ∑ j∈J nj
1 0


The double summations in the numerators make explicit that summation takes
place across individuals in different clusters. In the denominators, the summa-
tions operate over clusters. While the estimator remains unchanged from equa-
tion 5, expressing it this way reveals a fundamental problem with its application.
The trouble with using the estimator in equation 6 is that the quantities
nt = ∑ j∈J n j and nc = ∑ j∈J n j are no longer fixed numbers as they were in equa-
1 0
tion 5, but are now random variables. The total number of individuals in treat-
ment and control now depends on the size of the particular clusters assigned to
the experimental groups. To understand why this dependence is problematic, we
need only examine equation 4: the term N/nc may be moved to the outside of the
expectation operator because it is a fixed constant. When nc is a random variable,
calculating the expectation is more involved. In general, for a ratio of two random
variables u, v, (u/v),

u 1   u 
E  =  E[ u ] − Cov  , v  (7)
 v  E[ v ]  v  

if v > 0 (Hartley and Ross 1954). Because the difference-in-means estimator is


the difference between two ratios of random variables we can use the result in
­equation 7 to derive the bias of the difference-in-means estimator in equation 6.
Following Middleton (2008),

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 7

] = 1 T T M 1  nj 
E[ ∆ [ Y1 −Y0 ] −  Cov  ∑∑Y1ij / ∑n j , ∑n 
S
N N  mt  j∈J 1 i=1 j∈J 1 j∈J 1
j


1  nj
 
− Cov  ∑∑Y0 ij / ∑n j , ∑n j   .
mc  j∈J0 i=1 j∈J0 j∈J0  

] −∆=
It follows that the bias, E[ ∆ S

M 1  nj  1  
nj

−  Cov  ∑∑Y1ij / ∑n j , ∑nj  − m Cov  ∑∑Y0 ij / ∑nj , ∑n   . (8)


N  mt  j∈J 1 i=1 j∈J 1 j∈J 1   j∈J0 i=1 j∈J0 j∈J0
j

 c


Inspection of this term reveals that, if the size of the cluster is correlated with the
potential outcomes in the cluster, the difference-in-means estimator is biased.
Moreover, the presence of the terms 1/mt and 1/mc shows that the magnitude (and
even the direction) of the bias can depend on the relative number of clusters allo-
cated to treatment and control.
In some special cases, there will be no bias, such as when the cluster size
does not vary or when there is no covariance between cluster size and outcomes.
Nonetheless, in applied research we might expect cluster size to be related to
outcomes. For example, precinct size may be related to the characteristics of the
precinct, such as partisan composition and voting rates. In Section 6 we show
an example where cluster size is significantly related to treatment effect. Such
an association between cluster size and treatment effect has been referred to as
nonignorable cluster size (e.g. Hoffman et al. 2001).

3.3 A
 symptotic Properties of the Difference-In-Means
Estimator With Random Allocation of Clusters

In this section, we demonstrate two important facts about the difference-in-


means estimator. First, in a proof adapted from Middleton (2008), we will show
that the difference-in-means estimator is consistent as the number of clusters,
M, grows. Second, we demonstrate that the difference-in-means estimator is not
necessarily consistent as N grows.
Consistency of a statistic under a finite population is defined given a sequence
of h finite populations H where Nh < Nh+1, nth < nth+1 and nch < nch+1 for h = 1, 2, 3, …. The
estimator ∆  is said to be a consistent estimator of Δ if ∆   p
→ ∆ (converges in
S S

probability) as h→∞.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
8 Joel A. Middleton and Peter M. Aronow

To show that the difference-in-means estimator is consistent with large M, we


follow Brewer (1979) in assuming that as h→∞, the finite population H increases
as follows: (1) the original population of M clusters is exactly copied (h–1) times;
(2) from each of the h copies, mt clusters are allocated to treatment (such that
0 < mt < M) and the remaining mc = M–mt are allocated to control; (3) the h subsets
are collected in a single population of hM clusters, with hmt clusters in treatment
and hmc = hM–hmt in control; and (4) ∆  is defined as the difference-in-means
S
estimator as in equation 5, only now summation takes place across all hmc and
hmt clusters. Figure 1, Panel A illustrates this sort of asymptotic growth.
A less restrictive set of assumptions is possible, but this setup is conveni-
ent because H is easy to visualize and moment assumptions are built-in. We
= nj nj
express the estimator as, ∆ S ∑ ∑ Y /
j∈J 1
n−
i = 1 ij ∑ Y /
j∈J 1 j ∑ ∑
n, j∈J0 i = 1 ij ∑ j∈J0 j

where in this case J1 is defined as the set of hmt treatment clusters and J0 is defined
as the set of hmc control clusters. As h→∞, by the weak law of large numbers,
1 n hm 1 n hm 1 hm
∑ ∑ j Y 
h j∈J 1 i=1 ij
p
→Y1T ⋅ t , ∑ j∈J ∑ i=j 1Yij 
hM h 0
p
→Y0T ⋅ c , ∑ n 
hM h j∈J 1 j
p
→N⋅ t
hM
1 hm
and
h
∑ n 
j∈J0 j
p
→ N ⋅ c . By Slutsky’s theorem,
hM
hmt hm
Y1T ⋅ YT ⋅ c
hM − 0 hM = Y1 −Y0 .
T T
 p

∆ (9)
S
hm hm N
N⋅ t N⋅ c
hM hM


Figure 1: Two versions of Brewer’s simple notion of asymptotic growth. The population is
simply copied h–1 times. In Panel A, copies of the clusters are made and the number of clusters
grows. In Panel B, the number of clusters is fixed and the individuals within are copied. An
estimator is consistent under asymptotic growth if it converges to the parameter as h→∞.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 9

This proves that the difference-in-means estimator is consistent as the


number of clusters grows.
In the case where the size (rather than the number) of the clusters grows as
h→∞, the finite population H increases as follows: (1) the original population of
M units is exactly copied (h–1) times, but this time the h copies of a cluster are
considered part of one supercluster; (2) mt of the clusters are allocated to treat-
ment (such that 0 < mt < M) and the remaining mc = M–mt are allocated to control;
and (3) ∆ is defined as the difference-in-means estimator as in equation 9, but
S
now the inner summation takes place across all hnj units in each cluster. Figure 1,
Panel B illustrates this sort of asymptotic growth.
To show that the difference-in-means estimator is not necessarily consistent
simply with large N, we express the estimator as,
hn hn n n

 ∑ j∈J1 ∑ i=1j Yij ∑ j∈J0 ∑ i=1j Yij ∑ j∈J1 ∑ i=j 1Yij ∑ j∈J0 ∑ i=j 1Yij
∆S = − = − . (10)
∑ j∈J hnj ∑ j∈J hnj ∑ j∈J nj ∑ j∈J nj
1 0 1 0


As h→∞, the estimate remains unchanged with large N if the number of clus-
ters is fixed. This proves that the bias articulated in equation 8 is unmitigated for
increasingly large clusters.

3.4 Discussion

The results of this section highlight the fact that, for some designs, bias may not
be mitigated with increased units. For example, imagine a study of the effect of
state-level policy on public opinion. Increasing the number of surveys conducted
does nothing to decrease bias in that case since the number of states is fixed.
More troubling, the above results also suggest that the bias of an estimator
that averages together a number of biased sub-estimates will not diminish with
increasing number of sub-estimates. Consider a block randomized design where
clusters (e.g. houses, clinics, precincts) are randomized; if a fixed effects regres-
sion is used to “control” for groups, then adding more units by increasing the
number of blocks (strata) does not diminish the bias. This is because the fixed
effects estimator is simply a weighted average of group-level difference-in-means
estimates estimates (cf. Angrist and Pischke 2009, Chapter 5).4

4 However, as the formulas suggest, a way to mitigate such bias would be to block units based
on cluster size as suggested by Imai et al. (2009).

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
10 Joel A. Middleton and Peter M. Aronow

4 U
 nbiased Estimation of Treatment Effects Under
Random Allocation of Clusters
By understanding bias as a problem fundamental to ratio estimation, we can
circumvent the bias with an alternative design-based estimator. Notationally, it
helps to clarify the task if we consider cluster totals – i.e. the sum of the responses
n
of the individuals in each cluster. Define Y0Tj = ∑ i=j 1Y0 ij as the sum of responses of
n
the individuals in the jth cluster if assigned to control and Y1Tj = ∑ i=j 1Y1ij as the sum
of responses of the individuals in the jth cluster if assigned to treatment. For each
individual, only one of the two possible responses, Y0ij or Y1ij, may be observed and,
since individuals are assigned to treatment conditions in clusters, for any given
cluster, only one of the possible totals Y0Tj or Y1Tj , may be observed. The observed
cluster total for cluster j , YjT , may be expressed as: YjT = DjY1Tj + (1 − Dj )Y0Tj .
Using this new notation, the ATE may be expressed as
M nj M M

∆=
∑ ∑ (Y
j=1 i=1 1ij
−Y0 ij )
=
∑ j=1
Y1Tj − ∑ j =1Y0Tj
=
1 T T
[ Y −Y0 ].
M nj M
N 1
∑ ∑ j=1 i=1
1 ∑ j=1
nj

We can again construct an unbiased estimator for Δ with unbiased estimators of


Y0T and Y1T . Following the logic of equation 4,

 M M
Y0T,HT =
mc
∑Y T
0j
=
mc
∑Y j
T
. (11)
j∈J0 j∈J0


One can think of this estimator as estimating the average of the cluster totals
(among control clusters) and then multiplying by the number of clusters M to get
the estimated total for all units in the study. Likewise,

 M M
Y1,THT = ∑Y1Tj = ∑YjT . (12)
mt j∈J 1 mt j∈J 1


 
Following the same steps as equation 4, it can be shown that Y0T,HT and Y1,THT
are unbiased estimators of Y0T and Y1T , respectively. The terms M/mt and M/mc
are fixed; when taking the expectations of equations 11 and 12, they can be moved
outside the expectation operator. Note that the random variables at the root of
the ratio estimation problem above, nt and nc, do not appear in either estimator.
From these two unbiased estimators, we may therefore construct an estimator of
the ATE:

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 11

= 1 Y  M 1 1 
∆ 
T
−Y0T,HT  =  ∑YjT − ∑Y T
. (13)
HT
N 1,HT
N  mt j∈J 1 mc j∈J0
j
 

We refer to this estimator as the Horvitz-Thompson (HT) estimator because it is


a special case of the well-known estimator from sampling theory (Horvitz and
Thompson 1952; Chaudhuri and Stenger 2005).
The HT estimator can be criticized on two grounds. First, as Imai et al. (2009)
suggest, this estimator is not location invariant. We offer a proof of the non-­
invariance of the HT estimator in Section 4.1. Second, the HT estimator can be
highly imprecise; cluster sums tend to vary a great deal because there are more
individuals in some clusters than in others. In large clusters, totals may tend to be
large and in small clusters, totals may tend to be smaller. In Section 5.1, we will
develop an estimator that addresses both these limitations.

4.1 Non-Invariance of the Horvitz-Thompson estimator

To show that the estimator in equation 13 is not invariant to location shifts,


let Y1ij∗ be a linear transformation of the treatment outcome for the ith person
in the jth cluster such that Y1∗ij ≡ b0 + b1 ⋅Y1ij and likewise, the control outcomes,
Y0∗ij ≡ b0 + b1 ⋅Y0 ij . Invariance to this transformation would imply that, when analyz-
ing the transformed data, we achieve the relationship between the old estimate
and new estimate such that
 , (14)
∆∗HT = b1 ⋅∆ HT

i.e. the ATE estimated from linearly transformed outcomes will be equal to the
ATE estimated from non-transformed outcomes multiplied by the scaling factor
b1. In Appendix A, we demonstrate that the HT estimator is not location-invariant
because the estimate based on the transformed data will be

 M 1 1 
∆∗HT = b0 ⋅  ∑n j −
N  mt j∈J 1 mc
∑n  + b ⋅∆.
j 1 HT (15)
j∈J0  

Unless b0 = 0, the term on the left does not generally reduce to zero but instead
varies across treatment assignments, so equation 15 is not generally equivalent
to equation 14 for a given randomization. Note that, while a multiplicative scale
change (e.g. transforming feet to inches) need not be a concern, a linear trans-
formation that includes a location shift (e.g. reversing a binary indicator variable
or transforming Fahrenheit to Celsius) will lead to a violation of invariance. For

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
12 Joel A. Middleton and Peter M. Aronow

any given randomization, linearly transforming the data such that the intercept
changes can yield different estimates.

4.2 D
 eriving Estimators of the Variance of the Horvitz-­
Thompson Estimator Under Random Allocation of Clusters
In our derivation of variances, we follow the general formulations of Freedman
et al. (1998), which follow from a long tradition dating from Neyman (1923). The
variance of the estimator in equation 13 is

ˆ )=
V( ∆
1  T
N2 
( ) ( )
  
(
V Y0 +V Y1T − 2Cov Y0T , Y1T  . ) (16)


This expression is the true, not estimated, variance. To construct an unbiased


estimator of this variance, we must have unbiased estimators of each of the quan-
tities in equation 16. While unbiased estimators may be constructed for V Y0T ( )


and V (Y ) , there does not generally exist an unbiased estimator for Cov (Y , Y )
 T   T T
1 0 1

because the joint distribution of potential outcomes is not observable.


We may, however, derive a generally conservative estimator of the variance.
First, we derive the components of the true variance from equation 16. From the
principles of finite population sampling,

( )
 M  M − mc  2 T
2
V Y0T,HT =   σ ( Y0 j ),
mc  M − 1 

( )
 M  M − mt  2 T
2
V Y1,THT =   σ ( Y1 j ),
mt  M − 1 

and

(
 
)
Cov Y0T,HT , Y1,THT =−
M2
M −1
σ ( Y0Tj , Y1Tj ),

where, given features vj and wj for j∈1, …, M, finite population vari-


2
1 M  1 M 
ance σ ( v j ) = ∑ j =1  v j − ∑ j =1v j 
2
and finite population covariance
M  M 
1 M  1 M  1 M 
σ ( v j , w j ) = ∑ j =1  v j − ∑ j =1v j   w j − ∑ j =1w j  .
M  M  M 

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 13

From equation 16,

) =  M 2  M − mc  2 T M 2  M − mt  2 T
1 2M2 
V( ∆    σ ( Y0 j ) +   σ ( Y1 j ) + σ ( Y0Tj , Y1Tj ) 
 mc M − 1 mt M − 1 M −1
HT
N2 
M 2  M  σ ( Y0 j ) σ ( Y1 j ) 
2 T 2 T
1 
= 2  + + [ 2 σ ( Y0Tj , Y1Tj ) − σ 2 ( Y0Tj ) − σ 2 ( Y1Tj ) ]  . (17)
N  M − 1  mc mt  M − 1 


Since 2 σ ( Y0Tj , Y1Tj ) − σ 2 ( Y0Tj ) − σ 2 ( Y1Tj ) ≤ 0, it follows that


 σ 2 ( Y0Tj ) σ 2 ( Y1Tj ) 
 ) ≤V ∆
V( ∆ HT apx
=
HT
( ) M3

N 2 ( M − 1)  mc
+
mt 
.

Substituting unbiased estimators of σ 2 ( Y0Tj ) and σ 2 ( Y1Tj ) (Cochran 1977, theorem


),
2.4), we may derive an unbiased estimator of the quantity Vapx ( ∆ HT


( ) + ∑ (Y 
)
2 2

M  ∑ j∈J0 Yj −Ycj Y
T T T T
2 − 

V̂( ∆ HT ) = 2 
j∈J 1 j tj
,
N  mc ( mc − 1) mt ( mt − 1) 

where YmjT = ∑ j∈J YjT / mc , the mean value of YjT over all j∈J0 and YtjT = ∑ j∈J YjT / mt ,
0 1

the mean value of YjT over all j∈J1.5


The bias of the variance estimator is always nonnegative, thus ensuring the
) is conservative, it
variance estimator is conservative. However, while V̂( ∆ HT

may also be imprecise. In addition, when mc or mt is 1 the estimate is undefined.


) in experiments per-
In general, it is impossible to consistently estimate V( ∆ HT

formed on finite populations (Aronow et al. 2014) and thus it may be the case that
no single variance estimator is generally adequate. This issue is compounded
when N is small and asymptotic approximations may be poor.
We propose an alternative estimator of the variance by assuming sharp
null hypothesis and either analytically or computationally calculating the vari-
ance of the estimator. One common sharp null hypothesis is that of the sharp
null hypothesis of no treatment effect: H0:τi = 0, ∀i. H0 implies that the treatment
has no effect whatsoever on the outcome, i.e. that both potential outcomes are

5 When M is large, researchers may encounter numerical problems computing M2 and, later,
−2
 1 M 
M4. This problem may be obviated by replacing M2/N2 with  ∑ j = 1 n j  , the reciprocal of the
M 
square of the average number of units per cluster.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
14 Joel A. Middleton and Peter M. Aronow

identical: Y0i = Y1i = Yi. When the sharp null hypothesis of no effect holds, we know
two important facts: σ2(Y0j) = σ2(Y1j) = σ2(Yj) and σ(Y0j, Y1j) = σ2(Yj). By substituting σ2
into the last line of equation 17, we may calculate the true variance under this null
hypothesis,
2  
) = M  M  σ ( Yj ) + σ ( Yj )  + 1 [ 2 σ 2 ( Y T ) − σ 2 ( Y T ) − σ 2 ( Y T ) ] 
2 T 2 T

VN(∆ HT
N  M − 1  mc
2
mt  M − 1

j j j

 
M σ ( Yj )
4 2 T

= 2 .
N ( M − 1) mcmt

Note that if the sharp null hypothesis of no effect holds, V N ( ∆ ) is the true
HT
variance, which can be calculated from the data exactly or by way of resampling.
When the sharp null hypothesis of no effect does not necessarily hold, V N ( ∆ )
HT

may be construed as an estimator of V( ∆ HT ). We therefore refer to a variance
estimator constructed by assuming the sharp null hypothesis of no effect as
).
V̂ N ( ∆ HT
The primary benefit of using V̂ N ( ∆ ) is that it tends to be more stable
HT

than V̂( ∆ HT ), particularly when either nc or nt is small, because it combines
the variance of the treatment and control groups. In cases where V̂( ∆ ) is
HT
imprecise, V̂ N ( ∆ ) may be preferable. Highly imprecise standard errors may
HT
be downwardly biased even when the associated variance estimator is con-
servative. The square root is a concave function so, by Jensen’s inequality,
ˆ ∆ ) 0.5 ] ≤ ( E[V(
ˆ ∆ ) will tend to
) ] ) 0.5 . Since the estimates from V̂ N ( ∆
E[V( HT HT HT
remain stable across randomizations, its use may therefore avoid the bias result-
ing from Jensen’s inequality. However, when effect sizes are large, V̂ N ( ∆ ) will
HT
tend to overestimate the true sampling variability.
Recent theoretical results suggest that V̂ N ( ∆) may be adequate as a con-
HT
servative approximation. In general, V̂ N ( ∆ ) will be conservative relative to the
HT
true variance if effects are constant (at the cluster scale) or if the number of clus-
ters is balanced, in a result that directly follows from theorem 3 of Ding (2014)
and Samii and Aronow (2012) (by way of the relationship between pooled and
combined variance). These results indicate V̂ N ( ∆ )
HT
will have a higher value
than that of the true variance if treatment effects are in fact constant at the cluster
scale. For these reasons, choosing the sharp null of no effect as an approximation
will generally be conservative among the class of hypotheses such that effects are
constant at the cluster scale.6

6 Researchers may seek to calculate separate variance estimators for each of a grid of
­hypothesized, constant treatment effects, and use these to form a confidence interval by way of
inverting hypothesis tests. We thank an anonymous reviewer for this suggestion.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 15

Computational approximations of exact bias and variance terms may be


­computed for any estimator under any given sharp null hypothesis. Another note-
) is that it can be computed under designs where
worthy benefit of using V̂ N ( ∆ HT

V̂( ∆ HT ) cannot be computed, such as pair randomized designs.

4.3 Block Randomized Designs

In this section we consider how to generalize the HT estimator to block rand-


omized designs. In a block randomized design, clusters are first classified in to
one of B blocks, often on the basis of homogeneity of the clusters. In the bth block,
a fixed number of clusters, mtb, are assigned to treatment and the rest, mcb, to
control.
As each block represents an independent randomized experiment, the HT
estimator and variance estimators may be applied to each block separately. For

the bth block the estimator of the ATE can be written ∆bHT . An unbiased estimate
of the ATE for all the units in the study can be written as a weighted average of the
block-level estimates,


B
N 
B
∆ HT = ∑ b ∆bHT , (18)
b= 1 N


where Nb is the number of units in the bth block. From first principles, the variance
of the estimator is

( ) N
( )
B 2
 
V ∆ BHT = ∑ b2 V ∆bHT , (19)
b= 1 N 

and conservative variance estimation can be achieved by “plugging in” conserva-



( )
tive estimators of the variance for each of the V ∆bHT . Alternatively, Monte Carlo
simulations may be used as an approximation.

5 Difference Estimators
In this section, we propose a simple extension of the HT estimator to improve the
efficiency of the estimator as well as confer the important property of location
invariance.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
16 Joel A. Middleton and Peter M. Aronow

5.1 Des Raj Difference Estimator for Cluster Size

A major source of variability with the HT estimator is the variation in the number
of individuals in each cluster. Clusters with large nj will tend to have larger
values of YjT – that is, in many applications, as clusters get larger, the sum of
the ­outcomes for that cluster will also tend to get larger. We use the Des Raj (1965)
difference estimator to reduce this variability. To derive the Des Raj difference
estimator in this context, we first derive our estimates of the study population
totals, Y0Tj and Y0Tj by “differencing” off some of the variability:

 M
Y0T,R 1 =
mc
∑( Y j
T
− k( n j − N / M ) ),

(20)
j∈J0

where constant k is a prior estimate of the regression coefficient from a regression


of YjT on nj, and (nj–N/M) is the difference between the size of cluster j and the
average cluster size.7 k is also roughly equivalent to an estimate of the average
value of Yij for all units and does not have a causal interpretation. In Section 5.3,
we derive an exact expression for the optimal value of k, which depends on both
potential outcomes and the specifics of the experimental design. Similarly,

 M
Y1,TR 1 = ∑( YjT − k ( n j − N / M ) ). (21)
mt j∈J 1

To develop an intuition about this method, note that it is equivalent to defining
a new “differenced” variable U Tj , where U Tj =YjT − k ( n j − N / M ) and conducting
the analysis based on U Tj instead of YjT . So long as k is fixed before analysis, this
strategy does not lead to bias because

Ek [ n j − N / M ] = kE[ n j − N / M ] = k ⋅0 = 0. (22)

It follows that the HT and Des Raj estimators have the same expected value. Since
 
Y0T,R 1 and Y1,TR 1 are unbiased, it follows that the Des Raj estimator,
 = 1 Y
∆ T 
−Y T  ,
R1
N  1,R 1 0 ,R 1 

is also unbiased.8

7 A similar estimator is proposed by Hansen and Bowers (2009), differing primarily in that it
contains a random denominator.
8 Note that estimating k from the same data set can lead to bias, as we demonstrate in
­Appendix B, raising the question of where to obtain a suitable value. In Section 6, we sug-
gest using data from other blocks in experiments with blocking. Another option would be to
find an auxilliary data source from which a trustworthy value of k can be estimated. In survey

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 17

Deriving a conservative estimator of the variance of the Des Raj estimator


follows directly from Section 4.2:


( ) ( )  ,
2 2

M 2  ∑ j∈J0 U j −U cj ∑ j∈J1 U Tj −UtjT


T T

V̂( ∆ R 1 ) = 2  +  (23)
N  mc ( mc − 1) mt ( mt − 1)  

where U cjT = ∑ j∈J U Tj / mc , the mean value of U Tj in the control condition and
0

U tjT = ∑ j∈J U Tj / mt , the mean value of U Tj in the treatment condition. Similarly,


1

from Section 4.2, we may easily construct a variance estimator for ∆  by assum-
R1
ing the sharp null hypothesis of no treatment effect:

) = M 4 σ 2 ( U Tj )
V̂ N ( ∆ R1
.
N 2 ( M − 1) mcmt

As an alternative, Monte Carlo simulations can be used to compute this quantity.

5.2 Invariance of the Des Raj Difference Estimator

One benefit of the Des Raj estimator is that it has invariance to location transfor-
mation, regardless of the accuracy of the researcher’s choice of k. In this section,
we prove the invariance of the Des Raj estimator. When Y0ij and Y1ij are linearly
transformed, k will also change: the same transformation must be applied to k as
to Y0Tij and Y1Tij . Since k is on the same scale as the outcome variable, when the
outcome variable is transformed, k will also be transformed:
k ∗ = ( b0 + b1 ⋅k ). (24)

Using this new k*, we may again define new differenced treatment outcomes,

U 1Tj∗ =Y1Tj∗ − k ∗ ⋅( n j − N / M )
nj

= ∑( b0 + b1 ⋅Y1ij ) − ( b0 + b1 ⋅k ) ⋅( n j − N / M )
i=1
= n j ⋅b0 + b1 ⋅Y1Tj − ( b0 + b1 ⋅k ) ⋅( n j − N / M )
= b0 ⋅ N / M + b1 ⋅U 1Tj .

s­ ampling, researchers sometimes accept the bias of estimating k with regression (Sarndal 1978),
but the focus of the current paper is on unbiased estimation so regression estimation is outside
our scope. We recommend that either the value of or procedure for choosing k be specified in
a ­preanalysis planning document, so as to reduce the uncertainty associated with researcher
discretion.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
18 Joel A. Middleton and Peter M. Aronow

And, likewise, we may define new differenced control outcomes,


U 0T ∗j = b0 ⋅ N / M + b1 ⋅U 0T j . The estimate based on these transformed variables will be

 M 1 1 
∆∗R 1 =  ∑U Tj ∗ − ∑U T∗

N  mt j∈J 1 mc j∈J0
j

M 1 1 
=  ∑( b0 ⋅ N / M + b1 ⋅U 1Tj ) − ∑( b ⋅ N / M + b ⋅U T
)
N  mt j∈J 1 mc j∈J0
0 1 1j

M 1 1 
=  b1 ∑U 1 j − b1 ∑U 0 j 
T T

N  mt j∈J 1 mc j∈J0 
.
= b1 ⋅∆ (25)
R1


The Des Raj estimator is therefore invariant to linear transformation because any
linear transformation to the outcome will necessarily be reflected in k.
Note that the HT estimator may be considered a special case of the Des Raj
estimator when k = 0. However, unlike the HT estimator, the explicit assumption
that k = 0 ensures that when the scale of the outcome changes, the scale of k also
changes. The non-invariance of the HT estimator may therefore be thought of as a
failure to recognize the implicit assumption that k = 0 and to transform to k* when
the scale of the outcome changes.

5.3 Optimal Selection of k

To derive the optimal value of k, we begin by noting that the variance of U 0T j is

∑ (U )
2
T
0j
−U 0T j
j
σ 2 ( U 0T j ) =
M

∑ (Y )
2
T
0j
− k( n j − N / M ) −Y0Tj
j
=
M
= σ 2 ( Y0Tj ) + k 2 σ 2 ( n j ) − 2 k σ ( n j , Y0Tj ),

where U 0T j is the mean value of U 0T j over all j clusters. koptim , the value of k that
c
minimizes σ 2 ( U 0T j ), can be found using simple optimization. Since the second
derivative with respect to k, 2σ2(nj), must be positive, we may set the first deriva-
tive equal to zero and solve for k, so that

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 19

σ ( n j , Y0Tj )
koptim = . (26)
c
σ 2 ( nj )


Equation 26 should look familiar to the reader: the best fitting k is the ordinary
least squares coefficient.
Likewise, the optimal value of k for the potential outcomes under treatment
is koptim = σ ( n j , Y1Tj ) / σ 2 ( n j ). Given that koptim does not generally equal koptim , a
t t c
researcher could justifiably identify different values of k for treatment and control
groups. In practice, however, this would require a good deal of prior knowledge
(including knowledge about treatment effects); for this reason, a single value of k
will typically be preferable. In Appendix C, we derive a single optimal value of
k , koptim∗ = mt koptim / M + mc koptim / M .
c t
Unlike a structural parameter, the value of koptim* will depend on the number
of clusters assigned to treatment and to control. Perhaps counterintuitively, when
there are fewer clusters in the control condition, koptim* is more heavily weighted
toward koptim , the value of k that minimizes σ 2 ( U 0T j ) (and vice versa). A simple
c

intuition for this weighting is that the condition with fewer clusters will con-
tribute more to the overall variance of the estimator. Consequently, the greatest
increase in precision comes from adjustments made to units in that condition.
The chosen value of k will reduce the variability of the Des Raj estimator, ∆ ,
R1
relative to the HT estimator when, for koptim* > 0, 0 < k < 2koptim* and, for koptim* < 0,
0 > k > 2koptim*. In other words, the Des Raj estimator will have better precision than
the HT estimator unless the researcher picks a k with the wrong sign or chooses
a k that is more than twice the magnitude of koptim*. Because koptim* will tend to be
close to the average outcome for all individuals, the researcher will usually have
prior knowledge about the mean individual-level outcome.9
Under the sharp null hypothesis of no treatment effect, koptim*=
koptimc=koptimt= σ ( n j , YjT ) / σ 2 ( n j ), and thus the optimal k would be the ordinary
least squares coefficient from regressing YjT on nj. Prima facie, the intuitive next
step would be to try to estimate k from the data, utilizing ordinary least squares
on the observed data (perhaps controlling for Dj). However, regression estimates
of k can lead to bias in the estimation of treatment effects. In Appendix B, we
demonstrate that the bias from estimating k from within-sample data is

YT 
−Y T  M
E  1,R 1 0 ,R 1  − ∆ = ( Cov(kˆ, ncj ) − Cov(kˆ, ntj ) ),
 N  N

9 Note that our fundamental uncertainty about the optimal value of k does not itself contribute
to the uncertainty of our estimate since k is treated as a fixed constant, e.g. in equation 23.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
20 Joel A. Middleton and Peter M. Aronow

where k̂ is an estimator of k, ntj is the mean value of nj for clusters in the treat-
ment condition in a given randomization and ncj is the mean value of nj for units
in the control condition in a given randomization.
Knowing the optimal value of k under the sharp null hypothesis of no treat-
ment effect is nevertheless informative as we seek to construct principled prior esti-
mates for k. By using the ordinary least squares estimator on auxiliary data with
similar potential outcomes, we can approximate koptim* with out-of-sample data.
As we will demonstrate in our empirical example, such auxiliary data can
come from the other blocks in an block randomized experiment. If one was con-
cerned that estimating the values of k from other blocks of an experiment would
lead to additional stochasticity in the values of U 0T j and U 1Tj , Monte Carlo simula-
tions (whereby the values of k are recomputed for each simulation) may be used
to compute the sharp null variance estimate.

5.4 D
 es Raj Difference Estimator for Cluster Size and
Covariates

The Des Raj estimator may also be extended to include other covariates which
may further reduce the sampling variability of the estimator. Assume the
researcher has access to A covariates for each individual i in cluster j, denoted
n
by Xaij
T
, a ∈1, 2, …, A. Define the cluster total of the covariate, XajT = ∑ i=j 1 Xaij , and
M n
define the sum of the Xaij across all individuals in all clusters, XaT = ∑ j =1 ∑ i=j 1 Xaij . It
is simple to adapt the Des Raj estimator to incorporate these additional covariates.
Define constants k′ and ka (∀a) as prior estimates of the coefficients associated
with a regression of Yj on cluster size and cluster-level covariates, respectively.
Again, k′ and ka do not have causal interpretations. It follows that we may define

 M  
A
Y0T,R 2 =
mc
∑ Y j
T
− k ′( n j − N / M ) − ∑ka ( XajT − XaT / M )

j∈J0
 adjusting for size
a=1
   
 adjusting for other covariates 

and
 M  
A
Y1,TR 2 = ∑  YjT − k ′( n j − N / M ) − ∑ka ( XajT − XaT / M ) .
mt j∈J 1  a= 1 

 
By the logic of equation 22, Y0T,R 2 and Y1,TR 2 are unbiased estimators of Y0T and
Y1T , respectively. It follows that we may again construct an unbiased estimator
of Δ,

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 21

= 1 Y
∆ T 
−Y T  .
R2
N  1,R 2 0,R 2 

Following the same steps as in equation 25, it can be shown that as long as k′
undergoes the same linear transformation as the original data and ka (∀a∈A)
undergoes the same multiplicative scale shift, the Des Raj estimator with covari-
ates will also be invariant. It will also be more efficient than the preceding esti-
mators if the researcher’s estimates for k′ and ka are reasonable; constructing
variance estimators for ∆  is simple and follows directly from Section 5.1.
R2
Note that the efficiency characteristics of this Des Raj estimator may be
derived as in Section 5.3, where the same intuitions about efficiency hold. In prin-
ciple, a researcher should choose covariates that together do the best job of pre-
dicting values of the potential outcomes to achieve the values of Y0,TR 2 and Y1,TR 2
with the lowest variability across randomizations. In practice, a researcher might
apply a variable selection method such as penalized regression techniques using
an auxiliary data set to identify suitable covariates and values of k.

6 Application
In this application we reanalyze the data from Green and Vavreck (2008) who
used a cluster randomized design to examine the effectiveness of television ads
on voter turnout among 18- and 19-year-old voters in the 2004 presidential elec-
tion. The study randomized television cable districts to either a treatment group,
in which advertisements encouraging young people to vote were shown, or to
the control group. The original experiment included a total of 23,869 voters in 85
television cable districts in blocks (strata) of size 2 or 3. Because we wanted to use
prior turnout in the cable district as a covariate in our analysis, we limited the
analysis to the 80 cable districts for which this information was available from the
authors. This yielded 40 blocks of two cable districts each (one in treatment, the
other in control) and a total of 22,733 individual voters.
The outcome measure of interest, Yij, is whether or not the individual i in
cluster (cable district) j voted in the 2004 American presidential election (coded
1 if the individual voted, 0 if the individual did not vote). Because 18 and 19 year
olds are new registrants they have no prior voter history, so individual voter
history could not be used for covariates. However, we use turnout rate in the cable
district in the 2000 election as a covariate as well as age. While the covariates are
somewhat less than ideal because they are unlikely to be particularly predictive,
they provide us with an opportunity to examine how the Raj difference estimator
performs when covariates are not particularly informative. In such a situation we

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
22 Joel A. Middleton and Peter M. Aronow

might expect koptim to be near zero and values of k chosen may actually reduce the
efficiency of the Raj difference estimator since it is less likely to be the case that
2koptim > k > 0 when koptim = 0.

6.1 Randomization inference

Randomization inference (RI) will allow us to assess the bias and variance of
any given estimator. In addition, RI allows the researcher to perform completely
nonparametric significance testing (see, e.g. Rosenbaum 2002). We refer to the
estimate produced by a given estimator as the test statistic. RI assumes that a
given sharp null hypothesis holds and evaluates the test statistic for every pos-
sible random assignment of units to treatment and control. By recalculating the
test statistic for each possible treatment assignment, the reference distribution of
the test statistic is constructed. Fisher’s exact test is a well-known form of RI for
significance testing, but the method is much more general.
Because the total possible permutations increase rapidly with population
size, RI may be computationally infeasible. We may use Monte Carlo simula-
tions to approximate RI by repeatedly assigning units to treatment and control
groups randomly and estimating the test statistic that would be observed for each
­repetition. The distribution of the test statistic across randomizations forms the
reference distribution of the statistic. As the number of repetitions gets large, the
distribution of the test statistic based on repeated randomizations converges to
that of full RI. This method can achieve results arbitrarily close to RI by increasing
the number of repetitions.
We use randomization inference to examine the behavior of our estimators
and compare them with the behavior of three commonly used estimators. We
conduct randomization inference for two scenarios (5000 iterations). The first
scenario examines the behavior of the estimators under the sharp null hypothesis
of absolutely no treatment effect. The second scenario examines the behavior of
estimators under heterogeneous treatment effects.

6.2 Imputing Missing Potential Outcomes

Computing the test statistics under repeated randomizations requires that we can
observe both potential outcomes for each unit. Since in reality we only observe
the response of unit i under one of the treatments, we must impute the value of
the missing potential outcome before conducting RI. We conduct RI using two
different methods of imputation.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 23

The first method assumes the sharp null hypothesis of no treatment effect.
This effectively imputes the missing potential outcome with the observed poten-
tial outcome.
In the second method we simulate heterogeneous treatment effects, first
modeling the data using logistic regression in order to impute missing potential
outcomes. This method looks to the data as a guide to creating realistic potential
outcomes that have a similar structure to the original data. We used the logistic
regression model,
−1
  F −1

P( Yij = 1) = 1 −  1 + exp  α + τt j + βn j + φn jt j + ∑γ f Γ f  
  f =1 

where tj is a treatment indicator for cluster j, nj is the cluster size, F is the number
of blocks (in this case, 40), Γf is an indicator variable indicating whether cluster
j is in block f. The terms α, τ, β, φ and γf are coefficients estimated from the data
using maximum likelihood methods. Note the coefficient φ is responsible for the
heterogeneous treatment effects. We estimate τ as 0.4, β as 1.4 and φ as –0.9.
We used this model to impute missing potential outcomes for each individ-
ual. To do so, the latent probability of response (voting) was first computed for
each unit when treated, pti, and when not treated, pci, using the estimated model.
Each missing Yci and Yti was imputed using a random draw from a Bernoulli
random variable with probability estimated from the logistic regression model.
The imputation process was conducted for each iteration of the RI. Marginalizing

over the imputation process, the ATE is in expectation


∑(pi ti
− pci )
= 0.007, or
N
0.7 percentage points.

6.3 Treatment effect estimates

In this section, we define the estimators that will be compared. We will consider
four regression-based estimators as well as the three design-based estimators
proposed in this paper. We begin by detailing each of these estimators.
The first estimator is the regression without covariates, also known as the
difference-in-means. The model can be written:

Yij = β0 + β 1 Dj + eij , (27)

where β0 is a constant, Dj is an indicator for treatment, β1 is an estimate of Δ, and


eij is an error term. Our estimate of β1 follows from fitting the model with ordi-
nary least squares. The next estimator under consideration is the fixed-effects

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
24 Joel A. Middleton and Peter M. Aronow

, which includes fixed effects for each of the blocks


regression estimator, ∆ FE
(strata):
F −1
Yij = β0 + β 1 Dj + ∑γ f Γ f + eij , (28)
f =1 

where β0, β1, Dj and eij are as above and Γf represents the dummy variable for the fth
block, and the model is again fitted with ordinary least squares. We then consider
the fixed-effects regression estimator that also adjusts for the covariates: average
turnout in 2000 and age. The model can be written:
F −1
Yij = β0 + β 1 Dj + β 2 X 2 ij + β 3 X 3 ij + ∑γ f Γ f + eij ,
f =1

where β2 is the coefficient on precinct-level voter turnout in 2000, β3 is the coef-


ficient on age, and the model is fitted with ordinary least squares.
As Freedman (2008a) notes, even without clustering, the models above
that include covariates may be biased due to regression adjustment if treatment
assignment is imbalanced (i.e. nt≠nc) or there exists treatment effect heterogene-
ity (i.e. ∃i, j s.t. τi≠τj). For both fixed effects models, Huber-White “robust” cluster
standard errors are estimated. While often sufficient for inference, these standard
errors may be unreliable in finite samples (Freedman 2006; Angrist and Pischke
2009) and they may also fail to address larger issues of model misspecification
(King and Roberts 2014).
Our next estimator adds a random effect for cluster to the above specifica-
tion. Random effects estimation was the recommended analytical technique in
Green and Vavreck (2008). However, as we will show, this estimator is not guar-
anteed to be unbiased. We use the following specification:
F −1
Yij = β0 + β 1 Dj + β 2 X 2 ij + β 3 X 3 ij + ∑γ f Γ f + e j + eij ,
f =1

where ej is a normally distributed cluster-level disturbance (and eij is also distrib-


uted normally). This model is estimated using the lmer() function in the lme4
(Bates and Maechler 2010) package in R (R Development Core Team 2010) using
the default settings. Standard errors are empirical Bayes estimates also produced
by the lmer() function.
And, finally, we present treatment effect estimates for the HT estimator, the
Des Raj difference estimator (with nj) and the Des Raj difference estimator (with
nj and covariates). For all three the standard error estimates are the square root
ˆ ) are used as opposed to V̂( ∆
of our estimated “sharp null” variances V̂ N ( ∆ ˆ ) as
the latter is not identified in the pair-randomized design.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 25

In this application we use the alternate blocks of the experiment to derive the
values of k, k′ and ka from the data. For a given block, the values are estimated by
dropping that block from the data and regressing the outcome on the covariates
using data from the remaining 39 blocks.
To estimate k for the Des Raj estimator with only nj, we use the following
model:
YjT = α + kn j + e j ,

where α is a constant, nj is cluster size, and ej is a random disturbance. This esti-


mation procedure yields a principled estimate for k. To estimate k′ and ka for the
Des Raj estimator with both nj and covariates, we use the following model:

YjT = α′ + k ′n j + k1 X 1Tj + k2 X 2T j + e j ,

where α′ is a constant, X 1Tj is the total turnout in cluster j in the 2000 election,
X 2T j is the sum of ages in cluster j.
Note that in the sharp null scenario the estimated values of k, k′, k1 and k2 are
the same for all randomizations for a given block. For the heterogeneous treat-
ment effect scenario, however, these values can vary across randomizations as
the observed values of YjT change depending on whether cluster j is in treatment
or not. As mentioned above, this sort of variability in these values contributes
to the variability of the Raj difference estimator. In our application, the variance
estimators remain conservative nonetheless. In practice, if the contribution of k
to the uncertainty is a concern, Monte Carlo simulations could be used to esti-
mate the variance.

6.4 R
 andomization Inference With the Sharp Null Hypothesis
of No Treatment Effect

Figure 2 displays the results for the point estimators assuming the sharp null
hypothesis of no treatment effect. Solid vertical lines indicate the mean of the
sampling distributions.
Results show that all estimators are unbiased under the sharp null. The HT
estimator is the least precise estimator by far. The rest perform very competitively
with the random effects regression and Raj’s difference estimator being the most
precise.
Figure 3 displays the results for the standard error estimators under
the sharp null hypothesis. In the case of the regression with no covariates
­(difference-in-means) the standard errors are biased upwards due to the failure

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Reg. (Diff−in−means) Reg. w/FE Reg. w/FE, Cov. Reg. w/FE, Cov, RE
26

0.30
0.30
0.30

0.30
Bias = 0.0 Bias = 0.0 Bias = 0.0 Bias = 0.0
SD = 2.4 SD = 1.9 SD = 2.0 SD = 1.6
RMSE = 2.4 RMSE = 1.9 RMSE = 2.0 RMSE = 1.6

0.20
0.20
0.20

0.20

Density
Density
Density

Density

0.10
0.10

0.10
0.10

0
0

0
0
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
ATE (Percentage points) ATE (Percentage points) ATE (Percentage points) ATE (Percentage points)

Horvitz−Thompson Raj difference (nj) Raj difference (nj, cov.)

0.30
0.30

0.30
Joel A. Middleton and Peter M. Aronow

Bias = 0.0 Bias = 0.0 Bias = 0.0


SD = 9.2 SD = 1.6 SD = 1.6
RMSE = 9.2 RMSE = 1.6 RMSE = 1.6

0.20
0.20

0.20

Density
Density

Density

0.10
0.10

0.10

0
0

Authenticated
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
ATE (Percentage points) ATE (Percentage points) ATE (Percentage points)

Figure 2: Sampling distributions associated with the ATE estimators under the sharp null hypothesis of no treatment effect detailed in Section 6.

Download Date | 11/3/15 12:13 AM


Five thousand randomizations were used to estimate the desity of the sampling distributions. Density plots were generated using the density()
­function in R (R Development Core Team 2010), with the default settings and a bandwidth of 0.5 percentage points. Each estimator is detailed in

Brought to you by | University of California - Berkeley


Section 6.3. The vertical line indicates the expected value, and therefore bias, of the estimator. Bias and SE estimates in the upper-right of each plot
are computed from each empirical distribution.
Reg. (Diff−in−means) Reg. w/FE Reg. w/FE, cov. Reg. w/FE, cov, RE

8
8
8
8
SD = 0.09 SD = 0.06 SD = 0.07 SD = 0.04
Bias = 0.53 Bias = −0.54 Bias = −0.55 Bias = −0.06

6
6
6
6

4
4
4
4

Density
Density
Density
Density

2
2
2
2

0
0
0
0
0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0
SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points)

Horvitz−Thompson Raj difference (nj) Raj difference (nj, cov.)

8
8
8
SD = 0.00 SD = 0.00 SD = 0.00
Bias = 0.00 Bias = 0.00 Bias = 0.00

6
6
6

4
4
4

Density
Density
Density

2
2
2

0
0
0
8.0 8.5 9.0 9.5 10.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0

Authenticated
SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points)
Unbiased Estimation of the Average Treatment Effect

Figure 3: Sampling distributions associated with the SE estimators under the sharp null hypothesis of no treatment effect detailed in Section 6.
Five thousand randomizations were used to estimate the sampling distributions. Density plots were generated using the density() function in R

Download Date | 11/3/15 12:13 AM


(R Development Core Team 2010), with the default settings and a bandwidth of 0.05 percentage points. Each SE estimator is detailed in Section 6.3.
The solid vertical line indicates the expected value of the SE estimator. The dotted vertical line indicates the true standard error of the estimator. Bias
27

and SD estimates are computed from each empirical distribution. Distributions for the SEs under the sharp null hypothesis of no treatment effect were

Brought to you by | University of California - Berkeley


too narrow to display.
28 Joel A. Middleton and Peter M. Aronow

of this model to account for blocking.10 However, for the regression models
that include fixed effects the standard errors are badly biased downward as
we might expect given that sandwich type estimators tend to be unreliable in
finite samples. The standard errors associated with the random effects model
perform reasonably well, being only slightly biased downwards. Meanwhile,
under the sharp null, the standard errors associated with the HT estimator and
the Raj Difference estimators are exact, being both unbiased and having no
sampling variability.

6.5 R
 andomization Inference with Treatment Effect
Heterogeneity

Figure 4 displays results under treatment effect heterogeneity. Solid vertical lines
indicate the mean of the sampling distributions. Dotted vertical lines indicate the
true treatment effect (0.7 percentage points).
The results demonstrate that the regressions tend to be biased to varying
degrees. Interestingly, the regression without covariates (difference-in-means)
is only slightly biased downward. That the difference-in-means is not terribly
biased can be understood as a result of the sample size (80 clusters) being suf-
ficiently large (recall the consistency proof in Section 3.3).
When the regressions include fixed-effects, however, the bias actually
increases. This can be understood in light of the fact that fixed-effects regression
estimates yield variance-weighted averages of the block-level estimates (Angrist
and Pischke 2009). In other words, the fixed-effects estimator is equivalent to
taking the difference-in-means for each block and then taking a weighted average
of them. Since the block-level estimates are each biased, the overall average is
similarly biased. As discussed in Section 3.4 above, this is a particularly troubling
property of the fixed-effects estimator because it will also be inconsistent for
increasing numbers of blocks. In other words, adding more blocks to the experi-
ment will not necessarily diminish the overall bias.
Again, the HT estimator is unbiased but has very poor precision. And while
the random effects estimator has the lowest standard deviation, Des Raj’s differ-
ence estimators are the most precise in terms of RMSE.

10 Although we consider the bias of the standard error estimator, in practice, bias is not an ideal
loss function for evaluating standard error estimators. However, given the size of the sample and
the typical rate of convergence for variance estimators, we expect that bias serves as an approxi-
mation for asymptotic bias, which is of greater interest for constructing confidence intervals.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Reg. (Diff−in−means) Reg. w/FE Reg. w/FE, cov. Reg. w/FE, cov, RE

0.30
0.30
0.30
0.30
Bias = −0.2 Bias = 0.9 Bias = 0.9 Bias = 2.0
SD = 2.3 SD = 1.5 SD = 1.5 SD = 1.3
RMSE = 2.3 RMSE = 1.8 RMSE = 1.8 RMSE = 2.4

0.20
0.20
0.20
0.20

Density
Density
Density
Density

0.10
0.10
0.10
0.10

0
0
0
0
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
ATE (Percentage points) ATE (Percentage points) ATE (Percentage points) ATE (Percentage points)

Horvitz−Thompson Raj difference (nj) Raj difference (nj, cov.)

0.30
0.30
0.30
Bias = 0.0 Bias = 0.0 Bias = 0.0
SD = 9.2 SD = 1.5 SD = 1.6
RMSE = 9.2 RMSE = 1.5 RMSE = 1.6

0.20
0.20
0.20

Density
Density
Density

0.10
0.10
0.10

0
0
0
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

Authenticated
ATE (Percentage points) ATE (Percentage points) ATE (Percentage points)
Unbiased Estimation of the Average Treatment Effect

Figure 4: ATE estimator sampling distributions associated with heterogeneous treatment effects detailed in Section 6. Five thousand randomizations

Download Date | 11/3/15 12:13 AM


were used to estimate the desity of the sampling distributions. Density plots were generated using the density() function in R (R Development Core
Team 2010), with the default settings and a bandwidth of 0.5 percentage points. Each estimator is detailed in Section 6.3. The vertical line indicates
29

the expected value, and therefore bias, of the estimator. Bias and SE estimates in the upper-right of each plot are computed from each empirical

Brought to you by | University of California - Berkeley


distribution.
30 Joel A. Middleton and Peter M. Aronow

Note also that the addition of the covariates (age and turnout rate in 2000)
actually increases the variability in Raj’s difference estimator. This is because the
covariates are not particularly predictive of the outcome and so the estimated
values of k’s tend to miss their mark by a large extent.
Finally, Figure 5 displays the performance of the standard error estimators
in the case of heterogeneous treatment effects. Results again show that the
“robust cluster” standard errors can perform very badly, being substantially
downwardly biased in the case of the regressions with fixed effects. The stand-
ard error estimator associated with the random effects regression performs
well, being only slightly upwardly biased. The standard error estimators for the
HT and Raj difference estimators are conservative, being biased only slightly
upwards.

7 Conclusion
The unbiased estimation of the ATE in cluster-randomized experiments has
been elusive. In unpacking the source of the bias in the difference-in-means
estimator, this paper has also identified some common design-estimator com-
binations where the bias of estimators will not diminish with sample size such
as pair-­randomized designs combined with regression estimators with fixed
effects for block. This paper has returned to the first principles of randomiza-
tion and sampling theory, showing that the fundamental statistical properties
of randomization can be applied to modern causal inferential problems. Not
only does the Des Raj estimator provide the basis for an unbiased and location-
invariant estimator for the analysis of cluster-randomized experiments, com-
pared to the HT ­estimator it also achieves improved precision through covariate
adjustment.
There are a number of theoretical implications of this return to the first prin-
ciples of randomization. First, machinery based solely on sampling-theoretic
ideas can be sufficient for precise and unbiased estimation of causal parameters.
Second, researchers need not feel that achieving precise and unbiased causal esti-
mates requires an up-to-date knowledge of complex statistical models: we may
easily derive estimators with good statistical properties using only fundamental
concepts. Third, utilizing such estimators serves to remind us of the importance
of this distinction between observational studies and randomized experiments.
The importance of the logic of the experiment, with its reliance on randomiza-
tion, may be lost when researchers rely on model-based estimators that may or
may not reflect the experimental design.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Reg. (Diff−in−means) Reg. w/FE Reg. w/FE, cov. Reg. w/FE, cov, RE

8
8
8
8

6
6
6
6
Bias = 0.84 Bias = −0.20 Bias = −0.21 Bias = 0.22
SD = 0.25 SD = 0.16 SD = 0.17 SD = 0.12

4
4
4
4

Density
Density
Density
Density

2
2
2
2

0
0
0
0
0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0
SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points)

Horvitz−Thompson Raj difference (nj) Raj difference (nj, cov.)

8
8
8

6
6
6
Bias = 0.03 Bias = 0.22 Bias = 0.22
SD = 0.24 SD = 0.24 SD = 0.26

4
4
4

Density
Density
Density

2
2
2

0
0
0
8.0 8.5 9.0 9.5 10.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0
SE estimate (Percentage points) SE estimate (Percentage points) SE estimate (Percentage points)

Authenticated
Figure 5: SE estimator sampling distributions associated with the heterogeneous treatment effect detailed in Section 6. Five thousand randomiza-
Unbiased Estimation of the Average Treatment Effect

tions were used to estimate the sampling distributions. Density plots were generated using the density() function in R (R Development Core

Download Date | 11/3/15 12:13 AM


Team 2010), with the default settings and a bandwidth of 0.05 percentage points. Each SE estimator is detailed in Section 6.3. The solid vertical line
indicates the expected value of the SE estimator. The dotted vertical line indicates the true standard error of the estimator. Bias and SD estimates are
31

computed from each empirical distribution. Distributions for the SEs under the sharp null hypothesis of no treatment effect were too narrow to display.

Brought to you by | University of California - Berkeley


32 Joel A. Middleton and Peter M. Aronow

Acknowledgments: The authors acknowledge support from the Yale University


Faculty of Arts and Sciences High Performance Computing facility and staff.
The authors would also like to thank Allison Carnegie, Adam Dynes, Don Green,
­Jennifer Hill, Mary McGrath, David Nickerson, Cyrus Samii and two helpful review-
ers for helpful comments. The authors thank Kyle Peyton for research assistance
and manuscript preparation. Any errata are the sole responsibility of the authors.

Appendix

A Proof of non-invariance of the Horvitz-Thompson


estimator
To prove that the HT estimator is not invariant to location shifts, we need only
replace YjT with its linear transformation:

 M 1 1 
∆∗HT =  ∑YjT ∗ − ∑Y T∗

N  mt j∈J 1 mc j∈J0
j

 nj n

M 1 1 j
= ∑∑Y ∗ −
N  mt j∈J 1 i=1 ij mc
∑∑ ij  Y ∗

j∈J0 i = 1
 
M 1 
n nj
j
1
=  ∑∑( b0 + b1 ⋅Yij ) − ∑∑( b + b ⋅Y ) 
N  mt j∈J 1 i=1 mc 0 1 ij
 j∈J0 i = 1

M 1   1  
nj nj

=  ∑  n j ⋅b0 + ∑b1 ⋅Yij  − ∑  n ⋅b + ∑b ⋅Y  


N  mt j∈J 1  i= 1  mc j∈J0
j 0
i=1
1 ij
 
M 1 
nj nj
M 1 1  1
= b0 ⋅  ∑n j − ∑nj  + b1 ⋅ N  m ∑∑Yij − m ∑∑Y 
N  mt j∈J 1 mc 
ij
j∈J0
 t j∈J 1 i=1 c j∈J0 i = 1

M 1 1 
= b0 ⋅  ∑n −
N  mt j∈J 1 j mc
∑n  + b ⋅∆ˆ
j 1 HT
.
j∈J0 

B Bias from estimating k from within-sample data


Consider the situation where one wishes to improve upon the HT estimator by
adjusting for cluster size; in other words, one wishes to estimate k in equations 20

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 33

and 21 from the data to approximate the optimal value of k with an estimator kˆ.
In this scenario, the expected value of equation 20 yields

 M 
E Y1,TR 1  = E  ∑( YjT − kˆ( n j − N / M ) ) 
m
 t j∈J 1 
M     
ˆ  + E  kN

ˆ / M 
=  E ∑YjT  − E ∑kn ∑
mt   j∈J 1   j∈J 1  j
 j∈J 1 

=(M  ˆ n ] + E[ km
E m Y T  − E[ km
mt  t 1 j  t tj
ˆ N /M]
t )
ˆ ] − E[ kˆ ]E[ n ] )
=Y1T − M ( E[ kntj tj

=Y1T − MCov (kˆ, ntj ), (29)




where ntj is the mean value of nj for clusters in the treatment condition in a given
randomization. In the third line of equation 29, k̂ moves outside the summation
operator because it is a constant for a given randomization. Likewise,

E Y0T,R 1  =Y0T − MCov (kˆ, ncj ), (30)

where ncj is the mean value of nj for units in the control condition in a given ran-
domization. So the expected value of the estimator will be

Y
T 
−Y T  M
E  1,R 1 0 ,R 1  = ∆ + ( Cov (kˆ, ncj ) − Cov (kˆ, ncj ) ). (31)
 N  N

The term on the right of equation 31 represents the bias. A special case with no
bias is when the sharp null hypothesis of no treatment effect holds and treat-
ment and control groups have equal numbers of clusters. We refer the reader to
­Williams (1961), Freedman (2008a) and Freedman (2008b) for additional reading
on the particular bias associated with the regression adjustment of random
samples and experimental data.

C Derivation of the optimal value of k


To identify a single optimal value of k, koptim*, we refer to the first line of
equation 17,
 ) = c σ 2 ( U T ) + t σ 2 ( U T ) + 2 σ ( U T , U T ) (32)
vV ( ∆ R1 j0 j1 j0 j1

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
34 Joel A. Middleton and Peter M. Aronow

( M − 1) N 2 M − mc M − mt
where v = , c= , and t = . Now note that the terms
M 2
mc mt
σ 2 ( U Tj 0 ), σ 2 ( U Tj 0 ), and σ ( U Tj 0 , U Tj 1 ) in equation 32 can be written as follows:

σ 2 ( U Tj 1 ) = σ 2 ( YjT1 ) + k 2 σ 2 ( n j ) − 2 k σ ( YjT1 , n j ), (33)

σ 2 ( U Tj 0 ) = σ 2 ( YjT0 ) + k 2 σ 2 ( n j ) − 2 k σ ( YjT0 , n j ), (34)

and, defining δj = (nj–N/M),

σ ( U Tj 0 , U Tj 1 ) = E [ U Tj 0U Tj 1 ] −U 0T U 1T

= E [ YjT0 − k δ j ](YjT1 − k δ j ) −Y0T Y1T

= E [ YjT0YjT1 −YjT0 k δ j −YjT1 k δ j + k 2 δ2j ] −Y0T Y1T

= E[ YjT0YjT1 ] −Y0T Y1T − E[ YjT0 k δ j ] − E[ YjT1 k δ j ] + E[ k 2 δ2j ]

= σ ( YjT0 , YjT1 ) − k [ σ ( YjT0 , n j ) + E[ YjT0 ]E[ δ j ] ]

−k [ σ ( YjT1 , n j ) + E[ YjT1 ]E[ δ j ] ] + k 2 σ 2 ( n j )

= σ ( YjT0 , YjT1 ) − k [ σ ( YjT0 , n j ) + E[ YjT0 ] ⋅0 ]

− k [ σ ( YjT1 , n j ) + E[ YjT1 ] ⋅0 ] + k 2 σ 2 ( n j )

= σ ( YjT0 , YjT1 ) − k σ ( YjT0 , n j ) − k σ ( YjT1 , n j ) + k 2 σ 2 ( n j ), (35)




respectively. Substituting equations 33, 34, and 35 into equation 32,

 ) = c [ σ 2 ( Y T ) + k 2 σ 2 ( n ) − 2 k σ ( Y T , n )] + t [ σ 2 ( Y T ) + k 2 σ 2 ( n ) − 2 k σ ( Y T , n )]
vV( ∆ R1 j0 j j0 j j1 j j1 j

+2[ σ ( YjT0 ,YjT1 ) − k σ ( YjT0 ,n j ) − k σ ( YjT1 ,n j ) + k 2 σ 2 ( n j )].

Setting the first derivative with respect to k equal to zero,

0 = c [ 2 koptim∗ σ 2 ( n j ) − 2 σ ( YjT0 , n j )] + t [ 2 koptim∗ σ 2 ( n j ) − 2 σ ( YjT1 , n j )]


+2[ − σ ( YjT0 , n j ) − σ ( YjT1 , n j ) + 2 koptim∗ σ 2 ( n j )],

ckoptim∗ σ 2 ( n j ) + tkoptim∗ σ 2 ( n j ) + 2 koptim∗ σ 2 ( n j ) = c σ ( YjT0 , n j ) + t σ ( YjT1 ,n j )


+ σ ( YjT0 , n j ) + σ ( YjT1 ,n j )

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 35

 M − mc M − mt mc mt   M − mc mc 
 m + m + m + m  koptim∗ σ ( n j ) =  m + m  ⋅ σ ( Yj 0 , n j )
2 T

 c t c t   c c 
 M − mt mt 
+ +  ⋅ σ ( YjT1 , n j )
 m t
mt 

 M M  M  M
 m + m  koptim∗ σ ( n j ) =  m  σ ( Yj 0 , n j ) +  m  σ ( Yj 1 , n j )
2 T T

 c t  c  t

−1
 1 1  1  σ ( YjT0 , n j )  1  σ ( YjT1 , n j ) 
koptim∗ =  +    +  
m
 c mt  mc  σ ( n j )  mt  σ ( n j ) 
2 2

−1
 1 1  1   1 
koptim∗ =  +    koptimc +   koptimt 
 mc mt   mc   mt  

mt mc
koptim∗ = koptim + koptim .
M c
M t

The Des Raj estimator will be more efficient than the HT estimator when

( c + t + 2 ) k 2 σ 2 ( n j ) < 2 k [( c + 1) σ ( YjT0 , n j ) + ( t + 1) σ ( YjT1 , n j )]


 σ ( YjT0 , n j ) σ ( YjT1 , n j ) 
( c + t + 2 ) k 2 < 2 k ( c + 1) + ( t + 1) 
 σ 2 ( nj ) σ 2 ( n j ) 
( c + t + 2 ) k 2 < 2 k [( c + 1) koptim + ( t + 1) koptim ]
c t

m m 
k 2 < 2 k  t koptim + c koptim 
 M c
M t

k 2 < 2 k ⋅koptim∗ .

References
Angrist, J. D. and J. Pischke (2009) Mostly Harmless Econometrics. Princeton: Princeton
­University Press.
Aronow, P. M., D. P. Green and D. K. K. Lee (2014) “Sharp Bounds on the Variance in
­Randomized Experiments,” Annals of Statistics, 42(3):850–871.
Bates, D. and M. Maechler (2010) lme4: Linear mixed-effects models using S4 classes.
R package, version 0.999375-37.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
36 Joel A. Middleton and Peter M. Aronow

Brewer, K. R. W. (1979) “A Class of Robust Sampling Designs for Large-Scale Surveys,” Journal
of the American Statistical Association, 74:911–915.
Chaudhuri, A. and H. Stenger (2005) Survey Sampling. Boca Raton: Chapman and Hall.
Cochran, W. G. (1977) Sampling Techniques, 3rd ed. New York: John Wiley.
Des Raj. (1965) “On A Method of Using Multi-Auxiliary Information in Sample Surveys,” Journal
of The American Statistical Association, 60:270–277.
Ding, P. (2014) “A Paradox from Randomization-Based Causal Inference,” arXiv preprint
arXiv:1402.0142.
Donner, A. and N. Klar (2000) Design and Analysis of Cluster Randomization Trials in Health
Research. New York: Oxford Univ. Press.
Freedman, D. A. (2006) “On the So-Called ‘Huber Sandwich Estimator’ and ‘Robust’ Standard
Errors,” American Statistician, 60:299–302.
Freedman, D. A. (2008a) “On Regression Adjustments to Experimental Data,” Advances in
Applied Mathematics, 40:180–193.
Freedman, D. A. (2008b) “On Regression Adjustments in Experiments with Several Treat-
ments.” Annals of Applied Statistics, 2:176–196.
Freedman, D. A., R. Pisani and R. A. Purves (1998) Statistics, 3rd ed. New York: W. W. Norton,
Inc.
Green, D. P. and L. Vavreck (2008) “Analysis of Cluster-Randomized Experiments: A Comparison
of Alternative Estimation Approaches,” Political Analysis, 16:138–152.
Hansen, B. and J. Bowers (2008) “Covariate Balance in Simple, Stratified and Clustered
­Comparative Studies,” Statistical Science, 23:219–236.
Hansen, B. and J. Bowers (2009) “Attributing Effects to a Cluster-Randomized Get-Out-the-Vote
Campaign,” Journal of the American Statistical Association, 104:873–885.
Hartley, H. O. and A. Ross (1954) “Unbiased Ratio Estimators,” Nature, 174:270.
Hoffman, E. B., P. K. Sen and C. R. Weinberg (2001) “Within-Cluster Resampling,” Biometrika,
88: 1121–1134.
Horvitz, D. G. and D. J. Thompson (1952) “A Generalization of Sampling Without Replacement
From a Finite Universe,” Journal of the American Statistical Association, 47:663–684.
Humphreys, M. (2009) Bounds on Least Squares Estimates of Causal Effects in the Presence of
Heterogeneous Assignment Probabilities. Working paper. Available at: http://www.colum-
bia.edu/~mh2245/papers1/monotonicity4.pdf.
Imai, K., G. King and C. Nall (2009) “The Essential Role of Pair Matching in Cluster-Randomized
Experiments, with Application to the Mexican Universal Health Insurance Evaluation,”
Statistical Science, 24:29–53.
King, G. and M. Roberts (2014) “How Robust Standard Errors Expose Methodological Problems
They Do Not Fix, and What to Do About It,” Political Analysis, 1–12.
Lachin, J. M. (1988) “Properties of Simple Randomization in Clinical Trials,” Controlled Clinical
Trials, 9(4):312–326.
Lin, W. (2013) “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining
Freedman’s Critique,” Annals of Applied Statistics, 7(1):295–318.
Middleton, J. A. (2008) “Bias of the Regression Estimator for Experiments Using Clustered
Random Assignment,” Statistics and Probability Letters, 78:2654–2659.
Miratrix, L., J. Sekhon and B. Yu (2013) “Adjusting Treatment Effect Estimates by Post-
­Stratification in Randomized Experiments,” Journal of the Royal Statistical Society. Series B
(Methodological), 75(2):369–396.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM
Unbiased Estimation of the Average Treatment Effect 37

Neyman, J. (1923) “On the Application of Probability Theory to Agricultural Experiments: Essay
on Principles, Section 9,” Statistical Science, 5:465–480. (Translated in 1990).
Neyman, J. (1934) “On the Two Different Aspects of the Representative Method: The Method of
Stratified Sampling and the Method of Purposive Selection,” Journal of the Royal Statistical
Society, 97(4):558–625.
R Development Core Team. (2010) R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. Version
2.12.0.
Rosenbaum, P. R. (2002) Observational Studies, 2nd ed. New York: Springer.
Rubin, D. (1974) “Estimating Causal Effects of Treatments in Randomized and Nonrandomized
Studies,” Journal of Educational Psychology, 66:688–701.
Rubin, D. B. (1978) “Bayesian Inference for Causal Effects: The Role of Randomization,” The
Annals of Statistics, 6:34–58.
Rubin, D. B. (2005) “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions,”
Journal of the American Statistical Association, 100:322–331.
Samii, C. and P. M. Aronow (2012) “On Equivalencies Between Design-Based and Regression-
Based Variance Estimators for Randomized Experiments,” Statistics and Probability
­Letters, 82:365–370.
Sarndal, C.-E. (1978) “Design-Based and Model-Based Inference in Survey Sampling,”
­Scandinavian Journal of Statistics, 5(1):27–52.
Williams, W. H. (1961) “Generating Unbiased Ratio and Regression Estimators,” Biometrics,
17:267–274.

Brought to you by | University of California - Berkeley


Authenticated
Download Date | 11/3/15 12:13 AM

You might also like