Methods Matter PDF
Methods Matter PDF
Richard J. Murnane
John B. Willett
1
2011
1
Oxford University Press, Inc., publishes works that further
Oxford University’s objective of excellence
in research, scholarship, and education.
With offices in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
1 3 5 7 9 8 6 4 2
Printed in the United States of America
on acid-free paper
Contents
Preface xi
v
vi Contents
References 369
Index 381
This page intentionally left blank
Preface
xi
xii Preface
and difficult questions that lie at the center of the educational enterprise.
Because of this, we have always sought to motivate—and embed—our work
in substance, in the important questions that educational policymakers
ask. We believe that substantive needs are a powerful catalyst to the devel-
opment of new research designs and data-analytic methods. It is the
interplay between substance and method that has always provided us with
our most fertile ground and that we seek to preserve in our work together.
Once you have a substantive question, then it is clear that methods matter!
So, that explains why we work together. But why did we write this book?
It is not a decision that we reached either quickly or lightly; in fact, it was
more like a decision that evolved, rather than being made. Over the last
15 years, it became clear to us that innovative research designs and ana-
lytic practices were being developed constantly, and applied in the social
sciences and statistics. We thought that these new methods of causal infer-
ence had enormous potential for resolving critical problems that plagued
education research. After all, don’t we want compelling evidence of what
works to influence educational policymaking?
Yet, when we examined the scholarly literature that was supposed to
inform educational policymaking, we found that most of the quantitative
research could not even support credible statements of cause and effect.
Consequently, it seemed sensible to facilitate the implementation of the
new methods of causal inference in the fields of educational and social
science research. We wanted to persuade scholars, policymakers, and
practitioners that there were substantial and powerful methods that could
improve causal research in education and the social sciences. In our expe-
riences as teachers, the successful migration of innovative ideas across
domain boundaries has always demanded that they not only be expressed
understandably, but in context. Those working in education and the social
sciences had to be persuaded that there was something worthwhile that
would work for them. Consequently, over the last decade and a half, as
our own ideas began to crystallize, we tried to draw an adept group of up-
and-coming young scholars at our school into an advanced doctoral
seminar on causal inference, to worry about the issues with us. From out
of that seminar has grown this book.
In our seminar and in this book, our pedagogic approach has been to
embed the learning of innovative methods for causal inference in sub-
stantive contexts. To do this, we have drawn on exemplary empirical
research papers from other fields, mainly economics (because that’s a
field that at least one of us knows well!), to introduce, explain, and illus-
trate the application of the new methods. We have asked our students to
study these papers with us carefully. At the same time, we have tried to
provide them with clear and sensible intellectual frameworks within which
xiv Preface
the technical bases of the new methods for causal inference made sense.
In creating these frameworks, we have opted for conceptual, graphical,
and data-based explanations rather than those that are intensively math-
ematical and statistical. Our objective is to widen the reach and appeal of
the methods to scholars who do not possess the same deep technical
backgrounds as the developers and early implementers of the methods.
We have experienced some success in this effort, and have now brought
the same approach to this book. Throughout the book, we have sought to
present new methods for causal inference in a way that is sensitive to the
practical realities of the educational and social context. We hope not only
to make you receptive to incorporating these methods in your own
research, but also to see the value of the guidelines provided in the book
for judging the quality of the research studies you read.
Many colleagues have helped us as we worked on this book, answering
our many questions, providing data from their studies, providing feed-
back on draft chapters. At the distinct risk of leaving out the names of
colleagues to whom we are indebted, we would like to thank Joshua
Angrist, David Autor, Felipe Barrera-Osorio, Howard Bloom, Geoffrey
Borman, Kathryn Boudett, Sarah Cohodes, Tom Dee, Susan Dynarski,
Patricia Graham, Rema Hanna, Caroline Hoxby, Guido Imbens, Brian
Jacob, Larry Katz, Jim Kemple, Jeff Kling, Peter Kemper, Daniel Koretz,
Victor Lavy, Frank Levy, Leigh Linden, Jens Ludwig, Douglas Miller,
Richard Nelson, Edward Pauly, Stephen Raudenbush, Jonah Rockoff,
Juan Saavedra, Judy Singer, Miguel Urquiola, Emiliana Vegas, and partici-
pants in our causal inference doctoral course. We would especially like to
thank Lindsay Page and John Papay, who read the entire manuscript and
provided innumerable suggestions for improving it.
The staff members of our Learning Technology Center at HGSE have
always gone out of their way to support our computing needs, and have
responded to our questions and difficulties with immediate and thought-
ful help. We also want to thank our wonderful and extraordinarily efficient
assistant at HGSE, Wendy Angus. Wendy has solved numerous logistical
problems for us, formatted tables, fixed idiosyncratic problems in our
word processing, and been immensely helpful in getting this manuscript
out the door. Finally, we very much appreciate the financial support that
the Spencer Foundation provided for the research that contributed to
this book.
It goes without saying that we are also indebted to the members of our
production team at Oxford University Press in New York City. We are
particularly grateful to Joan Bossert, Editorial Director, who was receptive
to our proposal and directed us to our editor, Abby Gross. We have also
Preface xv
The call for better empirical evidence upon which to base sound educa-
tional policy decisions has a long history, one that is particularly well
documented in the United States. In a speech given to the National
Education Association (NEA) in 1913, Paul Hanus—a Harvard professor,
and later the first dean of the Harvard Graduate School of Education—
argued that “the only way to combat successfully mistaken common-sense
as applied to educational affairs is to meet it with uncommon-sense in the
same field—with technical information the validity of which is indisputable”
3
4 Methods Matter
(Hanus, 1920, p. 12). For Hanus, this meant that systematic research must
be conducted and its findings applied. In his words, “We are no longer
disputing whether education has a scientific basis; we are trying to find
that basis.” In his lengthy speech to the NEA, Hanus identified a number
of school policy decisions that he believed should be based on scientific
evidence. These included finding an “adequate and appropriate means of
determining the qualifications of well-trained and otherwise satisfactory
workers for the educational staff . . .,” and formulating “courses of study
. . .together with suggestions as to methods of teaching.” Although edu-
cational policymakers today would use somewhat different terms in
framing such questions, these same substantive concerns remain pressing
in countries around the world: How do we attract and retain skilled teach-
ers? What are the most important skills for students to acquire? What are
the most effective pedagogies for teaching these skills?
For educational researchers in Hanus’s time, and for many years there-
after, “carrying out scientific research” meant implementing the ideas of
scientific management that had been developed by Frederick W. Taylor
and laid out in his 1911 book Principles of Scientific Management. Taylor’s
central thesis was that experts could uncover the single “best” way to do a
particular job by conducting “time and motion” studies. Then, the task of
management was to provide the appropriate tools, and create training,
incentives, and monitoring systems to ensure that workers adopted and
followed the prescribed methods.
Although Taylor was careful not to apply his methods to any process as
complicated as education, many educational researchers were less cau-
tious. One of the most influential proponents of applying Taylor’s system
of scientific management to education was Frank Spaulding, who earned
a doctorate from the University of Leipzig, Germany, in 1894, served as
superintendent of several U.S. school districts during the first two decades
of the twentieth century, and, in 1920, became head of Yale University’s
newly formed Department of Education. Speaking at the same meeting
of the NEA at which Hanus gave his address, Spaulding described three
essentials for applying scientific management to education. He stipulated
that we must: (a) measure results, (b) compare the conditions and meth-
ods under which results are secured, and (c) adopt consistently the
conditions and methods that produce the best results (Callahan, 1962,
pp. 65–68).
Most educators today would agree with these essentials, even though
many would object to the “Taylorist” ideas that underlie them. However,
it was in applying these essentials to education that controversy arose.
The “results” that Spaulding used in his research included “the percent-
age of children of each year of age in the school district that the school
The Challenge for Educational Research 5
When the NIE’s research programs did not produce analogous visible
successes for education, it was deemed a failure. Few of its advocates had
appreciated how difficult it would be to answer questions posed by policy-
makers and parents about the effective use of educational resources.
Yet another part of the explanation for the demise of the NIE, and the
low funding levels of its successor, the U.S. Department of Education’s
Office of Educational Research and Improvement (OERI), was the wide-
spread perception that educational research was of relatively low quality.
A common indictment was that educational researchers did not take
advantage of new methodological advances in the social sciences, particu-
larly in the application of innovative strategies for making causal inferences.
In an attempt to respond to the concern about the low quality of
educational research, the U.S. Congress established the Institute of
Education Sciences (IES) in 2002, with a mandate to pursue rigorous
“scientific research” in education. One indication of the energy with
which the IES has pursued this mandate is that, in its first six years of
operation, it funded more than 100 randomized field trials of the effec-
tiveness of educational interventions.2 As we explain in Chapter 4, the
randomized experiment is the “gold-standard” design for research that
aims to make unbiased causal inferences.
Although the quest for causal evidence about the consequences of par-
ticular educational policies is particularly well documented in the United
States, researchers in many countries have conducted important studies
that have both broken new ground methodologically and raised new sub-
stantive questions. We illustrate with two examples. Ernesto Schiefelbein
and Joseph Farrell (1982) conducted a remarkable longitudinal study
during the 1970s of the transition of Chilean adolescents through school
and into early adulthood. The authors collected data periodically on a
cohort of students as they moved from grade 8 (the end of primary school)
through their subsequent schooling (which, of course, differed among
individuals) and into the labor market or into the university. This study,
Eight Years of Their Lives, was a remarkable tour de force for its time. It
demonstrated that it was possible, even in a developing country that was
experiencing extraordinary political turmoil, to collect data on the same
2. See Whitehurst (2008a; 2008b). We would like to thank Russ Whitehurst for explaining
to us which IES-funded research projects were designed as randomized field trials.
10 Methods Matter
individuals over an extended period of time, and that these data could pro-
vide insights not possible from analyses of cross-sectional data. Substantively,
the study documented the important role that the formal education system
in Chile played in sorting students on the basis of their socioeconomic
status. This evidence provided the basis for considerable debate in Chile
about the design of publicly funded education in the years after democracy
returned to the country in 1989 (McEwan, Urquiola, & Vegas, 2008).
The book Fifteen Thousand Hours, by Michael Rutter (1979), describes
another pioneering longitudinal study. The research team followed stu-
dents in 12 secondary schools in inner-city London over a three-year
period from 1971 through 1974, and documented that students attending
some secondary schools achieved better outcomes, on average, than those
attending other schools. One methodological contribution of the study
was that it measured several different types of student outcomes, includ-
ing delinquency, performance on curriculum-based examinations, and
employment one year after leaving school. A second was the attention
paid to collecting information on variables other than resource levels. In
particular, the researchers documented that characteristics of schools as
social organizations—including the use of rewards and penalties, the ways
teachers taught particular material, and expectations that faculty had of
students for active participation—were associated with differences in aver-
age student outcomes.
A close reading of the studies by Schiefelbein and Farrell, and by Rutter
and his colleagues, shows that both sets of researchers were aware acutely
of the difficulty of making causal inferences, even with the rich, longitudi-
nal data they had collected. For example, Schiefelbein and Farrell wrote:
“It is important to reemphasize that this study has not been designed as a
hypothesis-testing exercise. Our approach has consistently been explor-
atory and heuristic. And necessarily so” (p. 35). In their concluding
chapter, Rutter and his colleagues wrote: “The total pattern of findings
indicates the strong probability that the associations between school pro-
cesses and outcome reflect in part a causal process” (p. 179). Why were
these talented researchers, working with such rich data, not able to make
definitive causal statements about the answers to critical questions of edu-
cational policy? What does it take to make defensible causal inferences?
We address these questions in the chapters that follow.
Notice that all of the educational policy questions listed here concern
the impact of a particular action on one or more outcomes. For example,
does the provision of financial aid affect families’ decisions to send a
child to secondary school? This is a distinctive characteristic of causal
questions, and learning to answer such questions is the topic of this book.
In our work, we distinguish such causal questions from descriptive ques-
tions, such as whether the gap between the average reading achievement
of black students and that of white students closed during the 1980s.
Although there are often significant challenges to answering descriptive
questions well, these challenges are typically less difficult than the chal-
lenges you will face when addressing causal questions.
We have written this book not only for those who would like to conduct
causal research in education and the social sciences, but also for those
who want to interpret the results of such causal research appropriately
and understand how the results can inform policy decisions. In present-
ing these new designs and methods, we assume that you have a solid
background in quantitative methods, that you are familiar with the notion
of statistical inference, and that you are comfortable with statistical tech-
niques up to, and including, ordinary least-squares (OLS) regression
analysis. However, as an interested reader can see by skimming ahead in
the text, ours is not a highly technical book. To the contrary, our empha-
sis is not on mathematics, but on providing intuitive explanations of key
ideas and procedures. We believe that illustrating our technical explana-
tions with data from exemplary research studies makes the book widely
accessible.
We anticipate that you will obtain several immediate benefits from
reading our book carefully. First, you will learn how alternative research
designs for making causal inferences function, and you will come to
understand the strengths and limitations of each innovative approach.
In addition, you will learn how to interpret the results of studies that use
these research designs and analytic methods, and will come to understand
that careful interpretation of their findings, although often not obvious,
is critical to making the research useful in the policy process.
1. For many references to the evidence of the role of education in fostering economic
growth, see Hanushek and Woessman (2008). For evidence on the especially valuable
role of education in increasing productivity in environments experiencing technologi-
cal change, see Jamison & Lau (1982). The classic reference on the reasons why
employers are typically willing to pay for specific training, but not general training, is
Becker (1964).
14
The Importance of Theory 15
What Is Theory?
3. In Chapter 10, we describe one important paper in this line of research, written by
Janet Currie and Enrico Moretti (2003).
The Importance of Theory 17
4. For an accessible discussion of human capital and market signaling models, see Weiss
(1995).
5. For an introduction to Bourdieu’s theory, see Lane (2000).
18 Methods Matter
Theory in Education
6. The ideas we describe in this paragraph are taken from Shavelson and Towne (eds.,
2002).
20 Methods Matter
Voucher Theory
7. See Hoxby (2003), and Nechyba (2003) for discussions of the importance of general
equilibrium models for understanding the consequences of particular voucher
plans. For examples of such equilibrium models, see Nechyba (2003, pp. 387–414);
Epple and Romano (1998, pp. 33–62); Hoxby (2001); Fernandez and Rogerson (2003,
pp. 195–226).
8. Concerned with the sorting by socioeconomic status that took place under its equal-
value voucher system, the Chilean government modified its national voucher system in
2008. Under the new system, the vouchers distributed to children from the one-third
poorest families in the country (called Priority students) are worth 50% more than
those distributed to more affluent families. Private schools that receive higher-valued
vouchers are prohibited from charging Priority students any tuition or fees in excess of
the value of their voucher.
9. See, for example, Hoxby (2001); Fernandez and Rogerson (2003).
24 Methods Matter
In this chapter, we have chosen our examples primarily from the field of
economics because it is the social science discipline we know best. However,
theories drawn from other social-science disciplines can also inform the
design of causal educational research. Examples include theories of social
capital drawn from sociology and theories of child development from psy-
chology. The choice of a theoretical framework within which to embed
the design of quantitative research depends on the nature of the causal
question being asked and the knowledge base of the investigators.
We do want to emphasize, however, the distinction between social-
science theory and statistical theory. In recent decades, important
advances have been made in statistical theory that have led to new research
designs and analytic methods, many of which are presented in later chap-
ters of this book. New resampling methods for conducting hypothesis
tests and methods for estimating statistical power when individuals are
clustered in classrooms and/or schools provide two examples. The point
we want to emphasize here is that statistical theory, and the methods
stemming from advances in statistical theory, are methodological comple-
ments to substantive social-science theory, not substitutes.
For readers interested in learning more about the role of theory in inform-
ing causal research in general, and causal research in education in
The Importance of Theory 25
One of the first actions that Grover “Russ” Whitehurst, the first director
of the Institute of Education Sciences, took after assuming office in 2002
was to commission a survey of educational practitioners and policymak-
ers in order to learn what they wanted from educational research.1 Not
surprisingly, the survey results showed that the priorities of educators
depended on their responsibilities. Superintendents and other local edu-
cation officials were most interested in evidence about particular curricula
and instructional techniques that were effective in increasing student
achievement. State-level policymakers wanted to learn about the conse-
quences of standards-based educational reforms and the impact of
particular school intervention strategies. Congressional staff wanted to
know about the effectiveness of different strategies for enhancing teacher
quality. Educators at all levels wanted to know about the effect of differ-
ences in resource levels, such as class sizes, in determining students’
achievement.
Whereas the priorities of educators depended on their responsibilities,
the striking commonality in their responses was that practitioners and
policymakers—at all levels—wanted to know the answers to questions about
cause and effect. They wanted to know if A caused B, and wanted IES to com-
mission research that would provide them with answers. In this chapter,
we discuss the conditions that must be satisfied for such causal questions
to be addressed effectively in education, and we introduce some of the
major concepts and terms that we use throughout the rest of the book.
26
Designing Research to Address Causal Questions 27
Before we begin our discussion of how best to address the causal questions
that are so central to educators, we begin with a brief description of the
classical elements of good research design in the social sciences and edu-
cation. We do this because designing causal research requires us to pay
attention to the central tenets of all good research. Then, within this larger
domain, causal research must satisfy an additional set of constraints, and
it is these that form the central topic for the rest of our book. We used the
expression “strive for” in the title of this section because it is typically dif-
ficult to satisfy all of the conditions we describe. We use examples throughout
the book to clarify the consequences of not satisfying particular elements
of the classical description of effective research design. As you will learn,
violation of some of the tenets of appropriate design makes it impossible
to make a defensible causal inference about the consequences of an edu-
cational policy or intervention. Violation of other tenets does not threaten
the ability to make a causal inference, but does limit the ability to deter-
mine to whom the results of the study apply. We will return to these issues.
However, we begin by stating these elements of good research design.
First, in any high-quality research, whether it be purely descriptive or
able to support causal inference, it is critically important that it begin with
a clear statement of the research question that will drive the project and
the theory that will frame the effort. These two key elements ultimately
drive every aspect of the research design, as they provide the motivation
and the rationale for every design decision that you ultimately make. They
have also been the topics of our first two chapters and, as we have argued,
they are completely intertwined. As theories are refined, it becomes pos-
sible to pose more complex questions, and these, in their turn, inform
refinements of the theory. Light, Singer, and Willett (1990) referred to
this as the “wheel of science.”
An explicit statement of the research question makes it possible to
define the population of interest clearly and unambiguously. This is critical
in any research. If we do not do it, we cannot build a suitable sampling
frame, nor can we know to whom we can generalize the findings of our
research. In addition, it pays to be explicit, rather than vague, about the
nature of the population of interest. For example, in studying the impact
of class size on children’s reading skills, it might make sense to define the
population of interest to be “all children without special needs in first-
grade classrooms in urban public schools in the United States,” rather
than just “children.” Defining the population clearly enables readers who
have a particular concern, such as the impact of class size on the learning
of autistic children, to judge the relevance of our results to their concern.
28 Methods Matter
Randomized
experiments
All
Quasi- experiments
experiments
“the age of regression.” Seminal studies published in the 1980s threw cold
water on this “control for everything” strategy by demonstrating that
regression analyses that contained a very rich set of covariates did not
reproduce consistently the results of experiments in which individuals
were assigned randomly to different experimental conditions.2
A second response, especially common among developmental psychol-
ogists, was to accept that analysis of observational data could not support
causal inference and to simply avoid using causal language in both the
framing of research questions and in the interpretation of research results.
For example, researchers would investigate whether children placed in
center-based child care had better subsequent performance on cognitive
tests than did observationally similar children in family-based child care,
and would simply caution that causal attribution was not justified on the
basis of their findings. In our view, there are at least two problems with
this approach. First, the cautions presented in the “Methods” and “Results”
sections of research papers were often forgotten in the “Discussion” sec-
tion, where researchers would suggest policy implications that depended
on an unsupported causal interpretation of their findings. Second, their
use of noncausal language meant that these researchers were not accus-
tomed to considering explicitly alternative explanations for the statistical
relationships they observed.
Fortunately, in more recent years, social scientists have developed a
variety of new research designs and analytic strategies that offer greater
promise for addressing causal questions about the impact of educational
policies. Many of these new approaches also make use of standard tech-
niques of multiple regression analysis, but apply them in new ways.
Explaining these strategies, and illustrating their use, is a central goal of
this book.
2. See Angrist & Pischke (2009, pp. 86–91) for a discussion of this evidence.
34 Methods Matter
a treatment (e.g., “small” class size) and a “control” (e.g., “normal” class
size) condition, resetting all internal and external conditions to their
identical initial values before participants experienced either condition.
So, you might draw a representative sample of participants from the pop-
ulation, administer the treatment to them, and measure their outcome
values afterward. Then, to learn what the outcomes would be under the
counterfactual condition, you would need to transport these same par-
ticipants back to a time before your research was conducted, erase
all their experiences of the treatment and the outcome measurement
from their memories, and measure their values of the outcome again,
after their lives had transpired under the control condition. If this were
possible, you could argue convincingly that any difference in each partici-
pant’s outcome values between the two conditions must be due only to
their experiences of the treatment.
Then, because you possessed values of the outcome for each individual
obtained under both “factual” and ‘counterfactual” conditions, you would
be able to estimate the effect of the treatment for each participant. We
call this the individual treatment effect (ITE). You would do this simply by
subtracting the value of the outcome obtained under the counterfactual
condition from the value obtained under the treated condition. In this
imaginary world, you could then average these estimated ITEs across all
members of the sample to obtain the estimated average treatment effect
(ATE) for the entire group. Finally, with a statistical technique like a simple
paired t-test, you could seek to reject the null hypothesis that the popula-
tion mean difference in participants’ outcomes between the treated and
counterfactual conditions was zero. On its rejection, you could use your
estimate of the ATE as an unbiased estimate of the causal effect of the
treatment in the population from which you had sampled the participants.
Since time travel and selective memory erasure lie in the realm of imag-
ination rather than research, in practice you always have a “missing data”
problem. As we illustrate in Figure 3.2, you never actually know the value
of the outcome for any individual under both the treatment and control
conditions. Instead, for members of the treatment group, you are missing
the value of the outcome under the control condition, and for members
of the control group, you are missing the value of the outcome under the
treatment condition. Consequently, you can no longer estimate the indi-
vidual treatment effects and average them up to obtain an estimate of the
average treatment effect.
So, you must devise an alternative, practical strategy for estimating the
average treatment effect. The reason that this is so difficult to do in prac-
tice is that actors in the educational system typically care a lot about which
experimental units (whether they be students or teachers or schools) are
Designing Research to Address Causal Questions 35
For members of
Known Missing
the Treatment
Group …
For members of
Missing Known
the Control
Group … Figure 3.2 The challenge of the
counterfactual.
that the external agent has exercised his or her opportunity to assign par-
ticipants in a way that supports causal inference directly. One very simple
and useful way that such exogenous variation in experimental conditions
can be created is for the investigator to assign participants randomly to
treatments. Such an approach was taken in the Tennessee Student/
Teacher Achievement Ratio (STAR) experiment (Krueger, 1999).
In the mid-1980s, the Tennessee state legislature appropriated funding
for a randomized experiment to evaluate the causal impact of class-size
reduction on the reading and mathematics achievement of children in
the primary grades. More than 11,000 students and 1,300 teachers in
79 public schools throughout the state participated in the experiment,
which became known as Project STAR. In each participating school,
children entering kindergarten in the fall of 1985 were assigned randomly
by investigators to one of three types of classes: (a) a small class with 13 to
17 children, (b) a class of regular size with 22 to 25 students, or (c) a class
of regular size staffed by both a teacher and a full-time teacher’s aide.
Teachers in each school were also assigned randomly to classrooms.
Finally, the research design called for students to remain in their origi-
nally designated class type through third grade.
A major theme of our book is that some element of exogeneity in the
assignment of units to a treatment is necessary in order to make causal
inferences about the effects of that treatment. Expressed in the formal
terms used by statisticians and quantitative social scientists, a source of
exogenous assignment of units to treatments is necessary to identify the
causal impact of the treatment. So, when a social scientist asks what
identification strategy was used in a particular study, the question is about
the source of the exogeneity in the assignment of units to treatments.
In subsequent chapters, we show that randomization is not the only way
of obtaining useful exogenous variation in treatment status and conse-
quently of identifying the causal impact of a treatment. Sometimes, it is
possible to do so with data from a quasi-experiment. Sometimes, it is even
possible to do so with data from an observational study, using a statistical
method known as instrumental-variables estimation that we introduce in
Chapter 10.
The Tennessee STAR experiment, which the eminent Harvard statisti-
cian Frederick Mosteller called “one of the most important educational
investigations ever carried out” (Mosteller 1995, p. 113), illustrates the
difficulties in satisfying all of the conditions for good research that we
described earlier in this chapter. After the Tennessee legislature autho-
rized the experiment in 1985, the State Commissioner of Education
invited all public school systems and elementary schools in the state to
Designing Research to Address Causal Questions 37
apply to participate. Approximately 180 schools did so, 100 of which were
sufficiently large to satisfy the design criterion of having three classes at
each grade level from kindergarten through grade 3. The research team
then chose 79 schools to participate.
The process of selecting schools to participate in the STAR experiment
illustrates some of the compromises with best research practice that are
sometimes necessary in even extremely well-planned experiments. First,
the research sample of schools was chosen from the set of schools that
volunteered to participate. It is possible that the schools that volunteered
differed from those that did not in dimensions such as the quality of lead-
ership. Second, only quite large schools met the design requirements and
consequently the STAR experiment provided no evidence about the
impact of class size on student achievement in small schools. Third,
although the research team was careful to include in the research sample
urban, suburban, and rural schools, as the enabling legislation mandated,
it did not randomly select 79 schools from the population of 100 schools
that volunteered and met the size criteria (Folger, 1989). A consequence
of the sample selection process is that the definition of the population of
schools to which the results of the experiment could be generalized is not
completely clear. The most that can be said is that the results pertain to
large elementary schools in Tennessee that volunteered to participate in
the class-size experiment. It is important to understand that the lack of
clarity about the population from which the sample is taken is a matter of
external validity. The sampling strategy did not threaten the internal valid-
ity of the experiment because students and teachers within participating
schools were randomized to treatment conditions.
The STAR experiment also encountered challenges to internal validity.
Even though children in participating schools had originally been ran-
domly and exogenously assigned to classes of different sizes, some parents
were successful in switching their children from a regular-size class to
a small class at the start of the second school year. This endogenous
manipulation had the potential to violate the principal assumption that
underpinned the randomized experiment, namely, that the average
achievement of the students in regular-size classes provided a compelling
estimate of what the average achievement of the students placed in the
small classes would have been in the absence of the intervention. The
actions of these parents therefore posed a threat to the internal validity of
the causal inferences made from data collected in the STAR experiment
about the impact of a second year of placement in a small class.
This term, threat to internal validity, is important in the annals of causal
research and was one of four types of validity threats that Donald Campbell
38 Methods Matter
For readers who wish to follow up on the ideas we have raised in this chapter,
we recommend Shadish, Campbell, and Cook’s comprehensive book (2002)
on the design of research, Experimental and Quasi-Experimental Designs,
and Morgan and Winship’s insightful book (2007), Counterfactuals and
Causal Inference.
4
Investigator-Designed Randomized
Experiments
40
Investigator-Designed Randomized Experiments 41
We use data from the SCSF initiative—which we refer to as the New York
Scholarship Program (NYSP)—to illustrate ways of working with data from
randomized experiments.
In the next section, we present a framework for the design of experi-
mental research, often referred to as the potential outcomes framework.
Then, in the following section, we describe some simple statistical meth-
ods for analyzing the data that are generated in a randomized experiment,
and we illustrate these methods using the NYSP example. We draw your
attention on two key statistical properties of an estimator of experimental
effect—the properties of bias and precision— that have great relevance for
the design of research and subsequent data analysis. Our presentation of
basic experimental research in this chapter sets the stage for more com-
plex methodological developments that we describe later in the book.
2. Our brief introduction to Rubin’s Potential Outcomes Framework is drawn from the
excellent review article by Imbens and Wooldridge (2009).
Investigator-Designed Randomized Experiments 43
cannot depend on the group to which particular other children have been
assigned. Peer-group effects constitute one possible violation of SUTVA
in the evaluation of educational interventions. For example, if the impact
of voucher receipt on the reading achievement of child i depended on
whether the child’s next-door neighbor and closest friend also received a
voucher (and could then choose to move with him or her from a public to
a private school), then this would violate SUTVA. In Chapter 7, we discuss
strategies for dealing with the peer-group problem in evaluating the
impacts of educational interventions.
We turn now to the practical steps involved in implementing a two-
group randomized experiment. Figure 4.1 illustrates these steps. First, a
sensible number of participants are randomly sampled from a well-defined
population.3 Second, the sampled participants are randomly assigned to
experimental conditions. Here, in the case of a two-group experiment,
each is assigned to either the treatment or control condition. Third, a
well-defined intervention is implemented faithfully among participants in
the treatment group, but not among participants in the control group,
and all other conditions remain identical. Fourth, the value of an out-
come is measured for every participant, and its sample average estimated
separately for participants in the treatment and control groups. Fifth, the
sample difference between the outcome averages in the treatment and
control groups is computed, providing an estimate of the ATE. Standard
statistical methods—for instance, a two-group t-test—are then used to test
the null hypothesis of “no treatment/control group differences, in the
population.” If we reject the null hypothesis, then we can conclude that
the treatment has had a causal impact on the outcome. Irrespective of the
outcome of the hypothesis test, the ATE is an unbiased estimate of the
impact of the treatment in the population from which the sample was
drawn.
The reason this process leads to an unbiased estimate of the treatment
effect is that, when the assignment of participants to experimental condi-
tions is faithfully random, all factors other than treatment status will tend
to be distributed equally between participants in the treatment and con-
trol groups. This will be true not only for observed characteristics of
individuals in the two groups, such as gender, race, and age, but also for
any unobserved characteristics, such as motivation. As a consequence, all
rival explanations for any treatment/control differences in the outcome
4. From a purely technical perspective, it does not matter in which order these two ran-
domizations occur. For instance, for the purposes of this argument, it would be equally
effective to label each population member at random as a potential “treatment” or
“control” group member and then sample randomly from each of these newly labeled
subpopulations into the treatment and control groups. The results would be identical.
Of course, this “labeling the population” approach is considerably more impractical.
However, conceiving of the process of random selection and assignment in this way
does provide a better sense of how members of the treatment and control groups can
be equal in expectation—that is, equal, on average, in the population.
Investigator-Designed Randomized Experiments 45
5. Using the term made popular by Milton Friedman (and discussed in Chapter 2), the
“offer of a scholarship” is often referred to as the “receipt of a voucher.”
6. Notice that the sizes of the treatment and control groups do not have to be identical.
7. To reduce attrition of participants from the study, control group families were offered
modest payments to induce them to complete the same tests and surveys that voucher
recipients completed.
Defined
Population
Randomly Select
A Representative
Sample.
from the eligible families that did not apply for private-school tuition
vouchers.
It is also critical to understand that the primary question that the NYSP
evaluation addressed concerned the impact of the family’s receipt of a
tuition voucher on the student’s ultimate academic achievement. This is
an important topic because, as we discussed in Chapter 2, many govern-
ments provide families with vouchers to help pay their children’s tuitions
at private schools. However, it is important to distinguish the question of
whether voucher receipt impacts student achievement from the question
of whether attendance at a private school instead of a public school
impacts student achievement. In Chapter 11, we explain how instrumental-
variables estimation can provide a method of using the lottery-outcome
information from random-assignment experiments to address this second
research question. However, the analysis described in this chapter
addresses the first question, whether the randomized receipt of a voucher
had a causal impact on children’s subsequent educational achievement.
9. We thank Don Lara, the director of administration at Mathematica Policy Research, for
providing the NYSP data.
Investigator-Designed Randomized Experiments 49
Table 4.1 Alternative analyses of the impact of voucher receipt (VOUCHER) on the
third-grade academic achievement (POST_ACH) for a subsample of 521 African-
American children randomly assigned to either a “voucher” treatment or a “no voucher”
control group (n = 521)
achievement tests prior to entering the NYSP experiment and at the end
of their third year of participation. Of these, 291 were participants in the
“voucher receipt” group and 230 in the “no voucher” group. Following
the procedure adopted by Howell et al. (2002), we have averaged each
child’s national percentile scores on the reading and mathematics tests to
obtain variables measuring composite academic achievement on entry
into the study (which we refer to subsequently as covariate PRE_ACH)
and after the third year of the experiment (which we refer to subsequently
as outcome POST_ACH).
50 Methods Matter
10. Note that this is a pooled t-test, in which we assume that the population variance of
the outcome is identical in the treatment and control groups.
11. As usual, each person is assumed to draw their residual randomly and independently
2
from an identical normal distribution with mean zero and homoscedastic variance, s e .
12. Although we distinguish two experimental conditions—“voucher” versus “no voucher”—
as usual, we need only a single dichotomous predictor to separate these groups. We
could have created two dummy predictors to represent them—for instance, (a) VOUCHER,
coded 1 when a participant was in the treatment group, and (b) NOVOUCHER, coded
1 when a participant was in the control group. However, as is well known, it is unnec-
essary to include both of these question predictors in the regression model, because
membership in the treatment and control groups is mutually exclusive, which implies
that predictors VOUCHER and NOVOUCHER are perfectly collinear. Therefore, one
can be omitted from the model, thereby defining a “reference category.” In the model
Investigator-Designed Randomized Experiments 51
in Equation 4.1, we have omitted the predictor that identifies members of the control
group.
52 Methods Matter
13. Technically, it is the standard deviation of the estimates obtained in infinite re-sampling
from a population in which the null hypothesis (that the target parameter is zero)
is true.
Investigator-Designed Randomized Experiments 55
14. Today, using high-speed computing, there are ways of estimating standard errors,
such as the jackknife (Miller, 1974) and the bootstrap, (Efron & Tibshirani, 1998), which
are “nonparametric” and do not make strong distributional assumptions. Instead,
they use a process of “resampling from the sample,” which matches our hypothetical
“thought experiment,” to obtain many estimates of the parameter of interest and then
estimate the standard deviation of these multiple estimates to estimate the standard
error. When you apply these techniques, you replace the standard OLS parametric
assumptions with the raw power of computers and use that power to re-sample, not
from the population itself, but from the sample you have already drawn from that
population! This idea is founded on the notion that a random sample drawn from a
random sample of a population is also a random sample from the population itself.
15. There are other well-known estimators of the regression slope, including those that
minimize the mean absolute deviation of the data points from the trend line, which has
been generalized to provide the methods of quantile regression, and regression based
on ranks.
56 Methods Matter
methods as the technique for addressing the research questions but, more
importantly, to make sure that all its assumptions are met. It is this last
point that is critical for the development in the rest of our book.
But, what are these critical assumptions and how do they impact the
bias and precision of the OLS estimator? In specifying a regression
model—like the one in Equation 4.1—you make assumptions about both
the structural and stochastic parts of the model. In the structural compo-
nent of the model (which contains the intercept, the predictors, and the
slope parameters), you assume that the hypothesized relationship between
outcome and predictor is linear—that unit differences in the predictor
correspond to equal differences in the outcome at every level of the pre-
dictor. If you suspect that this assumption may not be valid, then you can
usually seek transformations of the outcome or the predictor to achieve
linearity. In our NYSP example, we are not concerned about the linearity
assumption as our principal predictor is a dichotomy that describes
voucher receipt. This means that there is only a single unit difference in
the predictor with which we are concerned, and that is the difference
between assigning a child to the control or the treatment group. In cases
in which continuous predictors are included in the regression model, it is
more pressing to make sure that the linearity assumption is met.
Notice the presence of the residual in the hypothesized regression
model in Equation 4.1. These residuals, by their presence in the model,
are also statements about the population, but the statements are about
stochastic—not structural—properties. They stipulate that we are willing
to believe, in the population, that some unknown part of the value of the
outcome for each individual is not directly attributable to the effects of
predictors that we have included in the model—in our case, the single
predictor, VOUCHER. Then, as discussed earlier, to proceed with statisti-
cal inference in the context of a single sample of data, we must adopt a set
of viable assumptions about the population distribution of the residuals.
Under the OLS fitting method, for instance, we assume that residuals are
randomly and independently drawn from a distribution that has a zero
mean value and an unknown but homoscedastic (that is, constant) variance
in the population.16 Each part of this statement affects a different facet of
16. Notice that we have not stipulated that the population residuals are drawn from a
normaldistribution, despite the common practice of assuming that they are normally
distributed when standard regression analysis is conducted. We have taken this subtle
step because the actual algebraic formulation of the OLS estimate of the regression
slope, and its unbiasedness property, derive only from a fitting algorithm that mini-
mizes the sum-of-squared residuals, regardless of their distribution. It is the subsequent
provision of ancillary inferential statistics—the critical values and p-values of the asso-
ciated small-sample statistical tests—that depend upon the normal theory assumption.
Investigator-Designed Randomized Experiments 57
the OLS estimation process. For instance, that the residuals are “randomly
and independently drawn” is the make-or-break assumption for the
unbiasedness property of the OLS estimator of regression slope. In par-
ticular, for this assumption to be true, the regression residuals must be
completely unrelated to any predictors that are included in the regression
model. If you violate this “randomness” assumption—for instance, if the
values of the residuals in Equation 4.1 are correlated with the values of
predictor, VOUCHER, for some reason—then the OLS estimate of regres-
sion parameter, β1 , will be a biased estimate of the population average
treatment effect.
Let us return to the second property of an OLS estimator that we have
deemed important—the property of precision. We have stated earlier that,
providing its underlying assumptions are met, an OLS estimate is the
most precise estimate of a regression slope that can be devised from a
given set of data, and we have introduced the concept of the standard
error of the estimated slope to summarize that precision. In our presenta-
tion, we have argued that the value of the standard error of the slope
estimate depends not only on the data but also on the distributional
assumptions made about the population residuals—that they are homosce-
dastic and, ultimately, normally distributed. In fact, adopting these
assumptions and provided that the residual homoscedasticity assumption
holds, standard regression texts tell us that the estimated standard error
of the OLS-estimated VOUCHER regression slope in Equation 4.1 is
⎛ Standard ⎞
⎜ ⎟ sˆe2
⎜ Error ⎟ = (4.2)
∑ (VOUCHER − VOUCHER • )
2
⎜ ⎟
n
i =1 i
⎝ of b1 ⎠
where, on the right-hand side of the equation and within the square root
2
sign, the numerator contains the estimated variance of the residuals, sˆe ,
and the denominator contains the sum of squared deviations of the values
of predictor VOUCHER around their sample mean, VOUCHER• . A simi-
lar expression for standard error could be crafted if there were multiple
predictors present in the model. For our purposes, however, it is sufficient
to focus on the single predictor case and the expression in Equation 4.2.
Finally, it is also worth noting that, under the normal theory assumption, an OLS
estimate of a regression slope is identical to the maximum-likelihood estimate (MLE).
Typically, in standard regression analysis, such hairs are not split and the assumption
that population residuals are normally distributed is often bundled immediately into
the standard expression of the assumptions.
58 Methods Matter
17. The t-statistic is the ratio of the slope estimate to its standard error.
Investigator-Designed Randomized Experiments 59
We are not concerned here with the differences between these two
estimates of the causal effect. Because the assignment of vouchers to fam-
ilies and children was random and exogenous, both are unbiased estimates
of the average treatment effect in the population.18 The question we are
asking, instead, is: What advantage was there to including the covariate in
the regression analysis in the bottom panel, if we could already make an
unbiased causal interpretation on the basis of the middle panel? The
answer concerns precision. Notice that the inclusion of the covariate in
the regression analysis in the bottom panel has reduced the magnitude of
the residual variance by 25%, from a value of 19.072 in the middle panel
to 14.373 in the bottom panel.19 This has occurred because the prior test
score is an important predictor of third-grade academic achievement,
and so its inclusion has predicted additional variation in the outcome,
with a consequent reduction in the unexplained variation that is repre-
sented by the residuals.
This reduction in residual variance is reflected in a substantial reduction
in the standard error of the estimated VOUCHER regression slope (from
1.683 in the middle panel to 1.269 in the bottom panel). As a consequence,
the t-statistic associated with the VOUCHER slope rises from 2.911 to 3.23,
and we obtain a p-value indicating an even smaller probability that our
data derive from a population in which the average treatment effect is 0.20
Thus, by including the covariate—even though we did not need it to obtain
an unbiased estimate of the treatment effect—we enjoy an improvement
in statistical power, at the same sample size. This gain in power is reflected
in the reduction of the associated p-value from 0.004 to 0.001.
It is important to keep in mind why appropriate covariates are often
included in analyses of experimental data. It is certainly not to reduce
bias. If you have randomly assigned participants to experimental condi-
tions, then your estimate of the average treatment effect will be unbiased.
If your treatment assignment was flawed and not random, then there is
little you can do to avoid bias. Regardless of how many covariates you
18. Different unbiased estimators of the same population parameter often provide
differently valued estimates of the same effect, in the same sample. This is neither
unusual, nor problematic, because each estimator is using the data to offer its own
“best guess” as to the value of the underlying population parameter. For instance, in
a symmetric distribution, the mean, median, and mode are all unbiased estimators of
the “center” of a normally distributed variable in the population. But, each weights
the elements of the sample data differently and so the values of the three estimators
are unlikely to be identical, even in the same sample of data.
19. Notice that the R2 statistic has risen correspondingly, from 0.016 to 0.442.
20. Notice that this increase in the t-statistic occurs despite a reduction in the parameter
estimate itself from 4.899 to 4.098.
60 Methods Matter
include and whatever stories you tell to motivate their inclusion, you will
find it hard to convince your audience that you have removed all the
potential bias by “controlling” for these features. Rarely can you fix
by analysiswhat you bungled by design (Light, Singer, & Willett, 1990). The
purpose of incorporating relevant covariates into an analysis of experi-
mental data is to reduce residual variation, decrease standard errors, and
increase statistical power.
Covariates appropriate for inclusion in an analysis of experimental
data include important exogenous characteristics of individuals that
do not vary over time, such as gender and race, and variables whose values
are measured prior to random assignment. The baseline test scores
included in Howell and Peterson’s regression model fall into this second
category. It is important to keep in mind that it is inappropriate to include
as covariates variables whose values are measured after random assign-
ment has been completed because they may be endogenous. An example
of the latter would be scores on tests that students took at the end of their
first year in the NYSP experiment. These latter scores are not candidates
for inclusion as covariates in the regression model because their values
may have been affected by students’ participation in the experiment.
Their inclusion would lead to bias in the estimate of the impact of voucher
receipt on student achievement measured at the end of three years of
participation. We return to these issues throughout the rest of the book,
as they are central to our ability to employ a variety of research designs to
obtain unbiased estimates of causal effects.
Challenges in Designing,
Implementing, and Learning from
Randomized Experiments
61
62 Methods Matter
Countries around the world have struggled with the design of secondary-
school education programs. In many countries, students are given the
option of enrolling in an academic track to prepare for post-secondary
education or in a vocational track to prepare for work in a specific occupa-
tion. Critics of vocational education argue that it does not prepare
students to cope with changing labor markets and that participating in
vocational training closes off access to post-secondary education. While
conceding these limitations of conventional vocational education pro-
grams, advocates argue that the solution lies in improving vocational
programs rather than in abandoning the concept and requiring all ado-
lescents to enroll in a traditional academic track.
One response to the call for a different kind of education, especially for
students who do not thrive in conventional academic tracks, has been the
Designing and Implementing Randomized Experiments 63
3. The following website, accessed September 10, 2009, provides a description of the his-
tory of career academies: http://www.ncacinc.com/index.php?option=com_content&
task=view&id=17&Itemid=28.
64 Methods Matter
would be to include in the study students from a great many high schools
containing career academies. This would make the research extremely
expensive to conduct.
Now, consider the second option for defining the population of inter-
est. The school would conduct information sessions in which they describe
the career academy option to all ninth graders and explain that the selec-
tion of students to receive enrollment offers would be determined by
lottery from among those students who demonstrate an active interest in
enrolling by participating in an interview and completing an application.
Among students randomly assigned to the treatment group from this
population, the take-up rate would probably be much higher, perhaps as
high as 80%. Again, assuming that participation in a career academy did
result in improved outcomes, the higher take-up rate would increase
dramatically the probability of rejecting the null hypothesis.
In summary, holding constant the size of the research budget, drawing
the research sample from a population of students who express an active
interest in career-academy enrollment increases the chances of demon-
strating—by statistical analysis—that the offer of enrollment in a career
academy improves outcomes. A cost of this choice is that the results pro-
vide conclusions that can be generalized only to the population of students
who expressed an interest in participating in a career academy, not to the
wider population of all students entering the relevant high-school grades.
In contrast, drawing the research sample from the population of all stu-
dents enrolled in the relevant grade in participating schools means that
the results can be generalized to this latter population. The disadvantage
is that you risk a low take-up rate for the offer, which, in turn, affects the
statistical power you have to reject the null hypothesis that the treatment
is no more effective than alternatives. Combating this problem would
require increasing the sample size massively to improve statistical power
and precision. Given the expense of conducting randomized experi-
ments, it is not surprising MDRC investigators chose to draw their research
samples from the population of students at each participating school who
expressed an active interest in enrolling in their school’s career academy.
In this section, we explain and illustrate some threats to the internal and
external validity of randomized experiments. The list of threats to validity
that we describe here is by no means exhaustive. We describe others in
chapters to come. The purpose of this section is to emphasize that,
although randomized experiments are the most effective way to learn
about the causal impact of many educational interventions, they are by no
70 Methods Matter
Cross-overs
One common threat to the internal validity of a two-group randomized
experiment occurs when participants “cross over” from the control group
to the treatment group, or vice versa, after random assignment has taken
place. In Project STAR, for instance, approximately 10% of students
switched between the small- and regular-size classes between one grade
Designing and Implementing Randomized Experiments 71
and the next. These cross-overs jeopardized the internal validity of the
experiment because they challenged the original exogeneity of the assign-
ment to experimental conditions, and the consequent equality in
expectation between the treatment and control groups that was required
for drawing causal inferences. The cross-overs create the possibility that
any higher-than-average academic achievement detected among children
in small classes may have stemmed at least in part from the uncontrolled
sorting of children with unobserved differences between the two experi-
mental conditions. In fact, Krueger (1999) argued that cross-overs did not
have a dramatic impact on the results of the Project STAR experiment.
The strongest evidence that he cited in support of this conclusion was
that the class-size effects that were detected were the largest in the first
year of the experiment, before any cross-overs occurred. In Chapter 11,
we describe how instrumental-variables estimation can be used to deal
with the internal threat to validity created by such cross-overs.
4. After the scheduled completion of the Project STAR experiment, Alan Krueger and
Diane Whitmore Schanzenbach raised the funds to follow participants in Project STAR
through elementary school and into high school (Krueger & Whitmore, 2000).
72 Methods Matter
In what ways does such attrition affect validity? Attrition from the
sample itself, regardless of whether it was from the treatment or control
group, may simply make the sample less representative of the underlying
population from which it was drawn, thereby undermining external validity.
Attrition also threatens the internal validity of the experiment. The reason
is that members who remain in the treatment group may no longer be
equal in expectation to members remaining in the control group.
Consequently, at least some part of any subsequent between-group differ-
ences in outcome could be due to unobserved differences that exist
between the members who remain in the treatment and control groups
after attrition, instead of being due to a causal effect of the experimental
treatment.
One sensible step in evaluating the extent to which sample attrition
poses a threat to the internal validity of a study is to examine whether the
attrition rate is higher in the control group than in the treatment group,
or vice versa. In Project STAR, for instance, 49% of the children assigned
initially to small classes in kindergarten had left the experiment by its
fourth year. The comparable figure for children assigned initially to regular-
size classes was 52% (Krueger, 1999). The percentage of students in the
control group of the NYSP who took the relevant achievement tests at
the end of the second year of the experiment was 7 points lower than the
percentage of students in the treatment group who did so. However, the
percentages of the treatment and control groups in the original sample
that took the tests at the end of year three of the experiment were similar
(Howell & Peterson, 2006). In the career-academy experiment, 82% of the
students offered a place in a career academy and 80% of those in the
control group did not complete the survey administered in the study’s
eleventh year (Kemple & Willner, 2008).
Of course, although evidence that the attrition rates in the treatment
and control groups of a randomized experiment are approximately equal
is comforting, the patterns of attrition in the two groups could still be
quite different. One way to examine this possibility is to capitalize on
information from a baseline survey administered prior to random assign-
ment to compare the sample distributions of the observed characteristics
in the key groups. These include individuals who left the treatment group,
those who left the control group, and those who remained in each of
these groups. Evidence that the sample distributions of observed baseline
characteristics in the four groups are very similar would support the case
that attrition from the research sample did not jeopardize the internal
validity of the experiment seriously, although such evidence is hardly
definitive as it pertains only to the characteristics that were actually mea-
sured in the baseline survey. The evaluators of all three of the interventions
Designing and Implementing Randomized Experiments 73
experiments that researchers from the Abdul Latif Jameel Poverty Action
Lab (J-PAL) conducted in order to learn about the benefits of two inter-
ventions to improve student achievement in cities in India. The first,
which took place in two large cities, examines the consequences of a novel
input strategy. The second, which took place in a rural area of India,
examines the consequences of a change in incentives for primary-school
teachers. The descriptions illustrate some of the practical challenges in
conducting random-assignment experiments and some strategies for
overcoming these challenges.
hours per day during the regular school day, with groups of 15 to 20 chil-
dren in the third or fourth grade who had fallen behind academically.
They taught a standardized curriculum that focused on the basic literacy
and numeracy skills that were part of the regular first- and second-grade
school curricula, but that the children in their care had not yet mastered.
The Balsakhi program proved popular in many Indian cities and grew
rapidly. Regular teachers liked the program because it removed the least
academically able students from their classes for part of the school day.
Participating children liked it because the Balsakhi came from their home
communities and tended to be more attuned to their problems than were
regular teachers. Additional factors contributing to its popularity were
the program’s low cost and the ease with which it could be maintained
and expanded. Indian cities typically have a large supply of young female
secondary-school graduates looking for work—all potential Balsakhis. The
rate of pay for Balsakhis, between $10 and $15 per month, was about one-
tenth the cost of a regular teacher. Moreover, since their training took
only two weeks, a high annual turnover rate among Balsakhis did not
inhibit program expansion. Nor did a lack of classrooms, because the
Balsakhi worked with students wherever space was available, often in
corridors or on playgrounds.
The Pratham staff that designed the Balsakhi program had reason to
believe that it would enhance children’s skills. One reason is that the pro-
gram concentrated its instruction on fundamental literacy and numeracy
skills that lagging students needed in order to comprehend the curricula in
the third and fourth grades. A second reason is that the third- and fourth-
grade teachers in the regular government schools tended to focus on
covering the curriculum by the end of the school year and paid little or no
attention to the needs of students whose skills were lagging. Consequently,
the Pratham staff reasoned that little would be lost from pulling lagging
students out of their regular classroom to work with a Balsakhi. Although
this reasoning was persuasive to many school directors and government
officials, it did not constitute evidence of program effectiveness.7
In the late 1990s, Pratham requested that researchers from J-PAL evaluate
how effective the Balsakhi program was in enhancing children’s academic
achievement. The research team, which included Abhijit Banerjee, Shawn
Cole, Esther Duflo, and Leigh Linden, concluded that the best way to
answer this question was to conduct a random-assignment experiment.
Pratham staff supported the J-PAL researchers’ recommendation, and
7. We thank Leigh Linden, a member of the J-PAL team that evaluated the Balsakhi pro-
gram, for providing clarifying comments about the details of the J-PAL team’s work on
this project.
Designing and Implementing Randomized Experiments 77
the research team began the work to design an experiment that would
take place during the 2001–2002 and 2002–2003 school years.
Since the program assigned Balsakhis to schools serving low-income
children, a logical way to design the experiment would have been to select
a sample of schools eligible to participate in the program, and then assign
Balsakhis randomly to half of the schools, treating the other half of the
schools as a control group. The research team anticipated, however, that
school directors in Vadodara and Mumbai, the two cities in western India
selected for the evaluation, would have reservations about participating
in an experiment with this kind of design. The reason was that schools in
the control group would not receive the assistance of a Balsakhi, but
would need to subject their students to the extra testing that was to be
part of the evaluation.
Recognizing the difficulty in obtaining cooperation for conducting an
experiment in which control schools obtained no additional resources,
the J-PAL researchers adopted a different design. The alternative that
they chose, after consultation with school directors, was to provide one
Balsakhi to each school that volunteered to participate in the experiment,
and then assign the Balsakhi randomly to either grade 3 or grade 4. Thus,
in 2001–2002, the first year of the experiment, half of the government
primary schools in Vadodara that participated in the experiment were
given a Balsakhi to work with children in grade 3; the other half were
given a Balsakhi to work with students in grade 4. In the second year of
the experiment, the assignments of Balsakhis to grade levels were switched.
Those participating schools that had a Balsakhi in grade 3 in year 1 were
given a Balsakhi for grade 4, and vice versa.
In Table 5.1, which is adapted from Banerjee et al. (2007), we illustrate
this design. In evaluating the first-year impact on student achievement of
having a Balsakhi work with grade 3 children, Group A schools would
make up the treatment group and Group B schools the control group.
Conversely, in evaluating the first-year impact of having a Balsakhi to work
Year 1 Year 2
(2001–2002) (2002–2003)
with grade 4 children, Group A schools were the control group and
Group B schools the treatment group. A similar design was used in assign-
ing Balsakhi to schools in Mumbai that volunteered to participate in the
experiment.
An advantage of the research design chosen by the J-PAL researchers
was that it allowed them to examine whether the causal impact on student
achievement of having access to a Balsakhi for two years was greater than
the impact of one year of access. The reason is that children who were in
the third grade in Group A schools in the first year of the experiment also
received access to a Balsakhi in the second year of the experiment, when
the children were in fourth grade. Their achievement at the end of the
second year of the experiment (when they had completed grade 4) could
be compared to the achievement, at the end of the first year of the exper-
iment, of those children who were in fourth grade in Group B schools in
that year.
The results of the evaluation of the Balsakhi experiment were encour-
aging. In the first year of the evaluation, the Balsakhi program increased
student test scores by an average of 0.14 of a standard deviation. In the
second year of the evaluation, the average effect of one year’s access to a
Balsakhi was 0.28 of a standard deviation, and the impacts were quite
similar across grades, subject areas, and research sites. The explanation
for the larger effect in the second year of the program was that implemen-
tation improved.
On the important question of whether access to two years of support
from a Balsakhi improved achievement more than one year of access, the
results were cautiously optimistic. The evidence from Mumbai indicated
that two years of access to a Balsakhi increased student performance on
the mathematics examination by 0.60 standard deviations, an impact
twice as large as the impact of one year of access.8
The research team also examined the persistence of the impact of the
Balsakhi program. One year after receiving the support of a Balsakhi, the
impact for low-achieving students had declined to approximately 0.10 of
a standard deviation. This suggests that the Balsakhi program is better
viewed as a vitamin, an intervention that struggling students need con-
tinually, than as a vaccination that, once received, protects students from
future struggles. However, of greater importance is the message from the
evaluation that a remarkably low-cost intervention made an important
difference in the achievement of struggling primary school students in
Indian cities. This evidence proved important in building support for the
Balsakhi program, which now serves hundreds of thousands of children
in India.
month (approximately $23) as in the past, an amount that did not depend
on their attendance. Teachers in the experimental group would be paid
50 rupees for every day that they actually taught each month, with a min-
imum of 500 rupees per month. Consequently, teachers in the treatment
group could earn as much as 1,300 rupees per month, but they could also
earn only half of their prior pay. Participants were told that a lottery would
be used to determine whether they would be placed in the treatment
group or the control group.
Once potential participants understood the incentive pay system and
how the lottery would work, the process for determining which teachers
would be in the treatment group appealed to their sense of fair play. One
question that some teachers asked J-PAL researchers during the partici-
pant focus-group sessions was why all teachers could not work under the
new pay-incentive system. The researchers explained that the Seva Mandir
had only sufficient resources to try the new approach with 60 teachers,
and that it had concluded that assignment by a fair lottery was the best
way to allocate the opportunity.
A second critical challenge was how to measure teacher attendance in
far-flung rural schools. The cost of having Seva Mandir staff visit widely
dispersed rural schools to monitor teacher attendance frequently would
have been prohibitive. In addition, participating teachers would have
resented unannounced monitoring visits, and the resentment might have
affected their teaching performance. The research team’s response to the
measurement challenge was to give each teacher a tamper-proof camera
that recorded on film the date and time that any picture was taken.
Teachers were instructed to have a student take a picture of the teacher
accompanied by at least eight students at the beginning of each school
day and then, again, at least five hours later, near the end of the school
day. The films were collected and developed each month, and the photo-
graphic record used to determine each teacher’s attendance rate and
their pay for the month.9 Once this data-collection process was explained
to teachers, they supported it because it was deemed fair and not subject
to the stresses from unannounced visits by Seva Mandir staff.
The results of the 27-month-long experiment showed that basing teach-
ers’ pay on the number of days that they actually taught reduced teacher
absences markedly, from 42% to 21% of the available working days, on
average. Even more important, it resulted in an increase of almost a fifth
of a standard deviation in their students’ achievement, as measured by
9. We thank Rema Hanna for reading a draft of this section and providing clarifying com-
ments about the details of the J-PAL team’s work on this project.
Designing and Implementing Randomized Experiments 81
tests of mathematics and reading. This impact is only slightly smaller than
that of the very expensive small class-size intervention tested in the
Tennessee STAR Experiment (Duflo, Hanna, & Ryan, 2008).
To learn more about the challenges you may face in carrying out random-
ized field trials, we recommend two additional readings. The first is the
insight-filled chapter entitled “Using Randomization in Development
Economics Research: A Toolkit,” by Esther Duflo, Rachel Glennerster,
and Michael Kremer (2008). The second is the NBER working paper by
John List, Sally Sadoff, and Mathis Wagner entitled “So You Want to Run
an Experiment, Now What? Some Simple Rules of Thumb for Optimal
Experimental Design” (2010).
6
We began the previous chapter by citing statistics from the What Works
Clearinghouse (WWC) about the enormous number of completed empiri-
cal evaluations of educational interventions that were unable to support
causal inference. For example, we noted that among 301 evaluations of the
effectiveness of interventions in elementary mathematics, 97% of the stud-
ies reviewed could not support a causal conclusion. The most common
reason was that the authors of the studies were unable to defend the
assumption that participants who had been assigned to the treatment and
control conditions were equal in expectation before the intervention began.
However, even in studies that meet this condition—for example, because
the investigator has assigned members of the analytic sample randomly to
treatment and control groups—the effort can be stymied by a sample of
inadequate size. If you conduct otherwise well-designed experimental
research in a too-small sample of participants, you may estimate a positive
impact for your intervention, but be unable to reject the null hypothesis
that its effect is zero, in the population. For example, the 3% of studies of
elementary-mathematics interventions that met the WWC standards for
supporting causal inferences included one evaluation of the causal impact
of a curriculum entitled Progress in Mathematics 2006.1 Had the sample
size of this study been larger and all else remained the same, the modest
positive results of the evaluation would have been statistically significant.
82
Statistical Power and Sample Size 83
Statistical Power
addressing the implicit NYSP research question using the simplest appro-
priate analytic technique available to the empirical researcher. This is a
two-group t-test of the null hypothesis that there is no difference, in
the population, between the average academic achievement of African-
American children in the experimental (voucher) and control (no voucher)
conditions.
To simplify our explanation of the new statistical concepts in this chapter,
we base our presentation on the application of a one-sided t-test. This
means that—in our introduction of the concept of statistical power—we
test the null hypothesis that the average academic achievement of treated
children is equal to the average achievement of untreated children versus
an alternative hypothesis that their achievement is greater than that of con-
trol children, in the population. This is a strictly pedagogic decision on our
part and was made to simplify our technical presentation. It contrasts
with our earlier substantive decision to rely on a two-sided t-test in our
detailed presentation of the actual analyses and findings from the NYSP
project in Chapter 4. There, we assumed that, if the null hypothesis were
rejected, the average achievement of children in the population who were
offered vouchers could be either greater than, or less than, the average
achievement of children not offered vouchers. Generally, in conducting
research, a one-sided test should only be used in circumstances in which
you can defend a strong prior belief that, if the treatment did have an
effect on the outcome of interest, you would know with certainty what the
direction of the difference in outcomes would be. This is rarely true in
practice, and we do not believe it would be true in the case of empirical
analyses of the NYSP data. On the other hand, as we discuss in Chapter 8,
an example in which we believe a one-sided test would be appropriate
concerns the impact of college scholarship aid on the decisions of high-
school seniors to enroll in college. Since scholarship aid reduces the cost
of college enrollment, it seems compelling to assume that, if scholarships
did have an impact on the percentage of high-school seniors who enrolled
in college, that effect would indeed be positive.
Fortunately, whether you choose a directional or a nondirectional alter-
native for your hypothesis testing, the technical concepts and connections
that we introduce in this chapter—and, in particular, the concept of statis-
tical power itself—remain unchanged. Later in the chapter, we describe
how critical features of the research design, the measurement of the vari-
ables, and the choice of a particular data-analytic approach affect the
statistical power in any particular experiment. At that point, we recon-
sider the decision to adopt a directional versus a nondirectional alternative
hypothesis and comment on how it impacts the magnitude of the statistical
power.
Statistical Power and Sample Size 85
tobserved =
( POST _ ACH V − POST _ ACH NV )
⎛1 1 ⎞ (6.1)
s2 ⎜ +
⎝ nV nNV ⎟⎠
5. Unfortunately for the pedagogy of our example, the “some non-0 value” to which we
refer in this sentence is not δ itself, but a linear function of it. This is because, under the
alternative hypothesis, the observed t-statistic has a non-central t-distribution whose
u ⎛ Γ((u − 1) / 2 ⎞
population mean is equal to δmultiplied by a constant whose value is ⎜ ⎟,
2 ⎝ Γ((u ) / 2 ⎠
where υ represents the degrees of freedom of the distribution and Γ( ) is the gamma
function.
88 Methods Matter
Rejecting When It
a=P
H0 Is True
−6 −4 −2 0 2 4 6 8 10 12
tobserved
Not When
b = P Rejecting It
H0 Is False
−6 −4 −2 0 2 4 6 8 10 12
tobserved
tcritical
Figure 6.1 Distributions of the observed t-statistic (tobserved ) under competing null (H0 )
and alternative (HA ) hypotheses, showing the Type I error (α), Type II error (β), and
placement of the critical value of the t-statistic (tcritical ), for a one-tailed test of population
outcome mean differences between a treatment and a control group.
horizontal axis represents the possible values that tobserved could attain—
actually, these values range from − ∞to +∞. The vertical axis represents the
“frequency” with which each value has occurred during the resampling
process. However, we are dealing with infinite resampling and a statistic
that can take on values ranging continuously between plus and minus
infinity. Consequently, we have drawn the exhibit as the envelope of a
probability density function(or pdf) in which the histogram has been rescaled
so that the total area under the envelope is equal to 1. Areas beneath the
envelope represent the probabilities with which particular ranges of values
of tobserved would occur in infinite resampling from a null population. For
instance, the probability that tobserved will take onany value at all is obviously
1, a value equal to the total area beneath the pdf.6 Similarly, because the
pdf is symmetric and centered on zero, there is a probability of exactly
one half—a 50% chance—that a value of tobserved sampled at random from
the null population will be larger than zero, or smaller than zero.7
In the bottom panel in Figure 6.1, we display the situation that would
occur under the competing alternative hypothesis, HA: ∆ µ= δ. The graphic
is essentially identical to that displayed under H0 , but we have shifted the
pdf of tobserved to the right by an amount that depends on δ—the value we
would anticipate for the population outcome mean difference between
treatment and control groups if HA were true.8 Again, the displaced pdf
represents the distribution of all the possible values of tobserved that could
be obtained if samples were drawn repeatedly and randomly from the
alternative population.
To complete our test, we rely on a decision rule that derives from our
decision to set the Type I error of our test at 5%. From this decision, we
can derive a critical value against which to compare the value of the
observed test statistic. We do this by determining the value that tobserved
would have to take on in order to split the null distribution of tobserved in the
top panel of Figure 6.1 vertically into two parts, with 5% percent of the
area beneath its envelope falling to the right of the split and 95% falling
to the left.9 In the figure, we indicate the place at which this split occurs
6. The area beneath the t-distribution is finite, and equal to 1, because its tails asymptote
to zero.
7. Not all distributions of test statistics are symmetric and zero at the center. However, the
logic of our argument does not depend for its veracity on the particular shape of the
pdf we have chosen to display. All that is required is that the pdf of the test statistic,
under H0 , be known. Consequently, our argument applies equally well to cases in which
distributions are asymmetric (as with the F and χ2 distributions).
8. Again, under the alternative hypothesis, the pdf of the observed t-statistic is not
centered on the value of δ itself, but on a value proportional to it. See footnote 5.
9. Recall that this is a one-sided test.
90 Methods Matter
by drawing a dashed vertical line. The place at which the vertical dashed
line intersects the horizontal axis provides the required critical value of
the test statistic tcritical that we will use in our hypothesis test. Our decision
is then straightforward. If tobserved is greater than tcritical , then we conclude
that it is probably too extreme to have come legitimately from the null
distribution. Consequently, we reject H0 in favor ofHA , and conclude that
parameter ∆µ is equal to δ, not zero, in the population from which we
have sampled. On the other hand, if tobserved is less thantcritical , we conclude
that our single empirical value of tobserved was probably sampled from a null
population Consequently, we would not reject H0 in favor ofHA . In other
words, by choosing a particular α-level (5%, say) to fix the level of the
Type I error, and combining this with our theoretical knowledge of the
shape of the pdf of the t-statistic under the null hypothesis, we can carry
out the desired test. It is the choice of the Type I error that provides us
with the criterion that we need to make the testing decision.
Now focus on the lower second panel in Figure 6.1, which is aligned
beneath the first. As we have noted, this lower panel illustrates the “alter-
native” side of the hypothesis testing situation. In it, we display the pdf of
all possible values that an observed t-statistic could take on in repeated
resampling from a population in which the alternative hypothesis was
true, and parameter ∆µhad a non-zero value of δ . Of course, because of
sampling variation, it is entirely possible that, in some proportion of resa-
mplings, tobserved will take on very small values, perhaps even values less
than tcritical —values that we typically associate with sampling from a null
population—even though the alternative hypothesis is actually true. If
this were to happen in practice, and we were to base our decision on
an artificially small empirically obtained value, we would declare the null
hypothesis true. In this case, we would have committed another kind of
mistake—called a Type II error. Now, we would end up falsely accepting the
null hypothesis even though the alternative was, in fact, true. The proba-
bility that tobserved may be idiosyncratically less than tcritical , even when the
alternative hypothesis is true, is represented by the shaded area under the
“alternative” probability density function to the left of tcritical . Just as symbol
αis used to represent the magnitude of Type I error, β is the symbol used
to represent the probability of a Type II error.
Finally, notice the horizontal separation of the centers of the pdfs,
under the competing null and alternative hypotheses, H0 and HA , in
Figure 6.1. This separation reflects the difference in the potential
values of ∆µ , under the alternative (∆µ=δ ) and null hypotheses (∆µ=0).10
10. Again, the horizontal distance between the centers of the H0 andHA pdfs is not equal
to δ, but is proportional to it. See footnote 5.
Statistical Power and Sample Size 91
11. Effect size can also be defined in terms of the correlation between outcome and predictor.
In the NYSP evaluation, an effect size defined in this way would be the sample correla-
tion between the academic achievement outcome and the dichotomous VOUCHER
predictor, for the sample of African-American children. This correlation has a value
of 0.127. When effect sizes are defined as correlations, a coefficient of magnitude 0.10
is regarded as a “small” effect size, 0.25 as a “medium” effect size, and 0.37 as a “large”
effect size (Cohen, 1988, Table 2.2.1, p. 22).
92 Methods Matter
12. Some argue that effect size is best scaled in terms of the standard deviation of the
outcome for participants in the control condition only. In the NYSP evaluation, this
would have led to an effect size of (4.899/17.172), or 0.285.
Statistical Power and Sample Size 93
Probability of ...
When H0
is true
(and 1−a
HA is false)
a
0 tobserved
When HA
is true
(and
H0 is false) 1−b
tobserved
tcritical
Figure 6.2 Four-way decision scenario, summarizing the probabilities of not rejecting H0
(1st column) or rejecting H0 (2nd column) when it is either True (1st row) or False (2nd
row), showing the Type I error ( α), Type II error (β), and placement of the critical value
of the t-statistic (tcritical ), for a one-tailed test of population outcome mean differences
between a treatment and a control group.
To appreciate this fully, recall that, once the pdf of the test statistic has
been specified under H0 , the value of tcritical depends only on your selec-
tion of the α-level. So, if you were willing to entertain a larger Type I
error, perhaps as high as 0.10, then your corresponding value of tcritical
would shrink, so that 10% of the area beneath H0 ’s pdf can now become
entrapped to its right. With your new willingness to entertain this larger
Type I error, you would find it easier to reject H0 because the single empir-
ically obtained value of your observed test statistic would be more likely to
exceed the now smaller value of tcritical . This means that, if you can tolerate
increased Type I error, you can more easily reject H0 and more easily
claim detection of a non-zero effect in the population. Of course, in
enhancing your chances of claiming such a non-zero effect, you have
increased the probability of Type I error—that is, you are now more likely
to reject H0 even when it is true! At the same time, shifting tcritical to a
smaller value has implicitly moved the vertical splitting of HA ’s pdf to the
left in Figure 6.2, and thereby reduced the value of the Type II error β.
So, you are now more likely to accept HA when it is true. This intimate—
and inverse—connection between the magnitudes of the Type I and II
errors is a central fact of statistical life. As you decide to make one type of
error less likely, you force the other one to become more likely, and vice
versa. So, you can correctly regard hypothesis testing as a trade-off between
the probabilities of two competing types of error.
More importantly, the decision probability featured in the right-hand
cell of the lower second row in Figure 6.2, which is of magnitude (1 – β),
is a central and important commodity in our empirical work. It is the prob-
ability of rejecting H0 when it is false. Or, alternatively, it is the probability of
accepting the alternative hypothesis when it is true. This is a highly pre-
ferred end result for most research—the rejection of the null hypothesis in
favor of the alternative, when the alternative is true. For example, in
designing the NYSP experiment, investigators were hoping to reject the
null hypothesis of no causal connection between voucher receipt and stu-
dent achievement in favor of an alternative hypothesis that stipulated
voucher receipt had a causal effect on student achievement. This impor-
tant quantity is defined as the statistical power of the study and, as you can
see from Figure 6.2, it is simply the complement of the Type II error. This
means that, knowing the pdfs of our test statistics—such as the t-statistic—
under the null and alternative hypotheses, and being willing to set the
Type I error level to some sensible value, means that we can actually esti-
mate a value for the statistical power. This can be very useful both during
the design of the research and also after the research has been completed.
We follow up on these ideas in the section that follows.
96 Methods Matter
13. Depending on the type of analysis, the test statistic’s pdf under HA may also have a
different shape from its pdf under H0 .
14. All the power analyses in this chapter were conducted using the G∗Power freeware,
v2.0, GPOWER: A-Priori, Post-Hoc and Compromise Power Analyses for MS-DOS, Dept. of
Psychology, Bonn University, Germany, http://www.psycho.uni-duesseldorf.de/aap/
projects/gpower/.
Statistical Power and Sample Size 97
a = 0.10
0.90
a = 0.05
0.80
a = 0.10
0.70
ES
=
Statistical Power
0.50
0.40
0.30
ES
=
0.2
0.20
0.10
0 100 200 300 400 500 600
Total Sample Size
Figure 6.3 Statistical power as a function of total sample size, effect size (0.2 versus 0.5),
and α-level (0.05 versus 0.10), for a one-tailed test of population outcome mean
differences between a treatment and a control group.
Statistical Power and Sample Size 99
The second important relationship evident in our figure is that, all else
remaining equal, you will always have more power to detect a larger effect.
In Figure 6.3, with a total sample size of 100 participants randomized to
treatment conditions, say, and an α-level of 0.05, you have a power of just
over 0.25 to detect a small effect (ES = 0.2) and a power of almost 0.80 to
detect a medium effect (ES = 0.5). Again, the reason for this link between
effect size and power can be deduced from our decision-scenario descrip-
tion in Figure 6.2. As we have noted already, the effect size determines the
horizontal separation of the test statistic’s pdfs under H0 andHA . So, if a
larger effect size is accommodated, the H0 and HA pdfs must be more
widely separated along the horizontal axis. But, in the first row of the
figure, the center of H0 ’s pdf is fixed at zero (because it represents the
“null” condition). So, as effect size is increased, the pdf of the test statistic
under HA shifts to the right, in the second row of the figure, sliding past
the location of tcritical , the placement of which has been fixed by the earlier
choice of α-level under the H0 pdf. Consequently, the area beneath the
alternative distribution to the right of tcritical must rise, and statistical power
is again increased.
Third, and most important, statistical power is always greater when the
total number of participants included in the experiment is larger, all else
being equal. This is quite a dramatic effect, as evidenced by the slopes of
the power/sample size relationships in Figure 6.3. Notice, for instance, in
research to detect a medium effect size (ES = 0.5) at an α-level of 0.05,
statistical power can be increased from about 0.55 to more than 0.80 by
increasing the total sample size from 50 to 100 participants! Although it is
more difficult to understand, the reason for this dependency can again be
deduced from Figure 6.2. As sample size increases, the pdf associated
with any test statistic always becomes slimmer and taller because its values
enjoy greater precision—and less scatter on repeated sampling—at larger
sample sizes. However, the location of the center of the distribution
remains unchanged.15 So, as the H0 andHA pdfs in Figure 6.2 slim down
and become more pointy, there are two important consequences, one for
each featured pdf. First, in the H0 pdf in the first row of the figure, the
location of tcritical must move to the left—that is, the critical value must get
smaller—in order to accommodate the fixed choice of α-level adopted for
the test. (Recall that choice of α-level splits the pdf under the null distri-
bution vertically, so that an area equal to the Type I error must fall to the
right of tcritical . In a rapidly narrowing distribution, this can only continue
15. You can check out this claim using one of the simulations of the distribution of the
sample mean as a function of sample size available on the Internet.
100 Methods Matter
16. Some versions of the t-test relax the population homoscedasticity assumption.
102 Methods Matter
violated, then your answer may be wrong no matter how powerful the
technique!
17. If treatment status is assigned randomly by the investigator, then the treatment
predictor will necessarily be uncorrelated with all other exogenous covariates.
Statistical Power and Sample Size 103
a set of covariates that predict about half the variation in the outcome
jointly, then you can maintain the same statistical power for your analyses
at half the sample size.
The message is clear. There is always an analytic advantage to prefer-
ring a more complex statistical analysis over a less complex one because it
provides you with an opportunity to increase precision by including cova-
riates. Greater precision brings increased statistical power, and the ability
to detect a smaller effect at the same sample size. However, significant
knowledge is needed to use complex statistical analyses appropriately.
In doing so, you are relying more heavily on the hypothesized structure of
the statistical model. You have to ensure that additional assumptions are
met. You have to do a good job, analytically speaking, with the new terms
in the model. You need to worry about whether the new covariates meet
the underlying requirements of the analysis in terms of the quality of their
measurement, the functional form of their relationship with the outcome,
whether they interact with other predictors in the model, and whether
they are truly independent of the existing residuals, as required. Clearly,
everything has its price! However, if it is a price that you can pay, the
rewards are great.
probably mean that you will only have to increase your anticipated total
sample size by a few percent.
of its positive value at the upper end and 2.5% of the area was entrapped
to the left of its negative value at the lower end.18 As a consequence, the
magnitude of the new tcritical must be larger than that currently displayed.
In going from the existing critical value of the t-statistic obtained under
the one-tailed test of our initial explanation to the new larger critical value,
we have effectively moved the vertical dashed reference line in Figure 6.2—
the line that also splits the pdf of the t -statistic under HA , in the second
row of the figure—to the right. Thus, the Type II error ( β)—represented by
the area entrapped beneath the pdf of the test statistic (under HA ) to the
left of the dashed vertical line—will have increased. Concurrently, the sta-
tistical power—the complement of that area, to the right of the vertical
dashed line—must be reduced. Thus, switching from a one-tailed to a two-
tailed test implicitly reduces the power of a statistical test.
We conclude by reminding you then that, in most research, two-tailed
tests are the order of the day, even though they are implicitly less power-
ful than one-tailed tests. Only when you can mount a compelling defense
of the argument that a particular policy or intervention can have only a
directed impact (positive or negative) on the outcomes of interest, in the
population, is the use of one-tailed tests justified.
If you want to learn more about statistical power, we suggest that you
consult the classic text by Jacob Cohen entitled Statistical Power Analysis
for the Behavioral Sciences (1988, 2nd edition).
18. Implicitly, in the two-tailed case, because the pdf of the t-statistic (under H0 ) is sym-
metric, tcritical will take on two values of the same magnitude—one positive and the
other negative—which are equally spaced on either side of the center of the pdf.
During the subsequent test, if the value of the observed t-statistic is positive, it will be
compared to the upper positive value of tcritical ; if it is negative, it will be compared to
the lower negative value.
7
107
108 Methods Matter
1. Imbens and Wooldridge (2009) also explain that it is usually not possible to separate
out the direct effects of the intervention on the individual from the indirect effects on
that individual that take place via their interactions with other students in their school.
110 Methods Matter
2. We thank Geoffrey Borman for providing the data. Although our findings do not differ
substantively from those of the original research, we recommend that readers inter-
ested in the evaluation of SFA consult the published papers by Borman and his
colleagues. One paper (Borman et al., 2005a), which provides the basis for our presen-
tation, describes the first-year results of the evaluation. A second (Borman, Slavin, &
Cheung, 2005b) describes the second-year results, and a third (Borman et al., 2007)
describes the results from the third and final year of the evaluation.
3. This was the outcome for which the original authors had the strongest findings in the
first year of the evaluation.
112 Methods Matter
4. Borman et al. (2005a) state their random-intercepts multilevel models using what has
become known as a “level-1/level-2” specification of the multilevel model. Under this
approach, they specify both a within-school (“level-1”) and a between-school (“level-2”)
component of the model. For instance, a simplified version of their within-school model,
without added control predictors, is
Level 1 :WATTACK ij = b0 j + e ij
And the corresponding between-school model, again without additional covariates, is
Level 2 : b0 j = g 00 + g 01SFAj + u0 j
The level-1 intercept parameter β0j represents thewithin-school average of the outcome
in the school j and differs from school to school. In the level-2 model, the school-level
residuals u0j provide random shocks to the grand intercept γ00 and lead to the
random intercepts of the schools. The level-1/level-2 specification can be collapsed
into a single “composite” model by substituting for parameter β0j from the level-2 into
the level-1 model, as follows:
WATTACK IJ = g 00 + g 01SFAj + (u0 j + e ij )
Note that the level-1/level-2 specification of the multilevel model is identical algebra-
ically to the random-intercepts regression model in Equation 7.1, with only cosmetic
differences in notation. In multilevel modeling, all level-1/level-2 specifications can be
collapsed into a single composite model. It is the composite specifications that we
choose to present in Equation 7.1.
Experimental Research When Participants Are Clustered Within Intact Groups 113
for the ith child in the jth school. To simplify our presentation, we have
omitted from this model a pair of important control predictors that
Borman and his colleagues included in their statistical models. One of
the omitted covariates represented the child’s grade level in school. This
is not relevant here because we have limited our analytic sample to chil-
dren in first grade. The other covariate was the school-level average
student pretest score on the Peabody Picture Vocabulary Test (PPVT).
We reserve this covariate for inclusion later in our presentation.
Notice that, unlike a standard OLS regression model, our random-
intercepts multilevel model in Equation 7.1 contains a composite residual
that sums two distinct error terms. We have specified the model in this
way deliberately to provide a mechanism, within the model, that accounts
for the hypothesized lack of independence that may exist among the
unpredicted portion of the responses of children within a school. The
first term is a child-level residual, εij , and the second aschool-level residual uj .
In our hypothesized random-intercepts multilevel model, all children in
the same school share the same value of the school-level residual uj , and
this serves to tie together—or correlate—their composite residuals.
Consequently, the model does not constrain their composite residuals to
be independent of each other, as standard OLS models require. In fitting
a random-intercepts multilevel model to data, we assume that each of the
constituent error terms, εij and uj , satisfies the usual residual normal-
theory assumptions. Thus, we assume that the child- and school-level
residuals are distributed independently of each other in the population,
that the child-level residuals have a population mean of zero and a vari-
2
ance of s e , and that the school-level residuals have a population mean of
2
zero and a variance of s u .
It is worth pausing at this point to understand why this new multilevel
model is referred to as a random-intercepts model. The reason becomes evi-
dent with a simple reordering of the terms in the model itself, to become:
( )
WATTACKij = g 0 + u j + g 1 SFAj + eij (7.2)
(under the assumption that the school-level residuals are drawn randomly
from a distribution with mean zero and homoscedastic variance s u2).
We have fitted the random-intercepts multilevel model specified in
Equation 7.1 to our subsample of data from the SFA evaluation. It appears
114 Methods Matter
Table 7.1 Parameter estimates, approximate p-values, standard errors, and selected
goodness-of-fit statistics for three random-effects multilevel models describing the fitted
relationship between the word-attack scores of first-graders, at the end of their first year
in the study, and the assignment of their school to either the SFA intervention or the
control condition (nschools = 41; nstudents = 2,334)
fitted model, our table contains two other fitted models: (a) an uncondi-
tional model that contains no predictors at all, and (b) a second conditional
model in which we have added the main effect of an interesting school-
level covariate, the within-school average value of a prior PPVT student
test score, measured before the intervention began.
Before turning to the results of the evaluation itself (as summarized in
fitted Models #2 and #3), we focus on the consequences of fitting Model
#1—the “unconditional” multilevel model. Parameter estimates from this
fitted model are easy to interpret because the model contains no explicit
predictors. The estimated intercept in the unconditional model, for
instance, tells us that—over all children and schools in our subsample of
first-graders—the average word-attack score is 477.54 points (p <0.001).
More interesting are the estimates of the child- and school-level residual
variances, which are 314.20 and 78.69, respectively. What do we make of
these two components of residual variance?
First, it is important to realize—as in a regular OLS-fitted regression
model—that when no predictors are present in the model, residual vari-
ability and outcome variability are synonymous. If no part of the outcome
is being predicted, then outcome variability must equal residual variabil-
ity. Here, because we have articulated our multilevel residual as a sum of
two independent contributions, we have partitioned the outcome varia-
tion effectively into its child- and school-level components. In the
unconditional multilevel model, the school-level residual variance is a sum-
mary of the variability in the school-mean value of the outcome from school
to school. It is often referred to as between-school variance. It summarizes
the scatter in the outcome among schools. The child-level residual variance
is what is left over after school-level variance has been removed from the
outcome variability. In the unconditional multilevel model, it is the out-
come variance among children within each school, pooled over the
schools. It is often referred to as the within-school variance. It describes the
scatter in the outcome from student to student within each school.
What we learn from the fitted unconditional model in Table 7.1 is that
the total sample variance in the word-attack score outcome (392.89) is the
sum of a within-school contribution of 314.2 and a between-school contri-
bution of 78.69. Comparing the magnitudes of these two contributions,
we notice that the sample outcome variability is made up disproportion-
ately of child-level rather than of school-level variation. We can summarize
the proportion of the total sample variance in the outcome at the school
level by expressing it as a fraction of the total variance—this fraction is
equal to 78.69/392.89, or 0.20. You will find this latter statistic, 0.20, listed
under the fitted unconditional model in the bottom row of Table 7.1 and
labeled “Intraclass Correlation.” It is an important summary statistic that
116 Methods Matter
will feature in the analyses that follow, including our subsequent statistical
power analyses. It summarizes the fact that, in our current sample of first-
graders clustered within schools, 20% of the variation in our outcome can
be attributed to differences in the average value of the outcome among
schools and that the rest is due to heterogeneity among children within a
school. Correspondingly, we define the population intraclass correlation
in terms of the population residual variances present in our hypothesized
random-intercepts multilevel model, as follows6
2
s between
r= 2
s within + s between
2
s u2 (7.3)
=
s e2 + s u2
78.69
=
314.20 + 78.69
= 0.20
You can articulate potential problems that can be caused by the non-
independence of participants within intact groups by thinking in terms of
the within- and between-group variability in the outcome. First, consider
the clusters of children—that is, the schools—that were enrolled ultimately
in the SFA evaluation. Imagine a fictitious scenario in which every child was
actually assigned randomly to his or her school at the beginning of the
school year. In this scenario, there would be considerable natural varia-
tion in reading achievement across all of the children. However, as a result
of their initial random assignment to schools, the average reading perfor-
mance of the children in each school would not differ across schools.
In other words, at the beginning of the year, when children were initially
6. Another version of this index could just as easily have been defined as the proportion
of the sum of the constituent variances that lies within school, or as the ratio of the two
constituent variances. By convention, however, Equation 7.3 is the definition that is
adopted because of its mapping onto other important parameters defined in tradi-
tional analyses of variance and regression analysis.
Experimental Research When Participants Are Clustered Within Intact Groups 117
7. Of course, this situation never occurs in practice because children are clustered natu-
rally in neighborhoods before they are assigned to schools. And, even then, they are
not assigned randomly to neighborhoods. In fact, unobserved forces in the neighbor-
hood act to render children’s responses interdependent long before they even get to
school, because schools draw from catchment areas within which children share many
unobserved opportunities and experiences. Typically, students will arrive in a school
already interdependent (that is, with a non-zero intraclass correlation), but it is likely
that their interdependence is enhanced by the unobserved common experiences that
they share subsequently at the school, over the academic year. In designing effective
research, it is critical to take both of these into account and focus on what the intraclass
correlation may potentially become, at that point in time at which the final value of the
outcome of the evaluation has been measured. In this case, that would be at the end of
the school year.
118 Methods Matter
8. Notice that these are approximately the squares of the corresponding standard
“small,” “medium,” and “large” values of the Pearson correlation coefficient.
9. Potentially, the deviation of an individual-level variable from the grand mean can be
written as a sum of: (a) the deviation of the individual-level score from the group-level
mean, and (b) the deviation of the group-level mean from the grand mean. Unless this
latter contribution is zero, any individual-level variable may contain both individual-
level and group-level variation, and thus adding an individual-level predictor to a
multilevel model could predict both individual-level and group-level variation in the
outcome.
10. Notice also that the overall model R2 statistic has risen from zero to 0.03.
120 Methods Matter
advance, to figure out how the number of intact clusters (that is, schools),
the number of children within each cluster, and the magnitude of the
intraclass correlation will affect the statistical power of our design.
You can gain insight into the dependence of statistical power on the
clustering of participants within intact groups by examining an expres-
sion for the population sampling variance of the estimate of regression
parameter γ1 , which in its turn represents the effect of the SFA treatment
in Equation 7.1. Under the simplifying assumptions that there are an
identical number of n participants (students) present in each of an even
number of J intact groups (schools) that have been randomized with an
equal number of schools assigned to treatment and control conditions,
we can write this population sampling variance as the sum of two parts, as
follows:
Var (gˆ1 ) = ⎜
⎛ 2 r
⎛ s e2 ⎞ ⎜ 4s e
⎟+⎜
( )
⎞
1− r ⎟
⎟ (7.4)
⎝ nJ / 4 ⎠ ⎜ J ⎟
⎝ ⎠
12. The requisite OLS population sampling variance is given by the first term to the right
of the equals sign in Equation 7.4, or
⎛ s e2 ⎞
⎜ ⎟
⎝ nJ / 4 ⎠
Experimental Research When Participants Are Clustered Within Intact Groups 123
13. The power computations were carried out using Optimal Design for Multi-Level and
Longitudinal Research, Version 0.35 (Liu et al., 2005). The accompanying manual is
a good source for further details of statistical power computation in the cluster-
randomized design.
14. Two factors contribute to the small difference between the estimated requisite sample
size, 650, reported here and the estimate of 620 described in the previous chapter.
The first is the impact of rounding error in the power computation algorithms of the
different software we have used. The second is that that you cannot make up a sample
of exactly 620 out of intact groups of 50.
124 Methods Matter
1.0
0.9
0.8
0.7
0.6
Power
0.5
0.4
19 34 49 64 79
Number of Clusters
1.0
0.9
0.8
0.7
0.6
Power
0.5
0.4
19 34 49 64 79
Number of Clusters
of the intraclass correlation rises to 0.1, the impact is even more dramatic.
Moderate power is not achieved until around 75 schools have been
included in the sampling plan, for a total sample size of 3,750 students.
The patterns displayed in Figure 7.1 illustrate that the statistical power of
a cluster-randomized design is indeed very sensitive to the value of the
intraclass correlation.
Consider now the impact on the statistical power of a cluster-randomized
design of the number of participants within a cluster (in the SFA evalua-
tion, the total number of children in grades K-2 within a school). In the
expression for the population sampling variance in Equation 7.4 (and the
corresponding standard error of the treatment effect), the number of
participants within a cluster n appears only in conjunction with the
number of clusters J as a product, to represent total sample size nJ. In
addition, it is present only in the denominator of the first term that
follows the equal sign. Thus, it plays the same role in the determination
of the statistical power of the cluster-randomized design as it does in
an individually randomized design in which the same total number of
“unclustered” participants were randomized to experimental conditions.
Thus, as the number of participants within a cluster is increased (for a
fixed number of clusters), the total sample size must increase, the popula-
tion sampling variance (and the corresponding standard error of the
treatment effect) must decrease, and the statistical power improve, as
we expect from our arguments in the previous chapter. However, in
Equation 7.4, we note that it is only the magnitude of the first term to the
right of the equal sign that is diminished by this increase in the total
sample size, regardless of the contribution of the second term. Thus, we
anticipate that any control over statistical power that is provided in the
cluster-randomized design by manipulating the number of participants
within a cluster must offer benefits no different than we would expect for
the same sample size increase in the simpler individually randomized
design. But, in a cluster-randomized design, its contribution is rapidly
dominated by the impact of any increase in the magnitude of the intra-
class correlation and in the number of clusters, both of which appear in
the second term to the right of the equal sign in Equation 7.4.
You can see the dependence of the statistical power of a cluster-
randomized design on the number of participants within a cluster in
Figure 7.1 by comparing the corresponding prototypical power values in
the top and bottom panels, at the same number of clusters and the same
intraclass correlation. Notice, for instance, that increasing the number of
participants present at each school from 50 to 100 students has only a
marginal impact on the power of the prospective analysis. With 50 stu-
dents at each school and an intraclass correlation of 0.05, about 45 schools
126 Methods Matter
estimates and improve statistical power. More often than not, however, in
settings in which the intraclass correlation is non-zero—that is, when the
intact grouping of participants has indeed influenced their unobserved
behaviors—one approach will be superior. It will typically be more effec-
tive, from the perspective of increasing statistical power, to reduce
group-level residual variance by including group-level covariates than to
reduce individual-level residual variance. This conclusion echoes a similar
conclusion that we reported in the previous section, where we noted that
the statistical power of a cluster-randomized design tends to be more sen-
sitive to the number of intact groups than to the number of participants
within those groups.
Essentially, the magnitude of the intraclass correlation is more sensitive
to changes in between-group residual variance than to changes in within-
group residual variance. This matters because statistical power is very
sensitive to the magnitude of the intraclass correlation. In fact, a reduction
in between-group residual variance will always lead to a greater increase
in statistical power than a corresponding reduction in the within-group
residual variance. This means, of course, that when you seek covariates to
include in analyses of clustered data, you would be well advised to first
seek out effective group-level covariates.
Following Borman and his colleagues (2005a), the only difference
between our fitted Models #2 and #3, in Table 7.1, is that the latter includes
the covariate, SCH_PPVT, which is the school-specific average value of
children’s scores on the PPVT prior to the start of the evaluation.15 Notice
that the addition of this school-level covariate to the model results in a
dramatic reduction in the estimated between-school residual variance,
from 76.61 to 48.57, while the within-school residual variance remains
unchanged. As a result, the estimated intraclass correlation falls from its
“large” value of 0.196 in Model #2 to an almost “medium” value of 0.134
in Model #3. There is a concomitant reduction in the standard error
of the estimated effect of the SFA treatment from 2.859 to 2.340.
Unfortunately, in this particular example, these changes were offset by a
reduction in the size of the parameter estimate itself, from 4.353 to 3.572.
Consequently, the value of the t-statistic does not differ very much.
This example is atypical, and the take-away remains that the addition of
15. Rather than estimate the average score of children in our subsample on this pretest
and use it as the covariate, we took advantage of the school-average pretest PPVT
score that was provided by Borman et al. (2005a), in their larger dataset. Repeating
our analyses with this latter average replaced by a within-school average obtained in
our subsample provided similar, if slightly weaker, results.
128 Methods Matter
( )
WATTACKij = g 0 + u j + g 1 SFAj + e ij (7.5)
( )
WATTACKij = a 0 + a 2 S2 j + a 3 S3 j + … + a 41 S41 j + g 1 SFAj + e ij (7.7)
16. We could have achieved the same ends by eliminating the standard intercept α0 and
retaining all of the school dummy predictors and their corresponding slope parame-
ters. Then, each parameter would represent the population average word-attack score
in its associated school, without the necessity to declare one as a “reference” school.
The results from the two alternative methods would be substantively identical.
130 Methods Matter
A model such as Equation 7.7, in which each intact school has its own
intercept, is referred to as a fixed-effects of schools model, and the coeffi-
cients associated with the dummy predictors represent those fixed effects.
Unfortunately, there is a problem in using this model to examine the rela-
tive effectiveness of SFA, which you will recall was a school-wide approach
to teaching reading. If you try to fit the fixed-effects of schools model in
Equation 7.7 in our SFA data, the statistical software will typically balk.
Depending on how the software is written, the program will either cease
to function or will respond by dropping one term—probably the predictor
SFA, and its associated slope parameter—from the model. In either case,
the analysis may fail, and the effect of the SFA intervention on the out-
come will certainly not be estimated. The reason is that perfect collinearity
exists between the SFA predictor and the full collection of school dum-
mies. Recall that the SFA treatment was randomized to intact schools,
and so predictor SFA is a dichotomous school-level predictor that pos-
sesses no variation within school. In other words, all the children in grades
K-2 in any school have the same value as the SFA predictor. So, SFA is a
predictor that essentially distinguishes between two types of school. Of
course, the dichotomous school predictors already in the model are also
doing the same job. For instance, let’s suppose that schools 21 through 41
were the control schools. Then, if we knew that a schoolgirl had a value of
1 for any of the corresponding dummy predictors, S21 through S41, we
would know that she attended a control school, and we would not need to
know her value on variable SFA. Alternatively, if she had a value of zero
on this same set of school indicators, we would know for sure that she had
been assigned to the treatment condition.17 In fact, once the vectors of
school dummies are included as predictors in the model, they absorb all
of the school-level variation in the outcome, and you can no longer add
any other school-level predictors to the model. Consequently, you cannot
include the critical school-level question predictor SFA, whose associated
regression parameter addresses the all-important research question at the
heart of the evaluation!
Given that you cannot include a school-level question predictor, such
as SFA, in a model that contains the fixed effects for schools, you may be
tempted to ask: What use then are such fixed-effects models? The answer
17. This explanation suggests a strategy that could be used effectively to estimate the
impact of the SFA treatment even in a fixed-effects-of-schools model from which
the SFA predictor had been omitted because of its complete redundancy. After fitting
the model containing all the school dummies and no SFA predictor, you can then
compare the average of the population intercepts of all the SFA-designated schools
with the average of the population intercepts of all non-SFA-designated schools using
a post-hoc general linear hypothesis (GLH) test, or linear-contrast analysis.
Experimental Research When Participants Are Clustered Within Intact Groups 131
is that they are very useful if you want to control all variation in the
outcome at one level and pose an important research question at another.
This is a situation that occurs frequently in educational research, where
children are not just nested within a two-level hierarchy, but within hierar-
chies that are many levels deep. Thus, children may be nested within
intact classes, which are then nested within schools, which are nested
within districts, and so on. For instance, although we cannot retain the
SFA predictor in a multilevel model that contains the fixed effects of
schools, we could include the fixed effects of a school district, thereby
controlling all variation in the outcome at this higher level of clustering.
In addition, we could continue to account for the nesting of children
within school by using the standard random-effects strategy of including
a school-level residual. Such combining of the methods of fixed and
random effects proves to be a flexible analytic strategy for handling the
grouping of participants at multiple levels.
For instance, in the Tennessee Student/Teacher Achievement Ratio
(STAR) experiment, kindergarten students in each of 79 large elementary
schools were assigned randomly to either a small class (13–17 students), a
regular-size class (22–25) students, or to a regular-size class with a full-time
teacher’s aide. Kindergarten teachers in participating schools were then
randomly assigned to classes. Over the year, of course, even though
students were originally assigned randomly to classes, their shared unob-
served experiences over the academic year could “build up” an intraclass
correlation of substantial magnitude. In evaluating the impact of the
experimental treatments on children’s academic achievement, it then
becomes important to take into account that kindergarten students and
teachers were randomized to experimental conditions in intact classes
within each participating school, so that they were nested within both
classrooms and schools. The corresponding analyses can accommodate
this complexity by estimating the treatment effects in a statistical model
that contains the random effects of a classroom and the fixed effects of
schools. In other words, you can fit a random-intercepts model with a
class-level residual, but include a set of dichotomous control predictors to
distinguish among the schools.
you have deliberate choices that you can make about your multilevel-
model specifications. In analyzing data from the STAR experiment, for
instance, one option is simply to include random effects for children ( ε),
classes (u), and schools (v), as follows:
135
136 Methods Matter
A natural experiment that took place during the Vietnam War era
provided Angrist with an opportunity to obtain unbiased estimates of the
impact of military draft eligibility on long-term labor market outcomes.
Between 1970 and 1975, the U.S. Department of Defense conducted five
draft lotteries that determined which American males in a particular age
group were eligible to be drafted into military service. The 1970 lottery
included men aged 19 through 26, and the lotteries in the four subse-
quent years included men aged 19 to 20. In each lottery, a random-sequence
number (RSN), ranging from 1 through 365, was assigned to each birth
date. Then, only men in the relevant age cohorts whose birthdays had
RSNs less than an exogenously determined ceiling, which was specified
by the Department of Defense each year, were subject to induction.
Angrist called such men “draft-eligible.” A simple comparison of the
annual earnings in the early 1980s for the group of men in a particular
cohort who were draft-eligible with those who were not provides an unbi-
ased estimate of the impact on earnings of being draft-eligible.
Note that the treatment being administered in this natural experiment
is “eligibility for the draft,” not the actual experience of military service.
This is because it is only the assignment of young men to draft eligibility
that was randomized, not military service itself. Indeed, some of those in
the draft-eligible “treatment group” avoided military service by enrolling
in college, by being declared unfit for military service due to physical lim-
itations, or by having been arrested prior to the draft. Among white males
born in 1950, 35% of those declared draft-eligible as a result of having a
low draft number actually served in the military, compared to 19% of
those declared draft-ineligible (Angrist, 1990; Table 2, p. 321). Of course,
the fact that many draft-eligible men did not serve in the military, and
some men who were not draft-eligible did serve, does not threaten the
unbiased estimation of the impact of being “draft-eligible” on later labor-
market outcomes. In fact, this natural experiment resembles the New York
Scholarship Program (NYSP), the investigator-designed experiment in
which the treatment was the randomized offer of a scholarship to help
pay private-school tuition. As explained in Chapter 4, not all families that
received the scholarship offer sent their child to a private school. As we
will see in Chapter 11, it is possible to use a more sophisticated technique,
called instrumental-variables estimation, to tease out the causal impact of
actual military service (or actual private-school attendance), using the
original randomly assigned offer as an “instrument.” In this chapter, how-
ever, we focus only on estimating the impact of draft eligibility on later
labor-market earnings.
Combining information from the draft lotteries with information on
subsequent earnings from the Social Security Administration, Angrist
Natural Experiments 139
3. We argue that a two-sided test makes sense for this example because it represents the
theoretical position that the treatment group could have either better or worse average
labor-market outcomes than the control group. We discuss the choice between one-
and two-tailed hypothesis testing in Chapter 6.
Natural Experiments 141
All else being equal, subtracting the average value of the outcome for the
control group from the average value of the outcome for the treatment
group provides an unbiased estimate of the causal impact of the financial-
aid offer on the subsequent college-going behavior of students whose
fathers were deceased. This is often called a “first-difference” estimate of
the treatment effect. Notice that the action by the Congress to eliminate
the SSSB program allowed Dynarski to study the causal impact of finan-
cial aid on the college-enrollment decisions of students without facing the
ethical questions that an investigator-designed randomized experiment
might have elicited.
We focus here on the first of several outcomes that Dynarski studied—
whether the student attended college by age 23. We define COLLi as a
dichotomous outcome variable that we code 1 if the ith high-school senior
attended college by age 23 and 0 otherwise. We present summary statis-
tics on outcome COLL in the upper panel of Table 8.1, in rows that
distinguish the treatment and control groups of students whose fathers
were deceased.4 The first row in the upper panel contains summary infor-
mation on the 137 high-school seniors who received the SSSB aid offer in
the years 1979 through 1981 (our “treatment” group). The second row
contains parallel information on the 54 high-school seniors who would
have been eligible for SSSB aid in 1982 through 1983, but did not receive
an SSSB aid offer because the program was cancelled (our “control”
group). The sample averages of COLL in the treatment and control groups
are 0.560 and 0.352, respectively, which means that 56% of students who
received an offer of tuition aid attended college by age 23, whereas only
35% of those who did not receive the offer did so.
Provided that this “natural” assignment of high-school seniors whose
fathers were deceased to the treatment and control conditions rendered the
two groups equal in expectation initially, we can obtain an unbiased esti-
mate of the population impact of a financial-aid offer on college attendance
by age 23 among students with deceased fathers. One method is to simply
estimate the sample between-group difference in outcome means D1:
{
D1 = COLL•
(Father Deceased , 79→81)
}
(Father Deceased , 82 →83 )
− COLL•
= 0.560 − 0.352 (8.2)
= 0.208
4. We thank Susan Dynarski for providing her dataset. All our analyses of these data account
for the cluster sampling and weighting in the complex survey design of the NLSY.
Natural Experiments 143
Table 8.1 “First difference” estimate of the causal impact of an offer of $6,700 in
financial aid (in 2000 dollars) on whether high-school seniors whose fathers were
deceased attended college by age 23 in the United States
H.S. Number Was Did H.S. Avg Value Between- H0: µOFFER =
Senior of Student’s Seniors of COLL Group µNO OFFER
Cohort Students Father Receive an (standard Difference
Deceased Offer of error) in Avg t-statistic p-value
SSSB Aid? Value of
COLL
financial aid did not affect college attendance among students whose
fathers were deceased by age 23, in the population, as follows:5
(Father Deceased , 79→81) (Father Deceased , 82→83)
COLL• − COLL•
t=
(
s.e. COLL•
(Father Deceased , 79→81)
− COLL•
(Father Deceased , 82→83)
)
(Father Deceased , 79→81) (Father Deceased , 82→83)
COLL• − COLL•
=
( ) ( )
2 2
(Father Deceased , 79→81) (Father Deceased , 82→83)
s.e. COLL• + s.e. COLL• (8.3)
=
(0.560 − 0.352)
(0.053)2 + (0.081)2
0.208
=
0.097
= 2.14
We could also have summarized and tested the impact of the receipt
of the higher-education financial-aid offer in the same dataset by using
OLS methods to fit a linear-probability model or we could have used
logistic regression analysis to regress our outcome COLL on a dichoto-
mous question predictor OFFER, defined to distinguish students in
the treatment and control groups (coded 1 if relevant students became
high-school seniors in 1979, 1980, or 1981, and therefore received an
offer of post-secondary tuition support; 0 otherwise), as in the case of an
investigator-designed experiment. We present the corresponding fitted
linear-probability model in the lower panel of Table 8.1. Notice that the
parameter estimate associated with the treatment predictor has a magni-
tude identical to our “difference estimate” in Equation 8.2.6 Using either
of these strategies, we can reject the standard null hypothesis associated
with the treatment effect, and conclude that the offer of financial aid did
5. The results of this t-test are approximate, as the outcome COLL is a dichotomous, not
continuous, variable.
6. Notice that the standard error associated with the main effect of OFFER, at 0.094, is
slightly different from the estimate provided in Equation 8.3, which had a value of
0.097. This occurs because, although both statistics are estimates of the standard error
of the same treatment effect, they are based on different assumptions. The OLS-based
estimate makes the more stringent assumption that residuals are homoscedastic at each
value of predictor OFFER— that is, essentially, that the within-group variances of the
outcome are the same in both the treatment and control groups, although the hand-
computed estimate permits these within-group variances to differ.
Natural Experiments 145
7. Note that we used a one-tailed test because we had a strong prior belief that the offer of
scholarship aid would increase the probability of college attendance, not decrease it.
146 Methods Matter
there were a great many Chicago students who scored just below, or just
above, the 2.8 cut-off point on the end-of-third-grade mathematics exami-
nation. By comparing the average mathematics scores of these two groups
one year and two years later, they estimated the causal impact of the man-
datory summer school and associated promotion policy. They found that
the treatments increased average student achievement by an amount
equal to 20% of the average amount of mathematics learning that typical
Chicago public school third-graders exhibited during the school year, an
effect that faded by 25%–40% over a second year.8
John Papay, Richard Murnane, and John Willett (2010) took advantage
of a structurally similar natural experiment in Massachusetts. Beginning
with the high-school class of 2003, Massachusetts public school students
have had to pass state-wide examinations in mathematics and English lan-
guage arts in order to obtain a high-school diploma. Students take the
examinations at the end of the tenth grade, and those who fail to score
above the exogenously defined minimum passing score may retake the
examinations in subsequent years. The research team found that being
classified as just failing on the mandatory mathematics test (when com-
pared to those students who just passed) lowered by 8 percentage points
the probability that low-income students attending urban high schools
graduated on time. Since the research had a discontinuity design, the
comparison was between low-income urban students whose scores fell
just below the minimum passing score on the forcing variable and those
whose scores fell just above the minimum passing score.
Other examples of natural experiments of this type originate in the
rules that schools and school districts adopt to set maximum class size. In
settings with maximum class-size rules, students are arrayed on a forcing
variable defined by the number of students enrolled in a school at their
grade level. If the number of enrolled students is less than the maximum
class size, say, 40 students, then the students’ class size will be equal to the
number of students enrolled in that grade level. However, if the number
of enrolled students is slightly greater than the maximum class size, say,
42 students, then the expected class size would be 21 students because a
second class must be added to the grade in order to comply with the
maximum class-size policy. As with other natural experiments, students in
schools with class-size maximums have then been arrayed implicitly on a
forcing variable (enrollment at that grade level), an exogenous cut-point
8. Jacob and Lefgren (2004, p. 235). The policy also applied to sixth-graders, and the
results of the intervention were different for this group than for third-graders.
Natural Experiments 149
(the class-size maximum) has been defined, and students who fall just to
one side of the cut-point experience different educational treatments
(class sizes) than do students who fall just to the other side of the
cut-point. Discontinuity studies of the impact of class size on average
student achievement that were based on maximum class-size rules have
been conducted using data from many countries, including Bolivia,
Denmark, France, Israel, the Netherlands, Norway, South Africa, and the
United States.9
In summary, natural experiments occur most frequently with disconti-
nuity designs, and are derived usually from three common sources of
discontinuity. First, a natural disaster or an abrupt change in a policy can
assign individuals or organizations that are in the same geographical juris-
diction randomly to different educational treatments at temporally
adjacent points in time. Dynarski (2003) analyzed data from a natural
experiment of this type. Second, exogenous differences in policies across
geographical jurisdictions at the same point in time can assign individuals
or organizations randomly to different policies based on their location.
Tyler and his colleagues (2000) made use of a natural experiment of this
type. Third, policies in a particular jurisdiction at a particular point in
time can assign individuals randomly to different educational treatments
based on their values on a forcing variable such as a test score, a measure
of socioeconomic status, or the number of students enrolled in a grade in
a particular school. As we explain in the next chapter, Angrist and Lavy
(1999) studied a natural experiment of this type.
Some natural experiments fall into more than one category. For example,
the natural experiment created by the Chicago mandatory summer-school
policy that Jacob and Lefgren (2004) studied falls into both our first and
third categories. We have already described how that study fell into the
third category—with end-of-school-year scores on the third-grade ITBS
mathematics achievement test as the forcing variable. However, it also
falls into the first category—with a temporal forcing variable—because
Chicago students who were in the third grade in the 1996–1997 school
year were subject to the policy, whereas those who were in the third grade
in the 1995–1996 school year were not. In fact, Jacob and Lefgren took
advantage of both of these attributes of the natural experiment cleverly in
their analytic strategy.
9. Relevant studies include: Browning and Heinesen (2003); Leuven, Oosterbeek, and
Rønning (2008); Case and Deaton (1999); Dobbelsteen, Levin, and Oosterbeek (2002);
Boozer and Rouse (2001); Angrist and Lavy (1999); Urquiola (2006), and Hoxby (2000).
150 Methods Matter
10. An exception would be the case in which participants succeed in subverting their
assignment and “cross-over” to the condition to which they were not assigned. In
Chapter 11, we describe a solution to this problem.
Natural Experiments 151
only 1981 high-school seniors with deceased fathers and her control group
as including only 1982 high-school seniors with deceased fathers.
Tightening her focus on the students with deceased fathers who were
most immediately adjacent to the cut-off year would have strengthened
Dynarski’s claim that the treatment and control groups were likely to be
equal in expectation, prior to treatment. The reason is that this decision
would have provided little time for anything else to have occurred that
could have affected college enrollment decisions for the relevant high-
school seniors. However, this would also have reduced the sample sizes of
her treatment and control groups dramatically, thereby reducing the sta-
tistical power of her research. Even with the slightly broader criteria that
Dynarski did use, there were only 137 high-school seniors in the treat-
ment group and 54 in the control group—a comparison that provides very
limited statistical power.
On the other hand, Dynarski could have expanded her definitions of
the treatment and control groups. For example, she might have included
in her treatment group all students with deceased fathers who graduated
from high school in any year from 1972 through 1981. Similarly, she could
have included in her control group students with deceased fathers who
graduated from high school in any year from 1982 through 1991. Using
such a ten-year window on either side of the discontinuity certainly would
have increased her sample size and statistical power dramatically. However,
had Dynarski widened the analytic window around the cut-point, she
would have found it more difficult to argue that seniors in the treatment
and control groups were equal in expectation initially. The reason is that
unanticipated events and longer-term trends might have had a substantial
influence on high-school seniors’ college-enrollment decisions. For exam-
ple, the differential between the average earnings of college graduates
and high-school graduates fell dramatically during the 1970s and then
rose rapidly during the 1980s.11 As a result, the incentives for high-school
seniors to attend college in the early 1970s were quite different from the
incentives facing high-school seniors in the late 1980s. Thus, widening the
analytic window would have cast into doubt the claim that any observed
difference between the treatment and control groups in college-going by
age 23 was solely a causal consequence of the elimination of the SSSB
financial-aid offer.
Taking these limiting cases into account, it may seem reasonable to
assume that high-school seniors with deceased fathers in the years within
11. For a discussion of the causes and consequences of trends in the college/high-school
wage differential, see Freeman (1976), Goldin and Katz (2008), and Murnane and
Levy (1996).
152 Methods Matter
transfer themselves knowingly from one side of the cut-off to the other.
This jeopardizes the exogeneity of the assignment process and under-
mines the assumption of equality of expectation for those in the treatment
and control groups. We discuss each of these threats in turn.
12. See Dynarski (2003, p. 283, fn. 14). As our colleague Bridget Long pointed out,
another factor that influenced college-enrollment rates during the period 1978–1983
was an increase in college tuition prices.
154 Methods Matter
men within each yearly cohort to either the treatment or control group
allayed the problem. With a random-assignment design, chronological
year would simply function as a stratifier, and the equality of expectation
assumption would be met by the randomization of participants to the dif-
ferent experimental conditions within each yearly stratum, and, hence,
overall. In a discontinuity design, on the other hand, participants in the
treatment and control conditions are, by definition, drawn from groups
at immediately adjacent values of the forcing variable (adjacent by year, by
geography, by test score, or by whatever defines the forcing variable on
which the exogenous cut-off has been imposed). Then, if an underlying
relationship exists between outcome and forcing variable (as it often
does!), the resulting small differences between participants in the treat-
ment and control groups on their values of the forcing variable may also
result in differences in the outcome between the groups.13
When faced with this kind of threat to the internal validity of a natural
experiment with a discontinuity design, investigators often respond by
correcting their estimate of the treatment effect using what is known as a
difference-in-differences strategy. Recall that we construed our estimate of
the impact of a financial-aid offer in the Dynarski example as simply the
difference in the sample average value of the binary outcome COLL
between seniors with deceased fathers assigned to the financial–aid-offer
treatment group and those assigned to the no financial-aid-offer control
group. As shown earlier, we can estimate and test this difference easily,
once we have chosen an appropriate bandwidth on either side of the cut-
off score within which to estimate the respective average values of the
outcome. In what follows, we refer to this as the first difference, D1, and we
presented it earlier in Equation 8.2 and Table 8.1.
Now let’s correct our first difference for the anticipated threat to validity
due to the small chronological difference in years of assignment between
the treatment and control groups. We can do this by subtracting, from
the first difference, a second difference that estimates the consequences of
any “secular” trend in college-going by age 23 that might have affected all
high-school seniors over the same period, including those eligible for
SSSB benefits. To do this, we need to estimate this latter secular trend
over time, so that it can be “subtracted out.” In the case of Dynarski’s
13. One can also argue that this “secular trend” problem is exacerbated when the forcing
variable is not continuous, but is coarsely discrete (as were the “years” in the Dynarski
example). If the assignment variable were continuous and the cut-off selected exoge-
nously, then—in the limit—participants in the vanishingly small regions to either side
of the cut-off would be mathematically equal in expectation, by definition. However,
the number of participants—and hence the sample size in the ensuing natural experi-
ment—in these infinitesimal regions would also be disappearingly small.
Natural Experiments 155
(a) Students with deceased fathers (b) Students whose fathers were not Deceased
p{College} p{College}
0.6 - 0.6 -
+
0.5 - 0.5 - •
•
0.4 - 0.4 -
+
0.3 - 0.3 -
| | Year | | Year
Pre- Post- Pre- Post-
1981 1981 1981 1981
Figure 8.1 Sample probability of attending college by age 23 among high-school seniors,
in the United States, immediately before and after the elimination of the SSSB program,
by whether their fathers were deceased or not.
156 Methods Matter
COLL, but that its magnitude (0.026) is considerably smaller than that of
the first difference. We display this second difference in the right-hand
panel of Figure 8.1, where the line segment joining the sample average
values of COLL, before and after the cancellation of the SSSB program,
has a smaller negative slope.
Now, we have two estimated differences. We could argue that our first
difference D1 estimates the population impact (call this ∆1) on college-
going by age 23 of both the elimination of financial aid and any impact
of a secular decline in college-going over the same period. Our second
difference D2 provides an estimate of just the population secular decline
in college-going over this same period (call this ∆2). We can now remove
the impact of the secular time trend from our estimate of the causal
effect of financial aid by subtracting the second difference from the first,
as follows:
D = D1 − D2
⎧ ⎛ Father ⎞
⎜ Deceased ⎟
⎛ Father ⎞
⎜ Deceased ⎟
⎫
⎪⎪ ⎜
⎜ 79→ 81 ⎠⎟
⎟ ⎜ ⎟⎪
⎜ 82→ 83 ⎠⎟ ⎪
= ⎨COLL• ⎝ − COLL• ⎝ ⎬
⎪ ⎪
⎪⎩ ⎪⎭
⎧ ⎛ Father Not ⎞
⎜ Deceased ⎟
⎛ Father Not ⎞
⎜ Deceased ⎟
⎫ (8.4)
⎪⎪ ⎜
⎜⎝ 79→ 81 ⎟⎠
⎟ ⎜ ⎟⎪
⎜⎝ 82→ 83 ⎟⎠ ⎪
− ⎨COLL• − COLL• ⎬
⎪ ⎪
⎩⎪ ⎭⎪
= (0.560 − 0.352) − (0.502 − 0.476 )
= 0.208 − 0.026
= 0.182
H 0 : ∆ = ∆1 − ∆ 2 = 0 (8.5)
⎧ ⎛ Father ⎞
⎜ Deceased ⎟
⎛ Father ⎞
⎜ Deceased ⎟
⎫ ⎧ ⎛ Father Not ⎞
⎜ Deceased ⎟
⎛ Father Not ⎞
⎜ Deceased ⎟
⎫
⎪⎪ ⎜
⎜⎝ 79→ 81 ⎟⎠
⎟
⎜⎝ 82→ 83 ⎟⎠ ⎪
⎜ ⎟
⎪ ⎪ ⎪ ⎜
⎜
⎝ 79→ 81 ⎠
⎟
⎟
⎜
⎜ 82→ 83 ⎠ ⎪
⎟⎪
⎟
⎨COLL• − COLL• ⎬ − ⎨COLL• − COLL• ⎝ ⎬
⎪ ⎪ ⎪ ⎪
⎪
⎩ ⎪ ⎩⎪
⎭ ⎭⎪
t=
⎡⎧ ⎛ Father ⎞
⎜ Deceased ⎟
⎛ Father ⎞
⎜ Deceased ⎟
⎫ ⎧ ⎛ Father Not
⎜ Deceased ⎟
⎞ ⎛ Father Not
⎜ Deceased ⎟
⎞ ⎤
⎫
⎢ ⎪⎪ ⎜ ⎟
⎜⎝ 82→ 83 ⎟⎠ ⎪
⎜ ⎟
⎪ ⎪ ⎪ ⎜ ⎟ ⎜ ⎟ ⎪⎥
⎜⎝ 79→ 81 ⎟⎠ ⎜
⎝ 79→ 81 ⎠ ⎟ ⎜ 82→ 83 ⎠ ⎪⎟
s.e. ⎢ ⎨COLL• − COLL• ⎬ − ⎨COLL• − COLL• ⎝ ⎬⎥
⎢⎪ ⎪ ⎪ ⎪⎥
⎢ ⎪⎩ ⎪
⎭ ⎩ ⎪ ⎪⎭ ⎥⎦
⎣
⎧ ⎛ Father ⎞
⎜ Deceased ⎟
⎛ Father ⎞
⎜ Deceased ⎟
⎫ ⎧ ⎛ Father Not ⎞
⎜ Deceased ⎟
⎛ Father Not ⎞
⎜ Deceased ⎟
⎫
⎪⎪ ⎜ ⎟
⎜⎝ 79→ 81 ⎠⎟ ⎜⎝ 82→ 83 ⎠⎟ ⎪
⎜ ⎟
⎪ ⎪ ⎪ ⎜
⎜
⎝ 79→ 81 ⎠
⎟
⎟
⎜
⎜ 82→ 83 ⎠ ⎪
⎟⎪
⎟
⎨COLL• − COLL• ⎬ − ⎨COLL• − COLL• ⎝ ⎬
⎪ ⎪ ⎪ ⎪
= ⎩⎪ ⎭⎪ ⎩⎪ ⎭⎪
2 2
⎡ ⎛ ⎛ Father ⎞ ⎤
⎜ Deceased ⎟
⎞ ⎡ ⎛ ⎛ Father ⎞ ⎤
⎜ Deceased ⎟
⎞
⎢ ⎜ ⎜⎝ 79→ 81 ⎟⎠ ⎟
⎜ ⎟ ⎥ ⎢ ⎜ ⎟ ⎥
⎜⎝ 82→ 83 ⎟⎠ ⎟
⎜
⎢ s.e. ⎜ COLL• ⎟ ⎥ + ⎢ s.e. ⎜ COLL• ⎟⎥ (8.6)
⎢ ⎜ ⎟ ⎥ ⎢ ⎜ ⎟⎥
⎢ ⎝ ⎠ ⎥ ⎢ ⎝ ⎠ ⎥⎦
⎣ ⎦ ⎣
2 2
⎡ ⎛ ⎛ Father Not ⎞ ⎤
⎜ Deceased ⎟
⎞ ⎡ ⎛ ⎛ Father Not ⎞ ⎤
⎜ Deceased ⎟
⎞
⎢ ⎜ ⎜ ⎟⎟⎥ ⎢ ⎜ ⎜ ⎟⎟⎥
⎝⎜ 79→ 81 ⎠⎟ ⎝⎜ 82→ 83 ⎠⎟
+ ⎢ s.e. ⎜ COLL• ⎟ ⎥ + ⎢ s.e. ⎜ COLL• ⎟⎥
⎢ ⎜ ⎟⎥ ⎢ ⎜ ⎟⎥
⎢ ⎝ ⎠ ⎥ ⎢ ⎝ ⎠ ⎥⎦
⎣ ⎦ ⎣
(0.560 − 0.352) − (0.502 − 0.476 )
=
(0.053)2 + (0.081)2 + (0.012)2 + (0.019)2
0.182
= = 1.84
0.099
This t-statistic is large enough to reject the null hypothesis using a one-
tailed test with Type I error = 0.05 (p <0.03).15 Consequently, we again
conclude that financial aid matters in the decision to go to college by
age 23 in the population of high-school seniors whose fathers were
deceased in 1981.
We noted earlier that we did not necessarily need to use a t-test to
test the size of the population first difference. Instead, we could regress
15. This is based on a normal approximation because of the large total sample size (n = 3,986)
across the four groups.
Natural Experiments 159
Table 8.3 Information on cohort membership and college attendance by age 23 in the
United States for selected high-school seniors in the 1979, 1980, 1981, 1982, and 1983
cohorts, with accompanying information on whether they were offered financial aid
(those in cohorts 1979–1981) and whether their father was deceased
7901 1979 1 1 1
7902 1979 0 1 1
7903 1979 1 1 0
7904 1979 1 1 1
…
8001 1980 1 1 0
8002 1980 1 1 1
8003 1980 0 1 1
8004 1980 0 1 0
…
8101 1981 0 1 1
8102 1981 1 1 0
8103 1981 1 1 0
8104 1981 1 1 0
…
8201 1982 1 0 1
8202 1982 0 0 0
8203 1982 0 0 0
8204 1982 0 0 1
…
8301 1983 0 0 1
8302 1983 1 0 1
8303 1983 0 0 1
8304 1983 1 0 0
…
16. The reason why the regression model in Equation 8.7 embodies the difference-in-
differences approach is easily understood by taking conditional expectations
throughout, at all possible pairs of values of predictors OFFER and FATHERDEC.
This provides population expressions for the average value of the outcome in each of
the four groups present in the difference-in-differences estimate, which can then be
subtracted to provide the required proof.
Natural Experiments 161
t-statistic p-value
17. Notice that the standard error associated with the two-way interaction of OFFER and
FATHERDEC, at 0.096, is marginally different from the estimate provided in Equa-
tion 8.6, which had a value of 0.099. This small difference occurs because, although
both are estimates of the standard error of the corresponding statistic, they are based
on slightly different assumptions. As usual, the OLS-based estimate makes the more
stringent assumption that residuals are homoscedastic at each level of predictors OFFER
and FATHERDEC—that is, that the within-group variances of the outcome are the
same in all four groups, although the hand-computed estimate permits the within-
group variances to differ, in the population. If the more stringent assumption of
homoscedasticity had been applied in both cases, then the standard-error estimates
would have been identical and equal to the regression-based estimate.
162 Methods Matter
more readily and also to match the findings in Dynarski’s paper. However,
even if we had specified a more appropriate nonlinear logit (or probit)
model, the predictor specification would have been identical and the
results congruent.
Finally, note that we have made two additional assumptions in obtain-
ing our difference-in-differences estimate of the treatment effect. Both
these assumptions concern whether it is reasonable to use this particular
“second” difference in average outcome to adjust for any secular trend in
the college-going decisions of high-school seniors whose fathers were
deceased. One of the assumptions is mathematical, the other conceptual.
First, in computing both the first and second differences, we are making
the “mathematical” assumption that it is reasonable to use a simple differ-
ence in average values to summarize trends in the outcome as a function
of the forcing variable, and that the mere subtraction of these differences
from one another does indeed adjust the experimental effect-size esti-
mate for the secular trend adequately. For instance, in our current
example, we obtained the first difference by subtracting the average value
of COLL in a pair of groups formed immediately before and after the
1981 cut-point, for both the high-school seniors whose fathers had died
and for those whose fathers had not died. Both these differences are
therefore rudimentary estimates of the slope of a hypothesized linear
trend that links average college-going and chronological year (grouped as
stipulated in our estimation), in the presence and absence of any experi-
mental treatment, respectively. Our confidence in the appropriateness of
these differences—and the value of the difference-in-differences estimate
derived from them—rests heavily on an assumption that the underlying
trends are indeed linear and can be estimated adequately by differencing
average outcome values at pairs of adjacent points on the horizontal axis.
These assumptions will not be correct if the true trend linking average
college-going to year is nonlinear. We have no way of checking this
assumption without introducing more data into the analysis and without
conducting explicit analyses to test the assumption. However, as we will
see in the following chapter, in which we describe what is known as the
regression-discontinuity design, if more data are available, there are not
only ways of testing the implicit linearity assumption, but also of obtaining
better estimates of the conceptual “second difference.”
Our second assumption is substantive, rather than mathematical, and
concerns the credibility of using the sample of students whose fathers
had not died to estimate an appropriate second difference. In our presen-
tation, we followed Dynarski in making the hopefully credible argument
that the trend in the average college-going rate for students whose fathers
were not deceased provided a valid estimate of the trend in the college-going
Natural Experiments 163
rate for students whose fathers were deceased. Of course, this may not be
true. Regardless, our point is that you must be cautious in applying the
difference-in-differences strategy. As the Dynarski (2003) paper illustrates,
it can be an effective strategy for analyzing data from a natural experi-
ment with a discontinuity design and for addressing an important
educational policy question. However, as we have explained, there are
additional assumptions underlying its use, and the researcher’s obliga-
tion is to defend the validity of these assumptions, as Dynarski did so
successfully in her paper.
18. Of course, there are also circumstances in which individuals would prefer not to be in
the treatment group. For example, most third-graders in the Chicago public schools
hoped not to be assigned to mandatory summer school.
164 Methods Matter
research sample. She chose not to do this for good reason. During the
years in which the SSSB program was in effect, some fathers of college-
bound high-school seniors could have chosen to retire or to press a disability
claim in order to acquire SSSB benefits for their child. Demographically
similar parents of students who were high-school seniors in 1982–1983
would not have had this incentive. Such actions could have led to unob-
served differences between participants in the treatment and control
groups thus defined, including perhaps differences in their interest in
enrolling in college. Thus, including students with disabled or retired
fathers in the research sample would probably have introduced bias into
the estimation of the treatment effect, and undermined the internal valid-
ity of the research. In subsequent chapters, we return to these threats to
the exogeneity of the assignment of individuals to treatment and control
groups in natural experiments with a discontinuity design.
165
166 Methods Matter
Joshua Angrist and Victor Lavy (1999) used data from an interesting natu-
ral experiment in Israel to examine whether class size had a causal impact
on the achievement of third-, fourth-, and fifth-grade students. The source
of their natural experiment was an interpretation by a 12th-century rab-
binic scholar, Maimonides, of a discussion in the 6th century Babylonian
Talmud about the most appropriate class size for bible study. Maimonides
ruled that class size should be limited to 40 students. If enrollment
exceeded that number, he stipulated that another teacher must be
appointed and the class split, generating two classes of smaller size.1 Of
critical importance to the work of Angrist and Lavy is that, since 1969, the
Education Ministry in Israel has used Maimonides’ rule to determine the
number of classes each elementary school in Israel would need, each year,
at each grade-level. Children entering a grade in a school with an enroll-
ment cohort of 40 students or fewer, for instance, would be assigned to
a single class containing that number of students. In another school
with an enrollment cohort of size 41 at the same grade level, an extra
teacher would be hired and two classes established, each containing 20 or
2. We thank Joshua Angrist for providing these data. The complete dataset contains
grade-level, school-specific enrollment cohorts that range in size from 8 through 226,
all of which are used in the research described in Angrist and Lavy (1999). For peda-
gogical simplicity, we focus on a subset of the data in which grade-level cohort enrollment
sizes fall close to the first Maimonides-inspired class-size cut-off of 40 students.
168 Methods Matter
Table 9.1 Fifth-grade average reading achievement and average class size (intended and
observed) in Israeli Jewish public schools, by fifth-grade enrollment-cohort size, around
the Maimonides-inspired cut-off of 40 students
In each row in the table, we also list the number of classrooms of that
cohort size in the sample (column 2), the intended and observed average
class sizes in those classrooms (columns 3 and 4), and the standard devia-
tions of the class-average reading achievement across the classes in the
cohort (column 7). We also label the cohort by whether it provides a nom-
inally “large” or “small” class-size treatment to the students it contains
(column 5). Notice that average reading achievement does indeed appear
to differ by class size. Children entering fifth grade in enrollment cohorts
of size 36 through 40, who would be assigned to “large” classes by
Maimonides’ rule, tend to have average achievement in the high 60s.
On the other hand (except in cohorts containing 42 children), children
entering the fifth grade in enrollment cohorts of sizes 41 through 46, and
who are therefore assigned to smaller classes by Maimonides’ rule, tend
to have average achievement in the mid-70s.
If the schools contributing the 37 classrooms that are listed in the fifth
and sixth rows of Table 9.1 had obeyed Maimonides’ rule to the letter, we
would anticipate a “large” average class size of 40 in the nine schools that
had an entering cohort with an enrollment of 40 and a small average class
size of 20.5 (that is, half of 41) in the 28 classrooms in schools in which
41 students started the school year in the fifth-grade cohort. However, for
reasons not yet apparent, among those large classes that were intended to
The Regression-Discontinuity Approach 169
contain 40 students, the observed average class size is actually less than
30. In the small classes that were intended to have an average size of 20.5,
the observed average class size is closer to the intended class size. However,
at 22.7, it is still two students larger than the Maimonides’ rule-intended
class size. Inspecting the data on these classrooms, on a school-by-school
basis, provides a clue as to why these anomalies are observed “on the
ground.” Among the schools with 40 children in their fifth-grade enter-
ing cohort, only four of the nine classrooms actually have class sizes of
exactly 40. One of the remaining classrooms contains 29 students, and
the other four have class sizes around 20.3 Thus, actual class sizes were
smaller than the class sizes intended by application of Maimonides’ rule.
One possible explanation for this difference hinges on the timing of the
enrollment measure. Cohort enrollments in this dataset were measured
in September, at the beginning of the academic year. It is possible for a
student to be added after the start of a school year to a fifth-grade cohort
with an initial size of 40 students, and then for the class to be divided
subsequently into two.
Is the discordance between actual class sizes and those anticipated by
application of Maimonides’ rule problematic for our evaluation? The
answer is “not necessarily,” providing we keep the specific wording of our
research question in mind. Recall the important distinction that we made
between “intent-to-treat” and “treated” in our discussion of the New York
Scholarship Program (NYSP) in Chapter 4. In the NYSP, it was receipt of
a private-school tuition voucher—which we regarded as an expression of
intent to send a child to private school—that was assigned randomly to
participating families. It was the causal impact of this offer of a private-
school education that could then be estimated without bias. Of course,
subsequently, some of the families that received vouchers chose not to
use them, and so the act of actually going to private school involved con-
siderable personal choice. This means that variation in a private-school
treatment, across children, was potentially endogenous. Consequently, a
comparison of the average achievement of children attending private
schools and those attending public schools could not provide an unbiased
estimate of the causal impact of a private-school treatment. Nonetheless,
it remained valuable to obtain an unbiased estimate of the causal impact
of the expressed intent to treat—that is, of the offer of a subsidy for pri-
vate-school tuition through the randomized receipt of a voucher—because
it is such offers that public policies typically provide.
3. The actual sizes of these four classes were 18, 20, 20, and 22.
170 Methods Matter
The natural experiment linking class size and aggregate student reading
achievement among Israeli fifth-graders provides an analogous situation.
The Maimonides’ rule-inspired class sizes are the intended treatment,
and we can estimate the causal effect on student achievement of this
intent to place each student into either a large or small class. It is certainly
useful to know the impact of such an offer, because it provides informa-
tion about the consequences of a policy decision to use Maimonides’ rule
to determine class sizes. However, it does not tell us how an actual reduc-
tion in observed class size would affect student achievement.4
4. In Chapter 11, we explain how exogenous variation in the offer of a treatment can be
used to tease out the causal impact of the actual treatment, using instrumental-variables
estimation.
5. This test treats classroom as the unit of analysis, and the degrees of freedom reflect a
count of those classes.
The Regression-Discontinuity Approach 171
A Difference-in-Differences Analysis
Recall the logic underlying the assumption of equality in expectation
across the cut-off on the forcing variable in a natural experiment with a
discontinuity design. We argued that, if the cut-off were determined exog-
enously, then only idiosyncratic events—such as the haphazard timing of
birth—will determine whether a particular student fell to the left or right
of the cut-off on the forcing variable, at least within a reasonably narrow
“window” on either side of the cut-off. If this argument is defensible, then
any unobserved differences between the students who were offered large
classes and those offered small classrooms at the beginning of the aca-
demic year will be inconsequential. On the other hand, we cannot ignore
the possible presence of a sizeable “second difference.” Even though stu-
dents who received large-class offers and those who received small-class
offers are separated nominally by only one child on the underlying forc-
ing variable (the enrollment-size continuum), it is possible that there may
8. One-sided test, with t-statistic computed from the summary statistics presented in
Table 9.1, as follows:
t=
{Y [41]
•
[40]
− Y• } − {Y •
[39] [38]
− Y• }
⎛s ⎞ ⎛s ⎞ ⎛s ⎞ ⎛s ⎞
2 2 2 2
⎟⎟ + ⎜⎜ ⎟⎟ + ⎜⎜ ⎟⎟ + ⎜⎜
[41] [40] [39] [38]
⎜⎜ ⎟⎟
⎝ n[41] ⎠ ⎝ n[40] ⎠ ⎝ n[39] ⎠ ⎝ n[38] ⎠
{73.68 − 67.93}− {68.87 − 67.85}
=
⎛ 8.772 ⎞ ⎛ 7.872 ⎞ ⎛ 12.072 ⎞ ⎛ 14.042 ⎞
⎜ ⎟+⎜ ⎟+⎜ ⎟+⎜ ⎟
⎝ 28 ⎠ ⎝ 9 ⎠ ⎝ 10 ⎠ ⎝ 10 ⎠
4.73
= = 0.71
6.63
where superscripts and subscripts listed in brackets, in the numerator and denomina-
tor, distinguish enrollment cohorts of sizes 38, 39, 40, and 41, respectively.
174 Methods Matter
We return to this issue next, where we illustrate how you can use the
regression-discontinuity approach to be more systematic about these choices
and can examine the sensitivity of findings to alternative bandwidths.
cohorts. If that is the case, then a slope estimate based on only two data
points could be very imprecise. Clearly, the difference-in-differences
method of estimating the magnitude of the treatment effect relies heavily
on sample average achievement values at just four values of cohort-
enrollment size. Fortunately, in situations where more data are available,
as is the case here, a regression-discontinuity (RD) approach allows us to
test and relax these assumptions.
It is clear from Table 9.1 that we know a lot more about the potential
relationship between average student achievement and cohort-enrollment
size than we have exploited so far. For example, based on these data, we
could obtain estimates of the crucial second difference in several differ-
ent ways. For instance, we could use the average achievement information
for the students in cohorts with enrollments of size 37 and 38 to estimate
a second difference of –1.09 (= 67.85 – 68.94). Similarly, the achievement
information for enrollment cohorts of size 36 and 37 provides another
second difference estimate of +1.64 (= 68.94 – 67.30). Recall though, that
the original estimate of the second difference, based on cohorts of size 38
and 39, was +1.02 points. Averaging these multiple estimates of the second
difference together leads to an overall average second difference of 0.52,
which is perhaps a more precise estimate of the underlying achievement
versus the cohort-enrollment linear trend than any estimate derived from
a particular pair of adjacent data points to the left of the cut-off. When we
can estimate many second differences like this, how do we figure out what
to do and where to stop? In fact, we are not even limited to estimating
second differences that are only “one child apart” on the cohort-enrollment
size forcing variable. For example, we could use the average achievement
information for enrollment cohorts of sizes 36 and 39 to provide a sec-
ond-difference estimate of +0.52 (= [(68.87 – 67.30)/3]. Finally, although
averaging together multiple second-difference estimates does draw addi-
tional relevant information into the estimation process, it does so in a
completely ad hoc fashion. The RD approach that we describe next pro-
vides a more systematic strategy for incorporating all the additional
information into the estimation process.
To facilitate understanding of the RD approach, examine Figure 9.1.
This figure displays selected sample information from Table 9.1 describ-
ing the overall class-average reading achievement of Israeli fifth-grade
children plotted against the sizes of grade-level enrollment cohorts of size
36 through 41. Notice that there is moderate vertical scatter—of about
one scale point, up or down—in the class-average reading achievement of
children in enrollment cohorts of size 36 through 40 (all of whom were
offered large classes by Maimonides’ rule). Notice also that this vertical
scatter is quite small in contrast to the approximately 6-point difference
176 Methods Matter
74
72
Average Reading Achievement
Adj. First
Diff. Diff.
70
68
66
35 36 37 38 39 40 41 42
Size of Enrollment Cohort
size, over these pre–cut-off cohorts.9 We have then projected this line
forward until it intersects the vertical dotted line drawn above the point
on the x-axis that corresponds to a cohort enrollment size of 41 students.
The vertical elevation of the tip of this arrow at this dotted line indicates
our “best projection” for what class-average reading achievement would
be in a cohort with an enrollment of 41, if Maimonides’ rule had not inter-
vened. If the projection is credible, then the “adjusted vertical difference”
between the tip of the arrow and the “observed” class-average achieve-
ment in this cohort will be a better estimate of the causal impact of the
exogenous offer of a class-size reduction than any of the separate piece-
wise difference-in-differences estimates. This is because the new adjusted
difference incorporates evidence on the achievement/enrollment trend
from all the enrollment cohorts from size 36 through 40 systematically,
and then uses it to adjust the estimated first-difference estimate for this
trend. We label the “first-difference” and “adjusted” estimates of treat-
ment impact on the plot in Figure 9.1. These are the basic ideas that
underpin RD analysis.
Of course, projections like that which we imposed on Figure 9.1 can be
constructed less arbitrarily by specifying and fitting an appropriate regres-
sion model to the original data, rather than by working with within-cohort
size averages and sketching plots, as we have done in our illustration. For
instance, we can extract from Angrist and Lavy’s original dataset all the
information on the 75 classrooms pertaining to fifth-grade enrollment
cohorts of sizes 36 through 41. We can then specify a regression model to
summarize the pre–cut-off achievement/enrollment trend credibly and
project it forward to the cohort of enrollment size 41, while including in
the model a dedicated regression parameter to capture the vertical “jig”
that we hypothesize occurs between the projected and observed class-
average reading achievement for classes offered to members of the
enrollment cohort of size 41. Estimation of this latter “jig” parameter,
obtained by standard regression methods, would provide an adjusted RD
estimate of the differential effect on student class-average reading achieve-
ment of an offer of “small” versus “large” classes generated by the
exogenous application of Maimonides’ rule.
To complete these analyses, we create two additional predictors in the
Angrist and Lavy dataset. We illustrate their creation in Table 9.2, where
we list selected cases (classrooms) from enrollment cohorts of size 36
through 41 students (we have omitted data on some cohorts in Table 9.2
Table 9.2 Class-average reading achievement and intended class size for the first three
fifth-grade classes in enrollment cohorts of size 36 to 41 (enrollment cohorts of size 37
and 38 omitted to save space) in Israeli Jewish public schools
3601 51.00 36 −5 0
3602 83.32 36 −5 0
3603 64.57 36 −5 0
…
…
3901 46.67 39 −2 0
3902 68.94 39 −2 0
3903 74.08 39 −2 0
…
4001 73.15 40 −1 0
4002 60.18 40 −1 0
4003 52.77 40 −1 0
…
4101 69.41 41 0 1
4102 80.53 41 0 1
4103 55.32 41 0 1
…
to save space, but these data were included in our computations). In the
second and third columns, we list values of class-average reading achieve-
ment, READ, and the corresponding enrollment-cohort size, SIZE, for
the first three classrooms in each cohort. We also list two new variables to
act as predictors in our RD analyses. First, we have created the new dichot-
omous predictor SMALL to indicate classrooms in those cohorts in which,
as a result of Maimonides’ rule, the offered class size is small. In this
reduced dataset, this is only for classrooms formed from enrollment
cohorts that contain 41 students. Second, we have “recentered” the forc-
ing variable—that is, the original enrollment-cohort SIZE predictor—by
subtracting a constant value of 41 from all values to form a new predictor
labeled CSIZE. This new predictor takes on a value of zero for classes in
the cohort of enrollment 41, and its non-zero values measure the horizon-
tal distance in the size of each cohort from the cohort of 41 students.
Thus, for instance, CSIZE has a value of “−1” for classrooms in enrollment
cohorts of 40 students.
The Regression-Discontinuity Approach 179
Then, subtracting Equation 9.4 from Equation 9.3, you can see that
regression parameter β2—the parameter associated with the predictor
SMALLin Equation 9.1—represents the population difference in average
reading achievement that we hypothesize to occur between children
offered prototypical small-class sizes and large-class sizes, at an enrollment
cohort size of 41. This is exactly the parameter that we wanted to estimate.
Notice the critical role played by the recentering of the original
forcing variable—enrollment-cohort size—to provide new predictor CSIZE.
180 Methods Matter
cut-off. Finally, we benefit from the increased statistical power that comes
with the anticipated increase in the sample size.
We present estimates of all parameters and ancillary statistics from this
analysis in Table 9.4. The results support and strengthen our earlier con-
clusions. Although our estimate of the average treatment effect of +3.85
is almost 25% smaller than the comparable estimate of 5.12 in Table 9.3,
its standard error is much smaller (at 2.81 instead of 4.00), a result of the
addition of 105 classrooms to the sample. Consequently, the p-value asso-
ciated with the corresponding one-sided test of average treatment impact
has fallen from .10 to .09.
Now it becomes tempting to widen the analytic window even more, in
the hope of further increasing statistical power. We present a summary
of the results of doing so in Table 9.5, where we have incorporated
additional enrollment cohorts into the analysis. In the first and second
row, we repeat the findings from the two RD analyses that we have already
completed: (a) the first analysis with all classrooms in schools with
September enrollments of 36 through 41, and (b) the second analysis that
added the enrollment cohorts of size 42 to 46. Recall that, in the process,
our sample of classes has more than doubled to 180 and, while the esti-
mate of the average treatment effect shrank in magnitude from 5.12 to
3.85, it remained positive and its precision increased. As we increase the
bandwidth still further, these trends continue. Consider the last row of
the table, where the comparison now includes 423 classrooms. Here, the
fundamental finding remains that there is a benefit to being offered edu-
cation in a smaller class, and the estimated effect size—which measures the
“vertical jig” in average achievement between students offered education
in “Large” and “Small”classes—has remained stable. Note that the addi-
tion of more classrooms has also increased statistical power dramatically
and reduced the size of the p-values associated with the effect of question-
predictor SMALL. In fact, when the RD regression model is fitted to the
data from the 423 classrooms that make up the enrollment cohorts
between 29 and 53 students in size, we can readily reject the associated
null hypothesis of no treatment effect (p = .02, one-sided test).
But, how wide can we increase the bandwidth and continue to believe
in the credibility of our results? Certainly, when we were using a first-
difference or a difference-in-differences approach, it is clear that the
narrower our focus around the Maimonides’ cut-off, the more confident
we could be in the equality-in-expectation assumption for children thus
pooled into large-offer and small-offer classrooms, and therefore in the
internal validity of any comparison between these groups. But, as we nar-
rowed the window of analytic attention to those cohorts with enrollments
successively closer to, and on either side of, the cut-off, the number of
classes included in the comparison decreased and the statistical power for
detecting an effect declined. Increasing the bandwidth, on the other
hand, certainly increased sample size and statistical power, but may have
also increased the challenge to the internal validity of the comparison.
In particular, when we make use of either a first-difference or a difference-
in-differences approach in our Angrist and Lavy example, the further we
The Regression-Discontinuity Approach 185
expand our bandwidth to include enrollment cohorts far from those with
grade-level enrollment of 40 or 41 students, the less plausible is our
assumption that students in the groups being compared are equal in
expectation on all unobserved dimensions. Indeed, as Angrist and Lavy
point out, socioeconomic status and population density are positively
related in Israel. As a result, children in enrollment cohorts of vastly dif-
ferent sizes are likely to differ dramatically in family socioeconomic status
and in other unobserved respects that may be related to academic achieve-
ment. This would be an especially serious problem if we were relying only
on a first difference as our estimate of the treatment effect, as we would
be successively pooling into our treatment and control groups cohorts
that were progressively less equal in expectation prior to treatment.
However, the situation with regression-discontinuity analyses is thank-
fully less stark. The reason is that we are not pooling the more remote
cohorts into our nominal treatment and control groups. Instead, we are
using the more remote cohorts in conjunction with the nearer ones—and
their relationship with the forcing variable—to project the estimated treat-
ment effect at the cut-off, where the assumption of equality in expectation
is indeed met. In other words, because our estimate of the treatment
effect pertains only at the cut-off, it may remain internally valid there,
regardless of how many additional data points are added into the analy-
ses, and from how far afield they are drawn. It does not matter if cohorts
remote from the cut-off are not equal in expectation with other cohorts
as remote on the other side, provided we can be confident that we are
using the information they offer to better project our expectations for
any difference in average outcome at the cut-off. Of course, doing so
requires the correct modeling of the relationship between the outcome
and the forcing variable.
To capitalize effectively on natural experiments with an RD design, we
must therefore strike a sensible balance between adequate statistical
power and defensible modeling of the trend that links the outcome to the
forcing variable along which the exogenous cut-off is specified. From a
mathematical perspective, over a sufficiently short range, all such trends
are locally linear. In the Angrist and Lavy dataset, the requirements of
local linearity may support inclusion of enrollment cohorts of size 36
through 41, or even the cohorts of size 29 through 35. However, if there
is evidence of curvilinearity in the underlying outcome/forcing-variable
trend, then we must either limit our bandwidth, so that the local linearity
assumption is met, or we must model the trend with a functional form
that has more credibility than the linear. We illustrate this point with
evidence from another important study.
186 Methods Matter
contained 649 children who had lived in counties with 1960 poverty rates
among the 300 poorest, and 674 who had lived in the one of the next 300
poorest counties. In all of their analyses, they treated the county as their
unit of analysis.10
In their first set of analyses, Ludwig and Miller examined whether the
exogenously defined poverty-rate cut-off for grant-writing assistance did
indeed result in a discontinuous jump in the availability of Head Start
programs in poor counties, as measured by funding levels per child in the
appropriate age group. They did this by comparing Head Start spending
per four-year-old in 1968 for two groups of counties, just on either side of
the exogenously defined cut-off. The treatment group consisted of coun-
tries with poverty rates between 59.2% (the minimum for receiving the
OEO application aid) and 69.2% (10 percentage points above the cut-off),
and the control group consisted of those with poverty rates from just
below 59.2% (the 301st poorest counties in the U.S.) to 49.2% (10 percent-
age points below the cut-off). They found that Head Start spending per
student in the first group (which contained 228 counties) was $288 per
four-year-old, whereas the comparable figure for the second group was
$134. They found similar results when they compared the spending levels
for groups defined within different bandwidths (for example, in analytic
windows that contained only those counties that fell within 5 percentage
points of the poverty cut-off). This gave the researchers confidence that
the OEO intervention did influence the availability of Head Start in these
counties markedly. Ludwig and Miller also verified that the difference in
Head Start participation rates between counties with poverty rates just
above the OEO poverty cut-off for support and those with poverty rates
just below this cut-off continued through the 1970s. This proved impor-
tant in locating data that would permit them to examine the long-term
effects of Head Start availability on children’s outcomes.
As Ludwig and Miller explain, the principal statistical model that they
fitted in their study is a generalization of the simple linear RD model that
we specified in Equation 9.1. They wrote it as
Yc = m (Pc ) + a Gc + uc (9.5)
Although the similarity between this model and our standard RD model
in Equation 9.1 is not immediately apparent, due to differences in nota-
tion, the two are essentially the same. In Equation 9.5, Yc is the value of an
10. Thus, they aggregated child-level outcome measures from the NELS to the county
level.
The Regression-Discontinuity Approach 189
outcome for the cth county, and could represent the average value of the
Head Start participation rate, say, in that county in a chosen year. The
forcing variable Pc is the poverty rate in thecth county 1960, recentered so
that it has a value of zero at the poverty rate cut-off of 59.2%. Dichotomous
question predictor Gc is the principal question predictor and indicates
whether the cth county received grant-writing assistance from the OEO (1 =
received assistance; 0 = otherwise). Thus, its associated regression param-
eter α (which corresponds to regression parameter β2 in Equation 9.1),
represents the causal impact of the grant-writing assistance on the out-
come, estimated at the poverty-rate cut-off of 59.2%. The stochastic
element in the model, υc, is a county-level residual.
The principal difference in appearance between the hypothesized
models in Equations 9.1 and 9.5 revolves around the term m(Pc ), which is
intended as a generic representation of the functional form of the hypoth-
esized relationship between the outcome and the forcing variable, P.
Ludwig and Miller first modeled the outcome as a linear function of
the forcing variable, as we did in our earlier example, but they allowed the
population slopes of the relationship to differ on opposite sides of the
discontinuity. They achieved this by replacing function m(Pc ) by a stan-
dard linear function of P and then adding the two-way interaction between
it and question predictor G as follows:
Yc = b0 + b1 Pc + a Gc + b2 (Pc × Gc ) + uc (9.6)
and the fitted relationship for counties to the right of the cut-off is:
( + b + b P
Yc = b0 + a )(1 2 c ) (9.8)
But, since forcing variable P is centered at the cut-off, and despite the
permitted difference in its hypothesized slope on the two sides of the cut-
off, you can demonstrate by subtraction that parameter α continues to
capture the average causal impact of grant-writing assistance on the Head
Start participation rate (don’t forget to set the value of Pc to zero before
you subtract!).
Ludwig and Miller initially fit Equation 9.6 with a bandwidth of
8 percentage points in the poverty index on either side of the cut-off.
190 Methods Matter
With this specification, they estimated that treatment effect α was 0.316,
with a standard error of 0.151. Thus, with this specification and band-
width, they could reject the null hypothesis that grant-writing assistance
had no impact on the rate of children’s participation in Head Start.
In Figure 9.2,11 with smooth dashed curves on either side of the cut-off,
we display the fitted quadratic relationships between Head-Start partici-
pation rate and the county poverty-rate forcing variable that were obtained
in Ludwig and Miller’s analyses. Notice, first, that the shape of the fitted
relationship is very different to the left of the cut-off point than to the
right. In addition, one dramatic limitation of the quadratic specifications
on either side of the cut-off is that the shapes of the fitted curves are
constrained to be symmetric around their maximum or minimum. Given
that these curves are fitted to a modest number of cases, it seems quite
plausible that the fitted curvilinear relationships between outcome and
forcing variable could be highly sensitive to atypical outcome values and
the leverage exercised by a very small number of atypical data points.
Moreover, since the estimate of the average treatment effect a comes
11. Figure 9.2 is a reproduction of Figure I, Panel B, from page 175 of Ludwig and Miller’s
2007 paper.
The Regression-Discontinuity Approach 191
.8
.6
.4
.2
0
40 50 60 70 80
1960 Poverty rate
Figure 9.2 Estimated discontinuity in Head Start participation at the Office of Economic
Opportunity cut-off for grant-writing support, using data from the National Educational
Longitudinal Study (NELS) first follow-up sample. Reproduced with permission from
Ludwig and Miller (2007), Figure I, Panel B, p. 175.
12. To learn more about local linear regression analysis, see Imbens and Lemieux (2008)
or Bloom (forthcoming).
192 Methods Matter
13. Figure 9.3 is a reproduction of Figure IV, Panel A, from page 182 of Ludwig and
Miller’s 2007 paper.
The Regression-Discontinuity Approach 193
5
4
3
2
1
40 50 60 70 80
1960 Poverty rate
Figure 9.3 Estimated discontinuity in county-level mortality rate for children aged five
to nine, 1973 through 1983, from causes that could be affected by Head Start, at the
Office of Economic Opportunity cut-off for grant-writing support. Reproduced with
permission from Ludwig and Miller (2007), Figure IV, Panel A, p. 182.
15. As explained in the final report of the evaluation (Gamse et al., 2008), the evaluation
of Reading First included one group of schools in addition to those in the 16 school
districts and one state that employed an RD design in allocating funds. This one
district assigned ten schools randomly to treatment or control status.
The Regression-Discontinuity Approach 197
16. Figure 9.4 is a reproduction of Figure 5 from page 198 of Urquiola and Verhoogen’s
(2009) paper.
The Regression-Discontinuity Approach 199
50
40
4th grade class size (x/n)
30
20
10
0
0 45 90 135
4th grade enrollment
Figure 9.4 Fourth-grade enrollment versus class size in urban private-voucher schools in
Chile, 2002. Based on administrative data for 2002. The solid line describes the relation-
ship between enrollment and class size that would exist if the class-size rule were applied
mechanically. The circles plot actual enrollment cell means of fourth- grade class size.
Only data for schools with fourth-grade enrollments below 180 are plotted; this excludes
less than 2% of all schools. Reproduced with permission from Urquiola and Verhoogen
(2009), Figure 5, p. 198.
80
60
Number of schools
40
20
0
17. Figure 9.5 is a reproduction of Figure 7, Panel A, from page 203 of Urquiola and
Verhoogen’s (2009) paper.
The Regression-Discontinuity Approach 201
13
12.5
Household income
12
11.5
45 90 135
4th-grade enrollment
18. Figure 9.6 is a reproduction of Figure 8, Panel A, from page 204 of Urquiola and
Verhoogen’s 2009 paper.
202 Methods Matter
The reason is that the students assigned to the smaller classes came from
families with more resources than those assigned to larger classes.
The Urquiola and Verhoogen study of class-size determination in Chile
illustrates two lessons about a critical threat to the internal validity of
studies that attempt to make causal inferences using data drawn from
natural experiments, especially those with discontinuity designs. The first
is the importance of learning a great deal about the context from which
the data derive, including the nature of the educational system, the incen-
tives that the natural experiment creates for educators and parents to
alter their behavior, and the opportunities that are present for respond-
ing to these incentives. The second lesson is the importance of examining
the data closely to see if there is any evidence that actions by educators,
parents, or others resulted in a violation of the critical “equal in expecta-
tion immediately on either side of the cut-off, prior to treatment”
assumption underlying this research approach to making causal inferences.
As Urquiola and Verhoogen’s paper illustrates, exploratory graphical
analyses often prove highly effective in detecting violations of the assump-
tions that underlie the RD identification strategy.
To learn about the history of the RD strategy for making causal infer-
ences, read Thomas Cook’s erudite paper, “Waiting for Life to Arrive:
A History of the Regression-Discontinuity Design in Psychology, Statistics
and Economics,” which appeared in a 2008 special issue of The Journal
of Econometrics dedicated to RD methodology. Other papers in this volume
provide a rich set of ideas for determining appropriate bandwidths, for
estimating the relationship between the outcome and the forcing vari-
able, and for examining threats to the internal validity of particular
applications. For an especially clear exposition of recent research on the
RD approach, read Howard Bloom’s forthcoming chapter entitled
“Regression-Discontinuity Analysis of Treatment Effects.”
10
Introducing Instrumental-Variables
Estimation
203
204 Methods Matter
like civic engagement. Dee used IVE to analyze data from the High School
and Beyond (HS&B) dataset, which contains rich information on large
samples of American students who were first surveyed in 1980. Dee
focused his research on students who were members of the HS&B
“sophomore cohort,” meaning that they were tenth-graders in American
high schools in 1980. This sophomore cohort was resurveyed in 1984
(when respondents were around 20 years old) and again in 1992 (when
they were around 28 years old).4 Here, we focus on a subsample drawn
from Dee’s data, consisting of 9,227 of the original HS&B respondents.5
In panel (a) of Table 10.1, we present univariate descriptive statistics on
the two key variables in our analyses. Our outcome variable REGISTER
measures active adult civic participation and this information was obtained
when respondents were about 28 years old. It is dichotomous and indi-
cates whether the respondent was registered to vote in 1992 (1 = registered;
0 = not registered); about two-thirds (67.1%) of the respondents were reg-
istered to vote in that year. Our principal question predictor COLLEGE is
also dichotomous and coarsely summarizes respondents’ educational
attainment as of the 1984 administration of the HS&B, when respondents
were about 20 years old (1 = had entered a two- or four-year college by
1984; 0 = had not entered). Slightly more than half of the respondents
(54.7%) had entered college by this time.
4. We thank Thomas Dee for providing the data on which he based his 2004 paper.
5. Because dichotomous outcomes were involved, Dee (2004) relied on a more sophisti-
cated approach based on the simultaneous-equations estimation of a bivariate probit
model. Here, for pedagogic clarity, we begin by adopting a simpler analytic approach
that specifies a linear-probability model (LPM). To better meet the demands of our
LPM model, we have limited our sample to respondents for whom a two-year college in
their county was located within 35 miles of their base-year high school when they
attended tenth grade, and for whom there were ten or fewer such two-year colleges
within the county. We have also eliminated 329 cases that had missing values on the
critical variables. Consequently, our estimates differ marginally from Dee’s (2004) pub-
lished estimates, although the thrust of our findings and his remain the same. Readers
interested in the substantive findings should consult his paper.
Instrumental-Variables Estimation 207
Table 10.1 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of the HS&B survey. (a) Univariate statistics on the
outcome REGISTER and question predictor COLLEGE; (b) sample bivariate statistics for
the same variables; and (c) OLS regression analysis of REGISTER on COLLEGE
REGISTER COLLEGE
Correlation:
REGISTER 1.0000
COLLEGE 0.1874 ∗∗∗,† 1.0000
Covariance:
REGISTER 0.2208
COLLEGE 0.0438 0.2478
R2 0.0351
6. Like the correlation coefficient, the covariance statistic summarizes the linear
association between two variables. The sample covariance of Y and X— represented by
208 Methods Matter
index, its value is not proscribed to fall between –1 and +1 (which are the
limiting values of the correlation coefficient), and so its absolute magni-
tude can be more difficult to interpret. However, as we will see later, there
are advantages to having an index of association that contains the scales
of the component variables. Also, because the covariance of a variable
with itself is simply its variance, the elements that fall on the diagonal
of a covariance matrix contain those variances. You can recover the com-
panion correlation coefficient by direct computation from the elements
of the corresponding covariance matrix. For instance, in panel (b) of
Table 10.1, the sample variances of variables REGISTER and COLLEGE
are 0.221 and 0.248, respectively, meaning that their corresponding stan-
dard deviations are the square roots of these quantities, 0.470 and 0.498.
The estimated correlation between these two variables is then simply
their sample covariance divided by the product of their sample standard
∑ (Y )(Y − Y ) = ∑ (Y − Y ) = s
n n 2
i =1 i − Y• i • i =1 i •
sYY = 2
n −1 n −1
Y
∑ (X )(X ) = ∑ (X ) =s
n n 2
i =1 i − X• i − X• i =1 i − X•
sXX = 2
n −1 n −1
X
∑ (Y )(X )
n
i =1 i − Y• i − X•
rYX =
{∑ (Y − Y ) }{∑ (X − X ) }
n
i =1 i •
2 n
i =1
i •
2
∑ (Y − Y )(X − X )
n
• •
i =1 i i
rYX =
{∑ (Y − Y ) }{∑ (X − X ) }
n
i =1
i •
2 n
i =1
i •
2
and so, correlation is the covariance between two variables, each standardized to mean zero
and unit standard deviation.
Instrumental-Variables Estimation 209
7. This hypothesis test is identical to the test on the correlation coefficient in panel (b).
8. Our naïve OLS regression analysis includes no individual-, county- and state-level cova-
riates and takes no account of the natural clustering of National Longitudinal Survey
of Youth (NLSY) respondents within their base-year high schools, as does Dee (2004)
in his more complete analysis. We omitted these features at this point for pedagogic
clarity. In the following section, “Incorporating Multiple Instruments into the First-
Stage Model,” we illustrate the inclusion of additional covariates into our analyses.
Finally, in sensitivity analyses not presented here, we have repeated all analyses pre-
sented in this chapter using robust standard errors estimated to account for the
clustering of participants within their base-year high schools. Although this increases
the standard errors associated with our central regression parameters by around 15%,
our basic results do not differ.
210 Methods Matter
⎛s ⎞
bˆ1OLS = ⎜ YX2 ⎟
⎝ sX ⎠
⎛ s(REGISTER ,COLLEGE ) ⎞
=⎜ ⎟⎠
⎝ s2COLLEGE
(10.1)
0.0438
=
0.2478
= 0.177
Notice that this estimate is identical to that obtained in the OLS regres-
sion analysis in panel (c). The intimate link among the sample covariance
of outcome and predictor, the sample variance of the predictor, and the
OLS-estimated slope, in a simple linear regression analysis, emphasizes
the utility of the sample covariance matrix as a summary of variation and
covariation in the data. More importantly, we will soon see that it pro-
vides insight into the functioning of the OLS slope estimator itself and
lights the way for us to instrumental-variables estimation.
But first, let’s examine how the presence of endogeneity in the ques-
tion predictor results in bias in the OLS estimator of the causal impact
of education on civic engagement. We begin by specifying a statistical
model that describes how we believe educational attainment affects civic
engagement. To keep notation simple in what follows, we do this in
generic form:
Yi = b0 + b1 X i + e i (10.2)
for the ith member of the population, with conventional notation and
assumptions.9 In our civic-engagement example, generic outcome Y would
be replaced by REGISTER, generic predictor X would be replaced by
10. The covariance algebra that we present here only illustrates the asymptotic unbiasedness
of the OLS estimator of slope. Using a more detailed application of statistical theory,
we can also show that it is also unbiased in small samples.
212 Methods Matter
11. The magnitude and direction of the bias are given by the second term on the right in
Equation 10.5, and so the larger the covariance of predictor and residual, the greater
the magnitude of the bias.
Instrumental-Variables Estimation 213
Variation in
Outcome, Y
Variation in
Question Predictor, X
(b) IV Approach
Variation in
Outcome, Y
Variation in
Instrument, I
Variation in
Question Predictor, X
Figure 10.1 Graphical analog for the population variation and covariation among
outcome Y, potentially endogenous question predictor X, and instrument, I, used for
distinguishing the OLS and IV approaches (a) OLS Approach: bivariate relationship
between Y and X, (b) IV Approach: trivariate relationship among Y, X, and I.
beyond the intersection of the two ellipses represents the population vari-
ation in outcome Y that is therefore unpredicted by question predictor X.
In other words, it represents the population residual variation.
Next, remaining in the top panel of the figure, examine the area of
intersection of the outcome and question predictor ellipses in comparison
to the total area of the lower medium-grey ellipse (which represents s X2 ).
214 Methods Matter
Instrumental-Variables Estimation
In the current example, we possess observational data on an interesting
outcome—REGISTER , a measure of civic engagement—and an important
question predictor—COLLEGE, a measure of educational attainment.
Our theory suggests that there should be a causal relationship between
the latter and the former. Consequently, we would like to use our obser-
vational data to obtain a credible estimate of the causal impact of
educational attainment on civic engagement. However, we suspect that
question predictor COLLEGE is potentially endogenous because partici-
pants have been able to choose their own levels of educational attainment.
As a result, an OLS estimate of the relationship between REGISTER and
COLLEGE may provide a biased view of the hypothesized underlying
population relationship between civic engagement and educational attain-
ment. What can we do to resolve this? How can we use the available
observational data to estimate the population relationship between these
two constructs, while avoiding the bias introduced into the results of the
standard OLS process by the potential endogeneity in COLLEGE?
In statistics, as in life, it is usually the case that we can always do better
if we have some way to incorporate additional useful information into our
decisions. Setting all skepticism aside, let’s imagine for a moment that we
had information available on an additional and very special kind of vari-
able—mysteriously named I, for “instrument”—that has also been measured
for all the participants in the sample. Let’s ask ourselves: What properties
would such an instrument need to have, to be helpful to us? How could we
incorporate it into our analysis, if we wanted to end up with an unbiased
estimate of the critical relationship between civic engagement and educa-
tional attainments, in which we are interested?
216 Methods Matter
s YI = b1s XI + se I (10.8)
⎛ s YI ⎞ ⎛ se I ⎞
⎜ ⎟ = b1 + ⎜ ⎟ (10.9)
⎝ s XI ⎠ ⎝ s XI ⎠
Here, surprisingly, notice a second interesting consequence of the speci-
fication of the population linear regression model. From Equation 10.9,
we see that the population covariance of Y with I ( σYI ) divided by the pop-
ulation covariance of X with I ( σXI ) is again equal to our critical parameter
representing the key population relationship of interest ( β1 ), provided
that the second term on the right-hand side of the equation is zero. And
it is zero when our new instrument is uncorrelated with the residuals in
the population regression model. That is,
⎛ s YI ⎞
⎜ ⎟ = b1 , when se I = 0 (10.10)
⎝ s XI ⎠
⎛s ⎞
bˆ1IVE = ⎜ YI ⎟ (10.11)
⎝ sXI ⎠
Table 10.2 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of the HS&B survey. (a) Univariate statistics on the
outcome REGISTER, question predictor COLLEGE, and instrument, DISTANCE;
(b) sample bivariate statistics among the same three variables; and (c) a method-of-
moments instrumental-variables estimate of the REGISTER on COLLEGE regression slope
Correlation:
REGISTER 1.0000
COLLEGE 0.1874 ∗∗∗,† 1.0000
DISTANCE –0.0335 ∗∗∗ –0.1114∗∗∗ 1.0000
Covariance:
REGISTER 0.2208
COLLEGE 0.0438 0.2478
DISTANCE –0.1369 –0.4825 75.730
Parameter Estimate
12. Of course, there may be reasons why the proximity of participants to their local insti-
tutions of higher education is not determined randomly and exogenously. We discuss
220 Methods Matter
First, and perhaps most importantly, notice that the endogenous ques-
tion predictor COLLEGE and instrument DISTANCE are indeed related.
The greater the distance between a tenth-grader’s high school and the
nearest community college (when they were in high school), the lower the
probability that the student will enroll in college subsequently (r = –0.111,
p<0.001, Table 10.2, panel (b)). In addition, our outcome REGISTER has
a negative and even smaller, but again statistically significant, correlation
with the instrument DISTANCE (r = –0.033; p <.001, Table 10.2, panel (b)).
Thus, the greater the distance between a tenth-grader’s former high
school and the nearest two-year college, the less probable it is that the
student registered to vote as an adult. Substituting the corresponding
sample covariances into Equation 10.11, we obtain an asymptotically unbi-
ased method-of-moments IVE of the impact of college enrollment on the
probability of registering to vote, as follows:
⎛s ⎞
b1 = ⎜ YI ⎟
IVE
⎝ sXI ⎠
s(REGISTER , DISTANCE )
= (10.12)
s(COLLEGE , DISTANCE )
−0.1369
=
−0.4825
= 0.284
Notice that this coefficient is positive and almost double the magnitude
of the corresponding OLS estimate (0.177, Table 10.1). This suggests that
the probability that an individual will register to vote, as an adult, is about
28 percentage points higher among college entrants than among those
who did not enroll in college. Provided that our instrument—the distance
of the respondent’s high school from the nearest two-year college in the
same county—satisfies the critical assumption we have described earlier,
then this new value of 0.284 is an asymptotically unbiased estimate of the
impact of educational attainment on civic engagement.
It is useful to explore the logic upon which this new method of estima-
tion is based. Conceptually, during the IVE process, we use our
instrument—which we regard as exogenous by assumption (and therefore
uncorrelated with the residuals in the main regression model)—to carve
out part of the variation in the question predictor that is also exogenous
three common objections—and the strategies that can be used for dealing with them—
later in this chapter, in the section “Proximity of Educational Institutions,” where we
describe research conducted by Janet Currie and Enrico Moretti (2003) in which they
used proximity as an instrument.
Instrumental-Variables Estimation 221
and then we use only that latter part in the estimation of the regression
slope. We can illustrate this statement by extending our earlier graphical
analogy to the lower panel of Figure 10.1. In the new panel, we have rep-
licated the original Venn diagram in the upper panel, with the same pair
of overlapping light- and medium-grey ellipses representing the variances
and covariance of the outcome and question predictor, as before. Then,
across these two intersecting ellipses, we have carefully overlaid a third
dark-grey, almost black, ellipse to represent variation in our instrument I.
Notice that this latter ellipse has been drawn to overlap both the first two
ellipses, thereby co-varying uniquely with both. However, it does not over-
lap any of the original residual variation in Y, which as we have explained
is represented by that part of the upper light-grey ellipse that falls beyond
the reach of variation in the question predictor X. We have drawn the
new figure like this because, by definition, a successful instrument must
not be correlated with those residuals, and therefore there can be no
overlap of instrument and residual variation. Finally, in the lower panel,
we also suggest that there may be some substantial part of the instru-
ment’s variation that is independent of variation in the question predictor;
this is why we have drawn the dark-grey “instrumental” ellipse sticking out
to the right of the lower medium-grey “question predictor” ellipse.
When we carry out successful IVE, it is as though we have allowed the
dark-grey ellipse that represents variation in the instrument to carve out
the corresponding parts of the original medium-grey “X” ellipse and the
medium-on-light-grey “Y on X” overlap, for further analytic attention.
And, because variation in the instrument is exogenous (by an assumption
that we still need to defend), the parts that we have carved out must also
be exogenous. Then, in forming our IV estimator, we restrict ourselves
implicitly to working with only the variation in outcome and question
predictor that is shared (i.e., that intersects or covaries) with the new
instrument within the lower dark-grey ellipse. Within this shared region,
we again form a quotient that is a ratio of a “part” to a “whole” to provide
our new instrumental-variables estimate of the Y on X slope. This quo-
tient is the ratio of the covariation shared between outcome and instrument
to the covariation shared by question predictor and instrument, respec-
tively. Identify the corresponding regions for yourself on the plot. They
are the regions where the light-, medium-, and dark-grey ellipses overlap,
and where the medium- and dark-grey ellipses overlap, respectively.
In a sense, we have used the instrument to carve up the old dubious varia-
tion and covariation in both outcome and question predictor into
identifiable parts, and have picked out and incorporated into our new
estimate only those parts that we know are—by assumption, at least—
unequivocally exogenous.
222 Methods Matter
13. The IV estimator also capitalizes on that part of the variation in outcome Y that is
localized within the region of instrumental variation. However, in the lower panel of
Figure 10.2, you can see that this part of the variation in Y—as defined by the overlap
of the Y and I regions—is also a subset of the variation in X that falls within the varia-
tion in I, and so our statement about the latter encompasses the former.
Instrumental-Variables Estimation 223
Equations 10.9 and 10.10 shows that there are two additional assumptions
that a variable must satisfy if it is to serve as a viable instrument. In practice,
the veracity of one of these assumptions proves relatively easy to confirm.
Unfortunately, this is not the case for the second assumption.
The “easy-to-prove” condition for successful IVE is that the instrument
must be related to the potentially endogenous question predictor (in other words,
the population covariance of question predictor and instrument, s XI ,
cannot be zero). This condition seems obvious, from both a logical and a
statistical perspective. If the question predictor and instrument were not
related, then no corresponding regions of outcome and question predic-
tor variation would be carved out by the dark-grey ellipse in the lower
panel of Figure 10.1, and s XI would be zero, thus rendering the quotients
in Equations 10.9 and 10.10 indeterminate (infinite). In simpler terms, if
the question predictor and instrument are unrelated, then we cannot use
the instrument successfully to carve out any part of the variation in X, let
alone any exogenous variation, and so our IVE will inevitably fail.
Fortunately, in the case of our civic-engagement example, this is not the
case. We have confirmed that the instrument and question predictor are
indeed related, using the hypothesis test that we conducted on their
bivariate correlation in panel (b) of Table 10.2. We can reject the null
hypothesis that COLLEGE and DISTANCE are unrelated, in the popula-
tion, at a reassuring p <0.001.14
The second important condition that must be satisfied for successful
IVE is that the instrument cannot be related to the unobserved effects (i.e., the
original residuals) that rendered the question predictor endogenous in the first
place. In other words, the covariance of instrument and residuals, s e I ,
must be zero. We see this throughout the algebraic development that led
to the IV estimator in Equations 10.9 and 10.10. The condition is also
appealing logically. If the instrument were correlated with the residuals in
the original “question” equation, it would suffer from the same problem
as the question predictor itself. Thus, it could hardly provide a solution to
our endogeneity problem.
14. The t-statistic associated with the rejection of the null hypothesis that the instrument
and question predictor are uncorrelated, in the population, is equal to 10.76. This
t-statistic—because only a single degree of freedom is involved in the test—corresponds
to an F-statistic of magnitude 115.9 (the square of 10.76). Such F statistics are often
used to gauge the strength of particular sets of instruments. Some methodologists
have suggested that sets of instruments should be considered “weak” if the associated
F-statistic has a magnitude less than 10 (Stock, Wright, & Yogo, 2002). Although this
cut-off is arbitrary, it is easily applied in more complex analyses in which multiple instru-
ments are incorporated and the required F-statistic is obtained in a global (GLH) test of
the hypothesis that all the instruments had no joint effect on the question predictor.
Instrumental-Variables Estimation 225
ultimate outcome goes through the question predictor. If there had been
a direct path from the instrument to the outcome, we would have seen an
overlap between the dark and light-grey ellipses that was not contained
within the medium-grey ellipse.
Thus, the second critical assumption of IVE, which we have stated for-
merly above as “instrument and residuals must be uncorrelated,” can be
reframed as “there is no direct path from instrument to outcome, except through
the question predictor.” This means that, in seeking a successful instrument,
we need to find a variable that is related to the potentially endogenous
question predictor (we call this “the first path”), and which in turn is
related to the outcome (“the second path”), but there is no path linking the
instrument directly to the outcome (i.e., there is “no third path”). So, you
can think of the IVE process as one in which instrument I ultimately and
indirectly predicts outcome Y, but its influence passes only through ques-
tion predictor X, rather than passing directly to the outcome from the
instrument itself. If you find yourself able to argue successfully that there
is no third path in your particular empirical setting, then you have a viable
instrument! This is often the most difficult challenge you face as an empir-
ical researcher who wants to employ IVE.
Thomas Dee argued that this condition held in the data that he used in
his civic-engagement study. In fact, he invoked economic theory to argue
that his distance instrument would be negatively related to college atten-
dance because the longer a student’s commute, the greater the cost of
college attendance. Dee argued that, after conditioning on observed cova-
riates, students’ high schools (and implicitly, their homes) were distributed
randomly around their local two-year college. Consequently, DISTANCE
becomes a credible instrument because it predicts some part of the varia-
tion in educational attainment, and it only affects future civic engagement
through its relationship with attainment. So, using IV estimation, he could
tease out an asymptotically unbiased estimate of the causal impact of edu-
cational attainment on civic engagement. As we will see below, there is a
long tradition in empirical research in economics and the social sciences
of using such measures of access as instruments for otherwise endoge-
nous question predictors like educational attainment.
15. This is what Thomas Dee (2004) does in his analysis of the civic engagement data.
He uses a probit, rather than a linear, functional form in modeling the hypothesized
relationships between the outcome REGISTER and the question predictor COLLEGE,
and between the latter variable and the instrument DISTANCE.
228 Methods Matter
,
instrument, I, and then estimating and outputting the predicted values, X
for each person. These predicted values then contain only the exogenous
part of the question predictor variation because the instrument used to
predict those values was itself exogenous.
This means that, at the first stage of the 2SLS process, we fit an OLS
regression model to the hypothesized relationship between the endoge-
nous question predictor and instrument, as follows:
Table 10.3 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of the HS&B survey. IV estimation of the REGIS-
TER on COLLEGE relationship using 2SLS, with DISTANCE as the instrument
R2 0.0124
R2 0.0223
i + e
2nd Stage: Yi = b 0 + b 1 X (10.14)
i
16. As noted earlier, the F-statistic associated with the prediction of COLLEGE by DIS-
TANCE in the first-stage analysis is 115.9, exceeding the cited “weak” instrument
cut-off of 10 by a considerable margin (see footnote 14).
17. This computation is –.0064 × 9.74, which equals –.0623.
18. We can use covariance algebra to confirm this claim. In the population, predicted values
of question predictor X are represented by α0 + α1Ii , and can replace X in Equation
i
10.14, leading to the “reduced” model:
Y =b +b X + e = b + b (a + a I ) + e
i 0 1 i i 0 1 0 1 i i
230 Methods Matter
We can also use our earlier graphical analogy for IV methods to illus-
trate the process of 2SLS estimation. In Figure 10.2, we replicate the
original light-, medium-, and dark-grey ellipses that represent the varia-
tion and covariation among our outcome, Y, potentially endogenous
question predictor, X, and instrument, I, from the lower-panel in
Figure 10.1. To reflect the stepwise nature of the 2SLS approach, we have
replicated the original Venn diagram and presented it twice, illustrating
the first-stage and second-stage facets of the 2SLS process by “dimming
out” the unneeded portions of the figure at each stage. We present these
new figures, with their respective dimmed-out portions, in the two panels
of Figure 10.2 . The Venn diagram for the first stage of the 2SLS process
is at the top; the diagram for the second stage is at the bottom.
Recall that, in the first stage of the 2SLS process, question predictor X
is regressed on instrument I. We illustrate this in the upper panel of
Figure 10.2 by featuring the medium-grey ellipse that represents variation
in X overlapping the dark-grey ellipse that represents variation in I. Their
overlap represents not only their covariation and success of the first-stage
regression analysis, but also the part of the variation in question predictor
Xthat has been “carved out” as exogenous, and captured in the respective
predicted values Xi . It is this part of the variation in the question predic-
tor that is then carried through to the second stage of the 2SLS process,
by the data analyst, as fitted values. So, we have redrawn this part of the
variation in X in the lower panel as an identical truncated partial ellipse,
relabeled as “Predicted Variation in Question Predictor X .” This partial
ellipse is darkened to acknowledge that it represents a portion of the
question-predictor variation. Finally, it is the overlap between the com-
plete light-grey ellipse that describes variation in the outcome Y and the
Reorganizing and taking covariances with I, throughout the reduced model, we have
Cov (Yi , I i ) = Cov ((b0 + b1a 0 ) + b1a1 I i + e i , I i )
Because the covariance of a constant with I, and that of the residuals with I, are both
zero, this reduces to:
s YI = Cov (b1a1 I i , I i ) = b1a1s I2
Reorganizing and making β1 the subject of the formula, we have:
s 1
b1 = YI2 ×
s I a1
We can re-express the first-stage slope parameter, α1, by (s XI / s I ) , from taking cova-
2
Variation in
Instrument, I
Variation in
Question Predictor, X
Variation in
Outcome, Y
Predicted Variation in
Question Predictor, X
Figure 10.2 Graphical analog for the population variation and covariation among
outcome Y, potentially endogenous question predictor X, and instrument, I, used for
presenting the 2SLS approachs: (a) First stage: Relationship between X and I, distinguish-
, (b) Second stage: Relationship between Y and X
ing predicted values, X .
The 2SLS approach provides concrete insight into one of the problems
with IVE. When you estimate the all-important “exogenous” predicted
values of X in the first stage of the process, you sacrifice variation in the
question predictor automatically because the predicted values of X inevi-
tably shrink from their observed values toward the sample mean, unless
prediction is perfect. Then, when you fit the second-stage model, regress-
ing the outcome on the newly predicted values of the question predictor,
the precision of the estimated regression slope β1 is impacted deleteri-
ously by the reduced variation present in the newly diminished version of
the question predictor. Consequently, the standard error of the new slope
will be larger than the corresponding standard error obtained in a naïve
(and biased) OLS regression analysis of the outcome/question predictor
relationship. The weaker the first-stage relationship, the less successful
you will be in carving out exogenous variability to load into the predicted
values of X. This means that when the first-stage relationship is weak, it
will be difficult to detect a relationship at the second stage unless your
sample is extremely large. This is the inevitable trade-off involved in
implementing IVE —you must forfeit variation in the question predictor
(hopefully forfeiting the endogenous portion of the variation, and retain-
ing as much of the exogenous as possible), so that you can eliminate bias
in the estimated value of β1 ! But, as a consequence, you sacrifice precision
in that estimate.19 It seems like a pretty decent trade, to us.
This trade-off between bias and precision is evident in our civic engage-
ment example. Notice that the R2 statistic in the first stage of the 2SLS
process is only slightly more than 0.01. Consequently, the standard error
associated with the all-important REGISTER on COLLEGE slope in the
second-stage analysis is quite large. In fact, if you compare the standard
Under the 2SLS approach, you fit these two models in a stepwise fashion,
with a predicted value replacing the measured value of the endogenous
predictor in the second-stage fit. However, you can also fit the two hypoth-
esized models simultaneously, using the methods of simultaneous-equations
modeling, or SEM, and again you will obtain identical results.20
As is the usual practice with the SEM approach, we first present our
hypotheses in Figure 10.3 as a path model that specifies the hypothesized
first- and second-stage relationships among the several variables simulta-
neously. In the figure, outcome Y, question predictor X, and instrument I
are symbolized by rectangles, and the connections among them are repre-
sented by single-headed arrows, each pointing in the hypothesized
direction of prediction.21 The path model contains all our hypotheses and
assumptions about the IV approach. For instance, the solid single-headed
arrow linking instrument I to question predictor X embodies our first
important assumption that the instrument I is related directly to question
predictor X, with slope parameter α1 (the first path). Then, a second
20. This technique is also often called structural-equations modeling or covariance structure
analysis. It is typically carried out using software such as LISREL and EQS.
21. In path models, a single-headed arrow indicates a connection that possesses a hypoth-
esized causal direction, such as that between predictor and outcome, and a
double-headed arrow indicates simple covariation between a pair of variables with no
explicit causal direction implied.
234 Methods Matter
di ei
b1
Xi Yi
α1
Ii
variation into two parts: (a) the valuable “predicted” part (which is related
to I, and is therefore exogenous), and (b) the problematic “unpredicted”
part or residual, which contains any endogenously determined compo-
nent that was originally part of X. Both of these parts of the original
variation in X may be correlated with the ultimate outcome, Y, but it is
only the exogenous first part that we want to determine our estimate of
β1. We make sure that this happens by providing a “back door” route—
via the covariation of the first- and second-stage residuals—by which
any potentially endogenous component of X can take whatever relation-
ship it wants with outcome Y. Then, our estimate of regression slope β1
depends only on the components of variation that our IV estimation
process requires.
In Table 10.4, we present IV estimates for our civic engagement exam-
ple obtained by SEM. As expected, the estimates, their standard errors,
and the associated statistical inference match those provided by our ear-
lier approaches to IVE, and the associated R2 statistics match those that
we obtained in the first and second stages of the 2SLS analysis.
Consequently, we offer no further interpretation of them here. One small
advantage of using SEM estimation to fit the first- and second-stage models
simultaneously, however, is that the estimation process yields an estimate
of the correlation between the residuals in the first- and second-stage
models. In our example, for instance, the estimated correlation between
the errors in the two models is –0.1151, which is negative and statistically
significant. This provides evidence that IVE was indeed required and that
Table 10.4 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of HS&B survey. IV estimation of the REGISTER
on COLLEGE relationship, using SEM, with DISTANCE as the instrument
R2 0.0124
s 2e sYX within I
b1 =
s 2X within I
s ed
s 2d
Figure 10.4 Graphical analog for the population variation and covariation among
outcome Y, potentially endogenous question predictor X, and instrument, I, used for
distinguishing the different variance components identified under the SEM approach.
Instrumental-Variables Estimation 237
distribution of these three variables from the lower panel of Figure 10.1.
Although we have left the size and shape of all three original ellipses
unchanged between figures, we have modified the construction and label-
ing of the Venn diagram to reflect the components of variation that are
identified by the way that the statistical models have been specified under
the SEM approach. Recall that this consists of two things: (a) the two
model specifications in Equation 10.15, along with (b) the additional
assumption that the residuals in the first- and second-stage models can
covary. In this new version of the graphical analog, we show how the SEM
model specification maps the two residual variances s d2 and s e2 and the
residual covariance s ed onto existing parts of the Venn diagram. First, the
2
population variance of the residuals in the first-stage model s d consists
of all variation in question predictor X that is not related to variation in the
instrument; thus, it corresponds to the area of the medium-grey ellipse
that does not intersect with the dark-grey ellipse. Second, the population
2
variance of the residuals in the second-stage model s e consists of all vari-
ation in outcome Y that falls beyond the reach of that part of the variation
in the question predictor that covaries with the instrument; thus, it cor-
responds to the area of the light-grey ellipse that does not overlap with the
intersection of the medium- and dark-grey ellipses. Finally, the popula-
tion covariation of the first- and second-stage residuals s ed is represented
by the overlap of these latter two regions, in the center of the figure, and
labeled off to the left. By specifying the SEM model as in Equation 10.15,
and permitting their residuals to covary, the areas of variation and cova-
riation among the residuals are partitioned effectively from among the
joint ellipses, leaving behind the same ratio of smaller areas to contribute
to the estimation of the population regression slope, as before. We pro-
vide a final conceptual expression for this parameter in the upper right of
the figure, to emphasize the point.
Finally, you might reasonably ask, if several methods are available for
obtaining an identical IV estimate, is one preferable to another? If you are
in the simplest analytic setting with which we opened this chapter—a
single outcome, a single potentially endogenous predictor, and a single
instrument—it doesn’t matter. Whichever analytic approach you use, you
will get the same answer. However, generally, we recommend using either
2SLS or SEM because these approaches can be extended more readily to
more complex analytic settings. (We introduced the simple method-of-
moments IV estimator at the beginning of this chapter solely for its
pedagogic value in establishing the basic principles and assumptions of
the IV approach.) We ourselves find the 2SLS approach especially ame-
nable to thoughtful data analysis, as it is simply a doubled-up application
of traditional OLS methods. So, all of the usual practices of good regression
238 Methods Matter
Now that we have explained how the standard IVE strategy involves the
fitting of a pair of linked statistical models, either sequentially by 2SLS or
simultaneously by SEM, the way is open to extend the basic approach to
the more complex analytic situations that we often face in practice. The
extensions include adding covariates as control predictors to the first- and
second-stage models, using multiple instruments in the first-stage model,
estimating models with multiple endogenous predictors in the second-
stage model, and using nonlinear model specifications. We consider each
of these extensions in turn.
will reduce residual variation, make standard errors smaller, and increase
your statistical power. However, you must exercise caution. You must be
convinced that any control variables that you add to an IVE—at either
stage—are themselves exogenous; otherwise, you will be simply introduc-
ing additional biases into your analyses.
To illustrate the process of adding covariates into an IVE, we extend
our 2SLS analysis of the civic-engagement data. Following Dee (2004), we
continue to treat DISTANCE as the critical instrumental variable, but now
we also include a vector of covariates that describe the participants’ race/
ethnicity in both the first- and second-stage regression models. For this
purpose, we have used a vector of three dichotomous predictors to distin-
guish whether a respondent was BLACK, HISPANIC, or OTHERRACE.
In each case, the relevant predictor takes on a value of 1 if the respondent
is of that particular race/ethnicity, and 0 otherwise. We have omitted the
dichotomous predictor, WHITE, to provide the reference category.
In Table 10.5, we present estimates of the critical regression parame-
ters, obtained with the new covariates incorporated into the analysis.
They can be compared with the estimates that we obtained in the absence
of the covariates in Table 10.3. As you might expect, the inclusion of the
covariates has increased the explanatory power of the models at both
stages. The associated R2 statistic has almost doubled, at each stage. This
improvement in prediction is reflected in a reduction of standard errors
associated with the second-stage parameter estimates. For instance, the
standard error associated with question predictor COLLEGE in the
second-stage model has declined by about 8%, from 0.0873 to 0.0806.
Of course, as in any statistical analysis, you need to exercise caution
when adding covariates. You must weigh the potential improvement in
predicted outcome sums-of-squares against the forfeiture of degrees of
freedom. Here, in the second-stage model, for instance, we have paid for
the improved prediction of REGISTER by our sacrifice of three degrees
of freedom, one for each of the new slope parameters introduced in the
model by our inclusion of the three race/ethnicity covariates. When
sample size is large, as in this case, this is not a problem. But, if sample
size were small, the number of additional covariates was large, and there
was little improvement in fit, then the standard error associated with the
question predictor could very well become larger and statistical power
lower upon the addition of more covariates. These, however, are the same
trade-offs that we are accustomed to making in all of our statistical analyses.
Given that fitting the first- and second-stage models without covariates
resulted in a reasonably strong, statistically significant positive impact
of college enrollment on subsequent civic engagement (see Table 10.3),
it makes sense to ask why Dee included so many additional covariates in
240 Methods Matter
Table 10.5 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of the HS&B survey. IV estimation of the
REGISTERon COLLEGE relationship using 2SLS, with DISTANCE as the instrument
and including participant race as covariates, at both the first and second stages
R2 0.0217
R2 0.0345
∼ p<0.10; ∗ p <0.05; ∗∗ p <0.01; ∗∗∗ p <0.001
†One-sided test.
his models. He addresses this question explicitly in his paper, and his
answer is linked to his defense of the credibility of his instruments. For
instance, one potential threat to the credibility of Dee’s instruments is
that states might choose to locate two-year colleges near communities in
which parents were well educated and the public high schools were
thought to be of especially high quality. Students in these communities
might not only have short commutes to the nearest community college,
but also be especially likely to vote as adults because their families valued
civic participation and because they attended high schools that empha-
sized its importance. As a result, students with short commutes to
community colleges might have higher probabilities of registering to
vote as adults, even if college enrollment had no causal impact on their
interest in civic participation. If this were the case, then DISTANCE (and
other measures of community college accessibility) would fail our “no
third path” test for a credible instrument, and the results of our IVE would
be flawed.
Instrumental-Variables Estimation 241
22. Notice that several of Dee’s chosen covariates—student tenth-grade test scores, for
instance—might be easily considered endogenous if they had been measured concur-
rently with the student’s decision to enroll in college (COLLEGE). However, the values
of these covariates were “predetermined”—their values were measured prior to the
period in which the value of the question predictor was determined (Kennedy, 1992,
p. 370). In contrast, it would not have been appropriate for Dee to include adult
labor-market earnings as a control variable in his second-stage equation, even though
this variable may predict the probability of civic engagement. The reason is that this
covariate is indeed endogenous in the second-stage model as its value was determined
after the college-enrollment decision had been made. Not only may this covariate be
correlated with unobserved determinants of civic engagement, such as motivation, it
may also have been associated with the decision of whether to go to college. Adding
a predictor that is potentially endogenous—at either stage—would compound your
analytic problems rather than resolve them.
242 Methods Matter
adherence, or lack of it, to the “no third path” assumption. If they are
truly instruments, then they must only be related to the ultimate outcome
REGISTER, through their impact on the potentially endogenous ques-
tion predictor COLLEGE. This is the case, by assumption, for Dee’s
instrument DISTANCE. Dee did not need to impose this same restriction
on the other exogenous covariates he included in the first-stage model; he
intended these other predictors to serve simply as covariates, not instru-
ments. Although they are also required to be exogenous, they are not
restricted to act only indirectly on ultimate outcome REGISTER through
the potentially endogenous question predictor. (And if they did meet the
“no third path” assumption, they would be instruments too!) The exoge-
nous covariates in the first-stage model may therefore act on the ultimate
outcome both indirectly (through question predictor COLLEGE), and
also directly—that is, by having a direct third path to the outcome of the
second-stage equation. Of course, if covariates in the first-stage model are
able to predict the ultimate outcome of the analyses directly, then you
must include these covariates necessarily in the second-stage model so
that they can display that path. Their inclusion accounts for their poten-
tial direct impact on the outcome of the second-stage model, REGISTER.
Thus, to summarize, any exogenous covariate that you choose to include
in the first-stage model and that you do not want to defend as an instru-
ment must also be included in the second-stage model. Perhaps to prevent
the inadvertent violation of this recommendation, statistical software
packages such as Stata force you to include in the second-stage model all
the covariates that you have listed for inclusion in the first-stage model
(except any instruments, of course).
There is a parallel and conceptually convergent argument, based on
the practice of 2SLS estimation that supports this principle that “all cova-
riates in the first-stage model must also be included in the second-stage
model.” It goes like this: both instrument and covariates are present in
the first-stage model, then the predicted values of the potentially endogenous
question predictor obtained during the completed first stage of the 2SLS
estimation will contain all the variation in the endogenous question pre-
dictor that was predicted by both the instrument and covariates. Then,
when these predicted values replace the observed values of the poten-
tially endogenous question predictor in the second-stage estimation, the
estimated value of their associated regression coefficient will depend on
both the earlier instrumental and covariate variation unless we control
explicitly for covariate variation in the second-stage model too! To be
unbiased, the estimated value of the regression coefficient associated with
the question predictor must be derived only from the variation that origi-
nates in the instrument itself. Therefore, we must include the first-stage
Instrumental-Variables Estimation 243
Table 10.6 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of the HS&B survey. IV estimation of the
REGISTERon COLLEGE relationship, using 2SLS, with DISTANCE and NUMBER as
instruments and participant race as covariates, at both the first and second stages
Table 10.7 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of the HS&B survey. IV estimation of the REGIS-
TER on COLLEGE regression slope, using 2SLS, with DISTANCE and NUMBER and
their interactions with each other and with race as instruments, and with the main
effects of race included at both stages as covariates
Model A Model B
R2 0.0270 0.0276
Model A Model B
R2 0.0295 0.0274
∼ p<0.10; ∗ p <0.05; ∗∗ p <0.01; ∗∗∗ p <0.001
†One-sided test.
from 0.0269 in Table 10.6, to 0.0276 in Table 10.7. Thus, as you might
expect, the estimated impact of COLLEGE on REGISTER at the second
stage also appears largely unaffected by the addition of the new interac-
tion instruments (the impact of COLLEGE rises only from 0.2776
to 0.2854) and the associated standard error declines only minimally.
Instrumental-Variables Estimation 247
The principle, however, remains intact and potentially useful, for applying
IVE in other data: it is always a good idea to look beyond a simple main-
effects specification of the first-stage model by considering interactions
among your instruments, and between your instruments and your exog-
enous covariates. All these could function potentially as instruments and
you could benefit if they did.23
23. To keep the exposition as simple as possible, we do not make use of the second instru-
mental variable, NUMBER, in subsequent sections of this chapter.
248 Methods Matter
24. This is a practice that is enforced by the programming structure of statistical software
such as Stata.
Table 10.8 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of the HS&B survey. IV estimation of the REGIS-
TER on COLLEGE relationship using 2SLS, including participant race as covariates at
both the first and second stages. Second-stage model contains interactions between the
endogenous question predictor COLLEGE and participant race, and first-stage model
includes DISTANCE and its interactions with race as instruments
Outcome = COLLEGE
INTERCEPT α0 0.6452∗∗∗ 0.0101
DISTANCE α1 –0.0071∗∗∗ 0.0007
BLACK α2 –0.0651∗∗∗ 0.0228
HISPANIC α3 –0.1276∗∗∗ 0.0194
OTHERRACE α4 0.0590∼ 0.0351
DIST × BLACK α5 0.0009 0.0020
DIST × HISPANIC α6 0.0013 0.0016
DIST × OTHERRACE α7 –0.0030 0.0030
R2 0.022
Outcome = COLLEGE × BLACK
BLACK α8 0.5801∗∗∗ 0.0081
DIST × BLACK α9 –0.0062∗∗∗ 0.0007
R2 0.504
Outcome = COLLEGE × HISPANIC
HISPANIC α10 0.5176∗∗∗ 0.0087
DIST × HISPANIC α11 –0.0058∗∗∗ 0.0007
R2 0.419
Outcome = COLLEGE × OTHERRACE
OTHERRACE α12 0.7041∗∗∗ 0.0076
DIST × OTHERRACE α13 –0.0101∗∗∗ 0.0006
R2 0.617
R2 0.0125
∼ p<0.10; ∗ p <0.05; ∗∗ p <0.01; ∗∗∗ p <0.001
†One-sided test.
250 Methods Matter
25. This statement is easy to confirm analytically, by fitting the requisite models by simul-
taneous-equations estimation. Then, both the “full” and “reduced” models can be fit
and will provide identical answers, with un-needed coefficients taking on a value of
zero during analysis if they are not set to zero in advance.
26. Another approach that can be used to test for the presence of interactions between
the endogenous predictor and covariates in the second-stage model is 2SLS,
conducted piecewise “by hand” using OLS regression analysis. Under the two-step
approach, you fit a single first-stage model by regressing the endogenous question
predictor (COLLEGE) on the main effect of the single instrument (DISTANCE) and
covariates. Fitted values of COLLEGE are then output from the fitted first-stage model
into a new variable, call it PREDCOLL. This latter variable is then introduced into the
second-stage model, in place of COLLEGE, in the usual way. PREDCOLL can be inter-
acted with exogenous predictors in the second-stage model, by forming cross-products
and entering them as predictors also. The estimates obtained are identical to those
obtained using the simultaneous methods described in the text, but you must adjust
the standard errors by hand, which can be tedious.
27. Notice that the coefficients on COLLEGE∗BLACK, COLLEGE∗HISPANIC, and
COLLEGE∗OTHERRACE are all negative and of approximately the same size (in abso-
lute value) as the positive coefficient on the main effect of COLLEGE. Recall that the
estimate of the impact of college enrollment on the probability of voter registration
for blacks, for example, is the sum of the coefficient on the main effect plus the coef-
ficient on the interaction term. Thus, the pattern of coefficients suggests that Dee’s
finding that college enrollment results in an increase in the probability of voter regis-
tration may be driven by the behavior of white students. However, it would take
analyses based on a much larger sample than that available in HS&B to reject the null
hypothesis that the impact of college is the same for all racial/ethnic groups.
Instrumental-Variables Estimation 251
Table 10.9 Civic engagement (in 1992) and educational attainment (in 1984) for 9,227
participants in the sophomore cohort of the HS&B survey. IV estimation of the
REGISTERon COLLEGE regression slope, using bivariate probit analysis, with
DISTANCEand NUMBER as instruments and including covariates representing
participant race, at the first and second stages
28. The estimated coefficient on COLLEGE in the second-stage fitted probit model must
be transformed into a more meaningful metric before it can be interpreted easily. This
is usually achieved by estimating the instantaneous slope of the outcome/predictor
relationship at the average values, or some other sensible specified values, of the cova-
riates. Here, we have estimated its value when COLLEGE is set to its sample average of
0.55, controlling for the presence of other predictors in the model.
29. Many of the ideas in this section derive from the presentation of Angrist and Krueger
(2001).
Instrumental-Variables Estimation 253
if December 31 is the birth date cut-off, children who are born early in the
calendar year enter school almost a year older than children who are born
later. Most states also have compulsory schooling laws, and usually require
students to remain in school until their 16th birthday. Thus, children
whose birthdays fall early in the calendar year typically reach the age of 16
one grade lower than children whose birthdays fall later in the school
year. This pattern led Angrist and Krueger to hypothesize that “quarter of
birth”—that is, whether the child was born in Spring, Summer, Fall, or
Winter—provided a set of credible instruments for identifying exogenous
variation in educational attainment. At any subsequent age, children
who had been born in a later quarter in the calendar year would tend
to have higher completed educational attainment. Furthermore, these
differences would be arguably exogenous because they were simply a con-
sequence of the haphazard and idiosyncratic nature of birth timing. The
authors applied their instruments in statistical models fitted to data from
the 1960, 1970, and 1980 censuses of the population. They included in
their analytic samples males who were 30–39 years of age, and those who
were 40–49 years of age, at the time they completed the relevant census
questionnaire.
The outcome variable in Angrist and Krueger’s first-stage model was
the number of years of schooling that each male had completed as of the
date of the relevant census. And, as you might expect, the predictor vari-
ables at this stage were of two types: (a) exogenous variables that also
served as covariates in the second-stage model, and (b) instruments. The
first group included a vector of nine dichotomous variables to distinguish
year of birth, and a vector of eight dichotomous variables that described
region of residence. The instruments included three dichotomous vari-
ables that identified the quarter of birth, and the interactions of these
quarter-of-birth predictors with the year-of-birth predictors.30 To confirm
their claim that the quarter of birth did indeed predict completed years
of schooling, Angrist and Krueger demonstrated that, on average, men
born in the first quarter of a calendar year had completed about one-
tenth of a year less schooling by age-30 than men born in the last quarter
of the calendar year. In their second-stage model, Angrist and Krueger
used the exogenous variation in educational attainment that had been
identified at the first stage to predict the logarithm of the men’s weekly
labor-market earnings. They found that, on average, each additional year
of education had caused a 10% increase in average weekly earnings.
30. To demonstrate the robustness of their results, Angrist and Krueger present
coefficients estimated from fitting first- and second-stage models with a variety of
specifications.
Instrumental-Variables Estimation 259
Of course, the authors were careful to point out that their estimate was a
local average treatment effect (LATE) that pertained only to males who
wanted to leave school as soon as they were old enough to do so.
Angrist and Krueger faced two threats to the validity of their instru-
ments. The first is that there may have been intrinsic differences in the
unobserved abilities of males who were born in different quarters of the
calendar year. If this were the case, then there could be a “third path” that
connected quarter of birth directly to subsequent labor-market earnings,
invalidating the use of quarter-of-birth indicators as instruments. The
authors responded to this potential threat in the following way. First, they
argued that quarter of birth should not affect the completed years of
schooling of males who had college degrees because the ultimate educa-
tional decisions of this group would not have been constrained by
compulsory schooling laws. Thus, for college graduates, there should not
even be any “second path” that related quarter of birth to labor-market
earnings through the impact on educational attainment. Consequently,
evidence that quarter of birth predicted the labor-market earnings of
college graduates would suggest the presence of a third path, directly
relating quarter of birth to the ultimate earnings outcome. The presence
of such a third path would invalidate the quarter-of-birth instruments.
Angrist and Krueger used their data on 40 to 49-year-old male college
graduates in the 1980 Census data to test for the presence of the direct
“third path.” They achieved this by fitting a single OLS regression model
in which they treated the ultimate outcome, logarithm of weekly earn-
ings, as the outcome variable and all the first-stage covariates and
instruments as predictors. In the annals of IVE, this is called the reduced-
form model, and it corresponds to the statistical model that is obtained by
collapsing the first-stage model into the second-stage model algebraically.31
After fitting their reduced-form model, Angrist and Krueger conducted a
test of the null hypothesis that the coefficients on the three quarter-
of-birth dichotomous predictors were jointly equal to zero. They failed
to reject this null hypothesis. They found the same result when they
repeated the exercise using data on 40 to 49-year-old college graduates
taken from the 1970 Census. This evidence led them to conclude that
there was no direct path relating quarter of birth to the weekly earnings
31. Recall that the endogenous predictor both appears in the second-stage model (as the
question predictor) and is the outcome of the first-stage model. Thus, you can take
the right-hand side of the first-stage model and substitute it for the endogenous pre-
dictor in the second-stage model, and simplify the resulting combination algebraically
to leave a “reduced-form” model in which the ultimate outcome is regressed directly
on the instruments and covariates.
260 Methods Matter
of college graduates . By analogy, they argued that this pattern would also
hold true for males with lower completed years of schooling. This evi-
dence and logic were central to Angrist and Krueger’s argument that
quarter-of-birth predictors satisfied the “no third path” requirement that
was needed for an instrument to be legitimate.32
The second threat to the validity of Angrist and Krueger’s instrumental-
variables strategy is that their instruments predicted only a very small part
of the total variation in the endogenous question predictor, years of com-
pleted schooling. As a result, their analyses could have been subject to
the weak-instrument problem. As explained earlier, one problem with the
implementation of IVE using weak instruments is that the results of
fitting the second-stage model can be very sensitive to the presence in the
point-cloud of even relatively few aberrant data points. Thus, IVE with
weak instruments can produce substantially biased results, even when the
analytic samples are very large (Bound, Jaeger, & Baker, 1995; Murray,
2006).
Angrist and Krueger’s response to the weak-instrument threat was to
check whether they did indeed have a problem. They conducted tests of
the null hypothesis that the coefficients on the three quarter-of-birth
dummy variables in their fitted first-stage model were jointly equal to
zero. (Recall that this is the model in which the men’s years of completed
schooling is the outcome variable.) As reported in Table 1 of their 1991
paper, they were able to reject this null hypothesis when they fit the first-
stage model to data on a sample of 30 to 39-year-old males (F = 24.9;
p<0.0001) and to data from a sample of 40 to 49-year-old males (F = 18.6;
p <0.0001), both samples taken from the 1980 Census. They presented
these test results as evidence that their IV estimations did not suffer from
a weak-instrument problem.33
32. Several studies published after the appearance of the Angrist and Krueger (1991)
paper provide evidence that season of birth affects adult life outcomes through mech-
anisms other than via the effect of compulsory-schooling laws. See, for example,
Bound and Jaeger (2000), and Buckles and Hungerman (2008).
33. Bound, Jaeger, and Baker (1995) argued that Angrist and Krueger’s defense of their
instruments was inadequate for two reasons. First, they did not find convincing
Angrist and Krueger’s defense of the “weak instrument” threat, pointing out that the
values of the R2 statistic from Angrist and Krueger’s first-stage regressions were
extraordinarily low. Second, Bound and his colleagues questioned whether the quarter-
of-birth instruments that Angrist and Krueger used in their 1991 paper satisfy the “no
third path” assumption.
Instrumental-Variables Estimation 261
predict the logarithm of class size quite well at the first stage.34 However,
the exogenous portion of class size that was thereby carved out ended up
having no causal impact on student achievement. She concluded that
there was no causal relationship between class size and student achieve-
ment in her data. Hoxby defended her null finding by arguing that it was
most likely that this idiosyncratic year-to-year variation in class size would
be unlikely to lead teachers to change their instructional patterns in a
manner that would affect student achievement.
The primary threat to the validity of Hoxby’s choice of instrument is
the possibility that her residual deviations from the smooth underlying
enrollment trends were not simply a consequence of idiosyncrasies in the
timing of births, but rather stemmed from purposeful actions by families
that could also result in a “third path” that connected the values of the
instrument to the ultimate levels of student achievement directly. For
example, perhaps entrepreneurial parents who learned that their child
was about to be placed in a large class could have transferred their child
to another school in the same school district, moved to another school
district, or sent their child to a private school. If the parents who responded
in this way were those who also devoted a particularly large amount of
time and resources at home to improving their child’s achievement, these
responses would create a direct third path that linked the enrollment
deviations to children’s achievement.
Hoxby presented four arguments in defense of her choice of instru-
ment. First, she pointed out that the coefficients on the time predictors in
her fourth-order polynomial enrollment model captured virtually all the
smooth time trends, even the quite subtle ones. Consequently, it is highly
likely that the residuals from these fitted curves (which served as her
instrument) were indeed the results of idiosyncratic events. Second, she
argued that even if the idiosyncratic variation did reflect the purposeful
actions of a few families choosing one school over another, it is likely that
the families would be choosing among public schools in a particular district.
She then showed that the results of fitting her first- and second-stage
models remained essentially the same when she refitted them on data
that had been aggregated to the district level. Third, Hoxby refit the
models from which she had derived her instrumental residuals, this time
using data on the number of children in each district who were aged
five on the school-entry date as her outcome variable, instead of the com-
plete grade-level enrollment in the district. Replicate analyses using these
34. As reported in Table III (p. 1270) of Hoxby’s (2000) paper, the t-statistics on the instru-
ment in her first-stage regressions ranged in value from roughly 4 to 80, depending on
the grade level and model specification.
Instrumental-Variables Estimation 263
265
266 Methods Matter
in the sixth grade and ends in the eleventh. Recipients of the PACES
scholarships could renew them through eleventh grade so long as their
academic progress was sufficient to merit promotion each year to the next
grade.
The PACES program was administered by local governments, which
covered 20% of the cost. The other 80% was covered by the central gov-
ernment. In many locales, including the capital city of Bogotá, demand
for the secondary-school scholarships exceeded supply. This led local gov-
ernments, including Bogotá’s, to use lotteries to determine which children
were offered scholarships. To be eligible for the government scholarship
lottery in Bogotá, a child had to live in a designated low-income neighbor-
hood, have attended a public primary school, and have been accepted at
a private secondary school that participated in the PACES program.
The evaluators of the Colombia secondary-school voucher program, a
group that included Joshua Angrist, Eric Bettinger, Erik Bloom, Elizabeth
King, and Michael Kremer (2002), started out by addressing the following
question: Did the offer of a PACES scholarship increase students’ educa-
tional attainments? The evaluators hypothesized that there were several
mechanisms through which this might be the case. First, some low-income
families that wanted to send their child to a private secondary school
could not have afforded to do so (at least for very long) in the absence of
a scholarship. A second is that availability of a scholarship would allow
some parents who might have sent their child to a private secondary
school in any case to upgrade to a better (and more expensive) private
school. A third is that the condition that renewal of the scholarship was
contingent on promotion would induce some students to devote more
attention to their studies than they otherwise would have.2
You, the careful reader, will recognize from Chapter 4 that you already
know how to make use of information from a fair lottery to answer a
research question about the impact of an offer of a scholarship from the
PACES program. The lottery created two exogenously assigned groups:
(a) a treatment group of participants who were offered a scholarship, and
(b) a control group of participants who were not offered a scholarship.
Using standard ordinary least-squares (OLS)-based regression methods
on the sample of students who participated in the 1995 Bogotá lottery,
the investigators found that the offer of a government-provided scholar-
ship increased the probability that a student from a low-income family
2. The Angrist et al. (2002) paper explains how the authors used their hypotheses about
the mechanisms underlying the program effects to inform their empirical work.
Using IVE to Recover the Treatment Effect 267
Angrist and his colleagues also wanted to address a second research ques-
tion: Does the use of financial aid to pay for secondary school increase the
educational attainments of low-income students? There are two reasons
why this second question differs from the first. One reason is that not all
of the low-income families that won the lottery and were offered a govern-
ment scholarship chose to use it. The second is that some families that
lost out in the lottery were successful in finding financial aid from other
sources. The challenge is to find a way to obtain an unbiased estimate of
the answer to this second research question.
It is important to keep in mind that Angrist and his colleagues’ second
research question concerns the consequences of obtaining and making
use of financial aid, not the consequences of attending a private school.
The distinction is important because almost all students who participated
in the PACES lottery, even those who were not awarded a PACES scholar-
ship, enrolled in a private secondary school for grade 6. This is not
surprising, given that a condition for eligibility for the PACES lottery was
that students had to have been accepted by a private secondary school
that participated in the PACES program. As mentioned earlier, some of
the families that lost out in the PACES lottery were able to obtain finan-
cial aid from another source to help pay their child’s private-school fees.
Some families that lost out in the PACES lottery paid the private-school
fees out of their own resources. However, many of these families were
unable to pay the private-school fees in subsequent years and consequently
their children left private school.
3. The estimates reported in the Angrist et al. (2002) paper range between 9 and 11
percentage points, depending on the covariates included in the statistical model.
The figure that we report here comes from an OLS-fitted regression model in which
the outcome was a dichotomous variable indicating whether a student had completed
eighth grade by 1998, and the single predictor was a dichotomous variable that took on
a value of 1 for students who were offered a government scholarship in the 1995 Bogotá
lottery.
268 Methods Matter
The challenge that Angrist and his colleagues faced was to find a way to
obtain an asymptotically unbiased estimate of the causal impact of the use
of financial aid on children’s subsequent educational attainment when
the offer—but not the use—of financial aid had been assigned randomly. In
what follows, we reorient our account of the evaluation of the PACES
scholarship program in Colombia to show how to answer this second
research question. Rather than regard the PACES lottery in Bogotá as a
randomized experiment designed to assess the causal impact of the offer
of a government scholarship on subsequent educational attainment, we
will regard it as a “flawed” (i.e., nonrandomized) experiment to assess the
impact of the use of scholarship aid from any source. In other words, in
our subsequent descriptions, the treatment of interest will be the use of
scholarship aid from any source. Seen from this perspective, the assign-
ment of participants to experimental conditions—that is, to a treatment
group that made use of financial aid or to a control group that did not
make use of financial aid—was tainted by self-selection. Thus, using the
term we defined in Chapter 3, we will refer to the evaluation of the
Colombia PACES program as a quasi-experiment to investigate the impact
of the use of financial aid on educational attainment, rather than as a
randomized experiment examining the impact of the offer of a PACES
scholarship.
As we explained in Chapter 3, it was not so long ago that researchers tried
to eliminate potential biases in their analyses of such quasi-experimental
data by incorporating large numbers of covariates into their OLS regres-
sion analyses in the hope of “controlling away” the differences due to
selection into the treatment and control groups. However, as we hope is
now clear, this strategy is unlikely to be successful because students whose
families took advantage of financial aid may have differed from those that
did not in many unobserved ways. For instance, parents who made use of
financial aid may have placed an especially high value on education,
regardless of whether they were assigned a government-provided scholar-
ship or not. Such a family value system may then have led to enhanced
family support for the child’s education and eventually to greater educa-
tional attainment, irrespective of any impact of the use of financial aid.
Then, in naïve OLS analyses of the quasi-experimental data, investigators
would have attributed these achievement differences spuriously to the
effect of the use of financial aid unless all differences in relevant family
values were controlled completely. Since many of the differences between
the families that found and used financial aid and those that did not were
unobserved, it is unlikely that the use of OLS methods, even including a
rich set of covariates, would provide an unbiased estimate of the answer
to the research question.
Using IVE to Recover the Treatment Effect 269
Table 11.1 Sample means (and standard deviations, where appropriate) on the outcome
variable, question predictor, instrument, and covariates, for a sample of students from
Bogotá, Colombia, who participated in the 1995 lottery to obtain a government-funded
private-school tuition scholarship, overall and by whether the child was offered financial aid
Outcome:
FINISH8TH 0.681 0.736 0.625 0.000
Endogenous Question Predictor:
USE_FIN_AID 0.582 0.915 0.240 0.000
Instrument:
WON_LOTTRY 0.506 – – –
Covariates:
BASE_AGE 12.00 11.97 12.04 0.42
(1.35) (1.35) (1.34)
MALE 0.505 0.505 0.504 0.98
Notice that about 58% of the students in our subsample did make use of
financial aid for at least one year during the three-year period under study.
The second variable listed is also dichotomous, and we have named it
WON_LOTTRY. Under our new quasi-experimental conception of the
Bogotá evaluation, this randomized offer of a government PACES schol-
arship is now an expression of the investigators’ intent to provide financial
aid, and is exogenous by randomization. WON_LOTTRY therefore has a
value of 1 for students who won the lottery and were offered a govern-
ment scholarship, and a value of 0 for students who lost out in the lottery.
In our subsample of children from Bogotá who participated in the 1995
lottery, almost 51% of participants were assigned randomly to receive an
offer of a government scholarship. Finally, in Table 11.1, we present paral-
lel descriptive statistics on two other variables measured at baseline:
(a) BASE_AGE, which measures the child’s age (in years) on the date of the
lottery, and (b) MALE, a dichotomous indicator that takes on a value of
1 for a male child, 0 for a female. We treat BASE_AGE and MALE as cova-
riates in our instrumental-variables analysis, using the strategies described
in the previous chapter to improve the precision of our estimates.
In the remaining columns of Table 11.1, we provide descriptive statis-
tics on outcome FINISH8TH, potentially endogenous question predictor,
USE_FIN_AID, and covariates BASE_AGE and MALE for the subsample
of children who were offered a PACES scholarship (WON_LOTTRY = 1),
Using IVE to Recover the Treatment Effect 271
and for the subsample of children who were not (WON_LOTTRY = 0).
We have also added a final column that contains the p-value from a t-test
of the null hypothesis that the population means of each variable do not
differ between those children who received the offer of a PACES scholar-
ship and those who did not. Notice the interesting similarities and
differences between the two groups, which ultimately drive the success of
our instrumental-variables estimation. For instance, on average, age on
the lottery date is the same in the two groups at baseline, as is the percent-
age of male students, as you would expect because the groups were formed
by the random assignment of a tuition offer. After three years, however,
there is about an 11 percentage point difference favoring the group that
was offered PACES scholarships in the percentage of students who had
completed grade 8. Notice that there are also statistically significant differ-
ences in the average value of the potentially endogenous question predictor,
USE_FIN_AID, between the two groups. As mentioned earlier, almost 92%
of the students who were offered a government scholarship used financial
aid to pay private-school fees, whereas only 24% of students who lost out
in the lottery did so. This confirms, as we suspected, that there is a strong
relationship between our potential instrument, WON_LOTTRY, and our
suspect and potentially endogenous question predictor, USE_FIN_AID.
Therefore, we have met the first condition for a credible instrument.
Under our new quasi-experimental framework for investigating the
impact of use of financial aid on students’ educational attainment, varia-
tion in our question predictor, USE_FIN_AID, is potentially endogenous.
Clearly, the choice of whether to use financial aid (from any source)
depends not only on the lottery outcome but also on the many unseen
resources, needs, and objectives of the family, each of which may also
impact the child’s subsequent educational attainment. Consequently, if
we were to use OLS regression analysis to investigate the relationship
between outcome FINISH8TH and question predictor USE_FIN_AID
(controlling for BASE_AGE and MALE), we would undoubtedly end up
with a biased estimate of any causal effect. Instead, we have used the two-
stage least-squares (2SLS) approach to obtain an IV estimate of the
relationship, using the exogenous assignment of intent to treat, repre-
sented by WON_LOTTRY, as our instrument. Our first- and second-stage
statistical models follow the pattern that we established in the previous
chapter, as follows:
+ b 3 MALEi + e i
272 Methods Matter
with the usual notation and assumptions. We have written these models
to reflect that, under 2SLS, the predicted values of potentially endoge-
nous question predictor USE_FIN_AID are obtained at the first stage and
used in place of the corresponding observed values at the second stage
(with appropriate corrections to the standard errors).4 We have also fol-
lowed our own earlier advice and included covariates, BASE_AGE and
MALE, in both the first- and second-stage models. Finally, for pedagogical
clarity, we have again adopted the simple linear-probability specification
for the first- and second-stage models.
We provide estimates, corrected standard errors, and approximate p-val-
ues for the model parameters, at both stages, in the upper and lower panels
of Table 11.2. In addition, in the two right-hand columns of the lower
panel, we have included estimates of corresponding parameters from a
naïve OLS regression analysis of FINISH8TH on the potentially endoge-
nous question predictor USE_FIN_AID, again controlling for BASE_AGE
and MALE, for comparison.5 Much of the table confirms what we already
suspected from examining the descriptive statistics in Table 11.2.
Because endogenous predictor X is a dichotomy, its averages are proportions, and so the
denominator can be further simplified:
mY|I =1 − mY|I = 0
bYX =
p ( X = 1|I = 1) − p ( X = 1|I = 0 )
This result has an interesting interpretation. First, the numerator is the difference in
outcome means between subpopulations defined by the values of the instrument,
1 and 0. In our Colombia tuition-voucher example, for instance, it is the difference in
outcome means between the subpopulation that received an offer of a scholarship and
the subpopulation that did not. It is the population effect of intent to treat (ITT). Second,
the denominator is the difference between subpopulations for whom I = 1 and I = 0,
Using IVE to Recover the Treatment Effect 273
Table 11.2 Instrumental-variables (2SLS) and naïve OLS estimates of the impact of use
of financial aid on on-time graduation from grade 8, among low-income students from
Bogotá, controlling for student gender and baseline age
in the proportions of participants for whom the endogenous predictor takes on a value of 1.
In our Colombia tuition-voucher example, this is the difference between the “offer”
and “no-offer” groups in the proportion of children who made use of financial aid from
any source. Combining these interpretations, we conclude that an asymptotically unbi-
ased estimate of the effect of using financial aid on educational attainment can be
obtained by rescaling the ITT estimate, using the difference in the sample proportion
of children who made use of financial aid, in each of the original randomized offer and
no-offer groups. (This is called a Wald estimator, after the eminent statistician, Abraham
Wald.) This conclusion continues to hold when additional exogenous covariates are
included, except that the effects of the new covariates must be partialed from the con-
ditional averages being divided above. Including the additional covariates in the first
and second stages of the 2SLS procedure achieves the conditioning automatically.
274 Methods Matter
From the upper panel, at the first stage, we can examine the all-important
first path that links instrument WON_LOTTRY to the potentially endog-
enous question predictor, USE_FIN_AID. Their relationship is strong and
statistically significant (p <0.001), and the fitted probability that a child
will use a scholarship is almost 68 percentage points higher among those
who received the offer of a PACES government scholarship than among
those who did not. Thus, WON_LOTTRY is confirmed as a strong instru-
ment.6 Notice, also, that covariate BASE_AGE has a negative relationship
(p <0.10) with the outcome variable, indicating that older children were
somewhat less likely to make use of financial aid than were those who
were relatively young on the date of the lottery.
At the second stage (lower panel), we can examine the second path
linking the predicted values of the potentially endogenous question pre-
dictor, USE_FIN_AID, to children’s subsequent educational attainment.
The results of the (biased) naïve OLS regression analysis suggest that the
children whose families made use of financial aid were 12 percentage
points more likely to have graduated from secondary school by 1998 than
those who did not make use of a scholarship. In contrast, our IV estimate
is almost 16 percentage points, almost one-third larger than the biased
OLS estimate. Notice that both covariates, BASE_AGE and MALE, play a
statistically significant role at the second stage, reducing residual variance
and increasing statistical power.
7. Our description of the use of IVE to identify causal effects within the context of Rubin’s
causal model draws heavily on the lucid description provided by Gennetian and her
colleagues in Chapter 3 of the 2005 book edited by Howard Bloom.
276 Methods Matter
USE_FIN_AID = 1 USE_FIN_AID = 0
(used financial aid (did not use financial “Compliers”
from some source) aid from any source)
USE_FIN_AID = 1 USE_FIN_AID = 1
(used financial aid (used financial aid “Always-Takers”
from some source) from some source)
USE_FIN_AID = 0 USE_FIN_AID = 0
(did not use financial (did not use financial “Never-Takers”
aid from any source) aid from any source)
Figure 11.1 Compliance styles in the population of students from Bogotá, by offer of a
government private-school tuition voucher, WON_LOTTRY (1 = scholarship voucher
offered; 0 = no scholarship voucher offered), and ultimate use of financial aid,
USE_FIN_AID(1 = used; 0 = did not use).
8. The reason we abstract from the details of the administration of the 1995 PACES lot-
tery in Bogotá in presenting this framework is that we want to distinguish the population
from the research sample. This distinction is not present in the Bogotá case because all
members of the population of eligible volunteers in Bogotá participated in that PACES
lottery.
Using IVE to Recover the Treatment Effect 277
In designing the research, we hope that there are a lot of these. Compliers
are willing to have their behavior determined by the outcomes of the lot-
tery, regardless of the particular experimental condition to which they
were assigned. Complying families who were assigned a PACES scholar-
ship use financial aid to help pay their children’s school fees at a private
secondary school; complying families who were not assigned a PACES
scholarship do not make use of financial aid from any source. The last two
compliance styles in the Angrist, Imbens, and Rubin framework, labeled
always-takers and never-takers, are also present potentially in empirical
research. Always-takers are families who will find and make use of finan-
cial aid to pay private-school fees regardless of whether they had been
assigned a PACES government scholarship. Never-takers are their mirror
image—they will not make use of financial aid to pay children’s fees at a
private secondary school under any circumstances.9
What do these three categories of potential compliance style teach us
about the interpretation of the instrumental-variables estimation of a
LATE? They have no consequence if you are only interested in estimating
and interpreting the causal effect of intent to treat (that is, the impact of
the offer of a PACES scholarship). However, they are relevant if you want
to know the impact on educational attainment of actually making use of
financial aid to pay private-school fees. The first thing to keep firmly in
mind is that, in any quasi-experiment, membership in these compliance
classes is hidden from view. All we know for sure is what we can observe.
In the case of the Bogotá study, this is whether the participant was offered
a PACES scholarship, and whether that participant then made use of
financial aid from any source. Notice that this information is not suffi-
cient to enable us to distinguish the compliance style of the family. Among
families that were assigned to the offer of a PACES scholarship, both
those that were compliers and those that were always-takers actually make
9. Gennetian et al. (2005) also describe a fourth potential group of population members,
whom they call defiers. These are participants whose experimental assignment induces
them to do exactly the opposite of what the investigator intends—they are contrarians.
Assigning them a PACES scholarship induces them not to use any scholarship; denying
them a government scholarship induces them to find a scholarship from another
source for their child. Such behavior is usually not anticipated in most experiments
because it implies that these participants are consistently contrary—that they will always
do the opposite of what the investigator asks them to do. To classify families as defiers,
we must be convinced that they simply choose to do the opposite of their intent-to-treat
assignment. Although logic demands the existence of this fourth class, in practice we
believe that it is often an empty set, and so we have eliminated it from our argument
here. This makes our presentation consistent with the framework presented by Angrist,
Imbens, and Rubin (1996). These authors describe the “no defiers” assumption as the
“monotonicity” assumption.
Using IVE to Recover the Treatment Effect 279
use of financial aid to pay fees at a private secondary school. So, we cannot
distinguish between these two groups by observing their actions. Yet, the
two groups may differ in unobserved ways—the latter group being pre-
pared to use whatever effort is necessary to find financial aid from an
alternative source if they lose out in the lottery for the government schol-
arships. Similarly, among families that were not assigned a government
scholarship, neither the compliers nor the never-takers make use of finan-
cial aid to pay private secondary-school fees, and these groups cannot be
distinguished on the basis of their overt actions. Yet again, these two
groups may differ. The first are families that would like to make use of
financial aid, but, after losing out in the PACES lottery, are unable to find
aid from another source. The second are families that have decided not
to make use of financial aid to pay private secondary-school fees under
any circumstances.
You can imagine how problematic such differences in compliance style
can be, if you are interested in the unbiased estimation of the causal
impact of use of financial aid on students’ educational attainment. If you
were to form two contrasting groups naïvely, those children whose par-
ents made use of financial aid and those children whose parents did not
use financial aid, and compare the children’s subsequent average educa-
tional attainment, you would be on shaky ground in claiming that any
difference detected was due solely to the causal impact of financial aid.
Why? Because both the group that made use of financial aid, and the
group that did not, contain a self-selected and unknown mixture of par-
ticipants of different compliance styles, each differing in unobserved
ways. Among members of the group that made use of financial aid, some
families (the compliers) would have done so because they were complying
with their favorable outcome in the lottery. Others (the always-takers)
would have done so because they searched for and found financial aid
from another source after losing out in the lottery. Similarly, some mem-
bers of the group that did not make use of financial aid (the compliers) let
the unfavorable outcome of the lottery dictate their behavior. Others (the
never-takers) would not have made use of financial aid under any circum-
stances, even if they had been offered it. Consequently, comparisons of
average educational attainments of the group that made use of financial
aid and the group that did not would be polluted potentially by the unseen
personal choices of families with different motivations (and perhaps dif-
ferences in their ability to support their children’s efforts to succeed in
secondary school).
In this context, it is interesting to ask: Exactly what comparison is being
estimated by the execution of the IVE using the PACES data that we
described earlier in this chapter? Or, more to the point perhaps, is the
280 Methods Matter
What strategies are effective in increasing the skills of students who lag
behind their classmates? One policy that many schools have tried is to
provide extra instruction to those who need it, either after the regular
school day is finished or during the school vacation period. Another
common, and quite controversial, policy is to mandate that students who
do not meet achievement benchmarks be retained in the same grade for
another school year. Recall from Chapter 8 that the Chicago Public
Schools (CPS) introduced a policy, in 1996, that included elements of
both of these remediation strategies. First, the district examined the
results of the standardized reading and mathematics tests that students in
10. Angrist, Imbens, and Rubin (1996) prove this statement. A critical assumption under-
lying the interpretation of the IV estimator as the treatment effect for compliers is
that the treatment effect for always-takers must not be influenced by the outcome of
the random assignment—that is, by whether always-takers were assigned to the treat-
ment or to the control group. The same applies to never-takers.
Using IVE to Recover the Treatment Effect 281
grade 3 took at the end of the school year.11 It then mandated that all
third-grade students whose scores fell below a cut-off score of 2.8 grade
equivalents on the reading or mathematics test had to attend a six-week
summer school that focused on building skills in these subjects. The
summer school attendees then retook the achievement tests at the end of
the summer instructional period. The policy specified that those students
who then met the 2.8 grade equivalents benchmarks were promoted to
fourth grade, and those who did not meet these benchmarks were retained
in the third grade for another year. All students were tested in reading a
year later.
The CPS policy was based on a sensible theory of action. The notion
was that the policy would provide a significant amount of extra instruc-
tion in core subjects for lagging students. Summer school classes were
small, typically with fewer than 15 students. For the summer session, prin-
cipals hand-picked teachers whom they thought would be effective in
teaching those students in need of remediation. All teachers were told to
follow a highly structured curriculum designed to emphasize mastery of
basic skills. The students had incentives to pay attention because their
promotion to the following grade was contingent on their achieving
scores of at least 2.8 grade equivalents on the end-of-summer reading and
mathematics achievement tests.
One of the first steps that researchers Brian Jacob and Lars Lefgren
(2004) undertook to evaluate whether it would be possible to conduct a
strong evaluation of the consequences of the CPS policy was to examine
whether the assignment rules specified in the policy had actually been
followed. They did this by estimating—on the end-of-school-year reading
examination that was used to determine which children would be assigned
to attend summer school—the percentage of students obtaining each
possible grade equivalent score who actually attended summer school.
They then plotted this percentage versus the grade-equivalent reading
score (centered to have a value of zero at the cut-off score of 2.8 grade
equivalents).12 In Figure 11.2, which is a reproduction of Figure 2 from
Jacob and Lefgren’s paper, we present the resulting graph. It shows that
the rules for assigning participants were obeyed fairly well, but not per-
fectly. About 90% of the students who scored below the exogenously
determined cut-off score of 2.8 grade equivalents attended the mandatory
11. As we discuss in Chapter 13, the policy pertained to students in grade 6 as well as
to those in grade 3. To simplify the description of the policy, we focus here on the
students in grade 3.
12. Students were much more likely to fail the reading than the mathematics achievement
test. This led the investigators to focus their analysis on the former.
282 Methods Matter
0.8
Fraction Treated
0.6
0.4
0.2
0
−1.5 −1 −0.5 0 0.5 1 1.5
Reading GE Relative to Cut-off
Figure 11.2 The relationship between the June reading scores (centered at the cut-off)
for Chicago Public-School (CPS) students in grades 3 and 6 and the sample probability
of attending summer school. Reproduced with permission from Figure 2, Jacob and
Lefgren, 2004, p. 230.
summer school as the policy specified they should, and only a very small
percentage of students scoring above the cut-off score did so. This good,
but less than perfect, take-up of the summer program by participants
meant that the cut-off score of 2.8 grade equivalents provided what meth-
odologists refer to as a fuzzy discontinuity, dividing students into treatment
and control groups well, but not perfectly, as a sharp discontinuity would
have done.
If compliance with the policy mandate had been perfect, Jacob and
Lefgren could have obtained an unbiased estimate of the impact of
summer school attendance on reading scores one year later by fitting the
regression discontinuity (RD) regression model specified in Equation 11.2
using OLS:13
13. We have simplified the author’s model specification for pedagogic purposes.
Using IVE to Recover the Treatment Effect 283
at the end of third grade and used as the forcing variable in the RD design;
outcome READi,t+ 1 is the student’s score on the standardized reading test
taken one year later; X is a vector of exogenous time-invariant student
characteristics; and residual εi is the error term.14 The key point to under-
stand is that if compliance with the assignment policy had been perfect,
the dichotomous variable SUMMERi would have had the same value for
every student as the exogenous variable BELOWi , which we define as
taking on a value of 1 for every student whose score on the end-of-third-
grade reading examination READi,t was less than 2.8 (0, otherwise). Had
compliance with the CPS remediation policy been perfect, the estimate of
β
2 would have provided an unbiased estimate of the impact of summer
school attendance on the subsequent reading scores of children whose
initial reading score READi,t was very close to the cut-off score of 2.8 grade
equivalents.
Since compliance with the policy mandate was not perfect, Jacob and
Lefgren realized that simply fitting Equation 11.2 by OLS methods would
result in a biased estimate of the causal impact of summer school atten-
dance. The values of question predictor SUMMER were not assigned
entirely exogenously. Although most students had complied with their
assignment, not all had. Some students with reading scores above the cut-
off actually participated in summer school, and some students with scores
below the cut-off did not. It is likely that students who did not comply
with their assignment had unobserved abilities and motivation that not
only resulted in their crossing over, but also influenced their reading-
achievement scores a year later. Thus, the endogeneity of the actual
assignment of students to the treatment group (SUMMER = 1) meant that
fitting Equation 11.2 by OLS regression methods would result in a biased
estimate of the program effect.
Fortunately, we know that the solution to this problem is simple now
that we understand an original exogenous assignment to an experimental
condition can serve as a credible instrument for the potentially endoge-
nous take-up of that treatment. We simply combine our interest in fitting
the statistical model in Equation 11.2 with what we have learned about
the application of IVE earlier in this chapter. We see from Figure 11.2 that
exogenous assignment to experimental condition was indeed a strong
predictor of the actual take-up of program—most of the students did what
14. In formal analyses, we would need to specify an error covariance structure that
accounted for the clustering of students within schools.
284 Methods Matter
they were told.15 Thus, the potential instrument predicts the endogenous
predictor strongly, as required. Second, it was unlikely that exogenous
assignment to the program would affect reading achievement a year later
for those immediately on either side of the cut-off, except through the
provision of the summer program. Thus, there was no “third path.” So,
Jacob and Lefgren treated Equation 11.2 as the second stage of a two-
stage model in which the first-stage model predicted enrollment in the
SUMMERprogram by the exogenous RD assignment, as follows:
15. Jacob and Lefgren (2004) explain that Figure 2 in their paper (which is reproduced as
Figure 11.2 here) is based on data pertaining to CPS students in grade 6 as well as
those in grade 3. In a private communication, Brian Jacob told us that the relationship
displayed in Figure 11.2 is virtually identical for students in the two grades.
16. Again, in formal analyses, we would specify an error covariance structure that
accounted for the clustering of students within schools.
Using IVE to Recover the Treatment Effect 285
286
Dealing with Selection Bias in Nonexperimental Data 287
schools were equally effective. In other words, the choices that families
made in selecting Catholic school for their children may have deceived
the researchers into overestimating the impact of the Catholic school
“treatment.” Methodologists refer to this as the selection-bias problem.
As we have noted throughout our book, you face this problem when you
evaluate any program in which participants or their advocates can choose
the treatment conditions they will experience.
Coleman and his colleagues recognized that they faced the selection-
bias problem, and responded by using a fix-up strategy that was
conventional at that time. In their work, they had used multiple regres-
sion analysis to model the relationship between students’ ultimate
academic achievement and a dichotomous question predictor that distin-
guished between the Catholic- and public-school treatments. To this basic
model, they added carefully selected covariates—representing the parents’
socioeconomic status and other background characteristics—in an attempt
to control away important pre existing differences between the Catholic
and public high-school students and their families. Critics argued that
this strategy was inadequate because one could never be sure that the full
spectrum of underlying unobserved differences has been taken into
account fully by the particular control predictors included, no matter how
well chosen.
Of course, if Coleman and his colleagues had possessed a suitable
instrument—an exogenous variable that predicted entry into Catholic
school, and through it student achievement, all without the presence of
an offending “third path”—they would have been able to solve their selection-
bias problem. As explained in Chapters 10 and 11, they could have used
instrumental-variables estimation (IVE) to obtain an asymptotically unbi-
ased estimate of the size of the “Catholic-school advantage.”2 Our point
is that, if you have a viable instrument, you simply do not need the
methods that we are about to describe in this chapter. Without a viable
instrument, though, all you can do is include covariates in your analysis
in order to try to control for any extant differences among children
who attended public or Catholic school, as Coleman attempted. But, no
matter which covariates you decide to include, it will always be dangerous
to draw causal conclusions from analyses of observational data. The
reason is that you can only eliminate the bias due to the observables you
2. Some analysts have suggested that a family’s religious affiliation (Evans & Schwab, 1995)
and the distance between a family’s residence and the nearest Catholic school (Neal,
1997) could serve as instrumental variables. However, Altonji and his colleagues (2005b)
present evidence that these variables are not legitimate instruments in existing datasets.
288 Methods Matter
3. You might argue that you could keep piling additional covariates into the model in
order to “whittle down” any bias present in the estimated Catholic-school advantage.
However, there is a problem with this strategy—apart from the fact that adding each
covariate costs an additional degree of freedom and leads to an accumulation of Type I
error. As you add each new covariate, it can only predict that part of the variation in the
outcome that has not yet already been predicted by all the other covariates. This means that
the outcome variation available nominally for subsequent prediction by new covariates
declines as the process continues. If covariates are intercorrelated (as they usually are!),
and if they also correlated with the endogenous question predictor, CATHOLIC in our
example, then you may face a burgeoning problem of multicollinearity as you proceed.
In this case, your estimation may become increasingly sensitive to the presence of aber-
rant data points in the point cloud, leading your estimates to become increasingly
erratic.
Dealing with Selection Bias in Nonexperimental Data 289
4. Although we do not make use of data beyond the 1992 survey, follow-up surveys of
NELS-88 participants were also conducted in 1994 and 2000.
5. For pedagogic reasons, we have also limited our sample to students with no missing
data on any variable included in our analyses. Thus, our sample size is smaller than that
of the original NELS-88 sample and has not maintained its population generalizability.
Consequently, we do not account for the complex survey sampling design in our analy-
ses. However, our results do not differ dramatically from full-sample analyses that do
take the design into account.
290 Methods Matter
6. Throughout this chapter, we have used one-sided tests in examining the hypothesis
that Catholic schools are more effective than public schools in enhancing the mathe-
matics achievement of students. We recognize that one could make the case for
two-sided hypothesis tests.
7. The standard deviation of MATH12 is 9.502 in our subsample.
Dealing with Selection Bias in Nonexperimental Data 291
8. Annual family income was coded as follows (in 1988 dollars): (1) no income, (2) less
than $1,000, (3) $1,000–$2,999, (4) $3,000–$4,999, (5) $5,000–$7,499, (6) $7,500–
$9,999, (7) $10,000–$14,999, (8) $15,000–$19,999, (9) $20,000–$24,999, (10)
$25,000–$34,999, (11) $35,000–$49,999, (12) $50,000–$74,999, (13) $75,000–$99,999,
(14) $100,000–$199,999, (15) more than $200,000.
9. Test for equality of medians, between groups: continuity-corrected χ2(df = 1) = 104.7
(p <0.001).
292 Methods Matter
Table 12.1 Descriptive statistics on annual family income, by stratum, overall and by
type of high school attended, and average twelfth-grade mathematics achievement by
income stratum and by high-school type (n = 5,671)
Label Income Sample Sample Mean Public Catholic Public Catholic Diff.
Range Variance (% of
Public Catholic stratum
total)
Hi_Inc $35,000 0.24 11.38 11.42 1,969 344 53.60 55.72 2.12 ∗∗∗,†
to $74,999 (14.87% )
Med_ $20,000 0.22 9.65 9.73 1,745 177 50.34 53.86 3.52 ∗∗∗,†
Inc to $34,999 (9.21% )
Lo_Inc ≤$19,999 3.06 6.33 6.77 1,365 71 46.77 50.54 3.76 ∗∗∗,†
(4.94% )
Weighted 3.01
Average ATE
Weighted 2.74
Average ATT
∼ p<0.10; ∗p <0.05; ∗∗p <0.01; ∗∗∗p <0.001
†One-sided test.
10. There is no imperative to choose three groups. In fact, there is technical evidence,
which we present later, that it is most effective to create at least five strata.
294 Methods Matter
11. When you iterate to a final set of strata, you may conduct many such hypothesis tests,
leading to an accumulation of Type I error. To avoid this, you can invoke a Bonferroni
correction to the α-level, in each test. In our work, for instance, we conducted each of
our balancing tests at the 0.01 level.
Dealing with Selection Bias in Nonexperimental Data 295
12. Our adoption of three strata in Table 12.1 results from an iterative “divisive” approach.
We started by pooling all students into a single income stratum. When a within-
stratum t-test suggested that we had not achieved balance on the sample means, by
CATHOLIC, we split the stratum into successively narrower strata, until balance on
the means was achieved in all strata. This process led to the three strata we present
here. Software designed to help you stratify typically uses this approach.
296 Methods Matter
public and Catholic, we obtain the estimated average effect of the Catholic
treatment, or ATE. Its value is 3.01 (lower right corner, Table 12.1), almost
a full point lower than the earlier biased estimate.13 A reason that this
estimate is so much lower than the naïve biased estimate is that the Hi_Inc
stratum—which contributes the smallest within-stratum estimate—contains
the greatest number of children, and so contributes most heavily to the
final estimate. An alternative is to weight within-stratum estimates of the
Catholic-school advantage by the number of treated (Catholic) students
in each cell. Doing so provides an estimate of 2.74 (lower-right corner of
Table 12.1).14 This is called an estimate of the average impact of the Catholic
treatment on the treated, or ATT. It is our best estimate of the difference in
average twelfth-grade mathematics achievement between students who
experienced the Catholic treatment and what their average would have
been if they had not been treated.
To provide additional insight into how stratification functions to
eliminate observed bias, we have used these stratum means and mean dif-
ferences to simulate the OLS-fitted relationships between outcome
MATH12 and question predictor CATHOLIC within each of our three
family income strata, in Figure 12.1. The fitted within-stratum trends are
presented as three solid lines, labeled on the right. The dashed line repre-
sents the original naive OLS-fitted trend line, corresponding to our initial
biased estimate of the Catholic-school advantage, obtained in the full
sample. Notice, first, that the between-stratum differences we have com-
mented on—which are evident in the last column of the table—are also
clear in the plot. All three of the fitted within-stratum trends have slopes
that are less steep than the naïve and biased full-sample slope estimate,
with the slope of the fitted trend line obtained in the Hi_Inc stratum the
least steep.15
13. The weighted average is {(2,313 × 2.12) + (1,922 × 3.52) + (1,436 × 3.76)}/5,671. Its
associated standard error can be obtained by pooling within-stratum standard devia-
tions, or by applying a resampling method such as bootstrapping. Similar computations
can be made for each of the overall bias-corrected estimates of the Catholic-school
advantage that we have estimated in this chapter.
14. The weighted average is {(344 × 2.12) + (177 × 3.52) + (71 × 3.76)}/592.
15. Unfortunately, there is some evidence in this plot of a potential interaction between
CATHOLIC and FAMINC8, in that the slopes of the three line segments appear to
diminish systematically at higher baseline annual family income. It was this heteroge-
neity in the impact of the Catholic schools that we were seeking to avoid by limiting
our sample to children in “non-wealthy” families. Although we have not succeeded
completely in this mission, we ignore the potential interaction in what follows and
retain our focus on the main effect of Catholic versus public high school, in order to
keep the exposition as simple as possible. This means, in essence, that we have aver-
aged the heterogeneous treatment effects across family-income groups.
Dealing with Selection Bias in Nonexperimental Data 297
FAMINC8 is “Hi”
55
FAMINC8 is “Med”
53
12th Grade Mathematics Achievement
51
FAMINC8 is “Lo”
49
47
45
Public Catholic
Type of High School
These plots, along with the entries in Table 12.1, provide insight
into why the intrusion of base-year family income into the MATH12/
CATHOLIC relationship led the original naïve estimate to have a
positive—“upward”—bias. In building your intuition, begin from the per-
spective of the three separate within-stratum relationships displayed in
Figure 12.1 and try to resurrect mentally the full-sample relationship by
combining the associated (and undisplayed) point clouds.16 First, notice
that the three within-stratum trend lines are ordered by the base-year
16. The point clouds surrounding these trend lines are not the familiar ellipses, because
the principal predictor, CATHOLIC, is dichotomous.
298 Methods Matter
Stratifying on Covariates
Now, suppose our theory suggests that, in addition to the impact of differ-
ences in base-year annual family income, parents of children who display
above-average prior academic skills are especially likely to enroll their
child in a Catholic high school. If this is the case, then the children who
enter Catholic high schools would not be equivalent in terms of prior
academic ability to those children who enroll in public high schools. This
too would create bias in the estimate of the Catholic-school advantage.
In fact, there is evidence in the NELS-88 dataset that this is the case:
Children who entered Catholic high schools had a base-year average
mathematics achievement (53.66) that was about 2 points higher, on aver-
age, than that of children who entered public high schools (51.24), and
the difference is statistically significant (t = 5.78; p <0.001).
Thus, it makes sense to now remove observed bias that is attributable to
both base-year annual family income and prior mathematics achievement
simultaneously from our naïve estimate of the Catholic-school advantage.
We can generalize the stratification approach easily to accommodate this,
but it stretches the capabilities of the technique. For instance, in the
NELS-88 dataset, student base-year mathematics achievement was mea-
sured by a standardized test, and information on this score is coded in
our covariate, MATH8. In our sample, MATH8 ranges from about 34 to
77, with a mean of 51.5. Following our earlier strategy of creating strata
within which children were relatively homogeneous on the new covariate,
we have created four prior achievement strata:
• Hi_Ach: High achievement stratum—scores of 51 or more.
• MHi_Ach: Medium-high achievement stratum—scores from 44 to 51.
• MLo_Ach: Medium-low achievement stratum—scores from 38 to 44.
• Lo_Ach: Low achievement stratum—scores below 38.
Imposing this stratification again limits heterogeneity in prior mathemat-
ics achievement within each stratum, and we can again attain balance on
the means of base-year mathematics achievement, by CATHOLIC, within
each of the new strata.
300 Methods Matter
Now, we have the option to go through the same exercise within these
four base-year mathematics achievement strata as we did with our three
original family-income strata, inspecting the distribution of children’s
grade 12 mathematics scores and providing respective estimates of the
Catholic-school advantage. In fact, we have performed these analyses,
and their consequences are as expected. However, our purpose here is
greater than this. First, we want to illustrate how to use stratification to
correct for bias due to both observed covariates—FAMINC8 and MATH8—
simultaneously. Second, we want to point out the problems that surface
when multiple covariates are incorporated into the stratification process.
Consequently, rather than stratifying by the child’s base-year mathemat-
ics achievement alone, we have “crossed” the three base-year annual
family-income strata with the four prior mathematics-ability strata to pro-
duce a cross-tabulation that contains 12 cells. Then, within each of these
cells, we have estimated the average twelfth-grade mathematics achieve-
ment of children in the public schools and those in Catholic high schools
and subtracted one from the other to obtain 12 estimates of the Catholic-
school advantage. We list these estimates, along with their corresponding
cell frequencies, in Table 12.2.
Notice that there has been a dramatic decline in the frequencies of
students who participated in each of the separate Catholic/public com-
parisons, a result of spreading the original sample across many more cells.
This problem of data sparseness has become particularly acute for chil-
dren who attended Catholic high schools, a modest-sized group to begin
with. For example, in strata where the Med_Inc and Lo_Inc base-year
annual family-income groups are crossed with the Lo_Ach prior mathe-
matics-achievement group (the eighth and twelfth strata from the top,
in Table 12.2), only a total of three students are in Catholic high schools.
The numbers of public high-school children in these cells are also smaller
(96 and 142, respectively) than in other cells in the stratification.
Comparisons in these sparse cells lack statistical power and precision.
This problem of sparseness is a standard concern in the application of
stratification methods, even in large datasets. As we try to correct for bias
along more and more dimensions, we find ourselves with cells that con-
tain fewer and fewer observations. This results in erratic within-stratum
estimates of the size of the treatment effect. This is illustrated by the
estimates of the size of the Catholic-school advantage listed in the
final column of Table 12.2. Some are very small, such as the estimate of
0.47 in the ninth (Med_Inc × Hi_Ach) stratum. Others are very large, such
as the estimate of 5.76 in the twelfth (Lo_Inc × Lo_Ach) stratum. Notice
that the outlying estimates tend to occur in cells containing very few
Catholic high-school students. This is a generic problem with the method
Dealing with Selection Bias in Nonexperimental Data 301
within-cell estimates have been weighted by the total cell sample size within
cell—is 1.50 (listed in the lower right corner of Table 12.2).17 Clearly, we
have again reduced the observed bias in our naïve estimate of the Catholic-
school advantage (3.89) dramatically from the intermediate estimate that
we obtained by stratifying on base-year annual family income alone
(3.01).
This progress toward successively smaller estimates of the Catholic-
school advantage as we bias-correct for additional covariates prompts us
to wonder whether, by incorporating further well-chosen covariates, we
could reduce the apparent Catholic-school advantage to zero. Of course,
it would be difficult to continue to add covariates into our stratification
without exacerbating the technical problems that we just described.
As we add covariates into the stratification design, the number of cells in
the cross-tabulation increases multiplicatively and cell frequencies plum-
met. We have to deal with increasing data sparseness, diminishing
statistical power, and poor precision for estimates of the Catholic-school
advantage within the cells of the stratification, and the increasing scatter
of the within-cell estimates. Clearly, there are practical limits to the appli-
cation of the stratification approach in bias correction!
The most serious consequence of increasing the complexity of the
stratification design by adding further covariates is that it leads eventually
to an extreme form of data sparseness in which there may be either no
Catholic high-school students or no public high-school students present
in particular cells of the cross-tabulation. Then, we can no longer estimate
the critical bias-corrected Catholic-school advantage in these cells.
Methodologists use the expression a lack of common support to describe
such regions in the space spanned by the covariates. In these regions,
estimates of treatment effects cannot be made because the regions do not
contain members of both the treatment and control groups that share the
same values of the covariates.
We could, of course, proceed by eliminating the offending cells
from contention. In some sense, this is not a problem because the public-
school students who would be eliminated from the sample had no
Catholic-school counterparts whom they matched on the values of the key
covariates. We would be left comparing only those Catholic and public
high-school students who are matched on the covariates. Perhaps
this enhances the legitimacy of the obtained comparisons, despite the
17. Weighting by the within-cell frequencies of Catholic high-school students, the average
effect of the treatment on the treated is 1.31 (lower right-hand cell in Table 12.2).
Dealing with Selection Bias in Nonexperimental Data 303
18. In experimental data, with decent sample sizes, there tends to be no equivalent problem
because random assignment to experimental conditions ensures that the distribution
of participants over all levels of every covariate is similar in both the treatment and
control groups.
19. In fact, Catholic high schools serve a substantial number of children from relatively
low-income families who exhibit quite modest academic skills as eighth-graders.
Consequently, the problem of lack of common support is not an especially serious
problem in comparing the achievement of Catholic high-school students and public
high-school students using quite large datasets such as NELS-88. However, lack of
common support is a much greater problem in comparing the achievement of stu-
dents who attend non-Catholic private high schools and the achievement of students
who attend public schools, using data from NELS-88 and other similar datasets.
304 Methods Matter
Given our theoretical position that family income and prior achievement
are both elements of a selection process that drove better-prepared chil-
dren from relatively high-income families disproportionately into Catholic
high schools, it is natural to ask why we need to employ stratification to
correct for the bias due to these observed covariates. Indeed, including
measures of these two constructs directly as covariates in a multiple regres-
sion model that will address our research question seems a sensible way
to proceed, although the approach is laden with assumptions that are
often overlooked.
In most respects, the incorporation of base-year annual family income
and prior achievement as covariates in the regression of MATH12 on
CATHOLIC is conceptually identical—as an approach to bias correction
on observables—to the earlier stratification approach. Even though it may
not appear so on the surface, direct control for covariates by regression
analysis implicitly forces the Catholic-school advantage to be estimated
simultaneously in “slices” of the dataset that are defined by the values
of the covariates. In addition, by including only the main effect of ques-
tion predictor CATHOLIC in the model, we force the estimated
Catholic-school advantage to be identical across all these slices; that is, we
force the MATH12 versus CATHOLIC trend lines to be parallel within
each cell defined by the covariates. Thus, in a very real sense, the overall
covariate-adjusted regression estimate is an implicit average of all of the
Dealing with Selection Bias in Nonexperimental Data 305
20. Technical details of the averaging process differ implicitly between the stratification
and regression approaches. In our example, under stratification, we averaged the
multiple within-cell estimates of the Catholic-school advantage by hand, weighting by
some version of the cell frequencies. If we weight by the total frequency of students in
each cell, we obtain an estimate of the overall ATE. If we weight by the frequency of
treated (Catholic-school) participants in each cell, we obtain an estimate of the overall
average effect of the treatment on the treated. In the regression approach, however, this
choice is taken out of our hands by an averaging process built implicitly into the OLS
estimation of the main effect of CATHOLIC. This latter averaging incorporates built-
in weights that depend essentially on the precisions of the within-cell estimates. Thus,
an OLS estimate of the Catholic-school advantage is a special kind of weighted ATE.
306 Methods Matter
Table 12.3 Parameter estimates and approximate p-values from the OLS regression of
twelfth-grade mathematics achievement on attendance at a Catholic versus public high
school, controlling for base-year annual family-income (Model A: FAMINC8; Model B:
INC8) and mathematics achievement (n = 5,671)
21. These intercept estimates are not exactly identical to the sample estimates of cell
means obtained in the stratification analyses. For instance, in the lowest stratum of
both income and prior mathematics achievement, the cell average twelfth-grade
mathematics achievement was 36.81 in the stratification analysis (Table 12.2, Row 12),
not 36.84, as in the regression analysis. The reason that these estimates are not identical,
Dealing with Selection Bias in Nonexperimental Data 307
and neither are other respective pairs, is because we estimated cell means separately
during the stratification analyses, and each cell was free to take on its own mean and
standard deviation, determined only by its own data, independent of data in all other
cells. This is not the case under the regression specification, where we have analyzed
all data simultaneously across 12 cells, under the assumption that the population
Catholic- versus public-school difference is identical in each cell and that population
residual variance is homoscedastic. These constraints, while tenable, have wrought
minor changes in the estimated high-school means in each cell. So, the new quantities
remain estimates of the average mathematics score in twelfth grade of the public high-
school students in each cell, but assume that the Catholic-school advantage is identical
in each cell and that residual variance is homoscedastic in the population. We benefit
by this pooling of data across cells—in terms of power and precision—at the cost of
relying on additional assumptions.
22. The regression approach estimates the ATE, not the ATT.
23. Again, the two estimates differ because of constraints imposed by the additional
assumption on Model A’s functional form.
24. We could test this assumption by including interactions between question predictor
CATHOLIC and the fixed effects of the 12-cell FAMINC8 by MATH8 stratification.
Follow-up analyses showed that none of the additional interactions made a statistically
significant contribution to model fit, beyond Model A.
308 Methods Matter
25. The converted values are: (1) $0K, (2) $0.5K, (3) $2K, (4) $4K, (5) $6.25K, (6) $8.75K,
(7) $12.5K, (8) $17.5K, (9) $22.5K, (10) $30K, (11) $42.5K, (12) $62.5K, in 1988 dollars,
where K = 1,000.
Dealing with Selection Bias in Nonexperimental Data 309
values in our sample ranging from 1 through 12. The steps between these
original scale points were not equally spaced in monetary terms and,
when we created our three original strata to limit observed variability in
this covariate, we collapsed together many of the original categories. So,
for instance, our Lo_Inc category included families with incomes that
ranged from $0 through $19,999. The median family incomes in each of
our three strata were not equally spaced, either. On the other hand, we
made no assumption that the effect of annual family income was linear
over its entire range. Thus, for children with a medium-low level of math-
ematics preparation (those in the MLo_Ach strata), the estimated
difference in elevation between the Med_Inc and Lo_Inc intercepts was
(42.03 – 40.96) or 1.07 (Table 12.3, Model A, rows 7 and 11). In contrast,
the difference in estimated elevation between the Hi_Inc and Med_Inc
intercepts is (42.76 – 42.03) or 0.73. The situation is different in Model B.
Not only have we replaced FAMINC8 by a variable that is measured in
actual dollar amounts, we have included only its linear effect in the model.
So, now, the effects of equal increments in INC8 on the outcome are held
implicitly to be equal by the linearity assumption. The same goes for
MATH8, now modeled in its original test metric, with the impact of equal
increments of test score on the outcome also held to be equal. Finally, the
same goes for the two-way interaction. It is now the interaction of the
linear effects of INC8 andMATH8. In the Model A specification, we may
have crudely collapsed categories of family income and test score, but we
did not mandate linearity!
Notice that Model B actually fits better than Model A—its R2 statistic is
almost 10 percentage points higher, and its residual variance about 13%
smaller. In addition, the estimated Catholic-school advantage is now
1.66, larger than the estimate obtained under Model A, but consistent
with the estimate obtained under the stratification approach (1.50). The
reason this improvement in fit has occurred is that: (a) the linearity
assumptions may make sense, given the data coding; (b) we have managed
to pick up on some of the additional variation in the covariates that was
sacrificed when we stratified them; and (c) Model B is dramatically more
parsimonious—we have estimated five parameters rather than 13.
Rather than designate one of these estimates of the Catholic-school
advantage as “correct,” our point is that differences in functional form
between Models A and B make a difference. So, when you decide to adopt
a direct control for covariates by regression modeling approach, you may
no longer need to make arbitrary decisions about how to collapse the
covariates and stratify, but you do have other, equally critical decisions to
make. This point will resurface when we describe the use of propensity
scores for bias correction, in the following section. Finally, it is worth
310 Methods Matter
mentioning that this is not the end of the process of selection-bias correc-
tion for observed covariates. If Model B had not fitted so well, we would
have sought transformations of the covariates, perhaps polynomials, and
included multiple interactions among the differently transformed vari-
ables, hoping for a successful and parsimonious specification. Perhaps we
would not have found such a specification and, in the end, had to accept
the coarsening of the covariates in the stratification process and fallen
back on non-parsimonious Model A, with its 13 parameters, or even on
the stratification method itself.
On the other hand, our choice of linear effects has served us well in this
particular example. So, we could continue the process of bias adjustment
by including other carefully selected covariates into the regression model.
However, there is a better approach to controlling bias due to observed
covariates in the estimation of treatment effects from non experimental
data, and we turn now to this approach.
Table 12.4 Parameter estimates and approximate p-values for a pair of fitted logistic
regression models in which attendance at a public or a Catholic high school
(CATHOLIC) has been regressed on hypothesized selection predictors (INC8 and
MATH8) that describe the base-year annual family income and student mathematics
achievement (n = 5,671)
p <0.001). Both the main effects of the covariates (p <0.001) and their
two-way interaction (p <0.01) have statistically significant effects on the
type of high school attended. We learn that it is more probable that chil-
dren from higher-income homes and with greater eighth-grade
mathematics achievement will attend a Catholic high school, although
the effect of each predictor is moderated by the presence of the other. At
higher levels of base-year annual family income, the impact of prior aca-
demic preparation is lessened, and vice versa.
In Model B (the lower panel of Table 12.4), we have refined our selec-
tion model by including a quadratic transformation of base-year annual
family income. This leads to a statistically significant improvement in fit,
as indicated by the decline in the –2LL statistic of 8.1 points with the loss
of one degree of freedom (p <0.01). As a selection model, we favor fitted
Model B over Model A for reasons that we reveal below and that hark
back to the issues of sparseness and common support that we described
at the beginning of the chapter.
Notice that the parameter estimates listed in both panels were obtained
by maximum likelihood estimation. In that sense, they are then “best” esti-
mates of the population parameters, chosen to maximize the joint
probability of observing all the outcome data—that is, the entire collec-
tion of 0’s and 1’s that represent the public- or Catholic-school choices of
the sampled children, given the statistical model. Thus, in a real sense,
these estimates provide us with a mathematical vision of how the covari-
ate values can best be combined to discriminate between children who
attend the two types of high school.26 Their contributions, in this regard,
can then be consolidated into a single number, for each child, by estimat-
ing his or her fitted value of the outcome. Because our outcome—
CATHOLIC—is dichotomous, and its “upper” category represents the
Catholic high-school choice, the fitted values are simply the estimated
probabilities that each child will attend a Catholic high school. As noted
earlier, in this application, we refer to these fitted probabilities as esti-
mated propensities.27 Providing that we have chosen the covariates well and
26. Here, we draw a conceptual link between logistic regression analysis and discriminant
analysis, with the former being parametrically less stringent in that its covariates need
not be drawn from a multivariate normal distribution (as the case with discriminant
analysis).
27. Because we have specified a logistic function for the selection model, the estimated
propensities are a nonlinear composite of covariate values and are optimal for dis-
criminating between children who choose the Catholic- versus public-school options,
given the adequacy of the logistic model. We could also have used a probit or a linear-
probability function. In the former, the propensities would again be a nonlinear
composite of covariate values; in the latter, a linear composite. However, the issue is
314 Methods Matter
not whether the composite is linear or nonlinear, but that the estimated propensities
best discriminate among those who choose to go to Catholic versus public schools,
given the particular covariates and the model. Thus, which of the three possible sets
of propensities is optimal—the logit-based, probit-based, or linear-probability–
based—depends on which of the three functions is appropriate. In practice, the choice
between probit and logit functions makes little difference. But, the linear-probability
function may lead to fitted values that fall outside the permissible range [0,1].
Traditionally, in propensity score estimation, the logit function has been preferred
(Rosenbaum & Rubin, 1984).
28. Rosenbaum and Rubin (1984) show that this statement is true under the assumption
of unconfoundedness. As explained earlier in the chapter, this critical assumption is that
the treatment assignment is independent of the outcomes, conditional on the covariates.
Or, in simpler terms, this means that we are assuming that, within each cell of a cross-
tabulation formed by the values of the covariates, assignment to treatment and control
conditions is random.
29. Notice that we have superimposed a kernel density plot in Panel A, to describe the
smoothed envelope of the histogram. We provide similar smooth envelopes on all
subsequent histograms displayed in figures in this chapter.
Dealing with Selection Bias in Nonexperimental Data 315
0 .1 .2
Estimated Propensity Scores
Panel B: By CATHOLIC
no
30
20 10
Frequency
0
yes
30 20
10
0
0 .1 .2
Estimated Propensity Scores
within each stratum. This, in turn, would imply that we could estimate the
difference in average twelfth-grade mathematics achievement between
Catholic and public high-school children within each stratum and pool
the average differences across strata to obtain an overall estimate of the
Catholic-school advantage. This is exactly what we do.
Propensity Blocks Block Average Estimated Average Base-Year Average Base- Average Mathematics
and Scores Frequencies Propensity Score Annual Family Year Mathematics Achievement (12th Grade)
Income Achievement (8th
Grade)
Block Range Publ. Cath. Publ. Cath. Publ. Cath. Publ. Cath. Publ. Cath. Diff.
1 p̂ <0.05 810 31 0.036 0.040 8.47 9.81 43.16 44.68 42.74 45.35 2.61 ∗,†
2 0.05≤ p̂ <0.075 741 45 0.062 0.064 18.14 17.53 47.45 49.46 47.16 50.22 3.07 ∗∗,†
3 0.075≤ p̂ <0.1 928 100 0.088 0.089 26.64 26.57 48.80 49.63 48.79 49.63 0.84 ∗∗,†
4 0.1≤ p̂ <0.125 786 87 0.114 0.114 33.35 33.36 52.62 52.91 52.02 54.26 2.24 ∗,†
5 0.125≤ p̂ <0.15 810 145 0.136 0.137 40.73 41.47 55.15 54.79 54.72 56.54 1.82 ∗∗,†
6 0.15≤ p̂ <0.2 1,004 184 0.163 0.163 57.34 58.37 58.55 57.86 56.95 57.32 0.36
31. The corresponding estimated effect of the Catholic-school treatment on the treated was
1.40, obtained by weighting within-block estimates by the number of Catholic high-
school students within block.
Table 12.6 Five propensity-score blocks, based on predicted values from final Model C (which contains selected covariates in addition to
those included in Model B). Within-block sample statistics include: (a) frequencies, (b) average propensity scores, and (c) average twelfth-
grade mathematics achievement by type of high school, and their difference (n = 5,671)
2 0.05 ≤ p̂ <0.1 1,431 110 0.075 0.078 48.85 51.00 2.15 ∗,†
3 0.1 ≤ p̂ <0.15 1,599 253 0.127 0.129 53.62 55.38 1.76 ∗∗,†
32. Details of these analyses are available from the authors on request.
33. Because the estimated propensity scores depend on many continuous (and categori-
cal) predictors, it is unlikely that two participants will share exactly the same propensity
score.
34. This claim, of course, is not universally true, as two participants can have similar pro-
pensity scores but differ on their values of the covariates, the differences in the latter
being offset by differences in the parameter estimates associated with the covariates
in the selection model. However, when selection predictors have been chosen sensibly,
it usually turns out to be a defensible claim.
322 Methods Matter
35. It is true that this process sacrifices statistical power as it eliminates members of the
sample from the analysis. However, remember that we are using observational data,
and that the sample members who have been eliminated were demonstrably non-equiv-
alent to those who were retained.
Dealing with Selection Bias in Nonexperimental Data 323
36. There is a lot of recent technical research on matching methods. For example, Abadie
and Imbens (2008) show that the use of bootstrapping does not produce correct stan-
dard errors for matching estimators. We thank Juan Saavedra for pointing this out to
us and for very helpful conversations about the topics discussed in this chapter.
324 Methods Matter
all treated. For example, they may be low-income adult males who had
participated in the National Supported Work (NSW) program, an initia-
tive that provided participants with structured work experiences. Now,
you want to learn whether the treatment improved labor-market outcomes
for participants. However, you lack a control group. How can you best
select a suitable control group from another dataset, such as the Current
Population Survey (CPS), which the U.S. Bureau of the Census administers
monthly to a sample of more than 50,000 U.S. households? Propensity-
score analysis and nearest-neighbor matching provide one alternative.
First, you merge all your data on the participants in the training program
with the CPS data from an appropriate year. Second, you fit a sensible
selection model that predicts whether an adult male participated in
the training program, or not, in the combined dataset and estimate a
propensity score for each person. Working with these propensities and
nearest-neighbor matching, you then select a subsample of males from
the CPS sample that can best serve as the control group for comparison
of outcomes with your treatment group, in the usual way.
In recent years, a number of studies have examined whether this par-
ticular application of propensity scores and nearest-neighbor matching
produces estimates of treatment effects that are similar to those obtained
from random-assignment experiments. Not surprisingly, the results of
these studies are mixed. For example, Rajeev Dehejia and Sadek Wahba
(2002) find that the application of these techniques allowed them to
replicate the results of the random-assignment evaluation of the NSW
program using a comparison group drawn from the CPS. On the other
hand, Diaz and Handa (2006) report that the use of these methods did
not allow them to replicate consistently the experimental results from the
analysis of Mexico’s PROGRESA conditional cash-transfer program.
The mixed nature of the results of studies that examine whether the
application of matching methods can reproduce the results from random-
assignment experiments should not surprise you. Much depends on the
success of the researchers in understanding the selection process, in mod-
eling it accurately, and in obtaining comparable measures of the critical
covariates in all of the datasets used in the analysis.
37. They would be equal to the overall proportion of Catholic high-school students in the
full sample, 0.104.
Dealing with Selection Bias in Nonexperimental Data 327
38. With only two choices, if the probability of choosing one of them is p, then the
probability of choosing the other must be (1 – p).
39. There is a typo in the extreme right-hand quotient in Imbens and Wooldridge’s
Equation (18). When corrected, it should read:
⎛ N WY N
Wi ⎞ ⎛ N
(1 − Wi )Yi N
(1 − Wi ) ⎞
tˆipw = ⎜ ∑ i i ∑ ˆe( X ) ⎟ − ⎜ ∑ 1 − ˆe( X ) ∑ 1 − ˆe( X ) ⎟
⎝ i =1 ˆe ( X i ) i =1 i ⎠ ⎝ i =1 i i =1 i ⎠
(Imbens, private communication, 2009), where Wi is the value of the treatment indica-
tor for the ith individual and takes on a value of 1, when the participant is a member
of the treatment group (0, otherwise) and ˆ( e X i ) is the propensity score, predicted
from covariates X.
40. Be cautious how you program and execute the WLS regression analysis, as the idio-
syncrasies of your statistical software may affect how you need to communicate the
IPW weights. To obtain the Imbens and Wooldridge IPW estimator by classical WLS
regression analysis, the regression weights must be the square-roots of the IPW
weights. However, some statistical routines require you to input WLS regression
weights as squares—that is, as variance-based weights—in which case, the IPW weights
themselves are the required form.
328 Methods Matter
.04
.03
kdensity math8
.02
.01
0
30 40 50 60 70 80
Math8
Panel B: After
.04
.03
kdensity math8
.02
.01
0
30 40 50 60 70 80
Math8
Recall from Chapter 1 the statement that Paul Hanus made at the 1913
meeting of the National Education Association: “The only way to combat
successfully mistaken common-sense as applied to educational affairs is to
meet it with uncommon-sense in the same field—with technical informa-
tion, the validity of which is indisputable” (Hanus, 1920). The quest for
Hanus’s “indisputable technical information” has been a long one, stretch-
ing out over the last century. However, advances made in recent decades
have made it increasingly possible to conduct social science research that
meets this standard. Some of these advances reflect improvements in our
application and interpretation of powerful analytic approaches. For exam-
ple, the efforts of Donald Rubin and other methodologists have clarified
the assumptions under which random-assignment experiments can pro-
vide unbiased answers to educational policy questions. A body of work
that began with contributions from psychologist Donald Campbell, and
to which methodologists from several disciplines have contributed, has
improved our understanding of the conditions under which the regression-
discontinuity design can support causal inferences. Methodologists have
also developed new insights into the application of instrumental-variables
estimation and the practical utility of particular kinds of instruments.
Other notable advances have been in the use of computers and data
warehouses for administrative record-keeping. For example, well-organized
digitized records that provide extensive information on such important
things as school enrollments, teacher assignments, student test scores,
and the labor-market outcomes of adults have become available increas-
ingly. This, in turn, has increased dramatically the feasibility of examining
332
Methodological Lessons 333
important roles that theory plays in social science research by again con-
sidering how a reduction in class size might result in higher student
achievement. At its most basic level, a simple theory about why smaller
class size would lead to higher student achievement seems straightfor-
ward: The smaller the class, the more time the teacher has to work with
students, individually or in small groups. However, a little reflection leads
one to the realization that this minimal theory must be refined if it is
to address the many subtle questions that arise in designing a class-
size-reduction policy and to evaluate the sensitivity of its consequences to
design decisions. For example, what can theory tell us about whether it
matters if all students are placed in smaller classes of the same size, or if
students with particular characteristics are assigned to especially small
classes? Does it matter how teachers are assigned to classes of different
size? More refined theories that suggest answers to questions such as
these shed light on the design of class-size reduction initiatives and on the
interpretation of their results.
For example, in 2001, economist Edward Lazear published a theory of
the links between class size and student achievement that addressed one
of these questions. Lazear theorized that students differ in their propen-
sity to disrupt the classroom learning-environment through their poor
behavior. In his theory, the mechanism through which a smaller class size
increases student achievement is that it reduces the amount of instruc-
tional time that is lost to student disruptions. One hypothesis stemming
from Lazear’s “disruptive-student” theory is that class-size reductions
would then have a greater aggregate impact on student achievement if
administrators grouped disruptive students strategically in especially
small classes than if students are assigned randomly to smaller classes.
The reason is that, in the larger classes now free of disruptions by unruly
students, all students will achieve at higher levels. In testing this subtle
hypothesis, it would be important to be able to identify disruptive chil-
dren and obtain the permission of participants to place them in particularly
small classes.
A quite different theory of why class size affects student achievement
centers on teacher labor markets and working conditions. From this theo-
retical perspective, smaller classes offer more desirable working conditions
for teachers. Consequently, schools that offer small classes will be more
effective in attracting strong applicants for their teaching positions than
will schools with larger classes (Murnane, 1981). In designing research to
test this “working-conditions” hypothesis, it would be important to adopt
a time-frame that is long enough to allow schools with reduced class sizes
to recruit these stronger candidates for their teaching positions and to
observe their impacts on subsequent student achievement.
Methodological Lessons 335
These two theories about the mechanisms through which class size
affects student achievement are relevant to the interpretation of the
results from the Tennessee Student/Teacher Achievement Ratio (STAR)
experiment. Recall from Chapter 3 that, in this experiment, teachers and
students were assigned randomly either to small classes (of 13–17 stu-
dents) or to regular- size classes (of 21–25 students). The results of the
evaluation showed that, at the end of the first year of the experiment,
students placed in small classes scored 4 percentage points higher, on
average, on a standardized test of cognitive skills than did students placed
in regular size classes (Krueger, 1999). Although advocates of class-size
reduction applauded the positive impact of the intervention, critics
argued that the impact was only modest in size, given the high financial
cost of reducing the class sizes in the first place (Hanushek, 1998).
One thing missing from the debate about the results of the STAR
experiment was recognition that the design of the experiment eliminated
two mechanisms through which class-size reductions may affect student
achievement. The assignment process in the STAR experiment meant
that, on average, potentially disruptive children were distributed randomly
among classes, rather than being clustered together in small classes.
Recent ingenious research by Bryan Graham (2008) using data from the
STAR experiment lends support for Lazear’s “disruptive-student” hypoth-
esis. The random assignment of teachers to classes of different size within
the same school also meant that the STAR experiment did not provide an
opportunity to test the “working-conditions” hypothesis.
The point of this example is to illustrate the importance of theory in
designing and in interpreting the results of research to evaluate the
impact of a policy intervention. The STAR experiment was a remarkable
accomplishment that provided important new information about the
consequences of class-size reductions. However, thinking about the theo-
retical mechanisms through which class size may influence student
achievement leads to the realization that the experiment was not designed
to test important hypotheses about two potential benefits of small classes:
the ability to handle disruptive students well, and the ability to attract
especially skilled teachers. It would take subsequent research, with different
designs, to estimate the importance of these mechanisms.
Putting this pattern together, Jacob and Lefgren concluded that their
results were consistent with the conclusion that retention in grade had no
long-term effect on the reading achievement of either third-grade students
or sixth-grade students. The difference in the test patterns two years after
promotion decisions were made could be accounted for by the pattern of
stakes attached to the tests subsequently faced by the different groups.
This example illustrates the importance of developing a thorough
understanding of the institutions and rules in the settings where an
educational-policy intervention takes place. In the case of the Jacob and
Lefgren study, knowledge of the details of Chicago’s testing policy—
including the stakes attached to tests administered at different grade
levels—was important in shedding light on the likely explanation of an
initially puzzling set of findings about the consequences of grade retention.
scores of the control group. However, this difference did not translate to
a treatment–control group difference in scores on the other tests of
reading. In particular, the students who were given access to the FFW
software did not score better than the control group, on average, on the
criterion-referenced state test that was aligned with the state’s reading
standards. This pattern led Rouse and Krueger to conclude that use of
the Fast ForWard computer-aided instruction (CAI) programs did not
result in improved reading skills for low-achieving urban students. Of
course, their conclusion would have been different had they only exam-
ined scores on the test provided by the software developer.
The random-assignment evaluation of Moving to Opportunity (MTO)
provides a compelling illustration of the second point. MTO is a ten-year
experiment sponsored by the U.S. Department of Housing and Urban
Development that provides a large random sample of low-income families
with the opportunity to move from extremely disadvantaged urban
neighborhoods to less distressed communities. Many social scientists
hypothesized that the primary benefit of MTO would be improvement in
labor-market outcomes for adults in treatment-group families. The logic
was that the residential moves would place families closer to jobs. However,
to date, the results of the evaluation have shown no improvements in
labor-market outcomes. Had the evaluation focused solely on examining
labor-market outcomes, the evidence regarding the effects of the MTO
experiment would have been uniformly discouraging.
Fortunately, in planning the evaluation, the research team considered
a variety of mechanisms through which moving to a better neighborhood
could alter the lives of families. One of many hypotheses was that the
opportunity to move out of high-crime neighborhoods would reduce
stress levels for parents and improve their mental health. This hypothesis
led the research team to collect data on a variety of measures of partici-
pants’ mental health. One of the most striking finding to date from the
MTO evaluation has been the marked improvement in the mental health
of mothers (Kling, Liebman, & Katz, 2007). This finding has led the
research team to plan for the next round of data collection and analysis,
an assessment of whether improved health for mothers is a mechanism
through which MTO leads to better cognitive and emotional develop-
ment for young children. Results bearing on this hypothesis will be
available by 2012.
examined impacts on student test scores at the end of the school year in
which each teacher worked with a particular group of students. This
makes sense because the effects of particular teachers on students’ skills
are likely to become muted over subsequent years, as the students experi-
ence other teachers. At the same time, policymakers are interested
especially in learning whether particular interventions have lasting effects,
and evidence that they do or do not is especially important in evaluating
whether to continue the interventions and whether to scale them up so
that they serve more participants. For that reason, it is useful, when finan-
cially feasible, to design research in a manner that makes it possible to
evaluate whether interventions have long-term impacts.
Recall that, on average, the MDRC evaluation team found no differ-
ences between the treatment group of students offered places in a career
academy and the control group in terms of their high-school grades, test
scores, graduation rates, or college-enrollment rates (Kemple, 2008).
Thus, even though each participant contributed data on each of multiple
outcome measures, the conclusion reached by the investigators by the
end of the participants’ high-school years was that the offer of a place in
a career academy had not resulted in better academic outcomes. However,
the evidence on outcomes that were measured subsequently—eight years
after high-school graduation—was quite different. For instance, the origi-
nal offer of a place in a career academy ultimately resulted in members of
the treatment group enjoying labor-market earnings that were 11% higher
than members of the control group, on average. Thus, the combination
of a long-term evaluation and the use of multiple outcome measures
resulted in an important—and surprising—set of results concerning the
outcomes of career academies.
students who had also applied for a PACES scholarship, but lost out in the
lottery. Of course, this is an unbiased estimate of the causal impact of the
intent to treat. Angrist and his colleagues then used instrumental-variables
estimation to address their second question: Does making use of a schol-
arship to help pay private secondary-school fees increase the educational
attainments of low-income students? One reason that the answer to this
question differs from the answer to the first question is that 8% of the
low-income students who won the lottery did not make use of the offer of
a scholarship from the PACES program. A second is that almost one-quarter
of those students who lost out in the lottery were successful eventually in
obtaining a scholarship to help pay private secondary-school fees. Thus,
the assignment of a scholarship take-up “treatment” was endogenous, a
result in part of unobserved family motivations and skills.
Using the randomized assignment of the offer of a PACES scholarship
as an instrument, the research team estimated that making use of a schol-
arship increased by 16 percentage points the probability that low-income
students completed the eighth grade on time. Recall from our earlier dis-
cussion of instrumental-variables estimation that this is an estimate of the
local-average treatment effect, or LATE estimate, and that it pertains to
those students whose decision about whether to make use of a scholar-
ship to pay secondary-school fees was sensitive to the PACES scholarship
offer or the lack of this offer. It is the effect of the treatment for compliers,
and not for those students in the population who would have always
obtained a scholarship (the always-takers) or never have done so (the never-
takers), regardless of their original assignment. Of course, we do not know
which members of the student population fall into each of these classes,
as they are unobserved features of individuals. If the impact of scholar-
ship use is homogeneous across all population members, then the LATE
estimate obtained by Angrist and his colleagues also applies universally to
all students in the population. When only one instrument is available, it is
not possible to explore whether this is the case or not.
It is also important to recognize that the LATE estimate does not
provide an estimate of the impact of scholarship use on students’ subse-
quent educational attainment, holding constant all other aspects of family
dynamics. Instead, it provides an estimate of the total impact of financial
aid on a student’s subsequent educational attainment. The distinction
may matter because the choice of whether to make use of a scholarship
offer to help pay a child’s secondary-school fees may not have been the
only parental decision affected by the lottery outcome. For example, par-
ents may have decided to reduce expenditures on books and other
learning materials for the child who obtained and made use of a scholar-
ship in order to free up money to help other children in their family who
346 Methods Matter
As part of the struggle to have their work published in the best possible
peer-reviewed journals, researchers often do not report evidence that par-
ticular interventions have no statistically significant effects on outcomes
or effects that run counter to their theories. Instead, they focus their
papers on describing the strongest results that support their theories.
Unfortunately, this behavior hinders the accumulation of knowledge.
Then, attempts to synthesize evidence about the efficacy of a particular
intervention from published studies can only summarize positive evi-
dence even in cases in which the vast majority of evaluations have found
no effects of the intervention (but were either not published, or the con-
trary findings down played). This problem is often referred to as publication
bias. Although there is no easy solution to the problem, the best defense
against it may be the practices of conscientious referees who focus on the
quality of the methodology used in a particular causal study, not on the
consistency of the results. In this regard, it is important to keep in mind
that unexpected results often occur in well-designed studies. We illustrate
this with evidence from Angrist and Lavy’s Maimonides’ rule paper.
Recall from Chapter 9 that Angrist and Lavy (1999) made clever use of
a natural experiment created by the exogenous application of Maimonides’
rule to estimate the causal impact of differences in intended class sizes on
student achievement. The substantive results that most readers of their
classic paper remember are that class size had a substantial impact on the
reading and mathematics achievement of students in fifth grade in the
year 1991, and that the impact was especially large in schools that served
high concentrations of economically disadvantaged students. In thinking
about the substantive implications of this study—and for that matter, of all
evaluations of the causal impacts of policy interventions—we believe it is
1. For more discussion of this point, see Todd and Wolpin (2003), and Duflo, Glennerster,
and Kremer (2008).
Methodological Lessons 347
important to pay attention to all of the results, including those that seem
somewhat puzzling.
In their paper, Angrist and Lavy also reported results on the impact of
class size on the reading and mathematics achievement of fourth-graders
in 1991 and third-graders in 1992. The results for the fourth-graders were
much weaker than those for the fifth-graders—the corresponding impact
on reading achievement was less than half as large, and the impact on
mathematics achievement was even smaller, and was not statistically sig-
nificant. Angrist and Lavy suggest that this pattern could be due to the
cumulative effect of class size: students who were taught in small fifth-
grade classes were probably also in relatively small classes in their earlier
grades, and each year brought additional benefits. Although this is a plau-
sible explanation for why the impact of intended class size on student
achievement should be somewhat smaller for fourth- than for fifth-graders,
it does not explain why the fourth-grade impacts would be less than half
as large as the corresponding fifth-grade impacts.
The results for third-graders are even more puzzling. Angrist and Lavy
found no impact of differences in intended class size on the reading and
mathematics achievement of third-graders in 1992 (the second and final
year of the testing program that provided the student-achievement
outcome). They speculated that this pattern stemmed from teachers’
responses to the publication of the 1991 results. After reading the results,
teachers may have devoted more time to test preparation in 1992, and
thus weakened the relationship between students’ test scores and their
true skill levels. Angrist and Lavy’s hypothesized explanation for their
third-grade results raises questions about the quality of the information
on students’ skills that is generated by test-based accountability systems,
an issue with which many countries grapple today.2 The point that we
want to emphasize, however, is the importance of paying attention to all
of the results of well-designed research studies, not just the strongest. We
see this as a necessary condition for research to have a beneficial impact
on public policy debates concerning the consequences of particular
educational initiatives. We applaud Angrist and Lavy for describing the
puzzling results for the third- and fourth-grade students whose achieve-
ment they investigated.
2. To learn more about the difficulty of interpreting student test results under test-based
accountability systems, see Koretz (2008).
348 Methods Matter
back to that of the children in the control group. This led him to conclude
that the smaller class sizes for grades 2 and 3 were required for the treat-
ment group to sustain its higher achievement. Hanushek assumed that
the higher achievement for children in the treatment group would have
been sustained if these children had been placed in regular-size classes in
grades 2 and 3. Thus, he concluded that, since the achievement differ-
ences at the end of grade 3 were no larger than those detected at the end
of grade 1, placement in small classes for grades 2 and 3 did not result in
further achievement benefits. Unfortunately, Project STAR did not pro-
vide the evidence needed to determine whether Mosteller’s or Hanushek’s
assumption was more accurate because the research design did not
include the randomization of a subset of students into small classes
for grades K and 1 and into regular-size classes in grades 2 and 3. This
step needs to be taken in a new round of random-assignment studies of
class-size reduction.
350
Substantive Lessons and New Questions 351
2. The topics we explore do not exhaust the interventions aimed at improving student
achievement that researchers have evaluated using high-quality causal research methods.
For example, there are many studies examining the effectiveness of new curricula and
of particular computer-based software programs.
352 Methods Matter
Until quite recently, the dominant strategy for improving school quality
was to purchase more or better inputs—for example, provide more books;
hire additional teachers so that class sizes could be reduced; or raise
teacher salaries in order to attract more skilled teachers. The attraction
of this approach is that teachers, students, and parents all enjoy having
additional resources in classrooms and so the strategy is popular politi-
cally. Unfortunately, input-based improvement strategies do not result in
improved student achievement consistently, and it is not difficult to under-
stand why. The fundamental problem in a great many schools is that
students do not receive consistently good instruction that is tailored to
their needs. As a result, children are not engaged actively in learning
while they are in school. A necessary condition for improving student
achievement is increased student engagement, and this means changing
the daily educational experiences of children in schools. Simply providing
additional resources to the school or classroom, without changing how
those resources are used, does not achieve the desired result.
Of course, the conclusion that resource levels do not affect student
achievement is too strong. In some settings, the provision of additional
resources does indeed make a difference to student outcomes because
the new resources do result in a change in children’s daily experiences in
school. In fact, we argue that paying attention to whether a particular
school-improvement strategy results in a change in children’s daily experi-
ences goes a long way toward predicting whether the intervention will
succeed in improving student achievement. We illustrate this point with
evidence from a number of high-quality studies that have investigated the
impacts of input-based educational interventions.
More Books?
Paul Glewwe and his colleagues conducted a random-assignment evalua-
tion of a program in rural Kenya that provided primary schools with
356 Methods Matter
Smaller Classes?
In earlier chapters, we described two well-known studies of the impact of
class size on student achievement. The Tennessee Student/Teacher
Achievement Ratio (STAR) experiment provided strong evidence that
spending the first year of school in a small class improved student achieve-
ment, especially for students from low-income families (Krueger, 1999).
The likely explanation is that when children come to school for the first
time, they have a lot to learn about how to behave in a structured class-
room setting. In a relatively small class (13–17 students), children’s initial
experiences are different from those in a larger class because teachers are
better able to help children acquire the appropriate behaviors.
Other relevant evidence for the impact of class size on student achieve-
ment comes from Angrist and Lavy’s (1999) analyses of data from Israel,
and from Miguel Urquiola’s (2006) similar study of the impact of class
size on the achievement of third-graders in schools in rural Bolivia. Both
studies found that students in some middle-primary grades who were in
small classes had higher reading and mathematics achievement, on aver-
age, than children schooled in larger classes. One possible explanation
for these findings stems from the identification strategy they employed.
Both studies used a regression-discontinuity design to exploit the
consequences of natural experiments that had been inaugurated by the
implementation of rules to govern maximum class size. The net effect
of this identification strategy was that comparisons of achievement
were between students in classes containing quite different numbers of
Substantive Lessons and New Questions 357
Better Teaching?
Studies conducted in a great many countries have documented that chil-
dren have higher achievement, on average, in some classrooms than they
do in others, and that differences in the quality of teaching are the likely
explanation (Rivkin, Hanushek, & Kain, 2005). This unsurprising pattern
suggests the potential value of devoting resources to either hiring teach-
ers who are known to be more effective or to improving the skills of
the incumbent teaching force. Many educational systems try to do both.
3. The standard deviation in class size in Hoxby (2000) ranged between 5.5 and 6.4
students, depending on the grade level (Appendix table, p. 1283). The standard devia-
tion of class size in the discontinuity sample in the Angrist and Lavy (1999) study
ranged between 7.2 and 7.4 (Table 1, p. 539). The standard deviation in class size in the
Urquiola (2006) study was 9.9 (Table 1, p. 172).
358 Methods Matter
The evidence that input-based strategies, such as reducing class size and
investing in the professional development of teachers, do not result con-
sistently in improved education for children is sobering. This pattern has
led to a growing interest in using different kinds of incentives to alter
the behaviors of teachers or students, and thereby improve educational
outcomes for children.
The first is that programs of this type have promise, as shown by the posi-
tive results of the student incentive program in Busia, Kenya. Second,
incentives may be a way to increase students’ use of the extra support
services that a great many educational institutions offer to struggling stu-
dents. Angrist, Lang, and Oreopoulos (2009) found this pattern in the
responses of female college students in Canada to the combination of
extra supports and financial rewards for academic achievement. Indeed,
paying students to engage in behaviors known to contribute to skill devel-
opment may be more effective in enhancing students’ skills in some
settings than rewarding their performances on standardized tests. Among
the questions worthy of attention in new studies of these effects are why
the same set of incentives elicits different responses in different settings,
and why, in at least some settings, girls are more responsive to short-term
incentives for improving academic performance than are boys.
charter schools in New York City. All of these evaluations found that
children who won a lottery that provided the offer of a place in a charter
school had higher test scores one or more years later than children who
lost out in the lottery and typically then attended conventional public
schools. In interpreting this evidence, it is important to keep in mind that
these lottery-based evaluations only examine the effectiveness of charter
schools that are heavily oversubscribed, and therefore have used lotteries
to determine who is accepted. Many charter schools across the nation are
not oversubscribed, and results from nation-wide evaluations of their
effects on student test scores, which are necessarily conducted with less
rigorous evaluation methods, are mixed.4
Although the evidence from many recent evaluations of charter schools
is encouraging, many important questions remain. One is whether the
charter schools that are more effective in increasing students’ skills will
flourish and those that are not will die. Given the complexity of the polit-
ical processes that determine which schools are granted renewals of their
charters, the answer is not obvious. Second, some charter schools require
that parents sign pledges stating they will be responsible for ensuring
their children adhere to a dress code and to rules regarding behavior and
attendance. It is not clear the extent to which such requirements mean
that “high commitment” charter schools will only ever serve a modest
percentage of children from low-income families. Third, some charter
schools make extraordinary demands on teachers, for example, requiring
very long work days and that teachers respond to phone calls from stu-
dents on evenings and weekends. It is not clear whether limits on the
supply of skilled teachers who are willing to work under these conditions
for sustained periods of time will limit the role of charter schools in edu-
cating poor students. Fourth, almost all of the evidence to date on the
relative effectiveness of charter schools comes from analyses of student
scores on standardized tests. Of course, the more important outcomes
are success in post-secondary education, in labor markets, and in adult
life. To date, there is little information on the extent to which charter
schools are more effective than public schools in helping children from
poor families achieve these outcomes.5
4. For example, see the Center for Research on Education Outcomes or CREDO (2009).
5. Many of the questions about charter schools described in this paragraph are taken
from Curto, Fryer, and Howard (2010).
Substantive Lessons and New Questions 367
Summing Up
are embedded, and on the cultures that influence the priorities and
behaviors of teachers, administrators, parents, children, and employers.
Final Words
Aaron, H. J. (1978). Politics and the professors: The great society in perspective. Studies
in social economics. Washington, DC: Brookings Institution.
Abadie, A. (January 2005). Semiparametric difference-in-differences estimators.
Review of Economic Studies, 72(1), 1–19.
Abadie, A., Imbens, G.W. (November 2008). On the failure of the bootstrap for
matching estimators. Econometrica, 76(6), 1537–57.
Abdulkadiroglu, A., Angrist, A., Dynarski, S., Kane, T. J., Pathak, P. (2009).
Accountability and flexibility in public schools: Evidence from Boston’s charters and
pilots. Research Working Paper No. 15549. Cambridge, MA: National Bureau
of Economic Research.
Agodini, R., Dynarski, M. (February 2004). Are experiments the only option?
A look at dropout-prevention programs. Review of Economics and Statistics,
86(1), 180–94.
Almond, D., Edlund, L., Palme, M. (2007). Chernobyl’s subclinical legacy: Prenatal
exposure to radioactive fallout and school outcomes in Sweden. Research Working
Paper No. 13347. Cambridge, MA: National Bureau of Economic Research.
Altonji, J. G., Elder, T. E., Taber, C. R. (February 2005a). Selection on observed
and unobserved variables: Assessing the effectiveness of Catholic schools. Jour-
nal of Political Economy, 113(1), 151–84.
Altonji, J. G., Elder, T. E., Taber, C. R. (2005b). An evaluation of instrumental-
variable strategies for estimating the effects of Catholic schooling. Journal of
Human Resources, 40(4), 791–821.
Angrist, J., Bettinger, E., Bloom, E., King, E., Kremer, M. (December 2002).
Vouchers for private schooling in Colombia: Evidence from a randomized
natural experiment. American Economic Review, 92(5), 1535–58.
Angrist, J., Bettinger, E., Kremer, M. (June 2006). Long-term educational conse-
quences of secondary-school vouchers: Evidence from administrative records
in Colombia. American Economic Review, 96(3), 847–62.
Angrist, J. D., Dynarski, S. M., Kane, T. J., Pathak, P.A., Walters, C.R. (2010). Who
benefits from Kipp? Research Working Paper No. 15740. Cambridge, MA:
National Bureau of Economic Research.
369
370 References
Angrist, J., Lang, D., Oreopoulos, P. (January 2009). Incentives and services for
college achievement: Evidence from a randomized trial. American Economic
Journal: Applied Economics, 1(1), 136–63.
Angrist, J., Lavy, V. (September 2009). The effects of high stakes high-school
achievement awards: Evidence from a randomized trial. American Economic
Review, 99(4), 1384–414.
Angrist, J. D. (June 1990). Lifetime earnings and the Vietnam-era draft lottery:
Evidence from Social Security administrative records. American Economic
Review, 80(3), 313–36.
Angrist, J. D., Imbens, G.W., Rubin, D. B. (June 1996). Identification of causal
effects using instrumental variables. Journal of the American Statistical Associa-
tion, 91(434), 444–55.
Angrist, J. D., Krueger, A. B. (Fall 2001). Instrumental variables and the search
for identification: From supply and demand to natural experiments .Journal of
Economic Perspectives, 15(4), 69–85.
Angrist, J. D., Krueger, A. B. (November 1991). Does compulsory school atten-
dance affect schooling and earnings? Quarterly Journal of Economics, 106(4),
979–1014.
Angrist, J. D., Lavy, V. (May 1999). Using Maimonides’ rule to estimate the effect
of class size on scholastic achievement. Quarterly Journal of Economics, 114(2),
533–75.
Angrist, J. D., Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist’s com-
panion. Princeton, NJ: Princeton University Press.
Banerjee, A.V., Cole, S., Duflo, E., Linden, L. (August 2007). Remedying educa-
tion: Evidence from two randomized experiments in India. Quarterly Journal of
Economics, 122(3), 1235–64.
Barrera-Osorio, F., Raju, D. (2009). Evaluating a test-based public subsidy program
for low-cost private schools: Regression-discontinuity evidence from Pakistan. Paper
presented at the National Bureau of Economic Research Program on Educa-
tion Meeting, April 30, 2009, Cambridge, MA.
Becker, G. S. (1964). Human capital: A theoretical and empirical analysis, with special
reference to education (vol. 80). New York: National Bureau of Economic
Research, distributed by Columbia University Press.
Becker, S. O., Ichino, A. (2002). Estimation of average treatment effects based on
propensity scores. Stata Journal, 2(4), 358–77.
Black, S. (May 1999). Do better schools matter? Parental valuation of elementary
education. Quarterly Journal of Economics, 114(2), 577–99.
Bloom, H. S. (Forthcoming). Modern regression-discontinuity analysis. In Field
experimentation: Methods for evaluating what works, for whom, under what circum-
stances, how, and why. M. W. Lipsey, D. S. Cordray (Eds.). Newbury Park, CA:
Sage.
Bloom, H. S., ed. (2005). Learning more from social experiments: Evolving analytic
approaches. New York: Sage.
Bloom, H. S., Thompson, S. L., Unterman, R. (2010). Transforming the High School
Experience: How New York City’s New Small Schools are Boosting Student Achieve-
ment and Graduation Rates. New York: MDRC.
Boozer, M., Rouse, C. (July 2001). Intra-school variation in class size: Patterns and
implications. Journal of Urban Economics, 50(1), 163–89.
Borko, H. (2004). Professional development and teacher learning: Mapping the
terrain. Educational Researcher, 33(8), 3–15.
References 371
Cohen, J. (1988). Statistical power analysis for the behavioral sciences, 2nd ed.
Hillsdale, NJ: Lawrence. Erlbaum Associates.
Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M.,
Weinfeld, F. D., York, R. L. (1966). Equality of educational opportunity. Washing-
ton, DC: U.S. Department of Health, Education, and Welfare, Office of
Education.
Coleman, J. S., Hoffer, T., Kilgore, S. (1982). High-school achievement: Public,
Catholic, and private schools compared. New York: Basic Books.
Cook, T. D. (February 2008). “Waiting for life to arrive”: A history of the regres-
sion-discontinuity design in psychology, statistics and economics. Journal of
Econometrics, 142(2), 636–54.
Cook, T. D., Shadish, W. R., Wong, V. C. (Autumn 2008). Three conditions under
which experiments and observational studies produce comparable causal esti-
mates: New findings from within-study comparisons. Journal of Policy Analysis
and Management, 27(4), 724–50.
Cook, T. D., Wong, V. C. (forthcoming). Empirical tests of the validity of the
regression-discontinuity design. Annales d’Economie et de Statistique.
Center for Research on Educational Outcomes (CREDO) (2009). Multiple choice:
Charter performance in 16 states. Palo Alto, CA: Stanford University.
Currie, J., Moretti, E. (2003). Mother’s education and the intergenerational trans-
mission of human capital: Evidence from college openings. Quarterly Journal
of Economics, 118(4), 495–532.
Davidoff, I., Leigh, A. (June 2008). How much do public schools really cost? Esti-
mating the relationship between house prices and school quality. Economic
Record, 84(265), 193–206.
Decker, P. T., Mayer, D. P., Glazerman, S. (2004). The effects of Teach For America on
students: Findings from a national evaluation. Princeton, NJ: Mathematica Policy
Research.
Dee, T. S. (August 2004). Are there civic returns to education? Journal of Public
Economics, 88(9–10), 1697–720.
Dehejia, R. (March–April 2005). Practical propensity-score matching: A reply to
Smith and Todd. Journal of Econometrics, 125(1–2), 355–64.
Dehejia, R. H., Wahba, S. (February 2002). Propensity-score-matching methods
for nonexperimental causal studies. Review of Economics and Statistics, 84(1),
151–61.
Dehejia, R. H., Wahba, S. (December 1999). Causal effects in nonexperimental
studies: Reevaluating the evaluation of training programs. Journal of the Ameri-
can Statistical Association, 94(448), 1053–62.
Deming, D. (2009). Better schools, less crime? Cambridge, MA: Harvard University.
Unpublished Working Paper.
Deming, D., Hasting, J.S., Kane, T.J., Staiger, D.O. (2009). School choice and college
attendance: Evidence from randomized lotteries. Cambridge, MA: Harvard Univer-
sity. Unpublished Working Paper.
Dewey, J. (1929). The sources of a science of education. New York: H. Liveright.
Diaz, J. J., Handa, S. (2006). An assessment of propensity score matching as a
nonexperimental impact estimator: Evidence from Mexico’s PROGRESA pro-
gram. Journal of Human Resources, 41(2), 319–45.
Dobbelsteen, S., Levin, J., Oosterbeek, H. (February 2002). The causal effect of
class size on scholastic achievement: Distinguishing the pure class-size effect
References 373
Howell, W. G., Wolf, P. J., Campbell, D. E., Peterson, P. E. (2002). School vouchers
and academic performance: Results from three randomized field trials.
Journal of Policy Analysis and Management, 21(2), 191–217.
Hoxby, C. (2000). Peer effects in the classroom: Learning from gender and race varia-
tion. Research Working Paper, No. 7867. Cambridge, MA: National Bureau of
Economic Research.
Hoxby, C. M. (2003). Introduction. In Caroline M. Hoxby (Ed.), The economics of
school choice (pp. 1–22). Chicago: University of Chicago Press.
Hoxby, C. M. (2001). Ideal vouchers. Cambridge MA: Harvard University. Unpub-
lished manuscript.
Hoxby, C. M. (November 2000). The effects of class size on student achievement:
New evidence from population variation. Quarterly Journal of Economics, 115(4),
1239–85.
Hoxby, C., Murarka, S. (2009). Charter schools in New York City: Who enrolls and how
they affect their students’ achievement. Research Working Paper, No.14852. Cam-
bridge, MA: National Bureau of Economic Research.
Hsieh, C.-T., Urquiola, M. (September 2006). The effects of generalized school
choice on achievement and stratification: Evidence from Chile’s voucher pro-
gram. Journal of Public Economics, 90(8–9), 1477–503.
Huang, G., Reiser, M., Parker, A., Muniec, J., Salvucci, S. (2003). Institute of educa-
tion sciences: Findings from interviews with education policymakers. Washington,
DC: U.S. Department of Education.
Imbens, G. W., Lemieux, T. (February 2008). Regression-discontinuity designs:
A guide to practice. Journal of Econometrics, 142(2), 615–35.
Imbens, G. W., Wooldridge, J. M. (2009). Recent developments of the economet-
rics of program evaluation. Journal of Economic Literature, 47(1), 5–86.
Jacob, B. A., Lefgren, L. (February 2004). Remedial education and student
achievement: A regression-discontinuity analysis. Review of Economics and Sta-
tistics, 86(1), 226–44.
Jacob, B. A., Levitt, S. D. (August 2003). Rotten apples: An investigation of the
prevalence and predictors of teacher cheating. Quarterly Journal of Economics,
118(3), 843–77.
Jamison, D. T., Lau, L.J. (1982). Farmer education and farm efficiency. A World Bank
research publication. Baltimore: Johns Hopkins University Press.
Kemple, J. J. (June 2008a). Career academies: Long-term impacts on labor-market
outcomes, educational attainment, and transitions to adulthood. New York:
MDRC.
Kemple, J. J., Willner, C. J. (2008b). Technical resources for career academies: Long-
term impacts on labor-market outcomes, educational attainment, and transitions to
adulthood. New York: MDRC.
Kennedy, P. (1992). A guide to econometrics, 3rd ed. Cambridge, MA: MIT Press.
Kling, J. R., Liebman, J. B., Katz, L. F. (January 2007). Experimental analysis of
neighborhood effects. Econometrica, 75(1), 83–119.
Knowles, E. (1999). The Oxford dictionary of quotations, 5th ed. New York: Oxford
University Press.
Koretz, D. M. (2008). Measuring up: What educational testing really tells us.
Cambridge, MA: Harvard University Press.
Kremer, M., Miguel, E., Thornton, R. (2009). Incentives to learn. Review of
Economics and Statistics, 91(3), 437–56.
376 References
Krueger, A., Whitmore, D. (2001). The effect of attending a small class in the
early grades on college-test taking and middle-school test results: Evidence
from project STAR. Economic Journal, 111, 1–28.
Krueger, A., Zhu, P. (2004). Another look at the New York City school-voucher
experiment. American Behavioral Scientist, 47, 658–98.
Krueger, A. B. (May 1999). Experimental estimates of education production
functions. Quarterly Journal of Economics, 114(2), 497–532.
LaLonde, R. J. (September 1986). Evaluating the econometric evaluations of
training programs with experimental data. American Economic Review, 76(4),
604–20.
Lane, J. F. (2000). Pierre Bourdieu: A critical introduction. Modern European thinkers.
Sterling, VA: Pluto Press.
Lavy, V. (2009). Performance pay and teachers’ effort, productivity and grading
ethics. American Economic Review, 99(5), 1979–2011.
Lazear, E. P. (August 2001). Educational production. Quarterly Journal of Economics,
116(3), 777–803.
Leuven, E., Oosterbeek, H., Rοnning, M. (2008). Quasi-experimental estimates of the
effect of class size on achievement in Norway. Discussion paper. Bonn, Germany:
IZA.
Light, R. J., Singer, J. D., Willett, J. B. (1990). By design: Planning research on higher
education. Cambridge, MA: Harvard University Press.
List, J. A., Wagner, M. (2010). So you want to run an experiment, now what? Some
simple rules of thumb for optimal experimental design. Research Working Paper,
No. 15701. Cambridge, MA: National Bureau of Economic Research.
Liu, X. F., Spybrook, J., Congdon, R., Raudenbush, S. (2005). Optimal design for
multi-level and longitudinal research, Version 0.35. Ann Arbor, MI: Survey
Research Center, Institute for Social Research, University of Michigan.
Ludwig, J., Miller, D. L. (February 2007). Does Head Start improve children’s life
chances? Evidence from a regression-discontinuity design. Quarterly Journal of
Economics, 122(1), 159–208.
Ludwig, J., Miller, D. L. (2005). Does head start improve children’s life chances?
Evidence from a regression-discontinuity design. Research Working Paper, No.
11702. Cambridge, MA: National Bureau of Economic Research.
Mann, H. (1891). Report for 1846. In M. T. Peabody Mann, G. C. Mann, F. Pécant
(Eds.), Life and works of Horace Mann, 5 vols. Boston/New York: Lee and
Shepard/C. T. Dillingham.
McEwan, P. J., Urquiola, M., Vegas, E. (Spring 2008). School choice, stratification,
and information on school performance: Lessons from Chile. Economia: Jour-
nal of the Latin American and Caribbean Economic Association, 8(2), 1, 27, 38–42.
McLaughlin, M. W. (1975). Evaluation and reform: The elementary and Secondary
Education Act of 1965, Title I. A Rand educational policy study. Cambridge,
MA: Ballinger.
Miller, R. G. (1974). The jackknife: A review. Biometrika, 61(1), 1–15.
Morgan, S. L., Winship, C. (2007). Counterfactuals and causal inference: Methods
and principles for social research. New York: Cambridge University Press.
Mosteller, F. (1995). The Tennessee study of class size in the early school grades.
The Future of Children, 5(2), 113–27.
Mosteller, F., Moynihan, D. P. (Eds.). (1972). On equality of educational opportunity.
New York: Random House.
References 377
Tyler, J. H., Murnane, R. J., Willett, J. B. (May 2000). Estimating the labor-market
signaling value of the GED. Quarterly Journal of Economics, 115(2), 431–68.
Urquiola, M. (February 2006). Identifying class size effects in developing
countries: Evidence from rural Bolivia. Review of Economics and Statistics, 88(1),
171–7.
Urquiola, M., Verhoogen, E. (March 2009). Class-size caps, sorting, and the
regression-discontinuity design. American Economic Review, 99(1), 179–215.
Weiss, A. (Fall 1995). Human capital vs. signalling explanations of wages. Journal
of Economic Perspectives, 9(4), 133–54.
Whitehurst, G. J. (2008a). National board for education sciences: 5-year report, 2003
through 2008. NBES 2009-6011. Washington, DC: National Board for Educa-
tion Sciences.
Whitehurst, G. J. (2008b). Rigor and relevance redux: Director’s biennial report to
congress. IES 2009-6010. Washington, DC: Institute of Education Sciences, U.S.
Department of Education.
Wooldridge, J. M. (2002). Econometric analysis of cross-section and panel data.
Cambridge, MA: MIT Press.
This page intentionally left blank
Index
381
382 Index