THNG Qualifyingpaper 2017
THNG Qualifyingpaper 2017
Permanent link
http://nrs.harvard.edu/urn-3:HUL.InstRepos:33797246
Terms of Use
This article was downloaded from Harvard University’s DASH repository, and is made available
under the terms and conditions applicable to Other Posted Material, as set forth at http://
nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA
Accessibility
The Effects of Class Size on Student Behavioral Outcomes:
Qualifying Paper
Submitted by
Yi Xe Thng
July 2016
Acknowledgements
I would like to thank the Mathematica Policy Research Inc. and Research Connections
for granting access to the Head Start FACES dataset, and the U.S. Department of Health
& Services, Office of Planning, Research and Evaluation for funding the Head Start FACES
study. I would also like to thank the members of my committee, Andrew Ho, Stephanie
Jones, and Luke Miratrix for their thoughtful feedback on numerous drafts of this paper.
Special thanks also to David Deming, for helping me develop this project in S‐090 and
providing feedback to the early versions of the paper. I am indebted to my advisor,
Andrew Ho, for his unwavering support throughout this process.
Table of Contents
Abstract ............................................................................................................................... 1
Introduction ........................................................................................................................ 2
Background ......................................................................................................................... 5
Research Design ................................................................................................................ 13
Empirical Strategy ............................................................................................................. 23
Results ............................................................................................................................... 32
Discussion.......................................................................................................................... 36
Conclusion ......................................................................................................................... 43
Tables ................................................................................................................................ 48
References ........................................................................................................................ 63
Appendices........................................................................................................................ 70
Tables
Abstract
Class size has a long history of research. To date, there is high quality evidence from
causal studies suggesting that smaller class size yields short and long‐term benefits for
students. The understanding on how smaller class size achieve their benefits, i.e., the
mechanisms, though, is less clear. Using data from the Head Start Family and Child
Experiences Survey (FACES) 2009 cohort, I used propensity score techniques to
investigate the effects of class size on behavioral outcomes for children who enrolled in
Head Start for the first time in 2009, in full‐day classrooms with predominantly 4 and 5‐
year olds. I also studied the role of teacher‐student interactions in the classroom as a
potential mediator of the above relationship. I found that smaller class sizes (17‐18
children per class) had a very small but non‐statistically significant effect (+0.10 S.D.) on
student behavioral outcomes over comparison class sizes (19‐20 children per class). I
also found a statistically significant effect of smaller class sizes on the quality of teacher‐
student interactions in the classroom (+0.33 S.D.). This effect was driven mainly by a
sub‐component of the teacher‐student interaction scale, namely, classroom
organization (+0.42 S.D.). The findings did not rule out the hypothesis that the quality of
teacher‐student interactions in the classroom may be a potential mechanism by which
smaller class size achieve their effects on students.
2
Introduction
Class size has been a much debated policy issue, with a long history of research
(see Glass & Smith, 1979; Schanzenbach, 2014; Wilson, 2002). Prior to the 1970s,
research on the effects of class size reduction was controversial, because studies yielded
very different results (Mosteller, 1995). Recent studies using causal inference methods
have found that smaller class sizes can improve student test scores (Angrist & Lavy,
1999; Fredriksson, Öckert, & Oosterbeek, 2013; Krueger, 1999) and provide long‐term
Teacher Achievement Ratio (STAR) study compared the effects of attending smaller
class sizes (13 to 17 students) to that of regular class sizes (22 to 25 students) for four
years from kindergarten through third grade (Finn & Achilles, 1990; Mosteller, 1995).
The experiment found that smaller class sizes conferred short‐term benefits for
students’ standardized test scores (Krueger, 1999), and long‐term benefits in terms of
high school completion (Finn, Gerber, & Boyd‐Zaharias, 2005), higher earnings, college
neighborhoods (Chetty et al., 2011), and fewer arrests for crime (Krueger & Whitmore,
2001). Using a regression discontinuity approach that utilized maximum class‐size rules,
researchers found that after splitting classes that reached maximum class size in
elementary schools, the smaller class sizes led to improvements in reading and math
scores in Israel (Angrist & Lavy, 1999) and Sweden (Fredriksson, Öckert, & Oosterbeek,
2013) and benefits in areas such as motivation, self‐confidence, and absenteeism for
3
Despite the strength of evidence and increasing adoption of class size reduction
policies at the state level in the U.S. (Education Commission of the States, 2010),
debates on class size policy persist. Cost has often been cited as a barrier (Achilles, Finn,
& Bain, 1998; Barnett, Schulman, & Shore, 2004; Biddle & Berliner, 2002) and has been
Washington 2014 Voters’ Guide, 2014). Practical issues are also substantial when
implementing class size reduction at scale, such as the difficulty of employing and
training the necessary number of qualified teachers, and the challenges of creating extra
classrooms (Biddle & Berliner, 2002). A few state‐level studies of class size reduction
programs, including California and Florida, have found little to no impact of reducing
class size (Chingos, 2012; Jepsen & Rivkin, 2009). Others have acknowledged the
benefits of class size reduction, but proposed that policy alternatives such as improving
teacher quality are more effective given the costs (Ballotpedia, 2010; Odden, 1990).
These debates give rise to a question about mechanism: How does small class
size achieve its impact on outcomes? The controversies about the effects of class size
reduction could arise due to a poor understanding of the magnitude of the benefits over
the costs, as well as a lack of clarity about the mechanisms at play, i.e., how smaller
class sizes achieve their effects (Barnett, Schulman, & Shore, 2004; Goldstein &
may achieve similar effects through less expensive interventions, or could undertake
Positive Student
Behavioral
RQ1
Outcomes
Small Class Long‐Term
RQ2b Outcomes
Size
RQ2a
High quality
Teacher‐Student
Interactions
I gather empirical evidence for this hypothesis in parts. First, I study the
and problem behaviors (RQ1), which research has increasingly identified as a key
predictor of school success and long‐term outcomes (Raver, 2002; Duncan &
Specifically, I look at the (RQ2a) intermediary effects of class size on the quality
Background
that increase effectiveness in teaching and learning for teachers and students
(Molnar et al., 1999; Pedder, 2006; Wilson, 2002). This raises the question: How
does small class size achieve its impact on outcomes? Theories on the
mechanisms carrying the influence of small class size have focused on teacher
Schulman, & Shore, 2004; Biddle & Berliner, 2002; Finn, Pannozzo, & Achilles, 2003;
Wilson, 2002). This set of theories focus on the proximal processes (Bronfenbrenner &
Morris, 2006) in the classroom which reflect the nature and quality of children’s
(2002, p.52), “It is what teachers do in and with smaller classes that makes the
The question then is, what do teachers do differently in smaller size classrooms?
A review of studies that focus on the kindergarten to lower elementary school years
suggest that there are at least two aspects of teaching that have been postulated to be
affected by class size. One aspect, teachers’ teaching methods, involves how teachers
organize the classroom and group students for instruction as well as instructional format
teachers’ teaching methods do not appear to differ very much regardless of class size.
This teaching methods aspect has been the subject of earlier theories prior to, and
including the Tennessee STAR experiment that postulated that reducing class size
induces changes in teachers’ teaching methods, such that they can provide more
individualized and higher quality instruction (Finn, Pannozzo, & Achilles, 2003).
Contrary to this hypothesis, some studies have found that teachers did not
change their teaching methods or beliefs as a result of reduction in class size (Evertson
& Randolph, 1989; Johnston, 1990; Molnar et al., 1999). A study using observational
data from STAR classrooms found that teachers did not change their teaching methods
even though class size was reduced by about one‐third of the original size (Evertson &
Randolph, 1989). In the Evertson and Randolph (1989) study, the choice of teaching
method appeared more greatly influenced by subject, rather than by class size. For
example, for math, teachers in both small and large classrooms tended to use whole
(Evertson & Randolph, 1989, p.96)). For reading, teachers tended to use reading circles
for small‐group reading, discussion, and in‐class assignments regardless of class size.
7
program in Wisconsin, Molnar et al. (1999) did not find evidence that teachers teaching
more so than teachers in regular sized classrooms. Instead, Molnar et al. (1999) found
that content coverage was valued more over student choice and interest. These studies
suggest that class size reduction may not automatically induce teachers to change their
There is a rich body of literature that examines the relationship between policies
intended to change teachers’ teaching practice and actual changes in their teaching
practice (e.g., see Coburn, 2004; McLaughlin, 1987; Richardson, 1990). This literature
suggests that teachers tend to be resistant to change even in the presence of specific
policies directed at changing teaching practice. For class size reduction policies which
are not direct interventions aimed at changing teaching methods, it seems even less
likely that teachers would respond by voluntarily change their teaching practice.
A second aspect that seems more responsive to changes in class size is teacher‐
which is distinct from but may complement teachers’ choice of methods. Time appears
1,935 headteachers (i.e., principals), chairs of governors (i.e., heads of school board),
teachers, and parents in primary schools in Britain, Bennett (1996) found that all the
stakeholder groups rated time spent with individual students to be heavily influenced by
class size. Presumably with fewer students in the class, teachers would have more time
8
found evidence for a link between smaller class size and greater quantity of teacher‐
study of 5‐7 year olds in England, Blatchford et al. (2003) found that class size was
negatively associated with percentage time spent teaching over class size ranges from
observation period) of teacher‐child interactions was also higher in smaller class sizes
(below 20) compared to larger ones (above 30) while the frequency of not interacting
was higher in the larger class sizes. In a related study of 4 and 5‐year olds, Blatchford
(2003) found higher frequency of occasions when children were the focus of teachers’
attention in smaller class sizes (average of 19 children) than in larger class sizes (average
of 33 children). In a separate study, Hargreaves, Galton, and Pell (1998) found a higher
frequency (number of 25‐second time samples) of feedback, both neutral and positive,
as well as more sustained interactions between teachers and students in smaller class
sizes.
smaller class sizes has come mainly through teacher interviews and self‐reports.
Teachers teaching smaller class sizes who were interviewed in the Tennessee STAR
study (Johnston, 1990) and in the Wisconsin SAGE study (Graue et al., 2007; Graue &
Oen, 2008; Molnar et al., 1999) indicated that they listened to their students more, and
9
developed better knowledge of their students and families. These teachers also
indicated that they had more time to monitor and evaluate student learning, to provide
feedback, and help in a timely manner. They could also spend more time with students
who had difficulty with the material. Although these studies suggest a favorable
relationship between smaller class size and the quality of teacher‐student interactions,
it should be noted that few studies have sought to replicate these findings through the
exception is the Graue et al. (2007) study in which independent observer ratings of the
interpretations.
Whilst the above studies show the link between smaller class size and teacher‐
student interactions, other non‐class size related studies have found that the quality of
(Downer, Rimm‐Kaufman, & Pianta, 2007), as well as social skills (Moiduddin et al.,
responsive relationships (Barnett, Schulman, & Shore, 2004), which in turn have been
shown to be associated with better cognitive and language outcomes in the first three
Another set of theories focus on student behavior, which generally propose that
students in smaller class sizes are more likely to be engaged socially and academically,
and less likely to display problematic behavior, thus allowing teachers to focus more on
10
subject‐matter instruction (Biddle & Berliner, 2002). Finn, Pannozzo, & Achilles (2003)
small class size increases the “visibility of the individual” and the “sense of belonging”
(p.346). With increased visibility, students cannot easily escape detection from teachers
when they misbehave, and they also face more pressure to participate. In smaller class
sizes, members also tend to feel greater affiliation with the group, which may influence
Some evidence exists for improved student behavior in smaller class sizes,
although the evidence generally hinges on teacher perceptions (Wilson, 2001). For
fewer disciplinary problems in classrooms with smaller class sizes (Molnar et al., 1999).
They attributed reasons such as a “familylike atmosphere” (p.175) and their ability to
students.
One study that included classroom observations and student interviews was
conducted in the context of secondary schools in Hong Kong (Harfitt & Tsui, 2015). The
observational study found that the students perceived a stronger sense of community in
the smaller class sizes, and were more behaviorally engaged, for example, more
Stronger evidence between smaller class size and improved student behavior in
the longer‐term comes from the Tennessee STAR experiment. Finn and Achilles (1999)
found that children assigned to smaller class sizes during kindergarten to third grade
11
scored 0.12 to 0.14 standard deviations higher on fourth grade teacher ratings of their
replicated these results in an independent analysis of the same data for Grade 4 and
Improved student behavior is central to school readiness (see Raver, 2002, for a
review), which has been shown to predict later performance on academic tests
(Alexander & Entwisle, 1993; McLelland, Morrison & Holmes, 2000). However, children’s
outcomes because they can affect how children interact with their peers and adults
(Moiduddin et al., 2012). Researchers have also proposed that improved student
behavior might be a link between smaller class size and its long‐term benefits (Chetty et
small class size is beyond the scope of this paper. However, this paper assumes that
relationship between smaller class size and student behavioral outcomes, specifically
Head Start
This study is carried out within the context of Head Start classrooms. Head Start
is a federally funded national program that seeks to promote school readiness for
economically disadvantaged children under 5 years old (Office of the Administration for
12
Children and Families, 2015). The Office of Head Start administers grants to public and
private, profit and non‐profit agencies in local communities to provide services to young
children and their families, through education, health, social and other services. Special
emphasis is placed on helping preschoolers develop school readiness including the areas
Head Start classrooms provide a salient context for this study especially since
smaller class size has been shown to have larger positive effects for children from low‐
income backgrounds than for children on average (Krueger, 1999). Moreover, studies
that have documented positive benefits tended to study the effects of implementing
smaller class size for younger children in kindergartens and/or elementary schools
(Angrist & Lavy, 1999; Chetty et al., 2011; Finn, Gerber, & Boyd‐Zaharias, 2005;
Fredriksson, Öckert, & Oosterbeek, 2013; Krueger, 1999; Krueger & Whitmore, 2001).
Studies have also shown that during the early childhood years, an interactive
Developing Child, 2004), which in theory, could be facilitated by smaller class sizes in an
early childhood program such as Head Start. For reasons explained in the Research
Design section, I focus on a particular segment of the Head Start population – children in
Summary
need not be mutually exclusive mechanisms of the effects of smaller class size (Biddle &
Berliner, 2002). However, few studies have examined the interdependent links between
13
smaller class size, teacher‐student interactions, and student behavioral outcomes (Finn,
smaller class size and non‐academic student outcomes, specifically student behavioral
outcomes within an early childhood education context in Head Start. This paper also
RQ1: Does smaller class size predict student behavioral outcomes, including
social‐emotional and problem behaviors, using propensity score matching to account for
selection into different class sizes in Head Start classrooms with predominantly 4 and 5‐
the effects of smaller class size on student behavioral outcomes in Head Start
interactions?
RQ2b: How well does class size explain student behavioral outcomes, including
Research Design
Dataset
14
One of the key difficulties in studying the mechanisms of smaller class size is that
the few experimental studies that had been conducted on the impact of class size did
not set out to study the processes that might explain its effects (Goldstein & Blatchford,
1998). Hence, I turned to an observational dataset – the Head Start Family and Child
Experiences Survey (FACES) (Malone et al., 2013). This is one of the few datasets that
contains reliable and established measures of a potential mediator of smaller class size,
Blatchford, 1998). Moreover, in the Head Start FACES study, data were also collected on
actual class size, i.e., the number of students and teachers in a class, as opposed to the
average number of students per teacher in the school (Wilson, 2002). Furthermore, it is
of interest to examine the effects of smaller class size on children in Head Start in
particular since prior studies have found larger effects of smaller class size for children
The Head Start FACES is a periodic, longitudinal study of Head Start programs to
and 4‐years old who were enrolled in the Head Start program for the first time in Fall
2009, their families, classrooms, and programs (Malone et al., 2013). Participants were
selected through a multi‐stage sampling design with four stages: “(1) Head Start
services; (2) centers within programs; (3) classrooms within centers; and (4) children
within classrooms” (Malone et al., 2013, p. 28). A total of 3,718 children and families
from 486 classrooms in 60 Head Start programs were sampled. Of these, 3,349 children
15
and their families participated in the study. I used data from the 2009 FACES cohort,
that is, data on 3 and 4‐year old children who enrolled in Head Start for the first time
during fall 2009. The data that I used were collected in fall 2009 and spring 2010 (Table
1).
Sample
Class size in Head Start programs is guided by the Head Start Program
Performance Standards (Head Start Bureau, 2005) (Table 2) which specify different class
size ranges based on the predominant age of children in the classroom (3 year olds
versus 4 and 5‐year olds) and program type (full‐ versus partial‐day) (henceforth termed
variation in class sizes for the class size category with full‐day programs serving
predominantly 4 and 5‐year olds (1,072 children)1. Table 3 compares the sample
My analysis by class size categories showed that some classrooms had class size
beyond the range permissible by the Performance Standards. Since the characteristics
that drive programs and centers to establish class sizes outside the permissible range,
e.g., urbanicity, labor supply and available resources, may lead them to be substantively
different from those which do so within the permissible range, defining smaller class
size to be outside the permissible range and comparison classes to be within the
permissible range may lead to estimates that include effects beyond smaller class size
1
Propensity score matching was inappropriate for the other class size categories as satisfactory covariate
balance could not be obtained.
16
alone. The dataset also did not contain sufficient covariates, e.g., demographic and
socio‐economic variables, at the program level to allow for matching. Hence I restricted
my analytic sample to the classrooms that had class size within the range permissible by
the Performance Standards. This limits the interpretation of my results to this specific
group of students. The restriction eliminates a further 159 cases leaving 913 children
across 135 classrooms. With my final analytic sample, I conducted a complete case
size because this presents a simple case for estimating treatment effects. The
alternative – to treat the different class sizes as multiple treatment doses – has been
identified as an active research area (Stuart, 2010; see also, Imbens, 2000). I used the
median class size within my analytic sample (19 children per class) to distinguish
between smaller (17‐18 children per class) and comparison (19‐20 children per class)
class sizes. The ensuing average class size was 17.6 and 19.9 children per class in the
Studies that have documented positive effects of class size on student outcomes
have tended to have sizeable differences between one‐third up to one‐half the original
class size (e.g., Angrist & Lavy, 1999; Chetty et al., 2011; Dee & West, 2011; Finn,
Gerber, & Boyd‐Zaharias, 2005; Fredriksson, Öckert, & Oosterbeek, 2013; Krueger, 1999;
Krueger & Whitmore, 2001). However, studies that have documented positive
relationships between class size and teacher and/or student behavior were more mixed
17
in their construction of the smaller class size variable. Some studies (e.g., Blatchford et
al., 2003) have used class size as a continuous variable, and estimated an approximately
linear relationship between class size and key variables such as percentage teaching
time within the range of 15‐25 children per class. Other studies (e.g., Blatchford, 2003;
Blatchford et al., 2003; Hargreaves, Galton, & Pell, 1998) have grouped class sizes into
small (e.g., below 20), large (e.g., 30), and sometimes various in‐between categories.
children per class) and comparison (average 19.9 children per class) class sizes
represents a very small variation in class size. Assuming a six‐hour class day with one
teacher who teaches continuously, the teacher could spend an extra 2.5 minutes per
day, representing a 13% increase, with each child in the smaller class. Though seemingly
inconsequential, it is the appropriate use of this short extra time, such as to provide an
7.5 hours in a 36‐week academic year) which together could have the potential to lead
example in situations where only a limited budget is available, can make a difference.
the FACES 2009 dataset. Two of the measures were based on teacher reports on
children’s cooperative behavior and problem behavior in the classroom. To reduce the
over‐reliance on teacher reports (Finn et al., 2003), I used a third measure based on
probability of a false rejection (Type I error) increases (see Deming, 2009). To address
based on the first component using principal components analysis (See Appendix A for
details). The composite is constructed such that good outcomes, i.e., children’s
cooperative behavior and social/cognitive behavior, have a positive weighting, while the
bad outcome, i.e. problem behavior, has a negative weighting. Overall, more positive
values on the composite would indicate more of the good outcomes and/or less of the
bad outcome.
classroom behavior, such as following teacher’s directions, and waiting for their turn
during classroom and play activities. This measure was adapted from the Personal
Maturity Scales developed by Alexander and Entwisle in 1988, and the Social Skills
Rating Systems developed by Gresham and Elliott in 1990 (as cited in Malone et al.,
2013). The Personal Maturity Scales was used by Zill and Daly (1993) in the 1976–1977
behaviors such as being unable to pay attention, disrupting class activities, and fighting.
This measure was modified from the Personal Maturity Scales developed by Alexander
and Entwisle in 1998, and the Behavior Problems Index developed by Peterson and Zill
impulse control, and sociability, using the Leiter International Performance Scale
Revised (Leiter‐R) Examiner Rating Scale. The Leiter‐R examiner ratings were previously
used in two large‐scale studies – Administration for Children and Family’s (2006) Early
Head Start Transition to Prekindergarten, and Olds et al.’s (2004) Home Visiting 2000 (as
cited by Malone et al., 2013). Table 4 provides further details for these three measures.
Scoring System (CLASS) (Pianta, La Paro, & Hamre, 2008) which measures quality in the
classroom with respect to teacher‐student interactions. The CLASS has been used in
for quality in classrooms (e.g., LoCaSale et al., 2007; Ponitz et al., 2009; Raver et al.,
2008). The CLASS was developed based on “scales used in large‐scale classroom
observation studies in the National Institute of Child Health and Human Development
(NICHD) Study of Early Care (NICHD Early Child Care Research Network [ECCRN], 2002;
Pianta, La Paro, Payne, Cox, & Bradley, 2002) and the National Center for Early
Development and Learning (NCEDL) MultiState Pre‐K Study (Early et al., 2005)” (Pianta,
The CLASS consists of the following domains: (a) Emotional Support (ES) which
measures teachers’ ability to support children socially and emotionally in the classroom,
(b) Instructional Support (IS) which measures how well teachers use interactions such as
development, and (c) Classroom Organization (CO) which measures how well teachers
positive climate, negative climate, teacher sensitivity, and regard for student
perspectives (Pianta, La Paro, & Hamre, 2008). Positive climate reflects a warm,
sarcasm and disrespect, and use of punishments. Teacher sensitivity reflects the degree
Regard for student perspectives measures the extent to which teachers’ interactions
value students’ points of view and ideas, and provide opportunities for development of
student autonomy.
concept development, quality of feedback, and language modeling (Pianta, La Paro, &
Hamre, 2008). Concept development measures the degree to which teachers engage in
thinking skills among students. Quality of feedback measures the extent to which
teachers provide comments and exchanges to students’ work, ideas, and actions.
Language modeling measures the degree to which teachers use language to motivate
& Hamre, 2008). Behavior management measures the teachers’ ability to prevent and
student learning time. Instructional learning formats measures how well teachers uses
The above hypothesized mediator variables were measured in spring 2010. With
ten dimensions, the probability of false rejection of the null hypothesis increases (e.g.,
see Deming, 2009). To address multiple inference, I created a composite index based on
the first component using principal components (see Appendix B for details), in addition
to conducting analyses by domain. However, some studies have shown that each CLASS
dimension may reflect a unique aspect of the classroom experience (see LoCaSale et al.,
2007). Hence, I included analyses for each of the CLASS dimensions as a means of
driving the results at the domain and subsequently composite index level.
child and program level that influence either selection into smaller class sizes or the
outcome, or both (Austin, 2011; Harder, Stuart, & Anthony, 2010), but which are “not in
the causal pathway between treatment and outcome” (Harder et al., 2010, p.237).
22
These variables are either time‐invariant, or measured at baseline, but not after
Appendix C describes in detail the variables used in this study. Briefly, the child‐
variables as well as factors that could influence parents’ level of involvement in their
children’s education (e.g., single parent households and mother’s employment status).
The program‐level selection variables include presence of program waitlists which might
influence programs to adopt larger class sizes, and an index of program director’s
class size.
To take into account the multistage sampling design of the original FACES 2009
sample, I included survey weights as a design covariate into the propensity score model.
These weights would capture information about the probability of selection and
response to the survey (DuGoff, Schuler, & Stuart, 2014). I did not include primary
sampling unit and strata variables as it was not feasible to include a large number of
estimate the effects of smaller class size, by including child, teacher, classroom, and
program covariates in the regression model (See Appendix C for details). Regression
adjustment combined with matching has been shown to be more robust and efficient
especially if the selection model is properly specified (Rosenbaum, 2005; Rubin, 1979).
23
Empirical Strategy
Selection Bias
One of the key challenges of using observational data to study the effects of
smaller class size on student behavioral outcomes is that students in smaller class sizes
may systematically differ from students in larger class sizes. Furthermore, the factors
driving student selection into smaller class sizes are complex, and the direction of bias
introduced may even contradict one another. For example, children requiring special
classrooms with smaller class sizes. This may possibly introduce a downward bias to
student behavioral outcomes. Children whose parents are motivated to send their child
to classrooms with smaller class sizes in hope of receiving a larger share of educational
resources may introduce an upward bias. Under such circumstances, the overall
direction and magnitude of bias is hard to predict. This motivates the use of quasi‐
sizes.
In this study, I used propensity score techniques (Rosenbaum & Rubin, 1983) to
address selection into smaller class sizes. Propensity score matching attempts to render
comparable. A key aspect of this method is the modeling of the selection process into
treatment. Propensity score techniques have the potential to mitigate the bias caused
by confounders of the selection process and treatment outcome when the selection
covariates are based on theory and knowledge of the selection process (Murnane &
24
Willett, 2011).
The assumption is that after applying propensity scores to balance the observed
covariate distribution, children’s enrolment in small and comparison Head Start class
size would be as good as random. In more technical terms, “The … assumption … is that
association between the treatment and the outcome” (Harder, Stuart, & Anthony, 2010,
unaccounted.
stages (Rubin, 2007). In the first, or design, phase, I employed propensity score
techniques to organize the data with the goal of reducing the bias between treatment
and comparison groups. I first modeled the selection process by estimating the
propensity score for selection into treatment status, followed by applying the
propensity score to render the treatment and comparison groups more comparable
using matching and subclassification methods (Harder, Stuart, & Anthony, 2010). After
each application of the propensity scores, I evaluated the resulting covariate balance
using criteria specified a priori, i.e. in the design stage before analyses were conducted
(see Design Phase – Balance Diagnostics sub‐section, p.29, for specific criteria). The
steps in this phase were reiterated until the covariate balance between treatment and
outcome data were used at this stage to maintain the objectivity of the design phase
(Rubin, 2007).
propensity score matching. To improve the precision of the estimate, I used covariate
adjustment after propensity score matching. Through the use of a separate design and
study in which subjects are randomly assigned to treatment and control groups without
Mediation Analysis
Following the approach outlined in Baron and Kenny (1986), I fitted a series of
(1) Regress the dependent variable (positive student behavioral outcome) on the
both the independent variable (smaller class size) and mediator variable
If the quality of teacher‐student interactions mediate the link between smaller class size
and positive student behavioral outcomes, then the following conditions must hold
(1) The relationship between smaller class size and positive student behavioral
(2) The relationship between smaller class size and high quality teacher‐student
variable; and
(4) The magnitude of the relationship between smaller class size and positive
student behavioral outcomes is smaller when the mediator variable, i.e., high
I tested for conditions (1) to (3) using statistical inference tests with alpha of 5%, while I
RQ1:
Overview. I first addressed selection into smaller class sizes by balancing the
observed covariate distribution of treatment and comparison groups. The balancing was
achieved through the use of exact matching on class size categories (children in full‐day
classrooms with predominantly 4 and 5‐year olds) as well as two propensity score
techniques: full matching and subclassification. In each iteration of the process, I first
estimated the propensity score, applied the propensity score to the data via full
27
using a priori specified criterion, (see Design Phase – Balance Diagnostics sub‐section,
p.29, for specific criteria). The process of refining the propensity score model was
Propensity score estimation. In the Head Start program guidelines, there are
different class size categories which stipulate the class size based on the predominant
age of children in the classroom, and the type of session (single or double session, which
loosely translates to the number of hours spent in the program per day) (Head Start
Bureau, 2005) (Table 2). Since both variables are explicit selection variables for class size
and are likely to be associated with student behavioral outcomes, the effect of smaller
class size may be substantively different for each of the class size categories. Green and
Stuart (2014) found that exact matching on subgroups of substantive interest followed
resulted in the best balance among various options for propensity score estimation and
matching.
Within the class size category for full‐day classrooms with predominantly 4 and
5‐year olds, I estimated the propensity score for being in a smaller class size using the
1
=1 = ( )]
(1)
[1 + e
28
where P(SMALLij=1) refers to the probability that child i is enrolled in a class j of smaller
size and S refers to the vector of selection covariates at the child, classroom, and
full matching and subclassification – in order to check the sensitivity of results to the
techniques have the advantage of: (i) using all data, versus nearest neighbor matching in
which data may be discarded if the controls are unmatched, and (ii) estimates not being
least one individual each from the treatment and comparison groups (Harder, Stuart, &
Anthony, 2010; Stuart, 2010). The optimal matched sets are formed by minimizing the
propensity score difference between all treatment‐comparison group pairs within each
subclasses containing individuals from both the treatment and comparison groups
based on their propensity scores (Harder, Stuart, & Anthony, 2010; Rosenbaum & Rubin,
1984), but differs in that fewer subclasses are created. Some early work in
subclassification suggests that creating five subclasses can remove “at least 90% of the
bias in the estimated treatment effect due to all of the covariates that went into the
propensity score” (Cochran & Rubin, 1973; Rosenbaum & Rubin, 1985 as cited in Stuart,
2010, p.9). However, depending on the sample size and the extent of propensity score
29
overlap between treatment and comparison groups, the optimal number of subclasses
balance diagnostics: standardized bias and region of common support. The standardized
bias was calculated as the difference in means between the treatment and comparison
groups for the covariate in question, divided by the standard deviation of the original
treatment group:
−
= (2)
absolute standardized bias is less than 25.0% (Rubin, 2001). Although t‐tests are also
commonly used as balance diagnostics, Stuart (2010) cautions against its use since such
hypothesis tests are an in‐sample property and often reflect the power of the test to
I also examined the region of common support for the estimated propensity
scores of the treatment and comparison groups. A greater region of overlap between
the two distributions would suggest that the treatment and comparison groups are
similar in the observed covariate distribution, and that application of propensity score
techniques might further improve the balance. Individuals with propensity scores
different from those within the region, and it is common to remove them from the
Software. I used the MatchIt software developed by Ho, Imai, King, & Stuart
(2011) to generate the propensity score, balance dianostics, and matching weights.
After a propensity score model with satisfactory covariate balance was developed, I
exported the dataset with the corresponding matching weights into Stata for the
Matching Weights. The MatchIt software generates weights that estimate the
average effect of treatment on the treated (ATT) (Ho, Imai, King, & Stuart, 2011).To
obtain the weights to estimate the average treatment effect (ATE) (Stuart, 2011), I
ℎ = × (3 )
ℎ = × (3 )
where ATEweightti and ATEweightci refers to the weight applied to each treatment unit t
or comparison unit c for calculating the ATE, and i refers to the subclass each unit is
subclass formed by matching, nti and nci refers to the number of treatment and
comparison units respectively in each subclass i; n refers to the number of units in the
sample (for a specific class size category), while nt and ni refers to the number of
treatment and comparison units respectively in the sample. The first term in equation
3a scales the treatment units so that the number of treatment units are matched
31
equally within that subclass. The second term in equation 3a scales the weights
generated by the first term to match the number of treatment units in the sample. The
same reasoning holds in equation 3b for comparison units. Overall, this weighting
scheme adjusts for the uneven numbers between treatment and comparison groups
within and across subclasses, so that the treatment and comparison groups contribute
For both the full matching and subclassification approaches, I used the following
model with the corresponding matching weights to estimate the average treatment
effect:
= + + + + (4)
where Yij represents the outcome for child i in classroom j, SMALLj represents the
treatment indicator for classroom j, vector Z represents the set of teacher and
classroom covariates, and eij represents a mean‐zero error term adjusted for clustering
at the classroom level. The matching weights are calculated from the MatchIt software.
1 represents the ATE of smaller class size, where a positive value indicates better
behavioral outcomes for children in the treatment group (Mediation condition 1).
classroom could be a potential mediator of the link between smaller class size and
smaller class size (SMALLj) and quality of teacher‐student interactions (CLASSj), using
32
unmatched classroom‐level data since the matching for RQ1 and RQ2b was performed
on student‐level data. I fitted OLS regression models with standard errors clustered at
= + + ′ + (5)
where vector Z represents the set of teacher and classroom covariates. If the effect of
smaller class size and children’s outcomes acted through the quality of teacher‐student
positive relationship (1) between smaller class size and the quality of teacher‐student
RQ2b: In the second stage, I added in the mediator variable, CLASSj, to the
= + + + ′ + + (6)
Following the Baron and Kenny (1986) formulation for studying mediation, a
mediation conditions (1) and (2) in the previous research questions would suggest that
the effect of smaller class size on student behavioral outcomes was mediated to some
Results
In Table 5, I show the covariate balance for covariates that could be associated
with the outcome, or treatment status, or both, for children in full‐day classrooms with
33
predominantly 4 and 5‐year olds. Panels A and B show the means of each covariate for
dataset. Panels C and D show the means of each covariate for children in the
which have absolute standardized bias between treatment and comparison groups
being greater than 25.0% are highlighted in the tables. The respective absolute
As shown in Table 5 Panels A and B, children in the smaller and comparison class
there was little difference in the percentage of children with IEP in both types of
classrooms. It did not appear that there were selection effects into smaller class size
based on children’s home backgrounds. At the program and classroom level, however,
there were a number of differences. As expected, classrooms with larger class sizes
tended to be in programs with waitlists for children. The classrooms with smaller class
sizes also tended to have program directors who had worked in the Head Start program
for a longer time, and teachers with Bachelor’s degree or above. The program directors
of these classrooms also tended to perceive fewer challenges in running the program.
These differences suggest that classrooms with smaller class sizes operated in different
program environments from those with bigger class sizes. This motivates the need for
that most covariates became more balanced using a yardstick of 25.0% in absolute
matching (but with absolute standardized bias still below 25.0%). One covariate,
“expanded Head Start program in the past year”, had its absolute standardized bias
comparison groups in Figure 4 shows substantial overlap for both matching methods
used, although a number of cases had no direct overlap in the extreme ends of the
propensity score distribution (14 individuals with propensity score < .054 all of whom
were in the comparison group; 38 individuals with propensity score > 0.82 all of whom
were in the treatment group). Almost all the individuals with propensity score
below .054 were in programs that had not expanded Head Start in the past year, while
almost all the individuals with propensity score above .82 were in programs that had
expanded Head Start in the past year. I later removed these individuals beyond the
region of common support in my analyses (Stuart, 2010), and checked for sensitivity of
In Table 6, I show the results for whether smaller class size predict student
behavioral outcomes after propensity score matching using subclassification and full
matching respectively (RQ1). The results show a very small effect size of smaller classes
of matching method and sample used (whether individuals beyond the region of
common support were trimmed from the sample). These estimates were very noisy and
35
not statistically significant. In analyses not shown, I also did not find any statistically
significant effects of smaller class size on each of the individual student outcome
With this null finding for mediation condition 1, the question about mediation
was no longer applicable. I show the rest of the results here for completeness. In Table
7, I show the results for whether smaller class size was associated with higher quality
teacher‐student interactions in the classroom. The results in Models 3 (no controls) and
4 (with controls) show that there was a small, positive, marginally statistically significant
association between smaller class size and the quality of teacher‐student interactions
(Models 5‐10) show that this association was driven primarily by the positive association
of smaller class size with the CLASS Classroom Organization domain (+0.42 S.D.).
In Table 8, I show the results for the effect of smaller class size on student
composite variable, and each of the 3 CLASS domains) as a mediator variable, using full
matching in the original analytic sample. If the mediation hypothesis were true, I would
interactions and student behavioral outcomes (mediation condition 3), and the
magnitude of treatment effect to be smaller than that found for RQ1. I did not find a
statistically significant relationship between the mediator and outcome. The magnitude
of the effect of smaller class size on student behavioral outcomes generally remained
unchanged after adding the mediator (Models 11‐18) compared to before (Models 1 &
36
2). Hence both conditions did not hold in this case, and the results did not differ by
Discussion
The research on smaller class size has yielded high quality evidence about its
short‐ and long‐term benefits. However, convincing decision‐makers that the benefits
are worth the costs continues to be a challenge. One key issue is that there is little
understanding of how smaller class sizes achieve their outcomes, and few studies have
addressed potential mechanisms (Barnett, Schulman, & Shore, 2004; Goldstein &
Blatchford, 1998).
interactions could be a mechanism by which smaller class sizes achieve their effects on
student behavior. I also investigated the effects of smaller class size on student
behavioral outcomes, an important, but often neglected outcome in small class size
research, but which has also been proposed to be a link to longer‐term outcomes
Limitations
Before discussing the findings, I note some limitations to my study. First, this
study is based on observational data and cannot make ironclad causal claims for the
main effects and mechanisms. I try to mitigate selection bias by using propensity score
Second, my study is carried out in the context of Head Start program and may
not be generalizable to other contexts. In addition, the Head Start FACES 2009 study
only included children aged 3 or 4‐years old who attended Head Start for the first time
in 2009 and not all children in Head Start during that year. Moreover, my analytic
sample is sliced from the dataset according to maximum class size rules and is not a
interactions) and is not generalizable to other possible mechanisms of small class size
Furthermore, conclusions may only be drawn for the short‐term effects (one
academic year) of smaller class size on student behavioral outcomes and quality of
teacher‐student interactions. Also, conclusions may only be drawn for the range and
Finally, the restriction of class size range may reduce statistical power to detect
hypothesized effects.
Using propensity score matching methods, I did not find any statistically
significant gains that class sizes of 17‐18 students per class had over class sizes of 19‐20
students per class for positive student behavioral outcomes in my sample of children
attending Head Start classrooms with predominantly 4 and 5‐year olds in full‐day
programs. The effect sizes were very small (+0.10 standard deviations).
38
between smaller class size during the early childhood years and student behavioral
outcomes to compare these results to. In the absence of a fairer comparison, I note the
In the first study, Dee and West (2011) studied the effects of smaller class sizes
that arose when students experienced different class sizes in different subjects in eighth
grade, on students’ psychological and behavioral engagement. The authors found a very
small effect size ranging from +0.05 to +0.09 standard deviations on the effect of smaller
subject, seeing the subject as useful for their future, and not being afraid to ask
questions. The authors however did not find any evidence for the effect of smaller class
second study that analyzed Tennessee STAR data, researchers found short- to middle-
term effects of being assigned to smaller class sizes during kindergarten to third grade
With the above results in perspective, the effect sizes found in my study are
comparable in magnitude to that found in the above two studies, except the estimates
in my study are more imprecisely estimated. Dee and West (2011) noted that the effects
of smaller class sizes on non‐cognitive outcomes are generally smaller than that on
academic outcomes. The larger effect sizes observed in the STAR study (Chetty et al.,
2011; Finn and Achilles, 1999) compared to the Dee and West study, as well as my
39
study, could arise because students in the former study received a longer treatment
duration.
It is possible that the small difference in class size, 17‐18 versus 19‐20 students
per class, was too small to make substantial differences to the outcome studied.
Previous studies in class size had a reduction in students by one‐third (Tennessee STAR
(Angrist & Lavy, 1999; Fredriksson, Öckert, & Oosterbeek, 2013)). Further research
in class size.
interactions in the smaller class sizes of 17‐18 students per class was statistically
significantly higher than that in class sizes of 19‐20 students per class (+0.33 standard
deviations), and that this effect was driven primarily by Classroom Organization. This
Classroom Organization domain. Even so, this finding seems to converge with
previous findings that teachers spend less time managing classrooms and more time
teaching when the class size is smaller. In Table 9, I present detailed results of the
associations between class size and individual CLASS dimensions. I found that smaller
class sizes of 17‐18 versus 19‐20 students per class was positively associated with the
Productivity dimension, which looked at how well the teacher manages instructional
routines and transitions to maximize learning time for students. It appears that having
40
fewer students in the classroom promotes greater productivity during lesson time. This
also complements the finding that the percentage of time teachers spend teaching is
Furthermore, smaller class size was positively associated with the Instructional
Learning Formats dimension, which looks at how well the teacher uses a variety of
learning modes and materials to facilitate and engage student interest. This seems to
support the theory that there is greater individualization in smaller class sizes (Graue &
Oen, 2008; Johnston, 1998; Molnar et al., 1999), i.e., with fewer students to manage,
teachers could tailor their instruction to students’ needs and learning styles and to
did not find a statistically significant positive association between smaller class size and
the Behavior Management dimension that looks at how well teachers set behavior
management is shaped by their training and prior beliefs (Kagan, 1992; Martin & Yin,
2006), hence a difference in class size alone did not change the way they manage the
classroom.
practice that are amenable to changes in class size, and aspects that require further
by the other domains for which no statistically significant association was found with
41
smaller class size. Within the domain of Instructional Support, I found only a very small
positive but statistically non‐significant association between smaller class size and the
Quality of Feedback dimension which included items such as whether teachers provide
scaffolds to aid learning, and prompt students to explain their thinking. This result,
taken together with past studies that found higher frequency of feedback in smaller
class size (Hargreaves, Galton, & Pell, 1998), seems to suggest that quantity and high
What was surprising though was that within the same Instructional Support
domain, there was a statistically significant positive association between smaller class
size and the Language Modeling dimension that measured the quantity and quality in
Paro, & Hamre, 2008, p. 75). This dimension included items such as whether
conversations and open‐ended questions occur frequently in the classroom, but also
responses, as well as “map” their own and “student actions with language”. It is unclear
whether the positive association was driven primarily through teachers of smaller class
sizes allowing more conversations and asking more open‐ended questions, which would
performance over all items in the dimension, which would raise questions of why
teachers in smaller class sizes could have higher quality of language modeling but not
quality of feedback? Still within the Instructional Support domain, I did not find any
42
positive association between smaller class size and the Concept Development
dimension, which measured how well teachers promoted higher‐order thinking skills
and understanding.
Support domain, taken in the light of previous research findings, are puzzling. Past
studies using Tennessee STAR and Wisconsin SAGE data found that teachers reported
having greater personal and learning‐related knowledge of their students such that they
could better provide help and support to those who need it (Johnston, 1990; Molnar et
al., 1999). Hence, we might expect a positive association between smaller class size with
the Emotional Support domain, which includes dimensions that measure teacher
sensitivity towards children’s emotional and academic needs (Teacher Sensitivity), and
warmth and respectfulness among teachers and students (Positive Climate). Instead,
none of the associations were statistically significant, though there were small effect
sizes (between +0.13 to +0.24 standard deviations). In terms of the Positive Climate and
might expect. It is unclear whether these results were due to a lack of power to detect
seems to paint a narrative that smaller class sizes might be associated with greater
quantity of teacher‐student interactions but that the quality of the interactions might
vary. However, this study, as with the previous study, could not untangle these findings
43
interactions, or whether there were other factors associated with both selection into
smaller class sizes and quality of teacher‐student interactions that confounded the
results.
class size and positive student behavioral outcomes, this limited my ability to establish
The statistically significant results of the relationship between smaller class size and high
quality teacher‐student interactions did not close the door to the possibility of the
mediator hypothesis. Future research should seek to achieve greater statistical power to
Conclusion
Class size research has a long history and there is strong evidence from credible
research methods that smaller class sizes can improve student test scores and provide
long‐term benefits. Smaller class sizes are also popularly perceived by educators and
parents to be beneficial for student learning. Still, debates persist over whether the
benefits are worth the costs and whether there are more cost‐effective policy
implementation of class size reduction policies in the United States have been
central question about mechanisms: How do class sizes achieve their results? By
understanding the mechanisms of smaller class sizes, the policy debates do not have to
boil down to a yes or no decision to implement class size reduction. Instead, the debates
can move towards more conversations on how to utilize and optimize a policy that has
been shown to work experimentally, and in the case of Israel and Sweden, on a large‐
scale basis.
A useful metaphor for this process is reverse engineering, the process of taking
apart an object to see how it works, with the hope of re‐producing it, enhancing it, or
even use its critical components to create something new and better. With better
understanding of the mechanisms of class size, questions could be raised, for example,
on which are the critical components that should not be compromised – such as teacher
quality – to ensure the success of the policy especially when implementation is at‐scale?
Are there policy complements to class size reduction – such as teacher professional
implemented could help teachers make the most use of smaller class sizes and stretch
the benefits further? Can policy alternatives specifically targeting those mechanisms of
Past research on class size has seldom focused on the mechanisms. My study
addresses this issue by investigating relationships less often studied but which are
45
important for understanding the mechanisms by which smaller class sizes achieve their
effects. Firstly, I examine the interdependent link between smaller class sizes, a
outcomes. Secondly, I examine a less often studied outcome in class size research –
student behavioral outcomes – which in turn has been postulated as a mechanism for
of time etc.). As past experimental studies have not collected data on the quality of
relationship between smaller class sizes of 17‐18 children versus 19‐20 children per class
investigate the hypothesis that high quality of teacher‐student interactions mediates the
relationship between smaller class size and student behavioral outcomes. However,
regression adjustment analyses found a positive association between smaller class size
and the CLASS domain of Classroom Organization. This coheres with previous research
findings based on observational data that smaller class size is associated with longer
does not exclude Classroom Organization from the list of potential mechanisms of
smaller class sizes. One surprising finding was that smaller class size was not statistically
significantly associated with the CLASS domain of greater Emotional Support by teachers
for students.
treatment with professional development on small class size interaction strategies. This
would allow randomization into smaller class sizes, and also experimental manipulation
experiments that randomly assign smaller class sizes to teachers and students with
observations (Goldstein & Blatchford, 1998). Future studies should also ensure greater
variation in the range of class size studied to improve statistical power. With more
47
research on understanding the mechanisms of smaller class size, we can better design
class size policies that can translate into benefits for students.
48
Tables
52
Table 5. Covariate balance on baseline characteristics for treatment and comparison groups before (Panels A and B) and after
propensity score matching by subclassification (Panel C) or full matching (Panel D) for full‐day classrooms with predominantly 4 & 5‐
year olds (n = 610). Covariates where the absolute standardized bias between treatment and comparison groups (not shown) is
greater than 0.25 are highlighted.
53
Covariates Treatment Means Comparison Means
Unmatched Subclassification Full Matching
A B C D
None of parents born in USA 0.26 0.25 0.26 0.25
Parental depression score 4.29 4.73 3.92 4.19
Single parent household 0.52 0.55 0.53 0.54
Below 100% of income‐poverty threshold 0.56 0.60 0.60 0.60
On multiple assistance programs 0.91 0.85 0.85 0.88
Household size 4.71 4.43 4.62 4.71
Moved multiple times in past year 0.13 0.13 0.14 0.13
English spoken at home 0.28 0.25 0.27 0.28
Neighborhood crime 0.30 0.27 0.35 0.29
Program waitlist 0.91 0.99 0.88 0.94
Program challenges perception scale ‐0.24 ‐0.10 ‐0.14 ‐0.14
Expanded HS program in past year 0.19 0.18 0.30 0.30
Director years in HS 17.97 14.47 16.69 16.82
Class hours per week 32.92 36.28 33.42 33.24
Teacher education: Bachelor's or above 0.62 0.43 0.71 0.67
Classroom Weight 104.93 138.45 105.51 106.40
54
55
Table 6. Effect size of smaller class size on student behavioral outcomes (RQ1) in full‐day
classrooms with predominantly 4 & 5‐year olds. Standard errors in parenthesis.
Subclassification Full Matching
1
Original Trimmed Original Trimmed1
M1 M2 M1 M2 M1 M2 M1 M2
Small class size 0.08 0.09 0.08 0.10 0.07 0.11 0.09 0.11
(17‐18 vs 19‐20) (0.18) (0.09) (0.18) (0.09) (0.16) (0.09) (0.16) (0.09)
Baseline behavior 0.49*** 0.49*** 0.47*** 0.47***
index (0.03) (0.03) (0.02) (0.03)
No. of teachers in 0.11 0.13 0.09 0.12
classroom (0.09) (0.09) (0.10) (0.10)
Class hours per week 0.00 0.00 0.00 0.00
(0.01) (0.00) (0.00) (0.00)
Teacher depression ‐0.01† ‐0.01† ‐0.01* ‐0.01*
score (0.01) (0.01) (0.01) (0.01)
Teacher education: 0.03 0.05 0.00 0.02
Bachelor's or above (0.08) (0.07) (0.07) (0.07)
Age at assessment 0.01* 0.01* 0.01† 0.02*
(0.01) (0.01) (0.01) (0.01)
Female 0.14† 0.15* 0.15† 0.20**
(0.08) (0.08) (0.08) (0.08)
African American ‐0.09 ‐0.14 ‐0.15 ‐0.19+
(0.10) (0.11) (0.10) (0.10)
Hispanic ‐0.20† ‐0.27* ‐0.21† ‐0.29**
(0.10) (0.10) (0.11) (0.11)
Has IEP ‐0.24 ‐0.20 ‐0.40† ‐0.37
(0.15) (0.16) (0.23) (0.24)
Single parent ‐0.33*** ‐0.30*** ‐0.30*** ‐0.30***
household (0.08) (0.07) (0.07) (0.07)
Below 100% of income‐ 0.19* 0.19* 0.17* 0.14†
poverty threshold (0.07) (0.07) (0.08) (0.08)
Constant ‐0.02 ‐0.79† ‐0.03 ‐0.94* ‐0.01 ‐0.72 ‐0.03 ‐1.06*
(0.09) (0.42) (0.07) (0.41) (0.08) (0.48) (0.09) (0.47)
Adjusted R‐square 0.00 0.52 0.00 0.51 0.00 0.49 0.00 0.50
N 610 610 558 558 610 610 558 558
†p < .10, *p < .05, **p < .01, ***p < .005
1
Cases where propensity scores (PS) were beyond the region of common support (PS < 0.054 or PS > 0.82)
were trimmed from the analytic sample.
Table 7. Effect size of smaller class sizes on quality of teacher‐student interactions (RQ2a) in full‐day classrooms with predominantly
4 & 5‐year olds (n = 115). Standard errors in parenthesis.
Dependent variable
CLASS Instructional Support Emotional Support Classroom Organization
Independent variable M3 M4 M5 M6 M7 M8 M9 M10
Small class size 0.38* 0.33† 0.25 0.20 0.23 0.18 0.45* 0.42*
(17‐18 Vs 19‐20) (0.18) (0.19) (0.18) (0.19) (0.19) (0.19) (0.18) (0.19)
Adjusted R‐square .03 0.07 .00 0.04 .00 0.06 .03 0.02
†p < .10, *p < .05, **p < .01, ***p < .001
56
57
Table 8. Effect size of smaller class size on student behavioral outcomes after including
mediator variable (RQ2b) for full‐day classrooms with predominantly 4 & 5‐year olds,
using full matching (n = 610). Standard errors in parenthesis.
M11 M12 M13 M14 M15 M16 M17 M18
Small class size 0.07 0.11 0.07 0.11 0.07 0.11 0.07 0.11
(17‐18 vs 19‐20) (0.16) (0.09) (0.21) (0.09) (0.21) (0.12) (0.21) (0.12)
CLASS 0.01
(0.05)
Instructional Support 0.03
(0.07)
Emotional Support 0.05
(0.09)
Classroom Organization 0.00
(0.07)
Baseline behavior 0.46*** 0.46*** 0.46*** 0.46***
index (0.03) (0.03) (0.03) (0.03)
No. of teachers in 0.08 0.08 0.07 0.08
classroom (0.10) (0.11) (0.11) (0.10)
Class hours per week 0.00 0.00 0.00 0.00
(0.00) (0.00) (0.00) (0.00)
Teacher depression ‐0.01* ‐0.01* ‐0.02* ‐0.01*
score (0.01) (0.01) (0.01) (0.01)
Teacher education: 0.00 ‐0.01 ‐0.01 0.00
Bachelor's or above (0.08) (0.08) (0.08) (0.07)
Age at assessment ‐0.04 ‐0.03 ‐0.04 ‐0.04
(0.05) (0.05) (0.05) (0.05)
Program challenges 0.01† 0.01† 0.01† 0.01†
perception scale (0.01) (0.01) (0.01) (0.01)
Female 0.15† 0.15† 0.15† 0.15†
(0.08) (0.08) (0.08) (0.08)
African American ‐0.15 ‐0.15 ‐0.16 ‐0.16
(0.10) (0.10) (0.10) (0.10)
Hispanic ‐0.21† ‐0.20† ‐0.21† ‐0.20†
(0.11) (0.11) (0.11) (0.11)
Has IEP ‐0.38 ‐0.39 ‐0.39 ‐0.38
(0.24) (0.24) (0.24) (0.24)
Single parent ‐0.30*** ‐0.30*** ‐0.30*** ‐0.30***
household (0.07) (0.07) (0.07) (0.07)
Below 100% of income‐ 0.18* 0.18* 0.18* 0.18*
poverty threshold (0.09) (0.09) (0.09) (0.08)
Constant ‐0.01 ‐0.73 ‐0.01 ‐0.79 ‐0.01 ‐0.95 ‐0.01 ‐0.76
(0.08) (0.48) (0.08) (0.54) (0.08) (0.65) (0.08) (0.60)
Adjusted R‐square 0.00 0.48 0.00 0.49 0.00 0.49 0.00 0.48
58
†p < .10, *p < .05, **p < .01, ***p < .005
59
Table 9. Effect size of smaller class sizes (Small = 17 to 18, Comparison = 19 to 20
children per class) on quality of teacher‐student interactions by dimensions1 in full‐day
classrooms with predominantly 4 & 5‐year olds (n = 115). Standard errors in parenthesis.
Dependent Variable
Instructional Support Dimension
Independent Concept Quality of Language
Variable Development Feedback Modeling
Small class size ‐0.03 0.13 0.39*
(0.20) (0.19) (0.18)
teachers, Class hours per week, Teacher depression score, Teacher education:
Bachelor’s degree or above, Program challenges perception scale.
60
Figure 2. Boxplots of absolute standardized bias before (unmatched) and after matching
(subclassification or full matching). Dotted line refers to absolute standardized bias of
0.25.
Figure 3. Absolute standardized bias in means of treatment and comparison groups before and after matching by subclassification or
full matching.
0.5
0.25
61
62
Figure 4. Propensity score distribution by treatment status.
63
References
Achilles, C. M., Finn, J. D., & Bain, H. P. (1998). Using Class Size to Reduce the Equity
Gap. Educational Leadership, 55(4), 40‐43.
Alexander, K. L., Entwisle, D. R., Blyth, D. A., & McAdoo, H. P. (1988). Achievement in
the first 2 years of school: Patterns and processes. Monographs of the Society
for Research in Child Development, i‐157.
Anderson, L.W. (2002). Balancing breadth and depth of content coverage: Taking
advantage of the opportunities provided by smaller classes. In J.D. Finn & M.C.
Wang (Eds.), Taking small classes one step further (pp. 51‐61). Greenwich, CT:
Information Age Publishing Inc.
Angrist, J. D., & Lavy, V. (1999). Using Maimonides' rule to estimate the effect of class
size on scholastic achievement. The Quarterly Journal of Economics, 114(2),
533‐575.
Barnett, W. S., Schulman, K., & Shore, R. (2004). Class size: What's the best fit? NIEER
Policy Matters, 9, 1-11. Retrieved from
http://nieer.org/resources/policybriefs/9.pdf
Biddle, B.J., & Berliner, D.C. (2002). Small class size and its effects. Educational
Leadership, 59(5), 12‐23.
64
Blatchford, P. (2003). A systematic observational study of teachers’ and pupils’
behaviour in large and small classes. Learning and Instruction, 13(6), 569‐595.
Blatchford, P., Bassett, P., Goldstein, H., & Martin, C. (2003). Are class size differences
related to pupils' educational progress and classroom processes? Findings
from the institute of education class size study of children aged 5–7
years. British Educational Research Journal, 29(5), 709‐730.
California Voter Guide (1998). Proposition 8: Class Size Reduction Funding. California:
Secretary of State. Retrieved from
http://vote98.sos.ca.gov/VoterGuide/Propositions/8noarg.htm
Chetty, R., Friedman, J. N., Hilger, N., Saez, E., Schanzenbach, D. W., & Yagan, D.
(2011). How does your kindergarten classroom affect your earnings? Evidence
from project STAR. The Quarterly Journal of Economics, 126(4), 1593‐1660.
Dee, T. S., & West, M. R. (2011). The non‐cognitive returns to class size. Educational
Evaluation and Policy Analysis, 33(1), 23‐46.
DuGoff, E. H., Schuler, M., & Stuart, E. A. (2014). Generalizing observational study
results: Applying propensity score methods to complex surveys. Health
Services Research, 49(1), 284‐303.
65
Duncan, G. J., & Magnuson, K. (2011). The nature and impact of early achievement
skills, attention skills, and behavior problems. In G.J. Duncan & J. Murnane
(Eds.), Whither opportunity (47‐70). New York: Russell Sage Foundation.
Education Commission of the States (2010). Class Size Policies. Retrieved from
http://www.ecs.org/clearinghouse/85/21/8521.pdf
Evertson, C. M., & Randolph, C. H. (1989). Teaching practices and class size: A new
look at an old issue. Peabody Journal of Education, 67(1), 85‐105.
Finn, J. D., & Achilles, C. M. (1990). Answers and questions about class size: A
statewide experiment. American Educational Research Journal, 27(3), 557‐577.
Finn, J. D., Gerber, S. B., & Boyd‐Zaharias, J. (2005). Small classes in the early grades,
academic achievement, and graduating from high school. Journal of
educational Psychology, 97(2), 214.
Finn, J. D., Pannozzo, G. M., & Achilles, C. M. (2003). The “why’s” of class size: Student
behavior in small classes. Review of Educational Research,73(3), 321‐368.
Fredriksson, P., Öckert, B., & Oosterbeek, H. (2013). Long‐term effects of class size.
The Quarterly Journal of Economics, 128(1), 249‐285.
Glass, G. V., & Smith, M. L. (1979). Meta‐analysis of research on class size and
achievement. Educational evaluation and policy analysis, 1(1), 2‐16.
Goldstein, H., & Blatchford, P. (1998). Class size and educational achievement: A
review of methodology with particular reference to study design. British
Educational Research Journal, 24(3), 255‐268.
Graue, E., Hatch, K., Rao, K., & Oen, D. (2007). The wisdom of class‐size
reduction. American Educational Research Journal, 44(3), 670‐700.
Graue, M. E., & Oen, D. (2008). You just feed them with a long‐handled spoon:
Families evaluate their experiences in a class size reduction reform.
Educational Policy, 23(5), 685‐713.
Gresham, F. M., & Elliott, S. N. (1990). Social skills rating system (SSRS). American
Guidance Service.
Grindal, T. (2011). The effects of preschool setting on young children's cognitive skills,
social behavior and approaches to learning: A propensity score analysis
66
(Unpublished qualifying paper). Harvard Graduate School of Education,
Cambridge, Massachusetts.
Harder, V. S., Stuart, E. A., & Anthony, J. C. (2010). Propensity score techniques and
the assessment of measured covariate balance to test causal associations in
psychological research. Psychological Methods, 15(3), 234.
Harfitt, G. J., & Tsui, A. (2015). An examination of class size reduction on teaching and
learning processes: A theoretical perspective. British Educational Research
Journal, 41(5), 845‐865.
Hargreaves, L., Galton, M., & Pell, A. (1998). The effects of changes in class size on
teacher–pupil interaction. International Journal of Educational Research, 29(8),
779‐795.
Head Start Bureau (2005). Head Start centers and use of space. Head Start Design
Guide. HHS/ACF/ACYF/HSB. Retrieved from
http://eclkc.ohs.acf.hhs.gov/hslc/tta‐
system/teaching/eecd/learning%20environments/planning%20and%20arrangi
ng%20spaces/edudev_art_00059_051606.html Last updated: October 2014
Ho, D., Imai, K., King, G., & Stuart, E. (2011). MatchIt: Nonparametric preprocessing
for parametric casual inference. Journal of Statistical Software, 42(8), 1‐28.
Jepsen, C., & Rivkin, S. (2009). Class size reduction and student achievement the
potential tradeoff between teacher quality and class size. Journal of human
resources, 44(1), 223‐250.
Jo, B., Stuart, E. A., MacKinnon, D. P., & Vinokur, A. D. (2011). The use of propensity
scores in mediation analysis. Multivariate Behavioral Research, 46(3), 425‐452.
LoCasale‐Crouch, J., Konold, T., Pianta, R., Howes, C., Burchinal, M., Bryant, D., ... &
Barbarin, O. (2007). Observed classroom quality profiles in state‐funded pre‐
kindergarten programs and associations with teacher, program, and classroom
characteristics. Early Childhood Research Quarterly, 22(1), 3‐17.
Malone, L.; Carlson, B.L.; Aikens, N.; Moiduddin, E.; Klein, A.K.; West, J., … & Rall, K.
(2013). Head Start Family and Children Experiences Survey: 2009 User’s
Manual. Ann Arbor, MI: Child Care & Early Education Research Connections.
Martin, N. K., Yin, Z., & Mayall, H. (2006). Classroom management training, teaching
experience and gender: Do these variables impact teachers' attitudes and
beliefs toward classroom management style? Proceedings from Annual
Conference of the Southwest Educational Research Association, 2006. Austin,
TX: ERIC.
McClelland, M. M., Morrison, F. J., & Holmes, D. L. (2000). Children at risk for early
academic problems: The role of learning‐related social skills. Early Childhood
Research Quarterly, 15(3), 307‐329.
Moiduddin, E., Aikens, N., Tarullo, L., West, J., Xue, Y. (2012). Child Outcomes and
Classroom Quality in FACES 2009. OPRE Report 2012‐37a. Washington, DC:
Office of Planning, Research and Evaluation, Administration for Children and
Families, U.S. Department of Health and Human Services.
Molnar, A., Smith, P., Zahorik, J., Palmer, A., Halbach, A., & Ehrle, K. (1999). Evaluating
the SAGE program: A pilot program in targeted pupil‐teacher reduction in
Wisconsin. Educational Evaluation and Policy Analysis, 21(2), 165‐177.
Mosteller, F. (1995). The Tennessee study of class size in the early school grades. The
Future of Children, 113‐127.
Murnane, R. J., & Willett, J. B. (2010). Methods matter: Improving causal inference in
educational and social science research. New York: Oxford University Press.
National Institute of Child Health and Human Development Early Child Care Research
Network (2000). The relation of child care to cognitive and language
development. Child Development, 71 (4), 960‐980.
National Scientific Council on the Developing Child. (2004). Young children develop in
an environment of relationships. Working Paper No. 1. Retrieved from
68
http://developingchild.harvard.edu/wp‐content/uploads/2004/04/Young‐
Children‐Develop‐in‐an‐Environment‐of‐Relationships.pdf
Office of the Administration for Children and Families (2015). About Office of Head
Start. Retrieved from http://www.acf.hhs.gov/programs/ohs
Peterson, J. L., & Zill, N. (1986). Marital disruption, parent‐child relationships, and
behavior problems in children. Journal of Marriage and the Family, 48(2), 295‐
307.
Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom Assessment Scoring
System Manual, Pre-K. Baltimore, MD: Paul H Brookes Pub Co.
Ponitz, C. C., Rimm‐Kaufman, S. E., Grimm, K. J., & Curby, T. W. (2009). Kindergarten
classroom quality, behavioral engagement, and reading achievement. School
Psychology Review, 38(1), 102‐120.
Raver, C. C., Jones, S. M., Li‐Grining, C. P., Metzger, M., Champion, K. M., & Sardin, L.
(2008). Improving preschool classroom processes: Preliminary findings from a
randomized trial implemented in Head Start settings. Early Childhood Research
Quarterly, 23(1), 10‐26.
Roid, G.H., & L.J. Miller (1997). Leiter International Performance Scale Revised,
Examiner Rating Scale (Leiter-R). Lutz, FL: Psychological Assessment Resources,
Inc.
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in
observational studies for causal effects. Biometrika, 70(1), 41‐55.
Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using
subclassification on the propensity score. Journal of the American Statistical
Association, 79(387), 516‐524.
69
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using
multivariate matched sampling methods that incorporate the propensity
score. The American Statistician, 39(1), 33‐38.
Rosenbaum, P.R. (2005). Observational study. In B.S. Everitt & D.C. Howell (Eds.),
Encyclopedia of statistics in behavioral science (Volume 3) (pp. 1451‐1462).
Chichester, United Kingdom: John Wiley & Sons.
Rubin, D. B. (2007). The design versus the analysis of observational studies for causal
effects: Parallels with the design of randomized trials. Statistics in Medicine,
26(1), 20‐36.
Schanzenbach, D. W. (2014). Does Class Size Matter? Policy Briefs, National Education
Policy Center, School of Education, University of Colorado, Boulder.
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look
forward. Statistical Science, 25(1), 1‐27.
Vanderweele, T. J., Hong, G., Jones, S. M., & Brown, J. L. (2013). Mediation and
spillover effects in group‐randomized trials: a case study of the 4Rs
educational intervention. Journal of the American Statistical
Association, 108(502), 469‐482.
Washington 2014 Voters' Guide (2014). Washington Class Size Reduction Measure
Initiative 1351. Washington: Office of the Secretary of State. Retrieved from
https://wei.sos.wa.gov/agency/osos/en/press_and_research/PreviousElection
s/2014/General‐Election/Pages/Online‐Voters‐Guide.aspx
Wilson, V. (2002). Does small really make a difference? (SCRE Research Report No.
107) The Scottish Council for Research in Education. Retrieved from
http://www.classsizematters.org/wp‐content/uploads/2012/11/107.pdf
70
Appendices
A summary index for student behavioral outcomes was constructed using the
problem behavior, and assessor ratings. Table 10 shows a strong but negative
The correlation between assessor ratings and the other two teacher ratings was
moderate. Table 11 shows the principal components loading for the index. The positive
outcomes were loaded positively while the negative outcome was loaded negatively
such that a positive value on the index represents good outcomes. The summary index
was formed for the outcomes variable, measured in spring 2010, and the outcome
Table 12. Principal components analysis of the measures of the CLASS domains
measured in spring 2010: Instructional Support (IS), Emotional Support (ES), and
Classroom Organization (CO).
Component 1
Eigenvalue 1.84
Proportion of variance explained 0.61
Principal component loadings
IS 0.52
ES 0.52
CO 0.60
72
Selection Variables
A. Child Characteristics.
Age at start of school term. A continuous variable indicating the child’s age in
Child gender. A dichotomous variable coded 1 if the child was female and 0 if the
Child race. A categorical variable indicating whether the child was white, African
American, Hispanic or other race. These were coded as a series of dummy variables,
with the category of interest coded 1, 0 otherwise. White was set as the reference
category.
Early Head Start. A dichotomous variable coded 1 if the child participated in Early
Head Start.
B. Parent Characteristics.
highest level of education was (i) less than a high school diploma; (ii) high school
These were coded as a series of dummy variables, with the category of interest coded 1,
0 otherwise. “Less than a high school diploma” was set as the reference category.
child’s mother was employed full‐time, part‐time, looking for work, or not in the labor
force. These were coded as a series of dummy variables, with the category of interest
coded 1, 0 otherwise. “Not in the labor force” was set as the reference category.
depression score. This was calculated based on the parent’s response to 12 items on the
interview, each scored on a scale of 0 to 3 for a total score range of 0 points (not
depressed) to 36 points (severely depressed). The FACES drew the items from the
Center for Epidemiologic Studies – Depression scale [CES‐D] (Malone et al., 2013).
C. Household Characteristics.
household received multiple assistance such as welfare, TANF, general assistance, food
members.
D. Neighborhood Characteristics.
E. Program Characteristics.
Program wait list. A dichotomous variable coded 1 if the Head Start Program had
items where the program director indicated whether each item made it harder for
him/her to do his/her job well in areas such as time constraints, lack of funds, lack of
qualified staff, staff turnover, lack of parental support, challenging population etc. The
75
index was formed from the first component, which was then standardized to have a
Expanded Head Start program. A dichotomous variable coded 1 if the Head Start
Director Head Start years. A continuous variable indicating the number of years
the director had been working with the Head Start program.
F. Classroom Characteristics.
Class hours per week. A discrete continuous measure of the number of hours the
Design Covariates.
Class weight. The Fall 2009 class weight provided in the Head Start FACES
dataset.
Covariates
For all three research questions, I used the classroom, teacher, and program covariates
described below. For RQ1 and RQ2b which involved children in the analysis, I included
the child covariates described below. Except for the child’s age at assessment of
outcomes which was measured in spring 2010, all other covariates were measured in
fall 2009.
The child covariates were obtained from parent interviews or direct child
assessments, teacher and classroom covariates were obtained from teacher interviews,
76
while the program selection variables were obtained from Head Start program director
interviews.
A. Classroom Covariates.
Class hours per week. A discrete continuous measure of the number of hours the
B. Teacher Covariates.
depression score. This was calculated based on the teacher’s response to 12 items on
the interview, each scored on a scale of 0 to 3 for a total score range of 0 points (not
depressed) to 36 points (severely depressed). The FACES drew the items from the
Center for Epidemiologic Studies – Depression scale [CES‐D] (Malone et al., 2013).
C. Program Covariates.
items where the program director indicated whether each item made it harder for
him/her to do his/her job well in areas such as time constraints, lack of funds, lack of
qualified staff, staff turnover, lack of parental support, challenging population etc. The
77
index was formed from the first component, which was then standardized to have a
D. Child Covariates.
Baseline score. The child’s baseline score for the behavioral outcomes index,
obtained in Fall 2009. The index is formed by taking the principal components of the
Child gender. A dichotomous variable coded 1 if the child was female and 0 if the
Child race. A categorical variable indicating whether the child was African
American, Hispanic or other race. These were coded as a series of dummy variables,
with the category of interest coded 1, 0 otherwise. Other race was set as the reference
category.