Jacob 2008
Jacob 2008
Jacob 2008
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
The University of Chicago Press, Society of Labor Economists, NORC at the University of Chicago, The
University of Chicago are collaborating with JSTOR to digitize, preserve and extend access to Journal of
Labor Economics.
http://www.jstor.org
We would like to thank Joseph Price and J. D. LaRock for their excellent
research assistance. We thank David Autor, Joe Doyle, Sue Dynarski, Amy Fin-
kelstein, Chris Hansen, Robin Jacob, Jens Ludwig, Frank McIntyre, Jonah Rock-
off, Doug Staiger, Thomas Dee, and seminar participants at the University of
California, Berkeley, Northwestern University, Brigham Young University, Co-
lumbia University, Harvard University, MIT, and the University of Virginia for
helpful comments. All remaining errors are our own. Contact the corresponding
author, Brian Jacob, at [email protected].
101
I. Introduction
One of the most striking findings in recent education research involves
the importance of teacher quality. A series of new papers have documented
substantial variation in teacher effectiveness in a variety of settings, even
among teachers in the same school (Rockoff 2004; Hanushek et al. 2005;
Aaronson, Barrow, and Sander 2007). The differences in teacher quality
are dramatic. For example, recent estimates suggest that the benefit of
moving a student from an average teacher to one at the 85th percentile
is comparable to a 33% reduction in class size (Rockoff 2004; Hanushek
et al. 2005). The difference between having a series of very good teachers
versus very bad teachers can be enormous (Sanders and Rivers 1996). At
the same time, researchers have found little association between observable
teacher characteristics and student outcomes—a notable exception being
a large and negative first-year teacher effect (see Hanushek [1986, 1997]
for reviews of this literature and Rockoff [2004] for recent evidence on
teacher experience effects).1 This is particularly puzzling given the likely
upward bias in such estimates (Figlio 1997; Rockoff 2004).
Private schools and most institutions of higher education implicitly
recognize such differences in teacher quality by compensating teachers,
at least in part, on the basis of ability (Ballou 2001; Ballou and Podgursky
2001). However, public school teachers have resisted such merit-based
pay due, in part, to a concern that administrators will not be able to
recognize (and thus properly reward) quality (Murnane and Cohen 1986).2
At first blush, this concern may seem completely unwarranted. Principals
not only interact with teachers on a daily basis—reviewing lesson plans,
observing classes, talking with parents and children—but also have ready
access to student achievement scores. Prior research, however, suggests
that this task might not be as simple as it seems. Indeed, the consistent
1
There is some limited evidence that cognitive ability, as measured by a score
on a certification exam, for example, may be positively associated with teacher
effectiveness, although other studies suggest that factors such as the quality of
one’s undergraduate institution are not systematically associated with effective-
ness. For a review of the earlier literature relating student achievement to teacher
characteristics, see Hanushek (1986, 1997). In recent work, Clotfelter, Ladd, and
Vigdor (2006) find teacher ability is correlated with student achievement, while
Harris and Sass (2006) find no such association.
2
Another concern, which we discuss below, involves favoritism or the simple
capriciousness of ratings.
finding that certified teachers are no more effective than their uncertified
colleagues suggests that commonly held beliefs among educators may be
mistaken.
In this article, we examine how well principals can distinguish between
more and less effective teachers, where effectiveness is measured by the
ability to raise student math and reading achievement. In other words,
do school administrators know good teaching when they see it? We find
that principals are quite good at identifying those teachers who produce
the largest and smallest standardized achievement gains in their schools
(i.e., the top and bottom 10%–20%) but have far less ability to distin-
guish between teachers in the middle of this distribution (i.e., the middle
60%–80%). This is not a result of a highly compressed distribution of
teacher ability.
While there are several limitations to our analysis, which we describe
in later sections of this article, our results suggest that policy makers
should consider incorporating principal evaluations into teacher compen-
sation and promotion systems. To the extent that principal judgments
focus on identifying the best and worst teachers, for example, to determine
bonuses and teacher dismissal, the evidence presented here suggests that
such evaluations would help promote student achievement. Principals can
also evaluate teachers on the basis of a broader spectrum of educational
outputs, including nonachievement outcomes valued by parents.
More generally, our findings inform the education production function
literature, providing compelling evidence that good teaching is, at least
to some extent, observable by those close to the education process even
though it may not be easily captured in those variables commonly avail-
able to the econometrician. The article also makes a contribution to the
empirical literature on subjective performance assessment by demonstrating
the importance of accounting for estimation error in measured productivity
and showing that the relationship between subjective evaluations and actual
productivity can vary substantially across the productivity distribution.
The remainder of the article proceeds as follows. In Section II, we
review the literature on objective and subjective performance evaluation.
In Section III, we describe our data and in Section IV outline how we
construct the different measures of teacher effectiveness. The main results
are presented in Section V. We conclude in Section VI.
II. Background
A. Prior Literature
The theoretical literature on subjective performance evaluation has fo-
cused largely on showing the conditions under which efficient contracts
are possible (Bull 1987; MacLeod and Malcomson 1989). Prendergast
(1993) and Prendergast and Topel (1996) show how the existence of sub-
B. Conceptual Framework
In order to provide a basis for interpreting the empirical findings in
this article, it is useful to consider how principals might form opinions
of teachers. Given the complexity of principal belief formation, and the
limited objectives of this article, we do not develop a formal model. Rather,
we describe the sources of information available to principals and how
they might interpret the signals they receive.
Each year principals receive a series of noisy signals of a teacher’s
performance, stemming from three main sources: (1) formal and informal
observations of the teacher working with students and/or interacting with
colleagues around issues of pedagogy or curriculum, (2) reports from
parents—either informal assessments or formal requests to have a child
placed with the teacher (or not placed with the teacher), and (3) student
achievement scores. Principals will differ in their ability and/or inclination
to gather and incorporate information from these sources and in the weight
that they place on each of the sources. A principal will likely have little
information on first-year teachers, particularly at the point when we sur-
veyed principals; namely, in February, before student testing took place and
before parents began requesting specific teachers for the following year.
Principals may differ with respect to the level of sophistication with
which they collect information and interpret the signals they receive. For
example, principals may be aware of the level of test scores in the teacher’s
classroom but unable to account for differences in classroom composition.
In this case, principal ratings might be more highly correlated to the level
of test scores than to teacher value added if little information besides test
scores was used to construct ratings. Also, principals might vary in how
they deal with the noise inherent in the signals they observe. A naive
principal might simply report the signal she observes regardless of the
variance of the noise component. A more sophisticated principal, however,
might act as a Bayesian, down-weighting noisy signals.6 Finally, it seems
likely that a principal’s investment in gathering information on and up-
dating beliefs about a particular teacher will be endogenously determined
by a variety of factors, including the initial signal the principal observes
as well as the principal’s assessment regarding how much a teacher can
benefit from advice and training.
Ultimately, in this article we limit our examination to the accuracy of
in Los Angeles. They found that a one standard deviation increase in teacher
effectiveness led to a 1–2 point raw score gain (although it is not possible to
calculate the effect size given the available information in the study).
6
Indeed, a simple model of principals as perfect Bayesians would generate
strong implications regarding how the accuracy of ratings evolves with the time
a principal and a teacher spend together. For example, this type of model would
imply that the variance of principal ratings increases over time (assuming that the
variance of true teacher ability does not change over time).
III. Data
The data for this study come from a midsize school district located in
the western United States. The student data include all of the common
demographic variables, as well as standardized achievement scores, and
allow us to track the students over time. The teacher data, which we can
link to students, include a variety of teacher characteristics that have been
used in previous studies, such as age, experience, educational attainment,
undergraduate and graduate institution attended, and license and certi-
fication information. With the permission of the district, we surveyed all
elementary school principals in February 2003 and asked them to rate the
teachers in their schools along a variety of different performance dimen-
sions.
To provide some context for the analysis, table 1 shows summary sta-
tistics from the district. While the students in the district are predomi-
nantly white (73%), there is a reasonable degree of heterogeneity in terms
of ethnicity and socioeconomic status. Latino students make up 21% of
the elementary population, and nearly half of all students in the district
(48%) receive free or reduced-price lunch. Achievement levels in the dis-
trict are almost exactly at the average of the nation (49th percentile on
the Stanford Achievement Test).
The primary unit of analysis in this study is the teacher. To ensure that
we could link student achievement data to the appropriate teacher, we
limit our sample to elementary teachers who were teaching a core subject
during the 2002–3 academic year.7 We exclude kindergarten and first-grade
teachers because achievement exams are not available for these students.8
Our sample consists of 201 teachers in grades 2–6. Like the students,
the teachers in our sample are fairly representative of elementary school
teachers nationwide. Only 16% of teachers in our sample are men. The
7
We exclude noncore teachers such as music teachers, gym teachers, and li-
brarians. Note, however, that in calculating teacher value-added measures, we use
all student test scores from 1997–98 through 2004–5.
8
Achievement exams are given to students in grades 1–6. In order to create a
value-added measure of teacher effectiveness, it is necessary to have prior achieve-
ment information for the student, which eliminates kindergarten and first-grade
students.
Table 1
Summary Statistics
Mean SD
Student characteristic:
Male .51
White .73
Black .01
Hispanic .21
Other .06
Limited English proficiency .21
Free or reduced-price lunch .48
Special education .12
Math achievement (national percentile) .49
Reading achievement (national percentile) .49
Language achievement (national percentile) .47
Teacher characteristic:
Male .16 .36
Age 41.9 12.5
Untenured .17 .38
Experience 11.9 8.9
Fraction with 10–15 years experience .19 .40
Fraction with 16–20 years experience .14 .35
Fraction with 21⫹ years experience .16 .37
Years working with principal 4.8 3.6
BA degree at in-state (but not local) college .10 .30
BA degree at out-of-state college .06 .06
MA degree .16 .16
Any additional endorsements .20 .40
Any additional endorsements in areas other than ESL .10 .31
Licensed in more than one area .26 .44
Licensed in area other than early childhood education
or elementary education .07 .26
Grade 2 .23 .42
Grade 3 .21 .41
Grade 4 .20 .40
Grade 5 .18 .38
Grade 6 .18 .39
Mixed-grade classroom .07 .26
Two teachers in the classroom .05 .22
Number of teachers 201
Number of principals 13
Note.—Student characteristics are based on students enrolled in grades 2–6 in spring 2003. Math and
reading achievement measures are based on the spring 2002 scores on the Stanford Achievement Test
(version 9) taken by selected elementary grades in the district. Teacher characteristics are based on
administrative data. Nearly all teachers in the district are Caucasian, so race indicators are omitted.
9
In this district, principals conduct formal evaluations annually for new teachers
and every third year for tenured teachers. However, prior studies have found
such formal evaluations suffer from considerable compression with nearly all
teachers being rated very highly. These evaluations are also part of a teacher’s
personnel file, and it was not possible to obtain access to these without permission
of the teachers.
Table 2
Summary Statistics for Principal Ratings
Mean 10th 90th
Item (SD) Percentile Percentile
the principal-specific mean for that question and dividing by the school-
specific standard deviation.
reading exam. The scores are reported as the percentage of items that the
student answered correctly, but we normalize achievement scores to be
mean zero and standard deviation to be one within each year-grade. The
vector X consists of the following student characteristics: age, race, gender,
free lunch eligibility, special education placement, limited English profi-
ciency status, prior math achievement, prior reading achievement, and
grade fixed effects, and C is a vector of classroom measures that include
indicators for class size and average student characteristics. A set of year
and school fixed effects are wt and fk , respectively. Teacher j’s contribution
to value added is captured by the dj’s. The error term a jt is common to
all students in teacher j’s classroom in period t (e.g., adverse testing con-
ditions faced by all students in a particular class, such as a barking dog).
Error term ijkt takes into account the student’s idiosyncratic error. In
order to account for the correlation of students within classrooms, we
correct the standard errors using the method suggested by Moulton
(1990).10
10
Another possibility would be to use cluster-corrected standard errors. How-
ever, the estimated standard errors behave very poorly for teachers who are in
the sample for a small number of years. It is also possible to estimate a model
that includes a random teacher-year effect, which should theoretically provide
more efficient estimates. In practice, however, the random effect estimates are
comparable to those we present in terms of efficiency. The intraclass correlation
coefficients calculated as part of the Moulton procedure are roughly .06 in reading
and .09 in mathematics.
14
Hanushek et al. (2005), for example, find that one standard deviation in the
teacher quality distribution is associated with a 0.22 standard deviation increase
in math on the Texas state assessment. Rockoff (2004) finds considerably smaller
effects—namely, that a one standard deviation increase in the teacher fixed-effect
distribution raises student math and reading achievement by roughly 0.10 standard
deviations on a nationally standardized scale. Examining high school students,
Aaronson et al. (2007) find that a one standard deviation improvement in teacher
quality leads to a .20 improvement in math performance over the course of a year.
Cov (dˆ , d) p
! Corr (dˆ , d) .
p
(2)
冑Var (dˆ ) [Var (d) ⫹Var (e)]
p
P
冑Var (dˆ )
OLS
,
冑Var (d)
to obtain the adjusted correlation. We obtain the standard errors using a
bootstrap.17
Note that this adjustment assumes that a principal’s rating is unrelated
to the error of our OLS estimate of teacher effectiveness. Specifically, we
assume that the numerator in equation (2) can be rewritten as follows:
15
We will use the terms “estimation error” and “measurement error” inter-
changeably, although in the testing context measurement error often refers to the
test-retest reliability of an exam, whereas the error stemming from sampling var-
iability is described as estimation error.
16
This assumes that the OLS estimates of the teacher fixed effects are not
correlated with each other. This would be true if the value-added estimates were
calculated with no covariates. Estimation error in the coefficients of the covariates
generates a nonzero covariance between teacher fixed effects, though in practice
the covariates are estimated with sufficient precision that this is not a problem.
17
For our baseline specifications we perform 1,000 iterations. For the robustness
and heterogeneity checks we perform 500 iterations. We also perform a principal-
level block bootstrap. The inference from this procedure (shown below) is similar
to our baseline results.
Cov (dˆ p, ˆdOLS ) p Cov (dˆ p, d) ⫹ Cov (dˆ p, e) . This would not be true if the
principals were doing the same type of statistical analysis as we are to
determine teacher effectiveness. However, to the extent that principals
base their ratings largely on classroom observations, discussions with stu-
dents and parents, and other factors unobservable to the econometrician,
this assumption will hold. To the extent that this is not true and principals
base their ratings, even partially, on the observed test scores (in the same
manner as the value-added model does—that is, conditioning on a variety
of covariates), the correlation we calculate will be biased upward.18
In addition to biasing our correlations, estimation error will lead to
attenuation bias if we use the teacher value-added measures as an ex-
planatory variable in a regression context.19 To account for attenuation
bias when we use the teacher value added in a regression context, we
construct empirical Bayes (EB) estimates of teacher quality. This ap-
proach was suggested by Kane and Staiger (2002) for producing efficient
estimates of school quality but has a long history in the statistics lit-
erature (see, e.g., Morris 1983).20 For more information on our calcu-
lation of the EB estimates, see Jacob and Lefgren (2005a).
While the correlation between objective and subjective performance
measures is a useful starting point, it has several limitations. Most im-
portant, the principal ratings may not have a cardinal interpretation, which
would make the correlation impossible to interpret. For example, the
difference between a six and seven rating may be greater or less in absolute
terms than the difference between a nine and ten. Second, correlations
are quite sensitive to outliers. Third, while we have normalized the prin-
cipal ratings to ensure that each principal’s ratings have the same variance,
it is possible that the variance of value added differs across schools. In
this case, stacking the data could provide misleading estimates of the
18
The correlations (and associated nonparametric statistics) may understate the
relation between objective and subjective measures if principals have been able
to remove or counsel out the teachers that they view as the lowest quality. How-
ever, our discussions with principals and district officials suggest that this occurs
rarely and is thus unlikely to introduce a substantial bias in our analysis. Similarly,
the correlations may be biased downward if principals assign teachers to class-
rooms in a compensatory way—i.e., principals assign more effective teachers to
classrooms with more difficult students. In this case, our value-added measures
will be attenuated (biased toward zero), which will reduce the correlation between
our subjective and objective measures.
19
If the value-added measure is used as a dependent variable, it will lead to less
precise estimates relative to using a measure of true teacher ability. Measurement
error will also lead us to overstate the variance of teacher effects, although this
is a less central concern for the analysis presented here.
20
In fact, the EB approach described here is very closely related to the errors-
in-variables approach that allows for heteroskedastic measurement error outlined
by Sullivan (2001).
average correlation between principal ratings and value added within the
district. Finally, a simple correlation does not tell us whether principals
are more effective at identifying teachers at certain points on the ability
distribution.
For these reasons, we estimate a nonparametric measure of the asso-
ciation between ratings and productivity. Specifically, we calculate the
probability that a principal can correctly identify teachers in the top or
bottom group within his or her school. If we knew the true ability of
each teacher, this exercise would be trivial. In order to address this ques-
tion using our noisy measure of teacher effectiveness, we rely on a Monte
Carlo simulation in which we assume that a teacher’s true value added is
distributed normally with a mean equal to the point estimate of the teacher
empirical Bayes fixed effect and a standard deviation equal to the standard
error on the teacher’s estimate.21 The basic intuition is that by taking
repeated draws from the value-added distribution of each teacher in a
school, we can determine the probability that any particular teacher will
fall in the top or bottom group within his or her school, which we can
then use to create the conditional probabilities shown below. The appendix
provides a more detailed description of this simulation.
A final concern with both the parametric and nonparametric measures
of association described above is that our objective measure of teacher
effectiveness is constructed from student test scores that have no natural
units. Exam scores in this district are reported in terms of the percent of
items answered correctly. We do not have access to individual test items,
which makes it impossible to develop performance measures that account
for the difficulty of the test items, as is commonly done in item response
theory analyses.22 Furthermore, until very recently the test was not stan-
dardized against a national population. However, to test the robustness
of our results, we develop value-added measures that use transformations
of the percent correct score, including a student’s percentile rank within
his year and grade, the square of the percent correct, and the natural
logarithm of the percent correct. Moreover, the district categorizes stu-
dents into four different proficiency categories on the basis of these exam
scores, and in some specifications we use these proficiency indicators as
outcomes for the creation of value-added measures. As we show below,
our results are robust to using these alternative value-added measures.
21
The empirical Bayes estimate of teacher value added is orthogonal to its
estimation error, which makes it more appropriate than the OLS estimate in this
context.
22
The core exam is the examination that is administered to all grades and broadly
reported. It is therefore likely that principals would focus on this measure of
achievement as opposed to some other exam, such as the Stanford Achievement
Test, which is only administered in the third (with limited participation), fifth,
and eight grades.
Table 3
Correlation between a Principal’s Rating of a Teacher’s Ability to Raise
Student Achievement and the Value-Added Measure of the Teacher’s
Effectiveness at Raising Student Achievement
construct, they should not be biased downward as in the case with many
prior studies of subjective performance evaluation.23 The positive and
significant correlations indicate that principals do have some ability to
identify this dimension of teacher effectiveness. As shown below, these
basic results are robust to a wide variety of alternative specifications and
sensitivity analyses.
However, one might ask why these correlations are not even higher.
One possibility is that principals focus on the average test scores in a
teacher’s classroom rather than student improvement relative to students
in other classrooms. The correlations between principal ratings and av-
erage student achievement scores by teacher, shown in row 2, provide
some support for this hypothesis. The correlation between principal rat-
ings and average test scores in reading is significantly higher than the
correlation between principal ratings and teacher value added (0.55 vs.
0.29). This suggests that principals may base their ratings at least partially
on a naive recollection of student performance in teacher’s class—failing
to account for differences in classroom composition. Another reason may
be that principals focus on their most recent observations of teachers. In
results not shown here, we find that the average achievement score (or
gains) in a teacher’s classroom in 2002 is a significantly stronger predictor
of the principal’s rating than the scores (or gains) in any prior year.24
To explore principals’ ability to identify the very best and worst teach-
ers, table 4 shows the estimates of the percent of teachers that a principal
can correctly identify in the top (bottom) group within his or her school.
Examining the results in the top panel, we see that the teachers identified
by principals as being in the top category were, in fact, in the top category
according to the value-added measures about 55% of the time in reading
and 70% of the time in mathematics. If principals randomly assigned
ratings to teachers, we would expect the corresponding probabilities to
be 14% and 26%, respectively. This suggests that principals have consid-
23
Bommer et al. (1995) emphasize the potential importance of this issue, noting
that in the three studies they found where objective and subjective measures tapped
precisely the same performance dimension; the mean corrected correlation was
0.71 as compared with correlations of roughly 0.30 in other studies. Medley and
Coker (1987) are unique in specifically asking principals to evaluate a teacher’s
ability to improve student achievement. They find that the correlation with these
subjective evaluations is no higher than with an overall principal rating.
24
This exercise suggests that in assigning teacher ratings, principals might focus
on their most recent observations of a teacher instead of incorporating information
on all past observations of the individual. Of course, it is possible that principals
may be correct in assuming that teacher effectiveness changes over time so that
the most recent experience of a teacher may be the best predictor of actual ef-
fectiveness. To examine this possibility, we create value-added measures that in-
corporate a time-varying teacher experience measure. As shown in our sensitivity
analysis, we obtain comparable results when we use this measure.
Table 4
Relationship between Principal Ratings of a Teacher’s Ability to Raise
Student Achievement and Teacher Value Added
Reading Math
Conditional probability that a teacher who received the top rating from
the principal was the top teacher according to the value-added
measure (SE) .55 .70
(.18) (.13)
Null hypothesis (probability expected if principals randomly assigned
teacher ratings) .14 .26
Z-score (p-value) of test of difference between observed and null 2.29 3.34
(.02) (.00)
Conditional probability that a teacher who received a rating above the
median from the principal was above the median according to the
value-added measure (SE) .62 .59
(.12) (.14)
Null hypothesis (probability expected if principals randomly assigned
teacher ratings) .33 .24
Z-score (p-value) of test of difference between observed and null 2.49 2.41
(.01) (.02)
Conditional probability that a teacher who received a rating below the
median from the principal was below the median according to the
value-added measure (SE) .51 .53
(.11) (.13)
Null hypothesis (probability expected if principals randomly assigned
teacher ratings) .35 .26
Z-score (p-value) of test of difference between observed and null 1.40 2.19
(.16) (.03)
Conditional probability that the teacher(s) who received the bottom rat-
ing from the principal was the bottom teacher(s) according to the
value-added measure (SE) .38 .61
(.22) (.14)
Null hypothesis (probability expected if principals randomly assigned
teacher ratings) .09 .23
Z-score (p-value) of test of difference between observed and null 1.30 2.76
(.19) (.01)
Note.—The probabilities are calculated using the procedure described in the appendix.
erable ability to identify teachers in the top of the distribution. The results
are similar if one examines teachers in the bottom of the ability distri-
bution (bottom panel).25
The second and third panels in table 4 suggest that principals are sig-
nificantly less successful at distinguishing between teachers in the middle
of the ability distribution. For example, in the second panel we see that
principals correctly identify only 62% of teachers as being better than
the median (including those in the top category), relative to the null
hypothesis of 33%—that is, the percent that one would expect if principal
25
We used the delta method to compute the relevant standard errors. A block
bootstrap at the principal level yielded even smaller p-values than those reported
in the tables.
Fig. 2.—Association between estimated teacher value added and principal rating
sample and the sample of teachers who did not receive the top or bottom
principal rating (in the bottom panel). If we exclude those teachers who
received the best and worst ratings, the adjusted correlation between prin-
cipal rating and teacher value added is 0.02 and ⫺0.04 in reading and
math, respectively (neither of which is significantly different than zero).
B. Robustness Checks
In this section, we outline potential concerns regarding the validity of
our estimates and attempt to provide evidence regarding the robustness
of our findings. One reason that principals might have difficulty distin-
guishing between teachers in the middle is that the distribution of teacher
value added is highly compressed. This is not the case, however. Even
adjusting for estimation error, the standard deviation of value added for
teachers outside of the top and bottom categories is .10 in reading and
.25 in math compared with standard deviations of .12 and .26, respectively,
for the overall sample.
Another possible concern is that the lumpiness of the principal ratings
might reduce the observed correlation between principal ratings and actual
value added. To determine the possible extent of this problem, we per-
formed a simulation in which we assumed principals perfectly observed
a normally distributed teacher quality measure. Then the principals as-
signed teachers in order to the actual principal reading rankings. For
example, a principal who assigned two 6s, three 7s, six 8s, three 9s, and
one 10 would assign the two teachers with the lowest generated value-
added measures a six. She would assign the next three teachers 7s, and
so on. The correlation between the lumpy principal rankings and the
generated teacher-quality measure is about 0.9, suggesting that at most
the correlation is downward biased by about 0.1 due to the lumpiness.
When we assume that the latent correlation between the principal’s con-
tinuous measure of teacher quality and true effectiveness is 0.5, the cor-
relation between the lumpy ratings and the truth is biased downward by
about 0.06, far less than would be required to fully explain the relatively
low correlation between the principal ratings and the true teacher effec-
tiveness.28 In addition, as shown in row 5, we obtain comparable corre-
lations when we exclude principals with extremely lumpy ratings.
Table 5 presents a variety of additional sensitivity analyses. For the sake
of brevity, we only show the correlations between principal ratings and
value added, adjusting for the estimation error in value added. In row 1,
28
In practice, the bias from lumpiness is likely to be even lower. This is because
teachers with dissimilar quality signals are unlikely to be placed in the same
category—even if no other teacher is between them. In other words, the size and
number of categories is likely to reflect the actual distribution of teacher quality,
at least in the principal’s own mind.
we present the baseline estimates. In the first panel (rows 2–5), we examine
the sensitivity of our findings to alternative samples. We find similar results
when we examine only grades for which we can examine math as well as
reading. The same is true if we exclude schools with extremely lumpy
principal ratings or first-year teachers. As discussed earlier, the correlation
between ratings and teacher value added is much lower for teachers with
ratings outside of the top and bottom.
In the next panel (rows 6–12), we investigate the concern that our
findings may not be robust to the measure of student achievement used
to calculate value added. As explained earlier, while principals were likely
to focus on the core exam, it is possible that they weigh the improvement
of high- and low-ability students in a way that does not correspond to
the commonly reported test metric. In rows 6–8, we examine measures
of achievement that focus on students at the bottom of the test distri-
bution. These measures include the log of the percent correct, achieving
above the proficiency cutoff, and the achievement of students with initial
achievement below the classroom mean. We find results comparable to
our baseline. In rows 9–10, we show results in which we place greater
weight on high-achieving students. These results include percent correct
squared and the achievement of students with initial achievement above
the classroom mean, and we find somewhat smaller correlations. Overall,
we believe these results are consistent with our baseline findings. However,
there does seem to be some suggestive evidence that principals may place
more weight on teachers bringing up the lowest-achieving students. For
example, the correlations using the teacher value-added scores based on
the achievement of students in the bottom half of the initial ability dis-
tribution are larger than those based on the students in the top half of
the ability distribution (though the differences are not significant at con-
ventional levels). However, the correlations that use value-added measures
based on proficiency cutoffs (which should disproportionately reflect im-
pacts on students with lower initial ability) are smaller than our baseline
results and not statistically different than zero. Of course, this might also
be due to the fact that this binary outcome measure inherently contains
less information than a continuous achievement score.
In row 11, we examine results in which the achievement measure used
is the students’ percentile in the grade-year distribution of test scores within
the district. Finally, in row 12, we examine gain scores that are normalized
so that student gains have unit standard deviation within each decile of the
initial achievement distribution. Both specifications yield results comparable
to the baseline.
In the third panel (rows 13–18), we examine the robustness of our
estimates to alternative estimation strategies for the value added and find
122
5. Exclude schools with extreme lumpiness in principal ratings .15* .26* .24* .32*
(.07) (.13) (.09) (.12)
Alternative measures of student achievement:
6. Test outcome is natural log of percent correct—greater weight on
low-achieving students .19* .33* .22* .28*
(.07) (.13) (.08) (.10)
7. Test outcome is scoring above “proficiency” cutoff—greater weight
on low-achieving students .12 .20 .12 .17
(.07) (.14) (.08) (.13)
123
18. Include ln(experience) variable (plus polynomials in prior
achievement) .20* .46 .26* .40*
(.06) (.35) (.07) (.12)
Other checks:
19. Block bootstrapped standard errors (clustering at school level) .18* .29* .25* .32*
(.09) (.14) (.09) (.12)
20. Average of principal-level correlations .21* .31*
(.08) ... (.08) ...
Note.—The adjusted correlations take into account the estimation error in our value-added measures of teacher effectiveness. Bootstrapped standard errors are in
C. Heterogeneity of Effects
It is possible that principals are able to gauge the performance of some
teachers more effectively than others. Alternatively, some principals may
be more effective than others in evaluating teacher effectiveness. In table
6, we examine this possibility. The first row shows the baseline estimates.
In the first panel, we examine how the correlation between principal
rating and teacher value added varies with teacher characteristics. It does
not appear that the correlation between ratings and teacher value added
varies systematically with teacher experience, the duration the principal
has known the teacher, or grade taught. The standard errors are generally
too large, however, to draw strong conclusions.
In the second panel, we consider whether some principals are more
capable of identifying effective teachers than others. Principals who have
been at their schools less than 4 years, are male, and identify themselves
as confident in their ability to assess teacher effectiveness appear to rate
teaching ability more accurately. The observed differences, however, are
never significant at the 5% level.
29
One exception is that the correlations that use only 2002–3 achievement data
are smaller than the baseline correlations and are not statistically different than
zero (although they are not statistically different than the baseline correlations,
either).
30
Note that when comparing the predictive power of the various measures, we
are essentially comparing the principal and compensation measures against feasible
value-added measures. Using unobserved actual value added could increase the
predictive power (as measured by the r-squared), but this is not a policy-relevant
measure of teacher quality. Of course, the nature of the EB measures is such that
coefficient estimates are consistent measures of impact of actual teacher value
added.
31
Excluding first-year teachers does not affect the estimated correlation between
the value-added measures and principal ratings, which suggests this exclusion
should not bias our results in this regard.
32
Since we are comparing the relative value of using a test-based vs. principal-
based measure of performance, the most relevant comparison is between a move-
ment in the empirical (not actual) distribution of teacher effectiveness and the
principal rating.
degrees have students who score roughly 0.11 standard deviations higher
than their counterparts (although this relationship should not be inter-
preted as causal since education levels may well be associated with other
omitted teacher characteristics). In this district, however, compensation is
a complicated, nonlinear function of experience and education. Column 2
shows that actual compensation has no significant relationship to student
achievement. In results not shown here, we find that including polynomials
in compensation does not change this result. Columns 3 and 4 indicate that
principal ratings—both overall ratings and ratings of a teacher’s ability to
raise achievement—are significantly associated with higher student achieve-
ment. Conditional on prior student achievement, demographics, and class-
room covariates, students whose teachers receive an overall rating one
standard deviation above the mean are predicted to score roughly 0.07
standard deviations higher than students whose teacher received an aver-
age rating. Column 5 shows that a teacher’s value added is an even better
predictor of future student achievement gains, with a coefficient half again
as that on the overall principal ratings.33 The r-square measures in the
bottom row indicate that none of the measures explain a substantial por-
tion of the variation across students, as one would expect, given that much
of the variation in nearly all student-level regressions occurs within the
classroom. Nonetheless, bootstrap tests indicate that the value-added mea-
sure explains significantly (at the 10% level in reading and the 5% level
in math) more of the variation in student achievement than the principal
ratings do. As shown in columns 7–11, the results for math are comparable
to the reading results.
To the extent that principal ratings are picking up a different dimension
of quality than the test-based measures, one might expect that combining
principal and value-added measures would yield a better predictor of
future achievement. Column 6 suggests that this might be the case. Con-
ditional on teacher value added, the principal’s overall rating of a teacher
is a significant predictor of student achievement. The results for math,
shown in column 12, are even stronger.
VI. Conclusions
In this article, we examine principals’ ability to identify teachers’ ability
to increase reading and math achievement. We build on prior literature
by using principal ratings that are aligned with the objective metric under
33
We examined the functional form of the relationship between both teacher
quality measures and student achievement but found that both were approximately
linear. Also, when we do not normalize the EB measure by the standard deviation
of teacher ability, the coefficient is insignificantly different from one, which we
would expect, given that the EB is essentially the conditional expectation of teacher
effectiveness.
Appendix
Nonparametric Measures of Association between Performance Indicators
In order to get a more intuitive understanding of the magnitude of the
relationship between principal ratings and actual teacher effectiveness, we
calculate several simple, nonparametric measures of the association be-
tween the subjective and objective performance indicators. While this
exercise is complicated somewhat by the existence of measurement error
in the teacher value-added estimates, it is relatively straightforward to
construct such measures through Monte Carlo simulations with only min-
imal assumptions. Following the notation in the text, we define the prin-
cipal’s assessment of teacher j as d̂Pj , the empirical Bayes estimated value
added of teacher j as d̂j, and the true ability of teacher j as dj. Our goal
is to calculate the following probabilities:
where t (b) indicates that the teacher was in the top (bottom) quantile of
the distribution. For example, equation (A1) is the probability that the
teacher is in the top quantile of the true ability distribution conditional
on being in the top quantile of the distribution according to the principal
assessment.
We can calculate the conditional probability of a teacher’s value-added
ranking given her principal ranking directly from the data. These prob-
abilities can be written as follows:
Note that the four equations also assume that the fact that the principal
rates a teacher in the top (bottom) category does not provide any addi-
tional information regarding whether estimated value added will be in the
top (bottom) category once we condition on whether the teacher’s true
ability is in the top (bottom) category. For example, in equation (A3), we
assume that Pr (dˆ j p tFdj p t) p Pr (dˆ j p tFdj p t, dˆ jP p t) and Pr (dˆ j p
tFdj p b) p Pr (dˆ j p tFdj p b, dˆ jP p t). While we do not believe this is
strictly true, it should not substantially bias our estimates.
We also know the following identities are true:
We can solve equations (A3) and (A7) to obtain equation (A1) as fol-
lows:
Pr (d ˆP ) j
j p tFdj p t p . (A12)
Pr (dj p tFdˆ j p t) ⫺ Pr (dj p bFdˆ j p t) PrPr(d(dpb)
ˆ pt)
j
Pr (dˆ j p bFdˆ jP p b) ⫺Pr (dj p tFdˆ j p b) [Pr (dˆ j p b)/ Pr (dj p t)]
p .
Pr (dj p bFdˆ j p b) ⫺Pr (dj p tFdˆ j p b) [Pr (dˆ j p b)/ Pr (dj p t)]
(A17)
References
Aaronson, Daniel, Lisa Barrow, and William Sander. 2007. Teachers and
student achievement in the Chicago public high schools. Journal of
Labor Economics 25, no. 1:95–135.
Alexander, Elmore R., and Ronnie D. Wilkins. 1982. Performance rating
validity: The relationship of objective and subjective measures of per-
formance. Group and Organization Studies 7, no. 4:485–96.
Armor, David, Patricia Conry-Oseguera, Millicent Cox, Nicelma King,
Lorraine McDonnell, Anthony Pascal, Edward Pauly, and Gail Zell-
man. 1976. Analysis of the school preferred reading program in selected
Los Angeles minority schools. Report no. R-2007-LAUSD. Rand Cor-
poration, Santa Monica, CA.
Ballou, Dale. 2001. Pay for performance in public and private schools.
Economics of Education Review 20:51–61.
Ballou, Dale, and Michael Podgursky. 2001. Teacher compensation: The
case for market-based pay. Education Matters 1, no. 1:16–25.
Bolino, Mark C., and William H. Turnley. 2003. Counternormative im-
pression management, likeability, and performance ratings: The use of
intimidation in an organizational setting. Journal of Organizational
Behavior 24, no. 2:237–50.
Bommer, William H., Jonathan L. Johnson, Gregory A. Rich, Philip M.
Podsakoff, and Scott B. MacKenzie. 1995. On the interchangeability
of objective and subjective measures of employee performance: A meta-
analysis. Personnel Psychology 48, no. 3:587–605.
Bull, Clive. 1987. The existence of self-enforcing implicit contracts. Quar-
terly Journal of Economics 102, no. 1:147–60.
Clotfelter, Charles T., Helen F. Ladd, and Jacob L. Vigdor. 2006. Teacher-
student matching and the assessment of teacher effectiveness. NBER
Working Paper no. 11936, National Bureau of Economic Research,
Cambridge, MA.
Figlio, David N. 1997. Teacher salaries and teacher quality. Economics
Letters 55, no. 2:267–71.
Figlio, David N., and Cecilia Elena Rouse. 2006. Do accountability and
voucher threats improve low-performing schools? Journal of Public
Economics 90, nos. 1–2:239–55.
Figlio, David N., and Joshua Winicki. 2005. Food for thought: The effects
of school accountability on school nutrition. Journal of Public Eco-
nomics 89, nos. 2–3:381–94.
Hanushek, Eric A. 1986. The economics of schooling: Production and
efficiency in public schools. Journal of Economic Literature 49, no. 3:
1141–77.
———. 1997. Assessing the effects of school resources on student per-
formance: An update. Educational Evaluation and Policy Analysis 19,
no. 2:141–64.
Hanushek, Eric A., John Kain, Daniel M. O’Brien, and Steven G. Rivkin.
2005. The market for teacher quality. NBER Working Paper no. 11154,
National Bureau of Economic Research, Cambridge, MA.
Hanushek, Eric A., and Steven G. Rivkin. 2004. How to improve the
supply of high quality teachers. In Brookings papers on education policy
2004, ed. Diane Ravitch. Washington, DC: Brookings Institution Press.
Harris, Douglas, and Timothy Sass. 2006. The effects of teacher training
on teacher value-added. Working paper, Economics Department, Flor-
ida State University, Tallahassee.
Heneman, Robert L. 1986. The relationship between supervisory ratings
and results-oriented measures performance: A meta-analysis. Personnel
Psychology 39:811–26.
Heneman, Robert L., David B. Greenberger, and Chigozie Anonyuo.
1989. Attributions and exchanges: The effects of interpersonal factors
on the diagnosis of employee performance. Academy of Management
Journal 32, no. 2:466–76.
Jacob, Brian A. 2005. Accountability, incentives and behavior: The impact
of high-stakes testing in the Chicago public schools. Journal of Public
Economics 89, nos. 5–6:761–96.
Jacob, Brian A., and Lars Lefgren. 2005a. Principals as agents: Subjective
performance measurement in education. NBER Working Paper no.
11463, National Bureau of Economic Research, Cambridge, MA.
———. 2005b. What do parents value in education? An empirical inves-
tigation of parents’ revealed preferences for teachers. NBER Working
Paper no. 11494, National Bureau of Economic Research, Cambridge,
MA.
Jacob, Brian A., and Stephen D. Levitt. 2003. Rotten apples: An inves-
tigation of the prevalence and predictors of teacher cheating. Quarterly
Journal of Economics 118, no. 3:843–78.
Kane, T., and D. Staiger. 2002. Volatility in school test scores: Implications
for test-based accountability systems. In Brookings papers on education