Jacob 2008

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

NORC at the University of Chicago

The University of Chicago

Can Principals Identify Effective Teachers? Evidence on Subjective Performance Evaluation in


Education
Author(s): Brian A. Jacob and Lars Lefgren
Source: Journal of Labor Economics, Vol. 26, No. 1 (January 2008), pp. 101-136
Published by: The University of Chicago Press on behalf of the Society of Labor Economists and
the NORC at the University of Chicago
Stable URL: http://www.jstor.org/stable/10.1086/522974 .
Accessed: 10/11/2014 14:40

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

The University of Chicago Press, Society of Labor Economists, NORC at the University of Chicago, The
University of Chicago are collaborating with JSTOR to digitize, preserve and extend access to Journal of
Labor Economics.

http://www.jstor.org

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Can Principals Identify Effective
Teachers? Evidence on Subjective
Performance Evaluation in Education
Brian A. Jacob, University of Michigan,
National Bureau of Economic Research

Lars Lefgren, Brigham Young University

We examine how well principals can distinguish between more and


less effective teachers. To put principal evaluations in context, we
compare them with the traditional determinants of teacher compen-
sation—education and experience—as well as value-added measures
of teacher effectiveness based on student achievement gains. We pre-
sent “out-of-sample” predictions that mitigate concerns that the
teacher quality and student achievement measures are determined
simultaneously. We find that principals can generally identify teachers
who produce the largest and smallest standardized achievement gains
but have far less ability to distinguish between teachers in the middle
of this distribution.

We would like to thank Joseph Price and J. D. LaRock for their excellent
research assistance. We thank David Autor, Joe Doyle, Sue Dynarski, Amy Fin-
kelstein, Chris Hansen, Robin Jacob, Jens Ludwig, Frank McIntyre, Jonah Rock-
off, Doug Staiger, Thomas Dee, and seminar participants at the University of
California, Berkeley, Northwestern University, Brigham Young University, Co-
lumbia University, Harvard University, MIT, and the University of Virginia for
helpful comments. All remaining errors are our own. Contact the corresponding
author, Brian Jacob, at [email protected].

[ Journal of Labor Economics, 2008, vol. 26, no. 1]


䉷 2008 by The University of Chicago. All rights reserved.
0734-306X/2008/2601-0006$10.00

101

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
102 Jacob/Lefgren

I shall not today attempt further to define the kinds of


material I understand to be embraced . . . but I know it
when I see it. (Justice Potter Stewart [trying to define
obscenity])

I. Introduction
One of the most striking findings in recent education research involves
the importance of teacher quality. A series of new papers have documented
substantial variation in teacher effectiveness in a variety of settings, even
among teachers in the same school (Rockoff 2004; Hanushek et al. 2005;
Aaronson, Barrow, and Sander 2007). The differences in teacher quality
are dramatic. For example, recent estimates suggest that the benefit of
moving a student from an average teacher to one at the 85th percentile
is comparable to a 33% reduction in class size (Rockoff 2004; Hanushek
et al. 2005). The difference between having a series of very good teachers
versus very bad teachers can be enormous (Sanders and Rivers 1996). At
the same time, researchers have found little association between observable
teacher characteristics and student outcomes—a notable exception being
a large and negative first-year teacher effect (see Hanushek [1986, 1997]
for reviews of this literature and Rockoff [2004] for recent evidence on
teacher experience effects).1 This is particularly puzzling given the likely
upward bias in such estimates (Figlio 1997; Rockoff 2004).
Private schools and most institutions of higher education implicitly
recognize such differences in teacher quality by compensating teachers,
at least in part, on the basis of ability (Ballou 2001; Ballou and Podgursky
2001). However, public school teachers have resisted such merit-based
pay due, in part, to a concern that administrators will not be able to
recognize (and thus properly reward) quality (Murnane and Cohen 1986).2
At first blush, this concern may seem completely unwarranted. Principals
not only interact with teachers on a daily basis—reviewing lesson plans,
observing classes, talking with parents and children—but also have ready
access to student achievement scores. Prior research, however, suggests
that this task might not be as simple as it seems. Indeed, the consistent

1
There is some limited evidence that cognitive ability, as measured by a score
on a certification exam, for example, may be positively associated with teacher
effectiveness, although other studies suggest that factors such as the quality of
one’s undergraduate institution are not systematically associated with effective-
ness. For a review of the earlier literature relating student achievement to teacher
characteristics, see Hanushek (1986, 1997). In recent work, Clotfelter, Ladd, and
Vigdor (2006) find teacher ability is correlated with student achievement, while
Harris and Sass (2006) find no such association.
2
Another concern, which we discuss below, involves favoritism or the simple
capriciousness of ratings.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 103

finding that certified teachers are no more effective than their uncertified
colleagues suggests that commonly held beliefs among educators may be
mistaken.
In this article, we examine how well principals can distinguish between
more and less effective teachers, where effectiveness is measured by the
ability to raise student math and reading achievement. In other words,
do school administrators know good teaching when they see it? We find
that principals are quite good at identifying those teachers who produce
the largest and smallest standardized achievement gains in their schools
(i.e., the top and bottom 10%–20%) but have far less ability to distin-
guish between teachers in the middle of this distribution (i.e., the middle
60%–80%). This is not a result of a highly compressed distribution of
teacher ability.
While there are several limitations to our analysis, which we describe
in later sections of this article, our results suggest that policy makers
should consider incorporating principal evaluations into teacher compen-
sation and promotion systems. To the extent that principal judgments
focus on identifying the best and worst teachers, for example, to determine
bonuses and teacher dismissal, the evidence presented here suggests that
such evaluations would help promote student achievement. Principals can
also evaluate teachers on the basis of a broader spectrum of educational
outputs, including nonachievement outcomes valued by parents.
More generally, our findings inform the education production function
literature, providing compelling evidence that good teaching is, at least
to some extent, observable by those close to the education process even
though it may not be easily captured in those variables commonly avail-
able to the econometrician. The article also makes a contribution to the
empirical literature on subjective performance assessment by demonstrating
the importance of accounting for estimation error in measured productivity
and showing that the relationship between subjective evaluations and actual
productivity can vary substantially across the productivity distribution.
The remainder of the article proceeds as follows. In Section II, we
review the literature on objective and subjective performance evaluation.
In Section III, we describe our data and in Section IV outline how we
construct the different measures of teacher effectiveness. The main results
are presented in Section V. We conclude in Section VI.

II. Background
A. Prior Literature
The theoretical literature on subjective performance evaluation has fo-
cused largely on showing the conditions under which efficient contracts
are possible (Bull 1987; MacLeod and Malcomson 1989). Prendergast
(1993) and Prendergast and Topel (1996) show how the existence of sub-

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
104 Jacob/Lefgren

jective evaluation helps to explain several features of organizations such


as the tendency of employers to agree with their employees. Recent work
demonstrates that the compression and leniency in performance evalua-
tions (relative to actual performance) often found in practice are features
of the optimal contract between a risk-neutral principal and a risk-averse
agent when rewards are based on subjective performance evaluation (Levin
2003; MacLeod 2003).
The empirical literature on subjective performance measurement has
focused largely on understanding the extent to which subjective supervisor
ratings match objective measures of employee performance and the extent
to which subjective evaluations are biased. This research suggests that
there is a relatively weak relationship between subjective ratings and ob-
jective performance (see Heneman [1986] and Bommer et al. [1995] for
good reviews) and that supervisor evaluations are indeed often influenced
by a number of nonperformance factors such as the age and gender of
the supervisor and subordinate and the likability of the subordinate (Al-
exander and Wilkins 1982; Heneman, Greenberger, and Anonyuo 1989;
Wayne and Ferris 1990; Lefkowitz 2000; Varma and Stroh 2001; Bolino
and Turnley 2003).3
The studies most closely related to the present analysis examine the
subjective evaluations given to elementary school teachers by their prin-
cipals. A collection of studies in the education literature report quite small
correlations between principal evaluations and student achievement, al-
though these studies are generally based on small, nonrepresentative sam-
ples, do not account properly for measurement error, and rely on objective
measures of teacher performance that are likely biased (Medley and Coker
1987; Peterson 1987, 2000).4 The best study on this topic examines the
relationship between teacher evaluations and student achievement among
second and third graders in the New Haven, CT, public schools (Murnane
1975). Conditional on prior student test scores and demographic controls,
the author found that teacher evaluations were significant predictors of
student achievement, although the magnitude of the relationship was only
modest.5
3
Prendergast (1999) observes that such biases create an incentive for employees
to engage in inefficient “influence” activities. Wayne and Ferris (1990) provide
some empirical support for this hypothesis.
4
The few studies that examine the correlation between principal evaluations
and other measures of teacher performance, such as parent or student satisfaction,
find similarly weak relationships (Peterson 1987, 2000).
5
While it is difficult to directly compare these results to the education studies,
the magnitude of the relationship appears to be modest. Murnane (1975) found
that for third-grade math, an increase in the principal rating of roughly one
standard deviation was associated with an increase of 1.3 standard scores (or 0.125
standard deviations). The magnitude of the reading effect was somewhat smaller.
Armor et al. (1976) present similar results for a subset of high-poverty schools

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 105

B. Conceptual Framework
In order to provide a basis for interpreting the empirical findings in
this article, it is useful to consider how principals might form opinions
of teachers. Given the complexity of principal belief formation, and the
limited objectives of this article, we do not develop a formal model. Rather,
we describe the sources of information available to principals and how
they might interpret the signals they receive.
Each year principals receive a series of noisy signals of a teacher’s
performance, stemming from three main sources: (1) formal and informal
observations of the teacher working with students and/or interacting with
colleagues around issues of pedagogy or curriculum, (2) reports from
parents—either informal assessments or formal requests to have a child
placed with the teacher (or not placed with the teacher), and (3) student
achievement scores. Principals will differ in their ability and/or inclination
to gather and incorporate information from these sources and in the weight
that they place on each of the sources. A principal will likely have little
information on first-year teachers, particularly at the point when we sur-
veyed principals; namely, in February, before student testing took place and
before parents began requesting specific teachers for the following year.
Principals may differ with respect to the level of sophistication with
which they collect information and interpret the signals they receive. For
example, principals may be aware of the level of test scores in the teacher’s
classroom but unable to account for differences in classroom composition.
In this case, principal ratings might be more highly correlated to the level
of test scores than to teacher value added if little information besides test
scores was used to construct ratings. Also, principals might vary in how
they deal with the noise inherent in the signals they observe. A naive
principal might simply report the signal she observes regardless of the
variance of the noise component. A more sophisticated principal, however,
might act as a Bayesian, down-weighting noisy signals.6 Finally, it seems
likely that a principal’s investment in gathering information on and up-
dating beliefs about a particular teacher will be endogenously determined
by a variety of factors, including the initial signal the principal observes
as well as the principal’s assessment regarding how much a teacher can
benefit from advice and training.
Ultimately, in this article we limit our examination to the accuracy of

in Los Angeles. They found that a one standard deviation increase in teacher
effectiveness led to a 1–2 point raw score gain (although it is not possible to
calculate the effect size given the available information in the study).
6
Indeed, a simple model of principals as perfect Bayesians would generate
strong implications regarding how the accuracy of ratings evolves with the time
a principal and a teacher spend together. For example, this type of model would
imply that the variance of principal ratings increases over time (assuming that the
variance of true teacher ability does not change over time).

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
106 Jacob/Lefgren

principal ratings regardless of the strategies principals use to construct


them. In an early version of this article, we presented evidence inconsistent
with principals acting as perfect Bayesians. Based on this work, we con-
cluded that the actual algorithms used by principals to form their opinions
of teacher effectiveness are likely to be quite complex and highly variable
across individual administrators. For this reason, we defer to future re-
search the construction and testing of behavioral models of principal
evaluations.

III. Data
The data for this study come from a midsize school district located in
the western United States. The student data include all of the common
demographic variables, as well as standardized achievement scores, and
allow us to track the students over time. The teacher data, which we can
link to students, include a variety of teacher characteristics that have been
used in previous studies, such as age, experience, educational attainment,
undergraduate and graduate institution attended, and license and certi-
fication information. With the permission of the district, we surveyed all
elementary school principals in February 2003 and asked them to rate the
teachers in their schools along a variety of different performance dimen-
sions.
To provide some context for the analysis, table 1 shows summary sta-
tistics from the district. While the students in the district are predomi-
nantly white (73%), there is a reasonable degree of heterogeneity in terms
of ethnicity and socioeconomic status. Latino students make up 21% of
the elementary population, and nearly half of all students in the district
(48%) receive free or reduced-price lunch. Achievement levels in the dis-
trict are almost exactly at the average of the nation (49th percentile on
the Stanford Achievement Test).
The primary unit of analysis in this study is the teacher. To ensure that
we could link student achievement data to the appropriate teacher, we
limit our sample to elementary teachers who were teaching a core subject
during the 2002–3 academic year.7 We exclude kindergarten and first-grade
teachers because achievement exams are not available for these students.8
Our sample consists of 201 teachers in grades 2–6. Like the students,
the teachers in our sample are fairly representative of elementary school
teachers nationwide. Only 16% of teachers in our sample are men. The
7
We exclude noncore teachers such as music teachers, gym teachers, and li-
brarians. Note, however, that in calculating teacher value-added measures, we use
all student test scores from 1997–98 through 2004–5.
8
Achievement exams are given to students in grades 1–6. In order to create a
value-added measure of teacher effectiveness, it is necessary to have prior achieve-
ment information for the student, which eliminates kindergarten and first-grade
students.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 107

Table 1
Summary Statistics
Mean SD

Student characteristic:
Male .51
White .73
Black .01
Hispanic .21
Other .06
Limited English proficiency .21
Free or reduced-price lunch .48
Special education .12
Math achievement (national percentile) .49
Reading achievement (national percentile) .49
Language achievement (national percentile) .47
Teacher characteristic:
Male .16 .36
Age 41.9 12.5
Untenured .17 .38
Experience 11.9 8.9
Fraction with 10–15 years experience .19 .40
Fraction with 16–20 years experience .14 .35
Fraction with 21⫹ years experience .16 .37
Years working with principal 4.8 3.6
BA degree at in-state (but not local) college .10 .30
BA degree at out-of-state college .06 .06
MA degree .16 .16
Any additional endorsements .20 .40
Any additional endorsements in areas other than ESL .10 .31
Licensed in more than one area .26 .44
Licensed in area other than early childhood education
or elementary education .07 .26
Grade 2 .23 .42
Grade 3 .21 .41
Grade 4 .20 .40
Grade 5 .18 .38
Grade 6 .18 .39
Mixed-grade classroom .07 .26
Two teachers in the classroom .05 .22
Number of teachers 201
Number of principals 13
Note.—Student characteristics are based on students enrolled in grades 2–6 in spring 2003. Math and
reading achievement measures are based on the spring 2002 scores on the Stanford Achievement Test
(version 9) taken by selected elementary grades in the district. Teacher characteristics are based on
administrative data. Nearly all teachers in the district are Caucasian, so race indicators are omitted.

average teacher is 42 years old and has roughly 12 years of experience


teaching. The vast majority of teachers attended the main local university,
while 10% attended another in-state college, and 6% attended a school
out of state. Sixteen percent of teachers have a master’s degree or higher,
and the vast majority of teachers are licensed in either early childhood
education or elementary education. Finally, 7% of the teachers in our
sample taught in a mixed-grade classroom in 2002–3, and 5% were in a
“split” classroom with another teacher.
In this district, elementary students take a set of “core” exams in reading

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
108 Jacob/Lefgren

and math in grades 1–8. These multiple choice, criterion-referenced exams


cover topics that are closely linked to the district learning objectives. While
student achievement results have not been directly linked to rewards or
sanctions until recently, the results of the core exams are distributed to
parents and published annually. Citing these factors, district officials sug-
gest that teachers and principals have focused on this exam even before
the recent passage of the federal accountability legislation (No Child Left
Behind).

A. Principal Assessments of Teacher Effectiveness


To obtain subjective performance assessments, we administered a survey
to all elementary school principals in February 2003 asking them to eval-
uate their teachers along a variety of dimensions.9 Principals were asked
to rate teachers on a scale from 1 (inadequate) to 10 (exceptional). Im-
portantly, principals were asked to not only provide a rating of overall
teacher effectiveness but also to assess a number of specific teacher char-
acteristics, including dedication and work ethic, classroom management,
parent satisfaction, positive relationship with administrators, and ability
to raise math and reading achievement. Principals were assured that their
responses would be completely confidential and would not be revealed
to the teachers or to any other school district employee.
Table 2 presents the summary statistics of each rating. While there was
some heterogeneity across principals, the ratings are generally quite high,
with an average of 8.07 and a 10–90 range from 6 to 10. The average
rating for the least generous principal was 6.7. At the same time, however,
there appears to be considerable variation within school. Figure 1 shows
histograms where each teacher’s rating has been normalized by subtracting
the median rating within the school for that same item. It appears that
principal ratings within a given school are roughly normally distributed
with five to six relevant categories. Perhaps not surprisingly, the principal
responses to some individual survey items are highly correlated (e.g., the
correlation between teacher organization and classroom management ex-
ceeds 0.7), while others are less highly correlated (e.g., the correlation
between role model and relationship with colleagues is less than 0.4).
Because principal ratings differ in terms of the degree of leniency and
compression, we normalize the ratings by subtracting from each rating

9
In this district, principals conduct formal evaluations annually for new teachers
and every third year for tenured teachers. However, prior studies have found
such formal evaluations suffer from considerable compression with nearly all
teachers being rated very highly. These evaluations are also part of a teacher’s
personnel file, and it was not possible to obtain access to these without permission
of the teachers.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 109

Table 2
Summary Statistics for Principal Ratings
Mean 10th 90th
Item (SD) Percentile Percentile

Overall teacher effectiveness 8.07 6.5 10


(1.36)
Dedication and work ethic 8.46 6 10
(1.54)
Organization 8.04 6 10
(1.60)
Classroom management 8.06 6 10
(1.63)
Raising student math achievement 7.89 6 9
(1.30)
Raising student reading achievement 7.90 6 10
(1.44)
Role model for students 8.35 7 10
(1.34)
Student satisfaction with teacher 8.36 7 10
(1.20)
Parent satisfaction with teacher 8.28 7 10
(1.30)
Positive relationship with colleagues 7.94 6 10
(1.72)
Positive relationship with administrators 8.30 6 10
(1.66)
Note.—These statistics are based on the 202 teachers included in the analysis sample.

the principal-specific mean for that question and dividing by the school-
specific standard deviation.

B. Value-Added Measures of Teacher Ability to Raise Standardized


Achievement Scores
The primary challenge in estimating measures of teacher effectiveness
using student achievement data involves the potential for nonrandom
assignment of students to classes. Following the standard practice in this
literature, we estimate value-added models that control for a wide variety
of observable student and classroom characteristics, including prior
achievement measures (see, e.g., Hanushek and Rivkin 2004; Rockoff
2004; Aaronson et al. 2007). The richness of our data allows us to observe
teachers over multiple years and thus to distinguish permanent teacher
quality from idiosyncratic class-year shocks and to estimate a teacher
experience gradient utilizing variation within individual teachers.
For our baseline specification, we use a panel of student achievement
data from 1997–98 through 2002–3 to estimate the following model:
yijkt p Cjt B ⫹ Xit G ⫹ wt ⫹ fk ⫹ dj ⫹ a jt ⫹ ␧ijkt , (1)
where i indexes students, j indexes teachers, k indexes school, and t indexes
year. The outcome measure y represents a student’s score on a math or

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
110 Jacob/Lefgren

Fig. 1.—Distribution of principal ratings of a teacher’s ability to raise student achievement

reading exam. The scores are reported as the percentage of items that the
student answered correctly, but we normalize achievement scores to be
mean zero and standard deviation to be one within each year-grade. The
vector X consists of the following student characteristics: age, race, gender,
free lunch eligibility, special education placement, limited English profi-
ciency status, prior math achievement, prior reading achievement, and
grade fixed effects, and C is a vector of classroom measures that include
indicators for class size and average student characteristics. A set of year
and school fixed effects are wt and fk , respectively. Teacher j’s contribution
to value added is captured by the dj’s. The error term a jt is common to
all students in teacher j’s classroom in period t (e.g., adverse testing con-
ditions faced by all students in a particular class, such as a barking dog).
Error term ␧ijkt takes into account the student’s idiosyncratic error. In
order to account for the correlation of students within classrooms, we
correct the standard errors using the method suggested by Moulton
(1990).10
10
Another possibility would be to use cluster-corrected standard errors. How-
ever, the estimated standard errors behave very poorly for teachers who are in
the sample for a small number of years. It is also possible to estimate a model
that includes a random teacher-year effect, which should theoretically provide
more efficient estimates. In practice, however, the random effect estimates are
comparable to those we present in terms of efficiency. The intraclass correlation
coefficients calculated as part of the Moulton procedure are roughly .06 in reading
and .09 in mathematics.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 111

It is worth noting that our value-added model differs from standard


practice in one respect. Given that ratings are normalized to have equal
mean and variance for each principal, a value-added indicator that mea-
sures effectiveness relative to a district rather than school average will be
biased downward.11 To ensure that we identify estimates of teacher quality
relative to other teachers within the same school, we (a) examine teachers
who are in their most recent school (i.e., for the small number of switching
teachers, we drop observations from their first school), (b) include school
fixed effects, and then (c) constrain the teacher fixed effects to sum to
zero within each school.12
While there is no way (short of randomly assigning students and teach-
ers to classrooms) to completely rule out the possibility of selection bias,
several pieces of evidence suggest that such nonrandom sorting is unlikely
to produce a substantial bias in our case. To account for unobservable,
time-invariant student characteristics (e.g., motivation or parental involve-
ment), we estimate value-added models that utilize achievement normal-
ized gains as the outcome and include student fixed effects (but not prior
achievement measures) as controls.13 As we show below, our results are
robust to models that include student fixed effects. We do not include
11
Typical value-added models simply contain school fixed effects that identify
teacher quality relative to all teachers (or some omitted teacher) in the district.
The comparison of teachers across schools is identified by both covariates in the
model as well as the fact that the same teachers switch schools during the sample
period. To see that the use of district-relative value-added measures will lead to
a downward bias, consider the possibility that teachers in certain schools have
systematically higher value-added measures than teachers in other schools. In the
extreme, value added could be perfectly sorted across schools so that all of the
teachers in the “best” school have higher value added than teachers in the “second-
best” school, and teachers in this second-best school all have higher value-added
scores than the teachers in the third-best school, etc. However, because we have
normalized principal ratings within a school, this means that half of the teachers
in the best school will, by construction, have “below average” subjective ratings.
The fact that the subjective and objective ratings are measured on different scales
in this example essentially introduces measurement error that will attenuate our
correlations.
12
The fact that principals are likely using different scales when evaluating teach-
ers makes any correlation between supervisor ratings and a districtwide produc-
tivity measure largely uninformative (even in the case where principals were at-
tempting to evaluate their own teachers relative to all others in the district).
13
Following Hanushek et al. (2005) and Reback (2005), we normalized achieve-
ment gains by student prior ability. Specifically, we divide students into q different
quantiles based on their prior achievement score, yijkt⫺1 , and then calculate the mean
and standard deviation of achievement gains (gijkt p yijkt ⫺ yijkt⫺1 ) for each quantile
separately, which we denote as mqg and jgq , respectively. We then create standardized
gain scores that are mean zero and unit standard deviation within each quantile:
q
Gijkt q
p (gijkt ⫺ mgq) /mgq. This ensures that each student’s achievement gain is com-
pared to the improvement of comparable peers.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
112 Jacob/Lefgren

such models as our baseline; however, as fewer students contribute to the


identification of teacher value added with student fixed effects (e.g., sixth-
grade students in the first year of our data and second-grade students in
the last are omitted). Also, the correlation of the estimation error across
individual teacher value-added measures is higher than in our baseline
specifications, complicating estimation. For more details on the value-
added models used in this article, see Jacob and Lefgren (2005a).
Before we turn to our primary objective, it is useful to consider the
teacher value-added measures that we estimate. After adjusting for esti-
mation error, we find that the standard deviation of teacher quality is 0.12
in reading and 0.26 in math, which appears roughly consistent with recent
literature on teacher effects.14 Because the dependent variable is a state-
specific, criterion-referenced test that we have normalized within grade-
year for the district, we take advantage of the fact that in recent years
third and fifth graders in the district have also taken the nationally normed
Stanford Achievement Test (SAT9) in reading and math. To provide a
better sense of the magnitude of these effects, one can determine how a
one standard deviation unit change on the core exam translates into na-
tional percentile points. This comparison suggests that moving a student
from an average teacher to a teacher one standard deviation above the
mean would result in roughly a 2–3 percentile point increase in test scores
(an increase of roughly 0.07–0.10 standard deviation units).

IV. Empirical Strategy


The primary objective of our analysis is to determine how well prin-
cipals can distinguish between more and less effective teachers as measured
by student achievement gains on standardized exams. This section outlines
the strategies we employ to address this issue. Our empirical methods
take into account estimation error in our value-added measures and the
ordinal nature of principal ratings.
Perhaps the most straightforward measure of association between the
subjective principal assessments and the objective value-added measures
is a simple correlation. However, the estimation error in our value-added
measures—which arises not only from sampling variation but also from
idiosyncratic factors that operate at the classroom level in a particular
year (e.g., a dog barking in the playground, a flu epidemic during testing

14
Hanushek et al. (2005), for example, find that one standard deviation in the
teacher quality distribution is associated with a 0.22 standard deviation increase
in math on the Texas state assessment. Rockoff (2004) finds considerably smaller
effects—namely, that a one standard deviation increase in the teacher fixed-effect
distribution raises student math and reading achievement by roughly 0.10 standard
deviations on a nationally standardized scale. Examining high school students,
Aaronson et al. (2007) find that a one standard deviation improvement in teacher
quality leads to a .20 improvement in math performance over the course of a year.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 113

week, or something about the dynamics of a particular group of chil-


dren)—will lead us to understate the correlation between the principal
ratings and the value-added indicators.15 To see this, note that the observed
value added can be written as d̂OLS p d ⫹ e , where d is the true fixed effect
and e represents estimation error. If we denote the principal rating as
d̂P, then it is simple to show that the correlation between principal rating
and observed value added is biased downward relative to the correlation
between principal rating and true value added:

Cov (dˆ p, ˆdOLS)


Corr (dˆ p, ˆdOLS) p
冑Var (dˆ ) Var (dˆ )
P OLS

Cov (dˆ , d) p

! Corr (dˆ , d) .
p
(2)
冑Var (dˆ ) [Var (d) ⫹Var (e)]
p
P

Fortunately, it is relatively simple to correct for this using the observed


estimation error from the value-added model. We obtain a measure of the
true variance by subtracting the mean error variance (the average of the
squared standard errors on the estimated teacher fixed effects) from the
variance of the observed valued-added measures: Var (d) p Var (dˆ OLS ) ⫺
Var (e).16 Then we can then simply multiply the observed correlation,
Corr (dˆ p, ˆdOLS ), by

冑Var (dˆ )
OLS

,
冑Var (d)
to obtain the adjusted correlation. We obtain the standard errors using a
bootstrap.17
Note that this adjustment assumes that a principal’s rating is unrelated
to the error of our OLS estimate of teacher effectiveness. Specifically, we
assume that the numerator in equation (2) can be rewritten as follows:
15
We will use the terms “estimation error” and “measurement error” inter-
changeably, although in the testing context measurement error often refers to the
test-retest reliability of an exam, whereas the error stemming from sampling var-
iability is described as estimation error.
16
This assumes that the OLS estimates of the teacher fixed effects are not
correlated with each other. This would be true if the value-added estimates were
calculated with no covariates. Estimation error in the coefficients of the covariates
generates a nonzero covariance between teacher fixed effects, though in practice
the covariates are estimated with sufficient precision that this is not a problem.
17
For our baseline specifications we perform 1,000 iterations. For the robustness
and heterogeneity checks we perform 500 iterations. We also perform a principal-
level block bootstrap. The inference from this procedure (shown below) is similar
to our baseline results.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
114 Jacob/Lefgren

Cov (dˆ p, ˆdOLS ) p Cov (dˆ p, d) ⫹ Cov (dˆ p, e) . This would not be true if the
principals were doing the same type of statistical analysis as we are to
determine teacher effectiveness. However, to the extent that principals
base their ratings largely on classroom observations, discussions with stu-
dents and parents, and other factors unobservable to the econometrician,
this assumption will hold. To the extent that this is not true and principals
base their ratings, even partially, on the observed test scores (in the same
manner as the value-added model does—that is, conditioning on a variety
of covariates), the correlation we calculate will be biased upward.18
In addition to biasing our correlations, estimation error will lead to
attenuation bias if we use the teacher value-added measures as an ex-
planatory variable in a regression context.19 To account for attenuation
bias when we use the teacher value added in a regression context, we
construct empirical Bayes (EB) estimates of teacher quality. This ap-
proach was suggested by Kane and Staiger (2002) for producing efficient
estimates of school quality but has a long history in the statistics lit-
erature (see, e.g., Morris 1983).20 For more information on our calcu-
lation of the EB estimates, see Jacob and Lefgren (2005a).
While the correlation between objective and subjective performance
measures is a useful starting point, it has several limitations. Most im-
portant, the principal ratings may not have a cardinal interpretation, which
would make the correlation impossible to interpret. For example, the
difference between a six and seven rating may be greater or less in absolute
terms than the difference between a nine and ten. Second, correlations
are quite sensitive to outliers. Third, while we have normalized the prin-
cipal ratings to ensure that each principal’s ratings have the same variance,
it is possible that the variance of value added differs across schools. In
this case, stacking the data could provide misleading estimates of the
18
The correlations (and associated nonparametric statistics) may understate the
relation between objective and subjective measures if principals have been able
to remove or counsel out the teachers that they view as the lowest quality. How-
ever, our discussions with principals and district officials suggest that this occurs
rarely and is thus unlikely to introduce a substantial bias in our analysis. Similarly,
the correlations may be biased downward if principals assign teachers to class-
rooms in a compensatory way—i.e., principals assign more effective teachers to
classrooms with more difficult students. In this case, our value-added measures
will be attenuated (biased toward zero), which will reduce the correlation between
our subjective and objective measures.
19
If the value-added measure is used as a dependent variable, it will lead to less
precise estimates relative to using a measure of true teacher ability. Measurement
error will also lead us to overstate the variance of teacher effects, although this
is a less central concern for the analysis presented here.
20
In fact, the EB approach described here is very closely related to the errors-
in-variables approach that allows for heteroskedastic measurement error outlined
by Sullivan (2001).

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 115

average correlation between principal ratings and value added within the
district. Finally, a simple correlation does not tell us whether principals
are more effective at identifying teachers at certain points on the ability
distribution.
For these reasons, we estimate a nonparametric measure of the asso-
ciation between ratings and productivity. Specifically, we calculate the
probability that a principal can correctly identify teachers in the top or
bottom group within his or her school. If we knew the true ability of
each teacher, this exercise would be trivial. In order to address this ques-
tion using our noisy measure of teacher effectiveness, we rely on a Monte
Carlo simulation in which we assume that a teacher’s true value added is
distributed normally with a mean equal to the point estimate of the teacher
empirical Bayes fixed effect and a standard deviation equal to the standard
error on the teacher’s estimate.21 The basic intuition is that by taking
repeated draws from the value-added distribution of each teacher in a
school, we can determine the probability that any particular teacher will
fall in the top or bottom group within his or her school, which we can
then use to create the conditional probabilities shown below. The appendix
provides a more detailed description of this simulation.
A final concern with both the parametric and nonparametric measures
of association described above is that our objective measure of teacher
effectiveness is constructed from student test scores that have no natural
units. Exam scores in this district are reported in terms of the percent of
items answered correctly. We do not have access to individual test items,
which makes it impossible to develop performance measures that account
for the difficulty of the test items, as is commonly done in item response
theory analyses.22 Furthermore, until very recently the test was not stan-
dardized against a national population. However, to test the robustness
of our results, we develop value-added measures that use transformations
of the percent correct score, including a student’s percentile rank within
his year and grade, the square of the percent correct, and the natural
logarithm of the percent correct. Moreover, the district categorizes stu-
dents into four different proficiency categories on the basis of these exam
scores, and in some specifications we use these proficiency indicators as
outcomes for the creation of value-added measures. As we show below,
our results are robust to using these alternative value-added measures.

21
The empirical Bayes estimate of teacher value added is orthogonal to its
estimation error, which makes it more appropriate than the OLS estimate in this
context.
22
The core exam is the examination that is administered to all grades and broadly
reported. It is therefore likely that principals would focus on this measure of
achievement as opposed to some other exam, such as the Stanford Achievement
Test, which is only administered in the third (with limited participation), fifth,
and eight grades.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
116 Jacob/Lefgren

Table 3
Correlation between a Principal’s Rating of a Teacher’s Ability to Raise
Student Achievement and the Value-Added Measure of the Teacher’s
Effectiveness at Raising Student Achievement

Reading Math Difference:


Reading ⫺
Unad- Ad- Unad- Ad- Math (Math
justed justed justed justed Sample Only)
(1) (2) (3) (4) (5)

1. Using baseline specification for creat-


ing value-added measure .18* .29* .25* .32* .00
(.07) (.10) (.08) (.10) (.11)
2. Using average student achievement
(levels) as the value-added measure .35* .55* .28* .37* .19
(.05) (.09) (.08) (.12) (.13)
Difference ⫺.26* ⫺.05
(.08) (.07)
Note.—Number of observations for reading and math is 201 and 151, respectively. Adjusted cor-
relations are described in the text. The standard errors shown in parentheses are calculated using a
bootstrap. The reading-math difference in col. 5 does not equal the simple difference between the
values in cols. 2 and 4 because the difference is calculated using the limited sample of teachers for
whom math value-added measures are available.
* Significant at the 5% level.

V. Can Principals Identify Effective Teachers?


Having outlined our strategy for estimating the relationship between
principal evaluations and teacher effectiveness, we will now present our
empirical findings. We will also compare the usefulness of principal as-
sessments in predicting future teacher effectiveness relative to the tradi-
tional determinants of teacher compensation (i.e., education and experi-
ence) and to value-added measures of teacher quality that are based on
student achievement gains.

A. Can Principals Identify a Teacher’s Ability to Raise


Standardized Test Scores?
Table 3 shows the correlation between a principal’s subjective evaluation
of how effective a teacher is at raising student reading (math) achievement
and that teacher’s actual ability to do so as measured by the value-added
measures described in the previous section. Columns 1 and 3 (of row 1)
show unadjusted correlations of 0.18 and 0.25 for reading and math,
respectively. As discussed earlier, however, these correlations will be biased
toward zero because of the estimation error in the value-added measures.
Once we adjust for estimation error, the correlations for reading and
math increase to 0.29 and 0.32, respectively. It is important to emphasize
that these correlations are not based on a principal’s overall rating of the
teacher but rather on the principal’s assessment of how effective the
teacher is at “raising student math (reading) achievement.” Because the
subjective and objective measures are identifying the same underlying

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 117

construct, they should not be biased downward as in the case with many
prior studies of subjective performance evaluation.23 The positive and
significant correlations indicate that principals do have some ability to
identify this dimension of teacher effectiveness. As shown below, these
basic results are robust to a wide variety of alternative specifications and
sensitivity analyses.
However, one might ask why these correlations are not even higher.
One possibility is that principals focus on the average test scores in a
teacher’s classroom rather than student improvement relative to students
in other classrooms. The correlations between principal ratings and av-
erage student achievement scores by teacher, shown in row 2, provide
some support for this hypothesis. The correlation between principal rat-
ings and average test scores in reading is significantly higher than the
correlation between principal ratings and teacher value added (0.55 vs.
0.29). This suggests that principals may base their ratings at least partially
on a naive recollection of student performance in teacher’s class—failing
to account for differences in classroom composition. Another reason may
be that principals focus on their most recent observations of teachers. In
results not shown here, we find that the average achievement score (or
gains) in a teacher’s classroom in 2002 is a significantly stronger predictor
of the principal’s rating than the scores (or gains) in any prior year.24
To explore principals’ ability to identify the very best and worst teach-
ers, table 4 shows the estimates of the percent of teachers that a principal
can correctly identify in the top (bottom) group within his or her school.
Examining the results in the top panel, we see that the teachers identified
by principals as being in the top category were, in fact, in the top category
according to the value-added measures about 55% of the time in reading
and 70% of the time in mathematics. If principals randomly assigned
ratings to teachers, we would expect the corresponding probabilities to
be 14% and 26%, respectively. This suggests that principals have consid-

23
Bommer et al. (1995) emphasize the potential importance of this issue, noting
that in the three studies they found where objective and subjective measures tapped
precisely the same performance dimension; the mean corrected correlation was
0.71 as compared with correlations of roughly 0.30 in other studies. Medley and
Coker (1987) are unique in specifically asking principals to evaluate a teacher’s
ability to improve student achievement. They find that the correlation with these
subjective evaluations is no higher than with an overall principal rating.
24
This exercise suggests that in assigning teacher ratings, principals might focus
on their most recent observations of a teacher instead of incorporating information
on all past observations of the individual. Of course, it is possible that principals
may be correct in assuming that teacher effectiveness changes over time so that
the most recent experience of a teacher may be the best predictor of actual ef-
fectiveness. To examine this possibility, we create value-added measures that in-
corporate a time-varying teacher experience measure. As shown in our sensitivity
analysis, we obtain comparable results when we use this measure.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
118 Jacob/Lefgren

Table 4
Relationship between Principal Ratings of a Teacher’s Ability to Raise
Student Achievement and Teacher Value Added
Reading Math

Conditional probability that a teacher who received the top rating from
the principal was the top teacher according to the value-added
measure (SE) .55 .70
(.18) (.13)
Null hypothesis (probability expected if principals randomly assigned
teacher ratings) .14 .26
Z-score (p-value) of test of difference between observed and null 2.29 3.34
(.02) (.00)
Conditional probability that a teacher who received a rating above the
median from the principal was above the median according to the
value-added measure (SE) .62 .59
(.12) (.14)
Null hypothesis (probability expected if principals randomly assigned
teacher ratings) .33 .24
Z-score (p-value) of test of difference between observed and null 2.49 2.41
(.01) (.02)
Conditional probability that a teacher who received a rating below the
median from the principal was below the median according to the
value-added measure (SE) .51 .53
(.11) (.13)
Null hypothesis (probability expected if principals randomly assigned
teacher ratings) .35 .26
Z-score (p-value) of test of difference between observed and null 1.40 2.19
(.16) (.03)
Conditional probability that the teacher(s) who received the bottom rat-
ing from the principal was the bottom teacher(s) according to the
value-added measure (SE) .38 .61
(.22) (.14)
Null hypothesis (probability expected if principals randomly assigned
teacher ratings) .09 .23
Z-score (p-value) of test of difference between observed and null 1.30 2.76
(.19) (.01)
Note.—The probabilities are calculated using the procedure described in the appendix.

erable ability to identify teachers in the top of the distribution. The results
are similar if one examines teachers in the bottom of the ability distri-
bution (bottom panel).25
The second and third panels in table 4 suggest that principals are sig-
nificantly less successful at distinguishing between teachers in the middle
of the ability distribution. For example, in the second panel we see that
principals correctly identify only 62% of teachers as being better than
the median (including those in the top category), relative to the null
hypothesis of 33%—that is, the percent that one would expect if principal

25
We used the delta method to compute the relevant standard errors. A block
bootstrap at the principal level yielded even smaller p-values than those reported
in the tables.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 119

Fig. 2.—Association between estimated teacher value added and principal rating

ratings were randomly assigned.26 The difference of 29 percentage points


is smaller than the difference of 41 percentage points we find in the top
panel. There is a similar picture at the bottom of the distribution. Prin-
cipals appear somewhat better at distinguishing between teachers in the
middle of the math distribution compared with reading, but they again
appear to be better at identifying the best and worst teachers.27
Figure 2 shows scatter plots and lowest lines of the relationship between
principal ratings and estimated teacher value-added measures for the entire
26
The null hypothesis is not 50% due to the fact that principals assign many
teachers the median rating.
27
An alternative explanation for this finding is that principal ratings are noisy,
so the difference in value added between two teachers with principal evaluations
that differ by one point is much smaller than the difference in value added between
teachers whose ratings differ by two or more points. If this were true, we might
expect principals to appear capable of identifying top teachers, not because they
can identify performance in tails of the distribution but rather because top teachers
have much higher principal evaluations than the remainder of the sample. A
comparison of teachers not in the top or bottom groups but who had the same
difference in absolute ratings might yield similar results. To investigate this pos-
sibility, we examined whether the difference in principal ratings predicted which
teacher had higher estimated value added. We found that, controlling for whether
teachers had received the top or bottom rating, the difference in principal ratings
between two teachers was not predictive of which teacher had higher estimated
value added. This suggests that our findings do indeed reflect a principal’s ability
to identify teachers in the tails of performance distribution.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
120 Jacob/Lefgren

sample and the sample of teachers who did not receive the top or bottom
principal rating (in the bottom panel). If we exclude those teachers who
received the best and worst ratings, the adjusted correlation between prin-
cipal rating and teacher value added is 0.02 and ⫺0.04 in reading and
math, respectively (neither of which is significantly different than zero).

B. Robustness Checks
In this section, we outline potential concerns regarding the validity of
our estimates and attempt to provide evidence regarding the robustness
of our findings. One reason that principals might have difficulty distin-
guishing between teachers in the middle is that the distribution of teacher
value added is highly compressed. This is not the case, however. Even
adjusting for estimation error, the standard deviation of value added for
teachers outside of the top and bottom categories is .10 in reading and
.25 in math compared with standard deviations of .12 and .26, respectively,
for the overall sample.
Another possible concern is that the lumpiness of the principal ratings
might reduce the observed correlation between principal ratings and actual
value added. To determine the possible extent of this problem, we per-
formed a simulation in which we assumed principals perfectly observed
a normally distributed teacher quality measure. Then the principals as-
signed teachers in order to the actual principal reading rankings. For
example, a principal who assigned two 6s, three 7s, six 8s, three 9s, and
one 10 would assign the two teachers with the lowest generated value-
added measures a six. She would assign the next three teachers 7s, and
so on. The correlation between the lumpy principal rankings and the
generated teacher-quality measure is about 0.9, suggesting that at most
the correlation is downward biased by about 0.1 due to the lumpiness.
When we assume that the latent correlation between the principal’s con-
tinuous measure of teacher quality and true effectiveness is 0.5, the cor-
relation between the lumpy ratings and the truth is biased downward by
about 0.06, far less than would be required to fully explain the relatively
low correlation between the principal ratings and the true teacher effec-
tiveness.28 In addition, as shown in row 5, we obtain comparable corre-
lations when we exclude principals with extremely lumpy ratings.
Table 5 presents a variety of additional sensitivity analyses. For the sake
of brevity, we only show the correlations between principal ratings and
value added, adjusting for the estimation error in value added. In row 1,
28
In practice, the bias from lumpiness is likely to be even lower. This is because
teachers with dissimilar quality signals are unlikely to be placed in the same
category—even if no other teacher is between them. In other words, the size and
number of categories is likely to reflect the actual distribution of teacher quality,
at least in the principal’s own mind.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 121

we present the baseline estimates. In the first panel (rows 2–5), we examine
the sensitivity of our findings to alternative samples. We find similar results
when we examine only grades for which we can examine math as well as
reading. The same is true if we exclude schools with extremely lumpy
principal ratings or first-year teachers. As discussed earlier, the correlation
between ratings and teacher value added is much lower for teachers with
ratings outside of the top and bottom.
In the next panel (rows 6–12), we investigate the concern that our
findings may not be robust to the measure of student achievement used
to calculate value added. As explained earlier, while principals were likely
to focus on the core exam, it is possible that they weigh the improvement
of high- and low-ability students in a way that does not correspond to
the commonly reported test metric. In rows 6–8, we examine measures
of achievement that focus on students at the bottom of the test distri-
bution. These measures include the log of the percent correct, achieving
above the proficiency cutoff, and the achievement of students with initial
achievement below the classroom mean. We find results comparable to
our baseline. In rows 9–10, we show results in which we place greater
weight on high-achieving students. These results include percent correct
squared and the achievement of students with initial achievement above
the classroom mean, and we find somewhat smaller correlations. Overall,
we believe these results are consistent with our baseline findings. However,
there does seem to be some suggestive evidence that principals may place
more weight on teachers bringing up the lowest-achieving students. For
example, the correlations using the teacher value-added scores based on
the achievement of students in the bottom half of the initial ability dis-
tribution are larger than those based on the students in the top half of
the ability distribution (though the differences are not significant at con-
ventional levels). However, the correlations that use value-added measures
based on proficiency cutoffs (which should disproportionately reflect im-
pacts on students with lower initial ability) are smaller than our baseline
results and not statistically different than zero. Of course, this might also
be due to the fact that this binary outcome measure inherently contains
less information than a continuous achievement score.
In row 11, we examine results in which the achievement measure used
is the students’ percentile in the grade-year distribution of test scores within
the district. Finally, in row 12, we examine gain scores that are normalized
so that student gains have unit standard deviation within each decile of the
initial achievement distribution. Both specifications yield results comparable
to the baseline.
In the third panel (rows 13–18), we examine the robustness of our
estimates to alternative estimation strategies for the value added and find

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Table 5
Sensitivity Analyses
Correlation between Value-Added Measure and Principal Rating of
Teacher’s Ability to Raise Math (Reading) Achievement
Reading Math
Specification Raw Adj. Raw Adj.

1. Baseline .18* .29* .25* .32*


(.07) (.10) (.08) (.10)
Alternative samples:
2. For math sample .23* .32* ... ...
(.08) (.11)
3. Exclude teachers who were in their first year in 2002–3 .19* .26* .28* .35*
(.07) (.10) (.08) (.10)
4. Exclude teachers with top or bottom principal rating .02 .03 ⫺.04 ⫺.05
(.07) (.13) (.10) (.14)

122
5. Exclude schools with extreme lumpiness in principal ratings .15* .26* .24* .32*
(.07) (.13) (.09) (.12)
Alternative measures of student achievement:
6. Test outcome is natural log of percent correct—greater weight on
low-achieving students .19* .33* .22* .28*
(.07) (.13) (.08) (.10)
7. Test outcome is scoring above “proficiency” cutoff—greater weight
on low-achieving students .12 .20 .12 .17
(.07) (.14) (.08) (.13)

All use subject to JSTOR Terms and Conditions


8. Estimate VA using only students below class mean prior achieve-
ment—use normalized gain score as outcome .21* .37* .29* .44*
(.06) (.12) (.08) (.12)
9. Test outcome is percent correct squared—greater weight on high-
achieving students .20* .29* .26* .33*
(.07) (.10) (.07) (.09)

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


10. Estimate VA using only students above class mean prior achieve-
ment—use normalized gain score as outcome .09 .16 .25* .32*
(.07) (.15) (.09) (.11)
11. Test outcome is percentile in district test distribution .20* .30* .28* .36*
(.07) (.10) (.07) (.09)
12. Outcome is the gain score normalized around predicted gain for
students with comparable prior achievement .18* .28* .28* .36*
(.07) (.10) (.08) (.10)
Alternative specifications of the value-added model
13. Exclude students in special education .258* .350* .317* .402*
(.077) (.100) (.077) (.096)
14. Only use 2002–3 achievement data .101 .191 .191 .233
(.093) (.267) (.111) (.135)
15. Measure value added within grade-school, rather than simply
within school .201* .428 .228* .321*
(.063) (.309) (.081) (.150)
16. Gain outcome with student fixed effects .16* .27** .31* .48*
(.07) (.14) (.08) (.16)
17. Include indicator for first-year teachers (plus polynomials in prior
achievement) .19* .30* .24* .30*
(.07) (.11) (.08) (.10)

123
18. Include ln(experience) variable (plus polynomials in prior
achievement) .20* .46 .26* .40*
(.06) (.35) (.07) (.12)
Other checks:
19. Block bootstrapped standard errors (clustering at school level) .18* .29* .25* .32*
(.09) (.14) (.09) (.12)
20. Average of principal-level correlations .21* .31*
(.08) ... (.08) ...
Note.—The adjusted correlations take into account the estimation error in our value-added measures of teacher effectiveness. Bootstrapped standard errors are in

All use subject to JSTOR Terms and Conditions


parentheses. VA p value added.
* Significant at the 5% level.
** Significant at the 10% level.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


124 Jacob/Lefgren

results comparable to our baseline.29 Row 19 demonstrates that our in-


ferences are robust to calculating standard errors using a block bootstrap
that clusters observations by school. Row 20 addresses the potential con-
cern that the variance of teacher value added may not be constant across
principals, in which case stacking the data may be inappropriate. Here
we estimate the unadjusted correlations separately for each principal and
take the simple average of them. We find that averaging the unadjusted
correlations across the principals yields an estimate similar to our baseline
unadjusted correlation.

C. Heterogeneity of Effects
It is possible that principals are able to gauge the performance of some
teachers more effectively than others. Alternatively, some principals may
be more effective than others in evaluating teacher effectiveness. In table
6, we examine this possibility. The first row shows the baseline estimates.
In the first panel, we examine how the correlation between principal
rating and teacher value added varies with teacher characteristics. It does
not appear that the correlation between ratings and teacher value added
varies systematically with teacher experience, the duration the principal
has known the teacher, or grade taught. The standard errors are generally
too large, however, to draw strong conclusions.
In the second panel, we consider whether some principals are more
capable of identifying effective teachers than others. Principals who have
been at their schools less than 4 years, are male, and identify themselves
as confident in their ability to assess teacher effectiveness appear to rate
teaching ability more accurately. The observed differences, however, are
never significant at the 5% level.

D. A Comparison of Alternative Teacher Assessment Measures


To put the usefulness of principal ratings into perspective, it is helpful
to compare them to alternative metrics of teacher quality. These include
the traditional determinants of teacher compensation—education and ex-
perience—as well as value-added measures of teacher quality that are based
on student achievement gains. Specifically, we examine how well each of
the three proxies for teacher quality—compensation, principal assessment,

29
One exception is that the correlations that use only 2002–3 achievement data
are smaller than the baseline correlations and are not statistically different than
zero (although they are not statistically different than the baseline correlations,
either).

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Table 6
Heterogeneity of Effects
Correlation between Value-Added Mea-
sure and Principal Rating of Teacher’s
Ability to Raise Math (Reading)
Achievement
Reading Math
Specification Raw Adj. Raw Adj.

1. Baseline .18* .29* .25* .32*


(.07) (.10) (.08) (.10)
By teacher characteristics:
2. Experienced teachers ( ≥ 11 years,
n p 92) .29* .35* .34* .39*
(.09) (.11) (.09) (.11)
3. Inexperienced teachers (! 11 years,
n p 109) .07 .38 .13 .20
(.08) (.45) (.11) (.19)
4. Principal known teacher for long
time ( ≥ 4 years, n p 114) .22* .28* .29* .35*
(.09) (.11) (.08) (.10)
5. Principal hasn’t known teacher for
long (! 4 years, n p 87) .13 .49 .21 .29
(.10) (.63) (.12) (.18)
6. Grades 2–4 (n p 128) .30* .43* .36* .46*
(.07) (.11) (.07) (.09)
7. Grades 5–6 (n p 73) .09 .40 ⫺.10 ⫺.14
(.10) (.66) (.19) (.30)
By principal characteristics:
8. Principal has been in the same
school at least 4 years (n p 108) .08 .12 .12 .16
(.10) (.15) (.10) (.16)
9. Principal has been in the same
school less than 4 years (n p 93) .29* .49* .35* .44*
(.09) (.25) (.10) (.13)
10. Principal confident in reading rat-
ings (n p 106) .21* .36* .37* .48*
(.08) (.16) (.09) (.12)
11. Principal not confident in reading
ratings (n p 79) .11 .16 .11 .14
(.11) (.16) (.12) (.16)
12. Principal confident in math rat-
ings (n p 47) .32* .58 .51* .82
(.11) (.49) (.11) (.82)
13. Principal not confident in math
ratings (n p 138) .12 .17 .17 .21
(.08) (.12) (.10) (.13)
14. Male principal (n p 103) .22* .32* .35* .46*
(.09) (.13) (.10) (.13)
15. Female principal (n p 98) .16 .25 .19** .25**
(.10) (.16) (.10) (.13)
Note.—The adjusted correlations take into account the estimation error in our value-added measures
of teacher effectiveness. Bootstrapped standard errors are in parentheses.
* Significant at the 5% level.
** Significant at the 10% level.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
126 Jacob/Lefgren

and estimated value added—predict student achievement.30 This exercise


serves an additional purpose as well. To the extent that principals observe
past test scores, their ratings may be correlated to the estimation error of
the value-added estimates. In this section, we overcome this concern by
examining student achievement subsequent to the principal evaluations.
In order to examine how well each of the teacher quality measures
predicts student achievement, we regress 2003 math and reading scores
on prior student achievement, student demographics, and a set of class-
room covariates including average classroom demographics and prior
achievement and class size. We then include different measures of teacher
quality. The standard errors shown account for the correlation of errors
within classroom. Importantly, the value-added measures are calculated
using the specification described in equation (1) but only include student
achievement data from 1998 to 2002. In order to account for attenuation
bias in the regressions, we use empirical Bayes estimates of the value added
(see Morris 1983). It is important to note that the use of value-added
measures based on 1998–2002 data means that we cannot include any
teachers with only 2003 student achievement data, which means that first-
year teachers will be excluded in the subsequent analysis, limiting our
sample to 162 teacher observations for reading and 118 teacher obser-
vations for math.31 To make the coefficients comparable between the prin-
cipal ratings and value added, we divide each EB measure by the standard
deviation of the EB measure itself. Thus, the coefficient can be interpreted
as the effect of moving one standard deviation in the empirical distribution
of teacher quality.32
Table 7 presents the results. Column 1 shows the effect of teacher
experience and education on reading achievement. While there does not
appear to be any significant relationship between teacher experience and
student achievement, results not presented here indicate that this is due
to the necessary omission of first-year teachers, who perform worse, on
average, than experienced teachers. In contrast, teachers with advanced

30
Note that when comparing the predictive power of the various measures, we
are essentially comparing the principal and compensation measures against feasible
value-added measures. Using unobserved actual value added could increase the
predictive power (as measured by the r-squared), but this is not a policy-relevant
measure of teacher quality. Of course, the nature of the EB measures is such that
coefficient estimates are consistent measures of impact of actual teacher value
added.
31
Excluding first-year teachers does not affect the estimated correlation between
the value-added measures and principal ratings, which suggests this exclusion
should not bias our results in this regard.
32
Since we are comparing the relative value of using a test-based vs. principal-
based measure of performance, the most relevant comparison is between a move-
ment in the empirical (not actual) distribution of teacher effectiveness and the
principal rating.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Table 7
The Association between Different Teacher Quality Measures and Future Student Achievement
Dependent Variable
2003 Reading Score 2003 Math Score
Independent Variable (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)

3–5 years of experience ⫺.013 .222


(.061) (.125)
6–10 years of experience .017 .023
(.060) (.125)
11–20 years of experience ⫺.019 .127
(.049) (.118)
21⫹ years of experience ⫺.049 .009
(.070) (.126)
MA degree .106* .086
(.046) (.075)
Annual pay (in $1,000) .000 ⫺.001 ⫺.000 .000
(.003) (.002) (.004) (.003)
Overall principal rating .070* .045* .141* .077*
(.020) (.020) (.023) (.025)
Principal rating of ability to raise reading
(math) scores .051* .100*
(.019) (.029)
Reading (math) value-added (EB measure) .106* .096* .207* .176*
(.015) (.017) (.022) (.023)

All use subject to JSTOR Terms and Conditions


R2 .477 .475 .479 .477 .484 .486 .389 .383 .398 .391 .413 .417
Note.—Each column represents a separate specification. Specifications in cols. 1–6 include 162 teachers and 3,891 students; cols. 7–12 include 118 teachers
and 2,590 students. All regressions include the following variables: male, special education status, free lunch eligibility, limited English proficiency, age, minority,
fixed effects for grade and school, lagged math and reading scores, class size, class-level average of student demographics and lagged achievement scores, and
an indicator for a mixed-grade class. Standard errors clustered at the teacher (i.e., classroom) level are shown in parentheses. EB p empirical Bayes.
* Significant at the 5% level.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


128 Jacob/Lefgren

degrees have students who score roughly 0.11 standard deviations higher
than their counterparts (although this relationship should not be inter-
preted as causal since education levels may well be associated with other
omitted teacher characteristics). In this district, however, compensation is
a complicated, nonlinear function of experience and education. Column 2
shows that actual compensation has no significant relationship to student
achievement. In results not shown here, we find that including polynomials
in compensation does not change this result. Columns 3 and 4 indicate that
principal ratings—both overall ratings and ratings of a teacher’s ability to
raise achievement—are significantly associated with higher student achieve-
ment. Conditional on prior student achievement, demographics, and class-
room covariates, students whose teachers receive an overall rating one
standard deviation above the mean are predicted to score roughly 0.07
standard deviations higher than students whose teacher received an aver-
age rating. Column 5 shows that a teacher’s value added is an even better
predictor of future student achievement gains, with a coefficient half again
as that on the overall principal ratings.33 The r-square measures in the
bottom row indicate that none of the measures explain a substantial por-
tion of the variation across students, as one would expect, given that much
of the variation in nearly all student-level regressions occurs within the
classroom. Nonetheless, bootstrap tests indicate that the value-added mea-
sure explains significantly (at the 10% level in reading and the 5% level
in math) more of the variation in student achievement than the principal
ratings do. As shown in columns 7–11, the results for math are comparable
to the reading results.
To the extent that principal ratings are picking up a different dimension
of quality than the test-based measures, one might expect that combining
principal and value-added measures would yield a better predictor of
future achievement. Column 6 suggests that this might be the case. Con-
ditional on teacher value added, the principal’s overall rating of a teacher
is a significant predictor of student achievement. The results for math,
shown in column 12, are even stronger.

VI. Conclusions
In this article, we examine principals’ ability to identify teachers’ ability
to increase reading and math achievement. We build on prior literature
by using principal ratings that are aligned with the objective metric under
33
We examined the functional form of the relationship between both teacher
quality measures and student achievement but found that both were approximately
linear. Also, when we do not normalize the EB measure by the standard deviation
of teacher ability, the coefficient is insignificantly different from one, which we
would expect, given that the EB is essentially the conditional expectation of teacher
effectiveness.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 129

examination. We also take into account measurement error in the objective


metric and the fact that principal ratings are categorical and may not have
a cardinal interpretation. We find that principals are generally effective at
identifying the very best and worst teachers. On average, however, they
are not able to distinguish teachers in the middle of the achievement
distribution. There is suggestive evidence that principals may be more
concerned with or aware of the achievement of low-ability students than
of high-ability students and may rely on achievement levels rather than
value added to assess teachers. Principal ratings are also a significant pre-
dictor of future student achievement, though they perform worse than
empirical measures of teacher effectiveness.
Our results suggest that policy makers should consider incorporating
principal evaluations into teacher compensation and promotion systems.
Comparing principal assessments to the traditional determinants of
teacher compensation—education and experience—we find that subjective
principal assessments of teachers are a substantially better predictor of
future student achievement. While value-added measures of teacher ef-
fectiveness generally do a better job at predicting future student achieve-
ment than do principal ratings, the two measures do about equally well
at identifying the best and worst teachers. To the extent that principal
judgments focus on identifying the best and worst teachers—for example,
to determine bonuses and teacher dismissal—the evidence presented here
suggests that such evaluations would help promote student achievement.
Moreover, principal assessment has the potential to mitigate some of the
concerns regarding strategic behavior on the part of teachers to improve
test scores without increasing actual knowledge (see Jacob and Levitt
2003).34 If principals can observe inputs as well as outputs, they may be
able to ensure that teachers increase student achievement through im-
provements in pedagogy, classroom management, or curriculum. Also, in
other work we show that principal ratings are correlated to other edu-
cational outputs valued by parents, such as student satisfaction (see Jacob
and Lefgren 2005b).
While principal evaluations seem promising, there are several important
reasons to be cautious in using these results to shape teacher compensation
and promotion systems. First, the inability of principals to distinguish
between a broad middle range of teacher quality suggests that one should
not rely on principals for fine-grained performance determinations as
might be required under certain merit pay policies. Second, our analysis
takes place in a context where principals were not explicitly evaluated on
34
Recent studies have documented a number of undesirable consequences as-
sociated with such high-stakes testing policies, including teaching to the test and
cheating (Jacob and Levitt 2003; Jacob 2005).

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
130 Jacob/Lefgren

the basis of their ability to identify effective teachers.35 It is possible that


moving to a system where principals had more authority and responsi-
bility for monitoring teacher effectiveness would enhance principals’ abil-
ity to identify various teacher characteristics. However, it is possible that
principals would be less willing to honestly assess teachers under such a
system, perhaps because of social or political pressures. Favoritism toward
particular teachers by school administrators long has been a concern among
teachers, and Jacob and Lefgren (2005a) find some tentative evidence that
principals may indeed engage in such behavior. Third, our analysis focuses
on the source of the teacher assessment; we do not address the type of
rewards or sanctions associated with teacher performance. This is clearly
an important dimension of any performance management system, and one
would not expect either a principal-based or a test-based assessment sys-
tem to have a substantial impact on student outcomes unless it were
accompanied by meaningful consequences.36
More broadly, our findings provide compelling evidence that good
teaching is, at least to some extent, observable by those close to the
education process even though it may not be easily captured in those
variables commonly available to the econometrician. This should provide
some hope to those attempting to characterize the behaviors associated
with effective teaching and ultimately improve education for all students.

Appendix
Nonparametric Measures of Association between Performance Indicators
In order to get a more intuitive understanding of the magnitude of the
relationship between principal ratings and actual teacher effectiveness, we
calculate several simple, nonparametric measures of the association be-
tween the subjective and objective performance indicators. While this
exercise is complicated somewhat by the existence of measurement error
in the teacher value-added estimates, it is relatively straightforward to
construct such measures through Monte Carlo simulations with only min-
imal assumptions. Following the notation in the text, we define the prin-
cipal’s assessment of teacher j as d̂Pj , the empirical Bayes estimated value
added of teacher j as d̂j, and the true ability of teacher j as dj. Our goal
is to calculate the following probabilities:

Pr (dj p tFdˆ jP p t) , (A1)


35
There were, however, a number of informal incentives for principals. For
example, the district monitored the standardized test achievement of all schools,
and parents are an active presence in many of the schools.
36
For examples of studies that examine accountability programs within edu-
cation, see Kane and Staiger (2002), Figlio and Winicki (2005), Jacob (2005), and
Figlio and Rouse (2006).

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 131

Pr (dj p bFdˆ jP p b) , (A2)

where t (b) indicates that the teacher was in the top (bottom) quantile of
the distribution. For example, equation (A1) is the probability that the
teacher is in the top quantile of the true ability distribution conditional
on being in the top quantile of the distribution according to the principal
assessment.
We can calculate the conditional probability of a teacher’s value-added
ranking given her principal ranking directly from the data. These prob-
abilities can be written as follows:

Pr (dˆ j p tFdˆ jP p t) p Pr (dˆ j p tFdj p t) Pr (dj p tFdˆ jP p t)

⫹ Pr (dˆ j p tFdj p b) Pr (dj p bFdˆ Pj p t) , (A3)

Pr (dˆ j p bFdˆ jP p t) p Pr (dˆ j p bFdj p t) Pr (dj p tFdˆ jP p t)

⫹ Pr (dˆ j p bFdj p b) Pr (dj p bFdˆ Pj p t) , (A4)

Pr (dˆ j p tFdˆ jP p b) p Pr (dˆ j p tFdj p t) Pr (dj p tFdˆ jP p b)

⫹ Pr (dˆ j p tFdj p b) Pr (dj p bFdˆ Pj p b) , (A5)

Pr (dˆ j p bFdˆ jP p b) p Pr (dˆ j p bFdj p t) Pr (dj p tFdˆ jP p b)

⫹ Pr (dˆ j p bFdj p b) Pr (dj p bFdˆ Pj p b) . (A6)

Note that the four equations also assume that the fact that the principal
rates a teacher in the top (bottom) category does not provide any addi-
tional information regarding whether estimated value added will be in the
top (bottom) category once we condition on whether the teacher’s true
ability is in the top (bottom) category. For example, in equation (A3), we
assume that Pr (dˆ j p tFdj p t) p Pr (dˆ j p tFdj p t, dˆ jP p t) and Pr (dˆ j p
tFdj p b) p Pr (dˆ j p tFdj p b, dˆ jP p t). While we do not believe this is
strictly true, it should not substantially bias our estimates.
We also know the following identities are true:

Pr (dj p tFdˆ jP p t) ⫹ Pr (dj p bFdˆ jP p t) p 1, (A7)

Pr (dj p tFdˆ jP p b) ⫹ Pr (dj p bFdˆ jP p b) p 1, (A8)

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
132 Jacob/Lefgren

Pr (dˆ j p tFdˆ jP p t) ⫹ Pr (dˆ j p bFdˆ jP p t) p 1, (A9)

Pr (dˆ j p tFdˆ jP p b) ⫹ Pr (dˆ j p bFdˆ jP p b) p 1. (A10)

We can solve equations (A3) and (A7) to obtain equation (A1) as fol-
lows:

Pr (dˆ j p tFdˆ jP p t) ⫺ Pr (dˆ j p tFdj p b)


Pr (d ˆP )
j p tFdj p t p . (A11)
Pr (dˆ j p tFdj p t) ⫺ Pr (dˆ j p tFdj p b)

Using Bayes’s rule, we can rewrite equation (A11) as follows:

Pr (dˆ j p tFdˆ jP p t) ⫺ Pr (dj p bFdˆ j p t) PrPr(d(dpb)


ˆ pt)
j

Pr (d ˆP ) j

j p tFdj p t p . (A12)
Pr (dj p tFdˆ j p t) ⫺ Pr (dj p bFdˆ j p t) PrPr(d(dpb)
ˆ pt)
j

We can estimate all of the remaining quantities in equation (A12) from


our data. More specifically, we can calculate estimates of the following
probabilities through simulation:

Pr (dj p tFdˆ j p t) , (A13)

Pr (dj p bFdˆ j p t) , (A14)

Pr (dj p tFdˆ j p b) , (A15)

Pr (dj p bFdˆ j p b) . (A16)

To do so, we assume that the true ability of teacher j is distributed


normally with a mean equal to the estimated empirical Bayes value added
for teacher j, dˆ j, and a variance equal to Var (dˆ j ) . Note that the empirical
Bayes estimates have the useful property that the estimation error is in-
dependent of the point estimate (this is not true for OLS). We then ran-
domly draw 500 realizations of each teacher’s true ability, d̂j , and for each
draw determine which set of teachers would fall in the top (bottom) quantile
of the ability distribution and whether the principal would have correctly
classified the teacher based on this realization. We estimate the probabilities
in equations (A13)–(A16) as the average of these realizations. Finally, we
can calculate Pr (dˆ j p t) p Pr (dj p t) and Pr (dˆ j p b) p Pr (dj p b) di-
rectly from our original data.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 133

In a similar fashion, we can obtain equation (A2) by solving equations


(A5) and (A8):

Pr (dˆ j p bFdˆ Pj p b) ⫺Pr (dˆ j p bFdj p t)


Pr (dj p bFdˆ Pj p b) p
Pr (dˆ j p bFdj p b) ⫺Pr (dˆ j p bFdj p t)

Pr (dˆ j p bFdˆ jP p b) ⫺Pr (dj p tFdˆ j p b) [Pr (dˆ j p b)/ Pr (dj p t)]
p .
Pr (dj p bFdˆ j p b) ⫺Pr (dj p tFdˆ j p b) [Pr (dˆ j p b)/ Pr (dj p t)]

(A17)

References
Aaronson, Daniel, Lisa Barrow, and William Sander. 2007. Teachers and
student achievement in the Chicago public high schools. Journal of
Labor Economics 25, no. 1:95–135.
Alexander, Elmore R., and Ronnie D. Wilkins. 1982. Performance rating
validity: The relationship of objective and subjective measures of per-
formance. Group and Organization Studies 7, no. 4:485–96.
Armor, David, Patricia Conry-Oseguera, Millicent Cox, Nicelma King,
Lorraine McDonnell, Anthony Pascal, Edward Pauly, and Gail Zell-
man. 1976. Analysis of the school preferred reading program in selected
Los Angeles minority schools. Report no. R-2007-LAUSD. Rand Cor-
poration, Santa Monica, CA.
Ballou, Dale. 2001. Pay for performance in public and private schools.
Economics of Education Review 20:51–61.
Ballou, Dale, and Michael Podgursky. 2001. Teacher compensation: The
case for market-based pay. Education Matters 1, no. 1:16–25.
Bolino, Mark C., and William H. Turnley. 2003. Counternormative im-
pression management, likeability, and performance ratings: The use of
intimidation in an organizational setting. Journal of Organizational
Behavior 24, no. 2:237–50.
Bommer, William H., Jonathan L. Johnson, Gregory A. Rich, Philip M.
Podsakoff, and Scott B. MacKenzie. 1995. On the interchangeability
of objective and subjective measures of employee performance: A meta-
analysis. Personnel Psychology 48, no. 3:587–605.
Bull, Clive. 1987. The existence of self-enforcing implicit contracts. Quar-
terly Journal of Economics 102, no. 1:147–60.
Clotfelter, Charles T., Helen F. Ladd, and Jacob L. Vigdor. 2006. Teacher-
student matching and the assessment of teacher effectiveness. NBER
Working Paper no. 11936, National Bureau of Economic Research,
Cambridge, MA.
Figlio, David N. 1997. Teacher salaries and teacher quality. Economics
Letters 55, no. 2:267–71.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
134 Jacob/Lefgren

Figlio, David N., and Cecilia Elena Rouse. 2006. Do accountability and
voucher threats improve low-performing schools? Journal of Public
Economics 90, nos. 1–2:239–55.
Figlio, David N., and Joshua Winicki. 2005. Food for thought: The effects
of school accountability on school nutrition. Journal of Public Eco-
nomics 89, nos. 2–3:381–94.
Hanushek, Eric A. 1986. The economics of schooling: Production and
efficiency in public schools. Journal of Economic Literature 49, no. 3:
1141–77.
———. 1997. Assessing the effects of school resources on student per-
formance: An update. Educational Evaluation and Policy Analysis 19,
no. 2:141–64.
Hanushek, Eric A., John Kain, Daniel M. O’Brien, and Steven G. Rivkin.
2005. The market for teacher quality. NBER Working Paper no. 11154,
National Bureau of Economic Research, Cambridge, MA.
Hanushek, Eric A., and Steven G. Rivkin. 2004. How to improve the
supply of high quality teachers. In Brookings papers on education policy
2004, ed. Diane Ravitch. Washington, DC: Brookings Institution Press.
Harris, Douglas, and Timothy Sass. 2006. The effects of teacher training
on teacher value-added. Working paper, Economics Department, Flor-
ida State University, Tallahassee.
Heneman, Robert L. 1986. The relationship between supervisory ratings
and results-oriented measures performance: A meta-analysis. Personnel
Psychology 39:811–26.
Heneman, Robert L., David B. Greenberger, and Chigozie Anonyuo.
1989. Attributions and exchanges: The effects of interpersonal factors
on the diagnosis of employee performance. Academy of Management
Journal 32, no. 2:466–76.
Jacob, Brian A. 2005. Accountability, incentives and behavior: The impact
of high-stakes testing in the Chicago public schools. Journal of Public
Economics 89, nos. 5–6:761–96.
Jacob, Brian A., and Lars Lefgren. 2005a. Principals as agents: Subjective
performance measurement in education. NBER Working Paper no.
11463, National Bureau of Economic Research, Cambridge, MA.
———. 2005b. What do parents value in education? An empirical inves-
tigation of parents’ revealed preferences for teachers. NBER Working
Paper no. 11494, National Bureau of Economic Research, Cambridge,
MA.
Jacob, Brian A., and Stephen D. Levitt. 2003. Rotten apples: An inves-
tigation of the prevalence and predictors of teacher cheating. Quarterly
Journal of Economics 118, no. 3:843–78.
Kane, T., and D. Staiger. 2002. Volatility in school test scores: Implications
for test-based accountability systems. In Brookings papers on education

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
Subjective Performance Evaluation 135

policy 2002, ed. Diane Ravitch. Washington, DC: Brookings Institution


Press.
Lefkowitz, Joel. 2000. The role of interpersonal affective regard in su-
pervisory performance ratings: A literature review and proposed causal
model. Journal of Occupational and Organizational Psychology 73:
67–85.
Levin, Jonathan. 2003. Relational incentive contracts. American Economic
Review 93, no. 3:835–57.
MacLeod, W. Bentley. 2003. Optimal contracting with subjective evalu-
ation. American Economic Review 93, no. 1:216–40.
MacLeod, W. Bentley, and J. Malcomson. 1989. Implicit contracts, incen-
tive compatibility and involuntary unemployment. Econometrica 56,
no. 2:447–80.
Medley, Donald M., and Homer Coker. 1987. The accuracy of principals’
judgments of teacher performance. Journal of Educational Research 80,
no. 4:242–47.
Morris, Carl N. 1983. Parametric empirical Bayes inference: Theory and
applications. Journal of the American Statistical Association 78, no. 381:
47–55.
Moulton, Brent R. 1990. An illustration of a pitfall in estimating the effects
of aggregate variables on micro units. Review of Economics and Statistics
72, no. 2:334–38.
Murnane, Richard. 1975. The impact of school resources on the learning
of inner-city children. Cambridge, MA: Ballinger.
Murnane, Richard J., and D. K. Cohen. 1986. Merit pay and the evaluation
problem: Why most merit pay plans fail and few survive. Harvard
Educational Review 56, no. 1:1–17.
Prendergast, Canice. 1993. The role of promotion in inducing specific
human capital acquisition. Quarterly Journal of Economics 108, no. 2:
523–34.
———. 1999. The provision of incentives in firms. Journal of Economic
Literature 37, no. 1:7–63.
Prendergast, Canice, and Robert Topel. 1996. Favoritism in organizations.
Journal of Political Economy 104, no. 5:958–75.
Peterson, Kenneth D. 1987. Teacher evaluation with multiple and variable
lines of evidence. American Educational Research Journal 24, no. 2:
311–17.
———. 2000. Teacher evaluation: A comprehensive guide to new directions
and practices. 2nd ed. Thousand Oaks, CA: Corwin.
Reback, Randall. 2005. Teaching to the rating: School accountability and
the distribution of student achievement. Working paper, Economics
Department, Columbia University.
Rockoff, Jonah E. 2004. The impact of individual teachers on student

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions
136 Jacob/Lefgren

achievement: Evidence from panel data. American Economic Review


94, no. 2:247–52.
Sanders, W. L., and J. C. Rivers. 1996. Cumulative and residual effects of
teachers on future student academic achievement. Research report, Uni-
versity of Tennessee Value-Added Research and Assessment Center,
Knoxville.
Sullivan, Daniel G. 2001. A note on the estimation of regression models
with heteroskedastic measurement errors. Working Paper no. 2001-23.
Federal Reserve Bank of Chicago.
Varma, Arup, and Linda K. Stroh. 2001. The impact of same-sex LMX
dyads on performance evaluations. Human Resource Management 40,
no. 4:309–20.
Wayne, Sandy J., and Gerald R. Ferris. 1990. Influence tactics, affect, and
exchange quality in supervisor-subordinate interactions: A laboratory
experiment and field study. Journal of Applied Psychology 75, no. 5:
487–99.

This content downloaded from 206.246.22.112 on Mon, 10 Nov 2014 14:40:42 PM


All use subject to JSTOR Terms and Conditions

You might also like