Student Evaluation
Student Evaluation
H)
Pnntcd I” Great Bnta~n Al P nghts reserved Copyright 0 1987 Pcrgamon Journals Ltd
HERBERT W. MARSH
The University
of Sydney, Australia
CONTENTS
ABSTRACT 255
ACKNOWLEDGEMENTS 256
253
254 HERBERT W. MARSH
REFERENCES 371
The purposes of this monograph are to provide an overview of findings and of research
methodology used to study students’ evaluations of teaching effectiveness, and to examine
implications and directions for future research. The focus of the investigation is on the
author’s own research that has led to the development of the Students’ Evaluations of Edu-
cational Quality (SEEQ) instrument, but it also incorporates a wide range of other re-
search. Based upon this overview, class-average student ratings are: (1) multidimensional;
(2) reliable and stable; (3) primarily a function of the instructor who teaches a course
rather than the course that is taught; (4) relatively valid against a variety of indicators of
effective teaching; (5) relatively unaffected by a variety of variables hypothesized as po-
tential biases; and (6) seen to be useful by faculty as feedback about their teaching, by stu-
dents for use in course selection, and by administrators for use in personnel decisions. In
future research a construct validation approach should be employed in which it is recog-
nized that: effective teaching and students’ evaluations designed to reflect it are multi-
dimensional/multifaceted; there is no single criterion of effective teaching; and tentative
interpretations of relationships with validity criteria and with potential biases must be
scrutinized in different contexts and must examine multiple criteria of effective teaching.
255
ACKNOWLEDGMENTS
I would like to thank Wilbert McKeachie, Kenneth Feldman, Peter Frey, Kenneth Doyle,
Robert Menges, John Centra, Peter Cohen, Michael Dunkin, Samuel Ball, Jennifer
Barnes, Les Leventhal, John Ware, and Philip Abrami, Larry Braskamp, Robert Wilson,
and Dale Brandenburg for their comments on earlier research that is described in this arti-
cle, and also Jesse Overall and Dennis Hocevar, co-authors on many earlier studies. I
would also like to gratefully acknowledge the support and encouragement which Wilbert
McKeachie has consistently given to me, and the invaluable assistance offered by Kenneth
Feldman in both personal correspondence and the outstanding set of review articles which
he has authored. Nevertheless, the interpretations expressed in this article are those of the
author, and may not reflect those of others whose assistance has been acknowledged.
256
CHAPTER 1
INTRODUCTION
A Historical Perspective
Doyle (1983) noted that in Antioch in AD 350 any father who was dissatisfied with the in-
struction given to his son could examine the boy, file a formal complaint to a panel of
teachers and laymen, and ultimately transfer his son to another teacher if the teacher could
be shown to have neglected his duties, while Socrates was executed in 399 BC for having
corrupted the youth of Athens with his teachings. In the twentieth century there were few
studies of students’ evaluation before the 192Os, but student evaluation programs were in-
troduced at Harvard, the University of Washington, Purdue University, and the Univer-
sity of Texas and other institutions in the mid-1920s. Barr (1948) noted 138 studies of
teaching efficiency written between 1905 and 1948, and de Wolf (1974) summarized 220
studies of students’ evaluations of teaching effectiveness that were written between 1968
and 1974. The term ‘students’ evaluations of teacher performance’ was first introduced in
the ERIC system in 1976; between 1976 and 1984 there were 1055 published and unpub-
lished studies under this heading and approximately half of those have appeared since
1980. Doyle (1983) noted a cyclical pattern in research activity with a sharp increase in the
decade beginning with 1927 and the most intense activity during the 1970s. Student evalua-
tion research is largely a phenomenon of the 1970s and 1980s but it has a long and import-
ant history dating back to the pioneering research conducted by H. H. Remmers.
H. H. Remmers initiated the first systematic research program in this field and might be
noted as the father of research into students’ evaluations of teaching effectiveness. In 1927
Remmers (Brandenburg & Remmers, 1927) published his multitrait Purdue scale and
proposed three principals for the design of such instruments: (a) that the list of traits must
be short enough to avoid halo effects and carelessness due to student boredom; (b) that the
traits must be those agreed upon by experts as the most important; and (c) that the traits
must be susceptible to student observation and judgement. During this early period he
examined issues of reliability, validity, halo effects, bias (Remmers & Brandenburg,
1927), the relation between course grades and student ratings (Remmers, 1928) and the
discriminability and relative importance of his multiple traits (Stalnaker & Remmers,
1928) and engaged in a series of discussion articles (Remmers & Wykoff, 1929) clarifying
257
x3 HERBERT W. MARSH
his views on students’ evaluations against suggested criticisms (Wykoff, 1929; Protzman,
1929). Some of his substantive and methodological contributions in subsequent research
were:
(1) Remmers (1931,1934) was the first to recognize that the reliability of student ratings
should be based on agreement among different students of the same teacher, and that the
reliability of the class-average response varies with the number of students in a way that is
analogous to the relation between test length (i.e. number of items) and test reliability in
the Spearman-Brown equation.
(2) Remmers (Smalzried & Remmers, 1943; see also Creager, 1950) published the first
factor analysis of class-average student responses to his 10 traits and identified two higher-
order traits that he called Empathy and Professional Maturity.
(3) In 1949 Remmers (Remmers et al., 1949; see also Elliot, 1950) found that when
students were randomly assigned to different sections of the same course, section-average
achievement corrected for initial aptitude was positively correlated with class-average rat-
ings of instructional effectiveness, thus providing one basis for the multiple-section validity
paradigm that has been so important in student evaluation research (see Chapter 4).
(4) Drucker and Remmers (1950, 1951) found that the ratings of alumni ten years after
their graduation from Purdue were substantially correlated with ratings by current stu-
dents for those instructors who had taught both groups. Alumni and current students also
showed substantial agreement on the relative importance they placed on the 10 traits from
the Purdue scale.
(5) In the first large-scale, multi-institutional study Remmers (Remmers & Elliot, 1949;
Elliot, 1950) correlated student responses from 14 colleges and universities with a wide
variety of background/demographic variables (for example, sex, rank, scholastic ability,
year in school). While some significant relationships were found, the results suggested that
background/demographic characteristics had little effect on student ratings.
In the early period of his research Remmers (for example, Brandenburg & Remmers,
1927; Remmers & Wykoff, 1929) was cautious about the use and interpretation of student
ratings, indicating that his scale: (a) was not designed to serve as a measure of teaching ef-
fectiveness but rather to assess student opinions as one aspect of the teacher-student rela-
tionship and the learning process; (b) should not be used for promotional decisions though
future research may show it to be valid for such purposes; and (c) should only be used vol-
untarily though limited observations suggest that instructors may profit by its use. How-
ever, the accumulated support from more than two decades of research led Remmers to
stronger conclusions such as: (a) “there is warrant for ascribing validity to student ratings
not merely as measures of student attitude toward instructors for which validity and relia-
bility are synonymous but also as measured by what students actually learn of the content
of the course” (Remmers et al., 1949, p. 26); (b) “undergraduate judgment as a criterion
of effective teaching . . . can no longer be waved aside as invalid and irrelevant”
(Remmers, 1950, p. 4); (c) “Teachers at all levels of the educational ladder have no real
choice as to whether they will be judged by those whom they teach . . . The only real
choice any teacher has is whether he wants to know what these judgments are and whether
he wants to use this knowledge in his teaching procedures” (Remmers, 1950, p. 4); (d) “As
higher education is organized and operated, students are pretty much the only ones who
observe and are in a position to judge the teacher’s effectiveness” (1958, p. 20); (e)
“Knowledge of student opinions and attitudes leads to improvement of the teacher’s per-
Students’ Evaluations 259
sonality and educational procedures” (1958, p. 21); and (f) “No research has been pub-
lished invalidating the use of student opinion as one criterion of teacher effectiveness”
(1958, p. 22).
While there have been important methodological advances in student evaluation
research since Remmers’ research, his studies provided the foundation for many of these
advances. Furthermore, while Remmers’ conclusions have been widely challenged in hun-
dreds of studies, results to be summarized in this monograph support his conclusions.
The most widely noted purposes for collecting students’ evaluations of teaching effec-
tiveness are variously to provide the following.
(1) Diagnostic feedback to faculty about the effectiveness of their teaching that will be
useful for the improvement of teaching.
(2) A measure of teaching effectiveness to be used in administrative decision making.
(3) Information for students to use in the selection of courses and instructors.
(4) A measure of the quality of the course, to be used in course improvement and cur-
riculum development.
(5) An outcome or a process description for research on teaching.
While the first purpose is nearly universal, the next three are not. At many universities
systematic student input is required before faculty are even considered for promotion,
while at others the inclusion of students’ evaluations is optional or not encouraged at all.
Similarly, the results of students’ evaluations are sold to students in some university
bookstores at the time when students select their courses, while at other universities the re-
sults are considered to be strictly confidential.
The use of students’ ratings of teaching effectiveness from instruments such as SEEQ
are probably not very useful for the fourth purpose, course evaluation and curriculum de-
velopment. These ratings are primarily a function of the instructor who teaches the course
rather than the course that is being taught (see Chapter 3), and thus provide little informa-
tion that is specific to the course. This conclusion should not be interpreted to mean that
student input is not valuable for such purposes, but only that the student ratings collected
for purposes of evaluating teaching effectiveness are not the appropriate source of student
input into questions of course evaluations.
The fifth purpose of student ratings, their use in research on teaching, has not been sys-
tematically examined, and this is unfortunate. Research on teaching involves at least three
major questions (Gage, 1963; Dunkin 8~Barnes, 1986): How do teachers behave? Why do
they behave as they do? and What are the effects of their behavior? Dunkin & Barnes go
on to conceptualize this research in terms of: (a) process variables (global teaching
methods and specific teaching behaviors); (b) presage variables (characteristics of
teachers and students); (c) context variables (substantive, physical and institutional envi-
ronments); and (d) product variables (student academic/professional achievement, at-
titudes and evaluations). Similarly, Braskamp et al. (1985), distinguish between input
(what students and teachers bring to the classroom), process (what students and teachers
do), and product (what students do or accomplish), though Doyle (1975) uses the same
three terms somewhat differently. Student ratings are important both as a process-de-
scription measure and as a product measure. This dual role played by student ratings, as a
760 HERBERT W. MARSH
process description and as a product of the process. is also inherent in their use as diagnos-
tic feedback, as input for tenure promotion decisions, and as information for students to
use in course selection. However, Dunkin and Barnes’ presage and context variables, or
Braskamp’s input variables, also have a substantial impact on both the process and the pro-
duct, and herein lies a dilemma. Students’ evaluations of teaching effectiveness, as either
a process or a product measure, should reflect the valid effects of presage and context mea-
sures. Nevertheless, since many presage and context variables may be beyond the control
of the instructor, such influences may represent a source of unfairness in the evaluation of
teaching effectiveness - particularly when students’ evaluations are used for personnel
decisions (see Chapter 5).
Another dilemma in student evaluation research is the question of whether information
for any one purpose is appropriate for other purposes, and particularly whether the same
data should be used for diagnostic feedback to faculty and for administrative decision mak-
ing. Many (for example, Abrami ef al., 1981; Braskamp et al., 1985; Centra, 1979; Doyle,
1983) argue that high inference, global, summative ratings are appropriate for administra-
tive purposes while low inference, specific, formative ratings are appropriate for diagnos-
tic feedback. A related issue is the dimensionality of students’ evaluations and whether
students’ evaluations are best represented by a single score, a set of factor scores on well-
established, distinguishable dimensions of teaching effectiveness, or responses to indi-
vidual items (see Chapter 2): In fact many student evaluation instruments (see Appendix)
ask students to complete a common core of items that vary in their level of inference, in-
clude provision for academic units or individual instructors to select or create additional
items, and provide space for written comments to open-ended questions. Hence the ques-
tion becomes one of what is the appropriate information to report to whom for what pur-
poses.
Particularly in the last 15 years, the study of students’ evaluations has been one of the
most frequently emphasized areas in American educational research. Literally thousands
of papers have been written and a comprehensive review is beyond the scope of this
monograph. The reader is referred to reviews by Aleamoni (1981); Braskamp et al. (1985);
Centra (1979); Cohen (1980. 1981); Costin et al. (1971); de Wolf (1974); Doyle (1975;
1983); Feldman (1976a, 1976b, 1977, 1978, 1979, 1983, 1984); Kulik and McKeachie
(1975); Marsh (1980a, 1982b, 1984); McKeachie (1979); Murray (1980); Overall and
Marsh (1982); and Remmers (1963). Individually, many of these studies may provide im-
portant insights. Yet, collectively the studies cannot be easily summarized and Aleamoni
(1981) notes that opinions about students’ evaluations vary from “reliable, valid, and use-
ful” to “unreliable, invalid, and useless.” How can opinions vary so drastically in an area
which has been the subject of thousands of studies? Part of the problem lies in the precon-
ceived biases of those who study student ratings; part of the problem lies in unrealistic ex-
pectations of what student evaluations can and should be able to do; part of the problem
lies in the plethora of ad hoc instruments based upon varied item content and untested
psychometric properties; and part of the problem lies in the fragmentary approach to the
design of both student-evaluation instruments and the research based upon them.
In the early 1970s there was a huge increase in the collection of students’ evaluations of
Students’ Evaluations 261
Information from students’ evaluations necessarily depends on the content of the evalua-
tion items, but student ratings, like the teaching that they represent, should be multi-
dimensional (for example, a teacher may be quite well organized but lack enthusiasm).
This contention is supported by common sense and a considerable body of empirical re-
search. Unfortunately, most evaluation instruments and research fail to take cognizance
of this multidimensionality. If a survey instrument contains an ill-defined hodge-podge of
different items and student ratings are summarized by an average of these items, then
there is no basis for knowing what is being measured, no basis for differentially weighting
different components in a way that is most appropriate to the particular purpose they are
to serve, nor any basis for comparing these results with other findings. If a survey contains
separate groups of related items that are derived from a logical analysis of the content of
effective teaching and the purposes that the ratings are to serve, or from a carefully con-
structed theory of teaching and learning, and if empirical procedures such as factor analy-
sis and MTMM analyses demonstrate that the items within the same group do measure the
same trait and that different traits are separate and distinguishable, then it is possible to in-
terpret what is being measured. The demonstration of a well-defined factor structure also
provides a safeguard against a halo effect-a generalization from some subjective feeling
having nothing to do with effective teaching, an external influence, or an idiosyncratic re-
sponse mode - that affects responses to all items.
Some researchers, while not denying the multidimensionality of student ratings, argue
that a total rating or an overall rating provides a more valid measure. This argument is typ-
ically advanced in research where separate components of the students’ evaluations have
not been empirically demonstrated, and so there is no basis for testing the claim. More im-
portantly, the assertion is not accurate for most circumstances. First, there are many pos-
sible indicators of effective teaching; the component that is ‘most valid’ will depend on the
criteria being considered (Marsh & Overall, 1980). Second, reviews and meta-analyses of
different validity criteria show that specific components of student ratings are more highly
correlated with individual validity criteria than an overall or total rating (e.g., student
learning-Cohen, 1981; instructor self-evaluations - Marsh, 1982c; Marsh et al., 1979b;
effect of feedback for the improvement of teaching -Cohen, 1980). Third, the influence
of a variety of background characteristics suggested by some as ‘biases’ to student ratings
is more difficult to interpret with total ratings than with specific components (Frey, 1978;
Marsh, 1980b, 1983,1984). Fourth, the usefulness of student ratings, particularly as diag-
nostic feedback to faculty, is enhanced by the presentation of separate components. Fi-
nally, even if it were agreed that student ratings should be summarized by a single score for
263
364 HERBERT W MARSH
a particular purpose, the weighting of different factors should be a function of logical and
empirical analyses of the multiple factors for the particular purpose; an optimally weighted
set of factor scores will automatically provide a more accurate reflection of any criterion
than will a non-optimally weighted total (for example, Abrami, 1985, p. 223). Hence, no
matter what the purpose, it is logically impossible for an unweighted average to be more
useful than an optimally weighted average of component scores.
Still other researchers, while accepting the multidimensionality of students’ evaluations
and the importance of measuring separate components for some purposes such as feed-
back to faculty, defend the unidimensionality of student ratings because “when student
ratings are used in personnel decisions, one decision is made.” However, such reasoning
is clearly illogical. First, the use to which student ratings are put has nothing to do with
their dimensionality, though it may influence the form in which the ratings are to be pre-
sented. Second, even if a single total score were the most useful form in which to sum-
marize student ratings for personnel decisions, and there is no reason to assume that gen-
erally it is, this purpose would be poorly served by an ill-defined total score based upon an
ad hoc collection of items that was not appropriately balanced with respect to the compo-
nents of effective teaching that were being measured. If a single score were to be used, it
should represent a weighted average of the different components where the weight as-
signed to each component was a function of logical and empirical analyses. There are a var-
iety of ways in which the weights could be determined, including the importance of each
component as judged by the instructor being evaluated, and the weighting could vary for
different courses or for different instructors. However the weights are established, they
should not be determined by the ill-defined composition of whatever items happen to ap-
pear on the rating survey, as is typically the case when a total score is used. Third, implicit
in this argument is the suggestion that administrators are unable to utilize or prefer not to
be given multiple sources of information for use in their deliberations, but I know of no
empirical research to support such a claim. At institutions where SEEQ has been used, ad-
ministrators (who are also teachers who have been evaluated with SEEQ and are familiar
with it) prefer to have summaries of ratings for separate SEEQ factors for each course
taught by an instructor for use in administrative decisions (see description of longitudinal
summary report by Marsh, 1982b, pp. 78-79). Important unresolved issues in student
evaluation research are how different rating components should be weighted for various
purposes, and what form of presentation is most useful for different purposes. However,
the continued, and mistaken, insistence that students’ evaluations represent a unidimen-
sional construct hinders progress on the resolution of these issues.
tor analysis to further test the dimensionality of the ratings. The most carefully constructed
instruments combine both logical/theoretical and empirical analyses in the research and
development of student rating instruments.
Feldman (1976b) sought to examine the different characteristics of the superior univer-
sity teacher from the student’s point of view with a systematic review of research that either
asked students to specify these characteristics or inferred them on the basis of correlations
between specific characteristics and students’ overall evaluations. On the basis of such
studies, and also to facilitate presentation of this material and his subsequent reviews of
other student evaluation research, Feldman derived a set of 19 categories that are listed in
Table 2.1. This list provides the most extensive and, perhaps, the best set of dimensions
that are likely to underlie students’ evaluations of effective teaching. Nevertheless,
Feldman used primarily a logical analysis based on his examination of the student evalua-
tion literature, and his results do not necessarily imply that students can accurately
evaluate these components, that these components can be used to differentiate among
teachers, or that other components do not exist.
Table 2.1
Nineteen Instructional Rating Dimensions Adapted From Feldman (1976)
These nineteen categories were originally presented by Feldman (1976) but in subsequent studies (e.g.,
Feldman, 1984) ‘Perceived outcome or impact of instruction’ and ‘Personal characteristics (‘personality’)’ were
added while rating dimensions 12 and 14 presented above were not included.
Factor analysis provides a test of whether students are able to differentiate among diffe-
rent components of effective teaching and whether the empirical factors confirm the facets
that the instrument is designed to measure. The technique cannot, however, determine
whether the obtained factors are important to the understanding of effective teaching; a
set of items related to an instructor’s physical appearance would result in a ‘physical ap-
pearance’ factor which probably has little to do with effective teaching. Consequently,
266 HERBERT W. MARSH
carefully developed surveys - even when factor analysis is to be used - typically begin
with item pools based upon literature reviews, and with systematic feedback from stu-
dents, faculty, and administrators about what items are important and what type of feed-
back is useful (for example, Hildebrand et al., 1971; Marsh, 1982b). For example, in the
development of SEEQ a large item pool was obtained from a literature review, instru-
ments in current usage, and interviews with faculty and students about characteristics
which they see as constituting effective teaching. Then, students and faculty were asked to
rate the importance of items, faculty were asked to judge the potential usefulness of the
items as a basis for feedback, and open-ended student comments on pilot instruments were
examined to determine if important aspects had been excluded. These criteria, along with
psychometric properties, were used to select items and revise subsequent versions. This
systematic development constitutes evidence for the content validity of SEEQ and makes
it unlikely that it contains any irrelevant factors.
The student evaluation literature does contain several examples of instruments that
have a well defined factor structure and that provide measures of distinct components of
teaching effectiveness. Some of these instruments (see Appendix for the actual instru-
ments) and the factors that they measure are the following.
(1) Frey’s Endeavor instrument (Frey et al., 1975; also see Marsh, 1981a): Presentation
Clarity, Workload, Personal Attention, Class Discussion, Organization/Planning, Grad-
ing, and Student Accomplishments.
(2) The Student Description of Teaching (SDT) questionnaire originally developed by
Hildebrand et al. (1971): Analytic/Synthetic Approach, Organization/Clarity, Instructor
Group Interaction, Instructor Individual Interaction, and Dynamism/Enthusiasm.
(3) Marsh’s SEEQ instrument (Marsh, 1982b, 1983,1984): Learning/Value, Instructor
Enthusiasm, Organization, Individual Rapport, Group Interaction, Breadth of Coverage,
Examinations/Grading, Assignments/Readings, and Workload/Difficulty.
(4) The Michigan State SIRS instrument (Warrington, 1973): Instructor Involvement,
Student Interest and Performance, Student-Instructor Interaction, Course Demands, and
Course Organization. The systematic approach used in the development of each of these
instruments, and the similarity of the facets which they measure, support their construct
validity. Factor analyses of responses to each of these instruments provide clear support
for the factor structure they were designed to measure, and demonstrate that the students’
evaluations do measure distinct components of teaching effectiveness. More extensive re-
views describing the components found in other research (Cohen, 1981; Feldman, 1976b;
Kulik & McKeachie, 1975) identify dimensions similar to those described here.
Factor analyses of responses to SEEQ (Marsh, 1982b, 1982c, 1983, 1984) consistently
identify the nine factors the instrument was designed to measure. Separate factor analyses
of evaluations from nearly 5,000 classes were conducted on different groups of courses
selected to represent diverse academic disciplines at graduate and undergraduate levels;
Students’ Evaluations 267
each clearly identified the SEEQ factor structure (Marsh, 1983). In one study, faculty in
329 courses were asked to evaluate their own teaching effectiveness on the same SEEQ
form completed by their students (Marsh, 1982~; Marsh & Hocevar, 1983). Separate fac-
tor analyses of student ratings and instructor self-evaluations each identified the nine
SEEQ factors (see Table 2.2). In other research (Marsh & Hocevar, 1984) evaluations of
the same instructor teaching the same course on different occasions demonstrated that
even the multivariate pattern of ratings was generalizable (for example, a teacher who was
judged to be well organized but lacking enthusiasm in one course was likely to receive a
similar pattern of ratings in other classes). These findings clearly demonstrate that student
ratings are multidimensional, that the same factors underlie ratings in different disciplines
and at different levels, and that similar ratings underlie faculty evaluations of their own
teaching effectiveness.
Higher-Order Factors
1 2 3 4 5 6 I a 9
- ~ - - - - - - -
Bvalustien items (paraphrased) SFSFSFSFSFSFSFSFSF
-
I. Learning/Value
Course challenging/stimulating 42 40 23 25 09 -10 04 04 oo -03 15 21 09 05 16 23 29 20
Learned something valuable 53 II 15 02 10 -02 09 04 01. 01 10 oo 10 04 17 09 16 O6
Increased subject interest 57 70 12 05 08 07 08 07 02 -03 08 03 -04 19 05 14 -02
Learned/understood subject matter 55 52 12 12 13 12 05 03 03 iI A: -01 19 07 14 -04 -23 -II
Overall course rating 36 39 25 29 16 09 12 OH 09 02 12 16 13 -08 14 27 Oa 16
‘1. Enthusiasm
Enthusiastic about teaching 15 29 07 10
Dynamic & energetic
Enhanced presentations with humor
n
08 03
04
16
15
-04
00
01
06
11
OS
02
06
01 13 02
06
12
oo
05
02
05
07
14
16
16
07
01
01
:
09
05
-G!
05
06
‘-02
-07
06
03
-03
-10
Teaching style held your interest : 12 16 06 06 oo 03 14 10 05
Overall instructor rating 12 21 14 08 23 02 11 16 10 -08 05 27 05 16
:I Organization
Instructor explanations clear 12 oo 07 24 20 09 05 04 10 06 13 01 06 23 -08 -03
Course materials prepared & clear 06 06 03 -02 09 01 10 -02 09 04 06 03 10 03 01 12
Objectives stated & pursued 19 12 -05 -08 03 05 08 05 14 08 25 27 06 05 06 06
Lectures facilitated note taking -03 02 20 09 -17 07 -02 05 14 04 15 06 08 ol -04 -05
4. Group Interaction
Encouraged class discussions 04 06 10 o2 01 03 03 oo oo oo 06 00 06 -05 oo -03
Students shared ideas/knowledge 02 08 06 -07 -04 -01 05 13 05 01 08 -02 08 -10 -02 01
Encouraged questions & answers 03 -04 06 09 14 06 16 -02 15 03 07 11 08 21 00 01
Encouraged expression of ideas 07 01 02 06 01 -11 20 09 05 07 09 12 05 09 oo -02
Individual Rapport
Friendly towards students -04 10 17 06 00 -06 13 12 -01 -05 13 02 10 -05 -07 01
Welcomed seeking help/advice 04 -10 05 02 02 07 06 oo -04 04 12 06 05 20 8: -04
Interested in individual students 07 10 11 09 oo 01 14 07 -01 -09 14 03 08 -09 09
Accessible to individual students 02 -13 -11 -11 16 09 09 -02 20 25 08 13 oo 14 04 07
Breadth of Coverage
Contrasted implications -05 02 12 01 05 03 08 01 -03 01 72 a4 08 -03 14 02 08 -06
Gave background of ideas/concepts 08 03 08 10 16 07 -03 -02 02 -02 71 78 01 08 11 -01 03 OR
Cave different points of view 04 -(xi 04 09 11 11 OA If o6 01 72 5:, 07 17 01 -06 04 oa
1)iscussed current developmenb 23 29 ox -04 -04 -04 05 12 09 oo 5o 4a 06 05 16 10 -01 -02
Examinations/Grading
Examination feedback valuable -03 01 oa 09 06 -11 09 05 08 12 -04
El 03 72 62 05 -03 09 03
Eval. methods fair/appropriate 06 02 oo -03 03 14 07 06 14 oo 10 17 69 64 I1 11 -08 04
Tested emphasized course content 08 oo -01 04 11 21 01 01 06 oo It -04 cl70 58 07 10 -02 -03
Table 2.2 (Continued)
1 2 3 4 5 6 1 8 9
~ - - - - co
Evaluation items (paraphrased) SFSFSFSFSFSFSFSFSF iz
a
8. Assignments
Reading/texts valuable -06 09 -03 -03 03 07 -01 -06 03 01 07 -07 01 11 091 70 02 04
Added to course understanding 12 01 -01 -12 01 04 09 21 01 17 -02 08 07 05 81 56 06 10
9. Workload/Bifficulty
Course difficulty (Easy-Hard) -06 00 06 -01 04 -05 -04 02 -01 00 08 00 -04 08 10 04 85 74
Course workload (Light-Heavy) 14 -04 -09 -01 03 02 07 05 00 04 06 01 00 01 00 -04 088 86
Course pace (Too Slow-Too Fast) -20 07 12 00 04 18 -12 -09 06 02 -03 -07 04 08 05 -04 62 32
Hours/week outside of class 14 00 07 00 -11 00 07 02 00 02 -04 03 03 -08 05 21 73 46
Note. Factor loadings in boxes are the loadings for items designed to measure each factor. All loadings are presented without decimal points. Factor analyses student of
ratings and instructor self-ratings consisted of a principal-components analysis, Kaiser normalization. and rotation to a direct oblimin criterion. The analyses were
performed with the commercially available Statistical Package for the Social Sciences (SPSS) routine (see Nie, Hull. Jenkins, Steinbrenner, & Bent, 1975).
270 HERBERT W. MARSH
from each scale*, the higher-order factors were not easily interpreted in that the factor
loadings did not approximate simple structure, no attempt was made to test the ability of
the two-factor solution to fit responses from the original 21 items, other research has
shown that responses to the 21 items do identify seven factors rather than just two (Frey et
al., 1975; Marsh, 1981a) and confirmatory factor analytic techniques designed to test
higher-order structures were not employed. Frey then demonstrated that his two higher-
order factors had systematic and distinguishable patterns of relations to external criteria of
validity and to potential biases to students’ evaluations. These findings reported by Frey
suggest the possibility and perhaps the importance of higher-order factors.
Remmers, as described earlier, demonstrated that the 10 traits on his Purdue scale were
important and distinguishable components of students’ evaluations. Remmers also inter-
preted the results of a factor analysis (Smalzried & Remmers, 1943) to infer two higher-
order traits that he called Empathy and Professional Maturity. However, Remmers did
not suggest that these two higher-order factors should be used instead of the 10 traits, and
he put little emphasis on them in his subsequent research. This study, as the Frey study,
suffers in that the first-order factors were inferred on the basis of single-item scales, but the
higher-order factors proposed by Remmers and Frey appear to be similar.
Feldman (1976b) derived 19 categories of characteristics of the superior teacher from
the students’ view (see Table 2.1). However, he also examined studies that contained two
or more of his 19 categories of teaching effectiveness and used the pattern of correlations
among the different categories to infer three higher-order clusters that were related to the
instructors’ roles as presenter (actor or communicator), facilitator (interactor or recip-
rocator), and manager (director or regulator).
Abrami (1985, p. 214) also suggested “that effective teaching may be described both
unitarily and multidimensionally in a way analogous to the way Weschler’s tests operation-
ally define intelligence in both general and specific terms” and that support for such an in-
terpretation would have important implications. Such a general factor may be a higher-
order factor derived from lower-order, more specific facets, and this would be consistent
with the higher-order perspective that is considered here. It should be noted, however,
that none of the studies examined here suggest that a single higher-order factor exists.
Hence, it seems unlikely that a unitary construct will be identified but it should be em-
phasized that the appropriate research to support or refute such a claim has not been con-
ducted.
The interpretation of first-order factors and their relation to external constructs may be
facilitated by the demonstration of a well-defined higher-order structure as proposed by
Frey. However, even if the existence of the higher-order factors can be demonstrated with
appropriate analytical procedures, this does not imply that the higher-order factors should
be used instead of the lower-order factors or that the higher-order factors can be inferred
without first measuring the lower-order factors. Despite the expedient advantages of
* Frey (1979) indicates that during its evolvement there were 12 different versions of Endeavor. Endeavor XI (see
Appendix) consists of 21 items that measure 7 first-order factors, and these first-order factors may infer two
higher-order factors. Endeavor XII consists of 7 items, the best item from each of the 7 first-order dimensions
on Endeavor XI, that measure 2 higher-order factors. The existence of the 2 Endeavor forms has resulted in some
confusion about the dimensionality of the Endeavor. My interpretation is that both the 7- and 21-item versions
of Endeavor measure 7 first-order factors, and that responses to these may be used to infer2 higher-order factors.
This interpretation is apparently consistent with that which appears in the Endeavor manual (Frey. 1979) and re-
conciles apparent discrepancies in descriptions of Endeavor (for example, Marsh, 1984 as compared to Abrami.
1985).
Students’ Evaluations 271
single-item scales, their use instead of multi-item scales, as in the 7-item Endeavor form
and the Purdue rating scale, cannot be generally recommended. In a methodological re-
view of many areas of research, Rushton et al. (1983) concluded that single-item scales are
less stable, less reliable, less valid and less representative than multi-item scales. Marsh et
al. (1985) compared responses to single- and multi-item scales by the same subjects and
reached similar conclusions. The investigation of higher-order factor structures in stu-
dents’ evaluations of teaching effectiveness represents an important area for future re-
search and this research will be facilitated by recent advances in the application of confir-
matory factor analysis that are summarized in the next section.
tions of higher-order factors in student evaluation research demonstrate the practical sig-
nificance of this application of confirmatory factor analysis for future research.
Implicit Theories
Abrami etal. (1981), Larson (1979), Whitely and Doyle (1976), and others, have argued
that dimensions identified by factor analyses of students’ evaluations may reflect raters’
implicit theories about dimensions of teacher behaviors in addition to, or instead of, di-
mensions of actual teaching behaviors. For example, if a student implicitly assumes that
the occurrence of behaviors X and Y are highly correlated and observes that a teacher is
high on X, then the student may also rate the teacher as high on Y even though the student
does not have an adequate basis for rating Y.
Implicit theories have a particularly large impact on factor analyses of individual student
responses, and this argues against the use of the individual student as the unit of analysis.
In fact, if the ratings by individual students within the same class are factor analyzed and
it is assumed that the stimulus being judged is constant for different students - a prob-
lematic assumption - then the derived factors reflect primarily implicit theories. Whitely
and Doyle (1976) suggest that students’ implicit theories are controlled when factor
analyses are performed on class-average responses, while Abrami et al. (1981, p. 13) warn
that it is only when students are randomly assigned to classes that the “computation of
classmeans cancels out individual student expectations and response patterns as sources of
variability.” However, Larson (1979) demonstrated that even class-average responses,
whether or not based upon random assignment, are affected by implicit theories if the im-
plicit theories generalize across students; it is only those implicit theories that are idiosyn-
cratic to individual students, along with a variety of sources of random variation, that are
cancelled out in the formation of class averages. While still arguing that course characteris-
tics may be reflected in the results of factor analyses, Abrami (1985) agreed with Larson
on this point.
Whitely and Doyle (1976; see also Abrami, 1985) suggested that rating dimensions de-
fined by factor analyses of individual student responses and of class-average responses are
similar because the implicit theories that affect the individual responses are generally
valid. However this explanation does not account for the possibility that implicit theories
that generalize across students may be reflected in both sets of factors. For this reason Lar-
son (1979) argued that the validity of students’ implicit theories cannot be tested with alter-
native factor analytic procedures based upon student ratings, no matter what the unit of
analysis, and that independent measures are needed. More generally the similarity or dis-
similarity of factors derived from analyses of individual student responses and class-aver-
age responses will never provide convincing evidence for the validity or invalidity of im-
plicit theories, or for the extent of influence implicit theories have on student ratings.
While the class-average is the only defensible unit of analysis for factor analyses of student
evaluations (for example, Marsh, 1984), the factors derived from such analyses represent
some unknown combination of students’ implicit theories about how teaching behaviors
should covary in addition to actual observations of how they do covary.
Students’ Evaluations 273
Support for the validity of the factor structure underlying students’ evaluations requires
that a similar structure be identified with a different method of assessment. Hence, the
similarity of the factor structures resulting from student ratings and instructor self-evalua-
tions shown in Table 2.2 is particularly important. While students and instructors may have
similar implicit theories, instructors are uniquely able to observe their own behaviors and
have little need to rely upon implicit theories in forming their self-ratings. Thus, the simi-
larity of the two factor structures supports the validity of the rating dimensions that were
identified by responses from both groups.
Semantic Similarities
In summary, most student evaluation instruments used in higher education, both in re-
search and in actual practice, have not been developed using systematic logical and empir-
ical techniques such as described in this monograph. The evaluation instruments discussed
earlier each provided clear support for the multidimensionality of students’ evaluations,
but the debate about which specific components of teaching effectiveness can and should
be measured has not been resolved, though there seems to be consistency in those that are
identified in responses to the most carefully designed instruments. Students’ evaluations
cannot be adequately understood if this multidimensionality is ignored. Many orderly, log-
ical relationships are misinterpreted or cannot be consistently replicated because of this
failure, and the substantiation of this claim will constitute a major focus of this monograph.
Instruments used to collect students’ evaluations of teaching effectiveness, particularly
those used for research purposes, should be designed to measure separate components of
teaching effectiveness, and support for both the content and construct validity of the mul-
tiple dimensions should be demonstrated.
CHAPTER 3
Reliability
The reliability of student ratings is commonly determined from the results of item analyses
(i.e. correlations among responses to different items designed to measure the same com-
ponent of effective teaching) and from studies of interrater agreement (i.e. agreement
among ratings by different students in the same class). The internal consistency among re-
sponses to items designed to measure the same component of effective teaching is consis-
tently high. However, such internal consistency estimates provide an inflated estimate of
reliability since they ignore the substantial portion of error due to the lack of agreement
among different students within the same course, and so they generally should not be used
(see Gilmore et al., 1978 for further discussion). Internal consistency estimators may be
appropriate, however, for determining whether the correlations between multiple facets
are so large that the separate facets cannot be distinguished, as in multitrait-multimethod
(MTMM) studies.
The correlation between responses by any two students in the same class (i.e. the single
rater reliability) is typically in the 0.20s but the reliability of the class-average response de-
pends upon the number of students rating the class as originally described by Remmers
(see also Feldman, 1977, for a review of methodological issues and empirical findings). For
example, the estimated reliability for SEEQ factors is about 0.95 for the average response
from 50 students ,0.90 from 25 students, 0.74 from 10 students, 0.60 from five students,
and only 0.23 for one student. As previously noted by Remmers, given a sufficient number
of students, the reliability of class-average student ratings compares favorably with that of
the best objective tests. In most applications, this reliability of the class-average response,
based on agreement among all the different students within each class, is the appropriate
method for assessing reliability. Recent applications of generalizability theory de-
monstrate how error due to differences between items and error due to differences be-
tween ratings of different students can both be incorporated into the same analysis, but the
error due to differences between items appears to be quite small (Gilmore et al., 1978).
Some critics suggest that students cannot recognize effective teaching until after being
called upon to apply course materials in further coursework or after graduation. Accord-
ing to this argument, former students who evaluate courses with the added perspective of
time will differ systematically from students who have just completed a course when
275
276 HERBERT W MARSH
evaluating teaching effectiveness. Remmers (Drucker & Remmers, 1950) originally coun-
tered this contention by showing that responses by ten year alumni agreed with those of
current students. More recent cross-sectional studies (Centra, 1979; Howard et uf., 198.5;
Marsh, 1977) have also shown good correlational agreement between the retrospective
ratings of former students and those of currently enrolled students. In a longitudinal study
(Marsh & Overall, 1979a; Overall & Marsh, 1980; see Table 3.1) the same students
evaluated classes at the end of a course and again several years later, at least one year after
graduation. End-of-class ratings in 100 courses correlated 0.83 with the retrospective rat-
ings (a correlation approaching the reliability of the ratings), and the median rating at each
time was nearly the same. Firth (1979) asked students to evaluate classes at the time of
graduation from their university (rather than at the end of each class) and one year after
graduation, and he also found good agreement between the two sets of ratings by the same
students. These studies demonstrate that student ratings are quite stable over time, and
argue that added perspective does not alter the ratings given at the end of a course. Hence,
these findings not only provide support for the long-term stability of student ratings, but
they also provide support for their construct validity (see Chapter 4).
Table 3.1
Long-Term Stability of Student Evaluations: Relative and Absolute Agreement Between End-of-Term and Re-
trospective Ratings (Reprinted with permission from Overall & Marsh, 1980)
Correlations between
end-of-term and M differences between end-of-term
retrospective ratings and retrospective ratings
1. Purpose of class
assignments made clear .55** .81” 6.63 6.61 +.02
2. Course objectives
adequately outlined .x3** .84** 6.61 6.47 +.14*
3. Class presentations
prepared and organized .62” .79” 6.67 6.54 +.13
4. You learned something
of value .53** .81” 6.65 6.87 -.22**
5. Instructor considerate
of your viewpoint .58” .83” 6.59 6.88 -.29**
6. Instructor involved you
in discussions .56” .84” 6.63 6.75 -.12
7. Instructor stimulated
your interest .5a** .a2** 6.38 6.50 -.12
8. Overall instructor
rating .65” .84” 6.55 6.74 -.19*
9. Overall course
rating 56” .83” 6.65 6.50 +.15*
IMedian across all
nine rating. items .58 .83 6.63 6.61
;V,,tp A total of 1.:1$4student responsesfmm lOOdifferent sectionseach assessedinstructmnal effectwenessat theend of each
‘Ia+ (end ,,f term) and again I year after graduation &trospecGve follow-up). All ratin@ were made along a 9-point response
wale that varied from I (very low or never) to 9 (verv high or always).
* p < .os. ** p < .01
Students’ Evaluations 277
In the same longitudinal study, Marsh (see Marsh & Overall, 1979) demonstrated that,
consistent with previous research, the single-rater reliabilities were generally in the 0.20s
for both end-of-course and retrospective ratings. (Interestingly, the single-rater re-
liabilities were somewhat higher for the retrospective ratings.) However, the median cor-
relation between end-of-class and retrospective ratings, when based on response by indi-
vidual students instead of class-average responses, was 0.59 (Table 3.1). The explanation
for this apparent paradox is the manner in which systematic unique variance, as opposed
to random error variance, is handled in determining the single rater reliability estimate and
the stability coefficient. Variance that is systematic, but unique to the response of a par-
ticular student, is taken to be error variance in the computation of the single-rater reliabil-
ity. However, if this systematic variance was stable over the several year period between
the end-of-course and retrospective ratings for an individual student, a demanding criter-
ion, then it is taken to be systematic variance rather than error variance in the computation
of the stability coefficient. While conceptual differences between internal consistency and
stability approaches complicate interpretations, there is clearly an enduring source of sys-
tematic variation in responses by individual students that is not captured by internal consis-
tency measures. This also argues that while the process of averaging across the ratings pro-
duces a more reliable measure, it also masks much of the systematic variance in individual
student ratings, and that there may be systematic differences in ratings linked to specific
subgroups of students within a class (also see Feldman, 1977). Various subgroups of stu-
dents within the same class may view teaching effectiveness differently, and may be diffe-
rentially affected by the instruction which they receive, but there has been surprisingly
little systematic research to examine this possibility.
Researchers have also asked how highly correlated student ratings are in two different
courses taught by the same instructor, and even in the same course taught by different
teachers on two different occasions. This research is designed to address three related
questions. First, what is the generality of the construct of effective teaching as measured
by students’ evaluations? Second, what is the relative importance of the effect of the in-
structor who teaches a class on students’ evaluations, compared to the effect of the particu-
lar class being taught? If the impact of the particular course is large, then the practice of
comparing ratings of different instructors for tenure/promotion decisions may be dubious.
Third, should ratings be averaged across different courses taught by the same instructor?
Marsh (1981b) arranged ratings of 1364 courses into 341 sets such that each set contained
ratings of: the same instructor teaching the same course on two occasions, the same in-
structor teaching two different courses, the same course taught by a different instructor,
and two different courses taught by different instructors (Table 3.2). For an overall in-
structor rating item the correlation between ratings of different instructors teaching the
same course (i.e. a course effect) was -0.05, while correlations for the same instructor in
different courses (0.61) and in two different offerings of the same course (0.72) were much
larger (Table 3.2). While this pattern was observed in each of the SEEQ factors, the corre-
lation between ratings of different instructors in the same course was slightly higher for
some evaluation factors (for example, Workload/Difficulty, Assignments, and Group In-
278 HERBERT W. MARSH
Table 3.2
Correlations Among Dlfferent Sets of Classes for Student Ratings and Background Characteristics (Reprinted
with permlssion from Marsh, 1984b)
Student rating
Learning/Value .696 .563 .232 a69
Enthusiasm .734 .613 .Oll .028
Organization/Clarity .676 .540 -.023 -.063
Group Interaction .699 .540 .291 ,224
Individual Rapport .726 .542 .180 .146
Breadth of Coverage .727 .481 .117 ,067
Examinations/Grading .633 .512 .066 -.004
Assignments .681 .428 .332 .112
Workload/Difficulty .733 .400 .392 .215
Overall course .712 .591 -.Oll -.065
Overall instructor .719 .607 -.051 -.059
Mean coefficient .707 .523 .140 .061
Background characteristic
Prior subject interest .635 .312 .563 .209
Reason for taking course (percent indicating
general interest) .770 .448 .671 .383
Class average expected grade .709 .405 .483 .356
Workload/difficulty .773 .400 .392 ,215
Course enrollment .846 .312 .593 .058
Percent attendance on day evaluations
administered .406 .164 .214 .045
Mean coefficient .690 .340 ,491 .211
teraction) but had a mean of only 0.14 across all the factors. In marked contrast, correla-
tions between background variables in different sets of courses (for example, prior subject
interest, class size, reason for taking the course) were higher for the same course taught by
two different instructors than for two different courses taught by the same instructor
(Table 3.2). Based on a path analysis of these results, Marsh argued that the effect of the
teacher on student ratings of teaching effectiveness is much larger than is the effect of the
course being taught, and that there is a small portion of reliable variance that is unique to
a particular instructor in a particular course that generalizes across different offerings of
the same course taught by the same instructor. Hence, students’ evaluations primarily re-
flect the effectiveness of the instructor rather than the influence of the course, and some
instructors may be uniquely suited to teaching some specific courses. A systematic exami-
nation of the suggestion that some teachers are better suited for some specific courses, and
that this can be identified from the results from a longitudinal archive of student ratings,
is an important area for further research.
These results provide support for the generality of student ratings across different
courses taught by the same instructor, but provide no support for the use of student ratings
to evaluate the course. Even student ratings of the overall course were primarily a function
of the instructor who taught the course, and not the particular course that was being
evaluated. In fact, the predominance of the instructor effect over the course effect was vir-
tually the same for both the overall instructor rating and the overall course rating. This
finding probably reflects the autonomy that university instructors typically have in con-
Students’ Evaluations 279
ducting the courses that they teach, and may not generalize to the relatively unusual setting
in which instructors have little or no autonomy. Nevertheless, the findings provide no sup-
port for the validity of student ratings of the course independent of the instructor who
teaches the course.
Marsh and Overall (1981) examined the effect of course and instructor in a setting where
all students were required to take all the same courses, thus eliminating many of the prob-
lems of self-selection that plague most studies. The same students evaluated instructors at
the end of each course and again one year after graduation from the program. For both
end-of-course and follow-up ratings, the particular instructor teaching the course ac-
counted for 5 to 10 times as much variance as the course. These findings again demonstrate
that the instructor is the primary determinant of student ratings rather than the course he
or she teaches.
Marsh and Hocevar (1984) also examined the consistency of the multivariate structure
of student ratings. University instructors who taught the same course at least four times
over a four-year period were evaluated by different groups of students in each of the four
courses (n=314 instructors, 1254 classes, 31,322 students). Confirmatory factor analysis
demonstrated not only the generalizability of the ratings of the instructors across the four
sets of courses, but also the generalizability of multivariate structure. For example, an in-
structor who was evaluated to be enthusiastic but poorly organized in one class received a
similar pattern of ratings in other offerings of the same course. The results of this study de-
monstrate a consistency of the factor structure across the different sets of courses, a rela-
tive lack of method/halo effect in the ratings, and a generalizability of the multivariate
structure; all of which provide a particularly strong demonstration of the multifaceted na-
ture of student ratings.
Gilmore et al. (1978), applying generalizability theory to student ratings, also found that
the influence of the instructor who teaches the course is much larger than that of the course
that is being taught. They suggested that ratings for a given instructor should be averaged
across different courses to enhance generalizability. If it is likely that an instructor will
teach many different classes during his or her subsequent career, then tenure decisions
should be based upon as many different courses as possible - Gilmore et al. suggest at
least five. However, if it is likely that an instructor will continue to teach the same courses
in which he or she had already been evaluated, then results from at least two different of-
ferings of each of these courses is suggested. These recommendations require that a lon-
gitudinal archive of student ratings be maintained for personnel decisions. This data would
provide for more generalizable summaries, the assessment of changes over time, and the
determination of which particular courses are best taught by a specific instructor. It is in-
deed unfortunate that some universities systematically collect students’ evaluations, but
fail to keep a longitudinal archive of the results. Such an archive would help overcome
some of the objections to student ratings (e.g., idiosyncratic occurrences in one particular
set of ratings), will enhance their usefulness, and will provide an important data base for
further research.
SEEQ provides space for student comments, and responses to these comments were
used in the development of SEEQ to determine whether or not important aspects of teach-
180 HERBERT W. MARSH
ing effectiveness had been excluded. Students’ comments are also viewed as a valuable
source of diagnostic feedback for the instructor and are returned along with the corn-.
puterized summary of the student ratings. However, it was assumed that content analyses
of the written comments was practically unfeasible and that, perhaps, the comments would
be most useful in their original form. Furthermore, students’comments have not been sys-
tematically related to responses to SEEQ items or to other indicators of effective teaching.
Braskamp and his colleagues (Braskamp et al., 1985; Braskamp ef al., 1981; Ory et al.,
1980) have pursued some of these issues. In a small study of 14 classes the overall favorabil-
ity of student comments was assessed with reasonable reliability by two judges and its cor-
relation with the overall student ratings (0.93) was close to the limits of the reliability of the
two indicators (Ory et al., 1980). In a larger study of 60 classes Braskamp et al. (1981)
sorted student comments into one of 22 content categories and evaluated comments in
terms of favorability. Although the researchers did not provide reliability data for the
analysis of the written comments, the overall favorability of these comments was again
substantially correlated with the overall instructor rating (0.75). In both studies these
authors argued that student ratings are more cost effective for obtaining overall evalua-
tions of teaching effectiveness, but that the comments offer more specific and diagnostic
information than do class-average ratings.
In a related study, Ory and Braskamp (1981) simulated results from written comments
and rating items-both global and specific - about a hypothetical instructor. The simu-
lated comments were presented as if they were in their original unedited form, and not as
summaries of the content analyses as described earlier. They then asked faculty to evaluate
the ratings and comments for purposes of both self-improvement and personnel decisions
on a number of attributes. The rating items were judged as easier to interpret and more
comprehensive regardless of the purpose but other judgements varied according to pur-
pose. In general, faculty judged ratings as superior for personnel decisions, but judged the
written comments as superior for self-improvement. Speculating on the results for written
comments, the authors suggested that “the nonstandardized, unique, personal written
comments by students are perceived as too subjective for important personnel decisions.
However, this highly idiosyncratic information about a particular course is viewed as use-
ful diagnostic information for making course changes” (pp. 280-281).
The research by Braskamp and his colleagues demonstrates that student comments, at
least on a global basis, can be reliably scored and that these scores agree substantially with
students’ responses to overall rating items. This supports for the generality of the ratings.
They also contend that the comments contain useful information that is not contained in
the overall ratings, and this seems plausible. Their findings also provide support for the
collection of student comments and returning these comments to faculty along with sum-
maries of the rating items. However, they do not indicate whether the additional infor-
mation provided by comments comes from the undigested, original comments or from the
results of detailed content analyses such as they performed. As Braskamp et al. (1981)
noted, their content categories are similar to those measured by multidimensional rating
scales, and so it may still be more cost effective to use rating items for even this type of
specific information. I suggest, as may be implied by Ory and Braskamp (1981), that the
useful information from comments that cannot be obtained from rating items is idiosyn-
cratic information that cannot easily be classified into generalizable categories, that is so
specific that its value would be lost if it was sorted into broader categories, or that cannot
be easily interpreted without knowledge of the particular context. From this perspective,
Students’ Evaluations 281
Remmers’ approach to the reliability of student ratings, agreement among different stu-
dents within the same class, has been useful, but it is too simplistic. Generalizability
theory, as applied by Kane and Brennan (1977) and by Kane et al. (1976), seemed to pro-
vide the promise of a much more elegant approach, and Aubrecht (1981) predicted that
there would be many such studies. However, researchers have apparently been unable to
232 HERBERT W. MARSH
deliver on this promise and there are a number of problems that require further attention.
First, while there have been numerous applications of generalizability theory in student
evaluation research, none that I know of have used responses to well defined multiple di-
mensions of teaching effectiveness even though generalizability theory has been used to
examine other multivariate constructs. Second, the ideal generalizability study would re-
quire that each of a large set of courses was taught by each of a large set of different instruc-
tors and was evaluated by the same set of students with no missing data. Since this is obvi-
ously impossible, it is typically assumed that the different instructors who teach the same
course over a period of years represents a random sample of instructors, that the different
students who take a given course represent a random sample of students, or that the diffe-
rent courses taught by the same teacher represent a random sample of courses. However,
such assumptions are clearly unwarranted. Gilmore (1980) illustrated that dramatically
different - or even impossible - findings result, depending on which of these assump-
tions are made and concluded with a quote from Mark Twain: “The thirteenth stroke of a
clock is not only false of itself, but casts grave doubts on the credibility of the preceding
twelve” (p. 14). Gilmore, though discouraged about the application of generalizability
theory in student evaluation research, proposed the problems could be resolved, but this
hope has apparently not been actualised.
Profile Analysis
Research by Braskamp and his colleagues suggests that student written comments can
be scored according to detailed content categories. Unfortunately, their research did not
include a multidimensional evaluation instrument. Hence it is not known the extent to
which student comments in specific categories converge with student ratings in matching
categories, and this seems like a fruitful application of MTMM analyses. Also, I know of
no research that has systematically related scores derived from student comments and
Students’ Evaluations 283
from rating items to other indicators of teaching effectiveness and this seems surprising.
While the cost of systematically analyzing student written comments may be prohibitive
for campus-wide programs, the undertaking seems to be reasonable as part of research
studies.
CHAPTER 4
VALIDITY
Student ratings, which constitute one measure of teaching effectiveness, are difficult to
validate since there is no single criterion of effective teaching. Researchers who use a con-
struct validation approach have attempted to demonstrate that student ratings are logically
related to various other indicators of effective teaching. In this approach, ratings are re-
quired to be substantially correlated with a variety of indicators of effective teaching and
less correlated with other variables, and, in particular, specific rating factors are required
to be most highly correlated with variables to which they are most logically and theoreti-
cally related. Within this framework, evidence for the long-term stability of students’
evaluations, for the generalizability of student ratings of the same instructor in different
courses, and for the generalizability of inferences across student ratings and student writ-
ten comments (see Chapter 3) can be interpreted as support for their validity. The most
widely accepted criterion of effective teaching is student learning, but other criteria in-
clude changes in student behaviors, instructor self-evaluations, the evaluations of peers
and/or administrators who actually attend class sessions, the frequency of occurrence of
specific behaviors observed by trained observers, and the effects of experimental manipu-
lations.
Historically, some researchers have argued for the use of criterion-related validity in-
stead of construct validity in the study of students’ evaluations, typically proposing that
measures of student learning are the criterion against which to validate student ratings. If
researchers specifically define effective teaching to mean student learning, and operation-
ally define student learning and students’ evaluations, it may be accurate to describe such
findings in terms of criterion-related validity but there are problems with such an approach
(see also Howard et al., 1985). First, such an approach requires the assumption that effec-
tive teaching and student learning are synonymous and this seems unwarranted. A more
reasonable assumption that is consistent with the construct validation approach is that stu-
dent learning is only one indicator of effective teaching, even if a very important indicator.
Second, even if student learning were assumed to be the only criterion of effective teach-
ing, student learning is also a hypothetical construct and so it might still be appropriate to
use the construct validity approach. Third, since construct validity subsumes criterion-
related validity (Cronbach, 1984; APA, 1985) it is logically impossible for criterion-related
validity to be appropriate instead of construct validity. Fourth, the narrow criterion-
related approach to validity will inhibit a better understanding of what is being measured
by students’ evaluations, of what can be inferred from students’ responses, and how find-
ings from diverse studies can be understood within a common framework.
285
286 HERBERT W. MARSH
Student learning, particularly if inferred from an objective, reliable, and valid test, is
probably the most widely accepted criteria of effective teaching. The purpose of this sec-
tion is not to argue for or against the use of student learning in the evaluation of teaching,
but rather to review research that uses learning as a criteria of students’ evaluations of
teaching effectiveness. Nevertheless, it is important to recognize that student achievement
is not generally appropriate as an indicator of effective teaching in universities as they are
presently organized. Examination scores in physical chemistry cannot be compared to
examination scores in introductory psychology; examination scores cannot be compared in
Students’ Evaluations 287
Methodological Problems
Rodin and Rodin (1972) reported a negative correlation between section-average grade
‘88 HERBERT W. MARSH
effective at accomplishing this when the number of sections is large and the number of stu-
dents within each section is small, because chance alone will create differences among the
sections. This paradigm does not constitute an experimental design in which students are
randomly assigned to treatment groups that are varied systematically in terms of experi-
mentally manipulated variables, and so the advantages of random assignment are not so
clear as in a standard experimental design.* Furthermore, the assumption of truly random
assignment of students to classes in large scale field studies is almost always compromised
by time-scheduling problems, students dropping out of a course after the initial assign-
ment, missing data, etc. For multisection validity studies the lack of initial equivalence is
particularly critical, since initial presage variables are likely to be the primary determinant
of end-of-course achievement. For this reason it is important to have effective pre-test
measures even when there is random assignment. While this may produce a pre-test sen-
sitization effect, the effect is likely to be trivial since: (a) pre-test variables will typically dif-
fer substantially from post-test measures; (b) there is no intervention other than the nor-
mal instruction that students expect to receive; (c) it seems unlikely that the collection of
pre-test measures will systematically affect either teaching effectiveness or student per-
formance; (d) pre-test measures can sometimes be obtained from student records without
having to actually be collected as part of the study; and (e) a ‘no-pre-test control’ could be
included. Sixth, performance on objectively scored examinations that have been the focus
of multisection validity studies may be an unduly limited criterion of effective teaching
(Dowel1 & Neal, 1982; Marsh & Overall, 1980). In summary, the multisection validity de-
sign is inherently weak and there are many methodological complications in its actual ap-
plication.
Cohen (1981) conducted a meta-analysis of all known multisection validity studies, regard-
less of methodological problems such as found in the Rodin and Rodin study. Across 68
multisection courses, student achievement was consistently correlated with student ratings
of Skill (OSO), Overall Course (0.47), Structure (0.47), Student Progress (0.47), and Over-
all Instructor (0.43). Only ratings of Difficulty had a near-zero or a negative correlation
with achievement. The correlations were higher when ratings were of full-time teachers,
when students knew their final grade when rating instructors, and when achievement tests
were evaluated by an external evaluator. Other study characteristics (for example, ran-
dom assignment, course content, availability of pre-test data) were not significantly re-
lated to the results. Many of the criticisms of the multisection validity study are at least par-
tially answered by this meta-analysis, particularly problems due to small sample sizes and
the weakness of the predicted effect, and perhaps the issue of the multiplicity of achieve-
ment measures and student rating instruments. These results provide strong support for
the validity of students’ evaluations of teaching effectiveness.
Marsh (1984; Marsh el al., 1975; Marsh & Overall, 1980) identified an alternative expla-
nation for positive results in multisection validity studies that he called the grading satisfac-
tion hypothesis (also called the grading leniency effect elsewhere), When course grades
(known or expected) and performance on the final exam are significantly correlated, then
higher evaluations may be due to: (a) more effective teaching that produces greater learn-
ing and higher evaluations by students; (b) increased student satisfaction with higher
grades which causes them to ‘reward’ the instructor with higher ratings independent of
more effective teaching or greater learning; or(c) initial differences in student characteris-
tics (for example, Prior Subject Interest, Motivation, and Ability) that affect both teaching
effectiveness and performance. The first hypothesis argues for the validity of student rat-
ings as a measure of teaching effectiveness, the second represents an undesirable bias in
the ratings, and the third is the effect of presage variables that may be accurately reflected
by the student ratings. Even when there are no initial differences between sections, either
of the first two explanations is viable, and Cohen’s finding that validity correlations are
substantially higher when students already know their final course grade makes the grad-
ing satisfaction hypothesis a plausible counter explanation. Palmer etal. (1978) made simi-
lar distinctions but their research has typically been discussed in relation to the potential
biasing effect of expected grades (see also Howard & Maxwell, 1980, 1982) rather than
multisection validity studies. Dowel1 and Neal (1982) also suggest such distinctions, but
then apparently confound the effects of grading leniency and initial differences in section-
average ability in their discussion and review of multisection validity studies.
Only in the two SEEQ studies (Marsh et al., 1975; Marsh & Overall, 1980) was the grad-
ing satisfaction hypothesis controlled as a viable alternative to support for the validity
hypothesis. The researchers reasoned that in order for satisfaction with higher grades to af-
fect students’ evaluations at the section-average level, section-average expected grades
must differ at the time the student evaluations are completed. In both these studies student
performance measures prior to the final examination were not standardized across sec-
tions. Hence, while each student knew approximately how his or her performance com-
pared to other students within the same section, there was no basis for knowing how the
performance of any section compared with that of other sections, and thus there was no
basis for differences between the sections in their satisfaction with expected grades. Con-
sistent with this suggestion, section-average expected grades indicated by students at the
time the ratings were collected did not differ significantly from one section to the next, and
were not significantly correlated with section-average performance on the final examina-
tion (even though individual expected grades within each section were). Since secfion-av-
erage expected grades at the time the ratings were collected did not vary, they could not
be the direct cause of higher student ratings that were positively correlated with student
performance, nor the indirect cause of the higher ratings as a consequence of increased stu-
dent satisfaction with higher grades. In most studies, where section-average expected
grades and section-average performance on the criterion measures are positively corre-
lated, the grading satisfaction hypothesis cannot be so easily countered.
Students’ Evaluations 291
Palmer er al. (1978) also compared validity and grading leniency hypotheses in a multi-
section validity study by relating section-average student learning (controlling for pre-test
data) and section-average grading leniency to student ratings. However, their study failed
to show a significant effect of either student learning or grading leniency on student rat-
ings. Potential problems with the study include the small number of sections (14) charac-
teristic of most multisection validity studies, and perhaps their operationalization of grad-
ing leniency as described below. Despite these problems, this study provides a
methodologically sophisticated approach to the analysis of multisection validity studies
that warrants further consideration.
Both the SEEQ and the Palmer era/. studies attempted to distinguish between the valid-
ity and the grading satisfaction (or grading leniency) hypotheses after controlling for initial
differences, but there were important differences in how this was accomplished. In the
SEEQ studies, due to the particular design of the study, there were no section-average dif-
ferences in expected grades at the time students completed their ratings and the authors ar-
gued that this eliminated the grading satisfaction hypothesis as a plausible explanation.
Palmer et al. used actual grades, instead of expected grades, as measured at approximately
the time when students completed the ratings. Since the grading satisfaction hypothesis
can only be explained in terms of expected grades the use of actual grades is dubious unless
it can be argued that actual and expected grades are virtually the same, that the basis of ac-
tual grades is the same for all sections, and that the relation between expected and actual
grades is the same for all sections. Hence it is recommended that expected grades, instead
of actual grades, should be the basis of such analyses in the future.
Palmer et al. further argued that grading leniency should be defined in terms of how ac-
tual grades differ from grades predicted on the basis of all pre-test variables and student
performance on their final test. This implies that the grading satisfaction hypothesis is due
to actual grades being higher or lower than predicted grades. However, this suggestion is
only plausible if it can be argued: (a) that actual grades are equivalent to expected grades
as described above; and (b) that predicted actual grades are equivalent to the grades that
students feel that they deserve. Palmer et al. make a similar point when they indicate that:
“What we are interested in, of course, is the students’ perceptions of instructor leniency”
(p. 858). Alternatively it may be plausible to argue that the grading satisfaction is based on
expected grades being higher or lower than those that students feel that they deserve.
However, even if students grades (expected or actual) and test performance both reflect
teaching effectiveness such that no grading leniency exists according to the Palmer et al.
definition, there would be no guarantee that subsequent ratings reflected the teaching ef-
fectiveness instead of satisfaction with the grades. Hence, while the Palmer et al. approach
may reflect a superior definition of objectively defined grading leniency, it apparently does
not provide an adequate test of the grading satisfaction hypothesis.
In the two SEEQ studies, the grading satisfaction hypothesis was tested in terms of
expected grades without any correction for the relative contribution of the instructor. That
is, student satisfaction with higher-than-average grades, even higher grades due to more
effective teaching, may be the cause of higher-than-average student ratings. Palmer et al.
argued that grades should be corrected for pre-test scores. In the SEEQ studies there were
no significant differences among sections in terms of the pre-test scores or expected grades
collected at the start of the course, and so such a correction would have had little effect.
Palmer et al. also argued that their measure of grading leniency should be corrected for
final test performance as an indication of the contribution of the instructor. In the SEEQ
292 HERBERT W. MARSH
studies there were no differences among sections in uncorrected expected grades and so
this correction would have led to the problematic conclusion that grading leniency as de-
fined by Palmer et al. was negatively correlated with student evaluations (i.e. lower than
deserved expected grades are associated with higher student ratings). These results cast
further doubt on the Palmer et al. approach to testing the grading satisfaction hypothesis
in multisection validity studies.
Instructor Self-Evaluations
Validity paradigms in student evaluation research are often limited to a specialized set-
ting (for example, large multisection courses) or use criteria such as the retrospective rat-
ings of former students that are unlikely to convince skeptics. Hence, the validity of stu-
dent ratings will continue to be questioned until criteria are utilized that are both applic-
able across a wide range of courses and widely accepted as an indicator of teaching effec-
tiveness (see Braskamp ef al., 1985 for further discussion). Instructors’ self-evaluations of
their own teaching effectiveness is a criterion which satisfies both of these requirements.
Furthermore, instructors can be asked to evaluate themselves with the same instrument
used by their students, thereby testing the specific validity of the different rating factors.
Finally, instructor self-evaluations are not substantially correlated with a wide variety of
instructor background/demographic characteristics other than their enjoyment of teaching
and their liking of their subject (Doyle & Weber, 1978; also Marsh, 1984 and discussion in
Chapter 5).
Despite the apparent appeal of instructor self-evaluations as a criterion of effective
teaching, it has had limited application. Centra (1973) found correlations of about 0.20 be-
tween faculty self-evaluations and student ratings, but both sets of ratings were collected
at the middle of the term as part of a study of the effect of feedback from student ratings
(see Chapter 8) rather than at the end of the course. Blackburn and Clark (1975) also re-
ported correlations of about 0.20, but they only asked faculty to rate their own teaching in
a general sense rather than their teaching in a specific class that was also evaluated by stu-
dents. In small studies with ratings of fewer than 20 instructors, correlations of 0.31 and
0.65 were reported by Braskamp er al. (1979) and 0.47 by Doyle and Crichton (1978). In
a study with 43 instructors, Howard et al. (1985) reported instructor self-evaluations to be
correlated 0.34 and 0.31 with ratings by students and former students, respectively. In
larger studies with ratings of 50 or more instructors, correlations of 0.62, 0.49, and 0.45
were reported by Webb and Nolan (1955), Marsh et al. (1979b), and Marsh (1982c).
Marsh (1982c; Marsh et al., 1979) conducted the only studies where faculty in a large
number of courses (81 and 329) were asked to evaluate their own teaching on the same
multifaceted evaluation instrument that was completed by students. In both studies: (a)
separate factors analyses of teacher and student responses identified the same evaluation
factors (see Table 2.2); (b) student-teacher agreement on every dimension was significant
(median T’Sof 0.49 and 0.45; Table 4.1); ( c ) mean differences between student and faculty
responses were small and not statistically significant for most items, and were unsystematic
when differences were significant (i.e. student ratings were higher than faculty self-evalua-
tions for some items but lower for others).
In MTMM studies, multiple traits (the student rating factors) are assessed by multiple
methods (student ratings and instructor self-evaluations). Consistent with the construct
validation approach discussed earlier, correlations (see Table 4.1 for MTMM matrix from
Marsh’s 1982 study) between student ratings and instructor self-evaluations on the same
dimension (i.e. convergent validities- median r’s of 0.49 and 0.45) were higher than cor-
relations between ratings on nonmatching dimensions (median r’s of -0.04 and 0.02), and
this is taken as support for the divergent validity of the ratings. In the second study, sepa-
rate analyses were also performed for courses taught by teaching assistants, undergraduate
level courses taught by faculty, and graduate level courses. Support for both the conver-
294 HERBERT W. MARSH
gent and divergent validity of the ratings was found in each set of courses (see also Howard
et al., 1985).
In discussing instructor self-evaluations, Centra speculated that prior experience with
student ratings may influence self-evaluations (1975) and that instructors may lower their
self-evaluations as a consequence of having been previously evaluated by students so that
their ratings would be expected to be more consistent with student ratings (1979). If in-
structors were asked to predict how students would evaluate them, then Centra’s sugges-
tion might constitute an important methodological problem for self-evaluation studies.
However, both SEEQ studies specifically instructed the faculty to rate their own teaching
effectiveness as they perceived it even if they felt that their students would disagree, and
not to report how their students would rate them. Hence, the fact that most of the instruc-
tors in these studies had been previously evaluated does not seem to be a source of invalid-
ity in the interpretation of the results (see also Doyle, 1983). Furthermore, given that the
average of student ratings is a little over 4 on a 5-point response scale, if instructor self-
evaluations are substantially higher than student ratings before they receive any feedback
from student ratings as suggested by Centra, then faculty, on average, may have urealisti-
tally high self-perceptions of their own teaching effectiveness. A systematic examination
of how instructor self-perceptions change, or do not change, as a consequence of student
feedback seems a fruitful area for further research.
This instructor self-evaluations research has important implications. First, the fact that
students’ evaluations show significant agreement with instructor self-evaluations provides
a demonstration of their validity that is acceptable to most researchers, and this agreement
can be examined in nearly all instructional settings. Second, there is good evidence for the
validity of student ratings for both undergraduate and graduate level courses (Marsh,
1982~). Third, support for the divergent validity demonstrates the validity of each specific
rating factor as well as of the ratings in general, and argues for the importance of using sys-
tematically developed, multifactor evaluation instruments.
Ratings by Peers
Peer ratings, based on actual classroom visitation, are often proposed as indicators of ef-
fective teaching (Braskamp et al., 1985; Centra, 1979; Cohen & McKeachie, 1980; French-
Lazovich, 1981; see also Aleamoni, 1985), and hence a criterion for validating students’
evaluations. In studies where peer ratings are ~tof based upon classroom visitation (for ex-
ample, Blackburn & Clark, 1975; Guthrie, 1954; Maslow & Zimmerman, 1956), ratings by
peers have correlated well with student ratings of university instructors, but it is likely that
peer ratings are based on information from students. Centra (1975) compared peer ratings
based on classroom visitation and student ratings at a brand new university, thus reducing
the probable confounding of the two sources of information. Three different peers
evaluated each teacher on two occasions, but there was a relative lack of agreement among
peers (mean r = 0.26) which brings into question their value as a criterion of effective
teaching and precluded any good correspondence with student ratings (r = 0.20).
Morsh et al. (1956) correlated student ratings, student achievement, peer ratings, and
supervisor ratings in a large multisection course. Student ratings correlated with achieve-
ment, supporting their validity. Peer and supervisor ratings, though significantly corre-
lated with each other, were not related to either student ratings or to achievement,
Table 4.1
Multitrait-Multimethod Matrix: Correlations Between Student Ratings and Faculty Self-Evaluations in 329 Courses (Reprinted with permission from Marsh,
1984b)
Factor 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Instructor self-evaluations
1. Learning/Value (83)
2. Enthusiasm 29 (82)
3. Organization 12 01 (74)
4. Group Interaction 01 (901
5. Individual Rapport -07 -8; -15
07 02. 182)
6. Breadth 13 12 13 -01. (84)
7. Examinations -01 08 26 zl 15 20 (76)
8. Assignments 24 -01 17 05 09 22 (70)
9. Workload/Difficulty 03 -01 12 -09 g -04 09 21 (70)
Student evaluations
10. Learning/Value 46 10 -01 08 -12 09 -04 08 02 (95)
11. Enthusiasm z 54 -04 -01 -02 -01 -03 -09 -09 45 WI
12. Organization 17 2 30 -03 04 07 09 -05 52 49 (93)
13. Croup Interaction 19 05 -5 52 00 -02 -14 -08 37 30 21 (98)
14. Individual Rapport 03 03 -05 2 28 -19 -03 -02 00 22 35 33 42 (96)
15. Breadth 26 00 -14 42 00 34 56 17 ; (94)
16. Examinations 18 lz 01
09 -01 06 -@ 17 -: -06
02 48
49 42 57 34 33 (93)
17. Assignments 20 03 02 09 -01 04 -0i 45 12 52 21 34 30 29 40 42 (92)
18. Workload/Difficulty -06 -03 04 00 03 -03 12 22 $j 06 02 -05 -05 08 18 -02 20 (87)
Note. Values in parentheses in the diagonals of the upper left and lower right matrices, the two triangular matrices, are reliability (coefficient alpha) coefficients (see
Hull & Nie. 1981). The underlined values in the diagonal of the lower left matrix, the square matrix, are convergent validity coefficients that have been corrected for
unreliability according to the Spearman Brown equation. The nine uncorrected validity coefficients. starting with Learning, would be .41, .48, .25. .46, .25, .37. .13. .36,
& 54. All correlation coefficients are presented without decimal points. Correlations greater than .lO are statistically significant.
‘96 HERBERT W. MARSH
suggesting that peer ratings may not have value as an indicator of effective teaching. Webb
and Nolan (1955) reported good correspondence between student ratings and instructor
self-evaluations, but neither of these indicators was positively correlated with supervisor
ratings (which the authors indicated to be like peer ratings). Howard et al. (1985) found
moderate correlations between student ratings and instructor self-evaluations, but ratings
by colleagues were not significantly correlated with student ratings, self-evaluations, or
the ratings of trained observers.
Other reviews of the peer evaluation process in higher education settings (for example,
Centra, 1979; Cohen & McKeachie, 1980; Braskamp et al., 1985; French-Lazovich, 1981)
have also failed to cite studies that provide empirical support for the validity of peer ratings
based on classroom visitation as an indicator of effective college teaching or as a criterion
for student ratings. Cohen and McKeachie (1980) and Braskamp ef al. (1985) suggested
that peer ratings may be suitable for formative evaluation, but suggested that they may not
be sufficiently reliable and valid to serve as a summative measure. Murray (1980), in com-
paring student ratings and peer ratings, found peer ratings to be “(1) less sensitive, reli-
able, and valid; (2) more threatening and disruptive of faculty morale: and (3) more af-
fected by non-instructional factors such as research productivity” (p. 45) than student rat-
ings. Ward et al. (1981; also see Braskamp et al., 1985) suggested a methodological prob-
lem with the collection of peer ratings in that the presence of a colleague in the classroom
apparently affects the classroom performance of the instructor and provides a threat to the
external validity of the procedure. In summary, peer ratings based on classroom visitation
do not appear to be very reliable or to correlate substantially with student ratings or with
any other indicator of effective teaching. While these findings neither support nor refute
the validity of student ratings, they clearly indicate that the use of peer evaluations of uni-
versity teaching for personnel decisions is unwarranted (see Striven, 1981 for further dis-
cussion).
ses. A total of 18 to 24 sets of observer reports were collected for each instructor. The me-
dian of single-rater reliabilities (i.e. the correlation between two sets of observational re-
ports) was 0.32, but the median reliability for the average response across the 18 to 24 re-
ports for each instructor was 0.77. Factor analysis of the observations revealed nine fac-
tors, and their content resembled factors in student ratings described earlier (for example,
Clarity, Enthusiasm, Interaction, Rapport, Organization). The observations significantly
differentiated among the three criterion groups of instructors, but were also modestly cor-
related with a set of background variables (for example, sex, age, rank, class size). Unfor-
tunately, Murray only considered student ratings on an overall instructor rating item, and
these were based upon ratings from a previous course rather than the one that was ob-
served. Hence, MTMM- type analyses could not be used to determine if specific observa-
tional factors were most highly correlated with matching student rating factors. The find-
ings do show, however, that instructors who are rated differently by students do exhibit
systematically different observable teaching behaviors.
Many observational studies focus on a limited range of teacher behaviors rather than the
broad spectrum of behaviors or a global measure of teaching effectiveness, and the study
of teacher clarity has been a particularly fruitful example. In field and correlational re-
search, observers measure (count or rate) clarity-related behaviors (see Land, 1985, for a
description of the types of behaviors) in natural classroom settings and these are related to
student achievement scores. In one such study, Land and Combs (1979) operationally de-
fined teacher clarity as the number of false starts or halts in speech, redundantly spoken
words, and tangles in words. More generally, teacher clarity is a term used to describe how
clearly a teacher explains the subject matter to students, and is frequently examined with
student evaluation instruments and with observational schedules. Teacher clarity vari-
ables are important because it has been shown that they can be reliably judged by students
and by external observers, they are consistently correlated with student achievement, and
they are amenable to both correlational and experimental designs (see Dunkin and
Barnes, 1986; Land, 1979,1985; Rosenshine & Furst, 1973). In experimental settings, les-
son scripts are videotaped which differ only in the frequency of clarity-related behaviors,
and randomly assigned groups of subjects view different lectures and complete achieve-
ment tests. Most studies, whether employing correlational or experimental designs, focus
on the positive relation between observations of clarity variables and student achieve-
ment. Dunkin and Barnes (1986), and Rosenshine and Furst (1973) were particularly im-
pressed with the robustness of this effect and its generality across different instruments,
different raters, and different levels of education. While not generally the primary focus
in this area of research, some studies have collected student ratings of teaching effective-
ness and particularly ratings of teacher clarity in conjunction with other variables.
Land and Combs (1981; see also Land & Smith, 1981) constructed 10 videotaped lec-
tures for which the frequency of clarity-related behaviors was systematically varied to rep-
resent the range of these behaviors observed in naturalistic studies. Ten randomly assigned
groups of students each viewed one of the lectures, evaluated the quality of teaching on a
ten item scale, and completed an objective achievement examination. The ten group-aver-
age variables (i.e. the average of student ratings and achievement scores in each group)
were determined, and the experimentally manipulated occurrence of clarity-related be-
haviors was substantially correlated with both student ratings and achievement, while stu-
dent ratings and achievement were significantly correlated with each other. Student re-
sponses to the item ‘the teacher’s explanations were clear to me’ were most highly corre-
298 HERBERT W. hlARSH
lated with both the experimentally manipulated clarity behaviors and results on the
achievement test. In an observational study, Hines etal. (1982) found that observer ratings
on a cluster of 29 clarity-related behaviors were substantially correlated with both student
ratings and achievement in college level math courses. In a review of such studies, Land
(1985) indicated that while clarity behaviors were significantly related to both ratings and
achievement, the correlations with ratings were significantly higher.
Research on teacher clarity, :hough not specifically designed to test the validity of stu-
dents’ evaluations, offers an important paradigm for student evaluation research. Teacher
clarity is evaluated by items on most student evaluation instruments, can be reliably ob-
served in a naturalistic field study, can be easily manipulated in laboratory studies, and is
consistently related to student achievement. Both naturalistic observations and experi-
mental manipulations of clarity-related behaviors are significantly correlated with student
ratings and with achievement, and student ratings of teacher clarity are correlated with
achievement. This pattern of findings supports the inclusion of clarity on student evalua-
tion instruments, demonstrates that student ratings are sensitive to natural and experimen-
tally manipulated differences in this variable, and supports the construct validity of the stu-
dent ratings with respect to this variable.
Systematic observations by trained observers are positively correlated with both stu-
dents’ evaluations and student achievement, even though research described in the last
section reported that peer ratings are not systematically correlated with either students’
evaluations or student achievement. A plausible reason for this difference lies in the relia-
bility of the different indicators. Class-average student ratings are quite reliable, but the
average agreement between ratings by any two students (i.e., the single rater reliability)
is generally in the 0.20s. Hence, it is not surprising that agreement between two peer vis-
itors who attend only a single lecture and respond to very general items is low. When ob-
servers are systematically trained and asked to rate the frequency of quite specific be-
haviors, and there is a sufficient number of ratings of each teacher by different observers,
then it is reasonable that their observations will be more reliable than peer ratings and
more substantially correlated with student ratings. However, further research is needed to
clarify this suggestion. For example, Howard et al. (1985) examined both external ob-
server ratings by trained graduate students and colleague ratings by untrained peers, but
found that neither was significantly correlated with the other, with instructor self-evalua-
tions, or with student ratings. However, both colleague and observer ratings were based
on two visits by only a single rater, both were apparently based on a similar rating instru-
ment, and the nature of the training given to the observers was not specified. While peer
ratings and behavioral observations have been considered as separate in the present arti-
cle, the distinction may not be so clear in actual practice; peers can be trained to estimate
the frequency of specific behaviors and some behavior observation schedules look like rat-
ing instruments.
The agreement between multifaceted observation schedules and multiple dimensions of
students’ evaluations appears to be an important area for future research. However, a
word of caution must be noted. The finding that specific teaching behaviors can be reliably
observed and do vary from teacher to teacher, does not mean that they are important.
Here, as with student ratings, specific behaviors and observational factors must also be re-
lated to external indicators of effective teaching. In this respect the simultaneous collec-
tion of several indicators of effective teaching is important, and the research conducted on
teacher clarity provides an important example.
Students’ Evaluations 299
Research Productivity
Teaching and research are typically seen as the most important products of university
faculty. Research helps instructors to keep abreast of new developments in their field and
to stimulate their thinking, and this in turn provides one basis for predicting a positive cor-
relation between research activity and students’ evaluations of teaching effectiveness.
However, Blackburn (1974) caricatured two diametrically opposed opinions about the di-
rection of the teaching/research relationship: (a) a professor cannot be a first rate teacher
if he/she is not actively engaged in scholarship; and (b) unsatisfactory classroom perform-
ance results from the professor neglecting teaching responsibilities for the sake of publica-
tions.
Marsh (1979; see also Centra, 1981, 1983; Feldman, 1987), in a review of 13 empirical
studies which mostly used student ratings as an indicator of teaching effectiveness, re-
ported that there was virtually no evidence for a negative relationship between effective-
ness in teaching and research; most studies found no significant relationship, and a few
studies reported weak positive correlations. Faia (1976) found no relationship at research-
oriented universities, but a small significant relationship where there was less emphasis on
research. Centra (1981, 1983), in perhaps the largest study of relations between teaching
effectiveness and research productivity (n = 4,596 faculty from a variety of institutions),
found weak positive correlations (median r = 0.22) between number of articles published
and students’ evaluations of teaching effectiveness for social sciences, but no correlation
in natural sciences, and humanities. Centra (1983) concluded that the teaching effective-
ness/research productivity relationship is nonexistent or modest, and found support for
neither the claim that research activity contributes to nor that it detracts from teaching ef-
fectiveness. Marsh and Overall (1979b; see Table 5.2 in Chapter 5) found that instructor
self-evaluations of their own research productivity were only modestly correlated with
their own self-evaluations of their teaching effectiveness (r’s between 0.09 and 0.41 for 9
SEEQ factors and two overall ratings), and were less correlated with student ratings (r’s
between 0.02 and 0.21). However, 4 of these 11 correlations with students’evaluations did
reach statistical significance, and the largest correlation was with student ratings of
Breadth of Coverage. The researchers reasoned that this dimension, which assesses
characteristics like ‘instructor adequately discussed current developments in the field’, was
the dimension most logically related to research activity. Linsky and Straus (1975) found
research activity was not correlated with students’ global ratings of instructors, but did cor-
relate modestly with student ratings of instructors’ knowledge (0.27). Frey (1978) found
that citation counts for more senior faculty in the sciences were significantly correlated
with student ratings of Pedagogical Skill (0.37), but not student ratings of Rapport (-0.23,
not significant). Frey emphasized that the failure to recognize the multifaceted nature of
students’ evaluations may mask consistent relationships, and this may account for the non-
significant relationships typically found.
Feldman (1987) recently completed a meta-analysis and substantive review of studies
that examined teaching effectiveness based on students’ evaluations and research produc-
tivity. Consistent with Marsh’s earlier 1979 review, Feldman found mostly near-zero or
slightly positive correlations. Using meta-analytic techniques he found an average correfa-
tion of 0.12 (p < 0.001) across the 29 studies in his review. This correlation remained refa-
tively stable across different indicators of research productivity except for five studies
using index citation counts which were unrelated to teaching effectiveness. The relation
300 HERBERT W. MARSH
was relatively unaffected by academic rank and faculty age, but the relation was slightly
higher in humanities and social sciences than natural sciences.
Feldman found that correlations between research productivity and his 19 specific
dimensions of students’ evaluations (see Table 2.1) varied from 0 to 0.21. Correlations of
about 0.2 were found for dimensions 3,5,9 in Table 2.1, whereas dimensions 2,8, 13. 14,
16, 18 and 19 in Table 2.1 were not significantly related to research productivity. Consis-
tent with Marsh’s research, those dimensions most logically related to research productiv-
ity were more highly correlated to it, but the differences were small. However. none of the
dimensions were negatively correlated to research productivity as implied by Frey (1978).
Ability, time spent, and reward structure are all critical variables in understanding the
teaching/research relationship. In a model developed to explain how these variables are
related (Figure 4.1) it is proposed that: (a) the ability to be effective at teaching and re-
search are positively correlated (a view consistent with the first opinion presented by
Blackburn); (b) time spent on research and teaching are negatively correlated (a view con-
sistent with the second opinion presented by Blackburn) and may be influenced by a re-
ward structure which systematically favors one over the other; (c) effectiveness, in both
teaching and research, is a function of both ability and time allocation; (d) the positive re-
lationship between abilities in the two areas and the negative correlation in time spent in
the two areas will result in little or no correlation in measures of effectiveness in the two
areas. In support of predictions b, c, and d, Jauch (1976) found that research effectiveness
was positively correlated with time spent on research and negatively correlated with time
spent on teaching, and that time spent on teaching and research were negatively correlated
with each other (see also Harry & Goldner, 1972). In his review, Feldman (1987) found
some indication that research productivity was positively correlated with time or effort de-
voted to research and, perhaps, negatively correlated with time or effort devoted to teach-
ing. However, he found almost no support for the contention teaching effectiveness was
related at all to time or effort devoted to either research or teaching. Thus, whereas there
is some support for the model in Figure 4.1, important linkages were not supported and are
in need of further research.
In summary, there appears to be a zero to low-positive correlation between measures of
research productivity and student ratings or other indicators of effective teaching, though
correlations may be somewhat higher for student rating dimensions which are most logi-
cally related to research effectiveness. While these findings seem neither to support nor re-
fute the validity of student ratings, they do demonstrate that measures of research produc-
tivity cannot be used to infer teaching effectiveness or vice versa.
Additional research not described here has considered other indicators of effective
teaching, but the criteria are idiosyncratic to particular settings, are insufficiently de-
scribed, or have only been considered in a few studies. For example, in a multisection val-
idity study, Marsh and Overall (1980) found that sections who rated their teacher most
highly were more likely to pursue further coursework in the area and to join the local com-
puter club (the course was an introduction to computer programming). Similarly,
McKeachie and Solomon (1958) and Boulton, Bonge and Marr (1979) found that students
Students’ Evaluations 301
Negative
Relationship
Effectiveness Effectiveness
Relationship
Figure 4.1 Model of predicted relations among teaching-related and research-related variables (Reprinted with
permission from Marsh, 1984b)
from introductory level courses that were taught by highly rated teachers were more likely
to enroll in advanced courses than students taught by poorly rated teachers. As described
in Chapter 3, Ory et al. (1980) found high correlations between student ratings and summa-
tive measures obtained from open-ended comments and a group interview technique,
though the ratings proved to be the most cost effective procedure. Marsh and Overall
(1979b) asked lecturers to rate how well they enjoyed teaching relative to their other duties
such as research, committees, etc. Instructor enjoyment of teaching was significantly and
positively correlated with students’ evaluations and instructor self-evaluations (see Table
5.2), and the highest correlations were with ratings of Instructor Enthusiasm.
Most researchers emphasize that teaching effectiveness should be measured from multi-
302 HERBERT W. MARSH
pie perspectives and with multiple criteria. Braskamp et al. (1985) identify four sources of
information for evaluating teaching effectiveness: students, colleagues, alumni, and in-
structor self-ratings. Ratings by students and alumni are substantially correlated, and rat-
ings by each of these sources appears to be moderately correlated with self-evaluations.
However, ratings by colleagues based on classroom observations do not seem to be
sysematically related to ratings by the other three sources. Braskamp et al. recommend
that colleagues should be used to review classroom materials such as course syllabi, assign-
ments, tests, texts, and Striven (1981) suggests that such evaluations should be done by
staff from the same academic discipline from another institution in a trading of services ar-
rangement that eliminates costs. While this use of colleagues is potentially valuable, I
know of no systematic research that demonstrates the reliability or validity of such ratings.
Howard el al. (1985) contrasted SEEQ construct validity research that usually considers
only two or three indicators of teaching effectiveness for large samples with their study that
examined five different indicators in a single study for a small sample (43 classes). In this
study Howard et al. collected data from all four of the sources of information about teach-
ing effectiveness noted by Braskamp et al. (1985). They found that “former-students and
student ratings evidence substantially greater validity coefficients of teaching effectiveness
than do self-report, colleague and trained observer ratings” (p. 195), and that while self-
evaluations were modestly correlated with student ratings, colleague and observer ratings
were not correlated with each other or with any other indicators. While their findings are
in basic agreement with SEEQ research and other research reviewed here, the inclusion
of such a variety of measures within a single study is an important contribution.
Researchers seem less concerned about the validity of information that is given to in-
structors for formative purposes such as feedback for the improvement of teaching effec-
tiveness. This perspective may be justified, pending the outcome of further research, since
there are fewer immediate consequences and legal implications. Nevertheless, even for
formative purposes, the continued use of any sort of information about teaching effective-
ness is not justified unless there is systematic research that supports its validity and aids in
its interpretation. The implicit assumption that instructors will be able to separate valid
and useful information from that which is not when evaluating formative feedback, while
administrators will be unable to make this distinction when evaluating summative mater-
ial, seems dubious.
Too little attention has been given to a grading leniency or grading satisfaction effect in
multisection validity research even though such an effect has been frequently posited as a
bias to students’ evaluations of teaching effectiveness (see Chapter 5). The SEEQ and the
Palmer et al. (1978) studies provide an important basis for further research. The design of
the SEEQ research disrupted the normal positive correlation between section-average ex-
pected grades and examination performance so that these variables were not significantly
correlated, but this design characteristic may limit the generality of the results and this ap-
proach. The particular approach used by Palmer et al. was apparently flawed, but their
general analytic strategy - particularly if based on student ratings of expected and de-
served grades instead of, or in addition to, actual grades and predicted grades - may
prove to be useful. Further research, particularly if able to resolve these issues, will have
important implications for multisection validity studies and will also provide a basis for the
application of a construct validation approach to the simultaneous study of validity and
bias as emphasized in Chapter 5.
Instructor self-evaluations of teaching effectiveness have been used in surprisingly few
large-scale studies. There is logical and empirical support for their use as a criterion of ef-
fective teaching, they can be collected in virtually all instructional settings, they are rela-
tively easy and inexpensive to collect, and they have important implications for the study
of the validity of students’ evaluations of teaching effectiveness and for the effect of poten-
tial sources of bias. Hence, this seems to be an important area for further research.
Murray’s 1983 study of observations of classroom behaviors provided an important de-
monstration of the potential value of ratings by external observers. Given the systematic
development of the classroom observation procedures in his study, it was unfortunate that
Murray’s measure of students’ evaluations of teaching effectiveness was so weak. How-
ever, the success of his study and this apparent problem point to an obvious direction for
further research.
Teacher clarity research provides a model for a new and potentially important paradigm
for validating student ratings. This potential is particularly exciting because it may bring
together findings from naturalistic field studies and more controlled laboratory settings.
While discussion in this monograph has focused on clarity, other specific behaviors can be
experimentally manipulated and the effects of these manipulations tested on student rat-
ings and other indicators of teaching effectiveness. Furthermore, if multiple behaviors are
304 HERBERT W. h4ARSH
manipulated within the same study. the paradigm can be used to test the discriminant and
convergent validity of student ratings. This type of approach to the study of students’
evaluations is not new, but its previous application has been limited primarily to the study
of potential biases to student ratings as in the Dr. Fox effect (see Chapter 6) and in the
study of semantic similarities in rating items as the basis for the robustness of the SEEQ
factor structure (see Chapter 3). In fact, findings from each of these alternative applica-
tions of this approach were interpreted as supporting the validity of students’ evaluations
for purposes of this monograph.
Marsh and Overall (1980) distinguished between cognitive and affective criteria of effec-
tive teaching, arguing for the importance of affective outcomes as well as cognitive out-
comes. Those findings indicate that cognitive and affective criteria need not be substan-
tially correlated, and appear to be differentially related to different student evaluation
rating components. Cognitive criteria have typically been limited to student learning as
measured in the multisection validity paradigm, and there are problems with such a narrow
definition. In contrast, affective criteria have been defined as anything that seems to be
noncognitive, and there are even more problems with such an open-ended definition.
Further research is needed to define more systematically what is meant by affective
criteria, perhaps in terms of the affective domains described elsewhere (for example,
Krathwohl et al., 1964), to operationally define indicators of these criteria, and to relate
these to multiple student rating dimensions (Abrami, 1985). The affective side of effective
teaching has not been given sufficient attention in student evaluation research or, perhaps,
in the study of teaching in general.
The disproportionate amount of attention given to the narrow definition of teaching ef-
fectiveness as student learning has stifled research on a wide variety of criteria that are ac-
ceptable in the construct validation approach. While this broader approach to validation
will undoubtably provide an umbrella for dubious research as suggested by Doyle (1983),
it also promises to bring new vigor and better understanding to the study of students’
evaluations of teaching effectiveness. In particular, there is a need for studies that consider
many different indicators of teaching effectiveness in a single study (for example, Elliot.
1950; Howard et al, 1985; Morsh et al., 1956; Webb & Nolan, 1955).
Finally, the example provided by student evaluation research will hopefully stimulate
the systematic study of other indicators of effective teaching. Research reviewed in this
chapter has focused on the construct validation of student ratings of teaching effectiveness,
but the review suggests that insufficient attention has been given to the construct valida-
tion of other indicators of teaching effectiveness.
CHAPTER 5
RELATIONSHIP TO BACKGROUND
CHARACTERISTICS: THE WITCH HUNT FOR
POTENTIAL BIASES IN STUDENTS’ EVALUATIONS
The construct validity of students’ evaluations requires that they be related to variables
that are indicative of effective teaching, but relatively uncorrelated with variables that are
not (i.e. potential biases). Since correlations between student ratings and other indicators
of effective teaching rarely approach their reliability, there is considerable residual var-
iance in the ratings that may be related to potential biases. Furthermore, faculty generally
believe that students’ evaluations are biased by a number of factors which they believe to
be unrelated to teaching effectiveness (e.g., Ryan, Anderson and Birchler, 1980). In a
survey conducted at a major research university where SEEQ was developed (Marsh &
Overall, 1979b) faculty were asked which of a list of 17 characteristics would cause bias to
student ratings, and over half the respondents cited course difficulty (72 per cent), grading
leniency, (68 percent), instructor popularity (63 percent), student interest in subject be-
fore course (62 percent), course workload (60 percent), class size (60 percent), reason for
taking the course (55 percent), and student’s GPA (53 percent). In the same survey faculty
indicated that some measure of teaching quality should be given more emphasis in person-
nel decisions than was presently the case and that student ratings provided useful feedback
to faculty. A dilemma existed in that faculty wanted teaching to be evaluated. but were
dubious about any procedure to accomplish this purpose. They were skeptical about the
accuracy of student ratings for personnel decisions but were even more critical of class-
room visitation, self-evaluations and other alternatives. Whether or not potential biases
actually impact student ratings, their utilization will be hindered so long as faculty think
they are biased.
Marsh and Overall (1979b) also asked instructors to consider the special circumstances
involved in teaching a particular course (for example, class size, content area, students’ in-
terest in the subject, etc.) and to rate the “ease of teaching this particular course”. These
ratings of ease-of-teaching (seee Table 5.1) were not significantly correlated with any of
the student rating factors, and were nearly uncorrelated with instructor self-evaluations.
Scott (1977) also asked instructors to indicate which, if any, ‘extenuating circumstances’
(e.g., class size, class outside area of competence, first time taught the course, as well as
an ‘other’ category) would adversely hinder students’ evaluations. The only extenuating
circumstances that actually affected a total score representing the students’ evaluations
was class size, and this effect was small. These two studies suggest that extenuating cir-
cumstances which lecturers think might adversely affect students’ evaluations in a particu-
lar course apparently have little effect, and also support earlier conclusions that the par-
ticular course has relatively little effect on students’ evaluations compared to the effect of
the lecturer who teaches the course (see Chapter 3).
305
Table 5.1
Background Characteristics: Correlations With Student Ratings (S) and Faculty Self-Evaluations (F) of Tberr Own Teaching Effectiveness (N= 1X3 undcr-
graduate courses; reprinted with permission from Marsh, 1984b)
SEEQ factor
Over Over
Background Characteristic Learn Enthu Organ Group Indiv Brdth Exams Assign Wrkld Crse lnvtr
Several large studies have looked at the multivariate relationship between a comprehen-
sive set of background characteristics and students’ evaluations. In such research it is im-
portant that similar variables not be included both as items on which students rate teaching
effectiveness and as background characteristics, particularly when reporting some sum-
mary measure of variance explained. For example, Price and Magoon (1971) found that 11
background variables explained over 20 percent of the variance in a set of 24 student rating
items. However, variables that most researchers would consider to be part of the evalua-
tion of teaching (e.g. Availability of Instructor, Explicitness of Course Policies) were con-
sidered as background characteristics and contributed to the variance explained in the stu-
dent ratings. Similarly, in a canonical correlation relating a set of class characteristics to a
set of student ratings, Pohlman (1975) found that over 20 percent of the variance in five
student rating items (i.e. the redundancy statistic described by Cooley & Lohnes, 1976)
could be explained by background characteristics. However, the best predicted student
rating item was Course Difficulty and it was substantially correlated with the conceptually
similar background characteristics of hours spent outside of class and expected grades.
Other multivariate studies have been more careful to separate variables considered as
part of the students’ evaluations and background characteristics. Brandenburg et al. (1977)
found that 27 percent of the variance in an average of student rating items could be
explained by a set of 11 background characteristics, but that most of the variance could be
explained by expected grade and, to a lesser extent, whether the course was an elective or
required. Brown (1976) found that 14 percent of the variance in an average of student rat-
ing items could be explained, but that expected grade accounted for the most variance.
Burton (1975) showed that eight background items explained 8-15 percent of the variance
in instructor ratings over a seven-semester period, but that the most important variable
was student enthusiasm for the subject. Centra and Creech (1976) found that four class-
room teacher variables (class size, years teaching, teaching load, and subject area) and
their interactions accounted for less than 3 percent of the variance in overall instructor rat-
ings. Stumpf, Freedman, and Aguanno (1979) found that background variables accounted
for very little of the variance in student ratings after the effects of expected grades (which
they reported to account for about 5 percent of the variance) had been controlled.
A few studies have considered both multiple background characteristics and multiple di-
mensions of students’ evaluations. Marsh (1980b) found that a set of 16 background
characteristics explained about 13 percent of the variance in the set of SEEQ dimensions.
However, the amount of variance explained varied from more than 20 percent in the Over-
all Course rating and the Learning/Value dimension, to about 2 percent of the Organiza-
tion and Individual Rapport dimensions. Four background variables were most important
and could account for nearly all the explained variance; more favorable ratings were corre-
lated with higher prior subject interest, higher expected grades, higher levels of Workload/
Difficulty, and a higher percentage of students taking the course for General Interest
Only. A path analysis (see Table 5.2) demonstrated that prior subject interest had the
strongest impact on student ratings, and that this variable also accounted for about one-
third of the relationship between expected grades and student ratings. Marsh (1983) de-
monstrated a similar pattern of results in five different sets of courses (one of which was
the set of courses used in the 1980 study) representing diverse academic disciplines at the
30x HERBERT W. MARSH
Table 5.2
Path Analysts Xiodel Relatmg Prtor Subject Interest, Reason for Taking Course, Expected Grade and Workload/
Difficulty to Student Ratmgs (Reprinted with permisston from Marsh. 198-lb)
Factor
Learning/Value 36 44 15 13 15 26 20 29 17 17 12
Enthusiasm 17 23 :; 09 08 09 20 16 20 11 11 06
Organization -04 -04 -03 16 16 16 03 02 01 04 04 00
Group Interaction 21 28 29 06 06 07 30 27 31 06 06 -02
Individual Rapport -05 09 09 -01 -02 -02 18 16 17 06 06 01
Breadth -07 -03 -03 23 19 19 06 -01 -02 21 21 15
Exams/Grading -05 03 03 12 10 10 25 18 18 20 20 10
Assignments 11 19 20 21 17 18 19 09 13 30 30 23
Overall course 23 32 33 19 15 16 26 15 22 30 30 23
Overall instructor 12 20 20 13 11 12 24 17 20 17 17 10
Variance
components’ 2.9% 5.1% 5.3% 2.3% 1.5% 1.8% 4.5% 2.6% 4.0% 3.6% 3.6% 1.8%
Nofe. The methods of calculating the path coefficients (p values in Figure 5.1). Direct Causal Coefficients (DC),
and total Causal Coefficients (TC) are described by Marsh (1980a). Orig r = original student rating. See Ftgure
5.1 for the corresponding path model.
a Calculated by summing the squared coefficients, dividing by the number of coefficients, and multiplying by
100%.
I II
I Reason for
d
I
Taking Course _ [, = _ o,~~~ Difficultvl
(General Interest)
Figure 5.1 Path analysis model relating prior subject interest, reason for taking course, expected grade, and
Workload/Difficulty (Path coefficients for the student rating factors appear in Table 5.2; reprinted with permts-
sion from Marsh, 1984b)
Students’ Evaluations 309
The finding that a set of background characteristics are correlated with students’ evalua-
tions of teaching effectiveness should not be interpreted to mean that the ratings are
biased, though this conclusion is often inferred by researchers. Support for a bias hypo-
thesis, as with the study of validity, must be based on a construct approach. This approach
requires that the background characteristics that are hypothesized to bias students’ evalua-
tions be examined in studies which are relatively free from methodological flaws using dif-
ferent approaches, and interpreted in relation to a specific definition of bias. Despite the
huge effort in this area of student-evaluation research, such a systematic approach is rare.
Perhaps more than any other area of student-evaluation research, the search for potential
sources of bias is extensive, confused, contradictory, misinterpreted, and methodologi-
cally flawed. In the subsections which follow, methodological weaknesses common to
many studies are presented, theoretical definitions of bias are discussed, and alternative
approaches to the study of bias are considered. The purpose of these subsections is to pro-
vide guidelines for evaluating existing research and for conducting future research. Fi-
nally, within this context, relationships between students’ evaluations and specific charac-
teristics frequently hypothesized to bias student ratings are examined.
Important and common methodological problems in the search for potential biases to
students’ evaluations include the following.
(1) Using correlation to argue for causation-the implication that some variable biases
student ratings argues that causation has been demonstrated, whereas correlation only im-
plies that a concommitant relation exists.
(2) Neglect of the distinction between practical and statistical significance-all conclu-
sions should be based upon some index of effect size as well as on tests of statistical signifi-
cance .
(3) Failure to consider the multivariate nature of both student ratings and a set of poten-
tial biases.
(4) Selection of an inappropriate unit of analysis. Since nearly all applications of
students’ evaluations are based upon class-average responses, this is nearly always the ap-
propriate unit of analysis. The size and even the direction of correlations based on class-
average responses may be different from correlations obtained when the analysis is per-
formed on responses by individual students. Hence, effects based on individual students
as the unit of analysis must also be demonstrated to operate at the class-average level.
(5) Failure to examine the replicability of findings in a similar setting and their
generalizability to different settings - this is particularly a problem in studies based on
small sample sizes or on classes from a single academic department at a single institution.
(6) The lack of an explicit definition of bias against which to evaluate effects-if a var-
iable actually affects teaching effectiveness and this effect is accurately reflected in student
ratings, then the influence is not a bias.
(7) Questions of the appropriateness of experimental manipulations - studies that
attempt to simulate hypothesized biases with operationally defined experimental manipu-
lations must demonstrate that the size and nature of the manipulation and the observed ef-
310 HERBERT W. MARSH
fects are representative of those that occur in natural settings (i.e. they must examme
threats to the external validity of the findings).
An important problem in research that examines the effect of potential biases to stu-
dents’ evaluations is that adequate definitions of bias have not been formulated. The mere
existence of a significant correlation between students’ evaluations and some background
characteristic should not be interpreted as support for a bias hypothesis. Even if a back-
ground characteristic is causalfy related to students’ evaluations, there is insufficient evi-
dence to support a bias hypothesis. For example, it can be plausibly argued that many of
the validity criteria discussed earlier, the alternative indicators of effective teaching such
as student learning and experimental manipulations of teacher clarity, are causally related
to students’ evaluations, but it makes no sense to argue that they bias students’ evalua-
tions. Support for a bias hypothesis must be based on a theoretically defensible definition
of what constitutes a bias. Alternative definitions of bias, which are generally implicit
rather than explicit, are described below.
One possible example, the ‘simplistic bias hypothesis’, is that if an instructor: (a) gives
students high grades; (b) demands little work of students; and (c) agrees to be evaluated
in small classes only; then he or she will be favorably evaluated on all rating items. Implicit
is the assumption that instructors will be rewarded on the basis of these characteristics
rather than on effective teaching. The studies by Marsh (1980b, 1983, 1984) clearly refute
this simplistic hypothesis. The clarity of the SEEQ factor structure demonstrates that stu-
dents differentiate their responses on the basis of more than just global impressions, so that
potential biases, if they do have an effect, will affect different rating dimensions differen-
tially. No background variable was substantially correlated with more than a few SEEQ
factors, and each showed little or no correlation with some of the SEEQ factors. The per-
centage of variance that could be explained in different dimensions varied dramatically.
Furthermore, the direction of the Workload/Difficulty effect was opposite to that pre-
dicted by the hypothesis, while the class size effect was small for dimensions other than
Group Interaction and Individual Rapport. Most importantly, the entire set of back-
ground variables, ignoring the question of whether or not any of them represent biases,
was able to explain only a small portion of the variance in student ratings.
The ‘simplistic bias hypothesis’ represents a strawman and its rejection does not mean
that student ratings are unbiased, but only that they are not biased according to this con-
ceptualization of bias. More rigorous and sophisticated definitions of bias are needed. A
more realistic definition that has guided SEEQ research is that student ratings are biased
to the extent that they are influenced by variables that are unrelated to teaching effective-
ness, and, perhaps, to the extent that this influence generalizes across all rating factors
rather than being specific to the particular factors most logically related to the influence.
For example, even though student learning in multisection validity studies is correlated
with student ratings, thiseffect should not be considered a ‘bias’. However, this seemingly
simple and intuitive notion of bias is difficult to test. It is not sufficient to show that some
variable is correlated with student ratings and that a causal interpretation is warrranted; it
must also be shown that the variable is not correlated with effective teaching. This is dif-
ficult in that effective teaching is a hypothetical construct so that all the problems involved
Students’ Evaluations 311
in trying to show that student ratings are valid come into play, and trying to ‘prove’ a null
hypothesis is always problematic. According to this definition of bias, most claims that stu-
dents’ evaluations are biased by any particular characteristics are clearly unwarranted (see
Feldman, 1984, for further discussion).
Other researchers infer yet another definition of bias by arguing that ratings are biased
to the extent that they are affected by variables that are not under the control of the in-
structor. According to this conceptualization, ratings must be ‘fair’ to be unbiased, even
to the extent of not accurately reflecting influences that do affect teaching effectiveness.
Such a definition is particularly relevant to a variable like prior subject interest that prob-
ably does affect teaching effectiveness in a way which is accurately reflected by student rat-
ings (see discussion below and Marsh & Cooper, 1981). Ironically, this conceptualization
would not classify a grading leniency effect (i.e. students giving better-than-deserved rat-
ings to instructors as a consequence of instructors giving better-than-deserved grades to
students) as a bias, since this variable is clearly under the control of the instructor. Hence,
while the issue of fairness is important, particularly when students’ evaluations are to be
used for personnel decisions, this definition of bias also seems to be inadequate. While
there is a need for further clarification of the issues of bias and fairness, it is also important
to distinguish between these two concepts so that they are not confused. The ‘fairness’ of
students’ evaluations needs to be examined separately from, or in addition to, the exami-
nation of their validity and susceptibility to bias.
Still other researchers (for example, Hoyt et al., 1973; Brandenburg et al, 1977; see also
Howard and Bray, 1979) seem to circumvent the problem of defining bias by statistically
controlling for potential biases with multiple regression techniques or by forming norma-
tive (cohort) groups that are homogeneous with respect to potential biases (e.g., class
size). However, underlying this procedure is the untested assumption that the variables
being controlled are causally related to student ratings, and that the relationship does rep-
resent a bias. For example, if inexperienced, less able teachers are systematically assigned
to teach large introductory classes, then statistically removing the effect of class size is not
appropriate. Furthermore, this approach is predicted on the existence of a theoretical de-
finition of bias and offers no help in deciding what constitutes a bias. Thus, while this pro-
cedure may be appropriate and valuable in some instances, it should only be used cauti-
ously, and in conjunction with research findings that demonstrate that a variable does con-
stitute a bias according to a theoretically defensible definition of bias or fairness.
Over a decade ago, McKeachie (1973) argued that student ratings could be better under-
stood if researchers did not concentrate exclusively on trying to interpret background rela-
tionships as biases, but instead examined the meaning of specific relationships. Following
this orientation, several approaches to the study of background influences have been
utilized. The most frequently employed approach is simply to correlate class-average stu-
dents’ evaluations with a class-average measure of a background variable hypothesized to
bias student ratings. Such an approach can be heuristic, but in isolation it can never be used
to demonstrate a bias. Instead, hypothesis generated from these correlational studies
should be more fully explored in further research using alternative approaches such as
those described below.
312 HERBERT W. MARSH
One alternative approach (Bausall 8cBausall, 1979; Marsh, 1982a) is to examine the re-
lationship between differences in background variables and differences in student ratings
for two or more offerings of the same course taught by the same instructor. The rationale
here is that since the instructor is the single most important determinant of student ratings,
the within-instructor comparison provides a more powerful analysis. Marsh found that for
pairs of courses the more favorably evaluated offering was correlated with: (a) higher ex-
pected grades, and presumably better mastery since grades were assigned by the same in-
structor to all students in the same course; (b) higher levels of Workload/Difficulty; and (c)
the instructor having taught the course at least once previously (and presumably having be-
nefited from that experience and the student ratings). Other background characteristics
such as class size, reason for taking a course, and prior subject interest had little effect.
While providing valuable insight, this approach is limited by technical difficulties involved
in comparing sets of difference scores, by the lack of variance in difference scores repre-
senting both student ratings and the background characteristics (i.e. if there is little var-
iance in the difference scores, then no relationship can be shown), and by difficulties in the
interpretation of the difference scores.
A second approach is to isolate a specific variable, simulate the variable with an experi-
mental manipulation, and examine its effect in experimental studies where students are
randomly assigned to treatment conditions. The internal validity (see Campbell & Stanley,
1973, for a discussion of internal and external threats to validity) of interpretations is
greatly enhanced since many counter explanations that typically exist in correlational
studies can be eliminated. However, this can only be accomplished at the expense of many
threats to the external validity of interpretations: the experimental setting or the manipu-
lation may be so contrived that the finding has little generality to the actual application of
student ratings; the size of the experimental manipulation may be unrealistic; the nature
of the variable in question may be seriously distorted in its ‘operationalization’. and effects
shown to exist when the individual student is the unit-of-analysis may not generalize when
the class-average is used as the unit-of-analysis. Consequently, while the results of such
studies can be very valuable, it is still incumbent upon the researcher to explore the exter-
nal validity of the interpretations and to demonstrate that similar effects exist in real set-
tings where student ratings are actually employed.
A third approach, derived from the construct validation emphasis, is based upon the as-
sumption that specific variables (for example, background characteristics, validity criteria,
experimental manipulations, etc.) should logically or theoretically be related to some
specific components of students’ evaluations, and less related to others. According to this
approach, if a variable is most highly correlated with the dimensions to which it is most log-
ically connected, then the validity of the ratings is supported. For example, class size is sub-
stantially correlated with ratings of Group Interaction and Individual Rapport but not with
other SEEQ dimensions (Marsh et al., 1979a; see discussion below). This parfern of find-
ings argues for the validity of the student ratings. Many relationships can be better under-
stood from this perspective rather than from trying to support or refute the existence of a
bias that impacts all student ratings.
A related approach, that has guided SEEQ research, is more closely tied to an earlier
definition of bias. This approach is based upon the assumption that a ‘bias’ that is specific
to student ratings should have little impact on other indicators of effective teaching. If a
variable is related both to student ratings and to other indicators of effective teaching, then
the validity of the ratings is supported. Employing this approach, Marsh asked instructors
Students’ Evaluations 313
in a large number of classes to evaluate their own teaching effectiveness with the same
SEEQ form used by their students, and the SEEQ factors derived from both groups were
correlated with background characteristics. Support for the interpretation of a bias in this
situation requires that some variable be substantially correlated with student ratings, but
not with instructor self-evaluations of their own teaching (also see Feldman, 1984). Of
course, even when a variable is substantially correlated with both student and instructor
self-evaluations, it is still possible that the variable biases both student ratings and instruc-
tor self-evaluations, but such an interpretation requires that the variable is not substan-
tially correlated with yet other valid indicators of effective teaching. Also, when the pat-
tern of correlations between a specific variable and the set of student evaluation factors is
similar to the p&fern of correlations with faculty self-evaluation factors, there is further
support for the validity of the student ratings. Results based on this and the other ap-
proaches will be presented below in the discussion of the effects of specific background
characteristics.
A fourth, infrequently used approach is to derive relations between student ratings and
external characteristics on the basis of explicit theory. Empirical support for the predic-
tions then supports the validity of both the theory and the measures used to test the theory.
Such an approach, depending on the nature of the theory and the definition of bias, may
also demonstrate that a particular set of relations between student ratings and background
characteristics do not constitute a bias. In an application of this approach that is discussed.
Neumann and Neumann (1985) used Biglan’s (1973) model of subject matter in different
academic disciplines (e.g., soft/hard, pure/applied, and life/nonlife) and its relation to the
role of teaching in the different disciplines to predict discipline differences in student rat-
ings.
Hundreds of studies have used a variety of approaches to examine the influence of many
background characteristics on students’ evaluations of teaching effectiveness, and a com-
prehensive review is beyond the scope of this monograph. Many of the older studies may
be of questionable relevance, and may also have been inaccurately described. Reviewers,
apparently relying on secondary sources, have perpetuated these inaccurate descriptions
and faulty conclusions (some findings commonly cited in ‘reviews’ are based upon older
studies which did not even consider the variable they are cited to have examined - see
Marsh, 1980a, for examples). Empirical findings in this area have been reviewed in an ex-
cellent series of articles by Feldman (1976a, 1976b, 1977,1978,1979,1983,1984), other re-
view papers by Aubecht (1981), Marsh (1983,1984), and McKeachie (1973,1979), monog-
raphs by Braskamp et al. (1985), Centra (1979; Centra & Creech, 1976) and Murray
(1980)) and a chapter by Aleamoni (1981). Older reviews by Costin et al. (1971), Kulik and
McKeachie (197.5), and the annotated bibliography by de Wolf (1974) are also valuable.
Results summarized below emphasize the description and explanation of the mul-
tivariate relationships that exist between specific background characteristics and multiple
dimensions of student ratings. This is a summary of findings based upon some of the most
frequently studied and/or the most important background characteristics, and of different
approaches to understanding the relationships. In this section the effects of the five back-
ground characteristics that have been most extensively examined in SEEQ research are
31-1 HERBERT W. MARSH
examined: class size; workload/difficulty; prior subject interest; expected grades; and the
reason for taking a course.
Class Size
Marsh reviewed previous research on the class size effect and examined correlations be-
tween class size and SEEQ dimensions (Marsh et al., 1979a; Marsh, 1980b, 1983). Class
size was moderately correlated with Group Interaction and Individual Rapport (nega-
tively, r’s as large as -0.30), but not with other SEEQ dimensions or with the overall rat-
ings of course or instructor (absolute values of r’s < 0.15). In the class size effect there was
also a significant nonlinear function where small and very large classes were evaluated
more favorably. However, since the majority of class sizes occur in the range where the
class size effect is negative, the correlation based on the entire sample of classes was still
slightly negative. These findings also appeared when instructor self-evaluations were con-
sidered; the pattern and magnitude of correlations between instructor self-evaluations of
their own teaching effectiveness and class size was similar to findings based upon student
ratings (Marsh et al., 1979a; also see Table 5.1). The specificity of the class size effect to
dimensions most logically related to this variable, and the similarity of findings based on
student ratings and faculty self-evaluations argue that this effect is not a ‘bias’ to student
ratings; rather, class size does have a moderate effect on some aspects of effective teaching
(primarily Group Interaction and Individual Rapport) and these effects are accurately re-
flected in the student ratings. This discussion of the class size effect clearly illustrates why
students’ evaluations cannot be adequately understood if their multidimensionality is ig-
nored (also see Feldman, 1984; Frey, 1978).
Feldman (1984; see also Feldman, 1978) conducted the most extensive review of rela-
tions between class size and students’ evaluations, and the results of his review are reason-
ably consistent with SEEQ research. Feldman noted that most studies find very weak
negative relations, but that the size of the relation is stronger for instructional dimensions
pertaining to the instructor’s interactions and interrelationships with students and that
some studies report a roughly U-shaped nonlinear relation. He also noted the possibility
that the pattern of results, particularly those in SEEQ research, may be interpreted to
mean that the class size effect may represent a valid influence that is accurately reflected
in student ratings. In one of the largest and perhaps the most broadly representative
studies of the class size effect, Centra and Creech (1976; see also Centra, 1979) also found
a clearly curvilinear effect in which classes in the 35-100 range received the lowest ratings,
whereas larger and smaller classes received higher ratings.
Superficially, the U-shaped relation between class size and student ratings appears to
contradict Glass’s well-established conclusion that teaching effectiveness, inferred from
achievement indicators or from affective measures, suffers when class size increases (e.g.,
Glass et al., 1981, pp. 35-43; Smith & Glass, 1980). However, Glass also found a non-
linear class-size effect that he summarized as a logarithmic function where nearly all the
negative effect occurred between class sizes of 1 and 40, and he did not present data for ex-
tremely large class sizes (e.g., several 100 students). Within the range of class sizes re-
ported by Glass, the class size/student evaluation relationship found in SEEQ research
could also be fit by a logarithmic relationship, and the increase in students’ evaluations did
not occur until class size was 200 or more. However, the suggestion that teaching effective-
Students’ Evaluations 315
tress in these extremely large classes may not suffer, or is even superior, has very important
implications, since offering just a few of these very large classes can free an enormous
amount of instructional time that can be used to substantially reduce the average class size
in the range where the class size effect is negative. However, Marsh (Marsh et al., 1979a;
also see Centra, 1979; Feldman, 1984) argued that his correlational effect should be inter-
preted cautiously and he speculated that the unexpectedly higher ratings for very large
classes could be due to: (a) the selection of particularly effective instructors with de-
monstrated success in such settings; (b) students systematically selecting classes taught by
particularly effective instructors, thereby increasing class size; (c) an increased motivation
for instructors to do well when teaching particularly large classes; and (d) the development
of ‘large class’ techniques instead of trying to use inappropriate, ‘small class’ techniques
that may produce lower ratings in moderately large classes. Clearly this is an area that war-
rants further research.
Marsh (Marsh & Cooper, 1981) reviewed previous studies of the prior subject interest
effect, as did Feldman (1977) and Howard and Maxwell (1980), and examined its effect on
SEEQ ratings by students (see also Marsh, 1980b, 1983) and by faculty. The effect of prior
subject interest on SEEQ scores was greater than that of any of the 15 other background
variables considered by Marsh (1980b, 1983). In different studies prior subject interest was
consistently more highly correlated with Learning/Value (r’s about 0.4) than with any
other SEEQ dimensions (r’s between 0.3 and -0.12; Table 5.2). Instructor self-evalua-
tions of their own teaching were also positively correlated with both their own and their
students’ perceptions of students’ prior subject interest (see Table 5.1). The self-evalua-
tion dimensions that were most highly correlated with prior subject interest, particularly
LeamingNalue, were the same as with student ratings. The specificity of the prior subject
interest effect to dimensions most logically related to this variable, and the similarity of
findings based on student ratings and faculty self-evaluations argue that this effect is not a
‘bias’ to student ratings. Rather, prior subject interest is a variable that influences some
aspects of effective teaching, particularly Learning/Value, and these effects are accurately
reflected in both the student ratings and instructor self-evaluations. Higher student in-
terest in the subject apparently creates a more favorable learning environment and facili-
tates effective teaching, and this effect is reflected in student ratings as well as faculty self-
evaluations.
Prior subject interest, as inferred in SEEQ research, is based on students’ retrospective
ratings collected at the end of the course. There may be methodological problems with
prior subject interest ratings collected either at the start or end of a course. At the start stu-
dents may not know enough about the course content to evaluate their interest, and it is
possible that high ratings of prior subject interest at the end of a course confound prior sub-
ject interest before the course with interest in the subject generated by the instructor. Con-
sequently, Howard and Schmeck (1979) asked students to rate their desire to take a course
at both the start and the end of courses. They found that precourse responses were strongly
correlated with end-of-course responses, and that both indicators of prior subject interest
had similar patterns of correlations with other rating items. They concluded that their find-
316 HERBERT W. MARSH
ings supported the use of responses collected at the end of courses to measure precourse
interest.
Marsh and Cooper (1981) also asked faculty to evaluate students’ prior subject interest
at the end of the course as well as to evaluate their own teaching effectiveness. Instructor
evaluations of students’ prior subject interest and students’ ratings of their own prior sub-
ject interest were substantially correlated and each showed a similar pattern of results to
both student rating dimensions and instructor self-evaluation rating dimensions. Thus, an
alternative indicator of prior subject interest, one not based on student ratings, provides
additional support for the generality of the findings.
Prior subject interest apparently influences student ratings in a way that is validly re-
flected in student ratings, and so the influence should not be interpreted as a bias to student
ratings. However, to the extent that the influence is inherent to the subject matter that is
being taught, it may represent a source of ‘unfairness’ when ratings are used for personnel
decisions. Whereas student ratings are primarily a function of the teacher who teaches a
course rather than the course being taught, prior subject interest is more a function of the
course than the teacher (see Table 3.2). If further research confirms that prior subject in-
terest is largely determined by the course rather than the instructor and that this compo-
nent of prior subject interest influences students’ evaluations, then it may be appropriate
to use normative comparisons or cohort groups to correct for this effect. Such an approach
is only justified for general courses that are taught by many different instructors so that the
course effect can be accurately determined. For specialised courses that are nearly always
taught by the same instructor, it may be impossible to disentangle the effects of course and
instructor. Nevertheless, this suggestion represents an important area for further research.
WorkloadlDifficulty
al., 1975; Office of Evaluation Services, 1972; Pohlman, 1972). Pohlman (1975) also re-
ported small but statistically significant relations between hours outside of class and stu-
dent ratings. Schwab (1976) reported that the effect of perceived difficulty on student rat-
ings was positive after controlling for the effects of other background variables such as ex-
pected grades. Thus, the SEEQ results do appear to be consistent with other findings.
Expected Grades
One of the most consistent findings in student evaluation research is that class-average
expected (or actual) grades are modestly correlated with class-average students’ evalua-
tions of teaching effectiveness. The critical issue is how this relation should be interpreted.
Alternative Interpretations
Studies based upon SEEQ, and literature reviews (e.g., Centra, 1979; Feldman, 1976a;
Marsh et al., 1976) have typically found class-average expected grades to be positively cor-
related with student ratings. There are, however, three quite different explanations for
this finding (see discussion in Chapter 4). The grading leniency hypothesis proposes that
instructors who give higher-than-deserved grades will be rewarded with higher-than-de-
served student ratings, and this constitutes a serious bias to student ratings. The validity
hypothesis proposes that better expected grades reflect better student learning, and that a
positive correlation between student learning and student ratings supports the validity of
student ratings. The student characteristic hypothesis proposes that preexisting student
presage variables such as prior subject interest may affect student learning, student grades,
and teaching effectiveness, so that the expected grade effect is spurious. While these ex-
planations of the expected grade effect have quite different implications, it should be
noted that grades, actual or expected, must surely reflect some combination of student
learning, the grading standards employed by an instructor, and preexisting presage vari-
ables.
Marsh, (1980b, 1983) examined the relationships among expected grades, prior subject
interest, and student ratings in a path analysis (also see Aubrecht, 1981; Feldman, 1976a).
Across all rating dimensions, nearly one-third of the expected grade effect could be
explained in terms of prior subject interest. Since prior subject interest precedes expected
grades, a large part of the expected grade effect is apparently spurious, and this finding
supports the student characteristic hypothesis. Marsh, however, interpreted the results as
support for the validity hypothesis in that prior subject interest is likely to impact student
performance in a class, but is unlikely to affect grading leniency. Hence, support for the
student characteristics hypothesis may also constitute support for the validity hypothesis;
prior subject interest produces more effective teaching which leads to better student leam-
ing, better grades, and higher evaluations. This interpretation, however, depends on a de-
finition of bias in which student ratings are not ‘biased’ to the extent that they reflect var-
iables which actually influence effectiveness of teaching.
In a similar analysis, Howard and Maxwell (1980; also see Howard & Maxwell, 1982),
found that most of the covariation between expected grades and class-average overall rat-
ings was eliminated by controlling for prior student motivation and student progress rat-
ings. In their path analysis, prior student motivation had a causal impact on expected
318 HERBERT W. MARSH
grades that was nearly the same as reported in SEEQ research and a causal effect on over-
all ratings which was even larger, while the causal effect of expected grades on student rat-
ings was smaller than that found in SEEQ research. They concluded that “the influence of
student motivation upon student performance, grades, and satisfaction appears to be a
more potent contributor to the covariation between grades and satisfaction than does the
direct contaminating effect of grades upon student satisfaction” (p. 818).
Faculty Self-Evaluations
Marsh and Overall (1979b) examined correlations among student ratings and instructor
self-evaluations of teaching effectiveness, student ratings of expected grades, and teacher
self-evaluations of their own ‘grading leniency’ (see Table 5.1). Correlations between ex-
pected grades and student ratings were positive and modest (r’s between 0.01 and 0.28) for
all SEEQ factors except Group Interaction (r = 0.38) and Workload/Difficulty (r =
-0.25). Correlations between expected grades and faculty self-evaluations were close to
zero (I’S between -0.11 and 0.11) except for Group Interaction (r = 0.17) and Workload/
Difficulty (r = -0.19). Correlations between faculty self-perceptions of their own ‘grading
leniency’ (on an ‘easy/lenient grader’ to ‘hard/strict grader’ scale) with both student and
teacher evaluations of effective teaching were small (T’Sbetween -0.16 and 0.19) except
for ratings of Workload/Difficulty (r’s of 0.26 and 0.28) and faculty self-evaluations of
Examinations/Grading (r = 0.32). In a separate study Marsh (1976) also reported small,
generally nonsignificant correlations between faculty self-evaluations of their grading le-
niency and student ratings, but found that ‘easy’ graders received somewhat (significantly)
lower overall course and Learning/value ratings. The correlations between grading le-
niency and student ratings, and the similarity in the pattern of correlations between ex-
pected grades and ratings by students and by faculty, seem to argue against the interpreta-
tion of the expected grade effect as a bias. Nevertheless, the fact that expected grades were
more positively correlated with student ratings than with faculty self-evaluations may pro-
vide some support for a grading leniency bias.
Marsh (Marsh et al., 1975; Marsh & Overall, 1980; see also Chapter 4) examined class-
average pre-test scores, expected grades, student achievement, and student ratings in mul-
tisection validity studies described earlier. Students selected classes in a quasi-random
fashion, and pre-test scores on an achievement test, motivational variables, and student
background variables were also collected at the start of the study. While this set of pre-test
variables was able to predict course performance with reasonable accuracy for individual
students, section-average responses to them were similar. Also, in each study, students
knew how their performance compared with other students within the same section, but
not how the average performance of their section compared with that of other sections.
Primarily as a consequence of this feature of the study, class-average expected grades,
which were collected along with student ratings shortly before the final examination, did
not differ significantly from section to section. Hence, the correlation between examina-
tion performance and student ratings could only be interpreted as support for the validity
Students’ Evaluations 319
hypothesis, and was not due to either preexisting variables or a grading leniency effect. It
is ironic that when researchers propose class-average grades (expected or actual) as a po-
tential bias to student ratings, a positive correlation between ratings and grades is nearly
always interpreted as a grading leniency effect, while a positive correlation between
grades, as reflected in examination performance, and ratings in multisection validity
studies is nearly always interpreted as an indication of validity; both interpretations are
usually viable in both situations. Again, it must be cautioned that support for the validity
hypothesis found here does not deny the appropriateness of other interpretations in other
situations.
Palmer ef al. (1978) also compared the validity and grading leniency hypotheses in a
multisection validity study by relating class-average student learning (controlling for pre-
test data) and grading leniency with student ratings. However, their study failed to show
a significant effect of either student learning or grading leniency, and thus provides no sup-
port for either hypothesis. Nevertheless, their study showed that moderately large differ-
ences among instructors on grading leniency had no effect on student ratings (but see earl-
ier discussion of this study in Chapter 4).
Some researchers have argued that the expected grade effect can be better examined by
randomly assigning students to different groups which are given systematically different
grade expectations. For example, Holmes (1972) gave randomly assigned groups of stu-
dents systematically lower grades than they expected and deserved, and found that these
students evaluated the teaching effectiveness as poorer than did control groups. While this
type of research is frequently cited as evidence for a grading leniency effect, this conclusion
is unwarranted. First, Holmes’s manipulation accounted for no more than 8 percent of the
variance in any of the evaluation items and much less variance across the entire set of rat-
ings, and reached statistical significance for only 5 of 19 items (Did instructor have suffi-
cient evidence to evaluate achievement; Did you get less than expected from the course;
Clarity of exam questions; Intellectual stimulation; and Instructor preparation). Hence
the size of the effect was small, was limited to a small portion of the items, and tended to
be larger for items that were related to the specific experimental manipulation. Second,
the results based upon ‘rigged’ grades that violate reasonable grade expectations may not
generalize to other settings and seem to represent a different variable than that examined
in naturalistic settings.
Powell (1977) examined the influence of grading standards of a single course taught by
the same instructor on five different occasions. While the effect of grading standards was
modestly related to ratings, the possible non-equivalence of the different classes and the
possible experimenter expectancy effects dictate that the results be interpreted cautiously
(see discussion by Abrami et al., 1980). Chacko (1983) and Vasta and Sarmiento (1979)
applied different grading standards in two sections of the same course, and found that the
more stringently graded courses were evaluated more poorly on some items. In both these
studies, students in the different sections were similar before the introduction of the grad-
ing manipulation. However, since students in different sections of the same course may
talk to each other, the nature of the experimental manipulation may have been known to
the students and may have created an effect beyond that of different grading standards in
naturalistic settings. If students in one section knew they were being graded on a harsher
320 HERBERT W. MARSH
grading standard than students in the other section, then the generality of the manipula-
tion may be suspect. Also, the use of a single course, a single instructor who knew the pur-
pose of the study (at least in the Chacko study), and the use of only two intact classes as
the basis of random assignment further attenuates the generalizability of these two studies.
Snyder and Clair (1976), instead of relying on intact classes, examined grading standards
in a laboratory study in which students were randomly assigned to groups that varied in
terms of the grades they were led to expect (A, B or C) and the grades they actually re-
ceived (A, B or C). Students were exposed to a brief tape-recorded lecture, after which
they completed a quiz and were given a grade that was randomly determined to be higher,
lower or the same as they had been led to expect and then evaluated the teaching effective-
ness. While the obtained grade was positively correlated with student ratings, there was
apparently a substantial interaction between expected and actual grades (the interaction
effect was not appropriately examined but is visually quite apparent). In fact, for the three
groups of students who actually received the grades they were led to expect - A, B or C
respectively-grades had little or no effect and the ‘A’ students actually gave the instruc-
tor the lowest ratings. This suggests that the violation of grade expectations may affect rat-
ings more than the actual grades in experimental studies in which grade expectations are
violated (see also Murray, 1980).
Abrami et al. (1980) conducted what appears to be the most methodologically sound
study of the effects of experimentally manipulated grading standards on students’ evalua-
tions. After reviewing previous research they described two ‘Dr. Fox’ type experiments
(see Chapter 6) in which grading standards were experimentally manipulated. Groups of
students viewed a videotaped lecture, rated teacher effectiveness, and completed an ob-
jective exam. Students returned two weeks later when they were given their examination
results and a grade based on their actual performance but scaled according to different
grading standards (i.e., an ‘average’ grade earning a B, C+, or C). The subjects then
viewed a similar videotaped lecture by the same instructor, again evaluated teacher effec-
tiveness, and took a test on the content of the second lecture. The manipulation of grading
standards had no effect on performance on the second achievement and weak inconsistent
effects on student ratings. There were also other manipulations (e.g., instructor expres-
siveness, content, and incentive), but the effect of grading standards accounted for no
more than 2 percent of the variance in student ratings for any of the conditions, and failed
to reach statistical significance in some. Not even the direction of the effect was consistent
across conditions, and stricter grading standards occasionally resulted in higher ratings.
These findings fail to support the contention that grading leniency produces an effect that
is of practical significance, though the external validity of this interpretation may also be
questioned.
Other Approaches
Marsh (1982a) compared differences in expected grades with differences in student rat-
ings for pairs of offerings of the same course taught by the same instructor on two different
occasions. He reasoned that differences in expected grades in this situation probably rep-
resent differences in student performance, since grading standards are likely to remain
constant, and differences in prior subject interest were small and relatively uncorrelated
Students’ Evaluations 321
with differences in student ratings. He found even in this context that students in the more
favorably evaluated course tended to have higher expected grades, which argued against
the grading leniency hypothesis. It should be noted, however, that while this study is in a
setting where differences due to grading leniency are minimized, there is no basis for con-
tending that the grading leniency effect does not operate in other situations. Also, the in-
terpretation is based on the untested assumption that differences in expected grades re-
flected, primarily, differences in student performance rather than differences in the grad-
ing standards by the instructor.
Peterson and Cooper (1980) compared students’ evaluations of the same instructors by
students who received grades and those who did not. The study was conducted at two col-
leges where students were free to cross-ehroll, but where students from one college were
assigned grades but those from the other were not. Class-average ratings were determined
separately for students in each class who received grades and those who did not, and there
was substantial agreement with evaluations by the two groups of students. Hence, even
though class-average grades of those students who received grades were correlated with
their class-average evaluations and showed the expected grade effect, their class-average
evaluations were in substantial agreement with those of students who did not receive
grades. This suggests that expected grade effect was not due to grading leniency, since
grading leniency was unlikely to affect ratings by students who did not receive grades.
Class average expected grades are about equally determined by the instructor who
teaches a course and by the course that is being taught (see Table 3.2). I know of no re-
search that has attempted to relate these separate components to student ratings, but such
an approach may be fruitful. The component of expected grades due to the course may be
more strongly influenced by presage variables whereas the component due to the instruc-
tor may be more strongly influenced by grading standards and differences in learning at-
tributable to the instructor. The substantial course effect demonstrates that much of the
variance in class-average expected grades is not due to differences in grading standards
that are idiosyncratic to individual instructors.
Summary
In this summary of research about the expected grade effect, a modest but not unimpor-
tant correlation between class-average expected grades and student ratings has consis-
tently been reported. There are, however, several alternative interpretations of this find-
ing, which were labeled the grading leniency hypothesis, the validity hypothesis, and the
student characteristics hypothesis. Evidence from a variety of different types of research
clearly supports the validity hypothesis and the student characteristics hypothesis, but
does not rule out the possibility that a grading leniency effect operates simultaneously.
Support for the grading leniency effect was found with some experimental studies, but
these effects were typically weak and inconsistent, may not generalize to nonexperimental
settings where student ratings are actually used, and in some instances may be due to the
violation of grade expectations that students had falsely been led to expect or that were
applied to other students in the same course. Consequently, while it is possible that a grad-
ing leniency effect may produce some bias in student ratings, support for this suggestion is
weak and the size of such an effect is likely to be insubstantial in the actual use of student
ratings.
322 HERBERT W. MARSH
Courses are often classified as being an elective or a required course, but preliminary
SEEQ research indicated that this dichotomy may be too simplistic. Typically students:
are absolutely required to take a few specific courses in their major; select from a narrow
range of courses in their major that fulfill major requirements; select from a very wide
range of courses that fulfill general education or breadth requirements; occasionally take
specific courses or select from a range of courses in a related field that fulfill major require-
ments; and take some courses for general interest. Hence, on SEEQ students indicate one
of the following as the reason why they took the course: (a) major requirement; (b) major
elective; (c) general interest; (d) general education requirement; (e) minor/related field;
or (f) other. In the analysis of reason for taking a course, the percentage of students indi-
cating each of the first five reasons were considered to be a subset of variables, and the
total contribution of the entire subset and of each separate variable in the subset was con-
sidered. Marsh (1980) found that all SEEQ factors tended to be positively correlated with
the percentage of students taking a course for genera1 interest and as major elective, but
tended to be negatively correlated with the percentage of students taking a course as a
major requirement or as a genera1 education requirement. However, after controlling for
the effects of the rest of the 16 background characteristics, general interest was the only
reason to have a substantial effect on ratings and it accounted for most of the variance that
could be explained by the subset of five reasons. The percentage of students taking a
course for genera1 interest was also one of the four background variables selected from the
set of 16 as having the largest impact on student ratings and included in path analyses
(Marsh, 1980,1983) described earlier (see Table 5.2 and Figure 5.1).
Marsh (1980, 1983) consistently found the percentage taking a course for genera1 in-
terest to be positively correlated with each of the SEEQ factors in different academic dis-
ciplines. However, the sizes of the correlations were modest, usually less than 0.20, and
the effect of this variable was smaller than that of the other three variables (prior subject
interest, expected grades, and workload/difficulty) considered in his path analyses. The
correlations were somewhat larger for Learning/value, Breadth of Coverage, Assign-
ments, Organization and overall course ratings than for the other SEEQ dimensions, but
only the correlations with Breadth of Coverage were as large as or larger than those of the
other variables considered in the path analysis.
Other researchers have typically compared elective courses with required courses, or
have related the percentage of students taking a course as an elective (or a requirement)
to student ratings, and either of these approaches may not be directly comparable to
SEEQ research. Large empirical studies have typically found that a course’s electivity is
positively correlated to student rating (e.g., Brandenburg er al., 1977; Pohlman, 1975; but
also see Centra & Creech, 1976) and these findings are also consistent with Feldman’s 1978
review. Thus, these generalizations appear to be consistent with the SEEQ research.
Marsh (1976, 1983) examined the relations between a wide variety of background
characteristics, but concluded that most of the variance in students’ evaluations that could
be accounted for by the entire set could be explained by those characteristics discussed
Students’ Evaluations 323
above. The effects of other characteristics, though much smaller, are considered briefly
below. A few additional characteristics were examined in particular SEEQ studies (e.g.,
the faculty self-evaluation studies) that were not available for the large scale studies, and
these are also discussed below. Finally, the results are compared with the findings of other
investigators, particularly those summarized in Feldman’s set of review articles.
SEEQ research has found that teaching assistants receive lower ratings than regular
faculty for most rating dimensions and overall rating items, but that they may receive
slightly higher ratings for Individual Rapport and perhaps Group Interaction (e.g., Marsh,
1976,198O; Marsh & Overall, 1979b). Marsh and Overall (1979b) found this same pattern
in a comparison of self-evaluations by teaching assistants and self-evalutions by regular fa-
culty. Large empirical studies by Centra and Creech (1976) and by Brandenburg et al.
(1977) and Feldman’s 1983 review also indicate that teaching assistants tend to receive
lower evaluations than do other faculty (though Feldman also reported some exceptions).
Once teaching assistants are excluded from the analysis, relations between rank and stu-
dent ratings are much smaller in SEEQ research. There is almost no relation between rank
and global ratings, while faculty rank is somewhat positively correlated with Breadth of
Coverage and somewhat negatively correlated with Group Interaction. These results for
the global ratings are consistent with large empirical studies (e.g., Aleamoni & Yimer,
1973; Brandenburg et al., 1977; Centra and Creech, 1976). Feldman (1983) reported that
a majority of the studies in his review found no significant effect of instructor rank on
global ratings, but that the significant relations that were found were generally positive.
Feldman also reported that rank was not significantly related to more specific rating di-
mensions in a majority of studies, but that positive relations tended to be more likely for
dimensions related to instructor knowledge and intellectual expansiveness whereas nega-
tive relations were more likely for ratings of encouragement of discussion, openness, con-
cern for students.
Marsh (1976) found instructor age to be negatively correlated with student ratings,
whereas Marsh et al. (1976) and Marsh and Overall (1979a) found nil or slightly negative
relations between years teaching experience and ratings. Feldman (1983) found that about
half the studies in his review found no relation between either age or years of teaching ex-
perience and global ratings, but that among those that did find significant relations, the
predominant finding was one of a negative relation. Braskamp et al. (1975) suggested that
student ratings may increase during the first 10 years of teaching, but decline somewhat
thereafter.
In summary, teaching assistants typically receive lower ratings than other faculty, but
otherwise there is little relation between either rank or experience and student ratings.
However, to the extent that there are significant relations at all, faculty with higher
academic ranks tend to be rated somewhat more favorably while older faculty and faculty
with more years of teaching experience tend to be rated somewhat lower. This pattern of
findings led Feldman (1983, p. 54) to conclude that: “the teacher’s academic rank should
not be viewed as interchangable with either teacher’s age or extent of instructional experi-
ence with respect to teacher evaluations.”
324 HERBERT W. MARSH
Course Level
Empirical studies (e.g., Centra & Creech, 1976; Pohlman, 1972) and Feldman’s 1977 re-
view indicate that student sex has very little effect on student ratings, though Feldman
notes that when significant effects are reported women may give slightly higher ratings
than men. Similarly, large empirical studies (e.g., Brandenburg et al., 1977; Brown, 1976)
and McKeachie’s 1979 review suggest that the sex of the instructor has little relation to stu-
dents’ evaluations, though Braskamp et al. (1985) and Aleamoni and Hexner (1980) con-
clude that the results are mixed. Feldman (1977) noted a few studies that reported the sex-
of-student by sex-of-teacher interactions, but few of the studies in his review reported such
an interaction.
The brief summary in this section is based primarily on Feldman’s 1979 review of this
topic, and interested readers are referred to this source for further discussion.
Feldman (1979) reported that anonymous ratings tended to be somewhat lower than
non-anonymous ratings, but that this result might vary with other circumstances. For ex-
ample, his review suggested that this effect would be stronger when teachers are given the
ratings before assigning grades, when students feel they may be called upon to justify or
elaborate their responses, or, perhaps, when students view the instructor as vindictive.
Students’ Evaluations 325
Braskamp et al. (1985) noted similar findings and recommended that ratings should be
anonymous.
Feldman (1979) reported that ratings tend to be higher when they are to be used for ad-
ministrative purposes than when used for feedback to faculty or for research purposes, but
that the size of this effect may be very small (see also Centra, 1979). Frankhouser (1984)
critically reviewed this research and presented results from what he argued to be a better
experimental design that showed that stated purpose had no effect on global ratings.
Timing
Feldman (1979) reported that ratings tended to be similar whether collected in the
middle of the term, near the end of the term, during the final exam or even after comple-
tion of the course. Braskamp et al. (1985) suggested that ratings collected during the final
examination may be lower, and that midterm ratings may be unreliable if students can be
identified. Marsh and Overall (1980) collected ratings during the middle of the term and
during the last week of the term in their multisection validity study. Midterm and end-of-
term ratings were highly correlated, but validity coefficients based on the midterm ratings
were substantially lower. Braskamp et al. recommend that ratings should be collected dur-
ing the last two weeks of the course.
In summary, it appears that some aspects of the manner in which students’ evaluate
teaching effectiveness may influence the ratings. While these influences may not be large,
the best recommendation is to standardize all aspects of the administration process. This
is particularly important if the ratings are to be used for personnel decisions and may need
to be defended in a legal setting.
Academic Discipline
Feldman (1978) reviewed studies that compared ratings across disciplines and found
that ratings are: somewhat higher than average in English, humanities, arts, languages,
and, perhaps, education; somewhat lower than average in social sciences, physical sci-
ences, mathematics and engineering, and business administration; and about average in
biological sciences. The Centra and Creech 1976 study is particularly important because it
was based on courses from over 100 institutions. They classified courses as natural sci-
ences, social sciences and humanities and found that ratings were highest in humanities
and lowest in natural sciences. However, even though these results were highly significant,
the difference accounted for less than 1 percent of the variance in the student ratings.
Neumann and Neumann (1985), based on Biglan’s 1973 theory, classified academic
areas according to whether they had: (a) a well-defined paradigm structure (hard/soft); (b)
an orientation towards application (applied/pure); and (c) an orientation to living or-
ganisms (life/nonlife). Previous research indicated that the role of teaching is different in
326 HERBERT W. MARSH
each of the eight combinations that result from these three dichotomies, and led the au-
thors to predict that students’ evaluations would be higher in soft, in pure, and in nonlife
disciplines. Based on this weak partial ordering, they made predictions about student
evaluations for 19 pairwise comparisons of the eight types of disciplines, and their results
provided support for all 19 predictions. While the effect of all three facets on students’
evaluations was significant, the effect of the hard/soft facet was largest. The authors indi-
cated that teachers in preparadigmatic areas where research procedures are not well-de-
veloped play a more major role than in paradigmatic areas where the content and method
of research is well-developed, teaching is relatively deemphasized. On the basis of this re-
search, the authors argued that student ratings should only be compared within similar dis-
ciplines and that campus-wide comparisons may be unwarranted since the role of teaching
varies in different academic areas. The generality of these findings may be limited since the
results were based on a single institution, and further tests of the generality of these find-
ings are important. The findings also suggest that discipline differences observed in stu-
dent ratings may reflect the different roles of teaching in these disciplines that are accu-
rately reflected in the student ratings.
There may be some differences in student ratings that are due to the academic discipline
of the subject, but the size of this relation is probably small. Nevertheless, since there are
few large, multi-institutional studies of this relation, conclusions must be tentative. The
implications of such a relation, if it exists, depend on the use of the students’ evaluations.
At institutions where SEEQ has been used, administrative use of the student ratings is at
the school or division level, and student ratings are not compared across diverse academic
disciplines. In such a situation, the relation between ratings and discipline may be less crit-
ical than in a setting where ratings are compared across all disciplines.
The relation between the personality of an instructor and students’ evaluations of his/
her teaching effectiveness is important for at least two different reasons. First, there is
sometimes the suspicion that the relation is substantial and that instructor personality has
nothing to do with being an effective teacher, so that the relation should be interpreted as
a bias to students’ evaluations. (This perspective apparently was the basis of initial in-
terpretations of the Dr. Fox effect discussed in Chapter 6 where ‘instructor expressiveness’
was experimentally manipulated and its affect on student ratings was examined.) Second,
if the relation is significant, then the results may have practical and theoretical importance
for distinguishing between effective and ineffective teachers, and for a better understand-
ing of the study of teaching effectiveness and the study of personality.
Given the potential importance of this relation, there has been surprisingly little re-
search of it. The brief summary of research presented here is based primarily on Feldman’s
1986 review. Because of the small number of studies of this relation, Feldman limited his
review to overall or global evaluations of teaching effectiveness, and focused on studies of
class-average student ratings. Feldman reviewed relations between overall student ratings
and 14 categories of personality as inferred from self-reports by the instructor or as infer-
red from ratings by others (students or colleagues). Across all studies that inferred person-
ality from self-report measures, the only practically significant correlations were for ‘posi-
tive self-regard, self-esteem’ (mean r = 0.30) and ‘energy and enthusiasm’ (mean r =
Students’ Evaluations 327
0.27); the mean correlation was 0.15 or less between student ratings of teaching effective-
ness and each of the other 12 areas of personality. In contrast, when personality was infer-
red from ratings by students, and from ratings by colleagues, the correlations were much
higher; the average correlations between students’ evaluations and most of the 14 cate-
gories of personality were between 0.3 and 0.6.
Aspects of the instructors’ personality, as inferred by self-report and as inferred by rat-
ings by others, are systematically related to students’ overall evaluations of the instructor.
However, correlations based on ratings by others are substantially larger than those based
on self-reports, and many interrelated explanations of this difference are plausible (see
Feldman, 1986): (a) teacher personality as perceived by students may be affected by simi-
lar response biases that affect their ratings of teacher effectiveness; (b) teacher personality
as inferred by colleagues may be based in part on information from students; (c) teacher
personality inferred by students and colleagues may be based in part on perceptions of
teaching effectiveness rather than personality per se (or vice versa); (d) teacher personal-
ity as inferred by self-report may be more or less valid, or may be more or less biased, than
teacher personality as inferred by others; (e) teacher personality as inferred by students,
and perhaps by colleagues, may be limited to a situationally specific aspect of personality
(e.g., personality as a teacher or as an academic professional) whereas self-report mea-
sures of personality are more general measures that do not focus on a specific context.
The aspect of Feldman’s speculations that I find most interesting is the suggestion that
self-report measures of personality assess general (i.e. situationally independent) compo-
nents of personality, whereas personality components inferred by colleagues and particu-
larly by students assess the manifestation of those components in a particular setting (situ-
ationally specific). Since teaching effectiveness should logically be more strongly related
to situationally specific personality as manifested in an academic setting than to personal-
ity in general, the situationally specific components of personality inferred by students and
colleagues should be more strongly related to teaching effectiveness than should the gen-
eral (situationally independent) personality components based on self-reports. In support
of such an interpretation, components of personality that are most strongly related to stu-
dent ratings of teaching effectiveness appear to be similar whether based on self-reports or
ratings by others. The problem with the interpretation based on existing research is that
the person making the ratings (self vs. other) and the situational specificity of the ratings
(general vs. situationally specific) are completely confounded. In order to unconfound
these variables, future research might: (a) ask teachers, students, and colleagues to make
personality ratings of the teacher in general and as manifested in the role of being a
teacher; (b) consider teacher self-evaluations as well as student ratings of teacher effec-
tiveness; and (c) collect personality ratings of the teacher by others who primarily know
the teacher outside of an academic setting.
Feldman’s review suggests that there is a relation between students’ evaluations of an
instructor and at least some aspects of the instructors’ personality, but does not indicate
whether this is a valid source of influence or a source of bias. As noted with other sources
of influence, if teacher personality characteristics influence other indicators of teaching ef-
fectiveness in a manner similar to their influence on students’ evaluations, then the rela-
tion should be viewed as supporting the validity of student ratings. Such an interpretation
may be supported by the review of Dr. Fox studies (Chapter 6) where an experimentally
manipulated personality-like variable - instructor expressiveness - is shown to affect
both students’ evaluations and examination performance in some circumstances.
328 HERBERT W. MARSH
Because of limitations in existing research, Feldman’s review did not examine the
systematic pattern of relations between specific components of personality and specific
student evaluation factors, and this is unfortunate. Logically, some specific aspects of an
instructor’s personality should be systematically related to some specific student evalua-
tion factors. For example, enthusiasm or some related construct is often a specific compo-
nent measured by personality inventories and by student evaluation instruments, so that
the two measures of enthusiasm should be substantially correlated. In fact, there is a mod-
erate overlap between the 19 categories of teacher effectiveness based on Feldman’s earl-
ier research (see Table 2.1) and the 14 categories of personality in his 1986 study. Other
specific components of personality and other student evaluation factors may be logically
unrelated, and so correlations between them should be much smaller. Again, support for
such an interpretation comes from the review of Dr. Fox studies (Chapter 6) that shows
that experimentally manipulated ‘instructor expressiveness’ is more substantially related
to student ratings of Instructor Enthusiasm than to other rating factors. In future research
it is important to examine relations between instructor personality and students’ evalua-
tions with a construct approach that considers the multidimensionality of both personality
and teaching effectiveness.
The search for potential biases to student ratings has itself been so biased, that it could
be called a witch hunt. Methodological problems listed at the start of this section are com-
mon. Furthermore, research in this area is seldom guided by any theoretical definition of
bias, and the definitions that are implicit in most studies are typically inadequate or inter-
nally inconsistent. Research findings described here, particularly for those relations not
emphasized in SEEQ research, should only be taken as rough approximations. There is
clearly a need for meta-analyses, and systematic reviews such as those by Feldman de-
scribed earlier, to provide more accurate estimates of the size of effects which have been
reported, and the conditions under which they were found. For most of the relations, the
effects tend to be small, the directions of the effects are sometimes inconsistent, and the
attribution of a bias is unwarranted if bias is defined as an effect that is specific to students’
evaluations and does not also influence other indicators of teaching effectiveness. Perhaps
the best summary of this area is McKeachie’s (1979) conclusion that a wide variety of var-
iables that could potentially influence student ratings apparently have little effect. Similar
conclusions have been drawn by Centra (1979), Menges (1973), Marsh (1980b), Murray
(1980), Aleamoni (1981), and others.
There are, of course, nearly an infinite number of variables that could be related to stu-
dent ratings and could be posited as potential biases. However, any such claim must be
seriously scrutinized in a series of studies that are relatively free from the common
methodological shortcomings, are based upon an explicit and defensible definition of bias,
and employ the type of logic used to examine the variables described above. Single studies
of the predictive validity of psychological measures have largely been replaced by a series
of construct validity studies, and a similar approach should also be taken in the study of po-
tential biases. Simplistic arguments that a significant correlation between student ratings
and some variable ‘X’ demonstrates a bias can no longer be tolerated, and are an injustice
Students’ Evaluations 329
to the field. It is unfortunate that the cautious attitude to interpreting correlations between
student ratings and potential indicators of effective teaching as evidence of validity has not
been adopted in the interpretation of correlations between student ratings and potential
biases as a source of invalidity.
CHAPTER 6
The Dr Fox effect is defined as the overriding influence of instructor expressiveness on stu-
dents’ evaluations of college/university teaching. The results of Dr Fox studies have been
interpreted to mean that an enthusiastic lecturer can entice or seduce favorable evalua-
tions, even though the lecture may be devoid of meaningful content. In the original Dr Fox
study by Naftulin et al. (1973)) a professional actor lectured to educators and graduate stu-
dents in an enthusiastic and expressive manner, and teaching effectiveness was evaluated.
Despite the fact that the lecture content was specifically designed to have little educational
value, the ratings were favorable. The authors and critics agree that the study was fraught
with methodological weaknesses, including the lack of any control group, a poor rating in-
strument, the brevity of the lecture compared to an actual course, the unfamiliar topic
coupled with the lack of a textbook with which to compare the lecture, and so on (see
Abrami et al., 1982; Frey, 1979; Ware &Williams, 1975). Frey (1979) notes that “this study
represents the kind of research that teachers make fun of during the first week of an intro-
ductory course in behavioral research methods. Almost every feature of the study is prob-
lematic” (p. 1). Nevertheless, reminiscent of the Rodin and Rodin (1972) study described
earlier, the results of this study were seized upon by critics of student ratings as support for
the invalidity of this procedure for evaluating teaching effectiveness.
To overcome some of the problems, Ware and Williams (Ware & Williams, 1975,1977;
Williams & Ware, 1976,1977) developed the standard Dr Fox paradigm where a series of
six lectures, all presented by the same professional actor was videotaped. Each lecture rep-
resented one of three levels of course content (the number of substantive teaching points
covered) and one of two levels of lecture expressiveness (the expressiveness with which the
actor delivered the lecture). Students viewed one of the six lectures, evaluated teaching ef-
fectiveness on a typical multi-item rating form, and completed an achievement test based
upon all the teaching points in the high contect lecture. Ware and Williams (1979, 1980)
reviewed their studies, and similar studies by other researchers, and concluded that differ-
ences in expressiveness consistently explained much more variance in student ratings than
did differences in content.
A Reanalysis
Marsh and Ware (1982) reanalyzed data from the Ware and Williams studies. A factor
analysis of the rating instrument identified five evaluation factors which varied in the way
331
332 HERBERT W. lLlARSH
they were affected by the experimental manipulations (Table 6.1). In the condition most
like the university classroom, where students were told before viewing the lecture that they
would be tested on the materials and that they would be rewarded in accordance with the
number of exam questions which they answered correctly (incentive before lecture in
Table 6. l), the Dr Fox effect was not supported. The instructor expressiveness manipula-
tion only affected ratings of Instructor Enthusiasm, the factor most logically related to that
manipulation, and content coverage significantly affected ratings of Instructor Knowledge
and Organization/Clarity, the factors most logically related to that manipulation.
When students were given no incentives to perform well, instructor expressiveness had
more impact on all five student rating factors than when external incentives were present.
though the effect on Instructor Enthusiasm was still largest. However, without external in-
centives, expressiveness also had a larger impact on student achievement scores than did
the content manipulation (i.e. presentation style had more to do with how well students
Table 6.1
Effect Sizes of Expressiveness, Content, Expressiveness x Content Interaction in Each of the Three Incentive
Conditions (Reprinted with permission from Marsh, 1983b)
No External Incentive
Clarity/Organization 11.3** 4.2” 1.6
Instructor Concern 12.9” 2.1 2.8’
Instructor Knowledge 12.8’* 2.7’ 1.9”
Instructor Enthusiasm 34.6** 1.9’ 2.4”
Learning Stimulation 13.0” 9.6’ l 1.5
Total rating (across all items) 25.4’* 5.19’ 3.3’.
Achievement test scores 9.4” 5.2” 1.3
Incentive After Lecture
Clarity/Organization 2.0 6.0 1.3
Instructor Concern 20.5** 7.5” 1.9
Instructor Knowledge 25.1” 8.8f’ 2.3
Instructor Enthusiasm 30.9” 3.3
Learning Stimulation 2:: .7
Total rating (across all items) ,;I:: l
7.0** 28
Achievement test scores .3 13.0” .4
Incentive Before Lecture
Clarity/Organization .3 11.5” 6.9’
Instructor Concern .l 7.0’ 6.2’
Instructor Knowledge .3 6.2’ 1.3
Instructor Enthusiasm 22.1” 4.0 6.6’
Learning Stimulation .I 8.88’ 8.1’
Total rating (across all items) 2.0 11.4” 6.8’
Achievement test scores .5 26.5” ‘)”
_. ,
Across All Incentive Conditions
Clarity/Organization 2.1” 5.0” 1.6’
Instructor Concern 7.2*’ 4.3” 1.0
Instructor Knowledge 6.4** 3.1” .8
Instructor Enthusiasm 25.4” 1.2’ 1.7”
Learning Stimulation 3.3” 4.9.’ 1.1
Total rating (across all items) 12.5’ l 5.2” 1.8’
Achievement test scores 1.7” 10.7” .3
Note. Separate analyses of variance (ANOVAS) were performed for each of the five evaluation factors, the sum
of the 18 rating items (Total rating). and the achievement test. First, separate two-way ANOVAS (Expressiveness
X Content) were performed for each of the three incentive conditions, and then three-way ANOVA’s (Incentive
X Expressiveness X Content) were performed for all the data. The effect sizes were defined as (SS,rre,JSS,,,,r)
x 100%.
l p < .05. ** p < .Ol.
Students’ Evaluations 333
performed on the examination than did the number of questions that had been covered in
the lecture). This finding demonstrated that, particularly when external incentives are
weak, expressiveness can have an important impact on both student ratings and achieve-
ment scores. In further analyses of the achievement scores, Marsh (1984, p. 212) con-
cluded that the study was one of the few to “show that more expressively presented lec-
tures curse betters examination performance in a study where there was random assign-
ment to treatment conditions and lecturer expressiveness was experimentally manipu-
lated.” Across all the conditions, the effect of instructor expressiveness on ratings of In-
structor Enthusiasm was larger than its effect on other student rating factors. Hence, as ob-
served in the examination of potential biases to student ratings, this reanalysis indicates
the importance of considering the multidimensionality of student ratings. An effect which
has been interpreted as a ‘bias’ to students’ evaluations seems more appropriately inter-
preted as support for their validity with respect to one component of effective teaching.
A Meta-Analysis
Abrami et al. (1982) conducted a review and a meta-analysis of all known Dr Fox
studies. On the basis of their meta-analysis, they concluded that expressiveness manipula-
tions had a substantial impact on overall student ratings and a small effect on achievement,
while content manipulations had a substantial effect on achievement and a small effect on
ratings. Consistent with the Marsh and Ware reanalysis, they also found that in the few
studies that analyzed separate rating factors, the rating factors that were most logically re-
lated to the expressiveness manipulation were most affected by it. Finally, they concluded
that while the expressiveness manipulation did interact with the content manipulation and
a host of other variables examined in the Dr Fox studies, none of these interactions ac-
counted for more than 5 percent of the variance in student ratings.
How should the results of the Dr Fox type studies be evaluated? Consistent with an em-
phasis on the construct validity of multifaceted ratings in this paper, a particularly power-
ful test of the validity of student ratings would be to show that each rating factor is strongly
influenced by manipulations most logically associated with it and less influenced by other
manipulations. This is the approach used in the Marsh and Ware reanalysis of the Dr Fox
data described above, and it offers strong support for the validity of ratings with respect to
expressiveness and, perhaps, limited support for their validity with respect to content.
Multiple rating factors have typically not been considered in Dr Fox studies (but see
Ware & Williams, 1977; and discussion of this study by Marsh & Ware, 1982). Instead, re-
searchers have relied on total scores even though they collect ratings which do represent
multiple rating dimensions (i.e. the same form as was shown to have five factors in the
Marsh & Ware reanalysis, and/or items from the 1971 Hildebrand et al. study described
earlier). However, this makes no sense when researchers also emphasize the differential
effects of the experimental manipulations on the total rating score and the achievement
outcome. According to this approach, student ratings may be invalid because they are
334 HERBERT W MARSH
dents were asked to compare high and low content lectures on the amount of content co-
vered, to indicate which test items had been covered in the lecture, to evaluate content
coverage relative to textual materials representing the content which was supposed to have
been covered, or even to evaluate content coverage after completing an examination
where they were told that all the questions were supposed to be covered, then they would
have a much better basis for evaluating the content coverage and I predict that their re-
sponses would more accurately reflect the content manipulation. Some support for this
suggestion comes from a recent study by Leventhal etal. (1983) where students viewed one
lecture which was either ‘good’ (high in content and expressiveness) or ‘poor’ (low in con-
tent and expressiveness), and a second lecture by the same lecturer which was either good
or bad. The sum across all ratings of the second lecture varied inversely with the quality of
the first. This is a ‘contrast’ effect which is typical in frame of reference studies (e.g., Par-
ducci, 1968); after viewing a poor lecture a second lecture seems better, after viewing a
good lecture a second lecture seems poorer. Here, the authors also examined different rat-
ing factors and found that the effects of manipulations of instructor characteristics varied
substantially according to the rating component (though the evaluation of Group Interac-
tion and Individual Rapport on the basis of videotaped lectures seems dubious). Unfortu-
nately, the effects of content and expressiveness were intentionally confounded in this de-
sign which was not intended to represent a standard Dr Fox study.
In addition to content and expressiveness effects, Dr Fox studies have considered the
effects of a variety of other variables: grading standards (Abrami er al., 1980); instructor
reputation (Perry et al., 1979); student personality characteristics (Abrami et al., 1982);
purpose of evaluation (Meier & Feldhusen, 1979); and student incentive (Williams &
Ware, 1976; Perry er al., 1979; Abrami et al., 1980). In each instance, the Dr Fox video-
tapes reflected the manipulations of just content and expressiveness, whereas the other ex-
perimental manipulations represented verbal instructions given to subjects before or after
viewing a Dr Fox lecture. The incompleteness with which these analyses are generally
reported makes it difficult to draw any conclusions, but apparently only incentive (for
example, the Marsh and Ware study) and instructor reputation had any substantial effect
on student ratings. When students are led, through experimentally manipulated feedback,
to believe that an instructor is an effective teacher, they rate him more favorably on the
basis of one, short videotaped lecture and presumably the manipulated feedback. Also,
when students are given external incentives to do well, they perform better on examina-
tions and rate teaching effectiveness more favorably.
Researchers have also prepared videotaped lectures manipulating variables other than
content and expressiveness. For example Land and Combs (1979; see earlier discussion)
videotaped 10 lectures which varied only in teacher speech clarity, operationally defined
as the number of false starts or halts in speech, redundantly spoken words, and tangles in
words. As teacher clarity improved there was a substantial linear improvement in both
student ratings of teaching effectiveness and student performance on a standardized
achievement test.
Cadwell and Jenkins (1985; see earlier discussion) experimentally manipulated verbal
descriptions of instructional behaviors, and asked students to evaluate each instructor in
336 HERBERT W MARSH
terms of four rating dimensions that were designed to reflect these behaviors. Each rating
item was substantially related to the behavioral descriptions that it was designed to reflect,
and more highly correlated with those matching behavioral descriptions than to the others
that were considered. While this aspect of the study was not the focus of Cadwell and Jen-
kin’s discussion, when viewed in the context of Dr Fox studies, the results provide clear
support for both the convergent and discriminant validity of the students’evaluations with
respect to the behavioral descriptions in the Cadwell and Jenkins study.
CHAPTER 7
Braskamp et al. (1985) argues that it is important for universities, as well as individual fa-
culty members, to take evaluations seriously and offers three arguments, based in part on
organizational research, in support of his position. First, through the evaluation process
the university can effectively communicate and reinforce its goals and values. Second, he
argues that the most prestigious universities and the most successful organizations are the
ones that take assessment seriously. In support of this contention Braskamp (p. 12) cited
William Spady’s review of the book In Search of Excellence by Peters and Waterman
where Spady states:
For me, the single most striking thing about the findings in the book was that the man-
agers of the successful corporations were data-based and assessment-driven. They built
into their management procedures the need to gather data about how things are operat-
ing. Also, there was a commitment to change and modify and improve what they were
doing based on those results. They were very much oriented toward a responsive model
of management. The responding had to do with meeting goals, so it was goal-based. It
was very much oriented toward using performance data, sales data and results, as the
basis for changing what needs to change in the organization.
Third, Braskamp cited employee motivation research that indicated that personal in-
vestment is enhanced when organizations provide stimulating work, provide a supportive
environment, and provide rewards that are perceived to be fair. In summary, Braskamp
(p. 14) states that: “the clarity and pursuit of purpose is best done if the achievements are
known. A course is charted and corrections are inevitable. Evaluation plays a role in the
clarity of purpose and determining if the pursuit is on course.” However, Braskamp
further notes that this perspective is well-established for the evaluation of research, but not
for teaching.
The introduction of a broad institution-based, carefully planned program of students’
evaluations of teaching effectiveness is likely to lead to the improvement of teaching.
Faculty will have to give serious consideration to their own teaching in order to evaluate
the merits of the program. The institution of a program that is supported by the administ-
ration will serve notice that teaching effectiveness is being taken more seriously by the ad-
ministrative hierarchy. The results of student ratings, as one indicator of effective teach-
ing, will provide a basis for informed administrative decisions and thereby increase the
likelihood that quality teaching will be recognized and rewarded, and that good teachers
will be given tenure. The social reinforcement of getting favorable ratings will provide
added incentive for the improvement of teaching, even at the tenured faculty level. Fi-
337
338 HERBERT W. MARSH
nally, faculty report that the feedback from student evaluations is useful to their own ef-
forts for the improvement of their teaching. None of these observations, however, pro-
vides an empirical demonstration of improvement of teaching effectiveness resulting from
students’ evaluations.
In most studies of the effects of feedback from students’ evaluations, teachers (or clas-
ses) are randomly assigned to experimental (feedback) and one or more control groups;
students’ evaluations are collected during the term; ratings of the feedback teachers are re-
turned to instructors as quickly as possible; and the various groups are compared at the end
of the term on a second administration on student ratings and sometimes on other vari-
ables as well. (There is considerable research on a wide variety of other techniques de-
signed to improve teaching effectiveness which use student ratings as an outcome measure
- see Levinson-Rose & Menges, 1981.)
SEEQ has been employed in two such feedback studies using multiple sections of the
same course. In the first study, results from an abbreviated form of the survey were simply
returned to faculty, and the impact of the feedback was positive, but very modest (Marsh
et al., 1975). In the second study (Overall & Marsh, 1979) researchers actually met with in-
structors in the feedback group to discuss the evaluations and possible strategies for im-
provement. In this study (see Table 7.1) students in the feedback group subsequently per-
formed better on a standardized final examination. rated teaching effectiveness more
favorably at the end of the course, and experienced more favorable affective outcomes
( i.e., feelings of course mastery, and plans to pursue and apply the subject). These two
studies suggest that feedback, coupled with a candid discussion with an external consul-
tant, can be an effective intervention for the improvement of teaching effectiveness (also
see McKeachie et al., 1980).
Reviewers of feedback studies have reached different conclusions (e.g., Abrami et al.,
1979; Kulik & McKeachie, 1975; Levinson-Rose & IMenges, 1981; McKeachie, 1979;
Rotem & Glassman, 1979). Cohen (1980), in order to clarify this controversy, conducted
a meta-analysis of all known feedback studies. Across all known feedback studies, Cohen
found that instructors who received midterm feedback were subsequently rated about one-
third of a standard deviation higher than controls on the Total Rating (an overall rating
item or the average of multiple items), and even larger differences were observed for rat-
ings of Instructor Skill, Attitude Toward Subject, and Feedback to Students. Studies that
augmented feedback with consultation produced substantially larger differences, but
other methodological variations had no effect. The results of this meta-analysis support
the SEEQ findings described above and demonstrate that feedback from students’ evalua-
tions, particularly when augmented by consultation, can lead to improvement in teaching
effectiveness. Levinson-Rose and Menges (1981), comparing the results of their review
with those reviewed by Kulik and McKeachie (19754, Abrami etal. (1979), and Rotem and
Glassman (1979), also concluded that their conclusions were more optimistic, particularly
when feedback is supplemented with consultation.
Students’ Evaluations 339
Table 7.1
F Values for Differences Between Students with Either Feedback or No-Feedback Instructors for End-of-Term
Ratings, Final Exam Performance, and Affective Course Consequences (Reprinted with permission from Over-
all and Marsh, 1979; see original article for more details of the analysis)
Group
Feedbacks No feedbackb
Rating components
Concern 52.38 a.5 49.51 10.1 2.87 19.1”
Breadth 50.84 7.9 49.59 7.9 1.25 4.8’
Interaction 51.94 7.4 48.61 10.3 3.33 32.4”
Organization 49.88 9.4 50.88 9.5 -1.00 2.5
Learning/Value 50.77 9.9 48.22 10.7 2.55 11.7”
Exams/Grading 50.52 9.9 49.08 10.1 1.44 4.1’
Workload/Difficulty 51.13 8.8 51.51 8.8 -.38 .4
Overall Instructor 7.00 1.6 6.33 2.1 .6; 26.4”
Overall Course 5.81 1.8 5.39 2.0 .42 5.4’
Instructional Improvement 5.97 1.5 5.49 1.5 .48 16.0”
Final exam performance 51.34 9.9 49.41 10.1 1.93 9.4”
Affective course consequences
Programming competence achieved 5.80 2.0 5.42 2.3 .38 7.7”
Computer understanding gained 6.18 2.0 5.94 2.1 .24 3.6
Future computer use planned 4.00 2.8 3.49 2.7 .51 6.5’
Future computer application planned 5.05 2.6 4.67 2.6 .38 5.4’
Further related coursework planned 4.39 2.9 3.52 2.9 .87 11.1”
IZiote Evaluatmn factors and final exam performance were standardized with M = 50 and SD = 10. Responses to summar!
rating items and affective course consequences varied along a scale ranging from 1 (very low) to 9 (very high)
a For feedback group, N = 295 students in 12 sections.
b For no-feedback group, N = 456 students in 18 sections.
l p < .05. l * p < .Ol.
Several issues still remain in this feedback research. First, the studies demonstrate that
feedback without consultation is only modestly effective, and none of the studies reporting
significant feedback effects with consultation provide an adequate control for the effect of
consultation without feedback (i.e., a placebo effect due to consultation, or a real effect
due to consultation which does not depend upon feedback from student ratings). Second,
the criterion of effective teaching used to evaluate the studies was limited primarily to stu-
dent ratings; only the Overall and Marsh study demonstrated a significant effect of feed-
back on achievement (but also see Miller, 1971; McKeachie et al., 1980). Most other
studies were not based upon multiple sections of the same course, and so it was not possible
to test the effect of feedback on achievement scores. Third, nearly all of the studies were
based on midterm feedback from midterm ratings. This limitation, perhaps, weakens the
likely effects in that many instructional characteristics cannot be easily altered in the sec-
ond half of the course, and also the generality of this approach to the effects of end-of-term
ratings in one term to subsequent teaching in other terms has not been examined. Further-
more, Marsh and Overall (1980) demonstrated in their multisection validity study that
while midterm and end-of-term ratings were substantially correlated, midterm ratings
were concluded to be less valid than end-of-term ratings since they were less correlated
340 HERBERT W. MARSH
with measures of student learning. Fourth, most of the research is based upon instructors
who volunteer to participate, and this further limits the generality of the effect, since vol-
unteers are likely to be more motivated to use the feedback to improve their instruction.
(This limitation does not apply to the two studies based upon SEEQ.) Finally, reward
structure is an important variable which has not been examined in this feedback research.
Even if faculty are intrinsically motivated to improve their teaching effectiveness, poten-
tially valuable feedback will be much less useful if there is no extrinsic motivation for
faculty to improve. To the extent that salary, promotion, and prestige are based almost
exclusively on research productivity, the usefulness of student ratings as feedback for the
improvement of teaching may be limited. Hildebrand (1972, p. 53) noted that: “The more
I study teaching, however, the more convinced I become that the most important require-
ment for improvement is incorporation of effective evaluation of teaching into advance-
ment procedures.”
Nearly all feedback studies have considered the effects of feedback from students’
evaluations within a single term, and this is unfortunate. Students’ evaluations are typi-
cally collected near the end of the term so that the more relevant question is the impact of
end-of-term ratings. Also, as discussed earlier, midterm ratings are apparently less valid
than end-of-term ratings, and many aspects of teaching effectiveness cannot be easily al-
tered in the middle of a course. Finally, as emphasized by Stevens and Aleamoni (1985))
the long-term effect of the feedback is more important that its short-term effect. Surpris-
ingly, only two studies known to the author - Centra (1973) and Stevens and Aleamoni
(1985) - have examined the effect of an intervention by comparing feedback and control
groups for a period of more than one semester.
Centra (1973) conducted a large short-term study of the effects of midterm feedback for
a fall semester, but also included a small follow-up phase during the next semester. Instruc-
tors from the first phase were included in the second phase if they were willing to partici-
pate again and if they taught the same course as evaluated in the first phase. Centra consi-
dered three groups in this second phase: (a) fall feedback, 8 of 43 instructors who had ad-
ministered evaluations and were given feedback from both midterm and end-of-term rat-
ings during the first phase; (b) fall post-test only, 13 of 43 instructors who had administered
ratings and were given feedback for only end-of-term ratings during the first phase; and (c)
spring only, 21 of 30 instructors who had not administered ratings during the fall semester.
(Instructors from the first phase who had administered midterm ratings but not received
any midterm feedback were apparently not considered in this second phase.) A mul-
tivariate analysis across a set of 23 items failed to demonstrate significant group differ-
ences, but for 4 of 23 items fall feedback instructors received significantly better @ < 0.05)
ratings than did fall post-test and spring semester groups. Centra also noted that only the
fall feedback group received normative comparisons, and that this might explain why the
fall post-test and spring groups did not differ from each other. Even though a majority of
the teachers from the first phase did not participate in the second phase, Centra reported
that these nonparticipants did not differ from the participants. Nevertheless, the large
number who did not qualify, the probable nonequivalence of the three groups, and the
very small size of the fall feedback group that provided the only hint of group differences
Students’ Evaluations 311
all dictate that the results should be interpreted cautiously. Similarly, interpretations
based on a few items selected a posterior-i, particularly after the multivariate difference
based on all items was not statistically significant, should also be made cautiously. Thus,
even though Centra and some subsequent reviewers suggest that this study provides sup-
port for the long-term effects of feedback from student ratings, the effects are very weak
- not even statistically significant when evaluated by traditional methods (i.e., the origi-
nal multivariate analysis) - and problems with the design of the second phase dictate that
interpretations should be made cautiously.
Aleamoni (1978) compared two groups of instructors on student ratings during the
spring semester. Both groups were given results of students’ evaluations duping the fall
term, but only the ‘experimental’ group was given an individual consultation designed to
improve teaching effectiveness based on the student ratings. Originally there was no inten-
tion of having two groups, and the comparison group consisted of those instructors who did
not receive the intervention because of time limitations, scheduling conflicts, and self-
selection. Also, instructors in the top two deciles were eliminated from the experimental
group, but apparently not from the comparison group. While there was significantly grea-
ter improvement in the experimental than control group on two of six measures, interpre-
tations are problematic. Since the groups were not randomly assigned nor even matched
on pre-test scores, they may not have been comparable. More critically, regression effects
- increases in the experimental group relative to the comparison group - were probable
because of the elimination of the top rated instructors in the experimental group but not
the comparison group.
Stevens and Aleamoni (1985) conducted a lo-year follow-up study of those instructors
in the Aleamoni (1978) study. For purposes of the follow-up study they considered three
groups, the consultation and comparison groups from the original study, and the group of
highly rated instructors that was originally scheduled to be part of the consultation group
but were excluded from the original study. In the first set of analyses they found that in-
structors who had received consultation, and other instructors who experienced pre-post
increases in student ratings in the original study, were subsequently more likely to collect
student ratings. In the second set of analyses (based on student ratings by those instructors
from the original study who subsequently used student ratings), the three groups were
compared on student ratings from the original study and from the follow-up period. A
multivariate analysis across five subscales revealed no significant group differences nor
group X time interactions, but univariate ANOVAs for two of the scales were statistically
significant. Further analyses indicated that these small effects were due primarily to the
high pre-test means for the group of highly rated instructors who were excluded from the
original study on the basis of high pre-test ratings. There were no significant differences
between the original consultation group and the original comparison groups. The authors
suggest that the follow-up study supports the results of the original study, but their statisti-
cal analyses - particularly the lack of differences in the two groups that were considered
in the original study - do not support this conclusion. Many suggestions by the authors
warrant further study, but no firm conclusions on the basis of this study are warranted be-
cause of design problems in both the original and follow-up study.
Other studies (e.g., Marsh, 1980b, 1982a; Voght & Lasher, 1973) have examined how
teaching effectiveness varies over time for the same instructor when that instructor is given
systematic feedback from students’ evaluations. However, since these studies have no con-
trol group, observed changes may be due to teaching experience, age-related changes in
341 HERBERT W. MARSH
Equilibrium Theory
Equilibrium theory was derived to explain why instructors are motivated to improve
their teaching. In one of the earliest applications of equilibrium theory to the effects of
feedback from student ratings, Gage (1963, p. 264) posited that: “Imbalance may be said
to exist whenever a person (teacher) finds that she holds a different attitude toward some-
thing (her behavior) from what she believed is held by another person or group (her pupils)
to whom she feels positively oriented.” In their review, Levinson-Rose and hlenges (1981,
p. 143) stated: “A discrepancy, either negative or positive, between evaluations by the in-
structor and those by students creates a state of disequilibrium. To restore equilibrium, the
teacher may attempt to modify instruction.” Centra (1973) hypothesized that: “On the
basis of equilibrium theory one could expect that the greater the gap between student rat-
ings and faculty self-ratings, the greater the likelihood that there would be a change in in-
struction” (p. 396). As typically operationalized, differences between student ratings and
some other indicator (usually instructor self-ratings, but sometimes ideal ratings by stu-
dents or the instructor*) collected at one point of time are posited to create a sense of dis-
equilibrium in the feedback group of teachers so that they change their teaching behaviors
in a way that is reflected in subsequent student ratings when compared to a control group
that receives no feedback.
There are, however, four important problems with this typical operationalization of
equilibrium theory. First, the existence of a discrepancy may not motivate a need for
l I have not emphasized ideal scores in the discussion of equilibrium theory for three reasons. First. I suspect that
class-average ideal scores will nearly always fall near the most favourable end of the response scale and have little
variation. Hence, variance explained by ideal-actual scores can be explained by just the actual scores. Second,
I am not sure that discrepancies between ideal and actual scores will be a motivation to change. Logically actual
scores will rarely if ever equal ideal scores, and so the existence of such discrepancies may not be a source of
psychological discomfort. ‘IlGrd, most studies have emphasized discrepancies between actual and self-ratings
(but see Gage. 1963).
Students’ Evaluations 343
change if the teacher is not concerned about the discrepancy. The extent of the motivation
for change will be a function of both the size of the discrepancy and the importance of the
discrepancy to the individual teacher. Second, particularly when the disequilibrium is due
to differences between student and self ratings, teachers can reduce the posited imbalance
either by changing their teaching behaviors so as to alter the students’ evaluations or by
changing their own self-evaluations. Hence, subsequent student ratings and self ratings
must be considered. (Other changes, such as denying the validity of the student ratings
may also be consistent with equilbrium theory and tests of such predictions would require
that other variables were collected.) Third, the likely effects of a discrepancy should differ
depending on the direction of the discrepancy. For teachers with a positive discrepancy -
self-ratings higher than student ratings - it is reasonable that teachers may be motivated
to alter their teaching behaviors so as to improve ratings, but they could just lower their
own, perhaps unrealistically high, self-evaluations. For teachers with a negative discre-
pancy - self-ratings lowers than student ratings - it is not reasonable that teachers would
intentionally undermine their teaching effectiveness in order to lower their student evalua-
tions, and it is more reasonable that they would simply increase their own, perhaps un-
realistically negative, self-evaluations. Hence, the effect of feedback needs to be consi-
dered separately for positively discrepant, negatively discrepant, and nondiscrepant
groups. Fourth, equilibrium theory implies that the existence of a discrepancy, in addition
to or instead of the diagnostic information provided by the feedback, is responsible for the
change in subsequent behavior. Equilibrium theory does not imply that there will be over-
all feedback/control differences, and such differences do not support equilibrium theory.
Rather, it is necessary to demonstrate that the feedback/control differences depend on the
direction and size of the discrepancy.
In addition to problems with the operationalization of equilibrium theory, there are also
statistical complications in the analysis of difference scores that are critical to tests of the
theory. First, teachers with different levels of discrepancy are not initially equivalent so
that their comparison on subsequent measures is dubious, even after statistical adjust-
ments for initial differences. Instead, comparisons should be made between experimental
and control groups that are equivalent due to random assignment and, perhaps, matching.
Second, depending on the analysis, there is the implicit assumption that instructor self-
evaluations and student ratings vary along the same scale, but this is unlikely. For ex-
ample, Centra (1979) suggests that instructor self-ratings are systematically higher than
student ratings, particularly when instructors have not previously seen results of students’
evaluations. Also, instructor self-ratings are based on the responses by a single individual
whereas student ratings are based on class-average responses, and so their variances will
probably differ substantially. Third, when a difference score is significantly related to
some other variable, it is further necessary to demonstrate that both components of the dif-
ference score contribute significantly and in the predicted direction. For example, even if
an ideal-actual difference score is significantly related to a subsequent criterion measure,
support for the use of a difference score requires that both the ideal and the actual score
contribute uniquely and in the opposite direction to the prediction of the criterion mea-
34.3 HERBERT W. MARSH
sure. If most of the variance due to difference scores can be explained by just one of the
two components, then the theoretical and empirical support for the use of the difference
score is weak.
The minimally appropriate analysis and corresponding design for equilibrium theories
requires a three-way MANOVA (or a conceptually similar regression analysis). The three
independent variables are initial instructor self-ratings (with at least three levels - e.g.,
low, medium and high), initial student ratings (with at least three levels), and the feedback
manipulation (feedback vs. control groups), whereas the dependent variables are sub-
sequent student ratings and instructor self-ratings. The main effect of the feedback man-
ipulation provides a test of the effect of feedback, but not of equilibrium theory. Support
for equilibrium theory requires that the feedback effect interacts with both levels of initial
self-ratings and student ratings. In this analysis the initial equivalence of the feedback and
control groups through random assignment of a sufficiently large number of teachers, and,
perhaps, matching, is critically important as a control for regression effects, ceiling effects,
floor effects, etc. Hence, designs without randomly assigned control groups, and particu-
larly designs with no comparison groups at all, are dubious.
One hypothetical set of results that would support equilibrium theory is illustrated in
Figure 7.1. For subsequent student ratings, the theory predicts differences between feed-
back and control groups to be positively related to initial self-ratings and negatively related
to initial student ratings. Note that the predicted differences are most positive for the three
Initial
Medium
t I I t I I
I I I I I I
Low Medium High LOW Medium High
INITIAL SELF RATINGS INITIAL SELF RATINGS
Figure 7.1 Hypothetical results in support of equilibrium theory. Feedback/control differences in subsequent stu-
dent ratings and subsequent self-evaluations for positively discrepant (self-ratings higher than student ratings, in-
dicated with a’s), nondiscrepant (b’s), and negatively discrepant (c’s) groups of instructors
Students’ Evaluations 345
positively discrepant groups (marked with a’s in Figure 7.1(a)), most negative for the three
negatively discrepant groups (marked with c’s), and intermediate for the three nondiscrep-
ant groups (marked with b’s). In contrast, for subsequent self-ratings, differences between
feedback and control groups are predicted to be negatively related to initial self-ratings
and positively related to initial student ratings. The predicted differences are most nega-
tive for the three positively discrepant groups (marked with a’s in Figure 7.1(b)), most
positive for the three negatively discrepant groups (marked with c’s), and intermediate for
the three nondiscrepant groups (marked with b’s).
The predictions in Figure 7.1 are consistent with equilibrium theory, but they may be
overly simplistic and other results would also support the theory. In particular, the predic-
tions are based on the assumptions that: (a) feedback by itself has no effect on either sub-
sequent student ratings or self-ratings (i.e., the average of feedback/control differences is
zero); (b) the effects of both initial student ratings and self ratings are linear and of approx-
imately the same magnitude; (c) the pattern of effects on subsequent student ratings will
be the mirror image of the effects on subsequent self-ratings. However, equilibrium theory
would still be supported if: (a) the average feedback/control difference was positive (or
even negative); (b) the effects of initial student ratings and self-ratings were nonlinear (but
still monotonic) or differed in magnitude; (c) the effects for positively discrepant and nega-
tively discrepant groups varied depending on whether the dependent variable was sub-
sequent student ratings or self-ratings. (Earlier it was suggested that changes in self ratings
were much more likely for negatively discrepant groups.)
Empirical Findings
Gage (1963) presented one of the earliest applications of equilibrium theory in a study
of the effect of feedback from students’ evaluations by sixth grade students. Gage posited
that psychological disequilibrium would be created by differences in students’ ratings of
their actual and ideal teachers. A randomly assigned group of feedback teachers was given
the results of both the actual and ideal ratings by their students, and between one and two
months later pupils again provided ratings of their actual and ideal teacher. Since discre-
pancy scores were based on ideal-actual ratings, nearly all the discrepancy scores were
positive (i.e. ideal teacher ratings were higher than actual teacher ratings). Although the
support was not overwhelming, the results showed that “teachers who received feedback
did seem to change in the direction of pupils’ ‘ideals’ more than did teachers from whom
feedback was withheld” (p. 265). However, Gage did not report whether this small effect
was due to information provided by the feedback or the discrepancy between ideal and ac-
tual ratings. Support for equilibrium theory would have required that the feedback/control
differences to be systematically related to the size of discrepancies, but such analyses were
not reported. Hence, the reported results provide some support of the effect of feedback,
but not for equilibrium theory.
Centra (1973) hypothesized that the greater the gap between student ratings and faculty
self-ratings, the more likely that there would be a change in instruction. In order to test this
theory, Centra collected student ratings and instructor self-ratings at mid-semester, gave
feedback to a randomly selected feedback group, and compared feedback and control
groups on end-of-semester ratings by students. Centra found that the groups did not differ
significantly on end-of-term ratings, but that end-of-term ratings did vary significantly ac-
346 HERBERT W. MARSH
cording to the difference between self-ratings and student ratings. For the positive discre-
pancy group, the size of the discrepancy was significantly more correlated with end-of-
term ratings for feedback than for control teachers on 5 of 17 items. For the negative dis-
crepancy group, the size of the discrepancy was significantly more correlated with end-of-
term ratings for the feedback than for control instructors for only one item.
I interpret Centra’s results to mean that the more substantial the positive discrepancy
between self-ratings and student ratings the more likely feedback is to have a positive ef-
fect on end-of-term ratings (13 of 17 items, 5 statistically significant) and, to a lesser extent,
the more substantial the negative discrepancy the more likely feedback is to have a positive
effect* on end-of-term ratings (11 of 16 items, 1 statistically significant). While the results
for the positive discrepancy group provide weak support for equilibrium theory, the results
of the negative discrepancy group apparently do not. However, this interpretation is some-
what dubious because feedback/control differences were primarily nonsignificant, Centra
did not actually test the implied interaction effects, and the results for his regression analy-
sis were not fully presented. Other reviewers have also had trouble interpreting Centra’s
study. Levinson-Rose and Menges (1981), for example, interpret the results to mean that
“5 of 17 items showed higher scores for the unfavorably discrepant group (those whose
self-ratings were higher than student ratings) compared with the favorably discrepant
group” (p. 413), but Centra made no statistical comparisons between the favorably and un-
favorably discrepant groups. Rotem and Glassman (1979) interpreted the results to mean
that “instructor’s self-ratings and students’ ratings have an interaction effect” (p. 502), but
Centra did not test the interaction effect. In summary Centra’s study may provide weak
support for equilibrium theory, but any interpretations of his reported results are prob-
lematic.
Braunstein et al. (1973), found that changes in student ratings between midterm and
end-of-term were more likely to be positive for their feedback group than for their control
group. After completion of midterm ratings but before the return of the feedback, all in-
structors were asked to complete a survey indicating what they expected their ratings to be
at the end of the term. (All instructors had previously received student evaluation feed-
back in previous terms as part of a regular student evaluation, so that these expectancy rat-
ings may differ from the self-evaluations considered in other studies.) For instructors in
just the feedback group, positive discrepancies - expectancies higher than midterm stu-
dent ratings - were more likely to be associated with positive changes in student ratings
than were negative discrepancies. The findings support the effect of feedback and appa-
rently support equilibrium theory, but an important qualification must be noted. The crit-
ical relation between discrepancies and subsequent student ratings was not compared to
results for the control group. All, or nearly all, the feedback instructors completed the ex-
pectancy ratings, but only one-third of the control group did so, suggesting that completion
of the expectancy ratings was not independent of the feedback manipulation. More impor-
tantly, the relation between discrepancy scores and subsequent student ratings for the con-
trol instructors who did complete the expectancy self-ratings was similar to that observed
in the feedback group (27 of 31 changes in the expected direction vs. 5 of 6 for the control
group). Since the relation between discrepancies and subsequent student ratings may have
* For the negativediscrepancygroup, the regression coefficients are negative for the feedbackgroup and tend
to be less positive or more negative than for the control group. However, since the sign of the discrepancy score
is negative, this results in more positive student ratings.
Students’ Evaluations 347
been similar in feedback and control groups, the apparent support for equilibrium theory
must be interpreted cautiously.
Pambookian (1976) examined changes in student ratings for small groups of positively
discrepant (i.e. self ratings higher than student ratings), negatively discrepant, and nondis-
crepant groups. He reasoned that the larger the discrepancy, whether positive or negative,
the greater the pressure on the instructor to change. However, he interpreted equilibrium
theory to mean that the direction of this change should be towards higher student ratings
for both negatively and positively discrepant groups even though this would apparently in-
crease - not decrease-the size of the discrepancy for the negatively discrepant teachers.
Notwithstanding the logic of this prediction, his results showed that for negatively discrep-
ant teachers the student ratings showed little change or were slightly poorer at the end-of-
term than at midterm. For nondiscrepant teachers, and particularly for positively discrep-
ant teachers, end-of-term ratings tended to be more favorable than midterm ratings. Thus
Pambookian’s results support the interpretation of equilibrium theory posited here, even
though they did not support his predictions. Nevertheless, the lack of control groups and
his extremely small size dictate that his findings must be interpreted cautiously.
Rotem (1978) collected both actual and desirable self-ratings from both instructors and
from students during the term, randomly assigned teachers to feedback and control condi-
tions, and collected parallel sets of ratings from students and instructors at the end of the
term. Rotem found no feedback/control differences on students’ actual end-of-term rat-
ings. In order to test the various forms of equilibrium theory, he conducted separate mul-
tiple regressions on feedback and control groups to determine if the desirable ratings by
students or teachers, or the actual ratings by teachers contributed to the end-of-term rat-
ings by students. However, he found that for both feedback and no-feedback groups, none
of these additional variables was related to actual end-of-term student ratings beyond what
could be explained by actual midterm student ratings. He concluded that: “It appears,
therefore, that discrepancies between ‘actual’ and ‘desirable’ ratings and between stu-
dents’ and instructors’ ratings did not have any functional relationship with the post-test
ratings” (p. 309). The design of the Rotem study is the best of those considered here and
his analysis provides a reasonable test of equilibrium theory, but his results provide no sup-
port for the theory with respect to the effect of discrepancies on subsequent student rat-
ings. Despite the fact that Rotem collected subsequent self-ratings as well as student rat-
ings, he did not test predictions from equilibrium theory with respect to this variable.
Hence, it is possible that instructors altered their self ratings instead of altering their teach-
ing behaviors, and such a finding would be consistent with equilibrium theory.
Equilibrium theory posits that instructors will be differentially motivated to improve the
effectiveness of their teaching depending on the discrepancy between how they view them-
selves and are viewed by students. In terms of equilibrium theory the role of the initial stu-
dent ratings is not to provide diagnostic information to improve teaching, though this
would not be inconsistent with the theory, but rather to provide a basis for establishing a
disequilibrium. Hence, support for the theory does not necessarily support the diagnostic
usefulness of the ratings, nor vice versa. The theory has an intuitive appeal. However, the
theory’s operationalization, and the design and analysis of studies to test the theory, are
3-M HERBERT W. MARSH
complicated and frequently misrepresented. None of the empirical studies of the theory
examined here were fully adequate, the most common errors being: (a) the lack of a con-
trol group; (b) the failure to consider changes in both subsequent and self ratings; and (c)
the inappropriate or incomplete analyses of the results. These methodological shortcom-
ings make statements of support or nonsupport to the theory problematic. Nevertheless,
the studies that appeared to be most methodologically adequate - Centra (1973) and par-
ticularly Rotem (1978) - provide weak or no support for the theory. The relation of
equilibrium theory to student evaluation research has not been fully examined, no well-de-
fined paradigm to test the theory has been established. and empirical tests of the theory are
generally inadequate. However, the deficiencies in previous research can be easily re-
medied and so this is an important area for further research, or, perhaps, even the
reanalysis of previous studies.
Since 1929, and particularly during the last 25 years, a variety of surveys have been con-
ducted to determine the importance of students’ evaluations and other indicators of teach-
ing effectiveness in evaluating total faculty performance in North American universities.
A 1929 survey by the American Association of University Professors (AAUP) asked how
teachers were selected and promoted, and it was noted that, while ‘skill in instruction’ was
cited by a majority of the respondents, “One could wish that it had also been revealed how
skill in instruction was determined, since it remains the most difficult and perplexing sub-
ject in the whole matter of promotion and appointment.” (AAUP, 1929 as cited by Re-
mmers & Wykoff, 1929.)
Subsequent surveys conducted during the last 25 years found that classroom teaching
was considered to be the most important criterion of total effectiveness, though research
effectiveness may be more important at prestigious, research-oriented universities (for re-
views see Centra, 1979; Leventhal er al., 1981; Seldin, 1975). In the earlier surveys ‘sys-
tematically collected student ratings’ was one of the least commonly mentioned methods
of evaluating teaching, and authors of those studies lamented that there seemed to be no
serious attempt to measure teaching effectiveness. Such conclusions seem to be consistent
with those in the 1929 AAUP survey. More recently, however, survey respondents indi-
cate that chairperson reports, followed by colleague evaluations and student ratings, are
the most common criteria used to evaluate teaching effectiveness, and that student ratings
should be the most important (Centra, 1979). These findings suggest that the importance
and usefulness of student ratings as a measure of teaching effectiveness have increased
dramatically during the last 60 years and particularly in the last two decades.
Despite the strong reservations by some, faculty are apparently in favor of the use of stu-
dent ratings in personnel decisions-at least in comparison with other indicators of teach-
ing effectiveness. For example, in a broad cross-section of colleges and universities, Rich
(1976) reported that 75 per cent of the respondents believed that student ratings should be
used in tenure decisions. Rich also noted that faculty at major research-oriented univer-
sities favored the use of student ratings more strongly than faculty from small colleges.
Rich suggested that this was because teaching effectiveness was the most important deter-
minant in personnel decisions in small colleges, so that student ratings were more threaten-
ing. However, Braskamp etaf. (1985) noted that university faculty place more emphasis on
Students’ Evaluations 349
striving for excellence and are more competitive than faculty at small colleges. and these
differences might explain their stronger acceptance of student ratings.
Leventhal et al. (1981), and Salthouse et al. (1978) composed fictitious summaries of fa-
culty performance that systematically varied reports of teaching and research effective-
ness, and also varied the type of information given about teaching (chairperson’s report or
chairperson’s report supplemented by summaries of student ratings). Both studies found
reports of research effectiveness to be more important in evaluating total faculty perform-
ance at research universities, though Leventhal et al. found teaching and research to be of
similar importance across a broader range of institutions. While teaching effectiveness as
assessed by the chairperson’s reports did make a significant difference in ratings of overall
faculty performance, neither study found that supplementing the chairperson’s report with
student ratings made any significant difference. However, neither study considered stu-
dent ratings alone nor even suggested that the two sources of evidence about teaching
effectiveness were independent. Information from the ratings and chairperson’s report
were always consistent so that one was redundant, and it would be reasonable for subjects
in these studies to assume that the chairperson’s report was at least partially based upon
students’ evaluations. These studies demonstrate the importance of reports of teaching ef-
fectiveness, but do not appear to test the impact of student ratings.
Little empirical research has been conducted on the use of ratings by prospective stu-
dents in the selection of courses. UCLA students reported that the Professor/Course
Evaluation Survey was the second most frequently read of the many student publications,
following the daily campus newspaper (Marsh, 1983). Leventhal et al. (1975) found that
students say that information about teaching effectiveness influences their course seiec-
tion. Students who select a class on the basis of information about teaching effectiveness
are more satisfied with the quality of teaching than are students who indicate other reasons
(Centra & Creech, 1976; Leventhal et al., 1976). In an experimental field study, Coleman
and McKeachie (1981) presented summaries of ratings of four comparable political science
courses to randomly selected groups of students during preregistration meetings. One of
the courses had received substantially higher ratings, and it was chosen more frequently by
students in the experimental group than by those in the control group. Based upon this li-
mited information, it seems that student ratings are used by students in the selection of in-
structors and courses.
With the possible exception of short-term studies of the effects of midterm ratings,
studies of the usefulness of student ratings are both infrequent, and often anecdotal. This
is unfortunate, because this is an area of research that can have an important and construc-
tive impact on policy and practice. Important, unresolved issues were identified that are
in need of further research. For example, for adminstrative decisions students’ evaluations
can be summarized by a single score representing an optimally-weighted average of
specific components, or by the separate presentation of each of the multiple components,
350 HERBERT W. bli\RSH
but there is no research to indicate which is most effective. If different components of stu-
dents’ evaluations are to be combined to form a total score, how should the different com-
ponents be weighted? Again there is no systematic research to inform policy makers. De-
bates about whether students’ evaluations have too much or too little impact on adminstra-
tive decisions are seldom based upon any systematic evidence about the amount of impact
they actually do have. Researchers often indicate that students’ evaluations are used as
one basis for personnel decisions, but there is a dearth of policy research on the policy
practices that are actually employed in the use of student ratings. A plethora of policy
questions exist (for example, how to select courses to be evaluated, the manner in which
rating instruments are administered, who is to be given access to the results, how ratings
from different courses are considered, whether special circumstances exist where ratings
for a particular course can be excluded either a priori or post-hoc, whether faculty have the
right to offer their own interpretation of ratings, etc.) which are largely unexplored despite
the apparently wide use of student ratings. Anecdotal reports often suggest that faculty
find student ratings useful, but there has been little systematic attempt to determine what
form of feedback to faculty is most useful (though feedback studies do support the use of
services by an external consultant) and how faculty actually use the results which they do
receive. Some researchers have cited anecdotal evidence for negative effects of student
ratings (e.g., lowering grading standards or making courses easier) but these are also
rarely documented with systematic research. While students’ evaluations are sometimes
used by students in their selection of courses, there is little guidance about the type of in-
formation which students want and whether this is the same as is needed for other uses of
students’ evaluations. These, and a wide range of related questions about how students’
evaluations are actually used and how their usefulness can be enhanced, provide a rich
field for further research.
CHAPTER 8
For this paradigm Marsh (1981a) selected two North American instruments that mea-
sure multiple dimensions of effective teaching, and whose psychometric properties had
been studied extensively - Frey’s Endeavor and the SEEQ that were described earlier
(also see Appendix). University of Sydney students from 25 academic departments
selected ‘one of the best’ and ‘one of the worst’ lecturers they had experienced, and rated
each on an instrument comprised primarily (55 of 63 items) of SEEQ and Endeavor items.
As part of the study, students were asked to indicate ‘inappropriate’ items, and to select
up to five items that they ‘felt were most important in describing either positive or negative
aspects of the overall learning experience in this instructional sequence’ for each instructor
who they evaluated. Analyses of the results included: (a) a discrimination analysis examin-
ing the ability of items and factors to differentiate between ‘best’ and ‘worst’ instructors;
(b) a summary of ‘not appropriate’ responses; (c) a summary of ‘most important item’ re-
sponses; (d) factor analyses of the SEEQ and/or Endeavor items; and (e) a MTMM analy-
sis of agreement between SEEQ and Endeavor scales.
This applicability paradigm was subsequently used in three other studies: Hayton (1983)
with Australian students in Technical and Further Education (TAFE) schools; Marsh et al.
(1985) with students from the Universidad de Navarra in Spain; and Clarkson (1984) with
351
351 HERBERT W hlARSH
students from the Papua New Guinea (PNG) University of Technology. The PNG study
differed from the others in that it was based on a much smaller sample and 3 much more
limited selection of students and teachers. Clarkson also noted that the PNG setting 1~3s
‘non-Western’ and differed substantially from the ‘Western’ settings in most student
evaluation research. The TAFE and Spanish studies differed from the other two in that
students were asked to select ‘a good, an average, and a poor teacher’ instead of a best and
worst teacher, and the five-point response scale used in the original study ~3s expanded
to include nine response categories. The Spanish study also differed in that the items were
first translated into Spanish. (English is the official language in PNG even though 720 dif-
ferent languages are spoken in that country according to Clarkson, 1981.)
All but the TAFE study have been previously published, and so only that one will be
described in further detail. TAFE courses vary from manual trade and craft courses to
para-professional diploma courses in fields such as nursing, and TAFE students are more
varied in terms of age, socioeconomic status, and educational background than typical Au-
stralian university students. Hayton collected his data from eight TAFE institutions and
within each institution, ratings were collected from different academic units so that the
final sample of 218 students (654 sets of ratings) was a broad cross-section of TAFE stu-
dents. Each student was asked to select a good, an average, and a poor teacher, and to limit
selection to classes of at least one term that used mainly a lecture/discussion format. The
major findings of this study, as well as of the other three, are summarized below.
The different groups of instructors selected by the students constitute criterion groups.
and student ratings should be able to differentiate between these groups. In each of the
four studies, all but the Workload/Difficulty items strongly differentiated between the two
groups or the three groups of teachers. In the studies with the three criterion groups -
good, average, and poor instructors - nearly all of the between group variance could be
explained by a linear component. In the Spanish study, for example, as much as two-thirds
of the variance in some items could be explained by differences between the groups of
teachers chosen as good, average and poor, and for most items less than 1 percent of this
could be explained by nonlinear components. Differences in the Workload/Difficulty
items tended to be much smaller, sometimes failing to reach statistical significance, though
the best teachers tended to teach courses that were judged to be more difficult and to have
a heavier workload. The ratings differentiated substantially between the criterion groups
in all four studies.
It is hardly surprising that an instructor selected as being ‘best’ by a student is consis-
tently rated more favorably than one who the student selected as being ‘worst’. The halo
effect produced by this selection process probably exaggerates the differentiation among
groups, but also makes it more difficult to distinguish among the multiple components of
effective teaching, thus undermining the factor analyses and MTMM analyses that are a
major focus of the applicability paradigm. Hence, the differentiation is a double-edged
sword; too little would suggest that the ratings are not valid, but too much would preclude
the possibility of demonstrating the multidimensional nature of the ratings. Because of this
problem, students in the TAFE and Spanish studies selected agood, average, and poor in-
Students’ Evaluations 353
structor, instead of best and worst instructors, and the five-point response scale was ex-
panded to nine categories. However, as will be discussed later, these changes do not ap-
pear to resolve the problem which continues to be a weakness in this paradigm.
Factor Analyses
In each of the four studies, separate factor analyses of the Endeavor and SEEQ items
were conducted in an attempt to replicate previously identified factors. However, these at-
tempts were plagued by problems associated with the analysis of responses by individual
students. As described earlier: (a) for most applications of students’ evaluations the class-
average is the appropriate unit of analysis, and factor analyses of responses by individual
students are typically discouraged; (b) many of the response biases that are idiosyncratic
to individual responses are cancelled out when class-average responses are analyzed; (c)
class-average responses tend to be more reliable than individual student responses; and (d)
for individual students within the same class the instructor is constant, so that the interpre-
tation of any factors based on individual student responses may be problematic when the
ratio of the number of students to the number of different classes is large. It is defensible
to factor analyze responses from a randomly selected single student from each of a large
number of classes and the sampling of students for the applicability paradigm should ap-
proximate this ideal. Thus it is important that students come from a wide background so
that it is unlikely that many students will evaluate the same class and so that variation in
actual teaching behaviors is substantial. This situation appears to exist for three of the four
studies considered here, but not for the PNG study where all respondents were enrolled
in a second-year mathematics course and evaluated only mathematics instructors. How-
ever, even when this precaution is met, it still provides no control for response biases that
are idiosyncratic to individual responses and these are likely to be particularly large due to
the way students were asked to select criterion instructors. Thus it is likely that instructors
selected as being best/good (worst/poor) will be rated favorably (unfavorably) on all items,
and this will reduce the distinctiveness of the different components of effective teaching.
Factor analyses in these four studies are also plagued by problems inherent in the use of
exploratory factor analysis (see Chapter 2). When the researcher hypothesizes a well-de-
fined factor structure and the results of exploratory factor analyses correspond closely to
the hypothesis, then there is strong support for the hypothesized structure and the con-
struct validity of the ratings. However, when there is not a clear correspondence between
the hypothesized and obtained structure, the interpretation is problematic. Because of the
indeterminancy of exploratory factor analysis, a different empirical solution that fits that
data just as well may exist that more closely corresponds to the hypothesized structure and
the researcher has no way of determining how well the hypothesized structure actually
does fit the data. Many of these problems can be resolved by the use of confirmatory factor
analysis as described earlier, but the technical difficulties in the use of this approach to fac-
tor analysis and the unavailability of appropriate statistical packages may limit its use in
some of the settings where the applicability paradigm might be most useful.
Despite the potential problems in the use of factor analysis described above, most of the
nine SEEQ factors and the seven Endeavor factors were identified in each of the four
studies. In the Spanish study, separate factor analyses of responses to the SEEQ and the
354 HERBERT W. MARSH
Endeavor items clearly identified all 16 factors that the two instruments were designed to
measure. In the University of Sydney study, only the Examination/Grading factor from
SEEQ was not well-identified, though a similar factor defined by Endeavor items was
identified and when items from the two instruments were combined in a single factor
analysis an Examination factor was defined by responses to items from both instruments.
In the TAFE study, the Examination/Grading factor from SEEQ was again not well-de-
fined, while the Planning/Organization and Presentation Clarity factors from Endeavor
formed a single factor instead of two separate factors. In the TAFE study, like the Univer-
sity of Sydney study, a factor analysis of responses from both instruments produced a well-
defined Examination factor that was defined by both SEEQ and Endeavor items, but the
combined analysis still did not provide support for the separation of the two Endeavor fac-
tors.
Results from the PNG study provided the poorest approximation to the hypothesized
factor structure of the four studies. While there was reasonable support for seven of the
nine SEEQ factors, items for the Organization/Clarity and Individual Rapport items did
not define separate factors. There was also reasonable support for four of the seven En-
deavor factors, but a large general factor incorporated items from the Presentation Clar-
ity, Organization/Planning, and Personal Attention factors. Clarkson (1984) examined
solutions from factor analyses in which varying numbers of factors were rotated to diffe-
rent degrees of obliqueness, and selected the best solution. However, the nine-factor sol-
ution for SEEQ and the seven-factor solution for Endeavor were not presented, nor was
other useful information such as the eigenvalues for the unrotated factors. Furthermore,
because of the small and limited sample and the associated problems discussed above, the
use of factor analysis in that study may be dubious.
In summary, not all of the SEEQ and Endeavor factors were identified in each of the
four studies, but every hypothesized factor was identified in at least two of the studies. Par-
ticularly given the problems inherent in the exploratory factor analysis in individual stu-
dent responses and the apparent halo effect produced by the selection process in the
applicability paradigm, I interpret these findings to offer reasonable support for the gener-
ality of factors identified in North American settings to a wide variety of educational con-
texts in different countries.
Inappropriate items
l Clarkson (1984) summarized his ‘inappropriate’ and ‘most important’ item responses only in terms of percent-
ages Instead of actual numbers and percentages as was done in the other studies, and the percentages that he re-
ported were substantially higher than in the other studies. Clarkson. as did the other researchers, asked students
to indicate up to five most important itemsforeuch instructor whom rhey evaluated. If the numberof most import-
ant nominations was divided by the total number of sets of responses (I.e. (2 x the number of students since each
student evaluated two instructors), then the sum of the percentages across all items must be no higher than 500
Students’ Evaluations 355
Results from each of the four studies (see Table 8.1) are similar in that every item is
judged to be appropriate by about 80 percent or more of the students, even though 3 to 7
of the 55 items are judged to be ‘inappropriate’ by more than 10 percent of the students;
the mean number of items judged to be ‘inappropriate’ is 2.4 of the 55 items. The items
most frequently judged to be ‘inappropriate’ come from the Group Interaction. Individual
Rapport, Examination, and Assignment factors from the two instruments. However,
there are also different patterns in the four studies. For example, Group Discussion items
are most frequently seen to be ‘inappropriate’ in PNG study and much less frequently indi-
cated to be ‘inappropriate’ in the TAFE and Spanish studies; Reading/assignment items
are most frequently seen as ‘inappropriate’ in the TAFE and Spanish studies, but not in the
University of Sydney and PNG studies.
I interpret these findings to mean that nearly all the SEEQ and Endeavor items are seen
to be appropriate by a large majority of the students in all four studies. The mean propor-
tion of items judged to be ‘inappropriate’ was similar in the four studies (means between
0.04 and 0.05) and few items are judged to be ‘inappropriate’ in any of the studies.
After completing a survey, students were asked to select up to five items that were ‘most
important’ in describing the overall learning environment. In three of the four studies, all
but the PNG study, every item was selected by at least some of the students as being ‘most
important’ (Table 8.1). Across all four studies the most frequently nominated items came
from the Enthusiasm, Learning/value, and Organization/clarity factors. However, there
were again some marked differences in the four studies-and the PNG study was particu-
larly different. PNG students, compared to students in the other studies, were much more
likely to nominate individual Rapport/personal attention items and Workload/difficulty
items as most important, and were less likely to nominate Learning/value items. For ex-
ample, while at least 10 percent of the students in each of the other three studies nominated
the item ‘learned something valuable’ as most important and no more than 2.6 percent saw
it as ‘inappropriate’, only 2 percent of the PNG students nominated this item as ‘most im-
portant’ and 10percentjudgedit to be inappropriate. Again, it is likely that much of the dis-
tinctive nature of the pattern of responses for the PNG students may reflect the limited
sample and might not generalize to a broader sample of PNG students who evaluated a
wider range of courses. In summary, I interpret these results to indicate that items from
each of the factors measured by SEEQ and Endeavor are seen to be important by students
in each of the different studies.
per cent (each student selected up to five items). In fact, the sum of the percentages reported by Clarkson ap-
proaches loo0 per cent indicating that he must have divided the number of most important question responses
by the number of students rather than the number of sets of responses. Since students were able to mark as many
items as inappropriate as they wanted, I cannot be absolutely sure that Clarkson computed the percentage of not
appropriate item responses in the same way. However, it is inconceivable to me that Clarkson would divide the
number of most important responses by the number of students and the number of inappropriate responses by
twice the number of students and not report this different basis for determining percentages in his study. The
problem occurs when Clarkson directly compares his percentages with those in Marsh (1981a) without first taking
into account the fact that he apparently divided the number of responses by the number of students while Marsh
divided the number of responses by two times the number of students. It should also be noted that this ambiguity
only affects interpretations of results in Table 8.1, since the correlations in Table 8.2 are independent of such a
linear transformation and results in Tables 8.3 and 8.4 do not involve ‘most important’ and ‘not appropriate’ re-
sponses.
356 HERBERT W. h4ARSH
Table 8.1
Paraphrased Items and the Factors They Were Designed to Represent in Marsh’s (M) SEEQ Instrument and
Frey’s (F) Endeavor Instruments m Four Studies (Reprinted with permission from Marsh. 1986)
Scale and ,tem USvd Span TAFE PNG USvd SD.%l TAFE PNG
Table 8.1
Paraphrased Items and the Factors They Were Designed to Represent in Marsh’s (M) SEEQ Instrument and
Frey’s (F) Endeavor Instruments in Four Studies (Reprinted with permission from Marsh, 1986)
Table 8.1 fConrinucdJ
FE. Able to get personal attention 18.4 15.5 6.7 8.5 5.7 9.5 6.5
F9. Concerned about student difftculties 2.8 3.5 1.7 4.5 10.2 8.3 24.5
15. PresentationClarity (Endeavor)
FL. Presentationclarified materials 0 6.9 2.0 ::: 19.6 8.0 7.0 0
R. Presentedclearly and summarized 0 1.6 1.1 28.5 13.7 15.3 9.0
F3. Made good useof examples 0 1.9 1.7 2.5 7.0 9.3 3.1 10.5
16. Organization/Planning(Endeavor)
F13. Presentationsplanned in advance 0.6 1.5 10.0 13.3 13.9 12.2 9.5
F14. Provided detailed course schedule A:; ::; 3.8 10.5 5.7 2.8 4.9 0
F15. Activitiesorderly scheduled 2.9 2.8 6.5 1.3 4.0 5.1 8.0
Now SEEQ = Students’ Evaluations of Educational Quality; USyd = University of Sydney; Span = Spanish;TAFE = Technical
and Further Education; PNG = Papua New Guinea.
Hayton (1983) suggested that items judged to be most ‘inappropriate’ were those least
likely to be nominated as ‘most important’ in the TAFE study, while Clarkson (1984)
argued that the pattern of ‘inappropriate’ and ‘most important’ responses by the PNG
students differed from that by University of Sydney students. More generally, the relative
similarity and differences in the pattern of results in each of the studies may provide an im-
portant way to better understand the educational contexts in the different settings. For ex-
ample, one question that may be answered by this analysis is whether the items seen to be
‘most important’ in one setting are the same items seen to be important in other settings.
In order to explore further such possibilities, the proportions of ‘inappropriate’ and ‘most
important’ responses were correlated with each other for each country.
In all four studies the correlation (Table 8.2) between ‘inappropriate’ and ‘most import-
ant’ responses was negative (Mn r = - 0.33), and was statistically significant for all but the
PNG study. When the percentage of ‘inappropriate’ responses for each item was summed
across the four studies, this value correlated - 0.42 with the sum of the ‘most important’
responses. These results provide empirical support for Hayton’s observation that ‘inap-
propriate’ items are less likely to be selected as being ‘most important’.
Correlations between the proportion of ‘most important’ responses given to each item
were positive (Mn r = 0.55) indicating a high degree of correspondence in the patterns of
responses. However, the average correlation for the first three studies (0.73) was substan-
tially higher than between these three studies and the PNG study (Mn r = 0.37). The pat-
tern of responses for the ‘inappropriate’ responses was somewhat less consistent across the
four studies (Mn r = 0.32), but still represented a moderate level of consistency. The aver-
age correlation among the patterns of ‘inappropriate’ responses was again much higher for
the first three studies (Mn r = 0.56) than between each of these three studies and the PNG
study (Mn r = 0.08). The highest degree of consistency occurred between the TAFE and
Spanish studies for both the patterns of ‘most important’ responses (r = 0.79) and ‘inap-
propriate’ responses (r = 0.71). These findings tend to support Clarkson’s observations
that the pattern of responses by PNG students differed from those in the other three
Table 8.2
Consistency of Patterns of Inappropriate and Most Important Responses in the Four Different Studies (Reprinted with permission from Marsh, 1986)
Variable I 2 3 4 5 6 7 8 9 IO
I. Inappropriate (USyd)
2. Inappropriate (Span) .50
3. Inappropriate (TAFE) .47 .71
4. Inappropriate (PNG) .36 -.04 - .08
5. Most important (USyd) -.41 - .30 - .41 - .08
6. Most important (Span) - .27 L .32 -.42 .I0 .72
7. Most important (TAFE) - .22 -.2s - .47 .04 .69 .79
8. Most important (PNG) -.26 -.20 - .32 -.I3 .28 .36 .46
9. Inappropriate total .84 .78 .71 .45 - .42 - .31 -.30 - .31
IO. Most important total - .37 - .33 - .49 -.04 .84 .86 .88 .67 - .42
M of proportions ,030 ,048 .037 .053 ,087 .079 .075 .082 ,044 .081
SD of proportions .041 .047 .030 .040 .076 .057 ,046 ,072 028 .050
Note. For the purpose of this analysis, each of the 55 items was considered to be a “case,” and the 10 “variables” were the proportion
of “inappropriate” or “most important” responses for that item in the different studies, as indicated by the row labels. (Data for this
analysis are in Table 1.) For example, the correlation of .79 between Variables 6 and 7 indicates that an item judged fo be most
important in the Spanish study was also likely to be judged as most important in the TAFE study, whereas the correlation of - .41
between Variables 1 and 5 indicates that an item judged to be inappropriate in the University of Sydney study was less likely to be seen
as most important in that study. Statistically significant correlations are those greater than .26 @ < .05, q” = 53, two-tailed) and .36
@ < .Ol, df = 53, two-tailed). USyd = University of Sydney; Span = Spanish; TAFE = Technical and Further Education; PNG =
Papua New Guinea.
Students’ Evaluations 359
studies, even though there was a moderately high level of consistency in the pattern of re-
sponses in all four studies.
While suggestive, the interpretation of these correlations must be made cautiously
because of the differences in the methodologies used in the four studies. The PNG study
differed most drastically from the other three - in terms of the sample of students and
limitations in the choice of criterion instructors - and this is the only study where the pat-
tern of responses to ‘inappropriate’ and ‘most important’ items was not highly consistent.
The TAFE and Spanish studies both asked each student to select three criterion instructors
instead of two, used a nine-point response scale instead of five-point scale, and considered
a slightly different set of items than the other two studies (though all four contained the 55
items from SEEQ and Endeavor and considered approximately the same total number of
items). Thus, the methodological similarities in these two studies might explain why the
patterns of ‘inappropriate’ and ‘most important’ responses were somewhat more consis-
tent for these two studies than between either of them and the University of Sydney study.
I interpret these results to demonstrate a surprising consistency in the pattern of ‘inapprop-
riate’ and ‘most important’ responses for three of the four studies, and I suggest that the
less consistent results in the PNG study probably reflect methodological differences in that
study rather than differences in the PNG context. I find the consistency in the first three
studies surprising because the three settings seem to vary so much that such a high level of
correspondence was not expected.
Multitrait-Multimethod Analyses
The SEEQ and Endeavor forms were independently designed, and do not even measure
the same number of components of effective teaching. Nevertheless, a content analysis of
the items and factors from each instrument suggests that there is considerable overlap
(Marsh, 1981a). There appears to be a one-to-one correspondence between the first five
SEEQ factors and the first five Endeavor factors that appear in Table 8.1, while the Or-
ganization/Clarity factor from SEEQ seems to combine the Organization/Planning and
Clarity factors from Endeavor. The remaining three SEEQ factors - Instructor En-
thusiasm, Breadth of Coverage, and Assignments - do not appear to parallel any factors
from Endeavor. This hypothesized structure of correspondence between SEEQ and En-
deavor factors is the basis of the MTMM analyses described below.
The set of correlations between scores representing the 9 SEEQ factors and the 7 En-
deavor factors is somewhat analogous to a MTMM matrix in which the different factors
correspond to the multiple traits and the different instruments are the different methods.
Convergent validity refers to the correlation between responses to SEEQ and Endeavor
factors that are hypothesized to be matching as described above, and these appear in boxes
in Table 8.2. Discriminant validity refers to the distinctiveness of the different factors; its
demonstration requires that the highest correlations appear between SEEQ and Endeavor
factors that are designed to measure the same components of effective teaching, and that
other correlations are smaller. Adapting the MTMM criteria developed by Campbell and
Fiske (1959) Marsh (1981a) proposed that:
(1) Agreement on SEEQ and Endeavor factors hypothesized to be matching, the con-
vergent validities, should be substantial (a criterion of convergent validity).
(2) Each convergent validity should be higher than other nonconvergent validities in the
360 HERBERT W. MARSH
same row and column of the 9 X 7 rectangular submatrix of correlations between SEEQ
and Endeavor factors. This criterion of discriminant validity requires that each convergent
validity is compared with either 13 or 14 other correlations (convergent validities were not
compared to other convergent validities when they appear in the same row or column of
this submatrix).
(3) Each convergent validity should be higher than correlations between that factor and
other SEEQ factors, and between that factor and other Endeavor factors, in the two trian-
gular submatrices. This criterion of divergent validity requires that each convergent valid-
ity is compared to 14 other correlations.
Each of the four studies presented MTMM matrices, but there are a number of problems
with these previous analyses. First, none of the studies rigorously applied the Campbell/
Fiske guidelines, and argued for (or against in the case of the PNG study) their support by
inspection. Second, empirically derived factor scores were used to determine the MTMM
matrices in the University of Sydney and Spanish studies, but scale scores consisting of an
unweighted average of responses to items designed to measure each factor were used in the
other two studies. Both Hayton (1983) and Clarkson (1984) argued that factor scores were
inappropriate since the SEEQ and Endeavor factors were not clearly identified in their
studies. In the Spanish study, where all the factors were identified, it is certainly justifiable
to use empirically derived factor scores. However, Marsh’s (1981a) use of factor scores is
somewhat problematic since the Examination factor for SEEQ was not well-defined.
Furthermore, even if the use of factor scores is justified, results based on factor scores may
not be directly comparable to those based on scale scores. Since the raw data for two
studies that used factor scores, the University of Sydney and Spanish studies, were avail-
able to the author, the MTMM analysis was redone with both the scale scores and factor
scores for these two studies.
Six MTMM matrices appear in Table 8.3 and the results of a formal application of the
Campbell/Fiske criteria as presented above are summarized in Table 8.4. For the Univer-
sity of Sydney and the Spanish studies, MTMM matrices and analyses are considered sepa-
rately for factor scores and scale scores, while only the MTMM matrices based on scale
scores (based on values from the original studies) are available for the other two studies.
In the six analyses, every convergent validity is substantial and statistically significant;
means of the convergent validities vary from 0.72 to 0.87 while the medians vary from 0.81
to 0.89. For both the Spanish study and particularly for the University of Sydney study, the
convergent validities for scale scores are higher than for the factor scores. The substantial
differences to the two sets of scores for the University of Sydney study is attributable
primarily to correlations between the Examination (SEEQ) and Grading (Endeavor) fac-
tors; the two factor scores representing this component correlated 0.34 with each other
while the two scale scores correlated 0.72. It is also important to note that the size of the
mean convergent validity in each study is only modestly smaller than the mean of the coef-
ficient alpha estimates for the same factors (mean alphas vary from 0.82 to 0.89). These
findings clearly satisfy the first Campbell/Fiske criterion for each of the different studies
and indicate that the sizes of the validity coefficients approach a logical upper-bound im-
posed by the reliability of the scores.
For the second Campbell/Fiske criterion, each of the convergent validities is compared
to other correlations in the rectangular submatrix. The means of the comparison correla-
tions are substantial, but substantially smaller than the size of the corresponding means for
the convergent validities. For the University of Sydney and Spanish studies, the mean of
Students’ Evaluations 361
Table 8.3
Six MTMM Matrices of Correlations Among SEEQ and Endeavor Factors
Factors 1 2 3 4 5 6 7 a 9
SEEQ Factors
1 1;:;
(84) (86)
2 26 61 (92)
39 63 (92)
66 57 (89) (90)
3 -05 01 06 07 (91)
04 04 08 15 (79)
09 07 25 -12 (86) (70)
4 33 56 46 63 20 01 (81)
42 64 50 70 13 05 (85)
65 59 74 63 21 10 (73)(88)
5 54 81 31 62 -03 02 32 58 (93)
68 80 43 69 -05 01 39 70 (90)
72 69 67 77 08 10 67 68 (87)(85)
6 24 64 52 81 -15 -01 48 66 33 67 (93)
38 60 50 83 02 15 46 70 43 65 (91)
68 70 79 78 21 -03 74 65 65 79 (83)(90)
7 39 71 55 81 -04 -02 52 62 47 74 60 88 (95)
47 69 65 87 22 21 40 69 43 73 64 85 (92)
73 61 82 80 19 -01 75 64 76 80 82 85 (90)(94)
8 42 67 39 68 -01 06 46 60 40 64 47 73 49 73 (88)
62 69 55 70 12 12 52 67 55 66 45 67 57 73 (89)
77 56 73 65 17 14 70 58 74 66 75 67 80 71 (82)(89)
9 22 34 37 51 07 14 39 55 18 36 35 42 37 39 33 46 (84)
32 54 25 65 -05 12 26 60 29 55 18 56 24 61 36 61 (84)
09 47 68 52 27 19 63 56 49 45 21 53 55 54 55 54 (64)(90)
Endeavor Factors
10 58- :l‘ 29 53 -03 -01 33 51 57 73 20 51 45 63 39 58 22 33
’ 94’
;88
93 72; 54
37 45
64 04
03 08
05 54
38 45
64 67
69 55
81 55
38 53
61 63
43 53
69 70
59 55
69 43
29 55
35
11 3379 ?30- :? -10 -05 56 62 37 60 70 81 63 77 49 67 39 48
46 58 ‘86 91’ 11 12 52 66 51 63 55 76 66 78 60 65 31 61
60 59 ‘28 83_1 21 -13 71 61 64 71 73 77 74 78 68 58 67 52
12 05 14 1Q lo T;5- 141 32 16 02 12 -02 13 06 11 05 14 20 26
15 17 13 24 ‘82 79’ 20 18 09 18 12 25 25 30 20 23 06 24
27 38 45 21 ‘29 44; 41 28 27 29 38 16 37 08 35 31 48 26
13 28 50 39 59 -04 ?I2 r34 -7? 39 55 43 58 31 53 36 53 50 43
46 55 48 63 01 05 ‘80 84’ 49 60 57 66 41 62 52 59 28 57
58 63 67 67 18 06 k80 67; 61 64 66 64 64 58 61 49 61 57
14 63 78 40 62 -05 -02 43% ‘8i89l 41 68 56 72 57 65 29 35
75 81 37 67 08 08 44 68 ‘81 91’ 40 63 48 70 56 65 32 53
75 67 71 74 16 15 73 69 >5 87; 69 77 76 79 72 74 53 60
15 23 63 47 80 -13 -05 55 66 31’6 r8: 931 71 88 49 73 32 41
47 63 69 84 08 17 51 73 48 67 ‘79 88’79 86 63 72 30 60
68 73 81 76 20 -04 79 68 67 77 26_ 88 ’ 83 85 78 73 60 50
V--i
16 26 47 58 65 06 09 51 48 35 51 ‘68 76’ 59 68 56 6-I 39 40
48 58 59 71 18 24 60 69 48 61 171 82’ 57 72 59 65 24 55
66 63 72 60 27 00 73 52 63 68 ’---81 821 73 71 72 52 60 44
362 HERBERT W. MARSH
Table 8.3
Six MTMM Matrices of Correlattons Among SEEQ and Endeavor Factors (Connnurd~
Factors 10 11 12 13 14 15 16
Endeavor Factors
10
‘(“y:;
(86) (71)
11 29 49
44 59
50 40 (78) (91)
12 05 10 03 08 (94)
13 19 20 22 (91)
17 37 44 14 (73) (86)
13 25 47 35 57 06 13 W)
42 56 50 60 13 17 (94)
46 48 69 56 36 23 (84) (88)
14 60 72 44 60 04 11 40 57 WI
72 82 46 60 18 21 45 60 (91)
70 60 67 65 36 28 63 67 (82) (85)
15 23 53 60 81 00 09 31 54 43 65 (92)
41 64 71 76 14 27 51 66 51 67 (89)
58 59 74 76 38 21 68 63 71 79 (86)(88)
16 21 36 60 65 16 20 41 49 43 53 67 72 (85)
46 60 58 65 28 32 6066 45 62 67 74 (78)
56 57 66 64 15 15 67 51 68 60 SO 71 (76)(64)
For each cell of the MTMM matrix there are six correlations representing results from study 1 (factor scores
- upper-left), study 1 (scale scores-upper-right), study 2 (factor scores - middle-left), study 2 (factor scores
-middle-right), study 3 (scale scores-lower-left), and study 4 (scale scores - lower-rrght). All coefficients are
presented without decimal points. The values in parentheses are coefficient alpha estimates of rehability for the
four studies, and the values in boxes are the convergent validity coefficients. See Table 2.1 for the factors and the
items used to define each.
the comparison correlations is substantially higher for scale scores than for factor scores.
In each of the six analyses a total of 96 comparisons were made, and the Campbell/Fiske
criterion was satisfied for all but 13 of these 576 comparisons. Ten of the 13 rejections came
in the analysis based on factor scores from the University of Sydney study. and most of
these involved the Examination factor from the SEEQ instrument that was not well de-
fined by the factor analysis. For the corresponding analysis based on scale scores from the
same study, there was only one rejection. The results of each of the four studies provide
strong support for this Campbell/Fiske criterion.
For the third Campbell/Fiske criterion, each of the convergent validities is compared to
other correlations in the two triangular submatrices. The means of these comparison cor-
relations are again substantial, but substantially smaller than the corresponding means for
the covergent validities. Again the mean correlation is substantially higher for scale scores
than for factor scores. In each of the six analyses a total of 98 comparisons were made, and
the Campbell/Fiske criterion was satisfied for all but 15.5 of these 588 comparisons. Again,
a majority of the rejections came from the analysis of factor scores from the University of
Sydney study, and most of these involved the Examination factor. For the corresponding
analysis based on scale scores from the same study, there were only two rejections. The re-
sults of each of the four studies provide strong support for this Campbell/Fiske criterion.
Students’ Evaluations 363
Table 8.4
Summaries of Three Campbell-Fisk Criteria for the Analysis of Six MTMM Matrices (Reprinted with permission
from Marsh, 1986)
Convergent validrties
M ,726 ,833 814 ,869 I324 ,721
Mdn AlO .a90 Xl0 .882 .850 ,820
Propon~on statistically
significant (out of 7) 1.000 1.000 loo0 l.cKQ l.ooo 1000
Crucrion 2
Comparison coefficients
M ,335 .468 AOil .541 ,573 .509
Mdn .3w ,511 450 616 620 ,580
Propon10n of successful
compansons (out of %) .8% ,990 ,995 ICOO loo0 ,995
Crircnon 3
Comparison coefficients
M for SEEQ factors ,305 .4a7 .358 .554 579 ,512
M for Endeavor factors ,313 ,448 ,426 ,530 567 ,501
M for SEEQ and Endeavor factors ,307 473 383 545 574 ,508
Mdn for SEEQ factors ,380 ,605 ,390 ,647 .665 ,580
Mdn for Endeavor factors .3lO .53l .450 .605 .580 ,570
Mdn for SEEQ and Endeavor factors ,370 ,565 ,430 ,640 650 ,580
FInportion of successful comparisons
SEEQ factors (out of 56) 911 ,964 1000 .964 .982 ,964
Endeavor factors (0~1 of 42) .929 .976 l.OC@ 1.000 l.Ow ,988
SEEQ and Endeavor factors (WI of 98) ,918 ,980 l.OCQ ,980 990 ,975
Toefficient alpha estimates were the same when analyses were conducted on factor ICORI and unwclghtcd scale scores from the same
swdy.
Noa Compansons conducted to test Critena 2 and 3 were counted as half a success and half a failure when a convergent valtdiry was
equal to a comparison coefticmnr. SEEQ = Students’ Evalua~mns of Educational Quality
These analyses resolve several issues. The MTMM analyses performed on factor scores
apparently are not directly comparable to those performed on scale scores. In the Spanish
study where all the factors were well defined, the use of empirically derived factor scores
instead of scale scores resulted in slightly lower convergent validities and substantially
lower comparison correlations. Since the goal of MTMM studies, and many applications
of students’ evaluations, is to maximally differentiate among different components of ef-
fective teaching, the use of factor scores seems preferable. In the University of Sydney
study, however, even though only one factor was not well-identified, the use of factor
scores resulted in weaker support for the discriminant validity of the ratings. The applica-
tion of the Campbell/Fiske guidelines resulted in a total of 18 rejections based on factor
scores, but only 3 rejections for the scale scores. These findings support the decision by
Hayton and by Clarkson to use scale scores in their MTMM analyses.
Items from two North American instruments designed to measure students’ evaluations
364 HERBERT W. MARSH
* The applicability paradigm was used in subsequent research conducted in New Zealand (Watkins eral., 1987)
and the results further support the generalizability of the findings described here. In particular, the factor struc-
ture based on SEEQ and Endeavor responses was well defined, convergent validities in the MTMM analysis
(0.73 to 0.94) were high, and every one of the 196 comparisons used to test discriminant validity were satisfied.
The authors concluded that: “This research has provided strong support for the applicability of these American
instruments for evaluating effective teaching at a New Zealand university.”
Students’Evaluations 365
California or UCLA where SEEQ was developed, or at other comparable North Ameri-
can universities, the results would be similar to those reported here. In particular, due
partly to the design of the applicability paradigm, the correlations between scale scores
used to represent the different factors would be substantial, thus making the differentia-
tion among the factors more difficult.
An important and provocative question raised by these findings is why students’ evalua-
tions are so widely employed at North American universities, but not at universities in
other countries? The conclusions summarized here suggest that teaching effectiveness can
be measured by students’ evaluations in different countries and that perhaps other findings
from research conducted in North America may generalize as well, so this is not the rea-
son. A more likely explanation is the political climate in North American universities.
While the study of students’ evaluations have a long history in the United States, it was
only in the late 1960s and 1970s that they became widely used. During this period there was
a marked increase in student involvement in university policy making and also an in-
creased emphasis on ‘accountability’ in North American universities. While similar condi-
tions may have existed in universities in at least some other countries, they did not result
in the systematic collection of students’ evaluations of teaching effectiveness.
While the impetus for the increased use of students’ evaluations of teaching effective-
ness in North American universities may have been the political climate, subsequent re-
search has shown them to be reliable, valid, relatively free from bias, and useful to stu-
dents, lecturers, and administrators. Future research in the use of students’ evaluations in
different countries needs to take three directions. First, in order to test the generality of
the conclusions in this article, the paradigm described here should be replicated in other
countries and particularly in non-Western, developing countries. Second, the validity of
the students’ evaluations must be tested against a wide variety of indicators of effective
teaching in different countries as has been done in North American research. Third,
perhaps employing the instruments used in this study, there is a need to examine and docu-
ment the problems inherent in the actual implementation of broad, institutionally-based
programs of students’ evaluations of teaching effectiveness in different countries.
I offer several suggestions about how this paradigm can be improved in future research.
(1) The sample of students selected for the study should represent a broad cross-section
of students at an institution. The ratio of the number of students to the number of different
classes should be kept as small as possible, and presented as part of the results of the
studies.
(2) When feasible, confirmatory factor analyses should be used instead of exploratory
factor analyses in the investigation of the structure underlying responses to SEEQ and to
Endeavor items - particularly when exploratory factor analyses apparently do not pro-
vide support for the hypothesized factors.
(3) Since this paradigm is designed to be exploratory, researchers should consider items
in addition to those contained in the SEEQ and Endeavor instruments. Some additional
items are presented in the University of Sydney and Spanish studies, but these are de-
signed to supplement SEEQ and Endeavor factors rather than to identify additional com-
ponents of effective teaching that might be uniquely appropriate to students in a particular
366 HERBERT W. MARSH
setting. However, if the intention of a study is to compare the pattern of results obtained
elsewhere with those described here, then the items should be kept constant.
(4) The MTMM analyses described here should be conducted with both scale scores and
factor scores. If the factor structure is not well defined, then the use of factor scores may
be problematic as appears to be the case in the University of Sydney study. Alternatively,
the factor score coefficients based on the Spanish study can be obtained from me, and
these can be used to compute factor scores in future studies. While the use of factor score
coefficients derived from another study may be problematic if a comparable factor struc-
ture cannot be demonstrated, the same criticism may be relevant to the computation of un-
weighted scale scores. Furthermore, the use of these factor scores and scale scores in the
MTMM analysis provides a test of their utility. If, as I predict, the use of these factor scores
provides better support for discriminant validity than do the scale scores, then the use of
the factor scores is preferable. Also, the use of the same set of factor score coefficients will
provide further standardization of results across different studies.
(5) Requesting students to select a good, an average and a poor teacher is probably pre-
ferable to selecting a best and worst teacher, though the halo effect appears to be substan-
tial in both cases. Perhaps it would be even better just to ask students to select three in-
structors without specifying that they are good or poor or average. If this procedure re-
duces the size of the halo effect then it might off-set the loss of opportunity to perform the
discrimination analysis. This suggestion is not offered as a recommendation, but rather as
a possibility worthy of further research.
Important weaknesses in this paradigm have been identified, even though some of these
may be overcome by the recommendations presented above. Thus it is important to
evaluate whether the paradigm is worth pursuing. The applicability paradigm is only in-
tended to serve as a first step in studying the generalizability of North American research
to other countries, or perhaps to nontraditional settings in North America, and it should
be evaluated within this context. The data generated by this paradigm seem to be useful
for testing the applicability of the North American instruments and for refining an instru-
ment that may be more suitable to a particular setting; it is clearly preferable to adopting
an untried instrument that has been validated in a very different setting. The paradigm is
also cost-effective and practical in that: it requires only a modest amount of effort for data
collection and data entry; it can be conducted with volunteer subjects; it does not require
the identification of either the student completing the forms or the instructor being
evaluated so that it is likely to be politically acceptable in most settings; and it is ideally
suited to being conducted by students in a research seminar (which was the basis of the
University of Sydney study). Furthermore, it may serve as an initial motivation to pursue
further research and the eventual utilization of students’ evaluations of teaching effective-
ness. Alternative approaches to studying the applicability of student ratings will require
researchers to administer forms to all the students in a sufficiently broad cross-section of
classes so that class-average responses can be used in subsequent analyses (i.e., at least 100
classes based on responses by several 1000 students). While such a large-scale effort would
be useful, there will be many situations in which it may not be feasible and the applicability
paradigm may still provide a useful pilot study that precedes the larger study even when
such a large-scale study is possible.
The focus of research summarized here has been on the similarity of the results from di-
verse academic settings in order to support the applicability of SEEQ and Endeavor in
these different settings. However, the comparison of patterns of ‘inappropriate’ and ‘most
Students’ Evaluations 367
important’ item responses may also provide an important basis for inferring how the learn-
ing environments in diverse settings differ. Even in the four studies that have been
conducted, the paradigm has been heuristic in that researchers have speculated about the
unique characteristics of students in each particular setting to account for some of the find-
ings, even though there appears to be a surprising consistency in the responses. The valid-
ity of such speculations is likely to improve once methodological problems have been re-
solved and there is a large enough data base with which to compare the findings of new
studies.
CHAPTER 9
Research described in this article demonstrates that student ratings are clearly mul-
tidimensional, quite reliable, reasonably valid, relatively uncontaminated by many vari-
ables often seen as sources of potential bias, and are seen to be useful by students, faculty,
and administrators. However, the same findings also demonstrate that student ratings may
have some halo effect, have at least some unreliability, have only modest agreement with
some criteria of effective teaching, are probably affected by some potential sources of bias,
and are viewed with some skepticism by faculty as a basis for personnel decisions. It should
be noted that this level of uncertainty probably also exists in every area of applied psychol-
ogy and for all personnel evaluation systems. Nevertheless, the reported results clearly de-
monstrate that a considerable amount of useful information can be obtained from student
ratings; useful for feedback to faculty, useful for personnel decisions, useful to students in
the selection of courses, and useful for the study of teaching. Probably, students’ evalua-
tions of teaching effectiveness are the most thoroughly studied of all forms of personnel
evaluation, and one of the best in terms of being supported by empirical research.
Despite the generally supportive research findings, student ratings should be used cauti-
ously, and there should be other forms of systematic input about teaching effectiveness,
particularly when they are used for tenure/promotion decisions. However, while there is
good evidence to support the use of students’ evaluations as one indicator of effective
teaching, there are few other indicators of teaching effectiveness whose use is systemati-
cally supported by research findings. Based upon the research reviewed here, other alter-
natives which may be valid include the ratings of previous students and instructor self-
evaluations, but each of these has problems of its own. Alumni surveys typically have very
low response rates and are still basically student ratings. Faculty self-evaluations may be
valid for some purposes, but probably not when tenure/promotion decisions are to be
based upon them. (Faculty should, however, be encouraged to have a systematic voice in
the interpretation of their student ratings.) Consequently, while extensive lists of alterna-
tive indicators of effective teaching are proposed (e.g., Centra, 1979), few are supported
by systematic research, and none are as clearly supported as students* evaluations of teach-
ing.
Why then, if student ratings are reasonably well supported by research findings, are they
so controversial and so widely criticized? Several suggestions are obvious. University
faculty have little or no formal training in teaching, yet find themselves in a position where
their salary or even their job may depend upon their classroom teaching skills. Any proce-
369
370 HERBERT W. MARSH
dure used to evaluate teaching effectiveness would prove to be threatening and therefore
criticized. The threat is exacerbated by the realization that there are no clearly defined
criteria of effective teaching, particularly when there continues to be considerable debate
about the validity of student ratings. Interestingly, measures of research productivity, the
other major determinant of instructor effectiveness, are not nearly so highly criticized, de-
spite the fact that the actual information used to represent them in tenure decisions is often
quite subjective and there are serious problems with the interpretation of the objective
measures of research productivity that are used. As demonstrated in this overview, much
of the debate is based upon ill-founded fears about student ratings, but the fears still per-
sist. Indeed, the popularity of two of the more widely employed paradigms in student
evaluation research, the multisection validity study and the Dr Fox study, apparently
stems from an initial notoriety produced by claims to have demonstrated that student rat-
ings are invalid. This occurred even though the two original studies (the Rodin & Rodin
1972 study, and the Naftulin ef al., 1973 study) were so fraught with methodological weak-
nesses as to be uninterpretable. Perhaps this should not be so surprising in the academic
profession where faculty are better trained to find counter explanations for a wide variety
of phenomena than to teach. Indeed, the state of affairs has resulted in a worthwhile and
healthy scrutiny of student ratings and has generated a considerable base of research upon
which to form opinions about their worth. However, the bulk of research has supported
their continued use as well as advocating further scrutiny.
REFERENCES
Abrami, P. C. (1985). Dimensions of effective college instruction. Review ofHigher Education,l, 211-228.
Abrami, P. C., Leventhal, L., & Perry, R. P. (1979). Can feedback from student ratings help to improve
teaching? Proceedings of the 5th International Conference on Improving University Teaching. London.
Abrami. P. C., Dickens, W. J., Perry, R. P., & Leventhal, L. (1980). Do teacher standards for assigning grades
affect student evaluations of instruction? Journal of Educational Psychology, 72.107-118.
Abrami, P. C., Leventhal, L., & Dickens, W. J. (1981). Multidimensionality of student ratings of instruction.
Instructional Evaluation, 6(l), 12-17.
Abrami, P. C., Leventhal, L., & Perry, R. P. (1982a). Educational seduction. Review of Educational Research,
52,446464.
Abrami, P. C., Perry, R. P., & Leventhal, L. (1982b). The relationship between student personality
characteristics, teacher ratings, and student achievement. Journal of Educational Psychology, 74.111-125.
Aleamoni, L. M. (1978). The usefulness of student evaluations in improving college teaching. Instructional
Science, 7,9.5-105.
Aleamoni, L. M. (1981). Student ratings of instruction. In J. Millman (Ed.), Handbook of Teacher Evaluation.
(pp. 110-145). Beverley Hills, CA: Sage.
Aleamoni, L. M. (198.5). Peer evaluation of instructors and instruction. Instructional Evaluation, 8.
Aleamoni, L. M., & Hexner, P. Z. (1980). A review of the research on student evaluation and a report on the
effect of different sets of instructions on student course and instructor evaluation. Instructional Science, 9,
67-84.
Aleamoni, L. M., & Yimer, M. (1973). An investigation of the relationship between colleague rating, student
rating, research productivity, and academic rank in rating instructional effectiveness. Journal of Educational
Psychology, 64,274-277.
American Psychological Association. (1985). The standards for educational and psychological testing.
Washington, D. C.: Author.
Aubrect, J. D. (1981). Reliability, validity and generalizability of student ratings of instruction (IDEA Paper
No. 6). Kansas State University: Center for Faculty Evaluations and Development. (ERIC Document
Reproduction Service No. ED 213 296.)
Barr, A. S. (1948). The measurement and prediction of teaching efficiency: A summary of investigations. Journal
of Experimental Education. 16,203-283.
Bausell, R. B., & Bausell, C. R. (1979). Student ratings and various instructional variables from a within-
instructor perspective. Research in Higher Education, 11, 167-177.
Benton, S. E. (1982). Rating College Teaching: Criterion Validity Studies of Student Evaluation of Imtrucrion
Instruments. (ERIC ED 221 147.)
Biglan, A. (1973). The characteristics of subject matter in different academic areas. Journal of Applied
Psychology, 57,195-203.
Blackburn, R. T. The meaning of work in academia. (1974). In J. I. Doi (Ed.), Assessingfaculty effort. (a special
issue of New Directions for Institutional Research.) San Francisco: Jossey-Bass.
Blackburn, R. T., & Clark, M. J. (1975). An assessment of faculty performance: Some correlations between
administrators, colleagues, student, and self-ratings. Sociology of Education, 48,242-256.
Brandenburg, D. C., Slindle, J. A., & Batista, E. E. (1977). Student ratings of instruction: Validity and
normative interpretations. Journal of Research in Higher Education, 7,67-78.
Brandenburg, G. C., & Remmen, H. H. (1927). A rating scale for instructors. Educational Administration and
Supervision, 13,39%406.
Braskamp, L. A., Caulley, D., & Costin, F. (1979). Student ratings and instructor self-ratings and their
relationship to student achievement. American Educational Research Journal, 16,295306.
Braskamp, L. A., Ory, J. C., & Pieper, D. M. (1981). Student written comments: Dimensions of instructional
quality. Journal of Educational Psychology, 73,65-70.
Braskamp, L. A., Brandenburg, D. C., & Ory, J. C. (1985). Evaluating teaching effectiveness: A practical guide.
Beverley Hills, CA: Sage.
Braunstein, D. N., Klein, G. A., & Pachla, M. (1973). Feedback expectancy shifts in student ratings of college
faculty. Journal of Applied Psychology, 58,254-258.
Brown, D. L. (1976). Faculty ratings and student grades: A university-wide multiple regression analysis. Journal
of Educational Psychology, 68,573-578.
371
372 HERBERT W. MARSH
Burton, D. (1975). Student ratings-an information source for decision-making. In R. G. Cope (Ed.), 1nforma-
tion for decisions in postsecondary educatton: Proceedings of the 15th Annual Forum(pp. 83-86). St. LOUIS,
MI: Association for Institutional Research.
Cadwell, J., & Jenkins, J. (1985). Effects of semantic similarity on student ratings of instructors. /ournaI of
Educational Psychology, 77,383-393.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod
matrix. Psychological BulIetin, 56,81-105.
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on
teaching. In N. L. Gage (Ed.), Handbook of research on teaching (pp. 171-246). Chicago: Rand McNally.
Centra, J. A. (1973). Two studies on the utility of student ratings for instructional improvement. I The effective-
ness of student feedback in modifying college instruction. II Self-ratings of college teachers: a comparison with
student ratings. (SIR Report No. 2). Princeton, NJ: Educational Testing Service.
Centra, J. A. (1975). Coileagues as raters of classroom instruction. Journal of Higher Education, 46,327-337.
Centra, J. A. (1977). Student ratings of instruction and their relationship to student learning. American
_Educational Research Journal, 14, 17-24.
Centra, J. A. (1979). Determining fact&y effectiveness. San Francisco: Jossey-Bass.
Centra, J. A. (1981). Research report: Reseach productivity and teaching effectiveness. Princeton, NJ: Educa-
tional Testing Service.
Centra, J. A. (1983). Research productivity and teaching effectiveness. Research in Higher Education, 18,
379-389.
Centra, J. A., and Creech, F. R. (1976). The relationship between student, teacher, and course characteristtcs
andstudentratingsof teacher effectiveness (Project Report 76-l). Princeton, NJ: Educational Testing Service.
Chacko, T. I. (1983). Student ratings of instruction: A function of grading standards. Educational Research
Quarterly, 8,21-25.
Clarkson, P. C. (1984). Papua New Guinea students’ perceptions of mathematics lecturers. Journal of Educa-
tional Psychology, 76,13861395.
Cohen, P. A. (1980). Effectiveness of student-rating feedback for improving college instruction: a meta-analysis.
Research in Higher Education, 13,321-341.
Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis of multisection
validity studies. Review of Educational Research, 51,281-309.
Cohen, P. A., & McKeachie, W. J. (1980). The role of colleagues in the evaluation of college teaching.
Improving College and University Teaching, 28,147-1.54.
Coleman, J., and McKeachie, W. J. (1981). Effects of instructor/course evaluations on student course selection.
Journal of Educational Psychology, 73,224-226.
Cooley, W. W., & Lohnes, P. R. (1976). Evaluation research in educatton. New York: Irvington.
Costin, F., Greenough, W. T., & Menges, R. J. (1971). Student ratings of college teaching: Reliability, validity
and usefulness. Review of Educational Research, 41,511-536.
Cranton, P. A., & Hillgarten, W. (1981). The relationships between student ratings and instructor: Implications
for improving teaching. Canadian Journal of Higher Education, 11,73-81.
Creager, J. A. (1950). A multiple-factor analysis of the Purdue rating scale for instructors? Purdue University
Studies in Higher Education, 70,75-96.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (pp. 443-507).
Washington, D.C.: American Council on Education.
Cronbach, L. J. (1984). Essentials ofpsychological testing. New York, NY: Harper & Row.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52.
281-302.
de Wolf, W. A. (1974). Student ratings of instruction in post secondary institutions: A comprehensive annotated
bibliography of research reported since 1968 (Vol. 1). Seattle: University of Washington Educational Assess-
ment Center.
Dowell, D. A., & Neal, J. A. (1982). A selective view of the validity of student ratings of teaching. Journal of
Higher Education, 53,51-62.
Doyle, K. 0. (1975). Student evaluation of instruction. Lexington, MA: D. C. Heath.
Doyle, K. 0. (1983). Evaluatingteaching.Lexington, MA: Lexington Books.
Doyle, K. 0.. & Crichton, L. I. (1978). Student, peer, and self-evaluations of college instructors. Journal of
Educational Psychology, 70,815-826.
Doyle, K. O., & Weber, P. L. (1978). Self-evaluations of college instructors. American Educational Research
Students’ Evaluations 373
Journal, 15,467-475.
Drucker, A. J.. & Remmen, H. H. (1950). Do alumni and students differ in their attitudes towards instructors?
Purdue University Studies in Higher Education, 70,62-74.
Drucker, A. J., & Remmers, H. H. (1951). Do alumni and students differ in their attitudes toward instructors?
Journal of Educational Psychology, 70,12%143.
Dunkin, M. J., & Barnes, J. (1986). Researchon teaching in higher education. In M. C. Wittrock (Ed.), Hand-
book of research on teaching (3rd Edition) (pp. 754-777). New York: MacMillan.
Elliot, D. N. (1950). Characteristics and relationships of various criteria of college and university teaching.
Purdue University Studies in Higher Education, 70,5-61.
Endeavor Information Systems (1979). The Endeavour instructional rating system: User’s handbook. Evanston,
IL: Author.
Faia, M. (1976). Teaching and research: Rapport or misalliance. Research in Higher Education, 4, 235-246.
Feldman, K. A. (1976a). Grades and college students’ evaluations of their courses and teachers. Research in
Higher Education, 4.69-l 11.
Feldman, K. A. (1976b). The superior college teacher from the student’s view. Research in Higher Education,
5,243-288.
Feldman, K. A. (1977). Consistency and variability among college students in rating their teachers and courses.
Research in Higher Education, 6.223-274.
Feldman, K. A. (1978). Course characteristics and college students’ ratings of their teachers and courses: What
we know and what we don’t. Research in Higher Education, 9, 199-242.
Feldman, K. A. (1979). The significance of circumstances for college students’ ratings of their teachers and
courses. Research in Higher Education, 10, 149-172.
Feldman, K. A. (1983). The seniority and instructional experience of college teachers as related to the evalua-
tions they receive from their students. Research in Higher Education, 18,3-124.
Feldman, K. A. (1984). Class size and students’ evaluations of college teacher and courses: A closer look.
Research in Higher Education, 21,45-l 16.
Feldman, K. A. (1986). The perceived instructional effectiveness of college teachers as related to their person-
ality and attitudinal characteristics: A review and synthesis. Research in Higher Education, 24, 139-213.
Firth, M. (1979). Impact of work experience on the validity of student evaluations of teaching effectiveness.
Journal of Educational Psychology, 71,726-730.
Frankhouser, W. M. (1984). The effects of different oral directions as to disposition of results on student
ratings of college instruction. Research in Higher Education, 20,367-374.
French-Laxovich, G.(1981). Peer review: Documentary evidence in the evaluation of teaching. In J. Millman
(Ed.), Handbook of Teacher Evaluation (pp. 73-89). Beverly Hills, CA: Sage.
Frey, P. W. (1978). A two dimensional analysis of student ratings of instruction. Research in Higher Education,
9,69-91.
Frey, P. W. (1979). The Dr Fox Effect and its implications. Instructional Evaluation, 3, 1-5.
Frey, P. W., Leonard, D. W., & Beatty, W. W. (1975). Student ratings of instruction: Validation research.
American Educational Research Journal, 12.327-336.
Gage, N. L. (1963a). A method for “improving” teaching behavior. Journal of Teacher Education, 14,261-266.
Gage, N. L. (1963b). Handbook on research on teaching. Chicago: Rand McNally.
Gage, N. L. (1972). Teacher effectiveness and teacher education. Palo Alto, CA: Pacific Books.
Gilmore, G. M., Kane, M. T., & Naccarato, R. W. (1978). The generalizability of student ratings of instruction:
Estimates of teacher and course components. Journal of Educational Measurement, 15,1-13.
Glass, G. V., McGaw, B., &Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.
Guthrie, E. R. (1954). The evaluation of teaching: A progress report. Seattle: University of Washington Press.
Harry, J., & Goldner, N. S. (1972). The null relationship between teaching and research. Sociology of
Education, 45,47-60.
Hayton, G. E. (1983). An investigation of the applicability in Technical and Further Education of a student
evaluation of teaching instrument. An unpublished thesis submitted in partial fulfillment of the requirement
for transfer to the Master of Education (Honoun) degree. Department of Education, University of Sydney.
Hildebrand, M. (1972). How to recommend promotion of a mediocre teacher without actually lying. Journal of
Higher Education, 43,44-62.
Hildebrand, M., Wilson, R. C., & Dienst, E. R. (1971). Evaluating university teaching. Berkeley: Center for
Research and Development in Higher Education, University of California, Berkeley.
Hines, C. V., Cruickshank, D. R., & Kennedy, J. J. (1982). Measuresofteacherclarity andtheir relationships to
374 HERBERT W. MARSH
srudent achievement and satlsfacrion. Paper presented at the annual meeting of the American Educational
Research Association, New York.
Holmes, D. S. (1972). Effects of grades and disconfirmed grade expectancies on students’ evaluations of then
instructors. Journal of Educational Psychology, 63, 130-133.
Howard, G. S., & Bray, J. H. (1979). Use of norm groups to adjust student ratings of instructIon. A warning.
Journal of Educational Psychology, 71,58-63.
Howard, G. S., Conway, C. G., & Maxwell, S. E. (1985). Construct validity of measures of college teachmg
effectiveness. Journal of Educational Psychology, 77. 187-196.
Howard, G. S., and Maxwell, S. E. (1980). The correlation between student satisfaction and grades: A case of
mistaken causation? Journal of Educational Psychology, 72, 810-820.
Howard, G. S., & Maxwell, S. E. (1982). Do grades contaminate student evaluations of instruction. Research in
Higher Education, 16, 175-188.
Howard, G. S., & Schmeck, R. R. (1979). Relationship of changes in student motivation to student evaluations
of instruction. Research in Higher Education, 10,305-315.
Hoyt, D. P., Owens, R. E., and Grouling, T. (1973). Interpreting student feedback on instructron and courses.
Manhattan, KN: Kansas State University.
Jauch, L. R. (1976). Relationships of research and teaching. Implications for faculty evaluation. Research in
Higher Education, 5, 1-13.
Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educarional Research,
47,276-292.
Kane, M. T., Gilmore, G. M., & Crooks, T. J. (1976). Student evaluations of teaching: The generalizability of
class means. Journal of Educational Measurement, 13, 171-183.
Krathwohl, D. R.. Bloom, B. S., & Masia, B. B. (1964). Taxonomy of educational objecttves: Theclasslfication
of educational goals. Handbook 2. Affective Domain. New York, McKay.
Kulik, J. A., & McKeachie, W. J. (1975). The evaluation of teachers in higher education. In Kerlmger (Ed.),
Review of research in education, (Vol. 3). Itasca, IL: Peacock.
Land, M. L. (1979). Low-inference variables of teacher clarity: Effects on student concept learning. Journal of
Educational Psychology, 71,795-799.
Land, M. L. (1985). Vagueness and clarity in the classroom. In T. Husen & T. N. Postlethwaite (Eds.), Inter-
national encyclopedia of education: Research and studies. Oxford: Pergamon Press.
Land, M. L., & Combs, A. (1981). Teacher clarity, student instructional ratings, and studentperformance. Paper
presented at the annual meeting of the American Educational Research Association, Los Angeles.
Land, M. L., & Smith, L. R. (1981). College student ratings and teacher behavior: An experimental study.
Journal of Social Studies Research, 5,19-22.
Larson, J. R. (1979). The limited utility of factor analytic techniques for the study of implicit theones in student
ratings of teacher behavior. American Educational Research Associutron, 16,201-211.
Leventhal, L., Abrami, P. C., Perry, R. P., & Breen, L. J. (1975). Section selection in multi-section courses:
Implications for the validation and use of student rating forms. Educutional and Psychological .Veasuremenf,
35.885-895.
Leventhal, L., Perry, R. P., Abrami, P. C., Turcotte, S. J. C., & Kane, B. (1981). Experimenml investigation of
tenure/promotion in American and Canadian universities. Paper presented at the Annual hleeting of the
American Educational Research Association, Los Angeles.
Leventhal, L., Turcotte, S. J. C., Abrami, P. C., & Perry, R. P. (1983). Primacyhecency effects in student
ratings of instruction: A reinterpretation ofgain-loss effects. Journalof Educational Psychology, 75.692-704.
Levinson-Rose, J., & Menges, R. J. (1981). Improving college teaching: A critical review of research. Review of
Educational Research, 51.403-434.
Linsky, A. S., & Straus, M. A. (1975). Student evaluations, research productivity, and eminence of college
faculty. Journal of Higher Education, 46,89-102.
Long, J. S. (1983): Confirmutory factor analysis: A preface to LISREL Beverly Hills, CA: Sage.
Marsh, H. W. (1976). The relationship between background variables and students’ evaluations of mstructional
quality. 01s 769. Los Angeles, CA: Office of Institutional Studies, University of Southern California.
Marsh, H. W. (1977). The validity of students’ evaluations: Classroom evaluations of instructors independently
nominated as best and worst teachers by graduating seniors. American Educational Research Journal, 14,
44147.
Marsh, H. W. (1979). Annotated bibliography of research on the relationship between quality of reaching and
quality of research in higher educution. Los Angeles: Office of Institutional Studies, Universiry of Southern
Students’ Evaluations 375
California.
Marsh, H. W. (1980a) Research on students’ evaluations of teaching effectiveness. Insfrucrional Evaluafion, 4,
5-13.
Marsh, H. W. (1980b) The influence of student, course and instructor characteristics on evaluations of university
teaching. American Educational Research Journal, 17,219-237.
Marsh, H. W. (1981a). Students’ evaluations of tertiary instruction: Testing the applicability of American
surveys in an Australian setting. Australian Journal of Education, 25, 177-192.
Marsh, H. W. (1981b). The use of path analysis to estimate teacher and course effects in student ratings of
instructional effectiveness. Applied Psychological Measurement, 6.47-60.
Marsh, H. W. (1982a). Factors affecting students’evaluations of the same course taught by the same instructor on
different occasions. American Educational Research Journal, 19,485-497.
Marsh, H. W. (1982b). SEEQ: A reliable, valid, and useful instrument for collecting students’ evaluations of
university teaching. British Journal of Educational Psychology, 52.77-95.
Marsh, H. W. (1982~). Validity of students’ evaluations of college teaching: A multitrait-multimethod analysis.
Journal of Educational Psychology, 74,264-279.
Marsh, H. W. (1983). Multidimensional ratings of teaching effectiveness by students from different academic
settings and their relation to studentlcourselinstructor characteristics. Journal of Educational Psychology,
75,1_5&166.
Marsh, H. W. (1984a). Experimental manipulations of university motivation and their effect on examination
performance. British Journal of Educational Psychology, 54,206-213.
Marsh, H. W. (1984b). Students’ evaluations of university teaching: Dimensionality, reliability, validity,
potential biases, and utility. Journal of Educational Psychology, 76.707-754.
Marsh, H. W. (1985). Students as evaluators of teaching. In T. Husen &T. N. Postlethwaite (Eds.), International
Encyclopedia of Education: Research and Studies. Oxford: Pergamon Press.
Marsh, H. W. (1986). Applicability paradigm: Students’ evaluations of teaching effectiveness in different
countries. Journal of Educational Psychology, 78,46%73.
Marsh, H. W., Barnes, J., & Hocevar, D. (1985). Self-other agreement on multi-dimensional self-concept
ratings: Factor analysis and multitrait-multimethod analysis. Journal of Personality and Social Psychology,
49,1360-1377.
Marsh, H. W., & Cooper, T. L. (1981). Prior subject interest, students’ evaluations, and instructional effective-
ness. Multivariate Behavioral Research, 16,82-104.
Marsh, H. W., Fleiner, H., & Thomas, C. S. (1975). Validity and usefulness of student evaluations of instruc-
tional quality. Journal of Educational Psychology, 67,833-839.
Marsh, H. W., & Groves, M. A. (1987). Students’ evaluations of teaching effectiveness and implicit theories:
A critique of Cadwell and Jenkins. Journal of Educational Psychology.
Marsh, H. W., & Hocevar, D. (1983). Confirmatory factor analysis of multitrait-multimethod matrices.
Journal of Educational Measurement, 20.231-248.
Marsh, H. W., Kc Hocevar, D. (1984). The factorial invariance of students’ evaluations of college teaching.
American Educational Research Journal, 21.341-366.
Marsh, H. W., & Overall, J. U. (1979a). Long-term stability of students’ evaluations: A note on Feldman’s
“Consistency and variability among college students in rating their teachers and courses.” Research in
Higher Education, 10.139-147.
Marsh, H. W., & Overall, J. U. (1979b). Validity of students’ evaluations of teaching: A comparison with
instructor self evaluations by reaching assistants, undergraduate faculty, and graduate faculty. Paper presented
at Annual Meeting of the American Educational Research Association, San Francisco (ERIC Document
Reproduction Service No. ED177 205).
Marsh, H. W., &Overall, J. U. (1980). Validity of students’evaluations of teaching effectiveness: Cognitive and
affective criteria. Journal of Educational Psychology, 72,468-475.
Marsh, H. W., & Overall, J. U. (1981). The relative influence of course level, course type, and instructor on
students’ evaluations of college teaching. American Educational Research Journal, 18, 103-112.
Marsh, H. W., Overall, J. U., & Kesler, S. P. (1979a). Class size, students’ evaluations, and instructional
effectiveness. American Educational Research Journal, 16,57-70.
Marsh, H. W., Overall, J. U., & Kesler, S. P. (1979b). Validity of student evaluations of instructional
effectiveness: A comparison of faculty self-evaluations and evaluations by their students. Journal of Educa-
tional Psychology, 71,149-160.
Marsh, H. W., Overall, J. U., & Thomas, C. S. (1976). The relationship between students’ evaluations of
376 HERBERT W. MARSH
mmucrion and expectedgrade. Paper presented at the Annual Meeting of the Amertcan Educattonal Research
Association, San Franctsco. (ERIC Document Reproduction Service No. ED 1X-130.)
Marsh. H. W., Touron, J., & Wheeler, B. (1985). Students’ evaluations of umversity instructors: The applic-
ability of American instruments in a Spanish setting. Teaching and Teucher Education: An International
Journal of Research and Studies, 1, 123-138.
Marsh, H. W., & Ware, J. E. (1982). Effects of expressiveness, content coverage, and incentive on multt-
dimensional student rating scales: New interpretations of the Dr. Fox Effect. Journal of Educational
Psychology, 74, 126-134.
Maslow, A. H., & Zimmerman, W. (1956). College teaching ability, scholarly activity, and personality. Journal
of Educational Psychology, 47, 185-189.
McKeachie, W. J., &Solomon, D. (1958). Student ratingsofinstructors: A validity study.JournalofEducational
Research, 51.379-382. 3%397.
McKeachie, W. J. (1963). Research on teaching at the college and university level. In N. L. Gage (Ed.),
Handbook of research on teaching (pp. 1118-l 172). Chicago: Rand McNally.
McKeachie, W. J. (1973). Correlates of student ratings. In A. L. Sockloff (Ed.), Proceedings: The First Invita-
tional Conference on Faculty Effectivenessas Evaluatedby Students (pp. 213-218). Measurement and Research
Center, Temple University.
McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 384-397.
McKeachie, W. J., Lin, Y-G, Daugherty, M., Moffett, M. M.. Neigler, C., Nork, J., Walz, M.,. & Baldwin, R.
(1980). Using student ratings and consultation to improve instruction. British Journal of Educational
Psychology, 50, 168-174.
Meirer, R. S., & Feldhusen, J. F. (1979). Another look at Dr Fox: Effect of stated purpose for evaluation,
lecturer expressiveness, and density of lecture content on student ratings. Journal of Educational Psychology,
71,339-345.
Menges, R. J. (1973). The new reporters: Students rate instruction. In C. R. Pace (Ed.), Evaluating Learning
and Teaching. San Francisco: Jossey-Bass.
Miller, M. T. (1971). Instructor attitudes toward, and their use of, student ratings of teachers. Journal of
Educational Psychology, 62,235-239.
Morsh, J. E., Burgess, G. G., & Smith, P. N. (1956). Student achievement as a measure of instructional
effectiveness. Journal of Educational Psychology, 47.79-88.
Murray, H. G. (1976). How do good teachers teach? An observationalstudy of the classroom teaching behaviors of
Social Science professors receiving low, medium and high teacher ratings. Paper presented at the Canadian
Psychological Association meeting.
Murray, H. G. (1980). Evaluating university teaching: A review of research. Toronto, Canada: Ontario Con-
federation of University Faculty Associations.
Murray, H. G. (1983). Low inference classroom teaching behaviors and student ratmgs of college teaching
effectiveness. Journal of Educational Psychology, 71,856-865.
Naftulin, D. H., Ware, J. E., & Donnelly, F. A. (1973). The Doctor Fox lecture: A paradigm of educational
seduction. Journal of Medical Education, 48,630-635.
Neumann, L., & Neumann, Y. (1985). Determinants of students’ instructional evaluation: A comparison of
four levels of academic areas. Journal of Educational Research, 78, 152-158.
Office of Evaluation Services (1972). Student Instructional Rating System responses and student characteristtcs.
SIRS Research Report No. 4. Michigan State University: Author.
Ory, J. C., & Braskamp, L. A. (1981). Faculty perceptions of the quality and usefulness of three types of
evaluative information. Research in Higher Education, 15.271-282.
Ory, J. C., Braskamp, L. A., & Pieper, D. M. (1980). Congruency of student evaluative information collected
by three methods. Journal of Educational Psychology, 72, 181-185.
Overall, J. U., & Marsh, H. W. (1979). Midterm feedback from students: Its relationship to instructional
improvement and students’ cognitive and affective outcomes. Journal of Educational Psychology, 71,
85-65.
Overall, J. U., & Marsh, H. W. (1980). Students’ evaluations of instruction: A longitudinal study of their
stability. Journal of Educational Psychology, 72,321-325.
Overall, J. U., & Marsh, H. W. (1982). Students’ evaluations of teaching: An update. American Associationfor
Higher Education Bulletin, 35(4), 9-13.
Palmer, J., Carliner, G., & Romer, T. (1978). Learning, leniency, and evaluations. Journal of Educational
Psychology, 70.855-863.
Students’ Evaluations 377
Pambookian, H. S. (1976). Discrepancy between instructor and student evaluations of instruction: Effect on
instructor. Instrucrional Science, 5,63-75.
Parducci, A. (1968). The relativism of absolute judgment. Scientific American, 219.84-90.
Perry, R. P., Abrami, P. C., & Leventhal, L. (1979). Educational seduction: The effect of instructor expressive-
ness and lecture content on student ratings and achievement. Journal ofEducnriono1 Psychology, 71,107-l 16.
Perry, R. P., Abrami, P. C., Leventhal, L., & Check, J. (1979). Instructor reputation: An expectancy relation-
ship involving student ratings and achievement. Journal of Educanonal Psychology, 71.776-787.
Powell, R. W. (1977). Grades, learning, and student evaluation of instruction. Research in Higher Education, 7,
195-205.
Peterson, C., & Cooper, S. (1980). Teacher evaluation by graded and ungraded students. Journal of Educational
Psychology, 72,682-685.
Pohlman, J. T. (1972). Summary of research on rhe relationship between srudenr characteristics and srudenr
evaluations of insrrucrion ar Southern Illinois University, Carbondale. Technical Report 1.I-72. Counselling
and Testing Center, Southern Illinois University, Carbondale.
Pohlman, J. T. (1975). A multivariate analysis of selected class characteristics and student ratings of instruction.
Multivariare Behavioral Research, 10,81-91.
Price, J. R., & Magoon, A. J. (1971). Predictors of college student ratings of instructors (Summary). Proceedings
of the 79rh Annual Convention of the American Psychological Association, I, 523-524.
Protzman, M. I. (1929). Student rating of college teaching. School and Sociery, 29,513-515.
Remmers, H. H. (1928). The relationship between students’ marks and student attitudes towards instructors.
School and Society, 28,759-760.
Remmers, H. H. (1930). To what extent do grades influence student ratings of instructors? Journal of Educu-
rional Research, 21.314-316.
Remmers, H. H. (1934). Reliability and halo effect on high school and college students’ judgments of their
teachers. Journal of Applied Psychology, 18,619-630.
Remmers, H. H. (1931). The equivalence of judgements and test items in the sense of the Spearman-Brown
formula. Journal of Educational Psychology, 22,66-71.
Remmers, H. H. (1958). On students’ perceptions of teachers’ effectiveness. In McKeachie (Ed.) The appruisal
of reaching in large universities. (pp. 17-23). Ann Arbor: The University of Michigan.
Remmers, H. H. (1963). Teaching methods in research on teaching. In N. L. Gage (Ed.), Handbook on
Teaching. Chicago: Rand McNally.
Remmers, H. H., & Brandenburg, G. C. (1927). Experimental data on the Purdue Rating Scale for instructors.
Educational Administration and Supervision, 13,51%527.
Remmers, H. H., & Elliot, D. N. (1949). The Indiana college and uruversity staff evaluation program. School
and Sociery, 70,168-171.
Remmers, H. H., & Wykoff, G. S. (1929). Student ratings of college teaching-A reply. School and Society,
30,232-234.
Remmers, H. H., Martin, F. D., & Elliot, D. N. (1949). Are student ratings of their instructors related to their
grades? Purdue University Studies in Higher Education, 44, 17-26.
Rich, H. E. (1976). Attitudes of college and university faculty toward the use of student evaluation. Educarional
Research Quarterly, 3,17-28.
Rodin, M., & Rodin, B. (1972). Student evaluations of teachers. Science, 177,1164-1166.
Rosenshine, B. (1971). Teaching behaviors and srudenr achievement. London: National Foundation for
Educational Research.
Rosenshine, B., & Furst, N. (1973). The use of direct observation to study teaching. In R. M. W. Travers (Ed.),
Second handbook of research on reaching. Chicago: Rand McNally.
Rotem, A. (1978). The effects of feedback from students to university instructors: An experimental study.
Research in Higher Educarion, 9,303-318.
Rotem, A., & Glassman, N. S. (1979). On the effectiveness of students’ evaluative feedback to university
instructors. Review of Educational Research, 49,497-511.
Rushton, J. P., Brainerd, C. J., & Pressley, M. (1983). Behavioral development and construct validity: The
principle of aggregation. Psychological Bulletin, 94,18-38.
Ryan, J. J., Anderson, J. A., & Birchler, A. B. (1980). Student evaluation: The faculty respond. Research in
Higher Educarion, 12,317-333.
Salthouse, T. A., McKeachie, W. J., & Lin, Y. G. (1978). An experimental investigation of factors affecting
university promotion decisions. Journal of Higher Educorion, 49,177-183.
378 HERBERT W. MARSH
Schwab, D. P. (1976). Manualforthe Course Evaluafton Instrument. Schoolof Busrness. Universttyof Wisconsin
at Madtson.
Scott. C. S. (1977). Student ratings and instructor-defined extenuating circumstances. Journal of Educational
Psychology, 69,1&747.
Striven, M. (1981). Summative teacher evaluatton. In J. Millman (Ed.), Handbook of teacher evaluauon
(pp. 244-271). Beverly Hills, CA: Sage.
Seldin, P. (1975). How colleges evaluate professors: Current polictes and practices in evoluanng classroom
teachingperformance in liberal arrs cofleges. Cronton-on-Hudson, New York: Blythe-Pennington.
Shavelson. R. J., Hubner, J. J., & Stanton, G. C. (1976) Self-concept: Validation of construct tnterpretation.
Review of Educational Research, 46.407-441.
Smalzried, N. T., & Remmers, H. H. (1943). A factor analysis of the Purdue Rating Scale for instructors. Journal
of Educational Psychology, 34,363-367.
Smith, M. L., & Glass, G. V. (1980). Meta-analysis of research on class size and its relationship to attitudes and
instruction. American Educational Research Journal, 17,419-433.
Snyder, C. R.. & Clair, M. (1976). Effects of expected and obtained grades on teacher evaluation and attribution
of performance. Journal of Educational Psychology, 68.75-82.
Stalnaker, J. M.. & Remmers, H. H. (1928). Can students discriminate traits associated with success in teaching?
Journal of Applied Psychology, 12,602-610.
Stevens, J. J., & Aleamoni, L. M. (1985). The use of evaluative feedback for instructional improvement: A
longitudinal perspective. Instructional Science, 13,285-304.
Stumpf, S. A., Freedman, R. D., & Aguanno, J. C. (1979). A path analysis of factors often found to be related
to student ratings of teaching effectiveness. Research in Higher Education, 11, 111-123.
Vasta, R., & Sarmiento, R. F., (1979). Liberal grading improves evaluations but not performance. Journal of
Educational Psychology, 71,207-211.
Voght, K. E., & Lasher, H. (1973). Does student evaluation stimulate improved teaching? Bowling Green, OH:
Bowling Green University (ERIC ED 013 371).
Ward, M. D., Clark, D. C., & Harrison, G. V. (1981, April). The observer effect in classroom visuation. Paper
presented at the annual meeting of the American Educational Research Association, Los Angeles.
Ware, J. E., & Williams, R. G. (1975). The Dr. Fox effect: A study of lecturer expressiveness and ratings of
instruction. Journal of Medical Education, 5, 149-156.
Ware, J. E., & Williams, R. G. (1977). Discriminant analysis of student ratings as a means of identibing lecturers
who differ in enthusiasm or information giving. Educational and Psychological Mearurement, 37.627-639.
Ware, J. E., & Williams, R. G. (1979). Seeing through the Dr. Fox effect: A response to Frey. Instructionof
Evaluation, 3,610.
Ware, J. E., & Williams, R. G. (1980). A reanalysis of the Doctor Fox experiments. Instructional Evaluation,
4, 15-18.
Warrington, W. G. (1973). Student evaluation of instruction at Michigan State University. In A. L. Sockloff
(Ed.), Proceedings: The first invitational conference on faculty effectiveness as evaluated by students (pp.
164-182). Philadelphia: Measurement and Research Center, Temple University.
Watkins, D., Marsh, H. W., & Young, D. (1987). Evaluating tertiary teaching: A New Zealand Perspective.
Teaching and Teacher Education: An International Journal of Research and Studies, 3,41-53.
Webb, W. B., & Nolan, C. Y. (1955). Student, supervisor, and self-ratings of instructional proficiency.
Journal of Educational Psychology, 4642-46.
Whitely, S. E., & Doyle, K. 0. (1976). Implicit theories in student ratings, American Educuuonuf Research
Journal, 13,241-253.
Williams, R. G., & Ware, J. E. (1976). Validity of student ratings of instruction under different incentive
conditions: A further study of the Dr. Fox effect. Journal of Educational Psychology, 68.48-56.
Williams, R. G.. & Ware, J. E. (1977). An extended visit with Dr Fox: Validity of student ratings of instruction
after repeated exposures to a lecturer. American Educational Research Journal, 14.449-457.
Wykoff, G. S. (1929). On the improvement of teaching. School and Society, 29,58-59.
Yunker, J. A. (1983). Validity research on student evaluations of teaching effectiveness: Individual versus class
mean observations. Research in Higher Education, 19,363-379.
APPENDIX
Included in this appendix are five instruments used to collect students evaluations of teach-
ing effectiveness: SEEQ (see earlier description), Endeavor (Frey et al., 197.5; also see
earlier description), SIRS (Warrington, 1973), SDT (a slight modification of instrument
described by Hildebrand et al., 1971; also see earlier description), and ICES (Office of In-
structional Resources, 1978; also see Brandenburg et al., 1985). A comparison of the five
instruments reveals that they differ in:
(1) Factor Structure. Four of the five - SEEQ, Endeavor, SIRS, and SDT - consist of
items that have a well-defined factor structure.
(2) Open-Ended Questions. Four of the five - SEEQ, SIRS, STD, and ICES - speci-
fically ask for open-ended comments, though the nature of the open-ended questions vary.
(3) Overall Ratings. Three of the five - SEEQ, SDT and ICES - include overall rat-
ings of the course and of the instructor.
(4) Background/Demographic Information. Three of the five - SEEQ, SIRS and ICES
- ask for some background/demographic information in addition to students’ ratings of
instructional effectiveness.
(5) Instructor-Selected Items. Three of the five - SEEQ, ICES, and SIRS - provide for
additional items. The ICES, unlike the other instruments, is a ‘cafeteria-style’ instrument
(Braskamp etal., 1985) that consists primarily of items selected by the individual instructor
or the academic unit, and only the three overall rating items are constant. Up to 23 addi-
tional items are selected from a catalog of over 450 items, and a computer-controlled
printer overprints these items on the ICES form. Thus, each instrument is ‘tailor made’
and the one illustrated here is just one example. SEEQ and SIRS provide for additional
items, up to 20 and 7 respectively, to be selected/created by the individual instructor or the
academic unit.
379
HERBERT W. MARSH
COMPLETEL
FOR EACH ITEM, DARKEN THE BOX WHlC” MOST CLOSELY ,ND,CATES YOUR ASSESSMENT OF THIS COVRSE.
Students’ Evaluations 381
The Student Instructional Rating System (SIRS) form (reprmted with permission).
352 HERBERT W. MARSH
t I
Students’ Evaluations 383
The Instructor and Course Evaluation System (ICES) (reprinted with permission)
(I
tlOY YELL 010 EXAMINATION QUESTIONS RE- POORLY
FLECT CONTENT AND EMPHASIS OF THE COURSE? RELATE0 00000 RELATED -
7 YAS THE PROGRESSION OF THE COURSE LCGI- YES. hC.
CAL AN0 COHERENT FROM BEGINNING TO EN07 ALkAVS 00000 SELLCM -
8 HOW WOULO YOU CHARACTERIZE THE INSTRUC- EXCELLENT VERY POOR
TOR’S ABILITY TO EXPLAIN? 00008
9 TWE INSTRUCTOR MOTIVATED PE TC 00 CV ALMOST ALWCST
BEST WORK. ALclAYS 00000 IdEVER -
10.010 THIS COURSE INCREASE YCUR INTEREST YES. NC. hOT
IN THE SUBJECT HATTER? GREATLY OOOOQ PUCH -
11 010 THIS COURSE IMPROVE YOUR UNOERSTANOING YES. SIG- NC.
OF CONCEPTS AND PRINCIPLES IN THIS FIELC? NI F ICANTLY @@@@@ hCT CUCH -
12.
14
17
19.
19
20
21
23
24
25
26
HERBERT W. MARSH
P(eau usa chlr side of the form for your personal comments on teachef
DO NOT WRITE
Objectwe Hems 1 - 3 wll be used to compare this course and mrtructor
IN THE
to others in the department and mstltution Data from other ~temf after
,tem 3 would be useful to the mrtructor for course improvements. Your
SHADED
mstructqr will not see Your completed evaIuat~on until final grader are
in for your course.
AREA
NOTE.
Someone other than your mstructor should collect and mad these forms.
lC
.
What do You suggest
, to improve this course7
‘0
I
Comment on the
.
procedumr
grading
ld .umr.
I
’ E
,
Instructor option
I qwstion
’ F
a
Instructor option
question
Students’ Evaluations 385
5
6
7
8
9
IO
,I
I2
14
15
16
17
18
19
m
21
22
23
24
25
26
27
28
29
2
42cD W * W W 47rnWWWW 52cDWaJWW 57aJWalWW :,
43al W a W W 48WWWWW 53 03WfmWW 58c13rnrnWW F
u(DWWWal 19 WWWWW 5,WWalWrn 59WWWWW
4sDWWWW MWWWWW wcawwww P
55awalww
IMDa)lXJWW 51 aJWcmWW 56ulWWWW 61,wrnWWW
08,
--
2
----- ~. - ____
h,
___-___ - ___--__
2.
I,
Students’ Evaluations 387
The Student Description of Teaching (SDR) questionnaire. (Note: The current version of the SDT is nearly the
same as that first described by Hildebrand eral.. 1971, and has been widely adopted including this computer-scor-
able verston used in the schools of Business Administration.
f
i
388 HERBERT W. MARSH
COMMENTS
1 Please use ,h,s space 10 ldenllfy what you perceave as I”e real strength and weaknesses 01
a) Ihe course