0% found this document useful (0 votes)
197 views133 pages

Student Evaluation

STUDENTS’ EVALUATIONS OF UNIVERSITY TEACHING:

Uploaded by

Edaham Ismail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views133 pages

Student Evaluation

STUDENTS’ EVALUATIONS OF UNIVERSITY TEACHING:

Uploaded by

Edaham Ismail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

Inr I Educ Ra Vol. I I, p .253-388.1987 0&?3-0355/87 So ao+.

H)
Pnntcd I” Great Bnta~n Al P nghts reserved Copyright 0 1987 Pcrgamon Journals Ltd

STUDENTS’ EVALUATIONS OF UNIVERSITY


TEACHING: RESEARCH FINDINGS,
METHODOLOGICAL ISSUES, AND DIRECTIONS FOR
FUTURE RESEARCH

HERBERT W. MARSH

The University
of Sydney, Australia

CONTENTS

ABSTRACT 255

ACKNOWLEDGEMENTS 256

CHAPTER 1 INTRODUCTION 257


A Historical Perspective 257
The Purposes of Students’ Evaluations of Teaching Effectiveness 259
A Construct Validation Approach to the Study of Students’ Evaluations 260

CHAPTER 2 DIMENSIONALITY OF STUDENTS’ EVALUATIONS 263


The Need for a Multidimensional Approach 263
The Selection of the Dimensions to be Measured
Empirically Defined Dimensions -The Use of Factor Analysis ::
Higher-Order Factors 267
The Use of Confirmatory Factor Analysis in Future Research 271
Implicit Theories and Semantic Similarities: Factors or Antifactors 272
Summary of the Dimensionality of Students’ Evaluations 214

CHAPTER 3 RELIABILITY, STABILITY AND GENERALIZABILITY 275


Reliability 275
Long Term Stability 275
Generalizability - Teacher and Course Effects 277
Student Written Comments - Generality Across Different Response Form 279
Directions for Future Research 281

CHAPTER 4 VALIDITY 285


The Construct Validation Approach to Validity 285
Student Learning-The Multisection Validity Study 286
Instructor Self-Evaluations 293
Ratings by Peers 294
Behavioral Observations by External Observers 2%
Research Productivity 299
The Use of Other Criteria of Effective Teaching 300
The Use of Multiple Criteria of Effective Teaching in the Same Study 301
Summary and Implications of Validity Research 302
Directions for Further Research 303

253
254 HERBERT W. MARSH

CHAPTER 5 RELATIONSHIP TO BACKGROUND CHARACTERISTICS: THE WITCH HUNT


FOR POTENTIAL BIASES IN STUDENTS’ EVALUATIONS 305
Large-Scale Empirical Studies 307
A Construct Approach to the Study of Bias
Methodological Weaknesses Common to Many Studies ::
Theoretical Definitions of Bias 310
Approaches to Exploring for Potential Biases 311
Effects of Specific Background Characteristics Emphasized
in SEEQ Research 313
Reason for Taking a Course 322
Effects of Specific Background Characteristics Not Emphasized in
SEEQ Research 322
Characteristics Not Examined in SEEQ Research 324

CHAPTER 6 ‘DR FOX’ STUDIES 331


The Dr Fox Paradigm 331
Reanalyses and Meta-Analyses 331
Interpretations, Implications and Problems 333
Other Variables Considered in Dr Fox-Type Studies 335

CHAPTER 7 UTILITY OF STUDENT RATINGS 337


Changes in Teaching Effectiveness Due to Feedback from Student Ratings 338
Equilibrium Theory 342
Usefulness in Tenure/Promotion Decisions 348
Usefulness in Student Course Selection 349
Summary of Studies of the Utility of Student Ratings 349

CHAPTER 8 THE USE OF STUDENT RATINGS IN DIFFERENT COUNTRIES: THE


APPLICABILITY PARADIGM 351
The Applicability Paradigm 351
Differentiation Between Different Groups of Instructors 352
Factor Analyses 353
Inappropriate and Most Important Items 354
Patterns of Relations in Inappropriate and Most Important Responses 357
Multitrait-Multimethod Analyses 359
Implications and Future Research 363
Recommendations for Future Research with the Applicability Paradigm 365

CHAPTER 9 OVERVIEW, SUMMARY AND IMPLICATIONS 369

REFERENCES 371

APPENDIX FIVE EXAMPLES OF COMPUTER SCORED INSTRUMENTS USED TO


COLLECT STUDENTS’ EVALUATIONS OF TEACHING EFFECTIVENESS 379
ABSTRACT

The purposes of this monograph are to provide an overview of findings and of research
methodology used to study students’ evaluations of teaching effectiveness, and to examine
implications and directions for future research. The focus of the investigation is on the
author’s own research that has led to the development of the Students’ Evaluations of Edu-
cational Quality (SEEQ) instrument, but it also incorporates a wide range of other re-
search. Based upon this overview, class-average student ratings are: (1) multidimensional;
(2) reliable and stable; (3) primarily a function of the instructor who teaches a course
rather than the course that is taught; (4) relatively valid against a variety of indicators of
effective teaching; (5) relatively unaffected by a variety of variables hypothesized as po-
tential biases; and (6) seen to be useful by faculty as feedback about their teaching, by stu-
dents for use in course selection, and by administrators for use in personnel decisions. In
future research a construct validation approach should be employed in which it is recog-
nized that: effective teaching and students’ evaluations designed to reflect it are multi-
dimensional/multifaceted; there is no single criterion of effective teaching; and tentative
interpretations of relationships with validity criteria and with potential biases must be
scrutinized in different contexts and must examine multiple criteria of effective teaching.

255
ACKNOWLEDGMENTS

I would like to thank Wilbert McKeachie, Kenneth Feldman, Peter Frey, Kenneth Doyle,
Robert Menges, John Centra, Peter Cohen, Michael Dunkin, Samuel Ball, Jennifer
Barnes, Les Leventhal, John Ware, and Philip Abrami, Larry Braskamp, Robert Wilson,
and Dale Brandenburg for their comments on earlier research that is described in this arti-
cle, and also Jesse Overall and Dennis Hocevar, co-authors on many earlier studies. I
would also like to gratefully acknowledge the support and encouragement which Wilbert
McKeachie has consistently given to me, and the invaluable assistance offered by Kenneth
Feldman in both personal correspondence and the outstanding set of review articles which
he has authored. Nevertheless, the interpretations expressed in this article are those of the
author, and may not reflect those of others whose assistance has been acknowledged.

256
CHAPTER 1

INTRODUCTION

A Historical Perspective

Historical Patterns of Student Evaluation Research

Doyle (1983) noted that in Antioch in AD 350 any father who was dissatisfied with the in-
struction given to his son could examine the boy, file a formal complaint to a panel of
teachers and laymen, and ultimately transfer his son to another teacher if the teacher could
be shown to have neglected his duties, while Socrates was executed in 399 BC for having
corrupted the youth of Athens with his teachings. In the twentieth century there were few
studies of students’ evaluation before the 192Os, but student evaluation programs were in-
troduced at Harvard, the University of Washington, Purdue University, and the Univer-
sity of Texas and other institutions in the mid-1920s. Barr (1948) noted 138 studies of
teaching efficiency written between 1905 and 1948, and de Wolf (1974) summarized 220
studies of students’ evaluations of teaching effectiveness that were written between 1968
and 1974. The term ‘students’ evaluations of teacher performance’ was first introduced in
the ERIC system in 1976; between 1976 and 1984 there were 1055 published and unpub-
lished studies under this heading and approximately half of those have appeared since
1980. Doyle (1983) noted a cyclical pattern in research activity with a sharp increase in the
decade beginning with 1927 and the most intense activity during the 1970s. Student evalua-
tion research is largely a phenomenon of the 1970s and 1980s but it has a long and import-
ant history dating back to the pioneering research conducted by H. H. Remmers.

H. H. Remmers - The Father of Student Evaluation Research

H. H. Remmers initiated the first systematic research program in this field and might be
noted as the father of research into students’ evaluations of teaching effectiveness. In 1927
Remmers (Brandenburg & Remmers, 1927) published his multitrait Purdue scale and
proposed three principals for the design of such instruments: (a) that the list of traits must
be short enough to avoid halo effects and carelessness due to student boredom; (b) that the
traits must be those agreed upon by experts as the most important; and (c) that the traits
must be susceptible to student observation and judgement. During this early period he
examined issues of reliability, validity, halo effects, bias (Remmers & Brandenburg,
1927), the relation between course grades and student ratings (Remmers, 1928) and the
discriminability and relative importance of his multiple traits (Stalnaker & Remmers,
1928) and engaged in a series of discussion articles (Remmers & Wykoff, 1929) clarifying
257
x3 HERBERT W. MARSH

his views on students’ evaluations against suggested criticisms (Wykoff, 1929; Protzman,
1929). Some of his substantive and methodological contributions in subsequent research
were:
(1) Remmers (1931,1934) was the first to recognize that the reliability of student ratings
should be based on agreement among different students of the same teacher, and that the
reliability of the class-average response varies with the number of students in a way that is
analogous to the relation between test length (i.e. number of items) and test reliability in
the Spearman-Brown equation.
(2) Remmers (Smalzried & Remmers, 1943; see also Creager, 1950) published the first
factor analysis of class-average student responses to his 10 traits and identified two higher-
order traits that he called Empathy and Professional Maturity.
(3) In 1949 Remmers (Remmers et al., 1949; see also Elliot, 1950) found that when
students were randomly assigned to different sections of the same course, section-average
achievement corrected for initial aptitude was positively correlated with class-average rat-
ings of instructional effectiveness, thus providing one basis for the multiple-section validity
paradigm that has been so important in student evaluation research (see Chapter 4).
(4) Drucker and Remmers (1950, 1951) found that the ratings of alumni ten years after
their graduation from Purdue were substantially correlated with ratings by current stu-
dents for those instructors who had taught both groups. Alumni and current students also
showed substantial agreement on the relative importance they placed on the 10 traits from
the Purdue scale.
(5) In the first large-scale, multi-institutional study Remmers (Remmers & Elliot, 1949;
Elliot, 1950) correlated student responses from 14 colleges and universities with a wide
variety of background/demographic variables (for example, sex, rank, scholastic ability,
year in school). While some significant relationships were found, the results suggested that
background/demographic characteristics had little effect on student ratings.

In the early period of his research Remmers (for example, Brandenburg & Remmers,
1927; Remmers & Wykoff, 1929) was cautious about the use and interpretation of student
ratings, indicating that his scale: (a) was not designed to serve as a measure of teaching ef-
fectiveness but rather to assess student opinions as one aspect of the teacher-student rela-
tionship and the learning process; (b) should not be used for promotional decisions though
future research may show it to be valid for such purposes; and (c) should only be used vol-
untarily though limited observations suggest that instructors may profit by its use. How-
ever, the accumulated support from more than two decades of research led Remmers to
stronger conclusions such as: (a) “there is warrant for ascribing validity to student ratings
not merely as measures of student attitude toward instructors for which validity and relia-
bility are synonymous but also as measured by what students actually learn of the content
of the course” (Remmers et al., 1949, p. 26); (b) “undergraduate judgment as a criterion
of effective teaching . . . can no longer be waved aside as invalid and irrelevant”
(Remmers, 1950, p. 4); (c) “Teachers at all levels of the educational ladder have no real
choice as to whether they will be judged by those whom they teach . . . The only real
choice any teacher has is whether he wants to know what these judgments are and whether
he wants to use this knowledge in his teaching procedures” (Remmers, 1950, p. 4); (d) “As
higher education is organized and operated, students are pretty much the only ones who
observe and are in a position to judge the teacher’s effectiveness” (1958, p. 20); (e)
“Knowledge of student opinions and attitudes leads to improvement of the teacher’s per-
Students’ Evaluations 259

sonality and educational procedures” (1958, p. 21); and (f) “No research has been pub-
lished invalidating the use of student opinion as one criterion of teacher effectiveness”
(1958, p. 22).
While there have been important methodological advances in student evaluation
research since Remmers’ research, his studies provided the foundation for many of these
advances. Furthermore, while Remmers’ conclusions have been widely challenged in hun-
dreds of studies, results to be summarized in this monograph support his conclusions.

The Purposes of Students’ Evaluations of Teaching Effectiveness

The most widely noted purposes for collecting students’ evaluations of teaching effec-
tiveness are variously to provide the following.
(1) Diagnostic feedback to faculty about the effectiveness of their teaching that will be
useful for the improvement of teaching.
(2) A measure of teaching effectiveness to be used in administrative decision making.
(3) Information for students to use in the selection of courses and instructors.
(4) A measure of the quality of the course, to be used in course improvement and cur-
riculum development.
(5) An outcome or a process description for research on teaching.
While the first purpose is nearly universal, the next three are not. At many universities
systematic student input is required before faculty are even considered for promotion,
while at others the inclusion of students’ evaluations is optional or not encouraged at all.
Similarly, the results of students’ evaluations are sold to students in some university
bookstores at the time when students select their courses, while at other universities the re-
sults are considered to be strictly confidential.
The use of students’ ratings of teaching effectiveness from instruments such as SEEQ
are probably not very useful for the fourth purpose, course evaluation and curriculum de-
velopment. These ratings are primarily a function of the instructor who teaches the course
rather than the course that is being taught (see Chapter 3), and thus provide little informa-
tion that is specific to the course. This conclusion should not be interpreted to mean that
student input is not valuable for such purposes, but only that the student ratings collected
for purposes of evaluating teaching effectiveness are not the appropriate source of student
input into questions of course evaluations.
The fifth purpose of student ratings, their use in research on teaching, has not been sys-
tematically examined, and this is unfortunate. Research on teaching involves at least three
major questions (Gage, 1963; Dunkin 8~Barnes, 1986): How do teachers behave? Why do
they behave as they do? and What are the effects of their behavior? Dunkin & Barnes go
on to conceptualize this research in terms of: (a) process variables (global teaching
methods and specific teaching behaviors); (b) presage variables (characteristics of
teachers and students); (c) context variables (substantive, physical and institutional envi-
ronments); and (d) product variables (student academic/professional achievement, at-
titudes and evaluations). Similarly, Braskamp et al. (1985), distinguish between input
(what students and teachers bring to the classroom), process (what students and teachers
do), and product (what students do or accomplish), though Doyle (1975) uses the same
three terms somewhat differently. Student ratings are important both as a process-de-
scription measure and as a product measure. This dual role played by student ratings, as a
760 HERBERT W. MARSH

process description and as a product of the process. is also inherent in their use as diagnos-
tic feedback, as input for tenure promotion decisions, and as information for students to
use in course selection. However, Dunkin and Barnes’ presage and context variables, or
Braskamp’s input variables, also have a substantial impact on both the process and the pro-
duct, and herein lies a dilemma. Students’ evaluations of teaching effectiveness, as either
a process or a product measure, should reflect the valid effects of presage and context mea-
sures. Nevertheless, since many presage and context variables may be beyond the control
of the instructor, such influences may represent a source of unfairness in the evaluation of
teaching effectiveness - particularly when students’ evaluations are used for personnel
decisions (see Chapter 5).
Another dilemma in student evaluation research is the question of whether information
for any one purpose is appropriate for other purposes, and particularly whether the same
data should be used for diagnostic feedback to faculty and for administrative decision mak-
ing. Many (for example, Abrami ef al., 1981; Braskamp et al., 1985; Centra, 1979; Doyle,
1983) argue that high inference, global, summative ratings are appropriate for administra-
tive purposes while low inference, specific, formative ratings are appropriate for diagnos-
tic feedback. A related issue is the dimensionality of students’ evaluations and whether
students’ evaluations are best represented by a single score, a set of factor scores on well-
established, distinguishable dimensions of teaching effectiveness, or responses to indi-
vidual items (see Chapter 2): In fact many student evaluation instruments (see Appendix)
ask students to complete a common core of items that vary in their level of inference, in-
clude provision for academic units or individual instructors to select or create additional
items, and provide space for written comments to open-ended questions. Hence the ques-
tion becomes one of what is the appropriate information to report to whom for what pur-
poses.

A Construct Validation Approach to the Study of Students’ Evaluations

Particularly in the last 15 years, the study of students’ evaluations has been one of the
most frequently emphasized areas in American educational research. Literally thousands
of papers have been written and a comprehensive review is beyond the scope of this
monograph. The reader is referred to reviews by Aleamoni (1981); Braskamp et al. (1985);
Centra (1979); Cohen (1980. 1981); Costin et al. (1971); de Wolf (1974); Doyle (1975;
1983); Feldman (1976a, 1976b, 1977, 1978, 1979, 1983, 1984); Kulik and McKeachie
(1975); Marsh (1980a, 1982b, 1984); McKeachie (1979); Murray (1980); Overall and
Marsh (1982); and Remmers (1963). Individually, many of these studies may provide im-
portant insights. Yet, collectively the studies cannot be easily summarized and Aleamoni
(1981) notes that opinions about students’ evaluations vary from “reliable, valid, and use-
ful” to “unreliable, invalid, and useless.” How can opinions vary so drastically in an area
which has been the subject of thousands of studies? Part of the problem lies in the precon-
ceived biases of those who study student ratings; part of the problem lies in unrealistic ex-
pectations of what student evaluations can and should be able to do; part of the problem
lies in the plethora of ad hoc instruments based upon varied item content and untested
psychometric properties; and part of the problem lies in the fragmentary approach to the
design of both student-evaluation instruments and the research based upon them.
In the early 1970s there was a huge increase in the collection of students’ evaluations of
Students’ Evaluations 261

teaching effectiveness at North American universities. Researchers from a wide variety of


academic disciplines were suddenly presented with large data files and they responded
with papers, conference presentations, and articles of varied quality. Unfortunately, there
were few well-established research paradigms and methodological guidelines to guide this
early research, and insufficient attention was given to those that were available (for ex-
ample, Remmers’ research). Many studies conducted during this period were
methodologically unsound, but their conclusions were nevertheless used as the basis of
policy and subsequent research. Richard Schutz, editor of the American Educational Re-
search Journal during the late 197Os, commented that the major educational research jour-
nals may have erred in accepting for publication so many student evaluation studies of
questionable quality during this period (personal communication, 1979). Articles pub-
lished in the major educational research journals during the last five years are of a better
quality and fewer in number than those in the early and mid-1970s. However, based on
ERIC citations, the number of student evaluations studies from all sources has not
changed substantially during this period, methodologically flawed studies continue to be
reported, and the quality of papers presented at conferences and published in less prestigi-
ous journals is still quite varied. During the late 197Os, and particularly the 198Os, research
paradigms and methodological standards evolved, but they are presented in a piecemeal
fashion in journal articles. While there have been monographs with a broader focus, these
have typically emphasized guidelines on how to establish student evaluation programs and
general overviews of research findings rather than a substantial evaluation of the conduct
of student evaluation research. Hence, an important aim of the present monograph is to
present the research paradigms and methodological standards that have evolved in re-
search studies of students’ evaluations of teaching effectiveness.
Validating interpretations of student responses to an evaluation instrument involves an
ongoing interplay between construct interpretations, instrument development, data col-
lection, and logic. Each interpretation must be considered a tentative hypothesis to be
challenged in different contexts and with different approaches. This process corresponds
to defining a nomological network (Cronbach, 1971; Shavelson et al., 1976) where dif-
ferentiable components of students’ evaluations are related to each other and to other con-
structs. Within-network studies attempt to ascertain whether students’ evaluations consist
of distinct components and, if so, what these components are. This involves logical ap-
proaches such as content analysis and empirical approaches such as factor analysis and
multitrait-multimethod (MTMM) analysis. Some clarification of within-network issues
must logically precede between-network studies where students’ evaluations are related to
external variables.
The construct validation approach described here and elsewhere (Marsh, 1982b, 1983,
1984) has been incorporated more fully in the design, the development, and the research
of the Students’ Evaluation of Educational Quality (SEEQ) instrument than with other
student evaluation instruments currently available. Consequently, the focus of this over-
view will be on SEEQ research. In the chapters that follow, relevant SEEQ research will
be described, and methodological, theoretical, and empirical issues arising from this dis-
cussion will be related to other research in the field, and new directions for future research
will be indicated. The emphasis of this monograph on my own research with SEEQ obvi-
ously reflects my own biases, but is also justified because SEEQ apparently has been
studied in a wider range of research studies than have other student evaluation instru-
ments.
262 HERBERT W. MARSH

The purpose of this monograph is to provide an overview of findings conducted in


selected areas of student evaluation research, to examine methodological issues and weak-
nesses in these areas of study, to indicate implications for the use and application of the rat-
ings, and to explore directions for future research. This research and this overview em-
phasize the construct validation approach described above and several perspectives about
student-evaluation research which underlie this approach.
(1) Teaching effectiveness is multifaceted. The design of instruments to measure stu-
dents’ evaluations, and the design of research to study the evaluations should reflect this
multidimensionality.
(2) There is no single criterion of effective teaching and alternative criteria may not be
substantially correlated. Hence, a construct approach to the validation of student ratings
is required in which the ratings are shown to be systematically related to a variety of other
indicators of effective teaching. No single study, no single criterion, and no single
paradigm can demonstrate, or refute, the validity of students’ evaluations.
(3) Different dimensions or factors of students’ evaluations will correlate more highly
with different indicators of effective teaching. The construct validity of interpretations
based upon the rating factors requires that each factor is significantly correlated with
criteria to which it is most logically and theoretically related, and less correlated with other
variables. In general, student ratings should not be summarized by a response to a single
item or an unweighted average response to many items. If ratings are to be averaged for a
particular purpose, logical and empirical analyses specific to the purpose should determine
the weighting each factor receives, so that the weighting will depend upon the purpose.
(4) An external influence, in order to constitute a bias to student ratings, must be sub-
stantially and causally related to the ratings, and relatively unrelated to other indicators of
effective teaching. As with validity research, interpretations of relations as a bias should
be viewed as a tentative hypotheses to be challenged in different contexts and with diffe-
rent approaches which are consistent with the multifaceted nature of student ratings. Such
interpretations must also be made in the relation to an explicit definition of what consti-
tutes a bias.
CHARTER 2

DIMENSIONALITY OF STUDENTS’ EVALUATIONS

The Need for a Multidimensional Approach

Information from students’ evaluations necessarily depends on the content of the evalua-
tion items, but student ratings, like the teaching that they represent, should be multi-
dimensional (for example, a teacher may be quite well organized but lack enthusiasm).
This contention is supported by common sense and a considerable body of empirical re-
search. Unfortunately, most evaluation instruments and research fail to take cognizance
of this multidimensionality. If a survey instrument contains an ill-defined hodge-podge of
different items and student ratings are summarized by an average of these items, then
there is no basis for knowing what is being measured, no basis for differentially weighting
different components in a way that is most appropriate to the particular purpose they are
to serve, nor any basis for comparing these results with other findings. If a survey contains
separate groups of related items that are derived from a logical analysis of the content of
effective teaching and the purposes that the ratings are to serve, or from a carefully con-
structed theory of teaching and learning, and if empirical procedures such as factor analy-
sis and MTMM analyses demonstrate that the items within the same group do measure the
same trait and that different traits are separate and distinguishable, then it is possible to in-
terpret what is being measured. The demonstration of a well-defined factor structure also
provides a safeguard against a halo effect-a generalization from some subjective feeling
having nothing to do with effective teaching, an external influence, or an idiosyncratic re-
sponse mode - that affects responses to all items.
Some researchers, while not denying the multidimensionality of student ratings, argue
that a total rating or an overall rating provides a more valid measure. This argument is typ-
ically advanced in research where separate components of the students’ evaluations have
not been empirically demonstrated, and so there is no basis for testing the claim. More im-
portantly, the assertion is not accurate for most circumstances. First, there are many pos-
sible indicators of effective teaching; the component that is ‘most valid’ will depend on the
criteria being considered (Marsh & Overall, 1980). Second, reviews and meta-analyses of
different validity criteria show that specific components of student ratings are more highly
correlated with individual validity criteria than an overall or total rating (e.g., student
learning-Cohen, 1981; instructor self-evaluations - Marsh, 1982c; Marsh et al., 1979b;
effect of feedback for the improvement of teaching -Cohen, 1980). Third, the influence
of a variety of background characteristics suggested by some as ‘biases’ to student ratings
is more difficult to interpret with total ratings than with specific components (Frey, 1978;
Marsh, 1980b, 1983,1984). Fourth, the usefulness of student ratings, particularly as diag-
nostic feedback to faculty, is enhanced by the presentation of separate components. Fi-
nally, even if it were agreed that student ratings should be summarized by a single score for
263
364 HERBERT W MARSH

a particular purpose, the weighting of different factors should be a function of logical and
empirical analyses of the multiple factors for the particular purpose; an optimally weighted
set of factor scores will automatically provide a more accurate reflection of any criterion
than will a non-optimally weighted total (for example, Abrami, 1985, p. 223). Hence, no
matter what the purpose, it is logically impossible for an unweighted average to be more
useful than an optimally weighted average of component scores.
Still other researchers, while accepting the multidimensionality of students’ evaluations
and the importance of measuring separate components for some purposes such as feed-
back to faculty, defend the unidimensionality of student ratings because “when student
ratings are used in personnel decisions, one decision is made.” However, such reasoning
is clearly illogical. First, the use to which student ratings are put has nothing to do with
their dimensionality, though it may influence the form in which the ratings are to be pre-
sented. Second, even if a single total score were the most useful form in which to sum-
marize student ratings for personnel decisions, and there is no reason to assume that gen-
erally it is, this purpose would be poorly served by an ill-defined total score based upon an
ad hoc collection of items that was not appropriately balanced with respect to the compo-
nents of effective teaching that were being measured. If a single score were to be used, it
should represent a weighted average of the different components where the weight as-
signed to each component was a function of logical and empirical analyses. There are a var-
iety of ways in which the weights could be determined, including the importance of each
component as judged by the instructor being evaluated, and the weighting could vary for
different courses or for different instructors. However the weights are established, they
should not be determined by the ill-defined composition of whatever items happen to ap-
pear on the rating survey, as is typically the case when a total score is used. Third, implicit
in this argument is the suggestion that administrators are unable to utilize or prefer not to
be given multiple sources of information for use in their deliberations, but I know of no
empirical research to support such a claim. At institutions where SEEQ has been used, ad-
ministrators (who are also teachers who have been evaluated with SEEQ and are familiar
with it) prefer to have summaries of ratings for separate SEEQ factors for each course
taught by an instructor for use in administrative decisions (see description of longitudinal
summary report by Marsh, 1982b, pp. 78-79). Important unresolved issues in student
evaluation research are how different rating components should be weighted for various
purposes, and what form of presentation is most useful for different purposes. However,
the continued, and mistaken, insistence that students’ evaluations represent a unidimen-
sional construct hinders progress on the resolution of these issues.

The Selection of the Dimensions to be Measured

An important issue in the construction of multidimensional rating scale instruments is


the content of the dimensions to be surveyed. The most typical approach consists of a log-
ical analysis of the content of effective teaching and the purposes of students’ evaluations,
supplemented perhaps with literature reviews of the characteristics other researchers have
found to be useful, and feedback from students and faculty. An alternative approach based
on a theory of teaching or learning could be used to posit the important dimensions,
though such an approach does not seem to have been used in student evaluation research.
However, with each approach, it is important to also use empirical techniques, such as fac-
Students’ Evaluations 265

tor analysis to further test the dimensionality of the ratings. The most carefully constructed
instruments combine both logical/theoretical and empirical analyses in the research and
development of student rating instruments.
Feldman (1976b) sought to examine the different characteristics of the superior univer-
sity teacher from the student’s point of view with a systematic review of research that either
asked students to specify these characteristics or inferred them on the basis of correlations
between specific characteristics and students’ overall evaluations. On the basis of such
studies, and also to facilitate presentation of this material and his subsequent reviews of
other student evaluation research, Feldman derived a set of 19 categories that are listed in
Table 2.1. This list provides the most extensive and, perhaps, the best set of dimensions
that are likely to underlie students’ evaluations of effective teaching. Nevertheless,
Feldman used primarily a logical analysis based on his examination of the student evalua-
tion literature, and his results do not necessarily imply that students can accurately
evaluate these components, that these components can be used to differentiate among
teachers, or that other components do not exist.

Table 2.1
Nineteen Instructional Rating Dimensions Adapted From Feldman (1976)

Teacher’s stimulation of interest in the course and subject matter.


Teacher’s enthusiasm for subject or for teaching.
Teacher’s knowledge of the subject.
Teacher’s intellectual expansiveness and breadth of coverage.
Teacher’s preparation and organization of the course.
Clarity and understandableness of presentations and explanations.
Teacher’s elocutionary skills.
Teacher’s sensitivity to, and concern with, class level and progress.
Clarity of course objectives and requirements.
Nature and value of the course material including its usefulness and relevance.
Nature and usefulness of supplementary materials and teaching aids.
Difficulty and workload of the course.
Teacher’s fairness and impartiality of evaluation ofstudents; quality of exams.
Classroom management.
Nature, quality and frequency of feedback from teacher to students.
Teacher’s encouragement of questions and discussion, and openness to the opinions of others.
Intellectual challenge and encouragement of independent thought.
Teacher’sconcern and respect for students; friendliness of the teacher.
Teacher’s availability and helpfulness.

These nineteen categories were originally presented by Feldman (1976) but in subsequent studies (e.g.,
Feldman, 1984) ‘Perceived outcome or impact of instruction’ and ‘Personal characteristics (‘personality’)’ were
added while rating dimensions 12 and 14 presented above were not included.

Empirically Defined Dimensions -The Use of Factor Analysis

Factor analysis provides a test of whether students are able to differentiate among diffe-
rent components of effective teaching and whether the empirical factors confirm the facets
that the instrument is designed to measure. The technique cannot, however, determine
whether the obtained factors are important to the understanding of effective teaching; a
set of items related to an instructor’s physical appearance would result in a ‘physical ap-
pearance’ factor which probably has little to do with effective teaching. Consequently,
266 HERBERT W. MARSH

carefully developed surveys - even when factor analysis is to be used - typically begin
with item pools based upon literature reviews, and with systematic feedback from stu-
dents, faculty, and administrators about what items are important and what type of feed-
back is useful (for example, Hildebrand et al., 1971; Marsh, 1982b). For example, in the
development of SEEQ a large item pool was obtained from a literature review, instru-
ments in current usage, and interviews with faculty and students about characteristics
which they see as constituting effective teaching. Then, students and faculty were asked to
rate the importance of items, faculty were asked to judge the potential usefulness of the
items as a basis for feedback, and open-ended student comments on pilot instruments were
examined to determine if important aspects had been excluded. These criteria, along with
psychometric properties, were used to select items and revise subsequent versions. This
systematic development constitutes evidence for the content validity of SEEQ and makes
it unlikely that it contains any irrelevant factors.

Student Evaluation Factors Identified by Factor Analysis

The student evaluation literature does contain several examples of instruments that
have a well defined factor structure and that provide measures of distinct components of
teaching effectiveness. Some of these instruments (see Appendix for the actual instru-
ments) and the factors that they measure are the following.
(1) Frey’s Endeavor instrument (Frey et al., 1975; also see Marsh, 1981a): Presentation
Clarity, Workload, Personal Attention, Class Discussion, Organization/Planning, Grad-
ing, and Student Accomplishments.
(2) The Student Description of Teaching (SDT) questionnaire originally developed by
Hildebrand et al. (1971): Analytic/Synthetic Approach, Organization/Clarity, Instructor
Group Interaction, Instructor Individual Interaction, and Dynamism/Enthusiasm.
(3) Marsh’s SEEQ instrument (Marsh, 1982b, 1983,1984): Learning/Value, Instructor
Enthusiasm, Organization, Individual Rapport, Group Interaction, Breadth of Coverage,
Examinations/Grading, Assignments/Readings, and Workload/Difficulty.
(4) The Michigan State SIRS instrument (Warrington, 1973): Instructor Involvement,
Student Interest and Performance, Student-Instructor Interaction, Course Demands, and
Course Organization. The systematic approach used in the development of each of these
instruments, and the similarity of the facets which they measure, support their construct
validity. Factor analyses of responses to each of these instruments provide clear support
for the factor structure they were designed to measure, and demonstrate that the students’
evaluations do measure distinct components of teaching effectiveness. More extensive re-
views describing the components found in other research (Cohen, 1981; Feldman, 1976b;
Kulik & McKeachie, 1975) identify dimensions similar to those described here.

Empirical Results from SEEQ Research

Factor analyses of responses to SEEQ (Marsh, 1982b, 1982c, 1983, 1984) consistently
identify the nine factors the instrument was designed to measure. Separate factor analyses
of evaluations from nearly 5,000 classes were conducted on different groups of courses
selected to represent diverse academic disciplines at graduate and undergraduate levels;
Students’ Evaluations 267

each clearly identified the SEEQ factor structure (Marsh, 1983). In one study, faculty in
329 courses were asked to evaluate their own teaching effectiveness on the same SEEQ
form completed by their students (Marsh, 1982~; Marsh & Hocevar, 1983). Separate fac-
tor analyses of student ratings and instructor self-evaluations each identified the nine
SEEQ factors (see Table 2.2). In other research (Marsh & Hocevar, 1984) evaluations of
the same instructor teaching the same course on different occasions demonstrated that
even the multivariate pattern of ratings was generalizable (for example, a teacher who was
judged to be well organized but lacking enthusiasm in one course was likely to receive a
similar pattern of ratings in other classes). These findings clearly demonstrate that student
ratings are multidimensional, that the same factors underlie ratings in different disciplines
and at different levels, and that similar ratings underlie faculty evaluations of their own
teaching effectiveness.

Higher-Order Factors

The multidimensionality of students’ evaluations appears to be becoming more widely


accepted. However, implicit in the formulations of some researchers is the suggestion that
a small number of higher-order factors -or even a single general factor-can account for
much of the variability in the first-order factors. There are two reasons why it is important
to examine this possibility. First, it is theoretically important to know how different dimen-
sions are related to each other. Second, if the higher-order factors are sufficiently strong,
then such higher-order factors may be practically useful for summarizing students’ evalua-
tions and for relating the ratings to other constructs.
Students’ evaluations of teaching effectiveness can be used to infer distinguishable
facets, but I know of no research that provides a methodologically sound basis for conciud-
ing that these first-order facets are independent, orthogonal, or uncorrelated. Whenever
a set student evaluation factors are even moderately correlated, it may be possible to use
the correlations among these first-order factors to infer second-order factors. However, it
is only recently that advances in the application of confirmatory factor analysis and the
availability of appropriate computer programs have made such analyses generally feasible
(see later discussion of confirmatory factor analysis). Hence, there are relatively few ex-
amples of such analyses in educational and psychological research, and none that I know
of in student evaluation research. Nevertheless, student evaluation researchers have infer-
red the existence of higher-order factors on the basis of less rigorous analyses.
Frey (1978) demonstrated that responses to his 21-item Endeavor instrument define
seven distinct components of effective teaching, but also argued that the seven first-order
factors define two higher-order factors that he called Skill and Empathy. He reported that
these two higher-order factors have quite different patterns of correlations with other vari-
ables such as class size, class-average grade, student learning in multisection validity
studies, and an index of research citation counts. Frey, consistent with the present em-
phasis on multidimensionality, argued that many of the inconsistencies in the student
evaluation literature resulted from the inappropriate unidimensional analysis of ratings
which should be examined in terms of separate dimensions. However, his justification for
summarizing the seven Endeavor dimensions with two higher-order dimensions may be
dubious. His analysis was based upon responses to only 7 of the 21 Endeavour items-one
Table 2.2
Factor Analyses of Students’ Evaluations of Teaching Effectiveness (S) and the Corresponding Faculty Self-Evaluations of Their Own Teaching (F) in 32Y ;
Courses (Reprinted with permission from Marsh, lY84b)

Factor pattern loadings

1 2 3 4 5 6 I a 9
- ~ - - - - - - -
Bvalustien items (paraphrased) SFSFSFSFSFSFSFSFSF
-
I. Learning/Value
Course challenging/stimulating 42 40 23 25 09 -10 04 04 oo -03 15 21 09 05 16 23 29 20
Learned something valuable 53 II 15 02 10 -02 09 04 01. 01 10 oo 10 04 17 09 16 O6
Increased subject interest 57 70 12 05 08 07 08 07 02 -03 08 03 -04 19 05 14 -02
Learned/understood subject matter 55 52 12 12 13 12 05 03 03 iI A: -01 19 07 14 -04 -23 -II
Overall course rating 36 39 25 29 16 09 12 OH 09 02 12 16 13 -08 14 27 Oa 16
‘1. Enthusiasm
Enthusiastic about teaching 15 29 07 10
Dynamic & energetic
Enhanced presentations with humor
n
08 03
04
16
15
-04
00
01
06
11
OS
02
06
01 13 02
06
12
oo
05
02
05
07
14
16
16
07
01
01
:
09
05
-G!
05
06
‘-02
-07
06
03
-03
-10
Teaching style held your interest : 12 16 06 06 oo 03 14 10 05
Overall instructor rating 12 21 14 08 23 02 11 16 10 -08 05 27 05 16
:I Organization
Instructor explanations clear 12 oo 07 24 20 09 05 04 10 06 13 01 06 23 -08 -03
Course materials prepared & clear 06 06 03 -02 09 01 10 -02 09 04 06 03 10 03 01 12
Objectives stated & pursued 19 12 -05 -08 03 05 08 05 14 08 25 27 06 05 06 06
Lectures facilitated note taking -03 02 20 09 -17 07 -02 05 14 04 15 06 08 ol -04 -05
4. Group Interaction
Encouraged class discussions 04 06 10 o2 01 03 03 oo oo oo 06 00 06 -05 oo -03
Students shared ideas/knowledge 02 08 06 -07 -04 -01 05 13 05 01 08 -02 08 -10 -02 01
Encouraged questions & answers 03 -04 06 09 14 06 16 -02 15 03 07 11 08 21 00 01
Encouraged expression of ideas 07 01 02 06 01 -11 20 09 05 07 09 12 05 09 oo -02
Individual Rapport
Friendly towards students -04 10 17 06 00 -06 13 12 -01 -05 13 02 10 -05 -07 01
Welcomed seeking help/advice 04 -10 05 02 02 07 06 oo -04 04 12 06 05 20 8: -04
Interested in individual students 07 10 11 09 oo 01 14 07 -01 -09 14 03 08 -09 09
Accessible to individual students 02 -13 -11 -11 16 09 09 -02 20 25 08 13 oo 14 04 07
Breadth of Coverage
Contrasted implications -05 02 12 01 05 03 08 01 -03 01 72 a4 08 -03 14 02 08 -06
Gave background of ideas/concepts 08 03 08 10 16 07 -03 -02 02 -02 71 78 01 08 11 -01 03 OR
Cave different points of view 04 -(xi 04 09 11 11 OA If o6 01 72 5:, 07 17 01 -06 04 oa
1)iscussed current developmenb 23 29 ox -04 -04 -04 05 12 09 oo 5o 4a 06 05 16 10 -01 -02
Examinations/Grading
Examination feedback valuable -03 01 oa 09 06 -11 09 05 08 12 -04
El 03 72 62 05 -03 09 03
Eval. methods fair/appropriate 06 02 oo -03 03 14 07 06 14 oo 10 17 69 64 I1 11 -08 04
Tested emphasized course content 08 oo -01 04 11 21 01 01 06 oo It -04 cl70 58 07 10 -02 -03
Table 2.2 (Continued)

Factor pattern loadings

1 2 3 4 5 6 1 8 9
~ - - - - co
Evaluation items (paraphrased) SFSFSFSFSFSFSFSFSF iz
a
8. Assignments
Reading/texts valuable -06 09 -03 -03 03 07 -01 -06 03 01 07 -07 01 11 091 70 02 04
Added to course understanding 12 01 -01 -12 01 04 09 21 01 17 -02 08 07 05 81 56 06 10
9. Workload/Bifficulty
Course difficulty (Easy-Hard) -06 00 06 -01 04 -05 -04 02 -01 00 08 00 -04 08 10 04 85 74
Course workload (Light-Heavy) 14 -04 -09 -01 03 02 07 05 00 04 06 01 00 01 00 -04 088 86
Course pace (Too Slow-Too Fast) -20 07 12 00 04 18 -12 -09 06 02 -03 -07 04 08 05 -04 62 32
Hours/week outside of class 14 00 07 00 -11 00 07 02 00 02 -04 03 03 -08 05 21 73 46

Note. Factor loadings in boxes are the loadings for items designed to measure each factor. All loadings are presented without decimal points. Factor analyses student of
ratings and instructor self-ratings consisted of a principal-components analysis, Kaiser normalization. and rotation to a direct oblimin criterion. The analyses were
performed with the commercially available Statistical Package for the Social Sciences (SPSS) routine (see Nie, Hull. Jenkins, Steinbrenner, & Bent, 1975).
270 HERBERT W. MARSH

from each scale*, the higher-order factors were not easily interpreted in that the factor
loadings did not approximate simple structure, no attempt was made to test the ability of
the two-factor solution to fit responses from the original 21 items, other research has
shown that responses to the 21 items do identify seven factors rather than just two (Frey et
al., 1975; Marsh, 1981a) and confirmatory factor analytic techniques designed to test
higher-order structures were not employed. Frey then demonstrated that his two higher-
order factors had systematic and distinguishable patterns of relations to external criteria of
validity and to potential biases to students’ evaluations. These findings reported by Frey
suggest the possibility and perhaps the importance of higher-order factors.
Remmers, as described earlier, demonstrated that the 10 traits on his Purdue scale were
important and distinguishable components of students’ evaluations. Remmers also inter-
preted the results of a factor analysis (Smalzried & Remmers, 1943) to infer two higher-
order traits that he called Empathy and Professional Maturity. However, Remmers did
not suggest that these two higher-order factors should be used instead of the 10 traits, and
he put little emphasis on them in his subsequent research. This study, as the Frey study,
suffers in that the first-order factors were inferred on the basis of single-item scales, but the
higher-order factors proposed by Remmers and Frey appear to be similar.
Feldman (1976b) derived 19 categories of characteristics of the superior teacher from
the students’ view (see Table 2.1). However, he also examined studies that contained two
or more of his 19 categories of teaching effectiveness and used the pattern of correlations
among the different categories to infer three higher-order clusters that were related to the
instructors’ roles as presenter (actor or communicator), facilitator (interactor or recip-
rocator), and manager (director or regulator).
Abrami (1985, p. 214) also suggested “that effective teaching may be described both
unitarily and multidimensionally in a way analogous to the way Weschler’s tests operation-
ally define intelligence in both general and specific terms” and that support for such an in-
terpretation would have important implications. Such a general factor may be a higher-
order factor derived from lower-order, more specific facets, and this would be consistent
with the higher-order perspective that is considered here. It should be noted, however,
that none of the studies examined here suggest that a single higher-order factor exists.
Hence, it seems unlikely that a unitary construct will be identified but it should be em-
phasized that the appropriate research to support or refute such a claim has not been con-
ducted.
The interpretation of first-order factors and their relation to external constructs may be
facilitated by the demonstration of a well-defined higher-order structure as proposed by
Frey. However, even if the existence of the higher-order factors can be demonstrated with
appropriate analytical procedures, this does not imply that the higher-order factors should
be used instead of the lower-order factors or that the higher-order factors can be inferred
without first measuring the lower-order factors. Despite the expedient advantages of

* Frey (1979) indicates that during its evolvement there were 12 different versions of Endeavor. Endeavor XI (see
Appendix) consists of 21 items that measure 7 first-order factors, and these first-order factors may infer two
higher-order factors. Endeavor XII consists of 7 items, the best item from each of the 7 first-order dimensions
on Endeavor XI, that measure 2 higher-order factors. The existence of the 2 Endeavor forms has resulted in some
confusion about the dimensionality of the Endeavor. My interpretation is that both the 7- and 21-item versions
of Endeavor measure 7 first-order factors, and that responses to these may be used to infer2 higher-order factors.
This interpretation is apparently consistent with that which appears in the Endeavor manual (Frey. 1979) and re-
conciles apparent discrepancies in descriptions of Endeavor (for example, Marsh, 1984 as compared to Abrami.
1985).
Students’ Evaluations 271

single-item scales, their use instead of multi-item scales, as in the 7-item Endeavor form
and the Purdue rating scale, cannot be generally recommended. In a methodological re-
view of many areas of research, Rushton et al. (1983) concluded that single-item scales are
less stable, less reliable, less valid and less representative than multi-item scales. Marsh et
al. (1985) compared responses to single- and multi-item scales by the same subjects and
reached similar conclusions. The investigation of higher-order factor structures in stu-
dents’ evaluations of teaching effectiveness represents an important area for future re-
search and this research will be facilitated by recent advances in the application of confir-
matory factor analysis that are summarized in the next section.

The Use of Confirmatory Factor Analysis in Future Research

Recent advances in the application of confirmatory factor analysis allow researchers to


test the ability of an apriori factor model to fit responses rather than to simply try to inter-
pret the factors that are generated by exploratory factor analyses (for example, Long,
1983; Marsh er al., 1985). For example, each of the multifactor instruments described earl-
ier contains clusters of items that are posited to measure distinguishable facets of teaching
effectiveness, and this implicit factor model forms the basis of one apriori model. Varia-
tions in such a model might allow each item to load on one and only one factor (a simple
structure) or to load on more than one factor (a complex structure), or might allow the po-
sited factors to be uncorrelated (an orthogonal structure) or correlated (an oblique struc-
ture). Other models might propose that two or more of the posited factors should be com-
bined to form a single factor, that items designed to measure a single factor really define
more than one factor, or that all the items can be represented by a single factor. Each of
these various models could then be compared in terms of their ability to fit the data. The
use of confirmatory factor analysis, instead of the exploratory factor analyses that have
predominated student evaluation research, offers the researcher tremendous flexibility to
define alternative models and to compare the ability of alternative models to fit the same
data. While confirmatory factor analyses have been used in student evaluation research
(for example, Marsh & Hocevar, 1983, 1984), its application has been surprisingly in-
frequent and represents an important methodological tool for future research.
The confirmatory factor analysis models described above are first-order models in that
no assumptions were made about the pattern of correlations that exist among the posited
factors. However, so long as the first-order factors are at least moderately correlated,
second-order and even third-order factors can be tested. Conceptually, a second-order
factor anlaysis is like using the factor correlations derived from an oblique first-order fac-
tor analysis as the starting point for a second factor analysis, though factor loadings for
both first- and second-order factors are solved simultaneously with confirmatory factor
analysis. It may be possible to use exploratory factor analysis of responses to items to iden-
tify first-order factors and then to factor analyze correlations among the first-order factors
to explore second-order factors. However, as indicated by Marsh, Barnes & Hocevar
(1985) it becomes increasingly difficult to rotate to simple structure with EFA as one prog-
resses to higher levels, and unless the first-order factors are very clearly identified the in-
terpretation of higher-order factors becomes problematic. Furthermore, the use of confir-
matory factor analysis also allows the investigator to explicitly define different first- and
second-order factor models and to compare their abilities to fit the data. Earlier sugges-
272 HERBERT W. MARSH

tions of higher-order factors in student evaluation research demonstrate the practical sig-
nificance of this application of confirmatory factor analysis for future research.

Implicit Theories and Semantic Similarities: Factors or Antifactors

Implicit Theories

Abrami etal. (1981), Larson (1979), Whitely and Doyle (1976), and others, have argued
that dimensions identified by factor analyses of students’ evaluations may reflect raters’
implicit theories about dimensions of teacher behaviors in addition to, or instead of, di-
mensions of actual teaching behaviors. For example, if a student implicitly assumes that
the occurrence of behaviors X and Y are highly correlated and observes that a teacher is
high on X, then the student may also rate the teacher as high on Y even though the student
does not have an adequate basis for rating Y.
Implicit theories have a particularly large impact on factor analyses of individual student
responses, and this argues against the use of the individual student as the unit of analysis.
In fact, if the ratings by individual students within the same class are factor analyzed and
it is assumed that the stimulus being judged is constant for different students - a prob-
lematic assumption - then the derived factors reflect primarily implicit theories. Whitely
and Doyle (1976) suggest that students’ implicit theories are controlled when factor
analyses are performed on class-average responses, while Abrami et al. (1981, p. 13) warn
that it is only when students are randomly assigned to classes that the “computation of
classmeans cancels out individual student expectations and response patterns as sources of
variability.” However, Larson (1979) demonstrated that even class-average responses,
whether or not based upon random assignment, are affected by implicit theories if the im-
plicit theories generalize across students; it is only those implicit theories that are idiosyn-
cratic to individual students, along with a variety of sources of random variation, that are
cancelled out in the formation of class averages. While still arguing that course characteris-
tics may be reflected in the results of factor analyses, Abrami (1985) agreed with Larson
on this point.
Whitely and Doyle (1976; see also Abrami, 1985) suggested that rating dimensions de-
fined by factor analyses of individual student responses and of class-average responses are
similar because the implicit theories that affect the individual responses are generally
valid. However this explanation does not account for the possibility that implicit theories
that generalize across students may be reflected in both sets of factors. For this reason Lar-
son (1979) argued that the validity of students’ implicit theories cannot be tested with alter-
native factor analytic procedures based upon student ratings, no matter what the unit of
analysis, and that independent measures are needed. More generally the similarity or dis-
similarity of factors derived from analyses of individual student responses and class-aver-
age responses will never provide convincing evidence for the validity or invalidity of im-
plicit theories, or for the extent of influence implicit theories have on student ratings.
While the class-average is the only defensible unit of analysis for factor analyses of student
evaluations (for example, Marsh, 1984), the factors derived from such analyses represent
some unknown combination of students’ implicit theories about how teaching behaviors
should covary in addition to actual observations of how they do covary.
Students’ Evaluations 273

Support for the validity of the factor structure underlying students’ evaluations requires
that a similar structure be identified with a different method of assessment. Hence, the
similarity of the factor structures resulting from student ratings and instructor self-evalua-
tions shown in Table 2.2 is particularly important. While students and instructors may have
similar implicit theories, instructors are uniquely able to observe their own behaviors and
have little need to rely upon implicit theories in forming their self-ratings. Thus, the simi-
larity of the two factor structures supports the validity of the rating dimensions that were
identified by responses from both groups.

Semantic Similarities

In an alternative conceptualization Cadwell and Jenkins (1985) suggested that students’


responses to different SEEQ items may be correlated not because of implicit theories
about how teaching behaviors covary, but because of the semantic similarity of the rating
items. Thus “instructors who are ‘friendly towards individual students’ also make ‘students
feel welcome in seeking help/advice,’ not because the two behaviors have consistently co-
occurred in the past” [implicit theories], “but because part of the meaning of being friendly
to individual students is making students feel welcome” [semantic similarities] (p. 384). In
order to test this possibility Cadwell and Jenkins gave students limited information (brief
verbal statements) about hypothetical teachers and asked students to rate these teachers
using SEEQ items. On the basis of the information given to students each of the rating fac-
tors should have been statistically independent, but there were systematic correlations
beyond what could be explained by the information given to the students, and the authors
interpreted this as support for their proposal.
However, Marsh and Groves (in press) critiqued the study and concluded that even
though the original proposal was plausible the methodological problems and conceptual
ambiguities in their study dictate that interpretations should be made cautiously, and pre-
clude any justifiable conclusions about the effect of semantic similarities. Among the seri-
ous problems with the Cadwell and Jenkins study were that: (a) it did not operationally dis-
tinguish between the implicit theories and semantic similarities, and this may be impossi-
ble to do; (b) students were given inadequate or contradictory information that forced
them to use implicit theories and/or semantic similarities in forming their responses so that
the study’s external validity is dubious; (c) analyses were conducted on individual student
ratings rather than class-average responses even though the authors indicated much of the
variance due to implicit theories and/or semantic similarities was idiosyncratic to indi-
vidual students; (d) the appropriateness of their fractional factorial design is suspect; and
(e) there were alternative explanations of their findings that were not based on semantic
similarities or implicit theories. Furthermore, most of the systematic variation in students’
responses to each item could be explained by the ‘behaviors’ that were most logically re-
lated to the item-a finding that supports the validity of the responses. Marsh and Groves
concluded that the construct validity of responses to SEEQ must ultimately be based not
just on support or nonsupport for a robust factor structure, but on the demonstration that
SEEQ factors form a consistent and logical pattern of relations with external indicators of
teaching effectiveness. This conclusion is consistent with the construct validation approach
emphasized in this monograph.
274 HERBERT W. MARSH

Summary of the Dimensionality of Students’ Evaluations

In summary, most student evaluation instruments used in higher education, both in re-
search and in actual practice, have not been developed using systematic logical and empir-
ical techniques such as described in this monograph. The evaluation instruments discussed
earlier each provided clear support for the multidimensionality of students’ evaluations,
but the debate about which specific components of teaching effectiveness can and should
be measured has not been resolved, though there seems to be consistency in those that are
identified in responses to the most carefully designed instruments. Students’ evaluations
cannot be adequately understood if this multidimensionality is ignored. Many orderly, log-
ical relationships are misinterpreted or cannot be consistently replicated because of this
failure, and the substantiation of this claim will constitute a major focus of this monograph.
Instruments used to collect students’ evaluations of teaching effectiveness, particularly
those used for research purposes, should be designed to measure separate components of
teaching effectiveness, and support for both the content and construct validity of the mul-
tiple dimensions should be demonstrated.
CHAPTER 3

RELIABILITY, STABILITY AND GENERALIZABILITY

Reliability

The reliability of student ratings is commonly determined from the results of item analyses
(i.e. correlations among responses to different items designed to measure the same com-
ponent of effective teaching) and from studies of interrater agreement (i.e. agreement
among ratings by different students in the same class). The internal consistency among re-
sponses to items designed to measure the same component of effective teaching is consis-
tently high. However, such internal consistency estimates provide an inflated estimate of
reliability since they ignore the substantial portion of error due to the lack of agreement
among different students within the same course, and so they generally should not be used
(see Gilmore et al., 1978 for further discussion). Internal consistency estimators may be
appropriate, however, for determining whether the correlations between multiple facets
are so large that the separate facets cannot be distinguished, as in multitrait-multimethod
(MTMM) studies.
The correlation between responses by any two students in the same class (i.e. the single
rater reliability) is typically in the 0.20s but the reliability of the class-average response de-
pends upon the number of students rating the class as originally described by Remmers
(see also Feldman, 1977, for a review of methodological issues and empirical findings). For
example, the estimated reliability for SEEQ factors is about 0.95 for the average response
from 50 students ,0.90 from 25 students, 0.74 from 10 students, 0.60 from five students,
and only 0.23 for one student. As previously noted by Remmers, given a sufficient number
of students, the reliability of class-average student ratings compares favorably with that of
the best objective tests. In most applications, this reliability of the class-average response,
based on agreement among all the different students within each class, is the appropriate
method for assessing reliability. Recent applications of generalizability theory de-
monstrate how error due to differences between items and error due to differences be-
tween ratings of different students can both be incorporated into the same analysis, but the
error due to differences between items appears to be quite small (Gilmore et al., 1978).

Long Term Stability

Some critics suggest that students cannot recognize effective teaching until after being
called upon to apply course materials in further coursework or after graduation. Accord-
ing to this argument, former students who evaluate courses with the added perspective of
time will differ systematically from students who have just completed a course when
275
276 HERBERT W MARSH

evaluating teaching effectiveness. Remmers (Drucker & Remmers, 1950) originally coun-
tered this contention by showing that responses by ten year alumni agreed with those of
current students. More recent cross-sectional studies (Centra, 1979; Howard et uf., 198.5;
Marsh, 1977) have also shown good correlational agreement between the retrospective
ratings of former students and those of currently enrolled students. In a longitudinal study
(Marsh & Overall, 1979a; Overall & Marsh, 1980; see Table 3.1) the same students
evaluated classes at the end of a course and again several years later, at least one year after
graduation. End-of-class ratings in 100 courses correlated 0.83 with the retrospective rat-
ings (a correlation approaching the reliability of the ratings), and the median rating at each
time was nearly the same. Firth (1979) asked students to evaluate classes at the time of
graduation from their university (rather than at the end of each class) and one year after
graduation, and he also found good agreement between the two sets of ratings by the same
students. These studies demonstrate that student ratings are quite stable over time, and
argue that added perspective does not alter the ratings given at the end of a course. Hence,
these findings not only provide support for the long-term stability of student ratings, but
they also provide support for their construct validity (see Chapter 4).

Table 3.1
Long-Term Stability of Student Evaluations: Relative and Absolute Agreement Between End-of-Term and Re-
trospective Ratings (Reprinted with permission from Overall & Marsh, 1980)

Correlations between
end-of-term and M differences between end-of-term
retrospective ratings and retrospective ratings

Relative agreement Absolute agreement

Class- Retrospective End-of-term Difference


Individual average ratings ratings ratings
students responses (N=lOO (N = 100 (iv=100
Evaluation items (N = 1,374) (N = 100) classes) classes) classes)

1. Purpose of class
assignments made clear .55** .81” 6.63 6.61 +.02
2. Course objectives
adequately outlined .x3** .84** 6.61 6.47 +.14*
3. Class presentations
prepared and organized .62” .79” 6.67 6.54 +.13
4. You learned something
of value .53** .81” 6.65 6.87 -.22**
5. Instructor considerate
of your viewpoint .58” .83” 6.59 6.88 -.29**
6. Instructor involved you
in discussions .56” .84” 6.63 6.75 -.12
7. Instructor stimulated
your interest .5a** .a2** 6.38 6.50 -.12
8. Overall instructor
rating .65” .84” 6.55 6.74 -.19*
9. Overall course
rating 56” .83” 6.65 6.50 +.15*
IMedian across all
nine rating. items .58 .83 6.63 6.61

;V,,tp A total of 1.:1$4student responsesfmm lOOdifferent sectionseach assessedinstructmnal effectwenessat theend of each
‘Ia+ (end ,,f term) and again I year after graduation &trospecGve follow-up). All ratin@ were made along a 9-point response
wale that varied from I (very low or never) to 9 (verv high or always).
* p < .os. ** p < .01
Students’ Evaluations 277

In the same longitudinal study, Marsh (see Marsh & Overall, 1979) demonstrated that,
consistent with previous research, the single-rater reliabilities were generally in the 0.20s
for both end-of-course and retrospective ratings. (Interestingly, the single-rater re-
liabilities were somewhat higher for the retrospective ratings.) However, the median cor-
relation between end-of-class and retrospective ratings, when based on response by indi-
vidual students instead of class-average responses, was 0.59 (Table 3.1). The explanation
for this apparent paradox is the manner in which systematic unique variance, as opposed
to random error variance, is handled in determining the single rater reliability estimate and
the stability coefficient. Variance that is systematic, but unique to the response of a par-
ticular student, is taken to be error variance in the computation of the single-rater reliabil-
ity. However, if this systematic variance was stable over the several year period between
the end-of-course and retrospective ratings for an individual student, a demanding criter-
ion, then it is taken to be systematic variance rather than error variance in the computation
of the stability coefficient. While conceptual differences between internal consistency and
stability approaches complicate interpretations, there is clearly an enduring source of sys-
tematic variation in responses by individual students that is not captured by internal consis-
tency measures. This also argues that while the process of averaging across the ratings pro-
duces a more reliable measure, it also masks much of the systematic variance in individual
student ratings, and that there may be systematic differences in ratings linked to specific
subgroups of students within a class (also see Feldman, 1977). Various subgroups of stu-
dents within the same class may view teaching effectiveness differently, and may be diffe-
rentially affected by the instruction which they receive, but there has been surprisingly
little systematic research to examine this possibility.

Generalizability - Teacher and Course Effects

Researchers have also asked how highly correlated student ratings are in two different
courses taught by the same instructor, and even in the same course taught by different
teachers on two different occasions. This research is designed to address three related
questions. First, what is the generality of the construct of effective teaching as measured
by students’ evaluations? Second, what is the relative importance of the effect of the in-
structor who teaches a class on students’ evaluations, compared to the effect of the particu-
lar class being taught? If the impact of the particular course is large, then the practice of
comparing ratings of different instructors for tenure/promotion decisions may be dubious.
Third, should ratings be averaged across different courses taught by the same instructor?
Marsh (1981b) arranged ratings of 1364 courses into 341 sets such that each set contained
ratings of: the same instructor teaching the same course on two occasions, the same in-
structor teaching two different courses, the same course taught by a different instructor,
and two different courses taught by different instructors (Table 3.2). For an overall in-
structor rating item the correlation between ratings of different instructors teaching the
same course (i.e. a course effect) was -0.05, while correlations for the same instructor in
different courses (0.61) and in two different offerings of the same course (0.72) were much
larger (Table 3.2). While this pattern was observed in each of the SEEQ factors, the corre-
lation between ratings of different instructors in the same course was slightly higher for
some evaluation factors (for example, Workload/Difficulty, Assignments, and Group In-
278 HERBERT W. MARSH

Table 3.2
Correlations Among Dlfferent Sets of Classes for Student Ratings and Background Characteristics (Reprinted
with permlssion from Marsh, 1984b)

Same Same Different Different


teacher, teacher, teacher, teacher,
same different same different
Measure course course course courses

Student rating
Learning/Value .696 .563 .232 a69
Enthusiasm .734 .613 .Oll .028
Organization/Clarity .676 .540 -.023 -.063
Group Interaction .699 .540 .291 ,224
Individual Rapport .726 .542 .180 .146
Breadth of Coverage .727 .481 .117 ,067
Examinations/Grading .633 .512 .066 -.004
Assignments .681 .428 .332 .112
Workload/Difficulty .733 .400 .392 .215
Overall course .712 .591 -.Oll -.065
Overall instructor .719 .607 -.051 -.059
Mean coefficient .707 .523 .140 .061
Background characteristic
Prior subject interest .635 .312 .563 .209
Reason for taking course (percent indicating
general interest) .770 .448 .671 .383
Class average expected grade .709 .405 .483 .356
Workload/difficulty .773 .400 .392 ,215
Course enrollment .846 .312 .593 .058
Percent attendance on day evaluations
administered .406 .164 .214 .045
Mean coefficient .690 .340 ,491 .211

teraction) but had a mean of only 0.14 across all the factors. In marked contrast, correla-
tions between background variables in different sets of courses (for example, prior subject
interest, class size, reason for taking the course) were higher for the same course taught by
two different instructors than for two different courses taught by the same instructor
(Table 3.2). Based on a path analysis of these results, Marsh argued that the effect of the
teacher on student ratings of teaching effectiveness is much larger than is the effect of the
course being taught, and that there is a small portion of reliable variance that is unique to
a particular instructor in a particular course that generalizes across different offerings of
the same course taught by the same instructor. Hence, students’ evaluations primarily re-
flect the effectiveness of the instructor rather than the influence of the course, and some
instructors may be uniquely suited to teaching some specific courses. A systematic exami-
nation of the suggestion that some teachers are better suited for some specific courses, and
that this can be identified from the results from a longitudinal archive of student ratings,
is an important area for further research.
These results provide support for the generality of student ratings across different
courses taught by the same instructor, but provide no support for the use of student ratings
to evaluate the course. Even student ratings of the overall course were primarily a function
of the instructor who taught the course, and not the particular course that was being
evaluated. In fact, the predominance of the instructor effect over the course effect was vir-
tually the same for both the overall instructor rating and the overall course rating. This
finding probably reflects the autonomy that university instructors typically have in con-
Students’ Evaluations 279

ducting the courses that they teach, and may not generalize to the relatively unusual setting
in which instructors have little or no autonomy. Nevertheless, the findings provide no sup-
port for the validity of student ratings of the course independent of the instructor who
teaches the course.
Marsh and Overall (1981) examined the effect of course and instructor in a setting where
all students were required to take all the same courses, thus eliminating many of the prob-
lems of self-selection that plague most studies. The same students evaluated instructors at
the end of each course and again one year after graduation from the program. For both
end-of-course and follow-up ratings, the particular instructor teaching the course ac-
counted for 5 to 10 times as much variance as the course. These findings again demonstrate
that the instructor is the primary determinant of student ratings rather than the course he
or she teaches.
Marsh and Hocevar (1984) also examined the consistency of the multivariate structure
of student ratings. University instructors who taught the same course at least four times
over a four-year period were evaluated by different groups of students in each of the four
courses (n=314 instructors, 1254 classes, 31,322 students). Confirmatory factor analysis
demonstrated not only the generalizability of the ratings of the instructors across the four
sets of courses, but also the generalizability of multivariate structure. For example, an in-
structor who was evaluated to be enthusiastic but poorly organized in one class received a
similar pattern of ratings in other offerings of the same course. The results of this study de-
monstrate a consistency of the factor structure across the different sets of courses, a rela-
tive lack of method/halo effect in the ratings, and a generalizability of the multivariate
structure; all of which provide a particularly strong demonstration of the multifaceted na-
ture of student ratings.
Gilmore et al. (1978), applying generalizability theory to student ratings, also found that
the influence of the instructor who teaches the course is much larger than that of the course
that is being taught. They suggested that ratings for a given instructor should be averaged
across different courses to enhance generalizability. If it is likely that an instructor will
teach many different classes during his or her subsequent career, then tenure decisions
should be based upon as many different courses as possible - Gilmore et al. suggest at
least five. However, if it is likely that an instructor will continue to teach the same courses
in which he or she had already been evaluated, then results from at least two different of-
ferings of each of these courses is suggested. These recommendations require that a lon-
gitudinal archive of student ratings be maintained for personnel decisions. This data would
provide for more generalizable summaries, the assessment of changes over time, and the
determination of which particular courses are best taught by a specific instructor. It is in-
deed unfortunate that some universities systematically collect students’ evaluations, but
fail to keep a longitudinal archive of the results. Such an archive would help overcome
some of the objections to student ratings (e.g., idiosyncratic occurrences in one particular
set of ratings), will enhance their usefulness, and will provide an important data base for
further research.

Student Written Comments - Generality Across Different Response Form

SEEQ provides space for student comments, and responses to these comments were
used in the development of SEEQ to determine whether or not important aspects of teach-
180 HERBERT W. MARSH

ing effectiveness had been excluded. Students’ comments are also viewed as a valuable
source of diagnostic feedback for the instructor and are returned along with the corn-.
puterized summary of the student ratings. However, it was assumed that content analyses
of the written comments was practically unfeasible and that, perhaps, the comments would
be most useful in their original form. Furthermore, students’comments have not been sys-
tematically related to responses to SEEQ items or to other indicators of effective teaching.
Braskamp and his colleagues (Braskamp et al., 1985; Braskamp ef al., 1981; Ory et al.,
1980) have pursued some of these issues. In a small study of 14 classes the overall favorabil-
ity of student comments was assessed with reasonable reliability by two judges and its cor-
relation with the overall student ratings (0.93) was close to the limits of the reliability of the
two indicators (Ory et al., 1980). In a larger study of 60 classes Braskamp et al. (1981)
sorted student comments into one of 22 content categories and evaluated comments in
terms of favorability. Although the researchers did not provide reliability data for the
analysis of the written comments, the overall favorability of these comments was again
substantially correlated with the overall instructor rating (0.75). In both studies these
authors argued that student ratings are more cost effective for obtaining overall evalua-
tions of teaching effectiveness, but that the comments offer more specific and diagnostic
information than do class-average ratings.
In a related study, Ory and Braskamp (1981) simulated results from written comments
and rating items-both global and specific - about a hypothetical instructor. The simu-
lated comments were presented as if they were in their original unedited form, and not as
summaries of the content analyses as described earlier. They then asked faculty to evaluate
the ratings and comments for purposes of both self-improvement and personnel decisions
on a number of attributes. The rating items were judged as easier to interpret and more
comprehensive regardless of the purpose but other judgements varied according to pur-
pose. In general, faculty judged ratings as superior for personnel decisions, but judged the
written comments as superior for self-improvement. Speculating on the results for written
comments, the authors suggested that “the nonstandardized, unique, personal written
comments by students are perceived as too subjective for important personnel decisions.
However, this highly idiosyncratic information about a particular course is viewed as use-
ful diagnostic information for making course changes” (pp. 280-281).
The research by Braskamp and his colleagues demonstrates that student comments, at
least on a global basis, can be reliably scored and that these scores agree substantially with
students’ responses to overall rating items. This supports for the generality of the ratings.
They also contend that the comments contain useful information that is not contained in
the overall ratings, and this seems plausible. Their findings also provide support for the
collection of student comments and returning these comments to faculty along with sum-
maries of the rating items. However, they do not indicate whether the additional infor-
mation provided by comments comes from the undigested, original comments or from the
results of detailed content analyses such as they performed. As Braskamp et al. (1981)
noted, their content categories are similar to those measured by multidimensional rating
scales, and so it may still be more cost effective to use rating items for even this type of
specific information. I suggest, as may be implied by Ory and Braskamp (1981), that the
useful information from comments that cannot be obtained from rating items is idiosyn-
cratic information that cannot easily be classified into generalizable categories, that is so
specific that its value would be lost if it was sorted into broader categories, or that cannot
be easily interpreted without knowledge of the particular context. From this perspective,
Students’ Evaluations 281

the attempt to systematically analyze student comments may be counterproductive. If such


analyses are to be pursued, then further research is needed to demonstrate that this
lengthy and time consuming exercise will provide useful and reliable information that is
not obtainable from the more cost effective rating items.

Directions for Future Research

Systematic Differences Attributable to Subgroups Within Classes

Students’ evaluations of teaching effectiveness are typically summarized by class-aver-


age responses and for most purposes this is appropriate. According to this perspective,
variance due to responses by different students within the same class represents error var-
iance -if every student provided a perfectly accurate report of teaching effectiveness, rat-
ings by all students would be the same. However, the longitudinal study by Overall and
Marsh (1979) demonstrated that differences in responses by individual students that were
observed in end-of-course ratings were consistent with differences observed in the retro-
spective ratings by the same students that were collected several years later. This argues,
along with commonsense, that an instructor’s effectiveness will vary systematically for dif-
ferent students in his or her class. While this likelihood has been given lipservice in some
studies (for example, Marsh & Overall, 1979) it has not been given sufficient attention
even though it has tremendous implications.
On the negative side, if identifiable subgroups of students within a class give systemati-
cally different responses, then this may constitute a source of bias to the ratings (see Chap-
ter 5). However, this is unlikely. First of all, a wide variety of individual student charac-
teristics have been found to have little affect on student ratings. Second, even if some such
characteristics did influence individual student responses, it would have little effect on
class-average responses so long as it evenly distributed across classes; even if students high
on a particular characteristic gave systematically higher ratings, it would only make a dif-
ference in classes that had a disproportionately high or low number of such students.
On the positive side, if an instructor can be shown to be differentially effective with a
particular type of student, then this would have important implications for the type of clas-
ses which that instructor should teach. Department chairpersons typically often use intui-
tive hunches based on such logic in the assignment of teachers, but this would offer an ob-
jective way to test such hunches. Furthermore, the fruitfulness of such an approach has
been well-documented by the success of aptitude-treatment-interaction research and may
provide a useful starting point for future research in students’ evaluations of teaching ef-
fectiveness.

The Inappropriateness of Existing Generalizability Research

Remmers’ approach to the reliability of student ratings, agreement among different stu-
dents within the same class, has been useful, but it is too simplistic. Generalizability
theory, as applied by Kane and Brennan (1977) and by Kane et al. (1976), seemed to pro-
vide the promise of a much more elegant approach, and Aubrecht (1981) predicted that
there would be many such studies. However, researchers have apparently been unable to
232 HERBERT W. MARSH

deliver on this promise and there are a number of problems that require further attention.
First, while there have been numerous applications of generalizability theory in student
evaluation research, none that I know of have used responses to well defined multiple di-
mensions of teaching effectiveness even though generalizability theory has been used to
examine other multivariate constructs. Second, the ideal generalizability study would re-
quire that each of a large set of courses was taught by each of a large set of different instruc-
tors and was evaluated by the same set of students with no missing data. Since this is obvi-
ously impossible, it is typically assumed that the different instructors who teach the same
course over a period of years represents a random sample of instructors, that the different
students who take a given course represent a random sample of students, or that the diffe-
rent courses taught by the same teacher represent a random sample of courses. However,
such assumptions are clearly unwarranted. Gilmore (1980) illustrated that dramatically
different - or even impossible - findings result, depending on which of these assump-
tions are made and concluded with a quote from Mark Twain: “The thirteenth stroke of a
clock is not only false of itself, but casts grave doubts on the credibility of the preceding
twelve” (p. 14). Gilmore, though discouraged about the application of generalizability
theory in student evaluation research, proposed the problems could be resolved, but this
hope has apparently not been actualised.

Profile Analysis

Students’ responses to evaluation instruments such as SEEQ are clearly multidimen-


sional. This multidimensionality has important implications for how the rating dimensions
relate to other criteria of effective teaching and potential biases as demonstrated in sub-
sequent chapters of this monograph. However, Marsh and Hocevar (1984) suggest that
each teacher may have a distinguishable profile of scores for the different rating dimen-
sions, and that this profile may generalize across different courses taught by the same in-
structor. More research is needed to determine the reliability of a profile of scores, as op-
posed to the reliability of each score within the profile. If these profiles are reliable, then
researchers must ask how a given profile of scores is related to different validity criteria as
well as examining the relation for each score within the profile. It may be, for example,
that student learning is maximized when both Enthusiasm and Organization/clarity are
high, whereas being high on either one alone is not sufficient. The implications of such re-
search may be important for student evaluation research, but may be even more important
for the study of teaching and teaching styles.

Student Written Comments

Research by Braskamp and his colleagues suggests that student written comments can
be scored according to detailed content categories. Unfortunately, their research did not
include a multidimensional evaluation instrument. Hence it is not known the extent to
which student comments in specific categories converge with student ratings in matching
categories, and this seems like a fruitful application of MTMM analyses. Also, I know of
no research that has systematically related scores derived from student comments and
Students’ Evaluations 283

from rating items to other indicators of teaching effectiveness and this seems surprising.
While the cost of systematically analyzing student written comments may be prohibitive
for campus-wide programs, the undertaking seems to be reasonable as part of research
studies.
CHAPTER 4

VALIDITY

The Construct Validation Approach to Validity

Student ratings, which constitute one measure of teaching effectiveness, are difficult to
validate since there is no single criterion of effective teaching. Researchers who use a con-
struct validation approach have attempted to demonstrate that student ratings are logically
related to various other indicators of effective teaching. In this approach, ratings are re-
quired to be substantially correlated with a variety of indicators of effective teaching and
less correlated with other variables, and, in particular, specific rating factors are required
to be most highly correlated with variables to which they are most logically and theoreti-
cally related. Within this framework, evidence for the long-term stability of students’
evaluations, for the generalizability of student ratings of the same instructor in different
courses, and for the generalizability of inferences across student ratings and student writ-
ten comments (see Chapter 3) can be interpreted as support for their validity. The most
widely accepted criterion of effective teaching is student learning, but other criteria in-
clude changes in student behaviors, instructor self-evaluations, the evaluations of peers
and/or administrators who actually attend class sessions, the frequency of occurrence of
specific behaviors observed by trained observers, and the effects of experimental manipu-
lations.
Historically, some researchers have argued for the use of criterion-related validity in-
stead of construct validity in the study of students’ evaluations, typically proposing that
measures of student learning are the criterion against which to validate student ratings. If
researchers specifically define effective teaching to mean student learning, and operation-
ally define student learning and students’ evaluations, it may be accurate to describe such
findings in terms of criterion-related validity but there are problems with such an approach
(see also Howard et al., 1985). First, such an approach requires the assumption that effec-
tive teaching and student learning are synonymous and this seems unwarranted. A more
reasonable assumption that is consistent with the construct validation approach is that stu-
dent learning is only one indicator of effective teaching, even if a very important indicator.
Second, even if student learning were assumed to be the only criterion of effective teach-
ing, student learning is also a hypothetical construct and so it might still be appropriate to
use the construct validity approach. Third, since construct validity subsumes criterion-
related validity (Cronbach, 1984; APA, 1985) it is logically impossible for criterion-related
validity to be appropriate instead of construct validity. Fourth, the narrow criterion-
related approach to validity will inhibit a better understanding of what is being measured
by students’ evaluations, of what can be inferred from students’ responses, and how find-
ings from diverse studies can be understood within a common framework.
285
286 HERBERT W. MARSH

A construct validity approach to the study of students’ evaluations of teaching effective-


ness now appears to be widely accepted (for example, Howard er al., 1985)) but its applica-
tion has been criticized. Doyle (1983), for example, argues that construct validation is par-
ticularly applicable for validating measures of effective teaching, but also states that:
“Construct validation (Cronbach and Meehl, 1955) is a mightily abused concept. Rather
than the rigorous articulation of interrelated hypotheses and research findings that its
creators envisioned, construct validation is often an umbrella notion applied to the
haphazard amassing of dubious research findings” (p. 58). While I applaud Doyle’s ap-
preciation of rigor and lofty ideals, it is important to note that Doyle’s perspective does not
reflect the present state of student evaluation research or the typical application of con-
struct validation (APA, 1985; Cronbach, 1984). For example, in his discussion of construct
validation Cronbach (1984, p. 149) stated that: (a) “creating a long-lived theory is an un-
reasonably lofty aspiration for present-day testers . . . Test interpreters employ a scien-
tific logic but - like engineers and physicians - they have to do the best they can with
comparatively primitive theory”; (b) an interpretation is to be supported by putting to-
gether many pieces of evidence”; and (c) “this complexity means that validation cannot be
reduced to rules.” Similarly, Howard ef al. recommend that researchers collect as many
different measures of teaching effectiveness as possible to create a multiple
operationalized index of the construct and use this to assess the construct validity of each
of the individual measures that comprise the index. While a rigorous theory may be a distal
goal of student evaluation research, a more realistic proximal goal is the systematic appli-
cation of construct validation as described by Cronbach (1984).
The purpose of material to be described in this chapter is to examine empirical relations
between students’ evaluations of teaching effectiveness and other indicators that have
been posited to reflect effective teaching. The intent is not specifically to evaluate the con-
struct validity of these other criteria as indicators of effective teaching, but to some extent
this is inevitable. As emphasized by Cronbach (1984) and many others, one of the most dif-
ficult problems in validating interpretations of a measure is to obtain suitable criterion
measures. An important limitation in validity research is the inadequate attention given to
criterion measures. To the extent that criterion measures are not reliably measured, or do
not validly reflect effective teaching, then they will not be useful for testing the construct
validity of students’ evaluations of teaching effectiveness. More generally, criterion mea-
sures that lack reliability or validity should not be used as indicators of effective teaching
for research, policy formation, feedback to faculty or administrative decision making.

Student Learning - The Multisection Validity Study

Student learning, particularly if inferred from an objective, reliable, and valid test, is
probably the most widely accepted criteria of effective teaching. The purpose of this sec-
tion is not to argue for or against the use of student learning in the evaluation of teaching,
but rather to review research that uses learning as a criteria of students’ evaluations of
teaching effectiveness. Nevertheless, it is important to recognize that student achievement
is not generally appropriate as an indicator of effective teaching in universities as they are
presently organized. Examination scores in physical chemistry cannot be compared to
examination scores in introductory psychology; examination scores cannot be compared in
Students’ Evaluations 287

an upper-division and a lower-division psychology course: and examination scores gener-


ally cannot be compared even in two introductory psychology courses taught by different
teachers. It may be reasonable to compare pre-test and post-test scores within a single
course as an indicator that some learning has taken place, but it is not valid to compare the
pre-test-post-test differences obtained in two different courses. It may be useful to deter-
mine the percentage of students who successfully master a behavioral objective as evi-
denced by test performance, but it is not valid to compare percentages obtained from dif-
ferent courses. In a very specialized, highly controlled setting it may be valid to compare
teachers in terms of operationally defined student learning, and this is the intent of multi-
section validity studies. However, it is important to reiterate that most university teaching
does take place in such a setting. Student ratings of teaching effectiveness are related to
student learning in such a limited setting on the assumption that such results will generalize
to settings where student learning is not an adequate basis for assessing effective teaching,
and not to demonstrate the appropriateness of student learning as a criterion of effective
teaching in those other settings.

The Multisection Validity Paradigm

It may be reasonable to validate students’ evaluations against student learning in large


multisection courses where different groups of students are presented the same materials
by different instructors. In the ideal multisection validity study: (a) there are many sections
of a large multisection course; (b) students are randomly assigned to sections, or at least
enroll without any knowledge about the sections or who will teach them, so as to minimize
initial differences between sections; (c) there are pre-test measures that correlate substan-
tially with final course performance for individual students; (d) each section is taught com-
pletely by a separate instructor; (e) each section has the same course outline, textbooks,
course objectives, and final examination; (f) the final examination is constructed to reflect
the common objectives by some person who does not actually teach any of the sections,
and, if there is a subjective component, is graded by an external person; (g) students in
each section evaluate teaching effectiveness on a standardized evaluation instrument, per-
ferably before they know their final course grade and without knowing how performance
in their section compares with that of students in other sections; and (h) section-average
student ratings are related to section-average examination performance, (see Yunker,
1983, for discussion on the unit-of-analysis issue) after controlling for pre-test measures
(for general discussion see Benton, 1979; Cohen, 1981; Marsh, 1980a; Marsh & Overall,
1980; Yunker, 1983). Support for the validity of the student ratings is demonstrated when
the sections that evaluate the teaching as most effective near the end of the course are also
the sections that perform best on standardized final examinations, and when plausible
counter explanations are not viable. Remmers was the first to apply this type of paradigm,
though it was not until the 1970s and 1980s that it became more fully developed and more
widely applied.

Methodological Problems

Rodin and Rodin (1972) reported a negative correlation between section-average grade
‘88 HERBERT W. MARSH

and section-average evaluations of graduate students teaching different quiz sections.


Ironically, this highly publicized study did not constitute a multisection validity study as
described above, and contained serious methodological problems. First, the ratings were
not of the instructor in charge of the course but of teaching assistants who played a small
ancillary role in the actual instruction. Thus, there was no way to separate achievement
produced by a teaching assistant from that due to the instructor; a student who put too
much reliance on the teaching assistant at the expense of lectures by the instructor might
evaluate the assistant highly and perform poorly on the exam. Doyle (1975) also argued
that a negative correlation might be expected since it would be the less able students who
would have the most need for the supplemental services provided by the teaching assis-
tants. Second, the study was conducted during the third term of a year-long course, and
students were free to change teaching assistants between terms. Furthermore, during the
third term students were not even required always to attend sections led by the same teach-
ing assistant. Consequently, the effects of different teaching assistants on student achieve-
ment was confounded. Third, there was no adequate measure of end-of-course achieve-
ment; performance was evaluated with problems given at the end of each segment of the
course, and students could repeat each exam as many as six times without penalty. Hence,
a teaching assistant who engendered resentment by applying added pressure on students
to continue retaking the exam might be evaluated poorly even though his or her students
eventually got more problems correct. Since there was no final examination, there is no
evidence that students who got more problems correct actually knew more at the end of the
course. Fourth, these negative findings have not been replicated in any other studies.
Hence, even though there are possible explanations for the negative correlation, the seri-
ous methodological problems in the study render the findings as uninterpretable (see also
Frey, 1978). In reviewing this study, Doyle (1975) stated that “to put the matter bluntly,
the attention received by the Rodin study seems disproportionate to its rigor, and their
data provide little if any guidance in the validation of student ratings.” (p. 59). In retros-
pect, the most interesting aspect of this study was that such a methodologically flawed
study received so much attention.
Even when the design of multisection validity studies is more adequate, numerous
methodological problems may still exist. First, the sample size in any given study is almost
always quite small, the number of different sections is generally about 15, and produces ex-
tremely large sampling errors. Second, most variance in achievement scores at all levels of
education is attributable to student presage variables and researchers are generally unable
to find appreciable effects due to differences in teacher, school practice, or teaching
method (Cooley & Lohnes, 1976; McKeachie, 1963). In multisection validity studies so
many characteristics of the setting are held constant, that differences in student learning
due to differences in teaching effectiveness are even further attenuated. Hence, though
the design is defensible, it is also quite weak for obtaining achievement differences that are
systematically correlated with students’ evaluations. Third, the comparison of findings ac-
ross different multisection validity studies is problematic, given the lack of consistency in
measures of course achievement and student rating instruments. Fourth, other criteria of
teaching effectiveness besides student learning should be considered; Marsh and Overall
(1980) found that different criteria of effective teaching were not significantly correlated
with each other even though each was significantly correlated with student ratings. Fifth,
presage variables such as initial student motivation and particularly ability level must be
equated across sections for comparisons to be valid. Even random assignment becomes in-
Students’Evaluations 289

effective at accomplishing this when the number of sections is large and the number of stu-
dents within each section is small, because chance alone will create differences among the
sections. This paradigm does not constitute an experimental design in which students are
randomly assigned to treatment groups that are varied systematically in terms of experi-
mentally manipulated variables, and so the advantages of random assignment are not so
clear as in a standard experimental design.* Furthermore, the assumption of truly random
assignment of students to classes in large scale field studies is almost always compromised
by time-scheduling problems, students dropping out of a course after the initial assign-
ment, missing data, etc. For multisection validity studies the lack of initial equivalence is
particularly critical, since initial presage variables are likely to be the primary determinant
of end-of-course achievement. For this reason it is important to have effective pre-test
measures even when there is random assignment. While this may produce a pre-test sen-
sitization effect, the effect is likely to be trivial since: (a) pre-test variables will typically dif-
fer substantially from post-test measures; (b) there is no intervention other than the nor-
mal instruction that students expect to receive; (c) it seems unlikely that the collection of
pre-test measures will systematically affect either teaching effectiveness or student per-
formance; (d) pre-test measures can sometimes be obtained from student records without
having to actually be collected as part of the study; and (e) a ‘no-pre-test control’ could be
included. Sixth, performance on objectively scored examinations that have been the focus
of multisection validity studies may be an unduly limited criterion of effective teaching
(Dowel1 & Neal, 1982; Marsh & Overall, 1980). In summary, the multisection validity de-
sign is inherently weak and there are many methodological complications in its actual ap-
plication.

The Cohen Meta-Analysis

Cohen (1981) conducted a meta-analysis of all known multisection validity studies, regard-
less of methodological problems such as found in the Rodin and Rodin study. Across 68
multisection courses, student achievement was consistently correlated with student ratings
of Skill (OSO), Overall Course (0.47), Structure (0.47), Student Progress (0.47), and Over-
all Instructor (0.43). Only ratings of Difficulty had a near-zero or a negative correlation
with achievement. The correlations were higher when ratings were of full-time teachers,
when students knew their final grade when rating instructors, and when achievement tests
were evaluated by an external evaluator. Other study characteristics (for example, ran-
dom assignment, course content, availability of pre-test data) were not significantly re-
lated to the results. Many of the criticisms of the multisection validity study are at least par-
tially answered by this meta-analysis, particularly problems due to small sample sizes and
the weakness of the predicted effect, and perhaps the issue of the multiplicity of achieve-

*The value of randomassignmentis frequentlymisunderstoodin student evaluation research, and particularly


in the multisection validity study. Random assignment is not an end, but merely a means to control for initial dif-
ferences in treatment groups that would otherwise complicate the interpretation of treatment effects. The effec-
tiveness of random assignment is positively related to the number of cases in each treatment group, but negatively
related to the number of different treatment groups. In the multisection validity study the instruction provided
by each teacher is a separate treatment. his or her students are the treatment group, and there is usually no 0
priori basis for establishing which of the many treatments is more or less effective. Hence, even with random as-
signment it is likely that some sections will have students who are systematically more able (prepared. motivated,
etc.) than others, and this is likely to bias the results (also see Yunker, 1983).
290 HERBERT W. MARSH

ment measures and student rating instruments. These results provide strong support for
the validity of students’ evaluations of teaching effectiveness.

Counfer Explanations - the Grading Satisfaction Hypothesis

Marsh (1984; Marsh el al., 1975; Marsh & Overall, 1980) identified an alternative expla-
nation for positive results in multisection validity studies that he called the grading satisfac-
tion hypothesis (also called the grading leniency effect elsewhere), When course grades
(known or expected) and performance on the final exam are significantly correlated, then
higher evaluations may be due to: (a) more effective teaching that produces greater learn-
ing and higher evaluations by students; (b) increased student satisfaction with higher
grades which causes them to ‘reward’ the instructor with higher ratings independent of
more effective teaching or greater learning; or(c) initial differences in student characteris-
tics (for example, Prior Subject Interest, Motivation, and Ability) that affect both teaching
effectiveness and performance. The first hypothesis argues for the validity of student rat-
ings as a measure of teaching effectiveness, the second represents an undesirable bias in
the ratings, and the third is the effect of presage variables that may be accurately reflected
by the student ratings. Even when there are no initial differences between sections, either
of the first two explanations is viable, and Cohen’s finding that validity correlations are
substantially higher when students already know their final course grade makes the grad-
ing satisfaction hypothesis a plausible counter explanation. Palmer etal. (1978) made simi-
lar distinctions but their research has typically been discussed in relation to the potential
biasing effect of expected grades (see also Howard & Maxwell, 1980, 1982) rather than
multisection validity studies. Dowel1 and Neal (1982) also suggest such distinctions, but
then apparently confound the effects of grading leniency and initial differences in section-
average ability in their discussion and review of multisection validity studies.
Only in the two SEEQ studies (Marsh et al., 1975; Marsh & Overall, 1980) was the grad-
ing satisfaction hypothesis controlled as a viable alternative to support for the validity
hypothesis. The researchers reasoned that in order for satisfaction with higher grades to af-
fect students’ evaluations at the section-average level, section-average expected grades
must differ at the time the student evaluations are completed. In both these studies student
performance measures prior to the final examination were not standardized across sec-
tions. Hence, while each student knew approximately how his or her performance com-
pared to other students within the same section, there was no basis for knowing how the
performance of any section compared with that of other sections, and thus there was no
basis for differences between the sections in their satisfaction with expected grades. Con-
sistent with this suggestion, section-average expected grades indicated by students at the
time the ratings were collected did not differ significantly from one section to the next, and
were not significantly correlated with section-average performance on the final examina-
tion (even though individual expected grades within each section were). Since secfion-av-
erage expected grades at the time the ratings were collected did not vary, they could not
be the direct cause of higher student ratings that were positively correlated with student
performance, nor the indirect cause of the higher ratings as a consequence of increased stu-
dent satisfaction with higher grades. In most studies, where section-average expected
grades and section-average performance on the criterion measures are positively corre-
lated, the grading satisfaction hypothesis cannot be so easily countered.
Students’ Evaluations 291

Palmer er al. (1978) also compared validity and grading leniency hypotheses in a multi-
section validity study by relating section-average student learning (controlling for pre-test
data) and section-average grading leniency to student ratings. However, their study failed
to show a significant effect of either student learning or grading leniency on student rat-
ings. Potential problems with the study include the small number of sections (14) charac-
teristic of most multisection validity studies, and perhaps their operationalization of grad-
ing leniency as described below. Despite these problems, this study provides a
methodologically sophisticated approach to the analysis of multisection validity studies
that warrants further consideration.
Both the SEEQ and the Palmer era/. studies attempted to distinguish between the valid-
ity and the grading satisfaction (or grading leniency) hypotheses after controlling for initial
differences, but there were important differences in how this was accomplished. In the
SEEQ studies, due to the particular design of the study, there were no section-average dif-
ferences in expected grades at the time students completed their ratings and the authors ar-
gued that this eliminated the grading satisfaction hypothesis as a plausible explanation.
Palmer et al. used actual grades, instead of expected grades, as measured at approximately
the time when students completed the ratings. Since the grading satisfaction hypothesis
can only be explained in terms of expected grades the use of actual grades is dubious unless
it can be argued that actual and expected grades are virtually the same, that the basis of ac-
tual grades is the same for all sections, and that the relation between expected and actual
grades is the same for all sections. Hence it is recommended that expected grades, instead
of actual grades, should be the basis of such analyses in the future.
Palmer et al. further argued that grading leniency should be defined in terms of how ac-
tual grades differ from grades predicted on the basis of all pre-test variables and student
performance on their final test. This implies that the grading satisfaction hypothesis is due
to actual grades being higher or lower than predicted grades. However, this suggestion is
only plausible if it can be argued: (a) that actual grades are equivalent to expected grades
as described above; and (b) that predicted actual grades are equivalent to the grades that
students feel that they deserve. Palmer et al. make a similar point when they indicate that:
“What we are interested in, of course, is the students’ perceptions of instructor leniency”
(p. 858). Alternatively it may be plausible to argue that the grading satisfaction is based on
expected grades being higher or lower than those that students feel that they deserve.
However, even if students grades (expected or actual) and test performance both reflect
teaching effectiveness such that no grading leniency exists according to the Palmer et al.
definition, there would be no guarantee that subsequent ratings reflected the teaching ef-
fectiveness instead of satisfaction with the grades. Hence, while the Palmer et al. approach
may reflect a superior definition of objectively defined grading leniency, it apparently does
not provide an adequate test of the grading satisfaction hypothesis.
In the two SEEQ studies, the grading satisfaction hypothesis was tested in terms of
expected grades without any correction for the relative contribution of the instructor. That
is, student satisfaction with higher-than-average grades, even higher grades due to more
effective teaching, may be the cause of higher-than-average student ratings. Palmer et al.
argued that grades should be corrected for pre-test scores. In the SEEQ studies there were
no significant differences among sections in terms of the pre-test scores or expected grades
collected at the start of the course, and so such a correction would have had little effect.
Palmer et al. also argued that their measure of grading leniency should be corrected for
final test performance as an indication of the contribution of the instructor. In the SEEQ
292 HERBERT W. MARSH

studies there were no differences among sections in uncorrected expected grades and so
this correction would have led to the problematic conclusion that grading leniency as de-
fined by Palmer et al. was negatively correlated with student evaluations (i.e. lower than
deserved expected grades are associated with higher student ratings). These results cast
further doubt on the Palmer et al. approach to testing the grading satisfaction hypothesis
in multisection validity studies.

Implications For Further Research

Multisection validity studies should be designed according to the criteria discussed


earlier. The interpretation of multisection validity studies may be substantially affected by
section-average differences in pre-test scores, the type of achievement tests used to infer
student learning, the nature of student ratings used to infer instructional effectiveness and
the expected grades at the time students evaluate teaching effectiveness. Hence it is im-
portant to test for the statistical significance of such differences and to provide some indi-
cation of effect size or variance explained. If section-average differences in either final
exam scores or student ratings are small, then it is important to determine whether the re-
sults reflect a lack of variation in the construct or a lack of reliability in the measures. Pre-
test measures should be substantially related to final examination performance of indi-
vidual students, and multiple regressions relating pre-test scores to examination scores
should be summarized. Nevertheless, if there is not at least a reasonable approximation to
random assignment and if pre-test scores are not reasonably consistent across sections,
then the statistical correction for such initial differences may be problematic. Further-
more, if section-average expected grades are significantly correlated with section-average
examination performance, and if the size of the validity coefficient is substantially reduced
when the effects of expected grades are controlled, then grading satisfaction may be a via-
ble alternative explanation of the results. It may be reasonable to correct expected grades
for pre-test scores, but such a correction will not have much effect unless there are substan-
tial section-average differences on pre-test scores and this situation produces other prob-
lems. The correction of section-average expected grades for the contribution of the in-
structional effectiveness as proposed by Palmer et al. is not recommended as the primary
analysis, but the results may further explicate a grading leniency effect. Finally, it may also
be useful to determine how student ratings of expected and deserved grades at the start of
the course, and at the time ratings are collected, affect the results of multisection validity
studies.
The methodologically flawed study by Rodin and Rodin aroused considerable interest
in multisection validity studies, and focused attention on the methodological weaknesses
of the design. Perhaps more than any other type of study, the credibility of student ratings
has rested on this paradigm. Researchers’ preoccupation with the multisection validity
study has had both positive and negative aspects. The notoriety of the Rodin and Rodin
study required that further research be conducted. Despite methodological problems and
difficulties in the interpretation of results, Cohen’s meta-analysis demonstrates that sec-
tions for which instructors are evaluated more highly by students tend to do better on stan-
dardized examinations; a finding which has been taken as strong support for use of the rat-
ings. However, the limited generality of the setting, the inherent weakness of the design,
and the possibility of alternative explanations all dictate that it is important to consider
other criterion measures and other paradigms in student-evaluation research.
Students’ Evaluations 293

Instructor Self-Evaluations

Validity paradigms in student evaluation research are often limited to a specialized set-
ting (for example, large multisection courses) or use criteria such as the retrospective rat-
ings of former students that are unlikely to convince skeptics. Hence, the validity of stu-
dent ratings will continue to be questioned until criteria are utilized that are both applic-
able across a wide range of courses and widely accepted as an indicator of teaching effec-
tiveness (see Braskamp ef al., 1985 for further discussion). Instructors’ self-evaluations of
their own teaching effectiveness is a criterion which satisfies both of these requirements.
Furthermore, instructors can be asked to evaluate themselves with the same instrument
used by their students, thereby testing the specific validity of the different rating factors.
Finally, instructor self-evaluations are not substantially correlated with a wide variety of
instructor background/demographic characteristics other than their enjoyment of teaching
and their liking of their subject (Doyle & Weber, 1978; also Marsh, 1984 and discussion in
Chapter 5).
Despite the apparent appeal of instructor self-evaluations as a criterion of effective
teaching, it has had limited application. Centra (1973) found correlations of about 0.20 be-
tween faculty self-evaluations and student ratings, but both sets of ratings were collected
at the middle of the term as part of a study of the effect of feedback from student ratings
(see Chapter 8) rather than at the end of the course. Blackburn and Clark (1975) also re-
ported correlations of about 0.20, but they only asked faculty to rate their own teaching in
a general sense rather than their teaching in a specific class that was also evaluated by stu-
dents. In small studies with ratings of fewer than 20 instructors, correlations of 0.31 and
0.65 were reported by Braskamp er al. (1979) and 0.47 by Doyle and Crichton (1978). In
a study with 43 instructors, Howard et al. (1985) reported instructor self-evaluations to be
correlated 0.34 and 0.31 with ratings by students and former students, respectively. In
larger studies with ratings of 50 or more instructors, correlations of 0.62, 0.49, and 0.45
were reported by Webb and Nolan (1955), Marsh et al. (1979b), and Marsh (1982c).
Marsh (1982c; Marsh et al., 1979) conducted the only studies where faculty in a large
number of courses (81 and 329) were asked to evaluate their own teaching on the same
multifaceted evaluation instrument that was completed by students. In both studies: (a)
separate factors analyses of teacher and student responses identified the same evaluation
factors (see Table 2.2); (b) student-teacher agreement on every dimension was significant
(median T’Sof 0.49 and 0.45; Table 4.1); ( c ) mean differences between student and faculty
responses were small and not statistically significant for most items, and were unsystematic
when differences were significant (i.e. student ratings were higher than faculty self-evalua-
tions for some items but lower for others).
In MTMM studies, multiple traits (the student rating factors) are assessed by multiple
methods (student ratings and instructor self-evaluations). Consistent with the construct
validation approach discussed earlier, correlations (see Table 4.1 for MTMM matrix from
Marsh’s 1982 study) between student ratings and instructor self-evaluations on the same
dimension (i.e. convergent validities- median r’s of 0.49 and 0.45) were higher than cor-
relations between ratings on nonmatching dimensions (median r’s of -0.04 and 0.02), and
this is taken as support for the divergent validity of the ratings. In the second study, sepa-
rate analyses were also performed for courses taught by teaching assistants, undergraduate
level courses taught by faculty, and graduate level courses. Support for both the conver-
294 HERBERT W. MARSH

gent and divergent validity of the ratings was found in each set of courses (see also Howard
et al., 1985).
In discussing instructor self-evaluations, Centra speculated that prior experience with
student ratings may influence self-evaluations (1975) and that instructors may lower their
self-evaluations as a consequence of having been previously evaluated by students so that
their ratings would be expected to be more consistent with student ratings (1979). If in-
structors were asked to predict how students would evaluate them, then Centra’s sugges-
tion might constitute an important methodological problem for self-evaluation studies.
However, both SEEQ studies specifically instructed the faculty to rate their own teaching
effectiveness as they perceived it even if they felt that their students would disagree, and
not to report how their students would rate them. Hence, the fact that most of the instruc-
tors in these studies had been previously evaluated does not seem to be a source of invalid-
ity in the interpretation of the results (see also Doyle, 1983). Furthermore, given that the
average of student ratings is a little over 4 on a 5-point response scale, if instructor self-
evaluations are substantially higher than student ratings before they receive any feedback
from student ratings as suggested by Centra, then faculty, on average, may have urealisti-
tally high self-perceptions of their own teaching effectiveness. A systematic examination
of how instructor self-perceptions change, or do not change, as a consequence of student
feedback seems a fruitful area for further research.
This instructor self-evaluations research has important implications. First, the fact that
students’ evaluations show significant agreement with instructor self-evaluations provides
a demonstration of their validity that is acceptable to most researchers, and this agreement
can be examined in nearly all instructional settings. Second, there is good evidence for the
validity of student ratings for both undergraduate and graduate level courses (Marsh,
1982~). Third, support for the divergent validity demonstrates the validity of each specific
rating factor as well as of the ratings in general, and argues for the importance of using sys-
tematically developed, multifactor evaluation instruments.

Ratings by Peers

Peer ratings, based on actual classroom visitation, are often proposed as indicators of ef-
fective teaching (Braskamp et al., 1985; Centra, 1979; Cohen & McKeachie, 1980; French-
Lazovich, 1981; see also Aleamoni, 1985), and hence a criterion for validating students’
evaluations. In studies where peer ratings are ~tof based upon classroom visitation (for ex-
ample, Blackburn & Clark, 1975; Guthrie, 1954; Maslow & Zimmerman, 1956), ratings by
peers have correlated well with student ratings of university instructors, but it is likely that
peer ratings are based on information from students. Centra (1975) compared peer ratings
based on classroom visitation and student ratings at a brand new university, thus reducing
the probable confounding of the two sources of information. Three different peers
evaluated each teacher on two occasions, but there was a relative lack of agreement among
peers (mean r = 0.26) which brings into question their value as a criterion of effective
teaching and precluded any good correspondence with student ratings (r = 0.20).
Morsh et al. (1956) correlated student ratings, student achievement, peer ratings, and
supervisor ratings in a large multisection course. Student ratings correlated with achieve-
ment, supporting their validity. Peer and supervisor ratings, though significantly corre-
lated with each other, were not related to either student ratings or to achievement,
Table 4.1
Multitrait-Multimethod Matrix: Correlations Between Student Ratings and Faculty Self-Evaluations in 329 Courses (Reprinted with permission from Marsh,
1984b)

Instructor self-evaluation factor Student evaluation factor

Factor 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Instructor self-evaluations
1. Learning/Value (83)
2. Enthusiasm 29 (82)
3. Organization 12 01 (74)
4. Group Interaction 01 (901
5. Individual Rapport -07 -8; -15
07 02. 182)
6. Breadth 13 12 13 -01. (84)
7. Examinations -01 08 26 zl 15 20 (76)
8. Assignments 24 -01 17 05 09 22 (70)
9. Workload/Difficulty 03 -01 12 -09 g -04 09 21 (70)
Student evaluations
10. Learning/Value 46 10 -01 08 -12 09 -04 08 02 (95)
11. Enthusiasm z 54 -04 -01 -02 -01 -03 -09 -09 45 WI
12. Organization 17 2 30 -03 04 07 09 -05 52 49 (93)
13. Croup Interaction 19 05 -5 52 00 -02 -14 -08 37 30 21 (98)
14. Individual Rapport 03 03 -05 2 28 -19 -03 -02 00 22 35 33 42 (96)
15. Breadth 26 00 -14 42 00 34 56 17 ; (94)
16. Examinations 18 lz 01
09 -01 06 -@ 17 -: -06
02 48
49 42 57 34 33 (93)
17. Assignments 20 03 02 09 -01 04 -0i 45 12 52 21 34 30 29 40 42 (92)
18. Workload/Difficulty -06 -03 04 00 03 -03 12 22 $j 06 02 -05 -05 08 18 -02 20 (87)

Note. Values in parentheses in the diagonals of the upper left and lower right matrices, the two triangular matrices, are reliability (coefficient alpha) coefficients (see
Hull & Nie. 1981). The underlined values in the diagonal of the lower left matrix, the square matrix, are convergent validity coefficients that have been corrected for
unreliability according to the Spearman Brown equation. The nine uncorrected validity coefficients. starting with Learning, would be .41, .48, .25. .46, .25, .37. .13. .36,
& 54. All correlation coefficients are presented without decimal points. Correlations greater than .lO are statistically significant.
‘96 HERBERT W. MARSH

suggesting that peer ratings may not have value as an indicator of effective teaching. Webb
and Nolan (1955) reported good correspondence between student ratings and instructor
self-evaluations, but neither of these indicators was positively correlated with supervisor
ratings (which the authors indicated to be like peer ratings). Howard et al. (1985) found
moderate correlations between student ratings and instructor self-evaluations, but ratings
by colleagues were not significantly correlated with student ratings, self-evaluations, or
the ratings of trained observers.
Other reviews of the peer evaluation process in higher education settings (for example,
Centra, 1979; Cohen & McKeachie, 1980; Braskamp et al., 1985; French-Lazovich, 1981)
have also failed to cite studies that provide empirical support for the validity of peer ratings
based on classroom visitation as an indicator of effective college teaching or as a criterion
for student ratings. Cohen and McKeachie (1980) and Braskamp ef al. (1985) suggested
that peer ratings may be suitable for formative evaluation, but suggested that they may not
be sufficiently reliable and valid to serve as a summative measure. Murray (1980), in com-
paring student ratings and peer ratings, found peer ratings to be “(1) less sensitive, reli-
able, and valid; (2) more threatening and disruptive of faculty morale: and (3) more af-
fected by non-instructional factors such as research productivity” (p. 45) than student rat-
ings. Ward et al. (1981; also see Braskamp et al., 1985) suggested a methodological prob-
lem with the collection of peer ratings in that the presence of a colleague in the classroom
apparently affects the classroom performance of the instructor and provides a threat to the
external validity of the procedure. In summary, peer ratings based on classroom visitation
do not appear to be very reliable or to correlate substantially with student ratings or with
any other indicator of effective teaching. While these findings neither support nor refute
the validity of student ratings, they clearly indicate that the use of peer evaluations of uni-
versity teaching for personnel decisions is unwarranted (see Striven, 1981 for further dis-
cussion).

Behavioral Observations by External Observers

At the pre-college level, observational records compiled by specially trained observers


are frequently found to be positively correlated with both student ratings of teaching effec-
tiveness and student achievement (see Rosenshine, 1971; Rosenshine & Furst, 1973 for
reviews), and similar studies at the tertiary level are also encouraging (see Dunkin &
Barnes, 1986; Murray, 1980). Murray (1976) found high positive correlations between ob-
servers’ frequency-of-occurrence estimates of specific teaching behaviors and an overall
student rating. Cranton and Hillgartner (1981) examined relationships between specific
teaching behaviors observed on videotapes of lectures in a naturalistic setting and student
ratings; student ratings of effectiveness of discussion were higher “when professors praised
student behavior, asked questions and clarified or elaborated student responses” (p.73);
student ratings of organization were higher “when instructors spent time structuring clas-
ses and explaining relationships” (p.73). Murray (1980) concluded that student ratings
“can be accurately predicted from outside observer reports of specific classroom teaching
behaviors” (p. 31).
In one of the most ambitious observation studies, Murray (1983) trained observers to
estimate the frequency of occurrence of specific teaching behaviors of 54 university in-
structors who had previously obtained high, medium or low student ratings in other clas-
Students’ Evaluations 297

ses. A total of 18 to 24 sets of observer reports were collected for each instructor. The me-
dian of single-rater reliabilities (i.e. the correlation between two sets of observational re-
ports) was 0.32, but the median reliability for the average response across the 18 to 24 re-
ports for each instructor was 0.77. Factor analysis of the observations revealed nine fac-
tors, and their content resembled factors in student ratings described earlier (for example,
Clarity, Enthusiasm, Interaction, Rapport, Organization). The observations significantly
differentiated among the three criterion groups of instructors, but were also modestly cor-
related with a set of background variables (for example, sex, age, rank, class size). Unfor-
tunately, Murray only considered student ratings on an overall instructor rating item, and
these were based upon ratings from a previous course rather than the one that was ob-
served. Hence, MTMM- type analyses could not be used to determine if specific observa-
tional factors were most highly correlated with matching student rating factors. The find-
ings do show, however, that instructors who are rated differently by students do exhibit
systematically different observable teaching behaviors.
Many observational studies focus on a limited range of teacher behaviors rather than the
broad spectrum of behaviors or a global measure of teaching effectiveness, and the study
of teacher clarity has been a particularly fruitful example. In field and correlational re-
search, observers measure (count or rate) clarity-related behaviors (see Land, 1985, for a
description of the types of behaviors) in natural classroom settings and these are related to
student achievement scores. In one such study, Land and Combs (1979) operationally de-
fined teacher clarity as the number of false starts or halts in speech, redundantly spoken
words, and tangles in words. More generally, teacher clarity is a term used to describe how
clearly a teacher explains the subject matter to students, and is frequently examined with
student evaluation instruments and with observational schedules. Teacher clarity vari-
ables are important because it has been shown that they can be reliably judged by students
and by external observers, they are consistently correlated with student achievement, and
they are amenable to both correlational and experimental designs (see Dunkin and
Barnes, 1986; Land, 1979,1985; Rosenshine & Furst, 1973). In experimental settings, les-
son scripts are videotaped which differ only in the frequency of clarity-related behaviors,
and randomly assigned groups of subjects view different lectures and complete achieve-
ment tests. Most studies, whether employing correlational or experimental designs, focus
on the positive relation between observations of clarity variables and student achieve-
ment. Dunkin and Barnes (1986), and Rosenshine and Furst (1973) were particularly im-
pressed with the robustness of this effect and its generality across different instruments,
different raters, and different levels of education. While not generally the primary focus
in this area of research, some studies have collected student ratings of teaching effective-
ness and particularly ratings of teacher clarity in conjunction with other variables.
Land and Combs (1981; see also Land & Smith, 1981) constructed 10 videotaped lec-
tures for which the frequency of clarity-related behaviors was systematically varied to rep-
resent the range of these behaviors observed in naturalistic studies. Ten randomly assigned
groups of students each viewed one of the lectures, evaluated the quality of teaching on a
ten item scale, and completed an objective achievement examination. The ten group-aver-
age variables (i.e. the average of student ratings and achievement scores in each group)
were determined, and the experimentally manipulated occurrence of clarity-related be-
haviors was substantially correlated with both student ratings and achievement, while stu-
dent ratings and achievement were significantly correlated with each other. Student re-
sponses to the item ‘the teacher’s explanations were clear to me’ were most highly corre-
298 HERBERT W. hlARSH

lated with both the experimentally manipulated clarity behaviors and results on the
achievement test. In an observational study, Hines etal. (1982) found that observer ratings
on a cluster of 29 clarity-related behaviors were substantially correlated with both student
ratings and achievement in college level math courses. In a review of such studies, Land
(1985) indicated that while clarity behaviors were significantly related to both ratings and
achievement, the correlations with ratings were significantly higher.
Research on teacher clarity, :hough not specifically designed to test the validity of stu-
dents’ evaluations, offers an important paradigm for student evaluation research. Teacher
clarity is evaluated by items on most student evaluation instruments, can be reliably ob-
served in a naturalistic field study, can be easily manipulated in laboratory studies, and is
consistently related to student achievement. Both naturalistic observations and experi-
mental manipulations of clarity-related behaviors are significantly correlated with student
ratings and with achievement, and student ratings of teacher clarity are correlated with
achievement. This pattern of findings supports the inclusion of clarity on student evalua-
tion instruments, demonstrates that student ratings are sensitive to natural and experimen-
tally manipulated differences in this variable, and supports the construct validity of the stu-
dent ratings with respect to this variable.
Systematic observations by trained observers are positively correlated with both stu-
dents’ evaluations and student achievement, even though research described in the last
section reported that peer ratings are not systematically correlated with either students’
evaluations or student achievement. A plausible reason for this difference lies in the relia-
bility of the different indicators. Class-average student ratings are quite reliable, but the
average agreement between ratings by any two students (i.e., the single rater reliability)
is generally in the 0.20s. Hence, it is not surprising that agreement between two peer vis-
itors who attend only a single lecture and respond to very general items is low. When ob-
servers are systematically trained and asked to rate the frequency of quite specific be-
haviors, and there is a sufficient number of ratings of each teacher by different observers,
then it is reasonable that their observations will be more reliable than peer ratings and
more substantially correlated with student ratings. However, further research is needed to
clarify this suggestion. For example, Howard et al. (1985) examined both external ob-
server ratings by trained graduate students and colleague ratings by untrained peers, but
found that neither was significantly correlated with the other, with instructor self-evalua-
tions, or with student ratings. However, both colleague and observer ratings were based
on two visits by only a single rater, both were apparently based on a similar rating instru-
ment, and the nature of the training given to the observers was not specified. While peer
ratings and behavioral observations have been considered as separate in the present arti-
cle, the distinction may not be so clear in actual practice; peers can be trained to estimate
the frequency of specific behaviors and some behavior observation schedules look like rat-
ing instruments.
The agreement between multifaceted observation schedules and multiple dimensions of
students’ evaluations appears to be an important area for future research. However, a
word of caution must be noted. The finding that specific teaching behaviors can be reliably
observed and do vary from teacher to teacher, does not mean that they are important.
Here, as with student ratings, specific behaviors and observational factors must also be re-
lated to external indicators of effective teaching. In this respect the simultaneous collec-
tion of several indicators of effective teaching is important, and the research conducted on
teacher clarity provides an important example.
Students’ Evaluations 299

Research Productivity

Teaching and research are typically seen as the most important products of university
faculty. Research helps instructors to keep abreast of new developments in their field and
to stimulate their thinking, and this in turn provides one basis for predicting a positive cor-
relation between research activity and students’ evaluations of teaching effectiveness.
However, Blackburn (1974) caricatured two diametrically opposed opinions about the di-
rection of the teaching/research relationship: (a) a professor cannot be a first rate teacher
if he/she is not actively engaged in scholarship; and (b) unsatisfactory classroom perform-
ance results from the professor neglecting teaching responsibilities for the sake of publica-
tions.
Marsh (1979; see also Centra, 1981, 1983; Feldman, 1987), in a review of 13 empirical
studies which mostly used student ratings as an indicator of teaching effectiveness, re-
ported that there was virtually no evidence for a negative relationship between effective-
ness in teaching and research; most studies found no significant relationship, and a few
studies reported weak positive correlations. Faia (1976) found no relationship at research-
oriented universities, but a small significant relationship where there was less emphasis on
research. Centra (1981, 1983), in perhaps the largest study of relations between teaching
effectiveness and research productivity (n = 4,596 faculty from a variety of institutions),
found weak positive correlations (median r = 0.22) between number of articles published
and students’ evaluations of teaching effectiveness for social sciences, but no correlation
in natural sciences, and humanities. Centra (1983) concluded that the teaching effective-
ness/research productivity relationship is nonexistent or modest, and found support for
neither the claim that research activity contributes to nor that it detracts from teaching ef-
fectiveness. Marsh and Overall (1979b; see Table 5.2 in Chapter 5) found that instructor
self-evaluations of their own research productivity were only modestly correlated with
their own self-evaluations of their teaching effectiveness (r’s between 0.09 and 0.41 for 9
SEEQ factors and two overall ratings), and were less correlated with student ratings (r’s
between 0.02 and 0.21). However, 4 of these 11 correlations with students’evaluations did
reach statistical significance, and the largest correlation was with student ratings of
Breadth of Coverage. The researchers reasoned that this dimension, which assesses
characteristics like ‘instructor adequately discussed current developments in the field’, was
the dimension most logically related to research activity. Linsky and Straus (1975) found
research activity was not correlated with students’ global ratings of instructors, but did cor-
relate modestly with student ratings of instructors’ knowledge (0.27). Frey (1978) found
that citation counts for more senior faculty in the sciences were significantly correlated
with student ratings of Pedagogical Skill (0.37), but not student ratings of Rapport (-0.23,
not significant). Frey emphasized that the failure to recognize the multifaceted nature of
students’ evaluations may mask consistent relationships, and this may account for the non-
significant relationships typically found.
Feldman (1987) recently completed a meta-analysis and substantive review of studies
that examined teaching effectiveness based on students’ evaluations and research produc-
tivity. Consistent with Marsh’s earlier 1979 review, Feldman found mostly near-zero or
slightly positive correlations. Using meta-analytic techniques he found an average correfa-
tion of 0.12 (p < 0.001) across the 29 studies in his review. This correlation remained refa-
tively stable across different indicators of research productivity except for five studies
using index citation counts which were unrelated to teaching effectiveness. The relation
300 HERBERT W. MARSH

was relatively unaffected by academic rank and faculty age, but the relation was slightly
higher in humanities and social sciences than natural sciences.
Feldman found that correlations between research productivity and his 19 specific
dimensions of students’ evaluations (see Table 2.1) varied from 0 to 0.21. Correlations of
about 0.2 were found for dimensions 3,5,9 in Table 2.1, whereas dimensions 2,8, 13. 14,
16, 18 and 19 in Table 2.1 were not significantly related to research productivity. Consis-
tent with Marsh’s research, those dimensions most logically related to research productiv-
ity were more highly correlated to it, but the differences were small. However. none of the
dimensions were negatively correlated to research productivity as implied by Frey (1978).
Ability, time spent, and reward structure are all critical variables in understanding the
teaching/research relationship. In a model developed to explain how these variables are
related (Figure 4.1) it is proposed that: (a) the ability to be effective at teaching and re-
search are positively correlated (a view consistent with the first opinion presented by
Blackburn); (b) time spent on research and teaching are negatively correlated (a view con-
sistent with the second opinion presented by Blackburn) and may be influenced by a re-
ward structure which systematically favors one over the other; (c) effectiveness, in both
teaching and research, is a function of both ability and time allocation; (d) the positive re-
lationship between abilities in the two areas and the negative correlation in time spent in
the two areas will result in little or no correlation in measures of effectiveness in the two
areas. In support of predictions b, c, and d, Jauch (1976) found that research effectiveness
was positively correlated with time spent on research and negatively correlated with time
spent on teaching, and that time spent on teaching and research were negatively correlated
with each other (see also Harry & Goldner, 1972). In his review, Feldman (1987) found
some indication that research productivity was positively correlated with time or effort de-
voted to research and, perhaps, negatively correlated with time or effort devoted to teach-
ing. However, he found almost no support for the contention teaching effectiveness was
related at all to time or effort devoted to either research or teaching. Thus, whereas there
is some support for the model in Figure 4.1, important linkages were not supported and are
in need of further research.
In summary, there appears to be a zero to low-positive correlation between measures of
research productivity and student ratings or other indicators of effective teaching, though
correlations may be somewhat higher for student rating dimensions which are most logi-
cally related to research effectiveness. While these findings seem neither to support nor re-
fute the validity of student ratings, they do demonstrate that measures of research produc-
tivity cannot be used to infer teaching effectiveness or vice versa.

The Use of Other Criteria of Effective Teaching

Additional research not described here has considered other indicators of effective
teaching, but the criteria are idiosyncratic to particular settings, are insufficiently de-
scribed, or have only been considered in a few studies. For example, in a multisection val-
idity study, Marsh and Overall (1980) found that sections who rated their teacher most
highly were more likely to pursue further coursework in the area and to join the local com-
puter club (the course was an introduction to computer programming). Similarly,
McKeachie and Solomon (1958) and Boulton, Bonge and Marr (1979) found that students
Students’ Evaluations 301

Negative
Relationship

Effectiveness Effectiveness

Relationship

Figure 4.1 Model of predicted relations among teaching-related and research-related variables (Reprinted with
permission from Marsh, 1984b)

from introductory level courses that were taught by highly rated teachers were more likely
to enroll in advanced courses than students taught by poorly rated teachers. As described
in Chapter 3, Ory et al. (1980) found high correlations between student ratings and summa-
tive measures obtained from open-ended comments and a group interview technique,
though the ratings proved to be the most cost effective procedure. Marsh and Overall
(1979b) asked lecturers to rate how well they enjoyed teaching relative to their other duties
such as research, committees, etc. Instructor enjoyment of teaching was significantly and
positively correlated with students’ evaluations and instructor self-evaluations (see Table
5.2), and the highest correlations were with ratings of Instructor Enthusiasm.

The Use of Multiple Criteria of Effective Teaching in the Same Study

Most researchers emphasize that teaching effectiveness should be measured from multi-
302 HERBERT W. MARSH

pie perspectives and with multiple criteria. Braskamp et al. (1985) identify four sources of
information for evaluating teaching effectiveness: students, colleagues, alumni, and in-
structor self-ratings. Ratings by students and alumni are substantially correlated, and rat-
ings by each of these sources appears to be moderately correlated with self-evaluations.
However, ratings by colleagues based on classroom observations do not seem to be
sysematically related to ratings by the other three sources. Braskamp et al. recommend
that colleagues should be used to review classroom materials such as course syllabi, assign-
ments, tests, texts, and Striven (1981) suggests that such evaluations should be done by
staff from the same academic discipline from another institution in a trading of services ar-
rangement that eliminates costs. While this use of colleagues is potentially valuable, I
know of no systematic research that demonstrates the reliability or validity of such ratings.
Howard el al. (1985) contrasted SEEQ construct validity research that usually considers
only two or three indicators of teaching effectiveness for large samples with their study that
examined five different indicators in a single study for a small sample (43 classes). In this
study Howard et al. collected data from all four of the sources of information about teach-
ing effectiveness noted by Braskamp et al. (1985). They found that “former-students and
student ratings evidence substantially greater validity coefficients of teaching effectiveness
than do self-report, colleague and trained observer ratings” (p. 195), and that while self-
evaluations were modestly correlated with student ratings, colleague and observer ratings
were not correlated with each other or with any other indicators. While their findings are
in basic agreement with SEEQ research and other research reviewed here, the inclusion
of such a variety of measures within a single study is an important contribution.

Summary and Implications of Validity Research

Effective teaching is a hypothetical construct for which there is no single indicator.


Hence, the validity of students’ evaluations or of any other indicator of effective teaching
must be demonstrated through a construct validation approach. Student ratings are sig-
nificantly and consistently related to a number of varied criteria including the ratings of
former students, student achievement in multisection validity studies, faculty self-evalua-
tions of their own teaching effectiveness, and, perhaps, the observations of trained obser-
vers on specific processes such as teacher clarity. This provides support for the construct
validity of the ratings. Peer ratings, based on classroom visitation, and research productiv-
ity were shown to have little correlation with students’ evaluations, and since they are also
relatively uncorrelated with other indicators of effective teaching, their validity as mea-
sures of effective teaching is problematic.
Nearly all researchers argue strongly that it is absolutely necessary to have multiple indi-
cators of effective teaching whenever the evaluation of teaching effectiveness is to be used
for personnel/tenure decisions. This emphasis on multiple indicators is clearly reflected in
research described in this article. However, it is critical that the validity of all indicators of
teaching effectiveness, not just student ratings, be systematically examined before they are
actually recommended for use in personnel/tenure decisions. It seems ironic that resear-
chers who argue that the validity of student ratings has not been sufficiently demonstrated,
despite the preponderance of research supporting their validity, are so willing to accept
other indicators which have not been tested or have been shown to have little validity.
Students’ Evaluations 303

Researchers seem less concerned about the validity of information that is given to in-
structors for formative purposes such as feedback for the improvement of teaching effec-
tiveness. This perspective may be justified, pending the outcome of further research, since
there are fewer immediate consequences and legal implications. Nevertheless, even for
formative purposes, the continued use of any sort of information about teaching effective-
ness is not justified unless there is systematic research that supports its validity and aids in
its interpretation. The implicit assumption that instructors will be able to separate valid
and useful information from that which is not when evaluating formative feedback, while
administrators will be unable to make this distinction when evaluating summative mater-
ial, seems dubious.

Directions for Further Research

Too little attention has been given to a grading leniency or grading satisfaction effect in
multisection validity research even though such an effect has been frequently posited as a
bias to students’ evaluations of teaching effectiveness (see Chapter 5). The SEEQ and the
Palmer et al. (1978) studies provide an important basis for further research. The design of
the SEEQ research disrupted the normal positive correlation between section-average ex-
pected grades and examination performance so that these variables were not significantly
correlated, but this design characteristic may limit the generality of the results and this ap-
proach. The particular approach used by Palmer et al. was apparently flawed, but their
general analytic strategy - particularly if based on student ratings of expected and de-
served grades instead of, or in addition to, actual grades and predicted grades - may
prove to be useful. Further research, particularly if able to resolve these issues, will have
important implications for multisection validity studies and will also provide a basis for the
application of a construct validation approach to the simultaneous study of validity and
bias as emphasized in Chapter 5.
Instructor self-evaluations of teaching effectiveness have been used in surprisingly few
large-scale studies. There is logical and empirical support for their use as a criterion of ef-
fective teaching, they can be collected in virtually all instructional settings, they are rela-
tively easy and inexpensive to collect, and they have important implications for the study
of the validity of students’ evaluations of teaching effectiveness and for the effect of poten-
tial sources of bias. Hence, this seems to be an important area for further research.
Murray’s 1983 study of observations of classroom behaviors provided an important de-
monstration of the potential value of ratings by external observers. Given the systematic
development of the classroom observation procedures in his study, it was unfortunate that
Murray’s measure of students’ evaluations of teaching effectiveness was so weak. How-
ever, the success of his study and this apparent problem point to an obvious direction for
further research.
Teacher clarity research provides a model for a new and potentially important paradigm
for validating student ratings. This potential is particularly exciting because it may bring
together findings from naturalistic field studies and more controlled laboratory settings.
While discussion in this monograph has focused on clarity, other specific behaviors can be
experimentally manipulated and the effects of these manipulations tested on student rat-
ings and other indicators of teaching effectiveness. Furthermore, if multiple behaviors are
304 HERBERT W. h4ARSH

manipulated within the same study. the paradigm can be used to test the discriminant and
convergent validity of student ratings. This type of approach to the study of students’
evaluations is not new, but its previous application has been limited primarily to the study
of potential biases to student ratings as in the Dr. Fox effect (see Chapter 6) and in the
study of semantic similarities in rating items as the basis for the robustness of the SEEQ
factor structure (see Chapter 3). In fact, findings from each of these alternative applica-
tions of this approach were interpreted as supporting the validity of students’ evaluations
for purposes of this monograph.
Marsh and Overall (1980) distinguished between cognitive and affective criteria of effec-
tive teaching, arguing for the importance of affective outcomes as well as cognitive out-
comes. Those findings indicate that cognitive and affective criteria need not be substan-
tially correlated, and appear to be differentially related to different student evaluation
rating components. Cognitive criteria have typically been limited to student learning as
measured in the multisection validity paradigm, and there are problems with such a narrow
definition. In contrast, affective criteria have been defined as anything that seems to be
noncognitive, and there are even more problems with such an open-ended definition.
Further research is needed to define more systematically what is meant by affective
criteria, perhaps in terms of the affective domains described elsewhere (for example,
Krathwohl et al., 1964), to operationally define indicators of these criteria, and to relate
these to multiple student rating dimensions (Abrami, 1985). The affective side of effective
teaching has not been given sufficient attention in student evaluation research or, perhaps,
in the study of teaching in general.
The disproportionate amount of attention given to the narrow definition of teaching ef-
fectiveness as student learning has stifled research on a wide variety of criteria that are ac-
ceptable in the construct validation approach. While this broader approach to validation
will undoubtably provide an umbrella for dubious research as suggested by Doyle (1983),
it also promises to bring new vigor and better understanding to the study of students’
evaluations of teaching effectiveness. In particular, there is a need for studies that consider
many different indicators of teaching effectiveness in a single study (for example, Elliot.
1950; Howard et al, 1985; Morsh et al., 1956; Webb & Nolan, 1955).
Finally, the example provided by student evaluation research will hopefully stimulate
the systematic study of other indicators of effective teaching. Research reviewed in this
chapter has focused on the construct validation of student ratings of teaching effectiveness,
but the review suggests that insufficient attention has been given to the construct valida-
tion of other indicators of teaching effectiveness.
CHAPTER 5

RELATIONSHIP TO BACKGROUND
CHARACTERISTICS: THE WITCH HUNT FOR
POTENTIAL BIASES IN STUDENTS’ EVALUATIONS

The construct validity of students’ evaluations requires that they be related to variables
that are indicative of effective teaching, but relatively uncorrelated with variables that are
not (i.e. potential biases). Since correlations between student ratings and other indicators
of effective teaching rarely approach their reliability, there is considerable residual var-
iance in the ratings that may be related to potential biases. Furthermore, faculty generally
believe that students’ evaluations are biased by a number of factors which they believe to
be unrelated to teaching effectiveness (e.g., Ryan, Anderson and Birchler, 1980). In a
survey conducted at a major research university where SEEQ was developed (Marsh &
Overall, 1979b) faculty were asked which of a list of 17 characteristics would cause bias to
student ratings, and over half the respondents cited course difficulty (72 per cent), grading
leniency, (68 percent), instructor popularity (63 percent), student interest in subject be-
fore course (62 percent), course workload (60 percent), class size (60 percent), reason for
taking the course (55 percent), and student’s GPA (53 percent). In the same survey faculty
indicated that some measure of teaching quality should be given more emphasis in person-
nel decisions than was presently the case and that student ratings provided useful feedback
to faculty. A dilemma existed in that faculty wanted teaching to be evaluated. but were
dubious about any procedure to accomplish this purpose. They were skeptical about the
accuracy of student ratings for personnel decisions but were even more critical of class-
room visitation, self-evaluations and other alternatives. Whether or not potential biases
actually impact student ratings, their utilization will be hindered so long as faculty think
they are biased.
Marsh and Overall (1979b) also asked instructors to consider the special circumstances
involved in teaching a particular course (for example, class size, content area, students’ in-
terest in the subject, etc.) and to rate the “ease of teaching this particular course”. These
ratings of ease-of-teaching (seee Table 5.1) were not significantly correlated with any of
the student rating factors, and were nearly uncorrelated with instructor self-evaluations.
Scott (1977) also asked instructors to indicate which, if any, ‘extenuating circumstances’
(e.g., class size, class outside area of competence, first time taught the course, as well as
an ‘other’ category) would adversely hinder students’ evaluations. The only extenuating
circumstances that actually affected a total score representing the students’ evaluations
was class size, and this effect was small. These two studies suggest that extenuating cir-
cumstances which lecturers think might adversely affect students’ evaluations in a particu-
lar course apparently have little effect, and also support earlier conclusions that the par-
ticular course has relatively little effect on students’ evaluations compared to the effect of
the lecturer who teaches the course (see Chapter 3).
305
Table 5.1
Background Characteristics: Correlations With Student Ratings (S) and Faculty Self-Evaluations (F) of Tberr Own Teaching Effectiveness (N= 1X3 undcr-
graduate courses; reprinted with permission from Marsh, 1984b)

SEEQ factor
Over Over
Background Characteristic Learn Enthu Organ Group Indiv Brdth Exams Assign Wrkld Crse lnvtr

Faculty Rating “Scholarly productiun in their discipline”


(1 = well below average to 5 =i well about=auerage)
S 17 07 18 04 06 21 04 17 11 14 16
F 28 20 40 09 11 26 25 25 10 40 41
Studenta Rating Course Workioad/Difficulty (I = tow to 5 = high)
08 01 04 06 18 04 23 53 26 16
; -04 03 03 00 60 21 15 - 17 09
Faculty Rating Course Workload/Difficulty (1 = lou to 5 = high)
s 08 02 01 -03 04 05 12 - 15 08
F 01 03 15 -09 10 _g 21 21 53 29 16
Students Rating expected course grade (1 = F to 5 = A)
S 28 20 38 16 Of 28 24 -25 26 27
F I1 -03 -z; 17 -10 -11 -11 02 -19 -01 00
Faculty Rating of “Grading Leniency” (1 = easy/lenient to 5 = hard/
sfrrcl)
s -04 -16 -06 06 -08 -05 -05 -02 26 -06 -10
F 60 04 06 16 14 aa 32 19 28 14 03
Glass size/enrollment (actual number of students enrolled)
S -24 -04 -13 -36 -21 -09 -22 -09 -07 -18 -20
F -02 03 10 -43 -17 -03 -03 -11 -04 -04 -09
Faculty Rating “Enjoy teaching relative to other duties”
(1 = extremely unerrjoyuble to 5 = extremely tqoyable)
25 34 18 22 33 00 20 09 03 29 32
; 24 39 01 10 12 -21 -20 03 -03 15 22
Faculty Rating “Ease of teaching particular course”
(1 = very easy to 5 = oery dlj~icuit 1
S -01 10 11 66 a9 09 01 05 03 08
F -16 -07 17 12 06 05 04 17 - 14 -10
Students’ Evaluations 307

Large-Scale Empirical Studies

Several large studies have looked at the multivariate relationship between a comprehen-
sive set of background characteristics and students’ evaluations. In such research it is im-
portant that similar variables not be included both as items on which students rate teaching
effectiveness and as background characteristics, particularly when reporting some sum-
mary measure of variance explained. For example, Price and Magoon (1971) found that 11
background variables explained over 20 percent of the variance in a set of 24 student rating
items. However, variables that most researchers would consider to be part of the evalua-
tion of teaching (e.g. Availability of Instructor, Explicitness of Course Policies) were con-
sidered as background characteristics and contributed to the variance explained in the stu-
dent ratings. Similarly, in a canonical correlation relating a set of class characteristics to a
set of student ratings, Pohlman (1975) found that over 20 percent of the variance in five
student rating items (i.e. the redundancy statistic described by Cooley & Lohnes, 1976)
could be explained by background characteristics. However, the best predicted student
rating item was Course Difficulty and it was substantially correlated with the conceptually
similar background characteristics of hours spent outside of class and expected grades.
Other multivariate studies have been more careful to separate variables considered as
part of the students’ evaluations and background characteristics. Brandenburg et al. (1977)
found that 27 percent of the variance in an average of student rating items could be
explained by a set of 11 background characteristics, but that most of the variance could be
explained by expected grade and, to a lesser extent, whether the course was an elective or
required. Brown (1976) found that 14 percent of the variance in an average of student rat-
ing items could be explained, but that expected grade accounted for the most variance.
Burton (1975) showed that eight background items explained 8-15 percent of the variance
in instructor ratings over a seven-semester period, but that the most important variable
was student enthusiasm for the subject. Centra and Creech (1976) found that four class-
room teacher variables (class size, years teaching, teaching load, and subject area) and
their interactions accounted for less than 3 percent of the variance in overall instructor rat-
ings. Stumpf, Freedman, and Aguanno (1979) found that background variables accounted
for very little of the variance in student ratings after the effects of expected grades (which
they reported to account for about 5 percent of the variance) had been controlled.
A few studies have considered both multiple background characteristics and multiple di-
mensions of students’ evaluations. Marsh (1980b) found that a set of 16 background
characteristics explained about 13 percent of the variance in the set of SEEQ dimensions.
However, the amount of variance explained varied from more than 20 percent in the Over-
all Course rating and the Learning/Value dimension, to about 2 percent of the Organiza-
tion and Individual Rapport dimensions. Four background variables were most important
and could account for nearly all the explained variance; more favorable ratings were corre-
lated with higher prior subject interest, higher expected grades, higher levels of Workload/
Difficulty, and a higher percentage of students taking the course for General Interest
Only. A path analysis (see Table 5.2) demonstrated that prior subject interest had the
strongest impact on student ratings, and that this variable also accounted for about one-
third of the relationship between expected grades and student ratings. Marsh (1983) de-
monstrated a similar pattern of results in five different sets of courses (one of which was
the set of courses used in the 1980 study) representing diverse academic disciplines at the
30x HERBERT W. MARSH

Table 5.2
Path Analysts Xiodel Relatmg Prtor Subject Interest, Reason for Taking Course, Expected Grade and Workload/
Difficulty to Student Ratmgs (Reprinted with permisston from Marsh. 198-lb)

Factor

I. Prior Subject II. Reason (General III. Expected IV. Workload/


Interest Interest Only) Course Grade Difficulty

Orig Orig Orig Orig


Student ratings DC TC r DC TC r DC TC r DC TC r

Learning/Value 36 44 15 13 15 26 20 29 17 17 12
Enthusiasm 17 23 :; 09 08 09 20 16 20 11 11 06
Organization -04 -04 -03 16 16 16 03 02 01 04 04 00
Group Interaction 21 28 29 06 06 07 30 27 31 06 06 -02
Individual Rapport -05 09 09 -01 -02 -02 18 16 17 06 06 01
Breadth -07 -03 -03 23 19 19 06 -01 -02 21 21 15
Exams/Grading -05 03 03 12 10 10 25 18 18 20 20 10
Assignments 11 19 20 21 17 18 19 09 13 30 30 23
Overall course 23 32 33 19 15 16 26 15 22 30 30 23
Overall instructor 12 20 20 13 11 12 24 17 20 17 17 10
Variance
components’ 2.9% 5.1% 5.3% 2.3% 1.5% 1.8% 4.5% 2.6% 4.0% 3.6% 3.6% 1.8%

Nofe. The methods of calculating the path coefficients (p values in Figure 5.1). Direct Causal Coefficients (DC),
and total Causal Coefficients (TC) are described by Marsh (1980a). Orig r = original student rating. See Ftgure
5.1 for the corresponding path model.
a Calculated by summing the squared coefficients, dividing by the number of coefficients, and multiplying by
100%.

graduate and undergraduate level, though the importance of a particular characteristic


varied somewhat with the academic setting.
Based upon a review of large multivariate studies that examine the combined effect of
a set of background variables on student ratings, it appears that between 5 percent and 25
percent of the variance in student ratings can be explained, depending upon the nature of
the student rating items, the background characteristics, perhaps the academic discipline,
and perhaps the institution(s) where the study was conducted. Prior subject interest, ex-
pected grades, and perhaps Workload/Difficulty seem to be the background variables
which are most strongly correlated with students’ evaluations of teaching.

I II
I Reason for

d
I
Taking Course _ [, = _ o,~~~ Difficultvl
(General Interest)

Figure 5.1 Path analysis model relating prior subject interest, reason for taking course, expected grade, and
Workload/Difficulty (Path coefficients for the student rating factors appear in Table 5.2; reprinted with permts-
sion from Marsh, 1984b)
Students’ Evaluations 309

A Construct Approach to the Study of Bias

The finding that a set of background characteristics are correlated with students’ evalua-
tions of teaching effectiveness should not be interpreted to mean that the ratings are
biased, though this conclusion is often inferred by researchers. Support for a bias hypo-
thesis, as with the study of validity, must be based on a construct approach. This approach
requires that the background characteristics that are hypothesized to bias students’ evalua-
tions be examined in studies which are relatively free from methodological flaws using dif-
ferent approaches, and interpreted in relation to a specific definition of bias. Despite the
huge effort in this area of student-evaluation research, such a systematic approach is rare.
Perhaps more than any other area of student-evaluation research, the search for potential
sources of bias is extensive, confused, contradictory, misinterpreted, and methodologi-
cally flawed. In the subsections which follow, methodological weaknesses common to
many studies are presented, theoretical definitions of bias are discussed, and alternative
approaches to the study of bias are considered. The purpose of these subsections is to pro-
vide guidelines for evaluating existing research and for conducting future research. Fi-
nally, within this context, relationships between students’ evaluations and specific charac-
teristics frequently hypothesized to bias student ratings are examined.

Methodological Weaknesses Common to Many Studies

Important and common methodological problems in the search for potential biases to
students’ evaluations include the following.
(1) Using correlation to argue for causation-the implication that some variable biases
student ratings argues that causation has been demonstrated, whereas correlation only im-
plies that a concommitant relation exists.
(2) Neglect of the distinction between practical and statistical significance-all conclu-
sions should be based upon some index of effect size as well as on tests of statistical signifi-
cance .
(3) Failure to consider the multivariate nature of both student ratings and a set of poten-
tial biases.
(4) Selection of an inappropriate unit of analysis. Since nearly all applications of
students’ evaluations are based upon class-average responses, this is nearly always the ap-
propriate unit of analysis. The size and even the direction of correlations based on class-
average responses may be different from correlations obtained when the analysis is per-
formed on responses by individual students. Hence, effects based on individual students
as the unit of analysis must also be demonstrated to operate at the class-average level.
(5) Failure to examine the replicability of findings in a similar setting and their
generalizability to different settings - this is particularly a problem in studies based on
small sample sizes or on classes from a single academic department at a single institution.
(6) The lack of an explicit definition of bias against which to evaluate effects-if a var-
iable actually affects teaching effectiveness and this effect is accurately reflected in student
ratings, then the influence is not a bias.
(7) Questions of the appropriateness of experimental manipulations - studies that
attempt to simulate hypothesized biases with operationally defined experimental manipu-
lations must demonstrate that the size and nature of the manipulation and the observed ef-
310 HERBERT W. MARSH

fects are representative of those that occur in natural settings (i.e. they must examme
threats to the external validity of the findings).

Theoretical Definitions of Bias

An important problem in research that examines the effect of potential biases to stu-
dents’ evaluations is that adequate definitions of bias have not been formulated. The mere
existence of a significant correlation between students’ evaluations and some background
characteristic should not be interpreted as support for a bias hypothesis. Even if a back-
ground characteristic is causalfy related to students’ evaluations, there is insufficient evi-
dence to support a bias hypothesis. For example, it can be plausibly argued that many of
the validity criteria discussed earlier, the alternative indicators of effective teaching such
as student learning and experimental manipulations of teacher clarity, are causally related
to students’ evaluations, but it makes no sense to argue that they bias students’ evalua-
tions. Support for a bias hypothesis must be based on a theoretically defensible definition
of what constitutes a bias. Alternative definitions of bias, which are generally implicit
rather than explicit, are described below.
One possible example, the ‘simplistic bias hypothesis’, is that if an instructor: (a) gives
students high grades; (b) demands little work of students; and (c) agrees to be evaluated
in small classes only; then he or she will be favorably evaluated on all rating items. Implicit
is the assumption that instructors will be rewarded on the basis of these characteristics
rather than on effective teaching. The studies by Marsh (1980b, 1983, 1984) clearly refute
this simplistic hypothesis. The clarity of the SEEQ factor structure demonstrates that stu-
dents differentiate their responses on the basis of more than just global impressions, so that
potential biases, if they do have an effect, will affect different rating dimensions differen-
tially. No background variable was substantially correlated with more than a few SEEQ
factors, and each showed little or no correlation with some of the SEEQ factors. The per-
centage of variance that could be explained in different dimensions varied dramatically.
Furthermore, the direction of the Workload/Difficulty effect was opposite to that pre-
dicted by the hypothesis, while the class size effect was small for dimensions other than
Group Interaction and Individual Rapport. Most importantly, the entire set of back-
ground variables, ignoring the question of whether or not any of them represent biases,
was able to explain only a small portion of the variance in student ratings.
The ‘simplistic bias hypothesis’ represents a strawman and its rejection does not mean
that student ratings are unbiased, but only that they are not biased according to this con-
ceptualization of bias. More rigorous and sophisticated definitions of bias are needed. A
more realistic definition that has guided SEEQ research is that student ratings are biased
to the extent that they are influenced by variables that are unrelated to teaching effective-
ness, and, perhaps, to the extent that this influence generalizes across all rating factors
rather than being specific to the particular factors most logically related to the influence.
For example, even though student learning in multisection validity studies is correlated
with student ratings, thiseffect should not be considered a ‘bias’. However, this seemingly
simple and intuitive notion of bias is difficult to test. It is not sufficient to show that some
variable is correlated with student ratings and that a causal interpretation is warrranted; it
must also be shown that the variable is not correlated with effective teaching. This is dif-
ficult in that effective teaching is a hypothetical construct so that all the problems involved
Students’ Evaluations 311

in trying to show that student ratings are valid come into play, and trying to ‘prove’ a null
hypothesis is always problematic. According to this definition of bias, most claims that stu-
dents’ evaluations are biased by any particular characteristics are clearly unwarranted (see
Feldman, 1984, for further discussion).
Other researchers infer yet another definition of bias by arguing that ratings are biased
to the extent that they are affected by variables that are not under the control of the in-
structor. According to this conceptualization, ratings must be ‘fair’ to be unbiased, even
to the extent of not accurately reflecting influences that do affect teaching effectiveness.
Such a definition is particularly relevant to a variable like prior subject interest that prob-
ably does affect teaching effectiveness in a way which is accurately reflected by student rat-
ings (see discussion below and Marsh & Cooper, 1981). Ironically, this conceptualization
would not classify a grading leniency effect (i.e. students giving better-than-deserved rat-
ings to instructors as a consequence of instructors giving better-than-deserved grades to
students) as a bias, since this variable is clearly under the control of the instructor. Hence,
while the issue of fairness is important, particularly when students’ evaluations are to be
used for personnel decisions, this definition of bias also seems to be inadequate. While
there is a need for further clarification of the issues of bias and fairness, it is also important
to distinguish between these two concepts so that they are not confused. The ‘fairness’ of
students’ evaluations needs to be examined separately from, or in addition to, the exami-
nation of their validity and susceptibility to bias.
Still other researchers (for example, Hoyt et al., 1973; Brandenburg et al, 1977; see also
Howard and Bray, 1979) seem to circumvent the problem of defining bias by statistically
controlling for potential biases with multiple regression techniques or by forming norma-
tive (cohort) groups that are homogeneous with respect to potential biases (e.g., class
size). However, underlying this procedure is the untested assumption that the variables
being controlled are causally related to student ratings, and that the relationship does rep-
resent a bias. For example, if inexperienced, less able teachers are systematically assigned
to teach large introductory classes, then statistically removing the effect of class size is not
appropriate. Furthermore, this approach is predicted on the existence of a theoretical de-
finition of bias and offers no help in deciding what constitutes a bias. Thus, while this pro-
cedure may be appropriate and valuable in some instances, it should only be used cauti-
ously, and in conjunction with research findings that demonstrate that a variable does con-
stitute a bias according to a theoretically defensible definition of bias or fairness.

Approaches to Exploring for Potential Biases

Over a decade ago, McKeachie (1973) argued that student ratings could be better under-
stood if researchers did not concentrate exclusively on trying to interpret background rela-
tionships as biases, but instead examined the meaning of specific relationships. Following
this orientation, several approaches to the study of background influences have been
utilized. The most frequently employed approach is simply to correlate class-average stu-
dents’ evaluations with a class-average measure of a background variable hypothesized to
bias student ratings. Such an approach can be heuristic, but in isolation it can never be used
to demonstrate a bias. Instead, hypothesis generated from these correlational studies
should be more fully explored in further research using alternative approaches such as
those described below.
312 HERBERT W. MARSH

One alternative approach (Bausall 8cBausall, 1979; Marsh, 1982a) is to examine the re-
lationship between differences in background variables and differences in student ratings
for two or more offerings of the same course taught by the same instructor. The rationale
here is that since the instructor is the single most important determinant of student ratings,
the within-instructor comparison provides a more powerful analysis. Marsh found that for
pairs of courses the more favorably evaluated offering was correlated with: (a) higher ex-
pected grades, and presumably better mastery since grades were assigned by the same in-
structor to all students in the same course; (b) higher levels of Workload/Difficulty; and (c)
the instructor having taught the course at least once previously (and presumably having be-
nefited from that experience and the student ratings). Other background characteristics
such as class size, reason for taking a course, and prior subject interest had little effect.
While providing valuable insight, this approach is limited by technical difficulties involved
in comparing sets of difference scores, by the lack of variance in difference scores repre-
senting both student ratings and the background characteristics (i.e. if there is little var-
iance in the difference scores, then no relationship can be shown), and by difficulties in the
interpretation of the difference scores.
A second approach is to isolate a specific variable, simulate the variable with an experi-
mental manipulation, and examine its effect in experimental studies where students are
randomly assigned to treatment conditions. The internal validity (see Campbell & Stanley,
1973, for a discussion of internal and external threats to validity) of interpretations is
greatly enhanced since many counter explanations that typically exist in correlational
studies can be eliminated. However, this can only be accomplished at the expense of many
threats to the external validity of interpretations: the experimental setting or the manipu-
lation may be so contrived that the finding has little generality to the actual application of
student ratings; the size of the experimental manipulation may be unrealistic; the nature
of the variable in question may be seriously distorted in its ‘operationalization’. and effects
shown to exist when the individual student is the unit-of-analysis may not generalize when
the class-average is used as the unit-of-analysis. Consequently, while the results of such
studies can be very valuable, it is still incumbent upon the researcher to explore the exter-
nal validity of the interpretations and to demonstrate that similar effects exist in real set-
tings where student ratings are actually employed.
A third approach, derived from the construct validation emphasis, is based upon the as-
sumption that specific variables (for example, background characteristics, validity criteria,
experimental manipulations, etc.) should logically or theoretically be related to some
specific components of students’ evaluations, and less related to others. According to this
approach, if a variable is most highly correlated with the dimensions to which it is most log-
ically connected, then the validity of the ratings is supported. For example, class size is sub-
stantially correlated with ratings of Group Interaction and Individual Rapport but not with
other SEEQ dimensions (Marsh et al., 1979a; see discussion below). This parfern of find-
ings argues for the validity of the student ratings. Many relationships can be better under-
stood from this perspective rather than from trying to support or refute the existence of a
bias that impacts all student ratings.
A related approach, that has guided SEEQ research, is more closely tied to an earlier
definition of bias. This approach is based upon the assumption that a ‘bias’ that is specific
to student ratings should have little impact on other indicators of effective teaching. If a
variable is related both to student ratings and to other indicators of effective teaching, then
the validity of the ratings is supported. Employing this approach, Marsh asked instructors
Students’ Evaluations 313

in a large number of classes to evaluate their own teaching effectiveness with the same
SEEQ form used by their students, and the SEEQ factors derived from both groups were
correlated with background characteristics. Support for the interpretation of a bias in this
situation requires that some variable be substantially correlated with student ratings, but
not with instructor self-evaluations of their own teaching (also see Feldman, 1984). Of
course, even when a variable is substantially correlated with both student and instructor
self-evaluations, it is still possible that the variable biases both student ratings and instruc-
tor self-evaluations, but such an interpretation requires that the variable is not substan-
tially correlated with yet other valid indicators of effective teaching. Also, when the pat-
tern of correlations between a specific variable and the set of student evaluation factors is
similar to the p&fern of correlations with faculty self-evaluation factors, there is further
support for the validity of the student ratings. Results based on this and the other ap-
proaches will be presented below in the discussion of the effects of specific background
characteristics.
A fourth, infrequently used approach is to derive relations between student ratings and
external characteristics on the basis of explicit theory. Empirical support for the predic-
tions then supports the validity of both the theory and the measures used to test the theory.
Such an approach, depending on the nature of the theory and the definition of bias, may
also demonstrate that a particular set of relations between student ratings and background
characteristics do not constitute a bias. In an application of this approach that is discussed.
Neumann and Neumann (1985) used Biglan’s (1973) model of subject matter in different
academic disciplines (e.g., soft/hard, pure/applied, and life/nonlife) and its relation to the
role of teaching in the different disciplines to predict discipline differences in student rat-
ings.

Effects of Specific Background Characteristics Emphasized in SEEQ Research

Hundreds of studies have used a variety of approaches to examine the influence of many
background characteristics on students’ evaluations of teaching effectiveness, and a com-
prehensive review is beyond the scope of this monograph. Many of the older studies may
be of questionable relevance, and may also have been inaccurately described. Reviewers,
apparently relying on secondary sources, have perpetuated these inaccurate descriptions
and faulty conclusions (some findings commonly cited in ‘reviews’ are based upon older
studies which did not even consider the variable they are cited to have examined - see
Marsh, 1980a, for examples). Empirical findings in this area have been reviewed in an ex-
cellent series of articles by Feldman (1976a, 1976b, 1977,1978,1979,1983,1984), other re-
view papers by Aubecht (1981), Marsh (1983,1984), and McKeachie (1973,1979), monog-
raphs by Braskamp et al. (1985), Centra (1979; Centra & Creech, 1976) and Murray
(1980)) and a chapter by Aleamoni (1981). Older reviews by Costin et al. (1971), Kulik and
McKeachie (197.5), and the annotated bibliography by de Wolf (1974) are also valuable.
Results summarized below emphasize the description and explanation of the mul-
tivariate relationships that exist between specific background characteristics and multiple
dimensions of student ratings. This is a summary of findings based upon some of the most
frequently studied and/or the most important background characteristics, and of different
approaches to understanding the relationships. In this section the effects of the five back-
ground characteristics that have been most extensively examined in SEEQ research are
31-1 HERBERT W. MARSH

examined: class size; workload/difficulty; prior subject interest; expected grades; and the
reason for taking a course.

Class Size

Marsh reviewed previous research on the class size effect and examined correlations be-
tween class size and SEEQ dimensions (Marsh et al., 1979a; Marsh, 1980b, 1983). Class
size was moderately correlated with Group Interaction and Individual Rapport (nega-
tively, r’s as large as -0.30), but not with other SEEQ dimensions or with the overall rat-
ings of course or instructor (absolute values of r’s < 0.15). In the class size effect there was
also a significant nonlinear function where small and very large classes were evaluated
more favorably. However, since the majority of class sizes occur in the range where the
class size effect is negative, the correlation based on the entire sample of classes was still
slightly negative. These findings also appeared when instructor self-evaluations were con-
sidered; the pattern and magnitude of correlations between instructor self-evaluations of
their own teaching effectiveness and class size was similar to findings based upon student
ratings (Marsh et al., 1979a; also see Table 5.1). The specificity of the class size effect to
dimensions most logically related to this variable, and the similarity of findings based on
student ratings and faculty self-evaluations argue that this effect is not a ‘bias’ to student
ratings; rather, class size does have a moderate effect on some aspects of effective teaching
(primarily Group Interaction and Individual Rapport) and these effects are accurately re-
flected in the student ratings. This discussion of the class size effect clearly illustrates why
students’ evaluations cannot be adequately understood if their multidimensionality is ig-
nored (also see Feldman, 1984; Frey, 1978).
Feldman (1984; see also Feldman, 1978) conducted the most extensive review of rela-
tions between class size and students’ evaluations, and the results of his review are reason-
ably consistent with SEEQ research. Feldman noted that most studies find very weak
negative relations, but that the size of the relation is stronger for instructional dimensions
pertaining to the instructor’s interactions and interrelationships with students and that
some studies report a roughly U-shaped nonlinear relation. He also noted the possibility
that the pattern of results, particularly those in SEEQ research, may be interpreted to
mean that the class size effect may represent a valid influence that is accurately reflected
in student ratings. In one of the largest and perhaps the most broadly representative
studies of the class size effect, Centra and Creech (1976; see also Centra, 1979) also found
a clearly curvilinear effect in which classes in the 35-100 range received the lowest ratings,
whereas larger and smaller classes received higher ratings.
Superficially, the U-shaped relation between class size and student ratings appears to
contradict Glass’s well-established conclusion that teaching effectiveness, inferred from
achievement indicators or from affective measures, suffers when class size increases (e.g.,
Glass et al., 1981, pp. 35-43; Smith & Glass, 1980). However, Glass also found a non-
linear class-size effect that he summarized as a logarithmic function where nearly all the
negative effect occurred between class sizes of 1 and 40, and he did not present data for ex-
tremely large class sizes (e.g., several 100 students). Within the range of class sizes re-
ported by Glass, the class size/student evaluation relationship found in SEEQ research
could also be fit by a logarithmic relationship, and the increase in students’ evaluations did
not occur until class size was 200 or more. However, the suggestion that teaching effective-
Students’ Evaluations 315

tress in these extremely large classes may not suffer, or is even superior, has very important
implications, since offering just a few of these very large classes can free an enormous
amount of instructional time that can be used to substantially reduce the average class size
in the range where the class size effect is negative. However, Marsh (Marsh et al., 1979a;
also see Centra, 1979; Feldman, 1984) argued that his correlational effect should be inter-
preted cautiously and he speculated that the unexpectedly higher ratings for very large
classes could be due to: (a) the selection of particularly effective instructors with de-
monstrated success in such settings; (b) students systematically selecting classes taught by
particularly effective instructors, thereby increasing class size; (c) an increased motivation
for instructors to do well when teaching particularly large classes; and (d) the development
of ‘large class’ techniques instead of trying to use inappropriate, ‘small class’ techniques
that may produce lower ratings in moderately large classes. Clearly this is an area that war-
rants further research.

Prior Subject interest

Marsh (Marsh & Cooper, 1981) reviewed previous studies of the prior subject interest
effect, as did Feldman (1977) and Howard and Maxwell (1980), and examined its effect on
SEEQ ratings by students (see also Marsh, 1980b, 1983) and by faculty. The effect of prior
subject interest on SEEQ scores was greater than that of any of the 15 other background
variables considered by Marsh (1980b, 1983). In different studies prior subject interest was
consistently more highly correlated with Learning/Value (r’s about 0.4) than with any
other SEEQ dimensions (r’s between 0.3 and -0.12; Table 5.2). Instructor self-evalua-
tions of their own teaching were also positively correlated with both their own and their
students’ perceptions of students’ prior subject interest (see Table 5.1). The self-evalua-
tion dimensions that were most highly correlated with prior subject interest, particularly
LeamingNalue, were the same as with student ratings. The specificity of the prior subject
interest effect to dimensions most logically related to this variable, and the similarity of
findings based on student ratings and faculty self-evaluations argue that this effect is not a
‘bias’ to student ratings. Rather, prior subject interest is a variable that influences some
aspects of effective teaching, particularly Learning/Value, and these effects are accurately
reflected in both the student ratings and instructor self-evaluations. Higher student in-
terest in the subject apparently creates a more favorable learning environment and facili-
tates effective teaching, and this effect is reflected in student ratings as well as faculty self-
evaluations.
Prior subject interest, as inferred in SEEQ research, is based on students’ retrospective
ratings collected at the end of the course. There may be methodological problems with
prior subject interest ratings collected either at the start or end of a course. At the start stu-
dents may not know enough about the course content to evaluate their interest, and it is
possible that high ratings of prior subject interest at the end of a course confound prior sub-
ject interest before the course with interest in the subject generated by the instructor. Con-
sequently, Howard and Schmeck (1979) asked students to rate their desire to take a course
at both the start and the end of courses. They found that precourse responses were strongly
correlated with end-of-course responses, and that both indicators of prior subject interest
had similar patterns of correlations with other rating items. They concluded that their find-
316 HERBERT W. MARSH

ings supported the use of responses collected at the end of courses to measure precourse
interest.
Marsh and Cooper (1981) also asked faculty to evaluate students’ prior subject interest
at the end of the course as well as to evaluate their own teaching effectiveness. Instructor
evaluations of students’ prior subject interest and students’ ratings of their own prior sub-
ject interest were substantially correlated and each showed a similar pattern of results to
both student rating dimensions and instructor self-evaluation rating dimensions. Thus, an
alternative indicator of prior subject interest, one not based on student ratings, provides
additional support for the generality of the findings.
Prior subject interest apparently influences student ratings in a way that is validly re-
flected in student ratings, and so the influence should not be interpreted as a bias to student
ratings. However, to the extent that the influence is inherent to the subject matter that is
being taught, it may represent a source of ‘unfairness’ when ratings are used for personnel
decisions. Whereas student ratings are primarily a function of the teacher who teaches a
course rather than the course being taught, prior subject interest is more a function of the
course than the teacher (see Table 3.2). If further research confirms that prior subject in-
terest is largely determined by the course rather than the instructor and that this compo-
nent of prior subject interest influences students’ evaluations, then it may be appropriate
to use normative comparisons or cohort groups to correct for this effect. Such an approach
is only justified for general courses that are taught by many different instructors so that the
course effect can be accurately determined. For specialised courses that are nearly always
taught by the same instructor, it may be impossible to disentangle the effects of course and
instructor. Nevertheless, this suggestion represents an important area for further research.

WorkloadlDifficulty

Workload/Difficulty is frequently cited by faculty as a potential bias to student ratings


(Marsh & Overall, 1979a). Ryan et al. (1980) reported that the introduction of student rat-
ings at one institution produced a substantial decrease in the workload and difficulty of
courses in the belief that this would lead to higher student ratings. The Workload/Diffi-
culty effect on students’ evaluations was also one of the largest found by Marsh (1980b,
1983). Paradoxically, at least based upon the supposition that Workload/Difficulty is a po-
tential ‘bias’ to student ratings, higher levels of Workload/Difficulty werepositively corre-
lated with student ratings. Marsh (1982a) found that in pairs of courses taught by the same
instructor, the more highly rated course tended to be the one perceived to have the higher
levels of Workload/Difficulty. Marsh and Overall (1979b) found that instructor self-
evaluations of their own teaching effectiveness were less highly correlated with Workload/
Difficulty than were student ratings, but the direction of these correlations was also posi-
tive (see Table 5.1). Since the direction of the Workload/Difficulty effect is opposite to that
predicted as a potential bias, and since this finding is consistent for both student ratings and
instructor self-evaluations, Workload/Difficulty does not appear to constitute a bias to stu-
dent ratings.
Given that the Workload/Difficulty effect is frequently cited as a potential bias, there is
surprisingly little research on its effect by other investigators. Rating instruments that in-
clude a dimension like Workload/Difficulty typically report more difficult courses to be as-
sociated with somehow more favorable ratings (e.g., Freedman & Stumpf, 1978; Frey et
Students’ Evaluations 317

al., 1975; Office of Evaluation Services, 1972; Pohlman, 1972). Pohlman (1975) also re-
ported small but statistically significant relations between hours outside of class and stu-
dent ratings. Schwab (1976) reported that the effect of perceived difficulty on student rat-
ings was positive after controlling for the effects of other background variables such as ex-
pected grades. Thus, the SEEQ results do appear to be consistent with other findings.

Expected Grades

One of the most consistent findings in student evaluation research is that class-average
expected (or actual) grades are modestly correlated with class-average students’ evalua-
tions of teaching effectiveness. The critical issue is how this relation should be interpreted.

Alternative Interpretations
Studies based upon SEEQ, and literature reviews (e.g., Centra, 1979; Feldman, 1976a;
Marsh et al., 1976) have typically found class-average expected grades to be positively cor-
related with student ratings. There are, however, three quite different explanations for
this finding (see discussion in Chapter 4). The grading leniency hypothesis proposes that
instructors who give higher-than-deserved grades will be rewarded with higher-than-de-
served student ratings, and this constitutes a serious bias to student ratings. The validity
hypothesis proposes that better expected grades reflect better student learning, and that a
positive correlation between student learning and student ratings supports the validity of
student ratings. The student characteristic hypothesis proposes that preexisting student
presage variables such as prior subject interest may affect student learning, student grades,
and teaching effectiveness, so that the expected grade effect is spurious. While these ex-
planations of the expected grade effect have quite different implications, it should be
noted that grades, actual or expected, must surely reflect some combination of student
learning, the grading standards employed by an instructor, and preexisting presage vari-
ables.
Marsh, (1980b, 1983) examined the relationships among expected grades, prior subject
interest, and student ratings in a path analysis (also see Aubrecht, 1981; Feldman, 1976a).
Across all rating dimensions, nearly one-third of the expected grade effect could be
explained in terms of prior subject interest. Since prior subject interest precedes expected
grades, a large part of the expected grade effect is apparently spurious, and this finding
supports the student characteristic hypothesis. Marsh, however, interpreted the results as
support for the validity hypothesis in that prior subject interest is likely to impact student
performance in a class, but is unlikely to affect grading leniency. Hence, support for the
student characteristics hypothesis may also constitute support for the validity hypothesis;
prior subject interest produces more effective teaching which leads to better student leam-
ing, better grades, and higher evaluations. This interpretation, however, depends on a de-
finition of bias in which student ratings are not ‘biased’ to the extent that they reflect var-
iables which actually influence effectiveness of teaching.
In a similar analysis, Howard and Maxwell (1980; also see Howard & Maxwell, 1982),
found that most of the covariation between expected grades and class-average overall rat-
ings was eliminated by controlling for prior student motivation and student progress rat-
ings. In their path analysis, prior student motivation had a causal impact on expected
318 HERBERT W. MARSH

grades that was nearly the same as reported in SEEQ research and a causal effect on over-
all ratings which was even larger, while the causal effect of expected grades on student rat-
ings was smaller than that found in SEEQ research. They concluded that “the influence of
student motivation upon student performance, grades, and satisfaction appears to be a
more potent contributor to the covariation between grades and satisfaction than does the
direct contaminating effect of grades upon student satisfaction” (p. 818).

Faculty Self-Evaluations

Marsh and Overall (1979b) examined correlations among student ratings and instructor
self-evaluations of teaching effectiveness, student ratings of expected grades, and teacher
self-evaluations of their own ‘grading leniency’ (see Table 5.1). Correlations between ex-
pected grades and student ratings were positive and modest (r’s between 0.01 and 0.28) for
all SEEQ factors except Group Interaction (r = 0.38) and Workload/Difficulty (r =
-0.25). Correlations between expected grades and faculty self-evaluations were close to
zero (I’S between -0.11 and 0.11) except for Group Interaction (r = 0.17) and Workload/
Difficulty (r = -0.19). Correlations between faculty self-perceptions of their own ‘grading
leniency’ (on an ‘easy/lenient grader’ to ‘hard/strict grader’ scale) with both student and
teacher evaluations of effective teaching were small (T’Sbetween -0.16 and 0.19) except
for ratings of Workload/Difficulty (r’s of 0.26 and 0.28) and faculty self-evaluations of
Examinations/Grading (r = 0.32). In a separate study Marsh (1976) also reported small,
generally nonsignificant correlations between faculty self-evaluations of their grading le-
niency and student ratings, but found that ‘easy’ graders received somewhat (significantly)
lower overall course and Learning/value ratings. The correlations between grading le-
niency and student ratings, and the similarity in the pattern of correlations between ex-
pected grades and ratings by students and by faculty, seem to argue against the interpreta-
tion of the expected grade effect as a bias. Nevertheless, the fact that expected grades were
more positively correlated with student ratings than with faculty self-evaluations may pro-
vide some support for a grading leniency bias.

Expected Grades in Multisection Validity Studies

Marsh (Marsh et al., 1975; Marsh & Overall, 1980; see also Chapter 4) examined class-
average pre-test scores, expected grades, student achievement, and student ratings in mul-
tisection validity studies described earlier. Students selected classes in a quasi-random
fashion, and pre-test scores on an achievement test, motivational variables, and student
background variables were also collected at the start of the study. While this set of pre-test
variables was able to predict course performance with reasonable accuracy for individual
students, section-average responses to them were similar. Also, in each study, students
knew how their performance compared with other students within the same section, but
not how the average performance of their section compared with that of other sections.
Primarily as a consequence of this feature of the study, class-average expected grades,
which were collected along with student ratings shortly before the final examination, did
not differ significantly from section to section. Hence, the correlation between examina-
tion performance and student ratings could only be interpreted as support for the validity
Students’ Evaluations 319

hypothesis, and was not due to either preexisting variables or a grading leniency effect. It
is ironic that when researchers propose class-average grades (expected or actual) as a po-
tential bias to student ratings, a positive correlation between ratings and grades is nearly
always interpreted as a grading leniency effect, while a positive correlation between
grades, as reflected in examination performance, and ratings in multisection validity
studies is nearly always interpreted as an indication of validity; both interpretations are
usually viable in both situations. Again, it must be cautioned that support for the validity
hypothesis found here does not deny the appropriateness of other interpretations in other
situations.
Palmer ef al. (1978) also compared the validity and grading leniency hypotheses in a
multisection validity study by relating class-average student learning (controlling for pre-
test data) and grading leniency with student ratings. However, their study failed to show
a significant effect of either student learning or grading leniency, and thus provides no sup-
port for either hypothesis. Nevertheless, their study showed that moderately large differ-
ences among instructors on grading leniency had no effect on student ratings (but see earl-
ier discussion of this study in Chapter 4).

Experimental Manipulation of Expected Grades

Some researchers have argued that the expected grade effect can be better examined by
randomly assigning students to different groups which are given systematically different
grade expectations. For example, Holmes (1972) gave randomly assigned groups of stu-
dents systematically lower grades than they expected and deserved, and found that these
students evaluated the teaching effectiveness as poorer than did control groups. While this
type of research is frequently cited as evidence for a grading leniency effect, this conclusion
is unwarranted. First, Holmes’s manipulation accounted for no more than 8 percent of the
variance in any of the evaluation items and much less variance across the entire set of rat-
ings, and reached statistical significance for only 5 of 19 items (Did instructor have suffi-
cient evidence to evaluate achievement; Did you get less than expected from the course;
Clarity of exam questions; Intellectual stimulation; and Instructor preparation). Hence
the size of the effect was small, was limited to a small portion of the items, and tended to
be larger for items that were related to the specific experimental manipulation. Second,
the results based upon ‘rigged’ grades that violate reasonable grade expectations may not
generalize to other settings and seem to represent a different variable than that examined
in naturalistic settings.
Powell (1977) examined the influence of grading standards of a single course taught by
the same instructor on five different occasions. While the effect of grading standards was
modestly related to ratings, the possible non-equivalence of the different classes and the
possible experimenter expectancy effects dictate that the results be interpreted cautiously
(see discussion by Abrami et al., 1980). Chacko (1983) and Vasta and Sarmiento (1979)
applied different grading standards in two sections of the same course, and found that the
more stringently graded courses were evaluated more poorly on some items. In both these
studies, students in the different sections were similar before the introduction of the grad-
ing manipulation. However, since students in different sections of the same course may
talk to each other, the nature of the experimental manipulation may have been known to
the students and may have created an effect beyond that of different grading standards in
naturalistic settings. If students in one section knew they were being graded on a harsher
320 HERBERT W. MARSH

grading standard than students in the other section, then the generality of the manipula-
tion may be suspect. Also, the use of a single course, a single instructor who knew the pur-
pose of the study (at least in the Chacko study), and the use of only two intact classes as
the basis of random assignment further attenuates the generalizability of these two studies.
Snyder and Clair (1976), instead of relying on intact classes, examined grading standards
in a laboratory study in which students were randomly assigned to groups that varied in
terms of the grades they were led to expect (A, B or C) and the grades they actually re-
ceived (A, B or C). Students were exposed to a brief tape-recorded lecture, after which
they completed a quiz and were given a grade that was randomly determined to be higher,
lower or the same as they had been led to expect and then evaluated the teaching effective-
ness. While the obtained grade was positively correlated with student ratings, there was
apparently a substantial interaction between expected and actual grades (the interaction
effect was not appropriately examined but is visually quite apparent). In fact, for the three
groups of students who actually received the grades they were led to expect - A, B or C
respectively-grades had little or no effect and the ‘A’ students actually gave the instruc-
tor the lowest ratings. This suggests that the violation of grade expectations may affect rat-
ings more than the actual grades in experimental studies in which grade expectations are
violated (see also Murray, 1980).
Abrami et al. (1980) conducted what appears to be the most methodologically sound
study of the effects of experimentally manipulated grading standards on students’ evalua-
tions. After reviewing previous research they described two ‘Dr. Fox’ type experiments
(see Chapter 6) in which grading standards were experimentally manipulated. Groups of
students viewed a videotaped lecture, rated teacher effectiveness, and completed an ob-
jective exam. Students returned two weeks later when they were given their examination
results and a grade based on their actual performance but scaled according to different
grading standards (i.e., an ‘average’ grade earning a B, C+, or C). The subjects then
viewed a similar videotaped lecture by the same instructor, again evaluated teacher effec-
tiveness, and took a test on the content of the second lecture. The manipulation of grading
standards had no effect on performance on the second achievement and weak inconsistent
effects on student ratings. There were also other manipulations (e.g., instructor expres-
siveness, content, and incentive), but the effect of grading standards accounted for no
more than 2 percent of the variance in student ratings for any of the conditions, and failed
to reach statistical significance in some. Not even the direction of the effect was consistent
across conditions, and stricter grading standards occasionally resulted in higher ratings.
These findings fail to support the contention that grading leniency produces an effect that
is of practical significance, though the external validity of this interpretation may also be
questioned.

Other Approaches

Marsh (1982a) compared differences in expected grades with differences in student rat-
ings for pairs of offerings of the same course taught by the same instructor on two different
occasions. He reasoned that differences in expected grades in this situation probably rep-
resent differences in student performance, since grading standards are likely to remain
constant, and differences in prior subject interest were small and relatively uncorrelated
Students’ Evaluations 321

with differences in student ratings. He found even in this context that students in the more
favorably evaluated course tended to have higher expected grades, which argued against
the grading leniency hypothesis. It should be noted, however, that while this study is in a
setting where differences due to grading leniency are minimized, there is no basis for con-
tending that the grading leniency effect does not operate in other situations. Also, the in-
terpretation is based on the untested assumption that differences in expected grades re-
flected, primarily, differences in student performance rather than differences in the grad-
ing standards by the instructor.
Peterson and Cooper (1980) compared students’ evaluations of the same instructors by
students who received grades and those who did not. The study was conducted at two col-
leges where students were free to cross-ehroll, but where students from one college were
assigned grades but those from the other were not. Class-average ratings were determined
separately for students in each class who received grades and those who did not, and there
was substantial agreement with evaluations by the two groups of students. Hence, even
though class-average grades of those students who received grades were correlated with
their class-average evaluations and showed the expected grade effect, their class-average
evaluations were in substantial agreement with those of students who did not receive
grades. This suggests that expected grade effect was not due to grading leniency, since
grading leniency was unlikely to affect ratings by students who did not receive grades.
Class average expected grades are about equally determined by the instructor who
teaches a course and by the course that is being taught (see Table 3.2). I know of no re-
search that has attempted to relate these separate components to student ratings, but such
an approach may be fruitful. The component of expected grades due to the course may be
more strongly influenced by presage variables whereas the component due to the instruc-
tor may be more strongly influenced by grading standards and differences in learning at-
tributable to the instructor. The substantial course effect demonstrates that much of the
variance in class-average expected grades is not due to differences in grading standards
that are idiosyncratic to individual instructors.

Summary

In this summary of research about the expected grade effect, a modest but not unimpor-
tant correlation between class-average expected grades and student ratings has consis-
tently been reported. There are, however, several alternative interpretations of this find-
ing, which were labeled the grading leniency hypothesis, the validity hypothesis, and the
student characteristics hypothesis. Evidence from a variety of different types of research
clearly supports the validity hypothesis and the student characteristics hypothesis, but
does not rule out the possibility that a grading leniency effect operates simultaneously.
Support for the grading leniency effect was found with some experimental studies, but
these effects were typically weak and inconsistent, may not generalize to nonexperimental
settings where student ratings are actually used, and in some instances may be due to the
violation of grade expectations that students had falsely been led to expect or that were
applied to other students in the same course. Consequently, while it is possible that a grad-
ing leniency effect may produce some bias in student ratings, support for this suggestion is
weak and the size of such an effect is likely to be insubstantial in the actual use of student
ratings.
322 HERBERT W. MARSH

Reason for Taking a Course

Courses are often classified as being an elective or a required course, but preliminary
SEEQ research indicated that this dichotomy may be too simplistic. Typically students:
are absolutely required to take a few specific courses in their major; select from a narrow
range of courses in their major that fulfill major requirements; select from a very wide
range of courses that fulfill general education or breadth requirements; occasionally take
specific courses or select from a range of courses in a related field that fulfill major require-
ments; and take some courses for general interest. Hence, on SEEQ students indicate one
of the following as the reason why they took the course: (a) major requirement; (b) major
elective; (c) general interest; (d) general education requirement; (e) minor/related field;
or (f) other. In the analysis of reason for taking a course, the percentage of students indi-
cating each of the first five reasons were considered to be a subset of variables, and the
total contribution of the entire subset and of each separate variable in the subset was con-
sidered. Marsh (1980) found that all SEEQ factors tended to be positively correlated with
the percentage of students taking a course for genera1 interest and as major elective, but
tended to be negatively correlated with the percentage of students taking a course as a
major requirement or as a genera1 education requirement. However, after controlling for
the effects of the rest of the 16 background characteristics, general interest was the only
reason to have a substantial effect on ratings and it accounted for most of the variance that
could be explained by the subset of five reasons. The percentage of students taking a
course for genera1 interest was also one of the four background variables selected from the
set of 16 as having the largest impact on student ratings and included in path analyses
(Marsh, 1980,1983) described earlier (see Table 5.2 and Figure 5.1).
Marsh (1980, 1983) consistently found the percentage taking a course for genera1 in-
terest to be positively correlated with each of the SEEQ factors in different academic dis-
ciplines. However, the sizes of the correlations were modest, usually less than 0.20, and
the effect of this variable was smaller than that of the other three variables (prior subject
interest, expected grades, and workload/difficulty) considered in his path analyses. The
correlations were somewhat larger for Learning/value, Breadth of Coverage, Assign-
ments, Organization and overall course ratings than for the other SEEQ dimensions, but
only the correlations with Breadth of Coverage were as large as or larger than those of the
other variables considered in the path analysis.
Other researchers have typically compared elective courses with required courses, or
have related the percentage of students taking a course as an elective (or a requirement)
to student ratings, and either of these approaches may not be directly comparable to
SEEQ research. Large empirical studies have typically found that a course’s electivity is
positively correlated to student rating (e.g., Brandenburg er al., 1977; Pohlman, 1975; but
also see Centra & Creech, 1976) and these findings are also consistent with Feldman’s 1978
review. Thus, these generalizations appear to be consistent with the SEEQ research.

Effects of Specific Background Characteristics Not Emphasized in SEEQ Research

Marsh (1976, 1983) examined the relations between a wide variety of background
characteristics, but concluded that most of the variance in students’ evaluations that could
be accounted for by the entire set could be explained by those characteristics discussed
Students’ Evaluations 323

above. The effects of other characteristics, though much smaller, are considered briefly
below. A few additional characteristics were examined in particular SEEQ studies (e.g.,
the faculty self-evaluation studies) that were not available for the large scale studies, and
these are also discussed below. Finally, the results are compared with the findings of other
investigators, particularly those summarized in Feldman’s set of review articles.

Instructor Rank and Years Teaching Experience

SEEQ research has found that teaching assistants receive lower ratings than regular
faculty for most rating dimensions and overall rating items, but that they may receive
slightly higher ratings for Individual Rapport and perhaps Group Interaction (e.g., Marsh,
1976,198O; Marsh & Overall, 1979b). Marsh and Overall (1979b) found this same pattern
in a comparison of self-evaluations by teaching assistants and self-evalutions by regular fa-
culty. Large empirical studies by Centra and Creech (1976) and by Brandenburg et al.
(1977) and Feldman’s 1983 review also indicate that teaching assistants tend to receive
lower evaluations than do other faculty (though Feldman also reported some exceptions).
Once teaching assistants are excluded from the analysis, relations between rank and stu-
dent ratings are much smaller in SEEQ research. There is almost no relation between rank
and global ratings, while faculty rank is somewhat positively correlated with Breadth of
Coverage and somewhat negatively correlated with Group Interaction. These results for
the global ratings are consistent with large empirical studies (e.g., Aleamoni & Yimer,
1973; Brandenburg et al., 1977; Centra and Creech, 1976). Feldman (1983) reported that
a majority of the studies in his review found no significant effect of instructor rank on
global ratings, but that the significant relations that were found were generally positive.
Feldman also reported that rank was not significantly related to more specific rating di-
mensions in a majority of studies, but that positive relations tended to be more likely for
dimensions related to instructor knowledge and intellectual expansiveness whereas nega-
tive relations were more likely for ratings of encouragement of discussion, openness, con-
cern for students.
Marsh (1976) found instructor age to be negatively correlated with student ratings,
whereas Marsh et al. (1976) and Marsh and Overall (1979a) found nil or slightly negative
relations between years teaching experience and ratings. Feldman (1983) found that about
half the studies in his review found no relation between either age or years of teaching ex-
perience and global ratings, but that among those that did find significant relations, the
predominant finding was one of a negative relation. Braskamp et al. (1975) suggested that
student ratings may increase during the first 10 years of teaching, but decline somewhat
thereafter.
In summary, teaching assistants typically receive lower ratings than other faculty, but
otherwise there is little relation between either rank or experience and student ratings.
However, to the extent that there are significant relations at all, faculty with higher
academic ranks tend to be rated somewhat more favorably while older faculty and faculty
with more years of teaching experience tend to be rated somewhat lower. This pattern of
findings led Feldman (1983, p. 54) to conclude that: “the teacher’s academic rank should
not be viewed as interchangable with either teacher’s age or extent of instructional experi-
ence with respect to teacher evaluations.”
324 HERBERT W. MARSH

Course Level

In SEEQ research, higher level courses-particularly graduate level courses - tend to


receive slightly higher ratings (e.g., Marsh, 1976, 1980, 1983; Marsh & Overall, 1979b).
Marsh and Overall (1979b) found that both student ratings and faculty self-evaluations
tended to be higher in graduate level courses than undergraduate courses. In his review of
this relation, Feldman (1978) also found that student ratings tended to be positively related
to course level. The effect of course level is typically diminished and may even disappear
when other background variables are controlled (see Braskamp ef nl., 1985; Feldman.
1978; Marsh, 1980) but this finding is difficult to interpret without a specific model to the
causal ordering of such variables.

Characteristics Not Examined in SEEQ Research

A wide variety of student/course/instructor characteristics have not been examined in


SEEQ research, but have been considered by other researchers. While an exhaustive re-
view of such other research is beyond the scope of the present monograph, brief sum-
maries of some of this research is considered below.

Sex of Students andlor Instructor

Empirical studies (e.g., Centra & Creech, 1976; Pohlman, 1972) and Feldman’s 1977 re-
view indicate that student sex has very little effect on student ratings, though Feldman
notes that when significant effects are reported women may give slightly higher ratings
than men. Similarly, large empirical studies (e.g., Brandenburg et al., 1977; Brown, 1976)
and McKeachie’s 1979 review suggest that the sex of the instructor has little relation to stu-
dents’ evaluations, though Braskamp et al. (1985) and Aleamoni and Hexner (1980) con-
clude that the results are mixed. Feldman (1977) noted a few studies that reported the sex-
of-student by sex-of-teacher interactions, but few of the studies in his review reported such
an interaction.

Administration and Stated Purpose of the Ratings

The brief summary in this section is based primarily on Feldman’s 1979 review of this
topic, and interested readers are referred to this source for further discussion.

Anonymity of Student Raters

Feldman (1979) reported that anonymous ratings tended to be somewhat lower than
non-anonymous ratings, but that this result might vary with other circumstances. For ex-
ample, his review suggested that this effect would be stronger when teachers are given the
ratings before assigning grades, when students feel they may be called upon to justify or
elaborate their responses, or, perhaps, when students view the instructor as vindictive.
Students’ Evaluations 325

Braskamp et al. (1985) noted similar findings and recommended that ratings should be
anonymous.

Purpose of the Ratings

Feldman (1979) reported that ratings tend to be higher when they are to be used for ad-
ministrative purposes than when used for feedback to faculty or for research purposes, but
that the size of this effect may be very small (see also Centra, 1979). Frankhouser (1984)
critically reviewed this research and presented results from what he argued to be a better
experimental design that showed that stated purpose had no effect on global ratings.

Timing
Feldman (1979) reported that ratings tended to be similar whether collected in the
middle of the term, near the end of the term, during the final exam or even after comple-
tion of the course. Braskamp et al. (1985) suggested that ratings collected during the final
examination may be lower, and that midterm ratings may be unreliable if students can be
identified. Marsh and Overall (1980) collected ratings during the middle of the term and
during the last week of the term in their multisection validity study. Midterm and end-of-
term ratings were highly correlated, but validity coefficients based on the midterm ratings
were substantially lower. Braskamp et al. recommend that ratings should be collected dur-
ing the last two weeks of the course.

Summary of Administration Characteristics

In summary, it appears that some aspects of the manner in which students’ evaluate
teaching effectiveness may influence the ratings. While these influences may not be large,
the best recommendation is to standardize all aspects of the administration process. This
is particularly important if the ratings are to be used for personnel decisions and may need
to be defended in a legal setting.

Academic Discipline

Feldman (1978) reviewed studies that compared ratings across disciplines and found
that ratings are: somewhat higher than average in English, humanities, arts, languages,
and, perhaps, education; somewhat lower than average in social sciences, physical sci-
ences, mathematics and engineering, and business administration; and about average in
biological sciences. The Centra and Creech 1976 study is particularly important because it
was based on courses from over 100 institutions. They classified courses as natural sci-
ences, social sciences and humanities and found that ratings were highest in humanities
and lowest in natural sciences. However, even though these results were highly significant,
the difference accounted for less than 1 percent of the variance in the student ratings.
Neumann and Neumann (1985), based on Biglan’s 1973 theory, classified academic
areas according to whether they had: (a) a well-defined paradigm structure (hard/soft); (b)
an orientation towards application (applied/pure); and (c) an orientation to living or-
ganisms (life/nonlife). Previous research indicated that the role of teaching is different in
326 HERBERT W. MARSH

each of the eight combinations that result from these three dichotomies, and led the au-
thors to predict that students’ evaluations would be higher in soft, in pure, and in nonlife
disciplines. Based on this weak partial ordering, they made predictions about student
evaluations for 19 pairwise comparisons of the eight types of disciplines, and their results
provided support for all 19 predictions. While the effect of all three facets on students’
evaluations was significant, the effect of the hard/soft facet was largest. The authors indi-
cated that teachers in preparadigmatic areas where research procedures are not well-de-
veloped play a more major role than in paradigmatic areas where the content and method
of research is well-developed, teaching is relatively deemphasized. On the basis of this re-
search, the authors argued that student ratings should only be compared within similar dis-
ciplines and that campus-wide comparisons may be unwarranted since the role of teaching
varies in different academic areas. The generality of these findings may be limited since the
results were based on a single institution, and further tests of the generality of these find-
ings are important. The findings also suggest that discipline differences observed in stu-
dent ratings may reflect the different roles of teaching in these disciplines that are accu-
rately reflected in the student ratings.
There may be some differences in student ratings that are due to the academic discipline
of the subject, but the size of this relation is probably small. Nevertheless, since there are
few large, multi-institutional studies of this relation, conclusions must be tentative. The
implications of such a relation, if it exists, depend on the use of the students’ evaluations.
At institutions where SEEQ has been used, administrative use of the student ratings is at
the school or division level, and student ratings are not compared across diverse academic
disciplines. In such a situation, the relation between ratings and discipline may be less crit-
ical than in a setting where ratings are compared across all disciplines.

Personality of the Instructor

The relation between the personality of an instructor and students’ evaluations of his/
her teaching effectiveness is important for at least two different reasons. First, there is
sometimes the suspicion that the relation is substantial and that instructor personality has
nothing to do with being an effective teacher, so that the relation should be interpreted as
a bias to students’ evaluations. (This perspective apparently was the basis of initial in-
terpretations of the Dr. Fox effect discussed in Chapter 6 where ‘instructor expressiveness’
was experimentally manipulated and its affect on student ratings was examined.) Second,
if the relation is significant, then the results may have practical and theoretical importance
for distinguishing between effective and ineffective teachers, and for a better understand-
ing of the study of teaching effectiveness and the study of personality.
Given the potential importance of this relation, there has been surprisingly little re-
search of it. The brief summary of research presented here is based primarily on Feldman’s
1986 review. Because of the small number of studies of this relation, Feldman limited his
review to overall or global evaluations of teaching effectiveness, and focused on studies of
class-average student ratings. Feldman reviewed relations between overall student ratings
and 14 categories of personality as inferred from self-reports by the instructor or as infer-
red from ratings by others (students or colleagues). Across all studies that inferred person-
ality from self-report measures, the only practically significant correlations were for ‘posi-
tive self-regard, self-esteem’ (mean r = 0.30) and ‘energy and enthusiasm’ (mean r =
Students’ Evaluations 327

0.27); the mean correlation was 0.15 or less between student ratings of teaching effective-
ness and each of the other 12 areas of personality. In contrast, when personality was infer-
red from ratings by students, and from ratings by colleagues, the correlations were much
higher; the average correlations between students’ evaluations and most of the 14 cate-
gories of personality were between 0.3 and 0.6.
Aspects of the instructors’ personality, as inferred by self-report and as inferred by rat-
ings by others, are systematically related to students’ overall evaluations of the instructor.
However, correlations based on ratings by others are substantially larger than those based
on self-reports, and many interrelated explanations of this difference are plausible (see
Feldman, 1986): (a) teacher personality as perceived by students may be affected by simi-
lar response biases that affect their ratings of teacher effectiveness; (b) teacher personality
as inferred by colleagues may be based in part on information from students; (c) teacher
personality inferred by students and colleagues may be based in part on perceptions of
teaching effectiveness rather than personality per se (or vice versa); (d) teacher personal-
ity as inferred by self-report may be more or less valid, or may be more or less biased, than
teacher personality as inferred by others; (e) teacher personality as inferred by students,
and perhaps by colleagues, may be limited to a situationally specific aspect of personality
(e.g., personality as a teacher or as an academic professional) whereas self-report mea-
sures of personality are more general measures that do not focus on a specific context.
The aspect of Feldman’s speculations that I find most interesting is the suggestion that
self-report measures of personality assess general (i.e. situationally independent) compo-
nents of personality, whereas personality components inferred by colleagues and particu-
larly by students assess the manifestation of those components in a particular setting (situ-
ationally specific). Since teaching effectiveness should logically be more strongly related
to situationally specific personality as manifested in an academic setting than to personal-
ity in general, the situationally specific components of personality inferred by students and
colleagues should be more strongly related to teaching effectiveness than should the gen-
eral (situationally independent) personality components based on self-reports. In support
of such an interpretation, components of personality that are most strongly related to stu-
dent ratings of teaching effectiveness appear to be similar whether based on self-reports or
ratings by others. The problem with the interpretation based on existing research is that
the person making the ratings (self vs. other) and the situational specificity of the ratings
(general vs. situationally specific) are completely confounded. In order to unconfound
these variables, future research might: (a) ask teachers, students, and colleagues to make
personality ratings of the teacher in general and as manifested in the role of being a
teacher; (b) consider teacher self-evaluations as well as student ratings of teacher effec-
tiveness; and (c) collect personality ratings of the teacher by others who primarily know
the teacher outside of an academic setting.
Feldman’s review suggests that there is a relation between students’ evaluations of an
instructor and at least some aspects of the instructors’ personality, but does not indicate
whether this is a valid source of influence or a source of bias. As noted with other sources
of influence, if teacher personality characteristics influence other indicators of teaching ef-
fectiveness in a manner similar to their influence on students’ evaluations, then the rela-
tion should be viewed as supporting the validity of student ratings. Such an interpretation
may be supported by the review of Dr. Fox studies (Chapter 6) where an experimentally
manipulated personality-like variable - instructor expressiveness - is shown to affect
both students’ evaluations and examination performance in some circumstances.
328 HERBERT W. MARSH

Because of limitations in existing research, Feldman’s review did not examine the
systematic pattern of relations between specific components of personality and specific
student evaluation factors, and this is unfortunate. Logically, some specific aspects of an
instructor’s personality should be systematically related to some specific student evalua-
tion factors. For example, enthusiasm or some related construct is often a specific compo-
nent measured by personality inventories and by student evaluation instruments, so that
the two measures of enthusiasm should be substantially correlated. In fact, there is a mod-
erate overlap between the 19 categories of teacher effectiveness based on Feldman’s earl-
ier research (see Table 2.1) and the 14 categories of personality in his 1986 study. Other
specific components of personality and other student evaluation factors may be logically
unrelated, and so correlations between them should be much smaller. Again, support for
such an interpretation comes from the review of Dr. Fox studies (Chapter 6) that shows
that experimentally manipulated ‘instructor expressiveness’ is more substantially related
to student ratings of Instructor Enthusiasm than to other rating factors. In future research
it is important to examine relations between instructor personality and students’ evalua-
tions with a construct approach that considers the multidimensionality of both personality
and teaching effectiveness.

Summary of the Search for Potential Biases

The search for potential biases to student ratings has itself been so biased, that it could
be called a witch hunt. Methodological problems listed at the start of this section are com-
mon. Furthermore, research in this area is seldom guided by any theoretical definition of
bias, and the definitions that are implicit in most studies are typically inadequate or inter-
nally inconsistent. Research findings described here, particularly for those relations not
emphasized in SEEQ research, should only be taken as rough approximations. There is
clearly a need for meta-analyses, and systematic reviews such as those by Feldman de-
scribed earlier, to provide more accurate estimates of the size of effects which have been
reported, and the conditions under which they were found. For most of the relations, the
effects tend to be small, the directions of the effects are sometimes inconsistent, and the
attribution of a bias is unwarranted if bias is defined as an effect that is specific to students’
evaluations and does not also influence other indicators of teaching effectiveness. Perhaps
the best summary of this area is McKeachie’s (1979) conclusion that a wide variety of var-
iables that could potentially influence student ratings apparently have little effect. Similar
conclusions have been drawn by Centra (1979), Menges (1973), Marsh (1980b), Murray
(1980), Aleamoni (1981), and others.
There are, of course, nearly an infinite number of variables that could be related to stu-
dent ratings and could be posited as potential biases. However, any such claim must be
seriously scrutinized in a series of studies that are relatively free from the common
methodological shortcomings, are based upon an explicit and defensible definition of bias,
and employ the type of logic used to examine the variables described above. Single studies
of the predictive validity of psychological measures have largely been replaced by a series
of construct validity studies, and a similar approach should also be taken in the study of po-
tential biases. Simplistic arguments that a significant correlation between student ratings
and some variable ‘X’ demonstrates a bias can no longer be tolerated, and are an injustice
Students’ Evaluations 329

to the field. It is unfortunate that the cautious attitude to interpreting correlations between
student ratings and potential indicators of effective teaching as evidence of validity has not
been adopted in the interpretation of correlations between student ratings and potential
biases as a source of invalidity.
CHAPTER 6

‘DR FOX’ STUDIES

The Dr Fox Paradigm

The Dr Fox effect is defined as the overriding influence of instructor expressiveness on stu-
dents’ evaluations of college/university teaching. The results of Dr Fox studies have been
interpreted to mean that an enthusiastic lecturer can entice or seduce favorable evalua-
tions, even though the lecture may be devoid of meaningful content. In the original Dr Fox
study by Naftulin et al. (1973)) a professional actor lectured to educators and graduate stu-
dents in an enthusiastic and expressive manner, and teaching effectiveness was evaluated.
Despite the fact that the lecture content was specifically designed to have little educational
value, the ratings were favorable. The authors and critics agree that the study was fraught
with methodological weaknesses, including the lack of any control group, a poor rating in-
strument, the brevity of the lecture compared to an actual course, the unfamiliar topic
coupled with the lack of a textbook with which to compare the lecture, and so on (see
Abrami et al., 1982; Frey, 1979; Ware &Williams, 1975). Frey (1979) notes that “this study
represents the kind of research that teachers make fun of during the first week of an intro-
ductory course in behavioral research methods. Almost every feature of the study is prob-
lematic” (p. 1). Nevertheless, reminiscent of the Rodin and Rodin (1972) study described
earlier, the results of this study were seized upon by critics of student ratings as support for
the invalidity of this procedure for evaluating teaching effectiveness.
To overcome some of the problems, Ware and Williams (Ware & Williams, 1975,1977;
Williams & Ware, 1976,1977) developed the standard Dr Fox paradigm where a series of
six lectures, all presented by the same professional actor was videotaped. Each lecture rep-
resented one of three levels of course content (the number of substantive teaching points
covered) and one of two levels of lecture expressiveness (the expressiveness with which the
actor delivered the lecture). Students viewed one of the six lectures, evaluated teaching ef-
fectiveness on a typical multi-item rating form, and completed an achievement test based
upon all the teaching points in the high contect lecture. Ware and Williams (1979, 1980)
reviewed their studies, and similar studies by other researchers, and concluded that differ-
ences in expressiveness consistently explained much more variance in student ratings than
did differences in content.

Reanalyses and Meta-Analyses

A Reanalysis

Marsh and Ware (1982) reanalyzed data from the Ware and Williams studies. A factor
analysis of the rating instrument identified five evaluation factors which varied in the way
331
332 HERBERT W. lLlARSH

they were affected by the experimental manipulations (Table 6.1). In the condition most
like the university classroom, where students were told before viewing the lecture that they
would be tested on the materials and that they would be rewarded in accordance with the
number of exam questions which they answered correctly (incentive before lecture in
Table 6. l), the Dr Fox effect was not supported. The instructor expressiveness manipula-
tion only affected ratings of Instructor Enthusiasm, the factor most logically related to that
manipulation, and content coverage significantly affected ratings of Instructor Knowledge
and Organization/Clarity, the factors most logically related to that manipulation.
When students were given no incentives to perform well, instructor expressiveness had
more impact on all five student rating factors than when external incentives were present.
though the effect on Instructor Enthusiasm was still largest. However, without external in-
centives, expressiveness also had a larger impact on student achievement scores than did
the content manipulation (i.e. presentation style had more to do with how well students

Table 6.1
Effect Sizes of Expressiveness, Content, Expressiveness x Content Interaction in Each of the Three Incentive
Conditions (Reprinted with permission from Marsh, 1983b)

Condition Expressiveness (%) Content (%) Interaction (“c)

No External Incentive
Clarity/Organization 11.3** 4.2” 1.6
Instructor Concern 12.9” 2.1 2.8’
Instructor Knowledge 12.8’* 2.7’ 1.9”
Instructor Enthusiasm 34.6** 1.9’ 2.4”
Learning Stimulation 13.0” 9.6’ l 1.5
Total rating (across all items) 25.4’* 5.19’ 3.3’.
Achievement test scores 9.4” 5.2” 1.3
Incentive After Lecture
Clarity/Organization 2.0 6.0 1.3
Instructor Concern 20.5** 7.5” 1.9
Instructor Knowledge 25.1” 8.8f’ 2.3
Instructor Enthusiasm 30.9” 3.3
Learning Stimulation 2:: .7
Total rating (across all items) ,;I:: l
7.0** 28
Achievement test scores .3 13.0” .4
Incentive Before Lecture
Clarity/Organization .3 11.5” 6.9’
Instructor Concern .l 7.0’ 6.2’
Instructor Knowledge .3 6.2’ 1.3
Instructor Enthusiasm 22.1” 4.0 6.6’
Learning Stimulation .I 8.88’ 8.1’
Total rating (across all items) 2.0 11.4” 6.8’
Achievement test scores .5 26.5” ‘)”
_. ,
Across All Incentive Conditions
Clarity/Organization 2.1” 5.0” 1.6’
Instructor Concern 7.2*’ 4.3” 1.0
Instructor Knowledge 6.4** 3.1” .8
Instructor Enthusiasm 25.4” 1.2’ 1.7”
Learning Stimulation 3.3” 4.9.’ 1.1
Total rating (across all items) 12.5’ l 5.2” 1.8’
Achievement test scores 1.7” 10.7” .3

Note. Separate analyses of variance (ANOVAS) were performed for each of the five evaluation factors, the sum
of the 18 rating items (Total rating). and the achievement test. First, separate two-way ANOVAS (Expressiveness
X Content) were performed for each of the three incentive conditions, and then three-way ANOVA’s (Incentive
X Expressiveness X Content) were performed for all the data. The effect sizes were defined as (SS,rre,JSS,,,,r)
x 100%.
l p < .05. ** p < .Ol.
Students’ Evaluations 333

performed on the examination than did the number of questions that had been covered in
the lecture). This finding demonstrated that, particularly when external incentives are
weak, expressiveness can have an important impact on both student ratings and achieve-
ment scores. In further analyses of the achievement scores, Marsh (1984, p. 212) con-
cluded that the study was one of the few to “show that more expressively presented lec-
tures curse betters examination performance in a study where there was random assign-
ment to treatment conditions and lecturer expressiveness was experimentally manipu-
lated.” Across all the conditions, the effect of instructor expressiveness on ratings of In-
structor Enthusiasm was larger than its effect on other student rating factors. Hence, as ob-
served in the examination of potential biases to student ratings, this reanalysis indicates
the importance of considering the multidimensionality of student ratings. An effect which
has been interpreted as a ‘bias’ to students’ evaluations seems more appropriately inter-
preted as support for their validity with respect to one component of effective teaching.

A Meta-Analysis

Abrami et al. (1982) conducted a review and a meta-analysis of all known Dr Fox
studies. On the basis of their meta-analysis, they concluded that expressiveness manipula-
tions had a substantial impact on overall student ratings and a small effect on achievement,
while content manipulations had a substantial effect on achievement and a small effect on
ratings. Consistent with the Marsh and Ware reanalysis, they also found that in the few
studies that analyzed separate rating factors, the rating factors that were most logically re-
lated to the expressiveness manipulation were most affected by it. Finally, they concluded
that while the expressiveness manipulation did interact with the content manipulation and
a host of other variables examined in the Dr Fox studies, none of these interactions ac-
counted for more than 5 percent of the variance in student ratings.

Interpretations, Implications and Problems

How should the results of the Dr Fox type studies be evaluated? Consistent with an em-
phasis on the construct validity of multifaceted ratings in this paper, a particularly power-
ful test of the validity of student ratings would be to show that each rating factor is strongly
influenced by manipulations most logically associated with it and less influenced by other
manipulations. This is the approach used in the Marsh and Ware reanalysis of the Dr Fox
data described above, and it offers strong support for the validity of ratings with respect to
expressiveness and, perhaps, limited support for their validity with respect to content.
Multiple rating factors have typically not been considered in Dr Fox studies (but see
Ware & Williams, 1977; and discussion of this study by Marsh & Ware, 1982). Instead, re-
searchers have relied on total scores even though they collect ratings which do represent
multiple rating dimensions (i.e. the same form as was shown to have five factors in the
Marsh & Ware reanalysis, and/or items from the 1971 Hildebrand et al. study described
earlier). However, this makes no sense when researchers also emphasize the differential
effects of the experimental manipulations on the total rating score and the achievement
outcome. According to this approach, student ratings may be invalid because they are
334 HERBERT W MARSH

‘oversensitive’ to expressiveness and ‘undersensitive’ to content when compared with


achievement scores (but see Abrami et al., 1982). It is hardly surprising that the number
of examination questions answered in a lecture (only 4 of 26 exam questions are covered
in the low content lecture, while all 26 are covered in the high contect lecture) has a sub-
stantial impact on examination performance immediately after the lecture, and less impact
on student ratings; more relevant is the finding that content also impacts student ratings.
Nor is it surprising that manipulations of instructor expressiveness have a large impact on
the total rating score when some of the items specifically ask students to judge the charac-
teristic that is being manipulated; more relevant is the finding that some rating factors are
relatively unaffected by expressiveness and that achievement scores are affected by ex-
pressiveness. Student ratings are multifaceted, the different rating factors do vary in the
way they are affected by different manipulations, and any specific criterion can be more ac-
curately predicted by differentially weighting the student rating dimensions. Since most of
the Dr Fox studies are based upon student rating instruments that do measure separate
components, reanalyses of these studies, as was done in the Marsh and Ware study, should
prove valuable.
There are still serious problems with the design of Dr Fox studies that limit their poten-
tial usefulness. First, further research is needed to investigate whether the size and nature
of experimental manipulations are representative of those that actually occur in the field,
and to ensure that the effects are not limited to some idiosyncratic aspect of the Dr Fox
paradigm. For example, the Dr Fox tapes were produced to be of similar length even
though the amount of content varied systematically. The length was equated by adding ir-
relevant content, and this may be like a lack of teacher clarity which negatively affects both
ratings and achievement as described earlier. Also, in the Abrami et al., 1982 meta-analy-
sis, videotaped lectures developed at the University of Manitoba produced effect sizes that
were twice as large for both student ratings and achievement as did tapes developed by
Ware and Williams. Second, the Dr Fox type of study should be extended to include other
teacher process variables such as instructor clarity manipulations. Third, additional indi-
cators of effective teaching should be examined as well as achievement scores and student
ratings administered immediately after viewing a videotaped lecture in a content area
which is relatively unknown to students. This might include long-term achievement, per-
formance on higher-order objectives rather than tasks requiring primarily knowledge-
level objectives, noncognitive objectives such as increased interest or desire to pursue the
subject further (see Marsh & Overall, 1980), lectures in content areas already familiar to
the viewers, or the inclusion of textual information which also fully covers all the teaching
points in the high content lecture.
Finally, I would like to suggest a counter explanation for some of the Dr Fox findings,
and to propose research to test this hypothesis. Student ratings, like all psychological im-
pressions, are relativistic and based upon some frame of reference. For students in univer-
sity classes the frame of reference is determined by their expectations for that class and by
their experience in other courses. What is the frame of reference in Dr Fox studies? Some
instructor characteristics such as expressiveness and speech clarity can be judged in isola-
tion, since a frame of reference has probably been established through prior experience,
and these characteristics do influence student ratings. For other characteristics such as
content coverage, external frames of reference are not so well defined. For example, COV-
ering four teaching points during a 20-minute lecture may seem like reasonable content
coverage to students, or even to instructors or to curriculum specialists. However, if stu-
Students’ Evaluations 335

dents were asked to compare high and low content lectures on the amount of content co-
vered, to indicate which test items had been covered in the lecture, to evaluate content
coverage relative to textual materials representing the content which was supposed to have
been covered, or even to evaluate content coverage after completing an examination
where they were told that all the questions were supposed to be covered, then they would
have a much better basis for evaluating the content coverage and I predict that their re-
sponses would more accurately reflect the content manipulation. Some support for this
suggestion comes from a recent study by Leventhal etal. (1983) where students viewed one
lecture which was either ‘good’ (high in content and expressiveness) or ‘poor’ (low in con-
tent and expressiveness), and a second lecture by the same lecturer which was either good
or bad. The sum across all ratings of the second lecture varied inversely with the quality of
the first. This is a ‘contrast’ effect which is typical in frame of reference studies (e.g., Par-
ducci, 1968); after viewing a poor lecture a second lecture seems better, after viewing a
good lecture a second lecture seems poorer. Here, the authors also examined different rat-
ing factors and found that the effects of manipulations of instructor characteristics varied
substantially according to the rating component (though the evaluation of Group Interac-
tion and Individual Rapport on the basis of videotaped lectures seems dubious). Unfortu-
nately, the effects of content and expressiveness were intentionally confounded in this de-
sign which was not intended to represent a standard Dr Fox study.

Other Variables Considered in Dr Fox-Type Studies

In addition to content and expressiveness effects, Dr Fox studies have considered the
effects of a variety of other variables: grading standards (Abrami er al., 1980); instructor
reputation (Perry et al., 1979); student personality characteristics (Abrami et al., 1982);
purpose of evaluation (Meier & Feldhusen, 1979); and student incentive (Williams &
Ware, 1976; Perry er al., 1979; Abrami et al., 1980). In each instance, the Dr Fox video-
tapes reflected the manipulations of just content and expressiveness, whereas the other ex-
perimental manipulations represented verbal instructions given to subjects before or after
viewing a Dr Fox lecture. The incompleteness with which these analyses are generally
reported makes it difficult to draw any conclusions, but apparently only incentive (for
example, the Marsh and Ware study) and instructor reputation had any substantial effect
on student ratings. When students are led, through experimentally manipulated feedback,
to believe that an instructor is an effective teacher, they rate him more favorably on the
basis of one, short videotaped lecture and presumably the manipulated feedback. Also,
when students are given external incentives to do well, they perform better on examina-
tions and rate teaching effectiveness more favorably.
Researchers have also prepared videotaped lectures manipulating variables other than
content and expressiveness. For example Land and Combs (1979; see earlier discussion)
videotaped 10 lectures which varied only in teacher speech clarity, operationally defined
as the number of false starts or halts in speech, redundantly spoken words, and tangles in
words. As teacher clarity improved there was a substantial linear improvement in both
student ratings of teaching effectiveness and student performance on a standardized
achievement test.
Cadwell and Jenkins (1985; see earlier discussion) experimentally manipulated verbal
descriptions of instructional behaviors, and asked students to evaluate each instructor in
336 HERBERT W MARSH

terms of four rating dimensions that were designed to reflect these behaviors. Each rating
item was substantially related to the behavioral descriptions that it was designed to reflect,
and more highly correlated with those matching behavioral descriptions than to the others
that were considered. While this aspect of the study was not the focus of Cadwell and Jen-
kin’s discussion, when viewed in the context of Dr Fox studies, the results provide clear
support for both the convergent and discriminant validity of the students’evaluations with
respect to the behavioral descriptions in the Cadwell and Jenkins study.
CHAPTER 7

UTILITY OF STUDENT RATINGS

Braskamp et al. (1985) argues that it is important for universities, as well as individual fa-
culty members, to take evaluations seriously and offers three arguments, based in part on
organizational research, in support of his position. First, through the evaluation process
the university can effectively communicate and reinforce its goals and values. Second, he
argues that the most prestigious universities and the most successful organizations are the
ones that take assessment seriously. In support of this contention Braskamp (p. 12) cited
William Spady’s review of the book In Search of Excellence by Peters and Waterman
where Spady states:

For me, the single most striking thing about the findings in the book was that the man-
agers of the successful corporations were data-based and assessment-driven. They built
into their management procedures the need to gather data about how things are operat-
ing. Also, there was a commitment to change and modify and improve what they were
doing based on those results. They were very much oriented toward a responsive model
of management. The responding had to do with meeting goals, so it was goal-based. It
was very much oriented toward using performance data, sales data and results, as the
basis for changing what needs to change in the organization.

Third, Braskamp cited employee motivation research that indicated that personal in-
vestment is enhanced when organizations provide stimulating work, provide a supportive
environment, and provide rewards that are perceived to be fair. In summary, Braskamp
(p. 14) states that: “the clarity and pursuit of purpose is best done if the achievements are
known. A course is charted and corrections are inevitable. Evaluation plays a role in the
clarity of purpose and determining if the pursuit is on course.” However, Braskamp
further notes that this perspective is well-established for the evaluation of research, but not
for teaching.
The introduction of a broad institution-based, carefully planned program of students’
evaluations of teaching effectiveness is likely to lead to the improvement of teaching.
Faculty will have to give serious consideration to their own teaching in order to evaluate
the merits of the program. The institution of a program that is supported by the administ-
ration will serve notice that teaching effectiveness is being taken more seriously by the ad-
ministrative hierarchy. The results of student ratings, as one indicator of effective teach-
ing, will provide a basis for informed administrative decisions and thereby increase the
likelihood that quality teaching will be recognized and rewarded, and that good teachers
will be given tenure. The social reinforcement of getting favorable ratings will provide
added incentive for the improvement of teaching, even at the tenured faculty level. Fi-
337
338 HERBERT W. MARSH

nally, faculty report that the feedback from student evaluations is useful to their own ef-
forts for the improvement of their teaching. None of these observations, however, pro-
vides an empirical demonstration of improvement of teaching effectiveness resulting from
students’ evaluations.

Changes in Teaching Effectiveness Due to Feedback from Student Ratings

Short Term Feedback Studies

In most studies of the effects of feedback from students’ evaluations, teachers (or clas-
ses) are randomly assigned to experimental (feedback) and one or more control groups;
students’ evaluations are collected during the term; ratings of the feedback teachers are re-
turned to instructors as quickly as possible; and the various groups are compared at the end
of the term on a second administration on student ratings and sometimes on other vari-
ables as well. (There is considerable research on a wide variety of other techniques de-
signed to improve teaching effectiveness which use student ratings as an outcome measure
- see Levinson-Rose & Menges, 1981.)
SEEQ has been employed in two such feedback studies using multiple sections of the
same course. In the first study, results from an abbreviated form of the survey were simply
returned to faculty, and the impact of the feedback was positive, but very modest (Marsh
et al., 1975). In the second study (Overall & Marsh, 1979) researchers actually met with in-
structors in the feedback group to discuss the evaluations and possible strategies for im-
provement. In this study (see Table 7.1) students in the feedback group subsequently per-
formed better on a standardized final examination. rated teaching effectiveness more
favorably at the end of the course, and experienced more favorable affective outcomes
( i.e., feelings of course mastery, and plans to pursue and apply the subject). These two
studies suggest that feedback, coupled with a candid discussion with an external consul-
tant, can be an effective intervention for the improvement of teaching effectiveness (also
see McKeachie et al., 1980).
Reviewers of feedback studies have reached different conclusions (e.g., Abrami et al.,
1979; Kulik & McKeachie, 1975; Levinson-Rose & IMenges, 1981; McKeachie, 1979;
Rotem & Glassman, 1979). Cohen (1980), in order to clarify this controversy, conducted
a meta-analysis of all known feedback studies. Across all known feedback studies, Cohen
found that instructors who received midterm feedback were subsequently rated about one-
third of a standard deviation higher than controls on the Total Rating (an overall rating
item or the average of multiple items), and even larger differences were observed for rat-
ings of Instructor Skill, Attitude Toward Subject, and Feedback to Students. Studies that
augmented feedback with consultation produced substantially larger differences, but
other methodological variations had no effect. The results of this meta-analysis support
the SEEQ findings described above and demonstrate that feedback from students’ evalua-
tions, particularly when augmented by consultation, can lead to improvement in teaching
effectiveness. Levinson-Rose and Menges (1981), comparing the results of their review
with those reviewed by Kulik and McKeachie (19754, Abrami etal. (1979), and Rotem and
Glassman (1979), also concluded that their conclusions were more optimistic, particularly
when feedback is supplemented with consultation.
Students’ Evaluations 339

Table 7.1
F Values for Differences Between Students with Either Feedback or No-Feedback Instructors for End-of-Term
Ratings, Final Exam Performance, and Affective Course Consequences (Reprinted with permission from Over-
all and Marsh, 1979; see original article for more details of the analysis)

Group

Feedbacks No feedbackb

Variable M SD M SD Difference F(1.601)

Rating components
Concern 52.38 a.5 49.51 10.1 2.87 19.1”
Breadth 50.84 7.9 49.59 7.9 1.25 4.8’
Interaction 51.94 7.4 48.61 10.3 3.33 32.4”
Organization 49.88 9.4 50.88 9.5 -1.00 2.5
Learning/Value 50.77 9.9 48.22 10.7 2.55 11.7”
Exams/Grading 50.52 9.9 49.08 10.1 1.44 4.1’
Workload/Difficulty 51.13 8.8 51.51 8.8 -.38 .4
Overall Instructor 7.00 1.6 6.33 2.1 .6; 26.4”
Overall Course 5.81 1.8 5.39 2.0 .42 5.4’
Instructional Improvement 5.97 1.5 5.49 1.5 .48 16.0”
Final exam performance 51.34 9.9 49.41 10.1 1.93 9.4”
Affective course consequences
Programming competence achieved 5.80 2.0 5.42 2.3 .38 7.7”
Computer understanding gained 6.18 2.0 5.94 2.1 .24 3.6
Future computer use planned 4.00 2.8 3.49 2.7 .51 6.5’
Future computer application planned 5.05 2.6 4.67 2.6 .38 5.4’
Further related coursework planned 4.39 2.9 3.52 2.9 .87 11.1”

IZiote Evaluatmn factors and final exam performance were standardized with M = 50 and SD = 10. Responses to summar!
rating items and affective course consequences varied along a scale ranging from 1 (very low) to 9 (very high)
a For feedback group, N = 295 students in 12 sections.
b For no-feedback group, N = 456 students in 18 sections.
l p < .05. l * p < .Ol.

Several issues still remain in this feedback research. First, the studies demonstrate that
feedback without consultation is only modestly effective, and none of the studies reporting
significant feedback effects with consultation provide an adequate control for the effect of
consultation without feedback (i.e., a placebo effect due to consultation, or a real effect
due to consultation which does not depend upon feedback from student ratings). Second,
the criterion of effective teaching used to evaluate the studies was limited primarily to stu-
dent ratings; only the Overall and Marsh study demonstrated a significant effect of feed-
back on achievement (but also see Miller, 1971; McKeachie et al., 1980). Most other
studies were not based upon multiple sections of the same course, and so it was not possible
to test the effect of feedback on achievement scores. Third, nearly all of the studies were
based on midterm feedback from midterm ratings. This limitation, perhaps, weakens the
likely effects in that many instructional characteristics cannot be easily altered in the sec-
ond half of the course, and also the generality of this approach to the effects of end-of-term
ratings in one term to subsequent teaching in other terms has not been examined. Further-
more, Marsh and Overall (1980) demonstrated in their multisection validity study that
while midterm and end-of-term ratings were substantially correlated, midterm ratings
were concluded to be less valid than end-of-term ratings since they were less correlated
340 HERBERT W. MARSH

with measures of student learning. Fourth, most of the research is based upon instructors
who volunteer to participate, and this further limits the generality of the effect, since vol-
unteers are likely to be more motivated to use the feedback to improve their instruction.
(This limitation does not apply to the two studies based upon SEEQ.) Finally, reward
structure is an important variable which has not been examined in this feedback research.
Even if faculty are intrinsically motivated to improve their teaching effectiveness, poten-
tially valuable feedback will be much less useful if there is no extrinsic motivation for
faculty to improve. To the extent that salary, promotion, and prestige are based almost
exclusively on research productivity, the usefulness of student ratings as feedback for the
improvement of teaching may be limited. Hildebrand (1972, p. 53) noted that: “The more
I study teaching, however, the more convinced I become that the most important require-
ment for improvement is incorporation of effective evaluation of teaching into advance-
ment procedures.”

Long- Term Feedback Studies

Nearly all feedback studies have considered the effects of feedback from students’
evaluations within a single term, and this is unfortunate. Students’ evaluations are typi-
cally collected near the end of the term so that the more relevant question is the impact of
end-of-term ratings. Also, as discussed earlier, midterm ratings are apparently less valid
than end-of-term ratings, and many aspects of teaching effectiveness cannot be easily al-
tered in the middle of a course. Finally, as emphasized by Stevens and Aleamoni (1985))
the long-term effect of the feedback is more important that its short-term effect. Surpris-
ingly, only two studies known to the author - Centra (1973) and Stevens and Aleamoni
(1985) - have examined the effect of an intervention by comparing feedback and control
groups for a period of more than one semester.
Centra (1973) conducted a large short-term study of the effects of midterm feedback for
a fall semester, but also included a small follow-up phase during the next semester. Instruc-
tors from the first phase were included in the second phase if they were willing to partici-
pate again and if they taught the same course as evaluated in the first phase. Centra consi-
dered three groups in this second phase: (a) fall feedback, 8 of 43 instructors who had ad-
ministered evaluations and were given feedback from both midterm and end-of-term rat-
ings during the first phase; (b) fall post-test only, 13 of 43 instructors who had administered
ratings and were given feedback for only end-of-term ratings during the first phase; and (c)
spring only, 21 of 30 instructors who had not administered ratings during the fall semester.
(Instructors from the first phase who had administered midterm ratings but not received
any midterm feedback were apparently not considered in this second phase.) A mul-
tivariate analysis across a set of 23 items failed to demonstrate significant group differ-
ences, but for 4 of 23 items fall feedback instructors received significantly better @ < 0.05)
ratings than did fall post-test and spring semester groups. Centra also noted that only the
fall feedback group received normative comparisons, and that this might explain why the
fall post-test and spring groups did not differ from each other. Even though a majority of
the teachers from the first phase did not participate in the second phase, Centra reported
that these nonparticipants did not differ from the participants. Nevertheless, the large
number who did not qualify, the probable nonequivalence of the three groups, and the
very small size of the fall feedback group that provided the only hint of group differences
Students’ Evaluations 311

all dictate that the results should be interpreted cautiously. Similarly, interpretations
based on a few items selected a posterior-i, particularly after the multivariate difference
based on all items was not statistically significant, should also be made cautiously. Thus,
even though Centra and some subsequent reviewers suggest that this study provides sup-
port for the long-term effects of feedback from student ratings, the effects are very weak
- not even statistically significant when evaluated by traditional methods (i.e., the origi-
nal multivariate analysis) - and problems with the design of the second phase dictate that
interpretations should be made cautiously.
Aleamoni (1978) compared two groups of instructors on student ratings during the
spring semester. Both groups were given results of students’ evaluations duping the fall
term, but only the ‘experimental’ group was given an individual consultation designed to
improve teaching effectiveness based on the student ratings. Originally there was no inten-
tion of having two groups, and the comparison group consisted of those instructors who did
not receive the intervention because of time limitations, scheduling conflicts, and self-
selection. Also, instructors in the top two deciles were eliminated from the experimental
group, but apparently not from the comparison group. While there was significantly grea-
ter improvement in the experimental than control group on two of six measures, interpre-
tations are problematic. Since the groups were not randomly assigned nor even matched
on pre-test scores, they may not have been comparable. More critically, regression effects
- increases in the experimental group relative to the comparison group - were probable
because of the elimination of the top rated instructors in the experimental group but not
the comparison group.
Stevens and Aleamoni (1985) conducted a lo-year follow-up study of those instructors
in the Aleamoni (1978) study. For purposes of the follow-up study they considered three
groups, the consultation and comparison groups from the original study, and the group of
highly rated instructors that was originally scheduled to be part of the consultation group
but were excluded from the original study. In the first set of analyses they found that in-
structors who had received consultation, and other instructors who experienced pre-post
increases in student ratings in the original study, were subsequently more likely to collect
student ratings. In the second set of analyses (based on student ratings by those instructors
from the original study who subsequently used student ratings), the three groups were
compared on student ratings from the original study and from the follow-up period. A
multivariate analysis across five subscales revealed no significant group differences nor
group X time interactions, but univariate ANOVAs for two of the scales were statistically
significant. Further analyses indicated that these small effects were due primarily to the
high pre-test means for the group of highly rated instructors who were excluded from the
original study on the basis of high pre-test ratings. There were no significant differences
between the original consultation group and the original comparison groups. The authors
suggest that the follow-up study supports the results of the original study, but their statisti-
cal analyses - particularly the lack of differences in the two groups that were considered
in the original study - do not support this conclusion. Many suggestions by the authors
warrant further study, but no firm conclusions on the basis of this study are warranted be-
cause of design problems in both the original and follow-up study.
Other studies (e.g., Marsh, 1980b, 1982a; Voght & Lasher, 1973) have examined how
teaching effectiveness varies over time for the same instructor when that instructor is given
systematic feedback from students’ evaluations. However, since these studies have no con-
trol group, observed changes may be due to teaching experience, age-related changes in
341 HERBERT W. MARSH

the instructors, or even the selection of instructors.


In summary, there has been virtually no research on the long-term effects of feedback
from students’ evaluations. A few studies have considered long-term follow-ups of short-
term interventions, but these are sufficiently flawed that no generalizations are warranted.
However, many of the methodological shortcomings could be overcome easily, particu-
larly at universities where student ratings are regularly collected for all faculty. No re-
search has examined the effects of continued feedback from student evaluations over a
long period of time with a true experimental design, and such research will be very difficult
to conduct. For short periods of time it may be justifiable to withhold the results of student
ratings from randomly selected instructors or to require that some instructors not collect
student ratings at all, but it would not be feasible to do so for many years. The long-term
effects of feedback from students’evaluations may be amenable to quasi-experimental de-
signs (e.g., Voght & Lasher, 1973), but the difficulties inherent in the intervention of such
studies may preclude any firm generalizations.

Equilibrium Theory

The Theory and its Operationalizations

Equilibrium theory was derived to explain why instructors are motivated to improve
their teaching. In one of the earliest applications of equilibrium theory to the effects of
feedback from student ratings, Gage (1963, p. 264) posited that: “Imbalance may be said
to exist whenever a person (teacher) finds that she holds a different attitude toward some-
thing (her behavior) from what she believed is held by another person or group (her pupils)
to whom she feels positively oriented.” In their review, Levinson-Rose and hlenges (1981,
p. 143) stated: “A discrepancy, either negative or positive, between evaluations by the in-
structor and those by students creates a state of disequilibrium. To restore equilibrium, the
teacher may attempt to modify instruction.” Centra (1973) hypothesized that: “On the
basis of equilibrium theory one could expect that the greater the gap between student rat-
ings and faculty self-ratings, the greater the likelihood that there would be a change in in-
struction” (p. 396). As typically operationalized, differences between student ratings and
some other indicator (usually instructor self-ratings, but sometimes ideal ratings by stu-
dents or the instructor*) collected at one point of time are posited to create a sense of dis-
equilibrium in the feedback group of teachers so that they change their teaching behaviors
in a way that is reflected in subsequent student ratings when compared to a control group
that receives no feedback.
There are, however, four important problems with this typical operationalization of
equilibrium theory. First, the existence of a discrepancy may not motivate a need for

l I have not emphasized ideal scores in the discussion of equilibrium theory for three reasons. First. I suspect that
class-average ideal scores will nearly always fall near the most favourable end of the response scale and have little
variation. Hence, variance explained by ideal-actual scores can be explained by just the actual scores. Second,
I am not sure that discrepancies between ideal and actual scores will be a motivation to change. Logically actual
scores will rarely if ever equal ideal scores, and so the existence of such discrepancies may not be a source of
psychological discomfort. ‘IlGrd, most studies have emphasized discrepancies between actual and self-ratings
(but see Gage. 1963).
Students’ Evaluations 343

change if the teacher is not concerned about the discrepancy. The extent of the motivation
for change will be a function of both the size of the discrepancy and the importance of the
discrepancy to the individual teacher. Second, particularly when the disequilibrium is due
to differences between student and self ratings, teachers can reduce the posited imbalance
either by changing their teaching behaviors so as to alter the students’ evaluations or by
changing their own self-evaluations. Hence, subsequent student ratings and self ratings
must be considered. (Other changes, such as denying the validity of the student ratings
may also be consistent with equilbrium theory and tests of such predictions would require
that other variables were collected.) Third, the likely effects of a discrepancy should differ
depending on the direction of the discrepancy. For teachers with a positive discrepancy -
self-ratings higher than student ratings - it is reasonable that teachers may be motivated
to alter their teaching behaviors so as to improve ratings, but they could just lower their
own, perhaps unrealistically high, self-evaluations. For teachers with a negative discre-
pancy - self-ratings lowers than student ratings - it is not reasonable that teachers would
intentionally undermine their teaching effectiveness in order to lower their student evalua-
tions, and it is more reasonable that they would simply increase their own, perhaps un-
realistically negative, self-evaluations. Hence, the effect of feedback needs to be consi-
dered separately for positively discrepant, negatively discrepant, and nondiscrepant
groups. Fourth, equilibrium theory implies that the existence of a discrepancy, in addition
to or instead of the diagnostic information provided by the feedback, is responsible for the
change in subsequent behavior. Equilibrium theory does not imply that there will be over-
all feedback/control differences, and such differences do not support equilibrium theory.
Rather, it is necessary to demonstrate that the feedback/control differences depend on the
direction and size of the discrepancy.

The Analysis, Design and Predicted Results of Equilibrium Theory Studies

In addition to problems with the operationalization of equilibrium theory, there are also
statistical complications in the analysis of difference scores that are critical to tests of the
theory. First, teachers with different levels of discrepancy are not initially equivalent so
that their comparison on subsequent measures is dubious, even after statistical adjust-
ments for initial differences. Instead, comparisons should be made between experimental
and control groups that are equivalent due to random assignment and, perhaps, matching.
Second, depending on the analysis, there is the implicit assumption that instructor self-
evaluations and student ratings vary along the same scale, but this is unlikely. For ex-
ample, Centra (1979) suggests that instructor self-ratings are systematically higher than
student ratings, particularly when instructors have not previously seen results of students’
evaluations. Also, instructor self-ratings are based on the responses by a single individual
whereas student ratings are based on class-average responses, and so their variances will
probably differ substantially. Third, when a difference score is significantly related to
some other variable, it is further necessary to demonstrate that both components of the dif-
ference score contribute significantly and in the predicted direction. For example, even if
an ideal-actual difference score is significantly related to a subsequent criterion measure,
support for the use of a difference score requires that both the ideal and the actual score
contribute uniquely and in the opposite direction to the prediction of the criterion mea-
34.3 HERBERT W. MARSH

sure. If most of the variance due to difference scores can be explained by just one of the
two components, then the theoretical and empirical support for the use of the difference
score is weak.
The minimally appropriate analysis and corresponding design for equilibrium theories
requires a three-way MANOVA (or a conceptually similar regression analysis). The three
independent variables are initial instructor self-ratings (with at least three levels - e.g.,
low, medium and high), initial student ratings (with at least three levels), and the feedback
manipulation (feedback vs. control groups), whereas the dependent variables are sub-
sequent student ratings and instructor self-ratings. The main effect of the feedback man-
ipulation provides a test of the effect of feedback, but not of equilibrium theory. Support
for equilibrium theory requires that the feedback effect interacts with both levels of initial
self-ratings and student ratings. In this analysis the initial equivalence of the feedback and
control groups through random assignment of a sufficiently large number of teachers, and,
perhaps, matching, is critically important as a control for regression effects, ceiling effects,
floor effects, etc. Hence, designs without randomly assigned control groups, and particu-
larly designs with no comparison groups at all, are dubious.
One hypothetical set of results that would support equilibrium theory is illustrated in
Figure 7.1. For subsequent student ratings, the theory predicts differences between feed-
back and control groups to be positively related to initial self-ratings and negatively related
to initial student ratings. Note that the predicted differences are most positive for the three

DEPENDENT VARIABLE: DEPENDENT VARIABLE:


SUBSEQUENT SUBSEQUENT
STUDENT RATINGS SELF RATINGS
a t

Initial

Medium

t I I t I I
I I I I I I
Low Medium High LOW Medium High
INITIAL SELF RATINGS INITIAL SELF RATINGS

Figure 7.1 Hypothetical results in support of equilibrium theory. Feedback/control differences in subsequent stu-
dent ratings and subsequent self-evaluations for positively discrepant (self-ratings higher than student ratings, in-
dicated with a’s), nondiscrepant (b’s), and negatively discrepant (c’s) groups of instructors
Students’ Evaluations 345

positively discrepant groups (marked with a’s in Figure 7.1(a)), most negative for the three
negatively discrepant groups (marked with c’s), and intermediate for the three nondiscrep-
ant groups (marked with b’s). In contrast, for subsequent self-ratings, differences between
feedback and control groups are predicted to be negatively related to initial self-ratings
and positively related to initial student ratings. The predicted differences are most nega-
tive for the three positively discrepant groups (marked with a’s in Figure 7.1(b)), most
positive for the three negatively discrepant groups (marked with c’s), and intermediate for
the three nondiscrepant groups (marked with b’s).
The predictions in Figure 7.1 are consistent with equilibrium theory, but they may be
overly simplistic and other results would also support the theory. In particular, the predic-
tions are based on the assumptions that: (a) feedback by itself has no effect on either sub-
sequent student ratings or self-ratings (i.e., the average of feedback/control differences is
zero); (b) the effects of both initial student ratings and self ratings are linear and of approx-
imately the same magnitude; (c) the pattern of effects on subsequent student ratings will
be the mirror image of the effects on subsequent self-ratings. However, equilibrium theory
would still be supported if: (a) the average feedback/control difference was positive (or
even negative); (b) the effects of initial student ratings and self-ratings were nonlinear (but
still monotonic) or differed in magnitude; (c) the effects for positively discrepant and nega-
tively discrepant groups varied depending on whether the dependent variable was sub-
sequent student ratings or self-ratings. (Earlier it was suggested that changes in self ratings
were much more likely for negatively discrepant groups.)

Empirical Findings

Gage (1963) presented one of the earliest applications of equilibrium theory in a study
of the effect of feedback from students’ evaluations by sixth grade students. Gage posited
that psychological disequilibrium would be created by differences in students’ ratings of
their actual and ideal teachers. A randomly assigned group of feedback teachers was given
the results of both the actual and ideal ratings by their students, and between one and two
months later pupils again provided ratings of their actual and ideal teacher. Since discre-
pancy scores were based on ideal-actual ratings, nearly all the discrepancy scores were
positive (i.e. ideal teacher ratings were higher than actual teacher ratings). Although the
support was not overwhelming, the results showed that “teachers who received feedback
did seem to change in the direction of pupils’ ‘ideals’ more than did teachers from whom
feedback was withheld” (p. 265). However, Gage did not report whether this small effect
was due to information provided by the feedback or the discrepancy between ideal and ac-
tual ratings. Support for equilibrium theory would have required that the feedback/control
differences to be systematically related to the size of discrepancies, but such analyses were
not reported. Hence, the reported results provide some support of the effect of feedback,
but not for equilibrium theory.
Centra (1973) hypothesized that the greater the gap between student ratings and faculty
self-ratings, the more likely that there would be a change in instruction. In order to test this
theory, Centra collected student ratings and instructor self-ratings at mid-semester, gave
feedback to a randomly selected feedback group, and compared feedback and control
groups on end-of-semester ratings by students. Centra found that the groups did not differ
significantly on end-of-term ratings, but that end-of-term ratings did vary significantly ac-
346 HERBERT W. MARSH

cording to the difference between self-ratings and student ratings. For the positive discre-
pancy group, the size of the discrepancy was significantly more correlated with end-of-
term ratings for feedback than for control teachers on 5 of 17 items. For the negative dis-
crepancy group, the size of the discrepancy was significantly more correlated with end-of-
term ratings for the feedback than for control instructors for only one item.
I interpret Centra’s results to mean that the more substantial the positive discrepancy
between self-ratings and student ratings the more likely feedback is to have a positive ef-
fect on end-of-term ratings (13 of 17 items, 5 statistically significant) and, to a lesser extent,
the more substantial the negative discrepancy the more likely feedback is to have a positive
effect* on end-of-term ratings (11 of 16 items, 1 statistically significant). While the results
for the positive discrepancy group provide weak support for equilibrium theory, the results
of the negative discrepancy group apparently do not. However, this interpretation is some-
what dubious because feedback/control differences were primarily nonsignificant, Centra
did not actually test the implied interaction effects, and the results for his regression analy-
sis were not fully presented. Other reviewers have also had trouble interpreting Centra’s
study. Levinson-Rose and Menges (1981), for example, interpret the results to mean that
“5 of 17 items showed higher scores for the unfavorably discrepant group (those whose
self-ratings were higher than student ratings) compared with the favorably discrepant
group” (p. 413), but Centra made no statistical comparisons between the favorably and un-
favorably discrepant groups. Rotem and Glassman (1979) interpreted the results to mean
that “instructor’s self-ratings and students’ ratings have an interaction effect” (p. 502), but
Centra did not test the interaction effect. In summary Centra’s study may provide weak
support for equilibrium theory, but any interpretations of his reported results are prob-
lematic.
Braunstein et al. (1973), found that changes in student ratings between midterm and
end-of-term were more likely to be positive for their feedback group than for their control
group. After completion of midterm ratings but before the return of the feedback, all in-
structors were asked to complete a survey indicating what they expected their ratings to be
at the end of the term. (All instructors had previously received student evaluation feed-
back in previous terms as part of a regular student evaluation, so that these expectancy rat-
ings may differ from the self-evaluations considered in other studies.) For instructors in
just the feedback group, positive discrepancies - expectancies higher than midterm stu-
dent ratings - were more likely to be associated with positive changes in student ratings
than were negative discrepancies. The findings support the effect of feedback and appa-
rently support equilibrium theory, but an important qualification must be noted. The crit-
ical relation between discrepancies and subsequent student ratings was not compared to
results for the control group. All, or nearly all, the feedback instructors completed the ex-
pectancy ratings, but only one-third of the control group did so, suggesting that completion
of the expectancy ratings was not independent of the feedback manipulation. More impor-
tantly, the relation between discrepancy scores and subsequent student ratings for the con-
trol instructors who did complete the expectancy self-ratings was similar to that observed
in the feedback group (27 of 31 changes in the expected direction vs. 5 of 6 for the control
group). Since the relation between discrepancies and subsequent student ratings may have

* For the negativediscrepancygroup, the regression coefficients are negative for the feedbackgroup and tend
to be less positive or more negative than for the control group. However, since the sign of the discrepancy score
is negative, this results in more positive student ratings.
Students’ Evaluations 347

been similar in feedback and control groups, the apparent support for equilibrium theory
must be interpreted cautiously.
Pambookian (1976) examined changes in student ratings for small groups of positively
discrepant (i.e. self ratings higher than student ratings), negatively discrepant, and nondis-
crepant groups. He reasoned that the larger the discrepancy, whether positive or negative,
the greater the pressure on the instructor to change. However, he interpreted equilibrium
theory to mean that the direction of this change should be towards higher student ratings
for both negatively and positively discrepant groups even though this would apparently in-
crease - not decrease-the size of the discrepancy for the negatively discrepant teachers.
Notwithstanding the logic of this prediction, his results showed that for negatively discrep-
ant teachers the student ratings showed little change or were slightly poorer at the end-of-
term than at midterm. For nondiscrepant teachers, and particularly for positively discrep-
ant teachers, end-of-term ratings tended to be more favorable than midterm ratings. Thus
Pambookian’s results support the interpretation of equilibrium theory posited here, even
though they did not support his predictions. Nevertheless, the lack of control groups and
his extremely small size dictate that his findings must be interpreted cautiously.
Rotem (1978) collected both actual and desirable self-ratings from both instructors and
from students during the term, randomly assigned teachers to feedback and control condi-
tions, and collected parallel sets of ratings from students and instructors at the end of the
term. Rotem found no feedback/control differences on students’ actual end-of-term rat-
ings. In order to test the various forms of equilibrium theory, he conducted separate mul-
tiple regressions on feedback and control groups to determine if the desirable ratings by
students or teachers, or the actual ratings by teachers contributed to the end-of-term rat-
ings by students. However, he found that for both feedback and no-feedback groups, none
of these additional variables was related to actual end-of-term student ratings beyond what
could be explained by actual midterm student ratings. He concluded that: “It appears,
therefore, that discrepancies between ‘actual’ and ‘desirable’ ratings and between stu-
dents’ and instructors’ ratings did not have any functional relationship with the post-test
ratings” (p. 309). The design of the Rotem study is the best of those considered here and
his analysis provides a reasonable test of equilibrium theory, but his results provide no sup-
port for the theory with respect to the effect of discrepancies on subsequent student rat-
ings. Despite the fact that Rotem collected subsequent self-ratings as well as student rat-
ings, he did not test predictions from equilibrium theory with respect to this variable.
Hence, it is possible that instructors altered their self ratings instead of altering their teach-
ing behaviors, and such a finding would be consistent with equilibrium theory.

Summary of Equilibrium Theory and Empirical Studies of the Theory

Equilibrium theory posits that instructors will be differentially motivated to improve the
effectiveness of their teaching depending on the discrepancy between how they view them-
selves and are viewed by students. In terms of equilibrium theory the role of the initial stu-
dent ratings is not to provide diagnostic information to improve teaching, though this
would not be inconsistent with the theory, but rather to provide a basis for establishing a
disequilibrium. Hence, support for the theory does not necessarily support the diagnostic
usefulness of the ratings, nor vice versa. The theory has an intuitive appeal. However, the
theory’s operationalization, and the design and analysis of studies to test the theory, are
3-M HERBERT W. MARSH

complicated and frequently misrepresented. None of the empirical studies of the theory
examined here were fully adequate, the most common errors being: (a) the lack of a con-
trol group; (b) the failure to consider changes in both subsequent and self ratings; and (c)
the inappropriate or incomplete analyses of the results. These methodological shortcom-
ings make statements of support or nonsupport to the theory problematic. Nevertheless,
the studies that appeared to be most methodologically adequate - Centra (1973) and par-
ticularly Rotem (1978) - provide weak or no support for the theory. The relation of
equilibrium theory to student evaluation research has not been fully examined, no well-de-
fined paradigm to test the theory has been established. and empirical tests of the theory are
generally inadequate. However, the deficiencies in previous research can be easily re-
medied and so this is an important area for further research, or, perhaps, even the
reanalysis of previous studies.

Usefulness in Tenure/Promotion Decisions

Since 1929, and particularly during the last 25 years, a variety of surveys have been con-
ducted to determine the importance of students’ evaluations and other indicators of teach-
ing effectiveness in evaluating total faculty performance in North American universities.
A 1929 survey by the American Association of University Professors (AAUP) asked how
teachers were selected and promoted, and it was noted that, while ‘skill in instruction’ was
cited by a majority of the respondents, “One could wish that it had also been revealed how
skill in instruction was determined, since it remains the most difficult and perplexing sub-
ject in the whole matter of promotion and appointment.” (AAUP, 1929 as cited by Re-
mmers & Wykoff, 1929.)
Subsequent surveys conducted during the last 25 years found that classroom teaching
was considered to be the most important criterion of total effectiveness, though research
effectiveness may be more important at prestigious, research-oriented universities (for re-
views see Centra, 1979; Leventhal er al., 1981; Seldin, 1975). In the earlier surveys ‘sys-
tematically collected student ratings’ was one of the least commonly mentioned methods
of evaluating teaching, and authors of those studies lamented that there seemed to be no
serious attempt to measure teaching effectiveness. Such conclusions seem to be consistent
with those in the 1929 AAUP survey. More recently, however, survey respondents indi-
cate that chairperson reports, followed by colleague evaluations and student ratings, are
the most common criteria used to evaluate teaching effectiveness, and that student ratings
should be the most important (Centra, 1979). These findings suggest that the importance
and usefulness of student ratings as a measure of teaching effectiveness have increased
dramatically during the last 60 years and particularly in the last two decades.
Despite the strong reservations by some, faculty are apparently in favor of the use of stu-
dent ratings in personnel decisions-at least in comparison with other indicators of teach-
ing effectiveness. For example, in a broad cross-section of colleges and universities, Rich
(1976) reported that 75 per cent of the respondents believed that student ratings should be
used in tenure decisions. Rich also noted that faculty at major research-oriented univer-
sities favored the use of student ratings more strongly than faculty from small colleges.
Rich suggested that this was because teaching effectiveness was the most important deter-
minant in personnel decisions in small colleges, so that student ratings were more threaten-
ing. However, Braskamp etaf. (1985) noted that university faculty place more emphasis on
Students’ Evaluations 349

striving for excellence and are more competitive than faculty at small colleges. and these
differences might explain their stronger acceptance of student ratings.
Leventhal et al. (1981), and Salthouse et al. (1978) composed fictitious summaries of fa-
culty performance that systematically varied reports of teaching and research effective-
ness, and also varied the type of information given about teaching (chairperson’s report or
chairperson’s report supplemented by summaries of student ratings). Both studies found
reports of research effectiveness to be more important in evaluating total faculty perform-
ance at research universities, though Leventhal et al. found teaching and research to be of
similar importance across a broader range of institutions. While teaching effectiveness as
assessed by the chairperson’s reports did make a significant difference in ratings of overall
faculty performance, neither study found that supplementing the chairperson’s report with
student ratings made any significant difference. However, neither study considered stu-
dent ratings alone nor even suggested that the two sources of evidence about teaching
effectiveness were independent. Information from the ratings and chairperson’s report
were always consistent so that one was redundant, and it would be reasonable for subjects
in these studies to assume that the chairperson’s report was at least partially based upon
students’ evaluations. These studies demonstrate the importance of reports of teaching ef-
fectiveness, but do not appear to test the impact of student ratings.

Usefulness in Student Course Selection

Little empirical research has been conducted on the use of ratings by prospective stu-
dents in the selection of courses. UCLA students reported that the Professor/Course
Evaluation Survey was the second most frequently read of the many student publications,
following the daily campus newspaper (Marsh, 1983). Leventhal et al. (1975) found that
students say that information about teaching effectiveness influences their course seiec-
tion. Students who select a class on the basis of information about teaching effectiveness
are more satisfied with the quality of teaching than are students who indicate other reasons
(Centra & Creech, 1976; Leventhal et al., 1976). In an experimental field study, Coleman
and McKeachie (1981) presented summaries of ratings of four comparable political science
courses to randomly selected groups of students during preregistration meetings. One of
the courses had received substantially higher ratings, and it was chosen more frequently by
students in the experimental group than by those in the control group. Based upon this li-
mited information, it seems that student ratings are used by students in the selection of in-
structors and courses.

Summary of Studies of the Utility of Student Ratings

With the possible exception of short-term studies of the effects of midterm ratings,
studies of the usefulness of student ratings are both infrequent, and often anecdotal. This
is unfortunate, because this is an area of research that can have an important and construc-
tive impact on policy and practice. Important, unresolved issues were identified that are
in need of further research. For example, for adminstrative decisions students’ evaluations
can be summarized by a single score representing an optimally-weighted average of
specific components, or by the separate presentation of each of the multiple components,
350 HERBERT W. bli\RSH

but there is no research to indicate which is most effective. If different components of stu-
dents’ evaluations are to be combined to form a total score, how should the different com-
ponents be weighted? Again there is no systematic research to inform policy makers. De-
bates about whether students’ evaluations have too much or too little impact on adminstra-
tive decisions are seldom based upon any systematic evidence about the amount of impact
they actually do have. Researchers often indicate that students’ evaluations are used as
one basis for personnel decisions, but there is a dearth of policy research on the policy
practices that are actually employed in the use of student ratings. A plethora of policy
questions exist (for example, how to select courses to be evaluated, the manner in which
rating instruments are administered, who is to be given access to the results, how ratings
from different courses are considered, whether special circumstances exist where ratings
for a particular course can be excluded either a priori or post-hoc, whether faculty have the
right to offer their own interpretation of ratings, etc.) which are largely unexplored despite
the apparently wide use of student ratings. Anecdotal reports often suggest that faculty
find student ratings useful, but there has been little systematic attempt to determine what
form of feedback to faculty is most useful (though feedback studies do support the use of
services by an external consultant) and how faculty actually use the results which they do
receive. Some researchers have cited anecdotal evidence for negative effects of student
ratings (e.g., lowering grading standards or making courses easier) but these are also
rarely documented with systematic research. While students’ evaluations are sometimes
used by students in their selection of courses, there is little guidance about the type of in-
formation which students want and whether this is the same as is needed for other uses of
students’ evaluations. These, and a wide range of related questions about how students’
evaluations are actually used and how their usefulness can be enhanced, provide a rich
field for further research.
CHAPTER 8

THE USE OF STUDENT RATINGS IN DIFFERENT


COUNTRIES: THE APPLICABILITY PARADIGM

Students’ evaluations of teaching effectiveness are commonly collected and frequently


studied at North American universities and colleges, but not in most other parts of the
world. There has been little attempt to test the applicability of North American instru-
ments, or the generalizability of findings from North American research, to other coun-
tries. However, there is also the danger that instruments in one setting will be used in new
settings without first studying the applicability of the instruments to the new setting. In
order to address this issue, Marsh (1981a) described a new paradigm for studying the
applicability of two North American instruments in different countries. In this chapter the
results from the 1981a study and three subsequent applications of the paradigm are re-
viewed, and strengths and weaknesses of the paradigm are evaluated, and implications for
future research are discussed.

The Applicability Paradigm

For this paradigm Marsh (1981a) selected two North American instruments that mea-
sure multiple dimensions of effective teaching, and whose psychometric properties had
been studied extensively - Frey’s Endeavor and the SEEQ that were described earlier
(also see Appendix). University of Sydney students from 25 academic departments
selected ‘one of the best’ and ‘one of the worst’ lecturers they had experienced, and rated
each on an instrument comprised primarily (55 of 63 items) of SEEQ and Endeavor items.
As part of the study, students were asked to indicate ‘inappropriate’ items, and to select
up to five items that they ‘felt were most important in describing either positive or negative
aspects of the overall learning experience in this instructional sequence’ for each instructor
who they evaluated. Analyses of the results included: (a) a discrimination analysis examin-
ing the ability of items and factors to differentiate between ‘best’ and ‘worst’ instructors;
(b) a summary of ‘not appropriate’ responses; (c) a summary of ‘most important item’ re-
sponses; (d) factor analyses of the SEEQ and/or Endeavor items; and (e) a MTMM analy-
sis of agreement between SEEQ and Endeavor scales.
This applicability paradigm was subsequently used in three other studies: Hayton (1983)
with Australian students in Technical and Further Education (TAFE) schools; Marsh et al.
(1985) with students from the Universidad de Navarra in Spain; and Clarkson (1984) with
351
351 HERBERT W hlARSH

students from the Papua New Guinea (PNG) University of Technology. The PNG study
differed from the others in that it was based on a much smaller sample and 3 much more
limited selection of students and teachers. Clarkson also noted that the PNG setting 1~3s
‘non-Western’ and differed substantially from the ‘Western’ settings in most student
evaluation research. The TAFE and Spanish studies differed from the other two in that
students were asked to select ‘a good, an average, and a poor teacher’ instead of a best and
worst teacher, and the five-point response scale used in the original study ~3s expanded
to include nine response categories. The Spanish study also differed in that the items were
first translated into Spanish. (English is the official language in PNG even though 720 dif-
ferent languages are spoken in that country according to Clarkson, 1981.)
All but the TAFE study have been previously published, and so only that one will be
described in further detail. TAFE courses vary from manual trade and craft courses to
para-professional diploma courses in fields such as nursing, and TAFE students are more
varied in terms of age, socioeconomic status, and educational background than typical Au-
stralian university students. Hayton collected his data from eight TAFE institutions and
within each institution, ratings were collected from different academic units so that the
final sample of 218 students (654 sets of ratings) was a broad cross-section of TAFE stu-
dents. Each student was asked to select a good, an average, and a poor teacher, and to limit
selection to classes of at least one term that used mainly a lecture/discussion format. The
major findings of this study, as well as of the other three, are summarized below.

Differentiation Between Different Groups of Instructors

The different groups of instructors selected by the students constitute criterion groups.
and student ratings should be able to differentiate between these groups. In each of the
four studies, all but the Workload/Difficulty items strongly differentiated between the two
groups or the three groups of teachers. In the studies with the three criterion groups -
good, average, and poor instructors - nearly all of the between group variance could be
explained by a linear component. In the Spanish study, for example, as much as two-thirds
of the variance in some items could be explained by differences between the groups of
teachers chosen as good, average and poor, and for most items less than 1 percent of this
could be explained by nonlinear components. Differences in the Workload/Difficulty
items tended to be much smaller, sometimes failing to reach statistical significance, though
the best teachers tended to teach courses that were judged to be more difficult and to have
a heavier workload. The ratings differentiated substantially between the criterion groups
in all four studies.
It is hardly surprising that an instructor selected as being ‘best’ by a student is consis-
tently rated more favorably than one who the student selected as being ‘worst’. The halo
effect produced by this selection process probably exaggerates the differentiation among
groups, but also makes it more difficult to distinguish among the multiple components of
effective teaching, thus undermining the factor analyses and MTMM analyses that are a
major focus of the applicability paradigm. Hence, the differentiation is a double-edged
sword; too little would suggest that the ratings are not valid, but too much would preclude
the possibility of demonstrating the multidimensional nature of the ratings. Because of this
problem, students in the TAFE and Spanish studies selected agood, average, and poor in-
Students’ Evaluations 353

structor, instead of best and worst instructors, and the five-point response scale was ex-
panded to nine categories. However, as will be discussed later, these changes do not ap-
pear to resolve the problem which continues to be a weakness in this paradigm.

Factor Analyses

In each of the four studies, separate factor analyses of the Endeavor and SEEQ items
were conducted in an attempt to replicate previously identified factors. However, these at-
tempts were plagued by problems associated with the analysis of responses by individual
students. As described earlier: (a) for most applications of students’ evaluations the class-
average is the appropriate unit of analysis, and factor analyses of responses by individual
students are typically discouraged; (b) many of the response biases that are idiosyncratic
to individual responses are cancelled out when class-average responses are analyzed; (c)
class-average responses tend to be more reliable than individual student responses; and (d)
for individual students within the same class the instructor is constant, so that the interpre-
tation of any factors based on individual student responses may be problematic when the
ratio of the number of students to the number of different classes is large. It is defensible
to factor analyze responses from a randomly selected single student from each of a large
number of classes and the sampling of students for the applicability paradigm should ap-
proximate this ideal. Thus it is important that students come from a wide background so
that it is unlikely that many students will evaluate the same class and so that variation in
actual teaching behaviors is substantial. This situation appears to exist for three of the four
studies considered here, but not for the PNG study where all respondents were enrolled
in a second-year mathematics course and evaluated only mathematics instructors. How-
ever, even when this precaution is met, it still provides no control for response biases that
are idiosyncratic to individual responses and these are likely to be particularly large due to
the way students were asked to select criterion instructors. Thus it is likely that instructors
selected as being best/good (worst/poor) will be rated favorably (unfavorably) on all items,
and this will reduce the distinctiveness of the different components of effective teaching.
Factor analyses in these four studies are also plagued by problems inherent in the use of
exploratory factor analysis (see Chapter 2). When the researcher hypothesizes a well-de-
fined factor structure and the results of exploratory factor analyses correspond closely to
the hypothesis, then there is strong support for the hypothesized structure and the con-
struct validity of the ratings. However, when there is not a clear correspondence between
the hypothesized and obtained structure, the interpretation is problematic. Because of the
indeterminancy of exploratory factor analysis, a different empirical solution that fits that
data just as well may exist that more closely corresponds to the hypothesized structure and
the researcher has no way of determining how well the hypothesized structure actually
does fit the data. Many of these problems can be resolved by the use of confirmatory factor
analysis as described earlier, but the technical difficulties in the use of this approach to fac-
tor analysis and the unavailability of appropriate statistical packages may limit its use in
some of the settings where the applicability paradigm might be most useful.
Despite the potential problems in the use of factor analysis described above, most of the
nine SEEQ factors and the seven Endeavor factors were identified in each of the four
studies. In the Spanish study, separate factor analyses of responses to the SEEQ and the
354 HERBERT W. MARSH

Endeavor items clearly identified all 16 factors that the two instruments were designed to
measure. In the University of Sydney study, only the Examination/Grading factor from
SEEQ was not well-identified, though a similar factor defined by Endeavor items was
identified and when items from the two instruments were combined in a single factor
analysis an Examination factor was defined by responses to items from both instruments.
In the TAFE study, the Examination/Grading factor from SEEQ was again not well-de-
fined, while the Planning/Organization and Presentation Clarity factors from Endeavor
formed a single factor instead of two separate factors. In the TAFE study, like the Univer-
sity of Sydney study, a factor analysis of responses from both instruments produced a well-
defined Examination factor that was defined by both SEEQ and Endeavor items, but the
combined analysis still did not provide support for the separation of the two Endeavor fac-
tors.
Results from the PNG study provided the poorest approximation to the hypothesized
factor structure of the four studies. While there was reasonable support for seven of the
nine SEEQ factors, items for the Organization/Clarity and Individual Rapport items did
not define separate factors. There was also reasonable support for four of the seven En-
deavor factors, but a large general factor incorporated items from the Presentation Clar-
ity, Organization/Planning, and Personal Attention factors. Clarkson (1984) examined
solutions from factor analyses in which varying numbers of factors were rotated to diffe-
rent degrees of obliqueness, and selected the best solution. However, the nine-factor sol-
ution for SEEQ and the seven-factor solution for Endeavor were not presented, nor was
other useful information such as the eigenvalues for the unrotated factors. Furthermore,
because of the small and limited sample and the associated problems discussed above, the
use of factor analysis in that study may be dubious.
In summary, not all of the SEEQ and Endeavor factors were identified in each of the
four studies, but every hypothesized factor was identified in at least two of the studies. Par-
ticularly given the problems inherent in the exploratory factor analysis in individual stu-
dent responses and the apparent halo effect produced by the selection process in the
applicability paradigm, I interpret these findings to offer reasonable support for the gener-
ality of factors identified in North American settings to a wide variety of educational con-
texts in different countries.

Inappropriate and Most Important Items

Inappropriate items

Items were judged to be ‘inappropriate’ if a student specifically indicated the item to be


inappropriate or failed to respond to the item. In order to examine the appropriateness of
items in the four studies, the number of ‘inappropriate’ responses was divided by the num-
ber of sets of responses (i.e., the number of students times the number of instructors
evaluated by each student since each student evaluated two or three instructors.*)

l Clarkson (1984) summarized his ‘inappropriate’ and ‘most important’ item responses only in terms of percent-
ages Instead of actual numbers and percentages as was done in the other studies, and the percentages that he re-
ported were substantially higher than in the other studies. Clarkson. as did the other researchers, asked students
to indicate up to five most important itemsforeuch instructor whom rhey evaluated. If the numberof most import-
ant nominations was divided by the total number of sets of responses (I.e. (2 x the number of students since each
student evaluated two instructors), then the sum of the percentages across all items must be no higher than 500
Students’ Evaluations 355

Results from each of the four studies (see Table 8.1) are similar in that every item is
judged to be appropriate by about 80 percent or more of the students, even though 3 to 7
of the 55 items are judged to be ‘inappropriate’ by more than 10 percent of the students;
the mean number of items judged to be ‘inappropriate’ is 2.4 of the 55 items. The items
most frequently judged to be ‘inappropriate’ come from the Group Interaction. Individual
Rapport, Examination, and Assignment factors from the two instruments. However,
there are also different patterns in the four studies. For example, Group Discussion items
are most frequently seen to be ‘inappropriate’ in PNG study and much less frequently indi-
cated to be ‘inappropriate’ in the TAFE and Spanish studies; Reading/assignment items
are most frequently seen as ‘inappropriate’ in the TAFE and Spanish studies, but not in the
University of Sydney and PNG studies.
I interpret these findings to mean that nearly all the SEEQ and Endeavor items are seen
to be appropriate by a large majority of the students in all four studies. The mean propor-
tion of items judged to be ‘inappropriate’ was similar in the four studies (means between
0.04 and 0.05) and few items are judged to be ‘inappropriate’ in any of the studies.

Most Important Items

After completing a survey, students were asked to select up to five items that were ‘most
important’ in describing the overall learning environment. In three of the four studies, all
but the PNG study, every item was selected by at least some of the students as being ‘most
important’ (Table 8.1). Across all four studies the most frequently nominated items came
from the Enthusiasm, Learning/value, and Organization/clarity factors. However, there
were again some marked differences in the four studies-and the PNG study was particu-
larly different. PNG students, compared to students in the other studies, were much more
likely to nominate individual Rapport/personal attention items and Workload/difficulty
items as most important, and were less likely to nominate Learning/value items. For ex-
ample, while at least 10 percent of the students in each of the other three studies nominated
the item ‘learned something valuable’ as most important and no more than 2.6 percent saw
it as ‘inappropriate’, only 2 percent of the PNG students nominated this item as ‘most im-
portant’ and 10percentjudgedit to be inappropriate. Again, it is likely that much of the dis-
tinctive nature of the pattern of responses for the PNG students may reflect the limited
sample and might not generalize to a broader sample of PNG students who evaluated a
wider range of courses. In summary, I interpret these results to indicate that items from
each of the factors measured by SEEQ and Endeavor are seen to be important by students
in each of the different studies.

per cent (each student selected up to five items). In fact, the sum of the percentages reported by Clarkson ap-
proaches loo0 per cent indicating that he must have divided the number of most important question responses
by the number of students rather than the number of sets of responses. Since students were able to mark as many
items as inappropriate as they wanted, I cannot be absolutely sure that Clarkson computed the percentage of not
appropriate item responses in the same way. However, it is inconceivable to me that Clarkson would divide the
number of most important responses by the number of students and the number of inappropriate responses by
twice the number of students and not report this different basis for determining percentages in his study. The
problem occurs when Clarkson directly compares his percentages with those in Marsh (1981a) without first taking
into account the fact that he apparently divided the number of responses by the number of students while Marsh
divided the number of responses by two times the number of students. It should also be noted that this ambiguity
only affects interpretations of results in Table 8.1, since the correlations in Table 8.2 are independent of such a
linear transformation and results in Tables 8.3 and 8.4 do not involve ‘most important’ and ‘not appropriate’ re-
sponses.
356 HERBERT W. h4ARSH

Table 8.1
Paraphrased Items and the Factors They Were Designed to Represent in Marsh’s (M) SEEQ Instrument and
Frey’s (F) Endeavor Instruments m Four Studies (Reprinted with permission from Marsh. 1986)

% of “not appropriate”““› % of “nwS1 mlporlant”


responses for each study responses for each study

Scale and ,tem USvd Span TAFE PNG USvd SD.%l TAFE PNG

I Group lnteractmn (SEEQ)


M 13. Encouraged class dlscussmn IO I 32 24 I3 0 II I 77 78 0
Ml4 Students shared knowledge/ideas 7.9 75 24 IO 0 38 4.5 7.0 75
MIS. Encouraged questions and gave answers 2.2 I.9 06 35 38 30 II 5 I2 5
M I6 Encouraged expression of ideas 38 49 44 3.0 3.2 I3 40 20
2 Leammg/Valuc (SEEQ)
MI Course challengmg and stlmulatrng 0.3 26 23 75 I9 9 I7 5 IO 4 80
M2 Leamed something valuable 06 26 I4 IO 0 I6 I I4 2 II 3 20
M3. lncnase SUbJcct interest I.3 I.9 I.8 0 95 II 2 66 IO 0
M4 Learned and understood SubJeCt matter 03 I I I8 I5 5 I 37 57 30
3 Workload/Dlfticulry (SEEQ)
M32 Course difficulty (easy-hard) 06 0 28 30 25 30 58 5.5
M33 Course workload (light-heavy) 0 0 23 JO 63 30 I3 5
M34 course pace (slow-fast) 0 03 29 40 73 53 ;: IS 0
4 ExammattonVGradmg (SEEQ)
M25. Enammatmn feedback valuable I3 0 I4 2 80 35 25 45 56 30
M26. Evaluat:on methods faidappropnate 7.0 6. I 43 54 I2 2 63 20
M27. Tested course content as emphaslud 82 91 I3 3 :x 41 45 28 2.0
5 Indwdual Rapport (SEEQ)
M 17. Fncndly Howard individual students 32 I I 09 55 73 I3 7 75 IO 0
Ml8. Welcomed students seekmg help/advrce 2.2 78 28 35 76 57 a.9 32.0
M 19. Interested m mdtvidual students 47 5.7 I 5 I2 0 32 83 60 IO 0
M20. Accesstblc to mdiwdual srudents 6.6 56 86 35 32 49 47 I4 0
6 OrgamutiorKlanty (SEEQ)
M9 Lecturer explanations clear 0.6 06 00 21 2 29 3 30 0 14.5
MI0 Materials well cxplamcd and prepared I.9 I5 :: IO 8 I4 7 10.9 20 5
MI I Course obJectives stated and pursued tz 49 46 15 I6 a 6.7 60 55
M 12. Lecrures facrlitnted takmg notes 2.2 3.3 52 60 95 IO 2 46 70
7 Instructor Enthuswsm (SEEQ)
MS. Enthusiastic atout tcachmg 0.3 0.8 0.0 75 27 8 26 0 I6 4 I5 0
M6. Dynamtc and encrgerrc 03 I7 30 I9 a 69 64 70
M7 Enhanced presentation with humor I3 :: 29 85 I5 a II 5 98 I5 0
M8 Teaching style held your interest 03 24 I8 25 36 7 I9 3 I7 9 28.5
8 Breadth of Coverage (SEEQ)
M2I Contrasted various tmplicatmns 5.7 7.2 60 50 35 65 46 8.0
M22. Gave background of ~dcaslconcepts 32 53 61 30 44 24 32 6.5
M23 Gave dtfferent pomts of wew 85 96 37 30 70 45 41 2.0
M24 Dwcussed current developments 54 18 3 60 0 28 35 52 II 5
9 AsslgnmenWReadings (SEEQ)
M28 Readings/tcru were valuable 21 I I3 0 I 5 32 26 46 30
M29. They contnbuted to undentandmg I5 a 66 25 63 22 49 I5
Overall ratrng ,tems (SEEQ)
M3 I Overall instructor ratmg 0 I9 I I I5 IO I 78 55
MM. Overall EOUPY ratmg 03 21 05 0 I2 3 4.5 9.9
IO Cla.ss Discussion (Endeavor)
FIO. Class discussion was welcome 9.8 IO I8 IO 5 I9 54 57
FI I S&ems encouraged to participate IO I 3.3 34 I5 5 II I 56 75 :.:
Fl2. Encouraged students to express Ideas 50 54 31 I6 0 09 32 32 0
II Student Accomphshmcnts (Endeavor)
Fl9. Undetsmod the advanced marenal 35 22 6.3 0 95 50 55
F20. Ability to analyze issues 0.9 2.7 38 0 IO.1 57 35
RI lncreascd knowledge and competence 06 I9 21 I.5 82 83 35
12. Workload (Endeavor)
F4. Students had to work hard 0 02 I.2 20 38 53 47 40
FS. Course requtred = lot of work 0 0.8 60 I6 30 3 I 7.0
F6. course workload was heavy 0 1 I :: 35 09 53 2.8 II 0
13 Gndtng/Examinauons (Endeavor)
PI6 Gradmg was fatr and imparual 66 86 6.3 I2 0 I3 9 72 60
Fl7. Gndtng reflected student perfotiancc 9.8 8.6 7.8 80 :: 62 46 I.5
Fl8. Grading mdicrtivc of accomplishments a.9 8.6 IO 2 75 I9 32 52 0
14 Personal Attentton (Endeavor)
R. Listened and wv11 wilbng to help 6.6 57 I5 70 51 67 I3 6 22 0
Students’ Evaluations 357

Table 8.1
Paraphrased Items and the Factors They Were Designed to Represent in Marsh’s (M) SEEQ Instrument and
Frey’s (F) Endeavor Instruments in Four Studies (Reprinted with permission from Marsh, 1986)
Table 8.1 fConrinucdJ

% of “not appropriate” % of “most important”


responsesfor each study responsesfor each study
Scale and item USyd Span TAFE PNG USyd Span TAFE PNG

FE. Able to get personal attention 18.4 15.5 6.7 8.5 5.7 9.5 6.5
F9. Concerned about student difftculties 2.8 3.5 1.7 4.5 10.2 8.3 24.5
15. PresentationClarity (Endeavor)
FL. Presentationclarified materials 0 6.9 2.0 ::: 19.6 8.0 7.0 0
R. Presentedclearly and summarized 0 1.6 1.1 28.5 13.7 15.3 9.0
F3. Made good useof examples 0 1.9 1.7 2.5 7.0 9.3 3.1 10.5
16. Organization/Planning(Endeavor)
F13. Presentationsplanned in advance 0.6 1.5 10.0 13.3 13.9 12.2 9.5
F14. Provided detailed course schedule A:; ::; 3.8 10.5 5.7 2.8 4.9 0
F15. Activitiesorderly scheduled 2.9 2.8 6.5 1.3 4.0 5.1 8.0

Now SEEQ = Students’ Evaluations of Educational Quality; USyd = University of Sydney; Span = Spanish;TAFE = Technical
and Further Education; PNG = Papua New Guinea.

Patterns of Relations in Inappropriate and Most Important Responses

Hayton (1983) suggested that items judged to be most ‘inappropriate’ were those least
likely to be nominated as ‘most important’ in the TAFE study, while Clarkson (1984)
argued that the pattern of ‘inappropriate’ and ‘most important’ responses by the PNG
students differed from that by University of Sydney students. More generally, the relative
similarity and differences in the pattern of results in each of the studies may provide an im-
portant way to better understand the educational contexts in the different settings. For ex-
ample, one question that may be answered by this analysis is whether the items seen to be
‘most important’ in one setting are the same items seen to be important in other settings.
In order to explore further such possibilities, the proportions of ‘inappropriate’ and ‘most
important’ responses were correlated with each other for each country.
In all four studies the correlation (Table 8.2) between ‘inappropriate’ and ‘most import-
ant’ responses was negative (Mn r = - 0.33), and was statistically significant for all but the
PNG study. When the percentage of ‘inappropriate’ responses for each item was summed
across the four studies, this value correlated - 0.42 with the sum of the ‘most important’
responses. These results provide empirical support for Hayton’s observation that ‘inap-
propriate’ items are less likely to be selected as being ‘most important’.
Correlations between the proportion of ‘most important’ responses given to each item
were positive (Mn r = 0.55) indicating a high degree of correspondence in the patterns of
responses. However, the average correlation for the first three studies (0.73) was substan-
tially higher than between these three studies and the PNG study (Mn r = 0.37). The pat-
tern of responses for the ‘inappropriate’ responses was somewhat less consistent across the
four studies (Mn r = 0.32), but still represented a moderate level of consistency. The aver-
age correlation among the patterns of ‘inappropriate’ responses was again much higher for
the first three studies (Mn r = 0.56) than between each of these three studies and the PNG
study (Mn r = 0.08). The highest degree of consistency occurred between the TAFE and
Spanish studies for both the patterns of ‘most important’ responses (r = 0.79) and ‘inap-
propriate’ responses (r = 0.71). These findings tend to support Clarkson’s observations
that the pattern of responses by PNG students differed from those in the other three
Table 8.2
Consistency of Patterns of Inappropriate and Most Important Responses in the Four Different Studies (Reprinted with permission from Marsh, 1986)

Variable I 2 3 4 5 6 7 8 9 IO

I. Inappropriate (USyd)
2. Inappropriate (Span) .50
3. Inappropriate (TAFE) .47 .71
4. Inappropriate (PNG) .36 -.04 - .08
5. Most important (USyd) -.41 - .30 - .41 - .08
6. Most important (Span) - .27 L .32 -.42 .I0 .72
7. Most important (TAFE) - .22 -.2s - .47 .04 .69 .79
8. Most important (PNG) -.26 -.20 - .32 -.I3 .28 .36 .46
9. Inappropriate total .84 .78 .71 .45 - .42 - .31 -.30 - .31
IO. Most important total - .37 - .33 - .49 -.04 .84 .86 .88 .67 - .42

M of proportions ,030 ,048 .037 .053 ,087 .079 .075 .082 ,044 .081
SD of proportions .041 .047 .030 .040 .076 .057 ,046 ,072 028 .050

Note. For the purpose of this analysis, each of the 55 items was considered to be a “case,” and the 10 “variables” were the proportion
of “inappropriate” or “most important” responses for that item in the different studies, as indicated by the row labels. (Data for this
analysis are in Table 1.) For example, the correlation of .79 between Variables 6 and 7 indicates that an item judged fo be most
important in the Spanish study was also likely to be judged as most important in the TAFE study, whereas the correlation of - .41
between Variables 1 and 5 indicates that an item judged to be inappropriate in the University of Sydney study was less likely to be seen
as most important in that study. Statistically significant correlations are those greater than .26 @ < .05, q” = 53, two-tailed) and .36
@ < .Ol, df = 53, two-tailed). USyd = University of Sydney; Span = Spanish; TAFE = Technical and Further Education; PNG =
Papua New Guinea.
Students’ Evaluations 359

studies, even though there was a moderately high level of consistency in the pattern of re-
sponses in all four studies.
While suggestive, the interpretation of these correlations must be made cautiously
because of the differences in the methodologies used in the four studies. The PNG study
differed most drastically from the other three - in terms of the sample of students and
limitations in the choice of criterion instructors - and this is the only study where the pat-
tern of responses to ‘inappropriate’ and ‘most important’ items was not highly consistent.
The TAFE and Spanish studies both asked each student to select three criterion instructors
instead of two, used a nine-point response scale instead of five-point scale, and considered
a slightly different set of items than the other two studies (though all four contained the 55
items from SEEQ and Endeavor and considered approximately the same total number of
items). Thus, the methodological similarities in these two studies might explain why the
patterns of ‘inappropriate’ and ‘most important’ responses were somewhat more consis-
tent for these two studies than between either of them and the University of Sydney study.
I interpret these results to demonstrate a surprising consistency in the pattern of ‘inapprop-
riate’ and ‘most important’ responses for three of the four studies, and I suggest that the
less consistent results in the PNG study probably reflect methodological differences in that
study rather than differences in the PNG context. I find the consistency in the first three
studies surprising because the three settings seem to vary so much that such a high level of
correspondence was not expected.

Multitrait-Multimethod Analyses

The SEEQ and Endeavor forms were independently designed, and do not even measure
the same number of components of effective teaching. Nevertheless, a content analysis of
the items and factors from each instrument suggests that there is considerable overlap
(Marsh, 1981a). There appears to be a one-to-one correspondence between the first five
SEEQ factors and the first five Endeavor factors that appear in Table 8.1, while the Or-
ganization/Clarity factor from SEEQ seems to combine the Organization/Planning and
Clarity factors from Endeavor. The remaining three SEEQ factors - Instructor En-
thusiasm, Breadth of Coverage, and Assignments - do not appear to parallel any factors
from Endeavor. This hypothesized structure of correspondence between SEEQ and En-
deavor factors is the basis of the MTMM analyses described below.
The set of correlations between scores representing the 9 SEEQ factors and the 7 En-
deavor factors is somewhat analogous to a MTMM matrix in which the different factors
correspond to the multiple traits and the different instruments are the different methods.
Convergent validity refers to the correlation between responses to SEEQ and Endeavor
factors that are hypothesized to be matching as described above, and these appear in boxes
in Table 8.2. Discriminant validity refers to the distinctiveness of the different factors; its
demonstration requires that the highest correlations appear between SEEQ and Endeavor
factors that are designed to measure the same components of effective teaching, and that
other correlations are smaller. Adapting the MTMM criteria developed by Campbell and
Fiske (1959) Marsh (1981a) proposed that:
(1) Agreement on SEEQ and Endeavor factors hypothesized to be matching, the con-
vergent validities, should be substantial (a criterion of convergent validity).
(2) Each convergent validity should be higher than other nonconvergent validities in the
360 HERBERT W. MARSH

same row and column of the 9 X 7 rectangular submatrix of correlations between SEEQ
and Endeavor factors. This criterion of discriminant validity requires that each convergent
validity is compared with either 13 or 14 other correlations (convergent validities were not
compared to other convergent validities when they appear in the same row or column of
this submatrix).
(3) Each convergent validity should be higher than correlations between that factor and
other SEEQ factors, and between that factor and other Endeavor factors, in the two trian-
gular submatrices. This criterion of divergent validity requires that each convergent valid-
ity is compared to 14 other correlations.
Each of the four studies presented MTMM matrices, but there are a number of problems
with these previous analyses. First, none of the studies rigorously applied the Campbell/
Fiske guidelines, and argued for (or against in the case of the PNG study) their support by
inspection. Second, empirically derived factor scores were used to determine the MTMM
matrices in the University of Sydney and Spanish studies, but scale scores consisting of an
unweighted average of responses to items designed to measure each factor were used in the
other two studies. Both Hayton (1983) and Clarkson (1984) argued that factor scores were
inappropriate since the SEEQ and Endeavor factors were not clearly identified in their
studies. In the Spanish study, where all the factors were identified, it is certainly justifiable
to use empirically derived factor scores. However, Marsh’s (1981a) use of factor scores is
somewhat problematic since the Examination factor for SEEQ was not well-defined.
Furthermore, even if the use of factor scores is justified, results based on factor scores may
not be directly comparable to those based on scale scores. Since the raw data for two
studies that used factor scores, the University of Sydney and Spanish studies, were avail-
able to the author, the MTMM analysis was redone with both the scale scores and factor
scores for these two studies.
Six MTMM matrices appear in Table 8.3 and the results of a formal application of the
Campbell/Fiske criteria as presented above are summarized in Table 8.4. For the Univer-
sity of Sydney and the Spanish studies, MTMM matrices and analyses are considered sepa-
rately for factor scores and scale scores, while only the MTMM matrices based on scale
scores (based on values from the original studies) are available for the other two studies.
In the six analyses, every convergent validity is substantial and statistically significant;
means of the convergent validities vary from 0.72 to 0.87 while the medians vary from 0.81
to 0.89. For both the Spanish study and particularly for the University of Sydney study, the
convergent validities for scale scores are higher than for the factor scores. The substantial
differences to the two sets of scores for the University of Sydney study is attributable
primarily to correlations between the Examination (SEEQ) and Grading (Endeavor) fac-
tors; the two factor scores representing this component correlated 0.34 with each other
while the two scale scores correlated 0.72. It is also important to note that the size of the
mean convergent validity in each study is only modestly smaller than the mean of the coef-
ficient alpha estimates for the same factors (mean alphas vary from 0.82 to 0.89). These
findings clearly satisfy the first Campbell/Fiske criterion for each of the different studies
and indicate that the sizes of the validity coefficients approach a logical upper-bound im-
posed by the reliability of the scores.
For the second Campbell/Fiske criterion, each of the convergent validities is compared
to other correlations in the rectangular submatrix. The means of the comparison correla-
tions are substantial, but substantially smaller than the size of the corresponding means for
the convergent validities. For the University of Sydney and Spanish studies, the mean of
Students’ Evaluations 361

Table 8.3
Six MTMM Matrices of Correlations Among SEEQ and Endeavor Factors

Factors 1 2 3 4 5 6 7 a 9

SEEQ Factors

1 1;:;
(84) (86)
2 26 61 (92)
39 63 (92)
66 57 (89) (90)
3 -05 01 06 07 (91)
04 04 08 15 (79)
09 07 25 -12 (86) (70)
4 33 56 46 63 20 01 (81)
42 64 50 70 13 05 (85)
65 59 74 63 21 10 (73)(88)
5 54 81 31 62 -03 02 32 58 (93)
68 80 43 69 -05 01 39 70 (90)
72 69 67 77 08 10 67 68 (87)(85)
6 24 64 52 81 -15 -01 48 66 33 67 (93)
38 60 50 83 02 15 46 70 43 65 (91)
68 70 79 78 21 -03 74 65 65 79 (83)(90)
7 39 71 55 81 -04 -02 52 62 47 74 60 88 (95)
47 69 65 87 22 21 40 69 43 73 64 85 (92)
73 61 82 80 19 -01 75 64 76 80 82 85 (90)(94)
8 42 67 39 68 -01 06 46 60 40 64 47 73 49 73 (88)
62 69 55 70 12 12 52 67 55 66 45 67 57 73 (89)
77 56 73 65 17 14 70 58 74 66 75 67 80 71 (82)(89)
9 22 34 37 51 07 14 39 55 18 36 35 42 37 39 33 46 (84)
32 54 25 65 -05 12 26 60 29 55 18 56 24 61 36 61 (84)
09 47 68 52 27 19 63 56 49 45 21 53 55 54 55 54 (64)(90)
Endeavor Factors
10 58- :l‘ 29 53 -03 -01 33 51 57 73 20 51 45 63 39 58 22 33
’ 94’
;88
93 72; 54
37 45
64 04
03 08
05 54
38 45
64 67
69 55
81 55
38 53
61 63
43 53
69 70
59 55
69 43
29 55
35
11 3379 ?30- :? -10 -05 56 62 37 60 70 81 63 77 49 67 39 48
46 58 ‘86 91’ 11 12 52 66 51 63 55 76 66 78 60 65 31 61
60 59 ‘28 83_1 21 -13 71 61 64 71 73 77 74 78 68 58 67 52
12 05 14 1Q lo T;5- 141 32 16 02 12 -02 13 06 11 05 14 20 26
15 17 13 24 ‘82 79’ 20 18 09 18 12 25 25 30 20 23 06 24
27 38 45 21 ‘29 44; 41 28 27 29 38 16 37 08 35 31 48 26
13 28 50 39 59 -04 ?I2 r34 -7? 39 55 43 58 31 53 36 53 50 43
46 55 48 63 01 05 ‘80 84’ 49 60 57 66 41 62 52 59 28 57
58 63 67 67 18 06 k80 67; 61 64 66 64 64 58 61 49 61 57
14 63 78 40 62 -05 -02 43% ‘8i89l 41 68 56 72 57 65 29 35
75 81 37 67 08 08 44 68 ‘81 91’ 40 63 48 70 56 65 32 53
75 67 71 74 16 15 73 69 >5 87; 69 77 76 79 72 74 53 60
15 23 63 47 80 -13 -05 55 66 31’6 r8: 931 71 88 49 73 32 41
47 63 69 84 08 17 51 73 48 67 ‘79 88’79 86 63 72 30 60
68 73 81 76 20 -04 79 68 67 77 26_ 88 ’ 83 85 78 73 60 50
V--i
16 26 47 58 65 06 09 51 48 35 51 ‘68 76’ 59 68 56 6-I 39 40
48 58 59 71 18 24 60 69 48 61 171 82’ 57 72 59 65 24 55
66 63 72 60 27 00 73 52 63 68 ’---81 821 73 71 72 52 60 44
362 HERBERT W. MARSH

Table 8.3
Six MTMM Matrices of Correlattons Among SEEQ and Endeavor Factors (Connnurd~

Factors 10 11 12 13 14 15 16

Endeavor Factors
10
‘(“y:;
(86) (71)
11 29 49
44 59
50 40 (78) (91)
12 05 10 03 08 (94)
13 19 20 22 (91)
17 37 44 14 (73) (86)
13 25 47 35 57 06 13 W)
42 56 50 60 13 17 (94)
46 48 69 56 36 23 (84) (88)
14 60 72 44 60 04 11 40 57 WI
72 82 46 60 18 21 45 60 (91)
70 60 67 65 36 28 63 67 (82) (85)
15 23 53 60 81 00 09 31 54 43 65 (92)
41 64 71 76 14 27 51 66 51 67 (89)
58 59 74 76 38 21 68 63 71 79 (86)(88)
16 21 36 60 65 16 20 41 49 43 53 67 72 (85)
46 60 58 65 28 32 6066 45 62 67 74 (78)
56 57 66 64 15 15 67 51 68 60 SO 71 (76)(64)

For each cell of the MTMM matrix there are six correlations representing results from study 1 (factor scores
- upper-left), study 1 (scale scores-upper-right), study 2 (factor scores - middle-left), study 2 (factor scores
-middle-right), study 3 (scale scores-lower-left), and study 4 (scale scores - lower-rrght). All coefficients are
presented without decimal points. The values in parentheses are coefficient alpha estimates of rehability for the
four studies, and the values in boxes are the convergent validity coefficients. See Table 2.1 for the factors and the
items used to define each.

the comparison correlations is substantially higher for scale scores than for factor scores.
In each of the six analyses a total of 96 comparisons were made, and the Campbell/Fiske
criterion was satisfied for all but 13 of these 576 comparisons. Ten of the 13 rejections came
in the analysis based on factor scores from the University of Sydney study. and most of
these involved the Examination factor from the SEEQ instrument that was not well de-
fined by the factor analysis. For the corresponding analysis based on scale scores from the
same study, there was only one rejection. The results of each of the four studies provide
strong support for this Campbell/Fiske criterion.
For the third Campbell/Fiske criterion, each of the convergent validities is compared to
other correlations in the two triangular submatrices. The means of these comparison cor-
relations are again substantial, but substantially smaller than the corresponding means for
the covergent validities. Again the mean correlation is substantially higher for scale scores
than for factor scores. In each of the six analyses a total of 98 comparisons were made, and
the Campbell/Fiske criterion was satisfied for all but 15.5 of these 588 comparisons. Again,
a majority of the rejections came from the analysis of factor scores from the University of
Sydney study, and most of these involved the Examination factor. For the corresponding
analysis based on scale scores from the same study, there were only two rejections. The re-
sults of each of the four studies provide strong support for this Campbell/Fiske criterion.
Students’ Evaluations 363

Table 8.4
Summaries of Three Campbell-Fisk Criteria for the Analysis of Six MTMM Matrices (Reprinted with permission
from Marsh, 1986)

Convergent validrties
M ,726 ,833 814 ,869 I324 ,721
Mdn AlO .a90 Xl0 .882 .850 ,820
Propon~on statistically
significant (out of 7) 1.000 1.000 loo0 l.cKQ l.ooo 1000

Crucrion 2

Comparison coefficients
M ,335 .468 AOil .541 ,573 .509
Mdn .3w ,511 450 616 620 ,580
Propon10n of successful
compansons (out of %) .8% ,990 ,995 ICOO loo0 ,995

Crircnon 3

Comparison coefficients
M for SEEQ factors ,305 .4a7 .358 .554 579 ,512
M for Endeavor factors ,313 ,448 ,426 ,530 567 ,501
M for SEEQ and Endeavor factors ,307 473 383 545 574 ,508
Mdn for SEEQ factors ,380 ,605 ,390 ,647 .665 ,580
Mdn for Endeavor factors .3lO .53l .450 .605 .580 ,570
Mdn for SEEQ and Endeavor factors ,370 ,565 ,430 ,640 650 ,580
FInportion of successful comparisons
SEEQ factors (out of 56) 911 ,964 1000 .964 .982 ,964
Endeavor factors (0~1 of 42) .929 .976 l.OC@ 1.000 l.Ow ,988
SEEQ and Endeavor factors (WI of 98) ,918 ,980 l.OCQ ,980 990 ,975

Co&icwnr alpha re/rabi/i@ csrrmarc?

M for SEEQ factors .Wl .9Ol 884 884 ,820 ,869


M for Endeavor factors ,887 ,887 ,889 ,889 .826 ,801
M for SEEQ and Endeavor factors ,895 a95 886 886 822 .a39
Mdn for SEEQ factors ,920 .920 .900 .840 .890
Mdn for Endeavor factors ,900 :zz ,910 .840 .850
Mdn for SEEQ and Endeavor factorr .905 :Z 905 .905 840 880

Toefficient alpha estimates were the same when analyses were conducted on factor ICORI and unwclghtcd scale scores from the same
swdy.
Noa Compansons conducted to test Critena 2 and 3 were counted as half a success and half a failure when a convergent valtdiry was
equal to a comparison coefticmnr. SEEQ = Students’ Evalua~mns of Educational Quality

These analyses resolve several issues. The MTMM analyses performed on factor scores
apparently are not directly comparable to those performed on scale scores. In the Spanish
study where all the factors were well defined, the use of empirically derived factor scores
instead of scale scores resulted in slightly lower convergent validities and substantially
lower comparison correlations. Since the goal of MTMM studies, and many applications
of students’ evaluations, is to maximally differentiate among different components of ef-
fective teaching, the use of factor scores seems preferable. In the University of Sydney
study, however, even though only one factor was not well-identified, the use of factor
scores resulted in weaker support for the discriminant validity of the ratings. The applica-
tion of the Campbell/Fiske guidelines resulted in a total of 18 rejections based on factor
scores, but only 3 rejections for the scale scores. These findings support the decision by
Hayton and by Clarkson to use scale scores in their MTMM analyses.

Implications and Future Research

Items from two North American instruments designed to measure students’ evaluations
364 HERBERT W. MARSH

of teaching effectiveness were administered to university students in different countries. In


each of the four studies most of the items were judged to be appropriate and chosen by at
least a few as being ‘most important’, and all but the Workload/Difficulty items clearly dif-
ferentiated between lecturers who students indicated to be good and poor teachers. There
was a surprising consistency to the items indicated to be ‘most important’ and to be ‘inap-
propriate’ across the four studies, though to a lesser extent for the PNG study. Factor
analyses clearly identified most of the factors that the instruments were designed to mea-
sure in each of these studies and which have been identified in North American research.
Finally, SEEQ and Endeavor factors that were hypothesized to measure similar dimen-
sions of effective teaching were found to be substantially correlated, while correlations be-
tween nonmatching factors were substantially smaller. Particularly, the results of the fac-
tor analyses and MTMM analyses suggest that students from some different countries do
differentiate among different components of teaching effectiveness, that the specific com-
ponents do generalize across different nationalities, and that students differentiate among
dimensions of effective teaching in a similar manner when responding to SEEQ and En-
deavor. Across the four studies the findings support the applicability and construct validity
of the SEEQ and Endeavor when administered to university students in at least these
countries. *
Despite the strong evidence for the separation of the various dimensions of effective
teaching, there were substantial correlations among many of the factors in all four studies.
However, several points are relevant to interpreting these high correlations. First, these
correlations were substantially lower than the reliabilities of the factors and convergent
validities observed in the MTMM analysis. Second, these correlations are based on re-
sponse by individual students where halo/method effects are likely to have a relatively
large impact. Students’ evaluations are typically summarized by the average response by
all the students in a given course and halo effects specific to particular students are likely
to cancel each other out. Third, by specifically asking students to select ‘good’, ‘average’,
and ‘poor’ teachers, the ratings are likely to be stereotypic and biased against differentia-
tion among dimensions. Fourth, some of the differentiation among components may be
lost when students are asked to make retrospective ratings of former lecturer’s rather than
to evaluate current lecturers. Finally, the use of scale scores instead of factor scores, and
the inclusion of the overall rating items as part of the scale scores exacerbate this problem.
Implicit in the discussion by the authors of each of the four studies is the suggestion that
‘North American research’ somehow provides a well-defined basis of comparison for the
studies reviewed here, but this assumption is clearly not justified. North American re-
search does provide strong support for SEEQ and Endeavor as used in research described
by the authors of the two instruments, but not for their use in the present paradigm. To my
knowledge the applicability paradigm has never been conducted in a North American set-
ting, and so there is no basis for comparing the results described here with those from
North American studies. I predict that if the applicability paradigm were used at North-
western University where Endeavor was developed, or at the University of Southern

* The applicability paradigm was used in subsequent research conducted in New Zealand (Watkins eral., 1987)
and the results further support the generalizability of the findings described here. In particular, the factor struc-
ture based on SEEQ and Endeavor responses was well defined, convergent validities in the MTMM analysis
(0.73 to 0.94) were high, and every one of the 196 comparisons used to test discriminant validity were satisfied.
The authors concluded that: “This research has provided strong support for the applicability of these American
instruments for evaluating effective teaching at a New Zealand university.”
Students’Evaluations 365

California or UCLA where SEEQ was developed, or at other comparable North Ameri-
can universities, the results would be similar to those reported here. In particular, due
partly to the design of the applicability paradigm, the correlations between scale scores
used to represent the different factors would be substantial, thus making the differentia-
tion among the factors more difficult.
An important and provocative question raised by these findings is why students’ evalua-
tions are so widely employed at North American universities, but not at universities in
other countries? The conclusions summarized here suggest that teaching effectiveness can
be measured by students’ evaluations in different countries and that perhaps other findings
from research conducted in North America may generalize as well, so this is not the rea-
son. A more likely explanation is the political climate in North American universities.
While the study of students’ evaluations have a long history in the United States, it was
only in the late 1960s and 1970s that they became widely used. During this period there was
a marked increase in student involvement in university policy making and also an in-
creased emphasis on ‘accountability’ in North American universities. While similar condi-
tions may have existed in universities in at least some other countries, they did not result
in the systematic collection of students’ evaluations of teaching effectiveness.
While the impetus for the increased use of students’ evaluations of teaching effective-
ness in North American universities may have been the political climate, subsequent re-
search has shown them to be reliable, valid, relatively free from bias, and useful to stu-
dents, lecturers, and administrators. Future research in the use of students’ evaluations in
different countries needs to take three directions. First, in order to test the generality of
the conclusions in this article, the paradigm described here should be replicated in other
countries and particularly in non-Western, developing countries. Second, the validity of
the students’ evaluations must be tested against a wide variety of indicators of effective
teaching in different countries as has been done in North American research. Third,
perhaps employing the instruments used in this study, there is a need to examine and docu-
ment the problems inherent in the actual implementation of broad, institutionally-based
programs of students’ evaluations of teaching effectiveness in different countries.

Recommendations for Future Research with the Applicability Paradigm

I offer several suggestions about how this paradigm can be improved in future research.
(1) The sample of students selected for the study should represent a broad cross-section
of students at an institution. The ratio of the number of students to the number of different
classes should be kept as small as possible, and presented as part of the results of the
studies.
(2) When feasible, confirmatory factor analyses should be used instead of exploratory
factor analyses in the investigation of the structure underlying responses to SEEQ and to
Endeavor items - particularly when exploratory factor analyses apparently do not pro-
vide support for the hypothesized factors.
(3) Since this paradigm is designed to be exploratory, researchers should consider items
in addition to those contained in the SEEQ and Endeavor instruments. Some additional
items are presented in the University of Sydney and Spanish studies, but these are de-
signed to supplement SEEQ and Endeavor factors rather than to identify additional com-
ponents of effective teaching that might be uniquely appropriate to students in a particular
366 HERBERT W. MARSH

setting. However, if the intention of a study is to compare the pattern of results obtained
elsewhere with those described here, then the items should be kept constant.
(4) The MTMM analyses described here should be conducted with both scale scores and
factor scores. If the factor structure is not well defined, then the use of factor scores may
be problematic as appears to be the case in the University of Sydney study. Alternatively,
the factor score coefficients based on the Spanish study can be obtained from me, and
these can be used to compute factor scores in future studies. While the use of factor score
coefficients derived from another study may be problematic if a comparable factor struc-
ture cannot be demonstrated, the same criticism may be relevant to the computation of un-
weighted scale scores. Furthermore, the use of these factor scores and scale scores in the
MTMM analysis provides a test of their utility. If, as I predict, the use of these factor scores
provides better support for discriminant validity than do the scale scores, then the use of
the factor scores is preferable. Also, the use of the same set of factor score coefficients will
provide further standardization of results across different studies.
(5) Requesting students to select a good, an average and a poor teacher is probably pre-
ferable to selecting a best and worst teacher, though the halo effect appears to be substan-
tial in both cases. Perhaps it would be even better just to ask students to select three in-
structors without specifying that they are good or poor or average. If this procedure re-
duces the size of the halo effect then it might off-set the loss of opportunity to perform the
discrimination analysis. This suggestion is not offered as a recommendation, but rather as
a possibility worthy of further research.
Important weaknesses in this paradigm have been identified, even though some of these
may be overcome by the recommendations presented above. Thus it is important to
evaluate whether the paradigm is worth pursuing. The applicability paradigm is only in-
tended to serve as a first step in studying the generalizability of North American research
to other countries, or perhaps to nontraditional settings in North America, and it should
be evaluated within this context. The data generated by this paradigm seem to be useful
for testing the applicability of the North American instruments and for refining an instru-
ment that may be more suitable to a particular setting; it is clearly preferable to adopting
an untried instrument that has been validated in a very different setting. The paradigm is
also cost-effective and practical in that: it requires only a modest amount of effort for data
collection and data entry; it can be conducted with volunteer subjects; it does not require
the identification of either the student completing the forms or the instructor being
evaluated so that it is likely to be politically acceptable in most settings; and it is ideally
suited to being conducted by students in a research seminar (which was the basis of the
University of Sydney study). Furthermore, it may serve as an initial motivation to pursue
further research and the eventual utilization of students’ evaluations of teaching effective-
ness. Alternative approaches to studying the applicability of student ratings will require
researchers to administer forms to all the students in a sufficiently broad cross-section of
classes so that class-average responses can be used in subsequent analyses (i.e., at least 100
classes based on responses by several 1000 students). While such a large-scale effort would
be useful, there will be many situations in which it may not be feasible and the applicability
paradigm may still provide a useful pilot study that precedes the larger study even when
such a large-scale study is possible.
The focus of research summarized here has been on the similarity of the results from di-
verse academic settings in order to support the applicability of SEEQ and Endeavor in
these different settings. However, the comparison of patterns of ‘inappropriate’ and ‘most
Students’ Evaluations 367

important’ item responses may also provide an important basis for inferring how the learn-
ing environments in diverse settings differ. Even in the four studies that have been
conducted, the paradigm has been heuristic in that researchers have speculated about the
unique characteristics of students in each particular setting to account for some of the find-
ings, even though there appears to be a surprising consistency in the responses. The valid-
ity of such speculations is likely to improve once methodological problems have been re-
solved and there is a large enough data base with which to compare the findings of new
studies.
CHAPTER 9

OVERVIEW, SUMMARY AND IMPLICATIONS

Research described in this article demonstrates that student ratings are clearly mul-
tidimensional, quite reliable, reasonably valid, relatively uncontaminated by many vari-
ables often seen as sources of potential bias, and are seen to be useful by students, faculty,
and administrators. However, the same findings also demonstrate that student ratings may
have some halo effect, have at least some unreliability, have only modest agreement with
some criteria of effective teaching, are probably affected by some potential sources of bias,
and are viewed with some skepticism by faculty as a basis for personnel decisions. It should
be noted that this level of uncertainty probably also exists in every area of applied psychol-
ogy and for all personnel evaluation systems. Nevertheless, the reported results clearly de-
monstrate that a considerable amount of useful information can be obtained from student
ratings; useful for feedback to faculty, useful for personnel decisions, useful to students in
the selection of courses, and useful for the study of teaching. Probably, students’ evalua-
tions of teaching effectiveness are the most thoroughly studied of all forms of personnel
evaluation, and one of the best in terms of being supported by empirical research.
Despite the generally supportive research findings, student ratings should be used cauti-
ously, and there should be other forms of systematic input about teaching effectiveness,
particularly when they are used for tenure/promotion decisions. However, while there is
good evidence to support the use of students’ evaluations as one indicator of effective
teaching, there are few other indicators of teaching effectiveness whose use is systemati-
cally supported by research findings. Based upon the research reviewed here, other alter-
natives which may be valid include the ratings of previous students and instructor self-
evaluations, but each of these has problems of its own. Alumni surveys typically have very
low response rates and are still basically student ratings. Faculty self-evaluations may be
valid for some purposes, but probably not when tenure/promotion decisions are to be
based upon them. (Faculty should, however, be encouraged to have a systematic voice in
the interpretation of their student ratings.) Consequently, while extensive lists of alterna-
tive indicators of effective teaching are proposed (e.g., Centra, 1979), few are supported
by systematic research, and none are as clearly supported as students* evaluations of teach-
ing.
Why then, if student ratings are reasonably well supported by research findings, are they
so controversial and so widely criticized? Several suggestions are obvious. University
faculty have little or no formal training in teaching, yet find themselves in a position where
their salary or even their job may depend upon their classroom teaching skills. Any proce-
369
370 HERBERT W. MARSH

dure used to evaluate teaching effectiveness would prove to be threatening and therefore
criticized. The threat is exacerbated by the realization that there are no clearly defined
criteria of effective teaching, particularly when there continues to be considerable debate
about the validity of student ratings. Interestingly, measures of research productivity, the
other major determinant of instructor effectiveness, are not nearly so highly criticized, de-
spite the fact that the actual information used to represent them in tenure decisions is often
quite subjective and there are serious problems with the interpretation of the objective
measures of research productivity that are used. As demonstrated in this overview, much
of the debate is based upon ill-founded fears about student ratings, but the fears still per-
sist. Indeed, the popularity of two of the more widely employed paradigms in student
evaluation research, the multisection validity study and the Dr Fox study, apparently
stems from an initial notoriety produced by claims to have demonstrated that student rat-
ings are invalid. This occurred even though the two original studies (the Rodin & Rodin
1972 study, and the Naftulin ef al., 1973 study) were so fraught with methodological weak-
nesses as to be uninterpretable. Perhaps this should not be so surprising in the academic
profession where faculty are better trained to find counter explanations for a wide variety
of phenomena than to teach. Indeed, the state of affairs has resulted in a worthwhile and
healthy scrutiny of student ratings and has generated a considerable base of research upon
which to form opinions about their worth. However, the bulk of research has supported
their continued use as well as advocating further scrutiny.
REFERENCES
Abrami, P. C. (1985). Dimensions of effective college instruction. Review ofHigher Education,l, 211-228.
Abrami, P. C., Leventhal, L., & Perry, R. P. (1979). Can feedback from student ratings help to improve
teaching? Proceedings of the 5th International Conference on Improving University Teaching. London.
Abrami. P. C., Dickens, W. J., Perry, R. P., & Leventhal, L. (1980). Do teacher standards for assigning grades
affect student evaluations of instruction? Journal of Educational Psychology, 72.107-118.
Abrami, P. C., Leventhal, L., & Dickens, W. J. (1981). Multidimensionality of student ratings of instruction.
Instructional Evaluation, 6(l), 12-17.
Abrami, P. C., Leventhal, L., & Perry, R. P. (1982a). Educational seduction. Review of Educational Research,
52,446464.
Abrami, P. C., Perry, R. P., & Leventhal, L. (1982b). The relationship between student personality
characteristics, teacher ratings, and student achievement. Journal of Educational Psychology, 74.111-125.
Aleamoni, L. M. (1978). The usefulness of student evaluations in improving college teaching. Instructional
Science, 7,9.5-105.
Aleamoni, L. M. (1981). Student ratings of instruction. In J. Millman (Ed.), Handbook of Teacher Evaluation.
(pp. 110-145). Beverley Hills, CA: Sage.
Aleamoni, L. M. (198.5). Peer evaluation of instructors and instruction. Instructional Evaluation, 8.
Aleamoni, L. M., & Hexner, P. Z. (1980). A review of the research on student evaluation and a report on the
effect of different sets of instructions on student course and instructor evaluation. Instructional Science, 9,
67-84.
Aleamoni, L. M., & Yimer, M. (1973). An investigation of the relationship between colleague rating, student
rating, research productivity, and academic rank in rating instructional effectiveness. Journal of Educational
Psychology, 64,274-277.
American Psychological Association. (1985). The standards for educational and psychological testing.
Washington, D. C.: Author.
Aubrect, J. D. (1981). Reliability, validity and generalizability of student ratings of instruction (IDEA Paper
No. 6). Kansas State University: Center for Faculty Evaluations and Development. (ERIC Document
Reproduction Service No. ED 213 296.)
Barr, A. S. (1948). The measurement and prediction of teaching efficiency: A summary of investigations. Journal
of Experimental Education. 16,203-283.
Bausell, R. B., & Bausell, C. R. (1979). Student ratings and various instructional variables from a within-
instructor perspective. Research in Higher Education, 11, 167-177.
Benton, S. E. (1982). Rating College Teaching: Criterion Validity Studies of Student Evaluation of Imtrucrion
Instruments. (ERIC ED 221 147.)
Biglan, A. (1973). The characteristics of subject matter in different academic areas. Journal of Applied
Psychology, 57,195-203.
Blackburn, R. T. The meaning of work in academia. (1974). In J. I. Doi (Ed.), Assessingfaculty effort. (a special
issue of New Directions for Institutional Research.) San Francisco: Jossey-Bass.
Blackburn, R. T., & Clark, M. J. (1975). An assessment of faculty performance: Some correlations between
administrators, colleagues, student, and self-ratings. Sociology of Education, 48,242-256.
Brandenburg, D. C., Slindle, J. A., & Batista, E. E. (1977). Student ratings of instruction: Validity and
normative interpretations. Journal of Research in Higher Education, 7,67-78.
Brandenburg, G. C., & Remmen, H. H. (1927). A rating scale for instructors. Educational Administration and
Supervision, 13,39%406.
Braskamp, L. A., Caulley, D., & Costin, F. (1979). Student ratings and instructor self-ratings and their
relationship to student achievement. American Educational Research Journal, 16,295306.
Braskamp, L. A., Ory, J. C., & Pieper, D. M. (1981). Student written comments: Dimensions of instructional
quality. Journal of Educational Psychology, 73,65-70.
Braskamp, L. A., Brandenburg, D. C., & Ory, J. C. (1985). Evaluating teaching effectiveness: A practical guide.
Beverley Hills, CA: Sage.
Braunstein, D. N., Klein, G. A., & Pachla, M. (1973). Feedback expectancy shifts in student ratings of college
faculty. Journal of Applied Psychology, 58,254-258.
Brown, D. L. (1976). Faculty ratings and student grades: A university-wide multiple regression analysis. Journal
of Educational Psychology, 68,573-578.

371
372 HERBERT W. MARSH

Burton, D. (1975). Student ratings-an information source for decision-making. In R. G. Cope (Ed.), 1nforma-
tion for decisions in postsecondary educatton: Proceedings of the 15th Annual Forum(pp. 83-86). St. LOUIS,
MI: Association for Institutional Research.
Cadwell, J., & Jenkins, J. (1985). Effects of semantic similarity on student ratings of instructors. /ournaI of
Educational Psychology, 77,383-393.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod
matrix. Psychological BulIetin, 56,81-105.
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on
teaching. In N. L. Gage (Ed.), Handbook of research on teaching (pp. 171-246). Chicago: Rand McNally.
Centra, J. A. (1973). Two studies on the utility of student ratings for instructional improvement. I The effective-
ness of student feedback in modifying college instruction. II Self-ratings of college teachers: a comparison with
student ratings. (SIR Report No. 2). Princeton, NJ: Educational Testing Service.
Centra, J. A. (1975). Coileagues as raters of classroom instruction. Journal of Higher Education, 46,327-337.
Centra, J. A. (1977). Student ratings of instruction and their relationship to student learning. American
_Educational Research Journal, 14, 17-24.
Centra, J. A. (1979). Determining fact&y effectiveness. San Francisco: Jossey-Bass.
Centra, J. A. (1981). Research report: Reseach productivity and teaching effectiveness. Princeton, NJ: Educa-
tional Testing Service.
Centra, J. A. (1983). Research productivity and teaching effectiveness. Research in Higher Education, 18,
379-389.
Centra, J. A., and Creech, F. R. (1976). The relationship between student, teacher, and course characteristtcs
andstudentratingsof teacher effectiveness (Project Report 76-l). Princeton, NJ: Educational Testing Service.
Chacko, T. I. (1983). Student ratings of instruction: A function of grading standards. Educational Research
Quarterly, 8,21-25.
Clarkson, P. C. (1984). Papua New Guinea students’ perceptions of mathematics lecturers. Journal of Educa-
tional Psychology, 76,13861395.
Cohen, P. A. (1980). Effectiveness of student-rating feedback for improving college instruction: a meta-analysis.
Research in Higher Education, 13,321-341.
Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis of multisection
validity studies. Review of Educational Research, 51,281-309.
Cohen, P. A., & McKeachie, W. J. (1980). The role of colleagues in the evaluation of college teaching.
Improving College and University Teaching, 28,147-1.54.
Coleman, J., and McKeachie, W. J. (1981). Effects of instructor/course evaluations on student course selection.
Journal of Educational Psychology, 73,224-226.
Cooley, W. W., & Lohnes, P. R. (1976). Evaluation research in educatton. New York: Irvington.
Costin, F., Greenough, W. T., & Menges, R. J. (1971). Student ratings of college teaching: Reliability, validity
and usefulness. Review of Educational Research, 41,511-536.
Cranton, P. A., & Hillgarten, W. (1981). The relationships between student ratings and instructor: Implications
for improving teaching. Canadian Journal of Higher Education, 11,73-81.
Creager, J. A. (1950). A multiple-factor analysis of the Purdue rating scale for instructors? Purdue University
Studies in Higher Education, 70,75-96.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (pp. 443-507).
Washington, D.C.: American Council on Education.
Cronbach, L. J. (1984). Essentials ofpsychological testing. New York, NY: Harper & Row.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52.
281-302.
de Wolf, W. A. (1974). Student ratings of instruction in post secondary institutions: A comprehensive annotated
bibliography of research reported since 1968 (Vol. 1). Seattle: University of Washington Educational Assess-
ment Center.
Dowell, D. A., & Neal, J. A. (1982). A selective view of the validity of student ratings of teaching. Journal of
Higher Education, 53,51-62.
Doyle, K. 0. (1975). Student evaluation of instruction. Lexington, MA: D. C. Heath.
Doyle, K. 0. (1983). Evaluatingteaching.Lexington, MA: Lexington Books.
Doyle, K. 0.. & Crichton, L. I. (1978). Student, peer, and self-evaluations of college instructors. Journal of
Educational Psychology, 70,815-826.
Doyle, K. O., & Weber, P. L. (1978). Self-evaluations of college instructors. American Educational Research
Students’ Evaluations 373

Journal, 15,467-475.
Drucker, A. J.. & Remmen, H. H. (1950). Do alumni and students differ in their attitudes towards instructors?
Purdue University Studies in Higher Education, 70,62-74.
Drucker, A. J., & Remmers, H. H. (1951). Do alumni and students differ in their attitudes toward instructors?
Journal of Educational Psychology, 70,12%143.
Dunkin, M. J., & Barnes, J. (1986). Researchon teaching in higher education. In M. C. Wittrock (Ed.), Hand-
book of research on teaching (3rd Edition) (pp. 754-777). New York: MacMillan.
Elliot, D. N. (1950). Characteristics and relationships of various criteria of college and university teaching.
Purdue University Studies in Higher Education, 70,5-61.
Endeavor Information Systems (1979). The Endeavour instructional rating system: User’s handbook. Evanston,
IL: Author.
Faia, M. (1976). Teaching and research: Rapport or misalliance. Research in Higher Education, 4, 235-246.
Feldman, K. A. (1976a). Grades and college students’ evaluations of their courses and teachers. Research in
Higher Education, 4.69-l 11.
Feldman, K. A. (1976b). The superior college teacher from the student’s view. Research in Higher Education,
5,243-288.
Feldman, K. A. (1977). Consistency and variability among college students in rating their teachers and courses.
Research in Higher Education, 6.223-274.
Feldman, K. A. (1978). Course characteristics and college students’ ratings of their teachers and courses: What
we know and what we don’t. Research in Higher Education, 9, 199-242.
Feldman, K. A. (1979). The significance of circumstances for college students’ ratings of their teachers and
courses. Research in Higher Education, 10, 149-172.
Feldman, K. A. (1983). The seniority and instructional experience of college teachers as related to the evalua-
tions they receive from their students. Research in Higher Education, 18,3-124.
Feldman, K. A. (1984). Class size and students’ evaluations of college teacher and courses: A closer look.
Research in Higher Education, 21,45-l 16.
Feldman, K. A. (1986). The perceived instructional effectiveness of college teachers as related to their person-
ality and attitudinal characteristics: A review and synthesis. Research in Higher Education, 24, 139-213.
Firth, M. (1979). Impact of work experience on the validity of student evaluations of teaching effectiveness.
Journal of Educational Psychology, 71,726-730.
Frankhouser, W. M. (1984). The effects of different oral directions as to disposition of results on student
ratings of college instruction. Research in Higher Education, 20,367-374.
French-Laxovich, G.(1981). Peer review: Documentary evidence in the evaluation of teaching. In J. Millman
(Ed.), Handbook of Teacher Evaluation (pp. 73-89). Beverly Hills, CA: Sage.
Frey, P. W. (1978). A two dimensional analysis of student ratings of instruction. Research in Higher Education,
9,69-91.
Frey, P. W. (1979). The Dr Fox Effect and its implications. Instructional Evaluation, 3, 1-5.
Frey, P. W., Leonard, D. W., & Beatty, W. W. (1975). Student ratings of instruction: Validation research.
American Educational Research Journal, 12.327-336.
Gage, N. L. (1963a). A method for “improving” teaching behavior. Journal of Teacher Education, 14,261-266.
Gage, N. L. (1963b). Handbook on research on teaching. Chicago: Rand McNally.
Gage, N. L. (1972). Teacher effectiveness and teacher education. Palo Alto, CA: Pacific Books.
Gilmore, G. M., Kane, M. T., & Naccarato, R. W. (1978). The generalizability of student ratings of instruction:
Estimates of teacher and course components. Journal of Educational Measurement, 15,1-13.
Glass, G. V., McGaw, B., &Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.
Guthrie, E. R. (1954). The evaluation of teaching: A progress report. Seattle: University of Washington Press.
Harry, J., & Goldner, N. S. (1972). The null relationship between teaching and research. Sociology of
Education, 45,47-60.
Hayton, G. E. (1983). An investigation of the applicability in Technical and Further Education of a student
evaluation of teaching instrument. An unpublished thesis submitted in partial fulfillment of the requirement
for transfer to the Master of Education (Honoun) degree. Department of Education, University of Sydney.
Hildebrand, M. (1972). How to recommend promotion of a mediocre teacher without actually lying. Journal of
Higher Education, 43,44-62.
Hildebrand, M., Wilson, R. C., & Dienst, E. R. (1971). Evaluating university teaching. Berkeley: Center for
Research and Development in Higher Education, University of California, Berkeley.
Hines, C. V., Cruickshank, D. R., & Kennedy, J. J. (1982). Measuresofteacherclarity andtheir relationships to
374 HERBERT W. MARSH

srudent achievement and satlsfacrion. Paper presented at the annual meeting of the American Educational
Research Association, New York.
Holmes, D. S. (1972). Effects of grades and disconfirmed grade expectancies on students’ evaluations of then
instructors. Journal of Educational Psychology, 63, 130-133.
Howard, G. S., & Bray, J. H. (1979). Use of norm groups to adjust student ratings of instructIon. A warning.
Journal of Educational Psychology, 71,58-63.
Howard, G. S., Conway, C. G., & Maxwell, S. E. (1985). Construct validity of measures of college teachmg
effectiveness. Journal of Educational Psychology, 77. 187-196.
Howard, G. S., and Maxwell, S. E. (1980). The correlation between student satisfaction and grades: A case of
mistaken causation? Journal of Educational Psychology, 72, 810-820.
Howard, G. S., & Maxwell, S. E. (1982). Do grades contaminate student evaluations of instruction. Research in
Higher Education, 16, 175-188.
Howard, G. S., & Schmeck, R. R. (1979). Relationship of changes in student motivation to student evaluations
of instruction. Research in Higher Education, 10,305-315.
Hoyt, D. P., Owens, R. E., and Grouling, T. (1973). Interpreting student feedback on instructron and courses.
Manhattan, KN: Kansas State University.
Jauch, L. R. (1976). Relationships of research and teaching. Implications for faculty evaluation. Research in
Higher Education, 5, 1-13.
Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educarional Research,
47,276-292.
Kane, M. T., Gilmore, G. M., & Crooks, T. J. (1976). Student evaluations of teaching: The generalizability of
class means. Journal of Educational Measurement, 13, 171-183.
Krathwohl, D. R.. Bloom, B. S., & Masia, B. B. (1964). Taxonomy of educational objecttves: Theclasslfication
of educational goals. Handbook 2. Affective Domain. New York, McKay.
Kulik, J. A., & McKeachie, W. J. (1975). The evaluation of teachers in higher education. In Kerlmger (Ed.),
Review of research in education, (Vol. 3). Itasca, IL: Peacock.
Land, M. L. (1979). Low-inference variables of teacher clarity: Effects on student concept learning. Journal of
Educational Psychology, 71,795-799.
Land, M. L. (1985). Vagueness and clarity in the classroom. In T. Husen & T. N. Postlethwaite (Eds.), Inter-
national encyclopedia of education: Research and studies. Oxford: Pergamon Press.
Land, M. L., & Combs, A. (1981). Teacher clarity, student instructional ratings, and studentperformance. Paper
presented at the annual meeting of the American Educational Research Association, Los Angeles.
Land, M. L., & Smith, L. R. (1981). College student ratings and teacher behavior: An experimental study.
Journal of Social Studies Research, 5,19-22.
Larson, J. R. (1979). The limited utility of factor analytic techniques for the study of implicit theones in student
ratings of teacher behavior. American Educational Research Associutron, 16,201-211.
Leventhal, L., Abrami, P. C., Perry, R. P., & Breen, L. J. (1975). Section selection in multi-section courses:
Implications for the validation and use of student rating forms. Educutional and Psychological .Veasuremenf,
35.885-895.
Leventhal, L., Perry, R. P., Abrami, P. C., Turcotte, S. J. C., & Kane, B. (1981). Experimenml investigation of
tenure/promotion in American and Canadian universities. Paper presented at the Annual hleeting of the
American Educational Research Association, Los Angeles.
Leventhal, L., Turcotte, S. J. C., Abrami, P. C., & Perry, R. P. (1983). Primacyhecency effects in student
ratings of instruction: A reinterpretation ofgain-loss effects. Journalof Educational Psychology, 75.692-704.
Levinson-Rose, J., & Menges, R. J. (1981). Improving college teaching: A critical review of research. Review of
Educational Research, 51.403-434.
Linsky, A. S., & Straus, M. A. (1975). Student evaluations, research productivity, and eminence of college
faculty. Journal of Higher Education, 46,89-102.
Long, J. S. (1983): Confirmutory factor analysis: A preface to LISREL Beverly Hills, CA: Sage.
Marsh, H. W. (1976). The relationship between background variables and students’ evaluations of mstructional
quality. 01s 769. Los Angeles, CA: Office of Institutional Studies, University of Southern California.
Marsh, H. W. (1977). The validity of students’ evaluations: Classroom evaluations of instructors independently
nominated as best and worst teachers by graduating seniors. American Educational Research Journal, 14,
44147.
Marsh, H. W. (1979). Annotated bibliography of research on the relationship between quality of reaching and
quality of research in higher educution. Los Angeles: Office of Institutional Studies, Universiry of Southern
Students’ Evaluations 375

California.
Marsh, H. W. (1980a) Research on students’ evaluations of teaching effectiveness. Insfrucrional Evaluafion, 4,
5-13.
Marsh, H. W. (1980b) The influence of student, course and instructor characteristics on evaluations of university
teaching. American Educational Research Journal, 17,219-237.
Marsh, H. W. (1981a). Students’ evaluations of tertiary instruction: Testing the applicability of American
surveys in an Australian setting. Australian Journal of Education, 25, 177-192.
Marsh, H. W. (1981b). The use of path analysis to estimate teacher and course effects in student ratings of
instructional effectiveness. Applied Psychological Measurement, 6.47-60.
Marsh, H. W. (1982a). Factors affecting students’evaluations of the same course taught by the same instructor on
different occasions. American Educational Research Journal, 19,485-497.
Marsh, H. W. (1982b). SEEQ: A reliable, valid, and useful instrument for collecting students’ evaluations of
university teaching. British Journal of Educational Psychology, 52.77-95.
Marsh, H. W. (1982~). Validity of students’ evaluations of college teaching: A multitrait-multimethod analysis.
Journal of Educational Psychology, 74,264-279.
Marsh, H. W. (1983). Multidimensional ratings of teaching effectiveness by students from different academic
settings and their relation to studentlcourselinstructor characteristics. Journal of Educational Psychology,
75,1_5&166.
Marsh, H. W. (1984a). Experimental manipulations of university motivation and their effect on examination
performance. British Journal of Educational Psychology, 54,206-213.
Marsh, H. W. (1984b). Students’ evaluations of university teaching: Dimensionality, reliability, validity,
potential biases, and utility. Journal of Educational Psychology, 76.707-754.
Marsh, H. W. (1985). Students as evaluators of teaching. In T. Husen &T. N. Postlethwaite (Eds.), International
Encyclopedia of Education: Research and Studies. Oxford: Pergamon Press.
Marsh, H. W. (1986). Applicability paradigm: Students’ evaluations of teaching effectiveness in different
countries. Journal of Educational Psychology, 78,46%73.
Marsh, H. W., Barnes, J., & Hocevar, D. (1985). Self-other agreement on multi-dimensional self-concept
ratings: Factor analysis and multitrait-multimethod analysis. Journal of Personality and Social Psychology,
49,1360-1377.
Marsh, H. W., & Cooper, T. L. (1981). Prior subject interest, students’ evaluations, and instructional effective-
ness. Multivariate Behavioral Research, 16,82-104.
Marsh, H. W., Fleiner, H., & Thomas, C. S. (1975). Validity and usefulness of student evaluations of instruc-
tional quality. Journal of Educational Psychology, 67,833-839.
Marsh, H. W., & Groves, M. A. (1987). Students’ evaluations of teaching effectiveness and implicit theories:
A critique of Cadwell and Jenkins. Journal of Educational Psychology.
Marsh, H. W., & Hocevar, D. (1983). Confirmatory factor analysis of multitrait-multimethod matrices.
Journal of Educational Measurement, 20.231-248.
Marsh, H. W., Kc Hocevar, D. (1984). The factorial invariance of students’ evaluations of college teaching.
American Educational Research Journal, 21.341-366.
Marsh, H. W., & Overall, J. U. (1979a). Long-term stability of students’ evaluations: A note on Feldman’s
“Consistency and variability among college students in rating their teachers and courses.” Research in
Higher Education, 10.139-147.
Marsh, H. W., & Overall, J. U. (1979b). Validity of students’ evaluations of teaching: A comparison with
instructor self evaluations by reaching assistants, undergraduate faculty, and graduate faculty. Paper presented
at Annual Meeting of the American Educational Research Association, San Francisco (ERIC Document
Reproduction Service No. ED177 205).
Marsh, H. W., &Overall, J. U. (1980). Validity of students’evaluations of teaching effectiveness: Cognitive and
affective criteria. Journal of Educational Psychology, 72,468-475.
Marsh, H. W., & Overall, J. U. (1981). The relative influence of course level, course type, and instructor on
students’ evaluations of college teaching. American Educational Research Journal, 18, 103-112.
Marsh, H. W., Overall, J. U., & Kesler, S. P. (1979a). Class size, students’ evaluations, and instructional
effectiveness. American Educational Research Journal, 16,57-70.
Marsh, H. W., Overall, J. U., & Kesler, S. P. (1979b). Validity of student evaluations of instructional
effectiveness: A comparison of faculty self-evaluations and evaluations by their students. Journal of Educa-
tional Psychology, 71,149-160.
Marsh, H. W., Overall, J. U., & Thomas, C. S. (1976). The relationship between students’ evaluations of
376 HERBERT W. MARSH

mmucrion and expectedgrade. Paper presented at the Annual Meeting of the Amertcan Educattonal Research
Association, San Franctsco. (ERIC Document Reproduction Service No. ED 1X-130.)
Marsh. H. W., Touron, J., & Wheeler, B. (1985). Students’ evaluations of umversity instructors: The applic-
ability of American instruments in a Spanish setting. Teaching and Teucher Education: An International
Journal of Research and Studies, 1, 123-138.
Marsh, H. W., & Ware, J. E. (1982). Effects of expressiveness, content coverage, and incentive on multt-
dimensional student rating scales: New interpretations of the Dr. Fox Effect. Journal of Educational
Psychology, 74, 126-134.
Maslow, A. H., & Zimmerman, W. (1956). College teaching ability, scholarly activity, and personality. Journal
of Educational Psychology, 47, 185-189.
McKeachie, W. J., &Solomon, D. (1958). Student ratingsofinstructors: A validity study.JournalofEducational
Research, 51.379-382. 3%397.
McKeachie, W. J. (1963). Research on teaching at the college and university level. In N. L. Gage (Ed.),
Handbook of research on teaching (pp. 1118-l 172). Chicago: Rand McNally.
McKeachie, W. J. (1973). Correlates of student ratings. In A. L. Sockloff (Ed.), Proceedings: The First Invita-
tional Conference on Faculty Effectivenessas Evaluatedby Students (pp. 213-218). Measurement and Research
Center, Temple University.
McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 384-397.
McKeachie, W. J., Lin, Y-G, Daugherty, M., Moffett, M. M.. Neigler, C., Nork, J., Walz, M.,. & Baldwin, R.
(1980). Using student ratings and consultation to improve instruction. British Journal of Educational
Psychology, 50, 168-174.
Meirer, R. S., & Feldhusen, J. F. (1979). Another look at Dr Fox: Effect of stated purpose for evaluation,
lecturer expressiveness, and density of lecture content on student ratings. Journal of Educational Psychology,
71,339-345.
Menges, R. J. (1973). The new reporters: Students rate instruction. In C. R. Pace (Ed.), Evaluating Learning
and Teaching. San Francisco: Jossey-Bass.
Miller, M. T. (1971). Instructor attitudes toward, and their use of, student ratings of teachers. Journal of
Educational Psychology, 62,235-239.
Morsh, J. E., Burgess, G. G., & Smith, P. N. (1956). Student achievement as a measure of instructional
effectiveness. Journal of Educational Psychology, 47.79-88.
Murray, H. G. (1976). How do good teachers teach? An observationalstudy of the classroom teaching behaviors of
Social Science professors receiving low, medium and high teacher ratings. Paper presented at the Canadian
Psychological Association meeting.
Murray, H. G. (1980). Evaluating university teaching: A review of research. Toronto, Canada: Ontario Con-
federation of University Faculty Associations.
Murray, H. G. (1983). Low inference classroom teaching behaviors and student ratmgs of college teaching
effectiveness. Journal of Educational Psychology, 71,856-865.
Naftulin, D. H., Ware, J. E., & Donnelly, F. A. (1973). The Doctor Fox lecture: A paradigm of educational
seduction. Journal of Medical Education, 48,630-635.
Neumann, L., & Neumann, Y. (1985). Determinants of students’ instructional evaluation: A comparison of
four levels of academic areas. Journal of Educational Research, 78, 152-158.
Office of Evaluation Services (1972). Student Instructional Rating System responses and student characteristtcs.
SIRS Research Report No. 4. Michigan State University: Author.
Ory, J. C., & Braskamp, L. A. (1981). Faculty perceptions of the quality and usefulness of three types of
evaluative information. Research in Higher Education, 15.271-282.
Ory, J. C., Braskamp, L. A., & Pieper, D. M. (1980). Congruency of student evaluative information collected
by three methods. Journal of Educational Psychology, 72, 181-185.
Overall, J. U., & Marsh, H. W. (1979). Midterm feedback from students: Its relationship to instructional
improvement and students’ cognitive and affective outcomes. Journal of Educational Psychology, 71,
85-65.
Overall, J. U., & Marsh, H. W. (1980). Students’ evaluations of instruction: A longitudinal study of their
stability. Journal of Educational Psychology, 72,321-325.
Overall, J. U., & Marsh, H. W. (1982). Students’ evaluations of teaching: An update. American Associationfor
Higher Education Bulletin, 35(4), 9-13.
Palmer, J., Carliner, G., & Romer, T. (1978). Learning, leniency, and evaluations. Journal of Educational
Psychology, 70.855-863.
Students’ Evaluations 377

Pambookian, H. S. (1976). Discrepancy between instructor and student evaluations of instruction: Effect on
instructor. Instrucrional Science, 5,63-75.
Parducci, A. (1968). The relativism of absolute judgment. Scientific American, 219.84-90.
Perry, R. P., Abrami, P. C., & Leventhal, L. (1979). Educational seduction: The effect of instructor expressive-
ness and lecture content on student ratings and achievement. Journal ofEducnriono1 Psychology, 71,107-l 16.
Perry, R. P., Abrami, P. C., Leventhal, L., & Check, J. (1979). Instructor reputation: An expectancy relation-
ship involving student ratings and achievement. Journal of Educanonal Psychology, 71.776-787.
Powell, R. W. (1977). Grades, learning, and student evaluation of instruction. Research in Higher Education, 7,
195-205.
Peterson, C., & Cooper, S. (1980). Teacher evaluation by graded and ungraded students. Journal of Educational
Psychology, 72,682-685.
Pohlman, J. T. (1972). Summary of research on rhe relationship between srudenr characteristics and srudenr
evaluations of insrrucrion ar Southern Illinois University, Carbondale. Technical Report 1.I-72. Counselling
and Testing Center, Southern Illinois University, Carbondale.
Pohlman, J. T. (1975). A multivariate analysis of selected class characteristics and student ratings of instruction.
Multivariare Behavioral Research, 10,81-91.
Price, J. R., & Magoon, A. J. (1971). Predictors of college student ratings of instructors (Summary). Proceedings
of the 79rh Annual Convention of the American Psychological Association, I, 523-524.
Protzman, M. I. (1929). Student rating of college teaching. School and Sociery, 29,513-515.
Remmers, H. H. (1928). The relationship between students’ marks and student attitudes towards instructors.
School and Society, 28,759-760.
Remmers, H. H. (1930). To what extent do grades influence student ratings of instructors? Journal of Educu-
rional Research, 21.314-316.
Remmers, H. H. (1934). Reliability and halo effect on high school and college students’ judgments of their
teachers. Journal of Applied Psychology, 18,619-630.
Remmers, H. H. (1931). The equivalence of judgements and test items in the sense of the Spearman-Brown
formula. Journal of Educational Psychology, 22,66-71.
Remmers, H. H. (1958). On students’ perceptions of teachers’ effectiveness. In McKeachie (Ed.) The appruisal
of reaching in large universities. (pp. 17-23). Ann Arbor: The University of Michigan.
Remmers, H. H. (1963). Teaching methods in research on teaching. In N. L. Gage (Ed.), Handbook on
Teaching. Chicago: Rand McNally.
Remmers, H. H., & Brandenburg, G. C. (1927). Experimental data on the Purdue Rating Scale for instructors.
Educational Administration and Supervision, 13,51%527.
Remmers, H. H., & Elliot, D. N. (1949). The Indiana college and uruversity staff evaluation program. School
and Sociery, 70,168-171.
Remmers, H. H., & Wykoff, G. S. (1929). Student ratings of college teaching-A reply. School and Society,
30,232-234.
Remmers, H. H., Martin, F. D., & Elliot, D. N. (1949). Are student ratings of their instructors related to their
grades? Purdue University Studies in Higher Education, 44, 17-26.
Rich, H. E. (1976). Attitudes of college and university faculty toward the use of student evaluation. Educarional
Research Quarterly, 3,17-28.
Rodin, M., & Rodin, B. (1972). Student evaluations of teachers. Science, 177,1164-1166.
Rosenshine, B. (1971). Teaching behaviors and srudenr achievement. London: National Foundation for
Educational Research.
Rosenshine, B., & Furst, N. (1973). The use of direct observation to study teaching. In R. M. W. Travers (Ed.),
Second handbook of research on reaching. Chicago: Rand McNally.
Rotem, A. (1978). The effects of feedback from students to university instructors: An experimental study.
Research in Higher Educarion, 9,303-318.
Rotem, A., & Glassman, N. S. (1979). On the effectiveness of students’ evaluative feedback to university
instructors. Review of Educational Research, 49,497-511.
Rushton, J. P., Brainerd, C. J., & Pressley, M. (1983). Behavioral development and construct validity: The
principle of aggregation. Psychological Bulletin, 94,18-38.
Ryan, J. J., Anderson, J. A., & Birchler, A. B. (1980). Student evaluation: The faculty respond. Research in
Higher Educarion, 12,317-333.
Salthouse, T. A., McKeachie, W. J., & Lin, Y. G. (1978). An experimental investigation of factors affecting
university promotion decisions. Journal of Higher Educorion, 49,177-183.
378 HERBERT W. MARSH

Schwab, D. P. (1976). Manualforthe Course Evaluafton Instrument. Schoolof Busrness. Universttyof Wisconsin
at Madtson.
Scott. C. S. (1977). Student ratings and instructor-defined extenuating circumstances. Journal of Educational
Psychology, 69,1&747.
Striven, M. (1981). Summative teacher evaluatton. In J. Millman (Ed.), Handbook of teacher evaluauon
(pp. 244-271). Beverly Hills, CA: Sage.
Seldin, P. (1975). How colleges evaluate professors: Current polictes and practices in evoluanng classroom
teachingperformance in liberal arrs cofleges. Cronton-on-Hudson, New York: Blythe-Pennington.
Shavelson. R. J., Hubner, J. J., & Stanton, G. C. (1976) Self-concept: Validation of construct tnterpretation.
Review of Educational Research, 46.407-441.
Smalzried, N. T., & Remmers, H. H. (1943). A factor analysis of the Purdue Rating Scale for instructors. Journal
of Educational Psychology, 34,363-367.
Smith, M. L., & Glass, G. V. (1980). Meta-analysis of research on class size and its relationship to attitudes and
instruction. American Educational Research Journal, 17,419-433.
Snyder, C. R.. & Clair, M. (1976). Effects of expected and obtained grades on teacher evaluation and attribution
of performance. Journal of Educational Psychology, 68.75-82.
Stalnaker, J. M.. & Remmers, H. H. (1928). Can students discriminate traits associated with success in teaching?
Journal of Applied Psychology, 12,602-610.
Stevens, J. J., & Aleamoni, L. M. (1985). The use of evaluative feedback for instructional improvement: A
longitudinal perspective. Instructional Science, 13,285-304.
Stumpf, S. A., Freedman, R. D., & Aguanno, J. C. (1979). A path analysis of factors often found to be related
to student ratings of teaching effectiveness. Research in Higher Education, 11, 111-123.
Vasta, R., & Sarmiento, R. F., (1979). Liberal grading improves evaluations but not performance. Journal of
Educational Psychology, 71,207-211.
Voght, K. E., & Lasher, H. (1973). Does student evaluation stimulate improved teaching? Bowling Green, OH:
Bowling Green University (ERIC ED 013 371).
Ward, M. D., Clark, D. C., & Harrison, G. V. (1981, April). The observer effect in classroom visuation. Paper
presented at the annual meeting of the American Educational Research Association, Los Angeles.
Ware, J. E., & Williams, R. G. (1975). The Dr. Fox effect: A study of lecturer expressiveness and ratings of
instruction. Journal of Medical Education, 5, 149-156.
Ware, J. E., & Williams, R. G. (1977). Discriminant analysis of student ratings as a means of identibing lecturers
who differ in enthusiasm or information giving. Educational and Psychological Mearurement, 37.627-639.
Ware, J. E., & Williams, R. G. (1979). Seeing through the Dr. Fox effect: A response to Frey. Instructionof
Evaluation, 3,610.
Ware, J. E., & Williams, R. G. (1980). A reanalysis of the Doctor Fox experiments. Instructional Evaluation,
4, 15-18.
Warrington, W. G. (1973). Student evaluation of instruction at Michigan State University. In A. L. Sockloff
(Ed.), Proceedings: The first invitational conference on faculty effectiveness as evaluated by students (pp.
164-182). Philadelphia: Measurement and Research Center, Temple University.
Watkins, D., Marsh, H. W., & Young, D. (1987). Evaluating tertiary teaching: A New Zealand Perspective.
Teaching and Teacher Education: An International Journal of Research and Studies, 3,41-53.
Webb, W. B., & Nolan, C. Y. (1955). Student, supervisor, and self-ratings of instructional proficiency.
Journal of Educational Psychology, 4642-46.
Whitely, S. E., & Doyle, K. 0. (1976). Implicit theories in student ratings, American Educuuonuf Research
Journal, 13,241-253.
Williams, R. G., & Ware, J. E. (1976). Validity of student ratings of instruction under different incentive
conditions: A further study of the Dr. Fox effect. Journal of Educational Psychology, 68.48-56.
Williams, R. G.. & Ware, J. E. (1977). An extended visit with Dr Fox: Validity of student ratings of instruction
after repeated exposures to a lecturer. American Educational Research Journal, 14.449-457.
Wykoff, G. S. (1929). On the improvement of teaching. School and Society, 29,58-59.
Yunker, J. A. (1983). Validity research on student evaluations of teaching effectiveness: Individual versus class
mean observations. Research in Higher Education, 19,363-379.
APPENDIX

FIVE EXAMPLES OF COMPUTER SCORED INSTRUMENTS


USED TO COLLECT STUDENTS’ EVALUATIONS OF
TEACHING EFFECTIVENESS

Included in this appendix are five instruments used to collect students evaluations of teach-
ing effectiveness: SEEQ (see earlier description), Endeavor (Frey et al., 197.5; also see
earlier description), SIRS (Warrington, 1973), SDT (a slight modification of instrument
described by Hildebrand et al., 1971; also see earlier description), and ICES (Office of In-
structional Resources, 1978; also see Brandenburg et al., 1985). A comparison of the five
instruments reveals that they differ in:
(1) Factor Structure. Four of the five - SEEQ, Endeavor, SIRS, and SDT - consist of
items that have a well-defined factor structure.
(2) Open-Ended Questions. Four of the five - SEEQ, SIRS, STD, and ICES - speci-
fically ask for open-ended comments, though the nature of the open-ended questions vary.
(3) Overall Ratings. Three of the five - SEEQ, SDT and ICES - include overall rat-
ings of the course and of the instructor.
(4) Background/Demographic Information. Three of the five - SEEQ, SIRS and ICES
- ask for some background/demographic information in addition to students’ ratings of
instructional effectiveness.
(5) Instructor-Selected Items. Three of the five - SEEQ, ICES, and SIRS - provide for
additional items. The ICES, unlike the other instruments, is a ‘cafeteria-style’ instrument
(Braskamp etal., 1985) that consists primarily of items selected by the individual instructor
or the academic unit, and only the three overall rating items are constant. Up to 23 addi-
tional items are selected from a catalog of over 450 items, and a computer-controlled
printer overprints these items on the ICES form. Thus, each instrument is ‘tailor made’
and the one illustrated here is just one example. SEEQ and SIRS provide for additional
items, up to 20 and 7 respectively, to be selected/created by the individual instructor or the
academic unit.

379
HERBERT W. MARSH

The Endectbor Instrument (reprmted wth permiwon)

ENDEAVOR INSTRUCTIONAL RATING FORM

COMPLETEL

FOR EACH ITEM, DARKEN THE BOX WHlC” MOST CLOSELY ,ND,CATES YOUR ASSESSMENT OF THIS COVRSE.
Students’ Evaluations 381

The Student Instructional Rating System (SIRS) form (reprmted with permission).
352 HERBERT W. MARSH

t I
Students’ Evaluations 383

The Instructor and Course Evaluation System (ICES) (reprinted with permission)

4 THE INSTRUCTOR UAS CONSCIENTIOUS ABOUT


HIS/HER INSTRUCTIONAL RESPOhSI8ILITIES.
5 ThE GRADING PROCEDURES FOR THE COURSE
I

(I
tlOY YELL 010 EXAMINATION QUESTIONS RE- POORLY
FLECT CONTENT AND EMPHASIS OF THE COURSE? RELATE0 00000 RELATED -
7 YAS THE PROGRESSION OF THE COURSE LCGI- YES. hC.
CAL AN0 COHERENT FROM BEGINNING TO EN07 ALkAVS 00000 SELLCM -
8 HOW WOULO YOU CHARACTERIZE THE INSTRUC- EXCELLENT VERY POOR
TOR’S ABILITY TO EXPLAIN? 00008
9 TWE INSTRUCTOR MOTIVATED PE TC 00 CV ALMOST ALWCST
BEST WORK. ALclAYS 00000 IdEVER -
10.010 THIS COURSE INCREASE YCUR INTEREST YES. NC. hOT
IN THE SUBJECT HATTER? GREATLY OOOOQ PUCH -
11 010 THIS COURSE IMPROVE YOUR UNOERSTANOING YES. SIG- NC.
OF CONCEPTS AND PRINCIPLES IN THIS FIELC? NI F ICANTLY @@@@@ hCT CUCH -
12.

14

17

19.

19

20

21

23

24

25

26
HERBERT W. MARSH

- DIRECTIONS FOR ICES SlCE 2

P(eau usa chlr side of the form for your personal comments on teachef

DO NOT WRITE
Objectwe Hems 1 - 3 wll be used to compare this course and mrtructor
IN THE
to others in the department and mstltution Data from other ~temf after
,tem 3 would be useful to the mrtructor for course improvements. Your
SHADED
mstructqr will not see Your completed evaIuat~on until final grader are
in for your course.
AREA
NOTE.
Someone other than your mstructor should collect and mad these forms.

-I PLEASE WRITE COMMENTS BELOW


.
:*
. What an the major
strengths and
weaknesses of the
.
instructor?
.

lC
.
What do You suggest
, to improve this course7

‘0

I
Comment on the
.
procedumr
grading
ld .umr.
I

’ E

,
Instructor option
I qwstion

’ F
a
Instructor option
question
Students’ Evaluations 385

The SEEQ Instrument.

5
6
7
8
9
IO
,I
I2

14
15
16
17
18
19
m
21
22
23
24
25
26
27
28
29

SUPPLEMENTAL QUESTIONS (USE RESPONSES BELOW FOR INSTRUCTOR’S QUESTIONS) 6


La

2
42cD W * W W 47rnWWWW 52cDWaJWW 57aJWalWW :,
43al W a W W 48WWWWW 53 03WfmWW 58c13rnrnWW F
u(DWWWal 19 WWWWW 5,WWalWrn 59WWWWW
4sDWWWW MWWWWW wcawwww P
55awalww
IMDa)lXJWW 51 aJWcmWW 56ulWWWW 61,wrnWWW
08,

--
2

____ __ _. ~_ _ _~ _ __.~ .~. __ - _-.--~-- --- ..-- __


3

----- ~. - ____

h,

___-___ - ___--__
2.

_ _~.--.__---__~ ____. _ -~~

I,
Students’ Evaluations 387

The Student Description of Teaching (SDR) questionnaire. (Note: The current version of the SDT is nearly the
same as that first described by Hildebrand eral.. 1971, and has been widely adopted including this computer-scor-
able verston used in the schools of Business Administration.

f
i
388 HERBERT W. MARSH

COMMENTS
1 Please use ,h,s space 10 ldenllfy what you perceave as I”e real strength and weaknesses 01
a) Ihe course

b) lhe ~nslruclor’r leachmg

2 WhaI improvements would you suggest’

You might also like