Seeq: A Reliable, Valid, A N D Useful Instrument For Collecting Students' Evaluations of University Teaching
Seeq: A Reliable, Valid, A N D Useful Instrument For Collecting Students' Evaluations of University Teaching
Seeq: A Reliable, Valid, A N D Useful Instrument For Collecting Students' Evaluations of University Teaching
INTRODUCTION
THEpurpose of this review is to summarise research that led to the development of
SEEQ (Students’ Evaluations of Educational Quality). SEEQ is an instrument and a
programme for collecting students’ evaluations of college/university teaching.
Research presented in this review is described in greater detail in a series of technical
reports and publications. This research, in addition to guiding SEEQ’s development,
has also provided an academic credibility that is essential in winning faculty support.
It is hoped that this review may serve as both a model and encouragement to academic
units seeking to implement or improve systematic programmes of students’ evaluations.
Research and development on the first SEEQ, which is substantially similar to the
current version, was conducted a t the University of California, Los Angeles (UCLA).
This effort began with a Task Force on the Evaluation of Teaching that examined
evaluation practices at UCLA and other universities, and made recommendations that
included the development of a campus-wide programme of students’ evaluations of
teaching. Based upon current practices, interviews with students and faculty, and a
review of the evaluation literature, an extensive item pool was developed. The work
done by Hildebrand et al. (1971) at the University of California, Davis was particularly
important in developing this pool of items. Several different pilot surveys-each
consisting of 50-75 items-were administered to classes in different academic depart-
ments. Students, in addition to making ratings, were asked to indicate the items they
felt were most important in describing the quality of teaching. Similarly, staff were
asked to indicate the items they felt would provide them with the most useful feedback
about their teaching. Students’ open-ended comments were reviewed to determine if
important aspects had been excluded. Factor analysis identified the dimensions
underlying the student ratings, and the items that best measured each. Reliability
coefficients were compiled for each of the evaluation items. Finally, after several
revisions, four criteria were used to select items to be included on the UCLA version
of SEEQ. These were: (1) student ratings of item importance, (2) staff ratings of item
usefulness, (3) factor analysis, and (4) item reliabilities. During the last 6 years over
500,000 of these forms have been completed by UCLA students from more than 50
academic departments in over 20,000 courses. The results of the evaluations are
returned to faculty as feedback about their teaching, are used in tenure/promotion
decisions and are published for students to use in the selection of courses.
77
78 Students’ Evaluations
The current version of SEEQ (see Appendix 1) was developed at the University of
Southern California (USC). A preliminary version of the instrument was adopted on
a trial basis by the Division of Social Sciences, pending the outcome of research on the
instrument. On the basis of much of the research summarised in this review, the
current form was unanimously endorsed by the Dean and Department Chairpersons
in the Division, and its use required in all Social Science courses. The programme was
later adopted by other academic units at USC, and over 250,000 SEEQ forms have
been completed by USC students over the last 4 years.
METHOD
Description of the instrument
The SEEQ survey form is presented in Appendix 1. The two-sided evaluation
instrument is self-explanatory, easily administered, and computer scorable. The form
strives for a compromise between uniformity and flexibility. The standardised ques-
tions used to evaluate all courses measure separate components of instructional
effectivenessthat have been identified with factor analysis. Provision for supplemental
questions at the bottom of the printed form allows the individual instructor or aca-
demic unit to design items unique to specific needs. Adequate provision for student
comments to open-ended questions is provided on the back of the form.
A sample of the two-page summary report prepared for each course is presented
in Appendix 2 (the actual report appears on 8.5 inch x 15 inch computer paper). The
summary report, along with the completed surveys that contain students’ open-ended
comments, are returned to the instructor. Copies of the report are also sent to the
Department Chairperson and/or the Dean of the particular academic unit. The data
upon which the report is based are permanently stored in a computer archive system
by the Office of Institutional Studies, the central office that processes the forms. In the
report, the evaluation factor scores, the overall summary ratings, and demographic/
background items are presented on page 1, while the separate rating items appear on
page 2. Each item is summarised by a frequency distribution of student responses, the
mean, the standard error, and the percentile rank that shows how the mean rating
compares with other courses. A graphic representation of the percentile rank is also
shown. If any supplemental questions were used, a summary of these responses
appears on a third page.
The normative comparisons provided in the summary report (the percentile ranks)
play an important role in the interpretation of the ratings. First, students are uni-
versally quite generous in their evaluations of teaching. The average overall course
and instructor ratings are typically about 4.0 on a one-to-five scale. Second, some
items receive higher responses than do others-overall instructor ratings are almost
always higher than overall course ratings. Finally, comparisons are made between
instructors teaching courses at similar levels (i.e., there are separate norms for graduate
level courses, undergraduate level courses taught by faculty members, and courses
taught by teaching assistants). Academic units at USC (e.g., the 10 departments in the
Division of Social Sciences) are given the option of using university-wide norms or
norms based upon ratings from just their unit. However, ratings are only ranked
against norms containing at least 200 courses.
A longitudinal summary report, summarising all the available courses ever taught
by each instructor, is also produced annually. The report contains means and percen-
tile ranks for the evaluation factor scores, the overall summary ratings, and selected
background/demographic items. This information is presented separately for each
course, and is averaged across all graduate level courses and across all undergraduate
courses. Courses that were evaluated by fewer than 10 students or by less than 50
H. W. MARSH 79
per cent of the enrolled students are not included in the longitudinal averages. Impor-
tant information can be gained from examining this report, beyond the convenience of
having a summary of all the ratings for each teacher. The longitudinal average is not
unduly affected by a chance occurrence in any one course offering, and it reflects
teaching effectiveness in the range of courses that are likely to be taught by a particular
instructor. The change in ratings over time provides a measure of instructional
improvement. Furthermore, this summary provides a basis for determining the
classes in which an individual teacher is most effective.
In addition to the individual and longitudinal summary reports, other studies and
special analyses are performed at the request of the Dean and/or Chairpersons.
These include requests as diverse as using previous ratings for a particular course as a
baseline against which to compare ratings after an innovative change, a determination
of the trend over time in ratings of all courses within a given academic department, and
the use of supplemental questions to query students about their preferences in class
scheduling.
RESULTS
Factor analysis
Factor analysis is used to describe the different components of teaching effective-
ness actually being measured by a set of questions. Its use is particularly important
in the development of student evaluation instruments, since it provides a safeguard
against a ' halo effect '-a generalisation from some subjective feeling about the teacher
which affects ratings of all the questions. To the extent that all the items are con-
taminated by this halo effect, they will all blend together and not be distinguished as the
separate components of teaching effectiveness that the evaluation form was designed
to measure.
A well-developed factor structure is also important to the interpretation of the
student ratings. Broad global ratings averaged across a collection of heterogeneous
items provide little diagnostic feedback and are difficult to interpret. For example,
Marsh, Overall and Kesler (1979b) showed that while large classes did tend to receive
lower ratings when averaged across all items, this effect was limited almost entirely
to the Group Interaction and Individual Rapport factors. Similarly, an interview
with a student about an earlier version of the evaluation form indicated that she had
given an instructor lower ratings on several more or less randomly selected items
because there were no items where she could express her sentiment that " the examin-
ations were terrible ". Even if particular components of teaching effectiveness seem
less important to a particular instructor (or academic unit), their exclusion may make
other ratings more difficult to interpret.
SEEQ measures nine evaluation factors (see Table 1). Marsh (Marsh and Overall,
1979b; Marsh, in press) presented a factor analysis of student ratings that confirmed
the nine factors SEEQ was designed to measure, and these findings have been replicated
in different academic disciplines and in different academic years. Even more convinc-
ing support came from a study in which faculty in 329 classes were asked to evaluate
their own teaching effectiveness with the same SEEQ form that was used by their
students. Separate factor analyses of the student ratings and the instructor self-
evaluation both demonstrated the same nine evaluation factors that had previously
been identified (see Table 1). More recently the same nine factors have been identified
in ratings collected at the University of Sydney, Australia (Marsh, 1981a). These
analyses illustrate the replicability of the rating factors and their generalisability
across different populations of students and different methods of evaluation.
TABLE 1
FACTOR
ANALYSES
OF STUDENTS' EVALUATIONS EmCTtVENEsSAND "HE
OF TEACHING CORRESPONDING STAFF
SELFEVALUATIONS
OF THEIR
OWNTCACHING (IN
I N ALL 329 COURSLS DRACKLIS)
Long-term stability
A common criticism directed at student ratings is that students do not have an
adequate perspective to recognise the value of instruction at the end of a class. Accor-
ding to this argument, students will only recognise the value of teaching after being
called upon to apply the course materials in further coursework and after graduation.
A rather unique opportunity to test this notion arose at a California State University
which had adopted an earlier version of SEEQ. Undergraduate and graduate students
in the school of management evaluated teaching effectiveness at the end of each course.
However, unlike most programmes, the forms were actually signed by the students,
allowing the identification of individual responses. One year after graduation from
the programme (and several years after taking a course) the same students were
again asked to make ‘ retrospective ratings ’ of teaching effectiveness in each course,
using a subset of the original items. Since all evaluations were signed, the end-of-term
ratings could be matched with the retrospective ratings. Over a several-year period of
time, matched sets of ratings-both end-of-term and retrospective-were collected for
students in 100 classes. Analysis of the two sets of ratings showed remarkable agree-
ment. The average correlation (relative agreement) between end-of-term and
retrospective ratings was 0.83. Mean differences between the two sets of ratings
(absolute agreement) was small ;the median rating was 6.63 for retrospective ratings and
6.61 for end-of-term ratings. Separate analysis showed these results to be consistent at
both the graduate and undergraduate levels, and across different course types.
This research is described in more detail in Marsh and Overall (1979a, 1981) and
Overall and Marsh (1980). In related research, Marsh (1977) showed that responses
from graduating seniors were similar to the ratings of current students.
F
82 Students’ Evaluations
Validity-student learning
Student ratings, one measure of teaching effectiveness, are difficult to validate
since there is no universal criterion of effective teaching. Consequently, using an
approach called construct validation, student ratings have been related to other
measures that are assumed to be indicative of effective teaching. If two measures that
are supposed to measure the same thing show agreement, there is evidence that both
are valid. Clearly this approach requires that many alternative validity criteria be used.
Within this framework, evidence of the long-term stability of student ratings can be
interpreted as a validity measure. However, the most commonly used criterion has
been student learning as measured by performance on a standardised exam-
ination.
Methodological problems require a special setting for this research. Ideally,
there are many sections (i.e., different lecture groups that are part of the same course)
of a large multi-section course in which students are randomly assigned or at least
enroll without knowledge of who will be teaching the section. Each section of the
course should be taught by a separate teacher, but the course outline, textbooks,
course objectives, and most importantly the final examination, should be developed by
a course director who does not actually lecture to the students. In two separate studies
applying this methodology, it was found that the sections that evaluate teaching most
favourably during the last week of classesalso perform best on the standardised examin-
ation given to all sections the following week. Since students did not know who would
be teaching different sections at the time of registration, and sections did not differ on a
pretest administered at the start of the term, these findings provide good support for
the validity of student ratings.
In the second of these studies a set of affective variables was also considered as a
validity criterion. Since the course was an introduction to computer programming,
these included such variables as feelings of course mastery, plans to apply the skills that
were gained from the course, plans to pursue the subject further, and determination of
whether or not students had joined the local computer club. In each case, more fav-
ourable responses to these items were correlated with more favourable evaluations
of the teacher.
These two studies are described in more detail in Marsh, Fleiner and Thomas
(1975) and Marsh and Overall (1980). Similar findings, using this same methodology,
are presented in Frey et al. (1975), Centra (1977), in studies reviewed by McKeachie
(1979) and Marsh (1980b), and in a meta-analysis by Cohen (1980).
TABLE 2
MULTITRAIT-MULTIMETHOD
MATRIX:CORRELATIONS BETWEEN SELF EVALUATIONS
STUDENT AND FACULTY IN ALL 329 COURsEs
Learning/Value );;(
Enthusiasm $);
Organisation 12 )-
;;(
Group interaction 01 03 (;y
Individual rapport -07 -01 07
Breadth 13 12 13 11
Examinations -01 08 26 09
Assignments 24 -01 17 05
Workld/Difficulty 03 -01 12 -09
NOTE:Correlations are presented without decimal points. Correlations that are in bold figures indicate background variables which account for at least 5 per cent of the variance in a
particular evaluation score. The value of Multiple R squared is based,upon the combined effect of the subset of background variables that is most highly correlated with the
evaluation score. This was determined with a step-wise multiple regression in which a new background variable was added at each step until no additional variable could increase
Multiple R squared by as much as 1 per cent. The Multiple R squared was then corrected for the number of variables in the equation.
These relationships showed substantial non-linearity (i.e., quadratic and/or cubic components add at least 1 per cent to the Variance Explained by the linear relationship and the
Total Variance Explained by all components was at least 5 per cent).
H. W. MARSH 87
According to this interpretation, part of the Expected Grade relationship with student
ratings is spurious. Second, the Expected Grade relationship can only be considered a
bias if higher grades reflect ‘ easy grading ’ on the part of the teacher. If the higher
grades reflect better student achievement, then the Expected Grade relationship may
support the validity of the student ratings, i.e., better ratings are associated with more
student learning. At least two facts support this interpretation. First, Prior Subject
Interest is related to Expected Grades and it is more reasonable to assume that it
affects student achievement rather than the instructor’s grading standards. Second,
lecturers’ self evaluations of their own grading standards showed little correlation
with student ratings. In reality, Expected Grades probably reflect some unknown com-
bination of both ‘ easy grading ’ and student achievement. However, even if Expected
Grades do represent a real bias to the student ratings, their effect is not substantial.
These studies show that none of the suspected biases to student ratings seems
actually to have much impact. Similar findings have been reported by Remmers
(1963), Hildebrand et al. (1971), McKeachie (1979), and Marsh (1980a). Neverthe-
less, as a consequence of this research, summary reports describing student evaluations
also include mean responses and percentile ranks for Prior Subject Interest and
Expected Grades (see Appendix 2). This research is described in greater detail in
Marsh (1978, 1980b). Separate studies have examined the relationship between
student ratings and: (1) Expected Grades (Marsh, Overall and Thomas, 1976), (2)
Class Size (Marsh, Overall and Kesler, 1979b), and (3) Prior Subject Interest (Marsh
and Cooper, 1981). In related research, Marsh (1981b; Marsh and Overall, 1981)
demonstrated that student ratings are primarily,a function of the instructor doing the
teaching, and not the particular course or the level at which it is taught.
Instructional improvement-feedback fionz student ratings
There is ample reason to believe that a carefully planned programme of instruc-
tional evaluation instituted on a broad basis will lead to the improvement of teaching.
Teachers, particularly those who are most critical of the student ratings, will have to
give more serious consideration to their own teaching in order to consider the merits
of an evaluation programme. The institution of the programme and the clear
endorsement by the administrative hierarchy will give notice that quality of teaching is
being taken more seriously, an observation that both students and faculty will be
likely to make. The results of the student ratings, as one measure of teaching effective-
ness, will provide a basis for administrative decisions and thereby increase the likeli-
hood that quality teaching will be recognised and rewarded. The social reinforcement
of getting favourable ratings will provide added incentive for the improvement of
teaching, even at the tenured faculty level. Finally, the diagnostic feedback from the
student ratings may provide a basis for instructional improvement. As described
earlier, teaching staff at USC indicate that student ratings are useful in the improve-
ment of a course and/or the quality of their teaching: 80 per cent said that they were
potentially useful while 59 per cent said they actually had been useful. However, the
empirical demonstration of this suggestion is more difficult to test.
In two different studies the effect of feedback from midterm evaluations on end-
of-course criteria was tested. Both these studies were conducted with the multi-
section course in computer programming described earlier. In the first study, students
completed an abbreviated version of the student evaluation instrument at mid-term,
and the results were returned to a random half of the instructors. At the end of the
term, student ratings of “ perceived change in instruction between the beginning of
the term and the end of the term ” were significantly higher for the feedback group, as
were ratings on two of the seven evaluation factors. Ratings on the overall course and
instructor summary items did not differ, nor did student performance on the standar-
dised final examination given to all students.
Students’ Evaluations
Several changes were made in the second study that was based upon 30 classes.
First, mid-term evaluations were made on the same evaluation form that was used at
the end of the course. Second, the researchers actually met with the group of ran-
domly selected feedback instructors to discuss the ratings. At this meeting the
teachers discussed the evaluations with each other and with the researchers, but were
assured that their comments would remain confidential. A third change was the
addition of affective variables, items that focused on application of the subject matter
and student pIans to pursue the subject. At the end of the term, students of the
feedback instructors : (1) rated teaching effectiveness more favourably, (2) averaged
higher scores on the standardised final examination, and (3) experienced more positive
affective outcomes than students whose instructors received no feedback. Students in
the feedback group were similar to the other students in terms of both pretest achieve-
ment scores completed at the start of the term and the midterm evaluations of their
teachers. These findings suggest that the feedback from student ratings, coupled with
a frank discussion of their implications with an external consultant, can be an effective
intervention for improving teaching effectiveness.
The details of these studies have been described in two published articles (Marsh,
Fleiner and Thomas, 1975; Overall and Marsh, 1979). SimiIar findings have been
reported by McKeachie et al. (1980) and a meta-analysis by Cohen (1981).
In summary, research described in this study has indicated that:
(1) SEEQ measures nine distinct components of teaching effectiveness as demons-
trated by factor analysis. Factor analysis of faculty evaluations of their own teaching
resulted in the same factors. Factor scores based upon this research are used to
summarise the student ratings that are returned to faculty.
(2) Student evaluations are quite reliable when based upon the responses of 10 to
15 or more students. Class ratings based upon fewer than ten student responses
should be interpreted carefully.
(3) The retrospective ratings of former students agree remarkably well with the
evaluations that they made at the end of a course.
(4) Student evaluations show moderate correlations with student learning as
measured by a standardised examination and with affective course consequences such
as application of the subject matter and plans to pursue the subject further.
( 5 ) Faculty self evaluations of their own teaching show good agreement with
student ratings.
(6) Suspected sources of bias to student ratings have little impact.
(7) Feedback from student ratings, particularly when coupled with a candid
discussion with an external consultant, can lead to improved teaching.
W
w
APPENDIX 2 -continued
Instructor: Doe, John Class Schedule Number: 99999 Page 2 of 2
Department: Sample Department Term: Spring 78 Number of students completing evaluations: 43
Course: Sample Dept 999 Percentage of Enrolled Students completing evaluations: 92%
Evaluation items (Some questions have been abbreviated) :
For each question, the percentage of students making each response, the mean average response, and the Standard Error (SE) of the responses are
presented. (These statistics are based upon the actual number of students responding to the questio?) In addition, the percentage of students
who completed the evaluation form but did not respond to a particular question is indicated in the No Resp column. Differences in mean
averages that are less than one standard error (see page one for a description) are too small to be reliably interpreted. In general, evaluations based
upon less than 10 students' responses, evaluations based upon less than 50 per cent of the class, and evaluation items which uere frequently left
blank should be interpreted cautiously. The percentile ranks (which vary between 0 and 100) and the graphs show how your evaluations compare
with other courses in your comparison group. (Higher percentile ranks and more stars indicate more favourable evaluations.) Your comparison
group is:
Undergraduate courses not taught
- by. teaching assistants
Rank relative to your
Percentage responding comparison group (See above)
Very Med- Very No SE %Ti1 Graph
Poor Poor ium Good good resp Mean +/- Rank 0 1 2 3 4 5 6 7 8 9
Learning
1. Course was intellectually challenging and stimulating 0 0 5 42 53 0 448 0.089 32 *****************
2. Learned something considered to be valuable 0 0 9 40 51 0 4.41 0.100 69 **************
3. Increased interest in subject as consequence of course 0 2 7 40 51 0 4.39 0.110 82 *****************
4. Learned and understood the subject materials 0 0 9 60 30 0 4.20 0.090 68 **************
Enthusiasm
5. Instructor was enthusiastic about teaching the course 0 0 2 26 72 0 4.69 0.077 84 *****************
6. Instructor was dynamic and energetic in conducting course 0 0 7 50 43 2 4.35 0.094 72 ***************
7. Instructor enhanced presentation with humour 0 5 33 36 26 2 3.82 0.135 42 * * * * ** * **
8. Instructor style of presentation held interest 0 7 7 53 33 0 4.11 0.125 71 ***************
Organisation
9. Instructors' explanations were clear 0 0 12 40 49 0 4.36 0.104 80 *****************
10. Course materials were well prepared and explained 0 0 7 34 59 5 4.50 0.099 88 ******************
11. Proposed objectives agreed with those actually taught 0 0 5 44 51 5 4.45 0.092 87 ******************
12. Lectures facilitated taking notes 0 2 9 28 60 0 4.46 0.116 90 *******************
Group interaction
13. Students encouraged to participate in class discussions 0 5 19 33 44 0 4.15 0.136 52 ***********
14. Students invited to share ideas and knowledge 0 2 21 40 37 0 4.11 0.125 48 **********
15. Students encouraged to ask questions and give answers 0 0 21 28 51 0 4.29 0.121 59 ************
16. Students encouraged to express own ideas 0 2 21 35 42 0 4.15 0.128 51 ***********
Individual rapport
17. Instructor was friendly towards individual students 0 5 17 45 32 7 4.04 0.133 26 ******
18. Instructor welcomed students to seek help/advice 0 7 29 29 36 2 3.92 0.149 32 *******
19. Instructor had genuine interest in individual students 0 2 25 40 32 7 4.01 0.130 48 **********
20. Instructor was accessible during office hours/after class 0 0 16 50 34 12 4.17 0.111 68 **************
Breadth
21. Instructor contrasted implications of various theories 0 2 19 36 43 2 4.18 0.128 67 **************
22. Instructor presented background of ideas/concepts 0 0 9 40 51 0 4.41 0.100 84 *****************
23. Instructor presented points of view other than own 0 2 19 45 33 2 4.09 0.121 53 *********** p
24. Instructor discussed current developments in field 0 2 16 35 47 0 4.25 0.125 58 ************
Examinations
25. Feedback on exams/graded materials was valuable 2 12 31 40 14 2 3.51 0.148 38 ********
26. Method of evaluation was fair and appropriate 0 5 21 52 21 2 3.89 0.121 56 ************
27. Graded materials tested course content as emphasised 0 0 24 43 33 2 4.09 0.116 63 ************* M
Assignments 3:
28. Required readings/texts were valuable 0 0 7 49 44 0 4.36 0.093 90 *******************
29. Assignments contributed to appreciation/understanding 0 0 10 48 43 2 4.32 0.099 83 *****************
Overall
30. How does this course compare with others at USC? 0 0 10 35 55 2 4.44 0.102 83 *****************
31. How does this instructor compare with others at USC? 0 0 5 29 67 2 4.61 0.089 84 *****************
t The actual Summary Report is a two-paged computer printout that is produced as part of the data processing.