RD Connections 21
RD Connections 21
RD Connections 21
21 • March 2013
Essay scoring has traditionally relied on human raters, who understand both the
content and the quality of writing. However, the increasing use of constructed-
Key Concepts response items, and the large number of students that will be exposed to such items in
Constructed-response item: assessments based on the Common Core State Standards (CCSS), raise questions about
the viability of relying on human scoring alone. This scoring method is expensive,
A test question that requires
requires extensive logistical efforts, and depends on less-than-perfect human
the test takers to supply the
judgment. Testing programs are therefore tapping into the power of computers to
answer, instead of choosing it
score constructed-response items efficiently. The interest in automated scoring of
from a list of possibilities.
essays is not new and has recently received additional attention from two federally
Common Core State supported consortia, PARCC and Smarter Balanced, which intend to incorporate
Standards (CCSS): A automated scoring into their common core state assessments planned for 2014.
set of curricular goals in Nonetheless, human labor cannot simply be replaced with machines, since human
English language arts and scoring and automated scoring have different strengths and limitations. In this essay,
mathematics adopted by the two scoring methods are compared from measurement and logistical perspectives.
most states for students in Conclusions are drawn from research literature, including ETS research, to summarize
grades K–12. the current state of automated essay scoring technology.
Large-scale testing The published research has few in-depth comparisons of the advantages and
programs: Assessments limitations of automated and human scoring. There are also debates in academia,
taken by a large volume of the media, and among the general public concerning the use of automated scoring
test takers, such as ETS’s TOEFL of essays in standardized tests and in electronic learning environments used in and
outside of classrooms. It is important for test developers, policymakers, and educators
iBT® test and Pearson’s PTE.
to have sufficient knowledge about the strengths and weaknesses of each scoring
High-stakes decision: A method in order to prevent misuse in a testing program. The purpose of this essay
judgment, based in part is to contrast significant characteristics of the two scoring methods, elucidate their
on test results, that has differences, and discuss their practical implications for testing programs.
significant consequences for
an individual, a group, or an Human Scoring
institution, such as college
admission, graduation, and Many large-scale testing programs in the United States include at least one essay-
writing item. Examples include the GMAT® test administered by the Graduate
school sanctions.
Management Admission Council®, the GRE ® revised General Test administered by
ETS, as well as the Pearson® Test of English (PTE). The written responses to such items
Editor’s note: Mo Zhang is an associate research scientist in ETS’s Research & Development division.
1
www.ets.org 1
R&D Connections • No. 21 • March 2013
“Human labor cannot simply are far more complex than responses to multiple-choice items, and are traditionally
be replaced with machines scored by human judges. Human raters typically gauge an essay’s quality aided by a
since human scoring and scoring rubric that identifies the characteristics an essay must have to merit a certain
score level. Some of the strengths of scoring by human graders are that they can
automated scoring have
(a) cognitively process the information given in a text, (b) connect it with their prior
different strengths and knowledge, and (c) based on their understanding of the content, make a judgment
limitations.” on the quality of the text. Trained human raters are able to recognize and appreciate
a writer’s creativity and style (e.g., artistic, ironic, rhetorical), as well as evaluate the
relevance of an essay’s content to the prompt. A human rater can also judge an
examinee’s critical thinking skills, including the quality of the argumentation and the
factual correctness of the claims made in the essay.
For all its strengths, human scoring has its limitations. To begin with, qualified human
raters must be recruited. Next, they must be instructed in how to use the scoring rubric
and their rating competencies must be certified prior to engaging in operational
grading. Finally, they must be closely monitored (and retrained if necessary) to ensure
the quality and consistency of their ratings. (See Baldwin, Fowles, & Livingston, 2005,
for ETS’s policies on performance assessment scoring.) In 2012, more than 655,000
test takers worldwide took the GRE revised General Test (ETS, 2013), with each test
taker responding to two essay prompts, producing a total of more than 1.3 million
responses. Obviously, involving humans in grading such high volumes, especially in
large-scale assessments like the GRE test, can be labor intensive, time consuming,
and expensive.
Humans can also make mistakes due to cognitive limitations that can be difficult or
even impossible to quantify, which in turn can add systematic biases to the final scores
(Bejar, 2011).
Table 1 exemplifies sources of human error known from the research literature.
www.ets.org 2
R&D Connections • No. 21 • March 2013
“Involving humans in grading It is also worth emphasizing that there has been relatively little published research
such high volumes, especially on human-rater cognition (e.g., see Suto, Crisp, & Greatorex, 2008). Hence, what goes
in large-scale assessments on in a rater’s mind when passing judgment is not well known, particularly under
operational scoring conditions. This lack of knowledge about the cognitive basis for
like the GRE test, can
human scoring could, in turn, undermine confidence in the validity of the automated
be labor intensive, time scores produced by computers designed to emulate those human ratings. It is because
consuming, and expensive.” of these known limitations of human scoring that consequential testing programs,
such as those for admissions or licensure, typically use more than one human rater for
each response, and adjudicate discrepancies between the raters if necessary.
Automated Scoring
Automated scoring has the potential to provide solutions to some of the obvious
shortcomings in human essay scoring (e.g., rater drift). Today’s state-of-the-art
systems for computer-based scoring involve construct-relevant aggregation of
quantifiable text features in order to evaluate the quality of an essay. These systems
work exclusively with variables that can be extracted and combined mathematically.
Humans, on the other hand, make holistic decisions under the influence of many
interacting factors.
The primary strength of automated scoring compared to human scoring lies in its
efficiency, absolute consistency in applying the same evaluation criteria across essay
submissions and over time, as well as its ability to provide fine-grained, instantaneous
feedback. Computers are neither influenced by external factors (e.g., deadlines) nor
emotionally attached to an essay. Computers are not biased by their stereotypes or
preconceptions of a group of examinees. Automated scoring can therefore achieve
greater objectivity than human scoring (Williamson, Bejar, & Hone, 1999). Most
automated scoring systems can generate nearly real-time performance feedback on
various dimensions of writing. For example, ETS’s e-rater® engine can provide feedback
on grammar, word usage, mechanics, style, organization, and development of a
written text (ETS, 2008). Similarly, Pearson’s Intelligent Essay Assessor™ can provide
feedback on six aspects of writing — ideas, organization, conventions, sentence
fluency, word choice, and voice (Pearson Education, Inc., 2011). It would be quite
difficult for human raters to offer such analytical feedback immediately for large
numbers of essays.
Automated scoring systems are often able to evaluate essays across grade levels (e.g.,
the e-rater engine, Intelligent Essay Assessor, Vantage Learning’s IntelliMetric®). Human
graders, in contrast, are usually trained to focus on a certain grade range associated
with a specific rubric and a set of tasks. Shifting a human rater to a new grade range
may therefore require considerable retraining.
Further, some automated scoring systems are able to grade essays written in
languages other than English (e.g., Intelligent Essay Assessor; Pearson Education, Inc.,
2012). This capability could facilitate the scoring of tests that are translated into other
languages for international administration, relieving the potential burden of recruiting
and training a large pool of human graders for alternate language assessment. There
www.ets.org 3
R&D Connections • No. 21 • March 2013
“Today’s state-of-the-art are also automated scoring systems that are able to detect inauthentic authorship
systems for computer-based (e.g., IntelliMetric; Rudner, Garcia, & Welch, 2005), which human raters may not be
scoring involve construct- able to do as readily as computers. It is worth noting that these alternate language
and inauthentic authorship capabilities have not been widely researched. Still, these
relevant aggregations of
directions represent a potential path to improve upon human scoring.
quantifiable text features in
order to evaluate the quality Notwithstanding its strengths, it must be recognized that automated scoring systems
of an essay. They work generally evaluate relatively rudimentary text-production skills (Williamson et al.,
2010), such as the use of subject-verb agreement evaluated by the grammar feature
exclusively with variables
in the e-rater engine, and spelling and capitalization as evaluated by the mechanics
that can be extracted and feature. Current automated essay-scoring systems cannot directly assess some of
combined mathematically. the more cognitively demanding aspects of writing proficiency, such as audience
Humans, on the other hand, awareness, argumentation, critical thinking, and creativity. The current systems are also
make holistic decisions not well positioned to evaluate the specific content of an essay, including the factual
under the influence of many correctness of a claim. Moreover, these systems can only superficially evaluate the
interacting factors.” rhetorical writing style of a test taker, while trained human raters can appreciate and
evaluate rhetorical style on a deeper level.
A final weakness of automated scoring systems is that they are generally designed to
“learn” the evaluation criteria by analyzing human-graded essays. This design implies
that automated scoring systems could inherit not only positive qualities, but also
any rating biases or other undesirable patterns of scoring present in the scores from
human raters.
Table 2 summarizes the strengths and weaknesses of human and automated scoring
as discussed above. The summary compares and contrasts the two methods from two
aspects: measurement and logistical effort.
www.ets.org 4
R&D Connections • No. 21 • March 2013
“Current automated essay- Table 2: A Summary of Strengths and Weaknesses in Human and Automated Scoring of Essays
scoring systems cannot Human Raters Automated Systems
directly assess some of the Are able to: Are able to assess:
more cognitively demanding • Comprehend the meaning of the text • Surface-level content relevance
being graded
aspects of writing proficiency • Development
• Make reasonable and logical judgments • Grammar
such as audience awareness, on the overall quality of the essay
• Mechanics
argumentation, critical
• Inconsistency error
• Subjectivity And:
• Perception difference error • Inherit biases/errors from human raters
• Severity error
• Scale shrinkage error
• Stereotyping error
Can allow:
• Quick re-scoring
Strengths
Logistical
www.ets.org 5
R&D Connections • No. 21 • March 2013
➡➡
www.ets.org 6
R&D Connections • No. 21 • March 2013
“Only when the system’s While automated scoring can defensibly be used as the sole evaluation mechanism
procedures for generating in low-stakes settings, the current state of the art is not sufficiently mature to endorse
scores are transparent can automated essay scoring as the sole score in assessments linked to high-stakes
decisions. The following three conditions need to be met before automated scoring is
score users and the scientific
applied in high-stakes environments as the only scoring method for essays. ETS and
community associated with other institutions are doing substantive research and development toward meeting
those users fully evaluate these goals.
the system’s usefulness and
• The internal mechanism for an automated scoring system is reasonably
potential liabilities.” transparent, especially in cases where the automated scores are used to
contribute to high-stakes decisions. Only when the system’s procedures
for generating scores are transparent can score users and the scientific
community associated with those users fully evaluate the system’s usefulness
and potential liabilities.
• A sufficiently broad base of validity evidence is collected at the early stage of
the implementation so that the deployment of the automated scoring system
can be justified.
• A quality-control system is in place so that aberrant performance on the part
of the automated scoring system can be detected in time to prevent reporting
incorrect results.
www.ets.org 7
R&D Connections • No. 21 • March 2013
“Human raters typically There are successful examples of using the e-rater engine in ETS testing products
are used not only as a representing each of these alternatives. The e-rater engine contributes to the final
development target, but also TOEFL iBT essay score in conjunction with human ratings, and it is used to confirm
human ratings in the GRE Analytical Writing assessment. For both programs, decisions
as an evaluation criterion.
on implementation were made in collaboration with program administrators and
Agreement in itself, however, based on substantive research evidence addressing both human and automated
is not enough to support scoring methods.
the use of automated
scores, particularly if Implications for Validation
evidence supporting the Human ratings are often used as a development target, as well as evaluation criteria,
validity of human ratings is for automated scoring.
insufficient.”
Human rating as a development target
The scores generated by automated systems are usually modeled to predict
operational human ratings, but there is an underlying assumption that those ratings
are a valid development target. If not, errors and biases associated with human ratings
could propagate to the automated scoring system.2 Aside from gathering data to
support the validity of operational human ratings, approaches to deal with this issue
might include modeling the ratings gathered under conditions where the raters are
under less pressure, and constructing scoring models using experts’ judgments of the
construct relevance and importance of different text features.
2
his concept was referred to as a “first-order validity argument” in Bejar (2012). More specifically, appraisal of the
T
human ratings is a prerequisite for all subsequent assumptions and inferences, and the omission of such an appraisal
would lead one to question the subsequent claims and conclusions.
www.ets.org 8
R&D Connections • No. 21 • March 2013
any event, developers and users of testing programs need to be aware of the potential
consequences of any differences in the score meaning of human or automated
systems for population groups.
Closing Statement
Advances in artificial intelligence technologies have made machine scoring of essays
a realistic option. Research and practical experience suggest that the technique
is promising for various testing purposes, and that it will be used more widely in
educational assessments in the near future. However, it is important to understand
the fundamental differences between automated and human scoring and be aware of
the consequences of choices in scoring methods and their implementation. Knowing
their strengths and weaknesses allows testing program directors and policymakers
to make strategic decisions about the right balance of methods for particular testing
purposes. Automated scoring can — when carefully deployed — contribute to the
efficient generation and delivery of high-quality essay scores, and it has the potential
to improve writing education in K–12, as well as in higher education, as the capability
becomes more mature.
Acknowledgments
The author would like to thank David Williamson, Randy Bennett, Michael Heilman,
Keelan Evanini, Isaac Bejar, and the editor of R&D Connections for their helpful
comments on earlier versions of this essay. The author is solely responsible for any
errors that may remain. Any opinions expressed in this essay are those of the author,
and not necessarily those of Educational Testing Service.
www.ets.org 9
R&D Connections • No. 21 • March 2013
References
Baldwin, D., Fowles, M., & Livingston, S. (2005). Guidelines for constructed-
responses and other performance assessments. Retrieved from http://
www.ets.org/Media/About_ETS/pdf/8561_ConstructedResponse_guidelines.pdf
Pearson Education, Inc. (2012). Intelligent Essay Assessor™ (IEA) fact sheet. Retrieved
from http://kt.pearsonassessments.com/download/IEA-FactSheet-20100401.pdf
Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2001).
Stumping e-rater: challenging the validity of automated essay scoring.
GRE Board Professional Report No. 98-08bP, ETS Research Report 01-03.
http://www.ets.org/Media/Research/pdf/RR-01-03-Powers.pdf
Rudner, L. M., Garcia, V., & Welch, C. (2005). An evaluation of IntelliMetric™ essay
scoring system using responses to GMAT® AWA prompts. Retrieved from http://
www.gmac.com/~/media/Files/gmac/Research/research-report-series/
RR0508_IntelliMetricAWA
Suto, I., Crisp, V., & Greatorex, J. (2008). Investigating the judgmental marking process:
an overview of our recent research. Research Matters, 5, 6–9.
Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). Mental model comparison of
automated and human scoring. Journal of Educational Measurement, 36, 158–184.
www.ets.org 10
R&D Connections • No. 21 • March 2013
Williamson, D. M., Bennett, R. E., Lazer, S., Bernstein, J., Foltz, P. W., Landauer, T. K.,
& Sweeney, K. (2010). Automated scoring for the assessment of common core
standards. Retrieved from http://research.collegeboard.org/sites/default/files/
publications/2012/8/ccss-2010-5-automated-scoring-assessment-common-
core-standards.pdf
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of
automated scoring. Educational Measurement: Issues and Practice, 31, 2–13.
http://onlinelibrary.wiley.com/doi/10.1111/j.1745-3992.2011.00223.x/abstract