0% found this document useful (0 votes)

66 views11 pages

Creating Scoring Rubric From Representative Student Answers For Improved Short Answer Grading

Uploaded by

Eduardo Ceh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views11 pages

Creating Scoring Rubric From Representative Student Answers For Improved Short Answer Grading

Uploaded by

Eduardo Ceh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326940962

Creating Scoring Rubric from Representative Student Answers for Improved

Short Answer Grading

Conference Paper · October 2018

CITATIONS READS

0 73

5 authors, including:

Smit Marvaniya Swarnadeep Saha

IBM Research India, Bangalore IBM Research AI
11 PUBLICATIONS 16 CITATIONS 7 PUBLICATIONS 11 CITATIONS

SEE PROFILE SEE PROFILE

Tejas Indulal Dhamecha Renuka Sindhgatta

IBM IBM
22 PUBLICATIONS 215 CITATIONS 43 PUBLICATIONS 270 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Development Intelligence View project

Automatic Short Answer Grading View project

All content following this page was uploaded by Swarnadeep Saha on 02 September 2018.

The user has requested enhancement of the downloaded file.

Creating Scoring Rubric from Representative Student Answers
for Improved Short Answer Grading
Smit Marvaniya Swarnadeep Saha Tejas I. Dhamecha
IBM Research - India IBM Research - India IBM Research - India
[email protected] [email protected] [email protected]

Peter Foltz Renuka Sindhgatta Bikram Sengupta

Pearson IBM Research - India IBM Research - India
[email protected] [email protected] [email protected]

ABSTRACT
Automatic short answer grading remains one of the key challenges Figure 1: Example of student interactions with dialog-based
of any dialog-based tutoring system due to the variability in the tutoring system.
student answers. Typically, each question may have no or few ex-
pert authored exemplary answers which make it difficult to (1)
generalize to all correct ways of answering the question, or (2)
represent answers which are either partially correct or incorrect.
In this paper, we propose an affinity propagation based cluster-
ing technique to obtain class-specific representative answers from
the graded student answers. Our novelty lies in formulating the
Scoring Rubric by incorporating class-specific representatives ob-
tained after proposed clustering, selecting, and ranking of graded
student answers. We experiment with baseline as well as state-
of-the-art sentence-embedding based features to demonstrate the
feature-agnostic utility of class-specific representative answers. Ex-
perimental evaluations on our large-scale industry dataset and a
benchmarking dataset show that the Scoring Rubric significantly
improves the classification performance of short answer grading.

KEYWORDS
Short Answer Grading, Scoring Rubric, Clustering, Supervised
Learning, Classification, Sentence Embeddings
ACM Reference Format:
Smit Marvaniya, Swarnadeep Saha, Tejas I. Dhamecha, Peter Foltz, Renuka and providing relevant feedback (as shown in Figure 1). Hence, auto-
Sindhgatta, and Bikram Sengupta. 2018. Creating Scoring Rubric from Repre- matically analyzing short student answers is a critical requirement
sentative Student Answers for Improved Short Answer Grading. In The 27th of any dialog-based intelligent tutoring system (ITS). Student Re-
ACM International Conference on Information and Knowledge Management sponse Analysis (SRA) [9] is the task of assessing a student answer
(CIKM ’18), October 22–26, 2018, Torino, Italy. ACM, New York, NY, USA, in comparison to a (set of) reference answer(s) for a given question.
10 pages. https://doi.org/10.1145/3269206.3271755 A key challenge of SRA is to approximate human tutor performance
by interpreting the correctness of student answers. Human tutors
1 INTRODUCTION are often able to consider a student response as correct even if it
The dialog-based tutoring system is a type of intelligent tutoring does not strictly match a reference answer or is not a semantic alter-
system, where learning is driven by a natural language dialog be- native of the same. They use their judgments and world knowledge
tween a student and the tutoring system [13]. The tutoring system to grade answers. Consider the example in Table 1. The correct
guides the student by asking questions, analyzing student responses responses are not semantic alternatives of the reference answer in
the strict sense. However, annotators mark them as correct based
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
on their judgments of student’s understanding. Scenarios like this
for profit or commercial advantage and that copies bear this notice and the full citation warrant the need for creating additional reference answers.
on the first page. Copyrights for components of this work owned by others than ACM Note that there are three ways to address this challenge. One
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a is to create additional reference answers with the help of Subject
fee. Request permissions from [email protected]. Matter Experts (SME). The second is to collect additional answers
CIKM ’18, October 22–26, 2018, Torino, Italy by sampling a large number of students. However, both these op-
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-6014-2/18/10. . . $15.00
tions are expensive; first, due to the SME costs and second, the
https://doi.org/10.1145/3269206.3271755 difficulty in collecting a large enough set of student answers to
CIKM ’18, October 22–26, 2018, Torino, Italy S. Marvaniya, S. Saha, T.I. Dhamecha, P. Foltz, R. Sindhgatta, and B. Sengupta

Table 1: Examples of student answers for a given question • We present generic methods to extract features from student
representing inter and intra-class variations. answers with respect to the SR. These feature extraction
techniques extend traditional techniques of feature repre-
Question With whom do older adults seek support? sentation that use only reference answer(s).
Reference Older adults seek support from people who understand • We experiment on a benchmarking dataset (SemEval-2013
Answer what they experience in old age. dataset [7]) and a large-scale industry dataset with simple lex-
People experiencing similar problems ical baseline features and sophisticated sentence-embedding
Correct
other older adults like them based features. Substantial improvements signify the feature-
Exemplars
those in similar situations as them
agnostic utility of SR for short answer grading. Empirically,
Support groups, peers, relationship’s with others
we also show that SR outperforms existing techniques that
Partial From people who are caring and interested in their lives
Exemplars and well-being.
compare the student answer with only reference answer(s).
Friends and family. • Finally, on the SemEval-2013 dataset, we show that our SR in
I think they seek support from there children if they conjunction with sentence-embedding based features yields
Incorrect have any. better or comparable results to state-of-the-art systems.
Exemplars Woman
A Counselor
2 LITERATURE
Table 2: Various models for short answer grading. δ : classi- Researchers have been working on the problem of short answer
fier, q: question, a: student answer, r : reference answer, c j , p j , grading for more than a decade now [20] with the SemEval-2013
and i j : exemplar correct, partially correct, and incorrect stu- challenge on student response analysis [7] further formalizing
dent answers, respectively. the problem, benchmarking datasets, and evaluation protocols. Re-
search works have focused on various aspects around the problem,
Traditional Model δ (a|q, r ) including usage of knowledge-base [20], patterns [25, 29, 30], graph
Multiple References Model δ (a|q, {r , c 1 , c 2 , . . . , c k }) alignment [19], corpus-based similarity [12], combination of simple
δ (a|q, {c : {r , c 1 , . . . , c k }, features [32], clustering [17], and Bayesian domain adaptation [31].
Scoring Rubric Model p : {p1 , p2 , . . . , pk }, Efforts have been made for effective annotation [2, 3, 6] and con-
i : {i 1 , i 2 , . . . , i k }}) tent assessment [24]. Further, modern deep learning approaches are
also explored for short answer grading [16, 27, 28, 34] and related
cover the variations. The third way is to build matching systems problems, such as textual entailment [5, 22]. However, majority
that can capture a general model of the domain by utilizing lim- of the literature utilizes a traditional model δ (a|q, r ), as defined in
ited student answers. Such a system should be generalizable not Table 2, for grading the student answer a in context of the question
just on straight semantic alternatives, but should also understand q and the SME created reference answer r .
varying degrees of correctness across answer variants. Building on Limited research has focused on either using multiple alternate
this research direction, we present a solution that automatically representations [25, 26] or obtaining generalizable lexical represen-
creates a Scoring Rubric from student answers. As shown in Table 2, tations of the reference answer [8]. Broadly, the research in this
the Scoring Rubric contains exemplary student answers at varying direction can be considered as following the Multiple References
degrees of correctness (in this case, correct, partially correct, (MR) Model as defined in Table 2. Dzikovska et al. [8] have rules
and incorrect). Intuitively, this modeling contains information to cover semantic information not encoded in the ontology for
that helps the classifiers understand the student answer space holis- evaluating student responses. Ramachandran and Foltz [26] use top
tically, as opposed to traditional and multiple reference models that scoring student responses to automatically extract patterns. They
emphasize on subspace describing the intra-class variations of only further use summaries of correct student responses to create alter-
correct answers. nate reference answers [25]. Often, a student answer is compared
This research focuses on leveraging graded student answers for with all the reference answers. If any of the comparisons yields a
obtaining a Scoring Rubric. Table 1 shows certain class-specific match, the student answer is predicted as correct.
exemplars that are part of a Scoring Rubric. Note the extra informa- Mitchell et al. [18] propose a semi-automated approach to create
tion that the exemplars contain compared to the reference answer. a Marking Scheme for short answer evaluation. Marking Scheme
It is our assertion that formulating a representative Scoring Rubric, consists of sets of answers with certain degrees of correctness. Uti-
and using it efficiently for feature extraction can significantly im- lization of alternate reference answers can be seen as a special case
prove short answer grading. Overall, this paper makes the following of creating a marking scheme. The benefits of using alternate refer-
research contributions: ence answers vary depending on the grading technique. Broadly, it
• We propose a method to formulate Scoring Rubric (SR) from is helpful to add those answers as alternate reference answers that
graded student answers. This is achieved by clustering the the grading technique is not able to match otherwise. In the simi-
student answers and selecting and ranking cluster represen- lar direction, our paper generalizes the idea of having additional
tatives for each grade category. Our SR is a generic concept reference answers by incorporating representative answers from
that is independent of the exact grade categories in any short all grade categories. Representative correct answers by themselves
answer grading task. In this work, we formalize and utilize are often not indicative of the answers from other categories, more
the notion of Scoring Rubric for classification task in SRA. so when the categories are many and hard to tell apart. We call this
Creating Scoring Rubric from Representative Student Answers for Improved Short Answer
Grading CIKM ’18, October 22–26, 2018, Torino, Italy

the Scoring Rubric (SR) for modeling better inter-class variations w j ∈ v, if 1) they are exact matches, 2) they are synonyms, or 3)
for the task of short answer grading. the cosine distance between their word-vectors [21] is less than a
certain threshold θ . In our experiments, we have empirically chosen
3 PROPOSED APPROACH the value of θ as 0.7.
Given a set of graded student answers to a question as an input, Sentence-Embedding based Similarity. Our next choice of simi-
our proposed approach first outputs a Scoring Rubric consisting larity metric is based on sentence-embeddings that help capture
of representative answers from each grade category (e.g. correct, the semantics of sentences. We use state-of-the-art InferSent [5]
partial, incorrect), followed by an efficient feature representation embeddings for encoding the student answers. InferSent embed-
for the end task of short answer grading. Figure 2 illustrates the dings, in general, are shown to work well across various NLP tasks.
detailed steps involved in the process. Computing the scoring rubric Their pre-trained model is a bidirectional LSTM trained on the Stan-
is a 3-step process involving clustering, representative selection, ford Natural Language Inference dataset. Given the embeddings
and ranking. First, the student answers belonging to each grade E(u) and E(v) for two student answers u and v respectively, the
are clustered independently. The clusters in each grade category similarity between them is computed as:
represent different ways of answering the question for that grade.
1 E(u) · E(v)
Next, a representative student answer is selected from each cluster. S(u, v) = 1+ (2)
It signifies a candidate answer for the scoring rubric of the respec- 2 ||E(u)|| ||E(v)||
tive grade category. The scoring rubric consists of a set of ranked Combined Similarity. Our combined similarity metric is a combi-
student answers that are short and grammatically well-formed for nation of token based and sentence-embedding based similarities.
each of the grade categories. For the end classification task, a fea- We compute the similarity score between student answers u and v
ture representation for each student answer is computed in context as the weighted sum of their token-based and sentence-embedding
of the scoring rubric. The individual steps are described in detail based similarity scores, formally given as
below. H(u, v) = β · S(u, v) + (1 − β) · T(u, v) (3)

3.1 Clustering Graded Student Answers where β is the weight parameter. Intuitively, for a factoid question,
the presence of keyword(s) dictates the correctness of the answer.
Clustering student answers belonging to a grade category identifies
This is captured using the token-based similarity. On the other
various ways of answering the question in that category. For exam-
hand, the sentence-embedding based similarity score is indicative
ple, clustering all correct student answers identifies different ways
of the overall similarity for broader questions. Value of β = 0.5 is
of answering the question correctly. The clustering process involves
used in our experiments.
design decisions pertaining to choice of 1) clustering algorithm and
Using the similarity metrics, a set of clusters is obtained for each
2) similarity metric.
grade. Figure 3 shows the cluster distributions using token-based,
The number of clusters, which translates to the number of ways
sentence-embedding based similarity and combined similarity meth-
of answering a question for a grade, is unknown. This challenge
ods respectively on our large-scale industry dataset. These clusters
rules out utilization of clustering techniques parametric to the num-
are used for representative selection as described in the following
ber of clusters, e.g. k-means. Therefore, we employ affinity propa-
subsection.
gation [11] based clustering algorithm for clustering the student
answers.
3.2 Representative Selection
Our similarity metric for the clustering algorithm measures the
similarity between two student answers. Given two student an- Representative selection involves selection of all such student an-
swers, we devise three novel similarity metrics as described below swers, each of which is most representative of a cluster. A cluster
- 1) Token-based similarity, 2) Sentence-embedding based similar- representative indicates a distinct way of answering the question.
ity and 3) Combination of 1) and 2). For factoid questions, the In the simplest of approaches, cluster centroid may be considered
similarity between keyword tokens should suffice as a similarity as the cluster representative. However, given that the samples are
metric, whereas for a relatively broader question, semantic similar- student answers here, the notion of representativeness may not
ity is more important. Therefore, we employ token and sentence- correspond to that of the cluster center. Ideally, the smallest correct
embedding based similarity metrics. answer exhibits what is sufficient to answer a question correctly.
Moreover, having a grammatically well-formed answer is helpful
Token-based Similarity. The token-based similarity between two in varying kinds of grading techniques that rely on parsing [4, 35].
student answers u and v is computed using Dice coefficient [10] Table 3 shows examples for short and grammatically well-formed
which is defined as: answers and otherwise, from our industry dataset.
2 · |αu ∩ αv | In light of these two observations, we propose to identify repre-
T(u, v) = (1) sentative answers using sentence construction metric (C), which
| α u | + | αv |
is a linear combination of 1) sentence length metric (L) and 2)
where, αu and αv are the token bags for u and v respectively. A sentence parsing metric (P). Formally, it is defined as follows.
token bag for a student answer is computed after removing stop
words from the answer, followed by question demoting [19, 32]. C(u) = α · L(u) + (1 − α) · P(u) (4)
The term, α u ∩ α v represents the overlapping tokens between both where, P(u) represents the confidence score of the dependency
the bags. A word w i ∈ u is considered overlapping with a word parse [4] of the student answer. Dependency parse score (P(u)) is
CIKM ’18, October 22–26, 2018, Torino, Italy S. Marvaniya, S. Saha, T.I. Dhamecha, P. Foltz, R. Sindhgatta, and B. Sengupta

Figure 2: Our proposed approach for clustering, selection and ranking of student answers and formulating Scoring Rubric for
improved short answer grading.

Clustering
Representative
Graded Student Ranking
Selection
Answers
Ranked Cluster
Question (q) Representatives
Reference Question Student
Class-1 Answer (r) (q) Answer (a)

1 2 3

Scoring Rubric Feature

…

…
…

(SR) Representation

Graded Student
Answers Class-k
𝐅(𝑞, 𝐒𝐑, 𝑎)
1 2

Figure 3: Cluster size vs Normalized question frequency from our large-scale industry dataset. (a) token-based similarity (b)
sentence-embedding based similarity (c) combined similarity.
0.4 0.4 0.4
Correct Partial Incorrect Correct Partial Incorrect Correct Partial Incorrect
0.35 0.35 0.35
Normalized Frequency

Normalized Frequency

Normalized Frequency
0.3 0.3 0.3
0.25 0.25 0.25
0.2 0.2 0.2
0.15 0.15 0.15
0.1 0.1 0.1
0.05 0.05 0.05
0 0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Cluster Size Cluster Size Cluster Size

(a) token-based similarity (b) sentence-embedding based similarity (c) combined similarity

a likelihood probability which is computed based on inter words Presumably, there is an optimal size of the scoring rubric, beyond
dependency and the rarity of the words in the sentence; which which elaborating the scoring rubric may not yield any further
can provide a proxy for the complexity of the utterance. This way benefits. In fact, very large scoring rubrics may as well confuse
dependency parse score P(u) helps in selecting student answers the classifier, thereby degrading the performance. Therefore, it is
which are relatively simple. Figures 4(a) and 4(b) show examples necessary to create a ranked order of the representatives and to ob-
of dependency parse1 for short and grammatically well-formed tain the final scoring rubric. We propose a ranking approach based
student answer and lengthy and complex student answer from our on two observations - 1) The cluster with more student answers
large-scale dataset respectively. L(u) represents the length score represent a more likely way of answering the question. Therefore,
of a sentence u which is estimated after question demoting for a the corresponding representative should be a part of the scoring
sentence u. L(u) is inverse of the normalized length of the sen- rubric. 2) Scoring Rubric should include well-formed and sufficient
tence u which helps in preferring smaller sentences as compared to answers. To reflect both these observations in the ranking scheme,
longer sentences. α is a weight parameter. Note that both length and we propose the following function (R) to score a representative.
parsing metric values are normalized to be bounded in [0, 1] before
computing the construction metric. The student answer with the |π | − |πmin |
R(u) = C(u) · (5)
highest construction score is identified as the cluster representative. |πmax | − |πmin |
Value of α = 0.5 is used in our experiments.
where, π is the cluster that the representative answer u belongs to,
3.3 Ranking πmin and πmax represent the smallest and largest clusters, and C(·)
is the sentence construction metric as defined in Eq. 4. This way of
There are situations when the number of clusters is large either ranking the cluster representatives helps in identifying the most
due to limitations of the clustering algorithms or due to the high representative well-formed student answers. Values of minimum
intra-class variations. This results in an increased scoring rubric. cluster size (πmin ) and maximum cluster size (πmax ) are 1 and 9
1 Figures 4(a) and 4(b) are generated using Stanford CoreNLP (http://corenlp.run) used in our experiments.
Creating Scoring Rubric from Representative Student Answers for Improved Short Answer
Grading CIKM ’18, October 22–26, 2018, Torino, Italy

Table 3: An example of clustering correct and partial student answers from our large-scale industry dataset using token-based
clustering method. The cluster representative student answers are shown in boldface.

Question: How are schemas used for memory in middle adulthood?

Reference Answer: Schemas are organized methods used to categorize information stored in memory in middle adulthood
Clustered Correct Student Answers
Categorizing things makes remembering easier.
Cluster 1 they are memory shortcuts to ease the burden of remembering the many things they experience each day
to help retrieve information from past experiences to use in different situations or scenarios
Cluster 2 schemas are like short cuts in our memory
It helps organize information, making it easier to retain.
Schemas in memory are used to organize and simplify information in our environment.
Cluster 3
schemas are used for memory because they are organized bodies of information stored in your memory.
way people recall information, organized bodies of information stored in memory.
Clustered Partial Student Answers
They are used to help store information.
Cluster 1 To retain information
recall new information, comprehend and encounter old experiences.
People are more likely to notice things that fit into their schema or how they think. A schema is a pattern of behavior or
thought
Cluster 2 Schemas are used for memory in middle adulthood as a past memory can help them be familiar with a new one
to help remember the past
Schemas organize one’s thoughts.
Having an outline and organization are useful
Cluster 3
its referred to on a a daily basis to help organize and prioritize their day
They help organize behavior

Figure 4: Examples of dependency parsing. (a) Short and grammatically well-formed student answer (normalized dependency
score = 0.90) (b) Lengthy and complex student answer (normalized dependency score = 0.81).

(a)

(b)
Thus, by clustering the student answers, selecting representa- f (q, r , a) between a student answer a and a reference answer r for
tives, and ranking them, we obtain a scoring rubric consisting of a question q.
ranked sample answers at varying degree of correctness. Next, we In this work, we utilize a scoring rubric instead of just a reference
explain the proposed methodology to obtain feature representation answer for comparing with the student answer. As described earlier,
of a student answer using the scoring rubric. the scoring rubric consists of representative student answers with
varying degrees of correctness. As an example, let us consider three
degrees of correctness, correct, partial, and incorrect. Then
4 FEATURE REPRESENTATION the scoring rubric consists of a reference answer r , a set of correct
Traditionally, a student answer is compared against the reference answers c, a set of partially correct answers p, and a set of incorrect
answer to obtain its feature representation. In the simplest of forms, answers i.
it may be textual overlap and other hand-crafted features. For As opposed to the traditional approach where the comparison is
sentence-embedding based representation, it can be the difference against a reference answer r , given by f (q, r, a), we compare the
between the embeddings of student answer and the reference an- same against representative sets of correct, partial, and incorrect
swer. In either of the cases, the pair-wise representation forms the answers also, represented as F(q, c, a), F(q, p, a), and F(q, i, a) re-
basis. Let f be the operator that creates the feature representation spectively. We add the reference answer r as part of the correct set
CIKM ’18, October 22–26, 2018, Torino, Italy S. Marvaniya, S. Saha, T.I. Dhamecha, P. Foltz, R. Sindhgatta, and B. Sengupta

the space anchored by the exemplars. To compute the feature repre-

Figure 5: An example of feature extraction by comparing sentation f , we use lexical overlap baseline features, as released by
against all representatives sets of correct, partial and incor- the SemEval-2013 task organizers [7, 9] and sentence-embedding
rect student answers. based features [5]. We choose these two feature sets as they differ
significantly in their ways of encoding the textual similarity; allow-
i1 ing us to evaluate the effectiveness of the proposed approach in the
f(q,i 1,a) i2
c2 c1
context of feature complexity.
f(q,c1,a)
i3
(1) Baseline Features: The baseline lexical overlap features,
c3
as released [7, 9] include 4 features between the student re-
Student sponse and the reference answer - (1) Raw overlap count,
Answer (a) f(q,p1,a)
(2) Cosine Similarity, (3) Lesk Similarity [1] and (4) F1 score
of the overlaps. The 4 features are additionally computed
p3 p1 Correct Exemplar between the reference answer and the question to create an
p2
Partial Exemplar 8-dimensional feature representation. We use these features
Incorrect Exemplar to compute the representation f between the student answer
and each selected representative. The feature representation
F w.r.t a grade category is computed according to Eq. 7. Note
c. Thus, the resultant concatenated feature representation w.r.t to that the 4 features between the question and the reference
the scoring rubric SR is obtained from Eq. 6. answer are the same for all grade categories and hence taken
F(q, SR, a) = [F(q, c, a), F(q, p, a), F(q, i, a)] (6) into consideration only once. Since the individual features
are non-overlapping, taking element-wise maximum gives
We now define the operator F. We use two different definitions of the closest match to a certain grade category. The final fea-
F as described below. ture representation F w.r.t the SR is given by Eq. 6.
(1) Closest Match: If the elements of the pair-wise feature rep- (2) Sentence-Embedding based Features: Sentence-embedding
resentation f express different aspects of similarity, we pro- features are obtained based on the sentence representations
pose to use following definition of F. using InferSent [5]. InferSent provides its embeddings in a d
F(q, c, a) = f (q, r , a) ⊕ f (q, c 1 , a) ⊕ · · · ⊕ f (q, c k , a) (7) dimensional space. Given a question q, student answer a and
a reference answer r , the feature representation is obtained
where ⊕ is the element-wise max operator, c i ∈ c, and |c| = as
k + 1. Alternatively, the closest matching answer can also be
f (q, r, a) = [|E(a) − E(r )|, E(a) ∗ E(r )] (10)
used for feature representation as defined below.
where E(a) and E(r ) are the embeddings of the student an-
F(q, c, a) = f (q, c j , a) where c j = arg max ϕ(c i , a|q) (8) swer and reference answer respectively. − and ∗ are element-
c i ∈c
where ϕ(c i , a|q) represents similarity between c i and a in wise subtraction and multiplication respectively. Note that
context of the question q. Eqs. 7 and 8 exhibit the closest for limiting the dimensionality of the feature representation,
match notion per element and per sample, respectively. In we do not use the question (q) embedding as part of our
either ways, it is a form of max aggregation. Note that in this feature representation. The feature representation F w.r.t
approach the feature dimension of F(q, c, a) is same as that a grade category, unlike the baseline features, is computed
of f (q, c i , a). Thus, the feature dimensionality is independent using Eq. 9. Since the features are dimensions of the embed-
of the the scoring rubric size. ding space, we chose to concatenate the individual repre-
(2) Scoring Rubric Preserving: In this approach, we propose sentations from the representatives rather than taking the
to preserve the representation of a student answer with element-wise maximum. The final representation F w.r.t the
respect to all the elements of the scoring rubric. Therefore, SR is again given by Eq. 6. We keep the InferSent embedding
as opposed to finding the closest match, the feature F(q, c, a) dimension (d) as 4, 096. All experiments using InferSent were
is the feature concatenation of all the pair-wise features performed using the pre-trained model (infersent.snli.pickle)
f (q, c i , a). which is trained on SNLI dataset.

F(q, c, a) = [f (q, c 1 , a), f (q, c 2 , a), · · · , f (q, c k , a)] (9) 5 EXPERIMENTS

where, |c| = k. Since, this approach preserves all the pair- We perform experiments to evaluate the effectiveness of the pro-
wise features, it allows the classifier to learn what is holisti- posed Scoring Rubric approach, emphasizing on (1) its benefits over
cally important across all the c 1 , c 2 , . . . , c k . The feature di- multiple references (MR) model, i.e. using only correct exemplars,
mensionality of F(q, c, a) is k times that of f (q, c i , a). There- (2) its sensitivity to feature representations, and (3) its performance
fore, the role of ranking is important in creating succinct compared to earlier published results. Further, to evaluate its gen-
scoring rubric to control the feature size. eralizability, we experiment on two datasets - (1) SemEval-2013
Figure 5 shows an interpretation of feature representation f of [7] dataset and (2) Our large-scale industry dataset. Table 4 shows
a student answer w.r.t correct, partial and incorrect exemplars. In the train-test splits of both datasets. Overall, we perform experi-
a broad sense, it encodes the coordinates of the student answer in ments on two datasets, with two different feature representations,
Creating Scoring Rubric from Representative Student Answers for Improved Short Answer
Grading CIKM ’18, October 22–26, 2018, Torino, Italy

Table 4: Distribution of SemEval-2013 SciEntsBank dataset and our large-scale industry datasets.

Questions Total Responses Train Test

Our dataset 483 16,458 12,317 4,141
SciEntsBank 135 5,509 4,969 540

by using MR and proposed SR, along with three proposed similarity Ramachandran and Foltz [26] as they also show results on the same
metrics of clustering. dataset using the same set of baseline features. Table 5 shows the
This section is organized into 4 subsections. In the first two, we macro-averaged-F1 and weighted-F1 on 2-way, 3-way, and 5-way
describe and analyze the results on both the datasets. The third test sets.
subsection shows a detailed ablation study on both datasets to The first row of the table demonstrates the results with the
better understand the effectiveness of SR. Finally, we compare our baseline features computed using only the reference answers. The
sentence-embedding based features in conjunction with SR against next two rows are taken from [26] where Ramachandran and Foltz
state-of-the-art systems on the SemEval-2013 dataset. introduce two clustering/summarization techniques on top of the
baseline features for incorporating additional reference answers.
5.1 SemEval-2013 [7] Dataset We compare our Multiple References (MR) and finally our Scoring
Our first set of experiments is on the SciEntsBank corpus of SemEval- Rubric (SR) with these state-of-the-art results. Note that our MR is in
2013 dataset. The corpus contains reference answers and student principle similar to the MEAD and Graph techniques as all of them
answers for 135 questions in Science domain. There are three are based on selecting and using multiple reference (or correct)
classification subtasks on three different test sets - Unseen An- answers. However, in this work, we introduce a novel method
swers (UA), Unseen Questions (UQ) and Unseen Domains (UD). for identifying the student answer representatives with different
The three classification subtasks include (1) 2-way classification similarity metrics. Our Scoring Rubric is a novel contribution where
into correct and incorrect classes, (2) 3-way classification into we extend MR to generate representative answers for all classes.
correct, incorrect and contradictory classes and (3) 5-way We list key observations from the results shown in Table 5.
classification into correct, partially correct, contradictory,
• Using the Multiple References obtained by the proposed
irrelevant and non_domain classes. Note that the samples in Sci-
clustering approach with sentence-embedding based similar-
EntsBank are same across 2-way, 3-way and 5-way classifications;
ity metric (MR + sent) outperforms both Graph and MEAD.
however, the labels change as the task becomes more granular. Each
Specifically, we achieve 3 points improvement over MEAD
question in SciEntsBank data has exactly one associated reference
in 5-way and 2 and 3 points improvement over Graph in 3-
answer and is thus a suitable choice for evaluating the effectiveness
way and 2-way respectively. This suggests that the proposed
of SR involving additional class-specific representatives. Further-
approach yields the correct exemplars well suited for the
more, similar to [26], we also test only on Unseen Answers as the
grading task.
representative answers generated for one question at train time
• Our Scoring Rubric (SR) shows further improvement af-
might not be relevant for another question at test time.
ter incorporating representative answers for all classes. We
We did not use any representative for non_domain and partial
achieve 3 points improvement in macro-averaged-F1 in 5-
classes in the 5-way classification subtask, as a significant number
way over our best performing MR. The improvements for
of questions did not have any student answer from those classes
3-way and 2-way are 8 points and 2 points respectively. We
in the train set. Although some questions did not have answers
believe 5-way results would have improved further, had there
from contradictory and irrelevant classes also, the number
been enough samples for clustering representatives from all
of such questions was relatively less. Therefore, we mitigate this
classes. Nonetheless, this validates the core intuition that
problem in the following way. For questions which did not have
providing information of various grade exemplars helps the
any contradictory student answer, we used the sentence “I don’t
classifier to better encode the domain holistically.
know"2 and for questions lacking irrelevant answers, we use the
• Overall, our proposal of using class-specific representatives,
question itself as an irrelevant representative. Note that, these
i.e. Scoring Rubric, significantly outperforms state-of-the-art
synthetic representations were necessary to evaluate the effect of
MEAD and Graph. Moreover, our best results achieve enor-
class-specific exemplars over only correct ones.
mous improvements over just the simple baseline features -
We devise two separate experiments using two different sets of
6 points, 14 points and 9 points in 5-way, 3-way and 2-way
features - (1) Baseline features [7, 9] and (2) Sentence-embedding
respectively.
based features using InferSent [5]. These sets of features range from
the simple hand-crafted baseline features to deep learning based
5.1.2 Sentence-Embedding based Features. We now show the
sentence-embeddings. The choice of our features helps evaluate
utility of class-specific representatives on state-of-the-art sentence-
the feature-agnostic utility of class-specific representatives.
embedding based features as well. We use InferSent [5] embeddings
5.1.1 Baseline Features. A decision tree with default param- of reference answers and student answers to encode our features
eters is learned over the baseline features. We compare against as described in Section 4.
2 The
From the ranked representatives of each grade category, we use
choice of the sentence was motivated by already existing samples of similar
meaning in the contradictory class. top 3 answers (including the reference answer) for correct class
CIKM ’18, October 22–26, 2018, Torino, Italy S. Marvaniya, S. Saha, T.I. Dhamecha, P. Foltz, R. Sindhgatta, and B. Sengupta

Table 5: Comparative evaluation with baseline features on 5-way, 3-way, and 2-way protocols of SciEntsBank dataset and our
large-scale industry dataset. MR: Multiple References, SR: Scoring Rubric

5-way 3-way 2-way Our dataset

Approach Sim. M-F1 W-F1 M-F1 W-F1 M-F1 W-F1 M-F1 W-F1
BF [7] - 0.375 0.435 0.405 0.523 0.617 0.635 0.423 0.465
Graph [26]‡ - 0.372 0.458 0.438 0.567 0.644 0.658 - -
MEAD [26]‡ - 0.379 0.461 0.429 0.554 0.631 0.645 - -
token 0.362 0.428 0.446 0.545 0.630 0.647 0.468 0.506
MR sent 0.402 0.474 0.459 0.581 0.673 0.686 0.477 0.515
combined 0.355 0.412 0.455 0.557 0.676 0.688 0.475 0.507
token 0.430 0.472 0.545 0.604 0.692 0.703 0.525 0.552
SR sent 0.405 0.459 0.500 0.578 0.676 0.685 0.530 0.559
combined 0.400 0.462 0.501 0.579 0.702 0.712 0.531 0.560
‡ Results as reported by Ramachandran and Foltz [26]

Table 6: Comparative evaluation with sentence-embedding based features on 5-way, 3-way, and 2-way protocols of SciEnts-
Bank dataset and our large-scale industry dataset. MR: Multiple References, SR: Scoring Rubric

5-way 3-way 2-way Our dataset

Approach Sim. M-F1 W-F1 M-F1 W-F1 M-F1 W-F1 M-F1 W-F1
SE-based [5] - 0.497 0.541 0.594 0.672 0.725 0.731 0.586 0.629
token 0.578 0.621 0.610 0.681 0.735 0.747 0.631 0.658
MR sent 0.557 0.584 0.627 0.703 0.754 0.766 0.627 0.654
combined 0.557 0.604 0.599 0.682 0.744 0.754 0.627 0.657
token 0.579 0.610 0.637 0.710 0.752 0.767 0.634 0.655
SR sent 0.568 0.600 0.621 0.688 0.745 0.756 0.629 0.658
combined 0.579 0.610 0.636 0.719 0.773 0.781 0.633 0.658

and top 1 answer for incorrect, contradictory and, irrelevant gains in 5-way, 3-way and 2-way respectively. This demon-
classes as exemplars in the Scoring Rubric. strates the utility of Scoring Rubric even when the student
We learn a multinomial logistic regression classifier on top of answers are represented in a semantic space.
the features. The best parameters are learned using k-fold cross-
validation. Table 6 compares macro-averaged-F1 and weighted-F1
on 5-way, 3-way, and 2-way with and without using the Scoring 5.2 Our large-scale Industry Dataset
Rubric. Note that we could not compare our MR with those of Ra- We also conducted experiments on our large-scale industry dataset,
machandran and Foltz [26] as their code was not publicly available. which consists of questions from Psychology domain. The eval-
We again list our key observations below. uation is on a 3-way classification task consisting of 3 classes -
correct, partial, and incorrect. The ground truth grades of
• Our MR outperforms all the three testing protocols for embedding- each student answers are provided by subject matter experts.
based features as well. Specifically, our best MR achieves 8 Similar to SemEval-2013 dataset, we again experiment with both
points better macro-averaged-F1 in 5-way and 3 points better baseline and sentence-embedding based features.
in 3-way and 5-way. This suggests that even when feature
representations are semantically rich, using Multiple Refer- 5.2.1 Baseline Features. Table 5 compares the macro-averaged-
ences is better than only using the SME created reference F1 and weighted-F1 of all the configurations on our dataset using
answer. baseline features. We compare our MR and SR against the traditional
• Our Scoring Rubric further improves the results with the ad- approach of not using any class-specific representatives. We make
dition of representatives from all classes. We obtain 1 point, following key observations.
and 2 points macro-averaged-F1 gains with our best SR over
the best MR in 3-way and 2-way respectively. Note that • Our Multiple References (MR) improve upon the baseline
5-way does not improve much because of lack of representa- features, with sentence-embedding based similarity metric
tives from non_domain and partial classes. achieving 5 points better macro-averaged-F1.
• Overall, incorporating class-specific representatives improves • Our Scoring Rubric (SR) shows further improvement. Partic-
only the sentence-embedding based features with substan- ularly, SR with combined similarity metric achieves 6 points
tially high 10 points, 4 points and 5 points macro-averaged-F1 improvement over MR.
Creating Scoring Rubric from Representative Student Answers for Improved Short Answer
Grading CIKM ’18, October 22–26, 2018, Torino, Italy

Table 7: Ablation study of Scoring Rubric showing macro-averaged-F1 scores with baseline features on SciEntsBank-3way and
our dataset. CR: Correct, I: Incorrect, P: Partial, CN: Contradictory.

Similarity SciEntsBank-3way Our dataset

Component CR CR+I CR+I+CN CR CR+P CR+P+I
Token 0.446 0.451 0.545 0.468 0.497 0.525
Sentence 0.459 0.457 0.500 0.477 0.504 0.530
Combined 0.455 0.468 0.501 0.475 0.513 0.531
Table 8: Comparison of our best SR on sentence-embedding based features with state-of-the-art results on SciEntsBank 2-way,
3-way and 5-way testing protocols. ‡ Results as reported by Riordan et al. [27].

5-way 3-way 2-way

Approach
M-F1 W-F1 M-F1 W-F1 M-F1 W-F1
Sultan et al. [32] 0.412 0.487 0.444 0.570 0.677 0.691
ETS [14] 0.598 0.640 0.647 0.708 0.762 0.770
COMeT [23] 0.551 0.598 0.640 0.707 0.768 0.773
SOFTCAR [15] 0.474 0.537 0.555 0.647 0.715 0.722
T & N best [33]‡ - 0.521 - - - 0.670
T & N tuned [27]‡ - 0.533 - - - 0.712
SE-based [5] 0.497 0.541 0.594 0.672 0.725 0.731
SE-based + SR 0.579 0.610 0.636 0.719 0.773 0.781

• The overall improvement with respect to just baseline fea- similarity metric improves substantial by 8 points as compared to
tures is again substantial with almost 11 points better macro- only using correct class representatives. Similarly, in our dataset,
averaged-F1. the improvement is about 6 points.
5.2.2 Sentence-Embedding based Features. We use the top 3 cor-
rect representatives (including the reference answer), and top 1 for
5.4 Comparison with State-of-the-Art Methods
partial and incorrect classes as exemplars in the Scoring Rubric. Our final experiment is one where we compare the sentence-embedding
Our results with sentence-embedding based features are shown in based feature in conjunction with Scoring Rubric against state-of-
table 6. We again study the effect of MR and SR on the sentence- the-art methods on SciEntsBank Unseen Answers test data. We use
embedding based features. The salient observations are listed below. the combined similarity metric for the comparisons. We specifically
• Using token-based MR obtains 5 points better macro-averaged- compare against best performing systems in the task3 –namely,
F1 over the sentence-embedding based features. ETS [14], COMeT [23], and SOFTCARDINALITY [15], along with
• The SR here does not improve the results much. We believe recent research including Sultan et al. [32]4 , Taghipour and Ng
the sophistication in the feature representation largely lim- [33], Riordan et al. [27], and Infersent [5]. Table 8 shows the macro-
its the substantial improvement that we saw with baseline averaged-F1 and weighted-F1 of all the systems on SciEntsBank
features. dataset.
• The overall improvement is still noteworthy, with token- We observed that the sentence-embedding based features with
based SR obtaining 5 points better macro-averaged-F1 over our SR outperform all systems on the 2-way subtask. On 3-way,
the sentence-embedding features. we match the state-of-the-art results with 1 point better weighted-
F1 but 1 point less macro-averaged-F1 compared to ETS [14]. On
5.3 Ablation Study of Scoring Rubric 5-way, our results are competitive, where we do better than all
systems except ETS. We believe that this could be down to our
In any short answer grading task, we believe that encoding repre-
inability to incorporate representatives for all classes. Also, note
sentatives for each class has its merits. We show this in Table 7 by
that ETS benefits from its underlying domain adaptation. On a direct
our ablation study of incorporating class-specific representatives
comparison between the sentence embedding features [5] and its
on baseline features. Our experiments are on both the datasets.
extension to proposed Scoring Rubric, the later yields significant
However, in SciEntsBank dataset, we show results only on 3-way,
improvements. We could use better features; using Scoring Rubric
as for 5-way we did not have representatives for all classes. The
on top of more sophisticated features should improve the results
results improve as we incrementally add representatives for each
further.
of the three classes. Note that the classes in our dataset and in Sci-
EntsBank are different. However, that does not affect the results as
3 https://docs.google.com/spreadsheets/d/1Xe3lCi9jnZQiZW97\
our idea of utilizing class-specific representatives to formulate Scor- hBfkg0x4cI3oDfztZPhK3TGO_gw/pub?output$=$html#
ing Rubric is generic. While using representatives for all the three 4 All
experiments were performed using their publicly available code at https://github.
classes in SciEntsBank, the macro-averaged-F1 with token-based com/ma-sultan/short-answer-grader
CIKM ’18, October 22–26, 2018, Torino, Italy S. Marvaniya, S. Saha, T.I. Dhamecha, P. Foltz, R. Sindhgatta, and B. Sengupta

6 CONCLUSION [11] B. J. Frey and D. Dueck. Clustering by passing messages between data points.
Science, 315(5814):972–976, 2007.
Automatic short answer grading for Intelligent Tutoring Systems [12] W. H. Gomaa and A. A. Fahmy. Short answer grading using string similarity and
has been a well-studied problem in NLP community over the years. corpus-based similarity. International Journal of Advanced Computer Science and
Applications, 3(11), 2012.
Traditional approaches have looked into it as a classification task [13] A. C. Graesser, S. Lu, G. T. Jackson, H. H. Mitchell, M. Ventura, A. Olney, and
where the student answer is matched against a reference answer for M. M. Louwerse. Autotutor: A tutor with dialogue in natural language. Behavior
a given question. To overcome challenges involving the variability Research Methods, Instruments, & Computers, 2004.
[14] M. Heilman and N. Madnani. ETS: Domain adaptation and stacking for short an-
in student answers, researchers have incorporated automated ways swer scoring. In Proceedings of the Joint Conference on Lexical and Computational
of selecting correct exemplars and use them as multiple reference Semantics, volume 2, pages 275–279, 2013.
answers. In this work, we generalize the notion of multiple reference [15] S. Jimenez, C. Becerra, and A. Gelbukh. SOFTCARDINALITY: Hierarchical text
overlap for student response analysis. In Proceedings of the Joint Conference on
answers to that of a Scoring Rubric that incorporates representa- Lexical and Computational Semantics, volume 2, pages 280–284, 2013.
tive student answers from multiple grade categories. Creation of a [16] S. Kumar, S. Chakrabarti, and S. Roy. Earth mover’s distance pooling over siamese
LSTMs for automatic short answer grading. In Proceedings of the International
Scoring Rubric involves clustering student answers for each grade Joint Conference on Artificial Intelligence, pages 2046–2052, 2017.
category, followed by a representative selection from each cluster, [17] A. S. Lan, D. Vats, A. E. Waters, and R. G. Baraniuk. Mathematical language
and finally ranking them. Extending the feature representation of a processing: Automatic grading and feedback for open response mathematical
questions. In ACM Conference on Learning @ Scale, pages 167–176, 2015.
student answer w.r.t. to a reference answer, we propose techniques [18] T. Mitchell, N. Aldridge, and P. Broomhead. Computerised marking of short-
to obtain its feature representation w.r.t the Scoring Rubric. We answer free-text responses. In Manchester IAEA conference, 2003.
experiment with simple lexical overlap baseline features as well as [19] M. Mohler, R. C. Bunescu, and R. Mihalcea. Learning to grade short answer ques-
tions using semantic similarity measures and dependency graph alignments. In
sophisticated sentence-embedding based features to demonstrate Proceedings of the Annual Meeting of the Association for Computational Linguistics:
that the notion of a Scoring Rubric is feature-agnostic. Its effective- Human Language Technologies, pages 752–762, 2011.
[20] M. Mohler and R. Mihalcea. Text-to-text semantic similarity for automatic short
ness is empirically evaluated on a benchmarking dataset and our answer grading. In Proceedings of the Conference of the European Chapter of the
large-scale industry dataset. We report significantly better results Association for Computational Linguistics, pages 567–575, 2009.
on both the datasets compared to existing approaches that compare [21] N. Mrkšic, D. OSéaghdha, B. Thomson, M. Gašic, L. Rojas-Barahona, P.-H. Su,
D. Vandyke, T.-H. Wen, and S. Young. Counter-fitting word vectors to linguistic
the student answer against only reference answer(s). Our model in- constraints. In Proceedings of the Conference of the North American Chapter of the
volving sentence-embedding based features w.r.t the Scoring Rubric Association for Computational Linguistics: Human Language Technologies, 2016.
also demonstrates comparable or better results to state-of-the-art [22] J. Mueller and A. Thyagarajan. Siamese recurrent architectures for learning
sentence similarity. In Proceedings of the Association for the Advancement of
models on the benchmarking dataset. Artificial Intelligence, pages 2786–2792, 2016.
Certain short answer grading tasks that output real-valued scores [23] N. Ott, R. Ziai, M. Hahn, and D. Meurers. CoMeT: Integrating different levels of
linguistic modeling for meaning assessment. In Proceedings of the Joint Conference
are modeled as a regression problem too. While our notion of a on Lexical and Computational Semantics, volume 2, pages 608–616, 2013.
Scoring Rubric is independent of the individual grade categories, [24] R. J. Passonneau, A. Poddar, G. Gite, A. Krivokapic, Q. Yang, and D. Perin. Wise
extending it to regression tasks where the classes are not well- crowd content assessment and educational rubrics. International Journal of
Artificial Intelligence in Education, 2018.
defined, remains one of the key future directions to pursue. [25] L. Ramachandran, J. Cheng, and P. W. Foltz. Identifying patterns for short answer
scoring using graph-based lexico-semantic text matching. In Proceedings of
the NAACL-HLT Workshop on Innovative Use of NLP for Building Educational
REFERENCES Applications, pages 97–106, 2015.
[1] S. Banerjee and T. Pedersen. An adapted lesk algorithm for word sense dis- [26] L. Ramachandran and P. W. Foltz. Generating reference texts for short answer
ambiguation using wordnet. In Proceedings of the International Conference on scoring using graph-based summarization. In Proceedings of the NAACL-HLT
Intelligent Text Processing and Computational Linguistics, pages 136–145, 2002. Workshop on Innovative Use of NLP for Building Educational Applications, pages
[2] S. Basu, C. Jacobs, and L. Vanderwende. Powergrading: a clustering approach to 207–212, 2015.
amplify human effort for short answer grading. Transactions of the Association [27] B. Riordan, A. Horbach, A. Cahill, T. Zesch, and C. M. Lee. Investigating neural
for Computational Linguistics, 1:391–402, 2013. architectures for short answer scoring. In Proceedings of the 12th Workshop on
[3] M. Brooks, S. Basu, C. Jacobs, and L. Vanderwende. Divide and correct: using Innovative Use of NLP for Building Educational Applications, pages 159–168, 2017.
clusters to grade short answers at scale. In Proceedings of the ACM Conference on [28] S. Saha, T. I. Dhamecha, S. Marvaniya, R. Sindhgatta, and B. Sengupta. Sentence
Learning @ Scale, pages 89–98, 2014. level or token level features for automatic short answer grading?: Use both. In
[4] D. Chen and C. Manning. A fast and accurate dependency parser using neural International Conference on Artificial Intelligence in Education, 2018.
networks. In Proceedings of the Conference on Empirical Methods in Natural [29] J. Z. Sukkarieh and J. Blackmore. C-rater: Automatic content scoring for short
Language Processing, pages 740–750, 2014. constructed responses. In International Florida Artificial Intelligence Research
[5] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised learning Society Conference, pages 290–295, 2009.
of universal sentence representations from natural language inference data. In [30] J. Z. Sukkarieh and S. Stoyanchev. Automating model building in c-rater. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing, Proceedings of the ACL Workshop on Applied Textual Inference, pages 61–69, 2009.
pages 670–680, 2017. [31] M. A. Sultan, J. Boyd-Graber, and T. Sumner. Bayesian supervised domain adap-
[6] T. I. Dhamecha, S. Marvaniya, S. Saha, R. Sindhgatta, and B. Sengupta. Balancing tation for short text similarity. In Proceedings of the Conference of the North
human efforts and performance of student response analyzer in dialog-based American Chapter of the Association for Computational Linguistics: Human Lan-
tutors. In International Conference on Artificial Intelligence in Education, 2018. guage Technologies, pages 927–936, 2016.
[7] M. Dzikovska, R. Nielsen, C. Brew, C. Leacock, D. Giampiccolo, L. Bentivogli, [32] M. A. Sultan, C. Salazar, and T. Sumner. Fast and easy short answer grading with
P. Clark, I. Dagan, and H. T. Dang. SemEval-2013 Task 7: The joint student high accuracy. In Proceedings of the Conference of the North American Chapter
response analysis and 8th recognizing textual entailment challenge. In Proceedings of the Association for Computational Linguistics: Human Language Technologies,
of the NAACL-HLT Workshop on Semantic Evaluation, 2013. pages 1070–1075, 2016.
[8] M. Dzikovska, N. Steinhauser, E. Farrow, J. Moore, and G. Campbell. BEETLE [33] K. Taghipour and H. T. Ng. A neural approach to automated essay scoring. In
II: Deep natural language understanding and automatic feedback generation for Proceedings of the Conference on Empirical Methods in Natural Language Processing,
intelligent tutoring in basic electricity and electronics. International Journal of pages 1882–1891, 2016.
Artificial Intelligence in Education, 24(3):284–332, 2014. [34] Y. Zhang, R. Shah, and M. Chi. Deep Learning+ Student Modeling+ Clustering:
[9] M. O. Dzikovska, R. D. Nielsen, and C. Brew. Towards effective tutorial feedback a recipe for effective automatic short answer grading. In Proceedings of the
for explanation questions: A dataset and baselines. In Proceedings of the Conference International Conference on Educational Data Mining, pages 562–567, 2016.
of the North American Chapter of the Association for Computational Linguistics: [35] M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. Zhu. Fast and accurate shift-reduce
Human Language Technologies, pages 200–210, 2012. constituent parsing. In Proceedings of the Annual Meeting of the Association for
[10] W. B. Frakes. Stemming algorithms., 1992. Computational Linguistics, volume 1, pages 434–443, 2013.