ED589145
ED589145
ED589145
Abstract—Vocabulary knowledge is essential to educational progress. High quality vocabulary instruction requires supportive
contextual examples to teach word meaning and proper usage. Identifying such contexts by hand for a large number of words can be
difficult. In this work, we take a statistical learning approach to engineer a system that predicts informativeness of a context for target
words that span the range of difficulty from middle school to college level. Our database (released open source) includes 1,000
hand-selected words associated with approximately 70,000 contextual examples gathered from the Internet. Our training data included
each context rated by 10 individuals on a four-point informativeness scale. We decompose the text of each context into a novel
collection of approximately 600 numerical features that captures diverse linguistic information. We then fit a nonparametric regression
model using Random Forests and compute out-of-sample prediction performance using cross-validation. Our system performs well
enough that it can replace a human judge: for a target word not found in our dataset, we can provide curated contexts to a student
learner such that most of the contexts (54%) feature rich contextual clues and confusing contexts are rare (<1%). The quality of our
curated contexts was validated by an independent panel of high school language arts teachers.
Index Terms—Adaptive and intelligent educational systems, statistical software, machine learning, text mining, language
summarization
F
1 I NTRODUCTION
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 2
is not true that every context is an appropriate or effective in context. Analyses of synonym generation and synonym
instructional means for vocabulary development.” Beck et matching tasks administered at pre- and post-test revealed
al. [26] describe four categories of contexts: misdirective, significant differences between taught words and control
non-directive, general and directive. Misdirective contexts words, but no significant differences between the words that
“direct the student to an incorrect meaning” which is harm- were pre-familiarized in nondirective vs. not familiarized.
ful when learning a word initially. Nondirective contexts Taken together, these results suggest that in designing
lack contextual clues and thereby provide no assistance. an optimal vocabulary tutor, the majority of contexts should
General contexts give enough clues for the learner to frame be directive (or at least generally directive). Misdirective
the word into a general category. Directive contexts are contexts need to be eliminated as best as possible as they
full of rich cues and are thereby the most pedagogically detract from learning [11]. As for the ambiguous, non-
effective. directive contexts, it appears they don’t facilitate learning
Contextual word learning experiments have demon- by themselves, and they neither hurt nor help vocabulary
strated that the informativeness of instructional contexts learning, once supportive contexts have been provided [27].
impacts learning. Frishkoff et al. [11] presented adults with Thus, the purpose of this paper is to describe a system that
target words in six sentence contexts. The informativeness automatically identifies contextual examples with roughly
of the contexts was systematically manipulated, such that the informativeness distribution outlined above. This sys-
some words were presented in six directive contexts, some tem will help to optimize the selection of contexts for
words were presented in five directive and one misdirective our DictionarySquared program, but can also be used by
contexts, and some words were presented in three directive teachers or others who wish to efficiently identify example
and three misdirective contexts. Directive contexts were con- contexts for vocabulary instruction.
straining for the target word, whereas misdirective contexts
were constraining for a distractor word that had a different 1.2 The Specifics of our Problem
meaning from the target word but overlapped in phonology
and orthography. Participants were asked to generate a We wish to engineer a system that takes as an input a target
succinct definition for the target word after each contextual word (a single word that we wish to teach a student) and a
experience. Analyses examining the accuracy of definitions context (a block of text containing that word) and output a
generated across trials revealed a significant interaction binary decision: “use” or “not use” in the DictionarySquared
between trial and context quality. Follow up analyses of vocabulary-teaching system.
the generated definitions indicated no differences in perfor- We limit the scope of acceptable words to be from
mance for words taught with five vs. six directive contexts; approximately late middle school level (e.g. “relevant”,
however, performance for words taught with three directive “ethnic”, “promote”), late high school level (e.g. “surly”,
and three misdirective contexts was significantly worse than “vestige”, “primordial”) and up to college level and beyond
in the other two conditions. Analyses of synonym judgment (e.g. “meretricious”, “vitiate”, “bucolic”). The system should
accuracy at pre- and post-test also revealed a significant be able to take, as an input, any word intended for vocabu-
interaction between test time and context informativeness, lary study in this range with an associated written context
such that gains in accuracy were significantly greater for containing the word.
words taught with more directive contexts. DictionarySquared teaches by showing contexts whose
Another study presented adults with six contexts for format is mostly textual. Its strategy is focused on students
each target word, with context varying in their informative- reading many short contexts instead of a few long contexts
ness [27]. This time, no contexts were misdirective; instead, (see [25] for specifics). Since our system presented here will
some words were taught in six directive contexts, some be a core technology within Dictionarysquared, we limit
words were taught in six nondirective contexts, and some what we mean by a “context” to blocks of text between
words were taught with half directive and half nondirective 42–65 words (on average, 54.2 ± 3.4) where each features
contexts. Results were similar to Frishkoff et al. [11]. Analy- the target word (in the target inflection) at least once, where
ses of generated definitions revealed a significant interaction virtually all of the contexts (a) begin at the beginning of
between trial number and informativeness, such that the a sentence, (b) end at the end of a sentence and (c) do not
accuracy of definitions generated for words presented in all span between paragraphs of the source text it was excerpted
nondirective contexts did not improve over trials, but the from.
accuracy of definitions for words presented in half or all As an example, consider the target word “proclivity” (in
directive contexts significantly improved. Also, analysis of bold). We would like to select informative contexts such as:
a synonym judgment task administered at pre- and post-
test revealed significant interactions between test time and Some people have a genuine proclivity for mo-
context informativeness, such that words taught in half- or tion sickness and will undoubtedly suffer more dur-
all-directive contexts were learned better than words taught ing rough seas. According to medical profession-
in all nondirective contexts. als, seasickness is more prevalent in children and
Adlof et al. [28] also examined contextual word learning women. On the other hand, children under 2 seem
by children (aged 9-12 years) and adults. In their studies, to be immune from the ailment. Of equally interest-
some words were taught in two or three directive contexts, ing note, elderly people are less susceptible ... [29,
following zero, one, or four familiarization exposures in third paragraph]
nondirective contexts. A matched set of “control” words
were included on pre- and post-tests, but never appeared
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 3
that is likely between general and directive, and discard expand the features to include more sophisticated n-grams
uninformative contexts such as: and semantic similarity features as well as a large number of
natural language processing indices, and we use advanced
Yet more additions to the links bar, and kind of by machine learning methods that lead to better performance.
way of addressing the massive proclivity of “Da Additionally, our system’s framework, model assumptions
Phenomenon,” I’d like to shower a little love on and presentation differ considerably.
some of my favourite bloggers, be a little selective
for a change. Somewhat ruefully I was reflecting
2 M ETHODS
that doing one of my pictorial bloggers break-outs
again is now becoming increasingly unlikely. [30] To make a prediction of whether or not a context is to
be used or not used in a vocabulary learning system, we
employ supervised statistical learning [37, Chapter 2]. To do
that is likely between misdirective and non-directive. so, we require two things: (1) training data, which includes
Within the scope of our problem, we identify two slightly both (a) informativeness ratings for each contextual example
different challenges. The first is to classify the informative- and (b) numerical “features” of the contextual examples
ness of a new, never previously seen target word embedded that correlate with the informativeness ratings, and (2) a
in a new, never previously seen context. The second is to statistical model and a means to fit this model using the
perform the same classification on a target word seen before training data. We explain these two requirements below and
but a context never previously seen. We will refer to these discuss how the statistical model’s performance is assessed.
two similar challenges for the duration of this paper as
[word unseen] and [word seen] respectively. 2.1 Training Data
2.1.1 Target Words and Associated Contexts
1.3 Previous Research The current DictionarySquared system teaches 1,000 words
Finding examples of words to promote learning has been split into 10 “bands” of 100 words each. These words span
considered before. Brown and Eskenazi [31] developed a range of difficulty to meet the needs of low, average,
“REAP”, a search engine for contextual examples optimal and high skilled high school students. To curate our list,
for personalized vocabulary learning by considering stu- we began with over 2,500 words compiled from Coxhead’s
dents’ grade level defined by word histograms [32] while Academic Word List [38] as well as lists of suggested
keeping track of words previously seen. We share the final words to study in preparation for standardized tests such
end-goal to foster personalized vocabulary growth, but we as the ACT, SAT, and GRE. We derived estimates of word
focus on providing rich contexts for individual words only. difficulty based on word frequency and dispersion norms
We also make implicit use of their student leveling by [39] and age-of-acquisition estimates [40]. These two indices
including features based on word frequencies and n-grams were highly, but not perfectly correlated within the list
in our prediction models. Mostow et al. [33] developed a (r = +0.77). We then divided the list into 10 difficulty bands
method to create short (about 6-words) informative contexts and hand-selected 100 words in each band that would pre-
containing a single sense of a polysemous word using n- sumably be considered “useful” for instruction according
grams. Here we focus on informativeness without regards to criteria described by vocabulary experts. These included
to the word sense. We also make implicit use of the n-grams words that: characterize written text and are general enough
features developed therein. Hassan and Mihalcea [34] auto- to be found across academic content domains [41], [42],
matically classified entire documents as “learning objects” might be difficult to learn from everyday conversation,
[35] or “learning assets” for concepts and showcased their but occur frequently enough in academic texts to be of
system on fourteen computer science concepts. Although assistance to the comprehension process [41] and further,
their goal was quite different than teaching vocabulary, we those that are generally explainable using familiar concepts
make use of their method of hand-engineering features [43].
and employing supervised machine learning to evaluate For each word, we query the DictionarySquared
educational value. database for contexts that contain the word in the same in-
Mostow et al. [36] studied the prediction of informative- flection. This DictionarySquared corpus was populated be-
ness for individual contexts using a subset of 13,000 contexts tween 2008-2010 using the Google Web API (since defunct).
for easy and medium difficulty words in our database. Their Contexts came from text separated by spaces within one
preliminary analysis indicated that a linear model fit with html tag (to enforce contiguous text) and result order was
measures of context length, context word frequency, context randomized. Care was taken to drop contexts with duplicate
readability, local predictability (derived using Google n- sentences (as defined by sharing one complete sentence)
grams), co-occurrence and distributional similarity as co- to ensure uniqueness of each context. Note that unlike a
variates predicted informativeness better than chance. How- Google News search, search results from the web API were
ever, the binary classification of good vs. bad contexts was not clustered into similar items. The contexts are devoid
insufficient to generate a set that included mostly good con- of illegal characters or inappropriate words (according to
texts. Even using the most stringent predicted probability a handmade list of ≈900 words) and do not have too many
rating (which discarded over 95% of the contexts) resulted non-letter characters.
in the acceptance of more bad contexts than good contexts. We then pruned the original 1,000 word list ensuring
In this study, we make use of the full set of available data, each word has more than 20 associated contexts (on average,
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 4
the words have 72.7 ± 20.7 contexts) for a total of 933 ultimately validated by the independent teacher validation
words with 67,833 associated contexts. Thus, our training experiment that we describe in Section 4.
data frame has 67,833 individual rows which constitute The density of our context ratings as well as a break-
training data examples for use in training and validating down by band are pictured in Figure 1. Note the rating
our statistical model. distribution is approximately Gaussian with an average of
0.59 ± 0.53.1 These contexts make a good training data set
2.1.2 Informativeness Ratings for developing the system we wish to build, as the examples
are drawn from a wide distribution of informativeness.
After contexts were queried, each context was hand-rated The fraction of misdirective contexts is at most 15%
using Amazon’s Mechanical Turk (MTurk), a world-wide (those rated less than 0) and the fraction of directive texts is
market for one-off concise tasks that has been validated to be at most 19% (those rated greater than 1). The average context
accurate for natural language tasks [44]. Ten unique MTurk in our current corpus is therefore between nondirective
workers rated each context for a total of approximately and general. We can appreciate that the raw database has
700,000 ratings. The contexts were rated on the same ordinal quite informative contexts as it was culled from reliable text
scale of “informativeness” pictured below: sources. Thus, our job to curate a subset of truly excellent
contexts is very challenging, a point we will return to in
• Very Helpful. After reading the context, a student Section 4.
will have a very good idea of what this word means.
• Somewhat Helpful. 2.1.3 Feature Extraction
• Neutral. The context neither helps nor hinders a Applying learning algorithms to data requires a transfor-
student’s understanding of the word’s meaning. mation from the raw data to a collection of features since
• Bad. This context is misleading, too difficult, or oth- the raw data representation (the text itself) is a suboptimal
erwise inappropriate. representation of the underlying information in the text [54].
Put another way, the raw characters of the text considered
alone have negligible correlation with the informativeness
The above scale choices roughly correspond to the Beck rating.
et al. [26] scale without the use of the technical terms found Thus, the next task consists of creating our own data
therein. We numerically code the levels in our ordinal scale representation — mapping the characters in the contextual
as +2, +1, 0, -1 respectively, i.e. the conventional default of examples to numeric features, i.e. functions gj : text → R,
even spacing. We feel it is appropriate to code neutral as which is a type of “text mining” [55], one of the goals of
0, bad as negative and helpful as positive but the values - natural language processing (NLP). Then we will extract
1, +1 and +2 are arbitrary. Future work can explore other patterns from these numeric features, relating them to in-
encodings. formativeness; this is a type of “machine learning”.
Several steps were taken in an effort to ensure quality The literature on data representation via textual feature
of the collected MTurk ratings in accordance with methods extraction is vast and it is difficult to know which features
and recommendations from past studies [45], [46], [47], [48]. should be extracted. We began with simple features such
First, raters were required to initially pass a qualification as number of sentences, words and punctuation counts, etc.
test which included contexts with known ratings. Second, We then read many hundreds of contexts ourselves and tried
individual rater agreement was monitored over time; raters to conceptually isolate common attributes observed among
whose response patterns appeared random or who other- highly directive contexts and other attributes observed in
wise showed substantial deviations from the crowd were misdirective contexts. Through doing so, we tried to identify
disqualified from future rating assignments. This is similar useful simplifying explanations or abstractions that helped
in spirit to more advanced algorithms [49, for example]. to make sense of the rich data that is natural text; this is a
Lastly, we included random “attention checks”; these con- form of disentangling features with correlation from ones
sisted of eliciting a rating for clearly helpful and clearly without [56, Chapter 1].
unhelpful contexts from the more difficult bands (7–10) as We first observed that directive contexts have common
determined by unanimous agreement from trained research expressions and phrases that use the word. This can be
assistants. Raters who frequently made errors on the at- captured if we can find all such phrases for all words. Thus,
tention checks were disqualified from future rating tasks, we employ the Google Book Corpus including text from
a strategy similar to the work of Oppenheimer et al. [50]. 1800-2000 [57] to calculate the number of times the target
Even though we took principled steps, it was still difficult word is found in every configuration of n-grams up to 5-
to root out all the low quality ratings. Cleaning out low grams also including blanks (or stop characters). In addition
quality ratings using principled methods [51, for example] to the probability estimate used by Mostow et al. [36], we
is a worthy enterprise, but it is left to future work. include the raw count. For instance, for the word defiant, we
We then employ the sample average of the 10 ratings as have features for configurations found in Figure 2.
the gold-standard label for each context. Previous research We also observed that directive contexts contain synony-
has found that averaging the ratings from 9 or more non- mous words (i.e. semantically related) as well as words that
expert raters reached agreement with expert raters [44], [52].
We henceforth denote the average rating as y , indicating 1. Note that it is likely not valid to generalize this observation, as this
dataset was collected from the Internet in quite an arbitrary way, and
this is our dependent variable going forward. Note that the scored using a rating system that does not exactly reflect Beck et al.’s
quality of the gold-standard y (along with our modeling) is [26] categorization.
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 5
2 ● ●
●
●
●
●
●
●
0.6
1
true y
density
0.4
0
0.2
●
● ● ● ●
● ● ● ● ●
−1 ● ● ● ● ● ● ● ●
0.0
1 2 3 4 5 6 7 8 9 10
−1 0 1 2
Average Rating band
Fig. 1: The distribution of y , the informativeness metric in our training set. All plots are generated with the R package
ggplot2 [53].
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 6
1.0
Cumulative Variance Explained
0.8
0.6
0.4
0.2
Principal Component
Fig. 3: The cumulative sum of sorted normed eigenvalues by principal component ordinal number. The vertical blue line
illustrates the cumulative principal component number that explains 95% of the variance in our set of features.
during modeling; thus going forward we use the full feature 2.3 Performance Validation
set. This confers the additional advantage of allowing us
We now describe performance assessment for both the
to investigate individual variable importance directly in
[word unseen] and the [word seen] task. We first split the
Section 3.4.
full training data into a model-building set and a holdout
The full dataset, 70,000 × 615, the individual Turker or “test” set. The model building set is used to fit an
ratings and the original text of the contexts is available and RF model. The test set is then predicted via the fitted RF
open source (see Section on “Replication”). model and the results are compared to its true y values
to determine performance; this is the “out-of-sample” (oos)
estimate or generalization estimate. Validations performed
2.2 The Statistical Model and its Use in Prediction oos guarantees that our performance estimate is not being
inflated by potentially dishonest in-sample overfitting.
We begin with the continuous response variable of the Generally speaking, the holdout consists of a random
average informativeness rating (Section 2.1.2) but then must sample of the rows of the training data. Under this random
make a binary decision on new contexts to administer a sampling, the oos estimate would correspond to an honest
context to the student (one) or not to administer a context estimate of the performance of [word seen] task only. Why?
(zero). This binary decision is based on costs which vary The target words in the holdout are the same target words
across the spectrum of the informativeness rating [−1, +2]. (more or less) as those in the model-building set. This
These costs are quite asymmetric as it costs more for a simulates new contexts for words already in the word bank.
student to be mistakenly presented with a poor context with To estimate the performance of the more realistic [word
informativeness near -1 than it does for us to mistakenly unseen] system, we randomize the target words them-
reject a good context with informativeness near +2 (see Ap- selves and holdout the rows corresponding with 10% of the
pendix A for a mathematical description). In short, we have words.2 Thus, when the RF model makes predictions oos,
a regression estimation problem where decisions are made it is predicting for contextual examples containing target
with a threshold based upon entire distributions. Below, we words never heretofore seen.
will develop a heuristic validation scheme appropriate to However, any single 10%-90% test-training split can over
this nontraditional problem. or underestimate performance due to chance. To mitigate
We employ random forests (RF) [67] to model the aver- this possibility, we employ 10-fold cross-validation [37,
age informativeness rating y as a function of the 615 features Chapter 7.10].
in the training data. RF is widely used for non-parametric The typical metric to consider in continuous prediction
estimation of continuous functions (regression) by flexibly is oos R2 or RMSE. Since our primary goal is to build a
fitting complex non-linearities and high-order interactions vocabulary learning system of which predicting the contin-
using its tree structure without overfitting [37, Chapter 15]. uous informativeness metric is only a necessary interme-
Predictions from this RF model will be continuous. In order diate step, R2 or RMSE are not meaningful performance
to make the binary decision, we use a threshold which we metrics. Here, our performance metric should reflect a total
denote ŷ0 (see page 12 of [37]) i.e. if y ≥ ŷ0 , we administer
the context (1, use) and if y < ŷ0 we do not administer 2. We employ 10% as this is a common practice. There is little
the context (0, throwout). Thus, our choice of ŷ0 is chosen statistical theory at this time that recommends a hold-out size for
estimating model performance. The largest used in practice is 20% and
to minimize the overall asymmetric costs (explained in the the smallest used is one observation (the so-called “leave one out” cross
next section). validation procedure).
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 7
cost function (Appendix A, Equation 2) which varies with should be about 1–2% corresponding to a 7.5-15.1x fold
the prespecified ŷ0 threshold. Different threshold values reduction over the original contextual examples from the
result in differential distributions of oos informativeness Internet (the training data). The system that performs at
that the system declares usable for future student con- this level has ŷ0 cutoffs 0.845–0.895 (highlighted in Table 2).
sumption (we call this the “use distribution”) and unusable In this range, we have approximately 10% of the contexts
for student consumption (the “throwout distribution”). To as uninformative. Most importantly, between 47–54% of
compare the use and throwout distributions, we compute the contexts will be general and directive featuring rich
empirical quantiles. We deem three quantile-based slices contextual clues that are supportive of rapid vocabulary
of the use distribution most important to monitor during learning. Put another way, the ratio of directive to misdirec-
the design of a student learning system: Y < 0, the poor tive contexts (after discarding) is 25:1 – 54:1. The price we
contexts, Y ∈ [0, 0.5), the non-informative educationally- pay is 91.6–96.6% throwout of our potential pool of directive
neutral contexts and Y > 1, highly informative contexts. contexts. We address the implications of this high cost in the
For the throwout distribution, we tabulate how often we discussion section. For the model with threshold ŷ0 = 0.895,
throwout the very best contexts (Y > 1). A holistic view of we illustrate the future use distribution compared to the
these quantiles informs our heuristic total cost performance training data in Figure 4.
metric, not R2 .
To understand more clearly our starting point, Table 1 3.2 Linear Model Results for the [word unseen] System
displays these quantile-slices for the original, uncurated
To demonstrate the predictive advantage of RF, we fit a
data by band. We can see informativeness varies signifi-
linear model (via ordinary least squares regression). We
cantly by band. The highest bands, indicating words repre-
varied ŷ0 similarly and attempted to display the results that
senting more nuanced concepts, have a larger the proportion
exhibit the 1–2% misdirective context proportion in Table 3
of misdirective and non-informative contexts, as well as a
as a comparison. However, the linear model was not able
lower proportion of richly informative contexts. Thus, our
to perform at that error rate (without ≈100% throwout).
model’s curation task is much more difficult at higher bands,
At higher rates, such as 3%, other cutoff performance was
a point we return to later (especially in Figure 5).
similar with the RF implementation but the throwout rate
As ŷ0 increases, the system will be more and more
was higher.
conservative as to which contexts it deems useful. Thus we
can decrease the proportion of contexts with y < 0 and with
y ∈ (0, 0.5] and increase our proportion of y > 1 contexts 3.3 Differential Performance by Band of the [word un-
by raising ŷ0 . But there is a tradeoff: the cost is greater false seen] System
negatives; the system will throw out many contexts that are Note that the results in Table 2 and the use distribution
good for student learning. If our pool of potential contexts illustration in Figure 4 represent an average across all words
is large (such as the Internet), this is not a problem except on (and bands). It is likely that the system will have better
the very rare words (which are rare even across the entire performance on contextual examples with target words
Internet). from lower bands. This association is expected as there are
both more misdirective contexts and less directive contexts
3 R ESULTS as band increases (review the differential distributions in
Table 1).
Here, we focus on results for the [word unseen] system and
We illustrate throwout rate by band in Figure 5. Note
we discuss results for the [word seen] system in Appendix B.
here that the throwout rate increases as the word band
increases. The system does not work beyond band 6. Band
3.1 RF Results for the [word unseen] System 7 has 99.2% throwout and band 10 has 99.6% throwout.
We vary the ŷ0 threshold in order to investigate the pos- Therefore, we recomputed the main results for just bands
sible binary decision systems that can be created from our 1–6 and display them in Table 4. Our metrics of interest in
continuous RF model.3 For each threshold, we compute the the usage distribution largely remain the same but throwout
empirical cumulative probabilities for the important metrics has significantly improved. It seems the model’s solution
of Table 1 as well as the throwout percentage and display to uncertainty of usability among contexts in bands 7-10 is
the results in Table 2. Each row of Table 2 estimates future simply to omit them.
performance: the cost of misdirective contexts, the cost of
uninformative contexts, the reward of directive contexts 3.4 Feature Importance
and the cost in throwout percentage. Each row provides
Which of the 615 features contribute to predictive per-
multivariate performance metrics.
formance in the [word unseen] model? We queried our
For our purpose, we stress that misdirective contexts
RF model4 for its variable importance data and we plot
are costly since their appearance in a vocabulary training
the mean decrease in accuracy as measured by out-of-bag
program for a given word has the potential to confuse the
increase in mean squared error in Figure 6 (see [67] for
student [11], [27]. Thus, we believe our target proportion
more information). This metric is analogous to effect size
3. Note that binary classification decisions are popularly made with in a multivariate regression. It accounts for collinearity
the help of a Receiver Operator Curve or Detection Error Tradeoff plot
[68] (which plots throwout versus “errors”). Here we have two full 4. We used the entire dataset, default hyperparameter values, n =
distributions with differential costs; thus, analogous visual aids are not 10, 000 samples per tree and 500 trees. Fitting more trees or raising the
practical nor appropriate. subsampling would be more computationally burdensome.
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 8
TABLE 1: Metrics computed about the oos use distribution. The cutoffs are for the real y values (the average rating). We
then show the differential cutoffs by word band. Note: the columns do not sum to 1 as the range y ∈ (0.5, 1] is not shown.
0.9
type
density
0.6 usage ys
ys
0.3
0.0
-1 0 1 2
Average
valRating
Fig. 4: The density of the informativeness use distribution (green) versus the training data distribution of informativeness
(red) at ŷ0 = 0.895. Contexts above the blue dotted line feature rich contextual clues that optimize student learning.
1.00 ●●
● ●●●●
● ●●
● ●
●●●
● ●●●●●
●●● ●●●
● ●
●
●●
●●●●●● ●●
●●
●●
●●●●●●●
● ●●
● ●
●
●●●●
●● ● ●●
●●●●●
●
●●●●
●●●●
● ●
●●●
●●● ●●
●●
●●●
● ●
●●
●●
●●
●●●
●
●●●●
● ●
●●●
● ●●●●
●
● ●●● ●●
● ●●●
●
● ●● ●●
●●●
● ●
●●●
● ●
●●
●●
●●●●
● ●
●● ●●
●●
●● ●●
●● ●
●●●
● ●●
●● ●
●●●●
● ●●
●●●
●
●●●
●●●●●
●
●●●●●
●●●
●
●●
●
●●● ● ●
●●●
●●●● ●
●
● ●●●●●
●●● ●
●●
●
●
● ●
● ● ● ● ● ●
● ●● ● ●
●● ●● ●
● ●●
●● ● ● ● ● ● ● ●
●● ● ● ● ● ●
●● ● ●
● ● ● ● ●●
● ● ●● ● ●
●
throwout proportion
● ● ● ● ● ●●
● ●
0.95 ●
● ● ●● ●
● ●
● ●●●●
● ●
● ●
● ● ● ●
● ●●●
● ● ●
● ● ●● ● ● ●
● ● ● ● ● ● ● ●●
● ●● ●
● ●● ●● ● ●● ● ●
● ●● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●● ●
● ● ●● ● ● ● ● ● ●
● ● ●
0.90 ● ●●
●
● ●
● ●
●● ● ●
● ●● ● ● ●● ● ●●
● ●
● ● ●● ● ●
●
●● ● ●
● ● ● ●
● ● ●●
●
● ●● ● ● ● ● ● ●
0.85 ● ●● ● ●●
● ●
● ●
● ● ● ● ● ●
● ● ●
●● ● ● ●●
● ●
● ● ●
0.80 ● ● ● ● ●
1 2 3 4 5 6 7 8 9 10
band
Fig. 5: Box-and-whisker plots of differential throwout by word organized by band for the RF model in the [word unseen]
system at ŷ0 = 0.895. The blue line plots the average throwout by band. Each point represents the throwout rate for a
single word; they are randomly jittered in the x axis only for easier display.
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 9
ŷ0 threshold P (Y < 0) P (Y ∈ [0, 0.5)) P (Y > 1) P (throwout of Y > 1) # accepted
-0.755 0.151 0.312 0.190 0.000 67833
..
.
0.495 0.091 0.291 0.228 0.110 50399
0.520 0.083 0.282 0.237 0.145 46755
0.545 0.077 0.274 0.247 0.184 42931
0.570 0.071 0.265 0.257 0.234 38758
0.595 0.065 0.254 0.269 0.288 34562
0.620 0.057 0.246 0.282 0.351 30208
0.645 0.053 0.232 0.297 0.418 25867
0.670 0.047 0.220 0.312 0.488 21735
0.695 0.044 0.207 0.329 0.561 17873
0.720 0.039 0.192 0.350 0.635 14160
0.745 0.033 0.175 0.370 0.704 10905
0.770 0.030 0.157 0.389 0.772 8039
0.795 0.025 0.142 0.414 0.831 5670
0.820 0.022 0.126 0.447 0.877 3855
0.845 0.019 0.118 0.474 0.916 2491
0.855 0.019 0.112 0.477 0.930 2048
0.865 0.018 0.103 0.491 0.941 1694
0.875 0.017 0.101 0.509 0.951 1381
0.885 0.013 0.096 0.524 0.959 1100
0.895 0.010 0.086 0.541 0.966 909
0.905 0.008 0.078 0.552 0.973 717
0.915 0.009 0.074 0.573 0.978 571
0.925 0.004 0.057 0.604 0.981 454
0.935 0.006 0.054 0.605 0.985 354
0.945 0.008 0.057 0.630 0.989 262
0.955 0.010 0.060 0.648 0.991 199
TABLE 2: Oos RF performance results for the [word unseen] system operating under a variety of ŷ0 thresholds (oos
R2 = 0.177). We compute empirical cumulative distribution function values of interest as well as the cost (the throwout
rate of informative contexts). We also display the number of contexts marked as acceptable of the 67,833 contexts in the
original training data. At low thresholds (below 0.5), the system keeps all contexts, so these thresholds are not relevant
and are colored gray. Past a threshold of 0.925, we have 98.1% throwout of our good contexts and acceptability is so low,
we lose the ability to accurately estimate empirical probabilities; they are also colored gray. Cutoffs usable in practice are
colored yellow (see main text). The last column displays the number of contexts accepted from the 67,833 in the training
data to inform intuition on the confidence intervals of the estimated probabilities; low values are not stable.
Fig. 6: Variable importance for all 615 features as measured by the increase in percent mean squared error. The red line is
the cutoff for features that no longer increase out-of-bag accuracy of the RF model. 50 features (8%) are in this category.
Feature names are omitted due to space restrictions.
between the features by permutating values of one feature the question “how many variables matter?” without further
but keeping the other feature values the same. work.
We learn here that the vast majority of features con- Which variables likely contribute the most to perfor-
tribute synergistically to oos predictive performance. If the mance? In Figure 7, we illustrate the top 30 most important
model had only a few variables contributing, there would variables of Figure 6 (the right tail) and print their variable
be many more scores with near zero or negative oos mean names. We now describe the top seven variables and spec-
squared error increase. ulate as to why they contain the most information about
context quality.
It is possible we could drop some of the 50 non-
contributing variables in a stepwise fashion without perfor- 1. similar_1_10 tallies the number of the most similar
mance loss, but running the stepwise elimination would be word stems to the target word (the top 10 most similar
computationally prohibitive. Thus, Figure 6 cannot answer as returned by the DISCO system querying the COCA
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 10
ŷ0 threshold P (Y < 0) P (Y ∈ [0, 0.5)) P (Y > 1) P (throwout of Y > 1) # accepted
0.445 0.091 0.297 0.226 0.110 51116
0.470 0.086 0.293 0.231 0.136 48696
0.495 0.082 0.289 0.236 0.165 46111
0.520 0.078 0.283 0.242 0.196 43356
0.545 0.072 0.278 0.249 0.231 40424
0.570 0.068 0.272 0.256 0.272 37352
0.595 0.064 0.265 0.264 0.314 34259
0.620 0.062 0.257 0.273 0.357 31130
0.645 0.057 0.250 0.282 0.406 27992
0.670 0.054 0.243 0.290 0.457 24921
0.695 0.051 0.236 0.300 0.507 21964
0.720 0.048 0.230 0.310 0.557 19113
0.745 0.046 0.222 0.320 0.610 16367
0.770 0.043 0.216 0.332 0.657 13999
0.795 0.040 0.208 0.343 0.699 11899
0.820 0.039 0.200 0.355 0.741 9953
0.845 0.035 0.191 0.369 0.778 8227
0.870 0.032 0.183 0.383 0.811 6788
0.895 0.032 0.174 0.394 0.840 5614
0.920 0.029 0.168 0.410 0.868 4492
0.945 0.028 0.160 0.424 0.893 3563
0.970 0.027 0.155 0.438 0.913 2856
0.995 0.027 0.152 0.443 0.928 2307
1.020 0.026 0.145 0.454 0.942 1802
1.045 0.024 0.149 0.459 0.954 1411
1.070 0.025 0.149 0.474 0.963 1096
1.095 0.022 0.146 0.491 0.971 862
1.345 0.020 0.177 0.503 0.995 147
1.370 0.015 0.174 0.492 0.996 132
1.395 0.017 0.197 0.453 0.996 117
1.420 0.019 0.202 0.452 0.997 104
1.570 0.015 0.191 0.412 0.998 68
1.670 0.000 0.152 0.435 0.999 46
TABLE 3: Oos linear model performance results for the [word unseen] system operating under ŷ0 thresholds which
produce similar P (Y < 0) quantiles as Table 2. Here, oos R2 = 0.167 which is approximately the same as the RF model
but performance based on our holistic cost metric is considerably lower. Again, R2 is not an appropriate guage of fit for
our model’s performance.
TABLE 4: Same as Table 2 but now we only look at oos results for bands 1–6. Note the proportion of directive contexts is
about the same but throwout has improved substantially.
corpus). This information is likely important because that tallies the age of acquisition (a scale created by
the reiteration of words related to the meaning of the [40]) for all words in the context. This is likely a proxy
target word provides contextual clues. for difficulty of the context.
2. collocation_1_10 is the same as similar_1_10 5. Kuperman.AoA.CW is the same as above except it
except it tallies the top most collocated words. Similar tallies for content words only. We assume these two
to words with related meaning, these words aid in features are collinear (future work can verify), thereby
scoping and limiting the meaning of the target word. sharing places in the split rules in the Random Forest.
3. politeness_component is a feature returned by the If so, then Kuperman’s scale should occupy the number
Sentiment Analysis and Cognition Engine [65] that 1 position.
measures politeness using the dictionary lists of “po- 6. count.word1.target.word2 is a feature computed
lite” words found in Stone’s [69] General Inquirer and from our trigrams database illustrated on line 5 of
Lasswell and Namenwirth’s [70] dictionary lists. As to Figure 2. Higher counts indicate the specific use in
why this feature is important for informativeness, we this context is prevalent. These two words sandwiching
do not know, but the explanation to number 7 below the target word likely provide information about its
may apply. meaning.
4. Kuperman.AoA.AW is a feature returned by the Tool for 7. MRC.Meaningfulness.CW is a feature returned by
the Automatic Analysis of Lexical Sophistication [61] the Tool for the Automatic Analysis of Lexical Sophis-
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 11
similar_1_10 ●
collocation_1_10 ●
count.target.word1 ●
politeness_component ●
Kuperman.AoA.AW ●
Kuperman.AoA.CW ●
count.word1.target.word2 ●
MRC.Meaningfulness.CW ●
num_times_target_exact_match ●
num_acronyms ●
repeated_content_lemmas_pronoun ●
Polit_GI ●
prob.word1.target.word2 ●
prob.target.word1 ●
All.AWL.Normed ●
MRC.Meaningfulness.AW ●
affect_friends_and_family_component ●
Positive_EmoLex ●
count.target ●
num_times_target_stem_match ●
SUBTLEXus.Freq.AW ●
Econ_GI ●
prob.word1.word2.target ●
BNC.Written.Bigram.Proportion ●
Arousal_nwords ●
SUBTLEXus.Range.CW.Log ●
BNC.Written.Trigram.Proportion ●
disjunctions ●
Arousal ●
Brown.Freq.CW.Log ●
10 15 20 25
%IncMSE
Fig. 7: Variable importance for the top 30 most important features as measured by the increase in percent mean squared
error. Descriptions of each individual variable is found in Appendix C.
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 12
correlate with informativeness, contexts can be sorted based completed the survey were rewarded with a $5 Amazon gift
on educational value. Space limitations preclude us from card, an incentive that was known before their participation.
discussing more features, the features in greater depth and
how our RF model uses feature interactions. There is no
4.2 Results
doubt much can be learned about the “why” of context
informativeness by querying our model. Of the 502 trials, the curated set was selected 305 times or
60.7%, a significant result when testing the null hypothesis
3.5 Ordering of the Contexts that there is no preference between the curated and uncu-
rated contexts (p ≈ 1.8 × 10−6 ). This test is only valid if all
As discussed during our problem setup, we prefer uninfor-
teachers’ responses are statistically independent. To test this
mative contexts to appear at the later stages of learning. To
assumption, we tested the differential performance between
test this, we use the model in Table 2 corresponding to the
the teachers using a logistic regression with dummy vari-
threshold of 0.845. We examine all words with five or more
ables for each teacher. Pitting this model versus a simple
contexts in the oos use distribution and plot the probability
model of only an intercept turns up insignificant via the
of seeing an uninformative or misdirective context by order
likelihood ratio test (p = 0.46). Thus, there is no reason to
of appearance once we order by predicted informativeness
believe the assumption of independence is not justified.
best to worst (Figure 8). As shown, about 47% of the words
These results provide unequivocal external validation
ever had such a context and it was uncommon to see that
that our [word unseen] model selects contexts that are more
type of context early on; only 11% of words in the first two
suitable for teaching than the original DictionarySquared
exposures and 31% within the first 5 exposures.
database. However, it may seem that the teachers’ choices
of the curated set 60.7% of the time is low — you may
4 E XTERNAL T EACHER VALIDATION wonder why the teachers could not select the context from
The stated purpose of our system is to automatically identify the curated set 100% of the time. An analysis of Figure 4 will
informative contextual examples for vocabulary instruction demonstrate that perfect discrimination is not possible: the
in high school students. To externally validate the quality average context in the uncurated set is between nondirective
of our [word unseen] system’s output, we conducted a and general and due to random sampling, the context from
randomized experiment with high school teachers. the uncurated set can be more informative than the context
from the curated set. When running simulations, 60.7% is
4.1 Methods in the ballpark of expected discrimination, especially when
Participants included 31 high school language arts teach- considering the fact that there are inevitable judgment calls
ers from the United States (30 from South Carolina and when the two contexts are similar in informativeness.
1 from Connecticut). They were recruited through social
media advertising and an email campaign. Each participant
5 D ISCUSSION
was asked to complete a web-based survey asking basic
demographic information and 18 experimental questions. Considering our RF model for the [word unseen] system,
Each participant was shown different experimental ques- we argue that our predictive performance is good enough to
tions created as follows. First, three target words were ran- implement this system in a context-collection effort without
domly selected without replacement from each of the first the need for a human rater. However, we limit our uncon-
six difficulty bands for a total of 18 unique words. For each ditional recommendation to future target words within the
word, one context was drawn at random from the original level defined by our band 1–6, words that were externally
DictionarySquared database (uncurated) and one context validated by high school teachers and words of which the
was drawn at random from the future use distribution, throwout rate is not overly punitive.
the set predicted to be used by [word unseen] model at The following example may demonstrate a typical fi-
ŷ = 0.895 (curated). Put another way, one context was nal product of our model’s curation. For the target word
drawn randomly from the red distribution in Figure 4 and “malevolent” in band 6, the following context
one is drawn randomly from the green distribution. For each
From Scotland comes stories of the Old Hag or
word, the teacher was asked which context would be better
Night Hag. The Old Hag is a malevolent spirit that
for teaching the word. The two contexts were presented
visits people in the middle of the night while they
randomly side-by-side below the prompt. A screenshot of
sleep. Those who survive this nocturnal visit report
the experimental question is provided in Figure 9. Each
being awoken with a feeling of dread or unease but
question was a separate web page and teachers were not
unable to move or speak. [75, third paragraph]
allowed to change responses upon each page submission.
Collecting many comparisons of a curated context and
an uncurated context drawn at random is an honest way received an average MTurk rating of 0.7 i.e. directive and
to test if our out-of-sample results of Table 2 comport with highly informative as we can see from all the contextual
professional language arts educators’ preferences. Here we clues. The RF oos prediction here is 0.89. Let’s compare this
are testing the superiority of the median informativeness of to
the set of contexts sifted by our model.
27 participants answered all 18 questions, and the re- If you or someone you know has gotten nothing
maining 4 participants answered between 1 and 11 ques- but heartache this Valentine’s Day (or any other
tions for a total of n = 502 trials. The 27 teachers that occasion involving that malevolent blood-pumping
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 13
1.0
cumulative probability estimate
0.8
0.6
● ● ●
● ●
0.4
● ● ●
● ●
● ●
●
●
0.2
●
●
●
0.0
0 5 10 15 20
Fig. 8: Under the use distribution, the cumulative probability distribution of the first exposure to a context with a true
rating less than 0.5 (non-informative or misdirective). We limit words here to those with 5 or more contexts.
Fig. 9: A screenshot of a typical experimental question for the teacher validation study. Here, the target word was “distinct”
and the teacher has already selected the context on the right by clicking on its box. By pressing “Next →”, the choice would
be finalized and the next question would be presented.
organ), this 12-song collection offers the perfect an- The last point to discuss is the high throwout rates of
tidote. Includes the J. Geils Band’s immortal “Love our best contextual examples of the target word. In order for
Stinks,” Gram Parsons’ defining version of “Love this system to be practical, we would query massive corpora
Hurts,” Joy Division’s “Love Will Tear Us Apart,” and (such as the Internet) for contextual examples and we would
more. [76, fourth paragraph] optimize our routines that compute the 615 features. Both
are possible and thus we do not anticipate the high throwout
rate to be a problem in practice.
which received an average MTurk rating of -0.1 indicating Once again, we have developed a system that has the
it is non-directive and possibly even misdirective. Here, the ability to automatically identify informative contexts for
RF oos prediction is 0.48. learning arbitrary words of interest and our technology can
be greatly beneficial to educators and researchers.
Thus, when implementing the RF model for the [word
unseen] system in Table 4 for a highlighted threshold, the
first context would be administered to students in our 5.1 Future Directions
vocabulary instruction system and the second would not This work represents our initial steps toward automatic
be administered. identification of useful contexts for vocabulary learning.
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 14
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 15
[23] E. D. Reichle and C. A. Perfetti, “Morphology in word identi- [48] A. Kittur, E. H. Chi, and B. Suh, “Crowdsourcing user studies with
fication: A word-experience model that accounts for morpheme mechanical turk,” in Proceedings of the SIGCHI Conference on Human
frequency effects,” Scientific Studies of Reading, vol. 7, no. 3, pp. Factors in Computing Systems. ACM, 2008, pp. 453–456.
219–237, 2003. [49] P. G. Ipeirotis, F. Provost, and J. Wang, “Quality management
[24] W. E. Nagy and J. A. Scott, “Vocabulary processes,” in Handbook on amazon mechanical turk,” in Proceedings of the ACM SIGKDD
of Reading Research (Volume 3), M. L. Kamil, P. Mosethal, P. D. Workshop on Human Computation. ACM, 2010, pp. 64–67.
Pearson, and R. Barr, Eds. Mahway, NJ: Erlbaum, 2000, pp. 269– [50] D. M. Oppenheimer, T. Meyvis, and N. Davidenko, “Instructional
284. manipulation checks: Detecting satisficing to increase statistical
[25] S. Adlof, M. McKeown, C. Perfetti, A. Kapelner, S. Nessaiver, power,” Journal of Experimental Social Psychology, vol. 45, no. 4, pp.
and J. Soterwood, “Dictionarysquared development paper,” 2016, 867–872, 2009.
working paper. [51] V. C. Raykar and S. Yu, “Eliminating spammers and ranking
[26] I. L. Beck, M. G. McKeown, and E. S. McCaslin, “Vocabulary annotators for crowdsourced labeling tasks,” Journal of Machine
development: All contexts are not created equal,” The Elementary Learning Research, vol. 13, no. Feb, pp. 491–518, 2012.
School Journal, vol. 83, no. 3, p. 177, jan 1983. [52] T. M. Byun, P. F. Halpin, and D. Szeredi, “Online crowdsourcing
[27] G. A. Frishkoff, C. Perfetti, and K. Collins-Thompson, “Predicting for efficient rating of speech: A validation study,” Journal of Com-
robust vocabulary growth from measures of incremental learn- munication Disorders, vol. 53, pp. 70–83, 2015.
ing,” Scientific Studies of Reading, vol. 15, no. 1, pp. 71–91, jan 2011. [53] H. Wickham, ggplot2: Elegant Graphics for Data Analysis. Springer-
[28] S. Adlof, G. Frishkoff, J. Dandy, and C. Perfetti, “Effects of induced Verlag New York, 2009.
orthographic and semantic knowledge on subsequent learning: a [54] M. Bilenko, A. Kamenev, V. Narayanan, and P. Taraba, “Adaptive
test of the partial knowledge hypothesis,” Reading and Writing, featurization as a service,” January 2016, uS Patent 20,160,012,318.
vol. 29, no. 3, pp. 475–500, 2016. [55] C. C. Aggarwal and C. Zhai, Mining text data. Springer Science &
[29] E. Silverstein. (n.d.) Avoiding seasickness. [Online]. Available: Business Media, 2012.
http://www.cruisecritic.com/articles.cfm?ID=48 [56] I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning,”
[30] M. Ingram. (n.d.). [Online]. Available: http://www.woebot.com/ 2016, book in preparation for MIT Press. [Online]. Available:
movabletype/archives/000138.html http://www.deeplearningbook.org
[31] J. Brown and M. Eskenazi, “Retrieval of authentic documents for [57] A. Franz and T. Brants. (2006) Official google
reader-specific lexical practice,” in InSTIL/ICALL Symposium 2004, research blog: All our n-gram are belong to you.
2004. [Online]. Available: http://googleresearch.blogspot.com/2006/
[32] K. Collins-Thompson and J. Callan, “A language modeling ap- 08/all-our-n-gram-are-belong-to-you.html
proach to predicting reading difficulty.” in Proceedings of NAACL [58] P. Kolb, “DISCO: A multilingual database of distributionally sim-
HLT, 2004. ilar words,” Proceedings of KONVENS-2008, Berlin, 2008.
[33] J. Mostow and W. Duan, “Generating example contexts to illus- [59] D. Yarowsky, “Unsupervised word sense disambiguation rivaling
trate a target word sense,” in Proceedings of the 6th Workshop on supervised methods,” in Proceedings of the 33rd Annual Meeting on
Innovative Use of NLP for Building Educational Applications. Asso- Association for Computational Linguistics. Association for Compu-
ciation for Computational Linguistics, 2011, pp. 105–110. tational Linguistics, 1995, pp. 189–196.
[34] S. Hassan and R. Mihalcea, “Learning to identify educational [60] M. Davies, “The 385+ million word corpus of contemporary
materials,” ACM Transactions on Speech and Language Processing american english (1990–2008+): Design, architecture, and linguistic
(TSLP), vol. 8, no. 2, p. 2, 2011. insights,” International Journal of Corpus Linguistics, vol. 14, no. 2,
[35] S. S. Nash, “Learning objects, learning object repositories, and pp. 159–190, 2009.
learning theory: Preliminary best practices for online courses,” [61] K. Kyle and S. A. Crossley, “Automatically assessing lexical
Interdisciplinary Journal of Knowledge and Learning Objects, vol. 1, sophistication: Indices, tools, findings, and application,” TESOL
pp. 217–228, 2005. Quarterly, vol. 49, no. 4, pp. 757–786, 2015.
[36] J. Mostow, D. Gates, R. Ellison, and R. Goutam, “Automatic iden- [62] B. Laufer, “What’s in a word that makes it hard or easy? in-
tification of nutritious contexts for learning vocabulary words,” tralexical factors affecting the difficulty of vocabulary acquisition,”
International Educational Data Mining Society, 2015. in Vocabulary Description, Acquisition and Pedagogy, M. McCarthy
[37] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical and N. Schmitt, Eds. Cambridge, United Kingdom: Cambridge
learning, tenth printing, second ed. Springer, 2009. University Press, 1997, ch. 2.3, pp. 140–155.
[38] A. Coxhead, “A new academic word list,” TESOL Quarterly, [63] S. A. Crossley, K. Kyle, and D. S. McNamara, “The tool for the au-
vol. 34, no. 2, pp. 213–238, 2000. tomatic analysis of text cohesion (TAACO): Automatic assessment
[39] S. Zeno, S. Ivens, R. Millard, and R. Duvvuri, The educator’s of local, global, and text cohesion,” Behavior Research Methods, pp.
word frequency guide. Brewster, NY: Touchstone Applied Science 1–11, 2015.
Associates, 1995. [64] D. S. McNamara, A. C. Graesser, P. M. McCarthy, and Z. Cai, Auto-
[40] V. Kuperman, H. Stadthagen-Gonzalez, and M. Brysbaert, “Age- mated evaluation of text and discourse with Coh-Metrix. Cambridge
of-acquisition ratings for 30,000 english words,” Behavior Research University Press, 2014.
Methods, vol. 44, no. 4, pp. 978–990, 2012. [65] S. A. Crossley, K. Kyle, and D. S. McNamara, “Sentiment analysis
[41] I. L. Beck, M. G. McKeown, and R. C. Omanson, “The effects and social cognition engine (SEANCE): An automatic tool for
and uses of diverse vocabulary instructional techniques,” in The sentiment, social cognition, and social-order analysis,” Behavior
nature of vocabulary acquisition, M. McKeown and M. E. Curtis, Eds. Research Methods, pp. 1–19, 2016.
Hillsdale, NJ: Erlbaum, 1987, pp. 147–163. [66] A. Blum and T. Mitchell, “Combining labeled and unlabeled data
[42] P. Nation, “Learning vocabulary in lexical sets: Dangers and with co-training,” in Proceedings of the Eleventh Annual Conference
guidelines,” TESOL Journal, vol. 9, no. 2, pp. 6–10, 2000. on Computational Learning Theory, 1998, pp. 92–100.
[43] S. A. Stahl and W. E. Nagy, Teaching word meanings. Routledge, [67] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
2007. 5–32, 2001.
[44] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng, “Cheap and [68] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przy-
fast—but is it good?: evaluating non-expert annotations for nat- bocki, “The DET curve in assessment of detection task perfor-
ural language tasks,” in Proceedings of the Conference on Empirical mance,” National Institute of Standards and Technology, Gaithers-
Methods in Natural Language Processing, 2008, pp. 254–263. burg, MD, Tech. Rep., 1997.
[45] E. Gibson, S. Piantadosi, and K. Fedorenko, “Using mechanical [69] P. J. Stone, D. C. Dunphy, and M. S. Smith, The general inquirer: A
turk to obtain and analyze english acceptability judgments,” Lan- computer approach to content analysis. MIT press, 1966.
guage and Linguistics Compass, vol. 5, no. 8, pp. 509–524, 2011. [70] H. D. Lasswell and J. Z. Namenwirth, The Lasswell value dictionary.
[46] J. K. Goodman, C. E. Cryder, and A. Cheema, “Data collection New Haven: Yale University Press, 1969.
in a flat world: The strengths and weaknesses of mechanical turk [71] M. Coltheart, “The MRC psycholinguistic database,” The Quarterly
samples,” Journal of Behavioral Decision Making, vol. 26, no. 3, pp. Journal of Experimental Psychology, vol. 33, no. 4, pp. 497–505, 1981.
213–224, 2013. [72] S. A. Crossley and D. S. McNamara, “Understanding expert rat-
[47] G. Paolacci, J. Chandler, and P. G. Ipeirotis, “Running experiments ings of essay quality: Coh-metrix analyses of first and second
on amazon mechanical turk,” Judgment and Decision making, vol. 5, language writing,” International Journal of Continuing Engineering
no. 5, pp. 411–419, 2010. Education and Life Long Learning, vol. 21, no. 2-3, pp. 170–191, 2011.
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TLT.2018.2789900, IEEE
Transactions on Learning Technologies
IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. X, NO. X, X XXXX 16
1939-1382 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.