Automatic Assessment of Syntactic Complexity For Spontaneous Speech Scoring

Available online at www.sciencedirect.
com
ScienceDirect
Speech Communication 67 (2015) 42–57
www.elsevier.com/locate/specom
Automatic assessment of syntactic complexity for spontaneous

speech scoring
Suma Bhat a,⇑, Su-Youn Yoon b
a
Beckman Institute for Advanced Science and Technology, University of Illinois, Urbana-Champaign, IL, USA
b
Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA
Received 12 November 2013; received in revised form 7 July 2014; accepted 12 September 2014
Available online 26 September 2014
Abstract
Expanding paradigms of language learning and testing prompt the need for developing objective methods of assessing language
proficiency from spontaneous speech. In this paper new measures of syntactic complexity for use in the framework of automatic scoring
systems for second language spontaneous speech, are studied. In contrast to most existing measures that estimate competence levels
indirectly based on the length of production units or frequency of specific grammatical structures, we capture the differences in the
distribution of morpho-syntactic features across learners’ proficiency levels. We build score-specific models of part of speech (POS)
tag distribution from a large corpus of spontaneous second language English utterances and use them to measure syntactic complexity.
Given a speaker’s response, we consider its similarity with a set of utterances scored for proficiency by humans. The comparison is
made by considering the distribution of POS tags in the response and a score-level. The underlying distribution of POS tags (indicative of
syntactic complexity) is represented via two models: a vector-space model and a language model.
Empirical results suggest that the proposed measures of syntactic complexity show a reasonable association with human-rated
proficiency scores compared to conventional measures of syntactic complexity. They are also significantly robust against errors resulting
from automatic speech recognition, making them more suitable for use in operational automated scoring applications. When used in
combination with other measures of oral proficiency in a state-of-the-art scoring model, the predicted scores show improved agreement
with human-assigned scores over a baseline scoring model without our proposed features.
Ó 2014 Elsevier B.V. All rights reserved.
Keywords: Language testing; Automated scoring; Speaking proficiency; Computer-aided language learning; Syntactic complexity; Objective measures
1. Introduction emerging paradigm are both its potential to make language

learning materials accessible to a wider range of learners at
The expansion of natural language and speech process- reduced costs (as compared to using human tutors) and
ing capabilities have created new areas of application for being more ubiquitous for use on a more flexible schedule.
use with the expanding paradigms of human–computer With more opportunities for computer-aided language
interaction. Today, language learning is gradually moving learning (CALL) interfaces being created today, there is
away from tutor-based or language-lab based scenarios, an increased need to endow CALL systems with the ability
to become computer-aided. The obvious advantages of this to assess language ability automatically. While the resulting
technology could be used for automated scoring in a testing
⇑ Corresponding author. scenario or for providing diagnostic feedback to the lear-
E-mail addresses: [email protected] (S. Bhat), [email protected] ner, efforts are being made to develop objective methods
(S.-Y. Yoon). of assessing language ability from spontaneous speech.
http://dx.doi.org/10.1016/j.specom.2014.09.005
0167-6393/Ó 2014 Elsevier B.V. All rights reserved.
S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57 43
Overall spoken proficiency in a target language can be study of measures that reflect language learners’ command
assessed by testing the abilities in various areas including of these influential skills has been the central theme of var-
fluency, pronunciation and intonation, grammar and ious studies in the area of second language acquisition.
vocabulary, and discourse structure. Currently, speech- In related literature, Ortega (2003) indicates that “the
enabled dialog systems allow learners to practice their range of forms that surface in language production and
speaking and listening with a virtual interlocutor (e.g., the degree of sophistication of such forms” are two impor-
SpeakESL), to receive feedback on their pronunciation tant areas in grammar usage collectively termed, “syntactic
[e.g., Carnegie Speech, or Native Accent (Eskenazi et al., complexity”. A vast majority of measures of syntactic com-
2007), EduSpeak from SRI (Franco et al., 2000)]. plexity have been used as indicators of levels of acquisition
These and other spoken response scoring systems work of syntactic competence, and in turn, are suggestive of pro-
on restricted speaking tasks such as reading a passage or ficiency levels in ESL writing (e.g. Wolf-Quintero et al.,
answering questions with a limited range of responses 1998; Ortega, 2003; Lu, 2010). These measures have been
(Bernstein et al., 2000; Balogh et al., 2007). In contrast to broadly classified into two groups (Bardovi-Harlig and
these systems that score restricted speech, scoring unstruc- Bofman, 1989). The first group is related to the acquisition
tured, unrestricted, and spontaneous responses poses a of specific grammatical expressions corresponding to vari-
much harder problem. In addition, if the systems target ous stages of language acquisition. Frequencies of negation
learners with diverse levels of second language proficiency or relative clauses – in terms of whether these expressions
and varied first language backgrounds, the difficulty occurred in the test responses without errors, fall into this
increases substantially. group (hereafter, the expression-based group). The second
The state-of-the-art system for scoring spontaneous group, not tied to particular structures, is related to length
speech in a testing scenario is SpeechRaterSM (Zechner of clauses or the relationship between clauses (hereafter,
et al., 2009). Although the current capability is sufficiently the length-based group). Representative measures in the
advanced to allow it to be used for the scoring of TOEFLÒ second group include the mean length of clause unit, the
Practice Online (TPO), a low-stakes practice test product, ratio of dependent clauses to the total number of clauses,
there is room for improving its feature set by expanding and the number of verb phrases per clause.
the coverage of important aspects of speaking proficiency In contrast with syntactic complexity, grammatical
and modifying others. For instance, aspects of grammar accuracy is the ability to generate sentences without gram-
and vocabulary sophistication are only being measured indi- matical errors. The measures in this group can be classified
rectly (more details on this later in this paper) and a more into two groups. Global accuracy measures include those
direct approach to measuring these aspects is necessary. that count all errors in sentence production and are calcu-
Taking the challenges posed in processing spontaneous lated as normalized values, e.g., the percentage of error-free
speech automatically into consideration, we propose a set clauses among all clauses (Foster and Skehan, 1996). A sec-
of measures of grammatical competence. This paper ond group of measures is more focussed on specific types of
describes the measures and their potential of being used constructions such as verb tense, third-person singular
in a state-of-the-art spontaneous scoring system. In Sec- forms, prepositions, and articles, and calculate the percent-
tion 2 the problem being studied is placed into the context age of error-free clauses with respect to these constructions
of previous work done in the related areas of written and (Robinson, 2006; Iwashita et al., 2008).
spoken language assessment. A description of the measures In the area of spoken language assessment, researchers
studied in this paper if found in Section 3. In Section 4, we have sought the application of measures of syntactic com-
delve into the details of the implementation of our pro- petence and grammatical accuracy. In particular, Halleck
posed measures. A description of the data is provided in (1995)’s study found that in the context of English as a for-
Section 5. The experimental details comprise the material eign language (EFL) assessment, holistic oral proficiency
in Section 6 and the results are presented in Section 7. In scores were highly correlated with three quantitative mea-
Section 8 we discuss the results of data analyses and high- sures (mean length of T-units,1 mean error-free T-unit
light some extensions to the study. Finally, a brief sum- length, and percentage of error-free T-units). Again, the
mary of the major findings of the paper is presented in results from a similar study that included both English
Section 9. and Japanese foreign language assessment, confirmed the
utility of these and other quantitative measures that assess
2. Motivation grammatical accuracy and syntactic complexity, in addi-
tion to vocabulary, pronunciation, and fluency (Iwashita
2.1. Assessment of syntactic competence in second language et al., 2008; Iwashita, 2010). However, the results were
learning inconclusive about the strength of the relationship between
the measures and the proficiency scores. Strong data
Numerous studies in related second language acquisi-
tion literature reveal that syntactic complexity and gram- 1
Hunt (1970) proposed the idea of a T-unit which is a main clause with
mar accuracy are regarded as some of the key skills that a subordinate clause and non-clausal units. It is different from a clause
strongly influence second language proficiency. Thus, the since it does not consider a subordinate clause as an independent unit.
44 S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57
dependencies on the participant groups were reported in resulted in substantially degraded performance (as com-
(Iwashita, 2010) and the discriminative ability of these pared to manually derived features from manually tran-
quantitative measures with respect to proficiency levels scribed responses). This degradation has been attributed
was not adequate. With proficiency levels rated on a scale, to the errors in automatic prediction of clause and sentence
the measures could only broadly discriminate students’ boundaries as well as those from the ASR. This raises nat-
proficiency levels, but failed to make fine-grained distinc- ural questions regarding the utility of those measures for
tions between adjacent levels; there were large variations use in speech scoring.
within a given level and the differences between the profi- A combination of multiple difficulties encountered in the
ciency levels were not always statistically significant. It is context of processing spoken responses may help explain
important to note that in all the ESL-related studies men- the issue. Most measures used in these studies were based
tioned above, the measures were obtained manually. on production units such as clauses and T-units. This being
Studies in automated speech scoring have focused on the case, the task of identifying them in speech that is natu-
measurements of several aspects of speech production, rally endowed with frequent occurrences of fragments and
including fluency (Cucchiarini et al., 2000, 2002), pronun- ellipses is naturally hard. Additionally, speech from lan-
ciation (Witt and Young, 1997; Witt, 1999; Franco et al., guage learners tends to include frequent grammatical errors
1997; Neumeyer et al., 2000), and intonation (Zechner which only increase the difficulty of identifying the units.
et al., 2009). However, research on the measurements Foster et al. (2000) listed multiple examples of difficult cases
related to grammar usage is considerably nascent. such as phrases without subjects or verbs and showed that
Zechner et al. (2009) include a normalized language having reliable units representing portions of speech to con-
model score of the speech recognizer as a grammatical sistently assess features such as accuracy or complexity is
measure. This measures the similarity between word dis- not easy to accomplish. A third difficulty is due to the length
tributions in the response with that in the language of spoken responses which are typically shorter than written
model,2 rather than the accuracy and diversity in gram- responses. Most measures based on sentence or sentence-
matical expressions. like units, found to be very reliable in measuring syntactic
With the recent foray into the realm of automated complexity from written responses, are rendered less reli-
assessment of spontaneous spoken language (Zechner able for use in speaking tasks that elicit only a few sen-
et al., 2009, 2011) there is a concurrent need to develop tences. Not surprisingly, Chen and Yoon (2011) observed
quantitative measures of syntactic complexity and gram- a marked decrease in the correlation between syntactic mea-
matical accuracy for use in such an environment. Two sures and proficiency as response length decreased.
important factors govern the use of any measure in this One is faced with even more obstacles while using syn-
realm. The first has to do with the discriminative ability tactic complexity and grammatical accuracy measures in
of the measure with respect to the target. This aspect is automated speech scoring. Spontaneous speech contains
related to the utility of the measure. The second has to disfluencies such as repairs and repetitions, and these disfl-
do with the way in which the measures are obtained from uencies need to be processed appropriately before calculat-
the data. By the very nature of automated assessment, it ing the quantitative measures. For instance, in calculating
is expected that these measures be obtained automatically. grammatical accuracy measures such as the mean length
Studies have only recently begun to actively investigate of error-free T-units, only the corrected parts found in
the usefulness of syntactic measures in the realm of auto- the repairs should be considered and the repetitive parts
mated scoring of spontaneous speech (Bernstein et al., should be excluded. This requires a high performing auto-
2010; Chen and Yoon, 2011; Chen and Zechner, 2011). mated disfluency detector. However, such a tool shows
These studies have used measures such as the average suboptimal performance even with native speech. In addi-
length of the clauses or sentences (Chen and Yoon, 2011) tion, speech recognition errors only worsen the situation.
previously studied in the context of writing assessment. Chen and Zechner (2011) showed that a moderate correla-
In addition to these length-based measures, Chen and tion between the score for syntactic complexity and speech
Zechner (2011) used parse-tree based features such as the proficiency (correlation coefficient = 0.49) was drastically
mean depth of parsing tree levels in addition to length- reduced when used with automated speech recognition
based measures. Using measurements from manual anno- (ASR) outputs. This reduction in correlation was found
tations as well as hypotheses generated by the speech recog- to be due to speech recognition errors. Due to these prob-
nition engine, these measures were used to predict the oral lems, the existing syntactic complexity and grammatical
proficiency scores of the learner given his/her response. accuracy measures do not seem reliable enough for being
These studies found that including length-based and used in automated speech proficiency scoring.
parse-tree based features in an automated scoring model In this study, we address the need for measures of gram-
matical ability in the sense defined by Ortega (2003) to
2 encompass range and sophistication of surface forms in
The language model was trained on non-native students’ speech and
broadcast news. Consequently, the grammar errors occurring in non-
speech. The requirements are: (a) the measures should cor-
native students’ speech may result in a good language model score, while respond to differences in proficiency levels in non-native
the grammatically correct but rare expressions may result in a bad score. spontaneous speech and (b) the measures should be reliable
for use in the context of automated speech scoring. In In (Lu, 2010), these measures of syntactic complexity
(Yoon and Bhat, 2012), inspired from vector-space model- were chosen based on the literature pertaining to second
ing in the area of information retrieval, a new measure of language development studies. They represent a fairly com-
grammatical competence that is based on comparing the plete picture of the repertoire of measures that second lan-
similarity of a given response to a body of learner guage development researchers draw from. They can be
responses, was studied. This measure was found to be rel- categorized into five types and are listed in Table 1.3
atively robust against speech recognition errors and was With the goal of having measures that normalize the
reliable for use with short responses, making it usable for effect of the varying length of utterances, we choose only
automated scoring of spontaneous speech in typical lan- the ratio-based measures (types II, III, IV and IV) and
guage assessment scenarios. In contrast to recent studies exclude measures of the first type.
focusing on length-based and grammatical structure-based
features (as outlined above), the focus was on capturing the 3.2. Proposed measures
differences in the distribution of grammatical expressions
across proficiency levels. Our proposed syntactic complexity measures have the
In this paper, we extend the study in (Yoon and Bhat, following two characteristics. First, in contrast to most
2012) and propose a new measure that is also similarity- methods that consider scores of syntactic complexity as a
based and derived from a language modeling approach to combination of measures of the length of sentence-based
natural language processing. Comparing the two similar- units or frequency of specific grammatical structures, we
ity-based measures side-by-side, we first study their degree directly measure students’ sophistication and range in
of association with proficiency scores as an indication of grammar usage based on the distribution of syntactic
their utility as measures of syntactic complexity. We then constructions.
study the performance of an automatic scoring model that Second, instead of rating a learner’s response using a
uses these features alongside features representing other scale based on native speech production, our experiments
aspects of language ability. Subsequently, we compared compare it with a similar body of learners’ responses. Con-
the performance of such a scoring model with that using sidering the variety of grammatical structures that native
conventional measures of syntactic complexity that have speakers produce, a comparison of learners’ constructions
been found to be useful in the context of automated scoring to that of native speakers implies searching in a large space
of written language assessment. of possible constructions. We instead collected a large
amount of learners’ spoken responses and classify them
3. Measures of syntactic complexity into four groups according to their proficiency level (as
assigned by professional raters). We then examined how
With the eventual goal of automatically scoring overall distinct the proficiency classes were based on the distribu-
spoken proficiency levels covering various aspects of speak- tion of part-of-speech (POS) tags. We hypothesized that
ing proficiency including grammatical accuracy, syntactic the level of acquired grammatical proficiency is signaled
complexity, vocabulary diversity, and discourse structure, by the distribution of the POS tags. Hence, given a stu-
our immediate goal is to measure the grammatical compe- dent’s response, we obtained its similarity with score-spe-
tence of a given second language utterance. As will be cific models that capture the distribution of POS tags in
described, conventional (or typical) measures of syntactic each score level. The intuition here is that the similarity
competence use sentence or clause length-based measures. of POS distribution of a response to that of the representa-
In contrast to this, our approach will be to classify the syn- tive response of a given score class is governed by the pro-
tactic complexity of an utterance as being similar to a par- ficiency level of the response.
ticular category of learner responses (proficiency class) by Working with POS tags as features has the advantage of
constructing representative models that are score-class a much lower dimensionality than would be needed when
specific. using lexical information. Moreover, rather than focusing
on specific grammatical constructions, we utilize the more
3.1. Measures from written language assessment fine-grained information at the level of sequences of POS
tags. This in turn provides us a convenient way to counter
We use a set of fourteen measures of syntactic complex- the effect of differences in topics in speaker responses, as
ity studied in (Lu, 2010), found to be highly predictive of would be the case in several scenarios of practical
syntactic complexity indices for grading second language importance.
learner essays. Although the study points out that the
results cannot be readily extended to productions with a
large portion of grammatically incomplete sentences (such 3
The study further defines the various production units and syntactic
as those produced by beginner-level learners), we chose
structures used in the measures and since we feel that the definitions and
these measures in order to get a baseline since no other the associated details are beyond the scope of the current study, we omit
measures of syntactic complexity for spoken utterances them from our discussion and direct the interested reader to the paper for
are as yet available in related studies. more details.
Table 1 each score category (Attali and Burstein, 2006). We extend

Conventional measures of syntactic complexity found useful in written this idea to the assessment of grammar usage using vectors
language assessment.
of POS tags. A given test response is treated like a query
Type Measures and converted to a vector on the coordinates defined by
I Mean length of clauses (MLC), mean length the chosen terms of the representative vector of proficiency
of sentences (MLS), and mean length of T-units (MLT) score classes. Given a test response in its vector form, it is
II Sentence complexity ratio (clauses per sentence, or C/S)
III T-unit complexity ratio (clauses per T-unit, or C/T),
assigned to a score class that has a similar POS distribution,
complex T-unit ratio (complex T-units per T-unit, or CT/T), where the similarity is captured by the cosine similarity func-
dependent clause ratio (dependent clauses per clause, tion – the dot-product of the two normalized vectors.
or DC/C), and dependent clauses per T-unit (DC/T) Cosine similarity is often used to identify documents
IV No. of coordinate phrases per clause (CP/C), No. of coordinate that are relevant for a given query. This measures the sim-
phrases per T-unit (CP/T), and, sentence coordination ratio
(T-units per sentence, or T/S)
ilarity between a given query and a document as the cosine
V No. of complex nominals per clause (CN/C), No. of complex of the angle between the corresponding vectors in a
nominals per T-unit (CN/T), and, No. of verb phrases high-dimensional space, where each term in the query
per T-unit (VPT) (and documents) corresponds to a unique dimension. If
a document is relevant to the query, it shares many terms
resulting in a small angle. In this study, a term was a single
The idea of capturing differences in POS tag distributions or compound POS tag (unigram, bigram or trigram)
for classification has been explored in several previous stud- weighted by its tf-df, and the document was the response
ies. In the area of text-genre classification, POS tag distribu- under consideration.
tions have been found to capture genre differences in text For a given response, we calculated the cosine similarity
(e.g. Feldman et al., 2009; Marin et al., 2009); in a language values of the response with the representative vector of
testing context, it has been used in grammatical error detec- each score level. As an example, cos1 is the cosine similarity
tion and essay scoring (Chodorow and Leacock, 2000; value of the test response to the representative vector of
Tetreault and Chodorow, 2008), as features based on the score level 1. Thus, the feature cos1 captures the similarity
frequency and types of errors. More recently, Roark et al. of a given response that of score level 1. A total of four sim-
(2011) found that the POS tag distribution effectively cap- ilarity values, one with each score level, were calculated. In
tured the differences in syntactic complexity between nor- addition, cosmax, the score-level that is most similar to the
mal subjects and subjects with mild cognitive impairment. response is also obtained.
Prompted by the utility of POS tag distributions in these
areas, our approach is to approximate the differences in syn-
tactic complexity between proficiency levels by the differ- 3.4. POS language models
ences in POS tag distributions.
Inspired by the approach to word-level modeling in the The second approach to modeling the POS distribution
area of information retrieval (IR) and language models in of a proficiency level is based on multiple POS language
the area of natural language processing, we model the models (LM). Multiple POS LMs with a minimum cross-
POS tag distributions in two ways: entropy criterion have been found useful in language iden-
tification studies such as (Zissman, 1996), and more
1. a POS-based vector space model representation and recently in (Yoon and Higgins, 2011) for automated speech
2. a POS n-gram language model. scoring.
The human-scored responses in the training set are clas-
sified into four groups according to their proficiency scores,
3.3. POS-based vector space model and score-specific POS n-gram LMs are trained. The key
assumption here is that the likelihood of a response given
We resort to the vector-space model (VSM) (Manning the score specific language model is a measure of the simi-
et al., 2008) of representing a document (used in IR) and larity of the response to the set of responses in the score
arrive at an analogous vector-space model to represent the class. Accordingly, the log-likelihood of each response
proficiency classes of the learner corpus. We begin by repre- given a score-specific LM constitutes the similarity feature.
senting each proficiency score level as a document whose The resulting similarity value was normalized by the length
term vector is formed from the available POS tags (or com- of the response.
binations thereof) of the responses in that score level. As in
the case of IR, each term is weighted by the appropriate
term frequency-inverse document frequency (tf-idf). The 3.5. Features capturing POS distribution
results is four vectors, one per score-level. Such a score-cat-
egory-based VSM has been used in automated essay scoring For a given response, a total of ten features (five features
to assess the lexical content of an essay by comparing the per model) are generated. The list of features with a brief
words in the test essay with the words in sample essays from description is as follows:
cosi : cosine similarity value of the test response with the described in Section 4.1. All transcriptions were tagged
representative vector of score level i ¼ 1; . . . ; 4. using the POS tagger described in Section 4.2 and POS
cosmax: the score level with the highest similarity score tag sequences were extracted. This was followed by the
given the response. model generation phase where the POS-VSM models and
lmi : logprob (likelihood) of the LM of score level the POS-LM models were obtained from the training data
i ¼ 1; . . . ; 4. with different sets of tags. Finally, the features were gener-
lmmax: the score level of the LM with the maximum log- ated for the responses in the test dataset.
prob given the response.
4.1. Automatic speech recognizer
We assume that the score-level having the highest cosine
similarity (or log-likelihood) value for a test response is An HMM recognizer was trained using approximately
most similar to the underlying syntactic complexity of the 733 h of non-native speech collected from 7872 speakers.
test response. In contrast to cosi , and lmi which have con- A gender independent triphone acoustic model and combi-
tinuous values, the two max features have an advantage in nation of bigram, trigram, and four-gram language models
that they can be directly interpreted as the proficiency level were used. A word error rate (WER) of 27% on the held-
of the given response, based on its syntactic complexity. out test dataset was observed.
3.6. Comparison of two models

4.2. POS tagger and model training data preparation
In VSM, the distributional differences among different
POS tags were generated using the POS tagger imple-
score levels are directly modeled through the use of idf. Pro-
mented in the Open-NLP toolkit.4 It was trained on the
ficient speakers use a set of grammatical expressions poten-
Switchboard (SWBD) corpus. This POS tagger was trained
tially more sophisticated than that of beginners, while
on about 528 K word/tag pairs and achieved a tagging accu-
beginners use simple expressions and sentences with fre-
racy of 96.3% on a test set of 379 K words. A combination of
quent grammatical errors. POS tags (or sequences) captur-
36 tags from the Penn Treebank tag set (Marcus et al., 1993)
ing these expressions may be seen in corresponding
and 6 tags generated for spoken languages were used in the
proportions in each score group. The POS tags (or tag
tagger. The POS-tagger was then used to tag the transcribed
sequences) which occur only in specific score levels get a high
responses of the training dataset. All responses in the same
idf value and accordingly have higher impact on the model.
score level were concatenated into a single score-specific
On the contrary, those tags which commonly occur across all
training set, and POS sequences were extracted.
score levels get a low idf value and have a smaller impact on
the model. As a result, the inherent differences between score
levels are captured in the score-specific VSMs. Similar to the 4.3. Compound unit generation using mutual information
language modeling approach to IR, the LM approach here
directly models the idea that a score level is a good match Temple (2000) pointed out that proficient learners’
to a response if the score-specific POS language model is speech is characterized by increased automaticity in speech
most likely to generate the syntactic structure of the response production. These speakers tend to memorize frequently
which will in turn happen if the score level contains the gram- used multi-word sequences as chunks and retrieve the
matical constructions in the response more often. whole chunks as single units. We capture the degree of
In a sense, the VSM attempts to model the global char- automaticity by the frequently co-occurring POS
acteristics of the POS tag distribution for each score class. sequences. We identified these co-occurring POS tag
The tf-idf weight of terms ensures that only those POS tags sequences as those having a high mutual information and
relevant to characterizing a score-level are represented. included them in our list of terms.
Thus, the representative vector of a score-class in the POS bigrams with high mutual information were
VSM model looks at the overall distribution of those char- selected and used as single units. First, all POS bigrams
acteristic POS tags (or tag sequences) and, in some sense, is which occurred less than 50 times were filtered out. Next,
a global approach. the remaining POS bigrams were sorted by their mutual
On the other hand, an LM based model of a score class information scores, and two different sets (top50 and
captures the similarity of a response to a proficiency class top100) were selected. Initially, we planned to select 50
based on local features or POS distribution patterns that bigrams for top50 and 100 bigrams for top100, but we
manifest at the bigram or trigram levels without consider- ended up to select 51 bigrams for top50 and 108 bigrams
ing whether the term is relevant to the score-level or not. for top100 due to bigrams with tied mutual information
scores. The selected POS bigrams were then transformed
4. Implementation details into compound tags and used in conjunction with the ori-
ginal Penn Treebank tag set. As a result, we generated three
The transcriptions of the responses were generated man-
4
ually as well as using a speech recognizer as will be http://opennlp.apache.org.
sets of POS tags: the original POS set without the com- listening and/or reading skills), they were not differentiated
pound unit (Base), the original set with 51 compound units in this study, owing to the fact that both items elicited
(Base + mi50), and the original set with 100 compound unconstrained speech and did not have specific target
units (Base + mi100). The number of tags was 42 for Base grammar expressions.
set, 93 for Base + mi50 set, and 150 for Base + mi100 set. Approximately 50,000 responses were collected and split
into two datasets: the ASR set and the scoring model train-
4.4. Building VSMs ing/test (SM) set. The ASR set comprised of about 733 h of
speech and was used for ASR training and POS similarity
Unigram, bigram, and trigram tags were used in VSM model training. The SM set comprised of 44 h of speech
building. For each n-gram, three sets of VSMs (using the and was used for feature evaluation and automated scoring
three tag sets presented in Section 4.3 as terms), yielding model evaluation. There was no overlap in items between
four score-specific vectors (one per score-class), were gener- the ASR set and SM set. Each response was rated for pro-
ated, resulting in a total of nine VSMs. For each VSM we ficiency by trained human scorers using a 4-point scoring
have cosi ; i ¼ 1; . . . ; 4 and cosmax as features, with cosi tak- scale, where 1 indicated low speaking proficiency and 4
ing values in [0, 1] and cosmax taking values in {1, 2, 3, 4}. indicated high speaking proficiency. The speaker, item
The results were based on the each n-gram model sepa- information, and distribution of proficiency scores are pre-
rately and we did not combine any models. sented in Table 2.
As seen in Table 2, there is a strong bias towards the
4.5. Training of POS n-gram LM middle scores (score 2 and 3) with approximately 83–84%
of the responses belonging to these two score levels.
Unigram, bigram, trigram, 4-gram and 5-gram LMs Although the skewed distribution of students at the low
were used in LM building. For each n-gram, three sets of and high score levels increased the difficulty of feature
LMs (using the three tag sets) were trained for each score development and model training, we used the data without
level using score-specific training data. We used the modifying the distribution since this constitutes a typical
SRILM toolkit (Stolcke, 2002) for training the models with distribution of proficiency scores in a large-scale language
Witten–Bell smoothing.5 For each n-gram and tag set com- assessment scenario. The mean scores in the two datasets
bination, we obtained lmi ; i ¼ 1; . . . ; 4 and lmmax as fea- were 2.67 and 2.63 respectively, and the overall score distri-
tures. While lmi takes values in ð inf; 0; lmmax takes bution of the two datasets was comparable.
values in the set {1, 2, 3, 4}. As with the VSMs, the results 773 Responses in ASR set and 4 responses in SM set were
were based on the each n-gram model separately and we not scoreable due to sub-optimal response characteristics
did not combine any models. such as technical difficulties (e.g, equipment or transmission
errors, loud background noise) and lack of student’s
5. Data response. These problematic responses were removed from
the data as a result of which the full set of 6 responses was
The data used in this study, was a proprietary collection not available for some speakers. The average number of
of responses from the Test of English as a Foreign Lan- responses per speaker for both data sets was more than
guageÒ internet-based test (TOEFLÒ iBT), an interna- 5.9 suggesting that these failures were relatively minor.
tional English language assessment required for studying In order to evaluate the reliability of the human ratings,
in an English medium university environment. The assess- approximately 5% of the ASR set (a total of 2388 responses)
ment consisted of six items per speaker, where they were was scored by two raters. Both the Pearson correlation coef-
prompted to provide responses lasting between 45 and ficient r and weighted-kappa k were 0.62 indicating the level
60 s per item, resulting in approximately 5.5 min of speech of subjectivity in the task of proficiency scoring.
per speaker. Among the six items, two items were “inde- Table 3 presents the descriptive analysis of the length of
pendent items” that asked examinees to provide informa- production units for each score point. For the analysis, we
tion or opinions on familiar topics based on their used a total of 200 responses from TOEFL Practice Online
personal experience or background knowledge. The four (TPO), an online practice test which allows students to gain
remaining items were “integrated items” that required familiarity with the format of our main data (TOEFL).
reading or listening in addition to speaking skill. Test tak- Thus the responses are similar to those in our datasets.
ers read and/or listen to some stimulus materials and then The manual transcriptions were annotated for locations
respond to a question based on them. All items extracted of clause boundaries (hereafter, CBs) and disfluencies such
spontaneous, unconstrained natural speech. While there as repetition, repair, and sentence fragments by two anno-
were important differences between these task types from tators with a linguistics background.6 Disfluencies fre-
an assessment point of view (i.e. whether the item required quently occur in non-native speakers’ spontaneous speech
5
This smoothing was chosen to accommodate for the non-zero
6
frequency of POS tags which prevents the application of Good-Turing Approximately 15% of data was double annotated; j was 0.95 for CBs
smoothing. and 0.75 for disfluencies.
Table 2
Data size and score distribution.
Data set No. of responses No. of speakers No. of items Score Score distribution
Mean SD 1 2 3 4
ASR 47,227 7872 24 2.67 0.73 1953 16,834 23,106 5,334
4% 36% 49% 11%
SM 2,876 240 12 2.63 0.75 141 1132 1263 340
5% 39% 44% 12%
Table 3
Means and standard deviations of the number of production units for each score point.
Level N Words per response CB per response T-units per response
M SD M SD M SD
1.00 50.00 41.78 27.56 5.38 3.58 3.20 2.01
2.00 50.00 71.38 31.48 7.38 3.31 4.86 2.18
3.00 50.00 85.28 32.00 9.22 3.84 5.50 2.43
4.00 50.00 94.22 38.98 9.48 3.93 5.94 2.97
and they should not be considered to be parts of a sentence. ered, with manual transcriptions available for both
This is because their inclusion will result in inflated lengths training and evaluation.
of production units. Therefore, disfluencies were removed, Set 2: (Manual, auto) – in this set-up we observe the
based on manual annotations. Secondly, the transcriptions effects of ASR output on the performance of a
were segmented into clauses, T-units, and sentences based model that is trained with manually transcribed data
on the human annotations. Finally, the means and stan- and tested on the ASR output.
dard deviations of each production unit were calculated Set 3: (Auto, auto) – this set-up provides the level of
for each score level. performance in a true operational scenario where it
is difficult to obtain manually transcribed responses.
6. Experiments
As such, we need a training set to obtain the underlying 6.2. Association between proposed measures, automatic
score-specific models which we then use to generate the fea- scores and human ratings
ture values using responses from the evaluation set.
Towards this end, we used the data in two modes. In the We determine the utility of the features by computing
manual mode, we used the manual transcriptions (hence- each feature’s correlation with human-assigned proficiency
forth termed manual) and in the automated mode (hence- scores as has been done in prior related studies. These fea-
forth termed auto), we use the output of the speech tures are then used in a multiple regression scoring model
recognizer. This set-up allowed for an understanding of (as studied in Zechner et al. (2009)) and the resulting auto-
the impact of ASR on the feature generation process as matic scores are compared with the human-rated scores. As
well on the feature performance. Finally, we use the fea- a baseline, we use the model studied in Zechner et al.
tures obtained in the automated mode in a multiple regres- (2009), which is then augmented with the features we pro-
sion scoring model and compare its performance with that pose in this study as well as features using prototypical
using conventional features of syntactic complexity (also measures of syntactic complexity. We then compare the
obtained in the automated mode). resulting scoring models based on the correlations of the
predicted values with that of the human scores.
6.1. Dataset combination
6.3. Factors for practical consideration
The experiments were conducted with the following
combinations of datasets. We vary the nature of the tran- A primary requirement of automated scoring systems
scriptions available (auto or manual) in the training and that make measurements on automatically recognized spo-
test sets, yielding the following (train, test) combinations ken responses is that the measures be immune to the ASR
for experimentation. Such a split in the dataset, permits errors inherent in the process. The utility of a set of auto-
us to make the following observations on the performance matically derived measures is better analyzed in the context
of the proposed features in various operating scenarios. of a parallel analysis that considers measurements obtained
using transcriptions obtained manually and automatically.
Set 1: (Manual, Manual) – in this mode, we have the The results will aid the assessment of the overall utility
best case performance of the model being consid- of the measures with respect to a change in the mode of
measurement. Our experiments and analyses will reflect in score-feature correlation when used with ASR
this factor of practical significance. output (Set 2 and Set 3) was not statistically
Another factor of importance from a practical stand- significant.
point is that of generalizability. In this respect, we are inter- Overall, cos4 not only outperformed cosmax, but
ested in seeing the extent to which the size of the training also was more robust against ASR errors.
data affects model performance. Along the lines of general- Bigram-based features outperformed both unigram-
izability, we also explore the extent to which our method is based and trigram-based features.
applicable to scoring items unseen in the training set. The inclusion of compound tags (refer Section 4.3)
did not result in an increased correlation for cos4 .
6.4. Comparison with prototypical measures from written However, it increased the correlation (statistically
language assessment significant at level 0.01) in the case of cosmax, when
obtained from unigrams.
We compare the utility of the proposed measures with
that of the most representative syntactic complexity mea- The unigrams had good coverage but limited power in
sures used in (Lu, 2010) and described in Section 3.1. As distinguishing between different score levels. Their ability
rationalized in Section 2.1, there is reason to believe that to distinguish between levels was augmented with the inclu-
the sentence-length based measures are inherently unsuit- sion of co-occurring POS tags. On the other hand, trigrams
able for use with automatic spoken language assessment. had the opposite characteristics – they captured more
In order to test this, the performance of our features with structure, but did not have good coverage because of data
the features of the syntactic complexity proposed in (Lu, sparseness. Bigrams seemed to strike a balance in both cov-
2010) were compared. Towards this, the clause boundaries erage and complexity (from among the three LMs consid-
of the ASR hypotheses were automatically detected using ered here) and may thus have resulted in the best
the automated clause boundary detection method used in performance with both the features in both manual and
(Chen and Zechner, 2011).7 The utterances were then ASR modes.
parsed using the Stanford Parser (Klein and Manning, It is worthwhile to emphasize that the performance of
2003), and a total of 14 features including both length- ASR-based features was comparable to that of transcrip-
related features and parse-tree based features were gener- tion-based features. The best performing feature among
ated using (Lu, 2012). Finally, the Pearson correlation ASR-based features was using the bigram and base set,
coefficients between these features and human proficiency with correlations nearly the same as the best performing
scores were calculated. feature among the transcription-based features. Seeing
how close the correlations were in the case the manual tran-
7. Results scription-based and ASR-hypothesis based features, we
conclude that the proposed measure is robust to ASR
7.1. POS features errors.
Table 5 shows correlations between the LM-similarity
We will first review the results of VSM-based features, features and expert-rated proficiency scores for experi-
followed by LM-based features. ments with n-grams n ¼ 1; . . . ; 3. Since no improved corre-
The VSMs and the features were obtained using both lation was observed by increasing n-gram size beyond
modes (manual and automated). Table 4 shows feature- trigrams, 4-gram and 5-gram results are excluded from this
score correlations for the features cosmax and cos4 (features table. Additionally, features lm1 ; lm2 and lm3 were
cos1 ; cos2 and cos3 were excluded since they showed lower excluded since they showed lower correlation with human
correlation with human scores and were highly correlated scores and were in turn highly correlated with lm4 . From
with cos4 ). From the table, we make the following the table, we make the following observations.
observations:
The feature most correlated with human scores was
The most correlated feature with human scores of lmmax using the base set with bigrams. It achieved a
proficiency was cos4 that used the base set with correlation of 0.38 when both model and evaluation
bigrams. It achieved the best correlation of 0.43 were based on manual transcriptions (Set 1). There
when manual transcriptions were used in model was a substantial drop in correlation when used with
building as well as in evaluation (Set 1). The drop the ASR output; the correlation for Set 3 (both
ASRs) was 0.33 (a 0.05 drop in absolute
correlation).
7
As expected, features based on set 1 of the dataset
The automated clause boundary detection method in this study was a
combinations (manually transcribed training and
Maximum Entropy Model based on word bigrams, POS tag bigrams, and
pause features. The method achieved an F-score of 0.60 for the non-native
evaluation data) outperformed those obtained using
speakers’ ASR hypotheses. A detailed description of the method is Set 2 and Set 3. However, when used with ASR out-
presented in (Chen and Yoon, 2011). put, Set 3 was better than Set 2. In Set 2, there was a
Table 4
Pearson correlation coefficients between VSM-based features and expert proficiency scores. Here we consider the features cos4 and cosmax in the VSM
model. All correlations are significant at the 0.01 level.
Feature Set Unigram Bigram Trigram
Base Base + mi50 Base + mi100 Base Base + mi50 Base + mi100 Base Base + mi50 Base + mi100
cosmax (M, M) 0.08 0.18 0.18 0.34 0.33 0.34 0.34 0.33 0.34
(M, A) 0.12 0.17 0.17 0.26 0.24 0.26 0.25 0.26 0.26
(A, A) 0.13 0.18 0.19 0.30 0.30 0.30 0.31 0.28 0.27
cos4 (M, M) 0.30 0.30 0.33 0.43 0.36 0.37 0.40 0.32 0.30
(M, A) 0.25 0.27 0.30 0.42 0.35 0.35 0.37 0.31 0.28
(A, A) 0.30 0.27 0.30 0.41 0.32 0.32 0.34 0.28 0.26
The values are in bold to indicate that they correspond to the best performance.
Table 5
Pearson correlation coefficients between LM-based features and expert proficiency scores. All correlations are significant at the 0.01 level.
Feature Set Unigram Bigram Trigram
Base Base + mi50 Base + mi100 Base Base + mi50 Base + mi100 Base Base + mi50 Base + mi100
lmmax (M, M) 0.31 0.32 0.32 0.38 0.34 0.34 0.36 0.26 0.25
(M, A) 0.26 0.28 0.28 0.29 0.28 0.29 0.29 0.24 0.26
(A, A) 0.30 0.31 0.31 0.33 0.32 0.3 0.29 0.29 0.29
lm4 (M, M) 0.06 0.05 0.07 0.19 0.17 0.18 0.23 0.19 0.2
(M, A) 0.05 0.03 0.02 0.15 0.15 0.14 0.18 0.19 0.17
(A, A) 0.06 0.04 0.05 0.16 0.12 0.14 0.18 0.14 0.15
discrepancy in the train/evaluation data condition; As seen in the previous section, cos4 emerged as the fea-
the train data was based on the manual transcription ture of choice when using the VSM, and lmmax the feature
while the evaluation data was based on the ASR. of choice when using the POS-LM. The two features seem
This discrepancy would have resulted in the addi- to represent different aspects of grammatical competence.
tional performance drop. What will follow next, will be a comparison of the two
Feature lmmax outperformed lm4 overall. features.
Bigram-based features using the base tag set showed
better correlation than the other n-gram features. In the case of cos4 , the feature, in a way, captures
However, for lm4 the 4-gram showed the best correla- how far the given POS tag distribution is with
tion (though not too different from trigram based respect to the representative vector of score-level 4
lm4 ), with different tag sets and dataset combinations. in a high-dimensional space. While this is useful,
The inclusion of compound tags did not result in an the feature value only gives the similarity with the
increased correlation for lm4 . score-level. Unlike the feature cos4 ; lmmax gives a
score-level value to a test-response, which may be
The LM-based features behaved differently from VSM- construed as a score of grammatical competence.
based features. First, the features were more susceptible to Although somewhat unrelated to the correlation
ASR errors, and there was a substantial performance drop with human-assigned proficiency scores, the relative
in the best performing features when used with the ASR ease of interpretation seems to be an advantage of
outputs. Furthermore, from Tables 4 and 5 we observe that the POS-LM model. However, taking into account
the feature cos4 was the better performing feature in VSM the correlation with human judgment, the VSM
method and the feature max was the better performing in model seems to be a better option.
LM method.
We list the correlation coefficients between the conven-
7.2. Comparison of features tional measures of syntactic complexity and the scores of
proficiency in Table 6. From Table 6, we note that the best
Table 6 shows the performance of the prototypical mea- performing feature is “DCC” (mean number of dependent
sures of syntactic competence from written language clauses per clause) and the correlation r was 0.14. In addi-
assessment. tion, “DCT” had a correlation that was also a statistically
We observe that these measures have lower correlations significant, but the correlation r was even weaker than
with human-scores compared to our proposed features. We “DCC”. Our best performing feature (bigram-based cos4 )
then include the features in a scoring model for subsequent widely outperformed the best of Lu (2010)’s features (with
performance comparison. correlations approximately 0.3 apart).
Table 6
Pearson correlation coefficients between the prototypical measures of syntactic complexity and expert proficiency scores for the ASR output. Only
correlations marked (*) were significant at a ¼ 0:01 level.
C/S C/T CT/T DC/C DC/T CP/C CP/T T/S CN/C CN/T VPT
0.014 0.006 0.017 0.138* 0.060* 0.015 0.026 0.027 0.028 0.031 0.008
For the purpose of comparing the features and the mod- performing VSM-based scenario. VSM-based models are
els side-by-side, we chose the best features (with the highest built using variable-sized training data of manual tran-
correlations with human judgments) from the models stud- scriptions (sampled to preserve the underlying score distri-
ied here (prototypical features from writing assessment, bution) and the Pearson’s correlation coefficients of the
VSM-based and POS-LM based models). For the sake of feature cos4 (derived from ASR-based transcriptions of
completion, we included both cases (the models trained the responses in the SM data) with human scores of profi-
and evaluated with manual transcriptions as well as models ciency for every sample are noted. From the results tabu-
trained and evaluated with automatic transcriptions). We lated in Table 8 we notice that even with 500 responses
observed that each of these methods had an associated best sampled from the training data, we are able to achieve
performing feature – the feature cos4 was the better per- the feature-score correlation of 0.41, which approaches
forming feature in the VSM and the feature max was the that shown by using a model that was trained on 47 K
better performing in POS-LM. The comparison of features responses.
is summarized in Table 7 below. In order to further highlight the generalization perfor-
A plausible explanation for the poor performance of mance of proposed models, it may be worth noting that
Lu (2010)’s features is that the features were generated since the test data had no item-overlap with the training
using a multi-stage automated process, and the errors in set, the feature generalizes well to unseen items.
each stage propagated to the next, resulting in a low fea-
ture performance. For instance, the errors in the auto- 7.4. Automatic scoring model comparison
mated clause boundary detection may result in a serious
drop in the performances. With the spoken responses The effect of the inclusion of the proposed features is
being particularly short (a typical response in the dataset best understood by studying them in an automatic scoring
had 10 clauses on average), even one error in clause model. We consider a multiple linear regression scoring
boundary detection can seriously affect the reliability of model that approximates the human scores by a linear
the features. combination of the proposed features that represent vari-
ous constructs of oral language proficiency. We first build
7.3. Effect of the training data size on model performance and compare stand-alone multiple regression models of
syntactic complexity. We then compare scoring models that
For the purpose of VSM-based and LM-based POS- include measurements of syntactic complexity in addition
model constructions, we used approximately 47 K to the other constructs of proficiency currently being mea-
responses, which may be unavailable in many practical sce- sured in the state-of-the-art scoring model (SpeechRater).
narios. To understand the effect of training data size on the For the purpose of this comparison, we perform 5-fold
resulting POS-model, we consider varying the available cross validation on the SM data described in Section 5
data size for POS-model building in the case of the better (N = 2876) using a training set (80%) and a held-out test
set (20%) to parametrize the models and to assess the mod-
els’ performance respectively in every run. The results are
Table 7 averaged over the 5 runs.
Comparison of feature-score correlations.
Study Feature Condition Correlation 7.4.1. Scoring model for syntactic complexity
(Lu, 2010) DCC Trans 0.14 Although the human assigned scores of proficiency are
ASR 0.14 holistic scores of overall proficiency, in the absence of
Current study cos4 Trans 0.43 stand-alone scores of syntactic complexity, we assume that
ASR 0.41 the scores of proficiency are well correlated with scores of
Current study lmmax Trans 0.38
ASR 0.33
syntactic complexity. Under this assumption, we first con-
struct automatic scoring models for syntactic complexity
Table 8
Comparison of cos4 -score correlations, r, by varying training data size (number of responses) for VSM-based POS-model training.
Data size (in 1000 s) 0.1 0.2 0.5 1.0 2.4 4.7 9.4 24 47
r 0.35 0.38 0.41 0.42 0.41 0.42 0.41 0.42 0.42
using the three sets of features studied here – the prototyp- in words (wdpchk) and global normalized language model
ical measures from written language assessment, the VSM- score (lmscore) (Zechner et al., 2009) by themselves (Base)
based features and the LM-based features. The grammar as also alongside the features considered in this study. Such
scoring model using only the prototypical features an arrangement permits us to not only compare the relative
described in Section 3.1 will serve as a baseline (GramLu). gains in scoring model performance compared to the base
We then compare scoring models using VSM- and LM- model but also to assess the relative importance of the fea-
based features (GramVSM and GramLM respectively). tures considered in the model.
We avoid collinearity in the scoring models by excluding Here, in addition to a base model with the features
correlated features ðr P 0:8Þ. The resulting GramLu model found in SpeechRater, we study augmented scoring mod-
includes the features CN/C, CN/T, CP/C, C/T, T/S, CT/T, els using the three sets of features studied here – the pro-
GramVSM includes cosmax and cos4 while GramLM totypical measures from written language assessment, the
includes lmmax and lm4 . The agreement between predicted VSM-based features and the LM-based features. The aug-
scores and human scores in terms of Pearson’s correlation mented models will be denoted by SynLu, SynVSM, and
coefficient (unrounded and rounded) as well as weighted SynLM respectively. In addition, we will consider a com-
kappa are tabulated in the Table 9. bination model SynAll, that will include all features. The
From the Table 9, we observe that the correlation is correlations (unrounded and rounded) as well as the
highest for a scoring model using VSM-based features weighted kappa for the models studied here are found in
and that the correlation is considerably higher (0.36) than Table 10.
that of the baseline model using conventional features for Comparing the agreement results in Table 10 we notice
written language assessment. The correlations of the scor- that the model SynVSM shows the highest (0.607) correla-
ing model using LM-based features is also reasonably tion of predicted scores with human scores among Base,
better than the baseline (0.24 higher). The agreement SynLu, SynLM and SynVSM. Moreover, testing for the
results of GramLu here is in line with the results in significance of the difference between two dependent corre-
(Chen and Zechner, 2011), where it was observed that lations using Steiger’s Z-test, we notice that the difference
measures of syntactic complexity were highly sensitive in correlations (Base–SynVSM, SynLu–SynVSM and Syn-
to errors in ASR and are hence not reliable for use in LM–SynVSM) are statistically significant at level = 0.01.
speech scoring. Combining all features in the SynAll model, although the
correlation appears to be marginally higher than that with
SynVSM the difference is not statistically significant. Thus,
7.4.2. Scoring model for oral proficiency
inclusion of measures of syntactic complexity shows
We next consider an evaluation of the syntactic com-
improved correlation between machine and human scores
plexity features studied here in an augmented scoring
compared to the state-of-the-art model (here, Base).
model that includes features from the SpeechRater that
The agreement between human and machine scores for
represent constructs of fluency, pronunciation, vocabulary
the scoring model SynVSM (Pearson correlation coefficient
and grammar. Towards this, we again consider a multiple
0.607) may not seem impressive by itself, but taken in the
regression model that includes the features global normal-
context of the subjectivity of the task of assigning a profi-
ized HMM acoustic model score (amscore), speaking rate
ciency score (as mentioned in Section 5), the correlation
(wpsec), types per second (tpsecutt), average chunk length
coefficient is seen to approach near-human agreement of
Table 9 0.62.
Comparison of scoring model performances only using features for Our next focus is the relative importance of the features
grammar competence. All values are significant at level 0.01 except the one
in an overall scoring model such as SynVSM that includes
marked with (*).
all available features for measuring the various constructs
Evaluation method GramLu GramLM GramVSM
of oral proficiency. In Table 11 we list the relative impor-
Weighted j 0.05 0.25 0.33 tance of the predictors, as the R2 contribution averaged
Correlation (unrounded) 0.06 0.30 0.42
over orderings among predictors (Grömping, 2006). We
Correlation (rounded) 0.04* 0.23 0.34
notice that the feature cos4 is the third most important pre-
The values are in bold to indicate that they correspond to the best
dictor of oral proficiency next only to tpsecutt and wdpchk
performance.
which are existing features in the SpeechRater module.
Table 10
Comparison of scoring model performances using features of syntactic complexity studied in this paper along with those available in SpeechRater. Here,
Base is the scoring model using just the features in SpeechRater. All correlations are significant at level 0.01.
Evaluation method Base SynLu SynLM SynVSM SynAll
Weighted j 0.540 0.540 0.560 0.560 0.570
Correlation (unrounded) 0.587 0.589 0.595 0.607 0.610
Correlation (rounded) 0.500 0.503 0.509 0.526 0.522
Table 11 Constant (CON): The group of POS tags whose

Comparison of the relative importance of the features in ranks remained same despite change in proficiency.
the augmented scoring model SynVSM.
Mix: The group of POS tags of with no consistent
Feature Relative importance (%) patterns in the ranks.
tpsecutt 35.03
wdpchk 23.90 Table 12 presents the number of POS tags in each class.
cos4 16.29
amscore 9.64
The ‘ABS’ group mostly consists of ‘WP’ and ‘WDT’;
wpsec 6.38 more than 50% of tags in this group are related to these
cosmax 5.92 two tags. ‘WP’ is a Wh-pronoun while ‘WDT’ is a Wh-
lmscore 2.81 determiner. Since most sentences in our data are declara-
The relative importance in bold highlights the fact that tive sentences, ‘Wh’ phrases signal the use of relative
the proposed new feature (cos4) is relatively important clauses. Therefore, the lack of these tags strongly support
compared to the other features available in prior studies. the hypothesis that the speakers in score group 1 were
not proficient in the use of relative clauses or their use in
8. Discussion limited situations.
The ‘INC’ group can be classified into three sub-groups:
The performance of the features considered in the exper- verbs, comparatives, and relative clauses. The verb group
iment may be viewed in the light of the following case-wise includes infinitive (TO_VB), passive (VB_VBN,
analyses of the VSM-based features and the LM-based VBD_VBN, VBN, VBN_IN, VBN_RP), and gerund forms
features. (VBG, VBG_RP, VBG_TO). Next, the comparative group
encompasses comparative constructions. Finally, the rela-
8.1. VSM-based similarity assessment tive clause group signals the presence of relative clauses.
The increased proportion of these tags reflects the use of
The measure of syntactic competence that we studied more complicated tense forms and modal forms as well
here may be viewed as a simplified version of overall syn- as more frequent use of relative clauses. It supports the
tactic competence, without the consideration of specific hypothesis that speakers with higher proficiency scores
constructions. We analyze the results further with the tend to use more complicated grammatical expressions.
intention of casting light on the level of detail of syntactic The ‘DEC’ group can be classified into five sub-groups:
competence that can be explained using our measure. Fur- noun, simple tense verb, GW and UH, non-compound, and
thermore, we will show that bigram POS sequences can comparative. The noun group is comprised of many noun
yield significant information on the range and sophistica- or proper noun-related expressions, and their high propor-
tion of grammar usage in the specific assessment context tions are consistent with the tendency of less proficient
(spontaneous speech with only declarative sentences). speakers to use nouns more frequently. Secondly, the sim-
ESL speakers with high proficiency scores are expected ple tense verb group is comprised of the base form (VB)
to use more complicated grammatical expressions that and simple present and past forms (PRP_VBD, VB,
result in a high proportion of POS tags related to these VBD_TO, VBP_TO, VBZ). Note that the expressions in
expressions in that score group. The distribution of POS these sub-groups are simpler than those in the ‘INC’ group.
tags was analyzed in detail in order to investigate whether The ‘UH’ tag is for interjection and filler words such as
there were systematic distributional changes according to ‘uh’ and ‘um’, while the ‘GW’ tag is for word-fragments.
the proficiency levels. For illustration purposes, we restrict These two spontaneous speech phenomena are strongly
our discussion to the analysis using unigrams (base and related to fluency, and they signal problems in speech pro-
compound tags). For each score group, the POS tags were duction. Frequent occurrences of these two tags are evi-
sorted based on the frequencies in training data and the dence of frequent planning problems and their inclusion
rank orders were calculated. The more frequent the POS in the ‘DEC’ class suggests that instances of speech plan-
tag, the higher its rank. ning problems decrease with increased proficiency.
A total of 150 POS tags, including the original POS tag Tags in the non-compound group, such as ‘DT’, ‘MD’,
set and top 100 compound tags, were classified into 5 ‘RBS’, and ‘TO’, have related compound tags. The non-
classes: compound tags are associated with the expressions that
do not co-occur with strongly related words, and they tend
Absent-in-low-proficiency (ABS): The group of POS to be related to errors. For instance, the non-compound
tags that appeared in all score groups except the ‘MD’ tag signals that there is an expression in which a
lowest proficiency group. modal verb is not followed by ‘VB’ (base form) and as seen
Increase (INC): The group of POS tags whose ranks
Table 12
increased consistently as proficiency increased. Tag distribution and proficiency scores.
Decrease (DEC): The group of POS tags whose
ABS INC DEC CON Mix
ranks decreased consistently as proficiency
14 37 33 18 48
increased.
in the examples, ‘the project may can change’ and ‘the others compared ASR-based ones with transcription-based ones.
must can not be good’, they are related to grammatical Fig. 1 shows the results. As a comparison, Fig. 2 provides
errors. the mean values of cos4 (cosine similarity value with score
Finally, the comparative group includes ‘RBR_JJR’. level 4) for each score level.
The decrease of the tag ‘RBR_JJR’ is related to the correct In lmmax, the mean values of ASR-based features were
acquisition of the comparative form. ‘RBR’ is for compar- lower than those of manual transcription-based features
ative adverbs and ‘JJR’ is for comparative adjectives, and a except for the score-class 1. In particular, the drop in fea-
combination of these two tags is strongly related to double- ture values for lmmax increased with increasing score levels
marked errors such as ‘more easier’. In the course of and this resulted in reducing the distinction between score
acquiring the comparative form, learners tend to use the levels. However, no such behavior was found for cos4 . In
double-marked form. The compound tags correctly capture fact, the mean values of ASR-based features were slightly
this erroneous stage. higher than those of transcription-based features for all
The ‘DEC’ group also includes three Wh-related tags – score levels, and the distinction between score levels
those in (WDT_NN,WDT_VBP, WRB), but the propor- remained steady.
tion is much smaller than that found in the ‘INC’ group. We attribute the drop in performance in the ASR-based
The analysis shows that the combination of original and features to the following sources of errors:
compound POS tags correctly captures systematic changes
in the grammatical expressions according to changes in 1. The speech recognizer may mis-recognize grammat-
proficiency levels. ical errors in the response, replacing them with the
more frequent, correct expressions.
8.2. POS LM-based similarity assessment 2. The speech recognizer may mis-recognize sophisti-
cated expressions in a response replacing them with
In contrast to VSM-based features, ASR errors caused more frequent, simpler expressions.
significant performance drops in the LM-based features. 3. The recognizer errors may generate grammatical
In this section, we will analyze the impact of ASR errors errors not existing in the actual responses.
on LM-based features in detail with reference to the models
based on Set 1 and Set 2 (the manual transcription based
model). Since the feature lmmax had a better performance
than lm4 , we conducted further analysis on lmmax. In addi-
tion, there was no significant difference between the
manual-transcription-based model and the ASR-hypothe-
sis-based model (Set 1 and Set 3). Hence, we limit our
discussion to those features obtained using the manual
transcription-based model.
The relationship between predicted scores and human
scores in the case of LM-based similarity features was fur-
ther analyzed using a contingency table for the feature
lmmax with Set 1 and Set 2 (recall that in both these cases,
the training data consists of manually transcribed utter-
ances and it is only the mode of evaluation data that
Fig. 1. Comparison of lmmax between manual-based features and ASR-
changes). Focusing on the best performing model (the based features at each score level.
bigram LM), the agreement between the human score
and the feature was promising; for the manual transcrip-
tion, the exact agreement between human scores and
lmmax was 36% and the combination of the exact and adja-
cent agreement was 86%. Upon examination of the contin-
gency table for the manual transcriptions, we notice that
every predicted score level is dominated by responses from
the score levels 2 and 3.
However, the feature lmmax was strongly influenced by
recognition errors. This is seen in the significant drop in the
correlation from 0.39 with set 1 to 0.29 with set 2. For ASR
hypotheses the exact agreement decreased from 36% to
32%, and the adjacent agreement from 50% to 40%,
respectively.
For further analysis, we calculated the mean values of Fig. 2. Comparison of cos4 between manual-based features and ASR-
lmmax (most similar score class) for each score level and based features at each score level.
Each of the items above, as well as any combination model, for measuring grammatical competence, with the
thereof, results in a loss of characteristics of a response requirement that they be amenable for use with automat-
thereby reducing the discriminating power of the features. ically recognized spontaneous speech. The underlying
While 1 results in inflated predicted levels, 2 and 3 result assumption in the choice of features is that the level of
in underestimated proficiency levels. Current results acquired grammatical proficiency is signaled by the distri-
(Fig. 1) supports Hypothesis 2 and 3 since the mean values bution of the POS tags. We observed that each of these
of ASR-based lmmax were generally lower than those of models has an associated best performing feature; the fea-
manual transcription-based ones. ture cos4 in the VSM and the feature lmmax in POS-LM.
In order to identify those bigrams that were heavily The features were found to generalize well with respect to
influenced by ASR errors, bigram frequencies for manual differences in item-type in responses. When used alongside
transcriptions and ASR hypotheses were obtained, and dif- existing features in the state-of-the-art scoring model for
ferences were examined. Since the impact of the errors was oral proficiency assessment in spontaneous speech, the
particularly strong on responses with a score of 4, we VSM-based features were seen to be important predictors
focused on responses with that score. We identified the of oral proficiency. Additionally, their inclusion in the
top 10 bigrams for which the ASR-based frequencies were scoring model showed improved agreement between
substantially higher than their manual transcription-based machine and human scores compared to the state-of-
frequencies. Among the 10 bigrams, tag repetitions (‘DT– the-art model.
DT’,‘IN–IN’,‘VBZ–VBZ’) and some determiner-related We also observed that a set of prototypical features
bigrams (‘DT–IN’,‘DT–PRP’,‘NN–DT’) were strongly based on sentence- or clause-based units are ill-suited for
associated with grammatical errors. The analysis lends sup- automatic scoring on ASR output which is in-line with pre-
port to our hypothesis that ASR generates non-existing vious studies. In addition to the reasonable correlation
grammatical errors resulting in underestimated high scores. with human-assigned scores, the key advantage of our pro-
This results in an underestimated proficiency level for the posed features is the relative immunity to ASR errors
highly proficient speakers when using the feature lmmax. (compared to conventional measures of syntactic complex-
Surprisingly, the ASR errors that affected the perfor- ity) making them suitable for use in automatic spontaneous
mance of lmmax did not influence the performance of speech scoring systems.
cos4 . We found that the 10 bigrams that were highly asso-
ciated with ASR errors had relatively low idf values; they
were 85–90% from the top when the bigrams were sorted Acknowledgements
by their idf values. This low idf has the effect of reducing
the impact of ASR errors on cos4 . This suggests that when The authors would like to thank Klaus Zechner, Keelan
compared to lmmax, cos4 seems to be better suited for Evanini, Lei Chen, and Shasha Xie for their valuable com-
being used in a fully automated speech scoring system that ments, help with data preparation and assistance with the
is based on speech recognition. experiments. In addition, we gratefully acknowledge the
proof-reading assistance offered by Prama Pramod.
8.3. Extensions and future directions
References
In assessing the level of association between the set of
features and the human scores for proficiency, the underly- Attali, Y., Burstein, J., 2006. Automated essay scoring with e–rater R V.2.
ing assumption was that the overall proficiency score was J. Technol. Learn. Assess. 4.
also indicative of syntactic competence. In reality, the Balogh, J., Bernstein, J., Cheng, J., Townshend, B., 2007. Automated
holistic manual scores (used in this study and otherwise) evaluation of reading accuracy: assessing machine scores. In: Proceed-
for overall proficiency capture more aspects of speech than ings of SLaTE, pp. 1–3.
Bardovi-Harlig, K., Bofman, T., 1989. Attainment of syntactic and
just syntactic complexity. Thus, the use of a more analytic morphological accuracy by advanced language learners. Stud. Second
human score of syntactic complexity in place of the overall Lang. Acquis. 11, 17–34.
proficiency score may provide a more accurate picture of Bernstein, J., Cheng, J., Suzuki, M., 2010. Fluency and structural
the quality of our features. complexity as predictors of L2 oral proficiency. In: Proceedings of
Future explorations in this direction will include combi- InterSpeech, pp. 1241–1244.
Bernstein, J., DeJong, J., Pisoni, D., Townshend, B., 2000. Two
nation of n-gram models and interpolated language models experiments in automated scoring of spoken language proficiency.
with interpolation weights that are specific to a proficiency In: Proceedings of InSTIL, pp. 57–61.
level, an approach that will alleviate the problem of data Chen, L., Yoon, S.Y., 2011. Detecting structural events for assessing non-
sparsity. native speech. In: Proceedings of the Workshop on Innovative Use of
NLP for Building Educational Applications, pp. 38–45.
Chen, M., Zechner, K., 2011. Computing and evaluating syntactic
9. Conclusions complexity features for automated scoring of spontaneous non-native
speech. In: Proceedings of ACL, pp. 722–731.
In this study, we proposed a set of features correspond- Chodorow, M., Leacock, C., 2000. An unsupervised method for detecting
ing to two models - the VSM model and the POS-LM grammatical errors. In: Proceedings of NAACL, pp. 140–147.
Cucchiarini, C., Strik, H., Boves, L., 2000. Quantitative assessment of Marin, M., Feldman, S., Ostendorf, M., Gupta, M. R., 2009. Filtering
second language learners’ fluency by means of automatic speech web text to match target genres. In: Proceedings of ICASSP, pp. 3705–
recognition technology. J. Acoust. Soc. Am. 107, 989–999. 3708.
Cucchiarini, C., Strik, H., Boves, L., 2002. Quantitative assessment of Neumeyer, L., Franco, H., Digalakis, V., Weintraub, M., 2000. Automatic
second language learners’ fluency: comparisons between read and scoring of pronunciation quality. Speech Commun., 88–93.
spontaneous speech. J. Acoust. Soc. Am. 111, 2862–2873. Ortega, L., 2003. Syntactic complexity measures and their relationship to
Eskenazi, M., Kennedy, A., Ketchum, C., Olszewski, R., Pelton, G., 2007. L2 proficiency: a research synthesis of college-level L2 writing. Appl.
The native accentTM pronunciation tutor: measuring success in the Linguist. 24, 492–518.
real world. In: Proceedings of SLaTE, pp. 124–127. Roark, B., Mitchell, M., Hosom, J.P., Hollingshead, K., Kaye, J., 2011.
Feldman, S., Marin, M., Ostendorf, M., Gupta, M.R., 2009. Part-of- Spoken language derived measures for detecting mild cognitive
speech histograms for genre classification of text. In: Proceedings of impairment. IEEE Trans. Audio Speech Lang. Process. 19, 2081–2090.
ICASSP, pp. 4781–4784. Robinson, P., 2006. Task complexity and second language narrative
Foster, P., Skehan, P., 1996. The influence of planning and task type on discourse. Lang. Learn. 45, 99–140.
second language performance. Stud. Second Lang. Acquis. 18, 299– Stolcke, A., et al., 2002. SRILM-an extensible language modeling toolkit.
324. In: INTERSPEECH.
Foster, P., Tonkyn, A., Wigglesworth, G., 2000. Measuring spoken Temple, L., 2000. Second language learner speech production. Stud.
language: a unit for all reasons. Appl. Linguist. 21, 354–375. Linguist., 288–297.
Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., Butzberger, J., Tetreault, J.R., Chodorow, M., 2008. The ups and downs of preposition
Rossier, R., Cesari, F., 2000. The SRI EduSpeak system: recognition error detection in ESL writing. in: Proceedings of COLING, pp. 865–
and pronunciation scoring for language learning. In: Proceedings of 872.
InSTIL, pp. 123–128. Witt, S., 1999. Use of the Speech Recognition in Computer-Assisted
Franco, H., Neumeyer, L., Kim, Y., Ronen, O., 1997. Automatic Language Learning. Unpublished Dissertation, Cambridge University
pronunciation scoring for language instruction. In: Proceedings of Engineering Department, Cambridge, UK.
ICASSP, pp. 1471–1474. Witt, S., Young, S., 1997. Performance measures for phone-level
Grömping, U., 2006. Relative importance for linear regression in r: the pronunciation teaching in CALL, in: Proceedings of STiLL, pp. 99–
package relaimpo. J. Stat. Softw. 17, 1–27. 102.
Halleck, G.B., 1995. Assessing oral proficiency: a comparison of holistic Wolf-Quintero, K., Inagaki, S., Kim, H.Y., 1998. Second Language
and objective measures. Mod. Lang. J. 79, 223–234. Development in Writing: Measures of Fluency, Accuracy, and
Hunt, K., 1970. Syntactic maturity inschool children and adults. Monogr. Complexity. Technical Report 17. Second Language Teaching and
Soc. Res. Child Develop. curriculum Center, The University of Hawai’i. Honolulu, HI.
Iwashita, N., 2010. Features of oral proficiency in task performance by Yoon, S.Y., Bhat, S., 2012. Assessment of esl learners’ syntactic
EFL and JFL learners. In: Selected Proceedings of the Second competence based on similarity measures. In: Proceedings of EMNLP,
Language Research Forum, pp. 32–47. pp. 600–608.
Iwashita, N., Brown, A., McNamara, T., OHagan, S., 2008. Assessed Yoon, S.Y., Higgins, D., 2011. Non-english response detection method for
levels of second language speaking proficiency: how distinct? Appl. automated proficiency scoring system. In: Proceedings of the Work-
Linguist. 29, 24–49. shop on Innovative Use of NLP for Building Educational Applica-
Klein, D., Manning, C.D., 2003. Accurate unlexicalized parsing. In: tions, pp. 161–169.
Proceedings of ACL, pp. 423–430. Zechner, K., Higgins, D., Xi, X., Williamson, D.M., 2009. Automatic
Lu, X., 2010. Automatic analysis of syntactic complexity in second scoring of non-native spontaneous speech in tests of spoken English.
language writing. Int. J. Corpus Linguist. 15, 474–496. Speech Commun. 51, 883–895.
Lu, X., 2012. L2 Syntactic Complexity Analyzer. <http://www.per- Zechner, K., Xi, X., Chen, L., 2011. Evaluating prosodic features for
sonal.psu.edu/xxl13/downloads/l2sca.html/> (retrieved 17.03.2012). automated scoring of non-native read speech. 2011 IEEE Workshop
Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to on Automatic Speech Recognition and Understanding (ASRU). IEEE,
Information Retrieval. Cambridge University Press, New York, USA. pp. 461–466.
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A., 1993. Building a large Zissman, M.A., 1996. Comparison of four approaches to automatic
annotated corpus of English: the penn treebank. Comput. Linguist. 19, language identification of telephone speech. Speech Audio Process. 4.
313–330.

Automatic Assessment of Syntactic Complexity For Spontaneous Speech Scoring

Uploaded by

Copyright:

Available Formats

Automatic Assessment of Syntactic Complexity For Spontaneous Speech Scoring

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Assessment of Syntactic Complexity For Spontaneous Speech Scoring

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Automatic assessment of syntactic complexity for spontaneous

1. Introduction emerging paradigm are both its potential to make language

Table 1 each score category (Attali and Burstein, 2006). We extend

3.6. Comparison of two models

Table 11 Constant (CON): The group of POS tags whose

You might also like