Automatic Assessment of Syntactic Complexity For Spontaneous Speech Scoring
Automatic Assessment of Syntactic Complexity For Spontaneous Speech Scoring
Automatic Assessment of Syntactic Complexity For Spontaneous Speech Scoring
com
ScienceDirect
Speech Communication 67 (2015) 42–57
www.elsevier.com/locate/specom
Received 12 November 2013; received in revised form 7 July 2014; accepted 12 September 2014
Available online 26 September 2014
Abstract
Expanding paradigms of language learning and testing prompt the need for developing objective methods of assessing language
proficiency from spontaneous speech. In this paper new measures of syntactic complexity for use in the framework of automatic scoring
systems for second language spontaneous speech, are studied. In contrast to most existing measures that estimate competence levels
indirectly based on the length of production units or frequency of specific grammatical structures, we capture the differences in the
distribution of morpho-syntactic features across learners’ proficiency levels. We build score-specific models of part of speech (POS)
tag distribution from a large corpus of spontaneous second language English utterances and use them to measure syntactic complexity.
Given a speaker’s response, we consider its similarity with a set of utterances scored for proficiency by humans. The comparison is
made by considering the distribution of POS tags in the response and a score-level. The underlying distribution of POS tags (indicative of
syntactic complexity) is represented via two models: a vector-space model and a language model.
Empirical results suggest that the proposed measures of syntactic complexity show a reasonable association with human-rated
proficiency scores compared to conventional measures of syntactic complexity. They are also significantly robust against errors resulting
from automatic speech recognition, making them more suitable for use in operational automated scoring applications. When used in
combination with other measures of oral proficiency in a state-of-the-art scoring model, the predicted scores show improved agreement
with human-assigned scores over a baseline scoring model without our proposed features.
Ó 2014 Elsevier B.V. All rights reserved.
Keywords: Language testing; Automated scoring; Speaking proficiency; Computer-aided language learning; Syntactic complexity; Objective measures
http://dx.doi.org/10.1016/j.specom.2014.09.005
0167-6393/Ó 2014 Elsevier B.V. All rights reserved.
S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57 43
Overall spoken proficiency in a target language can be study of measures that reflect language learners’ command
assessed by testing the abilities in various areas including of these influential skills has been the central theme of var-
fluency, pronunciation and intonation, grammar and ious studies in the area of second language acquisition.
vocabulary, and discourse structure. Currently, speech- In related literature, Ortega (2003) indicates that “the
enabled dialog systems allow learners to practice their range of forms that surface in language production and
speaking and listening with a virtual interlocutor (e.g., the degree of sophistication of such forms” are two impor-
SpeakESL), to receive feedback on their pronunciation tant areas in grammar usage collectively termed, “syntactic
[e.g., Carnegie Speech, or Native Accent (Eskenazi et al., complexity”. A vast majority of measures of syntactic com-
2007), EduSpeak from SRI (Franco et al., 2000)]. plexity have been used as indicators of levels of acquisition
These and other spoken response scoring systems work of syntactic competence, and in turn, are suggestive of pro-
on restricted speaking tasks such as reading a passage or ficiency levels in ESL writing (e.g. Wolf-Quintero et al.,
answering questions with a limited range of responses 1998; Ortega, 2003; Lu, 2010). These measures have been
(Bernstein et al., 2000; Balogh et al., 2007). In contrast to broadly classified into two groups (Bardovi-Harlig and
these systems that score restricted speech, scoring unstruc- Bofman, 1989). The first group is related to the acquisition
tured, unrestricted, and spontaneous responses poses a of specific grammatical expressions corresponding to vari-
much harder problem. In addition, if the systems target ous stages of language acquisition. Frequencies of negation
learners with diverse levels of second language proficiency or relative clauses – in terms of whether these expressions
and varied first language backgrounds, the difficulty occurred in the test responses without errors, fall into this
increases substantially. group (hereafter, the expression-based group). The second
The state-of-the-art system for scoring spontaneous group, not tied to particular structures, is related to length
speech in a testing scenario is SpeechRaterSM (Zechner of clauses or the relationship between clauses (hereafter,
et al., 2009). Although the current capability is sufficiently the length-based group). Representative measures in the
advanced to allow it to be used for the scoring of TOEFLÒ second group include the mean length of clause unit, the
Practice Online (TPO), a low-stakes practice test product, ratio of dependent clauses to the total number of clauses,
there is room for improving its feature set by expanding and the number of verb phrases per clause.
the coverage of important aspects of speaking proficiency In contrast with syntactic complexity, grammatical
and modifying others. For instance, aspects of grammar accuracy is the ability to generate sentences without gram-
and vocabulary sophistication are only being measured indi- matical errors. The measures in this group can be classified
rectly (more details on this later in this paper) and a more into two groups. Global accuracy measures include those
direct approach to measuring these aspects is necessary. that count all errors in sentence production and are calcu-
Taking the challenges posed in processing spontaneous lated as normalized values, e.g., the percentage of error-free
speech automatically into consideration, we propose a set clauses among all clauses (Foster and Skehan, 1996). A sec-
of measures of grammatical competence. This paper ond group of measures is more focussed on specific types of
describes the measures and their potential of being used constructions such as verb tense, third-person singular
in a state-of-the-art spontaneous scoring system. In Sec- forms, prepositions, and articles, and calculate the percent-
tion 2 the problem being studied is placed into the context age of error-free clauses with respect to these constructions
of previous work done in the related areas of written and (Robinson, 2006; Iwashita et al., 2008).
spoken language assessment. A description of the measures In the area of spoken language assessment, researchers
studied in this paper if found in Section 3. In Section 4, we have sought the application of measures of syntactic com-
delve into the details of the implementation of our pro- petence and grammatical accuracy. In particular, Halleck
posed measures. A description of the data is provided in (1995)’s study found that in the context of English as a for-
Section 5. The experimental details comprise the material eign language (EFL) assessment, holistic oral proficiency
in Section 6 and the results are presented in Section 7. In scores were highly correlated with three quantitative mea-
Section 8 we discuss the results of data analyses and high- sures (mean length of T-units,1 mean error-free T-unit
light some extensions to the study. Finally, a brief sum- length, and percentage of error-free T-units). Again, the
mary of the major findings of the paper is presented in results from a similar study that included both English
Section 9. and Japanese foreign language assessment, confirmed the
utility of these and other quantitative measures that assess
2. Motivation grammatical accuracy and syntactic complexity, in addi-
tion to vocabulary, pronunciation, and fluency (Iwashita
2.1. Assessment of syntactic competence in second language et al., 2008; Iwashita, 2010). However, the results were
learning inconclusive about the strength of the relationship between
the measures and the proficiency scores. Strong data
Numerous studies in related second language acquisi-
tion literature reveal that syntactic complexity and gram- 1
Hunt (1970) proposed the idea of a T-unit which is a main clause with
mar accuracy are regarded as some of the key skills that a subordinate clause and non-clausal units. It is different from a clause
strongly influence second language proficiency. Thus, the since it does not consider a subordinate clause as an independent unit.
44 S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57
dependencies on the participant groups were reported in resulted in substantially degraded performance (as com-
(Iwashita, 2010) and the discriminative ability of these pared to manually derived features from manually tran-
quantitative measures with respect to proficiency levels scribed responses). This degradation has been attributed
was not adequate. With proficiency levels rated on a scale, to the errors in automatic prediction of clause and sentence
the measures could only broadly discriminate students’ boundaries as well as those from the ASR. This raises nat-
proficiency levels, but failed to make fine-grained distinc- ural questions regarding the utility of those measures for
tions between adjacent levels; there were large variations use in speech scoring.
within a given level and the differences between the profi- A combination of multiple difficulties encountered in the
ciency levels were not always statistically significant. It is context of processing spoken responses may help explain
important to note that in all the ESL-related studies men- the issue. Most measures used in these studies were based
tioned above, the measures were obtained manually. on production units such as clauses and T-units. This being
Studies in automated speech scoring have focused on the case, the task of identifying them in speech that is natu-
measurements of several aspects of speech production, rally endowed with frequent occurrences of fragments and
including fluency (Cucchiarini et al., 2000, 2002), pronun- ellipses is naturally hard. Additionally, speech from lan-
ciation (Witt and Young, 1997; Witt, 1999; Franco et al., guage learners tends to include frequent grammatical errors
1997; Neumeyer et al., 2000), and intonation (Zechner which only increase the difficulty of identifying the units.
et al., 2009). However, research on the measurements Foster et al. (2000) listed multiple examples of difficult cases
related to grammar usage is considerably nascent. such as phrases without subjects or verbs and showed that
Zechner et al. (2009) include a normalized language having reliable units representing portions of speech to con-
model score of the speech recognizer as a grammatical sistently assess features such as accuracy or complexity is
measure. This measures the similarity between word dis- not easy to accomplish. A third difficulty is due to the length
tributions in the response with that in the language of spoken responses which are typically shorter than written
model,2 rather than the accuracy and diversity in gram- responses. Most measures based on sentence or sentence-
matical expressions. like units, found to be very reliable in measuring syntactic
With the recent foray into the realm of automated complexity from written responses, are rendered less reli-
assessment of spontaneous spoken language (Zechner able for use in speaking tasks that elicit only a few sen-
et al., 2009, 2011) there is a concurrent need to develop tences. Not surprisingly, Chen and Yoon (2011) observed
quantitative measures of syntactic complexity and gram- a marked decrease in the correlation between syntactic mea-
matical accuracy for use in such an environment. Two sures and proficiency as response length decreased.
important factors govern the use of any measure in this One is faced with even more obstacles while using syn-
realm. The first has to do with the discriminative ability tactic complexity and grammatical accuracy measures in
of the measure with respect to the target. This aspect is automated speech scoring. Spontaneous speech contains
related to the utility of the measure. The second has to disfluencies such as repairs and repetitions, and these disfl-
do with the way in which the measures are obtained from uencies need to be processed appropriately before calculat-
the data. By the very nature of automated assessment, it ing the quantitative measures. For instance, in calculating
is expected that these measures be obtained automatically. grammatical accuracy measures such as the mean length
Studies have only recently begun to actively investigate of error-free T-units, only the corrected parts found in
the usefulness of syntactic measures in the realm of auto- the repairs should be considered and the repetitive parts
mated scoring of spontaneous speech (Bernstein et al., should be excluded. This requires a high performing auto-
2010; Chen and Yoon, 2011; Chen and Zechner, 2011). mated disfluency detector. However, such a tool shows
These studies have used measures such as the average suboptimal performance even with native speech. In addi-
length of the clauses or sentences (Chen and Yoon, 2011) tion, speech recognition errors only worsen the situation.
previously studied in the context of writing assessment. Chen and Zechner (2011) showed that a moderate correla-
In addition to these length-based measures, Chen and tion between the score for syntactic complexity and speech
Zechner (2011) used parse-tree based features such as the proficiency (correlation coefficient = 0.49) was drastically
mean depth of parsing tree levels in addition to length- reduced when used with automated speech recognition
based measures. Using measurements from manual anno- (ASR) outputs. This reduction in correlation was found
tations as well as hypotheses generated by the speech recog- to be due to speech recognition errors. Due to these prob-
nition engine, these measures were used to predict the oral lems, the existing syntactic complexity and grammatical
proficiency scores of the learner given his/her response. accuracy measures do not seem reliable enough for being
These studies found that including length-based and used in automated speech proficiency scoring.
parse-tree based features in an automated scoring model In this study, we address the need for measures of gram-
matical ability in the sense defined by Ortega (2003) to
2 encompass range and sophistication of surface forms in
The language model was trained on non-native students’ speech and
broadcast news. Consequently, the grammar errors occurring in non-
speech. The requirements are: (a) the measures should cor-
native students’ speech may result in a good language model score, while respond to differences in proficiency levels in non-native
the grammatically correct but rare expressions may result in a bad score. spontaneous speech and (b) the measures should be reliable
S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57 45
for use in the context of automated speech scoring. In In (Lu, 2010), these measures of syntactic complexity
(Yoon and Bhat, 2012), inspired from vector-space model- were chosen based on the literature pertaining to second
ing in the area of information retrieval, a new measure of language development studies. They represent a fairly com-
grammatical competence that is based on comparing the plete picture of the repertoire of measures that second lan-
similarity of a given response to a body of learner guage development researchers draw from. They can be
responses, was studied. This measure was found to be rel- categorized into five types and are listed in Table 1.3
atively robust against speech recognition errors and was With the goal of having measures that normalize the
reliable for use with short responses, making it usable for effect of the varying length of utterances, we choose only
automated scoring of spontaneous speech in typical lan- the ratio-based measures (types II, III, IV and IV) and
guage assessment scenarios. In contrast to recent studies exclude measures of the first type.
focusing on length-based and grammatical structure-based
features (as outlined above), the focus was on capturing the 3.2. Proposed measures
differences in the distribution of grammatical expressions
across proficiency levels. Our proposed syntactic complexity measures have the
In this paper, we extend the study in (Yoon and Bhat, following two characteristics. First, in contrast to most
2012) and propose a new measure that is also similarity- methods that consider scores of syntactic complexity as a
based and derived from a language modeling approach to combination of measures of the length of sentence-based
natural language processing. Comparing the two similar- units or frequency of specific grammatical structures, we
ity-based measures side-by-side, we first study their degree directly measure students’ sophistication and range in
of association with proficiency scores as an indication of grammar usage based on the distribution of syntactic
their utility as measures of syntactic complexity. We then constructions.
study the performance of an automatic scoring model that Second, instead of rating a learner’s response using a
uses these features alongside features representing other scale based on native speech production, our experiments
aspects of language ability. Subsequently, we compared compare it with a similar body of learners’ responses. Con-
the performance of such a scoring model with that using sidering the variety of grammatical structures that native
conventional measures of syntactic complexity that have speakers produce, a comparison of learners’ constructions
been found to be useful in the context of automated scoring to that of native speakers implies searching in a large space
of written language assessment. of possible constructions. We instead collected a large
amount of learners’ spoken responses and classify them
3. Measures of syntactic complexity into four groups according to their proficiency level (as
assigned by professional raters). We then examined how
With the eventual goal of automatically scoring overall distinct the proficiency classes were based on the distribu-
spoken proficiency levels covering various aspects of speak- tion of part-of-speech (POS) tags. We hypothesized that
ing proficiency including grammatical accuracy, syntactic the level of acquired grammatical proficiency is signaled
complexity, vocabulary diversity, and discourse structure, by the distribution of the POS tags. Hence, given a stu-
our immediate goal is to measure the grammatical compe- dent’s response, we obtained its similarity with score-spe-
tence of a given second language utterance. As will be cific models that capture the distribution of POS tags in
described, conventional (or typical) measures of syntactic each score level. The intuition here is that the similarity
competence use sentence or clause length-based measures. of POS distribution of a response to that of the representa-
In contrast to this, our approach will be to classify the syn- tive response of a given score class is governed by the pro-
tactic complexity of an utterance as being similar to a par- ficiency level of the response.
ticular category of learner responses (proficiency class) by Working with POS tags as features has the advantage of
constructing representative models that are score-class a much lower dimensionality than would be needed when
specific. using lexical information. Moreover, rather than focusing
on specific grammatical constructions, we utilize the more
3.1. Measures from written language assessment fine-grained information at the level of sequences of POS
tags. This in turn provides us a convenient way to counter
We use a set of fourteen measures of syntactic complex- the effect of differences in topics in speaker responses, as
ity studied in (Lu, 2010), found to be highly predictive of would be the case in several scenarios of practical
syntactic complexity indices for grading second language importance.
learner essays. Although the study points out that the
results cannot be readily extended to productions with a
large portion of grammatically incomplete sentences (such 3
The study further defines the various production units and syntactic
as those produced by beginner-level learners), we chose
structures used in the measures and since we feel that the definitions and
these measures in order to get a baseline since no other the associated details are beyond the scope of the current study, we omit
measures of syntactic complexity for spoken utterances them from our discussion and direct the interested reader to the paper for
are as yet available in related studies. more details.
46 S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57
cosi : cosine similarity value of the test response with the described in Section 4.1. All transcriptions were tagged
representative vector of score level i ¼ 1; . . . ; 4. using the POS tagger described in Section 4.2 and POS
cosmax: the score level with the highest similarity score tag sequences were extracted. This was followed by the
given the response. model generation phase where the POS-VSM models and
lmi : logprob (likelihood) of the LM of score level the POS-LM models were obtained from the training data
i ¼ 1; . . . ; 4. with different sets of tags. Finally, the features were gener-
lmmax: the score level of the LM with the maximum log- ated for the responses in the test dataset.
prob given the response.
4.1. Automatic speech recognizer
We assume that the score-level having the highest cosine
similarity (or log-likelihood) value for a test response is An HMM recognizer was trained using approximately
most similar to the underlying syntactic complexity of the 733 h of non-native speech collected from 7872 speakers.
test response. In contrast to cosi , and lmi which have con- A gender independent triphone acoustic model and combi-
tinuous values, the two max features have an advantage in nation of bigram, trigram, and four-gram language models
that they can be directly interpreted as the proficiency level were used. A word error rate (WER) of 27% on the held-
of the given response, based on its syntactic complexity. out test dataset was observed.
sets of POS tags: the original POS set without the com- listening and/or reading skills), they were not differentiated
pound unit (Base), the original set with 51 compound units in this study, owing to the fact that both items elicited
(Base + mi50), and the original set with 100 compound unconstrained speech and did not have specific target
units (Base + mi100). The number of tags was 42 for Base grammar expressions.
set, 93 for Base + mi50 set, and 150 for Base + mi100 set. Approximately 50,000 responses were collected and split
into two datasets: the ASR set and the scoring model train-
4.4. Building VSMs ing/test (SM) set. The ASR set comprised of about 733 h of
speech and was used for ASR training and POS similarity
Unigram, bigram, and trigram tags were used in VSM model training. The SM set comprised of 44 h of speech
building. For each n-gram, three sets of VSMs (using the and was used for feature evaluation and automated scoring
three tag sets presented in Section 4.3 as terms), yielding model evaluation. There was no overlap in items between
four score-specific vectors (one per score-class), were gener- the ASR set and SM set. Each response was rated for pro-
ated, resulting in a total of nine VSMs. For each VSM we ficiency by trained human scorers using a 4-point scoring
have cosi ; i ¼ 1; . . . ; 4 and cosmax as features, with cosi tak- scale, where 1 indicated low speaking proficiency and 4
ing values in [0, 1] and cosmax taking values in {1, 2, 3, 4}. indicated high speaking proficiency. The speaker, item
The results were based on the each n-gram model sepa- information, and distribution of proficiency scores are pre-
rately and we did not combine any models. sented in Table 2.
As seen in Table 2, there is a strong bias towards the
4.5. Training of POS n-gram LM middle scores (score 2 and 3) with approximately 83–84%
of the responses belonging to these two score levels.
Unigram, bigram, trigram, 4-gram and 5-gram LMs Although the skewed distribution of students at the low
were used in LM building. For each n-gram, three sets of and high score levels increased the difficulty of feature
LMs (using the three tag sets) were trained for each score development and model training, we used the data without
level using score-specific training data. We used the modifying the distribution since this constitutes a typical
SRILM toolkit (Stolcke, 2002) for training the models with distribution of proficiency scores in a large-scale language
Witten–Bell smoothing.5 For each n-gram and tag set com- assessment scenario. The mean scores in the two datasets
bination, we obtained lmi ; i ¼ 1; . . . ; 4 and lmmax as fea- were 2.67 and 2.63 respectively, and the overall score distri-
tures. While lmi takes values in ð inf; 0; lmmax takes bution of the two datasets was comparable.
values in the set {1, 2, 3, 4}. As with the VSMs, the results 773 Responses in ASR set and 4 responses in SM set were
were based on the each n-gram model separately and we not scoreable due to sub-optimal response characteristics
did not combine any models. such as technical difficulties (e.g, equipment or transmission
errors, loud background noise) and lack of student’s
5. Data response. These problematic responses were removed from
the data as a result of which the full set of 6 responses was
The data used in this study, was a proprietary collection not available for some speakers. The average number of
of responses from the Test of English as a Foreign Lan- responses per speaker for both data sets was more than
guageÒ internet-based test (TOEFLÒ iBT), an interna- 5.9 suggesting that these failures were relatively minor.
tional English language assessment required for studying In order to evaluate the reliability of the human ratings,
in an English medium university environment. The assess- approximately 5% of the ASR set (a total of 2388 responses)
ment consisted of six items per speaker, where they were was scored by two raters. Both the Pearson correlation coef-
prompted to provide responses lasting between 45 and ficient r and weighted-kappa k were 0.62 indicating the level
60 s per item, resulting in approximately 5.5 min of speech of subjectivity in the task of proficiency scoring.
per speaker. Among the six items, two items were “inde- Table 3 presents the descriptive analysis of the length of
pendent items” that asked examinees to provide informa- production units for each score point. For the analysis, we
tion or opinions on familiar topics based on their used a total of 200 responses from TOEFL Practice Online
personal experience or background knowledge. The four (TPO), an online practice test which allows students to gain
remaining items were “integrated items” that required familiarity with the format of our main data (TOEFL).
reading or listening in addition to speaking skill. Test tak- Thus the responses are similar to those in our datasets.
ers read and/or listen to some stimulus materials and then The manual transcriptions were annotated for locations
respond to a question based on them. All items extracted of clause boundaries (hereafter, CBs) and disfluencies such
spontaneous, unconstrained natural speech. While there as repetition, repair, and sentence fragments by two anno-
were important differences between these task types from tators with a linguistics background.6 Disfluencies fre-
an assessment point of view (i.e. whether the item required quently occur in non-native speakers’ spontaneous speech
5
This smoothing was chosen to accommodate for the non-zero
6
frequency of POS tags which prevents the application of Good-Turing Approximately 15% of data was double annotated; j was 0.95 for CBs
smoothing. and 0.75 for disfluencies.
S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57 49
Table 2
Data size and score distribution.
Data set No. of responses No. of speakers No. of items Score Score distribution
Mean SD 1 2 3 4
ASR 47,227 7872 24 2.67 0.73 1953 16,834 23,106 5,334
4% 36% 49% 11%
SM 2,876 240 12 2.63 0.75 141 1132 1263 340
5% 39% 44% 12%
Table 3
Means and standard deviations of the number of production units for each score point.
Level N Words per response CB per response T-units per response
M SD M SD M SD
1.00 50.00 41.78 27.56 5.38 3.58 3.20 2.01
2.00 50.00 71.38 31.48 7.38 3.31 4.86 2.18
3.00 50.00 85.28 32.00 9.22 3.84 5.50 2.43
4.00 50.00 94.22 38.98 9.48 3.93 5.94 2.97
and they should not be considered to be parts of a sentence. ered, with manual transcriptions available for both
This is because their inclusion will result in inflated lengths training and evaluation.
of production units. Therefore, disfluencies were removed, Set 2: (Manual, auto) – in this set-up we observe the
based on manual annotations. Secondly, the transcriptions effects of ASR output on the performance of a
were segmented into clauses, T-units, and sentences based model that is trained with manually transcribed data
on the human annotations. Finally, the means and stan- and tested on the ASR output.
dard deviations of each production unit were calculated Set 3: (Auto, auto) – this set-up provides the level of
for each score level. performance in a true operational scenario where it
is difficult to obtain manually transcribed responses.
6. Experiments
As such, we need a training set to obtain the underlying 6.2. Association between proposed measures, automatic
score-specific models which we then use to generate the fea- scores and human ratings
ture values using responses from the evaluation set.
Towards this end, we used the data in two modes. In the We determine the utility of the features by computing
manual mode, we used the manual transcriptions (hence- each feature’s correlation with human-assigned proficiency
forth termed manual) and in the automated mode (hence- scores as has been done in prior related studies. These fea-
forth termed auto), we use the output of the speech tures are then used in a multiple regression scoring model
recognizer. This set-up allowed for an understanding of (as studied in Zechner et al. (2009)) and the resulting auto-
the impact of ASR on the feature generation process as matic scores are compared with the human-rated scores. As
well on the feature performance. Finally, we use the fea- a baseline, we use the model studied in Zechner et al.
tures obtained in the automated mode in a multiple regres- (2009), which is then augmented with the features we pro-
sion scoring model and compare its performance with that pose in this study as well as features using prototypical
using conventional features of syntactic complexity (also measures of syntactic complexity. We then compare the
obtained in the automated mode). resulting scoring models based on the correlations of the
predicted values with that of the human scores.
6.1. Dataset combination
6.3. Factors for practical consideration
The experiments were conducted with the following
combinations of datasets. We vary the nature of the tran- A primary requirement of automated scoring systems
scriptions available (auto or manual) in the training and that make measurements on automatically recognized spo-
test sets, yielding the following (train, test) combinations ken responses is that the measures be immune to the ASR
for experimentation. Such a split in the dataset, permits errors inherent in the process. The utility of a set of auto-
us to make the following observations on the performance matically derived measures is better analyzed in the context
of the proposed features in various operating scenarios. of a parallel analysis that considers measurements obtained
using transcriptions obtained manually and automatically.
Set 1: (Manual, Manual) – in this mode, we have the The results will aid the assessment of the overall utility
best case performance of the model being consid- of the measures with respect to a change in the mode of
50 S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57
measurement. Our experiments and analyses will reflect in score-feature correlation when used with ASR
this factor of practical significance. output (Set 2 and Set 3) was not statistically
Another factor of importance from a practical stand- significant.
point is that of generalizability. In this respect, we are inter- Overall, cos4 not only outperformed cosmax, but
ested in seeing the extent to which the size of the training also was more robust against ASR errors.
data affects model performance. Along the lines of general- Bigram-based features outperformed both unigram-
izability, we also explore the extent to which our method is based and trigram-based features.
applicable to scoring items unseen in the training set. The inclusion of compound tags (refer Section 4.3)
did not result in an increased correlation for cos4 .
6.4. Comparison with prototypical measures from written However, it increased the correlation (statistically
language assessment significant at level 0.01) in the case of cosmax, when
obtained from unigrams.
We compare the utility of the proposed measures with
that of the most representative syntactic complexity mea- The unigrams had good coverage but limited power in
sures used in (Lu, 2010) and described in Section 3.1. As distinguishing between different score levels. Their ability
rationalized in Section 2.1, there is reason to believe that to distinguish between levels was augmented with the inclu-
the sentence-length based measures are inherently unsuit- sion of co-occurring POS tags. On the other hand, trigrams
able for use with automatic spoken language assessment. had the opposite characteristics – they captured more
In order to test this, the performance of our features with structure, but did not have good coverage because of data
the features of the syntactic complexity proposed in (Lu, sparseness. Bigrams seemed to strike a balance in both cov-
2010) were compared. Towards this, the clause boundaries erage and complexity (from among the three LMs consid-
of the ASR hypotheses were automatically detected using ered here) and may thus have resulted in the best
the automated clause boundary detection method used in performance with both the features in both manual and
(Chen and Zechner, 2011).7 The utterances were then ASR modes.
parsed using the Stanford Parser (Klein and Manning, It is worthwhile to emphasize that the performance of
2003), and a total of 14 features including both length- ASR-based features was comparable to that of transcrip-
related features and parse-tree based features were gener- tion-based features. The best performing feature among
ated using (Lu, 2012). Finally, the Pearson correlation ASR-based features was using the bigram and base set,
coefficients between these features and human proficiency with correlations nearly the same as the best performing
scores were calculated. feature among the transcription-based features. Seeing
how close the correlations were in the case the manual tran-
7. Results scription-based and ASR-hypothesis based features, we
conclude that the proposed measure is robust to ASR
7.1. POS features errors.
Table 5 shows correlations between the LM-similarity
We will first review the results of VSM-based features, features and expert-rated proficiency scores for experi-
followed by LM-based features. ments with n-grams n ¼ 1; . . . ; 3. Since no improved corre-
The VSMs and the features were obtained using both lation was observed by increasing n-gram size beyond
modes (manual and automated). Table 4 shows feature- trigrams, 4-gram and 5-gram results are excluded from this
score correlations for the features cosmax and cos4 (features table. Additionally, features lm1 ; lm2 and lm3 were
cos1 ; cos2 and cos3 were excluded since they showed lower excluded since they showed lower correlation with human
correlation with human scores and were highly correlated scores and were in turn highly correlated with lm4 . From
with cos4 ). From the table, we make the following the table, we make the following observations.
observations:
The feature most correlated with human scores was
The most correlated feature with human scores of lmmax using the base set with bigrams. It achieved a
proficiency was cos4 that used the base set with correlation of 0.38 when both model and evaluation
bigrams. It achieved the best correlation of 0.43 were based on manual transcriptions (Set 1). There
when manual transcriptions were used in model was a substantial drop in correlation when used with
building as well as in evaluation (Set 1). The drop the ASR output; the correlation for Set 3 (both
ASRs) was 0.33 (a 0.05 drop in absolute
correlation).
7
As expected, features based on set 1 of the dataset
The automated clause boundary detection method in this study was a
combinations (manually transcribed training and
Maximum Entropy Model based on word bigrams, POS tag bigrams, and
pause features. The method achieved an F-score of 0.60 for the non-native
evaluation data) outperformed those obtained using
speakers’ ASR hypotheses. A detailed description of the method is Set 2 and Set 3. However, when used with ASR out-
presented in (Chen and Yoon, 2011). put, Set 3 was better than Set 2. In Set 2, there was a
S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57 51
Table 4
Pearson correlation coefficients between VSM-based features and expert proficiency scores. Here we consider the features cos4 and cosmax in the VSM
model. All correlations are significant at the 0.01 level.
Feature Set Unigram Bigram Trigram
Base Base + mi50 Base + mi100 Base Base + mi50 Base + mi100 Base Base + mi50 Base + mi100
cosmax (M, M) 0.08 0.18 0.18 0.34 0.33 0.34 0.34 0.33 0.34
(M, A) 0.12 0.17 0.17 0.26 0.24 0.26 0.25 0.26 0.26
(A, A) 0.13 0.18 0.19 0.30 0.30 0.30 0.31 0.28 0.27
cos4 (M, M) 0.30 0.30 0.33 0.43 0.36 0.37 0.40 0.32 0.30
(M, A) 0.25 0.27 0.30 0.42 0.35 0.35 0.37 0.31 0.28
(A, A) 0.30 0.27 0.30 0.41 0.32 0.32 0.34 0.28 0.26
The values are in bold to indicate that they correspond to the best performance.
Table 5
Pearson correlation coefficients between LM-based features and expert proficiency scores. All correlations are significant at the 0.01 level.
Feature Set Unigram Bigram Trigram
Base Base + mi50 Base + mi100 Base Base + mi50 Base + mi100 Base Base + mi50 Base + mi100
lmmax (M, M) 0.31 0.32 0.32 0.38 0.34 0.34 0.36 0.26 0.25
(M, A) 0.26 0.28 0.28 0.29 0.28 0.29 0.29 0.24 0.26
(A, A) 0.30 0.31 0.31 0.33 0.32 0.3 0.29 0.29 0.29
lm4 (M, M) 0.06 0.05 0.07 0.19 0.17 0.18 0.23 0.19 0.2
(M, A) 0.05 0.03 0.02 0.15 0.15 0.14 0.18 0.19 0.17
(A, A) 0.06 0.04 0.05 0.16 0.12 0.14 0.18 0.14 0.15
discrepancy in the train/evaluation data condition; As seen in the previous section, cos4 emerged as the fea-
the train data was based on the manual transcription ture of choice when using the VSM, and lmmax the feature
while the evaluation data was based on the ASR. of choice when using the POS-LM. The two features seem
This discrepancy would have resulted in the addi- to represent different aspects of grammatical competence.
tional performance drop. What will follow next, will be a comparison of the two
Feature lmmax outperformed lm4 overall. features.
Bigram-based features using the base tag set showed
better correlation than the other n-gram features. In the case of cos4 , the feature, in a way, captures
However, for lm4 the 4-gram showed the best correla- how far the given POS tag distribution is with
tion (though not too different from trigram based respect to the representative vector of score-level 4
lm4 ), with different tag sets and dataset combinations. in a high-dimensional space. While this is useful,
The inclusion of compound tags did not result in an the feature value only gives the similarity with the
increased correlation for lm4 . score-level. Unlike the feature cos4 ; lmmax gives a
score-level value to a test-response, which may be
The LM-based features behaved differently from VSM- construed as a score of grammatical competence.
based features. First, the features were more susceptible to Although somewhat unrelated to the correlation
ASR errors, and there was a substantial performance drop with human-assigned proficiency scores, the relative
in the best performing features when used with the ASR ease of interpretation seems to be an advantage of
outputs. Furthermore, from Tables 4 and 5 we observe that the POS-LM model. However, taking into account
the feature cos4 was the better performing feature in VSM the correlation with human judgment, the VSM
method and the feature max was the better performing in model seems to be a better option.
LM method.
We list the correlation coefficients between the conven-
7.2. Comparison of features tional measures of syntactic complexity and the scores of
proficiency in Table 6. From Table 6, we note that the best
Table 6 shows the performance of the prototypical mea- performing feature is “DCC” (mean number of dependent
sures of syntactic competence from written language clauses per clause) and the correlation r was 0.14. In addi-
assessment. tion, “DCT” had a correlation that was also a statistically
We observe that these measures have lower correlations significant, but the correlation r was even weaker than
with human-scores compared to our proposed features. We “DCC”. Our best performing feature (bigram-based cos4 )
then include the features in a scoring model for subsequent widely outperformed the best of Lu (2010)’s features (with
performance comparison. correlations approximately 0.3 apart).
52 S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57
Table 6
Pearson correlation coefficients between the prototypical measures of syntactic complexity and expert proficiency scores for the ASR output. Only
correlations marked (*) were significant at a ¼ 0:01 level.
C/S C/T CT/T DC/C DC/T CP/C CP/T T/S CN/C CN/T VPT
0.014 0.006 0.017 0.138* 0.060* 0.015 0.026 0.027 0.028 0.031 0.008
For the purpose of comparing the features and the mod- performing VSM-based scenario. VSM-based models are
els side-by-side, we chose the best features (with the highest built using variable-sized training data of manual tran-
correlations with human judgments) from the models stud- scriptions (sampled to preserve the underlying score distri-
ied here (prototypical features from writing assessment, bution) and the Pearson’s correlation coefficients of the
VSM-based and POS-LM based models). For the sake of feature cos4 (derived from ASR-based transcriptions of
completion, we included both cases (the models trained the responses in the SM data) with human scores of profi-
and evaluated with manual transcriptions as well as models ciency for every sample are noted. From the results tabu-
trained and evaluated with automatic transcriptions). We lated in Table 8 we notice that even with 500 responses
observed that each of these methods had an associated best sampled from the training data, we are able to achieve
performing feature – the feature cos4 was the better per- the feature-score correlation of 0.41, which approaches
forming feature in the VSM and the feature max was the that shown by using a model that was trained on 47 K
better performing in POS-LM. The comparison of features responses.
is summarized in Table 7 below. In order to further highlight the generalization perfor-
A plausible explanation for the poor performance of mance of proposed models, it may be worth noting that
Lu (2010)’s features is that the features were generated since the test data had no item-overlap with the training
using a multi-stage automated process, and the errors in set, the feature generalizes well to unseen items.
each stage propagated to the next, resulting in a low fea-
ture performance. For instance, the errors in the auto- 7.4. Automatic scoring model comparison
mated clause boundary detection may result in a serious
drop in the performances. With the spoken responses The effect of the inclusion of the proposed features is
being particularly short (a typical response in the dataset best understood by studying them in an automatic scoring
had 10 clauses on average), even one error in clause model. We consider a multiple linear regression scoring
boundary detection can seriously affect the reliability of model that approximates the human scores by a linear
the features. combination of the proposed features that represent vari-
ous constructs of oral language proficiency. We first build
7.3. Effect of the training data size on model performance and compare stand-alone multiple regression models of
syntactic complexity. We then compare scoring models that
For the purpose of VSM-based and LM-based POS- include measurements of syntactic complexity in addition
model constructions, we used approximately 47 K to the other constructs of proficiency currently being mea-
responses, which may be unavailable in many practical sce- sured in the state-of-the-art scoring model (SpeechRater).
narios. To understand the effect of training data size on the For the purpose of this comparison, we perform 5-fold
resulting POS-model, we consider varying the available cross validation on the SM data described in Section 5
data size for POS-model building in the case of the better (N = 2876) using a training set (80%) and a held-out test
set (20%) to parametrize the models and to assess the mod-
els’ performance respectively in every run. The results are
Table 7 averaged over the 5 runs.
Comparison of feature-score correlations.
Study Feature Condition Correlation 7.4.1. Scoring model for syntactic complexity
(Lu, 2010) DCC Trans 0.14 Although the human assigned scores of proficiency are
ASR 0.14 holistic scores of overall proficiency, in the absence of
Current study cos4 Trans 0.43 stand-alone scores of syntactic complexity, we assume that
ASR 0.41 the scores of proficiency are well correlated with scores of
Current study lmmax Trans 0.38
ASR 0.33
syntactic complexity. Under this assumption, we first con-
struct automatic scoring models for syntactic complexity
Table 8
Comparison of cos4 -score correlations, r, by varying training data size (number of responses) for VSM-based POS-model training.
Data size (in 1000 s) 0.1 0.2 0.5 1.0 2.4 4.7 9.4 24 47
r 0.35 0.38 0.41 0.42 0.41 0.42 0.41 0.42 0.42
S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57 53
using the three sets of features studied here – the prototyp- in words (wdpchk) and global normalized language model
ical measures from written language assessment, the VSM- score (lmscore) (Zechner et al., 2009) by themselves (Base)
based features and the LM-based features. The grammar as also alongside the features considered in this study. Such
scoring model using only the prototypical features an arrangement permits us to not only compare the relative
described in Section 3.1 will serve as a baseline (GramLu). gains in scoring model performance compared to the base
We then compare scoring models using VSM- and LM- model but also to assess the relative importance of the fea-
based features (GramVSM and GramLM respectively). tures considered in the model.
We avoid collinearity in the scoring models by excluding Here, in addition to a base model with the features
correlated features ðr P 0:8Þ. The resulting GramLu model found in SpeechRater, we study augmented scoring mod-
includes the features CN/C, CN/T, CP/C, C/T, T/S, CT/T, els using the three sets of features studied here – the pro-
GramVSM includes cosmax and cos4 while GramLM totypical measures from written language assessment, the
includes lmmax and lm4 . The agreement between predicted VSM-based features and the LM-based features. The aug-
scores and human scores in terms of Pearson’s correlation mented models will be denoted by SynLu, SynVSM, and
coefficient (unrounded and rounded) as well as weighted SynLM respectively. In addition, we will consider a com-
kappa are tabulated in the Table 9. bination model SynAll, that will include all features. The
From the Table 9, we observe that the correlation is correlations (unrounded and rounded) as well as the
highest for a scoring model using VSM-based features weighted kappa for the models studied here are found in
and that the correlation is considerably higher (0.36) than Table 10.
that of the baseline model using conventional features for Comparing the agreement results in Table 10 we notice
written language assessment. The correlations of the scor- that the model SynVSM shows the highest (0.607) correla-
ing model using LM-based features is also reasonably tion of predicted scores with human scores among Base,
better than the baseline (0.24 higher). The agreement SynLu, SynLM and SynVSM. Moreover, testing for the
results of GramLu here is in line with the results in significance of the difference between two dependent corre-
(Chen and Zechner, 2011), where it was observed that lations using Steiger’s Z-test, we notice that the difference
measures of syntactic complexity were highly sensitive in correlations (Base–SynVSM, SynLu–SynVSM and Syn-
to errors in ASR and are hence not reliable for use in LM–SynVSM) are statistically significant at level = 0.01.
speech scoring. Combining all features in the SynAll model, although the
correlation appears to be marginally higher than that with
SynVSM the difference is not statistically significant. Thus,
7.4.2. Scoring model for oral proficiency
inclusion of measures of syntactic complexity shows
We next consider an evaluation of the syntactic com-
improved correlation between machine and human scores
plexity features studied here in an augmented scoring
compared to the state-of-the-art model (here, Base).
model that includes features from the SpeechRater that
The agreement between human and machine scores for
represent constructs of fluency, pronunciation, vocabulary
the scoring model SynVSM (Pearson correlation coefficient
and grammar. Towards this, we again consider a multiple
0.607) may not seem impressive by itself, but taken in the
regression model that includes the features global normal-
context of the subjectivity of the task of assigning a profi-
ized HMM acoustic model score (amscore), speaking rate
ciency score (as mentioned in Section 5), the correlation
(wpsec), types per second (tpsecutt), average chunk length
coefficient is seen to approach near-human agreement of
Table 9 0.62.
Comparison of scoring model performances only using features for Our next focus is the relative importance of the features
grammar competence. All values are significant at level 0.01 except the one
in an overall scoring model such as SynVSM that includes
marked with (*).
all available features for measuring the various constructs
Evaluation method GramLu GramLM GramVSM
of oral proficiency. In Table 11 we list the relative impor-
Weighted j 0.05 0.25 0.33 tance of the predictors, as the R2 contribution averaged
Correlation (unrounded) 0.06 0.30 0.42
over orderings among predictors (Grömping, 2006). We
Correlation (rounded) 0.04* 0.23 0.34
notice that the feature cos4 is the third most important pre-
The values are in bold to indicate that they correspond to the best
dictor of oral proficiency next only to tpsecutt and wdpchk
performance.
which are existing features in the SpeechRater module.
Table 10
Comparison of scoring model performances using features of syntactic complexity studied in this paper along with those available in SpeechRater. Here,
Base is the scoring model using just the features in SpeechRater. All correlations are significant at level 0.01.
Evaluation method Base SynLu SynLM SynVSM SynAll
Weighted j 0.540 0.540 0.560 0.560 0.570
Correlation (unrounded) 0.587 0.589 0.595 0.607 0.610
Correlation (rounded) 0.500 0.503 0.509 0.526 0.522
54 S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57
in the examples, ‘the project may can change’ and ‘the others compared ASR-based ones with transcription-based ones.
must can not be good’, they are related to grammatical Fig. 1 shows the results. As a comparison, Fig. 2 provides
errors. the mean values of cos4 (cosine similarity value with score
Finally, the comparative group includes ‘RBR_JJR’. level 4) for each score level.
The decrease of the tag ‘RBR_JJR’ is related to the correct In lmmax, the mean values of ASR-based features were
acquisition of the comparative form. ‘RBR’ is for compar- lower than those of manual transcription-based features
ative adverbs and ‘JJR’ is for comparative adjectives, and a except for the score-class 1. In particular, the drop in fea-
combination of these two tags is strongly related to double- ture values for lmmax increased with increasing score levels
marked errors such as ‘more easier’. In the course of and this resulted in reducing the distinction between score
acquiring the comparative form, learners tend to use the levels. However, no such behavior was found for cos4 . In
double-marked form. The compound tags correctly capture fact, the mean values of ASR-based features were slightly
this erroneous stage. higher than those of transcription-based features for all
The ‘DEC’ group also includes three Wh-related tags – score levels, and the distinction between score levels
those in (WDT_NN,WDT_VBP, WRB), but the propor- remained steady.
tion is much smaller than that found in the ‘INC’ group. We attribute the drop in performance in the ASR-based
The analysis shows that the combination of original and features to the following sources of errors:
compound POS tags correctly captures systematic changes
in the grammatical expressions according to changes in 1. The speech recognizer may mis-recognize grammat-
proficiency levels. ical errors in the response, replacing them with the
more frequent, correct expressions.
8.2. POS LM-based similarity assessment 2. The speech recognizer may mis-recognize sophisti-
cated expressions in a response replacing them with
In contrast to VSM-based features, ASR errors caused more frequent, simpler expressions.
significant performance drops in the LM-based features. 3. The recognizer errors may generate grammatical
In this section, we will analyze the impact of ASR errors errors not existing in the actual responses.
on LM-based features in detail with reference to the models
based on Set 1 and Set 2 (the manual transcription based
model). Since the feature lmmax had a better performance
than lm4 , we conducted further analysis on lmmax. In addi-
tion, there was no significant difference between the
manual-transcription-based model and the ASR-hypothe-
sis-based model (Set 1 and Set 3). Hence, we limit our
discussion to those features obtained using the manual
transcription-based model.
The relationship between predicted scores and human
scores in the case of LM-based similarity features was fur-
ther analyzed using a contingency table for the feature
lmmax with Set 1 and Set 2 (recall that in both these cases,
the training data consists of manually transcribed utter-
ances and it is only the mode of evaluation data that
Fig. 1. Comparison of lmmax between manual-based features and ASR-
changes). Focusing on the best performing model (the based features at each score level.
bigram LM), the agreement between the human score
and the feature was promising; for the manual transcrip-
tion, the exact agreement between human scores and
lmmax was 36% and the combination of the exact and adja-
cent agreement was 86%. Upon examination of the contin-
gency table for the manual transcriptions, we notice that
every predicted score level is dominated by responses from
the score levels 2 and 3.
However, the feature lmmax was strongly influenced by
recognition errors. This is seen in the significant drop in the
correlation from 0.39 with set 1 to 0.29 with set 2. For ASR
hypotheses the exact agreement decreased from 36% to
32%, and the adjacent agreement from 50% to 40%,
respectively.
For further analysis, we calculated the mean values of Fig. 2. Comparison of cos4 between manual-based features and ASR-
lmmax (most similar score class) for each score level and based features at each score level.
56 S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57
Each of the items above, as well as any combination model, for measuring grammatical competence, with the
thereof, results in a loss of characteristics of a response requirement that they be amenable for use with automat-
thereby reducing the discriminating power of the features. ically recognized spontaneous speech. The underlying
While 1 results in inflated predicted levels, 2 and 3 result assumption in the choice of features is that the level of
in underestimated proficiency levels. Current results acquired grammatical proficiency is signaled by the distri-
(Fig. 1) supports Hypothesis 2 and 3 since the mean values bution of the POS tags. We observed that each of these
of ASR-based lmmax were generally lower than those of models has an associated best performing feature; the fea-
manual transcription-based ones. ture cos4 in the VSM and the feature lmmax in POS-LM.
In order to identify those bigrams that were heavily The features were found to generalize well with respect to
influenced by ASR errors, bigram frequencies for manual differences in item-type in responses. When used alongside
transcriptions and ASR hypotheses were obtained, and dif- existing features in the state-of-the-art scoring model for
ferences were examined. Since the impact of the errors was oral proficiency assessment in spontaneous speech, the
particularly strong on responses with a score of 4, we VSM-based features were seen to be important predictors
focused on responses with that score. We identified the of oral proficiency. Additionally, their inclusion in the
top 10 bigrams for which the ASR-based frequencies were scoring model showed improved agreement between
substantially higher than their manual transcription-based machine and human scores compared to the state-of-
frequencies. Among the 10 bigrams, tag repetitions (‘DT– the-art model.
DT’,‘IN–IN’,‘VBZ–VBZ’) and some determiner-related We also observed that a set of prototypical features
bigrams (‘DT–IN’,‘DT–PRP’,‘NN–DT’) were strongly based on sentence- or clause-based units are ill-suited for
associated with grammatical errors. The analysis lends sup- automatic scoring on ASR output which is in-line with pre-
port to our hypothesis that ASR generates non-existing vious studies. In addition to the reasonable correlation
grammatical errors resulting in underestimated high scores. with human-assigned scores, the key advantage of our pro-
This results in an underestimated proficiency level for the posed features is the relative immunity to ASR errors
highly proficient speakers when using the feature lmmax. (compared to conventional measures of syntactic complex-
Surprisingly, the ASR errors that affected the perfor- ity) making them suitable for use in automatic spontaneous
mance of lmmax did not influence the performance of speech scoring systems.
cos4 . We found that the 10 bigrams that were highly asso-
ciated with ASR errors had relatively low idf values; they
were 85–90% from the top when the bigrams were sorted Acknowledgements
by their idf values. This low idf has the effect of reducing
the impact of ASR errors on cos4 . This suggests that when The authors would like to thank Klaus Zechner, Keelan
compared to lmmax, cos4 seems to be better suited for Evanini, Lei Chen, and Shasha Xie for their valuable com-
being used in a fully automated speech scoring system that ments, help with data preparation and assistance with the
is based on speech recognition. experiments. In addition, we gratefully acknowledge the
proof-reading assistance offered by Prama Pramod.
8.3. Extensions and future directions
References
In assessing the level of association between the set of
features and the human scores for proficiency, the underly- Attali, Y., Burstein, J., 2006. Automated essay scoring with e–rater R V.2.
ing assumption was that the overall proficiency score was J. Technol. Learn. Assess. 4.
also indicative of syntactic competence. In reality, the Balogh, J., Bernstein, J., Cheng, J., Townshend, B., 2007. Automated
holistic manual scores (used in this study and otherwise) evaluation of reading accuracy: assessing machine scores. In: Proceed-
for overall proficiency capture more aspects of speech than ings of SLaTE, pp. 1–3.
Bardovi-Harlig, K., Bofman, T., 1989. Attainment of syntactic and
just syntactic complexity. Thus, the use of a more analytic morphological accuracy by advanced language learners. Stud. Second
human score of syntactic complexity in place of the overall Lang. Acquis. 11, 17–34.
proficiency score may provide a more accurate picture of Bernstein, J., Cheng, J., Suzuki, M., 2010. Fluency and structural
the quality of our features. complexity as predictors of L2 oral proficiency. In: Proceedings of
Future explorations in this direction will include combi- InterSpeech, pp. 1241–1244.
Bernstein, J., DeJong, J., Pisoni, D., Townshend, B., 2000. Two
nation of n-gram models and interpolated language models experiments in automated scoring of spoken language proficiency.
with interpolation weights that are specific to a proficiency In: Proceedings of InSTIL, pp. 57–61.
level, an approach that will alleviate the problem of data Chen, L., Yoon, S.Y., 2011. Detecting structural events for assessing non-
sparsity. native speech. In: Proceedings of the Workshop on Innovative Use of
NLP for Building Educational Applications, pp. 38–45.
Chen, M., Zechner, K., 2011. Computing and evaluating syntactic
9. Conclusions complexity features for automated scoring of spontaneous non-native
speech. In: Proceedings of ACL, pp. 722–731.
In this study, we proposed a set of features correspond- Chodorow, M., Leacock, C., 2000. An unsupervised method for detecting
ing to two models - the VSM model and the POS-LM grammatical errors. In: Proceedings of NAACL, pp. 140–147.
S. Bhat, S.-Y. Yoon / Speech Communication 67 (2015) 42–57 57
Cucchiarini, C., Strik, H., Boves, L., 2000. Quantitative assessment of Marin, M., Feldman, S., Ostendorf, M., Gupta, M. R., 2009. Filtering
second language learners’ fluency by means of automatic speech web text to match target genres. In: Proceedings of ICASSP, pp. 3705–
recognition technology. J. Acoust. Soc. Am. 107, 989–999. 3708.
Cucchiarini, C., Strik, H., Boves, L., 2002. Quantitative assessment of Neumeyer, L., Franco, H., Digalakis, V., Weintraub, M., 2000. Automatic
second language learners’ fluency: comparisons between read and scoring of pronunciation quality. Speech Commun., 88–93.
spontaneous speech. J. Acoust. Soc. Am. 111, 2862–2873. Ortega, L., 2003. Syntactic complexity measures and their relationship to
Eskenazi, M., Kennedy, A., Ketchum, C., Olszewski, R., Pelton, G., 2007. L2 proficiency: a research synthesis of college-level L2 writing. Appl.
The native accentTM pronunciation tutor: measuring success in the Linguist. 24, 492–518.
real world. In: Proceedings of SLaTE, pp. 124–127. Roark, B., Mitchell, M., Hosom, J.P., Hollingshead, K., Kaye, J., 2011.
Feldman, S., Marin, M., Ostendorf, M., Gupta, M.R., 2009. Part-of- Spoken language derived measures for detecting mild cognitive
speech histograms for genre classification of text. In: Proceedings of impairment. IEEE Trans. Audio Speech Lang. Process. 19, 2081–2090.
ICASSP, pp. 4781–4784. Robinson, P., 2006. Task complexity and second language narrative
Foster, P., Skehan, P., 1996. The influence of planning and task type on discourse. Lang. Learn. 45, 99–140.
second language performance. Stud. Second Lang. Acquis. 18, 299– Stolcke, A., et al., 2002. SRILM-an extensible language modeling toolkit.
324. In: INTERSPEECH.
Foster, P., Tonkyn, A., Wigglesworth, G., 2000. Measuring spoken Temple, L., 2000. Second language learner speech production. Stud.
language: a unit for all reasons. Appl. Linguist. 21, 354–375. Linguist., 288–297.
Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., Butzberger, J., Tetreault, J.R., Chodorow, M., 2008. The ups and downs of preposition
Rossier, R., Cesari, F., 2000. The SRI EduSpeak system: recognition error detection in ESL writing. in: Proceedings of COLING, pp. 865–
and pronunciation scoring for language learning. In: Proceedings of 872.
InSTIL, pp. 123–128. Witt, S., 1999. Use of the Speech Recognition in Computer-Assisted
Franco, H., Neumeyer, L., Kim, Y., Ronen, O., 1997. Automatic Language Learning. Unpublished Dissertation, Cambridge University
pronunciation scoring for language instruction. In: Proceedings of Engineering Department, Cambridge, UK.
ICASSP, pp. 1471–1474. Witt, S., Young, S., 1997. Performance measures for phone-level
Grömping, U., 2006. Relative importance for linear regression in r: the pronunciation teaching in CALL, in: Proceedings of STiLL, pp. 99–
package relaimpo. J. Stat. Softw. 17, 1–27. 102.
Halleck, G.B., 1995. Assessing oral proficiency: a comparison of holistic Wolf-Quintero, K., Inagaki, S., Kim, H.Y., 1998. Second Language
and objective measures. Mod. Lang. J. 79, 223–234. Development in Writing: Measures of Fluency, Accuracy, and
Hunt, K., 1970. Syntactic maturity inschool children and adults. Monogr. Complexity. Technical Report 17. Second Language Teaching and
Soc. Res. Child Develop. curriculum Center, The University of Hawai’i. Honolulu, HI.
Iwashita, N., 2010. Features of oral proficiency in task performance by Yoon, S.Y., Bhat, S., 2012. Assessment of esl learners’ syntactic
EFL and JFL learners. In: Selected Proceedings of the Second competence based on similarity measures. In: Proceedings of EMNLP,
Language Research Forum, pp. 32–47. pp. 600–608.
Iwashita, N., Brown, A., McNamara, T., OHagan, S., 2008. Assessed Yoon, S.Y., Higgins, D., 2011. Non-english response detection method for
levels of second language speaking proficiency: how distinct? Appl. automated proficiency scoring system. In: Proceedings of the Work-
Linguist. 29, 24–49. shop on Innovative Use of NLP for Building Educational Applica-
Klein, D., Manning, C.D., 2003. Accurate unlexicalized parsing. In: tions, pp. 161–169.
Proceedings of ACL, pp. 423–430. Zechner, K., Higgins, D., Xi, X., Williamson, D.M., 2009. Automatic
Lu, X., 2010. Automatic analysis of syntactic complexity in second scoring of non-native spontaneous speech in tests of spoken English.
language writing. Int. J. Corpus Linguist. 15, 474–496. Speech Commun. 51, 883–895.
Lu, X., 2012. L2 Syntactic Complexity Analyzer. <http://www.per- Zechner, K., Xi, X., Chen, L., 2011. Evaluating prosodic features for
sonal.psu.edu/xxl13/downloads/l2sca.html/> (retrieved 17.03.2012). automated scoring of non-native read speech. 2011 IEEE Workshop
Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to on Automatic Speech Recognition and Understanding (ASRU). IEEE,
Information Retrieval. Cambridge University Press, New York, USA. pp. 461–466.
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A., 1993. Building a large Zissman, M.A., 1996. Comparison of four approaches to automatic
annotated corpus of English: the penn treebank. Comput. Linguist. 19, language identification of telephone speech. Speech Audio Process. 4.
313–330.