Sign Transition Modeling and A Scalable Solution To Continuous Sign Language Recognition For Real-World Applications
Sign Transition Modeling and A Scalable Solution To Continuous Sign Language Recognition For Real-World Applications
Sign Transition Modeling and A Scalable Solution To Continuous Sign Language Recognition For Real-World Applications
We propose a new approach to modeling transition information between signs in continuous Sign Language
Recognition (SLR) and address some scalability issues in designing SLR systems. In contrast to Automatic 7
Speech Recognition (ASR) in which the transition between speech sounds is often brief and mainly addressed
by the coarticulation effect, the sign transition in continuous SLR is far from being clear and usually not easily
and exactly characterized. Leveraging upon hidden Markov modeling techniques from ASR, we proposed
a modeling framework for continuous SLR having the following major advantages, namely: (i) the system
is easy to scale up to large-vocabulary SLR; (ii) modeling of signs as well as the transitions between signs
is robust even for noisy data collected in real-world SLR; and (iii) extensions to training, decoding, and
adaptation are directly applicable even with new deep learning algorithms. A pair of low-cost digital gloves
affordable for the deaf and hard of hearing community is used to collect a collection of training and testing
data for real-world SLR interaction applications. Evaluated on 1,024 testing sentences from five signers, a
word accuracy rate of 87.4% is achieved using a vocabulary of 510 words. The SLR speed is in real time,
requiring an average of 0.69s per sentence. The encouraging results indicate that it is feasible to develop
real-world SLR applications based on the proposed SLR framework.
r
CCS Concepts: Human-Centered Computing → Natural Language Interfaces; Human-Centered r
r
Computing → Accessibility Systems and Tools; Human-Centered Computing → Gestural Input
Additional Key Words and Phrases: Sign language recognition, transition modeling, speech recognition,
hidden Markov models
ACM Reference Format:
Kehuang Li, Zhengyu Zhou, and Chin-Hui Lee. 2016. Sign transition modeling and a scalable solution to
continuous sign language recognition for real-world applications. ACM Trans. Access. Comput. 8, 2, Article 7
(January 2016), 23 pages.
DOI: http://dx.doi.org/10.1145/2850421
1. INTRODUCTION
Sign language [Ong and Ranganath 2005] is a form of natural language commonly
used for communication in the deaf and hard of hearing community. While spoken
languages use voices to convey meanings, sign languages utilize hand shape, position,
orientation and movement, sometimes with the aid of facial expression and body move-
ment, to express thoughts [Cherry 1968]. Like spoken languages, sign languages vary
from region to region. Automatic Sign Language Recognition (SLR) is a research topic
This research was supported and funded by Bosch. The majority of this work was conducted within Bosch,
while some evaluation experiments were performed in the School of Electrical and Computer Engineering at
Georgia Institute of Technology. The first author joined this research since his internship at Bosch Research
and Technology Center North America.
Authors’ addresses: Z. Zhou, Bosch Research and Technology Center North America, 4005 Miranda Avenue,
#200, Palo Alto, CA 94304; email: [email protected]; K. Li and C.-H. Lee, School of Electrical and
Computer Engineering, Georgia Institute of Technology, 777 Atlantic Drive NW, Atlanta, GA 30332-0250;
emails: {kehle, chl}@gatech.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for third-party components of this
work must be honored. For all other uses, contact the Owner/Author.
2016 Copyright held by Owner/Author
1936-7228/2016/01-ART7
DOI: http://dx.doi.org/10.1145/2850421
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:2 K. Li et al.
attracting many researchers’ attention in recent years [Starner et al. 1998; Fang et al.
2007; Forster et al. 2013]. Advances in automatic SLR technologies may lead to various
Human-Computer Interaction (HCI) [Shneiderman 1986; Laurel and Mountford 1990]
systems in order to facilitate communication between people with spoken and sign
language capabilities.
SLR is similar to Automatic Speech Recognition (ASR) in the sense that they both
attempt to automatically transfer sequences of input signals into sequences of words.
For ASR, Hidden Markov Model (HMM) [Rabiner 1989] has been one of the most
prevalent statistical modeling techniques since the mid-1970s [Baker 1975]. Intense
research has been conducted in various aspects, including feature extraction [Davis and
Mermelstein 1980], acoustic and language modeling [Lee et al. 1990; Rosenfeld 2000],
decoding [Ney and Ortmanns 2000], discriminative training [Juang et al. 1997], adap-
tive learning [Lee and Huo 2000], and postprocessing. The mature HMM-based ASR
advances have supported the development of many commercial products, including
speaker-independent large-vocabulary continuous speech recognition systems. Since
2012, deep learning [Hinton et al. 2012; Vinyals et al. 2012] has been the new focus
of ASR. Its flexible learning capacity is often incorporated into the HMM-based ASR
framework.
Compared with ASR, the SLR technology is still in the early stage of development.
The majority of the research activities on SLR are on isolated sign/word recognition [Oz
and Leu 2011; Pitsikalis et al. 2011; Cooper et al. 2012; Geng et al. 2014]. The research
efforts on continuous SLR started from the late 1990s and have so far been limited.
Such efforts mostly focused on small and medium-size vocabularies [Starner et al.
1998; Dreuw et al. 2007; Zafrulla et al. 2011]. For example, a 99-sign vocabulary was
adopted in Yang et al. [2010] and the best accuracy of 73% was achieved on camera data
collected from three users. The researchers from RWTH reported a 22.1% Word Error
Rate (WER) using a vocabulary of 455 words on data from 25 signers in a restricted
laboratory camera setting [Forster et al. 2013]. They also collected a set of “real” data
from TV weather broadcast videos, on which a WER of 49.2% was reported for six
signers with a 349-word vocabulary [Forster et al. 2014].
For large-vocabulary continuous SLR, the only reported experiments were conducted
on a set of Chinese sign language data collected by expensive wearable devices, includ-
ing two CyberGloves and three position trackers [Yao et al. 2006; Fang et al. 2007]. The
vocabularies contain up to 5,113 signs. The best reported accuracies were 91.9% for
two signers [Fang et al. 2007], and at about 90% for six signers [Chen et al. 2003; Fang
et al. 2004; Gao et al. 2004]. One key reason for the high accuracies may be attributed
to the use of the high-end CyberGloves that can accurately capture hand angles. How-
ever, this success on large-vocabulary SLR has not yet been repeated in other research
groups until recently by the authors of this article on smaller vocabularies with 208 to
370 words using CyberGloves or Kinect [Jiang et al. 2009; Zhou et al. 2010; Chai et al.
2013], though no accuracies about continuous SLR were reported.
The major research trend of continuous SLR is to move from ideal laboratory set-
tings (e.g., camera recordings requiring users to wear black clothes in front of a blue
background, CyberGloves) to real-world data (e.g., TV video, Kinect data), and from
small to medium vocabularies [Starner et al. 1998; Dreuw et al. 2007; Zafrulla et al.
2011; Chai et al. 2013; Forster et al. 2014]. In this work, we collect real-world data
using low-cost digital gloves. Compared with cameras and Kinect, digital gloves can
be used both indoors and outdoors, and will not be affected by common camera-related
problems such as illumination. Compared with high-end gloves like CyberGloves, the
low-cost gloves provide less accurate and noisier signals. However, they are much more
affordable for deaf and hard of hearing people, making real-world applications possi-
ble. Here by “real-world applications” we mean real-time applications using affordable
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:3
devices for multiple environments (e.g., indoor and outdoor). To develop such interac-
tive systems between sign language users and spoken language users, we propose a
scalable solution for continuous SLR based on the mature HMM framework of ASR
with special characters of sign languages carefully considered.
Continuous SLR has leveraged upon the extensive set of technologies in ASR
[Rabiner and Juang 1993; Lee et al. 1996] in HMM training and decoding [Starner
et al. 1998; Dreuw et al. 2007]. However, as illustrated in a recent review article as
well as in other papers, there are doubts in the SLR research community that ASR
techniques, particularly the HMM-based ASR framework, are not scalable for contin-
uous SLR tasks [Vogler and Metaxas 1999; Yang et al. 2010; Cooper et al. 2011]. This
is mainly due to two challenges related to SLR. First, the basic modeling units (i.e.,
phonemes) for signs are not well defined [Pitsikalis et al. 2011] and can be defined
in a way that leads to parallel recognition for multiple streams (e.g., left hand, right
hand) [Vogler and Metaxas 1999]. Second, the transition signals between signs lead
to a challenging coarticulation problem that is different from that of ASR [Lee 1988;
Vogler and Metaxas 1997]. To tackle these problems, variations of the conventional
HMM frameworks as well as alternative recognition structure (e.g., Nested Dynamic
Programming) have been proposed [Vogler and Metaxas 1999; Fang et al. 2007; Yang
et al. 2010]. However, system scalability and decoding efficiency are difficult topics to
address if extending such SLR systems to larger vocabularies.
We believe that the SLR-specific challenges, including transition modeling, can be
addressed within the mature HMM framework of ASR in a scalable way. In this study,
we propose such a SLR solution that can easily be scaled up to larger vocabularies, be
applied to real-world data, and be extended to other developed ASR techniques such
as adaptation and deep learning. This solution [Zhou et al. 2015] first define/select
sign phonemes (i.e., the basic signs that are consistent across various single/multisign
words) in the same way as in Wang et al. [2002], and merges the features from both
hands into one single feature vector for HMM modeling. Transition signals between
signs are then modeled using one universal HMM model, which implicitly performs
classification for transition signals through Gaussian splitting. This is similar to the
single universal filler known as a garbage collection model in keyword spotting [Wilpon
et al. 1990] to absorb nonkeyword extraneous speech. The sign phoneme HMMs and
the transition HMM are merged into a recognition framework that is similar to that
of conventional ASR. With this SLR framework, while the transition problem is well
handled, mature ASR decoding methods can be used to support real-time recogni-
tion even for large vocabularies. Other developed ASR techniques, ranging from mod-
eling to postprocessing, become directly applicable, and may be applied to improve
system performance in the future. We also propose a new data collection approach
that is scalable toward large-vocabulary SLR and suitable to process noisy real-world
data.
We implemented a prototype of the proposed solution on medium-vocabulary contin-
uous Chinese SLR. We started with the Chinese sign language because China has the
largest population of deaf and hard of hearing in the world, while the societal support
for this community is much less than that available in developed countries like the
USA. The experimental results are encouraging. Relatively high recognition accuracy
was achieved on real-world data, that is, noisy signals from low-cost gloves, collected
from six signers.
The rest of the article is organized as follows. We first discuss the related work in
Section 2. The proposed framework to model sign transitions and to address some
scalability issues for continuous SLR is presented in Section 3. Experiments and result
analyses are described in Sections 4 and 5, respectively. Conclusion and future research
directions are given in Section 6.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:4 K. Li et al.
2. RELATED WORK
The conventional HMM-based recognition structure that has been prevalently used
in ASR was thought to be not scalable for continuous SLR in some previous studies,
and therefore alternative architectures were adopted [Yang et al. 2007; Yang et al.
2010; Cooper et al. 2011; Kong and Ranganath 2014]. For example, Liang and Ouhy-
oung proposed a posture-based SLR approach that involves end-point detection and
posture analysis [Liang and Ouhyoung 1998]. Vogler and Metaxas proposed parallel
HMMs to process the left and right hands separately and combines probabilities from
two streams at the word end nodes during decoding [Vogler and Metaxas 1999]. Yang
et al. [2010] used a new recognition structure of nested dynamic programming instead
of the HMM modeling to handle transition signals (i.e., movement epenthesis) and
hand segmentation, and compared this method with conditional random fields. These
approaches are experimented on relatively small vocabularies, and the scalability to
large-vocabulary SLR remains a difficult research topic for these alternative recogni-
tion structures.
For the previous studies that follow the conventional HMM recognition structures for
continuous SLR, the transition signals between signs and the collection of training data
are handled in various ways. A popular choice is to ignore the transition signals and
train the word models with various features on the training data of whole sentences
[Starner et al. 1998; Dreuw et al. 2007; Zafrulla et al. 2011; Forster et al. 2013]. One
major problem of such an SLR strategy is that the word models trained in this way
actually included the neighborhood transition signals before and after the focused
words. This caused modeling problems for uncommon words that are often limited in
the training data, and as a result the recognition accuracies might be seriously affected
if the transition context of a word in testing sentences is unseen or rare in the training
data. A similar case in ASR is the robustness problem to unseen conditions. If the
training data is required to well cover the large variation in transition for each word,
the needed number of training sentences will increase to a prohibitive scale. Note that
collecting sign language data is more difficult than collecting speech data due to the
relatively small population of the users.
There are also some researchers using a threshold model to distinguish between
transition parts and signs/gestures in continuous signals [Lee and Kim 1999; Kelly
et al. 2009]. The main idea is to combine the states of the sign/gesture HMMs into
a threshold model of an ergodic structure, and then to identify transitions as well as
signs/gestures by comparing the likelihoods calculated from the threshold model and
sign/gesture models. The problem is that this method is only suitable for SLR tasks
involving a small number (e.g., eight) of signs. For middle/large-vocabulary SLR, the
method is impractical because the threshold model generated will be too large.
Vogler and Metaxas addressed the transition problem in HMM-based recognition
using two different approaches [Vogler and Metaxas 1997]. The first method is to con-
duct context-dependent sign modeling, that is, to train bisign models in a similar way
as biphones in ASR [Lee 1988]. This approach has a scalability problem since the
number of bisign models can be too large to train. The second is to divide transition
signals into 64 classes based on k-means clustering, and then train the correspond-
ing transition models. The recognition network has to be redesigned to connect the
transition models with words appropriately based on the starting and ending locations
of the hands. Note that this could be challenging especially for large-vocabulary SLR
with trigram language models [Rosenfeld 2000]. Experimental results show that the
second approach works better than the first one (with a word accuracy of 95.8% vs.
91.7% for a vocabulary of 53 signs), demonstrating the benefit of transition modeling.
In this work, sentences are used as the training data. Although by explicitly modeling
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:5
transition signals, signs can be modeled without considering the transition context,
the number of training sentences needed to robustly model every sign can still be high
for large-vocabulary SLR since uncommon signs occur less frequently than common
signs.
Fang et al. [2007] proposed another way to model transition signals within the HMM
recognition structure. The transition signals were clustered into multiple classes using
a temporal clustering algorithm. The transition models were then trained jointly with
the sign models using a bootstrap training method. The transition models and the
sign models are all viewed as candidates during the decoding procedure. To handle
the increased complexity in decoding, a pruning algorithm was also proposed. The
authors adopted a vocabulary of 5,113 signs and designed training data that was
composed of two parts: isolated signs and continuous sentences. During training, they
first train initial sign models using the isolated sign samples. Then the sign models are
retrained along with the transition models using the continuous sentences. The initial
sign models will often not match the sentences well, because the collection of isolated
sign samples will certainly contain transition signals before and after the target signs
and these transitions are different from the cases in sentences. Retraining of the sign
models is in the spirit of adapting the sign models to match the conditions in continuous
sign sentences. However, similar to the cases described earlier, it is difficult to collect
enough data to adapt all the signs to sentences. In their experiments, they used 750
distinct sentences in training and testing, and each sentence contains 6.6 words in
average. That means these sentences cover at most 4,950 signs out of the vocabulary
of 5,113 signs. The number of distinct signs that really appeared in these sentences
will be quite small because common signs will repeatedly appear in these sentences.
In other words, only a portion of the signs are adapted to the sentences. If testing
sentences involve signs that are not covered by these training sentences, the accuracy
will be heavily reduced.
In this article, we define a single instead of multiple transition models to model the
transition signals within the conventional HMM recognition structure, and the tran-
sition model has the same left-to-right structure as sign models. When compared with
previous studies that used clustering methods [Vogler and Metaxas 1997; Fang et al.
2007], we let the transition HMM handle the Gaussian split classification implicitly.
The single transition model is trained and used in the same way as the sign models.
Thus it is unnecessary to modify the training and decoding procedures for the use of
the transition model. Adopting a single transition model is also more advantageous
in recognition efficiency than using multiple transition models, as will be further dis-
cussed in Section 5.1. With the proposed framework, the conventional training and
decoding techniques developed for large-vocabulary ASR are directly usable, leading
to high system scalability.
Different from previous studies that completely or partially use sentences as training
data, we only use single-sign and multisign word samples as training data. Multisign
words are also included to capture the intrasentence variability of signs across different
words, as well as to provide samples of transition signals. We manually segmented the
starting and ending points of signs as well as the transition parts in all the word
samples. Note that while it is feasible to do such labeling for word samples, it is very
difficult to do so for continuous sign sentences. Although manual labeling may not be
perfect, it can greatly benefit modeling especially for highly variable real-world data,
for which the effectiveness of forced-alignment-based reestimation of HMM may be
greatly reduced [Forster et al. 2013]. With the proposed data collection methodology it
can easily be extended to large-vocabulary situations by collecting more word samples
linearly proportional to the vocabulary size and therefore it is a scalable solution.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:6 K. Li et al.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:7
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:8 K. Li et al.
transition models [Vogler and Metaxas 1997; Fang et al. 2007], the proposed transition
modeling method handles the classification of transitions through a convenient train-
ing procedure, and will lead to higher recognition efficiency because in this case it is
unnecessary to distinguish between competing transition models during decoding.
With the sign phoneme models and the universal transition model, the word model
can then be defined as the concatenation of the component sign models with the tran-
sition model inserted between the sign models. We can thus combine the word models
with the language model into a recognition framework in a similar way to the con-
ventional ASR framework [Lee 1988; Wilpon et al. 1990]. For ASR, as illustrated in
Figure 2, a silence model (indicated by “sil”) is placed before and after a sentence and a
short pause model (indicated by “sp”) is placed between every two sequential words. In
this SLR study, as shown in Figure 3, a start model (indicated by “st”) is placed before
a sentence and an end model (indicated by “end”) is used after a sentence, while a tran-
sition model (indicated by “tr” to be described in more detail later in Sections 4 and 5)
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:9
is inserted between the single-sign and multisign word models. Due to the similar
structure, the conventional ASR decoding algorithms are directly applicable to contin-
uous SLR, leading to real-time recognition speed even for large-vocabulary SLR tasks.
Applying other well-developed ASR techniques, such as discriminative training [Juang
et al. 1997], adaptive learning [Lee and Huo 2000], error processing [Zhou 2009], and
deep learning [Hinton et al. 2012], can easily be adopted to benefit continuous SLR.
We consider the difference between sign and spoken languages, and adopt a new
data collection method for model training. For ASR, the phone and word models have to
be trained on continuous speech utterances because the pronunciations of phones and
words highly depend on the context. However, for SLR, signs are relatively consistent
in different contexts, while the transition parts between signs are highly variable.
Due to these characteristics, we use word samples instead of sentence samples as
training data. A certain number of samples are collected for each of the sign words
in the vocabulary, including both single-sign and multisign words. To make full use
of the captured data, each word sample is then labeled into several segments, that is,
the component signs, preceded by the start part (i.e., from the beginning to the start
of the first sign), followed by the end part (i.e., from the finishing point of the last
sign to the end), and the transition parts, if existing. Each sign phoneme model can
thus be trained on all the segments of the focused signs in the training data, while
the transition, start, and end models can be trained on the corresponding segments,
respectively. The main advantages of this data collection methodology is as follows:
(i) when the vocabulary size increases, the need in training data increases only
linearly; (ii) every word in the vocabulary, including uncommon words, can be robustly
trained; and (iii) it is feasible to label word samples to provide reliable segmentation
information for training, which is especially variable for noisy real-world data.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:10 K. Li et al.
Os,i = q0 Q−1 −1
s,i ps Qs,i q0 , where q0 is the rotation from the initial sensor coordinate to the
global coordinate according to the sensor coordinate (Figure 5).
There are two different kinds of feature involved in our experiments: one is related to
the hand shape (Finger Related, or FR), and the other is related to the hand direction
(Global Angle of hand, or GA). The feature we used for describing the hand shape is
the cosine distance between the directions of adjacent finger bones, Ais,r = Os,i · Or,i .
For instance, if s = “2c” and r = “2b,” as in Figure 4, then As,r is about the bending of
the proximal interphalangeal joint (the second far end joint of the index finger). The
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:11
FR features are calculated per finger. On the other hand, the feature for the direction
of the hand, more specifically the direction of the palm, can be estimated by the mean
of the directions of sensorS “2a,” “3a,” “4a,” and “5a.” To fully describe the direction of
the hand in space, two perpendicular directions, one along the palm plan and the other
along the normal of the palm plan, were used. It is as well capable to get the
acceleration of the finger bones by replacing ps with the quaternion representation of
atsi , and then the acceleration and velocity of a hand can be estimated.
4. EXPERIMENTS
Next, we develop an initial prototype of a medium-vocabulary continuous SLR system
using the proposed solution. Related system issues in data organization, feature usages,
training and testing procedures, and the primary experimental results are presented
in the following four subsections, respectively.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:12 K. Li et al.
the hardware cost of two refined gloves will be about 150 US dollars, much more af-
fordable than the CyberGlove, which costs more than 17,000 US dollars per glove and
has been widely used for SLR in university laboratories [Wang et al. 2006; Fang et al.
2007; Mohandes 2010]. When a user signs a word or a sentence wearing the gloves,
the gloves sent out 100 frames of sensor signals per second for both hands simultane-
ously. Sign language features can thus be extracted based on these sequential sensor
signals.
We invited six signers to participate in our data collection. Five of them were from the
deaf and hard of hearing community with sign languages as their first language. The
remaining signer is a teacher involved in education for the deaf and hard of hearing
students with the sign language as a second language. Each signer was asked to wear
the digital gloves, and signed all words in the vocabulary three times and the sentences
twice. To reduce the modeling complexity, in this initial study, the signers were also
required to stand during the data collection process and put both hands down before
and after signing. Most data were successfully collected except for one deaf or hard
of hearing subject, for whom only one round of vocabulary words was collected due to
an electronic device failure. There were also some random electronic failures during
sentence collection. We organized the collected samples into training, development,
and testing subsets as illustrated in Table I.
We manually segmented all word samples to identify the segment boundary points
corresponding to sign phonemes, transitions between signs, and the start and end
parts. No sentence samples were segmented.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:13
For system construction, we first developed the task grammars to build the de-
coding network. The grammars contain 3,599,139 sign language sentences, including
both text-based and natural sign language sentences that are supported by the set
of 510 vocabulary words. The 215 distinct testing sentences are all covered by the
grammar. Note that every sentence in the grammars is of equal probability and is
recognizable.
We trained all the models on the word samples, each containing a start segment (i.e.,
signals before the first sign phoneme begins), one or more sign phoneme segments, zero
or more transition segments, and one ending segment (i.e., signals after the last sign
phoneme ends). The sign phoneme models were trained on the segments of the corre-
sponding signs. It is feasible to directly train the corresponding transition, start, and
end models. Alternative training methods may also be designed based on the detailed
properties of the start and end segments, which may vary for collection at different cir-
cumstances. In this study, since we require the signers to put both hands down before
and after signing, every start segment contains static signals followed by movement
signals, while every end segment contains movement signals that may or may not be
followed by static signals. The movement signals can be viewed as a type of transition.
We thus further label each start segment into a static segment and a transition seg-
ment, and label each end segment into a transition segment and a static segment, if it
exists. The manual segmentation process is illustrated in Figure 7, where the frames
of a word sample are segmented into static, transition, and sign phoneme segments.
Although only GA features are shown in Figure 7, FR features are referred to as well
in the labeling process. We train the transition model on all the transition segments,
including the ones in the start and end segments, and train an additional static model
on the static segments. The start model was built by connecting the static model with
the transition model. The end model is composed of two alternatives, one connecting
the transition model with the static model, and the other only including the transi-
tion model. All models were trained in the same way as ASR phoneme modeling. The
models are HMMs with diagonal covariance Gaussian mixtures, and are reestimated
iteratively after initialization as well as after splitting the Gaussian mixtures.
We used the tuning data listed in Table I to determine the following set of HMM
parameters: the number of states and the number of Gaussian mixtures for each HMM
state, as well as the required number of iterations of reestimation. With the optimal
configurations observed (e.g., using three-state HMM for sign phoneme models and for
the transition model), we applied the resulting SLR system on the testing data. The
experimental results are reported next.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:14 K. Li et al.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:15
the single HMM. Next, we compare the universal transition modeling (referred to
as UniversalTranModel_2G) with other approaches, such as no explicit modeling of
transition signals and utilizing multiple transition models, two popular ways reported
in previously HMM-based SLR studies [Starner et al. 1998; Fang et al. 2007; Forster
et al. 2013]. The comparison results are listed in the following.
In Table III, NoTranModel and NoTranModel_ExtendedSigns refer to two meth-
ods that conduct recognition without any transition models. The first approach sim-
ply deletes all the universal transition models inserted within/between word models
in the recognition framework and then applying the modified recognizer to the test
data. The recognition accuracy obtained is as low as 13.9%, clearly showing that using
the trained sign models alone does not cover the transition regions well. The second
approach, NoTranModel_ExtendedSigns, includes neighborhood transition context in
sign modeling. It extends each training segment of sign phoneme by including the
first half of the subsequent transition segment as well as the second half of the pre-
vious transition segment. The sign phoneme HMMs are then retuned and retrained
on the corresponding tuning and training data, respectively. The sign phoneme mod-
els trained in this way are applied in the same recognition framework as in the first
approach. The second approach achieves a recognition accuracy of 79.2%, much better
than the performance of the first approach, but still 8.2% lower than the accuracy of the
proposed SLR solution. One problem of including transition signals in sign modeling
is that collecting training data with various contexts for each sign could be difficult
especially for large-vocabulary continuous SLR.
UniversalTranModel_1G refers to the approach of using a universal transition model
but involving no state Gaussian splitting in training this model. All three states of the
transition model contain one Gaussian only, and the transition signals are implicitly in-
cluded in one class. Using this simplified universal transition model, the word accuracy
achieved is surprisingly high, being at about 82.5%. This illustrates the effectiveness
of the proposed SLR structure, and indicates that the sign phoneme models are well
modeled, being able to support the recognition tasks even when the transition model
was relatively simple.
MultiTranModels_V&M refers to the approach of performing SLR with multiple ex-
plicit transition models, whose training procedure and usage in decoding are adapted
from the method proposed by Vogler and Metaxas in [Vogler and Metaxas 1997]. Follow-
ing this previous work, we first clustered the starting and ending points of all signs in
training data using k-means clustering with the least-squares distance criterion. Four
distinct clusters were observed. The combination of different starting/ending classes
thus leads to 16 explicit transition models in total. We then trained the 16 transition
models on the corresponding transition segments in the same way as the training of the
sign models. With multiple transition models, the simplest way to apply these models
is to replace the original universal transition model within/between word models in
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:16 K. Li et al.
the search network with a confusion set of the multiple transition models. However,
with 16 transition models, the computational load of decoding with such an enlarged
search network is overwhelming, taking minutes to decode one sentence. In the work
[Vogler and Metaxas 1997], the authors attempted to solve the efficiency problem by
constraining the recognition network to use only one transition model between any two
signs. However, we noticed that on our glove-based data, multiple types of transition
segments are often observed between two signs, or in other words, multiple transition
models are often needed for the connection. There are three types of transition seg-
ments on average between two signs on the training samples of multisign words. We
constrain the search network to allow those transition models with corresponding tran-
sition segments observed in the training data to connect two signs in every multisign
word. For the transition between single/multisign words, since we use word sample as
the training data, no data is available to estimate suitable transition models between
two words. On our tuning data of sentences, we observed that for each word, its starting/
ending positions typically have higher variance in sentences than in word samples and
the variance of the ending position is larger than that of the starting position in gen-
eral. In this study, we assume the starting points of each word in sentences are always
consistent with those in training word samples, and further constrain the transition set
between every two words in the recognition network by only allowing those transition
models ending at the starting classes of the next word that are observed in the training
data. With such a constrained recognition network, the decoding of a sentence can be
done in 13.4s on average, and a 0.4% absolute increase in accuracy can be achieved.
To increase the decoding efficiency of MultiTranModels_V&M to an acceptable level
for real applications, we further conducted pruning, that is, limiting the number of
maximum active paths allowed in decoding to be 5,000. This approach is referred to
as MultiTranModels_V&M_Prune. As illustrated in Table III, although the efficiency
issue is solved by pruning, the recognition accuracy decreases to 85.8%, worse than the
performance of the proposed solution UniversalTranModel_2G.
The comparison results show that explicitly modeling multiple transition models may
bring higher recognition accuracy, but at a cost of increased complexity in decoding.
To apply the approach in real-time applications, strategies have to be designed to
constrain the recognition network, which by the way can be highly effort demanding,
and/or to conduct pruning. However, when applying such strategies, the possible gain
in accuracy may vanish quickly. On the other hand, our proposed simple solution of
universal transition model may provide comparable accuracy at real-time speed while
mature decoding systems (e.g., HTK and Kaldi) are directly applicable with no need to
specially constrain the recognition networks.
In this work, we did not compare with the context-dependent transition modeling
method mentioned in Vogler and Metaxas [1997] because we use word samples as
training data and thus are unable to train bisign models; not with the transition
modeling approach proposed in Fang et al. [2007] because our training transition
segments are insufficient to train more than 500 transition models as in that work;
not with the threshold model solution for transition signals [Kelly et al. 2009] because
combining states of all sign models will lead to one threshold model of impractically
large size; and not with the methods with recognition architectures difference from
HMM [Yang et al. 2007, 2010; Kong and Ranganath 2014], for which the efficiency for
middle/large-vocabulary SLR may remain an open issue.
5.2. Discussions on Scalability
With the proposed SLR solution, the system is well scaled in data collection, language
modeling, as well as decoding efficiency. When extending the existing vocabulary to a
larger size, we only need to collect training word samples for the new words, and enlarge
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:17
Table IV. Recognition Performance with Different Size of Vocabularies and Grammars
Number of Words Number of Sentences Average Decoding Time per
in Vocabulary Covered by Grammar Word Accuracy (%) Sentence (second)
510 3,599,139 87.4 0.69
86 607 96.2 0.07
the grammar to include the new words. Since the tuning sentences are only used to
adjust the system configurations, it is not necessary to increase the number of tuning
sentences when adopting larger vocabularies in continuous SLR. Regarding efficiency,
since the system adopts the conventional HMM-based ASR technologies, scaling up the
proposed SLR solution would not lead to an exponentially exploded efficiency issue.
For instance, in developing this work, we started with a vocabulary of 86 words and
developed grammars to use this small vocabulary to cover basic conversations under
one scenario at an accuracy of 96.2% and a decoding speed of 0.07s. We later extend
the vocabulary to 510 words to cover multiple scenarios, as presented in Section 4. The
grammar was also enlarged correspondingly. The performances for the two SLR tasks
with different vocabulary sizes are shown in Table IV.
We can see that when extending the vocabulary size from 86 to 510, the decoding
search space is significantly enlarged, from including only 607 distinct sentences to
supporting more than three million different sentences. The decoding complexity is
thus greatly increased. However, the reduction in word accuracy as well as in recog-
nition efficiency is relatively limited. The decoding speed remains real time. These
observations demonstrate the scalability of the proposed SLR solution.
When SLR is extended from medium-vocabulary to large-vocabulary tasks, a major
difficulty lies in language modeling. For large-vocabulary continuous SLR, developing
deterministic grammars, such as BNF grammars, can be infeasible and statistical n-
gram modeling [Rosenfeld 2000] will be needed. However, there is no sign language
corpus large enough for large-vocabulary n-gram modeling yet. When such corpora
become available, the n-gram models can be trained and utilized in SLR directly using
the mature n-gram-related techniques developed for ASR. The recognition efficiency of
the resulting SLR system is still expected to be real time. Note that the conventional
ASR systems can support real-time decoding for vocabulary sizes of more than 60,000
words.
5.3. Information Modeling for the Nondominant Hand
We also notice that a selective inclusion of the features from the nondominant hand
may be necessary for SLR. For sign languages, the usages of the dominant hand and
the nondominant hand are not equally important. For example, most frequently used
signs typically only involve the dominant hand. We would like to investigate whether
the nondominant hand should adopt a different set of features for certain cases, for
example, medium-vocabulary SLR. Since all signers in our data collection are right-
handed, for simplicity in this study, we used the right and left hands to represent the
dominant and nondominant hands, respectively.
We conducted our analysis by varying the features related to the left hand in the
feature vector. The comparison results of using different types of feature vectors are
reported next. In Table V, FR_right/left, GA_right/left, and GA_delta_right/left refer
to the FA, GA, and GA_delta features for the right or left hand. FA are finger-related
feature set, while GA and GA_delta are hand global angle-related feature sets, as
defined in Sections 3.3 and 4.2.
We noticed that if only using right-hand features in modeling, the system can already
achieve a word accuracy of 77%. This reflects the fact that the right hand bears more
information than the left hand for sign languages. Adding left-hand information does
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:18 K. Li et al.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:19
training set, the performance of the nonnative signer is comparable with the native
signers. This indicates that collecting training data from more nonnative signers may
improve the system performance on unseen nonnative signers. ASR technologies de-
veloped to handle the accent issue of nonnative speakers (e.g., acoustic modeling adap-
tation techniques [Wang et al. 2003]) may also be borrowed into SLR to relieve the
nonnative issue.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:20 K. Li et al.
ACKNOWLEDGMENTS
We thank Dr. Liu Ren, Dr. Kui Xu, Dr. Yen-Lin Chen from Bosch Research and Technology Center, North
America, Gerry Guo, Aria Jiang, Hong Luo, Ben Wang from Bosch China, Tobias Menne from RWTH Aachen
University, and Jianjie Zhang from Texas A&M University for their contributions to the whole system.
REFERENCES
Simon L. Altman. 1986. Rotations, Quaternions, and Double Groups. Dover Publications.
John Backus. 1978. Can programming be liberated from the von Neumann style?: A functional style and its
algebra of programs. Commun. ACM 21, 8 (1978), 613–641. DOI:http://dx.doi.org/10.1145/359576.359579
Lalit R. Bahl, Frederick Jelinek, and Robert L. Mercer. 1983. A maximum likelihood approach to continuous
speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 5, 2 (1983), 179–190. DOI:http://dx.doi.org/
10.1109/TPAMI.1983.4767370
James K. Baker. The DRAGON system: An overview. IEEE Trans. Acoust., Speech, Signal Process. 23, 1
(1975), 24–29. DOI:http://dx.doi.org/10.1109/TASSP.1975.1162650
Issam Bazzi. 2002. Modelling Out-of-Vocabulary Words for Robust Speech Recognition. Ph.D. dissertation.
Massachusetts Institute of Technology, Cambridge, Massachusetts.
Xiujuan Chai, Guang Li, Xilin Chen, Ming Zhou, Guobin Wu, and Hanjing Li. 2013. VisualComm: A tool to
support communication between deaf and hearing persons with the Kinect. In Proceedings of the 15th
International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS’13). ACM, New
York, NY, Article 76, 2 pages. DOI:http://dx.doi.org/10.1145/2513383.2513398
Yiqiang Chen, Wen Gao, Gaolin Fang, Changshui Yang, and Zhaoqi Wang. 2003. CSLDS: Chinese sign
language dialog system. In Proceedings of the IEEE International Workshop on Analysis and Modeling
of Faces and Gestures (AMFG’03). IEEE Computer Society, 236–237.
Collin Cherry. 1968. On Human Communications. MIT Press, Cambridge.
Helen Cooper, Brian Holt, and Richard Bowden. 2011. Sign language recognition. In Visual Analysis of
Humans. Springer, London, 539–562. DOI:http://dx.doi.org/10.1007/978-0-85729-997-0_27
Helen Cooper, Eng-Jon Ong, Nicolas Pugeault, and Richard Bowden. 2012. Sign language recognition using
sub-units. Journal Mach. Learning Res. 13, 1 (2012), 2205–2231.
Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations of monosyllabic
word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, Signal Process. 28, 4
(1980), 357–366. DOI:http://dx.doi.org/10.1109/TASSP.1980.1163420
Philippe Dreuw, David Rybach, Thomas Deselaers, Morteza Zahedi, and Hermann Ney. 2007, Speech recog-
nition techniques for a sign language recognition system. Hand 60 (2007), 80.
Gaolin Fang, Wen Gao, and Debin Zhao. 2007. Large-vocabulary continuous sign language recognition based
on transition-movement models. IEEE Trans. Syst., Man, Cybern. A 37, 1 (2007), 1–9. DOI:http://dx.doi.
org/ 10.1109/TSMCA.2006.886347
Guolin Fang, Wen Gao, and Debin Zhao. 2004. Large vocabulary sign language recognition based on
fuzzy decision trees. IEEE Trans. Syst., Man, Cybern. A 34, 3 (2004), 305–314. DOI:http://dx.doi.
org/10.1109/TSMCA.2004.824852
Jens Forster, Oscar Koller, Christian Oberdörfer, Yannick Gweth, and Hermann Ney. 2013. Improving con-
tinuous sign language recognition: Speech recognition techniques and system design. In Proceedings
of the 4th Workshop on Speech and Language Processing for Assistive Technologies. Association for
Computational Linguistics, 41–46.
Jens Forster, Christoph Schmidt, Oscar Koller, Martin Bellgardt, and Hermann Ney. 2014, Extensions of the
sign language recognition and translation corpus RWTH-PHOENIX-Weather. In Proceedings of the 9th
International Conference on Language Resources and Evaluation (LREC’14). ELRA, 1911–1916.
Wen Gao, Gaolin Fang, Debin Zhao, and Yiqiang Chen. 2004. A Chinese sign language recognition system
based on SOFMamp;/SRN/HMM. Pattern Recognition 37, 12 (2004), 2389–2402. DOI:http://dx.doi.org/
10.1016/j.patcog.2004.04.008
Lubo Geng, Xin Ma, Haibo Wang, Jason Gu, and Yibin Li. 2014. Chinese sign language recognition
with 3D hand motion trajectories and depth images. In Proceedings of the 11th World Congress on
Intelligent Control and Automation (WCICA 2014). IEEE, 1457–1461. DOI:http://dx.doi.org/10.1109/
WCICA.2014.7052933
William R. Hamilton. 1844. On a new species of imaginary quantities connected with a theory of quaternions.
Proc. Royal Irish Acad. 2, 1843 (1844), 424–434.
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew
Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:21
networks for acoustic modeling in speech recognition: The s views of four research groups. IEEE Signal
Process. Mag. 29, 6 (2012), 82–97. DOI:http://dx.doi.org/10.1109/MSP.2012.2205597
Frederick Jelinek, Lalit R. Bahl, and Robert L. Mercer. 1975. Design of a linguistic statistical decoder for the
recognition of continuous speech. IEEE Trans. Inf. Theory 21, 3 (1975), 250–256. DOI:http://dx.doi.org/
10.1109/TIT.1975.1055384
Feng Jiang, Wen Gao, Hongxun Yao, Debin Zhao, and Xilin Chen. 2009. Synthetic data generation tech-
nique in signer-independent sign language recognition. Pattern Recog. Lett. 30, 5 (2009), 513–524.
DOI:http://dx.doi.org/10.1016/j.patrec.2008.12.007
Biing-Hwang Juang, Wu Chou, and Chin-Hui Lee. 1997. Minimum classification error rate methods for
speech recognition. IEEE Trans. Speech Audio Process. 5, 3 (1997), 257–265. DOI:http://dx.doi.org/
10.1109/89.568732
Daniel Kelly, John McDonald, and Charles Markham. 2009. Recognizing spatiotemporal gestures and
movement epenthesis in sign language. In Machine Vision and Image Processing Conference 2009,
DOI:10.1109/IMVIP.2009.33.
Ron Kohavi. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In
IJCAI 14, 2 (Aug. 1995), 1137–1145.
W. W. Kong and Surendra Ranganath. 2014. Towards subject independent continuous sign language
recognition: A segment and merge approach. Pattern Recog. 47, 3 (2014), 1294–1308. DOI:10.1016/
j.patcog.2013.09.014.
Brenda Laurel and S. Joy Mountford. 1990. The Art of Human-Computer Interface Design. Addison-Wesley
Longman Publishing Co., Inc.
Chin-Hui Lee, Lawrence R. Rabiner, Roberto Pieraccini, and Jay G. Wilpon. 1990. Acoustic modeling for
large vocabulary speech recognition. Comput. Speech Language 4, 2 (1990), 127–165. DOI:http://dx.doi.
org/10.1016/0885-2308(90)90002-N
Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal (eds). 1996. Automatic Speech and Speaker Recognition:
Advanced Topics, Vol. 355. Springer Science & Business Media.
Chin-Hui Lee and Qiang Huo. 2000. On adaptive decision rules and decision parameter adaptation for au-
tomatic speech recognition. Proc. IEEE 88, 8 (2000), 1241–1269. DOI:http://dx.doi.org/10.1109/5.880082
Hyeon-Kyu Lee and Jin H. Kim. 1999. An HMM-Based threshold model approach for gesture recognition.
IEEE Trans. Pattern Anal. Mach. Intell. 21, 10 (1999), 961–73. DOI:10.1109/34.799904.
Kai-Fu Lee. 1988. Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The Sphinx Sys-
tem. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, Pennsylvania.
Rung-Huei Liang and Ming Ouhyoung. 1998. A real-time continuous gesture recognition system for sign
language. In Proceedings of the 3rd IEEE International Conference on Automatic Face and Gesture
Recognition. IEEE, 558–567. DOI:http://dx.doi.org/10.1109/AFGR.1998.671007
Bruce T. Lowerre. 1976. The Harpy Speech Recognition System. Ph.D. dissertation. Carnegie-Mellon Univer-
sity, Department of Computer Science, Pittsburgh, Pennsylvania.
Mohamed A. Mohandes. 2010. Recognition of two-handed Arabic signs using the CyberGlove. In Proceedings
of the 4th International Conference on Advanced Engineering Computing and Applications in Sciences
(ADVCOMP’10). IARIA, 124–129.
Hermann Ney and Stefan Ortmanns. 2000. Progress in dynamic programming search for LVCSR. Proc. IEEE
88, 8 (2000), 1224–1240. DOI:http://dx.doi.org/10.1109/5.880081
Sylvie C. W. Ong and Surendra Ranganath. 2005. Automatic sign language analysis: A survey and
the future beyond lexical meaning. IEEE Trans. Pattern Anal. Mach. Intell. 27, 6 (2005), 873–891.
DOI:http://dx.doi.org/10.1109/TPAMI.2005.112
Cemil Oz and Ming C. Leu. 2011. American Sign Language word recognition with a sensory glove us-
ing artificial neural networks. Eng. Appl. Artific. Intell. 24, 7 (2011), 1204–1213. DOI:http://dx.doi.org/
10.1016/j.engappai.2011.06.015
Vassilis Pitsikalis, Stavros Theodorakis, Christian Vogler, and Petros Maragos. 2011. Advances in phonetics-
based sub-unit modeling for transcription alignment and sign language recognition. In Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW’11).
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš Burget, Ondřej Glembek, Nagendra Goel, Mirko
Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel
Vesely. 2011. The Kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic
Speech Recognition and Understanding (ASRU’11). IEEE Signal Processing Society, 1–4.
Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recog-
nition. Proc. IEEE 77, 2 (1989), 257–286. DOI:http://dx.doi.org/10.1109/5.18626
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:22 K. Li et al.
Lawrence R. Rabiner and Biing-Hwang Juang, 1993. Fundamentals of Speech Recognition, Vol. 14. Prentice-
Hall.
Roni Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proc. IEEE
88, 8 (2000), 1270–1278. DOI:http://dx.doi.org/10.1109/5.880083
Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Techn. J. 27 (July and October
1948), 379–423 and 623–656.
Ben Shneiderman. 1986. Designing the user interface-strategies for effective human-computer interaction.
Pearson Education India.
Thad Starner, Joshua Weaver, and Alex Pentland. 1998. Real-time American sign language recognition
using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20, 12 (1998),
1371–1375. DOI:http://dx.doi.org/10.1109/34.735811
Oriol Vinyals, Suman V. Ravuri, and Daniel Povey. 2012. Revisiting recurrent neural networks for robust
ASR. In Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP’12). IEEE, 4085–4088. DOI:http://dx.doi.org/10.1109/ICASSP.2012.6288816
Andrew J. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding al-
gorithm. IEEE Trans. Inf. Theory 13, 2 (1967), 260–269. DOI:http://dx.doi.org/10.1109/TIT.1967.1054010
Christian Vogler and Dimitris Metaxas. 1997. Adapting hidden Markov models for ASL recognition by
using three-dimensional computer methods. In IEEE International Conference on Computational Cy-
bernetics and Simulation Systems, Man, and Cybernetics, Vol. 1. IEEE, 156–161. DOI:http://dx.doi.org/
10.1109/ICSMC.1997.625741
Christian Vogler and Dimitris Metaxas. 1999. Parallel hidden Markov models for American sign language
recognition. In Proceedings of the 7th IEEE International Conference on Computer Vision, Vol. 1. IEEE,
116–122. DOI:http://dx.doi.org/10.1109/ICCV.1999.791206
Beng Wang, Xiaohua Wang, Hong Luo, Liu Ren, Jianjie Zhang, Kui Xu, Yen-Lin Chen, Zhengyu Zhou, and
Wenwei Guo. 2014a. The glove to capture data for sign language recognition (in Chinese). Invention
Patent, Application No. 201410410413.7, Filed August 2014, State Intellectual Property Office of The
People’s Republic of China.
Beng Wang, Xiaohua Wang, Hong Luo, Liu Ren, Jianjie Zhang, Kui Xu, Yen-Lin Chen, Zhengyu Zhou, and
Wenwei Guo. 2014b. The glove to capture data for sign language recognition (in Chinese), Utility Patent,
Application No. CN 204044747 U, Approved in December 2014, State Intellectual Property Office of The
People’s Republic of China.
Chunli Wang, Wen Gao, and Shiguang Shan. 2002. An approach based on phonemes to large vocabulary
Chinese sign language recognition. In Proceedings of the 5th IEEE International Conference on Automatic
Face and Gesture Recognition. IEEE, 411–416. DOI:http://dx.doi.org/10.1109/AFGR.2002.1004188
Honggang Wang, Ming C. Leu, and Cemil Oz. 2006. American sign language recognition using multi-
dimensional hidden Markov models. J. Inf. Sci. Eng. 22, 5 (2006), 1109–1123.
Zhirong Wang, Tanja Schultz, and Alex Waibel. 2003. Comparison of acoustic model adaptation techniques
on non-native speech. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech,
and Signal Processing. Vol. 1, pp. I-540–I-543.
Jay G. Wilpon, Lawrence R. Rabiner, Chin-Hui Lee, and E. R. Goldman. 1990. Automatic recognition of
keywords in unconstrained speech using hidden Markov models. IEEE Trans. Acoust., Speech, Signal
Process. 38, 11 (1990), 1870–1878. DOI:http://dx.doi.org/10.1109/29.103088
Ruiduo Yang, Sudeep Sarkar, and Barbara Loeding. 2007. Enhanced level building algorithm for the move-
ment epenthesis problem in sign language recognition. In IEEE Conference on Computer Vision and
Pattern Recognition, 2007. DOI:10.1109/CVPR.2007.383347.
Ruiduo Yang, Sudeep Sarkar, and Barbara Loeding. 2010. Handling movement epenthesis and hand segmen-
tation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE
Trans. Pattern Anal. Mach. Intell. 32, 3 (2010), 462–477. DOI:http://dx.doi.org/10.1109/TPAMI.2009.26
Guilin Yao, Hongxun Yao, Xin Liu, and Feng Jiang. 2006. Real time large vocabulary continuous sign
language recognition based on OP/Viterbi algorithm. In Proceedings of the 18th International Conference
on Pattern Recognition (ICPR’06), Vol. 3. IEEE, 312–315. DOI:http://dx.doi.org/10.1109/ICPR.2006.954
Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying A. Liu, Gareth Moore,
Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, and Phil Woodland. 2006. The HTK Book (for
HTK version 3.4), (2006), Retrieved August 20, 2011 from http://htk.eng.cam.ac.uk/.
Zahoor Zafrulla, Helene Brashear, Thad Starner, Harley Hamilton, and Peter Presti. 2011. American sign
language recognition with the Kinect. In Proceedings of the 13th International Conference on Multimodal
Interfaces (ICMI’11). ACM, New York, NY, 279–286. DOI:http://dx.doi.org/10.1145/2070481.2070532
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:23
Yu Zhou, Debin Zhao, Hongxun Yao, and Wen Gao. 2010. Adaptive sign language recognition with exem-
plar extraction and MAP/IVFS. IEEE Signal Process. Lett. 17, 3 (2010), 297–300. DOI:http://dx.doi.org/
10.1109/LSP.2009.2038251
Zhengyu Zhou. 2009. An Error Detection and Correction Framework to Improve Large Vocabulary Continuous
Speech Recognition. Ph.D. dissertation. The Chinese University of Hong Kong, Hong Kong, China.
Zhengyu Zhou, Tobias Menne, Kehuang Li, Kui Xu, and Zhe Feng. 2015. System and method for automated
sign language recognition. Invention Patent (Provisional), Application No. 62148204, Filed April, 2015.
ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.