Sign Transition Modeling and A Scalable Solution To Continuous Sign Language Recognition For Real-World Applications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Sign Transition Modeling and a Scalable Solution to Continuous Sign

Language Recognition for Real-World Applications


KEHUANG LI, Georgia Institute of Technology
ZHENGYU ZHOU, Research and Technology Center, Robert Bosch LLC
CHIN-HUI LEE, Georgia Institute of Technology

We propose a new approach to modeling transition information between signs in continuous Sign Language
Recognition (SLR) and address some scalability issues in designing SLR systems. In contrast to Automatic 7
Speech Recognition (ASR) in which the transition between speech sounds is often brief and mainly addressed
by the coarticulation effect, the sign transition in continuous SLR is far from being clear and usually not easily
and exactly characterized. Leveraging upon hidden Markov modeling techniques from ASR, we proposed
a modeling framework for continuous SLR having the following major advantages, namely: (i) the system
is easy to scale up to large-vocabulary SLR; (ii) modeling of signs as well as the transitions between signs
is robust even for noisy data collected in real-world SLR; and (iii) extensions to training, decoding, and
adaptation are directly applicable even with new deep learning algorithms. A pair of low-cost digital gloves
affordable for the deaf and hard of hearing community is used to collect a collection of training and testing
data for real-world SLR interaction applications. Evaluated on 1,024 testing sentences from five signers, a
word accuracy rate of 87.4% is achieved using a vocabulary of 510 words. The SLR speed is in real time,
requiring an average of 0.69s per sentence. The encouraging results indicate that it is feasible to develop
real-world SLR applications based on the proposed SLR framework.
r
CCS Concepts: Human-Centered Computing → Natural Language Interfaces; Human-Centered r
r
Computing → Accessibility Systems and Tools; Human-Centered Computing → Gestural Input
Additional Key Words and Phrases: Sign language recognition, transition modeling, speech recognition,
hidden Markov models
ACM Reference Format:
Kehuang Li, Zhengyu Zhou, and Chin-Hui Lee. 2016. Sign transition modeling and a scalable solution to
continuous sign language recognition for real-world applications. ACM Trans. Access. Comput. 8, 2, Article 7
(January 2016), 23 pages.
DOI: http://dx.doi.org/10.1145/2850421

1. INTRODUCTION
Sign language [Ong and Ranganath 2005] is a form of natural language commonly
used for communication in the deaf and hard of hearing community. While spoken
languages use voices to convey meanings, sign languages utilize hand shape, position,
orientation and movement, sometimes with the aid of facial expression and body move-
ment, to express thoughts [Cherry 1968]. Like spoken languages, sign languages vary
from region to region. Automatic Sign Language Recognition (SLR) is a research topic

This research was supported and funded by Bosch. The majority of this work was conducted within Bosch,
while some evaluation experiments were performed in the School of Electrical and Computer Engineering at
Georgia Institute of Technology. The first author joined this research since his internship at Bosch Research
and Technology Center North America.
Authors’ addresses: Z. Zhou, Bosch Research and Technology Center North America, 4005 Miranda Avenue,
#200, Palo Alto, CA 94304; email: [email protected]; K. Li and C.-H. Lee, School of Electrical and
Computer Engineering, Georgia Institute of Technology, 777 Atlantic Drive NW, Atlanta, GA 30332-0250;
emails: {kehle, chl}@gatech.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for third-party components of this
work must be honored. For all other uses, contact the Owner/Author.
2016 Copyright held by Owner/Author
1936-7228/2016/01-ART7
DOI: http://dx.doi.org/10.1145/2850421

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:2 K. Li et al.

attracting many researchers’ attention in recent years [Starner et al. 1998; Fang et al.
2007; Forster et al. 2013]. Advances in automatic SLR technologies may lead to various
Human-Computer Interaction (HCI) [Shneiderman 1986; Laurel and Mountford 1990]
systems in order to facilitate communication between people with spoken and sign
language capabilities.
SLR is similar to Automatic Speech Recognition (ASR) in the sense that they both
attempt to automatically transfer sequences of input signals into sequences of words.
For ASR, Hidden Markov Model (HMM) [Rabiner 1989] has been one of the most
prevalent statistical modeling techniques since the mid-1970s [Baker 1975]. Intense
research has been conducted in various aspects, including feature extraction [Davis and
Mermelstein 1980], acoustic and language modeling [Lee et al. 1990; Rosenfeld 2000],
decoding [Ney and Ortmanns 2000], discriminative training [Juang et al. 1997], adap-
tive learning [Lee and Huo 2000], and postprocessing. The mature HMM-based ASR
advances have supported the development of many commercial products, including
speaker-independent large-vocabulary continuous speech recognition systems. Since
2012, deep learning [Hinton et al. 2012; Vinyals et al. 2012] has been the new focus
of ASR. Its flexible learning capacity is often incorporated into the HMM-based ASR
framework.
Compared with ASR, the SLR technology is still in the early stage of development.
The majority of the research activities on SLR are on isolated sign/word recognition [Oz
and Leu 2011; Pitsikalis et al. 2011; Cooper et al. 2012; Geng et al. 2014]. The research
efforts on continuous SLR started from the late 1990s and have so far been limited.
Such efforts mostly focused on small and medium-size vocabularies [Starner et al.
1998; Dreuw et al. 2007; Zafrulla et al. 2011]. For example, a 99-sign vocabulary was
adopted in Yang et al. [2010] and the best accuracy of 73% was achieved on camera data
collected from three users. The researchers from RWTH reported a 22.1% Word Error
Rate (WER) using a vocabulary of 455 words on data from 25 signers in a restricted
laboratory camera setting [Forster et al. 2013]. They also collected a set of “real” data
from TV weather broadcast videos, on which a WER of 49.2% was reported for six
signers with a 349-word vocabulary [Forster et al. 2014].
For large-vocabulary continuous SLR, the only reported experiments were conducted
on a set of Chinese sign language data collected by expensive wearable devices, includ-
ing two CyberGloves and three position trackers [Yao et al. 2006; Fang et al. 2007]. The
vocabularies contain up to 5,113 signs. The best reported accuracies were 91.9% for
two signers [Fang et al. 2007], and at about 90% for six signers [Chen et al. 2003; Fang
et al. 2004; Gao et al. 2004]. One key reason for the high accuracies may be attributed
to the use of the high-end CyberGloves that can accurately capture hand angles. How-
ever, this success on large-vocabulary SLR has not yet been repeated in other research
groups until recently by the authors of this article on smaller vocabularies with 208 to
370 words using CyberGloves or Kinect [Jiang et al. 2009; Zhou et al. 2010; Chai et al.
2013], though no accuracies about continuous SLR were reported.
The major research trend of continuous SLR is to move from ideal laboratory set-
tings (e.g., camera recordings requiring users to wear black clothes in front of a blue
background, CyberGloves) to real-world data (e.g., TV video, Kinect data), and from
small to medium vocabularies [Starner et al. 1998; Dreuw et al. 2007; Zafrulla et al.
2011; Chai et al. 2013; Forster et al. 2014]. In this work, we collect real-world data
using low-cost digital gloves. Compared with cameras and Kinect, digital gloves can
be used both indoors and outdoors, and will not be affected by common camera-related
problems such as illumination. Compared with high-end gloves like CyberGloves, the
low-cost gloves provide less accurate and noisier signals. However, they are much more
affordable for deaf and hard of hearing people, making real-world applications possi-
ble. Here by “real-world applications” we mean real-time applications using affordable

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:3

devices for multiple environments (e.g., indoor and outdoor). To develop such interac-
tive systems between sign language users and spoken language users, we propose a
scalable solution for continuous SLR based on the mature HMM framework of ASR
with special characters of sign languages carefully considered.
Continuous SLR has leveraged upon the extensive set of technologies in ASR
[Rabiner and Juang 1993; Lee et al. 1996] in HMM training and decoding [Starner
et al. 1998; Dreuw et al. 2007]. However, as illustrated in a recent review article as
well as in other papers, there are doubts in the SLR research community that ASR
techniques, particularly the HMM-based ASR framework, are not scalable for contin-
uous SLR tasks [Vogler and Metaxas 1999; Yang et al. 2010; Cooper et al. 2011]. This
is mainly due to two challenges related to SLR. First, the basic modeling units (i.e.,
phonemes) for signs are not well defined [Pitsikalis et al. 2011] and can be defined
in a way that leads to parallel recognition for multiple streams (e.g., left hand, right
hand) [Vogler and Metaxas 1999]. Second, the transition signals between signs lead
to a challenging coarticulation problem that is different from that of ASR [Lee 1988;
Vogler and Metaxas 1997]. To tackle these problems, variations of the conventional
HMM frameworks as well as alternative recognition structure (e.g., Nested Dynamic
Programming) have been proposed [Vogler and Metaxas 1999; Fang et al. 2007; Yang
et al. 2010]. However, system scalability and decoding efficiency are difficult topics to
address if extending such SLR systems to larger vocabularies.
We believe that the SLR-specific challenges, including transition modeling, can be
addressed within the mature HMM framework of ASR in a scalable way. In this study,
we propose such a SLR solution that can easily be scaled up to larger vocabularies, be
applied to real-world data, and be extended to other developed ASR techniques such
as adaptation and deep learning. This solution [Zhou et al. 2015] first define/select
sign phonemes (i.e., the basic signs that are consistent across various single/multisign
words) in the same way as in Wang et al. [2002], and merges the features from both
hands into one single feature vector for HMM modeling. Transition signals between
signs are then modeled using one universal HMM model, which implicitly performs
classification for transition signals through Gaussian splitting. This is similar to the
single universal filler known as a garbage collection model in keyword spotting [Wilpon
et al. 1990] to absorb nonkeyword extraneous speech. The sign phoneme HMMs and
the transition HMM are merged into a recognition framework that is similar to that
of conventional ASR. With this SLR framework, while the transition problem is well
handled, mature ASR decoding methods can be used to support real-time recogni-
tion even for large vocabularies. Other developed ASR techniques, ranging from mod-
eling to postprocessing, become directly applicable, and may be applied to improve
system performance in the future. We also propose a new data collection approach
that is scalable toward large-vocabulary SLR and suitable to process noisy real-world
data.
We implemented a prototype of the proposed solution on medium-vocabulary contin-
uous Chinese SLR. We started with the Chinese sign language because China has the
largest population of deaf and hard of hearing in the world, while the societal support
for this community is much less than that available in developed countries like the
USA. The experimental results are encouraging. Relatively high recognition accuracy
was achieved on real-world data, that is, noisy signals from low-cost gloves, collected
from six signers.
The rest of the article is organized as follows. We first discuss the related work in
Section 2. The proposed framework to model sign transitions and to address some
scalability issues for continuous SLR is presented in Section 3. Experiments and result
analyses are described in Sections 4 and 5, respectively. Conclusion and future research
directions are given in Section 6.

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:4 K. Li et al.

2. RELATED WORK
The conventional HMM-based recognition structure that has been prevalently used
in ASR was thought to be not scalable for continuous SLR in some previous studies,
and therefore alternative architectures were adopted [Yang et al. 2007; Yang et al.
2010; Cooper et al. 2011; Kong and Ranganath 2014]. For example, Liang and Ouhy-
oung proposed a posture-based SLR approach that involves end-point detection and
posture analysis [Liang and Ouhyoung 1998]. Vogler and Metaxas proposed parallel
HMMs to process the left and right hands separately and combines probabilities from
two streams at the word end nodes during decoding [Vogler and Metaxas 1999]. Yang
et al. [2010] used a new recognition structure of nested dynamic programming instead
of the HMM modeling to handle transition signals (i.e., movement epenthesis) and
hand segmentation, and compared this method with conditional random fields. These
approaches are experimented on relatively small vocabularies, and the scalability to
large-vocabulary SLR remains a difficult research topic for these alternative recogni-
tion structures.
For the previous studies that follow the conventional HMM recognition structures for
continuous SLR, the transition signals between signs and the collection of training data
are handled in various ways. A popular choice is to ignore the transition signals and
train the word models with various features on the training data of whole sentences
[Starner et al. 1998; Dreuw et al. 2007; Zafrulla et al. 2011; Forster et al. 2013]. One
major problem of such an SLR strategy is that the word models trained in this way
actually included the neighborhood transition signals before and after the focused
words. This caused modeling problems for uncommon words that are often limited in
the training data, and as a result the recognition accuracies might be seriously affected
if the transition context of a word in testing sentences is unseen or rare in the training
data. A similar case in ASR is the robustness problem to unseen conditions. If the
training data is required to well cover the large variation in transition for each word,
the needed number of training sentences will increase to a prohibitive scale. Note that
collecting sign language data is more difficult than collecting speech data due to the
relatively small population of the users.
There are also some researchers using a threshold model to distinguish between
transition parts and signs/gestures in continuous signals [Lee and Kim 1999; Kelly
et al. 2009]. The main idea is to combine the states of the sign/gesture HMMs into
a threshold model of an ergodic structure, and then to identify transitions as well as
signs/gestures by comparing the likelihoods calculated from the threshold model and
sign/gesture models. The problem is that this method is only suitable for SLR tasks
involving a small number (e.g., eight) of signs. For middle/large-vocabulary SLR, the
method is impractical because the threshold model generated will be too large.
Vogler and Metaxas addressed the transition problem in HMM-based recognition
using two different approaches [Vogler and Metaxas 1997]. The first method is to con-
duct context-dependent sign modeling, that is, to train bisign models in a similar way
as biphones in ASR [Lee 1988]. This approach has a scalability problem since the
number of bisign models can be too large to train. The second is to divide transition
signals into 64 classes based on k-means clustering, and then train the correspond-
ing transition models. The recognition network has to be redesigned to connect the
transition models with words appropriately based on the starting and ending locations
of the hands. Note that this could be challenging especially for large-vocabulary SLR
with trigram language models [Rosenfeld 2000]. Experimental results show that the
second approach works better than the first one (with a word accuracy of 95.8% vs.
91.7% for a vocabulary of 53 signs), demonstrating the benefit of transition modeling.
In this work, sentences are used as the training data. Although by explicitly modeling

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:5

transition signals, signs can be modeled without considering the transition context,
the number of training sentences needed to robustly model every sign can still be high
for large-vocabulary SLR since uncommon signs occur less frequently than common
signs.
Fang et al. [2007] proposed another way to model transition signals within the HMM
recognition structure. The transition signals were clustered into multiple classes using
a temporal clustering algorithm. The transition models were then trained jointly with
the sign models using a bootstrap training method. The transition models and the
sign models are all viewed as candidates during the decoding procedure. To handle
the increased complexity in decoding, a pruning algorithm was also proposed. The
authors adopted a vocabulary of 5,113 signs and designed training data that was
composed of two parts: isolated signs and continuous sentences. During training, they
first train initial sign models using the isolated sign samples. Then the sign models are
retrained along with the transition models using the continuous sentences. The initial
sign models will often not match the sentences well, because the collection of isolated
sign samples will certainly contain transition signals before and after the target signs
and these transitions are different from the cases in sentences. Retraining of the sign
models is in the spirit of adapting the sign models to match the conditions in continuous
sign sentences. However, similar to the cases described earlier, it is difficult to collect
enough data to adapt all the signs to sentences. In their experiments, they used 750
distinct sentences in training and testing, and each sentence contains 6.6 words in
average. That means these sentences cover at most 4,950 signs out of the vocabulary
of 5,113 signs. The number of distinct signs that really appeared in these sentences
will be quite small because common signs will repeatedly appear in these sentences.
In other words, only a portion of the signs are adapted to the sentences. If testing
sentences involve signs that are not covered by these training sentences, the accuracy
will be heavily reduced.
In this article, we define a single instead of multiple transition models to model the
transition signals within the conventional HMM recognition structure, and the tran-
sition model has the same left-to-right structure as sign models. When compared with
previous studies that used clustering methods [Vogler and Metaxas 1997; Fang et al.
2007], we let the transition HMM handle the Gaussian split classification implicitly.
The single transition model is trained and used in the same way as the sign models.
Thus it is unnecessary to modify the training and decoding procedures for the use of
the transition model. Adopting a single transition model is also more advantageous
in recognition efficiency than using multiple transition models, as will be further dis-
cussed in Section 5.1. With the proposed framework, the conventional training and
decoding techniques developed for large-vocabulary ASR are directly usable, leading
to high system scalability.
Different from previous studies that completely or partially use sentences as training
data, we only use single-sign and multisign word samples as training data. Multisign
words are also included to capture the intrasentence variability of signs across different
words, as well as to provide samples of transition signals. We manually segmented the
starting and ending points of signs as well as the transition parts in all the word
samples. Note that while it is feasible to do such labeling for word samples, it is very
difficult to do so for continuous sign sentences. Although manual labeling may not be
perfect, it can greatly benefit modeling especially for highly variable real-world data,
for which the effectiveness of forced-alignment-based reestimation of HMM may be
greatly reduced [Forster et al. 2013]. With the proposed data collection methodology it
can easily be extended to large-vocabulary situations by collecting more word samples
linearly proportional to the vocabulary size and therefore it is a scalable solution.

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:6 K. Li et al.

3. A SCALABLE SYSTEM FOR CONTINUOUS SIGN LANGUAGE RECOGNITION


Next, we propose a SLR solution that explicitly models transition signals in a scalable
way. We first present the SLR formulation in Section 3.1, and give more detail on sign
and transition modeling in Section 3.2. The feature extraction method is then described
in Section 3.3.

3.1. SLR Formulation


The proposed solution defines the continuous SLR problem in a way similar to ASR. We
find the most likely sequence of words, W∗ , in SLR based on a maximum a posteriori
Bayes decision formulation [Jelinek et al. 1975; Bahl et al. 1983], also known as an
optimal channel decoding solution [Shannon 1948], for a given sequence of feature
vectors, X, as follows:
W ∗ = arg max P(W|X) = arg max P(X | W)P(W), (1)
W W ∈

where X = {x1 , x2 , . . . , xT } is a sequence of T observed feature vectors corresponding


to the unknown input sentence, W = {w1 , w2 , . . . ,wM } is any sequence of M words
into which the input may be transcribed, and is the set of all allowable sentences
representing the search space. Note that a word here refers to the smallest units that
bear linguistic meanings. For sign languages, a word may contain one or more signs,
and similar to the case in ASR, the definition of words can be quite flexible. To solve
this problem in Equation (1), three major tasks are required. First, words have to
be modeled in order to evaluate the feature probability, P(X|W). Second, language
models have to be used to calculate the prior sentence probability, P(W). Third, the
word models and language models have to be combined to form networks of HMM
states [Ney and Ortmanns 2000] in which the decoding for the most likely word string
W∗ in the search space can be performed.
As shown in Equation (1), we view the input signals of SLR as a sequential input
of feature vectors. Note that the sign language involves multiple input channels, con-
sisting of signals from the right hand, the left hand, the facial expression, and the
body movement [Cherry 1968]. We combine the observations from various channels at
a given time into one single feature vector. In this work, we focus on hand signals. The
missing information from facial expression and the body movement may be partially
obtained directly by humans in real-world SLR application in face-to-face communica-
tions.
For language modeling, n-gram modeling [Rosenfeld 2000] has been a popular choice
for previous SLR experiments. However, for real-world applications, the current avail-
able corpora of sign languages are too small to estimate meaningful n-gram probabil-
ities. Adapting text corpora into sign language corpora is also not feasible in reality,
because natural sign languages are different from natural spoken languages in word
order and are not well studied from linguistic aspects. In China, although the govern-
ment suggests the deaf and hard of hearing population to obey the text grammar for
signing, that is, to perform text-based sign language, the community still prefers natu-
ral sign language, which is not well regulated yet and differs from one area to the next
in detail. All these factors make statistical n-gram modeling difficult for real-world
SLR. This work adopts a realistic solution: we manually created grammars as the lan-
guage model, with the aim of covering various natural sign language usages as well as
the test-based sign language. This method is feasible for small to medium-vocabulary
recognition tasks, known as a Backus Naur Form (BNF) grammar [Backus 1978]. Note
that grammars are still widely used in various commercial ASR products. While large
sign language corpora become available, adapting the SLR system to n-gram language
models is straightforward.

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:7

Fig. 1. A Chinese sign language sentence “Where to register?”

3.2. Sign and Transition Modeling


In this study, we adopt the left-to-right HMM structure [Rabiner 1989] to train all the
related models, including the sign phoneme, the transition, the start and end models,
as shown in Figures 2 and 3, to be described in more detail later.
For sign modeling, some previous studies in SLR [Liang and Ouhyoung 1998; Vogler
and Metaxas 1999] used detailed components, such as posture, orientation, and motion,
of signs as the modeling unit. Since such units may occur simultaneously and can be
different between left and right hands, the sequential single-channel HMM framework
of ASR is thus not applicable and it is necessary to design new recognition frameworks
in which scalability and decoding efficiency may still be challenging topics. In this
work, we follow Wang et al. [2002] to define and select sign phonemes as the modeling
units. The sign phonemes are the smallest contrastive units that bear some meaning
and distinguish one word from another, that is, are the basic signs contained in the
single/multisign words in vocabulary. A sign phoneme may occur in different sign
words. Based on the combined observations from the left and right hands, the sign
phonemes can be modeled in the same way as modeling acoustic phonemes in ASR.
Note that the combined observations are achieved by connecting the features from the
left hand with those from the right hand with the same time stamp into one feature
vector.
Transition modeling is a challenging topic in SLR [Fang et al. 2007; Yang et al. 2010].
One critical difference between ASR and SLR is that the coarticulation effect is differ-
ent for the two tasks. For ASR, a phoneme in speech is affected by the previous and
subsequent phones, and triphone modeling [Lee 1988] is the prevalent method to ad-
dress the coarticulation issue. For SLR, there are highly variable transitions inserted
between signs in sign languages. As illustrated in the example of a sign language sen-
tence shown in Figure 1, after finishing one sign, the hands often need to move to the
starting position of the next sign before the next hand gesture begins. Such additional
movements between signs introduce the transition signals. While the signs are rel-
atively consistent across the sign language signals, the transitions between any two
sequential signs are not defined and the possible hand movements are highly diversi-
fied. Training bisign or trisign models [Vogler and Metaxas 1997] like training biphone
or triphone models [Lee 1988] in ASR may partially relieve the transition problem, but
the number of bisign and trisign models to train can be prohibitively large when the
vocabulary size increases. In this work, we train a universal transition model for all the
transition signals between two signs. This is similar to using a universal filler model,
known as a garbage collector model in keyword spotting [Wilpon et al. 1990], to absorb
all nonkeyword speech segments. Through the standard training procedure of itera-
tive reestimation and Gaussian splitting, the transition model may contain multiple
Gaussian mixture components in each state and thus implicitly perform classifica-
tion on the transition signals. Compared with previous studies that first use certain
additional procedures to explicitly classify transition signals and then train multiple

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:8 K. Li et al.

Fig. 2. A conventional HMM framework for continuous ASR.

Fig. 3. The proposed modeling framework for continuous SLR.

transition models [Vogler and Metaxas 1997; Fang et al. 2007], the proposed transition
modeling method handles the classification of transitions through a convenient train-
ing procedure, and will lead to higher recognition efficiency because in this case it is
unnecessary to distinguish between competing transition models during decoding.
With the sign phoneme models and the universal transition model, the word model
can then be defined as the concatenation of the component sign models with the tran-
sition model inserted between the sign models. We can thus combine the word models
with the language model into a recognition framework in a similar way to the con-
ventional ASR framework [Lee 1988; Wilpon et al. 1990]. For ASR, as illustrated in
Figure 2, a silence model (indicated by “sil”) is placed before and after a sentence and a
short pause model (indicated by “sp”) is placed between every two sequential words. In
this SLR study, as shown in Figure 3, a start model (indicated by “st”) is placed before
a sentence and an end model (indicated by “end”) is used after a sentence, while a tran-
sition model (indicated by “tr” to be described in more detail later in Sections 4 and 5)

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:9

is inserted between the single-sign and multisign word models. Due to the similar
structure, the conventional ASR decoding algorithms are directly applicable to contin-
uous SLR, leading to real-time recognition speed even for large-vocabulary SLR tasks.
Applying other well-developed ASR techniques, such as discriminative training [Juang
et al. 1997], adaptive learning [Lee and Huo 2000], error processing [Zhou 2009], and
deep learning [Hinton et al. 2012], can easily be adopted to benefit continuous SLR.
We consider the difference between sign and spoken languages, and adopt a new
data collection method for model training. For ASR, the phone and word models have to
be trained on continuous speech utterances because the pronunciations of phones and
words highly depend on the context. However, for SLR, signs are relatively consistent
in different contexts, while the transition parts between signs are highly variable.
Due to these characteristics, we use word samples instead of sentence samples as
training data. A certain number of samples are collected for each of the sign words
in the vocabulary, including both single-sign and multisign words. To make full use
of the captured data, each word sample is then labeled into several segments, that is,
the component signs, preceded by the start part (i.e., from the beginning to the start
of the first sign), followed by the end part (i.e., from the finishing point of the last
sign to the end), and the transition parts, if existing. Each sign phoneme model can
thus be trained on all the segments of the focused signs in the training data, while
the transition, start, and end models can be trained on the corresponding segments,
respectively. The main advantages of this data collection methodology is as follows:
(i) when the vocabulary size increases, the need in training data increases only
linearly; (ii) every word in the vocabulary, including uncommon words, can be robustly
trained; and (iii) it is feasible to label word samples to provide reliable segmentation
information for training, which is especially variable for noisy real-world data.

3.3. Feature Extraction


Gyros and accelerometers were used to track the rotation and acceleration of hands
and fingers for which sensors were put on gloves roughly at the position of the center of
each finger bone [Wang et al. 2014b], as illustrated in Figure 4. Here, we only discuss
feature extraction for the right hand, and that for the left hand is almost the same.

Let us denote the sampled angular velocity given by sensor s at time ti as ωtsi , and
the angle rotated in each sample period can be estimated by
 
 ωtsi + ωtsi−1
θ s,i = (ti − ti−1 ) . (2)
2

If the rotation quaternion [Altman 1986] of θ s,i is qs,i , and given some initial attitude
quaternion, ps , and then the current orientation after M accumulated rotations is
Q−1 −1 −1 −1 −1
s,M ps Qs,M , where Qs,M = qs,M qs,M−1 · · · qs,1 , and Qs,M = qs,1 qs,2 · · · qs,M . Note that in
this study, the product of quaternions is the Hamilton product [Hamilton 1844].
Here we used the direction of gravity as the initial attitude quaternion, ps , and
since users were asked to naturally hang their arms, hands, and fingers at the very
beginning of each data sequence, such ps ’s roughly represent the initial direction of
the finger bones yet require no calibration, which can be tough since we have neither
the angle between finger bones and their corresponding sensors nor the bending of the
fingers in the relaxed mode of the users.

Let the sampled acceleration be atsi , the gravity was estimated by the mean of the
  
k−1 ti  
first k samples, gs = 1k i=0 as , and ps is the unit quaternion of ĝs = gs /|gs |. Then
we can estimate the directions the finger bones point to, in a global coordinate, as

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:10 K. Li et al.

Fig. 4. The distribution of sensors on the back of the hand.

Fig. 5. Coordinate on each sensor. Subscript G indicates global or absolute coordinate.

Os,i = q0 Q−1 −1
s,i ps Qs,i q0 , where q0 is the rotation from the initial sensor coordinate to the
global coordinate according to the sensor coordinate (Figure 5).
There are two different kinds of feature involved in our experiments: one is related to
the hand shape (Finger Related, or FR), and the other is related to the hand direction
(Global Angle of hand, or GA). The feature we used for describing the hand shape is
the cosine distance between the directions of adjacent finger bones, Ais,r = Os,i · Or,i .
For instance, if s = “2c” and r = “2b,” as in Figure 4, then As,r is about the bending of
the proximal interphalangeal joint (the second far end joint of the index finger). The

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:11

Fig. 6. An initial prototype of digital gloves installed with low-cost sensors.

FR features are calculated per finger. On the other hand, the feature for the direction
of the hand, more specifically the direction of the palm, can be estimated by the mean
of the directions of sensorS “2a,” “3a,” “4a,” and “5a.” To fully describe the direction of
the hand in space, two perpendicular directions, one along the palm plan and the other
along the normal of the palm plan, were used. It is as well capable to get the
acceleration of the finger bones by replacing ps with the quaternion representation of

atsi , and then the acceleration and velocity of a hand can be estimated.

4. EXPERIMENTS
Next, we develop an initial prototype of a medium-vocabulary continuous SLR system
using the proposed solution. Related system issues in data organization, feature usages,
training and testing procedures, and the primary experimental results are presented
in the following four subsections, respectively.

4.1. Data Organization


We adopted a medium-size vocabulary of 510 distinct words in Chinese sign language
to cover simple general-domain conversations. Note that each word may bear one or
more linguistic meanings, and the selection of meanings for a given word in a sentence
context is implemented in the grammar development. Among the 510 words, 353 of
these words were single-sign words, while the remaining are multisign words. The
vocabulary involves 490 unique sign phonemes (i.e., unique basic signs) in total. We
collected samples for each word in the vocabulary as the training data. Each word
sample was signed in isolation by a signer.
We also selected 215 distinct test sentences, which are typical examples from ba-
sic sign language conversations covering all the 510 words in the vocabulary with no
Out-Of-Vocabulary (OOV) [Bazzi 2002] words involved. These OOV issues often cause
problems in designing even mature ASR applications. We will defer such issues to fu-
ture studies. Both text-based and natural sign language sentence examples are tested.
Each sentence contains 4.2 words in average, ranging from two words to 10 words. A
portion of the sentence samples is reserved for parameter tuning, used to estimate the
optimal numbers of Gaussian splitting and reestimation [Young et al. 2006].
The word and sentence samples were collected using a pair of digital gloves installed
with low-cost sensors [Wang et al. 2014a, 2014b]. As illustrated in Figure 6, this pair
of gloves was just an initial prototype, for which sensors were mounted on the outside
surface of the glove cloths and distributed on the back of hands as shown in Figure 4.
The gloves can be improved with finer designs in the future, and we estimate that

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:12 K. Li et al.

Table I. Simulation Configuration


Data Type Content
Training 16 samples per word from six signers (involving 510 distinct words)
Tuning (Development) 1,028 sentences from five signers (involving 215 distinct sentences)
Testing 1,024 sentences from five signers (involving 215 distinct sentences)

the hardware cost of two refined gloves will be about 150 US dollars, much more af-
fordable than the CyberGlove, which costs more than 17,000 US dollars per glove and
has been widely used for SLR in university laboratories [Wang et al. 2006; Fang et al.
2007; Mohandes 2010]. When a user signs a word or a sentence wearing the gloves,
the gloves sent out 100 frames of sensor signals per second for both hands simultane-
ously. Sign language features can thus be extracted based on these sequential sensor
signals.
We invited six signers to participate in our data collection. Five of them were from the
deaf and hard of hearing community with sign languages as their first language. The
remaining signer is a teacher involved in education for the deaf and hard of hearing
students with the sign language as a second language. Each signer was asked to wear
the digital gloves, and signed all words in the vocabulary three times and the sentences
twice. To reduce the modeling complexity, in this initial study, the signers were also
required to stand during the data collection process and put both hands down before
and after signing. Most data were successfully collected except for one deaf or hard
of hearing subject, for whom only one round of vocabulary words was collected due to
an electronic device failure. There were also some random electronic failures during
sentence collection. We organized the collected samples into training, development,
and testing subsets as illustrated in Table I.
We manually segmented all word samples to identify the segment boundary points
corresponding to sign phonemes, transitions between signs, and the start and end
parts. No sentence samples were segmented.

4.2. Feature Usages and Selection


We extracted three sets of features, including FR, GA, and GA’s delta (i.e., the differ-
ences between the current and the previous frames, referred to as GA_delta), from the
raw data for both the left and right hands. No normalization was conducted since the
extracted features are naturally in the range of [−1, 1]. Decorrelation was also not
performed in this work for simplicity. In modeling, while all three feature sets were
used for the right hand, only GA was adopted as the features for the left hands. We
adopt different features for the two hands mainly because the left hand and right hand
are not equally important. Note that among the 490 distinct sign phonemes, only 241
of them required both hands in signing, and the remaining phonemes were signed
by the right hand alone. Another reason was that by deleting relatively unimportant
features, models can be more robustly trained on the given training data due to the
reduced feature dimension. As will be illustrated in Section 5.3, different feature usage
will greatly affect the system performance.

4.3. Training and Testing Procedures


With the proposed SLR solution, the system can be directly implemented by various
ASR toolkits such as HTK [Young et al. 2006] and Kaldi [Povey et al. 2011]. In this
study, we chose HTK, a popular HMM-based ASR platform, to train all models and
build the decoding network (i.e., based on the word network illustrated in Figure 3)
from sentence grammars, to train the models and to conduct testing.

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:13

Fig. 7. Manual segmentation of a word sample with GA features shown.

For system construction, we first developed the task grammars to build the de-
coding network. The grammars contain 3,599,139 sign language sentences, including
both text-based and natural sign language sentences that are supported by the set
of 510 vocabulary words. The 215 distinct testing sentences are all covered by the
grammar. Note that every sentence in the grammars is of equal probability and is
recognizable.
We trained all the models on the word samples, each containing a start segment (i.e.,
signals before the first sign phoneme begins), one or more sign phoneme segments, zero
or more transition segments, and one ending segment (i.e., signals after the last sign
phoneme ends). The sign phoneme models were trained on the segments of the corre-
sponding signs. It is feasible to directly train the corresponding transition, start, and
end models. Alternative training methods may also be designed based on the detailed
properties of the start and end segments, which may vary for collection at different cir-
cumstances. In this study, since we require the signers to put both hands down before
and after signing, every start segment contains static signals followed by movement
signals, while every end segment contains movement signals that may or may not be
followed by static signals. The movement signals can be viewed as a type of transition.
We thus further label each start segment into a static segment and a transition seg-
ment, and label each end segment into a transition segment and a static segment, if it
exists. The manual segmentation process is illustrated in Figure 7, where the frames
of a word sample are segmented into static, transition, and sign phoneme segments.
Although only GA features are shown in Figure 7, FR features are referred to as well
in the labeling process. We train the transition model on all the transition segments,
including the ones in the start and end segments, and train an additional static model
on the static segments. The start model was built by connecting the static model with
the transition model. The end model is composed of two alternatives, one connecting
the transition model with the static model, and the other only including the transi-
tion model. All models were trained in the same way as ASR phoneme modeling. The
models are HMMs with diagonal covariance Gaussian mixtures, and are reestimated
iteratively after initialization as well as after splitting the Gaussian mixtures.
We used the tuning data listed in Table I to determine the following set of HMM
parameters: the number of states and the number of Gaussian mixtures for each HMM
state, as well as the required number of iterations of reestimation. With the optimal
configurations observed (e.g., using three-state HMM for sign phoneme models and for
the transition model), we applied the resulting SLR system on the testing data. The
experimental results are reported next.

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:14 K. Li et al.

Table II. System Performance on Testing Sentences


Distribution of the Number of Word Errors
Number of Hits Substitutions Insertions Deletions Word Accuracy (%)
3,925 294 177 85 87.4

4.4. Evaluation of System Performance


On the testing dataset of 1,024 sentences, the configured system achieved a word
accuracy (HTK [Young et al. 2006]) of 87.4%. Details about the word errors are listed
in Table II.
The recognition accuracies achieved here are relatively high compared to the results
reported in other medium-vocabulary continuous SLR systems [Dreuw et al. 2007;
Forster et al. 2013, 2014]. It is especially encouraging since the test was based on noisy
real-world data collected from affordable devices (i.e., the low-cost digital gloves). Note
that as reported in previous studies [Forster et al. 2013], changing from ideal data
to real-world data may lead to a significant performance degradation. This robustness
enhancement in our proposed solution may be attributed to two key reasons. First, with
the proposed solution, each word in vocabulary is guaranteed to have a certain amount
of training data, having 16 training word samples in our case as shown in Table I.
This contributes to the performance robustness across difference sentences, including
those unseen in data collection as long as they are covered by the decoding grammars.
Second, the manual segmentation of the training word samples provides relatively
accurate sign boundaries for modeling of sign phonemes as well as transition, start,
and end models compared with automatic segmentation via Viterbi forced alignment
[Viterbi 1967; Young et al. 2006]. If we discard the manual labels and conduct the forced-
alignment-based iterative training, using “flat start” [Young et al. 2006] implicitly, the
testing accuracy will drop to 85.0%. In other words, our 24-h effort in total to manually
segment all training data leaded to 2.4% absolute increase in accuracy.
We also evaluated the system efficiency. Using a general desktop of Core i5 dual
at 3.4GHz, it took only 0.69s on average to recognize a sentence. This high decoding
efficiency is due to the fact that the proposed SLR solution models signs as well as
transitions in a framework similar to the conventional HMM-based ASR systems so
that advanced ASR techniques (e.g., the probability caching techniques to improve the
decoding efficiency [Young et al. 2006]) can be directly utilized.
Given the acceptable word accuracy, high decoding efficiency, reasonable vocabulary
size, and the use of low-cost digital gloves, we can see that it is already practical to
develop medium-scale, real-world SLR applications to benefit the deaf and hard of
hearing communities.

5. ANALYSIS AND DISCUSSIONS


In this section, we conduct a comparative study on transition modeling and discuss
the system scalability issues in Sections 5.1 and 5.2, respectively. We also analyze the
special characteristics related to modeling of the nondominant hand in Section 5.3. The
system performance on unseen signers is also investigated in Section 5.4.

5.1. A Comparative Study on Transition Modeling


In the proposed SLR solution, a universal transition model of Gaussian-mixture-based
HMM was trained on all segments of transition signals. The obtained model contained
three states and two Gaussian components per states. In other words, the transition
segment was roughly divided into three parts, each of which was clustered into two
classes represented by two Gaussians, respectively. The whole segment of the transition
signals was thus clustered into eight (i.e., 2∗ 2∗ 2) classes in an implicit way using

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:15

Table III. Comparison Result of Different Transition Modeling Methods


Average
Decoding Time
per Sentence Word Accuracy (%)
Method (second) Total Signer1 Signer3 Signer4 Signer5 Signer6
NoTranModel 0.53 13.9 15.0 12.4 18.2 16.6 7.4
NoTranModel_ExtendedSigns 0.75 79.2 69.4 82.5 90.3 85.2 68.3
UniversalTranModel_1G 0.68 82.5 74.4 87.1 90.2 84.3 76.3
UniversalTranModel_2G 0.69 87.4 78.8 91.7 93.1 86.7 86.4
MultiTranModels_V&M 13.4 87.8 79.6 92.2 92.3 87.1 87.7
MultiTranModels_V&M_Prune 0.72 85.8 78.5 89.4 89.2 86.7 85.3

the single HMM. Next, we compare the universal transition modeling (referred to
as UniversalTranModel_2G) with other approaches, such as no explicit modeling of
transition signals and utilizing multiple transition models, two popular ways reported
in previously HMM-based SLR studies [Starner et al. 1998; Fang et al. 2007; Forster
et al. 2013]. The comparison results are listed in the following.
In Table III, NoTranModel and NoTranModel_ExtendedSigns refer to two meth-
ods that conduct recognition without any transition models. The first approach sim-
ply deletes all the universal transition models inserted within/between word models
in the recognition framework and then applying the modified recognizer to the test
data. The recognition accuracy obtained is as low as 13.9%, clearly showing that using
the trained sign models alone does not cover the transition regions well. The second
approach, NoTranModel_ExtendedSigns, includes neighborhood transition context in
sign modeling. It extends each training segment of sign phoneme by including the
first half of the subsequent transition segment as well as the second half of the pre-
vious transition segment. The sign phoneme HMMs are then retuned and retrained
on the corresponding tuning and training data, respectively. The sign phoneme mod-
els trained in this way are applied in the same recognition framework as in the first
approach. The second approach achieves a recognition accuracy of 79.2%, much better
than the performance of the first approach, but still 8.2% lower than the accuracy of the
proposed SLR solution. One problem of including transition signals in sign modeling
is that collecting training data with various contexts for each sign could be difficult
especially for large-vocabulary continuous SLR.
UniversalTranModel_1G refers to the approach of using a universal transition model
but involving no state Gaussian splitting in training this model. All three states of the
transition model contain one Gaussian only, and the transition signals are implicitly in-
cluded in one class. Using this simplified universal transition model, the word accuracy
achieved is surprisingly high, being at about 82.5%. This illustrates the effectiveness
of the proposed SLR structure, and indicates that the sign phoneme models are well
modeled, being able to support the recognition tasks even when the transition model
was relatively simple.
MultiTranModels_V&M refers to the approach of performing SLR with multiple ex-
plicit transition models, whose training procedure and usage in decoding are adapted
from the method proposed by Vogler and Metaxas in [Vogler and Metaxas 1997]. Follow-
ing this previous work, we first clustered the starting and ending points of all signs in
training data using k-means clustering with the least-squares distance criterion. Four
distinct clusters were observed. The combination of different starting/ending classes
thus leads to 16 explicit transition models in total. We then trained the 16 transition
models on the corresponding transition segments in the same way as the training of the
sign models. With multiple transition models, the simplest way to apply these models
is to replace the original universal transition model within/between word models in

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:16 K. Li et al.

the search network with a confusion set of the multiple transition models. However,
with 16 transition models, the computational load of decoding with such an enlarged
search network is overwhelming, taking minutes to decode one sentence. In the work
[Vogler and Metaxas 1997], the authors attempted to solve the efficiency problem by
constraining the recognition network to use only one transition model between any two
signs. However, we noticed that on our glove-based data, multiple types of transition
segments are often observed between two signs, or in other words, multiple transition
models are often needed for the connection. There are three types of transition seg-
ments on average between two signs on the training samples of multisign words. We
constrain the search network to allow those transition models with corresponding tran-
sition segments observed in the training data to connect two signs in every multisign
word. For the transition between single/multisign words, since we use word sample as
the training data, no data is available to estimate suitable transition models between
two words. On our tuning data of sentences, we observed that for each word, its starting/
ending positions typically have higher variance in sentences than in word samples and
the variance of the ending position is larger than that of the starting position in gen-
eral. In this study, we assume the starting points of each word in sentences are always
consistent with those in training word samples, and further constrain the transition set
between every two words in the recognition network by only allowing those transition
models ending at the starting classes of the next word that are observed in the training
data. With such a constrained recognition network, the decoding of a sentence can be
done in 13.4s on average, and a 0.4% absolute increase in accuracy can be achieved.
To increase the decoding efficiency of MultiTranModels_V&M to an acceptable level
for real applications, we further conducted pruning, that is, limiting the number of
maximum active paths allowed in decoding to be 5,000. This approach is referred to
as MultiTranModels_V&M_Prune. As illustrated in Table III, although the efficiency
issue is solved by pruning, the recognition accuracy decreases to 85.8%, worse than the
performance of the proposed solution UniversalTranModel_2G.
The comparison results show that explicitly modeling multiple transition models may
bring higher recognition accuracy, but at a cost of increased complexity in decoding.
To apply the approach in real-time applications, strategies have to be designed to
constrain the recognition network, which by the way can be highly effort demanding,
and/or to conduct pruning. However, when applying such strategies, the possible gain
in accuracy may vanish quickly. On the other hand, our proposed simple solution of
universal transition model may provide comparable accuracy at real-time speed while
mature decoding systems (e.g., HTK and Kaldi) are directly applicable with no need to
specially constrain the recognition networks.
In this work, we did not compare with the context-dependent transition modeling
method mentioned in Vogler and Metaxas [1997] because we use word samples as
training data and thus are unable to train bisign models; not with the transition
modeling approach proposed in Fang et al. [2007] because our training transition
segments are insufficient to train more than 500 transition models as in that work;
not with the threshold model solution for transition signals [Kelly et al. 2009] because
combining states of all sign models will lead to one threshold model of impractically
large size; and not with the methods with recognition architectures difference from
HMM [Yang et al. 2007, 2010; Kong and Ranganath 2014], for which the efficiency for
middle/large-vocabulary SLR may remain an open issue.
5.2. Discussions on Scalability
With the proposed SLR solution, the system is well scaled in data collection, language
modeling, as well as decoding efficiency. When extending the existing vocabulary to a
larger size, we only need to collect training word samples for the new words, and enlarge

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:17

Table IV. Recognition Performance with Different Size of Vocabularies and Grammars
Number of Words Number of Sentences Average Decoding Time per
in Vocabulary Covered by Grammar Word Accuracy (%) Sentence (second)
510 3,599,139 87.4 0.69
86 607 96.2 0.07

the grammar to include the new words. Since the tuning sentences are only used to
adjust the system configurations, it is not necessary to increase the number of tuning
sentences when adopting larger vocabularies in continuous SLR. Regarding efficiency,
since the system adopts the conventional HMM-based ASR technologies, scaling up the
proposed SLR solution would not lead to an exponentially exploded efficiency issue.
For instance, in developing this work, we started with a vocabulary of 86 words and
developed grammars to use this small vocabulary to cover basic conversations under
one scenario at an accuracy of 96.2% and a decoding speed of 0.07s. We later extend
the vocabulary to 510 words to cover multiple scenarios, as presented in Section 4. The
grammar was also enlarged correspondingly. The performances for the two SLR tasks
with different vocabulary sizes are shown in Table IV.
We can see that when extending the vocabulary size from 86 to 510, the decoding
search space is significantly enlarged, from including only 607 distinct sentences to
supporting more than three million different sentences. The decoding complexity is
thus greatly increased. However, the reduction in word accuracy as well as in recog-
nition efficiency is relatively limited. The decoding speed remains real time. These
observations demonstrate the scalability of the proposed SLR solution.
When SLR is extended from medium-vocabulary to large-vocabulary tasks, a major
difficulty lies in language modeling. For large-vocabulary continuous SLR, developing
deterministic grammars, such as BNF grammars, can be infeasible and statistical n-
gram modeling [Rosenfeld 2000] will be needed. However, there is no sign language
corpus large enough for large-vocabulary n-gram modeling yet. When such corpora
become available, the n-gram models can be trained and utilized in SLR directly using
the mature n-gram-related techniques developed for ASR. The recognition efficiency of
the resulting SLR system is still expected to be real time. Note that the conventional
ASR systems can support real-time decoding for vocabulary sizes of more than 60,000
words.
5.3. Information Modeling for the Nondominant Hand
We also notice that a selective inclusion of the features from the nondominant hand
may be necessary for SLR. For sign languages, the usages of the dominant hand and
the nondominant hand are not equally important. For example, most frequently used
signs typically only involve the dominant hand. We would like to investigate whether
the nondominant hand should adopt a different set of features for certain cases, for
example, medium-vocabulary SLR. Since all signers in our data collection are right-
handed, for simplicity in this study, we used the right and left hands to represent the
dominant and nondominant hands, respectively.
We conducted our analysis by varying the features related to the left hand in the
feature vector. The comparison results of using different types of feature vectors are
reported next. In Table V, FR_right/left, GA_right/left, and GA_delta_right/left refer
to the FA, GA, and GA_delta features for the right or left hand. FA are finger-related
feature set, while GA and GA_delta are hand global angle-related feature sets, as
defined in Sections 3.3 and 4.2.
We noticed that if only using right-hand features in modeling, the system can already
achieve a word accuracy of 77%. This reflects the fact that the right hand bears more
information than the left hand for sign languages. Adding left-hand information does

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:18 K. Li et al.

Table V. Comparison on Left-Hand Feature Usages


Features Used Word Accuracy (%)
FR_right, GA_right, GA_delta_right (what if no left-hand info is used) 77.0
FR_right, GA_right, GA_delta_right, GA_left 87.4
FR_right, GA_right, GA_delta_right, FR_left, GA_left, GA_delta_left 82.8

Table VI. Results on Unseen Signers in the Cross-Validation Experiments


Word Accuracy (%)
Signer 1 70.0
Signer 3 72.2
Signer 4 89.5
Signer 5 74.8
Signer 6 (nonnative) 58.4
Average accuracy for all signers 73.0
Average accuracy for native signers 76.6

benefit the performance, as illustrated in Table V. However, selecting a subset of the


left-hand features into modeling may lead to a better performance than including all
available left-hand features. In our case, the best recognition accuracy (i.e., 87.4%) was
achieved when only GA features were used for the left hand. This indicates that for
the noisy real-world data we used, the left-hand detail, such as finger information,
introduced confusion instead of benefit. These observations show that for glove-based
small-to-medium vocabulary SLR tasks, replacing the left-hand glove with a simpler
device with few sensors or only using the right-hand glove may be an option.
Another characteristic related to the left hand is that the left-hand signals of the
right-hand-only signs are influenced by the neighboring signs. For example, for the
right-hand-only sign “where” (i.e., the second sign in Figure 1), the left-hand signals of
this sign are different between the case shown in Figure 1 and the case that the previous
and/or following signs are also right-hand-only signs. We will further investigate this
issue in the future.

5.4. Performance Investigation on Unseen Signers


This work involved six signers in data collection. Though data from the six signers is
insufficient to train robust signer-independent SLR systems, we perform a preliminary
study on the system performance for unseen signers in this subsection. We conducted
this set of experiments through cross-validation [Kohavi 1995]. For each signer, we
trained all the models based on the data from other signers and evaluated the system
performance on the testing sentences from the focused signer. The results are listed in
Table VI. Note that there is no result for the second signer because no sentences were
collected for this signer due to an electronic failure in data collection.
From Table VI, we can see that the word accuracy achieved varies greatly from
signer to signer, which is consistent with those reported in previous SLR studies [Yang
et al. 2010; Zafrulla et al. 2011]. The best accuracy obtained was 89.5%, relatively
comparable to the performance (i.e., 93.1% as in Table III) of the corresponding seen
signer. It comes from a deaf or hard of hearing signer who performed the signs in
a relatively standard way. In the meantime, the worst accuracy was less than 60%,
coming from the signer that takes sign language as a second language, referred to
as a nonnative signer. The relatively bad recognition performance for the nonnative
signer was due to the mismatching signing patterns with the native signers. A simi-
lar case is that nonnative speakers with heavy accents normally have low recognition
accuracies in ASR. As shown in Table III, if adding the nonnative signer into the

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:19

training set, the performance of the nonnative signer is comparable with the native
signers. This indicates that collecting training data from more nonnative signers may
improve the system performance on unseen nonnative signers. ASR technologies de-
veloped to handle the accent issue of nonnative speakers (e.g., acoustic modeling adap-
tation techniques [Wang et al. 2003]) may also be borrowed into SLR to relieve the
nonnative issue.

6. CONCLUSION AND FUTURE DIRECTIONS


We propose a scalable HMM-based solution to continuous sign language recognition.
The idea lies in training a single universal transition model to effectively character-
ize the highly variable transition signals between signs, and combines it with sign
phoneme models in an efficient way that is similar to the conventional HMM-based
ASR framework that can be easily scaled up without much loss in efficiency. Compared
with those techniques that explicitly classify transition signals and adopt multiple
transition models in HMM-based recognition, our approach handles the transition
classification implicitly through Gaussian splitting and the usage of a single universal
transition model leads to a high decoding efficiency due to the significantly reduced
search space during the recognition process. We also propose a new data collection
method that is suitable for SLR and is scalable even for noisy real-world data.
In this work, we develop a prototype system based on the proposed SLR solution for
medium-vocabulary real-world applications. A pair of low-cost digital gloves is adopted
as the signal collection devices. The sensor signals from both hands are transferred into
feature vectors using a novel feature extraction method, which bypasses the process
of reverting the exact hand shape, position, and orientation, and can get features
online using a sequence of Hamilton products. The prototype adopts a vocabulary of
510 words from Chinese sign language. A set of grammars is designed to cover basic
general conversations and supports both text-based and natural sign language usages.
Six signers are involved in data collection.
The experimental results are encouraging. Using the proposed solution, a word ac-
curacy of 87.4% is achieved on the testing sentences. The decoding speed is 0.69s per
sentence on average, that is, in real time. Note that the data focused in this study is
noisy real-world data, not data from ideal laboratory settings with devices like Cyber-
Gloves. The relatively high accuracy and real-time decoding achieved demonstrates
the effectiveness of the proposed SLR solution. They also indicate that developing
medium-vocabulary real-world SLR applications to benefit the deaf and hard of hear-
ing community is already feasible. Based on the prototype, we further conduct a series
of analyses. Several different strategies to model transition signals are compared and
the advantage of adopting a single transition model with multiple Gaussians is il-
lustrated. The scalability of the proposed SLR method is discussed with an example
of extending the vocabulary from 86 words to 510 words. The special characteristics
related to modeling of the nondominant hand as well as the system performance on
unseen signers are also analyzed.
In the future, when large sign language corpora become available for meaning-
ful n-gram modeling, developing large-vocabulary continuous SLR systems using the
proposed solution is straightforward. There are still many open research issues to in-
vestigate, including the optimal selection of sign phonemes, the modeling of left-hand
variability for the right-hand-only signs, and the processing of OOV words. For the pro-
totype, it can be improved in multiple ways, from glove design and feature selection, to
adopting additional ASR techniques, such as speaker adaptation, into SLR to benefit
the robustness and performance. Note that with the proposed solution, many advanced
ASR techniques, including deep learning, are also applicable.

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:20 K. Li et al.

ACKNOWLEDGMENTS
We thank Dr. Liu Ren, Dr. Kui Xu, Dr. Yen-Lin Chen from Bosch Research and Technology Center, North
America, Gerry Guo, Aria Jiang, Hong Luo, Ben Wang from Bosch China, Tobias Menne from RWTH Aachen
University, and Jianjie Zhang from Texas A&M University for their contributions to the whole system.

REFERENCES
Simon L. Altman. 1986. Rotations, Quaternions, and Double Groups. Dover Publications.
John Backus. 1978. Can programming be liberated from the von Neumann style?: A functional style and its
algebra of programs. Commun. ACM 21, 8 (1978), 613–641. DOI:http://dx.doi.org/10.1145/359576.359579
Lalit R. Bahl, Frederick Jelinek, and Robert L. Mercer. 1983. A maximum likelihood approach to continuous
speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 5, 2 (1983), 179–190. DOI:http://dx.doi.org/
10.1109/TPAMI.1983.4767370
James K. Baker. The DRAGON system: An overview. IEEE Trans. Acoust., Speech, Signal Process. 23, 1
(1975), 24–29. DOI:http://dx.doi.org/10.1109/TASSP.1975.1162650
Issam Bazzi. 2002. Modelling Out-of-Vocabulary Words for Robust Speech Recognition. Ph.D. dissertation.
Massachusetts Institute of Technology, Cambridge, Massachusetts.
Xiujuan Chai, Guang Li, Xilin Chen, Ming Zhou, Guobin Wu, and Hanjing Li. 2013. VisualComm: A tool to
support communication between deaf and hearing persons with the Kinect. In Proceedings of the 15th
International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS’13). ACM, New
York, NY, Article 76, 2 pages. DOI:http://dx.doi.org/10.1145/2513383.2513398
Yiqiang Chen, Wen Gao, Gaolin Fang, Changshui Yang, and Zhaoqi Wang. 2003. CSLDS: Chinese sign
language dialog system. In Proceedings of the IEEE International Workshop on Analysis and Modeling
of Faces and Gestures (AMFG’03). IEEE Computer Society, 236–237.
Collin Cherry. 1968. On Human Communications. MIT Press, Cambridge.
Helen Cooper, Brian Holt, and Richard Bowden. 2011. Sign language recognition. In Visual Analysis of
Humans. Springer, London, 539–562. DOI:http://dx.doi.org/10.1007/978-0-85729-997-0_27
Helen Cooper, Eng-Jon Ong, Nicolas Pugeault, and Richard Bowden. 2012. Sign language recognition using
sub-units. Journal Mach. Learning Res. 13, 1 (2012), 2205–2231.
Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations of monosyllabic
word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, Signal Process. 28, 4
(1980), 357–366. DOI:http://dx.doi.org/10.1109/TASSP.1980.1163420
Philippe Dreuw, David Rybach, Thomas Deselaers, Morteza Zahedi, and Hermann Ney. 2007, Speech recog-
nition techniques for a sign language recognition system. Hand 60 (2007), 80.
Gaolin Fang, Wen Gao, and Debin Zhao. 2007. Large-vocabulary continuous sign language recognition based
on transition-movement models. IEEE Trans. Syst., Man, Cybern. A 37, 1 (2007), 1–9. DOI:http://dx.doi.
org/ 10.1109/TSMCA.2006.886347
Guolin Fang, Wen Gao, and Debin Zhao. 2004. Large vocabulary sign language recognition based on
fuzzy decision trees. IEEE Trans. Syst., Man, Cybern. A 34, 3 (2004), 305–314. DOI:http://dx.doi.
org/10.1109/TSMCA.2004.824852
Jens Forster, Oscar Koller, Christian Oberdörfer, Yannick Gweth, and Hermann Ney. 2013. Improving con-
tinuous sign language recognition: Speech recognition techniques and system design. In Proceedings
of the 4th Workshop on Speech and Language Processing for Assistive Technologies. Association for
Computational Linguistics, 41–46.
Jens Forster, Christoph Schmidt, Oscar Koller, Martin Bellgardt, and Hermann Ney. 2014, Extensions of the
sign language recognition and translation corpus RWTH-PHOENIX-Weather. In Proceedings of the 9th
International Conference on Language Resources and Evaluation (LREC’14). ELRA, 1911–1916.
Wen Gao, Gaolin Fang, Debin Zhao, and Yiqiang Chen. 2004. A Chinese sign language recognition system
based on SOFMamp;/SRN/HMM. Pattern Recognition 37, 12 (2004), 2389–2402. DOI:http://dx.doi.org/
10.1016/j.patcog.2004.04.008
Lubo Geng, Xin Ma, Haibo Wang, Jason Gu, and Yibin Li. 2014. Chinese sign language recognition
with 3D hand motion trajectories and depth images. In Proceedings of the 11th World Congress on
Intelligent Control and Automation (WCICA 2014). IEEE, 1457–1461. DOI:http://dx.doi.org/10.1109/
WCICA.2014.7052933
William R. Hamilton. 1844. On a new species of imaginary quantities connected with a theory of quaternions.
Proc. Royal Irish Acad. 2, 1843 (1844), 424–434.
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew
Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:21

networks for acoustic modeling in speech recognition: The s views of four research groups. IEEE Signal
Process. Mag. 29, 6 (2012), 82–97. DOI:http://dx.doi.org/10.1109/MSP.2012.2205597
Frederick Jelinek, Lalit R. Bahl, and Robert L. Mercer. 1975. Design of a linguistic statistical decoder for the
recognition of continuous speech. IEEE Trans. Inf. Theory 21, 3 (1975), 250–256. DOI:http://dx.doi.org/
10.1109/TIT.1975.1055384
Feng Jiang, Wen Gao, Hongxun Yao, Debin Zhao, and Xilin Chen. 2009. Synthetic data generation tech-
nique in signer-independent sign language recognition. Pattern Recog. Lett. 30, 5 (2009), 513–524.
DOI:http://dx.doi.org/10.1016/j.patrec.2008.12.007
Biing-Hwang Juang, Wu Chou, and Chin-Hui Lee. 1997. Minimum classification error rate methods for
speech recognition. IEEE Trans. Speech Audio Process. 5, 3 (1997), 257–265. DOI:http://dx.doi.org/
10.1109/89.568732
Daniel Kelly, John McDonald, and Charles Markham. 2009. Recognizing spatiotemporal gestures and
movement epenthesis in sign language. In Machine Vision and Image Processing Conference 2009,
DOI:10.1109/IMVIP.2009.33.
Ron Kohavi. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In
IJCAI 14, 2 (Aug. 1995), 1137–1145.
W. W. Kong and Surendra Ranganath. 2014. Towards subject independent continuous sign language
recognition: A segment and merge approach. Pattern Recog. 47, 3 (2014), 1294–1308. DOI:10.1016/
j.patcog.2013.09.014.
Brenda Laurel and S. Joy Mountford. 1990. The Art of Human-Computer Interface Design. Addison-Wesley
Longman Publishing Co., Inc.
Chin-Hui Lee, Lawrence R. Rabiner, Roberto Pieraccini, and Jay G. Wilpon. 1990. Acoustic modeling for
large vocabulary speech recognition. Comput. Speech Language 4, 2 (1990), 127–165. DOI:http://dx.doi.
org/10.1016/0885-2308(90)90002-N
Chin-Hui Lee, Frank K. Soong, and Kuldip K. Paliwal (eds). 1996. Automatic Speech and Speaker Recognition:
Advanced Topics, Vol. 355. Springer Science & Business Media.
Chin-Hui Lee and Qiang Huo. 2000. On adaptive decision rules and decision parameter adaptation for au-
tomatic speech recognition. Proc. IEEE 88, 8 (2000), 1241–1269. DOI:http://dx.doi.org/10.1109/5.880082
Hyeon-Kyu Lee and Jin H. Kim. 1999. An HMM-Based threshold model approach for gesture recognition.
IEEE Trans. Pattern Anal. Mach. Intell. 21, 10 (1999), 961–73. DOI:10.1109/34.799904.
Kai-Fu Lee. 1988. Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The Sphinx Sys-
tem. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, Pennsylvania.
Rung-Huei Liang and Ming Ouhyoung. 1998. A real-time continuous gesture recognition system for sign
language. In Proceedings of the 3rd IEEE International Conference on Automatic Face and Gesture
Recognition. IEEE, 558–567. DOI:http://dx.doi.org/10.1109/AFGR.1998.671007
Bruce T. Lowerre. 1976. The Harpy Speech Recognition System. Ph.D. dissertation. Carnegie-Mellon Univer-
sity, Department of Computer Science, Pittsburgh, Pennsylvania.
Mohamed A. Mohandes. 2010. Recognition of two-handed Arabic signs using the CyberGlove. In Proceedings
of the 4th International Conference on Advanced Engineering Computing and Applications in Sciences
(ADVCOMP’10). IARIA, 124–129.
Hermann Ney and Stefan Ortmanns. 2000. Progress in dynamic programming search for LVCSR. Proc. IEEE
88, 8 (2000), 1224–1240. DOI:http://dx.doi.org/10.1109/5.880081
Sylvie C. W. Ong and Surendra Ranganath. 2005. Automatic sign language analysis: A survey and
the future beyond lexical meaning. IEEE Trans. Pattern Anal. Mach. Intell. 27, 6 (2005), 873–891.
DOI:http://dx.doi.org/10.1109/TPAMI.2005.112
Cemil Oz and Ming C. Leu. 2011. American Sign Language word recognition with a sensory glove us-
ing artificial neural networks. Eng. Appl. Artific. Intell. 24, 7 (2011), 1204–1213. DOI:http://dx.doi.org/
10.1016/j.engappai.2011.06.015
Vassilis Pitsikalis, Stavros Theodorakis, Christian Vogler, and Petros Maragos. 2011. Advances in phonetics-
based sub-unit modeling for transcription alignment and sign language recognition. In Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW’11).
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš Burget, Ondřej Glembek, Nagendra Goel, Mirko
Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel
Vesely. 2011. The Kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic
Speech Recognition and Understanding (ASRU’11). IEEE Signal Processing Society, 1–4.
Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recog-
nition. Proc. IEEE 77, 2 (1989), 257–286. DOI:http://dx.doi.org/10.1109/5.18626

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
7:22 K. Li et al.

Lawrence R. Rabiner and Biing-Hwang Juang, 1993. Fundamentals of Speech Recognition, Vol. 14. Prentice-
Hall.
Roni Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proc. IEEE
88, 8 (2000), 1270–1278. DOI:http://dx.doi.org/10.1109/5.880083
Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Techn. J. 27 (July and October
1948), 379–423 and 623–656.
Ben Shneiderman. 1986. Designing the user interface-strategies for effective human-computer interaction.
Pearson Education India.
Thad Starner, Joshua Weaver, and Alex Pentland. 1998. Real-time American sign language recognition
using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20, 12 (1998),
1371–1375. DOI:http://dx.doi.org/10.1109/34.735811
Oriol Vinyals, Suman V. Ravuri, and Daniel Povey. 2012. Revisiting recurrent neural networks for robust
ASR. In Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP’12). IEEE, 4085–4088. DOI:http://dx.doi.org/10.1109/ICASSP.2012.6288816
Andrew J. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding al-
gorithm. IEEE Trans. Inf. Theory 13, 2 (1967), 260–269. DOI:http://dx.doi.org/10.1109/TIT.1967.1054010
Christian Vogler and Dimitris Metaxas. 1997. Adapting hidden Markov models for ASL recognition by
using three-dimensional computer methods. In IEEE International Conference on Computational Cy-
bernetics and Simulation Systems, Man, and Cybernetics, Vol. 1. IEEE, 156–161. DOI:http://dx.doi.org/
10.1109/ICSMC.1997.625741
Christian Vogler and Dimitris Metaxas. 1999. Parallel hidden Markov models for American sign language
recognition. In Proceedings of the 7th IEEE International Conference on Computer Vision, Vol. 1. IEEE,
116–122. DOI:http://dx.doi.org/10.1109/ICCV.1999.791206
Beng Wang, Xiaohua Wang, Hong Luo, Liu Ren, Jianjie Zhang, Kui Xu, Yen-Lin Chen, Zhengyu Zhou, and
Wenwei Guo. 2014a. The glove to capture data for sign language recognition (in Chinese). Invention
Patent, Application No. 201410410413.7, Filed August 2014, State Intellectual Property Office of The
People’s Republic of China.
Beng Wang, Xiaohua Wang, Hong Luo, Liu Ren, Jianjie Zhang, Kui Xu, Yen-Lin Chen, Zhengyu Zhou, and
Wenwei Guo. 2014b. The glove to capture data for sign language recognition (in Chinese), Utility Patent,
Application No. CN 204044747 U, Approved in December 2014, State Intellectual Property Office of The
People’s Republic of China.
Chunli Wang, Wen Gao, and Shiguang Shan. 2002. An approach based on phonemes to large vocabulary
Chinese sign language recognition. In Proceedings of the 5th IEEE International Conference on Automatic
Face and Gesture Recognition. IEEE, 411–416. DOI:http://dx.doi.org/10.1109/AFGR.2002.1004188
Honggang Wang, Ming C. Leu, and Cemil Oz. 2006. American sign language recognition using multi-
dimensional hidden Markov models. J. Inf. Sci. Eng. 22, 5 (2006), 1109–1123.
Zhirong Wang, Tanja Schultz, and Alex Waibel. 2003. Comparison of acoustic model adaptation techniques
on non-native speech. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech,
and Signal Processing. Vol. 1, pp. I-540–I-543.
Jay G. Wilpon, Lawrence R. Rabiner, Chin-Hui Lee, and E. R. Goldman. 1990. Automatic recognition of
keywords in unconstrained speech using hidden Markov models. IEEE Trans. Acoust., Speech, Signal
Process. 38, 11 (1990), 1870–1878. DOI:http://dx.doi.org/10.1109/29.103088
Ruiduo Yang, Sudeep Sarkar, and Barbara Loeding. 2007. Enhanced level building algorithm for the move-
ment epenthesis problem in sign language recognition. In IEEE Conference on Computer Vision and
Pattern Recognition, 2007. DOI:10.1109/CVPR.2007.383347.
Ruiduo Yang, Sudeep Sarkar, and Barbara Loeding. 2010. Handling movement epenthesis and hand segmen-
tation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE
Trans. Pattern Anal. Mach. Intell. 32, 3 (2010), 462–477. DOI:http://dx.doi.org/10.1109/TPAMI.2009.26
Guilin Yao, Hongxun Yao, Xin Liu, and Feng Jiang. 2006. Real time large vocabulary continuous sign
language recognition based on OP/Viterbi algorithm. In Proceedings of the 18th International Conference
on Pattern Recognition (ICPR’06), Vol. 3. IEEE, 312–315. DOI:http://dx.doi.org/10.1109/ICPR.2006.954
Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying A. Liu, Gareth Moore,
Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, and Phil Woodland. 2006. The HTK Book (for
HTK version 3.4), (2006), Retrieved August 20, 2011 from http://htk.eng.cam.ac.uk/.
Zahoor Zafrulla, Helene Brashear, Thad Starner, Harley Hamilton, and Peter Presti. 2011. American sign
language recognition with the Kinect. In Proceedings of the 13th International Conference on Multimodal
Interfaces (ICMI’11). ACM, New York, NY, 279–286. DOI:http://dx.doi.org/10.1145/2070481.2070532

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.
Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition 7:23

Yu Zhou, Debin Zhao, Hongxun Yao, and Wen Gao. 2010. Adaptive sign language recognition with exem-
plar extraction and MAP/IVFS. IEEE Signal Process. Lett. 17, 3 (2010), 297–300. DOI:http://dx.doi.org/
10.1109/LSP.2009.2038251
Zhengyu Zhou. 2009. An Error Detection and Correction Framework to Improve Large Vocabulary Continuous
Speech Recognition. Ph.D. dissertation. The Chinese University of Hong Kong, Hong Kong, China.
Zhengyu Zhou, Tobias Menne, Kehuang Li, Kui Xu, and Zhe Feng. 2015. System and method for automated
sign language recognition. Invention Patent (Provisional), Application No. 62148204, Filed April, 2015.

Received April 2015; revised November 2015; accepted November 2015

ACM Transactions on Accessible Computing, Vol. 8, No. 2, Article 7, Publication date: January 2016.

You might also like