Sign Language Recognition Using CNNs
Sign Language Recognition Using CNNs
Neural Networks
1 Introduction
Very few people understand sign language. Moreover, contrary to popular belief,
it is not an international language. Obviously, this further complicates commu-
nication between the Deaf community and the hearing majority. The alternative
of written communication is cumbersome, because the Deaf community is gen-
erally less skilled in writing a spoken language [17]. Furthermore, this type of
communication is impersonal and slow in face-to-face conversations. For exam-
ple, when an accident occurs, it is often necessary to communicate quickly with
the emergency physician where written communication is not always possible.
The purpose of this work is to contribute to the field of automatic sign
language recognition. We focus on the recognition of the signs or gestures. There
are two main steps in building an automated recognition system for human
actions in spatio-temporal data [15]. The first step is to extract features from
the frame sequences. This will result in a representation consisting of one or
more feature vectors, also called descriptors. This representation will aid the
computer to distinguish between the possible classes of actions. The second step
is the classification of the action. A classifier will use these representations to
discriminate between the different actions (or signs). In our work, the feature
extraction is automated by using convolutional neural networks (CNNs). An
artificial neural network (ANN) is used for classification.
c Springer International Publishing Switzerland 2015
L. Agapito et al. (Eds.): ECCV 2014 Workshops, Part I, LNCS 8925, pp. 572–578, 2015.
DOI: 10.1007/978-3-319-16178-5 40
Sign Language Recognition Using Convolutional Neural Networks 573
2 Related Work
In our work, we build on the results of Roel Verschaeren [18]. He proposes a CNN
model that recognizes a set of 50 different signs in the Flemish Sign Language
with an error of 2.5%, using the Microsoft Kinect. Unfortunately, this work is
limited in the sense that it considers only a single person in a fixed environment.
In [19] an American Language recognition system is presented with a vocab-
ulary of 30 words. They constructed appearance-based representations and a
hand tracking system to be classified with a hidden Markov model (HMM). An
error rate of 10.91% is achieved on the RWTH-BOSTON-50 database.
The approach in [4] uses the Microsoft Kinect to extract appearance-based
hand features and track the position in 2D and 3D. The classification results are
obtained by comparing a hidden Markov model (HMM) approach with sequential
pattern boosting (SP-boosting). This resulted in an accuracy of 99.9% on 20
different isolated gestures on their specifically constructed data set and 85.1%
on a more realistic one with 40 gestures.
The Microsoft Kinect is also used in [2] that proposes a recognition system
for 239 words of the Chinese Sign Language (CSL). Here, the 3D movement
trajectory of the hands are used besides a language model to construct sentences.
This trajectory is aligned and matched with a gallery of known trajectories. The
top-1 and top-5 recognition rates are 83.51% and 96.32% respectively.
(a) RGB (b) Depth map (c) User index (d) Skeleton
Fig. 1. Data set for the CLAP14 gesture spotting challenge [5]
3 Methodology
3.1 Data
We use the data set from the ChaLearn Looking at People 2014 [5] (CLAP14)
challenge in this work. More specifically, Track 3: Gesture Spotting. This dataset
consists of 20 different Italian gestures, performed by 27 users with variations in
surroundings, clothing, lighting and gesture movement. The videos are recorded
with a Microsoft Kinect. As a result, we have access to the depth map, user
index (location of the user in the depth map) and the joint positions (Figure 1).
574 L. Pigou et al.
We use 6600 gestures in the development set of CLAP14 for our experiments:
4600 for the training set and 2000 for the validation set. The test set of CLAP14
is also considered as the test set for this work and consists of 3543 samples. The
users and backgrounds in the validation set are not contained in the training
set. The users and backgrounds in the test set can occur in the training and the
validation set.
3.2 Preprocessing
Our first step in the preprocessing stage is cropping the highest hand and the
upper body using the given joint information. We discovered that the highest
hand is the most interesting. If both hands are used, they perform the same
(mirrored) movement. If one hand is used, it is always the highest one. If the left
hand is used, the videos are mirrored. This way, the model only needs to learn
one side.
The preprocessing results in four video samples (hand and body with depth
and gray-scale) of resolution 64x64x32 (32 frames of size 64x64). Furthermore,
the noise in the depth maps is reduced with thresholding, background removal
using the user index, and median filtering. The outcome is shown in Figure 2.
Fig. 2. Preprocessing
CNNs (based on [13]) are feature extraction models in deep learning that recently
have proven to be to be very successful at image recognition [12], [3], [20], [7]. As of
now, the models are in use by various industry leaders like Google, Facebook and
Amazon. And recently, researchers at Google applied CNNs on video data [11].
CNNs are inspired by the visual cortex of the human brain. The artificial
neurons in a CNN will connect to a local region of the visual field, called a
receptive field. This is accomplished by performing discrete convolutions on the
image with filter values as trainable weights. Multiple filters are applied for each
channel, and together with the activation functions of the neurons, they form
Sign Language Recognition Using Convolutional Neural Networks 575
feature maps. This is followed by a pooling scheme, where only the interest-
ing information of the feature maps are pooled together. These techniques are
performed in multiple layers as shown in Figure 3.
initialized at 0.003 with a 5% decrease after each epoch. The weights of the
CNNs are randomly initialized with a normal distribution with μ = 0 and
σ = 0.04, and σ = 0.02 for the weights of the ANN. The biases of the CNNs are
initialized at 0.2 and the biases of the ANN at 0.1.
Experiments are conducted on one machine with a hexa-core processor (Intel
Core i7-3930K), 32GB SDRAM and a NVIDIA GeForce GTX 680 GPU with
4096MB of memory . The models are implemented using the Python libraries
Theano [1], and PyLearn2 [8] for the fast implementation of 2D convolutions by
Alex Krizhevsky [12].
4 Results
Our most notable experiments are the models with ReLUs, dropout, LCN and
data augmentation. The validation results of these experiments are shown in
Table 1. We observe a validation accuracy of 91.70% (8.30% error rate) for our
best model. Furthermore, ReLUs prove to be very effective with an improvement
of 23.8% with respect to tanh units.
The accuracy on the test set is 95.68% and we observe a 4.13% false pos-
itive rate, caused by the noise movements. Note that the test result is higher
than the validation result, because the validation set doesn’t contain users and
backgrounds in the training set.
Sign Language Recognition Using Convolutional Neural Networks 577
The final score for the CLAP14 competition is the mean Jaccard Index of
each gesture and video sample. The Jaccard Index is a measure for overlapping
frames between prediction and ground truth. The validation score of our best
model is 0.789675 and the final score is 0.788804, which ranks us fifth of the 17
qualified teams.
5 Conclusion
This work shows that convolutional neural networks can be used to accurately
recognize different signs of a sign language, with users and surroundings not
included in the training set. This generalization capacity of CNNs in spatio-
temporal data can contribute to the broader research field on automatic sign
language recognition.
References
1. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G.,
Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expres-
sion compiler. In: Proceedings of the Python for Scientific Computing Conference
(SciPy), June 2010, oral Presentation
2. Chai, X., Li, G., Lin, Y., Xu, Z., Tang, Y., Chen, X., Zhou, M.: Sign Language
Recognition and Translation with Kinect (2013). Language Recognition and Trans-
lation with Kinect.pdf. http://vipl.ict.ac.cn/sites/default/files/papers/files/2013
FG xjchai Sign
3. Cireşan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for
image classification. In: IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pp. 3642–3649. IEEE (2012)
4. Cooper, H., Ong, E.J., Pugeault, N., Bowden, R.: Sign language recognition using
sub-units. The Journal of Machine Learning Research 13(1), 2205–2231 (2012)
5. Escalera, S., Bar, X., Gonzlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce,
V., Escalante, H.J., Shotton, J., Guyon, I.: Chalearn looking at people challenge
2014: Dataset and results. In: ECCV Workshop (2014)
6. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier networks. In: Proceedings
of the 14th International Conference on Artificial Intelligence and Statistics 15,
pp. 315–323 (2011). http://eprints.pascal-network.org/archive/00008596/
7. Goodfellow, I.J., Bulatov, Y., Ibarz, J., Arnoud, S., Shet, V.: Multi-digit number
recognition from street view imagery using deep convolutional neural networks
(2013). arXiv preprint arXiv:1312.6082
8. Goodfellow, I.J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu,
R., Bergstra, J., Bastien, F., Bengio, Y.: Pylearn2: a machine learning research
library (2013). arXiv preprint arXiv:1308.4214. http://arxiv.org/abs/1308.4214
9. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Improving neural networks by preventing co-adaptation of feature detectors (2012).
arXiv preprint arXiv:1207.0580
10. Jarrett, K., Kavukcuoglu, K.: What is the best multi-stage architecture for object
recognition?. In: IEEE 12th International Conference on Computer Vision, pp.
2146–2153 (2009). http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=5459469
578 L. Pigou et al.
11. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-
scale video classification with convolutional neural networks. In: CVPR (2014)
12. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep con-
volutional neural networks. Advances in Neural Information, 1–9 (2012). http://
books.nips.cc/papers/files/nips25/NIPS2012 0534.pdf
13. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86(11) (1998)
14. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann
machines. In: Proceedings of the 27th International Conference on Machine Learn-
ing (ICML 2010), pp. 807–814 (2010)
15. Poppe, R.: A survey on vision-based human action recognition. Image and Vision
Computing 28(6), 976–990 (2010)
16. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initializa-
tion and momentum in deep learning. In: Proceedings of the 30th International
Conference on Machine Learning (ICML 2013), pp. 1139–1147 (2013)
17. Van Herreweghe, M.: Prelinguaal dove jongeren en nederlands: een syntactisch
onderzoek. Universiteit Gent, Faculteit Letteren en Wijsbegeerte (1996)
18. Verschaeren, R.: Automatische herkenning van gebaren met de microsoft kinect
(2012)
19. Zaki, M.M., Shaheen, S.I.: Sign language recognition using a combination of new
vision based features. Pattern Recognition Letters 32(4), 572–577 (2011)
20. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural net-
works (2013). arXiv preprint arXiv:1311.2901