Staticsign CNN
Staticsign CNN
Keywords: Artificial Neural Network, ASL, Convolutional Sign language recognition is still a challenging problem despite of
Neural Network, Deep Learning, GPU, PDNN, Pyttsx, Theano many research efforts during the last few decades. It requires the
understanding of combination of multi-modal information such as
I. I NTRODUCTION hand pose and movement, facial expression, and human body
In daily life, the communication between different communities posture. Moreover, even same signs have significantly different
highly depends on human based translation services. The involvement appearances for different signers and different viewpoints.
of human expertize is very difficult and expensive for translation. The In this paper, we focus on American Sign Language (ASL)
automatic sign language recognition leads to understand the meaning of recognition from static depth images. In all over the world
different signs without the help from expert persons. more than hundred sign languages in the world. American Sign
In common, sign language recognition system contains different Language (ASL) is used throughout U.S. and Canada, as well
modules: object tracking, skin segmentation, feature extraction, as other regions of the world, including Western regions of
and recognition. The first two modules are used to extract and Africa and Southeastern regions of Asia. Approximately
locate hands in the video frames. The purpose of next modules 500,000 people use ASL as a primary language in U.S. The
stands for feature extraction, classification and recognition. fig.2 shows ASL alphabets and numbers. Visual similarity of
Fig.1 demonstrates a general system architecture overview for a SLR different signs make it difficult for recognition. So it become a
system. Based on segmented hands, we can extract the hand shape and challenging area in computer vision tasks. Depth sensors
orientation based features. Finally, classifiers are trained to recognize enable us to capture additional information to improve
the signs. The sign language to speech converter reduces the accuracy and/or processing time.
bridge between normal people and dump people.
9
International Journal of Pure and Applied Mathematics Special Issue
C. Objectives
The main objective of this project is to contribute to the field
of automatic sign language recognition. We focus on the recognition
of the static sign language gestures. This work focused on deep
learning approach to recognize 24 alphabets and 0-9 numbers. We
created a convolutional neural network classifier that can recognize
static sign language gestures with high accuracy. We have trained
the network under different configurations, also analyzed and
tabulated the obtained results. The result shows that accuracy improves
as we include more data from different subjects during training.
We have also created a simple java GUI application to test our
classifier.
II. LITERATURE REVIEW
Fig. 2. ASL finger a l p h a b e t s and numbers
Byeongkeun et al. proposed a real-time sign language finger
Also, with recent Improvement of GPU, CNNs have been employed spelling recognition using Convolutional Neural Networks [1] from
to many computer vision problems. The reason for this is the Depth map. The work focuses on static finger spelling in
reduced training and testing time when using GPU compared to American Sign Language though small but important part of sign
CPU. So in this work a fast fully parameterizable GPU language recognition. Even though it used depth sensors which enable
configuration of CNN is used to train the hand gestures for them to capture additional information to improve accuracy and
good feature extraction and classification. processing time, their caffe architecture is very complicated. [1] A
method for implementing a sign language to text/voice
A. Motivation conversion system without using handheld gloves and sensors,
by capturing the gesture continuously and converting them to
The various advantages of building such a system includes: voice. In this method only few images were captured for
• Sign-to-text/speech translation system or dialog systems which recognition. The design of a communication aid for physically
is use in specific public domains such as airports, post offices, challenged [2] has been created as a prototype.
or hospitals.
• SLR can help to translate the video to text or speech enables
The system developed under the MATLAB environment. It
inter communication between normal and deaf people.
consists of mainly two phases via training phase and testing phase.
B. Problem Statement In training phase the author used a Feed Forward Neural Network with
Sign language uses lots of gestures so that it looks like a 200 neurons in the hidden layer and 10 in output which takes 58
movement language which consists of a series of hands and arms epochs. In testing phase a real time footage of sign language gesture
motion. There are different standards for sign languages for is captured, it is then segmented and then compared with the database
different countries. Also to be noted that some unknown words are created. If a match is found by the neural network then text output of
translated by simply showing gestures for each alphabet in the word. the corresponding gesture is produced.
The problem pattern interpretation of hand gestures includes the
In addition, sign language also includes specific gestures to each following issues.
alphabet in English dictionary and for each number between 0 and 9.
Based on these sign languages are made up of two groups, namely 1. Identifying and tracking the characteristics of hand gestures.
static gesture and dynamic gesture. Static gesture is used for alphabet 2. Training the captured gestures using Feed Forward Neural
and number representation, whereas dynamic gesture is used for Network
specific concepts. Dynamic also include words, sentences etc. Static 3. Segmentation of the hand gestures which is a continuous
gesture consists of poses of hand, whereas latter include motion of Stream.
hands, head or both. Sign language is a visual language and consists 4. Interpretations of the attribute patterns constituting the
of 3 major components, such as finger-spelling, word level sign Gestural segment.
vocabulary and Non-manual features. Finger-spelling is used to spell 5. Integrating the concurrent attributes as a whole.
words letter by letter whereas latter is keyword based.
But the design of a sign language translator is quite challenging Sruthi Upendran [3] and et.al introduced “American Sign
despite of many research efforts during the last few decades. It Language Interpreter System for Deaf and Dumb
requires the understanding of combination of multi-modal Individuals”. The discussed procedures could recognize 20
information such as hand pose and movement, facial expression, and out of 24 static ASL alphabets. The alphabets A, M, N and
human body posture. Moreover, even same signs have significantly S couldn’t be recognized due to occlusion problem. They
different appearances for different signers and different viewpoints. have used only a limited number of images.
10
International Journal of Pure and Applied Mathematics Special Issue
The same can be implemented using an optimized approach by The dataset consists of 1,000 images for each of the 33 different
implementing the famous viola jones algorithm with LBP feature hand signs from five subjects. 33 hand signs include all the finger
for hand gestures recognition in a real time environment. By using spellings of both alphabets and numbers except J and Z which require
this algorithm created Indian sign language interpreter with android temporal information for classification.
implementation [4]. The advantage of this approach is that it takes less Since (2/V) and (6/W) are differentiated based on context, only
computational power to detect the gestures. one class is used to represent both one alphabet and one number.
Another work related to this field was creating sign language
recognition system by using pattern matching [5]. The main aim of The collected dataset images need to be modified according to
this proposed work is to create a system which will work on sign our needs. Our aim was to create a light weight CNN classifier that
can be used with resource constrained embedded devices. Se we
language recognition. Many researchers have already introduced about
many various sign language recognition systems and have downscaled the dataset image to 28x28 gray scale images. This will
help to reduce the number of input nodes in the first layer. Here we
implemented using different techniques and methods. This proposed
system is focusing on an approach which is to put the SLR system are using only 784 features from each image for both training and
testing. For simplicity we have pickled all the images using pythons
which will work on Signs as well as Text (which will
pickle function.
understandable by deaf and dumb persons and also by normal
persons). The main task will be performed in two ways by the
system. It will take input by the user in the form of text which
will be then perform matching with the sign and vice-versa.
The first way is when user will give the input as a text, it will
perform matching with the already created database entries with its
corresponding signs and then system will output that sign to the
requesting user. The same technique is used to process letters,
numbers as well as words and eventually phrases. The second way
includes the concept of image processing [4], the input given by
another user as a sign (which will be in image format) will be
processed by the system on the basis of the outer portion of the
fingers and hands portion of the image .If the sign is valid then
it will generate its text format which will be output on screen
to user.
Fig. 3. Convolution with single layer
Chenyang Zhang, Yingli Tian [5] et.el “multi-modality
American Sign Language recognition” .The main enlightened
features of the system is twofold: 1) it consider about multiple B. Classification
signal modalities including the sequence of depth images,
Architecture: In our work we used PDNN implementation of
RGB image-based hand shapes, facial expression attributes
the CNN. PDNN is a Python deep learning toolkit developed under
and key point detections for both body joints and facial
the Theano environment [14]. The architecture consists of a single
landmarks. 2) By learning from signing sequences performed
convolutional layer with 20 feature maps, local filter having a size of
by fluent ASL signers and annotations provided by
5x5, and a pooling size of 2x2. The input to the architecture has one
professional linguisticians, our system can recognize different
feature map with a dimension of 28x28. The output of the network
components such as English words and special ASL grammar
are flattened. The number of targets or output classes is 33. We have
components, such as facial expressions or head movements
trained our network with FC hidden layers ranging from 1 to 4. We
that have grammatical meaning during sentences.
have used a learning rate of 0.1 for our model.
Real time tracking [16] of gestures from hand movement is
Feature Extraction: We extracted 784 feature vector from each
difficult than the face recognition [6]. A comparison has been
preprocessed depth image. The images are then grouped to training,
done in [7] for still and moving image recognition. Sign
validation and testing and they were pickled respectively.
language recognition has been extensively tried using
mathematical models. [8] Explains the deployment of Support
Training: We train and test neural networks in twenty four
Vector Machine for sign language recognition.
different operating modes. We trained the model by varying the
hidden layers from 1 to 4 and by varying the epochs from 500 to
1000 in each case. The number of nodes in each hidden layer also
III.IMPLEMENTATION
varies. Out of the 1000 samples of each image we used 900 images
A. Dataset for creating the training set, 60 for validation and 40 for testing. The
For the system implementation, 33000 images have collected from data sets were pickled and were used for training. Figure 4 shows the
the available dataset, using Creative Senz3D depth camera of the flowchart of the training process. After the model was trained
resolution of 320x240. Compared to RGB images, more information the net parameters were saved so that it can be used in the
can collect from depth images. So in the proposed system, we have testing phase for testing the accuracy of the model and also for
taken the advantage of Kinect images for attaining maximum classification of the input symbols.
efficiency.
11
International Journal of Pure and Applied Mathematics Special Issue
12
International Journal of Pure and Applied Mathematics Special Issue
13
International Journal of Pure and Applied Mathematics Special Issue
[14 ]http://deeplearning.net/software/theano/
[15] http://note.sonots.com/SciSoftware/haartraining.html
[16] Anju .M. Nair, S. Joshua Daniel, “Design Of Wireless Sensor
Networks For Pilgrims Tracking And Monitoring”, International
Journal of Innovations in Scientific and Engineering Research
(IJISER), Vol.1, No.2, pp.82-87, 2014.
14
15
16