0% found this document useful (0 votes)
64 views8 pages

Staticsign CNN

This document summarizes a research paper that proposes recognizing American Sign Language (ASL) gestures using a Convolutional Neural Network (CNN) trained on depth images captured by a Kinect sensor. The system was trained on 33,000 images to classify 24 alphabets and numbers 0-9. It achieved 94.67% accuracy for classification. The CNN architecture was designed to be lightweight to enable incorporation into embedded devices with limited resources. The paper also reviewed prior work on sign language recognition and discussed the objectives of static gesture recognition and a graphical user interface to test the classifier.

Uploaded by

J SANDHYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views8 pages

Staticsign CNN

This document summarizes a research paper that proposes recognizing American Sign Language (ASL) gestures using a Convolutional Neural Network (CNN) trained on depth images captured by a Kinect sensor. The system was trained on 33,000 images to classify 24 alphabets and numbers 0-9. It achieved 94.67% accuracy for classification. The CNN architecture was designed to be lightweight to enable incorporation into embedded devices with limited resources. The paper also reviewed prior work on sign language recognition and discussed the objectives of static gesture recognition and a graphical user interface to test the classifier.

Uploaded by

J SANDHYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

International Journal of Pure and Applied Mathematics

Volume 117 No. 20 2017, 9-15


ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
url: http://www.ijpam.eu
Special Issue
ijpam.eu

Automatic Sign Language Finger Spelling Using


Convolution Neural Network: Analysis
Beena M.V. Dr. M.N. Agnisarman Namboodiri
Asst. professor, CSE Dept. Dean P G studies,
Vidya Academy of Science and Vidya Academy of Science and
Technology, Thrissur - 680501, India Technology, Thrissur - 680501, India

Abstract—Very few people understand sign language. More- over,


contrary to popular belief, it is not an international language.
Obviously, this further complicates communication between the deaf
community and the hearing majority. The alternative of written
communication is cumbersome, because the deaf community is generally
less skilled in writing a spoken language. For example, when an
accident occurs, it is often necessary to communicate quickly with the
emergency physician where written communication is not always
possible. The purpose of this work is to contribute recognizing
American sign languages to the field of automatic sign language
recognition with maximum efficiency. This paper focuses on the
recognition of static gestures of ASL which are collected from Kinect
sensor. The most challenging part in the design of an automatic sign
language translator is the design of a good classifier that can classify
the input static gestures with high accuracy. In the proposed system,
design of classifier for sign languages recognition uses CNN
architecture from kinect Depth images. The system trained CNNs
for the classification of 24 alphabets and 0-9 numbers using 33000
images. The system has trained the classifier with different
parameter configurations and tabulated the results. Compared to
previous literature the proposed work attained an efficiency of
94.6774% for our classifier. Also created a simple java GUI
application to test our classifier. We have designed our network to be
light weight so that it can be incorporated easily with embedded
devices having limited resources. The result shows that accuracy
improves as we include more data from different subjects during Fig. 1. General System Architecture
training.

Keywords: Artificial Neural Network, ASL, Convolutional Sign language recognition is still a challenging problem despite of
Neural Network, Deep Learning, GPU, PDNN, Pyttsx, Theano many research efforts during the last few decades. It requires the
understanding of combination of multi-modal information such as
I. I NTRODUCTION hand pose and movement, facial expression, and human body
In daily life, the communication between different communities posture. Moreover, even same signs have significantly different
highly depends on human based translation services. The involvement appearances for different signers and different viewpoints.
of human expertize is very difficult and expensive for translation. The In this paper, we focus on American Sign Language (ASL)
automatic sign language recognition leads to understand the meaning of recognition from static depth images. In all over the world
different signs without the help from expert persons. more than hundred sign languages in the world. American Sign
In common, sign language recognition system contains different Language (ASL) is used throughout U.S. and Canada, as well
modules: object tracking, skin segmentation, feature extraction, as other regions of the world, including Western regions of
and recognition. The first two modules are used to extract and Africa and Southeastern regions of Asia. Approximately
locate hands in the video frames. The purpose of next modules 500,000 people use ASL as a primary language in U.S. The
stands for feature extraction, classification and recognition. fig.2 shows ASL alphabets and numbers. Visual similarity of
Fig.1 demonstrates a general system architecture overview for a SLR different signs make it difficult for recognition. So it become a
system. Based on segmented hands, we can extract the hand shape and challenging area in computer vision tasks. Depth sensors
orientation based features. Finally, classifiers are trained to recognize enable us to capture additional information to improve
the signs. The sign language to speech converter reduces the accuracy and/or processing time.
bridge between normal people and dump people.

9
International Journal of Pure and Applied Mathematics Special Issue

Also even same signs have significantly different appearances for


different signers and different viewpoints. This work focuses on the
creation of a static sign language translator by using Convolutional
Neural Network. We created a light weight network that can be used
with embedded devices having less resources.

C. Objectives
The main objective of this project is to contribute to the field
of automatic sign language recognition. We focus on the recognition
of the static sign language gestures. This work focused on deep
learning approach to recognize 24 alphabets and 0-9 numbers. We
created a convolutional neural network classifier that can recognize
static sign language gestures with high accuracy. We have trained
the network under different configurations, also analyzed and
tabulated the obtained results. The result shows that accuracy improves
as we include more data from different subjects during training.
We have also created a simple java GUI application to test our
classifier.
II. LITERATURE REVIEW
Fig. 2. ASL finger a l p h a b e t s and numbers
Byeongkeun et al. proposed a real-time sign language finger
Also, with recent Improvement of GPU, CNNs have been employed spelling recognition using Convolutional Neural Networks [1] from
to many computer vision problems. The reason for this is the Depth map. The work focuses on static finger spelling in
reduced training and testing time when using GPU compared to American Sign Language though small but important part of sign
CPU. So in this work a fast fully parameterizable GPU language recognition. Even though it used depth sensors which enable
configuration of CNN is used to train the hand gestures for them to capture additional information to improve accuracy and
good feature extraction and classification. processing time, their caffe architecture is very complicated. [1] A
method for implementing a sign language to text/voice
A. Motivation conversion system without using handheld gloves and sensors,
by capturing the gesture continuously and converting them to
The various advantages of building such a system includes: voice. In this method only few images were captured for
• Sign-to-text/speech translation system or dialog systems which recognition. The design of a communication aid for physically
is use in specific public domains such as airports, post offices, challenged [2] has been created as a prototype.
or hospitals.
• SLR can help to translate the video to text or speech enables
The system developed under the MATLAB environment. It
inter communication between normal and deaf people.
consists of mainly two phases via training phase and testing phase.
B. Problem Statement In training phase the author used a Feed Forward Neural Network with
Sign language uses lots of gestures so that it looks like a 200 neurons in the hidden layer and 10 in output which takes 58
movement language which consists of a series of hands and arms epochs. In testing phase a real time footage of sign language gesture
motion. There are different standards for sign languages for is captured, it is then segmented and then compared with the database
different countries. Also to be noted that some unknown words are created. If a match is found by the neural network then text output of
translated by simply showing gestures for each alphabet in the word. the corresponding gesture is produced.
The problem pattern interpretation of hand gestures includes the
In addition, sign language also includes specific gestures to each following issues.
alphabet in English dictionary and for each number between 0 and 9.
Based on these sign languages are made up of two groups, namely 1. Identifying and tracking the characteristics of hand gestures.
static gesture and dynamic gesture. Static gesture is used for alphabet 2. Training the captured gestures using Feed Forward Neural
and number representation, whereas dynamic gesture is used for Network
specific concepts. Dynamic also include words, sentences etc. Static 3. Segmentation of the hand gestures which is a continuous
gesture consists of poses of hand, whereas latter include motion of Stream.
hands, head or both. Sign language is a visual language and consists 4. Interpretations of the attribute patterns constituting the
of 3 major components, such as finger-spelling, word level sign Gestural segment.
vocabulary and Non-manual features. Finger-spelling is used to spell 5. Integrating the concurrent attributes as a whole.
words letter by letter whereas latter is keyword based.
But the design of a sign language translator is quite challenging Sruthi Upendran [3] and et.al introduced “American Sign
despite of many research efforts during the last few decades. It Language Interpreter System for Deaf and Dumb
requires the understanding of combination of multi-modal Individuals”. The discussed procedures could recognize 20
information such as hand pose and movement, facial expression, and out of 24 static ASL alphabets. The alphabets A, M, N and
human body posture. Moreover, even same signs have significantly S couldn’t be recognized due to occlusion problem. They
different appearances for different signers and different viewpoints. have used only a limited number of images.

10
International Journal of Pure and Applied Mathematics Special Issue

The same can be implemented using an optimized approach by The dataset consists of 1,000 images for each of the 33 different
implementing the famous viola jones algorithm with LBP feature hand signs from five subjects. 33 hand signs include all the finger
for hand gestures recognition in a real time environment. By using spellings of both alphabets and numbers except J and Z which require
this algorithm created Indian sign language interpreter with android temporal information for classification.
implementation [4]. The advantage of this approach is that it takes less Since (2/V) and (6/W) are differentiated based on context, only
computational power to detect the gestures. one class is used to represent both one alphabet and one number.
Another work related to this field was creating sign language
recognition system by using pattern matching [5]. The main aim of The collected dataset images need to be modified according to
this proposed work is to create a system which will work on sign our needs. Our aim was to create a light weight CNN classifier that
can be used with resource constrained embedded devices. Se we
language recognition. Many researchers have already introduced about
many various sign language recognition systems and have downscaled the dataset image to 28x28 gray scale images. This will
help to reduce the number of input nodes in the first layer. Here we
implemented using different techniques and methods. This proposed
system is focusing on an approach which is to put the SLR system are using only 784 features from each image for both training and
testing. For simplicity we have pickled all the images using pythons
which will work on Signs as well as Text (which will
pickle function.
understandable by deaf and dumb persons and also by normal
persons). The main task will be performed in two ways by the
system. It will take input by the user in the form of text which
will be then perform matching with the sign and vice-versa.
The first way is when user will give the input as a text, it will
perform matching with the already created database entries with its
corresponding signs and then system will output that sign to the
requesting user. The same technique is used to process letters,
numbers as well as words and eventually phrases. The second way
includes the concept of image processing [4], the input given by
another user as a sign (which will be in image format) will be
processed by the system on the basis of the outer portion of the
fingers and hands portion of the image .If the sign is valid then
it will generate its text format which will be output on screen
to user.
Fig. 3. Convolution with single layer
Chenyang Zhang, Yingli Tian [5] et.el “multi-modality
American Sign Language recognition” .The main enlightened
features of the system is twofold: 1) it consider about multiple B. Classification
signal modalities including the sequence of depth images,
Architecture: In our work we used PDNN implementation of
RGB image-based hand shapes, facial expression attributes
the CNN. PDNN is a Python deep learning toolkit developed under
and key point detections for both body joints and facial
the Theano environment [14]. The architecture consists of a single
landmarks. 2) By learning from signing sequences performed
convolutional layer with 20 feature maps, local filter having a size of
by fluent ASL signers and annotations provided by
5x5, and a pooling size of 2x2. The input to the architecture has one
professional linguisticians, our system can recognize different
feature map with a dimension of 28x28. The output of the network
components such as English words and special ASL grammar
are flattened. The number of targets or output classes is 33. We have
components, such as facial expressions or head movements
trained our network with FC hidden layers ranging from 1 to 4. We
that have grammatical meaning during sentences.
have used a learning rate of 0.1 for our model.
Real time tracking [16] of gestures from hand movement is
Feature Extraction: We extracted 784 feature vector from each
difficult than the face recognition [6]. A comparison has been
preprocessed depth image. The images are then grouped to training,
done in [7] for still and moving image recognition. Sign
validation and testing and they were pickled respectively.
language recognition has been extensively tried using
mathematical models. [8] Explains the deployment of Support
Training: We train and test neural networks in twenty four
Vector Machine for sign language recognition.
different operating modes. We trained the model by varying the
hidden layers from 1 to 4 and by varying the epochs from 500 to
1000 in each case. The number of nodes in each hidden layer also
III.IMPLEMENTATION
varies. Out of the 1000 samples of each image we used 900 images
A. Dataset for creating the training set, 60 for validation and 40 for testing. The
For the system implementation, 33000 images have collected from data sets were pickled and were used for training. Figure 4 shows the
the available dataset, using Creative Senz3D depth camera of the flowchart of the training process. After the model was trained
resolution of 320x240. Compared to RGB images, more information the net parameters were saved so that it can be used in the
can collect from depth images. So in the proposed system, we have testing phase for testing the accuracy of the model and also for
taken the advantage of Kinect images for attaining maximum classification of the input symbols.
efficiency.

11
International Journal of Pure and Applied Mathematics Special Issue

Fig. 5. GUI Interface of application

Fig. 4. Training phase

Testing: In the testing phase the accuracy of our trained model is


tested. The saved network parameter is loaded to test the dataset and
determined the accuracy. The method used for all test cases and
tabulated the obtained accuracies. A simple java GUI application is
created to test out classifier for the purpose of static sign language
translation. The application allows the user to select the images of
the sign language static gestures that need to be classified. The
Fig. 6. Accuracy table
application used our trained CNN network to classify these symbols
and produce their corresponding labels. The labels are then turned It is clear that the accuracy of the model can be improved by
to their corresponding alphabets/numbers and they are grouped to increasing the number of samples in the training set. In our
form words or sentences. The words/sentences are then spoken out work, GPU enabled system with Theano is used for training. Theano
using the python pyttsx text to speech module. The Screen shot of offers more features in complicated optimization algorithms like
the GUI is shown in fig.5. conjugate gradient, CNN etc. For fast calculation and implementations
of mathematical operations theano includes CUDA code generators
IV. EXPERIMENTAL RESULTS and n-dimensional (dense) arrays located in GPU memory with
As mentioned in Sec.3, we train and test for twenty four Python bindings.. The processing time is about 5 sec for each epoch
different experimental settings. The results are shown in figure 6 .Our using Nvidia GeForce GTX 970 M. Without GPU it was found that
system achieves 94.6774% accuracy when training and validation each epoch has taken about 1.30 minutes on Intel 6th generation 17
data have samples corresponding to the test subject with training set processor. So the training time was drastically reduced by using
having 33000 images and validation set having 1980 and the test set Theano with GPU. The accuracy vs epoch plot for our experiment
having 1320 images. In this experiment, we have used 98% of is shown in fig. 7. From this we can see that for each hidden layer
images in the dataset. Out of 1000 images belonging a class 900 configuration the accuracy of the model increases with epoch. Then
images are used for training, 60 for validation and 40 for testing. it reaches a peak value then it start decreasing as we increase the
We have trained the model for 33 symbols. We considered all number of epochs. We have achieved a high accuracy of 94.6774%
possible combinations of subjects for training, validation, and test for the model with 1 hidden layer and 700 epoch. We have used up
and the final reported accuracy is the average of all. We have trained to 4 hidden layers. The hidden layer configuration used for our
the model with different configurations. We have trained the experiment are, for one hidden layer we have used 5x5 convolution
model up to 4 hidden layers. For each hidden layer we have mask with 20 local filters and a max-pooling layer of 2x2.The accuracy
used epochs ranging from 500 to 1000. For all the cases we vs hidden layer plot is shown in fig.8. The calculated values of
have used sigmoidal function as the activation function. We precision and recall of alphabets and numerals for the model
have used a learning rate of 0.1 for all of the above cases. with the highest accuracy is shown in fig. 9.

12
International Journal of Pure and Applied Mathematics Special Issue

Fig. 9. Precision and Recall


Fig. 7. Accuracy vs Epoch
R EFERENCES
[1] K an g , Byeongkeun, Subarna Tripathi, and Truong Q. Nguyen.
”Real- time sign language fingerspelling recognition using
convolutional neural networks from depth map.” arXiv preprint
arXiv: 1509.03001 (2015).
[2] Su gan ya , R., and T. Meeradevi. ”Design of a communication aid for
phys- ically challenged.” In Electronics and Communication S ys te ms
(ICECS), 2015 2nd International Conference on, pp. 818-822. IEEE,
2015.
[3] Sruthi Upendran, Thamizharasi. A,” American Sign Language
Interpreter System for Deaf and Dumb Individuals”, 2014
International Conference on Control, Instrumentation,
Communication and Computational Technologies (ICCICCT), 978-
1-4799-4190-2,2014 IEEE.
[4] P ra ma d a , Sawant, Deshpande Saylee, Nale Pranita, Nerkar
Samiksha, and M. S. Vaidya. ”Intelligent Sign Language Recognition
Using Image Processing.” IOSR Journal of Engineering (IOSRJEN) 3,
no. 2 (2013):45-51.
[5] Chenyang Zhang, Yingli Tian, Matt Huenerfauth,” multi-
modality american sign language recognition” ICIP 2016,978-1-
4673-9961-6.
[6] Thad Starner & Alex Pentland ,(1995), Real-Time American Sign
Fig. 8. Accuracy vs Epoch Language From video using Hidden markov models (IEEE), pp.
265-271.
[7] D. Kumarage, S. Fernando, P. Fernando, D. Madushanka& R.
Samarasinghe,(2011) “Real time sign language recognition using
V. C ONCLUSION still image comparison & motion recognition” sixth international
conference on industrial and information systems, srilanka
Communications between deaf-mute and a normal person have [8] M.S. Sinith, Soorej G Kamal, Nisha B.,Nayana S,
KiranSurendran, & Jith P S, (2012), “sign language recognition
always been a challenging task. The goal of this project is to reduce using support vector machine” international conference on
the barrier of communication by contributing to the field of advances in computing and communications,(IEEE), pp 122-129
automatic sign language recognition. [9] Y eo , Hui-Shyong, Byung-Gook Lee, and Hyotaek Lim.”Hand
tracking and gesture recognition system for human-computer
Through this work, a CNN classifier is constructed which is interaction using low-cost hardware.” Multimedia Tools and
capable of recognizing static sign language gestures. A basic GUI Applications 74, no. 8 (2015):2687-2715.
application is created to test our classifier in this system.. The [10] Ga r g, Pragati, Naveen Aggarwal, and Sanjeev Sofat. ”Vision
based hand gesture recognition.” World Academy of Science,
application allows users to select the static sign gestures as input and Engineering and Technology 49, no. 1 (2009): 972-977.
it will speak out the words or sentences corresponding to the gesture. [11] Amr u ta S. Talrejaa, Darshana Tekadea, Shailesh Bharada and
We have trained our model for 33 symbols which include alphabets Lovely Mutnejab.”Sign Language Recognition System by
Pattern Match- ing.”International Journal of Innovative and
and numbers. We were able to achieve an accuracy of 94.6774% for Emerging Research in En- gineering, vol. 2, Issue 2, 2015
our CNN classifier. [12] N a ga r aj an , S., and T. S. Subashini. ”Static hand gesture
recognition for sign language alphabets using edge oriented histogram
and multi class SVM.” International Journal of Computer
Applications 82, no. 4 (2013).
[13] Pigou, Lionel, Sander Dieleman, Pieter-Jan Kindermans, and Ben-
jamin Schrauwen. ”Sign language recognition using convolutional
neural networks.” In Computer Vision-ECCV 2014 Workshops, pp.
572-578. Springer International Publishing.

13
International Journal of Pure and Applied Mathematics Special Issue

[14 ]http://deeplearning.net/software/theano/
[15] http://note.sonots.com/SciSoftware/haartraining.html
[16] Anju .M. Nair, S. Joshua Daniel, “Design Of Wireless Sensor
Networks For Pilgrims Tracking And Monitoring”, International
Journal of Innovations in Scientific and Engineering Research
(IJISER), Vol.1, No.2, pp.82-87, 2014.

14
15
16

You might also like