Sign Language Recognition Using Deep Learning and Computer Vision
Sign Language Recognition Using Deep Learning and Computer Vision
Sign Language Recognition Using Deep Learning and Computer Vision
net/publication/342331104
CITATIONS READS
0 729
3 authors, including:
Dr.Sabeenian R.S
Sona College of Technology
95 PUBLICATIONS 375 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dr.Sabeenian R.S on 28 June 2020.
I. Introduction
People with impaired speech and hearing uses Sign language as a form of communication. Disabled People use
this sign language gestures as a tool of non-verbal communication to express their own emotions and thoughts to
other common people. But these common people find it difficult to understand their expression, thus trained sign
language expertise are needed during medical and legal appointment, educational and training session. Over the past
few years, there has been an increase in demand for these services. Other form of services such as video remote
human interpret using the high-speed Internet connection, has been introduced, thus these services provides an easy
to use sign language interpret service, which can be used and benefited, yet have major limitations.
To address this, we use a custom CNN model to recognize gestures in sign language. Convolutional neural
network of 11 layers is constructed, four Convolution layers, three Max-Pooling Layers, two dense layers, one
flattening layer and one dropout layer. We use the American Sign Language Dataset from MNIST to train the model
to identify the gesture. The dataset contains the features of different augmented gestures. Introduced a custom CNN
(Convolutional Neural Network) model to identify of the sign from a video frame using Open-CV.
Initially, feature extracted dataset is used to train the custom model that has 11 layers with a default image size.
Rest of this analysis is organized as follows: Section 2 gives a summary of the performed literature survey; Section 3
talk about the datasets and its specialities. Section 4 overviews the structure of the model introduced. Section 5
highlights point the experiment and observations of this project. At last, Section 6 express the issue faced by model
and projected the possible developments in Section 7.
DOI: 10.5373/JARDCS/V12SP5/20201842
ISSN 1943-023X 964
Received: 15 Mar 2020/Accepted: 18 Apr 2020
Jour of Adv Research in Dynamical & Control Systems, Vol. 12, 05-Special Issue, 2020
The left-out part during this process had the palm with the gesture. However, since this had several consistency
issues, this dataset was not used for training the CNN. Unsupervised learning was employed using the K-means
clustering algorithm. SIFT mapping and gaussian masks were used to extract features and train the dataset. End
accuracy was over 90%.
The sign recognition in [2] is accomplished with PCA (Principal Component Analysis). Recognition with neural
networks is also proposed in the paper. The acquired data was with a 3MP camera due to which there was a poor
quality. It consists of 15 images per sign. Their results were not satisfactory due to their considerably small dataset.
Simple boundary pixel analysis was performed by doing segmentation and separating RGB components. The
authors mentioned that a better output can be achieved using Neural networks than the results obtained by
combining the fingertip algorithm with PCA.
The paper by Nandy et al. [3] classifies the gestures by splitting the data into segmented features and employs
Euclidean distance and K-Nearest Neighbours. Similar work by Kumud et al. [4] shows how to do continuous
recognition. The paper proposes extraction of frames from videos, data pre-processing, extracting frames and other
features, recognition and optimization. Pre-processing is accomplished by converting the video into RGB frames
with same dimension. Skin colour segmentation with the HSV was used to extract skin region and were converted to
binary form. Extraction of key frames are done by gradient calculation between the frames, and extraction of
features is done by oriental histogram. Classification is achieved by several distance calculation like Euclidean,
Manhattan, Chess Board Distance.
DOI: 10.5373/JARDCS/V12SP5/20201842
ISSN 1943-023X 965
Received: 15 Mar 2020/Accepted: 18 Apr 2020
Jour of Adv Research in Dynamical & Control Systems, Vol. 12, 05-Special Issue, 2020
The MNIST data consists of 60,000 training images, whereas 10,000 testing images. 50 percent of the training
set, similarly other set of 50 percent of the test set were taken from NIST's training images, meanwhile the further 50
percent of the training set, similarly other 50 percent of the test set were pull from NIST's testing images. The
American Sign Language alphabet and number collections of images of hand gestures produce a multi class level
issues having 24 classes of letters (excluding J and Z which require motion). Since J and Z require dynamic gesture,
they are not included.
DOI: 10.5373/JARDCS/V12SP5/20201842
ISSN 1943-023X 966
Received: 15 Mar 2020/Accepted: 18 Apr 2020
Jour of Adv Research in Dynamical & Control Systems, Vol. 12, 05-Special Issue, 2020
As shown above in Figure 4, Initially the image is segmented from a video input from the Webcam. The frames
are dropped from the video with a region of Interest (Threshold Square Box) so as to avoid background conflicts. A
custom CNN model with 11 layers is used. The gesture image segmented from the video frame is then converted to
a grayscale image. The input image from the webcam is converted to grayscale since the model is trained with the
features of the grayscale images i.e. the MNIST dataset is a pre-processed dataset of RGB images that are converted
to grayscale. The converted image is then scaled in respect to the size of the images with which the model was
trained. The image is fed into the pre-trained custom CNN model post scaling and transformation. The gesture
prediction from the CNN model is obtained and post that, it is classified based on the categorical label. The
classified gesture is displayed as text.
DOI: 10.5373/JARDCS/V12SP5/20201842
ISSN 1943-023X 967
Received: 15 Mar 2020/Accepted: 18 Apr 2020
Jour of Adv Research in Dynamical & Control Systems, Vol. 12, 05-Special Issue, 2020
The model lacks accuracy with noisy images when it is dropped from the video frame. The model performance
was not as expected if a person wears ornaments like ring as the dataset used to train the model was clean without
inclusion of any ornaments.
IX. Conclusion
This paper introduces a CNN based approach for the recognition and classification of the sign language using
computer vision. Unlike the other approaches, this approach yields better accuracy and considerably low false
positives. Other possible extensions to this work include dynamic gesture recognition, [VIII] and are being carried
out.
References
[1] Aditya Das, Shantanu Gawde, Khyati Suratwala, Dr. Dhananjay Kalbande (2018, February). Facial
expression recognition from video sequences: temporal and static modelling. Computer Vision and Image
Undertaking 91.
[2] Zafar Ahmed Ansari and Gaurav Harit, “Nearest Neighbour Classification of Indian Sign Language
Gestures using Kinect Camera”, in Sadhana, Vol. 41, No. 2, February 2016, pp. 161-182.
[3] Recognition of Isolated Indian Sign Language Gesture in Real Time, Anup Nandy, Jay Shankar Prasad,
Soumik Mondal, Pavan Chakraborty, G. C. Nandi, Communications in Computer and Information Science
book series (CCIS, volume 70).
[4] Continuous dynamic Indian Sign Language gesture recognition with invariant backgrounds by Kumud
Tripathi, Neha Baranwal, G. C. Nandi at 2015 Conference on Advances in Computing, Communications
and Informatics (ICACCI).
[5] Adam: A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba, Published as a conference
paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
[6] S. Tamura and S. Kawasaki, “Recognition of Sign Language Motion Images”, In Pattern Recognition,
volume 21, pages 343-353, 1988.
DOI: 10.5373/JARDCS/V12SP5/20201842
ISSN 1943-023X 968
Received: 15 Mar 2020/Accepted: 18 Apr 2020