25CSDS16 1
25CSDS16 1
Abstract—Gesture recognition plays an important role in com- lot of errors can arise [3]. Thus it is tough to recognize the
munication through sign language. It is a fast growing domain hand gestures. Our paper focuses on detecting and recognizing
within computer vision and has attracted significant research due the hand gestures using different methods and finding out the
to its widespread social impact. To tackle the difficulties faced
by hearing impairment, it is the need of the hour to develop accuracy by those methods. Also we see the performance,
a system which translates the sign language into text which can convenience and issues related with each method. Currently
easily be recognized by the impaired people. In this paper, a static a lot of methods and technologies are being used for sign
hand gesture recognition system is developed for American Sign and gesture recognition. Among them the most common ones
Language using deep Convolutional Neural Network. The system used are Hand Glove Based Analysis, Microsoft Kinect Based
architecture is light weight to make the system easily deployable
and mobile. In order to achieve high accuracy on live scenarios Analysis, Support Vector Machines and Convolutional Neural
we employ, a number of image processing techniques which assist Networks. One of the objective of these methods is to bridge
in appropriate background subtraction and frame segmentation. the communication gap between speech and hearing impaired
Our approach focuses on mobility,cost-free and easy deployment people with the normal people and also successful and smooth
in low computational environment. Our system achieved a testing integration of these differently abled people in our society. In
accuracy of 96%.
Index Terms—Sign Recognition, Gesture Recognition, Com-
our research paper we build a real time communication system
puter Vision, Convolutional Neural Networks. using the advancements in Machine Learning. Currently the
systems in existence either work on a small dataset and
achieve stable accuracy or work on a large dataset with
I. I NTRODUCTION
unstable accuracy. We try to resolve this problem by applying
Communication is imparting, sharing and conveying of Convolutional Neural Network (CNN) on a fairly large dataset
information, news, ideas and feelings. Of them, sign language to achieve a good and stable accuracy.
is one of the way of non-verbal communication which is
gaining impetus and strong foothold due to its applications II. L ITERATURE S URVEY
in a large number of fields. The most prominent application In order to bridge communication gap between hearing and
of this method is its usage by differently disabled persons speech impaired members, different approaches have been
like deaf and mute people. They can communicate with non- used by researchers for recognition of various hand gestures.
signing people without the help of a translator or interpreter These approaches can be broadly divided into three cate-
by this method. Some other applications are in the automotive gories - Hand Segmentation Approach, Gesture Recognition
sector, transit sector, gaming sector and also while unlocking Approach and Feature Extraction Approach.
a smart phone [1]. The sign gesture recognition can be done Two categories of visual-based hand gesture recognition can
in two ways: static gesture and dynamic gesture [2]. While be used.The first one is a 3-D hand gesture model that works
communicating, the static gesture makes use of hand shapes by comparing input frames which makes use of sensors like
while the dynamic gesture makes use of the movements of gloves, helmet, etc[4].The other one is Microsoft Kinect based
the hand [2]. Our paper focus on static gestures. Hand gesture analysis which makes use of Kinect camera. Kinect hardware
recognition is a way of understanding and then classifying gives accurate tracking of several user joints. So a huge dataset
the movements by the hands. But the human hands have very is required for the 3-D hand gesture model since it requires
complex articulations with the human body and therefore a a huge data set and also has a higher hardware cost due to
2
Fig. 2: SSIM between intermediate frames
We propose a computer vision based approach to recognize extract each individual frame from the video feed, we perform
static hand gestures. The system would analyze a video feed frame by frame comparison by computing the Structural Sim-
and recognize the hand gesture and then output the correct ilarity Index(SSIM) between two adjacent frames and based
class label. For each sign performed by the user, the system upon a threshold we factor in distinct frame selection. The
will output 1 among 36 class labels comprising of ASL Ges- video feed taken into consideration is captured from a webcam
tures for alphabets and numbers. The system would take in a at a resolution of 900 * 900 and 23fps. The video feed is
video feed which could be pre-recorded or coming live from an reduced down to 12 fps and the user is given an ROI to perform
input device The system would segregate each sign and output the sign gesture. Thus the final images after the cropping are
the sign label accordingly. The major challenges identified of size 300 * 300. For two images I (i, k) and K(i, j) SSIM
were performing precise background subtraction and achieving is calculated according to equation 1.
high accuracy in order for the system to be used in formulating (2μx μy + C1 ) + (2σxy + C2 )
sentences. Background Subtraction required tackling changing SSIM (x, y) = 2 (1)
(μx + μ2y + C1 )(σx2 + σy2 + C2 )
illumination and foreground noise in input images. Operations
such as Gaussian Mixture based segmentation and Image where, ux is the average of x
Morphology are performed in order to reduce background uy is the average of y
noise. Henceforth, our system architecture employs three σx2 is the variance of x
phases viz. Frame segregation, Image Processing and Image σy 2 is the variance of y
recognition.Fig 1. illustrates the system block diagram σxy is the co variance of x and y
(c1 - k1 L)2 and (c2 - k2 L)2 are two variables to stabilize the
V. I MPLEMENTATION division with weak denominator
L represents the dynamic range of the pixel values and
A. Frame Segregation k1 =0.01 and k2 =0.03 by default [12].
Frame segregation is the first stage which involves identify-
ing the frames which contain the sign gesture and segregating An SSIM of 1 indicates perfect similarity. By testing
those frames for further processing and recognition. In order to for two images to be considered distinct the threshold value
3
Fig. 4: CNN Architecture
for SSIM was identified to be 0.45. Fig 2 illustrates the SSIM high accuracy in image recognition tasks because of impor-
between intermediate frames. As evident from the calculated tant positional features of hand and fingers being lost when
values, the SSIM between two similar images is greater than background subtraction is performed. Thus, in order to retain
the threshold value. For dissimilar frames such as frame 1 those positional features, AND operation is performed with the
and frame 2, the SSIM calculated is less than the threshold original image and the noise reduced subtracted image. This
value but since it transitions from background to foreground, results in the white pixels of binary image acting as a filter for
frame 2 is not considered as distinct. To the contrary frame 5 the RGB image. After which, the resulting RGB image after
is considered as a distinct frame moving from foreground to addition is converted to gray scale. This is done to eliminate
the background. any bias due to the user skin tone or foreground lighting during
recognition.
B. Image Processing
After the image segregation phase, the next phase is C. Image Recognition
processing the output frames. The first step is to perform The last phase in the system is the image recognition phase.
background subtraction. Since our system is mobile, the In order to achieve higher accuracy as compared to existing
input images will vary a lot in terms of the background and systems and to keep the system computationally lightweight,
the lighting conditions. Background subtraction is performed we make use of Convolutional neural network (CNN) for
using Gaussian Mixture based background segmentation image recognition. Convolutional neural network are a class
algorithm. It uses a mixture of K Gaussian distributions to of feed-forward artificial neural networks commonly used for
model each background picture [13][14]. The output frames visual analysis tasks. They comprise of neurons which act as
are compared against the background image and a resulting learnable parameters having their own weight and biases. The
image is obtained after the background subtraction. Normal entire neural network learns with the help of a loss function, a
background subtraction was ruled out due to its low tolerance learning rate is used to fine tune the learning. The input layer
to dynamic conditions. Noise reduction was performed since takes in the data which gets propagated through the various
some observable noise was present on the resulting image. layers and a output is generated, the generated output is com-
The two main types of noise observed were spatial noise due pared with the actual output and the system updates its weights
to motion and salt and pepper noise due to change in lighting and biases to correct itself, this step is crucial and is known as
conditions. In order to remove spatial noise, low pass spatial backpropagation and this process done iteratively is called as
filtering was used with a kernel of size 3 and for reducing training. The training duration of CNN is decided according to
other kinds of noise, morphological opening was performed the size, the number of layers and also the learning rate. The
on the subtracted image, with a structuring element of size 5. CNN was trained using the ASL sign language image dataset
Morphological opening performs erosion followed by dilation consisting of around 35K images with each class having a
which is useful in removing noise. minimum of 800 images. The dataset consisted of gray scale
static sign images concerning alphabets and numbers. Fig 5.
The resulting binary image after performing the operations shows some images from the dataset. The character labels
is illustrated in figure 3. This resultant image wont yield associated with each image were converted into binary vectors
4
using one hot encoding, thus converting categorical values into
numbers.The proposed architecture of our CNN is illustrated
in figure 4. It consists of three convolutional layers with 32,
64, 128 number of filters, having intermediate max-pooling
layers and Relu activations. A kernel of size 3 and pool size
of 2 was used accordingly. The last three layers consisted of a
flattening layer and fully connected layers with dropout layers (a) Model Accuracy
in between in order to avoid overfitting. The final dense layer
is of size 36 corresponding to the number of class labels with
softmax activation. The input to the CNN will be the gray
scale processed image resized to 28 * 28 as per the dataset
and the output of the CNN would be a probability distribution
to classify the image into probabilistic values between 0 and
1. The loss function used for training was categorical cross
entropy and the optimizer used was rmsprop. The training
was conducted for 250 epochs with a batch size of 512.
For any image feed into the CNN, it outputs a probability
distribution. The node containing the highest probability value
is considered as the output node and the correct label against
that node is outputted. In this way the system determines what
sign the user performed.
VI. R ESULTS
The recognition accuracy of the CNN obtained on the test (b) Model Loss
set was 96.36% . Figure 6 illustrates the model accuracy Fig. 6: CNN performance graphs
and loss during the training and validation. From the figure
it is evident that the model loss converges to almost zero,
henceforth eliminating the case of under-fitting i.e. the model a computationally low cost and can be deployed in an mobile
is capable enough to generalize to the dataset and also the test setting while makes it suitable for real time applications. The
accuracy obtained rules out the suspicion of model being over- research related to vision based gesture recognition is still in
fitted i.e. the system is able to guess unseen data correctly. progress and our future research would be based on further
Testing over a live setting, our system yielded 38 correct improving the accuracy, expanding the classification dictionary
predictions out of a set of 40 trials, gaining a accuracy of and employing dynamic gestures for recognition.
95%. Image segregation was able to pick out the distinct
frames correctly and the system performed well even when R EFERENCES
the lighting conditions were changed. Depending on the con-
straints posed in the scenario of static gesture recognition, [1] Gesture Recognition (2018, October, 4) Wikipedia [Online] Available:
our system produced the highest results as compared to the [2] Priyanka C Pankajakshan, Thilagavathi B, Sign Language Recognition
System, IEEE Sponsored 2nd International Conference on Innovations
existing systems out there which rely on feature extraction in Information Embedded and Communication Systems ICIIECS15
or employment of costly gloves. It can be deployed as an [3] Jobin Francis and Anoop B K. Article: Significance of Hand Gesture
application on a minimal system and comes at a zero-cost. Recognition Systems in Vehicular Automation-A Survey. International
Journal of Computer Applications 99(7):50-55, August 2014.
VII. C ONCLUSION AND F UTURE W ORK [4] T. Starner and A. Pentland, ”Real-time American sign language recogni-
tion from video using hidden Markov models”, Technical Report, M.I.T
In this study a system to classify static gestures was iden- Media Laboratory Perceptual Computing Section, Technical Report No.
tified and implemented using Convolutional Neural Network. 375, 1995.
[5] Camastra, Francesco, and Domenico De Felice. ”L VQ-based hand
Our system is adaptive and performs robustly under varied gesture recognition using a data glove.” Neural Nets and Surroundings.
lighting and background conditions. The proposed system is Springer Berlin Heidelberg, 2013. 159-168.
5
[6] Lang, S., B. Marco and R. Raul. Sign Language Recognition Using
Kinect. In: L.K. Rutkowski, Marcin and R. T. Scherer, Ryszard Zadeh,
Lotji Zurada, Jacek (Eds.), Springer Berlin / Heidelberg,pp:394-402,20
11.
[7] V. K. Verma, S. Srivastava, and N. Kumar, ”A comprehensive review
on automation of Indian sign language,” IEEE Int. Conf, Adv. Comput.
Eng. Appl. Mar 2015 pp. 138-142
[8] Sanaa Khudayer Jadwaa, Feature Extraction for Hand Gesture Recog-
nition: A Review, International Journal of Scientific Engineering
Research, Volume 6, Issue 7, July-2015
[9] George Karidakis et al , Feature Extraction-Shodhganga
[10] Archana Ghotkar, Gajanan K.Kharate, Hand Segmentation Techniques
to Hand Gesture Recognition for Natural Human Computer Interaction,
International Journal of Human Computer Interaction
[11] Rafiqul Zaman Khan, Noor Adnan Ibraheem, Comparative Study of
Hand Gesture Recognition System, SIPM, FCST, ITCA, WSE, ACSIT,
CS IT 06, pp. 203213, 2012.
[12] Structural Similarity (2018, August, 27) Wikipedia [Online] Available:
[13] P.KaewTraKulPong, R.Bowden, An Improved Adaptive Background
Mixture Model for Real-time Tracking with Shadow Detection , In
Proc. 2nd European Workshop on Advanced Video Based Surveillance
Systems, AVBS01. Sept 2001. VIDEO BASED SURVEILLANCE SYS-
TEMS: Computer Vision and Distributed Processing
[14] Background Subtraction, Open Source Computer Vision