0% found this document useful (0 votes)
13 views6 pages

25CSDS16 1

Uploaded by

ujjwalbansal780
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

25CSDS16 1

Uploaded by

ujjwalbansal780
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2018 4th International Conference on Computing Communication and Automation (ICCCA)

Hand Gesture Recognition System using


Convolutional Neural Networks
1st Raj Patel 1st Jash Dhakad 2nd Kashish Desai
Information Technology Information Technology Information Technology
Dwarkadas J. Sanghvi College of Engg Dwarkadas J. Sanghvi College of Engg Dwarkadas J. Sanghvi College of Engg
Mumbai, India Mumbai, India Mumbai, India
[email protected] [email protected] [email protected]

2nd Tanay Gupta 3rd Prof. Stevina Correia


Information Technology Information Technology
Dwarkadas J. Sanghvi College of Engg Dwarkadas J. Sanghvi College of Engg
Mumbai, India Mumbai, India
[email protected] [email protected]

Abstract—Gesture recognition plays an important role in com- lot of errors can arise [3]. Thus it is tough to recognize the
munication through sign language. It is a fast growing domain hand gestures. Our paper focuses on detecting and recognizing
within computer vision and has attracted significant research due the hand gestures using different methods and finding out the
to its widespread social impact. To tackle the difficulties faced
by hearing impairment, it is the need of the hour to develop accuracy by those methods. Also we see the performance,
a system which translates the sign language into text which can convenience and issues related with each method. Currently
easily be recognized by the impaired people. In this paper, a static a lot of methods and technologies are being used for sign
hand gesture recognition system is developed for American Sign and gesture recognition. Among them the most common ones
Language using deep Convolutional Neural Network. The system used are Hand Glove Based Analysis, Microsoft Kinect Based
architecture is light weight to make the system easily deployable
and mobile. In order to achieve high accuracy on live scenarios Analysis, Support Vector Machines and Convolutional Neural
we employ, a number of image processing techniques which assist Networks. One of the objective of these methods is to bridge
in appropriate background subtraction and frame segmentation. the communication gap between speech and hearing impaired
Our approach focuses on mobility,cost-free and easy deployment people with the normal people and also successful and smooth
in low computational environment. Our system achieved a testing integration of these differently abled people in our society. In
accuracy of 96%.
Index Terms—Sign Recognition, Gesture Recognition, Com-
our research paper we build a real time communication system
puter Vision, Convolutional Neural Networks. using the advancements in Machine Learning. Currently the
systems in existence either work on a small dataset and
achieve stable accuracy or work on a large dataset with
I. I NTRODUCTION
unstable accuracy. We try to resolve this problem by applying
Communication is imparting, sharing and conveying of Convolutional Neural Network (CNN) on a fairly large dataset
information, news, ideas and feelings. Of them, sign language to achieve a good and stable accuracy.
is one of the way of non-verbal communication which is
gaining impetus and strong foothold due to its applications II. L ITERATURE S URVEY
in a large number of fields. The most prominent application In order to bridge communication gap between hearing and
of this method is its usage by differently disabled persons speech impaired members, different approaches have been
like deaf and mute people. They can communicate with non- used by researchers for recognition of various hand gestures.
signing people without the help of a translator or interpreter These approaches can be broadly divided into three cate-
by this method. Some other applications are in the automotive gories - Hand Segmentation Approach, Gesture Recognition
sector, transit sector, gaming sector and also while unlocking Approach and Feature Extraction Approach.
a smart phone [1]. The sign gesture recognition can be done Two categories of visual-based hand gesture recognition can
in two ways: static gesture and dynamic gesture [2]. While be used.The first one is a 3-D hand gesture model that works
communicating, the static gesture makes use of hand shapes by comparing input frames which makes use of sensors like
while the dynamic gesture makes use of the movements of gloves, helmet, etc[4].The other one is Microsoft Kinect based
the hand [2]. Our paper focus on static gestures. Hand gesture analysis which makes use of Kinect camera. Kinect hardware
recognition is a way of understanding and then classifying gives accurate tracking of several user joints. So a huge dataset
the movements by the hands. But the human hands have very is required for the 3-D hand gesture model since it requires
complex articulations with the human body and therefore a a huge data set and also has a higher hardware cost due to

978-1-5386-6947-1/18/$31.00 ©2018 IEEE 1


sensors on the gloves. This glove based model for American a space-coordinated system in which any color which is
Sign Language was proposed by Starner and Pentland [5]. speified is represented by single point. Here, using different
It is not practically possible for the user to wear gloves color spaces for robust hand detection and segmentation, three
continuously. techniques were introduced. Hand tracking and segmentation
The 2-D hand gesture model make use of image dataset (HTS) technique using HSV color space is identified for the
for feature extraction and detection. There are many other pre-processing of HGR system.
approaches used for image based gesture recognition like ANN Some issues with hand segmentation are,firstly some
(Artificial Neural Network), HMM (Hidden Markov Model), objects, which are irrelevant,might overlap with the
Eigenvalue based and Perceptual colour based. The feature hand.Also,performance of the hand segmentation algorithm is
vector extracted from the image are inputted into HMM [6]. degraded when the distance between the user and the camera
For classification, particle filtering and segmentation methods is more than 1.5 meters [11]. Lastly,hand segmentation
like Support Vector Machine (SVM) is used where image restricts the user to make some gestures in a particular
frame is converted into HSV colour space as it is less sensitive manner,like gestures must be made with the right hand only,
to light effects[7]. Feature extraction can be employed using the arm should be vertical, the palm should face the camera
various methods. One of the most used method for feature and the background should be clear and uniform.
extraction is by Contour Shape Technique which extracts the
C. Glove based hand gesture recognition
boundary information of the sign.
Glove based approaches make use of gesture or capacitive
III. C URRENTLY USED M ETHODOLOGIES touch sensors embedded into gloves to recognize hand ges-
A. Feature Extraction ture.The widely used methods make use of hand motion to
A feature is a function of one or more measurements convey hand signs and the motion is tracked and translated to
computed so that it quantifies some significant characteristic text. Hand motions are categorized using clustering techniques
of the object [9]. Feature extraction is a special form of dimen- such as k-means. Other approaches use charge-transfer touch
sionality reduction. In pattern recognition and also in image sensors for translation by using On / Off binary signals. These
processing,if the input given is quite large for processing,then approaches achieve high accuracy but incur high cost due to
it is suspected to be redundant and eventually the input data the necessary hardware.
which is given will transform into a reduced representation
set of features [8]. Feature extraction can be defined as a
process of transforming input data in set of features. The
general expectation is that the features set will extract the
information which is relevant from the input data if we extract
the features carefully in order to perform the desired task using
this reduced representation instead of the full size input.
Some issues with feature extraction are, firstly ,the features
should carry enough information about the image and should
not require any domain-specific knowledge for their extraction
[9]. Secondly, the features should be easy to compute in order
to make feature extraction more feasible for a large image
collection and rapid retrieval. Also, they should relate well
to the human perceptual characteristics since users finally
determine the suitability of the images retrieved.
B. Hand Segmentation Approach
Hand tracking and Segmentation should be always done in
an efficient manner as they are the keys of success towards any
gesture recognition, because of the challenges vision based Fig. 1: System Block Diagram
methods pose such as intensity of the continuous variation
in lightning, many objects in the background(complex) and
detection of the skin color. Color is very powerful descriptor IV. P ROPOSED M ETHODOLOGY
for object detection. Thus, color information was used for In the view of the limitations posed by the approaches
the segmentation purpose , which is invariant to rotation mentioned above, our system would focus on mitigating those
and geometric variation of the hand [10]. Human sees incompetencies. The system to be built has to be capable to
color component’s features such as saturation, hue and the be able to be deployed on a mobile or web application for
brightness component more than the percentage of primary far reach and easy accessibility so, it has to be lightweight
colors which are red, green and blue [10]. These color models and computationally competent enough to recognize the signs
represent the standardized way of a particular color. It is appropriately.

2
Fig. 2: SSIM between intermediate frames

Fig. 3: Image Processing Steps

We propose a computer vision based approach to recognize extract each individual frame from the video feed, we perform
static hand gestures. The system would analyze a video feed frame by frame comparison by computing the Structural Sim-
and recognize the hand gesture and then output the correct ilarity Index(SSIM) between two adjacent frames and based
class label. For each sign performed by the user, the system upon a threshold we factor in distinct frame selection. The
will output 1 among 36 class labels comprising of ASL Ges- video feed taken into consideration is captured from a webcam
tures for alphabets and numbers. The system would take in a at a resolution of 900 * 900 and 23fps. The video feed is
video feed which could be pre-recorded or coming live from an reduced down to 12 fps and the user is given an ROI to perform
input device The system would segregate each sign and output the sign gesture. Thus the final images after the cropping are
the sign label accordingly. The major challenges identified of size 300 * 300. For two images I (i, k) and K(i, j) SSIM
were performing precise background subtraction and achieving is calculated according to equation 1.
high accuracy in order for the system to be used in formulating (2μx μy + C1 ) + (2σxy + C2 )
sentences. Background Subtraction required tackling changing SSIM (x, y) = 2 (1)
(μx + μ2y + C1 )(σx2 + σy2 + C2 )
illumination and foreground noise in input images. Operations
such as Gaussian Mixture based segmentation and Image where, ux is the average of x
Morphology are performed in order to reduce background uy is the average of y
noise. Henceforth, our system architecture employs three σx2 is the variance of x
phases viz. Frame segregation, Image Processing and Image σy 2 is the variance of y
recognition.Fig 1. illustrates the system block diagram σxy is the co variance of x and y
(c1 - k1 L)2 and (c2 - k2 L)2 are two variables to stabilize the
V. I MPLEMENTATION division with weak denominator
L represents the dynamic range of the pixel values and
A. Frame Segregation k1 =0.01 and k2 =0.03 by default [12].
Frame segregation is the first stage which involves identify-
ing the frames which contain the sign gesture and segregating An SSIM of 1 indicates perfect similarity. By testing
those frames for further processing and recognition. In order to for two images to be considered distinct the threshold value

3
Fig. 4: CNN Architecture

for SSIM was identified to be 0.45. Fig 2 illustrates the SSIM high accuracy in image recognition tasks because of impor-
between intermediate frames. As evident from the calculated tant positional features of hand and fingers being lost when
values, the SSIM between two similar images is greater than background subtraction is performed. Thus, in order to retain
the threshold value. For dissimilar frames such as frame 1 those positional features, AND operation is performed with the
and frame 2, the SSIM calculated is less than the threshold original image and the noise reduced subtracted image. This
value but since it transitions from background to foreground, results in the white pixels of binary image acting as a filter for
frame 2 is not considered as distinct. To the contrary frame 5 the RGB image. After which, the resulting RGB image after
is considered as a distinct frame moving from foreground to addition is converted to gray scale. This is done to eliminate
the background. any bias due to the user skin tone or foreground lighting during
recognition.
B. Image Processing
After the image segregation phase, the next phase is C. Image Recognition
processing the output frames. The first step is to perform The last phase in the system is the image recognition phase.
background subtraction. Since our system is mobile, the In order to achieve higher accuracy as compared to existing
input images will vary a lot in terms of the background and systems and to keep the system computationally lightweight,
the lighting conditions. Background subtraction is performed we make use of Convolutional neural network (CNN) for
using Gaussian Mixture based background segmentation image recognition. Convolutional neural network are a class
algorithm. It uses a mixture of K Gaussian distributions to of feed-forward artificial neural networks commonly used for
model each background picture [13][14]. The output frames visual analysis tasks. They comprise of neurons which act as
are compared against the background image and a resulting learnable parameters having their own weight and biases. The
image is obtained after the background subtraction. Normal entire neural network learns with the help of a loss function, a
background subtraction was ruled out due to its low tolerance learning rate is used to fine tune the learning. The input layer
to dynamic conditions. Noise reduction was performed since takes in the data which gets propagated through the various
some observable noise was present on the resulting image. layers and a output is generated, the generated output is com-
The two main types of noise observed were spatial noise due pared with the actual output and the system updates its weights
to motion and salt and pepper noise due to change in lighting and biases to correct itself, this step is crucial and is known as
conditions. In order to remove spatial noise, low pass spatial backpropagation and this process done iteratively is called as
filtering was used with a kernel of size 3 and for reducing training. The training duration of CNN is decided according to
other kinds of noise, morphological opening was performed the size, the number of layers and also the learning rate. The
on the subtracted image, with a structuring element of size 5. CNN was trained using the ASL sign language image dataset
Morphological opening performs erosion followed by dilation consisting of around 35K images with each class having a
which is useful in removing noise. minimum of 800 images. The dataset consisted of gray scale
static sign images concerning alphabets and numbers. Fig 5.
The resulting binary image after performing the operations shows some images from the dataset. The character labels
is illustrated in figure 3. This resultant image wont yield associated with each image were converted into binary vectors

4
using one hot encoding, thus converting categorical values into
numbers.The proposed architecture of our CNN is illustrated
in figure 4. It consists of three convolutional layers with 32,
64, 128 number of filters, having intermediate max-pooling
layers and Relu activations. A kernel of size 3 and pool size
of 2 was used accordingly. The last three layers consisted of a

Fig. 5: Images from dataset

flattening layer and fully connected layers with dropout layers (a) Model Accuracy
in between in order to avoid overfitting. The final dense layer
is of size 36 corresponding to the number of class labels with
softmax activation. The input to the CNN will be the gray
scale processed image resized to 28 * 28 as per the dataset
and the output of the CNN would be a probability distribution
to classify the image into probabilistic values between 0 and
1. The loss function used for training was categorical cross
entropy and the optimizer used was rmsprop. The training
was conducted for 250 epochs with a batch size of 512.
For any image feed into the CNN, it outputs a probability
distribution. The node containing the highest probability value
is considered as the output node and the correct label against
that node is outputted. In this way the system determines what
sign the user performed.
VI. R ESULTS
The recognition accuracy of the CNN obtained on the test (b) Model Loss
set was 96.36% . Figure 6 illustrates the model accuracy Fig. 6: CNN performance graphs
and loss during the training and validation. From the figure
it is evident that the model loss converges to almost zero,
henceforth eliminating the case of under-fitting i.e. the model a computationally low cost and can be deployed in an mobile
is capable enough to generalize to the dataset and also the test setting while makes it suitable for real time applications. The
accuracy obtained rules out the suspicion of model being over- research related to vision based gesture recognition is still in
fitted i.e. the system is able to guess unseen data correctly. progress and our future research would be based on further
Testing over a live setting, our system yielded 38 correct improving the accuracy, expanding the classification dictionary
predictions out of a set of 40 trials, gaining a accuracy of and employing dynamic gestures for recognition.
95%. Image segregation was able to pick out the distinct
frames correctly and the system performed well even when R EFERENCES
the lighting conditions were changed. Depending on the con-
straints posed in the scenario of static gesture recognition, [1] Gesture Recognition (2018, October, 4) Wikipedia [Online] Available:
our system produced the highest results as compared to the [2] Priyanka C Pankajakshan, Thilagavathi B, Sign Language Recognition
System, IEEE Sponsored 2nd International Conference on Innovations
existing systems out there which rely on feature extraction in Information Embedded and Communication Systems ICIIECS15
or employment of costly gloves. It can be deployed as an [3] Jobin Francis and Anoop B K. Article: Significance of Hand Gesture
application on a minimal system and comes at a zero-cost. Recognition Systems in Vehicular Automation-A Survey. International
Journal of Computer Applications 99(7):50-55, August 2014.
VII. C ONCLUSION AND F UTURE W ORK [4] T. Starner and A. Pentland, ”Real-time American sign language recogni-
tion from video using hidden Markov models”, Technical Report, M.I.T
In this study a system to classify static gestures was iden- Media Laboratory Perceptual Computing Section, Technical Report No.
tified and implemented using Convolutional Neural Network. 375, 1995.
[5] Camastra, Francesco, and Domenico De Felice. ”L VQ-based hand
Our system is adaptive and performs robustly under varied gesture recognition using a data glove.” Neural Nets and Surroundings.
lighting and background conditions. The proposed system is Springer Berlin Heidelberg, 2013. 159-168.

5
[6] Lang, S., B. Marco and R. Raul. Sign Language Recognition Using
Kinect. In: L.K. Rutkowski, Marcin and R. T. Scherer, Ryszard Zadeh,
Lotji Zurada, Jacek (Eds.), Springer Berlin / Heidelberg,pp:394-402,20
11.
[7] V. K. Verma, S. Srivastava, and N. Kumar, ”A comprehensive review
on automation of Indian sign language,” IEEE Int. Conf, Adv. Comput.
Eng. Appl. Mar 2015 pp. 138-142
[8] Sanaa Khudayer Jadwaa, Feature Extraction for Hand Gesture Recog-
nition: A Review, International Journal of Scientific Engineering
Research, Volume 6, Issue 7, July-2015
[9] George Karidakis et al , Feature Extraction-Shodhganga
[10] Archana Ghotkar, Gajanan K.Kharate, Hand Segmentation Techniques
to Hand Gesture Recognition for Natural Human Computer Interaction,
International Journal of Human Computer Interaction
[11] Rafiqul Zaman Khan, Noor Adnan Ibraheem, Comparative Study of
Hand Gesture Recognition System, SIPM, FCST, ITCA, WSE, ACSIT,
CS IT 06, pp. 203213, 2012.
[12] Structural Similarity (2018, August, 27) Wikipedia [Online] Available:
[13] P.KaewTraKulPong, R.Bowden, An Improved Adaptive Background
Mixture Model for Real-time Tracking with Shadow Detection , In
Proc. 2nd European Workshop on Advanced Video Based Surveillance
Systems, AVBS01. Sept 2001. VIDEO BASED SURVEILLANCE SYS-
TEMS: Computer Vision and Distributed Processing
[14] Background Subtraction, Open Source Computer Vision

You might also like