Sign Language Detection From Hand Gesture Images Using Deep Multi-Layered Convolution Neural Network
Sign Language Detection From Hand Gesture Images Using Deep Multi-Layered Convolution Neural Network
Neural Network
Rajarshi Bhadra Subhajit Kar
Department of Electrical Engineering Department of Electrical Engineering
Future Institute of Engineering and Management Future Institute of Engineering and Management
Kolkata, India Kolkata, India
[email protected] [email protected]
Abstract—Automatic detection of sign language from hand for training and rest of the images for validation purpose. Das
gesture images is crucial nowadays. Accurate detection and et al. have also proposed a deep learning-based approach to
classification of sign language can help people with hearing and recognize sign languages from static gesture images [5]. They
speech disorder. In this paper, a deep multi-layered convolution
neural network is proposed for this purpose. In the proposed have classified 24 classes contains around 100 images in each
approach, 32 convolution filters with 3 x 3 kernel, LeakyReLU class. Inception V3 model has been used by them to train
activation function and 2 x 2 max pooling operation have been the model. Huang et al. proposed a 3D convolutional neural
performed in the deep multi-layered CNN structure. SoftMax network (CNN) to extracts discriminative spatial-temporal fea-
activation function has been used in the output layer. The tures from raw video stream features [6]. An inception model
proposed approach has been evaluated on a database containing
both static (54000 images and 36 classes) and dynamic (49613 has been proposed by Bantupalli et al. which takes video
images and 23 classes) hand gesture images. Experimental results sequance and extracts temporal and spatial features from them
demonstrate the efficacy of the proposed methodology in sign [7]. Patel and Ambekar used PNN and KNN classifiers on
language detection task. their own dataset by extracting the features using 7Hu moment
Index Terms—Sign Language Detection, Deep Learning, multi- techniques. [8]. A framework for Sign Language Recognition
layered CNN
for Bangla Alphabet was proposed by Uddin and Chowdhury
using SVM classifier on 2 different datasets consisting 2400
I. I NTRODUCTION
images of Bengal Sign Language [9]. Rao et al. proposed a
Sign or signed languages are generally used for conveying CNN architecture for classifying selfie sign language gestures
meaning through visual modality [1]. These languages are of 200 ISL sign generating 300000 sign video frames. The
expressed through manual articulations in combination with CNN model is trained on 180000 video frames and returned
non manual elements. These are full fledged natural languages a satisfactory accuracy [10]. M. R. Abid et al. proposed a
with its own grammar and lexicon. In this context, intelligent Dynamic Sign Language Recognition system using Stochastic
methodologies for sign language recognition from hand ges- Linear Formal Grammar. Bag-of-Features and local part model
ture images are important. These intelligent methodologies can approach has been used for recognizing individual words of
help the large number of peoples in the world suffering from sign language [11]. A DLSTM based hand gesture recognition
hearing loss and speech disorder. In the literature, different method has been proposed by Avola et al. [12] where a RNN
methodologies have been reported for recognizing the sign is trained using the angles formed by the finger bones of the
languages accurately [2]–[13]. human hands. The features are extracted using Leap Motion
A real time hand gesture detection and recognition method- Controller (LMC) sensor. This method resulted a satisfactory
ology using Bag-of-Features has been proposed by Dardas et accuracy over the state of art methods. A wearable hand device
al. [2] where features have been extracted by SIFT technique is proposed as a smart sign language interpretation system
and a multi-class support vector training model has been used by Lee et al. [13]. The entire system is consist of a fusion
for classification purpose. Kurdyumov et al. have classified of pressure sensor module, a processing module & a display
English sign language alphabets using support vector machine unit mobile application module. SVM classifier is used for
(SVM) and k-nearest neighborhood classifier [3]. Gray scale analyzing the data collected from the sensor.
features have been proposed by them for enhancing the classi- The methodologies reported in the literature [2]–[13] per-
fication performance. Convolutional Neural Network (CNN)- forms the detection and classification on either static images
based sign language classification has been proposed by Lionel or dynamic images separately. The feature extraction has been
et al. [4]. A total number of 6600 images of Italian gestures performed and SVM classifier has been used for detection and
have been used in their work. They have utilized 4600 images classification task [2], [3], [9], [13]. The Neural Network based
methodologies have also been proposed in the literature for & Fig. 3 shows the data preprocessing techniques used in our
the same task [4]–[8], [10], [12]. However, the methodologies proposed methodology.
proposed in the literature [2]–[13] are complex and the number
of data samples are also not sufficient. In contrast, a deep
multi-layered CNN [14] architecture has been proposed in B. Proposed Deep Multi-layered CNN Architecture
this paper that performs on both static gestures & dynamic
gestures with huge number of data samples collected from The proposed deep multi-layered CNN having 5 layers
multiple sources. The static gestures represents the alphabets has been elaborated in Fig. 4. The convolution operation
and numbers and the dynamic gestures represents the emo- using 32 filters of 3 x 3 kernels has been performed in each
tional states. 36 classes of static gestures and 23 classes of layer with unit stride sliding window and single padding.
dynamic gestures have been used as input to the proposed deep Leaky Rectified linear unit (LeakyReLU) [17] has been used
multi-layered CNN structure. Moreover, busy background has as the activation function. A 2 x 2 max pooling operation
been used in the input images and the dataset has been split has also been performed to utilize the results of convolution
into training, validation and test set in the ratio of 60 : operation from each layer into a more compact tensor. In
20 : 20 to make the system more robust. The experimental the proposed technique, the three dimensional tensors have
results demonstrate that the proposed methodology performs been converted to one dimensional feature vectors consisting
satisfactorily in classifying both static gestures and dynamic 512 neurons after the multi-layered structure. Thereafter, a
gestures. dense layer of 128 neurons and an output layer consisting
The rest of the paper has been organized as follows. The 59 neurons (static gesture : 36 ; dynamic gesture : 23) have
proposed methodology has been described in section II. Sec- been used. LeakyReLU [17] and SoftMax [18] have been
tion III describes the database preparation. The experimental used as activation functions in dense layer and output layer
results have been demonstrated in section IV and finally
section V concludes the paper.
II. P ROPOSED M ETHODOLOGY
The block diagram of the proposed methodology has been
shown in Fig. 1. The images in the database are non-uniform
in size and are having constant background. Therefore, in this
paper, the images have been rescaled and busy background
has been added to make the proposed system more robust.
Thereafter, all the images with uniform size and busy back- Fig. 2. Image Rescaling. (a) Original Image (b) Rescaled Image
ground have been used as an input to the proposed deep multi-
layered CNN structure. Finally, the output layer performs the
classification task.
A. Data Preproceesing
The hand gesture images have been first rescaled to 50x50
size. Thereafter, the foregrounds of the images have been ex-
tracted [15]. Then random images of different cities have been
added with the foreground images to create busy background
[16]. The images with busy background have then been used Fig. 3. Busy Background. (a) Original Image (b) Eliminated Foreground (c)
as input to the proposed system for classification task. Fig. 2 Image with Busy Background
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
197
2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI), India
respectively. LeakyReLU [17] is defined as: the gradients respectively, gt is gradient at time t, β1 and β2
are exponentially decay rates for moment estimates. The β1
αx for x < 0 and β2 values have been selected experimentally as 0.9 and
f (x) = (1)
x for x ≥ 0 0.999 respectively. The categorical cross entropy loss [21] has
been used as cost function in the proposed algorithm. The cost
where x is input to the activation, i.e the weighted sum of the
function J(θ) can be defined as:
pre-activation layer and α is the slope which prevents dying
ReLU problem [19]. The SoftMax [18] activation function is M
defined as, J(θ) = − yi,c log(pi,c ) (5)
e yi c=1
ŷi = k (2)
yj
j=1 e where, M denotes number of classes, yi,c is a binary indicator
(0 or 1) that indicates whether c is the correct class, pi,c
where, yi is ith logit value, k is total number of logits,
denotes predicted probability between 0 and 1.
yˆi denotes predicted probability of a particular sample. The
probabilities of each class have been evaluated in the output III. DATASET USED
layer of the proposed multi-layered CNN structure. A specific In this paper, the static and dynamic hand gesture images
output node (Logit) having the highest probability value rep- have been used for detection and classification purpose. The
resents the classification of the corresponding class. A first hand gesture images can be described as follows.
order gradient-based optimization process known as Adam
optimizer [20] has been used in the training phase of the A. Static Hand Gestures
proposed methodology to optimize the training process. The The dataset [22] consist of hand gesture representation of
Adam optimizer is defined as: digits (0 - 9) and alphabets (A - Z). The dataset contains 54000
images and 36 classes (digits: 10 and alphabets: 26). The pre-
mt = β1 mt−1 + (1 − β1 )gt (3)
processed static hand gesture samples with busy backgrounds
are shown in Fig. 5 (a) & Fig. 5 (b)
vt = β2 vt−1 + (1 − β2 )gt2 (4) B. Dynamic Hand Gestures
where, mt and vt are estimates of the first moment (the The dataset [23], [24] consist of hand gesture representation
mean) and the second moment (the uncentered variance) of of emotions (positions: above, below etc. ; emotions: alone,
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
198
2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI), India
TABLE I
C LASSIFICATION P ERFORMANCE
TABLE II
C OMPARISON WITH DIFFERENT CNN MODELS
afraid etc. and misc.: bring, drink etc.). The dataset contains (BATCH SIZE = 64, E POCHS = 11)
49613 images and 23 classes. The pre-processed dynamic hand
gesture samples with busy backgrounds are shown in Fig. 5 Model Accuracy Training Time
(c) & Fig. 5 (d) VGG16 [25] 97.17% 544 seconds
VGG19 [26] 98.25% 646 seconds
MobileNet [27] 99.21% 231 seconds
IV. E XPERIMENTAL R ESULT A NALYSIS ResNet50 [28] 98.39% 649 seconds
A. Experimental setup Proposed 99.89% 194 seconds
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
199
2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI), India
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
200