0% found this document useful (0 votes)
33 views5 pages

Sign Language Detection From Hand Gesture Images Using Deep Multi-Layered Convolution Neural Network

The document discusses sign language detection from hand gesture images using a deep multi-layered convolutional neural network. The proposed approach uses 32 convolution filters with 3x3 kernels, LeakyReLU activation, and 2x2 max pooling in the CNN structure. It is evaluated on datasets containing 54,000 static images across 36 classes and 49,613 dynamic images across 23 classes, demonstrating the effectiveness of the proposed method for sign language detection.

Uploaded by

GARGI VEDPATHAK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views5 pages

Sign Language Detection From Hand Gesture Images Using Deep Multi-Layered Convolution Neural Network

The document discusses sign language detection from hand gesture images using a deep multi-layered convolutional neural network. The proposed approach uses 32 convolution filters with 3x3 kernels, LeakyReLU activation, and 2x2 max pooling in the CNN structure. It is evaluated on datasets containing 54,000 static images across 36 classes and 49,613 dynamic images across 23 classes, demonstrating the effectiveness of the proposed method for sign language detection.

Uploaded by

GARGI VEDPATHAK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI), India

Sign Language Detection from Hand Gesture


Images using Deep Multi-layered Convolution
2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI) | 978-1-7281-9342-7/20/$31.00 ©2021 IEEE | DOI: 10.1109/CMI50323.2021.9362897

Neural Network
Rajarshi Bhadra Subhajit Kar
Department of Electrical Engineering Department of Electrical Engineering
Future Institute of Engineering and Management Future Institute of Engineering and Management
Kolkata, India Kolkata, India
[email protected] [email protected]

Abstract—Automatic detection of sign language from hand for training and rest of the images for validation purpose. Das
gesture images is crucial nowadays. Accurate detection and et al. have also proposed a deep learning-based approach to
classification of sign language can help people with hearing and recognize sign languages from static gesture images [5]. They
speech disorder. In this paper, a deep multi-layered convolution
neural network is proposed for this purpose. In the proposed have classified 24 classes contains around 100 images in each
approach, 32 convolution filters with 3 x 3 kernel, LeakyReLU class. Inception V3 model has been used by them to train
activation function and 2 x 2 max pooling operation have been the model. Huang et al. proposed a 3D convolutional neural
performed in the deep multi-layered CNN structure. SoftMax network (CNN) to extracts discriminative spatial-temporal fea-
activation function has been used in the output layer. The tures from raw video stream features [6]. An inception model
proposed approach has been evaluated on a database containing
both static (54000 images and 36 classes) and dynamic (49613 has been proposed by Bantupalli et al. which takes video
images and 23 classes) hand gesture images. Experimental results sequance and extracts temporal and spatial features from them
demonstrate the efficacy of the proposed methodology in sign [7]. Patel and Ambekar used PNN and KNN classifiers on
language detection task. their own dataset by extracting the features using 7Hu moment
Index Terms—Sign Language Detection, Deep Learning, multi- techniques. [8]. A framework for Sign Language Recognition
layered CNN
for Bangla Alphabet was proposed by Uddin and Chowdhury
using SVM classifier on 2 different datasets consisting 2400
I. I NTRODUCTION
images of Bengal Sign Language [9]. Rao et al. proposed a
Sign or signed languages are generally used for conveying CNN architecture for classifying selfie sign language gestures
meaning through visual modality [1]. These languages are of 200 ISL sign generating 300000 sign video frames. The
expressed through manual articulations in combination with CNN model is trained on 180000 video frames and returned
non manual elements. These are full fledged natural languages a satisfactory accuracy [10]. M. R. Abid et al. proposed a
with its own grammar and lexicon. In this context, intelligent Dynamic Sign Language Recognition system using Stochastic
methodologies for sign language recognition from hand ges- Linear Formal Grammar. Bag-of-Features and local part model
ture images are important. These intelligent methodologies can approach has been used for recognizing individual words of
help the large number of peoples in the world suffering from sign language [11]. A DLSTM based hand gesture recognition
hearing loss and speech disorder. In the literature, different method has been proposed by Avola et al. [12] where a RNN
methodologies have been reported for recognizing the sign is trained using the angles formed by the finger bones of the
languages accurately [2]–[13]. human hands. The features are extracted using Leap Motion
A real time hand gesture detection and recognition method- Controller (LMC) sensor. This method resulted a satisfactory
ology using Bag-of-Features has been proposed by Dardas et accuracy over the state of art methods. A wearable hand device
al. [2] where features have been extracted by SIFT technique is proposed as a smart sign language interpretation system
and a multi-class support vector training model has been used by Lee et al. [13]. The entire system is consist of a fusion
for classification purpose. Kurdyumov et al. have classified of pressure sensor module, a processing module & a display
English sign language alphabets using support vector machine unit mobile application module. SVM classifier is used for
(SVM) and k-nearest neighborhood classifier [3]. Gray scale analyzing the data collected from the sensor.
features have been proposed by them for enhancing the classi- The methodologies reported in the literature [2]–[13] per-
fication performance. Convolutional Neural Network (CNN)- forms the detection and classification on either static images
based sign language classification has been proposed by Lionel or dynamic images separately. The feature extraction has been
et al. [4]. A total number of 6600 images of Italian gestures performed and SVM classifier has been used for detection and
have been used in their work. They have utilized 4600 images classification task [2], [3], [9], [13]. The Neural Network based

978-1-7281-9342-7/21/$31.00 ©2021 IEEE


Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
196
2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI), India

Fig. 1. Block Diagram of Proposed Methodology.

methodologies have also been proposed in the literature for & Fig. 3 shows the data preprocessing techniques used in our
the same task [4]–[8], [10], [12]. However, the methodologies proposed methodology.
proposed in the literature [2]–[13] are complex and the number
of data samples are also not sufficient. In contrast, a deep
multi-layered CNN [14] architecture has been proposed in B. Proposed Deep Multi-layered CNN Architecture
this paper that performs on both static gestures & dynamic
gestures with huge number of data samples collected from The proposed deep multi-layered CNN having 5 layers
multiple sources. The static gestures represents the alphabets has been elaborated in Fig. 4. The convolution operation
and numbers and the dynamic gestures represents the emo- using 32 filters of 3 x 3 kernels has been performed in each
tional states. 36 classes of static gestures and 23 classes of layer with unit stride sliding window and single padding.
dynamic gestures have been used as input to the proposed deep Leaky Rectified linear unit (LeakyReLU) [17] has been used
multi-layered CNN structure. Moreover, busy background has as the activation function. A 2 x 2 max pooling operation
been used in the input images and the dataset has been split has also been performed to utilize the results of convolution
into training, validation and test set in the ratio of 60 : operation from each layer into a more compact tensor. In
20 : 20 to make the system more robust. The experimental the proposed technique, the three dimensional tensors have
results demonstrate that the proposed methodology performs been converted to one dimensional feature vectors consisting
satisfactorily in classifying both static gestures and dynamic 512 neurons after the multi-layered structure. Thereafter, a
gestures. dense layer of 128 neurons and an output layer consisting
The rest of the paper has been organized as follows. The 59 neurons (static gesture : 36 ; dynamic gesture : 23) have
proposed methodology has been described in section II. Sec- been used. LeakyReLU [17] and SoftMax [18] have been
tion III describes the database preparation. The experimental used as activation functions in dense layer and output layer
results have been demonstrated in section IV and finally
section V concludes the paper.
II. P ROPOSED M ETHODOLOGY
The block diagram of the proposed methodology has been
shown in Fig. 1. The images in the database are non-uniform
in size and are having constant background. Therefore, in this
paper, the images have been rescaled and busy background
has been added to make the proposed system more robust.
Thereafter, all the images with uniform size and busy back- Fig. 2. Image Rescaling. (a) Original Image (b) Rescaled Image
ground have been used as an input to the proposed deep multi-
layered CNN structure. Finally, the output layer performs the
classification task.
A. Data Preproceesing
The hand gesture images have been first rescaled to 50x50
size. Thereafter, the foregrounds of the images have been ex-
tracted [15]. Then random images of different cities have been
added with the foreground images to create busy background
[16]. The images with busy background have then been used Fig. 3. Busy Background. (a) Original Image (b) Eliminated Foreground (c)
as input to the proposed system for classification task. Fig. 2 Image with Busy Background

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
197
2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI), India

Fig. 4. Proposed Multi-layered CNN Architecture.

respectively. LeakyReLU [17] is defined as: the gradients respectively, gt is gradient at time t, β1 and β2
 are exponentially decay rates for moment estimates. The β1
αx for x < 0 and β2 values have been selected experimentally as 0.9 and
f (x) = (1)
x for x ≥ 0 0.999 respectively. The categorical cross entropy loss [21] has
been used as cost function in the proposed algorithm. The cost
where x is input to the activation, i.e the weighted sum of the
function J(θ) can be defined as:
pre-activation layer and α is the slope which prevents dying
ReLU problem [19]. The SoftMax [18] activation function is M

defined as, J(θ) = − yi,c log(pi,c ) (5)
e yi c=1
ŷi = k (2)
yj
j=1 e where, M denotes number of classes, yi,c is a binary indicator
(0 or 1) that indicates whether c is the correct class, pi,c
where, yi is ith logit value, k is total number of logits,
denotes predicted probability between 0 and 1.
yˆi denotes predicted probability of a particular sample. The
probabilities of each class have been evaluated in the output III. DATASET USED
layer of the proposed multi-layered CNN structure. A specific In this paper, the static and dynamic hand gesture images
output node (Logit) having the highest probability value rep- have been used for detection and classification purpose. The
resents the classification of the corresponding class. A first hand gesture images can be described as follows.
order gradient-based optimization process known as Adam
optimizer [20] has been used in the training phase of the A. Static Hand Gestures
proposed methodology to optimize the training process. The The dataset [22] consist of hand gesture representation of
Adam optimizer is defined as: digits (0 - 9) and alphabets (A - Z). The dataset contains 54000
images and 36 classes (digits: 10 and alphabets: 26). The pre-
mt = β1 mt−1 + (1 − β1 )gt (3)
processed static hand gesture samples with busy backgrounds
are shown in Fig. 5 (a) & Fig. 5 (b)
vt = β2 vt−1 + (1 − β2 )gt2 (4) B. Dynamic Hand Gestures
where, mt and vt are estimates of the first moment (the The dataset [23], [24] consist of hand gesture representation
mean) and the second moment (the uncentered variance) of of emotions (positions: above, below etc. ; emotions: alone,

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
198
2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI), India

TABLE I
C LASSIFICATION P ERFORMANCE

Training Accuracy Validation Accuracy Testing Accuracy


99.96% 99.92% 99.89%
Precision Recall F1-Score Sensitivity Specificity
99.80% 99.78% 99.85% 99.78% 99.99%
Fig. 5. Data Samples. (a) 1 (b) A (c) Bring (d) Afraid

TABLE II
C OMPARISON WITH DIFFERENT CNN MODELS
afraid etc. and misc.: bring, drink etc.). The dataset contains (BATCH SIZE = 64, E POCHS = 11)
49613 images and 23 classes. The pre-processed dynamic hand
gesture samples with busy backgrounds are shown in Fig. 5 Model Accuracy Training Time
(c) & Fig. 5 (d) VGG16 [25] 97.17% 544 seconds
VGG19 [26] 98.25% 646 seconds
MobileNet [27] 99.21% 231 seconds
IV. E XPERIMENTAL R ESULT A NALYSIS ResNet50 [28] 98.39% 649 seconds
A. Experimental setup Proposed 99.89% 194 seconds

The proposed methodology has been evaluated on the fol-


lowing hardware platform: AMD Ryzen 5 2400G (4 Cores, 8
Threads) CPU, NVIDIA RTX 2060 graphics card (6GB DDR6
VRAM, 192 bits), 16GB DDR4 RAM clocked at 3000mHz.
The training phase takes 11 epochs to converge. The cross
entropy loss has been monitored for 5 more epochs. The
initial learning rate has been selected as 0.0001 experimentally.
However, the learning rate is further decreased by a factor of
10 after every 10 epochs. The batch size of the proposed al-
gorithm has been selected as 64 experimentally. The proposed
algorithm has been shown in Algorithm 1.
B. Results Analysis
In this paper the training, validation and testing have been
performed on 62187, 20713 and 20713 images respectively.
The training, validation and test set contains both static and
dynamic hand gesture images. The test set has been evaluated
as blind testing i.e. the test images have never been used for
training and validation purpose. This blind testing formulation
makes the proposed system more robust. The training and
validation performance has been shown in Fig. 6. The classi-
fication result achieved by the proposed technique is shown in
Table I. It represents that the proposed methodology performs
satisfactorily on blind test dataset. The experimental result has
also been compared with different CNN architectures and the
comparison has been shown in Table II.

Algorithm 1: Proposed Training Algorithm


Fig. 6. Training and Validation Performance.
• Create a 5-layered CNN with 1 Convolution layers & 1
MaxPool layer in each layer.
• Train using Adam Optimizer. static (0 - 9 and A - Z) and dynamic (alone, afraid, anger etc.)
• Reduce the learning rate by a factor of 10 after 10 gestures in the training, validation and blind testing phase to
epochs. make the system more robust. A blind test accuracy of 99.89%
• If training loss doesn’t reduce for 5 consecutive epochs has been obtained by the proposed technique. However, the
then stop training. real time detection and classification of hand gestures has
not been implemented in this paper. Therefore, the future
V. C ONCLUSION research should be directed towards development of intelligent
In this paper, a deep CNN architecture consisting 5 layers methodologies for real time sign language detection. In this
has been proposed to detect and classify sign languages from regard, region based CNN can be implemented for more
hand gesture images. The proposed methodology uses both convenient detection of sign languages.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
199
2021 IEEE Second International Conference on Control, Measurement and Instrumentation (CMI), India

R EFERENCES [24] A. Nandy, J. S. Prasad, S. Mondal, P. Chakraborty, and G. C. Nandi,


“Recognition of isolated indian sign language gesture in real time,” in
[1] K. Snoddon, “Wendy sandler and diane lillo-martin, sign language and International Conference on Business Administration and Information
linguistic universals. cambridge: Cambridge university press, 2006. pp. Processing. Springer, 2010, pp. 102–107.
xxi, 547. pb $45.00.” Language in Society, vol. 37, 10 2008. [25] S. Tammina, “Transfer learning using vgg-16 with deep convolutional
[2] N. H. Dardas and N. D. Georganas, “Real-time hand gesture detec- neural network for classifying images,” International Journal of Scien-
tion and recognition using bag-of-features and support vector machine tific and Research Publications (IJSRP), vol. 9, p. p9420, 10 2019.
techniques,” IEEE Transactions on Instrumentation and Measurement, [26] M. Shaha and M. Pawar, “Transfer learning for image classification,” in
vol. 60, no. 11, pp. 3592–3607, 2011. 2018 Second International Conference on Electronics, Communication
[3] R. Kurdyumov, P. Ho, and J. K. Ng, “Sign language classification using and Aerospace Technology (ICECA), 2018, pp. 656–660.
webcam images,” 2011. [27] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[4] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, “Sign M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
language recognition using convolutional neural networks,” in Lecture networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
Notes in Computer Science. Springer, 2015, pp. 572–578. [Online]. 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
Available: http://dx.doi.org/10.1007/978-3-319-16178-5 40 [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[5] A. Das, S. Gawde, K. Suratwala, and D. Kalbande, “Sign language recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available:
recognition using deep learning on custom processed static gesture http://arxiv.org/abs/1512.03385
images,” in 2018 International Conference on Smart City and Emerging
Technology (ICSCET). IEEE, 2018, pp. 1–6.
[6] Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li, “Sign
language recognition using 3d convolutional neural networks,” in 2015
IEEE International Conference on Multimedia and Expo (ICME), 2015,
pp. 1–6.
[7] K. Bantupalli and Y. Xie, “American sign language recognition using
deep learning and computer vision,” in 2018 IEEE International Con-
ference on Big Data (Big Data), 2018, pp. 4896–4899.
[8] U. Patel and A. G. Ambekar, “Moment based sign language recognition
for indian languages,” in 2017 International Conference on Computing,
Communication, Control and Automation (ICCUBEA), 2017, pp. 1–6.
[9] M. A. Uddin and S. A. Chowdhury, “Hand sign language recognition for
bangla alphabet using support vector machine,” in 2016 International
Conference on Innovations in Science, Engineering and Technology
(ICISET), 2016, pp. 1–4.
[10] G. A. Rao, K. Syamala, P. V. V. Kishore, and A. S. C. S. Sastry,
“Deep convolutional neural networks for sign language recognition,” in
2018 Conference on Signal Processing And Communication Engineering
Systems (SPACES), 2018, pp. 194–197.
[11] M. R. Abid, E. M. Petriu, and E. Amjadian, “Dynamic sign language
recognition for smart home interactive application using stochastic
linear formal grammar,” IEEE Transactions on Instrumentation and
Measurement, vol. 64, no. 3, pp. 596–605, 2015.
[12] D. Avola, M. Bernardi, L. Cinque, G. L. Foresti, and C. Massaroni,
“Exploiting recurrent neural networks and leap motion controller for
the recognition of sign language and semaphoric hand gestures,” IEEE
Transactions on Multimedia, vol. 21, no. 1, pp. 234–245, 2019.
[13] B. G. Lee and S. M. Lee, “Smart wearable hand device for sign
language interpretation system with sensors fusion,” IEEE Sensors
Journal, vol. 18, no. 3, pp. 1224–1232, 2018.
[14] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
MIT press Cambridge, 2016, vol. 1, no. 2.
[15] C. Rother, V. Kolmogorov, and A. Blake, “” grabcut” interactive
foreground extraction using iterated graph cuts,” ACM transactions on
graphics (TOG), vol. 23, no. 3, pp. 309–314, 2004.
[16] J. Minichino and J. Howse, Learning OpenCV 3 Computer Vision with
Python. Packt Publishing Ltd, 2015.
[17] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified
activations in convolutional network,” arXiv preprint arXiv:1505.00853,
2015.
[18] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with
gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
[19] L. Lu, Y. Shin, Y. Su, and G. E. Karniadakis, “Dying relu and initializa-
tion: Theory and numerical examples,” arXiv preprint arXiv:1903.06733,
2019.
[20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[21] Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training
deep neural networks with noisy labels,” in Advances in neural infor-
mation processing systems, 2018, pp. 8778–8788.
[22] A. Khan, “Sign language gesture images dataset,” https://www.kaggle.
com/ahmedkhanak1995/sign-language-gesture-images-dataset.
[23] A. Nandy, S. Mondal, J. S. Prasad, P. Chakraborty, and G. Nandi,
“Recognizing & interpreting indian sign language gesture for human
robot interaction,” in 2010 international conference on computer and
communication technology (ICCCT). IEEE, 2010, pp. 712–717.

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on September 21,2023 at 15:31:49 UTC from IEEE Xplore. Restrictions apply.
200

You might also like