0% found this document useful (0 votes)
2 views

Real Time Static and Dynamic Sign Language Recognition Using Deep Learning

The document discusses a real-time sign language recognition system utilizing deep learning techniques to enhance communication for deaf-mute individuals. It details the implementation of both static and dynamic recognition using 2D and 3D deep convolutional neural networks, achieving high accuracy rates of nearly 99% for simple backgrounds and 84% for video datasets. The system incorporates spatial augmentation and batch normalization to improve performance and reduce overfitting, allowing for efficient gesture recognition without the need for gloves.

Uploaded by

jayanthp1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Real Time Static and Dynamic Sign Language Recognition Using Deep Learning

The document discusses a real-time sign language recognition system utilizing deep learning techniques to enhance communication for deaf-mute individuals. It details the implementation of both static and dynamic recognition using 2D and 3D deep convolutional neural networks, achieving high accuracy rates of nearly 99% for simple backgrounds and 84% for video datasets. The system incorporates spatial augmentation and batch normalization to improve performance and reduce overfitting, allowing for efficient gesture recognition without the need for gloves.

Uploaded by

jayanthp1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Journal of Scientific & Industrial Research

Vol. 81, November 2022, pp. 1186-1194


DOI: 10.56042/jsir.v81i11.52657

Real Time Static and Dynamic Sign Language Recognition using Deep Learning
P Jayanthi1, Ponsy R K Sathia Bhama1*, K Swetha2 & S A Subash2
1
Department of Computer Technology, MIT, Anna University, Chennai 600 044, Tamil Nadu, India
2
Department of Information Technology, MIT, Anna University, Chennai 600 044, Tamil Nadu, India

Received 20 July 2021; revised 29 October 2022; accepted 29 October 2022

Sign language recognition systems are used for enabling communication between deaf-mute people and normal user.
Spatial localization of the hands could be a challenging task when hands-only occupies 10% of the entire image. This is
overcome by designing a real-time efficient system that is capable of performing the task of extraction, recognition, and
classification within a single network with the use of a deep convolution network. The recognition is performed for static
image dataset with a simple and complex background, dynamic video dataset. Static image dataset is trained and tested
using a 2D deep-convolution neural network whereas dynamic video dataset is trained and tested using a 3D deep-
convolution neural network. Spatial augmentation is done to increase the number of images of static dataset and key-frame
extraction to extract the key-frames from the videos for dynamic dataset. To improve the system performance and accuracy
Batch-Normalization layer is added to the convolution network. The accuracy is nearly 99% for dataset with a simple
background, 92% for dataset with complex background, and 84% for the video dataset. By obtaining a good accuracy, the
system is proved to be real-time efficient in recognizing and interpreting the sign language gestures.

Keywords: Deaf-mute people, Human-machine interaction, Inception deep-convolution network, Key frame extraction,
Video analytics

Introduction to be more efficient for recognition when they were


Hand gesture recognition systems have become the trained with sufficient amount of data. Hence,
heart of technology and are used significantly in recognition systems were built using Convolutional
Human-Computer Interaction1,2 (HCI), sign language Neural Network (CNN)16 where the extraction was
recognition systems3–7, commanding electronic devices8, done using Region Proposal Networks (RPN) and
gaming, android applications, etc. The vision-based classification was carried out using faster Region
hand gesture recognition systems used in human- CNN.17 Videos need to be processed in three
computer interaction, in which hand tracking is dimensions to achieve a better learning and recognition
followed by gesture recognition based on extracted rate, hence the 3D CNNs were developed,18,19 that
hand features using background subtraction methods. combines CNN to train spatial features and RNN
Hidden Markov Model (HMM) detects movement along with LSTM to train temporal features for hand
and skin colour information solely through visual gesture recognition from videos. The CNN can act
cues. Many real-time applications were developed directly on raw inputs; inputs are split into channels,
using colour glove9, pyro10 Kinect sensor11 (depth and feature representation combines these channels
sensor), (electric sensor), and to gain input, followed and regularization20 is also done to improve the
by feature extraction and classification. Other real- performance.
time and vision based recognition systems (HCI)12,13 Deep neural networks typically requires enormous
were used to extract the hand regions from the entire data to learn, however, problems like overfitting or
image using object marking approaches,14 and the underfitting can occur, with overfitting is being more
recognition was done only on the extracted regions common, which could be prevented using dropout in
using bounding box collection, sliding window which randomly dropping of the neurons occurs.
strategy, and region proposal algorithms,15 followed Furthermore, to improve the performance of the CNN
by classification. Hence, neural networks were found and to reduce time complexity, batch-normalization is
——————
applied. In the existing literature using gloves21, is
*Author for Correspondence wearisome as users have to wear motion sensor
E-mail: [email protected] gloves whenever needed to translate signs.22 With this
JAYANTHI et al.: STATIC & DYNAMIC SIGNS RECOGNITION USING DEEP LEARNING 1187

knowledge, researchers try to develop a system 3D Deep CNN


without the need for gloves, but it has problems with The 3D convolutions apply a 3D filter moving in
tracking hand movements. (x, y, z) directions. The directions specify the spatial
and temporal dimensions. 3D-CNN5,6,18 is mainly
Static Image Recognition Systems used for event detection in videos. This can also be
Sign language gestures are stereo typed into two used for image detection. Before processing the video
kinds: static and dynamic signs. Static signs are the dataset using the network, the video dataset is split up
one which state the gesture in an image. Examples of into key frames.27 These frames are used in training
the static signs of Sign Language (SL) are the 3D-CNN model, and the gestures are efficiently
1. Except J and Z, the English alphabet. determined in the testing phase.
2. OK.
3. Pray. Regularization Techniques
4. House. Regularization is the mechanism that improves the
5. Know, etc… CNN performance by reducing the time complexity
The static image can be interpreted based on the and refining accuracy. This can be done by the
complex and simple backgrounds, an approach23 dropout mechanism20 to suppress the problem of
which does use the localization mechanism developed overfitting that usually occurs in the deep CNN and
for improving the accuracy of the developed system. reducing the internal co-variant shift by adding a
Finger spelling signs are recognised using statistical batch-normalization layer in between the deep
methods24 of machine learning and deep learning after CNN layers. The fundamental notion is to arbitrarily
the signs is processed using Gabor filters, Feature drop units with links in the course of training. During
mixtures, etc. Most of the literature states that static training, dropout samples for an exponential
sign recognition is done on only finger spelling.25–29 quantity of diverse “thinned” nets. During testing,
approximating the effect by calculating the prediction
Dynamic Video Recognition Systems
average of all thinned nets, which eases the over
Videos contain two different dimensions; one is the
fitting effects. Batch Normalization (BN)15 technique
spatial and the other is the temporal dimension
is capable of improving the performance efficiency of
and they have to be processed in three dimensions.
the convolution and the non-conv layers of the deep
The compact representation of dynamic signs was
CNN architecture. Batch normalization requires
extracted using a block-based histogram of
multiple inputs from the previous layer to the
optical flow.30 Neural networks with Deep CNN
normalization task in a mini-batch. First, it calculates
along with key frame extraction could be used for
the mean and the variance values, later performs
recognition of spatial features of the video.31,32
normalization using scaling and shifting factors.
Masood et al.,6 proposed a real-time sign language
The Batch normalization algorithm consists of
recognition where the spatial features are
forward pass and backward pass as depicted in Fig. 1.
trained using the inception model deep-CNN while
Batch normalization layers are split up into sub-layers
the temporal features are trained using the RNN. At
using fission mechanism and later combined with the
the outset, the input is sent as a video and the
preceding conv layer, Relu and following conv. This
processing is done on the images after the video is
batch normalization process is commonly known as
split up into frames. 3D CNN is being implemented to
the Batch Normalization Fission and Fusion (BNFF).
process the spatial and temporal features of dynamic
By doing this process the memory sweeps can be
gestures.
reduced to 1 from 3. Batch normalisation performs
2D Deep CNN various operations such as data reuse optimization,
The 2D convolutions use a 2D filter moving in pruning and approximate computing, fusion and
(x, y) directions.15 The directions specify the spatial blending layers, and training acceleration.
dimensions. This can only be used for the image From all the observations it is clear that the
detection mechanisms but this cannot be used in the existing recognition systems had to implement
temporal dimensions as in videos. Deep CNN different algorithms for each phase, subsequently
typically requires a larger dataset19 to increase proposed work focuses on recognition mechanism
learning rate, which has been accomplished using without localization produced better result and the
offline and online dataset augmentation techniques. major idea is to combine the detection, extraction,
1188 J SCI IND RES VOL 81 NOVEMBER 2022

Fig. 1 — Batch normalization algorithm: (a) Forward pass, (b) Backward pass

Fig. 2 — Image dataset processing model with 2DD-CNN regularization and interpretation of gestures for simple and complex
background image dataset

recognition, and classification task within a single to study and determine very huge datasets, providing
neural network without localization for both static and good accuracy. Also, overfitting can be avoided by
dynamic hand gestures. This makes the network using a larger dataset in the training model, thereby
diverse from all the other CNN networks implements increasing the system performance in the testing
the extraction and classification tasks separately. phase. Spatial dataset augmentation can be done in
two ways.
Experimental Details • Offline dataset augmentation.
Sign Language Recognition - Static Image Dataset • Online dataset augmentation.
The proposed system prevents the localization task Offline Dataset Augmentation
and minimizes the workload for training the large Offline dataset augmentation can be done by
dataset by using a GPU machine and an inception performing operations such as scaling, translation
deep-CNN network. rotation, flipping, adding noise, random crops,
Model for Sign Language Interpretation lighting condition, perspective transformation,
The image dataset needs processing only in 2- reversal mechanism. Data augmentation is done to
dimensions known as the spatial dimensions as shown increase the size of the dataset by flipping, lighting
in Fig. 2 for the proposed system model for conditions, random crops, and reverse ordering
processing the image dataset, the 2D convolution operations. In turn, the system can be trained more
network. The static images are augmented using efficiently, thus leading to increased learning and
various spatial-augmentation techniques and send to good accuracy.
the convolution layer where the detection of 2DD-CNN
embedded patterns and hand region shrinking occurs. A 2D convolution network is capable of learning
Later regularization techniques like dropout20 and the image datasets. The designed network depicted
batch normalization layers are added. Finally, in Fig. 3 is capable of learning only the spatial
classification of the gestures is carried out. dimensions of the image and hence it cannot be
Spatial Augmentation applied for the dynamic video dataset. The network is
Spatial augmentation is usually done when the an inception model which is capable of recognition
dataset size is small. Usually, Deep-CNN is designed tasks within single network architecture. The First
JAYANTHI et al.: STATIC & DYNAMIC SIGNS RECOGNITION USING DEEP LEARNING 1189

increase the accuracy and efficiency of the system.


The BN also plays a major role in the multi-class
classification in the network. It is applied after the
dense and before the activation. The only importance
is to reduce the time complexity by processing the
output as mini-batch and reducing the internal
covariant shift. The mini-batch mean is first
calculated where the input is taken as a batch. The
input is from the convolution layer and m is limit. The
mean µ is calculated using the formulae in Eq. 1.
Fig. 3 — 2D-CNN flow takes image input, performs recognition
and later classification done by fully-connected layers and output µβ ∑ x … (1)
based on probability

layer has a kernel size of 5 × 5 and strides equal to After calculating the mean, the variance is
3 × 3 and the remaining conv layers have a kernel size determined using the formulae as illustrated in Eq. 2.
equal to 3 × 3 and strides equal to 1 × 1. The key idea
is to design a CNN network capable of detecting σµ ∑ x µ … (2)
the hands in the image without performing any
localization task. Thus, the implemented system Further normalization is performed using the
consists of an inception deep CNN network that can formulae given in Eq. 3 and the value of 𝑥 is found.
detect only the hand region from the entire image After that, scale and shift operations are performed as
through the designed convolution layers capable of illustrated in Eq. 4, to obtain the batch normalization
detecting the embedded patterns in the image, which output.
is the hand region performing the gesture excluding 𝐢 µ𝛃
the other hand areas. It is always known that the hand x𝐢 … (3)
Ɛ
region performing the gesture occupies only 10% of
the image, and this part is embedded into the image
along with the human body and human hands. The
designed CNN architecture (Fig. 3) is adept at 𝑦 𝛾 x𝒊 β ≡ BN , x … (4)
learning the spatial features so it can detect the hand where, γ and β are the scale and shift values.
gestures without extraction from the image. For the The output obtained in yi as illustrated in Eq. 4, is
hand region shrinking mechanism to occur, system sent to the next layers of the CNN for further
used 9 conv layers and 4 pooling layers. While processing.
performing CNN system might face the problem of
Sign Language Recognition for Dynamic Video Dataset
overfitting are prevented by adding the dropout
The proposed system meant for classifying
mechanism and the flow of the deep CNN for
dynamic gesture is depicted in Fig. 4. For processing,
processing the static image dataset is shown in Fig. 3.
the video dataset system has a 3DD-CNN
Regularization Using Batch-normalization architecture. The variation with respect to the 2DD-
This is one of the regularization techniques applied CNN is that the conv layers are removed and instead
to improve the performance of the network. The input the Softmax activation introduced. The number of
is processed as a mini-batch; later the mini-batch pooling layers remains the same. However, the
variance is taken. Normalization is performed, finally keyframes are taken from the input video dataset and
scaling and shifting operations are performed. The used in the training phase. The dynamic videos are
batch normalization is used to mitigate the internal given as input and later split up into keyframes. The
covariant shift, regularizes the model, and reduces the keyframes are processed through the 3D Deep-CNN
need for dropout. In the designed network the output network followed by detecting embedded patterns
from the conv layer is given as input to the batch & hand region shrinking. Regularization is done
normalization layer. The input is a mini-batch that is using dropout and batch normalization. Finally,
processed by the BN layer thus it is helpful in the classification is based on the probability and
reduction of the processing time. It is also used to interpretations of gestures are carried out.
1190 J SCI IND RES VOL 81 NOVEMBER 2022

3D Deep CNN
In the 3D CNN network, the first layer is the conv
layer which has 3 dimensions (x, y, and z) for spatial
and temporal dimensions and is capable of learning
the video datasets. To perform the conv operation a
kernel size equal to 7 × 7 × 7 with the strides of 3 × 3
× 3 has been used, second conv layer has a kernel size
5 × 5 × 5 with the strides equal to 3 × 3 × 3 and the
remaining conv layer with filter size 3 × 3 × 3 with
strides equal to 3 × 3 × 3. A CNN has a convolution
configuration, limits the neural association amid
layers, and is made to allocate the unchanged weights
in a layer. BN layer is added after the conv layer
7 conv layers in the 3D CNN network which has the
same functioning used in the 2D CNN network
(Fig. 3). The Softmax activation function is used to
incorporate the two important properties of the same,
as calculated values should be in the range of 0 to 1,
Fig. 4 — Architecture of SLR for dynamic video dataset and the sum of all the probabilities should be equal
to 1. ReLU activation takes the values from 0 to ∞
Keyframe Extraction
whereas the Softmax activation restricts the required
The keyframe extraction is a powerful technique
inputs based on the greater probability, thus reducing
for summary video content, it is categorized into 4
the workload of the system during processing thus
types based on short boundary, visual information,
favouring learning. The hand region shrinking and the
movement analysis, and clustered method. The main
detection of embedded pattern is done similarly as in
idea behind keyframe extraction is to convert the
the 2DD-CNN model. Gesture recognition system is
video into one dimensional signal that eases the
designed to interpret the sign language signs and
training and testing phases when processed using the
the system designed to improve the real-time
D-CNN network. Pre-processing phase is meant to
efficiency trained using a GPU machine, the online
process the video in such a manner that reduces the
available workspace “collaborator” is an open-source
computational complexity of the classification
GPU-based machine, with a good storage capacity.
models. Keyframe extraction is the key aspect in
feature extraction since it extracts those frames Results and Discussion
responsible for the key movement of hand signs and it
eliminates repletion of frames. The keyframe extraction Dataset Preparation
algorithms illustrate the mechanism followed to extract The dataset for the sign language with the simple
keyframes. background was collected from the project “Sign
Keyframe Extraction Algorithm language and Static gesture recognition using scikit-
1. Calculate interframe differences (inter-diff) of a learn”. The bench mark dataset LSA64 data is used
video. for the dynamic video processing; LSA64 dataset
2. Calculate the sum of pixels in each frame. consists of 64 signs and includes both one-handed
3. Calculate the mean for each inter-diff frame. (R: Right hand) and two-handed signs (B: Both hands).
4. Convolute the mean array with the ‘Humming
Static Signs Dataset
Window’ array obtained by using the formula:
The American Sign Language (ASL) dataset with
w(n) = 0.5−0.5 × cos (2∏n/M−1) 0 ≤ n ≤ M−1.
simple backgrounds having high-resolution quality
5. Calculate the relative local extrema from the
images in which the images had 35% human hand and
obtained array convolution.
the rest is background, initially contained only 1500
6. The respective frame indexes are obtained using
images for 14 different users posing different signs
local extrema.
from A-Z except J and Z or simple background. To
7. The keyframes from the video are extracted using
increase the number of images in a dataset the spatial
the frame indices.
JAYANTHI et al.: STATIC & DYNAMIC SIGNS RECOGNITION USING DEEP LEARNING 1191

augmentation technique is being used; hence, 26880 Results


images were obtained from 14 different users having The dataset is processed on an online available
1920 images. The dataset for the sign language with workspace "collaboratory", supports all the python
complex background contains 10 different ASL signs and Open CV library packages and works on GPU.
starting from 0–9. The human hand performing the The memory is available for processing 12 GB and 3
sign occupies only 10% of the image. It was collected GB RAM. Static datasets were processed by loading
from 14 different users having a total of 1400 images. each dataset separately, video dataset is split into 3
The size of the dataset is increased by performing the parts of 2.8 GB each. Initially the first part is being
spatial augmentation and obtained 11200 images. It used to train the system and the model is saved,
contains 80 images per gesture and 800 images per second batch of data is then used to train the system
user. with the saved model.
This is done 3 times and the accuracy obtained is
Dynamic Signs Dataset 84% wherein testing produced the correct
Proposed work used the first 30 signs of the interpretation (80%) of gestures accurately. Accuracy
LSA64, 1500 videos of 10 signers with 5 repetitions obtained in the designed 2DD-CNN and 3DD CNN
of each gesture. The list of gesture is shown in networks is illustrated in Fig. 5 and Fig. 6,
Table 1, where Name depicts the gesture and H respectively. How the designed network works better
designates only one hand (Right hand-R) is involved than other benchmark networks for the static,
in the gesture or both hands (B) are involved. complex background datasets can be observed from
Table 1 — List of gesture for the dynamic gesture recognition
ID Name H ID Name H ID Name H ID Name H
01 Opaque R 09 Women R 17 Call R 24 Argentina R
02 Red R 10 Enemy R 18 Skimmer R 25 Uruguay R
03 Green R 11 Son R 19 Bitter R 26 Country R
04 Yellow R 12 Man R 20 Sweet milk R 27 Last name R
05 Bright R 13 Away R 21 Milk R 28 Where R
06 Light-Blue R 14 Drawer R 22 Water R 29 Mock B
07 Colors R 15 Born R 23 Food R 30 Birthday R
08 Photo B 16 Learn R

Fig. 5 — Accuracy and loss for simple and complex background of static 2DD-CNN: (a) Accuracy (simple background), (b) Loss (simple
background), (c) Accuracy (complex background), (d) Loss (complex background)
1192 J SCI IND RES VOL 81 NOVEMBER 2022

Fig. 6 — Results of dynamic video dataset processing for different number of videos: (a) Accuracy (500), (b) Loss (500), (c) Accuracy
(1000), (d) Loss (1000), (e) Accuracy (1500), (f) Loss (1500)

Table 2 — Comparison study of various systems and the proposed network for static, dynamic datasets
Types Network Accuracy Precision Recall F1 Score
3D-CNN 0.838 0.894 0.843 0.840
Dynamic
AlexNet 0.782 0.776 0.784 0.780
Dataset
VGG19 0.738 0.740 0.754 0.747
2D-CNN 0.992 0.991 0.990 0.991
Static Simple
AlexNet 0.959 0.961 0.957 0.957
Dataset
VGG19 0.968 0.969 0.968 0.961
2D-CNN 0.919 0.903 0.919 0.911
Static Complex
AlexNet 0.909 0.892 0.909 0.900
Dataset
VGG19 0.920 0.903 0.920 0.912

Table 2. Key frame extraction is a pre-processing both are also applied in the AlexNet and VGG-19
phase which reduces the computational complexity of architectures for comparison.
the classification model. Local maxima key extraction
methods extract the frames which are responsible for Interpretation
the hand movements and eliminates the repetition of The classification in the CNN is done based on the
frames. 3DD-CNN network initially trained for the probability by the fully connected layer which does
key, frames first 500 videos and appended with the the multi-class classification and the gestures are
learning of 1000 videos finally considers all the 1500 separated into classes and after training they are
videos. The system converges to a better accuracy as interpreted. The interpretation is in text format. This
the batch is increased. Since the proposed work has makes the normal user understand the gestures
used batch normalization, regularization technique, performed by the deaf-mute people. The interpretation
JAYANTHI et al.: STATIC & DYNAMIC SIGNS RECOGNITION USING DEEP LEARNING 1193

2 Rautaray S S & Agrawal A, Vision based hand gesture


recognition for human computer interaction: A survey, Artif
Intell Rev, 43 (2015) 1–54.
3 Starner T E, Visual Recognition of American Sign Language
Using Hidden Markov Models, MS dissertation,
Massachusetts Institute of Technology, USA, 1995.
4 Anjo M D S, Pizzolato E B & Feuerstack S, A real-time
system to recognize static gestures of Brazilian sign language
(libras) alphabet using Kinect, Brazilian Symp on Human
Factors in Computer Systems (Brazil) 2012, 259–268.
5 Huang J, Zhou W, Li H & Li W, Sign language recognition
Fig. 7 — The interpretation of the gesture in the form of text after using 3D convolutional neural networks, Proc IEEE Int Conf
classification based on the probability during training: (a) Opaque; Multimedia Expo (Torino, Italy) 2015, 1–6.
(b) Water. 6 Masood S, Srivastava A, Thuwal H C & Ahmad M, Real-
time sign language gesture (word) recognition from video
of the gestures in the form of text output is given in sequences using CNN and RNN, Proc Int Conf Front Intell
Fig. 7 and Table 2 presents the performance metrics Comput; Theory Appl (Odisha, India) 2018, 623–632.
of the proposed network of 2D-CNN and 3D-CNN 7 Joys J, Balakrishnan K & Sreeraj M, Sign quiz: A quiz based
with AlexNet, VGG19. The designed network works tool for earning finger spelled signs in Indian sign language
using ASLR, IEEE Access, 7 (2019) 28363–28371.
better than other benchmarks networks for the static, 8 Lee D & Park Y, Vision-based remote control system by
complex, and video dataset. There is a significant motion detection and open finger counting, IEEE Trans
increase in the accuracy while detecting static signs Consum Electron, 55 (2009) 2308–2313.
and dynamic gestures using proposed method. 9 Lamberti L & Camastra F, Real-time hand gesture recognition
using a color glove, Proc Int Conf Image Analysis & Process,
Evaluation of these performance metrics for the CNN (ICIAP) (Ravenna, Italy) 2011, 365–373.
model indicates that the proposed method 10 Erden F & Çetin A E, Hand gesture based remote control
outperforms with the other two networks being in system using infrared sensors and a camera, IEEE Trans
concern for both Static Dataset and Dynamic Dataset. Consum Electron, 60 (2014) 2308–2313.
11 Wang Y & Yang R, Real-time hand posture recognition
based on hand dominant line using Kinect, Proc IEEE Int
Conclusions Conf Multimed Expo Worksh (ICMEW) (San Jose, CA, USA)
The proposed methodology is a stimulating 2013, 1–4.
technique for gesture recognition of sign language in 12 Mishra S R, Krishna D, Sanyal G & Sarkar A, A feature
simple background static image dataset, complex weighting technique on SVM for human action recognition, J
Sci Ind Res, 79 (2020) 626–630.
background static image dataset, and dynamic video 13 Chen Z H, Kim J T, Liang J, Zhang J & Yuan Y B, Real-
dataset with an accuracy of 99%, 92%, and 84% time hand gesture recognition using finger segmentation, Sci
respectively. The static signs are interpreted with 2D- World J, 2014 (2014) 2456–2459.
CNN with the augmentation to overcome overfitting 14 Gokgoz K, The Nature of Object Marking in ASL, Ph.D
and the 3D-CNN recognizes the dynamic gesture after Thesis, Purdue University, United States, 2013.
15 Girshick R, Donahue J, Darrell T & Malik J, Region based
performing the local maxima key frame extraction, convolutional networks for accurate object detection and
which is crucial for the recognition of signs. The segmentation, IEEE Trans Pattern Anal MachIntell, 38 (2015)
system performance is further improved by adding 142–158.
batch normalization layer in both 2D-CNN and 3D- 16 Mahajan P, Abrol P & Lehana P K, Scene based
classification of aerial images using convolution neural
CNN. The proposed system is capable of recognizing networks, J Sci Ind Res, 79 (2020) 1087–1094.
and interpreting the gestures in both images and 17 Kopuklu O, Gunduz A, Kose N & Rigoll G, Real-time hand
videos efficiently, while its limitation is to handle gesture detection and classification using convolutional
only A-Z of static signs except J and Z. Additionally, neural networks, Proc Int Conf Automatic Face Gesture
it could be deployed as a web application, where the Recognit (Lille, France) 2019, 1–8.
18 Molchanov P, Gupta S, Kim K & Kautz J, Hand gesture
client performs the signs and requests the server to recognition with 3D convolutional neural networks, IEEE Int
predict the actual recognition of the sign in the form Conf Comput Vis Pattern Recognit Worksh (ICCVPR) 2015,
of edge computing. 1–7.
19 Ji S, Xu W, Yang M & Yu K, 3D convolutional neural
References networks for human action recognition, IEEE Trans Pattern
1 Bhatt R, Fernandes N & Dhage A, Vision based hand gesture Anal Mach Intell, 35 (2012) 221–231.
recognition for human computer interaction, Int J Innov Sci 20 Srivastava N, Hinton G, Krizhevsky A, Sutskever I &
Eng Technol, 2 (2013) 110–114. Salakhutdinov R, Dropout: A simple way to prevent neural
1194 J SCI IND RES VOL 81 NOVEMBER 2022

networks from overfitting, J Mach Learn Res, 15 (2014) 26 Li Y & Zhang P, Static hand gesture recognition based on
1929–1958. hierarchical decision and classification of finger features, J
21 Kishore P V V, Anil Kumar D, Chandra Sekhara Sastry A S Sci Prog, 105(1) (2022) 163–170.
& Kumar K, Motionlets matching with adaptive kernal for 27 Gupta S, Jaafar J & Ahmad W F W, Static hand gesture
3-D Indian sign language recognition, IEEE Sens J, 18 recognition using local Gabor filter, Procedia Engineering,
(2018) 3327–3337. Int Symp Robot Intell Sensors (Kuching, Sarawak, Malaysia)
22 Pan J, Luo Y, Li Y, Khong C, Chun-Huat T, Aaron H & 2012, 827–832.
Thean V-Y, Wireless multi-channel capacitive sensor system 28 Naveed M, Quratulain Q & Shaukat A, Comparison of
for efficient glove-based gesture recognition with AI at the GLCM based hand gesture recognition systems using
edge, IEEE Trans Circuits Syst, 67 (2020) 1624–1628. multiple classifiers, Proc IEEE Int Conf Robot Autom2
23 Bao P, Maqueda A I, Del-Blanco C R & Garcia N, Tiny hand (Xi'an, China) 2021, 1–5.
gesture recognition without localization via a deep 29 Ghosh D K & Ari S, "Static hand gesture recognition using
convolutional network, IEEE Trans Consum Electron, 63 mixture of features and SVM classifier, 5th Int Conf Commun
(2017) 251–257. Syst Netw (Gwalior, MP, India) 2015, 1094–1099.
24 Kanchana P, Kosin C & Jing-Ming G, Signer independence 30 Lim K M, Tan A W C & Tan S C, Block based histogram of
finger alphabet recognition using discrete wavelet transform optical flow for isolated sign language recognition, J Vis
and area level run lengths, J Vis Commun Image Represent, Commun Image Represent, 40 (2016) 538–545.
38 (2016) 658–677. 31 Nielson M, Neural Networks and Deep Learning
25 Nandy A, Prasad J S, Mondal S, Chakraborty P & (Determination Press, San Francisco, CA, USA) 2015.
Nandi G C, Recognition of Isolated Indian Sign Language 32 Pan W, Zhang X & Zhongfu Ye, Attention-based sign
Gesture in Real Time, J Commun Comput Inf Sci, 70 (2010) language recognition network utilizing key frame sampling
102–107. and skeletal features, IEEE Access, 8 (2020) 215592–215602

You might also like