Real Time Static and Dynamic Sign Language Recognition Using Deep Learning
Real Time Static and Dynamic Sign Language Recognition Using Deep Learning
Real Time Static and Dynamic Sign Language Recognition using Deep Learning
P Jayanthi1, Ponsy R K Sathia Bhama1*, K Swetha2 & S A Subash2
1
Department of Computer Technology, MIT, Anna University, Chennai 600 044, Tamil Nadu, India
2
Department of Information Technology, MIT, Anna University, Chennai 600 044, Tamil Nadu, India
Sign language recognition systems are used for enabling communication between deaf-mute people and normal user.
Spatial localization of the hands could be a challenging task when hands-only occupies 10% of the entire image. This is
overcome by designing a real-time efficient system that is capable of performing the task of extraction, recognition, and
classification within a single network with the use of a deep convolution network. The recognition is performed for static
image dataset with a simple and complex background, dynamic video dataset. Static image dataset is trained and tested
using a 2D deep-convolution neural network whereas dynamic video dataset is trained and tested using a 3D deep-
convolution neural network. Spatial augmentation is done to increase the number of images of static dataset and key-frame
extraction to extract the key-frames from the videos for dynamic dataset. To improve the system performance and accuracy
Batch-Normalization layer is added to the convolution network. The accuracy is nearly 99% for dataset with a simple
background, 92% for dataset with complex background, and 84% for the video dataset. By obtaining a good accuracy, the
system is proved to be real-time efficient in recognizing and interpreting the sign language gestures.
Keywords: Deaf-mute people, Human-machine interaction, Inception deep-convolution network, Key frame extraction,
Video analytics
Fig. 1 — Batch normalization algorithm: (a) Forward pass, (b) Backward pass
Fig. 2 — Image dataset processing model with 2DD-CNN regularization and interpretation of gestures for simple and complex
background image dataset
recognition, and classification task within a single to study and determine very huge datasets, providing
neural network without localization for both static and good accuracy. Also, overfitting can be avoided by
dynamic hand gestures. This makes the network using a larger dataset in the training model, thereby
diverse from all the other CNN networks implements increasing the system performance in the testing
the extraction and classification tasks separately. phase. Spatial dataset augmentation can be done in
two ways.
Experimental Details • Offline dataset augmentation.
Sign Language Recognition - Static Image Dataset • Online dataset augmentation.
The proposed system prevents the localization task Offline Dataset Augmentation
and minimizes the workload for training the large Offline dataset augmentation can be done by
dataset by using a GPU machine and an inception performing operations such as scaling, translation
deep-CNN network. rotation, flipping, adding noise, random crops,
Model for Sign Language Interpretation lighting condition, perspective transformation,
The image dataset needs processing only in 2- reversal mechanism. Data augmentation is done to
dimensions known as the spatial dimensions as shown increase the size of the dataset by flipping, lighting
in Fig. 2 for the proposed system model for conditions, random crops, and reverse ordering
processing the image dataset, the 2D convolution operations. In turn, the system can be trained more
network. The static images are augmented using efficiently, thus leading to increased learning and
various spatial-augmentation techniques and send to good accuracy.
the convolution layer where the detection of 2DD-CNN
embedded patterns and hand region shrinking occurs. A 2D convolution network is capable of learning
Later regularization techniques like dropout20 and the image datasets. The designed network depicted
batch normalization layers are added. Finally, in Fig. 3 is capable of learning only the spatial
classification of the gestures is carried out. dimensions of the image and hence it cannot be
Spatial Augmentation applied for the dynamic video dataset. The network is
Spatial augmentation is usually done when the an inception model which is capable of recognition
dataset size is small. Usually, Deep-CNN is designed tasks within single network architecture. The First
JAYANTHI et al.: STATIC & DYNAMIC SIGNS RECOGNITION USING DEEP LEARNING 1189
layer has a kernel size of 5 × 5 and strides equal to After calculating the mean, the variance is
3 × 3 and the remaining conv layers have a kernel size determined using the formulae as illustrated in Eq. 2.
equal to 3 × 3 and strides equal to 1 × 1. The key idea
is to design a CNN network capable of detecting σµ ∑ x µ … (2)
the hands in the image without performing any
localization task. Thus, the implemented system Further normalization is performed using the
consists of an inception deep CNN network that can formulae given in Eq. 3 and the value of 𝑥 is found.
detect only the hand region from the entire image After that, scale and shift operations are performed as
through the designed convolution layers capable of illustrated in Eq. 4, to obtain the batch normalization
detecting the embedded patterns in the image, which output.
is the hand region performing the gesture excluding 𝐢 µ𝛃
the other hand areas. It is always known that the hand x𝐢 … (3)
Ɛ
region performing the gesture occupies only 10% of
the image, and this part is embedded into the image
along with the human body and human hands. The
designed CNN architecture (Fig. 3) is adept at 𝑦 𝛾 x𝒊 β ≡ BN , x … (4)
learning the spatial features so it can detect the hand where, γ and β are the scale and shift values.
gestures without extraction from the image. For the The output obtained in yi as illustrated in Eq. 4, is
hand region shrinking mechanism to occur, system sent to the next layers of the CNN for further
used 9 conv layers and 4 pooling layers. While processing.
performing CNN system might face the problem of
Sign Language Recognition for Dynamic Video Dataset
overfitting are prevented by adding the dropout
The proposed system meant for classifying
mechanism and the flow of the deep CNN for
dynamic gesture is depicted in Fig. 4. For processing,
processing the static image dataset is shown in Fig. 3.
the video dataset system has a 3DD-CNN
Regularization Using Batch-normalization architecture. The variation with respect to the 2DD-
This is one of the regularization techniques applied CNN is that the conv layers are removed and instead
to improve the performance of the network. The input the Softmax activation introduced. The number of
is processed as a mini-batch; later the mini-batch pooling layers remains the same. However, the
variance is taken. Normalization is performed, finally keyframes are taken from the input video dataset and
scaling and shifting operations are performed. The used in the training phase. The dynamic videos are
batch normalization is used to mitigate the internal given as input and later split up into keyframes. The
covariant shift, regularizes the model, and reduces the keyframes are processed through the 3D Deep-CNN
need for dropout. In the designed network the output network followed by detecting embedded patterns
from the conv layer is given as input to the batch & hand region shrinking. Regularization is done
normalization layer. The input is a mini-batch that is using dropout and batch normalization. Finally,
processed by the BN layer thus it is helpful in the classification is based on the probability and
reduction of the processing time. It is also used to interpretations of gestures are carried out.
1190 J SCI IND RES VOL 81 NOVEMBER 2022
3D Deep CNN
In the 3D CNN network, the first layer is the conv
layer which has 3 dimensions (x, y, and z) for spatial
and temporal dimensions and is capable of learning
the video datasets. To perform the conv operation a
kernel size equal to 7 × 7 × 7 with the strides of 3 × 3
× 3 has been used, second conv layer has a kernel size
5 × 5 × 5 with the strides equal to 3 × 3 × 3 and the
remaining conv layer with filter size 3 × 3 × 3 with
strides equal to 3 × 3 × 3. A CNN has a convolution
configuration, limits the neural association amid
layers, and is made to allocate the unchanged weights
in a layer. BN layer is added after the conv layer
7 conv layers in the 3D CNN network which has the
same functioning used in the 2D CNN network
(Fig. 3). The Softmax activation function is used to
incorporate the two important properties of the same,
as calculated values should be in the range of 0 to 1,
Fig. 4 — Architecture of SLR for dynamic video dataset and the sum of all the probabilities should be equal
to 1. ReLU activation takes the values from 0 to ∞
Keyframe Extraction
whereas the Softmax activation restricts the required
The keyframe extraction is a powerful technique
inputs based on the greater probability, thus reducing
for summary video content, it is categorized into 4
the workload of the system during processing thus
types based on short boundary, visual information,
favouring learning. The hand region shrinking and the
movement analysis, and clustered method. The main
detection of embedded pattern is done similarly as in
idea behind keyframe extraction is to convert the
the 2DD-CNN model. Gesture recognition system is
video into one dimensional signal that eases the
designed to interpret the sign language signs and
training and testing phases when processed using the
the system designed to improve the real-time
D-CNN network. Pre-processing phase is meant to
efficiency trained using a GPU machine, the online
process the video in such a manner that reduces the
available workspace “collaborator” is an open-source
computational complexity of the classification
GPU-based machine, with a good storage capacity.
models. Keyframe extraction is the key aspect in
feature extraction since it extracts those frames Results and Discussion
responsible for the key movement of hand signs and it
eliminates repletion of frames. The keyframe extraction Dataset Preparation
algorithms illustrate the mechanism followed to extract The dataset for the sign language with the simple
keyframes. background was collected from the project “Sign
Keyframe Extraction Algorithm language and Static gesture recognition using scikit-
1. Calculate interframe differences (inter-diff) of a learn”. The bench mark dataset LSA64 data is used
video. for the dynamic video processing; LSA64 dataset
2. Calculate the sum of pixels in each frame. consists of 64 signs and includes both one-handed
3. Calculate the mean for each inter-diff frame. (R: Right hand) and two-handed signs (B: Both hands).
4. Convolute the mean array with the ‘Humming
Static Signs Dataset
Window’ array obtained by using the formula:
The American Sign Language (ASL) dataset with
w(n) = 0.5−0.5 × cos (2∏n/M−1) 0 ≤ n ≤ M−1.
simple backgrounds having high-resolution quality
5. Calculate the relative local extrema from the
images in which the images had 35% human hand and
obtained array convolution.
the rest is background, initially contained only 1500
6. The respective frame indexes are obtained using
images for 14 different users posing different signs
local extrema.
from A-Z except J and Z or simple background. To
7. The keyframes from the video are extracted using
increase the number of images in a dataset the spatial
the frame indices.
JAYANTHI et al.: STATIC & DYNAMIC SIGNS RECOGNITION USING DEEP LEARNING 1191
Fig. 5 — Accuracy and loss for simple and complex background of static 2DD-CNN: (a) Accuracy (simple background), (b) Loss (simple
background), (c) Accuracy (complex background), (d) Loss (complex background)
1192 J SCI IND RES VOL 81 NOVEMBER 2022
Fig. 6 — Results of dynamic video dataset processing for different number of videos: (a) Accuracy (500), (b) Loss (500), (c) Accuracy
(1000), (d) Loss (1000), (e) Accuracy (1500), (f) Loss (1500)
Table 2 — Comparison study of various systems and the proposed network for static, dynamic datasets
Types Network Accuracy Precision Recall F1 Score
3D-CNN 0.838 0.894 0.843 0.840
Dynamic
AlexNet 0.782 0.776 0.784 0.780
Dataset
VGG19 0.738 0.740 0.754 0.747
2D-CNN 0.992 0.991 0.990 0.991
Static Simple
AlexNet 0.959 0.961 0.957 0.957
Dataset
VGG19 0.968 0.969 0.968 0.961
2D-CNN 0.919 0.903 0.919 0.911
Static Complex
AlexNet 0.909 0.892 0.909 0.900
Dataset
VGG19 0.920 0.903 0.920 0.912
Table 2. Key frame extraction is a pre-processing both are also applied in the AlexNet and VGG-19
phase which reduces the computational complexity of architectures for comparison.
the classification model. Local maxima key extraction
methods extract the frames which are responsible for Interpretation
the hand movements and eliminates the repetition of The classification in the CNN is done based on the
frames. 3DD-CNN network initially trained for the probability by the fully connected layer which does
key, frames first 500 videos and appended with the the multi-class classification and the gestures are
learning of 1000 videos finally considers all the 1500 separated into classes and after training they are
videos. The system converges to a better accuracy as interpreted. The interpretation is in text format. This
the batch is increased. Since the proposed work has makes the normal user understand the gestures
used batch normalization, regularization technique, performed by the deaf-mute people. The interpretation
JAYANTHI et al.: STATIC & DYNAMIC SIGNS RECOGNITION USING DEEP LEARNING 1193
networks from overfitting, J Mach Learn Res, 15 (2014) 26 Li Y & Zhang P, Static hand gesture recognition based on
1929–1958. hierarchical decision and classification of finger features, J
21 Kishore P V V, Anil Kumar D, Chandra Sekhara Sastry A S Sci Prog, 105(1) (2022) 163–170.
& Kumar K, Motionlets matching with adaptive kernal for 27 Gupta S, Jaafar J & Ahmad W F W, Static hand gesture
3-D Indian sign language recognition, IEEE Sens J, 18 recognition using local Gabor filter, Procedia Engineering,
(2018) 3327–3337. Int Symp Robot Intell Sensors (Kuching, Sarawak, Malaysia)
22 Pan J, Luo Y, Li Y, Khong C, Chun-Huat T, Aaron H & 2012, 827–832.
Thean V-Y, Wireless multi-channel capacitive sensor system 28 Naveed M, Quratulain Q & Shaukat A, Comparison of
for efficient glove-based gesture recognition with AI at the GLCM based hand gesture recognition systems using
edge, IEEE Trans Circuits Syst, 67 (2020) 1624–1628. multiple classifiers, Proc IEEE Int Conf Robot Autom2
23 Bao P, Maqueda A I, Del-Blanco C R & Garcia N, Tiny hand (Xi'an, China) 2021, 1–5.
gesture recognition without localization via a deep 29 Ghosh D K & Ari S, "Static hand gesture recognition using
convolutional network, IEEE Trans Consum Electron, 63 mixture of features and SVM classifier, 5th Int Conf Commun
(2017) 251–257. Syst Netw (Gwalior, MP, India) 2015, 1094–1099.
24 Kanchana P, Kosin C & Jing-Ming G, Signer independence 30 Lim K M, Tan A W C & Tan S C, Block based histogram of
finger alphabet recognition using discrete wavelet transform optical flow for isolated sign language recognition, J Vis
and area level run lengths, J Vis Commun Image Represent, Commun Image Represent, 40 (2016) 538–545.
38 (2016) 658–677. 31 Nielson M, Neural Networks and Deep Learning
25 Nandy A, Prasad J S, Mondal S, Chakraborty P & (Determination Press, San Francisco, CA, USA) 2015.
Nandi G C, Recognition of Isolated Indian Sign Language 32 Pan W, Zhang X & Zhongfu Ye, Attention-based sign
Gesture in Real Time, J Commun Comput Inf Sci, 70 (2010) language recognition network utilizing key frame sampling
102–107. and skeletal features, IEEE Access, 8 (2020) 215592–215602