Deep Learning-Based Sign Language Recognition System For Static Signs
Deep Learning-Based Sign Language Recognition System For Static Signs
https://doi.org/10.1007/s00521-019-04691-y (0123456789().,-volV)
(0123456789().,-volV)
S . I . : H Y B R I D A R T I F I C I A L I N T E L L I G E N CE A N D M A C H I N E L E A R N I N G
TECHNOLOGIES
Received: 3 December 2018 / Accepted: 18 December 2019 / Published online: 1 January 2020
Ó Springer-Verlag London Ltd., part of Springer Nature 2020
Abstract
Sign language for communication is efficacious for humans, and vital research is in progress in computer vision systems.
The earliest work in Indian Sign Language (ISL) recognition considers the recognition of significant differentiable hand
signs and therefore often selecting a few signs from the ISL for recognition. This paper deals with robust modeling of static
signs in the context of sign language recognition using deep learning-based convolutional neural networks (CNN). In this
research, total 35,000 sign images of 100 static signs are collected from different users. The efficiency of the proposed
system is evaluated on approximately 50 CNN models. The results are also evaluated on the basis of different optimizers,
and it has been observed that the proposed approach has achieved the highest training accuracy of 99.72% and 99.90% on
colored and grayscale images, respectively. The performance of the proposed system has also been evaluated on the basis
of precision, recall and F-score. The system also demonstrates its effectiveness over the earlier works in which only a few
hand signs are considered for recognition.
Keywords Sign language Data acquisition Convolutional neural network Max-pooling Softmax Optimizer
123
7958 Neural Computing and Applications (2020) 32:7957–7968
human signs. Networks which are based on deep learning Huang et al. [5] presented a Kinect-based sign language
paradigms deal with the architectures and learning algo- recognition system using 3D convolutional neural net-
rithms that are biologically inspired, in distinction to con- works. They used 3D CNN to capture spatial–temporal
ventional networks. Generally, the training of deep features from raw data, which help in extracting authentic
networks occurs in a layer-wise manner and depends on features to adapt to the large differences of hand gestures.
more distributed features as present in the human visual This model is validated on a real dataset collected from 25
cortex. In this, the abstract features from the collected signs signs with a recognition rate of 94.2%. Huang et al. [6]
in the first layer are grouped into primary features in the proposed a real-sense-based sign language recognition
second layer, which further combined into more defined system. They collected total 65,000 image frames con-
features present in the next layer. These features are then taining 26 alphabet signs, out of which 52,000 were used
further combined together into more engrossing features in for training and 13,000 for testing. The deep neural net-
the following layers, which help in the better recognition of work model was trained and classified using deep belief
different signs [2]. network and achieved an accuracy of 98.9% with real-
The sign language presents a huge variability in postures sense and 97.8% with Kinect. Pigou et al. [7] contributed
that a hand can have, which makes this discipline a par- their efforts on Microsoft Kinect and CNN-based recog-
ticularly complex problem. To deal with this, a correct nition system. In this system, they used thresholding,
generation of the static postures is necessary. In addition, background removal and median filtering for preprocess-
because each region has specific language grammar, it is ing. They implemented Nesterov’s Accelerated Gradient
required to develop the Indian Sign Language database, descent (NAG) optimizer and achieved a validation accu-
which has not been available yet. racy of 91.7% to recognize Italian gestures. Molchanov
Most of the research work in sign language recognition et al. [8] presented a multi-sensor system for gesture
based on deep learning technique is performed on sign recognition of the driver’s hand. They calibrate the data
languages other than Indian Sign Language. Of recent, this received from depth, radar and optical sensors, and use
area is gaining popularity among research experts. The CNN to classify ten different gestures. The experimental
earliest reported work on sign language recognition is results showed that the system achieved the best accuracy
mainly based on machine learning techniques. These of 94.1% using a combination of all three sensors. Tang
methods result in low accuracy as it does not extract fea- et al. [9] proposed a hand posture recognition system for
tures automatically. The main goal of deep learning tech- sign language recognition using the Kinect sensor. They
niques is automatic feature engineering. The idea behind employed hand detection and tracking algorithms for pre-
this is to automatically learn a set of features from raw data processing of the captured data. The proposed system is
that can be useful in sign language recognition. In this trained on 36 different hand postures using LeNet-5 CNN-
manner, it avoids the manual process of handcrafted feature based model. The testing has been performed using Deep
engineering by learning as a set of features automatically. Belief Network (DBN) and CNN, and it has been found
There exist many reported research systems related to that DBN outperformed CNN with the overall average
sign language recognition based on deep learning and accuracy of 98.12%.
machine learning techniques. Nagi et al. [3] proposed a Yang and Zhu [10] presented a video-based Chinese
max-pooling CNN for vision-based hand gesture recogni- Sign Language (CSL) recognition using CNN. They col-
tion. They employed color segmentation to retrieve hand lected data using 40 daily vocabularies and showed that the
contour and morphological image processing to remove developed method simplifies the hand segmentation
noisy edges. The experiments were performed on 6000 sign method and avoids information loss while extracting fea-
images collected from six gesture classes only and tures. They used Adagrad and Adadelta optimizers for
achieved an accuracy of 96%. learning CNN and found that Adadelta outperformed
Rioux-Maldague and Giguere [4] presented a feature Adagrad. Tushar et al. [11] proposed a numerical hand sign
extraction technique for the recognition of hand pose using recognition method using Deep CNN. They presented a
depth images and intensity images that are captured using layer-wise optimized architecture in which batch normal-
Kinect. They employed threshold on the maximum hand ization contributes to faster training convergence and the
depth for segmentation, resize the image and use image involvement of the dropout technique alleviates data over-
centralization for preprocessing. The results were evaluated fitting. The collected American Sign Language (ASL)
on known users and unseen users using a deep belief net- images were optimized using Adadelta optimizer of CNN
work. The recall and precision of 99% were achieved with and resulted in an accuracy of 98.50%. Oyedotun and
known users, 77% recall and 79% precision was achieved khashman [2] developed a vision-based static hand gesture
with unseen users. recognition system for recognizing 24 American Sign
Language alphabets. The complete hand gestures were
123
Neural Computing and Applications (2020) 32:7957–7968 7959
obtained from the publicly available Thomas Moeslund’s 2 CNN architecture components
gesture recognition database. They implemented the CNN
network and Stacked Denoising Autoencoders (SDAE) The objective of CNN is to learn the features present in the
network and achieved an accuracy of 91.33% and 92.83% data with higher order using convolutions. The CNN
on testing data, respectively. Bheda and Radpour [12] architecture works well for the recognition of objects
presented an American Sign Language-based recognition which includes images. They can recognize individuals,
system for letters and digits. The proposed CNN-based faces, street signs and other facets of visual data. There
architecture consists of three groups of convolutional lay- exist a number of CNN variations, but each of them is
ers followed by a max-pool layer and a dropout layer and based on the pattern of layers present, as shown in Fig. 1.
two groups of fully connected layers. The collected images CNN architecture consists of different components
were preprocessed using background subtraction technique which include different types of layers and activation
and achieved an accuracy of 82.5% on alphabets and 97% functions. The listing describes the purpose and functioning
on digits using stochastic gradient descent optimizer. of some commonly used layers which is discussed below.
Rao et al. [13] developed a selfie-based sign language
recognition system using Deep CNN. They created the Convolutional layer The core building blocks of CNN
dataset which performs 200 signs in different angles and architecture are the convolutional layer. Convolutional
under various background environments. They adopted layers (Conv) modify the input data with the help of a patch
mean pooling, max-pooling and stochastic pooling strate- of neurons connected locally from the previous layer. The
gies on CNN, and it has been observed that a stochastic dot product will be computed by the layer between the
pooling outperformed other pooling strategies with a region of the neurons present in the input layer and the
recognition rate of 92.88%. Koller et al. [14] proposed the weights to which they are locally connected present in the
hybrid approach that combines the strong discriminative output layer.
qualities of CNN with the sequence modeling property of A convolution is a mathematical operation that describes
Hidden Markov Model (HMM) for recognition of contin- the rule for merging two sets of information. The convo-
uous signs. The collected data have been preprocessed by lution operation takes input, applies a convolution filter or
using a dynamic programming-based approach. It has been kernel, and returns a feature map as an output as shown in
observed that the hybrid CNN-HMM approach outper- Fig. 2. This operation demonstrates the sliding of the ker-
forms the other state-of-the-art approaches. nel across the input data which produces the convoluted
Kumar et al. [15] proposed a two stream CNN archi- output data. At each step, the input data values are multi-
tecture, which takes two color-coded images the joint plied by the kernel within its boundaries and a single value
distance topographic descriptor (JDTD) and joint angle in the output feature map is created.
topographical descriptor (JATD) as input. They collected Let us suppose the frame size of an input image
and developed the dataset of 50,000 sign videos of Indian W 2 RwXh . The convolutional filter with size F is used for
Sign Language and achieved an accuracy of 92.14%. convolution with a stride of S and P padding for input
Based on the requirements mentioned above, this paper image boundary. The size of the output of the convolution
aims to develop a complete system based on deep learning layer is presented by Eq. (1).
models to recognize static signs of Indian Sign Language W F þ 2P
Output ¼ þ1 ð1Þ
collected from different users. It presents an effective S
method for the recognition of Indian Sign Language digits, For example, there is one neuron with a receptive field
alphabets and words used in day-to-day life. The deep size of F = 3, the input size is W = 128, and there is zero
learning-based convolutional neural network (CNN) padding of P = 1. The neuron stride across the input in
architecture is constructed using convolutional layers, fol- stride of S = 1, giving output of size (128 - 3 ? 2)/
lowed by other layers. A web camera-based dataset of 1 ? 1 = 128.
static signs has been created under different environmental The output of a convolutional layer is denoted with
conditions. The performance of the proposed system has standardized Eq. (2).
been evaluated using different deep learning models, 0 1
optimizers, precision, recall and F-score. X
The paper is organized as follows. Section 2 describes anj ¼ f @ yn1
i kijn þ bnj A ð2Þ
i2Cj
the generalized CNN architecture used for classification.
The proposed system design and architecture are demon- where * is the convolution operation, n represents the nth
strated in Sect. 3. Section 4 describes the experimental layer, anj is the jth output map, yn1 represents the ith input
i
results and analysis. Finally, the research has been con- map in the ðn 1Þth layer, the convolutional kernel is
cluded in Sect. 5.
123
7960 Neural Computing and Applications (2020) 32:7957–7968
represented by kij , bj represents bias, Cj is used for repre- operation used by the pooling layer helps in the resizing of
senting input maps and f is an activation function [10]. the input data spatially (width, height). This operation is
For example, suppose that the input volume has size called as max-pooling. The down-sampling in this layer has
[128 9 128 9 3]. If the filter size is 3 9 3, then each been performed using filters on the input data.
neuron in the convolution layer will have weights to a For example, the input volume of size
[3 9 3 9 3] region in the input volume, for a total of [126 9 126 9 16] is pooled with filter size 2, stride 2 into
3*3*3 = 27 weights and ? 1 bias parameter. output volume of size [63 9 63 9 16].
The main objective of other feature extraction layers is
ReLU layer ReLU stands for Rectified Linear Units. The
to reduce the dimensions of the output generated by con-
ReLU layer helps in applying an element-wise activation
volutional layers. After convolution, the max-method will
function over the input data thresholding, for example,
be used over a region with some specific size for sub-
maxð0; xÞ at zero, giving the same dimension output as the
sampling of feature map. This operation is given by
input to the layer. The usage of ReLU layers does not affect
Eq. (3).
the receptive field of the convolution layer and at the same
anj ¼ s an1
i ; 8i 2 Vj ð3Þ time provides nonlinearity to the network. This nonlinear
property of the function helps in the better generalization of
where s is the subsampling operation and Vj is the jth
the classifier. The nonlinear function f ð xÞ used in the ReLU
region of subsampling in the nth input map [10].
layer is shown in Eq. (4).
Pooling layer Pooling layers help in reducing the repre-
f ð xÞ ¼ maxð0; xÞ ð4Þ
sentation of data gradually over the network and control
over-fitting. The pooling layer operates in an independent The sigmoid function and hyperbolic tangent are some
manner on every depth slice of the input. The max () other activation functions that can also be used to influence
123
Neural Computing and Applications (2020) 32:7957–7968 7961
nonlinearity in the network. The usage of ReLU is pre- the CNN architecture parameters are fine-tuned until the
ferred because the derivative of the function helps back- results match the desired accuracy.
propagation work considerably faster without making any
noticeable difference to generalization accuracy [16]. 3.1 Data acquisition
Fully connected layer/output layer Fully connected layer is
The three-channel image frames (RGB) are retrieved from
used to compute scores of different features for classifi-
the camera, and then these images are passed to the image
cation. The dimensions of the output volume are
preprocessing module. The dataset consists of the collec-
[1 9 1 9 N], where N represents the number of output
tion of the RGB images for different static signs. The
classes to be evaluated. Each output neuron is connected
dataset comprises 35,000 images which include 350 images
with all other neurons in the previous layer with different
for each of the static signs. There are 100 distinct sign
sets of weights. Furthermore, the fully connected layer is a
classes that include 23 alphabets of English, 0–10 digits
set of convolutions in which each feature map is connected
and 67 commonly used words (e.g., bowl, water, stand,
with every field of the consecutive layer and filters consist
hand, fever, etc.). The dataset consists of static sign images
of the same size as that of the input image [16].
with various sizes, colors and taken under different envi-
For example, a fully connected layer with
ronmental conditions to assist in the better generalization
[63 9 63 9 16] volume and a convolutional layer use fil-
of the classifier. A few examples from the dataset are
ter size 16, giving output volume [1 9 1 9 63,504].
shown in Fig. 4.
The final and last layer is the classification layer. As this
sign language recognition is a multi-classification problem,
3.2 Data preprocessing
softmax function is used in the output layer for classifica-
tion. Finally, the last fully connected layer with 1000
The data preprocessing is the application of different
neurons is used that computes the class scores. Here, 1000
morphological operations that are used to remove noise
represents the total number of classes in the dataset.
from the data. In this phase, the sign images are prepro-
Generally, the CNN architecture consists of four main
cessed using two methods that are image resizing and
layers that are a convolutional layer, the pooling layer, the
normalization. In image resizing, the image is resized to
ReLU layer and the fully connected or output layer. The
128 9 128. These images are then normalized to change
proposed sign language recognition system has been tested
the range of pixel intensity values which results in mean 0
on approximately 50 models of CNN by making variations
and variance 1.
in the hyperparameters such as filter size, stride and pad-
ding as presented in Sect. 3. The system has also been
3.3 Model training
tested by changing the number of convolutional and
pooling layers. To enhance the effectiveness of the results,
The model training is based upon convolutional neural
one more layer, i.e., dropout layer, is also added in the
networks. The proposed model is trained using the Tesla
proposed approach, which is a regularization technique
K80 Graphical Processing Unit (GPU), 12 GB memory,
used to ignore randomly selected neurons at the time of
64 GB Random Access Memory (RAM) and 100 GB Solid
training and it helps in reducing the chances of over-fitting.
State Drive (SSD). The classifier takes the preprocessed
sign images and classifies it into the corresponding cate-
gory. The classifier is trained on the dataset of different ISL
3 System design and rationale
signs. The dataset is shuffled and divided into training and
validation set with the size of training set being 80% of the
The proposed sign language recognition system includes
whole dataset. Shuffling the dataset is very significant in
four major phases that are data acquisition, image prepro-
terms of adding randomness to the process of neural net-
cessing, training and testing of the CNN classifier. Figure 3
work training which prevents the network from being
describes the data flow diagram depicting the working
biased toward certain parameters. The configuration of the
model of the system. The first phase is the data acquisition
CNN architecture used in the proposed system is described
phase, in which the RGB data of static signs get collected
in Table 1.
using a camera. The collected sign images are then pre-
processed using image resizing and normalization. These
3.4 Testing
normalized images are stored in the data store for future
use. In the next phase, the proposed system gets trained
The developed sign language recognition system has been
using CNN classifier and then the trained model is used to
tested on approximately 50 convolutional neural network
perform testing. The last phase is the testing phase in which
models. The algorithms with different optimizers are used
123
7962 Neural Computing and Applications (2020) 32:7957–7968
to train the network for a maximum of 100 epochs with the function and predict results as accurate as possible. In this
loss function as categorical cross-entropy. Some of the paper, the proposed model is tested on different optimizers
other parameters which were used to fine-tune the network such as Adaptive Moment Estimation (Adam), Adagrad,
architecture based upon the preliminary results and after Adadelta, RMSprop and Stochastic Gradient Descent
applying some heuristics to increase the accuracy and find (SGD). The model is trained using AlexNet and Adam as
an optimal CPU/GPU computing usage are described in an optimizer and achieved training and validation accuracy
Table 2. of 10% and 5%, respectively. It took a total 4 h to train our
It can be observed from Table 2 that the accuracy of the model, and it has been observed that the model obtained is
proposed model gets increased as we limit the number of highly under-fitted. In the next step, we have reduced the
layers in CNN architecture. The training and validation number of layers from 8 to 5 and it has been found that the
accuracy get increased to 99.17% and 98.80% by reducing training and validation accuracy get increased to 42% and
the number of layers from 8 to 4, respectively. On the other 26%, respectively, using Adam as an optimizer and 16
hand, the accuracy gets decreased as we alter the number of filters. The proposed model achieved the best result with
filters from 16 filters to 32 filters and then to 64 filters with training and validation accuracy of 99.17% and 98.80%,
20 epochs. It has been observed that the recognition rate is respectively, using total 4 layers, 16 filters and Adam as an
high with only 20 epochs. optimizer.
The optimizers are used to tweak the parameters or The proposed model is tested using different optimizers.
weights of the model which helps in minimizing the loss Experimental results with respect to optimizers and colored
123
Neural Computing and Applications (2020) 32:7957–7968 7963
Table 1 Proposed system architecture faster calculations and performs updates more frequently
on massive datasets.
Layer type Output size Parameters
The proposed model is also tested on grayscale data.
Input 128 9 128 9 3 – The results obtained with respect to different optimizers, 16
Conv2d_1 (128, 128, 16) 448 filters, 4 layers and grayscale image datasets are given in
Conv2d_2 (126, 126, 16) 2320 Table 4. It has been observed that the model achieved the
Maxpooling2d_1 (63, 63, 16) 0 training and validation accuracy of 99.24% and 98.85%,
Dropout (63, 63, 16) 0 respectively, using Adam optimizer. The system achieved
Flatten 63,504 0 training and validation accuracy of 99.76% and 98.35%,
Dense_1 (FC1) 64 4,064,272 respectively, using RMSProp and it has been found that the
Dense_2 (FC2) 100 6500 SGD optimizer outperformed Adam, RMSProp and other
Total parameters 4,073,540 optimizers with training and validation accuracy of 99.90%
Trainable parameters 4,073,540 and 98.70%, respectively, on grayscale image dataset.
Non-trainable parameters 0
image datasets are represented in Table 3. It has been The performance of the Indian Sign Language recognition
observed that the SGD outperformed RMSProp, Adam and system is evaluated on the basis of two different experi-
other optimizers with 16 filters and 4 layers. The proposed ments. Firstly, the parameters used in training the model
model obtained the training and validation accuracy of are fine-tuned in which the number of layers, number of
99.72% and 98.56% using SGD optimizer, respectively. filters and optimizers have been changed. In the second
However, it is the distinct advantage of SGD that it does experiment, the performance of the trained model is
123
7964 Neural Computing and Applications (2020) 32:7957–7968
evaluated on color as well as on the grayscale image The training concluded after the 20th epoch due to stag-
dataset. The average precision, recall, F1-score and accu- nation in the improvement in validation loss.
racy of the ISL recognition system have also been
computed. 4.1 Comparison with existing systems
Precision is defined as,
TP=ðTP þ FPÞ ð5Þ The comparative analysis of the proposed Indian Sign
Language recognition system with other classifiers using
where TP and FP are the numbers of true and false posi- our own dataset is shown in Table 6. It has been found that
tives, respectively. the authors of the existing systems have used machine
The Recall is defined as, learning-based techniques for classification, whereas in our
TP=ðTP þ FNÞ ð6Þ methodology we have proposed an Indian Sign Language
recognition system using a deep learning-based CNN
where FN is the number of false negatives technique. It has been observed that the proposed Indian
The F1-score is defined as, Sign Language recognition system outperformed all the
2 Precision Recall/ðPrecision þ RecallÞ ð7Þ other existing ISL systems with an accuracy of 99.90%. It
has been also concluded that the CNN convolute structure
The classification performance for some of the grayscale
in large datasets by using the algorithm of backpropagation
sign samples showing precision, recall and F1 score is
which indicates how a machine could change its parame-
shown in Table 5. The complete description of results for
ters that are used to evaluate the representation in each
all the signs is given in ‘‘Appendix.’’
layer from the representation in the previous layer.
The compilation accuracy and loss range from about
The results of the proposed CNN-based sign language
12% and 3.623 after the third epoch to 99.90% and 0.012
recognition system are best when experimentation was
after the 20th epoch on training data, whereas the valida-
performed with different number of layers in CNN archi-
tion accuracy and validation loss range from 14 and 3.458
tecture. The rigorous experimentation was also performed
to 98.70% and 0.023 during the first 20 epochs as described
to find the optimal parameter values (number of layers,
in Fig. 5. The early stopping mechanism is also applied in
kernel size) for the implementation of the algorithm.
case the validation accuracy stops improving before the
completion of maximum of 30 epochs to avoid over-fitting.
123
Neural Computing and Applications (2020) 32:7957–7968 7965
Table 5 Classification
Sign Precision Recall F1-score Sign Precision Recall F1-score
performance
A 1.00 0.96 0.98 Me 1.00 1.00 1.00
Afraid 0.97 0.97 0.97 Nose 0.98 1.00 0.99
B 1.00 1.00 1.00 Oath 1.00 1.00 1.00
Bent 0.97 1.00 0.99 Open 1.00 0.97 0.98
Coolie 0.97 0.94 0.96 P 1.00 0.97 0.98
Claw 1.00 1.00 1.00 Pray 1.00 1.00 1.00
D 0.79 0.97 0.87 Q 0.97 1.00 0.99
Doctor 0.98 1.00 0.99 S 0.95 1.00 0.97
Eight 0.96 0.90 0.93 Sick 1.00 1.00 1.00
Eye 1.00 1.00 1.00 Strong 0.97 1.00 0.98
Fever 0.95 1.00 0.97 T 0.99 1.00 0.99
Fist 0.97 0.98 0.97 Tongue 0.99 1.00 0.99
Gun 0.97 1.00 0.99 Trouble 1.00 0.95 0.97
H 1.00 1.00 1.00 U 1.00 0.99 0.99
Hand 0.97 1.00 0.98 V 1.00 1.00 1.00
I 1.00 1.00 1.00 West 1.00 0.93 0.96
Jain 0.99 1.00 0.99 Water 0.93 0.98 0.95
Fig. 5 Accuracy and loss curves for training and validation datasets
5 Conclusion and future scope layers. Each convolutional layer consists of different fil-
tering window sizes which help in improving the speed and
In this research, an effective method for the recognition of accuracy of recognition. A web camera-based dataset of
ISL digits, alphabets and words used in daily routine is 35,000 images from 100 static signs has been generated
presented. The proposed CNN architecture is designed with under different environmental conditions. The proposed
convolutional layers, followed by ReLU and max-pooling architecture has been tested on approximately 50 deep
123
7966 Neural Computing and Applications (2020) 32:7957–7968
learning models using different optimizers. The system videos into frames. A video sequence contains temporal as
results in the highest training and validation accuracy of well as spatial features. Firstly, a hand object is focused to
99.17% and 98.80%, respectively, with respect to change reduce the time and space complexity of network. After
in parameters such as the number of layers and number of that, the spatial features are extracted from the video
filters. The proposed system is also tested using different frames and the temporal features are extracted by relating
optimizers, and it has been found that SGD outperformed the video frames in the meantime. The frames of the
Adam and RMSProp optimizers with training and valida- training set will be given to the CNN model for training
tion accuracy of 99.90% and 98.70%, respectively, on the process. Finally, the trained model will be used as future
grayscale image dataset. The results of the proposed system reference to make predictions of the training and test data.
have also been evaluated on the basis of precision, recall The work will also be extended to develop a mobile-based
and F-score. It has been found that the system outper- application for the recognition of different signs in real
formed other existing systems even with less number of time.
epochs.
The major source of challenge in sign language recog- Acknowledgements This publication is an outcome of R&D work
undertaken in the project under the Visvesvaraya PhD Scheme of
nition is the capability of sign recognition systems to Ministry of Electronics and Information Technology, Government of
adequately process a large number of different manual India, being implemented by Digital India Corporation (formerly
signs while executing with low error rates. For this con- Media Lab Asia). We gratefully acknowledge the support of NVIDIA
dition, it has been shown that the proposed system is robust Corporation with the donation of the Titan XP GPU used for this
research.
enough to learn 100 different static manual signs with
lower error rates, as in contrast to other recognition systems
described in other works in which few hand signs are
Compliance with ethical standards
considered for recognition. Conflict of interest The authors declare that they have no conflict of
For future work, there is a need to collect more datasets interest.
to refine the recognition method. Furthermore, the experi-
mentation is ongoing on the trained CNN model to rec-
ognize signs in real time. In addition, the system will be Appendix
extended to recognize dynamic signs which require the
collection and development of a video-based dataset and
the system is tested using CNN architecture by dividing the
123
Neural Computing and Applications (2020) 32:7957–7968 7967
S no. Sign Precision Recall F1-score S no. Sign Precision Recall F1-score
123
7968 Neural Computing and Applications (2020) 32:7957–7968
10. Yang S, Zhu Q (2017) Video-based Chinese sign language topographical descriptor on a 2—stream CNN. Neurocomput
recognition using convolutional neural network. In: IEEE 9th 372:40–54
international conference on communication software and net- 16. Prabhu R (2018) Understanding of convolutional neural network
works (ICCSN), pp 929–934 (CNN) — deep learning. https://medium.com/@RaghavPrabhu/
11. Tushar AK, Ashiquzzaman A, Islam MR (2017) Faster conver- understanding-of-convolutional-neural-network-cnn-deep-learn
gence and reduction of overfitting in numerical hand sign ing-99760835f148. Accessed 4 Mar 2018
recognition using DCNN. In: Humanitarian technology confer- 17. Rahaman MA, Jasim M, Ali MH, Hasanuzzaman M (2014) Real-
ence (R10-HTC), IEEE Region 10, pp 638–641 time computer vision-based Bengali Sign Language recognition.
12. Bheda V, Radpour D (2017) Using deep convolutional networks In: 17th IEEE international conference on computer and infor-
for gesture recognition in American sign language. arXiv preprint mation technology (ICCIT), pp 192–197
arXiv:1710.06836 18. Uddin MA, Chowdhury SA (2016) Hand sign language recog-
13. Rao GA, Syamala K, Kishore PVV, Sastry ASCS (2018) Deep nition for Bangla alphabet using support vector machine. In:
convolutional neural networks for sign language recognition. In: IEEE international conference on innovations in science, engi-
IEEE conference on signal processing and communication engi- neering and technology (ICISET), pp 1–4
neering systems (SPACES), pp 194–197 19. Rao GA, Kishore PVV (2017) Selfie video based continuous
14. Koller O, Zargaran S, Ney H, Bowden R (2018) Deep sign: Indian sign language recognition system. Ain Shams Eng J
enabling robust statistical continuous sign language recognition 9(4):1929–1939
via hybrid CNN-HMMs. Int J Comput Vis 126(12):1311–1325
15. Kumar EK, Kishore PVV, Kiran Kumar MT (2019) 3D sign Publisher’s Note Springer Nature remains neutral with regard to
language recognition with joint distance and angular coded color jurisdictional claims in published maps and institutional affiliations.
123