0% found this document useful (0 votes)
51 views6 pages

Convolutional Neural Networks For Indian Sign Language Recognition

Sign Language has been a crucial means of com- munication for the deaf and mute communities worldwide since ages. In India alone, 1 percent of the population consists of hard of hearing and mute individuals.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views6 pages

Convolutional Neural Networks For Indian Sign Language Recognition

Sign Language has been a crucial means of com- munication for the deaf and mute communities worldwide since ages. In India alone, 1 percent of the population consists of hard of hearing and mute individuals.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1891

Convolutional Neural Networks for Indian Sign


Language Recognition
Manpreet Kaur Sidhu1; Snehal Hon2; Sandesh Marathe3; Tushar A. Rane4
Department of Information Technology
SCTR’s Pune Institute of Computer Technology Pune, India

Abstract:- Sign Language has been a crucial means of II. LITERATURE SURVEY
com- munication for the deaf and mute communities
worldwide since ages. In India alone, 1 percent of the There have been various methods used to recognize sign
population consists of hard of hearing and mute language and a significant amount of methods used CNN’s
individuals. Hence, to help support these marginalized for their recognition.
communities, it is important to make use of techno- logical
advancements such as deep learning, computer vision and A proposed three-layer CNN achieves the highest
neural network technologies to create systems and recognition accuracy in [1] for numerals and alphabets
applications that can not only help create sign language and among pre-trained models, ResNet152V2 performs the
recognition software for the deaf community, but also best. The fine-tuning of pre-trained models and impact of
provide means to educate others about sign languages hyperparameters like learning rate, batch size, and
around the world. In this paper, we present a system that momentum on performance is studied. Human action
utilizes Convolutional Neural Networks to recognize the recognition is explored in [2] using convolutional neural
alphabets A-Z of the Indian Sign Language(ISL) by networks (CNNs) based on skeleton heatmaps extracted
accepting the real time hand signs performed by the from a two-stage pose estimation model. This method
user as input from the users’ camera feed and then combines an improved single shot detector (SSD) with
displays the recognized alphabet label as output in the convolutional pose machines (CPM) to accurately capture
form of text and speech. We created a custom Indian human skeleton data. It usesResNet as the backbone of SSD
sign language dataset for all 26 alphabets for this and incorporates multiscale transformations for better
experimentation. The extraction of key features was skeleton key point detection. A novel CNN model for Sign
performed using CNN, background removal, hand Language Recognition (SLR) is proposed in [3], which
segmentation and thresholding. automatically extracts spatial-temporal features from raw
video streams, utilizing multiple input channels including
Keywords:- Convolutional Neural Network(CNN), Indian color, depth, and body joint positions from Microsoft Kinect.
Sign Language(ISL), Deep Learning, Sign Language This approach eliminates the needfor hand-crafted features.
Recogni- tion(SLR). CNNs and Microsoft Kinect were used in [4] and their model
automates feature extraction from video data and achieves
I. INTRODUCTION high accuracy in recognizing 20 Italian gestures by using
techniques like data augmentation and dropout.
The increase in hearing impairments in densely
populated countries such as India emphasizes the need for A combination of CNN and Long Short-Term Memory
better communication options. Advances in speech-to-text (LSTM) networks are utilized in [5], and the system
and sign language recognition are critical for enhancing achieved high accuracies for recognizing nine common
accessibility. This study investigates these technologies, Egyptian Sign Language(ESL) words. Using the LSA16
focusing on the need for inclusive solutions for Indian Sign and RWTH-PHOENIX-Weather datasets, the authors in [6]
Language (ISL) users. evaluated models including LeNet, VGG16, ResNet-34, All
Convolutional, and Inception, along with transfer learning
Deaf and hard of hearing people experience difficulties techniques. The results showed that VGG16 performed the
due to the restricted use of ISL. Traditional interpretation best, closely followed by LeNet, and accuracy improved
methods are expensive and inconvenient. We present a significantly when hands were pre-segmented from the
camera-based system for ISL that uses Convolutional Neural background. A framework for Arabic Sign Language (ArSL)
Networks (CNN) to recognize real-time ISL motions. This recognition was introduced in [7], employing transfer
strategy improves affordability and accessibility. learning with various deep learning models and vision
transformers. The methodology encompasses pretrained
Our dependable sign language recognition system use models like VGG, ResNet, MobileNet, Inception, DenseNet,
CNNto identify ISL motions in real time, boosting mobility Big Transfer, ViT, and Swin, alongside CNN architectures
by depending just on a camera-based setup that eliminates the with data augmentation and batch normalization. The
need for extra electronics. This improves communication and limitations of traditional methods and introduction to modern
social integration for ISL users. approaches like Convolutional Neural Networks (CNNs),

IJISRT24MAY1891 www.ijisrt.com 2568


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1891

YOLO, and Transfer Learning for more efficient recognition variety. We have tried to maintain a consistent background
were discussed in [8]. Notably, CNNs and YOLO for our dataset, and avoided too many parameters (detailed
demonstrate impressive real-time recognition speed. background, miscellaneous items in frame, etc.) to not
confuse the CNN model. Our dataset consists of
A system that detects hand poses and gestures in approximately 40,000 images for all alphabets, which is
IndianSign Language (ISL) in [9] in real time uses a Grid- around 1,500 images per alphabet, and all images are
based Feature Extraction approach to represent the hand’s collected as grayscale. According to the Indian Sign
pose asa Feature Vector. Hand poses are then identified using Language Research and Training Center (ISLRTC) website,
the k-Nearest Neighbors method. For gesture classification, the English alphabets H, I, J and Y in Indian Sign Language
the motion and intermediate hand posture observation have movement when signing. Hence for our experiment we
sequences are fed into Hidden Markov Model chains that have chosen static poses for those four alphabets by using the
correspond to the 12 pre-selected gestures in ISL. Google’s manual alphabet poses provided by ISLRTC’s website as
Mediapipe tool was used in [10] to obtain hand landmarks, reference. All images were taken using the standard digital
and a custom dataset of American Sign Language[ASL] was camera provided by our computer.
constructed for the experiment. LSTM was used for hand
gesture recognition. The neural network used in [11] is also a B. Data Preprocessing
Convolutional Neural Network (CNN), which improves the We collected our original images in 128x128 resolution.
predictability of the American Sign Language alphabet. Their This is done to help us reduce storage and avoid higher
interface system can decode sign language movements and resolution data. Initial preprocessing steps include taking
hand positions into natural English. A recognition approach the grayscale image and adding Gaussian blur, which helps
using Microsoft Kinect, CNNs, and GPU acceleration was us reduce unnecessary noise and high frequency parts of
presented in [12]. CNNs prioritise automated feature building the image. Our next steps include adding thresholding,
over handwritten features. It is highly accurate in recognizing which is applied to our blurred grayscale images, and it
motions and sign language.It uses a three-module approach. converts them into a binary inverted image. Here, we make
Sign Language Detection, Text to Speech, and Text use of Otsu’s method as it helps us divide our images into
Scraping. foreground and background, and gives us the optimal value
of thresholding we need. Another variation of our dataset was
The proposed system in [13] makes use of Smart experimented with other procedures like contours for hand
Gloves. The recommended solution uses a flex sensor and a segmentation and preprocessing, which were discarded in the
micro- processor to transform sign language into text and final preprocessed dataset as it didn’t match our requirements.
audio The system analyzes the sign input using a mat lab After thresholding, our dataset now consists of polarizing
image process- ing technique and classifies it for recognized images, where the hand area of the image is white, thus
identification. Various computer vision techniques are used creating a segmented hand for every image. The lighting had
in [14] suchas grayscale conversion, dilatation, and mask to be balanced the most out of all parameters, to help create
operation. They use a Convolutional Neural Network (CNN) the best hand segments for our dataset.
to detect images taken from users’ webcam. Publications on
decision support and intelligent systems for sign language We applied various data augmentation techniques
recognition (SLR) were retrieved and analysed from the such as rotation, horizontal flipping, shearing, zoom, etc.
Scopus database in [15]. The retrieved articles are analysed to our dataset. We performed augmentation for a specific
using bibliometric VOSViewer software to derive publication number of images in each class by using an interval
temporal and re- gional distributions, establish collaboration between themand hence created multiple variations of those
networks between affiliations and authors, and identify images, which were then combined with our original
productive institutions in this context. Deep neural dataset. The interval was added because augmenting each
networks were used in [16] to provide a framework for image would increase our dataset too much in size, which
recognizing sign language (SL) signals and converting them would not be managable due to our limitations. We have
into English and voice. CNN and RNN are used for spatial made use of Image Data Generator package provided by
and temporal analysis, respectively. Keras in Python for all augmentation steps. We also
experimented with other functions provided like‘shift_range’
III. PROPOSED METHODOLOGY and ‘fill_mode’ on another copy of our dataset, but it resulted
in excessive variation within a class, which we wanted to
To create a real time sign language recognition system, avoid. We shuffled all images before trainingthe model to
we wanted to work with Indian Sign Language. Hence, we create more variety within each training batch. Images were
have created our own image dataset and model for the same. put into batch size of 128 before training, andour dataset
was divided into 80/20 for training and validation
A. Dataset Collection respectively. More augmentation was provided to classes that
For our system, we decided to create and use our own had less images to balance them with respect to all other
Indian Sign Language(ISL) dataset for all 26 alphabets. All classes, and to maintain an average of approximately 1500
three members of our group have contributed to creating the images per class.
dataset, and it is done so in different hand positions for added

IJISRT24MAY1891 www.ijisrt.com 2569


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1891

Fig 1: Screenshots of ISL Alphabet Prediction in Real Time

C. Model Building and Training model for our experiment consists of 4 convolutional layers
Convolutional Neural Networks are widely used for and 3 dense layers. Our first layer consists of a convolutional
image classification, as they are good at discerning, layer with 32 filters with a kernel size of 3x3, followed by a
extracting, and then classifying key and important features max pooling layer and a dropout layer. Our next layer consists
from the images. We use the CNN model for both of another convolutional layer with 64 filters, followed by a
extracting features and for classification, as Convolutional max pooling and dropout layer. The last two layers follow this
Neural Networks can additionaly perform good feature same pattern, with the third convolutional layer using 128
extraction, and we wantedto experiment a CNN’s working filters andthe final layer using 256 filters. For all layers, a
and result for the same. CNN’s lean towards better image kernel sizeof 3x3 was maintained and the activation function
recognition when the neural network and feature extraction used was ‘relu’, as it provides to the model efficiency and
gets deeper i.e. gets more layers. Hence, we make use of non-linearity. All max pooling layers use a pool size of 2x2
multiple layers for our model. I Our initial ISL dataset and the dropout value in all layers is 0.2. Further after
consisted of 13,000 images, and hence the model considered flattening these layers, we add 3 dense layers, with the first
for this was a simple 2 layered CNN model which yielded dense layer consistingof 256 neurons followed by a dropout
good results, but as we continuedto increase our dataset, layer of value 0.5.The second dense layer consists of 128
this model had to be increased in complexity to provide neurons and is also followed by another dropout layer of 0.5
good results for a much larger dataset. value. Both dense layers used the ‘relu’ activation function.
Our final Dense layer consists of 26 neurons corresponding to
For our latest and final version of our dataset of 40,000 our 26 classes, and uses softmax activation function. For
images, our CNN model initially consisted of 3 convolutional compiling our model weuse the Adam optimizer as it proves
layers which didn’t yield best results for our real time pre- efficient for large number of parameters.
dictions. Hence, we changed our batch size to 128 and after
analyzing different versions of the model, the final CNN

IJISRT24MAY1891 www.ijisrt.com 2570


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1891

D. Real Time Prediction IV. RESULT ANALYSIS


After creating and training our model effectively we
perform our final output and prediction in real time. The ISL Screenshot examples of our final real time prediction
signs are performed in front of the device camera and the live output in a suitable environment are displayed in Fig. 1. Our
video frames are extracted from the camera feed and each proposedCNN model best predicts in real time alphabets like
frame goes through blurring and thresholding as a processing A, L, P, T, U, V etc., due to their distinct positioning. Our
step, similar to how it was performed on the training dataset. model struggles with alphabets M, N, R because the hand
Then the CNN model predicts the performed sign based positions for these signs are similar as seen in Fig. 1 and hence
on its training and the predicted label is displayed alongside user must adjust hand positions accurately for these signs. We
its accuracy percentage. evaluated our model and output using evaluation metrics such
as accuracy, precision, etc.
The final user interface allows user to use button
pressto display the predicted alphabet and be able to display A. Evaluation Metrics Used
the alphabets to form words. Users can also use our provided Metrics like accuracy, precision etc. are commonly
GUIto create their own dataset and train the model on custom used to evaluate the performance of machine learning and
sign language data. During the real time ongoing prediction deep learning models, particularly in classification. The terms
user has to simply adjust the background to be plain and in- cluded are True Positives(TP), True Negatives(TN). False
makesure the lighting is right enough to create a hand Positive(FP), False Negatives(FN). Accuracy: Accuracy
segment, and this is possible as we also display the mea- sures the overall correctness of the model’s predictions
thresholded view so the user can manage the environment across all classes. It gives us the proportion of correctly
according to the binary feed. We make use of tkinter classified instances among all instances in the dataset. Higher
package provided by pythonto create our interface. We also accuracy means a model is giving good predictions but it
provide the user a text to speech feature where the predicted alone is not enough to assess the performance of a model.
alphabet can be converted to speech and be spelled out for the
user, and they can also view the sign language images for
all alphabets if they wish to see how to perform the signs.
These features provide better communication with user in
our interface.
(1)

 Precision: Precision measures the ability of the model to


avoid false positives. It provides the proportion of true
positive predictions among all instances predicted as
positive by the model. Precision indicates how many of
the instances predicted as positive are actually positive. A
high precision means that the model has a low false
positive rate, which is desirable in applications where
false positives are costly.

Fig 2: Loss Graph of our CNN Model


(2)

 Recall: Recall measures the ability of the model to


identify all relevant instances. It provides us with the
proportion oftrue positive predictions among all actual
positive instances in the dataset. Recall indicates how
many of the actual positive instances the model can
successfully identify. A high recall means that the model
can capture most of the positive in- stances, which is
crucial in applications where missing positiveinstances is
undesirable.

Fig 3: Accuracy Graph of our CNN Model (3)

IJISRT24MAY1891 www.ijisrt.com 2571


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1891

 F1 Score: The F1 score is the harmonic mean of precision to be to get good accuracy. A complex model for a smaller
and recall. It provides a single score that balances both number of images in a dataset led to training errors like
precision and recall. F1 score is especially useful if one overfitting. Hence balancing the model complexity to the size
has imbalanced classes. As it combines precision and of our dataset was an important part in our process. A 4-layer
recall into a single metric, it gives equal importance to deep CNN model gave us the best accuracy out of all versions
both. It provides a balanced measure of a model’s of CNN experimented with. The training and validation loss
performance, particularly in situations where precision and accuracy curve for the model is displayed in Fig. 2 and
and recall are equally important. Fig.3 respectively.

Our CNN model gives us an overall accuracy of


above 99%. The models’ accuracy on training dataset is
(4) above 97% and a validation accuracy of above 99%. The
precision, recall. F1 score and accuracy of our CNN model
We also observed during experimentation that the lesser can be seen in Table 1.
data we had within our dataset, the simpler the model needed

Table 1: Metrics Table


Model Accuracy Precision Recall F1 Score
4-layer CNN 99.4% 99.5% 99% 99.2%

For further evaluation we considered two different Our CNN model is trained on both single-handed and
custom datasets. Dataset 1 led to Accuracy of 99.96% and double-handed signs and is able to provide a good accuracy
dataset 2 led to an accuracy of 98.75% as seen in Fig. 4 and for both types of hand signs.
5.

Fig 4: Test Accuracy of Model on Dataset 1

Fig 5: Test Accuracy of Model on Dataset 2

V. LIMITATIONS AND FUTURE SCOPE environment limitations in our experiment in real time like
background and lighting and hence applying methods that
Our current system can recognise 26 alphabets with a can help deal with those issues can create better sign language
database of 40,000 but due to device limitations training the recognition systems. Sign language recognition can also be
model was outstretched and time consuming. Hence adding implemented into video calls for providing further
further updates like GPU can help make the process smoother accessibility features to deaf and mute people. Our model
and provide faster training which can lead to better testing works with a database of images of static poses, which can be
and more time for experimenting. Further we have certain expanded intoworking with video database of hand poses that

IJISRT24MAY1891 www.ijisrt.com 2572


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1891

have move- ment. Our model can also be used to recognise [7]. Nojood M. Alharthi, Salha M. Alzahrani.“Vision
alphabets in other Sign Languages such as American Sign Transformers and Transfer Learning Approaches for
Language(ASL),British Sign Language(BSL), etc if trained Arabic Sign Language Recognition”. Applied
with respectivedatasets. Sciences(2023).
[8]. Xianwei Jiang, Yanqiong Zhang1,Juan Lei and
VI. CONCLUSION Yudong Zhang. “A Survey on Chinese Sign Language
Recognition: From Traditional Meth- ods to Artificial
In our paper, we display our experimentation and Intelligence” Computer Modeling in Engineering &
approach to recognise Indian sign language input in real time Sciences Tech Science Press(2024)
using Convolutional Neural Network with an overall [9]. Kartik Shenoy, Tejas Dastane, Varun Rao, Devendra
accuracy ofthe model being above 99%. Our model provides Vyavaharkar “Real- time Indian Sign Language (ISL)
quality and favourable recognition for both one-handed and Recognition” 9th ICCCNT 2018, IISC, Bengaluru
two-handed signs done by user with limited real time delay [10]. Sundar B. , Bagyammal T. “American Sign Language
and can be expanded to work with larger datasets. It can be Recognition for Alphabets Using MediaPipe and
determined that multiple layers of Convolutional Neural LSTM ” 4th International Confer- ence on Innovative
Networks providebetter accuracy for larger datasets, as it did Data Communication Technology and Application.
for our custom created dataset. CNN proves to be very 10.1016/j.procs.2022.12.066
competent and useful to work with image dataset and extracts [11]. Ahmed KASAPBAS Ahmed Eltayeb AHMED
relevant features, good enough to predict proficiently in real ELBUSHRA Omar AL- HARDANEE Arif YILMAZ
time. Thus through our research we can establish that using “DeepASLR: A CNN based human com- puter
Neural network technologies like CNN can help create interface for American Sign Language recognition for
systems for deaf, mute and disabled communities which help hearing- impaired individuals ” Computer Methods
make communication easier for them through such and Programs in Biomedicine Update 2 (2022)
technological advancements. 100048.
https://doi.org/10.1016/j.cmpbup.2021.100048
REFERENCES [12]. Piyush Kapoor, Hema N2 “Sign Language and
Common Gesture Using CNN ”. International Journal
[1]. Prachi Sharma, Radhey Shyam Anand.“A of Advanced Trends in Computer Science and
comprehensive evalua- tion of deep models and Engineering ISSN 2278-3091
optimizers for Indian sign language recognition”. [13]. Karan Bhavsar, Raj Ghatiya, Aarti Gohil, Devanshi
Graphics and Visual Computing 5 (2021) 200032. Thakkar, Bhumi Shah. .“Sign Language Recognition”.
https://doi.org/10.1016/j.gvc.2021.200032 International Journal of Research Publication and
[2]. Ruiqi Sun, Qin Zhang, Chuang Luo, Jiamin Guo, Hui Reviews Vol (2) Issue (9) (2021) Page 771-777.
Chai. “Human ac- tion recognition using a [14]. Rachana Patil, Vivek Patil, Abhishek Bahuguna, and
convolutional neural network based on skeleton Mr. Gaurav Datkhile. “Indian Sign Language
heatmaps from two-stage pose estimation” Recognition using Convolutional Neu- ral Network”.
Biomimetic Intelligence and Robotics 2 (2022) ITM Web of Conferences 40, 03004 (2021) ICACC-
100062. https://doi.org/10.1016/j.birob.2022.100062 2021.
[3]. PRABHAKARA R UYYALA. “SIGN LANGUAGE [15]. I.A. Adeyanju, O.O. Bello b, M.A. Adegboyega.
RECOGNITION USING CONVOLUTIONAL “Machine learning methods for sign language
NEURAL NETWORKS” Journal of In- recognition: A critical review and analysis”.
terdisciplinary Cycle Research Volume XIV, Issue I, [16]. Intelligent Systems with Applications 12 (2021)
January/2022 ISSN NO: 0022-1945. 200056
[4]. Lionel Pigou(B), Sander Dieleman, Pieter-Jan [17]. Sai Bharath Padigala, Gogineni Hrushikesh
Kindermans, Benjamin Schrauwen. “Sign Language Madhav,Saranu Kishore Kumar, . International
Recognition Using Convolutional Neural Networks” Research Journal of Engineering and Technology
ELIS, Ghent University, Ghent, Belgium (2015) (IRJET) e-ISSN: 2395-0056Dr. Narayanamoorthy M.
[5]. Ahmed Adel Gomaa Elhagry, Rawan Gla Elrayes. “VIDEO BASED SIGN LANGUAGE
“Egyptian Sign Language Recognition Using CNN RECOGNITION USING CNN-LSTM”. Interna-
and LSTM” Computer Vision and Pattern tional Research Journal of Engineering and
Recognition(2021) Technology (IRJET) e-ISSN: 2395-0056
[6]. Quiroga, Facundo — Antonio, Ramiro — Ronchetti,
Franco — Lan- zarini, Laura Cristina — Rosete,
Alejandro “A Study of Convolutional Architectures
for Handshape Recognition applied to Sign
Language” CACIC 2017

IJISRT24MAY1891 www.ijisrt.com 2573

You might also like