Hand_Gesture_Based_Sign_Language_Recognition_Using_Deep_Learning
Hand_Gesture_Based_Sign_Language_Recognition_Using_Deep_Learning
Abstract— Sign Language Recognition (SLR) serves a very Therefore, it is essential to make sure that the processing
important role in creating a bridge between deaf/mute of image/video to identify correct and discriminating features
community and conventional society. SLR uses facial, head, should be done effectively. Although SLR research has
arm, and hand motions to communicate linguistic information. increased significantly in recent years [4], [5][6] creating an
Since mostly used body parts for communication are hands,
therefore this paper is based on hand gesture sign language
automated SLR system is still an open research challenge.
recognition system. The proposed method performs very well Both static and dynamic movements are used in sign
not only in detecting hand landmarks but also tracking of hand language. The primary definition of a static gesture is the
gesture continuously by using Google’s MediaPipe. AlexNet fingerspelling of a certain alphabetic or numerical symbol
classifier is used for classification of different gestures of Indian using a set finger configuration and fixed hand position [5].
Sign Language (ISL) and also, comparison of the proposed In contrast, dynamic gestures could be single words or
method is done with some other existing SLR models. The sentences [7]. Both contact-based systems and contactless-
dataset is a self-generated dataset which contains 15 classes for based systems are used by sign language recognition systems.
words and captured by stable webcam. This proposed method In contact-based systems, the location of the skeleton joints
presents a system with recognition accuracy of 98.9% and its
effectiveness is increased by translating the hand gesture
in a series of postures or bodily motions serves as a signal to
recognition into readable text. represent the gesture data, which is then recorded [3][8].
Whereas, Contactless approaches require the capturing of
Keywords— Sign Language Recognition, Indian Sign gestures through still/video camera. Flexibility, user friendly
Language, Convolution Neural Network, MediaPipe. approach and more natural way of communication are various
advantages of contactless approach over contact-based
I. INTRODUCTION systems that can be uncomfortable and include risk of
physical contact with sensors.
The World Health Organization has predicted that by
Deep learning techniques are currently advancing, and this
2050, there will be a considerable number of people, around
has led to promising results in a variety of vision-based
2.5 billion, who will suffer from hearing loss to some extent,
disciplines, including object detection, image categorization,
with at least 700 million requiring hearing rehabilitation [1].
and action recognition [9], [10].
This is a significant public health issue that needs to be
addressed. Hearing loss can greatly affect one's quality of life,
resulting in social isolation, depression, and cognitive Dataset
decline. It can also lead to economic losses due to reduced Generation
productivity and increased healthcare costs. Input Frame Hand Detection Hand Landmark
𝟏𝟑 × 𝟏𝟑 × 𝟑𝟖𝟒
MaxPool MaxPool
𝟓𝑿𝟓 𝑪𝒐𝒏𝒗 𝟑𝑿𝟑 𝑪𝒐𝒏𝒗 𝟑𝑿𝟑 𝑪𝒐𝒏𝒗
𝟏𝟏𝑿𝟏𝟏 𝑪𝒐𝒏𝒗 𝟑X3 𝟑𝑿𝟑
𝟒𝟎𝟗𝟔
Alright Call Me Fist Good
MaxPool
𝟑𝑿𝟑 𝑪𝒐𝒏𝒗
𝟑𝑿𝟑
Hello Live Long No Okay 𝟏𝟑 × 𝟏𝟑 × 𝟐𝟓𝟔
𝟔 × 𝟔 × 𝟐𝟓𝟔
FC FC FC
Peace Rock Smile Stop
15 𝟒𝟎𝟗𝟔
Output Classes
Fig. 2: The process flow for classification stage including deep feature extraction using AlexNet classifier.
tracking various parts of the human body. These parameters like hand position and signer’s body. In [15] ,
modelshave been trained on large and diverse datasets from the proposed technique generates the predicted labels as
Google and can detect and track key points on different both text and speech outputs after recognizing ISL
parts of the body as nodes and edges, with the nodes alphabets and numbers in real time scenario. The main
representing three-dimensional coordinate points that have limitation of this study is usage of single-handed gestures
been normalized. Fig. 1 illustrates the fundamental block whereas in real time gestures can be done by both hands. In
diagram representation of the proposed methodology. The [16] Bengali sign language (BSL) alphabets recognition
block diagram of the proposed methodology consists of introduced which also used three single handed gestures
data acquisition, pre-processing which includes hand datasets with an accuracy of 94%, 99.06% and 99.06%
detection and hand landmark detection, dataset generation, respectively.
AlexNet classifier for classification [11] at the output stage. From the literature, it is evident that there is a
The organization of the paper is: The related studies significant research gap in the area of recognizing Indian
are briefly examined in Section II, the proposed work is Sign Language (ISL) words using hand gestures, thereby
discussed in Section III, the dataset generation is discussed aiming to address the aforementioned research gap and
in Section IV, results and discussion are discussed in improve the accuracy of recognition for ISL words.
Section V, Section VI concludes the proposed work with
future scope. III. PROPOSED METHODOLOGY
First step is to generate dataset as there is no publicly
II. RELATED WORKS available datasets for words. Signs are captured from
Understanding sign language is a challenging task webcam and dataset was generated. The commonly used
because it varies from region to region. Various researches words in which only one hand represents a particular word
have been done in different languages but ISL recognition are 'okay', ‘yes’ 'peace', 'thumbs up', 'call me', 'stop', 'live
still needs to be improvise. This section briefly explained long', 'fist', 'smile', 'thumbs down', 'rock', and words which
some latest research on gesture-based sign language uses both hands are ‘alright’, ‘hello’, ‘good’, ‘no’. After
recognition system. capturing 2536 images by stable camera, they are converted
In [12] authors explained sign language recognition into frames. The dataset images are divided into 75% for
during continuous movements of hands using Video training and 25% for testing.
Transformer Network and Long Short-Term Memory Second step is to pass video frames to MediaPipe
(LSTM) and recognition of Chinese sign language was framework. Google’s MediaPipe Hands [11] is a solution
done. Recent research [13] based on sentence recognition for accurate hand and finger tracking. It uses machine
was done which used arm bands with inertial sensors for learning (ML) to deduce 21 3D hand landmarks from a
collection of electromyogram signals. The main limitation single frame as shown in Fig. 3. Various existing state-of-
of contact-based approach is health hazardous and the-art approaches [17][18]rely on desktop environments
uncomfortable. In [14] human machine interaction for for inference whereas the proposed approach achieves real-
surgical robot was done by using LSTM and 3D user’s time performance even on a mobile phone and scales to
fingers. It used Chinese language for recognition of ten multiple hands. It consists of two model for extracting hand
gestures. The limitation of this research was that it only and its landmarks. First model is palm detection model
detects user’s finger but it should combine multiple which recognizes hand region from video frame. After that
294
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 27,2024 at 06:01:12 UTC from IEEE Xplore. Restrictions appl
2023 Third International Conference on Secure Cyber Computing and Communication (ICSCCC)
295
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 27,2024 at 06:01:12 UTC from IEEE Xplore. Restrictions appl
2023 Third International Conference on Secure Cyber Computing and Communication (ICSCCC)
Also, the comparison of the proposed model with the illumination and complex background condition. With
other state-of-the-art is shown in Table I and evaluation combination of AlexNet classifier which is used for
parameters of the proposed method is shown in Table II. It recognition purpose, results in accuracy of 98.9%. As no
is evident from Table I that, when the proposed method is standard dataset of words is publicly available, so dataset
compared with [4][5][19]the proposed method provides used in this method is a self-generated dataset of 15 words.
98.9% accuracy with the use of MediaPipe technique and The limitation of this method is that it is used for only 15
AlexNet classifier. It has attained the accuracy of 98.9% classes of words. Therefore, increasing the scope of future
and confusion matrix is shown in Fig. 5. The corresponding research with a greater number of classes. The dataset can
recognition accuracy curve and loss curve are shown in Fig also be extended with universally accepted words in ISL.
6 and Fig. 7 respectively. REFERENCES
[1] “Deafness and hearing loss.” https://www.who.int/news-
room/fact-sheets/detail/deafness-and-hearing-loss (accessed
Feb. 21, 2023).
[2] “Eberhard, D. M., Simons, G. F., &Fennig, C. D. (2020.).
Ethnologue Languages of the World (23rd ed.). SIL
International. - References - Scientific Research Publishing.”
https://scirp.org/reference/referencespapers.aspx?referenceid=
3057024 (accessed Feb. 27, 2023).
[3] U. Côté-Allard et al., “Deep Learning for Electromyographic
Hand Gesture Signal Classification Using Transfer Learning.,”
IEEE Trans Neural Syst Rehabil Eng, vol. 27, no. 4, pp. 760–
771, Jan. 2019, doi: 10.1109/TNSRE.2019.2896269.
[4] H. V. Verma, E. Aggarwal, and S. Chandra, “Gesture
recognition using kinect for sign language translation,” 2013
IEEE Second International Conference on Image Information
Processing (ICIIP-2013), pp. 96–100, 2013, doi:
10.1109/ICIIP.2013.6707563.
Fig. 7. Recogntion Loss curve for the proposed method (Val_loss: [5] B. Divya, J. Delpha, and S. Badrinath, “Public speaking words
Validation loss and Test_loss: Test loss) (Indian sign language) recognition using EMG,” 2017
International Conference On Smart Technologies For Smart
Nation (SmartTechCon), pp. 798–800, May 2017, doi:
TABLE I. COMPARISON WITH SOME EXISTING 10.1109/SMARTTECHCON.2017.8358482.
METHODS. [6] M. Al-Hammadi, G. Muhammad, W. Abdul, M. Alsulaiman, M.
Ref Dataset Technique & A. Bencherif, and M. A. Mekhtiche, “Hand Gesture Recognition
Accuracy for Sign Language Using 3DCNN,” IEEE Access, vol. 8, pp.
Classifier
American Sign 79491–79509, 2020, doi: 10.1109/ACCESS.2020.2990434.
Language dataset Microsoft Kinect [7] X. Zhang, X. Chen, Y. Li, V. Lantz, K. Wang, and J. Yang, “A
[4] 91% framework for hand gesture recognition based on accelerometer
(ASL) sensor
and EMG sensors,” IEEE Transactions on Systems, Man, and
Self-generated Cybernetics Part A:Systems and Humans, vol. 41, no. 6, pp.
dataset of 5 words 1064–1076, Nov. 2011, doi: 10.1109/TSMCA.2011.2116004.
[5] EMG and SVM 90% [8] N. Sarhan and S. Frintrop, “Transfer Learning for Videos: From
using EMG
Action Recognition to Sign Language Recognition,”
Proceedings - International Conference on Image Processing,
Concatenate ICIP, vol. 2020-October, pp. 1811–1815, Oct. 2020, doi:
[16] BdSL Alphabet segmentation and 94.0% 10.1109/ICIP40778.2020.9191289.
BenSignNet CNN [9] M. Shamim Hossain, M. Al-Hammadi, and G. Muhammad,
[13] Chinese Sign “Automatic Fruit Classification Using Deep Learning for
Language (CSL) Industrial Applications,” IEEE Trans Industr Inform, vol. 15,
sEMG and CNN 94.2% no. 2, pp. 1027–1034, Feb. 2019, doi: 10.1109/TII.2018.2875149.
dataset
[10] H. Altaheri et al., “Deep learning techniques for classification of
electroencephalogram (EEG) motor imagery (MI) signals: a review,” Neural
Proposed Self-generated
MediaPipe and Comput Appl, 2021, doi: 10.1007/S00521-021-06352-5.
Model dataset of words 98.9 %
AlexNet classifier [11] “Hands mediapipe.”
https://google.github.io/mediapipe/solutions/hands
[12] W. Qin, X. Mei, Y. Chen, Q. Zhang, Y. Yao, and S. Hu, “Sign
Language Recognition and Translation Method based on VTN,”
TABLE II. EVALUATION PARAMETERS FOR THE 2021 International Conference on Digital Society and
PROPOSED MODEL. Intelligent Systems, DSInS 2021, pp. 111–115, 2021, doi:
10.1109/DSINS54396.2021.9670588.
[13] Z. Wang et al., “Hear Sign Language: A Real-Time End-to-End
Ref Accuracy Precision Recall F1- Sign Language Recognition System,” IEEE Trans Mob Comput,
score
vol. 21, no. 07, pp. 2398–2410, Jul. 2022, doi:
Proposed Model 98.9% 99.33% 99.07% 99.16% 10.1109/TMC.2020.3038303.
[14] W. Qi, S. E. Ovur, Z. Li, A. Marzullo, and R. Song, “Multi-
Sensor Guided Hand Gesture Recognition for a Teleoperated
Robot Using a Recurrent Neural Network,” IEEE Robot Autom
VI. CONCLUSION Lett, vol. 6, no. 3, pp. 6039–6045, Jul. 2021, doi:
10.1109/LRA.2021.3089999.
The proposed method recognizes 15 gestures which [15] S. Katoch, V. Singh, and U. S. Tiwary, “Indian Sign Language
are then converted into readable texts. These gestures recognition system using SURF with SVM and CNN,” Array,
represent the words of ISL which are generally used in vol. 14, Jul. 2022, doi: 10.1016/J.ARRAY.2022.100141.
daily life. For detecting hand landmarks, Google’s [16] A. S. M. Miah, J. Shin, M. A. M. Hasan, and M. A. Rahim,
“BenSignNet: Bengali Sign Language Alphabet Recognition
MediaPipe is used which is very effective in low
296
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 27,2024 at 06:01:12 UTC from IEEE Xplore. Restrictions appl
2023 Third International Conference on Secure Cyber Computing and Communication (ICSCCC)
297
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 27,2024 at 06:01:12 UTC from IEEE Xplore. Restrictions appl