SignLanguageRecognitionSystem
SignLanguageRecognitionSystem
net/publication/374199430
CITATIONS READS
0 1,232
2 authors, including:
SEE PROFILE
All content following this page was uploaded by Morgen Junior Muchada on 27 September 2023.
Registra on : H190196E
Abstract
The deaf community uses sign language when communicating with the non-deaf people. This
might be challenging for the general people to understand the gestures used in sign language. This
sign language can be translated into a form that is easily understood by the general public. This
research is based on different images and video capturing, preprocessing, classification of hand
gestures, landmarks extraction, and classification techniques. In order to identify the most
promising methods for future study, this paper looks and analyzes the techniques used to create
the datasets, the sign language recognition systems, and the classification algorithms. A lot of the
currently offered studies contribute to classification approaches along with deep learning because
of the growth in classification methods. This research is focused on the methods and the techniques
In this paper, I present a sign language translation system that translates sign language gestures
into written words. The system uses a webcam to capture video frames, OpenCV for frame
capturing, MediaPipe Holistic used to extraction features, and an LSTM network for gesture
recognition and translation. The system aims to help bridge the communication challenge between
deaf and non-deaf individuals, allowing for more effective.
Related Work
English Gesture based communica on
BOBSL is a substantial English Communication through Signing (BSL) dataset that includes
approximately 1,400 hours of BSL and English subtitled BBC programming. It covers an
extensive variety of programming remembering shows for food, business, travel, history, science,
nature, and medication. The dataset likewise incorporates a test parcel for the ECCV SLRTP
2022 studio challenge with 272 episodes. For signer-independent evaluation, the dataset contains
37 signers, and distinct signers can be observed in the training, validation, test, and challenge.
sets.
Sign language grammar is not standardized or uniform and varies from country to country. For
instance, hand gestures, body gestures, and facial expressions are all used in Portuguese Sign
Language (PSL). For a variety of purposes, including enabling young children to use computers
by comprehending sign language, SLR systems offer an accurate and effective method for
translating sign language into text or voice. However, space- and time-based gesture models
must be developed.
Skin variety division is handled using the RGB, HSV, HIS, and YCbCr variety models, despite
the fact that variety division is more challenging due to potential awareness issues with lighting,
cameras, and complexion. The color of the hand makes it simple to distinguish the arm from the
palm, which is why the HSV color model is so popular. The HSV and YCbCr color models were
used to segment individuals' palms and faces in a study. The color of the hands' skin was divided
using the RGB color model. Match and contrast the skin tone in a video or image that has been
provided using the RGB color model.
whether or not contextual segmentation is used. Context oriented division, as utilized by edge
discovery calculations, considers the spatial connection between features, as opposed to non-
setting focused division, which bunches pixels in light of worldwide characteristics instead of
spatial connections. Skin identification consolidates hand development following skin location to
produce more engaged results. Using colored gloves to give the hands different characteristics,
similar to skin detection, makes hand segmentation easier.
Using motion data matrices and boulares to track the motion and articulation points of hands,
Harris corner detection is a novel method for segmenting hands. 2D hand signature investigation
can be utilized to distinguish hand division. The parts of a picture that help to represent and
describe the shape of a region are extracted through morphological processes. Skeletons, borders,
and convex hulls are examples of these parts.
Classifica on
The last step, grouping, is generally liable for the viability of signals. In sign language, a series of
gestures are repeated to form words and sentences. This strategy, on the other hand, may result in
issues if devices are placed in environments that are high-risk or uncontrolled. The outfit that best
matches the expression sequence can be chosen from a list of options to determine a gesture's
fashionability. In research on signal stylishness, analysts utilize two methodologies: AI classifiers
based on Hidden Markov models are used by some, while feature extraction methods like template
matching are used by others. Because it was more likely to provide the desired collection and sign,
the latter was used in our study. Signs are characterized and distinguished by phonetic attributes
and a corridor, and support vector machines are used as a multiclass classifier to classify them.
SVM classifiers choose the best hyperplane for a decision function after being trained on images
of specific gestures. This gives them the ability to choose the right sign. Regression issues can be
resolved by employing Random Forest-type machine learning techniques. When a new
classification process begins, the standard for evaluation is determined to be a product's parameters
and attributes.
Methodology
The sign language translation system i developed performs the following steps to translate sign
language gestures into written words:
A. Pre-processing
The system first captures video frames from the webcam using OpenCV. The frames are pre-
processed by resizing them to a standard size and normalizing pixel values to improve the accuracy
of the feature extraction.
Feature Extrac on
The system uses MediaPipe Holistic to extract features from the pre-processed video frames.
MediaPipe Holistic provides pre-trained neural networks for detecting and tracking human body
landmarks, including facial landmarks, hand landmarks, and body pose estimation. The system
uses the hand landmarks to detect and track hand gestures.
LSTM Training
The system uses TensorFlow/Keras to implement and train an LSTM neural network for gesture
recognition and translation. The LSTM network takes the extracted features as input and maps
them to corresponding text labels for each sign language gesture. I trained the network on my own
dataset.
Output Genera on
The system generates the output in the form of translated text. The translated text is displayed on
the screen in real-time.
The system functionality enables accurate translation of sign language gestures into written words.
However, during the development of the system, I faced several technical challenges which
affected the accuracy.
Results
During the development of the sign language translation system, we produced some results but
also faced some technical challenges, which we outline below, along with the solutions we
employed to overcome them.
Dataset Crea on
As a result of dataset creation, I was able to successfully train an LSTM network to recognize a
set of specific sign language gestures. Despite the challenge of creating a suitable dataset, the
process of manually capturing and labeling video clips allowed me to ensure that the data was
representative and included all the signs required for my application.
By using the resulting dataset to train the LSTM network, I was able to achieve a fair degree of
accuracy in recognizing the targeted sign language gestures. This has significant implications for
improving accessibility and communication for individuals who are deaf or hard of hearing, as
well as opening up new possibilities for sign language recognition in areas such as education,
healthcare, and entertainment.
LSTM Training
As a result of the training process for the LSTM network, I was able to overcome the challenge
of achieving consistent accuracy in recognizing sign language gestures. While the quality and
quantity of training data proved to be a key factor in the accuracy of the network, I found that
adjusting certain hyperparameters, such as the number of layers, hidden units, and learning rate,
could also have a significant impact.
By adjusting these factors and techniques, I was able to achieve a fair level of accuracy in
recognizing sign language gestures with the LSTM network. This has important implications for
improving communication accessibility for individuals who are deaf or hard of hearing, and
demonstrates the importance of careful parameter tuning and regularization in the training of
machine learning models
Real- me Performance
As a result of addressing the challenge of real-time performance, I was able to significantly
reduce the delay between recognizing and translating sign language gestures. By optimizing the
code for speed, including using efficient algorithms and data structures, I was able to reduce the
amount of time required for computations and minimize input and output operations.
These optimizations helped to improve the real-time performance of the system, ensuring that
sign language gestures were translated into text with minimal delay. This has important
implications for improving communication accessibility for individuals who are deaf or hard of
hearing, as it allows for more seamless and efficient communication through sign language
recognition and translation.
Discussion
The vision-based method uses gesture categorization to extract elements from images. SVM and
ANN are the two classification algorithms that are used the most. SVM performed preferable
them over ANN when looked at. When looking at the data at hand, most models use sensors to
get information about the area around them. Brain networks are generally utilized in vision-based
ways to deal with photographs and recordings, while Gee and SVMs are utilized for order. This
is because there are now more data sources to choose from. The final stage of the vision-based
approach is the classification of motions, which extracts elements from images for quick
extraction. SVM and ANN are the two most widely used techniques for classifying data. SVM
offers superior performance if researchers conduct evaluations. Because it is used in statistical
methods to obtain spatiotemporal data, the hidden Markov model (HMM) is a general method
for understanding sign language. The majority of models obtain environmental information from
sensors when examining the available data. Neural networks are frequently utilized in vision-
based approaches to photos and videos, whereas HMMs and SVMs are utilized for classification.
This is because more data sources are easily accessible.
Another important processing method for identifying sign language is the neural network model.
In order to comprehend sign language, the CNN processing image goes through convolution,
pooling, activation functions, and fully connected layers. Implemented 3D-CNNs are used to
derive motion data from depth variation in frames and features. Videos of sign language can be
used by LSTM-based algorithms to obtain simulation temporal sequence data. BLSTM-3D
ResNet , a jump forward in LSTM, can limit palms and hands from video progressions. In
vision-based systems, two important classification techniques are HMM and SVM.
Convolutional neural networks have gained prominence in recent work on vision-based sign
language recognition.Without having to provide any data or undergo any training, pre-trained
models can be utilized immediately. Scientists have provided pre-prepared models with a ton of
thought lately because of their incredible potential. It also reduces the expense of training and
creating datasets. To redsuce the cost of preparing, numerous analysts have utilized pre-prepared
models like signal based VGG16, Google Net, and AlexNet. Cross-entropy and other loss
functions are utilized when training datasets to reduce loss errors. The most frequently used loss
function is cross-entropy multiclass category. The broadly utilized analyzer, SGD, and ADAM
enhancers should help out the misfortune capability.
The deep neural network (DNN) produces better outcomes but requires the largest training
dataset because it can learn and associate on its own. It presently has the registering ability to run
applications on huge datasets. carrying out new calculations and improvement to current ones
take into account better program execution.
Conclusion
I presented a sign language translation system that can recognize and translate sign language
gestures into written words in real-time. The system utilized OpenCV to capture frames,
MediaPipe Holistic to extract features, and an LSTM neural network trained on our own dataset
to recognize and translate the signs.
The system achieved fair accuracy, real-time performance. The system has the potential to improve
communication and accessibility for individuals with hearing disabilities by providing real-time
translation of sign language gestures into written words.
I also identified several technical challenges, including the difficulty of gathering enough data to
create a robust dataset and the accuracy issues when training the LSTM neural network. However,
I was able to overcome some of these challenges by creating my own dataset and implementing
various techniques to improve the accuracy of the system.
In future work, I plan to expand my dataset and improve the accuracy of complex signs. I also plan
to explore the use of additional deep learning techniques by combining CNN and LSTM Neural
Networks to try and increase the accuracy making the system more robust.
The sign language translation system has the potential to make a significant impact in the lives of
individuals with hearing disabilities, and I am excited to continue working on its development and
improvement.
References
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computa on, 9(8), 1735-1780. doi:
10.1162/neco.1997.9.8.1735
Bradski, G. (2000). The OpenCV Library. Dr. Dobb's Journal of So ware Tools.
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochas c Op miza on. arXiv preprint arXiv:1412.6980.
Chollet, F. (2015). Keras: Deep Learning library for Theano and TensorFlow. Retrieved from h ps://keras.io/
Lane, D., Bha acharya, S., & Blackwell, A. F. (2015). What can we learn from designing interac ve systems for the
developing world? [Conference presenta on]. Bri sh HCI 2015 Conference Proceedings, Lincoln, UK. Retrieved
from h ps://www.robots.ox.ac.uk/~davidlane/bha acharya-blackwell-bhci2015.pdf
Wikipedia. (n.d.). Sign language. In Wikipedia. Retrieved April 22, 2023, from
h ps://en.wikipedia.org/wiki/Sign_language
Texas Woman's University. (n.d.). American sign language image database. In Texas Woman's University. Retrieved
April 22, 2023, from h p://vlm1.uta.edu/~srujana/ASLID/ASL_Image_Dataset.html
Duro, R. J., Rebelo, A., & Rodrigues, J. M. F. (2014). Vision-based Portuguese sign language recogni on system.
Pa ern Recogni on Le ers, 49, 48-57. doi: 10.1016/j.patrec.2014.04.020
Kaur, P., & Singh, K. (2012). A video based Indian sign language recogni on system (INSLR) using wavelet transform
and fuzzy logic. Interna onal Journal of Engineering and Technology, 4(5), 662-669. Retrieved from
h p://www.ijetch.org/papers/581-IT010.pdf
Lamptey, R. (2019). BOBSL: A Bangla online sign language recogni on system. [Master's thesis]. Oxford University,
UK. Retrieved from h ps://www.cs.ox.ac.uk/files/13219/Rishabh_Lamptey_Thesis_Final.pdf
Pundlik, S., & Blackwell, A. F. (2022). Towards improved automa c recogni on of sign language: A review. Signal
Processing: Image Communica on, 105, 117977. doi: 10.1016/j.image.2022.117977
Akakandelwa, A. (2022). Sign language recogni on using machine learning: A survey. Signal Processing: Image
Communica on, 105, 117980. doi: 10.1016/j.image.2022.117980
Lee, J., Lee, D., Lee, Y. K., & Chung, W. Y. (2016). Sign language recogni on based on deep learning using Kinect v2
sensor. In Proceedings of the 2016 IEEE Interna onal Conference on Consumer Electronics-Taiwan
Na onal Ins tute on Deafness and Other Communica on Disorders. (n.d.). American Sign Language. Retrieved
from h ps://www.nidcd.nih.gov/health/american-sign-language
View publication stats
Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A
survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials,
17(4), 2347-2376. doi: 10.1109/COMST.2015.2444095
Alsheikh, M. A., Zhang, Y., Niyato, D., & Al-Dulaimi, A. (2022). A survey of machine learning
techniques for resource management in edge computing. Journal of Network and Computer Applications,
195, 103061. doi: 10.1016/j.jnca.2021.103061
Alsheikh et al.'s survey article provides an overview of machine learning techniques for resource
management in edge computing.
Al-Fuqaha et al.'s article is a survey of enabling technologies, protocols, and applications for IoT.