0% found this document useful (0 votes)
4 views8 pages

Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 8

Report

Survey of Technical Articles


Topic: Sign Language to text Conversion
Abstract
Sign language is a vital form of communication for the deaf and hard-of-hearing community.
However, due to the limited number of people who are proficient in sign language, a
communication gap often exists between sign language users and the hearing population.
Sign language to text conversion aims to bridge this gap by utilizing advanced technologies
like computer vision, machine learning, and natural language processing (NLP). This report
provides an in-depth analysis of the current state of sign language to text conversion
systems, exploring various techniques, algorithms, and challenges in the field.

1. Introduction
Sign language is a rich, complex system of communication that uses visual-manual modality
to convey meaning. Each country or region often has its own variant of sign language, such
as American Sign Language (ASL), British Sign Language (BSL), and Indian Sign Language (ISL).
The need for automatic sign language translation has grown due to the global push for
inclusivity and accessibility for the deaf community.
The conversion of sign language to text involves interpreting hand gestures, facial
expressions, and body movements, translating them into meaningful text. This process
integrates various technologies, including computer vision, artificial intelligence, and
linguistics.

2. Sign Language Structure


Sign language has its grammar, syntax, and lexicon, differing significantly from spoken
languages. Some key characteristics include:

Manual components: Hand shapes, movements, locations, and orientations.


Non-manual markers: Facial expressions, head movements, and body posture.
Time-based structure: Temporal aspects of gestures, such as duration and pauses.
Because of its multimodal nature, sign language is complex to convert into linear text form.

3. Technological Overview
The process of sign language to text conversion can be broken down into three major steps:
 Gesture Recognition: Recognizing signs based on hand movements, shapes, and
orientations.
 Gesture Classification: Mapping recognized gestures to specific signs or words.
 Text Generation: Converting recognized signs into grammatically correct sentences.
3.1 Gesture Recognition
Gesture recognition involves identifying and understanding hand signs. This can be done
using two major approaches:
1. Sensor-based Approach: This method uses gloves, accelerometers, or specialized
sensors to track the movement and shape of hands. An example of this is data
gloves, which capture hand positions and finger bends. However, these systems tend
to be expensive and cumbersome for widespread use.
2. Vision-based Approach: This method uses cameras and computer vision algorithms
to detect hand gestures. Vision-based techniques include color detection,
background subtraction, and depth sensors like Microsoft Kinect. The advantage is
that users do not need specialized equipment, only a camera.

3.1.1 Deep Learning Techniques


In recent years, deep learning techniques have gained traction in gesture recognition.
Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and their
variants, like Long Short-Term Memory (LSTM) networks, are used to identify complex
patterns in video sequences.
Convolutional Neural Networks (CNNs): Primarily used for image classification tasks, CNNs
have been adapted to recognize static signs by learning features from images of hand
gestures.
Recurrent Neural Networks (RNNs): Particularly useful for sequential data, RNNs (or LSTMs)
help in identifying dynamic gestures, where the temporal aspect of sign language is crucial.
3D CNNs: These models extend CNNs to process video sequences by taking spatio-temporal
features into account, enabling the identification of dynamic signs.
3.2 Gesture Classification
After recognizing gestures, the next step is to classify these into corresponding signs or
words. This is usually achieved through machine learning algorithms:
 Support Vector Machines (SVMs): Commonly used in early sign recognition systems for
classifying different gestures.
 K-Nearest Neighbors (KNN): A simple, yet effective classification algorithm used for
grouping similar gestures.
 Deep Neural Networks (DNNs): More recent systems employ deep learning models due
to their superior ability to handle large datasets and learn intricate features.

The classification task also needs to address challenges like:


 Sign Variability: Different users may sign the same word differently due to personal
styles.
 Ambiguity: Some signs may look similar, leading to potential confusion.

3.3 Text Generation


Text generation is the final step, where recognized signs are converted into human-readable
text. While signs in sign language do not follow the grammar of spoken languages exactly,
translation systems need to create grammatically correct text. Natural Language Processing
(NLP) techniques like sequence-to-sequence models, attention mechanisms, and language
modeling play a crucial role in this process.

3.4 Dataset
The dataset used for this work was based on ISL. According to the best of the knowledge of
the authors, there does not exist an authentic and complete dataset for all the 26 alphabets
of English language for ISL. Our dataset was manually prepared by clicking various images of
each finger-spelled alphabet and applying different forms of data augmentation techniques.
At the end, the dataset contained over 1,50,000 images of all 26 categories. There were
approximately 5,500 images of each alphabet. To keep the data consistent, the same
background was used for most of the images. Also, the images were clicked in different
lighting conditions to train a robust model resistant of any such changes in the surroundings.
The images in this dataset were clicked by a Redmi Note 5 Pro, 20 megapixel camera. All the
RGB images were resized to 144×144 pixels per image so as to remove the possibility of
varying sizes. Fig.2 shows a few sample images from this dataset.

Methodology
In this section, we would discuss the architectures of various self-developed and pre-trained
deep neural networks, machine learning algorithms and their corresponding performances
for the task of hand gesture to audio and audio to hand gesture recognition. The complete
implementation was done on Keras using Tensorflow as the backend. A pictorial overview of
our entire framework is presented in Fig. 1. The three individual models are briefly discussed
as follows.
• Pre-trained VGG16 Model: Under this approach, the gestures were classified using a pre-
trained VGG16 model based on the Imagenet dataset. We truncated its last layer and then
added custom designed layers to provide a baseline comparison with the state of the art
networks.
• Natural Language Based Output Networks: For this model, a Deep Convolutional Neural
Network (DCNN) with 26 categories was developed. Later, the output was fed to an English
Corpora based model for eradicating any errors during classification. This process was based
on the probability of the occurrence of the particular word in the English vocabulary.
Moreover, only the top-3 accuracy scores provided by the neural network was considered in
this model.
• Hierarchical Network: Our final approach comprises of a novel hierarchical model for
classification which resembles a tree-like structure. It involves initially classifying gestures
into two categories (one-hand or twohand), and subsequently feeding them into further
deep neural networks. The corresponding outputs were utilized for categorizing them into
the 26 English alphabets.
Algorithm for formation of Valid English Words from given sequence of
alphabets as input
The natural language based output network was developed for rectifying errors made by the
CNN model. The main motive of this model is to correct the falsely predicted outcomes
during ISL-conversation. Thus, a misspelled word can be corrected by using an algorithm that
takes into account the possible words in the English language that can be formed by the
predicted alphabets via intelligently changing a letter or two. Such algorithms are useful in
practical terms to overcome the flaws of CNN. A 13-layer CNN was developed which
received these images, with their pix- 5 els scaled between -1 and +1. The neural net was a
simple network comprising of 3×3 convolutional filters followed by max-pooling. The latter
layers consisted dropout (0.3-0.4) and batch normalisation for avoiding any overfitting.
Adam optimiser with a learning rate of 0.0002 was used to minimize the categorical cross-
entropy loss function. The softmax layer provided output as 26 probabilities, each
corresponding to the output being that particular alphabet. Exploiting this characteristic of
the softmax layer, we calculated the total probability for a given word, which is a collection
of alphabets as the sum of all the probabilities of the highest predicted output for that
alphabet. For example, if the word that a user inputs alphabet of ‘cat’, then for each letter ‘c’,
‘a’ and ‘t’, the probabilities for the top-3 predicted letters will be saved. This will be the
overall probability of the word being ‘cat’. Now, if the output probabilities that the CNN
provided with respect to each letter corresponded to ‘cet’, then this word will be searched
through a corpora of length=3 in the English dictionary. If no such word exists, it will change
the letters (one at a time) by the next highest probable letter, and check it again in the
dictionary. If such a word exists, then it is stored along with it’s total probability. The model
output the word with the highest probability belonging in the English dictionary as the final
prediction. This model works on the idea that if a user wants to converse in finger-spelled
ISL, he/she is likely to depict a word that exists in the English dictionary (apart from unusual
proper nouns).
4. State of the Art Systems
Numerous systems and research projects have focused on converting sign language to text.
Some notable systems include:
 SignAll: A vision-based system that translates ASL into text using multiple cameras and
deep learning algorithms to recognize signs.
 Google Translate’s Hand Gesture Recognition: Google’s AI research division has
experimented with hand gesture recognition models using mobile phone cameras for
real-time sign-to-text translation.
 DeepASL: This project uses LSTM-based deep learning techniques to identify ASL
gestures and convert them into text, focusing on dynamic gestures.

5. Challenges and Limitations


Despite technological advancements, several challenges remain in sign language to text
conversion:

5.1 Gesture Complexity


Sign language is multimodal, requiring recognition of not only hand movements but also
facial expressions, which significantly contribute to meaning. Systems that fail to incorporate
these non-manual markers often provide incomplete or inaccurate translations.

5.2 Variability and Dialects


Sign language varies significantly between regions, and even within a single variant, there is
substantial variation in how individuals sign. A system trained on one dataset may struggle
to recognize signs from a different region or user.

5.3 Continuous Sign Language Recognition


Unlike isolated sign recognition, continuous sign language involves recognizing signs in a
natural, flowing sequence. This poses additional challenges as signs blend together, and the
system must distinguish between individual signs and interpret context.

5.4 Real-time Processing


For sign language conversion to be useful in everyday applications, it must occur in real-time
with low latency. This requires highly optimized algorithms that can process video frames
quickly, a significant challenge for resource-constrained devices like smartphones.
5.5 Limited Datasets
A major bottleneck for the development of effective sign language recognition systems is the
scarcity of large, labeled datasets. Collecting and annotating sign language data is time-
consuming and expensive, limiting the amount of training data available for machine
learning models.

6. Future Directions
Advancements in artificial intelligence, particularly in deep learning and NLP, offer promising
avenues for improving sign language to text conversion. Some future trends include:
 Multimodal Learning: Incorporating facial expressions, body posture, and even gaze
tracking to improve the accuracy and context understanding of sign language systems.
 Transfer Learning: Leveraging pre-trained models on large image or video datasets and
fine-tuning them for specific sign language tasks.
 Augmented Reality (AR) and Wearables: The development of AR glasses or wearables
that provide real-time sign-to-text conversion could significantly enhance
communication accessibility for the deaf community.
 Sign Language Data Augmentation: To address the limited availability of data,
researchers are exploring ways to synthetically generate or augment sign language
datasets using techniques like Generative Adversarial Networks (GANs).

7. Conclusion
Sign language to text conversion is a crucial technological innovation for bridging
communication gaps between the deaf and hearing communities. While current systems
demonstrate promising results, there are still significant challenges to overcome, particularly
in the areas of real-time processing, continuous sign recognition, and incorporating non-
manual markers. Future advancements in AI and multimodal learning could pave the way for
more accurate, scalable, and accessible sign language translation systems, improving
inclusivity for millions of people worldwide.

8. References
 https://journal.ijresm.com/index.php/ijresm/article/view/748/720
 https://ijeast.com/papers/135-139,Tesma512,IJEAST.pdf
 https://github.com/yatharth77/Indian-Sign-Language-Gesture-Recognition
 https://www.researchgate.net/publication/
362331604_Real_Time_Sign_Language_Translation_Systems_A_review_study
 https://www.mdpi.com/2079-9292/12/12/2678
This report provides a comprehensive overview of sign language to text conversion
technologies and explores the potential for future innovations in this vital field.

You might also like