Report
Report
Report
1. Introduction
Sign language is a rich, complex system of communication that uses visual-manual modality
to convey meaning. Each country or region often has its own variant of sign language, such
as American Sign Language (ASL), British Sign Language (BSL), and Indian Sign Language (ISL).
The need for automatic sign language translation has grown due to the global push for
inclusivity and accessibility for the deaf community.
The conversion of sign language to text involves interpreting hand gestures, facial
expressions, and body movements, translating them into meaningful text. This process
integrates various technologies, including computer vision, artificial intelligence, and
linguistics.
3. Technological Overview
The process of sign language to text conversion can be broken down into three major steps:
Gesture Recognition: Recognizing signs based on hand movements, shapes, and
orientations.
Gesture Classification: Mapping recognized gestures to specific signs or words.
Text Generation: Converting recognized signs into grammatically correct sentences.
3.1 Gesture Recognition
Gesture recognition involves identifying and understanding hand signs. This can be done
using two major approaches:
1. Sensor-based Approach: This method uses gloves, accelerometers, or specialized
sensors to track the movement and shape of hands. An example of this is data
gloves, which capture hand positions and finger bends. However, these systems tend
to be expensive and cumbersome for widespread use.
2. Vision-based Approach: This method uses cameras and computer vision algorithms
to detect hand gestures. Vision-based techniques include color detection,
background subtraction, and depth sensors like Microsoft Kinect. The advantage is
that users do not need specialized equipment, only a camera.
3.4 Dataset
The dataset used for this work was based on ISL. According to the best of the knowledge of
the authors, there does not exist an authentic and complete dataset for all the 26 alphabets
of English language for ISL. Our dataset was manually prepared by clicking various images of
each finger-spelled alphabet and applying different forms of data augmentation techniques.
At the end, the dataset contained over 1,50,000 images of all 26 categories. There were
approximately 5,500 images of each alphabet. To keep the data consistent, the same
background was used for most of the images. Also, the images were clicked in different
lighting conditions to train a robust model resistant of any such changes in the surroundings.
The images in this dataset were clicked by a Redmi Note 5 Pro, 20 megapixel camera. All the
RGB images were resized to 144×144 pixels per image so as to remove the possibility of
varying sizes. Fig.2 shows a few sample images from this dataset.
Methodology
In this section, we would discuss the architectures of various self-developed and pre-trained
deep neural networks, machine learning algorithms and their corresponding performances
for the task of hand gesture to audio and audio to hand gesture recognition. The complete
implementation was done on Keras using Tensorflow as the backend. A pictorial overview of
our entire framework is presented in Fig. 1. The three individual models are briefly discussed
as follows.
• Pre-trained VGG16 Model: Under this approach, the gestures were classified using a pre-
trained VGG16 model based on the Imagenet dataset. We truncated its last layer and then
added custom designed layers to provide a baseline comparison with the state of the art
networks.
• Natural Language Based Output Networks: For this model, a Deep Convolutional Neural
Network (DCNN) with 26 categories was developed. Later, the output was fed to an English
Corpora based model for eradicating any errors during classification. This process was based
on the probability of the occurrence of the particular word in the English vocabulary.
Moreover, only the top-3 accuracy scores provided by the neural network was considered in
this model.
• Hierarchical Network: Our final approach comprises of a novel hierarchical model for
classification which resembles a tree-like structure. It involves initially classifying gestures
into two categories (one-hand or twohand), and subsequently feeding them into further
deep neural networks. The corresponding outputs were utilized for categorizing them into
the 26 English alphabets.
Algorithm for formation of Valid English Words from given sequence of
alphabets as input
The natural language based output network was developed for rectifying errors made by the
CNN model. The main motive of this model is to correct the falsely predicted outcomes
during ISL-conversation. Thus, a misspelled word can be corrected by using an algorithm that
takes into account the possible words in the English language that can be formed by the
predicted alphabets via intelligently changing a letter or two. Such algorithms are useful in
practical terms to overcome the flaws of CNN. A 13-layer CNN was developed which
received these images, with their pix- 5 els scaled between -1 and +1. The neural net was a
simple network comprising of 3×3 convolutional filters followed by max-pooling. The latter
layers consisted dropout (0.3-0.4) and batch normalisation for avoiding any overfitting.
Adam optimiser with a learning rate of 0.0002 was used to minimize the categorical cross-
entropy loss function. The softmax layer provided output as 26 probabilities, each
corresponding to the output being that particular alphabet. Exploiting this characteristic of
the softmax layer, we calculated the total probability for a given word, which is a collection
of alphabets as the sum of all the probabilities of the highest predicted output for that
alphabet. For example, if the word that a user inputs alphabet of ‘cat’, then for each letter ‘c’,
‘a’ and ‘t’, the probabilities for the top-3 predicted letters will be saved. This will be the
overall probability of the word being ‘cat’. Now, if the output probabilities that the CNN
provided with respect to each letter corresponded to ‘cet’, then this word will be searched
through a corpora of length=3 in the English dictionary. If no such word exists, it will change
the letters (one at a time) by the next highest probable letter, and check it again in the
dictionary. If such a word exists, then it is stored along with it’s total probability. The model
output the word with the highest probability belonging in the English dictionary as the final
prediction. This model works on the idea that if a user wants to converse in finger-spelled
ISL, he/she is likely to depict a word that exists in the English dictionary (apart from unusual
proper nouns).
4. State of the Art Systems
Numerous systems and research projects have focused on converting sign language to text.
Some notable systems include:
SignAll: A vision-based system that translates ASL into text using multiple cameras and
deep learning algorithms to recognize signs.
Google Translate’s Hand Gesture Recognition: Google’s AI research division has
experimented with hand gesture recognition models using mobile phone cameras for
real-time sign-to-text translation.
DeepASL: This project uses LSTM-based deep learning techniques to identify ASL
gestures and convert them into text, focusing on dynamic gestures.
6. Future Directions
Advancements in artificial intelligence, particularly in deep learning and NLP, offer promising
avenues for improving sign language to text conversion. Some future trends include:
Multimodal Learning: Incorporating facial expressions, body posture, and even gaze
tracking to improve the accuracy and context understanding of sign language systems.
Transfer Learning: Leveraging pre-trained models on large image or video datasets and
fine-tuning them for specific sign language tasks.
Augmented Reality (AR) and Wearables: The development of AR glasses or wearables
that provide real-time sign-to-text conversion could significantly enhance
communication accessibility for the deaf community.
Sign Language Data Augmentation: To address the limited availability of data,
researchers are exploring ways to synthetically generate or augment sign language
datasets using techniques like Generative Adversarial Networks (GANs).
7. Conclusion
Sign language to text conversion is a crucial technological innovation for bridging
communication gaps between the deaf and hearing communities. While current systems
demonstrate promising results, there are still significant challenges to overcome, particularly
in the areas of real-time processing, continuous sign recognition, and incorporating non-
manual markers. Future advancements in AI and multimodal learning could pave the way for
more accurate, scalable, and accessible sign language translation systems, improving
inclusivity for millions of people worldwide.
8. References
https://journal.ijresm.com/index.php/ijresm/article/view/748/720
https://ijeast.com/papers/135-139,Tesma512,IJEAST.pdf
https://github.com/yatharth77/Indian-Sign-Language-Gesture-Recognition
https://www.researchgate.net/publication/
362331604_Real_Time_Sign_Language_Translation_Systems_A_review_study
https://www.mdpi.com/2079-9292/12/12/2678
This report provides a comprehensive overview of sign language to text conversion
technologies and explores the potential for future innovations in this vital field.