0% found this document useful (0 votes)
32 views9 pages

SSRN 4973124

The document discusses a methodology for generating Indian Sign Language (ISL) from spoken language using a Long Short-Term Memory (LSTM) algorithm. It outlines the process of converting speech to text, applying natural language processing, and translating the text into ISL while considering its unique grammar and syntax. The proposed system aims to facilitate communication for individuals with hearing impairments by automating the interpretation of spoken language into sign language.

Uploaded by

knucklechee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views9 pages

SSRN 4973124

The document discusses a methodology for generating Indian Sign Language (ISL) from spoken language using a Long Short-Term Memory (LSTM) algorithm. It outlines the process of converting speech to text, applying natural language processing, and translating the text into ISL while considering its unique grammar and syntax. The proposed system aims to facilitate communication for individuals with hearing impairments by automating the interpretation of spoken language into sign language.

Uploaded by

knucklechee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Generating sign language by utilizing the LSTM algorithm to process and

convert spoken language inputs into corresponding ISL


1. G.Santhanakrishnan, Research Scholar(Part Time),Department of Computer Science, Sri Ramakrishna Mission
Vidyalaya College Of Arts & Science (Affiliated to Bharathiar University), Coimbatore-20
2. Dr.S.Kumaravel , HOD & Associate Professor, Department of Computer Science, Sri Ramakrishna Mission Vidyalaya
College Of Arts & Science (Affiliated to Bharathiar University), Coimbatore-20

Abstract:

We aim to achieve the first-ever faultless production of sign language only from spoken
words. Other modes of dialogues not fixed it to consideration in previous initiatives in this
field, which concentrated on producing sign language from text transcripts manually annotated
by people. Nonetheless, using sign language to replace spoken words offers a workable solution
for communicating with people who are hard of hearing or deaf. This removes the requirement
for text as a foundation, enabling the development of techniques that capture a spoken language
that is more natural, organic, and uninterrupted enriched with a vast vocabulary. We will tackle
the problem utilising a Recurrent Neural Network (RNN) speech recognition algorithm as part
of our suggested methodology to produce sign language straight from spoken words. We are
creating and making available the first dataset for Indian Sign Language, which consists of text
transcripts, speech annotations at the speech level, and the associated sign language generation.
Keywords: Sign Language, ISL, RNN, LSTM, Voice to Sign
Introduction:

Since the dawn of humanity, communication has been vital to our advancement. In
today's society, it is challenging to manage daily activities and business without a common
language understood by all parties involved. To circumvent linguistic barriers, individuals with
hearing impairments have developed gesture-based communication systems. However, they
still encounter a variety of challenges in their daily lives. While two individuals with hearing
impairments, without any other significant disabilities, can communicate effectively through
gesture-based communication, challenges arise when a hearing person and a person with
hearing impairment interact. Facilitating this interaction requires an interpretation process that
converts spoken language into gesture-based communication and vice versa.

Although human interpreters can facilitate this linguistic exchange, they are often costly
and not always available. As a result, a practical solution is needed to fulfil the requirements
of individuals with hearing impairments for routine communication. By automating the
interpretation process, machine interpretation offers a way to overcome these linguistic
barriers. For individuals with hearing impairments, this technological advancement is crucial
as it enables communication between them and hearing individuals, providing the former with
equal access to information and opportunities as the latter. [1] There are several steps involved
in switching to Indian Communication through Signing (ISL). Initially, speech recognition
technology captures spoken words and transcribes them into text. Next, the context and
semantics of the message in this transcribed text are understood through the application of
natural language processing, or NLP. Finally, when translating the processed text into ISL,
consideration is given to the distinct grammar and syntax of sign language. This translation
process may employ sophisticated models or a database of ISL signs to ensure that the content
being delivered is accurate and suitable for ISL users' cultural context.
Speech recognition, also known as computer speech recognition, involves enabling a
computer to understand spoken language. Essentially, it means programming a computer to
listen to us and respond appropriately. By "understand," we mean converting spoken words
into a comprehensible format, such as text. In this sense, speech recognition is also referred to
as the speech-to-text conversion process. Typically, this process involves a computer to carry
out the task, a speech recognition program, and a microphone to capture human speech.[2]
After the speech is captured, it will undergo semantic reworking using natural language
processing (NLP) and the suggested computation to convert the perceived text to Indian
gesture-based communication (ISL).
In the proposed research, we describe a procedure for translating spoken language into
sign language. When words are expressed, the framework interprets them into signs that are
related to one another. Through the closure of communication barriers, this method enables
speech-to-sign conversion.

Related Work:

Recurrent neural networks (RNNs) are a useful model for sequential data. For
arrangement naming assignments where the info yield arrangement is questionable, RNNs can
be prepared start to finish utilizing strategies like Connectionist Fleeting Order. Combining
these methods with the Long Short-term Memory RNN architecture has produced cutting-edge
results for the recognition of cursive handwriting. Deep feedforward networks, on the other
hand, have outperformed RNNs in voice recognition to date, despite their poor performance.
[3]. Hand motions can be recorded by the camera, which can then be translated into sign
language to help nonverbal learners. Additionally, it can convert images to text and recognise
speech to convert live readings into speech formats. A microphone can also help those who
have trouble hearing by translating spoken words into text that appears on a screen. Thanks to
its ability to transform speech into text and provide speech outputs, this technology can help
people who are visually blind as well as those who are hearing impaired.[4] The biggest
difficulty in translating spoken English to sign language is coming up with sentences that
correctly use the grammar of ASL. This project tackles this by removing non-ASL terms,
tokenising sentences, and identifying elements of speech using a rule-based methodology.
Proper names have a distinct format, with each letter shown in the form of a sign language
(Goa, for example, is G-O-A). Moreover, the system keeps proper nouns in an alphabetical
order separated by hyphens and enforces certain criteria, such as verb correction. Even while
this approach increases productivity and decreases repetition, it is still error-prone and time-
consuming, especially when switching from ASL to ISL. [5] The endeavour entails creating a
reliable application that uses the Google Text-to-Speech API to translate speech input into text
using NLTK libraries. After tokenising the text and applying rules to translate English into ISL
gloss, lemmatisation and stemming are performed by the system. After processing, the output
is transferred through SIGML to represent the sign language with avatars and the Hamburg
Sign Language Notation System to get data from a database. By transforming speech into three-
dimensional avatar animations and presenting Hindi language symbols in place of GIFs,
photos, or videos, this communication system seeks to help those with disabilities manage their
memory more effectively.[6] The suggested model creates Indian Sign Language (ISL) Gloss
after using a Hidden Markov Model to translate speech into English text. Pre-processing
procedures like stemming, tokenisation, selective stop word removal, and punctuation removal
are involved in this. Leacock-Chodorow similarity metric is used to measure similarity. Despite
attaining a 68% accuracy rate, the system has difficulties when synonyms for certain nouns in
the ISL dictionary are absent, resulting in redundant information in the ISL gloss. Similarly,
identical words in WorldNet may not have the intended meaning.[7] It is now widely
recognised following a great deal of research, that sign languages are real languages that share
traits with spoken languages and are a component of a single, cohesive natural language
system. But there are important distinctions between spoken and signed languages, such the
fact that sign languages need the use of two hands, which has a big impact on linguistic
structure. The study of Al-Sayyad Bedouin sign language, which emphasises the unique yet
related nature of spoken and signed languages within the same language faculty, lends more
credence to this idea. Recognising these two forms of language as well as the critical function
the body plays in influencing language form is necessary to comprehend the entirety of human
language. [8]
Proposed Methodology:

In the proposed methodology, three main processes are involved in the suggested
methodology to convert spoken words into text, and the Long Short-Term Memory (LSTM)
network is an important component of the speech-to-text process:

LSTM algorithm Recognized text is


Speech Input adapted speech converted into formal
recognition to text ISL Sentence

Tokenized sign images ISL Sentence is


are constructed to Tokenized and mapped
animated ISL display into appropriate sign
Image

Speech Input Collection: First, spoken words are recorded using a microphone. The
system's input is provided by this unprocessed speech data.
Converting Speech to Text with the LSTM Algorithm: An LSTM-based neural network
is used to process the speech that has been recorded. Sequence prediction issues, like speech
recognition, are especially well-suited for LSTM networks, a kind of Recurrent Neural
Network (RNN). The speech signal is processed by the LSTM network via the subsequent
stages:
The process of feature extraction involves first converting the raw speech into a digital
format, after which it is transformed into a collection of features, which are commonly
displayed as spectrograms or Mel-frequency cepstral coefficients (MFCCs). Over time, these
features record significant speech attributes like loudness and frequency.

Algorithm ContinuousSpeechRecognition(speech_stream):

Step 1: Preprocess Speech Stream


speech_frames = CaptureSpeechFrames(speech_stream)
features = ExtractFeatures(speech_frames)

Step 2: Load Pretrained LSTM Model


lstm_model = LoadModel('path_to_pretrained_lstm_model')

Step 3: Initialize Variables


sentence = ""
previous_output = None

Step 4: Process Features with LSTM Model


for each feature_vector in features:
if previous_output is not None:
input_vector = Concatenate(previous_output, feature_vector)
else:
input_vector = feature_vector

# Predict next part of the sentence


output = lstm_model.Predict(input_vector)

# Append the predicted word/character to the sentence


sentence += DecodeOutput(output)

# Update the previous_output


previous_output = output

Step 5: Post-process the Output Sentence


sentence = CleanUpSentence(sentence)

Return sentence

LSTM Processing: The LSTM network receives the extracted characteristics as input.
Because LSTMs can manage long-range dependencies in sequential data, they are perfect for
context-sensitive jobs like speech recognition. Memory cells are specialised units found in
LSTMs that are capable of updating and maintaining data over extended periods of time. Three
gates—an input gate, a forget gate, and an output gate—control the information flow in each
memory cell.
The LSTM network can learn which segments of the input sequence are crucial for
generating predictions thanks to these gates, which aid in the selective remembering or
forgetting of information.
Temporal Context Handling: During the course of processing the speech feature
sequence, the LSTM picks up on patterns and temporal correlations within the data. Because
of its capacity, the LSTM can efficiently convert the speech features into a string of characters
or words, thereby capturing the temporal dynamics of speech.
Output Generation: Using the LSTM's predictions, the text output is produced in the
last phase. Following the input sequence processing by the LSTM network, the output is fed
through a softmax activation function and a dense layer (a fully linked layer). The dense layer
aids in the mapping of the target and LSTM outputs.
The output of the LSTM network is processed by passing it via a softmax activation
function and a dense layer, or fully linked layer. The target text classes (words or characters)
are mapped from the LSTM outputs with the assistance of the dense layer. The most likely
sequence is chosen using the probability distribution over all potential characters or words
provided by the softmax layer. The text that is produced is an accurate representation of the
spoken words and can be utilised for a number of purposes, including voice-activated systems
and transcription services.
Tokenisation is the initial stage in the conversion of recorded text into ISL gloss. In this
step, the text is divided into discrete words, or tokens. After tokenisation, parts of speech (POS)
tagging is carried out and lemmatization—which reduces words to their base or root forms—
is applied. The grammatical roles—adjectives, verbs, and nouns—of each word are identified
with the use of POS tagging. Mapping these terms to matching Indian Sign Language (ISL)
glosses is the next stage. NLP approaches are used in this mapping in order to achieve accurate
representation while taking into account ISL's distinct grammar and syntax. An organised ISL
gloss that represents the material in a format that can be signed is the result of this stage.

Algorithm 1.b [Conversion of Recognized speech text to ISL Text] :


ConvertEnglishToISL(sentence):
def ConvertEnglishToISLWithSignImages(sentence):
words = reorder_all(Tokenize(sentence)) # Apply all ISL grammar rules in one function
isl_dictionary = LoadISLDictionary()
sign_images = [isl_dictionary.get(word, [isl_dictionary.get(c, "UNKNOWN_SIGN") for c in
Tokenize(word, by_character=True)]) for word in words]
sign_images = [item for sublist in sign_images for item in sublist] # Flatten the list of lists
isl_video = CreateVideoFromImages(sign_images)
DisplayVideo(isl_video)

def reorder_all(words):
# Apply all ISL grammar rules (pseudo-function for brevity)
return
apply_topic_comment_structure(apply_verb_tense_structure(apply_spatial_agreement(apply_rh
etorical_structure(apply_existential_structure(

apply_possession_structure(apply_pluralization(apply_verb_agreement(apply_pronoun_pointing(
apply_non_manual_emphasis(apply_comparative_structure(

reorder_conditional_sentence(apply_role_shifting(apply_classifier_construction(apply_negation_s
tructure(reorder_for_wh_question(
reorder_for_yes_no_question(reorder_to_time_topic_comment(reorder_to_topicalization(reorde
r_simple_sentence(words))))))))))))))))))))

Algorithm 1.b which converts English sentences to Indian Sign Language (ISL) text.
The ConvertEnglishToISLWithSignImages function first tokenizes the sentence and applies
all ISL grammar rules through the reorder_all function, which adjusts the word order based
on ISL syntax. The reorder_all function sequentially applies a series of ISL grammar rules,
such as topic-comment structure, verb tense, spatial agreement, rhetorical structures, and more,
to ensure the sentence conforms to ISL grammar. It encapsulates all the necessary
transformations, such as handling topicalization, conditional sentences, and negations, among
others, into a single pipeline. After reordering, the function loads an ISL dictionary and maps
each word to its corresponding sign image. If a word isn't found in the dictionary, it is broken
down into characters, and each character is individually mapped. The resulting sign images are
then flattened into a single list, which is used to create a video representing the ISL sentence.
Finally, this video is displayed as the output, effectively translating the English sentence into
ISL.

After the ISL text is formed tag the corresponding Sign Image gloss in the Sign image
dictionary

Algorithm 3:
ConvertEnglishToISLWithSignImages(sentence):
Step 1: Tokenize the Sentence
words = Tokenize(sentence)

Step 2: Initialize ISL Dictionary


isl_dictionary = LoadISLDictionary()

Step 3: Initialize List for Sign Images


sign_images = []

Step 4: Map Words to ISL Signs


for each word in words:
if word in isl_dictionary:
sign_image = isl_dictionary[word]
add sign_image to sign_images
else:
characters = Tokenize(word, by_character=True)
for each character in characters:
if character in isl_dictionary:
sign_image = isl_dictionary[character]
add sign_image to sign_images
else:
# Handle case where character is not in the ISL dictionary
# (This part is dependent on how you want to manage such cases)
add "UNKNOWN_SIGN" to sign_images

Step 5: Create Video from Buffered Sign Images


isl_video = CreateVideoFromImages(sign_images)

Step 6: Display ISL Video


DisplayVideo(isl_video)

The system receives the ISL gloss as input during the ISL gloss to sign display step.
Getting the matching ISL signs from a pre-made lexicon or database is the first task. An
extensive collection of ISL signs, each linked to certain words or phrases, may be found in this
database. Following ISL guidelines, the retrieval procedure guarantees that every gloss element
is paired with the appropriate sign. The system then creates visual representations after
identifying the relevant signs. These depictions can be animated pictures, 3D avatars, or movies
that show the symptoms. The intention is to give precise and unambiguous visual signals for
the ISL indications.

The procedure culminates in the presentation of ISL signals, which function as a means
of communication for users. This display can be viewed by a large number of people because
it can be shown on different screens or digital interfaces. Effective communication is made
possible by the visual portrayal of ISL signs, particularly for individuals who use sign language
as their primary form of expression. This method can also be included into accessibility
solutions for the hearing impaired, instructional resources, and translation services. The system
facilitates inclusive communication by bridging the gap between spoken and sign language by
utilising NLP techniques and visual technology.

Experimental Results:

Test case 1:

Given Speech Input : What is your name

ISL Text : your name what

YOUR NAME WHAT


Test case 2:

Given Speech Input : My Car is red

ISL Text : Me Car red

Output:

ME CAR RED

Conclusion and Future Work:

Speech-to-sign language systems' use of LSTM (Long Short-Term Memory)


algorithms is a major development in inclusive communication technologies. Speech
recognition is a perfect application for long-term dependency detection and handling (LSTM)
networks. These systems offer a dependable basis for translating voice into sign language by
precisely typing spoken words into text, improving accessibility for the deaf and hard-of-
hearing populations. Daily spoken sentences have achieved maximum accuracy levels in the
conversion to Indian Sign Language (ISL), highlighting the effectiveness of these systems.
However, there is a need to update the spoken-to-ISL text algorithms as part of future work to
enhance this accuracy further. This technology not only facilitates greater inclusion in spheres
such as education and employment but also encourages social contact. As improvements
continue, LSTM-based systems will likely become even more effective, increasing their impact
and promoting a more diverse society.

References:

[1] Quach, L., Nguyen, C.-N.: Conversion of the Vietnamese grammar into sign language
structure using the example-based machine translation algorithm. In: International Conference on
Advanced Technologies for Communications, pp. 27–31 (2018).
https://doi.org/10.1109/ATC.2018.8587584
[2] Bhushan C.Kamble,”Speech recognition using artificial neural network” proc of Int'l Journal
of Computing, Communications & Instrumentation Engg. (IJCCIE) Vol. 3, Issue 1 2016

[3] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition with
deep recurrent neural networks." In 2013 IEEE international conference on acoustics, speech and
signal processing, pp. 6645-6649. Ieee, 2013.

[4] Mallika, G. Siva. "INFORMATION CONVEYING SYSTEM FOR DISABLED


PERSONS." In Industrial Engineering Journal , ISSN: 0970-2555, Volume : 52, Issue 4, April : 2023

[5] F. Shaikh, S. Darunde, N. Wahie, and S. Mali, "Sign Language Translation System for
Railway Station Announcements," 2019 IEEE Bombay Section Signature Conference (IBSSC), 2019, pp.
1-6, DOI: 10.1109/IBSSC47189.2019.8973041.
[6] B. D. Patel, H. B. Patel, M. A. Khanvilkar, N. R. Patel and T. Akilan, "ES2ISL: An Advancement
in Speech to Sign Language Translation using 3D Avatar Animator," 2020 IEEE Canadian Conference
on Electrical and Computer Engineering (CCECE), 2020, pp. 1-5, DOI:
10.1109/CCECE47787.2020.9255783.1

[7] K. Saija, S. Sangeetha and V. Shah, "WordNet Based Sign Language Machine Translation:
from English Voice to ISL Gloss," 2019 IEEE 16th India Council International Conference (INDICON),
2019, pp. 1-4, DOI: 10.1109/INDICON47234.2019.9029074.
[8] Sandler, Wendy. "Speech and sign: the whole human language." Theoretical Linguistics 50,
no. 1-2 (2024): 107-124.

You might also like