Final Year Project Report (Final)
Final Year Project Report (Final)
LANGUAGE TO TEXT
BACHELOR OF TECHNOLOGY
In
We hereby declare that this submission is our work and that, to the best of our
knowledge and belief, it contains no material previously published or written
by another person nor material which to a substantial extent has been accepted
for the award of any other degree or diploma of the university or other institute
of higher learning, except where due acknowledgment has been made in the
text.
Signature :
Name: Ishu Singh
RollNo.2000681530
027
Date:
CERTIFICATE
This is to certify that Project Report entitled ― “Developing a system for converting sign
language to text using Machine Learning” which is submitted by Ishu Singh
(2000681530027), Raman Baliyan (2000681530039), Ritik Chauhan (2000681530040) and
Harsh Tyagi (2000681530024) in partial fulfillment of the requirement for the award of
degree B. Tech. in Department of Computer Science and Engineering (Artificial Intelligence
& Machine Learning) of Dr. A.P.J. Abdul Kalam Technical University, U.P., Lucknow, is a
record of the candidates own work carried out by them under my supervision. The matter
embodied in this Project report is original and has not been submitted for the award of any
other degree.
ii
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the B. Tech Project
undertaken during B.Tech. Final Year. We owe a special debt of gratitude to
our guide Mr.Umesh Kumar, Assistant Professor, Department of Computer
Science and Engineering (Artificial Intelligence), Meerut Institute of
Engineering and Technology, Meerut for her constant support and guidance
throughout the course of our work. her sincerity, thoroughness and
perseverance have been a constant source of inspiration for us. It is only her
cognizant efforts that our endeavors have seen light of the day.
Signature : Signature :
Date : Date :
Signature : Signature :
Date : Date :
iii
ABSTRACT
To train the proposed Sign Language Detection System, we compiled a huge dataset
of various sign language gestures made by numerous signers. Extensive trials show
that the suggested system is successful and resilient in properly recognizing diverse
sign language motions across a variety of ambient circumstances, including
illumination, background clutter, and signer-specific changes.
The proposed Sign Language Detection System shows great potential in improving
accessibility for the Deaf and Hard of Hearing Community by offering real-time
interpretation of sign language motions. Future work will entail expanding the
system to accommodate multi-person sign language communications and
incorporating it into wearable devices for on-the-go accessibility.
v
LIST OF TABLES
3.1 Libraries 20
vi
LIST OF FIGURE
vii
TABLE OF CONTENT
Page No.
DECLARATION............................................................................................................i
CERTIFICATE..............................................................................................................ii
ACKNOWLEDGEMENT.............................................................................................iii
ABSTRACT....................................................................................................................iv
LIST OF TABLES..........................................................................................................v
LIST OF FIGURES........................................................................................................vi
CHAPTER 1 INTRODUCTION..................................................................................1
1.1 INTRODUCTION..........................................................................................1
1.4 OBJECTIVES.................................................................................................6
CHAPTER 2 LITERATURE........................................................................................7
CHAPTER 4 RESULTS.................................................................................................21
RESULTS.........................................................................................................................21
CHAPTER 5 SYSTEM REQUIREMENTS..................................................................23
REFRENCES........................................................................................................................26
APPENDIX……………………………………………………………………………………….27
CHAPTER-1 INTRODUCTION
1.1 Introduction
In the contemporary era of rapid technological advancements, the quest for innovative
solutions that foster seamless communication for individuals with diverse linguistic abilities
remains a pivotal focal point. Within this context, the development of a Hand Sign Language
to Text and Speech Conversion system using advanced Convolutional Neural Networks (CNN)
represents a significant stride towards inclusivity and accessibility. This groundbreaking
system stands as a testament to the fusion of state-of-the-art image processing, machine
learning methodologies, and intuitive user interfaces, all converging to bridge the gap between
conventional spoken language and the intricate nuances of sign language.
Amidst its multifaceted capabilities, one of the primary objectives of this system is the accurate
detection and interpretation of an extensive range of hand signs, encompassing not only the 26
letters of the English alphabet but also the recognition of the backslash symbol, a crucial
component for seamless textual communication. By harnessing the power of CNN, the system
demonstrates an unprecedented accuracy rate exceeding 99%, enabling the precise translation
of intricate hand gestures into their corresponding textual representations.
The core architecture of the system integrates the robust OpenCV library for intricate image
processing and gesture recognition, coupled with the flexible Keras library, serving as the
backbone for the streamlined implementation of the CNN model.
The comprehensive workflow of the system encompasses real-time video input capturing,
sophisticated image preprocessing, and informed predictions based on the robust CNN model,
reflecting a harmonious blend of cutting-edge technology and user-centric design.
Furthermore, the system is equipped with a highly intuitive Graphical User Interface (GUI)
that showcases the captured video feed and the recognized hand sign, providing users with a
seamless experience to interact with the system effortlessly. Users are presented with an array
of options, including the ability to select suggested words or effortlessly clear the recognized
sentence, fostering an environment of interactive and dynamic communication. Additionally,
the integration of text-to-speech functionality empowers users to not only visualize but also
audibly comprehend the recognized sentence, enhancing the overall accessibility and user
experience.
Through rigorous and extensive testing, the efficacy and precision of the proposed system have
Page 1 of 51
been extensively validated, underscoring its immense potential for real-world applications
across a diverse spectrum of contexts. By facilitating the seamless conversion of intricate hand
gestures into coherent textual and auditory output, this system paves the way for enhanced
communication and inclusivity,
catering to the diverse needs of individuals with varying linguistic abilities and promoting a
more connected and accessible society.
Page 2 of 51
1.1 Need of the Project
Creating a project for sign language to text conversion and speech synthesis could
greatly benefit the Deaf and Hard of Hearing community by providing them with a
more accessible means of communication. Here's an outline of how such a project
could be structured:
Research and Data Collection: Study various sign languages, their gestures, and
common vocabulary. Gather a diverse dataset of sign language gestures, possibly
including videos or images, along with their corresponding text translations.
User Interface Design: Develop a user-friendly interface that allows users to input
sign language gestures, either through live video feeds from cameras or pre-
recorded videos/images. Display the recognized text translations in real-time,
providing immediate feedback to users. Integrate speech synthesis to audibly
output the translated text.Testing and Evaluation: Conduct extensive testing with
individuals proficient in sign language to ensure the accuracy and reliability of
gesture recognition. Gather feedback from users to improve the usability and
accessibility of the system. Accessibility Considerations: Ensure that the
application is compatible with assistive technologies and adheres to accessibility
standards. Provide customization options for font size, color schemes, and other
visual preferences.Localization and Language Support: Support multiple sign
languages and spoken languages to cater to a global audience.
Page3of 51
1.2 Utility in Market
A reliable system for detecting sign language has a large and varied market potential.
Such a system can be used in education, healthcare, customer service, entertainment,
and other fields in addition to meeting the requirements of those with hearing
impairments.
Additionally, the use of sign language recognition technology in the retail and
customer service sectors can help companies better serve clients who have hearing
loss, creating an environment that is more welcoming and inclusive.
Sign language recognition can create new opportunities for accessibility and content
production in the entertainment sector, giving hard of hearing and deaf people access
to a greater variety of multimedia content, such as movies, TV series, and
internet videos.
Page4of 51
1.3 Objectives
1. Statistics indicate that more than 80% of individuals with disabilities are unable to
read or write. In response, our system endeavors to narrow the communication
divide between individuals with different abilities, including those who are
hearing-impaired or visually impaired. It achieves this by converting a significant
portion of sign language into text and speech, thereby facilitating effective
communication among diverse groups.
2. Individuals who are deaf or hard of hearing can utilize understandable hand
movements to express their messages.
3. Those without visual impairments can utilize the software to understand sign
language and communicate proficiently with individuals who are deaf or hard of
hearing. Similarly, individuals who are visually impaired can also engage in
communication when the predicted text from sign language is converted into
speech.
4. This project aims to close the existing gap in understanding sign language, making
it more accessible and comprehensible to a wider audience.
Page5of 51
CHAPTER-2 LITERATURE SURVEY
In the domain of sign language recognition and translation, Convolutional Neural Networks
(CNNs) have emerged as a prominent technique, particularly for American Sign Language
(ASL) recognition. Researchers like Hsien-I Lin et al. have utilized image segmentation to
extract hand gestures and achieved high accuracy levels, around 95%, using CNN models
trained on specific hand motions. Similarly, Garcia et al. developed a real-time ASL translator
using pre-trained models like GoogLeNet, achieving accurate letter classification.
In Das et al.'s study [1], they developed an SLR system utilizing deep learning techniques,
specifically training an Inception V3 CNN on a dataset comprising static images of ASL
motions. Their dataset consisted of 24 classes representing alphabets from A to Z, except for J.
Achieving an average accuracy rate of over 90%, with the best validation accuracy reaching
98%, their model demonstrated the effectiveness of the Inception V3 architecture for static sign
language detection.
Sahoo et al. [2] focused on identifying Indian Sign Language (ISL) gestures related to numbers
0 to 9. They employed machine learning methods such as Naive Bayes and k-Nearest Neighbor
on a dataset captured using a digital RGB sensor. Their models achieved impressive average
accuracy rates of 98.36% and 97.79%, respectively, with k-Nearest Neighbor slightly
outperforming Naive Bayes.
Ansari et al. [3] investigated ISL static gestures using both 3D depth data and 2D images
captured with Microsoft Kinect. They utilized K-means clustering for classification and
achieved an average accuracy rate of 90.68% for recognizing 16 alphabets, demonstrating the
efficacy of incorporating depth information into the classification process.
Rekha et al. [4] analyzed a dataset containing static and dynamic signs in ISL, employing skin
color segmentation techniques for hand detection. They trained a multiclass Support Vector
Machine (SVM) using features such as edge orientation and texture, achieving a success rate of
86.3%. However, due to its slow processing speed, this method was deemed unsuitable for
real-time gesture detection.
Bhuyan et al. [5] utilized a dataset of ISL gestures and employed a skin color-based
segmentation approach for hand detection. They achieved a recognition rate of over 90% using
Page6of 51
the nearest neighbor classification method, showcasing the effectiveness of simple yet robust
techniques.
Pugeault et al. [6] developed a real-time ASL recognition system utilizing a large dataset of 3D
depth photos collected through a Kinect sensor. Their system achieved highly accurate
classification rates by incorporating Gabor filters and multi-class random forests,
demonstrating the effectiveness of integrating advanced feature extraction techniques.
Keskin et al. [7] focused on recognizing ASL numerals using an object identification technique
based on components. With a dataset comprising 30,000 observations categorized into ten
classes, their approach demonstrated strong performance in numeral recognition.
Sundar B et al. [8] presented a vision-based approach for recognizing ASL alphabets using the
MediaPipe framework. Their system achieved an impressive 99% accuracy in recognizing 26
ASL alphabets through hand gesture recognition using LSTM. The proposed approach offers
valuable applications in human-computer interaction (HCI) by converting hand gestures into
text, highlighting its potential for enhancing accessibility and communication.
Jyotishman Bora et al. [9] developed a machine learning approach for recognizing Assamese
Sign Language gestures. They utilized a combination of 2D and 3D images along with the
MediaPipe hand tracking solution, training a feed-forward neural network. Their model
achieved 99% accuracy in recognizing Assamese gestures, demonstrating the effectiveness of
their method and suggesting its applicability to other local Indian languages. The lightweight
nature of the MediaPipe solution allows for implementation on various devices without
compromising speed and accuracy.In terms of continuous sign language recognition, systems
have been developed to automate training sets and identify compound sign gestures using
noisy text supervision. Statistical models have also been explored to convert speech data into
sign language, with evaluations based on metrics like Word Error Rate (WER), BLEU, and
NIST scores.
Overall, research in sign language recognition and translation spans various techniques and
languages, aiming to improve communication accessibility for individuals with hearing
impairments
Page7of 51
CHAPTER-3 METHODOLOGY AND IMPLEMENTATION
The proposed system aims to develop a robust and efficient Hand Sign Language to Text and
Speech Conversion system using advanced Convolutional Neural Networks (CNN). With a
primary focus on recognizing hand signs, including the 26 alphabets and the backslash
character, the system integrates cutting-edge technologies to ensure accurate translation and
interpretation. Leveraging the OpenCV library for streamlined image processing and gesture
recognition, and the Keras library for the implementation of the CNN model, the system
guarantees high precision in sign language interpretation.
The system involves the real-time capture of video input showcasing hand gestures, which are
then pre-processed to enhance the quality of the images. These pre-processed images are then
fed into the trained CNN model, enabling precise predictions and accurate translation of the
gestures into corresponding text characters. The integration of a user-friendly Graphical User
Interface (GUI) provides an intuitive display of the captured video and the recognized hand
sign, empowering users with the option to choose suggested words or clear the recognized
sentence effortlessly.
Furthermore, the system is equipped with text-to-speech functionality, allowing users to listen
to the recognized sentence, thereby enhancing the overall accessibility and usability of the
system. The proposed system is designed with a focus on real-world applications, ensuring its
effectiveness and accuracy through extensive testing and validation. The system's robust
architecture and accurate translation capabilities position it as a promising solution for bridging
communication gaps and facilitating seamless interaction for individuals using sign language.
The camera is used to capture the image gestures in the vision based method. The vision based
method reduces the difficulties as in the glove based method. “Hand talk-a sign language
recognition based on accelerometer and semi data” this paper introduces American Sign
Language conventions. It is part of the “deaf culture” and includes its own system of puns,
inside jokes, etc. It is very difficult to understand understanding someone speaking Japanese by
English speaker. The sign language of Sweden is very difficult to understand by the speaker of
ASL.
A computer vision system is implemented to select whether to differentiate objects using
colour or black and white and, if colour, to decide what colour space to use (red, green, blue or
hue, saturation, luminosity).
Page8of 51
Fig. 3.1 - FlowChart
Page9of 51
3.1 Data Collection
For the project we tried to find already made datasets but we couldn’t find
dataset in the form of raw images that matched our requirements. All we could
find were the datasets in the form of RGB values. Hence, we decided to create
our own data set. Steps we followed to create our data set are as follows.
Page10of 51
Fig. 3.4- Labeled Dataset
The rescale parameter plays a crucial role in normalizing pixel values by dividing them
by 255, a common practice that confines the pixel range to [0, 1]. Another important
parameter, rotation_range, defines the extent of random rotations applied to the images,
with a maximum rotation angle of 15 degrees. The width_shift_range and
height_shift_range parameters dictate the allowable horizontal and vertical shifts,
proportionate to the image dimensions. For instance, setting width_shift_range to 0.1
permits horizontal shifts up to 10% of the image width. shear_range controls shear
transformations, which slant images along the x or y axis. The zoom_range parameter
manages zooming effects, allowing images to be zoomed in (up to 20%) or out (up to
20%). Lastly, horizontal_flip enables horizontal flipping of images, a valuable technique
for enhancing the model’s capacity to recognize features from diverse orientations. I used
OpenCV to convert images to grayscale, apply Gaussian blur, and create binary training
images for CNN model comparison. MediaPipe was also employed to extract hand
landmarks, enhancing the analysis.
Page11of 51
Fig. 3.5 – Grayscaled DataSet
For model training, we began by splitting the preprocessed dataset into training,
validation, and testing sets. This division ensured that the model could learn from a
diverse range of examples during training while also allowing for unbiased
evaluation of its performance on unseen data.
Page12of 51
The architecture comprised multiple convolutional layers followed by max-pooling
layers to extract and downsample features from the input images. Batch
normalization layers were incorporated to accelerate training and improve model
stability. Dropout regularization was also applied to mitigate overfitting by randomly
deactivating a fraction of neurons during training .The activation function used
throughout the network was Rectified Linear Unit (ReLU), known for its simplicity
and effectiveness in promoting nonlinear transformations.
After defining the CNN architecture, we compiled the model using the TensorFlow
and Keras frameworks. TensorFlow provides a comprehensive suite of tools for
building and training deep learning models, while Keras offers a high-level API for
quickly prototyping neural networks.
We configured the model with appropriate loss function, optimizer, and evaluation
metrics. Categorical cross-entropy loss was chosen as the loss function for multi-
class classification tasks, while the Adam optimizer was selected for its efficiency
and adaptability. Accuracy was used as the evaluation metric to assess the model's
performance on both training and validation data.
Finally, we initiated the training process using TensorFlow and Keras, feeding the
preprocessed training data into the model and iteratively adjusting its parameters to
minimize the defined loss function. The training progress was monitored using the
validation set to prevent overfitting and ensure generalization to unseen data
Page13of 51
3.3.1 Convolutional Neural Network Architecture (CNN)
Convolutional Neural Networks are a type of deep learning model specifically designed
for processing and analysing visual data, such as images and videos. They are inspired by
the visual processing mechanism of the human brain and have proven to be highly
effective in tasks such as image recognition, object detection, and image classification.
CNNs are composed of multiple layers, including convolutional layers, pooling layers,
and fully connected layers. These layers work together to extract relevant features from
the input data and make accurate predictions. The key operations within a CNN include
convolution, pooling, and fully connected layers.
Pooling layers: Pooling layers work to reduce the spatial dimensions of the data. They
do this by down sampling the feature maps, thereby decreasing the computational load
and controlling overfitting. Common types of pooling include max pooling and average
pooling, which retain the most significant features while discarding the less relevant
information.
Fully connected layers: The fully connected layers interpret the features extracted by
the convolutional and pooling layers. They use this information to make predictions
based on the learned representations. These layers provide the final output of the CNN,
enabling the network to classify the input data into different categories based on the
learned features.One of the primary advantages of CNNs is their ability to
automatically learn hierarchical representations from raw pixel data. Unlike traditional
machine learning models, which require manual feature extraction, CNNs can learn
complex patterns and relationships between pixels on their own. This capability makes
them well-suited for tasks that involve complex visual patterns and intricate spatial
relationships.To improve their performance, CNNs utilize techniques such as back
propagation and gradient descent during the training process. These techniques allow
the network to adjust its parameters, optimizing its ability to recognize and classify
images accurately. In recent years, CNNs have become a foundational component in
Page14of 51
various computer vision applications. Their capability to capture spatial hierarchies and
local patterns within images has significantly contributed to advancements in image
understanding and pattern recognition. Researchers continue to explore ways to
enhance CNN architectures, such as by incorporating residual connections, batch
normalization, and attention mechanisms, to further improve their performance and
generalizability across different visual tasks.
Page15of 51
Fig. 3.8 – CNN Architecture
The model was compiled using the Adam optimizer once the necessary
hyperparameters were determined, employing categorical cross-entropy as the loss
function. Adam optimizer offers a straightforward implementation, computational
efficiency, and requires less memory. This optimization technique estimates the first
and second moments of the gradient to adaptively adjust the learning rate for each
parameter, contributing to efficient parameter updates.
Now, let's delve deeper into the analysis of the layers within the model:
Page16of 51
output size of the convolutional layer is determined by parameters such as
kernel size, stride, and padding.
It is basically used to extract features from the dataset which can be used as
data pre procressing.
Output Layer : The output layer comprises neurons activated by the softmax
function. Softmax generates a probability distribution over the classes,
indicating the likelihood of the input image belonging to each class.
The model training and optimization module involve training the Convolutional Neural
Network (CNN) using the pre-processed dataset and optimizing the network's
architecture and parameters to achieve superior performance. This module includes
procedures such as model configuration, hyperparameter tuning, and cross-validation to
enhance the CNN's learning capabilities and generalization to various hand sign gestures.
By conducting comprehensive model training and optimization, the module ensures the
CNN's ability to accurately recognize and classify a wide range of hand sign language
gestures with high precision and reliability.
Page17of 51
Fig 3.9- Training loss and accuracy
Python : To develop and run the model, we use Python due to its rich
libraries and its efficiency to run the machine learning programs.
Deep Learning : We used Deep Learning in order to classify the data as per
the project’s requirement.
Google Colab IDE : To run and implement our code, we use google colab
IDE to train and develop our model.
Page18of 51
3.5 Libraries Used
Page19of 51
CHAPTER 4 RESULTS
4.1 Results
We cleaned the ASL dataset before using 4500 photos per class to train our model. There
were 166K photos in the original collection. An 80% training set and a 20% test set were
created from the dataset. In order to train the model, we used a range of hyperparameters,
including learning rate, batch size, and the number of epochs.
Our test set evaluation metrics demonstrate the trained model's remarkable performance.
It properly identified every sample in the test set, earning a high accuracy score of 100%.
The classification report's precision, recall, and F1-score values are all 100%, showing
that the model properly identified each class's samples without making any errors.
Page20of 51
R 1.00 1.00 1.00 912
S 1.00 1.00 1.00 861
T 1.00 1.00 1.00 895
u 1.00 1.00 1.00 878
V 1.00 1.00 1.00 901
W 1.00 1.00 1.00 917
X 1.00 1.00 1.00 952
Y 1.00 1.00 1.00 897
Z 1.00 1.00 1.00 904
Accuracy 1.00 23400
Macro avg 1.00 1.00 1.00 23400
Weighted avg 1.00 1.00 1.00 23400
Page21of 51
CHAPTER 5 SYSTEM REQUIREMENTS
Camera
SD Card
HDD 180 GB
RAM 4 GB
Page22of 51
CHAPTER 6 CONCLUSION AND FUTURE SCOPE
In summary, our ASL recognition model stands out with an extraordinary accuracy rate of
99.50% in real-time Sign Language Recognition (SLR). This achievement is primarily
attributed to the sophisticated combination of Mediapipe for feature extraction and
Convolutional Neural Networks (CNN) for classification. By leveraging these advanced
techniques, our model offers a robust and precise solution for interpreting ASL hand
gestures.
Central to the success of our model is the meticulous curation and preprocessing of the
dataset. From an initial collection of 13,000 photos, we carefully selected 500
representative images per class, ensuring a balanced and diverse training corpus. This
meticulous approach enabled our model to generalize effectively, recognizing a broad
spectrum of ASL gestures with remarkable accuracy.
Moreover, we envision expanding the scope of our model to encompass a wider range of
sign languages and gestures, fostering inclusivity and accessibility on a global scale. This
expansion will involve extensive research and development efforts, including the
integration of machine learning algorithms capable of comprehending entire sign
language sentences and phrases.
Ultimately, our goal is to realize a comprehensive suite of SLR systems that transcend
linguistic barriers, enabling seamless communication between sign language users and the
Page23of 51
broader community. With continued innovation and collaboration, we believe that this
technology has the potential to revolutionize communication and accessibility for
individuals with hearing impairments, facilitating greater integration and participation in
society.
In addition, community engagement efforts that seek feedback and input from the deaf and
hard of hearing community help guarantee that system design is culturally sensitive and
inclusive. Finally, the project's future orientation lies in ongoing research and
collaboration, with the objective of making sign language detection technology more
accessible, intuitive, and powerful for all users.
Page24of 51
REFERENCES
1. Das, S. Gawde, K. Suratwala and D. Kalbande. (2018) "Sign language recognition using
deep learning on custom processed static gesture images," in International Conference on
Smart City and Emerging Technology (ICSCET).
7. C. Keskin, F. Kıraç, Y. E. Kara and L. Akarun. (2013) "Real time hand pose estimation
using depth sensors," in Consumer depth cameras for computer vision, Springer, p.119–
137
8. Sundar, B., & Bagyammal, T. (2022). American Sign Language Recognition for
Alphabets Using MediaPipe and LSTM. Procedia Computer Science, 215, 642–
651.https://doi.org/10.1016/j.procs.2022.12.066
9. Bora, J., Dehingia, S., Boruah, A., Chetia, A. A., & Gogoi,D. (2023). Real-time Assamese
Sign Language Recognition using MediaPipe and Deep Learning. Procedia Computer
Page25of 51
Science, 218, 1384–1393. https://doi.org/10.1016/j.procs.2023.01.117
Page26of 51
APPENDIX
offset=29
hs=enchant.Dict("en-US")
hd = HandDetector(maxHands=1)
hd2 = HandDetector(maxHands=1)
class Application:
def __init__(self):
self.vs = cv2.VideoCapture(0)
self.current_image = None
self.model = load_model('model.h5')
self.speak_engine=pyttsx3.init()
Page27of 51
self.speak_engine.setProperty("rate",100)
voices=self.speak_engine.getProperty("voices")
self.speak_engine.setProperty("voice",voices[0].id)
self.ct = {}
self.ct['blank'] = 0
self.blank_flag = 0
self.space_flag=False
self.next_flag=True
self.prev_char=""
self.count=-1
self.ten_prev_char=[]
for i in range(10):
self.ten_prev_char.append(" ")
for i in ascii_uppercase:
self.ct[i] = 0
self.root = tk.Tk()
self.root.title("Sign Language To Text Conversion")
self.root.protocol('WM_DELETE_WINDOW', self.destructor)
self.root.geometry("1300x700")
self.panel = tk.Label(self.root)
self.panel.place(x=100, y=3, width=480, height=640)
self.T = tk.Label(self.root)
self.T.place(x=60, y=5)
Page28of 51
self.T.config(text="Sign Language To Text Conversion", font=("Courier", 30, "bold"))
self.T1 = tk.Label(self.root)
self.T1.place(x=10, y=580)
self.T1.config(text="Character :", font=("Courier", 30, "bold"))
self.T3 = tk.Label(self.root)
self.T3.place(x=10, y=632)
self.T3.config(text="Sentence :", font=("Courier", 30, "bold"))
self.T4 = tk.Label(self.root)
self.T4.place(x=10, y=700)
self.T4.config(text="Suggestions :", fg="red", font=("Courier", 30, "bold"))
self.b1=tk.Button(self.root)
self.b1.place(x=390,y=700)
self.b2 = tk.Button(self.root)
self.b2.place(x=590, y=700)
self.b3 = tk.Button(self.root)
self.b3.place(x=790, y=700)
self.b4 = tk.Button(self.root)
self.b4.place(x=990, y=700)
self.speak = tk.Button(self.root)
Page29of 51
self.speak.place(x=1305, y=630)
self.speak.config(text="Speak", font=("Courier", 20), wraplength=100,
command=self.speak_fun)
self.clear = tk.Button(self.root)
self.clear.place(x=1205, y=630)
self.clear.config(text="Clear", font=("Courier", 20), wraplength=100,
command=self.clear_fun)
self.word1=" "
self.word2 = " "
self.word3 = " "
self.word4 = " "
self.video_loop()
def video_loop(self):
try:
ok, frame = self.vs.read()
cv2image = cv2.flip(frame, 1)
hands = hd.findHands(cv2image, draw=False, flipType=True)
cv2image_copy=np.array(cv2image)
cv2image = cv2.cvtColor(cv2image, cv2.COLOR_BGR2RGB)
self.current_image = Image.fromarray(cv2image)
Page30of 51
imgtk = ImageTk.PhotoImage(image=self.current_image)
self.panel.imgtk = imgtk
self.panel.config(image=imgtk)
if hands:
# #print(" --------- lmlist=",hands[1])
hand = hands[0]
lmList, bbox = () , ()
for item in hand:
lmList = item['lmList']
bbox = item['bbox']
os = ((400 - w) // 2) - 15
os1 = ((400 - h) // 2) - 15
for t in range(0, 4, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t + 1][0] + os, self.pts[t + 1][1]
+ os1),
(0, 255, 0), 3)
Page31of 51
for t in range(5, 8, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t + 1][0] + os, self.pts[t + 1][1]
+ os1),
(0, 255, 0), 3)
for t in range(9, 12, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t + 1][0] + os, self.pts[t + 1][1]
+ os1),
(0, 255, 0), 3)
for t in range(13, 16, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t + 1][0] + os, self.pts[t + 1][1]
+ os1),
(0, 255, 0), 3)
for t in range(17, 20, 1):
cv2.line(white, (self.pts[t][0] + os, self.pts[t][1] + os1), (self.pts[t + 1][0] + os, self.pts[t + 1][1]
+ os1),
(0, 255, 0), 3)
cv2.line(white, (self.pts[5][0] + os, self.pts[5][1] + os1), (self.pts[9][0] + os, self.pts[9][1] +
os1), (0, 255, 0),
3)
cv2.line(white, (self.pts[9][0] + os, self.pts[9][1] + os1), (self.pts[13][0] + os, self.pts[13][1] +
os1), (0, 255, 0),
3)
cv2.line(white, (self.pts[13][0] + os, self.pts[13][1] + os1), (self.pts[17][0] + os, self.pts[17][1]
+ os1),
(0, 255, 0), 3)
cv2.line(white, (self.pts[0][0] + os, self.pts[0][1] + os1), (self.pts[5][0] + os, self.pts[5][1] +
os1), (0, 255, 0),
3)
cv2.line(white, (self.pts[0][0] + os, self.pts[0][1] + os1), (self.pts[17][0] + os, self.pts[17][1] +
os1), (0, 255, 0),
3)
for i in range(21):
cv2.circle(white, (self.pts[i][0] + os, self.pts[i][1] + os1), 2, (0, 0, 255), 1)
Page32of 51
res=white
self.predict(res)
self.current_image2 = Image.fromarray(res)
imgtk = ImageTk.PhotoImage(image=self.current_image2)
self.panel2.imgtk = imgtk
self.panel2.config(image=imgtk)
def distance(self,x,y):
return math.sqrt(((x[0] - y[0]) ** 2) + ((x[1] - y[1]) ** 2))
def action1(self):
idx_space = self.str.rfind(" ")
idx_word = self.str.find(self.word, idx_space)
Page33of 51
last_idx = len(self.str)
self.str = self.str[:idx_word]
self.str = self.str + self.word1.upper()
def action2(self):
idx_space = self.str.rfind(" ")
idx_word = self.str.find(self.word, idx_space)
last_idx = len(self.str)
self.str=self.str[:idx_word]
self.str=self.str+self.word2.upper()
#self.str[idx_word:last_idx] = self.word2
def action3(self):
idx_space = self.str.rfind(" ")
idx_word = self.str.find(self.word, idx_space)
last_idx = len(self.str)
self.str = self.str[:idx_word]
self.str = self.str + self.word3.upper()
def action4(self):
idx_space = self.str.rfind(" ")
idx_word = self.str.find(self.word, idx_space)
last_idx = len(self.str)
self.str = self.str[:idx_word]
self.str = self.str + self.word4.upper()
def speak_fun(self):
self.speak_engine.say(self.str)
self.speak_engine.runAndWait()
Page34of 51
def clear_fun(self):
self.str=" "
self.word1 = " "
self.word2 = " "
self.word3 = " "
self.word4 = " "
pl = [ch1, ch2]
Page35of 51
if (self.pts[5][0] < self.pts[4][0]):
ch1 = 0
print("++++++++++++++++++")
# print("00000")
if pl in l:
if self.pts[6][1] > self.pts[8][1] and self.pts[14][1] < self.pts[16][1] and self.pts[18][1] <
self.pts[20][1] and self.pts[0][0] < self.pts[8][
0] and self.pts[0][0] < self.pts[12][0] and self.pts[0][0] < self.pts[16][0] and self.pts[0][0] <
self.pts[20][0]:
ch1 = 3
Page36of 51
print("33333c")
Page37of 51
self.pts[6][1] > self.pts[8][1] and self.pts[10][1] < self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and self.pts[18][1] <
self.pts[20][1]):
ch1 = 4
# print("44444")
Page38of 51
if (self.pts[6][1] > self.pts[8][1] and self.pts[10][1] < self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and self.pts[18][1] < self.pts[20][
1]) and self.pts[4][1] > self.pts[10][1]:
ch1 = 5
print("55555b")
Page39of 51
l = [[5, 7], [5, 2], [5, 6]]
pl = [ch1, ch2]
if pl in l:
if self.pts[3][0] < self.pts[0][0]:
ch1 = 7
# print("77777")
Page40of 51
l = [[7, 2]]
pl = [ch1, ch2]
if pl in l:
if self.pts[18][1] < self.pts[20][1] and self.pts[8][1] < self.pts[10][1]:
ch1 = 6
# print("666662")
Page41of 51
[6, 3], [6, 4], [7, 5], [7, 2]]
pl = [ch1, ch2]
if pl in l:
if (self.pts[6][1] > self.pts[8][1] and self.pts[10][1] > self.pts[12][1] and self.pts[14][1] >
self.pts[16][1] and self.pts[18][1] > self.pts[20][
1]):
ch1 = 1
# print("111111")
l = [[6, 1], [6, 0], [4, 2], [4, 1], [4, 6], [4, 4]]
pl = [ch1, ch2]
if pl in l:
if (self.pts[10][1] > self.pts[12][1] and self.pts[14][1] > self.pts[16][1] and
self.pts[18][1] > self.pts[20][1]):
ch1 = 1
# print("111112")
Page42of 51
if ((self.pts[6][1] > self.pts[8][1] and self.pts[10][1] < self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and
self.pts[18][1] < self.pts[20][1]) and (self.pts[2][0] < self.pts[0][0]) and self.pts[4][1] >
self.pts[14][1]):
ch1 = 1
# print("111113")
l = [[3, 4], [3, 0], [3, 1], [3, 5], [3, 6]]
pl = [ch1, ch2]
if pl in l:
if ((self.pts[6][1] > self.pts[8][1] and self.pts[10][1] < self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and
self.pts[18][1] < self.pts[20][1]) and (self.pts[2][0] < self.pts[0][0]) and self.pts[14][1] <
self.pts[4][1]):
ch1 = 1
# print("1111mmm3")
Page43of 51
# con for [i][pqz]
l = [[5, 4], [5, 5], [5, 1], [0, 3], [0, 7], [5, 0], [0, 2], [6, 2], [7, 5], [7, 1], [7, 6], [7, 7]]
pl = [ch1, ch2]
if pl in l:
if ((self.pts[6][1] < self.pts[8][1] and self.pts[10][1] < self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and
self.pts[18][1] > self.pts[20][1])):
ch1 = 1
Page44of 51
if pl in l:
if not (self.pts[0][0] + fg < self.pts[8][0] and self.pts[0][0] + fg < self.pts[12][0] and self.pts[0]
[0] + fg < self.pts[16][0] and
self.pts[0][0] + fg < self.pts[20][0]) and not (
self.pts[0][0] > self.pts[8][0] and self.pts[0][0] > self.pts[12][0] and self.pts[0][0] > self.pts[16]
[0] and self.pts[0][0] > self.pts[20][
0]) and self.distance(self.pts[4], self.pts[11]) < 50:
ch1 = 1
# print("111116")
Page45of 51
if self.pts[4][0] > self.pts[6][0] and self.pts[4][0] > self.pts[10][0] and self.pts[4][1] <
self.pts[18][1] and self.pts[4][1] < self.pts[14][1]:
ch1 = 'N'
if ch1 == 2:
if self.distance(self.pts[12], self.pts[4]) > 42:
ch1 = 'C'
else:
ch1 = 'O'
if ch1 == 3:
if (self.distance(self.pts[8], self.pts[12])) > 72:
ch1 = 'G'
else:
ch1 = 'H'
if ch1 == 7:
if self.distance(self.pts[8], self.pts[4]) > 42:
ch1 = 'Y'
else:
ch1 = 'J'
if ch1 == 4:
ch1 = 'L'
if ch1 == 6:
ch1 = 'X'
if ch1 == 5:
if self.pts[4][0] > self.pts[12][0] and self.pts[4][0] > self.pts[16][0] and self.pts[4][0] >
self.pts[20][0]:
if self.pts[8][1] < self.pts[5][1]:
ch1 = 'Z'
else:
Page46of 51
ch1 = 'Q'
else:
ch1 = 'P'
if ch1 == 1:
if (self.pts[6][1] > self.pts[8][1] and self.pts[10][1] > self.pts[12][1] and self.pts[14][1] >
self.pts[16][1] and self.pts[18][1] > self.pts[20][
1]):
ch1 = 'B'
if (self.pts[6][1] > self.pts[8][1] and self.pts[10][1] < self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and self.pts[18][1] < self.pts[20][
1]):
ch1 = 'D'
if (self.pts[6][1] < self.pts[8][1] and self.pts[10][1] > self.pts[12][1] and self.pts[14][1] >
self.pts[16][1] and self.pts[18][1] > self.pts[20][
1]):
ch1 = 'F'
if (self.pts[6][1] < self.pts[8][1] and self.pts[10][1] < self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and self.pts[18][1] > self.pts[20][
1]):
ch1 = 'I'
if (self.pts[6][1] > self.pts[8][1] and self.pts[10][1] > self.pts[12][1] and self.pts[14][1] >
self.pts[16][1] and self.pts[18][1] < self.pts[20][
1]):
ch1 = 'W'
if (self.pts[6][1] > self.pts[8][1] and self.pts[10][1] > self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and self.pts[18][1] < self.pts[20][
1]) and self.pts[4][1] < self.pts[9][1]:
ch1 = 'K'
if ((self.distance(self.pts[8], self.pts[12]) - self.distance(self.pts[6], self.pts[10])) < 8) and (
self.pts[6][1] > self.pts[8][1] and self.pts[10][1] > self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and self.pts[18][1] <
self.pts[20][1]):
ch1 = 'U'
Page47of 51
if ((self.distance(self.pts[8], self.pts[12]) - self.distance(self.pts[6], self.pts[10])) >= 8) and (
self.pts[6][1] > self.pts[8][1] and self.pts[10][1] > self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and self.pts[18][1] <
self.pts[20][1]) and (self.pts[4][1] > self.pts[9][1]):
ch1 = 'V'
if ch1 == 1 or ch1 =='E' or ch1 =='S' or ch1 =='X' or ch1 =='Y' or ch1 =='B':
if (self.pts[6][1] > self.pts[8][1] and self.pts[10][1] < self.pts[12][1] and self.pts[14][1] <
self.pts[16][1] and self.pts[18][1] > self.pts[20][1]):
ch1=" "
Page48of 51
if self.ten_prev_char[(self.count - 2) % 10] != "next":
char_to_append = self.ten_prev_char[(self.count - 2) % 10]
# Replace "Backspace" with a space character
if char_to_append == "space":
char_to_append = " "
self.str += char_to_append
else:
char_to_append = self.ten_prev_char[(self.count - 0) % 10]
# Replace "Backspace" with a space character
if char_to_append == "space":
char_to_append = " "
self.str += char_to_append
self.prev_char=ch1
self.current_symbol=ch1
self.count += 1
self.ten_prev_char[self.count%10]=ch1
if len(self.str.strip())!=0:
st=self.str.rfind(" ")
ed=len(self.str)
word=self.str[st+1:ed]
self.word=word
print("----------word = ",word)
if len(word.strip())!=0:
hs.check(word)
lenn = len(hs.suggest(word))
if lenn >= 4:
self.word4 = hs.suggest(word)[3]
Page49of 51
if lenn >= 3:
self.word3 = hs.suggest(word)[2]
if lenn >= 2:
self.word2 = hs.suggest(word)[1]
if lenn >= 1:
self.word1 = hs.suggest(word)[0]
else:
self.word1 = " "
self.word2 = " "
self.word3 = " "
self.word4 = " "
def destructor(self):
print("Closing Application...")
# print(self.ten_prev_char)
self.root.destroy()
self.vs.release()
cv2.destroyAllWindows()
print("Starting Application...")
(Application()).root.mainloop()
OUTPUT:
Page50of 51
Page51of 51
Page52of 51