Dti Report
Dti Report
A PROJECT REPORT
BACHELOR OF TECHNOLOGY
in
Computer Science & Engineering
By
(22L31A05C4)
MALLIPUDI AASHIK SAGAR
2024-2025
VIGNAN’S INSTITUTE OF INFORMATION TECHNOLOGY
VISAKHAPATNAM
CERTIFICATE
This is to certify that the project report entitled –
SIGN LANGUAGE RECOGNITION (TO TEXT & SPEECH)
is a Bonafide record of the work done by MALLIPUDI AASHIK SAGAR(22L31A05C4),
CHINDADA DAVID LIVING STON (23L31A0543), BOLLIPOGU SHYAM (23L31A0531),
BV THRIDHAR (23L31A0537), BHUKYA PAVAN KUMAR NAIK (23L31A0524)
EXTERNAL EXAMINER
ACKNOWLEDGEMENT
I take this opportunity to express my deep and sincere indebtedness to our guide.
Sri Dr. Ch. V V Ramana, Associate Professor in Computer Science Engineering for his
guidance, support and encouragement in completion of this project work.
I feel happy to convey my sincere thanks to our principal for his kind co-operation in
providing all provisions for the successful completion of the project.
Communication plays a crucial role in human interaction, yet for individuals with hearing and
speech impairments, expressing thoughts becomes a major challenge. Sign language is their
primary mode of communication, but understanding it requires specialized knowledge,
creating a communication gap with the broader community. This project aims to bridge that
gap by developing a system that converts American Sign Language (ASL) gestures into
corresponding text and speech. Using a vision-based approach, the system captures hand
gestures through a webcam, processes them with the help of Media pipe landmark detection,
and classifies them using a Convolutional Neural Network (CNN) model. By mapping the
gestures to alphabets and forming words, the system outputs the recognized text and
simultaneously generates its audio using a text-to-speech engine. This application offers a
cost-effective, accessible, and efficient solution that empowers deaf and mute individuals to
communicate seamlessly with the hearing population, enhancing their inclusion and
independence in society. With real-time processing, high accuracy even under varying
lighting and background conditions, and a user-friendly interface, the project represents a
significant step towards inclusive communication technology.
KEY FEATURES
1. Real-time Hand Gesture Detection:
Captures hand gestures instantly using a webcam, enabling real-time recognition and
translation.
2. Mediapipe-based Hand Landmark Extraction:
Uses Google's Mediapipe framework to accurately detect hand landmarks, ensuring
robustness across different backgrounds and lighting conditions.
3. CNN-based Gesture Classification:
Employs a Convolutional Neural Network (CNN) model trained on pre-processed
skeletal hand landmarks to classify American Sign Language (ASL) gestures with
high accuracy.
4. Text Conversion Module:
Recognizes individual gestures, forms words by combining alphabets, and displays
the output as readable text on the screen.
5. Speech Conversion Module:
Integrates a text-to-speech (TTS) engine (pyttsx3 library) to convert recognized text
into clear, audible speech, facilitating two-way communication.
6. Robust Preprocessing Techniques:
Applies image preprocessing techniques such as hand segmentation, grey scaling,
adaptive thresholding, and landmark drawing to improve model performance and
reliability.
7. Background and Lighting Adaptability:
Ensures accurate detection and classification even in varying lighting conditions and
complex backgrounds by focusing on hand landmarks rather than raw images.
8. User-Friendly GUI Interface:
Provides an intuitive desktop application built using Python Tkinter for easy
interaction, supporting gesture recognition, text display, and speech output.
9. Error Handling and Suggestions:
Includes features like special gestures for adding space between words and moving to
the next alphabet, enhancing smooth user experience.
10. High Accuracy:
Achieves gesture recognition accuracy of up to 97% in normal conditions and up to
99% under clean backgrounds and good lighting.
BENEFITS
Communication is essential for expressing thoughts, emotions, and ideas. While verbal
communication is the most common form, individuals who are deaf or mute rely heavily on
sign language as their primary means of interaction. However, not everyone in society
understands sign language, leading to significant communication barriers.
The Conversion of Sign Language to Text and Speech project aims to bridge this gap by
developing a system that can automatically recognize hand gestures of American Sign
Language (ASL) and convert them into corresponding text and speech outputs. Using a
vision-based approach integrated with machine learning techniques, the system captures hand
gestures via webcam, processes them through hand landmark extraction, classifies them using
a trained Convolutional Neural Network (CNN), and generates text and audible speech
outputs.
This project not only empowers deaf and mute individuals by giving them a voice but also
enhances inclusivity by allowing non-sign language users to understand and interact with
them effectively.
OBJECTIVE
The primary objectives of the project are:
To design and develop a real-time system capable of detecting and recognizing static
ASL hand gestures accurately.
To translate recognized gestures into corresponding English alphabets, and then into
meaningful text and speech outputs.
To build a system that works reliably across varying backgrounds and lighting
conditions without the need for specialized or costly hardware.
To create a user-friendly desktop application that can be used by individuals with no
technical expertise.
To contribute to social inclusivity by reducing communication barriers for individuals
with hearing and speech impairments.
SCOPE
The scope of the Conversion of Sign Language to Text and Speech project includes:
Developing a complete real-time gesture recognition pipeline starting from hand
detection to gesture classification, text generation, and audio output.
Training the model specifically for the American Sign Language (ASL) alphabet
gestures (A-Z), allowing users to form words by combining letters.
Providing both text display and speech output, ensuring that the system supports users
in different communication scenarios.
Designing a scalable system where additional gestures, full words, dynamic signs, and
multi-language support can be added in the future.
Focusing on affordable, accessible technologies, ensuring that the system requires
only basic hardware like a webcam and standard computer setup.
PROCEDURE
Flow Chart Diagram:
REQUIREMENTS
Hardware Requirements:
Webcam: Integrated or USB external webcam for real-time hand gesture capturing.
Computer/Laptop:
o Minimum 4GB RAM.
o Intel i3 Processor or higher recommended.
o At least 10GB free disk space.
Software Requirements:
Operating System: Windows 8 or above.
Programming Environment: Python 3.9 environment installed.
Libraries/Frameworks: OpenCV, Mediapipe, TensorFlow, Keras, NumPy, Pyttsx3,
Tkinter.
IDE/Code Editor: PyCharm (preferred), Visual Studio Code, or Jupyter Notebook.
Additional Tools:
o Pip package manager for installing dependencies.
o Audio drivers for enabling speech output.
DESIGN THINKING MODEL
EMPATHY
In the empathy phase, we focus on understanding the needs, challenges, and experiences of
deaf and mute individuals. Communication barriers often lead to isolation, dependency on
interpreters, and difficulty in expressing basic needs.
By putting ourselves in their situation, we recognize that traditional methods like sign
language are effective among those who understand it, but ineffective in general society.
Many hearing individuals are unfamiliar with sign language, which limits social, educational,
and professional opportunities for the deaf and mute community.
Thus, the need arises for a real-time system that translates sign language into text and speech,
making communication smoother, faster, and accessible to everyone.
DEFINE
Problem Statement:
Deaf and mute individuals face communication challenges because most of the
population is not familiar with sign language, limiting their interactions in daily life.
Key Needs Identified:
A system that can accurately recognize sign language gestures.
A system that translates those gestures into readable text and audible speech.
A low-cost, easily accessible solution that does not require specialized equipment or
training.
Thus, the defined problem is:
"How can we enable deaf and mute individuals to communicate more effectively with
the hearing population in real-time without needing human interpreters?"
IDEATE
Possible ideas brainstormed to solve the problem:
Use wearable devices like smart gloves embedded with sensors.
Develop a mobile app using camera-based gesture recognition.
Create a real-time desktop system using computer vision and AI to detect hand
gestures.
Use landmark-based skeletal analysis to overcome lighting and background issues.
Integrate text-to-speech modules to provide audio output for the translated gestures.
After evaluating feasibility, cost, and ease of use, the best idea chosen was:
Building a real-time webcam-based system using Mediapipe for hand landmark
detection and a CNN model for gesture classification, combined with text-to-speech
generation.
SOLUTION
The proposed solution is a vision-based desktop application that captures hand gestures
using a standard webcam, processes the image to extract hand landmarks, classifies the
gesture using a trained CNN model, and then outputs the corresponding text and speech.
The solution requires only a basic computer and webcam, making it affordable and accessible
to everyone.
Main features include:
Real-time hand gesture recognition.
Robust performance across various lighting and background conditions.
Conversion of recognized gestures into both text and speech outputs.
User-friendly interface for smooth interaction.
DFD Level – 0 Diagram
PROTOTYPE
Prototype Development Steps:
1. Data Collection:
Captured images of ASL gestures for each alphabet (A-Z) using webcam and
Mediapipe.
2. Preprocessing:
Applied grayscale conversion, Gaussian blur, thresholding, and hand landmark
drawing on a plain background to standardize input images.
3. Model Training:
Developed a CNN model using Keras and TensorFlow to classify gestures into
respective alphabets.
4. GUI Interface:
Created a basic Tkinter GUI to display recognized alphabets and allow users to
interact easily.
5. Speech Integration:
Integrated Pyttsx3 library to convert recognized text into audible speech output.
TESTING
Testing was carried out in multiple phases:
Unit Testing:
Verified that each module (image capture, preprocessing, model prediction, text
display, speech generation) worked independently.
Integration Testing:
Tested the system as a whole by connecting all modules and ensuring proper
communication between them.
User Testing:
Conducted practical tests by asking users to perform random ASL alphabets in
different lighting and background settings.
Observed system's real-time performance, recognition speed, and accuracy.
Result:
Achieved an overall accuracy of 97% under normal conditions and up to 99% under
clean backgrounds.
Challenges Found During Testing:
Difficulty in recognition under very poor lighting.
Confusion between very similar-looking gestures without clear separation.
Solutions:
Encouraged use in well-lit environments when possible.
Suggested clear finger positioning for better landmark detection.
IMPLEMENTATION
The final system was implemented as a standalone Python desktop application with the
following workflow:
Webcam captures real-time hand gesture.
Mediapipe detects hand landmarks and extracts key points.
Pre-processed skeleton image is passed to the trained CNN model.
Model predicts the corresponding alphabet and outputs it as text on the GUI.
Recognized text is converted into speech using Pyttsx3.
Users can form words by sequentially showing gestures and use special gestures for
space and next operations.
The system provides an effective, user-friendly, and low-cost solution to bridge the
communication gap for deaf and mute individuals, fulfilling the core objective of the project.
IMAGES OF THE PROJECT