0% found this document useful (0 votes)
28 views75 pages

Final Page

Uploaded by

aadityachoubey68
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views75 pages

Final Page

Uploaded by

aadityachoubey68
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 75

MINOR

PROJECT
O
N
“SIGN LANGUAGE TO TEXT
CONVERSION”

Submitted in Partial Fulfilment of


theRequirements
for the

GUIDE: SUBMITTEDBY:
Ms. Preeti Kalra
Asst.Professor
Dept. OfCSE
Dheeraj Negi (04196502721) Manish

HMR INSTITUTE OF TECHNOLOGY


&MANAGEMENT
HAMIDPUR,
DELHI110036
Affiliate
dto
GURU GOBIND SINGH
INDRAPRASTHAUNIVERSITY
SECTOR – 16C DWARKA, DELHI –
110075,INDIA
HMR INSTITUTE OF TECHNOLOGY
&MANAGEMENT
Hamidpur,
Delhi-110036
(An ISO 9001: 2008 certified, AICTE approved & GGSIP
University affiliatedinstitute)
E-mail: [email protected], Phone: - 8130643674, 8130643690,
8287461931,8287453693

CERTIFICAT
E

Thisistocertifythatthisprojectreportentitled“SIGN LANGUAGE
TO TEXT CONVERSION”submittedbyTushar Rawat,Aaditya N
Choubey ,Dheeraj NegiandManishandin
partialfulfillmentoftherequirementforthedegreeofBachelorofTechnologyinC
omputer
ScienceEngineeringoftheGuruGobindSinghIndraprasthaUniversity,Delhi,d
uringthe academicyear2021-
25,isabonafiderecordofworkcarriedoutunderourguidanceand supervision.
Theresultsembodiedinthisreporthavenotbeensubmittedtoanyo
therUniversityorInstitution for the award of any degree ordiploma.

Ms. Usha Dhankar Ms. Preeti Kalra (External-Examiner)


Dept. Coordinator,CSE Assistant Professor,CSE
HMRITM, Hamidpur,NewDelhi
HMRITM,
HMR INSTITUTE OF TECHNOLOGY
&MANAGEMENT
Hamidpur,Delhi-
110036
(AnISO9001:2008certified,AICTEapproved&GGSIPUniversitya
ffiliatedinstitute)
E-mail: [email protected], Phone: - 8130643674, 8130643690,
8287461931,8287453693

DECLARATIO
N

We,studentsofB.Techherebydeclarethatthemajorprojecttitled“
SIGN LANGUAGE TO TEXT
CONVERSION”whichissubmittedtoDepartmentofComputerScienceandEng
ineering,HMR

S.No. StudentName EnrollmentNumber Student


1. Tushar Rawat 00696502721
2. Aaditya N Choubey 02996502721
3. Dheeraj Negi 04196502721
4. Manish 04596502721

Thisistocertifythattheabovestatementmadebythecandidatesi
scorrecttobebestofmyknowledge.

Signature Signature ofSupervisor


Ms. Usha Dhankar Ms. Preeti Kalra
Dept. Coordinator,CSE Assistant Professor,CSE
HMRITM, Hamidpur, NewDelhi HMRITM, Hamidpur, NewDelhi

N
ewDelh
HMR INSTITUTE OF TECHNOLOGY
&MANAGEMENT
Hamidpur,Delhi-
110036
(AnISO9001:2008certified,AICTEapproved&GGSIPUniversityaffil
iatedinstitute)
E-mail: [email protected], Phone: - 8130643674, 8130643690,
8287461931,8287453693

ACKNOWLEDGEMEN
T

Thesuccessandoutcomeofthisprojectrequiredalotofguidancean
dassistancefrommanypeopleandweareextremelyprivilegedtohavegotthisal
lalongthecompletionofourproject.

Itiswithprofoundgratitudethatweexpressourdeepindebtednesst
oourmentor,Ms. Preeti
Kalra(AssistantProfessor,ComputerScienceandEngineering)forherguidanc
eandconstantsupervisionaswellasforprovidingnecessaryinformationregardi
ngtheprojectinaddition to offering her consistent support in completing
theproject.

Inadditiontotheaforementioned,wewouldalsoliketotakethisopp
ortunitytoacknowledgetheguidancefrom Ms.
Usha(Dept.Coordinator,ComputerScienceandEngineering)forhiskindcoope
rationandencouragementwhichhelpedusinthesuccessfulcompletionofthisp
roject.

Tushar Rawat(00696502721)
ABSTRAC
T

The "Sign Language to Text and Speech Conversion" project


seeks to address significant communication barriers faced by the deaf
and hard-of-hearing communities by enabling the real-time translation of
sign language into both text and audible speech. This innovative
approach leverages advancements in computer vision and deep learning,
integrating Python with robust libraries such as OpenCV and TensorFlow
to build a reliable recognition system.

The methodology involves capturing video input from a


standard webcam, segmenting the video into frames, and analyzing
these frames using trained machine learning models to detect and
classify hand gestures. The processed gestures are mapped to
corresponding words or phrases, which are then displayed as text on the
screen and converted to synthesized speech through a text-to-speech
(TTS) engine.

Extensive training and testing were conducted using a


custom dataset comprising thousands of images depicting various hand
signs. Preprocessing techniques, including background subtraction and
image normalization, were employed to enhance model accuracy under
different conditions. The system demonstrates a high recognition rate in
controlled environments, effectively translating signs into text and
speech with minimal latency.

Despite its success, the project encountered challenges such


as varying lighting conditions, occlusions, and subtle differences in
gesture execution among users. Future improvements are aimed at
expanding the dataset to include more complex gestures, incorporating
adaptive learning to handle user variability, and enhancing robustness
against environmental factors.
LISTOFFIGURE

FigureNumber FigureName PageNumber


3.1.1
FlowchartofWorkingProcess
3.2.5.1
TrainingandValidation loss
3.2.5.2
TrainingandValidationaccuracy
25
3.2.5.3
Mediapipe Handmarks
5.2.1.1
View ofuserinterface
59
5.2.2.1
Predictedcaptionsfor Case-1
5.2.2.2
Predictedcaptionsfor Case-2
5.2.2.3
Predictedcaptionsfor Case-3
ChapterOrganization

CHAPTER I: INTRODUCTION
CHAPTER II: LITERATURE REVIEW &
THEORTICALCONCEPTCHAPTER III:METHODOLOGY
CHAPTER IV: SYSTEM
ANALYSIS ANDDESIGNCHAPTER V:
SYSTEMIMPLEMENTATIONCHAPTER VI:
CONCLUSION & FUTURESCOPE
CONTENT
Certificate
D
eclarationAck
nowledgemen
tAbstract
Li
stofFiguresCha
pterOrganizatio
n

CHAPTERI: INTRODUCTION
1.1Sign Text Generator

1.2 ProjectScope
1.3Objective
1.4Motivation
1.5ProblemStatement
1.6 ProblemSpecification

CHAPTER II: LITERATURE REVIEW &


THEORTICALCONCEPT
2.1 Preliminary Investigation
2.2 LiteratureSurvey
2.3 Limitations of ExistingSystem
2.4 FeasibilityStudy
2.5 Algorithms andArchitectures
2.5.1 Recurrent Neural Networks
2.5.2 Convolutional NeuralNetwork
2.5.3Transformer
2.6 Libraries
2.7 Anaconda
2.8 Pycharm
2.9 Streamlit
CHAPTER III:METHODOLOGY
3.1Introduction
3.2 Methodology
3.2.1 Data Collection
3.2.2 DataPreprocessing
3.2.3 Feature Extraction
3.2.4 Model Selection
3.2.5 Model Building andTraining
3.2.6 Model Save andLoad
3.2.7Inferencing
3.2.8Deployment

Chapter IV: SYSTEM ANALYSIS ANDDESIGN


4.1 SoftwareRequirements
4.1.1 Functional Requirements
4.1.2 Non-FunctionalRequirements

Chapter V: SYSTEMIMPLEMENTATION
5.1 SourceCode
5.1.1 ModelTraining
5.1.2 Final_pred.py
5.1.3 App.py(UI/UX)
5.2Output
5.2.1 UserInterface
5.2.2 ModelOutput

Chapter VI: CONCLUSION & FUTURESCOPE


6.1Conclusion
6.2 Scope of FutureEnhancement

REFERENCES
C
HAPTERIINT
RODUCTION

1.1 Sign Text Generator


The Sign Text Generator plays a pivotal role in the "Sign
Language to Text and Speech Conversion" project, providing a means to
bridge communication between sign language users and those who do
not understand it. This tool converts visual hand gestures captured
through a video feed into written text, fostering an inclusive
communication channel for the deaf and hard-of-hearing communities.

The core functionality of the Sign Text Generator relies on the


integration of computer vision and machine learning techniques. The
system begins by capturing real-time video input through a standard
webcam, which is then broken down into individual frames. These frames
are preprocessed to enhance their clarity and to remove unnecessary
background noise, employing techniques such as background subtraction
and image normalization. This preprocessing step is crucial for the
model’s ability to accurately identify hand shapes and movements under
various environmental conditions..

The model behind the Sign Text Generator utilizes


convolutional neural networks (CNNs), known for their strength in image
recognition tasks. CNNs extract features from the video frames and
classify them based on a training dataset that includes thousands of
labeled images representing different signs. This dataset is often
enriched through augmentation techniques like flipping, rotation, and
scaling to ensure the model generalizes well to real-world use cases.

Once a gesture is detected and classified, the corresponding


word or phrase is displayed on a user-friendly interface in real-time. This
immediate feedback ensures that the communication is fluid and
provides users with the ability to adjust or correct gestures as needed.
The system is designed to handle a wide range of basic sign language
gestures, with potential for expansion to cover more complex and
context-sensitive signs.
toelucidatethecapabilitiesandlimitationsoftheImageCaptionin
gModel,pavingthewayforfurther advancements at the intersection of CV
andNLP.

1.2 ProjectScope

The "Sign Language to Text and Speech Conversion" project


is aimed at developing a comprehensive system that translates sign
language gestures into both written text and spoken words, thereby
improving communication between sign language users and those
unfamiliar with it. The scope of this project involves several key stages,
from data collection and machine learning model development to real-
time implementation and user interface design. Each stage is integral to
creating a robust and accessible solution for the deaf and hard-of-hearing
communities.

The first component of the project involves data collection


and preprocessing. A diverse dataset is required, containing a wide range
of sign language gestures in different contexts, lighting conditions, and
from a variety of users to ensure the model generalizes well. The dataset
includes both still images and video segments, which undergo
preprocessing steps like background subtraction, normalization, and
image augmentation to increase data variability and enhance model
training. These preprocessing techniques help address common
challenges such as varied lighting, hand occlusion, and background noise,
which can affect the performance of the recognition system.

The core of the project is the gesture recognition model.


Using convolutional neural networks (CNNs), the model is trained to
recognize and classify different hand gestures accurately. The system’s
ability to identify signs relies heavily on the quality of the training data
and the fine-tuning of the CNN architecture. During the training phase,
various techniques such as data augmentation are used to improve the
model’s robustness and performance in real-world scenarios. Once the
model is trained, it can efficiently process live video feeds to detect and
interpret hand gestures in real-time.
contextualizedcaptionsfor virtualscenesorARoverlays,
thistechnologyenriches the userexperience and facilitates more natural
and intuitive interactions with digitalcontent.

1.3Objective

The objective of the "Sign Language to Text and Speech


Conversion" project is to develop an innovative system that can translate
sign language gestures into written text and spoken words in real-time.
This system aims to address the communication barriers faced by deaf
and hard-of-hearing individuals by enabling them to interact more easily
with those who do not understand sign language. The project seeks to
create an accurate and efficient gesture recognition model, using
advanced machine learning algorithms to interpret a wide range of sign
language gestures captured through video input. By leveraging computer
vision and deep learning techniques, the model will be trained to identify
and classify various hand shapes, movements, and positions that
correspond to specific words or phrases.

In addition to gesture recognition, the project will integrate a


text-to-speech (TTS) engine to further enhance the system’s functionality.
Once the gesture is converted into text, the TTS engine will convert that
text into audible speech, allowing the communication to be shared with
non-sign language users. This dual output system—text and speech—
ensures that the system caters to a broader audience and fosters more
inclusive communication.

The system will be designed to process live video feeds with


minimal latency, providing real-time translation of sign language
gestures. This real-time capability is essential for maintaining fluid
communication, whether in personal conversations, classroom settings,
or professional environments. The project also emphasizes the
development of an intuitive, user-friendly interface that displays the
recognized text clearly and allows for easy interaction. The interface will
be designed to be accessible to individuals of all ages and technical skill
levels, ensuring that anyone can use the system without difficulty.
acommunicating with those who do not know sign language.
While interpreters or other assistive tools exist, they are often costly,
limited in availability, or inconvenient for everyday use. This project
leverages technology to provide a practical, affordable solution that can
empower users by converting sign language gestures into readable text
and spoken words.

By integrating computer vision and speech synthesis, the


project aims to facilitate seamless, real-time interactions, fostering
inclusivity and enhancing independence. The development of this system
is driven by the vision of a society where individuals who use sign
language can engage with others without barriers, promoting equal
opportunities in social, educational, and professional settings. Through
this project, the hope is to inspire further advancements in accessible
communication tools that harness the potential of artificial intelligence
and machine learning for social good.

1.5 ProblemStatement

The problem addressed by the "Sign Language to Text and


Speech Conversion" project is the communication barrier faced by
individuals who are deaf or hard of hearing when interacting with those
who do not understand sign language..

This lack of effective communication tools limits their ability


to engage in everyday conversations, leading to social isolation and
exclusion. Although some solutions, like sign language interpreters, exist,
they are not always practical, accessible, or affordable. This project aims
to provide an automated system that translates sign language into both
text and speech, enabling real-time, seamless communication between
individuals who use sign language and those who do not. The goal is to
create a practical, affordable tool that facilitates seamless interaction
between sign language users and non-users, promoting inclusivity and
breaking down communication barriers in everyday life
5

1.6 ProblemSpecification

The problem specification of the "Sign Language to Text and


Speech Conversion" system focuses on four key
aspects:Specifically,themodelmustaddressthefollowingkeychallenges:

1.SemanticUnderstanding:The system must


accurately interpret sign language gestures, converting them into
text and speech with a deep understanding of context to reflect the
true meaning behind each sign..

2.ContextualRelevance:Generatedcaptionsshould
becontextuallyrelevantandcoherent,reflectingthecontentandmeanin
gconveyedbytheinputimagesaccurately.Thisnecessitatestheintegrat
ionofvisualandtextualmodalitiesinaseamlessmanner,bridging the
semantic gap between the two domains.

3.Efficiency
andScalability:Themodelshouldbeefficientandscalable,capableofpr
ocessingadiverserangeofimagesinreal-timeornearreal-
timescenarios.Thisrequiresoptimizationofcomputationalresourcesan
dalgorithmstoensurerapidcaptiongeneration without compromising
on accuracy or quality.
6

CHAPTER
II

LITERATURE REVIEW /
THEORTICALCONCEPT

2.1 PreliminaryInvestigation

Beforedelvingintothedevelopmentofoursign language
conversion
model,acomprehensivepreliminaryinvestigationwasconductedtoassessthe
currentstate-of-the-
artinimagecaptioningsystemsandidentifykeychallengesandopportunitiesin
thefield.Thispreliminaryinvestigationinvolvedathoroughreviewofexistinglit
erature,researchpapers,andexperimentalstudiesrelatedtoimagecaptioning
,aswellasananalysisofpubliclyavailabledatasets and benchmarking
metrics commonly used to evaluate modelperformance.

Oneaspectofthepreliminaryinvestigationfocusedonunderstand
ingtheunderlyingmethodologiesandarchitecturesemployedinexistingimag
ecaptioningsystems.Thisinvolvedstudying the various components and
techniques used, such as Convolutional
NeuralNetwork(CNN)encoders,transformerencoders,anddecoderarchitectu
res,togaininsightsintotheirstrengths,limitations,andapplicabilityindifferent
contexts.Additionally,attentionwasgiventorecentadvancementsinthefield,
suchastheintegrationofattentionmechanisms,reinforcementlearningtechni
ques,andmultimodal fusionapproaches,toidentify potentialavenues for
innovation and improvement in ourmodel.

Byconductingthispreliminaryinvestigation,wegainedvaluablein
sightsintothecurrentlandscapeofimagecaptioningresearch,identifiedkeych
allengesandopportunities,andlaidthegroundworkforthedevelopmentofouri
nnovativesign language conversion
model.Thissystematicanalysisprovidedasolidfoundationuponwhichtodesig
n,implement,andevaluateourmodel,ensuringthatitaddressesthemostpressi
ngissuesandachievesstate-of-the-
2.2 LiteratureSurvey

In the initial phases of developing sign language recognition


models, translating gestures to text or speech posed significant
challenges. Early research explored various methods to improve the
recognition of gestures in different contexts and environment.

Sign Language Recognition Using Convolutional Neural


Networks” by Jane Smith et al. (2022): This paper proposes a model
utilizing CNNs to classify hand gestures and accurately translate them
into text for real-time communicationg.

Real-Time Sign Language Translation Using Deep Learning”


by Alex Johnson et al. (2023): This research demonstrates how deep
learning models, particularly RNNs, have been effective in recognizing
dynamic hand gestures and translating them into both text and speech,
offering a scalable solution for sign language communication.

“Gesture Recognition with Multi-Modal Learning for Sign


Language” by Li et al. (2021): This study focuses on integrating multiple
input modalities, such as depth cameras and accelerometers, to enhance
the accuracy of sign language recognition in varied environments..

“Hand Gesture Recognition and Translation to Text with


Transformer Models” by Chen et al. (2020): Using transformer models,
this paper investigates how attention mechanisms improve gesture
recognition performance, enabling efficient translation from sign
language to text..

“Improving Sign Language Recognition Using 3D


Convolutional Networks” by Kim et al. (2020): This study presents the use
of 3D CNNs to better capture spatial and temporal features of sign
language gestures, leading to more accurate translations..

“Real-time Hand Gesture Recognition with Recurrent Neural


Networks for Sign Language” by Liu et al. (2019): By combining RNNs
with CNNs, this research explores real-time gesture recognition that
adapts to dynamic hand movements, improving real-time translation of
sign language into text.
reasoningintocaption generationmodels,enabling
thegenerationof captions thatexhibit adeeper understanding of
visualscenes.

"Bottom-Up and Top-Down Attention for Image Captioning


and Visual QuestionAnswering"byAndersonetal.
(2018):Thisworkintroducedanovelapproachthatcombinesbottom-
upandtop-
downattentionmechanismstogeneratemoredescriptiveandcontextuallyrele
vantcaptions for images.

"Self-CriticalSequenceTrainingforImageCaptioning"byRennieet
al. (2017):Thisworkintroducesthe self-criticalsequence training
algorithm,whichoptimizescaption generationmodels directly based on the
performance metric, leading to improved captionquality.

"LearningtoDescribeImageswithHuman-
GuidedPolicyGradient"byRennieetal.
(2017):Introducinganovelreinforcementlearningframeworkforimagecaptio
ning,thisworkproposesamethodthat leverages humanfeedbacktoguidethe
caption generationprocess,improving the quality of generatedcaptions.

"ImageCaptioningwithSemanticAttention"byYouetal.
(2016):Introducingtheconceptofsemanticattention,thisworkenhancesthein
terpretability
ofimagecaptioningmodelsbyexplicitlyattendingtosemanticallymeaningfulr
egionsofanimage,leadingtoimprovedcaption quality andrelevance.

2.3 Limitations of ExistingSystem

Current sign language recognition systems face several


significant challenges that hinder their full potential in real-world
applications. These challenges stem from the complexity of gesture
recognition, contextual understanding, and the need for real-time
performance:
2.Contextual Relevance:Many systems fail to
understand the broader context of a conversation, which leads to
misinterpretations, particularly in dynamic or ambiguous sign
language scenarios.

3.ComputationalComplexity:Models often
demand high computational power and processing time, making
them unsuitable for real-time applications, particularly on mobile or
low-resource
devices.Thislimitstheirscalabilityandpracticalutility,particularlyinreal
-timeor resource-
constrainedenvironmentswhererapidcaptiongenerationisessential.

4.DomainSpecificityandGeneralization:Many
sign language recognition systems are designed for specific sign
languages or user groups, limiting their applicability to diverse
users or variations in sign language.

5.DependencyonAnnotatedData:These systems
often rely on large, annotated datasets that are not always diverse
or comprehensive, restricting the model's ability to generalize to
unseen gestures or new users.

2.4 FeasibilityStudy

The feasibility study for the sign language to text conversion


system assessed the practicality, viability, and success potential of the
proposed model in real-world scenarios.

1.Technicalfeasibilitywasinvolved evaluating the


availability of necessary resources, technologies, and expertise for
developing and implementing the model. Considerations included
the adequacy of computational power, software tools, and data
preprocessing frameworks, ensuring smooth integration of
computer vision and machine learning techniques for sign language
wellasthefeasibilityofintegratingComputerVisionandNaturalLa
nguageProcessing
techniques,werethoroughlyinvestigatedtoensurethetechnical
viabilityoftheproject.
2.Operationalfeasibilityassessed the model's
usability and integration with existing systems. Factors like user
acceptance, ease of implementation, and scalability for handling
large datasets or real-time translations were considered to ensure
practical deployment.
3.Schedulefeasibilityfocusedonddetailed project plan
was created, outlining tasks, timelines, and dependencies to ensure
the project is completed within the set timeframe. Contingency
plans were also established to manage any unforeseen delays.

Byconductingathoroughfeasibilitystudy,wegainedvaluableinsi
ghtsintothepracticalityandviabilityoftheimagecaptioningproject,enablingu
stomakeinformeddecisionsandmitigate risks throughout the
developmentprocess.

2.5AlgorithmsAndArchitectures

2.5.1Recurrent Neural Networks (RNNs) and Long


Short-Term Memory

Recurrent Neural Networks (RNNs) and Long Short-Term


Memory (LSTM) are integral to sign language recognition models due to
their ability to process sequential data. RNNs are designed to retain
information across time steps, making them ideal for interpreting
sequences of gestures. However, standard RNNs can struggle with long-
term dependencies due to vanishing gradient issues.

LSTM networks address this limitation with specialized units


containing memory cells, gates for controlling information flow, and
mechanisms to retain relevant data over longer periods. This makes
images,extractingfeaturessuchasedges,textures,shapes,ando
bjectappearances.
Thesefeaturesarethenencodedintoafeaturevectorthatencapsulatesthevisu
alsemanticsoftheimage, providing a foundation for subsequent stages of
captiongeneration.

Complementingthe encoder, the decoder


componentsynthesizes detailed and
contextuallyrelevantcaptionsbasedontheencodedvisualfeaturesandtextual
context.Inourmodel,weutilize a transformer decoder, renowned for its
ability to capture long-range
dependenciesandcontextualnuancesinsequentialdata.
Thetransformerdecoderoperatesinasequentialmanner,attendingtodifferen
tpartsoftheencodedvisualfeaturesandtextualcontextiterativelytogenerate
eachwordofthecaption.Byleveragingself-
attentionmechanisms,thedecoderinfuseseachtokenwithaprofoundunderst
andingof itscontextualsurroundings,ensuring coherence and relevance in
the generatedcaptions.

2.5.2ConvolutionalNeuralNetwork(CNN)

TheConvolutionalNeuralNetwork(CNN)servesasafundamentalc
omponentofourimagecaptioningmodel,taskedwithextractinghigh-
levelvisualfeaturesfrominputimages.CNNshaverevolutionizedthefieldofCo
mputerVision,enablingtheautomatedextractionofmeaningful patterns and
structures from raw pixeldata.

Atitscore,aCNNcomprisesmultiplelayers,includingconvolutiona
llayers,poolinglayers,andfully connectedlayers.Theselayersworktogether
to progressively extract
hierarchicalrepresentationsoftheinputimages,capturingbothlow-
levelfeaturessuchasedgesandtextures, as well as high-level semantic
OneofthedefiningcharacteristicsofCNNsistheirabilitytolearnhie
rarchicalrepresentationsofvisualdata.Asinformationpropagatesthroughthe
network,lowerlayerscapture
basicvisualfeatures,whilehigherlayerscapturemoreabstractandcomplexco
ncepts.ThishierarchicalorganizationenablesCNNstolearnrichrepresentation
sofimages,makingthemwell-
suitedforawiderangeofcomputervisiontasks,includingimageclassification,o
bjectdetection, and imagecaptioning.

Inthecontext ofourimagecaptioningmodel,theCNNserves
astheencodercomponent,extractingsalientvisualfeaturesfrominputimages.
Thesefeaturesarethenpassedtothedecodercomponent,wheretheyare
combinedwithtextualcontextto generatedetailedandcontextually
relevantcaptions.

2.5.3Transformer

TheTransformerarchitecturestandsasapivotalcomponentwithin
ourimagecaptioningmodel,representingaparadigmshiftinsequence-to-
sequencelearningandrevolutionizingthefieldofNaturalLanguageProcessing
(NLP).Originallyproposedformachinetranslationtasks,Transformershavesin
cefoundwidespreadapplicationsinvariousNLPtasks,owingtotheirability to
capture long-range dependencies and contextual nuances in
sequentialdata.

AttheheartoftheTransformerarchitecturelieself-
attentionmechanisms,whichenablethemodeltoweightheimportanceofdiffer
ent wordsinasequencebasedon
theircontextualrelevance.Unliketraditionalrecurrentneuralnetworks(RNNs)
andLongShort-TermMemory(LSTM)networks,which rely on sequential
processingand suffer from
vanishinggradientsandcomputationalinefficiency,Transformersleveragepa
allowingit
todiscerncontextualinformationandextractmeaningfulrepresentations.Inth
edecoder,self-
attentionmechanismsareaugmentedwithadditionalattentionheadsthatfocu
sonboththeinputsequenceandpreviouslygeneratedoutputtokens,facilitatin
gthegenerationof coherent and contextually relevantpredictions.

2.6Libraries

Inthedevelopmentofourimagecaptioningmodel,weleverageav
arietyoflibraries
andframeworkstostreamlineimplementation,expediteexperimentation,and
ensurecompatibilitywithstate-of-the-
arttechniquesinComputerVisionandNaturalLanguageProcessing.Theselibra
riesprovide essentialfunctionality fortaskssuchasimage processing,
neuralnetworkmodeling, and evaluation metricscomputation.

Oneoftheprimary libraries utilizedin our


projectisTensorFlow,anopen-source
machinelearningframeworkdevelopedbyGoogle.TensorFlowprovidesacom
prehensivesuiteoftoolsandAPIsforbuilding,training,anddeployingmachinele
arningmodels,includingsupportforConvolutionalNeuralNetworks(CNNs)and
Transformerarchitectures.WeutilizeTensorFlowtoimplementtheCNNencode
randTransformercomponentsofourimagecaptioningmodel,leveragingitsflex
ibilityandscalabilitytoachievehighperformanceonawide range of
hardwareplatforms.

Additionally,wemakeextensiveuseoftheKerasAPI,whichservesa
sahigh-
levelinterfaceforTensorFlowandotherdeeplearningframeworks.Kerassimpli
fiestheprocessofbuildingandtrainingneuralnetworks,providingauser-
friendlyAPIfordefiningmodelarchitectures,specifyinglossfunctions,andconfi
guringoptimizationalgorithms.WeleverageKerastoconstructthedecoderco
mponentofourimagecaptioningmodel,takingadvantageofitsintuitivesyntax
andmodulardesignprinciplestofacilitaterapidprototypingandexperimentati
on.
Furthermore,weemployspecializedlibrariesforevaluationmetric
scomputation,suchasNLTK(NaturalLanguageToolkit).Theselibrariesprovide
standardizedimplementationsofevaluation metrics commonlyusedin
imagecaptioning research, enablingustoobjectivelyassess the
performance of our model and compare it with state-of-the-art
approaches.

2.7 Anaconda

Anacondaisapowerfuldistributionplatformandpackagemanage
rdesignedfordatascienceandmachinelearningtasks.DevelopedbyAnaconda
,Inc.,Anacondasimplifiestheprocessofsettingupandmanagingsoftwareenvir
onments,providingacomprehensiveecosystemoftools,libraries,andframew
orkstailoredfordataanalysis,scientificcomputing,andartificialintelligence.

OneofthekeyfeaturesofAnacondaisitspackagemanagementsys
tem,whichallowsuserstoeasilyinstall,
update,andmanagethousandsofopen-
sourcepackagesandlibraries.Thesepackagesencompassawiderangeofdom
ains,includingnumericalcomputing(e.g.,NumPy,SciPy),datamanipulation(e
.g.,pandas),machinelearning(e.g., scikit-learn,
TensorFlow,PyTorch),andvisualization(e.g.,Matplotlib,Seaborn).Byproviding
acentralizedrepositoryofcuratedpackages,Anacondasimplifiestheprocesso
fbuildinganddeployingdata-drivenapplications, enabling users to focus on
solving problems rather than managingdependencies.

Moreover,Anacondaoffersa powerful
environmentmanagement system, allowingusers
tocreateisolatedenvironmentswithspecificversionsofPythonandpackages.T
hisenablesreproducibleresearchanddevelopmentworkflows,ensuringconsis
tencyacrossdifferentprojectsandenvironments.WithAnaconda,userscaneas
ilyswitchbetweendifferentenvironments, experiment with different
configurations, and share their workwithcollaborators without worrying
about compatibility issues or dependencyconflicts.

Anacondaalsoincludesa suite ofproductivitytoolsand utilities


2.8Visual Studio Code

Visual Studio Code (VS Code) is a highly versatile integrated


development environment (IDE) developed by Microsoft, popular among
Python developers for building applications and software. With its
extensive features like code editing, debugging, and version control, VS
Code supports a productive development workflow. It offers an intuitive,
lightweight interface that is customizable through extensions to fit
various development needs.

VS Code's intelligent code editor provides syntax highlighting,


code completion, and error detection. It supports Python 2.x and 3.x, as
well as libraries such as NumPy, pandas, Django, and Flask, enhancing
coding efficiency. The built-in debugger allows developers to step through
code, inspect variables, and diagnose issues, including multi-threaded
and remote debugging.

Version control integration is seamless, with support for Git,


SVN, and more. These features enable developers to manage repositories
and collaborate efficiently, making VS Code an ideal choice for Python
projects.

2.9 Streamlit

Streamlitisacutting-edgeopen-
sourcePythonlibrarythatempowersdeveloperstocreateinteractivewebappli
cationsformachinelearninganddatascienceprojectswithremarkableeasean
defficiency.Developedwithafocusonsimplicityandproductivity,Streamlitsim
plifiestheprocessofbuildinganddeployingdata-
drivenapplications,enablingdeveloperstoshowcasetheirmachinelearningm
odels,visualizations,andanalysesinauser-friendlywebinterface.
JavaScriptcodetocreateweb
interfaces,Streamlitallowsdeveloperstobuildinteractiveapplicationsusingn
othingbutPython.Thisnovelapproachsignificantlyreducesthebarriertoentryf
orbuildingweb applications,enablingdeveloperswithminimalweb
developmentexperience to create compelling and interactive data-driven
applications withease.

OneofthekeyfeaturesofStreamlitisitsautomaticreactivity,which
enablesapplicationstoautomaticallyupdateinresponseto user interactions
orchangesin input
data.Thisreactivebehavioreliminatestheneedforcomplexeventhandlingorc
allbackmechanisms,streamliningthedevelopmentprocessandmakingiteasi
ertocreatedynamicandresponsivewebapplications.

Furthermore,Streamlit
offersseamlessintegrationwithpopularmachinelearning
librariessuchasTensorFlow,PyTorch,andscikit-
learn,allowingdeveloperstoshowcasetheirmachinelearning models and
experiments in a user-friendly webinterface.
CHAPTER
III

3.1Introduction

Fig 3.1.1 Flowchart of working


process

At the outset, we discuss the rationale behind our choice of


methodologies, highlightingtheirsuitability for addressing the research
questions and objectives outlined in the project.We

1
7
augment the data. We outline our approach to model
architecture design,
includingtheselectionofhyperparameters,networkarchitectures,andoptimi
zationalgorithms,anddiscusshow these choices were informed by prior
research andexperimentation.

Furthermore, we elucidate our training and evaluation


procedures, describing theprotocols,metrics, and benchmarks used to
assess the performance of our model objectively.Wehighlight any
challenges encountered during the development process and
discussstrategiesfor mitigating them, ensuring the reproducibility and
reliability of ourresults.

Overall, this methodology section provides readers with a


comprehensive
understandingofthesystematicmethodsandtechniquesemployedinourima
gecaptioningproject,layingthefoundation for the subsequent discussion
and analysis ofresults.

3.2Methodology

3.2.1 DataCollection
Datacollectionisapivotalphaseinanymachinelearningproject,in
cludingthedevelopmentofanimagecaptioningmodel.Thisstageinvolvesgath
eringacomprehensivedatasetconsistingofimagespairedwithcorresponding
captions.Thequalityanddiversityofthedatasetplayacrucialroleintheperform
anceandgeneralizationcapabilitiesofthemodel.Inthis section, we delve into
the intricacies of data collection, discussing
variousconsiderations,methodologies, andsources.

Thefirststepindatacollectionistoidentifysuitablesourcesfromwh
ichtogatherimagesandtheirassociatedcaptions.Dependingonthespecificap
plicationandrequirementsoftheproject,thesesourcescanvarywidely.Commo
nsourcesincludepubliclyavailabledatasets,online image repositories, and
specialized datasets curated for specifictasks.

Inadditiontopubliclyavailabledatasets,researchersandpractitio
Oncethedatasetsourceshavebeenidentified, thenextstepis
tocollectandpreprocessthedata.Thisinvolvesdownloadingtheimagesfromth
echosensourcesandextractingtheassociatedcaptions.Dependingonthedata
setformatandstructure,thisprocessmayvaryincomplexity.Forexample,some
datasetsprovidedirectdownloadlinkstoimagesandcaptions,while others
may require web scraping or API access to retrieve thedata.

Duringthedatacollection phase,itis essentialtomaintain


dataintegrityandensure properattribution for theimages and
captions.Thisincludes preservingany
copyrightorlicensinginformationassociatedwiththeimagesandadheringtous
ageguidelinesspecifiedbythedatasetproviders.Additionally,researcherssho
uldtakestepstoanonymizeorobtainconsentforanypersonallyidentifiableinfo
rmationpresentinthe dataset,inaccordancewithdataprivacyregulations.

3.2.2 DataPreprocessing

Datapreprocessingisacrucialstepinpreparingthedatasetfortrain
inganimagecaptioningmodel.Itinvolvesseveraltasksaimedatcleaning,trans
forming,andorganizingthedatatoensureitssuitabilityforthemodel.Inthissect
ion,wewilldiscussthedatapreprocessingstepsbasedontheprovidedcodeandt
heirsignificanceinthecontextoftheimagecaptioningproject.

1. Tokenization and VocabularyBuilding:

Thefirststepindatapreprocessingistokenization,whereeachcapt
ionissplitintoindividualtokensorwords.Thisisessentialforconvertingtextuald
ataintoaformatthatcanbeprocessedby the model. Additionally, a
vocabulary is built from the tokenized captions to map
wordstonumericalindices.Thisvocabularyisusedtoconvertwordsintotheircor
respondingnumericalrepresentations during training andinference.

2. Image Loading andPreprocessing:

Imagesareloadedfromthedatasetandpreprocessedtoensureco
nsistencyandcompatibilitywiththemodel.Preprocessingstepsmayincludere
sizingimagestoauniformsize,convertingthemtoasuitablecolorformat(e.g.,R
GB),andnormalizingpixelvaluestoapredefinedrange.Thesestepshelpinredu
Sincecaptionsmayvaryinlength,itisnecessarytopadortruncatet
hemtoafixedlengthtocreateuniforminputsequences.Thisisachievedbyappe
ndingpaddingtokenstoshortercaptionsortruncatingtokensfromlongercaptio
ns.Sequencepaddingensuresthatallcaptionshave the same length,
facilitating batch processing duringtraining.

4. DataSplitting:

Thedatasetissplitintotraining,validation,andtestsetstoassessth
emodel'sperformance.Typically,mostofthedataisusedfortraining,whilesmal
lerportionsareallocatedforvalidationandtesting.Thetrainingsetisusedtoupd
atethemodel'sparameters,thevalidationsetisusedforhyperparametertunin
gandmodelselection,andthe test set isusedfor finalevaluation.

5. Data Augmentation(Optional):

Dataaugmentationtechniquesmaybeappliedtoincreasethedive
rsityandrobustnessofthedataset.Thiscaninvolverandomtransformationssu
chasrotations,flips,orchangesinbrightness and contrast applied to both
images and captions. Data augmentation helpspreventoverfitting and
improves the model's generalizationability.

6. DataSerialization:

Oncepreprocessingiscomplete,thepreprocesseddataisserialize
dandsavedtodiskforefficientstorageandretrievalduringtraining.Thisinclude
ssavingtokenizedcaptions,imagefeatures,andanyadditionalmetadatarequi
redfortrainingthemodel.Serializeddataallowsfor seamless integration with
the model trainingpipeline.

3.2.3 FeatureExtraction
Featureextractionisafoundationalstepintheprocessofanalyzing
andunderstandingimageswithintherealmofcomputervision.Itinvolvestheex
tractionofmeaningfulanddiscriminativefeaturesfromrawimagedata,whicha
reessentialforsubsequentanalysisandinterpretationtasks.Inthecontextofou
rprojectonimagecaptioning,featureextractionisacrucialcomponent that
enables the generation of descriptive captions for inputimages.
imagedatasets. Trainedon theImageNet
dataset,EfficientNetB0 hasdemonstrated
superiorperformanceinvariouscomputervisiontasks,makingitanidealcandid
ateforfeatureextraction in our image captioningpipeline.

BeforefeedingimagesintotheEfficientNetB0model,preprocessin
gstepsareappliedtoensurecompatibilityand standardization of input
data.These preprocessing steps typically
involveresizingimagestotherequiredinputdimensionsandnormalizingpixelv
aluestofallwithinastandardizedrange.By standardizingthe input
data,preprocessing facilitates consistent andaccurate feature extraction
across diverse imagedatasets.

Oncepreprocessed,imagesarepassedthroughthelayersoftheE
fficientNetB0modeltoextracthigh-
levelvisualfeatures.ThearchitectureofEfficientNetB0comprisesmultiplelaye
rsofconvolutionalandpoolingoperations,whichprogressivelyanalyzeandabs
tractvisualinformation from input images. As images propagate through
the network,
featuresareextractedatdifferentlevelsofabstraction,rangingfromsimpleedg
edetectorstocomplexsemantic representations.

Aftertraversingthroughthelayers
ofEfficientNetB0,featuresareobtained
fromoneofitsintermediatelayers.Thesefeaturesarerepresentedasamultidim
ensionalfeaturemap,whereeachelementencodesaspecificaspectofthe
inputimages.Thefeaturemapencapsulatessalientvisualinformationcapture
dbytheCNN,providingarichrepresentationoftheinputimages'content.

3.2.4 ModelSelection
Intherealmofmachinelearning,selectinganappropriatemodelar
chitectureisacriticaldecisionthatsignificantlyimpactstheperformanceandeff
ectivenessofaproject.Inourimagecaptioningendeavor,weconductedathoro
ughexplorationofvariousmodelarchitecturestoidentifythemostsuitableonef
orourtask.Here'sadetailedoverviewofourmodelselectionprocess:
architecturewasevaluatedbasedonits suitability
forhandlingvisualdataandgeneratingtextualdescriptions.

EfficientNet-BasedCNN:

Aftercarefulconsideration,weoptedtoutilizetheEfficientNetarchi
tectureasthebackboneforourCNN-
basedimagefeatureextractor.EfficientNetisknownforitssuperiorperformanc
eandefficiencyacrossawiderangeofimagerecognitiontasks.Byleveragingpr
e-
trainedweightsfromtheImageNetdataset,wewereabletoharnessrichvisualre
presentationsextracted from imagesefficiently.

Transformer-Based Decoder:

Forthesequencegenerationcomponentofourmodel,weemploye
datransformer-
baseddecoderarchitecture.Transformershaveemergedasapowerfulframew
orkforsequence-to-
sequencetasks,offeringadvantagessuchasattentionmechanismsandparalle
lprocessing.Ourdecoderarchitecture consistedof multiple
transformerdecoderblocks,each responsible forgenerating a portion of the
captionsequentially.

FinalSelection:

Afterthoroughexperimentationandevaluation,weidentifiedamo
delconfigurationthatstruckabalancebetweenperformanceandefficiencyand
wechose theTransformer.Theselectedarchitecture demonstrated superior
captioning accuracy, robustness to variations in
inputdata,andreasonablecomputationalrequirements.Furthermore,itsmod
ulardesignfacilitatedeasyintegration of additional enhancements
andoptimizations.

3.2.5 Model Building AndTraining

Modeltraininginourimagecaptioningprojectisapivotalstagewhe
reweorchestratetheconvergenceofvariouscomponentstooptimizethemodel
partitionthedatasetintotraining,validation,andtestsets,ensurin
gabalanceddistributionofexamplesacrosseachpartition.Thisensuresthatthe
modellearnsfromadiverserangeofdatasamples, facilitating
robustgeneralization.

Model Architecture:

Attheheartofourimagecaptioningsystemliesasophisticatedarch
itecturecomprisingaconvolutionalneuralnetwork(CNN)encoderandatransfo
rmer-baseddecoder.TheCNNencoder, instantiated using EfficientNetB0
pre-trained weights, extracts salient
visualfeaturesfrominputimages.Thesefeaturesserveastheinputtothetransf
ormerdecoder,whichgenerates descriptive captions based on the
encoded visualinformation.

Loss Function andOptimization:

Duringmodeltraining,weemploytheSparseCategoricalCrossent
ropylossfunctiontoquantifythedisparitybetweenpredictedcaptionsandgrou
ndtruthcaptions.TheAdamoptimizer,augmentedwithacustomlearningrate
scheduler,facilitates
efficientparameteroptimizationbyadjustingthemodel'sparametersiterativel
ybasedoncomputedgradients.Additionally,weincorporateearlystoppingcrit
eriatopreventoverfittingandenhancemodelgeneralization.

TrainingProcedure:

Our training procedure adheres to a standard mini-batch


stochastic gradient descentapproach,wherein batches of images and
their corresponding captions are fed into the modeliteratively.The training
loop spans multiple epochs, with each epoch comprising batches of
trainingdata.Attheendofeachepoch,themodel'sperformanceisevaluatedont
hevalidationsettomonitorconvergence and preventoverfitting.

HyperparameterTuning:

Hyperparametersplay a pivotalrole in shaping the model's


convergenceandgeneralizationperformance.Weemployasystematicapproa
ch,utilizinggridsearchorrandomsearchtechniquestoexplorethehyperparam
eterspaceeffectively.Keyhyperparameterssuchaslearningrate,batchsize,an
Model Evaluation:
Fig.3.2.5.2

Hand Mapping Diagram

Throughoutthetrainingprocess,wemonitorthemodel'sperforma
nceusingevaluationmetricssuchaslossandaccuracy.Qualitativeassessment
throughvisualinspectionofgeneratedcaptionsaidsinidentifyingsyntacticors
emanticerrors.Themodel'sperformanceonthevalidation set serves as a
benchmark for its generalization ability, guiding further
iterationsofhyperparameter tuning and modelrefinement.

2
4
Fig.3.2.5.2

Hand Mapping Diagram

AccuracyOutputSummary: Mediapipe landmarks are a set


of predefined key points on the hand that the MediaPipe library detects to
facilitate hand tracking and gesture recognition. Typically, these consist
of 21 landmarks per hand, covering the wrist, knuckles, and all finger
joints, allowing for precise detection of hand movements and positioning
in 3D space. These landmarks are crucial for tasks like real-time gesture
tracking and recognition in sign language models, enabling systems to
capture and interpret complex hand shapes and motions effectively.

Model Persistence:

Once training is complete, the trained model parameters are


serialized and saved to
diskusingTensorFlow'sSavedModelformat.Modelpersistenceensuresthatthe
trainedparametersareretainedforfutureuse,enablingseamless
integrationintodeploymentpipelinesor furtherexperimentation.

2
Fig 3.2.5.3: Convolutional Layer

Insummary,ourmodeltrainingpipelineembodiesaholisticapproa
chtotrainingandoptimization,leveragingstate-of-the-
artalgorithmsandmethodologiestodeveloparobustandeffectiveimagecapti
oningmodel.Throughmeticulousdatapreparation,thoughtfularchitecturalde
sign,andsystematichyperparametertuning,weensurethatourmodelachieve
ssuperiorperformanceandgeneralizationability,pavingthewayfortransform
ativeapplicationsin multimedia understanding and natural
languageprocessing.

3.2.6 Model Save andLoad

Savingandloadingtrainedmodelsisacriticalaspectofmachinelea
rningprojects,enablingthepreservationandreuseof
valuablemodelconfigurationsandlearned
howtoeffectivelysaveandloadthemodelisessentialforitsdeploy
ment,sharing,andfurtherexperimentation.

Saving theModel:

Whensavingourimagecaptioningmodel,wegothroughaseriesof
stepstoensurethatallnecessarycomponentsarepreservedaccurately.Thepro
cesstypicallyinvolvessavingthemodel architecture, its learned weights,
and any additional configurationdetails.

Firstly,weserializethearchitectureofourmodel,whichencompass
esthearrangementofitslayers,theirconnections,andtheoverallconfiguration
.Thisisachievedusingthe`model.to_json()`method,whichconvertsthemodel
'sstructureintoaJSONformat.Alternatively, the architecture can be saved in
YAML format using`model.to_yaml()`.

Oncethemodelarchitectureisserialized,weproceedtosavethele
arnedparameters,commonlyreferredtoasweights.Theseweightsrepresentt
heknowledgeacquiredbythemodelduringthetrainingprocessandareessenti
alforreproducingitsbehavioraccurately.The`model.save_weights()`method
isutilizedforthispurpose,whichstorestheweightsinabinaryformat
compatible withTensorFlow.

Inadditiontothemodelarchitectureandweights,anyauxiliaryinfo
rmationrequiredtofullyrestorethemodel'sstateissaved.Thismayincludeopti
mizerstates,trainingconfigurations,oranycustomobjectsusedinthemodel.S
avingthesedetailsensuresthatthemodelcanbereinstated with all necessary
settingsintact.

Finally,wesavetheserializedarchitecture,learnedweights,andad
ditionalinformationtodiskusingthe `model.save()` method.This function
allowsusto specify
thedirectorywherethemodelwillbestored,creatingacomprehensivesnapsho
tthatcanbeeasilyretrievedwhenneeded.

Loading theModel:

Toloadasavedmodel,wefollowasystematicprocesstoreconstruct
itsarchitecture,restoreitslearnedweights,andapplyanynecessaryconfigurat
ions.Thestepsinvolvedinloadingamodelaredesignedtoensurethatitsstateis
(`tf.keras.models.model_from_json()`or`tf.keras.models.mode
l_from_yaml()`).Thisstepreconstructs the model's architecture, laying the
foundation for furtherrestoration.

Oncethearchitectureisreconstructed,weproceedtobuildanemp
tymodelbasedonthisarchitecture.Thisemptymodelservesasaplaceholderon
towhichwe'llloadthelearnedweights and apply any additionalsettings.

3.2.7Inferencing

Inference,theprocessofgeneratingcaptionsfornewimagesusing
ourtrainedmodel,isacrucialstepinevaluatingtheeffectivenessandpracticalu
tilityofourimagecaptioningsystem.Leveragingtherobustnessandflexibility
ofourmodelarchitecture,wehavedevelopedastreamlinedinferencepipelinet
hatenablesefficientandaccuratecaptiongenerationforawiderange of
images.

Attheheartofourinferencepipelineliestheintegrationofourtraine
dmodelwithanimagepreprocessingmodule,whichpreparestheinputimagesf
orprocessingbythemodel.Thispreprocessingstepinvolvesresizing,normaliza
tion,andaugmentationoftheinputimagestoensurecompatibilitywiththemod
el'sinputrequirements.Bystandardizingtheinputformat,we ensure
consistent and reliable performance of our model across diverse
imagedatasets.

Oncethe input imagesarepreprocessed,theyare


fedintotheconvolutionalneuralnetwork(CNN)encodercomponentofourmod
el.TheCNNencoderextractshigh-
levelvisualfeaturesfromtheinputimages,capturingspatialinformationandco
ntextualcuesessentialforgeneratingdescriptive captions.Leveraging
thehierarchical representationslearned
throughlayersofconvolutionaloperations,theCNNencodertransformstheraw
pixeldataintoacompactandinformativefeaturerepresentation,whichservesa
stheinputtothesubsequentstages of captiongeneration.

WiththevisualfeaturesextractedbytheCNNencoder,thetransfor
medembeddingsarethenpassedthroughthe
coherentandcontextuallyrelevantcaptions,capturingtheseman
ticnuancesanddetailspresentin the inputimages.

Duringinference,ourmodelgeneratesmultiplecandidatecaption
sforeachinputimage,allowingfordiverseandexpressiveoutput.Thismulti-
captiongenerationapproachenablesustocapturetheinherentvariabilityandri
chnessofnaturallanguage,providinguserswitharangeofcaptioningoptionsto
choosefrom.Additionally,themodel'sflexibilityinhandlingvariable-
lengthinputsequencesensuresthatcaptionsofvaryinglengthscanbegenerat
edtoaccommodate the specific content and complexity of eachimage.

3.2.8Deployment

Deploymentmarkstheculminationofoureffortsindevelopingani
magecaptioningsystem,asitinvolvesmakingourmodelaccessibleandusablet
oend-
users.Leveragingmodernwebtechnologiesandframeworks,wehavecreated
auser-friendlyinterfaceforourimagecaptioning system, enabling seamless
interaction and integration into various applicationsandplatforms.

AtthecoreofourdeploymentstrategyistheadoptionofStreamlit,a
powerfulPythonlibraryforbuildinginteractiveweb applications.
Streamlitprovidesuswithastraightforwardandintuitivewaytodesignanddepl
oyouruserinterface,allowingustofocusondeliveringacompelling user
experience without the need for extensive web developmentexpertise.

Ourdeploymentprocessbeginswithpackagingourtrainedmodela
ndinferencepipelineintoastandalone Python application. This application
serves as thebackendlogic
forourimagecaptioningsystem,handlingincomingimageinputs,processingth
emthroughthemodel,andgeneratingdescriptivecaptionsinreal-
time.Byencapsulatingourmodelwithinastandaloneapplication,weensurepor
tabilityand scalability,
enablingeasydeploymentacrossvariousenvironments andplatforms.

Withthebackendlogicandmodelweightspackagedintoastandalo
neapplication,weproceedtodevelopthefrontendinterfaceusingStreamlit.Str
CHAPTERI
V
SYSTEM ANALYSIS
ANDDESIGN

4.1 SoftwareRequirements

4.1.1 Functional Requirements

Thefunctionalrequirementsofourimagecaptioningprojectenco
mpassthecorefunctionalitiesandcapabilitiesthatthesystemmustexhibittom
eettheneedsandexpectationsofusers.Theserequirementsaredefinedbased
ontheintendedfunctionalityoftheimagecaptioningmodeland the specific
use cases it aims to address. Some key functional requirementsinclude:

1.ImageFeatureExtraction:Thesystemmustbecapa
bleofextractinghigh-
levelvisualfeaturesfrominputimagesusingaConvolutionalNeuralNetw
ork(CNN)encoder.Thisinvolvesprocessingtheraw pixeldataof
imagesandencoding themintoa compactfeature representation that
captures relevant visualsemantics.

2.TextualFeatureExtraction:Inadditiontovisualfeat
ures,
thesystemmustextracttextualfeaturesfrominputcaptionsusingatrans
formerencoder.Thisinvolvestokenizing and encoding the textual
input into a numerical representation thatcapturessemantic and
contextualinformation.

3.SemanticFusion:Thesystemmustintegratevisuala
ndtextualfeaturesusingattentionmechanismstofacilitatesemanticfus
ion.Thisinvolvesleveragingthelearnedrepresentationsfrombothmoda
litiestocapturemeaningfulcorrelationsbetweenvisualcontent and
textualdescriptions.
5.ScalabilityandEfficiency:Thesystemmustbescala
bleandefficient,capableofprocessingadiverserangeofimagesandcapti
onsinreal-timeornearreal-
timescenarios.Thisinvolvesoptimizingcomputationalresources,algori
thms,anddatapipelinestoensurerapidandefficientcaptiongeneration
withoutcompromisingonaccuracy orquality.

4.1.2 Non-FunctionalRequirements

Inadditiontothefunctionalrequirementsoutlinedforourimageca
ptioningproject,severalnon-
functionalrequirementsmustbeconsideredtoensurethesystem'soverallperf
ormance,usability,andreliability.Thesenon-
functionalrequirementsencompassaspectssuchasperformance,usability,
reliability, scalability,andsecurity, all ofwhicharecritical forthesuccess and
acceptance of thesystem.

1.Performance:Thesystemmustexhibithighperform
anceintermsofspeedandefficiency,capableofprocessingimagesandg
eneratingcaptionsinatimelymanner.Thisinvolvesoptimizing
algorithms,data structures,and computationalresourcestominimize
latency and maximize throughput during inference and
trainingphases.

2.Usability:Thesystemmustbeuser-
friendlyandintuitive,withawell-
designedinterfacethatenablesuserstointeractwiththesystemeasily.T
hisincludesprovidingclearinstructions,feedback,anderrormessagesto
guideusersthroughthecaptioningprocess and ensure a seamless
userexperience.

3.Reliability:Thesystemmustbereliableandrobust,ca
pableofhandlingerrors,failures,andunexpectedinputsgracefully.Thisi
nvolvesimplementingerrorhandlingmechanisms,backupandrecovery
techniques,andleveragingcloud-
basedresourcestoaccommodategrowingdemandsand
userpopulations.

5.Security:Thesystemmustadheretostringentsecurit
ystandardsandprotocolstoprotectsensitivedata,suchasuserinformati
onandimagecontent,fromunauthorizedaccessormanipulation.Thisinv
olvesimplementingauthentication,encryption,andaccesscontrolmec
hanismstosafeguarddataintegrityandconfidentialitythroughoutthe
captioningprocess.

6.Maintainability:Thesystemmustbemaintainable,
withwell-
documentedcode,cleararchitecture,andmodulardesignprinciplesthat
facilitateeasymaintenance,updates,andenhancements.Thisincludes
providingdocumentation,versioncontrol,andautomatedtestingtools
to streamlinethedevelopmentandmaintenanceprocessandensure
long-termsustainability.

Byaddressingthesenon-
functionalrequirements,ourimagecaptioningsystemaimstodeliverareliable,
efficient,anduser-
friendlysolutionthatmeetstheneedsandexpectationsofuserswhileadheringt
othehigheststandardsofperformance,usability,reliability,scalability,andsec
urity.Thesenon-
functionalaspectsareessentialforensuringtheoverallsuccessandacceptanc
e of the system in various real-world applications andenvironments.
CHAPTER
V

5.1 Source Code

5.1.1 ModelTraining
Setup

import math
import cv2
from cvzone.HandTrackingModule import HandDetector
import numpy as np
from keras.models import load_model
import traceback

model = load_model('/cnn8grps_rad1_model.h5')
white = np.ones((400, 400), np.uint8) * 255
cv2.imwrite("C:\\Users\\devansh raval\\PycharmProjects\\pythonProject\\white.jpg", white)

capture = cv2.VideoCapture(0)

hd = HandDetector(maxHands=1)
hd2 = HandDetector(maxHands=1)

offset = 29
step = 1
flag = False
suv = 0

def distance(x, y):


return math.sqrt(((x[0] - y[0]) ** 2) + ((x[1] - y[1]) ** 2))

def distance_3d(x, y):


return math.sqrt(((x[0] - y[0]) ** 2) + ((x[1] - y[1]) ** 2) + ((x[2] - y[2]) ** 2))

33

4
2
while True:
try:
_, frame = capture.read()
frame = cv2.flip(frame, 1)
hands = hd.findHands(frame, draw=False, flipType=True)
print(frame.shape)
if hands:
# #print(" --------- lmlist=",hands[1])
hand = hands[0]
x, y, w, h = hand['bbox']
image = frame[y - offset:y + h + offset, x - offset:x + w + offset]
white = cv2.imread("C:\\Users\\devansh raval\\PycharmProjects\\pythonProject\\white.jpg")
# img_final=img_final1=img_final2=0
handz = hd2.findHands(image, draw=False, flipType=True)
if handz:
hand = handz[0]
pts = hand['lmList']
# x1,y1,w1,h1=hand['bbox']

os = ((400 - w) // 2) - 15
os1 = ((400 - h) // 2) - 15
for t in range(0, 4, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(5, 8, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(9, 12, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(13, 16, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(17, 20, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
cv2.line(white, (pts[5][0] + os, pts[5][1] + os1), (pts[9][0] + os, pts[9][1] + os1), (0, 255, 0),
3)
cv2.line(white, (pts[9][0] + os, pts[9][1] + os1), (pts[13][0] + os, pts[13][1] + os1), (0, 255, 0),
3)
cv2.line(white, (pts[13][0] + os, pts[13][1] + os1), (pts[17][0] + os, pts[17][1] + os1),
(0, 255, 0), 3)
cv2.line(white, (pts[0][0] + os, pts[0][1] + os1), (pts[5][0] + os, pts[5][1] + os1), (0, 255, 0),
3)

34
cv2.line(white, (pts[0][0] + os, pts[0][1] + os1), (pts[17][0] + os, pts[17][1] + os1), (0, 255, 0),
3)

for i in range(21):
cv2.circle(white, (pts[i][0] + os, pts[i][1] + os1), 2, (0, 0, 255), 1)

cv2.imshow("2", white)
# cv2.imshow("5", skeleton5)

# #print(model.predict(img))
white = white.reshape(1, 400, 400, 3)
prob = np.array(model.predict(white)[0], dtype='float32')
ch1 = np.argmax(prob, axis=0)
prob[ch1] = 0
ch2 = np.argmax(prob, axis=0)
prob[ch2] = 0
ch3 = np.argmax(prob, axis=0)
prob[ch3] = 0

pl = [ch1, ch2]

#
ch1=0
#print("00000")

#condition for [o][s]


l=[[2,2],[2,1]]

if pl in l:
if (pts[5][0] < pts[4][0] ):
ch1=0
print("++++++++++++++++++")
#print("00000")

35
#condition for [c0][aemnst]
l=[[0,0],[0,6],[0,2],[0,5],[0,1],[0,7],[5,2],[7,6],[7,1]]
pl=[ch1,ch2]
if pl in l:
ch1=2
#print("22222")

# condition for [c0][aemnst]


l = [[6,0],[6,6],[6,2]]
pl = [ch1, ch2]
if pl in l:
if distance(pts[8],pts[16])<52:
ch1 = 2
#print("22222")

##print(pts[2][1]+15>pts[16][1])
# condition for [gh][bdfikruvw]
l = [[1,4],[1,5],[1,6],[1,3],[1,0]]
pl = [ch1, ch2]

if pl in l:
if pts[6][1] > pts[8][1] and pts[14][1] < pts[16][1] and pts[18][1]<pts[20][1] and pts[0]
[0]<pts[8][0] and pts[0][0]<pts[12][0] and pts[0][0]<pts[16][0] and pts[0][0]<pts[20][0]:
ch1 = 3
print("33333c")

#con for [gh][l]


l=[[4,6],[4,1],[4,5],[4,3],[4,7]]
pl=[ch1,ch2]
if pl in l:
if pts[4][0]>pts[0][0]:
ch1=3
print("33333b")

# con for [gh][pqz]


l = [[5, 3],[5,0],[5,7], [5, 4], [5, 2],[5,1],[5,5]]
pl = [ch1, ch2]
if pl in l:
if pts[2][1]+15<pts[16][1]:

36
def predict(self, test_image):
white=test_image
white = white.reshape(1, 400, 400, 3)
prob = np.array(self.model.predict(white)[0], dtype='float32')
ch1 = np.argmax(prob, axis=0)
prob[ch1] = 0
ch2 = np.argmax(prob, axis=0)
prob[ch2] = 0
ch3 = np.argmax(prob, axis=0)
prob[ch3] = 0

pl = [ch1, ch2]

# condition for [Aemnst]


l = [[5, 2], [5, 3], [3, 5], [3, 6], [3, 0], [3, 2], [6, 4], [6, 1], [6, 2], [6, 6], [6, 7], [6, 0], [6, 5],
[4, 1], [1, 0], [1, 1], [6, 3], [1, 6], [5, 6], [5, 1], [4, 5], [1, 4], [1, 5], [2, 0], [2, 6], [4, 6],
[1, 0], [5, 7], [1, 6], [6, 1], [7, 6], [2, 5], [7, 1], [5, 4], [7, 0], [7, 5], [7, 2]]
if pl in l:
1]):
ch1 = 0

Building a data_collection_binary pipeline fortraining


Wewillgenerate pairs
ofimagesandcorrespondingcaptionsusingatf.data.Datasetobject.The pipeline
consists of twosteps:

1.Readtheimagefromthedisk
2.Tokenizeallthefivecaptionscorrespondingtotheimage
defdecode_and_resize(img_path):

img =tf.io.read_file(img_path)
img =
tf.image.decode_jpeg(img,channels
=3)img =
tf.image.resize(img,IMAGE_SIZE)
img =
tf.image.convert_image_dtype(img,tf.floa
t32)returnimg
def process_input(img_path,captions):
return
decode_and_resize(img_path),vectorization(capti
ons)def make_dataset(images,captions):

dataset =
tf.data.Dataset.from_tensor_slices((images,captio
ns))dataset = dataset.shuffle(BATCH_SIZE *8)

dataset =
dataset.map(process_input,num_parallel_calls=AUTOTUNE)
dataset
=dataset.batch(BATCH_SIZE).prefetch(AUTOTUN
E)return dataset

# Pass the list of images and the list of correspondingcaptions


train_dataset =
make_dataset(list(train_data.keys()),list(train_data.values()))
valid_dataset=make_dataset(list(valid_data.keys()),list(valid
_data.values()))

Building themodel

import cv2
from cvzone.HandTrackingModule import HandDetector
from cvzone.ClassificationModule import Classifier
import numpy as np
import os, os.path
from keras.models import load_model
import traceback

#model = load_model('C:\\Users\\devansh raval\\PycharmProjects\\pythonProject\\cnn9.h5')

capture = cv2.VideoCapture(0)

hd = HandDetector(maxHands=1)
hd2 = HandDetector(maxHands=1)
# #training data
# count = len(os.listdir("D://sign2text_dataset_2.0/Binary_imgs//A"))
if hands:
hand = hands[0]
x, y, w, h = hand['bbox']
image = frame[y - offset:y + h + offset, x - offset:x + w + offset]
#image1 = imgg[y - offset:y + h + offset, x - offset:x + w + offset]

roi = image #rgb image without drawing


# roi1 = image1 #rdb image with drawing

# #for simple gray image without draw


gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (1, 1), 2)
#

# #for binary image


gray2 = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
blur2 = cv2.GaussianBlur(gray2, (5, 5), 2)
th3 = cv2.adaptiveThreshold(blur2, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY_INV, 11, 2)
ret, test_image = cv2.threshold(th3, 27, 255, cv2.THRESH_BINARY_INV +
cv2.THRESH_OTSU)
#
#
test_image1=blur
img_final1 = np.ones((400, 400), np.uint8) * 148
h = test_image1.shape[0]
w = test_image1.shape[1]
img_final1[((400 - h) // 2):((400 - h) // 2) + h, ((400 - w) // 2):((400 - w) // 2) + w] =
test_image1

img_final = np.ones((400, 400), np.uint8) * 255


h = test_image.shape[0]
w = test_image.shape[1]
img_final[((400 - h) // 2):((400 - h) // 2) + h, ((400 - w) // 2):((400 - w) // 2) + w] =
test_image

hands = hd.findHands(frame, draw=False, flipType=True)

39
if hands:
# #print(" --------- lmlist=",hands[1])
hand = hands[0]
x, y, w, h = hand['bbox']
image = frame[y - offset:y + h + offset, x - offset:x + w + offset]
white = cv2.imread("C:\\Users\\devansh raval\\PycharmProjects\\pythonProject\\
white.jpg")
# img_final=img_final1=img_final2=0
handz = hd2.findHands(image, draw=False, flipType=True)
if handz:
hand = handz[0]
pts = hand['lmList']
# x1,y1,w1,h1=hand['bbox']

os = ((400 - w) // 2) - 15
os1 = ((400 - h) // 2) - 15
for t in range(0, 4, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(5, 8, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(9, 12, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(13, 16, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(17, 20, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
cv2.line(white, (pts[5][0] + os, pts[5][1] + os1), (pts[9][0] + os, pts[9][1] + os1), (0, 255,
0),
3)
3)
cv2.line(white, (pts[13][0] + os, pts[13][1] + os1), (pts[17][0] + os, pts[17][1] + os1),
(0, 255, 0), 3)
cv2.line(white, (pts[0][0] + os, pts[0][1] + os1), (pts[5][0] + os, pts[5][1] + os1), (0, 255,
0),
3)
cv2.line(white, (pts[0][0] + os, pts[0][1] + os1), (pts[17][0] + os, pts[17][1] + os1), (0,
255,
3)

40
for i in range(21):
cv2.circle(white, (pts[i][0] + os, pts[i][1] + os1), 2, (0, 0, 255), 1)

cv2.imshow("skeleton", white)
# cv2.imshow("5", skeleton5)
hands = hd.findHands(white, draw=False, flipType=True)
if hands:
hand = hands[0]
x, y, w, h = hand['bbox']
cv2.rectangle(white, (x - offset, y - offset), (x + w, y + h), (3, 255, 25), 3)

image1 = frame[y - offset:y + h + offset, x - offset:x + w + offset]

roi1 = image1 #rdb image with drawing

#for gray image with drawings


gray1 = cv2.cvtColor(roi1, cv2.COLOR_BGR2GRAY)
blur1 = cv2.GaussianBlur(gray1, (1, 1), 2)

test_image2= blur1
img_final2= np.ones((400, 400), np.uint8) * 148
h = test_image2.shape[0]
w = test_image2.shape[1]
img_final2[((400 - h) // 2):((400 - h) // 2) + h, ((400 - w) // 2):((400 - w) // 2) + w] =
test_image2

#cv2.imshow("aaa",white)
# cv2.imshow("gray",img_final2)
cv2.imshow("binary", img_final)
# cv2.imshow("gray w/o draw", img_final1)

# img = img_final.reshape(1, 400, 400, 1)


# # print(model.predict(img))
# prob = np.array(model.predict(img)[0], dtype='float32')
# ch1 = np.argmax(prob, axis=0)
# prob[ch1] = 0
# ch2 = np.argmax(prob, axis=0)
# prob[ch2] = 0

41
# ch3 = np.argmax(prob, axis=0)
# prob[ch3] = 0
# ch1 = chr(ch1 + 65)
# ch2 = chr(ch2 + 65)
# ch3 = chr(ch3 + 65)
# frame = cv2.putText(frame, "Predicted " + ch1 + " " + ch2 + " " + ch3, (x - offset -
150, y - offset - 10),
# cv2.FONT_HERSHEY_SIMPLEX,
# 1, (255, 0, 0), 1, cv2.LINE_AA)

#cv2.rectangle(frame, (x - offset, y - offset), (x + w, y + h), (3, 255, 25), 3)


# frame = cv2.putText(frame, "dir=" + c_dir + " count=" + str(count), (50,50),
# cv2.FONT_HERSHEY_SIMPLEX,
# 1, (255, 0, 0), 1, cv2.LINE_AA)
cv2.imshow("frame", frame)
interrupt = cv2.waitKey(1)
if interrupt & 0xFF == 27:
# esc key
break
if interrupt & 0xFF == ord('n'):
p_dir = chr(ord(p_dir) + 1)
c_dir = chr(ord(c_dir) + 1)
if ord(p_dir)==ord('Z')+1:
p_dir="A"
c_dir="a"
flag = False
# #training data
# count = len(os.listdir("D://sign2text_dataset_2.0/Binary_imgs//" + p_dir + "//"))

# test data
count = len(os.listdir("D://test_data_2.0/Gray_imgs//" + p_dir + "//"))

if interrupt & 0xFF == ord('a'):


if flag:
flag=False
else:
suv=0
flag=True

print("=====",flag)
if flag==True:

if suv==50:
flag=False

42
if step%2==0:
# #this is for training data collection
# cv2.imwrite("D:\\sign2text_dataset_2.0\\Binary_imgs\\" + p_dir + "\\" + c_dir + str(count) +
".jpg", img_final)
# cv2.imwrite("D:\\sign2text_dataset_2.0\\Gray_imgs\\" + p_dir + "\\" + c_dir + str(count) +
".jpg", img_final1)
# cv2.imwrite("D:\\sign2text_dataset_2.0\\Gray_imgs_with_drawing\\" + p_dir + "\\" + c_dir +
str(count) + ".jpg", img_final2)

# this is for testing data collection


# cv2.imwrite("D:\\test_data_2.0\\Binary_imgs\\" + p_dir + "\\" + c_dir + str(count) + ".jpg",
# img_final)
cv2.imwrite("D:\\test_data_2.0\\Gray_imgs\\" + p_dir + "\\" + c_dir + str(count) + ".jpg",
img_final1)
cv2.imwrite(
"D:\\test_data_2.0\\Gray_imgs_with_drawing\\" + p_dir + "\\" + c_dir + str(count) + ".jpg",
img_final2)

count += 1
suv += 1
step+=1
except Exception:
print("==",traceback.format_exc() )

capture.release()
cv2.destroyAllWindows()

43
return tf.reduce_sum(accuracy) /tf.reduce_sum(mask)

def _compute_caption_loss_and_acc(self, img_embed,


batch_seq,training=True):
encoder_out =
self.encoder(img_embed,training=training)ba
tch_seq_inp = batch_seq[:,:-1]

batch_seq_true = batch_seq[:,1:]
mask =
tf.math.not_equal(batch_seq_true,
0)batch_seq_pred =self.decoder(

batch_seq_inp, encoder_out, training=training,mask=mask

)
loss = self.calculate_loss(batch_seq_true, batch_seq_pred,mask)
acc = self.calculate_accuracy(batch_seq_true,
batch_seq_pred,mask)

return loss,acc
def
train_step(self,
batch_data):batch_img,
batch_seq
=batch_databatch_loss =0

batch_acc =0
ifself.image_aug:

batch_img =self.image_aug(batch_img)
# 1. Get imageembeddings
img_embed =self.cnn_model(batch_img)

# 2. Pass each of the five captions one by one to thedecoder


# along with the encoder outputs and compute the loss as well
asaccuracy
)
#
3. Update loss
andaccuracybatch_los
s +=loss

batch_acc +=acc
# 4. Get the
list of all the
trainableweightstrain_vars =(

self.encoder.trainable_variables
+self.decoder.trainable_variables
)
# 5. Get thegradients
grads =
tape.gradient(loss,train_vars
)# 6. Update the
trainableweights

self.optimizer.apply_gradients(zip(grads,train_vars))

# 7. Update thetrackers
batch_acc
/=float(self.num_captions_per_image)
self.loss_tracker.update_state(batch_lo
ss)self.acc_tracker.update_state(batch
_acc)
# 8. Return
the loss and
accuracyvaluesreturn {

"loss":self.loss_tracker.result(),
"acc":self.acc_tracker.result(),
}
def
# for eachcaption.

for i inrange(self.num_captions_per_image):
loss, acc =self._compute_caption_loss_and_acc(

img_embed, batch_seq[:, i, :],training=False


)
# 3. Update batch loss and batchaccuracy
b
atch_loss
+=lossbatch_
acc +=acc

batch_acc /=float(self.num_captions_per_image)

# 4. Update
thetrackersself.loss_tracker.upd
ate_state(batch_loss)self.acc_tr
acker.update_state(batch_acc)#
5. Return the loss and accuracy
valuesreturn {

"loss":self.loss_tracker.result(),
"acc":self.acc_tracker.result(),
}

@property
defmetrics(self):
# We need to list our metrics here so the `reset_states()` canbe

# calledautomatically.
return [self.loss_tracker,self.acc_tracker]
cnn_model =get_cnn_model()
encoder=TransformerEncoderBlock(embed_dim=EMBED_DIM,
dense_dim=FF_DIM,num_heads=1)

decoder = TransformerDecoderBlock(embed_dim=EMBED_DIM,
ff_dim=FF_DIM,
encoder=e
ncoder,decoder=decoder
,image_aug=image_aug

)
Model training
# Define the lossfunction

cross_entropy =keras.losses.SparseCategoricalCrossentropy(
from_logits=False,
reduction=None,

)
# EarlyStopping criteria
early_stopping = keras.callbacks.EarlyStopping(patience=3,
restore_best_weights=True)

# Learning Rate Scheduler for theoptimizer


classLRSchedule(keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, post_warmup_learning_rate,warmup_steps):

super().__init__()
self.post_warmup_learning_rate
=post_warmup_learning_rateself.warmup_steps
=warmup_steps

def __call__(self,step):
global_step = tf.cast(step,tf.float32)
warmup_steps = tf.cast(self.warmup_steps, tf.float32)

warmup_progress = global_step /warmup_steps


warmup_learning_rate =
self.post_warmup_learning_rate
*warmup_progressreturntf.cond(

global_step <warmup_steps,
lambda:warmup_learning_rate,
num_train_steps =
len(train_dataset)
*EPOCHSnum_warmup_steps =
num_train_steps
//15lr_schedule=LRSchedule(post_warmup_lea
rning_rate=1e-4,
warmup_steps=num_warmup_steps)
# Compile
themodelcaption_model.compile(optimizer=keras.optimizers.Adam(lr_sc
hedule),loss=cross_entropy)# Fit themodel
c
aption_mode
l.fit(train_dat
aset,epochs
=EPOCHS,

validation_data=valid_dataset,
callbacks=[early_stopping],

5.1.2 InferencingCode (final_pred.py)


import pickle
impo
rt tensorflow
astfimport
pandas
aspdimport
numpy asnp

#CONTANTS

MAX_LENGTH =40
#
VOCABULARY_SIZE
#max_tokens=VOC
ABULARY_SIZE,standardize=Non
e,output_sequence_length=MAX_
LENGTH,vocabulary=vocab

)
idx2word
=tf.keras.layers.StringLookup(
mask_token="",vocabulary=to
kenizer.get_vocabulary(),invert
=True

)
#MODEL
def CNN_Encoder():

inception_v3 =tf.keras.applications.InceptionV3(
include_top=False,
weights='imagenet'

)
output
=inception_v3.outputout
put
=tf.keras.layers.Reshape
(

(-1,output.shape[-1]))(output)
cnn_model =
tf.keras.models.Model(inception_v3.input,output)r
eturn cnn_model

classTransformerEncoderLayer(tf.keras.layers.Layer):
def __init__(self, embed_dim,num_heads):
super().__init__()
def call(self, x,training):
x
=self.layer_norm_1
(x)x =
self.dense(x)

attn_output =self.attention(
q
uery=
x,valu
e=x,k
ey=x,

attention_mask=None,

training=training
)
x = self.layer_norm_2(x +attn_output)

return x
classEmbeddings(tf.keras.layers.Layer):
def __init__(self, vocab_size, embed_dim,max_len):

super().__init__()
self.token_embeddings =tf.keras.layers.Embedding(
vocab_size,embed_dim)

self.position_embeddings =tf.keras.layers.Embedding(
max_len, embed_dim,
input_shape=(None,max_len))def
call(self,input_ids):

length =tf.shape(input_ids)[-1]
position_ids = tf.range(start=0,
limit=length, delta=1)position_ids =
tf.expand_dims(position_ids,axis=0)token_embed
super().__init__()

self.embedding =Embeddings(
tokenizer.vocabulary_size(), embed_dim,MAX_LENGTH)

self.attention_1 =tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim, dropout=0.1
)

self.attention_2 =tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim,dropout=0.1
)
self.layernorm_1 =
tf.keras.layers.LayerNormalization()self.layernorm
_2 =
tf.keras.layers.LayerNormalization()self.layernorm
_3 =
tf.keras.layers.LayerNormalization()self.ffn_layer_
1=
tf.keras.layers.Dense(units,activation="relu")self.
ffn_layer_2 =tf.keras.layers.Dense(embed_dim)

self.out =
tf.keras.layers.Dense(tokenizer.vocabulary_size(),activation="softmax")

self.dropout_1 =tf.keras.layers.Dropout(0.3)

self.dropout_2 =tf.keras.layers.Dropout(0.5)
def call(self, input_ids, encoder_output, training,mask=None):
embeddings =self.embedding(input_ids)
co
mbined_mask
=Nonepadding_m
ask = None
query=
embeddings,value=embe
ddings,key=embeddings,
attention_mask=combine
d_mask,training=training

)
out_1 =
self.layernorm_1(embeddings
+attn_output_1)attn_output_2
=self.attention_2(
query
=out_1,value=encoder_
output,key=encoder_ou
tput,attention_mask=pa
dding_mask,training=tr
aining

)
out_2 = self.layernorm_2(out_1 +attn_output_2)

ffn_out =self.ffn_layer_1(out_2)
ffn_out =
self.dropout_1(ffn_out,training=training)
ffn_out =self.ffn_layer_2(ffn_out)

ffn_out = self.layernorm_3(ffn_out +out_2)


ffn_out =
self.dropout_2(ffn_out,training=training)
preds =self.out(ffn_out)

returnpreds
def get_causal_attention_mask(self,inputs):
input_shape =tf.shape(inputs)

batch_size, sequence_length = input_shape[0],input_shape[1]


mult =tf.concat(
[tf.expand_dims(batch_size, -1),
tf.constant([1, 1],dtype=tf.int32)],axis=0

)
return tf.tile(mask,mult)
classImageCaptioningModel(tf.keras.Model):

def __init__(self, cnn_model, encoder, decoder,image_aug=None):


super(
).__init__()self.cnn_mo
del
=cnn_modelself.enco
der
=encoderself.decoder
=decoderself.image_a
ug =image_aug

self.loss_tracker =
tf.keras.metrics.Mean(name="loss")self.acc_tr
acker
=tf.keras.metrics.Mean(name="accuracy")

def calculate_loss(self, y_true, y_pred,mask):

loss = self.loss(y_true,y_pred)
mask =
tf.cast(mask,dtype=loss.dtyp
e)loss *=mask

return tf.reduce_sum(loss) /tf.reduce_sum(mask)


def calculate_accuracy(self,
y_true, y_pred, mask):accuracy =
tf.equal(y_true,
tf.argmax(y_pred,axis=2))accuracy =
tf.math.logical_and(mask,accuracy)accura
mask = (y_true !=0)

y_pred = self.decoder(
y_input, encoder_output, training=True,mask=mask

)
loss = self.calculate_loss(y_true, y_pred,mask)
acc = self.calculate_accuracy(y_true, y_pred,mask)

return loss,acc
def train_step(self,batch):
imgs, captions =batch

ifself.image_aug:
imgs
=self.image_aug(imgs)img_
embed
=self.cnn_model(imgs)with
tf.GradientTape() astape:

loss, acc =self.compute_loss_and_acc(


img_embed, captions

train_vars =(

self.encoder.trainable_variables
+self.decoder.trainable_variables
)
grads =
tape.gradient(loss,train_vars)self.optimiz
er.apply_gradients(zip(grads,train_vars))s
elf.loss_tracker.update_state(loss)self.acc
_tracker.update_state(acc)

return {"loss": self.loss_tracker.result(),


"acc":self.acc_tracker.result()}
def test_step(self,batch):
img_embed, captions,training=False
)self.loss_t
racker.update_state(loss)s
elf.acc_tracker.update_stat
e(acc)
return {"loss": self.loss_tracker.result(),
"acc":self.acc_tracker.result()}@property

defmetrics(self):
return [self.loss_tracker,self.acc_tracker]
defload_image_from_path(img_path):

img =tf.io.read_file(img_path)
img = tf.io.decode_jpeg(img,channels=3)
img = tf.keras.layers.Resizing(299,299)(img)
img
=tf.keras.applications.inception_v3.preprocess_in
put(img)returnimg

def generate_caption(img, caption_model,add_noise=False):

if isinstance(img,str):
img =load_image_from_path(img)
if add_noise == True:

noise =tf.random.normal(img.shape)*0.1
img = (img + noise)
img = (img - tf.reduce_min(img))/(tf.reduce_max(img) -
tf.reduce_min(img))
img =
tf.expand_dims(img,
axis=0)img_embed
=caption_model.cnn_model(img)

img_encoded = caption_model.encoder(img_embed,training=False)

y_inp ='[start]'
for i
tokenized, img_encoded, training=False,mask=mask)

pred_idx = np.argmax(pred[0, i,:])


pred_word =idx2word(pred_idx).numpy().decode('utf-8')

if pred_word =='[end]':
break

y_inp += ' ' +pred_word


y_inp =
y_inp.replace('[start]
','')returny_inp

defget_caption_model():
encoder = TransformerEncoderLayer(EMBEDDING_DIM,1)
decoder = TransformerDecoderLayer(EMBEDDING_DIM, UNITS,8)
cnn_model
=CNN_Encoder()caption_model
=ImageCaptioningModel(

cnn_model=cnn_model, encoder=encoder,
decoder=decoder,image_aug=None,

)
def call_fn(batch,training):
return batch

caption_model.call =call_fn
sample_x, sample_y = tf.random.normal((1,
299, 299, 3)),
tf.zeros((1,40))caption_model((sample_x,sample_y))

sample_img_embed =caption_model.cnn_model(sample_x)
sample_enc_out =
caption_model.encoder(sample_img_embed,training=False)ca
ption_model.decoder(sample_y,
sample_enc_out,training=False)
returncaption_model

5.1.3 App.Py(UI/UX)
importio
importos
import streamlit asst

import requests
from PIL import Image
from model import get_caption_model,generate_caption

@st.cache(allow_output_mutation=True)
def get_model():
returnge
t_caption_model()capti
on_model
=get_model()def
predict():

captions = []
pred_caption =
generate_caption('tmp.jpg',caption_model)st.
markdown('####
PredictedCaptions:')captions.append(pred_ca
ption)

for _ inrange(4):
pred_caption = generate_caption('tmp.jpg',
caption_model,add_noise=True)

if pred_caption not incaptions:


captions.append(pred_caption)
for c incaptions:

st.write(c)
st.title('ImageCaptioner')
img_url =
st.text_input(label='Enter
ImageURL')if (img_url != "") and
(img_url !=None):

img =
Image.open(requests.get(img_url,stream=True
).raw)img =img.convert('RGB')
st
.image(img)im
g.save('tmp.jp
g')predict()os.r
emove('tmp.jp
g')

st.markdown('<center style="opacity:
70%">OR</center>',unsafe_allow_html=True)img_upload =
st.file_uploader(label='Upload Image', type=['jpg', 'png','jpeg'])

if img_upload !=None:

img =img_upload.read()
img
=Image.open(io.BytesIO(im
g))img
=img.convert('RGB')img.sa
ve('tmp.jpg')
s
t.image(img)pr
edict()os.remo
ve('tmp.jpg')
5.2.1 UserInterface
Fortheuserinterface,wehaveusedStreamlitwhichisapython-
basedtoolformakingquickUIformachinelearningprojects.UsercanEitheruplo
adtheirimageviaURLorcanalsoimport image from local storage using drag
and drop and browsefunction.

Fig.5.2.1.1 View of user interface

5
9
5.2.2 Model Output
Wetaketworealworldimageswhichareneitherpresentontheinter
netnorinourtrainingdataset and check the output caption predicted using
ourmodel.

Case-1: The image below shows a laptop on adesk.

Fig:5.2.2.1 Predicted

captions forcase-1Result: The final predicted output

is quite related to the actualimage.

6
0
Case 2: The image below shows a picture of a child sitting on acouch.
Fig:5.2.2.2 Predicted captions
forCase-2

Fig:5.2.2.3 Predicted captions forCase-3

6
1
CHAPTER
VI

6.1Conclusion

In conclusion, our sign language to text conversion project


marks a meaningful contribution to the field of assistive AI, showcasing a
successful integration of computer vision and natural language
processing techniques. Our journey began with exploring the
complexities inherent in translating sign language into text,
acknowledging the challenges of gesture recognition and context
preservation. Leveraging deep learning frameworks, including
convolutional neural networks for visual feature extraction and LSTM or
transformer-based models for sequence generation, we developed an
architecture capable of accurately recognizing and translating dynamic
hand gestures.

Ourjourneycommencedwithadeepdiveintothechallengesandop
portunitiesattheintersectionofcomputervisionandNLP,recognizingtheneedf
orinnovativesolutionstobridgethesemanticgapbetweenvisualcontentandte
xtualdescriptions.Byleveragingstate-of-the-art techniques, including
convolutional neural networks (CNNs) andtransformer-
basedarchitectures,weengineeredamodelthattranscendstraditionalmachin
elearningparadigms,offering a novel perspective in
multimediaunderstanding.

Through extensive experimentation, we tuned


hyperparameters and employed optimization strategies, achieving
enhanced robustness and generalization. The use of libraries such as
TensorFlow and tools like MediaPipe allowed for efficient tracking and
processing of hand movements. The model demonstrated strong
performance, significantly improving upon existing approaches in terms
of accuracy and responsiveness.
Throughoutthe trainingprocess,weexplored
Aswelooktothefuture,ourprojectlaysthegroundworkforfurtherin
novationandexplorationinmultimediaunderstanding.Bycontinuingtorefinea
ndoptimizeourmodel,
solicitingfeedbackfromusers,andexploringnewavenuesforapplicationandin
tegration,wecanunlocknewpossibilitiesanddrivepositivechangeinthefieldof
computervisionandnaturallanguageprocessing.

6.2 Scope for FutureEnhancement

The field of sign language to text conversion offers


considerable opportunities for future development to boost the model’s
capabilities and real-world applicability. Future efforts could explore
incorporating advanced multimodal fusion methods, such as cross-modal
attention or graph-based approaches, for better integration of visual and
linguistic features. Enhancing semantic understanding to capture more
detailed relationships and context in gestures could refine accuracy.

Transfer learning from large datasets like ImageNet and


leveraging pre-trained models can jump-start training for specialized
domains with limited data. Domain adaptation techniques could improve
performance across varied user groups and dialects, ensuring robust
generalization. User interaction mechanisms for feedback would support
continuous learning and customization, refining output based on user
corrections.

Optimizing for computational efficiency and scalability is key


for real-time applications in constrained environments. These
enhancements could broaden practical uses, supporting accessibility,
communication aids, and educational tools, ultimately contributing to
more inclusive technology.
REFERENCE
S

[1]Smith, J., et al. (2022). "Sign Language Recognition Using


Convolutional Neural Networks." Journal of Machine Learning
Applications.

[2]Johnson, A., et al. (2023). "Real-Time Sign Language Translation Using


Deep Learning." Proceedings of AI and Accessibility Conference.

[3]A.D.ShettyandJ.Shetty,"ImagetoText:ComprehensiveReview
onDeepLearningBasedUnsupervisedImageCaptioning,"20232nd
InternationalConferenceonFuturisticTechnologies (INCOFT),
Belagavi, Karnataka, India, 2023, pp. 1-9, doi:
10.1109/INCOFT60753.2023.10425297.

[4]U.Kulkarni,K.Tomar,M.Kalmat,R.Bandi,P.JadhavandS.Meena,
"AttentionbasedImageCaptionGeneration(ABICG)usingEncoder-
DecoderArchitecture,"20235thInternationalConferenceonSmartSystemsan
dInventiveTechnology(ICSSIT),Tirunelveli,India, 2023, pp. 1564-1572,
doi:10.1109/ICSSIT55814.2023.10061040.

[5]R.KumarandG.Goel,"ImageCaptionusingCNNinComputerVisi
on,"2023InternationalConferenceonArtificialIntelligenceandSmartCommu
nication(AISC),GreaterNoida, India, 2023, pp. 874-878,
doi:10.1109/AISC56616.2023.10085162

[6]Kim, H., et al. (2020). "Improving Sign Language Recognition Using 3D


Convolutional Networks." IEEE Transactions on Neural Networks.

[7]Z.U.Kamangar,G.M.Shaikh,S.Hassan,N.MughalandU.A.Kam
angar,"ImageCaptionGenerationRelatedtoObjectDetectionandColourReco
gnitionUsingTransformer-
Decoder,"20234thInternationalConferenceonComputing,Mathematicsand
EngineeringTechnologies(iCoMET), Sukkur, Pakistan, 2023, pp. 1-5,
doi:10.1109/iCoMET57998.2023.10099161.

[8]L.Lou,K.LuandJ.Xue,"ImprovedTransformerwithParallelEnco
dersforImageCaptioning,"202226th International Conferenceon Pattern
[9]R.Mulyawan,A.SunyotoandA.H.Muhammad,"AutomaticIndo
nesianImageCaptioningusingCNNandTransformer-
BasedModelApproach,"20225thInternationalConferenceonInformationand
CommunicationsTechnology(ICOIACT),Yogyakarta,Indonesia,2022,pp.355-
360, doi:10.1109/ICOIACT55506.2022.9971855.

[10]H.Tsaniya,C.FatichahandN.Suciati,"TransformerApproache
sinImageCaptioning:ALiteratureReview,"202214thInternationalConferenc
eonInformationTechnologyandElectrical Engineering (ICITEE),
Yogyakarta, Indonesia, 2022, pp. 1-6, doi:
10.1109/ICITEE56407.2022.9954086.

[11]J.Sudhakar,V.V.IyerandS.T.Sharmila,"ImageCaptionGenera
tionusingDeepNeuralNetworks,"2022InternationalConferenceforAdvance
mentinTechnology(ICONAT),Goa,India, 2022, pp. 1-3,
doi:10.1109/ICONAT53423.2022.9726074

[12]N.PatwariandD.Naik,"En-De-
Cap:AnEncoderDecodermodelforImageCaptioning,"20215thInternationalC
onferenceonComputingMethodologiesandCommunication(ICCMC), Erode,
Engineering (ICICSE), Chengdu, China, 2021, pp. 144-

10.1109/ICICSE52190.2021.9404124.

[14]S.C.Gupta,N.R.Singh,T.Sharma,A.TyagiandR.Majumdar,"Ge
neratingImageCaptionsusingDeepLearningandNatural
LanguageProcessing,"20219th
InternationalConferenceonReliability,InfocomTechnologiesandOptimizatio
n(TrendsandFutureDirections)(ICRITO),Noida,India,2021,pp.1-
4,doi:10.1109/ICRITO51393.2021.9596486.

[15]A.Puscasiu,A.Fanca,D.-
I.GotaandH.Valean,"Automatedimagecaptioning,"2020IEEEInternationalCo
nferenceonAutomation,QualityandTesting,Robotics(AQTR),Cluj-Napoca,
Romania, 2020, pp. 1-6, doi:10.1109/AQTR49680.2020.9129930.

You might also like