Final Page
Final Page
PROJECT
O
N
“SIGN LANGUAGE TO TEXT
CONVERSION”
GUIDE: SUBMITTEDBY:
Ms. Preeti Kalra
Asst.Professor
Dept. OfCSE
Dheeraj Negi (04196502721) Manish
CERTIFICAT
E
Thisistocertifythatthisprojectreportentitled“SIGN LANGUAGE
TO TEXT CONVERSION”submittedbyTushar Rawat,Aaditya N
Choubey ,Dheeraj NegiandManishandin
partialfulfillmentoftherequirementforthedegreeofBachelorofTechnologyinC
omputer
ScienceEngineeringoftheGuruGobindSinghIndraprasthaUniversity,Delhi,d
uringthe academicyear2021-
25,isabonafiderecordofworkcarriedoutunderourguidanceand supervision.
Theresultsembodiedinthisreporthavenotbeensubmittedtoanyo
therUniversityorInstitution for the award of any degree ordiploma.
DECLARATIO
N
We,studentsofB.Techherebydeclarethatthemajorprojecttitled“
SIGN LANGUAGE TO TEXT
CONVERSION”whichissubmittedtoDepartmentofComputerScienceandEng
ineering,HMR
Thisistocertifythattheabovestatementmadebythecandidatesi
scorrecttobebestofmyknowledge.
N
ewDelh
HMR INSTITUTE OF TECHNOLOGY
&MANAGEMENT
Hamidpur,Delhi-
110036
(AnISO9001:2008certified,AICTEapproved&GGSIPUniversityaffil
iatedinstitute)
E-mail: [email protected], Phone: - 8130643674, 8130643690,
8287461931,8287453693
ACKNOWLEDGEMEN
T
Thesuccessandoutcomeofthisprojectrequiredalotofguidancean
dassistancefrommanypeopleandweareextremelyprivilegedtohavegotthisal
lalongthecompletionofourproject.
Itiswithprofoundgratitudethatweexpressourdeepindebtednesst
oourmentor,Ms. Preeti
Kalra(AssistantProfessor,ComputerScienceandEngineering)forherguidanc
eandconstantsupervisionaswellasforprovidingnecessaryinformationregardi
ngtheprojectinaddition to offering her consistent support in completing
theproject.
Inadditiontotheaforementioned,wewouldalsoliketotakethisopp
ortunitytoacknowledgetheguidancefrom Ms.
Usha(Dept.Coordinator,ComputerScienceandEngineering)forhiskindcoope
rationandencouragementwhichhelpedusinthesuccessfulcompletionofthisp
roject.
Tushar Rawat(00696502721)
ABSTRAC
T
CHAPTER I: INTRODUCTION
CHAPTER II: LITERATURE REVIEW &
THEORTICALCONCEPTCHAPTER III:METHODOLOGY
CHAPTER IV: SYSTEM
ANALYSIS ANDDESIGNCHAPTER V:
SYSTEMIMPLEMENTATIONCHAPTER VI:
CONCLUSION & FUTURESCOPE
CONTENT
Certificate
D
eclarationAck
nowledgemen
tAbstract
Li
stofFiguresCha
pterOrganizatio
n
CHAPTERI: INTRODUCTION
1.1Sign Text Generator
1.2 ProjectScope
1.3Objective
1.4Motivation
1.5ProblemStatement
1.6 ProblemSpecification
Chapter V: SYSTEMIMPLEMENTATION
5.1 SourceCode
5.1.1 ModelTraining
5.1.2 Final_pred.py
5.1.3 App.py(UI/UX)
5.2Output
5.2.1 UserInterface
5.2.2 ModelOutput
REFERENCES
C
HAPTERIINT
RODUCTION
1.2 ProjectScope
1.3Objective
1.5 ProblemStatement
1.6 ProblemSpecification
2.ContextualRelevance:Generatedcaptionsshould
becontextuallyrelevantandcoherent,reflectingthecontentandmeanin
gconveyedbytheinputimagesaccurately.Thisnecessitatestheintegrat
ionofvisualandtextualmodalitiesinaseamlessmanner,bridging the
semantic gap between the two domains.
3.Efficiency
andScalability:Themodelshouldbeefficientandscalable,capableofpr
ocessingadiverserangeofimagesinreal-timeornearreal-
timescenarios.Thisrequiresoptimizationofcomputationalresourcesan
dalgorithmstoensurerapidcaptiongeneration without compromising
on accuracy or quality.
6
CHAPTER
II
LITERATURE REVIEW /
THEORTICALCONCEPT
2.1 PreliminaryInvestigation
Beforedelvingintothedevelopmentofoursign language
conversion
model,acomprehensivepreliminaryinvestigationwasconductedtoassessthe
currentstate-of-the-
artinimagecaptioningsystemsandidentifykeychallengesandopportunitiesin
thefield.Thispreliminaryinvestigationinvolvedathoroughreviewofexistinglit
erature,researchpapers,andexperimentalstudiesrelatedtoimagecaptioning
,aswellasananalysisofpubliclyavailabledatasets and benchmarking
metrics commonly used to evaluate modelperformance.
Oneaspectofthepreliminaryinvestigationfocusedonunderstand
ingtheunderlyingmethodologiesandarchitecturesemployedinexistingimag
ecaptioningsystems.Thisinvolvedstudying the various components and
techniques used, such as Convolutional
NeuralNetwork(CNN)encoders,transformerencoders,anddecoderarchitectu
res,togaininsightsintotheirstrengths,limitations,andapplicabilityindifferent
contexts.Additionally,attentionwasgiventorecentadvancementsinthefield,
suchastheintegrationofattentionmechanisms,reinforcementlearningtechni
ques,andmultimodal fusionapproaches,toidentify potentialavenues for
innovation and improvement in ourmodel.
Byconductingthispreliminaryinvestigation,wegainedvaluablein
sightsintothecurrentlandscapeofimagecaptioningresearch,identifiedkeych
allengesandopportunities,andlaidthegroundworkforthedevelopmentofouri
nnovativesign language conversion
model.Thissystematicanalysisprovidedasolidfoundationuponwhichtodesig
n,implement,andevaluateourmodel,ensuringthatitaddressesthemostpressi
ngissuesandachievesstate-of-the-
2.2 LiteratureSurvey
"Self-CriticalSequenceTrainingforImageCaptioning"byRennieet
al. (2017):Thisworkintroducesthe self-criticalsequence training
algorithm,whichoptimizescaption generationmodels directly based on the
performance metric, leading to improved captionquality.
"LearningtoDescribeImageswithHuman-
GuidedPolicyGradient"byRennieetal.
(2017):Introducinganovelreinforcementlearningframeworkforimagecaptio
ning,thisworkproposesamethodthat leverages humanfeedbacktoguidethe
caption generationprocess,improving the quality of generatedcaptions.
"ImageCaptioningwithSemanticAttention"byYouetal.
(2016):Introducingtheconceptofsemanticattention,thisworkenhancesthein
terpretability
ofimagecaptioningmodelsbyexplicitlyattendingtosemanticallymeaningfulr
egionsofanimage,leadingtoimprovedcaption quality andrelevance.
3.ComputationalComplexity:Models often
demand high computational power and processing time, making
them unsuitable for real-time applications, particularly on mobile or
low-resource
devices.Thislimitstheirscalabilityandpracticalutility,particularlyinreal
-timeor resource-
constrainedenvironmentswhererapidcaptiongenerationisessential.
4.DomainSpecificityandGeneralization:Many
sign language recognition systems are designed for specific sign
languages or user groups, limiting their applicability to diverse
users or variations in sign language.
5.DependencyonAnnotatedData:These systems
often rely on large, annotated datasets that are not always diverse
or comprehensive, restricting the model's ability to generalize to
unseen gestures or new users.
2.4 FeasibilityStudy
Byconductingathoroughfeasibilitystudy,wegainedvaluableinsi
ghtsintothepracticalityandviabilityoftheimagecaptioningproject,enablingu
stomakeinformeddecisionsandmitigate risks throughout the
developmentprocess.
2.5AlgorithmsAndArchitectures
2.5.2ConvolutionalNeuralNetwork(CNN)
TheConvolutionalNeuralNetwork(CNN)servesasafundamentalc
omponentofourimagecaptioningmodel,taskedwithextractinghigh-
levelvisualfeaturesfrominputimages.CNNshaverevolutionizedthefieldofCo
mputerVision,enablingtheautomatedextractionofmeaningful patterns and
structures from raw pixeldata.
Atitscore,aCNNcomprisesmultiplelayers,includingconvolutiona
llayers,poolinglayers,andfully connectedlayers.Theselayersworktogether
to progressively extract
hierarchicalrepresentationsoftheinputimages,capturingbothlow-
levelfeaturessuchasedgesandtextures, as well as high-level semantic
OneofthedefiningcharacteristicsofCNNsistheirabilitytolearnhie
rarchicalrepresentationsofvisualdata.Asinformationpropagatesthroughthe
network,lowerlayerscapture
basicvisualfeatures,whilehigherlayerscapturemoreabstractandcomplexco
ncepts.ThishierarchicalorganizationenablesCNNstolearnrichrepresentation
sofimages,makingthemwell-
suitedforawiderangeofcomputervisiontasks,includingimageclassification,o
bjectdetection, and imagecaptioning.
Inthecontext ofourimagecaptioningmodel,theCNNserves
astheencodercomponent,extractingsalientvisualfeaturesfrominputimages.
Thesefeaturesarethenpassedtothedecodercomponent,wheretheyare
combinedwithtextualcontextto generatedetailedandcontextually
relevantcaptions.
2.5.3Transformer
TheTransformerarchitecturestandsasapivotalcomponentwithin
ourimagecaptioningmodel,representingaparadigmshiftinsequence-to-
sequencelearningandrevolutionizingthefieldofNaturalLanguageProcessing
(NLP).Originallyproposedformachinetranslationtasks,Transformershavesin
cefoundwidespreadapplicationsinvariousNLPtasks,owingtotheirability to
capture long-range dependencies and contextual nuances in
sequentialdata.
AttheheartoftheTransformerarchitecturelieself-
attentionmechanisms,whichenablethemodeltoweightheimportanceofdiffer
ent wordsinasequencebasedon
theircontextualrelevance.Unliketraditionalrecurrentneuralnetworks(RNNs)
andLongShort-TermMemory(LSTM)networks,which rely on sequential
processingand suffer from
vanishinggradientsandcomputationalinefficiency,Transformersleveragepa
allowingit
todiscerncontextualinformationandextractmeaningfulrepresentations.Inth
edecoder,self-
attentionmechanismsareaugmentedwithadditionalattentionheadsthatfocu
sonboththeinputsequenceandpreviouslygeneratedoutputtokens,facilitatin
gthegenerationof coherent and contextually relevantpredictions.
2.6Libraries
Inthedevelopmentofourimagecaptioningmodel,weleverageav
arietyoflibraries
andframeworkstostreamlineimplementation,expediteexperimentation,and
ensurecompatibilitywithstate-of-the-
arttechniquesinComputerVisionandNaturalLanguageProcessing.Theselibra
riesprovide essentialfunctionality fortaskssuchasimage processing,
neuralnetworkmodeling, and evaluation metricscomputation.
Additionally,wemakeextensiveuseoftheKerasAPI,whichservesa
sahigh-
levelinterfaceforTensorFlowandotherdeeplearningframeworks.Kerassimpli
fiestheprocessofbuildingandtrainingneuralnetworks,providingauser-
friendlyAPIfordefiningmodelarchitectures,specifyinglossfunctions,andconfi
guringoptimizationalgorithms.WeleverageKerastoconstructthedecoderco
mponentofourimagecaptioningmodel,takingadvantageofitsintuitivesyntax
andmodulardesignprinciplestofacilitaterapidprototypingandexperimentati
on.
Furthermore,weemployspecializedlibrariesforevaluationmetric
scomputation,suchasNLTK(NaturalLanguageToolkit).Theselibrariesprovide
standardizedimplementationsofevaluation metrics commonlyusedin
imagecaptioning research, enablingustoobjectivelyassess the
performance of our model and compare it with state-of-the-art
approaches.
2.7 Anaconda
Anacondaisapowerfuldistributionplatformandpackagemanage
rdesignedfordatascienceandmachinelearningtasks.DevelopedbyAnaconda
,Inc.,Anacondasimplifiestheprocessofsettingupandmanagingsoftwareenvir
onments,providingacomprehensiveecosystemoftools,libraries,andframew
orkstailoredfordataanalysis,scientificcomputing,andartificialintelligence.
OneofthekeyfeaturesofAnacondaisitspackagemanagementsys
tem,whichallowsuserstoeasilyinstall,
update,andmanagethousandsofopen-
sourcepackagesandlibraries.Thesepackagesencompassawiderangeofdom
ains,includingnumericalcomputing(e.g.,NumPy,SciPy),datamanipulation(e
.g.,pandas),machinelearning(e.g., scikit-learn,
TensorFlow,PyTorch),andvisualization(e.g.,Matplotlib,Seaborn).Byproviding
acentralizedrepositoryofcuratedpackages,Anacondasimplifiestheprocesso
fbuildinganddeployingdata-drivenapplications, enabling users to focus on
solving problems rather than managingdependencies.
Moreover,Anacondaoffersa powerful
environmentmanagement system, allowingusers
tocreateisolatedenvironmentswithspecificversionsofPythonandpackages.T
hisenablesreproducibleresearchanddevelopmentworkflows,ensuringconsis
tencyacrossdifferentprojectsandenvironments.WithAnaconda,userscaneas
ilyswitchbetweendifferentenvironments, experiment with different
configurations, and share their workwithcollaborators without worrying
about compatibility issues or dependencyconflicts.
2.9 Streamlit
Streamlitisacutting-edgeopen-
sourcePythonlibrarythatempowersdeveloperstocreateinteractivewebappli
cationsformachinelearninganddatascienceprojectswithremarkableeasean
defficiency.Developedwithafocusonsimplicityandproductivity,Streamlitsim
plifiestheprocessofbuildinganddeployingdata-
drivenapplications,enablingdeveloperstoshowcasetheirmachinelearningm
odels,visualizations,andanalysesinauser-friendlywebinterface.
JavaScriptcodetocreateweb
interfaces,Streamlitallowsdeveloperstobuildinteractiveapplicationsusingn
othingbutPython.Thisnovelapproachsignificantlyreducesthebarriertoentryf
orbuildingweb applications,enablingdeveloperswithminimalweb
developmentexperience to create compelling and interactive data-driven
applications withease.
OneofthekeyfeaturesofStreamlitisitsautomaticreactivity,which
enablesapplicationstoautomaticallyupdateinresponseto user interactions
orchangesin input
data.Thisreactivebehavioreliminatestheneedforcomplexeventhandlingorc
allbackmechanisms,streamliningthedevelopmentprocessandmakingiteasi
ertocreatedynamicandresponsivewebapplications.
Furthermore,Streamlit
offersseamlessintegrationwithpopularmachinelearning
librariessuchasTensorFlow,PyTorch,andscikit-
learn,allowingdeveloperstoshowcasetheirmachinelearning models and
experiments in a user-friendly webinterface.
CHAPTER
III
3.1Introduction
1
7
augment the data. We outline our approach to model
architecture design,
includingtheselectionofhyperparameters,networkarchitectures,andoptimi
zationalgorithms,anddiscusshow these choices were informed by prior
research andexperimentation.
3.2Methodology
3.2.1 DataCollection
Datacollectionisapivotalphaseinanymachinelearningproject,in
cludingthedevelopmentofanimagecaptioningmodel.Thisstageinvolvesgath
eringacomprehensivedatasetconsistingofimagespairedwithcorresponding
captions.Thequalityanddiversityofthedatasetplayacrucialroleintheperform
anceandgeneralizationcapabilitiesofthemodel.Inthis section, we delve into
the intricacies of data collection, discussing
variousconsiderations,methodologies, andsources.
Thefirststepindatacollectionistoidentifysuitablesourcesfromwh
ichtogatherimagesandtheirassociatedcaptions.Dependingonthespecificap
plicationandrequirementsoftheproject,thesesourcescanvarywidely.Commo
nsourcesincludepubliclyavailabledatasets,online image repositories, and
specialized datasets curated for specifictasks.
Inadditiontopubliclyavailabledatasets,researchersandpractitio
Oncethedatasetsourceshavebeenidentified, thenextstepis
tocollectandpreprocessthedata.Thisinvolvesdownloadingtheimagesfromth
echosensourcesandextractingtheassociatedcaptions.Dependingonthedata
setformatandstructure,thisprocessmayvaryincomplexity.Forexample,some
datasetsprovidedirectdownloadlinkstoimagesandcaptions,while others
may require web scraping or API access to retrieve thedata.
3.2.2 DataPreprocessing
Datapreprocessingisacrucialstepinpreparingthedatasetfortrain
inganimagecaptioningmodel.Itinvolvesseveraltasksaimedatcleaning,trans
forming,andorganizingthedatatoensureitssuitabilityforthemodel.Inthissect
ion,wewilldiscussthedatapreprocessingstepsbasedontheprovidedcodeandt
heirsignificanceinthecontextoftheimagecaptioningproject.
Thefirststepindatapreprocessingistokenization,whereeachcapt
ionissplitintoindividualtokensorwords.Thisisessentialforconvertingtextuald
ataintoaformatthatcanbeprocessedby the model. Additionally, a
vocabulary is built from the tokenized captions to map
wordstonumericalindices.Thisvocabularyisusedtoconvertwordsintotheircor
respondingnumericalrepresentations during training andinference.
Imagesareloadedfromthedatasetandpreprocessedtoensureco
nsistencyandcompatibilitywiththemodel.Preprocessingstepsmayincludere
sizingimagestoauniformsize,convertingthemtoasuitablecolorformat(e.g.,R
GB),andnormalizingpixelvaluestoapredefinedrange.Thesestepshelpinredu
Sincecaptionsmayvaryinlength,itisnecessarytopadortruncatet
hemtoafixedlengthtocreateuniforminputsequences.Thisisachievedbyappe
ndingpaddingtokenstoshortercaptionsortruncatingtokensfromlongercaptio
ns.Sequencepaddingensuresthatallcaptionshave the same length,
facilitating batch processing duringtraining.
4. DataSplitting:
Thedatasetissplitintotraining,validation,andtestsetstoassessth
emodel'sperformance.Typically,mostofthedataisusedfortraining,whilesmal
lerportionsareallocatedforvalidationandtesting.Thetrainingsetisusedtoupd
atethemodel'sparameters,thevalidationsetisusedforhyperparametertunin
gandmodelselection,andthe test set isusedfor finalevaluation.
5. Data Augmentation(Optional):
Dataaugmentationtechniquesmaybeappliedtoincreasethedive
rsityandrobustnessofthedataset.Thiscaninvolverandomtransformationssu
chasrotations,flips,orchangesinbrightness and contrast applied to both
images and captions. Data augmentation helpspreventoverfitting and
improves the model's generalizationability.
6. DataSerialization:
Oncepreprocessingiscomplete,thepreprocesseddataisserialize
dandsavedtodiskforefficientstorageandretrievalduringtraining.Thisinclude
ssavingtokenizedcaptions,imagefeatures,andanyadditionalmetadatarequi
redfortrainingthemodel.Serializeddataallowsfor seamless integration with
the model trainingpipeline.
3.2.3 FeatureExtraction
Featureextractionisafoundationalstepintheprocessofanalyzing
andunderstandingimageswithintherealmofcomputervision.Itinvolvestheex
tractionofmeaningfulanddiscriminativefeaturesfromrawimagedata,whicha
reessentialforsubsequentanalysisandinterpretationtasks.Inthecontextofou
rprojectonimagecaptioning,featureextractionisacrucialcomponent that
enables the generation of descriptive captions for inputimages.
imagedatasets. Trainedon theImageNet
dataset,EfficientNetB0 hasdemonstrated
superiorperformanceinvariouscomputervisiontasks,makingitanidealcandid
ateforfeatureextraction in our image captioningpipeline.
BeforefeedingimagesintotheEfficientNetB0model,preprocessin
gstepsareappliedtoensurecompatibilityand standardization of input
data.These preprocessing steps typically
involveresizingimagestotherequiredinputdimensionsandnormalizingpixelv
aluestofallwithinastandardizedrange.By standardizingthe input
data,preprocessing facilitates consistent andaccurate feature extraction
across diverse imagedatasets.
Oncepreprocessed,imagesarepassedthroughthelayersoftheE
fficientNetB0modeltoextracthigh-
levelvisualfeatures.ThearchitectureofEfficientNetB0comprisesmultiplelaye
rsofconvolutionalandpoolingoperations,whichprogressivelyanalyzeandabs
tractvisualinformation from input images. As images propagate through
the network,
featuresareextractedatdifferentlevelsofabstraction,rangingfromsimpleedg
edetectorstocomplexsemantic representations.
Aftertraversingthroughthelayers
ofEfficientNetB0,featuresareobtained
fromoneofitsintermediatelayers.Thesefeaturesarerepresentedasamultidim
ensionalfeaturemap,whereeachelementencodesaspecificaspectofthe
inputimages.Thefeaturemapencapsulatessalientvisualinformationcapture
dbytheCNN,providingarichrepresentationoftheinputimages'content.
3.2.4 ModelSelection
Intherealmofmachinelearning,selectinganappropriatemodelar
chitectureisacriticaldecisionthatsignificantlyimpactstheperformanceandeff
ectivenessofaproject.Inourimagecaptioningendeavor,weconductedathoro
ughexplorationofvariousmodelarchitecturestoidentifythemostsuitableonef
orourtask.Here'sadetailedoverviewofourmodelselectionprocess:
architecturewasevaluatedbasedonits suitability
forhandlingvisualdataandgeneratingtextualdescriptions.
EfficientNet-BasedCNN:
Aftercarefulconsideration,weoptedtoutilizetheEfficientNetarchi
tectureasthebackboneforourCNN-
basedimagefeatureextractor.EfficientNetisknownforitssuperiorperformanc
eandefficiencyacrossawiderangeofimagerecognitiontasks.Byleveragingpr
e-
trainedweightsfromtheImageNetdataset,wewereabletoharnessrichvisualre
presentationsextracted from imagesefficiently.
Transformer-Based Decoder:
Forthesequencegenerationcomponentofourmodel,weemploye
datransformer-
baseddecoderarchitecture.Transformershaveemergedasapowerfulframew
orkforsequence-to-
sequencetasks,offeringadvantagessuchasattentionmechanismsandparalle
lprocessing.Ourdecoderarchitecture consistedof multiple
transformerdecoderblocks,each responsible forgenerating a portion of the
captionsequentially.
FinalSelection:
Afterthoroughexperimentationandevaluation,weidentifiedamo
delconfigurationthatstruckabalancebetweenperformanceandefficiencyand
wechose theTransformer.Theselectedarchitecture demonstrated superior
captioning accuracy, robustness to variations in
inputdata,andreasonablecomputationalrequirements.Furthermore,itsmod
ulardesignfacilitatedeasyintegration of additional enhancements
andoptimizations.
Modeltraininginourimagecaptioningprojectisapivotalstagewhe
reweorchestratetheconvergenceofvariouscomponentstooptimizethemodel
partitionthedatasetintotraining,validation,andtestsets,ensurin
gabalanceddistributionofexamplesacrosseachpartition.Thisensuresthatthe
modellearnsfromadiverserangeofdatasamples, facilitating
robustgeneralization.
Model Architecture:
Attheheartofourimagecaptioningsystemliesasophisticatedarch
itecturecomprisingaconvolutionalneuralnetwork(CNN)encoderandatransfo
rmer-baseddecoder.TheCNNencoder, instantiated using EfficientNetB0
pre-trained weights, extracts salient
visualfeaturesfrominputimages.Thesefeaturesserveastheinputtothetransf
ormerdecoder,whichgenerates descriptive captions based on the
encoded visualinformation.
Duringmodeltraining,weemploytheSparseCategoricalCrossent
ropylossfunctiontoquantifythedisparitybetweenpredictedcaptionsandgrou
ndtruthcaptions.TheAdamoptimizer,augmentedwithacustomlearningrate
scheduler,facilitates
efficientparameteroptimizationbyadjustingthemodel'sparametersiterativel
ybasedoncomputedgradients.Additionally,weincorporateearlystoppingcrit
eriatopreventoverfittingandenhancemodelgeneralization.
TrainingProcedure:
HyperparameterTuning:
Throughoutthetrainingprocess,wemonitorthemodel'sperforma
nceusingevaluationmetricssuchaslossandaccuracy.Qualitativeassessment
throughvisualinspectionofgeneratedcaptionsaidsinidentifyingsyntacticors
emanticerrors.Themodel'sperformanceonthevalidation set serves as a
benchmark for its generalization ability, guiding further
iterationsofhyperparameter tuning and modelrefinement.
2
4
Fig.3.2.5.2
Model Persistence:
2
Fig 3.2.5.3: Convolutional Layer
Insummary,ourmodeltrainingpipelineembodiesaholisticapproa
chtotrainingandoptimization,leveragingstate-of-the-
artalgorithmsandmethodologiestodeveloparobustandeffectiveimagecapti
oningmodel.Throughmeticulousdatapreparation,thoughtfularchitecturalde
sign,andsystematichyperparametertuning,weensurethatourmodelachieve
ssuperiorperformanceandgeneralizationability,pavingthewayfortransform
ativeapplicationsin multimedia understanding and natural
languageprocessing.
Savingandloadingtrainedmodelsisacriticalaspectofmachinelea
rningprojects,enablingthepreservationandreuseof
valuablemodelconfigurationsandlearned
howtoeffectivelysaveandloadthemodelisessentialforitsdeploy
ment,sharing,andfurtherexperimentation.
Saving theModel:
Whensavingourimagecaptioningmodel,wegothroughaseriesof
stepstoensurethatallnecessarycomponentsarepreservedaccurately.Thepro
cesstypicallyinvolvessavingthemodel architecture, its learned weights,
and any additional configurationdetails.
Firstly,weserializethearchitectureofourmodel,whichencompass
esthearrangementofitslayers,theirconnections,andtheoverallconfiguration
.Thisisachievedusingthe`model.to_json()`method,whichconvertsthemodel
'sstructureintoaJSONformat.Alternatively, the architecture can be saved in
YAML format using`model.to_yaml()`.
Oncethemodelarchitectureisserialized,weproceedtosavethele
arnedparameters,commonlyreferredtoasweights.Theseweightsrepresentt
heknowledgeacquiredbythemodelduringthetrainingprocessandareessenti
alforreproducingitsbehavioraccurately.The`model.save_weights()`method
isutilizedforthispurpose,whichstorestheweightsinabinaryformat
compatible withTensorFlow.
Inadditiontothemodelarchitectureandweights,anyauxiliaryinfo
rmationrequiredtofullyrestorethemodel'sstateissaved.Thismayincludeopti
mizerstates,trainingconfigurations,oranycustomobjectsusedinthemodel.S
avingthesedetailsensuresthatthemodelcanbereinstated with all necessary
settingsintact.
Finally,wesavetheserializedarchitecture,learnedweights,andad
ditionalinformationtodiskusingthe `model.save()` method.This function
allowsusto specify
thedirectorywherethemodelwillbestored,creatingacomprehensivesnapsho
tthatcanbeeasilyretrievedwhenneeded.
Loading theModel:
Toloadasavedmodel,wefollowasystematicprocesstoreconstruct
itsarchitecture,restoreitslearnedweights,andapplyanynecessaryconfigurat
ions.Thestepsinvolvedinloadingamodelaredesignedtoensurethatitsstateis
(`tf.keras.models.model_from_json()`or`tf.keras.models.mode
l_from_yaml()`).Thisstepreconstructs the model's architecture, laying the
foundation for furtherrestoration.
Oncethearchitectureisreconstructed,weproceedtobuildanemp
tymodelbasedonthisarchitecture.Thisemptymodelservesasaplaceholderon
towhichwe'llloadthelearnedweights and apply any additionalsettings.
3.2.7Inferencing
Inference,theprocessofgeneratingcaptionsfornewimagesusing
ourtrainedmodel,isacrucialstepinevaluatingtheeffectivenessandpracticalu
tilityofourimagecaptioningsystem.Leveragingtherobustnessandflexibility
ofourmodelarchitecture,wehavedevelopedastreamlinedinferencepipelinet
hatenablesefficientandaccuratecaptiongenerationforawiderange of
images.
Attheheartofourinferencepipelineliestheintegrationofourtraine
dmodelwithanimagepreprocessingmodule,whichpreparestheinputimagesf
orprocessingbythemodel.Thispreprocessingstepinvolvesresizing,normaliza
tion,andaugmentationoftheinputimagestoensurecompatibilitywiththemod
el'sinputrequirements.Bystandardizingtheinputformat,we ensure
consistent and reliable performance of our model across diverse
imagedatasets.
WiththevisualfeaturesextractedbytheCNNencoder,thetransfor
medembeddingsarethenpassedthroughthe
coherentandcontextuallyrelevantcaptions,capturingtheseman
ticnuancesanddetailspresentin the inputimages.
Duringinference,ourmodelgeneratesmultiplecandidatecaption
sforeachinputimage,allowingfordiverseandexpressiveoutput.Thismulti-
captiongenerationapproachenablesustocapturetheinherentvariabilityandri
chnessofnaturallanguage,providinguserswitharangeofcaptioningoptionsto
choosefrom.Additionally,themodel'sflexibilityinhandlingvariable-
lengthinputsequencesensuresthatcaptionsofvaryinglengthscanbegenerat
edtoaccommodate the specific content and complexity of eachimage.
3.2.8Deployment
Deploymentmarkstheculminationofoureffortsindevelopingani
magecaptioningsystem,asitinvolvesmakingourmodelaccessibleandusablet
oend-
users.Leveragingmodernwebtechnologiesandframeworks,wehavecreated
auser-friendlyinterfaceforourimagecaptioning system, enabling seamless
interaction and integration into various applicationsandplatforms.
AtthecoreofourdeploymentstrategyistheadoptionofStreamlit,a
powerfulPythonlibraryforbuildinginteractiveweb applications.
Streamlitprovidesuswithastraightforwardandintuitivewaytodesignanddepl
oyouruserinterface,allowingustofocusondeliveringacompelling user
experience without the need for extensive web developmentexpertise.
Ourdeploymentprocessbeginswithpackagingourtrainedmodela
ndinferencepipelineintoastandalone Python application. This application
serves as thebackendlogic
forourimagecaptioningsystem,handlingincomingimageinputs,processingth
emthroughthemodel,andgeneratingdescriptivecaptionsinreal-
time.Byencapsulatingourmodelwithinastandaloneapplication,weensurepor
tabilityand scalability,
enablingeasydeploymentacrossvariousenvironments andplatforms.
Withthebackendlogicandmodelweightspackagedintoastandalo
neapplication,weproceedtodevelopthefrontendinterfaceusingStreamlit.Str
CHAPTERI
V
SYSTEM ANALYSIS
ANDDESIGN
4.1 SoftwareRequirements
Thefunctionalrequirementsofourimagecaptioningprojectenco
mpassthecorefunctionalitiesandcapabilitiesthatthesystemmustexhibittom
eettheneedsandexpectationsofusers.Theserequirementsaredefinedbased
ontheintendedfunctionalityoftheimagecaptioningmodeland the specific
use cases it aims to address. Some key functional requirementsinclude:
1.ImageFeatureExtraction:Thesystemmustbecapa
bleofextractinghigh-
levelvisualfeaturesfrominputimagesusingaConvolutionalNeuralNetw
ork(CNN)encoder.Thisinvolvesprocessingtheraw pixeldataof
imagesandencoding themintoa compactfeature representation that
captures relevant visualsemantics.
2.TextualFeatureExtraction:Inadditiontovisualfeat
ures,
thesystemmustextracttextualfeaturesfrominputcaptionsusingatrans
formerencoder.Thisinvolvestokenizing and encoding the textual
input into a numerical representation thatcapturessemantic and
contextualinformation.
3.SemanticFusion:Thesystemmustintegratevisuala
ndtextualfeaturesusingattentionmechanismstofacilitatesemanticfus
ion.Thisinvolvesleveragingthelearnedrepresentationsfrombothmoda
litiestocapturemeaningfulcorrelationsbetweenvisualcontent and
textualdescriptions.
5.ScalabilityandEfficiency:Thesystemmustbescala
bleandefficient,capableofprocessingadiverserangeofimagesandcapti
onsinreal-timeornearreal-
timescenarios.Thisinvolvesoptimizingcomputationalresources,algori
thms,anddatapipelinestoensurerapidandefficientcaptiongeneration
withoutcompromisingonaccuracy orquality.
4.1.2 Non-FunctionalRequirements
Inadditiontothefunctionalrequirementsoutlinedforourimageca
ptioningproject,severalnon-
functionalrequirementsmustbeconsideredtoensurethesystem'soverallperf
ormance,usability,andreliability.Thesenon-
functionalrequirementsencompassaspectssuchasperformance,usability,
reliability, scalability,andsecurity, all ofwhicharecritical forthesuccess and
acceptance of thesystem.
1.Performance:Thesystemmustexhibithighperform
anceintermsofspeedandefficiency,capableofprocessingimagesandg
eneratingcaptionsinatimelymanner.Thisinvolvesoptimizing
algorithms,data structures,and computationalresourcestominimize
latency and maximize throughput during inference and
trainingphases.
2.Usability:Thesystemmustbeuser-
friendlyandintuitive,withawell-
designedinterfacethatenablesuserstointeractwiththesystemeasily.T
hisincludesprovidingclearinstructions,feedback,anderrormessagesto
guideusersthroughthecaptioningprocess and ensure a seamless
userexperience.
3.Reliability:Thesystemmustbereliableandrobust,ca
pableofhandlingerrors,failures,andunexpectedinputsgracefully.Thisi
nvolvesimplementingerrorhandlingmechanisms,backupandrecovery
techniques,andleveragingcloud-
basedresourcestoaccommodategrowingdemandsand
userpopulations.
5.Security:Thesystemmustadheretostringentsecurit
ystandardsandprotocolstoprotectsensitivedata,suchasuserinformati
onandimagecontent,fromunauthorizedaccessormanipulation.Thisinv
olvesimplementingauthentication,encryption,andaccesscontrolmec
hanismstosafeguarddataintegrityandconfidentialitythroughoutthe
captioningprocess.
6.Maintainability:Thesystemmustbemaintainable,
withwell-
documentedcode,cleararchitecture,andmodulardesignprinciplesthat
facilitateeasymaintenance,updates,andenhancements.Thisincludes
providingdocumentation,versioncontrol,andautomatedtestingtools
to streamlinethedevelopmentandmaintenanceprocessandensure
long-termsustainability.
Byaddressingthesenon-
functionalrequirements,ourimagecaptioningsystemaimstodeliverareliable,
efficient,anduser-
friendlysolutionthatmeetstheneedsandexpectationsofuserswhileadheringt
othehigheststandardsofperformance,usability,reliability,scalability,andsec
urity.Thesenon-
functionalaspectsareessentialforensuringtheoverallsuccessandacceptanc
e of the system in various real-world applications andenvironments.
CHAPTER
V
5.1.1 ModelTraining
Setup
import math
import cv2
from cvzone.HandTrackingModule import HandDetector
import numpy as np
from keras.models import load_model
import traceback
model = load_model('/cnn8grps_rad1_model.h5')
white = np.ones((400, 400), np.uint8) * 255
cv2.imwrite("C:\\Users\\devansh raval\\PycharmProjects\\pythonProject\\white.jpg", white)
capture = cv2.VideoCapture(0)
hd = HandDetector(maxHands=1)
hd2 = HandDetector(maxHands=1)
offset = 29
step = 1
flag = False
suv = 0
33
4
2
while True:
try:
_, frame = capture.read()
frame = cv2.flip(frame, 1)
hands = hd.findHands(frame, draw=False, flipType=True)
print(frame.shape)
if hands:
# #print(" --------- lmlist=",hands[1])
hand = hands[0]
x, y, w, h = hand['bbox']
image = frame[y - offset:y + h + offset, x - offset:x + w + offset]
white = cv2.imread("C:\\Users\\devansh raval\\PycharmProjects\\pythonProject\\white.jpg")
# img_final=img_final1=img_final2=0
handz = hd2.findHands(image, draw=False, flipType=True)
if handz:
hand = handz[0]
pts = hand['lmList']
# x1,y1,w1,h1=hand['bbox']
os = ((400 - w) // 2) - 15
os1 = ((400 - h) // 2) - 15
for t in range(0, 4, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(5, 8, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(9, 12, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(13, 16, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(17, 20, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
cv2.line(white, (pts[5][0] + os, pts[5][1] + os1), (pts[9][0] + os, pts[9][1] + os1), (0, 255, 0),
3)
cv2.line(white, (pts[9][0] + os, pts[9][1] + os1), (pts[13][0] + os, pts[13][1] + os1), (0, 255, 0),
3)
cv2.line(white, (pts[13][0] + os, pts[13][1] + os1), (pts[17][0] + os, pts[17][1] + os1),
(0, 255, 0), 3)
cv2.line(white, (pts[0][0] + os, pts[0][1] + os1), (pts[5][0] + os, pts[5][1] + os1), (0, 255, 0),
3)
34
cv2.line(white, (pts[0][0] + os, pts[0][1] + os1), (pts[17][0] + os, pts[17][1] + os1), (0, 255, 0),
3)
for i in range(21):
cv2.circle(white, (pts[i][0] + os, pts[i][1] + os1), 2, (0, 0, 255), 1)
cv2.imshow("2", white)
# cv2.imshow("5", skeleton5)
# #print(model.predict(img))
white = white.reshape(1, 400, 400, 3)
prob = np.array(model.predict(white)[0], dtype='float32')
ch1 = np.argmax(prob, axis=0)
prob[ch1] = 0
ch2 = np.argmax(prob, axis=0)
prob[ch2] = 0
ch3 = np.argmax(prob, axis=0)
prob[ch3] = 0
pl = [ch1, ch2]
#
ch1=0
#print("00000")
if pl in l:
if (pts[5][0] < pts[4][0] ):
ch1=0
print("++++++++++++++++++")
#print("00000")
35
#condition for [c0][aemnst]
l=[[0,0],[0,6],[0,2],[0,5],[0,1],[0,7],[5,2],[7,6],[7,1]]
pl=[ch1,ch2]
if pl in l:
ch1=2
#print("22222")
##print(pts[2][1]+15>pts[16][1])
# condition for [gh][bdfikruvw]
l = [[1,4],[1,5],[1,6],[1,3],[1,0]]
pl = [ch1, ch2]
if pl in l:
if pts[6][1] > pts[8][1] and pts[14][1] < pts[16][1] and pts[18][1]<pts[20][1] and pts[0]
[0]<pts[8][0] and pts[0][0]<pts[12][0] and pts[0][0]<pts[16][0] and pts[0][0]<pts[20][0]:
ch1 = 3
print("33333c")
36
def predict(self, test_image):
white=test_image
white = white.reshape(1, 400, 400, 3)
prob = np.array(self.model.predict(white)[0], dtype='float32')
ch1 = np.argmax(prob, axis=0)
prob[ch1] = 0
ch2 = np.argmax(prob, axis=0)
prob[ch2] = 0
ch3 = np.argmax(prob, axis=0)
prob[ch3] = 0
pl = [ch1, ch2]
1.Readtheimagefromthedisk
2.Tokenizeallthefivecaptionscorrespondingtotheimage
defdecode_and_resize(img_path):
img =tf.io.read_file(img_path)
img =
tf.image.decode_jpeg(img,channels
=3)img =
tf.image.resize(img,IMAGE_SIZE)
img =
tf.image.convert_image_dtype(img,tf.floa
t32)returnimg
def process_input(img_path,captions):
return
decode_and_resize(img_path),vectorization(capti
ons)def make_dataset(images,captions):
dataset =
tf.data.Dataset.from_tensor_slices((images,captio
ns))dataset = dataset.shuffle(BATCH_SIZE *8)
dataset =
dataset.map(process_input,num_parallel_calls=AUTOTUNE)
dataset
=dataset.batch(BATCH_SIZE).prefetch(AUTOTUN
E)return dataset
Building themodel
import cv2
from cvzone.HandTrackingModule import HandDetector
from cvzone.ClassificationModule import Classifier
import numpy as np
import os, os.path
from keras.models import load_model
import traceback
capture = cv2.VideoCapture(0)
hd = HandDetector(maxHands=1)
hd2 = HandDetector(maxHands=1)
# #training data
# count = len(os.listdir("D://sign2text_dataset_2.0/Binary_imgs//A"))
if hands:
hand = hands[0]
x, y, w, h = hand['bbox']
image = frame[y - offset:y + h + offset, x - offset:x + w + offset]
#image1 = imgg[y - offset:y + h + offset, x - offset:x + w + offset]
39
if hands:
# #print(" --------- lmlist=",hands[1])
hand = hands[0]
x, y, w, h = hand['bbox']
image = frame[y - offset:y + h + offset, x - offset:x + w + offset]
white = cv2.imread("C:\\Users\\devansh raval\\PycharmProjects\\pythonProject\\
white.jpg")
# img_final=img_final1=img_final2=0
handz = hd2.findHands(image, draw=False, flipType=True)
if handz:
hand = handz[0]
pts = hand['lmList']
# x1,y1,w1,h1=hand['bbox']
os = ((400 - w) // 2) - 15
os1 = ((400 - h) // 2) - 15
for t in range(0, 4, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(5, 8, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(9, 12, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(13, 16, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
for t in range(17, 20, 1):
cv2.line(white, (pts[t][0] + os, pts[t][1] + os1), (pts[t + 1][0] + os, pts[t + 1][1] + os1),
(0, 255, 0), 3)
cv2.line(white, (pts[5][0] + os, pts[5][1] + os1), (pts[9][0] + os, pts[9][1] + os1), (0, 255,
0),
3)
3)
cv2.line(white, (pts[13][0] + os, pts[13][1] + os1), (pts[17][0] + os, pts[17][1] + os1),
(0, 255, 0), 3)
cv2.line(white, (pts[0][0] + os, pts[0][1] + os1), (pts[5][0] + os, pts[5][1] + os1), (0, 255,
0),
3)
cv2.line(white, (pts[0][0] + os, pts[0][1] + os1), (pts[17][0] + os, pts[17][1] + os1), (0,
255,
3)
40
for i in range(21):
cv2.circle(white, (pts[i][0] + os, pts[i][1] + os1), 2, (0, 0, 255), 1)
cv2.imshow("skeleton", white)
# cv2.imshow("5", skeleton5)
hands = hd.findHands(white, draw=False, flipType=True)
if hands:
hand = hands[0]
x, y, w, h = hand['bbox']
cv2.rectangle(white, (x - offset, y - offset), (x + w, y + h), (3, 255, 25), 3)
test_image2= blur1
img_final2= np.ones((400, 400), np.uint8) * 148
h = test_image2.shape[0]
w = test_image2.shape[1]
img_final2[((400 - h) // 2):((400 - h) // 2) + h, ((400 - w) // 2):((400 - w) // 2) + w] =
test_image2
#cv2.imshow("aaa",white)
# cv2.imshow("gray",img_final2)
cv2.imshow("binary", img_final)
# cv2.imshow("gray w/o draw", img_final1)
41
# ch3 = np.argmax(prob, axis=0)
# prob[ch3] = 0
# ch1 = chr(ch1 + 65)
# ch2 = chr(ch2 + 65)
# ch3 = chr(ch3 + 65)
# frame = cv2.putText(frame, "Predicted " + ch1 + " " + ch2 + " " + ch3, (x - offset -
150, y - offset - 10),
# cv2.FONT_HERSHEY_SIMPLEX,
# 1, (255, 0, 0), 1, cv2.LINE_AA)
# test data
count = len(os.listdir("D://test_data_2.0/Gray_imgs//" + p_dir + "//"))
print("=====",flag)
if flag==True:
if suv==50:
flag=False
42
if step%2==0:
# #this is for training data collection
# cv2.imwrite("D:\\sign2text_dataset_2.0\\Binary_imgs\\" + p_dir + "\\" + c_dir + str(count) +
".jpg", img_final)
# cv2.imwrite("D:\\sign2text_dataset_2.0\\Gray_imgs\\" + p_dir + "\\" + c_dir + str(count) +
".jpg", img_final1)
# cv2.imwrite("D:\\sign2text_dataset_2.0\\Gray_imgs_with_drawing\\" + p_dir + "\\" + c_dir +
str(count) + ".jpg", img_final2)
count += 1
suv += 1
step+=1
except Exception:
print("==",traceback.format_exc() )
capture.release()
cv2.destroyAllWindows()
43
return tf.reduce_sum(accuracy) /tf.reduce_sum(mask)
batch_seq_true = batch_seq[:,1:]
mask =
tf.math.not_equal(batch_seq_true,
0)batch_seq_pred =self.decoder(
)
loss = self.calculate_loss(batch_seq_true, batch_seq_pred,mask)
acc = self.calculate_accuracy(batch_seq_true,
batch_seq_pred,mask)
return loss,acc
def
train_step(self,
batch_data):batch_img,
batch_seq
=batch_databatch_loss =0
batch_acc =0
ifself.image_aug:
batch_img =self.image_aug(batch_img)
# 1. Get imageembeddings
img_embed =self.cnn_model(batch_img)
batch_acc +=acc
# 4. Get the
list of all the
trainableweightstrain_vars =(
self.encoder.trainable_variables
+self.decoder.trainable_variables
)
# 5. Get thegradients
grads =
tape.gradient(loss,train_vars
)# 6. Update the
trainableweights
self.optimizer.apply_gradients(zip(grads,train_vars))
# 7. Update thetrackers
batch_acc
/=float(self.num_captions_per_image)
self.loss_tracker.update_state(batch_lo
ss)self.acc_tracker.update_state(batch
_acc)
# 8. Return
the loss and
accuracyvaluesreturn {
"loss":self.loss_tracker.result(),
"acc":self.acc_tracker.result(),
}
def
# for eachcaption.
for i inrange(self.num_captions_per_image):
loss, acc =self._compute_caption_loss_and_acc(
batch_acc /=float(self.num_captions_per_image)
# 4. Update
thetrackersself.loss_tracker.upd
ate_state(batch_loss)self.acc_tr
acker.update_state(batch_acc)#
5. Return the loss and accuracy
valuesreturn {
"loss":self.loss_tracker.result(),
"acc":self.acc_tracker.result(),
}
@property
defmetrics(self):
# We need to list our metrics here so the `reset_states()` canbe
# calledautomatically.
return [self.loss_tracker,self.acc_tracker]
cnn_model =get_cnn_model()
encoder=TransformerEncoderBlock(embed_dim=EMBED_DIM,
dense_dim=FF_DIM,num_heads=1)
decoder = TransformerDecoderBlock(embed_dim=EMBED_DIM,
ff_dim=FF_DIM,
encoder=e
ncoder,decoder=decoder
,image_aug=image_aug
)
Model training
# Define the lossfunction
cross_entropy =keras.losses.SparseCategoricalCrossentropy(
from_logits=False,
reduction=None,
)
# EarlyStopping criteria
early_stopping = keras.callbacks.EarlyStopping(patience=3,
restore_best_weights=True)
super().__init__()
self.post_warmup_learning_rate
=post_warmup_learning_rateself.warmup_steps
=warmup_steps
def __call__(self,step):
global_step = tf.cast(step,tf.float32)
warmup_steps = tf.cast(self.warmup_steps, tf.float32)
global_step <warmup_steps,
lambda:warmup_learning_rate,
num_train_steps =
len(train_dataset)
*EPOCHSnum_warmup_steps =
num_train_steps
//15lr_schedule=LRSchedule(post_warmup_lea
rning_rate=1e-4,
warmup_steps=num_warmup_steps)
# Compile
themodelcaption_model.compile(optimizer=keras.optimizers.Adam(lr_sc
hedule),loss=cross_entropy)# Fit themodel
c
aption_mode
l.fit(train_dat
aset,epochs
=EPOCHS,
validation_data=valid_dataset,
callbacks=[early_stopping],
#CONTANTS
MAX_LENGTH =40
#
VOCABULARY_SIZE
#max_tokens=VOC
ABULARY_SIZE,standardize=Non
e,output_sequence_length=MAX_
LENGTH,vocabulary=vocab
)
idx2word
=tf.keras.layers.StringLookup(
mask_token="",vocabulary=to
kenizer.get_vocabulary(),invert
=True
)
#MODEL
def CNN_Encoder():
inception_v3 =tf.keras.applications.InceptionV3(
include_top=False,
weights='imagenet'
)
output
=inception_v3.outputout
put
=tf.keras.layers.Reshape
(
(-1,output.shape[-1]))(output)
cnn_model =
tf.keras.models.Model(inception_v3.input,output)r
eturn cnn_model
classTransformerEncoderLayer(tf.keras.layers.Layer):
def __init__(self, embed_dim,num_heads):
super().__init__()
def call(self, x,training):
x
=self.layer_norm_1
(x)x =
self.dense(x)
attn_output =self.attention(
q
uery=
x,valu
e=x,k
ey=x,
attention_mask=None,
training=training
)
x = self.layer_norm_2(x +attn_output)
return x
classEmbeddings(tf.keras.layers.Layer):
def __init__(self, vocab_size, embed_dim,max_len):
super().__init__()
self.token_embeddings =tf.keras.layers.Embedding(
vocab_size,embed_dim)
self.position_embeddings =tf.keras.layers.Embedding(
max_len, embed_dim,
input_shape=(None,max_len))def
call(self,input_ids):
length =tf.shape(input_ids)[-1]
position_ids = tf.range(start=0,
limit=length, delta=1)position_ids =
tf.expand_dims(position_ids,axis=0)token_embed
super().__init__()
self.embedding =Embeddings(
tokenizer.vocabulary_size(), embed_dim,MAX_LENGTH)
self.attention_1 =tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim, dropout=0.1
)
self.attention_2 =tf.keras.layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim,dropout=0.1
)
self.layernorm_1 =
tf.keras.layers.LayerNormalization()self.layernorm
_2 =
tf.keras.layers.LayerNormalization()self.layernorm
_3 =
tf.keras.layers.LayerNormalization()self.ffn_layer_
1=
tf.keras.layers.Dense(units,activation="relu")self.
ffn_layer_2 =tf.keras.layers.Dense(embed_dim)
self.out =
tf.keras.layers.Dense(tokenizer.vocabulary_size(),activation="softmax")
self.dropout_1 =tf.keras.layers.Dropout(0.3)
self.dropout_2 =tf.keras.layers.Dropout(0.5)
def call(self, input_ids, encoder_output, training,mask=None):
embeddings =self.embedding(input_ids)
co
mbined_mask
=Nonepadding_m
ask = None
query=
embeddings,value=embe
ddings,key=embeddings,
attention_mask=combine
d_mask,training=training
)
out_1 =
self.layernorm_1(embeddings
+attn_output_1)attn_output_2
=self.attention_2(
query
=out_1,value=encoder_
output,key=encoder_ou
tput,attention_mask=pa
dding_mask,training=tr
aining
)
out_2 = self.layernorm_2(out_1 +attn_output_2)
ffn_out =self.ffn_layer_1(out_2)
ffn_out =
self.dropout_1(ffn_out,training=training)
ffn_out =self.ffn_layer_2(ffn_out)
returnpreds
def get_causal_attention_mask(self,inputs):
input_shape =tf.shape(inputs)
)
return tf.tile(mask,mult)
classImageCaptioningModel(tf.keras.Model):
self.loss_tracker =
tf.keras.metrics.Mean(name="loss")self.acc_tr
acker
=tf.keras.metrics.Mean(name="accuracy")
loss = self.loss(y_true,y_pred)
mask =
tf.cast(mask,dtype=loss.dtyp
e)loss *=mask
y_pred = self.decoder(
y_input, encoder_output, training=True,mask=mask
)
loss = self.calculate_loss(y_true, y_pred,mask)
acc = self.calculate_accuracy(y_true, y_pred,mask)
return loss,acc
def train_step(self,batch):
imgs, captions =batch
ifself.image_aug:
imgs
=self.image_aug(imgs)img_
embed
=self.cnn_model(imgs)with
tf.GradientTape() astape:
train_vars =(
self.encoder.trainable_variables
+self.decoder.trainable_variables
)
grads =
tape.gradient(loss,train_vars)self.optimiz
er.apply_gradients(zip(grads,train_vars))s
elf.loss_tracker.update_state(loss)self.acc
_tracker.update_state(acc)
defmetrics(self):
return [self.loss_tracker,self.acc_tracker]
defload_image_from_path(img_path):
img =tf.io.read_file(img_path)
img = tf.io.decode_jpeg(img,channels=3)
img = tf.keras.layers.Resizing(299,299)(img)
img
=tf.keras.applications.inception_v3.preprocess_in
put(img)returnimg
if isinstance(img,str):
img =load_image_from_path(img)
if add_noise == True:
noise =tf.random.normal(img.shape)*0.1
img = (img + noise)
img = (img - tf.reduce_min(img))/(tf.reduce_max(img) -
tf.reduce_min(img))
img =
tf.expand_dims(img,
axis=0)img_embed
=caption_model.cnn_model(img)
img_encoded = caption_model.encoder(img_embed,training=False)
y_inp ='[start]'
for i
tokenized, img_encoded, training=False,mask=mask)
if pred_word =='[end]':
break
defget_caption_model():
encoder = TransformerEncoderLayer(EMBEDDING_DIM,1)
decoder = TransformerDecoderLayer(EMBEDDING_DIM, UNITS,8)
cnn_model
=CNN_Encoder()caption_model
=ImageCaptioningModel(
cnn_model=cnn_model, encoder=encoder,
decoder=decoder,image_aug=None,
)
def call_fn(batch,training):
return batch
caption_model.call =call_fn
sample_x, sample_y = tf.random.normal((1,
299, 299, 3)),
tf.zeros((1,40))caption_model((sample_x,sample_y))
sample_img_embed =caption_model.cnn_model(sample_x)
sample_enc_out =
caption_model.encoder(sample_img_embed,training=False)ca
ption_model.decoder(sample_y,
sample_enc_out,training=False)
returncaption_model
5.1.3 App.Py(UI/UX)
importio
importos
import streamlit asst
import requests
from PIL import Image
from model import get_caption_model,generate_caption
@st.cache(allow_output_mutation=True)
def get_model():
returnge
t_caption_model()capti
on_model
=get_model()def
predict():
captions = []
pred_caption =
generate_caption('tmp.jpg',caption_model)st.
markdown('####
PredictedCaptions:')captions.append(pred_ca
ption)
for _ inrange(4):
pred_caption = generate_caption('tmp.jpg',
caption_model,add_noise=True)
st.write(c)
st.title('ImageCaptioner')
img_url =
st.text_input(label='Enter
ImageURL')if (img_url != "") and
(img_url !=None):
img =
Image.open(requests.get(img_url,stream=True
).raw)img =img.convert('RGB')
st
.image(img)im
g.save('tmp.jp
g')predict()os.r
emove('tmp.jp
g')
st.markdown('<center style="opacity:
70%">OR</center>',unsafe_allow_html=True)img_upload =
st.file_uploader(label='Upload Image', type=['jpg', 'png','jpeg'])
if img_upload !=None:
img =img_upload.read()
img
=Image.open(io.BytesIO(im
g))img
=img.convert('RGB')img.sa
ve('tmp.jpg')
s
t.image(img)pr
edict()os.remo
ve('tmp.jpg')
5.2.1 UserInterface
Fortheuserinterface,wehaveusedStreamlitwhichisapython-
basedtoolformakingquickUIformachinelearningprojects.UsercanEitheruplo
adtheirimageviaURLorcanalsoimport image from local storage using drag
and drop and browsefunction.
5
9
5.2.2 Model Output
Wetaketworealworldimageswhichareneitherpresentontheinter
netnorinourtrainingdataset and check the output caption predicted using
ourmodel.
Fig:5.2.2.1 Predicted
6
0
Case 2: The image below shows a picture of a child sitting on acouch.
Fig:5.2.2.2 Predicted captions
forCase-2
6
1
CHAPTER
VI
6.1Conclusion
Ourjourneycommencedwithadeepdiveintothechallengesandop
portunitiesattheintersectionofcomputervisionandNLP,recognizingtheneedf
orinnovativesolutionstobridgethesemanticgapbetweenvisualcontentandte
xtualdescriptions.Byleveragingstate-of-the-art techniques, including
convolutional neural networks (CNNs) andtransformer-
basedarchitectures,weengineeredamodelthattranscendstraditionalmachin
elearningparadigms,offering a novel perspective in
multimediaunderstanding.
[3]A.D.ShettyandJ.Shetty,"ImagetoText:ComprehensiveReview
onDeepLearningBasedUnsupervisedImageCaptioning,"20232nd
InternationalConferenceonFuturisticTechnologies (INCOFT),
Belagavi, Karnataka, India, 2023, pp. 1-9, doi:
10.1109/INCOFT60753.2023.10425297.
[4]U.Kulkarni,K.Tomar,M.Kalmat,R.Bandi,P.JadhavandS.Meena,
"AttentionbasedImageCaptionGeneration(ABICG)usingEncoder-
DecoderArchitecture,"20235thInternationalConferenceonSmartSystemsan
dInventiveTechnology(ICSSIT),Tirunelveli,India, 2023, pp. 1564-1572,
doi:10.1109/ICSSIT55814.2023.10061040.
[5]R.KumarandG.Goel,"ImageCaptionusingCNNinComputerVisi
on,"2023InternationalConferenceonArtificialIntelligenceandSmartCommu
nication(AISC),GreaterNoida, India, 2023, pp. 874-878,
doi:10.1109/AISC56616.2023.10085162
[7]Z.U.Kamangar,G.M.Shaikh,S.Hassan,N.MughalandU.A.Kam
angar,"ImageCaptionGenerationRelatedtoObjectDetectionandColourReco
gnitionUsingTransformer-
Decoder,"20234thInternationalConferenceonComputing,Mathematicsand
EngineeringTechnologies(iCoMET), Sukkur, Pakistan, 2023, pp. 1-5,
doi:10.1109/iCoMET57998.2023.10099161.
[8]L.Lou,K.LuandJ.Xue,"ImprovedTransformerwithParallelEnco
dersforImageCaptioning,"202226th International Conferenceon Pattern
[9]R.Mulyawan,A.SunyotoandA.H.Muhammad,"AutomaticIndo
nesianImageCaptioningusingCNNandTransformer-
BasedModelApproach,"20225thInternationalConferenceonInformationand
CommunicationsTechnology(ICOIACT),Yogyakarta,Indonesia,2022,pp.355-
360, doi:10.1109/ICOIACT55506.2022.9971855.
[10]H.Tsaniya,C.FatichahandN.Suciati,"TransformerApproache
sinImageCaptioning:ALiteratureReview,"202214thInternationalConferenc
eonInformationTechnologyandElectrical Engineering (ICITEE),
Yogyakarta, Indonesia, 2022, pp. 1-6, doi:
10.1109/ICITEE56407.2022.9954086.
[11]J.Sudhakar,V.V.IyerandS.T.Sharmila,"ImageCaptionGenera
tionusingDeepNeuralNetworks,"2022InternationalConferenceforAdvance
mentinTechnology(ICONAT),Goa,India, 2022, pp. 1-3,
doi:10.1109/ICONAT53423.2022.9726074
[12]N.PatwariandD.Naik,"En-De-
Cap:AnEncoderDecodermodelforImageCaptioning,"20215thInternationalC
onferenceonComputingMethodologiesandCommunication(ICCMC), Erode,
Engineering (ICICSE), Chengdu, China, 2021, pp. 144-
10.1109/ICICSE52190.2021.9404124.
[14]S.C.Gupta,N.R.Singh,T.Sharma,A.TyagiandR.Majumdar,"Ge
neratingImageCaptionsusingDeepLearningandNatural
LanguageProcessing,"20219th
InternationalConferenceonReliability,InfocomTechnologiesandOptimizatio
n(TrendsandFutureDirections)(ICRITO),Noida,India,2021,pp.1-
4,doi:10.1109/ICRITO51393.2021.9596486.
[15]A.Puscasiu,A.Fanca,D.-
I.GotaandH.Valean,"Automatedimagecaptioning,"2020IEEEInternationalCo
nferenceonAutomation,QualityandTesting,Robotics(AQTR),Cluj-Napoca,
Romania, 2020, pp. 1-6, doi:10.1109/AQTR49680.2020.9129930.