Implementing Complexity in Automatic Image Caption Generator Using Recurrent Neural Network Over Long Short-Term Memory
Implementing Complexity in Automatic Image Caption Generator Using Recurrent Neural Network Over Long Short-Term Memory
Abstract
Aim: To grasp the context of a picture and explain it in natural languages, such as English, using an image caption generator
and processing ideas. Materials and Methods: The performance analysis for the highest accuracy in picture caption
generator using beam search (N=10) and long short term memory (N=10) with 70% and 30% split sizes of training and test
datasets, using G-power setting parameters: (α=0.05 and power=0.86) respectively Results: RNN has significantly better
accuracy (91%) compared to long short term memory accuracy (76%) and attained the significance value of 0.670 (Two-
tailed, p>0.05). Conclusion: Recurrent neural networks achieved significantly better classification than Long short-term
memory for generating a description of the image.
Keywords: Deep Learning, Recurrent neural network, Long short term memory, Accuracy, Novel image caption, Encoder-
Decoder.
DOI: 10.47750/pnr.2022.13.S03.014
INTRODUCTION
Automatic caption generation is a tough undertaking that can aid visually challenged persons in understanding
the content of web images (Bai and An 2018). It may also have a significant impact on search engines and
robots. This problem is substantially more difficult than image categorization or object recognition, both of
which have been extensively researched (Mishra and Banerjee 2020). We have explored a few techniques to
produce good results since researchers have been involved in discovering an effective strategy to generate better
forecasts (Kameswari 2021). To create a good model, we used deep neural networks and machine learning
techniques. We used the Flickr8k dataset, which contains approximately 8000 example photographs with five
captions each (Wang et al. 2016). The applications include editing apps, novel capitalizations on generation in
virtual assistants, encoder-decoder, picture indexing, visually impaired people, for social media and a variety of
other natural language processing applications are among them. It aids in the creation of an image caption
(Dehaqi, Seydi, and Madadi 2021)
The LSTM and simple RNN were used in different ways. Recent articles have sparked my interest.
Approximately 175 papers were located in IEEE Xplore, while 213 papers were identified in the ScienceDirect
database (Han and Choi 2020; Agrawal et al. 2021). The Python libraries utilized throughout the development
included Keras, which features a VCG net for9image recognition, and TensorFlow(Brownlee 2018). We tested
numerous encoder-decoder models on our system to determine how they affect captions development and to
demonstrate various application cases (Vo, n.d.). For the image caption generator, develop a unique parallel-
fusion RNN and LSTM architecture (Verma et al. 2021). The proposed technique involves improving
performance and efficiency. Make a different caption generation survey available. Split photo captioning
approaches into groups based on the strategy in each method was quite beneficial in knowing how to execute
novel image captions with a flickr8k dataset of images (Tan and Chan 2019). Our team has extensive knowledge
and research experience that has translate into high quality publications(Bhansali et al. 2021; Jayanth et al.
2021; Sudhakar, Ravel, and Perumal 2021; Sathiyamoorthi et al. 2021; Deepanraj et al. 2021; Raju et al. 2021;
Arun Prakash et al. 2020; Kamath et al. 2020; Shanmugam et al. 2021; Rajasekaran et al. 2020; Adhinarayanan
et al. 2020; Rajesh et al. 2020; Aurtherson et al. 2021)
The topic of improving feature extraction and RNN classifier efficiency was thoroughly covered. In the
novel image caption generation, the Long Short Term Memory classifier that was used to train flickr8k data
produced better results. The flaw in the existing system's research gap has a lower degree of accuracy. The aim
of this research is to increase classification accuracy by adding RNN and comparing its performance to that of
LSTM by encoder-decoder models (Aghav 2020). With the use of novel image caption and deep learning
techniques, the proposed model improves classifiers to better discriminate objects (Kinghorn, Zhang, and Shao
2018).
Flickr8k dataset, which contains approximately 8000 example photographs with five captions each as a
dataset. The encoder-decoder model used a collection of photos, roughly 680 images with descriptions on
innovative captions generated. Recurrent neural networks were used to extract the captions, which were then
preprocessed. The RNN algorithm, which accomplishes classification by forming groups of every single class in
the data, is the first group in this study. The RNN classifier uses k groups as its input size and attempts to
classify them as the value of significance. The proposed work is designed and implemented with the help of
googlecolab software. The platform to assess deep learning was Windows 10 OS. The Hardware configuration
was an Intel corei7 processor with a RAM size of 8GB. The system sort used was 64-bit. For the
implementation of code, the python programming language was used. As for code execution, the flickr8k
dataset is worked behind to perform an output process for accuracy.
Statistical Analysis
SPSS software is used for statistical analysis of Recurrent Neural networks and Long Short Term
Memory. Independent variables are images, caption generator, vocabulary, preprocessed words, and description
length. Dependent variables are accuracy, precision, T-test analysis was carried out to calculate accuracy for
both methods.
Results
With a sample size of 10, the suggested RNN algorithm and LSTM were run in Google colab at
different periods. Table 1 shows the encoder-decoder models' anticipated novel image caption accuracy and
recognition of novel image caption production. These ten data samples, along with their loss values, are utilized
to create statistical values that may be compared for each algorithm. The mean accuracy of the RNN algorithm
was 91%, while the LSTM method was 76% according to the data. RNN and LSTM mean accuracy values are
shown in Table 3. The RNN's mean value is higher than the LSTM, with standard deviations of 7.16608 and
7.71992, respectively. Table 4 presents the RNN and LSTM Independent sample T-test data, with a significant
value of 0.670 (two-tailed, p>0.05). In terms of mean accuracy and loss, Fig. 1 shows a comparison of RNN and
LSTM.
Deep learning also specifies the group statistics value, as well as the mean, standard deviation, and
standard error mean for the two techniques. The loss between two algorithms of RNN and LSTM is classified in
the graphical form of comparative analysis. This shows that Recurrent Networks are substantially better with
91% accuracy when compared to the 76% accuracy of Long Short Term Memory.
Discussion
The significance value achieved in the provided studyis 0.670 because, of a large number of datasets
with fewer parameters (Two-tailed, p>0.05), implying that RNN appears to be superior to LSTM. The RNN
classifier has a 91% accuracy rate, while the LSTM classifier has a 76% accuracy rate. In this work, a previous
comparison of RNN versus LSTM is shown (Alahmadi, Park, and Hahn 2019). When compared to the LSTM
classifier, this clearly shows that RNN appears to be a stronger classifier. This research compares the accuracy
of RNN and LSTM shown in table 2, finding that RNN has a 91% accuracy and LSTM has a 76% accuracy
(Poghosyan and Sarukhanyan 2017). RNN is a sort of artificial neural network used in deep learning to create
captions for new images using previously saved datasets.
RNN makes the relationship between these two concealed layers (Ly, Traore, and Dia 2021). The
output layer can receive data from both the past and the future at the same time (Huang 2020). A similar LSTM
may carry out relevant data throughout the interpretation of inputs, and it can discard non-related information
using a forget gate (K. 2020). The Opposite Recommendations in editing apps, Novel Caption generation in
automated systems, encoder decoder, picture indexing, visually impaired people, for social media, and various
more natural language processing applications were amongst these uses. It aids in the creation of an image
caption (Tomar et al. 2022).
The study's drawbacks include the fact that training a convolutional neural network takes a long time,
especially with flickr8K datasets in deep learning (Yang et al. 2020). The dataset has several attributes that the
classifier can utilize to improve prediction accuracy and work more effectively at achieving the vision. The
future scope of image caption generators should be increased accuracy and exact precision numbers can be
raised as a result of features like these, accuracy and exact precision numbers can be increased. The system
should be enhanced to accommodate a bigger number of photos with less time spent training the data set in the
future scope of this work.
Conclusion
This proposed work used both the algorithms RNN and LSTM Machine Algorithm to predict the
accuracy. The RNN-LSTM model was created with the goal of automatically generating captions for the input
images. This model can be applied to a wide range of situations. Learned about the RNN model, and LSTM
models, and verified that the model is capable of creating captions for the input images. It is observed that the
RNN gives the best accuracy with 91% compared to the LSTM 76%
DECLARATIONS
Conflicts of Interests
No conflict of interest in this manuscript.
Authors Contribution
Author ST was involved in data collection, data analysis, and manuscript writing. Author RG
was involved in conceptualization, data validation, and critical reviews of manuscripts.
Acknowledgment
The authors would like to express their gratitude towards Saveetha School of Engineering, Saveetha Institute of
Medical and Technical Sciences (formerly known as Saveetha University)
for providing the necessary infrastructure to carry out this work successfully.
Funding: We thank the following organizations for providing financial support that enabled us
to complete the study.
References
1. Adhinarayanan, Rajesh, AravindhRamakrishnan, GopalKaliyaperumal, Melvinvíctor De Poures, Rajesh Kumar Babu, and
DamodharanDillikannan. 2020. “Comparative Analysis on the Effect of 1-Decanol and Di-N-Butyl Ether as Additive with
diesel/LDPE Blends in Compression Ignition Engine.” Energy Sources, Part A: Recovery, Utilization, and Environmental Effects,
June, 1–18.
2. Aghav, Jagannath. 2020. “Image Captioning Using Deep Learning.” International Journal for Research in Applied Science and
Engineering Technology. https://doi.org/10.22214/ijraset.2020.6232.
3. Agrawal, Vaishnavi, Shariva Dhekane, Neha Tuniya, and Vibha Vyas. 2021. “Image Caption Generator Using Attention
Mechanism.” 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT).
https://doi.org/10.1109/icccnt51525.2021.9579967.
4. Alahmadi, Rehab, Chung Hyuk Park, and James Hahn. 2019. “Sequence-to-Sequence Image Caption Generator.” Eleventh
International Conference on Machine Vision (ICMV 2018). https://doi.org/10.1117/12.2523174.
5. Arun Prakash, V. R., J. Francis Xavier, G. Ramesh, T. Maridurai, K. Siva Kumar, and R. Blessing Sam Raj. 2020. “Mechanical,
Thermal and Fatigue Behaviour of Surface-Treated Novel Caryota Urens Fibre–reinforced Epoxy Composite.” Biomass Conversion
and Biorefinery, August. https://doi.org/10.1007/s13399-020-00938-0.
6. Aurtherson, P. Babu, Bhanu Teja Nalla, Karthikeyan Srinivasan, Kulmani Mehar, and Yuvarajan Devarajan. 2021. “Biofuel
Production from Novel Prunus Domestica Kernel Oil: Process Optimization Technique.” Biomass Conversion and Biorefinery,
May. https://doi.org/10.1007/s13399-021-01551-5.
7. Bai, Shuang, and Shan An. 2018. “A Survey on Automatic Image Caption Generation.” Neurocomputing.
https://doi.org/10.1016/j.neucom.2018.05.080.
8. Bhansali, Karan J., Kamlesh R. Balinge, Subodh U. Raut, Shubham A. Deshmukh, M. Senthil Kumar, C. Ramesh Kumar, and
Pundlik R. Bhagat. 2021. “Visible Light Assisted Sulfonic Acid-Functionalized Porphyrin Comprising Benzimidazolium Moiety for
PhotocatalyticTransesterification of Castor Oil.” Fuel 304 (November): 121490.
9. Brownlee, Jason. 2018. Deep Learning for Time Series Forecasting: Predict the Future with MLPs, CNNs and LSTMs in Python.
Machine Learning Mastery.
10. Deepanraj, B., N. Senthilkumar, D. Mala, and A. Sathiamourthy. 2021. “Cashew Nut Shell Liquid as Alternate Fuel for CI
Engine—optimization Approach for Performance Improvement.” Biomass Conversion and Biorefinery, February.
https://doi.org/10.1007/s13399-021-01312-4.
11. Dehaqi, Ali Mollaahmadi, Vahid Seydi, and Yeganeh Madadi. 2021. “Adversarial Image Caption Generator Network.” SN
Computer Science. https://doi.org/10.1007/s42979-021-00486-y.
12. Han, Seung-Ho, and Ho-Jin Choi. 2020. “Domain-Specific Image Caption Generator with Semantic Ontology.” 2020 IEEE
International Conference on Big Data and Smart Computing (BigComp). https://doi.org/10.1109/bigcomp48618.2020.00-12.
13. Huang, Chien-Lin. 2020. “Speaker Characterization Using TDNN, TDNN-LSTM, TDNN-LSTM-Attention Based Speaker
Embeddings for NIST SRE 2019.” The Speaker and Language Recognition Workshop (Odyssey 2020).
https://doi.org/10.21437/odyssey.2020-60.
14. Jayanth, BellappuVenkat, Melvin Victor Depoures, GopalKaliyaperumal, DamodharanDillikannan, DilipsinghJawahar,
KumaranPalani, and Ganesha Prasad MeravanigeeShivappa. 2021. “A Comprehensive Study on the Effects of Multiple Injection
Strategies and Exhaust Gas Recirculation on Diesel Engine Characteristics That Utilize Waste High Density Polyethylene Oil.”
Energy Sources, Part A: Recovery, Utilization, and Environmental Effects, June, 1–18.
15. Kamath, Manjunath, Subha Krishna Rao, Jaison, Sridhar, Kasthuri, Gopinath, Sivaperumal, and ShantanuPatil. 2020. “Melatonin
Delivery from PCL Scaffold Enhances Glycosaminoglycans Deposition in Human Chondrocytes – Bioactive Scaffold Model for
Cartilage Regeneration.” Process Biochemistry 99 (December): 36–47.
16. Kameswari, A. V. N. 2021. “Image Caption Generator Using Deep Learning.” International Journal for Research in Applied
Science and Engineering Technology. https://doi.org/10.22214/ijraset.2021.38652.
17. Kinghorn, Philip, Li Zhang, and Ling Shao. 2018. “A Region-Based Image Caption Generator with Refined Descriptions.”
Neurocomputing. https://doi.org/10.1016/j.neucom.2017.07.014.
18. K., Sahityabhilash. 2020. “Impact of Loss Function Using M-LSTM Classifier for Sequence Data.” International Journal of
Psychosocial Rehabilitation. https://doi.org/10.37200/ijpr/v24i5/pr202059.
19. Ly, Racine, FousseiniTraore, and Khadim Dia. 2021. Forecasting Commodity Prices Using Long-Short-Term Memory Neural
Networks. Intl Food Policy Res Inst.
20. Mishra, Sanjukta, and Minakshi Banerjee. 2020. “Automatic Caption Generation of Retinal Diseases with Self-Trained RNN Merge
Model.” Advances in Intelligent Systems and Computing. https://doi.org/10.1007/978-981-15-2930-6_1.
21. Poghosyan, Aghasi, and Hakob Sarukhanyan. 2017. “Short-Term Memory with Read-Only Unit in Neural Image Caption
Generator.” 2017 Computer Science and Information Technologies (CSIT). https://doi.org/10.1109/csitechnol.2017.8312163.
22. Rajasekaran, S., D. Damodharan, K. Gopal, B. Rajesh Kumar, and Melvin Victor De Poures. 2020. “Collective Influence of 1-
Decanol Addition, Injection Pressure and EGR on Diesel Engine Characteristics Fueled with diesel/LDPE Oil Blends.” Fuel 277
(October): 118166.
23. Rajesh, A., K. Gopal, De Poures Melvin Victor, B. Rajesh Kumar, A. P. Sathiyagnanam, and D. Damodharan. 2020. “Effect of
Anisole Addition to Waste Cooking Oil Methyl Ester on Combustion, Emission and Performance Characteristics of a DI Diesel
Engine without Any Modifications.” Fuel 278 (October): 118315.
24. Raju, P., K. Raja, K. Lingadurai, T. Maridurai, and S. C. Prasanna. 2021. “Glass/Caryota Urens Hybridized Fibre-Reinforced
nanoclay/SiC Toughened Epoxy Hybrid Composite: Mechanical, Drop Load Impact, Hydrophobicity and Fatigue Behaviour.”
Biomass Conversion and Biorefinery, March. https://doi.org/10.1007/s13399-021-01427-8.
25. Sathiyamoorthi, Ramalingam, Gomathinayakam Sankaranarayanan, Dinesh Babu Munuswamy, and Yuvarajan Devarajan. 2021.
“Experimental Study of Spray Analysis for Palmarosa Biodiesel‐diesel Blends in a Constant Volume Chamber.” Environmental
Progress & Sustainable Energy 40 (6). https://doi.org/10.1002/ep.13696.
26. Shanmugam, Rajasekaran, DamodharanDillikannan, GopalKaliyaperumal, Melvin Victor De Poures, and Rajesh Kumar Babu.
2021. “A Comprehensive Study on the Effects of 1-Decanol, Compression Ratio and Exhaust Gas Recirculation on Diesel Engine
Characteristics Powered with Low Density Polyethylene Oil.” Energy Sources, Part A: Recovery, Utilization, and Environmental
Effects 43 (23): 3064–81.
27. Sudhakar, M. P., Merlyn Ravel, and K. Perumal. 2021. “Pretreatment and Process Optimization of Bioethanol Production from
Spent Biomass of GanodermaLucidum Using Saccharomyces Cerevisiae.” Fuel 306 (December): 121680.
28. Tan, Ying Hua, and Chee Seng Chan. 2019. “Phrase-Based Image Caption Generator with Hierarchical LSTM Network.”
Neurocomputing. https://doi.org/10.1016/j.neucom.2018.12.026.
29. Tomar, Dimpal, Pradeep Tomar, Arpit Bhardwaj, and G. R. Sinha. 2022. “Deep Learning Neural Network Prediction System
Enhanced with Best Window Size in Sliding Window Algorithm for Predicting Domestic Power Consumption in a Residential
Building.” Computational Intelligence and Neuroscience 2022 (March): 7216959.
30. Verma, Akash, Harshit Saxena, Mugdha Jaiswal, and Poonam Tanwar. 2021. “Intelligence Embedded Image Caption Generator
Using LSTM Based RNN Model.” 2021 6th International Conference on Communication and Electronics Systems (ICCES).
https://doi.org/10.1109/icces51350.2021.9489253.
31. Vo, Tham. n.d. “FuzzSemNIC: A Deep Fuzzy Neural Network Semantic-Enhanced Approach of Neural Image Captioning.”
https://doi.org/10.21203/rs.3.rs-610265/v1.
32. Wang, Minsi, Li Song, Xiaokang Yang, and Chuanfei Luo. 2016. “A Parallel-Fusion RNN-LSTM Architecture for Image Caption
Generation.” 2016 IEEE International Conference on Image Processing (ICIP). https://doi.org/10.1109/icip.2016.7533201.
33. Yang, Min, Junhao Liu, Ying Shen, Zhou Zhao, Xiaojun Chen, Qingyao Wu, and Chengming Li. 2020. “An Ensemble of
Generation- and Retrieval-Based Image Captioning with Dual Generator Generative Adversarial Network.” IEEE Transactions on
Image Processing: A Publication of the IEEE Signal Processing Society PP (October). https://doi.org/10.1109/TIP.2020.3028651.
Table 1. Group, Accuracy, and Loss value uses 8 columns with 8 width data for novel image caption generators.
SI.NO Name Type Width Decimal Columns Measure Role
Table 2. Accuracy and Loss Analysis of recurrent neural network and Long short term memory.
S.No GROUPS ACCURACY LOSS
91.00 9.00
81.68 18.32
74.56 25.44
1 RNN
86.25 13.75
78.64 21.36
85.78 14.22
68.94 31.06
90.56 9.44
84.36 15.64
76.25 23.75
78.00 22.00
67.21 32.79
61.78 38.22
2
LSTM
73.56 26.44
63.75 36.25
59.14 40.86
57.56 42.44
75.12 24.88
60.53 39.47
56.85 43.15
Table 3. Group Statistical Analysis of RNN and LSTM. Mean, Standard Deviation, and Standard Error Mean
are obtained for 10 samples. RNN has higher mean accuracy and lower mean loss when compared to LSTM.
Name GROUP N Mean Std.Deviation Std.Error
Mean
Table 4. Independent Sample T-test: RNN is insignificantly better than LSTM with a p-value 0.670 (Two-
tailed, p>0.05)
Name Variance F Sig. t df Sig Mean Std.Erro Lower Upper
s (2- Diffencen r
tail e differenc
ed) e
LOSS
Equal _ _ - 17.90 .00 -16.45200 3.33091 - -9.45124
Variances 4.93 1 0 23.4527
not 9 6
assumed
Fig. 1. Simple Bar Mean of Accuracy by RNN and LSTM Machine Algorithm, the bar chart representing the
comparison of mean accuracy of RNN is 91 % and LSTM is 76 %. X-Axis: RNN vs LSTM Machine
Algorithm. Y-Axis: Mean accuracy. The error bars are 95% for both algorithms. The Standard Deviation Error
Bars are +/- 1 SD.