Apply Deep Learning-Based CNN and LSTM For Visual Image Caption Generator
Apply Deep Learning-Based CNN and LSTM For Visual Image Caption Generator
V.Ramachandran P.Rajaram
Computer Science and Engineering Computer Science and Engineering
GITAM School of computing, GITAM School of computing,
GITAM (Deemed to be University) GITAM (Deemed to be University)
Bengaluru, India Bengaluru, India [email protected]
[email protected]
Abstract—Implemented Image captioning is becoming a frames by image caption generators. They intend to automate
need. Deep neural network models enable integrated the process of image interpretation. Not to mention the
applications to generate and caption images. Image captioning enormous potential it has to help people who are visually
describes an image. It requires identifying an image's main impaired. Using natural languages to automatically describe
objects, properties, and relationships. It produces correct the content of photographs is a fundamental and difficult task.
sentences. The creation of image caption generators, which It has a significant potential impact. It could, for example,
examine the content of an image and offer pertinent assist visually challenged people in better understanding the
descriptions, makes use of deep learning and computer vision
content of web images [1]. It could also deliver more accurate
techniques. One element of this method involves classifying
and concise picture/video information in contexts like as
objects in the image using English keywords derived from
training datasets. The caption generator was created by
image sharing in social networks or video surveillance
combining LSTM and CNN. In this paper, we propose a deep- systems.
learning model that uses computer vision and machine Humans typically use natural languages that are brief and
translation to characterize images and generate captions. The succinct when describing a scenario [11]. Machine vision
model successfully recognizes and labels visual items and their systems, on the other hand, capture a scene description in the
relationships. Transfer Learning will be used to demonstrate the form of a two-dimensional array of pixels. [2] The first part is
suggested experiment, coupled with the Flickr8k dataset and the
a convolutional neural network (CNN), which deciphers the
Python3 programming language. This study will also look into
picture. Questions like "What are the objects?" "Where are the
the functions and construction of neural networks. The
proposed model achieves a BELU Score of 69.8. objects?" and "How are the things interactive?" are common
in computer vision, and they are all answered via image
Keywords— Deep learning, CNN, RNN, LSTM understanding. The second part, an RNN, is what's used to
create the illustrative phrase. They perform exceptionally well
I. INTRODUCTION for sequence prediction problems and are a type of RNN
Our brain has the ability of labelling or annotating any (recurrent neural network) with long short-term memory. The
image that is presented to us. Image captioning is one of the next word is obvious based on the context. By overcoming
technologies that is utilized the most frequently in today's RNN's memory limitations, it stands out as a superior
world. Integrated apps that produce photos and caption them technology. An LSTM equipped with a forget gate can
are powered by models that use deep neural networks. The act selectively retain and discard data as it processes inputs to
of providing a description of a picture is known as image maximize processing speed [12]
captioning. It entails recognizing important things in an image A. Problem Statement
as well as their qualities and the relationships between those
The challenge posits a captioning work that requires a
attributes and the key items. It results in phrases that are
computer vision system to recognize and characterize in
correct both syntactically and semantically. Using computer
natural language significant sections in photos. This job is
vision and machine translation, this study presents a deep
proposed as a solution to the problem. The object detection
learning model for the purpose of automatically generating
process can be generalized when the descriptions are only a
image descriptions and captions. This article can identify
single word long in the image captioning challenge. Using
things inside an image, determine the relationships between
your knowledge of the images' subjects and context,
those objects, and produce captions for the image. The
determine which of the available labels best describes each
Xception model will be used to illustrate Transfer Learning,
picture in the collection (s). to build a platform that can be
and the dataset that was utilized was Flickr8k. The
utilized by customers and that is powered by CNN and LSTM
programming language that was used was Python3. In
for the purpose of automatically generating an image's
addition to that, the functions and topologies of neural
description.
networks will be talked about in this study. The production of
image captions is necessary for both computer vision and B. Objective
natural language processing. By automatically creating captions, the image caption
Image segmentation, which is used by both Facebook and generator's main goal is to improve user experience. This has
Google Photos, is capable of being applied to individual video applications in social media, image indexing, assistive
technology for the blind, and various other areas of natural LRCN and NIC, are less precise and limited in scope. Two
language processing. Modern results on caption generating models are used in the process. The scope of this technology
issues have been achieved using deep learning techniques.[3] has been broadened to include image-based question and
The most amazing aspect of these methods is that, rather than answer systems as well as rich captioning [6].
requiring complex data preparation or a pipeline of specially
created models, a unified end-to-end model can be developed A. Data set
to predict a caption given a photo. Based on the image we The Flickr8k dataset is widely available and serves as a
supply; this tool will automatically generate a caption using a benchmark for evaluating image-to-set procedures. There are
trained model. When we use or apply this, users will receive 8,000 pictures here, and each one is accompanied by five
automated captions on social media. descriptions. The photos in this gallery were culled from
various sources on the photo-sharing website Flickr. The
II. RELATED WORK captions for these photos give excellent descriptions of the
There The primary challenge of artificial intelligence is people, places, and things depicted. The dataset is more
automatic description of an image's content. In the past, image extensive because it depicts non-famous persons and places in
annotations (nouns and adjectives) were created first [3] and addition to well-known individuals and landmarks. There are
then sentences. a total of 6,000 images spread between the training,
development, and testing phases. Advantages of this dataset
Using a recurrent convolutional architecture, Donahue et for this task include: having several labels for a single image
al. demonstrated its usefulness on three tasks: image boosts the model's commonality and reduces overfitting. The
description, video recognition, and video description, versatility of the model used for image processing and
demonstrating its potential for use in large-scale visual analysis is bolstered by the large variety of training images
learning. used to train the model.
[4] These comprehensive, educable models incorporate III. METHODOLOGY
time-varying network states. According to Venugopalan et al.,
the inability to fully comprehend the intermediate result is the A. Convolution Neural Networks (CNN)
main obstacle. Video text generation is one area where LRCN CNN are specialized deep neural networks that can
has been put to use. Instead, then using a single architecture process data with a 2D matrix input structure. Images may be
for all three tasks, Vinyals et al. introduced a neural image easily represented as a 2D matrix, therefore CNN is extremely
caption (NIC) model for caption creation specifically in useful in image processing [7]. CNN is mostly used for picture
LRCN. The GoogLeNet and LSTM are used to optimize the categorization and determining whether an image is of a bird,
model's likelihood of producing the target description a plane, or Superman, among other things.
sentence from the training photographs. Model performance
is evaluated both qualitatively and quantitatively. In the B. LSTM
Microsoft Captioning Competition (MS COCO), human Long short term memory (LSTM) is a form of RNN
judges placed this method first (2015). There are three (recurrent neural network) that is ideally suited for sequence
differences between the LRCN and the NIC, and this could prediction challenges [15]. We can guess what the following
suggest a performance gap. To start, while NIC uses word will be based on the preceding paragraph. It has
GoogLeNet, LRCN uses VGGNet. Second, whereas LRCN outperformed regular RNNs in terms of effectiveness by
feeds visual feature information to all LSTM units, NIC only overcoming the limitations of RNNs with short-term
feeds it to the first unit. Thirdly, NIC's RNN architecture is memory[16-23]. A forget gate allows the LSTM to ignore
more straightforward than that of LRCN (a single-layer irrelevant information while still processing useful
LSTM) (two factored LSTM layers). There is mathematical information as seen in fig 1.
congruence between LRCN and NIC's approaches to image
captioning. LRCN's performance suffers from the fact that it
was designed for three different jobs. This forces the network
to make a choice between ease of use and flexibility.
Fang et al. proposed a visual concepts-based approach to
learning (Fang et al.,). Nouns, verbs, and adjectives were all
included in the training of the visual detectors for captions
using multiple instance learning. [5] To get word frequency
statistics, they employed over 400 thousand image
descriptions to train a language model. Using sentence-level Fig. 1. Sentence generation in image caption generator
features and a deep multi-modal similarity model, they
reordered the contenders for the caption. In 34% of cases, they C. Steps to build the image caption generator
are superior to humans. Due to human intervention, the
x Bring all necessary packages into play.
method's parameters are not easily repeatable. Microsoft's
Captionbot (which can also do this) may employ this method. x Make data cleansing.
VSA was proposed by Karpathy and his team. The method x Take the feature vector out.
creates word/sentence descriptions of image portions. When it
comes to aligning retrieved visual features to picture regions, x Loading the dataset for the model training
our method uses Region-based convolutional Networks (R- x Vocabulary tokenization
CNN) rather than CNN [5]. The produced descriptions
outperform retrieval baselines for both complete images and a x Build a data generator.
new set of region-level annotations. Similar methods, such as
1587
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:56 UTC from IEEE Xplore. Restrictio
2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)
Libraries
pip install tensorflow
keras
pillow
Fig. 3. Image of flickr8k lemma tokens
numpy
To provide 5 cleaning functions:
tqdm
1. Open the document file using load_ fp (filename) and
jupyterlab
extract the contents into a string.
these libraries are necessary to build the image caption
2. Use the Img_capt (filename) method to create a
generator
description dictionary that will map images with all five
1) Tensorflow: captions.
TensorFlow has a unique capability for picture 3. By accepting all descriptions as input, the txt
identification, and the recognised images are kept in a _cleaning(descriptions) method is used to clean the data.
particular folder. This algorithm for security considerations When working with textual data, we must do a variety of
will be simple to implement with very similar images [17]. cleaning operations, such as uppercase to lowercase
The relevant images are contained in the dataset image and conversion, punctuation removal, and removal of words
must be loaded. containing digits.
2) Keras: 4. To generate a vocabulary, use the function txt
The original data inputs are used to feed the Keras _vocab(descriptions). It extracts all the unique words from the
module named Image Data Generator, which randomly descriptions
modifies the input and returns a result containing only the
newly transformed data[18]. 5. Use the method save descriptions (descriptions,
filename) to save all of the preprocessed description into one
Pillow: file.
The Pillow library contains all of the core image 1) Feature vector
processing features. Images can be modified, rotated, and To use the Xception pre-trained model, which has been
resized. You can use the Pillow module's histogram function trained using a lot of data, to extract the characteristics from
to extract statistics from an image, which you can then utilize these models. To classify the photos, Xception was trained on
to perform automatic contrast enhancement and statistical an imagenet datasets with 1000 classifications.[8] Using keras
analysis. applications, we are able to import this model. To include the
3) Numpy: Xception model into our model, we need to implement a few
Since arrays can also be used to represent images, changes. The final classification layer needs to be eliminated
NumPy may be used to do many image analysis operations in order to retrieve the 2048 vectors of features and fit them
from scratch [19]. into the 299*299*3 image size needed by the xception model.
1588
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:56 UTC from IEEE Xplore. Restrictio
2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)
• This function, load photos(fname), takes a data file input text sequence, and the anticipated output text sequence,
as a parameter and loads a text file it into string to respectively.
produce a list of image names.
Features Extraction: Features are being extracted from
• picture, load clean descriptions(fname) The captions photos. Vector features, also called embeddings, are the result.
for each image in the set of photos are saved by this After features are extracted from the original images using the
function to a dictionary. We add the and identifier to CNN model [13,14], the feature vectors are downsized and
each caption in order to make it easier for the LSTM made RNN-friendly.
model to recognise the start and end of a caption.
Tokenization: After CNN generates feature vectors, they
• photo features load () - This function returns the are given to an RNN, which then decodes the tokens. In this
feature extraction vectors from the Xception models case, the captions are created based on a predicted word order.
and the photo dictionary.
Prediction is the final process that occurs after
Because machines cannot grasp the complex English tokenization.
language, they require a simple numerical representation to
process model data. [9]As a result, we assign a different, IV. RESULTS QUALITATIVE ANALYSIS
unique index value to each word in the vocabulary. Keras has In this section, a number of representative captions that
an in-built tokenizer function that creates tokens from the were created by the attention model are shown. These captions
vocabulary. We can save them in the "tokenizer.p" pickle file. have varied degrees of success when it comes to identifying
visual characteristics and accurately characterizing the scene.
It is unusual to find captions that have nothing to do with the
image; nonetheless, a persistent problem is that uncommon
subtypes of object classes (for example, an alpaca or an ethnic
cuisine item) are frequently mislabeled as subtypes that are
more typically encountered in the training data (e.g., a cow or
a hamburger).
For simplicity, only two images have been subjected to
testing, and the results can be seen in the following images
fig 5, fig 6:
Path:
Flicker8k_Dataset/1115339222_05e56d5a20.jpg
Output:
Path:
Flicker8k_Dataset/1335245722_08e54d5b20.jpg
1589
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:56 UTC from IEEE Xplore. Restrictio
2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)
1590
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:56 UTC from IEEE Xplore. Restrictio
2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)
[11] Saravanan, T., Sathish, T., & Keerthika, K. (2022, December). Technology, 8(6), 4241–4247.
Forecasting Economy using Machine Learning Algorithm. In 2022 https://doi.org/10.35940/ijeat.F9024.088619
Fourth International Conference on Emerging Research in Electronics, [17] Divyapushpalakshmi, M., & Ramalakshmi, R. (2021). An efficient
Computer Science and Technology (ICERECT) (pp. 1-5). IEEE. sentimental analysis using hybrid deep learning and optimization
[12] Herdade, S., Kappeler, A., Boakye, K., & Soares, J. (2019). Image technique for Twitter using parts of speech (POS) tagging.
captioning: Transforming objects into words. Advances in neural International Journal of Speech Technology.
information processing systems, 32. Saravanan, T., & Saravanakumar, [18] Divyapushpalakshmi,M., & Ramalakshmi, R. (2022). Hybrid machine
S. (2021, December). Privacy Preserving using Enhanced Shadow learning approach for community and overlapping community
Honeypot technique for Data Retrieval in Cloud Computing. In 2021 detection in social network. Transactions on emerging
3rd International Conference on Advances in Computing, telecommunications technologies.
Communication Control and Networking (ICAC3N) (pp. 1151-1154).
IEEE. [19] Srigurulekha, K., & Ramachandran, V. (2020, January). Food image
recognition using CNN. In 2020 International Conference on Computer
[13] Raypurkar, M., Supe, A., Bhumkar, P., Borse, P., & Sayyad, S. (2021). Communication and Informatics (ICCCI) (pp. 1-7). IEEE.
Deep learning-based image caption generator. International Research
[20] P. Ajay, B. Nagaraj, R. Arun Kumar, Ruihang Huang, P. Ananthi,
Journal of Engineering and Technology (IRJET), 8(03).
"Unsupervised Hyperspectral Microscopic Image Segmentation Using
[14] Saravanan, T., Jhaideep, T., & Bindu, N. H. (2022, April). Detecting Deep Embedded Clustering Algorithm", Scanning, vol. 2022, Article
depression using Hybrid models created using Google's BERT and ID 1200860, 9 pages, 2022. https://doi.org/10.1155/2022/1200860.
Facebook's Fast Text Algorithms. In 2022 2nd International
[21] Ajay, P., Nagaraj, B. and Jaya, J., 2022. Bi-level energy optimization
Conference on Advance Computing and Innovative Technologies in
model in smart integrated engineering systems using WSN. Energy
Engineering (ICACITE) (pp. 415-421). IEEE.
Reports, 8, pp.2490-2495.
[15] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell:
A neural image caption generator. In Proceedings of the IEEE [22] Ajay, P., Nagaraj, B., Pillai, B.M., Suthakorn, J. and Bradha, M., 2022.
conference on computer vision and pattern recognition Intelligent ecofriendly transport management system based on iot in
urban areas. Environment, Development and Sustainability, pp.1-8.
[16] Saravanan, T., & Nithya, N. S. (2019). Design of dynamic source
[23] Rajendran, A., Balakrishnan, N. and Ajay, P., 2022. Deep embedded
routing with the aid of fuzzy logic for cross layered mobile ad hoc
median clustering for routing misbehaviour and attacks detection in ad-
networks. International Journal of Engineering and Advanced
hoc networks. Ad Hoc Networks, 126, p.102757.
1591
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:56 UTC from IEEE Xplore. Restrictio