0% found this document useful (0 votes)
18 views8 pages

Image Caption Generation Research Paper

This document presents a novel technique for image caption generation that combines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, enhancing the contextual accuracy of generated captions. The proposed method has been tested on benchmark datasets, demonstrating its ability to produce detailed and insightful descriptions that closely mimic human perception. The study also highlights the limitations of existing methods and emphasizes the importance of utilizing multiple evaluation metrics to assess model performance.

Uploaded by

acreations221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views8 pages

Image Caption Generation Research Paper

This document presents a novel technique for image caption generation that combines Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, enhancing the contextual accuracy of generated captions. The proposed method has been tested on benchmark datasets, demonstrating its ability to produce detailed and insightful descriptions that closely mimic human perception. The study also highlights the limitations of existing methods and emphasizes the importance of utilizing multiple evaluation metrics to assess model performance.

Uploaded by

acreations221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Describing the Unseen: A novel Technique for Image

Caption Generation using Deep Learning


1
Ganta Srija, 1Goje Sri Akshar, 1Jale Nikhitha, 1Jillela Likhitha, 2Dr.Sujit Das and 3Dr.Thayyaba Khatoon
1
UG Students, 2Assistant Professor, 3Professor & HoD
Department of Artificial Intelligence and Machine Learning, School of Engineering,
Malla Reddy University, Maisammaguda, Dulapally,
Hyderabad, Telangana 500100

[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]

Abstract

The field of image caption creation has undergone a revolution in recent years due to the integration of deep learning techniques,
specifically Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs). This work introduces a
revolutionary method that combines CNN and LSTM to describe visual content with previously unheard-of contextual and
accuracy. This method combines the sequential modelling capabilities of LSTMs with the image feature extraction strengths of
CNNs to provide captions for visible elements while also capturing subtleties and unseen information in images. We prove the
efficacy of our method on benchmark datasets by extensive testing and assessment, highlighting its capacity to generate detailed
and insightful descriptions that closely resemble human perception. This innovative method not only progresses. This innovative
strategy opens up new possibilities for deep learning approaches to comprehend and analyze visual content in addition to pushing
the boundaries of picture captioning.

Keywords: Long Short-Term Memory, Convolutional Neural Networks, Sequential Modelling.

variety of functions. In the study, one application of AI has


1. Introduction been covered. They identified the face using deep learning
techniques. The primary goal of automatic image caption
generation is to produce coherent sentences that explain the
image's content and the relationship between the items that are
In computer vision, automatic picture caption generation is
recognized in the image. These sentences can then be utilized
an ongoing research area. Because this issue combines two
as suggestions in a variety of applications.
of the primary domains of artificial intelligence (AI),
computer vision and natural language processing, and has a It can be applied to a number of natural language processing
wide range of practical applications, researchers are drawn tasks, including social media recommendation, image
to it. Understanding an image is the first step in creating a indexing, virtual assistants, and visually impaired people. The
meaningful sentence out of it, and object identification and creation of picture captions can aid machines in
image classification can help with this. Actually, object comprehending the content of images. It involves more than
detection and image classification are easier tasks than just finding objects in a picture; it also entails figuring out how
automatic caption synthesis. By looking at photos, people the objects that have been found relate to one another.
can create captions for them. Finding visual things that are
represented in a picture and determining the relationship Image captioning techniques are divided into three categories
between those images are inherent human abilities. Caption by researchers: deep neural network-based, retrieval-based,
generation is a skill that is acquired via learning and and template-based [36]. First, attributes, objects, and actions
experience. are identified from the image using template-based
approaches. Next, predefined templates with a certain number
It is theorized that machines can learn to comprehend the of blank spaces are filled in. Retrieving an image that
relationships between objects and attain accuracy resembles the input image is how retrieval-based approaches
comparable to humans by using a variety of datasets for generate captions. These techniques provide syntactically
training. With the significant advancements in Artificial accurate captions; yet, picture specificity and accuracy in
Intelligence (AI), photos are now being used as input for a meaning is not assured. Using a linguistic model, captions are
generated once the image has been encoded in deep neural yields state-of-the-art results for image captioning. The
network-based approaches. The deep neural network-based enhanced outcomes on a pair of datasets demonstrate the
approach has the potential to produce semantically more applicability of the suggested model. Second, the captions that
accurate captions for the provided photos in comparison to the suggested model generates for real-time sample
the previous two methods. Most existing picture captioning photographs confirm the effectiveness of the suggested model
models use deep neural network architecture. The first to once more. Thirdly, the writers of the majority of studies
define a multi-modal log-bilinear model for image published in the literature have only reported findings using a
captioning with a fixed context window was Kiros. CNNs limited set of evaluation criteria. However, these particular
and RNNs are primarily utilized in encoder-decoder image measurements don't give a fair assessment of their different
captioning models. RNN is utilized as a language model or strategies. On the other hand, in order to present a just and fair
decoder and CNN as an encoder to create a 1-D array review, the suggested model is assessed using every metric
representation of the input image. the picture description. available. Based on the results, it is evident that the suggested
Determining appropriate CNN and RNN models is a model performs well across all criteria, indicating its
difficult problem. The following is a synopsis of our effectiveness and uniqueness.
research contributions:

• For automatic picture captioning, this study suggests using


VGG16 Hybrid Places 1365 and LSTM as the encoder and 2. Literature Survey
decoder, respectively. The suggested model performs better
than any cutting-edge methods. A wide range of research has been carried out previously in the
field of image captioning and content generation for image
• Using the Flickr8k and MSCOCO Captions datasets, this captioning. Tn the paper [4], authors proposed content selection
study offers experimental results using all prominent methods for image caption generation. In this paper, Mainly
metrics, including BLEU, ROUGE L, METEOR, and three categories of features i.e geometric, conceptual, and visual
GLEU. are used for content generation. Also, variety of methods have
been proposed for image caption generation in the past. They
• The study presents the suggested model's validation results may be classified in broadly three categories i.e., Template-
using a random sample of live images. based methods [14, 30, 51, 37], Retrieval-based methods [40,
43, 16, 46, 20], and Deep neural network based (Encoder-
Regarding the first contribution, the encoder used in this
decoder) methods [49, 3, 15, 12]. These models are often built
work is the VGG16 Hybrid Places 1365 model. The
using CNN to encode the image & extract visual information
ImageNet and Places 365 datasets are used to train this
whereas RNN is used to decode the visual information into a
model. As a result, 1365 output class It is anticipated that
sentence.
the training set's greater sample size and class diversity will
improve generalizability and yield more precise results for A detail study has been done on deep neural network-based
the picture caption creation challenge. This aligns with the image captioning models in the paper [32]. In this paper, authors
outcomes of the trials that were conducted. surveyed the deep learning-based models used for image
captioning on MS-COCO and Flickr30k dataset. In template-
Regarding contribution 2, a number of studies have been
based approach, the process of caption generation is performed
conducted in the literature that have only employed a subset
using predefined templates with a number of blank spaces that
of the available metrics. This can cause the results to be
are filled with objects, actions, and attributes recognised in the
unfairly evaluated. Thus, the current study reports the
input image.
outcomes in all commonly used metrics to prevent such a
circumstance. In the paper [14], the authors proposed the template slots for
generating captions that are filled with the predicted triplet
The suggested model is tested for ten representative photos
(object, action, scene) of visual components. Again, in the paper
from the dataset together with their ground-truth reference
[30], the authors used a Conditional Random Field (CRF) based
captions in order to show Further supporting the model's
technique to derive the objects (people, cars etc. or things like
purported efficacy is the outcome on these arbitrary photos.
trees, roads etc.), attributes, and prepositions. The model is
The following reasons contribute to the suggested model's
evaluated on PASCAL dataset using BLEU and ROUGE scores.
originality. First, early tests were conducted on the publicly
In this work, the best BLEU score obtained was 0.18 and
accessible Flickr8k dataset to examine a number of
corresponding best ROUGE score was 0.25.
alternative models and their hybrid variations before
deciding on the suggested model. It was determined to The authors of the paper [9] presented a method for caption
improve the approach suggested in this paper in light of generation by selecting valuable phrases from existing captions
those preliminary findings. On publicly available datasets and combines them carefully to create a new caption. It used 1
like Flickr8k and MSCOCO Captions, the suggested model
million captioned pictures corpus dataset, with 1000 images curriculum learning for image captioning. They evaluated
put aside as a test set to compute BLEU(0.189) and METEOR results on MS-COCO dataset with achieves BLEU-1 score of
(0.101) score. These methods are basically hard to design and 82.2 and a BLEU-2 score of 67.6.
depend on pre-defined template. Due to their dependence on
such tempates, these methods are not able to generate The authors of the paper [49] proposed the NIC (Neural Image
sentences or captions with variable lengths. Template-based Caption) model based on encoder-decoder architecture. In this
approaches are capable of generating captions that are model, CNN is utilised as the encoder, and the final layer of
grammatically correct. However, the length of the generated CNN is linked to the RNN decoder, which generates the text
captions is fixed due to the predefined templates. captions. In this model, LSTM is utilized as RNN. Junhua Mao
et al. [33] proposed the m-RNN model to accomplish the task of
In the paper [8], authors proposed a memory enhanced generating captions and the task of image and sentence retrieval.
captioning model for image captioning. They introduce This model provided a BLEU-1 score of 0.5050 on the Flickr8k
external memory based past knowledge to encode the dataset. Chetan Amritkar et al. [3] designed a neural network
caption, and further use the decoder to generate correct model which generates captions for images in natural language
caption. They evaluate the propose model on MS-COCO and yielded a score of 0.535 (BLEU-1) on Flickr8k dataset.
dataset with 3.5% improved CIDr. In retrieval based
captioning methods , the captions for images are generated by In the method of [50], an image is encoded into a numerical
collecting visually similar images. This type of approaches representation using convolutional neural network (CNN) and
find captions for visually comparable images from the the output of CNN is used as input in the decoder (RNN) to
training dataset after discovering visually similar images and generate the captions but one word at a time. Yan Chu et al. [10]
use those captions to return the caption of the query image. put forward a model using ResNet50 and LSTM with soft
attention that produced a BLEU-1 score of 0.619 on the Flickr8K
On the basis of millions of photos and their descriptions, the dataset. Sulabh Katiyar et al. [45] proposed two types of models,
authors of the paper [40] developed a model for finding a simple encoder-decoder model and an encoder-decoder model
similar images among the large number of images in the with attention. These models generate BLEU-1 scores of 0.6373
dataset and returning the descriptions of these retrieved and 0.6532 respectively on Flickr8k dataset.
images to query image.
In the paper [21], authors investigated region-based image
In the paper [35], the authors used density estimation method captioning method along with knowledge graph along with
to generate captions and got a BLEU score of approximately encoder-decode of model to validate the generated captions.
0.35. Again, in the paper [46], the authors used visual and They evaluated the proposed work on MS-COCO and Flickr30k
semantic similarity scores to cluster similar images. They dataset. A hierarchical deep neural network is proposed for
merge the images together, retrieve caption of the input image automatic image captioning in the paper [44]. The experimental
from captions of similar images in the same cluster. Some results are evaluated on MS-COCO dataset. In the paper [7],
researchers have proposed a ranking-based framework to authors proposed Bag-LSTM methods for automatic image
generate captions for each image by exposing it to sentence- captioning on MS-COCO dataset. Also, they proposed variants
based image captioning [20]. of LSTM on mentioned dataset and concluded that Bag-LSTM
perform better on CIDEr value. In the paper [17], authors
In the paper [18], authors proposed text based visual attention proposed fusion based text feature extraction for image
(TBVA) model for identifying salient object automatically. captioning using DNN(deep neural network) with LSTM. They
They evaluated proposed model on MSCOCO and Flickr30k evaluated the proposed model on Fliker30k dataset. A semantic
dataset. In the paper [39], authors proposed data-driven based embedding as global guidance and attention model have
approach for image description generation using retrieval- proposed in the paper [22].
based technique. They concluded that proposed method
provides efficient and relevant result to produce image The Experiments were conducted on Flickr8k, Flickr30k and
captions. Although these strategies provide syntactically MS-COCO to validate the proposed work. A R-CNN base top-
valid and generic sentences, they fail to produce image- down and bottom-up approach is proposed in paper [6]. The
specific and semantically correct sentences. Due to the model is improved by re-ranked caption using beam search
success of encoder-decoder architectures in machine decoders and explanatory features. A Reference based Long
translation, a similar encoder-decoder (neural network-based Short-Term Memory (R-LSTM) based method is proposed in
method) architecture has been successfully used in image the paper [11], for automatic image caption generation. They
captioning as well. These models depend on deep neural used weighting scheme between words and image to define
networks for producing description of input images, that are relevant caption. Validation of proposed model was done on
considered more precise than those generated by the other Fliker30k and MS-COCO dataset, and they commented on their
two categories of methods. In the paper[13],authors proposed experiment that CIDr value on MS-COCO dataset was increased
dual graph convolutional networks with transformer and 10.37%. After going through the studies on image caption
generation, mentioned above, some significant research • Limited Incorporation of Deep Learning Techniques:
points were identified. Firstly (RP1), CNN models used in Previous systems may not fully leverage the advancements in
most state-of-the-arts are pre-trained on ImageNet dataset deep learning, such as CNNs and RNNs, for image feature
which is object specific and not scene specific. Consequently, extraction and sequence generation, limiting their performance
those models produce object specific results. Secondly (RP2), compared to newer approaches.
most papers reported their results in terms of one or two
evaluation metrics only, such as accuracy and BLEU-1 score. • Lack of Attention Mechanisms: Existing systems may not
To examine RP1, VGG16 Hybrid Places1365 model is used incorporate attention mechanisms, which can help the model
as CNN in the proposed model to provide object and scene focus on relevant regions of the image while generating
specific result. This model has been pre-trained on both, the captions, resulting in captions that may not accurately reflect the
ImageNet and Places datasets. Furthermore, in order to content of the image.
address RP2, we evaluated the results using multiple
evaluation metrics such as BLEU, METEOR, ROUGE and
GLEU measures. 3.2 Proposed System

In order to overcome the difficulties and constraints of the


current captioning systems, a state-of-the-art application of deep
3.Proposed Methodology
learning techniques is presented in the proposed picture
captioning system. Fundamentally, the system makes use of an
3.1 Existing System
intricate architecture that blends recurrent neural networks
The current image captioning system mostly uses (RNNs), more especially Long Short-Term Memory (LSTM)
conventional techniques, which may not always yield precise models, for sequential caption creation, with convolutional
and contextually appropriate captions. These techniques neural networks (CNNs) for reliable picture feature extraction.
frequently rely on rule-based and manual feature engineering, With the help of this architecture, which is meant to learn the
which may be insufficient to capture the finer points and complex links between textual descriptions and visual data
subtleties of visual material. Furthermore, prior systems retrieved from images, the model is able to provide captions that
could not have fully utilised the advances in deep learning are both contextually relevant and descriptive. Moreover, the
methods for visual feature extraction and sequence model incorporates attention processes to dynamically
generation, such as convolutional neural networks (CNNs) concentrate on visually significant areas of the picture while
and recurrent neural networks (RNNs). Moreover, captioning.
multimodal fusion methods and attention mechanisms may
This improves the alignment of written descriptions with visual
not be included in current systems to improve captioning
content, leading to more precise and cohesive captions. The
performance. All things considered, the shortcomings of the
suggested method makes use of transfer learning strategies,
current systems highlight the necessity of creating a more
which is a crucial feature. The system may harness the
reliable and effective picture captioning system that makes
information from large-scale datasets and tailor it to the
use of cutting-edge deep learning techniques.
particular goal of picture captioning by finetuning pretrained
models. As a result, the model performs better and is better able
to generalise to new tasks and domains—even in situations when
3.1.1 Disadvantages of Existing System there is a shortage of training data. The suggested system has a
strong backend architecture and an easy-to-use interface that
• Limited Context Understanding: Conventional was created with the Streamlit framework. With this interface,
approaches often struggle to understand the context of the users can upload photos or enter image URLs with ease and get
image fully, leading to captions that may lack relevance or instantly produced, evocative captions. Through a smooth and
fail to capture the intended meaning. user-friendly interface, the system seeks to make sophisticated
picture captioning technology more accessible to a wider
• Manual Feature Engineering: Many existing systems audience. All things considered, the suggested approach is a
rely on manual feature engineering, which can be time- major breakthrough in the image captioning space, utilising
consuming and may not capture all relevant visual features cutting-edge deep learning methods to provide precise,
effectively, leading to suboptimal caption quality. contextually appropriate captions for a variety of images. The
• Rule-based Approaches: Rule-based approaches used system aims to push the frontiers of image captioning and
contribute to the development of more advanced AI systems for
in traditional systems may lack flexibility and adaptability,
comprehending and interpreting visual content through its
resulting in captions that are rigid and unable to handle
creative architecture and user-friendly interface.
variations or complexities in visual content.
3.2.1 Advantages of Proposed System 3.2.2 System Architecture
The proposed system for image captioning offers several
advantages over existing systems, leveraging state-of-the-art
deep learning techniques and user-friendly interface design to
enhance performance and usability:

• Accurate and Contextually Relevant Captions: By


utilizing a combination of convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) with attention
mechanisms, the proposed system generates descriptive
captions that accurately capture the content and context of
images. The attention mechanisms enable dynamic focusing
on relevant regions of the image, improving the alignment
between visual features and textual descriptions.

• Transfer Learning for Improved Generalization: The


system incorporates transfer learning techniques by fine-
tuning pretrained models, allowing it to leverage knowledge
from large-scale datasets and adapt it to the task of image
captioning. This enhances the system's ability to generalize to
new tasks and domains , even with limited training data,
leading to improved captioning performance.
4.Results and Discussions

• Real-Time Caption Generation: With a user-friendly


interface built using the Streamlit framework, the proposed
system enables users to upload images or input image URLs
and receive descriptive captions in real-time. This real-time
caption generation capability enhances user experience and
facilitates quick and efficient captioning of images.

• Democratized Access to Advanced Technology: By


providing a seamless and intuitive user interface, the
proposed system democratizes access to advanced image
captioning technology. Users with varying levels of technical
expertise can easily utilize the system to generate descriptive
captions for images, without requiring specialized knowledge
in deep learning or computer vision. This is the User Interface we will get after executing the code.
• Enhanced Usability and Interactivity: The user-
friendly interface of the proposed system enhances usability
and interactivity, allowing users to interact with the system in
a more intuitive and engaging manner. The interface provides
clear instructions for uploading images or inputting image
URLs, making the captioning process straightforward and
accessible to a wide range of users.

• Potential for Further Development and


Customization: The modular design of the proposed system
allows for further development and customization.
Developers can extend the system's capabilities by
integrating additional features, such as multilingual
captioning or image search functionalities, to meet specific
user needs and preferences. Click on the Browse files button to browse the files in your
system to upload an image. Then click any image and click open
button to upload the image in the app.
After clicking Open button, the image will be Uploaded. Now After clicking Enter the image will be loaded.
click on Generate Caption Button to predict the Caption for
uploaded image.

Now click on Generate Caption Button to predict the Caption


for loaded image.

After Clicking the Generate Caption button we will get the


predicted caption along with the speech option. This option
enables us to hear the predicted caption.

After Clicking the Generate Caption button we will get the


Click On Image URL option and enter the path of the image predicted caption along with the speech option. This option
in the provide space and click enter to load the image. enables us to hear the predicted caption.
BLEU Score: 5.1 Future Scope
The image captioning project sets the stage for future
BLEU1 0.540437 exploration and enhancement in multimodal deep learning,
paving the way for advancements in both technical capabilities
BLEU2 0.316454 and real-world applications. One avenue for future
improvement lies in the integration of more sophisticated
attention mechanisms.

5. Conclusion Exploring novel architectures, particularly transformer-based


models, presents another promising direction for future
The culmination of our image captioning project exemplifies research. These architectures have shown great potential in
the transformative prowess inherent in deep learning various natural language processing tasks and could potentially
methodologies, notably convolutional neural networks surpass the performance of traditional CNN-RNN architectures
(CNNs) and recurrent neural networks (RNNs), particularly in image captioning. By leveraging transformer-based models
Long Short-Term Memory (LSTM) networks. By seamlessly for both image feature extraction and sequence generation,
integrating these cutting-edge technologies, our system researchers can unlock new levels of efficiency and
excels in automatically generating descriptive and effectiveness in caption generation.
contextually relevant captions for images. Leveraging
pretrained models like VGG16 for image feature extraction Expanding the scope and diversity of datasets used for training
and incorporating attention mechanisms for enhanced is crucial for improving the generalization capability of image
captioning accuracy, we've showcased the symbiotic captioning systems. Incorporating larger and more diverse
relationship between computer vision and natural language datasets, such as Open Images or Conceptual Captions, would
processing. expose the system to a wider range of visual and textual
contexts, ultimately leading to more robust and adaptable
Moreover, the utilization of Streamlit as our frontend models.
framework underscores our commitment to user-centric
design, facilitating effortless interaction for users to upload Furthermore, fine-tuning pre-trained language models like
images or input image URLs. This intuitive interface elevates BERT or GPT holds promise for enhancing the system's
the overall user experience and accessibility of our system, understanding of language semantics and generating higher-
ensuring that our technology remains inclusive and user- quality captions. By leveraging the rich semantic
friendly. representations learned by these models, image captioning
systems can produce more nuanced and contextually relevant
Throughout our journey, we've underscored the critical descriptions. Beyond technical advancements, there is vast
importance of data preprocessing, model loading, and potential for applying image captioning systems in real-world
optimization techniques in achieving optimal performance. scenarios. These systems can play a crucial role in content
Evaluation metrics such as BLEU serve as litmus tests, indexing, enabling efficient retrieval and organization of visual
validating our system's efficacy in caption generation. information. Additionally, they have the potential to serve as
Additionally, our exploration of different architectures accessibility technologies for visually impaired individuals,
highlights the nuanced considerations involved in model providing auditory descriptions of visual content. Integration
selection and parameter tuning, laying the groundwork for with augmented reality (AR) and virtual reality (VR)
future advancements in multimodal deep learning technologies could further enhance user experiences, enabling
applications. immersive interactions with captioned images in virtual
environments.
Looking forward, our project sets the stage for further
innovation, with potential applications ranging from content In summary, the future of image captioning is characterized by
indexing to accessibility technologies and image-centric user ongoing research and development efforts aimed at enhancing
experiences. By pushing the boundaries of artificial model architectures, expanding datasets, and exploring new
intelligence, we've contributed to the ongoing evolution of applications. By addressing these challenges, image captioning
intelligent systems capable of understanding and systems have the potential to become indispensable tools for
communicating with humans in increasingly nuanced and understanding and interpreting visual content in diverse
contextually relevant ways. contexts, driving innovation and advancement in the field of
multimodal deep learning.
5.2 References and Pattern Recognition (pp. 2533-2542).

[1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, [11] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A.
M., Gould, S., & Zhang, L. (2018). Bottom-Up and Top- (2016). You only look once: Unified, real-time object detection.
Down Attention for Image Captioning and Visual Question In Proceedings of the IEEE Conference on Computer Vision
Answering. In Proceedings of the IEEE Conference on and Pattern Recognition (pp. 779-788).
Computer Vision and Pattern Recognition (pp. 6077-6086).
[12] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-
[2] Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., cnn: Towards real-time object detection with region proposal
& Yuille, A. L. (2018). DeepLab: Semantic image networks. In Advances in neural information processing
segmentation with deep convolutional nets, atrous systems (pp. 91-99).
convolution, and fully connected crfs. IEEE transactions on
[13] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015).
pattern analysis and machine intelligence, 40(4), 834-848.
Show and tell: A neural image caption generator. In Proceedings
[3] Donahue, J., Anne Hendricks, L., Guadarrama, S., of the IEEE Conference on Computer Vision and Pattern
Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. Recognition (pp. 3156-3164).
(2015). Long-term recurrent convolutional networks for
[14] Xu, J., Mei, T., Yao, T., & Rui, Y. (2015). MSR-VTT: A
visual recognition and description. In Proceedings of the
large video description dataset for bridging video and language.
IEEE conference on computer vision and pattern recognition
In Proceedings of the IEEE Conference on Computer Vision
(pp. 2625-2634).
and Pattern Recognition (pp. 4105-4114).
[4] Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng,
[15] Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X.,
L., Dollár, P., ... & He, K. (2017). From captions to visual
& Metaxas, D. N. (2017). StackGAN: Text to photo-realistic
concepts and back. In Proceedings of the IEEE Conference
image synthesis with stacked generative adversarial networks.
on Computer Vision and Pattern Recognition (pp. 1473-
In Proceedings of the IEEE International Conference on
1482).
Computer Vision (pp. 5907-5915).
[5] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep
residual learning for image recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition (pp. 770-778).

[6] Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual


losses for real-time style transfer and super-resolution. In
European Conference on Computer Vision (pp. 694-711).

[7] Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic


alignments for generating image descriptions. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 3128-3137).

[8] Li, Y., Gong, Y., Zhang, X., & Huang, J. (2018).
Generating diverse and natural text-to-image via conditional
generative adversarial networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
(pp. 8837-8846).

[9] Lu, J., Xiong, C., Parikh, D., & Socher, R. (2016).
Knowing When to Look: Adaptive Attention via A Visual
Sentinel for Image Captioning. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
(pp. 375-383).

[10] Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., &
Yuille, A. (2016). Learning like a child: Fast novel visual
concept learning from sentence descriptions of images. In
Proceedings of the IEEE Conference on Computer Vision

You might also like