0% found this document useful (0 votes)
17 views9 pages

Image Caption Generation Research Paper

This document presents a novel technique for image caption generation that integrates Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks to enhance the contextual accuracy of captions. The proposed method demonstrates superior performance on benchmark datasets like Flickr8k and MSCOCO, showcasing its ability to generate detailed and human-like descriptions of images. The study highlights the limitations of existing methods and emphasizes the importance of using advanced deep learning techniques and comprehensive evaluation metrics in image captioning research.

Uploaded by

acreations221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Image Caption Generation Research Paper

This document presents a novel technique for image caption generation that integrates Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks to enhance the contextual accuracy of captions. The proposed method demonstrates superior performance on benchmark datasets like Flickr8k and MSCOCO, showcasing its ability to generate detailed and human-like descriptions of images. The study highlights the limitations of existing methods and emphasizes the importance of using advanced deep learning techniques and comprehensive evaluation metrics in image captioning research.

Uploaded by

acreations221
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Describing the Unseen: A novel Technique for Image

Caption Generation using Deep Learning


1
Ganta Srija, 1Goje Sri Akshar, 1Jale Nikhitha, 1Jillela Likhitha, 2Dr.Sujit Das and 3Dr.Thayyaba Khatoon
1
UG Students, 2Assistant Professor, 3Professor & HoD
Department of Artificial Intelligence and Machine Learning, School of Engineering,
Malla Reddy University, Maisammaguda, Dulapally,
Hyderabad, Telangana 500100

[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]

Abstract

The field of image caption creation has undergone a revolution in recent years due to the integration of deep learning techniques,
specifically Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs). This work introduces a
revolutionary method that combines CNN and LSTM to describe visual content with previously unheard-of contextual and
accuracy. This method combines the sequential modelling capabilities of LSTMs with the image feature extraction strengths of
CNNs to provide captions for visible elements while also capturing subtleties and unseen information in images. We prove the
efficacy of our method on benchmark datasets by extensive testing and assessment, highlighting its capacity to generate detailed
and insightful descriptions that closely resemble human perception. This innovative method not only progresses. This innovative
strategy opens up new possibilities for deep learning approaches to comprehend and analyze visual content in addition to
pushing the boundaries of picture captioning.

Keywords: Long Short-Term Memory, Convolutional Neural Networks, Sequential Modelling.

variety of functions. In the study, one application of AI has


1. Introduction been covered. They identified the face using deep learning
techniques. The primary goal of automatic image caption
generation is to produce coherent sentences that explain the
In computer vision, automatic picture caption generation is image's content and the relationship between the items that
an ongoing research area. Because this issue combines two are recognized in the image. These sentences can then be
of the primary domains of artificial intelligence (AI), utilized as suggestions in a variety of applications.
computer vision and natural language processing, and has
It can be applied to a number of natural language processing
a wide range of practical applications, researchers are
tasks, including social media recommendation, image
drawn to it. Understanding an image is the first step in
indexing, virtual assistants, and visually impaired people. The
creating a meaningful sentence out of it, and object
creation of picture captions can aid machines in
identification and image classification can help with this.
comprehending the content of images. It involves more than
Actually, object detection and image classification are
just finding objects in a picture; it also entails figuring out
easier tasks than automatic caption synthesis. By looking
how the objects that have been found relate to one another.
at photos, people can create captions for them. Finding
visual things that are represented in a picture and Image captioning techniques are divided into three categories
determining the relationship between those images are by researchers: deep neural network-based, retrieval-based,
inherent human abilities. Caption generation is a skill that and template-based [36]. First, attributes, objects, and actions
is acquired via learning and experience. are identified from the image using template-based
approaches. Next, predefined templates with a certain number
It is theorized that machines can learn to comprehend the
of blank spaces are filled in. Retrieving an image that
relationships between objects and attain accuracy
resembles the input image is how retrieval-based approaches
comparable to humans by using a variety of datasets for
generate captions. These techniques provide syntactically
training. With the significant advancements in Artificial
accurate captions; yet, picture specificity and accuracy in
Intelligence (AI), photos are now being used as input for a
meaning is not assured. Using a linguistic model, captions are
generated once the image has been encoded in deep neural like Flickr8k and MSCOCO Captions, the suggested model
network-based approaches. The deep neural network-based yields state-of-the-art results for image captioning. The
approach has the potential to produce semantically more enhanced outcomes on a pair of datasets demonstrate the
accurate captions for the provided photos in comparison to applicability of the suggested model. Second, the captions
the previous two methods. Most existing picture that the suggested model generates for real-time sample
captioning models use deep neural network architecture. photographs confirm the effectiveness of the suggested model
The first to define a multi-modal log-bilinear model for once more. Thirdly, the writers of the majority of studies
image captioning with a fixed context window was Kiros. published in the literature have only reported findings using a
CNNs and RNNs are primarily utilized in encoder-decoder limited set of evaluation criteria. However, these particular
image captioning models. RNN is utilized as a language measurements don't give a fair assessment of their different
model or decoder and CNN as an encoder to create a 1-D strategies. On the other hand, in order to present a just and
array representation of the input image. the picture fair review, the suggested model is assessed using every
description. Determining appropriate CNN and RNN metric available. Based on the results, it is evident that the
models is a difficult problem. The following is a synopsis suggested model performs well across all criteria, indicating
of our research contributions: its effectiveness and uniqueness.

• For automatic picture captioning, this study suggests


using VGG16 Hybrid Places 1365 and LSTM as the
encoder and decoder, respectively. The suggested model 2. Literature Survey
performs better than any cutting-edge methods.
A wide range of research has been carried out previously in
• Using the Flickr8k and MSCOCO Captions datasets, this the field of image captioning and content generation for image
study offers experimental results using all prominent captioning. Tn the paper [4], authors proposed content
metrics, including BLEU, ROUGE L, METEOR, and selection methods for image caption generation. In this paper,
GLEU. Mainly three categories of features i.e geometric, conceptual,
and visual are used for content generation. Also, variety of
• The study presents the suggested model's validation methods have been proposed for image caption generation in
results using a random sample of live images. the past. They may be classified in broadly three categories i.e.,
Template-based methods [14, 30, 51, 37], Retrieval-based
Regarding the first contribution, the encoder used in this
methods [40, 43, 16, 46, 20], and Deep neural network based
work is the VGG16 Hybrid Places 1365 model. The
(Encoder-decoder) methods [49, 3, 15, 12]. These models are
ImageNet and Places 365 datasets are used to train this
often built using CNN to encode the image & extract visual
model. As a result, 1365 output class It is anticipated that
information whereas RNN is used to decode the visual
the training set's greater sample size and class diversity
information into a sentence.
will improve generalizability and yield more precise
results for the picture caption creation challenge. This A detail study has been done on deep neural network-based
aligns with the outcomes of the trials that were conducted. image captioning models in the paper [32]. In this paper,
authors surveyed the deep learning-based models used for
Regarding contribution 2, a number of studies have been
image captioning on MS-COCO and Flickr30k dataset. In
conducted in the literature that have only employed a
template-based approach, the process of caption generation is
subset of the available metrics. This can cause the results
performed using predefined templates with a number of blank
to be unfairly evaluated. Thus, the current study reports the
spaces that are filled with objects, actions, and attributes
outcomes in all commonly used metrics to prevent such a
recognised in the input image.
circumstance.
In the paper [14], the authors proposed the template slots for
The suggested model is tested for ten representative photos
generating captions that are filled with the predicted triplet
from the dataset together with their ground-truth reference
(object, action, scene) of visual components. Again, in the
captions in order to show Further supporting the model's
paper [30], the authors used a Conditional Random Field
purported efficacy is the outcome on these arbitrary
(CRF) based technique to derive the objects (people, cars etc.
photos. The following reasons contribute to the suggested
or things like trees, roads etc.), attributes, and prepositions. The
model's originality. First, early tests were conducted on the
model is evaluated on PASCAL dataset using BLEU and
publicly accessible Flickr8k dataset to examine a number
ROUGE scores. In this work, the best BLEU score obtained
of alternative models and their hybrid variations before
was 0.18 and corresponding best ROUGE score was 0.25.
deciding on the suggested model. It was determined to
improve the approach suggested in this paper in light of The authors of the paper [9] presented a method for caption
those preliminary findings. On publicly available datasets generation by selecting valuable phrases from existing captions
and combines them carefully to create a new caption. It used description of input images, that are considered more precise
1 million captioned pictures corpus dataset, with 1000 than those generated by the other two categories of methods. In
images put aside as a test set to compute BLEU(0.189) and the paper[13],authors proposed dual graph convolutional
METEOR (0.101) score. These methods are basically hard networks with transformer and curriculum learning for image
to design and depend on pre-defined template. Due to their captioning. They evaluated results on MS-COCO dataset with
dependence on such tempates, these methods are not able to achieves BLEU-1 score of 82.2 and a BLEU-2 score of 67.6.
generate sentences or captions with variable lengths.
Template-based approaches are capable of generating The authors of the paper [49] proposed the NIC (Neural Image
captions that are grammatically correct. However, the length Caption) model based on encoder-decoder architecture. In this
of the generated captions is fixed due to the predefined model, CNN is utilised as the encoder, and the final layer of
templates. CNN is linked to the RNN decoder, which generates the text
captions. In this model, LSTM is utilized as RNN. Junhua Mao
In the paper [8], authors proposed a memory enhanced et al. [33] proposed the m-RNN model to accomplish the task
captioning model for image captioning. They introduce of generating captions and the task of image and sentence
external memory based past knowledge to encode the retrieval. This model provided a BLEU-1 score of 0.5050 on
caption, and further use the decoder to generate correct the Flickr8k dataset. Chetan Amritkar et al. [3] designed a
caption. They evaluate the propose model on MS-COCO neural network model which generates captions for images in
dataset with 3.5% improved CIDr. In retrieval based natural language and yielded a score of 0.535 (BLEU-1) on
captioning methods , the captions for images are generated Flickr8k dataset.
by collecting visually similar images. This type of
approaches find captions for visually comparable images In the method of [50], an image is encoded into a numerical
from the training dataset after discovering visually similar representation using convolutional neural network (CNN) and
images and use those captions to return the caption of the the output of CNN is used as input in the decoder (RNN) to
query image. generate the captions but one word at a time. Yan Chu et al.
[10] put forward a model using ResNet50 and LSTM with soft
On the basis of millions of photos and their descriptions, the attention that produced a BLEU-1 score of 0.619 on the
authors of the paper [40] developed a model for finding Flickr8K dataset. Sulabh Katiyar et al. [45] proposed two types
similar images among the large number of images in the of models, a simple encoder-decoder model and an encoder-
dataset and returning the descriptions of these retrieved decoder model with attention. These models generate BLEU-1
images to query image. scores of 0.6373 and 0.6532 respectively on Flickr8k dataset.

In the paper [35], the authors used density estimation In the paper [21], authors investigated region-based image
method to generate captions and got a BLEU score of captioning method along with knowledge graph along with
approximately 0.35. Again, in the paper [46], the authors encoder-decode of model to validate the generated captions.
used visual and semantic similarity scores to cluster similar They evaluated the proposed work on MS-COCO and
images. They merge the images together, retrieve caption of Flickr30k dataset. A hierarchical deep neural network is
the input image from captions of similar images in the same proposed for automatic image captioning in the paper [44]. The
cluster. Some researchers have proposed a ranking-based experimental results are evaluated on MS-COCO dataset. In
framework to generate captions for each image by exposing the paper [7], authors proposed Bag-LSTM methods for
it to sentence-based image captioning [20]. automatic image captioning on MS-COCO dataset. Also, they
proposed variants of LSTM on mentioned dataset and
In the paper [18], authors proposed text based visual concluded that Bag-LSTM perform better on CIDEr value. In
attention (TBVA) model for identifying salient object the paper [17], authors proposed fusion based text feature
automatically. They evaluated proposed model on extraction for image captioning using DNN(deep neural
MSCOCO and Flickr30k dataset. In the paper [39], authors network) with LSTM. They evaluated the proposed model on
proposed data-driven based approach for image description Fliker30k dataset. A semantic embedding as global guidance
generation using retrieval-based technique. They concluded and attention model have proposed in the paper [22].
that proposed method provides efficient and relevant result
to produce image captions. Although these strategies The Experiments were conducted on Flickr8k, Flickr30k and
provide syntactically valid and generic sentences, they fail MS-COCO to validate the proposed work. A R-CNN base top-
to produce image-specific and semantically correct down and bottom-up approach is proposed in paper [6]. The
sentences. Due to the success of encoder-decoder model is improved by re-ranked caption using beam search
architectures in machine translation, a similar encoder- decoders and explanatory features. A Reference based Long
decoder (neural network-based method) architecture has Short-Term Memory (R-LSTM) based method is proposed in
been successfully used in image captioning as well. These the paper [11], for automatic image caption generation. They
models depend on deep neural networks for producing used weighting scheme between words and image to define
relevant caption. Validation of proposed model was done on in traditional systems may lack flexibility and adaptability,
Fliker30k and MS-COCO dataset, and they commented on resulting in captions that are rigid and unable to handle
their experiment that CIDr value on MS-COCO dataset was variations or complexities in visual content.
increased 10.37%. After going through the studies on image
caption generation, mentioned above, some significant • Limited Incorporation of Deep Learning Techniques:
research points were identified. Firstly (RP1), CNN models Previous systems may not fully leverage the advancements in
used in most state-of-the-arts are pre-trained on ImageNet deep learning, such as CNNs and RNNs, for image feature
dataset which is object specific and not scene specific. extraction and sequence generation, limiting their performance
Consequently, those models produce object specific results. compared to newer approaches.
Secondly (RP2), most papers reported their results in terms
• Lack of Attention Mechanisms: Existing systems may
of one or two evaluation metrics only, such as accuracy and
not incorporate attention mechanisms, which can help the
BLEU-1 score. To examine RP1, VGG16 Hybrid
model focus on relevant regions of the image while generating
Places1365 model is used as CNN in the proposed model to
captions, resulting in captions that may not accurately reflect
provide object and scene specific result. This model has
the content of the image.
been pre-trained on both, the ImageNet and Places datasets.
Furthermore, in order to address RP2, we evaluated the
results using multiple evaluation metrics such as BLEU,
METEOR, ROUGE and GLEU measures. 3.2 Proposed System
In order to overcome the difficulties and constraints of the
current captioning systems, a state-of-the-art application of
3.Proposed Methodology deep learning techniques is presented in the proposed picture
captioning system. Fundamentally, the system makes use of an
3.1 Existing System intricate architecture that blends recurrent neural networks
(RNNs), more especially Long Short-Term Memory (LSTM)
The current image captioning system mostly uses
models, for sequential caption creation, with convolutional
conventional techniques, which may not always yield
neural networks (CNNs) for reliable picture feature extraction.
precise and contextually appropriate captions. These
With the help of this architecture, which is meant to learn the
techniques frequently rely on rule-based and manual feature
complex links between textual descriptions and visual data
engineering, which may be insufficient to capture the finer
retrieved from images, the model is able to provide captions
points and subtleties of visual material. Furthermore, prior
that are both contextually relevant and descriptive. Moreover,
systems could not have fully utilised the advances in deep
the model incorporates attention processes to dynamically
learning methods for visual feature extraction and sequence
concentrate on visually significant areas of the picture while
generation, such as convolutional neural networks (CNNs)
captioning.
and recurrent neural networks (RNNs). Moreover,
multimodal fusion methods and attention mechanisms may This improves the alignment of written descriptions with visual
not be included in current systems to improve captioning content, leading to more precise and cohesive captions. The
performance. All things considered, the shortcomings of the suggested method makes use of transfer learning strategies,
current systems highlight the necessity of creating a more which is a crucial feature. The system may harness the
reliable and effective picture captioning system that makes information from large-scale datasets and tailor it to the
use of cutting-edge deep learning techniques. particular goal of picture captioning by finetuning pretrained
models. As a result, the model performs better and is better
able to generalise to new tasks and domains—even in
3.1.1 Disadvantages of Existing System situations when there is a shortage of training data. The
suggested system has a strong backend architecture and an
• Limited Context Understanding: Conventional easy-to-use interface that was created with the Streamlit
approaches often struggle to understand the context of the framework. With this interface, users can upload photos or
image fully, leading to captions that may lack relevance or enter image URLs with ease and get instantly produced,
fail to capture the intended meaning. evocative captions. Through a smooth and user-friendly
interface, the system seeks to make sophisticated picture
• Manual Feature Engineering: Many existing systems captioning technology more accessible to a wider audience. All
rely on manual feature engineering, which can be time- things considered, the suggested approach is a major
consuming and may not capture all relevant visual features breakthrough in the image captioning space, utilising cutting-
effectively, leading to suboptimal caption quality. edge deep learning methods to provide precise, contextually
appropriate captions for a variety of images. The system aims
• Rule-based Approaches: Rule-based approaches used
to push the frontiers of image captioning and contribute to integrating additional features, such as multilingual captioning
the development of more advanced AI systems for or image search functionalities, to meet specific user needs and
comprehending and interpreting visual content through its preferences.
creative architecture and user-friendly interface.

3.2.1 Advantages of Proposed System


3.2.2 System Architecture
The proposed system for image captioning offers several
advantages over existing systems, leveraging state-of-the-art
deep learning techniques and user-friendly interface design
to enhance performance and usability:

• Accurate and Contextually Relevant Captions: By


utilizing a combination of convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) with
attention mechanisms, the proposed system generates
descriptive captions that accurately capture the content and
context of images. The attention mechanisms enable
dynamic focusing on relevant regions of the image,
improving the alignment between visual features and textual
descriptions.

• Transfer Learning for Improved Generalization:


The system incorporates transfer learning techniques by
fine-tuning pretrained models, allowing it to leverage
knowledge from large-scale datasets and adapt it to the task
of image captioning. This enhances the system's ability to
generalize to new tasks and domains , even with limited 4.Results and Discussions
training data, leading to improved captioning performance.

• Real-Time Caption Generation: With a user-friendly


interface built using the Streamlit framework, the proposed
system enables users to upload images or input image URLs
and receive descriptive captions in real-time. This real-time
caption generation capability enhances user experience and
facilitates quick and efficient captioning of images.

• Democratized Access to Advanced Technology: By


providing a seamless and intuitive user interface, the
proposed system democratizes access to advanced image
captioning technology. Users with varying levels of
technical expertise can easily utilize the system to generate
descriptive captions for images, without requiring
specialized knowledge in deep learning or computer vision. This is the User Interface we will get after executing the code.

• Enhanced Usability and Interactivity: The user-


friendly interface of the proposed system enhances usability
and interactivity, allowing users to interact with the system
in a more intuitive and engaging manner. The interface
provides clear instructions for uploading images or inputting
image URLs, making the captioning process straightforward
and accessible to a wide range of users.

• Potential for Further Development and


Customization: The modular design of the proposed
system allows for further development and customization.
Developers can extend the system's capabilities by
Click on the Browse files button to browse the files in your
system to upload an image. Then click any image and click
open button to upload the image in the app. Click On Image URL option and enter the path of the image in
the provide space and click enter to load the image.

After clicking Open button, the image will be Uploaded. Now


click on Generate Caption Button to predict the Caption for After clicking Enter the image will be loaded.
uploaded image.

Now click on Generate Caption Button to predict the


Caption for loaded image.
After Clicking the Generate Caption button we will get the
predicted caption along with the speech option. This option
enables us to hear the predicted caption.
Additionally, our exploration of different architectures
highlights the nuanced considerations involved in model
selection and parameter tuning, laying the groundwork for
future advancements in multimodal deep learning applications.

Looking forward, our project sets the stage for further


innovation, with potential applications ranging from content
indexing to accessibility technologies and image-centric user
experiences. By pushing the boundaries of artificial
intelligence, we've contributed to the ongoing evolution of
intelligent systems capable of understanding and
communicating with humans in increasingly nuanced and
contextually relevant ways.

After Clicking the Generate Caption button we will get


the predicted caption along with the speech option. This
option enables us to hear the predicted caption. 5.1 Future Scope

BLEU Score: The image captioning project sets the stage for future
exploration and enhancement in multimodal deep learning,
paving the way for advancements in both technical capabilities
and real-world applications. One avenue for future
BLEU1 0.540437
improvement lies in the integration of more sophisticated
BLEU2 0.316454 attention mechanisms.

Exploring novel architectures, particularly transformer-based


5. Conclusion models, presents another promising direction for future
research. These architectures have shown great potential in
The culmination of our image captioning project various natural language processing tasks and could potentially
exemplifies the transformative prowess inherent in deep surpass the performance of traditional CNN-RNN
learning methodologies, notably convolutional neural architectures in image captioning. By leveraging transformer-
networks (CNNs) and recurrent neural networks (RNNs), based models for both image feature extraction and sequence
particularly Long Short-Term Memory (LSTM) networks. generation, researchers can unlock new levels of efficiency
By seamlessly integrating these cutting-edge technologies, and effectiveness in caption generation.
our system excels in automatically generating descriptive
Expanding the scope and diversity of datasets used for training
and contextually relevant captions for images. Leveraging
is crucial for improving the generalization capability of image
pretrained models like VGG16 for image feature extraction
captioning systems. Incorporating larger and more diverse
and incorporating attention mechanisms for enhanced
datasets, such as Open Images or Conceptual Captions, would
captioning accuracy, we've showcased the symbiotic
expose the system to a wider range of visual and textual
relationship between computer vision and natural language
contexts, ultimately leading to more robust and adaptable
processing.
models.
Moreover, the utilization of Streamlit as our frontend
Furthermore, fine-tuning pre-trained language models like
framework underscores our commitment to user-centric
BERT or GPT holds promise for enhancing the system's
design, facilitating effortless interaction for users to upload
understanding of language semantics and generating higher-
images or input image URLs. This intuitive interface
quality captions. By leveraging the rich semantic
elevates the overall user experience and accessibility of our
representations learned by these models, image captioning
system, ensuring that our technology remains inclusive and
systems can produce more nuanced and contextually relevant
user-friendly.
descriptions. Beyond technical advancements, there is vast
Throughout our journey, we've underscored the critical potential for applying image captioning systems in real-world
importance of data preprocessing, model loading, and scenarios. These systems can play a crucial role in content
optimization techniques in achieving optimal performance. indexing, enabling efficient retrieval and organization of visual
Evaluation metrics such as BLEU serve as litmus tests, information. Additionally, they have the potential to serve as
validating our system's efficacy in caption generation. accessibility technologies for visually impaired individuals,
providing auditory descriptions of visual content. [8] Li, Y., Gong, Y., Zhang, X., & Huang, J. (2018).
Integration with augmented reality (AR) and virtual reality Generating diverse and natural text-to-image via conditional
(VR) technologies could further enhance user experiences, generative adversarial networks. In Proceedings of the IEEE
enabling immersive interactions with captioned images in Conference on Computer Vision and Pattern Recognition (pp.
virtual environments. 8837-8846).

In summary, the future of image captioning is characterized [9] Lu, J., Xiong, C., Parikh, D., & Socher, R. (2016).
by ongoing research and development efforts aimed at Knowing When to Look: Adaptive Attention via A Visual
enhancing model architectures, expanding datasets, and Sentinel for Image Captioning. In Proceedings of the IEEE
exploring new applications. By addressing these challenges, Conference on Computer Vision and Pattern Recognition (pp.
image captioning systems have the potential to become 375-383).
indispensable tools for understanding and interpreting
visual content in diverse contexts, driving innovation and [10] Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille,
advancement in the field of multimodal deep learning. A. (2016). Learning like a child: Fast novel visual concept
learning from sentence descriptions of images. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 2533-2542).
5.2 References
[11] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A.
[1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, (2016). You only look once: Unified, real-time object
M., Gould, S., & Zhang, L. (2018). Bottom-Up and Top- detection. In Proceedings of the IEEE Conference on
Down Attention for Image Captioning and Visual Question Computer Vision and Pattern Recognition (pp. 779-788).
Answering. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (pp. 6077-6086). [12] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-
cnn: Towards real-time object detection with region proposal
[2] Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., networks. In Advances in neural information processing
& Yuille, A. L. (2018). DeepLab: Semantic image systems (pp. 91-99).
segmentation with deep convolutional nets, atrous
convolution, and fully connected crfs. IEEE transactions on [13] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015).
pattern analysis and machine intelligence, 40(4), 834-848. Show and tell: A neural image caption generator. In
Proceedings of the IEEE Conference on Computer Vision and
[3] Donahue, J., Anne Hendricks, L., Guadarrama, S., Pattern Recognition (pp. 3156-3164).
Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T.
(2015). Long-term recurrent convolutional networks for [14] Xu, J., Mei, T., Yao, T., & Rui, Y. (2015). MSR-VTT: A
visual recognition and description. In Proceedings of the large video description dataset for bridging video and
IEEE conference on computer vision and pattern language. In Proceedings of the IEEE Conference on
recognition (pp. 2625-2634). Computer Vision and Pattern Recognition (pp. 4105-4114).

[4] Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, [15] Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang,
L., Dollár, P., ... & He, K. (2017). From captions to visual X., & Metaxas, D. N. (2017). StackGAN: Text to photo-
concepts and back. In Proceedings of the IEEE Conference realistic image synthesis with stacked generative adversarial
on Computer Vision and Pattern Recognition (pp. 1473- networks. In Proceedings of the IEEE International
1482). Conference on Computer Vision (pp. 5907-5915).

[5] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep
residual learning for image recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 770-778).

[6] Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual


losses for real-time style transfer and super-resolution. In
European Conference on Computer Vision (pp. 694-711).

[7] Karpathy, A., & Fei-Fei, L. (2015). Deep visual-


semantic alignments for generating image descriptions. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (pp. 3128-3137).

You might also like