0% found this document useful (0 votes)
153 views41 pages

6 - 23 - Deep Learning Approaches On Image Captioning A Review

This document provides a review of deep learning approaches for image captioning published between 2018 and 2022. It discusses common solutions and techniques for image captioning, including encoder-decoder models that use CNNs to encode images and RNNs to decode captions. The document also examines datasets, evaluation metrics, challenges in the field, and ranks methods' performance. It identifies future research directions like mitigating dataset bias and incorporating vision-language pre-training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views41 pages

6 - 23 - Deep Learning Approaches On Image Captioning A Review

This document provides a review of deep learning approaches for image captioning published between 2018 and 2022. It discusses common solutions and techniques for image captioning, including encoder-decoder models that use CNNs to encode images and RNNs to decode captions. The document also examines datasets, evaluation metrics, challenges in the field, and ranks methods' performance. It identifies future research directions like mitigating dataset bias and incorporating vision-language pre-training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

D EEP L EARNING A PPROACHES ON I MAGE C APTIONING : A

R EVIEW

Taraneh Ghandi Hamidreza Pourreza Hamidreza Mahyar∗


Faculty of Engineering Computer Engineering Department Faculty of Engineering
arXiv:2201.12944v5 [cs.CV] 22 Aug 2023

McMaster University Ferdowsi University of Mashhad McMaster University


Hamilton, Ontario, Canada Mashhad, Iran Hamilton, Ontario, Canada
[email protected] [email protected] [email protected]

August 23, 2023

A BSTRACT
Image captioning is a research area of immense importance, aiming to generate natural language
descriptions for visual content in the form of still images. The advent of deep learning and more
recently vision-language pre-training techniques has revolutionized the field, leading to more sophis-
ticated methods and improved performance. In this survey paper, we provide a structured review of
deep learning methods in image captioning by presenting a comprehensive taxonomy and discussing
each method category in detail. Additionally, we examine the datasets commonly employed in image
captioning research, as well as the evaluation metrics used to assess the performance of different
captioning models. We address the challenges faced in this field by emphasizing issues such as object
hallucination, missing context, illumination conditions, contextual understanding, and referring ex-
pressions. We rank different deep learning methods’ performance according to widely used evaluation
metrics, giving insight into the current state of the art. Furthermore, we identify several potential
future directions for research in this area, which include tackling the information misalignment
problem between image and text modalities, mitigating dataset bias, incorporating vision-language
pre-training methods to enhance caption generation, and developing improved evaluation tools to
accurately measure the quality of image captions.

Keywords Image Captioning, Deep Learning, Text Generation, Neural Networks, Machine translation

1 Introduction
Automatic image captioning is a critical research problem with numerous complexities, attracting a significant amount
of work with extensive applications across various domains such as human-computer interaction [81, 34, 155], medical
image captioning and prescription [108, 64, 9], traffic data analysis [79], quality control in industry [94], and especially
assistive technologies for visually impaired individuals [52, 123, 29, 2, 96]. The field has undergone a revolutionary
transformation with the development and growth of deep learning techniques [128, 112], resulting in the emergence of
advanced methods and enhanced performance. Automatic image captioning lies at the intersection of natural language
processing and computer vision. This field of research deals with the creation of textual descriptions for images without
human intervention. Given an input image I, the goal is to generate a caption C describing the visual contents present
inside the given image, with C being a set of sentences C = {c1 , c2 , ..., cn } where each ci is a sentence of the generated
caption C.
Given the recent advancements in the domain of image captioning, an updated review of the more recent research
works can assist researchers in keeping up with the latest progress in this field. There exist numerous literature reviews
and surveys on image captioning, providing an extensive collection of research conducted in previous years. Notably,
Hossain et al. [58] authored a comprehensive survey paper that served as an inspiration for this work’s structure.

Corresponding author
Deep Learning Approaches on Image Captioning: A Review

However, instead of a pairwise comparison like in [58], we have organized our paper to feature a separate section
for each method category and follow the same order of category in our discussion (section 5). Furthermore, most
surveys typically cover works dating from 2018 and earlier, while more recent research is yet to be addressed. Some
surveys [31] are limited in the number of research works covered, while others [21] do not delve into the methodologies’
specific details. Additionally, considering the recent advancements of vision language pre-training methods, image
captioning methods that fall under this category must be addressed and discussed. This category has seldom been
explored in previous survey works. The field of image captioning can be classified into multiple categories which differ
in the captioning settings, such as dense captioning methods which provide captions for each entity presented in the
image or whole image captioning methods that provide captions for the entirety of the input image. Here, we focus on
reviewing "whole image" captioning methods.
In this paper, we discuss various methods of image captioning introduced in papers published from 2018 to 2022,
followed by the most common problems and challenges of image captioning. We provide a comprehensive analysis of
each method, covering widely used datasets and evaluation metrics. We also compare the performance of the different
covered methods before exploring future directions in the field. The section on problems and challenges provides a
detailed overview of the inherent difficulties in image captioning and provides insight into potential solutions to address
them. We hope to provide a thorough understanding of image captioning through this review and encourage continued
progress in the field.

2 Common Solutions and Techniques

Automatic image captioning is usually computationally intensive and structurally complicated. Therefore, it is necessary
to study and observe the different methods of solving this problem in order to propose a practical and efficient solution.
Despite the recent remarkable advances in hardware design and optimization techniques, utilizing sensible methods and
tools is still of vital importance. This section discusses some of the standard solutions and techniques used in image
captioning.

2.1 Most Common Solutions

Image captioning is presented chiefly as a sequence-to-sequence problem in machine vision. In sequence-to-sequence


problems, the goal is to convert a specific sequence to the appropriate corresponding sequence. One of the essential
sequence-to-sequence problems is machine translation. In machine translation, a sequence of words (e.g., a sentence) in
a language is translated to its alternative in another language. In order to learn the correspondence between the sequences,
the sequences are mapped into a common space in which the distance between two sequences with close meaning is
small. One of the most common solutions to the image captioning problem is inspired by machine translation and has
given promising results according to the performance metrics. In this class of methods, known as the "encoder-decoder"
methods, the input image is mapped to an intermediate representation of the image contents. It is then converted to
a sequence of words that make up the caption of the image. In the encoding stage, Convolutional Neural Networks
(CNNs) are frequently used to detect objects in the image since the last convolutional layer of these networks provides
a rich representation of an image. This layer is used as a feature vector (or multiple feature vectors obtained from
different regions in the image). In the decoding stage, recurrent neural networks (RNNs) are commonly used due to their
ability to give a proper representation of the human language and texts. After the image and its corresponding caption
in the dataset are mapped into a common space, the correspondence between the two representations is learned and
new captions are generated for new images. Despite significant results, these methods usually give general and vague
captions for images and do not describe image contents appropriately since all information is compressed into a single
vector. This causes problems with learning the information at the beginning of the sequence and the deeper relations
between image contents. Many new methods have been proposed to solve these problems, most of them having the
encoder-decoder structure as their core component. These methods and their features and possible flaws are discussed
later in this survey. In addition to these methods, other methods, such as dense captioning [68], have been proposed to
solve the image captioning problem. However, in recent works on image captioning dating from 2018 to 2022, these
methods are seldom used, and methods based on the attention mechanism and graphs have been used more frequently.

2.2 Some of the Widely Used Techniques in Image Captioning

Before introducing various methods of image captioning in detail, we discuss some of the frequently used techniques in
image captioning methods.

2
Deep Learning Approaches on Image Captioning: A Review

2.2.1 R-CNNs
When detecting objects, an especially trained Convolutional Neural Network also detects the object’s bounding box
inside the image. If a simple CNN is used for object detection, using a grid above the image and processing the
individual cells of the grid is one way to detect object bounding boxes. However, objects that appear in images are of
various shapes and sizes and can be located anywhere inside the image; therefore, one type of grid with fixed cell sizes
will not give desirable results. In order to resolve this issue, grids with different cell sizes must be used to detect objects
with different settings, which will be computationally intensive.
To solve this issue, Girshick et al. introduced Region-based CNNs (R-CNN) [42]. These networks use selective search
to extract only 2000 regions from the image. The regions are given the term: region proposals. The selective search
algorithm generates many candidate regions to segment the input image. These regions are merged recursively and
form larger regions which are then selected as the final region proposals. Since no form of actual learning is used inside
the selective search algorithm, the algorithm may produce incorrect region proposals.
Fast R-CNN: The same authors that introduced R-CNN introduced another network under the name of Fast R-CNN
[41]. The working of Fast R-CNN is very similar to that of R-CNN. However, instead of feeding the region proposals to
a CNN, the input image is fed to the CNN to produce a convolutional feature map. Region proposals are then generated
using this feature map and the selective search algorithm. Fast R-CNN is faster than R-CNN in that the image is only
convolved once, and a feature map is extracted, as opposed to R-CNN that fed 2000 region proposals to a CNN.
Faster R-CNN: Despite Fast R-CNN being faster than R-CNN, both use the selective search algorithm, which is
time-consuming and affects the network’s performance. In [114], a new method was presented in which the network
learns the region proposals and does not use the selective search algorithm. Similar to Fast R-CNN, an image is given
to the CNN as input, and a convolutional feature map is extracted. Instead of the selective search algorithm, a separate
network is used to predict region proposals. Faster R-CNN is significantly faster than R-CNN and Fast R-CNN and
can be used in real-time object detection. Faster R-CNN is used in many papers covered in this survey to generate a
presentation for the input image.

2.2.2 RNNs
Recurrent neural networks (RNNs) [117, 69] are a type of artificial neural network that also have internal memory.
These networks are recurrent in the sense that they perform the same operation for each input data, and the output from
the current input data is dependent on the computations from previous steps. After the output is computed, it is fed
back into the network. RNN uses current input and the learned output from the previous step during inference time.
In contrast to the feed-forward neural networks, RNNs use their internal memory (also known as the internal state)
to process a sequence of input. This feature enables RNNs to perform tasks such as human handwriting or speech
detection. In other words, RNNs can be used in applications in which the inputs are related to each other in some
way and are not independent. On the other hand, these networks are prone to problems such as "Vanishing Gradient,"
difficulty in training, and the inability to process long sequences. If a sequence is lengthy, RNN networks might lose
parts of the information at the beginning of the sequence.

2.2.3 LSTMs and GRUs


An improved version of RNN networks is the "Long Short-Term Memory Network (LSTM)"[56]. LSTMs have been
explicitly designed to resolve long-term dependency problems [40]. Since LSTMs are effective at capturing long-term
temporal dependencies without such optimization issues, they have been used to solve many challenging problems,
including handwriting recognition/generation, language modeling/translation, speech synthesis, acoustic modeling of
speech, protein secondary structure prediction, and audio and video analysis.[45] These networks have an internal
"Gate" mechanism that can control and modify the flow of information. These gates can learn which data is essential
in a sequence or must be ignored. Therefore, important information is stored all through the sequence. Despite the
advantages of LSTMs over RNN, LSTMs ignore the hierarchical structure in a sentence. Also, LSTMs need plenty of
storage space due to their memory cells. LSTM networks are widely used in "Encoder-Decoder"-based methods in
image captioning to generate representations of the captions, which are textual data. Another network similar to LSTMs
is the "Gated Recurrent Unit (GRU)"[20]. These networks are similar to LSTMs in structure but use fewer gates to
control the flow of information. Fewer parameters allow GRUs to train easier and faster than their LSTM counterparts.
GRUs have been shown to perform better on specific smaller and less frequent datasets [46].

2.2.4 ResNet
Residual neural networks or ResNets are used in object detection [53]. The structure of these networks is inspired by
the pyramidal cells in the cerebral cortex and uses slip connections to connect multiple layers. Usually, ResNets are

3
Deep Learning Approaches on Image Captioning: A Review

implemented with two or three skip connections. ResNets are built with "Residual Blocks" placed on top of each other;
for example, ResNet-50 consists of 50 layers containing residual blocks. Optimization of this structure is shown to be
easier and faster compared to the simple network structure (without skip connections and residual blocks). Some of the
research works presented in this survey have used ResNet to detect objects and generate image representations.

3 Deep Learning-Based Image Captioning


In this section, we have organized and classified the different frameworks, methods, and approaches which were
extensively used in recent research works based on their core structure. Some terms and notations in the covered papers
have been altered to maintain consistency throughout this review. A figure demonstrating the taxonomy provided in this
paper is shown in 1.

Deep Learning-based
Image Captioning
Methods

Attention-based Graph-based

Generating
Graph-based + Attention
Multi-Style Captions

Transformer-based Convolutional network-based

Transformer-based +Graphs Unsupervised Approaches


& Reinforcement Learning
VLP

Figure 1: The taxonomy of the image captioning methods covered in this survey paper.

3.1 Attention-Based Methods

The methods that fall under the attention-based category utilize attention mechanisms to emphasize the most relevant
parts of the input image when generating captions.
Attention-based methods [10] are inspired by the human attention pattern and the way the human eye focuses on images.
When inspecting images, humans focus more on the image’s salient features. The same mechanism is implemented in
attention-based mechanisms. During the training process, the model is shown "where to look at." To understand the
mechanism of attention-based methods, one can imagine a sequential decoder in which, in addition to the previous
cell’s output and internal state, there is also a context vector under the term ’c.’
Vector c is the weighted sum of hidden states in the encoder.
Tx
X
ci = aij hj (1)
j=1

In the statement above, aij is the "amount of attention" that output i must pay to input j, and hj is the encoder state in
input j. aij is obtained by calculating softmax over attention amounts that are shown with e on inputs and for output i:
exp (eij )
aij = softmax (eij ) = PTx (2)
k=1 exp (eik )

eij = f (Si−1 , hj ) (3)

4
Deep Learning Approaches on Image Captioning: A Review

where, f is the model that determines how much input at j and output i are correlated, and Si−1 is the hidden state
from the previous time step. The model f can be estimated with a small neural network and can be optimized with any
gradient-based optimization techniques, such as gradient descent. A presentation of the attention mechanism used in an
encoder-decoder framework typically used in machine translation is shown in Figure 2.

Decoder ...
Attention Layer
Addition
Context Vector

Multiplication Multiplication Multiplication Multiplication

softmax softmax softmax softmax


Decoder
Hidden State Score Score Score Score

Encoder
Hidden State
Encoder ...
Figure 2: The attention mechanism in an encoder-decoder framework is typically used in machine translation.[70]

In short, attention-based image captioning methods generate a weighted sum of extracted feature vectors at each time
step in their decoder that guides the decoder module. Similar to the encoder-decoder framework, attention-based
methods were first introduced for the machine translation problem in [10]. In most of the attention-based methods, a
CNN or a region-based CNN is used in the encoding stage to provide a representation of the image, and an RNN is
usually used in the decoding stage. A block diagram of the basis of attention-based methods (which was first proposed
by Xu et al. [140]) is shown in Figure 3. The last layer of a Convolutional Neural Network (Here, VGGnet by Simonyan
et al. [124])- just before max pooling- has been used to extract features from the image. The LSTM network [56] with
attention has been used as the decoder. The multiple images surrounding the LSTM shown in this figure demonstrate
the attention values over different regions of the image. The lighter areas mean a higher attention value. The colored
outline of the generated words in the caption corresponds to the regions outlined by the same colors.

5
Deep Learning Approaches on Image Captioning: A Review

224 × 224 × 3 224 × 224 × 64

112 × 112 × 128


A
56 × 56 × 256
28 × 28 × 512 7× 7 × 512
14 × 14 × 512 1 × 1 × 4096 1 × 1 × 1000
bird
LSTM flying
convolutional + ReLU
max pooling over a
fully connected + ReLU
softmax lake
Input Feature Extraction Using LSTM with Attention over the
Image a Convolutional Neural Network
Caption
Image as the Decoder

Figure 3: The basis of attention-based methods (best viewed in color).

Multi-Head Attention Multi-head attention [128] is a module for the attention mechanisms, which runs through the
attention mechanism several times in parallel. The attention outputs achieved by this method are then concatenated and
linearly transformed into the expected dimension. Multiple attention heads help attend to the parts of the sequence
which are different in nature, e.g., longer-term dependencies versus shorter-term dependencies. Multi-head attention
can be defined as:
MultiHead(Q, K, V ) = [head1 , ..., headn ]W0
(4)
where headi = Attention(QWiQ , KWiK , V WiV )
Where the set of queries is packed together into the matrix Q, the keys and values are packed into matrices K and V ,
and WiQ , WiK , WiV , and W Q are the parameter matrices: WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv
and WiO ∈ Rhdv ×dmodel , h=the number of parallel attention layers.
Soft Attention and Hard Attention Attention is usually implemented in two forms: soft or hard attention [140]. In
soft attention, weighted image features accounted for attention are used as input instead of using an image as an input
to the LSTM [56]. Soft attention disregards irrelevant areas by multiplying the corresponding features map with a
low weight. High attention areas keep the original value while low attention areas get closer to 0 (become dark in the
visualization) [140].
Hard attention uses a stochastic sampling model. Sampling is performed to accurately calculate the gradient descent in
the backpropagation, and the results are averaged using the Monte Carlo method. Monte Carlo performs end-to-end
episodes to compute an average for all sampling results. The accuracy depends on the number of samples and sampling
quality in hard attention. However, soft attention applies the regular backpropagation method to compute the gradient,
which is easier to calculate. The accuracy is also subject to the assumption that the weighted average is a good
representation of the area of attention. Attention-based methods are widely used in the encoder-decoder framework.
Most of the research works discussed in this survey have used it as their primary framework or have combined it with
other methods to improve its performance.
Vinyals et al. [130] have been the first to incorporate deep learning-based encoder-decoder framework for image
captioning. The presented model in their work is inspired by machine translation, based on the findings that indicate
that given a powerful sequence model, it is possible to achieve remarkable results by directly maximizing the probability
of the correct translation. CNNs can produce a rich presentation of an input image by embedding it into a fixed-length
vector. Vinyals et al. [130] have presented a model that uses CNN as an image "encoder" by pre-training it for an image
classification task first and using the last hidden layer as an input to an RNN "decoder" that generates sentences. The
model is trained to maximize the likelihood of the target description sentence given the training image. This work has
been used by many other researchers as a basis to expand upon and refine using other modules and techniques [140].
Anderson et al. proposed the "bottom-up and top-down" method in [7]. The bottom-up module proposes the salient
regions in the image, and each of the proposed regions is represented as a convolutional feature vector. This module
is implemented using Faster R-CNN [114], which was discussed previously. Faster R-CNN works well as a "hard"
attention mechanism since a small number of bounding box features are selected from a large number of configurations.
Faster R-CNN network is initialized with ResNet-101 [53] pre-trained for image classification on the ImageNet dataset.
Faster R-CNN is then trained using the Visual Genome [73] dataset. The top-down module, designed to caption
images, contains two LSTM networks [56] with the standard implementation. The first LSTM network operates as
a top-down visual attention model, and the second LSTM network operates as a language model. The top-down
visual attention module estimates a distribution of attention over regions and calculates the extracted feature vector

6
Deep Learning Approaches on Image Captioning: A Review

as a weighted sum over total region proposals. The captioning model takes a variably-sized set of k image features:
V = {v1 , ..., vk }, vi ∈ RD as input. Each image feature encodes a salient region of the image. These image features
can be defined as the output of the bottom-up attention model or as the spatial output layer of a CNN. The input vector
to the attention LSTM at each time step consists of the previous output of the language LSTM, the mean-pooled image
features v̄ = k1 i vi , and an encoding of the word generated previously all concatenated together. The input to the
P
language model LSTM is composed of the attended image feature concatenated with the output of the attention LSTM.
The two-layer LSTM [56] structure has also been used by Yao et al. in [149] as the attention mechanism in the final stage
(More detail on the workings of this paper is discussed in "Combining Attention-Based and Graph-Based Methods"
(section 3.3)).
Gu et al. [47] have presented a multi-stage coarse-to-fine structure for image captioning. This structure contains
multiple decoders that each work on the output of the decoder in the previous step, making the captions richer in every
step. This paper has used the LSTM network [56] as the decoder. The structure comprises three LSTM networks, with
the first LSTM presenting the coarse details at the first stage and reducing computations in the later stages. The other
LSTMs operate as fine-level decoders. At each stage, attention weights and hidden vectors generated by the previous
stage decoder are used as input to the next stage decoder. The operation of the coarse decoder is based on the general
and global features of the image. However, in many cases, each word belongs to a small region of the image only.
Using the general features of the image might yield improper results due to the possible noise from unrelated regions.
Therefore, a "Stacked Attention Model [147]" is used to improve the performance of this coarse-to-fine structure. This
model enables the structure to extract visual information from finer details for future word predictions. The stacked
model generates a spatial map that determines the region of each predicted word. Using this stacked attention model,
finer and more precise details are extracted, and noise is gradually reduced. Also, regions that are highly relevant to the
words are determined.
Huang et al. [65] have introduced a new attention-based structure containing one more level of attention. The structure
named "Attention on Attention (AoA)" generates an "Information Vector" and an "Attention Gate" with two linear
transformations. An attention module fatt (Q, K, V) operates on some queries, keys and values denoted by Q, K, and V
respectively and generates some weighted average vectors denoted by V̂ . The attention module measures the similarity
between Q and K and uses this similarity score to calculate weighted average vectors over V, which is formulated as:

eai,j
ai,j = fsim (qi , kj ), αi,j = P ai,j (5)
je

X
vˆi = αi,j vj (6)
j

where qi ∈ Q is the ith query, kj ∈ K and vj ∈ V are the j th key/value pair. fsim is a function that computes the
similarity score of each kj and qi , and vˆi is the attended vector for the query qi . Since the attention module produces
a weighted average for each query regardless of the relation between Q and K/V , the weighted average vector can
be irrelevant or misleading information. The AoA module measures the relevance between the attention results and
the query. The information vector i is generated with a linear transformation on current content (caption) and results
from the attention component, and stores both parts’ data. The attention gate g is generated from the content and the
result from the attention component using sigmoid activation. The value inside each part (also called a channel) of this
attention gate determines the level of importance of the channel in the information vector. Both the information vector
and the attention gate is conditioned on the attention result and the current context (i.e. the query) q. The AoA structure
adds another level of attention with element-wise multiplication of the attention gate and the information vector and
finally produces the attended information, which contains useful data. The AoA structure is applied to both the encoder
and decoder (termed AoANet): AoA is applied to the encoder after extracting image features to obtain the relation
between objects present inside the image. AoA is also applied to the decoder to remove the attention results which
are unrelated to the actual output or are ambiguous and leave the essential and useful results. The AoA structure has
been introduced as an addition to the attention-based methods, and it can be applied to any attention method. In the
experiments conducted by the authors, a Faster R-CNN [114] pre-trained on the ImageNet [27] and Visual Genome
[73] datasets is used to extract feature vectors from the image.
Jiang et al. propose a novel recurrent fusion network (RFNet) in [67] for the image captioning task, which uses multiple
CNNs as encoders, and a recurrent fusion process is inserted after the encoders to produce better representations for
the decoder. Each representation extracted from an individual image CNN can be regarded as an individual view
depicting the input image. The fusion procedure consists of two stages; the first stage produces multiple sets of "thought
vectors" by exploiting the interactions among the representations from multiple CNNs. The second stage performs

7
Deep Learning Approaches on Image Captioning: A Review

multi-attention on the sets of thought vectors and generates a new set of thought vectors for the decoder. For the
experiments, they use ResNet [53], DenseNet [63], Inception-V3 [125], Inception-V4 [126] and Inception-ResNet-V2
[126] as encoders to extract 5 groups of representations. Having considered reinforcement learning (RL) as a method to
improve image captioning performance, they have trained their model with cross-entropy loss and fine-tuned the trained
model with CIDEr optimization using reinforcement learning.
Incorporating attention in image captioning has transformed the field considerably, enabling more accurate and natural
caption generation. However, they do not come without flaws. One problem with classic attention-based image
captioning is that they do not consider the relations between the objects detected inside the image.

3.1.1 Injecting Spatial and Semantic Relation Information into Attention-Based Methods

A group of attention-based methods have included the spatial and semantic relations in an image to describe the content
more appropriately.
In the method proposed by Pan et al. [105], a new type of attention in the form of a unified block called "X-Linear
Attention Block" is introduced, which uses bilinear pooling to emphasize salient image features and multimodal
reasoning. This structure uses spatial and channel-wise bilinear attention to extract second-order interactions. The
second-order interaction is obtained by calculating the outer product of the key (mapped image features) and the
query (internal state of the sentence decoder) using bilinear pooling to consider all second-order interactions between
keys and queries. After bilinear pooling, two embedding layers are used to predict the attention weights belonging to
each region, and a softmax layer is then used to normalize the spatial attention vector. Also, a "squeeze-excitation"
operation is performed on the embedded outer product (feature map). The squeezing process aggregates the feature
map across spatial regions to produce a channel descriptor. The excitation process performs the self-gating mechanism
with a sigmoid on the channel descriptor to obtain the channel-wise attention vector. Finally, the outer product of
the key and query and the value from bilinear pooling is weighted summated with the spatial attention vector. Then,
the channel-wise multiplication of this weighted sum and the channel attention vector is calculated and taken as
the attended features. Higher-order interactions can be computed by combining the X-linear attention blocks. This
research work has used Faster R-CNN [114] to detect a set of regions. A stack of X-linear attention blocks is then
used to encode region-level features of the image and the higher-order interactions between them to produce a set of
enhanced region-level and image-level features. These attention blocks are used in the sentence decoder for multimodal
reasoning. A new type of attention in the form of a unified block called "X-Linear Attention Block" is introduced,
which uses bilinear pooling to emphasize salient image features and multimodal reasoning. This structure uses spatial
and channel-wise bilinear attention to extract second-order interactions. The second-order interaction is obtained by
calculating the outer product of the key (mapped image features) and the query (internal state of the sentence decoder)
using bilinear pooling to consider all second-order interactions between keys and queries. After bilinear pooling, two
embedding layers are used to predict the attention weights belonging to each region, and a softmax layer is then used
to normalize the spatial attention vector. Also, a "squeeze-excitation" operation is performed on the embedded outer
product (feature map). The squeezing process aggregates the feature map across spatial regions to produce a channel
descriptor. The excitation process performs the self-gating mechanism with a sigmoid on the channel descriptor to
obtain the channel-wise attention vector. Finally, the outer product of the key and query and the value from bilinear
pooling is weighted summated with the spatial attention vector. Then, the channel-wise multiplication of this weighted
sum and the channel attention vector is calculated and taken as the attended features. Higher-order interactions can
be computed by combining the X-linear attention blocks. This research work has used Faster R-CNN [114] to detect
a set of regions. A stack of X-linear attention blocks is then used to encode region-level features of the image and
the higher-order interactions between them to produce a set of enhanced region-level and image-level features. These
attention blocks are used in the sentence decoder for multimodal reasoning.
Cornia et al. [23] presented a method capable of describing an image by focusing on different regions in different orders
following a given conditioning. By means of analyzing the syntactic dependencies between words, a higher level of
abstraction can be recovered in which words can be organized into a tree-like structure. In a dependency tree, each word
is linked together with its modifiers. Given a dependency tree, nouns can be grouped with their modifiers, thus building
noun chunks. The proposed model is built on a recurrent architecture which considers the decomposition of a sentence
into noun chunks and models the relationship between image regions and textual chunks to ground the generation
process on image regions explicitly. The model is conditioned on the input image I, and an ordered sequence of region
sets R, which acts as a control signal and jointly predicts two output distributions corresponding to the word-level and
chunk-level representation of the sentence. During the generation, the model keeps a pointer to the current region and
can shift to the next element in R using a boolean chunk-shifting gate gt . To generate the output caption, a recurrent
neural network with adaptive attention is used. The probability of switching to another chunk p(gt |R) is calculated
in an adaptive mechanism in which an LSTM [56] computes a compatibility function between its internal state and a

8
Deep Learning Approaches on Image Captioning: A Review

latent representation modeling the state of memory at the end of a chunk. The compatibility score is compared to that of
attending one of the regions rt , and the result is used as an indicator to switch to the next region set in R.
The addition of spatial and semantic relations to the attention-based framework has significantly improved the quality of
the captions generated by the models. Despite the improvements achieved by this addition, some problems still remain,
including the ambiguity of the captions, the lack of grounding, heavy computations associated with the object detectors,
and the requirement of bounding-box annotations. In order to resolve some of these issues, other approaches to the
image captioning problem have been introduced, which are explained and discussed in the following sections.

3.2 Graph-Based Methods for Spatial and Semantic Relations between Image Elements

The methods discussed in this section utilize scene graphs to better model the spatial and semantic relations between
image elements.
Due to their ability to represent relations between elements, graphs are used in applications in which the relations
between elements are important [14, 132]. Studies have shown the effectiveness of incorporating semantic information
and object attributes in generating captions of higher quality [150, 137, 36, 148, 158]. Some research works on image
captioning have used graphs to incorporate the spatial and semantic relations between the elements inside an image. In
order to utilize graphs in caption generation, two types of graph extraction are usually used: scene graph extraction
from images [139, 144, 154, 83, 127, 50, 25] and scene graph extraction from textual data [136, 6]. Once a scene
is abstracted into symbols, the language generation is almost independent of visual perception.[145] Given scene
abstractions "helmet-on-human" and "road dirty," humans can infer "a man with a helmet in the countryside" by using
common sense knowledge like "countryside road dirty." This can be considered as the inductive bias that enables
humans to perform better than machines.
Yang et al. [145] have integrated the inductive bias of language generation into the encoder-decoder framework
commonly used in image captioning. The proposed method uses scene graphs to connect the image and text modalities.
A scene graph G is a unified representation that connects the objects, their attributes, and their relationships in an
image I or a sentence S by directed edges. To encode the language prior, Yang et al. [145] proposed the Scene Graph
Auto-Encoder (SGAE), which is a sentence self-reconstruction network used in the I → G → D → S training pipeline.
The I → G module is a visual scene graph detector. A multi-modal GCN is introduced and used in the G → D module
to complement the visual cues that may be ignored due to imperfect visual detection. D can be considered as a working
memory [131] that assists in re-keying the encoded nodes from I to S to a more generic representation with smaller
domain gaps. The proposed SGAE-based image captioning model is implemented using Faster R-CNN [114], and the
language decoder proposed by [7] with RL-based training strategy [115]. The proposed framework is formulated as
follows:

Encoder : V ← I,
M ap : V̂ ← R(V, G; D), G ← V, (7)
Decoder : S ← V̂ .

Where V denotes the extracted image features (usually extracted by a Convolutional Neural Network (CNN)). The
mapping module frequently used in the encoder-decoder framework for image captioning is the module that encodes
the visual features from the image into a representation which is later taken as input by the language decoder. This
mapping module has been modified according to the formulation in 7 by introducing the scene graph G into a re-encoder
R parametrized by a shared dictionary D. The Scene Graph Auto-Encoder (SGAE) learns the dictionary D, which
embeds the language inductive bias from sentence-to-sentence reconstruction. Next, the encoder-decoder framework is
equipped with SGAE to form the overall image captioner.
Gu et al. [49] have introduced a particular framework for training an image captioning model in an unsupervised manner
and without image-caption pairs. The framework uses a scene graph to generate an intermediate representation of
images and captions and maps these scene graphs to their feature space using "Cycle-Consistent Adversarial Training"
[160]. This paper has used an image scene graph generator, a sentence scene graph generator, and a feature mapping
module in charge of mapping image features and captions modalities together. To align scene graphs and captions,
CycleGAN [160] is used. The unrelated image and sentence scene graphs are first encoded using the scene graph
encoder trained on the sentence corpus. Next, unsupervised cross-modal mapping is performed for feature alignment
with CycleGAN. This work is closely related to [145]. The main difference is that the framework in [145] is based
on paired settings. CycleGAN is generally used to transform two images together, and one of its applications is
transforming two image elements together (For example, transforming an apple into an orange or a horse into a zebra.).

9
Deep Learning Approaches on Image Captioning: A Review

Gao et al. [37] proposed a scene-graph-based semantic representation method by embedding the scene graph as an
intermediate state. The task of image captioning is divided into two phases termed: concept cognition and sentence
reconstruction. In the first phase, a vocabulary of semantic concepts is built, and a novel CNN-RNN-SVM framework
is used to generate a scene-graph-based semantic representation, which is used as the input for an RNN generating
captions in the second phase. The CNN part extracts visual features, the RNN part models image/concept relationships
and concept dependency, and the SVM part classifies the semantic concepts and outputs the relevant concepts for the
scene-graph-based sequence.
The general workflow of the graph-based methods is displayed in Figure 4. Usually, a Convolutional Neural Network
is used to extract visual features from the image, and the semantic and spatial graph is built on the detected regions.
The vertices denote regions, and the edges denote the relationships between the regions. Next, Graph Convolutional
Networks (GCNs) [72] encode the regions and relationships in the scene graph. The obtained feature vector is then
passed to LSTM [56] decoders to generate captions.

Scene Graph
Extraction LSTM
Objects,
A bird flying over
Attributes
p1 p2 p3 p4
& Relations
... Caption

LSTM

LSTM

LSTM

LSTM

LSTM
<start> A bird flying

Input Image
Figure 4: The general workflow of graph-based methods.

3.3 Combining Attention-Based Methods and Graph-Based Methods

In order to solve some of the issues revolving around image captioning problems and the problems regarding attention-
based and graph-based methods, some recent research works have introduced structures that combine the two method-
ologies.
As previously mentioned, the visual relations between image elements give insight into their relative positions or
interactions. To detect the visual relations between image elements, one not only needs to detect object locations inside
the image, but also needs to detect all sorts of interaction between pairs of elements. Using these visual relations will
allow for a more in-depth comprehension of images. However, the considerable diversity in object sizes and their
locations will make the interaction detection task more difficult.
Yao et al. [149] use a combination of Graph Convolutional Networks [72] and LSTMs [56] to incorporate the relations
between image elements while also taking the attention-based encoder-decoder framework into account. Spatial
and semantic relations have been integrated to enrich image representations in the image encoder, and learning the
relationships has been considered a classification problem. Faster R-CNN [114] has been used for region proposals.
Two spatial and semantic graphs are built to represent spatial and semantic relations between image contents. These
two graphs are generated from the detected image regions, with regions being graph nodes and the relations between
them as the edges of the graph. In the spatial graph, spatial relations are considered as edges, and in the semantic graph,
the semantic relations are considered as edges. The semantic graph is trained using the Visual Genome [73] dataset. To
represent the image, Graph Convolutional Networks [72] are used which incorporate the semantic and spatial relations
obtained from their corresponding graphs. The combination of the enhanced image region representations and their
semantic and spatial relations are then fed into an LSTM [56] decoder to generate the caption sentences. During
inference, to combine the output of the two spatial and semantic decoders, the distribution over the words generated by
the two decoders is linear weight summated at each time step, and the word with the highest probability is extracted.
The proposed model by Zhong et al. [157] decomposes the image scene graph into a set of sub-graphs. Each sub-graph
captures a semantic component of the input image. Zhong et al. [157] designed a sub-graph proposal network (sGPN)
that learns to detect meaningful sub-graphs. An attention-based LSTM then decodes the selected sub-graphs for
generating sentences. Given an input image I, a scene graph G = (V, E) is extracted from I using MotifNet [152],
where V represents the nodes corresponding to the detected objects in I and E represents the set of edges corresponding
to the relationships between object pairs. The goal is to generate a set of sentences S = {Sj } to describe different
components of the image using the scene graph G. Sub-graphs are defined as {Gci = (Vic , Eic )} where Vic ⊆ V and

10
Deep Learning Approaches on Image Captioning: A Review

Eic ⊆ E. The method aims to model the joint probability P (Cij = (G, Gci , Cj )|I), where P (Cij |I) = 1 when the
sub-graph Gci can be used to decode the sentence Sj and P (Cij |I) = 0 otherwise. P (Cij |I) can be decomposed into
three parts:

P (Cij |I) = P (G|I)P (Gci |G, I)P (Sj |Gci , G, I) (8)

P (G|I) can be interpreted as the scene graph extraction phase, P (Gci |G, I) as the scene graph decomposition phase
and the selection of important sub-graphs for sentence generation, and P (Sj |Gci , G, I) as the decoding phase in which
a selected sub-graph Gci is decoded into its corresponding sentence Sj , and the tokens in Sj are associated to the nodes
Vic of the sub-graph Gci (the image regions in I).
Wang et al. [133] have used a Graph Neural Network [118] to represent the relation between image elements and have
used a novel content-based attention framework to store image regions previously attended by the attention module
as well. A ResNet-101 [53] network trained on the ImageNet [27] dataset is first used to extract image features. The
non-linear activations of the last convolutional layer of this network are used as the image representation and are
denoted as:

V = {v1 , v2 , . . . , vn | vi ∈ Rm } (9)

Where vi represents each of the non-linear activations of the last convolutional layer. A Graph Convolutional Network
[72] fgnn is initialized using the image features belonging to different image regions to explore relations between
the visual objects in the image. This graph initializes each node inside the graph with a spatial representation and
to derive the implicit relation-aware representation R = {r1 , r2 , . . . , rn | ri ∈ Rm } (where ri represents the nodes
inside the graph), updates the value of the nodes with hidden representation from other nodes recursively. The visual
representations R are forwarded into context-aware attention model fatt . Unlike some other attention-based models,
this novel attention framework uses LSTM [56] to store the previously attended regions. Storing these regions will aid
the attention module in its future region selections. Next, a language model based on LSTM, flstm , uses the previous
hidden state h(t−1) , the previously generated word embedding Xt and the output v̄t from the attention model as input
and produces the current hidden state ht as the output to predict the next word.
Chen et al. [19] have proposed a model to generate controllable image captions which actively consider user intentions.
The paper introduces a more fine-grained control signal called Abstract Scene Graph (ASG), a directed graph composed
of three types of abstract nodes grounded in the image: object, attribute, and relationship. The caption generation model
is based on the encoder-decoder framework, consisting of a role-aware graph encoder and a language decoder that
considers both the context and structure of nodes for attention. The decoder utilizes a two-layer LSTM [56] structure,
including an attention LSTM and a language LSTM. The model gradually updates the graph representation during
decoding to fully cover information in ASG without omission or repetition and keep track of graph access status. The
role-aware graph encoder contains a role-aware node embedding to distinguish node intentions and a multi-relational
Graph Convolutional Network for contextual encoding.
Aiming to employ knowledge in scene graphs for image captioning explicitly, Li et al. [80] introduce a framework based
on scene graphs. First, the scene graph for the input image is generated using the method proposed in [139]. A set of
initial bounding boxes should be produced to generate the scene graph. Li et al. have used the region proposal network
(RPN) proposed by Girshick et al. [41] to produce a set of object proposals for the image. To capture the visual features,
the VGG-16 network is used to extract CNN features from the corresponding regions of object entities. Semantic
features are also obtained by extracting triplets, which are lexeme sequences that describe object relationships from the
graph and embed them into fixed-length vectors. To utilize both types of information, a hierarchical attention-based
fusion module is introduced which determines when and what to attend to during sentence generation.
Xu et al. [141] proposed a framework to embed the scene graph into a compact representation capable of capturing
explicit semantic concepts and graph topology information. An input image I is processed by a CNN to generate the
image features. A set of modules detect the objects, attributes, and related components to infer the scene graph. Next,
an external vocabulary compiles the scene graph into the vector Vcon (I). An adjacent matrix is presented where the
objects and relationships of the graphs are used as vertices and edges. A fixed-length vector Vtopo (I) is extracted to
capture the topological information from the adjacent matrix. Xu et al. proposed an attention extraction mechanism that
extracts sub-graphs and selects an attention graph with the corresponding region by computing cluster nodes in the
adjacency matrix. The attention region is denoted as Vatt (I). The four vectors are combined into a single representation
for the scene graph, which is fed into the LSTM-based [56] language model.
Lee et al. [75] have extended the top-down captioner introduced in [7] and have added an attention component for
relation features. No graph convolutions are used in the proposed model. The authors state that using visual relations

11
Deep Learning Approaches on Image Captioning: A Review

from scene graphs directly is an alternative to Graph Convolutional Networks (GCNs), and avoids expensive graph
convolutions.
There is a different set of challenges associated with the use of scene-graphs. Scene graph extraction is a difficult task
on its own, and the relations between the elements are not always as simple as pairwise relationships. Graph parsers
still need improvement as well.

3.4 Convolutional Network-Based Methods

Convolutional network-based methods utilize convolutional neural networks to extract image features and generate
captions using output from a language model. Thanks to the recent advances in convolutional architectures on other
sequence-to-sequence tasks such as convolutional image generation [101] and machine translation [38, 39], it is possible
to consider CNNs as an effective solution to many vision-language tasks. The methods discussed in this section have
incorporated CNNs into their proposed systems.
Inspired by the advances of CNNs in vision-language tasks, Aneja et al. [8] have presented a convolutional model
containing three main components. The first and the last components are input/output word embeddings, respectively.
While the middle component contains LSTM [56] or GRU [20] units in other methods, masked convolutions are used in
the proposed approach. This component is feed-forward without any recurrent functions, unlike the RNN approaches.
Wang et al. [134] proposed a framework relying on Convolutional Neural Networks only to generate captions. The
framework consists of four modules: a vision module, a language module, an attention module, and a prediction module.
The vision module is a CNN without the fully connected layer, for which VGG-16 has been used. The language module
is based on a CNN without pooling. RNNs use a recurrent path to memorize context, whereas CNNs use kernels and
stack multiple layers to model the context. The prediction module is a one-hidden layer neural network as well. Since
different levels of the language CNN represent different levels of concept, a hierarchical attention module has been
employed where attention vectors are calculated at each level of the language model and fed into the next level. Since
the attention maps are computed in a bottom-up manner as opposed to the RNN-based model, it is possible to train the
model in parallel over all words in the sentence. The authors observed the effect of several hyper-parameters, such as
the number of layers and the width of the kernel belonging to the language CNN. The receptive field of the language
CNN can be increased by stacking more layers or increasing the width of the kernel. The experiments showed that
increasing the kernel width is a better choice.
Less attention has been paid to convolutional network-based methods compared to the categories discussed above.
Convolutional network-based models help generate more entropy and as a result, more caption diversity. Also, they
perform better in classification tasks and do not suffer from vanishing gradients. However, these methods still need
improvement in terms of performance according to the evaluation metrics.

3.5 Transformer-Based Methods

Many current works have utilized Transformers to build more robust solutions for the captioning problem. RNNs and
LSTMs have been criticized due to their inflexibility, limitations regarding expression ability, and other complexities.
Due to their recurrent nature, RNNs have difficulty memorizing inputs many steps ago, which leads to high-frequency
phrase fragments without regard to the visual cues [77]. The limitations posed by LSTMs and RNNs as language
models have led researchers to use alternatives such as Transformers.
Some recent works have studied the application of Transformers [128] -mainly as the language model. Herdade
et al. [55] utilized Transformers in the proposed "Object Relation Transformer" model, which incorporates spatial
relations between detected objects using geometric attention. This encoder-decoder-based structure implements spatial
relationships between detected objects inside an image using geometric attention. The object relation module presented
by Hu et al. [59] represents the spatial relations in the encoder. The combination of Faster R-CNN [114], and
ResNet-101 [53] as the base Convolutional Neural Network is used for object detection and feature extraction. Every
image feature vector is processed through an input embedding layer consisting of a fully connected layer to reduce
the dimension, followed by a ReLU and a dropout layer. The first encoder layer of the Transformer model uses the
embedded feature vectors as input, and the subsequent layers use the output tokens of the previous encoder layers. Each
encoder layer is composed of a multi-head self-attention layer followed by a small feed-forward neural network.
Using the intermediate feature maps obtained from ResNet-101 [53] as input, a "Region Proposal Network (RPN)"
generates bounding boxes for the objects proposed by the network. Multiple neural network layers are added to predict
the corresponding class for each region and correct the bounding box for each of the proposed regions. Also, to
implement geometric attention, the value of attention weight matrices changes: bounding box properties (such as center,
width, and height) are combined with their corresponding attention weights using a high-dimensional embedding [128].

12
Deep Learning Approaches on Image Captioning: A Review

Liu et al. [88] introduce the "Global-and-Local Information Exploring-and-Distilling (GLIED)" approach that explores
and distills the cross-modal source information. The Transformer-based structure globally captures the inherent spatial
and relational groupings of the individual image regions and attribute words for an aspect-based image representation.
Afterward, it extracts fine-grained source information locally for precise and accurate word selection. They used the
RCNN-based visual features provided by Anderson et al. [7] for image regions extracted by Faster R-CNN [114].
Cornia et al. [24] introduce a fully attentive model called M 2 - a Meshed Transformer with Memory for Image
Captioning. The architecture is inspired by the Transformer model for machine translation and learns a multi-level
representation of the relationships between image regions integrating learned a priori knowledge. The model incorporates
two novelties: (1) image regions and their relationships are encoded in a multi-level fashion, in which both low-level
and high-level relations are considered. The model learns and encodes a priori knowledge using persistent memory
vectors. (2) The sentence generation -done with a multi-layer architecture- exploits both low- and high-level visual
relationships via a learned gating mechanism, which weights multi-level contributions at each stage. This creates a
mesh-like connection between the encoder and decoder layers. The encoder is in charge of processing regions in the
input image and the relationships between them. Simultaneously, the decoder reads the output of each encoding layer
and generates the caption word by word. All interactions between word and image-level features are modeled via scaled
dot-product attention without using recurrence.
Huang et al. [65] used a Transformer-like encoder paired with an LSTM decoder. Li et al. [77] investigated
a Transformer-based sequence modeling framework named "ETA-Transformer ."They have proposed EnTangled
Attention (ETA) that enables the Transformer to benefit from both semantic and visual information simultaneously. Liu
et al. [91] introduce CaPtion TransformeR (CPTR), which takes sequentialized raw images as input to the Transformer.
As an encoder-decoder framework, CPRT is a full Transformer network that replaces the commonly used CNN in the
encoder part with the Transformer encoder. A purely Transformer-based architecture, PureT, is designed by Wang et al.
[135]. In PureT, SwinTransformer [93] replaces Faster-RCNN, and the architecture features a refining encoder and
decoder. Fang et al. [32] introduce a fully VIsion Transformer-based image CAPtioning mode (ViTCAP) along with
a lightweight Concept Token Network (CTN), which is used to produce concept tokens. The structure uses a vision
transformer backbone as the stem image encoder, which produces grid features. CTN is then applied to predict semantic
concepts. A multi-modal module uses grid representations and Top-K concept tokens as input to perform the decoding
process. Pseudo ground-truth concepts are extracted from the image captions using a simple classification task, and
CTN is optimized to predict them during training. Li et al. [84] designed a Transformer-style encoder-decoder structure
called Comprehending and Ordering Semantics Networks (COS-Net). A CLIP model (image encoder and text encoder)
[112] is used as a cross-modal retrieval model which retrieves sentences semantically similar to the input image. The
semantic words in retrieved sentences are treated as the primary semantic cues. A novel semantic comprehender is also
introduced by the authors, which removes the irrelevant semantic words in those primary cues and simultaneously infers
missing words visually grounded in the image. Afterward, a semantic ranker sorts the semantic words in linguistic
order. Zeng et al. [153] propose a Spatial-aware Pseudo-supervised (SP) module which uses a number of learnable
semantic clusters to quantize grid features with multiple centroids without direct supervision. These centroids aim
to integrate grid features of similar semantic information together. In addition to the SP module, a simple weighted
residual connection is introduced, named Scale-wise Reinforcement (SR) module. This module explores both low and
high-level encoded features concurrently.
Nguyen et al. [100] present a Transformer-only neural architecture titled GRIT (Grid- and Region-based Image
captioning Transformer) which uses DETR-based detector along with grid and region based features. Hu et al. [60]
have proposed ExpansionNet v2 which utilizes a novel technique titled Block Static Expansion layer. This technique
processes the input by distributing it over a collection of sequences with different lengths, which helps to explore the
possibility of performance bottlenecks in the input length in Deep Learning methods. This layer is designed to improve
the quality of features refinement and ultimately increase the effectiveness of the static expansion. The architecture of
ExpansionNet v2 follows the standard encoder-decoder structure and is implemented on top of Swin-Transformer [93].

3.6 Combining Transformers and Scene Graphs

A number of the works have experimented with model designs that incorporate both Transformers and scene graphs.
He et al. [54] aimed to employ the spatial relations between detected regions inside an image. In their proposed
model, each Transformer layer implements multiple sub-transformers to encode relations between regions and decode
information. The encoding method combines a visual semantic graph and a spatial graph. In another architecture
introduced by Chen et al. [18], the encoder consists of two sub-encoders for visual and semantic information. Faster-
RCNN proposes image regions, and a scene graph is built using the detected regions. GCN is then used to enrich the
graph representation. A semantic matrix is learned from the scene graph and fed into a multi-modal attention module in
the decoder. This module is used to leverage multi-modal representation in caption generation. Yang et al. [146] have

13
Deep Learning Approaches on Image Captioning: A Review

proposed an architecture called ReFormer, which generates features with relation information embedded. ReFormer
explicitly expresses the pair-wise relationships between objects present inside an image. ReFormer combines scene
graph generation and image captioning using one modified Transformer model.

3.7 Vision Language Pre-Training Methods for Image Captioning

Some recent works have attempted pre-training paradigms to lessen the reliance of the models on fully-supervised
learning. A large-scale model is pre-trained on a dataset with an enormous amount of data by self-supervised learning.
The pre-trained model is then generalized to various downstream tasks.
One widely used pre-trained model is CLIP (Contrastive Language-Image Pre-Training). CLIP is designed to provide a
shared representation for both image and text prompts [99]. It has been trained on numerous images and captions using
a contrastive loss, allowing for more consistency and correlation between its visual and textual representations. One
of the recent works utilizing CLIP in the proposed method is ClipCap by Mokady et al. [99]. The authors introduce
a model that produces a prefix for each caption by applying a mapping network over the CLIP embeddings. Next,
a pre-trained language model (GPT-2 [111]) is fine-tuned to generate captions. This approach is inspired by Li et
al. [82], who discussed the possibility of adapting a language model for new tasks by concatenating a learned prefix.
Barraco et al. [12] investigate the role of CLIP features in image captioning by devising an architecture composed
of an encoder-decoder Transformer architecture. Hu et al. [61] present the Visual VOcabulary pre-training (VIVO),
which aims to learn a joint presentation of visual and text input. Unlike existing VLP models, which use image-caption
pairs to pre-train, VIVO uses image-tag pairs for pre-training. In the pre-training stage, an image captioning model
first learns to label image regions using image-tag pairs as training data. In the fine-tuning stage, the model learns to
map an image to a sentence conditioned on the detected objects using image-caption pairs and their corresponding
object tags. The sentences are learned from image-caption pairs, while object tags may refer to novel objects that
do not exist in image-caption pairs. The addition of object tags allows for zero-shot generalization to novel visual
objects for image captioning. Xia et al. [138] highlight that while recent pre-training methods for vision-language
(VL) understanding tasks have achieved state-of-the-art performance, they cannot be directly applied to generation
tasks. Xia et al. present Cross-modal Generative Pre-Training for Image Captioning (XGPT), which uses a cross-modal
encoder-decoder architecture and is directly optimized for generation tasks.
Li et al. [81] have proposed a pre-training method that leverages salient objects, which are usually present in both image
and caption as anchor points. The method uses object tags as anchor points to align image and language modalities
in a shared semantic space. The training samples are defined as triplets, each consisting of a word sequence, a set
of object tags, and a set of image region features. This pre-training method can be applied to many vision-language
tasks, including image-text retrieval, Visual Question Answering (VQA), and image captioning. Many vision-language
pre-training methods, including [81], are built upon Bidirectional Encoder Representations from Transformers (BERT)
[28]. These models use a two-stage training scheme in which the model first learns the contextualized vision-language
representations by predicting the masked words or image regions based on their intra-modality or cross-modality
relationships on large amounts of image-text pairs. To counteract the problem of pre-training a single, unified model
that is applicable to a wide range of vision-language tasks via fine-tuning, Zhou et al. [159] have introduced a new
pre-training method for a unified representation for both encoding and decoding. The unified encoder-decoder model,
called the Vision-Language Pre-training (VLP) model, can be fine-tuned for both vision-language generation (e.g.,
image captioning) and understanding tasks (e.g., visual question answering). This model uses a shared multi-layer
Transformer network for encoding and decoding, which is pre-trained on large amounts of image-caption pairs. The
VLP model is optimized for two unsupervised vision-language prediction tasks: bidirectional and sequence-to-sequence
(seq2seq) masked language prediction. These two tasks only differ in what context the prediction conditions are on,
which is controlled by specific self-attention masks for the shared Transformer network. The context of the masked
caption word, which is the target of prediction, consists of all the image regions and all words on its right and left in the
caption in bidirectional prediction. In contrast, in the seq2seq task, the context consists of all the image regions and the
words on the left of the to-be-predicted word in the caption.
Li et al. [76] present mPLUG; a novel vision-language foundation model designed for both cross-modal understanding
and generation. mPLUG aims to counteract some of the problems commonly witnessed in pre-training models, such
as low computational efficiency and information asymmetry by novel cross-modal skip-connections. These skip-
connections generate inter-layer shortcuts that skip a specific number of layers. This method is used to improve the
slow full self-attention on the vision side. Liu et al. [90] present Prismer, a vision-language model that uses a group of
domain experts to combine their knowledge and apply it to different vision-language reasoning tasks. Prismer performs
well in fine-tuned and few-shot learning, while requiring significantly less training data compared to other models.

14
Deep Learning Approaches on Image Captioning: A Review

3.8 Unsupervised Methods and Reinforcement Learning

There has been a recent trend toward relaxing the reliance on paired image-caption datasets for image captioning. Many
of the current research works employ reinforcement learning methods due to their unsupervised nature. The interest in
unsupervised methods stem from the problem of the models relying almost entirely on the quality and volume of the
image-caption pairs in datasets.
One early work by Gu et al. [48] involved generating captions in a pivot language and translating the caption to a
target language. This method requires a paired image-caption dataset for the pivot language but does not use a paired
dataset with captions being in the target language. Another research paper in this field used reinforcement learning
with gradient policy along with RNNs in 2016 [113]. Shetty et al. [120] proposed the first study that explored using
conditional GANs [98] to generate human-like and diverse descriptions.
Feng et al. [33] use a set of images, a sentence corpus, and a visual concept detector for unsupervised training. The
images and the sentence corpus are projected into a common latent space such that they can reconstruct each other. The
sentence corpus is prepared using the captions available on Shutterstock [122], which is a photo-sharing platform. On
this platform, each image is uploaded with a caption. This corpus is not related to the images and is independent. The
proposed structure comprises an image encoder, a sentence generator, and a discriminator. The Inception-V4 [126] is
used as the image encoder, and the sentence generator and discriminator are both LSTMs [56].
Since no image-caption pairs exist, three new metrics have been introduced as three discriminators to evaluate the
model’s performance. The discriminator first distinguishes a real sentence from the sentence corpus from a sentence
generated by the model, and the generator is rewarded at each time step. By maximizing this reward, the generator
tries to produce plausible sentences. However, more than this discriminator is needed since the quality of the generated
sentence, and its relevance to the image must also be evaluated. To do so, the model must learn the visual contents of the
image. The generated words are rewarded if the generated caption contains words whose corresponding visual concept
is detected inside the image. This reward is called a "concept reward." Finally, since the performance of the model is
much dependent on the performance of the visual concept detector and these detectors only detect a limited number of
objects, images and captions are projected into a common latent space such that they can reconstruct each other.
Chen et al. [16] proposed an image captioning framework based on conditional generative adversarial nets as an
extension of the reinforcement learning-based encoder-decoder architecture. Highlighting that the conventional encoder-
decoder structures directly optimize one metric, which cannot guarantee improvement in all metrics, the paper designed
a discriminator network to decide if a caption is human-described or machine-generated based on the idea of GANs.
Two discriminator models have been designed and tested: a CNN-based discriminator model that uses the conditional
CNN for real or fake sentence classification, and an RNN-based discriminator model that consists of the standard LSTM
[56], a fully connected linear layer, and a softmax output layer. The CNN-based framework was shown to improve the
performance more than the RNN-based framework, while the RNN-based framework can save 30% training time. It
was finally concluded that the ensemble results of 4 CNN-based (denoted as CNN-GAN) and 4 RNN-based (denoted as
RNN-GAN) models could noticeably improve the performance of a single model.
Liu et al. [92] have introduced an image captioning module and a self-retrieval module. A Convolutional Neural
Network extracts image features, and an LSTM [56] decodes a sequence of words based on these features. The
self-retrieval module evaluates the similarity between the generated captions, the input image, and some "distractors." If
the caption generator module generates distinct and proper captions, the relevance between these captions and their
corresponding images must be more than the relevance between the generated captions and unrelated, distracting
images. This condition is represented as the text-to-image retrieval error and improves the performance of the image
captioning module with back-propagation and the REINFORCE algorithm.
Guo et al. [51] used a discriminator structure similar to that of [33]. The discriminator distinguishes whether the
generated sentence is real and rewards the learner based on how real the sentences seem. Another discriminator
distinguishes the style of the generated captions. Also, the LSTM [56] decoder used in [47] has been used as a
reinforcement learning agent making an action (prediction of the next word). After a sentence is completed, the agent
will observe a sentence-level reward and update its internal state.
A block diagram of the general workflow of the unsupervised methods is shown in Figure 5. VGGNet [124] has been
used as the image encoder, and the caption generator is an LSTM network [56]. Therefore, the overall design follows
the typical encoder-decoder structure. The discriminator is also an LSTM network, which determines if the given
caption is real (from the sentence corpus) or generated by the model. The generator is rewarded accordingly by the
discriminator.

15
Deep Learning Approaches on Image Captioning: A Review

Caption
Image Encoder Caption Generator Discriminator
224 × 224 × 64
A bird flying over
112 × 112 × 128 p1 p2 p3 p4
56 × 56 × 256

LSTM

LSTM

LSTM

LSTM
28 × 28 × 512 7× 7 × 512
14 × 14 × 512 1 × 1 × 4096 1 × 1 × 1000 ...
Real/Fake?

LSTM

LSTM

LSTM

LSTM

LSTM
...
convolutional + ReLU
max pooling <start> A bird flying A bird flying <eos>
fully connected + ReLU
Input Image softmax

Reward

Figure 5: The general workflow of unsupervised methods.

Taking the issues related to supervised settings into account, such as the tedious process of dataset preparation and the
difficult training process, the unsupervised setting has been the focus of many recent works and is expected to become
more favored in the future as well.

3.9 Generating Multi-Style Captions

The papers discussed so far generate captions with a neutral tone. These generated captions usually describe factual
data about image contents. Meanwhile, humans use many styles and tones in their daily speech to communicate with
one another. Some of these styles and tones are humorous, hostile, and poetic. Incorporating these styles can help
humans interact with the caption more and make the captions more attractive. Stylized captions can also be used in
applications such as photo-sharing and Chatbots.
Shuster et al. [121] have added tone and style as a feature to their dataset, as well as images and their appropriate
captions. This paper has introduced a novel structure called TransResNet, which projects images, captions, and their
corresponding personality traits into a shared space using an encoder-decoder framework. Two classes of models have
been considered: retrieval models and generative models. The retrieval model considers any caption in the entire dataset
as a possible candidate response, whereas the generative model produces captions word by word via the aforementioned
structure. The retrieval model has given better results.
A structure consisting of five modules for caption generation in different styles has been introduced [51] by Guo et al. .
The first module is a plain image encoder. Next is a caption generator module that outputs a sentence conditioned on
a specific style. The following module is a caption discriminator that distinguishes a real sentence from a generated
sentence. This discriminator is trained in an adversarial manner to encourage the learner to generate more convincing
captions closer to the human language. Afterward, a style discriminator module that determines the style of the
generated caption is used. Inspired by the fact that there is some content consistency between neutral captions and
stylized captions, another module called "The Back-Translation Module" is also used. This module translates a stylized
caption into a neutral one. (If a stylized caption is generated and translated to a factual and neutral caption, we should
arrive at the real factual caption.) This process is implemented using multi-lingual neural machine translation (NMT),
in which the stylized captions are considered input and neutral captions are considered output.
A figure consisting of some example captions from sections 3.1, 3.1.1, 3.3, and 3.9 in this survey is shown in Figure 6.
Each row belongs to a specific category in which two images are displayed, along with the captions describing them.
For each image, the ground-truth caption and the generated caption are shown.

GT: Two cats laying on a big bed and GT: A zebra is standing outside
Attention-Based looking at the camera. Graph-Based in the snow.
G: Two cats laying on top of a bed. + Attention G: A zebra standing in the snow
GT: A black and red small train next to a building.
Attention-Based in shopping area. Anxious: I love cats but i always get so scared
+ Spatial & G: A small train on a city street that they will scratch me.
Semantic Relations Multi-Style Happy: That cat looks SO happy to be outside.
with people near by. Vague: That’s a nice cat. Or is it a lion?
Dramatic: That cat looks so angry; it might claw your eyes out!
Charming: Awww, sweet kitty. You are so handsome!

Figure 6: Sample captions generated by multiple methods in different categories. "GT" indicates "Ground Truth
Caption," and "G" indicates "Generated Caption." The captions are generated by Huang et al. [65] (top-left), Wang et al.
[133] (top-right), Li et al. [81] (bottom-left), and Shuster et al. [121] (bottom-right)

16
Deep Learning Approaches on Image Captioning: A Review

4 Problems in Image Captioning


In image captioning, researchers are usually confronted with a set of problems, some of which commonly experienced
in many artificial intelligence tasks such as the exposure bias problem [113], the loss-evaluation mismatch problem
[47, 89, 115, 142], the vanishing gradient problem [57], and the exploding gradient problem [107, 43]. In addition,
image captioning poses certain challenges unique to the task. These challenges include object hallucination, illumination
conditions, contextual understanding and referring expressions. In this section, we first review some of the common
problems in many intelligent tasks, followed by continuing problems in image captioning which come, in fact, a part of
of the nature of the task.

4.1 The Exposure Bias Problem

This problem occurs in language models when the model is only exposed to the training data and not its own predictions.
The standard RNN models are trained to predict the next word according to the previous words in the ground truth
sequence. In contrast, in testing time, the ground truth data is no longer available, and the model uses its own previous
predictions for its following predictions. This problem will gradually produce more errors in the model’s output [113].
The structure proposed by Gu et al. [47] has employed reinforcement learning for optimization. This structure uses
each intermediate decoder’s output in testing time and also the output of the previous decoder to normalize the reward,
which solves the exposure bias and the loss-evaluation mismatch problem at the same time.

4.2 The Loss-Evaluation Mismatch Problem

The language models are usually trained to minimize the cross-entropy loss at each time step. Meanwhile, during
testing, the generated captions are evaluated using sentence-level metrics (discussed in "Evaluation Metrics for Image
Captioning Methods"). These metrics are non-differentiable and cannot be used directly as a test-time error [47].
Multiple efforts have been made to optimize these metrics using reinforcement learning [89, 115, 142].

4.3 The Vanishing Gradient Problem

The vanishing gradient problem happens in neural networks that train with gradient methods and back-propagation. In
these methods, each of the neural network’s weights is updated according to the partial derivative of the error function
based on the current value inside the weight at each iteration. In some cases, the value of the gradient is so minuscule
that there is no change in weights. In the worst case, this might halt the training process completely [57].

4.4 The Exploding Gradient Problem

Another problem associated with gradients is the "exploding gradient" problem. In deep or recurrent neural networks,
error gradients can build up and accumulate during an update, resulting in enormous values of gradients. Consequently,
the network’s weights receive large updates, and as a result, the network will become very unstable. In the best case, the
deep multilayer Perceptron network cannot learn from the training data and results in NaN (Not a Number) weight
values that can no longer be updated. Also, in recurrent networks, exploding gradients can result in an unstable network
that is unable to learn from training data. In the best case, it will result in a network that cannot learn over long input
sequences of data [107, 43].

4.5 Object Hallucination

Object hallucination [116] is a persistent problem for image captioning models, wherein the model detects objects
that are not present in the input image. This can lead to poor performance in visually-impaired users, who require
accurate and concise captions. According to a study by MacLeod et al. [95], for many visually impaired people who
prefer correctness over coverage, hallucination is a severe disadvantage for a captioning model and an obvious concern.
Furthermore, object hallucination indicates an internal issue of the model. Rohrbach et al. have proposed a new metric
to measure object hallucination, CHAIR (Caption Hallucination Assessment with Image Relevance), which measures
the proportion of generated words that correspond to objects in the input image, according to ground truth sentences
and object segmentations. The CHAIR metric has both per-instance and per-sentence variants, denoted as CHAIRi
(equation 10) and CHAIRs (equation 11), respectively.

|Hallucinated objects|
CHAIRi = (10)
|All objects mentioned|

17
Deep Learning Approaches on Image Captioning: A Review

|Sentences with hallucinated objects|


CHAIRs = (11)
|All sentences|
According to the study performed by Rohrbach et al., models that perform better on standard evaluation metrics (Such
as BLEU [106] and SPICE [6]) perform better on CHAIR. However, this is not always true. It was found that the models
which were optimized for CIDEr frequently hallucinated more. Also, models with attention tended to perform better on
the CHAIR metric than models that did not incorporate attention. However, this gain was primarily due to these models’
access to the underlying convolutional features and not the actual attention mechanism. Also, GAN-based models
decreased hallucination, implying that GAN loss is beneficial in decreasing hallucination. This is due to the fact that the
GAN loss encourages sentences to resemble human-generated captions. The presence of a hallucinated object likely
suggests that a sentence is generated, and the discriminator dismisses the caption containing the hallucinated object.

4.6 Illumination Conditions

Illumination conditions are a critical factor that can impact the accuracy and reliability of the generated captions,
particularly when the image is captured in low-light conditions or indoors. Poor lighting can result in images with
reduced contrast, making it difficult for the captioning model to discern fine details and recognize objects, people, or
scenes. Moreover, the presence of shadows or uneven illumination can further hinder the model’s ability to accurately
analyze the visual content. Shadows and uneven illumination can also further complicate the model’s analysis of visual
content. For example, an image of a black cat in a dimly lit room with uneven illumination may be difficult for the
captioning model to recognize as a cat. To overcome these challenges, researchers have been actively exploring various
techniques to improve the visual quality of the images [3, 4], including contrast enhancement, color correction, and
low-light image enhancement. These techniques aim to mitigate the challenges posed by poor illumination and improve
the accuracy of generated captions.

4.7 Contextual Understanding

Image captioning models also require the ability to understand the context of the scene, including the relationships
between objects, the spatial arrangement, and the overall atmosphere [150, 137, 36, 148, 158]. This contextual
understanding can be difficult to achieve, as it requires the caption generation model to have a deep understanding of
the visual content and the ability to perform reasoning given the visual content.

4.8 Referring Expressions

Another problem in image captioning is the use of referring expressions, such as "the girl with the red hair" or "the dog
in the corner." These expressions require the captioning models to identify and link the appropriate objects in the image.
This can pose a challenge, especially if the objects are partially obscured or if there are multiple similar objects in the
scene, and requires a combination of visual and linguistic understanding [22]. Referring expressions are important
for improving caption quality, as they provide more detailed and informative descriptions of the objects in the image,
allowing the model to generate more accurate and nuanced captions.

5 Discussion
This section provides a comprehensive critical analysis of the methods falling in the different categories overviewed
in section 3. Each method-inevitably-possesses advantages and disadvantages. Nevertheless, considering these
characteristics aids researchers in adopting a suitable solution. The technical details of the structures and methods
discussed in this section have been explained in section 3.

5.1 Using Attention

Attention-based methods attempt to imitate the human attention mechanism by showing the model "where to look at"
during the training process. Attention is widely used in encoder-decoder architectures where CNNs are typically used
in combination with LSTMs to produce a representation for the given image and generate captions, respectively. Some
of the papers focusing on attention-based methods have mentioned low precision in region selection for attention as a
flaw of the attention-based methods. They claim that most of the attention-based methods presented choose regions of
the same size and shape without considering image contents. They have also mentioned that determining the optimal
number of region proposals will bring about an unresolvable trade-off between small or large amounts of detail (or,
representing the image coarsely or finely). One solution to this problem was proposed by Anderson et al. in [7] as the

18
Deep Learning Approaches on Image Captioning: A Review

"bottom-up and top-down" method. Another problem of the attention-based methods is the "single-stage" structure.
Most of these methods are only a single encoder-decoder attention structure, which cannot provide rich captions for
the images. In the multi-stage coarse-to-fine structure proposed by Gu et al. [47], at each stage, attention weights and
hidden vectors generated by the previous stage decoder are used as input to the next stage decoder, reducing ambiguity
in the captions. This structure allows for a richer caption at each stage. Another problem associated with attention-based
methods for image captioning is that a proper correlation between the vectors obtained from attention and caption is not
guaranteed, and it might lead to improper results. If feature vectors do not contain valuable information, the attention
model still generates a vector that is a weighted sum over candidate vectors and is unrelated to the correct caption.
To solve this issue, Huang et al. [65] have introduced an attention-based structure (Attention on Attention- or AoA)
which contains one more level of attention. The authors have compared AoA with LSTM [56] and GRU networks
[20]: internal states, memories, and gates are used in LSTMs and GRUs to implement the attention mechanism. AoA
only performs two linear transformations and does not require hidden states, making it computationally reasonable
while outperforming LSTM. The combination of LSTM and AoA has been reported to be unstable since it can reach
a sub-optimal point. This means that increasing the volume of the stack and the number of gates to improve the
performance is futile. Jiang et al. [67] state that the existing encoder-decoder models employ only one kind of CNN to
describe image content. Consequently, the image contents will be described from only one specific viewpoint, and the
semantic meaning of the input image cannot be comprehensively understood, which will restrict the performance. In
order to improve the image captioning model, the model introduced by Jiang et al. [67] extracts diverse representations
from multiple encoders. The novel recurrent fusion network (RFNet) proposed in the paper uses multiple CNNs as
encoders. Each representation extracted from an individual CNN can act as an individual view of the image content.

5.1.1 Injecting Spatial and Semantic Relation Information into Attention-Based Methods

One of the significant downsides of the methods that only use the attention mechanism as their main solution for image
captioning is that these methods fail to consider the spatial and semantic relations between image elements. Spatial
and semantic relations in an image are integral to comprehension of the image contents [150, 137, 36, 148, 158]. For
example, spatial relations in an image could help differentiate between "a person riding a horse" and "a person standing
on a horse’s back." Also, relative size can help differentiate between objects with their most significant difference being
their size, like violins and cellos. In addition to that, incorporating these relations makes the object detection task more
precise. As a possible solution, Herdade et al. [55] have introduced the "Object Relation Transformer". Pan et al. [105]
have mentioned another problem about the attention-based image captioning methods: in most of these methods, only
the first-order interactions between objects inside the image are observed. Their paper has claimed that since the image
captioning problem involves multi-modal data (image and text), multi-modal reasoning is needed, and observing the
first-order interaction between features only will render more in-depth reasoning impossible. The structure proposed by
Pan et al. uses spatial and channel-wise bilinear attention to extract second-order interactions. Liu et al. [88] claim that
there is still great difficulty in deep image understanding; because the systems tend to view one image as unrelated
individual segments and are not guided to comprehend the relationships between the objects inside the image. They
argue that such understanding requires adequate attention to correlated image regions and coherent attributes of interest.
To do so, they have presented the "Global-and-Local Information Exploring-and-Distilling (GLIED) approach.
To pre-train the models and methods discussed so far, existing methods mostly concatenate detected regions and textual
features and use self-attention to learn the semantic alignments between the two modalities. These methods suffer from
two main issues: ambiguity and a need for more grounding [81].
Ambiguity: The methods that utilize the spatial and semantic relations between objects in images use object detectors
to locate salient objects. These detectors usually generate redundant regions, and the visual features extracted using
these detectors are extracted from overly sampled regions. The regions belonging to multiple objects might overlap
heavily and cause ambiguity in the extracted visual embeddings.
Lack of grounding: the object tags used in previous methods are not associated with both object regions and word
embeddings, resulting in a lack of grounding. Also, the attention models so far do not focus on the same regions as
a human would when looking at an image [26]. However, salient objects are usually present in both image and the
corresponding caption, which can be used as anchor points to ease the process of training in the vision-language tasks.
The pre-training method proposed by Li et al. [81] leverages these anchor points to tackle the mentioned issues. In
many vision-language pre-training methods, the pre-trained model is fine-tuned for downstream tasks. However, it is
challenging to pre-train a single, unified model that is universally applicable to a wide range of vision-language tasks
via fine-tuning. Zhou et al. presented a pre-training method for a unified representation for encoding and decoding
in [159]. The Vision-Language Pre-training (VLP) model proposed in this paper has the advantage of unifying the
encoder and decoder and learning a more universal contextualized vision-language representation, which can be
fine-tuned for generation and understanding tasks easily. This unified procedure results in a single model architecture
for the two distinct vision-language prediction tasks (bidirectional and seq2seq). This alleviates the need to train

19
Deep Learning Approaches on Image Captioning: A Review

multiple pre-training models for different tasks without significant performance loss. In order to fine-tune for the image
captioning task, the VLP model is fine-tuned on the target dataset using the seq2seq objective. Cornia et al. [23]
claim that an attention-based architecture implicitly selects which regions to focus on, but it does not provide a way of
controlling which regions are described and what importance is given to each region. The model suggested in their
paper is able to focus on different regions in different orders following a given condition. Words can be organized into a
tree-like structure, and a higher level of abstraction can be recovered considering the syntactic dependencies between
words.

5.2 Using Graphs for Spatial and Semantic Relations

Graphs have been used extensively in many image captioning methods due to their ability to cohesively represent
the relation between multiple elements. These methods have utilized graphs in two ways: scene graphs extracted
from images and scene graphs extracted from textual data. Scene graphs have been used as a component inside
encoder-decoder-based or unsupervised frameworks, and some have employed scene graphs along with Transformers.
Graph-based methods pose challenges of their own. Yang et al. [145] rightfully state that an ever-present problem
has never been substantially resolved: the different variants of the encoder-decoder-based framework, when fed an
unseen image, usually produce a simple and trivial caption about the salient objects in the image, which is no better
than a list of object detection. The model presented by Yang et al. adds the inductive bias of language generation to the
encoder-decoder framework and uses scene graphs to connect the image and text modalities.
Gu et al. [49] argue that the majority of image captioning studies are conducted in English, and preparing image-caption
paired datasets in other languages requires human expertise and is time-consuming. The method introduced in their
paper uses scene graphs as an intermediate representation of the image and sentence and maps the scene graphs in their
feature space using cycle-consistent adversarial training.

5.3 Using Attention and Graphs

Considering how mutual correlations or interactions between objects are the natural basis for image description, Yao
et al. [149] study the visual relationships between objects and how they can be utilized for this matter. They have
built semantic and spatial correlations on image regions and used Graph Convolutions to learn richer representations.
One major challenge of image captioning is the problem of grounded captioning. Most models do not focus on the
same image regions as a human would while observing an image, which may lead to object hallucination [116]. Zhong
et al. [157] addressed this problem by revisiting the representation of image scene graphs. The key idea is to select
essential sub-graphs and only decode a single target sentence from a chosen sub-graph. The model can link the decoded
tokens back into the image regions, demonstrating noticeable results for caption grounding. Another downside of
the attention-based methods is that they do not incorporate the regions previously attended by the attention model.
These regions can be used in the module’s following region selections. Wang et al. [133] have integrated this point as
well as the semantic relations between image elements in their proposed structure, which uses a novel content-based
attention framework to store previously attended image regions. Chen et al. have discussed in [19] that even though
some methods focus on controlling expressive styles or attempt to control the description contents (discussed in section
3.9), they can only handle a coarse-level signal. Their method uses a directed graph consisting of three node types
grounded in the image, which allows for incorporating user intentions. Li et al. [80] argue that most methods that devise
semantic concepts treat entities in images individually and lack helpful, structured information. Therefore, they have
utilized scene graphs along with CNN features from the bounding box offsets of object entities. Another work by Xu et
al. [141] also addresses the lack of structured information in current systems. The authors propose the Scene Graph
Captioner (SGC), which is divided into three major components: The graph embedding model, the attention extraction
model, and the language model. The attention extraction model is inspired by the concept of small world in the human
brain. The work proposed by Lee et al. [75] uses visual relations from scene graphs directly instead of GCNs, claiming
that it will avoid expensive graph convolutions. While the performance of some GCN-based models is slightly better,
evading graph convolutions may be reasonable in some frameworks.

5.4 Using Convolutional Network-Based Methods

LSTM networks [56] have been considered the standard for vision-language tasks such as image captioning and
visual question answering due to their impressive ability to memorize long-term dependencies through a memory cell.
However, training such networks can be considerably challenging due to the complex addressing and overwriting
mechanism combined with the required processing being inherently sequential, and the significant storage required
in the process. LSTMs [56] also require more careful engineering when considering a novel task [8]. Earlier, CNNs
could not perform as well as LSTMs on vision-language tasks. The recent advances in convolutional structures on other

20
Deep Learning Approaches on Image Captioning: A Review

sequence-to-sequence tasks have enabled researchers to use CNNs in many other vision-language tasks. Also, CNNs
produce more entropy [8], which can be helpful for diverse predictions, have better classification accuracy, and do not
suffer from vanishing gradients. Aneja et al. [8] proposed a convolutional model which uses masked convolutions
instead of LSTM or GRU units. This work also experimented with attention by forming an attended image vector
and adding it to the word embedding at every layer. Doing so, the model has outperformed the attention baseline
[140]. With attention, the model could identify salient objects for the given image. Arguing that RNNs or LSTMs [56],
which are widely used in image captioning, cannot be computed in parallel and also ignore the underlying hierarchical
structure of the sentences, Wang et al. [134] designed a framework entirely relying on CNNs. The proposed model can
be computed in parallel and is faster to train. However, convolutional-network based methods still need improvement in
terms of performance.

5.5 Using Transformers

The encoder-decoder framework continues to dominate the image captioning world, with the models only varying
in details and sub-modules. The recent success of Transformers in natural language processing tasks has inspired
many researchers to replace the RNN model with Transformer in the decoders, aiming to benefit from its excellent
performance and the possibility of parallel training. Transformers have been the center of attention in the computer
vision field as well, with models such as DETR [15], ViT [30], SETR [156], and IPT [17].
Liu et al. [91] have proposed the CaPtion TransformeR (CPTR), a full Transformer network to replace the widely used
CNN in the encoder part of the encoder-decoder framework. Fang et al. [32] criticize the use of object detectors as a
tool to provide visual representation, stating that it may lead to heavy computational load and that they require box
annotations. Fang et al. [32] have introduced the detector-free ViTCAP model with a fully Transformer architecture,
which uses grid representations without regional operations.
Nguyen et al. [100] mention another issue with CNN-based detectors. CNN-based detectors use non-maximum
suppression (NMS) at the last stage of computation to remove redundant bounding boxes. As a consequence, end-to-end
training of an entire model consisting of detector and decoder modules becomes difficult. To overcome this problem
and to reduce their high computation cost, Nguyen et al. employ the Deformable DETR [161] and replace the CNN
backbone in the original design with Swin Transformer. The COS-Net model (Comprehending and Ordering Semantics
Network) proposed by Li et al. [84] aims to unify semantic comprehending and ordering. COS-Net uses a CLIP
model (image encoder and text encoder) [112] is utilized as a cross-modal retrieval model which retrieves sentences
semantically similar to the input image.
Zeng et al. [153] argue that directly operating at grid features may lead to the loss of spatial information caused by the
flattening operation. The objective of the Pseudo-supervised (SP) module designed by the authors is to resolve this
issue. Also, the Scale-wise Reinforcement (SR) module has been introduced to maintain the model size and improve
performance. Wang et al. [135] argue that using a network such as Faster-RCNN as the encoder divides the captioning
task into two stages and thus limits it. The PureT model built by the authors is a pure Transformer-based structure that
integrates the captioning task into one stage and enables end-to-end training.
ExpansionNet v2 introduced by Hu et al. [60] aims to solve the problem of performance bottlenecks in the input length
in Deep Learning methods for image captioning. To address this issue, the authors introduce a new technique called
Block Static Expansion, which distributes and processes the input over a collection of sequences with different lengths.
This method helps to improve the quality of features refinement and ultimately increase the effectiveness of the static
expansion.

5.5.1 Using Graphs and Transformers


Some current captioning encoders use a GCN to represent the relation information. Yang et al. [146] highlight that
these encoders are ineffective in image captioning due to the use of methods such as Maximum Likelihood Estimation
rather than a relation-centric loss and the use of pre-trained models to obtain relationships instead of the encoder to
improve model explainability. Yang et al. propose the ReFormer architecture, which applies the objective of scene
graph generation and image captioning by means of one modified Transformer. Chen et al. [18] use the Transformer
as their base architecture in the model SGGC (Scene Graph Guiding Captioning). The encoder is composed of two
sub-encoders named visual encoder and semantic encoder. In the visual encoder sub-component, a Transformer encoder
consisting of N identical encoding layers has been used instead of the general CNN-based encoder to capture the
relationships between visual regions better. Scene graphs have been used as additional guidance for decoder generation.
While Transformers are suitable for self-supervised pretext tasks on large-scale data, training can become expensive
and burdensome. There is a need for more economic Transformer-based large-scale multi-modal models which can be
achieved by means of incorporating more inductive bias about vision and language data [143].

21
Deep Learning Approaches on Image Captioning: A Review

5.6 Using Vision-Language Pre-Training for Image Captioning

Vision-language pre-training (VLP) has remarkably contributed to the recent advances in image captioning and is
currently the dominant training method for vision-language (VL) tasks. In VLP approaches, a large-scale model
is usually pre-trained on massive amounts of data using self-supervised learning, and then generalized to adapt to
downstream tasks. Studies by [81, 155, 159] have extensively observed the effect of pre-training objective methods and
architectures. The scale of the pre-training dataset is also believed to be a crucial factor in outstanding performance.
VLP helps alleviate some of the problems experienced in conventional image captioning methods. The conventional
methods typically need to minimize the gap between the visual and textual modals and are therefore resource-hungry
[99]. Excessive training time and numerous trainable parameters are also required, reducing their practicality. On the
other hand, given new samples, the models need to be updated to adapt to new inputs. This brings about the need
for lightweight models with faster training times and fewer parameters. It has recently been observed that powerful
vision-language pre-trained models improve zero-shot performance dramatically and reduce training time. One such
pre-trained model is CLIP (Contrastive Language-Image Pre-Training). Mokady et al. [99] have leveraged CLIP
encoding as prefix to the captions in their ClipCap model. A lightweight Transformer-based mapping network is trained
from the CLIP embedding space and a learned constant. The GPT-2 network is used as the language model to generate
captions given the prefix embeddings. Hu et al. [61] point out that while many VLP methods have been introduced
that learn vision-language representations through training large-scale Transformer models, most are designed for
understanding tasks. The few solutions that can be applied to image captioning [81, 159] use paired image-caption
data for pre-training, which cannot improve zero-shot performance. VIVO (VIsual VOcabulary pre-training), proposed
by Hu et al., learns vision-language alignment on image-tag pairs. Since caption annotations are not needed, many
existing vision datasets originally prepared for tasks such as image tagging or object detection can be used. Xia et al.
[138] emphasize that VL generation tasks necessitate the ability to learn generation capabilities as well as the ability to
understand cross-modal representations. Also, Xia et al. explain that the pre-trained models developed for understanding
tasks only provide the encoder, and separate decoders need to be trained to enable generation. In addition to this
deficiency, none of the pre-training tasks are designed for the whole sentence generation. The XGPT (Cross-modal
Generative Pre-Training for Image Captioning) takes advantage of a cross-modal encoder-decoder architecture and is
directly optimized for VL generation tasks. Three generative pre-training tasks have been designed to countervail the
lack of pre-training objectives for generation tasks, namely: Image-conditioned Masked Language Modeling (IMLM),
Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG).
Li et al. [76] mention computational inefficiency and information asymmetry as some of the shortcomings of existing
pre-trained models. Li et al. [76] have proposed the mPLUG model, which incorporates a novel cross-modal fusion
mechanism with cross-modal skip-connections to alleviate these problems. Liu et al. [90] point out the data insensitivity
problem and heavy computations associated with current vision-language problem, and take a different approach in
their model Prismer to learn domain knowledge via distinct and separate sub-networks referred to as experts. Prismer
includes modality-specific experts that encode multiple types of visual information directly from their corresponding
network outputs. The expert models are pre-trained and frozen individually and are connected through lightweight
trainable components. This approach results in a significant reduction in total network parameters.

5.7 Using Unsupervised Methods and Reinforcement Learning

The research works discussed in the aforementioned categories used a combination of images and their corresponding
captions to train the structures they introduced and generated captions for new images while optimizing metrics.
Training these supervised methods is challenging and involves some problems. One problem is that most of the research
on image captioning has only worked on generating captions in the English language, and a proper dataset consisting of
captions in multiple languages is not available. Preparing such a dataset requires the skills of human experts and is
very time-consuming. Preparing a dataset of images and their corresponding captions is generally a difficult task. The
Microsoft COCO dataset [87], which is widely used in image captioning is much smaller than other datasets specifically
designed for the object detection task, such as ImageNet and Open Images [24]. Microsoft COCO dataset [87] has 100
object classes only; consequently, the models trained on this dataset fail to generalize for new images that were not
covered in the dataset. A considerable part of image captioning research is moving towards unsupervised methods to
solve these issues. The early works improved the diversity of the captions; however, they sacrificed overall performance.
Feng et al. [33] have used a sentence corpus, a visual concept detector, and a set of images for unsupervised training.
The model is composed of an image encoder, a sentence generator, and a discriminator. The results obtained from this
research work have been criticized by Gu et al. [49] (discussed in section 3.2). It has been explained that considering the
limitations imposed by supervised learning, this research work has not achieved significant results, and the performance
of the proposed model is not satisfactory. Gu et al. [49] use an unsupervised method (CycleGAN) to align the scene
graph and the captions. Chen et al. [16] point out one issue with conventional encoder-decoder structures: many
directly optimize one or a combination of metrics. This can not guarantee consistent improvement over all metrics. As

22
Deep Learning Approaches on Image Captioning: A Review

a solution, Chen et al. have designed a discriminator network based on the idea of GANs, which judges if a caption is
human-generated or produced by a machine. Liu et al. [92] have introduced a system consisting of a captioning module
and a self-retrieval module. The notable part of this work is the self-retrieval module (which uses the REINFORCE
algorithm) that improves the performance of the aforementioned structure while only training on partially labeled data.

5.8 Captioning in Multiple Styles

Some of the papers covered in this survey generate captions in multiple styles, with some of these styles being humorous
or hostile. The structure called "TransResNet" presented by Shuster et al. [121] considers two classes of models:
retrieval and generative. While the retrieval model has given better results, a disadvantage of the retrieval models for
caption generation is that these models do not produce a new caption and only choose a caption from a massive dataset.
The retrieval models usually generate general and repetitive captions. This pushes many researchers to use unsupervised
methods. Guo et al. [51] have stated that incorporating appropriate styles into captions will enrich their clarity and
appeal and allows for user engagement and social interactions. The structure presented by Guo et al. is composed of
five modules for caption generation in different styles.
Stylized captions can help improve user interaction. However, since neutral captions that report factual data are
more appropriate for visually impaired individuals, stylized captions may not be the best choice to utilize in assistive
technologies.

6 Datasets and Performance Comparison

The methods discussed in previous sections use various datasets and are evaluated with multiple evaluation metrics. In
this section, we review the datasets and metrics widely used in recent research works in depth. The available datasets
for the image captioning task are still small compared to that of object detection, and the evaluation metrics have many
limitations. Considering the increasing importance of the image captioning task, preparing richer datasets and more
accurate metrics can be vital to the growth and improvement of the task.

6.1 Datasets Used by Recent Works

6.1.1 Microsoft COCO

The MS COCO dataset [87] is a vast dataset for object detection, image segmentation, and image captioning. This
dataset contains many features, such as image segmentation, 328,000 images, 91 object classes, and five captions for
each image.

6.1.2 Flickr30K

This dataset [151] is introduced for the automatic image captioning and grounded language understanding task. This
dataset contains 31,000 images collected from the Ficker website, along with 158 thousand captions written by humans.
This dataset contains a detector for everyday objects, a color classifier, and a bias toward selecting larger objects. This
dataset has not specified any split settings for training, testing, and evaluation, and researchers use any split settings
they desire when using this dataset.

6.1.3 Flickr30K Entities

The Flickr30K Entities dataset [109] is based on the Flickr30K dataset and contains 158k captions from Flickr30K with
244k coreference chains which link mentions of the same entities in images. The dataset also contains 276k manually
annotated bounding boxes corresponding to each entity.

6.1.4 Visual Genome

Unlike the other dataset discussed which only had one caption for the entire image, this dataset [73] presents a separate
caption for each image region. This dataset comprises seven parts: region descriptions, attributes, relationships, region
graphs, scene graphs, and question-answer pairs. The Visual Genome dataset contains more than 108 thousand images,
with each image having an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects.

23
Deep Learning Approaches on Image Captioning: A Review

6.1.5 FlickrStyle10k

This dataset [35] contains 10 thousand images with captions of varying styles. Training data consists of 7 thousand
images, and the testing and evaluation data consist of 2 thousand and 1 thousand images, respectively. Each image has
captions in different styles, such as poetic, humorous, and neutral (factual).
In addition to the datasets commonly used by many research works, some have collected and prepared their own dataset.

6.1.6 TextCaps

This dataset [123] aims to help train visual assistants for visually impaired individuals, focusing on presenting captions
for images with written text inside them. According to the data reported in [13], 21 percent of the questions asked by
visually impaired individuals involved written texts inside images. This dataset presents 145 thousand captions for 28
thousand images.

6.1.7 VizWiz-Captions

This dataset [52] has been introduced as a dataset appropriate for image captioning for visually impaired individuals.
This dataset consists of 23,431 training images and 117,155 training captions, 7,750 validation images, 38,750
evaluation captions, 8,000 images, and 40,000 testing captions. The images have been taken directly by visually
impaired individuals.
There are some datasets recently introduced that are not used by many works yet are potential choices for future research
works.

6.1.8 Google’s Conceptual Captions

This dataset [119] consists of approximately 3.3 million images and captions. The images have been collected from the
Internet first, along with the "alt-text" associated with them. These image-caption pairs have been then filtered and
processed to extract appropriate captions for the images that describe image contents. This dataset is split into training
and evaluation splits. There are 3,318,333 image-caption pairs in the training split and 15,840 image-caption pairs in
the evaluation split.

6.1.9 nocaps

The "Novel Object Captioning at Scale (nocaps)" dataset [1] has been presented to encourage the development of
captioning models that can surpass the limitation of visual concepts in existing datasets. The introduced benchmark is
composed of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test
sets. The training data consists of Open Images image-level labels and object bounding boxes in addition to COCO
image-caption pairs. Considering that Open Images contains many more object classes not present in COCO, about 400
object classes in test images have almost no associated training captions.

6.1.10 Open Images V6: Visual Relationships

The Open Images dataset [74] contains various sections for object detection, image segmentation, object relationships,
and more. The dataset includes approximately 9 million images in 600 different classes. Each image contains an
average of 8.4 objects. One section of this dataset is the Visual Relationships section which contains 329 tertiary
relationships for 375 thousand images. These relationships are in the forms of human-object (for example, "a person
holding a microphone"), object-object (for example, "a dog inside a car"), and object-attribute (for example, "bench is
wooden," "handbag is made of leather"). The most recent version of this dataset is available at [44].

6.1.11 Open Images V6: Localized Narratives

In 2020 and the sixth version of the Open Images dataset, a new section under the name of "Localized Narratives"
was added [110]. This new section contains 1 million and 671 thousand images from the Open Images Dataset. A
human describer has described each image in the dataset via a voice recording while moving their computer mouse on
the regions they were describing. Since the words of the caption are in sync with the mouse movements, the location
associated with each word is available.

24
Deep Learning Approaches on Image Captioning: A Review

6.1.12 SentiCap
SentiCap [97] is a sentiment captioning dataset based on the MS COCO dataset [87]. There are three positive and three
negative sentiment captions for each image. The positive sentiment subset consists of 2,873 sentences and 998 images
for training and another 2,019 sentences over 673 images for testing. The negative sentiment subset consists of 2,468
sentences and 997 images for training and another 1,509 sentences over 503 images for testing.

6.1.13 SBU Captions Dataset


The SBU dataset [104] consists of 1 million images and their corresponding descriptions given by individuals as they
upload an image to Flickr. The captions are not guaranteed to be visual or unbiased. Therefore this dataset has more
noise compared to other datasets.

6.1.14 The Karpathy Split


In order to train machine learning learners, the dataset is usually split into training, evaluation, and testing split. One
typical split for the datasets widely used in recent works is a split called Karpathy [71], which splits the dataset into
5000 images for offline testing, and 5000 images for offline evaluation and uses the rest for training. Most research
works use this splitting method to be more consistent with other works.
The details regarding datasets discussed in this section have been summarized in table 1.

Table 1: Most common datasets: details (R.D. indicates "Region Descriptions", L.N. indicates "Localized Narratives",
and V.R. indicates "Visual Relationships)
Dataset Total Images Objects/Image Object Classes Captions/Image
Visual Genome [73] 108,077 36.17 80,138 5.4m R.D.
MS COCO[87] 330,000 7.57 91 5
Flickr30K Entities[109] 31,783 8.7 44,518 5
OpenImagesV6:V.R.[44] 375,000 8.4 - 1
Flickr30K[151] 31,000 - - 5
FlickrStyle10K[35] 10,000 - - 2
OpenImagesV6:L.N.[110] 849,000 - - 1
SBU Captions[104] 1 mil - - 1
SentiCap[97] 3171 - - 6
TextCaps[123] 28,408 - - 5
VizWiz-Captions[52] 39,181 - - 5
nocaps[1] 15,100 - 680 11
Conceptual Captions[119] 3 mil< - - 1

6.2 Evaluation Metrics for Image Captioning Methods

The metrics discussed below fall into two categories: the text evaluation metrics and the caption evaluation metrics. The
text evaluation metrics evaluate machine-generated text portions independently. Most of these metrics were introduced
for evaluating the text generated by machine translation models. The caption evaluation metrics evaluate the captions
generated by the models and have been designed specifically for the image captioning task.

6.2.1 BLEU (Bilingual Evaluation Understudy)


BLEU [106] is an evaluation metric for machine-generated texts. Separate parts of a text are compared against a set of
reference texts, and each part receives a score. The overall score is an average over these scores; however, the syntactical
correctness is not evaluated. The performance of this metric varies based on the references used and the size of the
generated text. The BLEU metric is a widely used metric due to being a pioneer in evaluating machine-generated texts,
being independent of language, their simplicity, high speed, low cost, and being quite comparable with human judgment.
BLEU counts the consistent n-grams in the machine-generated text and the reference text. n-grams are a contiguous
sequence of n items in a text in the field of computational linguistics and probability. These items can be phonemes,
syllables, letters, words, or base pairs. The number "n" determines the number of grams that will be compared against
each other. Usually, BLEU-1, BLEU-2, BLEU-3, and BLEU-4 are computed using the BLEU metric. To compute
BLEU-n, the n-grams of 1 to "n" are computed, and each is assigned one single weight. Next, the geometric mean of

25
Deep Learning Approaches on Image Captioning: A Review

these n-grams is calculated according to these weights. For example, when computing BLEU-4, the n-grams of 1 to 4
are calculated, and each is given the value 0.25 as their weight, followed by the geometric mean being computed over
these values. This metric does have some disadvantages, such as the fact that the computed scores are only high when
the generated text is short. Also, in some cases, a high score achieved using this metric is unreliable and does not mean
a higher-quality text.

6.2.2 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)


ROUGE [85] is a set of metrics that evaluate the quality of text summarization. This metric determines the quality of a
summary by comparing it to other ideal human-created summaries: the number of overlapping units, like n-grams, word
sequences, and word pairs between the machine-generated summary and the ideal summaries are counted. Multiple
measures are introduced: ROUGE-N (which counts the overlap of n-grams between the machine-generated summary
and the ideal summary. ROUGE-1 and ROUGE-2 are subsets of ROUGE-N), ROUGE-L (which is essentially longest
common subsequence based statistics, and considers sentence level structure similarity naturally and automatically finds
the longest co-occurring in sequence n-grams), ROUGE-W (which is based on weighted longest common subsequence
that prefers consecutive longest common subsequences), ROUGE-S (which is based on skip-bigram co-occurrence,
with skip-bigram being any pair of words in their sentence order), and ROUGE-SU (which is based on skip-bigram
plus unigram co-occurrence), with each being used in a specific application. This metric does not perform well for
evaluating summaries in more than one text.

6.2.3 METEOR (Metric for Evaluation of Translation with Explicit Ordering)


This metric [11] compares word segments against reference texts. This method is based on the harmonic mean of
unigram precision and recall (recall is weighted higher than precision). METEOR has features such as stemming and
synonymy matching in addition to the standard exact word matching. This metric makes a better correlation at the
sentence level or segment level.

6.2.4 CIDEr (Consensus-based Image Description Evaluation)


This metric [129] is explicitly designed for evaluating image captions and descriptions. In contrast to other metrics
working with only five captions per image-which makes them unsuitable for evaluating the consensus between the
generated captions and human judgments- CIDEr reaches this level of consensus using term-frequency inverse document
frequency (TF-IDF). CIDER is technically an annotation modality for automatically computing consensus. A measure
of consensus encodes how often n-grams in the candidate sentence are present in the reference sentences. Also, n-grams
not present in the reference sentences must not exist in the candidate sentences. Furthermore, lower weight must be
given to n-grams frequently appearing across all images in the dataset since they are likely to contain less information.
To encode this, Vedantam et al. [129] performed a Term-Frequency Inverse Document Frequency (TF-IDF) weighting
for each n-gram. A version of CIDEr called CIDEr-D exists as a part of the Microsoft COCO evaluation server.

6.2.5 SPICE (Semantic Propositional Image Caption Evaluation)


The SPICE metric [6] is a metric for evaluating image captions based on semantic context. This metric measures how
well objects, attributes, and the relations between them are covered in image captions. A scene graph is used to extract
the names of different objects, attributes, and the relationships between them from image captions. The metric utilizes
semantic representations produced by this graph.
The discussed methods are far from human judgment in terms of quality due to various factors. Using external
knowledge databases along with evaluation metrics can help improve evaluation quality.

6.3 Performance Comparison Based on MSCOCO Test Servers

Microsoft COCO has presented an online server for testing and evaluation purposes to enable a more fair and uniform
testing platform. In Tables 2 and 3, the results from the research works which have used the Microsoft COCO test
servers have been listed. The Microsoft COCO test servers report two numbers for each evaluation metric: c5 and c40.
c5 is computed using five reference captions, and c40 is computed using 40. The best performances are highlighted
with boldface font.
According to the results collected in Tables 2 and 3, the model proposed by Li et al. [84] has achieved the best c5 and
c40 results in most metrics. The model proposed by Aneja et al. [8]-a convolutional-network based model- has achieved
the lowest results in both c5 and c40 for almost all metrics. The model designed by Guo et al. [51] has achieved the
worst results in BLEU-2 c5 and BLEU-3 c5. Achieving worse results is natural, considering that the model operates

26
Deep Learning Approaches on Image Captioning: A Review

Table 2: The reported results obtained from Microsoft COCO servers - Top 10 methods (B:BLEU [106])
Reference B-1 B-2 B-3 B-4
c5 c40 c5 c40 c5 c40 c5 c40
Nguyen et al. [100] 84.1 97.6 69.4 93.5 54.9 86.3 42.5 76.8
Hu et al. [60] 83.3 96.9 68.8 92.6 54.4 85.0 42.1 75.3
Anderson et al. [7] 80.2 95.2 64.1 88.8 49.1 79.4 36.9 68.5
Chen et al. [16] 81.9 95.6 66.3 90.1 51.7 81.7 39.6 71.5
Cornia et al. [24] 81.6 96.0 66.4 90.8 51.8 82.7 39.7 72.8
Jiang et al. [67] 80.4 95.0 64.9 89.3 50.1 80.1 38.0 69.2
Li et al. [84] 83.3 96.8 68.6 92.3 54.2 84.5 42 74.7
Liu et al. [88] 80.1 94.6 64.7 88.9 50.2 80.4 38.5 70.3
Pan et al. [105] (X-LAN) 81.4 95.7 66.5 90.5 52.0 82.4 40.0 72.4
Pan et al. [105] (X-Transformer) 81.9 95.7 66.9 90.5 52.4 82.5 40.3 72.4
Yang et al. [145] - - - - - - 38.5 69.7
Yao et al. [149] - - 65.5 89.3 50.8 80.3 38.7 69.7
Zeng et al. [153] 82.2 96.5 67 91.4 52.4 83.3 40.1 73.5

Table 3: The reported results obtained from Microsoft COCO servers (M:METEOR [11], R:ROUGE [85], C:CIDEr
[129], S:SPICE [6])
Reference R M C S
c5 c40 c5 c40 c5 c40 c5 c40
Nguyen et al. [100] 61.2 77.1 30.9 41.0 141.3 143.8 - -
Hu et al. [60] 60.8 76.4 30.4 40.1 138.5 140.8 - -
Anderson et al. [7] 57.1 72.4 27.6 36.7 117.9 120.5 21.5 71.5
Chen et al. [16] 59.0 74.4 28.7 38.2 123.1 124.3 - -
Cornia et al. [24] 59.2 74.8 29.4 39.0 129.3 132.1 - -
Huang et al. [65] 58.9 74.5 29.1 38.5 126.9 129.6 - -
Jiang et al. [67] 58.2 73.1 28.2 37.2 122.9 125.1 - -
Li et al. [84] 60.6 76.4 30.4 40.1 136.7 138.3 - -
Liu et al. [88] 58.3 73.8 28.6 37.9 123.3 125.6 - -
Pan et al. [105] (X-LAN) 59.5 75.2 29.7 39.3 130.2 132.8 - -
Pan et al. [105] (X-Transformer) 59.5 75.0 29.6 39.2 131.1 133.5 - -
Yang et al. [145] 58.6 73.6 28.2 37.2 123.8 126.5 - -
Yao et al. [149] 58.5 73.4 28.5 37.6 125.3 126.5 - -
Zeng et al. [153] 59.5 75 29.6 39.3 132.6 135 - -

with unpaired text, as there is no total consistency between image and text. The model proposed by Wang et al. [134]
falls at the lower part of the performance list in most metrics as well, which shows that convolutional-network based
methods still need much more improvement to reach the performance of the other methods, such as the attention-based
ones. Still, the benefits and advantages of convolutional-network based methods, such as a more straightforward
training process (discussed more thoroughly in "Convolutional-network based methods," section 3.4), encourage further
research in this field.

6.4 Comparing Independent Results

Many research works have reported their results independently, as well as the results reported by the Microsoft COCO
servers. A large number of these research works have used the code publicly available in [86] to evaluate their
performance. This codebase evaluates a models performance using BLEU [106], METEOR [11], ROUGE [85], CIDEr
[129] and SPICE [6] metrics.
In this section, we list the results reported independently (not obtained by Microsoft COCO servers) by the works
covered in this survey in Tables 4, 5, 6, and 7. We have listed the best results for the research works that reported
results under different settings (for example, optimization using different loss functions). The best performances are
highlighted with boldface font.

27
Deep Learning Approaches on Image Captioning: A Review

Table 4: The independent results - Top 10 methods - BLEU-1 and BLEU-2 (B:BLEU [106], Ref: Reference)
B-1 B-2
Ref Score Ref Score
Zhong et al.[157] 90.7 Li et al.[84] 69.1
Nguyen et al.[100] 84.2 Pan et al.[105] 66.8
Hu et al. [60] 83.5 Liu et al.[91] 66.6
Li et al.[84] 83.5 Jiang et al.[67] 64.7
Yang et al.[146] 82.3 Li et al.[80] 63.2
Cornia et al.[24] 82.0 Liu et al.[92] 63.1
Liu et al.[91] 81.7 Gu et al.[47] 62.5
Pan et al.[105] 81.7 Chen et al.[18] 60.7
Huang et al.[65] 81.6 Wang et al.[133] 60.3
Li et al.[77] 81.5 Aneja et al.[8] 55.3

Table 5: The independent results - Top 10 methods (B:BLEU [106], Ref: Reference)
B-3 B-4
Ref Score Ref Score
Li et al.[84] 54.9 Zhong et al.[157] 59.3
Pan et al.[105] 52.6 Li et al. [76] 46.5
Liu et al.[91] 52.2 Li et al.[84] 42.9
Jiang et al.[67] 50.0 Hu et al.[60] 42.7
Li et al.[80] 48.3 Nguyen et al. [100] 42.4
Liu et al.[92] 48.0 Li et al.[81] 41.7
Gu et al.[47] 47.9 Pan et al.[105] 40.7
Wang et al.[133] 46.5 Cornia et al.[24] 40.5
Chen et al.[18] 46.2 Liu et al. [90] 40.4
Aneja et al.[8] 41.8 Huang et al.[65] 40.2

Table 6: The independent results - Top 10 methods ( M:METEOR [11], R:ROUGE [85], Ref: Reference)
M R
Ref Score Ref Score
Zhong et al.[157] 40.1 Zhong et al.[157] 71.5
Li et al.[76] 32.0 Hu et al.[60] 61.1
Liu et al.[90] 31.4 Li et al.[84] 61.0
Hu et al.[62] 31.4 Nguyen et al.[100] 60.7
Li et al.[84] 30.8 Fang et al.[32] 60.1
Nguyen et al.[100] 30.6 Barraco et al.[12] 59.9
Hu et al.[60] 30.6 Yang et al.[146] 59.8
Li et al.[81] 30.6 Pan et al.[105] 59.7
Fang et al.[32] 30.1 Cornia et al.[24] 59.5
Barraco et al.[12] 30.0 Liu et al.[91] 59.4

Among the research works covered in this survey paper, [157] (which introduces a sub-graph proposal network along
with an attention-based LSTM decoder) has had the best results in BLEU-1 (90.7) and BLEU-4 (59.3), as well as
METEOR [11] (40.1), and ROUGE [85] (71.5), while [84] (COS-Net, a model which uses CLIP image and text encoder
as a cross-modal retrieval model) has had the best results in BLEU-2 (69.1) and BLEU-3 (54.9). Also, [23] (Show,
Control and Tell) has achieved the best CIDEr and SPICE results (209.7 and 48.5, respectively).
A recurring pattern among the best-performing methods presented in Tables 2 and 3 is the application of Transformers,
scene graphs, and vision language pre-training methods [12, 18, 19, 24, 32, 54, 62, 65, 77, 80, 81, 84, 91, 105, 146,
145, 153, 157]. These methods owe their performance to the capabilities of Transformers, scene graphs, and vision
language pre-training methods. Transformers are capable of capturing complex relationships between objects and their
surroundings, making them particularly effective in handling long-range dependencies in image sequences. Scene
graphs, on the other hand, represent the relationships between objects within an image and allow for efficient inference

28
Deep Learning Approaches on Image Captioning: A Review

Table 7: The independent results - Top 10 methods (C:CIDEr [129], S:SPICE [6], Ref: Reference)
C S
Ref Score Ref Score
Cornia et al.[23] 209.7 Cornia et al.[23] 48.5
Chen et al.[19] 204.2 Chen et al.[19] 42.1
Zhong et al.[157] 166.7 Zhong et al.[157] 30.1
Li et al.[76] 155.1 Li et al.[76] 26.0
Hu et al.[62] 145.5 Hu et al.[62] 25.5
Nguyen et al.[100] 144.2 Hu et al.[60] 24.7
Hu et al.[60] 143.7 Li et al.[84] 24.7
Li et al.[84] 143.0 Li et al.[81] 24.5
Li et al.[81] 140.0 Liu et al. [90] 24.4
Barraco et al.[12] 139.4 Nguyen et al.[100] 24.3

of the visual content. Another desirable feature of graphs is their ability to represent composite and unstructured data
types, as they provide a flexible and efficient way to model the complex relationships and interconnections between
various entities within a system. In addition to Transformers and scene graphs, some of the high-performing image
captioning methods in Tables 2 and 3 also utilize vision language pre-training techniques [12, 62, 81, 84]. These
methods involve training a model on large datasets that consist of both visual and textual information, allowing the
model to learn a joint embedding space. By pre-training on such datasets and acquiring knowledge from multiple
modalities, the model can effectively learn to understand visual content and generate natural language descriptions. The
integration of these techniques in captioning models has led to a notable improvement in their overall performance, as
evidenced by the results presented in Tables 2 and 3.

7 Challenges and the Future Directions

Despite the abundance of solutions and methods presented to solve the image captioning problem, some challenges
and open problems remain. The performance of the supervised methods relies significantly on the quality of the
datasets. However, datasets can not cover the real world regardless of how massive they are, and the applicability of
supervised methods is limited to the set of objects the detector is trained to distinguish. On the other hand, datasets
with image-caption pairs inevitably contain more examples of a specific situation (one example being: "man riding
a skateboard"). These examples in the training data falsely bias the model towards generating more captions similar
to those examples rather than including actual detected objects [80]. The supervised paradigm overly relies on the
language priors, which can lead to the object-hallucination phenomenon as well [84].
The problems associated with the supervised methods encourage researchers to devise unsupervised techniques. On the
other hand, due to the different properties of image and text modalities, the encoders of image and sentence cannot be
shared. Therefore, the critical challenge in an unpaired setting is the gap of information misalignment in images, and
sentences [49]. The current unsupervised image captioning methods still need to catch up in performance rankings.
One promising direction of research is using scene graphs for image captioning. However, despite the many possibilities
unveiled by scene graphs, discussed extensively in the previous sections, utilizing them comes with challenges.
Constructing scene graphs is a complicated task in itself, and due to the interactions between objects being beyond
simple pairwise relations, integrating scene graphs is quite tedious [141]. Also, scene graph parsers are still not as
powerful [145, 132]. According to some of the works which studied the impact of scene graphs on the quality of the
captions, scene graphs are effective only if pre-training of the scene graph generators is done with visually relevant
relation data [75].
VLP methods have been used to resolve some of the flaws with supervised methods and object detector-based designs.
However, most VLP approaches are catered to understanding tasks, and generation tasks such as image captioning
demand more capabilities. A number of the recent works covered in this paper have aimed to fulfill this need. However,
this field needs more investigation and analysis. Moreover, detector-free designs have a rising popularity. In these
designs, the detector is removed for the vision-language pre-training in an end-to-end fashion [32]. Also, a general
visual encoder replaces the detector and is used to produce grid features for later cross-modal fusion. However, the
construction of a stronger detector-free image captioning model still needs investigation. Despite the challenges faced
when working with scene graphs, vision language pre-training methods and Transformers, almost each one of the
best-performing models according to evaluation metrics use one or a combination of these techniques, as shown in

29
Deep Learning Approaches on Image Captioning: A Review

section 6.4. This further proves the potential of these techniques in solving the image captioning problem, and are
promising tools for the future of generative tasks. Specifically, graphs are valuable in representing complex relationships
and interconnections between different entities, particularly for composite, semi-structured, and unstructured data that
may not be easily handled by other types of data models. Considering the recent advancements in generative artificial
intelligence such as large language models (LLMs) [103] and multimodal language models (MLLMs) [102, 66, 5, 78],
the need for representation methods capable of handling such data types will become more and more visible and felt in
near future.
Another gap in the literature is the lack of focus on the application of image captioning for the visually impaired.
Describing images can be the core of a vision assistant designed to aid the visually impaired in their daily lives: one can
be informed of potential dangers in their environment and have a general understanding of what is happening around
them. Considering the issues mentioned earlier and gaps, unsupervised learning, and unpaired setting are of great
potential. Also, the graph-based approach is expected to become even more popular in the near future. LLMs, MLLMs
and Transformers in combination with vision-language pre-training methods are also very likely to become standard
practice.

8 Conclusion

This paper has covered recent image captioning methods, provided a taxonomy of the approaches, and mentioned
their features and properties. We also discussed the common problems in image captioning, reviewed datasets and
evaluation metrics, compared the performance of the covered methods and algorithms in terms of experimental results,
and highlighted the challenges and future directions in image captioning. Despite the numerous methods and solutions
presented for the image captioning problem, there are still some major problems and challenges for which few solutions
have been suggested. On the other hand, the generated captions still need to be higher in quality and are far from
human-generated captions. Also, the datasets cannot cover the infinite real world. The evaluation metrics still need to
be improved and are still not ideal for evaluating the precise performance of the models. However, Vision-Language
Pre-Training (VLP) methods are frequently used in recent works and have shown promising performance. VLP methods
and Transformers are likely to be inseparable components of models in the future of image captioning.
Moreover, more research needs to be done on visual assistants for visually impaired individuals. Preparing such an
assistant requires certain features to be implemented, making it different from the other applications of image captioning.
The best models presented by the research works do not perform well as visual assistants and do not consider the
specific demands and needs of visually impaired people. A proper caption for a visually impaired person includes the
most important aspects of the image first and the other noticeable details afterward. The surroundings and finer details
must also be described, such as details about the textures and the position of objects relative to each other. Therefore, a
caption appropriate for the needs of visually impaired individuals is denser and contains much more detail compared to
the captions generated by conventional methods and models. Also, the caption generation process may be altered in a
way that the initial caption provided to the user can be more general and shorter. The caption may become denser and
more detailed upon the user asking more questions about the image. Considering the importance of the aforementioned
issues and the growing number of visually impaired individuals, a noticeable lack of an efficient solution remains.
Valuable research work in this field would be automatic image captioning with a particular focus on creating a visual
assistant for visually impaired individuals.

References
[1] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson.
nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision
(ICCV), pages 8948–8957, Manhattan, New York, U.S., 2019. IEEE. doi:10.1109/ICCV.2019.00904. URL
http://doi.org/10.1109/ICCV.2019.00904.
[2] H. Ahsan, D. Bhatt, K. Shah, and N. Bhalla. Multi-modal image captioning for the visually impaired. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational
Linguistics: Student Research Workshop, pages 53–60, Stroudsburg, PA, USA, jun 2021. Association for
Computational Linguistics. doi:10.18653/v1/2021.naacl-srw.8. URL http://doi.org/10.18653/v1/2021.
naacl-srw.8.
[3] R. Al Sobbahi and J. Tekli. Comparing deep learning models for low-light natural scene image enhancement
and their impact on object detection and classification: Overview, empirical evaluation, and challenges. Signal
Processing: Image Communication, page 116848, 2022.

30
Deep Learning Approaches on Image Captioning: A Review

[4] R. Al Sobbahi and J. Tekli. Low-light image enhancement using image-to-frequency filter learning. In
Image Analysis and Processing–ICIAP 2022: 21st International Conference, Lecce, Italy, May 23–27, 2022,
Proceedings, Part II, pages 693–705. Springer, 2022.
[5] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds,
et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing
Systems, 35:23716–23736, 2022.
[6] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Semantic propositional image caption evaluation. In
B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision – ECCV 2016, pages 382–398, Manhattan,
New York, USA, 2016. Springer International Publishing. ISBN 978-3-319-46454-1. doi:10.1007/978-3-319-
46454-1_24. URL https://doi.org/10.1007/978-3-319-46454-1_24.
[7] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down
attention for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 6077–6086, Los Alamitos, CA, USA, jun 2018. IEEE Computer
Society. doi:10.1109/CVPR.2018.00636. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.
2018.00636.
[8] J. Aneja, A. Deshpande, and A. G. Schwing. Convolutional image captioning. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 5561–5570, Manhattan, New York, U.S., June
2018. IEEE. doi:10.1109/CVPR.2018.00583. URL http://doi.org/10.1109/CVPR.2018.00583.
[9] H. Ayesha, S. Iqbal, M. Tariq, M. Abrar, M. Sanaullah, I. Abbas, A. Rehman, M. F. K. Niazi, and S. Hussain.
Automatic medical image interpretation: State of the art and future directions. Pattern Recognition, 114, June
2021. ISSN 0031-3203. doi:10.1016/j.patcog.2021.107856. URL https://doi.org/10.1016/j.patcog.
2021.107856.
[10] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In
Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[11] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human
judgments. In J. Goldstein, A. Lavie, C.-Y. Lin, and C. Voss, editors, Proceedings of the acl workshop on intrinsic
and extrinsic evaluation measures for machine translation and/or summarization, volume 29, pages 65–72,
Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. URL https://aclanthology.org/
W05-09.
[12] M. Barraco, M. Cornia, S. Cascianelli, L. Baraldi, and R. Cucchiara. The unreasonable effectiveness of
clip features for image captioning: An experimental analysis. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 4662–4670, Los Alamitos, CA, USA, jun 2022. IEEE
Computer Society. doi:10.1109/CVPRW56347.2022.00512. URL https://doi.ieeecomputersociety.
org/10.1109/CVPRW56347.2022.00512.
[13] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White,
et al. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd Annual ACM Symposium
on User Interface Software and Technology, UIST ’10, pages 333–342, New York, NY, USA, 2010. Association
for Computing Machinery. ISBN 9781450302715. doi:10.1145/1866029.1866080. URL https://doi.org/
10.1145/1866029.1866080.
[14] J. A. Bondy and U. S. R. Murty. Graph theory with applications, volume 290. North-Holland, Amsterdam,
Netherlands, 1976. ISBN 0-444-19451-7.
[15] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with
transformers. In European conference on computer vision, pages 213–229, Cham, 2020. Springer, Springer
International Publishing.
[16] C. Chen, S. Mu, W. Xiao, Z. Ye, L. Wu, and Q. Ju. Improving image captioning with conditional generative
adversarial nets. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of
Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial
Intelligence, EAAI 2019, 33(01):8142–8150, July 2019. ISSN 2159-5399. doi:10.1609/aaai.v33i01.33018142.
URL https://ojs.aaai.org/index.php/AAAI/article/view/4823.
[17] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao. Pre-trained image processing
transformer. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12294–
12305, Los Alamitos, CA, USA, jun 2021. IEEE Computer Society. doi:10.1109/CVPR46437.2021.01212. URL
https://doi.ieeecomputersociety.org/10.1109/CVPR46437.2021.01212.

31
Deep Learning Approaches on Image Captioning: A Review

[18] H. Chen, Y. Wang, X. Yang, and J. Li. Captioning transformer with scene graph guiding. In
2021 IEEE international conference on image processing (ICIP), pages 2538–2542. IEEE, 2021.
doi:10.1109/ICIP42928.2021.9506193. URL https://doi.org/10.1109/ICIP42928.2021.9506193.
[19] S. Chen, Q. Jin, P. Wang, and Q. Wu. Say as you wish: Fine-grained control of image caption generation
with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 9962–9971, Manhattan, New York, U.S., 2020. IEEE. doi:10.1109/CVPR42600.2020.00998.
URL http://doi.org/10.1109/CVPR42600.2020.00998.
[20] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learn-
ing phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 1724–1734,
Stroudsburg, PA, USA, 2014. Association for Computational Linguistics. doi:10.3115/v1/d14-1179. URL
https://aclanthology.org/D14-1179.
[21] M. Chohan, A. Khan, M. S. Mahar, S. Hassan, A. Ghafoor, and M. Khan. Image captioning using deep learning:
A systematic literature review. International Journal of Advanced Computer Science and Applications, 11(5),
2020. doi:10.14569/IJACSA.2020.0110537. URL http://dx.doi.org/10.14569/IJACSA.2020.0110537.
[22] E. Coppock, D. Dionne, N. Graham, E. Ganem, S. Zhao, S. Lin, W. Liu, and D. Wijaya. Informativity in image
captions vs. referring expressions. In Proceedings of the Probability and Meaning Conference (PaM 2020),
pages 104–108, 2020.
[23] M. Cornia, L. Baraldi, and R. Cucchiara. Show, control and tell: A framework for generating controllable
and grounded captions. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 8299–8308, Manhattan, New York, U.S., 2019. IEEE. doi:10.1109/CVPR.2019.00850. URL http:
//doi.org/10.1109/CVPR.2019.00850.
[24] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara. Meshed-memory transformer for image captioning. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10578–10587,
Manhattan, New York, U.S., 2020. IEEE. doi:10.1109/CVPR42600.2020.01059. URL http://doi.org/10.
1109/CVPR42600.2020.01059.
[25] B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In Proceedings of the
IEEE conference on computer vision and Pattern recognition, pages 3298–3308, Manhattan, New York, U.S.,
2017. IEEE. doi:10.1109/CVPR.2017.352. URL http://doi.org/10.1109/CVPR.2017.352.
[26] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do
humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100,
Oct. 2017. ISSN 1077-3142. doi:10.1016/j.cviu.2017.10.001. URL http://doi.org/10.1016/j.cviu.
2017.10.001. Language in Vision.
[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, Manhattan,
New York, U.S., 2009. IEEE. doi:10.1109/CVPR.2009.5206848. URL http://doi.org/10.1109/CVPR.
2009.5206848.
[28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for
language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–
4186, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1423.
URL http://doi.org/10.18653/v1/N19-1423.
[29] P. Dognin, I. Melnyk, Y. Mroueh, I. Padhi, M. Rigotti, J. Ross, Y. Schiff, R. A. Young, and B. Belgodere.
Image captioning as an assistive technology: Lessons learned from vizwiz 2020 challenge. arXiv preprint
arXiv:2012.11696, 2021.
[30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for
image recognition at scale. In International Conference on Learning Representations, September 2021. URL
https://openreview.net/forum?id=YicbFdNTTy.
[31] A. Elhagry and K. Kadaoui. A thorough review on recent deep learning methodologies for image captioning.
2021. doi:10.48550/ARXIV.2107.13114. URL https://arxiv.org/abs/2107.13114v1.
[32] Z. Fang, J. Wang, X. Hu, L. Liang, Z. Gan, L. Wang, Y. Yang, and Z. Liu. Injecting semantic con-
cepts into end-to-end image captioning. In 2022 IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 17988–17998, Los Alamitos, CA, USA, jun 2022. IEEE Computer Soci-

32
Deep Learning Approaches on Image Captioning: A Review

ety. doi:10.1109/CVPR52688.2022.01748. URL https://doi.ieeecomputersociety.org/10.1109/


CVPR52688.2022.01748.
[33] Y. Feng, L. Ma, W. Liu, and J. Luo. Unsupervised image captioning. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), volume 2019-June, pages 4125–4134, Manhattan, New
York, U.S., June 2019. IEEE. doi:10.1109/CVPR.2019.00425. URL http://doi.org/10.1109/CVPR.2019.
00425.
[34] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling
for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, pages 457—-468, Stroudsburg, PA, USA, 2016. Association for
Computational Linguistics. doi:10.18653/v1/D16-1044. URL http://doi.org/10.18653/v1/D16-1044.
[35] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng. Stylenet: Generating attractive visual captions with
styles. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 955–964,
Los Alamitos, CA, USA, July 2017. IEEE Computer Society. doi:10.1109/CVPR.2017.108. URL https:
//doi.ieeecomputersociety.org/10.1109/CVPR.2017.108.
[36] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks
for visual captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
1141–1150, Manhattan, New York, USA, 2017. IEEE. doi:10.1109/CVPR.2017.127. URL http://doi.org/
10.1109/CVPR.2017.127.
[37] L. Gao, B. Wang, and W. Wang. Image captioning with scene-graph based semantic concepts. In Pro-
ceedings of the 2018 10th International Conference on Machine Learning and Computing, ICMLC 2018,
pages 225–229, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450363532.
doi:10.1145/3195106.3195114. URL https://doi.org/10.1145/3195106.3195114.
[38] J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin. A convolutional encoder model for neural machine
translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 123–135, Stroudsburg, PA, USA, July 2017. Association for Computational Linguistics.
doi:10.18653/v1/P17-1012. URL https://doi.org/10.18653/v1/P17-1012.
[39] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning.
In International Conference on Machine Learning, pages 1243–1252. PMLR, 2017.
[40] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural
Computation, 12(10):2451–2471, 10 2000. ISSN 0899-7667. doi:10.1162/089976600300015015. URL
https://doi.org/10.1162/089976600300015015.
[41] R. Girshick. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), volume 1, pages
1440–1448, Los Alamitos, CA, USA, dec 2015. IEEE Computer Society. doi:10.1109/ICCV.2015.169. URL
https://doi.ieeecomputersociety.org/10.1109/ICCV.2015.169.
[42] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and
semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1,
pages 580–587, Los Alamitos, CA, USA, jun 2014. IEEE Computer Society. doi:10.1109/CVPR.2014.81. URL
https://doi.ieeecomputersociety.org/10.1109/CVPR.2014.81.
[43] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning (adaptive computation and machine learning series).
The MIT Press, Cambridge, MA, USA, 2016. ISBN 978-0262035613.
[44] Google. Open images dataset v6 + extensions, August 2021. URL https://storage.googleapis.com/
openimages/web/index.html.
[45] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey.
IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, oct 2017. ISSN 2162-2388.
doi:10.1109/tnnls.2016.2582924. URL https://doi.org/10.1109/TNNLS.2016.2582924.
[46] N. Gruber and A. Jockisch. Are gru cells more specific and lstm cells more sensitive in motive classification
of text? Frontiers in artificial intelligence, 3:40, 2020. ISSN https://portal.issn.org/resource/ISSN/26248212.
doi:10.3389/frai.2020.00040. URL https://dx.doi.org/10.3389/frai.2020.00040.
[47] J. Gu, J. Cai, G. Wang, and T. Chen. Stack-captioning: Coarse-to-fine learning for image captioning. In 32nd
AAAI Conference on Artificial Intelligence, AAAI 2018, pages 6837–6845, Palo Alto, California USA, 2018.
AAAI Press.
[48] J. Gu, S. Joty, J. Cai, and G. Wang. Unpaired image captioning by language pivoting. In V. Ferrari, M. Hebert,
C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, pages 519–535, Manhattan, New York,

33
Deep Learning Approaches on Image Captioning: A Review

USA, 2018. Springer International Publishing. ISBN 978-3-030-01246-5. doi:10.1007/978-3-030-01246-5_31.


URL http://doi.org/10.1007/978-3-030-01246-5_31.
[49] J. Gu, S. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang. Unpaired image captioning via scene graph alignments.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, volume 2019-October, pages
10323–10332, Manhattan, New York, U.S., 2019. IEEE. doi:10.1109/ICCV.2019.01042. URL http://doi.
org/10.1109/ICCV.2019.01042.
[50] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling. Scene graph generation with external knowledge and
image reconstruction. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 1969–1978, Manhattan, New York, U.S., 2019. IEEE. doi:10.1109/CVPR.2019.00207. URL http:
//doi.org/10.1109/CVPR.2019.00207.
[51] L. Guo, J. Liu, P. Yao, J. Li, and H. Lu. Mscap: Multi-style image captioning with unpaired stylized text.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, volume 2019-
June, pages 4204–4213, Manhattan, New York, U.S., 2019. IEEE. doi:10.1109/CVPR.2019.00433. URL
http://doi.org/10.1109/CVPR.2019.00433.
[52] D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya. Captioning images taken by people who are blind. In
A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Computer Vision – ECCV 2020, pages 417–434, Man-
hattan, New York, USA, 2020. Springer International Publishing. ISBN 978-3-030-58520-4. doi:10.1007/978-3-
030-58520-4_25. URL http://doi.org/10.1007/978-3-030-58520-4_25.
[53] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 770–778, Manhattan, New York, U.S., 2016. IEEE.
doi:10.1109/CVPR.2016.90. URL http://doi.org/10.1109/CVPR.2016.90.
[54] S. He, W. Liao, H. R. Tavakoli, M. Yang, B. Rosenhahn, and N. Pugeault. Image captioning through image
transformer. In Proceedings of the Asian Conference on Computer Vision (ACCV), page 153–169, Berlin,
Heidelberg, November 2020. Springer-Verlag. ISBN 978-3-030-69537-8. doi:10.1007/978-3-030-69538-5_10.
URL https://doi.org/10.1007/978-3-030-69538-5_10.
[55] S. Herdade, A. Kappeler, K. Boakye, and J. Soares. Image captioning: Transforming objects into
words. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 32, pages 11137–11147, Red Hook, New
York, USA, 2019. Curran Associates, Inc. URL https://proceedings.neurips.cc/paper/2019/file/
680390c55bbd9ce416d1d69a9ab4760d-Paper.pdf.
[56] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, nov
1997. ISSN 0899-7667. doi:10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.
8.1735.
[57] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of
learning long-term dependencies. Wiley-IEEE Press, Piscataway, New Jersey, USA, 2001. ISBN 978-0-780-
35369-5. doi:10.1109/9780470544037.ch14.
[58] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga. A comprehensive survey of deep learning for
image captioning. ACM Comput. Surv., 51(6), Feb. 2019. ISSN 0360-0300. doi:10.1145/3295748. URL
https://doi.org/10.1145/3295748.
[59] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 3588–3597, Manhattan, New York, USA, 2018.
IEEE. doi:10.1109/CVPR.2018.00378. URL http://doi.org/10.1109/CVPR.2018.00378.
[60] J. C. Hu, R. Cavicchioli, and A. Capotondi. Expansionnet v2: Block static expansion in fast end to end training
for image captioning. arXiv preprint arXiv:2208.06551, 2022.
[61] X. Hu, X. Yin, K. Lin, L. Zhang, J. Gao, L. Wang, and Z. Liu. Vivo: Visual vocabulary pre-training for novel
object captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(2):1575–1583, May 2021.
doi:10.1609/aaai.v35i2.16249. URL https://ojs.aaai.org/index.php/AAAI/article/view/16249.
[62] X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang. Scaling up vision-language pretraining for image
captioning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17959–
17968, Los Alamitos, CA, USA, jun 2022. IEEE Computer Society. doi:10.1109/CVPR52688.2022.01745. URL
https://doi.ieeecomputersociety.org/10.1109/CVPR52688.2022.01745.
[63] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700–
4708, Manhattan, New York, U.S., 2017. IEEE Computer Society. doi:10.1109/CVPR.2017.243. URL http:
//doi.org/10.1109/CVPR.2017.243.

34
Deep Learning Approaches on Image Captioning: A Review

[64] J.-H. Huang, T.-W. Wu, and M. Worring. Contextualized keyword representations for multi-modal retinal
image captioning. In Proceedings of the 2021 International Conference on Multimedia Retrieval, ICMR ’21,
page 645–652, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384636.
doi:10.1145/3460426.3463667. URL https://doi.org/10.1145/3460426.3463667.
[65] L. Huang, W. Wang, J. Chen, and X.-Y. Wei. Attention on attention for image captioning. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, volume 2019-October, pages 4634–4643, Manhattan,
New York, U.S., 2019. IEEE. doi:10.1109/ICCV.2019.00473. URL http://doi.org/10.1109/ICCV.2019.
00473.
[66] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, Q. Liu, et al. Language
is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
[67] W. Jiang, L. Ma, Y.-G. Jiang, W. Liu, and T. Zhang. Recurrent fusion network for image captioning. In Computer
Vision – ECCV 2018), pages 510–526, Manhattan, New York, USA, 2018. Springer International Publishing.
doi:10.1007/978-3-030-01216-8_31. URL http://doi.org/10.1007/978-3-030-01216-8_31.
[68] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense
captioning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4565–4574,
Manhattan, New York, U.S., 2016. IEEE. doi:10.1109/CVPR.2016.494. URL http://doi.org/10.1109/
CVPR.2016.494.
[69] M. I. Jordan. Chapter 25 - Serial Order: A Parallel Distributed Processing Approach, volume 121 of Advances in
Psychology, pages 471–495. North-Holland, Amsterdam, Netherlands, 1997. doi:10.1016/S0166-4115(97)80111-
2. URL https://doi.org/10.1016/S0166-4115(97)80111-2.
[70] R. Karim. Attn: Illustrated attention. attention in gifs and how it is used in. . . , Jan 2019. URL https:
//towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3.
[71] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 39:664–676, Apr. 2017. ISSN 01628828.
doi:10.1109/TPAMI.2016.2598339. URL http://doi.org/10.1109/TPAMI.2016.2598339.
[72] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International
Conference on Learning Representations (ICLR), 2017.
[73] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al.
Visual genome: Connecting language and vision using crowdsourced dense image annotations. International
journal of computer vision, 123(1):32–73, May 2017. ISSN 15731405. doi:10.1007/s11263-016-0981-7.
[74] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci,
A. Kolesnikov, T. Duerig, and V. Ferrari. The open images dataset v4: Unified image classification, object
detection, and visual relationship detection at scale. International Journal of Computer Vision, 128:1956–
1981, July 2020. ISSN 15731405. doi:10.1007/s11263-020-01316-z. URL http://doi.org/10.1007/
s11263-020-01316-z.
[75] K.-H. Lee, H. Palangi, X. Chen, H. Hu, and J. Gao. Learning visual relation priors for image-text matching
and image captioning with neural scene graph generators. arXiv preprint arXiv:1909.09953, 2019. URL
https://arxiv.org/abs/1909.09953.
[76] C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, G. Xu, Z. Cao, et al. mplug: Effective and
efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022.
[77] G. Li, L. Zhu, P. Liu, and Y. Yang. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 8928–8937, Los Alamitos, CA, USA, nov 2019. IEEE
Computer Society. doi:10.1109/ICCV.2019.00902. URL https://doi.ieeecomputersociety.org/10.
1109/ICCV.2019.00902.
[78] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image
encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
[79] W. Li, Z. Qu, H. Song, P. Wang, and B. Xue. The traffic scene understanding and prediction based on image
captioning. IEEE Access, 9:1420–1427, Dec. 2020. ISSN 2169-3536. doi:10.1109/ACCESS.2020.3047091.
URL http://doi.org/10.1109/ACCESS.2020.3047091.
[80] X. Li and S. Jiang. Know more say less: Image captioning based on scene graphs. IEEE Transactions
on Multimedia, 21(8):2117–2130, Jan. 2019. ISSN 1941-0077. doi:10.1109/TMM.2019.2896516. URL
http://doi.org/10.1109/TMM.2019.2896516.

35
Deep Learning Approaches on Image Captioning: A Review

[81] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. Oscar: Object-
semantics aligned pre-training for vision-language tasks. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm,
editors, Computer Vision – ECCV 2020, pages 121–137, Manhattan, New York, USA, 2020. Springer, Springer
International Publishing. ISBN 978-3-030-58577-8. doi:10.1007/978-3-030-58577-8_8. URL https://doi.
org/10.1007/978-3-030-58577-8_8.
[82] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint
arXiv:2101.00190, 2021. doi:10.48550/ARXIV.2101.00190. URL https://arxiv.org/abs/2101.00190.
[83] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang. Scene graph generation from objects, phrases and region
captions. In Proceedings of the IEEE international conference on computer vision, pages 1270–1279, Manhattan,
New York, U.S., 2017. IEEE. doi:10.1109/ICCV.2017.142. URL http://doi.org/10.1109/ICCV.2017.
142.
[84] Y. Li, Y. Pan, T. Yao, and T. Mei. Comprehending and ordering semantics for image captioning. In 2022
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17969–17978, Los Alamitos,
CA, USA, jun 2022. IEEE Computer Society. doi:10.1109/CVPR52688.2022.01746. URL https://doi.
ieeecomputersociety.org/10.1109/CVPR52688.2022.01746.
[85] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Proceedings of the workshop on
text summarization branches out (WAS 2004), pages 74–81, Stroudsburg, PA, USA, 2004. Association for
Computational Linguistics. URL https://aclanthology.org/W04-1000.
[86] T.-Y. Lin. Microsoft coco caption evaluation, August 2021. URL https://github.com/tylin/
coco-caption.
[87] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco:
Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision –
ECCV 2014, pages 740–755, Manhattan, New York, USA, 2015. Springer, Springer International Publishing.
ISBN 978-3-319-10602-1.
[88] F. Liu, X. Ren, Y. Liu, K. Lei, and X. Sun. Exploring and distilling cross-modal information for image
captioning. In IJCAI International Joint Conference on Artificial Intelligence, volume 2019-August, pages
5095–5101, California, USA, 7 2020. International Joint Conferences on Artificial Intelligence Organization.
doi:10.24963/ijcai.2019/708. URL https://doi.org/10.24963/ijcai.2019/708.
[89] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Improved image captioning via policy gradient
optimization of spider. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 873–881,
Manhattan, New York, U.S., 2017. IEEE. doi:10.1109/ICCV.2017.100. URL http://doi.org/10.1109/
ICCV.2017.100.
[90] S. Liu, L. Fan, E. Johns, Z. Yu, C. Xiao, and A. Anandkumar. Prismer: A vision-language model with an
ensemble of experts. arXiv preprint arXiv:2303.02506, 2023.
[91] W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu. Cptr: Full transformer network for image captioning. arXiv preprint
arXiv:2101.10804, 2021. doi:10.48550/ARXIV.2101.10804. URL https://arxiv.org/abs/2101.10804.
[92] X. Liu, H. Li, J. Shao, D. Chen, and X. Wang. Show, tell and discriminate: Image captioning by self-retrieval with
partially labeled data. In Computer Vision – ECCV 2018, pages 353–369, Manhattan, New York, USA, September
2018. Springer International Publishing. ISBN 978-3-030-01267-0. doi:10.1007/978-3-030-01267-0_21. URL
http://doi.org/10.1007/978-3-030-01267-0_21.
[93] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchi-
cal vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 10012–10022, Los Alamitos, CA, USA, oct 2021. IEEE Computer So-
ciety. doi:10.1109/ICCV48922.2021.00986. URL https://doi.ieeecomputersociety.org/10.1109/
ICCV48922.2021.00986.
[94] R. C. Luo, Y.-T. Hsu, Y.-C. Wen, and H.-J. Ye. Visual image caption generation for service robotics and
industrial applications. In 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS),
pages 827–832, Manhattan, New York, U.S., 2019. IEEE. doi:10.1109/ICPHYS.2019.8780171. URL http:
//doi.org/10.1109/ICPHYS.2019.8780171.
[95] H. MacLeod, C. L. Bennett, M. R. Morris, and E. Cutrell. Understanding blind people’s experiences with
computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human
Factors in Computing Systems, CHI ’17, page 5988–5999, New York, NY, USA, 2017. Association for Computing
Machinery. ISBN 9781450346559. doi:10.1145/3025453.3025814. URL https://doi.org/10.1145/
3025453.3025814.

36
Deep Learning Approaches on Image Captioning: A Review

[96] B. Makav and V. Kılıç. A new image captioning approach for visually impaired people. In 2019 11th International
Conference on Electrical and Electronics Engineering (ELECO), pages 945–949, Manhattan, New York, U.S.,
2019. IEEE. doi:10.23919/ELECO47770.2019.8990630. URL http://doi.org/10.23919/ELECO47770.
2019.8990630.
[97] A. Mathews, L. Xie, and X. He. Senticap: Generating image descriptions with sentiments. In Proceedings of the
Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 3574–3580, Palo Alto, California, USA,
2016. AAAI Press.
[98] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[99] R. Mokady, A. Hertz, and A. H. Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint
arXiv:2111.09734, 2021. doi:10.48550/ARXIV.2111.09734. URL https://arxiv.org/abs/2111.09734.
[100] V.-Q. Nguyen, M. Suganuma, and T. Okatani. Grit: Faster and better image captioning transformer using dual
visual features. In European Conference on Computer Vision, pages 167–184. Springer, 2022.
[101] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image
generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, volume 29, pages
4790–4798, Red Hook, New York, USA, 2016. Curran Associates, Inc. URL https://proceedings.neurips.
cc/paper/2016/file/b1301141feffabac455e1f90a7de2054-Paper.pdf.
[102] OpenAI. Gpt-4, Mar 202. URL https://openai.com/research/gpt-4.
[103] OpenAI. Introducing chatgpt, Nov 2022. URL https://openai.com/blog/chatgpt.
[104] V. Ordonez, G. Kulkarni, and T. Berg. Im2text: Describing images using 1 million captioned pho-
tographs. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Ad-
vances in Neural Information Processing Systems, volume 24, pages 1143–1151, Red Hook, New York,
USA, 2011. Curran Associates, Inc. URL https://proceedings.neurips.cc/paper/2011/file/
5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
[105] Y. Pan, T. Yao, Y. Li, and T. Mei. X-linear attention networks for image captioning. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10971–10980, Manhattan, New
York, U.S., 2020. IEEE. doi:10.1109/CVPR42600.2020.01098. URL http://doi.org/10.1109/CVPR42600.
2020.01098.
[106] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–
318, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. doi:10.3115/1073083.1073135.
URL https://doi.org/10.3115/1073083.1073135.
[107] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings
of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13,
page III–1310–III–1318, 12 Mountain Rock Ln, Norfolk, Massachusetts, 02056, United States, 2013. JMLR.
[108] J. Pavlopoulos, V. Kougia, and I. Androutsopoulos. A survey on biomedical image captioning. In Proceedings
of the Second Workshop on Shortcomings in Vision and Language, pages 26–36, Stroudsburg, PA, USA, 2019.
Association for Computational Linguistics. doi:10.18653/v1/w19-1803. URL http://doi.org/10.18653/
v1/w19-1803.
[109] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k
entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal
of Computer Vision, 123(1):74–93, May 2017. ISSN 1573-1405. doi:10.1007/s11263-016-0965-7. URL
https://doi.org/10.1007/s11263-016-0965-7.
[110] J. Pont-Tuset, J. Uijlings, S. Changpinyo, R. Soricut, and V. Ferrari. Connecting vision and language with
localized narratives. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Computer Vision – ECCV 2020,
pages 647–664, Manhattan, New York, USA, 2020. Springer International Publishing. ISBN 978-3-030-58558-7.
doi:10.1007/978-3-030-58558-7_38. URL http://doi.org/10.1007/978-3-030-58558-7_38.
[111] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised
multitask learners. OpenAI blog, 1(8):9, 2019.
[112] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,
et al. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors,
Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine
Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/
v139/radford21a.html.

37
Deep Learning Approaches on Image Captioning: A Review

[113] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. In
4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings, 2016.
[114] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal
networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems -
Volume 1, NIPS’15, pages 91–99, Cambridge, MA, USA, 2015. MIT Press.
[115] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1195, Manhattan,
New York, U.S., 2017. IEEE. doi:10.1109/CVPR.2017.131. URL http://doi.org/10.1109/CVPR.2017.
131.
[116] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, page 4035–4045,
Stroudsburg, PA, USA, 2018. Association for Computational Linguistics. doi:10.18653/v1/D18-1437. URL
http://doi.org/10.18653/v1/D18-1437.
[117] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by er-
ror propagation. Morgan Kaufmann, Burlington, Massachusetts, 1988. ISBN 978-1-4832-1446-
7. doi:https://doi.org/10.1016/B978-1-4832-1446-7.50035-2. URL https://www.sciencedirect.com/
science/article/pii/B9781483214467500352.
[118] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE
Transactions on Neural Networks, 20(1):61–80, Dec. 2008. ISSN 1941-0093. doi:10.1109/TNN.2008.2005605.
URL http://doi.org/10.1109/TNN.2008.2005605.
[119] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text
dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Stroudsburg, PA, USA, July 2018.
Association for Computational Linguistics. doi:10.18653/v1/P18-1238. URL https://aclanthology.org/
P18-1238.
[120] R. Shetty, M. Rohrbach, L. Anne Hendricks, M. Fritz, and B. Schiele. Speaking the same language: Matching
machine to human captions by adversarial training. In 2017 IEEE International Conference on Computer
Vision (ICCV), pages 4155–4164, Manhattan, New York, USA, 2017. IEEE. doi:10.1109/ICCV.2017.445. URL
http://doi.org/10.1109/ICCV.2017.445.
[121] K. Shuster, S. Humeau, H. Hu, A. Bordes, and J. Weston. Engaging image captioning via personality. In
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume
2019-June, pages 12516–12526, Manhattan, New York, U.S., 2019. IEEE. doi:10.1109/CVPR.2019.01280. URL
http://doi.org/10.1109/CVPR.2019.01280.
[122] Shutterstock. Stock images, photos, vectors, video and music | shutterstock, Sep 2019. URL https://www.
shutterstock.com/.
[123] O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. Textcaps: a dataset for image captioning with reading
comprehension. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Computer Vision – ECCV 2020,
pages 742–758, Manhattan, New York, USA, 2020. Springer International Publishing. ISBN 978-3-030-58536-5.
[124] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings, 75, May 2015. ISSN 15352900.
[125] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826,
Manhattan, New York, U.S., 2016. IEEE. doi:10.1109/CVPR.2016.308. URL http://doi.org/10.1109/
CVPR.2016.308.
[126] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual
connections on learning. In Thirty-first AAAI conference on artificial intelligence, AAAI’17, page 4278–4284,
Palo Alto, California, USA, 2017. AAAI Press. URL https://www.aaai.org/ocs/index.php/AAAI/
AAAI17/paper/view/14806.
[127] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu. Learning to compose dynamic tree structures for visual
contexts. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6612–
6621, Manhattan, New York, U.S., 2019. IEEE. doi:10.1109/CVPR.2019.00678. URL http://doi.org/10.
1109/CVPR.2019.00678.

38
Deep Learning Approaches on Image Captioning: A Review

[128] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention
is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
editors, Advances in neural information processing systems, volume 30, pages 5998–6008, Red Hook, New York,
USA, 2017. Curran Associates, Inc.
[129] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In
Proceedings of the IEEE conference on computer vision and pattern recognition, volume 07-12-June-2015, pages
4566–4575, Los Alamitos, CA, USA, 2015. IEEE Computer Society. doi:10.1109/CVPR.2015.7299087. URL
https://doi.ieeecomputersociety.org/10.1109/CVPR.2015.7299087.
[130] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, Los
Alamitos, CA, USA, jun 2015. IEEE Computer Society. doi:10.1109/CVPR.2015.7298935. URL https:
//doi.ieeecomputersociety.org/10.1109/CVPR.2015.7298935.
[131] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In D. Lee,
M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in neural information processing
systems, volume 29, pages 3630–3638, Red Hook, New York, USA, 2016. Curran Associates, Inc. URL https:
//proceedings.neurips.cc/paper/2016/file/90e1357833654983612fb05e3ec9148c-Paper.pdf.
[132] D. Wang, D. Beck, and T. Cohn. On the role of scene graphs in image captioning. In Proceedings of the Beyond
Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN), pages 29–34, Stroudsburg, PA, USA,
2019. Association for Computational Linguistics. doi:10.18653/v1/D19-6405. URL http://doi.org/10.
18653/v1/D19-6405.
[133] J. Wang, W. Wang, L. Wang, Z. Wang, D. D. Feng, and T. Tan. Learning visual relationship and
context-aware attention for image captioning. Pattern Recognition, 98, Feb. 2020. ISSN 0031-3203.
doi:10.1016/j.patcog.2019.107075. URL https://doi.org/10.1016/j.patcog.2019.107075.
[134] Q. Wang and A. B. Chan. Cnn+ cnn: Convolutional decoders for image captioning. CoRR, abs/1805.09019,
2018. URL http://arxiv.org/abs/1805.09019.
[135] Y. Wang, J. Xu, and Y. Sun. End-to-end transformer based model for image captioning. Proceedings of the
AAAI Conference on Artificial Intelligence, 36(3):2585–2594, Jun. 2022. doi:10.1609/aaai.v36i3.20160. URL
https://ojs.aaai.org/index.php/AAAI/article/view/20160.
[136] Y.-S. Wang, C. Liu, X. Zeng, and A. Yuille. Scene graph parsing as dependency parsing. In Proceedings of
the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), pages 397—-407, Stroudsburg, PA, USA, 2018. Association for
Computational Linguistics. doi:10.18653/v1/N18-1037. URL http://doi.org/10.18653/v1/N18-1037.
[137] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van Den Hengel. What value do explicit high level concepts have
in vision to language problems? In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 203–212, Manhattan, New York, USA, 2016. IEEE. doi:10.1109/CVPR.2016.29. URL http:
//doi.org/10.1109/CVPR.2016.29.
[138] Q. Xia, H. Huang, N. Duan, D. Zhang, L. Ji, Z. Sui, E. Cui, T. Bharti, and M. Zhou. Xgpt: Cross-modal
generative pre-training for image captioning. In Natural Language Processing and Chinese Computing: 10th
CCF International Conference, NLPCC 2021, Qingdao, China, October 13–17, 2021, Proceedings, Part I, pages
786–797, Berlin, Heidelberg, 2021. Springer, Springer-Verlag. ISBN 978-3-030-88479-6. doi:10.1007/978-3-
030-88480-2_63. URL https://doi.org/10.1007/978-3-030-88480-2_63.
[139] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 3097–3106, Manhattan, New York,
U.S., 2017. IEEE. doi:10.1109/CVPR.2017.330. URL http://doi.org/10.1109/CVPR.2017.330.
[140] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and
tell: Neural image caption generation with visual attention. In F. Bach and D. Blei, editors, Proceedings of the
32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research,
pages 2048–2057, Lille, France, 07–09 Jul 2015. PMLR.
[141] N. Xu, A.-A. Liu, J. Liu, W. Nie, and Y. Su. Scene graph captioner: Image captioning based on structural visual
representation. Journal of Visual Communication and Image Representation, 58:477–485, Jan. 2019. ISSN
1047-3203. doi:https://doi.org/10.1016/j.jvcir.2018.12.027. URL http://doi.org/10.1016/j.jvcir.2018.
12.027.
[142] N. Xu, H. Zhang, A.-A. Liu, W. Nie, Y. Su, J. Nie, and Y. Zhang. Multi-level policy and reward-based deep
reinforcement learning framework for image captioning. IEEE Transactions on Multimedia, 22(5):1372–1383,
Sept. 2020. doi:10.1109/TMM.2019.2941820. URL http://doi.org/10.1109/TMM.2019.2941820.

39
Deep Learning Approaches on Image Captioning: A Review

[143] Y. Xu, L. Li, H. Xu, S. Huang, F. Huang, and J. Cai. Image captioning in the transformer age. arXiv preprint
arXiv:2204.07374, 2022. doi:10.48550/ARXIV.2204.07374. URL https://arxiv.org/abs/2204.07374.
[144] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph generation. In V. Ferrari, M. Hebert,
C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, pages 690–706, Manhattan, New York,
USA, 2018. Springer International Publishing. ISBN 978-3-030-01246-5. doi:10.1007/978-3-030-01246-5_41.
URL http://doi.org/10.1007/978-3-030-01246-5_41.
[145] X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for image captioning. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10685–10694, Manhattan, New York,
U.S., 2019. IEEE. doi:10.1109/CVPR.2019.01094. URL http://doi.org/10.1109/CVPR.2019.01094.
[146] X. Yang, Y. Liu, and X. Wang. Reformer: The relational transformer for image captioning. In Proceedings
of the 30th ACM International Conference on Multimedia, MM ’22, page 5398–5406, New York, NY, USA,
2022. Association for Computing Machinery. ISBN 9781450392037. doi:10.1145/3503161.3548409. URL
https://doi.org/10.1145/3503161.3548409.
[147] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In
Proceedings of the IEEE conference on computer vision and pattern recognition, volume 2016-December, pages
21–29, Manhattan, New York, U.S., 2016. IEEE. doi:10.1109/CVPR.2016.10. URL https://doi.org/10.
1109/CVPR.2016.10.
[148] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image captioning with attributes. In 2017 IEEE Interna-
tional Conference on Computer Vision (ICCV), pages 4904–4912, Manhattan, New York, USA, 2017. IEEE.
doi:10.1109/ICCV.2017.524. URL http://doi.org/10.1109/ICCV.2017.524.
[149] T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In V. Ferrari, M. Hebert,
C. Sminchisescu, and Y. Weiss, editors, Computer Vision – ECCV 2018, pages 711–727, Manhattan, New York,
USA, 2018. Springer International Publishing. ISBN 978-3-030-01264-9. doi:10.1007/978-3-030-01264-9_42.
[150] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 4651–4659, Manhattan, New York,
USA, 2016. IEEE. doi:10.1109/CVPR.2016.503. URL http://doi.org/10.1109/CVPR.2016.503.
[151] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity
metrics for semantic inference over event descriptions. Transactions of the Association for Computational
Linguistics, 2:67–78, February 2014. ISSN 2307-387X. doi:10.1162/tacl_a_00166. URL https://doi.org/
10.1162/tacl_a_00166.
[152] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5831–5840, Manhattan, New York,
U.S., 2018. IEEE. doi:10.1109/CVPR.2018.00611. URL http://doi.org/10.1109/CVPR.2018.00611.
[153] P. Zeng, H. Zhang, J. Song, and L. Gao. S2 transformer for image captioning. In L. D. Raedt, editor, Proceed-
ings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 1608–1614.
International Joint Conferences on Artificial Intelligence Organization, July 2022. doi:10.24963/ijcai.2022/224.
URL https://doi.org/10.24963/ijcai.2022/224.
[154] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation
detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3107–3115,
Manhattan, New York, U.S., 2017. IEEE. doi:10.1109/CVPR.2017.331. URL http://doi.org/10.1109/
CVPR.2017.331.
[155] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao. Vinvl: Revisiting visual representations
in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 5579–5588, Manhattan, New York, U.S., 2021. IEEE. doi:10.1109/CVPR46437.2021.00553.
URL http://doi.org/10.1109/CVPR46437.2021.00553.
[156] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al. Rethinking semantic
segmentation from a sequence-to-sequence perspective with transformers. In 2021 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 6877–6886, Los Alamitos, CA, USA, jun 2021. IEEE
Computer Society. doi:10.1109/CVPR46437.2021.00681. URL https://doi.ieeecomputersociety.org/
10.1109/CVPR46437.2021.00681.
[157] Y. Zhong, L. Wang, J. Chen, D. Yu, and Y. Li. Comprehensive image captioning via scene graph decompo-
sition. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors, Computer Vision – ECCV 2020, pages
211–229, Manhattan, New York, USA, 2020. Springer International Publishing. ISBN 978-3-030-58568-6.
doi:https://doi.org/10.1007/978-3-030-58568-6_13. URL http://doi.org/https://doi.org/10.1007/
978-3-030-58568-6_13.

40
Deep Learning Approaches on Image Captioning: A Review

[158] L. Zhou, C. Xu, P. Koch, and J. J. Corso. Watch what you just said: Image captioning with text-conditional
attention. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, Thematic Workshops ’17,
page 305–313, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450354165.
doi:10.1145/3126686.3126717. URL https://doi.org/10.1145/3126686.3126717.
[159] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao. Unified vision-language pre-training for image
captioning and vqa. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 34(07):13041–13049, Apr. 2020.
ISSN 2159-5399. doi:10.1609/aaai.v34i07.7005. URL http://doi.org/10.1609/aaai.v34i07.7005.
[160] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent
adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232,
Manhattan, New York, U.S., 2017. IEEE. doi:10.1109/ICCV.2017.244. URL http://doi.org/10.1109/
ICCV.2017.244.
[161] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. Deformable detr: Deformable transformers for end-to-end
object detection. arXiv preprint arXiv:2010.04159, 2020.

41

You might also like