Ref 11
Ref 11
1
including N-cut, color-based segmentation and hybrid en-
gine. It also discussed how model engineering and incorpo-
rating more hyper-parameters improve the overall pipeline
and result in the best accuracy for such models. In the same
year, another study [6] covered a review about the literature
from 2017 to 2019, where they discussed different datasets
and architectures. They stated that the CNN-LSTM models
outperformed the CNN-RNN ones, and the most evaluation
metric used was BLEU (1 to 4). They also found that the
best methods for such a model implementation are encode-
decoder and attention mechanism. Furthermore, they men- Figure 1: The semantic space used by OSCAR [16]. In the
tioned that a combination of both methods can help in im- example of a dog sitting on a couch, ”couch” and ”dog” are
proving the results in such a task. Image captioning remains close in region features since they’re roughly in the same
an active research area, and new methodologies keep being area of the image, but they farther apart in word embeddings
published up until this moment. That was one of the main because of their different meanings.
motivations to write this review paper in order to tackle all
the recent advances in the past few years including 2020.
2.2. OSCAR
2. Methods
Vision-language pre-training (VLP) is widely used for
learning cross-modal representations. It suffers, however,
2.1. UpDown
from 2 issues [16]: a difficulty in discerning features due
to the overlap of their image regions and a lack of align-
Most common mechanisms that rely on visual attention
ment caption words and their corresponding image regions.
today are of the top-down kind. They are fed their partially
[16] remedy to this by using object tags as ”anchor points”.
finished caption at each time step to gain context. The issue
More specifically, they use triples as inputs composed of
with these models however is that there is no deliberation as
image region features, object tags and word sequence (cap-
to what regions of an image will receive attention. This has
tion). Doing this helps, because when one channel is in-
an effect on the quality of the captions as focusing on salient
complete or noisy, the other might complete the informa-
object regions will provide descriptions that are similar to
tion (you can describe an object both through image and
ones given by humans [23].
language). It is therefore simple to make the alignments,
[4] introduce Up-Down, a model that joins an entirely vi- because the most important elements in the image appear in
sual bottom-up mechanism and a task-specific context top- the matching caption and are also the ones that are expected
down one. The former gives proposals on image regions to receive the most attention.
that it deems salient, while the latter uses context to com-
pute an attention distribution over them, thus allowing at- Implementation Details OSCAR detects object tags using
tention to be directed to the important objects in the input Faster R-CNN [20] and presents a 2-view perspective:
image. (1) A dictionary-view with a linguistic semantic space en-
compassing the tags and caption tokens, and a visual se-
Implementation Details The bottom-up mechanism em- mantic space where image regions lay (Fig. 1).
ploys the Faster R-CNN [20] object detection model, re- (2) A modality-view that consists of an image modality con-
sponsible for recognizing class objects and surrounding taining image features and tags, and a language modality
them with bounding boxes. For pre-training, it is initial- with caption tokens. The total loss of pretraining is de-
ized with Resnet-101 [8] and trained on the Visual Genome fined by the addition of a masked token loss of predicting
dataset [14]. The top-down mechanism uses a visual atten- masked tokens form the linguistic semantic space and the
tion LSTM and a language one. The attention LSTM is fed contrastive loss of predicting if an image-tag sequence is
the previous language LSTM outputs, the word generated polluted (from replaced tags).
at time t-1 and mean-pooled image features to decide which
regions should receive attention. The generated caption up At inference time, the input consists of image regions
until that point is then used to compute the conditional dis- and tags At each time step of the generation, a [MASK]
tribution over potential output words, and the product of all token is appended to the sequence before being replaced
of the conditional distributions gives the distribution over by a token from the vocabulary until the [STOP] token is
the complete captions. generated.
2
Figure 2: Visual Vocabulary used by VIVO. [10] Objects
that are similar semantically are closer together. o repre- Figure 3: SPICE scene graph for the caption ”A young girl
sents regions and + represents tags. Yellow objects and tags standing on top of a tennis court”. The objects are marked
are novel. red, the relations blue and the attributes green. [3]
In the nocaps [1] challenge, the only allowed image- One of the drawbacks of reinforcement learning is the
caption dataset is the MS COCO [17] one, making con- reward hacking problem, which, in other words, is overfit-
ventional VLP methods inapplicable [10]. For that reason, ting on the reward function which occurs when the agent
[10] came up with VIVO (VIsual VOcabulary pre-training) finds a way to maximize the score without generating cap-
What it does differently is defining a ”visual vocabulary”, tions of a better quality. When using a CIDEr optimization
which is a joint embedding space of tags and image region [21] for example, common phrases are given less weight
features, where vectors of semantically close objects (e.g. and punishment is given to a caption that is too short. As a
accordion and instrument) are close to each other (Fig. 2). result, when a short caption is generated, common phrases
After pre-training the vocabulary, the model is fine-tuned are added to it to make it longer, ending up with unnatural
with image-caption pairs using the MS COCO dataset [17]. sentence endings such as ”a little girl holding a cat in a of
The key difference between VIVO and other VLP models a.” [15]
is that VIVO is only pre-trained on image-tag pairs and no [15] introduce meta learning, which is learning a meta
captions are involved before fine-tuning. This can prove model that is able to optimize and adapt to several different
very useful since tags are easier to be generated automati- tasks [7]. In this case, the model simultaneously optimizes
cally which allows the use of a huge number of them for no the reward function (reinforcement task) and uses supervi-
annotation costs. sion from the ground truth (supervision task) by taking gra-
dient steps in both directions. This guarantees the distinc-
Implementation Details VIVO uses a multi-layer Trans-
tiveness of the captions and their propositional correctness
former responsible for aligning tags with their correspond-
and results in sound human-like sentences.
ing image region features followed by a linear layer and
softmax. During pre-training, image region features are ex- Additionally, they import the SPICE [3] metric and add
tracted from the input image using Updown’s object detec- it to the CIDEr [27] reward term since it performs seman-
tor [4] and fed to the Transformer along with a set of pairs tic propositional evaluation using a scene graph. What this
of images and tags. One or more tags are randomly masked means is that an unusual caption ending will be an object-
and the model makes predictions based on the remaining relation pair in the scene graph without a match (Fig. 3).
tags and the image regions. Unfortunately, SPICE has a reward hacking issue of its own
since it allows duplicate tuples. It is therefore not easy to
In fine-tuning, the model is fed a triplet of image regions, develop an ideal evaluation metric.
tags and a caption where some of the caption’s tokens are Implementation Details [15] use the UpDown architecture
randomly masked and the model learns to predict them. A [4] as outlined above. The two tasks that the model needs
uni-directional attention mask is applied and the parameters to optimize are the maximum likelihood estimate task and
are optimized using a cross-entropy loss. the reinforcement learning task. In other words, it needs
At inference time, image region features are extracted to take 2 gradient steps in order to update the parameter θ.
from the input image and tags are detected. A caption is In the first step, the model adapts θ to the two tasks and
then generated one token at a time in an auto-regressive calculates their respective losses. Then, θ updates itself in
manner (using the previous tokens as input) until the end to- what is named a ”meta update”. Doing things this way, the
ken is generated or the caption reached its maximum length. model learns the parameter θ that optimizes both tasks in-
3
stead of simply taking a step in between the two gradients Method CIDEr SPICE
when adding up the losses. Resnet Baseline 111.1 20.2
UpDown 120.1 21.4
2.5. Conditional GAN-Based Model MLE Maximization 110.2 20.3
To overcome reward hacking, [5] use discriminator net- *RL Maximization 120.4 21.3
works to decide whether a generated caption is from a hu- *MLE + RL Maximization 119.3 21.2
man or a machine. Since they do not give their model a *Meta Learning 121.0 21.7
name, we will call it IC-GAN (Image Captioning GAN) for IC-GAN (Updown/CNN-GAN) 123.2 22.1
the sake of practicality. IC-GAN (Updown/RNN-GAN) 122.2 22.0
Implementation Details [5] experimented with 2 differ- IC-GAN (Updown/ensemble) 125.9 22.3
ent architectures for the discriminator: one of them using
a CNN with a fully connected layer and a sigmoid trans- Table 1: Results of the overall performance on MS COCO
formation, and the other an RNN (LSTM) with a fully con- Karpathy test split [15] [5]. Methods with a * symbol use
nected layer and a softmax. They also experiment with an reinforcement learning with CIDEr optimization.
ensemble of 4 CNNs and 4 RNNs. For the generator, a num-
ber of different architectures were used but we will focus in
the results on the generator that uses the UpDown architec-
ture [4]. In all cases, the generator and discriminator need themselves, we have decided to include the Resnet baseline
to be pre-trained before being alternatively fine-tuned. to the comparison (Table 1) to get an idea about the individ-
ual effect of the bottom-up + top-down approach.
2.6. Evaluation Metrics
We notice that UpDown shows an important gain in per-
To compare the quality of the generated captions to the formance going from 111.1 to 120.1 and from 20.2 to 20.4
ground-truth, a number of evaluation metrics are used. The in the CIDEr and SPICE metrics respectively. This repre-
most common used ones being CIDEr, SPICE, BLEU and sents a relative improvement of 8% in CIDEr and 3% in
METEOR. The common metrics across all the covered liter- SPICE. Therefore, adding bottom-up attention has an im-
ature are CIDEr and SPICE, which is why we will be using portant positive impact on image captioning.
them. CIDEr [27] is an image classification evaluation met-
Concerning the experiments in [15], we observe that the
ric that uses term frequency-inverse document frequency
model that uses meta learning receives a CIDEr score of
(TF-IDF) [22] to achieve human consensus. SPICE [3] is
121.0 and a SPICE score of 21.7. It is the most performant
a new semantic concept-based caption assessment metric
on both evaluation metrics compared to maximizing using
based on scene-graph, a graph-based semantic representa-
the maximum likelihood estimate, reinforcement learning
tion (Fig. 3) [11][24].
and the MLE+RL maximization (which relies on simply
2.7. Benchmarks adding up the gradients from the supervision and reinforce-
ment tasks). It also has a slight improvement on the Up-
[1] develop a benchmark called nocaps, which, in addi- Down model without the use of meta learning.
tion to the image captioning dataset Microsoft COCO Cap-
tions [17], makes use of the Open Images object detection IC-GAN on the other hand, shows the highest per-
dataset [13] to introduce novel objects not seen in the for- formance with a significant improvement on all 3 mod-
mer. The nocaps benchmark is made up of 166,100 captions els. The relative improvements compared to the UpDown
that describe 15,100 images from the OpenImages valida- model range between 1.7% (RNN-GAN) and 4.6% (ensem-
tion and test sets. OSCAR [16] and VIVO [10] methods are ble). Additionally, comparing with conventional reinforce-
evaluated on the nocaps validation set [1]. Karpathy splits ment learning approaches, the proposed adversarial learn-
[12] are used in the evaluation of the meta learning model ing method boosts the performance from 8.9% to 16.3% [5].
[15] and the IC-GAN one [5]. Finally, UpDown [4] is eval-
It is important to note that although using a CNN-GAN
uated on both benchmarks.
slighlty improves the score compared to using an RNN-
GAN, the latter can save up to 30% of training time com-
3. Results
pared to the former [5].
3.1. MS COCO Karpathy Splits Benchmark
Evaluation scores aside, IC-GAN is able to generate
[4] initially ran experiments on both an ablated Resnet human-like captions and avoids mistakes commonly made
[8] baseline model and the UpDown model to measure the by tradition reinforcement learning methods, such as du-
impact of the bottom-up attention. Since the meta learning plicated words and logical errors like ”a group of people
model [15] and IC-GAN [5] use the UpDown architecture standing on top of a clock” [5].
4
in-domain near-domain out-of-domain overall
Method
CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE
Validation Set
UpDown (2019) 78.1 11.6 57.7 10.3 31.3 8.3 55.3 10.1
UpDown + CBS 80.0 12.0 73.6 11.3 66.4 9.7 73.1 11.1
UpDown + ELMo +
79.3 12.4 73.8 11.4 71.7 9.9 74.3 11.2
CBS
OSCAR (2020) 79.6 12.3 66.1 11.5 45.3 9.7 63.8 11.2
OSCAR + CBS 80.0 12.1 80.4 12.2 75.3 10.6 79.3 11.9
OSCAR+SCST+CBS 83.4 12.0 81.6 12.0 77.6 10.6 81.1 11.7
VIVO (2020) 88.8 12.9 83.2 12.6 71.1 10.6 81.5 12.2
VIVO + CBS 90.4 13.0 84.9 12.5 83.0 10.7 85.3 12.2
VIVO+SCST+CBS 92.2 12.9 87.8 12.6 87.5 11.5 88.3 12.4
Human 84.4 14.3 85.0 14.3 95.7 14.0 87.1 14.2
3.2. nocaps Benchmark published between 2018 and 2020 which gives it a key role
and notable impact on the advances in the field. Novel ob-
The OSCAR model [16] is characterized by being highly ject captioning also seems to be gathering a lot of interest
efficient when it comes to parameters due to the anchor after proving its usefulness.
points making the semantic alignments learning easier.
Although the VIVO and OSCAR models don’t show
When used on its own, it outperfoms the UpDown model
scores as high as the ones that use meta learning and ad-
on all in-domain, near-domain and out-of-domain subsets.
versarial learning, they are superior in terms of their us-
By adding Constrained Beam Search [2] and Self Critical
ability ”in the wild”, since the MS COCO dataset [17] on
Sequential Training (SCST) [21] the performance improves
which all of these models are trained contains only a small
tremendously, particularly in the out-of-domain subset go-
part of the objects that we run into in real life. We should
ing from 45.3 to 77.6 in CIDEr and from 9.7 to 10.6 in
therefore not take evaluation scores at face value, especially
SPICE.
after demonstrating the reward hacking problem in rein-
OSCAR does not score however, as high as VIVO as forcement learning. In the future, additional efforts should
shown in Table 2. In the in-domain case, VIVO on its be put into making more robust reward functions and into
own outperforms the combinations OSCAR+SCST+CBS more research on novel object captioning, especially replac-
and UpDown+ELMo+CBS [19] by a difference ranging ing human-annotated object detection datasets with fully
from 5.4 to 9.5 in CIDEr and from 0.5 to 0.9 on SPICE. machine-generated tags.
The VIVO+SCST+CBS version shows the highest per-
formance with CIDEr scores that even surpass the hu-
man ones across in-domain and near-domain. The out-of- 5. Conclusions
domain subset results still show great scores that surpass all Image captioning is an active research topic that cre-
other models. ates the space for competition among researchers. It is
noticed that the existing review papers do not cover some
4. Discussion of the important recent advances, although there are new
methodologies with outperforming performance. The out-
It seems that the current research on image captioning is going research on image captioning is focused on deep
heavily focused on deep-learning techniques, and for a good learning-based methods, where attention mechanisms ex-
reason. Image captioning is a very complex task that com- ist along with deep reinforcement and adversarial learning.
bines both computer vision and natural language processing Our paper discusses the recent methods and their implemen-
and it therefore needs powerful techniques that could han- tations. State-of-the-art techniques include Updown, OS-
CAR, VIVO, Meta Learning and a GAN-based model. The
dle that level of complexity. Attention mechanisms along
GAN-based model is the most performant, UpDown has
with deep reinforcement and adversarial learning, appear to the most impact and OSCAR and VIVO are the more use-
be actively-researched methods for this task, as showcased ful. We hope this review will provide the community with
in this paper. Faster R-CNN is a popular network choice, a complementary guideline along with the existing review
along with the LSTM. In particular, the UpDown model papers for further pursued research in the image captioning
seems to be used as a basis for multiple papers that were research topic.
5
References [15] N. Li, Z. Chen, and S. Liu. Meta learning for image cap-
tioning. Proceedings of the AAAI Conference on Artificial
[1] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. John- Intelligence, 33:8626–8633, 2019. 3, 4
son, D. Batra, D. Parikh, S. Lee, and P. Anderson. nocaps: [16] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang,
novel object captioning at scale. CoRR, abs/1812.08658, H. Hu, L. Dong, F. Wei, et al. Oscar: Object-semantics
2018. 3, 4 aligned pre-training for vision-language tasks. In European
[2] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Conference on Computer Vision, pages 121–137. Springer,
Guided open vocabulary image captioning with constrained 2020. 2, 4, 5
beam search. CoRR, abs/1612.00576, 2016. 5 [17] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.
[3] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and
Semantic propositional image caption evaluation. Computer C. L. Zitnick. Microsoft COCO: common objects in context.
Vision – ECCV 2016 Lecture Notes in Computer Science, CoRR, abs/1405.0312, 2014. 3, 4, 5
page 382–398, 2016. 3, 4 [18] K. C. Nithya and V. V. Kumar. A review on automatic im-
[4] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, age captioning techniques. In 2020 International Conference
S. Gould, and L. Zhang. Bottom-up and top-down atten- on Communication and Signal Processing (ICCSP), pages
tion for image captioning and VQA. CoRR, abs/1707.07998, 0432–0437, 2020. 1
2017. 2, 3, 4 [19] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark,
[5] C. Chen, S. Mu, W. Xiao, Z. Ye, L. Wu, and Q. Ju. Improv- K. Lee, and L. Zettlemoyer. Deep contextualized word rep-
ing image captioning with conditional generative adversarial resentations. CoRR, abs/1802.05365, 2018. 5
nets. Proceedings of the AAAI Conference on Artificial Intel- [20] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:
ligence, 33:8142–8150, 2019. 4 towards real-time object detection with region proposal net-
[6] M. Chohan, A. Khan, M. Saleem, S. Hassan, A. Ghafoor, works. CoRR, abs/1506.01497, 2015. 2
and M. Khan. Image captioning using deep learning: A sys- [21] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel.
tematic literature review. International Journal of Advanced Self-critical sequence training for image captioning. CoRR,
Computer Science and Applications, 11(5), 2020. 2 abs/1612.00563, 2016. 3, 5
[7] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- [22] S. Robertson. Understanding inverse document frequency:
learning for fast adaptation of deep networks. In Interna- on theoretical arguments for idf. Journal of Documentation,
tional Conference on Machine Learning, pages 1126–1135. 60(5):503–520, 2004. 4
PMLR, 2017. 3 [23] B. J. Scholl. Objects and attention: The state of the art. Cog-
nition., 80(1). 2
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. CoRR, abs/1512.03385, 2015. 2, 4 [24] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D.
Manning. Generating semantically precise scene graphs
[9] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga. A
from textual descriptions for improved image retrieval. Pro-
comprehensive survey of deep learning for image captioning,
ceedings of the Fourth Workshop on Vision and Language,
2018. 1
2015. 4
[10] X. Hu, X. Yin, K. Lin, L. Wang, L. Zhang, J. Gao, and Z. Liu. [25] R. Staniūtė and D. Šešok. A systematic literature review on
Vivo: Visual vocabulary pre-training for novel object cap- image captioning. Applied Sciences, 9(10):2024, 2019. 1
tioning, 2021. 3, 4, 5
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
[11] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all
M. S. Bernstein, and L. Fei-Fei. Image retrieval using scene you need. CoRR, abs/1706.03762, 2017. 1
graphs. 2015 IEEE Conference on Computer Vision and Pat- [27] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider:
tern Recognition (CVPR), 2015. 4 Consensus-based image description evaluation. 2015 IEEE
[12] A. Karpathy and F. Li. Deep visual-semantic alignments for Conference on Computer Vision and Pattern Recognition
generating image descriptions. CoRR, abs/1412.2306, 2014. (CVPR), 2015. 3, 4
4
[13] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija,
A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit,
S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik,
D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Open-
images: A public dataset for large-scale multi-label and
multi-class image classification. Dataset available from
https://github.com/openimages, 2017. 4
[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bern-
stein, and F. Li. Visual genome: Connecting language and
vision using crowdsourced dense image annotations. CoRR,
abs/1602.07332, 2016. 2