Ref 11

This paper reviews recent deep learning methodologies for image captioning, focusing on techniques such as attention mechanisms, deep reinforcement learning, and generative adversarial networks (GANs). It compares various models like UpDown, OSCAR, VIVO, and a conditional GAN-based model, highlighting their performance and methodologies. The review serves as a roadmap for researchers to stay updated with advancements in the field of image caption generation.

Uploaded by

Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views6 pages

Ref 11

Uploaded by

Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Ref 11

A Thorough Review on Recent Deep Learning Methodologies for Image

Captioning

Ahmed Elhagry Karima Kadaoui

[email protected] [email protected]
arXiv:2107.13114v1 [cs.CV] 28 Jul 2021

Abstract works. That is why we’re presenting this paper, in an at-

tempt to cover some of the most relevant papers that have
Image Captioning is a task that combines computer vi- recently been published and compare the performances of
sion and natural language processing, where it aims to gen- their models and describing how they work.
erate descriptive legends for images. It is a two-fold process One of the techniques that have a crucial role in image
relying on accurate image understanding and correct lan- captioning today is the use of attention mechanisms. Ever
guage understanding both syntactically and semantically. since transformers [26] were introduced, a number of dif-
It is becoming increasingly difficult to keep up with the lat- ferent tasks such as machine translation and language mod-
est research and findings in the field of image captioning eling saw significant improvements thanks to them. Image
due to the growing amount of knowledge available on the captioning is no different: there is an extensive use of top-
topic. There is not, however, enough coverage of those find- down visual attention in different models as will be pre-
ings in the available review papers. We perform in this sented in the paper. Another technique that sparked re-
paper a run-through of the current techniques, datasets, searchers’ interest is deep reinforcement learning. It has
benchmarks and evaluation metrics used in image caption- shown to be particularly good with unusual images (e.g.
ing. The current research on the field is mostly focused on bed in a forest). The way it’s used in image captioning
deep learning-based methods, where attention mechanisms is through optimizing the reward function by maximizing
along with deep reinforcement and adversarial learning ap- its expected value, which cannot be done using MLE since
pear to be in the forefront of this research topic. In this metrics are not differentiable.
paper, we review recent methodologies such as UpDown,
OSCAR, VIVO, Meta Learning and a model that uses con- 1.1. Related Work
ditional generative adversarial nets. Although the GAN-
As mentioned in the review paper [9], the authors pre-
based model achieves the highest score, UpDown repre-
sented a comprehensive review of the state-of-the-art deep
sents an important basis for image captioning and OSCAR
learning-based image captioning techniques by late 2018.
and VIVO are more useful as they use novel object caption-
The paper gave a taxonomy of the existing techniques, com-
ing. This review paper serves as a roadmap for researchers
pared the pros and cons, and handled the research topic
to keep up to date with the latest contributions made in the
from different aspects including learning type, architec-
field of image caption generation.
ture, number of captions, language models and feature map-
ping. They also discussed both the strengths and weak-
Keywords: Image Captioning, Computer Vision, NLP,
nesses of different datasets and evaluation metrics. Accord-
Reinforcement Learning, Attention Mechanism, GANs
ing to another review paper [25] published later mid 2019,
1. Introduction the authors compared the image captioning methodologies
from 2016 to 2019 including newer ones on two datasets:
Image captioning has a variety of applications. From MSCOCO and Flickr30k. An investigation was done on dif-
automatic image indexing to assistive technology, there is ferent feature extractors including AlexNet, VGG-16 Net,
more and more incentive to build solutions that don’t rely ResNet, GoogleNet with all the nine Inception models, and
on human annotation since it quickly becomes cost inef- DenseNet. In addition, language models were covered such
ficient. A lot of work has been done in the field already as LSTM, RNN, CNN, GRU and TPGN. This is while com-
and some image captioning survey papers are available, but paring several evaluation metrics including BLEU (1 to 4),
none of them cover some important models that came out CIDEr and METEOR. In [18], the paper put the spotlight
between 2018 and 2020. In a field that evolves at this rate, on some of the advancement on the image captioning task
it is imperative to keep track of the latest models and frame- until early 2020, where various approaches were discussed

1
including N-cut, color-based segmentation and hybrid en-
gine. It also discussed how model engineering and incorpo-
rating more hyper-parameters improve the overall pipeline
and result in the best accuracy for such models. In the same
year, another study [6] covered a review about the literature
from 2017 to 2019, where they discussed different datasets
and architectures. They stated that the CNN-LSTM models
outperformed the CNN-RNN ones, and the most evaluation
metric used was BLEU (1 to 4). They also found that the
best methods for such a model implementation are encode-
decoder and attention mechanism. Furthermore, they men- Figure 1: The semantic space used by OSCAR [16]. In the
tioned that a combination of both methods can help in im- example of a dog sitting on a couch, ”couch” and ”dog” are
proving the results in such a task. Image captioning remains close in region features since they’re roughly in the same
an active research area, and new methodologies keep being area of the image, but they farther apart in word embeddings
published up until this moment. That was one of the main because of their different meanings.
motivations to write this review paper in order to tackle all
the recent advances in the past few years including 2020.

2.2. OSCAR
2. Methods
Vision-language pre-training (VLP) is widely used for
learning cross-modal representations. It suffers, however,
2.1. UpDown
from 2 issues [16]: a difficulty in discerning features due
to the overlap of their image regions and a lack of align-
Most common mechanisms that rely on visual attention
ment caption words and their corresponding image regions.
today are of the top-down kind. They are fed their partially
[16] remedy to this by using object tags as ”anchor points”.
finished caption at each time step to gain context. The issue
More specifically, they use triples as inputs composed of
with these models however is that there is no deliberation as
image region features, object tags and word sequence (cap-
to what regions of an image will receive attention. This has
tion). Doing this helps, because when one channel is in-
an effect on the quality of the captions as focusing on salient
complete or noisy, the other might complete the informa-
object regions will provide descriptions that are similar to
tion (you can describe an object both through image and
ones given by humans [23].
language). It is therefore simple to make the alignments,
[4] introduce Up-Down, a model that joins an entirely vi- because the most important elements in the image appear in
sual bottom-up mechanism and a task-specific context top- the matching caption and are also the ones that are expected
down one. The former gives proposals on image regions to receive the most attention.
that it deems salient, while the latter uses context to com-
pute an attention distribution over them, thus allowing at- Implementation Details OSCAR detects object tags using
tention to be directed to the important objects in the input Faster R-CNN [20] and presents a 2-view perspective:
image. (1) A dictionary-view with a linguistic semantic space en-
compassing the tags and caption tokens, and a visual se-
Implementation Details The bottom-up mechanism em- mantic space where image regions lay (Fig. 1).
ploys the Faster R-CNN [20] object detection model, re- (2) A modality-view that consists of an image modality con-
sponsible for recognizing class objects and surrounding taining image features and tags, and a language modality
them with bounding boxes. For pre-training, it is initial- with caption tokens. The total loss of pretraining is de-
ized with Resnet-101 [8] and trained on the Visual Genome fined by the addition of a masked token loss of predicting
dataset [14]. The top-down mechanism uses a visual atten- masked tokens form the linguistic semantic space and the
tion LSTM and a language one. The attention LSTM is fed contrastive loss of predicting if an image-tag sequence is
the previous language LSTM outputs, the word generated polluted (from replaced tags).
at time t-1 and mean-pooled image features to decide which
regions should receive attention. The generated caption up At inference time, the input consists of image regions
until that point is then used to compute the conditional dis- and tags At each time step of the generation, a [MASK]
tribution over potential output words, and the product of all token is appended to the sequence before being replaced
of the conditional distributions gives the distribution over by a token from the vocabulary until the [STOP] token is
the complete captions. generated.

2
Figure 2: Visual Vocabulary used by VIVO. [10] Objects
that are similar semantically are closer together. o repre- Figure 3: SPICE scene graph for the caption ”A young girl
sents regions and + represents tags. Yellow objects and tags standing on top of a tennis court”. The objects are marked
are novel. red, the relations blue and the attributes green. [3]

2.3. VIVO 2.4. Meta Learning

In the nocaps [1] challenge, the only allowed image- One of the drawbacks of reinforcement learning is the
caption dataset is the MS COCO [17] one, making con- reward hacking problem, which, in other words, is overfit-
ventional VLP methods inapplicable [10]. For that reason, ting on the reward function which occurs when the agent
[10] came up with VIVO (VIsual VOcabulary pre-training) finds a way to maximize the score without generating cap-
What it does differently is defining a ”visual vocabulary”, tions of a better quality. When using a CIDEr optimization
which is a joint embedding space of tags and image region [21] for example, common phrases are given less weight
features, where vectors of semantically close objects (e.g. and punishment is given to a caption that is too short. As a
accordion and instrument) are close to each other (Fig. 2). result, when a short caption is generated, common phrases
After pre-training the vocabulary, the model is fine-tuned are added to it to make it longer, ending up with unnatural
with image-caption pairs using the MS COCO dataset [17]. sentence endings such as ”a little girl holding a cat in a of
The key difference between VIVO and other VLP models a.” [15]
is that VIVO is only pre-trained on image-tag pairs and no [15] introduce meta learning, which is learning a meta
captions are involved before fine-tuning. This can prove model that is able to optimize and adapt to several different
very useful since tags are easier to be generated automati- tasks [7]. In this case, the model simultaneously optimizes
cally which allows the use of a huge number of them for no the reward function (reinforcement task) and uses supervi-
annotation costs. sion from the ground truth (supervision task) by taking gra-
dient steps in both directions. This guarantees the distinc-
Implementation Details VIVO uses a multi-layer Trans-
tiveness of the captions and their propositional correctness
former responsible for aligning tags with their correspond-
and results in sound human-like sentences.
ing image region features followed by a linear layer and
softmax. During pre-training, image region features are ex- Additionally, they import the SPICE [3] metric and add
tracted from the input image using Updown’s object detec- it to the CIDEr [27] reward term since it performs seman-
tor [4] and fed to the Transformer along with a set of pairs tic propositional evaluation using a scene graph. What this
of images and tags. One or more tags are randomly masked means is that an unusual caption ending will be an object-
and the model makes predictions based on the remaining relation pair in the scene graph without a match (Fig. 3).
tags and the image regions. Unfortunately, SPICE has a reward hacking issue of its own
since it allows duplicate tuples. It is therefore not easy to
In fine-tuning, the model is fed a triplet of image regions, develop an ideal evaluation metric.
tags and a caption where some of the caption’s tokens are Implementation Details [15] use the UpDown architecture
randomly masked and the model learns to predict them. A [4] as outlined above. The two tasks that the model needs
uni-directional attention mask is applied and the parameters to optimize are the maximum likelihood estimate task and
are optimized using a cross-entropy loss. the reinforcement learning task. In other words, it needs
At inference time, image region features are extracted to take 2 gradient steps in order to update the parameter θ.
from the input image and tags are detected. A caption is In the first step, the model adapts θ to the two tasks and
then generated one token at a time in an auto-regressive calculates their respective losses. Then, θ updates itself in
manner (using the previous tokens as input) until the end to- what is named a ”meta update”. Doing things this way, the
ken is generated or the caption reached its maximum length. model learns the parameter θ that optimizes both tasks in-

3
stead of simply taking a step in between the two gradients Method CIDEr SPICE
when adding up the losses. Resnet Baseline 111.1 20.2
UpDown 120.1 21.4
2.5. Conditional GAN-Based Model MLE Maximization 110.2 20.3
To overcome reward hacking, [5] use discriminator net- *RL Maximization 120.4 21.3
works to decide whether a generated caption is from a hu- *MLE + RL Maximization 119.3 21.2
man or a machine. Since they do not give their model a *Meta Learning 121.0 21.7
name, we will call it IC-GAN (Image Captioning GAN) for IC-GAN (Updown/CNN-GAN) 123.2 22.1
the sake of practicality. IC-GAN (Updown/RNN-GAN) 122.2 22.0
Implementation Details [5] experimented with 2 differ- IC-GAN (Updown/ensemble) 125.9 22.3
ent architectures for the discriminator: one of them using
a CNN with a fully connected layer and a sigmoid trans- Table 1: Results of the overall performance on MS COCO
formation, and the other an RNN (LSTM) with a fully con- Karpathy test split [15] [5]. Methods with a * symbol use
nected layer and a softmax. They also experiment with an reinforcement learning with CIDEr optimization.
ensemble of 4 CNNs and 4 RNNs. For the generator, a num-
ber of different architectures were used but we will focus in
the results on the generator that uses the UpDown architec-
ture [4]. In all cases, the generator and discriminator need themselves, we have decided to include the Resnet baseline
to be pre-trained before being alternatively fine-tuned. to the comparison (Table 1) to get an idea about the individ-
ual effect of the bottom-up + top-down approach.
2.6. Evaluation Metrics
We notice that UpDown shows an important gain in per-
To compare the quality of the generated captions to the formance going from 111.1 to 120.1 and from 20.2 to 20.4
ground-truth, a number of evaluation metrics are used. The in the CIDEr and SPICE metrics respectively. This repre-
most common used ones being CIDEr, SPICE, BLEU and sents a relative improvement of 8% in CIDEr and 3% in
METEOR. The common metrics across all the covered liter- SPICE. Therefore, adding bottom-up attention has an im-
ature are CIDEr and SPICE, which is why we will be using portant positive impact on image captioning.
them. CIDEr [27] is an image classification evaluation met-
Concerning the experiments in [15], we observe that the
ric that uses term frequency-inverse document frequency
model that uses meta learning receives a CIDEr score of
(TF-IDF) [22] to achieve human consensus. SPICE [3] is
121.0 and a SPICE score of 21.7. It is the most performant
a new semantic concept-based caption assessment metric
on both evaluation metrics compared to maximizing using
based on scene-graph, a graph-based semantic representa-
the maximum likelihood estimate, reinforcement learning
tion (Fig. 3) [11][24].
and the MLE+RL maximization (which relies on simply
2.7. Benchmarks adding up the gradients from the supervision and reinforce-
ment tasks). It also has a slight improvement on the Up-
[1] develop a benchmark called nocaps, which, in addi- Down model without the use of meta learning.
tion to the image captioning dataset Microsoft COCO Cap-
tions [17], makes use of the Open Images object detection IC-GAN on the other hand, shows the highest per-
dataset [13] to introduce novel objects not seen in the for- formance with a significant improvement on all 3 mod-
mer. The nocaps benchmark is made up of 166,100 captions els. The relative improvements compared to the UpDown
that describe 15,100 images from the OpenImages valida- model range between 1.7% (RNN-GAN) and 4.6% (ensem-
tion and test sets. OSCAR [16] and VIVO [10] methods are ble). Additionally, comparing with conventional reinforce-
evaluated on the nocaps validation set [1]. Karpathy splits ment learning approaches, the proposed adversarial learn-
[12] are used in the evaluation of the meta learning model ing method boosts the performance from 8.9% to 16.3% [5].
[15] and the IC-GAN one [5]. Finally, UpDown [4] is eval-
It is important to note that although using a CNN-GAN
uated on both benchmarks.
slighlty improves the score compared to using an RNN-
GAN, the latter can save up to 30% of training time com-
3. Results
pared to the former [5].
3.1. MS COCO Karpathy Splits Benchmark
Evaluation scores aside, IC-GAN is able to generate
[4] initially ran experiments on both an ablated Resnet human-like captions and avoids mistakes commonly made
[8] baseline model and the UpDown model to measure the by tradition reinforcement learning methods, such as du-
impact of the bottom-up attention. Since the meta learning plicated words and logical errors like ”a group of people
model [15] and IC-GAN [5] use the UpDown architecture standing on top of a clock” [5].

4
in-domain near-domain out-of-domain overall
Method
CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE
Validation Set
UpDown (2019) 78.1 11.6 57.7 10.3 31.3 8.3 55.3 10.1
UpDown + CBS 80.0 12.0 73.6 11.3 66.4 9.7 73.1 11.1
UpDown + ELMo +
79.3 12.4 73.8 11.4 71.7 9.9 74.3 11.2
CBS
OSCAR (2020) 79.6 12.3 66.1 11.5 45.3 9.7 63.8 11.2
OSCAR + CBS 80.0 12.1 80.4 12.2 75.3 10.6 79.3 11.9
OSCAR+SCST+CBS 83.4 12.0 81.6 12.0 77.6 10.6 81.1 11.7
VIVO (2020) 88.8 12.9 83.2 12.6 71.1 10.6 81.5 12.2
VIVO + CBS 90.4 13.0 84.9 12.5 83.0 10.7 85.3 12.2
VIVO+SCST+CBS 92.2 12.9 87.8 12.6 87.5 11.5 88.3 12.4
Human 84.4 14.3 85.0 14.3 95.7 14.0 87.1 14.2

Table 2: Evaluation on nocaps validation set [10]

3.2. nocaps Benchmark published between 2018 and 2020 which gives it a key role
and notable impact on the advances in the field. Novel ob-
The OSCAR model [16] is characterized by being highly ject captioning also seems to be gathering a lot of interest
efficient when it comes to parameters due to the anchor after proving its usefulness.
points making the semantic alignments learning easier.
Although the VIVO and OSCAR models don’t show
When used on its own, it outperfoms the UpDown model
scores as high as the ones that use meta learning and ad-
on all in-domain, near-domain and out-of-domain subsets.
versarial learning, they are superior in terms of their us-
By adding Constrained Beam Search [2] and Self Critical
ability ”in the wild”, since the MS COCO dataset [17] on
Sequential Training (SCST) [21] the performance improves
which all of these models are trained contains only a small
tremendously, particularly in the out-of-domain subset go-
part of the objects that we run into in real life. We should
ing from 45.3 to 77.6 in CIDEr and from 9.7 to 10.6 in
therefore not take evaluation scores at face value, especially
SPICE.
after demonstrating the reward hacking problem in rein-
OSCAR does not score however, as high as VIVO as forcement learning. In the future, additional efforts should
shown in Table 2. In the in-domain case, VIVO on its be put into making more robust reward functions and into
own outperforms the combinations OSCAR+SCST+CBS more research on novel object captioning, especially replac-
and UpDown+ELMo+CBS [19] by a difference ranging ing human-annotated object detection datasets with fully
from 5.4 to 9.5 in CIDEr and from 0.5 to 0.9 on SPICE. machine-generated tags.
The VIVO+SCST+CBS version shows the highest per-
formance with CIDEr scores that even surpass the hu-
man ones across in-domain and near-domain. The out-of- 5. Conclusions
domain subset results still show great scores that surpass all Image captioning is an active research topic that cre-
other models. ates the space for competition among researchers. It is
noticed that the existing review papers do not cover some
4. Discussion of the important recent advances, although there are new
methodologies with outperforming performance. The out-
It seems that the current research on image captioning is going research on image captioning is focused on deep
heavily focused on deep-learning techniques, and for a good learning-based methods, where attention mechanisms ex-
reason. Image captioning is a very complex task that com- ist along with deep reinforcement and adversarial learning.
bines both computer vision and natural language processing Our paper discusses the recent methods and their implemen-
and it therefore needs powerful techniques that could han- tations. State-of-the-art techniques include Updown, OS-
CAR, VIVO, Meta Learning and a GAN-based model. The
dle that level of complexity. Attention mechanisms along
GAN-based model is the most performant, UpDown has
with deep reinforcement and adversarial learning, appear to the most impact and OSCAR and VIVO are the more use-
be actively-researched methods for this task, as showcased ful. We hope this review will provide the community with
in this paper. Faster R-CNN is a popular network choice, a complementary guideline along with the existing review
along with the LSTM. In particular, the UpDown model papers for further pursued research in the image captioning
seems to be used as a basis for multiple papers that were research topic.

5
References [15] N. Li, Z. Chen, and S. Liu. Meta learning for image cap-
tioning. Proceedings of the AAAI Conference on Artificial
[1] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. John- Intelligence, 33:8626–8633, 2019. 3, 4
son, D. Batra, D. Parikh, S. Lee, and P. Anderson. nocaps: [16] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang,
novel object captioning at scale. CoRR, abs/1812.08658, H. Hu, L. Dong, F. Wei, et al. Oscar: Object-semantics
2018. 3, 4 aligned pre-training for vision-language tasks. In European
[2] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Conference on Computer Vision, pages 121–137. Springer,
Guided open vocabulary image captioning with constrained 2020. 2, 4, 5
beam search. CoRR, abs/1612.00576, 2016. 5 [17] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.
[3] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and
Semantic propositional image caption evaluation. Computer C. L. Zitnick. Microsoft COCO: common objects in context.
Vision – ECCV 2016 Lecture Notes in Computer Science, CoRR, abs/1405.0312, 2014. 3, 4, 5
page 382–398, 2016. 3, 4 [18] K. C. Nithya and V. V. Kumar. A review on automatic im-
[4] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, age captioning techniques. In 2020 International Conference
S. Gould, and L. Zhang. Bottom-up and top-down atten- on Communication and Signal Processing (ICCSP), pages
tion for image captioning and VQA. CoRR, abs/1707.07998, 0432–0437, 2020. 1
2017. 2, 3, 4 [19] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark,
[5] C. Chen, S. Mu, W. Xiao, Z. Ye, L. Wu, and Q. Ju. Improv- K. Lee, and L. Zettlemoyer. Deep contextualized word rep-
ing image captioning with conditional generative adversarial resentations. CoRR, abs/1802.05365, 2018. 5
nets. Proceedings of the AAAI Conference on Artificial Intel- [20] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:
ligence, 33:8142–8150, 2019. 4 towards real-time object detection with region proposal net-
[6] M. Chohan, A. Khan, M. Saleem, S. Hassan, A. Ghafoor, works. CoRR, abs/1506.01497, 2015. 2
and M. Khan. Image captioning using deep learning: A sys- [21] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel.
tematic literature review. International Journal of Advanced Self-critical sequence training for image captioning. CoRR,
Computer Science and Applications, 11(5), 2020. 2 abs/1612.00563, 2016. 3, 5
[7] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- [22] S. Robertson. Understanding inverse document frequency:
learning for fast adaptation of deep networks. In Interna- on theoretical arguments for idf. Journal of Documentation,
tional Conference on Machine Learning, pages 1126–1135. 60(5):503–520, 2004. 4
PMLR, 2017. 3 [23] B. J. Scholl. Objects and attention: The state of the art. Cog-
nition., 80(1). 2
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. CoRR, abs/1512.03385, 2015. 2, 4 [24] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D.
Manning. Generating semantically precise scene graphs
[9] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga. A
from textual descriptions for improved image retrieval. Pro-
comprehensive survey of deep learning for image captioning,
ceedings of the Fourth Workshop on Vision and Language,
2018. 1
2015. 4
[10] X. Hu, X. Yin, K. Lin, L. Wang, L. Zhang, J. Gao, and Z. Liu. [25] R. Staniūtė and D. Šešok. A systematic literature review on
Vivo: Visual vocabulary pre-training for novel object cap- image captioning. Applied Sciences, 9(10):2024, 2019. 1
tioning, 2021. 3, 4, 5
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
[11] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all
M. S. Bernstein, and L. Fei-Fei. Image retrieval using scene you need. CoRR, abs/1706.03762, 2017. 1
graphs. 2015 IEEE Conference on Computer Vision and Pat- [27] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider:
tern Recognition (CVPR), 2015. 4 Consensus-based image description evaluation. 2015 IEEE
[12] A. Karpathy and F. Li. Deep visual-semantic alignments for Conference on Computer Vision and Pattern Recognition
generating image descriptions. CoRR, abs/1412.2306, 2014. (CVPR), 2015. 3, 4
4
[13] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija,
A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit,
S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik,
D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Open-
images: A public dataset for large-scale multi-label and
multi-class image classification. Dataset available from
https://github.com/openimages, 2017. 4
[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bern-
stein, and F. Li. Visual genome: Connecting language and
vision using crowdsourced dense image annotations. CoRR,
abs/1602.07332, 2016. 2

Health 1 3rd Quarter Final
100% (3)
Health 1 3rd Quarter Final
37 pages
Cdi 8 Midterms
No ratings yet
Cdi 8 Midterms
82 pages
Essential Grammar in Use-Verb To Be
100% (1)
Essential Grammar in Use-Verb To Be
2 pages
Catalogue Career Paths Courses
No ratings yet
Catalogue Career Paths Courses
6 pages
Exploring Library Resources and Services For Research and Instruction
100% (1)
Exploring Library Resources and Services For Research and Instruction
40 pages
欢迎来到helpwriting.net - 您最佳的论文写作服务
100% (2)
欢迎来到helpwriting.net - 您最佳的论文写作服务
8 pages
Precis Writing and Comprehension
No ratings yet
Precis Writing and Comprehension
3 pages
SKPDD Canto 2 CH 1-10 Question Bank
No ratings yet
SKPDD Canto 2 CH 1-10 Question Bank
31 pages
English Verb Conjugation 2
No ratings yet
English Verb Conjugation 2
2 pages
Breast MRI Structured Report TEMPLATES
No ratings yet
Breast MRI Structured Report TEMPLATES
6 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
Gpiozero Readthedocs Io en Stable
No ratings yet
Gpiozero Readthedocs Io en Stable
276 pages
Modern Workplace - Slide Deck Presentation
No ratings yet
Modern Workplace - Slide Deck Presentation
13 pages
m-715 Writing
No ratings yet
m-715 Writing
81 pages
9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey
No ratings yet
9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey
60 pages
Lecture 5
No ratings yet
Lecture 5
102 pages
This Manuscript Is Currently Submitted To Computer Vision and Image Understanding Journal
No ratings yet
This Manuscript Is Currently Submitted To Computer Vision and Image Understanding Journal
34 pages
8 - 23 - Image Captioning Based On Scene Graphs - A Survey
No ratings yet
8 - 23 - Image Captioning Based On Scene Graphs - A Survey
24 pages
1 - Create Users and Roles
No ratings yet
1 - Create Users and Roles
9 pages
6 - 23 - Deep Learning Approaches On Image Captioning A Review
No ratings yet
6 - 23 - Deep Learning Approaches On Image Captioning A Review
41 pages
The Complete Guide For Linux System Administration CH03 Powerpoint
No ratings yet
The Complete Guide For Linux System Administration CH03 Powerpoint
40 pages
Creative Writing
No ratings yet
Creative Writing
17 pages
CVIU Hema 1-S2.0-S1077314222000650-Main
No ratings yet
CVIU Hema 1-S2.0-S1077314222000650-Main
13 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Table of Specification
No ratings yet
Table of Specification
37 pages
From Show To Tell: A Survey On Image Captioning
No ratings yet
From Show To Tell: A Survey On Image Captioning
22 pages
A Comprehensive Survey of Deep Learning For Image Captioning
No ratings yet
A Comprehensive Survey of Deep Learning For Image Captioning
36 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
2 - Hierarchical LSTMs With Adaptive Attention For
No ratings yet
2 - Hierarchical LSTMs With Adaptive Attention For
18 pages
Applsci 13 11103 v2
No ratings yet
Applsci 13 11103 v2
38 pages
Neural Attention For Image Captioning: Review of Outstanding Methods
No ratings yet
Neural Attention For Image Captioning: Review of Outstanding Methods
30 pages
Data Science Interview Questions (#Day27)
No ratings yet
Data Science Interview Questions (#Day27)
18 pages
Survey of Road Anomalies Detection Methods-2022-10-30-07-32
No ratings yet
Survey of Road Anomalies Detection Methods-2022-10-30-07-32
22 pages
Meshed-Memory Transformer For Image Captioning
No ratings yet
Meshed-Memory Transformer For Image Captioning
15 pages
Image Captioning Model Using Attention and Object
No ratings yet
Image Captioning Model Using Attention and Object
17 pages
Show Attend and Tell
No ratings yet
Show Attend and Tell
10 pages
TSP CMC 53245
No ratings yet
TSP CMC 53245
18 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
CH2 Software Processes
No ratings yet
CH2 Software Processes
30 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
(IJCST-V11I4P12) :N. Kalyani, G. Pradeep Reddy, K. Sandhya
No ratings yet
(IJCST-V11I4P12) :N. Kalyani, G. Pradeep Reddy, K. Sandhya
16 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
A Comprehensive Guide To Deep Neural Network-Based Image Captions
No ratings yet
A Comprehensive Guide To Deep Neural Network-Based Image Captions
17 pages
7 - 23 - Deep Image Captioning A Review of Methods, Trends and Future Challenges
No ratings yet
7 - 23 - Deep Image Captioning A Review of Methods, Trends and Future Challenges
21 pages
He Image Captioning Through Image Transformer ACCV 2020 Paper
No ratings yet
He Image Captioning Through Image Transformer ACCV 2020 Paper
17 pages
He 2017
No ratings yet
He 2017
8 pages
Design and Configure Azure Front Door
No ratings yet
Design and Configure Azure Front Door
11 pages
Automatic Image Captioning Combining Natural Language Processing and
No ratings yet
Automatic Image Captioning Combining Natural Language Processing and
14 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
Two Tier LSTM Model
No ratings yet
Two Tier LSTM Model
13 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
Deep Learning Approaches Based On Transformer Architectures For Image Captioning Tasks
No ratings yet
Deep Learning Approaches Based On Transformer Architectures For Image Captioning Tasks
16 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Applications of AI
No ratings yet
Applications of AI
13 pages
... The Noisier... The Children Got, ... The Angrier... The Teacher Got
No ratings yet
... The Noisier... The Children Got, ... The Angrier... The Teacher Got
6 pages
Image Captioning Via A Hierarchical Attention Mechanism and Policy Gradient Optimization
No ratings yet
Image Captioning Via A Hierarchical Attention Mechanism and Policy Gradient Optimization
13 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
Bai 2018
No ratings yet
Bai 2018
14 pages
EAES Effective Augmented Embedding Spaces For Text-Based Image Captioning
No ratings yet
EAES Effective Augmented Embedding Spaces For Text-Based Image Captioning
10 pages
The Sacred Revolution Propaganda and Personality Cult in North Korea
No ratings yet
The Sacred Revolution Propaganda and Personality Cult in North Korea
18 pages
Research Paper - Virtual Assistant
No ratings yet
Research Paper - Virtual Assistant
15 pages
Ref 12
No ratings yet
Ref 12
7 pages
ZPTD
No ratings yet
ZPTD
7 pages
Video Captioning Approaches
No ratings yet
Video Captioning Approaches
6 pages
MTQ Holy Face Devotion First Week
No ratings yet
MTQ Holy Face Devotion First Week
11 pages
16258-Article Text-19752-1-2-20210518
No ratings yet
16258-Article Text-19752-1-2-20210518
9 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
Image Captioning With Semantic Attention: 0.2 0.3 Surfboard Wave Surfing
No ratings yet
Image Captioning With Semantic Attention: 0.2 0.3 Surfboard Wave Surfing
9 pages
Deng Z Zero-Shot Style Transfer Via Attention Reweighting CVPR 2024 Paper
No ratings yet
Deng Z Zero-Shot Style Transfer Via Attention Reweighting CVPR 2024 Paper
11 pages
Image Captioning With Semantic Attention: 2016 IEEE Conference On Computer Vision and Pattern Recognition
No ratings yet
Image Captioning With Semantic Attention: 2016 IEEE Conference On Computer Vision and Pattern Recognition
9 pages
Watch What You Just Said: Image Captioning With Text-Conditional Attention
No ratings yet
Watch What You Just Said: Image Captioning With Text-Conditional Attention
9 pages
Performance Evaluation of Medical Image Captioning Using
No ratings yet
Performance Evaluation of Medical Image Captioning Using
10 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
De Nardin A One-Shot Learning Approach To Document Layout Segmentation of Ancient WACV 2024 Paper
No ratings yet
De Nardin A One-Shot Learning Approach To Document Layout Segmentation of Ancient WACV 2024 Paper
10 pages
Dense Video Captioning CVPR 2024 paper جيدة
No ratings yet
Dense Video Captioning CVPR 2024 paper جيدة
10 pages
Image Captioning With Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
No ratings yet
Image Captioning With Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
17 pages
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
No ratings yet
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
9 pages
Project Report Image Captioning Models Prakhar Dhyani
No ratings yet
Project Report Image Captioning Models Prakhar Dhyani
8 pages
Rich Image Captioning in The Wild
No ratings yet
Rich Image Captioning in The Wild
8 pages
THIRD CONDITIONAL STUDENT DOCUMENT - Edited Yta
No ratings yet
THIRD CONDITIONAL STUDENT DOCUMENT - Edited Yta
5 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Image Captioners Sometimes Tell More Than Images They See
No ratings yet
Image Captioners Sometimes Tell More Than Images They See
6 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Mobilities 1st Edition John Urry Instant Download
No ratings yet
Mobilities 1st Edition John Urry Instant Download
51 pages
Template Reading Assessment Monitoring Tool
No ratings yet
Template Reading Assessment Monitoring Tool
4 pages
A Research On Image Captioning by Different Encoder Networks
No ratings yet
A Research On Image Captioning by Different Encoder Networks
4 pages
Manchester Grammar School 2009 Part 1
No ratings yet
Manchester Grammar School 2009 Part 1
3 pages
CS - 8TH Bridge Course
No ratings yet
CS - 8TH Bridge Course
3 pages
Netaji Notes
No ratings yet
Netaji Notes
2 pages
Ten Promised Paradise Suhaba
No ratings yet
Ten Promised Paradise Suhaba
2 pages
What Is Prophetic Ministry
No ratings yet
What Is Prophetic Ministry
1 page
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet