0% found this document useful (0 votes)
90 views14 pages

A Survey On Multimodal Bidirectional Machine Learning Translation of Image and Natural Language Processing

Uploaded by

Rahul Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views14 pages

A Survey On Multimodal Bidirectional Machine Learning Translation of Image and Natural Language Processing

Uploaded by

Rahul Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Expert Systems With Applications 235 (2024) 121168

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Review

A survey on multimodal bidirectional machine learning translation of image


and natural language processing
Wongyung Nam, Beakcheol Jang ∗
Graduate School of Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, South Korea

ARTICLE INFO ABSTRACT

Keywords: Advances in multimodal machine learning help artificial intelligence to resemble human intellect more closely,
Computer vision and natural language which perceives the world from multiple modalities. We surveyed state-of-the-art research on the modalities
processing of bidirectional machine learning translation of image and natural language processing (NLP), which address
Deep learning
a considerable proportion of human life. Recently, with the advances in deep learning model architectures
Image captioning
and learning methods in the fields of image and NLP, considerable progress has been made in multimodal
Image synthesis
Machine learning
machine learning translations that can be built by integrating image and NLP. Our goal is to explore and
Multimodal summarize state-of-the-art research on multimodal machine learning translation and present a taxonomy for
the multimodal bidirectional machine learning translation of image and NLP. Furthermore, we reviewed the
evaluation metrics and compared state-of-the-art approaches that influences this field. We believe that this
survey will become a cornerstone of future research by discussing the challenges in multimodal machine
learning translation and direction of future research based on understanding state-of-the-art research in the
field.

1. Introduction 2013; Li, Kulkarni, Berg, Berg, & Choi, 2011; Xu et al., 2015; Zhou
et al., 2020), Text Retrieval (Karpathy, Joulin, & Li, 2014; Karpathy
Multimodal machine learning translation is the task of using arti- & Li, 2017; Lee, Chen, Hua, Hu, & He, 2018; Ordonez, Kulkarni, &
ficial intelligence to automatically translate content between different Berg, 2011; Socher, Karpathy, Le, Manning, & Ng, 2014), Question
modalities. Sensory modality is defined as one aspect of a stimulus Generation (QG) (Changpinyo et al., 2022), and Multimodal Named
or what is perceived after a stimulus, for example, light, sound, tem- Entity Recognition (MNER) (Moon, Neves, & Carvalho, 2018; Yu, Jiang,
perature, taste, and pressure. Based on these concepts, we understand Yang, & Xia, 2020). Moreover, applications that perform text-to-image
that the human world comprised a combination of multiple modalities. translation include Image Synthesis (Ramesh et al., 2021; Xu et al.,
When an object is placed in front of a human being, human recognize 2018; Yu et al., 2022; Zhang et al., 2017, 2019; Zhu, Pan, Chen, &
it through stimulation of various sensory such as its temperature, Yang, 2019), and Image Retrieval (Karpathy et al., 2014; Lee et al.,
smell, color and texture. We understand the behavior of expressing the 2018; Socher et al., 2014). We summarize the state-of-the-art models
smell perceived by sensory of nose through language or text, or the of these applications.
expressing the scene perceived by sensory of vision through the image
Previous studies related to machine learning translation between
or gesture are translations between different modalities. Therefore, in
computer vision and Natural Language Processing (NLP) have been
artificial intelligence, freely designing and building architectures using
presented on multimodal machine learning in one direction from image
multimodality is an important challenge understanding the human
to natural language or from natural language to image. Our survey is
world. This study defines the scope of multimodal machine learning
worthy in that we summarize the state-of-the-art research in the field
translation as the translation between two modalities: vision modality,
of multimodal bidirectional machine learning translation of image and
which is represented with an image in the computer vision, and nat-
NLP. Another related research trend is that the scope of multimodal
ural language modality, which is represented in text through machine
translation and neural machine translation. Applications that perform machine learning research has been broad including computer vision,
image-to-text translation include Image Captioning (Anderson et al., natural language, and sound. However, our survey specified the modal-
2018; Dai, Fidler, Urtasun, & Lin, 2017; Elliott & Keller, 2013; Herdade, ity scope to two modalities, computer vision, especially image, and
Kappeler, Boakye, & Soares, 2019; Hu et al., 2021; Kulkarni et al., NLP, which are considered most closely related to human life. It can

∗ Corresponding author.
E-mail addresses: [email protected] (W. Nam), [email protected] (B. Jang).

https://doi.org/10.1016/j.eswa.2023.121168
Received 22 February 2023; Received in revised form 31 July 2023; Accepted 8 August 2023
Available online 12 August 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Fig. 1. Overview of the multimodal bidirectional machine learning translation of image and natural language processing (NLP) tasks and taxonomies of the most relevant approaches.

be understood more details that the characteristics of modality are and 2. Related work
how it is reflected in the algorithms.
We suggested a taxonomy into four categories according to the type Studies have been conducted on multimodal machine learning trans-
of model architecture and representation method of image and text fea- lation from image to NLP or from NLP to image (Agnese, Herrera, Tao,
tures: retrieval-based, templated- based, encoder–decoder-based, and & Zhu, 2020; Baltrusaitis, Ahuja, & Morency, 2018; Bernardi et al.,
generative-based models. We also review the model structure and 2017; Frolov, Hinz, Raue, Hees, & Dengel, 2021; Guo, Wang, & Wang,
features of suggested taxonomy. The models in each taxonomy encode 2019; Hossain, Sohel, Shiratuddin, & Laga, 2019; Stefanini et al., 2022),
visual and text features in different manners, follow different objective as it is a research field that continues to be of interest.
functions, and evaluate the model performance using different evalua- Bernardi et al. (2017), Hossain et al. (2019) and Stefanini et al.
tion metrics. We then compare each model and determine further useful (2022) reviewed image captioning, a problem related to image-to-text
models that have built effective pipelines to transform the input modal- applications. It refers to an application that inputs an image and outputs
ity into an output modality without information loss or distortion. This a text that describes it relative to the input image. Therefore, it is
study aims to provide a better understanding of this field by reviewing a representative task of multimodal machine learning translation of
state-of-the-art research. The summary and taxonomy are summarized image and NLP. Bernardi et al. (2017) sorted the models related to im-
in Fig. 1. age captioning into three categories (direct generation, retrieval-based
The main contributions of this survey can be summarized as follows: from visual space, retrieval-based from multimodal space models):
direct generation and retrieval-based models, which, have two sub-
• We summarize the state-of-the-art research in the field of mul- groups of retrieval from visual and from multimodal spaces. Direct
timodal bidirectional machine learning translation of image and generation models first predict the image content in terms of ob-
NLP. jects, attributes, scene type, and actions based on visual features.
• We suggest a taxonomy that classify models into four categories This content information is then used to generate text that describes
according to the representation method, model architecture, and the image. A retrieval-based model is based on searching for simi-
evaluation method. lar images in a certain database and retrieving text relative to that
• We review the dataset and evaluation metrics commonly used image. Subsequently, a text description for a novel image is gener-
to explore the applications of multimodal bidirectional machine ated based on the text retrieved. It is then classified into subgroups
learning translation. according to whether the visual feature mapping space is visual or
• We give an overview discuss on the challenges and future direc- multimodal. The survey in Stefanini et al. (2022), focused on state-
tions. of-the-art models for image captioning, focusing on methods of visual
encoding and text generation, including the pretraining paradigm and
In Section 2, we review previous research on multimodal machine masked language model loss. Image representation methods focus on
learning translation and discuss differences between our study and the following four methods: non-attentive, additive attentive attention,
previous research. In Sections 3 to 6, we classify state-of-the-art mod- graph-base, and self-attention. Moreover, text generation is focused on
els into four categories: retrieval-, template-, encoder–decoder-, and long short-term memory (LSTM), convolutional neural network (CNN),
generative-based-models. In Section 7, we discuss the methodology and transformer-based, image-to-text early fusion, such as bidirectional
used to evaluate each model. We review datasets commonly used to encoder representations from transformers (BERT). The survey in Hos-
evaluate the model performance and discuss the metrics used as indi- sain et al. (2019), focused on deep learning models of the image
cators to evaluate model performance for each category. In Section 8, captioning field according to the number of captions, model architec-
we compare the results of state-of-that-art models and, in Sections 9 ture, type of learning, feature mapping space type, and language model
and 10, we conclude with a discussion on future directions. type.

2
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Frolov et al. (2021) surveyed a text-to-image task that generates


images from text descriptions based on generative adversarial networks
(GANs) (Goodfellow et al., 2014) in an unsupervised manner. Image
synthesis from text is a representative machine learning translation
application that inputs a text data and outputs an image data, unlike
the image captioning. In the survey of Frolov et al. (2021), only
models with GAN based architectures were surveyed, and evaluated
as excellent for image generation tasks. In addition to these methods
for text-to-image tasks using a single text as input. Frolov et al. (2021)
surveyed models that used additional information, such as multiple
captions, semantic masks, and scene graphs, as additional input values.
Agnese et al. (2020) proposed a taxonomy and reviewed state-of-
the-art deep learning models based on GANs and encoder–decoder
neural networks that perform text-to-image synthesis. In Agnese et al.
(2020), the authors focused on introducing GAN-based models info four
enhancement categories: semantic, resolution, diversity, and motion.
Our survey contributes to the literature by conducting a survey
on unidirectional multimodal machine learning translation, as well
bidirectional multimodal machine learning translation of image and
NLP. Baltrusaitis et al. (2018) and Guo et al. (2019) presented a survey
on multimodal machine learning among three modalities: vision, NLP,
and sound. However, the nature of each modality was different, and
this significantly influences the direction in which they should be
represented in the machine learning model. Therefore, we will focus on
the multimodal machine learning translation between two modalities,
vision and NLP, which are the most actively studied in the artificial
Fig. 2. Simple structured workflow of the retrieval-based models.
intelligence (AI) field.
Bernardi et al. (2017) noted some challenges remaining in machine
learning for image and NLP. Concerning vision, vision is considered
an important task in which objects not depicted in the image should on five types of object weights: object, stuff, people, scenes, and term
frequency and inverse document frequency (TFIDF). Finally, based on
be describable with text: some descriptions may require background
the bilingual evaluation understudy score (BLEU) score, this re-ranking
knowledge in certain situations such as in describing the work of a
result is trained with two models, a linear regression model and a linear
famous painter. In addition, vision must consider the discrimination of
support vector machine (SVM), which then consists of matching the
important and less important images in an image. Moreover, concern-
appropriate caption for the test data image.
ing NLP, an important task remains in its ability to convert the image
Karpathy et al. (2014) and Socher et al. (2014) introduced the
representation into a sentence, and choose which content in the image
deep neural network models SDT-RNN and SDTRNN-FG, which perform
is important or less important. In addition, challenges such as those
image and text retrieval tasks. Text features are represented based on
involving which verbs and pronouns are used and whether grammatical
the dependency tree parse proposed by De Marneffe, MacCartney, Man-
accuracy is correctly reflected. We investigated this challenge of vision
ning, et al. (2006). Dependency tree parsing is a method of inferring a
by reviewing how the transformer and attention mechanism work in
more detailed level of sentence fragments by calculating the influence
terms of image representation, and how to overcome the challenges
between words in a sentence. It was proposed as a manner of over-
of NLP through the type of language model and deep neural net-
coming the limitations of the previous text feature representation that
work learning method used to improve the performance of multimodal
assigns equal weight to all words or does not distinguish between active
machine learning translation of vision and NLP. and passive representations. SDT-RNN (Socher et al., 2014) is based
on recurrent neural networks (RNN) for mapping the text information
3. Retrieval-based models represented by a dependency tree parse to the common embedding
space. Image features are represented through a deep neural network
From Sections 3 to 6, the characteristics of each taxonomy and the proposed by Le (2013) and then mapped to the same embedding space.
models belonging to them is reviewed. The summary of key contribu- When an image is provided, a sentence corresponding to a given image
tions and technologies of the described models of all taxonomy is shown is retrieved, or when a sentence is provided an image corresponding to
in Table 1. this text is retrieved from this embedding space. SDTRNN-FG (Karpathy
The first taxonomy, in which we classify state-of-the-art models ac- et al., 2014) extracts more detailed text information based on a triplet
cording to the characteristics of the method, is retrieval-based models. map of the object, action, and scene to the embedding space. The
In the past, retrieval-based machine learning translation was used for fragments of the image are extracted by the region convolutional
image or text retrieval, which adequately describes a given image or neural network (RCNN) proposed by Girshick, Donahue, Darrell, and
text in the translation field between image and NLP. The overview of Malik (2014). This is pretrained with ImageNet (Deng et al., 2009)
the models corresponding to retrieval-based is reported in Fig. 1(b) and and finetuned with 200 classes of the ImageNet Detection Challenge
the simple depiction of a workflow of the retrieval-based models is (Russakovsky et al., 2015). The objective function is calculated by
shown in Fig. 2 visually. adding the global ranking objective and fragment alignment object.
Ordonez et al. (2011) presented a relatively simple retrieval-based The introduction of a bidirectional recurrent neural network(BRNN)
model that calculates the global similarity from dataset 𝐶 consisting (Schuster & Paliwal, 1997) by Karpathy and Li (2017), which retrieves
of an image and image caption set, retrieves the most similar image, text for dense image annotation, achieved a richer description of the
and then outputs the corresponding image caption. The first process regional level of an image. The first step is to map text and image
involves creating a similar dataset 𝑀 which is created by extracting the features to one embedding space, as in previous retrieval-based models
data most similar to the given image 𝐼. The image content of dataset (Karpathy et al., 2014; Socher et al., 2014). Image features are rep-
𝑀 is re-ranked by estimating the similarity with given image I based resented through an RCNN, and text features are represented based

3
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Table 1
Summary of key contributions and technologies of each model.
Model Year Key attributes
BabyTalk 2013 Suggested the techniques that extract image through DPM, linear SVM, language feature through n-gram, and train the
(Kulkarni et al., 2013) model with CFR learning and inference.
Web-scale 2011 Presented a technique for generating image descriptions based on web-scale n-gram data, which utilizes the frequency
(Li et al., 2011) counts of every possible n-gram sequence.
VDDR 2013 Proposed the assumption that utilizing the spatial relationships within the visual dependency- grammar is a more
(Elliott & Keller, 2013) efficient method for generating image captions and proved it.
LRLSVM 2011 Suggested the method that generate image description based on calculate the global similarity of a query image and
(Ordonez et al., 2011) direct estimation of image content.
SDT-RNN 2014 Utilized the collapsed tree formalism from Stanford dependency parser.
(Socher et al., 2014)
BRNN 2017 Introduced an approach that utilizes multimodal embedding space and a structured objective to infer the latent
(Karpathy & Li, 2017) alignment between segments of a sentence and the regions of an image.
SDTRNN-FG 2014 Suggested using a finer level embedding method to map fragments of images and fragments of sentences into a
(Karpathy et al., 2014) common space and to directly associate these fragments across modalities.
SCAN 2018 Introduced Stacked Cross Attention, a method that discovers full latent visual-semantic alignments by utilizing both
(Lee et al., 2018) image regions and words, and infers the similarity between images and text..
UMT 2020 Suggested the applying Transformer to MNER and proposed a Multimodal Interaction Module that incorporate a
(Yu et al., 2020) text-based entity span detection, which aims to alleviate the bias of the visual content in MNER.
MNER-MA 2018 Presented incorporation visual contexts for named entity recognition tasks by taking both image and text as input and
(Moon et al., 2018) the architecture consisting Bi-LSTM with CFR learning for multimodal NER.
Show, Attend and Tell 2015 Two variants(hard and soft attention) were attempted . A soft attention is trained by standard back-propagation and a
(Xu et al., 2015) hard attention is trained by maximizing an approximate variational lower bound or equivalently by Reinforcement
learning.
Up-down 2018 Proposed a visual attention technique that combines both bottom-up and top-down approaches.
(Anderson et al., 2018)
UnifiedVLP 2020 A shared multi-layer transformer network was proposed pretrained on a large amount dataset.
(Zhou et al., 2020)
ORT 2019 The geometric attention mechanism is utilized to incorporate information regarding the spatial relationships between
(Herdade et al., 2019) detected objects.
LEMON 2021 Proposed the Vision Language Pre-training scaling rule for image captioning and demonstrated impressive results in
(Hu et al., 2021) recognizing a wide range of long-tail visual objects in a zero-shot manner.
StackGAN-v1 2017 Stacked GAN and a novel Conditioning Augmentation(CA) were presented. CA aimed at encouraging smoothness in the
(Zhang et al., 2017) latent conditioning manifold to improve the diversity and stability.
StackGAN-v2 2019 Proposed an advanced multi-stage GAN architecture for both conditional and unconditional generative tasks. The
(Zhang et al., 2019) architecture includes multiple generators that share their parameters in a tree-like structure.
CGAN 2017 Conditional GAN is used to learn the generator instead of using MLE.
(Dai et al., 2017)
AttnGAN 2018 Proposed a novel attentional generative network with DAMSM, which allows for multi-stage refinement and is used for
(Xu et al., 2018) fine-grained text-to-image synthesis task.
DM-GAN 2019 Proposed a combined model that utilizes a GAN and dynamic memory component to generate high-quality images and
(Zhu et al., 2019) introduced a memory writing gate and response gate
DALL-E 2021 Proposed a training method to autoregressively model the text and image tokens as a single stream of data using two
(Ramesh et al., 2021) training procedures.
Parti 2022 Proposed a Parti model that uses a transformer-based image tokenizer, ViT-VQGAN, and scales the transformer model
(Yu et al., 2022) up to 20B parameter.

on a BRNN. A BRNN improves the ranking performance of words image features using Faster R-CNN (Ren, He, Girshick, & Sun, 2017),
in sentences by calculating word representations by transforming the which consists of two steps: (1) calculating the region of interests
sequence of N words in a sentence into a h-dimensional vector for (ROIs) through a region proposal networks (RPNs), which is trained
each world. Although the dimensions of words in each sentence are end-to-end by back propagation and stochastic gradient descent (SGD)
fixed to 50 and 200 for SDT-RNN and SDTRNN-FC, respectively, BRNN (LeCun et al., 1989), and (2) performing bounding box regression
contributes by aligning continuous sentence segments whose lengths using ROIs pooling. The Faster R-CNN was combined with ResNet-
are not fixed. The BRNN calculates both streams of the weight and
101 and pretrained by Anderson et al. (2018). for Visual Genomes
bias in order from the left word to the right word of the sentence
(Krishna et al., 2017). Bottom-up attention, which extracts features
and vice versa. The objective function follows the global ranking ob-
from the image pixel, was used instead of the top-down attention,
ject, however, the image-sentence alignment score, which measures
which extracts the features from the entire image. In the RPNs, a small
the alignment similarity between images and sentences, was further
simplified. CNN proposes objects, and the proposed top box proposal becomes an
The Stack Cross Attention Networks (SCANs) presented by Lee input for RoI pooling and is used to extract small feature maps for each
et al. (2018) focused on the more important part of the image and box proposal. Text features were represented through a Bidirectional
measured the similarity of the corresponding word unlike previous GRU (Bahdanau, Cho, & Bengio, 2014; Schuster & Paliwal, 1997) and
studies, wherein the similarity between the details of an image and all mapped to the embedding space to which the image features were
words was measured using attention mechanisms. A SCAN represents mapped. Subsequently, the cosine similarity of the two modality vector

4
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

2009) have been used to recognize objects. The linear SVM trained by
Farhadi et al. (2009) was used to recognize stuff detectors in an input
image according to the object, attribute, and prepositions templates and
the N-gram architecture was used as the language model. An RBF kernel
SVM was trained for the visual attribute classifiers.
Elliott and Keller (2013) presented a study related to a model using
a Visual Dependency Representation (VDRs) and N-gram language
model. In the VDRs, the relationship between pixel overlaps, angle, and
image region was classified into eight types, and the spatial relationship
of the image region was defined as visual dependency grammar. It
was divided into eight types according to the extent to which pixels
and angles of image regions X and Y overlap and the symmetrical
relationship in which they were placed. The template for language
generation comprised the subject region, object region and relationship
between the two. The sentence template was divided into five types.
The dependency parse for text features followed the method proposed
by McDonald, Crammer, and Pereira (2005). If one node exists accord-
ing to the number of nodes in the VDRs, the head-child relationship can
be created according to the sentence template based on the Euclidean
distance calculation, but if multiple nodes exist, sentence fragments
must be created and combined by the model. This is expressed as a
model structure. However, this model structure has a limitation in that
it lacks a manner of expressing a verb. To overcome this limitation,
a parallel model that can express a verb using two distributions was
introduced.
Fig. 3. Simple structured workflow of the template-based models.
5. Encoder–decoder-based models

In this section, we classify the research that performed multimodal


images and sentences gathered in one modality embedding space was machine learning translation with a model consisting of the architec-
calculated. ture of an encoder and decoder structure using a deep neural network
A summary of applications, representation method and evaluation
as encoder–decoder based models. The encoder–decoder architecture
metrics of the models corresponding to all categories are seen in
using deep neural networks overcomes the limited flexibly in compos-
Table 2.
ing a sentence using retrieval-based or template-based models, because
retrieval-based and template-based models search the entire sentence
4. Template-based models
and match the image or follow a set template. Cho, Van Merriënboer,
Bahdanau, and Bengio (2014) presented a machine translation of an
We classify the models with the structure used for translation be-
encoder–decoder structure based on neural networks as an approach
tween image and NLP after representing text features according to
for statistical machine translation. The model (Cho et al., 2014) was
a specific grammatical frame as template-based models. Unlike the
introduced with a structure consisting of an encoder that extracts a
retrieval-based models that search the entire sentences and match the
corresponding images, a sentence is created by combining the frame of representation of a fixed length from the input text and a decoder
object, adjective, and preposition according to the predefined template that generates the correct translation from the representation extracted.
of each sentence. Most studies on template-based models have been These encoder–decoder structures are more effective in generating new
based on Support Vector Machine (SVM) rather than deep neural text or images compared with the previous retrieval- or template-based
networks. An overview of the models corresponding to template-based methods. An overview of the models corresponding to template-based
is reported in Fig. 1(a) and the simple workflow of the template-based is reported in Fig. 1(c) and the simple workflow of the template-based
models is shown in Fig. 3 visually. models is shown in Fig. 4 visually.
Kulkarni et al. (2013) and Li et al. (2011) proposed models for Xu et al. (2015) presented a model that performs image captioning
image captioning designed based on the prepositional phrases and that transforms image modality into text modality using encoder–
modifiers that were used to describe specific objects and their relative decoder structure with an attention mechanism. In the encoder, the
positions and relationships when people describe a scene. Previous image feature is extracted using Oxford VGGnet (Simonyan & Zisser-
studies have been limited to the description of one object, verb, and man, 2014) pretrained on ImageNet without finetuning, and the image
scene, Whereas Kulkarni et al. (2013) presented a method that con- feature is encoded as a D-dimensional representation of the text feature
tributed to creating sentences describing multiple objects, modifiers, corresponding to a part of the image. The decoder consists of the LSTM
and their relationships. Objects were detected through image detection, network proposed by Hochreiter and Schmidhuber (1997), and is then
and each candidate object was separated using attribute classifiers. reflected as a context vector that contributes to the prediction of the
Each candidate object was processed as a prepositional relation func- word. In the decoder, one word is generated for each time step using
tion, and a node Conditional Random Field (CRF) that infers objects, the value extracted from the hidden state at the previous time step and
attributes, and prepositions was constructed. Then, the CFR was cal- word generated from the LSTM at the previous time as input. In the
culated by reflecting the parameters of the three values of the weights model, when given an image, the image feature is passed through a
among the text-based potentials, parameter weights of the image-based CNN layer that consists of L layers and D neurons in each layer and is
potentials, and weights calculated as the trade-off between these two then output as a feature vector 𝑎 for each layer location. The following
weights. The correct answer label was predicted as a graph based on the three values are used as inputs to the LSTM in the decoder: embedding
CFR calculation value, and a sentence inferred by this labeling graph metrics 𝑚 as a dimensional vector embedding of caption 𝑦𝑡−1 generated
was formed. Multi-scale disconformable part models (DPMs) (Felzen- at time step 𝑡 − 1, ℎ𝑡−1 , which is passed through the hidden state layer
szwalb et al., 0000; Felzenszwalb, Girshick, McAllester, & Ramanan, of previous time step, and context vector z calculated by the attention

5
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Table 2
Overview of the multimodal bidirectional machine learning translation models. We present a taxonomy and summarize the characteristics of the most relevant models of multimodal
bidirectional machine learning translation of image and NLP. The approach indicates the taxonomy provided in this study. (Approach: R = Retrieval-based models, T =Template-based
models, E = Encoder–decoder based models, G = Generative-based models, Application: IC = image captioning, TR = text retrieval, IR = image retrieval, MNER = multimodal
named entity recognition, IS = image synthesis, Evaluation Metrics: BL = BLEU, HE = human evaluation, RK = Recal@K, ME: METEOR, R-p: R-precision, CI = CIDEr, RO =
ROUGE, IS = inception Score, FID = Frechet Inception Distance.)
Model Modality Approach Application Representation Evaluation metrics
R T E G Image Language
BabyTalk Image → Text ✓ IC DPM (Felzenszwalb, Girshick, n_gram BL, HE
(Kulkarni et al., 2013) & McAllester, 0000),
Linear SVM (Farhadi, Endres,
Hoiem, & Forsyth, 2009)
Web-scale Image → Text ✓ IC DPM (Felzenszwalb et al., n_gram BL, HE
(Li et al., 2011) 0000),
Linear SVM (Farhadi et al.,
2009)
VDDR Image → Text ✓ IC VDRs MST-Parser BL, HE
(Elliott & Keller, 2013) (Schuster & Paliwal, 1997) (Schuster & Paliwal,
1997)
LRLSVM Image → Text ✓ IC, TR Gist Feature Extraction TFIDF BL
(Ordonez et al., 2011)
SDT-RNN Image ↔ Text ✓ IC, TR, IR Le et al., 2012 SDT-RNN RK
(Socher et al., 2014) (De Marneffe et al., 2006)
BRNN Image → Text ✓ IC, TR RCNN BRNN BL, ME, CI, RK
(Karpathy & Li, 2017) (Girshick et al., 2014) (Schuster & Paliwal,
1997)
SDTRNN-FG Image ↔ Tex ✓ IC, TR, IR RCNN SDT-RNN RK
(Karpathy et al., 2014) (Girshick et al., 2014) (Socher et al., 2014)
SCAN Image ↔ Text ✓ IC, TR, IR ResNet-101 (He, Zhang, Ren, Bidirectional GRU RK
(Lee et al., 2018) & Sun, 2016a), (Bahdanau et al., 2014;
Faster R-CNN (Ren et al., Schuster & Paliwal,
2017) 1997)
UMT Image → Text ✓ MNER ResNet BERT RK, R-p, RO
(Yu et al., 2020)
MNER-MA Image → Text ✓ MNER Inception-v3 Bi-LSTM RK, R-p, RO
(Moon et al., 2018) (Nilsback & Zisserman, 2008)
Show, Attend and Tell Image → Text ✓ IC Oxford VGGnet LSTM BL, ME
(Xu et al., 2015) (Simonyan & Zisserman, 2014) (Hochreiter &
Schmidhuber, 1997)
Up-down Image → Text ✓ IC ResNet-101(He et al., 2016a), LSTM BL, ME, SP, CI, RO
(Anderson et al., 2018) Faster-R-CNN (Ren et al., (Hochreiter &
2017) Schmidhuber, 1997)
UnifiedVLP Image → Text ✓ IC Faster R-CNN (Ren et al., BERT BL, ME, SP, CI
(Zhou et al., 2020) 2017), (Devlin, Chang, Lee, &
ResNet-101 FPN (Xie, Toutanova, 2019)
Girshick, Dollar, Tu, & He,
2017)
ORT Image → Text ✓ IC ResNet-101(He et al., 2016a), Transformer BL, ME, SP, CI
(Herdade et al., 2019) Faster-R-CNN (Ren et al.,
2017)
LEMON Image → Text ✓ IC Faster R-CNN BERT BL, ME, SP, CI
(Hu et al., 2021) (Ren et al., 2017) (Devlin et al., 2019)
StackGAN-v1 Image ← Text ✓ IS PixelCNN LSTM IS, HE
(Zhang et al., 2017) (Oord, Kalchbrenner, & (Hochreiter &
Kavukcuoglu, 2016) Schmidhuber, 1997)
StackGAN-v2 Image ← Text ✓ IS PixelCNN LSTM IS, FID, HE
(Zhang et al., 2019) (Oord et al., 2016) (Hochreiter &
Schmidhuber, 1997)
CGAN Image → Text ✓ IC Oxford VGGnet LSTM BL, ME, SP, CI, HE
(Dai et al., 2017) (Simonyan & Zisserman, 2014) (Hochreiter &
Schmidhuber, 1997)
AttnGAN Image ← Text ✓ IS Inception-v3 BRNN R-p, IS
(Xu et al., 2018) (Nilsback & Zisserman, 2008) (Schuster & Paliwal,
1997)
DM-GAN Image ← Text ✓ IS Dynamic Memory BRNN R-p, IS, FID
(Zhu et al., 2019) (Schuster & Paliwal,
1997)
DALL-E Image ← Text ✓ IS dVAE BPE IS, FID, HE
(Ramesh et al., 2021) (LeCun, Bottou, Bengio,
& Haffner, 1998)
Parti Image ← Text ✓ IS ViT-VQGAN BERT FID
(Yu et al., 2022) (Yu et al., 2021) (Devlin et al., 2019)

6
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Vision-Language Pre-training (VLP), whose structure has the encoder


and decoder share a multi-layer transformer network, unlike previ-
ous encoder–decoder based multimodal translation models that use
different models for the encoder and decoder. The image feature is
represented by a Faster R-CNN (Ren et al., 2017) with ResNet-101
FPN back born (Xie et al., 2017) and the text feature is represented
by BERT (Devlin et al., 2019). The architecture of BERT (Devlin et al.,
2019) consists of 12 transformer blocks of an encoder–decoder, and
each layer has a masked self-attention layer and a feed-forward module.
Two types of self-attention masks are used, depending on the objectives.
Hu et al. (2021) presented a model called Large-scale image captioner
(LEMON) that improves the performance of image captioning through
visual-language pre-training. To extract the image features, the Faster
R-CNN (Ren et al., 2017) pretrained from Zhang et al. (2021) was
used. In addition, a sequence-to-sequence attention mask (Zhou et al.,
2020) was used for the encoder layer. In particular, note that large
datasets and large-scale model parameters were introduced using six
to 32 layers with 20 billion image text pair automatically collected
from the web, unlike previous research that demonstrated performance
using six or 12 transformer layers with approximately four million
datasets. This model revealed that model size does not significantly
affect performance when the dataset is small but has a significant effect
on large datasets.

6. Generative-based models

Fig. 4. Simple structured workflow of the encoder–decoder based models. In the deep learning field, discriminant models have received more
attention than generative models that have been classified as unsu-
pervised machine learning. However, with the introduction of GANs,
score to the output value of the encoder. Stochastic Hard Attention and which overcome the difficulties of approximating probabilistic compu-
Deterministic Soft Attention have been proposed to obtain the attention tations and exploit the advantages of precise linear units of generative
scores. models (Goodfellow et al., 2014), research on generative models has
Anderson et al. (2018) introduced the bottom-up attention, which been actively pursued. The fourth taxonomy that we categorize as
extracts features by identifying features from the image pixel unit based state-of-the-art multimodal machine learning translation is generative-
on the encoder–decoder structure. Faster R-CNN (Ren et al., 2017) with based models. Research published on the translation of vision and
ResNet (He et al., 2016a) was used. This model, it is revealed that NLP using models with an architecture using GANs (Goodfellow et al.,
hard attention is more efficient. To predict the regional characteristics 2014), Variational Autoencoders (VAEs) (Kingma & Welling, 2013),
of an image more effectively, the mean pooled convolutional feature and autoregressive models, which are recognized as a generative model
that learns the embedding of the real object class is passed to the approach, are included in this discussion. An overview of the models
output layer as a concatenated softmax distribution with the embedding corresponding to template-based is reported in Fig. 1(d) and the simple
learning result of the real object class. Another peculiarity of this model workflow of the template-based models is shown in Fig. 5 visually.
is that it has two LSTM layers (Hochreiter & Schmidhuber, 1997): Zhang et al. (2017, 2019) created a high-resolution photo-realistic
top-down attention LSTM and language LSTM. image by introducing conditioning argumentation to overcome the
Herdade et al. (2019) who developed an Object Relation Trans- limitation of generating unnatural images in the existing GAN model.
former (ORT) model that detects the relative position and size of objects This research began by noting the reason why the previous GAN
represented an advancement in top-down and bottom-up attention model failed to produce natural images: because the natural image and
research, as it is capable of detecting the relative position and size implied model distributions may not overlap in the high-dimensional
of objects. It has a structure that detects spatial relationships between pixel space (Arjovsky & Bottou, 2017; Sønderby, Caballero, Theis, Shi,
input values by detecting objects through geometric attention. The & Huszár, 2016).
basis of object detection is that the image feature vectors are extracted In the architecture of the Stacked Generative Adversarial Networks
using a Faster R-CNN (Girshick et al., 2014) with ResNet-101 (He et al., (StackGAN) model (Zhang et al., 2017), the Conditioning Argumenta-
2016a). This feature vector is used as the input of the Transformer tion (CA) enables the resampling process and outputs the phi (𝜙) vector
(Vaswani et al., 2017). The output of the object detector is first input created through the text embedding proposed by Reed, Akata, Lee, and
to the fully connected embedding layer, is used as an input token of Schiele (2016). In the learning of the generator, the discontinuity may
the encoder layer and passed through six encoder layers. Each encoder appear large because the phi (𝜙) vector is large. Therefore, this phi (𝜙)
layer consists of multi-head self-attention with eight identical heads, vector is passed through a fully connected layer and sampled from a
and the output passed through these eight heads is combined and used normal distribution with mu and sigma. The vector passed through the
as an input value for the decoder. To reflect the relative position and CA block is passed through Stage-I and Stage-II GAN blocks. In Stage-I
size of the object, the geometric features of 𝜆 for the bounding boxes GAN, the basic shape and color of low-resolution objects are sketched
are calculated and reflected. based on the given text description. Stage-II GAN uses the result of
Unlike studies that utilized pretrained models only for randomly Stage-I GAN and input text description to generate high-resolution
initialized models or language representation, Hu et al. (2021) and unknowns with detailed information. The generator of Stage-I GAN
Zhou et al. (2020) presented multimodal machine learning translation follows the structure of a general GAN generator, however, upsampling
models that employed a model pretrained with vision and language is performed using a deconvolution layer. The discriminator of Stage-I
representation. The model proposed by Zhou et al. (2020) performs GAN uses the fake and real images generated by the generator as inputs
image captioning and visual question answering task through Unified to reduce the dimension through a down-sampling block. Through

7
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

the information level that each word represents for image information,
because the representation of all words is processed equally. The DM-
GAN (Zhu et al., 2019) is primarily composed of an initial image
generation and dynamic memory-based image refinement. The text
feature is encoded in the initial image generation step, which passes
through a deep convolution generator to create the first image 𝑥0 . In
dynamic memory-based image refinement, memory writing gate stores
text information as key value, key addressing, and value reading, which
play the role of reading features from the memory module to refine
the features of low-resolution images, and a response gate controls the
mixing of features read from memory and current image features. The
memory writing gate also plays a role in selecting a highly related word
by calculating the similarity between the image and word to refine the
original image. A bidirectional LSTM layer pretrained from Xu et al.
(2018) is used for the text encoder. The DM-GAN (Zhu et al., 2019) is
a model that has achieved significant results in that it updates image
features and adjusts information flow through the gate mechanism.
DELL-E, which was proposed by Ramesh et al. (2021), has been
shown to benefit from using an autoregressive transformer (Vaswani
et al., 2017) with a large datasets for text-to-image synthesis tasks.
To overcome the limitation using relatively small datasets, MSCOCO
(Lin et al., 2014), and SUB (Ordonez et al., 2011), DALL-E (Ramesh
et al., 2021) were advocated by the creation of a dataset filtered by
YFCC100M (Thomee et al., 2016), an image-text pair collected from
Wikipedia, on the Conceptual Caption dataset proposed by Sharma,
Ding, Goodman, and Soricut (2018). The training procedure for the
Fig. 5. Simple structured workflow of the generative-based models.
model consists of two stages. In Stage-I, a discrete variational autoen-
coder (dVAE), which consists of convolutional (LeCun et al., 1998)
ResNets (He, Zhang, Ren, & Sun, 2016b) with bottleneck-style res-
this process, Stage-I GAN classifies fake and real images from spatial blocks, is trained to compress a 256 × 256 RGB image into a 32 × 32
replication, which concatenates the variable phi (𝜙) variable previously image tokens. In Stage-II, an autoregressive transformer is trained
output. The generator of Stage-II GAN generates high-resolution images to model the combined distribution of text and image tokens after
with the phi (𝜙) variable and a low-resolution image as input. It combining up to 256 BPE-encoded text tokens, which were proposed
undergoes a spatial replication process similar to Stage-I GAN and by Sennrich, Haddow, and Birch (2015) with 32 × 32 image tokens.
uses residual blocks to effectively transfer gradients to image and text Most of the generative-based models are based on generating image
relationship learning. The high-resolution fake image generated in this from text, but research on the ConditionalGAN (CGAN) model has also
manner passes through the Stage-II GAN discriminator, which has the been conducted, which transforms image modality into text modality
same structure as the Stage-I GAN discriminator. (Dai et al., 2017). The model consists of a generator G and evaluator
StackGAN++ (Zhang et al., 2019), which is an upgraded Stack- E and is composed of CNN and LSTM layers. When a pair of image-
GAN (Zhang et al., 2017), was proposed as a model with the archi- text data is trained, the model takes two inputs: image features and
tecture of a multi-stage generation model network for conditional and random text vector z. The image feature is passed to the CNN layer,
unconditional generation tasks. StackGAN++ (Zhang et al., 2019) has and the random vector 𝑧 is passed through the LSTM layer to generate
a tree structure that generates the same image from multiple sources word 𝑤𝑡 using conditional probability. When training the generator, the
through multiple generators and discriminator structures. This further feedback from the evaluator is trained using policy gradient (Sutton,
improves the image quality compared with StackGAN (Zhang et al., McAllester, Singh, & Mansour, 1999). In training the evaluator, three
2017) and stabilizes the training by jointly approximating multiple descriptions are considered: a human-generated description of a given
distributions. image, a description generated by a generator, and a human-generated
In encoder–decoder based models, the studies (Anderson et al., description of another image that is unrelated to the given image.
2018; Herdade et al., 2019; Xu et al., 2015) have been conducted on Generating a description of an image is extended to create paragraphs
improving the performance by performing a regional representation of from a single sentence by designing a hierarchical LSTM consisting of
the image using attention mechanisms. However, in generative-based sentence- and word-level LSTMs. First, a vector sequence is generated
models, no studies have been published on the translation of language by encoding the subject of a sentence for an image. For each sentence,
and vision modalities by introducing attention mechanisms. Xu et al. a word conditioned on the generated subject and text vector z is gener-
(2018) presented a model that performs text-to-image translation by ated. The evaluation of paragraph P first embeds each sentence through
introducing an attention mechanism based on GANs (Goodfellow et al., the word-level LSTM and then embeds the paragraph by combining the
2014) and reflects the word-context vector of the text description sentences embedded through the sentence-level LSTM. It then takes the
to generate images. In addition, the concept of the Deep Attentional dot product for paragraph P and the image feature and calculates the
Multimodal Similarity Model (DAMSM) was introduced, which is an score by applying the sigmoid activation function.
image-to-text loss that calculates the similarity between images and Yu et al. (2022) introduced Pathways Autoregressive Text-to-Image
sentences generated by the generator. (Parti), which is a regression model that supports realistic image gen-
Zhu et al. (2019) presented a Dynamic Memory Generative Adver- eration and rich content synthesis by learning with a large-scale model
sarial Networks (DM-GAN). The models (Xu et al., 2018; Zhang et al., and data compared with previous models. This model has 16 encoder
2017, 2019) previously proposed generative-based models generate layers and 64 decoder layers. Although it showed good performance
incorrect images in the second stage if an incorrect image is created in with a large size, difficulties remained in text rendering and, homonyms
the first stage when generating images using the multi-stage method. In that do not reflect the realistic size between objects, reflect numbers,
addition, they are limited in their inability to reflect the difference in reflect ironic expressions.

8
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Table 3
Overview of the main dataset for multimodal bidirectional machine learning translation of image and NLP. (Application: IC = image captioning, TR = text retrieval, IR = image
retrieval, MNER = multimodal named entity recognition, IS = image syn-thesis, Evaluation Metrics: BL = BLEU, HE = human evaluation, RK = Re-cal@K, ME: METEOR, CI =
CIDEr, RO = ROUGE, IS = Inception Score, FID = Frechet Inception Score).
Dataset Number of image Application Reference.
Pscal1K 1K IC, IS, IR, TR Kulkarni et al. (2013), and Li et al. (2011),
(Rashtchian, Young, Hodosh, & Hockenmaier, Karpathy et al. (2014), and Socher et al. (2014)
2010)
Oxford-102 8K IC, IS, IR, TR Zhang et al. (2017, 2019)
(Nilsback & Zisserman, 2008)
CUB 12K IC, IS, IR, TR Xu et al. (2018), and Zhang et al. (2017, 2019),
(Wah, Branson, Welinder, Perona, & Belongie, Ramesh et al. (2021), and Zhu et al. (2019)
2011)
Flicker8K 8K IC, IS, IR, TR Karpathy and Li (2017), Xu et al. (2015), and Karpathy et al.
(Hodosh, Young, & Hockenmaier, 2013) (2014)
Flicker30K 31K IC, IS, IR, TR Dai et al. (2017), Xu et al. (2015), and Zhou et al. (2020)
(Young, Lai, Hodosh, & Hockenmaier, 2014) Karpathy et al. (2014), Karpathy and Li (2017), and Lee et al.
(2018)
MSCOCO 123K IC, IS, IR, TR Anderson et al. (2018), Herdade et al. (2019), Xu et al. (2015),
(Lin et al., 2014) and Zhou et al. (2020),
(Chen et al., 2015) Hu et al. (2021), Karpathy et al. (2014), Karpathy and Li
(2017), and Zhang et al. (2017),
Ramesh et al. (2021), Xu et al. (2018), Zhang et al. (2019), and
Zhu et al. (2019)
SBU 1M IC, IS, IR, TR Ordonez et al. (2011)
(Ordonez et al., 2011)
CC 3M 3.1M IC, IS, IR, TR Hu et al. (2021), and Ramesh et al. (2021)
(Sharma et al., 2018)
CC 12M 12.2M IC, IS, IR, TR Hu et al. (2021)
(Changpinyo, Sharma, Ding, & Soricut, 2021)
ALT200M 203M IC, IS, IR, TR Hu et al. (2021)
(Hu et al., 2021)
SnapChat 10K MNER Yu et al. (2020)
(Zhang, Fu, Liu, & Huang, 2018)
Tweet 4.3M MNER Moon et al. (2018)
(Lu, Neves, Carvalho, Zhang, & Ji, 2018)

7. Evaluation methodology Concerning retrieval- and template-based models, experiments are


performed using relatively small datasets such as Pascal1K (Rashtchian
7.1. Dataset et al., 2010), however, a larger dataset is required for experiments with
encoder–decoder- and generative-based models. For the MNER task,
the SnapChat and Tweet (Lu et al., 2018; Zhang et al., 2018) datasets,
For the experiment of multimodal machine learning translation
which contain named entities annotated by humans, have been used
between the image and NLP, data in which an image and text de-
for experiments.
scribe the image are required. In general, some datasets that are more
frequently used in state-of-the-art research. A summary of commonly
7.2. Evaluation metrics
used datasets in the field of image and NLP multimodal bidirectional
machine learning translation is reported in Table 3.
The methods for measuring the performance of multimodal machine
The most commonly used datasets in multimodal bidirectional ma-
learning translation can be broadly divided into human evaluation, in
chine learning translation of image and NLP are MSCOCO (Chen et al.,
which a human directly evaluates and scores the result, and automatic
2015; Lin et al., 2014), Flicker8K (Hodosh et al., 2013), Flicker30K
evaluation, in which a score is automatically measured based on a
(Young et al., 2014), and CUB (Wah et al., 2011). MSCOCO (Chen et al., standard. The models (Elliott & Keller, 2013; Kulkarni et al., 2013; Li
2015; Lin et al., 2014) was first released in size with 164K images et al., 2011; Ramesh et al., 2021; Xu et al., 2018; Zhang et al., 2017,
in 2014, and 40K images of the test dataset were added in 2015. In 2019) were subjected to human evaluation, their results were directly
2017, it was updated to 123K images through community feedback. evaluated by human annotators. However, this evaluation method has
Flicker8K (Hodosh et al., 2013) and Flicker30K (Young et al., 2014) the disadvantage of the same evaluation differing depending on the
are datasets collected from Flicker with five descriptions by generating annotators’ motives or work objectives. In addition, studies have been
a human-annotator per image, 8K contains 8000 images, and 30K conducted on the results of human evaluation differing when the an-
contains 31,000 images. The CUB (Wah et al., 2011) dataset concerns notators provide feedback on their mistakes, as they noted the flaws in
birds and contains 11,788 images of 200 subcategories of birds. Ten the image more pessimistically (Denton, Chintala, Fergus, et al., 2015).
descriptions were included for each image. Each dataset contains a Therefore, in most cases, either human and automatic evaluation
different number of text descriptions per image. For the Oxford-102 are used together, or automatic evaluation is used as an evaluation
(Nilsback & Zisserman, 2008) and CUB (Wah et al., 2011) datasets, the index. Based on state-of-the-art research, we classified the automatic
total number of images in the data was small, however, the number of evaluation metrics as follows: ranking-, n-gram-, scene graph-, visual-
text description sentences per image is relatively large. SBU (Ordonez based metrics. Ranking-based metrics are primarily used as an index
et al., 2011), CC3M (Sharma et al., 2018), CC12M (Changpinyo et al., to evaluate retrieval-based models, n-gram- and scene graph-based
2021), and ALT200M (Hu et al., 2021), include only one description metrics are commonly used as a measurement index for template- or
per image, but these are considered large datasets because they include encoder–decoder-based models, and visual-based metrics are used to
over millions of images. evaluate generative-based models.

9
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Ranking-Based Metrics Recall@K and R-precision are ranking- image captions by converting both the candidate sentence and correct
based metrics, and these are evaluation metrics that have been widely answer sentence into a graph-based semantic representation called a
used to measure performance in recommendation systems (Hodosh scene graph. The previous evaluation metrics respond sensitively to
et al., 2013). The precision and recall for the topK items are universally overlapping n-grams and return a high score if the verb and several
calculated because the recommendation systems are generally inter- prepositional phrases match, even when the subject and object dif-
ested in recommending the topK items to users. Hodosh et al. (2013) fer considerably. In contrast, a scene graph encodes the relationships
introduced Recall@K and R-precision to measure the performance of between objects, attributes within images, and image captions. SPICE
image captioning. They are used to evaluate the model of the search (Anderson et al., 2016) evaluation metrics operate in two stages. In the
method, not the model that generates image captions with new words, first stage, the dependencies between words in captions are defined
and were introduced as indexes to measure how well the model under- using a dependency parser (Klein & Manning, 2003). In the second
stands the relationship between the image and the appropriate caption stage, the dependency tree is mapped to the scene graph using a rule-
space that describes it. Recall@K measures the recall at a fixed rankK. based system. Then, given a candidate sentence and correct answer
That is not only the median rank r of correct responses among all sentence, an F-score is calculated for the combination of logical tu-
test queries, and percentage of test queries that had a correct response ples representing semantic propositions in the scene graph. However,
among the topK results. In the research Xu et al. (2018), R-precision these SPICE metrics are limited in their inability to properly capture
was used as an index to measure the performance of generating images grammatical and structural errors in a language.
from text. It measures the retrieval of relevant text given an image Visual-Based Metrics We classified the Inception Score (IS) (Sal-
query by calculating the cosine similarity between a global image imans et al., 2016) and Frechet Inception Distance (FID) (Heusel,
vector and 100 candidate sentence vectors. Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017), which are pri-
N-Gram-Based Metrics ROUGE (Lin, 2004), BLEU marily used to evaluate the performance of generative models, as
(Papineni, Roukos, Ward, & Zhu, 2002), METOR (Banerjee & Lavie, visual-based metrics because they are measured based on the distri-
2005), and CIDEr (Vedantam, Zitnick, & Parikh, 2014) are evaluation bution of image classification and calculation of the distance between
metrics that automatically measure the performance of the model image pixels. The IS (Salimans et al., 2016) returns a list of generated
based on N consecutive word sequences. Recall-Oriented Understudy images, scoring them in terms of image quality and diversity. For both
(ROUGE) (Lin, 2004) was introduced as a metric to automatically quality and diversity, a high score is given when the degree of com-
measure performance on text summarization. It is calculated using pletion is high, and a low score when the degree of completion is low.
the recall score, which indicates how many words among the words The lower bound of the score is zero, and the highest score is infinity,
composing the correct answer summary are included in the candidate however, in practice the upper bound is not infinite. These metrics
sentence generated by the model, and the precision value, which classify the generated images using an inception model (Szegedy, Van-
indicates how many words overlap with the correct answer summary houcke, Ioffe, Shlens, & Wojna, 2016). A meaningful image object must
in the candidate sentence. have a conditional label distribution with low entropy, and the model
BLEU (Papineni et al., 2002) is a method of scoring a value between must have high marginal entropy to generate more diverse images.
0 and 1 using unigram precision. Unigram precision is calculated as the Therefore, the joint probability of these two conditions is proposed as
number of words that exist in the correct sentence among the words an IS. However, Barratt and Sharma (2018) identified five limitations
included in the predicted candidate sentence. It is then divided by the of IS, which inevitably result in low IS scores: (1) when the generator
number of words in the candidate sentence. The more words in the generates an image not in the pretrained dataset of Inception V3
correct sentence that are in the predicted sentence, the higher the score. (Nilsback & Zisserman, 2008), and (2) when generating images using
However, obtaining a high score is a limitation, even if the same word a different set of labels than the classifier training data. In addition, it
is included multiple times. Therefore, the modified unigram precision has a limitation in providing a high IS score although it is not a good
is reflected in the numerator as an element called Max-Ref-Count that result: (3) when the classifier network focuses on texture rather than
counts the number of times a word is included in the correct answer image content, (4) when creating high-quality images multiple times,
label. In addition, because measuring scores based on unigrams does and (5) when the GAN duplicates the training data.
not reflect the order of words in a sentence, the order of words using The FID (Heusel et al., 2017) is a metric that calculates the distance
n-grams can be considered. between the label and generated images and how similar the two
METOR (Banerjee & Lavie, 2005) evaluation metrics are also widely groups are statistically. The feature vector output from the pretrained
used to evaluate the results of machine translation. First, unigram Inception V3 is used, and the distance between the multivariate normal
that match each other are created between the machine predicted distributions is used for the calculation.
sentence hypothesis and the correct sentence reference. Then, the
unigram precision is calculated by selecting the alignment with the 8. Comparison
least intersection among the linked sorts. Subsequently, the words in
the sentence predicted from the correct sentence are counted, and the Among the models surveyed in this study, the text-to-image and
unigram recall is obtained by dividing it by the number of words in image-to-text model performance occupy the most important propor-
the correct sentence, and then combined using the harmonic average tion, are compared Table 4. Each performance indicator in the table
of precision and recall. At this time, a weight of nine times that of the was extracted from the performance evaluation provided in the study
accuracy is assigned to the recall. of each model.
The CIDEr (Vedantam et al., 2014) is measured by forming triplets First, in the text-to-image translation part, we compared the models
of two candidate sentences and one correct sentence generated through evaluated with IS (Salimans et al., 2016), FID (Heusel et al., 2017),
the model using the correct sentences of all captions for the im- and R-precision (Hodosh et al., 2013) as evaluation metrics on the
ages. CIDEr is evaluated based on the eight Term Frequency Inverse MSCOCO (Chen et al., 2015; Lin et al., 2014) and CUB (Wah et al.,
Document Frequency (TF-IDF) values for the n-gram elements of the 2011) datasets. For each dataset, the highest score for each evalua-
sentence. After finding the TF-IDF values for the n-gram elements tion metric is indicated in bold. Although StackGAN did not provide
included in the candidate sentence and the TF-IDF vectors for the an FID score for the results of the experiment, StackGAN-v2, which
word elements in the correct sentence, it is calculated using the cosine was proposed by the same authors, obtained the FID score for both
similarity of these two values. StackGAN and StackGAN-v2. Although DALL-E presents the FID score
Scene Graph-Based Metrics The score metric called SPICE (Ander- as an approximate score, Parti shows and compares the FID scores for
son, Fernando, Johnson, & Gould, 2016) evaluates the performance of both DALL-E and Parti. When we compared the IS score, we determined

10
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Table 4
Performance of models on text-to-image: performance analysis of image-to-text translation of remarkable models for each dataset. The performance of the best performing model in
each dataset is indicated in bold. It was used for MSCOCO dataset (Lin et al., 2014). (Approach: R = Retrieval-based models, E = Encoder–decoder based models, G = Generative-based
models).
Model Evaluation metrics: BLEU-1 Evaluation metrics: BLEU-4 Evaluation metrics: METOR Approach
MSCOCO Flicker30k MSCOCO Flicker30k MSCOCO Flicker30k
BRNN 62.50 57.30 – – 19.50 – R
(Karpathy & Li, 2017)
Show, Attend and Tell 71.80 66.90 25.00 19.90 23.90 18.49 ED
(Xu et al., 2015)
Up-down 77.20 – 36.20 – 27.00 – ED
(Anderson et al., 2018)
UnifiedVLP – – 36.5* 30.10 28.4* 23.00 ED
(Zhou et al., 2020)
ORT 80.50 – 38.60 – 28.70 – ED
(Herdade et al., 2019)
LEMON – – 41.5* – 30.8* – ED
(Hu et al., 2021)
CGAN – – 20.70 8.80 22.40 13.20 G
(Dai et al., 2017)

that DM-GAN showed the best performance on both the MSCOCO (Chen 9.1. Establishment of fair evaluation metrics
et al., 2015; Lin et al., 2014) and CUB (Wah et al., 2011) datasets. When
we compared the FID scores, Parti significantly outperformed the other We reviewed the automatic evaluation metrics commonly used in
models. DALL-E and Parti, which showed the best performance on the multimodal bidirectional machine learning translations of image and
MSCOCO (Lin et al., 2014) dataset, were not evaluated on the CUB NLP as an index to evaluate model performance. However, studies
(Wah et al., 2011) dataset. DM-GAN showed the best IS score on the (Elliott & Keller, 2013; Kulkarni et al., 2013; Li et al., 2011; Ramesh
CUB (Wah et al., 2011) dataset. In addition, the performance difference et al., 2021; Zhang et al., 2017, 2019) used human and automatic
between StackGAN and StackGAN-v2 is significant. The generative- evaluation metrics because the performance evaluation of these auto-
based models using R-precision metrics are AttnGAN and DM-GAN. matic evaluation metrics has not yet followed the human evaluation.
DM-GAN exhibited higher performance than AttnGAN. We recorded the Sajjadi, Bachem, Lucic, Bousquet, and Gelly (2018) noted that an ob-
best performance of the model in Table 4. stacle to future research is the lack of quantitative evaluation methods
with which to accurately evaluate the quality of trained models. N-
In image-to-text translation, the BLEU (Papineni et al., 2002) and
gram- and graph-based metrics cannot reflect homonyms or catch the
METEOR (Banerjee & Lavie, 2005) metrics have been most commonly
grammatical and archetypal problems of a language. The IS score
used as performance evaluation metrics. Similarly, we indicate the
provides a high score even if an image is created multiple times, and
best performance for each dataset and evaluation metrics in bold. The
the evaluation is conducted by focusing on the texture of the generated
encoder–decoder-based models show good performances. The CGAN
image rather than the content, whereas the FID score uses limited
model, which attempts the image-to-text task based on a generative-
statistics to calculate the score. To compensate for these shortcomings,
based method, is meaningful in that it has never been attempted Kynkäänniemi, Karras, Laine, Lehtinen, and Aila (2019) and Sajjadi
as a generative-based structure before. However, it does not show a et al. (2018) published studies on precision and recall metrics modified
significantly high performance. Show, Attend and Tell model achieved to suit the evaluation of generative-based models. In addition, Yu et al.
BLEU-1 scores of 66.70 and 66.90 points with soft and hard attention, (2022) identified the cherry-picking problem, which attempts to mea-
respectively, on Flicker30K. However, for the METEOR (Banerjee & sure model performance by selecting only the good results that exist
Lavie, 2005) score, the model with soft attention achieved 18.49 points, in language models, machine translation, and image synthesis tasks. In
whereas the model with hard attention achieved 18.46 points. This conclusion, as a direction for future research, a method for measuring
indicates that the soft attention mechanism shows better performance the performance of models in an objective and fair manner is necessary,
with a slight difference when evaluated using METEOR (Banerjee & as well as a manner for automatic measurements that is closer to human
Lavie, 2005) metrics. We recorded the best performance of the model evaluation. With the performance of mobile devices and the time
in Table 5. (66.90 and 18.49 point). On the MSCOCO (Chen et al., spent indoors increasing, many location-based services and applications
2015; Lin et al., 2014) dataset, the models with soft and hard atten- have been developed. For such services and applications to become
tion obtained 70.70, and 71.80 points in terms of the BLEU-1 score, widespread, the importance of indoor positioning technology that ac-
respectively, and 23.90 and 23.01 points in terms of the METEOR score, curately detects the user’s current location is emphasized. However,
respectively. The higher performance is recorded in Table 5. For the indoor positioning technologies based on sensor data, wireless signals,
performance of LEMON, the results of the largest model, which shows and visual images tend to have the problems of error accumulation
the best performance among the three models presented in this study, and maintenance difficulty. To solve these problems, researchers have
are recorded in Table 5. For the MSCOCO (Chen et al., 2015; Lin et al., studied indoor positioning technologies that use indoor spots with
2014) dataset, the performance of the model that used the updated distinctive features as landmarks. In this study, the development of
version in 2015 is indicated by * mark. indoor.

9.2. Recognizing AI ethical issues and responsible development


9. Future research direction
With the development of AI and machine learning, the ethical issues
To conclude this survey, we discuss five aspects for future research of AI have become concern. Brundage et al. (2018) explained the
directions. ethical issues that can be caused in the AI field through the keywords

11
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Table 5
Performance of models on image-to-text: performance analysis of image-to-text translation of remarkable models for each dataset. The performance of the best- performing model
for each dataset is indicated in bold. * indicates the models trained on the MSCOCO (Chen et al., 2015) dataset updated in 2015. (Approach: R = Retrieval-based models, E =
Encoder–decoder based models, G = Generative-based models).
Model Evaluation metrics: IS Evaluation metrics: FID Evaluation metrics: R-precision Approach
MSCOCO CUB MSCOCO CUB MSCOCO CUB
StackGAN-v1 8.45 3.70 74.05 51.89 – – G
(Zhang et al., 2017)
StackGAN-v2 8.30 4.04 81.59 15.30 – – G
(Zhang et al., 2019)
AttnGAN 25.89 4.36 35.49 – 85.47 62.82 G
(Xu et al., 2018)
DM-GAN 30.49 4.75 32.64 16.09 88.56 72.31 G
(Zhu et al., 2019)
DALL-E – – 28.00 – – – G
(Ramesh et al., 2021)
Parti – – 7.23 – – – G
(Yu et al., 2022)

of digital, physical, and political safety and discussed countermeasure. based model. In the arts, Frans, Soros, and Witkowski (2022) succeeded
Amodei et al. (2016) noted the various problems caused by AI models; in creating various styles of artistic drawings through CLIPdraw, which
for instance, AI models designed to follow a purpose do not consider can be used to inspire artists or provide new methods of innova-
the surrounding environment or safety standards, or models that can tion, such as setting themes or creating unique visual interactions (Yu
take detrimental actions when placed in a previously unexperienced et al., 2022). Therefore, expanding the possibility of using multimodal
environment. Brown et al. (2020) noted the disinformation problem machine learning translation from a domain-specific perspective is
caused by deepfakes, and their safety and ethical issues. As pretrained possible, rather than approaching it from a general perspective. We
models with large datasets are increasingly used, discrimination and expect multimodal bidirectional machine learning translation to be
prejudice against the various races included in such pretrained datasets further developed and improve the quality of human life by confirming
may be learned. Wolf, Miller, and Grodzinsky (2017) reported the the applicability of multimodal bidirectional machine learning transla-
seriousness of AI ethics issues through a chatbot case that caused tion in all aspects of human life, such as productivity, creativity, and
serious verbal violence and sexual harassment. To prevent this problem, convenience.
filtering words containing bias and violence is important, particularly
when using a pretrained model with a large datasets. As many studies
10. Conclusion
have concluded, the development of responsible AI is believed to
be important in recognizing ethical issues in the AI field and future
research. We surveyed and summarized state-of-the-art research on multi-
modal bidirectional machine learning translation of image and NLP.
9.3. Unsupervised learning Research on the conversion from images to text has been ongoing for
a long time. Recently, owing to the power of algorithms and deep
As most human and animal learning is unsupervised learning, the networks, significant changes and developments have occurred in the
development direction of machine learning is expected to move from conversion of text to image. The combination of image and NLP has a
supervised learning to unsupervised learning (LeCun, Bengio, & Hin- wide range of usefulness in everyday life compared with the utilization
ton, 2015). Indeed, many studies are moving in this direction. The aspect of a single modality. This is because understanding the learning
implementation of zero-shot learning and few-shot learning in language process of multimodality rather than the learning of a single modality
models (Brown et al., 2020; Chowdhery et al., 2022; Kojima, Gu, more accurately reflects the human learning process, as integrating
Reid, Matsuo, & Iwasawa, 2022), has succeeded in performing image knowledge across multiple domains is a core competency of human
captioning and image synthesis by adapting the models trained by intelligence. In addition, transforming one or more modalities is a
unsupervised learning in multimodal machine learning translation of valuable research area in the field of AI, as it more closely approximates
vision and NLP (Ramesh, Dhariwal, Nichol, Chu, & Chen, 2022; Ramesh human and animal thinking abilities. We hope our survey and taxon-
et al., 2021; Saharia et al., 2022; Yu et al., 2022). In addition, a new omy suggestions have made a significant contribution to understanding
object captioning technique has been introduced (Agrawal et al., 2019; the research of multimodal machine learning translation and can serve
Hendricks et al., 2016; Venugopalan et al., 2017), which is said to as a foundation for future research.
further increase the possibility of zero-shot learning because it describes
an object that does not appear in the training set (Stefanini et al., CRediT authorship contribution statement
2022) and to show the development of unsupervised learning again.
Unsupervised learning is expected to be developed more actively in
Wongyung Nam: Conceptualization, Methodology, Formal analy-
multimodal machine learning translation, and is expected to be the
sis, Writing – original draft, Visualization, Writing – review & editing.
direction in which AI that mimics human learning should be developed.
Beakcheol Jang: Supervision, Project administration, Conceptualiza-
tion, Writing – review & editing, Funding acquisition.
9.4. Expansion of available possibilities

For future research on multimodal bidirectional machine learning, Declaration of competing interest
the applications of image and NLP translation are endless, such as
in medicine and the arts. Li, Liang, Hu, and Xing (2018) presented The authors declare that they have no known competing finan-
research on the task of generating long and coherent reports by per- cial interests or personal relationships that could have appeared to
ceiving visual patterns from medical images using the encoder–decoder influence the work reported in this paper.

12
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

Data availability Elliott, D., & Keller, F. (2013). Image description using visual dependency representa-
tions. In Proceedings of the 2013 conference on empirical methods in natural language
processing (pp. 1292–1302).
No data was used for the research described in the article
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their
attributes. In 2009 IEEE conference on computer vision and pattern recognition (pp.
Acknowledgments 1778–1785). IEEE.
Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. Discriminatively Trained
This work was supported by the National Research Foundation Deformable Part Models, Release 4.
of Korea Fund under Grant NRF-2022R1F1A1063961 and the Yonsei Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2009). Object
detection with discriminatively trained part-based models. IEEE Transactions on
University Research Fund, South Korea of 2023-22-0104. Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Frans, K., Soros, L. B., & Witkowski, O. (2022). Clipdraw: Exploring text-to-drawing syn-
References thesis through language-image encoders. Advances in Neural Information Processing
Systems, 35, 5207–5218.
Agnese, J., Herrera, J., Tao, H., & Zhu, X. (2020). A survey and taxonomy of adversarial Frolov, S., Hinz, T., Raue, F., Hees, J., & Dengel, A. R. (2021). Adversarial text-to-image
neural networks for text-to-image synthesis. Wiley Interdisciplinary Reviews: Data synthesis: A review. Neural Networks : the Official Journal of the International Neural
Mining and Knowledge Discovery, 10(4), Article e1345. Network Society, 144(C), 187–209. http://dx.doi.org/10.1016/j.neunet.2021.07.019.
Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., et al. (2019). Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for
Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international accurate object detection and semantic segmentation. In Proceedings of the IEEE
conference on computer vision (pp. 8948–8957). conference on computer vision and pattern recognition (pp. 580–587).
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et
Concrete problems in AI safety. arXiv preprint arXiv:1606.06565. al. (2014). Generative adversarial nets. Advances in Neural Information Processing
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propo- Systems, 27.
sitional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Guo, W., Wang, J., & Wang, S. (2019). Deep multimodal representation learning: A
conference, Amsterdam, the Netherlands, october 11-14, 2016, proceedings, part V 14 survey. IEEE Access, 7, 63373–63394.
(pp. 382–398). Springer. He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep residual learning for image
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., et al. (2018). recognition. In Proceedings of the IEEE conference on computer vision and pattern
Bottom-up and top-down attention for image captioning and visual question recognition (pp. 770–778).
answering. In Proceedings of the IEEE conference on computer vision and pattern He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity mappings in deep residual
recognition (pp. 6077–6086). networks. In Computer Vision–ECCV 2016: 14th European conference, Amsterdam, the
Arjovsky, M., & Bottou, L. (2017). Towards principled methods for training generative Netherlands, october 11–14, 2016, proceedings, part IV 14 (pp. 630–645). Springer.
adversarial networks. arXiv preprint arXiv:1701.04862. Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R. J., Saenko, K., & Dar-
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly rell, T. (2016). Deep compositional captioning: Describing novel object categories
learning to align and translate. arXiv preprint arXiv:1409.0473. without paired training data. 2016 IEEE Conference on Computer Vision and Pattern
Baltrusaitis, T., Ahuja, C., & Morency, L.-P. (2018). Multimodal machine learning: A Recognition (CVPR), 1–10.
survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Herdade, S., Kappeler, A., Boakye, K., & Soares, J. (2019). Image captioning: Trans-
41(2), 423–443. forming objects into words. Advances in Neural Information Processing Systems,
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with 32.
improved correlation with human judgments. In Proceedings of the acl workshop on Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans
intrinsic and extrinsic evaluation measures for machine translation and/or summarization trained by a two time-scale update rule converge to a local nash equilibrium.
(pp. 65–72). Advances in Neural Information Processing Systems, 30.
Barratt, S., & Sharma, R. (2018). A note on the inception score. arXiv preprint Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
arXiv:1801.01973. 9(8), 1735–1780.
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., et al.
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a
(2017). Automatic description generation from images: a survey of models, datasets,
ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence
and evaluation measures. In Proceedings of the 26th international joint conference on
Research, 47, 853–899.
artificial intelligence (pp. 4970–4974).
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020).
of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1–36.
Language models are few-shot learners. Advances in Neural Information Processing
Hu, X., Gan, Z., Wang, J., Yang, Z., Liu, Z., Lu, Y., et al. (2021). Scaling up vision-
Systems, 33, 1877–1901.
language pretraining for image captioning. 2022 IEEE/CVF Conference on Computer
Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., et al. (2018).
Vision and Pattern Recognition (CVPR), 17959–17968.
The malicious use of artificial intelligence: Forecasting, prevention, and mitigation.
Karpathy, A., Joulin, A., & Li, F.-F. (2014). Deep fragment embeddings for bidirectional
arXiv preprint arXiv:1802.07228.
image sentence mapping. In Proceedings of the 27th international conference on
Changpinyo, S., Kukliansky, D., Szpektor, I., Chen, X., Ding, N., & Soricut, R. (2022).
neural information processing systems - volume 2 (pp. 1889–1897). MIT Press, arXiv:
All you may need for VQA are image captions. arXiv preprint arXiv:2205.01883.
1406.5679.
Changpinyo, S., Sharma, P. K., Ding, N., & Soricut, R. (2021). Conceptual 12m:
Karpathy, A., & Li, F.-F. (2017). Deep visual-semantic alignments for generating image
Pushing web-scale image-text pre-training to recognize long-tail visual concepts.
descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4),
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
664–676. http://dx.doi.org/10.1109/TPAMI.2016.2598339.
(pp. 3558–3568).
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., et al. (2015). Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. URL https:
Microsoft coco captions: Data collection and evaluation server. arXiv preprint //arxiv.org/abs/1312.6114.
arXiv:1504.00325. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties the 41st annual meeting of the association for computational linguistics (pp. 423–430).
of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv: Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language
1409.1259. models are zero-shot reasoners. Advances in Neural Information Processing Systems,
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., et al. (2022). 35, 22199–22213.
Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017).
Dai, B., Fidler, S., Urtasun, R., & Lin, D. (2017). Towards diverse and natural image Visual genome: Connecting language and vision using crowdsourced dense image
descriptions via a conditional GAN. In Proceedings of the IEEE international conference annotations. International Journal of Computer Vision, 123(1), 32–73.
on computer vision (pp. 2970–2979). Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). BabyTalk:
De Marneffe, M.-C., MacCartney, B., Manning, C. D., et al. (2006). Generating typed Understanding and generating simple image descriptions. IEEE Transactions on
dependency parses from phrase structure parses.. vol. 6, In Lrec (pp. 449–454). Pattern Analysis and Machine Intelligence, 35(12), 2891–2903. http://dx.doi.org/10.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Li, F.-F. (2009). Imagenet: A large-scale 1109/TPAMI.2012.162.
hierarchical image database. In 2009 IEEE conference on computer vision and pattern Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., & Aila, T. (2019). Improved
recognition (pp. 248–255). Ieee, http://dx.doi.org/10.1109/CVPR.2009.5206848. precision and recall metric for assessing generative models. Advances in Neural
Denton, E. L., Chintala, S., Fergus, R., et al. (2015). Deep generative image models Information Processing Systems, 32.
using a laplacian pyramid of adversarial networks. Advances in Neural Information Le, Q. V. (2013). Building high-level features using large scale unsupervised learning.
Processing Systems, 28. In 2013 IEEE international conference on acoustics, speech and signal processing (pp.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep 8595–8598). IEEE.
bidirectional transformers for language understanding. ArXiv, arXiv:1810.04805. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

13
W. Nam and B. Jang Expert Systems With Applications 235 (2024) 121168

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., et Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale
al. (1989). Backpropagation applied to handwritten zip code recognition. Neural image recognition. arXiv preprint arXiv:1409.1556.
Computation, 1(4), 541–551. Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., & Ng, A. (2014). Grounded com-
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied positional semantics for finding and describing images with sentences. Transactions
to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. of the Association for Computational Linguistics, 2, 207–218.
Lee, K.-H., Chen, X., Hua, G., Hu, H., & He, X. (2018). Stacked cross attention for Sønderby, C. K., Caballero, J., Theis, L., Shi, W., & Huszár, F. (2016). Amortised MAP
image-text matching. In Proceddings of the European conference on computer vision inference for image super-resolution. arXiv preprint arXiv:1610.04490.
(pp. 201–216). Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., & Cucchiara, R.
Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., & Choi, Y. (2011). Composing simple image (2022). From show to tell: a survey on deep learning-based image captioning. IEEE
descriptions using web-scale n-grams. In Proceedings of the fifteenth conference on Transactions on Pattern Analysis and Machine Intelligence, 45(1), 539–559.
computational natural language learning (pp. 220–228). Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient
Li, Y., Liang, X., Hu, Z., & Xing, E. P. (2018). Hybrid retrieval-generation reinforced methods for reinforcement learning with function approximation. Advances in
agent for medical image report generation. Advances in Neural Information Processing Neural Information Processing Systems, 12.
Systems, 31. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text inception architecture for computer vision. 2016 IEEE Conference on Computer Vision
summarization branches out (pp. 74–81). and Pattern Recognition (CVPR), 2818–2826.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., et al. (2016).
Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th YFCC100m: The new data in multimedia research. Communications of the ACM,
European conference, Zurich, Switzerland, september 6-12, 2014, proceedings, part V 59(2), 64–73.
13 (pp. 740–755). Springer. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al.
Lu, D., Neves, L., Carvalho, V., Zhang, N., & Ji, H. (2018). Visual attention model (2017). Attention is all you need. Advances in Neural Information Processing Systems,
for name tagging in multimodal social media. In Proceedings of the 56th annual 30.
meeting of the association for computational linguistics (volume 1: long papers) (pp. Vedantam, R., Zitnick, C. L., & Parikh, D. (2014). CIDEr: Consensus-based image
1990–1999). description evaluation. 2015 IEEE Conference on Computer Vision and Pattern
McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of Recognition (CVPR), 4566–4575.
dependency parsers. In Proceedings of the 43rd annual meeting of the association Venugopalan, S., Hendricks, L. A., Rohrbach, M., Mooney, R. J., Darrell, T., & Saenko, K.
for computational linguistics (ACL’05) (pp. 91–98). (2017). Captioning images with diverse objects. 2017 IEEE Conference on Computer
Moon, S., Neves, L., & Carvalho, V. (2018). Multimodal named entity recognition Vision and Pattern Recognition (CVPR), 1170–1178.
for short social media posts. http://dx.doi.org/10.48550/arXiv.1802.07862, URL Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd
http://arxiv.org/abs/1802.07862. birds-200–2011 dataset.
Nilsback, M.-E., & Zisserman, A. (2008). Automated flower classification over a large Wolf, M. J., Miller, K., & Grodzinsky, F. S. (2017). Why we should have seen that
number of classes. In 2008 Sixth Indian conference on computer vision, graphics & coming: comments on microsoft’s tay" experiment," and wider implications. Acm
image processing (pp. 722–729). IEEE. Sigcas Computers and Society, 47(3), 54–64.
Oord, A. V. D., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural Xie, S., Girshick, R., Dollar, P., Tu, Z., & He, K. (2017). Aggregated residual transfor-
networks. In International conference on machine learning (pp. 1747–1756). PMLR. mations for deep neural networks. In Proceedings of the IEEE conference on computer
Ordonez, V., Kulkarni, G., & Berg, T. (2011). Im2text: Describing images using 1 million vision and pattern recognition (pp. 1492–1500).
captioned photographs. Advances in Neural Information Processing Systems, 24. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015).
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic Show, attend and tell: Neural image caption generation with visual attention. In
evaluation of machine translation. In Proceedings of the 40th annual meeting of the International conference on machine learning (pp. 2048–2057). PMLR.
association for computational linguistics (pp. 311–318). Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., et al. (2018). Attngan: Fine-
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical grained text to image generation with attentional generative adversarial networks.
text-conditional image generation with clip latents. arXiv preprint arXiv:2204. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
06125. 1316–1324).
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., et al. (2021). Zero- Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions
shot text-to-image generation. In International conference on machine learning (pp. to visual denotations: New similarity metrics for semantic inference over event
8821–8831). PMLR. descriptions. Transactions of the Association for Computational Linguistics, 2, 67–78.
Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image Yu, J., Jiang, J., Yang, L., & Xia, R. (2020). Improving multimodal named entity
annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 recognition via entity span detection with unified multimodal transformer. In
workshop on creating speech and language data with Amazon’s mechanical turk (pp. Annual meeting of the association for computational linguistics (pp. 3342–3352).
139–147). Association for Computational Linguistics.
Reed, S. E., Akata, Z., Lee, H., & Schiele, B. (2016). Learning deep representations of Yu, J., Li, X., Koh, J. Y., Zhang, H., Pang, R., Qin, J., et al. (2021). Vector-quantized
fine-grained visual descriptions. In Proceedings of the IEEE conference on computer image modeling with improved VQGAN. ArXiv, arXiv:2110.04627.
vision and pattern recognition (pp. 49–58). Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., et al. Scaling Autoregressive
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object Models for Content-Rich Text-to-Image Generation, arXiv preprint arXiv:2206.
detection with region proposal networks. IEEE Transactions on Pattern Analysis and 10789, 2(3), 5.
Machine Intelligence, 39(06), 1137–1149. http://dx.doi.org/10.1109/TPAMI.2016. Zhang, Q., Fu, J., Liu, X., & Huang, X. (2018). Adaptive co-attention network for named
2577031. entity recognition in tweets. vol. 32, In Proceedings of the AAAI conference on artificial
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). intelligence.
ImageNet large scale visual recognition challenge. International Journal of Computer Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., et al. (2021). Vinvl: Revisiting
Vision (IJCV), 115(3), 211–252. http://dx.doi.org/10.1007/s11263-015-0816-y. visual representations in vision-language models. In Proceedings of the IEEE/CVF
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., et al. (2022). conference on computer vision and pattern recognition (pp. 5579–5588).
Photorealistic text-to-image diffusion models with deep language understanding. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017). Stackgan: Text
Advances in Neural Information Processing Systems, 35, 36479–36494. to photo-realistic image synthesis with stacked generative adversarial networks. In
Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., & Gelly, S. (2018). Assessing Proceedings of the IEEE international conference on computer vision (pp. 5907–5915).
generative models via precision and recall. Advances in Neural Information Processing Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2019). Stack-
Systems, 31. gan++: Realistic image synthesis with stacked generative adversarial networks.
Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1947–1962.
(2016). Improved techniques for training GANs. Advances in Neural Information http://dx.doi.org/10.1109/TPAMI.2018.2856256.
Proceeding Systems, 29. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & Gao, J. (2020). Unified vision-
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE language pre-training for image captioning and vqa. vol. 34, In Proceedings of the
Transactions on Signal Processing, 45(11), 2673–2681. AAAI conference on artificial intelligence (pp. 13041–13049).
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words Zhu, M., Pan, P., Chen, W., & Yang, Y. (2019). DM-GAN: Dynamic memory generative
with subword units. arXiv preprint arXiv:1508.07909. adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF
Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual captions: A conference on computer vision and pattern recognition (pp. 5802–5810).
cleaned, hypernymed, image alt-text dataset for automatic image captioning. In
Proceedings of the 56th annual meeting of the association for computational linguistics
(volume 1: long papers) (pp. 2556–2565).

14

You might also like