Unpaired Image Captioning by Language Pivoting
Unpaired Image Captioning by Language Pivoting
Pivoting
1 Introduction
Recent several years have witnessed unprecedented advancements in automatic
image caption generation. This progress can be attributed (i) to the invention
of novel deep learning framework that learns to generate natural language de-
scriptions of images in an end-to-end fashion, and (ii) to the availability of large
annotated corpora of images paired with captions such as MSCOCO [30] to train
these models. The dominant methods are based on an encoder-decoder frame-
work, which uses a deep convolutional neural network (CNN) to encode the image
into a feature vector, and then use a recurrent neural network (RNN) to gener-
ate the caption from the encoded vector [29, 27, 44]. More recently, approaches
of using attention mechanisms and reinforcement learning have dominated the
MSCOCO captioning leaderboard [1, 39, 18].
Despite the impressive results achieved by the deep learning framework, one
performance bottleneck is the availability of large paired datasets because neu-
ral image captioning models are generally annotation-hungry requiring a large
amount of annotated image-caption pairs to achieve effective results [19]. How-
ever, in many applications and languages, such large-scale annotations are not
2 Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Gang Wang
readily available, and are expensive and slow to acquire. In these scenarios,
unsupervised methods that can generate captions from unpaired data or semi-
supervised methods that can exploit paired annotations from other domains or
languages are highly desirable [5]. In this paper, we pursue the later research
avenue, where we assume that we have access to image-caption paired instances
in one language (Chinese), and our goal is to transfer this knowledge to a target
language (English) for which we do not have such image-caption paired datasets.
We also assume that we have access to a separate source-target (Chinese-English)
parallel corpus to help us with the transformation. In other words, we wish to
use the source language (Chinese) as a pivot language to bridge the gap between
an input image and a caption in the target language (English).
The concept of using a pivot language as an intermediary language has been
studied previously in machine translation (MT) to translate between a resource-
rich language and a resource-scarce language [46, 42, 25, 6]. The translation task
in this strategy is performed in two steps. A source-to-pivot MT system first
translates a source sentence into the pivot language, which is in turn translated
to the target language using a pivot-to-target MT system. Although related,
image captioning with the help of a pivot language is fundamentally different
from MT, since it involves putting together two different tasks – captioning and
translation. In addition, the pivot-based pipelined approach to MT suffers from
two major problems when it comes to image captioning. First, the conventional
pivot-based MT methods assume that the datasets for source-to-pivot and pivot-
to-target translations come from the same (or similar) domain(s) with similar
styles and word distributions. However, as it comes to image captioning, cap-
tions in the pivot language (Chinese) and sentences in the (Chinese-English)
parallel corpus are quite different in styles and word distributions. For instance,
MSCOCO captioning dataset mostly consists of images of a large scene with
object instances (nouns), whereas language parallel corpora are more generic.
Second, the errors made in the source-to-pivot translation get propagated to the
pivot-to-target translation module in the pipelined approach.
In this paper, we present an approach that can effectively capture the charac-
teristics of an image captioner from the source language and align it to the target
language using another source-target parallel corpus. More specifically, our pivot-
based image captioning framework comprises an image captioner image-to-pivot,
an encoder-decoder model that learns to describe images in the pivot language,
and a pivot-to-target translation model, another encoder-decoder model that
translates the sentence in pivot language to the target language, and these two
models are trained on two separate datasets. We tackle the variations in writ-
ing styles and word distributions in the two datasets by adapting the language
translation model to the captioning task. This is achieved by adapting both the
encoder and the decoder of the pivot-to-target translation model. In particular,
we regularize the word embeddings of the encoder (of pivot language) and the
decoder (of target language) models to make them similar to image captions.
We also introduce a joint training algorithm to connect the two models and
enable them to interact with each other during training. We use AIC-ICC [47]
Unpaired Image Captioning by Language Pivoting 3
and AIC-MT [47] as the training datasets and two datasets (MSCOCO and
Flickr30K [37]) as the validation datasets. The results show that our approach
yields substantial gains over the baseline methods on the validation datasets.
2 Background
mechanism [26, 41, 35, 10]. The attention-based translation model proposed by
Kalchbrenner et al [26] is an early attempt to train the end-to-end NMT model.
Luong et al [33] extend the basic encoder-decoder framework to multiple en-
coders and decoders. However, large-scale parallel corpora are usually not easy
to obtain for some language pairs. This is unfortunate because NMT usually
needs a large amount of data to train. As a result, improving NMT on resource-
scarce language pairs has attracted much attention [55, 16].
Recently, many works have been done in the area of pivot strategies of
NMT [11, 46, 42, 3, 53, 14, 25]. Pivot-based approach introduces a third language,
named pivot language for which there exist source-pivot and pivot-target paral-
lel corpora. The translation of pivot-based approaches can be divided into two
steps: the sentence in the source language is first translated into a sentence in
the pivot language, which is then translated to a sentence in the target language.
However, such pivot-based approach has a major problem that the errors made
in the source-to-pivot model will be forwarded to the pivot-to-target model.
Recently, Cheng et al [7] introduce an autoencoder to reconstruct monolingual
corpora. They further improve it in [8], in which they propose a joint training
approach for pivot-based NMT.
where θi→y are the model parameters to be learned in the absence of any paired
θ
data, i(ni ) = y (ny ) . We use the pivot language x to learn the mapping: i −−
i→x
−→
θx→y
x −−−→ y. Note that image-to-pivot (Di,x ) and pivot-to-target (Dx,y ) in our
setting are two distinct datasets with possibly no common elements.
Fig. 1 illustrates our pivot-based image captioning approach. We have an
image captioning model P (x|i; θi→x ) to generate a caption in the pivot language
from an image and a NMT model P (y|x; θx→y ) to translate this caption into
the target language. In addition, we have an autoencoder in the target language
P (ŷ|ŷ; θŷ→ŷ ) that guides the target language decoder to produce caption-like
sentences. We train these components jointly so that they interact with each
other. During inference, given an unseen image i to be described, we use the
joint decoder:
y ∼ arg max P (y|i; θi→x , θx→y ) (2)
y
Unpaired Image Captioning by Language Pivoting 5
Fig. 1. Pictorial depiction of our pivot-based unpaired image captioning setting. Here,
i, x, y, and ŷ denote source image, pivot language sentence, target language sentence,
and ground truth captions in target language, respectively. We use a dashed line to
denote that there is no parallel corpus available for the pair. Solid lines with arrows
represent decoding directions. Dashed lines inside a language (circle) denote stylistic
and distributional differences between caption and translation data.
In the following, we first give an overview of neural methods for image cap-
tioning and machine translation using paired (parallel) data. Then, we present
our approach that extends these standard models for unpaired image captioning
with a pivot language.
Standard Image Captioning. For image captioning in the paired setting, the
goal is to generate a caption x̃ from an image i such that x̃ is as similar to
the ground truth caption x. We use Px (x|i; θi→x ) to denote a standard encoder-
decoder based image captioning model with θi→x being the parameters. We first
encode the given image to the image features v with a CNN-based image encoder:
v = CNN(i). Then, we predict the image description x from the global image
feature v. The training objective is to maximize the probability of the ground
truth caption words given the image:
θ̃i→x = arg max Li→x (3)
θi→x
(ni )
NX
i −1 M X−1
(ni ) (n )
, i(ni ) ; θi→x )
i
= arg max log Px (xt |x0:t−1 (4)
θi→x
ni =0 t=0
where Ni is the number of image-caption pairs, M (ni ) is the length of the caption
(n ) (ni )
x(ni ) , xt denotes a word in the caption, and Px (xt i |x0:t−1 , i(ni ) ) corresponds
to the activation of the Softmax layer. The decoded word is drawn from:
xt ∼ arg max
x
P (xt |x0:t−1 ; i) (5)
Vi→x
x
where Vi→x is the vocabulary of words in the image-caption dataset Di,x .
6 Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Gang Wang
where M and N are the lengths of the source and target sentences, respectively.
The maximum-likelihood training objective of the model can be expressed as:
θ̃x→y = arg max Lx→y (7)
θx→y
(nx )
n NX
x −1 N X−1
(nx ) (n )
; x(nx ) ; θx→y )
x
= arg max log Py (yt |y0:t−1 (8)
θx→y
nx =0 t=0
During inference we calculate the probability of the next symbol given the
source sentence encoding and the decoded target sequence so far, and draw the
word from the dictionary according to the maximum probability:
yt ∼ arg max
y
P (yt |y0:t−1 ; x0:M −1 ) (9)
Vx→y
y
where Vx→y is the vocabulary of the target language in the translation dataset
Dx,y .
Unpaired Image Captioning by Language Pivoting. In the unpaired set-
ting, our goal is to generate a description y in the target language for an im-
age i without any pair information. We assume, there is a second language x
called “pivot” for which we have (separate) image-pivot and pivot-target paired
datasets. The image-to-target model in the pivot-based setting can be decom-
posed into two sub-models by treating the pivot sentence as a latent variable:
X
P (y|i; θi→x , θx→y ) = Px (x|i; θi→x )Py (y|x; θx→y ) (10)
x
where Px (x|i; θi→x ) and Py (y|x; θx→y ) are the image captioning and NMT mod-
els, respectively. Due to the exponential search space in the pivot language, we
approximate the captioning process with two steps. The first step translates the
image i into a pivot language sentence x̃. Then, the pivot language sentence is
translated to a target language sentence ỹ. To learn such a pivot-based model,
a simple approach is to combine the two loss functions in Equations (4) and (8)
as follows:
where x̃ is the image description generated from i in the pivot language, ỹ is the
translation of x̃, and θ̃i→x and θ̃x→y are the learned model parameters.
However, this pipelined approach to image caption generation in the target
language suffers from couple of key limitations. First, image captioning and ma-
chine translation are two different tasks. The image-to-pivot and pivot-to-target
models are quite different in terms of vocabulary and parameter space because
they are trained on two possibly unrelated datasets. Image captions contain de-
scription of objects in a given scene, whereas machine translation data is more
generic, in our case containing news event descriptions, movie subtitles, and
conversational texts. They are two different domains with differences in writing
styles and word distributions. As a result, the captions generated by the pipeline
approach may not be similar to human-authored captions. Fig. 1 distinguishes
between the two domains of pivot and target sentences: caption domain and
translation domain (see second and third circles). The second limitation is that
the errors made by the image-to-pivot captioning model get propagated to the
pivot-to-target translation model.
To overcome the limitations of the pivot-based caption generation, we pro-
pose to reduce the discrepancy between the image-to-pivot and pivot-to-target
models, and to train them jointly so that they learn better models by interacting
with each other during training. Fig. 2 illustrates our approach. The two models
share some common aspects that we can exploit to connect them as we describe
below.
Fig. 2. Illustration of our image captioning model with pivot language. The image
captioning model first transforms an image into latent pivot sentences, from which our
machine translation model generates the target caption.
where wx is a word in the pivot language that is shared by the two embedding
wx
matrices, and θi→x ∈ Rd denotes the vector representation of wx in the source-
wx
to-pivot model, and θx→y ∈ Rd denotes the vector representation of wx in the
wx wx
pivot-to-target model. Note that, here we adapt θx→y towards θi→x , that is,
wx
θi→x is already a learned model and kept fixed during adaptation.
Adapting the encoder embeddings of the NMT model does not guarantee
that the decoder of the model will produce caption-like sentences. For this, we
need to also adapt the decoder embeddings of the NMT model to the caption
Nŷ −1
data. We first use the target-target parallel corpus Dŷ,ŷ = {(ŷ (nŷ ) , ŷ (nŷ ) )}nŷ=0
to train an autoencoder P (ŷ|ŷ; θŷ→ŷ ), where θŷ→ŷ are the parameters of the
autoencoder. The maximum-likelihood training objective of autoencoder can be
expressed as:
θ̃ŷ→ŷ = arg max Lŷ→ŷ (15)
θŷ→ŷ
where Lŷ→ŷ is the cross-entropy (XE) loss. The autoencoder then “teaches” the
decoder of the translation model P (y|x; θx→y ) to learn similar word representa-
tions. This is again achieved by minimizing the l2 distance between two vectors:
wy wy
X
wy wy
Rx→ŷ (θx→y , θŷ→ŷ )=− ||θx→y − θŷ→ŷ ||2 (16)
y y
wy ∈Vx→y ∩Vŷ→ ŷ
y
where Vŷ→ŷ is the vocabulary of y in Dŷ,ŷ , and wy is a word in the target
language that is shared by the two embedding matrices. By optimizing Equa-
tion (16), we try to make the learned caption share a similar style as the target
captions.
Joint Training. In training, our goal is to find a set of source-to-target model
parameters that maximizes the training objective:
Ji→x,x→y,y→ŷ = Li→x + Lx→y + Lŷ→ŷ + λRi→x,x→y,y→ŷ (17)
wx wx wy wy
Ri→x,x→y,y→ŷ = Ri→y (θi→x , θx→y ) + Rx→ŷ (θx→y , θŷ→ŷ ) (18)
where λ is the hyper-parameter used to balance the preference between the loss
terms and the connection terms. Since both the captioner Px (x|i; θi→x ) and the
translator Py (y|x; θx→y ) have large vocabulary sizes (see Table 1), it is hard to
train the joint model with an initial random policy. Thus, in practice, we pre-
train the captioner, translator and autoencoder first, and then jointly optimize
them with Equation (17).
4 Experiments
Datasets. In our experiments, we choose the two independent datasets used
from AI Challenger (AIC) [47]: AIC Image Chinese Captioning (AIC-ICC) and
Unpaired Image Captioning by Language Pivoting 9
Table 1. Statistics of the datasets used in our experiments, where “im” denotes the
image, “zh” denotes Chinese, and “en” denotes English.
Source Target
Dataset Lang.
# Image/Sent. Vocab. Size # Sent. Vocab. Size
AIC-ICC im → zh 240K − 1,200K 4,461
Training
AIC-MT zh → en 10,000K 50,004 10,000K 50,004
MSCOCO im → en 123K − 615K 9,487
Testing
Flickr30K im → en 30K − 150K 7,000
4
https://github.com/fxsjy/jieba
10 Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Gang Wang
Architecture. As can be seen in Fig. 2, we have three models used in our image
captioner. The first model i2tim→zh learns to generate the Chinese caption x
from a given image i. It is a standard CNN-RNN architecture [44], where word
outputted from the previous time step is taken as the input for the current
time step. For each image, we encoder it with ResNet-101 [21], and then apply
average pooling to get a vector of dimensions 2,048. After that, we map the
image features through a linear projection and get a vector of dimensions 512.
The decoder is implemented based on an LSTM network. The dimensions of the
LSTM hidden states and word embedding are fixed to 512 for all of the models
discussed in this paper. Each sentence starts with a special BOS token, and ends
with an EOS token.
The second model nmtzh→en learns to translate the Chinese sentence x to
the English sentence y. It has three components: a sentence encoder, a sentence
decoder, and an attention module. The words in the pivot language are first
mapped to word vectors and then fed into a bidirectional LSTM network. The
decoder predicts the target language words based on the encoded vector of the
source sentence as well as its previous outputs. The encoder and the decoder are
connected through an attention module which allows the decoder to focus on
different regions of the source sentence during decoding.
The third model t2ten→en learns to produce the caption-style English sen-
tence ŷ. It is essentially an autoencoder trained on a set of image descriptions
extracted from MSCOCO, where the encoder and the decoder are based on one-
layer LSTM network. The encoder reads the whole sentence as input and the
decoder is to reconstruct the input sentence.
Training Setting. All the modules are randomly initialized before training
except the image CNN, for which we use a pre-trained model on ImageNet. We
first independently train the image Chinese captioner, the Chinese-to-English
translator, and the autoencoder with the cross-entropy loss on AIC-ICC, AIC-
MT, and MSCOCO corpus, respectively. During this stage, we use Adam [28]
algorithm to do model updating with a mini-batch size of 100. The initial learning
rate is 4e−4 , and the momentum is 0.9. The best models are selected according
to the validation scores, which are then used for the subsequent joint training.
Specifically, we combine the just trained models with the connection terms, and
conduct a joint training with Equation (17). We set the hyper-parameter λ to
1.0, and train the joint model using Adam optimizer with a mini-batch size of
64 and an initial learning rate of 2e−4 . Weight decay and dropout are applied in
this training phase to prevent over-fitting.
Testing setting. During testing, the output image description is first formed by
drawing words in pivot language from i2tim→zh until an EOS token is reached,
and then translated with nmtzh→en to the target language. Here we use beam
search for the two inference procedures. Beam search is an efficient decoding
method for RNN-based models, which keeps the top-k hypotheses at each time
step, and considers them as the candidates to generate a new top-k hypotheses
at the next time step. We set a fixed beam search size of k = 5 for i2tim→zh
Unpaired Image Captioning by Language Pivoting 11
and k = 10 for nmtzh→en . We evaluate the quality of the generated image de-
scriptions with the standard evaluation metrics: BLEU [36], METEOR [12], and
CIDEr [43]. Since BLEU aims to assess how similar two sentences are, we also
evaluate the diversity of the generated sentence with Self-BLEU [54], which takes
one sentence as the hypothesis and the others as the reference, and then calcu-
lates BLEU score for every generated sentence. The final Self-BLEU score is
defined as the average BLEU scores of the sentences.
are trained on AIC-ICC and AIC-MT, respectively. We also report the results
of our implementation of FC-2K [39], which adopts a similar architecture.
with online Google translation in terms of B@n and CIDEr metrics, while ob-
taining significant improvements over the lower bound. This demonstrates the
effectiveness of the connection term on the pivot language. Moreover, by adding
the connection term on the target language, our model with the two connection
terms (Ri→x,x→y,y→ŷ ) further improves the performance. This suggests that a
small corpus in the target domain is able to make the decoder to generate image
descriptions that are more like captions. The connection terms help to bridge
the word representations of the two different domains. The captions generated
by Google translator have higher METEOR. We speculate the following reasons.
First, Google Translator generates longer captions than ours. Since METEOR
computes the score not only on the basis of n-gram precision but also of uni-gram
recall, its default parameters favor longer translations than other metrics [4]. Sec-
ond, in addition to exact word matching, METEOR considers matching of word
stems and synonyms. Since Google translator is trained on a much larger corpus
than ours, it generates more synonymous words. Table 4 also shows the results
of unpaired image English captioning on Flickr30K, where we can draw similar
conclusions.
We further evaluate the diversity of the generated image descriptions using
Self-BLEU metric. Table 5 shows the detailed Self-BLEU scores. It can be seen
that our method generates image descriptions with the highest diversity, com-
Unpaired Image Captioning by Language Pivoting 13
pared with the upper and lower bounds. For better comparison, we also calculate
the Self-BLEU scores calculated on ground truth captions.
Table 5. Self-BLEU scores on MSCOCO 5K test split. Note that lower Self-BLEU
scores imply higher diversity of the image descriptions.
datasets and by joint training of the model components. For example, with the
detected people in the first image, our model generates the sentence with “a
bunch of people in sports suits”, which is more diverse than the sentence with
“a group of baseball players” generated by the paired model.
Fig. 3. Examples of the generated sentences on MSCOCO test images, where i2tim→zh
is the image captioner trained on AIC-ICC, i2tim→en is the image captioner trained
on MSCOCO, i2tim→zh→en (Ri→x,x→y,y→ŷ ) and i2tim→zh→en (Ri→x,x→y ) are our pro-
posed models for unpaired image captioning, and GT stands for ground truth caption.
5 Conclusion
In this paper, we have proposed an approach to unpaired image captioning with
the help of a pivot language. Our method couples an image-to-pivot caption-
ing model with a pivot-to-target NMT model in a joint learning framework.
The coupling is done by adapting the word representations in the encoder and
the decoder of the NMT model to produce caption-like sentences. Empirical
evaluation demonstrates that our method consistently outperforms the baseline
methods on MSCOCO and Flickr30K image captioning datasets. In our future
work, we plan to explore the idea of ‘back-translation’ to create pseudo Chinese-
English translation data for English captions, and adapt our decoder language
model by training on this pseudo dataset.
Acknowledgments
This research was carried out at the Rapid-Rich Object Search (ROSE) Lab
at the Nanyang Technological University, Singapore. The ROSE Lab is sup-
ported by the National Research Foundation, Singapore, and the Infocomm Me-
dia Development Authority, Singapore. We gratefully acknowledge the support
of NVIDIA AI Tech Center (NVAITC) for our research at NTU ROSE Lab,
Singapore.
Unpaired Image Captioning by Language Pivoting 15
References
1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
to align and translate. In: ICLR (2015)
2. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence
prediction with recurrent neural networks. In: NIPS. pp. 1171–1179 (2015)
3. Bertoldi, N., Barbaiani, M., Federico, M., Cattoni, R.: Phrase-based statistical
machine translation with pivot languages. In: IWSLT. pp. 143–149 (2008)
4. Cer, D., Manning, C.D., Jurafsky, D.: The best lexical metric for phrase-based
statistical mt system optimization. In: NAACL. pp. 555–563 (2010)
5. Chen, T.H., Liao, Y.H., Chuang, C.Y., Hsu, W.T., Fu, J., Sun, M.: Show, adapt and
tell: Adversarial training of cross-domain image captioner. In: ICCV. pp. 521–530
(2017)
6. Chen, Y., Liu, Y., Li, V.O.: Zero-resource neural machine translation with multi-
agent communication game. In: AAAI. pp. 5086–5093 (2018)
7. Cheng, Y., Xu, W., He, Z., He, W., Wu, H., Sun, M., Liu, Y.: Semi-supervised
learning for neural machine translation. In: ACL. pp. 1965–1974 (2016)
8. Cheng, Y., Yang, Q., Liu, Y., Sun, M., Xu, W.: Joint training for pivot-based
neural machine translation. In: IJCAI. pp. 3974–3980 (2017)
9. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for
statistical machine translation. pp. 1724–1734 (2014)
10. Cohn, T., Hoang, C.D.V., Vymolova, E., Yao, K., Dyer, C., Haffari, G.: Incorpo-
rating structural alignment biases into an attentional neural translation model. In:
ACL. pp. 876–885 (2016)
11. Cohn, T., Lapata, M.: Machine translation by triangulation: Making effective use
of multi-parallel corpora. In: ACL. pp. 728–735 (2007)
12. Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evalua-
tion for any target language. In: ACL. pp. 376–380 (2014)
13. Ding, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Context contrasted feature
and gated multi-scale aggregation for scene segmentation. In: CVPR. pp. 2393–
2402 (2018)
14. El Kholy, A., Habash, N., Leusch, G., Matusov, E., Sawaf, H.: Language indepen-
dent connectivity strength features for phrase pivot statistical machine translation.
In: ACL. pp. 412–418 (2013)
15. Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J.,
He, X., Mitchell, M., Platt, J.C., et al.: From captions to visual concepts and back.
In: CVPR. pp. 1473–1482 (2015)
16. Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F.T.Y., Cho, K.: Zero-resource
translation with multi-lingual neural machine translation. In: EMNLP. pp. 268–
277 (2016)
17. Gu, J., Cai, J., Joty, S., Niu, L., Wang, G.: Look, imagine and match: Improving
textual-visual cross-modal retrieval with generative models. In: CVPR. pp. 7181–
7189 (2018)
18. Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: Coarse-to-fine learning for
image captioning. In: AAAI. pp. 6837–6844 (2018)
19. Gu, J., Wang, G., Cai, J., Chen, T.: An empirical study of language cnn for image
captioning. In: ICCV. pp. 1222–1231 (2017)
20. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang,
X., Wang, G., Cai, J., et al.: Recent advances in convolutional neural networks.
Pattern Recognition pp. 354–377 (2017)
16 Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Gang Wang
21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR. pp. 770–778 (2016)
22. Hitschler, J., Schamoni, S., Riezler, S.: Multimodal pivots for image caption trans-
lation. In: ACL. pp. 2399–2409 (2016)
23. Jean, S., Cho, K., Memisevic, R., Bengio, Y.: On using very large target vocabulary
for neural machine translation. In: ACL. pp. 1–10 (2015)
24. Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding long-short term memory
for image caption generation. In: ICCV. pp. 2407–2415 (2015)
25. Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N.,
Viégas, F., Wattenberg, M., Corrado, G., et al.: Google’s multilingual neural ma-
chine translation system: enabling zero-shot translation. TACL pp. 339–352 (2016)
26. Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In:
EMNLP. pp. 1700–1709 (2013)
27. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image
descriptions. In: CVPR. pp. 3128–3137 (2015)
28. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
29. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby
talk: Understanding and generating image descriptions. In: CVPR. pp. 1601–1608
(2011)
30. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755
(2014)
31. Liu, C., Sun, F., Wang, C., Wang, F., Yuille, A.: Mat: A multimodal attentive
translator for image captioning. In: IJCAI. pp. 4033–4039 (2017)
32. Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning
via policy gradient optimization of spider. In: ICCV. pp. 873–881 (2017)
33. Luong, M.T., Le, Q.V., Sutskever, I., Vinyals, O., Kaiser, L.: Multi-task sequence
to sequence learning. In: ICLR (2016)
34. Luong, M.T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the
rare word problem in neural machine translation. In: ACL. pp. 11–19 (2015)
35. Mi, H., Sankaran, B., Wang, Z., Ittycheriah, A.: Coverage embedding models for
neural machine translation. In: EMNLP. pp. 955–960 (2016)
36. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: ACL. pp. 311–318 (2002)
37. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb-
nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer
image-to-sentence models. In: ICCV. pp. 2641–2649 (2015)
38. Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with
recurrent neural networks. In: ICLR (2016)
39. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence
training for image captioning. In: CVPR. pp. 7008–7024 (2017)
40. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. In: NIPS. pp. 3104–3112 (2014)
41. Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H.: Modeling coverage for neural machine
translation. In: ACL. pp. 76–85 (2016)
42. Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based statis-
tical machine translation. In: NAACL. pp. 484–491 (2007)
43. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image
description evaluation. In: CVPR. pp. 4566–4575 (2015)
44. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image
caption generator. In: CVPR. pp. 3156–3164 (2015)
Unpaired Image Captioning by Language Pivoting 17
45. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: Lessons learned from
the 2015 mscoco image captioning challenge. PAMI pp. 652–663 (2017)
46. Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine
translation. Machine Translation pp. 165–181 (2007)
47. Wu, J., Zheng, H., Zhao, B., Li, Y., Yan, B., Liang, R., Wang, W., Zhou, S., Lin,
G., Fu, Y., et al.: Ai challenger: A large-scale dataset for going deeper in image
understanding. arXiv preprint arXiv:1711.06475 (2017)
48. Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.v.d.: What value do explicit high
level concepts have in vision to language problems? In: CVPR. pp. 203–212 (2016)
49. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S.,
Bengio, Y.: Show, attend and tell: Neural image caption generation with visual
attention. In: ICML. pp. 2048–2057 (2015)
50. Yang, X., Zhang, H., Cai, J.: Shuffle-then-assemble: Learning object-agnostic visual
relationship features. In: ECCV (2018)
51. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with at-
tributes. In: ICCV. pp. 22–29 (2017)
52. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic
attention. In: CVPR. pp. 4651–4659 (2016)
53. Zahabi, S.T., Bakhshaei, S., Khadivi, S.: Using context vectors in improving a
machine translation system with bridge language. In: ACL. pp. 318–322 (2013)
54. Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., Yu, Y.: Texygen: A bench-
marking platform for text generation models. In: SIGIR. pp. 1097–1100 (2018)
55. Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural
machine translation. In: EMNLP. pp. 1568–1575 (2016)