0% found this document useful (0 votes)
7 views18 pages

A Multi-Instance Multi-Label Dual Learning Approach For

Uploaded by

z.elidrissi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

A Multi-Instance Multi-Label Dual Learning Approach For

Uploaded by

z.elidrissi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

A Multi-instance Multi-label Dual Learning Approach for

Video Captioning

WANTING JI and RUILI WANG, School of Natural and Computational Sciences, Massey University,
Auckland, New Zealand

Video captioning is a challenging task in the field of multimedia processing, which aims to generate informa-
tive natural language descriptions/captions to describe video contents. Previous video captioning approaches
mainly focused on capturing visual information in videos using an encoder-decoder structure to generate
video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning,
which captured the information in both videos and captions. Based on this, this article proposes a novel multi-
instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-
decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video
reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network
(Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable
mapping between video regions and lexical labels to generate video captions. Then the video reconstruction
module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation
module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the
reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated
captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental
results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.
CCS Concepts: • Computing methodologies → Machine learning; Neural networks; Video summariza-
tion;
Additional Key Words and Phrases: Deep neural networks, Dual learning, Multiple instance learning, Multi-
media processing, Video captioning
ACM Reference format:
Wanting Ji and Ruili Wang. 2021. A Multi-instance Multi-label Dual Learning Approach for Video Captioning.
ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s, Article 72 (June 2021), 18 pages.
https://doi.org/10.1145/3446792

1 INTRODUCTION
Video captioning is a challenging task in the field of multimedia processing, bridging vision
and language. It aims to generate informative natural language descriptions/captions to describe
video contents [1-3]. Compared with other captioning tasks, such as image captioning [4-6],

Updated author affiliation: Wanting Ji, School of Information, Liaoning University, Shenyang, Liaoning, China; Ruili Wang,
School of Natural and Computational Sciences, Massey University, Auckland, New Zealand.
Authors’ address: W. Ji and R. Wang (corresponding author), School of Natural and Computational Sciences, Massey
University, Auckland, New Zealand; emails: [email protected], [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
72
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2021 Association for Computing Machinery.
1551-6857/2021/06-ART72 $15.00
https://doi.org/10.1145/3446792

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:2 W. Ji and R. Wang

video captioning is more challenging. This is because a video contains much more complicated
information, such as objects, actions, scenes, than that in a still image. So far, various video
captioning approaches were developed, and many of them have been applied to real-world
applications. For example, movie transcriptions leverage video captioning approaches to convert
a movie into a textural story.
Early research [7-9] for video captioning was usually based on template-based language models,
which generated video captions using predefined language templates. Specifically, these template-
based language models defined several language templates, which could be separated into several
phases, such as object, subject, and verb, according to specific grammar rules [8]. Each part of a
video was aligned with a word using object detection methods, and then the detected words were
placed into different phases of a predefined template to be a sentence to describe video contents.
However, these models cannot textualize all key information in videos, i.e., these models cannot
map all key information of a video to words.
Recently, with the development of numerous deep neural networks, many advanced neural
networks [10-12, 47] have been developed for video captioning and achieved significant success.
These deep neural network-based video captioning approaches were usually based on an encoder-
decoder structure to achieve this process [13-16]. This is because the encoder-decoder structure
allows the training process of a video captioning model works in an end-to-end fashion. Specif-
ically, the encoder of a video captioning model extracted features from video frames, and then
the extracted features would be passed to the decoder to generate descriptive sentences for video
contents [17].
The encoder-decoder structure has a limitation in generating video captions. Specifically, to
train a video captioning model, ground-truth captions/words are used as the input for the decoder.
However, in the test process, this input will be replaced by the words generated by the decoder.
Since the generated words may be different from the ground-truth, difference/bias can be gener-
ated. This difference/bias leads to the accumulation of errors in the test process. In other words,
once a “bad” word is generated by the decoder in the test process, this error cannot be detected or
ignored by the decoder, and then this error will be accumulated and propagated as the length of
the generated word sequence increases [1].
More recently, a new encoder-decoder-reconstructor structure [1] was proposed for video cap-
tioning to overcome the aforementioned limitation. Specifically, this structure exploited the bi-
directional flows (i.e., forward and backward flows) between captions and video contents. In the
forward flow (i.e., video-to-caption), an encoder-decoder structure was used to produce video cap-
tions using the encoded video features. In the backward flow (i.e., caption-to-video), the proposed
reconstructors were devised to reproduce raw video features. Thus, the loss yielded by the two
flows was used to train the proposed video captioning model in an end-to-end manner. Although
this model jointly utilized video features and semantic information to generate video captions
and outperformed the existing encoder-decoder models, it cannot capture key frames/information
from videos to align with semantic information closely.
Multi-instance learning is an effective solution to this problem, which is widely applied in mul-
timedia applications. For example, Shen et al. [18] proposed a weakly supervised video captioning
approach, which utilized multi-instance learning to link video frame regions with lexical labels,
and then the labeled video region-sequences were fed into the decoder to generate video captions.
Thus, a weakly association could be built between video region-sequences and the generated video
captions. However, their approach was based on the encoder-decoder structure rather than a better
encoder-decoder-reconstructor structure.
In this article, we propose a novel multi-instance multi-label dual learning approach
(MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:3

Specifically, MIMLDL contains two modules: caption generation and video reconstruction mod-
ules. The caption generation module utilizes a lexical fully convolutional neural network
(Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn
a translatable mapping between video regions and lexical labels to generate video captions. Then
the video reconstruction module utilizes the output of the caption generation module to synthesize
visual sequences to reproduce raw videos. A dual learning mechanism fine-tunes the two modules
according to the gap between the raw and the reproduced videos. In other words, our approach
can minimize the semantic gap between raw videos and the generated captions by minimizing
the differences between the reproduced and the raw visual sequences. Experimental results on a
benchmark dataset demonstrate that the proposed approach can improve the accuracy of video
captioning.
The contributions of this article are twofold: (i) We propose an effective approach that can gener-
ate accurate captions to describe video contents. (ii) The proposed encoder-decoder-reconstructor-
based multi-instance multi-label dual learning approach can learn a translatable mapping between
video regions and lexical labels and minimize the semantic gap between raw videos and the gener-
ated captions by minimizing the differences between the reproduced and the raw visual sequences.
The rest of this article is organized as follows: The related work on video captioning and multi-
instance learning is reviewed in Section 2. Section 3 presents the proposed video captioning ap-
proach. The experimental setting details and experimental results are provided and discussed in
Section 4. The conclusion and our future work are represented in Section 5.

2 RELATED WORK
In this section, we briefly provide a literature review of video captioning approaches in Section 2.1,
and then review the applications of multi-instance learning in Section 2.2. Finally, we introduced
the dual mechanism in Section 2.3.

2.1 Video Captioning


Producing captions from videos is a challenging task that has received wide attention in the field of
multimedia processing. Over the past few years, various approaches have been developed for video
captioning. Most of these approaches can be implemented through the encoder-decoder structure.
The key issue in video captioning is to generate informative and accurate descriptions for videos.
A common structure of the encoder-decoder based video captioning models is to combine con-
volutional neural networks (CNNs) and recurrent neural networks (RNNs) according to
their advantages, that is, using CNNs as an encoder to extract features (i.e., compact representa-
tional vectors [49]) from an input video, and then using RNNs as a decoder to construct a language
model for video caption generation based on the extracted vectors. Venugopalan et al. [13] cap-
tured video representation vectors by averaging the CNN features of each video frame, and then
fed the captured vectors into a Long Short Term Memory (LSTM) network to generate video
captions.
Moreover, since the temporal dynamics of video sequences is also important for video caption-
ing, Venugopalan et al. [14] proposed the well-known video captioning approach Sequence to
Sequence Video to Text (S2VT). In S2VT, in addition to extracting spatial features from video
frames, they also used optical flow to extract temporal/motion features, and then utilized LSTMs
to process the extracted spatial and temporal features for video caption generation. Additionally,
Zhang and Tian [15] proposed a two-stream network to capture spatial and temporal information
from videos to generate video captions. Tian et al. [50] proposed a dense captioning approach using
hourglass-structured residual learning. By incorporating dense connected networks and residual
learning, their approach could generate discriminant feature maps for dense captioning.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:4 W. Ji and R. Wang

Recently, attention mechanisms have been considered as an effective way to improve the per-
formance of video captioning approaches based on the encoder-decoder structure. Various atten-
tion mechanisms were developed and introduced to video captioning approaches. Yao et al. [16]
further proposed an attention mechanism to assign weights to spatial features, and then fused
the weighted features as video representation vectors. Yan et al. [3] proposed a spatial-temporal
attention mechanism for video captioning and used the proposed attention mechanism on the
extracted spatial and temporal features to select the significant regions from videos for video cap-
tioning. However, these video captioning approaches only considered the spatial and temporal
features in videos to generate video captions.
To overcome this problem, the multimodal learning mechanism was introduced to video cap-
tioning approaches. Since a video contains multiple modalities (e.g., visual modality, audio modal-
ity, and textual modality), multimodal features can be fused and used to generate video captions.
Wang et al. [2] proposed a video captioning approach named Multimodal Memory Model (M3)
based on visual modal features and textural modal features, which could address the visual-textual
alignment issue. In M3, a visual and textual shared memory was proposed, which could be used to
model long-term visual-textual dependency and guide visual attention for video captioning based
on the interaction between videos and captions.
To summarize, current approaches for video captioning mainly rely on the visual information
in videos but ignore the use of the generated captions. These approaches can perform well in gen-
erating simple sentences to describe video contents. However, in some cases, such as many details
are existing in videos, these approaches cannot achieve comparable performance to humans. To
solve this problem, this article aims to develop a novel video captioning approach that can capture
detailed information in videos by using multiple instance learning and dual learning mechanisms.
In the following sections, we will introduce these two mechanisms in detail.

2.2 Multi-instance Learning


Multi-instance learning (MIL) is a form of weakly supervised learning. It assumes that each data
in a dataset is regarded as a bag [18-21]. Each bag is a collection of instances. In the case where each
instance can be labeled by a binary label, if the labels of all instances in a bag are negative, then the
bag will be labeled as negative; if at least one of the instances in the bag has a positive label, then
the bag will be labeled as positive. By using multiple instance learning, all bags in the dataset can
be labeled according to the labels of instances. This is different from classical supervised learning,
where the label of each data is indeed known.
Multiple instance learning was first proposed by Dietterich et al. [19] to predict drug perfor-
mance for drug design. They analyze the molecules that are known to be useful or not useful for a
drug to predict the performance of new molecules for this drug. When inputting a new molecule
into their approach, how closely the molecule is coupled to a target “binding area” can be pre-
dicted. If the shape of the molecule and the binding area can be tightly coupled, then the molecule
can be used for synthesizing this drug; otherwise, it cannot.
Recently, multiple instance learning has been widely applied in multimedia applications, such
as image processing and video processing.
Shamsolmoali et al. [20] proposed an adversarial multi-instance learning-based human
pose estimation approach (AMIL), which overcomes departed pose estimation problems re-
sulted from joint obstructions and overlapping on human bodies. They proposed a structure-aware
network to integrate the priors of human body structures. The learning model of AMIL was a gen-
erative adversarial network that contained two residual multiple instance learning sub-models as
the generator and the discriminator, respectively. In the training process, the discriminator was

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:5

used to distinguish the ground-truth from the generated human poses, and the adversarial loss
would be backpropagated to the generator. This process could help the generator to learn reason-
able body configurations to improve the accuracy of pose estimation.
Later, Shen et al. [18] proposed a weakly supervised dense video captioning approach that gen-
erated dense captions using video-level annotations for training. Specifically, three modules were
contained in their approach: (i) a visual module that created a weak mapping between the regions
in video frames and the words in annotations based on multi-instance multi-label learning mech-
anism; (ii) a region-sequence module that produced informative region-sequences based on the
outcomes of the visual module by connecting the regions between video frames; (iii) a language
module that produced dense video captions using the generated region-sequences.
Further, Zhang et al. [21] proposed a multi-instance multi-label approach that could recognize
and localize actions in untrimmed videos. Since most of the background contents in an untrimmed
video were not related to the actions of interest and could reduce the accuracy of action recogni-
tion, they spatially and temporally segment each untrimmed video into person-centric clips using
pose estimation and tracking techniques. Then the action recognition problem could be formulated
as a multi-instance multi-label learning problem by associating the bag-of-instances structure with
video-level labels.
In this article, we use the multi-instance multi-label learning mechanism to build a weak map-
ping between the words in annotations and the regions in video frames.

2.3 Dual Learning


Dual learning is an effective mechanism that has been wildly used in various machine learning
applications, such as image segmentation [22], sentiment analysis [23], image-to-image transfor-
mation [24-26], machine translation [27-30], and so on. The main idea of dual learning is very
intuitive. It aims to leverage the duality between two related tasks as a feedback signal to boost
the performances of both tasks [31, 32].
To leverage the duality between two related tasks, a dual learning framework usually consists
of two agents: a primal model and a dual model. The primal model maps an x from one domain to
another, while the dual model maps it back. The mapping functions between these two domains
are trained simultaneously. Thus, one function can be close to the inverse of the other. For example,
when we apply the dual learning mechanism to machine translation, if we first translate a sentence
from English to French, then we can get an English sentence that is the same or similar to the
original English sentence when translating the French sentence back to English.
He et al. [27] first proposed the dual learning mechanism and applied it into machine translation.
In their approach, two dual translators were updated in a reinforcement learning manner and a
reconstructed distortion was used as the feedback signal. After that, Wang et al. [28] and Xia
et al. [33] exploited the joint distribution constraint in the dual learning mechanism. From their
research, we knew that when computing from either domain, the joint distribution of sample over
two domains was invariant to each other. Further, Xia et al. [34] proposed a model-level dual
learning approach to share the components between the primary model and the dual model.
In addition, Zhao et al. [35] proposed a cross-domain image captioning approach using dual
learning to overcome the problem of lack of image-text pairs in the training set. Wang et al. [31]
proposed a multi-agent dual learning framework, which consisted of multiple primal and dual
models, for machine translation and image translation.
In this article, our proposed approach utilizes attention-based dual learning for video caption-
ing. Unlike the existing encoder-decoder model that only contains a video-to-caption forward
flow, we also build a caption-to-video backward flow. In other words, by fully considering the

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:6 W. Ji and R. Wang

Fig. 1. The framework of our MIMLDL approach for video captioning.

bidirectional training between videos and captions, our proposed approach can further improve
the video captioning accuracy.

3 THE PROPOSED APPROACH


This section details the proposed multi-instance multi-label dual learning approach
(MIMLDL) for video captioning. As illustrated in Figure 1, MIMLDL consists of two modules:
a caption generation module and a video reconstruction module. In this section, a brief introduc-
tion of Lexical-FCN is provided in Section 3.1, the two modules are represented in Section 3.2 and
Section 3.3, respectively, and the loss function of MIMLDL is represented in Section 3.4 for training.

3.1 Lexical Fully Convolutional Neural Network (Lexical-FCN)


Lexical-FCN [18] is a multi-instance multi-label learning based deep neural network, which can
create a mapping between lexical labels and video frame regions. In this article, a lexical vocab-
ulary is first built from the video caption training set. Then the mapping between the lexical vo-
cabulary and video frame regions can be established using the multi-instance multi-label learning
mechanism.
3.1.1 Lexical Vocabulary. To build a lexical vocabulary from video captions, we first use part-of-
speech [36] to extract words from the video caption training. The captured words can belong to any
part/phase of a sentence, such as nouns, pronouns, verbs, and adjectives. Following the setting in
Reference [18], some of the most frequent function words (such as “is,” “are,” “to,” “with,” “and,”
“in,” “at,” and “on”) are considered as stop words, which will be removed from the built lexical
vocabulary. For the remaining captured words, if a word appearing at least three times in the
training set, then it will be kept in the lexical vocabulary; otherwise, it will be removed. This
number is selected by our experiments. We have tested our approach when the number is 1, 3, 5,
and 7, and our approach achieved the best performance when the number is 3. When the frequency
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:7

of a word is required to be higher, the richness of the generated lexical vocabulary will be reduced,
but the words in the vocabulary are more likely to be the keywords for describing video contents.
Finally, all words that meet the above conditions compose our lexical vocabulary V.
3.1.2 Lexical Mapping. To map video frame regions with the lexical vocabulary using multi-
instance multi-label learning mechanism, we follow the setting in Reference [18], which utilized
a word detection method [37] to achieve multi-instance learning and a deep lexical classification
method [38] to achieve multi-label learning.
Multi-instance learning. Multi-instance learning is a form of weakly supervised learning. Given
a bag of instances X i = {x i1 , . . . , x i j }, where x i j ∈ Rd denotes a d-dimensional feature vector for
the jth instance in the ith bag. If no instance in the bag X i contains the word w, then this bag
is regarded as negative and the word label y w i is set to 0; if at least one of the instances in X i
contains the word w, then this bag is regarded as positive and the word label y w i is set to 1. For
example, as shown in Figure 2(a), given a video frame, the regions in the frame can be associated
with word labels such as “child,” “running,” and “park” in the lexical vocabulary through multi-
instance learning.
Multi-label learning. Multi-label learning is a machine learning mechanism that is used to asso-
ciate an instance with multiple class labels simultaneously [48]. For this article, given an instance
x i , it will be associated with word labels y i = {y i1 , . . . , y ki } using multi-label learning mechanism,
where k is the number of labels. For example, as shown in Figure 2(b), given a video frame, the
little boy in the frame can be associated with labels such as “child,” “kid,” or other synonyms in
the lexical vocabulary through multi-label learning.
Multi-instance multi-label learning. Multi-instance multi-label learning can be considered as
a generalization of multi-instance learning. For this article, given a bag of instances X i =
{x i1 , . . . , x i j }, each instance in this bag can be labeled with one or multiple word labels using
multi-instance multi-label learning mechanism. For example, as shown in Figure 2(c), given a video
frame, the regions in this frame are first associated with word labels such as “child,” “running,”
and “park” through multi-instance learning. Then by using multi-label learning, other labels such
as “kid” and other synonyms in the lexical vocabulary will also be associated with this little boy.
The loss function of multi-instance multi-label learning. Following the setting in Reference [18],
we utilize cross-entropy loss to evaluate/measure the performance of multi-instance multi-label
learning. The loss function of a bag of instances can be defined as:
1   
N
 
L (X , y; θ ) = − y i · log pi + 1 − y i · log 1 − pi , (1)
N i=1
where y i denotes the label vector for the bag X i ; θ is the model parameters; N denotes the number
of bags; pi denotes the corresponding probability vector.
Since the instances in a bag can be labeled as positive or negative, a noisy-OR [39] formulation
is used to combine the probabilities of the word w in the ith bag. Mathematically, the probability
p̂ w
i can be defined as:
 w     
p̂ w
i = P y i = 1|X i ; θ = 1 − 1 − P ywi = 1|x i j ; θ . (2)
x i j ∈X i

Further, a sigmoid function is used to model the word probability as:


   
P ywi = 1|x i j ; θ = σ W w x i j + b w , (3)
where W w and b w denotes the weight matrices and bias vector of the word w, respectively, σ (·)
denotes the sigmoid function, and the output of the last mean pooling layer of ResNet-50 (i.e.,
pool5 for ResNet-50) is used to represent the instance x i j .
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:8 W. Ji and R. Wang

Fig. 2. An example of (a) multi-instance learning, (b) multi-label learning, and (c) multi-instance multi-label
learning.

3.1.3 Region Sequence Generation. A region sequence is generated by matching and sequen-
tially connecting the same or similar regions in different video frames. Since Lexical-FCN can link
each region in a video frame with a lexical label/description, the process of region sequence can be
formulated as a subset selection problem [40]. Specifically, we initialize an empty subset and then
sequentially add one most informative and coherent region in each video frame into the subset.
Mathematically, we define Sv as the set of all possible region sequences in a video v, and A is
defined as a region sequence subset of Sv , i.e., A ⊆ Sv . Thus, the optimized region sequence A ∗
can be represented as:
A ∗ = arg min R (x v , A) , (4)
A ⊆Sv
where xv denotes all region feature representations of the video v, and R (·) is an objective function
for optimization. Following the setting in Reference [18], we utilize a linear combination objective
to optimize A:
R (x v , A) = W Tv f (x v , A) , (5)
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:9

where f (·) denotes an objective function to evaluate a region sequence. In this article, we also
evaluate region sequences from three aspects, i.e., informative, coherent, and diverse. Thus, f =
[finf , fdiv , fcoh ]T .
Informativeness evaluates how many video contents can be described, coherence evaluates the
temporal coherence between region sequences, and diversity evaluates the differences between a
region sequence and other region sequences. Assuming that the probability distribution of a region
sequence can be represented as {piw }i=1
N , and q w denotes the probability distribution of a candidate

region-sequence, the three aspects of a region sequence can be defined as:



finf (x v , At ) = p w ; p w = max piw , (6)
i ∈At
w


N
piw
fdiv = w
∫ w pi log dw, (7)
i=1
qw

fcoh = x r t , x r s , (8)
r s ∈At −1
where A0 = ∅; r t denotes the tth region added into the region sequence A; x r t is the feature vector
of the region r t ; x r s represents the feature vector of the region r s ; r s denotes a region sequence
added into the region sequence A before r t ; <, > is a dot-production operation between two
normalized feature vectors.
Further, the objective function f in Equation (4) is set to be a monotone submodular function and
W v is set to be non-negative. This allows us to find a near-optimal solution efficiently. Follow the
setting in Reference [18], we utilize the submodular maximization to learn W v . Thus, the marginal
gain function can be defined as:
L (W v ; r ) = R (At −1 ∪ {r }) − R (At −1 ) = W Tv f (x v , At −1 ∪ {r }) − W Tv f (x v , At −1 ) . (9)
To maximize the marginal gain, At will meet the following conditions:
At = At −1 ∪ {r t } ; r t = arд max L (W v ; r ) , (10)
r ∈R t

Thus, W v can be optimized by:

1 
N
λ
min max Li (W v ; r ) + W v 2
, (11)
Wv ≥0 N r ∈r i 2
i=1

where the max term of Equation (11) is a generalized hinge loss. Besides, since words are associ-
ated with regions, the generated region sequences should include high-scored words for caption
generation. The matching score will be calculated by:

fi = i ,
pw (12)
w ∈VS ;piw ≥θ

where VS denotes a lexical subset that is formed by the lexical labels in sentence S based on the
lexical vocabulary V, and θ is the threshold of p w
i . Therefore, the process of video region sequence
generation can be achieved through the following five steps:
(i) initialize W v = 1 (i.e., set all elements in W v equal to 1);
(ii) get a region sequence with current W v using submodular maximization;
(iii) weakly associate a sentence S with region sequence using a winner-takes-all scheme [18];
(iv) update W v with the output of step (iii);
(v) repeat steps (ii)–(iv) until W v is converged.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:10 W. Ji and R. Wang

3.2 Caption Generation Module


The proposed caption generation module aims to generate captions from videos. In this module, a
Lexical FCN is used to extract features from video frames and learn a translatable mapping between
video regions and lexical labels. Then the labeled video region-sequences will be used to generate
video captions using an LSTM-based decoder.
To produce a descriptive sentence S = {s 1 , s 2 , . . . , sn } to depict the content of a video v, conven-
tional encoder-decoder models usually calculate the caption generation probability word-by-word:

n
P (S |v) = P (st |s <t , v; ϑ ) , (13)
t =1
where n denotes the length of a sentence S; st is the word generated at timestep t; s <t denotes
previously generated caption {s 1 , s 2 , . . . , st −1 }; ϑ presents the parameters in the encoder-decoder
model.
Encoder. The encoder is used to extract features from videos to generate video representation.
Previous video captioning approaches usually utilized predefined CNNs (such as AlexNet [13],
GoogleNet [30], and VGG19 [41]) as the encoder to extract features from videos. This is because
these neural networks can convert each video frame into a fixed-length video representation. In
this article, we utilize Lexical FCN as an encoder to extract features from raw videos due to the
advantages of Lexical FCN.
Decoder. The decoder utilizes the generated video representation to generate video captions
word-by-word. Since LSTM can model long-term temporal dependencies, it is usually used as an
effective decoder to convert video representations into video captions. Furthermore, various at-
tention mechanisms are usually used in the decoder to capture the most salient regions in videos
for video captioning. In this article, we also use LSTM networks as the decoder to generate video
captions and introduce a multi-head dot product attention mechanism (MHDPA) [42] to help
the decoder capture key information from videos.
At timestep t, to generate or predict a word using an LSTM-based decoder, the probability of
the predicted word can be calculated by:
P (st |s <t , v, ϑ ) ∝ exp (φ (st −1 , ht , et ; ϑ )) , (14)
where ht and et are the LSTM hidden state and MHDPA context vector calculated at the timestep
t, respectively, and φ(·) denotes an activation function of the LSTM-based decoder. Further, since
this article utilizes MHDPA to assign attention weights to the video representation of each video
frame, the context vector et can be calculated as:
m
et = α jt v j , (15)
j=1

where m denotes the frame number of the video v; α jt presents the jth attention weight at timestep
t; vj denotes the jth video representation of v.
MHDPA is a self-attention mechanism proposed in Reference [42], which utilizes three matrices
Q, K, and V to store all queries, keys, and values, respectively. By using a linear projection, the
value of these three matrices can be represented as:
Q = MW Q , (16)

K = MW K , (17)

V = MW V , (18)

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:11

where W ∗ denotes the weights of these three matrices, and M is a matrix of memories that is
randomly initialized. Then the attention of a network can be obtained by computing a set of queries
simultaneously. Thus, the dot products of a query (i.e., dot-product attention) can be computed by:
QK T
A (Q, K, V ) = so f tmax    V , (19)
 dK 
where d K denotes the dimensionality of the key vectors. Then the so f tmax function is used to get
the weights on values. Therefore, the dot-product attention can be represented as:
MW Q (MW K )T
Aω (M ) = so f tmax    MW V , (20)
 dK 
where ω = (W Q ,W K ,W V ). We represent the output of Aω (M ) as M , which is a matrix with the
same dimensionality as M. M  is an update of M. In other words, each element me in M  includes
the information from M. Thus, the information in the memory can be shuttled/transferred from
memory to memory via the parameters W Q , W K , and W V . At each step of the attention mecha-
nism, every memory can be updated according to the information memorized in other memories.
In this article, to generate realistic captions to describe video contents, our caption genera-
tion module is trained by minimizing the negative log-likelihood. Mathematically, this can be
written~as:
N
min −loдP (S t |υt ; ω) (21)
ω
t =1

3.3 Video Reconstruction Module


As shown in Figure 1, the proposed video reconstruction module is used to reproduce videos.
In other words, according to the hidden state sequences of the decoder, this module generates
feature vectors to represent video contents. Then a dual learning mechanism is used to fine-tune
the proposed two modules according to the gap between the raw and the reproduced videos. In
other words, our approach minimizes the semantic gap between raw videos and the generated
captions by minimizing the differences between the reproduced and the raw visual sequences.
Due to the high dimension and diversity of raw video frames, we cannot directly reproduce
video frames based on the hidden states generated by the caption generation module. A simple
way to solve this problem is to reproduce video representations generated by the encoder using
the hidden states sequences of the decoder. Thus, the hidden states H = {h 1 , h 2 , . . . , hn } will be
fed into the proposed video reconstruction module to reproduce video representations.
Further, building such a video reconstruction module can help improve the performance of the
decoder. This is because to make a reconstructed video more similar or even the same as its raw
video, more useful information is required to be extracted from the raw video by the decoder. Thus,
the captions generated by the decoder can be further enhanced.
Similar to the proposed caption generation module, the proposed video reconstruction module
is also composed of LSTM networks and the MHDPA attention mechanism. At each timestep, the
reproduced video representations are calculated by the hidden states of the decoder chosen by
MHDPA:
n
μt = β tj h j , (22)
j=1
where β tjrepresents the attention weights of the jth hidden state calculated by MHDPA at
timestep t. Thus, at timestep t, the reconstructed video representation z t can be measured
by μ t and previous reconstructed video representations {z 1 , z 2 , . . . , z t −1 }. In other words, at
each timestep, this helps the video reconstruction module to dynamically generate contextual
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:12 W. Ji and R. Wang

information μ t and to selectively process the hidden states based on the attention weight β tj .
Therefore, the video reconstruction module can further employ the word composition and the
temporal dynamics of the whole video captions. This can enhance the relationships between the
raw videos and the generated video captions.

3.4 Loss Function


Since the proposed video reconstruction module produces video representations frame-by-frame,
we define the video reconstruction loss function as:
1 
m
Llr ec = ψ (z j , v j ), (23)
m j=1

where z j denotes the jth reconstructed video representation; v j denotes the jth raw video repre-
sentation; ψ (·) is the Euclidean distance measure function.
We train the proposed video captioning approach by minimizing the entire loss function of
our approach. The entire loss function consists of two phases: the loss function of the caption
generation module and the loss function of the video reconstruction module. We calculate the
loss function of the caption generation module using the forward likelihood and calculate the loss
function of the video reconstruction module using Equation (23). Thus, the loss function of our
proposed approach can be defined as:

N
L(θ, θ r ec ) = (−loдP (S j |v j ; θ ) + λLr ec (v j , z j θ r ec )), (24)
j=1

where the caption generation loss −loдP (S j |v j ; θ is calculated by Equation (21); the video recon-
struction loss Lr ec (v j , z j ); θ r ec is calculated by Equation (23); λ denotes a hyper-parameter that is
used to find a compromise between the proposed two modules. The larger the difference between
the ground truth and the generated results, the greater the gradient of the loss function and the
faster the convergence rate.
Algorithm 1 shows the proposed video captioning approach. Our approach contains two steps
for training:
In the first step, the caption generation module is trained using the caption generation loss (i.e.,
the forward likelihood), and the early stopping strategy is used to terminate the training process
of this module.
In the second step, we jointly train the proposed two modules according to the entire loss func-
tion of the proposed approach.

ALGORITHM 1: MIMLDL Algorithm


Input: Training pairs <video, ground-truth caption>
Output: Generated video captions
1 Randomly initialize parameters;
2 Extract features from videos using Lexical-FCN;
3 for each epoch do
4 Generate captions using the caption generation module;
5 Reconstruct videos using the video reconstruction module;
6 Calculate loss function;
7 Fine-tune the two modules
8 end for

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:13

4 EXPERIMENTS
We evaluate the proposed MIMLDL video captioning approach on Microsoft Research video
to text (MSR-VTT) [41] dataset. To demonstrate the effectiveness of MIMLDL, we utilize the
popular evaluation metrics including METEOR [43], BLEU-4 [44], and ROUGE-L [45] with the
codes released on the Microsoft COCO evaluation server [46].

4.1 Dataset and Experimental Setting


Dataset: MSR-VTT is one of the largest datasets for video captioning, which consists of 10K video
clips from 20 categories. Each video clip in MSR-VTT is annotated with approximately 20 sen-
tences. Similar to Reference [1], this article utilizes 6,513 video clips to form a training set, 497
video clips to form a validation set, and 2,990 video clips to form a test set. The embedding size of
the words in the dataset is set to 468.
Hardware and Software Environment: All experiments in this article are done on a deep learning
workstation with Intel Core i9 CPU, four GTX 1080 Ti GPUs, and 128 GB RAM. Our approach is
implemented in Python.
In the caption generation module, we first sample 30 frames for each video. Then the sampled
frames are fed into the Lexical-FCN (Resnet50) networks. In this way, frame features are reshaped
to the standard size 320 × 320, and the semantic feature of each frame can be extracted from the
last pooling layer with 2,048 dimensions. The input dimension of the decoder is set to 468, which is
equal to the dimension of the word embedding. Besides, 512 units are contained in a hidden layer.
In the video reconstruction module, the hidden state of the decoder is taken as the input, the
dimension of which is set to 512. To simplify the calculation of the reconstruction loss function,
we set the size of the hidden layer to the same size as the video presentation, i.e., 2,048 dimensions.
For a dual learning-based approach, the selection of the hyper-parameter λ, which is used to
balance the contributions of the two modules (i.e., the caption generation module and the video
reconstruction module), is very important. Wang et al. [1] have verified that too large λ may cause
obvious decreasing of the performance of video caption generation, although adding the recon-
struction loss can improve the performance of video captioning. Thus, the selection of the hyper-
parameter λ is crucial. In this article, we set λ to 0.1, 0.3, and 0.5, and our approach achieves the
best performance when λ is set to 0.1. The learning rate of the caption generation module is set
to 2e-4; the learning rate of the video reconstruction module is set to 4e-5; and the batch size of
the whole approach is set to 16. Furthermore, when the CIDEr value on the validation set stopped
growing for the next 20 consecutive epochs, the training process will be stopped.
In addition, to guarantee the generated captions exactly matches the content of the given video,
we introduce three popular natural language evaluation metrics (i.e., METEOR [43], BLEU-4 [44],
and ROUGE-L [45]) to measure the quality of the generated captions. The three metrics can achieve
high value if the generated captions contain key information in the videos.

4.2 Experimental Result Discussion


We test the proposed video captioning approach on the MSR-VTT dataset. Table 1 shows the abla-
tion studies of the proposed approach. We implement the proposed approach in three situations:
(i) caption generation module, (ii) caption generation module + video reconstruction module, and
(iii) caption generation module + video reconstruction module + multi-instance learning). Exper-
imental results demonstrate that the encoder-decoder-reconstructor structure and multi-instance
learning can effectively improve the performance of video captioning approaches.
Table 2 shows the quantitative experimental results on the dataset. Our approach is compared
with several classical encoder-decoder-based video captioning approaches and the state-of-the-art

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:14 W. Ji and R. Wang

Table 1. Ablation Studies of MIMLDL in Terms of METEOR, BLEU-4, and ROUGE-L Scores
on the MSR-VTT Dataset (%)

Approaches METROR BLEU-4 ROUGE-L


MIMLDL (caption generation module + 25.9 33.1 57.0
multi-instance learning)
MIMLDL (caption generation module + video 24.7 35.9 58.3
reconstruction module)
MIMLDL (caption generation module + video 27.1 39.3 59.5
reconstruction module
+ multi-instance learning)

Table 2. Experimental Results of Different Video Captioning Approaches in Terms of METEOR, BLEU-4,
and ROUGE-L Scores on the MSR-VTT Dataset (%)

Approaches METEOR BLEU-4 ROUGE-L


MP-LSTM (AlexNet) [1] 23.4 32.3 -
MP-LSTM (GoogleNet) [1] 24.6 34.6 -
MP-LSMT (VGG19) [1] 24.7 34.8 -
SA-LSTM (AlexNet) [1] 23.8 34.8 -
SA-LSTM (GoogleNet) [1] 25.2 35.2 -
SA-LSTM (VGG19) [1] 25.4 35.6 -
SA-LSTM (Inception-V4) [1] 25.5 36.3 58.3
RecNetlocal (SA-LSTM) [1] 26.6 39.1 59.3
Bi-directional MIL [18] 23.3 28.7 53.1
Bi-directional MIMLL [18] 25.9 33.7 56.9
Our MIMLDL 27.1 39.3 59.5

video captioning approaches, including MP-LSTM [1], SA-LSTM [1], RecNet [1], Bi-directional
MIL [18], and Bi-directional MIMLL [18].
MP-LSTM [1] is an encoder-decoder-based video captioning approach. It uses AlexNet,
GoogleNet, or VGG19 as an encoder to extract features from video frames. Then the mean pooling
result of the output of the encoder is fed into an LSTM-based decoder to generate video captions.
SA-LSTM [1] is an encoder-decoder based video captioning approach. It uses AlexNet,
GoogleNet, VGG19, or Inception-V4 as an encoder to extract features from video frames. Then
an attention mechanism is used to fuse the output of the encoder. The output of the attention
mechanism is fed into an LSTM-based decoder to generate video captions.
RecNet [1] is an encoder-decoder-reconstructor-based video captioning approach. It uses
Inception-V4 as an encoder to extract features from video frames. Then a spatial attention mech-
anism is used to fuse the output of the encoder. The output of the attention mechanism is fed into
an LSTM-based decoder to generate video captions. After that, the hidden states of the decoder are
processed by the attention mechanism and fed into a reconstructor to reproduce video features.
Bi-directional MIL [18] is an encoder-decoder based video captioning approach. It uses Resnet50
as an encoder to extract features from video frames. Then the extracted features are used to train
a Lexical-FCN network using multi-instance learning. The output of Lexical-FCN is fed into a
decoder to generate video captions.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:15

Fig. 3. Visualization of some video captioning examples on the MSR-VTT dataset.

Bi-directional MIMLL [18] is an encoder-decoder-based video captioning approach. Then the


extracted features are used to train a Lexical-FCN network using multi-instance multi-label learn-
ing. The output of Lexical-FCN is fed into a decoder to generate video captions.
As illustrated in Table 2, the performance of SA-LSTM is better than MP-LSTM when using
the same encoder (such as AlexNet encoder). This is because SA-LSTM used an attention mech-
anism rather than mean-pooling for feature fusion. Further, the performance of ResNet is bet-
ter than SA-LSTM when using the same encoder (i.e., Inception-V4 encoder). This is because
ResNet contains an additional reconstructor for video captioning. In addition, the performance of
Bi-directional MIMLL is better than SA-LSTM, which verifies the effectiveness of multi-instance
multi-label learning for video captioning. In addition, compared with all reference video caption-
ing approaches, our MIMLDL approach achieves the highest performance for video captioning,
since it combines multi-instance multi-label learning mechanism with an attention-based encoder-
decoder-reconstructor structure for video captioning. Figure 3 shows qualitative examples of video
captions generated by our approach. We compared the generated captions with a reference ap-
proach and the ground truths (GT).

5 CONCLUSION
Video captioning is a task that can generate captions from videos. It has been used to solve many
real-world problems, such as provide help to the blind to know the contents of a movie plot.
This article proposes a novel encoder-decoder-reconstructor-based multi-instance multi-
label dual learning approach (MIMLDL) for video captioning. MIMLDL contains two modules:
caption generation and video reconstruction modules. Specifically, a weakly supervised multi-
instance multi-label learning-based lexical fully convolutional neural network (Lexical-
FCN) is used in the caption generation module to learn a translatable mapping between video
regions and lexical labels for caption generation. Then the hidden states of the decoder are used
by the proposed video reconstruction module to synthesize visual sequences to reproduce raw
videos. According to the gap between the raw and the reproduced videos, the two modules are
fine-tuned through a dual learning mechanism. A multi-head attention mechanism is also used in
the two modules to capture the most effective information from raw videos and generated cap-
tions. Thus, our approach can minimize the semantic gap between a raw video and the generated
caption by minimizing the differences between the reproduced and the raw visual sequences.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:16 W. Ji and R. Wang

We test MIMLDL on the MSR-VTT dataset. Experimental results demonstrate that our approach
can improve the accuracy of video captioning. Our research also verifies the effectiveness of multi-
instance learning-based dual learning in generating high-quality video captions.
The proposed approach also can be further improved. For example, this article utilizes a multi-
head attention mechanism rather than develop a novel attention mechanism to capture the in-
formation from videos and captions. In the future, we will propose better video captioning ap-
proaches and more appropriate attention mechanisms for video captioning. Besides, we intend to
apply video captioning to a wider application field for solving real-world problems.

REFERENCES
[1] B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 7622–7631.
[2] J. Wang, W. Wang, Y. Huang, L. Wang, and Tan. 2018. M3: Multimodal memory modelling for video captioning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.
[3] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2019. STAT: Spatial-temporal attention mechanism
for video captioning. IEEE Trans. Multim. 22, 1 (2019), 229–241.
[4] A. Wang, H. Hu, and L. Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans.
Multim. Comput., Commun., Applic. 14, 3 (2018), 1–15.
[5] L. Yang, H. Hu, S. Xing, and X. Lu. 2020. Constrained LSTM and residual attention for image captioning. ACM Trans.
Multim. Comput., Commun., Applic. 16, 3 (2020), 1–18.
[6] J. Wu, H. Hu, and L. Yang. 2019. Pseudo-3D attention transfer network with content-aware strategy for image cap-
tioning. ACM Trans. Multim. Comput., Commun., Applic. 15, 3 (2019), 1–19.
[7] A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images
based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.
[8] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language
descriptions. In Proceedings of the IEEE International Conference on Computer Vision. 433–440.
[9] R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision
and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 1–7.
[10] J. Ma, R. Wang, W. Ji, H. Zheng, E Zhu, and J. Yin. 2019. Relational recurrent neural networks for polyphonic sound
event detection. Multim. Tools Applic. 78, 20 (2019), 29509–29527.
[11] Y. Wu, X. Ji, W. Ji, Y. Tian, and H. Zhou. 2020. CASR: A context-aware residual network for single-image superreso-
lution. Neural Comput. Applic. 32, 6 (2020), 14533–14548.
[12] Z. Liu, Z. Li, M. Zong, W. Ji, R. Wang, and Y. Tian. 2019. Spatiotemporal saliency based multi-stream networks for
action recognition. In Proceedings of the Asian Conference on Pattern Recognition. 74–84.
[13] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating videos to natural
language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. 1494–1504.
[14] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence-video
to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542.
[15] C. Zhang and Y. Tian. 2016. Automatic video description generation via LSTM with joint two-stream encoding. In
Proceedings of the 23rd International Conference on Pattern Recognition. 2924–2929.
[16] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting
temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515.
[17] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. 2017. Deep learning for video classification and captioning. In Frontiers of Multimedia
Research. ACM, 3–29.
[18] Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.
[19] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel
rectangles. Artif. Intell. 89, 1-2 (1997), 31–71.
[20] P. Shamsolmoali, M. Zareapoor, H. Zhou, and J. Yang. 2020. AMIL: Adversarial multi-instance learning for human
pose estimation. ACM Trans. Multim. Comput., Commun., Applic. 16, 1 (2020), 1–23.
[21] X. Zhang, H. Shi, C. Li, and P. Li. 2020. Multi-instance multi-label action recognition and localization based on spatio-
temporal pre-trimming for untrimmed videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 12886–
12893.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:17

[22] P. Luo, G. Wang, L. Lin, and X. Wang. 2017. Deep dual learning for semantic image segmentation. In Proceedings of
the IEEE International Conference on Computer Vision. 2718–2726.
[23] Y. Xia, J. Bian, T. Qin, N. Yu, and T. Liu. 2017. Dual inference for machine learning. In Proceedings of the International
Joint Conferences on Artificial Intelligence. 3112–3118.
[24] Z. Yi, H. Zhang, P. Tan, and M. Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In
Proceedings of the IEEE International Conference on Computer Vision. 2849–2857.
[25] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. 2017. Learning to discover cross-domain relations with generative
adversarial networks. In Proceedings of the 34th International Conference on Machine Learning. 1857–1865.
[26] J. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial
networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.
[27] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. 2016. Dual learning for machine translation. In Proceedings
of the International Conference on Advances in Neural Information Processing Systems. 820–828.
[28] Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and T. Liu. 2018. Dual transfer learning for neural machine translation
with marginal distribution regularization. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–7.
[29] G. Lample, A. Conneau, L. Denoyer, and M. A. Ranzato. 2018. Unsupervised machine translation using monolingual
corpora only. In Proceedings of the International Conference on Learning Representations. 1–14.
[30] M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the
International Conference on Learning Representations. 1–12.
[31] Y. Wang, Y. Xia, T. He, F. Tian, T. Qin, C. Zhai, and T. Liu. 2019. Multi-agent dual learning. In Proceedings of the
International Conference on Learning Representations. 1–15.
[32] Z. Zhao, Y. Xia, T. Qin, and T. Liu. 2019. Dual learning: Theoretical study and algorithmic extensions. In Proceedings
of the International Conference on Learning Representations. 1–16.
[33] Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu. 2017. Dual supervised learning. In Proceedings of the International
Conference on Machine Learning. 3789–3798.
[34] Y. Xia, X. Tan, F. Tian, T. Qin, N. Yu, and T. Liu. 2018. Model-level dual learning. In Proceedings of the International
Conference on Machine Learning. 5383–5392.
[35] W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao. 2017. Dual learning for cross-domain image captioning.
In Proceedings of the ACM on Conference on Information and Knowledge Management. 29–38.
[36] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic depen-
dency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology. 173–180.
[37] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao et al. 2015. From captions to visual concepts
and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.
[38] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional
captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 1–10.
[39] D. Heckerman. 1990. A tractable inference algorithm for diagnosing multiple diseases. In Mach. Intell. Pattern Recog.
10 (1990), 163–171.
[40] M. Gygli, H. Grabner, and L. V. Gool. 2015. Video summarization by learning submodular mixtures of objectives. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090–3098.
[41] J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.
[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention
is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems.
5998–6008.
[43] S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with
human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine
Translation and/or Summarization. 65–72.
[44] K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation.
In Proceedings of the 40th Meeting on Association for Computational Linguistics. 311–318.
[45] C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Associ-
ation for Computational Linguistics, 74–81.
[46] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. 2015. Microsoft COCO captions: Data
collection and evaluation server. arXiv preprint arXiv:1504.00325. (2015).
[47] C. Tang, X. Liu, S. An, and P. Wang. 2020. BR2Net: Defocus blur detection via bidirectional channel attention residual
refining network. IEEE Trans. Multim. DOI: 10.1109/TMM.2020.2985541.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:18 W. Ji and R. Wang

[48] C. Tang, X. Liu, P. Wang, C. Zhang, M. Li, and L. Wang. 2019. Adaptive hypergraph embedded semi-supervised
multi-label image annotation. IEEE Trans. Multim. 21, 11 (2019), 2837–2849.
[49] X. Liu, L. Wang, J. Zhang, J. Yin, and H. Liu. 2013. Global and local structure preservation for feature selection. IEEE
Trans. Neural Netw. Learn. Syst. 25, 6 (2013), 1083–1095.
[50] Y. Tian, X. Wang, J. Wu, R. Wang, and B. Yang. 2019. Multi-scale hierarchical residual network for dense captioning.
J. Artif. Intell. Res. 64 (2019), 181–196.

Received July 2020; revised December 2020; accepted January 2021

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.

You might also like