A Multi-Instance Multi-Label Dual Learning Approach For
A Multi-Instance Multi-Label Dual Learning Approach For
Video Captioning
WANTING JI and RUILI WANG, School of Natural and Computational Sciences, Massey University,
Auckland, New Zealand
Video captioning is a challenging task in the field of multimedia processing, which aims to generate informa-
tive natural language descriptions/captions to describe video contents. Previous video captioning approaches
mainly focused on capturing visual information in videos using an encoder-decoder structure to generate
video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning,
which captured the information in both videos and captions. Based on this, this article proposes a novel multi-
instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-
decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video
reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network
(Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable
mapping between video regions and lexical labels to generate video captions. Then the video reconstruction
module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation
module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the
reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated
captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental
results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.
CCS Concepts: • Computing methodologies → Machine learning; Neural networks; Video summariza-
tion;
Additional Key Words and Phrases: Deep neural networks, Dual learning, Multiple instance learning, Multi-
media processing, Video captioning
ACM Reference format:
Wanting Ji and Ruili Wang. 2021. A Multi-instance Multi-label Dual Learning Approach for Video Captioning.
ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s, Article 72 (June 2021), 18 pages.
https://doi.org/10.1145/3446792
1 INTRODUCTION
Video captioning is a challenging task in the field of multimedia processing, bridging vision
and language. It aims to generate informative natural language descriptions/captions to describe
video contents [1-3]. Compared with other captioning tasks, such as image captioning [4-6],
Updated author affiliation: Wanting Ji, School of Information, Liaoning University, Shenyang, Liaoning, China; Ruili Wang,
School of Natural and Computational Sciences, Massey University, Auckland, New Zealand.
Authors’ address: W. Ji and R. Wang (corresponding author), School of Natural and Computational Sciences, Massey
University, Auckland, New Zealand; emails: [email protected], [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
72
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2021 Association for Computing Machinery.
1551-6857/2021/06-ART72 $15.00
https://doi.org/10.1145/3446792
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:2 W. Ji and R. Wang
video captioning is more challenging. This is because a video contains much more complicated
information, such as objects, actions, scenes, than that in a still image. So far, various video
captioning approaches were developed, and many of them have been applied to real-world
applications. For example, movie transcriptions leverage video captioning approaches to convert
a movie into a textural story.
Early research [7-9] for video captioning was usually based on template-based language models,
which generated video captions using predefined language templates. Specifically, these template-
based language models defined several language templates, which could be separated into several
phases, such as object, subject, and verb, according to specific grammar rules [8]. Each part of a
video was aligned with a word using object detection methods, and then the detected words were
placed into different phases of a predefined template to be a sentence to describe video contents.
However, these models cannot textualize all key information in videos, i.e., these models cannot
map all key information of a video to words.
Recently, with the development of numerous deep neural networks, many advanced neural
networks [10-12, 47] have been developed for video captioning and achieved significant success.
These deep neural network-based video captioning approaches were usually based on an encoder-
decoder structure to achieve this process [13-16]. This is because the encoder-decoder structure
allows the training process of a video captioning model works in an end-to-end fashion. Specif-
ically, the encoder of a video captioning model extracted features from video frames, and then
the extracted features would be passed to the decoder to generate descriptive sentences for video
contents [17].
The encoder-decoder structure has a limitation in generating video captions. Specifically, to
train a video captioning model, ground-truth captions/words are used as the input for the decoder.
However, in the test process, this input will be replaced by the words generated by the decoder.
Since the generated words may be different from the ground-truth, difference/bias can be gener-
ated. This difference/bias leads to the accumulation of errors in the test process. In other words,
once a “bad” word is generated by the decoder in the test process, this error cannot be detected or
ignored by the decoder, and then this error will be accumulated and propagated as the length of
the generated word sequence increases [1].
More recently, a new encoder-decoder-reconstructor structure [1] was proposed for video cap-
tioning to overcome the aforementioned limitation. Specifically, this structure exploited the bi-
directional flows (i.e., forward and backward flows) between captions and video contents. In the
forward flow (i.e., video-to-caption), an encoder-decoder structure was used to produce video cap-
tions using the encoded video features. In the backward flow (i.e., caption-to-video), the proposed
reconstructors were devised to reproduce raw video features. Thus, the loss yielded by the two
flows was used to train the proposed video captioning model in an end-to-end manner. Although
this model jointly utilized video features and semantic information to generate video captions
and outperformed the existing encoder-decoder models, it cannot capture key frames/information
from videos to align with semantic information closely.
Multi-instance learning is an effective solution to this problem, which is widely applied in mul-
timedia applications. For example, Shen et al. [18] proposed a weakly supervised video captioning
approach, which utilized multi-instance learning to link video frame regions with lexical labels,
and then the labeled video region-sequences were fed into the decoder to generate video captions.
Thus, a weakly association could be built between video region-sequences and the generated video
captions. However, their approach was based on the encoder-decoder structure rather than a better
encoder-decoder-reconstructor structure.
In this article, we propose a novel multi-instance multi-label dual learning approach
(MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:3
Specifically, MIMLDL contains two modules: caption generation and video reconstruction mod-
ules. The caption generation module utilizes a lexical fully convolutional neural network
(Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn
a translatable mapping between video regions and lexical labels to generate video captions. Then
the video reconstruction module utilizes the output of the caption generation module to synthesize
visual sequences to reproduce raw videos. A dual learning mechanism fine-tunes the two modules
according to the gap between the raw and the reproduced videos. In other words, our approach
can minimize the semantic gap between raw videos and the generated captions by minimizing
the differences between the reproduced and the raw visual sequences. Experimental results on a
benchmark dataset demonstrate that the proposed approach can improve the accuracy of video
captioning.
The contributions of this article are twofold: (i) We propose an effective approach that can gener-
ate accurate captions to describe video contents. (ii) The proposed encoder-decoder-reconstructor-
based multi-instance multi-label dual learning approach can learn a translatable mapping between
video regions and lexical labels and minimize the semantic gap between raw videos and the gener-
ated captions by minimizing the differences between the reproduced and the raw visual sequences.
The rest of this article is organized as follows: The related work on video captioning and multi-
instance learning is reviewed in Section 2. Section 3 presents the proposed video captioning ap-
proach. The experimental setting details and experimental results are provided and discussed in
Section 4. The conclusion and our future work are represented in Section 5.
2 RELATED WORK
In this section, we briefly provide a literature review of video captioning approaches in Section 2.1,
and then review the applications of multi-instance learning in Section 2.2. Finally, we introduced
the dual mechanism in Section 2.3.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:4 W. Ji and R. Wang
Recently, attention mechanisms have been considered as an effective way to improve the per-
formance of video captioning approaches based on the encoder-decoder structure. Various atten-
tion mechanisms were developed and introduced to video captioning approaches. Yao et al. [16]
further proposed an attention mechanism to assign weights to spatial features, and then fused
the weighted features as video representation vectors. Yan et al. [3] proposed a spatial-temporal
attention mechanism for video captioning and used the proposed attention mechanism on the
extracted spatial and temporal features to select the significant regions from videos for video cap-
tioning. However, these video captioning approaches only considered the spatial and temporal
features in videos to generate video captions.
To overcome this problem, the multimodal learning mechanism was introduced to video cap-
tioning approaches. Since a video contains multiple modalities (e.g., visual modality, audio modal-
ity, and textual modality), multimodal features can be fused and used to generate video captions.
Wang et al. [2] proposed a video captioning approach named Multimodal Memory Model (M3)
based on visual modal features and textural modal features, which could address the visual-textual
alignment issue. In M3, a visual and textual shared memory was proposed, which could be used to
model long-term visual-textual dependency and guide visual attention for video captioning based
on the interaction between videos and captions.
To summarize, current approaches for video captioning mainly rely on the visual information
in videos but ignore the use of the generated captions. These approaches can perform well in gen-
erating simple sentences to describe video contents. However, in some cases, such as many details
are existing in videos, these approaches cannot achieve comparable performance to humans. To
solve this problem, this article aims to develop a novel video captioning approach that can capture
detailed information in videos by using multiple instance learning and dual learning mechanisms.
In the following sections, we will introduce these two mechanisms in detail.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:5
used to distinguish the ground-truth from the generated human poses, and the adversarial loss
would be backpropagated to the generator. This process could help the generator to learn reason-
able body configurations to improve the accuracy of pose estimation.
Later, Shen et al. [18] proposed a weakly supervised dense video captioning approach that gen-
erated dense captions using video-level annotations for training. Specifically, three modules were
contained in their approach: (i) a visual module that created a weak mapping between the regions
in video frames and the words in annotations based on multi-instance multi-label learning mech-
anism; (ii) a region-sequence module that produced informative region-sequences based on the
outcomes of the visual module by connecting the regions between video frames; (iii) a language
module that produced dense video captions using the generated region-sequences.
Further, Zhang et al. [21] proposed a multi-instance multi-label approach that could recognize
and localize actions in untrimmed videos. Since most of the background contents in an untrimmed
video were not related to the actions of interest and could reduce the accuracy of action recogni-
tion, they spatially and temporally segment each untrimmed video into person-centric clips using
pose estimation and tracking techniques. Then the action recognition problem could be formulated
as a multi-instance multi-label learning problem by associating the bag-of-instances structure with
video-level labels.
In this article, we use the multi-instance multi-label learning mechanism to build a weak map-
ping between the words in annotations and the regions in video frames.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:6 W. Ji and R. Wang
bidirectional training between videos and captions, our proposed approach can further improve
the video captioning accuracy.
of a word is required to be higher, the richness of the generated lexical vocabulary will be reduced,
but the words in the vocabulary are more likely to be the keywords for describing video contents.
Finally, all words that meet the above conditions compose our lexical vocabulary V.
3.1.2 Lexical Mapping. To map video frame regions with the lexical vocabulary using multi-
instance multi-label learning mechanism, we follow the setting in Reference [18], which utilized
a word detection method [37] to achieve multi-instance learning and a deep lexical classification
method [38] to achieve multi-label learning.
Multi-instance learning. Multi-instance learning is a form of weakly supervised learning. Given
a bag of instances X i = {x i1 , . . . , x i j }, where x i j ∈ Rd denotes a d-dimensional feature vector for
the jth instance in the ith bag. If no instance in the bag X i contains the word w, then this bag
is regarded as negative and the word label y w i is set to 0; if at least one of the instances in X i
contains the word w, then this bag is regarded as positive and the word label y w i is set to 1. For
example, as shown in Figure 2(a), given a video frame, the regions in the frame can be associated
with word labels such as “child,” “running,” and “park” in the lexical vocabulary through multi-
instance learning.
Multi-label learning. Multi-label learning is a machine learning mechanism that is used to asso-
ciate an instance with multiple class labels simultaneously [48]. For this article, given an instance
x i , it will be associated with word labels y i = {y i1 , . . . , y ki } using multi-label learning mechanism,
where k is the number of labels. For example, as shown in Figure 2(b), given a video frame, the
little boy in the frame can be associated with labels such as “child,” “kid,” or other synonyms in
the lexical vocabulary through multi-label learning.
Multi-instance multi-label learning. Multi-instance multi-label learning can be considered as
a generalization of multi-instance learning. For this article, given a bag of instances X i =
{x i1 , . . . , x i j }, each instance in this bag can be labeled with one or multiple word labels using
multi-instance multi-label learning mechanism. For example, as shown in Figure 2(c), given a video
frame, the regions in this frame are first associated with word labels such as “child,” “running,”
and “park” through multi-instance learning. Then by using multi-label learning, other labels such
as “kid” and other synonyms in the lexical vocabulary will also be associated with this little boy.
The loss function of multi-instance multi-label learning. Following the setting in Reference [18],
we utilize cross-entropy loss to evaluate/measure the performance of multi-instance multi-label
learning. The loss function of a bag of instances can be defined as:
1
N
L (X , y; θ ) = − y i · log pi + 1 − y i · log 1 − pi , (1)
N i=1
where y i denotes the label vector for the bag X i ; θ is the model parameters; N denotes the number
of bags; pi denotes the corresponding probability vector.
Since the instances in a bag can be labeled as positive or negative, a noisy-OR [39] formulation
is used to combine the probabilities of the word w in the ith bag. Mathematically, the probability
p̂ w
i can be defined as:
w
p̂ w
i = P y i = 1|X i ; θ = 1 − 1 − P ywi = 1|x i j ; θ . (2)
x i j ∈X i
Fig. 2. An example of (a) multi-instance learning, (b) multi-label learning, and (c) multi-instance multi-label
learning.
3.1.3 Region Sequence Generation. A region sequence is generated by matching and sequen-
tially connecting the same or similar regions in different video frames. Since Lexical-FCN can link
each region in a video frame with a lexical label/description, the process of region sequence can be
formulated as a subset selection problem [40]. Specifically, we initialize an empty subset and then
sequentially add one most informative and coherent region in each video frame into the subset.
Mathematically, we define Sv as the set of all possible region sequences in a video v, and A is
defined as a region sequence subset of Sv , i.e., A ⊆ Sv . Thus, the optimized region sequence A ∗
can be represented as:
A ∗ = arg min R (x v , A) , (4)
A ⊆Sv
where xv denotes all region feature representations of the video v, and R (·) is an objective function
for optimization. Following the setting in Reference [18], we utilize a linear combination objective
to optimize A:
R (x v , A) = W Tv f (x v , A) , (5)
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:9
where f (·) denotes an objective function to evaluate a region sequence. In this article, we also
evaluate region sequences from three aspects, i.e., informative, coherent, and diverse. Thus, f =
[finf , fdiv , fcoh ]T .
Informativeness evaluates how many video contents can be described, coherence evaluates the
temporal coherence between region sequences, and diversity evaluates the differences between a
region sequence and other region sequences. Assuming that the probability distribution of a region
sequence can be represented as {piw }i=1
N , and q w denotes the probability distribution of a candidate
N
piw
fdiv = w
∫ w pi log dw, (7)
i=1
qw
fcoh = x r t , x r s , (8)
r s ∈At −1
where A0 = ∅; r t denotes the tth region added into the region sequence A; x r t is the feature vector
of the region r t ; x r s represents the feature vector of the region r s ; r s denotes a region sequence
added into the region sequence A before r t ; <, > is a dot-production operation between two
normalized feature vectors.
Further, the objective function f in Equation (4) is set to be a monotone submodular function and
W v is set to be non-negative. This allows us to find a near-optimal solution efficiently. Follow the
setting in Reference [18], we utilize the submodular maximization to learn W v . Thus, the marginal
gain function can be defined as:
L (W v ; r ) = R (At −1 ∪ {r }) − R (At −1 ) = W Tv f (x v , At −1 ∪ {r }) − W Tv f (x v , At −1 ) . (9)
To maximize the marginal gain, At will meet the following conditions:
At = At −1 ∪ {r t } ; r t = arд max L (W v ; r ) , (10)
r ∈R t
1
N
λ
min max Li (W v ; r ) + W v 2
, (11)
Wv ≥0 N r ∈r i 2
i=1
where the max term of Equation (11) is a generalized hinge loss. Besides, since words are associ-
ated with regions, the generated region sequences should include high-scored words for caption
generation. The matching score will be calculated by:
fi = i ,
pw (12)
w ∈VS ;piw ≥θ
where VS denotes a lexical subset that is formed by the lexical labels in sentence S based on the
lexical vocabulary V, and θ is the threshold of p w
i . Therefore, the process of video region sequence
generation can be achieved through the following five steps:
(i) initialize W v = 1 (i.e., set all elements in W v equal to 1);
(ii) get a region sequence with current W v using submodular maximization;
(iii) weakly associate a sentence S with region sequence using a winner-takes-all scheme [18];
(iv) update W v with the output of step (iii);
(v) repeat steps (ii)–(iv) until W v is converged.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:10 W. Ji and R. Wang
where m denotes the frame number of the video v; α jt presents the jth attention weight at timestep
t; vj denotes the jth video representation of v.
MHDPA is a self-attention mechanism proposed in Reference [42], which utilizes three matrices
Q, K, and V to store all queries, keys, and values, respectively. By using a linear projection, the
value of these three matrices can be represented as:
Q = MW Q , (16)
K = MW K , (17)
V = MW V , (18)
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:11
where W ∗ denotes the weights of these three matrices, and M is a matrix of memories that is
randomly initialized. Then the attention of a network can be obtained by computing a set of queries
simultaneously. Thus, the dot products of a query (i.e., dot-product attention) can be computed by:
QK T
A (Q, K, V ) = so f tmax V , (19)
dK
where d K denotes the dimensionality of the key vectors. Then the so f tmax function is used to get
the weights on values. Therefore, the dot-product attention can be represented as:
MW Q (MW K )T
Aω (M ) = so f tmax MW V , (20)
dK
where ω = (W Q ,W K ,W V ). We represent the output of Aω (M ) as M , which is a matrix with the
same dimensionality as M. M is an update of M. In other words, each element me in M includes
the information from M. Thus, the information in the memory can be shuttled/transferred from
memory to memory via the parameters W Q , W K , and W V . At each step of the attention mecha-
nism, every memory can be updated according to the information memorized in other memories.
In this article, to generate realistic captions to describe video contents, our caption genera-
tion module is trained by minimizing the negative log-likelihood. Mathematically, this can be
written~as:
N
min −loдP (S t |υt ; ω) (21)
ω
t =1
information μ t and to selectively process the hidden states based on the attention weight β tj .
Therefore, the video reconstruction module can further employ the word composition and the
temporal dynamics of the whole video captions. This can enhance the relationships between the
raw videos and the generated video captions.
where z j denotes the jth reconstructed video representation; v j denotes the jth raw video repre-
sentation; ψ (·) is the Euclidean distance measure function.
We train the proposed video captioning approach by minimizing the entire loss function of
our approach. The entire loss function consists of two phases: the loss function of the caption
generation module and the loss function of the video reconstruction module. We calculate the
loss function of the caption generation module using the forward likelihood and calculate the loss
function of the video reconstruction module using Equation (23). Thus, the loss function of our
proposed approach can be defined as:
N
L(θ, θ r ec ) = (−loдP (S j |v j ; θ ) + λLr ec (v j , z j θ r ec )), (24)
j=1
where the caption generation loss −loдP (S j |v j ; θ is calculated by Equation (21); the video recon-
struction loss Lr ec (v j , z j ); θ r ec is calculated by Equation (23); λ denotes a hyper-parameter that is
used to find a compromise between the proposed two modules. The larger the difference between
the ground truth and the generated results, the greater the gradient of the loss function and the
faster the convergence rate.
Algorithm 1 shows the proposed video captioning approach. Our approach contains two steps
for training:
In the first step, the caption generation module is trained using the caption generation loss (i.e.,
the forward likelihood), and the early stopping strategy is used to terminate the training process
of this module.
In the second step, we jointly train the proposed two modules according to the entire loss func-
tion of the proposed approach.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:13
4 EXPERIMENTS
We evaluate the proposed MIMLDL video captioning approach on Microsoft Research video
to text (MSR-VTT) [41] dataset. To demonstrate the effectiveness of MIMLDL, we utilize the
popular evaluation metrics including METEOR [43], BLEU-4 [44], and ROUGE-L [45] with the
codes released on the Microsoft COCO evaluation server [46].
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:14 W. Ji and R. Wang
Table 1. Ablation Studies of MIMLDL in Terms of METEOR, BLEU-4, and ROUGE-L Scores
on the MSR-VTT Dataset (%)
Table 2. Experimental Results of Different Video Captioning Approaches in Terms of METEOR, BLEU-4,
and ROUGE-L Scores on the MSR-VTT Dataset (%)
video captioning approaches, including MP-LSTM [1], SA-LSTM [1], RecNet [1], Bi-directional
MIL [18], and Bi-directional MIMLL [18].
MP-LSTM [1] is an encoder-decoder-based video captioning approach. It uses AlexNet,
GoogleNet, or VGG19 as an encoder to extract features from video frames. Then the mean pooling
result of the output of the encoder is fed into an LSTM-based decoder to generate video captions.
SA-LSTM [1] is an encoder-decoder based video captioning approach. It uses AlexNet,
GoogleNet, VGG19, or Inception-V4 as an encoder to extract features from video frames. Then
an attention mechanism is used to fuse the output of the encoder. The output of the attention
mechanism is fed into an LSTM-based decoder to generate video captions.
RecNet [1] is an encoder-decoder-reconstructor-based video captioning approach. It uses
Inception-V4 as an encoder to extract features from video frames. Then a spatial attention mech-
anism is used to fuse the output of the encoder. The output of the attention mechanism is fed into
an LSTM-based decoder to generate video captions. After that, the hidden states of the decoder are
processed by the attention mechanism and fed into a reconstructor to reproduce video features.
Bi-directional MIL [18] is an encoder-decoder based video captioning approach. It uses Resnet50
as an encoder to extract features from video frames. Then the extracted features are used to train
a Lexical-FCN network using multi-instance learning. The output of Lexical-FCN is fed into a
decoder to generate video captions.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:15
5 CONCLUSION
Video captioning is a task that can generate captions from videos. It has been used to solve many
real-world problems, such as provide help to the blind to know the contents of a movie plot.
This article proposes a novel encoder-decoder-reconstructor-based multi-instance multi-
label dual learning approach (MIMLDL) for video captioning. MIMLDL contains two modules:
caption generation and video reconstruction modules. Specifically, a weakly supervised multi-
instance multi-label learning-based lexical fully convolutional neural network (Lexical-
FCN) is used in the caption generation module to learn a translatable mapping between video
regions and lexical labels for caption generation. Then the hidden states of the decoder are used
by the proposed video reconstruction module to synthesize visual sequences to reproduce raw
videos. According to the gap between the raw and the reproduced videos, the two modules are
fine-tuned through a dual learning mechanism. A multi-head attention mechanism is also used in
the two modules to capture the most effective information from raw videos and generated cap-
tions. Thus, our approach can minimize the semantic gap between a raw video and the generated
caption by minimizing the differences between the reproduced and the raw visual sequences.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:16 W. Ji and R. Wang
We test MIMLDL on the MSR-VTT dataset. Experimental results demonstrate that our approach
can improve the accuracy of video captioning. Our research also verifies the effectiveness of multi-
instance learning-based dual learning in generating high-quality video captions.
The proposed approach also can be further improved. For example, this article utilizes a multi-
head attention mechanism rather than develop a novel attention mechanism to capture the in-
formation from videos and captions. In the future, we will propose better video captioning ap-
proaches and more appropriate attention mechanisms for video captioning. Besides, we intend to
apply video captioning to a wider application field for solving real-world problems.
REFERENCES
[1] B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 7622–7631.
[2] J. Wang, W. Wang, Y. Huang, L. Wang, and Tan. 2018. M3: Multimodal memory modelling for video captioning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.
[3] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2019. STAT: Spatial-temporal attention mechanism
for video captioning. IEEE Trans. Multim. 22, 1 (2019), 229–241.
[4] A. Wang, H. Hu, and L. Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans.
Multim. Comput., Commun., Applic. 14, 3 (2018), 1–15.
[5] L. Yang, H. Hu, S. Xing, and X. Lu. 2020. Constrained LSTM and residual attention for image captioning. ACM Trans.
Multim. Comput., Commun., Applic. 16, 3 (2020), 1–18.
[6] J. Wu, H. Hu, and L. Yang. 2019. Pseudo-3D attention transfer network with content-aware strategy for image cap-
tioning. ACM Trans. Multim. Comput., Commun., Applic. 15, 3 (2019), 1–19.
[7] A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images
based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.
[8] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language
descriptions. In Proceedings of the IEEE International Conference on Computer Vision. 433–440.
[9] R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision
and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 1–7.
[10] J. Ma, R. Wang, W. Ji, H. Zheng, E Zhu, and J. Yin. 2019. Relational recurrent neural networks for polyphonic sound
event detection. Multim. Tools Applic. 78, 20 (2019), 29509–29527.
[11] Y. Wu, X. Ji, W. Ji, Y. Tian, and H. Zhou. 2020. CASR: A context-aware residual network for single-image superreso-
lution. Neural Comput. Applic. 32, 6 (2020), 14533–14548.
[12] Z. Liu, Z. Li, M. Zong, W. Ji, R. Wang, and Y. Tian. 2019. Spatiotemporal saliency based multi-stream networks for
action recognition. In Proceedings of the Asian Conference on Pattern Recognition. 74–84.
[13] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating videos to natural
language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. 1494–1504.
[14] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence-video
to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542.
[15] C. Zhang and Y. Tian. 2016. Automatic video description generation via LSTM with joint two-stream encoding. In
Proceedings of the 23rd International Conference on Pattern Recognition. 2924–2929.
[16] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting
temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515.
[17] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. 2017. Deep learning for video classification and captioning. In Frontiers of Multimedia
Research. ACM, 3–29.
[18] Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.
[19] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel
rectangles. Artif. Intell. 89, 1-2 (1997), 31–71.
[20] P. Shamsolmoali, M. Zareapoor, H. Zhou, and J. Yang. 2020. AMIL: Adversarial multi-instance learning for human
pose estimation. ACM Trans. Multim. Comput., Commun., Applic. 16, 1 (2020), 1–23.
[21] X. Zhang, H. Shi, C. Li, and P. Li. 2020. Multi-instance multi-label action recognition and localization based on spatio-
temporal pre-trimming for untrimmed videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 12886–
12893.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
A Multi-instance Multi-label Dual Learning Approach for Video Captioning 72:17
[22] P. Luo, G. Wang, L. Lin, and X. Wang. 2017. Deep dual learning for semantic image segmentation. In Proceedings of
the IEEE International Conference on Computer Vision. 2718–2726.
[23] Y. Xia, J. Bian, T. Qin, N. Yu, and T. Liu. 2017. Dual inference for machine learning. In Proceedings of the International
Joint Conferences on Artificial Intelligence. 3112–3118.
[24] Z. Yi, H. Zhang, P. Tan, and M. Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In
Proceedings of the IEEE International Conference on Computer Vision. 2849–2857.
[25] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. 2017. Learning to discover cross-domain relations with generative
adversarial networks. In Proceedings of the 34th International Conference on Machine Learning. 1857–1865.
[26] J. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial
networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.
[27] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. 2016. Dual learning for machine translation. In Proceedings
of the International Conference on Advances in Neural Information Processing Systems. 820–828.
[28] Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and T. Liu. 2018. Dual transfer learning for neural machine translation
with marginal distribution regularization. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–7.
[29] G. Lample, A. Conneau, L. Denoyer, and M. A. Ranzato. 2018. Unsupervised machine translation using monolingual
corpora only. In Proceedings of the International Conference on Learning Representations. 1–14.
[30] M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the
International Conference on Learning Representations. 1–12.
[31] Y. Wang, Y. Xia, T. He, F. Tian, T. Qin, C. Zhai, and T. Liu. 2019. Multi-agent dual learning. In Proceedings of the
International Conference on Learning Representations. 1–15.
[32] Z. Zhao, Y. Xia, T. Qin, and T. Liu. 2019. Dual learning: Theoretical study and algorithmic extensions. In Proceedings
of the International Conference on Learning Representations. 1–16.
[33] Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu. 2017. Dual supervised learning. In Proceedings of the International
Conference on Machine Learning. 3789–3798.
[34] Y. Xia, X. Tan, F. Tian, T. Qin, N. Yu, and T. Liu. 2018. Model-level dual learning. In Proceedings of the International
Conference on Machine Learning. 5383–5392.
[35] W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao. 2017. Dual learning for cross-domain image captioning.
In Proceedings of the ACM on Conference on Information and Knowledge Management. 29–38.
[36] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic depen-
dency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology. 173–180.
[37] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao et al. 2015. From captions to visual concepts
and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.
[38] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional
captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 1–10.
[39] D. Heckerman. 1990. A tractable inference algorithm for diagnosing multiple diseases. In Mach. Intell. Pattern Recog.
10 (1990), 163–171.
[40] M. Gygli, H. Grabner, and L. V. Gool. 2015. Video summarization by learning submodular mixtures of objectives. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090–3098.
[41] J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.
[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention
is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems.
5998–6008.
[43] S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with
human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine
Translation and/or Summarization. 65–72.
[44] K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation.
In Proceedings of the 40th Meeting on Association for Computational Linguistics. 311–318.
[45] C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Associ-
ation for Computational Linguistics, 74–81.
[46] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. 2015. Microsoft COCO captions: Data
collection and evaluation server. arXiv preprint arXiv:1504.00325. (2015).
[47] C. Tang, X. Liu, S. An, and P. Wang. 2020. BR2Net: Defocus blur detection via bidirectional channel attention residual
refining network. IEEE Trans. Multim. DOI: 10.1109/TMM.2020.2985541.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.
72:18 W. Ji and R. Wang
[48] C. Tang, X. Liu, P. Wang, C. Zhang, M. Li, and L. Wang. 2019. Adaptive hypergraph embedded semi-supervised
multi-label image annotation. IEEE Trans. Multim. 21, 11 (2019), 2837–2849.
[49] X. Liu, L. Wang, J. Zhang, J. Yin, and H. Liu. 2013. Global and local structure preservation for feature selection. IEEE
Trans. Neural Netw. Learn. Syst. 25, 6 (2013), 1083–1095.
[50] Y. Tian, X. Wang, J. Wu, R. Wang, and B. Yang. 2019. Multi-scale hierarchical residual network for dense captioning.
J. Artif. Intell. Res. 64 (2019), 181–196.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 17, No. 2s, Article 72. Publication date: June 2021.