A Multimodal Framework For Video Caption Generatio
A Multimodal Framework For Video Caption Generatio
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Video captioning is a highly challenging computer vision task that automatically describes
the video clips using natural language sentences with a clear understanding of the embedded semantics.
In this work, a video caption generation framework consisting of discrete wavelet convolutional neural
architecture along with multimodal feature attention is proposed. Here global, contextual and temporal
features in the video frames are taken into account and separate attention networks are integrated in the
visual attention predictor network to capture multiple attentions from these features. These attended features
with textual attention are employed in the visual-to-text translator for caption generation. The experiments
are conducted on two benchmark video captioning datasets - MSVD and MSR-VTT. The results prove an
improved performance of the method with a CIDEr score of 91.7 and 52.2, for the aforementioned datasets,
respectively.
INDEX TERMS Video Captioning, Discrete Wavelet Convolutional Model, Multimodal Feature
Extraction, Visual Attention Predictor.
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
relationship between them in the entire video sequence. concludes the paper.
While generating the descriptions, the contextual information
also need to be given adequate importance along with the II. LITERATURE SURVEY
visual and semantic information. To achieve this goal, we In early approaches, caption generation of videos are done
propose a deep neural network architecture utilizing Discrete using classical template-based techniques that employed
Wavelet Transform (DWT) based CNN for extracting more the SVO-triplets - Subject (S), Verb (V), and Object
finer visual details from the video frames, which enables (O) [18]. These triplets are found out individually and
better video caption generation. The new architecture is able they are combined together to form a sentence. Many
to exploit spectral information in the video frames along encoder-decoder architectures have been proposed that uses
with the spatial, semantic and temporal details for caption 2DCNN/3DCNN structures as the encoder for generating
generation. The proposed model includes Global Feature feature representations and sequential models like RNN,
Extractor (GFE) incorporating DWT based Convolutional LSTM and GRU as the decoder for language translation
Neural Network (2D-WCNN) for the extraction of global [19], [20]. A two-step captioning approach that learns
features from the video frames, a Contextual Object the correspondence between semantic representation labels
Relationship Extractor (CORE) for finding out the contextual and verbalization before translating it to natural language
relationship between different objects in the frames and a is introduced in [21]. An unsupervised Multirate Visual
Temporal Feature Extractor (TFE) that consists of 3DCNN Recurrent Model (MVRM) is presented by [22] that is
model for getting the dynamic features from the video. capable of handling motion speed variations in video
A visual attention predictor network is also incorporated frames with bidirectional reconstruction technique. Dual
that extracts the attention from these features and finally Memory Recurrent Model (DMRM) utilizing global and
visual-to-text-translation is done using a caption decoder temporal details along with semantic supervision for accurate
network as in transformer [17]. This greatly solves the detection of region-of-interest is proposed in [23]. Captions
long-range dependency issues in the sequential models with are also generated with latent topic guidance [24], Time
self attention mechanism and allows parallel computing. Boundary-aware LSTM cell [25], Boosted and Parallel Long
Due to the presence of self attention layers, the decoder Short-Term Memory Networks (BP-LSTM) [26] and Object
network can enhance the quality of visual-to-text translation Relational Graph with Teacher-Recommended Learning
by considering the word-to-word, object-to-object as well (ORG-TRL) system [27]. Another technique utilizes two
as object-to-word interactions in the input sequences. This steps - video Part-of-Speech (POS) tagging and visual cue
network utilizes global dependencies between input and translation [28]. This can be accomplished using mixture
output for providing improved performance compared to model for converting visual features to lexical words and
RNN and LSTM. sentence templates comprising of POS tags.
The main contributions of the proposed framework are: A few algorithms are developed by considering both
the spatial as well as temporal features simultaneously
1) A 2D-WCNN structure using two-level DWT
along with attention mechanisms. A multimodal stochastic
decomposition and CNN layers is employed for
recurrent neural networks (MS-RNN) that make use of
extracting the global features in the video frames. The
latent stochastic variables are presented by [29] for video
utilization of DWT helps to include the fine grained
captioning. Hierarchical encoder structures are also proposed
spatial, spectral and semantic details in the frames.
by [30] and [31] that gives more attention to the temporal
2) A Contextual Object Relationship Extractor (CORE)
details of the video. Descriptions are made by employing
which makes use of the feature maps obtained from
attention in the decoder section [15], [16] as well as
the 2D-WCNN for predicting region proposals and
multimodal fusion mechanisms with aural features in the
computes the contextual relationship between the
video [32]. A multimodal temporal attention mechanism
different frames in the video.
incorporating image, motion, and audio features is given
3) A multimodal visual feature attention network that
in [33]. This architecture is developed by assuming that
concurrently computes global, contextual and temporal
different modalities carry different task-relevant information
feature attention, capable of increasing the efficiency
at different time instances. In another work, captions are
and prediction accuracy of the entire methodology.
produced using co-attention model based recurrent neural
The effectiveness of the proposed method is evaluated network (CAM-RNN) consisting of a visual attention
using two benchmark datasets - MSVD and MSR-VTT and module, a text attention module and a balancing gate [34].
the results are compared with the existing state-of-the-art This algorithm is capable to perform adaptive detection of
methods using the evaluation metrics BLEU, METEOR and the most relevant regions in the image and thus concentrate
CIDEr. on the relevant words or phrases in the generated sentence.
The paper is organized as follows: Section II gives a brief Recently, transformer based decoder models are
review of the research works existing in this area. The details proposed for the generation of video descriptions. Masked
regarding proposed architecture and experimental results are transformers can be used for generating end-to-end
described in Section III and IV, respectively and Section V video captions which uses masking network to produce
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
FIGURE 1: Architecture of the proposed model. GFE, CORE and TFE generate multimodal features for text generation.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
TABLE 1: Details of various convolutional layers in 2D- Proposal Network (RPN), in a similar fashion as that of
WCNN model. The input to the network is of size Faster RCNN, along with classifier and regression layers
224×224×3. The residual blocks are mentioned in square for creating bounding boxes. The identified objects are then
brackets. paired and different sub images are created with each having
Layer Name Kernel size/No. of filters,stride Output size the identified object pairs. For uniformity, these sub images
L1 3 × 3/64,stride1 112 × 112 × 64 are resized to 32×32 and are given to the CNN layers having
L2 3 × 3/128,stride1 56 × 56 × 128
3 × 3/64,stride1 224 × 224 × 64
two sets of 64 filters, each with receptive field 3×3 as shown
CON V1 3 × 3/64,stride1 224 × 224 × 64 in Fig. 3. The obtained features maps highlight the spatial
3 × 3/64,stride2 112 × 112 × 64 relationship between the object pairs. The spatial relation
3 × 3/128,stride1 112 × 112 × 128
CON V2 3 × 3/128,stride1 112 × 112 × 128
feature maps of each object pair is then stacked together and
3 × 3/128,stride 2 56 × 56 × 128 is given to 1×1×64 convolution layer to form the contextual
1 × 1/128, stride1 spatial relation feature map. These features of CORE are
3 × 3/128, stride1 ×2 56 × 56 × 256 passed through a fully connected layer to produce a 2048-
1 × 1/256, stride1
dimensional feature vector, Vci .
1 × 1/128, stride2
CON V3 3 × 3/128, stride1×4 28 × 28 × 512 (3) TFE: The temporal features are extracted using
1 × 1/512, stride1 3DCNN (C3D network).
1 × 1/256, stride2 The three feature vectors - Vgi , Vci and Vti representing
CON V4 3 × 3/256, stride1 ×6 14 × 14 × 1024
1 × 1/1024, stride1
global features, local features and motion features in the
1 × 1/512, stride2
video, respectively, are then provided to the VAP module.
CON V5 3 × 3/512, stride1 ×3 7 × 7 × 2048
1 × 1/2048, stride1 B. VISUAL ATTENTION PREDICTOR
Content rich video caption generation necessitates clear
understanding of semantics in the video. To accomplish
In level-1 wavelet decomposition, the low pass and high this, VAP network is incorporated in the model that utilizes
pass filtering of the input frame produces an approximation Scaled Dot-product Attention to compute the global, local
sub-band (ll1 ) and three detailed sub-bands (lh1 , hl1 and and temporal attention features. It consists of multi-head
hh1 ). The ll1 sub-band is further filtered out into four sub- attention mechanism having H parallel attention layers or
bands - ll2 , lh2 , hl2 and hh2 in level-2 decomposition. Then heads, each computing Scaled Dot-product Attention on an
these sub-bands of R, G and B components obtained from input having a set of queries (Q), keys (K) and values
both the levels are stacked together and each of these are (V ), each of dimension Rdi . In the case of global attention
concatenated with the maxpooled output of CON V1 and network, all the Q, K and V values are set to be equal with
CON V2 layers, respectively as shown in Fig. 2. All the four Vgi as shown in Fig. 1. Thus the output of the global attention
sub bands need to be included to extract the features out from network is given by the expression,
it in the convolutional layers because each sub band carries g
distinguishable features, which are very essential to have a Fatt = (h1 ⊕ h2 ⊕ ...hH )W o (1)
good visual representation of the frame. The configuration of
levels from CON V3 to CON V5 is same as that of ResNet- hi = Gatt (QWiQ , K WiK , V WiV ) (2)
50 network with three-layer bottleneck blocks along with
residue connections. The first convolutional layers of the
QK T
levels CON V3 to CON V5 are having stride 2. The details Gatt = sof tmax( √ )V (3)
regarding the filters used in various convolutional layers di
along with their output sizes are summarized in Table 1. where ⊕ denotes concatenation, Gatt represents global
Batch normalization together with ReLU activation function Scaled Dot-product Attention with independent head
is used in all the layers. Padding is also employed in every projection matrices, WiQ , WiK and WiV , for i = 1, 2, ...H
di
layer. The extracted global feature maps from CON V5 layer in R H Xdi . W o ∈ Rdi Xdi is the output projection matrix that
are given to the CORE for finding out the relationship combines the output from the various heads, each having a
between the various objects in the frames. These feature maps dimension of dHi . Similarly, the outputs for the CORE and
are also given to the VAP for computing global attention. c t
temporal attention networks, Fatt and Fatt will be computed
(2) CORE: Better captions can be generated only by as in equations 1 through 3 with the Q, K and V inputs set
considering the contextual relationship between the objects to Vci and Vti , respectively.
in the video. The CORE employs a modified configuration In VAP module, N identical attention sub networks are
of Faster RCNN structure [39]. It utilizes the feature maps stacked together separately each for computing the global,
produced by the CON V5 layer of 2D-WCNN network contextual and temporal attention features. The attended
i−1
in the GFE to predict the region proposals for detecting feature output from the (i − 1)th stage, Fatt is used
i
the object relationships as shown in Fig. 3. The object to produce the attended features of the next stage, Fatt ,
g
regions are identified using 2D-WCNN network and Region in a recursive manner. The features, Fatt are given to
4 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
the normalization (N orm) and feedforward (F F ) networks The V Tatt features are fed to a linear network and finally,
having two fully-connected layers, ReLU activation function the prediction of words is performed by softmax layer.
and dropout layers. This introduces non-linearity in the During the training phase, cross-entropy loss LCE from all
network. In this work, the drop out ratio is set as 0.1. Residual time-steps is used, which is expressed as,
connection and layer normalization is included in all the T
X
network layers. Thus VgA is computed as, LCE (θ) = − log(pθ (Wt | W1:t−1 , Vgi , Vci , Vti )) (8)
g t=1
VgA = N orm[F F (N orm[Fatt + Vgi ])+
g where W1:t−1 represents the ground truth sequence at time-
(N orm[Fatt + Vgi ])] (4)
step, t and θ denotes the parameters.
Similarly, the contextual and temporal attention features are The model so designed has to be undergone exhaustive
obtained as, evaluation to reveal its effectiveness in video captioning, as
discussed below.
c
VcA = N orm[F F (N orm[Fatt + Vci ])+
c IV. EXPERIMENTS AND RESULTS
(N orm[Fatt + Vci ])] (5)
Both qualitative and quantitative analysis of proposed
t
VtA = N orm[F F (N orm[Fatt + Vti ])+ framework has been carried out with different datasets and
t performance evaluation metrics. Results of this analysis and
(N orm[Fatt + Vti ])] (6)
a comparative study with the state-of-the-art video captioning
The attended output features so obtained from the global, techniques are presented in this section.
local and temporal attention networks are multiplied together
and is given to the visual-to-text translator for further A. DATASETS USED
processing. Experiments are conducted on two benchmark datasets for
video captioning: Microsoft Research Video Description
C. VISUAL-TO-TEXT TRANSLATOR Corpus dataset (MSVD) [40] and MSR-Video to Text dataset
The attended visual representations from the VAP of three (MSR-VTT) [41].
attention networks, together with the attended ground truth 1) MSVD dataset: It consists of 1,970 YouTube video
caption word embeddings, Xcap , are fed to the multi clips having an average of 40 manually annotated
head attention networks of VTT section of the architecture captions per clip. For fair comparison, we have used
as shown in Fig. 1. The VTT consists of one masked the split-up as proposed in [12], that consists of 1,200
attention network computing the self-attention within the videos for training, 100 videos for validation and 670
word embeddings. It uses a mask matrix for improving the videos for testing.
self attention learning process in the caption word embedding 2) MSR-VTT dataset: It is the largest video captioning
during training and each word learns or attend from the words dataset having 10K video clips, each annotated with 20
in the previous positions of the output sequence. This self sentences. The standard split-up as mentioned in [41]
attention layer of word embeddings is followed by multi- is adopted for this dataset - 6,513 videos for training,
D
head attention, M Hatt , which computes the guided attention 497 for validation and 2,990 videos for testing.
on the word embeddings in accordance with the attended
D
visual representations. The M Hatt consists of four VTT B. PERFORMANCE EVALUATION METRICS
sublayers stacked together to produce the attended visual-to- The performance evaluation of the methodology is done
text features, V Tatt as given below, using the evaluation metrics - BiLingual Evaluation
Understudy (BLEU@4) [42], Metric for Evaluation of
V Tatt = N orm[F F (N orm(FDatt + Xcap ))+
Translation with Explicit ORdering (METEOR) [43] and
N orm(FDatt + Xcap )] (7) Consensus-based Image Description Evaluation (CIDEr)
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
[44]. These will be denoted as B@4, M T and CD, TABLE 2: Performance results of the proposed method for
respectively, in this work. The B@4 metric is a commonly different mother wavelets. Here BM denotes the baseline
used metric for evaluating machine translation that measures method.
the 4-gram based accuracy. M T metric measures the MSVD MSR-VTT
Mother wavelet
harmonic mean of unigram precision and recall between the B@4 MT CD B@4 M T CD
BM 49.32 31.64 85.53 38.44 25.15 48.27
candidate and the reference sentences. It actually computes db1 50.49 32.18 86.65 39.38 26.25 49.05
the word correlation between the two sentences. The CD db4 50.17 32.47 86.81 39.73 25.81 48.76
metric evaluates the consensus in the generated sentence bior1.5 51.74 33.01 87.93 40.54 26.33 49.35
bior2.4 51.12 33.21 87.58 40.49 25.91 49.24
as assessed by humans. Hence these three metrics can bior3.5 50.61 32.56 87.27 40.14 26.03 49.31
effectively calculate the consistency between occurrences bior5.5 51.08 32.94 87.81 39.95 26.27 49.08
of words in the generated caption and the ground-truth Coif2 49.96 32.48 86.69 39.43 25.74 48.82
Coif5 50.29 32.64 86.15 39.81 25.82 48.78
descriptions. Sym2 50.35 32.47 86.37 38.98 25.59 48.95
Sym4 49.98 32.29 86.45 39.36 25.77 49.17
C. IMPLEMENTATION DETAILS
Since the video datasets include videos with different frame
rates varying from 6 to 60, input is resampled and made to
The training is carried out for a batch size of 8 for MSVD
have uniform frame rate for smooth working of the algorithm
and 16 for MSR-VTT. In order to avoid overfitting, dropout
for the datasets under consideration. Hence in our method,
and early stopping are used in the method. The self-critical
the video clips are resampled at 10fps and 30 uniformly
training strategy is employed in the implementation, where
spaced frames are chosen from each video clip, keeping in
the model is trained initially for 50 epochs with the cross-
mind that adjacent frames in the short clips included in the
entropy loss and it is further fine tuned with 25 epochs using
datasets differ very little in terms of the information content.
the self-critical loss for achieving the best CD score on
2D-WCNN model in the GFE module is pre-trained using
validation set. This helps to tackle the exposure bias problem
ImageNet dataset [45] and C3D model in the TFE module
during the optimization with cross-entropy loss alone. During
is pre-trained using Sports-1M dataset [46]. For temporal
the testing phase, BeamSearch strategy is adopted to select
feature extraction, we have considered non-overlapping
the best caption from few selected candidates. The beam
sequence of 16 frames, same as the default settings. All the
size is chosen as 5. The time cost for training process with
visual features are given to individual fully-connected layers
MSVD dataset is 2.1 hours/epoch and that of MSR-VTT is
with 512 units, to match with the feature dimensionality of
4.6 hours/epoch. The average time cost for the testing phase
the attention networks in the model. In VAP, four stages of
of the model is 9.3 sec. Thus for each video with 30 sampled
attention networks are used for the extraction of attended
frames, the average testing speed is 3.2 frames per second.
features.
The pre-processing of all the textual descriptions are done
by tokenizing with NLTK toolkit that splits the sentences D. SELECTION OF APPROPRIATE MOTHER WAVELET
into words, convert all the words to lowercase and remove
To choose the appropriate mother wavelet, the performance
punctuations. All those words having an occurrence rate
of the proposed method is analyzed with two-levels of
less than 3 are removed. Each word in the caption is
wavelet decomposition on ten different mother wavelet
represented as a word vector using the 300 dimensional
functions of four different wavelet families - Daubechies
GloVe word embeddings [47] pre-trained on a large-scale
wavelets (dbN) [48], biorthogonal Wavelets (biorNr.Nd)
corpus. For dimensionality matching, the GloVe embeddings
[49], Coiflets (coifN) [50] and Symlets (symN) [49], where N
are given to LSTM network with 512 hidden units. The
represents number of vanishing moments, Nr and Nd denotes
maximum sentence length is limited to 20. In the visual-
the number of vanishing moments in the reconstruction and
to-text translator network, four caption decoder stages with
decomposition filters, respectively. The detailed performance
model dimension di set as 512 is employed. The number
results of our method for MSVD and MSR-VTT datasets
of heads, H for M Hatt are taken as 4 and the dimension
are given in Table 2. A baseline method (BM) consisting of
of each head is found to be 128. The hidden size of F F
two-level DWT decomposition based CNN along with single
networks are set as 1024. Also, sine and cosine functions
attention network in VAP module is used in the experiment.
are used as positional encodings with the word embeddings
[17], which provides an information regarding the position For MSVD dataset bior2.4 secures highest M T score
of the tokens in the sequence. The weight initialization of the of about 33.21 but bior1.5 scores better values of B@4
model is done using Xavier method and is optimized with and CD compared to the other wavelets. For MSR-VTT,
ADAM employing an initial learning rate of 1e-5 for MSVD bior1.5 achieves better B@4, M T and CD score of about
and 4e-5 for MSR-VTT, with default exponential decay 40.54, 26.33 and 49.35, respectively. Hence for 2D-WCNN
rates of (0.9, 0.999). To train the proposed method, Nvidia network, we have chosen bior 1.5 wavelet for both MSVD
Tesla V100 with 16GB with 5120 CUDA cores is used. The and MSR-VTT datasets.
implementation of the method is done using TensorFlow 2.3.
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
TABLE 3: Performance results of the proposed method for TABLE 4: Performance comparison of our method with other
different number of decomposition levels state-of-the-art methods on MSVD dataset. All the values are
MSVD MSR-VTT
reported as % and HIGH is good in all columns. (-) indicates
Decomposition levels that the metric is not reported.
B@4 MT CD B@4 M T CD
1-level 52.87 35.14 90.39 43.83 28.68 51.37
2-level 53.64 36.53 91.71 44.92 29.84 52.22 Method B@4 MT CD
3-level 53.69 36.90 91.89 45.03 29.89 52.51 HRNE [30] 43.8 33.1 -
h-RNN [31] 49.9 32.6 65.8
LSTM-TSA [51] 52.8 33.5 74.0
M3 [52] 52.8 33.3 -
E. SELECTION OF DWT DECOMPOSITION STAGES PickNet [53] 52.3 33.3 76.5
A detailed study regarding the results obtained for various RecNet [54] 52.3 34.1 80.3
DS-RNN [55] 53.0 34.7 79.4
DWT decomposition levels are carried out and the results are MS-RNN [56] 53.3 33.8 74.8
given in Table 3. GRU-EVE [57] 47.9 35.0 78.1
The method with two level decomposition yields better GFN-POS [58] 53.9 34.9 91.0
TRGCN [59] 52.6 36.3 89.6
results than with 1-level DWT decomposition scoring an STAT [14] 52.0 33.3 73.8
improvement in the B@4, M T , CD values of about SBAT [36] 53.1 35.3 89.5
0.77%, 1.39% and 1.32%, respectively, for the MSVD Ours 53.6 36.5 91.7
dataset and about 1.09%, 1.16% and 0.85% , respectively,
for MSR-VTT dataset. The method yields only slight TABLE 5: Performance comparison of our method with other
improvements in the performance metric scores with the state-of-the-art methods on MSR-VTT dataset. All the values
inclusion of three level decomposition. Hence considering are reported as % and HIGH is good in all columns. (-)
the computational complexity, method with two level indicates that the metric is not reported.
decomposition is preferred in the proposed work. Method B@4 MT CD
VideoLAB [60] 39.1 27.7 44.1
F. PERFORMANCE EVALUATION Aalto [61] 39.8 26.9 45.7
v2t-Navigator [62] 40.8 28.2 44.8
Both quantitative and qualitative analysis of the methodology M T VC [63] 40.8 28.8 47.1
are carried out using the evaluation metrics and is compared PickNet [53] 41.3 27.7 44.1
with the state-of-the-art methods as detailed below. TVT [64] 40.1 27.9 47.7
DS-RNN [55] 42.3 29.4 46.1
MS-RNN [56] 39.8 26.1 40.9
1) Quantitative results GRU-EVE [57] 38.3 28.4 48.1
GFN-POS [58] 41.7 27.8 48.5
Table 4 shows the performance results of our method on TRGCN [58] 44.6 29.5 51.4
MSVD dataset along with the comparison on the state-of-the- MARN [65] 40.4 28.1 47.1
art methods: HRNE [30], h-RNN [31], LSTM-TSA [51], M 3 STAT [14] 39.3 27.1 43.9
SBAT [36] 42.9 28.9 51.6
[52], PickNet [53], RecNet [54], DS-RNN [55], MS-RNN ORG-TRL [66] 43.6 28.8 50.9
[56], GRU-EVE [57], GFN-POS [58], TRGCN [59], STAT Ours 44.9 29.8 52.2
[14] and SBAT [36] in video captioning. From Table 4, it can
be noted that our algorithm outperforms the existing methods
with an improved M T and CD score of 36.5% and 91.7%, captured from these three different modalities simultaneously
respectively. It also secures a B@4 of 53.6%. This proves the and are combined to acquire all the attentive regions in the
ability of our method in highlighting the finer details in the video that highlights the underlying video semantics. The
input video clip. Table 5 shows the quantitative comparison textual attention is also interleaved with the aforementioned
results of the proposed method on MSR-VTT dataset with attention helps to generate captions which are at par with
the existing state-of-the-art methods. These includes the human generated ones.
methods ranked in top-3 positions of the Leaderboard of
MSR-VTT Challenge 2017 - VideoLAB [60], Aalto [61]
2) Qualitative results
and v2t-Navigator [62] along with the methods, M T VC
[63], PickNet [53], TVT [64], DS-RNN [55], MS-RNN [56], Fig. 4 illustrates the qualitative comparison of the captions
GRU-EVE [57], GFN-POS [58], TRGCN [58], MARN [65], generated by STAT [14] and the proposed method for sample
STAT [14] and SBAT [36] in video captioning. For this videos from both the datasets. From the generated captions,
dataset also, it achieves an impressive B@4, M T and CD it is evident that the proposed method understands the visual
scores of about 44.9%, 29.8% and 52.2%, respectively, which concepts in the video in a superior manner and generates
indicates the better performance of our method compared to captions reflecting the underlying semantics such as ‘sits on
the existing methods. Inclusion of DWT in the architecture, a sofa’, ‘drinks from a bottle’, ‘ride a motorbike’, ‘down
helps to extract the fine visual details present in the video the road’, ‘blue and white paper’, ‘eyes’ and ‘brush’, thus
clips more efficiently compared to the other methods. The conveying more details of the video close to the human
method extracts three different features from the video generated ground truth captions.
for multimodal video representation. Then attentions are
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
TABLE 6: Results of different configurations of the proposed method for visual feature representation for MSVD and MSR-
VTT datasets.∗ denotes the first phase experimental study using single attention network in VAP module whereas ∗∗ represents
second phase study conducted with three separate attention networks in VAP. Here Glob, T emp and Context represents global,
temporal and contextual features. All the values are reported as % and HIGH is good in all columns.
Cross-Entropy loss Self-Critical loss
Configuration MSVD MSR-VTT MSVD MSR-VTT
B@4 CD B@4 CD B@4 CD B@4 CD
Glob+Temp∗ 48.8 86.3 39.4 47.4 51.7 87.9 40.5 49.3
Context+Temp∗ 48.4 85.8 38.3 46.8 51.0 87.2 39.8 48.5
Glob+Context+Temp∗ 49.5 88.1 39.6 49.1 52.4 89.6 42.9 50.4
Glob+Temp∗∗ 51.2 88.2 40.8 49.4 53.1 90.8 43.7 51.5
Context+Temp∗∗ 50.6 87.8 41.7 48.7 52.8 90.1 43.3 50.8
Glob+Context+Temp∗∗ 52.9 89.8 42.6 51.1 53.6 91.7 44.9 52.2
Ours(without WCNN) 50.1 87.8 39.9 50.0 52.3 89.2 42.4 50.6
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
(a) B@4 - MSVD dataset (b) M T - MSVD dataset (c) CD - MSVD dataset
(d) B@4 - MSR-VTT dataset (e) M T - MSR-VTT dataset (f) CD - MSR-VTT dataset
FIGURE 5: Evaluation of the performance on proposed method for different number of attention blocks in the VAP-VTT for
MSVD and MSR-VTT datasets.
FIGURE 6: Illustration of captions generated by the proposed method in comparison with the baseline method.
4 with B@4 and CD values of about 53.6% and 91.7% for illustrated with few sample video clips from both the datasets
MSVD dataset and 44.9% and 52.2% for MSR-VTT dataset, as shown in Fig. 6. Here the baseline method is the network
respectively. without WCNN as mentioned above.
The quality of the captions generated by the method is
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
REFERENCES
[1] W. Li, Z. Qu, H. Song, P. Wang, and B. Xue, “The traffic scene
understanding and prediction based on image captioning,” IEEE Access,
vol. 9, pp. 1420–1427, 2021.
[2] S. Amirian, A. Farahani, H. R. Arabnia, K. M. Rasheed, and T. R. Taha,
“The use of video captioning for fostering physical activity,” 2020 Int.
Conf. on Computational Science and Computational Intelligence (CSCI),
pp. 611–614, 2020.
FIGURE 7: Samples of negative results. [3] N. Xu, A. Liu, W. Nie, and Y. Su, “Attention-in-attention networks for
surveillance video understanding in internet of things,” IEEE Internet of
Things Journal, vol. 5, no. 5, pp. 3419–3429, 2018.
[4] C. Gurrin, “Content-based video retrieval,” Encyclopedia of Database
H. LIMITATIONS OF THE PROPOSED WORK Systems, pp. 466–473, 2009.
[5] S. Ding, S. Qu, Y. Xi, and S. Wan, “A long video caption generation
Even though the proposed method gives better performance algorithm for big video data retrieval,” Future Generation Computer
in the reported evaluation metrics, it still has some Systems, vol. 93, pp. 583–595, 2019.
[6] S. Fujita, T. Hirao, H. Kamigaito, M. Okumura, and M. Nagata, “Soda:
limitations. The method fails to generate correct contextual Story oriented dense video captioning evaluation framework,” Computer
descriptions of few video clips because of wrong visual Vision – ECCV 2020, Lecture Notes in Computer Science, vol. 12351, pp.
content interpretations. In the sample frames of the first video 517–531, 2020.
[7] R. Aditya, R. Asmita, V. Vidya, and P.V.R Badri, “Automatic subtitle
clip in Fig. 7, objects with reflections are visible. In this video
generation for videos,” in 2020 6th Int. Conf. on Advanced Computing
clip, we can observe a baby in red dress looking himself and Communication Systems (ICACCS), 2020, pp. 132–135.
in the mirror and kissing. Our method identifies this as"two [8] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adaptive
babies in red dress are playing", producing false result. The attention for visual captioning,” IEEE Trans. on Pattern Analysis and
Machine Intelligence, vol. 42, no. 5, pp. 1112–1131, 2020.
method also fails to extract the correct visual interpretations [9] L. Gao, X. Wang, J. Song, and Y. Liu, “Fused GRU with semantic-
or semantics from those videos, having high motion complex temporal attention for video captioning,” Neurocomputing, vol. 395, pp.
event, similar to the one shown in the frames of the second 222–228, 2020.
[10] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent
sample video in Fig. 7, where a motorcyclist is met with neural encoder for video representation with application to captioning,” in
an accident by losing his control over the bike and finally CVPR, 2016, pp. 1029–1038.
falls in the water. This activity is identified wrongly as the [11] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and
K. Saenko, “Translating videos to natural language using deep recurrent
motorcyclist is "flying a bike" by the method. This is because neural networks,” in Proc. of the 2015 Conf. of the North American
the attention-based multimodal features may interact each Chapter of the Association for Computational Linguistics: Human
other that degrades the performance of the method with Language Technologies, 2015, pp. 1494–1504.
[12] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and
false interpretations of the underlying semantics in the video. K. Saenko, “Sequence to sequence – video to text,” 2015 IEEE Int. Conf.
Another limitation of the proposed model is the increase in on Computer Vision (ICCV), pp. 4534–4542, 2015.
time cost with the addition of discrete wavelet pre-processing [13] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and
A. Courville, “Describing videos by exploiting temporal structure,” in
stage and VAP network that computes the global, local and 2015 IEEE Int. Conf. on Computer Vision (ICCV), dec 2015, pp. 4507–
temporal attention features separately. 4515.
[14] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, “STAT:
Spatial-temporal attention mechanism for video captioning,” IEEE Trans.
V. CONCLUSION on Multimedia, vol. 22, no. 1, pp. 229–241, 2020.
In this work, a deep neural network architecture is introduced [15] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adaptive
attention for visual captioning,” IEEE Trans. on Pattern Analysis and
for video caption generation by exploiting multimodal Machine Intelligence, vol. 42, pp. 1112–1131, 2020.
feature attention in the video. In this method, the inclusion [16] X. Shi, J. Cai, J. Gu, and S. Joty, “Video captioning with boundary-
of two-level discrete wavelet decomposition in 2D visual aware hierarchical language decoding and joint video prediction,”
Neurocomputing, vol. 417, pp. 347–356, 2020.
feature representation helps to extract additional information
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
contained in spatial, temporal and spectral domains in the Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of the
video. The adoption of three separate attention networks in 31st Int. Conf. on NIPS, 2017, p. 6000–6010.
the visual attention predictor is responsible for extracting [18] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language description
of human activities from video images based on concept hierarchy of
more attentive features, leading to more semantic captions actions,” Int. Journal of Computer Vision, vol. 50, pp. 171–184, 2004.
in the visual-to-text translator. The performance evaluation [19] S. Mukherjee, S. Ghosh, S. Ghosh, P. Kumar, and P. P. Roy, “Predicting
of the method is carried out using two benchmark datasets video-frames using encoder-convlstm combination,” in ICASSP, 2019, pp.
2027–2031.
and compared with existing state-of-the-art methods in video [20] S. Liu, Z. Ren, and J. Yuan, “SibNet: Sibling convolutional encoder
captioning. The results obtained highlights the efficiency of for video captioning,” IEEE Trans. on Pattern Analysis and Machine
the method in generating meaningful video captions. This Intelligence, vol. 43, no. 9, pp. 3259–3272, 2021.
[21] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele,
method can be further improved by exploiting the audio “Translating video content to natural language descriptions,” in 2013 IEEE
features also to generate more meaningful captions. The Int. Conf. on Computer Vision, 2013, pp. 433–440.
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
[22] L. Zhu, Z. Xu, and Y. Yang, “Bidirectional multirate reconstruction for [47] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
temporal modeling in videos,” 2017 IEEE Conf. on Computer Vision and for word representation,” in Empirical Methods in Natural Language
Pattern Recognition (CVPR), pp. 1339–1348, 2017. Processing (EMNLP), 2014, pp. 1532–1543.
[23] Z. Yang, Y. Han, and Z. Wang, “Catching the temporal regions-of-interest [48] I. Daubechies, “Ten lectures on wavelets,” Society for Industrial and
for video captioning,” Proc. of the 25th ACM Int. Conf. on Multimedia, Applied Mathematics, USA, 1992.
2017. [49] A. Karoui and R. Vaillancourt, “Families of biorthogonal wavelets,”
[24] S. Chen, Q. Jin, J. Chen, and A. G. Hauptmann, “Generating video Computers Mathematics with Applications, vol. 28, no. 4, pp. 25–39,
descriptions with latent topic guidance,” IEEE Trans. on Multimedia, vol. 1994.
21, no. 9, pp. 2407–2418, 2019. [50] G. Beylkin, R. R. Coifman, and V. Rokhlin, “Fast wavelet transforms
[25] L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-aware and numerical algorithms I,” Communications on Pure and Applied
neural encoder for video captioning,” in CVPR, 2017, pp. 3185–3194. Mathematics, vol. 44, no. 2, pp. 141–183, 1991.
[26] M. Nabati and A. Behrad, “Video captioning using boosted and [51] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred
parallel long short-term memory networks,” Computer Vision and Image semantic attributes,” in 2017 IEEE Conf. on Computer Vision and Pattern
Understanding, vol. 190, pp. 102840, 2020. Recognition (CVPR), 2017, pp. 984–992.
[52] J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan, “M3: Multimodal
[27] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. Zha, “Object
memory modelling for video captioning,” in CVPR, 2018, pp. 7512–7520.
relational graph with teacher-recommended learning for video captioning,”
[53] Y. Chen, S. Wang, and Q. Zhang, W.and Huang, “Less is more: Picking
in CVPR, 2020, pp. 13275–13285.
informative frames for video captioning,” in ECCV, 2018.
[28] J. Hou, X. Wu, W. Zhao, J. Luo, and Y. Jia, “Joint syntax representation [54] B. Wang, L. Ma, W. Zhang, and W. Liu, “Reconstruction network for
learning and visual cue translation for video captioning,” in 2019 video captioning,” in CVPR, 2018, pp. 7622–7631.
IEEE/CVF Int. Conf. on Computer Vision (ICCV), 2019, pp. 8917–8926. [55] N. Xu, A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M. Kankanhalli,
[29] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, and H. T. Shen, “Dual-stream recurrent neural network for video captioning,” IEEE Trans.
“From deterministic to generative: Multimodal stochastic rnns for video on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2482–
captioning,” IEEE Trans. on Neural Networks and Learning Systems, vol. 2493, 2019.
30, no. 10, pp. 3047–3058, 2019. [56] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, and H. T. Shen,
[30] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent “From deterministic to generative: Multimodal stochastic rnns for video
neural encoder for video representation with application to captioning,” in captioning,” IEEE Trans. on Neural Networks and Learning Systems, vol.
2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 30, no. 10, pp. 3047–3058, 2019.
jun 2016, pp. 1029–1038. [57] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian, “Spatio-
[31] L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-aware temporal dynamics and semantic attribute enriched visual encoding for
neural encoder for video captioning,” in 2017 IEEE Conf. on Computer video captioning,” in CVPR, 2019, pp. 12479–12488.
Vision and Pattern Recognition (CVPR), 2017, pp. 3185–3194. [58] B. Wang, L. Ma, W. Zhang, W. Jiang, J. Wang, and W. Liu, “Controllable
[32] C. Wu, Y. Wei, X. Chu, S. Weichen, F. Su, and L. Wang, video captioning with pos sequence guidance based on gated fusion
“Hierarchical attention-based multimodal fusion for video captioning,” network,” in 2019 IEEE/CVF Int. Conf. on Computer Vision (ICCV),
Neurocomputing, vol. 315, pp. 362–370, 2018. 2019, pp. 2641–2650.
[33] C. Hori, T. Hori, T. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, [59] X. Xiao, Y. Zhang, R. Feng, T. Zhang, S. Gao, and W. Fan, “Video
and K. Sumi, “Attention-based multimodal fusion for video description,” captioning with temporal and region graph convolution network,” in
in 2017 IEEE Int. Conf. on Computer Vision (ICCV), 2017, pp. 4203– ICME, 2020, pp. 1–6.
4212. [60] V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks,
[34] B. Zhao, X. Li, and X. Lu, “CAM-RNN: Co-attention model based rnn for M. Rohrbach, and K. Saenko, “Multimodal video description,” in Proc. of
video captioning,” IEEE Trans. on Image Processing, vol. 28, no. 11, pp. the 24th ACM Int. Conf. on Multimedia, 2016, p. 1092–1096.
5552–5565, 2019. [61] R. Shetty and J. Laaksonen, “Frame- and segment-level features and
[35] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense candidate pool evaluation for video caption generation,” in Proc. of the
video captioning with masked transformer,” in 2018 IEEE/CVF Conf. on 24th ACM Int. Conf. on Multimedia, 2016, MM ’16, p. 1073–1076.
Computer Vision and Pattern Recognition, 2018, pp. 8739–8748. [62] Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann, “Describing
[36] T. Jin, S. Huang, M. Chen, Y. Li, and Z. Zhang, “SBAT: Video captioning videos using multi-modal fusion,” in Proc. of the 24th ACM Int. Conf.
with sparse boundary-aware transformer,” in IJCAI, 2020. on Multimedia, 2016, p. 1087–1091.
[63] R. Pasunuru and M. Bansal, “Multi-task video captioning with video and
[37] J. Lei, L. Wang, Y. Shen, D. Yu, T. L. Berg, and M. Bansal, “MART:
entailment generation,” in 55th Annual Meeting of the Association for
Memory-augmented recurrent transformer for coherent video paragraph
Computational Linguistics, 2017, vol. 1, pp. 1273–1283.
captioning,” 2020.
[64] M. Chen, Y. Li, Z. Zhang, and S. Huang, “TVT: Two-view transformer
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
network for video captioning,” in Proc. of The 10th Asian Conf. on
recognition,” in CVPR, 2016, pp. 770–778.
Machine Learning, 14–16 Nov 2018, vol. 95, pp. 847–862.
[39] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- [65] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y. Tai, “Memory-attended
time object detection with region proposal networks,” in Proc. of the 28th recurrent network for video captioning,” in cVPR, 2019, pp. 8339–8348.
Int. Conf. on Neural Information Processing Systems - Volume 1. 2015, [66] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. Zha, “Object
NIPS’15, p. 91–99, MIT Press. relational graph with teacher-recommended learning for video captioning,”
[40] D. Chen and W. Dolan, “Collecting highly parallel data for paraphrase in 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition
evaluation,” in Proc. of ACL, 2011, pp. 190–200. (CVPR), 2020, pp. 13275–13285.
[41] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description
dataset for bridging video and language,” in CVPR, 2016, pp. 5288–5296.
[42] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A method for
automatic evaluation of machine translation,” in Proc. of the 40th Annual
Meeting on Association for Computational Linguistics, 2002, p. 311–318.
[43] A. Lavie and A. Agarwal, “Meteor: An automatic metric for mt evaluation
with high levels of correlation with human judgments,” in Proc. of the
Second Workshop on Statistical Machine Translation, 2007, p. 228–231.
[44] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based
image description evaluation.,” in CVPR, 2015, pp. 4566–4575.
[45] J. Deng, W. Dong, R. Socher, L. Li, Kai L., and Li F., “ImageNet: A
large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
[46] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-
Fei, “Large-scale video classification with convolutional neural networks,”
in CVPR, 2014.
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/