0% found this document useful (0 votes)

9 views12 pages

A Multimodal Framework For Video Caption Generatio

The article presents a novel multimodal framework for video caption generation that utilizes a discrete wavelet convolutional neural network (DWT-CNN) along with multimodal feature attention to enhance the understanding of video semantics. The proposed method effectively captures global, contextual, and temporal features from video frames, achieving improved performance on benchmark datasets MSVD and MSR-VTT with CIDEr scores of 91.7 and 52.2, respectively. The framework aims to refine video captioning by integrating various attention mechanisms and deep learning architectures to produce more accurate natural language descriptions.

Uploaded by

kumarmorenitinpcemes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views12 pages

A Multimodal Framework For Video Caption Generatio

Uploaded by

kumarmorenitinpcemes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

A Multimodal Framework For Video

Caption Generation
RESHMI S. BHOOSHAN1 , and SURESH K.2 , (Senior Member, IEEE)
1
College of Engineering, Thiruvananthapuram, APJ Abdul Kalam Technological University, Kerala, India (e-mail: [email protected])
2
Government Engineering College, Wayanad, APJ Abdul Kalam Technological University, Kerala, India ([email protected])
Corresponding author: Reshmi S. Bhooshan (e-mail: reshmibhooshan@ cet.ac.in).

ABSTRACT Video captioning is a highly challenging computer vision task that automatically describes
the video clips using natural language sentences with a clear understanding of the embedded semantics.
In this work, a video caption generation framework consisting of discrete wavelet convolutional neural
architecture along with multimodal feature attention is proposed. Here global, contextual and temporal
features in the video frames are taken into account and separate attention networks are integrated in the
visual attention predictor network to capture multiple attentions from these features. These attended features
with textual attention are employed in the visual-to-text translator for caption generation. The experiments
are conducted on two benchmark video captioning datasets - MSVD and MSR-VTT. The results prove an
improved performance of the method with a CIDEr score of 91.7 and 52.2, for the aforementioned datasets,
respectively.

INDEX TERMS Video Captioning, Discrete Wavelet Convolutional Model, Multimodal Feature
Extraction, Visual Attention Predictor.

I. INTRODUCTION made by Venugopal et al., a novel video to text generation

Video caption generation aims to automatically generate methodology is presented, which extends image captioning
meaningful natural language descriptions about the video. methods by incorporating a 2D-CNN network along with
For this, a clear understanding about the semantic details mean pooling and RNN decoder structure [11]. But this
as well as contextual visual relationship between different method fails to use the temporal details present in the video
objects present in the video is needed. Many algorithms for caption generation. Subsequently, an S2VT model is
have been developed by the researchers in this area of proposed in [12] that uses stacked LSTM network to learn
computer vision, for generating descriptions closer to human the temporal information in a sequence of frames and then
perception level. Video caption generation is important in produce a sequence of words. Later, attention mechanisms
a variety of real-world applications such as content-based are included in the spatial as well as temporal domain to
video retrieval, video comprehension generation, automatic achieve better performance [13], [14]. Video descriptions
assistance devices for the visually impaired, subtitles creation can also be generated by employing attention in the decoder
in videos, intelligent driving assistance systems, video section as well as using multimodal fusion mechanisms of
surveillance and so on [1] - [7]. visual, text and audio features [15], [16].
Most of the deep learning frameworks employed for The caption generation of video clips is more challenging
caption generation use an encoder-decoder structure. The compared to image captioning because of the involvement
encoder utilizes a convolutional neural network (CNN) or of shots, scenes, activities, random motion of objects
Recurrent Neural Network (RNN) for extracting the visual of different categories or classes, attributes and varying
and semantic details in the video. It generates a feature illumination conditions. The aforesaid attention mechanisms
vector representation corresponding to the visual content in have effectively produced video captions. But the generated
the video, which is then given to a decoder structure having descriptions still needs a refinement to completely describe
sequential models such as RNN, Long Short Term Memory the video clip. This can be accomplished by extracting
(LSTM) or Gated Recurrent Unit (GRU) that does the visual the diverse perceptive nature of the objects present in the
to natural language translation [8] - [10]. In an early attempt video and highlighting both the semantic and contextual

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202526

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

relationship between them in the entire video sequence. concludes the paper.
While generating the descriptions, the contextual information
also need to be given adequate importance along with the II. LITERATURE SURVEY
visual and semantic information. To achieve this goal, we In early approaches, caption generation of videos are done
propose a deep neural network architecture utilizing Discrete using classical template-based techniques that employed
Wavelet Transform (DWT) based CNN for extracting more the SVO-triplets - Subject (S), Verb (V), and Object
finer visual details from the video frames, which enables (O) [18]. These triplets are found out individually and
better video caption generation. The new architecture is able they are combined together to form a sentence. Many
to exploit spectral information in the video frames along encoder-decoder architectures have been proposed that uses
with the spatial, semantic and temporal details for caption 2DCNN/3DCNN structures as the encoder for generating
generation. The proposed model includes Global Feature feature representations and sequential models like RNN,
Extractor (GFE) incorporating DWT based Convolutional LSTM and GRU as the decoder for language translation
Neural Network (2D-WCNN) for the extraction of global [19], [20]. A two-step captioning approach that learns
features from the video frames, a Contextual Object the correspondence between semantic representation labels
Relationship Extractor (CORE) for finding out the contextual and verbalization before translating it to natural language
relationship between different objects in the frames and a is introduced in [21]. An unsupervised Multirate Visual
Temporal Feature Extractor (TFE) that consists of 3DCNN Recurrent Model (MVRM) is presented by [22] that is
model for getting the dynamic features from the video. capable of handling motion speed variations in video
A visual attention predictor network is also incorporated frames with bidirectional reconstruction technique. Dual
that extracts the attention from these features and finally Memory Recurrent Model (DMRM) utilizing global and
visual-to-text-translation is done using a caption decoder temporal details along with semantic supervision for accurate
network as in transformer [17]. This greatly solves the detection of region-of-interest is proposed in [23]. Captions
long-range dependency issues in the sequential models with are also generated with latent topic guidance [24], Time
self attention mechanism and allows parallel computing. Boundary-aware LSTM cell [25], Boosted and Parallel Long
Due to the presence of self attention layers, the decoder Short-Term Memory Networks (BP-LSTM) [26] and Object
network can enhance the quality of visual-to-text translation Relational Graph with Teacher-Recommended Learning
by considering the word-to-word, object-to-object as well (ORG-TRL) system [27]. Another technique utilizes two
as object-to-word interactions in the input sequences. This steps - video Part-of-Speech (POS) tagging and visual cue
network utilizes global dependencies between input and translation [28]. This can be accomplished using mixture
output for providing improved performance compared to model for converting visual features to lexical words and
RNN and LSTM. sentence templates comprising of POS tags.
The main contributions of the proposed framework are: A few algorithms are developed by considering both
the spatial as well as temporal features simultaneously
1) A 2D-WCNN structure using two-level DWT
along with attention mechanisms. A multimodal stochastic
decomposition and CNN layers is employed for
recurrent neural networks (MS-RNN) that make use of
extracting the global features in the video frames. The
latent stochastic variables are presented by [29] for video
utilization of DWT helps to include the fine grained
captioning. Hierarchical encoder structures are also proposed
spatial, spectral and semantic details in the frames.
by [30] and [31] that gives more attention to the temporal
2) A Contextual Object Relationship Extractor (CORE)
details of the video. Descriptions are made by employing
which makes use of the feature maps obtained from
attention in the decoder section [15], [16] as well as
the 2D-WCNN for predicting region proposals and
multimodal fusion mechanisms with aural features in the
computes the contextual relationship between the
video [32]. A multimodal temporal attention mechanism
different frames in the video.
incorporating image, motion, and audio features is given
3) A multimodal visual feature attention network that
in [33]. This architecture is developed by assuming that
concurrently computes global, contextual and temporal
different modalities carry different task-relevant information
feature attention, capable of increasing the efficiency
at different time instances. In another work, captions are
and prediction accuracy of the entire methodology.
produced using co-attention model based recurrent neural
The effectiveness of the proposed method is evaluated network (CAM-RNN) consisting of a visual attention
using two benchmark datasets - MSVD and MSR-VTT and module, a text attention module and a balancing gate [34].
the results are compared with the existing state-of-the-art This algorithm is capable to perform adaptive detection of
methods using the evaluation metrics BLEU, METEOR and the most relevant regions in the image and thus concentrate
CIDEr. on the relevant words or phrases in the generated sentence.
The paper is organized as follows: Section II gives a brief Recently, transformer based decoder models are
review of the research works existing in this area. The details proposed for the generation of video descriptions. Masked
regarding proposed architecture and experimental results are transformers can be used for generating end-to-end
described in Section III and IV, respectively and Section V video captions which uses masking network to produce
2 VOLUME 4, 2016

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

FIGURE 1: Architecture of the proposed model. GFE, CORE and TFE generate multimodal features for text generation.

differentiable mask from event proposals and thereby

maintaining consistency between the proposal and captioning
during training [35]. A sparse boundary-aware transformer
(SBAT) aligned with cross-modal encoding scheme can be
used to enhance the multimodal interaction thereby providing
better captions as mentioned by [36]. Coherent paragraph
generation can be made possible using Memory-Augmented
Recurrent Transformer (MART) having a memory module to
acquire a highly summarized memory state from the video
segments and the sentence history [37].
In all these methods, textual descriptions are created FIGURE 2: Structure of 2D-WCNN model.
by considering only the spatial and temporal features
in the video. Enhanced video captions can be achieved
by extracting spectral information using DWT and the vocabulary size. The detailed description of different
contextual information through visual relationship detection elements of the proposed model is given in the following
between objects. In order to obtain a good degree of subsections.
video understanding, 2D multiresolution discrete wavelet
convolutional model is used in this work for global and A. VISUAL FEATURE EXTRACTOR
contextual visual feature extraction, which in turn are The VFE comprises of three submodules: GFE, CORE and
then fed to the VAP along with the temporal features TFE for the generation of multimodal features.
for identifying the attentive regions in the video. These (1) GFE: It employs a 2D-WCNN network having
multimodal features are combined in VTT for producing a modified ResNet-50 structure [38] with two level-
semantically improved natural language captions. The DWT decomposition to provide better time-frequency
following section describes the proposed model that representation of the frames. In DWT decomposition, each
integrates the aforementioned techniques. frame in the video is decomposed into four sub bands, which
highlight frequency details in the image. Hence the utilization
III. PROPOSED METHOD of DWT pre-processing stage together with convolutional
Fig. 1 illustrates the overall framework of the proposed neural network helps to extract some of the distinctive
methodology. In this, video captions are generated using spectral features that are more predominant in the sub band
a sequence of visual feature representations obtained from levels of the frames in addition to the spatial, semantic
VFE and VAP. The caption generation is accomplished using as well as channel details. The detailed structure of 2D-
VTT consisting of multimodal attention network along with WCNN network is shown in Fig. 2. Each of the input frames,
softmax layer that describes the video with a sequence of resized to 224×224, is subjected to two level multi-resolution
encoded words, Sv = [W1 , W2 , ...Wl ], with Wi ∈ RNv , decomposition using DWT and is further fed to the CNN
where l is the length of the generated caption and Nv is structure comprising of five levels - CON V1 to CON V5 .
VOLUME 4, 2016 3

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

TABLE 1: Details of various convolutional layers in 2D- Proposal Network (RPN), in a similar fashion as that of
WCNN model. The input to the network is of size Faster RCNN, along with classifier and regression layers
224×224×3. The residual blocks are mentioned in square for creating bounding boxes. The identified objects are then
brackets. paired and different sub images are created with each having
Layer Name Kernel size/No. of filters,stride Output size the identified object pairs. For uniformity, these sub images
L1 3 × 3/64,stride1 112 × 112 × 64 are resized to 32×32 and are given to the CNN layers having
L2 3 × 3/128,stride1 56 × 56 × 128
3 × 3/64,stride1 224 × 224 × 64
two sets of 64 filters, each with receptive field 3×3 as shown
CON V1 3 × 3/64,stride1 224 × 224 × 64 in Fig. 3. The obtained features maps highlight the spatial
3 × 3/64,stride2 112 × 112 × 64 relationship between the object pairs. The spatial relation
3 × 3/128,stride1 112 × 112 × 128
CON V2 3 × 3/128,stride1 112 × 112 × 128
feature maps of each object pair is then stacked together and
 3 × 3/128,stride 2 56 × 56 × 128 is given to 1×1×64 convolution layer to form the contextual
1 × 1/128, stride1 spatial relation feature map. These features of CORE are
3 × 3/128, stride1 ×2 56 × 56 × 256 passed through a fully connected layer to produce a 2048-
1 × 1/256, stride1
  dimensional feature vector, Vci .
1 × 1/128, stride2
CON V3 3 × 3/128, stride1×4 28 × 28 × 512 (3) TFE: The temporal features are extracted using
1 × 1/512, stride1 3DCNN (C3D network).
 
1 × 1/256, stride2 The three feature vectors - Vgi , Vci and Vti representing
CON V4  3 × 3/256, stride1 ×6 14 × 14 × 1024
1 × 1/1024, stride1
global features, local features and motion features in the

1 × 1/512, stride2
 video, respectively, are then provided to the VAP module.
CON V5  3 × 3/512, stride1 ×3 7 × 7 × 2048
1 × 1/2048, stride1 B. VISUAL ATTENTION PREDICTOR
Content rich video caption generation necessitates clear
understanding of semantics in the video. To accomplish
In level-1 wavelet decomposition, the low pass and high this, VAP network is incorporated in the model that utilizes
pass filtering of the input frame produces an approximation Scaled Dot-product Attention to compute the global, local
sub-band (ll1 ) and three detailed sub-bands (lh1 , hl1 and and temporal attention features. It consists of multi-head
hh1 ). The ll1 sub-band is further filtered out into four sub- attention mechanism having H parallel attention layers or
bands - ll2 , lh2 , hl2 and hh2 in level-2 decomposition. Then heads, each computing Scaled Dot-product Attention on an
these sub-bands of R, G and B components obtained from input having a set of queries (Q), keys (K) and values
both the levels are stacked together and each of these are (V ), each of dimension Rdi . In the case of global attention
concatenated with the maxpooled output of CON V1 and network, all the Q, K and V values are set to be equal with
CON V2 layers, respectively as shown in Fig. 2. All the four Vgi as shown in Fig. 1. Thus the output of the global attention
sub bands need to be included to extract the features out from network is given by the expression,
it in the convolutional layers because each sub band carries g
distinguishable features, which are very essential to have a Fatt = (h1 ⊕ h2 ⊕ ...hH )W o (1)
good visual representation of the frame. The configuration of
levels from CON V3 to CON V5 is same as that of ResNet- hi = Gatt (QWiQ , K WiK , V WiV ) (2)
50 network with three-layer bottleneck blocks along with
residue connections. The first convolutional layers of the
QK T
levels CON V3 to CON V5 are having stride 2. The details Gatt = sof tmax( √ )V (3)
regarding the filters used in various convolutional layers di
along with their output sizes are summarized in Table 1. where ⊕ denotes concatenation, Gatt represents global
Batch normalization together with ReLU activation function Scaled Dot-product Attention with independent head
is used in all the layers. Padding is also employed in every projection matrices, WiQ , WiK and WiV , for i = 1, 2, ...H
di
layer. The extracted global feature maps from CON V5 layer in R H Xdi . W o ∈ Rdi Xdi is the output projection matrix that
are given to the CORE for finding out the relationship combines the output from the various heads, each having a
between the various objects in the frames. These feature maps dimension of dHi . Similarly, the outputs for the CORE and
are also given to the VAP for computing global attention. c t
temporal attention networks, Fatt and Fatt will be computed
(2) CORE: Better captions can be generated only by as in equations 1 through 3 with the Q, K and V inputs set
considering the contextual relationship between the objects to Vci and Vti , respectively.
in the video. The CORE employs a modified configuration In VAP module, N identical attention sub networks are
of Faster RCNN structure [39]. It utilizes the feature maps stacked together separately each for computing the global,
produced by the CON V5 layer of 2D-WCNN network contextual and temporal attention features. The attended
i−1
in the GFE to predict the region proposals for detecting feature output from the (i − 1)th stage, Fatt is used
i
the object relationships as shown in Fig. 3. The object to produce the attended features of the next stage, Fatt ,
g
regions are identified using 2D-WCNN network and Region in a recursive manner. The features, Fatt are given to
4 VOLUME 4, 2016

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

FIGURE 3: Structure of Contextual Object Relationship Extractor.

the normalization (N orm) and feedforward (F F ) networks The V Tatt features are fed to a linear network and finally,
having two fully-connected layers, ReLU activation function the prediction of words is performed by softmax layer.
and dropout layers. This introduces non-linearity in the During the training phase, cross-entropy loss LCE from all
network. In this work, the drop out ratio is set as 0.1. Residual time-steps is used, which is expressed as,
connection and layer normalization is included in all the T
X
network layers. Thus VgA is computed as, LCE (θ) = − log(pθ (Wt | W1:t−1 , Vgi , Vci , Vti )) (8)
g t=1
VgA = N orm[F F (N orm[Fatt + Vgi ])+
g where W1:t−1 represents the ground truth sequence at time-
(N orm[Fatt + Vgi ])] (4)
step, t and θ denotes the parameters.
Similarly, the contextual and temporal attention features are The model so designed has to be undergone exhaustive
obtained as, evaluation to reveal its effectiveness in video captioning, as
discussed below.
c
VcA = N orm[F F (N orm[Fatt + Vci ])+
c IV. EXPERIMENTS AND RESULTS
(N orm[Fatt + Vci ])] (5)
Both qualitative and quantitative analysis of proposed
t
VtA = N orm[F F (N orm[Fatt + Vti ])+ framework has been carried out with different datasets and
t performance evaluation metrics. Results of this analysis and
(N orm[Fatt + Vti ])] (6)
a comparative study with the state-of-the-art video captioning
The attended output features so obtained from the global, techniques are presented in this section.
local and temporal attention networks are multiplied together
and is given to the visual-to-text translator for further A. DATASETS USED
processing. Experiments are conducted on two benchmark datasets for
video captioning: Microsoft Research Video Description
C. VISUAL-TO-TEXT TRANSLATOR Corpus dataset (MSVD) [40] and MSR-Video to Text dataset
The attended visual representations from the VAP of three (MSR-VTT) [41].
attention networks, together with the attended ground truth 1) MSVD dataset: It consists of 1,970 YouTube video
caption word embeddings, Xcap , are fed to the multi clips having an average of 40 manually annotated
head attention networks of VTT section of the architecture captions per clip. For fair comparison, we have used
as shown in Fig. 1. The VTT consists of one masked the split-up as proposed in [12], that consists of 1,200
attention network computing the self-attention within the videos for training, 100 videos for validation and 670
word embeddings. It uses a mask matrix for improving the videos for testing.
self attention learning process in the caption word embedding 2) MSR-VTT dataset: It is the largest video captioning
during training and each word learns or attend from the words dataset having 10K video clips, each annotated with 20
in the previous positions of the output sequence. This self sentences. The standard split-up as mentioned in [41]
attention layer of word embeddings is followed by multi- is adopted for this dataset - 6,513 videos for training,
D
head attention, M Hatt , which computes the guided attention 497 for validation and 2,990 videos for testing.
on the word embeddings in accordance with the attended
D
visual representations. The M Hatt consists of four VTT B. PERFORMANCE EVALUATION METRICS
sublayers stacked together to produce the attended visual-to- The performance evaluation of the methodology is done
text features, V Tatt as given below, using the evaluation metrics - BiLingual Evaluation
Understudy (BLEU@4) [42], Metric for Evaluation of
V Tatt = N orm[F F (N orm(FDatt + Xcap ))+
Translation with Explicit ORdering (METEOR) [43] and
N orm(FDatt + Xcap )] (7) Consensus-based Image Description Evaluation (CIDEr)
VOLUME 4, 2016 5

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

[44]. These will be denoted as B@4, M T and CD, TABLE 2: Performance results of the proposed method for
respectively, in this work. The B@4 metric is a commonly different mother wavelets. Here BM denotes the baseline
used metric for evaluating machine translation that measures method.
the 4-gram based accuracy. M T metric measures the MSVD MSR-VTT
Mother wavelet
harmonic mean of unigram precision and recall between the B@4 MT CD B@4 M T CD
BM 49.32 31.64 85.53 38.44 25.15 48.27
candidate and the reference sentences. It actually computes db1 50.49 32.18 86.65 39.38 26.25 49.05
the word correlation between the two sentences. The CD db4 50.17 32.47 86.81 39.73 25.81 48.76
metric evaluates the consensus in the generated sentence bior1.5 51.74 33.01 87.93 40.54 26.33 49.35
bior2.4 51.12 33.21 87.58 40.49 25.91 49.24
as assessed by humans. Hence these three metrics can bior3.5 50.61 32.56 87.27 40.14 26.03 49.31
effectively calculate the consistency between occurrences bior5.5 51.08 32.94 87.81 39.95 26.27 49.08
of words in the generated caption and the ground-truth Coif2 49.96 32.48 86.69 39.43 25.74 48.82
Coif5 50.29 32.64 86.15 39.81 25.82 48.78
descriptions. Sym2 50.35 32.47 86.37 38.98 25.59 48.95
Sym4 49.98 32.29 86.45 39.36 25.77 49.17
C. IMPLEMENTATION DETAILS
Since the video datasets include videos with different frame
rates varying from 6 to 60, input is resampled and made to
The training is carried out for a batch size of 8 for MSVD
have uniform frame rate for smooth working of the algorithm
and 16 for MSR-VTT. In order to avoid overfitting, dropout
for the datasets under consideration. Hence in our method,
and early stopping are used in the method. The self-critical
the video clips are resampled at 10fps and 30 uniformly
training strategy is employed in the implementation, where
spaced frames are chosen from each video clip, keeping in
the model is trained initially for 50 epochs with the cross-
mind that adjacent frames in the short clips included in the
entropy loss and it is further fine tuned with 25 epochs using
datasets differ very little in terms of the information content.
the self-critical loss for achieving the best CD score on
2D-WCNN model in the GFE module is pre-trained using
validation set. This helps to tackle the exposure bias problem
ImageNet dataset [45] and C3D model in the TFE module
during the optimization with cross-entropy loss alone. During
is pre-trained using Sports-1M dataset [46]. For temporal
the testing phase, BeamSearch strategy is adopted to select
feature extraction, we have considered non-overlapping
the best caption from few selected candidates. The beam
sequence of 16 frames, same as the default settings. All the
size is chosen as 5. The time cost for training process with
visual features are given to individual fully-connected layers
MSVD dataset is 2.1 hours/epoch and that of MSR-VTT is
with 512 units, to match with the feature dimensionality of
4.6 hours/epoch. The average time cost for the testing phase
the attention networks in the model. In VAP, four stages of
of the model is 9.3 sec. Thus for each video with 30 sampled
attention networks are used for the extraction of attended
frames, the average testing speed is 3.2 frames per second.
features.
The pre-processing of all the textual descriptions are done
by tokenizing with NLTK toolkit that splits the sentences D. SELECTION OF APPROPRIATE MOTHER WAVELET
into words, convert all the words to lowercase and remove
To choose the appropriate mother wavelet, the performance
punctuations. All those words having an occurrence rate
of the proposed method is analyzed with two-levels of
less than 3 are removed. Each word in the caption is
wavelet decomposition on ten different mother wavelet
represented as a word vector using the 300 dimensional
functions of four different wavelet families - Daubechies
GloVe word embeddings [47] pre-trained on a large-scale
wavelets (dbN) [48], biorthogonal Wavelets (biorNr.Nd)
corpus. For dimensionality matching, the GloVe embeddings
[49], Coiflets (coifN) [50] and Symlets (symN) [49], where N
are given to LSTM network with 512 hidden units. The
represents number of vanishing moments, Nr and Nd denotes
maximum sentence length is limited to 20. In the visual-
the number of vanishing moments in the reconstruction and
to-text translator network, four caption decoder stages with
decomposition filters, respectively. The detailed performance
model dimension di set as 512 is employed. The number
results of our method for MSVD and MSR-VTT datasets
of heads, H for M Hatt are taken as 4 and the dimension
are given in Table 2. A baseline method (BM) consisting of
of each head is found to be 128. The hidden size of F F
two-level DWT decomposition based CNN along with single
networks are set as 1024. Also, sine and cosine functions
attention network in VAP module is used in the experiment.
are used as positional encodings with the word embeddings
[17], which provides an information regarding the position For MSVD dataset bior2.4 secures highest M T score
of the tokens in the sequence. The weight initialization of the of about 33.21 but bior1.5 scores better values of B@4
model is done using Xavier method and is optimized with and CD compared to the other wavelets. For MSR-VTT,
ADAM employing an initial learning rate of 1e-5 for MSVD bior1.5 achieves better B@4, M T and CD score of about
and 4e-5 for MSR-VTT, with default exponential decay 40.54, 26.33 and 49.35, respectively. Hence for 2D-WCNN
rates of (0.9, 0.999). To train the proposed method, Nvidia network, we have chosen bior 1.5 wavelet for both MSVD
Tesla V100 with 16GB with 5120 CUDA cores is used. The and MSR-VTT datasets.
implementation of the method is done using TensorFlow 2.3.
6 VOLUME 4, 2016

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

TABLE 3: Performance results of the proposed method for TABLE 4: Performance comparison of our method with other
different number of decomposition levels state-of-the-art methods on MSVD dataset. All the values are
MSVD MSR-VTT
reported as % and HIGH is good in all columns. (-) indicates
Decomposition levels that the metric is not reported.
B@4 MT CD B@4 M T CD
1-level 52.87 35.14 90.39 43.83 28.68 51.37
2-level 53.64 36.53 91.71 44.92 29.84 52.22 Method B@4 MT CD
3-level 53.69 36.90 91.89 45.03 29.89 52.51 HRNE [30] 43.8 33.1 -
h-RNN [31] 49.9 32.6 65.8
LSTM-TSA [51] 52.8 33.5 74.0
M3 [52] 52.8 33.3 -
E. SELECTION OF DWT DECOMPOSITION STAGES PickNet [53] 52.3 33.3 76.5
A detailed study regarding the results obtained for various RecNet [54] 52.3 34.1 80.3
DS-RNN [55] 53.0 34.7 79.4
DWT decomposition levels are carried out and the results are MS-RNN [56] 53.3 33.8 74.8
given in Table 3. GRU-EVE [57] 47.9 35.0 78.1
The method with two level decomposition yields better GFN-POS [58] 53.9 34.9 91.0
TRGCN [59] 52.6 36.3 89.6
results than with 1-level DWT decomposition scoring an STAT [14] 52.0 33.3 73.8
improvement in the B@4, M T , CD values of about SBAT [36] 53.1 35.3 89.5
0.77%, 1.39% and 1.32%, respectively, for the MSVD Ours 53.6 36.5 91.7
dataset and about 1.09%, 1.16% and 0.85% , respectively,
for MSR-VTT dataset. The method yields only slight TABLE 5: Performance comparison of our method with other
improvements in the performance metric scores with the state-of-the-art methods on MSR-VTT dataset. All the values
inclusion of three level decomposition. Hence considering are reported as % and HIGH is good in all columns. (-)
the computational complexity, method with two level indicates that the metric is not reported.
decomposition is preferred in the proposed work. Method B@4 MT CD
VideoLAB [60] 39.1 27.7 44.1
F. PERFORMANCE EVALUATION Aalto [61] 39.8 26.9 45.7
v2t-Navigator [62] 40.8 28.2 44.8
Both quantitative and qualitative analysis of the methodology M T VC [63] 40.8 28.8 47.1
are carried out using the evaluation metrics and is compared PickNet [53] 41.3 27.7 44.1
with the state-of-the-art methods as detailed below. TVT [64] 40.1 27.9 47.7
DS-RNN [55] 42.3 29.4 46.1
MS-RNN [56] 39.8 26.1 40.9
1) Quantitative results GRU-EVE [57] 38.3 28.4 48.1
GFN-POS [58] 41.7 27.8 48.5
Table 4 shows the performance results of our method on TRGCN [58] 44.6 29.5 51.4
MSVD dataset along with the comparison on the state-of-the- MARN [65] 40.4 28.1 47.1
art methods: HRNE [30], h-RNN [31], LSTM-TSA [51], M 3 STAT [14] 39.3 27.1 43.9
SBAT [36] 42.9 28.9 51.6
[52], PickNet [53], RecNet [54], DS-RNN [55], MS-RNN ORG-TRL [66] 43.6 28.8 50.9
[56], GRU-EVE [57], GFN-POS [58], TRGCN [59], STAT Ours 44.9 29.8 52.2
[14] and SBAT [36] in video captioning. From Table 4, it can
be noted that our algorithm outperforms the existing methods
with an improved M T and CD score of 36.5% and 91.7%, captured from these three different modalities simultaneously
respectively. It also secures a B@4 of 53.6%. This proves the and are combined to acquire all the attentive regions in the
ability of our method in highlighting the finer details in the video that highlights the underlying video semantics. The
input video clip. Table 5 shows the quantitative comparison textual attention is also interleaved with the aforementioned
results of the proposed method on MSR-VTT dataset with attention helps to generate captions which are at par with
the existing state-of-the-art methods. These includes the human generated ones.
methods ranked in top-3 positions of the Leaderboard of
MSR-VTT Challenge 2017 - VideoLAB [60], Aalto [61]
2) Qualitative results
and v2t-Navigator [62] along with the methods, M T VC
[63], PickNet [53], TVT [64], DS-RNN [55], MS-RNN [56], Fig. 4 illustrates the qualitative comparison of the captions
GRU-EVE [57], GFN-POS [58], TRGCN [58], MARN [65], generated by STAT [14] and the proposed method for sample
STAT [14] and SBAT [36] in video captioning. For this videos from both the datasets. From the generated captions,
dataset also, it achieves an impressive B@4, M T and CD it is evident that the proposed method understands the visual
scores of about 44.9%, 29.8% and 52.2%, respectively, which concepts in the video in a superior manner and generates
indicates the better performance of our method compared to captions reflecting the underlying semantics such as ‘sits on
the existing methods. Inclusion of DWT in the architecture, a sofa’, ‘drinks from a bottle’, ‘ride a motorbike’, ‘down
helps to extract the fine visual details present in the video the road’, ‘blue and white paper’, ‘eyes’ and ‘brush’, thus
clips more efficiently compared to the other methods. The conveying more details of the video close to the human
method extracts three different features from the video generated ground truth captions.
for multimodal video representation. Then attentions are
VOLUME 4, 2016 7

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

FIGURE 4: Results of qualitative comparison.

TABLE 6: Results of different configurations of the proposed method for visual feature representation for MSVD and MSR-
VTT datasets.∗ denotes the first phase experimental study using single attention network in VAP module whereas ∗∗ represents
second phase study conducted with three separate attention networks in VAP. Here Glob, T emp and Context represents global,
temporal and contextual features. All the values are reported as % and HIGH is good in all columns.
Cross-Entropy loss Self-Critical loss
Configuration MSVD MSR-VTT MSVD MSR-VTT
B@4 CD B@4 CD B@4 CD B@4 CD
Glob+Temp∗ 48.8 86.3 39.4 47.4 51.7 87.9 40.5 49.3
Context+Temp∗ 48.4 85.8 38.3 46.8 51.0 87.2 39.8 48.5
Glob+Context+Temp∗ 49.5 88.1 39.6 49.1 52.4 89.6 42.9 50.4
Glob+Temp∗∗ 51.2 88.2 40.8 49.4 53.1 90.8 43.7 51.5
Context+Temp∗∗ 50.6 87.8 41.7 48.7 52.8 90.1 43.3 50.8
Glob+Context+Temp∗∗ 52.9 89.8 42.6 51.1 53.6 91.7 44.9 52.2
Ours(without WCNN) 50.1 87.8 39.9 50.0 52.3 89.2 42.4 50.6

G. ABLATION STUDY of features are considered and obtained higher CD values

of 91.7% and 52.2% for MSVD and MSR-VTT datasets,
Experimental studies were conducted to validate the
respectively, for self critical loss. It scores an improvement of
enhanced performance of the method with the inclusion of
about 1.9% and 1.1%, respectively, in CD value for MSVD
global, contextual and temporal features in the video with and
and MSR-VTT with the inclusion of Self critical training
without discrete wavelet decomposition. An ablation study is
strategy. As mentioned in Table 6, a configuration without
also conducted to analyze the effectiveness of the inclusion
WCNN that consists of ResNet-50 for capturing the global
of multiple attention stages in the VAP and VTT networks.
features of the frames, Faster RCNN network for extracting
Table 6 shows the results of B@4 and CD scores for various
contextual information and C3D model for getting temporal
network configurations on MSVD and MSR-VTT datasets.
details in the video are also done. It achieved a B@4 and CD
Initially, the experiments are conducted in two phases. In the
value of about 89.2% and 50.6%. This enhancement is due to
first phase, we have used single attention network with four
the extraction of spectral information along with the spatial,
sub layers to handle the concatenated visual input features
temporal and semantic details in the input video.
- global, contextual and temporal. For this, experiments
are carried out with multiple configurations as in Table 6. Also experimental study has been conducted for finding
The configuration with Glob+Context+Temp achieves the out the optimal number of attention blocks with the
maximum value of CD of about 89.6% and 50.4% for introduction of self critical loss. The results are highlighted
MSVD and MSR-VTT dataset, respectively, for self critical in Fig. 5. The system performance seems improved with
loss. In the second phase of experimental study, we have used the number of attention blocks in the VAP-caption decoder
three separate attention networks each having four attention stages of the transformer network. But it gets saturated after a
blocks, for computing the attention of global, local and particular number of attention blocks. Thus, for the proposed
temporal features. Here also three different configurations model, the optimal number of attention blocks is found to be
8 VOLUME 4, 2016

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

(a) B@4 - MSVD dataset (b) M T - MSVD dataset (c) CD - MSVD dataset

(d) B@4 - MSR-VTT dataset (e) M T - MSR-VTT dataset (f) CD - MSR-VTT dataset
FIGURE 5: Evaluation of the performance on proposed method for different number of attention blocks in the VAP-VTT for
MSVD and MSR-VTT datasets.

FIGURE 6: Illustration of captions generated by the proposed method in comparison with the baseline method.

4 with B@4 and CD values of about 53.6% and 91.7% for illustrated with few sample video clips from both the datasets
MSVD dataset and 44.9% and 52.2% for MSR-VTT dataset, as shown in Fig. 6. Here the baseline method is the network
respectively. without WCNN as mentioned above.
The quality of the captions generated by the method is
VOLUME 4, 2016 9

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

method can be extended to generate textual descriptions

of lengthy videos such as the surveillance video system to
highlight the abnormal events contained in it.

REFERENCES
[1] W. Li, Z. Qu, H. Song, P. Wang, and B. Xue, “The traffic scene
understanding and prediction based on image captioning,” IEEE Access,
vol. 9, pp. 1420–1427, 2021.
[2] S. Amirian, A. Farahani, H. R. Arabnia, K. M. Rasheed, and T. R. Taha,
“The use of video captioning for fostering physical activity,” 2020 Int.
Conf. on Computational Science and Computational Intelligence (CSCI),
pp. 611–614, 2020.
FIGURE 7: Samples of negative results. [3] N. Xu, A. Liu, W. Nie, and Y. Su, “Attention-in-attention networks for
surveillance video understanding in internet of things,” IEEE Internet of
Things Journal, vol. 5, no. 5, pp. 3419–3429, 2018.
[4] C. Gurrin, “Content-based video retrieval,” Encyclopedia of Database
H. LIMITATIONS OF THE PROPOSED WORK Systems, pp. 466–473, 2009.
[5] S. Ding, S. Qu, Y. Xi, and S. Wan, “A long video caption generation
Even though the proposed method gives better performance algorithm for big video data retrieval,” Future Generation Computer
in the reported evaluation metrics, it still has some Systems, vol. 93, pp. 583–595, 2019.
[6] S. Fujita, T. Hirao, H. Kamigaito, M. Okumura, and M. Nagata, “Soda:
limitations. The method fails to generate correct contextual Story oriented dense video captioning evaluation framework,” Computer
descriptions of few video clips because of wrong visual Vision – ECCV 2020, Lecture Notes in Computer Science, vol. 12351, pp.
content interpretations. In the sample frames of the first video 517–531, 2020.
[7] R. Aditya, R. Asmita, V. Vidya, and P.V.R Badri, “Automatic subtitle
clip in Fig. 7, objects with reflections are visible. In this video
generation for videos,” in 2020 6th Int. Conf. on Advanced Computing
clip, we can observe a baby in red dress looking himself and Communication Systems (ICACCS), 2020, pp. 132–135.
in the mirror and kissing. Our method identifies this as"two [8] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adaptive
babies in red dress are playing", producing false result. The attention for visual captioning,” IEEE Trans. on Pattern Analysis and
Machine Intelligence, vol. 42, no. 5, pp. 1112–1131, 2020.
method also fails to extract the correct visual interpretations [9] L. Gao, X. Wang, J. Song, and Y. Liu, “Fused GRU with semantic-
or semantics from those videos, having high motion complex temporal attention for video captioning,” Neurocomputing, vol. 395, pp.
event, similar to the one shown in the frames of the second 222–228, 2020.
[10] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent
sample video in Fig. 7, where a motorcyclist is met with neural encoder for video representation with application to captioning,” in
an accident by losing his control over the bike and finally CVPR, 2016, pp. 1029–1038.
falls in the water. This activity is identified wrongly as the [11] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and
K. Saenko, “Translating videos to natural language using deep recurrent
motorcyclist is "flying a bike" by the method. This is because neural networks,” in Proc. of the 2015 Conf. of the North American
the attention-based multimodal features may interact each Chapter of the Association for Computational Linguistics: Human
other that degrades the performance of the method with Language Technologies, 2015, pp. 1494–1504.
[12] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and
false interpretations of the underlying semantics in the video. K. Saenko, “Sequence to sequence – video to text,” 2015 IEEE Int. Conf.
Another limitation of the proposed model is the increase in on Computer Vision (ICCV), pp. 4534–4542, 2015.
time cost with the addition of discrete wavelet pre-processing [13] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and
A. Courville, “Describing videos by exploiting temporal structure,” in
stage and VAP network that computes the global, local and 2015 IEEE Int. Conf. on Computer Vision (ICCV), dec 2015, pp. 4507–
temporal attention features separately. 4515.
[14] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, “STAT:
Spatial-temporal attention mechanism for video captioning,” IEEE Trans.
V. CONCLUSION on Multimedia, vol. 22, no. 1, pp. 229–241, 2020.
In this work, a deep neural network architecture is introduced [15] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adaptive
attention for visual captioning,” IEEE Trans. on Pattern Analysis and
for video caption generation by exploiting multimodal Machine Intelligence, vol. 42, pp. 1112–1131, 2020.
feature attention in the video. In this method, the inclusion [16] X. Shi, J. Cai, J. Gu, and S. Joty, “Video captioning with boundary-
of two-level discrete wavelet decomposition in 2D visual aware hierarchical language decoding and joint video prediction,”
Neurocomputing, vol. 417, pp. 347–356, 2020.
feature representation helps to extract additional information
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
contained in spatial, temporal and spectral domains in the Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of the
video. The adoption of three separate attention networks in 31st Int. Conf. on NIPS, 2017, p. 6000–6010.
the visual attention predictor is responsible for extracting [18] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language description
of human activities from video images based on concept hierarchy of
more attentive features, leading to more semantic captions actions,” Int. Journal of Computer Vision, vol. 50, pp. 171–184, 2004.
in the visual-to-text translator. The performance evaluation [19] S. Mukherjee, S. Ghosh, S. Ghosh, P. Kumar, and P. P. Roy, “Predicting
of the method is carried out using two benchmark datasets video-frames using encoder-convlstm combination,” in ICASSP, 2019, pp.
2027–2031.
and compared with existing state-of-the-art methods in video [20] S. Liu, Z. Ren, and J. Yuan, “SibNet: Sibling convolutional encoder
captioning. The results obtained highlights the efficiency of for video captioning,” IEEE Trans. on Pattern Analysis and Machine
the method in generating meaningful video captions. This Intelligence, vol. 43, no. 9, pp. 3259–3272, 2021.
[21] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele,
method can be further improved by exploiting the audio “Translating video content to natural language descriptions,” in 2013 IEEE
features also to generate more meaningful captions. The Int. Conf. on Computer Vision, 2013, pp. 433–440.

10 VOLUME 4, 2016

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

[22] L. Zhu, Z. Xu, and Y. Yang, “Bidirectional multirate reconstruction for [47] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
temporal modeling in videos,” 2017 IEEE Conf. on Computer Vision and for word representation,” in Empirical Methods in Natural Language
Pattern Recognition (CVPR), pp. 1339–1348, 2017. Processing (EMNLP), 2014, pp. 1532–1543.
[23] Z. Yang, Y. Han, and Z. Wang, “Catching the temporal regions-of-interest [48] I. Daubechies, “Ten lectures on wavelets,” Society for Industrial and
for video captioning,” Proc. of the 25th ACM Int. Conf. on Multimedia, Applied Mathematics, USA, 1992.
2017. [49] A. Karoui and R. Vaillancourt, “Families of biorthogonal wavelets,”
[24] S. Chen, Q. Jin, J. Chen, and A. G. Hauptmann, “Generating video Computers Mathematics with Applications, vol. 28, no. 4, pp. 25–39,
descriptions with latent topic guidance,” IEEE Trans. on Multimedia, vol. 1994.
21, no. 9, pp. 2407–2418, 2019. [50] G. Beylkin, R. R. Coifman, and V. Rokhlin, “Fast wavelet transforms
[25] L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-aware and numerical algorithms I,” Communications on Pure and Applied
neural encoder for video captioning,” in CVPR, 2017, pp. 3185–3194. Mathematics, vol. 44, no. 2, pp. 141–183, 1991.
[26] M. Nabati and A. Behrad, “Video captioning using boosted and [51] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred
parallel long short-term memory networks,” Computer Vision and Image semantic attributes,” in 2017 IEEE Conf. on Computer Vision and Pattern
Understanding, vol. 190, pp. 102840, 2020. Recognition (CVPR), 2017, pp. 984–992.
[52] J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan, “M3: Multimodal
[27] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. Zha, “Object
memory modelling for video captioning,” in CVPR, 2018, pp. 7512–7520.
relational graph with teacher-recommended learning for video captioning,”
[53] Y. Chen, S. Wang, and Q. Zhang, W.and Huang, “Less is more: Picking
in CVPR, 2020, pp. 13275–13285.
informative frames for video captioning,” in ECCV, 2018.
[28] J. Hou, X. Wu, W. Zhao, J. Luo, and Y. Jia, “Joint syntax representation [54] B. Wang, L. Ma, W. Zhang, and W. Liu, “Reconstruction network for
learning and visual cue translation for video captioning,” in 2019 video captioning,” in CVPR, 2018, pp. 7622–7631.
IEEE/CVF Int. Conf. on Computer Vision (ICCV), 2019, pp. 8917–8926. [55] N. Xu, A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M. Kankanhalli,
[29] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, and H. T. Shen, “Dual-stream recurrent neural network for video captioning,” IEEE Trans.
“From deterministic to generative: Multimodal stochastic rnns for video on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2482–
captioning,” IEEE Trans. on Neural Networks and Learning Systems, vol. 2493, 2019.
30, no. 10, pp. 3047–3058, 2019. [56] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, and H. T. Shen,
[30] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent “From deterministic to generative: Multimodal stochastic rnns for video
neural encoder for video representation with application to captioning,” in captioning,” IEEE Trans. on Neural Networks and Learning Systems, vol.
2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 30, no. 10, pp. 3047–3058, 2019.
jun 2016, pp. 1029–1038. [57] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian, “Spatio-
[31] L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-aware temporal dynamics and semantic attribute enriched visual encoding for
neural encoder for video captioning,” in 2017 IEEE Conf. on Computer video captioning,” in CVPR, 2019, pp. 12479–12488.
Vision and Pattern Recognition (CVPR), 2017, pp. 3185–3194. [58] B. Wang, L. Ma, W. Zhang, W. Jiang, J. Wang, and W. Liu, “Controllable
[32] C. Wu, Y. Wei, X. Chu, S. Weichen, F. Su, and L. Wang, video captioning with pos sequence guidance based on gated fusion
“Hierarchical attention-based multimodal fusion for video captioning,” network,” in 2019 IEEE/CVF Int. Conf. on Computer Vision (ICCV),
Neurocomputing, vol. 315, pp. 362–370, 2018. 2019, pp. 2641–2650.
[33] C. Hori, T. Hori, T. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, [59] X. Xiao, Y. Zhang, R. Feng, T. Zhang, S. Gao, and W. Fan, “Video
and K. Sumi, “Attention-based multimodal fusion for video description,” captioning with temporal and region graph convolution network,” in
in 2017 IEEE Int. Conf. on Computer Vision (ICCV), 2017, pp. 4203– ICME, 2020, pp. 1–6.
4212. [60] V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks,
[34] B. Zhao, X. Li, and X. Lu, “CAM-RNN: Co-attention model based rnn for M. Rohrbach, and K. Saenko, “Multimodal video description,” in Proc. of
video captioning,” IEEE Trans. on Image Processing, vol. 28, no. 11, pp. the 24th ACM Int. Conf. on Multimedia, 2016, p. 1092–1096.
5552–5565, 2019. [61] R. Shetty and J. Laaksonen, “Frame- and segment-level features and
[35] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense candidate pool evaluation for video caption generation,” in Proc. of the
video captioning with masked transformer,” in 2018 IEEE/CVF Conf. on 24th ACM Int. Conf. on Multimedia, 2016, MM ’16, p. 1073–1076.
Computer Vision and Pattern Recognition, 2018, pp. 8739–8748. [62] Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann, “Describing
[36] T. Jin, S. Huang, M. Chen, Y. Li, and Z. Zhang, “SBAT: Video captioning videos using multi-modal fusion,” in Proc. of the 24th ACM Int. Conf.
with sparse boundary-aware transformer,” in IJCAI, 2020. on Multimedia, 2016, p. 1087–1091.
[63] R. Pasunuru and M. Bansal, “Multi-task video captioning with video and
[37] J. Lei, L. Wang, Y. Shen, D. Yu, T. L. Berg, and M. Bansal, “MART:
entailment generation,” in 55th Annual Meeting of the Association for
Memory-augmented recurrent transformer for coherent video paragraph
Computational Linguistics, 2017, vol. 1, pp. 1273–1283.
captioning,” 2020.
[64] M. Chen, Y. Li, Z. Zhang, and S. Huang, “TVT: Two-view transformer
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
network for video captioning,” in Proc. of The 10th Asian Conf. on
recognition,” in CVPR, 2016, pp. 770–778.
Machine Learning, 14–16 Nov 2018, vol. 95, pp. 847–862.
[39] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- [65] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y. Tai, “Memory-attended
time object detection with region proposal networks,” in Proc. of the 28th recurrent network for video captioning,” in cVPR, 2019, pp. 8339–8348.
Int. Conf. on Neural Information Processing Systems - Volume 1. 2015, [66] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. Zha, “Object
NIPS’15, p. 91–99, MIT Press. relational graph with teacher-recommended learning for video captioning,”
[40] D. Chen and W. Dolan, “Collecting highly parallel data for paraphrase in 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition
evaluation,” in Proc. of ACL, 2011, pp. 190–200. (CVPR), 2020, pp. 13275–13285.
[41] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description
dataset for bridging video and language,” in CVPR, 2016, pp. 5288–5296.
[42] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: A method for
automatic evaluation of machine translation,” in Proc. of the 40th Annual
Meeting on Association for Computational Linguistics, 2002, p. 311–318.
[43] A. Lavie and A. Agarwal, “Meteor: An automatic metric for mt evaluation
with high levels of correlation with human judgments,” in Proc. of the
Second Workshop on Statistical Machine Translation, 2007, p. 228–231.
[44] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based
image description evaluation.,” in CVPR, 2015, pp. 4566–4575.
[45] J. Deng, W. Dong, R. Socher, L. Li, Kai L., and Li F., “ImageNet: A
large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
[46] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-
Fei, “Large-scale video classification with convolutional neural networks,”
in CVPR, 2014.

VOLUME 4, 2016 11

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

RESHMI S. BHOOSHAN received the B.Tech

degree and M. Tech degree in Applied
Electronics and Instrumentation Engineering in
2001 and 2005 from College of Engineering
Thiruvananthapuram, University of Kerala,
India. She is currently pursuing the Ph.D.
degree from APJ Abdul Kalam Technological
University, Kerala, India. She joined as Faculty
in Government Engineering College, Department
of Technical Education in 2008. Currently, she
is working as Assistant Professor in the department of Electronics and
Communication Engineering, Government Engineering College, Barton
Hill, Thiruvananthapuram, Kerala, India. Her research interests include
Signal Processing, Image and video analytics, Machine Learning, Computer
vision and related areas

SURESH K. (Senior Member, IEEE) received

PhD.in Signal Processing in 2010 and M.E. in
Signal Processing in 2000 from Indian Institute
of Science, Bangalore. He completed his B.
Tech degree in Electronics Communication from
College of Engineering Thiruvananthapuram,
University of Kerala, in 1994. He joined as Faculty
in Government Engineering College, Department
of Technical Education in 1995. Currently, he is
working as Professor in Government Engineering
College Wayanad, Kerala, India. His research interests include Signal
Processing, Audio Signal Processing, Image/video analytics, Machine
Learning and related areas.

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

ESI Study Guide For Exam AI-900
No ratings yet
ESI Study Guide For Exam AI-900
6 pages
Video Captioning Approaches
No ratings yet
Video Captioning Approaches
6 pages
2 - Hierarchical LSTMs With Adaptive Attention For
No ratings yet
2 - Hierarchical LSTMs With Adaptive Attention For
18 pages
Comparing Attention-Based Neural Architectures For Video Captioning
No ratings yet
Comparing Attention-Based Neural Architectures For Video Captioning
10 pages
IEEE Paper
No ratings yet
IEEE Paper
13 pages
Video Captioning Using Neural Networks
No ratings yet
Video Captioning Using Neural Networks
13 pages
Attentive Visual Semantic Specialized Network For Video Captioning
No ratings yet
Attentive Visual Semantic Specialized Network For Video Captioning
8 pages
Mathematics 11 03685
No ratings yet
Mathematics 11 03685
16 pages
Multi-Modal Hierarchical Attention-Based Dense Video Captioning
No ratings yet
Multi-Modal Hierarchical Attention-Based Dense Video Captioning
5 pages
CVIU Hema 1-S2.0-S1077314222000650-Main
No ratings yet
CVIU Hema 1-S2.0-S1077314222000650-Main
13 pages
Deep Learning-Based Video Captioning Technique Using Transformer
No ratings yet
Deep Learning-Based Video Captioning Technique Using Transformer
4 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
Two Tier LSTM Model
No ratings yet
Two Tier LSTM Model
13 pages
Video To Sequence
No ratings yet
Video To Sequence
9 pages
Adaptive Feature Abstraction For Translating Video
No ratings yet
Adaptive Feature Abstraction For Translating Video
16 pages
Vision-Text Cross-Modal Fusion For Accurate Video Captioning
No ratings yet
Vision-Text Cross-Modal Fusion For Accurate Video Captioning
16 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
Aafaq Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding For Video CVPR 2019 Paper
No ratings yet
Aafaq Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding For Video CVPR 2019 Paper
10 pages
I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
No ratings yet
I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
10 pages
Audio-Visual Interpretable and Controllable Video Captioning CVPRW 2019 Paper
No ratings yet
Audio-Visual Interpretable and Controllable Video Captioning CVPRW 2019 Paper
4 pages
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
No ratings yet
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
6 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Ref 12
No ratings yet
Ref 12
7 pages
This Manuscript Is Currently Submitted To Computer Vision and Image Understanding Journal
No ratings yet
This Manuscript Is Currently Submitted To Computer Vision and Image Understanding Journal
34 pages
Applsci 13 11103 v2
No ratings yet
Applsci 13 11103 v2
38 pages
Cross-Domain Modality Fusion For Dense Video Captioning
No ratings yet
Cross-Domain Modality Fusion For Dense Video Captioning
15 pages
Video To Audio Generation Through Text
No ratings yet
Video To Audio Generation Through Text
30 pages
TSP CMC 53245
No ratings yet
TSP CMC 53245
18 pages
Bridging Video and Text A Two-Step Polishing Transformer For Video Captioning
No ratings yet
Bridging Video and Text A Two-Step Polishing Transformer For Video Captioning
15 pages
Qy - Semantic Enhanced Video Captioning With Multi-featureFusion
No ratings yet
Qy - Semantic Enhanced Video Captioning With Multi-featureFusion
21 pages
A Machine Learning Pipeline For Semantic Aware and Contexts Rich Video Description Method
No ratings yet
A Machine Learning Pipeline For Semantic Aware and Contexts Rich Video Description Method
9 pages
Abstract:: Doi: 10.5281/zenodo.7923088
No ratings yet
Abstract:: Doi: 10.5281/zenodo.7923088
12 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Chen Panda-70M Captioning 70M Videos With Multiple Cross-Modality Teachers CVPR 2024 Paper
No ratings yet
Chen Panda-70M Captioning 70M Videos With Multiple Cross-Modality Teachers CVPR 2024 Paper
12 pages
A Multi-Instance Multi-Label Dual Learning Approach For
No ratings yet
A Multi-Instance Multi-Label Dual Learning Approach For
18 pages
Image Captioning With Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
No ratings yet
Image Captioning With Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory
17 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Generating Video Descriptions With Attention-Driven LSTM Models in Hindi Language
No ratings yet
Generating Video Descriptions With Attention-Driven LSTM Models in Hindi Language
9 pages
To Create What You Tell: Generating Videos From Captions: Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li and Tao Mei
No ratings yet
To Create What You Tell: Generating Videos From Captions: Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li and Tao Mei
10 pages
He 2017
No ratings yet
He 2017
8 pages
Papers
No ratings yet
Papers
9 pages
Long Short-Term Relation Transformer With Global Gating For Video Captioning
No ratings yet
Long Short-Term Relation Transformer With Global Gating For Video Captioning
13 pages
Genertaion of Hindu Languhae
No ratings yet
Genertaion of Hindu Languhae
9 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Transformer Network For Video To Text Translation
No ratings yet
Transformer Network For Video To Text Translation
6 pages
V 2T: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
No ratings yet
V 2T: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
11 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey
No ratings yet
9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey
60 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
Image Caption Generation Using Deep Neural Networks
No ratings yet
Image Caption Generation Using Deep Neural Networks
3 pages
Applications of AI
No ratings yet
Applications of AI
13 pages
Watch What You Just Said: Image Captioning With Text-Conditional Attention
No ratings yet
Watch What You Just Said: Image Captioning With Text-Conditional Attention
9 pages
From Show To Tell: A Survey On Image Captioning
No ratings yet
From Show To Tell: A Survey On Image Captioning
22 pages
Enhancing LSTM-based Video Narration Through Text-Derived Linguistic Insights
No ratings yet
Enhancing LSTM-based Video Narration Through Text-Derived Linguistic Insights
5 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
09 - Chapter 5
No ratings yet
09 - Chapter 5
22 pages
07 - Chapter 3
No ratings yet
07 - Chapter 3
3 pages
11 - Chapter 5
No ratings yet
11 - Chapter 5
37 pages
14 - Chapter 5
No ratings yet
14 - Chapter 5
14 pages
Dense Video
No ratings yet
Dense Video
35 pages
Controllable Video Captioning With An Exemplar Sentence 2021
No ratings yet
Controllable Video Captioning With An Exemplar Sentence 2021
9 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
PYTHON Major Project Titles List 100
No ratings yet
PYTHON Major Project Titles List 100
5 pages
ML - CSA 301 - ML Perspective and Issues
No ratings yet
ML - CSA 301 - ML Perspective and Issues
34 pages
ML Viva Questions
No ratings yet
ML Viva Questions
25 pages
Optimizing Brain Tumor Identification With Fine - Tuned Pre-Trained CNN Models A Comparative Study of VGG16 and EfficientNetB4
No ratings yet
Optimizing Brain Tumor Identification With Fine - Tuned Pre-Trained CNN Models A Comparative Study of VGG16 and EfficientNetB4
5 pages
Notes Unit 1-3 Part-I
No ratings yet
Notes Unit 1-3 Part-I
20 pages
NIS Micro Project
No ratings yet
NIS Micro Project
19 pages
Deep Learning - AD3501 - Question Bank
No ratings yet
Deep Learning - AD3501 - Question Bank
9 pages
Machine Learning 18CSE18
No ratings yet
Machine Learning 18CSE18
2 pages
01 - ML Introduction - Course Outline
No ratings yet
01 - ML Introduction - Course Outline
21 pages
Artificial Neural Network (ANN) Toolbox For Scilab - Prashant Dave PDF
No ratings yet
Artificial Neural Network (ANN) Toolbox For Scilab - Prashant Dave PDF
25 pages
Random Forest Classifier
No ratings yet
Random Forest Classifier
18 pages
Twin Support Vector Machines Models Extensions and Applications
No ratings yet
Twin Support Vector Machines Models Extensions and Applications
221 pages
It 8 Sem Machine Learning 3705 Summer 2019
No ratings yet
It 8 Sem Machine Learning 3705 Summer 2019
2 pages
2K22 - B17 - 49 PRIYANSHU NANDAN - Multi Layer Perceptrons Reference
No ratings yet
2K22 - B17 - 49 PRIYANSHU NANDAN - Multi Layer Perceptrons Reference
32 pages
Data Science Interview Questions With Answers ?
No ratings yet
Data Science Interview Questions With Answers ?
16 pages
Deep Neural Nets - 33 Years Ago and 33 Years From Now
No ratings yet
Deep Neural Nets - 33 Years Ago and 33 Years From Now
17 pages
Neural Network (Machine Learning) - Wikipedia
No ratings yet
Neural Network (Machine Learning) - Wikipedia
40 pages
Brown - Applied AI & DS
No ratings yet
Brown - Applied AI & DS
25 pages
T SNE
No ratings yet
T SNE
11 pages
Essay GenAI
No ratings yet
Essay GenAI
3 pages
ML Unit Wise Important Questions
No ratings yet
ML Unit Wise Important Questions
2 pages
Module 1
No ratings yet
Module 1
66 pages
Electronics: Identification of Plant-Leaf Diseases Using CNN and Transfer-Learning Approach
No ratings yet
Electronics: Identification of Plant-Leaf Diseases Using CNN and Transfer-Learning Approach
19 pages
19 Query Rewriting For Rag
No ratings yet
19 Query Rewriting For Rag
13 pages
Is Chatgpt A Financial Expert? Evaluating Language Models On Financial Natural Language Processing
No ratings yet
Is Chatgpt A Financial Expert? Evaluating Language Models On Financial Natural Language Processing
9 pages
Unit1-Introduction To AI (Refference)
No ratings yet
Unit1-Introduction To AI (Refference)
6 pages
Hybrid AI Agent On 2d Racing Game Using Neural Networks and Reinforcement Learning
No ratings yet
Hybrid AI Agent On 2d Racing Game Using Neural Networks and Reinforcement Learning
7 pages
Iris Liveness Detection Using Transfer Learning With MobileNets: Strengthening Cybersecurity in Biometric Identification
No ratings yet
Iris Liveness Detection Using Transfer Learning With MobileNets: Strengthening Cybersecurity in Biometric Identification
17 pages

A Multimodal Framework For Video Caption Generatio

Uploaded by

A Multimodal Framework For Video Caption Generatio

Uploaded by

This article has been accepted for publication in IEEE Access.

A Multimodal Framework For Video

I. INTRODUCTION made by Venugopal et al., a novel video to text generation

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

differentiable mask from event proposals and thereby

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

FIGURE 3: Structure of Contextual Object Relationship Extractor.

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

FIGURE 4: Results of qualitative comparison.

G. ABLATION STUDY of features are considered and obtained higher CD values

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

method can be extended to generate textual descriptions

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

Reshmi S. Bhooshan et al.: A Multimodal Framework For Video Caption Generation

RESHMI S. BHOOSHAN received the B.Tech

SURESH K. (Senior Member, IEEE) received

You might also like