Cross-Domain Modality Fusion For Dense Video Captioning
Cross-Domain Modality Fusion For Dense Video Captioning
Cross-Domain Modality Fusion For Dense Video Captioning
2691-4581 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
764 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022
Fig. 2. Illustration of the proposed DVC framework: First, we contextualize the visual contents in a shared high-level semantic space learned by the SC-Net.
Neuronwise time-series signal of SC-Net activations is encoded with temporal information using the descriptor transformer. The resulting descriptor is fed to the
proposal generation network for detecting semantically meaningful event boundaries. The detected events’ hidden state and the descriptors within the detected
event are fused using the attention mechanism. The attentive representation of the proposal is then fed to the LSTM-based caption generation module. The event
proposal and caption generation modules are trained jointly in an end-to-end manner.
efficient in terms of processing videos in its entirety. The former Pan et al. [70] propose to integrate the visual and semantic
must explicitly compute a subset of windows for event detection information by incorporating relevance and coherence joint
using techniques such as dictionary learning [67] or recursive learning alongside long short-term memory (LSTM) training.
networks [46], whereas the latter [45], [46] only require a The relevance loss captures the relationship between the visual
forward pass through the network model. More recently, Wang content and the semantics of the entire sentence, whereas the
et al. [68] have employed a bidirectional recurrent network and coherence loss captures the contextual relationship among the
passed the video twice to improve the quality of event proposals generated words of the sentence. The model is jointly trained
over [45]. However, all of the aforementioned methods do not with coherence and relevance losses to generate semantically
take advantage of available linguistic information, which is the rich sentences.
key novelty of our technique. The aforementioned methods mainly deal with images or
short single-event videos as opposed to long multievent videos,
which are considered in our work. Furthermore, they employ
B. Visual-Semantic Information Integration visual and linguistic information for spatial region representa-
There have been previous works that integrate visual and tion learning. Although Pan et al. [70] improve the semantics
linguistic information to gain benefit from their joint modeling. of generated sentences with visual content, in sharp contrast to
For instance, Karpathy and Fei-Fei [52] model the corre- our technique, they altogether ignore the temporal dynamics
spondence between language and visual data by learning an of the video. However, for video representations, detecting
alignment model. They assume that contiguous segments in a the temporal time stamps in long untrimmed videos is more
sentence refer to spatial locations in the image. Hence, they crucial. This is one of the key differences between our proposed
propose a model that is able to align the sentence segments with semantic contextualization for visual content as compared to the
the spatial locations by associating the two modalities through spatial representation learning in image captioning. Moreover,
multimodal embedding space. For that purpose, they employ existing techniques either introduce attention mechanism or
a region-based convolutional neural network (R-CNN) [4] to learn joint space projection of the two modalities, whereas we
detect objects in the image. Then, using bidirectional image fuse the linguistic and visual cues by learning a network in a
sentence mapping [69], they retain the top-19 object locations supervised manner. Finally, these methods also differ from our
and learn representations for all the 19 object bounding boxes work in linguistic information exploitation. We take leverage
and of the whole image. A bidirectional recurrent neural net- from sentences in a sense we first cluster them into similar
work (Bi-RNN) is used in their method to compute the words’ semantic concepts and assign a unique ID to each cluster for
representation from its hidden state. Their model learns to score supervised training of the SC-Net. Then, similar visual features
the similarity between words and regions of an image as a and corresponding semantic concepts are fused together using
function of R-CNN object detection with the outputs of Bi-RNN. SC-Net.
Xu et al. [53] introduce an attention model that learns to gaze More recently, Mun et al. [44] and Zhou et al. [71] have
and describe the salient objects in the image. Unlike Karpathy proposed to incorporate linguistic information and context in
and Fei-Fei [52], who use object detector to get the regional the dense video caption framework. However, both the methods
representations, they extract feature vectors of an image from use the two modalities implicitly while training the model in
fully connected as well as lower layers of the convolutional neu- an end-to-end manner. This way, the representations used by
ral network (CNN). By this, they capture the correspondences their proposal generation models lack associated linguistic in-
between the regional feature vectors and 2-D image portions. formation. On the other hand, our method explicitly integrates
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
766 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022
the linguistic and visual information for proposal generation. captioning module to tackle long dependencies. Mun et al. [44]
The implicit guidance of language information comes as an propose an event sequence generation network that reduces the
additional advantage to our technique during the end-to-end number of proposals and exploits visual and linguistic context
training. implicitly while training the model. Iashin and Rahtu [92] in-
corporate audio and speech modalities to further improve the
C. Video Captioning performance of the DVC framework.
As human annotations of videos are laborious and expensive,
Before the pervasive use of deep learning, traditional video
weakly supervised methods attempt to mitigate this problem.
captioning methods [72]–[74] employed the classical approach
These methods does not use temporal segment annotations for
of detecting subject (S), verb (V), and object (O) in a video
training of the DVC model. Instead, these techniques rely on the
and described them in natural language using template-based
assumption that each caption describes one temporal segment
techniques. With the availability of large-scale datasets [75],
and each segment has one caption. Duan et al. [93] take labeled
[76], more recent video captioning techniques strongly rely on
captions as input for weakly supervised dense event caption by
neural networks for this task, with encoder–decoder schemes
adopting the cycle-consistent learning strategy. Shen et al. [94]
being the widely employed backbone. Such frameworks first
propose multi-instance multilabel learning to weakly link video
encode visual inputs with CNNs and then decode them into
regions with lexical labels to generate diverse and informative
natural language sentences using recurrent networks. More re-
captions. Rahman et al. [95] adopt two modalities i.e., audio and
cent methods augment this scheme with advanced concepts
video, focusing on the role of audio in DVC.
of, e.g., reinforcement learning (RL) [77], object and action
While promising results are produced by the aforementioned
modeling [35], [78], [79], incorporating Fourier transform (FT)
techniques, there is an apparent disassociation between event
with the CNN [80], attention mechanism [81], [82], semantic
detection and description generation among all methods. A rep-
attribute learning [33], [83], multimodal memory [84], [85],
resentation defined over both visual information and associated
and audio integration [86], [87] for improved performance.
semantics can intrinsically couple the detection and captioning
Notwithstanding the superior performance, these methods are
subtasks for DVC. This concept forms the basis of our tech-
limited to process short single-event videos and describe them in
nique, which further takes advantage of temporal refinement of
a single sentence. Few attempts to describe videos with multiple
visual cues and proposal representation with attentive fusion,
sentences/paragraph have also been made that employ event pro-
and end-to-end training of detection and captioning modules.
posal or captioning modules hierarchically to generate multiple
sentences [88]–[90]. Compared to the problem of describing
a short video by a single or rarely by multiple sentences, the III. METHODOLOGY
challenges of DVC are multifold, as noted in Section I. This This section introduces our architecture for DVC, as shown
renders most of the techniques for conventional video captioning in Fig. 2. First, we discuss our SC-Net, followed by descriptor
ineffective for the task of DVC in their original form. transformer, proposal generation network, and caption genera-
tion network. We summarize the symbols and notations used in
D. Dense Video Captioning the text in Table I for ready reference.
The task of DVC was introduced by Krishna et al. [43]. In
contrast to video captioning, which describes short videos in a A. Semantic Contextualization Network
single sentence, DVC first involves detection of multiple events, We propose an SC-Net to learn a representation that is defined
possibly overlapping, in long videos and then describe all the jointly over the visual contents and the caption semantic space
detected events in natural language. Most contemporary works of the videos. The SC-Net maps video visual contents to a
tackle the problem as a detection and description framework in quantitative representation of the semantic notions in video
a supervised [43], [44], [47], [68], [71], [90]–[92] or weakly captions. To represent the semantic notions, we first learn uni-
supervised manner [93]–[95]. Most of the methods address the versal caption embeddings. To that end, we use the Sent2Vec
challenges of this task with a bimodule framework. The two model [96] pretrained on Tweets (19.7B words), Wikipedia
modules include a proposal generation module to detect events in sentences (1.7B words), and Toronto book corpus (0.9B words).
the input video and a captioning module to generate the captions The model learns a source embedding Ew for each word w in
for the detected events. the vocabulary V with embedding dimension H. A sentence
Krishna et al. [43] incorporate a multiscale proposal gener- embedding is computed as the average of the embeddings of the
ation network [46] in the aforementioned framework and pro- constituent words that are learned not only with unigram, but
posed an attention-based captioning network to capture the event also with n-grams (i.e., n = 2, 3, 4). Formally, the embedding
context. Wang et al. [68] employ a bidirectional proposal gener- Ec for a given caption is modeled as
ation network to improve the proposals’ generation accuracy by
1 1
better contextualizing the events within the video. Li et al. [91] Ec := βiL(s) = Ew (1)
propose temporal coordinates and descriptiveness regression to |L(s)| |L(s)|
w∈L(s)
localize the proposals in the video and employed an attribute-
augmented captioning network [40] for improved performance. where L(s) is the list of n-grams, including unigrams present
Zhou et al. [71] propose to adopt the transformer [5] as the in the caption, and β ∈ RH×|V| with iL(s) ∈ {0, 1}|V| denotes
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 767
TABLE I
SUMMARY OF SYMBOLS AND NOTATIONS USED IN THE TEXT
that end, we transform the visual stream into a descriptor vec- Y = {yi }Ti=1 indicating which temporal intervals correspond to
tor, capturing its temporal dynamics. We feed the time-series which actions in the video. At time step t, the ground truth (GT)
activations of δ frame sequences to construct the descriptor label y t ∈ Rk contains binary entries. The jth entry ytj is 1 if the
vector. corresponding proposal interval has a temporal intersection over
Let F = {f 1 , f2 , . . ., fL } denote the features of a given video union (tIoU) with the GT larger than 0.5. During training, the
X frame sequence extracted with SC-Net, where fi ∈ Rm . We network is penalized for errors according to weighted multilabel
use a temporal resolution of δ = 64 and create a set of vi- cross entropy to balance the positive and negative proposals. At
sual streams S = {S1 , S2 , . . ., ST } using the feature set, where any time step t, the network loss is computed as
T = |F|/δ. Here, Si ∈ Rm×δ is a matrix, with each column
representing a contextualized vector in the form of SC-Net acti-
k
L(η, t, ξ, Y ) = − w0j ytj log ηtj + w1j (1 − ytj ) log(1 − ηtj )
vations. We compute a descriptor vector ξ for each visual stream. j=1
For that, we hierarchically process Si in a neuronwise manner (3)
with a short FT Φ(.) [101]. In the first level of hierarchical where the weights w0j and w1j are calculated based on the number
processing, we take the δ activations of the jth neuron, i.e., αj = of positive and negative proposals in the training set, and ηtj is
[αj,1 , αj,2 , . . ., αj,δ ] ∈ Rδ and use the first κ coefficients of the output prediction vector for the jth proposal at time step t.
Φ(αj ) to construct ξ 1j ∈ Rκ , where the superscript 1 indicates Hence, the total proposal loss for all the training videos is
the first level of hierarchy. In the next level, αj is divided obtained by averaging along the time stamps, i.e.,
into two components, i.e., α21 j = {αj,1 , αj,2 , . . ., αj,δ/2 } and
α22 1 1
Tw
j = {αj,δ/2+1 , αj,δ/2+2 , . . ., αj,δ } The two components are
Lprop = L(c, t, ξ, y) (4)
separately processed to compute Φ(α21 22
j ) and Φ(αj ). We again |X | Tw t=1
(X,y)∈X
retain the κ low-frequency coefficients of the transformations.
These coefficients are concatenated to form ξ 2j ∈ R2κ at the sec- where Tw is the length of running sliding window, and |.|
(l−1)
ond level of the hierarchy. For l levels, we have ξ lj ∈ R2 ×κ
. indicates cardinality of the set. The overall architecture of
In this work, we let l = {1, 2} based on empirical grounds. The the proposed module follows the encoder–decoder framework
descriptor ξ j ∈ R3κ for the jth neuron is computed by con- similar to SST [45]. However, our encoder is composed of
Φ(Ψ(.)) in contrast to C3D (e.g., SST and DAP). Moreover,
catenating ξ 1j and ξ 2j . The process of hierarchically computing
we employ LSTM, instead of gated recurrent unit, as a decoder
the Fourier coefficients is repeated for all the m neurons, and
the descriptor for the whole visual stream Si is constructed and set δ = 64, instead of 16. All other settings are similar to
SST.
as ξ ∈ R(3×κ×m)×1 by concatenating ξ j for all neurons. This
process gets repeated for all the T streams for a video to
generate T descriptors ξ i={1,...,T } that comprehensively encode D. Caption Generation Network
the temporal dynamics of the video. For decoding event proposals and generating their captions,
we follow the common practice in the existing related litera-
C. Proposal Generation Network ture [43], [44], [68] originally introduced by Krishna et al. [43]
and adapt LSTM network [103] for the caption generation.
Our proposal module receives the series of descriptors
LSTM has an excellent ability to model longer sequences, which
ξ i={1,...,T } for a video. The descriptors are enriched with as-
is required in DVC. In our captioning network, we also incor-
sociated semantics of the video, enabling the proposal module
porate temporal dynamic attention presented in [68]. However,
to detect semantically meaningful proposals.
our caption generation model is specifically induced to generate
To train the network, we generate densely sampled video
words from SC-Net features, which we employ after temporal
segments, which are significantly longer than the temporal
refinement by the descriptor transformer. This aspect is unique
proposals we aim to detect. For instance, for a training video
to our network
Xi with Li number of frames and Ti visual streams, video ⎛ f ⎞ ⎛ ⎞
segments are extracted by running a sliding window of length Γt σ
Tw = Lw /δ with a stride δ. We keep Lw kδ so that the ⎜ Γi ⎟ ⎜ σ ⎟ Ht
⎜ t ⎟ ⎜ ⎟
training samples are long enough to simulate the long untrimmed ⎜ o ⎟ = ⎜ ⎟Wd (5)
⎝ Γt ⎠ ⎝ σ ⎠ ρt
videos, where k is the number of generated proposals. The
network, at each time step t, takes the hidden state of sequence c̃t tanh
encoder and output confidence scores of k proposals. Formally, f i
ct = Γt ct−1 + Γt c̃t (6)
confidence scores {ηtj }kj=1 , corresponding to the k proposals,
are generated. At each time step, the model considers proposals o
ht = Γt tanh(ct ) (7)
of sizes 1, 2, . . ., k time steps corresponding to δ, 2δ, . . ., kδ
f i o
frames, respectively. This is done in a single forward pass at where Γt , Γt , Γt , ct , and ht are forget gate, input gate,
each time step. Thus, the model is able to consider multiple output gate, memory cell, and current hidden state of LSTM,
time scales in a single pass of the video without the need to respectively. W d is the transformation matrix to be learned
rerun the video for various temporal scales. We use the compact and H t = [ω t ht−1 ] is the concatenation of the input word
visual descriptors, i.e., ξ = {ξ i }Ti=1 and their associated labels embedding ω t and previous hidden state ht−1 at time step t. ρt
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 769
is computed from the last hidden state hi,t of the ith proposal ei (i.e., 10024/4926/5044) as training, validation, and test videos,
and corresponding descriptor vectors. For that matter, let V ei ,t respectively. The GT annotations from the test split are withheld
denote the descriptor vectors for the ith detected event proposal for the online competition. YouCook-II comprises 2000 open-
ei at time step t so domain cooking videos. Each video is further divided into 3–16
segments elaborating cooking steps. The segments are annotated
V ei ,t = {ξ start , ξ start+1 , . . ., ξ end }. (8)
and time stamped for temporal localization. The average length
We can write of each video is approximately 600 s. Following [26], we split
the dataset into train, validation, and test sets with the ratio of
ϕi,t = W Ta · tanh(W d [V ei ,t , hi,t ] + b) (9) 66%:23%:10%, respectively.
exp(ϕi,t )
Ai,t = d (10) B. Evaluation Metrics
k=1 exp(ϕk,t )
We perform evaluations with four widely used metrics
where d is the length of the ith event descriptors in ei and hi,t BLEU@N, METEOR, CIDEr, and ROUGEL . The evaluation
represents the last hidden state for the ith proposal at time step t. metrics for dense captioning consider the proposal accuracy as
The final attentive event representation is obtained by well as captioning accuracy. We use the dense captioning evalua-
d tion source code provided by Krishna et al. [43] to measure event
ρt = Ai,t · [V ei ,t , hi,t ]. (11) localization precision and quality of the generated captions.1 To
i=1 evaluate the detected event proposals, we measure the recall
For our captioning module, the captioning loss Lcap is defined and precision of the generated proposals. We report the metric
as the sum of negative log likelihoods of the correct word in a scores and scores of the proposals as averages taken over the
sentence with W words, averaged over all generated proposals tIoU thresholds of 0.3, 0.5, 0.7, and 0.9 with the GT proposals.
P W
1 C. Setup Details
Lcap =− log(p(wi )) (12)
|P| j=1 i=1 We extract the framewise features of the ActivityNet cap-
tions dataset videos using the activations of avgpool layer of
where wi is the ith word in the GT caption, and P is the set of ResNet-152 [104]. We utilize these features and the K semantic
proposals. notions from the captions corpora of the respective dataset to
train our SC-Net. We select K = 2000 and 200 for ActivityNet
E. Joint Training Captions and YouCook-II datasets, respectively. We approxi-
In our dense captioning framework, proposal and captioning mate K from DBSCAN [105], which is able to capture data
modules are trained jointly in an end-to-end manner. However, association and structure, by varying minPoints and eps such that
for better initialization, we first train the proposal module alone the minimum number of outliers is produced (we refer to original
for ten epochs. Later, both modules are trained in an end-to-end work [105] for algorithmic details). We select minPoints = 3 and
manner with the loss function defined as follows: eps = 0.8 in our approximation process. We then pass on the
clusters number to K-means algorithm to generate K clusters.
Ltotal = λ1 Lprop + λ2 Lcap (13) We chose popular kmeans++ for centroid initialization, which
where λ1 and λ2 are hyperparameters that balance contributions is relatively consistent and faster as compared to random and
of the two modules. In our experiments, we empirically set Forgy initialization techniques. The performance of SC-Net is
λ1 = 1 and λ2 = 2 using cross validation. In order to induce subjective and is not highly sensitive to K within reasonable
better models, during the end-to-end training, we only use those bounds. Various techniques exist for exploring different hyper-
proposals that have tIoU ≥ 0.8 with the GT proposals. parameters, e.g., number of units in each layer and number of
layers of deep networks [106]. To optimize the number of layers
IV. EVALUATION in the SC-Net, we tested networks with increasing number of
layers [107] and stopped where the performance peaked on our
A. Datasets validation data. The hidden layer sizes varied in the intervals
We evaluate our technique using the large-scale ActivityNet [128, 2048]. During SC-Net training, dropout is set to 0.5 for
Captions [43] dataset and the YouCook-II [26] dataset for DVC. better generalization and the network is trained for 20 epochs
ActivityNet Captions comprises ∼100 k natural language sen- with a batch size of 250. From the trained SC-Net, activations
tences describing ∼20 k untrimmed real-world videos. Each of p(4) are used to generate the time-series signals for the
video consists of at least two and on average three annotated subsequent temporal refinement. We take two coefficients with
events with one human-provided caption per segment along with two layer architectures (six coefficients per neuron stream) in
the start and end times of each event. The annotated sentences in comparison to three coefficients in three layer architectures (21
the dataset contain 13.48 words on average, describing about 36 coefficients per neuron stream) in [80]. Note that with lesser
s of a video. Furthermore, there is almost 10% temporal overlap number of coefficients, more noise (similar visual content in
of the events in the video that makes the dataset really interesting
and challenging. The dataset is split into 50%, 25%, and 25% 1 [Online]. Available: https://github.com/ranjaykrishna/densevid_eval
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
770 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022
TABLE II
EVENT DETECTION PERFORMANCE OF THE PROPOSAL NETWORK WITH AND WITHOUT CONTEXTUALIZATION AND REFINEMENT WITH THE SC-NET
AND THE DESCRIPTOR TRANSFORMER
Results on four thresholds for tIoU on the ActivityNet Captions validation set are reported. For brevity, we refer to both contextualization
and refinement as SC-Net.
this case) cancellation is applied, which is desired in our case as MFCC [95]. Table III presents the performance comparison of
it allows us to discriminate between visual content with subtle our method with the existing techniques on learned proposals
differences. The output feature is reduced to 500 dimensions and GT proposals using the validation set of the ActivityNet
with principal component analysis (PCA). For the proposal and Captions dataset. We note that, among the existing methods,
caption generation networks, we incorporate two LSTM layers TDA-CG [68] must pass the video twice, to capture the past
with the hidden state size of 512 each. as well as the future context, which makes it computationally
Before the end-to-end training of proposal and captioning expensive. Moreover, Masked Transformer [71] must use optical
modules, we pretrain the proposal module individually for ten flow features in addition to visual features to achieve the reported
epochs and then perform end-to-end training with the captioning results. MFCC [95] incorporates audio features additionally
network. We use the Adam [108] optimizer using dynamic and investigates its effects on DVC. In contrast, our proposed
learning rate with an initial learning rate of 1 × 10−3 reduced network does not require any additional modalities, such as
after every ten epochs. We adopt stochastic gradient descent optical flow or audio, and outperforms all these techniques. Note
for training in our experiments and train the models end-to-end that incorporating optical flow or audio in our technique will
for 50 epochs. We employ the TensorFlow framework and the further boost its performance.
NVIDIA Titan XP 1080 GPU for the development. In Table III, we also report the results of different variants
We generate 1000 proposals with the jointly trained model and of our technique for an ablative analysis of the strengths of the
pass the features of the high-confidence proposals (tIoU ≥ 0.8) proposed method. The Base (C) model is implemented using the
to the caption generation network to generate the descriptions. 3-D CNN to extract input video features. This variant does not
Later, we organize the generated captions and associated pro- utilize the contextualization and temporal refinement module
posals pairs based on their scores in a descending order to select (see Fig. 2), and the events are directly detected using the 3-D
the top 100 proposal/caption pairs for evaluation. CNN features. The ResNet + MP employs the strategy of mean
pooling the 2-D ResNet features to aggregate temporal informa-
tion in the frames, and perform event detection. It is apparent
D. Results and Comparisons from the results that a 2-D CNN representation followed by
1) Event Proposals: Before benchmarking the final caption- mean pooling is more desirable than the 3-D CNN representation
ing performance of our overall framework, we first demonstrate alone.
the performance boost achieved only by the event detection Hence, we eventually selected the activations of 2-D ResNet’s
module with the proposed contextualization and refinement of avgpool layer to encode the input videos. The mentioned results
the visual video contents in Table II. In the table, we refer are based on the same layer. The ResNet + TE replaces the
to the results achieved by our “Contextualization & Temporal mean pooling strategy with the temporal encoding proposed
Refinement” module (see Fig. 2) as “with SC-Net” for brevity. in this work for the intraevent temporal modeling. As evident
The other variant replaces the module with C3D to extract from the table, the proposed temporal modeling boosts model
the input video features. This variant is termed “w/o SC-Net.” performance significantly across all metrics, i.e., B4, METEOR,
The anchor sizes and other experimental settings are kept same CIDEr, and ROUGE by 128.0%, 3.7%, 71.2%, and 9.1%, respec-
for both models for fair comparison. In the table, notice the tively.
significant improvement across all tIoU. The event detection Finally, the SA-DVC uses both the temporal encoding and
with SC-Net achieves an overall 11.55% gain in the recall and the visual-semantic contextualization. We refer to our technique
12.52% gain in the precision. This ascertains the efficacy of as SA-DVC for semantics-aware dense video captioning. In the
our proposed technique in the event detection. We emphasize table, SA-DVC (1KP) selects the top 1K proposals to evaluate
that the network and the descriptor transformer for our method the performance. This number is 100 for SA-DVC (100P). We
work hand in hand to complete the pipeline. Both components also provide detailed results on all tIoU values for our best
are integral part of the overall module and cannot be split for performing variant of Table III in Table VIII (see Section VI-A).
further ablative analysis. In Table III, results are obtained using the initially proposed
2) Dense Video Captioning: To benchmark our technique, evaluation metrics as used by the ActivityNet evaluation server.
we compare its performance with the state-of-the art methods, These metrics have since been updated. Instead of one out of
including Krishna’s [43], Duan’s [93], JEDDi-Net [47], TDA- multiple incorrect predictions for a video, the updated metrics
CG [68], Masked Transformer [71], DVC [91], SDVC [44], and also account for the remaining incorrect ones. Hence, the scores
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 771
TABLE III
DVC RESULTS ON ACTIVITYNET CAPTIONS [43] WITH THE LEARNED PROPOSALS (LEFT) AND GT PROPOSALS (RIGHT)
Average scores of BLEU (B1-B4), METEOR (M), CIDEr (C), and ROUGE (R) are reported across tIoU of 0.3, 0.5, 0.7, and 0.9. * indicates the use of additional
modalities (e.g., optical flow, attributes).
TABLE IV TABLE V
DVC RESULTS ON ACTIVITYNET CAPTIONS [43] WITH LEARNT PROPOSALS SERVER COMPUTED METEOR SCORES ON THE TEST SET OF ACTIVITYNET
ON THE UPDATED METRIC CAPTIONS [43]
are generally lower. The updated metrics scores are being grad- TABLE VI
ually reported by the researchers; however, their slow adaption DVC RESULTS ON THE YOUCOOK-II DATASET [26] WITH THE PREDICTED
currently entails incomprehensive benchmarking. Nevertheless, PROPOSALS
we still report our results on the updated metric in Table IV.
Although the literature is currently void of the updated metric
performance of many techniques [43], [47], [68], [91], [93], a
clear correlation between the updated and original metric scores
is observable. Our technique achieves highly competitive results Average scores of BLEU B4, METEOR (M), CIDEr (C), and ROUGE (R) are
reported across tIoU of 0.3, 0.5, 0.7, and 0.9, respectively.
on both metrics. Note that the reasons for performance gap
with [44] on METEOR metric are twofold. First, Mun et al. [44]
additionally use RL to specifically optimize performance for
METEOR. The ablation analysis of [44] shows a significant gain
due to this additional optimization (from 6.92 (without RL) to
8.82 (with RL), as depicted in Table IV). The average number of In Table VI, we report the performance of our model on
proposals for the above is still 2.85. Second, when the number the YouCook-II [26] dataset. For baseline comparison, we
of proposals in [44] is increased from 2.85 to 77.9 (a number re-evaluate Masked Transformer [71] and report score on up-
relatively closer to ours, i.e., 100), its score drops down from 6.92 dated metrics. For fair comparison, we select top 100 proposals
to 4.58. Despite the aforementioned facts, we outperform [44] from [71] for evaluation. As depicted in Table VI, our technique
in B-4 and R metrics, as shown in Table IV. clearly outperforms [71] across all metrics. In contrast, the
In Table V, we report the performance of our technique on performance of our model is relatively not as strong as on the
the test split of the ActivityNet Captions, as computed by the ActivityNet captions [43] dataset. One possible reason could
remote server, which only provides the METEOR score. In the be the small objects size or subtle actions in the YouCook-II
table, Masked Transformer [71] must use ensemble features dataset. Most DVC frameworks employ abstract visual features;
that employ optical flow in addition to the visual features. The therefore, it is hard to differentiate between, e.g., carrot and
proposed SA-DVC achieves highly competitive performance cucumber or peel and cut, resulting in severe penalty by the
using only the visual features. evaluation metrics.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
772 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022
Fig. 4. Visualization of a computed sentence/caption embeddings cloud projected to 3-D using PCA. (a) Semantic map of “the players talk during the match”
embedding and its nearest neighbors. For readability, only 5000 vectors are projected, and we select 20 nearest neighbors to the query sentence. (b) Nearest
neighbors points in isolation for better view.
TABLE VII
MEMORY FOOTPRINT AND COMPUTATIONAL ANALYSIS OF OUR TECHNIQUE
WITH EXISTING METHODS
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 773
TABLE VIII
DVC RESULTS OF THE PROPOSED METHOD AT DIFFERENT TIOU THRESHOLDS
ON THE ACTIVITYNET CAPTIONS [43] DATASET ON LEARNED PROPOSALS
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
774 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022
has failed to detect all the events and also failed to differentiate
and detect add pasta and add bacon as two separate events
(recipe steps).
VII. CONCLUSION
DVC is a challenging task that is currently handled by
event detection followed by caption generation. We proposed
a technique that couples these two tasks with visual-semantic
contextualization. Our method additionally refines the resulting
representation with a hierarchical FT. The refined representation
is used by our event detector. The detected event representation
is computed by attentively fusing the event’s final hidden state
Fig. 7. Qualitative results of DVC by the proposed technique on the Activi- with the descriptors. This way, we are able to make it more
tyNet captions [43] dataset. The bars indicate GT events (blue), detected events discriminative for utilization by the caption generation network.
by the baseline method (brown) and by the proposed method (green). Captions
relevant to the method used are presented under each separately. ei corresponds A thorough evaluation of our technique on the large-scale Ac-
to the event number. tivityNet Captions dataset and the YouCook-II dataset shows
competitive or better performance of our technique across mul-
tiple metrics.
ACKNOWLEDGMENT
The views and conclusions contained in this article are those
of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the Army
Research Office or the U.S. Government. The U.S. Government
is authorized to reproduce and distribute reprints for government
purposes notwithstanding any copyright notation herein.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 775
[14] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, [37] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs
“Generative adversarial text to image synthesis,” in Proc. Int. Conf. Mach. for image captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Learn., 2016, pp. 1060–1069. Recognit., Jun. 2019, pp. 10677–10686.
[15] P. Anderson et al., “Vision-and-language navigation: Interpreting [38] L. Zhou, Y. Zhang, Y.-G. Jiang, T. Zhang, and W. Fan, “Re-Caption:
visually-grounded navigation instructions in real environments,” in Proc. Saliency-enhanced image captioning through two-phase learning,” IEEE
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3674– Trans. Image Process., vol. 29, pp. 694–709, Jul. 2019.
3683. [39] S. Ye, J. Han, and N. Liu, “Attentive linear transformation for image
[16] N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, and M. Shah, “Video description: captioning,” IEEE Trans. Image Process., vol. 27, no. 11, pp. 5514–5524,
A survey of methods, datasets, and evaluation metrics,” ACM Comput. Nov. 2018.
Surv., vol. 52, no. 6, pp. 1–37, 2019. [40] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with
[17] D. Gurari et al., “VizWiz grand challenge: Answering visual questions attributes,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4904–4912.
from blind people,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [41] Y. Wang, Z. Lin, X. Shen, S. Cohen, and G. W. Cottrell, “Skeleton key:
2018, pp. 3608–3617. Image captioning by skeleton-attribute decomposition,” in Proc. IEEE
[18] J. P. Bigham et al., “VizWiz: Nearly real-time answers to visual ques- Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7378–7387.
tions,” in Proc. 23nd Annu. ACM Symp. User Interface Softw. Technol., [42] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description
2010, pp. 333–342. dataset for bridging video and language,” in Proc. IEEE Conf. Comput.
[19] R. Tapu, B. Mocanu, and T. Zaharia, “Dynamic subtitles: A multimodal Vis. Pattern Recognit., 2016, pp. 5288–5296.
video accessibility enhancement dedicated to deaf and hearing impaired [43] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-
users,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, 2019, captioning events in videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
pp. 2558–2566. pp. 706–715.
[20] X. Che, S. Luo, H. Yang, and C. Meinel, “Automatic lecture subtitle [44] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han, “Streamlined dense video
generation and how it helps,” in Proc. IEEE 17th Int. Conf. Adv. Learn. captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Technol., 2017, pp. 34–38. 2019, pp. 6581–6590.
[21] Z. Xu, C. Hu, and L. Mei, “Video structured description technology [45] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles, “SST:
based intelligence analysis of surveillance videos for public security Single-stream temporal action proposals,” in Proc. IEEE Conf. Comput.
applications,” Multimedia Tools Appl., vol. 75, no. 19, pp. 12 155–12 172, Vis. Pattern Recognit., 2017, pp. 6373–6382.
2016. [46] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, “Daps: Deep
[22] S. Xiao, Z. Zhao, Z. Zhang, Z. Guan, and D. Cai, “Query-biased self- action proposals for action understanding,” in Proc. Eur. Conf. Comput.
attentive network for query-focused video summarization,” IEEE Trans. Vis., 2016, pp. 768–784.
Image Process., vol. 29, pp. 5889–5899, Apr. 2020. [47] H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko, “Joint event
[23] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summarization detection and description in continuous video streams,” in Proc. IEEE
via semantic attended networks,” in Proc. AAAI Conf. Artif. Intell., 2018, Winter Conf. Appl. Comput. Vis. Workshops, 2019, pp. 25–26.
pp. 216–223. [48] C. Baldassano, J. Chen, A. Zadbood, J. W. Pillow, U. Hasson, and K. A.
[24] S. Zhang, Y. Zhu, and A. K. Roy-Chowdhury, “Context-aware surveil- Norman, “Discovering event structure in continuous narrative perception
lance video summarization,” IEEE Trans. Image Process., vol. 25, no. 11, and memory,” Neuron, vol. 95, no. 3, pp. 709–721, 2017.
pp. 5469–5478, Nov. 2016. [49] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
[25] C. T. Dang and H. Radha, “Heterogeneity image patch index and its appli- of word representations in vector space,” in 1st Int. Conf. Learning
cation to consumer video summarization,” IEEE Trans. Image Process., Representations, ICLR, Scottsdale, Arizona, USA, 2013.
vol. 23, no. 6, pp. 2704–2718, Jun. 2014. [50] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
[26] L. Zhou, C. Xu, and J. J. Corso, “Towards automatic learning of proce- vectors with subword information,” Trans. Assoc. Comput. Linguistics,
dures from web instructional videos,” in Proc. AAAI Conf. Artif. Intell., vol. 5, pp. 135–146, 2017.
2018, pp. 7590–7598. [51] O. Duchenne, I. Laptev, J. Sivic, F. R. Bach, and J. Ponce, “Automatic
[27] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and annotation of human actions in video,” in Proc. IEEE 12th Int. Conf.
S. Lacoste-Julien, “Unsupervised learning from narrated instruction Comput. Vis., 2009, pp. 1491–1498.
videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, [52] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for gen-
pp. 4575–4583. erating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern
[28] P. Bojanowski et al., “Weakly supervised action labeling in videos under Recognit., 2015, pp. 3128–3137.
ordering constraints,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 628– [53] K. Xu et al., “Show, attend and tell: Neural image caption generation with
643. visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
[29] H. Su, W. Qi, Z. Li, Z. Chen, G. Ferrigno, and E. De Momi, “Deep [54] P. Nguyen, T. Liu, G. Prasad, and B. Han, “Weakly supervised action
neural network approach in EMG-based force estimation for human- localization by sparse temporal pooling network,” in Proc. IEEE/CVF
robot interaction,” IEEE Trans. Artif. Intell., vol. 2, no. 5, pp. 404–412, Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6752–6761.
Oct. 2021. [55] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
[30] B. Zhao, X. Li, and X. Lu, “CAM-RNN: Co-attention model based for action recognition in videos,” in Proc. Int. Conf. Neural Inf. Process.
RNN for video captioning,” IEEE Trans. Image Process., vol. 28, no. 11, Syst., 2014, pp. 568–576.
pp. 5552–5565, Nov. 2019. [56] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network
[31] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adaptive (T-CNN) for action detection in videos,” in Proc. IEEE Int. Conf. Comput.
attention for visual captioning,” IEEE Trans. Pattern Anal. Mach. Intell., Vis., 2017, pp. 5823–5832.
vol. 42, no. 5, pp. 1112–1131, May 2020. [57] G. Yu and J. Yuan, “Fast action proposals for human action detection
[32] N. Aafaq, N. Akhtar, W. Liu, and A. Mian, “Empirical autopsy of deep and search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
video captioning encoder-decoder architecture,” Array, vol. 9, p. 100052, pp. 1302–1311.
2021, doi: 10.1016/j.array.2020.100052. [58] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for
[33] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred spatio-temporal action localization,” in Proc. IEEE Int. Conf. Comput.
semantic attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Vis., 2015, pp. 3164–3172.
2017, pp. 6504–6512. [59] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-
[34] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, and H. T. Shen, “From stream bi-directional recurrent neural network for fine-grained action
deterministic to generative: Multimodal stochastic RNNs for video detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
captioning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 10, pp. 1961–1970.
pp. 3047–3058, Oct. 2019. [60] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learning
[35] B. Pan et al., “Spatio-temporal graph for video captioning with knowl- of action detection from frame glimpses in videos,” in Proc. IEEE Conf.
edge distillation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- Comput. Vis. Pattern Recognit., 2016, pp. 2678–2687.
nit., Jun. 2020, pp. 10867–10876. [61] Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in
[36] Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” untrimmed videos via multi-stage CNNs,” in Proc. IEEE Conf. Comput.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, Vis. Pattern Recognit., 2016, pp. 1049–1058.
pp. 4120–4129.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
776 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022
[62] A. Montes, A. Salvador, S. Pascual, and X. Giro-i Nieto, “Tempo- [85] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y.-W. Tai, “Memory-
ral activity detection in untrimmed videos with recurrent neural net- attended recurrent network for video captioning,” in Proc. IEEE/CVF
works,” in Proc. 1st NIPS Workshop Large Scale Comput. Vis. Syst., Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8339–8348.
2016. [86] W. Hao, Z. Zhang, and H. Guan, “Integrating both visual and audio cues
[63] S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in LSTMs for enhanced video caption,” in Proc. Proc. 32nd AAAI Conf. Artif. Intell.,
for activity detection and early detection,” in Proc. IEEE Conf. Comput. 2018, pp. 6894–6901.
Vis. Pattern Recognit., 2016, pp. 1942–1950. [87] J. Xu, T. Yao, Y. Zhang, and T. Mei, “Learning multimodal attention
[64] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “TALL: Temporal activity LSTM networks for video captioning,” in Proc. 25th ACM Int. Conf.
localization via language query,” in Proc. IEEE Int. Conf. Comput. Vis., Multimedia, 2017, pp. 537–545.
2017, pp. 5277–5285. [88] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and
[65] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. B. Schiele, “Coherent multi-sentence video description with variable
Russell, “Localizing moments in video with natural language,” in Proc. level of detail,” in Proc. German Conf. Pattern Recognit., 2014,
IEEE Int. Conf. Comput. Vis., 2017, pp. 5804–5813. pp. 184–195.
[66] J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia, “TURN TAP: Temporal [89] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph
unit regression network for temporal action proposals,” in Proc. IEEE Int. captioning using hierarchical recurrent neural networks,” in Proc. IEEE
Conf. Comput. Vis., 2017, pp. 3648–3656. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4584–4593.
[67] F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem, “Fast temporal [90] Y. Xiong, B. Dai, and D. Lin, “Move forward and tell: A progressive
activity proposals for efficient detection of human actions in untrimmed generator of video descriptions,” in Proc. Eur. Conf. Comput. Vis., 2018,
videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 468–483.
pp. 1914–1923. [91] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, “Jointly localizing and
[68] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu, “Bidirectional attentive fu- describing events for dense video captioning,” in Proc. IEEE/CVF Conf.
sion with context gating for dense video captioning,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2018, pp. 7492–7500.
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7190–7198. [92] V. Iashin and E. Rahtu, “Multi-modal dense video captioning,” in Proc.
[69] A. Karpathy, A. Joulin, and L. F. Fei-Fei, “Deep fragment embeddings IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, 2020,
for bidirectional image sentence mapping,” in Proc. Int. Conf. Neural Inf. pp. 4117–4126.
Process. Syst., 2014, pp. 1889–1897. [93] X. Duan, W. Huang, C. Gan, J. Wang, W. Zhu, and J. Huang, “Weakly
[70] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding supervised dense event captioning in videos,” in Proc. Int. Conf. Neural
and translation to bridge video and language,” in Proc. IEEE Conf. Inf. Process. Syst., 2018, pp. 3059–3069.
Comput. Vis. Pattern Recognit., 2016, pp. 4594–4602. [94] Z. Shen et al., “Weakly supervised dense video captioning,” in Proc.
[71] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5159–5167.
dense video captioning with masked transformer,” in Proc. IEEE/CVF [95] T. Rahman, B. Xu, and L. Sigal, “Watch, listen and tell: Multi-modal
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8739–8748. weakly supervised dense event captioning,” in Proc. IEEE/CVF Int. Conf.
[72] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language description Comput. Vis., 2019, pp. 8907–8916.
of human activities from video images based on concept hierarchy of [96] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sen-
actions,” Int. J. Comput. Vis., vol. 50, no. 2, pp. 171–184, 2002. tence embeddings using compositional N-gram features,” in Proc. Conf.
[73] P. Das, C. Xu, R. F. Doell, and J. J. Corso, “A thousand frames in just a North Amer. Chapter Assoc. Comput. Linguistics, 2018, pp. 528–540.
few words: Lingual description of videos through latent topics and sparse [97] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q.
object stitching,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., V. Le, “XLNet: Generalized autoregressive pretraining for language
2013, pp. 2634–2641. understanding,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019,
[74] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and pp. 5753–5763.
S. Guadarrama, “Generating natural-language video descriptions using [98] R. Kiros et al., “Skip-thought vectors,” in Proc. 28th Int. Conf. Neural
text-mined knowledge,” in Proc. 27th AAAI Conf. Artif. Intell., 2013, Inf. Process. Syst., 2015, pp. 3294–3302.
pp. 541–547. [99] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTscore:
[75] S. Gella, M. Lewis, and M. Rohrbach, “A dataset for telling the stories of Evaluating text generation with BERT,” in Proc. Int. Conf. Learn. Rep-
social media videos,” in Proc. Conf. Empirical Methods Natural Lang. resent., 2020.
Process., 2018, pp. 968–974. [100] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
[76] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach, “Grounded spatiotemporal features with 3D convolutional networks,” in Proc. IEEE
video description,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- Int. Conf. Comput. Vis., 2015, pp. 4489–4497.
nit., 2019, pp. 6571–6580. [101] A. V. Oppenheim, Discrete-Time Signal Processing. New Delhi, India:
[77] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang, “Video caption- Pearson Education India, 1999.
ing via hierarchical reinforcement learning,” in Proc. IEEE/CVF Conf. [102] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for
Comput. Vis. Pattern Recognit., 2018, pp. 4213–4222. 3D human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,
[78] Q. Zheng, C. Wang, and D. Tao, “Syntax-aware action targeting for video vol. 36, no. 5, pp. 914–927, May 2014.
captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [103] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Jun. 2020, pp. 13093–13102. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[79] Z. Zhang et al., “Object relational graph with teacher-recommended [104] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
learning for video captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Pattern Recognit., Jun. 2020, pp. 13275–13285. pp. 770–778.
[80] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian, “Spatio-temporal [105] M. Ester et al., “A density-based algorithm for discovering clusters
dynamics and semantic attribute enriched visual encoding for video in large spatial databases with noise,” in Proc. 2nd Int. Conf. Knowl.
captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Discovery Data Mining, 1996, pp. 226–231.
2019, pp. 12479–12488. [106] Y. Bengio, “Practical recommendations for gradient-based training of
[81] C. Yan et al., “STAT: Spatial-temporal attention mechanism for video deep architectures,” in Neural Networks: Tricks of the Trade. New York,
captioning,” IEEE Trans. Multimedia, vol. 22, no. 1, pp. 229–241, NY, USA: Springer, 2012, pp. 437–478.
Jan. 2019. [107] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio,
[82] L. Yao et al., “Describing videos by exploiting temporal structure,” in “An empirical evaluation of deep architectures on problems with many
Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4507–4515. factors of variation,” in Proc. 24th Int. Conf. Mach. Learn., 2007,
[83] Z. Gan et al., “Semantic compositional networks for visual cap- pp. 473–480.
tioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, [108] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
pp. 1141–1150. in Proc. Int. Conf. Learn. Represent., 2015, p. 13.
[84] J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan, “M3: Multimodal [109] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning
memory modelling for video captioning,” in Proc. IEEE/CVF Conf. in computer vision: A survey,” IEEE Access, vol. 6, pp. 14410–14430,
Comput. Vis. Pattern Recognit., 2018, pp. 7512–7520. 2018.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 777
Nayyer Aafaq received the B.E. degree (with distinc- Naveed Akhtar received the master’s degree in com-
tion) in avionics from the College of Aeronautical puter science from Hochschule Bonn-Rhein-Sieg,
Engineering, National University of Sciences and Sankt Augustin, Germany, in 2012, and the Ph.D.
Technology (NUST), Islamabad, Pakistan, in 2007, degree in computer vision from the University of
and the M.S. degree (with high distinction) in sys- Western Australia (UWA), Crawley, WA, Australia,
tems engineering from the Queensland University in 2017.
of Technology, Brisbane, QLD, Australia, in 2012. He has been a Research Fellow with UWA since
He is currently working toward the Ph.D. degree 2017. He was a Research Fellow with Australian Na-
with the School of Computer Science and Software tional University, Canberra, ACT, Australia. He was
Engineering, University of Western Australia (UWA), a recipient of multiple scholarships during his Ph.D.
Crawley, WA, Australia. research. His research in computer vision and pattern
His research in computer vision and pattern recognition has been published in recognition has been published in prestigious venues of the field, including
prestigious venues of the field including the IEEE/CVF Conference on Computer IEEE Conference on Computer Vision and Pattern Recognition and the IEEE
Vision and Pattern Recognition and ACM Computing Surveys. He is a recipient TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. He was
of Scholarship for International Research Fees scholarship at UWA. He was a also a reviewer for these venues. His current research interests include adversarial
Research Assistant with STG Research Institute, Pakistan, from 2007 to 2011, machine learning, action recognition, and hyperspectral image analysis..
and a Lecturer with NUST from 2013 to 2017. His current research interests Dr. Akhtar was a runner-up for the Canon Extreme Imaging Competition in
include deep learning, video analysis and intersection of natural language 2015.
processing, computer vision, and machine learning.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.