Cross-Domain Modality Fusion For Dense Video Captioning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO.

5, OCTOBER 2022 763

Cross-Domain Modality Fusion for Dense


Video Captioning
Nayyer Aafaq , Ajmal Mian , Senior Member, IEEE, Wei Liu, Member, IEEE, Naveed Akhtar ,
and Mubarak Shah , Fellow, IEEE

Abstract—Dense video captioning requires localization and de- I. INTRODUCTION


scription of multiple events in long videos. Prior works detect events
DVANCEMENTS in deep learning [1] have led to tremen-
in videos solely relying on the visual content and completely ignore
the semantics (captions) related to the events. This is undesirable
because human-provided captions often also describe events that
A dous progress in computer vision and natural language
processing (NLP). Various deep learning techniques have re-
are visually nonpresent or subtle to detect. In this research, we sulted in unprecedented performance for computer vision
propose to capitalize on this natural kinship between events and
their human-provided descriptions. We propose a semantic con-
tasks, e.g., image classification [2], [3], object detection [4], and
textualization network to encode the visual content of videos by NLP tasks, e.g., machine translation [5], representation learn-
representing it in a semantic space. The representation is further ing [6], question answering [7], [8], dialogue generation [9],
refined to incorporate temporal information and transformed into and reasoning [10]. Based on these advancements, there is a
event descriptors using a hierarchical application of short Fourier desire to explore and solve new problems that enable concept
transform. Our proposal network exploits the fusion of semantic
and visual content enabling it to generate semantically meaningful
comprehension and reasoning capabilities at the intersection
event proposals. For each proposed event, we attentively fuse its of machine vision and natural language. To name a few, such
hidden state and descriptors to compute discriminative representa- problems include visual question answering [11], visual dia-
tion for the subsequent captioning network. Thorough experiments logue [12], multimodal machine translation [13], text-to-image
on the standard large-scale ActivityNet Captions dataset and addi- synthesis [14], and vision-and-language navigation [15]. One of
tionally on the YouCook-II dataset show that our method achieves
competitive or better performance on multiple popular metrics for
such problems that are rapidly attracting interest from computer
the problem. vision and NLP community is dense video captioning (DVC).
Impact Statement—Artificial intelligence (AI) has unlocked myr- The challenge in this problem is to detect multiple complex and
iad possibilities to help people with disabilities, e.g., giving voice to possibly overlapping events in a long untrimmed video and then
nonverbal people, sign language translation, overcoming autism, generate their human-meaningful descriptions. The application
and other motor disabilities. Recently, integration of vision and lan-
guage has further enabled AI to assist nearly 2.2 billion, people with
areas are boundless [16] and are not limited to assistance to
vision impairment. Such AI models must comprehend both visual visually impaired [17], [18], automatic video subtitling [19],
and language domains to provide solutions for daily life challenges [20], video surveillance [21], video summarization [22]–[25],
of visually impaired, e.g., navigation, reading, and understanding generation of written procedures from instructional videos [26]–
of their surrounding events. Dense video captioning (DVC) is one [28], and assistive robotics [29].
of the challenges that vision and language research communities
jointly tackle to describe visual events in natural languages. Our
The task of DVC is far more involved than conventional
algorithm proposes to leverage both modalities and enhance the video captioning [30]–[34], and even more so when compared
comprehension capability of a DVC framework. with image captioning [13], [35]–[41]. In contrast to video
captioning, that intends to describe a short video clip (e.g., up
Index Terms—Context modeling, dense video captioning (DVC),
event localization, language and vision, video captioning. to 30 s long in MSR-VTT [42]) in a single sentence, DVC
requires understanding of a much longer and complex video
(e.g., up to 600 s long in ActivityNet Captions [43]). It also
Manuscript received 31 July 2021; revised 24 October 2021; accepted 4 demands addressing challenges that are not encountered in
December 2021. Date of publication 10 December 2021; date of current version
17 October 2022. This work was supported in part by the Australian Research
conventional captioning. For instance, DVC requires explicating
Council under Discovery Grant DP190102443 and in part by the Army Research precise boundaries of events of interest that may range across
Office under Grant W911NF-19-1-0356. The work of Naveed Akhtar was sup- multiple time scales to describe them. Events are normally not
ported by the Australian Government through an Office of National Intelligence
National Intelligence Postdoctoral Grant under Project NIPG-2021-001. This
even a concern in conventional video captioning. Currently,
article was recommended for publication by Associate Editor Guilherme Nelson the challenges of DVC are addressed by dividing this task
DeSouza upon evaluation of the reviewers’ comments. (Corresponding author: into subtasks of event detection and event captioning [44]. The
Nayyer Aafaq.)
Nayyer Aafaq, Ajmal Mian, Wei Liu, and Naveed Akhtar are
former is performed with event proposal networks [45], [46],
with the University of Western Australia, Crawley, WA 6009, Aus- whereas the latter employs independent captioning modules.
tralia (e-mail: [email protected]; [email protected]; The modules performing these subtasks are subsequently unified
[email protected]; [email protected]).
Mubarak Shah is with the University of Central Florida, Orlando, FL 32816
by either training them in an alternating fashion [43] or in an
USA (e-mail: [email protected]). end-to-end [47] manner. Although such a strategy encourages
Digital Object Identifier 10.1109/TAI.2021.3134190

2691-4581 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
764 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022

ignore the semantic information and only consider visual infor-


mation for proposal predictions. Few methods [52], [53] have
attempted to integrate the two modalities in the image captioning
framework. However, they employ visual and linguistic infor-
mation for spatial region representation learning, which is not
the case in our framework. For detailed differences, refer to
Section II-B.
Fig. 1. Example from the ActivityNet Captions [43] dataset depicting chal- Once the proposals are detected, we need to generate captions
lenges of DVC. Visual content shows a group of people standing on a ground. for the selected event proposals using a sequential captioning
However, semantically, it has been described as four different events by a human
caption provider. For a DVC technique that relies solely on visual content, it is network. To that end, contemporary methods represent the
hard to successfully identify these events that interest humans. detected proposals via the final hidden state of the proposal
network. Since the detected events are dense (e.g., 1000 per
video) and they overlap with each other significantly, such rep-
resentation is insufficient to distinguish the overlapped events.
induction of correlation between the modules. However, taking To address this, we propose to compute event features more
Fig. 1 for example, such divide and conquer strategies will not precisely using attentive fusion of the final hidden state of the
differentiate event 1 (e1) from event 2 (e2) or event 3 (e3) from proposal network and the descriptors (i.e., temporally refined
event 4 (e4) as they are visually similar. features from SC-Net) within that proposal. We utilize this fused
Intuitively, event segmentation in video and language have proposal representation to feed into the captioning network that
strong correlation. Research shows that human brain activity is generates semantically rich captions.
naturally structured into semantically meaningful events [48]. We summarize the contributions of our work as follows.
Therefore, human description of a video is a natural source 1) We propose a framework that contextualizes visual content
of information to localize events with similar visual content. using linguistic information for DVC. The proposed SC-
For example, e1 & e2 and e3 & e4 in Fig. 1 have similar Net is designed to employ both linguistic and visual cues.
visual content, respectively. Such pairing of events results in This allows the proposal generation network to generate
more natural descriptions that involve coreference. On the other semantically meaningful event proposals.
hand, if we take the coreference as a language cue, then we 2) We present the descriptor transformer to capture temporal
can better detect event boundaries. What is more prevalent is cues among the framewise SC-Net contextualized output,
that there are events and subtleties that are not visually present. which requires no training. The semantically and tem-
Take e2 in Fig. 1 for example; there is no visual indicator of porally enriched output from the descriptor transformer
what they are conversing about; instead, the caption states that boosts event localization and quality of description for
it is about “the next move.” However, in e3 and e4, even though each event.
“coaches” and “injured players” are present, they are not easily 3) We evaluate the proposed technique for DVC on the large-
detectable from visual content alone due to their lack of apparent scale ActivityNet Captions [43] dataset. We also report
differences from normal players. This inspires a key strategy in first DVC results (evaluating on updated metrics) on the
our technique. We propose a semantic contextualization network YouCook-II [26] dataset. We achieve competitive or better
(SC-Net), which is designed to employ both linguistic and visual results on multiple metrics on both the datasets.
cues for event proposal generation. During its training stage, it
explicitly accounts for human-provided captions. Rather than in- II. RELATED WORK
corporating individual captions into each video during training,
In the following, we discuss the related literature for event
which would produce an overfitted model, we choose to cluster
detection, visual-semantic joint modeling, video captioning, and
similar captions into an optimal number of clusters. Thanks to
DVC. These subtasks are integral parts of the overall problem
the success of word embeddings [49], [50] in capturing semantic
at hand. More recent DVC techniques follow this discussion.
similarity between words and documents, we use the caption
embeddings into clusters. The cluster IDs are then used as proxy
labels to provide supervision to the SC-Net, which produces A. Event Detection in Videos
semantically contextualized visual embeddings. These are fur- Event detection is an integral component of a long standing
ther encoded in a descriptor transformer (see Fig. 2) to capture problem of untrimmed video understanding [54]–[56]. In recent
the temporal cues among the framewise features. This enables years, event detection has been mainly handled as spatiotempo-
our event proposal module to discern semantically meaningful ral [57], [58] or temporal-only detection task [46], [59]–[63]. For
events even when there are no visually distinguishable activities the former, events are localized in spatiotemporal cubes, while
present in the video (see Section VI-B). temporal-only methods directly focus on predicting the start and
Prior art generally uses a sliding window strategy [51] for end time stamps of the activities in videos, ignoring the spatial
localizing events in a video. This strategy does not scale well localization of objects. Predominantly, temporal-only methods
to longer events. Hence, more recent methods [43], [44] adopt are extended into language and vision tasks, such as dense
dense action proposals (DAP) and single-stream temporal action captioning [43] and query-based activity localization [64], [65].
proposals (SST) that run through a video only once to make Compared to the more conventional sliding window strategy for
proposal predictions. While attractive, these methods altogether event detection [61], [66], feedforward neural networks are more
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 765

Fig. 2. Illustration of the proposed DVC framework: First, we contextualize the visual contents in a shared high-level semantic space learned by the SC-Net.
Neuronwise time-series signal of SC-Net activations is encoded with temporal information using the descriptor transformer. The resulting descriptor is fed to the
proposal generation network for detecting semantically meaningful event boundaries. The detected events’ hidden state and the descriptors within the detected
event are fused using the attention mechanism. The attentive representation of the proposal is then fed to the LSTM-based caption generation module. The event
proposal and caption generation modules are trained jointly in an end-to-end manner.

efficient in terms of processing videos in its entirety. The former Pan et al. [70] propose to integrate the visual and semantic
must explicitly compute a subset of windows for event detection information by incorporating relevance and coherence joint
using techniques such as dictionary learning [67] or recursive learning alongside long short-term memory (LSTM) training.
networks [46], whereas the latter [45], [46] only require a The relevance loss captures the relationship between the visual
forward pass through the network model. More recently, Wang content and the semantics of the entire sentence, whereas the
et al. [68] have employed a bidirectional recurrent network and coherence loss captures the contextual relationship among the
passed the video twice to improve the quality of event proposals generated words of the sentence. The model is jointly trained
over [45]. However, all of the aforementioned methods do not with coherence and relevance losses to generate semantically
take advantage of available linguistic information, which is the rich sentences.
key novelty of our technique. The aforementioned methods mainly deal with images or
short single-event videos as opposed to long multievent videos,
which are considered in our work. Furthermore, they employ
B. Visual-Semantic Information Integration visual and linguistic information for spatial region representa-
There have been previous works that integrate visual and tion learning. Although Pan et al. [70] improve the semantics
linguistic information to gain benefit from their joint modeling. of generated sentences with visual content, in sharp contrast to
For instance, Karpathy and Fei-Fei [52] model the corre- our technique, they altogether ignore the temporal dynamics
spondence between language and visual data by learning an of the video. However, for video representations, detecting
alignment model. They assume that contiguous segments in a the temporal time stamps in long untrimmed videos is more
sentence refer to spatial locations in the image. Hence, they crucial. This is one of the key differences between our proposed
propose a model that is able to align the sentence segments with semantic contextualization for visual content as compared to the
the spatial locations by associating the two modalities through spatial representation learning in image captioning. Moreover,
multimodal embedding space. For that purpose, they employ existing techniques either introduce attention mechanism or
a region-based convolutional neural network (R-CNN) [4] to learn joint space projection of the two modalities, whereas we
detect objects in the image. Then, using bidirectional image fuse the linguistic and visual cues by learning a network in a
sentence mapping [69], they retain the top-19 object locations supervised manner. Finally, these methods also differ from our
and learn representations for all the 19 object bounding boxes work in linguistic information exploitation. We take leverage
and of the whole image. A bidirectional recurrent neural net- from sentences in a sense we first cluster them into similar
work (Bi-RNN) is used in their method to compute the words’ semantic concepts and assign a unique ID to each cluster for
representation from its hidden state. Their model learns to score supervised training of the SC-Net. Then, similar visual features
the similarity between words and regions of an image as a and corresponding semantic concepts are fused together using
function of R-CNN object detection with the outputs of Bi-RNN. SC-Net.
Xu et al. [53] introduce an attention model that learns to gaze More recently, Mun et al. [44] and Zhou et al. [71] have
and describe the salient objects in the image. Unlike Karpathy proposed to incorporate linguistic information and context in
and Fei-Fei [52], who use object detector to get the regional the dense video caption framework. However, both the methods
representations, they extract feature vectors of an image from use the two modalities implicitly while training the model in
fully connected as well as lower layers of the convolutional neu- an end-to-end manner. This way, the representations used by
ral network (CNN). By this, they capture the correspondences their proposal generation models lack associated linguistic in-
between the regional feature vectors and 2-D image portions. formation. On the other hand, our method explicitly integrates

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
766 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022

the linguistic and visual information for proposal generation. captioning module to tackle long dependencies. Mun et al. [44]
The implicit guidance of language information comes as an propose an event sequence generation network that reduces the
additional advantage to our technique during the end-to-end number of proposals and exploits visual and linguistic context
training. implicitly while training the model. Iashin and Rahtu [92] in-
corporate audio and speech modalities to further improve the
C. Video Captioning performance of the DVC framework.
As human annotations of videos are laborious and expensive,
Before the pervasive use of deep learning, traditional video
weakly supervised methods attempt to mitigate this problem.
captioning methods [72]–[74] employed the classical approach
These methods does not use temporal segment annotations for
of detecting subject (S), verb (V), and object (O) in a video
training of the DVC model. Instead, these techniques rely on the
and described them in natural language using template-based
assumption that each caption describes one temporal segment
techniques. With the availability of large-scale datasets [75],
and each segment has one caption. Duan et al. [93] take labeled
[76], more recent video captioning techniques strongly rely on
captions as input for weakly supervised dense event caption by
neural networks for this task, with encoder–decoder schemes
adopting the cycle-consistent learning strategy. Shen et al. [94]
being the widely employed backbone. Such frameworks first
propose multi-instance multilabel learning to weakly link video
encode visual inputs with CNNs and then decode them into
regions with lexical labels to generate diverse and informative
natural language sentences using recurrent networks. More re-
captions. Rahman et al. [95] adopt two modalities i.e., audio and
cent methods augment this scheme with advanced concepts
video, focusing on the role of audio in DVC.
of, e.g., reinforcement learning (RL) [77], object and action
While promising results are produced by the aforementioned
modeling [35], [78], [79], incorporating Fourier transform (FT)
techniques, there is an apparent disassociation between event
with the CNN [80], attention mechanism [81], [82], semantic
detection and description generation among all methods. A rep-
attribute learning [33], [83], multimodal memory [84], [85],
resentation defined over both visual information and associated
and audio integration [86], [87] for improved performance.
semantics can intrinsically couple the detection and captioning
Notwithstanding the superior performance, these methods are
subtasks for DVC. This concept forms the basis of our tech-
limited to process short single-event videos and describe them in
nique, which further takes advantage of temporal refinement of
a single sentence. Few attempts to describe videos with multiple
visual cues and proposal representation with attentive fusion,
sentences/paragraph have also been made that employ event pro-
and end-to-end training of detection and captioning modules.
posal or captioning modules hierarchically to generate multiple
sentences [88]–[90]. Compared to the problem of describing
a short video by a single or rarely by multiple sentences, the III. METHODOLOGY
challenges of DVC are multifold, as noted in Section I. This This section introduces our architecture for DVC, as shown
renders most of the techniques for conventional video captioning in Fig. 2. First, we discuss our SC-Net, followed by descriptor
ineffective for the task of DVC in their original form. transformer, proposal generation network, and caption genera-
tion network. We summarize the symbols and notations used in
D. Dense Video Captioning the text in Table I for ready reference.
The task of DVC was introduced by Krishna et al. [43]. In
contrast to video captioning, which describes short videos in a A. Semantic Contextualization Network
single sentence, DVC first involves detection of multiple events, We propose an SC-Net to learn a representation that is defined
possibly overlapping, in long videos and then describe all the jointly over the visual contents and the caption semantic space
detected events in natural language. Most contemporary works of the videos. The SC-Net maps video visual contents to a
tackle the problem as a detection and description framework in quantitative representation of the semantic notions in video
a supervised [43], [44], [47], [68], [71], [90]–[92] or weakly captions. To represent the semantic notions, we first learn uni-
supervised manner [93]–[95]. Most of the methods address the versal caption embeddings. To that end, we use the Sent2Vec
challenges of this task with a bimodule framework. The two model [96] pretrained on Tweets (19.7B words), Wikipedia
modules include a proposal generation module to detect events in sentences (1.7B words), and Toronto book corpus (0.9B words).
the input video and a captioning module to generate the captions The model learns a source embedding Ew for each word w in
for the detected events. the vocabulary V with embedding dimension H. A sentence
Krishna et al. [43] incorporate a multiscale proposal gener- embedding is computed as the average of the embeddings of the
ation network [46] in the aforementioned framework and pro- constituent words that are learned not only with unigram, but
posed an attention-based captioning network to capture the event also with n-grams (i.e., n = 2, 3, 4). Formally, the embedding
context. Wang et al. [68] employ a bidirectional proposal gener- Ec for a given caption is modeled as
ation network to improve the proposals’ generation accuracy by
1 1 
better contextualizing the events within the video. Li et al. [91] Ec := βiL(s) = Ew (1)
propose temporal coordinates and descriptiveness regression to |L(s)| |L(s)|
w∈L(s)
localize the proposals in the video and employed an attribute-
augmented captioning network [40] for improved performance. where L(s) is the list of n-grams, including unigrams present
Zhou et al. [71] propose to adopt the transformer [5] as the in the caption, and β ∈ RH×|V| with iL(s) ∈ {0, 1}|V| denotes

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 767

TABLE I
SUMMARY OF SYMBOLS AND NOTATIONS USED IN THE TEXT

function of our network as follows:


⎡ ⎤
1 Li  
L(Θ, X ) = E ⎣ ||z j − Ψ Υ(xji ) ||22 ⎦ (2)
Xi ∈X Li j=1 i

where xji ∈ Xi is the jth frame of the ith video Xi , Υ(.)


is a visual representation of the frame, Ψ(.) is the SC-Net
transformation, Li is the number of frames in the ith video,
Θ represents network parameters, and zij is the proxy label of
Fig. 3. Proposed SC-Net: The network maps similar visual contents in videos the said frame. The SC-Net learns semantically contextualized
to the associated semantic notions. The notions are quantitatively encoded with
compositional n-gram caption features over which a semantic codebook is visual representation of the visual contents that is used to extract
generated with K-means clustering. The SC-Net maps features of the input features from video frames. For the frames corresponding to
video frames (avgpool layer activations of ResNet-152) to the proxy labels in overlapping events, we assign the frames to the events with
the codebook. The activations of the highlighted layer are further processed by
the subsequent modules, as shown in Fig. 2. shorter temporal length and the respective proxy label. We use
the layer p(4) to produce a series of m = 256 neuron activations
per video frame.
a binary vector to encode a given caption. Simple arithmetic Most of the contemporary baseline methods rely on
combination (e.g., summation and mean) of word embeddings C3D [100] for spatiotemporal encoding. The proposed SC-Net
(e.g., BERT [6], XLNET [97], or skip-thoughts [98]) is currently has a more comprehensive representation. Despite being more
the most popular approach for obtaining vector representation of sophisticated, our representation has lower memory footprint,
n-grams or sentences in the captioning literature [99]. However, which also makes our method more efficient—established with
we use the Sent2Vec as the algorithm is characterized by its results in Table VII. The SC-Net structure specifically leverages
low computational complexity (O(1) vector operations per word language information for subsequent proposal generation. By
processed), while simultaneously showing good performance construction, this is not possible for the existing baseline meth-
on a wide range of evaluation tasks. We emphasize that the ods. In the overall pipeline of our method, the enriched repre-
embeddings can be replaced with any state-of-the-art sentence sentation of SC-Net strengthens the coupling between language
or document embedding algorithms for potential performance and visual information while training in isolation and end-to-end
improvement. manner.
We cluster the computed embeddings with K-means to form
a semantic codebook, where the centroids of clusters correspond
to abstract popular notions in the video captions. We use indices B. Descriptor Transformer
of the clusters as proxy labels for the supervised training of We transform frame-level activations into visual streams using
the SC-Net. The SC-Net is designed to learn the semantics a temporal resolution of 64, which also significantly reduces
associated with the videos in a nonlinear manner, and when the computational cost of proposal and captioning models.
deployed, it maps a visual representation to a proxy label. Short-time FT [80], [101], [102] is able to capture the temporal
Depicted in Fig. 3, the SC-Net is a Q-layer perceptron, where dynamics of a time-series data without training and allows us
Q = 5 in our experiments. The number of neurons in the qth to achieve competitive performance simultaneously. In contrast
fully connected layer p(q) are as follows: p(1) = 2048, p(2) = to [80] and [102], we use the FT in different settings in terms
1024, p(3) = 512, p(4) = 256, and p(5) = 2000. We use recti- of its application, neural stream temporal length, hierarchical
fied linear unit activations for the layers. The final layer is architecture, and coefficient distribution (see Section IV-C).
followed by a softmax layer of K units. We define the loss We employ this technique for the event detection module. To
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
768 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022

that end, we transform the visual stream into a descriptor vec- Y = {yi }Ti=1 indicating which temporal intervals correspond to
tor, capturing its temporal dynamics. We feed the time-series which actions in the video. At time step t, the ground truth (GT)
activations of δ frame sequences to construct the descriptor label y t ∈ Rk contains binary entries. The jth entry ytj is 1 if the
vector. corresponding proposal interval has a temporal intersection over
Let F = {f 1 , f2 , . . ., fL } denote the features of a given video union (tIoU) with the GT larger than 0.5. During training, the
X frame sequence extracted with SC-Net, where fi ∈ Rm . We network is penalized for errors according to weighted multilabel
use a temporal resolution of δ = 64 and create a set of vi- cross entropy to balance the positive and negative proposals. At
sual streams S = {S1 , S2 , . . ., ST } using the feature set, where any time step t, the network loss is computed as
T = |F|/δ. Here, Si ∈ Rm×δ is a matrix, with each column
representing a contextualized vector in the form of SC-Net acti- 
k
L(η, t, ξ, Y ) = − w0j ytj log ηtj + w1j (1 − ytj ) log(1 − ηtj )
vations. We compute a descriptor vector ξ for each visual stream. j=1
For that, we hierarchically process Si in a neuronwise manner (3)
with a short FT Φ(.) [101]. In the first level of hierarchical where the weights w0j and w1j are calculated based on the number
processing, we take the δ activations of the jth neuron, i.e., αj = of positive and negative proposals in the training set, and ηtj is
[αj,1 , αj,2 , . . ., αj,δ ] ∈ Rδ and use the first κ coefficients of the output prediction vector for the jth proposal at time step t.
Φ(αj ) to construct ξ 1j ∈ Rκ , where the superscript 1 indicates Hence, the total proposal loss for all the training videos is
the first level of hierarchy. In the next level, αj is divided obtained by averaging along the time stamps, i.e.,
into two components, i.e., α21 j = {αj,1 , αj,2 , . . ., αj,δ/2 } and
α22 1  1 
Tw
j = {αj,δ/2+1 , αj,δ/2+2 , . . ., αj,δ } The two components are
Lprop = L(c, t, ξ, y) (4)
separately processed to compute Φ(α21 22
j ) and Φ(αj ). We again |X | Tw t=1
(X,y)∈X
retain the κ low-frequency coefficients of the transformations.
These coefficients are concatenated to form ξ 2j ∈ R2κ at the sec- where Tw is the length of running sliding window, and |.|
(l−1)
ond level of the hierarchy. For l levels, we have ξ lj ∈ R2 ×κ
. indicates cardinality of the set. The overall architecture of
In this work, we let l = {1, 2} based on empirical grounds. The the proposed module follows the encoder–decoder framework
descriptor ξ j ∈ R3κ for the jth neuron is computed by con- similar to SST [45]. However, our encoder is composed of
Φ(Ψ(.)) in contrast to C3D (e.g., SST and DAP). Moreover,
catenating ξ 1j and ξ 2j . The process of hierarchically computing
we employ LSTM, instead of gated recurrent unit, as a decoder
the Fourier coefficients is repeated for all the m neurons, and
the descriptor for the whole visual stream Si is constructed and set δ = 64, instead of 16. All other settings are similar to
SST.
as ξ ∈ R(3×κ×m)×1 by concatenating ξ j for all neurons. This
process gets repeated for all the T streams for a video to
generate T descriptors ξ i={1,...,T } that comprehensively encode D. Caption Generation Network
the temporal dynamics of the video. For decoding event proposals and generating their captions,
we follow the common practice in the existing related litera-
C. Proposal Generation Network ture [43], [44], [68] originally introduced by Krishna et al. [43]
and adapt LSTM network [103] for the caption generation.
Our proposal module receives the series of descriptors
LSTM has an excellent ability to model longer sequences, which
ξ i={1,...,T } for a video. The descriptors are enriched with as-
is required in DVC. In our captioning network, we also incor-
sociated semantics of the video, enabling the proposal module
porate temporal dynamic attention presented in [68]. However,
to detect semantically meaningful proposals.
our caption generation model is specifically induced to generate
To train the network, we generate densely sampled video
words from SC-Net features, which we employ after temporal
segments, which are significantly longer than the temporal
refinement by the descriptor transformer. This aspect is unique
proposals we aim to detect. For instance, for a training video
to our network
Xi with Li number of frames and Ti visual streams, video ⎛ f  ⎞ ⎛ ⎞
segments are extracted by running a sliding window of length Γt σ  
Tw = Lw /δ with a stride δ. We keep Lw  kδ so that the ⎜ Γi ⎟ ⎜ σ ⎟ Ht
⎜ t ⎟ ⎜ ⎟
training samples are long enough to simulate the long untrimmed ⎜ o ⎟ = ⎜ ⎟Wd (5)
⎝ Γt ⎠ ⎝ σ ⎠ ρt
videos, where k is the number of generated proposals. The
network, at each time step t, takes the hidden state of sequence c̃t tanh
encoder and output confidence scores of k proposals. Formally, f  i
ct = Γt  ct−1 + Γt  c̃t (6)
confidence scores {ηtj }kj=1 , corresponding to the k proposals,
are generated. At each time step, the model considers proposals o
ht = Γt  tanh(ct ) (7)
of sizes 1, 2, . . ., k time steps corresponding to δ, 2δ, . . ., kδ
f  i o
frames, respectively. This is done in a single forward pass at where Γt , Γt , Γt , ct , and ht are forget gate, input gate,
each time step. Thus, the model is able to consider multiple output gate, memory cell, and current hidden state of LSTM,
time scales in a single pass of the video without the need to respectively. W d is the transformation matrix to be learned
rerun the video for various temporal scales. We use the compact and H t = [ω t ht−1 ] is the concatenation of the input word
visual descriptors, i.e., ξ = {ξ i }Ti=1 and their associated labels embedding ω t and previous hidden state ht−1 at time step t. ρt

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 769

is computed from the last hidden state hi,t of the ith proposal ei (i.e., 10024/4926/5044) as training, validation, and test videos,
and corresponding descriptor vectors. For that matter, let V ei ,t respectively. The GT annotations from the test split are withheld
denote the descriptor vectors for the ith detected event proposal for the online competition. YouCook-II comprises 2000 open-
ei at time step t so domain cooking videos. Each video is further divided into 3–16
segments elaborating cooking steps. The segments are annotated
V ei ,t = {ξ start , ξ start+1 , . . ., ξ end }. (8)
and time stamped for temporal localization. The average length
We can write of each video is approximately 600 s. Following [26], we split
the dataset into train, validation, and test sets with the ratio of
ϕi,t = W Ta · tanh(W d [V ei ,t , hi,t ] + b) (9) 66%:23%:10%, respectively.
exp(ϕi,t )
Ai,t = d (10) B. Evaluation Metrics
k=1 exp(ϕk,t )
We perform evaluations with four widely used metrics
where d is the length of the ith event descriptors in ei and hi,t BLEU@N, METEOR, CIDEr, and ROUGEL . The evaluation
represents the last hidden state for the ith proposal at time step t. metrics for dense captioning consider the proposal accuracy as
The final attentive event representation is obtained by well as captioning accuracy. We use the dense captioning evalua-

d tion source code provided by Krishna et al. [43] to measure event
ρt = Ai,t · [V ei ,t , hi,t ]. (11) localization precision and quality of the generated captions.1 To
i=1 evaluate the detected event proposals, we measure the recall
For our captioning module, the captioning loss Lcap is defined and precision of the generated proposals. We report the metric
as the sum of negative log likelihoods of the correct word in a scores and scores of the proposals as averages taken over the
sentence with W words, averaged over all generated proposals tIoU thresholds of 0.3, 0.5, 0.7, and 0.9 with the GT proposals.
P W
1  C. Setup Details
Lcap =− log(p(wi )) (12)
|P| j=1 i=1 We extract the framewise features of the ActivityNet cap-
tions dataset videos using the activations of avgpool layer of
where wi is the ith word in the GT caption, and P is the set of ResNet-152 [104]. We utilize these features and the K semantic
proposals. notions from the captions corpora of the respective dataset to
train our SC-Net. We select K = 2000 and 200 for ActivityNet
E. Joint Training Captions and YouCook-II datasets, respectively. We approxi-
In our dense captioning framework, proposal and captioning mate K from DBSCAN [105], which is able to capture data
modules are trained jointly in an end-to-end manner. However, association and structure, by varying minPoints and eps such that
for better initialization, we first train the proposal module alone the minimum number of outliers is produced (we refer to original
for ten epochs. Later, both modules are trained in an end-to-end work [105] for algorithmic details). We select minPoints = 3 and
manner with the loss function defined as follows: eps = 0.8 in our approximation process. We then pass on the
clusters number to K-means algorithm to generate K clusters.
Ltotal = λ1 Lprop + λ2 Lcap (13) We chose popular kmeans++ for centroid initialization, which
where λ1 and λ2 are hyperparameters that balance contributions is relatively consistent and faster as compared to random and
of the two modules. In our experiments, we empirically set Forgy initialization techniques. The performance of SC-Net is
λ1 = 1 and λ2 = 2 using cross validation. In order to induce subjective and is not highly sensitive to K within reasonable
better models, during the end-to-end training, we only use those bounds. Various techniques exist for exploring different hyper-
proposals that have tIoU ≥ 0.8 with the GT proposals. parameters, e.g., number of units in each layer and number of
layers of deep networks [106]. To optimize the number of layers
IV. EVALUATION in the SC-Net, we tested networks with increasing number of
layers [107] and stopped where the performance peaked on our
A. Datasets validation data. The hidden layer sizes varied in the intervals
We evaluate our technique using the large-scale ActivityNet [128, 2048]. During SC-Net training, dropout is set to 0.5 for
Captions [43] dataset and the YouCook-II [26] dataset for DVC. better generalization and the network is trained for 20 epochs
ActivityNet Captions comprises ∼100 k natural language sen- with a batch size of 250. From the trained SC-Net, activations
tences describing ∼20 k untrimmed real-world videos. Each of p(4) are used to generate the time-series signals for the
video consists of at least two and on average three annotated subsequent temporal refinement. We take two coefficients with
events with one human-provided caption per segment along with two layer architectures (six coefficients per neuron stream) in
the start and end times of each event. The annotated sentences in comparison to three coefficients in three layer architectures (21
the dataset contain 13.48 words on average, describing about 36 coefficients per neuron stream) in [80]. Note that with lesser
s of a video. Furthermore, there is almost 10% temporal overlap number of coefficients, more noise (similar visual content in
of the events in the video that makes the dataset really interesting
and challenging. The dataset is split into 50%, 25%, and 25% 1 [Online]. Available: https://github.com/ranjaykrishna/densevid_eval

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
770 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022

TABLE II
EVENT DETECTION PERFORMANCE OF THE PROPOSAL NETWORK WITH AND WITHOUT CONTEXTUALIZATION AND REFINEMENT WITH THE SC-NET
AND THE DESCRIPTOR TRANSFORMER

Results on four thresholds for tIoU on the ActivityNet Captions validation set are reported. For brevity, we refer to both contextualization
and refinement as SC-Net.

this case) cancellation is applied, which is desired in our case as MFCC [95]. Table III presents the performance comparison of
it allows us to discriminate between visual content with subtle our method with the existing techniques on learned proposals
differences. The output feature is reduced to 500 dimensions and GT proposals using the validation set of the ActivityNet
with principal component analysis (PCA). For the proposal and Captions dataset. We note that, among the existing methods,
caption generation networks, we incorporate two LSTM layers TDA-CG [68] must pass the video twice, to capture the past
with the hidden state size of 512 each. as well as the future context, which makes it computationally
Before the end-to-end training of proposal and captioning expensive. Moreover, Masked Transformer [71] must use optical
modules, we pretrain the proposal module individually for ten flow features in addition to visual features to achieve the reported
epochs and then perform end-to-end training with the captioning results. MFCC [95] incorporates audio features additionally
network. We use the Adam [108] optimizer using dynamic and investigates its effects on DVC. In contrast, our proposed
learning rate with an initial learning rate of 1 × 10−3 reduced network does not require any additional modalities, such as
after every ten epochs. We adopt stochastic gradient descent optical flow or audio, and outperforms all these techniques. Note
for training in our experiments and train the models end-to-end that incorporating optical flow or audio in our technique will
for 50 epochs. We employ the TensorFlow framework and the further boost its performance.
NVIDIA Titan XP 1080 GPU for the development. In Table III, we also report the results of different variants
We generate 1000 proposals with the jointly trained model and of our technique for an ablative analysis of the strengths of the
pass the features of the high-confidence proposals (tIoU ≥ 0.8) proposed method. The Base (C) model is implemented using the
to the caption generation network to generate the descriptions. 3-D CNN to extract input video features. This variant does not
Later, we organize the generated captions and associated pro- utilize the contextualization and temporal refinement module
posals pairs based on their scores in a descending order to select (see Fig. 2), and the events are directly detected using the 3-D
the top 100 proposal/caption pairs for evaluation. CNN features. The ResNet + MP employs the strategy of mean
pooling the 2-D ResNet features to aggregate temporal informa-
tion in the frames, and perform event detection. It is apparent
D. Results and Comparisons from the results that a 2-D CNN representation followed by
1) Event Proposals: Before benchmarking the final caption- mean pooling is more desirable than the 3-D CNN representation
ing performance of our overall framework, we first demonstrate alone.
the performance boost achieved only by the event detection Hence, we eventually selected the activations of 2-D ResNet’s
module with the proposed contextualization and refinement of avgpool layer to encode the input videos. The mentioned results
the visual video contents in Table II. In the table, we refer are based on the same layer. The ResNet + TE replaces the
to the results achieved by our “Contextualization & Temporal mean pooling strategy with the temporal encoding proposed
Refinement” module (see Fig. 2) as “with SC-Net” for brevity. in this work for the intraevent temporal modeling. As evident
The other variant replaces the module with C3D to extract from the table, the proposed temporal modeling boosts model
the input video features. This variant is termed “w/o SC-Net.” performance significantly across all metrics, i.e., B4, METEOR,
The anchor sizes and other experimental settings are kept same CIDEr, and ROUGE by 128.0%, 3.7%, 71.2%, and 9.1%, respec-
for both models for fair comparison. In the table, notice the tively.
significant improvement across all tIoU. The event detection Finally, the SA-DVC uses both the temporal encoding and
with SC-Net achieves an overall 11.55% gain in the recall and the visual-semantic contextualization. We refer to our technique
12.52% gain in the precision. This ascertains the efficacy of as SA-DVC for semantics-aware dense video captioning. In the
our proposed technique in the event detection. We emphasize table, SA-DVC (1KP) selects the top 1K proposals to evaluate
that the network and the descriptor transformer for our method the performance. This number is 100 for SA-DVC (100P). We
work hand in hand to complete the pipeline. Both components also provide detailed results on all tIoU values for our best
are integral part of the overall module and cannot be split for performing variant of Table III in Table VIII (see Section VI-A).
further ablative analysis. In Table III, results are obtained using the initially proposed
2) Dense Video Captioning: To benchmark our technique, evaluation metrics as used by the ActivityNet evaluation server.
we compare its performance with the state-of-the art methods, These metrics have since been updated. Instead of one out of
including Krishna’s [43], Duan’s [93], JEDDi-Net [47], TDA- multiple incorrect predictions for a video, the updated metrics
CG [68], Masked Transformer [71], DVC [91], SDVC [44], and also account for the remaining incorrect ones. Hence, the scores

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 771

TABLE III
DVC RESULTS ON ACTIVITYNET CAPTIONS [43] WITH THE LEARNED PROPOSALS (LEFT) AND GT PROPOSALS (RIGHT)

Average scores of BLEU (B1-B4), METEOR (M), CIDEr (C), and ROUGE (R) are reported across tIoU of 0.3, 0.5, 0.7, and 0.9. * indicates the use of additional
modalities (e.g., optical flow, attributes).

TABLE IV TABLE V
DVC RESULTS ON ACTIVITYNET CAPTIONS [43] WITH LEARNT PROPOSALS SERVER COMPUTED METEOR SCORES ON THE TEST SET OF ACTIVITYNET
ON THE UPDATED METRIC CAPTIONS [43]

* indicates use of ensemble features (visual and optical


* and ‡ indicate the use of additional modalities (e.g., optical flow and flow).
audio). † indicates score on very few (on average 2.85) proposals.

are generally lower. The updated metrics scores are being grad- TABLE VI
ually reported by the researchers; however, their slow adaption DVC RESULTS ON THE YOUCOOK-II DATASET [26] WITH THE PREDICTED
currently entails incomprehensive benchmarking. Nevertheless, PROPOSALS
we still report our results on the updated metric in Table IV.
Although the literature is currently void of the updated metric
performance of many techniques [43], [47], [68], [91], [93], a
clear correlation between the updated and original metric scores
is observable. Our technique achieves highly competitive results Average scores of BLEU B4, METEOR (M), CIDEr (C), and ROUGE (R) are
reported across tIoU of 0.3, 0.5, 0.7, and 0.9, respectively.
on both metrics. Note that the reasons for performance gap
with [44] on METEOR metric are twofold. First, Mun et al. [44]
additionally use RL to specifically optimize performance for
METEOR. The ablation analysis of [44] shows a significant gain
due to this additional optimization (from 6.92 (without RL) to
8.82 (with RL), as depicted in Table IV). The average number of In Table VI, we report the performance of our model on
proposals for the above is still 2.85. Second, when the number the YouCook-II [26] dataset. For baseline comparison, we
of proposals in [44] is increased from 2.85 to 77.9 (a number re-evaluate Masked Transformer [71] and report score on up-
relatively closer to ours, i.e., 100), its score drops down from 6.92 dated metrics. For fair comparison, we select top 100 proposals
to 4.58. Despite the aforementioned facts, we outperform [44] from [71] for evaluation. As depicted in Table VI, our technique
in B-4 and R metrics, as shown in Table IV. clearly outperforms [71] across all metrics. In contrast, the
In Table V, we report the performance of our technique on performance of our model is relatively not as strong as on the
the test split of the ActivityNet Captions, as computed by the ActivityNet captions [43] dataset. One possible reason could
remote server, which only provides the METEOR score. In the be the small objects size or subtle actions in the YouCook-II
table, Masked Transformer [71] must use ensemble features dataset. Most DVC frameworks employ abstract visual features;
that employ optical flow in addition to the visual features. The therefore, it is hard to differentiate between, e.g., carrot and
proposed SA-DVC achieves highly competitive performance cucumber or peel and cut, resulting in severe penalty by the
using only the visual features. evaluation metrics.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
772 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022

Fig. 4. Visualization of a computed sentence/caption embeddings cloud projected to 3-D using PCA. (a) Semantic map of “the players talk during the match”
embedding and its nearest neighbors. For readability, only 5000 vectors are projected, and we select 20 nearest neighbors to the query sentence. (b) Nearest
neighbors points in isolation for better view.

TABLE VII
MEMORY FOOTPRINT AND COMPUTATIONAL ANALYSIS OF OUR TECHNIQUE
WITH EXISTING METHODS

Fig. 5. Effects of varying the number of clusters, i.e., K value on model


performance. Overall, there is an increasing trend in the model performance
P(M) indicates parameters in millions. Train time is
with an increase in the K value. For a reasonable performance–computational
in minutes per epoch. * indicates the number of pa-
cost tradeoff, we select K = 2000 in our experiments.
rameters, based upon available model information,
which are calculated theoretically as authors were
unable to provide the actual number of parameters.
that sentences semantically close to the query are clustered in a
dense region indicated by the dotted lined box. Fig. 4(b) shows
V. DISCUSSION a zoomed-in view of the 3-D space. We can see that most of
A. Memory Footprint the sentences are semantically related to the query. For example,
“the two men argue about the game.” This semantic relationship
In Table VII, we show different variants of our model along captured by SC-Net helps detect related visual content as a
with the number of parameters and training time. Base (C) repre- separate event in our SA-DVC framework.
sents baseline model that uses C3D network, ResNet (TE) vari-
ant employs ResNet as 2-D CNN for feature extraction followed
C. Effect of Hyperparameter K
by temporal encoding, and SC-Net refers to our final model that
incorporates visual and linguistic information. Table VI shows In Fig. 5, we show sensitivity of the model to the hyperpa-
that the proposed framework has a smaller memory footprint rameter K of the K-means algorithm. We tested the model for
and computational cost when compared to the state-of-the-art K = {1200, 1600, 2000, 2400, 2800} and choose the final value
methods, yet it performs at par or better than those methods on as K = 2000 even though there is still an increasing trend in the
multiple metrics. For instance, SDVC [44] is about six times performance. The choice was made to strike a balance between
more expensive computationally and has a larger than twice the the performance and the computational cost.
memory footprint of our method.
D. Effect of Number of Proposals
B. Visualization of Sentence Embeddings It was observed in our experiments that the existing evaluation
In Fig. 4, we visualize the n-gram-based sentence embed- metrics are influenced differently by different number of propos-
dings computed in the SC-Net. For better readability, only a als selected for caption generation. This calls for further putting
portion of the semantic space is visualized. We project the high- the reported results into the right perspective. In Fig. 6, we show
dimensional vectors to 3-D using PCA and use cosine similarity the influence of the number of selected proposals on the metric
to find the nearest neighbors of a query sentence. For a query values for the same overall set of the generated captions. As can
sentence “the players talk during the match,” Fig. 4(a) shows be seen, increasing the number of proposals adversely affects

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 773

TABLE VIII
DVC RESULTS OF THE PROPOSED METHOD AT DIFFERENT TIOU THRESHOLDS
ON THE ACTIVITYNET CAPTIONS [43] DATASET ON LEARNED PROPOSALS

Fig. 6. Adverse effects of increasing proposal numbers on evaluation metric


scores. Subsets of the same ordered proposals are used for our technique. CIDEr BLEU (B1–B4), METEOR (M), CIDEr (C), and ROUGE (R) metrics are used to
seems heavily influenced by varying the number of proposals primarily because report the score of the final model.
of different scale, i.e., 1000 in contrast to other metrics that are scaled out of 100.
ROUGE also improves considerably with reduced numbers without changing TABLE IX
the overall captions set. (Right) Effects of proposals variation from 1000 to 100. HUMAN EVALUATION
(Left) Effect of proposals variation below 100—shown separately due to scale
variability.

Visual relevance (VR)—percentage of generated sets of sentences that best describe


the scores of all four metrics. CIDEr is comparatively more the video content.
sensitive to the increasing number of proposals. We follow the Human relevance (HR)—percentage of sets of generated sentences that are at par
or better than human-provided sentences.
standard practice of reporting results on 1000 and 100 proposals
in Tables III and IV. Techniques such as SDVC [44] compute
very few (on average 2.85) proposals, which boosts their metric B. Human Evaluations
scores primarily due to their smaller number of proposals.
To further validate the effectiveness of our technique and
to better understand how satisfactory are the generated sen-
tences for localized temporal event proposals from different
E. Limitations and Future Work methods, we also conducted a human study to compare our
We have employed a 2-D CNN for feature extraction, in method against two approaches, i.e., Masked Transformer [71]
our technique, followed by temporal refinement of the spatial and TDA-CG [68]. In this task, 12 evaluators participated. The
features using short-time FT. This technique alleviates the need participants are further equally divided into two groups. The first
for training the latter part and reduces the memory footprint at group is provided with 15 randomly selected video clips along
the same time while achieving excellent results. However, this with temporally localized captions generated by all methods.
process is conducted offline as it is not end-to-end differentiable. Each evaluator in the group is then asked to select the captions
We expect that end-to-end training may boost the performance of that best describe the video clip. We allow the selection of
our approach. Nevertheless, that enhancement will also compro- multiple best captions if the participant thinks that multiple
mise the adversarial robustness of the method, which is currently captions are equally good. To reduce the human subjectivity,
a big concern for the research community [109]. Currently, being each video is evaluated by six evaluators. Captions selected
nondifferentiable, the modeling provides an inherent robustness by more than two evaluators are considered as the final best
to the adversarial attacks in the visual domain. Instead of the full captions.
pixel-based features, we only employ the low-frequency coeffi- The second group is provided with captions generated by all
cients to capture the temporal dynamics of the visual features. the methods along with human annotations and is asked whether
Leveraging the removal of irrelevant partial information from the system-generated captions of temporally localized sentences
the input for adversarial robustness can be an interesting re- resemble human annotations. From all the received responses,
search direction for future work. Moreover, it will be interesting we calculate two metrics: 1) visual relevance (VR): percentage
to explore the contextualization performance based on BERT of sets of sentences that best describe the video content and
(and its other variants) embeddings. If computational cost is 2) human relevance (HR): percentage of sets of sentences that
not the bottleneck, then this option is preferred selection for are at par or better than human generated sentences. The results
performance gain. of the human study are presented in Table IX. It is evident from
the table that in most cases, i.e., 78.6%, the captions generated
by SA-DVC can best describe the visual content. Our SA-DVC
outperforms best competitor by 4.1% on human relevance.
VI. ADDITIONAL RESULTS
A. Quantitative Results C. Qualitative Results
We provide detailed results for all the tIoUs [0.3, 0.5, 0.7, 0.9] 1) Qualitative Results ActivityNet Captions: In Fig. 7, we
of our model in Table VIII. The results show that the performance show representative examples of captions generated by the
of the model improves for higher tIoU across all metrics. This proposed technique for the ActivityNet captions [43] dataset.
further necessitates the need to generate more accurate events. We compare the performance for our baseline and proposed

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
774 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022

has failed to detect all the events and also failed to differentiate
and detect add pasta and add bacon as two separate events
(recipe steps).

VII. CONCLUSION
DVC is a challenging task that is currently handled by
event detection followed by caption generation. We proposed
a technique that couples these two tasks with visual-semantic
contextualization. Our method additionally refines the resulting
representation with a hierarchical FT. The refined representation
is used by our event detector. The detected event representation
is computed by attentively fusing the event’s final hidden state
Fig. 7. Qualitative results of DVC by the proposed technique on the Activi- with the descriptors. This way, we are able to make it more
tyNet captions [43] dataset. The bars indicate GT events (blue), detected events discriminative for utilization by the caption generation network.
by the baseline method (brown) and by the proposed method (green). Captions
relevant to the method used are presented under each separately. ei corresponds A thorough evaluation of our technique on the large-scale Ac-
to the event number. tivityNet Captions dataset and the YouCook-II dataset shows
competitive or better performance of our technique across mul-
tiple metrics.

ACKNOWLEDGMENT
The views and conclusions contained in this article are those
of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the Army
Research Office or the U.S. Government. The U.S. Government
is authorized to reproduce and distribute reprints for government
purposes notwithstanding any copyright notation herein.

Fig. 8. Qualitative results of DVC by the proposed technique on the YouCook-


II [26] dataset. The bars indicate GT events (blue), detected events by the baseline REFERENCES
method (brown) and by the proposed event detector (green). Captions relevant
to the method used are presented under each separately. ei corresponds to the [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
event number. no. 7553, pp. 436–444, 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf.
Process. Syst., 2012, vol. 25, pp. 1097–1105.
the DVC model. As can be seen in the figures, the events [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
generated by the proposed method (green bar) are more aligned large-scale image recognition,” 2014, arXiv:1409.1556.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
with the GT events (blue bar) as compared to baseline (brown chies for accurate object detection and semantic segmentation,” in Proc.
bar), which is adopted by most methods. Note that visual-only IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
method is unable to detect events as described by the annotations [5] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf.
Neural Inf. Process. Syst., 2017, pp. 6000–6010.
from the visually similar content. Moreover, descriptions of [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
both the detected events are similar. In contrast, the proposed of deep bidirectional transformers for language understanding,” in Proc.
technique that leverages linguistic information in the proposal NAACL-HLT, 2019.
[7] C. Qiu, G. Zhou, Z. Cai, and S. Anders, “A global-local attentive relation
generation module is able to capture such events e.g., standing, detection model for knowledge base question answering,” IEEE Trans.
doing karate, and continue hitting from visually similar content. Artif. Intell., vol. 2, no. 2, pp. 200–212, Apr. 2021.
Furthermore, our technique is also able to generate more rele- [8] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic parsing on
freebase from question-answer pairs,” in Proc. Conf. Empirical Methods
vant captions. We show captions that have the highest overlap Natural Lang. Process., 2013, pp. 1533–1544.
proposals with the GT events. [9] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky, “Deep
2) Qualitative Results YouCook-II: We provide qualitative reinforcement learning for dialogue generation,” in Proc. Conf. Empirical
Methods Natural Lang. Process., 2016, pp. 1192–1202.
results for the YouCook-II [26] dataset in Fig. 8. Here, again, [10] R. Socher, D. Chen, C. D. Manning, and A. Ng, “Reasoning with neural
we compare the performance for our two models, i.e., baseline tensor networks for knowledge base completion,” in Proc. Int. Conf.
and proposed. As can be seen from the figures, our proposed Neural Inf. Process. Syst., 2013, pp. 926–934.
[11] S. Antol et al., “VQA: Visual question answering,” in Proc. IEEE Int.
technique is able to generate more plausible events along with Conf. Comput. Vis., 2015, pp. 2425–2433.
high quality captions. The difference of the two models is more [12] A. Das et al., “Visual dialog,” in Proc. IEEE Conf. Comput. Vis. Pattern
prominent here as the YouCook-II dataset involves more subtle Recognit., 2017, pp. 1080–1089.
[13] J. Donahue et al., “Long-term recurrent convolutional networks for visual
actions, e.g., cutting and peeling that needs to be detected and recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern
differentiated. We can see that the visual-only (baseline) model Recognit., 2015, pp. 2625–2634.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 775

[14] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, [37] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs
“Generative adversarial text to image synthesis,” in Proc. Int. Conf. Mach. for image captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Learn., 2016, pp. 1060–1069. Recognit., Jun. 2019, pp. 10677–10686.
[15] P. Anderson et al., “Vision-and-language navigation: Interpreting [38] L. Zhou, Y. Zhang, Y.-G. Jiang, T. Zhang, and W. Fan, “Re-Caption:
visually-grounded navigation instructions in real environments,” in Proc. Saliency-enhanced image captioning through two-phase learning,” IEEE
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3674– Trans. Image Process., vol. 29, pp. 694–709, Jul. 2019.
3683. [39] S. Ye, J. Han, and N. Liu, “Attentive linear transformation for image
[16] N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, and M. Shah, “Video description: captioning,” IEEE Trans. Image Process., vol. 27, no. 11, pp. 5514–5524,
A survey of methods, datasets, and evaluation metrics,” ACM Comput. Nov. 2018.
Surv., vol. 52, no. 6, pp. 1–37, 2019. [40] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with
[17] D. Gurari et al., “VizWiz grand challenge: Answering visual questions attributes,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4904–4912.
from blind people,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [41] Y. Wang, Z. Lin, X. Shen, S. Cohen, and G. W. Cottrell, “Skeleton key:
2018, pp. 3608–3617. Image captioning by skeleton-attribute decomposition,” in Proc. IEEE
[18] J. P. Bigham et al., “VizWiz: Nearly real-time answers to visual ques- Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7378–7387.
tions,” in Proc. 23nd Annu. ACM Symp. User Interface Softw. Technol., [42] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description
2010, pp. 333–342. dataset for bridging video and language,” in Proc. IEEE Conf. Comput.
[19] R. Tapu, B. Mocanu, and T. Zaharia, “Dynamic subtitles: A multimodal Vis. Pattern Recognit., 2016, pp. 5288–5296.
video accessibility enhancement dedicated to deaf and hearing impaired [43] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-
users,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, 2019, captioning events in videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
pp. 2558–2566. pp. 706–715.
[20] X. Che, S. Luo, H. Yang, and C. Meinel, “Automatic lecture subtitle [44] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han, “Streamlined dense video
generation and how it helps,” in Proc. IEEE 17th Int. Conf. Adv. Learn. captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Technol., 2017, pp. 34–38. 2019, pp. 6581–6590.
[21] Z. Xu, C. Hu, and L. Mei, “Video structured description technology [45] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles, “SST:
based intelligence analysis of surveillance videos for public security Single-stream temporal action proposals,” in Proc. IEEE Conf. Comput.
applications,” Multimedia Tools Appl., vol. 75, no. 19, pp. 12 155–12 172, Vis. Pattern Recognit., 2017, pp. 6373–6382.
2016. [46] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, “Daps: Deep
[22] S. Xiao, Z. Zhao, Z. Zhang, Z. Guan, and D. Cai, “Query-biased self- action proposals for action understanding,” in Proc. Eur. Conf. Comput.
attentive network for query-focused video summarization,” IEEE Trans. Vis., 2016, pp. 768–784.
Image Process., vol. 29, pp. 5889–5899, Apr. 2020. [47] H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko, “Joint event
[23] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summarization detection and description in continuous video streams,” in Proc. IEEE
via semantic attended networks,” in Proc. AAAI Conf. Artif. Intell., 2018, Winter Conf. Appl. Comput. Vis. Workshops, 2019, pp. 25–26.
pp. 216–223. [48] C. Baldassano, J. Chen, A. Zadbood, J. W. Pillow, U. Hasson, and K. A.
[24] S. Zhang, Y. Zhu, and A. K. Roy-Chowdhury, “Context-aware surveil- Norman, “Discovering event structure in continuous narrative perception
lance video summarization,” IEEE Trans. Image Process., vol. 25, no. 11, and memory,” Neuron, vol. 95, no. 3, pp. 709–721, 2017.
pp. 5469–5478, Nov. 2016. [49] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
[25] C. T. Dang and H. Radha, “Heterogeneity image patch index and its appli- of word representations in vector space,” in 1st Int. Conf. Learning
cation to consumer video summarization,” IEEE Trans. Image Process., Representations, ICLR, Scottsdale, Arizona, USA, 2013.
vol. 23, no. 6, pp. 2704–2718, Jun. 2014. [50] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
[26] L. Zhou, C. Xu, and J. J. Corso, “Towards automatic learning of proce- vectors with subword information,” Trans. Assoc. Comput. Linguistics,
dures from web instructional videos,” in Proc. AAAI Conf. Artif. Intell., vol. 5, pp. 135–146, 2017.
2018, pp. 7590–7598. [51] O. Duchenne, I. Laptev, J. Sivic, F. R. Bach, and J. Ponce, “Automatic
[27] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and annotation of human actions in video,” in Proc. IEEE 12th Int. Conf.
S. Lacoste-Julien, “Unsupervised learning from narrated instruction Comput. Vis., 2009, pp. 1491–1498.
videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, [52] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for gen-
pp. 4575–4583. erating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern
[28] P. Bojanowski et al., “Weakly supervised action labeling in videos under Recognit., 2015, pp. 3128–3137.
ordering constraints,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 628– [53] K. Xu et al., “Show, attend and tell: Neural image caption generation with
643. visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
[29] H. Su, W. Qi, Z. Li, Z. Chen, G. Ferrigno, and E. De Momi, “Deep [54] P. Nguyen, T. Liu, G. Prasad, and B. Han, “Weakly supervised action
neural network approach in EMG-based force estimation for human- localization by sparse temporal pooling network,” in Proc. IEEE/CVF
robot interaction,” IEEE Trans. Artif. Intell., vol. 2, no. 5, pp. 404–412, Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6752–6761.
Oct. 2021. [55] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
[30] B. Zhao, X. Li, and X. Lu, “CAM-RNN: Co-attention model based for action recognition in videos,” in Proc. Int. Conf. Neural Inf. Process.
RNN for video captioning,” IEEE Trans. Image Process., vol. 28, no. 11, Syst., 2014, pp. 568–576.
pp. 5552–5565, Nov. 2019. [56] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network
[31] L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with adaptive (T-CNN) for action detection in videos,” in Proc. IEEE Int. Conf. Comput.
attention for visual captioning,” IEEE Trans. Pattern Anal. Mach. Intell., Vis., 2017, pp. 5823–5832.
vol. 42, no. 5, pp. 1112–1131, May 2020. [57] G. Yu and J. Yuan, “Fast action proposals for human action detection
[32] N. Aafaq, N. Akhtar, W. Liu, and A. Mian, “Empirical autopsy of deep and search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
video captioning encoder-decoder architecture,” Array, vol. 9, p. 100052, pp. 1302–1311.
2021, doi: 10.1016/j.array.2020.100052. [58] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for
[33] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred spatio-temporal action localization,” in Proc. IEEE Int. Conf. Comput.
semantic attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Vis., 2015, pp. 3164–3172.
2017, pp. 6504–6512. [59] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-
[34] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, and H. T. Shen, “From stream bi-directional recurrent neural network for fine-grained action
deterministic to generative: Multimodal stochastic RNNs for video detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
captioning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 10, pp. 1961–1970.
pp. 3047–3058, Oct. 2019. [60] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learning
[35] B. Pan et al., “Spatio-temporal graph for video captioning with knowl- of action detection from frame glimpses in videos,” in Proc. IEEE Conf.
edge distillation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- Comput. Vis. Pattern Recognit., 2016, pp. 2678–2687.
nit., Jun. 2020, pp. 10867–10876. [61] Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in
[36] Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” untrimmed videos via multi-stage CNNs,” in Proc. IEEE Conf. Comput.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, Vis. Pattern Recognit., 2016, pp. 1049–1058.
pp. 4120–4129.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
776 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 3, NO. 5, OCTOBER 2022

[62] A. Montes, A. Salvador, S. Pascual, and X. Giro-i Nieto, “Tempo- [85] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y.-W. Tai, “Memory-
ral activity detection in untrimmed videos with recurrent neural net- attended recurrent network for video captioning,” in Proc. IEEE/CVF
works,” in Proc. 1st NIPS Workshop Large Scale Comput. Vis. Syst., Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8339–8348.
2016. [86] W. Hao, Z. Zhang, and H. Guan, “Integrating both visual and audio cues
[63] S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in LSTMs for enhanced video caption,” in Proc. Proc. 32nd AAAI Conf. Artif. Intell.,
for activity detection and early detection,” in Proc. IEEE Conf. Comput. 2018, pp. 6894–6901.
Vis. Pattern Recognit., 2016, pp. 1942–1950. [87] J. Xu, T. Yao, Y. Zhang, and T. Mei, “Learning multimodal attention
[64] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “TALL: Temporal activity LSTM networks for video captioning,” in Proc. 25th ACM Int. Conf.
localization via language query,” in Proc. IEEE Int. Conf. Comput. Vis., Multimedia, 2017, pp. 537–545.
2017, pp. 5277–5285. [88] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and
[65] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. B. Schiele, “Coherent multi-sentence video description with variable
Russell, “Localizing moments in video with natural language,” in Proc. level of detail,” in Proc. German Conf. Pattern Recognit., 2014,
IEEE Int. Conf. Comput. Vis., 2017, pp. 5804–5813. pp. 184–195.
[66] J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia, “TURN TAP: Temporal [89] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph
unit regression network for temporal action proposals,” in Proc. IEEE Int. captioning using hierarchical recurrent neural networks,” in Proc. IEEE
Conf. Comput. Vis., 2017, pp. 3648–3656. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4584–4593.
[67] F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem, “Fast temporal [90] Y. Xiong, B. Dai, and D. Lin, “Move forward and tell: A progressive
activity proposals for efficient detection of human actions in untrimmed generator of video descriptions,” in Proc. Eur. Conf. Comput. Vis., 2018,
videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 468–483.
pp. 1914–1923. [91] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, “Jointly localizing and
[68] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu, “Bidirectional attentive fu- describing events for dense video captioning,” in Proc. IEEE/CVF Conf.
sion with context gating for dense video captioning,” in Proc. IEEE/CVF Comput. Vis. Pattern Recognit., 2018, pp. 7492–7500.
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7190–7198. [92] V. Iashin and E. Rahtu, “Multi-modal dense video captioning,” in Proc.
[69] A. Karpathy, A. Joulin, and L. F. Fei-Fei, “Deep fragment embeddings IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, 2020,
for bidirectional image sentence mapping,” in Proc. Int. Conf. Neural Inf. pp. 4117–4126.
Process. Syst., 2014, pp. 1889–1897. [93] X. Duan, W. Huang, C. Gan, J. Wang, W. Zhu, and J. Huang, “Weakly
[70] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding supervised dense event captioning in videos,” in Proc. Int. Conf. Neural
and translation to bridge video and language,” in Proc. IEEE Conf. Inf. Process. Syst., 2018, pp. 3059–3069.
Comput. Vis. Pattern Recognit., 2016, pp. 4594–4602. [94] Z. Shen et al., “Weakly supervised dense video captioning,” in Proc.
[71] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5159–5167.
dense video captioning with masked transformer,” in Proc. IEEE/CVF [95] T. Rahman, B. Xu, and L. Sigal, “Watch, listen and tell: Multi-modal
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8739–8748. weakly supervised dense event captioning,” in Proc. IEEE/CVF Int. Conf.
[72] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language description Comput. Vis., 2019, pp. 8907–8916.
of human activities from video images based on concept hierarchy of [96] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sen-
actions,” Int. J. Comput. Vis., vol. 50, no. 2, pp. 171–184, 2002. tence embeddings using compositional N-gram features,” in Proc. Conf.
[73] P. Das, C. Xu, R. F. Doell, and J. J. Corso, “A thousand frames in just a North Amer. Chapter Assoc. Comput. Linguistics, 2018, pp. 528–540.
few words: Lingual description of videos through latent topics and sparse [97] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q.
object stitching,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., V. Le, “XLNet: Generalized autoregressive pretraining for language
2013, pp. 2634–2641. understanding,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019,
[74] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and pp. 5753–5763.
S. Guadarrama, “Generating natural-language video descriptions using [98] R. Kiros et al., “Skip-thought vectors,” in Proc. 28th Int. Conf. Neural
text-mined knowledge,” in Proc. 27th AAAI Conf. Artif. Intell., 2013, Inf. Process. Syst., 2015, pp. 3294–3302.
pp. 541–547. [99] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTscore:
[75] S. Gella, M. Lewis, and M. Rohrbach, “A dataset for telling the stories of Evaluating text generation with BERT,” in Proc. Int. Conf. Learn. Rep-
social media videos,” in Proc. Conf. Empirical Methods Natural Lang. resent., 2020.
Process., 2018, pp. 968–974. [100] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
[76] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach, “Grounded spatiotemporal features with 3D convolutional networks,” in Proc. IEEE
video description,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- Int. Conf. Comput. Vis., 2015, pp. 4489–4497.
nit., 2019, pp. 6571–6580. [101] A. V. Oppenheim, Discrete-Time Signal Processing. New Delhi, India:
[77] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang, “Video caption- Pearson Education India, 1999.
ing via hierarchical reinforcement learning,” in Proc. IEEE/CVF Conf. [102] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for
Comput. Vis. Pattern Recognit., 2018, pp. 4213–4222. 3D human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,
[78] Q. Zheng, C. Wang, and D. Tao, “Syntax-aware action targeting for video vol. 36, no. 5, pp. 914–927, May 2014.
captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [103] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Jun. 2020, pp. 13093–13102. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[79] Z. Zhang et al., “Object relational graph with teacher-recommended [104] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
learning for video captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Pattern Recognit., Jun. 2020, pp. 13275–13285. pp. 770–778.
[80] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian, “Spatio-temporal [105] M. Ester et al., “A density-based algorithm for discovering clusters
dynamics and semantic attribute enriched visual encoding for video in large spatial databases with noise,” in Proc. 2nd Int. Conf. Knowl.
captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Discovery Data Mining, 1996, pp. 226–231.
2019, pp. 12479–12488. [106] Y. Bengio, “Practical recommendations for gradient-based training of
[81] C. Yan et al., “STAT: Spatial-temporal attention mechanism for video deep architectures,” in Neural Networks: Tricks of the Trade. New York,
captioning,” IEEE Trans. Multimedia, vol. 22, no. 1, pp. 229–241, NY, USA: Springer, 2012, pp. 437–478.
Jan. 2019. [107] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio,
[82] L. Yao et al., “Describing videos by exploiting temporal structure,” in “An empirical evaluation of deep architectures on problems with many
Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4507–4515. factors of variation,” in Proc. 24th Int. Conf. Mach. Learn., 2007,
[83] Z. Gan et al., “Semantic compositional networks for visual cap- pp. 473–480.
tioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, [108] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
pp. 1141–1150. in Proc. Int. Conf. Learn. Represent., 2015, p. 13.
[84] J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan, “M3: Multimodal [109] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning
memory modelling for video captioning,” in Proc. IEEE/CVF Conf. in computer vision: A survey,” IEEE Access, vol. 6, pp. 14410–14430,
Comput. Vis. Pattern Recognit., 2018, pp. 7512–7520. 2018.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.
AAFAQ et al.: CROSS-DOMAIN MODALITY FUSION FOR DENSE VIDEO CAPTIONING 777

Nayyer Aafaq received the B.E. degree (with distinc- Naveed Akhtar received the master’s degree in com-
tion) in avionics from the College of Aeronautical puter science from Hochschule Bonn-Rhein-Sieg,
Engineering, National University of Sciences and Sankt Augustin, Germany, in 2012, and the Ph.D.
Technology (NUST), Islamabad, Pakistan, in 2007, degree in computer vision from the University of
and the M.S. degree (with high distinction) in sys- Western Australia (UWA), Crawley, WA, Australia,
tems engineering from the Queensland University in 2017.
of Technology, Brisbane, QLD, Australia, in 2012. He has been a Research Fellow with UWA since
He is currently working toward the Ph.D. degree 2017. He was a Research Fellow with Australian Na-
with the School of Computer Science and Software tional University, Canberra, ACT, Australia. He was
Engineering, University of Western Australia (UWA), a recipient of multiple scholarships during his Ph.D.
Crawley, WA, Australia. research. His research in computer vision and pattern
His research in computer vision and pattern recognition has been published in recognition has been published in prestigious venues of the field, including
prestigious venues of the field including the IEEE/CVF Conference on Computer IEEE Conference on Computer Vision and Pattern Recognition and the IEEE
Vision and Pattern Recognition and ACM Computing Surveys. He is a recipient TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. He was
of Scholarship for International Research Fees scholarship at UWA. He was a also a reviewer for these venues. His current research interests include adversarial
Research Assistant with STG Research Institute, Pakistan, from 2007 to 2011, machine learning, action recognition, and hyperspectral image analysis..
and a Lecturer with NUST from 2013 to 2017. His current research interests Dr. Akhtar was a runner-up for the Canon Extreme Imaging Competition in
include deep learning, video analysis and intersection of natural language 2015.
processing, computer vision, and machine learning.

Ajmal Mian (Senior Member, IEEE) received the


Ph.D. degree (with distinction) in computer science
from the University of Western Australia, Crawley,
WA, Australia, in 2006.
Mubarak Shah (Fellow, IEEE) received the M.S. and
He is currently a Professor of Computer Science
Ph.D. degrees in computer engineering from Wayne
with the University of Western Australia. He has au-
State University, Detroit, Michigan, in 1982 and 1986,
thored or coauthored more than 200 scientific papers respectively.
in reputable journals and conferences. He has se-
He is the Trustee Chair Professor of Computer
cured ten Australian Research Council Grants, a Na-
Science and the Founding Director of the Center for
tional Health and Medical Research Council Grant,
Research in Computer Vision, University of Central
a DAAD German Australian Research Cooperation Florida (UCF), Orlando, FL, USA. His research in-
Grant, and two U.S. Department of Defense Grants. His research interests
terests include video surveillance, visual tracking, hu-
include computer vision, machine learning, adversarial deep learning, 3-D shape
man activity recognition, visual analysis of crowded
analysis, and video analysis.
scenes, video registration, and unmanned aerial vehi-
Dr. Mian received the Australasian Distinguished Doctoral Dissertation cle video analysis.
Award from the Computing Research and Education Association of Australasia.
Dr. Shah is an ACM Distinguished Speaker. He was an IEEE Distinguished
He received the prestigious Australian Postdoctoral and Australian Research
Visitor Speaker from 1997 to 2000 and received the IEEE Outstanding Engineer-
Fellowships in 2008 and 2011, respectively. He received the University of
ing Educator Award in 1997. In 2006, he received a Pegasus Professor Award,
Western Australia Outstanding Young Investigator Award in 2011, the West Aus-
the highest award at UCF. He received the Harris Corporations Engineering
tralian Early Career Scientist of the Year Award in 2012, the Vice-Chancellors
Achievement Award in 1999, TOKTEN awards from United Nations Develop-
Mid-Career Research Award in 2014, the Aspire Professional Development
ment Program in 1995, 1997, and 2000, the Teaching Incentive Program Award
Award in 2016, and the Excellence in Research Supervision Award in 2017. in 1995 and 2003, the Research Incentive Award in 2003 and 2009, millionaires
He is an Associate Editor for the IEEE TRANSACTIONS ON IMAGE PROCESSING,
club awards in 2005 and 2006, the University Distinguished Researcher Award
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, and
in 2007, and the Honorable mention for the ICCV 2005 Where Am I? Challenge
Pattern Recognition Journal.
Problem. He was nominated for the Best Paper Award at the ACM Multimedia
Conference in 2005. He is an Editor for an international book series on video
computing. He was an Editor-in-Chief of Machine Vision and Applications and
Wei Liu (Member, IEEE) received the Ph.D. degree in an Associate Editor for ACM Computing Surveys. He was the Program Co-Chair
computer science from the University of Newcastle, of 2008 Conference on Computer Vision and Pattern Recognition, an Associate
Callaghan, NSW, Australia, in 2003. Editor for the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
She is currently with the Department of Com- INTELLIGENCE, and a Guest Editor for the special issue of International Journal
puter Science and Software Engineering, University of Computer Vision on Video Computing. He is a Fellow of the American
of Western Australia, Crawley, WA, USA, and co-lead Association for the Advancement of Science, the International Association for
the Faculty’s Big Data Research Group. Her research Pattern Recognition, and the International Society for Optical Engineers.
impact in the field of knowledge discovery from nat-
ural language text data is evident by a series of highly
cited papers and the reputable top data mining and
knowledge management journals and conferences,
including, ACM Computer Surveys, Journal of Data Mining and Knowledge Dis-
covery, Knowledge and Information Systems, International Conference on Data
Engineering, and ACM International Conference on Information and Knowledge
Management. Her current research interests include deep learning methods
for knowledge graph construction from natural language text, sequential data
mining, and text mining.
Dr. Liu has won three Australian Research Council Grants and several industry
grants.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 26,2024 at 06:05:48 UTC from IEEE Xplore. Restrictions apply.

You might also like