0% found this document useful (0 votes)
21 views12 pages

Exploring Global Diverse Attention Via Pairwise

Uploaded by

mahdikh.76.mkh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

Exploring Global Diverse Attention Via Pairwise

Uploaded by

mahdikh.76.mkh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

DRAFT 1

Exploring global diverse attention via pairwise


temporal relation for video summarization
Ping Li, Qinghao Ye, Luming Zhang, Luming Zhang, Li Yuan, Xianghua Xu, and Ling Shao

Abstract—Video summarization is an effective way to facilitate In the last decade, many contributions have been devoted
video searching and browsing. Most of existing systems employ to exploring the recurrent encoder-decoder architecture which
encoder-decoder based recurrent neural networks, which fail to utilizes Recurrent Neural Networks (RNNs) [7] and Long
explicitly diversify the system-generated summary frames while
arXiv:2009.10942v1 [cs.CV] 23 Sep 2020

requiring intensive computations. In this paper, we propose Short-Term Memory (LSTM) models [8], [9]. These models
an efficient convolutional neural network architecture for video are able to learn high-level feature representation of data,
SUMmarization via Global Diverse Attention called SUM-GDA, thus generating video summary with high quality. Actually,
which adapts attention mechanism in a global perspective to con- recurrent neural networks can inherently capture temporal de-
sider pairwise temporal relations of video frames. Particularly, pendency by encoding sequence information of video frames,
the GDA module has two advantages: 1) it models the relations
within paired frames as well as the relations among all pairs, and enjoy the widespread success in practical tasks, e.g.,
thus capturing the global attention across all frames of one video; machine translation [10] and action recognition [11]. However,
2) it reflects the importance of each frame to the whole video, there are several following drawbacks: (1) it is difficult for
leading to diverse attention on these frames. Thus, SUM-GDA is RNNs to make full use of GPU parallelization, due to the fact
beneficial for generating diverse frames to form satisfactory video that the generation of its hidden state ht for the t-th frame
summary. Extensive experiments on three data sets, i.e., SumMe,
TVSum, and VTW, have demonstrated that SUM-GDA and its depends on the previous hidden state ht−1 ; (2) gated RNN
extension outperform other competing state-of-the-art methods models like LSTMs cannot well model long-range dependency
with remarkable improvements. In addition, the proposed models across video frames since the gated mechanism will lead to
can be run in parallel with significantly less computational costs, serious decay of the history information inherited from those
which helps the deployment in highly demanding applications. frames appearing at early time; (3) in some scenarios, e.g.,
Index Terms—Global diverse attention, pairwise temporal the video news broadcast on one topic usually consists of
relation, video summarization, convolutional neural networks several edited videos from different sources, there may exist
semantic discontinuity within video sequences, which is very
I. I NTRODUCTION challenging and RNNs cannot well resolve this problem.
To alleviate the above problems, we exploit the temporal
I N the era of big data, video has become one of the most
important carriers of data as the number and the volume
have both increased rapidly in daily life. Video summarization
context relations among video frames from the pairwise rela-
tion perspective. In particular, a pairwise similarity matrix is
as a good way to manage these videos (e.g., video searching designed to model multi-scale temporal relations across frames
and video browsing) [1], [2] has received much interests in the by storing the context information. The rationale behind this
field of computer vision and pattern recognition. It essentially idea is that temporal relations of video frames can be modeled
selects the key shots of a video as the summary, and these by the pairwise similarity between two frames regardless
key shots are expected to convey the most important infor- of their distance. That is to say, unlike RNNs that model
mation of video. To obtain the summary, traditional methods long-range dependency using history information of early
often use hand-crafted features which are then processed by frames, it is unnecessary for our scheme to go through all the
unsupervised models [3], [4] or supervised models [5], [6]. intermediate frames between source frame and target frame.
Nevertheless, these models can not be trained efficiently in Instead, we compute the pairwise similarity matrix directly at
an end-to-end manner and hand-crafted features are incapable limited computational cost, which makes it quite efficient and
of encoding high-level semantic information of video, so they suitable for GPU parallelization.
may not work well for those videos with diverse and complex In another aspect, a good video summary should include
scenes in the real-world applications. those shots with the most diversity and the most representative-
ness, i.e., the selected key shots with higher importance scores
This work was supported in part by the National Natural Science Foundation should reflect diversified semantic information of video. Our
of China under Grants 61872122, 61502131, in part by Natural Science
Foundation of Zhejiang Province under Grant LY18F020015. (Corresponding human beings always tend to summarize video after scanning
author: Ping Li.) all the frames, which inspires us to imitate this process and
P. Li, Q. Ye and X. Xu are with the School of Computer Science and summarize video by fully attending to the complete video as
Technology, Hangzhou Dianzi University, Hangzhou 310018, China (e-mail:
[email protected]). attention mechanism is prevailing and successful in sequence
L. Zhang is with the College of Computer Science, Zhejiang University, modeling [12]. Therefore, we develop an efficient video
Hangzhou 310027, China. SUMmarization model with Global Diverse Attention called
L. Yuan is with the National University of Singapore, Singapore 119077.
L. Shao is with Inception Institute of Artificial Intelligence, Abu Dhabi, SUM-GDA using pairwise temporal relation, to quantify the
UAE. importance of each video frame and simultaneously promote
2 DRAFT

reflects the relations between source frame and target


frame, SUM-GDA only needs very limited computational
costs so that it is inherently much more efficient than
other competing approaches.
• The proposed SUM-GDA model is explored in super-
vised, unsupervised and semi-supervised scenarios. Em-
pirical studies in terms of both quantitative and qualitative
views are provided.
• The diversity of generated summaries and the influence
of optical flow features are both investigated.
The remaining parts are organized as follows. Sec. II gener-
ally reviews some closely related works and Sec. III introduces
the global diverse attention mechanism and the SUM-GDA
model in two scenarios. Then, Sec. IV describes a number of
experiments carried out on several data sets and reports both
Fig. 1. The Overview of SUM-GDA. It exploits the global diverse attention quantitative and qualitative results as well as rigorous analysis.
mechanism to diversify the selected frames as the final summary. In Row3:
The heights of colorized bars represent different measurements of frames, Finally, we conclude this paper.
i.e., attention weights, which are computed as dissimilarity scores based on
pairwise information within video. Such kind of different attention weights can II. R ELATED W ORK
reflect the diversity of frames in a global view. These diverse attention weights
are utilized to predict the importance score of each frame in the video. In Video summarization [15] has been a long-standing problem
Row2: The light blue bars represent the predicted importance scores of video in multimedia analysis with great practical potential, and lots
frames, and the dark blue bars indicate the frames selected for generating of relevant works have been explored in recent years.
summary.
Traditional approaches can be generally divided into two
categories, i.e., unsupervised learning and supervised learning.
Unsupervised methods generally choose key shots in terms
the diversity among these frames. In concrete, the proposed
of heuristic criteria, such as relevance, representativeness,
GDA mechanism is used to model the temporal relations
and diversity. Among them, cluster-based approaches [16]
among frames by leveraging the pairwise similarity matrix,
aggregate visually similar shots into the same group, and the
where each row vector is closely related to the attention level
obtained group centers will be selected as final summary. In
and its lower entry indicates a higher dissimilarity weight.
earlier works, clustering algorithms are directly applied to
Hence, more emphasis should be put on the corresponding
video summarization [17], and later some researchers combine
pairwise frames in order to promote the diversity among video
domain knowledge to improve performance [16], [4]. Besides,
frames. As a result, all stacked row vectors can reflet different
dictionary learning [1] is another stream of unsupervised
attention levels of the whole video from a global perspective.
methods, and it finds key shots to build a dictionary as the
The overview of SUM-GDA is depicted in Figure 1, where representation of video in addition to preserving the local
the key shots of input video are selected according to the structure of data when necessary. Supervised methods mainly
frame scores that reflect the importance of the frame to the utilize human-labeled summaries as training data to learn how
whole video. These scores are obtained by score regression human would summarize videos, which benefits obtaining
with global diverse attention mechanism, which is the core impressive results. For instance, Ghosh et al. [5] and Gygli et
component of the proposed model. In this work, we have al. [6] treat video summarization as a scoring problem which
explored both supervised and unsupervised variants of this depends on the interestingness and the importance of video
model, which are evaluated by conducting a lot of interesting frames respectively, where the shots with higher scores are
experiments on several video data sets including SumMe [6], chosen to generate video summary. Lu and Grauman [18]
TVSum [13], and VTW [14], in different data settings, proposed a story-driven model to tell story of an egocentric
i.e., Canonical, Augmented, and Transfer. Empirical studies video. Furthermore, there are also some attempts to exploit
demonstrate that the proposed method outperforms other state- auxiliary information such as web images and video categories
of-the-art approaches in both supervised and unsupervised sce- to enhance the summarization performance.
narios. Furthermore, the selected key shots for some randomly Recently, deep learning approaches have attracted increasing
sampled videos are visually shown to further validate the interest for summarizing videos. For example, Yao et al. [19]
effectiveness of the global diverse attention mechanism. proposed a deep ranking model based on convolutional neural
In short, the main contributions of this work can be high- networks to encode the input video and output the ranking
lighted in following aspects: scores according to the relationship between highlight and
• A global diverse attention mechanism is developed to non-highlight video segments; Zhang et al. [20] applied bidi-
model the temporal dependency of video by using pair- rectional LSTM to video summarization, and it can predict the
wise relations between every two frames regardless of probability of each selected shot; Zhang et al. [20] also de-
their stride magnitude, which helps well handle the long- veloped DPP-LSTM, which further introduces Determinantal
range dependency problem of RNN models. Point Process (DPP) to vsLSTM; Zhao et al. [21] put for-
• By directly calculating the pairwise similarity matrix that ward a hierarchical architecture of LSTMs to model the long
LI et al.: EXPLORING GLOBAL DIVERSE ATTENTION VIA PAIRWISE TEMPORAL RELATION FOR VIDEO SUMMARIZATION 3

Fig. 2. The architecture of SUM-GDA. Note that W Q , W K and W V are three parameters to be learned by training model, d is normalized pairwise
dissimilarity vector, and ci indicates the context of frame xi .

temporal dependency among video frames; Yuan et al. [3] (i.e., the gap between source frame and target frame). Diverse
designed cycle-consistent adversarial LSTM networks to suf- attention weights are then transformed to importance scores for
ficiently encode the temporal information in an unsupervised every frame, and importance score indicates the importance of
manner, resulting in promising summarization performance. the frame that may characterize some critical scene or objects
In addition, Zhou et al. [22] employed reinforcement learning within the given video.
and Rochan et al. [23] adapted the convolutional sequence Essentially, SUM-GDA leverages the global diverse atten-
model to promote the quality of video summary; Rochan et tion mechanism to derive the dissimilarity representation of
al. [24] attempted to train the model with unpaired samples video frames, which makes it possible to accomplish video
by adopting key frame selector and summary discriminator summarization even when the sequence order of video frames
network in unsupervised manner, and the generated summary is disrupted in some situations, e.g., several videos regarding
may come from different source videos. the same news are combined together by careful editing. Thus,
Inspired by human visual perception [25], attention mecha- SUM-GDA can be trained in parallel on multiple GPUs, which
nism is exploited to guide the feed forward process of neural significantly reduces computational costs.
networks. To this end, attention-based models have gained
satisfactory performance in various domains, such as person
re-identification [26], visual tracking [27], video question A. Global Diverse Attention
answering [28], and action recognition [29]. Among them, Global diverse attention mainly exploits diversity infor-
self-attention [12] mechanism leverages the relation among mation to enhance the performance of video summarization.
all points of a single sequence to generate the corresponding Particularly, GDA adopts self-attention to encode the pairwise
representation of data, which makes it be widely applied in temporal information within video frames, and then aggregates
real-world tasks [30], e.g., Fajtl et al. [31] utilized self- the pairwise information of all frame pairs in a video to
attention to only measure the importance of each frame while globally evaluate the importance of each frame to the whole
neglecting frame diversity. However, our work takes advantage video. Afterwards, GDA transforms importance scores into
of self-attention mechanism to model the temporal pairwise pairwise dissimilarity scores which indicate the appearance
relation to quantify the importance of each frame and also to variation between two frames.
promote the diversity among the frames, leading to boosted Given a video V , it consists of N frames {v1 , v2 , . . . , vN }.
summarization performance. For its i-th frame and j-th frame, i, j ∈ {1, 2, . . . , N }, the
corresponding feature representation vectors xi ∈ RD and
III. T HE P ROPOSED A PPROACH xj ∈ RD are derived from pre-trained CNN networks like
In this section, we first introduce the proposed video sum- GoogLeNet [32]. The pairwise attention matrix A ∈ RN ×N
marization model with global diverse attention (SUM-GDA) essentially reveals the underlying temporal relation across
in a supervised manner, which employs convolutional neural frame pairs of the video, and the entry Aij between the i-
networks as backbone. The developed model is expected to th frame and the j-th frame is
be much more efficient compared to the typical recurrent 1
neural networks like LSTMs, since existing RNNs based Aij = √ (W Q xi )T (W K xj ), (1)
q
methods always have the difficulty in modeling the long-
range dependency within videos especially in surveillance where q > 0 is a constant; two linear projections W Q ∈
environment. To overcome this difficulty, we propose the RD×D and W K ∈ RD×D correspond to the paired video
global diverse attention mechanism by adapting self-attention frames xi and xj respectively, and they are parameters to
[12] into video summarization task. In particular, we design a be learned by training model. Instead of directly using dot-
pairwise similarity matrix to accommodate diverse attention production, we scale the output attention value with an em-
weights of video frames, which can well encode temporal pirical scaling factor √1q (q = D), because the model is very
relations between every two frames in a wide range of stride likely to generate very small gradients after applying a softmax
4 DRAFT

function without scaling especially during back-propagation methods [34] as it can promote diversity within the selected
process. subsets.
The obtained pairwise attention weights Aij are then con- Given the index set Y = {1, 2, . . . , N }, the positive semi-
verted to corresponding normalized weights αij by using definite kernel matrix L ∈ RN ×N is computed to represent the
following softmax function, i.e., frame-level pairwise similarity, and the probability of a subset
Ysub ⊆ Y is
exp(Aij )
αij = PN . (2) det(LYsub )
r=1 exp(Arj ) PL (Y = Ysub ; L) = , (5)
det(L + IN )
These pairwise attention weights only reveal the importance of
where Y is the random variable to take an index value, IN is
each frame to the video, while a good summary is expected
a N × N identity matrix, and det(L + IN ) is a normalization
to be composed by diverse frames or key shots. Hence, to
constant. If two items within the same subset are similar, then
diversify video frames globally, GDA attends to those frames
the probability P ({i, j} ⊆ Y; L) = Lii Ljj − L2ij will be
that are dissimilar to the other frames in the whole video.
close to zero, i.e., setting the probability of the subset to zero.
Mathematically, we compute the pairwise dissimilarity vector
Otherwise, the high probability indicates that the subset has
d̂ = [dˆ1 , dˆ2 , . . . , dˆN ] ∈ RN by
high variation or diversity.
N
Y d̂ Inspired by quality-diversity decomposition [33], we en-
dˆi = (1 − αij ), d= , (3) hance DPP with explicitly modeling the variation by defining
j=1 kd̂k1 the following kernel matrix L as
Q
where k · k1 denotes `1 -norm and the operator represents Lij = yi yj Φij = yi yj exp(−βkφi − φj k22 ), (6)
the product of multiple elements. The vector d denotes the
normalized pairwise dissimilarity and its elements {di }N where the pairwise similarity between the paired frames xi
i=1
are actually global diverse attention weights, which indicate and xj is derived from two linear transformations φi and φj ,
how the i-th frame xi differs from the whole input video and whose output frame scores are yi and yj .
simultaneously suggest the diversity of the frames. a) Variation Loss: Since DPP enforces the high diversity
To capture frame semantic information modeled by the constraint on the selection of frame subsets, the redundancy
weighted context matrix C ∈ RD×N , we apply linear mapping of video summary can be reduced. And such diversity can be
to normalized pairwise dissimilarity vectors as evaluated by the variation loss, i.e.,
X
Lvar = − log PL (Y = Ysub ; L). (7)
ci = di ⊗ (W V xi ), (4)
Ysub ⊆Y
where the linear projection matrix W V ∈ RD×D is the b) Keyframe Loss: In the supervised setting, we use
parameter to be learned, ci ∈ RD is the weighted vector ground-truth annotations of key frames ŷ during training and
reflecting the context of the i-th frame, and ⊗ is element-wise define key frame loss formulated as
product. N
X 
Lkey = − ŷi log yi + (1 − ŷi ) log(1 − yi ) . (8)
B. The SUM-GDA Model i=1

Incorporating with global diverse attention, the proposed By combining Eq. (7) and Eq. (8), we can obtain the
SUM-GDA model includes the score regression module and supervised loss of the proposed SUM-GDA model:
the linear embedding module following the feed forward layer, Lsup = Lkey + Lvar . (9)
and its architecture is illustrated in Figure 2. The model first
extracts feature vectors by pre-trained CNN and computes During model training, the above loss functions are opti-
global diverse attention matrix A. Then, the weighted features mized iteratively. By incorporating DPP with score regression,
are handled by two fully-connected layers including linear we can build a unified end-to-end deep neural network archi-
embedding function φ(·) and score regression function y(·). tecture SUM-GDA for video summarization, and its training
Those frames with high regression scores are selected to procedures are summarized in Algorithm 1.
form final summary. In the feed forward layer, for weighted
context matrix C, linear transformation is performed with the C. Unsupervised SUM-GDA
dropout layer using layer normalization technique. For score In many practical tasks, SUM-GDAunsup can learn from a
regression, we compute frame scores y ∈ RN using two set of untrimmed videos without supervision.
linear layers with ReLU activation function, dropout, and layer Generally speaking, it is difficult for different users who are
normalization in between. asked to evaluate the same video to achieve the consensus on
Admittedly, the proposed model can predict the likelihood the final summary. Instead of using the oracle, we replace the
of one video frame to be included in the final summary. To key frame loss in Eq. (8) with the length regularization Llen
further diversify the selected frames, we adopt the Deter- balanced by summary ratio σ shown below:
minantal Point Process (DPP) [33] technique which defines N
the distribution that measures the negative correlation over 1 X
Llen = yi − σ . (10)
all the subsets. DPP has been widely used in summarization N i=1 2
LI et al.: EXPLORING GLOBAL DIVERSE ATTENTION VIA PAIRWISE TEMPORAL RELATION FOR VIDEO SUMMARIZATION 5

Algorithm 1 SUM-GDA Model Training Algorithm 2 Video Summary Generation


Input: Input:
Set of M videos V = {V1 , V2 , · · · , VM }, learning rate η. Test video V and model parameters Θ.
Output: Output:
Learned model parameters: Θ. Video summary S.
1: Initialize all parameters denoted by Θ using Xavier. 1: Initialization: S ← ∅.
2: Extract frame-level features {Xm }M m=1 for all videos. 2: Extract frame features X = [x1 , x2 , · · · , xN ] ∈ RD×N of
3: repeat V via pre-trained CNN model.
4: for m = 1 to M do 3: Use KTS [36] to partition the test video into different
5: Use Xm to calculate global pairwise attention matrix segments {Sk }Tk=1 .
A ∈ RN ×N using Eq.(1). 4: Compute global pairwise attention matrix A ∈ RN ×N
6: Calculate the normalized pairwise dissimilarity vec- using Eq.(1).
tor d̂ ∈ RN by Eqs.(2)(3). 5: Calculate the normalized dissimilarity vector d̂ ∈ RN by
7: Compute weighted context vector ci by Eq.(4). Eqs.(2)(3).
8: Calculate frame score yi by score regression y(·). 6: Compute weighted feature ci ∈ RD via Eq.(4).
9: Obtain transformed feature vector φi by linear em- 7: Calculate frame score yi for each context feature ci by
bedding function φ(·). passing feed forward layer and score regression module.
10: Compute the loss L using Eq.(9) or Eq.(12). 8: Get pk for each shot Sk generated by Eq.(13).
11: Θ ← Θ − η 5Θ L(Θ). 9: for all shot Sk in video V do
12: end for 10: if pk = 1, then
13: until convergence 11: S ← S ∪ {Sk }.
14: return Θ. 12: end for
13: return S.

Besides, in order to fully represent the diversity of the


selected frames without summary annotations, we use the where sk represents the mean score of a specific key shot
repelling loss Lrep [35] to enhance the diversity among video within T key shots with length lk , and the key shots with
frames. This loss is calculated as the mean value of pairwise pk = 1 are selected to generate the final summary. Algorithm
similarities for all N video frames, i.e., 2 gives the detailed steps to generate video summary using
the proposed SUM-GDA model.
1 XX φTi φj
Lrep = . (11)
N (N − 1) i kφi k2 kφj k2
i6=j E. Summary Diversity Metric
The merit of repelling loss is that a diverse subset will To examine the quality of generated summary, most existing
lead to the lower value of Lrep . Now, we can obtain the methods adopt the popular F-score [37], [20] metric which
unsupervised loss of SUM-GDAunsup as follows: computes the overlap shots between generated summary and
user summary. However, this criterion fails to reflect the
Lunsup = Llen + Lrep . (12) diversity within generated summary and such diversity can
be actually regarded as another way to evaluate the quality
SUM-GDAunsup has great potential in practice since it does of generated summary. Hopefully, the key shots forming the
not require ground-truth annotations which is hard to collect summary should be diverse as much as possible. In this work,
due to its expensive human labeling. we propose one summary diversity metric (ζ) to assess the
diversity within generated summaries, i.e.,
D. Summary Generation 1 XM X T
Sq Si
ζ= min (kXm − Xm k2 ), (14)
We generate the final summary of an input video by M • T m=1 i=1 Sq ∈S,q=1,...,|S|
selecting a set of key shots. Following [20], we first generate
a set of change points using Kernel Temporal Segmentation where k · k2 is `2 norm, min(·) is a function to yield the
(KTS) [36], which indicates the key shot segments, and then minimum value; M indicates the number of videos, |S| is the
constrain the summary length l to the proportion of user number of selected key shots in a video, S is the set of all
summary length to the original video length. After that, we shots in a video, S is the set of all selected key shots in a
S
select key shots by the 0/1 Knapsack algorithm which is video, Xmq ∈ RD denotes shot feature vector obtained by
formulated as averaging feature vectors across all frames in the shot Sq of
 T video m (Vm ).
Summary diversity metric (ζ) measures the average Eu-
P


 pk lk ≤ l,
XT k=1
 clidean distance between the video key shot and its nearest
lk
max pk sk , s.t. (13) cluster center, i.e., key shot. The smaller ζ indicates higher
sk = l1k ykt ,
P
pk 
k=1 

 t=1 diversity within generated summary. The rationale behind this

pk ∈ {0, 1}. is that the key shot can be regarded as the cluster center
6 DRAFT

TABLE I
in clustering viewpoint. Hence, the more closely aggregated E VALUATION SETTINGS FOR S UM M E . T O EVALUATE TVS UM , WE SWAP
clusters suggest the more diversified clusters, thus the more S UM M E AND TVS UM .
diversified key shots. Usually, the selected key shots are more
Setting Training Testing
representative than those unselected ones. Canonical 80% SumMe 20% SumMe
80% SumMe + OVP
Augmented 20% SumMe
IV. E XPERIMENTS + YouTube + TVSum
Transfer OVP + YouTube + TVSum SumMe
This section mainly explores the summarization perfor-
mance of the proposed SUM-GDA model on several data sets. TABLE II
First, we give some statistics of data sets and describe the P ERFORMANCE COMPARISON (F- SCORE %) WITH SUPERVISED METHODS
ON S UM M E AND TVS UM .
evaluation metrics as well as the experimental settings. Then,
SumMe TVSum
the implementation details will be given, which is followed Method
C A T C A T
by the reported results, the ablation study as well as further Bi-LSTM [20] 37.6 41.6 40.7 54.2 57.9 56.9
DPP-LSTM [20] 38.6 42.9 41.8 54.7 59.6 58.7
discussions on semi-supervised scenario, summary diversity SUM-GANsup [40] 41.7 43.6 - 56.3 61.2 -
and optical flow features. DR-DSNsup [22] 42.1 43.9 42.6 58.1 59.8 58.9
SUM-FCN [23] 47.5 51.1 44.1 56.8 59.2 58.2
HSA-RNN [39] - 44.1 - - 59.8 -
A. Data Sets CSNetsup [41] 48.6 48.7 44.1 58.5 57.1 57.4
VASNet [31] 49.7 51.1 - 61.4 62.4 -
We have evaluated different summarization methods on two M-AVS [42] 44.4 46.1 - 61.0 61.8 -
benchmark data sets, i.e., SumMe [6] and TVSum [13]. Two SUM-GDA 52.8 54.4 46.9 58.9 60.1 59.0

additional data sets, i.e., Open Video Project (OVP) [38]


and YouTube [38], are used to augment and transfer the TABLE III
training set. SumMe and TVSum include both cases where the P ERFORMANCE COMPARISON (F- SCORE %) WITH UNSUPERVISED
METHODS ON S UM M E AND TVS UM .
scene changes quickly or slowly which is challenging. SumMe
SumMe TVSum
consists of 25 user videos and different annotations. TVSum Method
C A T C A T
has 50 user videos and each video is annotated by 20 users SUM-GANrep [40] 38.5 42.5 - 51.9 59.3 -
SUM-GANdpp [40] 39.1 43.4 - 51.7 59.5 -
with frame-level importance scores; OVP contains 50 videos DR-DSN [22] 41.4 42.8 42.4 57.6 58.4 57.8
while YouTube has 39 videos. CSNet [41] 51.3 52.1 45.1 58.8 59.0 59.2
Cycle-SUM [3] 41.9 - - 57.6 - -
We also examined the performance on the VTW [14] UnpairedVSN [24] 47.5 - - 55.6 - -
database which is much larger and contains 2,529 videos SUM-GDAunsup 50.0 50.2 46.3 59.6 60.5 58.8
collected from YouTube web site. The videos in VTW are
mostly shorter than those in SumMe and TVSum. Each video
is annotated with highlighted key shots.
D. Implementation Details
B. Evaluation Metric For fair comparison, we use the pool5 features of
GoogLeNet pre-trained on ImageNet, as the feature vector
Following the protocols in [37], [20], the similarity of the
x ∈ R1024 for each frame. The number of hidden units is
generated summary is evaluated by measuring its agreement
set to 1024, the dropout rate is set to 0.6, and the L2 weight
with user-annotated summary. We use the harmonic mean of
decay coefficient is set to 10−5 . For SUM-GDAunsup , σ is
precision and recall, i.e., F-score, as the evaluation metric.
0.3. We train our model using the Adam optimizer with the
The correct parts are the overlapped key shots between user
initial learning rate 5 × 10−5 for SumMe, 10−4 for TVSum
annotations and generated summary. Formally, the precision
overlap and 5 × 10−4 for VTW, and training is terminated after 200
is P = prediction and the recall is R = user overlap
annotation . Then epochs. All experiments were conducted on a machine with
the F-score is formulated as:
NVIDIA GTX 1080Ti GPU using PyTorch platform.
2×P ×R
F = × 100%. (15)
P +R
E. Quantitative Results
C. Evaluation Settings We compare the proposed model with a number of state-
For SumMe and TVSum, we evaluate the proposed model of-the-art video summarization methods including both super-
in three settings including Canonical (C), Augmented (A), and vised and unsupervised approaches, for which we directly use
Transfer (T), and the details are shown in Table I. The data set the records reported in their original papers. The results are
is divided into two parts: 80% for training and 20% for testing. reported in Table II and Table III respectively for supervised
We adopt five-fold cross validation and report the results by and unsupervised methods.
averaging the F-scores over five testing splits. Following the On SumMe and TVSum, the compared supervised meth-
protocol in [6], [13], we take the maximum of F-score over ods contain Bi-LSTM [20], DPP-LSTM [20], SUM-GANsup
different reference summaries for SumMe, and computed the [40], DR-DSNsup [22], SUM-FCN [23], HSA-RNN [39],
average of F-score over those summaries for TVSum. For CSNetsup [41], VASNet [31], and M-AVS [42]; the compared
VTW, we evaluate the model in canonical setting, where the unsupervised methods include SUM-GANrep /SUM-GANdpp
data set is split to two parts including 2,000 videos for training [40], DR-DSN [22], CSNet [41], Cycle-SUM [3], and Un-
and the rest for test in line with [39]. pairedVSN [24]. Bi-LSTM and DPP-LSTM employ LSTMs
LI et al.: EXPLORING GLOBAL DIVERSE ATTENTION VIA PAIRWISE TEMPORAL RELATION FOR VIDEO SUMMARIZATION 7

TABLE IV
P ERFORMANCE COMPARISON (F- SCORE %) ON VTW.
Method Precision Recall F-score
HD-VS [19] 39.2 48.3 43.3
DPP-LSTM [20] 39.7 49.5 44.3
HSA-RNN [39] 44.3 54.8 49.1
SUM-GDAunsup 47.8 48.6 47.9
SUM-GDA 50.1 50.7 50.2

to encode the temporal information of video. Meanwhile,


SUM-GAN and its variants that adopt generative adversarial Fig. 3. F-score(%) of SUM-GDAunsup for different summary ratios σ on
networks utilize the generated summary to reconstruct video SumMe, TVSum, and VTW.
frames, and Cycle-SUM additionally uses reconstructed video
to construct summary to further promote the performance.
Moreover, reinforcement learning is introduced in DR-DSN
to solve the summarization problem; SUM-FCN uses a fully
convolutional sequence network to encode the information;
HSA-RNN utilizes hierarchical structure RNN to model the
information of different levels; CSNet exploits different strides
and chunks to model the temporal relation; VASNet leverages
self-attention to only measure the importance of each frame;
M-AVS adopts the encoder-decoder network with attention;
UnpariedVSN trains the model with unpaired data via key
frame selector network and summary discriminator network.
On VTW, the compared methods are all supervised methods,
since to our best knowledge there are no works using unsu-
pervised method on this database. The compared approaches
contain HD-VS [19], DPP-LSTM [20], and HSA-RNN [39]. Fig. 4. Time comparison in terms of training and test time with F-score(%)
on SumMe and TVSum in Canonical setting.
Among them, HD-VS uses two-stream CNNs to summarize
the video. The results are reported in Table IV.
From Table II, III and IV, several interesting observations
can be found as follows. • The proposed SUM-GDA obtains larger gains on VTW
compared to other methods because global diverse atten-
• SUM-GDA outperforms most of the competing models
tion exploits the temporal relations among all frame pairs
by large margins in different data settings. Our model
and can better encode the importance of each frame.
is better than the rest by at least 2.8% on SumMe,
i.e., boosted by 3.1%, 3.3% and 2.8% in canonical, Moreover, we have examined the sensitiveness of summary
augmented and transfer settings respectively. Moreover, ratio σ for SUM-GDAunsup to the summarization perfor-
the unsupervised model SUM-GDAunsup yields the best mance. Figure 3 shows F-score on all three databases. We can
performance by 1.2% higher than the second-best one observe that when σ is too large or too small, the performance
in transfer setting on SumMe while it gains the top degrades sharply. The best results are obtained when σ = 0.3.
performance on TVSum in both canonical and augmented In addition, due to the large variance in summary proportion
settings. This has well validated the advantages of the (i.e., the ratio of the summary length to the video length) on
developed GDA component which models the long-range VTW, SUM-GDAunsup enjoys robust performance for σ. As
temporal relations among video frames and those selected shown by Figure 3, our model achieves the best when σ = 0.4,
frames are further diversified by our framework. which is close to the mean value of all summary proportions
• SUM-GDA achieves larger performance improvement on VTW.
than that of SUM-GDAunsup on SumMe, but slightly
smaller than that on TVSum. This might be due to the
F. Computational Issues
fact that SumMe is more challenging and it adopts the
highest F-score among several users which shows more We provide both training time and test time compari-
concentration when doing evaluation. Therefore, using son for several state-of-the-art methods on SumMe and TV-
supervised approaches tends to make the prediction be Sum data sets in Figure 4, which shows SUM-GDA and
close to the specific ground truth. However, the evaluation SUM-GDAunsup are much more efficient. We found that
on TVSum averages F-scores among several users, and SUM-GDAunsup reduces both training and test time signif-
the ground truth will be possibly misled as different users icantly, which verifies that SUM-GDAunsup achieves better
may not reach concesus. the consistent agreement. In trade-off between efficiency and performance than existing
consequence, unsupervised SUM-GDAunsup model will models. In the following, we make some analysis on computa-
provide a step closer to the user summaries. tional costs of the proposed method and RNN based methods.
8 DRAFT

TABLE V
F- SCORE (%) OF ALL CASES ON S UM M E AND TVS UM IN C ANONICAL in some degree. Furthermore, the model with variation loss
SETTING . i.e.SUM-GDA (Row 7) is better than that without it by 1.2%
Method SumMe TVSum and 0.8% on SumMe and TVSum respectively, which is for
SUM-GDAunsup w/o Llen 47.4 57.7 the reason that variation loss helps generate diverse subsets.
SUM-GDAunsup w/o Lrep 48.9 58.9
SUM-GDAunsup w/o GDA 43.0 55.7
In addition, both of SUM-GDAunsup and SUM-GDA are
SUM-GDAunsup 50.0 59.6 improved by adopting GDA module as shown in Row 3 and
SUM-GDA w/o Lvar 51.6 58.1 Row 6. Therefore, different regularization terms or loss play
SUM-GDA w/o GDA 45.1 53.1 different roles in promoting global diverse attention on video
SUM-GDA 52.8 58.9
frames.

H. Qualitative Results
Given an input sequence {z1 , · · · , zn } which includes n
We visualize the selected key shots of different videos on
frame images with zi ∈ Rd , suppose the dimension of the
TVSum generated by our SUM-GDA model. From Figure 5,
hidden state vector ht is d. Then the time complexity of
it can be clearly seen that SUM-GDA selects most peak points
RNN is O(nd2 ) since ht = tanh(Whh ht−1 +Wzh zt ), where
according to frame scores using the 0/1 Knapsack algorithm.
Wxh , Whh ∈ Rd×d [43]. However, the GDA module of our
As depicted by Figure 6, SUM-GDA and SUM-GDAunsup
approach only requires O(n2 d), which is faster than RNN
yield different frame scores. The bottom figure for SUM-
when the sequence length n is smaller than d. This is mostly
GDAunsup generates frame scores with more sparsity, as we
true for frame image representations used by state-of-the-
constrain the mean of frame scores to be close to the summary
art video summarization models [20], [22], [3] who sample
ratio, which may lead to sparsity.
frames with 2f ps from video. Regarding the benchmark
To verify the effectiveness of SUM-GDA, we visualize
databases SumMe, TVSum, and VTW tested above, the length
global diverse attention weights on TVSum in Figure 7, which
of video is usually smaller than the dimension of its embedding
describes a traffic accident where a train crashed with a
representation, e.g., 1024. Moreover, RNNs [7] and LSTM
car. Comparing the middle image and the right-most image,
models [8] suffer from the gradient vanishing and exploding
they look visually similar but their attention weights are very
problem due to the use of hyperbolic tangent and sigmoid
different. This does help achieve the goal of diversifying the
activation function, which leads to gradient decay over time
selected frames because there exists redundancy in similar
steps during the training process. Li et al.[43] found that RNN
frames, only one of which is required to form the final
and LSTM cannot keep the long-term temporal information of
summary in practice.
the sequence when its length is greater than 1000 in empirical
studies. In addition, the parallelized amount of computations
can be measured by the minimum number of sequentially I. Semi-Supervised Scenario
executed operations of the module [12], since sequentially In many practical applications, it often appears that only
executed operation cannot be implemented in parallelization. partial videos have labels due to costly human labeling while
For RNN based methods, O(n) sequential operations are a large number of unlabeled videos are easily available which
required while GDA module only needs O(1) sequential are useful for training summarization model. Regarding this
operations if enough GPU cards are available. This implies scenario, we additionally examined the proposed methods in
that our GDA module is more efficient by taking advantages semi-supervised setting (which are not explored in previous
of parallelization. Hence, SUM-GDA is actually much faster works). Specifically, for SumMe, we use 80% SumMe +
than recurrent neural network based methods. OVP + YouTube + TVSum for training (labeled videos)
and 20% SumMe for testing (test videos). For TVSum, the
G. Ablation Study above SumMe and TVSum are swapped. Those unlabeled
videos are sampled from VTW by ignoring the corresponding
To examine the influences of different loss terms, we con- annotations. Here, test videos are only used during test, which
ducted ablation study on the proposed model and the results is slightly different from transductive learning which considers
are shown in Table V. the test videos as unlabeled samples used for learning model.
As can be seen from the table, the model without length The results on SumMe and TVSum are shown in Table VI,
regularization (Row 1) will deteriorate the summarization from which it can be observed that the summarization perfor-
performance, as the length of generated summary should be mance is consistently improved with the increasing number
constrained in a sensible range. Besides, the performance of of unlabeled samples. This indicates that unlabeled data can
the model with repelling loss (Row 4) is improved by 1.1% and also benefit the model learning and more unlabeled data can
0.7% on SumMe and TVSum respectively in comparison with provide more consolidated information to enhance the quality
that without such loss (Row 2). We can attribute this to the of generated summaries.
fact that when the repelling loss is added to the unsupervised
extension besides length regularization, pairwise similarity
between partial frames will be reduced, which would lower J. Diversity of Generated Summaries
the importance scores of those frames in the video. Hence, To evaluate diversity of summaries generated by different
the global diverse attention on video frames will be enhanced methods, we use the Diversity Metric (ζ) defined in Eq.(14),
LI et al.: EXPLORING GLOBAL DIVERSE ATTENTION VIA PAIRWISE TEMPORAL RELATION FOR VIDEO SUMMARIZATION 9

(a) Video 1: changing tire (b) Video 10: production workshop

(c) Video 12: daily life of one woman (d) Video 41: motorcycle flipping

Fig. 5. Visualization of SUM-GDA generated summaries for different videos in TVSum. Light blue bars represent ground-truth scores, and dark blue bars
denote generated summaries.

the corresponding key shot, i.e., the shots are more densely
distributed. To illustrate the summary diversity metric ζ, here
we examine our method on two randomly selected videos from
SumMe and TVSum respectively. The video shot distribution
results are depicted in Figure 8, which are obtained by using t-
SNE [44] to project all the shot features into two-dimensional
(a) SUM-GDA data space after inputting the given video to the SUM-GDA
model. In this figure, the red solid circle denotes feature
vector of key shot, blue solid circle denotes feature vector
of unselected shot, and the dashed irregular circle denote the
group including one key shot and its surrounding shots which
are the closest ones.
(b) SUM-GDAunsup From Figure 8, it is vividly shown that with regard to
different video shot groups (dashed circles), our model tends
Fig. 6. Visualization of selected key-shots for test video 30 (motorcycle to select those shots leading to smaller distances with their
show) in TVSum. neighbors (i.e. unselected shots) as the key shots to form the
final summary. Besides, we can observe that the selected key
TABLE VI
P ERFORMANCE (F- SCORE %) WITH DIFFERENT NUMBERS OF UNLABELED shot and its group neighbors are densely clustered, which indi-
SAMPLES IN SEMI - SUPERVISED SETTING . T HE FIRST ROW DENOTES THE cates the small ζ value according to the definition in Sec. III-E.
NUMBER OF UNLABELED VIDEOS FROM VTW DURING TRAINING . Actually, the selected key shot acts as a representative shot in
# of unlabeled 0 500 1000 1500 2000 the group and may be linearly reconstructed from its group
SumMe 54.4 55.4 55.6 56.0 56.3
TVSum 60.1 60.4 60.5 60.6 60.7
neighbors. Thus, the selected key shots from different groups
are not only far away but also visually diverse. In another
word, when the group shot points obtained by using the model
TABLE VII
S UMMARY DIVERSITY M ETRIC ON S UM M E AND TVS UM MEASURED BY
are closer, the generated summaries will be more diverse.
D IVERSITY M ETRIC ζ (↓), SMALLER ζ MEANS HIGHER DIVERSITY.
Database DPP-LSTM[20] DR-DSN[22] SUM-GDA SUM-GDAunsup
SumMe 0.133 0.109 0.099 0.108 K. Optical Flow Features
TVSum 0.319 0.312 0.304 0.308
Existing methods often use RGB images as the video input,
which do not consider motion information within the video.
To capture the video motion, we investigate the influences of
and the results are recorded in Table VII. As the table shows, optical flow features on generated summaries. Here we extract
the proposed SUM-GDA approach has the lowest value ζ, optical flow features of each video using TV-L1 [45] package
indicating it can generate the most diverse summaries on both and we use the pool5 features of GoogLeNet for fairness.
of the two benchmark data sets. It can be attributed to the fact We conduct the experiments on three data sets with three
that SUM-GDA and SUM-GDAunsup can benefit from global kinds of different inputs including RGB, Optical Flow, and
diverse attention mechanism, which encourages the diversity the combination of the two. The experimental setting follows
of generated summaries. the Canonical setting described in Table I. The performance
Note that lower ζ means the video shots are closer to results are shown in Table VIII.
10 DRAFT

Fig. 7. Visualization of frame scores and attention weights on Video 7 in TVSum by SUM-GDA. Light blue bars represent predicted scores, and dark blue
bars denote generated summaries. The magnitude of attention weights which is normalized to the same range of corresponding scores is plotted with light
red line. Purple dash lines are the change points generated by KTS [36].

(a) Video 14 from SumMe: Notre Dame de Paris

(b) Video 30 from TVSum: Paper Wasp Removal

Fig. 8. Diversity illustration using t-SNE [44] representation over the video shot features using SUM-GDA. The denser the group points (dashed circle), the
more diversity the generated summaries. In (a), since dis1 < dis2 , the blue circle belongs to the right group; similarly, in (b), dis3 < dis4 and dis5 < dis6 ,
which decides the group the corresponding blue circle belongs to. Here dis computes the distance between two circle points.
LI et al.: EXPLORING GLOBAL DIVERSE ATTENTION VIA PAIRWISE TEMPORAL RELATION FOR VIDEO SUMMARIZATION 11

TABLE VIII
P ERFORMANCE COMPARISON WITH DIFFERENT INPUT FEATURES IN TERMS OF P RECISION , R ECALL , F- SCORE (%).
Database Method RGB Opt.Flow Precision Recall F-score
X 52.5 53.6 52.8
SUM-GDA X 46.9 49.8 47.9
X X 47.4 50.0 48.2
SumMe
X 49.6 52.3 50.0
SUM-GDAunsup X 47.6 52.0 49.8
X X 48.2 51.6 49.4
X 59.0 58.9 58.9
SUM-GDA X 59.2 59.3 59.2
X X 61.0 60.9 61.0
TVSum
X 59.5 59.7 59.6
SUM-GDAunsup X 59.4 59.1 59.2
X X 60.2 60.3 60.2
X 50.1 50.7 50.2
SUM-GDA X 41.7 49.5 43.3
X X 40.9 47.4 42.1
VTW
X 47.9 48.6 47.9
SUM-GDAunsup X 37.2 42.1 38.1
X X 38.1 43.8 39.2

From Table VIII, we can observe that optical flow features compared with several state-of-the-art approaches. In addition,
greatly improve the performance on TVSum. This is because we have examined the diversity of generated summaries and
videos in this data set mostly contain a large portion of actions visualized the results, which suggest global diverse attention
(e.g. motorcycle flipping, daily life of one woman) and the mechanism indeed benefits a lot in diversifying the key frames
motion information plays an important role in learning the chosen for constituting video summary. Also the idea of global
video summarization model. However, for SumMe and VTW diverse attention might be helpful in training base learners for
data sets, the performances drop down when incorporating deep ensemble learning.
optical flow features. This might be for the reason that videos While we test our model on optical flow features of video
in these two data sets contain many different scenarios with which provides temporal dynamics, it does not always promote
less actions or slow-moved actions, but optical flow mainly the performance and sometimes even reduce the summa-
captures the motion information while neglecting the still rization quality. Since temporal information is critical for
backgrounds which are also essential for learning the effective encoding sequential data, it is worth putting more emphasis on
model. Sometimes, optical flow feature will even provide improving the way of extracting optical flow from video and
misleading direction to capture the motion information of also exploring other possible ways to better capture temporal
video in such situation, resulting in generating less promising structure of video data in future. On the other hand, our model
summaries. is advantageous in efficiency when the sequence length is
less than the dimension of embedding representation, which
V. C ONCLUSION AND F UTURE W ORK is true mostly as frame sampling is adopted as preprocessing.
However, this might not be the case when denser sampling is
This paper has proposed a novel video summarization model
required in some situation which needs large computations.
called SUM-GDA, which exploits the global diverse attention
Therefore, it is a promising direction to further boost the
mechanism to model pairwise temporal relations among video
efficiency of summarization model so as to make it be adapted
frames. In a global perspective, the mutual relations among
to wider scenarios.
different frame pairs in videos can be sufficiently leveraged
for better obtaining informative key frames. To select the
optimal subset of key frames, our model adopts determinantal
point process to enhance the diversity of chosen frames. R EFERENCES
Particularly, determinantal point process results in different
[1] S. Mei, G. Guan, Z. Wang, W. Shuai, M. He, and D. D. Feng,
frame groups revealing diversified semantics in videos. Those “Video summarization via minimum sparse reconstruction,” Pattern
chosen frames, indicated by high frame scores, are regards Recognition, vol. 48, no. 2, pp. 522–533, 2015.
as the concise collection of source video with good com- [2] X. Li, B. Zhao, and X. Lu, “Key frame extraction in the summary space,”
IEEE Transactions on Cybernetics, vol. 48, no. 6, pp. 1923–1934, 2017.
pleteness and least redundancy. Moreover, we extend SUM- [3] L. Yuan, F. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-sum: Cycle-
GDA to the unsupervised scenario where the heavy cost consistent adversarial lstm networks for unsupervised video summa-
of human labeling can be saved, and this can facilitate a rization,” in Proceedings of the 23rd AAAI Conference on Artificial
Intelligence (AAAI), 2019, pp. 9143–9150.
variety of practical tasks with no supervised information. To
[4] S. E. F. D. Avila, “Vsumm: A mechanism designed to produce static
investigate the summarization performance of the proposed video summaries and a novel evaluation method,” Pattern Recognition
models, we conducted comprehensive experiments on three Letters, vol. 32, no. 1, pp. 56–68, 2011.
publicly available video databases, i.e., SumMe, TVSum, and [5] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people
and objects for egocentric video summarization,” in Proceedings of the
VTW. Empirical results have verified that both SUM-GDA and IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
SUM-GDAunsup can yield more promising video summaries 2012, pp. 1346–1353.
12 DRAFT

[6] M. Gygli, H. Grabner, H. Riemenschneider, and L. J. V. Gool, “Creating [28] W. Wang, Y. Huang, and L. Wang, “Long video question answering:
summaries from user videos,” in Proceedings of the IEEE Conference a matching-guided attention model,” Pattern Recognition, vol. 102, p.
on 13th European Conference on Computer Vision (ECCV), 2014, pp. 107258, 2020.
505–520. [29] J. Li, X. Liu, M. Zhang, and D. Wang, “Spatio-temporal deformable
[7] C. L. Giles, G. M. Kuhn, and R. J. Williams, “Dynamic recurrent 3d convnets with attention for action recognition,” Pattern Recognition,
neural networks: Theory and applications,” IEEE Transactions on Neural vol. 98, p. 107037, 2020.
Networks, vol. 5, no. 2, pp. 153–156, 1994. [30] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-
[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural works,” in Proceedings of the IEEE Conference on Computer Vision
computation, vol. 9, no. 8, pp. 1735–1780, 1997. and Pattern Recognition (CVPR), 2018, pp. 7794–7803.
[9] K.-Y. Huang, C.-H. Wu, and M.-H. Su, “Attention-based convolutional [31] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino,
neural network and long short-term memory for short-term detection of “Summarizing videos with attention,” in Asian Conference on Computer
mood disorders based on elicited speech responses,” Pattern Recogni- Vision Workshop, 2018, pp. 39–54.
tion, vol. 88, pp. 668–678, 2019. [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[10] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
H. Schwenk, and Y. Bengio, “Learning phrase representations using in Proceedings of the IEEE Conference on Computer Vision and Pattern
RNN encoder-decoder for statistical machine translation,” in Proceed- Recognition (CVPR), 2015, pp. 1–9.
ings of the Conference on Empirical Methods in Natural Language [33] A. Kulesza and B. Taskar, “Determinantal point processes for machine
Processing (EMNLP), 2014, pp. 1724–1734. learning,” Foundations and Trends in Machine Learning, vol. 5, no. 2-3,
[11] J. Donahue, L. Anne Hendricks, M. Rohrbach, S. Venugopalan, pp. 123–286, 2012.
S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent [34] B. Gong, W. Chao, K. Grauman, and F. Sha, “Diverse sequential subset
convolutional networks for visual recognition and description,” IEEE selection for supervised video summarization,” in Advances in Neural
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, Information Processing Systems (NIPS), 2014, pp. 2069–2077.
no. 4, pp. 677–691, 2016. [35] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative ad-
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, versarial network,” in Proceedings of 5th International Conference on
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Learning Representations (ICLR), 2017.
Neural Information Processing Systems (NIPS), 2017, pp. 5998–6008. [36] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific
[13] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing video summarization,” in Proceedings of the IEEE Conference on 13th
web videos using titles,” in Proceedings of the IEEE Conference on European Conference on Computer Vision (ECCV), 2014, pp. 540–555.
Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5179– [37] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer:
5187. Exemplar-based subset selection for video summarization,” in Proceed-
[14] K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun, “Generation for user ings of IEEE Conference on Computer Vision and Pattern Recognition
generated videos,” in Proceedings of the IEEE Conference on European (CVPR), 2016, pp. 1059–1067.
Conference on Computer Vision (ECCV), 2016, pp. 609–625. [38] S. E. F. De Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albu-
[15] H. Fang, J. Jiang, and Y. Feng, “A fuzzy logic approach for detection of querque Araújo, “Vsumm: A mechanism designed to produce static
video shot boundaries,” Pattern Recognition, vol. 39, no. 11, pp. 2092– video summaries and a novel evaluation method,” Pattern Recognition
2100, 2006. Letters, vol. 32, no. 1, pp. 56–68, 2011.
[39] B. Zhao, X. Li, and X. Lu, “Hsa-rnn: Hierarchical structure-adaptive
[16] P. Mundur, R. Yong, and Y. Yesha, “Keyframe-based video summa-
rnn for video summarization,” in Proceedings of the IEEE Conference
rization using delaunay clustering,” International Journal on Digital
on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7405–
Libraries, vol. 6, no. 2, pp. 219–232, 2006.
7414.
[17] Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, “Adaptive key frame
[40] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video sum-
extraction using unsupervised clustering,” in Proceedings of the IEEE
marization with adversarial lstm networks,” in Proceedings of the IEEE
Conference on International Conference on Image Processing (ICIP),
Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1,
1998, pp. 866–870.
2017, pp. 2982–2991.
[18] Z. Lu and K. Grauman, “Story-driven summarization for egocentric [41] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, “Discriminative
video,” in Proceedings of the IEEE Conference on Computer Vision feature learning for unsupervised video summarization,” in Proceedings
and Pattern Recognition (CVPR), 2013, pp. 2714–2721. of the 33rd AAAI Conference on Artificial Intelligence (AAAI), 2019, pp.
[19] T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep 8537–8544.
ranking for first-person video summarization,” in Proceedings of the [42] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), attention-based encoder-decoder networks,” IEEE Transactions on Cir-
2016, pp. 982–990. cuits and Systems for Video Technology, vol. 30, no. 6, pp. 1709–1717,
[20] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization 2020.
with long short-term memory,” in Proceedings of the IEEE Conference [43] S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao, “Independently recurrent neu-
on 14th European Conference on Computer Vision (ECCV), 2016, pp. ral network (indrnn): Building a longer and deeper rnn,” in Proceedings
766–782. of the IEEE Conference on Computer Vision and Pattern Recognition
[21] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network (CVPR), 2018, pp. 5457–5466.
for video summarization,” in Proceedings of the ACM Conference on [44] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal
Multimedia Conference (MM), 2017, pp. 863–871. of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
[22] K. Zhou, Y. Qiao, and T. Xiang, “Deep reinforcement learning for unsu- [45] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime
pervised video summarization with diversity-representativeness reward,” tv-l1 optical flow,” in Proceedings of the 29th DAGM Symposium on
in Proceedings of the 32nd AAAI Conference on Artificial Intelligence Pattern Recognition. Springer, 2007, pp. 214–223.
(AAAI), 2018, pp. 7582–7589.
[23] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully
convolutional sequence networks,” in Proceedings of the 15th European
Conference on Computer Vision (ECCV’18), 2018, pp. 358–374.
[24] M. Rochan and Y. Wang, “Video summarization by learning from
unpaired data,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2019, pp. 7902–7911.
[25] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with
visual attention,” in Proceedings of the IEEE International Conference
on Learning Representations (ICLR), 2015.
[26] F. Yang, K. Yan, S. Lu, H. Jia, X. Xie, and W. Gao, “Attention driven
person re-identification,” Pattern Recognition, vol. 86, pp. 143–155,
2019.
[27] B. Chen, P. Li, C. Sun, D. Wang, G. Yang, and H. Lu, “Multi attention
module for visual tracking,” Pattern Recognition, vol. 87, pp. 80–93,
2019.

You might also like