Exploring Global Diverse Attention Via Pairwise
Exploring Global Diverse Attention Via Pairwise
Abstract—Video summarization is an effective way to facilitate In the last decade, many contributions have been devoted
video searching and browsing. Most of existing systems employ to exploring the recurrent encoder-decoder architecture which
encoder-decoder based recurrent neural networks, which fail to utilizes Recurrent Neural Networks (RNNs) [7] and Long
explicitly diversify the system-generated summary frames while
arXiv:2009.10942v1 [cs.CV] 23 Sep 2020
requiring intensive computations. In this paper, we propose Short-Term Memory (LSTM) models [8], [9]. These models
an efficient convolutional neural network architecture for video are able to learn high-level feature representation of data,
SUMmarization via Global Diverse Attention called SUM-GDA, thus generating video summary with high quality. Actually,
which adapts attention mechanism in a global perspective to con- recurrent neural networks can inherently capture temporal de-
sider pairwise temporal relations of video frames. Particularly, pendency by encoding sequence information of video frames,
the GDA module has two advantages: 1) it models the relations
within paired frames as well as the relations among all pairs, and enjoy the widespread success in practical tasks, e.g.,
thus capturing the global attention across all frames of one video; machine translation [10] and action recognition [11]. However,
2) it reflects the importance of each frame to the whole video, there are several following drawbacks: (1) it is difficult for
leading to diverse attention on these frames. Thus, SUM-GDA is RNNs to make full use of GPU parallelization, due to the fact
beneficial for generating diverse frames to form satisfactory video that the generation of its hidden state ht for the t-th frame
summary. Extensive experiments on three data sets, i.e., SumMe,
TVSum, and VTW, have demonstrated that SUM-GDA and its depends on the previous hidden state ht−1 ; (2) gated RNN
extension outperform other competing state-of-the-art methods models like LSTMs cannot well model long-range dependency
with remarkable improvements. In addition, the proposed models across video frames since the gated mechanism will lead to
can be run in parallel with significantly less computational costs, serious decay of the history information inherited from those
which helps the deployment in highly demanding applications. frames appearing at early time; (3) in some scenarios, e.g.,
Index Terms—Global diverse attention, pairwise temporal the video news broadcast on one topic usually consists of
relation, video summarization, convolutional neural networks several edited videos from different sources, there may exist
semantic discontinuity within video sequences, which is very
I. I NTRODUCTION challenging and RNNs cannot well resolve this problem.
To alleviate the above problems, we exploit the temporal
I N the era of big data, video has become one of the most
important carriers of data as the number and the volume
have both increased rapidly in daily life. Video summarization
context relations among video frames from the pairwise rela-
tion perspective. In particular, a pairwise similarity matrix is
as a good way to manage these videos (e.g., video searching designed to model multi-scale temporal relations across frames
and video browsing) [1], [2] has received much interests in the by storing the context information. The rationale behind this
field of computer vision and pattern recognition. It essentially idea is that temporal relations of video frames can be modeled
selects the key shots of a video as the summary, and these by the pairwise similarity between two frames regardless
key shots are expected to convey the most important infor- of their distance. That is to say, unlike RNNs that model
mation of video. To obtain the summary, traditional methods long-range dependency using history information of early
often use hand-crafted features which are then processed by frames, it is unnecessary for our scheme to go through all the
unsupervised models [3], [4] or supervised models [5], [6]. intermediate frames between source frame and target frame.
Nevertheless, these models can not be trained efficiently in Instead, we compute the pairwise similarity matrix directly at
an end-to-end manner and hand-crafted features are incapable limited computational cost, which makes it quite efficient and
of encoding high-level semantic information of video, so they suitable for GPU parallelization.
may not work well for those videos with diverse and complex In another aspect, a good video summary should include
scenes in the real-world applications. those shots with the most diversity and the most representative-
ness, i.e., the selected key shots with higher importance scores
This work was supported in part by the National Natural Science Foundation should reflect diversified semantic information of video. Our
of China under Grants 61872122, 61502131, in part by Natural Science
Foundation of Zhejiang Province under Grant LY18F020015. (Corresponding human beings always tend to summarize video after scanning
author: Ping Li.) all the frames, which inspires us to imitate this process and
P. Li, Q. Ye and X. Xu are with the School of Computer Science and summarize video by fully attending to the complete video as
Technology, Hangzhou Dianzi University, Hangzhou 310018, China (e-mail:
[email protected]). attention mechanism is prevailing and successful in sequence
L. Zhang is with the College of Computer Science, Zhejiang University, modeling [12]. Therefore, we develop an efficient video
Hangzhou 310027, China. SUMmarization model with Global Diverse Attention called
L. Yuan is with the National University of Singapore, Singapore 119077.
L. Shao is with Inception Institute of Artificial Intelligence, Abu Dhabi, SUM-GDA using pairwise temporal relation, to quantify the
UAE. importance of each video frame and simultaneously promote
2 DRAFT
Fig. 2. The architecture of SUM-GDA. Note that W Q , W K and W V are three parameters to be learned by training model, d is normalized pairwise
dissimilarity vector, and ci indicates the context of frame xi .
temporal dependency among video frames; Yuan et al. [3] (i.e., the gap between source frame and target frame). Diverse
designed cycle-consistent adversarial LSTM networks to suf- attention weights are then transformed to importance scores for
ficiently encode the temporal information in an unsupervised every frame, and importance score indicates the importance of
manner, resulting in promising summarization performance. the frame that may characterize some critical scene or objects
In addition, Zhou et al. [22] employed reinforcement learning within the given video.
and Rochan et al. [23] adapted the convolutional sequence Essentially, SUM-GDA leverages the global diverse atten-
model to promote the quality of video summary; Rochan et tion mechanism to derive the dissimilarity representation of
al. [24] attempted to train the model with unpaired samples video frames, which makes it possible to accomplish video
by adopting key frame selector and summary discriminator summarization even when the sequence order of video frames
network in unsupervised manner, and the generated summary is disrupted in some situations, e.g., several videos regarding
may come from different source videos. the same news are combined together by careful editing. Thus,
Inspired by human visual perception [25], attention mecha- SUM-GDA can be trained in parallel on multiple GPUs, which
nism is exploited to guide the feed forward process of neural significantly reduces computational costs.
networks. To this end, attention-based models have gained
satisfactory performance in various domains, such as person
re-identification [26], visual tracking [27], video question A. Global Diverse Attention
answering [28], and action recognition [29]. Among them, Global diverse attention mainly exploits diversity infor-
self-attention [12] mechanism leverages the relation among mation to enhance the performance of video summarization.
all points of a single sequence to generate the corresponding Particularly, GDA adopts self-attention to encode the pairwise
representation of data, which makes it be widely applied in temporal information within video frames, and then aggregates
real-world tasks [30], e.g., Fajtl et al. [31] utilized self- the pairwise information of all frame pairs in a video to
attention to only measure the importance of each frame while globally evaluate the importance of each frame to the whole
neglecting frame diversity. However, our work takes advantage video. Afterwards, GDA transforms importance scores into
of self-attention mechanism to model the temporal pairwise pairwise dissimilarity scores which indicate the appearance
relation to quantify the importance of each frame and also to variation between two frames.
promote the diversity among the frames, leading to boosted Given a video V , it consists of N frames {v1 , v2 , . . . , vN }.
summarization performance. For its i-th frame and j-th frame, i, j ∈ {1, 2, . . . , N }, the
corresponding feature representation vectors xi ∈ RD and
III. T HE P ROPOSED A PPROACH xj ∈ RD are derived from pre-trained CNN networks like
In this section, we first introduce the proposed video sum- GoogLeNet [32]. The pairwise attention matrix A ∈ RN ×N
marization model with global diverse attention (SUM-GDA) essentially reveals the underlying temporal relation across
in a supervised manner, which employs convolutional neural frame pairs of the video, and the entry Aij between the i-
networks as backbone. The developed model is expected to th frame and the j-th frame is
be much more efficient compared to the typical recurrent 1
neural networks like LSTMs, since existing RNNs based Aij = √ (W Q xi )T (W K xj ), (1)
q
methods always have the difficulty in modeling the long-
range dependency within videos especially in surveillance where q > 0 is a constant; two linear projections W Q ∈
environment. To overcome this difficulty, we propose the RD×D and W K ∈ RD×D correspond to the paired video
global diverse attention mechanism by adapting self-attention frames xi and xj respectively, and they are parameters to
[12] into video summarization task. In particular, we design a be learned by training model. Instead of directly using dot-
pairwise similarity matrix to accommodate diverse attention production, we scale the output attention value with an em-
weights of video frames, which can well encode temporal pirical scaling factor √1q (q = D), because the model is very
relations between every two frames in a wide range of stride likely to generate very small gradients after applying a softmax
4 DRAFT
function without scaling especially during back-propagation methods [34] as it can promote diversity within the selected
process. subsets.
The obtained pairwise attention weights Aij are then con- Given the index set Y = {1, 2, . . . , N }, the positive semi-
verted to corresponding normalized weights αij by using definite kernel matrix L ∈ RN ×N is computed to represent the
following softmax function, i.e., frame-level pairwise similarity, and the probability of a subset
Ysub ⊆ Y is
exp(Aij )
αij = PN . (2) det(LYsub )
r=1 exp(Arj ) PL (Y = Ysub ; L) = , (5)
det(L + IN )
These pairwise attention weights only reveal the importance of
where Y is the random variable to take an index value, IN is
each frame to the video, while a good summary is expected
a N × N identity matrix, and det(L + IN ) is a normalization
to be composed by diverse frames or key shots. Hence, to
constant. If two items within the same subset are similar, then
diversify video frames globally, GDA attends to those frames
the probability P ({i, j} ⊆ Y; L) = Lii Ljj − L2ij will be
that are dissimilar to the other frames in the whole video.
close to zero, i.e., setting the probability of the subset to zero.
Mathematically, we compute the pairwise dissimilarity vector
Otherwise, the high probability indicates that the subset has
d̂ = [dˆ1 , dˆ2 , . . . , dˆN ] ∈ RN by
high variation or diversity.
N
Y d̂ Inspired by quality-diversity decomposition [33], we en-
dˆi = (1 − αij ), d= , (3) hance DPP with explicitly modeling the variation by defining
j=1 kd̂k1 the following kernel matrix L as
Q
where k · k1 denotes `1 -norm and the operator represents Lij = yi yj Φij = yi yj exp(−βkφi − φj k22 ), (6)
the product of multiple elements. The vector d denotes the
normalized pairwise dissimilarity and its elements {di }N where the pairwise similarity between the paired frames xi
i=1
are actually global diverse attention weights, which indicate and xj is derived from two linear transformations φi and φj ,
how the i-th frame xi differs from the whole input video and whose output frame scores are yi and yj .
simultaneously suggest the diversity of the frames. a) Variation Loss: Since DPP enforces the high diversity
To capture frame semantic information modeled by the constraint on the selection of frame subsets, the redundancy
weighted context matrix C ∈ RD×N , we apply linear mapping of video summary can be reduced. And such diversity can be
to normalized pairwise dissimilarity vectors as evaluated by the variation loss, i.e.,
X
Lvar = − log PL (Y = Ysub ; L). (7)
ci = di ⊗ (W V xi ), (4)
Ysub ⊆Y
where the linear projection matrix W V ∈ RD×D is the b) Keyframe Loss: In the supervised setting, we use
parameter to be learned, ci ∈ RD is the weighted vector ground-truth annotations of key frames ŷ during training and
reflecting the context of the i-th frame, and ⊗ is element-wise define key frame loss formulated as
product. N
X
Lkey = − ŷi log yi + (1 − ŷi ) log(1 − yi ) . (8)
B. The SUM-GDA Model i=1
Incorporating with global diverse attention, the proposed By combining Eq. (7) and Eq. (8), we can obtain the
SUM-GDA model includes the score regression module and supervised loss of the proposed SUM-GDA model:
the linear embedding module following the feed forward layer, Lsup = Lkey + Lvar . (9)
and its architecture is illustrated in Figure 2. The model first
extracts feature vectors by pre-trained CNN and computes During model training, the above loss functions are opti-
global diverse attention matrix A. Then, the weighted features mized iteratively. By incorporating DPP with score regression,
are handled by two fully-connected layers including linear we can build a unified end-to-end deep neural network archi-
embedding function φ(·) and score regression function y(·). tecture SUM-GDA for video summarization, and its training
Those frames with high regression scores are selected to procedures are summarized in Algorithm 1.
form final summary. In the feed forward layer, for weighted
context matrix C, linear transformation is performed with the C. Unsupervised SUM-GDA
dropout layer using layer normalization technique. For score In many practical tasks, SUM-GDAunsup can learn from a
regression, we compute frame scores y ∈ RN using two set of untrimmed videos without supervision.
linear layers with ReLU activation function, dropout, and layer Generally speaking, it is difficult for different users who are
normalization in between. asked to evaluate the same video to achieve the consensus on
Admittedly, the proposed model can predict the likelihood the final summary. Instead of using the oracle, we replace the
of one video frame to be included in the final summary. To key frame loss in Eq. (8) with the length regularization Llen
further diversify the selected frames, we adopt the Deter- balanced by summary ratio σ shown below:
minantal Point Process (DPP) [33] technique which defines N
the distribution that measures the negative correlation over 1 X
Llen = yi − σ . (10)
all the subsets. DPP has been widely used in summarization N i=1 2
LI et al.: EXPLORING GLOBAL DIVERSE ATTENTION VIA PAIRWISE TEMPORAL RELATION FOR VIDEO SUMMARIZATION 5
TABLE I
in clustering viewpoint. Hence, the more closely aggregated E VALUATION SETTINGS FOR S UM M E . T O EVALUATE TVS UM , WE SWAP
clusters suggest the more diversified clusters, thus the more S UM M E AND TVS UM .
diversified key shots. Usually, the selected key shots are more
Setting Training Testing
representative than those unselected ones. Canonical 80% SumMe 20% SumMe
80% SumMe + OVP
Augmented 20% SumMe
IV. E XPERIMENTS + YouTube + TVSum
Transfer OVP + YouTube + TVSum SumMe
This section mainly explores the summarization perfor-
mance of the proposed SUM-GDA model on several data sets. TABLE II
First, we give some statistics of data sets and describe the P ERFORMANCE COMPARISON (F- SCORE %) WITH SUPERVISED METHODS
ON S UM M E AND TVS UM .
evaluation metrics as well as the experimental settings. Then,
SumMe TVSum
the implementation details will be given, which is followed Method
C A T C A T
by the reported results, the ablation study as well as further Bi-LSTM [20] 37.6 41.6 40.7 54.2 57.9 56.9
DPP-LSTM [20] 38.6 42.9 41.8 54.7 59.6 58.7
discussions on semi-supervised scenario, summary diversity SUM-GANsup [40] 41.7 43.6 - 56.3 61.2 -
and optical flow features. DR-DSNsup [22] 42.1 43.9 42.6 58.1 59.8 58.9
SUM-FCN [23] 47.5 51.1 44.1 56.8 59.2 58.2
HSA-RNN [39] - 44.1 - - 59.8 -
A. Data Sets CSNetsup [41] 48.6 48.7 44.1 58.5 57.1 57.4
VASNet [31] 49.7 51.1 - 61.4 62.4 -
We have evaluated different summarization methods on two M-AVS [42] 44.4 46.1 - 61.0 61.8 -
benchmark data sets, i.e., SumMe [6] and TVSum [13]. Two SUM-GDA 52.8 54.4 46.9 58.9 60.1 59.0
TABLE IV
P ERFORMANCE COMPARISON (F- SCORE %) ON VTW.
Method Precision Recall F-score
HD-VS [19] 39.2 48.3 43.3
DPP-LSTM [20] 39.7 49.5 44.3
HSA-RNN [39] 44.3 54.8 49.1
SUM-GDAunsup 47.8 48.6 47.9
SUM-GDA 50.1 50.7 50.2
TABLE V
F- SCORE (%) OF ALL CASES ON S UM M E AND TVS UM IN C ANONICAL in some degree. Furthermore, the model with variation loss
SETTING . i.e.SUM-GDA (Row 7) is better than that without it by 1.2%
Method SumMe TVSum and 0.8% on SumMe and TVSum respectively, which is for
SUM-GDAunsup w/o Llen 47.4 57.7 the reason that variation loss helps generate diverse subsets.
SUM-GDAunsup w/o Lrep 48.9 58.9
SUM-GDAunsup w/o GDA 43.0 55.7
In addition, both of SUM-GDAunsup and SUM-GDA are
SUM-GDAunsup 50.0 59.6 improved by adopting GDA module as shown in Row 3 and
SUM-GDA w/o Lvar 51.6 58.1 Row 6. Therefore, different regularization terms or loss play
SUM-GDA w/o GDA 45.1 53.1 different roles in promoting global diverse attention on video
SUM-GDA 52.8 58.9
frames.
H. Qualitative Results
Given an input sequence {z1 , · · · , zn } which includes n
We visualize the selected key shots of different videos on
frame images with zi ∈ Rd , suppose the dimension of the
TVSum generated by our SUM-GDA model. From Figure 5,
hidden state vector ht is d. Then the time complexity of
it can be clearly seen that SUM-GDA selects most peak points
RNN is O(nd2 ) since ht = tanh(Whh ht−1 +Wzh zt ), where
according to frame scores using the 0/1 Knapsack algorithm.
Wxh , Whh ∈ Rd×d [43]. However, the GDA module of our
As depicted by Figure 6, SUM-GDA and SUM-GDAunsup
approach only requires O(n2 d), which is faster than RNN
yield different frame scores. The bottom figure for SUM-
when the sequence length n is smaller than d. This is mostly
GDAunsup generates frame scores with more sparsity, as we
true for frame image representations used by state-of-the-
constrain the mean of frame scores to be close to the summary
art video summarization models [20], [22], [3] who sample
ratio, which may lead to sparsity.
frames with 2f ps from video. Regarding the benchmark
To verify the effectiveness of SUM-GDA, we visualize
databases SumMe, TVSum, and VTW tested above, the length
global diverse attention weights on TVSum in Figure 7, which
of video is usually smaller than the dimension of its embedding
describes a traffic accident where a train crashed with a
representation, e.g., 1024. Moreover, RNNs [7] and LSTM
car. Comparing the middle image and the right-most image,
models [8] suffer from the gradient vanishing and exploding
they look visually similar but their attention weights are very
problem due to the use of hyperbolic tangent and sigmoid
different. This does help achieve the goal of diversifying the
activation function, which leads to gradient decay over time
selected frames because there exists redundancy in similar
steps during the training process. Li et al.[43] found that RNN
frames, only one of which is required to form the final
and LSTM cannot keep the long-term temporal information of
summary in practice.
the sequence when its length is greater than 1000 in empirical
studies. In addition, the parallelized amount of computations
can be measured by the minimum number of sequentially I. Semi-Supervised Scenario
executed operations of the module [12], since sequentially In many practical applications, it often appears that only
executed operation cannot be implemented in parallelization. partial videos have labels due to costly human labeling while
For RNN based methods, O(n) sequential operations are a large number of unlabeled videos are easily available which
required while GDA module only needs O(1) sequential are useful for training summarization model. Regarding this
operations if enough GPU cards are available. This implies scenario, we additionally examined the proposed methods in
that our GDA module is more efficient by taking advantages semi-supervised setting (which are not explored in previous
of parallelization. Hence, SUM-GDA is actually much faster works). Specifically, for SumMe, we use 80% SumMe +
than recurrent neural network based methods. OVP + YouTube + TVSum for training (labeled videos)
and 20% SumMe for testing (test videos). For TVSum, the
G. Ablation Study above SumMe and TVSum are swapped. Those unlabeled
videos are sampled from VTW by ignoring the corresponding
To examine the influences of different loss terms, we con- annotations. Here, test videos are only used during test, which
ducted ablation study on the proposed model and the results is slightly different from transductive learning which considers
are shown in Table V. the test videos as unlabeled samples used for learning model.
As can be seen from the table, the model without length The results on SumMe and TVSum are shown in Table VI,
regularization (Row 1) will deteriorate the summarization from which it can be observed that the summarization perfor-
performance, as the length of generated summary should be mance is consistently improved with the increasing number
constrained in a sensible range. Besides, the performance of of unlabeled samples. This indicates that unlabeled data can
the model with repelling loss (Row 4) is improved by 1.1% and also benefit the model learning and more unlabeled data can
0.7% on SumMe and TVSum respectively in comparison with provide more consolidated information to enhance the quality
that without such loss (Row 2). We can attribute this to the of generated summaries.
fact that when the repelling loss is added to the unsupervised
extension besides length regularization, pairwise similarity
between partial frames will be reduced, which would lower J. Diversity of Generated Summaries
the importance scores of those frames in the video. Hence, To evaluate diversity of summaries generated by different
the global diverse attention on video frames will be enhanced methods, we use the Diversity Metric (ζ) defined in Eq.(14),
LI et al.: EXPLORING GLOBAL DIVERSE ATTENTION VIA PAIRWISE TEMPORAL RELATION FOR VIDEO SUMMARIZATION 9
(c) Video 12: daily life of one woman (d) Video 41: motorcycle flipping
Fig. 5. Visualization of SUM-GDA generated summaries for different videos in TVSum. Light blue bars represent ground-truth scores, and dark blue bars
denote generated summaries.
the corresponding key shot, i.e., the shots are more densely
distributed. To illustrate the summary diversity metric ζ, here
we examine our method on two randomly selected videos from
SumMe and TVSum respectively. The video shot distribution
results are depicted in Figure 8, which are obtained by using t-
SNE [44] to project all the shot features into two-dimensional
(a) SUM-GDA data space after inputting the given video to the SUM-GDA
model. In this figure, the red solid circle denotes feature
vector of key shot, blue solid circle denotes feature vector
of unselected shot, and the dashed irregular circle denote the
group including one key shot and its surrounding shots which
are the closest ones.
(b) SUM-GDAunsup From Figure 8, it is vividly shown that with regard to
different video shot groups (dashed circles), our model tends
Fig. 6. Visualization of selected key-shots for test video 30 (motorcycle to select those shots leading to smaller distances with their
show) in TVSum. neighbors (i.e. unselected shots) as the key shots to form the
final summary. Besides, we can observe that the selected key
TABLE VI
P ERFORMANCE (F- SCORE %) WITH DIFFERENT NUMBERS OF UNLABELED shot and its group neighbors are densely clustered, which indi-
SAMPLES IN SEMI - SUPERVISED SETTING . T HE FIRST ROW DENOTES THE cates the small ζ value according to the definition in Sec. III-E.
NUMBER OF UNLABELED VIDEOS FROM VTW DURING TRAINING . Actually, the selected key shot acts as a representative shot in
# of unlabeled 0 500 1000 1500 2000 the group and may be linearly reconstructed from its group
SumMe 54.4 55.4 55.6 56.0 56.3
TVSum 60.1 60.4 60.5 60.6 60.7
neighbors. Thus, the selected key shots from different groups
are not only far away but also visually diverse. In another
word, when the group shot points obtained by using the model
TABLE VII
S UMMARY DIVERSITY M ETRIC ON S UM M E AND TVS UM MEASURED BY
are closer, the generated summaries will be more diverse.
D IVERSITY M ETRIC ζ (↓), SMALLER ζ MEANS HIGHER DIVERSITY.
Database DPP-LSTM[20] DR-DSN[22] SUM-GDA SUM-GDAunsup
SumMe 0.133 0.109 0.099 0.108 K. Optical Flow Features
TVSum 0.319 0.312 0.304 0.308
Existing methods often use RGB images as the video input,
which do not consider motion information within the video.
To capture the video motion, we investigate the influences of
and the results are recorded in Table VII. As the table shows, optical flow features on generated summaries. Here we extract
the proposed SUM-GDA approach has the lowest value ζ, optical flow features of each video using TV-L1 [45] package
indicating it can generate the most diverse summaries on both and we use the pool5 features of GoogLeNet for fairness.
of the two benchmark data sets. It can be attributed to the fact We conduct the experiments on three data sets with three
that SUM-GDA and SUM-GDAunsup can benefit from global kinds of different inputs including RGB, Optical Flow, and
diverse attention mechanism, which encourages the diversity the combination of the two. The experimental setting follows
of generated summaries. the Canonical setting described in Table I. The performance
Note that lower ζ means the video shots are closer to results are shown in Table VIII.
10 DRAFT
Fig. 7. Visualization of frame scores and attention weights on Video 7 in TVSum by SUM-GDA. Light blue bars represent predicted scores, and dark blue
bars denote generated summaries. The magnitude of attention weights which is normalized to the same range of corresponding scores is plotted with light
red line. Purple dash lines are the change points generated by KTS [36].
Fig. 8. Diversity illustration using t-SNE [44] representation over the video shot features using SUM-GDA. The denser the group points (dashed circle), the
more diversity the generated summaries. In (a), since dis1 < dis2 , the blue circle belongs to the right group; similarly, in (b), dis3 < dis4 and dis5 < dis6 ,
which decides the group the corresponding blue circle belongs to. Here dis computes the distance between two circle points.
LI et al.: EXPLORING GLOBAL DIVERSE ATTENTION VIA PAIRWISE TEMPORAL RELATION FOR VIDEO SUMMARIZATION 11
TABLE VIII
P ERFORMANCE COMPARISON WITH DIFFERENT INPUT FEATURES IN TERMS OF P RECISION , R ECALL , F- SCORE (%).
Database Method RGB Opt.Flow Precision Recall F-score
X 52.5 53.6 52.8
SUM-GDA X 46.9 49.8 47.9
X X 47.4 50.0 48.2
SumMe
X 49.6 52.3 50.0
SUM-GDAunsup X 47.6 52.0 49.8
X X 48.2 51.6 49.4
X 59.0 58.9 58.9
SUM-GDA X 59.2 59.3 59.2
X X 61.0 60.9 61.0
TVSum
X 59.5 59.7 59.6
SUM-GDAunsup X 59.4 59.1 59.2
X X 60.2 60.3 60.2
X 50.1 50.7 50.2
SUM-GDA X 41.7 49.5 43.3
X X 40.9 47.4 42.1
VTW
X 47.9 48.6 47.9
SUM-GDAunsup X 37.2 42.1 38.1
X X 38.1 43.8 39.2
From Table VIII, we can observe that optical flow features compared with several state-of-the-art approaches. In addition,
greatly improve the performance on TVSum. This is because we have examined the diversity of generated summaries and
videos in this data set mostly contain a large portion of actions visualized the results, which suggest global diverse attention
(e.g. motorcycle flipping, daily life of one woman) and the mechanism indeed benefits a lot in diversifying the key frames
motion information plays an important role in learning the chosen for constituting video summary. Also the idea of global
video summarization model. However, for SumMe and VTW diverse attention might be helpful in training base learners for
data sets, the performances drop down when incorporating deep ensemble learning.
optical flow features. This might be for the reason that videos While we test our model on optical flow features of video
in these two data sets contain many different scenarios with which provides temporal dynamics, it does not always promote
less actions or slow-moved actions, but optical flow mainly the performance and sometimes even reduce the summa-
captures the motion information while neglecting the still rization quality. Since temporal information is critical for
backgrounds which are also essential for learning the effective encoding sequential data, it is worth putting more emphasis on
model. Sometimes, optical flow feature will even provide improving the way of extracting optical flow from video and
misleading direction to capture the motion information of also exploring other possible ways to better capture temporal
video in such situation, resulting in generating less promising structure of video data in future. On the other hand, our model
summaries. is advantageous in efficiency when the sequence length is
less than the dimension of embedding representation, which
V. C ONCLUSION AND F UTURE W ORK is true mostly as frame sampling is adopted as preprocessing.
However, this might not be the case when denser sampling is
This paper has proposed a novel video summarization model
required in some situation which needs large computations.
called SUM-GDA, which exploits the global diverse attention
Therefore, it is a promising direction to further boost the
mechanism to model pairwise temporal relations among video
efficiency of summarization model so as to make it be adapted
frames. In a global perspective, the mutual relations among
to wider scenarios.
different frame pairs in videos can be sufficiently leveraged
for better obtaining informative key frames. To select the
optimal subset of key frames, our model adopts determinantal
point process to enhance the diversity of chosen frames. R EFERENCES
Particularly, determinantal point process results in different
[1] S. Mei, G. Guan, Z. Wang, W. Shuai, M. He, and D. D. Feng,
frame groups revealing diversified semantics in videos. Those “Video summarization via minimum sparse reconstruction,” Pattern
chosen frames, indicated by high frame scores, are regards Recognition, vol. 48, no. 2, pp. 522–533, 2015.
as the concise collection of source video with good com- [2] X. Li, B. Zhao, and X. Lu, “Key frame extraction in the summary space,”
IEEE Transactions on Cybernetics, vol. 48, no. 6, pp. 1923–1934, 2017.
pleteness and least redundancy. Moreover, we extend SUM- [3] L. Yuan, F. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-sum: Cycle-
GDA to the unsupervised scenario where the heavy cost consistent adversarial lstm networks for unsupervised video summa-
of human labeling can be saved, and this can facilitate a rization,” in Proceedings of the 23rd AAAI Conference on Artificial
Intelligence (AAAI), 2019, pp. 9143–9150.
variety of practical tasks with no supervised information. To
[4] S. E. F. D. Avila, “Vsumm: A mechanism designed to produce static
investigate the summarization performance of the proposed video summaries and a novel evaluation method,” Pattern Recognition
models, we conducted comprehensive experiments on three Letters, vol. 32, no. 1, pp. 56–68, 2011.
publicly available video databases, i.e., SumMe, TVSum, and [5] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people
and objects for egocentric video summarization,” in Proceedings of the
VTW. Empirical results have verified that both SUM-GDA and IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
SUM-GDAunsup can yield more promising video summaries 2012, pp. 1346–1353.
12 DRAFT
[6] M. Gygli, H. Grabner, H. Riemenschneider, and L. J. V. Gool, “Creating [28] W. Wang, Y. Huang, and L. Wang, “Long video question answering:
summaries from user videos,” in Proceedings of the IEEE Conference a matching-guided attention model,” Pattern Recognition, vol. 102, p.
on 13th European Conference on Computer Vision (ECCV), 2014, pp. 107258, 2020.
505–520. [29] J. Li, X. Liu, M. Zhang, and D. Wang, “Spatio-temporal deformable
[7] C. L. Giles, G. M. Kuhn, and R. J. Williams, “Dynamic recurrent 3d convnets with attention for action recognition,” Pattern Recognition,
neural networks: Theory and applications,” IEEE Transactions on Neural vol. 98, p. 107037, 2020.
Networks, vol. 5, no. 2, pp. 153–156, 1994. [30] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-
[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural works,” in Proceedings of the IEEE Conference on Computer Vision
computation, vol. 9, no. 8, pp. 1735–1780, 1997. and Pattern Recognition (CVPR), 2018, pp. 7794–7803.
[9] K.-Y. Huang, C.-H. Wu, and M.-H. Su, “Attention-based convolutional [31] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino,
neural network and long short-term memory for short-term detection of “Summarizing videos with attention,” in Asian Conference on Computer
mood disorders based on elicited speech responses,” Pattern Recogni- Vision Workshop, 2018, pp. 39–54.
tion, vol. 88, pp. 668–678, 2019. [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[10] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
H. Schwenk, and Y. Bengio, “Learning phrase representations using in Proceedings of the IEEE Conference on Computer Vision and Pattern
RNN encoder-decoder for statistical machine translation,” in Proceed- Recognition (CVPR), 2015, pp. 1–9.
ings of the Conference on Empirical Methods in Natural Language [33] A. Kulesza and B. Taskar, “Determinantal point processes for machine
Processing (EMNLP), 2014, pp. 1724–1734. learning,” Foundations and Trends in Machine Learning, vol. 5, no. 2-3,
[11] J. Donahue, L. Anne Hendricks, M. Rohrbach, S. Venugopalan, pp. 123–286, 2012.
S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent [34] B. Gong, W. Chao, K. Grauman, and F. Sha, “Diverse sequential subset
convolutional networks for visual recognition and description,” IEEE selection for supervised video summarization,” in Advances in Neural
Transactions on Pattern Analysis and Machine Intelligence, vol. 39, Information Processing Systems (NIPS), 2014, pp. 2069–2077.
no. 4, pp. 677–691, 2016. [35] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative ad-
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, versarial network,” in Proceedings of 5th International Conference on
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Learning Representations (ICLR), 2017.
Neural Information Processing Systems (NIPS), 2017, pp. 5998–6008. [36] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific
[13] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing video summarization,” in Proceedings of the IEEE Conference on 13th
web videos using titles,” in Proceedings of the IEEE Conference on European Conference on Computer Vision (ECCV), 2014, pp. 540–555.
Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5179– [37] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer:
5187. Exemplar-based subset selection for video summarization,” in Proceed-
[14] K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun, “Generation for user ings of IEEE Conference on Computer Vision and Pattern Recognition
generated videos,” in Proceedings of the IEEE Conference on European (CVPR), 2016, pp. 1059–1067.
Conference on Computer Vision (ECCV), 2016, pp. 609–625. [38] S. E. F. De Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albu-
[15] H. Fang, J. Jiang, and Y. Feng, “A fuzzy logic approach for detection of querque Araújo, “Vsumm: A mechanism designed to produce static
video shot boundaries,” Pattern Recognition, vol. 39, no. 11, pp. 2092– video summaries and a novel evaluation method,” Pattern Recognition
2100, 2006. Letters, vol. 32, no. 1, pp. 56–68, 2011.
[39] B. Zhao, X. Li, and X. Lu, “Hsa-rnn: Hierarchical structure-adaptive
[16] P. Mundur, R. Yong, and Y. Yesha, “Keyframe-based video summa-
rnn for video summarization,” in Proceedings of the IEEE Conference
rization using delaunay clustering,” International Journal on Digital
on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7405–
Libraries, vol. 6, no. 2, pp. 219–232, 2006.
7414.
[17] Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, “Adaptive key frame
[40] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video sum-
extraction using unsupervised clustering,” in Proceedings of the IEEE
marization with adversarial lstm networks,” in Proceedings of the IEEE
Conference on International Conference on Image Processing (ICIP),
Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1,
1998, pp. 866–870.
2017, pp. 2982–2991.
[18] Z. Lu and K. Grauman, “Story-driven summarization for egocentric [41] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, “Discriminative
video,” in Proceedings of the IEEE Conference on Computer Vision feature learning for unsupervised video summarization,” in Proceedings
and Pattern Recognition (CVPR), 2013, pp. 2714–2721. of the 33rd AAAI Conference on Artificial Intelligence (AAAI), 2019, pp.
[19] T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep 8537–8544.
ranking for first-person video summarization,” in Proceedings of the [42] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), attention-based encoder-decoder networks,” IEEE Transactions on Cir-
2016, pp. 982–990. cuits and Systems for Video Technology, vol. 30, no. 6, pp. 1709–1717,
[20] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization 2020.
with long short-term memory,” in Proceedings of the IEEE Conference [43] S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao, “Independently recurrent neu-
on 14th European Conference on Computer Vision (ECCV), 2016, pp. ral network (indrnn): Building a longer and deeper rnn,” in Proceedings
766–782. of the IEEE Conference on Computer Vision and Pattern Recognition
[21] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network (CVPR), 2018, pp. 5457–5466.
for video summarization,” in Proceedings of the ACM Conference on [44] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal
Multimedia Conference (MM), 2017, pp. 863–871. of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
[22] K. Zhou, Y. Qiao, and T. Xiang, “Deep reinforcement learning for unsu- [45] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime
pervised video summarization with diversity-representativeness reward,” tv-l1 optical flow,” in Proceedings of the 29th DAGM Symposium on
in Proceedings of the 32nd AAAI Conference on Artificial Intelligence Pattern Recognition. Springer, 2007, pp. 214–223.
(AAAI), 2018, pp. 7582–7589.
[23] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully
convolutional sequence networks,” in Proceedings of the 15th European
Conference on Computer Vision (ECCV’18), 2018, pp. 358–374.
[24] M. Rochan and Y. Wang, “Video summarization by learning from
unpaired data,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2019, pp. 7902–7911.
[25] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with
visual attention,” in Proceedings of the IEEE International Conference
on Learning Representations (ICLR), 2015.
[26] F. Yang, K. Yan, S. Lu, H. Jia, X. Xie, and W. Gao, “Attention driven
person re-identification,” Pattern Recognition, vol. 86, pp. 143–155,
2019.
[27] B. Chen, P. Li, C. Sun, D. Wang, G. Yang, and H. Lu, “Multi attention
module for visual tracking,” Pattern Recognition, vol. 87, pp. 80–93,
2019.