0% found this document useful (0 votes)
41 views9 pages

Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward

This document summarizes a research paper on unsupervised deep reinforcement learning for video summarization. The researchers develop a Deep Summarization Network (DSN) that uses reinforcement learning with a novel diversity-representativeness reward to generate video summaries without supervision. The reward function considers both the diversity and representativeness of selected frames to encourage generating high-quality summaries. Experiments show the unsupervised method outperforms other unsupervised approaches and is comparable to supervised methods.

Uploaded by

alyasocialway
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views9 pages

Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward

This document summarizes a research paper on unsupervised deep reinforcement learning for video summarization. The researchers develop a Deep Summarization Network (DSN) that uses reinforcement learning with a novel diversity-representativeness reward to generate video summaries without supervision. The reward function considers both the diversity and representativeness of selected frames to encourage generating high-quality summaries. Experiments show the unsupervised method outperforms other unsupervised approaches and is comparable to supervised methods.

Uploaded by

alyasocialway
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deep Reinforcement Learning for Unsupervised Video Summarization with

Diversity-Representativeness Reward
Kaiyang Zhou,1,2 Yu Qiao1∗ , Tao Xiang2
1
Guangdong Key Lab of Computer Vision and Virtual Reality,
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
2
Queen Mary University of London, UK
[email protected], [email protected], [email protected]
arXiv:1801.00054v3 [cs.CV] 13 Feb 2018

Abstract vised learning, using both video-level summaries and frame-


level importance scores. At test time, DPP-LSTM predicts
Video summarization aims to facilitate large-scale video importance scores and outputs feature vectors simultane-
browsing by producing short, concise summaries that are
ously, which are together used to construct a DPP matrix.
diverse and representative of original videos. In this paper,
we formulate video summarization as a sequential decision- Due to the DPP modeling, DPP-LSTM needs to be trained
making process and develop a deep summarization network in a two-stage manner.
(DSN) to summarize videos. DSN predicts for each video Although DPP-LSTM (Zhang et al. 2016b) has shown
frame a probability, which indicates how likely a frame is se- state-of-the-art performances on several benchmarks, we ar-
lected, and then takes actions based on the probability distri- gue that supervised learning cannot fully explore the poten-
butions to select frames, forming video summaries. To train tial of deep networks for video summarization because there
our DSN, we propose an end-to-end, reinforcement learning- does not exist a single ground truth summary for a video.
based framework, where we design a novel reward function This is grounded by the fact that humans have subjective
that jointly accounts for diversity and representativeness of
opinions on which parts of a video should be selected as the
generated summaries and does not rely on labels or user inter-
actions at all. During training, the reward function judges how summary. Therefore, devising more effective summarization
diverse and representative the generated summaries are, while methods that rely less on labels is still in demand.
DSN strives for earning higher rewards by learning to pro- Mahasseni et al. (Mahasseni, Lam, and Todorovic 2017)
duce more diverse and more representative summaries. Since developed an adversarial learning framework to train DPP-
labels are not required, our method can be fully unsupervised. LSTM. During the learning process, DPP-LSTM selects
Extensive experiments on two benchmark datasets show that keyframes and a discriminator network is used to judge
our unsupervised method not only outperforms other state- whether a synthetic video constructed by the keyframes is
of-the-art unsupervised methods, but also is comparable to or real or not, in order to enforce DPP-LSTM to select more
even superior than most of published supervised approaches.
representative frames. Although their framework is unsu-
pervised, the adversarial nature makes the training unstable,
Introduction which may result in model collapse. In terms of increas-
ing diversity, DPP-LSTM cannot benefit maximally from the
Driven by the exponential growth in the amount of online DPP module without the help of labels. Since a RNN-based
videos in recent years, research in video summarization has encoder-decoder network following DPP-LSTM for video
gained increasing attention, leading to various methods pro- reconstruction requires pretraining, their framework requires
posed to facilitate large-scale video browsing (Gygli et al. multiple training stages, which is not efficient in practice.
2014; Gygli, Grabner, and Van Gool 2015; Zhang et al. In this paper, we formulate video summarization as a se-
2016a; Song et al. 2015; Panda and Roy-Chowdhury 2017; quential decision-making process and develop a deep sum-
Mahasseni, Lam, and Todorovic 2017; Potapov et al. 2014). marization network (DSN) to summarize videos. DSN has
Recently, recurrent neural network (RNN), especially an encoder-decoder architecture, where the encoder is a con-
with the long short-term memory (LSTM) cell (Hochre- volutional neural network (CNN) that performs feature ex-
iter and Schmidhuber 1997), has been exploited to model traction on video frames and the decoder is a bidirectional
the sequential patterns in video frames, as well as to tackle LSTM network that produces probabilities based on which
the end-to-end training problem. Zhang et al. (Zhang et al. actions are sampled to select frames. To train our DSN, we
2016b) proposed a deep architecture that combines a bidi- propose an end-to-end, reinforcement learning-based frame-
rectional LSTM network with a Determinantal Point Process work with a diversity-representativeness (DR) reward func-
(DPP) module that increases diversity in summaries, refer- tion that jointly accounts for diversity and representativeness
ring to as DPP-LSTM. They trained DPP-LSTM with super- of generated summaries, and does not rely on labels or user

Corresponding author. interactions at all.
Copyright c 2018, Association for the Advancement of Artificial The DR reward function is inspired by the general crite-
Intelligence (www.aaai.org). All rights reserved. ria of what properties a high-quality video summary should
have. Specifically, the reward function consists of a diver- Related Work
sity reward and a representativeness reward. The diversity Video summarization. Research in video summarization
reward measures how dissimilar the selected frames are to has been significantly advanced in recent years, leading to
each other, while the representativeness reward computes approaches of various characteristics. Lee et al. (Lee, Ghosh,
distances between frames and their nearest selected frames, and Grauman 2012) identified important objects and peo-
which is essentially the k-medoids problem. These two re- ple in summarizing videos. Gygli et al. (Gygli et al. 2014)
wards complement to each other and work jointly to en- learned a linear regressor to predict the degree of interest-
courage DSN to produce diverse, representative summaries. ingness of video frames and selected keyframes with the
The intuition behind this learning strategy is closely con- highest interestingness scores. Gygli et al. (Gygli, Grabner,
cerned with how humans summarize videos. To the best of and Van Gool 2015) cast video summarization as a sub-
our knowledge, this paper is the first to apply reinforcement set selection problem and optimized submodular functions
learning to unsupervised video summarization. with multiple objectives. Ejaz et al. (Ejaz, Mehmood, and
The learning objective of DSN is to maximize the ex- Baik 2013) applied an attention-modeling technique to ex-
pected rewards over time. The rationale for using reinforce- tracting keyframes of visual saliency. Zhang et al. (Zhang
ment learning (RL) to train DSN is two-fold. Firstly, we use et al. 2016a) developed a nonparametric approach to trans-
RNN as part of our model and focus on the unsupervised fer structures of known video summaries to new videos with
setting. RNN needs to receive supervision signals at each similar topics. Auxiliary resources have also been exploited
temporal step but our rewards are computed over the whole to aid the summarization process such as web images/videos
video sequence, i.e., they can only be obtained after a se- (Song et al. 2015; Khosla et al. 2013; Chu, Song, and Jaimes
quence finishes. To provide supervision from a reward that is 2015) and category information (Potapov et al. 2014). Most
only available in the end of sequence, RL becomes a natural of these non-deep summarization methods processed video
choice. Secondly, we conjecture that DSN can benefit more frames independently, thus ignoring the inherent sequential
from RL because RL essentially aims to optimize the action patterns. Moreover, non-deep summarization methods usu-
(frame-selection) mechanism of an agent by iteratively en- ally do not support end-to-end training, which causes extra
forcing the agent to take better and better actions. However, costs at test time. To address the aforementioned issues, we
optimizing action mechanism is not particularly highlighted model video summarization via a deep RNN to capture long-
in a normal supervised/unsupervised setting. term dependencies in video frames, and propose a reinforce-
As the training process does not require labels, our ment learning-based framework to train the network end to
method can be fully unsupervised. To fit the case where end.
labels are available, we further extend our unsupervised Reinforcement learning (RL). RL has become an in-
method to the supervised version by adding a supervised creasingly popular research area due to its effectiveness in
objective that directly maximizes the log-probability of se- various tasks. Mnih et al. (Mnih et al. 2013) successfully
lecting annotated keyframes. By learning the high-level con- approximated Q function with a deep CNN, and enabled
cepts encoded in labels, our DSN can recognize globally im- their agent to beat a human expert in several Atari games.
portant frames and produce summaries that highly align with Later on, many researchers have applied RL algorithms to
human-annotated summaries. vision-related applications such as image captioning (Xu et
We conduct extensive experiments on two datasets, al. 2015) and person re-identification (Lan et al. 2017). In
SumMe (Gygli et al. 2014) and TVSum (Song et al. 2015), the domain of video summarization, our work is not the first
to quantitatively and qualitatively evaluate our method. The to use RL. Previously, Song et al. (Song et al. 2016) has
quantitative results show that our unsupervised method not applied RL to training a summarization network for select-
only outperforms other state-of-the-art unsupervised alter- ing category-specific keyframes. Their learning framework
natives, but also is comparable to or even superior than most requires keyframe-labels and category information of train-
of published supervised methods. More impressively, the ing videos. However, our work significantly differs from the
qualitative results illustrate that DSN trained with our un- work of Song et al. and other RL-based work in the way
supervised learning algorithm can identify important frames that labels or user interactions are not required at all during
that coincide with human selections. the learning process, which is attributed to our novel reward
The main contributions of this paper are summarized function. Therefore, our summarization method can be fully
as follows: (1) We develop an end-to-end, reinforcement unsupervised and is more practical to be deployed for large-
learning-based framework for training DSN, where we pro- scale video summarization.
pose a label-free reward function that jointly accounts for
diversity and representativeness of generated summaries. To
the best of our knowledge, our work is the first to apply re- Proposed Approach
inforcement learning to unsupervised video summarization. We formulate video summarization as a sequential decision-
(2) We extend our unsupervised approach to the supervised making process. In particular, we develop a deep sum-
version to leverage labels. (3) We conduct extensive exper- marization network (DSN) to predict probabilities for
iments on two benchmark datasets to show that our unsu- video frames and make decisions on which frames to se-
pervised method not only outperforms other state-of-the-art lect based on the predicted probability distributions. We
unsupervised methods, but also is comparable to or even su- present an end-to-end, reinforcement learning-based frame-
perior than most of published supervised approaches. work for training our DSN, where we design a diversity-
R(S) = Rdiv + Rrep
V1 |{z} |{z}
V2 Eq.(3) Eq.(5)
Reward Function

...
VM

Videos summary S = {vyi |ayi = 1, i = 1, 2, ...}


video Vi = {vt }Tt=1
LSTM LSTM
pt 1 actions A = {at |at 2 {0, 1}, t = 1, ..., T }
LSTM LSTM
pt
LSTM LSTM
pt+1
CNN BiRNN reward R(S)
Deep Summarization Network (DSN)

Figure 1: Training deep summarization network (DSN) via reinforcement learning. DSN receives a video Vi and takes actions
A (i.e., a sequence of binary variables) on which parts of the video are selected as the summary S. The feedback reward R(S)
is computed based on the quality of the summary, i.e., diversity and representativeness.

representativeness reward function, which directly assesses Diversity-Representativeness Reward Function


how diverse and representative the generated summaries are. During training, DSN will receive a reward R(S) that evalu-
Figure 1 illustrates the overall learning process. ates the quality of generated summaries, and the objective of
DSN is to maximize the expected rewards over time by pro-
Deep Summarization Network ducing high-quality summaries. In general, a high-quality
video summary is expected to be both diverse and repre-
We adopt the encoder-decoder framework for our deep sum- sentative of the original video so that temporal information
marization network (DSN). The encoder is a convolutional across the entire video can be maximally preserved. To this
neural network (CNN) that extracts visual features {xt }Tt=1 end, we propose a novel reward that assesses the degree
from the input video frames {vt }Tt=1 with the length T . The of diversity and representativeness of generated summaries.
decoder is a bidirectional recurrent neural network (BiRNN) The proposed reward is composed of a diversity reward Rdiv
topped with a fully connected (FC) layer. The BiRNN takes and a representativeness reward Rrep , which we detail as fol-
as input the entire visual features {xt }Tt=1 and produces cor- lows.
responding hidden states {ht }Tt=1 . Each ht is the concatena- Diversity reward. We evaluate the degree of diversity of
tion of the forward hidden state hft and the backward hidden a generated summary by measuring the dissimilarity among
state hbt , which encapsulate the future information and the the selected frames in the feature space. Let the indices of
past information with a strong emphasis on the parts sur- the selected frames be Y = {yi |ayi = 1, i = 1, ..., |Y|},
rounding the tth frame. The FC layer that ends with the sig- we compute Rdiv as the mean of the pairwise dissimilarities
moid function predicts for each frame a probability pt , from among the selected frames:
which a frame-selection action at is sampled: XX
1
Rdiv = d(xt , xt0 ), (3)
pt = σ(W ht ), (1) |Y|(|Y| − 1) 0 t∈Y t ∈Y
at ∼ Bernoulli(pt ), (2) t0 6=t

where σ represents the sigmoid function, at ∈ {0, 1} in- where d(·, ·) is the dissimilarity function calculated by
dicates whether the tth frame is selected or not. The bias in xTt xt0
Eq. (1) is omitted for brevity. A video summary is composed d(xt , xt0 ) = 1 − . (4)
of the selected frames, S = {vyi |ayi = 1, i = 1, 2, ...}. ||xt ||2 ||xt0 ||2
In practice, we use the GoogLeNet (Szegedy et al. 2015) Intuitively, the more diverse (or more dissimilar) the se-
pretrained on ImageNet (Deng et al. 2009) as the CNN lected frames to each other, the higher the diversity reward
model. The visual feature vectors {xt }Tt=1 are extracted that the agent can receive. However, Eq. (3) treats video
from the penultimate layer of the GoogLeNet. For the RNN frames as randomly permutable items which ignore the tem-
cell, we employ long short-term memory (LSTM) to en- poral structure inherent in sequential data. In fact, the sim-
hance RNN’s ability for capturing long-term dependencies ilarity between two temporally distant frames should be ig-
in video frames. During training, we only update the de- nored because they are essential to the storyline construc-
coder. tion (Gong et al. 2014). To overcome this problem, we set
d(xt , xt0 ) = 1 if |t − t0 | > λ, where λ controls the degree where Rn is the reward computed at the nth episode. Eq. (9)
of temporal distance. We will validate this hypothesis in the is also known as the episodic REINFORCE algorithm.
Experiments section. Although the gradient in Eq. (9) is a good estimate, it may
Representativeness reward. This reward measures how contain high variance which will make the network hard to
well the generated summary can represent the original video. converge. A common countermeasure is to subtract the re-
To this end, we formulate the degree of representativeness of ward by a constant baseline b, so the gradient becomes
a video summary as the k-medoids problem (Gygli, Grabner,
N T
and Van Gool 2015). In particular, we want the agent to se- 1 XX
lect a set of medoids such that the mean of squared errors Oθ J(θ) ≈ (Rn − b)Oθ log πθ (at |ht ), (10)
between video frames and their nearest medoids is minimal. N n=1 t=1
Therefore, we define Rrep as
where b is simply computed as the moving average of re-
1X
T wards experienced so far for computational efficiency.
Rrep = exp(− min ||xt − xt0 ||2 ). (5)
T t=1 t0 ∈Y
Regularization
With this reward, the agent is encouraged to select frames Since selecting more frames will also increase the reward,
that are close to the cluster centers in the feature space. An we impose a regularization term on the probability distribu-
alternative formulation of Rrep can be the inverse reconstruc- tions p1:T produced by DSN in order to constrain the per-
tion errors achieved by the selected frames, but this formu- centage of frames selected for the summary. Inspired by
lation is too computationally expensive. (Mahasseni, Lam, and Todorovic 2017), we minimize the
Diversity-representativeness reward. Rdiv and Rrep following term during training,
complement to each other and work jointly to guide the
learning of DSN: T
1X
Lpercentage = || pt − ||2 , (11)
R(S) = Rdiv + Rrep . (6) T t=1
During training, Rdiv and Rrep are similar in the order of
magnitude. In fact, it is non-trivial to keep Rdiv and Rrep where  determines the percentage of frames to be selected.
at the same order of magnitude during training, thus none In addition, we also add the `2 regularization term on the
of them would dominate in gradient computation. We give weight parameters θ to avoid overfitting
zero reward to DSN when no frames are selected, i.e., the X
2
sampled actions are all zeros. Lweight = θi,j . (12)
i,j
Training with Policy Gradient
The goal of our summarization agent is to learn a policy Optimization
function πθ with parameters θ by maximizing the expected We optimize the policy function’s parameters θ via stochas-
rewards tic gradient-based method. By combing the gradients com-
J(θ) = Epθ (a1:T ) [R(S)], (7) puted from Eq. (10), Eq. (11) and Eq. (12), we update θ as
where pθ (a1:T ) denotes the probability distributions over
possible action sequences, and R(S) is computed by Eq. (6). θ = θ − αOθ (−J + β1 Lpercentage + β2 Lweight ), (13)
πθ is defined by our DSN.
Following the REINFORCE algorithm proposed by where α is learning rate, and β1 and β2 are hyperparameters
Williams (Williams 1992), we can compute the derivative that balance the weighting.
of the objective function J(θ) w.r.t. the parameters θ as In practice, we use Adam (Kingma and Ba 2014) as the
optimization algorithm. As a result of learning, the log-
T
X probability of actions taken by the network that have led to
Oθ J(θ) = Epθ (a1:T ) [R(S) Oθ log πθ (at |ht )], (8) high rewards is increased, while that of actions that have re-
t=1 sulted in low rewards is decreased.
where at is the action taken by DSN at time t and ht is the
hidden state from the BiRNN. Extension to Supervised Learning
Since Eq. (8) involves the expectation over high- Given the keyframe indices for a video, Y ∗ = {yi∗ |i =
dimensional action sequences, which is hard to compute di- 1, ..., |Y ∗ |}, we use Maximum Likelihood Estimation
rectly, we approximate the gradient by running the agent for (MLE) to maximize the log-probability of selecting
N episodes on the same video and then taking the average keyframes specified by Y ∗ , log p(t; θ) where t ∈ Y ∗ . p(t; θ)
gradient is computed from Eq. (1). The objective is formalized as
N
1 XX
T X
Oθ J(θ) ≈ Rn Oθ log πθ (at |ht ), (9) LMLE = log p(t; θ). (14)
N n=1 t=1 t∈Y ∗
Summary Generation N to 5. The other hyperparameters α, β1 and β2 in Eq. (13)
For a test video, we apply a trained DSN to predict are optimized via cross-validation. We set the dimension of
the frame-selection probabilities as importance scores. We hidden state in the RNN cell to 256 throughout this paper.
compute shot-level scores by averaging frame-level scores Training is stopped when it reaches a maximum number of
within the same shot. For temporal segmentation, we use epochs (60 in our case). Early stopping is executed when
KTS proposed by (Potapov et al. 2014). To generate a sum- reward creases to increase for a period of time (10 epochs
mary, we select shots by maximizing the total scores while in our experiments). We implement our method based on
ensuring that the summary length does not exceed a limit, Theano (Al-Rfou et al. 2016)2 .
which is usually 15% of the video length. The maximiza- Comparison. To compare with other approaches, we im-
tion step is essentially the 0/1 Knapsack problem, which is plement Uniform sampling, K-medoids and Dictionary se-
known as NP-hard. We obtain a near-optimal solution via lection (Elhamifar, Sapiro, and Vidal 2012) by ourselves. We
dynamic programming (Song et al. 2015). retrieve results of other approaches including Video-MMR
Besides evaluating generated summaries in the Experi- (Li and Merialdo 2010), Vsumm (De Avila et al. 2011), Web
ments part, we also qualitatively analyze the raw predictions image (Khosla et al. 2013), Online sparse coding (Zhao and
of DSN so as to exclude the effect of this summary genera- Xing 2014), Co-archetypal (Song et al. 2015), Interesting-
tion step, by which we can better understand what DSN has ness (Gygli et al. 2014), Submodularity (Gygli, Grabner, and
learned. Van Gool 2015), Summary transfer (Zhang et al. 2016a), Bi-
LSTM and DPP-LSTM (Zhang et al. 2016b), GANdpp and
GANsup (Mahasseni, Lam, and Todorovic 2017) from pub-
Experiments lished papers. Due to space limit, we do not include these
Experimental Setup citations in tables.
Datasets. We evaluate our methods on SumMe (Gygli et
al. 2014) and TVSum (Song et al. 2015). SumMe con- Quantitative Evaluation
sists of 25 user videos covering various topics such as hol- We first compare our method with several baselines that
idays and sports. Each video in SumMe ranges from 1 differ in learning objectives. Then, we compare our meth-
to 6 minutes and is annotated by 15 to 18 persons, thus ods with current state-of-the-art unsupervised/supervised
there are multiple ground truth summaries for each video. approaches in the three evaluation settings.
TVSum contains 50 videos, which include the topics of Comparison with baselines. We set the baseline models
news, documentaries, etc. The duration of each video varies as the ones trained with Rdiv only and Rrep only, which are
from 2 to 10 minutes. Similar to SumMe, each video in denoted by D-DSN and R-DSN, respectively. We represent
TVSum has 20 annotators that provide frame-level impor- the model trained with the two rewards jointly as DR-DSN.
tance scores. Following (Song et al. 2015; Zhang et al. The model that is extended to the supervised version is de-
2016b), we convert importance scores to shot-based sum- noted by DR-DSNsup . We also validate the effectiveness of
maries for evaluation. In addition to these two datasets, we the proposed technique (we call this λ-technique from now
exploit two other datasets, OVP1 that has 50 videos and on) that ignores the distant similarity when computing Rdiv .
YouTube (De Avila et al. 2011) that has 39 videos ex- We represent the D-DSN trained without the λ-technique as
cluding cartoon videos, to evaluate our method in the set- D-DSNw/o λ . To verify that DSN can benefit more from re-
tings where training data is augmented (Zhang et al. 2016b; inforcement learning than from supervised learning, we add
Mahasseni, Lam, and Todorovic 2017). another baseline as the DSN trained with the cross entropy
Evaluation metric. For fair comparison with other ap- loss using keyframe annotations, where a confidence penalty
proaches, we follow the commonly used protocol from (Pereyra et al. 2017) is imposed on the output distributions
(Zhang et al. 2016b) to compute F-score as the metric to as- as a regularization term. This model is denoted by DSNsup .
sess the similarity between automatic summaries and ground
truth summaries. We also follow (Zhang et al. 2016b) to deal Table 1: Results (%) of different variants of our method on
with multiple ground truth summaries. SumMe and TVSum.
Evaluation settings. We use three settings as suggested
in (Zhang et al. 2016b) to evaluate our method. (1) Canoni- Method SumMe TVSum
cal: we use the standard 5-fold cross validation (5FCV), i.e., DSNsup 38.2 54.5
80% of videos for training and the rest for testing. (2) Aug- D-DSNw/o λ 39.3 55.7
mented: we still use the 5FCV but we augment the training D-DSN 40.5 56.2
data in each fold with OVP and YouTube. (3) Transfer: for R-DSN 40.7 56.9
a target dataset, e.g. SumMe or TVSum, we use the other DR-DSN 41.4 57.6
three datasets as the training data to test the transfer ability DR-DSNsup 42.1 58.1
of our model.
Implementation details. We downsample videos by 2 fps Table 1 reports the results of different variants of our
as did in (Zhang et al. 2016b). We set the temporal distance method on SumMe and TVSum. We can see that DR-DSN
λ to 20, the  in Eq. 11 to 0.5, and the number of episodes
2
Codes are available on https://github.com/KaiyangZhou/vsumm-
1
Open video project: https://open-video.org/. reinforce
(a) Example frames from video 18 in TVSum (indexed as in (Song et al. 2015)).

(b) DR-DSNsup (c) DR-DSN

(d) R-DSN (e) D-DSN

Figure 2: Video summaries generated by different variants of our approach for video 18 in TVSum. The light-gray bars in (b) to
(e) correspond to ground truth importance scores, while the colored areas correspond to the selected parts by different models.

clearly outperforms D-DSN and R-DSN on both datasets, Although our reward functions are analogous to the ob-
which demonstrates that by using Rdiv and Rrep collabo- jectives of GANdpp in concepts, ours directly model diver-
ratively, we can better teach DSN to produce high-quality sity and representativeness of selected frames in the feature
summaries that are diverse and representative. Comparing space, which is more useful to guide DSN to find good so-
the unsupervised model with the supervised one, we see lutions. In addition, the training performances of DR-DSN
that DR-DSN significantly outperforms DSNsup on the two are 40.2% on SumMe and 57.2% on TVSum, which sug-
datasets (41.4 vs. 38.2 on SumMe and 57.6 vs. 54.5 on TV- gest that the model did not overfit to the training data (note
Sum), which justifies our assumption that DSN can bene- that we do not explicitly optimize the F-score metric in the
fit more from reinforcement learning than from supervised training objective function).
learning.
By adding the supervision signals of LMLE (Eq. (14)) to Table 2: Results (%) of unsupervised approaches on SumMe
DR-DSN, the summarization performances are further im- and TVSum. Our DR-DSN performs the best, especially in
proved (1.7% improvements on SumMe and 0.9% improve- TVSum where it exhibits a huge advantage over others.
ments on TVSum). This is because labels encode the high-
level understanding of the video content, which is exploited Method SumMe TVSum
by DR-DSNsup to learn more useful patterns. Video-MMR 26.6 -
The performances of R-DSN are slightly better than those Uniform sampling 29.3 15.5
of D-DSN on the two datasets, which is because diverse K-medoids 33.4 28.8
summaries usually contain redundant information that are Vsumm 33.7 -
irrelevant to the video subject. We observe that the perfor- Web image - 36.0
mances of D-DSN are better than those of D-DSNw/o λ that Dictionary selection 37.8 42.0
does not consider temporally distant frames. When using the Online sparse coding - 46.0
λ-technique in training, around 50% ∼ 70% of the distance Co-archetypal - 50.0
matrix was set to 1 (varying across different videos) at the GANdpp 39.1 51.7
early stage. As the training epochs increased, the percentage DR-DSN 41.4 57.6
went up too, eventually staying around 80% ∼ 90%. This
makes sense because selecting temporally distant frames can Comparison with supervised approaches. Table 3 re-
lead to higher rewards and DSN is encouraged to do so with ports the results of our supervised model, DR-DSNsup , and
the diversity reward function. other supervised approaches. In terms of LSTM-based meth-
Comparison with unsupervised approaches. Table 2 ods, our DR-DSNsup beats the others, i.e., Bi-LSTM, DPP-
shows the results of DR-DSN against other unsupervised LSTM and GANsup , by 1.0% ∼ 12.0% on SumMe and
approaches on SumMe and TVSum. It can be seen that 3.2% ∼ 7.2% on TVSum, respectively. It is also interesting
DR-DSN outperforms the other unsupervised approaches on to see that the summarization performance of our unsuper-
both datasets by large margins. On SumMe, DR-DSN is vised method, DR-DSN, is even superior than the state-of-
5.9% better than the current state-of-the-art, GANdpp . On the-art supervised approach on TVSum (57.6 vs. 56.3), and
TVSum, DR-DSN substantially beats GANdpp by 11.4%. is better than most of the supervised approaches on SumMe.
DR-DSN F-score = 64.3 XCorr = 83.16 DR-DSN F-score = 41.9 XCorr = 91.84

DSNsup F-score = 58.7 XCorr = 78.06 DSNsup F-score = 41.7 XCorr = 90.13

(a) Video 11 in TVSum (b) Video 10 in TVSum

Figure 3: Ground truth (top) and importance scores predicted by DR-DSN (middle) and DSNsup (bottom). Besides the F-score
for each prediction, we also compute cross-correlation (XCorr) for each pair of prediction and ground truth to give a quantitative
measure of similarity over two series of 1D arrays. The higher the XCorr, the more similar two arrays are to each other.

These results strongly prove the efficacy of our learning Table 4: Results (%) of the LSTM-based approaches on
framework. SumMe and TVSum in the Canonical (C), Augmented (A)
and Transfer (T) settings, respectively.
Table 3: Results (%) of supervised approaches on SumMe
and TVSum. Our DR-DSNsup performs the best. SumMe TVSum
Method
C A T C A T
Method SumMe TVSum Bi-LSTM 37.6 41.6 40.7 54.2 57.9 56.9
Interestingness 39.4 - DPP-LSTM 38.6 42.9 41.8 54.7 59.6 58.7
Submodularity 39.7 - GANdpp 39.1 43.4 - 51.7 59.5 -
Summary transfer 40.9 - GANsup 41.7 43.6 - 56.3 61.2 -
Bi-LSTM 37.6 54.2 DR-DSN 41.4 42.8 42.4 57.6 58.4 57.8
DPP-LSTM 38.6 54.7 DR-DSNsup 42.1 43.9 42.6 58.1 59.8 58.9
GANsup 41.7 56.3
DR-DSNsup 42.1 58.1 Table 5: Results (%) of using different gated recurrent units.

Comparison in the Augmented (A) and Transfer (T) SumMe TVSum


Method
settings. Table 4 compares our methods with current state- LSTM GRU LSTM GRU
of-the-art LSTM-based methods in the A and T settings. The DR-DSN 41.4 41.2 57.6 56.7
results in the Canonical setting are also provided to exhibit DR-DSNsup 42.1 41.5 58.1 57.8
the improvements obtained by increased training data. In
the A setting, DR-DSNsup performs marginally better than
GANsup on SumMe (43.9 vs. 43.6), whereas it is defeated sandwich in Figure 2. In general, all four methods pro-
by GANsup on TVSum (59.8 vs. 61.2). This may be because duce high-quality summaries that span the temporal struc-
the LSTM model in GANsup has more hidden units (1024 vs. ture, with only small variations observed in some frames.
our 256). In the T setting, DR-DSNsup performs the best on The peak regions of ground truth are almost captured. Nev-
both datasets, suggesting that our model is able to transfer ertheless, the summary produced by the supervised model,
knowledge between datasets. Furthermore, it is interesting DR-DSNsup , is much closer to the complete storyline con-
to see that our unsupervised model, DR-DSN, is superior or veyed by the original video i.e., from food preparation to
comparable with other methods in both settings. Overall, we cooking. This is because DR-DSNsup benefits from labels
firmly believe that by using a larger model and/or designing that allow high-level concepts to be better captured.
a better network architecture, we can obtain better summa- Predicted importance scores. We visualize the raw pre-
rization performances with our learning framework. dictions by DR-DSN and DSNsup in Figure 3. By compar-
We also experiment with different gated RNN units, i.e., ing predictions with ground truth, we can better understand
LSTM vs. GRU (Cho et al. 2014), and find that LSTM-based in more depth how well DSN has learned. It is worth high-
models consistently beat GRU-based models (see Table 5). lighting that the curves of importance scores predicted by
This may be interpreted as that the memory mechanism in the unsupervised model resemble those predicted by the su-
LSTM has a higher degree of complexity, thus allowing pervised model in several parts. More importantly, these
more complex patterns to be learned. parts coincide with the ones also considered as important
by humans. This strongly demonstrates that reinforcement
Qualitative Evaluation learning with our diversity-representativeness reward func-
Video summaries. We provide qualitative results for an ex- tion can well imitate the human-learning process and effec-
emplar video that talks about a man making a spicy sausage tively teach DSN to recognize important frames.
Conclusion [Hochreiter and Schmidhuber 1997] Hochreiter, S., and
In this paper, we proposed a label-free reinforcement learn- Schmidhuber, J. 1997. Long short-term memory. Neural
ing algorithm to tackle unsupervised video summarization. computation 9(8):1735–1780.
Extensive experiments on two benchmark datasets showed [Khosla et al. 2013] Khosla, A.; Hamid, R.; Lin, C.-J.; and
that using reinforcement learning with our unsupervised re- Sundaresan, N. 2013. Large-scale video summarization us-
ward function outperformed other state-of-the-art unsuper- ing web-image priors. In CVPR, 2698–2705.
vised alternatives, and produced results comparable to or [Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam:
even superior than most supervised methods. A method for stochastic optimization. In ICLR.
[Lan et al. 2017] Lan, X.; Wang, H.; Gong, S.; and Zhu, X.
Acknowledgments 2017. Deep reinforcement learning attention selection for
We thank Ke Zhang and Wei-Lun Chao for discussions person re-identification. In BMVC.
of details of their paper (Zhang et al. 2016b). This work [Lee, Ghosh, and Grauman 2012] Lee, Y. J.; Ghosh, J.; and
was supported in part by National Key Research and De- Grauman, K. 2012. Discovering important people and ob-
velopment Program of China (2016YFC1400704) and Na- jects for egocentric video summarization. In CVPR, 1346–
tional Natural Science Foundation of China (U1613211, 1353. IEEE.
61633021).
[Li and Merialdo 2010] Li, Y., and Merialdo, B. 2010.
References Multi-video summarization based on video-mmr. In
WIAMIS, 1–4. IEEE.
[Al-Rfou et al. 2016] Al-Rfou, R.; Alain, G.; Almahairi, A.;
Angermueller, C.; Bahdanau, D.; Ballas, N.; Bastien, F.; [Mahasseni, Lam, and Todorovic 2017] Mahasseni, B.;
Bayer, J.; Belikov, A.; Belopolsky, A.; et al. 2016. Theano: Lam, M.; and Todorovic, S. 2017. Unsupervised video
A python framework for fast computation of mathematical summarization with adversarial lstm networks. In CVPR.
expressions. arXiv preprint arXiv:1605.02688. [Mnih et al. 2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.;
[Cho et al. 2014] Cho, K.; Van Merriënboer, B.; Bahdanau, Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M.
D.; and Bengio, Y. 2014. On the properties of neural 2013. Playing atari with deep reinforcement learning. arXiv
machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1312.5602.
preprint arXiv:1409.1259. [Panda and Roy-Chowdhury 2017] Panda, R., and Roy-
[Chu, Song, and Jaimes 2015] Chu, W.-S.; Song, Y.; and Chowdhury, A. K. 2017. Collaborative summarization of
Jaimes, A. 2015. Video co-summarization: Video summa- topic-related videos. In CVPR.
rization by visual co-occurrence. In CVPR, 3584–3592. [Pereyra et al. 2017] Pereyra, G.; Tucker, G.; Chorowski, J.;
[De Avila et al. 2011] De Avila, S. E. F.; Lopes, A. P. B.; Kaiser, Ł.; and Hinton, G. 2017. Regularizing neural net-
da Luz, A.; and de Albuquerque Araújo, A. 2011. Vsumm: works by penalizing confident output distributions. arXiv
A mechanism designed to produce static video summaries preprint arXiv:1701.06548.
and a novel evaluation method. Pattern Recognition Letters [Potapov et al. 2014] Potapov, D.; Douze, M.; Harchaoui, Z.;
32(1):56–68. and Schmid, C. 2014. Category-specific video summariza-
[Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; tion. In ECCV, 540–555. Springer.
Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hier- [Song et al. 2015] Song, Y.; Vallmitjana, J.; Stent, A.; and
archical image database. In CVPR, 248–255. IEEE. Jaimes, A. 2015. Tvsum: Summarizing web videos using
[Ejaz, Mehmood, and Baik 2013] Ejaz, N.; Mehmood, I.; titles. In CVPR, 5179–5187.
and Baik, S. W. 2013. Efficient visual attention based frame- [Song et al. 2016] Song, X.; Chen, K.; Lei, J.; Sun, L.; Wang,
work for extracting key frames from videos. Signal Process- Z.; Xie, L.; and Song, M. 2016. Category driven deep recur-
ing: Image Communication 28(1):34–44. rent neural network for video summarization. In ICMEW,
[Elhamifar, Sapiro, and Vidal 2012] Elhamifar, E.; Sapiro, 1–6. IEEE.
G.; and Vidal, R. 2012. See all by looking at a few: Sparse [Szegedy et al. 2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet,
modeling for finding representative objects. In CVPR, 1600– P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and
1607. IEEE. Rabinovich, A. 2015. Going deeper with convolutions. In
[Gong et al. 2014] Gong, B.; Chao, W.-L.; Grauman, K.; and CVPR, 1–9.
Sha, F. 2014. Diverse sequential subset selection for super- [Williams 1992] Williams, R. J. 1992. Simple statistical
vised video summarization. In NIPS, 2069–2077. gradient-following algorithms for connectionist reinforce-
[Gygli et al. 2014] Gygli, M.; Grabner, H.; Riemenschnei- ment learning. Machine learning 8(3-4):229–256.
der, H.; and Van Gool, L. 2014. Creating summaries from [Xu et al. 2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville,
user videos. In ECCV, 505–520. Springer. A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015.
[Gygli, Grabner, and Van Gool 2015] Gygli, M.; Grabner, Show, attend and tell: Neural image caption generation with
H.; and Van Gool, L. 2015. Video summarization by learn- visual attention. In ICML, 2048–2057.
ing submodular mixtures of objectives. In CVPR, 3090– [Zhang et al. 2016a] Zhang, K.; Chao, W.-L.; Sha, F.; and
3098. Grauman, K. 2016a. Summary transfer: Exemplar-based
subset selection for video summarization. In CVPR, 1059–
1067.
[Zhang et al. 2016b] Zhang, K.; Chao, W.-L.; Sha, F.; and
Grauman, K. 2016b. Video summarization with long short-
term memory. In ECCV, 766–782. Springer.
[Zhao and Xing 2014] Zhao, B., and Xing, E. P. 2014. Quasi
real-time summarization for consumer videos. In CVPR,
2513–2520.

You might also like