Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward
Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward
Diversity-Representativeness Reward
Kaiyang Zhou,1,2 Yu Qiao1∗ , Tao Xiang2
1
Guangdong Key Lab of Computer Vision and Virtual Reality,
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
2
Queen Mary University of London, UK
[email protected], [email protected], [email protected]
arXiv:1801.00054v3 [cs.CV] 13 Feb 2018
...
VM
Figure 1: Training deep summarization network (DSN) via reinforcement learning. DSN receives a video Vi and takes actions
A (i.e., a sequence of binary variables) on which parts of the video are selected as the summary S. The feedback reward R(S)
is computed based on the quality of the summary, i.e., diversity and representativeness.
where σ represents the sigmoid function, at ∈ {0, 1} in- where d(·, ·) is the dissimilarity function calculated by
dicates whether the tth frame is selected or not. The bias in xTt xt0
Eq. (1) is omitted for brevity. A video summary is composed d(xt , xt0 ) = 1 − . (4)
of the selected frames, S = {vyi |ayi = 1, i = 1, 2, ...}. ||xt ||2 ||xt0 ||2
In practice, we use the GoogLeNet (Szegedy et al. 2015) Intuitively, the more diverse (or more dissimilar) the se-
pretrained on ImageNet (Deng et al. 2009) as the CNN lected frames to each other, the higher the diversity reward
model. The visual feature vectors {xt }Tt=1 are extracted that the agent can receive. However, Eq. (3) treats video
from the penultimate layer of the GoogLeNet. For the RNN frames as randomly permutable items which ignore the tem-
cell, we employ long short-term memory (LSTM) to en- poral structure inherent in sequential data. In fact, the sim-
hance RNN’s ability for capturing long-term dependencies ilarity between two temporally distant frames should be ig-
in video frames. During training, we only update the de- nored because they are essential to the storyline construc-
coder. tion (Gong et al. 2014). To overcome this problem, we set
d(xt , xt0 ) = 1 if |t − t0 | > λ, where λ controls the degree where Rn is the reward computed at the nth episode. Eq. (9)
of temporal distance. We will validate this hypothesis in the is also known as the episodic REINFORCE algorithm.
Experiments section. Although the gradient in Eq. (9) is a good estimate, it may
Representativeness reward. This reward measures how contain high variance which will make the network hard to
well the generated summary can represent the original video. converge. A common countermeasure is to subtract the re-
To this end, we formulate the degree of representativeness of ward by a constant baseline b, so the gradient becomes
a video summary as the k-medoids problem (Gygli, Grabner,
N T
and Van Gool 2015). In particular, we want the agent to se- 1 XX
lect a set of medoids such that the mean of squared errors Oθ J(θ) ≈ (Rn − b)Oθ log πθ (at |ht ), (10)
between video frames and their nearest medoids is minimal. N n=1 t=1
Therefore, we define Rrep as
where b is simply computed as the moving average of re-
1X
T wards experienced so far for computational efficiency.
Rrep = exp(− min ||xt − xt0 ||2 ). (5)
T t=1 t0 ∈Y
Regularization
With this reward, the agent is encouraged to select frames Since selecting more frames will also increase the reward,
that are close to the cluster centers in the feature space. An we impose a regularization term on the probability distribu-
alternative formulation of Rrep can be the inverse reconstruc- tions p1:T produced by DSN in order to constrain the per-
tion errors achieved by the selected frames, but this formu- centage of frames selected for the summary. Inspired by
lation is too computationally expensive. (Mahasseni, Lam, and Todorovic 2017), we minimize the
Diversity-representativeness reward. Rdiv and Rrep following term during training,
complement to each other and work jointly to guide the
learning of DSN: T
1X
Lpercentage = || pt − ||2 , (11)
R(S) = Rdiv + Rrep . (6) T t=1
During training, Rdiv and Rrep are similar in the order of
magnitude. In fact, it is non-trivial to keep Rdiv and Rrep where determines the percentage of frames to be selected.
at the same order of magnitude during training, thus none In addition, we also add the `2 regularization term on the
of them would dominate in gradient computation. We give weight parameters θ to avoid overfitting
zero reward to DSN when no frames are selected, i.e., the X
2
sampled actions are all zeros. Lweight = θi,j . (12)
i,j
Training with Policy Gradient
The goal of our summarization agent is to learn a policy Optimization
function πθ with parameters θ by maximizing the expected We optimize the policy function’s parameters θ via stochas-
rewards tic gradient-based method. By combing the gradients com-
J(θ) = Epθ (a1:T ) [R(S)], (7) puted from Eq. (10), Eq. (11) and Eq. (12), we update θ as
where pθ (a1:T ) denotes the probability distributions over
possible action sequences, and R(S) is computed by Eq. (6). θ = θ − αOθ (−J + β1 Lpercentage + β2 Lweight ), (13)
πθ is defined by our DSN.
Following the REINFORCE algorithm proposed by where α is learning rate, and β1 and β2 are hyperparameters
Williams (Williams 1992), we can compute the derivative that balance the weighting.
of the objective function J(θ) w.r.t. the parameters θ as In practice, we use Adam (Kingma and Ba 2014) as the
optimization algorithm. As a result of learning, the log-
T
X probability of actions taken by the network that have led to
Oθ J(θ) = Epθ (a1:T ) [R(S) Oθ log πθ (at |ht )], (8) high rewards is increased, while that of actions that have re-
t=1 sulted in low rewards is decreased.
where at is the action taken by DSN at time t and ht is the
hidden state from the BiRNN. Extension to Supervised Learning
Since Eq. (8) involves the expectation over high- Given the keyframe indices for a video, Y ∗ = {yi∗ |i =
dimensional action sequences, which is hard to compute di- 1, ..., |Y ∗ |}, we use Maximum Likelihood Estimation
rectly, we approximate the gradient by running the agent for (MLE) to maximize the log-probability of selecting
N episodes on the same video and then taking the average keyframes specified by Y ∗ , log p(t; θ) where t ∈ Y ∗ . p(t; θ)
gradient is computed from Eq. (1). The objective is formalized as
N
1 XX
T X
Oθ J(θ) ≈ Rn Oθ log πθ (at |ht ), (9) LMLE = log p(t; θ). (14)
N n=1 t=1 t∈Y ∗
Summary Generation N to 5. The other hyperparameters α, β1 and β2 in Eq. (13)
For a test video, we apply a trained DSN to predict are optimized via cross-validation. We set the dimension of
the frame-selection probabilities as importance scores. We hidden state in the RNN cell to 256 throughout this paper.
compute shot-level scores by averaging frame-level scores Training is stopped when it reaches a maximum number of
within the same shot. For temporal segmentation, we use epochs (60 in our case). Early stopping is executed when
KTS proposed by (Potapov et al. 2014). To generate a sum- reward creases to increase for a period of time (10 epochs
mary, we select shots by maximizing the total scores while in our experiments). We implement our method based on
ensuring that the summary length does not exceed a limit, Theano (Al-Rfou et al. 2016)2 .
which is usually 15% of the video length. The maximiza- Comparison. To compare with other approaches, we im-
tion step is essentially the 0/1 Knapsack problem, which is plement Uniform sampling, K-medoids and Dictionary se-
known as NP-hard. We obtain a near-optimal solution via lection (Elhamifar, Sapiro, and Vidal 2012) by ourselves. We
dynamic programming (Song et al. 2015). retrieve results of other approaches including Video-MMR
Besides evaluating generated summaries in the Experi- (Li and Merialdo 2010), Vsumm (De Avila et al. 2011), Web
ments part, we also qualitatively analyze the raw predictions image (Khosla et al. 2013), Online sparse coding (Zhao and
of DSN so as to exclude the effect of this summary genera- Xing 2014), Co-archetypal (Song et al. 2015), Interesting-
tion step, by which we can better understand what DSN has ness (Gygli et al. 2014), Submodularity (Gygli, Grabner, and
learned. Van Gool 2015), Summary transfer (Zhang et al. 2016a), Bi-
LSTM and DPP-LSTM (Zhang et al. 2016b), GANdpp and
GANsup (Mahasseni, Lam, and Todorovic 2017) from pub-
Experiments lished papers. Due to space limit, we do not include these
Experimental Setup citations in tables.
Datasets. We evaluate our methods on SumMe (Gygli et
al. 2014) and TVSum (Song et al. 2015). SumMe con- Quantitative Evaluation
sists of 25 user videos covering various topics such as hol- We first compare our method with several baselines that
idays and sports. Each video in SumMe ranges from 1 differ in learning objectives. Then, we compare our meth-
to 6 minutes and is annotated by 15 to 18 persons, thus ods with current state-of-the-art unsupervised/supervised
there are multiple ground truth summaries for each video. approaches in the three evaluation settings.
TVSum contains 50 videos, which include the topics of Comparison with baselines. We set the baseline models
news, documentaries, etc. The duration of each video varies as the ones trained with Rdiv only and Rrep only, which are
from 2 to 10 minutes. Similar to SumMe, each video in denoted by D-DSN and R-DSN, respectively. We represent
TVSum has 20 annotators that provide frame-level impor- the model trained with the two rewards jointly as DR-DSN.
tance scores. Following (Song et al. 2015; Zhang et al. The model that is extended to the supervised version is de-
2016b), we convert importance scores to shot-based sum- noted by DR-DSNsup . We also validate the effectiveness of
maries for evaluation. In addition to these two datasets, we the proposed technique (we call this λ-technique from now
exploit two other datasets, OVP1 that has 50 videos and on) that ignores the distant similarity when computing Rdiv .
YouTube (De Avila et al. 2011) that has 39 videos ex- We represent the D-DSN trained without the λ-technique as
cluding cartoon videos, to evaluate our method in the set- D-DSNw/o λ . To verify that DSN can benefit more from re-
tings where training data is augmented (Zhang et al. 2016b; inforcement learning than from supervised learning, we add
Mahasseni, Lam, and Todorovic 2017). another baseline as the DSN trained with the cross entropy
Evaluation metric. For fair comparison with other ap- loss using keyframe annotations, where a confidence penalty
proaches, we follow the commonly used protocol from (Pereyra et al. 2017) is imposed on the output distributions
(Zhang et al. 2016b) to compute F-score as the metric to as- as a regularization term. This model is denoted by DSNsup .
sess the similarity between automatic summaries and ground
truth summaries. We also follow (Zhang et al. 2016b) to deal Table 1: Results (%) of different variants of our method on
with multiple ground truth summaries. SumMe and TVSum.
Evaluation settings. We use three settings as suggested
in (Zhang et al. 2016b) to evaluate our method. (1) Canoni- Method SumMe TVSum
cal: we use the standard 5-fold cross validation (5FCV), i.e., DSNsup 38.2 54.5
80% of videos for training and the rest for testing. (2) Aug- D-DSNw/o λ 39.3 55.7
mented: we still use the 5FCV but we augment the training D-DSN 40.5 56.2
data in each fold with OVP and YouTube. (3) Transfer: for R-DSN 40.7 56.9
a target dataset, e.g. SumMe or TVSum, we use the other DR-DSN 41.4 57.6
three datasets as the training data to test the transfer ability DR-DSNsup 42.1 58.1
of our model.
Implementation details. We downsample videos by 2 fps Table 1 reports the results of different variants of our
as did in (Zhang et al. 2016b). We set the temporal distance method on SumMe and TVSum. We can see that DR-DSN
λ to 20, the in Eq. 11 to 0.5, and the number of episodes
2
Codes are available on https://github.com/KaiyangZhou/vsumm-
1
Open video project: https://open-video.org/. reinforce
(a) Example frames from video 18 in TVSum (indexed as in (Song et al. 2015)).
Figure 2: Video summaries generated by different variants of our approach for video 18 in TVSum. The light-gray bars in (b) to
(e) correspond to ground truth importance scores, while the colored areas correspond to the selected parts by different models.
clearly outperforms D-DSN and R-DSN on both datasets, Although our reward functions are analogous to the ob-
which demonstrates that by using Rdiv and Rrep collabo- jectives of GANdpp in concepts, ours directly model diver-
ratively, we can better teach DSN to produce high-quality sity and representativeness of selected frames in the feature
summaries that are diverse and representative. Comparing space, which is more useful to guide DSN to find good so-
the unsupervised model with the supervised one, we see lutions. In addition, the training performances of DR-DSN
that DR-DSN significantly outperforms DSNsup on the two are 40.2% on SumMe and 57.2% on TVSum, which sug-
datasets (41.4 vs. 38.2 on SumMe and 57.6 vs. 54.5 on TV- gest that the model did not overfit to the training data (note
Sum), which justifies our assumption that DSN can bene- that we do not explicitly optimize the F-score metric in the
fit more from reinforcement learning than from supervised training objective function).
learning.
By adding the supervision signals of LMLE (Eq. (14)) to Table 2: Results (%) of unsupervised approaches on SumMe
DR-DSN, the summarization performances are further im- and TVSum. Our DR-DSN performs the best, especially in
proved (1.7% improvements on SumMe and 0.9% improve- TVSum where it exhibits a huge advantage over others.
ments on TVSum). This is because labels encode the high-
level understanding of the video content, which is exploited Method SumMe TVSum
by DR-DSNsup to learn more useful patterns. Video-MMR 26.6 -
The performances of R-DSN are slightly better than those Uniform sampling 29.3 15.5
of D-DSN on the two datasets, which is because diverse K-medoids 33.4 28.8
summaries usually contain redundant information that are Vsumm 33.7 -
irrelevant to the video subject. We observe that the perfor- Web image - 36.0
mances of D-DSN are better than those of D-DSNw/o λ that Dictionary selection 37.8 42.0
does not consider temporally distant frames. When using the Online sparse coding - 46.0
λ-technique in training, around 50% ∼ 70% of the distance Co-archetypal - 50.0
matrix was set to 1 (varying across different videos) at the GANdpp 39.1 51.7
early stage. As the training epochs increased, the percentage DR-DSN 41.4 57.6
went up too, eventually staying around 80% ∼ 90%. This
makes sense because selecting temporally distant frames can Comparison with supervised approaches. Table 3 re-
lead to higher rewards and DSN is encouraged to do so with ports the results of our supervised model, DR-DSNsup , and
the diversity reward function. other supervised approaches. In terms of LSTM-based meth-
Comparison with unsupervised approaches. Table 2 ods, our DR-DSNsup beats the others, i.e., Bi-LSTM, DPP-
shows the results of DR-DSN against other unsupervised LSTM and GANsup , by 1.0% ∼ 12.0% on SumMe and
approaches on SumMe and TVSum. It can be seen that 3.2% ∼ 7.2% on TVSum, respectively. It is also interesting
DR-DSN outperforms the other unsupervised approaches on to see that the summarization performance of our unsuper-
both datasets by large margins. On SumMe, DR-DSN is vised method, DR-DSN, is even superior than the state-of-
5.9% better than the current state-of-the-art, GANdpp . On the-art supervised approach on TVSum (57.6 vs. 56.3), and
TVSum, DR-DSN substantially beats GANdpp by 11.4%. is better than most of the supervised approaches on SumMe.
DR-DSN F-score = 64.3 XCorr = 83.16 DR-DSN F-score = 41.9 XCorr = 91.84
DSNsup F-score = 58.7 XCorr = 78.06 DSNsup F-score = 41.7 XCorr = 90.13
Figure 3: Ground truth (top) and importance scores predicted by DR-DSN (middle) and DSNsup (bottom). Besides the F-score
for each prediction, we also compute cross-correlation (XCorr) for each pair of prediction and ground truth to give a quantitative
measure of similarity over two series of 1D arrays. The higher the XCorr, the more similar two arrays are to each other.
These results strongly prove the efficacy of our learning Table 4: Results (%) of the LSTM-based approaches on
framework. SumMe and TVSum in the Canonical (C), Augmented (A)
and Transfer (T) settings, respectively.
Table 3: Results (%) of supervised approaches on SumMe
and TVSum. Our DR-DSNsup performs the best. SumMe TVSum
Method
C A T C A T
Method SumMe TVSum Bi-LSTM 37.6 41.6 40.7 54.2 57.9 56.9
Interestingness 39.4 - DPP-LSTM 38.6 42.9 41.8 54.7 59.6 58.7
Submodularity 39.7 - GANdpp 39.1 43.4 - 51.7 59.5 -
Summary transfer 40.9 - GANsup 41.7 43.6 - 56.3 61.2 -
Bi-LSTM 37.6 54.2 DR-DSN 41.4 42.8 42.4 57.6 58.4 57.8
DPP-LSTM 38.6 54.7 DR-DSNsup 42.1 43.9 42.6 58.1 59.8 58.9
GANsup 41.7 56.3
DR-DSNsup 42.1 58.1 Table 5: Results (%) of using different gated recurrent units.