Motion Saliency Using CNN
Motion Saliency Using CNN
Abstract—Data-driven saliency detection has attracted strong information mainly affects the human gaze. If the viewing is
interest as a result of applying convolutional neural networks task driven, the human brain memory mechanism will be acti-
to the detection of eye fixations. Although a number of image- vated, and people will focus on a target object with high-level
based salient object and fixation detection models have been
proposed, video fixation detection still requires more exploration. semantic information. In computer science, saliency detec-
Different from image analysis, motion and temporal information tion has been widely researched in recent years to further
is a crucial factor affecting human attention when viewing video understand and simulate the human attention mechanism. The
sequences. Although existing models based on local contrast and overall efforts in this field can be divided into two main cat-
low-level features have been extensively researched, they failed to egories. The first category is salient object detection, which
simultaneously consider interframe motion and temporal infor-
mation across neighboring video frames, leading to unsatisfactory aims at accurately extracting objects that grab a person’s atten-
performance when handling complex scenes. To this end, we pro- tion. The second category is called eye fixation detection,
pose a novel and efficient video eye fixation detection model to which focuses on selecting a number of locations and points
improve the saliency detection performance. By simulating the that may attract attention.
memory mechanism and visual attention mechanism of human Images have always been the focus of computer vision
beings when watching a video, we propose a step-gained fully
convolutional network by combining the memory information research, including hyperspectral images [1], [2], and ordinary
on the time axis with the motion information on the space axis images of three channels. Saliency detection, as a prepro-
while storing the saliency information of the current frame. The cessing step, is an important branch in the study of images,
model is obtained through hierarchical training, which ensures which has received more attention over recent years and is
the accuracy of the detection. Extensive experiments in com- widely used in many visual applications, including image
parison with 11 state-of-the-art methods are carried out, and the
results show that our proposed model outperforms all 11 methods retrieval [3], object segmentation [4], scene classification [5],
across a number of publicly available datasets. object detection [6], and target tracking [7]–[11]. As for video
saliency detection, to detect the significance of each frame
Index Terms—Eye fixation detection, fully convolutional neural
networks, video saliency. more accurately, intraframe saliency detection needs to be car-
ried out along with a simultaneous consideration of interframe
motion and temporal information.
This paper essentially focuses on eye fixation predictions
I. I NTRODUCTION inside a video sequence. Differing from image analysis, the
HEN VIEWING visual images or videos, the human analysis of video sequences presents more challenges due to
W visual attention mechanism helps humans selectively
choose salient areas or points upon which to fixate their gaze.
the fact that the motion and temporal information affects the
attention of the viewer. In addition, movie videos with com-
When observing a static image, features such as color, con- plex scenes and moving objects make the eye fixation detection
tour, and luminance may be dominant factors influencing the even more difficult. Although some models based on local
point of focus. When watching a video, however, the motion contrast and low-level feature information have been reported
in the literature, such models often lack consideration of the
interframe motion and temporal information, leading to an
Manuscript received January 2, 2018; revised March 9, 2018 and unsatisfactory performance when handling complex scenes,
April 9, 2018; accepted April 19, 2018. This work was supported by the such as those with fast moving objects or a moving lens. In
National Natural Science Foundation of China under Grant 61572351, Grant
61772360, Grant 61732011, and Grant 61620106008. This paper was recom- contrast with existing methods, we consider the motion and
mended by Associate Editor H. Lu. (Corresponding authors: Zheng Wang; memory information simultaneously with the spatial infor-
Jianmin Jiang.) mation. As shown in Fig. 1, our proposed model primarily
M. Sun, Z. Zhou, and Q. Hu are with the School of Computer Science,
Tianjin University, Tianjin 300350, China (e-mail: [email protected]; uses the designed step gained fully convolutional network
[email protected]; [email protected]). with expanded information [model SGF(E)] for video fixation
Z. Wang is with the School of Software Engineering, Tianjin University, detection. Our model takes the saliency predictions in previous
Tianjin 300350, China (e-mail: [email protected]).
J. Jiang is with the Research Institute for Future Media Computing, frame, the moving object boundary map between two adja-
College of Computer Science and Software Engineering, Shenzhen University, cent frames, and the current frame as the input, and computes
Shenzhen 518060, China (e-mail: [email protected]). the spatiotemporal saliency probability to produce a saliency
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. detection output, without requiring any preprocessing. Some
Digital Object Identifier 10.1109/TCYB.2018.2832053 sample predictions are given in Fig. 2.
2168-2267 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. Flow chart of our proposed model, in which we use the proposed model SGF for capturing the spatial and temporal information simultaneously.
SGF(3) is used to handle the first frame because neither motion nor temporal information is available. From the next frame onward, the SGF(E) model
takes EF(1) from SGF(3), a fast moving object edge map B(2) from the OPB algorithm, and the current frame (2) as the input, and directly outputs the
spatiotemporal prediction EF(2). Section IV provides further details regarding the proposed model.
SUN et al.: SG-FCN: MOTION AND MEMORY-BASED DEEP LEARNING MODEL FOR VIDEO SALIENCY DETECTION 3
visual saliency model, which first form activation maps on motion information, also combines gradient flow field with
certain feature channels, and then normalize them by high- energy optimization, to achieve the spatiotemporal consistency
lighting the conspicuity. Seo and Milanfar [16] employed of the output saliency maps. Wang et al. [49] introduced
a matrix cosine similarity to compute the local regression an unsupervised and geodesic distance-based salient video
kernels and measure the likeness of a pixel to its surround- object segmentation method. They consider the spatial edges
ings. Guo et al. [17] believed that the phase spectrum of and temporal motion boundaries as indicators of foreground
the Fourier transform obtains the location of salient areas, object locations in order to attain both spatially and tempo-
and Wang et al. [18] consider both the motion and appear- rally coherent object segmentation. Zhou et al. [50] showed
ance information through a quaternion feature representation. a temporal filter to enhance the rendering of salient motion.
Other methods such as sparse coding [19], [20] have also Zhang and Sclaroff [51] proposed a Boolean map-based
been applied in certain models. However, all of the above saliency model, computing the frame feature as a set of binary
methods use handcrafted features, which require domain- images and obtaining saliency maps by analyzing the topolog-
specific knowledge. Some deep-learning-based methods have ical structure of the Boolean maps. Zhong et al. [32] believed
recently been proposed for more accurate and simpler feature that the flow information is quite important in video detection
extraction, including autoencoders [43], convolutional neural tasks, and thus propose a fast flow model, constructing spa-
network [35]–[41], and long short-term memory [44]. For tial and temporal saliency maps and fusing them together to
example, Han et al. [43] developed a stacked denoising autoen- create the final attention. Li and Li [33] mainly focused on
coder to represent features from raw images. Wang et al. [35] compressed domain video eye fixation detection, and present
used predetection results to enhance the estimation through an algorithm based on residual DCT coefficients norm and
a recurrent fully convolutional network. operational block description length. The fixation prediction
After feature extraction, a significant step is to integrate is obtained through a Gaussian model whose center is deter-
the contrast for obtaining the final prediction. Lin et al. [14] mined based on the feature values. Harel et al. [15] took the
computed the final saliency by using the Lm-norm and motion and flicker information into account and compute the
combine the super features using the Winner-Take-All mecha- final eye fixation by adding extra channels, and Han et al. [34]
nism. Wang et al. [18] used an inverse quarternion Fourier built two models, a spatial attention model for predicting loca-
transform to reconstruct the final abnormal saliency map, tions in a frame, and a temporal attention model that measures
Tavakoli et al. [31] proposed a framework based on interimage the most important frame in a video sequence.
similarities and an ensemble of extreme learning machines. In Unlike the models mentioned above, our proposed model
a study by Li et al. [41], set up a multitask learning scheme for does not separate the spatial information detection from the
exploring the intrinsic correlations between saliency detection temporal information detection and then fuses them. Instead,
and semantic image segmentation and use a graph Laplacian we calculate the spatial–temporal information simultaneously
regularized nonlinear regression model for saliency refinement. through a step-gained FCN (SGF) model, which is inspired
Tavakoli and Laaksonen [42] employed independent subspace from the correlation of neighboring frames inside the video
analysis to obtain a hierarchical representation and exploit sequence. In addition, we take the detection result of the
local and global saliency concept to achieve salient detec- previous frame as the auxiliary information, and then com-
tion. Methods such as [25]–[28] utilize the fact that multiple pute the boundaries of the moving object by exploiting the
images with common foreground can be detected simultane- flow gradient of the adjacent frames. In this way, significant
ously. Feature extraction and integration are performed within advantages can be achieved in which the information on the
the range of multiple images, which leads to the problem of current frame, the saliency maps in previous frame, and the
co-saliency detection. In addition to the above methods, using moving object boundary map are considered simultaneously
end-to-end models [35]–[44] directly to produce the prediction to ensure the consistency in both time and space.
has remained an interesting research direction in recent years.
III. G ROUND T RUTH C OMPUTATION
In this section, we introduce how the corresponding ground
B. Video Saliency Detection
truth is obtained for a video frame being observed, which is
According to the input, saliency models can be further cat- crucial for our proposed model.
egorized into static and dynamic saliency models. A dynamic
saliency model takes video sequences or continual frames A. Dataset Introduction
as an input to obtain a patch of saliency detection results. For model training and performance evaluations, the ground
This task is also known as video saliency detection, which truth fixation maps for raw videos are required. Owing to
has recently attracted a significant interest. The reason behind the fact that a high level of correlation exists between the
this growth in popularity is the importance of video saliency human visual attention area and eye movements, we can record
detection as a preprocessing for many different tasks including the eye movement data for multiple subjects using an eye
video compression and summarization. tracker, and hence calculate the ground truth according to cer-
The existing video saliency models can be further divided tain existing algorithms. Some representative algorithms have
into salient object detection models [48]–[50] and eye fixation been reported in [34]. To implement a better training pro-
prediction models [15], [32]–[34], [51]. The model proposed cess, we compute the ground truth using two datasets: 1) the
in [48] uses intraframe boundary information and interframe Hollywood2 dataset [45] and 2) the UCF-sports dataset [52].
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
For effective training, we divide these two dataset into the A. Spatial Branch: SGF(i)
training set and the test set. As to the Hollywood2 dataset, the The spatial branch takes a single frame as the input and
training set contains 62 000 samples and the test set contains produces a fixation map of the same size. We employ a fully
4500 samples. The UCF dataset consists of 21 600 training convolutional neural network to model this process. First, we
samples and 2300 testing samples. employ the first five convolutional blocks of VGGNet [53], by
adding deconvolutional layers, the model ensures end-to-end
B. Ground Truth Computation detection. Second, we design three different network struc-
Similar to the work reported in [34], we proposed to calcu- tures, and train them individually to design the next model
lating the ground truth using the following method. For a given based on the previous one. To reflect such a feature, we call
video, assume there are S subjects, each of which has a total the model SGF. As the model contains three different network
of I eye fixation tracking records per frame, where the total structures, we can obtain features of different types and scale
number of videos is V. Specifically, the ground truth value during the previous layers, which are useful for further train-
of the Hollywood2 and UCF-sports datasets can be calculated ing. In addition, as the deconvolutional layers have different
through the following three steps: kernel sizes, the model considers not only the global informa-
⎛ ⎛ ⎞ tion but also the local information, producing a more accurate
VRy (j)·SRx
VR (j) VR (j) SR y − VRx (j) ⎠ eye fixation map.
xSi , ySi , k = ⎝ ⎝ySi −
j j x x
xSi ,
SRx SRx 2 As shown in Fig. 4, the bottom of the network is a stack
⎞ of convolutional layers. To learn more global information, we
currT build several deconvolutional layers on the top at the 16th
fps(j)⎠. (1) convolutional layer. Three models have different upsampling
106
factors. For SGF(1), the first five convolution blocks are initial-
Through (1), we obtain the true fixations from S subjects ized with the weights of VGGNet, which is originally trained
j over 1.3 million images of the ImageNet dataset [54], and the
per frame, where Si represents the ith subject. In addition, xSi
j
and ySi represent the eye location coordinates for the ith sub- kernel size of the deconvolutional layers is set to ×19, and
ject of the jth video, respectively, and k represents the frame ×10 with a stride of 4, respectively. The SGF(2) model is
number of the jth video addressed by the current coordinates. designed based on the SGF(1) model, where the parameters
Moreover, VRx (j) and VRy (j), respectively, represent the true are initialized from SGF(1). To achieve a smoother detection,
resolution of the jth video, and SRx and SRy demonstrate the we add a deconvolutional layer and modify the kernel size to
horizontal and vertical resolution of the display. currT indi- ×15 with a stride of 5, ×13 with a stride of 3, and ×22 with
cates the time stamp of the gaze sample (microseconds), and a stride of 2. Similarly, the SGF(3) model is based on SGF(2),
fps(j) is the frame rate of the current video sequence where the first deconvolutional layer has ×5 upsampling fac-
tors, second, third, and the last deconvolutional layer has ×9,
β· (R−r) + C−c
2 2
×10, and ×22 upsampling factors, respectively.
α · exp − w2 In convolutional blocks, each convolutional layer needs an
mygauss = (2) h1 × w1 × c1 input and an output feature map with a size of
π ·w
h2×w2×c2, where hi , wi , ci (i = 1, 2) denote the height, width,
S
I
j
Gk =
j j
mygauss r − ySi : 2r − ySi − 1 and channel number, respectively. The first convolutional layer
takes h × w × 3 raw frames as input, and produces a feature
i=1 t=1
map after a linear transformation with a bias term. Assuming
j j
r − xSi : 2r − xSi − 1 . (3) that each convolutional layer has a kernel with its weights set
Here, (2) indicates the proposed Gaussian model, where the to W and the offset term set to b, the feature map is calculated
as follows:
value of W is empirically set to 35, indicating that a gaze ⎛ ⎞
point is mapped to the surrounding 35 pixels on the graph.
The values of α and β are empirically set, r and c denote xjl = f ⎝ xil−1 ∗ wlij + blj ⎠ (4)
the horizontal and vertical resolution of the jth video, respec- i∈Mj
tively, and R, C are the matrices generated from R and C, the where Mj is the number of feature maps at the previous layer
dimensions of which are (2r + 1, 2c + 1). The final fixation l, xil−1 represents the jth feature map from the (l − 1)th layer,
j
map Gk for the kth frame inside the jth video sequence can and f is a nonlinear activation function. We choose ReLU
be calculated through (3). as the activation function and embed max pooling in the
convolutional layers. After a convolution operation, the fea-
IV. M ODEL S TRUCTURE ture maps are sparse and down-sampled. For up-sampling, the
To extract both temporal and spatial information from deconvolutional layers are applied to the top of the model
a video sequence simultaneously, and take the human memory
y = Us (fs (ζ, θconv ), θdeconv ) (5)
mechanism into account to reflect the fact that the past frame
in the video sequence will form a relatively significant object where ζ indicates the input frame data, fs (·) is the convo-
in human brain, we propose an SGF for spatial–temporal eye lutional operation with parameter θconv , Us (·) denotes the
fixation detection. deconvolutional operation with parameter θdeconv , the kernel
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SUN et al.: SG-FCN: MOTION AND MEMORY-BASED DEEP LEARNING MODEL FOR VIDEO SALIENCY DETECTION 5
Fig. 3. Structure of model SGF(E). As shown in the flowchart, the input data is a tensor of h × w × 4. At the top of the model, we add an Eltwise layer
with function MAX [big map(i), boundary map(i)] before Sigmoid function.
parameter S is set for up-sampling. More details on their Algorithm 1 OPB for Moving Object Contour Detection
specific implementations are described in Section V. Input: frame Fi , frame Fi−1
Output: Boundary map Bi
1: Obtain the color gradient map
CG i;
F
B. Temporal Branch: SGF(E) 1.1: generate super-pixels Sp i inside the frame Fi through
SLIC [55];
The temporal branch has a similar structure as the SGF(2)
1.2: compute super-pixel segmentation map Si and the color
model as shown in Fig. 4, which takes the current frame (i), gradient magnitude CGi using Eq. (8).
saliency map (i − 1) of previous frame, and moving object 2: Obtain the optical flow gradient map Mi ;
boundary map (i) as input, and outputs the final fixation map 2.1: generate the optical flow gradient magnitude MFi from the
of frame (i). This design is motivated by the fact that, when optical flow map OGFi through LDME [56];
2.2: set a threshold θ to obtain a motion area with a higher
viewing a video sequence, not only the moving object but also
magnitude than θ.
fixations from the previous frame will influence the eye loca- 3: Combine CGi and Mi to obtain the boundary map Bi using
tion in the current frame. The saliency information in previous Eq. (10).
frame added to the input and the boundary information of
the moving object combined before the final detection enable
and support the model to achieve a more comprehensive and
accurate eye fixation identification, and such improvement is
achieved only by considering the memory information regard- C. OPB: Motion Information From Interframes
ing the previous salient object, but also by the most significant Our observation reveals that moving objects are more
movement information of the object. eye-catching, even though the object does not present any
Specifically, as the first convolutional layer has an input significant difference in comparison with the surrounding
with four dimensions, we need to concatenate the saliency background. In other words, motion is the most crucial cue for
map (i − 1) and the current frame (i) to form the input data. video eye fixation detection, which makes it important for min-
Hence, the first convolutional operation is modified as ing deeper interframe information. Following the spirit of the
work reported in [48], we proposed an OPB algorithm for deep
f Fi , prei−1 ; WFi , Wpi−1 = WFi ∗ Fi + Wpi−1 ∗ prei−1 + b interframe information mining. As shown in Figs. 1 and 3,
(6) OPB is used to extract the contour information for the most
significant moving objects through three steps, the details of
where WFi denotes the weight matrix corresponding to the which are given below.
input frame data Fi , and Wpi−1 denotes the weight matrix Step 1: Extract the super-pixel information of the current
corresponding to the saliency map prei−1 of previous frame. frame to preserve the original structural elements of the video
The remaining layers are set exactly like the SGF(2) model content, while effectively simplifying and ignoring some use-
except for the last layer. Further, we use an Eltwise layer to less details. The superpixels SFi = {S1Fi , S2Fi , . . . , SpFi } are
combine the motion boundary map B(i) with the deep map distinguished by strong edges that characterize the most impor-
before the Sigmoid function is applied. The structure of our tant content of the frame. Letting P represent the number of
proposed spatiotemporal network, called SGF(E), is illustrated super-pixels, where Fig. 5(b) illustrates the super-pixel seg-
in Fig. 3. mentation map Si , the color gradient magnitude at position
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 4. Three different structures. In the SGF(1) model, convolutional blocks are initialized from VGGNet. For SGF(2) and SGF(3), convolutional blocks
for the next model are initialized from the previous model. The deconvolution layers of the three models use different sizes of upsampling factors, taking into
account the local and global information.
Fig. 5. Some detection results obtained through the OPB model. (a) Three different frames from different video sequences. (b) Super-pixel segmentation
map S using SLIC. (c) Color gradient magnitude CG from (b). (d) Visualized optical flow map O of the frames in (a). (e) Optical flow gradient magnitude
M from (d). (f) Significant moving object boundary map B achieved by fusing (c) and (e).
z = (x, y) can be calculated as follows: Step 3: Combine CGi (z) with Mi (z) to compute the final
boundary map Bi (z) at position z through
CGi (x, y) = ∇Si (x, y). (7)
Step 2: Obtain an optical gradient by extracting the optical CGi (z) ∗ 1 − e−αMi (z) if i = 1
Bi (z) =
flow between the current and previous frames, and then choose μBi−1 (z) + λPri if i ≥ 1 & Bi−1 (z) > σ
those areas with a larger gradient by setting a threshold. While
Pri = CGi (z) ∗ 1 − e−αMi (z) ∗ min(∇Bi−1 (z) (10)
Fig. 5(d) illustrates some examples of such a derived optical
flow, the optical gradient at position z can be obtained through
(8), and the magnitude Mi (z) through (9) where α is a weighting factor used to decide how much bound-
ary information of the optical gradient magnitude Mi (z) need
y
OGxFi (z), OGFi (z) = ∇OzFi (x), ∇OzFi (y) (8) to be reserved. In our implementation, we empirically set it to
⎧ 0.75. Here, μ and λ are two scaling factors used to coordinate
⎨ 2 2
y
Mi (z) = OGxFi (z) + OGFi (z) if Mi (z) > θ (9) the calculation results. The larger μ is, the greater the influ-
⎩ ence of the previous frame. In contrast, the larger the λ, the
0 if Mi (z) ≤ θ
smaller the effect of the previous frame. Here, σ is a threshold
y
where OGxFi and OGFi represent the results of the optical flow parameter, which is also empirically set to 0.3. For the con-
gradient along the x- and y-axes, respectively. venience of our presentation and description, a pseudo-code
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SUN et al.: SG-FCN: MOTION AND MEMORY-BASED DEEP LEARNING MODEL FOR VIDEO SALIENCY DETECTION 7
summary of our proposed algorithm is given in Algorithm 1, Algorithm 2 Training Method for Stage One
and some detection results are illustrated in Fig. 5. Input: image pair (I, G) for saliency detection
Output: pixel-wise binary map P
1. for i=1: 3
V. T RAINING M ETHOD 2. If i=1:
c
Initialize the parameters ωSGF(i) of the shared fully convolu-
In this section, we elaborate on the proposed two-stage
learning to predict the human eye fixation on a video sequence. tional part using the pre-trained VGGNet;
3. Else,
We first pretrain our model proposed in Section IV-A using the c c
Initialize the parameters ωSGF(i) from ωSGF(i−1) ;
image saliency detection datasets, which enables the model to d
4. Initialize the parameters ωSGF(i) of the deconvolutional part
learn the features of the salient objects and grab the salient
randomly from the Gaussian distribution;
regions inside a single image. We then fine-tune the model 5. Based on ωSGF(i)c d
and ωSGF(i) , utilize SGD and BP to train
on two video eye fixation datasets mentioned in Section III SGF(i) by minimizing the training loss using Eq. (11)
and another image eye fixation dataset reported in [57]. In 6. end for
this way, we ensure that the model can precisely predict the
eye fixation locations. The logic behind this is that most Algorithm 3 Training Method for Stage Two
of the benchmarks used for image saliency detection are Input: frame pair (F, G) for eye fixation detection
open-sourced for public access, and the resolution of the Output: pixel-wise probability map P
image is relatively high with a richer variety, pretraining on 1. for i=1: 2
these benchmarks helps to improve the model‘s generalization 2. If i=1:
c
Initialize the parameters SGF(i) c
from ωSGF(i) ;
capability.
3. Else:
c
Initialize the parameters SGF(i) c
from SGF(i−1) ;
A. Implementation Details d
4. Initialize the parameters SGF(i) of the deconvolutional part
The SGF model is implemented based on the Caffe [58] randomly from the Gaussian distribution;
toolbox. We initialize the first 13 convolutional layers of SGF 5. Based on SGF(i)c d
and SGF(i) , utilize SGD and BP to train
with those of the pretrained VGG 16-layer net, and transfer SGF(i) by minimizing the training loss using Eq. (12).
the learned representations by fine-tuning to both the saliency 6. end for
detection and eye fixation detection. We then construct the
deconvolutional layers for upsampling, the parameters of
which are initialized as Gaussian distribution parameters and are randomly selected each time as the training set, with the
iteratively updated during the training. For training pur- remaining group applied as the test set. This design is applied
poses, all images and the ground-truth maps are resized to to the training of methods SGF(1), SGF(2), and SGF(3), the
500 × 500 pixels, and the SGD learning procedure is acceler- details of which are summarized in Algorithm 2.
ated using a NVIDIA GeForce GTX 1080ti GPU. In stage one, Where ωSGF(i)
c and ωSGF(i)
d represent the convolutional
the momentum parameter is set to 0.99, the learning rate is parameters and the deconvolutional parameters of the
set to 10−10 , and the weight decay is 0.0005. In stage two, the SGF(i) model, respectively.
momentum parameter is set to 0.999, the learning rate is set
to 10−11 , and the weight decay is 0.00005. The loss functions C. Stage Two
for two stages are designed as
In this phase, we use the Hollywood2 and UCF datasets to
1
w h train the SGF model based on the operational process com-
L1 (P, G) = Gi,j − Pi,j 2 (11) pleted in the first stage. The two datasets have more than
2 2
i=1 j=1 80 000 frames altogether from different video sequences and
1
w h the corresponding ground truth map. Similar to stage one, we
L2 (P, G) = Gi,j − Pi,j 2 train three models individually and use the loss function in (12)
2 2
i=1 j=1 to fine-tune the parameters. The detailed training procedure is
w×h
summarized in Algorithm 3.
+η Gi,j log Pi,j + 1 − Gi,j log 1 − Pi,j c
Here, SGF(i) d
and SGF(i) represent the convolutional and
i=1 deconvolutional parameters for the SGF(i) model, respectively.
(12)
where η is a weighting factor to show the importance of the VI. E XPERIMENTAL R ESULTS
corresponding loss item, which are set empirically along with In this section, we report the experimental results of the
others mentioned in earlier sections. proposed approach for video eye fixation detection. First,
we describe the five datasets and evaluation metrics used in
B. Stage One this paper. Second, we provide the experimental results to
In this stage, we use six benchmarks related to image demonstrate the advantages of our approach. We compared
saliency detection. Table I shows the basic information of all our method with 11 existing state-of-the-art methods, and both
six datasets intuitively. To use a cross-validation method for qualitative and quantitative analyses of the experimental results
training, we divide all images into ten groups, and nine groups are presented.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
I NFORMATION ON THE S IX I MAGE S ALIENCY DATASETS
TABLE II
I NFORMATION OF THE F IVE V IDEO DATASETS
Fig. 6. Precision-recall and ROC curves gained from 12 models. (a) and (c) Hollywood2 dataset. (b) and (d) UCF-sports dataset. We can clearly see that
our proposed model achieves an advanced performance compared with the others.
TABLE III
M ODEL P ERFORMANCE C OMPARISON U SING F IVE D IFFERENT M ETRICS ON H OLLYWOOD 2
SUN et al.: SG-FCN: MOTION AND MEMORY-BASED DEEP LEARNING MODEL FOR VIDEO SALIENCY DETECTION 9
Fig. 7. Fixation prediction maps of our proposed SGF(E) model and 11 benchmarks, first two rows belong to HOLLYWOOD2, the third and fourth rows
belong to UCF-sports, the fifth row belongs to VAGBA, the sixth row belongs to DIEM, and the last row belongs to CRCNS.
curves for SGF and 11 state-of-the-art methods (Fig. 6), and summarize the experimental results, and comparative experi-
calculated the corresponding shuffled-AUC, NSS, SIM, and ment on VAGBA ,CRCNS, and DIEM datasets, as well as the
EMD of Hollywood2 and UCF datasets. Tables III and IV experimental results are shown in Fig. 7. It can be seen that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV
M ODEL P ERFORMANCE C OMPARISON U SING F IVE D IFFERENT M ETRICS ON UCF
TABLE V
D IFFERENCES B ETWEEN THE F OUR M ENTIONED M ODELS
TABLE VI
A BLATION A NALYSIS OF THE P ROPOSED M ETHOD ON T WO DATASETS
our model shows a better generalization capability in dealing and added both memory and motion information to enable
with complicated scenes, and can accurately find eye fixation the model to capture the salient points across neighbor-
points to make predictions that are closest to the ground truth. ing frames. With this process, both the previous detection
2) Ablation Study: To evaluate the performance of SGF in and the motion information were taken into account to
comparison with the four models proposed in this paper, we achieve the maximum probability of eye fixations, which
summarize the main differences among the four models in improve the accuracy of the detection results. Intensive exper-
Table V, and the results of their performance comparisons are iments validated the superiority of our proposed model in
given in Table VI. comparison with 11 representative existing state-of-the art
As shown, the SGF(E) model, which combines the motion methods.
information from OPB and the memory information from the Finally, we highlight our main contributions as follows.
SGF(3) model, achieves the best performance on both datasets. 1) We proposed a deep model for video saliency detection
Correspondingly, two important conclusions can be drawn: without the need of any preprocessing operations.
1) the memory information from the previous frames is useful 2) The memory information was exploited to enhance
for detections in the current frame and 2) the motion infor- the model generalization by considering the fact that
mation across neighboring frames plays a constructive role in changes between two adjacent frames inside a video are
improving the overall performance through a fusion of this limited within a certain range, and hence the correspond-
information with current detections. ing eye fixations should remain correlated.
3) Extensive experiments were carried out and compara-
tive results were reported, which not only supported that
VII. C ONCLUSION our proposed model is superior in comparison with the
In this paper, we proposed a robust deep model for the previous methods but also validated the robustness of
detection of video eye fixations. By studying the mecha- our proposed approaches.
nisms of human visual attention and memory, we simulate Further research can be identified to focus more on human
the process of viewing video sequences by human beings, brain activities and explore in detail the mechanism of human
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SUN et al.: SG-FCN: MOTION AND MEMORY-BASED DEEP LEARNING MODEL FOR VIDEO SALIENCY DETECTION 11
memory, thereby achieving more accurate and robust detec- [24] H. R. Tavakoli, E. Rahtu, and J. Heikkila, “Fast and efficient saliency
tions of eye fixation points as well as their saliencies. detection using sparse sampling and kernel density estimation,” in Proc.
Scandinavian Conf. Image Anal., vol. 6688, no. 2, pp. 666–675, 2011.
[25] H. Fu, D. Xu, B. Zhang, S. Lin, and R. K. Ward, “Object-based multiple
foreground video co-segmentation via multi-state selection graph,” IEEE
R EFERENCES Trans. Image Process, vol. 24, no. 11, pp. 3415–3424, Nov. 2015.
[1] X. Q. Lu, B. Q. Wang, X. L. Li, and X. T. Zheng, “Exploring models [26] H. Fu, X. Cao, and Z. Tu, “Cluster-based co-saliency detection,” IEEE
and data for remote sensing image caption generation,” IEEE Geosci. Trans. Image Process, vol. 22, no. 10, pp. 3766–3778, Oct. 2013.
Remote Sens. Mag., vol. 55, no. 4, pp. 2183–2195, Apr. 2018. [27] X. Yao, J. Han, D. Zhang, and F. Nie, “Revisiting co-saliency detection:
[2] X. Q. Lu, W. X. Zhang, and X. L. Li, “A hybrid sparsity and distance- A novel approach based on two-stage multi-view spectral rotation co-
based discrimination detector for hyperspectral images,” IEEE Geosci. clustering,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3196–3209,
Remote Sens. Mag., vol. 56, no. 3, pp. 1704–1717, Mar. 2018. Jul. 2017.
[3] X. Q. Lu, Y. X. Chen, and X. L. Li, “Hierarchical recurrent neural [28] H. Yu et al., “Co-saliency detection within a single image,” in Proc.
hashing for image retrieval with hierarchical convolutional features,” AAAI Conf. Artif. Intell., New Orleans, LA, USA, 2018.
IEEE Trans. Image Process., vol. 27, no. 1, pp. 106–120, Jan. 2018. [29] X. C. Cao, C. Q. Zhang, H. Z. Fu, X. J. Guo, and Q. Tian,
[4] J. Han, R. Quan, D. Zhang, and F. Nie, “Robust object co-segmentation “Saliency-aware nonparametric foreground annotation based on weakly
using background prior,” IEEE Trans. Image Process., vol. 27, no. 4, labeled data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 6,
pp. 1639–1651, Apr. 2018. pp. 1253–1265, Jun. 2016.
[5] X. Q. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classification [30] C. Q. Zhang, Z. Q. Tao, X. X. Wei, and X. C. Cao, “A flexible frame-
by unsupervised representation learning,” IEEE Geosci. Remote Sens. work of adaptive method selection for image saliency detection,” Pattern
Mag., vol. 55, no. 9, pp. 5148–5157, Sep. 2017. Recognit. Lett., vol. 63, pp. 66–70, Oct. 2015.
[6] Y. Lin, Y. Tong, Y. Cao, Y. Zhou, and S. Wang, “Visual-attention-based [31] H. R. Tavakoli, A. Borji, J. Laaksonen, and E. Rahtu, “Exploiting inter-
background modeling for detecting infrequently moving objects,” IEEE image similarity and ensemble of extreme learners for fixation prediction
Trans. Circuits Syst. Video Technol., vol. 27, no. 6, pp. 1208–1221, using deep features,” Neurocomputing, vol. 244, pp. 10–18, Jun. 2017.
Jun. 2017. [32] S.-H. Zhong, Y. Liu, F. Ren, J. Zhang, and T. Ren, “Video saliency
[7] D. Du et al., “Geometric hypergraph learning for visual tracking,” IEEE detection via dynamic consistent spatio-temporal attention modelling,”
Trans. Cybern., vol. 47, no. 12, pp. 4182–4195, Dec. 2017. in Proc. 27th AAAI Conf. Artif. Intell., Bellevue, WA, USA, 2013,
[8] Q. Liu, J. Yang, K. Zhang, and Y. Wu, “Adaptive compressive track- pp. 1063–1069.
ing via online vector boosting feature selection,” IEEE Trans. Cybern., [33] Y. Li and Y. Li, “A fast and efficient saliency detection model in video
vol. 47, no. 12, pp. 4289–4301, Dec. 2017. compressed-domain for human fixations prediction,” Multimedia Tools
[9] J. Chen, J. H. Li, S. H. Yang, and F. Deng, “Weighted optimization-based Appl., vol. 76, no. 24, pp. 26273–26295, 2017.
distributed kalman filter for nonlinear target tracking in collaborative [34] J. Han, L. Sun, X. Hu, J. Han, and L. Shao, “Spatial and tempo-
sensor networks,” IEEE Trans. Cybern., vol. 47, no. 11, pp. 3892–3905, ral visual attention prediction in videos using eye movement data,”
Nov. 2017. Neurocomputing, vol. 145, pp. 140–153, Dec. 2014.
[10] L. Wang, L. Zhang, and Z. Yi, “Trajectory predictor by using recurrent [35] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detec-
neural networks in visual tracking,” IEEE Trans. Cybern., vol. 47, no. 10, tion with recurrent fully convolutional networks,” in Proc. Eur. Conf.
pp. 3172–3183, Oct. 2017. Comput. Vis., vol. 2. Amsterdam, The Netherlands, 2016, pp. 825–841.
[11] L. Zhang and P. N. Suganthan, “Visual tracking with convolutional ran- [36] L. Zhang, C. Yang, H. Lu, X. Ruan, and M.-H. Yang, “Ranking saliency,”
dom vector functional link network,” IEEE Trans. Cybern., vol. 47, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 9, pp. 1892–1904,
no. 10, pp. 3243–3253, Oct. 2017. Sep. 2017.
[12] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual atten- [37] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise refine-
tion for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., ment model for detecting salient objects in images,” in Proc. IEEE Int.
vol. 20, no. 11, pp. 1254–1259, Nov. 1998. Conf. Comput. Vis., Venice, Italy, 2017, pp. 4039–4048.
[13] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict [38] L. Wang et al., “Learning to detect salient objects with image-level
where humans look,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Kyoto, supervision,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Japan, 2009, pp. 2106–2113. Honolulu, HI, USA, 2017, pp. 3796–3805.
[14] Y. Lin et al., “A visual-attention model using earth mover’s distance- [39] J. Han, H. Chen, N. Liu, C. Yan, and X. Li, “CNNs-based RGB-D
based saliency measurement and nonlinear feature combination,” IEEE saliency detection via cross-view transfer and multiview fusion,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 35, no. 2, pp. 314–328, Trans. Cybern., to be published, doi: 10.1109/TCYB.2017.2761775.
Feb. 2013. [40] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced deep-
[15] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in learning techniques for salient and category-specific object detection:
Proc. 19th Int. Conf. Neural Inf. Process. Syst., 2007, pp. 545–552. A survey,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 84–100,
[16] H. J. Seo and P. Milanfar, “Static and space-time visual saliency detec- Jan. 2018.
tion by self-resemblance,” J. Vis., vol. 9, no. 12, pp. 1–27, Nov. 2009. [41] X. Li et al., “DeepSaliency: Multi-task deep neural network model for
[17] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection salient object detection,” IEEE Trans. Image Process., vol. 25, no. 8,
using phase spectrum of quaternion Fourier transform,” in Proc. IEEE pp. 3919–3930, Aug. 2016.
Int. Conf. Comput. Vis. Pattern Recognit., Anchorage, AK, USA, 2008, [42] H. R. Tavakoli and J. Laaksonen, “Bottom-up fixation prediction using
pp. 1–8. unsupervised hierarchical models,” in Proc. Int. Asian Conf. Comput.
[18] H. Wang, H. Guo, X. Y. Wu, and J. Li, “Temporal and spatial anomaly Vis., Taipei, Taiwan, 2016, pp. 287–302.
detection using phase spectrum of quaternion Fourier transform,” in [43] J. Han et al., “Two-stage learning to predict human eye fixations via
Proc. IEEE Int. Conf. Inf. Autom., Lijiang, China, 2015, pp. 657–662. SDAEs,” IEEE Trans. Cybern., vol. 46, no. 2, pp. 487–498, Feb. 2016.
[19] X. Hou and L. Zhang, “Dynamic visual attention: Searching for cod- [44] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human
ing length increments,” in Proc. Conf. Adv. Neural Inf. Process. Syst., eye fixations via an LSTM-based saliency attentive model,” arXiv
Vancouver, BC, Canada, 2008, pp. 681–688. preprint, arXiv:1611.09571, 2017.
[20] J. Han et al., “An object-oriented visual saliency detection framework [45] S. Mathe and C. Sminchisescu, “Actions in the eye: Dynamic gaze
based on sparse coding representations,” IEEE Trans. Circuits Syst. datasets and learnt saliency models for visual recognition,” IEEE Trans.
Video Technol., vol. 23, no. 12, pp. 2009–2021, Dec. 2013. Pattern Anal. Mach. Intell., vol. 37, no. 7, pp. 1408–1424, Jul. 2015.
[21] R. Achanta and S. Süsstrunk, “Saliency detection using maximum [46] A. Nantheera, K. Daniels, and B. Jeremy, “Fixation prediction and
symmetric surround,” IEEE Trans. Image Process., vol. 119, no. 9, visual priority maps for biped locomotion,” IEEE Trans. Cybern., to
pp. 2653–2656, Sep. 2010. be published.
[22] L. Y. Zhang, M. H. Tong, T. K. Marks, H. H. Shan, and G. W. Cottrell, [47] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning uncertain
“SUN: A Bayesian framework for saliency using natural statistics,” J. convolutional features for accurate saliency detection,” in Proc. IEEE
Vis., vol. 8, no. 7, pp. 1–20, 2008. Int. Conf. Comput. Vis., Venice, Italy, 2017, pp. 212–221.
[23] X. Hou, J. Harel, and C. Koch, “Image signature: Highlighting sparse [48] W. Wang, J. Shen, and L. Shao, “Consistent video saliency using local
salient regions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 1, gradient flow optimization and global refinement,” IEEE Trans. Image
pp. 194–201, Jan. 2011. Process., vol. 24, no. 11, pp. 4185–4196, Nov. 2015.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[49] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object Ziqi Zhou is currently pursuing the M.S. degree
segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., in computer science with the School of Computer
Boston, MA, USA, Jun. 2015, pp. 3395–3402. Science and Technology, Tianjin University, Tianjin,
[50] F. Zhou, S. B. Kang, and M. F. Cohen, “Time-mapping using space- China.
time saliency,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Her current research interests include com-
Columbus, OH, USA, Jun. 2014, pp. 3358–3365. puter image/video processing, and particularly deep
[51] J. Zhang and S. Sclaroff, “Saliency detection: A Boolean map approach,” learning.
in Proc. IEEE Int. Conf. Comput. Vis., Sydney, NSW, Australia, 2014,
pp. 153–160.
[52] K. Soomro and A. R. Zamir, “Action recognition in realistic sports
videos,” in Computer Vision in Sports (Advances in Computer Vision
and Pattern Recognition), T. Moeslund, G. Thomas, and A. Hilton, Eds.,
Cham, Switzerland: Springer, 2014.
[53] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv:1409.1556, 2014.
[54] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. Qinghua Hu received the B.S., M.S., and Ph.D.
[55] R. Achanta et al., “SLIC superpixels,” EPFL, Lausanne, Switzerland, degrees in computer science from the Harbin
Rep. EPFL-REPORT-149300, 2010. Institute of Technology, Harbin, China, in 1999,
[56] T. Brox and J. Malik, “Large displacement optical flow: Descriptor 2002, and 2008, respectively.
matching in variational motion estimation,” IEEE Trans. Pattern Anal. He was a Postdoctoral Fellow with the
Mach. Intell., vol. 33, no. 3, pp. 500–513, Mar. 2010. Department of Computing, Hong Kong Polytechnic
[57] S. Ramanathan, H. Katti, N. Sebe, M. Kankanhalli, and T.-S. Chua, University, Hong Kong, from 2009 to 2011. He is
“An eye fixation database for saliency detection in images,” in Proc. currently a Full Professor and the Vice Dean of
Eur. Conf. Comput. Vis., vol. 6314, 2010, pp. 30–43. the School of Computer Science and Technology,
[58] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed- Tianjin University, Tianjin, China. He has authored
ding,” in Proc. ACM Multimedia, Orlando, FL, USA, 2014, pp. 675–678. over 100 journal and conference papers in the areas
[59] R. J. Peters, A. Iyer, L. Itti, and C. Koch, “Components of bottom- of granular computing-based machine learning, reasoning with uncertainty,
up gaze allocation in natural images,” Vis. Res., vol. 45, no. 18, pattern recognition, and fault diagnosis. His current research interests include
pp. 2397–2416, 2005. rough sets, granular computing, and data mining for classification and
[60] D. J. Berg, S. E. Boehnke, R. A. Marino, D. P. Munoz, and L. Itti, “Free regression.
viewing of dynamic stimuli by humans and monkeys,” J. Vis., vol. 9, Prof. Hu was the Program Committee Co-Chair of the International
no. 5, pp. 1–15, 2009. Conference on Rough Sets and Current Trends in Computing in 2010, the
[61] L. Itti, “Automatic foveation for video compression using a neurobio- Chinese Rough Set and Soft Computing Society in 2012 and 2014, and the
logical model of visual attention,” IEEE Trans. Image Process., vol. 13, International Conference on Rough Sets and Knowledge Technology, the
no. 10, pp. 1304–1318, Oct. 2004. International Conference on Machine Learning and Cybernetics in 2014, and
[62] P. K. Mital, T. J. Smith, R. L. Hill, and J. M. Henderson, “Clustering the General Co-Chair of IJCRS 2015. He is currently the PC-Co-Chair of
of gaze during dynamic scene viewing is predicted by motion,” Cogn. CCML 2017 and CCCV 2017.
Comput., vol. 3, no. 1, pp. 5–24, 2011.
[63] B. W. Tatler, R. J. Baddeley, and I. D. Gilchrist, “Visual correlates of
fixation selection: Effects of scale and time,” Vis. Res., vol. 45, no. 5,
pp. 643–659, 2005.
[64] A. Toet, “Computational versus psychophysical bottom-up image Zheng Wang received the Ph.D. degree in com-
saliency: A comparative evaluation study,” IEEE Trans. Pattern Anal. puter science from the School of Computer Science,
Mach. Intell., vol. 33, no. 11, pp. 2131–2148, Nov. 2011. Tianjin University (TJU), Tianjin, China, in 2009.
He is currently an Associate Professor with
the School of Computer Software, TJU. He was
a Visiting Scholar with INRIA Institute, Paris,
France, from 2007 to 2008. His current research
interests include video analysis, hyperspectral imag-
ing, and computer graphics.