A Unified Spatiotemporal Prior Based On Geodesic Distance For Video Object Segmentation
A Unified Spatiotemporal Prior Based On Geodesic Distance For Video Object Segmentation
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 1
Abstract—Video saliency, aiming for estimation of a single dominant object in a sequence, offers strong object-level cues for
unsupervised video object segmentation. In this paper, we present a geodesic distance based technique that provides reliable and
temporally consistent saliency measurement of superpixels as a prior for pixel-wise labeling. Using undirected intra-frame and
inter-frame graphs constructed from spatiotemporal edges or appearance and motion, and a skeleton abstraction step to further
enhance saliency estimates, our method formulates the pixel-wise segmentation task as an energy minimization problem on a function
that consists of unary terms of global foreground and background models, dynamic location models, and pairwise terms of label
smoothness potentials. We perform extensive quantitative and qualitative experiments on benchmark datasets. Our method achieves
superior performance in comparison to the current state-of-the-art in terms of accuracy and speed.
Index Terms—Video saliency, video object segmentation, geodesic distance, spatiotemporal prior.
1 I NTRODUCTION
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 2
Fig. 1. Overview of our video object segmentation framework. Input frame is over-segmented into superpixels and a spatiotemporal edge map is
produced by the combination of static edge probability and optical flow gradient magnitude. For each superpixel, we compute its object-probability
and the refined saliency estimate via intra-frame graph and inter-frame graph, respectively. An object skeleton abstraction method is further derived
for obtaining final saliency estimates via biasing the central skeleton regions with higher saliency values. Finally, we combine the spatiotemporal
saliency priors, global appearance models and dynamic location models using motion information among few subsequent frames for producing
correct pixel-wise segmentation.
feature as another cue within their image saliency models [13], using an energy function that contains three unary and
[14], [15], lacking an elegant framework to incorporate intra-frame two pair-wise terms (Section 4).
and inter-frame information in a unified fashion. • A new formulation for spatiotemporal saliency by exploit-
In this paper, we aim to partition the foreground objects from ing intra-frame and inter-frame relevancy via undirected
their backgrounds in all frames of a given video sequence without graphs on superpixels. For the intra-frame stimulus, we
any user assistance or contextual assumptions. To this end, we employ geodesic distance on spatiotemporal edges within
propose a video object segmentation method that consists of a a single frame. We construct the inter-frame graph for
superpixel based spatiotemporal saliency prior detection stage and temporal coherence between consecutive frames (Section
pixel based binary labeling stage that runs in a recursive fashion. 3).
Our proposed video segmentation framework is depicted in • A geodesic distance based weighting of intra-frame and
Figure 1. We first introduce a unified spatiotemporal saliency inter-frame graphs based on the observation that salient
prior that combines motion boundaries and spatial edges into a regions have higher geodesic distances to background
unified model that is designed to align with object boundaries regions (Section 3.1 and 3.3).
for a simple yet powerful representation. Our model offers a • A greedy skeleton abstraction scheme for iteratively se-
reliable and temporally consistent region-level prior for object lecting confident foreground regions (Section 3.4).
segmentation by employing psychophysically motivated low-level Our method achieves the state-of-art performance on four large
features that incorporate spatial stimulus and temporal coherence. benchmarks. This paper builds upon and extends our recent work
Saliency of a region is measured by its shortest geodesic distance in [20] with a more in depth discussion of the algorithm and
to background regions in two inter-frame and intra-frame graphs, expanded evaluation results. We further introduce a new geodesic
which are constructed from the intensity edge and motion bound- distance based skeleton regions abstraction method that regular-
ary cues, as well as the background information across adjacent izes the original regions of object with higher saliency.
frames. The geodesic distance has the power of abstracting object The remainder of this paper is organized as follows: An
structure to efficiently determine its central regions by assigning overview of the related work is given in Section 2. The spa-
higher saliency values to more representative regions. It has been tiotemporal saliency stage is explained in Section 3. Intermediate
shown to be effective for supervised segmentation where user processes of the video segmentation method are articulated in
provides seeds [16], [17], [18], [19]. Such a user interaction, Section 4. Experimental evaluations are presented in Section 5.
however, is impractical in streaming video applications. Our Discussions and limitations are given in Section 6. Concluding
method extends the geodesic distance into unsupervised video remarks can be found in Section 7.
segmentation. Hence, we design a skeleton abstraction method
that explicitly incorporates weak object structure and emphasizes
the saliency values of the central skeleton regions based on 2 R ELATED W ORK
geodesic distances. After obtaining the spatiotemporal saliency In this section, we give a brief overview of recent works in
prior, we integrate saliency cues, dynamic location models as well unsupervised video segmentation and saliency detection.
as global appearance models into an energy minimization that is
optimized via graph-cuts to generate highly accurate foreground 2.1 Unsupervised Video Segmentation
object boundaries in entire video segment.
A variety of techniques have been proposed for unsupervised
To summarize, our main contributions are: video segmentation in the past decade. Most approaches are based
• A unified framework that incorporates video saliency for on bottom-up models using low-level features such as motion,
unsupervised pixel-wise labeling of foreground objects color, and edge orientation. In particular, the importance of the
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 3
motion information was emphasized in many works [1], [2], [3], of the context of the image. Supervised learning with a specific
[21], [22], [23]. While the use of short duration motion boundaries class is therefore a frequently adopted principle. Most of the
in pairs of subsequent frames is not uncommon [22], several saliency detection methods are based on bottom-up visual atten-
methods [1], [2], [3], [21], [23] argued that motion should be tion mechanisms, which are independent of the knowledge of the
analyzed over longer periods, as such long term analysis is able to content in the image and utilize various low level features, such as
decrease the intra-object variance of motion relative to the inter- intensity, color and orientation.
object variance and propagate motion information to frames in Inspired by visual perception studies that indicate contrast is
which the object remains static. For this, [2] grouped pixels with a major factor in visual attention mechanisms, numerous bottom-
coherent motion computed via long-range motion vectors from the up models have been proposed based on different mathematical
past and future frames. Similarly, the work in [1] offered a frame- formulations of contrast. Many methods assumed that globally
work for trajectory-based video segmentation through building infrequent features are more salient, and frequency analysis is
affinity matrix between pairs of trajectories. In [3], discontinuities carried out in the spectral domain. For example, [32] proposed a
of embedding density between spatially neighboring trajectories saliency detection algorithm using spectral residuals based on the
were detected. Incorporating higher order motion models, a clus- log spectra representation for images. The phase spectrum of the
tering method for point tracks was proposed in [23]. In general, Fourier transform is considered to be the key element in obtaining
motion based methods suffer difficulties when different parts of an the location of salient regions in [33]. Later, [34] introduced
object exhibit nonhomogeneous motion patterns. This problem is a frequency-tuned approach to estimate center-surround contrast
exacerbated further with the absence of a strong prior for object. using color and luminance features. Other methods attempted to
Moreover, these approaches require careful selection of a suitable detect saliency in the spatial domain, usually adopting several
model especially for the trajectory clustering process, which often visual cues. A graph-based dissimilarity measure was used in [30].
comes with a high computation complexity, as [7] pointed out. In [35], a content-aware saliency detection with the consideration
There were previous efforts [4], [5], [6], [24] that presented of the contrast from both local and global perspectives was
optimization frameworks for bottom-up segmentation employing built. [36] presented a framework for saliency detection based
both appearance and motion cues. Several methods [7], [8], [9], on the efficient fusion of different feature channels and the local
[25], [26] proposed to select primary object regions in object center-surround hypothesis. In [37], two saliency indicators, global
proposal domain based on the notion of what a generic object appearance contrast and spatially compact distribution, were con-
looks like. These approaches benefit from the work of object sidered. Recently, several approaches [38], [39], [40] exploited
hypotheses proposals [10], [11], [12] that offer a large number background information, called boundary prior. These methods
of object candidates in every frame. Therefore, segmenting video use image boundaries as background, further enhancing saliency
object is transformed into an object region selection problem. computation.
In this selection process, both motion and appearance cues are While image saliency detection has been extensively studied,
used to measure the objectness of a proposal. More specifically, a computing spatiotemporal saliency for videos is a relatively new
clustering process was introduced for finding objects by [7], a con- problem. Different from image saliency detection, moving objects
strained maximum weight cliques technique to model the selection catch more attention of human beings than static ones, even if the
process was imposed [8], and a layered directed acyclic graph static objects have large contrast to their neighbors. In other words,
based framework was presented by [9]. Work of [25] segmented motion is the most important cue for video saliency detection,
moving objects by ranking spatiotemporal segment proposals with which makes deeper exploration of the inter-frame information
moving objectness detector trained on image and motion fields. crucial. Gao et al. [13] extended their image saliency model
In [26], tracking and segmentation were integrated into a unified [41] by adding the motion channel for prediction of human eye
framework to detect the primary object proposal and handle the fixations in dynamic scenes based on the center-surround hypoth-
video segmentation task. The main drawbacks of the proposal esis. Similarly, Mahadevan et al. [14] combined center-surround
based algorithms are their high computational cost associated with saliency with the dynamic textures for spatiotemporal saliency
proposal generation and complicated object inference schemes. using the saliency model in [41]. In [15], Seo et al. computed the
Recently, some methods [27], [28] were proposed to exploit so-called local regression kernels from the given video, measuring
temporal correlations over the entire video, which produced global the likeness of a pixel (or voxel) to its surrounding. They extended
optimal segments. their model for video saliency detection straightforwardly by
extracting a feature vector from each spatiotemporal 3-D cube. Re-
cently, [5] used a statistical framework and local feature contrast
2.2 Saliency Detection for Image and Video in illumination, color, and motion for formulating final saliency
The human visual system is remarkably effective in localizing the maps. [42] proposed a cluster-based saliency method, where three
regions in a scene that stand out from their neighbors. Saliency visual attention cues, contrast, spatial, and global correspondence,
detection [29] is originally a task of simulating the human visual are devised to measure the cluster saliency. [43] adopted space-
system for predicting scene locations where a human observer may time saliency to generate a low-frame-rate video from a high-
fixate. Recent research has shown that extracting salient objects or frame-rate input using various low-level features and region-based
regions is more useful and beneficial to a wide range of computer contrast analysis.
vision applications. The output of salient object detection is It can be seen that video saliency detection is still an emerging
usually a saliency map where the intensity of each pixel represents and challenging research problem to be further investigated. The
the probability of that pixel belonging to the salient object. existing methods, however, usually build their system with a
Saliency detection methods in general can be categorized as simple combination of image saliency models with motion cues,
either bottom-up or top-down approaches. Top-down approaches lacking an efficient framework to fully explore intra-frame and
[30], [31] are goal-directed and require an explicit understanding inter-frame information together.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 4
Fig. 2. Overview of our geodesic distance based spatiotemporal saliency prior. (a) Input frame F k . (b) Oversegmentation of F k into superpixels
Yk . (c) Spatial edge probability map Eck of F k . (d) Gradient magnitude Eok of optical flow of F k . (e) Superpixel-wise spatiotemporal edge map E k
computed via Equation 1. (f) Object estimation result P k via intra-frame graph. (g) Saliency result S k via inter-frame graph. (h) Final video saliency
via the proposed skeleton abstraction method.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 5
Fig. 3. Illustration of our inter-frame graph construction. (a) Frame F k . (b) Optical flow flied V k . (c) When the optical flow estimation is not accurate
(which is unfortunately the common case) object probabilities P k are degraded. (d) Frame F k is decomposed into background regions Bk and
object-like regions Uk by self-adaptive threshold σ k defined in Equation 6. The black regions indicate the background regions Bk , while the bright
regions indicate the object-like regions Uk . (e) The decomposition of prior frame F k−1 . (f) The object-like regions Uk−1 of frame F k−1 are projected
onto frame F k . (g) Spatiotemporal saliency result S k for frame F k with consideration of (d) and (e). (h) Spatiotemporal saliency result S k for frame
F k with consideration of (e) and (f).
3.2 Intra-frame Graph Construction which can be seen as the accumulated edge weights along their
To highlight the foreground regions that have high spatiotemporal shortest path on graph G k .
edge values or are surrounded by regions with high spatiotemporal If a superpixel is outside the desired object, its probability
edge values, we employ the geodesic distance to compute a value is small because there exists a pathway to image boundaries
probability map. that does not pass the regions with high spatiotemporal edge value.
The geodesic distance dgeo (v1 , v2 , G) between any two nodes Whereas, if a superpixel is inside the object, this superpixel is
v1 , v2 in graph G is the smallest integral of a weight function W surrounded by the regions with large probabilities of edges, which
over all possible paths between v1 and v2 : increases the geodesic distance to image boundaries. We normalize
Z v1 the probability map P k to [0, 1] for each frame. Since our graph
dgeo (v1 , v2 , G) = min |W (m) · C˙v1 ,v2 (m)|dm, (2) is very sparse, the shortest paths of all superpixels are efficiently
Cv1 ,v2 v2 computed by the Johnson algorithm [49].
where Cv1 ,v2 (m) is a path connecting the nodes v1 , v2 .
For frame F k , we construct an undirected weighted graph
3.3 Inter-frame Graph Construction
G = {V k , E k } with superpixels Yk as nodes V k and the links
k
between adjacent nodes as edges E k . Based on the graph structure, The foreground probability map P k reveals the foreground object
we derive a |V k | × |V k | weight matrix W k , where |V k | is the region but it is not complete and precise. In particular, probability
number of nodes in V k . The (m, n)-th element of W k indicates values of the true background regions near the object boundary
the weight of edge ekmn ∈ E k between adjacent superpixels Ym k may have high values due to the oversegmentation process. Be-
k sides, inaccurate optical flow estimation may result in erroneous
and Yn :
k
Wmn = kEk (ym k
) − Ek (ynk )k, (3) values. By the definition of saliency, foreground and background
regions should be visually different, and object regions should
where Ek (Ym k
) and Ek (Ynk ) correspond to the spatiotemporal be temporally continuous between consecutive frames. These
k
boundary probability of superpixels Ym and Ynk , separately. motivate us to estimate saliency between pairs of adjacent frames.
For superpixel ynk , the probability P k (ynk ) of being foreground For each pair of adjacent frames F k and F k+1 , we construct
is computed by the shortest geodesic distance to the image k k k k
an undirected weighted graph G 0 = {V 0 , E 0 }. The nodes V 0
boundaries using k
consist of the superpixels Y of frame F k and the superpixels
P k (ynk ) = min dgeo (ynk , q, G k ), (4) Yk+1 of frame F k+1 . There are two types of edges: intra-frame
q∈Qk
edges that link spatially adjacent superpixels and inter-frame edges
where Qk indicate the superpixels along the four boundaries of that connect temporally adjacent superpixels. The superpixels
frame F k . The geodesic distance dgeo (v1 , v2 , G k ) between any are spatially connected if they are adjacent in the same frame.
two superpixels v1 , v2 ∈ V k in graph G k can be computed in Temporally adjacent superpixels refer to the superpixels which
discrete form: belong to different frames but have overlap. We assign the edge
weight as the Euclidean distance between their mean colors in the
X
dgeo (v1 , v2 , G k ) = min k
Wmn , m, n ∈ Cv1 ,v2 . (5)
Cv ,v 1 2 m,n CIE-Lab color space.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 6
Fig. 4. Illustration of the skeleton abstraction process. (a) Frame F k . (b) Saliency results S k of (a) obtained via Equation 8. (c) Frame F k is
decomposed into background regions B0k and object-like regions U0k by self-adaptive threshold σ 0k defined in Equation 9. The black regions
indicate the background regions B0k , while the bright regions indicate the object-like regions U0k . (d) The red region corresponds to the base skeleton
region, which is the first selected skeleton region through Equation 10. (e) The three yellow regions correspond to the subsequently selected
skeleton regions through Equation 11. (f) We iteratively find the add skeleton regions until the number of selected skeleton regions reaches 10% of
the number of object-like regions U0k . (g) The blue regions are the other skeleton regions that lie on the shortest geodesic path between the base
and the selected skeleton regions. (h) The saliency values of the skeleton regions are enhanced.
For each frame, we use a self-adaptive threshold to decompose The rationale behind Equation 8 is that the saliency value of a
frame F k into background regions Bk and object-like regions Uk superpixel is measured by its shortest path to background regions
through the probability map P k . The threshold σ k for frame F k in color space considering both spatial and temporal information.
is computed as We update P k and P k+1 for frame F k and F k+1 with S k and
σ k = µ(P k ), (6) S k+1 , and keep iterating this process for the following two adjacent
frames F k+1 and F k+2 until the final frame.
where µ(·) is the mean probability of all pixels within the frame
F k . We assign the object-like regions Uk and the background
3.4 Skeleton Abstraction
regions Bk of k -th frame as
To further refine the saliency estimates above, we use a geodesic
Uk = {ynk |P k (ynk ) > σ k } distance based abstraction scheme that augments core regions with
∪ {ynk |ynk is temporally connected to Uk−1 }, (7) higher saliency values.
B =k k
Y −U . k We decompose (Figure 4-c) frame F k into two parts: back-
ground regions B0k and object-like regions U0k using a threshold
In a causal system, previously determined object regions offer similar to the one in Equation 6 yet computed by the saliency
valuable information to eliminate artifacts due to inaccurate op- result S k as
tical flow estimation. Therefore, we project object-like regions of σ 0k = µ(S k ),
prior frame F k−1 onto frame F k . Our motivation can be observed U0k = {ynk |S k (ynk ) > σ 0k }, (9)
in Figure 3. The object estimation result of frame F k (Figure 3- 0k k 0k
B = Y −U .
c) is not ideal, due to the incorrect optical flow estimation
(Figure 3-b). If F k is segmented using only the self-adaptive As the saliency result S k is more accurate than P k (this is
threshold T k defined in Equation 7, an inferior decomposition quantitatively verified in our experiment part), we decompose
is generated (Figure 3-d), further leading into incorrect saliency frame F k through an efficient thresholding strategy.
result (Figure 3-g). When the previous estimation is projected,
The skeleton region abstraction is an iterative process based
a more correct decomposition can be obtained (Figure 3-f), and
on the undirected weighted graph G k defined in Section 3.2. The
more consistent saliency can be attained (Figure 3-h).
k base skeleton region should have two properties. First, this region
Based on the graph G 0 , we compute saliency map S k (S k+1 ) should be as far away from background regions B0k as possible;
k k+1
for frame F (F ) as follows: second, it should be close to foreground regions U0k . Based on
k this conditions, the base skeleton region is selected by
S k (ynk ) = min dgeo (ynk , b, G 0 ),
b∈Bk ∪Bk+1
(8) n maxu0 ∈U0k dgeo (o, u0 , G k ) o
S k+1 (ynk+1 ) = min dgeo (ynk+1 , b, G 0 ).
k
Ok ← argmin . (10)
b∈Bk ∪Bk+1 o∈U0k minb0 ∈B0k dgeo (o, b0 , G k )
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 7
Fig. 5. Illustration of pixel labeling. (a) Input frame F k . (b) Spatiotemporal saliency map S k . (c) The regions within the red boundaries are the
superpixels with the saliency value larger than the adaptive threshold, which are used for establishing the foreground histogram model. The
regions between the green boundaries and red boundaries are for building background histogram model. (d) Global appearance models with color
histograms Hf and Hb for foreground and background via (c). (e) Probability map for foreground computed via the global appearance model. (f)
Accumulated optical flow gradient magnitude E b k for frame F k yields trajectory of the object within few subsequent frames. (g) Dynamic location
prior Lk is obtained via intra-frame graph construction method described in Section 3.2. (h) Final segmentation results by Equation 13, which
consists of the saliency term (b), the appearance term (e), and the location term (g), and two pairwise terms.
After obtaining the base skeleton region (Figure 4-d), we select We formulate the segmentation task as a pixel labeling prob-
the other skeleton regions. These regions are as far away from lem. Each pixel xki in frame F k can take a label lik ∈ {0, 1},
background regions B0k and previous skeleton regions as possible. where 0 corresponds to background and 1 corresponds to fore-
This induces the skeleton regions to cover object regions that may ground. A labeling L = {lik }k,i of pixels from all frames represents
have different appearances. Therefore, the skeleton regions are a partitioning of the entire video. Similar to other segmentation
selected in a greedy fashion: works [7], [50], we define an energy function for labeling L of all
n o the pixels as
Ok← Ok ∪ argmax min dgeo (o, o0, G k )·min dgeo (o, b,0 G k ) .
o∈U0k o0 ∈Ok b0 ∈B0k X X X
(11) F(L) = U k (lik ) + λ1 Ak (lik ) + λ2 Lk (lik )
As shown in Figure 4-e, each of the subsequent skeleton re- k,i k,i k,i
X X (12)
gions is selected to maximize its geodesic distance to background + λ3 V k (lik , ljk ) + λ4 W k (lik , ljk+1 ),
and previously selected skeleton regions. This process continues (i,j)∈Ns (i,j)∈Nt
until a small percentage (10%) of the object-like regions U0k are
selected as skeleton. All object-like regions that lie on the shortest where the spatial pixel neighborhood Ns consists of 8 neighboring
geodesic path between the base and subsequently chosen skeleton pixels within the same frame, the temporal pixel neighborhood Nt
regions are also selected as skeleton regions. Finally, we increase consists of the forward-backward 9 neighbors in adjacent frames,
the saliency values of the skeleton regions (in all experiments 2×) and i, j are indices of pixels. This energy function consists of
as in Figure 4-h). A quantitative evaluation of the effectiveness three unary terms, U k , Ak and Lk , and two pairwise terms V k
and improvement of each step of our saliency scheme is presented and W k , which depend on the labels of spatially and temporally
in Section 5.4. neighboring pixels.
The purpose of U k is to evaluate how likely a pixel is fore-
ground according to the spatiotemporal saliency maps computed
4 P IXEL L ABELING E NERGY F UNCTION
in the previous step. The unary appearance term Ak encourages
In the second stage of our segmentation method, we perform labeling pixels that have similar colors according to their global
binary video segmentation based on the saliency results from appearance models. The third unary term Lk is for labeling
Section 3. Separate global appearance models for foreground and pixels according to the location priors estimated from the dynamic
background are established using the saliency results. Dynamic location models. The pairwise terms V k and W k encourage
location model for each frame is estimated from motion informa- spatial and temporal smoothness, respectively. All the terms are
tion extracted from subsequent frames. Finally, the spatiotemporal described in detail next. The scalar parameters λ weight the
saliency maps, global appearance models and dynamic location various terms, which can be set according to the characteristic
model are combined into an energy function for binary segmenta- of the video content. In our implements, we empirically set
tion. λ1 = 0.5, λ2 = 0.2, λ3 = λ4 = 100.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 8
Having described the separate terms of the complete energy by constraining the segmentation labels to be both spatially and
function F below, we use graph-cuts [51] to compute the optimal temporally consistent. They are contrast-modulated Potts poten-
binary labeling and obtain the final segmentation (Figure 5-h). tials [7], [22], [50], which favor assigning the same label to
neighboring pixels that have similar color. The spatial consistency
Saliency term U k : The unary saliency term U k is based term V k computed between spatially adjacent pixels xi and xj is
on the saliency estimation results and penalizes assignments of defined as
pixels with low saliency to the foreground. The term U k has the k k 2
V k (lik , ljk ) = δ(lik , ljk ) exp−θ||C(xi )−C(xj )|| , (17)
following form
( where C(xki )
is the color vector of the pixel xki
and δ(·) denotes
k k − log(1 − S k (xki )) if lik = 0;
the Dirac delta function, which is 0 when lik 6= ljk . The constant φ
U (li ) = (13)
− log(S k (xki )) if lik = 1. is chosen [50] to be
X
Appearance term Ak : To model the foreground and background θ = (2 ||C(xki ) − C(xkj )||2 )−1 , (18)
(i,j)∈Ns
appearance, two weighted color histograms Hf and Hb are
computed in RGB color space. Each color channel is uniformly to ensure the exponential term in Equation 17 switches appro-
quantized into 10 bins, and there is a total of 103 bins in the priately between high and low contrast. Similarly, the temporal
joint histogram. Each pixel contributes into these histograms Hf consistency term W k is defined as
and Hb according to its color values with weights S k (x) and W k (lik , ljk+1 ) = δ(lik , ljk+1 ) exp−θ||C(xi )−C(xj
k k+1
)||2
. (19)
1 − S k (x), respectively.
To construct the foreground (background) histogram, we only 5 E XPERIMENTAL E VALUATIONS
use pixels from the superpixels spatially connected to the former Even though it is not the ultimate goal of the proposed algorithm,
foreground (background) superpixels and have saliency values we first evaluate the effectiveness of our spatiotemporal saliency
larger (smaller) than the adaptive threshold, which is defined as the estimation method by comparing against some state-of-the-art
mean value of spatiotemporal saliency map. This strategy makes saliency methods (in Section 5.1). After that, in Section 5.2,
better use of the information of spatiotemporal saliency results we compare both quantitatively and qualitatively our overall seg-
and minimizes adverse effects of background regions with similar mentation method with serveral well-known video segmentation
color to the foreground contaminating the foreground histogram approaches. Then we offer more detailed exploration and dissect
(Figure 5-c,e). Finally, the histograms are normalized. Denoting various parts of our approach. In Section 5.3, we assess its
c(xki ) as the histogram bin index of RGB color value at pixel xki , computational load. In Section 5.4, we investigate the impact of
the unary appearance term Ak is defined as: important parameters, verify basic assumptions of the proposed
Hb (c(xki ))
algorithm and evaluate the effectiveness of each step of the pro-
− log( ) if lik = 0;
H f (c(xk )) + H (c(xk ))
b
posed framework. In our comparisons, we use the implementations
Ak (lik ) = i
k
i
(14) provided by the respective authors and set their free parameters to
− log( H f (c(xi )) k
) if l i = 1. maximize their performance.
Hf (c(xki )) + Hb (c(xki ))
For quantitative and qualitative analyses, we use four bench-
mark datasets, the SegTrack [52], the extended SegTrack [53],
Location term Lk : For the cases of cluttered scenes and back-
Freiburg-Berkeley Motion Segmentation Dataset (FBMS) [1] and
ground regions having similar appearance models with the fore-
the DAVIS [54]. The SegTrack dataset contains 6 videos in total
ground, the object motion consistency provides a valuable prior to
where full pixel-level segmentation ground-truth for each frame
locate the areas likely to contain the object. Thus, we estimate the
is available. We follow the common protocol [7], [8], [22] and
location of foreground object with respect to motion information
use 5 video sequences (Birdfall, Cheetah, Girl, Monkeydog and
from a small number of neighboring frames.
Parachute) for evaluations (the last video, Penguin, is not usable
For k -th frame, we accumulate the optical flow gradient mag-
for saliency since only a single penguin is labeled in a colony of
nitudes within a temporal window of ±t frames to obtain relatively
penguins).
longer term motion information of the foreground regions as
While the SegTrack dataset is widely popular, the extended
k+t
X k+t
X SegTrack dataset is more challenging. It was originally introduced
bk =
E Eoi = k∇V i k. (15) for evaluating object tracking algorithms, yet it is also suitable for
i=k−t i=k−t video object segmentation. The extended SegTrack dataset con-
Having a larger temporal window provides some robustness to sists of 8 additional sequences, which have complex backgrounds
individual pixel-wise unreliable optical estimates. However, this and varying object motion patterns. We select five sequences (Bird
may also cause E b k loses its discriminative power since motion of Paradise, Frog, Monkey, Soldier and Worm), each of which
cue spans out on too many frames. In our experiments, we set contains a single dominant object.
t = 5. We use the intra-frame graph construction described in The FBMS dataset, containing 59 video sequences, is widely
Section 3.1 to compute a dynamic location model for each frame used for video segmentation and covers various challenges such as
(Figure 5-f,g). Finally, we determine the location prior Lki for large foreground and background appearance variation, significant
pixel xki and the unary location term Lk as shape deformation, and large camera motion.
( We also report our performance on the newly developed
k k − log(1 − Lk (xki )) if lik = 0; DAVIS dataset, which is one of the most challenging video seg-
L (li ) = (16)
− log(Lk (xki )) if lik = 1. mentation benchmarks. It comprises a total of 50 high-resolution
sequences spanning a wide degree of difficulty, such as occlusions,
Pairwise terms V k , W k : These terms impose label smoothness fast-motion and appearance changes.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 9
Fig. 6. Comparison with 6 alternative saliency detection methods using SegTrack dataset [52] (top), extended SegTrack dataset [53] (middle) and
FBMS dataset [1] (bottom) with pixel-level ground-truth: (a) average precision recall curve by segmenting saliency maps using fixed thresholds,
(b) F-score, (c) average MAE. Notice that, our algorithm significantly outperforms other methods in terms of the precision-recall curve and F-score.
Our method achieved more than 75% improvement over the best previous method in terms of MAE.
5.1 Evaluation of Spatiotemporal Saliency including precision-recall (PR) curve, F-score [34], and MAE
(mean absolute errors). Precision is the fraction of the correctly la-
Since spatiotemporal saliency detection is an important step of beled foreground pixels among the all pixels labeled as foreground
our video segmentation approach, we assess its performance by the algorithm, while recall is the fraction of correctly labeled
against the existing saliency methods. Using the original imple- foreground pixels among the ground-truth foreground pixels. We
mentations obtained from the corresponding authors, we make generate binary saliency maps from each method and plot the
comparisons between 6 alternative approaches including manifold corresponding PR curves by varying the operating point threshold.
ranking saliency model (MR) [39], saliency filter (SF) [55], self-
In general, a high recall response may come at the expense
resemblance based saliency (SS) [15], saliency via quaternion
of reduced precision, and vice versa. Therefore, we also estimate
Fourier transform (QS) [33], cluster-based co-saliency (CS) [42],
F-score for evaluating precision and recall simultaneously. F-score
and space-time saliency for time-mapping (TS) [43]. The former
evaluates precision and recall is defined as
two of these methods aim at image saliency while the latter four
are designed for video saliency. (1 + β 2 ) × precision × recall
We report results on three widely used performance measures F-score = , (20)
β 2 × precision + recall
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 10
Fig. 7. Qualitative comparison against the state-of-the-art methods on the SegTrack benchmark videos [52], the extended SegTrack [53] sequences
and the famous FBMS dataset [1] with pixel-level ground-truth labels. Our saliency method renders the entire objects as salient in complex scenarios,
yielding continuous saliency maps that are most similar to the ground-truth.
where we set β 2 to 0.3 to assign a higher importance to precision The precision-recall curves of all methods are reported in
as suggested in [34]. Figure 6-a. As shown, our method significantly outperforms the
For a complete analysis, we follow [55] to evaluate the mean state-of-the-art. The minimum recall value in these curves can also
absolute error (MAE) between a real-valued saliency map S and be regarded as an indicator of robustness. A high precision score
a binary ground-truth G for all image pixels: at the minimum recall value means a good separation between the
|S − G| foreground and background confidence values, as most of the high
MAE =, (21) confidence saliency values (close to 1) are correctly estimated the
N
foreground object.
where N is the number of pixels. The MAE estimates the approx-
imation degree between the saliency map and the ground-truth As can be seen, when the threshold is close to 255, the recall
map, and it is normalized to [0, 1]. The MAE provides a better scores of other saliency models become very small, and the recall
estimate of conformity between estimated and ground-truth maps. scores of SS [15] and QS [33] shrinks to 0. This is a result of those
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 11
Fig. 8. Our segmentation results on the SegTrack [52] (Cheetah and Girl) and the extended SegTrack dataset [53] (Bird of Paradise, Monkey and
Soldier ) with pixel-level ground-truth masks. The pixels within the green boundaries are segmented as foreground.
saliency maps do not correspond to the ground-truth objects. To score is well above the performance of other methods. The MAE
our advantage, the minimum recall of the our method does not results are presented in Figure 6-c. As shown, our saliency maps
drop to 0. This demonstrates our saliency maps align better with successfully reduce the MAE almost by 75% compared to the
the correct objects. In addition, our saliency method achieves the second best method (which is SF [55]). In summary, our method
best precision rates above 0.9, which shows it is more precise and consistently produces superior results.
responsive to the actual salient information. Similar conclusions
can be drawn from the F-score, as shown in Figure 6-b. Our F- Figure 7 shows a qualitative comparison of different methods,
where brighter pixels indicate higher saliency probabilities. It is
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 12
Fig. 9. Our segmentation results on the FBMS [1] (Horse and Camel) and the DAVIS dataset [54] (Kite-walk, Mallard-fly and Parkour ) with pixel-level
ground-truth masks. The pixels within the green boundaries are segmented as foreground.
observed that image saliency methods (MR [39], SF [55]) applied pixel precision and tend to assign lower foreground probabilities
independently to each frame produce unstable outputs, some to pixels inside the salient objects. This is due to the fact that
saliency maps even completely miss the foreground object, mainly optical flow estimations are unreliable.
because temporal coherence in video can convey important infor-
mation for identifying salient objects. In contrast, video saliency Based on above, we draw two important conclusions: (1) mo-
methods SS [15], QS [33], CS [42], and TS [43] perform relatively tion information gives effective guidance for detecting foreground
better as they utilize motion information. However, saliency maps object; (2) making methods rely heavily on motion information
from previous video saliency models are often generated in lower is not the optimal choice. Comprehensive utilization of various
features in spatial and temporal space (color, edges, motion, etc.)
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 13
TABLE 1
APFPER results on SegTrack dataset [52] compared to the ground-truth. Lower values are better. The best and the second best results are
boldfaced and underlined, respectively.
unsupervised supervised
video frames
Ours [1] [7] [8] [9] [23] [22] [28] [53] [56] [52] [57]
Birdfall 30 140 217 288 468 155 606 189 144 199 468 252 454
Cheetah 29 622 890 905 1175 633 11210 806 617 599 1968 1142 1217
Girl 21 991 3859 1785 5683 1488 26409 1698 1195 1164 7595 1304 1755
SegTrack
Monkeydog 71 350 284 521 1434 365 12662 472 354 322 1434 563 683
Parachute 51 195 855 201 1595 220 40251 221 200 242 1113 235 502
Avg. - 459 1221 740 2071 572 18228 677 502 505 2516 699 922
TABLE 2
IoU scores on SegTrack dataset [52] and extended SegTrack dataset [53] compared to the ground-truth. Higher values are better. The best and the
second best results are boldfaced and underlined, respectively.
unsupervised supervised
video frames
Ours [7] [9] [22] [26] [28] [24] [58] [59] [60] [61]
Birdfall 30 74.5 48.7 71.4 37.4 72.5 73.2 57.4 78.7 57.4 56.0 32.5
Cheetah 29 64.3 43.4 58.8 40.9 61.2 64.2 24.4 66.1 33.8 46.1 33.1
SegTrack Girl 21 88.7 77.5 81.9 71.2 86.4 86.7 31.9 84.6 87.9 53.6 52.4
Monkeydog 71 78.0 64.3 74.2 73.6 74.0 76.1 68.3 82.2 54.4 61.0 22.1
Parachute 51 94.8 94.3 93.9 88.1 95.9 94.6 69.1 94.4 94.5 85.6 69.9
Bird of Paradise 98 94.5 22.3 35.2 85.4 90.0 93.9 86.8 93.0 95.2 5.1 44.3
Frog 279 83.3 71.0 76.3 69.4 80.2 81.5 67.1 56.3 81.4 14.5 45.2
Extended Monkey 31 84.1 38.6 61.4 69.6 83.1 63.9 61.9 86.0 88.6 73.1 61.7
SegTrack Soldier 32 79.2 10.0 51.4 47.4 76.3 36.8 66.5 81.1 86.4 70.7 43.0
Worm 243 74.8 40.5 53.9 73.0 82.4 61.7 34.7 79.3 89.6 36.8 27.4
Avg. - 81.6 51.1 65.8 65.6 80.2 73.3 56.8 80.1 76.9 50.2 43.1
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 14
TABLE 4
IoU scores on a representative subset of the DAVIS dataset [54], and
the average computed over the 50 video sequences. Higher values are
better. The best and the second best results are boldfaced and
underlined, respectively.
sequences of DAVIS datasets. As may be seen in Table 4, our Fig. 10. Computational load of our method and the state-of-the-art for
340×240 video. (a) Execution time of video saliency estimation stage
method still performs comparably or better than other concurrent compared against other video saliency methods [15], [33], [42], [43]. (b)
approaches. Execution time of overall method compared against other video seg-
Representative pixel labeling results are shown in Figure 8 mentation methods [7], [9], [22]. (c) Execution time of each intermediate
steps. Step1 and Step2 are saliency estimations via intra-frame graph
(SegTrack, the extended SegTrack) and Figure 9 (FBMS, DAVIS). and inter-frame graph, respectively. Step3 is the final saliency step.
Our method has the ability to segment the objects with fast
motion patterns (Cheetah and Horse) or large shape deformation
(Parkour). It produces accurate segmentation maps even when The execution time of each part of our whole scheme is shown
foreground undergoes appearance changes (Mallard-fly), contains in Figure 10-c. The whole segmentation pipeline takes about 3.5
various motion patterns (Soldier), or has similar color cues with seconds for each frame, where over 60% of the runtime is spent
the background (Monkey). In contrast, existing approaches [7], on the edge generation [46]. Saliency detection takes a total of
[9], [22] either mislabel background pixels as foreground or miss 1.2 seconds: 0.38 seconds for computing the saliency via intra-
foreground pixels. In our experiments we observed that target frame graph (Step1), 0.59 seconds for improving saliency results
foregrounds in various scenarios can be segmented accurately by via inter-frame graph (Step2), and 0.23 seconds for generating
our algorithm. final saliency via abstracting skeleton regions (Step3).
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 15
Fig. 11. Parameter selection for the number of superpixel K using (a) the
SegTrack database and (b) the extended SegTrack dataset. The MAE of
the saliency results and the IoU score of the segmentation results are
plotted as functions of a variety of Ks.
TABLE 5
Validation of spatiotemporal edge generation on the SegTrack dataset
[52] and extended SegTrack dataset [53].
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 16
Fig. 13. Video object segmentation and salient object detection results for object occlusion. (a) For object with part occlusions, the proposed method
can still produce reliable spatiotemporal saliency prior and generate accurate segments. (b) When heavy object occlusions occur, the proposed
method may suffer difficulties since it sticks to find a salient object followed by the basic assumption of saliency detection.
6 D ISCUSSIONS AND L IMITATIONS As opposed to the traditional video segmentation methods that
The proposed algorithm has a few limitations. The performance heavily rely on cumbersome object inference and motion analysis,
of our algorithm is limited by the accuracy of spatiotemporal our method emphasizes the importance of video saliency, which
saliency estimation. Saliency estimation is the cornerstone of our offers strong and reliable cues for pixel labeling of foreground
method to determine where the primary object it is. If importance video objects.
analysis was misleading, it might negatively affect segmentation The proposed method incorporates intra-graph edge and inter-
results. For example, our spatiotemporal saliency method may graph motion boundary information into a spatiotemporal edge
not be well suited for scenes that have multiple salient objects
map. It uses the geodesic distance on these graphs to measure
or have a primary foreground object that occupies large portion
the saliency score of each superpixel. In intra-frame graph, the
of the image. In these scenarios, it is likely to produce sub-
optimal results as the potential assumption for saliency detection geodesic distance between the superpixel and frame boundary is
is that only a part of scene attracts human attention mostly. In our exploited to estimate the foreground probability. In inter-frame
approach, we formulate the local dynamic location prior and the graph, geodesic distance to the estimated background is utilized to
global appearance information in the proposed segmentation ener- update the spatiotemporal saliency map for each pair of adjacent
gy function (Equation 13), which would alleviate this problem. frames. The geodesic distance is also employed to extract the base
Another difficulty for the current method is handling ob- and supporting foreground superpixels in the skeleton abstraction
jects with occlusion, which is the common challenge in video step to further enhance the saliency scores. In the pixel labeling
segmentation problem. As the proposed spatiotemporal saliency stage, an energy function that combines global appearance models,
prior relies on the object continuity between adjacent frames, it is
dynamic location models and spatiotemporal saliency maps is
able to handle common scenarios with small or short occlusions
defined and efficiently minimized via graph-cuts to obtain the final
in a bottom-up fashion (Figure 13-a). As for some extremely
difficult scenarios with complete occlusions, such as the bmx segmentation results.
in Figure 13-b, the proposed method may still locate a part We have evaluated our methods on four benchmarks, name-
of scene as salient region, even the object has been occluded. ly SegTrack [52], extended SegTrack [53], FBMS [1] and the
That is followed by the basic assumption of saliency detection DAVIS [54]. The extensive experimental evaluations show that
that important object should exist. One promising direction to our approach can generate high quality saliency maps in relatively
improve the segmentation is the use of long range connectivity short time and achieve consistently higher performance scores
of objects such as motion trajectories. Other advances may come than many other existing methods. Comparing with other video
from adopting some occlusion-aware tracking techniques or the segmentation methods, our approach generates both quantitatively
development of more powerful representations beyond regions,
and qualitatively superior segmentation results.
such as supervoxel and video object proposal.
For future work, we will apply the proposed approach to other
applications, such as video resizing, video summarization, and
7 C ONCLUSIONS AND F UTURE W ORK video compression. Additionally, our work provides important
We have presented an unsupervised approach that incorporates hints toward combining spatiotemporal saliency prior with more
geodesic distance into saliency-aware video object segmentation. effective video representations, such as trajectory and supervoxel.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 17
R EFERENCES [28] W.-D. Jang, C. Lee, and C.-S. Kim, “Primary object segmentation in
videos via alternate convex optimization of foreground and background
[1] T. Brox and J. Malik, “Object segmentation by long term analysis of distributions,” in Computer Vision and Pattern Recognition (CVPR),
point trajectories,” in European Conference on Computer Vision (ECCV), 2016.
2010. [29] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual
[2] J. Lezama, K. Alahari, J. Sivic, and I. Laptev, “Track to the future: Spatio- attention for rapid scene analysis,” IEEE Pattern Analysis and Machine
temporal video segmentation with long-range motion cues,” in Computer Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
Vision and Pattern Recognition (CVPR), 2011. [30] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” Ad-
[3] K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by tracing vances in neural information processing systems, pp. 545–552, 2006.
discontinuities in a trajectory embedding,” in Computer Vision and [31] A. Borji, D. Sihite, and L. Itti, “Probabilistic learning of task-specific
Pattern Recognition (CVPR), 2012. visual attention,” in Computer Vision and Pattern Recognition (CVPR),
[4] W. Brendel and S. Todorovic, “Video object segmentation by tracking 2012.
regions,” in Computer Vision (ICCV), 2009.
[32] X. Hou and L. Zhang, “Saliency detection: a spectral residual approach,”
[5] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, “Segmenting salient in Computer Vision and Pattern Recognition (CVPR), 2007.
objects from images and videos,” in European Conference on Computer
[33] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection using
Vision (ECCV), 2010.
phase spectrum of quaternion fourier transform,” in Computer Vision and
[6] C. Xu, C. Xiong, and J. J. Corso, “Streaming hierarchical video segmen-
Pattern Recognition (CVPR), 2008.
tation,” in European Conference on Computer Vision (ECCV), 2012.
[34] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned
[7] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video object
salient region detection,” in Computer Vision and Pattern Recognition
segmentation,” in Computer Vision (ICCV), 2011.
(CVPR), 2009.
[8] T. Ma and L. J. Latecki, “Maximum weight cliques with mutex con-
straints for video object segmentation,” in Computer Vision and Pattern [35] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency
Recognition (CVPR), 2012. detection,” in Computer Vision and Pattern Recognition (CVPR), 2010.
[9] D. Zhang, O. Javed, and M. Shah, “Video object segmentation through [36] D. Klein and S. Frintrop, “Center-surround divergence of feature statistics
spatially accurate and temporally dense extraction of primary object for salient object detection,” in Computer Vision (ICCV), 2011.
regions,” in Computer Vision and Pattern Recognition (CVPR), 2013. [37] M. Cheng, J. Warrell, and W. Lin, “Efficient salient region detection with
[10] I. Endres and D. Hoiem, “Category independent object proposals,” in soft image abstraction,” in Computer Vision (ICCV), 2013.
European Conference on Computer Vision (ECCV), 2010. [38] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background
[11] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?” in Computer priors,” in European Conference on Computer Vision (ECCV), 2012.
Vision and Pattern Recognition (CVPR), 2010. [39] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection
[12] J. Carreira and C. Sminchisescu, “Constrained parametric min-cuts for via graph-based manifold ranking,” in Computer Vision and Pattern
automatic object segmentation,” in Computer Vision and Pattern Recog- Recognition (CVPR), 2013.
nition (CVPR), 2010. [40] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust
[13] D. Gao, M. Vijay, and V. Nuno, “The discriminant center-surround background detection,” in Computer Vision and Pattern Recognition
hypothesis for bottom-up saliency,” Advances in neural information (CVPR), 2014.
processing systems, pp. 497–504, 2008. [41] D. Gao and N. Vasconcelos, “Bottom-up saliency is a discriminant
[14] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamic process,” in Computer Vision and Pattern Recognition (CVPR), 2007.
scenes,” IEEE Pattern Analysis and Machine Intelligence, vol. 32, no. 1, [42] H. Fu, X. Cao, and Z. Tu, “Cluster-based co-saliency detection,” IEEE
pp. 171–177, Jan 2010. Transactions on Image Processing, vol. 22, no. 10, pp. 3766–3778, Oct
[15] H. J. Seo and P. Milanfar, “Static and space-time visual saliency detection 2013.
by self-resemblance,” Journal of vision, vol. 9, no. 12, p. 15, 2009. [43] F. Zhou, S. B. Kang, and M. F. Cohen, “Time-mapping using space-time
[16] X. Bai and G. Sapiro, “A geodesic framework for fast interactive image saliency,” in Computer Vision and Pattern Recognition (CVPR), 2014.
and video segmentation and matting,” in Computer Vision (ICCV), 2007.
[44] A. M. Treisman and G. Gelade, “A feature-integration theory of atten-
[17] B. Price, B. Morse, and S. Cohen, “Geodesic graph cut for interactive im- tion,” Cognitive psychology, vol. 12, no. 1, pp. 97–136, 1980.
age segmentation,” in Computer Vision and Pattern Recognition (CVPR),
[45] P. Mital, T. J. Smith, S. Luke, and J. Henderson, “Do low-level visual
2010.
features have a causal influence on gaze during dynamic scene viewing?”
[18] C. Antonio, S. Toby, and B. Andrew, “Geos: geodesic image segmenta-
Journal of Vision, vol. 13, no. 9, pp. 144–144, 2013.
tion,” in European Conference on Computer Vision (ECCV), 2008.
[46] M. Leordeanu, R. Sukthankar, and C. Sminchisescu, “Efficient closed-
[19] A. Criminisi, T. Sharp, C. Rother, and P. Perez, “Geodesic image and
form solution to generalized boundary detection,” in European Confer-
video editing,” ACM Transactions on Graphics, vol. 29, no. 5, 2010.
ence on Computer Vision (ECCV), 2012.
[20] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object
segmentation,” in Computer Vision and Pattern Recognition (CVPR), [47] T. Brox and J. Malik, “Large displacement optical flow: Descriptor
2015. matching in variational motion estimation,” IEEE Transactions on Pat-
[21] P. Ochs and T. Brox, “Object segmentation in video: a hierarchical tern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 500–513,
variational approach for turning point trajectories into dense regions,” March 2011.
in Computer Vision (ICCV), 2011. [48] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,
[22] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained “SLIC superpixels compared to state-of-the-art superpixel methods,”
video,” in Computer Vision (ICCV), 2013. IEEE Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp.
[23] P. Ochs and T. Brox, “Higher order motion models and spectral cluster- 2274–2282, 2012.
ing,” in Computer Vision and Pattern Recognition (CVPR), 2012. [49] D. B. Johnson, “Efficient algorithms for shortest paths in sparse network-
[24] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchi- s,” J. ACM, vol. 24, no. 1, pp. 1–13, Jan. 1977.
cal graph-based video segmentation,” in Computer Vision and Pattern [50] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive fore-
Recognition (CVPR), 2010. ground extraction using iterated graph cuts,” ACM Transactions on
[25] K. Fragkiadaki, P. Arbeláez, P. Felsen, and J. Malik, “Learning to Graphics, vol. 23, no. 3, pp. 309–314, 2004.
segment moving objects in videos,” in Computer Vision (ICCV), 2015. [51] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy mini-
[26] F. Xiao and Y. J. Lee, “Track and segment: An iterative unsupervised mization via graph cuts,” IEEE Pattern Analysis and Machine Intelli-
approach for video object proposals,” in Computer Vision and Pattern gence, vol. 20, no. 12, pp. 1222–1239, 2001.
Recognition (CVPR), 2016. [52] D. Tsai, M. Flagg, A. Nakazawa, and J. M. Rehg, “Motion coherent
[27] A. Faktor and M. Irani, “Video segmentation by non-local consensus tracking using multi-label MRF optimization,” International Journal of
voting.” in British Machine Vision Conference (BMVC), 2014. Computer Vision, vol. 100, no. 2, pp. 190–202, 2012.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 18
[53] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, “Video segmen- Ruigang Yang received the MS degree from
tation by tracking many figure-ground segments,” in Computer Vision Columbia University in 1998 and the PhD degree
(ICCV), 2013. from the University of North Carolina, Chapel
[54] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross, and Hill in 2003. He is currently a full professor of
A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology Computer Science at the University of Kentucky.
for video object segmentation,” in Computer Vision and Pattern Recog- His research interests span over computer vision
nition (CVPR), 2016. and computer graphics, in particular in 3D re-
[55] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters: construction and 3D data analysis. He has pub-
Contrast based filtering for salient region detection,” in Computer Vision lished more than 100 papers, which, according
and Pattern Recognition (CVPR), 2012. to Google Scholar, has received close to 6,000
[56] O. Barnich and M. Van Droogenbroeck, “Vibe: A universal background citations with an h-index of 37 (as of 2014). He has received a number
subtraction algorithm for video sequences,” IEEE Transactions on Image of awards, including the US National Science Foundation Faculty Early
Processing, vol. 20, no. 6, pp. 1709–1724, June 2011. Career Development (CAREER) Program Award in 2004, best Demon-
[57] P. Chockalingam, N. Pradeep, and S. Birchfield, “Adaptive fragments- stration Award at CVPR 2007 and the Deans Research Award at the
based tracking of non-rigid objects using level sets,” in Computer Vision University of Kentucky in 2013. He is currently an associate editor of the
(ICCV), 2009. IEEE Transactions on Pattern Analysis and Machine Intelligence and a
[58] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.-H. Yang, “Jots: Joint online senior member of IEEE.
tracking and segmentation,” in Computer Vision and Pattern Recognition
(CVPR), 2015.
[59] Y.-H. Tsai, M.-H. Yang, and M. J. Black, “Video segmentation via object
flow,” in Computer Vision and Pattern Recognition (CVPR), 2016.
[60] M. Godec, P. M. Roth, and H. Bischof, “Hough-based tracking of non- Fatih Porikli is an IEEE Fellow and a Professor
rigid objects,” in Computer Vision (ICCV), 2011. with the Research School of Engineering, Aus-
[61] S. Wang, H. Lu, F. Yang, and M.-H. Yang, “Superpixel tracking,” in tralian National University, Canberra, ACT, Aus-
Computer Vision (ICCV), 2011. tralia. He is also acting as the Computer Vision
[62] B. Taylor, V. Karasev, and S. Soattoc, “Causal video object segmenta- Group Leader at NICTA, Australia. He has re-
tion from persistence of occlusions,” in Computer Vision and Pattern ceived his PhD from New York University (NYU),
Recognition (CVPR), 2015. New York in 2002. Previously he served a Distin-
guished Research Scientist at Mitsubishi Electric
Research Laboratories (MERL), Cambridge. Be-
fore joining MERL in 2000, he developed satellite
imaging solutions at HRL, Malibu CA, and 3D display systems at AT&T
Research Laboratories, Middletown, NJ. He has contributed broadly
to object and motion detection, tracking, image-based representations,
and video analytics. Prof Porikli was the recipient of the R&D 100
Wenguan Wang received the B.S. degree in Scientist of the Year Award in 2006. He has won 4 best paper awards at
computer science and technology from the Bei- premier IEEE conferences including the Best Paper Runner-Up at IEEE
jing Institute of Technology in 2013. He is cur- CVPR in 2007, the Best Paper at IEEE Workshop on Object Tracking
rently working toward the Ph.D. degree in the and Classification beyond Visible Spectrum (OTCBVS) in 2010, and the
School of Computer Science, Beijing Institute of Best Paper from IEEE AVSS in 2011, and the Best Poster Award at
Technology, Beijing, China. His current research IEEE AVSS in 2014. He has received 5 other professional prizes. Prof
interests include salient object detection and ob- Porikli authored more than 130 publications and invented 61 patents. He
ject segmentation for image and video. serves as the Associate Editor of 5 premier journals for the past 8 years
including IEEE Signal Processing Magazine (impact rate 6.0) and SIAM
Imaging Sciences (rank 2 out of 236 applied math journals). He served
at the organizing committees of several flagship conferences including
ICCV, ECCV, CVPR, ICIP, AVSS, ICME, ISVC, and ICASSP. He served
as Area Chair of ICCV 2015, CVPR 2009, ICPR 2010, IV 2008, and
ICME 2006. He served as a judge at several NSF panels from 2006 to
2013 and gave keynotes in various symposiums.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.