0% found this document useful (0 votes)

75 views18 pages

A Unified Spatiotemporal Prior Based On Geodesic Distance For Video Object Segmentation

This article presents a new method for unsupervised video object segmentation that uses geodesic distance on spatiotemporal graphs to compute reliable saliency estimates of superpixels as a prior for pixel-wise labeling. The method constructs undirected intra-frame and inter-frame graphs from spatiotemporal edges and appearance/motion cues. It formulates segmentation as an energy minimization problem combining unary terms of foreground/background models and pairwise terms of label smoothness. Experimental results show the method achieves state-of-the-art performance in accuracy and speed compared to other approaches.

Uploaded by

research usmani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views18 pages

A Unified Spatiotemporal Prior Based On Geodesic Distance For Video Object Segmentation

Uploaded by

research usmani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 1

A Unified Spatiotemporal Prior based on

Geodesic Distance for Video Object
Segmentation
Wenguan Wang, Jianbing Shen, Senior Member, IEEE, Ruigang Yang, Senior Member, IEEE,
and Fatih Porikli, Fellow, IEEE

Abstract—Video saliency, aiming for estimation of a single dominant object in a sequence, offers strong object-level cues for
unsupervised video object segmentation. In this paper, we present a geodesic distance based technique that provides reliable and
temporally consistent saliency measurement of superpixels as a prior for pixel-wise labeling. Using undirected intra-frame and
inter-frame graphs constructed from spatiotemporal edges or appearance and motion, and a skeleton abstraction step to further
enhance saliency estimates, our method formulates the pixel-wise segmentation task as an energy minimization problem on a function
that consists of unary terms of global foreground and background models, dynamic location models, and pairwise terms of label
smoothness potentials. We perform extensive quantitative and qualitative experiments on benchmark datasets. Our method achieves
superior performance in comparison to the current state-of-the-art in terms of accuracy and speed.

Index Terms—Video saliency, video object segmentation, geodesic distance, spatiotemporal prior.

1 I NTRODUCTION

U NSUPERVISED video object segmentation, a key challenge

in computer vision, aims at partitioning multiple video
frames into objects and background regions. Such an automatic
appearance information are combined in measuring the objectness
of proposals where various assessment strategies are introduced.
However, all these approaches still face many difficulties. On
segmentation has been shown to benefit a variety of applications one hand, they all require complicated object inference techniques,
such as video summarization, video compression, content based which comes with a high computational expense. On the other
video retrieval and human-computer interaction, to name a few. hand, they impose heuristically chosen cues which may not be the
Traditionally, video object segmentation task is performed right choice for a general class of objects. Besides, proposal based
with motion and appearance information represented by motion methods sustain the disadvantage that correct proposals are often
vectors, feature point trajectories, color descriptors, and boundary few or do not exist at all when the foreground object is small or
indicators. Depending on the availability and quality of these similar to the background.
inputs, object regions are usually obtained after complicated and We can ask whether there is any reliable object-level descriptor
fragile inference procedures often with preset assumptions of ob- that can be employed for a general class of video objects. This
ject and camera motion. In simple scenarios where the foreground calls for introducing a robust object-level cue as an indicator of
object moves distinctly from its background, grouping of mo- the object in terms of interesting regions in a scene. We address
tion vectors and feature point trajectories generates semantically this challenge by giving emphasis to the value of spatiotempo-
meaningful segments. Several works [1], [2], [3] analyzed point ral saliency to automatically identify visually prominent object
trajectories to leverage the motion information. But, what about if regions in a video sequence. Our intuition is that potentially
a part of the object remains static? In typical complex videos, the discriminative yet confined motion and appearance cues should
assumption of motion consistency may result in oversegmentation, be combined with more comprehensive spatiotemporal saliency
thus failing to extract entire object regions. Utilizing both motion cues in order to generate a reliable object prior. Once a reliable
and appearance cues seems to be a better choice as it was adopted saliency prior is built, estimating refined appearance models and
by many methods [4], [5], [6], [7], [8], [9]. Specially, [4], [5], then in turn generating accurate object segments becomes feasi-
[6] generate a large number of object proposals [10], [11], [12] ble. This motivates us to decompose the automatic segmentation
in every frame using these cues, and cast the task of video object problem into two stages: video saliency detection and video object
segmentation as the problem of inferring and selecting the most partitioning.
relevant object proposal. In the selection process, both motion and For an effective solution to unsupervised video segmentation,
we need the capability to detect salient regions in a video. While
salient object detection in still images has been exploited in the
• W. Wang and J. Shen are with Beijing Laboratory of Intelligent Information past, computing spatiotemporal saliency in videos is still an active
Technology, School of Computer Science, Beijing Institute of Technology, area of research since extending image based algorithms to video
(email: [email protected], [email protected])
• R. Yang is with the University of Kentucky, Lexington, KY 40507. (email: is nontrivial. Temporal coherence yields significant information,
[email protected]) nevertheless, it is inevitably susceptible to noise due to nonuniform
• F. Porikli is with the Research School of Engineering, Australian National background motions and well-known motion estimation errors.
University, and NICTA. (email: [email protected])
Moreover, most video saliency methods simply treat the motion

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2662005, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES 2

Fig. 1. Overview of our video object segmentation framework. Input frame is over-segmented into superpixels and a spatiotemporal edge map is
produced by the combination of static edge probability and optical flow gradient magnitude. For each superpixel, we compute its object-probability
and the refined saliency estimate via intra-frame graph and inter-frame graph, respectively. An object skeleton abstraction method is further derived
for obtaining final saliency estimates via biasing the central skeleton regions with higher saliency values. Finally, we combine the spatiotemporal
saliency priors, global appearance models and dynamic location models using motion information among few subsequent frames for producing
correct pixel-wise segmentation.

feature as another cue within their image saliency models [13], using an energy function that contains three unary and
[14], [15], lacking an elegant framework to incorporate intra-frame two pair-wise terms (Section 4).
and inter-frame information in a unified fashion. • A new formulation for spatiotemporal saliency by exploit-
In this paper, we aim to partition the foreground objects from ing intra-frame and inter-frame relevancy via undirected
their backgrounds in all frames of a given video sequence without graphs on superpixels. For the intra-frame stimulus, we
any user assistance or contextual assumptions. To this end, we employ geodesic distance on spatiotemporal edges within
propose a video object segmentation method that consists of a a single frame. We construct the inter-frame graph for
superpixel based spatiotemporal saliency prior detection stage and temporal coherence between consecutive frames (Section
pixel based binary labeling stage that runs in a recursive fashion. 3).
Our proposed video segmentation framework is depicted in • A geodesic distance based weighting of intra-frame and
Figure 1. We first introduce a unified spatiotemporal saliency inter-frame graphs based on the observation that salient
prior that combines motion boundaries and spatial edges into a regions have higher geodesic distances to background
unified model that is designed to align with object boundaries regions (Section 3.1 and 3.3).
for a simple yet powerful representation. Our model offers a • A greedy skeleton abstraction scheme for iteratively se-
reliable and temporally consistent region-level prior for object lecting confident foreground regions (Section 3.4).
segmentation by employing psychophysically motivated low-level Our method achieves the state-of-art performance on four large
features that incorporate spatial stimulus and temporal coherence. benchmarks. This paper builds upon and extends our recent work
Saliency of a region is measured by its shortest geodesic distance in [20] with a more in depth discussion of the algorithm and
to background regions in two inter-frame and intra-frame graphs, expanded evaluation results. We further introduce a new geodesic
which are constructed from the intensity edge and motion bound- distance based skeleton regions abstraction method that regular-
ary cues, as well as the background information across adjacent izes the original regions of object with higher saliency.
frames. The geodesic distance has the power of abstracting object The remainder of this paper is organized as follows: An
structure to efficiently determine its central regions by assigning overview of the related work is given in Section 2. The spa-
higher saliency values to more representative regions. It has been tiotemporal saliency stage is explained in Section 3. Intermediate
shown to be effective for supervised segmentation where user processes of the video segmentation method are articulated in
provides seeds [16], [17], [18], [19]. Such a user interaction, Section 4. Experimental evaluations are presented in Section 5.
however, is impractical in streaming video applications. Our Discussions and limitations are given in Section 6. Concluding
method extends the geodesic distance into unsupervised video remarks can be found in Section 7.
segmentation. Hence, we design a skeleton abstraction method
that explicitly incorporates weak object structure and emphasizes
the saliency values of the central skeleton regions based on 2 R ELATED W ORK
geodesic distances. After obtaining the spatiotemporal saliency In this section, we give a brief overview of recent works in
prior, we integrate saliency cues, dynamic location models as well unsupervised video segmentation and saliency detection.
as global appearance models into an energy minimization that is
optimized via graph-cuts to generate highly accurate foreground 2.1 Unsupervised Video Segmentation
object boundaries in entire video segment.
A variety of techniques have been proposed for unsupervised
To summarize, our main contributions are: video segmentation in the past decade. Most approaches are based
• A unified framework that incorporates video saliency for on bottom-up models using low-level features such as motion,
unsupervised pixel-wise labeling of foreground objects color, and edge orientation. In particular, the importance of the

motion information was emphasized in many works [1], [2], [3], of the context of the image. Supervised learning with a specific
[21], [22], [23]. While the use of short duration motion boundaries class is therefore a frequently adopted principle. Most of the
in pairs of subsequent frames is not uncommon [22], several saliency detection methods are based on bottom-up visual atten-
methods [1], [2], [3], [21], [23] argued that motion should be tion mechanisms, which are independent of the knowledge of the
analyzed over longer periods, as such long term analysis is able to content in the image and utilize various low level features, such as
decrease the intra-object variance of motion relative to the inter- intensity, color and orientation.
object variance and propagate motion information to frames in Inspired by visual perception studies that indicate contrast is
which the object remains static. For this, [2] grouped pixels with a major factor in visual attention mechanisms, numerous bottom-
coherent motion computed via long-range motion vectors from the up models have been proposed based on different mathematical
past and future frames. Similarly, the work in [1] offered a frame- formulations of contrast. Many methods assumed that globally
work for trajectory-based video segmentation through building infrequent features are more salient, and frequency analysis is
affinity matrix between pairs of trajectories. In [3], discontinuities carried out in the spectral domain. For example, [32] proposed a
of embedding density between spatially neighboring trajectories saliency detection algorithm using spectral residuals based on the
were detected. Incorporating higher order motion models, a clus- log spectra representation for images. The phase spectrum of the
tering method for point tracks was proposed in [23]. In general, Fourier transform is considered to be the key element in obtaining
motion based methods suffer difficulties when different parts of an the location of salient regions in [33]. Later, [34] introduced
object exhibit nonhomogeneous motion patterns. This problem is a frequency-tuned approach to estimate center-surround contrast
exacerbated further with the absence of a strong prior for object. using color and luminance features. Other methods attempted to
Moreover, these approaches require careful selection of a suitable detect saliency in the spatial domain, usually adopting several
model especially for the trajectory clustering process, which often visual cues. A graph-based dissimilarity measure was used in [30].
comes with a high computation complexity, as [7] pointed out. In [35], a content-aware saliency detection with the consideration
There were previous efforts [4], [5], [6], [24] that presented of the contrast from both local and global perspectives was
optimization frameworks for bottom-up segmentation employing built. [36] presented a framework for saliency detection based
both appearance and motion cues. Several methods [7], [8], [9], on the efficient fusion of different feature channels and the local
[25], [26] proposed to select primary object regions in object center-surround hypothesis. In [37], two saliency indicators, global
proposal domain based on the notion of what a generic object appearance contrast and spatially compact distribution, were con-
looks like. These approaches benefit from the work of object sidered. Recently, several approaches [38], [39], [40] exploited
hypotheses proposals [10], [11], [12] that offer a large number background information, called boundary prior. These methods
of object candidates in every frame. Therefore, segmenting video use image boundaries as background, further enhancing saliency
object is transformed into an object region selection problem. computation.
In this selection process, both motion and appearance cues are While image saliency detection has been extensively studied,
used to measure the objectness of a proposal. More specifically, a computing spatiotemporal saliency for videos is a relatively new
clustering process was introduced for finding objects by [7], a con- problem. Different from image saliency detection, moving objects
strained maximum weight cliques technique to model the selection catch more attention of human beings than static ones, even if the
process was imposed [8], and a layered directed acyclic graph static objects have large contrast to their neighbors. In other words,
based framework was presented by [9]. Work of [25] segmented motion is the most important cue for video saliency detection,
moving objects by ranking spatiotemporal segment proposals with which makes deeper exploration of the inter-frame information
moving objectness detector trained on image and motion fields. crucial. Gao et al. [13] extended their image saliency model
In [26], tracking and segmentation were integrated into a unified [41] by adding the motion channel for prediction of human eye
framework to detect the primary object proposal and handle the fixations in dynamic scenes based on the center-surround hypoth-
video segmentation task. The main drawbacks of the proposal esis. Similarly, Mahadevan et al. [14] combined center-surround
based algorithms are their high computational cost associated with saliency with the dynamic textures for spatiotemporal saliency
proposal generation and complicated object inference schemes. using the saliency model in [41]. In [15], Seo et al. computed the
Recently, some methods [27], [28] were proposed to exploit so-called local regression kernels from the given video, measuring
temporal correlations over the entire video, which produced global the likeness of a pixel (or voxel) to its surrounding. They extended
optimal segments. their model for video saliency detection straightforwardly by
extracting a feature vector from each spatiotemporal 3-D cube. Re-
cently, [5] used a statistical framework and local feature contrast
2.2 Saliency Detection for Image and Video in illumination, color, and motion for formulating final saliency
The human visual system is remarkably effective in localizing the maps. [42] proposed a cluster-based saliency method, where three
regions in a scene that stand out from their neighbors. Saliency visual attention cues, contrast, spatial, and global correspondence,
detection [29] is originally a task of simulating the human visual are devised to measure the cluster saliency. [43] adopted space-
system for predicting scene locations where a human observer may time saliency to generate a low-frame-rate video from a high-
fixate. Recent research has shown that extracting salient objects or frame-rate input using various low-level features and region-based
regions is more useful and beneficial to a wide range of computer contrast analysis.
vision applications. The output of salient object detection is It can be seen that video saliency detection is still an emerging
usually a saliency map where the intensity of each pixel represents and challenging research problem to be further investigated. The
the probability of that pixel belonging to the salient object. existing methods, however, usually build their system with a
Saliency detection methods in general can be categorized as simple combination of image saliency models with motion cues,
either bottom-up or top-down approaches. Top-down approaches lacking an efficient framework to fully explore intra-frame and
[30], [31] are goal-directed and require an explicit understanding inter-frame information together.

Fig. 2. Overview of our geodesic distance based spatiotemporal saliency prior. (a) Input frame F k . (b) Oversegmentation of F k into superpixels
Yk . (c) Spatial edge probability map Eck of F k . (d) Gradient magnitude Eok of optical flow of F k . (e) Superpixel-wise spatiotemporal edge map E k
computed via Equation 1. (f) Object estimation result P k via intra-frame graph. (g) Saliency result S k via inter-frame graph. (h) Final video saliency
via the proposed skeleton abstraction method.

3 S PATIOTEMPORAL S ALIENCY P RIOR 3.1 Spatiotemporal Edge Generation

Our video object segmentation method consists of two stages: Human visual perception [44], [45] suggest that basic visual
superpixel based spatiotemporal saliency prior detection and pixel features such as motion and edges are processed at the human
based binary labeling. Here, we explain the saliency stage first. pre-attentive stage for visual attention, which motivates us to
To achieve reliable saliency estimation, our method combines combine spatial edge and motion boundary cues into a coalescent
psychophysically motivated low-level features, such as color, spatiotemporal edge map. Both color and motion discontinuities
edge, and motion boundary in a unified geodesic distance based provide valuable evidence in predicting object boundaries. As
framework. Figure 2 shows the intermediate stages of our video shown in Figure 2, spatial color discontinuities in a single frame
saliency prior. First, input frames are partitioned into superpixels and optical flow field estimated from two consecutive frames
for computational and memory efficiency (Figure 2-b). We then reveal the important regions of the video frames. We build our
extract two types of edges: spatial edges (Figure 2-c) within the approach on these two indicators.
same frame, and motion boundary edges (Figure 2-d) across the Given an input video sequence {F 1 , F 2 , · · ·}, we compute a
neighboring frames. These two features are explicitly integrated spatial edge probability map Eck of k -th frame F k using [46]. The
into a single spatiotemporal edge map (Figure 2-e) as described in value of Eck (x), normalized to [0, 1], represents the probability of
Section 3.1. edge at the corresponding pixel x. The optical flow between the
Geodesic distance is adopted in an intra-frame graph for pairs of subsequent frames are obtained by the large displacement
computing the object probability of each superpixel as given motion estimation algorithm [47]. Let V k be the optical flow field
Figure 2-f. The object probability is computed as the shortest of frame F k . We compute the motion gradient magnitude Eok of
geodesic distance to the frame boundaries and it is based on the V k as Eok (x) = k∇V k (x)k.
observation that salient object areas are often surrounded by the We oversegment each frame into superpixels using SLIC [48].
regions with high spatiotemporal edge value. Details of this part In our implements, we set the number of sunperpixels per frame
can be found in Section 3.2. K=1000. Let Yk = {y1k , y2k , · · ·} be the superpixel set of frame
To improve the saliency estimation, a coarse separation of F k . Given the pixel edge map Eck , the edge probability of super-
foreground and background is obtained via a self-adaptive thresh- pixel ynk is computed as the average value of the pixels with 10
olding. Then, an inter-frame graph is constructed by computing largest edge probabilities within ynk . This generates a superpixel-
geodesic distance to the estimated background regions of two wise edge map Eck . Similarly, the optical flow magnitude map
adjacent frames as explained in Section 3.3. This graph is used Eok is re-computed on superpixel level. Then, we generate a
to produce an initial spatiotemporal saliency map as shown in spatiotemporal edge map Ek as:
Figure 2-g.
Ek = Eck · Eok . (1)
We also apply a novel skeleton abstraction method that am-
plifies the saliency values of the central skeleton regions based The intuition behind the design of Equation 1 is that, distinct mo-
on geodesic distances to incorporate weak object structure. As tion patterns and spatial gradients are indicators of the location of
explained in Section 3.4, we assign such central regions higher salient foreground object. This can be easily observed in Figure 2-
saliency values in the final spatiotemporal saliency prior, which e where object superpixels either have high spatiotemporal edge
can be seen in Figure 2-h. map values or are surrounded by those high valued superpixels.

Fig. 3. Illustration of our inter-frame graph construction. (a) Frame F k . (b) Optical flow flied V k . (c) When the optical flow estimation is not accurate
(which is unfortunately the common case) object probabilities P k are degraded. (d) Frame F k is decomposed into background regions Bk and
object-like regions Uk by self-adaptive threshold σ k defined in Equation 6. The black regions indicate the background regions Bk , while the bright
regions indicate the object-like regions Uk . (e) The decomposition of prior frame F k−1 . (f) The object-like regions Uk−1 of frame F k−1 are projected
onto frame F k . (g) Spatiotemporal saliency result S k for frame F k with consideration of (d) and (e). (h) Spatiotemporal saliency result S k for frame
F k with consideration of (e) and (f).

3.2 Intra-frame Graph Construction which can be seen as the accumulated edge weights along their
To highlight the foreground regions that have high spatiotemporal shortest path on graph G k .
edge values or are surrounded by regions with high spatiotemporal If a superpixel is outside the desired object, its probability
edge values, we employ the geodesic distance to compute a value is small because there exists a pathway to image boundaries
probability map. that does not pass the regions with high spatiotemporal edge value.
The geodesic distance dgeo (v1 , v2 , G) between any two nodes Whereas, if a superpixel is inside the object, this superpixel is
v1 , v2 in graph G is the smallest integral of a weight function W surrounded by the regions with large probabilities of edges, which
over all possible paths between v1 and v2 : increases the geodesic distance to image boundaries. We normalize
Z v1 the probability map P k to [0, 1] for each frame. Since our graph
dgeo (v1 , v2 , G) = min |W (m) · C˙v1 ,v2 (m)|dm, (2) is very sparse, the shortest paths of all superpixels are efficiently
Cv1 ,v2 v2 computed by the Johnson algorithm [49].
where Cv1 ,v2 (m) is a path connecting the nodes v1 , v2 .
For frame F k , we construct an undirected weighted graph
3.3 Inter-frame Graph Construction
G = {V k , E k } with superpixels Yk as nodes V k and the links
k

between adjacent nodes as edges E k . Based on the graph structure, The foreground probability map P k reveals the foreground object
we derive a |V k | × |V k | weight matrix W k , where |V k | is the region but it is not complete and precise. In particular, probability
number of nodes in V k . The (m, n)-th element of W k indicates values of the true background regions near the object boundary
the weight of edge ekmn ∈ E k between adjacent superpixels Ym k may have high values due to the oversegmentation process. Be-
k sides, inaccurate optical flow estimation may result in erroneous
and Yn :
k
Wmn = kEk (ym k
) − Ek (ynk )k, (3) values. By the definition of saliency, foreground and background
regions should be visually different, and object regions should
where Ek (Ym k
) and Ek (Ynk ) correspond to the spatiotemporal be temporally continuous between consecutive frames. These
k
boundary probability of superpixels Ym and Ynk , separately. motivate us to estimate saliency between pairs of adjacent frames.
For superpixel ynk , the probability P k (ynk ) of being foreground For each pair of adjacent frames F k and F k+1 , we construct
is computed by the shortest geodesic distance to the image k k k k
an undirected weighted graph G 0 = {V 0 , E 0 }. The nodes V 0
boundaries using k
consist of the superpixels Y of frame F k and the superpixels
P k (ynk ) = min dgeo (ynk , q, G k ), (4) Yk+1 of frame F k+1 . There are two types of edges: intra-frame
q∈Qk
edges that link spatially adjacent superpixels and inter-frame edges
where Qk indicate the superpixels along the four boundaries of that connect temporally adjacent superpixels. The superpixels
frame F k . The geodesic distance dgeo (v1 , v2 , G k ) between any are spatially connected if they are adjacent in the same frame.
two superpixels v1 , v2 ∈ V k in graph G k can be computed in Temporally adjacent superpixels refer to the superpixels which
discrete form: belong to different frames but have overlap. We assign the edge
weight as the Euclidean distance between their mean colors in the
X
dgeo (v1 , v2 , G k ) = min k
Wmn , m, n ∈ Cv1 ,v2 . (5)
Cv ,v 1 2 m,n CIE-Lab color space.

Fig. 4. Illustration of the skeleton abstraction process. (a) Frame F k . (b) Saliency results S k of (a) obtained via Equation 8. (c) Frame F k is
decomposed into background regions B0k and object-like regions U0k by self-adaptive threshold σ 0k defined in Equation 9. The black regions
indicate the background regions B0k , while the bright regions indicate the object-like regions U0k . (d) The red region corresponds to the base skeleton
region, which is the first selected skeleton region through Equation 10. (e) The three yellow regions correspond to the subsequently selected
skeleton regions through Equation 11. (f) We iteratively find the add skeleton regions until the number of selected skeleton regions reaches 10% of
the number of object-like regions U0k . (g) The blue regions are the other skeleton regions that lie on the shortest geodesic path between the base
and the selected skeleton regions. (h) The saliency values of the skeleton regions are enhanced.

For each frame, we use a self-adaptive threshold to decompose The rationale behind Equation 8 is that the saliency value of a
frame F k into background regions Bk and object-like regions Uk superpixel is measured by its shortest path to background regions
through the probability map P k . The threshold σ k for frame F k in color space considering both spatial and temporal information.
is computed as We update P k and P k+1 for frame F k and F k+1 with S k and
σ k = µ(P k ), (6) S k+1 , and keep iterating this process for the following two adjacent
frames F k+1 and F k+2 until the final frame.
where µ(·) is the mean probability of all pixels within the frame
F k . We assign the object-like regions Uk and the background
3.4 Skeleton Abstraction
regions Bk of k -th frame as
To further refine the saliency estimates above, we use a geodesic
Uk = {ynk |P k (ynk ) > σ k } distance based abstraction scheme that augments core regions with
∪ {ynk |ynk is temporally connected to Uk−1 }, (7) higher saliency values.
B =k k
Y −U . k We decompose (Figure 4-c) frame F k into two parts: back-
ground regions B0k and object-like regions U0k using a threshold
In a causal system, previously determined object regions offer similar to the one in Equation 6 yet computed by the saliency
valuable information to eliminate artifacts due to inaccurate op- result S k as
tical flow estimation. Therefore, we project object-like regions of σ 0k = µ(S k ),
prior frame F k−1 onto frame F k . Our motivation can be observed U0k = {ynk |S k (ynk ) > σ 0k }, (9)
in Figure 3. The object estimation result of frame F k (Figure 3- 0k k 0k
B = Y −U .
c) is not ideal, due to the incorrect optical flow estimation
(Figure 3-b). If F k is segmented using only the self-adaptive As the saliency result S k is more accurate than P k (this is
threshold T k defined in Equation 7, an inferior decomposition quantitatively verified in our experiment part), we decompose
is generated (Figure 3-d), further leading into incorrect saliency frame F k through an efficient thresholding strategy.
result (Figure 3-g). When the previous estimation is projected,
The skeleton region abstraction is an iterative process based
a more correct decomposition can be obtained (Figure 3-f), and
on the undirected weighted graph G k defined in Section 3.2. The
more consistent saliency can be attained (Figure 3-h).
k base skeleton region should have two properties. First, this region
Based on the graph G 0 , we compute saliency map S k (S k+1 ) should be as far away from background regions B0k as possible;
k k+1
for frame F (F ) as follows: second, it should be close to foreground regions U0k . Based on
k this conditions, the base skeleton region is selected by
S k (ynk ) = min dgeo (ynk , b, G 0 ),
b∈Bk ∪Bk+1
(8) n maxu0 ∈U0k dgeo (o, u0 , G k ) o
S k+1 (ynk+1 ) = min dgeo (ynk+1 , b, G 0 ).
k
Ok ← argmin . (10)
b∈Bk ∪Bk+1 o∈U0k minb0 ∈B0k dgeo (o, b0 , G k )

Fig. 5. Illustration of pixel labeling. (a) Input frame F k . (b) Spatiotemporal saliency map S k . (c) The regions within the red boundaries are the
superpixels with the saliency value larger than the adaptive threshold, which are used for establishing the foreground histogram model. The
regions between the green boundaries and red boundaries are for building background histogram model. (d) Global appearance models with color
histograms Hf and Hb for foreground and background via (c). (e) Probability map for foreground computed via the global appearance model. (f)
Accumulated optical flow gradient magnitude E b k for frame F k yields trajectory of the object within few subsequent frames. (g) Dynamic location
prior Lk is obtained via intra-frame graph construction method described in Section 3.2. (h) Final segmentation results by Equation 13, which
consists of the saliency term (b), the appearance term (e), and the location term (g), and two pairwise terms.

After obtaining the base skeleton region (Figure 4-d), we select We formulate the segmentation task as a pixel labeling prob-
the other skeleton regions. These regions are as far away from lem. Each pixel xki in frame F k can take a label lik ∈ {0, 1},
background regions B0k and previous skeleton regions as possible. where 0 corresponds to background and 1 corresponds to fore-
This induces the skeleton regions to cover object regions that may ground. A labeling L = {lik }k,i of pixels from all frames represents
have different appearances. Therefore, the skeleton regions are a partitioning of the entire video. Similar to other segmentation
selected in a greedy fashion: works [7], [50], we define an energy function for labeling L of all
n o the pixels as
Ok← Ok ∪ argmax min dgeo (o, o0, G k )·min dgeo (o, b,0 G k ) .
o∈U0k o0 ∈Ok b0 ∈B0k X X X
(11) F(L) = U k (lik ) + λ1 Ak (lik ) + λ2 Lk (lik )
As shown in Figure 4-e, each of the subsequent skeleton re- k,i k,i k,i
X X (12)
gions is selected to maximize its geodesic distance to background + λ3 V k (lik , ljk ) + λ4 W k (lik , ljk+1 ),
and previously selected skeleton regions. This process continues (i,j)∈Ns (i,j)∈Nt
until a small percentage (10%) of the object-like regions U0k are
selected as skeleton. All object-like regions that lie on the shortest where the spatial pixel neighborhood Ns consists of 8 neighboring
geodesic path between the base and subsequently chosen skeleton pixels within the same frame, the temporal pixel neighborhood Nt
regions are also selected as skeleton regions. Finally, we increase consists of the forward-backward 9 neighbors in adjacent frames,
the saliency values of the skeleton regions (in all experiments 2×) and i, j are indices of pixels. This energy function consists of
as in Figure 4-h). A quantitative evaluation of the effectiveness three unary terms, U k , Ak and Lk , and two pairwise terms V k
and improvement of each step of our saliency scheme is presented and W k , which depend on the labels of spatially and temporally
in Section 5.4. neighboring pixels.
The purpose of U k is to evaluate how likely a pixel is fore-
ground according to the spatiotemporal saliency maps computed
4 P IXEL L ABELING E NERGY F UNCTION
in the previous step. The unary appearance term Ak encourages
In the second stage of our segmentation method, we perform labeling pixels that have similar colors according to their global
binary video segmentation based on the saliency results from appearance models. The third unary term Lk is for labeling
Section 3. Separate global appearance models for foreground and pixels according to the location priors estimated from the dynamic
background are established using the saliency results. Dynamic location models. The pairwise terms V k and W k encourage
location model for each frame is estimated from motion informa- spatial and temporal smoothness, respectively. All the terms are
tion extracted from subsequent frames. Finally, the spatiotemporal described in detail next. The scalar parameters λ weight the
saliency maps, global appearance models and dynamic location various terms, which can be set according to the characteristic
model are combined into an energy function for binary segmenta- of the video content. In our implements, we empirically set
tion. λ1 = 0.5, λ2 = 0.2, λ3 = λ4 = 100.

Having described the separate terms of the complete energy by constraining the segmentation labels to be both spatially and
function F below, we use graph-cuts [51] to compute the optimal temporally consistent. They are contrast-modulated Potts poten-
binary labeling and obtain the final segmentation (Figure 5-h). tials [7], [22], [50], which favor assigning the same label to
neighboring pixels that have similar color. The spatial consistency
Saliency term U k : The unary saliency term U k is based term V k computed between spatially adjacent pixels xi and xj is
on the saliency estimation results and penalizes assignments of defined as
pixels with low saliency to the foreground. The term U k has the k k 2
V k (lik , ljk ) = δ(lik , ljk ) exp−θ||C(xi )−C(xj )|| , (17)
following form
( where C(xki )
is the color vector of the pixel xki
and δ(·) denotes
k k − log(1 − S k (xki )) if lik = 0;
the Dirac delta function, which is 0 when lik 6= ljk . The constant φ
U (li ) = (13)
− log(S k (xki )) if lik = 1. is chosen [50] to be
X
Appearance term Ak : To model the foreground and background θ = (2 ||C(xki ) − C(xkj )||2 )−1 , (18)
(i,j)∈Ns
appearance, two weighted color histograms Hf and Hb are
computed in RGB color space. Each color channel is uniformly to ensure the exponential term in Equation 17 switches appro-
quantized into 10 bins, and there is a total of 103 bins in the priately between high and low contrast. Similarly, the temporal
joint histogram. Each pixel contributes into these histograms Hf consistency term W k is defined as
and Hb according to its color values with weights S k (x) and W k (lik , ljk+1 ) = δ(lik , ljk+1 ) exp−θ||C(xi )−C(xj
k k+1
)||2
. (19)
1 − S k (x), respectively.
To construct the foreground (background) histogram, we only 5 E XPERIMENTAL E VALUATIONS
use pixels from the superpixels spatially connected to the former Even though it is not the ultimate goal of the proposed algorithm,
foreground (background) superpixels and have saliency values we first evaluate the effectiveness of our spatiotemporal saliency
larger (smaller) than the adaptive threshold, which is defined as the estimation method by comparing against some state-of-the-art
mean value of spatiotemporal saliency map. This strategy makes saliency methods (in Section 5.1). After that, in Section 5.2,
better use of the information of spatiotemporal saliency results we compare both quantitatively and qualitatively our overall seg-
and minimizes adverse effects of background regions with similar mentation method with serveral well-known video segmentation
color to the foreground contaminating the foreground histogram approaches. Then we offer more detailed exploration and dissect
(Figure 5-c,e). Finally, the histograms are normalized. Denoting various parts of our approach. In Section 5.3, we assess its
c(xki ) as the histogram bin index of RGB color value at pixel xki , computational load. In Section 5.4, we investigate the impact of
the unary appearance term Ak is defined as: important parameters, verify basic assumptions of the proposed
Hb (c(xki ))

algorithm and evaluate the effectiveness of each step of the pro-
 − log( ) if lik = 0;


 H f (c(xk )) + H (c(xk ))
b
posed framework. In our comparisons, we use the implementations
Ak (lik ) = i
k
i
(14) provided by the respective authors and set their free parameters to

 − log( H f (c(xi )) k

) if l i = 1. maximize their performance.
Hf (c(xki )) + Hb (c(xki ))

For quantitative and qualitative analyses, we use four bench-
mark datasets, the SegTrack [52], the extended SegTrack [53],
Location term Lk : For the cases of cluttered scenes and back-
Freiburg-Berkeley Motion Segmentation Dataset (FBMS) [1] and
ground regions having similar appearance models with the fore-
the DAVIS [54]. The SegTrack dataset contains 6 videos in total
ground, the object motion consistency provides a valuable prior to
where full pixel-level segmentation ground-truth for each frame
locate the areas likely to contain the object. Thus, we estimate the
is available. We follow the common protocol [7], [8], [22] and
location of foreground object with respect to motion information
use 5 video sequences (Birdfall, Cheetah, Girl, Monkeydog and
from a small number of neighboring frames.
Parachute) for evaluations (the last video, Penguin, is not usable
For k -th frame, we accumulate the optical flow gradient mag-
for saliency since only a single penguin is labeled in a colony of
nitudes within a temporal window of ±t frames to obtain relatively
penguins).
longer term motion information of the foreground regions as
While the SegTrack dataset is widely popular, the extended
k+t
X k+t
X SegTrack dataset is more challenging. It was originally introduced
bk =
E Eoi = k∇V i k. (15) for evaluating object tracking algorithms, yet it is also suitable for
i=k−t i=k−t video object segmentation. The extended SegTrack dataset con-
Having a larger temporal window provides some robustness to sists of 8 additional sequences, which have complex backgrounds
individual pixel-wise unreliable optical estimates. However, this and varying object motion patterns. We select five sequences (Bird
may also cause E b k loses its discriminative power since motion of Paradise, Frog, Monkey, Soldier and Worm), each of which
cue spans out on too many frames. In our experiments, we set contains a single dominant object.
t = 5. We use the intra-frame graph construction described in The FBMS dataset, containing 59 video sequences, is widely
Section 3.1 to compute a dynamic location model for each frame used for video segmentation and covers various challenges such as
(Figure 5-f,g). Finally, we determine the location prior Lki for large foreground and background appearance variation, significant
pixel xki and the unary location term Lk as shape deformation, and large camera motion.
( We also report our performance on the newly developed
k k − log(1 − Lk (xki )) if lik = 0; DAVIS dataset, which is one of the most challenging video seg-
L (li ) = (16)
− log(Lk (xki )) if lik = 1. mentation benchmarks. It comprises a total of 50 high-resolution
sequences spanning a wide degree of difficulty, such as occlusions,
Pairwise terms V k , W k : These terms impose label smoothness fast-motion and appearance changes.

Fig. 6. Comparison with 6 alternative saliency detection methods using SegTrack dataset [52] (top), extended SegTrack dataset [53] (middle) and
FBMS dataset [1] (bottom) with pixel-level ground-truth: (a) average precision recall curve by segmenting saliency maps using fixed thresholds,
(b) F-score, (c) average MAE. Notice that, our algorithm significantly outperforms other methods in terms of the precision-recall curve and F-score.
Our method achieved more than 75% improvement over the best previous method in terms of MAE.

5.1 Evaluation of Spatiotemporal Saliency including precision-recall (PR) curve, F-score [34], and MAE
(mean absolute errors). Precision is the fraction of the correctly la-
Since spatiotemporal saliency detection is an important step of beled foreground pixels among the all pixels labeled as foreground
our video segmentation approach, we assess its performance by the algorithm, while recall is the fraction of correctly labeled
against the existing saliency methods. Using the original imple- foreground pixels among the ground-truth foreground pixels. We
mentations obtained from the corresponding authors, we make generate binary saliency maps from each method and plot the
comparisons between 6 alternative approaches including manifold corresponding PR curves by varying the operating point threshold.
ranking saliency model (MR) [39], saliency filter (SF) [55], self-
In general, a high recall response may come at the expense
resemblance based saliency (SS) [15], saliency via quaternion
of reduced precision, and vice versa. Therefore, we also estimate
Fourier transform (QS) [33], cluster-based co-saliency (CS) [42],
F-score for evaluating precision and recall simultaneously. F-score
and space-time saliency for time-mapping (TS) [43]. The former
evaluates precision and recall is defined as
two of these methods aim at image saliency while the latter four
are designed for video saliency. (1 + β 2 ) × precision × recall
We report results on three widely used performance measures F-score = , (20)
β 2 × precision + recall

Fig. 7. Qualitative comparison against the state-of-the-art methods on the SegTrack benchmark videos [52], the extended SegTrack [53] sequences
and the famous FBMS dataset [1] with pixel-level ground-truth labels. Our saliency method renders the entire objects as salient in complex scenarios,
yielding continuous saliency maps that are most similar to the ground-truth.

where we set β 2 to 0.3 to assign a higher importance to precision The precision-recall curves of all methods are reported in
as suggested in [34]. Figure 6-a. As shown, our method significantly outperforms the
For a complete analysis, we follow [55] to evaluate the mean state-of-the-art. The minimum recall value in these curves can also
absolute error (MAE) between a real-valued saliency map S and be regarded as an indicator of robustness. A high precision score
a binary ground-truth G for all image pixels: at the minimum recall value means a good separation between the
|S − G| foreground and background confidence values, as most of the high
MAE =, (21) confidence saliency values (close to 1) are correctly estimated the
N
foreground object.
where N is the number of pixels. The MAE estimates the approx-
imation degree between the saliency map and the ground-truth As can be seen, when the threshold is close to 255, the recall
map, and it is normalized to [0, 1]. The MAE provides a better scores of other saliency models become very small, and the recall
estimate of conformity between estimated and ground-truth maps. scores of SS [15] and QS [33] shrinks to 0. This is a result of those

Fig. 8. Our segmentation results on the SegTrack [52] (Cheetah and Girl) and the extended SegTrack dataset [53] (Bird of Paradise, Monkey and
Soldier ) with pixel-level ground-truth masks. The pixels within the green boundaries are segmented as foreground.

saliency maps do not correspond to the ground-truth objects. To score is well above the performance of other methods. The MAE
our advantage, the minimum recall of the our method does not results are presented in Figure 6-c. As shown, our saliency maps
drop to 0. This demonstrates our saliency maps align better with successfully reduce the MAE almost by 75% compared to the
the correct objects. In addition, our saliency method achieves the second best method (which is SF [55]). In summary, our method
best precision rates above 0.9, which shows it is more precise and consistently produces superior results.
responsive to the actual salient information. Similar conclusions
can be drawn from the F-score, as shown in Figure 6-b. Our F- Figure 7 shows a qualitative comparison of different methods,
where brighter pixels indicate higher saliency probabilities. It is

Fig. 9. Our segmentation results on the FBMS [1] (Horse and Camel) and the DAVIS dataset [54] (Kite-walk, Mallard-fly and Parkour ) with pixel-level
ground-truth masks. The pixels within the green boundaries are segmented as foreground.

observed that image saliency methods (MR [39], SF [55]) applied pixel precision and tend to assign lower foreground probabilities
independently to each frame produce unstable outputs, some to pixels inside the salient objects. This is due to the fact that
saliency maps even completely miss the foreground object, mainly optical flow estimations are unreliable.
because temporal coherence in video can convey important infor-
mation for identifying salient objects. In contrast, video saliency Based on above, we draw two important conclusions: (1) mo-
methods SS [15], QS [33], CS [42], and TS [43] perform relatively tion information gives effective guidance for detecting foreground
better as they utilize motion information. However, saliency maps object; (2) making methods rely heavily on motion information
from previous video saliency models are often generated in lower is not the optimal choice. Comprehensive utilization of various
features in spatial and temporal space (color, edges, motion, etc.)

TABLE 1
APFPER results on SegTrack dataset [52] compared to the ground-truth. Lower values are better. The best and the second best results are
boldfaced and underlined, respectively.

unsupervised supervised
video frames
Ours [1] [7] [8] [9] [23] [22] [28] [53] [56] [52] [57]
Birdfall 30 140 217 288 468 155 606 189 144 199 468 252 454
Cheetah 29 622 890 905 1175 633 11210 806 617 599 1968 1142 1217
Girl 21 991 3859 1785 5683 1488 26409 1698 1195 1164 7595 1304 1755
SegTrack
Monkeydog 71 350 284 521 1434 365 12662 472 354 322 1434 563 683
Parachute 51 195 855 201 1595 220 40251 221 200 242 1113 235 502
Avg. - 459 1221 740 2071 572 18228 677 502 505 2516 699 922

TABLE 2
IoU scores on SegTrack dataset [52] and extended SegTrack dataset [53] compared to the ground-truth. Higher values are better. The best and the
second best results are boldfaced and underlined, respectively.

unsupervised supervised
video frames
Ours [7] [9] [22] [26] [28] [24] [58] [59] [60] [61]
Birdfall 30 74.5 48.7 71.4 37.4 72.5 73.2 57.4 78.7 57.4 56.0 32.5
Cheetah 29 64.3 43.4 58.8 40.9 61.2 64.2 24.4 66.1 33.8 46.1 33.1
SegTrack Girl 21 88.7 77.5 81.9 71.2 86.4 86.7 31.9 84.6 87.9 53.6 52.4
Monkeydog 71 78.0 64.3 74.2 73.6 74.0 76.1 68.3 82.2 54.4 61.0 22.1
Parachute 51 94.8 94.3 93.9 88.1 95.9 94.6 69.1 94.4 94.5 85.6 69.9
Bird of Paradise 98 94.5 22.3 35.2 85.4 90.0 93.9 86.8 93.0 95.2 5.1 44.3
Frog 279 83.3 71.0 76.3 69.4 80.2 81.5 67.1 56.3 81.4 14.5 45.2
Extended Monkey 31 84.1 38.6 61.4 69.6 83.1 63.9 61.9 86.0 88.6 73.1 61.7
SegTrack Soldier 32 79.2 10.0 51.4 47.4 76.3 36.8 66.5 81.1 86.4 70.7 43.0
Worm 243 74.8 40.5 53.9 73.0 82.4 61.7 34.7 79.3 89.6 36.8 27.4
Avg. - 81.6 51.1 65.8 65.6 80.2 73.3 56.8 80.1 76.9 50.2 43.1

produces more satisfactory segmentation results. Our model is TABLE 3

able to estimate more accurate saliency maps within and on IoU scores on a representative subset of the FBMS dataset [1], and the
average computed over the 59 video sequences. Higher values are
the boundaries of the target objects in cluttered backgrounds. better. The best and the second best results are boldfaced and
In addition, the assigned saliency values have higher confidence underlined, respectively.
values, which also reflects in the quantitative analysis.
video Ours [7] [9] [22]
5.2 Evaluation of Pixel Labeling Bear2 70.1 87.5 21.0 86.8
Our framework produces both spatially and temporally coherent Cars5 38.5 10.7 38.7 17.4
Cars9 60.0 19.5 28.9 52.4
video object segmentation results in a fully unsupervised way. We Cars10 55.9 65.7 74.9 79.0
use the average per-frame pixel error rate (APFPER) introduced by Cats1 85.7 19.8 81.5 83.1
[52] for evaluating the performance on the SegTrack dataset. This Dogs2 91.7 90.8 83.7 86.3
Horses1 89.4 77.6 83.5 77.5
error rate measures the number of misclassified pixels and used FBMS Horses2 92.7 13.5 86.7 91.5
in [8], [9], [22]. As discussed in [53], the intersection-over-union People1 68.1 56.0 64.8 53.3
overlap (IoU) metric1 , which is the intersection over union of the People2 68.3 47.1 56.5 48.0
estimated and ground-truth segmentation maps, is an informative People4 86.4 82.1 83.8 79.4
People5 56.4 10.7 84.4 51.8
indicator of the performance. This metric is also widely used for Rabbits1 90.8 92.4 91.6 92.9
evaluating the segmentation performance. Therefore, we report our Rabbits2 71.0 20.4 47.8 28.3
performance on the IoU metric for the SegTrack [52], extended Rabbits5 88.1 55.1 84.7 90.1
SegTrack [53], FBMS [1] and the DAVIS [54] by computing the Avg. 63.3 52.3 54.3 47.7
score for each frame and then averaging it over all frames.
The APFPER results of ours and [1], [7], [8], [9], [23], [22],
[28], [53], [56], [52], [57] on the SegTrack are shown in Table 1. the extended SegTrack. Our approach outperforms the state-of-
The segmentation methods in [1], [7], [8], [9], [23], [22], [28], the-art most videos and achieves the highest overall IoU score
[53], [56] and our method are unsupervised, while other methods (81.6).
in [52], [57] are supervised. As seen, our method outperforms We analyze the effectiveness of our approach on the FBMS
all existing unsupervised algorithms on most video sequences. dataset and DAVIS dataset using IoU score. The IoU scores for
Furthermore, our algorithm is better or on a par with the super- representative sequences of the FBMS dataset and the average
vised approaches [52], [57], which indicates the robustness of the performance over the entire dataset (59 videos) are demonstrated
proposed approach. in Table 3. The proposed method achieves the best score on most
Table 2 presents the IoU scores of our method and [7], [9], of the videos and the best average score overall. Table 4 reports
[22], [26], [28], [24], [58], [59], [60], [61] on the SegTrack and the IoU scores of our method and other methods [3], [7], [22],
1. Note the IoU score is equivalent to the region similarity metric (Jaccard [21], [62], [27] on the DAVIS datasets, some results are borrowed
metric) used in DAVIS benchmark. from [54]. The average IoU score is computed over the 50 video

TABLE 4
IoU scores on a representative subset of the DAVIS dataset [54], and
the average computed over the 50 video sequences. Higher values are
better. The best and the second best results are boldfaced and
underlined, respectively.

video Ours [3] [7] [22] [21] [62] [27]

Bmx-bumps 40.9 35.0 30.9 24.0 35.2 36.8 63.5
Bmx-trees 26.3 16.2 19.3 18.00 18.8 12.1 21.2
Boat 21.0 12.9 6.48 36.1 14.3 5.6 0.7
Bus 77.5 68.3 78.5 82.4 88.4 66.4 62.9
Camel 75.9 77.8 57.8 56.2 75.5 84.9 76.8
Car-roundabout 76.8 55.2 64.0 80.8 63.0 87.1 50.8
Dance-twirl 36.8 36.5 38.0 45.3 36.6 45.2 34.7
DAVIS Elephant 67.3 75.9 67.5 82.3 68.8 49.4 51.7
Horsejump-high 65.5 36.4 37.0 57.8 73.4 83.0 83.4
Lucia 89.0 66.9 84.7 64.3 41.7 84.0 87.6
Mallard-fly 71.9 29.3 58.4 60.1 3.3 38.0 61.7
Motocross-bumps 55.3 50.1 68.9 61.7 46.5 60.3 61.4
Paragliding-launch 65.6 55.4 55.9 50.6 51.2 59.1 62.8
Soccerball 85.4 35.0 87.8 84.3 37.0 24.2 82.9
Swing 71.3 41.2 70.9 43.1 62.2 53.3 85.1
Avg. 64.7 50.1 56.9 57.4 54.3 51.4 64.1

sequences of DAVIS datasets. As may be seen in Table 4, our Fig. 10. Computational load of our method and the state-of-the-art for
340×240 video. (a) Execution time of video saliency estimation stage
method still performs comparably or better than other concurrent compared against other video saliency methods [15], [33], [42], [43]. (b)
approaches. Execution time of overall method compared against other video seg-
Representative pixel labeling results are shown in Figure 8 mentation methods [7], [9], [22]. (c) Execution time of each intermediate
steps. Step1 and Step2 are saliency estimations via intra-frame graph
(SegTrack, the extended SegTrack) and Figure 9 (FBMS, DAVIS). and inter-frame graph, respectively. Step3 is the final saliency step.
Our method has the ability to segment the objects with fast
motion patterns (Cheetah and Horse) or large shape deformation
(Parkour). It produces accurate segmentation maps even when The execution time of each part of our whole scheme is shown
foreground undergoes appearance changes (Mallard-fly), contains in Figure 10-c. The whole segmentation pipeline takes about 3.5
various motion patterns (Soldier), or has similar color cues with seconds for each frame, where over 60% of the runtime is spent
the background (Monkey). In contrast, existing approaches [7], on the edge generation [46]. Saliency detection takes a total of
[9], [22] either mislabel background pixels as foreground or miss 1.2 seconds: 0.38 seconds for computing the saliency via intra-
foreground pixels. In our experiments we observed that target frame graph (Step1), 0.59 seconds for improving saliency results
foregrounds in various scenarios can be segmented accurately by via inter-frame graph (Step2), and 0.23 seconds for generating
our algorithm. final saliency via abstracting skeleton regions (Step3).

5.3 Computational Load 5.4 Validation of the Proposed Algorithm

Our method is tested on a Dell T5610 workstation with an 5.4.1 Parameter Selection
Intel Xeon E5 CPU of 2.50 GHz with unoptimized MATLAB In this section, we investigate the impact of important parameters
implementation. We analyze the computational load of the steps and verify basic assumptions of the proposed algorithm. We carry
in the proposed pipeline. We also include 4 video saliency methods out the evaluation of our system on the SegTrack dataset [52] and
[15], [33], [42], [43] and 3 video segmentation methods [7], [9], the extended SegTrack dataset [53].
[22] for providing a comprehensive view of execution times of We first study the influence of the number of superpixels
existing approaches. K per frame. We report the performance by varying K =
The execution times are presented in Figure 10 (excluding {500, 600, · · ·, 1400, 1500}. We plot the MAE value of the spa-
optical flow computations for all algorithms). Figure 10-a shows tiotemporal saliency estimates and the IoU score of the segmenta-
the execution time comparisons of our and other saliency methods. tion results as functions of a variety of Ks in Figure 11. For both
It is clear that our saliency method is one of the fastest solutions datasets, we can observe that the performance of both saliency
and only slower than the frequency domain based method [33]. estimates and segments increase when more superpixels (K ↑) are
Figure 10-b reports the per-frame processing times of the overall oversegmented. However, when we further increase the number
segmentation procedures. All solutions use the optical flow esti- of superpixels (K > 900), the final performance does not change
mation method of [47]. Our method (3.5 seconds per frame) is obviously. As the computational expense would increase with the
much faster than [7], [9] but only slower than [22]. lager number of superpixels, we set K = 1000 in our implements.
The object proposal based segmentation methods of [7], [9] Additionally, we can find that the segmentation performance gets
require computationally expensive and complex object proposal better with more accurate saliency estimates.
generation and inference stage [10] costing 43.5 seconds addi- In Section 3.1, we fuse the spatial edge Ec and motion edge
tional time per frame. Clearly, running time efficiency is the Eo together via Equation 1. It is necessary to explore different
major bottleneck for the usability of those video segmentation combination strategies for generating spatiotemporal edges. We
algorithms, as a substantial amount of time is spent preprocessing consider four extra combination strategies and report their cor-
image frames to generate object proposals. responding performance of saliency estimates and segmentation

Fig. 11. Parameter selection for the number of superpixel K using (a) the
SegTrack database and (b) the extended SegTrack dataset. The MAE of
the saliency results and the IoU score of the segmentation results are
plotted as functions of a variety of Ks.

TABLE 5
Validation of spatiotemporal edge generation on the SegTrack dataset
[52] and extended SegTrack dataset [53].

SegTrack Extended SegTrack

spatiotemporal edge E
MAE IoU MAE IoU
Ec · Eo 0.011 80.05 0.049 82.82 Fig. 12. Assessment of individual steps of our saliency estimation by
Ec 0.147 32.43 0.143 24.93 (a) precision-recall curves, and (b) MAE scores. Step1 and Step2 refer
Eo 0.023 72.54 0.084 63.29 to saliency via intra-frame and inter-frame graphs, respectively. Step3
Ec + Eo 0.045 64.18 0.091 60.30 is the skeleton abstraction. Top: evaluation results on the SegTrack
exp(Ec + Eo ) 0.060 61.31 0.107 55.46 [52]. Middle: evaluation results on the extended SegTrack [53]. Bottom:
evaluation results on the FBMS [1].

results in Table 5. As can be seen, the combination strategy in

Equation 1 achieves the best performance.
extended SegTrack [53] and the FBMS [1] datasets. We report
5.4.2 Validation of Spatiotemporal Saliency Steps the performance improvement of each step in Figure 12. Step1
The proposed spatiotemporal saliency method has intra-frame and Step2 refer to the initial saliency via the intra-frame graph
graph and inter-frame graph saliency steps and also the skele- (Section 3.2) and the refined saliency via the inter-frame graph
ton abstraction step. The intra-frame graph provides an initial (Section 3.3). Step3 corresponds to our final saliency results
estimation of the salient region and the background in the first (Section 3.4). As shown, compared to the PR curve for initial
step (Section 3.2). Based on this initial estimation, the inter-frame saliency map Step1, the performance of the refined saliency
graph further improves the saliency estimations of the superpixels Step2 is elevated and final saliency estimates Step3 achieve
in the second step (Section 3.3). The salient object regions are the best performance. This demonstrates the contribution of our
compared with each other and highly-confident foreground regions saliency refinement via inter-frame graph and object skeleton
are strengthened into the final saliency map in the third step abstraction scheme based saliency optimization for improving the
(Section 3.4). saliency detection performance. The results for the MAE measure
To exhibit more details of our algorithm and objectively show similar conclusions. Overall, the performance of each step
evaluate the contribution of different parts in the proposed saliency improves progressively, which demonstrates that the combination
model to the saliency detection performance, we report the eval- of all steps is effective for improving the overall performance.
uation of each stage of our algorithm on the SegTrack [52], the Some qualitative comparison results can be observed in Figure 2.

Fig. 13. Video object segmentation and salient object detection results for object occlusion. (a) For object with part occlusions, the proposed method
can still produce reliable spatiotemporal saliency prior and generate accurate segments. (b) When heavy object occlusions occur, the proposed
method may suffer difficulties since it sticks to find a salient object followed by the basic assumption of saliency detection.

6 D ISCUSSIONS AND L IMITATIONS As opposed to the traditional video segmentation methods that
The proposed algorithm has a few limitations. The performance heavily rely on cumbersome object inference and motion analysis,
of our algorithm is limited by the accuracy of spatiotemporal our method emphasizes the importance of video saliency, which
saliency estimation. Saliency estimation is the cornerstone of our offers strong and reliable cues for pixel labeling of foreground
method to determine where the primary object it is. If importance video objects.
analysis was misleading, it might negatively affect segmentation The proposed method incorporates intra-graph edge and inter-
results. For example, our spatiotemporal saliency method may graph motion boundary information into a spatiotemporal edge
not be well suited for scenes that have multiple salient objects
map. It uses the geodesic distance on these graphs to measure
or have a primary foreground object that occupies large portion
the saliency score of each superpixel. In intra-frame graph, the
of the image. In these scenarios, it is likely to produce sub-
optimal results as the potential assumption for saliency detection geodesic distance between the superpixel and frame boundary is
is that only a part of scene attracts human attention mostly. In our exploited to estimate the foreground probability. In inter-frame
approach, we formulate the local dynamic location prior and the graph, geodesic distance to the estimated background is utilized to
global appearance information in the proposed segmentation ener- update the spatiotemporal saliency map for each pair of adjacent
gy function (Equation 13), which would alleviate this problem. frames. The geodesic distance is also employed to extract the base
Another difficulty for the current method is handling ob- and supporting foreground superpixels in the skeleton abstraction
jects with occlusion, which is the common challenge in video step to further enhance the saliency scores. In the pixel labeling
segmentation problem. As the proposed spatiotemporal saliency stage, an energy function that combines global appearance models,
prior relies on the object continuity between adjacent frames, it is
dynamic location models and spatiotemporal saliency maps is
able to handle common scenarios with small or short occlusions
defined and efficiently minimized via graph-cuts to obtain the final
in a bottom-up fashion (Figure 13-a). As for some extremely
difficult scenarios with complete occlusions, such as the bmx segmentation results.
in Figure 13-b, the proposed method may still locate a part We have evaluated our methods on four benchmarks, name-
of scene as salient region, even the object has been occluded. ly SegTrack [52], extended SegTrack [53], FBMS [1] and the
That is followed by the basic assumption of saliency detection DAVIS [54]. The extensive experimental evaluations show that
that important object should exist. One promising direction to our approach can generate high quality saliency maps in relatively
improve the segmentation is the use of long range connectivity short time and achieve consistently higher performance scores
of objects such as motion trajectories. Other advances may come than many other existing methods. Comparing with other video
from adopting some occlusion-aware tracking techniques or the segmentation methods, our approach generates both quantitatively
development of more powerful representations beyond regions,
and qualitatively superior segmentation results.
such as supervoxel and video object proposal.
For future work, we will apply the proposed approach to other
applications, such as video resizing, video summarization, and
7 C ONCLUSIONS AND F UTURE W ORK video compression. Additionally, our work provides important
We have presented an unsupervised approach that incorporates hints toward combining spatiotemporal saliency prior with more
geodesic distance into saliency-aware video object segmentation. effective video representations, such as trajectory and supervoxel.

R EFERENCES [28] W.-D. Jang, C. Lee, and C.-S. Kim, “Primary object segmentation in
videos via alternate convex optimization of foreground and background
[1] T. Brox and J. Malik, “Object segmentation by long term analysis of distributions,” in Computer Vision and Pattern Recognition (CVPR),
point trajectories,” in European Conference on Computer Vision (ECCV), 2016.
2010. [29] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual
[2] J. Lezama, K. Alahari, J. Sivic, and I. Laptev, “Track to the future: Spatio- attention for rapid scene analysis,” IEEE Pattern Analysis and Machine
temporal video segmentation with long-range motion cues,” in Computer Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
Vision and Pattern Recognition (CVPR), 2011. [30] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” Ad-
[3] K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by tracing vances in neural information processing systems, pp. 545–552, 2006.
discontinuities in a trajectory embedding,” in Computer Vision and [31] A. Borji, D. Sihite, and L. Itti, “Probabilistic learning of task-specific
Pattern Recognition (CVPR), 2012. visual attention,” in Computer Vision and Pattern Recognition (CVPR),
[4] W. Brendel and S. Todorovic, “Video object segmentation by tracking 2012.
regions,” in Computer Vision (ICCV), 2009.
[32] X. Hou and L. Zhang, “Saliency detection: a spectral residual approach,”
[5] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, “Segmenting salient in Computer Vision and Pattern Recognition (CVPR), 2007.
objects from images and videos,” in European Conference on Computer
[33] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection using
Vision (ECCV), 2010.
phase spectrum of quaternion fourier transform,” in Computer Vision and
[6] C. Xu, C. Xiong, and J. J. Corso, “Streaming hierarchical video segmen-
Pattern Recognition (CVPR), 2008.
tation,” in European Conference on Computer Vision (ECCV), 2012.
[34] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned
[7] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video object
salient region detection,” in Computer Vision and Pattern Recognition
segmentation,” in Computer Vision (ICCV), 2011.
(CVPR), 2009.
[8] T. Ma and L. J. Latecki, “Maximum weight cliques with mutex con-
straints for video object segmentation,” in Computer Vision and Pattern [35] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency
Recognition (CVPR), 2012. detection,” in Computer Vision and Pattern Recognition (CVPR), 2010.
[9] D. Zhang, O. Javed, and M. Shah, “Video object segmentation through [36] D. Klein and S. Frintrop, “Center-surround divergence of feature statistics
spatially accurate and temporally dense extraction of primary object for salient object detection,” in Computer Vision (ICCV), 2011.
regions,” in Computer Vision and Pattern Recognition (CVPR), 2013. [37] M. Cheng, J. Warrell, and W. Lin, “Efficient salient region detection with
[10] I. Endres and D. Hoiem, “Category independent object proposals,” in soft image abstraction,” in Computer Vision (ICCV), 2013.
European Conference on Computer Vision (ECCV), 2010. [38] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background
[11] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?” in Computer priors,” in European Conference on Computer Vision (ECCV), 2012.
Vision and Pattern Recognition (CVPR), 2010. [39] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection
[12] J. Carreira and C. Sminchisescu, “Constrained parametric min-cuts for via graph-based manifold ranking,” in Computer Vision and Pattern
automatic object segmentation,” in Computer Vision and Pattern Recog- Recognition (CVPR), 2013.
nition (CVPR), 2010. [40] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust
[13] D. Gao, M. Vijay, and V. Nuno, “The discriminant center-surround background detection,” in Computer Vision and Pattern Recognition
hypothesis for bottom-up saliency,” Advances in neural information (CVPR), 2014.
processing systems, pp. 497–504, 2008. [41] D. Gao and N. Vasconcelos, “Bottom-up saliency is a discriminant
[14] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamic process,” in Computer Vision and Pattern Recognition (CVPR), 2007.
scenes,” IEEE Pattern Analysis and Machine Intelligence, vol. 32, no. 1, [42] H. Fu, X. Cao, and Z. Tu, “Cluster-based co-saliency detection,” IEEE
pp. 171–177, Jan 2010. Transactions on Image Processing, vol. 22, no. 10, pp. 3766–3778, Oct
[15] H. J. Seo and P. Milanfar, “Static and space-time visual saliency detection 2013.
by self-resemblance,” Journal of vision, vol. 9, no. 12, p. 15, 2009. [43] F. Zhou, S. B. Kang, and M. F. Cohen, “Time-mapping using space-time
[16] X. Bai and G. Sapiro, “A geodesic framework for fast interactive image saliency,” in Computer Vision and Pattern Recognition (CVPR), 2014.
and video segmentation and matting,” in Computer Vision (ICCV), 2007.
[44] A. M. Treisman and G. Gelade, “A feature-integration theory of atten-
[17] B. Price, B. Morse, and S. Cohen, “Geodesic graph cut for interactive im- tion,” Cognitive psychology, vol. 12, no. 1, pp. 97–136, 1980.
age segmentation,” in Computer Vision and Pattern Recognition (CVPR),
[45] P. Mital, T. J. Smith, S. Luke, and J. Henderson, “Do low-level visual
2010.
features have a causal influence on gaze during dynamic scene viewing?”
[18] C. Antonio, S. Toby, and B. Andrew, “Geos: geodesic image segmenta-
Journal of Vision, vol. 13, no. 9, pp. 144–144, 2013.
tion,” in European Conference on Computer Vision (ECCV), 2008.
[46] M. Leordeanu, R. Sukthankar, and C. Sminchisescu, “Efficient closed-
[19] A. Criminisi, T. Sharp, C. Rother, and P. Perez, “Geodesic image and
form solution to generalized boundary detection,” in European Confer-
video editing,” ACM Transactions on Graphics, vol. 29, no. 5, 2010.
ence on Computer Vision (ECCV), 2012.
[20] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object
segmentation,” in Computer Vision and Pattern Recognition (CVPR), [47] T. Brox and J. Malik, “Large displacement optical flow: Descriptor
2015. matching in variational motion estimation,” IEEE Transactions on Pat-
[21] P. Ochs and T. Brox, “Object segmentation in video: a hierarchical tern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 500–513,
variational approach for turning point trajectories into dense regions,” March 2011.
in Computer Vision (ICCV), 2011. [48] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,
[22] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained “SLIC superpixels compared to state-of-the-art superpixel methods,”
video,” in Computer Vision (ICCV), 2013. IEEE Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp.
[23] P. Ochs and T. Brox, “Higher order motion models and spectral cluster- 2274–2282, 2012.
ing,” in Computer Vision and Pattern Recognition (CVPR), 2012. [49] D. B. Johnson, “Efficient algorithms for shortest paths in sparse network-
[24] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchi- s,” J. ACM, vol. 24, no. 1, pp. 1–13, Jan. 1977.
cal graph-based video segmentation,” in Computer Vision and Pattern [50] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive fore-
Recognition (CVPR), 2010. ground extraction using iterated graph cuts,” ACM Transactions on
[25] K. Fragkiadaki, P. Arbeláez, P. Felsen, and J. Malik, “Learning to Graphics, vol. 23, no. 3, pp. 309–314, 2004.
segment moving objects in videos,” in Computer Vision (ICCV), 2015. [51] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy mini-
[26] F. Xiao and Y. J. Lee, “Track and segment: An iterative unsupervised mization via graph cuts,” IEEE Pattern Analysis and Machine Intelli-
approach for video object proposals,” in Computer Vision and Pattern gence, vol. 20, no. 12, pp. 1222–1239, 2001.
Recognition (CVPR), 2016. [52] D. Tsai, M. Flagg, A. Nakazawa, and J. M. Rehg, “Motion coherent
[27] A. Faktor and M. Irani, “Video segmentation by non-local consensus tracking using multi-label MRF optimization,” International Journal of
voting.” in British Machine Vision Conference (BMVC), 2014. Computer Vision, vol. 100, no. 2, pp. 190–202, 2012.

[53] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, “Video segmen- Ruigang Yang received the MS degree from
tation by tracking many figure-ground segments,” in Computer Vision Columbia University in 1998 and the PhD degree
(ICCV), 2013. from the University of North Carolina, Chapel
[54] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross, and Hill in 2003. He is currently a full professor of
A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology Computer Science at the University of Kentucky.
for video object segmentation,” in Computer Vision and Pattern Recog- His research interests span over computer vision
nition (CVPR), 2016. and computer graphics, in particular in 3D re-
[55] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters: construction and 3D data analysis. He has pub-
Contrast based filtering for salient region detection,” in Computer Vision lished more than 100 papers, which, according
and Pattern Recognition (CVPR), 2012. to Google Scholar, has received close to 6,000
[56] O. Barnich and M. Van Droogenbroeck, “Vibe: A universal background citations with an h-index of 37 (as of 2014). He has received a number
subtraction algorithm for video sequences,” IEEE Transactions on Image of awards, including the US National Science Foundation Faculty Early
Processing, vol. 20, no. 6, pp. 1709–1724, June 2011. Career Development (CAREER) Program Award in 2004, best Demon-
[57] P. Chockalingam, N. Pradeep, and S. Birchfield, “Adaptive fragments- stration Award at CVPR 2007 and the Deans Research Award at the
based tracking of non-rigid objects using level sets,” in Computer Vision University of Kentucky in 2013. He is currently an associate editor of the
(ICCV), 2009. IEEE Transactions on Pattern Analysis and Machine Intelligence and a
[58] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.-H. Yang, “Jots: Joint online senior member of IEEE.
tracking and segmentation,” in Computer Vision and Pattern Recognition
(CVPR), 2015.
[59] Y.-H. Tsai, M.-H. Yang, and M. J. Black, “Video segmentation via object
flow,” in Computer Vision and Pattern Recognition (CVPR), 2016.
[60] M. Godec, P. M. Roth, and H. Bischof, “Hough-based tracking of non- Fatih Porikli is an IEEE Fellow and a Professor
rigid objects,” in Computer Vision (ICCV), 2011. with the Research School of Engineering, Aus-
[61] S. Wang, H. Lu, F. Yang, and M.-H. Yang, “Superpixel tracking,” in tralian National University, Canberra, ACT, Aus-
Computer Vision (ICCV), 2011. tralia. He is also acting as the Computer Vision
[62] B. Taylor, V. Karasev, and S. Soattoc, “Causal video object segmenta- Group Leader at NICTA, Australia. He has re-
tion from persistence of occlusions,” in Computer Vision and Pattern ceived his PhD from New York University (NYU),
Recognition (CVPR), 2015. New York in 2002. Previously he served a Distin-
guished Research Scientist at Mitsubishi Electric
Research Laboratories (MERL), Cambridge. Be-
fore joining MERL in 2000, he developed satellite
imaging solutions at HRL, Malibu CA, and 3D display systems at AT&T
Research Laboratories, Middletown, NJ. He has contributed broadly
to object and motion detection, tracking, image-based representations,
and video analytics. Prof Porikli was the recipient of the R&D 100
Wenguan Wang received the B.S. degree in Scientist of the Year Award in 2006. He has won 4 best paper awards at
computer science and technology from the Bei- premier IEEE conferences including the Best Paper Runner-Up at IEEE
jing Institute of Technology in 2013. He is cur- CVPR in 2007, the Best Paper at IEEE Workshop on Object Tracking
rently working toward the Ph.D. degree in the and Classification beyond Visible Spectrum (OTCBVS) in 2010, and the
School of Computer Science, Beijing Institute of Best Paper from IEEE AVSS in 2011, and the Best Poster Award at
Technology, Beijing, China. His current research IEEE AVSS in 2014. He has received 5 other professional prizes. Prof
interests include salient object detection and ob- Porikli authored more than 130 publications and invented 61 patents. He
ject segmentation for image and video. serves as the Associate Editor of 5 premier journals for the past 8 years
including IEEE Signal Processing Magazine (impact rate 6.0) and SIAM
Imaging Sciences (rank 2 out of 236 applied math journals). He served
at the organizing committees of several flagship conferences including
ICCV, ECCV, CVPR, ICIP, AVSS, ICME, ISVC, and ICASSP. He served
as Area Chair of ICCV 2015, CVPR 2009, ICPR 2010, IV 2008, and
ICME 2006. He served as a judge at several NSF panels from 2006 to
2013 and gave keynotes in various symposiums.

Jianbing Shen (M’11-SM’12) is a Professor with

the School of Computer Science, Beijing Insti-
tute of Technology, Beijing, China. His research
interests include vision for graphics. He has pub-
lished about 60 journal and conference papers
such as IEEE CVPR, IEEE ICCV, and IEEE
Transactions. He has also obtained many flag-
ship honors including the Fok Ying Tung Edu-
cation Foundation from Ministry of Education,
the Program for Beijing Excellent Youth Talents
from Beijing Municipal Education Commission, and the Program for
New Century Excellent Talents from Ministry of Education. His research
interests include computer vision and multimedia processing. He worked
as a visiting professor in the Department of Information Technology and
Electrical Engineering, ETH Zurich, Switzerland, and the Department of
Computer Science, University of California, Los Angeles, USA. He is on
the editorial boards of Neurocomputing.

PROJECT REPORT-Moving Object Tracking
No ratings yet
PROJECT REPORT-Moving Object Tracking
87 pages
Sec 2 Team 06
No ratings yet
Sec 2 Team 06
71 pages
Yun Thesesgg
No ratings yet
Yun Thesesgg
127 pages
Every Frame Counts
No ratings yet
Every Frame Counts
8 pages
Beyond Pixels: A Comprehensive Survey From Bottom-Up To Semantic Image Segmentation and Cosegmentation
No ratings yet
Beyond Pixels: A Comprehensive Survey From Bottom-Up To Semantic Image Segmentation and Cosegmentation
56 pages
Real-Time Discriminative Background Subtraction
No ratings yet
Real-Time Discriminative Background Subtraction
29 pages
FULLTEXT01
No ratings yet
FULLTEXT01
22 pages
Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge
No ratings yet
Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge
15 pages
Compound Markov Random Field Model Based Video Segmentation
100% (1)
Compound Markov Random Field Model Based Video Segmentation
6 pages
Active Visual Segmentation: Ajay K. Mishra, Yiannis Aloimonos, Loong-Fah Cheong, and Ashraf A. Kassim, Member, IEEE
No ratings yet
Active Visual Segmentation: Ajay K. Mishra, Yiannis Aloimonos, Loong-Fah Cheong, and Ashraf A. Kassim, Member, IEEE
15 pages
Engineering Journal A Fire Fly Optimization Based Video Object Co-Segmentation
No ratings yet
Engineering Journal A Fire Fly Optimization Based Video Object Co-Segmentation
7 pages
MOMS With Events: Multi-Object Motion Segmentation With Monocular Event Cameras
No ratings yet
MOMS With Events: Multi-Object Motion Segmentation With Monocular Event Cameras
15 pages
Brox Malik ECCV10
No ratings yet
Brox Malik ECCV10
14 pages
Video Saliency-Detection Using Custom Spatiotemporal Fusion Method
No ratings yet
Video Saliency-Detection Using Custom Spatiotemporal Fusion Method
7 pages
5 Laptov2005walkingModel
No ratings yet
5 Laptov2005walkingModel
17 pages
Submodular Salient Region Detection
No ratings yet
Submodular Salient Region Detection
8 pages
VA Lecture 28
No ratings yet
VA Lecture 28
20 pages
Object Detection
100% (1)
Object Detection
15 pages
Automatic Detection of Object of Interest and Tracking in Active Video
No ratings yet
Automatic Detection of Object of Interest and Tracking in Active Video
14 pages
Blazingly Fast Seg
No ratings yet
Blazingly Fast Seg
10 pages
Learning Video Object Segmentation From Static Images
No ratings yet
Learning Video Object Segmentation From Static Images
10 pages
Swiftnet: Real-Time Video Object Segmentation
No ratings yet
Swiftnet: Real-Time Video Object Segmentation
10 pages
Image/Video Segmentation Using Bodypix Model: Department of Computer St. John College of Engineering and Management
No ratings yet
Image/Video Segmentation Using Bodypix Model: Department of Computer St. John College of Engineering and Management
5 pages
Video Object Extraction
No ratings yet
Video Object Extraction
10 pages
Real-Time Foreground Segmentation and Boundary Matting For Live Videos Using SVM Technique
No ratings yet
Real-Time Foreground Segmentation and Boundary Matting For Live Videos Using SVM Technique
5 pages
Efficient Hierarchical Graph-Based Video Segmentation: Matthias Grundmann Vivek Kwatra Mei Han Irfan Essa
No ratings yet
Efficient Hierarchical Graph-Based Video Segmentation: Matthias Grundmann Vivek Kwatra Mei Han Irfan Essa
8 pages
A Novel Approach For Visual Saliency Detection and Segmentation - PDF
No ratings yet
A Novel Approach For Visual Saliency Detection and Segmentation - PDF
5 pages
DiffusionVID Denoising Object Boxes With SpatioTemporal Conditioning For Video Object Detection
No ratings yet
DiffusionVID Denoising Object Boxes With SpatioTemporal Conditioning For Video Object Detection
11 pages
Flow-Edge Guided Unsupervised Video Object Segmentation
No ratings yet
Flow-Edge Guided Unsupervised Video Object Segmentation
12 pages
Icgec2013
No ratings yet
Icgec2013
8 pages
Salience Model
No ratings yet
Salience Model
19 pages
Regions of Interest For Accurate Object Detection
No ratings yet
Regions of Interest For Accurate Object Detection
8 pages
Efficient Fusion of Spatio-Temporal Saliency For Frame Wise Saliency Identification
No ratings yet
Efficient Fusion of Spatio-Temporal Saliency For Frame Wise Saliency Identification
13 pages
csvt11 Preprint
No ratings yet
csvt11 Preprint
14 pages
Segmentation and Tracking Multiple Objects Under Occlusion From Multiview Video
No ratings yet
Segmentation and Tracking Multiple Objects Under Occlusion From Multiview Video
6 pages
A Robust Video Object Segmentation Scheme With Prestored Background Information
No ratings yet
A Robust Video Object Segmentation Scheme With Prestored Background Information
4 pages
Object Segmentation Based On Saliency Extraction and Bounding Box
No ratings yet
Object Segmentation Based On Saliency Extraction and Bounding Box
5 pages
4876-Article Text-7942-1-10-20190709
No ratings yet
4876-Article Text-7942-1-10-20190709
8 pages
CSA12 Project Expo
No ratings yet
CSA12 Project Expo
12 pages
Image Processing Paper
No ratings yet
Image Processing Paper
5 pages
Motion Saliency Using CNN
No ratings yet
Motion Saliency Using CNN
12 pages
Video Semantic Segmentation With Inter-Frame Feature Fusion and Inner-Frame Feature Refinement
No ratings yet
Video Semantic Segmentation With Inter-Frame Feature Fusion and Inner-Frame Feature Refinement
29 pages
A Hybrid Deep Learning Approach For Video Object Detection
No ratings yet
A Hybrid Deep Learning Approach For Video Object Detection
9 pages
Toward Building A Robust and Intelligent Video Surveillance System: A Case Study
No ratings yet
Toward Building A Robust and Intelligent Video Surveillance System: A Case Study
4 pages
Object Detection and Trackinfg in Videos: N. Rasathi
No ratings yet
Object Detection and Trackinfg in Videos: N. Rasathi
8 pages
Video Foreground Extraction Using Multi
No ratings yet
Video Foreground Extraction Using Multi
16 pages
Video Segmentation For Moving Object Detection Using Local Change & Entropy Based Adaptive Window Thresholding
No ratings yet
Video Segmentation For Moving Object Detection Using Local Change & Entropy Based Adaptive Window Thresholding
12 pages
A Hierarchical Visual Model For Video Object Summarization
No ratings yet
A Hierarchical Visual Model For Video Object Summarization
13 pages
20 Tpami Space Time
No ratings yet
20 Tpami Space Time
14 pages
Paper 3516
No ratings yet
Paper 3516
4 pages
Chen, Liu, Yang - Unknown - Multi-Instance Object Segmentation With Occlusion Handling-Annotated
No ratings yet
Chen, Liu, Yang - Unknown - Multi-Instance Object Segmentation With Occlusion Handling-Annotated
9 pages
A Joint Scale Analysis and Machine Learning Framework For Cell Detection and Segmentation in Time Lapse Micros
No ratings yet
A Joint Scale Analysis and Machine Learning Framework For Cell Detection and Segmentation in Time Lapse Micros
5 pages
Motion-Grounded-Video-Reasoning (Groundmore)
No ratings yet
Motion-Grounded-Video-Reasoning (Groundmore)
12 pages
Baghbaderani Temporally-Consistent Video Semantic Segmentation With Bidirectional Occlusion-Guided Feature Propagation WACV 2024 Paper
No ratings yet
Baghbaderani Temporally-Consistent Video Semantic Segmentation With Bidirectional Occlusion-Guided Feature Propagation WACV 2024 Paper
11 pages
Realtime Visual Recognition in Deep Convolutional Neural Networks
No ratings yet
Realtime Visual Recognition in Deep Convolutional Neural Networks
13 pages
Chandra CVPR 2018
No ratings yet
Chandra CVPR 2018
10 pages
Video Segmentation Using FCM Algorithm
No ratings yet
Video Segmentation Using FCM Algorithm
5 pages
General Framework For Object Detection
No ratings yet
General Framework For Object Detection
9 pages
Study Guide, Unit 9
No ratings yet
Study Guide, Unit 9
5 pages
Israr Educational Philosophy (8609) 2nd Assignment
No ratings yet
Israr Educational Philosophy (8609) 2nd Assignment
36 pages
Career Guidance and Counseling - Summarize by Cherry
No ratings yet
Career Guidance and Counseling - Summarize by Cherry
3 pages
DLL Oct.2 6peg10
No ratings yet
DLL Oct.2 6peg10
3 pages
DLL in English 4 (4th Quarter-Week 4)
No ratings yet
DLL in English 4 (4th Quarter-Week 4)
5 pages
Challenges of Organizational Change
100% (1)
Challenges of Organizational Change
2 pages
CS-871-Lecture 1
No ratings yet
CS-871-Lecture 1
41 pages
Classroom Observation Assignment-Form 1 Final Copy Weebly
No ratings yet
Classroom Observation Assignment-Form 1 Final Copy Weebly
3 pages
Infuence of Music On Wine Selection
100% (1)
Infuence of Music On Wine Selection
6 pages
Madisen Keim Resume 2023
No ratings yet
Madisen Keim Resume 2023
2 pages
Learning Style
No ratings yet
Learning Style
6 pages
Transforming The Learning Experience: How Design Thinking Is
No ratings yet
Transforming The Learning Experience: How Design Thinking Is
13 pages
Remedial Teaching Plan 2023
No ratings yet
Remedial Teaching Plan 2023
2 pages
Theories Chart
No ratings yet
Theories Chart
2 pages
Learning Theories: Prepared By: Eclevia, Bea Marie E. BSN Iii-3
No ratings yet
Learning Theories: Prepared By: Eclevia, Bea Marie E. BSN Iii-3
17 pages
10 Attributes of A Great Chair: 3. Personal Strength
No ratings yet
10 Attributes of A Great Chair: 3. Personal Strength
2 pages
Emotional Intelligence in Nursing The He
No ratings yet
Emotional Intelligence in Nursing The He
6 pages
SC DLP Y2 TS25 (Unit 3)
No ratings yet
SC DLP Y2 TS25 (Unit 3)
9 pages
Principles of Test Creation
No ratings yet
Principles of Test Creation
31 pages
Vdocuments - MX - Internship Report Format Vtu
No ratings yet
Vdocuments - MX - Internship Report Format Vtu
2 pages
g4 Mini Research Complete
No ratings yet
g4 Mini Research Complete
22 pages
Curriculum Vitae - Mohamed Abdalla Osman
No ratings yet
Curriculum Vitae - Mohamed Abdalla Osman
5 pages
Lecture 1 Introduction of Course 27022023 093606am
No ratings yet
Lecture 1 Introduction of Course 27022023 093606am
42 pages
Designing Pleasurable Products An Introd
No ratings yet
Designing Pleasurable Products An Introd
2 pages
Changing Masculine To Singular Feminine
No ratings yet
Changing Masculine To Singular Feminine
1 page
The Age of Mechatronics: A Phenomenological Study of Students' Utilization of Educational Robots in Class
No ratings yet
The Age of Mechatronics: A Phenomenological Study of Students' Utilization of Educational Robots in Class
10 pages
Iatrogenic Creation of New Alters
No ratings yet
Iatrogenic Creation of New Alters
9 pages
Report Script
No ratings yet
Report Script
2 pages
Week Six Mayer CH 3
No ratings yet
Week Six Mayer CH 3
7 pages
CLASSROOM LANGUAGE Teacher Instructions: Work in Groups of Three Listen and Practice
No ratings yet
CLASSROOM LANGUAGE Teacher Instructions: Work in Groups of Three Listen and Practice
3 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet

A Unified Spatiotemporal Prior Based On Geodesic Distance For Video Object Segmentation

Uploaded by

A Unified Spatiotemporal Prior Based On Geodesic Distance For Video Object Segmentation

Uploaded by

This article has been accepted for publication in a future issue of this journal, but has not been

A Unified Spatiotemporal Prior based on

U NSUPERVISED video object segmentation, a key challenge

3 S PATIOTEMPORAL S ALIENCY P RIOR 3.1 Spatiotemporal Edge Generation

produces more satisfactory segmentation results. Our model is TABLE 3

video Ours [3] [7] [22] [21] [62] [27]

5.3 Computational Load 5.4 Validation of the Proposed Algorithm

SegTrack Extended SegTrack

results in Table 5. As can be seen, the combination strategy in

Jianbing Shen (M’11-SM’12) is a Professor with

You might also like