Accumulative Difference Image ADI PDF
Accumulative Difference Image ADI PDF
Sequence: A Review
Dengsheng Zhang and Guojun Lu
Gippsland School of Computing and Information Technology
Monash University, Churchill, Vic 3842, Australia
{dengsheng.zhang, guojun.lu}@infotech.monash.edu.au
Abstract
Segmentation of objects in image sequences is very important in many aspects of multimedia
applications. In the second generation image/video coding, images are segmented into objects to achieve
efficient compression by coding the contour and texture separately. As the purpose is to achieve high
compression performance, the objects segmented may not be semantically meaningful to human observers.
The more recent applications, such as content-based image/video retrieval and image/video composition,
require that the segmented objects be semantically meaningful. Indeed, the recent multimedia standard
MPEG-4 specifies that a video is composed of meaningful video objects. Although many segmentation
techniques have been proposed in the literature, fully automatic segmentation tools for general applications
are currently not achievable. This paper provides a review of this important and challenging area of
segmentation of moving objects. We describe common approaches including temporal segmentation,
spatial segmentation and the combination of temporal-spatial segmentation. As an example, a complete
segmentation scheme, which is an informative part of MPEG-4, is summarized.
1. Introduction
The ideal goal of segmentation is to identify the semantically meaningful components of an image and
grouping the pixels belonging to such components. While it is impossible to segment static objects in image
at the present stage, it is more practical to segment moving objects from dynamic scene with the aid of
motion information contained in it. Segmentation of moving objects in image sequence plays an important
role in image sequence processing and analysis. Once the moving objects are detected or extracted out, they
can serve for varieties of purposes.
The development of techniques for the segmentation of moving object has mostly been driven by the
so called second-generation coding [KIK85, TKP96]. The second generation coding techniques use image
representations based on human vision system (HVS) rather than the conventional canonical form which is
based on the concept of pixel or block of pixels as the basic entities that are coded. As a result of including
the human visual system, natural images are treated as being a composition of objects defined not by a set
of pixels regularly spaced in all dimensions but by their shape and color. With second-generation coding
techniques, the original image is broken down into regions of homogenous characteristics or “objects” of
arbitrary shapes, these “objects” are then contour and texture encoded.
Since compression efficiency is the primary goal of coding, although much work has been done in
segmentation-based coding in second-generation coding to achieve very low bit rate video streams
(MORPHCO [SBCP96], SESAME [Salembier et al 97]), content-based functionalities, such as object
identification and retrieval, are not addressed to this end. The “objects” in second-generation coding are
different from the semantic objects corresponding to real world objects. Second-generation coding
decomposes images into regions of homogeneous characteristics which can be intensity, color, motion,
directional components or other predefined visual patterns. In real world, objects rarely appear to be
homogenous. As a result, say, a human body with different moving parts is likely to be segmented into
different parts to achieve more prediction gain [Diehl91]. This is in contrast with segmentation for content-
1
based functionalities, which seeks way of aims at identifying meaningful objects corresponding to real
world objects.
Due to the rapid progress in micro-electronics and computer technology together with the creation of
networks operating with various channel capacities, the last decade has seen the emerging of new
multimedia applications such as Internet multimedia, Video on Demand (VOD), interpersonal
communications (videoconference, videophone), and digital library. The importance of visual
communications has increased tremendously. A new world standard MPEG-4 has just come into being to
address issues associated with multimedia applications. As a part of MPEG-4 standard, MPEG-4 video
differs from other video standard in two main aspects: very low bitrate and content-based functionalities.
The content-based functionalities in MPEG-4 video characterize a revolution in representing digital video
and will have a tremendous influence in the future of visual world. With content-based functionalities,
video bit stream can be manipulated to achieve personalized video. In MPEG-4 video, bitstream is
composed of Video Object Planes (VOPs) which can be used to assemble a real world scene [N2172]. Each
VOP is coded independent of other objects by its texture, motion and shape. VOPs form the basic elements
of MPEG-4 video bitstreams. VOP extraction is a key issue in efficiently applying the MPEG-4 coding
scheme. Although MPEG-4 standard doesn’t specify how to obtain VOPs, it’s apparent and recognised that
segmentation-based techniques are essential to, and therefore, will dominate VOP generation. This is
because most of visual information, existing or being generated, is in the form of frames or images. To
achieve content-based functionalities, these frames and images have to be decomposed into individual
objects to be fed into MPEG-4 video encoder.
Although it is premature to seek an automatic solution for general segmentation purpose at present
[Meier98], many video segmentation approaches have been proposed in the literature for this purpose.
They can be classified motion-based segmentation(based on motion information only) and spatial-temporal
segmentation (a combination of temporal and spatial segmentation). Motion-based techniques suffer from
boundary inaccuracy. For content-based purposes, it appears that the spatio-temporal approach is most
appropriate. This paper provides a review of some of the most important segmentation techniques.
The rest of the paper is organized as follows. In Section 2, we give a overview on segmentation of
moving objects. Section 3 reviews 2D methods. In Section 4, a number of 3D methods are discussed.
Section 5 devotes to spatio-temporal approaches. In Section 6, we summarize the discussions. In Section 7,
we describe a complete segmentation scheme which combines temporal and spatial segmentation. Section 8
concludes the paper.
2
segmentation criteria, which can be maximum a posteriori (MAP), Hough transform, expectation and
maximization (EM). Therefore, a typical motion segmentation algorithm consists of three steps
corresponding to these three issues. However, due to noise problem and motion complexity of the scene,
the real motion segmentation/clustering schemes are usually much more complex than this in that the
motion estimation in the motion representation stage and the segmentation are usually recursive processes.
A simplified motion-based segmentation is given in Figure 1(a).
Motion-based segmentation algorithms can be classified either based on their motion representations or
based on their clustering criteria. Since motion representation plays such a crucial role in motion
segmentation that motion segmentation techniques generally focus on the design of motion estimation
algorithms. Therefore, motion segmentation are best identified and distinguished by the motion
representation it adopts. Within each subgroup identified by its motion representation, the methods are
distinguished by their clustering criteria.
Traditional motion-based segmentation methods, which employ motion information only, usually deals
with scenes with rigid motion or piecewise rigid motion. The comparatively new spatio-temporal
segmentation techniques, which employ both spatial and temporal information embedded in the sequence
and directly target the emerging multimedia applications and generic situations, are often neglected from
the classification categories in the literature. By combining both motion and spatial information, these
techniques intend to overcome the over-segmentation problem in image segmentation and overcome the
noise-sensitive and inaccuracy problems in motion-based segmentation. The spatio-temporal segmentation
is classified into motion segmentation because it employs the same motion estimation techniques as in
motion-based segmentation and temporal segmentation is usually used to guide the overall segmentation
results. However, this group of segmentation algorithms differ from the motion-based segmentation in that
it makes use of spatial information to rectify and improve the temporal segmentation results. In this way,
not only can it overcome the above mentioned problems in motion-based segmentation but also can be
adopted to non-rigid motion, therefore, more generic scenes. A simplified spatio-temporal segmentation is
shown in Figure 1(b).
Based on the above discussions, this paper adopts the classification of segmentation of moving objects
as shown in Figure 2.
Input data
Input data
(a) (b)
Figure 1. (a) Simplified motion-based segmentation and (b) simplified spatio-temporal segmentation
In the following sections, we discuss each of the segmentation techniques shown in Figure 2.
3
4
3. 2D motion-based segmentation methods
2D motion-based segmentation methods can be divided into segmentation based on optical flow
discontinuities and segmentation based on change detection.
This group of methods perform segmentation based on displacement or optical flow of image pixels.
The displacement or optical flow of a pixel is a motion vector represented by the motion between a pixel in
one frame and its correspondence pixel in the following frame. Optical flow is itself a very active research
topic. For the varieties of optical flow estimation, readers are referred to [An88, BB95, Tekalp95, SK99].
Early work on segmentation tries to segment images into moving objects using local measurements.
Potter [Potter75] uses velocity as a cue to segmentation. The work is based on the assumption that all parts
of an object have the same velocity when they are moving. Potter’s approach to motion extraction was
based on the measurement of the movement of edges. He assumed that since the pictures of a scene were
taken in very close time, the edges could be correlated between the pictures by their spatial positions alone.
A motion measurement for a given reference point of a superimposed grid was obtained by determining the
differences (between pictures) of the displacements of edges from the point. The point was classified into
one of three classes—body, shadow and background. Points were grouped within classes on the basis of
identical motion measurement values. In his later work [Potter77], Potter determines the approximate
velocity of a pixel by using template matching and the templates he chose are Cross, T- and L-templates. It
is shown in the work that the template matching process provides more accurate velocity information from
more complex scenes. The main advantage of using templates for velocity extraction is that they are object
independent, but since the template features do not appear everywhere in the picture, the resulted velocity
field is sparse. Without a spatial analysis, it is impossible to segment out the whole object.
Spoerri and Ullman [SU87] recognize that the computation of motion and the detection of motion
boundaries is a “chicken and egg” dilemma. In order to broken this dilemma, they try to detect motion
boundaries before a full flow field is found, using local flow discontinuity tests. The input to the bimodality
tests is a local normal flow histograms constructed from a circular neighborhood at each image point. The
bimodality test detects motion boundaries by computing the degree of bimodality, or two peaks of equal
strength, present in the local histogram. Hildreth [Hildreth84] makes use of the fact that if two adjacent
objects undergo different motion v1 and v2, then a normal flow components, whose orientation lie between
the directions of v1 and v2, will change in sign and/or magnitude across the boundary. Therefore she uses
zero-crossing of normal flow to detect motion boundaries. These methods are suitable for the situations
where the estimation of motion is difficult, but the motion boundaries can be still perceived. Only very
simple images are used for testing. The advantage of using local flow test is that it can detect boundaries
locally without knowing the motion of the rest parts of the moving object, this is helpful in moving object
segmentation where it is not possible to use a test criterion, such as bimodality test, to divide the whole
flow field. This can happen when the flow field of the moving object is not uniform. However, the choice
of input data primitives is a challenge for these algorithms, since intensity values are sensitive to noise and
changes in illumination while edge features tend to be sparse and their density is non-uniform. Similar
approach has also been adopted by Nagel et. al [NSKO94].
Overington [Overington87] also utilizes the normal components of flow computed at edges to find
discontinuities in the normal flow component. The discontinuities are used to detect moving objects in a
scene, taken from a static camera.
Thompson et al [TMB85] apply motion edge detection on image flow field to find object boundaries,
which is a natural extension of classical intensity-based edge detection method. For this purpose, the Marr-
Hildreth edge detector is used to detect moving objects’ boundaries based on the Laplacian-of-the-
Gaussian (LOG) smoothed flow field. Clocksin [Clocksin80] proposed the use of the Laplacian operator
for detecting sharp changes in the velocity field generated when an observer translates in a static
environment. He shows that, in such circumstances, discontinuities in the magnitude of flow can be
detected with a Laplacian operator; in particular, singularities in the Laplacian occur at discontinuities in
the flow. Similar approach has been taken by Schunck [Schunck89] which applies motion edge detection
on a surface-based smoothed optical flow field resulted from the clustering of optical flow constraint line.
These algorithms suffer the same drawback of over-segmentation as those approaches based on the
intensity-based gradient edge detection. A solution is to combine temporal information with spatial
5
information such as color and texture, so that the over-segmentation in the motion field can be overcome by
a spatial segmentation.
While segmentation based on finding flow discontinuities is straightforward, it is unlikely to achieve
expected results. In essence, the optical flow field has the same statistical characteristics as that of intensity
in image. Based on the experience from image segmentation, it is not difficult to recognize that high level
information and rules are needed to help the analysis.
In the above methods, optical flow or velocity is usually computed at every image point in the frame.
Since the percentage of points having zero motion or a simple global motion is usually large, it is more
efficient and economical to locate and focus analysis on areas that are changing. More importantly,
segmentation of moving objects is usually a multi-stage process or an iterative process, the elimination of
large number of potential noise points can prevent much errors from propagating further. Now that
background is usually stationary or has a simple global motion, it is possible to remove it by a simple
differencing or a motion compensated differencing. This method is dominantly used in the segmentation for
object-based coding and segmentation for content-based functionalities [HT88, Diehl91, MW98].
Segmentation using change detection avoids the computation of differential gradients in the estimation of
optical flow which is unreliable. These algorithms begin with change detection to distinguish between
temporally changed and unchanged regions of two successive images k-1 and k, the moving object is then
separated from the changed regions. The decision, whether a spatial position x = (x, y) belongs to a changed
or to an unchanged image part, is based on the evaluation of the frame difference (FD):
The FD is usually applied on a measurement window rather than on single pixel. In order to distinguish
between relevant changes due to motion of objects or brightness changes and irrelevant temporal changes
due to noise, the frame difference has to be compared to a threshold Tch. The reliable decision, that a spatial
position x belongs to a changed region, is only possible, if the frame difference exceeds this threshold. The
binary change detection mask C(x), indicating changed (C(x) = 1) and unchanged regions (C(x) = 0), is
provided by the change detection algorithm at each spatial position x. Hence, the performance of a change
detector is essentially dependent on two parameters. The first is the choice of the threshold separating
changed from unchanged luminance picture elements, and the second is to find a reasonable criterion that
eliminates small regions, e.g. small unchanged regions within large changed regions [TB89].
Jain et al have made intensive studies on change detection using accumulative difference picture (ADP)
technique [JN79, JMA79, Jain81, JJ83, Jain84a, Jain84b]. An accumulative difference picture is formed by
comparing every frame of an image sequence to a reference frame and increasing the entry in the
accumulative difference picture by 1 whenever the difference for the pixel exceeds the threshold, see
Figure 3. Thus an accumulative difference picture ADPk is computed over k frames by comparing with a
reference frame ADP0 [JKS95].
3 2 1
Moving object in following frames
Counter 6 5 4 3 2 1
9 8 7 6 5 4 6 6 6
Figure 3. Accumulative difference picture. The moving object moves right one pixel per frame [JN79].
6
The ADPk (x, y) works as a counter for each pixel in the accumulative image. Segmentation can be
carried out by finding the higher values in the counter, which are likely to correspond to actual moving
objects.
In [JN79, Jain81], Jain and Nagel try to extract out rigid moving objects in the scene, taken from a
stationary camera, by analyzing accumulative difference pictures. The key idea in their approaches is to
reconstruct a stationary scene component, or the whole background, from the image sequence. Once this
background is reconstructed, moving object can be detected by comparing the following frames with this
stationary scene. The difference pictures are created by a simple thresholding of the change between two
frames, where the threshold is a likelihood ratio constructed from the mean grey mean value and its
variance for the sample areas (measurement window) from two frames. In the next stage, the first frame is
selected as the reference frame and the difference picture for the reference frame is initialized to zero for all
its elements, then other successive difference pictures are accumulated with the first difference picture.
When the object has been completely displaced from it’s original location, a reference frame comprising
only stationary components can be formed, and moving objects in following frames can be segmented out
by comparing with this reference frame. It’s difficult to apply the algorithm to generic scene due to some
strong assumptions such as stationary background, rigid motion and monotonicity of object motion.
Furthermore, the reconstruction may not possible because occluding object may come into scene before the
object under analyzing has moved over its original location, it also needs a long sequence of frames in the
motion analysis.
Difference accumulation can only be applied on limited applications due to the strong assumptions.
However, if these assumptions are satisfied, difference accumulation can be a very convenient way to
recover moving object. An alternative approach to the difference accumulation is the accumulation of
similarity which can be used to recover stationary component and for scene mosaicing, as is exploited by
Wang and Adelson [WA94]. They try to recover different layers from a scene of panorama view after each
layer is identified by its motion characteristics.
In [JJ83], Jayaramamurthy and Jain proposed an approach to segment dynamic scenes containing
textured objects by combining pixel velocity and difference pictures. This multistage approach first uses
differencing to obtain active regions in the frame which contain moving objects. The threshold chosen to
detect the active region is a preset value of 10% of the peak intensity value found in the frames. In the next
stage, a Hough transform on an optical flow field is used to determine the motion parameters associated
with each active region. Finally, the intensity changes and the motion parameters are combined to obtain
the masks of the moving objects. This is an approach that combines the strengths of local and global
methods. It has been known that using global method such as Hough transform alone to a scene containing
several moving objects can not yield useful results because of the potential interference of the individual
peaks in the parameter space contributed by different moving objects. So in their work, they resolve this
difficulty in their work by successive refinement using local confidence test. Like their other algorithms,
the algorithm is based on strong assumptions such as rigid and translational motion, stationary camera etc.,
limiting its application. The pixel to pixel matching method exploited in the pixel velocity estimation
makes the algorithm specially impractical in most situations.
Change detection through simple thresholding can lead to significant errors and inaccuracy in general
situations. For this reason, change detection is usually embedded into a hierarchical or a relaxation
algorithm. In this case, the initial change detection is refined by a motion compensated prediction/update or
a threshold evaluation-update process (see Figure 4 for example).
Threshold
calculation
Figure 4. Block diagram of the change detector. (FD: frame difference; C, Ci: change detection masks)
7
Unchanged region Changed region Unchanged region
Moving Covered
object background
Motion
vectors
t +1
Figure 5. Separation of change mask into moving objects, covered and uncovered background. Moving
object is represented by motion vectors which both tails and heads fall within the changed region.
Thoma and Bierling [TB89] combine change detection with optical flow to carry out object
segmentation, and incorporate a median filter to eliminate small elements in the change mask. The change
detection based segmentation algorithm is adopted from [HT88], which is an iterative three-step process
(Figure 4). In the first step, the threshold operation is performed on a measurement window using the mean
absolute frame difference. The threshold is initially chosen to be a fixed value of 3/256. Then, a two
dimensional median filter is used in order to smooth the boundaries of the changed regions. The last step is
to eliminate small isolated regions of the change detection mask. After these three steps, the initial
threshold is re-evaluated, it is adapted to the standard deviation of the noise based on the available
unchanged regions. The process repeats with the new threshold until the system is stable. This results in a
change detection mask. The segmentation of moving object is achieved by separating the covered and
uncovered background from the change detection mask based on the previously estimated motion field
using hierarchical block matching technique (Figure 5). Problem can arise here that some spatial positions
in the uncovered background of current frame are not addressed by any motion vector, and hence cannot be
identified, therefore, this problem affects boundaries accuracy in most situations.
Aach et al [AKM93] propose a change detection technique using MAP and relaxation. The algorithm
starts by computing the grey level difference image D. An initial change detection mask is then computed
by a threshold operation on the squared normalized difference image. The thresholding is carried out by
performing a significance test on the noise hypothesis of the luminance difference image D, which is
modeled as Gaussian camera noise with a variance σ 2. Since the global thresholding can result in small
isolated regions and irregular boundary, an optimization mechanism using MAP is designed to modify or
update the change mask in expectation to eliminate small elements. The MAP criterion is then put into a
deterministic relaxation to refine the object boundary. While the algorithm overcome the ‘corona’ effect (a
blurring effect which reduces the spatial resolution by some filters), the choice of the significance level for
the significance test is still arbitrary, this can make the algorithm image dependant. The result object mask
is too scattered due to large number of small areas in the object area. Mechanism for eliminating small
regions is apparently necessary, Mech and Wollborn [MW98] overcome this shortcoming by a
morphological closing operation.
4. 3D motion-based methods
The 2D approaches described above analyze apparent motion on 2D image plane and the analysis is
performed only according to the information available on frames without taking into account the structure
and the “real” motion of the moving objects in space. 2D motion models are simple, but less realistic. A
8
practical and robust motion segmentation scheme must take into account objects’ structure and motion in
space. As a result, 3D motion segmentation is employed in most practical segmentation systems.
Within 3D methods, two groups of segmentations can be distinguished from the varieties of algorithms:
structure from motion (SFM) approach and parametric approach. SFM usually deals with 3D scenes,
which contain significant depth information; while significant depth is not assumed in parametric approach.
Another important difference between the two approaches is the motion rigidity assumption in SFM while
parametric approach only assumes piecewise motion rigidity of the scene.
The structure from motion problem refers to recovering 3D geometry in space from 2D motion on
image plane. The idea of structure from motion is inspired by the phenomenon that human beings act in a
three dimensional world while they only sense 2D projection of it. Early structure from motion efforts were
focused on structure from stereo, which deals with binocular scenes. In this paper we only discuss SFM
with monocular scenes, which is effectively equivalent to stereo with a single camera. For structure from
stereo, readers are referred to [Scharstein97, Fusiello98, Tekalp95]
SFM is preferred in applications where recovery of the “real” motion in the environment is essential,
such as in robotic navigation, object tracking etc. SFM also find application in some other area such as
animation, active vision, 3D coding and mosaic etc.
Recovering 3D structure from 2D motion is a difficult problem to solve, since the observation data
have one less dimension with respect to the unknown environment to be estimated. Several simplifying
assumptions are usually made to the general problem of 3D models from 2D imagery to formulate SFM
task. One key assumption is that objects in the scene are moving rigidly, or, equivalently only the camera is
allowed to move. An additional simplification is that feature points have been located and correspondences
have been established between feature points in the two frames.
SFM techniques differ in their camera (geometry) model, linearity (in parameters estimation), number
of features, and restrictions on camera motion and scene structure.
Most SFM techniques are based on exploiting geometric or algebraic properties that are invariant under
projection to multiple images, from which camera motion information is easily extracted. For example, the
essential matrix, the fundamental matrix and factorization method all exploit various invariants of
perspective or parallel projection to recover the relative extrinsic camera parameters from two or three
views. The SFM problems are then distinguished into linear approaches and non-linear approaches based
on the optimization methods employed to estimate the geometric parameters. Linear approaches try to solve
the SFM problems linearly by put them into least-square sense optimization. However, with the exception
of the factorization method, these techniques are rarely scalable to multiple images which limits the extent
to which the solution can be made robust. Though the factorization method is scalable to a number of
images, it is not recursive, and it assumes orthographic model which severely limits camera motion and
scene structure. Therefore, these linear methods fails when the geometry constraints are degenerated, for
example, when the motion between images is small. For that, non-linear approaches using extended
Kalman filter (EKF) [JAP99] and projector error refinement [Bestor98] are proposed. Robust structure and
motion are recovered by minimizing a nonlinear cost function over image sequence. A full review of SFM
methods is beyond the scope of this paper, for the varieties of SFM algorithms, readers are referred to
[JAP99, Bestor98, Tekalp95]. In the following, we discuss two different 3D segmentation schemes using
the SFM method.
Torr [Torr95] developed a stochastic 3D motion segmentation scheme that makes use of epipolar line
generated from fundamental matrix. In the two view case, according to epipolar geometry, a point p1 in one
frame is associated with its correspondence point p2 in the other frame by a fundamental matrix F (which
encapsulates all the information on camera motion, including camera translation information) in the form:
p1tFp2 = 0 (4.1.1)
and the correspondence point p2 can only fall on a constraint epipolar line Fp1 (Figure 6).
9
Figure 6. The perspective projection of object point P onto image planes formulates the epipolar geometry
Estimate F
Model generation
The algorithm involves four steps: feature matching, model generation, cluster pruning and multiple
hypothesis test to determine correct segmentation. The whole algorithm and the model generation are given
in Figure 7.
In the first step, the feature matching and the geometry estimation are a recursive approach. Once the
epipolar geometry has been estimated, it is used to aid feature matching. All the corners are rematched after
the estimation of the epipolar geometry, and the epipolar geometry may then be further refined. When there
is more than one moving object in the scene, it is assumed that the motion of the camera relative to the ith
object is specified by the relative motion parameters with associated fundamental matrix F. Since the rigid
3D feature point set (corners) of the two views are linked by the fundamental matrix F, the segmentation
problem is transformed into that of clustering the features in the image consistent with distinct fundamental
matrices, which constraint a epipolar line for each feature point. So, in the second step, clustering is carried
out in a model generation module (Figure 5(b)). It starts by randomly select 7 pairs, calculating F. More
pairs are added with an eye to consistency: each new pair is tested to see if it meets a maximum likelihood
(ML) criterion of being at least 95% likely belong to the cluster. The ML cost function is created by
modeling the distance of a feature to its estimated epipolar line constrained by F as white Gassian noise
with zero mean, and the a priori probability as geometric distribution. In the third step, small clusters may
be pruned depending on the likelihood that they were randomly generated, and like clusters are merged. A
special cluster, which makes use of a uniform probability density function, is used to capture data points
which are not well-modeled by the other clusters. The final step is a multiple hypothesis test to determine
which particular combination of the many feasible clusters is more likely to represent the actual motions.
10
Due the sparse features exploited in the segmentation, object boundary cannot be detected. Furthermore,
the estimation of fundamental matrix needs a bigger displacement between frames (views).
MacLean [MacLean96] proposed a scheme based on the expectation and maximization (EM)
algorithm. The input data to the segmentation algorithm are linear constraints on 3D translational motion.
The linear constraints are generated from 7 bilinear constraints on 3D translation and rotation using the
subspace methods:
where Ω and T are the rotation velocity and the translation velocity of the object: V = Ω×X + T, and u(x) =
(u(x, y), v(x, y)) is the 2D optical flow. The linear constraint cancels the rotational part of the motion,
resulting in only translational constraint to recover depth information of the scene, which is favored in the
application.
The algorithm adopts a top-down multi-process approach, starting with an initial guess that there is a
single translational motion in the scene, the translational parameters are then estimated by a non-linear
optimization. In the second step, the EM algorithm is used to cluster constraints between the single
translation process and an outlier population (rejection). The EM algorithm is an iterative two step
methods: (1) the expectation step assigns features to the motion parameters that they most consistent with;
(2) the maximization step estimates parameters of the models from the features found consistent with them.
The third step examines the outlier population for evidence of other translation process according to a
ownership probability criterion. The fourth step checks the new processes and merge small processes with
the larger processes or discard if they are too small. Repeat from step 2 until no new process turns out. The
problem with the algorithm is the initialization. Since methods for optimizing non-linear equations seldom
guarantee a global minimum, the initial guess is critical, poor guess can lead to undesired result. Torr
[Torr95, TM93] has proved that the EM algorithm may improve a segmentation if the segmentation is
already good, otherwise the algorithm’s convergence properties are poor, e.g. if two models are nearly the
same then they will each grab elements of the other’s set.
Both methods adopt joint motion estimation and segmentation approach. However, there are several
differences between the second algorithm and the first one. First, MacLean deals with a 2D dense optical
flow as input data which can detect object boundary. Second, the second segmentation is performed based
on constraints that are related solely to translational direction and a mixture models are used to model
multiple motion process. Third, the top-down approach of the second algorithm is also different from the
first one.
Parametric methods relax the rigidity assumption in the SFM method into piecewise rigidity. Parametric
models are built by making use of object motion in space and explicitly assume physical structure in the
scene. The three dimensional motion of the object, which is usually modeled as a 3D affine motion model
and can be described by a rotation matrix R and a translation vector T: X' = RX + T, where X and X' are
two objects points at time t and t+1 respectively. The two physical structure models assumed in the
parametric methods are usually planar surface or parabolic surface, which are acceptable approximations
to real objects’ structure in natural scenes. By combining the structure and motion constraints with one of
the two geometry models: parallel and perspective, the following parametric models are generated
[Diehl91, Tekalp95]:
(1) 6 parameter (affine) model, corresponds to planar surface under parallel projection:
x'= a1 x + a2 y + a3
y'= a4 x + a5 y + a6 (4.2.1.1)
(2) 8 parameter model, corresponds to planar surface under perspective projection
a1 x + a 2 y + a 3 , a 4 x + a5 y + a6
x' = y' = (4.2.1.2)
a 7 x + a8 y + 1 a 7 x + a8 y + 1
11
(3) 12 parameter model, corresponds to parabolic surface under parallel projection
x' = a1 x2 + a2 y2 + a3 xy+ a4 x+ a5 y + a6
y' = b1 x2 + b2 y2 + b3 xy+ b4 x+ b5 y + b6 (4.2.1.3)
where x = (x, y) and x' = (x', y') are correspondence points at time t and t+1 respectively, assumed to be
established by a 2D optical flow vector. Another 8 parameter model, called quadratic flow is generated by
combining 3D object velocity model V = Ω×X + T, planar surface and perspective projection geometry
constraints:
x' = a1 + a2 x+ a3 y + a7 x2 + a8 xy
y' = a4 + a5 x+ a6 y + a7 xy + a8 y2 (4.2.1.4)
Adiv [Adiv85] first proposed the segmentation of scene into planar patches, the idea is later adopted by
many researchers. The segmentation is a hierarchically structured three-stage algorithm in which objects in
the scene are decomposed into planar patches that are moving rigidly. In the first stage of the segmentation,
optical flow vectors are grouped into components using a multi-pass modified Hough transform on the
parameter space, where a component is a connected set of vectors which support the same affine
transformation of (4.2.1.1). In the modified Hough transform, each flow vector u(x) = (v1(x), v2(x)) votes
for a set of quantized parameter which minimizes:
where δx (x)= v1(x) - a1 x - a2 y - a3 and δy (x) = v2(x) - a4 x + a5 y + a6. The parameter sets that receive the
most votes are likely to represent candidate motions. Three techniques are used to alleviate the computation
involved in the high dimensional Hough transform: i) multi-resolution, where at each resolution level the
parameter space is quantized around the estimates obtained at the previous level; ii) decomposition of the
parameter space into two subspace { a1 , a2 , a3 } and { a4 , a5 , a6} and iii) multi-pass, where the flow
vectors which are most consistent with the candidate parameters are grouped first.
The second stage is a merging of components. Adjacent components created in the first stage are then
merged into segments if they obey the same eight-parameter quadratic flow model (4.2.1.4), called Ψ
transform. In the last stage, flow vectors that are not consistent with any Ψ transform are merged into
neighbouring segments if they are consistent with the corresponding Ψ transform, resulting in the final
segmentation. Many other researchers also use this idea to segment the world into planar facets [MB87,
MW86, TM93, IRP92]. These methods could be expected to be useful in many man-made environments
where planar surfaces occur in abundance. However, when combined with other methods, these methods
can be applied to many other situations as can be seen in the integrated methods in Section 4.2.4.
Byesian method is among the most widely used segmentation techniques. The objective of Byesian
method is to maximize the posterior probability of the unknown label field (segmentation) X, given the
observed motion field D. In order to maximize the a posteriori (MAP) probability P(X | D), two probability
distributions must be specified according to Bayes’ theorem: the conditional probability P(D|X) and the a
priori likelihood P(X). To determine P(X), the label field X is usually modeled as Markov random field
(MRF), and P(D|X) is modeled as the white Gaussian noise with zero mean and variance σ 2. Due to the
equivalence between MRF and Gibbs distribution [GG84], the MAP problem is reduced to minimizing a
global cost function, which consists of two terms: the close-to-data term and the smoothness term:
1
E=
2σ 2
∑|| u(x ) − u~(x ) ||
xi
i i
2
+∑ ∑V
xi x j ∈Nx i
C ( X (x i ), X (x j )) (4.2.3.1)
~ are the observed flow and the synthesized or estimated flow at each pixel , Nxi denotes the
where u and u
neighborhood system for the label file and the potential function:
12
− β if X (x i ) = X (x j )
VC ( X (x i ), X (x j )) = β >0 (4.2.3.2)
β otherwise
The varieties of Bayesian segmentation techniques then differ in the choice of different cost functions.
It is well-known that motion estimation and segmentation are interdependent. Motion estimation
requires the knowledge of motion boundaries, while segmentation needs the estimate motion field to
identify motion boundaries. Joint motion estimation and segmentation algorithms are ways to break this
cycle. For that, the MAP is usually put into a recursive process to iteratively optimize motion parameters
and the segmentation, this is illustrated as Figure 8(a):
Initialisation
Initialisation
Motion Estimation
Motion Estimation
Segmentation
Prediction
Iteration?
Yes
No Compensated? No
output
Yes
output
(a) (b)
Figure 8. (a) joint motion estimation and segmentation (b) ICM used in [BF93]
Murray and Buxton [MB87] first proposed a MAP segmentation algorithm where the parametric model
is modeled as quadratic flow of (4.2.1.5) and segmentation field modeled by Gibbs distribution. In order to
compute the observation probability P(D|X), the eight parameters of a quadratic flow model (4.2.1.4) are
calculated for each region by linear regression on a normal flow field. The interpretation of this flow field,
or the label file X is modelled as MRF. The cost function of the corresponding Gibbs distribution has the
following two components of (4.2.3.1): (1) a close-to-data-term, measuring how well the estimated motion
model approximates the observed optical flow field using a criterion of the sum of squared normalized
difference between the original flow vector and the estimated motion vector. (2) a spatial smoothness term,
represented by the sum of potential functions (4.2.3.2) and line process potentials to allow for motion
discontinuities across the motion boundary. The potential function is constructed on those two sites cliques
C and the line process potential is a 0-1 function reflecting the cost of various line configurations for
allowing neighboring sites to have different interpretation. The line process is a compensating mechanism
to the boundary blurring side-effect introduced by the smoothness. The cost function is minimized
iteratively using simulated annealing (SA), starting with a initial labeling X. Calculate the mapping
parameters for each region using linear squares and set the initial temperature T. In the second step, scan
the pixel sites, at each site, perturb the label Xi = X(xi) randomly. Decide whether to accept or reject this
13
perturbation according to the cost function. The third step re-estimate the mapping parameters for each
region in the least squares sense after all pixel sites are visited once, based on the new segmentation label
configuration. In the last step, lower the temperature T and repeat from step 2, until a stopping criterion is
satisfied. The drawback with this algorithm is the prohibitive computation involved in the SA.
Chang et al [CTS94] proposed a Bayesian approach based on a representation of the motion field as the
sum of a parametric field and a residual field. The parameters of the eight-parameter model (4.2.1.2) are
obtained for each region in the least-square sense from the dense field. The cost function to be minimized
resulting from the MAP criterion consists of four terms, each derived from an MRF. The first term U1 is the
temporal continuity term, measuring how well the prediction is and is minimized when both the
synthesized and dense motion field minimize the displaced frame difference (DFD):
U1 = α ∑[ I
x
k
~)]2
(x) − I k +1 (x + u (4.2.5.3)
where Ik and Ik+1 are the two frames under analyzing and α is the normalization factor. The second term U2
is the close-to-data-term, which is the same with the first term of (4.2.3.1), is minimized if the parametric
representation is consistent with the dense flow field. The third term U3 is a piecewise smoothness term, a
term intending to replace the line process in [MB87] by only enforce spatial smoothness on flow vectors
generated by a single object:
U3 = β ∑ ∑ || u(x ) − u(x
x i x j ∈Nx i
i j ) ||2 δ ( X (x i ) − X (x j )) (4.2.5.4)
The fourth term U4 is a standard spatial smoothness term, which is the same with the second term of
(4.2.3.1), represented by a potential function to enforce a smooth label field. Since the number of unknowns
is three times higher when the motion field has to be estimated as well, the computational complexity is
significantly larger. Chang et al decomposed the cost function into two terms and alternate between
estimating the motion field and the segmentation labels using high confidence first (HCF) and iterated
conditional modes (ICM), respectively. Comparing with the approach of [MB87], the cost function has two
more term: the piecewise smoothness term and the temporal continuity term. The use of ICM alleviates the
prohibitive computation in the SA.
Bouthemy and Francois [BF93] also put MAP into ICM. The algorithm is a top down approach, starting
by a initial guess to obtain a initial segmentation. Two initialization are tested, one is to start by region
growing based on the likelihood ratio test on each motion block. The other is to start by treating the entire
image as a single region. In the second step, estimate motion parameters represented by the affine model of
(4.2.1.1) making use of the initial segmentation. The next step labels each pixel into the estimated affine
model according to the MAP cost function similar to that in [MB87]. In the fourth step, motion parameters
are re-estimated based on the new segmentation field. The fifth step performs motion prediction between
the two frames under analyzing to check consistency with the observed frames. If the motions between the
two frames are not compensated well, repeat from step three. The block diagram of the algorithm is
illustrated in Figure 8(b). Both [BF93] and [CTS94] take into account of temporal continuity, but
Bouthemy and Francois adopts a different approach. Instead of putting the temporal continuity criterion
into the cost function, it is put into a separate step of the algorithm.
Although deterministic relaxation algorithm such as ICM is less expensive in computation than the SA,
it does not guarantee a global minimum. The minimization is likely trapped into local minimum near the
initial stage. A well initialization is essential to the overall performance of the segmentation.
Bayesian motion estimation and segmentation algorithms are criticized for its computational
expensiveness. To solve this problem, Bayesian approach is often conducted in multi-resolution way.
Stiller [Stiller97] employs a deterministic relaxation technique over a multi-scale pyramid. The
relaxation techniques is similar to the ICM. The cost function for the MAP criterion consists of all the
terms of the cost function in [CTS94] except the second term.
While Bayesian segmentation algorithms solve the noise problem and works well in the finding of
homogenous motion regions, it has no mechanism to solve the over-segmentation problem. But it can be
integrated into other segmentation techniques to overcome this limitation. In the segmentation for content-
based functionalities, Mech and Wollborn [MW98] use this method for the creation of the change detection
mask.
14
4.2.4 Integrated methods: Segmentation for object-based video coding
In the above segmentation techniques, scenes are segmented into planar patches representing different
motions, more precisely, disconnected motions. These disconnected motions are not necessarily different in
motion. Instead, regions with same motion may be scattered in the resulted segmentation. In the
segmentation-based coding, however, it is preferred to group these scattered regions with same motion into
a single motion, called “layer”. More gain can be obtained by representing a motion layer with a single set
of motion parameters than by representing each region with a individual set of parameters. By grouping the
motions in the scene into different motion layers, more efficient coding can be achieved through a
hierarchical manner. Besides, segmentation for coding purpose needs a more accurate motion boundary to
avoid the prediction artifact. To achieve this, the algorithms for this purpose focus on finding global motion
homogeneity instead of local motion homogeneity, and integrated methods combining 2D and 3D methods
and layered segmentation methods are proposed.
It I t+1
Intra-frame
Change detection segmentation
Object definition
&
Parameter estimation
Separation segmentation
into
changed & unchanged
Prediction
Yes
output
Hötter and Thoma [HT88] proposed a hierarchically structured segmentation algorithm aiming at
object-based coding, where each segmented motion region is described by one set of motion parameters.
The algorithm is a three step approach, see Figure 9. (1) In the first step, change detection is applied on two
input frames It and It+1 to segment the two fields into changed and unchanged regions, resulting in the
change detection mask. The change detection mechanism is an iterative three-step process which has been
discussed in Section 3.2 (Figure 4). (2) In the second step, each connected changed region is for the present
treated as one moving object. Then an eight parameters motion model (4.2.1.2) is estimated for each region
using the direct method proposed by Tsai and Huang [TH81]. Using the displacement field derived from
the estimated parameters, the change detection mask is verified resulting in a segmentation mask, where the
moving objects are separated from uncovered background or background to be covered. The separation
mechanism is discussed in Section 3.2 (Figure 5). (3) In the next step of the hierarchically structured
segmentation, the mapping parameters are used to perform a motion compensating prediction in the
changed regions using a displaced frame difference (DFD) criterion . The image regions not correctly
described by the mapping parameters are detected from the change detector, evaluating the prediction
result. These detected regions are treated according to the changed regions of the initializing step. This
15
procedure is hierarchically repeated until all separately moving objects are described by their mapping
parameters.
Diehl [Diehl91] extends Hötter and Thoma’s method into a spatio-temporal algorithm, in which, both
contour and texture information from the single images and information from successive images is used to
split up a scene into various objects. The overall procedure of the algorithm is similar to that of Hötter and
Thoma’s. But in every stage, the segmentation result is refined by an intra-frame counterpart based on
single images. The inter-frame segmentation controls this combination as motion is used as the main
segmentation criterion. Using the inter-frame segmentation, each of the contour regions is assigned to a
moving region. Then adjacent regions of the same type are merged to obtain the final objects.
The differences between Hötter and Thoma’s method and Diehl’s method are in three aspects. Firstly,
Diehl’s method is a spatial-temporal one while Hötter and Thoma’s method is a motion-based
segmentation. Secondly, the two methods are also different in the parmetric models employed and in the
estimation of the model parameters. Hötter and Thoma employ a parametric model of (4.2.1.2) while Diehl
employ a parametric of model of (4.2.1.3) which is a more accurate approximation to object surface. For
the parameter estimation, Hötter and Thoma use the direct method while Diehl optimizing the parameter
estimation using a more complex modified Newton algorithm. Thirdly, Hötter and Thoma use segmentation
history to support object tracking. This is useful in recovering meaningful moving objects from the scene,
as can be seen later in the segmentation for content-based functionalities. The block diagram of these two
algorithms is plotted in Figure 9. In the diagram, the central part which is marked by bold signs is shared by
both methods. Diehl’s intra-frame segmentation is plotted in the right by dot signs, and the two more blocks
in Hötter and Thoma’s algorithm are plotted in the left.
Similar approach to the above two methods has also been taken by Musmann et al [MHO89]. The
segmentation techniques described in these algorithms can be naturally exploited for background-
foreground separation where all the moving parts in the foreground is regarded as a single moving object.
Because of its feasibility in obtaining an integrated moving foreground object, the main idea in these
algorithms is adopted by Mech and Wollborn in their proposal to MPEG-4 [m1949]. The idea of
representing motions in the scene into different levels is later adopted by Wang and Adelson [WA94],
Borshukov et al [BBAT97] to represent scene into layers.
It is worthwhile to indicate out here that the concept of object used in the above segmentation
algorithms is different from what it means in the segmentation for content-based functionalities where
objects represent meaningful, or real world objects. For segmentation for object-based coding, or more
properly, segmentation-based coding [DM95], the final segmented objects are regions of homogenous
motion which can be described by a single set of motion parameters. These regions of uniform motion are
often called objects in these algorithms. It is clear that the final results of the segmentation for object-based
coding are not meaningful objects as used in the content-based functionality methods, because the motion
of real objects is rarely uniform.
There are cases where the scene can be separated into layers. Wang and Adelson [WA94] proposed a
segmentation scheme to separate panorama scenes into layers. The idea underlying this algorithm is to
align the scene by compensating out the global motion, then accumulate the aligned frames to recover the
interested layer(s). They assume that regions undergoing a common affine motion are part of the same
physical object in the scene. The objective is to derive a single representative image for each layer. The
algorithm starts by estimating an optical flow field, and then subdivides the frame into square blocks. The
affine motion parameters are computed for each block by linear regression to get an initial set of motion
models or hypotheses. The pixels are then grouped using an iterative adaptive K-means clustering
algorithm. Pixel x is assigned to hypothesis or layer i if the difference between the optical flow at x and the
flow vector synthesized from the affine parameters of the layer is smaller than for any other hypothesis.
Obviously, this does not enforce spatial continuity of the label i field. To construct the layers, information
of a longer sequence is necessary due to accumulation process. The frames are warped according to the
affine motion of the layers such that coherently moving objects are aligned. A temporal median filter is
then applied to the aligned sequence to enhance the image. By accumulating all the aligned frames, the
layer can be recovered. A similar approach has been proposed by Torres et al [TGM97] and has been later
improved by Borshukov et al into a multi-stage affine classification algorithm [BBAT97]. Hsu et al
16
[HAP94] also adopted this idea to segment scene into layers of coherent motion with the objective for
coding.
Some researchers take these algorithms as a VOP segmentation approaches, but the situations they can
apply are rare. In essence, these proposed algorithms of representing scene into different layers are methods
of representing scene into different levels of motions, with similar idea to that of Hötter and Thoma’s
[HT88]. In terms of separating scenes into different layers of meaningful objects, these algorithms can be
applied to very limited situations such as in scenes of panorama view due to the strong conditions needed to
recover layers. Such conditions include rigid, monotonously incremental motion, significant depth
variations and long sequence of frames etc. Therefore, these approaches are more suitable for coding than
for content-based functionalities as is demonstrated in the results by [BBAT97, HAP94].
5. Spatial-temporal segmentation
Recently, due to the emerging multimedia applications such as MPEG-4 and MPEG-7, there is a need
of segmentation of scenes into meaningful objects, or meaningful moving objects, to facilitate the so called
content-based functionalities. For example, the new world standard MPEG-4 defines video scene consisting
of video object plane (VOP) to support content-based functionalities such as object-based spatial and
temporal scalability, user interacting with scene content etc. Although many motion segmentation
techniques are available, techniques for segmentation of meaningful objects from generic scenes does not
exist. From the preceding discussion, we note that the integrated methods employed in the segmentation
for object-based coding is specially suitable for this purpose. For example, the Diehl’s spatial-temporal
attempt plus Hötter and Thoma’s memory mechanism is a possible alternative to achieve segmentation of
meaningful objects from scene. Several schemes have been attempted to segment moving objects from
background employing spatial-temporal approach. In the following, we discuss these algorithms by
analyzing the temporal and spatial parts separately.
Mech and Wollborn [MW98, m1949] propose a segmentation scheme based on Hötter and Thoma’s
algorithm [HT88] and Diehl’s aogorithm [Diehl91]. The algorithm is implemented in four steps which is
illustrated in the left part of Figure 10.
(1) In the first step, a possibly apparent camera motion is estimated and compensated using an 8
parameter motion model of (4.2.1.2). (2) In the second step, an apparent scene cut or strong camera pan is
detected by evaluating the median-squared error (MSE) between the two successive frames, considering
only background regions of the previous frame. In case of a scene cut the algorithm is reset. (3) The third
step is a change detection mask (CDM) module, see the leftmost part of Figure 9. First, an initial CDM
called CDMi between two successive frames is generated by a relaxation technique (See discussion of
[AKM93] in Section 3.2), using local thresholds which consider the state of neighboring pixels. In order to
get temporally stable object regions, a memory for change detection masks (CDM) is then applied to make
use the previous segmentation results. The updated CDM (CDMu) is then simplified using morphological
closing operator to generate the final CDM for object detection. (4) By the fourth step, the same technique
used by Thoma and Bierling [TB89] is used to obtain an initial moving object mask (OMi) from the CDM
(See Figure 5 in Section 3.2). It is then adapted to luminance edges of the corresponding frame, resulting in
the final object region. The key idea in this algorithm is to get an initial object mask from the change
detection, which can be used to improve and track the interested object in later stage. In order to create the
initial object mask, instead of firstly performing change detection followed by a global motion estimation
as is done in [HT88], it first applies a global motion compensation in a purpose of eliminating motion
caused by camera motion, and change detection can be applied after. Similar method has also been used by
Dufaux et al [DML95]. The algorithm eliminates the assumption of a stationary background as that in
[HT88] and [Diehl91] by allowing camera panning and zooming. In the global motion estimation, pixels
which distance from the left or right border is less than 10 pixels are used as observation points, assuming
that there is no moving object near the left and right image border. This assumption can be easily violated
in most natural scenes. In order to overcome the limitations in the motion estimation, more effective motion
estimation method has to be found.
17
It I t+1
CDM generation
Initial CDMi
CDM simplification
CDM
OM
Meier [Meier98, m2238] proposes a VOP segmentation scheme with similar procedure to [MW98]. The
algorithm starts with the estimation of a dense motion field using block matching. The global motion is
then modeled by the six-parameter affine transformation of (4.2.1.1) with the parameters being obtained
using a robust least median of squares algorithm. But without a pre-motion segmentation in the dense
motion field, the result of global motion estimation can be far away from the correct global motion, because
the pixels in the moving object region are equally selected as the observation points in the affine parameter
estimation while the observation points used for the parameter estimation should be restricted to those in
the background. The initial object model or initial object mask is obtained by combining the motion
segmentation result from a complex morphological motion filtering with a spatial segmentation using
Canny operator. Instead of using memory tracking exploited in [MW98], the object tracking is done by
using Hausdorff distance and the model update process. The VOP extraction is applied by a post-
processing. Since some key parameters in the motion segmentation process need input from the user, it is
not a fully automatic VOP segmentation algorithm.
Neri et al [NCRT98, m2365] proposed a segmentation algorithm based on high order statistics (HOS).
The algorithm produces the segmentation map of each frame fk of the sequence by processing a group of
frames {fk-i, i=0,..n}. The number of frames n varies on the basis of the estimated object velocity. Any
global motion component is removed by a pre-processing stage, aligning fk-i to fk, j=1,..n. For each frame fk,
the algorithm splits in three steps, as is illustrated in Figure 11.
(1) In the first step, the frame differences { dk-j (x,y)=fk-j(x,y)– fk-n(x,y), j=0,..n-1 } of each frame of the
group with respect to the first frame fk-n are evaluated in order to detect the changed areas, due to object
motion, uncovered background and noise. In order to reject the luminance variations due to noise, an
Higher Order Statistic test is performed. Namely, for each pixel (x,y) the fourth-order central moment
18
~ ( 4) ( x, y ) of each inter-frame difference d(x,y) is estimated on a 3x3 window it is compared with a
m d k− j
threshold adaptively set on the basis of the estimated background activity, and set to zero if it is below the
threshold. On the sequence of the thresholded fourth-order moment maps, a motion detection procedure is
performed. (2) This step aims at distinguish changed areas representing uncovered background (which
stands still in the HOS maps) and moving objects (moving in the HOS maps). At the j-th iteration, the pair
of thresholded HOS maps m ~ ( 4) ( x, y), m
~ ( 4) ( x, y ) is examined. For each pixel (x,y) the displacement of it
dk − j d k − j −1
is evaluated on a 3x3 window, adopting a SAD criterion (Sum of Absolute Differences), and if the
displacement is not null the pixel is classified as moving. Then, the lag j is increased (i.e. the pair of maps
slides) and the motion analysis is repeated, until j=n-2. Pixels presenting null displacements on all the
observed pairs are classified as still. Note that, from a computational point of view, at each iteration, only
pixels that were not already classified as moving are examined. Moreover, the matching is not necessary on
the pixels which are zero and are included in a 3x3 zero window, and they are assumed to be still. Thus, at
the expense of some comparisons, the matching is performed on few pixels, and at the expense of
examining more than a pair of difference, the search window is small. (3) Still regions, internal to moving
regions, are re-assigned to foreground. The regularization algorithm refines the imposing a priori local
connectivity constraints, on both the background and the foreground. Namely, topological constraints on
the size of objects irregularities such as holes, isthmi, gulfs and isles are imposed by means of
morphological filtering. Five morphological operations, with circular structuring element, are applied.
From a topological point of view, the regularization supports multiple moving regions whose size exceeds
the size of the structuring element, corresponding to different moving objects. Finally, a post-processing
operation refines the results on the basis of spatial edges.
fk
The temporal segmentation in Choi et al’s algorithm [CLK97, m2091] is also resulted from camera
motion compensated change detection. The change detection mask, resulting from a Neyman-Pearson test
based on the statistical characteristics of the observation window, is overlaid over the current segmented
regions resulted from the morphological segmentation based on spatial information. When a majority part
of a segmented region belongs to changed region in the change detection mask, the whole area of the
segmented region is declared as a foreground, otherwise a background.
Up to now, the algorithms discussed in this section all ignore depth information in the scene. This is due
to that, in those applications such as coding, content-based functionalities, the scenes under analysis are
usually assumed to be 2D scenes. In 2D scene, the scene viewed from a camera is at such a distance that it
can be approximated by a flat 2D surface and the camera is undergoing rotations and zooms. In these
situations, camera translation or ego-motion is not significant in 2D scenes, therefore, the depth variations
are not significant, thus often ignored. But for generic scene, camera can be close to the moving objects or
can be under translational move, this will induce significant depth variations in the images. Depth
information can be employed to help analysis of motion event, such as occlusion.
Pardás [Pardás97] proposed a segmentation scheme to use motion coherence information together with
the depth level information. The scheme is a two-level bottom-up approach. The bottom level is based on
the grey level information, while the top level uses the relative depth of the regions in order to merge
regions from the previous level. Temporal continuity is achieved by means of a region tracking procedure
implicit in the grey level segmentation and by filtering the depth estimation. The bottom level segmentation
uses a time-recursive segmentation scheme relying on the grey level information. In the top level, a
classification of the regions obtained in the previous level is done relying on an estimation of the relative
depth between these regions. Those neighbouring regions which are found to be in the same depth level are
considered as a unique region in this segmentation level. The relative depth of the regions of the grey level
segmentation is estimated by considering the occlusions between regions and the motion coherence
between neighbouring regions. This estimation procedure is performed in four steps: Motion estimation,
19
motion parameters comparison, overlapping computation and depth level assignation. Only image sequence
with simple scene is tested in the work.
The segmentation algorithms described in this section are all automatic approaches trying to segment
scene into meaningful objects. Precisely, these algorithms separate scene into background and moving
objects in foreground. The segmented foreground is not processed further, although it may still be
separable. These algorithms directly target the content-based functionalities addressed by MPEG-4. Most
of these algorithms base the segmentation of moving objects on change detection instead of on motion
field. This is due to that frame difference field is more reliable than motion field, this is explained in the
beginning of Section 3.2.
Although these approaches are far away from the ultimate goal of segmentation: separate scene into
semantically meaningful objects, they are promising.
Segmentation based on motion information is unlikely to achieve an accurate result without the help of
spatial information. Therefore, many effective segmentation schemes proposed are spatio-temporal
schemes. While in these approaches, motion information is usually used as the main criterion to guide the
segmentation process, spatial segmentation also plays an important role in the algorithms. Two types of
spatial segmentation methods are usually exploited: contour based and region based.
In [MW98, m1949], spatial segmentation is applied twice in the process. Morphological closing-
operator is used to eliminate potential small regions in the change detection mask (CDM) resulted from the
change detection process in the previous stage. In order to avoid mistakenly merging of small regions into
the main body, a special ternary mask is created to record false change points. Within this ternary region,
small regions with size below a certain threshold are eliminated. After this simplification process, an object
mask (OM) is created by eliminating out the uncovered background from the segmented CDM. This CDM
may include undesired background points or pieces around its boundaries due to noise. For this reason,
edge information extracted from the current frame using Sobel operator is exploited to improve the
boundaries of the object mask. Motion boundaries within certain radius of the local edges are adapted to
edges. This results in the final moving object.
While edge plays a post-processing role in [MW98], edge information plays a key role in the
segmentation process proposed by Meier [Meier98, m2238]. Instead of using an intensity map for the
object mask, Meier uses an edge map to represent object mask for the reason that grey level representations
are not reliable due to their sensitivity to changes in illumination. This representation leads to expensive
distance comparison in the tracking stage and close-fill in the final object extraction stage. For this purpose,
the change detection mask (CDM) resulted from the change detection process in the previous stage is
adapted to the edge map extracted from the current frame using Canny operator. The binary object mask
(OM) of a moving object is then created by selecting all edge points that belong to the CDM. The next
operations of tracking, updating and video object plane (VOP) extraction are all based on this binary
model. The binary OM is tracked in the sequence by using Hausdorff Distance. The model update that
accommodates both rigid and non-rigid moving parts of an object (referred to as slowly changing and
rapidly changing components, respectively) is followed. The VOP extraction is a close-fill process, where
the close process is an erosion of the OM boundaries. After this, VOP can be created by simply filling-in
the closed OM. The close process is realized by examine a 3×3 neighborhood using an adjacency code
(AC) combined with a look-up table. The boundary gap is dealt with Dijikstra’s shortest path. However,
there are still some gaps in the final model that can not be connected, as can be seen from the results
presented on the work This can cause problem in the filling-in process. The author fails to point out how it
is dealt with. Besides, the empirical set of the distance value for different type of points in the gap filling
stage is ad hoc.
In Choi et al’s segmentation scheme [CLK97, m2091], spatial segmentation consists a core part of the
algorithm. The spatial segmentation is based on a morphological segmentation which utilizes
morphological filters and watershed algorithm as basic tools. In the first step, images (or motion
compensated image if there are global motion) are simplified by morphological open-close by
reconstruction filters. These filters remove regions that are smaller than a given size but preserve the
contours of the remaining objects in the image. By the second step, the spatial gradient of the simplified
image is approximated by the use of morphological gradient operator. In order to increase robustness,
color information is also incorporated into the gradient computation and the estimated gradient is
20
thresholded. The spatial gradient is used as an input of watershed algorithm to partition an image into
homogeneous intensity regions. In the third step, the boundary decision is taken through the use of
watershed algorithm. The watershed algorithm is a region growing algorithm and it assigns pixels in the
uncertainty area to the most similar region with some segmentation criterion such as difference of intensity
values. The final result of the watershed algorithm is usually an over-segmented tessellation of the input
image. To overcome this problem, region merging follows. In the fourth step, small regions are merged in
order to yield larger and meaningful regions which are homogenous and different from its neighbors. For
this purpose, a joint similarity measurement T is chosen to compare a region R under consideration with its
neighbors. T is the sum of the average sum of absolute difference (ASAD) of two corresponding regions
between two frames and the average intensity of the region in the current frame.
where β is the weight factor. Then the region under consideration is merged to a neighboring region where
the difference between two similarity measures is smallest. Drawback of this algorithm is the need of preset
value for the thresholds and parameters employed in the gradient approximation and region merge stages.
Morphological operations and watershed algorithm prove to be very powerful segmentation tools. In order
to make them efficient in the implementation, many research activities have been carried out in this area
[SP94, VS91].
Wardhani and Gonzalez [WG99] proposed a split and merge image sequence segmentation scheme for
content based retrieval. The algorithm first splits the frame into regions of homogenous color. It then
implements merging by applying Gestalt laws, which are proximity, similarity, good continuation, closure,
surroundedness, relative size, symmetry and common fate. In their algorithm, size, color and texture are
used as proximity and similarity criteria; lines or edges are used as a high level information for the good
continuation grouping criterion. Surroundedness grouping can eliminate regions of holes, isthmi, gulfs and
isles of small size. The symmetry grouping is effective for symmetrical objects. Finally, based on the
motion information obtained by a weighted comparing of colour, size and location of the segmented
regions between frames, an inter-frame grouping is applied as the common fate grouping criterion. This
algorithm is mainly based on image segmentation techniques, since region growing techniques have over-
segmentation problems, if part of the interested object in the scene does not move, the algorithm can still
have problem in extracting a integrated or meaningful object.
Meier et al [MNG85] proposed a Bayesian segmentation based on high confidence first (HCF), which is
an improvement to Pappas [Pappas92] method. In this work, the cost function is composed of three parts,
The first term is the close-to-data term originates from the conditional probability P(O|X), which is
modeled as the squared difference between observation points and the mean of the region. The second term
consists of two parts. The first part is the sum of two-point cliques potential functions and the second part is
the sum of three-point cliques potential functions based on the edge image. The algorithm starts from the
seed points or initial regions which are selected on a grid with spacing d. Based on these seeds, a modified
HCF technique labels pixels in order of confidence. Regions that were missed by the initialization grid will
remain uncommitted after this stage. For these regions, a new label is created so that in the second stage all
pixels can be assigned to a region by HCF, resulting in the final partition. The approach improves/differs
from Pappas’ method in two aspects. Firstly, different regions carry different labels to ensure that only
pixels belonging to the same region are included into the calculation of the regions’ mean gray-level.
Secondly, there is no need of presetting labels or class K, and no need for an initial estimate of the
segmentation X.
As mentioned in the beginning of this section, image segmentation algorithms are either based on
contour finding or on region growing. For contour based segmentation, edge detection techniques are the
most widely adopted. Canny and Sobel operator are the commonly used edge detection methods due to the
resilience performance. For region growing techniques, three typical segmentation methods have been
presented. They are the classic hybrid approach [WG99], method based on morphological filters and
watershed algorithm [m2091] and Bayesian method [MNG85]. Arguably, Bayesian segmentation, is the
most widely used segmentation method, but due to the usually need of a relaxation minimization process
such as simulated annealing, it’s too expensive in computation. In recent researches, region growing
21
methods based on morphological filter and watershed algorithm are dominantly used. For a complete
review of image segmentation, readers are referred to [PP93, HS85, RMB97, GJJ96, Dougherty92].
It can be seen from the table, all robust segmentation algorithms are achieved at the cost of high
computation complexities. It can also be drawn from the table, that the spatio-temporal approaches perform
comparatively better than the other methods. In the following we give a complete segmentation scheme
reflecting the latest spatial and temporal segmentation techniques.
22
the temporal and spatial segmentation results are combined to obtain moving object boundaries. The block
diagram is illustrated in Fig. 12. In the following, we summarize each of these steps.
Video Sequence
Camera Motion
Compensation
Temporal Spatial
Segmentation Segmentation
Combination of Temporal
and Spatial Results
Figure 12. Block diagram of combined temporal and spatial segmentation framework
Camera motion, or global motion, is modeled as an eight-parameter motion model of (4.2.1.2), the
parameters are estimated by regression considering only pixels within background regions of the previous
image. After the estimation of the motion parameters, the motion vector for every pixel is known. Then, a
post processing step accounts for model failures in background regions due to assumption of a rigid plane.
By this post processing step, the estimated motion vector of every background pixel is improved by
performing a full search within an squared area of limited size. The frame is then motion compensated
according to the estimated motion.
There are normally dramatic content changes between scene cuts. Thus temporal segmentation should
only be used between frames within the same cut. The scene cut detector evaluates whether the difference
between the current original image and the camera motion compensated previous image exceeds a given
threshold. The evaluation is only performed within the background regions of the previous frame.
The temporal segmentation algorithm, which is mainly based on a change detector in [MW98] and has
been discussed in Section 5.1, can be summarized into the following four steps, assuming that a possible
camera motion has already been compensated.
(i) An initial change detection mask (CDMi) between two successive frames is generated by
thresholding the difference image.
(ii) Boundaries of the CDMi are smoothed by a relaxation on a MAP detector [AKM93], using
local thresholds which consider the state of neighboring pels. This results in a change
detection mask (CDM). The CDM is simplified by usage of a morphological closing-operator
and elimination of small regions. In order to get temporally stable object regions, a memory
23
for the CDM is applied, denoted as MEM. The temporal depth of MEM adapts automatically
to the sequence.
(iii) An initial moving object mask (OMi) is estimated by eliminating the uncovered background
from the CDM [HT88]. Displacement information for pels within the CDM is used. The
displacement of each pel within the CDM is calculated by hierarchical block-matching. After
the estimation of the displacement vector for every pel in the CDM, motion vectors whose
‘head’ and ‘foot’ are both within the CDM are marked as motion vectors corresponding to
object pels. This results in the OMi. Those other pels are treated as covered/uncovered
background
(iv) At the last step, the CMi is adapted to the luminance edges of the corresponding frame,
resulting in the final object mask (OM).
Spatial segmentation splits the entire image into homogeneous regions in terms of intensity. The
different homogenous regions are distinguished by their encompassing boundaries. The spatial
segmentation algorithm is implemented in four steps.
(i) The input images (or motion compensated image if there are global motion) are simplified by
morphological open-close by reconstruction filters. These filters remove regions that are
smaller than a given size but preserve the contours of the remaining objects in the image.
(ii) The spatial gradient of the simplified image is approximated by the use of morphological
gradient operator. In order to increase robustness, color information is also incorporated into
the gradient computation and the estimated gradient is thresholded to remove noisy gradients.
The spatial gradient is used as an input of watershed algorithm to partition an image into
homogeneous intensity regions.
(iii) The boundary decision is taken through the use of watershed algorithm. The watershed
algorithm is a region growing algorithm and it assigns pixels in the uncertainty area to the
most similar region with some segmentation criterion such as difference of intensity values.
The watershed algorithm is highly sensitive to gradient noise, which yields many catchment
basins, the final result of the algorithm is usually an over-segmented tessellation of the input
image. To overcome this problem, region merging follows
(iv) In the region merging step, small regions are merged in order to yield larger and meaningful
regions which are homogenous and different from its neighbors. For this purpose, a similarity
measurement T is chosen to compare a region under consideration with its neighbors. T is the
sum of the average sum of absolute difference (ASAD) of two corresponding regions between
two frames and the average intensity of the region in the current frame. Then a region under
consideration is merged to a neighbor region when the difference between two similarity
measurements is the smallest
In the last step, OM s or moving object boundaries are obtained by combining the spatially segmented
regions with the object mask (OM) obtained from the temporal segmentation. First, the OM is overlaid on
top of the spatial segmentation mask, when the majority part of a spatially segmented region belongs to the
OM, the whole area of the segmented region is declared as part of the OM, otherwise as background. Next,
the spatially segmented regions are also overlaid on the previous OM in the MEM, if the majority part of a
region under consideration belongs to the previous OM, the region is also declared to be a part of OM.
7.6 Discussion
This framework represents the current state of the art of automatic segmentation for MPEG-4 video
applications. The work is a spatio-temporal approach where motion is utilized as the main segmentation
criterion in terms of final decision on the object definition. The resulted foreground (OMs) and background
are not treated further. Due to the frequent use of ad hoc thresholding, preset parameters and that some
system values need adjust to different scene, the framework can be very unstable. The final results are
promising, but are not completely semantically meaningful, especially when part of the object is not
24
moving. Since spatial segmentation, such as morphological operations and edge information, has already
been incorporated in the temporal segmentation to refine object mask (OM) and its boundaries, it is not
clear whether another process of spatial segmentation is necessary and can do better. No results
demonstrating this necessity are given. Furthermore there are several underlying assumptions in the
algorithm. Firstly, it is assumed that there is no relatively abrupt motion, or significant camera translation
between two consecutive frames (at least between the first two frames), otherwise the global motion can
not be compensated by the global motion model adopted. Secondly, it is assumed the background is
composed of rigid planar surface, if there are non-rigid moving things such as moving clouds, smokes, and
flowing water etc, they could be included into foreground objects. Thirdly, it is assumed all the parts of the
interested objects are under motion (not necessarily uniform, though). If part of the object has not moved, it
is missed out from the final segmentation. Finally, there is another assumption in the global motion
estimation that moving object is away from the frame’s three boundaries (upper, left and right), at least 10
pixels away. These assumptions can affect its performance when applied on generic scenes. In recognition
of these limitations of current automatic techniques, two semi-automatic methods that require human
intervention are also included in the annex of MPEG-4 [N2553].
8. Conclusion
In this paper we have reviewed current techniques for the segmentation of moving objects in image
sequence. Emphasis has been given to those techniques of segmentation for content-based functionalities.
The most promising approach combines temporal and spatial segmentation.
It is clear from the review that although great advances have been made in image/video segmentation
techniques, there are still challenges to achieve fully automatic segmentation/extraction of semantically
meaningful objects from generic scenes.
Reference:
1. [Adiv85] Gilad Adiv. Determining three-dimensional motion and structure from optical flow
generated by several moving objects. IEEE PAMI 7(4):384-401, July 1985.
2. [AKM93] Til Aach, Andre Kaup and Rudolf Mester. Statistical Model-based Change Detection in
Moving Video. Signal Processing 31(2):165-180.
3. [AN88] J. K. Aggargwal and N. Nandhakumar. On the Computation of Motion from Sequences of
Images-- A Review. Proc. of the IEEE, 76(8):917-935, 1988.
4. [BB95] S. S. Beauchemin and J. L. Barron. The Computation of Optical Flow. ACM Computing
Surveys, 27(3), Sept. 1995.
5. [BBAT97] Georgi D. Borshukov, Gozde Bozdagi, Yucel Altunbasak and A. Murat Tekalp,
Motion. Segmentation by Multi-stage Affine Classification. IEEE Trans. on Image Processing
6(1):1591-1594 Nov. 1997
6. [BF93] Patrick Bouthemy and Edouard Francois. Motion Segmentation and Qualitative Dynamic
Scene Analysis from an Image Sequence. International Journal of Computing Vision, 10:2, 157-
182, 1993.
7. [Bestor98] Gareth S. Bestor. Recovering Feature and Observer Position By Projected Error
Refinement. Ph.D thesis, University of Wisconsin- Madison, 1998.
8. [CLK97] Jae Gark Choi, Si-Woong Lee and Seong-Dae Kim. Spatio-Temporal Video
Segmentation Using a Joint Similarity Measure. IEEE Trans. on Circuits and Systems for Video
Technology vol.7 No.2 April 1997, pp.279-286.
9. [Clocksin80] W. F. Clocksin. Perception of Surface Slant and Edge Labels from Optical Flow: A
Computational Approach, Perception 9:253-269, 1980.
10. [CTS94] Michael M. Chang, A. Murat Tekalp and M. Ibrahim Sezan. An Algorithm for
Simultaneous Motion Estimation and Segmentation. IEEE Int. Conf. On Acoustics, Speech and
Signal Processing, ICASSP’94, Adelaide, Australia, April 1994, vol. V pp.221-224.
11. [Diehl91] Norbert Diehl. Object-Oriented Motion Estimation and Segmentation in Image.
Sequence Signal Processing: Image Communication 3: 23-56, 1991.
25
12. [DM95] F. Dufaux, F. Moscheni. Segmentation-based motion estimation for second generation
video coding techniques. In Video Coding: the Second Generation Approach, L. Torres and M.
Kunt Eds., Kluwer Academic Publishers, pp. 219-263, 1995.
13. [DML95] Frederic Dufaux, Fabrice Moscheni and Andrew Lippman. Spatio-Temporal
Segmentation Based On Motion And Static Segmentation. Proc. Int. Conf. On Image Processing
vol. I Oct. 1995 Washington D.C. pp.306-309.
14. [Dougherty92] E. R. Dougherty. An introduction to morphological image processing. SPIE,
Bellingham, Washington, 1992.
15. [Fusiello98] Andrea Fusiello. Three-Dimensional Vision For Structure and Motion Estimation.
Ph.D thesis, University of Udine, Italy, November 1998.
16. [GG84] S. Geman and D. Geman. Stochastic relaxation, Gipps distributions and the Bayesian
restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence PAMI-6(6):721-
741, 1984.
17. [GJJ96] E. Gose, R. Johnsonbaugh and Steve Jost. Pattern Recognition and Image Analysis.
Pretice Hall PTR Upper Saddle River, NJ, 1996.
18. [HAP94] S. Hsu, P. Anandan, and S. Peleg. Accurate computation of optical flow by using layered
motion motion representation. International Conference on Pattern Recognition, Jerusalem Oct.
1994.
19. [Heldreth84] E. Hildreth. The Measurement of Visual Motion. MIT press, 1984.
20. [HS81] Berthold K. P. Horn and Brian G. Schunck. Determining Optical Flow. Artificial
Intelligence 17(1-3): 185-203 1981.
21. [HS85] Robert M. Haralick and Linda G. Shapiro. Survey: Image Segmentation Techniques.
Computer Vision, Graphics and Image Processing 29:100-132, 1985.
22. [HT88] Michael Hötter and Robert Thoma. Image Segmentation Based on Object Oriented
Mapping Parameter Estimation. Signal Processing 15, 1988 315-334.
23. [IRP92] M. Irani, B. Rousso and S. Peleg. Detecting and tracking multiple moving objects using
temporal integration. In G. Sandini, editor, Proc. 2nd European Conference on Computer Vision,
LNCS 588, pp.282-287 Springer-Verlag, 1992.
24. [Jain81] Ramesh Jain. Dynamic Scene Analysis Using Pixel-Based Processes. Computer vol.14
No.8 1981 pp. 12-18
25. [Jain84a] Ramesh Jain. Difference and accumulative difference pictures in dynamic scene
analysis. Image and Vision Computing 2(2): 98-108 1984.
26. [Jain84b] Ramesh C. Jain. Segmentation of Frame Sequences Obtained by a Moving Observer.
IEEE Trans. on PAMI vol. PAMI-6 No.5 Sep.1984 pp. 624-629.
27. [JAP99] Tony Jebara, Ali Azarbayejani and Alex Pentland. 3D Structure from 2D Motion. IEEE
Signal Processing Magazine, 16(3):66-84, May 1999.
28. [JJ83] S. N. Jayaramamurthy and Ramesh Jain. An Approach to the Segmentation of Textured
Dynamic Scenes. Computer Vision, Graphics and Image Processing 21, 239-261, 1983.
29. [JKS95] R. Jain, R. Kasturi and B. G. Schunck. Machine Vision. McGraw-Hill, Inc, 1995.
30. [JMA79] Ramesh Jain, W. N. Martin and J. K. Aggarwal. Segmentation through the Detection of
Changes Due to Motion. Computer Graphics and Image Processing vol.11 No.1 1979 pp. 13-34.
31. [JN79] Ramesh Jain and H.-H. Nagel. On the Analysis of Accumulative Difference Pictures from
Image Sequences of Real World Scenes. IEEE Trans. on PAMI vol. PAMI-1 No.2 April 1979 pp.
206-214.
32. [KD92] J. Konrad and E. Dubois. Bayesian estimation of motion vector field. IEEE Trans. on
Pattern Analysis and Machine Intelligence 14:910-927, 1992.
33. [KIK85] M. Kunt, A. Ikonomopoulos and M. Kocher. Second generation image coding
techniques. Proceedings of the IEEE 73(4):549-575, 1985.
34. [m571] S. Colonnese, A. Neri, G. Russo and P. Talone (FUB). Moving objects versus still
background classification: a spatial temporal segmentation tool for MPEG-4. ISO/IEC
JTC1/SC29/WG11 MPEG96/m571 Munich, January 1996
35. [m1831] R. Mech, and P. Gerken (UH). Automatic segmentation of moving objects (Core
Experiment N2). ISO/IEC JTC1/SC29/WG11 MPEG97/m1831 Sevilla, ES February 1997.
36. [m1949] R. Mech, and P. Gerken (UH). Automatic segmentation of moving objects (Patial results
of core experiment N2). ISO/IEC JTC1/SC29/WG11 MPEG97.
26
37. [m2091] J.G. Choi, M. Kim, M. H. Lee and C. Ahn (ETRI). Automatic segmentation based on
spatio-temporal information. ISO/IEC JTC1/SC29/WG11 MPEG97/m2091, Bristol, GB April
1997.
38. [m2238] T. Meier and King N. Ngan (UWA). Automatic Segmentation Based on Hausdoff Object
Tracking ISO/IEC JTC1/SC29/WG11 MPEG97/m2238. Stockholm, July 1997.
39. [m2365] S. Colonnese, A. Neri, G. Russo and P. Talone (FUB). Core Experiment N2: Preliminary
FUB results on combination of automatic segmentation techniques. ISO/IEC JTC1/SC29/WG11
MPEG97/m2365, July 1997.
40. [m2383] J.G. Choi, M. Kim, M. H. Lee and C. Ahn (ETRI) S. Colonnese, U. Mascia, G. Russo
and P. Talone (FUB). Merging of temporal and spatial segmentation. ISO/IEC JTC1/SC29/WG11
MPEG97/m2383, July 1997.
41. [m2641] J.G. Choi, M. Kim, M. H. Lee and C. Ahn (ETRI). New ETRI results on core experiment
N2 on automatic segmentation techniques. ISO/IEC JTC1/SC29/WG11 MPEG97/m2641, October
1997.
42. [m2803] Munchurl Kim, Jae Gark Choi, Myoung Ho Lee and Chieteuk Ahn. User-assisted
segmentation for moving objects of interest. ISO/IEC JTC1/SC29/WG11 MPEG97/m2803, Oct.
1997.
43. [m3093] S. Colonnese, G. Russo (FUB). Segmentation techniques: towards a semi-automatic
approach. ISO/IEC JTC1/SC29/WG11 MPEG98/m3093, February 1998.
44. [m3320] S. Colonnese, G. Russo (FUB). User interactions modes in semi-automatic segmentation:
development of a flexible graphical user interface in Java. ISO/IEC JTC1/SC29/WG11
MPEG98/m3320, March 1998.
45. [m3349] Munchurl Kim, Jae Gark Choi, Myoung Ho Lee and Chieteuk Ahn. User-assisted Video
Object Segmentation by Multiple Object Tracking with a Graphical User Interface. ISO/IEC
JTC1/SC29/WG11 MPEG98/m3349, March 1998.
46. [m3935] Munchurl Kim, Jae Gark Choi, Myoung Ho Lee and Chieteuk Ahn. User’s guide for a
user-assisted video object segmentation tool. ISO/IEC JTC1/SC29/WG11 MPEG98/m3935, Oct.
1998.
47. [m4047] G. Russo (FUB). Results of FUB user assisted segmentation environment. ISO/IEC
JTC1/SC29/WG11 MPEG98/m4047, Oc. 1998.
48. [MB87] David Murry and Bernard Buxton. Scene segmentation from visual motion using
global optimization. IEEE PAMI 9(2) 1987.
49. [MacLean96] Wallace James MacLean. Recovery of Egomotion and Segmentation of Independent
Object Motion Using The EM-Algorithm. Ph.D thesis, University of Toronto, 1996.
50. [Meier98] Thomas Meier. Segmentation for Video Object Plane Extraction and Reduction of
Coding Artifacts. Ph.D thesis, Dept. Of Electrical and Electronic Engineering, The University of
Western Australia, 1998.
51. [MHO89] Hans Georg Musmann, Michael Hötter and Jörn Ostermann. Object-Oriented Analysis
Coding of Moving Images. Signal Processing: Image Communication 1, 1989 pp.117-138.
52. [MNG85] T. Meier, K. N. Ngan and G. Crebbin. A robust Markovian segmentation based on
highest confidence first (HCF). In IEEE Int. Conf. On Image Processing, ICIP’97, Santa Barbara,
CA, USA, Oct. 1997, vol. I pp.216-219.
53. [MW86] D. W. Murray and N. S. Williams. Detecting the image boundaries between optical flow
fields from several moving planar facets. Pattern Recognition Letters 4:87-92, 1986.
54. [MW98] Roland Mech and Michael Wollborn. A noise robust method for 2D shape estimation of
moving objects in video sequences considering a moving camera. Signal Processing 66(2):203-
217, 1998.
55. [N2172] ISO/IEC JTC1/SC29/WG11 MPEG98/N2172. MPEG-4 Video Verification Model.
Version 11.0, Tokyo, March 1998.
56. [N2553] ISO/IEC JTC1/SC29/WG11 MPEG98/N553. Version 2 Visual Working Draft Revision
6.0. Rome, 1998.
57. [N2995] ISO/IEC JTC1/SC29/WG11 MPEG99/N2995. MPEG-4 Overview. Melbourne Oct.
1999.
58. [NCRT98] A. Neri, S. Colonnese, G. Russo and P. Ralone. Automatic Moving Object and
Background Separation. Signal Processing 66:219-232, 1998.
27
59. [NSKO94] H.-H. Nagel, G. Socher, H. Kollnig and M. Otte. Motion Boundary detection in image
sequences by local stochastic test. In Proc. 3rd European Conf. On Computer Vision, LNCS
800/801,Vol. II, pp.305-314, May 1994.
60. [Overington87] I. Overington. Gradient-based flow segmentation and location of the focus of
expansion. In Proc. 3rd Alvey Vision Conference pp. 169-177, 1987.
61. [Pappas92] Thrasyvoulos N. Pappas. An Adaptive Clustering Algorithm for Image Segmentation.
IEEE Trans. on Signal Processing 40(4): 901-914, 1992.
62. [Pardás97] M. Pardás. Relative depth estimation and segmentation in monocular schemes.
Picture Coding Symposium, PCS 97, Berlin, Germany, Sept. 1997, pp.367-372.
63. [Potter75] Jerry L. Potter. Velocity as a Cue to Segmentation. IEEE Trans. on Systems, Man and
Cybernetics May 1975 pp. 390-394.
64. [Potter77] Jerry L. Potter. Scene Segmentation Using Motion Information. Computer Graphics
and Image Processing 6, 558-581 (1977).
65. [PP93] Nikhil R. Pal and Sankar K. Pal. A Review on Image Segmentation Techniques. Pattern
Recognition 26(9):1277-1294, 1993.
66. [RMB97] M. M. Reid, R. J. Millar and N. D. Black. Second Generation Image Coding: An
Overview. ACM Computing Surveys 29(1):3-29, 1997.
67. [Salembier et al 97] P. Salembier, F. Marqués, M. Pardàs, R. Morros, I. Corset, S. Jeannin, L.
Bouchard, F. Meyer, and B. Marcotegui. Segmentation-based video coding system allowing the
manipulation of objects. IEEE Trans. on Circuits and Systems for Video Technology, 7(1):60-74,
February 1997.
68. [SBCP96] P. Salembier, P. Brigger, J.R. Casas, and M. Pardàs. Morphological operators for image
and video compression. IEEE Transactions on Image Processing, 5(6):881-898, June 1996.
69. [Scharstein97] Daniel Scharstein. View Synthesis Using Stereo Vision. Ph.D thesis, Cornell
University, 1997.
70. [Schunck89] Brian G. Schunck. Image Flow Segmentation and Estimation by Constraint Line
Clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(10):1010-1027, 1989.
71. [SK99] C. Stiller and J. Konrad. Estimating motion in image sequences: A tutorial on modeling
and computation of 2D motion. IEEE Signal Process. Magazine 16:70-91, July 1999.
72. [SP94] Philippe Salembier and Montse Pardas. Hierarchical Morphological Segmentation for
Image Sequence Coding. IEEE Transactions on Image Processing 3(5): 639-651, 1994.
73. [Stiller97] Christoph Stiller. Object-Based Estimation of Dense Motion Fields. IEEE Transactions
on Image Processing vol.6 No.2 Feb. 1997 pp.234-250.
74. [SU87] A. Spoerri and S. Ullman. The early detection of motion boundaries. Proc. 1st
International Conference on Computer Vision pp. 209-218, 1987.
75. [TB89] Robert Thoma and Matthias Bierling. Motion Compensating Interpolation Considering
Covered and Uncovered Background. Signal Processing: Image Communication 1 (1989) 191-
212.
76. [Tekalp95] A. Murat Tekalp. Digital Video Processing. Prentice Hall PTR.
77. [TGM97] L. Torres, D. García and A. Mates. On the Use of Layers for Video Coding and Object
Manipulation. 2nd Erlangen Symposium, Advances in Digital Image Communication, Erlangen,
Germany, pages 65 -73, April 25, 1997.
78. [TH81] Roger Y. Rsai and Thomas S. Huang. Estimating Three-Dimensional Motion Parameters
of a Rigid Planar Patch. IEEE Trans. on Acoustics, Speech and Signal Processing ASSP-
29(6):1147-1152, 1981.
79. [TKP96] L.Torres, M. Kunt and F. Pereira. Second Generation Video Coding Schemes and their
Role in MPEG-4. European Conference on Multimedia Applications, Services and Techniques,
pages 799 - 824, Louvain-la-Neuve, Belgium, May 28-30, 1996.
80. [TM93] P. H. S. Torr and D. W. Murray. Statistical detection of independent movement from a
moving camera Image and Vision Computing 1(4):180-187, May 1993.
81. [TMB85] William B. Thompson, Kathleen M. Mutch and Valdis A. Berzins. Dynamic Occlusion
Analysis in Optical Flow Fields. IEEE Trans. on Pattern Analysis and Machine Intelligence
PAMI-7(4):374-383, 1985.
82. [Torr95] P. H. S. Torr. Motion Segmentation and Outlier Detection. Ph.D thesis, University of
Oxford, 1995.
28
83. [VS91] Luc Vincent and Pierre Soille. Watersheds in Digital Spaces: An Efficient Algorithm Based
on Immersion Simulations. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(6):583-
598, 1991.
84. [WA94] John Y. A. Wang and Edward H. Adelson. Representing Moving Images with Layers.
IEEE Trans. on Image Processing vol.3, no. 5, pp.625-638, Sept. 1994.
85. [WG99] Aster Wardhani and Ruben Gonzalez. Image Structure Analysis for CBIR. Proc. Digital
Image Computing: Techniques and Applications, DICTA'99, Dec. 1999, Perth, Australia, pp.166-
168.
29