Object Tracking
Object Tracking
Alper Yilmaz
Ohio State University
Omar Javed
ObjectVideo, Inc.
and
Mubarak Shah
University of Central Florida
The goal of this article is to review the state-of-the-art tracking methods, classify them into different cate-
gories, and identify new trends. Object tracking, in general, is a challenging problem. Difculties in tracking
objects can arise due to abrupt object motion, changing appearance patterns of both the object and the scene,
nonrigid object structures, object-to-object and object-to-scene occlusions, and camera motion. Tracking is
usually performed in the context of higher-level applications that require the location and/or shape of the
object in every frame. Typically, assumptions are made to constrain the tracking problem in the context of
a particular application. In this survey, we categorize the tracking methods on the basis of the object and
motion representations used, provide detailed descriptions of representative methods in each category, and
examine their pros and cons. Moreover, we discuss the important issues related to tracking including the use
of appropriate image features, selection of motion models, and detection of objects.
Categories and Subject Descriptors: I.4.8 [Image Processing and Computer Vision]: Scene Analysis
Tracking
General Terms: Algorithms
Additional Key Words and Phrases: Appearance models, contour evolution, feature selection, object detection,
object representation, point tracking, shape tracking
ACM Reference Format:
Yilmaz, A., Javed, O., and Shah, M. 2006. Object tracking: A survey. ACM Comput. Surv. 38, 4, Article 13
(Dec. 2006), 45 pages. DOI = 10.1145/1177352.1177355 http://doi.acm.org/10.1145/1177352.1177355
This material is based on work funded in part by the US Government. Any opinions, ndings, and conclusions
or recommendations expressed in this material are those of the authors and do not necessarily reect the
views of the US Government.
Authors address: A. Yilmaz, Department of CEEGS, Ohio State University; email: [email protected];
O. Javed, ObjectVideo, Inc., Reston, VA 20191; email: [email protected]; M. Shah, School of EECS,
University of Central Florida; email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for prot or direct commercial advantage and that
copies show this notice on the rst page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specic permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax 1 (212)
869-0481, or [email protected].
c _2006 ACM 0360-0300/2006/12-ART13 $5.00 DOI: 10.1145/1177352.1177355 http://doi.acm.org/10.1145/
1177352.1177355.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
2 A. Yilmaz et al.
1. INTRODUCTION
Object tracking is an important task within the eld of computer vision. The prolif-
eration of high-powered computers, the availability of high quality and inexpensive
video cameras, and the increasing need for automated video analysis has generated a
great deal of interest in object tracking algorithms. There are three key steps in video
analysis: detection of interesting moving objects, tracking of such objects from frame
to frame, and analysis of object tracks to recognize their behavior. Therefore, the use of
object tracking is pertinent in the tasks of:
motion-based recognition, that is, human identication based on gait, automatic ob-
ject detection, etc;
automated surveillance, that is, monitoring a scene to detect suspicious activities or
unlikely events;
video indexing, that is, automatic annotationand retrieval of the videos inmultimedia
databases;
human-computer interaction, that is, gesture recognition, eye gaze tracking for data
input to computers, etc.;
trafc monitoring, that is, real-time gathering of trafc statistics to direct trafc ow.
vehicle navigation, that is, video-based path planning and obstacle avoidance
capabilities.
In its simplest form, tracking can be dened as the problem of estimating the tra-
jectory of an object in the image plane as it moves around a scene. In other words, a
tracker assigns consistent labels to the tracked objects in different frames of a video. Ad-
ditionally, depending on the tracking domain, a tracker can also provide object-centric
information, such as orientation, area, or shape of an object. Tracking objects can be
complex due to:
loss of information caused by projection of the 3D world on a 2D image,
noise in images,
complex object motion,
nonrigid or articulated nature of objects,
partial and full object occlusions,
complex object shapes,
scene illumination changes, and
real-time processing requirements.
One can simplify tracking by imposing constraints on the motion and/or appearance
of objects. For example, almost all tracking algorithms assume that the object motion
is smooth with no abrupt changes. One can further constrain the object motion to
be of constant velocity or constant acceleration based on a priori information. Prior
knowledge about the number and the size of objects, or the object appearance and
shape, can also be used to simplify the problem.
Numerous approaches for object tracking have been proposed. These primarily differ
from each other based on the way they approach the following questions: Which object
representation is suitable for tracking? Which image features should be used? How
should the motion, appearance, and shape of the object be modeled? The answers to
these questions depend on the context/environment in which the tracking is performed
and the end use for which the tracking information is being sought. A large number
of tracking methods have been proposed which attempt to answer these questions for
a variety of scenarios. The goal of this survey is to group tracking methods into broad
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 3
categories and provide comprehensive descriptions of representative methods in each
category. We aspire to give readers, who require a tracker for a certain application,
the ability to select the most suitable tracking algorithm for their particular needs.
Moreover, we aim to identify new trends and ideas in the tracking community and hope
to provide insight for the development of new tracking methods.
Our survey is focused on methodologies for tracking objects in general and not on
trackers tailored for specic objects, for example, person trackers that use human kine-
matics as the basis of their implementation. There has been substantial work on track-
ing humans using articulated object models that has been discussed and categorized in
the surveys by Aggarwal and Cai [1999], Gavrilla [1999], and Moeslund and Granum
[2001]. We will, however, include some works on the articulated object trackers that
are also applicable to domains other than articulated objects.
We follow a bottom-up approach in describing the issues that need to be addressed
when one sets out to build an object tracker. The rst issue is dening a suitable rep-
resentation of the object. In Section 2, we will describe some common object shape
representations, for example, points, primitive geometric shapes and object contours,
and appearance representations. The next issue is the selection of image features used
as an input for the tracker. In Section 3, we discuss various image features, such as color,
motion, edges, etc., which are commonly used in object tracking. Almost all tracking
algorithms require detection of the objects either in the rst frame or in every frame.
Section 4 summarizes the general strategies for detecting the objects in a scene. The
suitability of a particular tracking algorithm depends on object appearances, object
shapes, number of objects, object and camera motions, and illumination conditions. In
Section 5, we categorize and describe the existing tracking methods and explain their
strengths and weaknesses in a summary section at the end of each category. In Section
6, important issues relevant to object tracking are discussed. Section 7 presents future
directions in tracking research. Finally, concluding remarks are sketched in Section 8.
2. OBJECT REPRESENTATION
In a tracking scenario, an object can be dened as anything that is of interest for further
analysis. For instance, boats on the sea, sh inside an aquarium, vehicles on a road,
planes in the air, people walking on a road, or bubbles in the water are a set of objects
that may be important to track in a specic domain. Objects can be represented by
their shapes and appearances. In this section, we will rst describe the object shape
representations commonly employed for tracking and then address the joint shape and
appearance representations.
Points. The object is represented by a point, that is, the centroid (Figure 1(a))
[Veenman et al. 2001] or by a set of points (Figure 1(b)) [Serby et al. 2004]. In general,
the point representation is suitable for tracking objects that occupy small regions in
an image.(see Section 5.1).
Primitive geometric shapes. Object shape is represented by a rectangle, ellipse
(Figure 1(c), (d) [Comaniciu et al. 2003], etc. Object motion for such representations
is usually modeled by translation, afne, or projective (homography) transformation
(see Section 5.2 for details). Though primitive geometric shapes are more suitable for
representing simple rigid objects, they are also used for tracking nonrigid objects.
Object silhouette and contour. Contour representation denes the boundary of an
object (Figure 1(g), (h). The region inside the contour is called the silhouette of the
object (see Figure 1(i) ). Silhouette and contour representations are suitable for track-
ing complex nonrigid shapes [Yilmaz et al. 2004].
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
4 A. Yilmaz et al.
Fig. 1. Object representations. (a) Centroid, (b) multiple points, (c) rectangular
patch, (d) elliptical patch, (e) part-based multiple patches, (f) object skeleton, (g)
complete object contour, (h) control points on object contour, (i) object silhouette.
Articulated shape models. Articulated objects are composed of body parts that are
held together with joints. For example, the human body is an articulated object with
torso, legs, hands, head, and feet connected by joints. The relationship between the
parts are governed by kinematic motion models, for example, joint angle, etc. In order
to represent an articulated object, one can model the constituent parts using cylinders
or ellipses as shown in Figure 1(e).
Skeletal models. Object skeleton can be extracted by applying medial axis trans-
form to the object silhouette [Ballard and Brown 1982, Chap. 8]. This model is com-
monly used as a shape representationfor recognizing objects [Ali and Aggarwal 2001].
Skeleton representation can be used to model both articulated and rigid objects (see
Figure 1(f).
There are a number of ways to represent the appearance features of objects. Note
that shape representations can also be combined with the appearance representations
[Cootes et al. 2001] for tracking. Some common appearance representations in the
context of object tracking are:
Probability densities of object appearance. The probability density estimates of the
object appearance can either be parametric, such as Gaussian [Zhu and Yuille 1996]
and a mixture of Gaussians [Paragios and Deriche 2002], or nonparametric, such as
Parzen windows [Elgammal et al. 2002] and histograms [Comaniciu et al. 2003]. The
probability densities of object appearance features (color, texture) can be computed
from the image regions specied by the shape models (interior region of an ellipse or
a contour).
Templates. Templates are formed using simple geometric shapes or silhouettes
[Fieguth and Terzopoulos 1997]. An advantage of a template is that it carries both
spatial and appearance information. Templates, however, only encode the object ap-
pearance generated from a single view. Thus, they are only suitable for tracking
objects whose poses do not vary considerably during the course of tracking.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 5
Active appearance models. Active appearance models are generated by simultane-
ously modeling the object shape and appearance [Edwards et al. 1998]. In general,
the object shape is dened by a set of landmarks. Similar to the contour-based repre-
sentation, the landmarks can reside on the object boundary or, alternatively, they can
reside inside the object region. For each landmark, an appearance vector is stored
which is in the form of color, texture, or gradient magnitude. Active appearance
models require a training phase where both the shape and its associated appear-
ance is learned from a set of samples using, for instance, the principal component
analysis.
Multiview appearance models. These models encode different views of an object. One
approach to represent the different object views is to generate a subspace from the
given views. Subspace approaches, for example, Principal Component Analysis (PCA)
and Independent Component Analysis (ICA), have been used for both shape and
appearance representation [Mughadam and Pentland 1997; Black and Jepson 1998].
Another approach to learn the different views of an object is by training a set of clas-
siers, for example, the support vector machines [Avidan 2001] or Bayesian networks
[Park and Aggarwal 2004]. One limitation of multiview appearance models is that the
appearances in all views are required ahead of time.
In general, there is a strong relationship between the object representations and the
tracking algorithms. Object representations are usually chosen according to the ap-
plication domain. For tracking objects, which appear very small in an image, point
representation is usually appropriate. For instance, Veenman et al. [2001] use the
point representation to track the seeds in a moving dish sequence. Similarly, Shaque
and Shah [2003] use the point representation to track distant birds. For the objects
whose shapes can be approximated by rectangles or ellipses, primitive geometric shape
representations are more appropriate. Comaniciu et al. [2003] use an elliptical shape
representation and employ a color histogram computed from the elliptical region for
modeling the appearance. In 1998, Black and Jepson used eigenvectors to represent the
appearance. The eigenvectors were generated from rectangular object templates. For
tracking objects with complex shapes, for example, humans, a contour or a silhouette-
based representation is appropriate. Haritaoglu et al. [2000] use silhouettes for object
tracking in a surveillance application.
3. FEATURE SELECTION FOR TRACKING
Selecting the right features plays a critical role in tracking. In general, the most de-
sirable property of a visual feature is its uniqueness so that the objects can be easily
distinguished in the feature space. Feature selection is closely related to the object rep-
resentation. For example, color is used as a feature for histogram-based appearance
representations, while for contour-based representation, object edges are usually used
as features. In general, many tracking algorithms use a combination of these features.
The details of common visual features are as follows.
Color. The apparent color of an object is inuenced primarily by two physical factors,
1) the spectral power distribution of the illuminant and 2) the surface reectance
properties of the object. In image processing, the RGB (red, green, blue) color space
is usually used to represent color. However, the RGB space is not a perceptually uni-
form color space, that is, the differences between the colors in the RGB space do not
correspond to the color differences perceived by humans [Paschos 2001]. Addition-
ally, the RGB dimensions are highly correlated. In contrast, L
and L
are
perceptually uniform color spaces, while HSV (Hue, Saturation, Value) is an approx-
imately uniform color space. However, these color spaces are sensitive to noise [Song
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
6 A. Yilmaz et al.
et al. 1996]. In summary, there is no last word on which color space is more efcient,
therefore a variety of color spaces have been used in tracking.
Edges. Object boundaries usually generate strong changes in image intensities. Edge
detection is used to identify these changes. An important property of edges is that
they are less sensitive to illumination changes compared to color features. Algorithms
that track the boundary of the objects usually use edges as the representative feature.
Because of its simplicity andaccuracy, the most popular edge detectionapproachis the
Canny Edge detector [Canny 1986]. An evaluation of the edge detection algorithms
is provided by Bowyer et al. [2001].
Optical Flow. Optical ow is a dense eld of displacement vectors which denes the
translation of each pixel in a region. It is computed using the brightness constraint,
which assumes brightness constancy of corresponding pixels in consecutive frames
[Horn and Schunk 1981]. Optical ow is commonly used as a feature in motion-based
segmentation and tracking applications. Popular techniques for computing dense
optical ow include methods by Horn and Schunck [1981], Lucas and Kanade [1981],
Black and Anandan [1996], and Szeliski and Couglan [1997]. For the performance
evaluation of the optical ow methods, we refer the interested reader to the survey
by Barron et al. [1994].
Texture. Texture is a measure of the intensity variation of a surface which quanties
properties such as smoothness and regularity. Compared to color, texture requires
a processing step to generate the descriptors. There are various texture descriptors:
Gray-Level Cooccurrence Matrices (GLCMs) [Haralick et al. 1973] (a 2D histogram
which shows the cooccurrences of intensities in a specied direction and distance),
Laws texture measures [Laws 1980] (twenty-ve 2D lters generated from ve 1D
lters corresponding to level, edge, spot, wave, and ripple), wavelets [Mallat 1989]
(orthogonal bank of lters), and steerable pyramids [Greenspan et al. 1994]. Simi-
lar to edge features, the texture features are less sensitive to illumination changes
compared to color.
Mostly features are chosen manually by the user depending on the application
domain. However, the problem of automatic feature selection has received signif-
icant attention in the pattern recognition community. Automatic feature selection
methods can be divided into lter methods and wrapper methods [Blum and Langley
1997]. The lter methods try to select the features based on a general criteria, for ex-
ample, the features should be uncorrelated. The wrapper methods select the features
based on the usefulness of the features in a specic problem domain, for example,
the classication performance using a subset of features. Principal Component Analy-
sis (PCA) is an example of the lter methods for the feature reduction. PCA involves
transformation of a number of (possibly) correlated variables into a (smaller) number of
uncorrelated variables called the principal components. The rst principal component
accounts for as much of the variability in the data as possible, and each succeeding
component accounts for as much of the remaining variability as possible. A wrapper
method of selecting the discriminatory features for tracking a particular class of ob-
jects is the Adaboost [Tieu and Viola 2004] algorithm. Adaboost is a method for nding
a strong classier based on a combination of moderately inaccurate weak classiers.
Given a large set of features, one classier can be trained for each feature. Adaboost,
as discussed in Sections 4.4, will discover a weighted combination of classiers (repre-
senting features) that maximize the classication performance of the algorithm. The
higher the weight of the feature, the more discriminatory it is. One can use the rst n
highest-weighted features for tracking.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 7
Table I. Object Detection Categories
Categories Representative Work
Point detectors Moravecs detector [Moravec 1979],
Harris detector [Harris and Stephens 1988],
Scale Invariant Feature Transform [Lowe 2004].
Afne Invariant Point Detector [Mikolajczyk and Schmid 2002].
Segmentation Mean-shift [Comaniciu and Meer 1999],
Graph-cut [Shi and Malik 2000],
Active contours [Caselles et al. 1995].
Background Modeling Mixture of Gaussians[Stauffer and Grimson 2000],
Eigenbackground[Oliver et al. 2000],
Wall ower [Toyama et al. 1999],
Dynamic texture background [Monnet et al. 2003].
Supervised Classiers Support Vector Machines [Papageorgiou et al. 1998],
Neural Networks [Rowley et al. 1998],
Adaptive Boosting [Viola et al. 2003].
Among all features, color is one of the most widely used feature for tracking.
Comaniciu et al. [2003] use a color histogram to represent the object appearance. De-
spite its popularity, most color bands are sensitive to illumination variation. Hence
in scenarios where this effect is inevitable, other features are incorporated to model
object appearance. Cremers et al. [2003] use optical ow as a feature for contour track-
ing. Jepson et al. [2003] use steerable lter responses for tracking. Alternatively, a
combination of these features is also utilized to improve the tracking performance.
4. OBJECT DETECTION
Every tracking method requires an object detection mechanismeither in every frame or
when the object rst appears in the video. A common approach for object detection is to
use information in a single frame. However, some object detection methods make use of
the temporal information computed from a sequence of frames to reduce the number of
false detections. This temporal information is usually in the form of frame differencing,
which highlights changing regions in consecutive frames. Given the object regions in
the image, it is then the trackers task to performobject correspondence fromone frame
to the next to generate the tracks.
We tabulate several common object detection methods in Table I. Although the object
detection itself requires a survey of its own, here we outline the popular methods in
the context of object tracking for the sake of completeness.
4.1. Point Detectors
Point detectors are used to nd interest points in images which have an expressive
texture in their respective localities. Interest points have been long used in the con-
text of motion, stereo, and tracking problems. A desirable quality of an interest point
is its invariance to changes in illumination and camera viewpoint. In the literature,
commonly used interest point detectors include Moravecs interest operator [Moravec
1979], Harris interest point detector [Harris and Stephens 1988], KLT detector [Shi
and Tomasi 1994], and SIFT detector [Lowe 2004]. For a comparative evaluation of
interest point detectors, we refer the reader to the survey by Mikolajczyk and Schmid
[2003].
To nd interest points, Moravecs operator computes the variation of the image inten-
sities in a 4 4 patch in the horizontal, vertical, diagonal, and antidiagonal directions
and selects the minimumof the four variations as representative values for the window.
A point is declared interesting if the intensity variation is a local maximum in a 12
12 patch.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
8 A. Yilmaz et al.
Fig. 2. Interest points detected by applying (a) the Harris, (b) the KLT, and (c) SIFT operators.
The Harris detector computes the rst order image derivatives, (I
x
, I
y
), in x and y di-
rections to highlight the directional intensity variations, then a second moment matrix,
which encodes this variation, is evaluated for each pixel in a small neighborhood:
M =
_
I
2
x
I
x
I
y
I
x
I
y
I
2
y
_
. (1)
An interest point is identied using the determinant and the trace of M which mea-
sures the variation in a local neighborhood R =det(M)k tr(M)
2
, where k is constant.
The interest points are marked by thresholding R after applying nonmaxima suppres-
sion (see Figure 2(a) for results). The source code for Harris detector is available at
HarrisSrc. The same moment matrix M given in Equation (1) is used in the interest
point detection step of the KLT tracking method. Interest point condence, R, is com-
puted using the minimum eigenvalue of M,
min
. Interest point candidates are selected
by thresholding R. Among the candidate points, KLT eliminates the candidates that
are spatially close to each other (Figure 2(b)). Implementation of the KLT detector is
available at KLTSrc.
Quantitatively both Harris and KLT emphasize the intensity variations using very
similar measures. For instance, R in Harris is related to the characteristic polynomial
used for nding the eigenvalues of M:
2
det(M) tr(M) =0, while KLT computes
the eigenvalues directly. Inpractice, bothof these methods nd almost the same interest
points. The only difference is the additional KLT criterion that enforces a predened
spatial distance between detected interest points.
In theory, the M matrix is invariant to both rotation and translation. However, it
is not invariant to afne or projective transformations. In order to introduce robust
detection of interest points under different transformations, Lowe [2004] introduced
the SIFT (Scale Invariant Feature Transform) method which is composed of four steps.
First, a scale space is constructed by convolving the image with Gaussian lters at
different scales. Convolved images are used to generate difference-of-Gaussians (DoG)
images. Candidate interest points are then selected from the minima and maxima of
the DoG images across scales. The next step updates the location of each candidate by
interpolating the color values using neighboring pixels. In the third step, low contrast
candidates as well as the candidates along the edges are eliminated. Finally, remaining
interest points are assigned orientations based on the peaks in the histograms of
gradient directions in a small neighborhood around a candidate point. SIFT detector
generates a greater number of interest points compared to other interest point detec-
tors. This is due to the fact that the interest points at different scales and different
resolutions (pyramid) are accumulated. Empirically, it has been shown in Mikolajczyk
and Schmid [2003] that SIFT outperforms most point detectors and is more resilient
to image deformations. Implementation of the SIFT detector is available at SIFTSrc.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 9
Fig. 3. Mixture of Gaussian modeling for background subtraction. (a) Image from a sequence
in which a person is walking across the scene. (b) The mean of the highest-weighted Gaussians
at each pixels position. These means represent the most temporally persistent per-pixel color
and hence should represent the stationary background. (c) The means of the Gaussian with
the second-highest weight; these means represent colors that are observed less frequently. (d)
Background subtraction result. The foreground consists of the pixels in the current frame that
matched a low-weighted Gaussian.
4.2. Background Subtraction
Object detection can be achieved by building a representation of the scene called the
background model and then nding deviations fromthe model for each incoming frame.
Any signicant change inanimage regionfromthe backgroundmodel signies a moving
object. The pixels constituting the regions undergoing change are marked for further
processing. Usually, a connected component algorithm is applied to obtain connected
regions corresponding to the objects. This process is referred to as the background
subtraction.
Frame differencing of temporally adjacent frames has been well studied since the
late 70s [Jain and Nagel 1979]. However, background subtraction became popular fol-
lowing the work of Wren et al. [1997]. In order to learn gradual changes in time, Wren
et al. propose modeling the color of each pixel, I(x, y), of a stationary background
with a single 3D (Y, U, and V color space) Gaussian, I(x, y) N((x, y), (x, y)). The
model parameters, the mean (x, y) and the covariance (x, y), are learned from the
color observations in several consecutive frames. Once the background model is de-
rived, for every pixel (x, y) in the input frame, the likelihood of its color coming from
N((x, y), (x, y)) is computed, and the pixels that deviate fromthe background model
are labeled as the foreground pixels. However, a single Gaussian is not a good model for
outdoor scenes [Gao et al. 2000] since multiple colors can be observed at a certain loca-
tion due to repetitive object motion, shadows, or reectance. Asubstantial improvement
in background modeling is achieved by using multimodal statistical models to describe
per-pixel background color. For instance, Stauffer and Grimson [2000] use a mixture
of Gaussians to model the pixel color. In this method, a pixel in the current frame is
checked against the background model by comparing it with every Gaussian in the
model until a matching Gaussian is found. If a match is found, the mean and vari-
ance of the matched Gaussian is updated, otherwise a new Gaussian with the mean
equal to the current pixel color and some initial variance is introduced into the mix-
ture. Each pixel is classied based on whether the matched distribution represents the
background process. Moving regions, which are detected using this approach, along
with the background models are shown in Figure 3.
Another approach is to incorporate region-based (spatial) scene information instead
of only using color-based information. Elgammal and Davis [2000] use nonparamet-
ric kernel density estimation to model the per-pixel background. During the sub-
traction process, the current pixel is matched not only to the corresponding pixel in
the background model, but also to the nearby pixel locations. Thus, this method can
handle camera jitter or small movements in the background. Li and Leung [2002]
fuse the texture and color features to perform background subtraction over blocks of
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
10 A. Yilmaz et al.
Fig. 4. Eigenspace decomposition-based background subtraction (space is constructed with objects in the
FOV of camera): (a) an input image with objects, (b) reconstructed image after projecting input image onto
the eigenspace, (c) difference image. Note that the foreground objects are clearly identiable.
5 5 pixels. Since texture does not vary greatly with illumination changes, the method
is less sensitive to illumination. Toyama et al. [1999] propose a three-tiered algorithm
to deal with the background subtraction problem. In addition to the pixel-level sub-
traction, the authors use the region and the frame-level information. At the pixel level,
the authors propose to use Wiener ltering to make probabilistic predictions of the
expected background color. At the region level, foreground regions consisting of homo-
geneous color are lled in. At the frame level, if most of the pixels in a frame exhibit
suddenly change, it is assumed that the pixel-based color background models are no
longer valid. At this point, either a previously stored pixel-based background model is
swapped in, or the model is reinitialized.
An alternate approach for background subtraction is to represent the intensity vari-
ations of a pixel in an image sequence as discrete states corresponding to the events
in the environment. For instance, for tracking cars on a highway, image pixels can
be in the background state, the foreground (car) state, or the shadow state. Rittscher
et al. [2000] use Hidden Markov Models (HMM) to classify small blocks of an image
as belonging to one of these three states. In the context of detecting light on and off
events in a room, Stenger et al. [2001] use HMMs for the background subtraction. The
advantage of using HMMs is that certain events, which are hard to model correctly
using unsupervised background modeling approaches, can be learned using training
samples.
Instead of modeling the variation of individual pixels, Oliver et al. [2000] pro-
pose a holistic approach using the eigenspace decomposition. For k input frames,
I
i
: i = 1 k, of size nm, a background matrix B of size k l is formed by cas-
cading m rows in each frame one after the other, where l = (nm), and eigenvalue
decomposition is applied to the covariance of B, C = B
T
B. The background is then rep-
resented by the most descriptive eigenvectors, u
i
, where i < < k, that encompass
all possible illuminations in the eld of view(FOV). Thus, this approach is less sensitive
to illumination. The foreground objects are detected by projecting the current image to
the eigenspace and nding the difference between the reconstructed and actual images.
We show detected object regions using the eigenspace approach in Figure 4.
One limitation of the aforementioned approaches is that they require a static back-
ground. This limitation is addressed by Monnet et al. [2003], and Zhong and Sclaroff
[2003]. Both of these methods are able to deal with time-varying background (e.g., the
waves on the water, moving clouds, and escalators). These methods model the image re-
gions as autoregressive moving average (ARMA) processes which provide a way to learn
and predict the motion patterns in a scene. An ARMA process is a time series model
that is made up of sums of autoregressive and moving-average components, where an
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 11
Fig. 5. Segmentation of the image shown in (a), using mean-shift segmentation (b) and normalized cuts (c).
autoregressive process can be described as a weighted sum of its previous values and
a white noise error.
In summary, most state-of-the-art tracking methods for xed cameras, for example,
Haritaoglu et al. [2000] and Collins et al. [2001] use background subtraction methods
to detect regions of interest. This is because recent subtraction methods have the ca-
pabilities of modeling the changing illumination, noise, and the periodic motion of the
background regions and, therefore, can accurately detect objects in a variety of circum-
stances. Moreover, these methods are computationally efcient. Inpractice, background
subtraction provides incomplete object regions in many instances, that is, the objects
may be spilledinto several regions, or there may be holes inside the object since there are
no guarantees that the object features will be different from the background features.
The most important limitation of background subtraction is the requirement of station-
ary cameras. Camera motion usually distorts the background models. These methods
can be applied to video acquired by mobile cameras by regenerating background mod-
els for small temporal windows, for instance, three frames, from scratch [Kanade et al.
1998] or by compensating sensor motion, for instance, creating background mosaics
[Rowe and Blake 1996; Irani and Anandan 1998]. However, both of these solutions
require assumptions of planar scenes and small motion in successive frames.
4.3. Segmentation
The aim of image segmentation algorithms is to partition the image into perceptually
similar regions. Every segmentation algorithm addresses two problems, the criteria
for a good partition and the method for achieving efcient partitioning [Shi and Malik
2000]. In this section, we will discuss recent segmentation techniques that are relevant
to object tracking.
4.3.1. Mean-Shift Clustering. For the image segmentation problem, Comaniciu and
Meer [2002] propose the mean-shift approach to nd clusters in the joint spatial+color
space, [l , u, v, x, y], where [l , u, v] represents the color and [x, y] represents the spatial
location. Given an image, the algorithm is initialized with a large number of hypoth-
esized cluster centers randomly chosen from the data. Then, each cluster center is
moved to the mean of the data lying inside the multidimensional ellipsoid centered on
the cluster center. The vector dened by the old and the new cluster centers is called
the mean-shift vector. The mean-shift vector is computed iteratively until the cluster
centers do not change their positions. Note that during the mean-shift iterations, some
clusters may get merged. InFigure 5(b), we showthe segmentationusing the mean-shift
approach generated using the source code available at MeanShiftSegmentSrc.
Mean-shift clustering is scalable to various other applications such as edge detection,
image regularization [Comaniciu and Meer 2002], and tracking [Comaniciu et al. 2003].
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
12 A. Yilmaz et al.
Mean-shift based segmentation requires ne tuning of various parameters to obtain
better segmentation, for instance, selection of the color and spatial kernel bandwidths,
and the threshold for the minimum size of the region considerably effect the resulting
segmentation.
4.3.2. Image Segmentation Using Graph-Cuts. Image segmentation can also be formu-
lated as a graph partitioning problem, where the vertices (pixels), V = {u, v, . . .], of a
graph (image), G, are partitioned into N disjoint subgraphs (regions), A
i
,
N
i = 1
A
i
= V,
A
i
A
j
= , i ,= j , by pruning the weighted edges of the graph. The total weight of the
pruned edges between two subgraphs is called a cut. The weight is typically computed
by color, brightness, or texture similarity between the nodes. Wu and Leahy [1993] use
the minimum cut criterion, where the goal is to nd the partitions that minimize a cut.
In their approach, the weights are dened based on the color similarity. One limitation
of minimum cut is its bias toward oversegmenting the image. This effect is due to the
increase in cost of a cut with the number of edges going across the two partitioned
segments.
Shi and Malik [2000] propose the normalized cut to overcome the oversegmentation
problem. In their approach, the cut not only depends on the sum of edge weights in the
cut, but also on the ratio of the total connection weights of nodes in each partition to
all nodes of the graph. For image-based segmentation, the weights between the nodes
are dened by the product of the color similarity and the spatial proximity. Once the
weights between each pair of nodes are computed, a weight matrix W and a diagonal
matrix D, where D
i,i
=
N
j = 1
W
i, j
, are constructed. The segmentation is performed
rst by computing the eigenvectors and the eigenvalues of the generalized eigensystem
(DW)y = Dy, then the second-smallest eigenvector is used to divide the image into
two segments. For each new segment, this process is recursively performed until a
threshold is reached. In Figure 5(c), we show the segmentation results obtained by the
normalized cuts approach.
In normalized cuts-based segmentation, the solution to the generalized eigensystem
for large images can be expensive in terms of processing and memory requirements.
However, this method requires fewer manually selected parameters, compared to mean-
shift segmentation. Normalized cuts have also been used in the context of tracking
object contours [Xu and Ahuja 2002].
4.3.3. Active Contours. In an active contour framework, object segmentation is
achieved by evolving a closed contour to the objects boundary, such that the contour
tightly encloses the object region. Evolution of the contour is governed by an energy
functional which denes the tness of the contour to the hypothesized object region.
Energy functional for contour evolution has the following common form:
E() =
_
1
0
E
int
(v) E
im
(v) E
ext
(v)ds, (2)
where s is the arc-length of the contour , E
int
includes regularization constraints, E
im
includes appearance-based energy, and E
ext
species additional constraints. E
int
usu-
ally includes a curvature term, rst-order (v) or second-order (
2
v) continuity terms to
nd the shortest contour. Image-based energy, E
im
, can be computed locally or globally.
Local information is usually in the form of an image gradient and is evaluated around
the contour [Kass et al. 1988; Caselles et al. 1995]. In contrast, global features are com-
puted inside and outside of the object region. Global features include color [Zhu and
Yuille 1996; Yilmaz et al. 2004; Ronfard 1994] and texture [Paragios and Deriche 2002].
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 13
Different researchers have used different energy terms in Equation (2). In 1995,
Caselles et al. exclude the E
ext
and use only the image gradient as the image energy
E
im
= g([I[), where g is the sigmoid function. Compared to the gradient, a function
of the gradient denes the object contour as a geodesic curve in the Riemannian space
[Caselles et al. 1995]. However, image gradients provide very local information and are
sensitive to local minima. In order to overcome this problem, researchers introduced
region-based image energy terms. In 1996, Zhu and Yuille proposed using region infor-
mation instead of the image gradient. However, the use of regional terms in the energy
functional does not result in good localization of the object contour. Recently, methods
that combine both region-based and gradient-based image energy have become popular.
Paragios and Deriche [2002] propose using a convex combination of the gradient and
region-based energies, E
image
= E
boundary
(1 )E
region
. In particular, the authors
model the appearance in E
region
by mixture of Gaussians. Contour evolution is rst
performed globally, then locally by varying the from 0 to 1 at each iteration.
An important issue in contour-based methods is the contour initialization. In im-
age gradient-based approaches, a contour is typically placed outside the object region
and shrunk until the object boundary is encountered [Kass et al. 1988; Caselles et al.
1995]. This constraint is relaxed in region-based methods such that the contour can be
initialized either inside or outside the object so that the contour can either expand or
shrink, respectively, to t the object boundary. However, these approaches require prior
object or background knowledge [Paragios and Deriche 2002]. Using multiple frames
or a reference frame, initialization can be performed without building region priors.
For instance, in Paragios and Deriche [2000], the authors use background subtraction
to initialize the contour.
Besides the selection of the energy functional and the initialization, another im-
portant issue is selecting the right contour representation. Object contour, , can be
represented either explicitly (control points, v) or implicitly (level sets, ). In the explicit
representation, the relation between the control points are dened by spline equations.
In the level sets representation, the contour is represented on a spatial grid which
encodes the signed distances of the grids from the contour with opposite signs for the
object and the background regions. The contour is implicitly dened as the zero cross-
ings in the level set grid. The evolution of the contour is governed by changing the grid
values according to the energy computed using Equation (2), evaluated at each grid po-
sition. The changes in the grid values result in newzero crossings, hence, a newcontour
positions (more details are given in Section 5.3). The source code for generic level sets,
which can be used for various applications by specifying the contour evolution speed,
for instance, segmentation, tracking, heat ow etc., is available at LevelSetSrc. The
most important advantage of implicit representation over the explicit representation
is its exibility in allowing topology changes (split and merge).
4.4. Supervised Learning
Object detection can be performed by learning different object views automatically from
a set of examples by means of a supervised learning mechanism. Learning of different
object views waives the requirement of storing a complete set of templates. Givena set of
learning examples, supervised learning methods generate a function that maps inputs
to desired outputs. A standard formulation of supervised learning is the classication
problem where the learner approximates the behavior of a function by generating an
output in the form of either a continuous value, which is called regression, or a class
label, which is called classication. In context of object detection, the learning examples
are composed of pairs of object features and an associated object class where both of
these quantities are manually dened.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
14 A. Yilmaz et al.
Selection of features plays an important role in the performance of the classica-
tion, hence, it is important to use a set of features that discriminate one class from the
other. In addition to the features discussed in Section 3, it is also possible to use other
features such as object area, object orientation, and object appearance in the form of
a density function, for example, histogram. Once the features are selected, different
appearances of an object can be learned by choosing a supervised learning approach.
These learning approaches include, but are not limited to, neural networks [Rowley
et al. 1998], adaptive boosting [Viola et al. 2003], decision trees [Grewe and Kak 1995],
and support vector machines [Papageorgiou et al. 1998]. These learning methods com-
pute a hypersurface that separates one object class fromthe other in a high dimensional
space.
Supervised learning methods usually require a large collection of samples from each
object class. Additionally, this collection must be manually labeled. A possible approach
to reducing the amount of manually labeled data is to accompany cotraining with su-
pervised learning [Blum and Mitchell 1998]. The main idea behind cotraining is to
train two classiers using a small set of labeled data where the features used for each
classier are independent. After training is achieved, each classier is used to assign
unlabeled data to the training set of the other classier. It was shown that, starting
from a small set of labeled data with two sets of statistically independent features,
cotraining can provide a very accurate classication rule [Blum and Mitchell 1998].
Cotraining has been successfully used to reduce the amount of manual interaction re-
quired for training in the context of adaboost [Levin et al. 2003] and support vector
machines [Kockelkorn et al. 2003]. Following we will discuss the adaptive boosting and
the support vector machines due to their applicability to object tracking.
4.4.1. Adaptive Boosting. Boosting is an iterative method of nding a very accurate
classier by combining many base classiers, each of which may only be moderately
accurate [Freund and Schapire 1995]. In the training phase of the Adaboost algorithm,
the rst step is to construct an initial distribution of weights over the training set.
The boosting mechanism then selects a base classier that gives the least error, where
the error is proportional to the weights of the misclassied data. Next, the weights
associated with the data misclassied by the selected base classier are increased. Thus
the algorithm encourages the selection of another classier that performs better on the
misclassied data in the next iteration. For interested readers tutorials on boosting are
available at http://www.boosting.org.
In the context of object detection, weak classiers can be simple operators such as
a set of thresholds, applied to the object features extracted from the image. In 2003,
Viola et al. used the Adaboost framework to detect pedestrians. In their approach,
perceptrons were chosen as the weak classiers which are trained on image features
extracted by a combination of spatial and temporal operators. The operators for feature
extraction are in the form of simple rectangular lters, and are shown in Figure 6.
The operators inthe temporal domainare inthe formof frame differencing whichencode
some form of motion information. Frame differencing, when used as an operator in the
temporal domain, reduces the number of false detections by enforcing object detection
in the regions where the motion occurs.
4.4.2. Support Vector Machines. As a classier, Support Vector Machines (SVM) are
used to cluster data into two classes by nding the maximummarginal hyperplane that
separates one class from the other [Boser et al. 1992]. The margin of the hyperplane,
which is maximized, is dened by the distance between the hyperplane and the closest
data points. The data points that lie onthe boundary of the marginof the hyperplane are
called the support vectors. In the context of object detection, these classes correspond
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 15
Fig. 6. A set of rectangular lters used by Viola et al. [2003] to extract features used in the Adaboost
framework. Each lter is composed of three regions: white, light gray, and dark gray, with associated
weights 0, 1, and 1 respectively. Inorder to compute the feature ina window, these lters are convolved
with the image.
to the object class (positive samples) and the nonobject class (negative samples). From
manually generated training examples labeled as object and nonobject, computation of
the hyperplane from among an innite number of possible hyperplanes is carried out
by means of quadratic programming.
Despite being a linear classier, SVM can also be used as a nonlinear classier by ap-
plying the kernel trick to the input feature vector extracted from the input. Application
of the kernel trick to a set of data that is not linearly separable, transforms the data to
a higher dimensional space which is likely to be separable. The kernels used for kernel
trick are polynomial kernels or radial basis functions, for example, Gaussian kernel
and two-layer perceptron, for instance, a sigmoid function. However, the selection of
the right kernel for the problem at hand is not easy. Once a kernel is chosen, one has to
test the classication performance for a set of parameters which may not work as well
when new observations are introduced to the sample set.
In the context of object detection, Papageorgiou et al. [1998] use SVM for detecting
pedestrians and faces in images. The features used to discriminate between the classes
are extracted by applying Haar wavelets to the sets of positive and negative training
examples. In order to reduce the search space, temporal information is utilized by com-
puting the optical ow eld in the image. Particularly, the discontinuities in the optical
ow eld are used to initiate the search for possible objects resulting in a decreased
number of false positives.
5. OBJECT TRACKING
The aim of an object tracker is to generate the trajectory of an object over time by locat-
ing its position in every frame of the video. Object tracker may also provide the complete
region in the image that is occupied by the object at every time instant. The tasks of de-
tecting the object and establishing correspondence between the object instances across
frames can either be performed separately or jointly. In the rst case, possible object
regions in every frame are obtained by means of an object detection algorithm, and then
the tracker corresponds objects across frames. In the latter case, the object region and
correspondence is jointly estimated by iteratively updating object location and region
information obtained from previous frames. In either tracking approach, the objects
are represented using the shape and/or appearance models described in Section 2. The
model selected to represent object shape limits the type of motion or deformation it can
undergo. For example, if an object is represented as a point, then only a translational
model can be used. In the case where a geometric shape representation like an ellipse is
used for the object, parametric motion models like afne or projective transformations
are appropriate. These representations can approximate the motion of rigid objects in
the scene. For a nonrigid object, silhouette or contour is the most descriptive representa-
tion and both parametric and nonparametric models can be used to specify their motion.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
16 A. Yilmaz et al.
Fig. 7. Taxonomy of tracking methods.
Table II. Tracking Categories
Categories Representative Work
Point Tracking
Deterministic methods MGE tracker [Salari and Sethi 1990],
GOA tracker [Veenman et al. 2001].
Statistical methods Kalman lter [Broida and Chellappa 1986],
JPDAF [Bar-Shalom and Foreman 1988],
PMHT [Streit and Luginbuhl 1994].
Kernel Tracking
Template and density based Mean-shift [Comaniciu et al. 2003],
appearance models KLT [Shi and Tomasi 1994],
Layering [Tao et al. 2002].
Multi-view appearance models Eigentracking [Black and Jepson 1998],
SVM tracker [Avidan 2001].
Silhouette Tracking
Contour evolution State space models [Isard and Blake 1998],
Variational methods [Bertalmio et al. 2000],
Heuristic methods [Ronfard 1994].
Matching shapes Hausdorff [Huttenlocher et al. 1993],
Hough transform [Sato and Aggarwal 2004],
Histogram [Kang et al. 2004].
In viewof the aforementioned discussion, we provide a taxonomy of tracking methods
in Figure 7. Representative work for each category is tabulated in Table II. We now
briey introduce the main tracking categories, followed by a detailed section on each
category.
Point Tracking. Objects detected in consecutive frames are represented by points,
and the association of the points is based on the previous object state which can
include object position and motion. This approach requires an external mechanism
to detect the objects in every frame. An example of object correspondence is shown in
Figure 8(a).
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 17
Fig. 8. (a) Different tracking approaches. Multipoint correspondence, (b) parametric transformation of a
rectangular patch, (c, d) Two examples of contour evolution.
Kernel Tracking. Kernel refers to the object shape and appearance. For example,
the kernel can be a rectangular template or an elliptical shape with an associated
histogram. Objects are tracked by computing the motion of the kernel in consecutive
frames (Figure 8(b)). This motionis usually inthe formof a parametric transformation
such as translation, rotation, and afne.
Silhouette Tracking. Tracking is performed by estimating the object region in each
frame. Silhouette tracking methods use the information encoded inside the object
region. This information can be in the form of appearance density and shape models
which are usually in the form of edge maps. Given the object models, silhouettes are
tracked by either shape matching or contour evolution (see Figure 8(c), (d)). Both of
these methods can essentially be considered as object segmentation applied in the
temporal domain using the priors generated from the previous frames.
5.1. Point Tracking
Tracking can be formulated as the correspondence of detected objects represented by
points across frames. Point correspondence is a complicated problem-specially in the
presence of occlusions, misdetections, entries, and exits of objects. Overall, point cor-
respondence methods can be divided into two broad categories, namely, deterministic
and statistical methods. The deterministic methods use qualitative motion heuristics
[Veenman et al. 2001] to constrain the correspondence problem. On the other hand,
probabilistic methods explicitly take the object measurement and take uncertainties
into account to establish correspondence.
5.1.1. Deterministic Methods for Correspondence. Deterministic methods for point corre-
spondence dene a cost of associating each object in frame t 1 to a single object in
frame t using a set of motion constraints. Minimization of the correspondence cost is
formulated as a combinatorial optimization problem. A solution, which consists of one-
to-one correspondences (Figure 9(b)) among all possible associations (Figure 9(a)), can
be obtained by optimal assignment methods, for example, Hungarian algorithm, [Kuhn
1955] or greedy search methods. The correspondence cost is usually dened by using a
combination of the following constraints.
Proximity assumes the location of the object would not change notably fromone frame
to other (see Figure 10(a)).
Maximum velocity denes an upper bound on the object velocity and limits the possi-
ble correspondences to the circular neighborhood around the object (see Figure 10(b)).
Small velocity change (smooth motion) assumes the direction and speed of the object
does not change drastically (see Figure 10(c)).
Common motion constrains the velocity of objects in a small neighborhood to be sim-
ilar (see Figure 10(d)). This constraint is suitable for objects represented by multiple
points.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
18 A. Yilmaz et al.
Fig. 9. Point correspondence. (a) All possible associations of a point (object) in frame t 1 with points
(objects) in frame t, (b) unique set of associations plotted with bold lines, (c) multiframe correspondences.
Fig. 10. Different motion constraints. (a) proximity, (b) maximum velocity (r denotes radius), (c) small
velocity-change, (d) common motion, (e) rigidity constraints. . denotes object position at frame t 2,
denotes object position at frame t 1, and nally denotes object position at frame t.
Rigidity assumes that objects in the 3D world are rigid, therefore, the distance be-
tween any two points on the actual object will remain unchanged (see Figure 10(e)).
Proximal uniformity is a combination of the proximity and the small, velocity change
constraints.
We should, however, note that these constraints are not specic to the deterministic
methods, and they can also be used in the context of point tracking using statistical
methods.
Here we present a sample of different methods proposed in the literature in this
category. Sethi and Jain [1987] solve the correspondence by a greedy approach based
on the proximity and rigidity constraints. Their algorithm considers two consecutive
frames and is initialized by the nearest neighbor criterion. The correspondences are
exchanged iteratively to minimize the cost. A modied version of the same algorithm
which computes the correspondences in the backward direction (from the last frame
to the rst frame) in addition to the forward direction is also analyzed. This method
cannot handle occlusions, entries, or exits. Salari and Sethi [1990] handle these prob-
lems, by rst establishing correspondence for the detected points and then extending
the tracking of the missing objects by adding a number of hypothetical points. Ran-
garajan and Shah [1991] propose a greedy approach, which is constrained by proximal
uniformity. Initial correspondences are obtained by computing optical ow in the rst
two frames. The method does not address entry and exit of objects. If the number of
detected points decrease, occlusion or misdetection is assumed. Occlusion is handled
by establishing the correspondence for the detected objects in the current frame. For
the remaining objects, position is predicted based on a constant velocity assumption. In
the work by Intille et al. [1997], which uses a slightly modied version of Rangarajan
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 19
Fig. 11. Results of two point correspondence algorithms. (a) Tracking using the algorithm
proposed by Veenman et al. [2001] in the rotating dish sequence color segmentation was used to
detect black dots on a white dish ( c _2001 IEEE). (b) Tracking birds using the algorithmproposed
by Shaque and Shah [2003]; birds are detected using background subtraction ( c _2003 IEEE).
and Shah [1991] for matching object centroids, the objects are detected by using back-
ground subtraction. The authors explicitly handle the change in the number of objects
by examining specic regions in the image, for example, a door, to detect entries/exits
before computing the correspondence.
Veenman et al. [2001] extend the work of Sethi and Jain [1987], and Rangarajan
and Shah [1991] by introducing the common motion constraint for correspondence. The
common motion constraint provides a strong constraint for coherent tracking of points
that lie on the same object; however, it is not suitable for points lying on isolated objects
moving in different directions. The algorithm is initialized by generating the initial
tracks using a two-pass algorithm, and the cost function is minimized by Hungarian
assignment algorithm in two consecutive frames. This approach can handle occlusion
and misdetection errors, however, it is assumed that the number of objects are the same
throughout the sequence, that is, no object entries or exits. See Figure 11(a) for tracking
results.
Shaque and Shah [2003] propose a multiframe approach to preserve temporal co-
herency of the speed and position (Figure 9(c)). They formulate the correspondence as
a graph theoretic problem. Multiple frame correspondence relates to nding the best
unique path P
i
= {x
0
, . . . , x
k
] for each point (the superscript represents the frame num-
ber). For misdetected or occluded objects, the path will consist of missing positions in
corresponding frames. The directed graph, which is generated using the points in k
frames, is converted to a bipartite graph by splitting each node (object) into two (
and ) nodes and representing directed edges as undirected edges from to nodes.
The correspondence is then established by a greedy algorithm. They use a window of
frames during point correspondence to handle occlusions whose durations are shorter
than the temporal window used to perform matching. See Figure 11(b) for results on
this algorithm for the tracking of birds.
5.1.2. Statistical Methods for Correspondence. Measurements obtained from video sen-
sors invariably contain noise. Moreover, the object motions can undergo randompertur-
bations, for instance, maneuvering vehicles. Statistical correspondence methods solve
these tracking problems by taking the measurement and the model uncertainties into
account during object state estimation. The statistical correspondence methods use the
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
20 A. Yilmaz et al.
state space approach to model the object properties such as position, velocity, and ac-
celeration. Measurements usually consist of the object position in the image, which is
obtained by a detection mechanism. Followings, we will discuss the state estimation
methods in the context of point tracking, however, it should be noted that these methods
can be used in general to estimate the state of any time varying system. For example,
these methods have extensively been used for tracking contours [Isard and Blake 1998],
activity recognition [Vaswani et al. 2003], object identication [Zhou et al. 2003], and
structure from motion [Matthies et al. 1989].
Consider a moving object in the scene. The information representing the object, for
example, location, is dened by a sequence of states X
t
: t = 1, 2, . . .. The change in
state over time is governed by the dynamic equation,
X
t
= f
t
(X
t1
) W
t
, (3)
where W
t
: t = 1, 2, . . . is white noise. The relationship between the measurement and
the state is specied by the measurement equation Z
t
= h
t
(X
t
, N
t
), where N
t
is the
white noise and is independent of W
t
. The objective of tracking is to estimate the state
X
t
given all the measurements up to that moment or, equivalently, to construct the
probability density function p(X
t
[Z
1,...,t
). A theoretically optimal solution is provided
by a recursive Bayesian lter which solves the problem in two steps. The prediction
step uses a dynamic equation and the already computed pdf of the state at time t 1
to derive the prior pdf of the current state, that is, p(X
t
[Z
1,...,t1
). Then, the correction
step employs the likelihood function p(Z
t
[X
t
) of the current measurement to compute
the posterior pdf p(X
t
[Z
1,...,t
). In the case where the measurements only arise due to
the presence of a single object in the scene, the state can be simply estimated by the
two steps as dened. On the other hand, if there are multiple objects in the scene,
measurements need to be associated with the corresponding object states. We now
discuss the two cases.
5.1.2.1. Single Object State Estimation. For the single object case, if f
t
and h
t
are linear
functions and the initial state X
1
and noise have a Gaussian distribution, then the
optimal state estimate is given by the Kalman Filter. In the general case, that is, object
state is not assumed to be a Gaussian, state estimation can be performed using particle
lters [Tanizaki 1987].
Kalman Filters. AKalman lter is used to estimate the state of a linear systemwhere
the state is assumed to be distributed by a Gaussian. Kalman ltering is composed
of two steps, prediction and correction. The prediction step uses the state model to
predict the new state of the variables:
X
t
= DX
t1
W,
t
= D
t1
D
T
Q
t
,
where X
t
and
t
are the state and the covariance predictions at time t. Dis the state
transition matrix which denes the relation between the state variables at time t
and t 1. Q is the covariance of the noise W. Similarly, the correction step uses the
current observations Z
t
to update the objects state:
K
t
=
t
M
T
[M
t
M
T
R
t
]
1
, (4)
X
t
= X
t
K
t
[Z
t
MX
t
]
. .
v
, (5)
t
=
t
K
t
M
t
,
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 21
where v is called the innovation, M is the measurement matrix, K is the Kalman
gain, which is the Riccati Equation (4) used for propagation of the state models. Note
that the updated state, X
t
is still distributed by a Gaussian. In case the functions
f
t
and h
t
are nonlinear, they can be linearized using the Taylor series expansion to
obtain the extended Kalman lter [Bar-Shalom and Foreman 1988]. Similar to the
Kalman lter, the extended Kalman lter assumes that the state is distributed by a
Gaussian.
The Kalman lter has been extensively used in the vision community for tracking.
Broida and Chellappa [1986] used the Kalman lter to track points in noisy images.
In stereo camera-based object tracking, Beymer and Konolige [1999] use the Kalman
lter for predicting the objects position and speed in x z dimensions. Rosales and
Sclaroff [1999] use the extended Kalman lter to estimate 3D trajectory of an object
from 2D motion. A Matlab toolbox for Kalman ltering is available at KalmanSrc.
Particle Filters. One limitation of the Kalman lter is the assumption that the state
variables are normally distributed (Gaussian). Thus, the Kalman lter will give poor
estimations of state variables that do not follow Gaussian distribution. This limita-
tion can be overcome by using particle ltering [Tanizaki 1987]. In particle ltering,
the conditional state density p(X
t
[Z
t
) at time t is represented by a set of samples
{s
(n)
t
: n = 1, . . . , N] (particles) with weights
(n)
t
(sampling probability). The weights
dene the importance of a sample, that is, its observation frequency [Isard and Blake
1998]. To decrease computational complexity, for each tuple (s
(n)
,
(n)
), a cumula-
tive weight c
(n)
is also stored, where c
(N)
= 1. The new samples at time t are drawn
fromS
t1
= {(s
(n)
t1
,
(n)
t1
, c
(n)
t1
) : n = 1, . . . , N] at the previous time t 1 step based on
different sampling schemes [MacKay 1998]. The most common sampling scheme is
importance sampling which can be stated as follows.
(1) Selection. Select N random samples s
(n)
t
fromS
t1
by generating a random number
r [0, 1], nding the smallest j such that c
( j )
t1
> r and setting s
(n)
t
= s
( j )
t1
.
(2) Prediction. For each selected sample s
(n)
t
, generate a new sample by
s
(n)
t
= f ( s
(n)
t
, W
(n)
t
), where W
(n)
t
is a zero meanGaussianerror and f is a non-negative
function, i.e. f (s) = s.
(3) Correction. Weights
(n)
t
corresponding to the new samples s
(n)
t
are computed us-
ing the measurements z
t
by
(n)
t
= p(z
t
[x
t
= s
(n)
t
), where p(.) can be modeled as a
Gaussian density.
Using the new samples S
t
, one can estimate the new object position by
t
=
N
n = 1
(n)
t
f (s
(n)
t
, W). Particle lter-based trackers can be initialized by either us-
ing the rst measurements, s
(n)
0
X
0
, with weight
(n)
0
=
1
N
or by training the system
using sample sequences. In addition to keeping track of the best particles, an additional
resampling is usually employed to eliminate samples with very low weights. Note that
the posterior density does not have to be a Gaussian. Particle lters recently became
popular in computer vision. They are especially used for object detection and tracking.
A Matlab toolbox for tracking using particle ltering is available at ParticleFltSrc.
Note that the Kalman lter and particle lter described assume a single measure-
ment at each time instant, that is, the state of a single object is estimated. Track-
ing multiple objects requires a joint solution of data association and state estimation
problems.
5.1.2.2. Multiobject Data Association and State Estimation. When tracking multiple ob-
jects using Kalman or particle lters, one needs to deterministically associate the most
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
22 A. Yilmaz et al.
likely measurement for a particular object to that objects state, that is, the corre-
spondence problem needs to be solved before these lters can be applied. The simplest
method to perform correspondence is to use the nearest neighbor approach. However,
if the objects are close to each other, then there is always a chance that the correspon-
dence is incorrect. An incorrectly associated measurement can cause the lter to fail
to converge. There exist several statistical data association techniques to tackle this
problem. A detailed review of these techniques can be found in the book by Fortmann
and Bar-Shalom [1988] or in the survey by Cox [1993]. Joint Probability Data Associa-
tion Filtering (JPDAF) and Multiple Hypothesis Tracking (MHT) are two widely used
techniques for data association. We give a brief description of these techniques in the
following.
Joint Probability Data Association Filter. Let a track be dened as a sequence of
measurements that are assumed to originate from the same object. Suppose we have
N tracks and, at time t, Z(t) = z
1
(t), z
m
t
(t) are the m measurements. We need to
assign these measurements to the existing tracks. Let be a set of assignments. It
is assumed that the number of tracks will remain constant over time. Let v
i,l
be the
innovation (see the discussion on the Kalman Filter) associated with the track l due
to the measurement z
i
. The JPDAF associates all measurements with each track.
The combined weighted innovation is given by
v
l
=
m
k
i = 1
l
i
v
i,l
, (6)
where
l
i
is the posterior probability that the measurement i originated from the
object associated with track l and is given as:
l
i
=
P[
l
(k)[Z
t
]
i,l
(), (7)
where
i,l
is the indicator variable, with i = 1, . . . , m
k
and l = 1, . . . , N. It is equal
to one if the measurement z
i
(k) is associated with track l , otherwise it is zero. The
weighted innovation given in Equation (6) can be plugged into the Kalman lter
update Equations (5) for each track l .
JPDAFis used by Chang and Aggarwal [1991] to perform3Dstructure reconstruction
from a video sequence. Rasmussen and Hager [2001] use a constrained JPDAF lter to
track regions. The major limitation of the JPDAF algorithm is its inability to handle
new objects entering the eld of view (FOV) or already tracked objects exiting the FOV.
Since the JPDAF algorithm performs data association of a xed number of objects
tracked over two frames, serious errors can arise if there is a change in the number of
objects. The MHT algorithm, which is explained next, does not have this shortcoming.
Multiple Hypothesis Tracking (MHT). If motion correspondence is established using
only two frames, there is always a nite chance of an incorrect correspondence. Bet-
ter tracking results can be obtained if the correspondence decision is deferred until
several frames have been examined. The MHT algorithm maintains several corre-
spondence hypotheses for each object at each time frame [Reid 1979]. The nal track
of the object is the most likely set of correspondences over the time period of its ob-
servation. The algorithm has the ability to create new tracks for objects entering the
FOV and terminate tracks for objects exiting the FOV. It can also handle occlusions,
that is, continuation of a track even if some of the measurements from an object are
missing.
MHTis aniterative algorithm. Aniterationbegins witha set of current trackhypothe-
ses. Each hypothesis is a collection of disjoint tracks. For each hypothesis, a prediction
of each objects position in the next frame is made. The predictions are then compared
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 23
with actual measurements by evaluating a distance measure. A set of correspondences
(associations) are established for each hypothesis based on the distance measure which
introduces new hypotheses for the next iteration. Each new hypothesis represents a
new set of tracks based on the current measurements. Note that each measurement
can belong to a new object entering the FOV, a previously tracked object, or a spurious
measurement. Moreover, a measurement may not be assigned to an object because the
object may have exited the FOV, or a measurement corresponding to an object may
not be obtained. The latter happens because either the object is occluded or it is not
detected due to noise.
Note that MHT makes associations in a deterministic sense and exhaustively
enumerates all possible associations. To reduce the computational load, Streit and
Luginbuhl [1994] propose a probabilistic MHT (PMHT) in which the associations are
considered to be statistically independent random variables and thus there is no re-
quirement for exhaustive enumeration of associations. Recently, particle lters that
handle multiple measurements to track multiple objects have been proposed by Hue
et al. [2002]. In their method, data association is handled in a similar way as in PMHT,
however, the state estimation is achieved through particle lters.
The MHT algorithm is computationally exponential both in memory and time. To
overcome this limitation, Cox and Hingorani [1996] use Murtys [1968] algorithm to
determine k-best hypotheses in polynomial time for tracking interest points. Chamand
Rehg [1999] use the multiple hypothesis framework to track the complete human body.
5.1.3. Discussion. Point tracking methods can be evaluated on the basis of whether
they generate correct point trajectories. Given a ground truth, the performance can be
evaluated by computing precision and recall measures. In the context of point tracking,
precision and recall measures can be dened as:
precision =
# of correct correspondences
# of established correspondences
, (8)
recall =
# of correct correspondences
# of actual correspondences
, (9)
where actual correspondences denote the correspondences available in the ground
truth. Additionally, a qualitative comparison of object trackers can be made based on
their ability to
deal with entries of new objects and exits of existing objects,
handle the missing observations (occlusion), and
provide an optimal solution to the cost function minimization problem used for es-
tablishing correspondence.
In Table III, we provide a qualitative comparison based on these properties.
One important issue in the context of point trackers is the handling of missing or
noisy observations. To address these problems, deterministic point trackers often use
a combination of motion-based constraints addressed in Section 5.1.1, that is, common
motion [Veenman et al. 2001] or proximal uniformity [Rangarajan and Shah 1991].
Statistical point tracking methods explicitly handle noise by taking model uncertain-
ties into consideration. These uncertainties are usually assumed to be in the form of
normally distributed noise. However, the assumption that measurements are normally
distributed around their predicted position may not hold. Moreover, in many cases, the
noise parameters are not known. In the case of valid assumptions on distributions and
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
24 A. Yilmaz et al.
Table III. Qualitative Comparison of Point Trackers (#: number of objects, M: multiple
objects, S single object. Symbols
and denote whether the tracker can or cannot handle
occlusions, object entries object exits, and provide the optimal solution.)
# Entry Exit Occlusion Optimal
GE [Sethi and Jain 1987] M
MGE [Salari and Sethi 1990] M
GOA [Veenman et al. 2001] M
MFT [Shaque and Shah 2003] M
Kalman [Bar-Shalom and Foreman 1988] S
JPDAF [Bar-Shalom and Foreman 1988] M
MHT [Cox and Hingorani 1996] M
noise, Kalman lters [Bar-Shalom and Foreman 1988] and MHT [Reid 1979] give opti-
mal solutions. Another possible approach to handling noise and missing observations
is to enforce constraints that dene the objects 3D structure. For instance, multibody
factorization methods can be used for handling noisy observations by enforcing the
object points to t into the 3D object shape. This is addressed for the nonrigid object
by Bregler et al. [Torresani and Bregler 2002; Bregler et al. 2000] where the authors
rst dene a set of shape bases from a set of reliable tracks which has minimum or
no appearance error on the points trajectory. Computed shape basis then serves as a
constraint on the remaining point trajectories that are labeled as unreliable.
Point trackers are suitable for tracking very small objects which can be represented
by a single point (single point representation). Multiple points are needed to tracklarger
objects. In the context of tracking objects using multiple points, automatic clustering
of points that belong to the same object is an important problem. This is due to the
need to distinguish between multiple objects and, between objects and background.
Motion-based clustering or segmentation approaches [Vidal and Ma 2004; Black and
Anandan 1996; Wang and Adelson 1994] usually assume that the points being tracked
lie on rigid bodies in order to simplify the segmentation problem.
5.2. Kernel Tracking
Kernel tracking is typically performed by computing the motion of the object, which is
represented by a primitive object region, from one frame to the next. The object motion
is generally in the form of parametric motion (translation, conformal, afne, etc.) or the
dense oweld computed in subsequent frames. These algorithms differ in terms of the
appearance representation used, the number of objects tracked, and the method used
to estimate the object motion. We divide these tracking methods into two subcategories
based on the appearance representation used, namely, templates and density-based
appearance models, and multiview appearance models.
5.2.1. Tracking Using Template and Density-Based Appearance Models. Templates and
density-based appearance models (see Section 2) have been widely used because of
the their relative simplicity and low computational cost. We divide the trackers in this
category into two subcategories based on whether the objects are tracked individually
or jointly.
5.2.1.1. Tracking single objects. The most common approach in this category is template
matching. Template matching is a brute force method of searching the image, I
w
, for a
region similar to the object template, O
t
dened in the previous frame. The position of
the template in the current image is computed by a similarity measure, for example,
cross correlation: arg max
dx,dy
y
(O
t
(x, y) I
w
(xdx, ydy))
y
O
2
t
(x, y)
, where (dx, d y) specify the can-
didate template position. Usually image intensity or color features are used to form the
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 25
Fig. 12. Mean-shift tracking iterations. (a) estimated object location at time t 1, (b) frame at time
t with initial location estimate using the previous object position, (c), (d), (e) location update using
mean-shift iterations, (f) nal object position at time t.
templates. Since image intensity is very sensitive to illumination changes, image gra-
dients [Bircheld 1998] can also be used as features. A limitation of template matching
is its high computation cost due to the brute force search. To reduce the computational
cost, researchers usually limit the object search to the vicinity of its previous position.
Also, more efcient algorithms for template matching have been proposed [Schweitzer
et al. 2002].
Note that instead of templates, other object representations can be used for tracking,
for instance, color histograms or mixture models can be computed by using the appear-
ance of pixels inside the rectangular or ellipsoidal regions. Fieguth and Terzopoulos
[1997] generate object models by nding the mean color of the pixels inside the rect-
angular object region. To reduce computational complexity, they search the object in
eight neighboring locations. The similarity between the object model, M, and the hy-
pothesized position, H, is computed by evaluating the ratio between the color means
computed from M and H. The position which provides the highest ratio is selected as
the current object location.
Comaniciu and Meer [2003] use a weighted histogram computed from a circular re-
gion to represent the object. Instead of performing a brute force search for locating the
object, they use the mean-shift procedure (Section 4.3). The mean-shift tracker maxi-
mizes the appearance similarity iteratively by comparing the histograms of the object,
Q, and the window around the hypothesized object location, P. Histogram similarity
is dened in terms of the Bhattacharya coefcient,
b
u = 1
P(u)Q(u), where b is the
number of bins. At each iteration, the mean-shift vector is computed such that the his-
togram similarity is increased. This process is repeated until convergence is achieved,
which usually takes ve to six iterations. For histogram generation, the authors use a
weighting scheme dened by a spatial kernel which gives higher weights to the pixels
closer to the object center. Comaniciu [2002] extended the mean-shift tracking approach
used a joint spatial-color histogram (Section 4.3) instead of just a color histogram. An
example of mean-shift tracking is giveninFigure 12. Anobvious advantage of the mean-
shift tracker over the standard template matching is the elimination of a brute force
search, and the computation of the translation of the object patch in a small number of
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
26 A. Yilmaz et al.
iterations. However, mean-shift tracking requires that a portion of the object is inside
the circular region upon initialization (part of the object has to be inside the white el-
lipse in Figure 12(b)). Implementation of the mean-shift tracker is available in OpenCV
as CAMSHIFT at MeanShiftSegmentSrc.
Jepson et al. [2003] propose an object tracker that tracks an object as a three-
component mixture, consisting of the stable appearance features, transient features
and noise process. The stable component identies the most reliable appearance for
motion estimation, that is, the regions of the object whose appearance does not quickly
change over time. The transient component identies the quickly changing pixels. The
noise component handles the outliers in the object appearance that arise due to noise.
An online version of the EM algorithm is used to learn the parameters of this three-
component mixture. The authors use the phase of the steerable lter responses as fea-
tures for appearance representation. The object shape is represented by an ellipse. The
motion of the object is computed in terms of warping the tracked region from one frame
to the next one. The warping transformation consists of translation, (t
x
, t
y
), rotation
(a, b), and scale, s, parameters:
_
x
/
y
/
_
= s
_
a b
b a
__
x
y
_
_
t
x
t
y
_
. (10)
A weighted combination of the stable and transient components is used to determine
the warping parameters. The advantage of learning stable and transient features is
that one can give more weight to stable features for tracking, for example, if the face
of a person who is talking is being tracked, then the forehead or nose region can give a
better match to the face in the next frame as opposed to the mouth of the person (see
Figure 13).
Another approach to track a region dened by a primitive shape is to compute its
translation by use of an optical ow method. Optical ow methods are used for gener-
ating dense ow elds by computing the ow vector of each pixel under the brightness
constancy constraint, I(x, y, t) I(x dx, y dy, t dt) = 0 [Horn and Schunk 1981].
This computation is always carried out in the neighborhood of the pixel either alge-
braically [Lucas and Kanade. 1981] or geometrically [Schunk 1986]. Extending optical
ow methods to compute the translation of a rectangular region is trivial. In 1994,
Shi and Tomasi proposed the KLT tracker which iteratively computes the translation
(du, dv) of a region (e.g., 25 25 patch) centered on an interest point (for interest point
detection, see Section 4.1):
_
I
2
x
I
x
I
y
I
x
I
y
I
2
y
__
du
dv
_
=
_
I
x
I
t
I
y
I
t
_
.
This equation is similar in construction to the optical ow method proposed by Lu-
cas and Kanade [1981]. Once the new location of the interest point is obtained,
the KLT tracker evaluates the quality of the tracked patch by computing the afne
transformation
_
x
/
y
/
_
=
_
a b
c d
__
x
y
_
_
t
x
t
y
_
, (11)
between the corresponding patches in consecutive frames. If the sum of square differ-
ence between the current patch and the projected patch is small, they continue tracking
the feature, otherwise the feature is eliminated. The implementation of the KLTtracker
is available at KLTSrc . The results obtained by the KLTtracker are shown in Figure 14.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 27
Fig. 13. Results of the robust online tracking method by Jepson et al.
[2003] (a) The target region in different frames. (b) The mixing proba-
bility of the stable component. Note that the probabilities around the
mouth and eyebrow regions change, while they remain the same in the
other regions ( c _2003 IEEE).
Fig. 14. Tracking features using the KLT tracker.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
28 A. Yilmaz et al.
5.2.1.2. Tracking Multiple Objects. Modeling objects individually does not take into ac-
count the interaction between multiple objects and between objects and background
during the course of tracking. An example interaction between objects can be one ob-
ject partially or completely occluding the other. The tracking methods given in the
following model the complete image, that is, the background and all moving objects are
explicitly tracked.
Tao et al. [2002] propose an object tracking method based on modeling the whole
image, I
t
, as a set of layers. This representation includes a single background layer
and one layer for each object. Each layer consists of shape priors (ellipse), , motion
model (translation and rotation), , and layer appearance, A, (intensity modeled using
a single Gaussian). Layering is performed by rst compensating the background mo-
tion modeled by projective motion such that the objects motion can be estimated from
the compensated image using 2D parametric motion. Then, each pixels probability of
belonging to a layer (object), p
l
, is computed based on the objects previous motion and
shape characteristics. Any pixel far from a layer is assigned a uniform background
probability, p
b
. Later, the objects appearance (intensity, color) probability p
a
is cou-
pled with p
l
to obtain the nal layer estimate. The model parameters (
t
,
t
, A
t
) that
maximize observing a layer at time t are estimated iteratively using an expectation
maximization algorithm. However, due to the difculty in simultaneous estimation of
the parameters, the authors individually estimate one set, while xing the others. For
instance, they rst estimate layer ownership using the intensity for each pixel, then
they estimate the motion (rotation and translation) using appearance probabilities,
and nally update layer ownership using this motion. The unknowns for each object
are iteratively estimated until the layer ownership probabilities are maximized.
Isard and MacCormick [2001] propose joint modeling of the background and fore-
ground regions for tracking. The background appearance is represented by a mix-
ture of Gaussians. Appearance of all foreground objects is also modeled by mixture
of Gaussians. The shapes of objects are modeled as cylinders. They assume the ground
plane is known, thus the 3D object positions can be computed. Tracking is achieved
by using particle lters where the state vector includes the 3D position, shape and the
velocity of all objects in the scene. They propose a modied prediction and correction
scheme for particle ltering which can increase or decrease the size of the state vector
to include or remove objects. The method can also tolerate occlusion between objects.
However, the maximum number of objects in the scene is required to be predened.
Another limitation of the approach is the use of the same appearance model for all
foreground objects, and it requires training to model the foreground regions.
5.2.2. Tracking Using MultiviewAppearance Models. In the previous tracking methods, the
appearance models, that is, histograms, templates etc., are usually generated online.
Thus these models represent the information gathered about the object from the most
recent observations. The objects may appear different from different views, and if the
object viewchanges dramatically during tracking, the appearance model may no longer
be valid, and the object track might be lost. To overcome this problem, different views
of the object can be learned ofine and used for tracking.
In 1998, Black and Jepson proposed a subspace-based approach, that is, eigenspace,
to compute the afne transformation from the current image of the object to the image
reconstructed using eigenvectors. First, a subspace representation of the appearance of
an object is built using Principal Component Analysis (PCA), then the transformation
from the image to the eigenspace is computed by minimizing the so-called subspace
constancy equation which evaluates the difference between the image reconstructed
using the eigenvectors and the input image. Minimization is performed in two steps:
nding subspace coefcients and computing afne parameters. In the rst step, the
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 29
afne parameters are xed, and the subspace coefcients are computed. In the second
step, using the new subspace coefcients, afne parameters are computed. Based on
this, tracking is performed by estimating the afne parameters iteratively until the
difference between the input image and the projected image is minimized. Note that the
use of eigenspace for similarity computationis a useful alternative to standard template
matching techniques such as SSD and normalized correlation. The eigenspace-based
similarity computation is equivalent to matching with a linear combination of eigen
templates. This allows for distortions in the templates, for example, distortion caused
by illumination changes in images.
In a similar vein, Avidan [2001] used a Support Vector Machine (SVM) classier for
tracking. SVM is a general classication scheme that, given a set of positive and neg-
ative training examples, nds the best separating hyperplane between the two classes
[1998]. During testing, the SVM gives a score to the test data indicating the degree of
membership of the test data to the positive class. For SVM-based trackers, the positive
examples consist of the images of the object to be tracked, and the negative examples
consist of all things that are not to be tracked. Generally, negative examples consist of
background regions that could be confused with the object. Avidans tracking method,
instead of minimizing the intensity difference of a template from the image regions,
maximizes the SVM classication score over image regions in order to estimate the
position of the object. One advantage of this approach is that knowledge about back-
ground objects (negative examples that are not to be tracked) is explicitly incorporated
in the tracker.
5.2.3. Discussion. The main goal of the trackers in this category is to estimate the
object motion. With the region-based object representation, computed motion implicitly
denes the object regionas well as the object orientationinthe next frame since, for each
point of the object in the current frame, its location in the next frame can be determined
using the estimated motionmodel. Depending onthe context inwhichthese trackers are
being used, only one of these three properties might be more important. For instance,
in the case of analyzing the object behavior based on the object trajectory, only the
motion is adequate. However, to identify an object, the region it encompasses is also
important. In order to evaluate the performance of the trackers in this category, one
can dene measures based on what is expected from the tracker. In the case when the
tracker is expected to provide only object motion, the evaluation can be performed by
computing a distance measure between the estimated and actual motion parameters.
An example of a distance measure can be the angular distance, d =
A.B
[A[[B[
, between
the motion vectors, A and B. For applications when the tracker is required to provide
the correct object region in addition to its trajectory, the trackers performance can
be evaluated by computing the precision and the recall measures. Both the precision
and the recall measure are dened in terms of the intersection of the hypothesized
and correct object region. In particular, precision is the ratio of the intersection to the
hypothesized regions. The recall is the ratio of the intersection to the ground truth.
A qualitative comparison of kernel trackers can be obtained based on
tracking single or multiple objects,
ability to handle occlusion,
requirement of training,
type of motion model, and
requirement of a manual initialization.
In Table IV, we provide the qualitative comparison of the methods discussed in this
section.
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
30 A. Yilmaz et al.
Table IV. Qualitative Comparison of Geometric Model-Based Trackers (Init. denotes
initialization. #: number of objects, M: multiple objects, S: single object respectively, A: afne or
homography, T: translational motion, S: scaling, R: rotation, P: partial occlusion, F: full
occlusion. Symbols
and respectively denote if the tracker requires or does not require
training or initialization.)
# Motion Training Occ. Init.
Simple template matching S T P
Mean-shift [Comaniciu et al. 2003] S T S P
KLT [Shi and Tomasi 1994] S A P
Appearance Tracking [Jepson et al. 2003] S T S R P
Layering [Tao et al. 2002] M T S R F
Bramble [Isard and MacCormick 2001] M T S R
F
EigenTracker [Black and Jepson 1998] S A
P
SVM [Avidan 2001] S T
P
Use of primitive geometric shapes to represent objects is very common due to the real-
time applicability of the state-of-the-art methods. Because of the rigidity constraint,
tracking methods in this category compute parametric motion of the object. This motion
is usually in the form of translation, conformal afne, afne, or projective. Motion of
the object can be estimated by maximizing the object appearance similarity between
the previous and current frame. The estimation process can be in the form of a brute
force search, or by using gradient ascent (descent)-based maximization (minimization)
process. Object trackers, based on the gradient ascent (descent) approach, require that
some part of the object is at least visible inside the chosen shape whose location is
dened by the previous object position. To eliminate such requirements, a possible
approach is to use Kalman ltering or particle ltering discussed in the context of
point trackers to predict the location of the object in the next frame. Given the object
state dened in terms of velocity and acceleration of the object centroid these lters will
estimate the position of the object centroid such that the likelihood of observing part of
the object inside the kernel is increased [Comaniciu et al. 2003]. This requirement can
also be met by performing global motion compensation, assuming that the objects are
farther from the camera and camera motion can be estimated by afne or projective
transformation [Yilmaz et al. 2003].
One of the limitations of primitive geometric shapes for object representation is that
parts of the objects may be left outside of the denedshape while parts of the background
may reside inside it. This phenomena can be observed for both the rigid objects (when
the object pose changes) and nonrigid objects (when local motion results in changes in
object appearance). In such cases, the object motion estimated by maximizing model
similarity may not be correct. To overcome this limitation, one approach is to force
the kernel to reside inside the object rather than encapsulating the complete shape.
Another approach is to model the object appearance by probability density functions of
color/texture and assign weights to the pixels residing inside the primitive shape based
on the conditional probability of observed color/texture.
5.3. Silhouette Tracking
Objects may have complex shapes, for example, hands, head, and shoulders (see
Figure 15(a)) that cannot be well described by simple geometric shapes. Silhouette-
based methods provide an accurate shape description for these objects. The goal of a
silhouette-based object tracker is to nd the object region in each frame by means of
an object model generated using the previous frames. This model can be in the form
of a color histogram, object edges or the object contour. We divide silhouette trackers
into two categories, namely, shape matching and contour tracking. Shape matching
approaches search for the object silhouette in the current frame. Contour tracking ap-
proaches, on the other hand, evolve an initial contour to its new position in the current
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 31
Fig. 15. (a) Edge observations along the contour normals ( c _1998 Kluwer).
(b) Level set contour representation, each grid position encodes the Euclidean
distance between a grid point and the point on the contour; gray levels rep-
resent the values of the grid.
frame by either using the state space models or direct minimization of some energy
functional.
5.3.1. Shape Matching. Shape matching can be performed similar to tracking based
on template matching (Section 5.2), where an object silhouette and its associated model
is searched in the current frame. The search is performed by computing the similarity
of the object with the model generated from the hypothesized object silhouette based
on previous frame. In this approach, the silhouette is assumed to only translate from
the current frame to the next, therefore nonrigid object motion is not explicitly han-
dled. The object model, which is usually in the form of an edge map, is reinitialized
to handle appearance changes in every frame after the object is located. This update
is required to overcome tracking problems related to viewpoint and lighting condition
changes as well as nonrigid object motion. In 1993, Huttenlocher et al. performed shape
matching using an edge-based representation. The authors use the Hausdorff distance
to construct a correlation surface fromwhich the minimumis selected as the newobject
position. The Hausdorff metric is a mathematical measure for comparing two sets of
points A = {a
1
, a
2
, , a
n
] and B = {b
1
, b
2
, , b
m
] in terms of the least similar mem-
bers [Hausdorff 1962]:
H(A, B) = max{h(A, B), h(b, A)], (12)
where h(A, B) = sup
aA
inf
bB
| a b | and | . | is the norm of choice. In the context
of matching using an edge-based model, Hausdorff distance measures the most mis-
matched edges. Due to this, the method emphasize parts of the edge map that are not
drastically affected by object motion. For instance, in the case of a walking person, the
head and the torso do not change their shape much, whereas the motion of the arms and
legs will result in drastic shape change, such that removing the edges corresponding to
arms and legs will improve the tracking performance. In a similar vein, Li et al. [2001]
propose using the Hausdorff distance for verication of the trajectories and pose esti-
mation problem. Tracking is achieved by evaluating the optical ow vector computed
inside the hypothesized silhouette such that the average ow provides the new object
position. For verication, an edge-based model is kept for each tracked object for all
possible object poses. The authors then apply distance transform to the edges in the
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
32 A. Yilmaz et al.
hypothesized object edges to speed up the computation of L
2
norm in the Hausdorff
distance computation.
Another approach to match shapes is to nd corresponding silhouettes detected in
two consecutive frames. Establishing silhouette correspondence, or in short silhouette
matching, can be considered similar to point matching discussed in Section 5.1. How-
ever, the main difference between silhouette matching and point matching is the object
representations and the object models used. In particular, silhouette matching uses
the complete object region in, contrast to using points. In addition, silhouette matching
makes use of an objects appearance features, whereas point matching uses only motion
and position-based features. Silhouette detection is usually carried out by background
subtraction (see Section 4.2 for discussion). Once the object silhouettes are extracted,
matching is performed by computing some distance between the object models associ-
ated with each silhouette. Object models are usually in the form of density functions
(color or edge histograms), silhouette boundary (closed or open object contour), object
edges or a combinationof these models. In2004, Kang et al. used histograms of color and
edges as the object models. In contrast to traditional histograms, they proposed gener-
ating histograms from concentric circles with various radii centered on a set of control
points on a reference circle. The reference circle is chosen as the smallest circle encap-
sulating the object silhouette. Use of concentric circles implicitly encodes the spatial
information which in regular histogram is only possible when the spatial (x, y) coordi-
nates are included inthe observationvector [Comaniciuand Meer 2002]. Resulting color
and edge histograms are rotation, translation, and scale invariant, hence, provide the
same matching score for objects transformed by conformal afne transform. The match-
ing score can be computed using several distance measures including cross-correlation,
Bhattacharya distance, and Kullback-Leibler divergence. Among these three measures,
the authors conclude that the Bhattacharya distance and the Kullback-Leibler diver-
gence perform similarly, and both peform better than the correlation-based measure.
To match silhouettes in consecutive frames, Haritaoglu et al. [2000] model the object
appearance by the edge information obtained inside the object silhouette. In particular,
the edge model is used to rene the translation of the object using the constant velocity
assumption. This renement is carried out by performing binary correlation between
the object edge in the consecutive frames.
In contrast to looking for possible silhouette matches in consecutive frames, tracking
silhouettes can be performed by computing the ow vectors for each pixel inside the
silhouette such that the ow that is dominant over the entire silhouette is used to
generate the silhouette trajectory. Following this observation, Sato and Aggarwal [2004]
proposed to generating object tracks by applying Hough transform in the velocity space
to the object silhouettes in consecutive frames. Binary object silhouettes are detected
using background subtraction (see Section 4.2). Then, from a spatio-temporal window
around each moving region pixel, a velocity Hough transform is applied to compute
voting matrices for the vertical ow v and the horizontal ow u. These voting matrices
provides the so-called Temporal Spatio-Velocity (TSV) image in4D(x, y, u, v) per frame.
TSV image encodes the dominant motion of a moving region pixel and its likelihood
in terms of number of votes such that a thresholding operation will provide regions
with similar motion patterns. In contrast to appearance-based matching of silhouettes,
TSV provides a motion-based matching of the object silhouettes and is less sensitive to
appearance variations, due to different object views (e.g., front and back of the object
may look different).
5.3.2. Contour Tracking. Contour tracking methods, in contrast to shape matching
methods. iteratively evolve an initial contour in the previous frame to its new posi-
tion in the current frame. This contour evolution requires that some part of the object
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
Object Tracking: A Survey 33
in the current frame overlap with the object region in the previous frame. Tracking
by evolving a contour can be performed using two different approaches. The rst ap-
proach uses state space models to model the contour shape and motion. The second
approach directly evolves the contour by minimizing the contour energy using direct
minimization techniques such as gradient descent.
5.3.2.1. Tracking Using State Space Models. The objects state is dened in terms of the
shape and the motion parameters of the contour. The state is updated at each time
instant such that the contours a posteriori probability is maximized. The posterior
probability depends on the prior state and the current likelihood which is usually de-
ned in terms of the distance of the contour from observed edges.
Terzopoulos and Szeliski [1992] dene the object state by the dynamics of the control
points. The dynamics of the control points are modeled in terms of a spring model,
which moves the control points based on the spring stiffness parameters. The newstate
(spring parameters) of the contour is predicted using the Kalman lter. The correction
step uses the image observations which are dened in terms of the image gradients. In
1998, Isard and Blake dened the object state in terms of spline shape parameters and
afne motion parameters. The measurements consist of image edges computed in the
normal direction to the contour (see Figure 15(a)). The state is updated using a particle
lter. In order to obtain initial samples for the lter, they compute the state variables
from the contours extracted in consecutive frames during a training phase. During the
testing phase, the current state variables are estimated through particle ltering based
on the edge observations along normal lines at the control points on the contour.
In 2000, MacCormick and Blake extended the particle lter-based object tracker in
Isard and Blake [1998] to track multiple objects by including the exclusion principle
for handling occlusion. The exclusion principle integrates into the sampling step of the
particle ltering framework such that, for two objects, if a feature lies in the observation
space of both objects, then it contributes more to the samples of the object which is
occluding the other object. Since the exclusion principle is only dened between two
objects, this approach can track at most two objects undergoing occlusion at any time
instant.
Chen et al. [2001] propose a contour tracker where the contour is parameterized as
an ellipse. Each contour node has an associated HMM and the states of each HMM
is dened by the points lying on the lines normal to the contour control point. The
observation likelihood of the contour depends on the background and the foreground
partitions dened by the edge along the normal line on contour control points. The
state transition probabilities of the HMM are estimated using the JPDAF. Given the
observation likelihood and the state transition probabilities, the current contour state
is estimated using the Viterbi algorithm [1967]. After the contour is approximated, an
ellipse is t to enforce elliptical shape constraint.
The methods just discussed above represent the contours using explicit represen-
tation, for example, parametric spline. Explicit representations do not allow topology
changes such as region split or merge [Sethian 1999]. Next, we will discuss contour
tracking methods based on direct minimization of energy functional. These methods
can use implicit representations and allow topology changes.
5.3.2.2. Tracking by Direct Minimization of Contour Energy Functional. In the context of
contour evolution, there is an analogy between the segmentation methods discussed in
Section 4.3 and the contour tracking methods in this category. Both the segmentation
and tracking methods minimize the energy functional either by greedy methods or by
gradient descent. The contour energy is dened in terms of temporal information in
the form of either the temporal gradient (optical ow) [Bertalmio et al. 2000; Mansouri
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.
34 A. Yilmaz et al.
Fig. 16. Car tracking using the level sets method ( c _2002. IEEE)
2002; Cremers and Schnorr 2003], or appearance statistics generated from the object
and the background regions [Yilmaz et al. 2004; Ronfard 1994].
Contour tracking using temporal image gradients is motivated by the extensive work
on computing the optical ow. The optical ow constraint is derived from the bright-
ness constancy constraint: I
t1
(x, y) I
t
(x u, y v) = 0, where I is the image, t
is the time, and (u, v) is the ow vector in the x and the y directions. Bertalmio et
al. [2000] use this constraint to evolve the contour in consecutive frames. Their objec-
tive was to compute u and v iteratively for each contour position using the level set
representation (see Figure 15(b)). At each iteration, contour speed in the normal direc-
tion,
n , is computed by projecting the gradient magnitude [I
t
[ on
n . The authors
use two energy functionals, one for contour tracking, E
t
, and another one for intensity
morphing, E
m
: E
m
() =
_
1
0
E
im
(v)ds and E
t
() =
_
1
0
E
ext
(v)ds, where E
ext
is computed
based on E
m
. The intensity morphing functional, which minimizes intensity changes
in the current and the previous frames, I
t
= I
t
I
t1
, on the hypothesized object con-
tour,
F(x, y)
t
= I
t
(x, y) | F(x, y) |, is coupled with the contour tracking equation,
(x, y)
t
= I
t
(x, y) n
F
n