2004 05 Eccv Hand Tracking 2d
2004 05 Eccv Hand Tracking 2d
1 Introduction
An essential building block of many vision systems is one that permits tracking
objects of interest in a temporal sequence of images. For example, this is the case
of systems able to interpret the activities of humans, in which we are particularly
interested. Such systems depend on effective and efficient tracking of a human
operator or parts of his body (e.g. hands, face), as he performs a certain task.
In this context, vision-based tracking needs to provide answers to the following
fundamental questions. First, how is a human modeled and how are instances
of the employed model detected in an image? Second, how are instances of the
detected model associated temporally in sequences of images?
The human body is a complex, non-rigid structure with many degrees of free-
dom. Therefore, the type and complexity of the models employed for tracking
vary dramatically [4, 3], depending heavily on the requirements of the applica-
tion domain under consideration. For example, tracking people in an indoors
2 A. Argyros, M. Lourakis
A recent survey [22] includes a very interesting overview of the use of color for
face (and, therefore skin-color) detection. A major decision towards providing
a model of skin color is the selection of the color space to be employed. Sev-
eral color spaces have been proposed including RGB [8], normalized RGB [12,
10], HSV [15], YCrCb [2], YUV [20], etc. Color spaces efficiently separating the
chrominance from the luminance components of color are typically considered
preferable. This is due to the fact that by employing chrominance-dependent
components of color only, some degree of robustness to illumination changes can
be achieved. Terrillon et al [18] review different skin chrominance models and
evaluate their performance.
Having selected a suitable color space, the simplest approach for defining
what constitutes skin color is to employ bounds on the coordinates of the se-
lected space [2]. These bounds are typically selected empirically, i.e. by examining
the distribution of skin colors in a preselected set of images. Another approach
is to assume that the probabilities of skin colors follow a distribution that can
be learned either off-line or by employing an on-line iterative method [15]. In
the case of non-parametric approaches, the learnt distribution is represented by
means of a color probabilities histogram. Other, so-called parametric approaches,
are based either on a unimodal Gaussian probability density function [12, 20] or
multimodal Gaussian mixtures [9, 14] that model the probability distribution
of skin color. The parameters of a unimodal Gaussian density function are es-
timated by maximum likelihood estimation techniques. Multi-modal Gaussian
mixtures require the Expectation-Maximization (EM) algorithm to be employed.
According to Yang et al [21], a mixture of Gaussians is preferable compared to
a single Gaussian distribution. Still, [10] argues that histogram models provide
better accuracy and incur lower computational costs compared to mixture mod-
els for the detection of skin-colored areas in an image. A few of the proposed
methods perform some sort of adaptation to become insensitive to changes in
the illumination conditions. For example in [14] it has been suggested to adapt a
Tracking Multiple Skin-colored Objects 3
2 Method description
The proposed method for tracking multiple skin-colored objects operates as fol-
lows. At each time instance, the camera acquires an image on which skin-colored
blobs (i.e. connected sets of skin-colored pixels) are detected. The method also
maintains a set of object hypotheses that have been tracked up to this instance
in time. The detected blobs, together with the object hypotheses are then asso-
ciated in time. The goal of this association is (a) to assign a new, unique label
to each new object that enters the cameras field of view for the first time, and
(b) to propagate in time the labels of already detected objects. What follows, is
a more detailed description of the approach adopted to solve the aforementioned
subproblems.
Then, the probability of each image point being skin-colored can be determined
and all image points with probability P (s|c) > Tmax are considered as being
Tracking Multiple Skin-colored Objects 5
skin-colored. These points constitute the seeds of potential blobs. More specif-
ically, image points with probability P (s|c) > Tmin where Tmin < Tmax , that
are immediate neighbors of skin-colored image points are recursively added to
each blob. The rationale behind this region growing operation is that an image
point with relatively low probability of being skin-colored should be considered
as such in the case that it is a neighbor of an image point with high probability
of being skin-colored. This hysteresis thresholding type of operation has been
very successfully applied to edge detection [1] and also proves extremely useful
in the context of robust detection of skin-colored blobs. Indicative values for the
thresholds Tmax and Tmin are 0.5 and 0.15, respectively. A connected compo-
nents labeling algorithm is then responsible for assigning different labels to the
image points of different blobs. Size filtering on the derived connected compo-
nents is also performed to eliminate small, isolated blobs that are attributed
to noise and do not correspond to interesting skin-colored regions. Each of the
remaining connected components corresponds to a skin-colored blob. The final
step in skin color detection is the computation of up to second order moments
for each blob that will be used in the tracking process.
the evidence that the system gathers during the w most recent frames. Clearly,
the second set better reflects the recent appearance of skin-colored objects and
is better adapted to the current illumination conditions. Skin color detection is
then performed based on:
where P (s|c) and Pw (s|c) are both given by eq. (1) but involve prior probabilities
that have been computed from the whole training set and from the detection
results in the last w frames, respectively. In eq. (2), is a sensitivity parameter
that controls the influence of the training set in the detection process. Setting
w = 5 and = 0.8 gave rise to very good results in a series of experiments
involving gradual variations of illumination.
b1 h1 h2
h3
b3
h4 b2
Fig. 1. Various cases of the relation between skin-colored blobs and object hypotheses.
the problem. In this particular example there are three blobs (b1 , b2 and b3 )
Tracking Multiple Skin-colored Objects 7
while there are four object hypotheses (h1 , h2 , h3 and h4 ) from the previous
frame.
What follows is an algorithm that can cope effectively with the data associa-
tion problem. The proposed algorithm needs to address three different subprob-
lems: (a) object hypothesis generation (i.e. an object appears in the field of view
for the first time) (b) object hypothesis tracking in the presence of multiple,
potential occluding objects (i.e. previously detected objects move arbitrarily in
the field of view) and (c) object model hypothesis removal (i.e. a tracked object
disappears from the field of view).
where
cos() sin() x xc y yc
v = ,
sin() cos()
From the definition of D(p, h) it turns out that the value of this metric is less
than 1.0, equal to 1.0 or greater than 1.0 depending on whether point p is inside,
on, or outside ellipse h, respectively. Consider now a model ellipse h and a point
p belonging to a blob b. In the case that D(p, h) < 1.0, we conclude that the
point p and the blob b support the existence of the object hypothesis h and that
object hypothesis h predicts blob b. Consider now a blob b such that:
Equation (4) describes a blob with empty intersection with all ellipses of the
existing object hypotheses. Blob b1 in Fig. 1 is such a case. This, implies that
none of the existing object hypotheses accounts for the existence of this blob.
For each such blob, a new object hypothesis is generated. The parameters of
the generated object hypothesis can be derived directly from the statistics of
the distribution of points belonging to the blob. The center of the ellipse of the
object hypothesis becomes equal to the centroid of the blob and the rest of the
ellipse parameters can be computed from the covariance matrix of the bivariate
distribution of the location of blob points. More specifically, it can be shown that
xx xy
if the covariance matrix of the blobs points distribution is = ,
xy yy
then an ellipse can be defined with parameters:
p p
1 xy
= 1 , = 2 , = tan (5)
1 yy
+ + + p
where 1 = xx 2yy , 2 = xx 2yy , and = (xx yy )2 + 4xy 2 .
Algorithmically, at each time t, all detected blobs are tested against the
criterion of eq. (4). For all qualifying blobs, an object hypothesis is formed and
8 A. Argyros, M. Lourakis
the corresponding ellipse parameters are determined based on eqs. (5). Moreover,
all such blobs are excluded from further consideration in the subsequent steps
of object tracking.
Object hypothesis tracking. After new object hypotheses have been formed
as described in the previous section, all the remaining blobs must support the
existence of past object hypotheses. The main task of the tracking algorithm
amounts to associating blob pixels to object hypotheses. There are two rules
governing this association:
Rule 1: If a skin-colored pixel of a blob is located within the ellipse of some
object hypothesis (i.e. supports the existence of the hypothesis) then this
pixel is considered as belonging to this hypothesis.
Rule 2: If a skin-colored pixel is outside all ellipses corresponding to the
object hypotheses, then it is assigned to the object hypothesis that is closer
to it, using the distance metric of eq. (3).
Formally, the set o of skin-colored pixels that are associated with an object
hypothesis h is given by o = R1 R2 where R1 = {p B | D(p, h) < 1.0} and
R2 = {p B | D(p, h) = minkH {D(p, k)}}.
In the example of Fig. 1, two different object hypotheses (h2 and h3 ) are
competing for the skin-colored area of blob b2 . According to the rule 1 above,
all skin pixels within the ellipse of h2 will be assigned to it. According to the
same rule, the same will happen for skin pixels under the ellipse of h3 . Note that
pixels in the intersection of these ellipses will be assigned to both hypotheses
h2 and h3 . According to rule 2, pixels of blob b2 that are not within any of the
ellipses, will be assigned to their closest ellipse which is determined by eq. (3).
Another interesting case is that of a hypothesis that is supported by more
than one blobs (see for example hypothesis h4 in Fig. 1). Such cases may arise
when, for example, two objects are connected at the time they first appear in
the scene and later split. To cope with situations where a hypothesis h receives
support from several blobs, the following strategy is adopted. If there exists only
one blob b that is predicted by h and, at the same time, not predicted by any
other hypothesis, then h is assigned to b. Otherwise, h is assigned to the blob
with which it shares the largest number of skin-colored points. In the example
of Fig. 1, hypothesis h4 gets support from blobs b2 and b3 . Based on the above
rule, it will be finally assigned to blob b3 .
After having assigned skin pixels to object hypotheses, the parameters of the
object hypotheses hi are re-estimated based on the statistics of pixels oi that
have been assigned to them.
Equation (6) essentially describes hypotheses that are not supported by any
skin-colored image points. Hypothesis h1 in Fig. 1 is such a case. In practice,
we permit an object hypothesis to survive for a certain amount of time, even
in the absence of any support, so that we account for the case of possibly poor
skin-color detection. In our implementation, this time interval has been set to
half a second, which approximately amounts to fourteen image frames.
3 Experimental results
In this section, representative results from a prototype implementation of the
proposed tracker are provided. The reported experiment consists of a long (3825
frames) sequence that has been acquired and processed on-line and in real-time
on a Pentium 4 laptop computer running MS Windows at 2.56 GHz. A web
camera with an IEEE 1394 (Firewire) interface has been used for this experiment.
For the reported experiment, the initial, seed training set contained 20
images and was later refined in a semi-automatic manner using 80 additional
images. The training set contains images of four different persons that have
been acquired under various lighting conditions.
Figure 2 provides a few characteristic snapshots of the experiment. For vi-
sualization purposes, the contour of each tracked object hypothesis is shown.
Different contour colors correspond to different object hypotheses.
When the experiment starts, the camera is still and the tracker correctly
asserts that there are no skin-colored objects in the scene (Fig. 2(a)). Later,
the hand of a person enters the field of view of the camera and starts moving
at various depths, directions and speeds in front of it. At some point in time,
the camera also starts moving in a very jerky way; the camera is mounted on
the laptops monitor which is being moved back and forth. The persons second
hand enters the field of view; hands now move in overlapping trajectories. Then,
10 A. Argyros, M. Lourakis
the persons face enters the field of view. Hands disappear and then reappear in
the scene. All three objects move independently in disjoint trajectories and in
varying speeds ((b)-(d)), ranging from slow to fast; at some point in time the
person starts dancing, jumping and moving his hands very fast. The experiment
proceeds with hands moving in crossing trajectories. Initially hands cross each
other slowly and then very fast ((e)-(g)). Later on, the person starts applauding
which results in his hands touching but not crossing each other ((h)-(j)). Right
after, the person starts crossing his hands like tying in knots ((k)-(o)). Next,
the hands cross each other and stay like this for a considerable amount of time;
then the person starts moving, still keeping his hands crossed ((p)-(r)). Then, the
person waves and crosses his hands in front of his face ((s)-(u)). The experiments
concludes with the person turning the light on and off ((v)-(x)), while greeting
towards the camera (Fig.2(x)).
As it can be verified from the snapshots, the labeling of the object hypothe-
ses is consistent throughout the whole sequence, which indicates that they are
correctly tracked. Thus, the proposed tracker performs very well in all the above
cases, some of which are challenging. Note also that no images of the person
depicted in this experiment were contained in the training set. With respect to
computational performance, the 3825 frames sequence presented previously has
been acquired and processed at an average frame rate of 28.45 fps (320x240 im-
ages). The time required for grabbing a single video frame from the IEEE 1394
interface dominates the trackers cycle time. When prerecorded image sequences
are loaded from disk, considerably higher tracking frame rates can be achieved.
Besides the reported example, the proposed tracker has also been extensively
tested with different cameras and in different settings involving different scenes
and humans. Demonstration videos including the reported experiment can be
found at http://www.ics.forth.gr/cvrl/demos.html.
4 Discussion
In this paper, a new method for tracking multiple skin-colored objects has been
presented. The proposed method can cope successfully with multiple objects
moving in complex patterns as they dynamically enter and exit the field of view
of a camera. Since the tracker is not based on explicit background modeling
and subtraction, it may operate even with images acquired by a moving camera.
Ongoing research efforts are currently focused on (1) combining the proposed
method with binocular stereo processing in order to derive 3D information re-
garding the tracked objects, (2) providing means for discriminating various types
of skin-colored areas (e.g. hands, faces, etc) and (3) developing methods that
build upon the proposed tracker in order to be able to track interesting parts of
skin-colored areas (e.g. eyes for faces, fingertips for hands, etc).
Acknowledgements
This work was partially supported by EU IST-2001-32184 project ActIPret.
Tracking Multiple Skin-colored Objects 11
References
1. J.F. Canny. A computational approach to edge detection. IEEE Trans. on PAMI,
8(11):769798, 1986.
2. D. Chai and K.N. Ngan. Locating facial region of a head-and-shoulders color image.
In Proc. of FG98, pages 124129, 1998.
3. Q. Delamarre and O. Faugeras. 3d articulated models and multi-view tracking wih
physical forces. Computer Vision and Image Understanding, 81:328357, 2001.
4. D.M. Gavrila. The visual analysis of human movement: A survey. Computer Vision
and Image Understanding, 73(1):8298, 1999.
5. C. Hue, J.-P. Le Cadre, and P. Perez. Sequential monte carlo methods for multiple
target tracking and data fusion. IEEE Trans. on Signal Proc., 50(2):309325, 2002.
6. M. Isard and A. Blake. Icondensation: Unifying low-level and high-level tracking
in a stochastic framework. In Proc. of ECCV98, pages 893908, 1998.
7. O. Javed and M. Shah. Tracking and object classification for automated surveil-
lance. In Proc. of ECCV02, pages 343357, 2002.
8. T.S. Jebara and A. Pentland. Parameterized structure from motion for 3d adaptive
feedback tracking of faces. In Proc. of CVPR97, pages 144150, 1997.
9. T.S. Jebara, K. Russel, and A. Pentland. Mixture of eigenfeatures for real-time
structure from texture. In Proc. of ICCV98, pages 128135, 1998.
10. M.J. Jones and J.M. Rehg. Statistical color models with application to skin detec-
tion. In Proc. of CVPR99, volume 1, pages 274280, 1999.
11. R.E. Kalman. A new approach to linear filtering and prediction problems. Trans-
actons of the ACME-Journal of Basic Engineering, pages 3545, 1960.
12. S.H. Kim, N.K. Kim, S.C. Ahn, and H.G. Kim. Object oriented face detection
using range and color information. In Proc. of FG98, pages 7681, 1998.
13. E. Koller-Meier and F. Ade. Tracking multiple objects using the condensation
algorithm. Journal of Robotics and Autonomous Systems, 34(2-3):93105, 2001.
14. Y. Raja S. McKenna and S. Gong. Tracking color objects using adaptive mixture
models. IVC journal, 17(3-4):225231, 1999.
15. D. Saxe and R. Foulds. Toward robust skin identification in video images. In Proc.
of FG96, pages 379384, 1996.
16. N.T. Siebel and S. Maybank. Fusion of multiple tracking algorithms for robust
people tracking. In Proc. of ECCV02, pages 373387, 2002.
17. M. Spengler and B. Schiele. Multi object tracking based on a modular knowledge
hierarchy. In Proc. of International Conference on Computer Vision Systems, pages
373387, 2003.
18. J.C. Terrillon, M.N. Shirazi, H. Fukamachi, and S. Akamatsu. Comparative per-
formance of different skin chrominance models and chrominance spaces for the
automatic detection of human faces in color images. In Proc. of FG00, pages
5461, 2000.
19. J. Triesch and C. von der Malsburg. Democratic integration: Self-organized inte-
gration of adaptive cues. Neural Computation, 13(9):20492074, 2001.
20. M.H. Yang and N. Ahuja. Detecting human faces in color images. In Proc. of
ICIP98, volume 1, pages 127130, 1998.
21. M.H. Yang and N. Ahuja. Face Detection and Gesture Recognition for Human-
computer Interaction. Kluwer Academic Publishers, New York, 2001.
22. M.H. Yang, D.J. Kriegman, and N. Ahuja. Detecting faces in images: A survey.
IEEE Trans. on PAMI, 24(1):3458, 2002.