3D Modeling and Animation Synthesis and Analysis Techniques For The Human Body 2004-1931777985
3D Modeling and Animation Synthesis and Analysis Techniques For The Human Body 2004-1931777985
3D Modeling and Animation Synthesis and Analysis Techniques For The Human Body 2004-1931777985
and Animation:
Synthesis and Analysis
Techniques for the
Human Body
Nikos Sarris
Michael G. Strintzis
IRM Press
3D Modeling
and
Animation:
Synthesis and Analysis
Techniques for the
Human Body
Nikos Sarris
Informatics & Telematics Institute, Greece
Michael G. Strintzis
Informatics & Telematics Institute, Greece
IRM Press
Publisher of innovative scholarly and professional
information technology titles in the cyberage
Hershey • London • Melbourne • Singapore
Acquisition Editor: Mehdi Khosrow-Pour
Senior Managing Editor: Jan Travers
Managing Editor: Amanda Appicello
Development Editor: Michele Rossi
Copy Editor: Bernard J. Kieklak, Jr.
Typesetter: Amanda Appicello
Cover Design: Shane Dillow
Printed at: Integrated Book Technology
Copyright © 2005 by Idea Group Inc. All rights reserved. No part of this book may be repro-
duced in any form or by any means, electronic or mechanical, including photocopying, without
written permission from the publisher.
All work contributed to this book is new, previously-unpublished material. The views expressed in
this book are those of the authors, but not necessarily of the publisher.
3D Modeling and
Animation:
Synthesis and Analysis
Techniques for the Human Body
Table of Contents
Preface ............................................................................................................. vi
Nikos Sarris, Informatics & Telematics Insistute, Greece
Michael G. Strintzis, Informatics & Telematics Insistute, Greece
Chapter I
Advances in Vision-Based Human Body Modeling .................................. 1
Angel Sappa, Computer Vision Center, Spain
Niki Aifanti, Informatics & Telematics Institute, Greece
Nikos Grammalidis, Informatics & Telematics Institute, Greece
Sotiris Malassiotis, Informatics & Telematics Institute, Greece
Chapter II
Virtual Character Definition and Animation within the
MPEG-4 Standard ......................................................................................... 27
Marius Preda, GET/Institut National des Télécommunications,
France
Ioan Alexandru Salomie, ETRO Department of the Vrije Universiteit
Brussel, Belgium
Françoise Preteux, GET/Institut National des Télécommunications,
France
Gauthier Lafruit, MICS-DESICS/Interuniversity MicroElectronics
Center (IMEC), Belgium
Chapter III
Camera Calibration for 3D Reconstruction and View
Transformation .............................................................................................. 70
B. J. Lei, Delft University of Technology, The Netherlands
E. A. Hendriks, Delft University of Technology, The Netherlands
Aggelos K. Katsaggelos, Northwestern University, USA
Chapter IV
Real-Time Analysis of Human Body Parts and Gesture-Activity
Recognition in 3D ...................................................................................... 130
Burak Ozer, Princeton University, USA
Tiehan Lv, Princeton University, USA
Wayne Wolf, Princeton University, USA
Chapter V
Facial Expression and Gesture Analysis for Emotionally-Rich
Man-Machine Interaction ........................................................................ 175
Kostas Karpouzis, National Technical University of Athens,
Greece
Amaryllis Raouzaiou, National Technical University of Athens,
Greece
Athanasios Drosopoulos, National Technical University of Athens,
Greece
Spiros Ioannou, National Technical University of Athens, Greece
Themis Balomenos, National Technical University of Athens,
Greece
Nicolas Tsapatsoulis, National Technical University of Athens,
Greece
Stefanos Kollias, National Technical University of Athens, Greece
Chapter VI
Techniques for Face Motion & Expression Analysis on
Monocular Images ..................................................................................... 201
Ana C. Andrés del Valle, Institut Eurécom, France
Jean-Luc Dugelay, Institut Eurécom, France
Chapter VII
Analysis and Synthesis of Facial Expressions ....................................... 235
Peter Eisert, Fraunhofer Institute for Telecommunications,
Germany
Chapter VIII
Modeling and Synthesis of Realistic Visual Speech in 3D .................. 266
Gregor A. Kalberer, BIWI – Computer Vision Lab, Switzerland
Pascal Müller, BIWI – Computer Vision Lab, Switzerland
Luc Van Gool, BIWI – Computer Vision Lab, Switzerland and
VISICS, Belgium
Chapter IX
Automatic 3D Face Model Adaptation with Two Complexity
Modes for Visual Communication ........................................................... 295
Markus Kampmann, Ericsson Eurolab Deutschland GmbH,
Germany
Liang Zhang, Communications Research Centre, Canada
Chapter X
Learning 3D Face Deformation Model: Methods and Applications ..... 317
Zhen Wen, University of Illinois at Urbana Champaign, USA
Pengyu Hong, Harvard University, USA
Jilin Tu, University of Illinois at Urbana Champaign, USA
Thomas S. Huang, University of Illinois at Urbana Champaign,
USA
Chapter XI
Synthesis and Analysis Techniques for the Human Body:
R&D Projects ............................................................................................. 341
Nikos Karatzoulis, Systema Technologies SA, Greece
Costas T. Davarakis, Systema Technologies SA, Greece
Dimitrios Tzovaras, Informatics & Telematics Institute,
Greece
Preface
Realistic human animation is also a matter of ongoing research and, in the case
of human cloning, relies on accurate tracking of the 3D motion of a human,
which has to be duplicated by his 3D model. The inherently complex articula-
tion of the human body imposes great difficulties in both the tracking and ani-
mation processes, which are being tackled by specific techniques, such as mod-
eling languages, as well as by standards developed for these purposes. Particu-
larly the human face and hands present the greatest difficulties in modeling and
animation due to their complex articulation and communicative importance in
expressing the human language and emotions.
Within the context of this book, we present the state-of-the-art methods for
analyzing the structure and motion of the human body in parallel with the most
effective techniques for constructing realistic synthetic models of virtual hu-
mans.
vii
The level of detail that follows is such that the book can prove useful to stu-
dents, researchers and software developers. That is, a level low enough to
describe modeling methods and algorithms without getting into image process-
ing and programming principles, which are not considered as prerequisite for
the target audience.
The main objective of this book is to provide a reference for the state-of-the-
art methods delivered by leading researchers in the area, who contribute to the
appropriate chapters according to their expertise. The reader is presented with
the latest, research-level, techniques for the analysis and synthesis of still and
moving human bodies, with particular emphasis on facial and gesture charac-
teristics.
Attached to this preface, the reader will find an introductory chapter which
revises the state-of-the-art on established methods and standards for the analysis
and synthesis of images containing humans. The most recent vision-based hu-
man body modeling techniques are presented, covering the topics of 3D human
body coding standards, motion tracking, recognition and applications. Although
this chapter, as well as the whole book, examines the relevant work in the
context of computer vision, references to computer graphics techniques are
given, as well.
The detection of the human body and the recognition of human activities and
hand gestures from multiview images are examined by Ozer, Lv and Wolf in
viii
Chapter 4. Introducing the subject, the authors provide a review of the main
components of three-dimensional and multiview visual processing techniques.
The real-time aspects of these techniques are discussed and the ways in which
these aspects affect the software and hardware architectures are shown. The
authors also present the multiple-camera system developed by their group to
investigate the relationship between the activity recognition algorithms and the
architectures required to perform these tasks in real-time.
The face, being the most expressive and complex part of the human body, is the
object of discussion in the following five chapters as well. Chapter 6 examines
techniques for the analysis of facial motion aiming mainly to the understanding
of expressions from monoscopic images or image sequences. In Chapter 7
Eisert also addresses the same problem with his methods, paying particular
attention to understanding and normalizing the illumination of the scene.
Kalberer, Müller and Van Gool present their work in Chapter 8, extending the
state-of-the-art in creating highly realistic lip and speech-related facial motion.
The book concludes with Chapter 11, by Karatzoulis, Davarakis and Tzovaras,
providing a reference to current relevant R&D projects worldwide. This clos-
ing chapter presents a number of promising applications and provides an over-
view of recent developments and techniques in the area of analysis and synthe-
sis techniques for the human body. Technical details are provided for each
project and the provided results are also discussed and evaluated.
Advances in Vision-Based Human Body Modeling 1
Chapter I
Advances in
Vision-Based Human
Body Modeling
Angel Sappa
Computer Vision Center, Spain
Niki Aifanti
Informatics & Telematics Institute, Greece
Nikos Grammalidis
Informatics & Telematics Institute, Greece
Sotiris Malassiotis
Informatics & Telematics Institute, Greece
Abstract
This chapter presents a survey of the most recent vision-based human body
modeling techniques. It includes sections covering the topics of 3D human
body coding standards, motion tracking, recognition and applications.
Short summaries of various techniques, including their advantages and
disadvantages, are introduced. Although this work is focused on computer
vision, some references from computer graphics are also given. Considering
that it is impossible to find a method valid for all applications, this chapter
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
2 Sappa, Aifanti, Grammalidis & Malassiotis
Introduction
Human body modeling is experiencing a continuous and accelerated growth. This
is partly due to the increasing demand from computer graphics and computer
vision communities. Computer graphics pursue a realistic modeling of both the
human body geometry and its associated motion. This will benefit applications
such as games, virtual reality or animations, which demand highly realistic
Human Body Models (HBMs). At the present, the cost of generating realistic
human models is very high, therefore, their application is currently limited to the
movie industry where HBM’s movements are predefined and well studied
(usually manually produced). The automatic generation of a realistic and fully
configurable HBM is still nowadays an open problem. The major constraint
involved is the computational complexity required to produce realistic models
with natural behaviors. Computer graphics applications are usually based on
motion capture devices (e.g., magnetic or optical trackers) as a first step, in order
to accurately obtain the human body movements. Then, a second stage involves
the manual generation of HBMs by using editing tools (several commercial
products are available on the market).
Recently, computer vision technology has been used for the automatic genera-
tion of HBMs from a sequence of images by incorporating and exploiting prior
knowledge of the human appearance. Computer vision also addresses human
body modeling, but in contrast to computer graphics it seeks more for an efficient
than an accurate model for applications such as intelligent video surveillance,
motion analysis, telepresence or human-machine interface. Computer vision
applications rely on vision sensors for reconstructing HBMs. Obviously, the rich
information provided by a vision sensor, containing all the necessary data for
generating a HBM, needs to be processed. Approaches such as tracking-
segmentation-model fitting or motion prediction-segmentation-model fitting
or other combinations have been proposed showing different performances
according to the nature of the scene to be processed (e.g.. indoor environments,
studio-like environments, outdoor environments, single-person scenes, etc). The
challenge is to produce a HBM able to faithfully follow the movements of a real
person.
Vision-based human body modeling combines several processing techniques
from different research areas which have been developed for a variety of
conditions (e.g., tracking, segmentation, model fitting, motion prediction, the
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 3
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
4 Sappa, Aifanti, Grammalidis & Malassiotis
3 DOF
1 DOF 3 DOF
3 DOF 3 DOF
1 DOF
each arm and each leg. The illustration presented in Figure 1 (left) corresponds
to an articulated model defined by 22 DOF.
On the contrary, in computer graphics, highly accurate representations consist-
ing of more than 50 DOF are generally selected. Aubel, Boulic & Thalmann
(2000) propose an articulated structure composed of 68 DOF. They correspond
to the real human joints, plus a few global mobility nodes that are used to orient
and position the virtual human in the world.
The simplest 3D articulated structure is a stick representation with no associated
volume or surface (Figure 1 (left)). Planar 2D representations, such as the
cardboard model, have also been widely used (Figure 1 (right)). However,
volumetric representations are preferred in order to generate more realistic
models (Figure 2). Different volumetric approaches have been proposed,
depending upon whether the application is in the computer vision or the computer
graphics field. On one hand, in computer vision, where the model is not the
purpose, but the means to recover the 3D world, there is a trade-off between
accuracy of representation and complexity. The utilized models should be quite
realistic, but they should have a low number of parameters in order to be
processed in real-time. Volumetric representations such as parallelepipeds,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 5
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
6 Sappa, Aifanti, Grammalidis & Malassiotis
1989). A mathematical model will include the parameters that describe the links,
as well as information about the constraints associated with each joint. A model
that only includes this information is called a kinematic model and describes the
possible static states of a system. The state vector of a kinematic model consists
of the model state and the model parameters. A system in motion is modeled
when the dynamics of the system are modeled as well. A dynamic model
describes the state evolution of the system over time. In a dynamic model, the
state vector includes linear and angular velocities, as well as position (Wren &
Pentland, 1998).
After selecting an appropriate model for a particular application, it is necessary
to develop a concise mathematical formulation for a general solution to the
kinematics and dynamics problem, which are non-linear problems. Different
formalism have been proposed in order to assign local reference frames to the
links. The simplest approach is to introduce joint hierarchies formed by indepen-
dent articulation of one DOF, described in terms of Euler angles. Hence, the body
posture is synthesized by concatenating the transformation matrices associated
with the joints, starting from the root. Despite the fact that this formalism suffers
from singularities, Delamarre & Faugeras (2001) propose the use of composi-
tions of translations and rotations defined by Euler angles. They solve the
singularity problems by reducing the number of DOFs of the articulation.
The Web3D H-anim working group (H-anim) was formed so that developers
could agree on a standard naming convention for human body parts and joints.
The human form has been studied for centuries and most of the parts already
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 7
have medical (or Latin) names. This group has produced the Humanoid
Animation Specification (H-anim) standards, describing a standard way of
representing humanoids in VRML. These standards allow humanoids created
using authoring tools from one vendor to be animated using tools from another.
H-anim humanoids can be animated using keyframing, inverse kinematics,
performance animation systems and other techniques. The three main design
goals of H-anim standards are:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
8 Sappa, Aifanti, Grammalidis & Malassiotis
The MPEG-4 SNHC (Synthetic and Natural Hybrid Coding) group has standard-
ized two types of streams in order to animate avatars:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 9
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
10 Sappa, Aifanti, Grammalidis & Malassiotis
and the new 3D position of each point in the global seamless mesh is computed
as a weighted combination of the related bone motions.
The skinned model definition can also be enriched with inverse kinematics-
related data. Then, bone positions can be determined by specifying only the
position of an end effector, e.g., a 3D point on the skinned model surface. No
specific inverse kinematics solver is imposed, but specific constraints at bone
level are defined, e.g., related to the rotation or translation of a bone in a certain
direction. Also muscles, i.e., NURBS curves with an influence region on the
model skin, are supported. Finally, interpolation techniques, such as simple linear
interpolation or linear interpolation between two quaternions (Preda & Prêteux,
2001), can be exploited for key-value-based animation and animation compres-
sion.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 11
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
12 Sappa, Aifanti, Grammalidis & Malassiotis
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 13
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
14 Sappa, Aifanti, Grammalidis & Malassiotis
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 15
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
16 Sappa, Aifanti, Grammalidis & Malassiotis
history of motion (e.g., more recently moving pixels are brighter). MEI and MHI
temporal templates are then matched to stored instances of views of known
actions.
A technique for human motion recognition in an unconstrained environment,
incorporating hypotheses which are probabilistically propagated across space
and time, is presented in Bregler (1997). EM clustering, recursive Kalman and
Hidden Markov Models are used as well. The feasibility of this method is tested
on classifying human gait categories (running, walking and skipping). HMMs are
quite often used for classifying and recognizing human dynamics. In Pavlovic &
Rehg (2000), HMMs are compared with switching linear dynamic systems
(SLDS) towards human motion analysis. It is argued that the SLDS framework
demonstrates greater descriptive power and consistently outperforms standard
HMMs on classification and continuous state estimation tasks, although the
learning-inference mechanism is complicated.
Finally, a novel approach for the identification of human actions in an office
(entering the room, using a computer, picking up the phone, etc.) is presented in
Ayers & Shah (2001). The novelty of this approach consists in using prior
knowledge about the layout of the room. Action identification is modeled by a
state machine consisting of various states and the transitions between them. The
performance of this system is affected if the skin area of the face is occluded,
if two people get too close and if prior knowledge is not sufficient. This approach
may be applicable in surveillance systems like those ones described in the next
section.
Applications
3D HBMs have been used in a wide spectrum of applications. This section is only
focused on the following four major application areas: a) Virtual reality; b)
Surveillance systems; c) User interface; and d) Medical or anthropometric
applications. A brief summary is given below.
Virtual Reality
The efficient generation of 3D HBMs is one of the most important issues in all
virtual reality applications. Models with a high level of detail are capable of
conveying emotions through facial animation (Aubel, Boulic & Thalmann, 2000).
However, it is still nowadays very hard to strike the right compromise between
realism and animation speed. Balcisoy et al. (2000) present a combination of
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 17
Surveillance Systems
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
18 Sappa, Aifanti, Grammalidis & Malassiotis
walking and running speeds and direction of motion. One of the constraints is that
the motion must be front-parallel. Gavrila & Philomin (1999) present a shape-
based object detection system, which can also be included into the surveillance
category. The system detects and distinguishes, in real-time, pedestrians from a
moving vehicle. It is based on a template-matching approach. Some of the
system’s limitations are related to the segmentation algorithm or the position of
pedestrians (the system cannot work with pedestrians very close to the camera).
Recently Yoo, Nixon & Harris (2002) have presented a new method for
extracting human gait signatures by studying kinematics features. Kinematics
features include linear and angular position of body articulations, as well as their
displacements and time derivatives (linear and angular velocities and accelera-
tions). One of the most distinctive characteristics of the human gait is the fact
that it is individualistic. It can be used in vision surveillance systems, allowing the
identification of a human by means of its gait motion.
User Interface
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 19
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
20 Sappa, Aifanti, Grammalidis & Malassiotis
Conclusions
Human body modeling is a relatively recent research area with a higher
complexity than the classical rigid object modeling. It takes advantage of most
of the techniques proposed within the rigid object modeling community, together
with a prior-knowledge of human body movements based on a kinematics and
dynamics study of the human body structure. The huge amount of articles
published during the last years involving 3D human body modeling demonstrates
the increasing interest in this topic and its wide range of applications. In spite of
this, many issues are still open (e.g., unconstrained image segmentation, limita-
tions in tracking, development of models including prior knowledge, modeling of
multiple person environments, real-time performance). Each one of these topics
represents a stand-alone problem and their solutions are of interest not only to
human body modeling research, but also to other research fields.
Unconstrained image segmentation remains a challenge to be overcome. An-
other limitation of today’s systems is that commonly the motion of a person is
constrained to simple movements with a few occlusions. Occlusions, which
comprise a significant problem yet to be thoroughly solved, may lead to erroneous
tracking. Since existence and accumulation of errors is possible, the systems
must become robust enough to be able to recover any loss of tracking. Similarly,
techniques must be able to automatically self-tune the model’s shape param-
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 21
References
Aggarwal, J. K. & Cai, Q. (1999). Human motion analysis: A review. Computer
Vision and Image Understanding, 73(3), 428-440.
Ali, A. & Aggarwal, J.K. (2001). Segmentation and recognition of continuous
human activity. IEEE Workshop on Detection and Recognition of
Events in Video. Vancouver, Canada.
Aubel, A., Boulic, R. & Thalmann D. (2000). Real-time display of virtual
humans: Levels of details and impostors. IEEE Trans. on Circuits and
Systems for Video Technology, Special Issue on 3D Video Technology,
10(2), 207-217.
Ayers, D. & Shah, M. (2001). Monitoring human behavior from video taken in
an office environment. Image and Vision Computing, 19(12), 833-846.
Balcisoy, S., Torre, R., Ponedr, M., Fua, P. & Thalmann, D. (2000). Augmented
reality for real and virtual humans. Symposium on Virtual Reality Soft-
ware Technology. Geneva, Switzerland.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
22 Sappa, Aifanti, Grammalidis & Malassiotis
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 23
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
24 Sappa, Aifanti, Grammalidis & Malassiotis
Iwasawa, S., Ohya, J., Takahashi, K., Sakaguchi, T., Ebihara, K. & Morishima,
S. (2000). Human body postures from trinocular camera images. 4th IEEE
International Conference on Automatic Face and Gesture Recogni-
tion. Grenoble, France.
Kanade, T., Rander, P. & Narayanan, P. (1997). Virtualized reality: Construct-
ing virtual worlds from real scenes. IEEE Multimedia.
Marzani, F., Calais, E. & Legrand, L. (2001). A 3-D marker-free system for the
analysis of movement disabilities-an application to the Legs. IEEE Trans.
on Information Technology in Biomedicine, 5(1), 18-26.
Marzani, F., Maliet, Y., Legrand, L. & Dusserre, L. (1997). A computer model
based on superquadrics for the analysis of movement disabilities. 19th
International Conference of the IEEE, Engineering in Medicine and
Biology Society. Chicago, IL.
Moeslund, T. B. & Granum, E. (2001). A survey of computer vision-based
human motion capture. Computer Vision and Image Understanding,
81(3), 231-268.
Nurre, J., Connor, J., Lewark, E. & Collier, J. (2000). On segmenting the three-
dimensional scan data of a human body. IEEE Trans. on Medical Imaging,
19(8), 787-797.
Okada, R., Shirai, Y. & Miura, J. (2000). Tracking a person with 3D motion by
integrating optical flow and depth. 4th IEEE International Conference on
Automatic Face and Gesture Recognition. Grenoble, France.
Park, S. & Aggarwal, J. K. (2000). Recognition of human interaction using
multiple features in grayscale images. 15th International Conference on
Pattern Recognition. Barcelona, Spain.
Paul, R. (1981). Robot manipulators: mathematics, programming and con-
trol. Cambridge, MA: MIT Press.
Pavlovic, V. & Rehg, J. M. (2000). Impact of dynamic model learning on
classification of human motion. IEEE International Conference on
Computer Vision and Pattern Recognition. Hilton Head Island, SC.
Plänkers, R. & Fua, P. (2001). Articulated soft objects for video-based body
modelling. IEEE International Conference on Computer Vision.
Vancouver, Canada.
Preda, M. (Ed.). (2002). MPEG-4 Animation Framework eXtension (AFX) VM
9.0.
Preda, M. & Prêteux, F. (2001). Advanced virtual humanoid animation frame-
work based on the MPEG-4 SNHC Standard. Euroimage ICAV 3D 2001
Conference. Mykonos, Greece.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Advances in Vision-Based Human Body Modeling 25
Rosales, R., Siddiqui, M., Alon, J. & Sclaroff, S. (2001). Estimating 3D Body
Pose using Uncalibrated Cameras. IEEE International Conference on
Computer Vision and Pattern Recognition. Kauai Marriott, Hawaii.
Sato, K. & Aggarwal, J. K. (2001). Tracking and recognizing two-person
interactions in outdoor image sequences. IEEE Workshop on Multi-
Object Tracking. Vancouver, Canada.
Sidenbladh, H., Black, M. J. & Sigal, L. (2002). Implicit probabilistic models of
human motion for synthesis and tracking. European Conf. on Computer
Vision. Copenhagen, Denmark.
Sminchisescu, C. & Triggs, B. (2001). Covariance scaled sampling for monocu-
lar 3D body tracking. IEEE International Conference on Computer
Vision and Pattern Recognition. Kauai Marriott, Hawaii.
Solina, F. & Bajcsy, R. (1990). Recovery of parametric models from range
images: the case for superquadrics with global deformations. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 12(2), 131-147.
Tognola, G., Parazini, M., Ravazzani, P., Grandori, F. & Svelto, C. (2002).
Simple 3D laser scanner for anatomical parts and image reconstruction
from unorganized range data. IEEE International Conference on Instru-
mentation and Measurement Technology. Anchorage, AK.
Utsumi, A., Mori, H., Ohya, J. & Yachida, M. (1998). Multiple-view-based
tracking of multiple humans. 14th International Conference on Pattern
Recognition. Brisbane, Qld., Australia.
Wachter, S. & Nagel, H. (1999). Tracking persons in monocular image se-
quences. Computer Vision and Image Understanding, 74(3), 174-192.
Weng, N., Yang, Y. & Pierson, R. (1996). 3D surface reconstruction using
optical flow for medical imaging. IEEE Nuclear Science Symposium.
Anaheim, CA.
Werghi, N. & Xiao, Y. (2002). Wavelet moments for recognizing human body
posture from 3D scans. Int. Conf. on Pattern Recognition. Quebec City,
Canada.
Wingbermuehle, J., Weik, S., & Kopernik, A. (1997). Highly realistic modeling
of persons for 3D videoconferencing systems. IEEE Workshop on Mul-
timedia Signal Processing. Princeton, NJ, USA.
Wren, C. & Pentland, A. (1998). Dynamic models of human motion. IEEE
International Conference on Automatic Face and Gesture Recogni-
tion. Nara, Japan.
Wren, C., Azarbayejani, A., Darrell, T. & Pentland, A. (1997). Pfinder: real-time
tracking of the human body. IEEE Trans. on Pattern Analysis and
Machine Intelligence,19(7), 780-785.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
26 Sappa, Aifanti, Grammalidis & Malassiotis
Yamamoto, M., Sato, A., Kawada, S., Kondo, T. & Osaki, Y. (1998). Incremen-
tal tracking of human actions from multiple views. IEEE International
Conference on Computer Vision and Pattern Recognition, Santa Bar-
bara, CA.
Yoo, J., Nixon, M. & Harris, C. (2002). Extracting human gait signatures by body
segment properties. 5th IEEE Southwest Symposium on Image Analysis
and Interpretation, Santa Fe, CA.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 27
Chapter II
Virtual Character
Definition and
Animation within the
MPEG-4 Standard
Marius Preda
GET/Institut National des Télécommunications, France
Françoise Preteux
GET/Institut National des Télécommunications, France
Gauthier Lafruit
MICS-DESICS/Interuniversity MicroElectronics Center (IMEC), Belgium
Abstract
Besides being one of the well-known audio/video coding techniques,
MPEG-4 provides additional coding tools dedicated to virtual character
animation. The motivation of considering virtual character definition and
animation issues within MPEG-4 is first presented. Then, it is shown how
MPEG-4, Amendment 1 offers an appropriate framework for virtual human
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
28 Preda, Salomie, Preteux & Lafruit
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 29
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
30 Preda, Salomie, Preteux & Lafruit
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 31
described. The next section describes in detail the first avatar animation
framework, adopted in MPEG-4 in 1998, i.e., the FBA framework. Here, the
avatar body is structured as a collection of segments individually specified by
using INDEXEDFACE SET. The avatar face is a unique object animated by deforma-
tion controlled by standardized feature points. The section, Virtual Characters
in MPEG-4 Part 16, introduces a generic deformation model, recently adopted
by MPEG-4 (December, 2002), called BBA. It is shown how this model is
implemented through two deformation controllers: bones and muscles. The
generality of the model allows it to directly animate the seamless object mesh or
the space around it. Moreover, hierarchical animation is possible when consid-
ering the BBA technique and specific geometry representations, such as
Subdivision Surfaces or MESHG RID. This advanced animation is presented in the
section, Hierarchic Animation: Subdivision Surface and MESH GRID.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
32 Preda, Salomie, Preteux & Lafruit
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 33
A key concept in the MPEG-4 standard is the definition of the scene, where text,
2D and 3D graphics, audio and video data can (co)exist and (inter)act. A scene
is represented as a tree, where each object in the scene is the instantiation of a
node or a set of nodes. Compression representation of the scene is done through
the BInary Format for Scene (BIFS) specification (ISOIEC, 2001). Special
transformations and grouping capabilities of the scene make it possible to cope
with spatial and temporal relationships between objects.
The first version of the standard addresses the animation issue of a virtual human
face, while Amendment 1 contains specifications related to virtual human body
animation. In order to define and animate a human-like virtual character, MPEG-4
introduces the so-called FBA Object. Conceptually, the FBA object consists of
two collections of nodes in a scene graph grouped under the so-called Face node
and Body node (Figure 1), and a dedicated compressed stream. The next
paragraph describes how these node hierarchies include the definition of the
geometry, the texture, the animation parameters and the deformation behaviour.
The structure of the Face node (Figure 1a) allows the geometric representation
of the head as a collection of meshes, where the face consists of a unique mesh
(Figure 2a). The shape and the appearance of the face is controlled by the FDP
(Facial Definition Parameter) node through the faceSceneGraph node for the
geometry, and the textureCoord and useOrthoTexture fields for the texture.
Moreover, a standardized number of control points are attached to the face mesh
through the featurePointsCoord field as shown in Figure 3. These points control
the face deformation. The deformation model is enriched by attaching
parameterisation of the deformation function within the neighbourhood of the
control points through the faceDefTables node.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
34 Preda, Salomie, Preteux & Lafruit
Figure 2. The face as a unique mesh (a) and the head as a collection of
meshes (b), (c) and (d).
The face expressions and animation are controlled by the FAP (Face Animation
Parameter) node, which is temporal and updated by the FBA decoder. Anima-
tions can be performed at a high level, using a standardized number of
expressions and visemes, as well as at a low level by directly controlling the
feature points. In this case, a standardized number of key points (84), corre-
sponding to the human features (e.g., middle point of upper lip) is defined on the
face surface (Figure 3a and b). The complete animation is then performed by
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 35
(a) (b)
deforming the mesh in the vicinity of the key points (Doenges, 1997; Escher,
1998; Lavagetto, 1999).
A virtual body object is represented in the scene graph as a collection of nodes
grouped under the so-called Body node (Figure 1b).
The BDP (Body Definition Parameters) node controls the intrinsic properties of
each anatomical segment of the avatar body. It includes information related to
the avatar body representation as a static object composed by anatomical
segments (bodySceneGraph node), and deformation behaviour (bodyDefTables
and bodySegmentConnectionHint nodes) (ISOIEC, 2001). Body definition pa-
rameters are virtual character-specific. Hence the complete morphology of an
avatar can readily be altered by overriding the current BDP node. Geometrically,
the static definition of the virtual body object is a hierarchical graph consisting
of nodes associated with anatomical segments and edges. This representation
could be compressed using the MPEG-4 3D Mesh coding (3DMC) algorithm
(Taubin, 1998a) defining subpart relationships, grouped under the bodySceneGraph
node. The MPEG-4 virtual avatar is defined as a segmented virtual character,
using the H-Anim V2.0 nodes and hierarchy: Humanoid, Joint, Segment, and Site
nodes.
The BAP (Body Animation Parameters) node contains angular values and
defines the animation parameters as extrinsic properties of an anatomical
segment, i.e., its 3D pose with respect to a reference frame attached to the
parent segment. The orientation of any anatomical segment is expressed as the
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
36 Preda, Salomie, Preteux & Lafruit
(a) (b)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 37
Since the manners for obtaining the content are various and quickly evolve in
time, the FBA specifications do not mandate any specific methods for obtaining
a real description of a 3D human body and its associated animation parameters.
Defining only the representation format, the specifications allow a free develop-
ment of content creation. To address the avatar modeling issue, we developed
the Virtual Human Modeling (VHM) authoring tool to help a designer to obtain
— from a scanned geometric model of a human-like avatar — an articulated
version, compliant with the FBA specifications. The authoring tool is made up of
three parts:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
38 Preda, Salomie, Preteux & Lafruit
Figure 6. The 3AI: (a) BAP editing by using a dedicated user interface
allowing to tune the value for each BAP, (b) interactive tracking of gestures
in image sequences.
(a) (b)
The FBA specifications provide, for both face and body animation parameters,
two encoding methods (predictive and DCT-based).
In the first method (Figure 7a), FAPs/BAPs are coded with a predictive coding
scheme. For each parameter to be coded in frame n, the decoded value of this
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 39
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
40 Preda, Salomie, Preteux & Lafruit
Table 1. Bit-rates [kbps] for the DCT-based coding scheme. Q denotes the
global quantization value.
Sign Q=1 Q=2 Q=4 Q=8 Q=12 Q=12 Q=24 Q=31
Bitrate[kbps] 12.08 10.41 8.5 7.29 6.25 5.41 3.95 3.33
We have tested the FBA compression techniques on a data set representing real
sign language content. In this sequence, both arms and the body of the avatar are
animated. The frame-rate for this content is 20 fps. When dealing with a wide
range of target bit rates, the DCT-based method has to be used. Since the DCT-
based method uses a 16 frames temporal buffer, an animation delay occurs. For
sign language application, this delay introduces a small non-synchronisation, but
does not affect the message comprehension. In the case of applications that
require near loss-less compression and exact synchronization of the animation
with another media, the use of the frame predictive-based method is recom-
mended. In order to increase the efficiency of the arithmetic encoder, the
MPEG-4 FBA specifications standardize a set of ranges for each animation
parameter. The global quantization step is used here for scaling the value to be
encoded in the corresponding range. Each animation parameter is encoded with
the same number of bits inside this range. If the obtained scaled value is outside
of the range, a higher quantization step has to be used.
In our tests related to sign language, when using the frame predictive-based
method, a quantization value bigger than four has to be used and the obtained bit-
rate is close to 6.2 kbps. The compression results for the DCT-based method are
presented in Table 1.
The low bit-rate, less than 15 kbps, obtained by compressing the animation
parameters, while keeping visual degradation at a satisfactory level, allows
animation transmission in a low bit-rate network.
Local Deformations
The segmented nature of an FBA compliant avatar has the main disadvantage
that during the animation seams will occur at the joints between the segments.
To overcome this limitation, a special tool based on the so-called Body Defor-
mation Tables (BDTs) has been introduced in MPEG-4. The principle consists
in adding small displacements for the vertices near the joint. Thereby, during the
animation the borders of two segments remain connected. BDTs specify a list
of vertices of the 3D model, as well as their local displacements as functions of
BAPs (ISOIEC, 2001). An example of BDTs’ use is described in Preda (2002).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 41
A key issue pointed out in the previous section refers to realistic animation that
FBA tools cannot efficiently achieve. One of the main reasons for this comes
from considering the avatar as a segmented mesh and performing the animation
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
42 Preda, Salomie, Preteux & Lafruit
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 43
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
44 Preda, Salomie, Preteux & Lafruit
∀v ∈ Ω i , ϕi (v) = µi (v) ∑ ω [T (ξ
ξ k ∈ψ i ( v )
k i k ) − ξk ]
, (1)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 45
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
46 Preda, Salomie, Preteux & Lafruit
The bone influence volume is defined as the support of the affectedness measure
µ. Here µ is expressed as a family of functions ( µ d ) d∈[0,l ] ∪ µ 0− ∪ µ l + . µd is
(a) Forearm bone influence volume. (b) The muscle influence volume.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 47
defined on the perpendicular plane located at distance d from the bone origin.
The support of µd is partitioned into three specific zones (Zd,in, Zd,mid and Zd,ext) by
two concentric circles characterised by their respective radius rd and Rd (Figure
10). µd is then defined as follows:
1, x ∈ Z int
δ ( x, Z )
µd (x ) = f ext
, x ∈ Z mid ,
d d
R − r (3)
0, x ∈ Z ext
where δ(x, Zext) denotes the Euclidean distance from x to Zext and f(⋅) is a user-
specified fall-off to be chosen among the following functions:
π
x 3 , x 2 , x, sin( x), x and 3
x . This set of functions allows a large choice for
2
designing the influence volume and ensures the generality of the model.
The affectedness measure µ 0− (respectively µ l + ) is defined in the same manner,
but using two half-spheres of radius r0 and R0 (respectively rl and Rl) as
illustrated in Figure 10a.
The bone influence volume being defined, animating the virtual character
consists of deforming its mesh by translating its vertices according to the bone
transformation.
Here only affine transformations are applied to the bone controller. In virtual
character animation, the most widely used geometric transformation consists in
changing the orientation of the bone with respect to its parent in the skeleton
hierarchy. Thus, the bone can be rotated with respect to an arbitrary axis.
However, when special effects are needed, the bone can also be translated. For
instance, in cartoon-like animations, thinning and thickening the skin envelope are
frequently used. For such effects, the bone transformation must contain a scale
component specified with respect to a pre-defined direction.
The general form of the geometric transformation of a bone b is expressed as
a 4 × 4 element matrix T obtained as follows:
where TRwb, Rwb, Swb give the bone translation, rotation and scale, respectively,
expressed in the world coordinate system.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
48 Preda, Salomie, Preteux & Lafruit
where matrix C b allows it to pass from the bone local coordinate system to the
world coordinate system.
Once the bone geometric transformation is computed, it is possible to compute
the new position of the skin vertices in the bone influence volume according to
Equation (2).
The SMS framework does not limit the number of bones used for animating a
virtual character. Consequently, local deformation effects can be obtained by
creating and animating additional bones. If such an approach can be used with
a certain success for simulating limited local deformation, the efficiency may
strongly decrease when considering realistic animations. To address this kind of
animation, the second controller, ensuring muscle-like deformation was intro-
duced.
The muscle influence volume is constructed as a tubular surface generated by
a circle of radius r moving along the NURBS curve (Figure 10b). The
affectedness function is then defined as follows:
0 ( ( ))
δ v ,ψ v > r
µ (v) = r − δ (vi ,ψ (vi )) (v , (v )) r,
i i
f δ ψ ≤ (6)
r i i
where δ denotes the Euclidean distance, f(⋅) is to be chosen among the following
3 2 π
functions: x , x , x, sin( x), x1 / 2 and x1 / 3 , and ψ is the function assigning to
2
v its correspondent point on the muscle curve.
A muscle is designed as a curve, together with an influence volume on the virtual
character’s skin. In order to build a flexible and compact representation of the
muscle shape, a NURBS-based modeling is used. Animating the muscle consists
here of updating the NURBS parameters. The main advantages of NURBS-
based modeling are the accurate representation of complex shapes. The control
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 49
One of the main purposes of the SMS framework is to allow the definition and
the animation of a virtual character within a hybrid 3D scene. In this context, a
scene graph architecture to describe the SMS elements is proposed. This
architecture is built according to the VRML and MPEG-4 scene graph definition
rules. The structure of the proposed architecture, therefore, relies on the
definition of scene graph nodes.
At the root of the SMS-related node hierarchy, a SBVCAnimation node is
defined. The main purpose of this node is to group together a subset of virtual
characters of the scene graph and to attach to this group an animation resource
(textual or binary). An SMS virtual character is defined as a SBSkinnedModel
node and it is related to a collection of bones, each one defined as a SBBone node,
together with a collection of muscles, defined as SBMuscle nodes. An optimal
modeling issue is addressed by defining the SBSegment node. In addition, the
SBSite node allows defining semantic regions in the space of virtual character.
The extensive description of the node interfaces is outside of the goal of this
chapter and one can find them in the MPEG-4 standard Part 16 published by ISO
(ISOIEC, 2003). Nevertheless, we briefly present the SBBone, SBMuscle and
SBSkinnedModel in order to illustrate the concepts discussed above.
The fields of the SBBone node are illustrated in Figure 11a.
The SBBone node specifies four types of information, namely: semantic data,
bone-skin influence volume, bone geometric transformation, and bone IK
constraints. Each of these components is further detailed in the following.
The SBBone node is used as a building block in order to describe the hierarchy
of the articulated virtual character by attaching one or more child objects. The
children field has the same semantic as in MPEG-4 BIFS. During the animation,
each bone can be addressed by using its identifier, boneID. This field is also
present in the animation resource (textual or binary). If two bones share the same
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
50 Preda, Salomie, Preteux & Lafruit
Figure 11. The fields of the SBBone (a) and SBMuscle (b) node.
(a) (b)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 51
in the node data, to the Euler angles-based representation, used in the animation
resource (file or stream), and vice versa (Shoemake, 1994).
The Inverse Kinematics information related to a bone refers to the position of the
bone in a kinematics chain and to the definition of possible movement constraints
of this one bone.
The fields of the SBMuscle node are illustrated in Figure 11b. The SBMuscle
node enables us to add information relative to local deformation for simulating
muscle activity at the skin level. Mainly two kinds of information are represented
by this node: the influence volume and the curve form. The muscle influence
volume described is supported as follows. The specification of the affected vertices
and of the measure of affectedness is performed by instantiating the vertices list
(skinCoordIndex) and the affectedness list (skinCoordWeight). The influence
volume is computed by using the radius and falloff fields. The radius field
specifies the maximum distance for which the muscle will affect the skin. The
falloff field specifies the choice of the measure of affectedness function as
π x and 4 for
follows: -1 for x3, 0 for x2, 1 for x, 2 for sin( x) , 3 for 3
x.
2
The animation of the muscle curve is based on NurbsCurve structure as defined
in Grahn (2001) and uses the following fields: controlPoint — containing the
coordinates and the weights — and knot. The main fields of the SBSkinnedModel
node are illustrated in Figure 12.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
52 Preda, Salomie, Preteux & Lafruit
The SBSkinnedModel node is the root used to define one SMS virtual character
and it contains the definition parameters of the entire seamless model or of a
seamless part of the model. Mainly, this node contains the model geometry and
the skeleton hierarchy. The geometry is specified by skinCoord field — (a list
containing the 3D coordinates of all the vertices of the seamless model) and the
skin field — (a collection of shapes which share the same skinCoord). This
mechanism allows us to consider the model as a continuous mesh and, at the
same time, to attach different attributes (e.g., colour, texture, etc.) to different
parts of the model. The skeleton field contains the root of the bone hierarchy. All
the bones and muscles belonging to the skinned model are contained in dedicated
lists.
Once the skinned model is defined in a static position, the animation is obtained
by updating, at time samples, the geometric transformation of the bones and
muscles. In order to ensure a compact representation of these parameters, the
MPEG-4 standard specifies a dedicated stream, the so-called BBA stream.
The key point for ensuring a compact representation of the SMS animation
parameters consists of decomposing the geometric transformations into elemen-
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 53
tary motions. Thus, when only using, for example, the rotation component of the
bone geometric transformation, a binary mask indicates that the other compo-
nents are not involved. In order to deform a muscle only by translating a control
point, a binary mask has to specify that weight factors and basis functions are
not used. Since the animation system does not systematically use all of the
elements of the transformations associated with bones and muscles, this
approach produces a very compact representation of the animation stream.
Moreover, the compactness of the animation stream can still be improved when
dealing with rotations. During the animation, the rotation of a bone with respect
to its parent is a typically used technique. In the definition of the bone node, the
rotation is represented as a quaternion. However, many motion editing systems
use the rotation decomposition with respect to the Euler’s angles. In practice,
when less than three angles describe a joint transformation due to the nature of
the joint, a Euler’s angle-based representation is more appropriate. Thus, to get
a more compact animation stream, a rotation is represented, in the animation
resource, as Euler’s angles-based decomposition.
In Craig (1989), it is shown that there are 24 different ways to specify a rotation
by using a triplet of angles. By introducing a parameter characterizing the 24
possible combinations of the Euler’s angles, Shoemake (1994) demonstrates that
there is a one-to-one mapping between the quaternion (or rotation matrix)
representation and the pair given by the Euler’s angles and the introduced
parameter. In order to take this into account, a parameter called rotationOrder
has been introduced into the bone node.
For the rest of the bone transformation components (translation, scale, etc.), the
representation in the animation resource is identical to the representation in the
nodes.
The issue of temporal frame interpolation has been often addressed in the
computer animation literature (Foley, 1992; O’Rourke, 1998). From simple linear
interpolation, appropriate for translations, to more complex schemes based on
high-degree polynomials, or quaternions, which take orientation into account, a
large number of techniques are available. The advantages and the disadvantages
of each one are well known. Many of these techniques are supported by most
of the current animation software packages. Temporal frame interpolation is
intensively used to perform animation from textual description or from interac-
tive authoring. One in order to reduce the size of the transmitted data ,and the
second to ease authoring, it is allowed to specify the animation parameters for
the key-frames and not only frame-by-frame. However, in order to ensure the
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
54 Preda, Salomie, Preteux & Lafruit
Animation frame
In an SMS textual or binary format, for each key frame, two types of information
are defined: a vector corresponding to the animation mask, called
animationMaskVector, which indicates the components of the geometrical
transformation to be updated in the current frame; and a vector corresponding
to the animation values called animationValueVector which specifies the new
values of the components to be updated.
Let us describe the content of each of these vectors. For the exact syntax, one
can refer to the ISOIEC (2003).
• animationMaskVector
In the animation mask of a key-frame, a positive integer KeyFrameIndex
indicates to the decoder the number of frames which have to be obtained
by temporal interpolation. If this number is zero, the decoder sends the
frame directly to the animation engine. Otherwise, the decoder computes
n intermediate frames (n=KeyFrameIndex) and sends them to the anima-
tion engine, together with the content of the received key-frame.
Some bones or muscles of the SMS virtual character may not be animated
in all frames. The boneIDs and muscleIDs of the updated bones and
muscles, respectively, are parts of the animationMaskVector. In addition,
animationMaskVector contains the animation mask of each bone,
boneAnimationMaskVector, and the animation mask of each muscle,
muscleAnimationMaskVector. These vectors are detailed below.
• animationValueVector
The animationValueVector contains the new values of each bone and
muscle geometric transformation that have to be transmitted and it is
obtained by concatenation of all the boneAnimationValueVector and
muscleAnimationValueVector fields.
For compression efficiency, SMS stream specifications limit the maximum
number of bones and muscle nodes to 1,024 each. These bone and muscle
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 55
nodes can belong to one or more skinned models and are grouped in a
SBVCAnimation node. Thus, the fields boneID and muscleID must be
unique in the scene graph and their values must lie in the interval [0,
…1,023].
• boneAnimationMaskVector
To address high compression efficiency, a hierarchical representation of
the bone motion is used. At the first level, the bone motion is decomposed
into translation, rotation, scale, scale orientation and center transformation.
At the second level, all of these components that are set to 1 in the bone
mask, are individually decomposed in elementary motions (e.g., translation
along the X axis, rotation with respect to Y axis). This hierarchical
processing makes it possible to obtain short mask vectors. The size of the
boneAnimationMaskVector can vary from two bits (corresponding to a
single elementary motion) to 21 bits (all the components of the local
transformation of the bone change with respect to the previous key-frame).
• boneAnimationValueVector
The boneAnimationValueVector contains the values to be updated corre-
sponding to all elementary motions with a mask value of 1. The order of the
elements in the boneAnimationValueVector is obtained by analyzing
boneAnimationMaskVector.
• muscleAnimationMaskVector
The muscle animation parameters in the SMS stream are coordinates of the
control points of the NURBS curve, weights of the control points and/or
knot values.
The number of control points and the number of elements of the knot
sequence are integers between 0 and 63 and they are encoded in the
muscleAnimationMaskVector, after the muscleID field. As in the case of
the bone, a hierarchical processing is used to represent the mask.
• muscleAnimationValueVector
The muscleAnimationValueVector contains the new values of the muscle
animation parameters. As in the case of a bone, this vector is ordered
according to the muscleAnimationMaskVector.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
56 Preda, Salomie, Preteux & Lafruit
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 57
gives this freedom to the designer, being thus possible to achieve muscle-like
deformations on any part of the virtual character’s skin in order to get a realistic
animation.
Both frameworks address streaming animation and provide low-bit-rate com-
pression schemes. Both FBA and SMS allow the above-mentioned compression
methods, frame-based and DCT-based. Moreover, to improve the compression
ratio, SMS supports advanced animation techniques, such as temporal frame
interpolation and inverse kinematics. For both FBA and SMS, the bit-rate of the
compressed stream depends on the movement complexity (number of segments/
joints involved in motion) and generally lies in the range of 5-40 kbps, for a frame
rate of 25fps.
In the FBA framework, the animation stream contains information relative to the
animation of a single human virtual character, while, in the SMS framework, it
is possible to animate several characters by using a unique stream. Moreover,
the SMS supports the definition and animation of generic 2D/3D objects. This
property is very useful when dealing with a scene where a large number of
avatars or generic objects are present.
In SMS animation, more complex computations are required than in the case of
FBA animation. Thus, concerning the terminal capabilities, dedicated 3D hard-
ware or software optimization is well-suited for implementing SMS animation.
However, the SMS deformation mechanism is in line with the development of
graphics APIs and graphics hardware.
The deformation based on the bones and muscles controllers can be applied in
relation with advanced geometry definition techniques, allowing hierarchical
animation. The following section describes such representation techniques
(Subdivision Surfaces and MESHG RID) and shows the use of BBA to control the
animation of a synthetic object by affecting its surrounding space.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
58 Preda, Salomie, Preteux & Lafruit
Subdivision Surfaces
Subdivision surfaces, originally introduced by Catmull and Clark (1978) and Doo
and Sabin (1978), have recently emerged as a useful tool for modeling free-form
surfaces. A number of other subdivision schemes have been devised over the
years, including Loop’s (1987), Dyn et al.’s (known as the “butterfly” scheme)
(1990) or Kobbelt’s (2000). Subdivision is a recursive refinement process that
splits the facets or vertices of a polygonal mesh (the initial “control hull”) to yield
a smooth limit surface. The refined mesh obtained after each subdivision step is
used as the control hull for the next step, and so all successive (and hierarchically
nested) meshes can be regarded as control hulls. The refinement of a mesh is
performed both on its topology, as the vertex connectivity is made richer and
richer, and on its geometry, as the new vertices are positioned in such a way that
the angles formed by the new facets are smaller than those formed by the old
facets. The interest in considering subdivision surfaces for animation purposes
are related to the hierarchical structure: the animation parameters directly affect
only the base mesh vertex positions and, for higher resolutions, the vertices are
obtained through a subdivision process. Three subdivision surfaces schemes are
supported by the MPEG-4 standard: Catmull-Clark, Modified Loop and Wave-
let-based. For a detailed description of these methods and how they are
implemented in MPEG-4, the reader is referred to ISOIEC (2003).
MESHGRID
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 59
(b) CW
+
(d) hierarchical mesh
…
SW SV
SU
can be used to render the object at the appropriate level of detail, can be obtained
from the connectivity-wireframe. The reference-grid is the result of a hierarchi-
cal reference system as shown in Figure 13e.
Starting from the humanoid model of Figure 13a, the following sections will
discuss the design particularities of the connectivity-wireframe, reference-grid,
and their relationship, such that the model can be animated using a hierarchical
skeleton-based approach.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
60 Preda, Salomie, Preteux & Lafruit
The reference system consists of three sets of reference surfaces SU, SV, SW, as
labeled in Figure 13e. For a better understanding, the reference system has been
chosen uniformly distributed. Notice that, usually in a real case, as shown in
Figure 13c, the reference grid is non-uniformly distributed. The reference grid
is defined by the intersection points between the three sets of reference surfaces
SU, SV, SW, as given by Equation (7).
RG = 1 ∑ SU ,∑ SV ,∑ SW (7)
U V W
CW = 7 ∑ C ( SU ) , ∑ C ( SV ) , ∑ C ( SW ) (8)
U V W
The discrete position (u, v, w) of a reference grid point represents the indices of
the reference surfaces {S U, S V, SW} intersecting at that point, while the
coordinate (x, y, z) of a reference grid point is equal to the coordinate of the
computed intersection point.
There is a constraint imposed on the reference surfaces, however. They must
be chosen in such a way that the reference surfaces from one set do not intersect
each other, but intersect the reference surfaces from the other sets. To obtain
the connectivity-wireframe, the TRI SCAN method performs the contouring of the
object in each of the reference surfaces SU, SV, SW. Any intersection between
two contours defines a vertex. The connectivity-wireframe consists of the set
of all vertices generated by the intersections between contours, and the
connectivity between these vertices. A mathematical definition of the connec-
tivity-wireframe is given by Equation (8).
In the general case, the connectivity-wireframe is heterogeneous and can be
seen as a net of polygonal shapes ranging from triangles to heptagons, where,
except for the triangles, the other polygons may not be planar. Therefore, to
triangulate the connectivity-wireframe in a consistent way, a set of connectivity
rules has been designed especially for that purpose, as explained in Salomie
(2002a; Salomie, 2002b).
As one might have noticed, there exists a relationship between the vertices and
the reference grid, since a vertex is the intersection point between two contours,
therefore, belonging to two reference surfaces from different sets. This relation-
ship can be followed in the 2D cross-section (see Figure 14), inside a reference
surface, intersecting the object. Any vertex (label 4), lying on a contour of the
object (label 5), is located on a reference grid line (label 1) in between two
reference grid points, one inside the object (label 3) and one outside the object
(label 2). Notice that a reference grid line is the intersection curve of two
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 61
G2 KKKKH G2
G1V
V offset = KKKKKH , V
G1G2 KKKKH KKKKKH
G1 G1 G1V = G1G2 Coffset
with offset ∈ [0,1)
(9) (10)
(a) (b)
reference surfaces from different sets. As illustrated in Figure 14a, each vertex
is attached to a grid position G 1, and the relative position of vertex V with respect
to G1 and G 2 is given by the scalar offset (see Equation (9)). When either G1 or
G2 moves during the animation (shown in Figure 14b), the coordinates (x, y, z) of
V can be updated as given by Equation (10).
A multi-resolution model can be designed by choosing a multi-resolution refer-
ence system, each resolution level having its corresponding reference grid. The
multi-resolution reference system has a hierarchical structure (see Figure 13e),
which allows obtaining from the last resolution level reference system any lower-
resolution-level reference system by removing the appropriate reference sur-
faces.
The connectivity-wireframe obtained from a model (Figure 13b), by scanning it
according to a hierarchical reference system (Figure 13e), has a hierarchical
structure, as well. This can be proven considering that all the vertices from any
lower-resolution-level Rl are preserved in the immediate higher-resolution-level
Rl+1, since the reference system of resolution level Rl is a sub-set of the reference
system of resolution level Rl+1. In addition, resolution level Rl+1 will insert new
vertices and, therefore, alter the connectivity between the vertices of resolution
level Rl. A hierarchical connectivity-wireframe can be decomposed into single-
resolution connectivity-wireframes, and each of them can be triangulated to
obtain the corresponding mesh, as show in Figure 13d.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
62 Preda, Salomie, Preteux & Lafruit
effects by modifying the value of the offset, see Equation (10); and (2) reshaping
of the regular reference grid. The latter form of animation can be done on a
hierarchical multi-resolution basis, and will be exploited for the bone-based
animation of the humanoid.
A particularity of the MESHG RID representation is that the hierarchical structure
of the reference grid allows the coordinates of the reference grid points of any
resolution-level Rl+1 to be recomputed whenever the coordinates of the reference
grid points of the lower resolution-level Rl are modified, for instance by the bone-
based animation script. For that purpose, “Dyn’s four-point scheme for curves”
interpolation (Dyn, 1987) is applied.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 63
not need a separate compression tool, in contrast to WSS for which the base
mesh is coded separately, typically with low compression tools like
INDEXEDF ACESET . Therefore, and together with its dedicated animation capabili-
ties, MESH GRID is preferred over WSS for virtual character animation.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
64 Preda, Salomie, Preteux & Lafruit
(a) (b)
Figure 16. Snapshot of the humanoid knee during the animation. The
reference grid lines are displayed in black. The surface at the second
resolution level is displayed as a wireframe in (a) and Gouraud-shaded in
(b).
(a) (b)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 65
number of vertices contained in the surface mesh of the second resolution level
is already quite high (216 vertices), while the reference grid at the lowest level
only consists of 27 points at the height of the knee (three planes defined by nine
points each). Although the number of grid points will be higher when applying the
long interpolation filter, the fact that the grid is defined on a regular space,
seriously simplifies the interactive selection of the grid points, since it is possible
to determine the majority of these points automatically once a few key points
have been chosen. Moreover, the same animation script can animate: (1) any
resolution of the MESHG RID model, due to its hierarchical construction; and (2)
any model with a reference grid that is defined in a compatible way with the
reference model used in the animation script. Compatible MESHG RID models are
characterized by: (1) the same number of reference surfaces defining the
reference system; and (2) the same reference surfaces passing through the
same anatomical positions in the different models. The drawback when animat-
ing an INDEXEDFACE SET model is the need for a different script at each resolution
level of each model.
In practice, for achieving efficient hierarchical animation with the BBA ap-
proach, an appropriate technique is to animate the reference grid points, i.e., the
space around the model. Moreover, the BBA hierarchical animation technique
based on the MESH GRID representation method offers serious advantages in
terms of compactness, design simplicity and computational load, compared to a
bone-based animation defined for a complex single resolution model which is
described as an INDEXEDFACE SET.
The major advantage compared to animation techniques defined for other
hierarchical models (e.g., Subdivision Surfaces), is that it is more intuitive to
address the regularly defined grid points than the vertices and that it is possible,
with the same script, to animate compatible M ESHG RID models. The reference
grid points contained in the lowest resolution level represent only a small
percentage of the total number of points present in the final level and, in addition,
only a limited number of grid points are animated, i.e., only those related to
vertices. Hence, such a hierarchical approach is very advantageous, since the
animation can be designed in terms of a limited number of points and since the
number of computations needed to apply the BBA transformation to each of
these points will be reduced.
Conclusions
This chapter is devoted to the standardization of virtual character animation. In
particular, the MPEG-4 Face and Body, as well as the Skeleton, Muscle, and Skin
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
66 Preda, Salomie, Preteux & Lafruit
Acknowledgments
This chapter has grown out of numerous discussions with members of the
MPEG-4 AFX standardization committee and fruitful suggestions proposed by
numerous colleagues, who have provided inspiration during the germination of
the chapter. In particular, Prof. Jan Cornelis, Dr. Rudi Deklerck and Dr.
Augustin Gavrilescu from the ETRO Department of the Vrije Universiteit
Brussel have contributed to many improvements in the Humanoid animation
framework, based on the MESH GRID representation. Some parts of the work, like
the FBA and SMS have matured thanks to assessments through projects, amongst
which the IST European Project ViSiCAST has probably contributed the most
to the quality of the outcome.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 67
References
Capin, T. K., & Thalmann, D. (1999). Controlling and Efficient coding of MPEG-
4 Compliant Avatars, Proceedings of IWSNHC3DI’99, Santorini, Greece.
Catmull, E. & Clark, J. (1978). Recursively generated B-spline surfaces on
arbitrary topological meshes. Computer-Aided Design, 10, 350-355.
Craig, J., Jr. (1989). Introduction to Robotics: Mechanics and Control, 2nd
edition. Reading, MA: Addison Wesley.
Doenges, P., Capin, T., Lavagetto, F., Ostermann, J., Pandzic, I. & Petajan, E.
(1997). Mpeg-4: Audio/video and synthetic graphics/audio for real-time,
interactive media delivery. Image Communications Journal, 5(4), 433-
463.
Doo, D. & Sabin, M. (1978). Behaviour of recursive division surfaces near
extraordinary points. Computer-Aided Design, 10(6), 356-360.
Dyn, N., Levin, D. & Gregory, J. A. (1987). A four-point interpolatory
subdivision scheme for curve design. Computer-Aided Geometric De-
sign, 4, 257-268.
Dyn, N., Levin, D. & Gregory, J. A (1990). A Butterfly Subdivision Scheme for
Surface Interpolation with Tension Control. ACM Transactions on Graph-
ics, 9(2), 160-169.
Escher, M., Pandzic, I. & Magnenat-Thalmann, N. (1998). Facial Animation and
Deformation for MPEG-4, Proceedings of Computer Animation’98.
Foley, J. D., Damm , A., Feiner, S. K., & Hughes, J. F. (1992). Computer
Graphics – Principles and Practice, 2nd edition. Reading, MA: Addison
Wesley.
Grahn, H., Volk, T. & Wolters, H. J. (2001). NURBS Extension for VRML97,
2/2001, © Bitmanagement Software. Retrieved from the World Wide Web
at: http://www.bitmanagement.de/developer/contact/nurbs/overview.html.
ISOIEC 14496-1:2001 (2001). Information technology. Coding of audio-visual
objects. Part 1: Systems, International Organization for Standardization,
Switzerland.
ISOIEC 14496-1:2003 (2003). Information technology. Coding of audio-visual
objects. Part 16: Animation Framework eXtension. International Organiza-
tion for Standardization, Switzerland.
Kim, M., Wood S. & Cheok, L. T. (2000). Extensible MPEG-4 Textual Format
(XMT), in Proceedings of the 2000 ACM workshops in Multimedia, 71-
74, Los Angeles, CA.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
68 Preda, Salomie, Preteux & Lafruit
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Virtual Character Definition and Animation within the MPEG-4 Standard 69
Taubin, G., Guéziec, A., Horn, W. & Lazarus, F. (1998b). Progressive forest
split compression. Proceedings of SIGGRAPH’98, 123-132.
Thalmann-Magnenat, N. & Thalmann, D. (2000). Virtual Reality Software and
Technology. Encyclopedia of Computer Science and Technology.
Marcel Dekker, 41.
Endnotes
1
Living actor technology, http://www.living-actor.com/.
2
Scotland government web page, http://www.scotland.gov.uk/pages/news/
junior/introducing_seonaid.aspx.
3
Walt Disney Pictures & Pixar. Geri’s game, Toy Story (1995), A Bug’s
Life (1998), Toy Story 2 (1999) and Monsters, Inc. (2001).
4
Vandrea news presenter, Channel 5, British Broadcasting Television.
5
Eve Solal, Attitude Studio, www.evesolal.com.
6
blaxxun Community, VRML - 3D - Avatars - Multi-User Interaction, http:/
/www.blaxxun.com/vrml/home/ccpro.htm.
7
3D Studio Max™ Discreet, http://www.discreet.com/index-nf.html.
8
Maya™ Alias/Wavefront, http://www.aliaswavefront.com/en/news/
home.shtml.
9
The Virtual Reality Modeling Language, International Standard ISO/IEC
14772-1:1997, www.vrml.org.
10
H-Anim – Humanoid AnimationWorking Group, www.h-anim.org.
11
SNHC - Synthetic and Natural Hybrid Coding, www.sait.samsung.co.kr/
snhc.
12
MPEG Page at NIST, mpeg.nist.gov.
13
Face2Face Inc. www.f2f-inc.com.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
70 Lei, Hendriks & Katsaggelos
Chapter III
E. A. Hendriks
Delft University of Technology, The Netherlands
Aggelos K. Katsaggelos
Northwestern University, USA
Abstract
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 71
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
72 Lei, Hendriks & Katsaggelos
Coordinate Systems
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 73
x w = R ow x o + t ow , (1)
a rigid body transformation, in which only rotation and translation are permitted,
but scaling is not allowed (Euclidean geometry). This kind of transformation is
called Euclidean transformation.
A similar relation exists between xw and the corresponding xc:
( ) (x
x w = R cw x c + t cw or x c = R cw
T w
− t cw ) (2)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
74 Lei, Hendriks & Katsaggelos
The relation between the camera coordinates (in CCS) and the metric projection
coordinates (in PCS) is inferred from the principle of lens projection (modeled
as a pinhole, see Figure 1b). This perspective transformation is a kind of
projective mapping (ref. Figure 1a).
In Figure 1a, the optical center, denoted by O, is the center of the focus of
projection. The distance between the image plane and O is the focal length,
which is a camera constant and denoted by f. The line going through O that is
perpendicular to the image plane is called the optical axis. The intersection of
the optical axis and the image plane is denoted by o, and is termed the principal
point or image center. The plane going through O that is parallel to the image
plane is called the focal plane.
The perspective projection from the 3-D space (in CCS) onto the image plane
(in IMCS) through the PCS can be formulated as:
x im 1 / s x 0 x 0 x m − f x 0 x0 x c
c im c
z y = z 0 1/ s y y 0 y m = 0 − fy y 0 y c
, (3)
1 0 0 1 1 0 0 1 z c
xw
x im − f x 0 x0 w
im
y0 ( R cw ) − ( R cw ) ⋅ t cw w
T T y
y ≅ 0 − fy
z ,
1 0 0 1
1
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 75
Thus, three transform matrices can be identified to project a 3-D world point onto
its 2-D correspondence in an image. These three matrices are termed the
intrinsic transform matrix K% (ITM, encoding all intrinsic camera parameters),
the extrinsic transform matrix M % (ETM, encoding all extrinsic camera
parameters), and the projection matrix P% (PM, encoding all linear camera
parameters), and are given by:
− f x 0 x0
K = 0
% y0 M
, % = ( R c ) − ( R cw ) ⋅ t cw , and P% = KM
− fy w T T
% % . (4)
0 0 1
T
Thus, for the projective coordinates 1 x% im = x im y im 1 and
T
x% w = x w y w z w 1 , we have:
x% im ≅ P% ⋅ x% w . (5)
Through this projection, a 3-D straight line will map as a 2-D straight line. From
this observation, this pure pinhole modeling is called linear modeling.
The perfect pinhole model is only an approximation of the real camera system.
It is, therefore, not valid when high accuracy is required. The nonlinear
components (skew and distortion) of the model need to be taken into account in
order to compensate for the mismatch between the perfect pinhole model and the
real situation.
In applications for which highly accurate calibration is not necessary, distortion
is not considered. Instead, in this case a parameter u characterizing the skew
(Faugeras, 1993) of the image is defined and computed, resulting in an ITM as:
− f x u x0
K = 0
% − fy y0 xim xˆ im
and im = ˆ im . (6)
0 0 1 y y
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
76 Lei, Hendriks & Katsaggelos
Camera Parameters
The coefficients of the distortion model, together with the intrinsic and extrinsic
parameters, emulate the imaging process of a camera with very high accuracy.
In addition, the skew parameter could also be included. However, since the
modeled distortion already accounts for the interaction between the x and y
components, the image skew does not need to be considered explicitly, but
instead it is treated implicitly as part of the distortion.
It can be noticed in the previous equations that f and sx, respectively f and sy, are
always coupled with each other. This means that only the values of fx = f / sx and
fy = f / sy, instead of the actual values of f, sx, and sy, are important and can be
recovered.
All these camera parameters, which can be recovered directly from a camera
calibration procedure, can be grouped into a linear parameter vector pl and a
distortion coefficient vector pd, as follows:
pl = α [ β γ xcw y cw z cw fx fy x0 ]T
y0 , (7)
p d = [distortion coefficients ] .
T
(8)
These two vectors are further combined into one camera parameter vector pc
as follows:
pc = pl[ T
pd ].
T T
(9)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 77
In general, the number of independent constraints should not be less than the
DOFs of the camera parameter space. Based on this observation, a counting
argument has been presented for self-calibration (Pollefeys et al., 1999).
Minimal constraint requirements for different calibration purposes can also be
identified (Torr & Zisserman, 1996). Often, however, to improve numerical
robustness, many more independent constraints than needed are utilized (over-
constrained problem) (Hartley & Zisserman, 2000), or several equivalent
representations of the same constraint are employed simultaneously (Malm &
Heyden, 2001).
Due to the availability of useful geometric constraints, a number of different
camera calibration approaches exist. Meanwhile, various types of compromises
or trade-offs always have to be made. Such a trade-off is, for instance, the
desired accuracy of calibration results and the depth of view that can be
supported by the camera calibration results. Requirements also differ from
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
78 Lei, Hendriks & Katsaggelos
Passive (Fixed) Calibration: All camera parameters are assumed fixed and,
therefore, the calibration is performed only once. This approach often uses
images of objects for which the accurate geometric and photometric
properties can be devised, and reconstructs the relationship between such
properties and the recorded images. From these relations all camera
parameters can be estimated quantitatively through some linear or nonlin-
ear optimization process (Tsai, 1987; Triggs, McLauchlan, Hartley &
Fitzgibbon, 1999).
Active Calibration: For active vision purposes, some of the intrinsic camera
parameters (typically focus and zoom) are assumed to vary actively, while
the extrinsic parameters are either changed on purpose in a controlled
fashion (e.g., pure rotation) or not at all (Willson, 1994). The investigation
of the relationship between zooming (and/or focus) and other intrinsic
parameters is at the center of this approach.
Self-Calibration: Depending on the application requirements, all or some of the
camera parameters are assumed to vary independently, while the remaining
ones are unknown constants or parameters that have already been recov-
ered via a pre-processing step (e.g., through a passive calibration tech-
nique). The purpose of self-calibration is to be able to recover the camera
parameters at different settings of the optical and geometrical configura-
tions (Pollefeys et al., 1999).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 79
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
80 Lei, Hendriks & Katsaggelos
1992). Any more elaborate distortion model than a radial one could help in
increasing the accuracy, but may incur numerical instability (Tsai, 1987). Thus,
most calibration algorithms usually take into account only the radial distortion.
However, when wide-angle cameras are used, adding a non-radial distortion
component in the distortion model will improve accuracy significantly (Weng,
Cohen & Herniou, 1992).
Therefore, the complexity of the distortion model (i.e., the number of distortion
coefficients considered) should match the available computation resources and
the accuracy required by the application.
Distortion Models
Two different models have been constructed to describe the distortion phenom-
enon. They were developed for the purpose of projection and that of 3-D
reconstruction, respectively.
Imaging-distortion model
For the camera projection purpose, the distortion can be modeled as “imaging
distortion” as:
xˆ im xim f x ⋅ ∆ Im
im = im +
x
Im , (10)
yˆ y f y ⋅ ∆ y
where
∆ y y ' r 2y ' + r
2 4 6 2 2
y 'r y 'r 2x ' y ' 0 r2
T
Im
d = k1
p Im k 2Im k3Im P1Im P2Im s1Im s2Im ,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 81
x c xim − x0 y c y im − y0
x' = − = y ' = − =
f y , and r = x ' + y ' .
2 2 2
, zc
zc fx
Im
k1Im , k 2Im , k 3 (Radial), P1Im , P2Im (De-centering), and s1Im , s 2Im (Thin Prim)
represent the imaging-distortion coefficients, while ∆Imx and ∆Imy represent
distortions in the horizontal (x-axis) and vertical (y-axis) directions, respectively.
The radial distortion is caused by the fact that objects at different angular
distances from the lens axis undergo different magnifications. The de-centering
distortion is due to the fact that the optical centers of multiple lenses are not
correctly aligned with the center of the camera. The thin-prim distortion arises
from the imperfection in the lens design and manufacturing, as well as the
camera assembly.
This distortion model can be simplified by neglecting certain parameters. k1Im
usually accounts for about 90% of the total distortion (Slama, 1980). For
example, in some cases, only radial and tangential components were taken into
consideration. The effect of the thin prim coefficients ( s1Im and s 2Im ) was
overlooked without affecting the final accuracy, because this component only
causes additional radial and tangential distortions (Weng et al., 1992).
Reconstruction-distortion model
Besides being modeled as imaging distortion, the distortion can also be modeled
as “reconstruction distortion” as follows:
x im xˆ im f x ⋅ ∆ Re
im = im −
x
Re , (12)
y yˆ f y ⋅ ∆ y
where
∆ Re xr
ˆ 2 ˆ 4
xr ˆ 6
xr 2 xˆ 2 + r 2 ˆˆ
2 xy r2 0 Re
x
=
Re 2 ⋅ pd , (13)
∆ y yr ˆ 4 ˆ 6 2 yˆ + r 2
2
ˆ yr yr ˆˆ
2 xy 0 r2
T
Re
d = k1
p Re k2Re k3Re P1Re P2Re s1Re s2Re ,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
82 Lei, Hendriks & Katsaggelos
xˆ im − x0 yˆ im − y0
xˆ = , ˆ
y = , and r 2 = xˆ 2 + yˆ 2 .
fx fy
and 12 could be different from the principal point employed in equation 3 (Wei
& Ma, 1994). On the other hand, under radial distortion with a dominant
Im Re
coefficient k1 (or k1 ), a small shift of the distortion center is equivalent to
adding two de-centering distortion terms (Ahmed & Farag, 2001). Therefore, if
the distortion is estimated independently of the calibration of other camera
parameters, the principal point for the linear perspective transformation should
be distinguished from the distortion center. However, if all camera parameters
(including distortion coefficients) are calibrated simultaneously, the distortion
center and the principal point should be treated as being the same. In this case,
it is better to take the de-centering distortion component into consideration.
Discussions
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 83
complete in the 3-D CCS and only complete in the 2-D IMCS and only
associated with the 3-D coordinates associated with the 2-D coordinates
(Wolberg, 1990).
Plumb-line method
With a linear camera model (ref. equation 5), a straight line in the 3-D scene is
still a straight line in the corresponding image plane. Any deviation from this
straightness should, therefore, be attributed to distortion.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
84 Lei, Hendriks & Katsaggelos
Suppose there exists a straight 3-D line in the 3-D scene. Correspondingly, a
straight image line should appear in the focal plane of an ideal linear camera. Let
T
x im = x im y im denote an arbitrary point on that image line. The following
equation should then be satisfied
where θ is the angle between the image line and the x-axis of the focal plane,
and ρ is the perpendicular distance from the origin to this line.
Suppose that the reconstruction-distortion model is employed, and in the distor-
tion model assume that fx = fy (Brown, 1971; Devernay & Faugeras, 2001).
Substituting equation 12 into equation 14 leads to an expression of the form
where x0, y0, sx, sy, k1Re, k2Re, k3Re, P 1Re, P2Re, s1Re, s2Re, θ and ρ are all unknown,
and ε is a random error.
If enough colinear points are available,
( )
n m
ς = ∑∑ f ( xˆijim , yˆijim , x0 , y0 , sx , s y , k1Re , k 2Re , k3Re , P1Re , P2Re , s1Re , s2Re , θ i , ρ i ) (16)
i =1 j =1
can be minimized to recover x0, y0, sx, sy, k1Re, k2Re, k3Re, P 1Re, P2Re, s1Re, s2Re, θi
[
and ρi. xˆ ijim yˆ ijim ]
T
(i = 1 ... n, j = 1 ... m) are distorted image points whose
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 85
When the plumb-line method is applied, “straight” lines, which are distorted
straight image lines, need to be extracted first, typically by means of an edge
detection technique, before the optimization on equation 16 can be performed.
Clearly, the accuracy of the extracted lines determines the accuracy of the
estimated parameters. If the calibration set-up is carefully designed so that those
“straight” lines can be located accurately, an overall accuracy in the order of
2×10-5 can be achieved (Fryer, Clarke & Chen, 1994). However, for irregular
natural scenes, it may be difficult to locate “straight” lines very accurately. To
tackle this problem, an iterative strategy has been adopted (Devernay &
Faugeras, 2001). According to this strategy, edge detection (step 1) and
optimization (step 2) are first performed on the original images. Based on the
estimated distortion coefficients, images are corrected (undistorted), and then
steps 1 and 2 are repeated. This iterative process continues until a small deviation
ς is reached. Applying this strategy on natural scenes, a mean distortion error
of about pixel (for a 512×512 image) can be obtained (Devernay & Faugeras,
2001). Improved results can be obtained by modifying equation 16 (dividing, for
example, the function f (...) by ρi [Swaminathan & Nayer, 2000]) and by
carefully defining the “straightness” of a line (using, for example, snakes [Kang,
2000]).
The plumb-line method explores only one invariant of the projective transforma-
tion. Other projective invariants or properties, such as converging of parallel
lines, can also be employed for estimating distortion in a fashion similar to the
plumb-line method. Some of these methods are summarized below. They make
use of:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
86 Lei, Hendriks & Katsaggelos
Other techniques
r r2
( ) ( )
2 2
xˆ im − x0 + yˆ im − y0 = f ⋅ ln + 1 + 2 ,
f
f
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 87
although more research needs be done to model other types of distortions with
high accuracy.
In Farid & Popescu (2001), large radial distortion is detected by analyzing the
correlation in the frequency-domain of a single image. The method, however,
may only work for certain types of scenes. Finding out what types of scenes are
suitable for this technique seems to be an interesting research topic.
Contrary to the nonlinear equations 11 and 13, the standard distortion model is
modified in Fitzgibbon (2001) to a projective linear, but equivalent, representation
assuming only radial distortion. Based on this, an efficient closed-form algorithm
is formulated that is guaranteed to estimate the fundamental matrix (Faugeras,
1993).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
88 Lei, Hendriks & Katsaggelos
After that, some other approaches that utilize special calibration objects or
specific phenomena in the 3-D scene are summarized. Finally, the calibration is
evaluated and feature extraction issues are discussed.
xw
t1 p11 p12 p31 p41 w
2 2 y
t ≅ p1 p22 p32 p42 w
z , (17)
t 3 p13 p23 p33 p43
1
~
where pij is the element of the matrix P at the ith row and jth column, and
x im = t 1 / t 3 , and y = t / t .
im 2 3
(18)
Substituting equation 18 into equation 17 and expanding the matrix product yields
the following two equations with 12 unknowns:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 89
T
im
i = xi
sponding 2-D image coordinates x im yiim (i = 1 ... N) by the following
equation:
A ⋅ p = 02 N×1 , (21)
x1w y1w z1w 1 0 0 0 0 − x1w x1im − y1w x1im − z1w x1im − x1im
0 0 0 0 x1w y1w z1w 1 − x1w y1im − y1w y1im − z1w y1im − y1im
x2w y2w z2w 1 0 0 0 0 − x2w x2im − y2w x2im − z2w x2im − x2im
A=0 0 0 0 x2w y2w z2w 1 − x2w y2im − y2w y2im − z2w y2im − y2im
M ,
w
xN y w
N z w
N 1 0 0 0 0 − xNw xNim − y Nw xNim − z Nw xNim − xNim
0 0 0 0 xNw y Nw z Nw 1 − xNw y im − y Nw y im − z Nw y im − y Nim
N N N
T
p = p11 p12 p31 p14 p12 p22 p32 p42 p13 p23 p33 p43 .
represents the element at the ith row and jth column, except c99 = c10
10
= c11
11
=1. A
closed-form solution to this problem can be obtained by the method described in
Faugeras (1993) and Hartley & Zisserman (2000).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
90 Lei, Hendriks & Katsaggelos
The values of all camera parameters mentioned can be recovered directly from
the results of DLT.
Combining the three items in equation 4 gives:
( )
T T
where rj (j = 1...3) is the jth column of matrix R cw , and t x ty t z = − R cw t cw .
Let:
(p ) +(p ) +(p )
2 2 2 T
where ρ = 3
1
3
2
3
3 and q 'i = q i / ρ = p1i p2i p3i / ρ (i =
1...3).
Then, all original intrinsic and extrinsic camera parameters can be recovered as:
3
( )
t z = ε z q' 34 , r3 = ε z q' , x0 = q'1 q' 3 , y 0 = q' 2 q' 3 ,
T
( ) T
( ) (
f y = q' 2 − y 0 q' 3 , r2 = −ε z q' 2 − y 0 q' 3 / f y , r1 = −ε z q'1 − x0 q' 3 / f x , )
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 91
( ) ( )
f x = q'1 − x0 q' 3 , t y = − ε z q' 24 − y 0 t z / f y , t x = − ε z q'14 − x0 t z / f x , (24)
Employing the DLT method described, one can recover 11 independent elements
of the matrix P% . However, according to a previous section in this chapter, P% has
only 10 DOFs. This means that the recovered 11 independent elements may not
be geometrically valid. In other words, certain geometric constraints may not be
fulfilled by the 10 intrinsic and extrinsic parameters of a camera recovered from
the reconstructed P% of the simple DLT. For this problem, there are three
solutions.
In order to match the 11 DOFs of the projection matrix P% , one could add one
more DOFs to the camera parameter space by taking into account the skew
factor u. By this change, substituting equation 6 into equation 5 yields (ref.
equation 17):
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
92 Lei, Hendriks & Katsaggelos
T 3 3 1 3
( )
u = −(q'1 ×q' 2 ) (q' 2 ×q' 3 ), t z = ε z q ' 4 , r3 = ε z q' , x 0 = q ' q' , y 0 = q ' q' ,
2T 3
( ) T
( )
f y = q' 2 − y 0 q' 3 , r2 = −ε z q' 2 − y 0 q'3 / f y , r1 = r2 × r3 ,
( )
f x = ε z q'1 −ur2 − x 0 r3 , t y = − ε z q' 24 − y 0 t z / f y ,
Constrained DLT
In some cases, the skew factor need not be considered, while in others the
nonlinear effects have been eliminated in advance by means of the techniques.
( ) (q ' × q ' ) = 0 , which guarantees the
T
Then an additional constraint q '1× q '2 2 3
(p ) +(p ) +(p )
2 2 2
Minimize Ap subject to the constraints 3
1
3
2
3
3 = 1 and
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 93
1 2 2
( ) q2
( )
T T
N
q x w + q14 x w + q42
d geo = ∑ im
− xi + im
− yi .
i =1 q ( ) 3
( )
T w T
3
x + q4
3
q x w + q43
After solving the optimization problem 28, all original camera parameters can be
recovered by applying equation 24.
Modified DLT
where rj ( i, j = 1K 3 ) is the element of R cw at the ith row and jth column and
i
( )
T T
t x ty t z = − R cw t cw .
( )
T
Therefore, as R cw R cw = I , we know immediately that:
a1 a5 + a 2 a 6 + a3 a 7 = 0 . (30)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
94 Lei, Hendriks & Katsaggelos
By further requiring that a12 = 1 or (a9)2 + (a10)2 + (a11)2 = 1, we can carry out a
constrained non-linear search to obtain the 10 independent parameters (x0, y0, a3,
a4, a6, ..., a11; in case a12 = 1) or (x0, y0, a3, a4, a6, ..., a10, a12; in case (a9)2 + (a10)2
+ (a11)2 = 1).
Up to this point, only linear relations in the imaging process were considered. This
already suffices if one aims more at efficiency than accuracy. However, for
highly accurate measurements, the distortion should also be taken into account.
A straightforward way to achieve this is to incorporate the distortion coefficients
directly into the DLT calculation by adding some nonlinear elements. This is the
topic of the next section.
Both simple DLT and geometrically valid DLTs cannot take the distortion
component into account. In this section, an innovative way of incorporating the
distortion into the DLT is discussed. Its advantages and disadvantages are
addressed, as well.
Because of the existence of two distortion models, the 3-D reconstruction and
the projection have to be handled differently.
% −1x% im + t w ,
x w ≅ R cw K (31)
c
where
−1/ f x 0 x0 / f x
% = 0
K −1
−1/ f y y0 / f y
. (32)
0 0 1
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 95
x w ≅ Du%ˆ im , (33)
~
where D is a matrix with 3 rows and 36 columns. uˆ im is a column vector with
( )( )
36 elements in the form xˆ im i yˆ im j ( i, j = 0K 7 and i + j ≤ 7 ).
~
In fact, the dimensions of D and uˆ im depend on the distortion (coefficients)
considered. Several examples are shown in Table 2, where the last case is the
same linear situation that was just handled in the previous section.
~
If enough pairs of xw and corresponding x̂ im (and thus uˆ im ) are available, D can
be calculated from equation 33 in the same way as illustrated. Subsequently,
from D, the line of sight of any image pixel in the current camera can be
computed. Furthermore, assume that in another camera, which has transforma-
~
tion matrix D', a point x̂'im (and thus û ' im ) can be located which is the projection
of the same world point xw as the point x̂ im . Then:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
96 Lei, Hendriks & Katsaggelos
[ ]T
If all world points x w = x w y w z w belong to the same plane, it can be
assumed that without loss of generality all z w equal 0 . If at the same time only
distortion coefficients k1Im , P1Im , P2Im , s1Im , and s 2Im are considered, then the
dimension of D decreases to 3 × 10 and that of u ~ w to 10 × 1 .
~ ~ w ), D
Again, if we have enough pairs of xˆ im and corresponding xw (and thus u
can be calculated from equation 35. From D , the projection of any world point
into the current camera can be computed by means of equation 35 in a linear
fashion.
Perspectivity
[ ]
For an arbitrary point with coordinates x( w1 = x w1 y w1 T in a world plane Π 1 ,
its coordinates in the WCS, whose x and y axes are the same as those of Π 1 ,
are obviously ~( T
[ ]
x w1 = x w1 y w1 1 . Following the same procedure as the one
leading to equation 33, it can be derived that:
(
x% 1w ≅ E1uˆ% im , (36)
~(
where x w1 = x w1 [ ]
T
y w1 1 , E1 is a matrix with 3 rows and 36 columns,
(
[xˆ im
yˆ im ]
T
is the projection of x w1 onto the image plane of the current camera,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 97
~ im
and uˆ is a column vector with 36 elements in the form xˆ im ( ) (yˆ )
i im j
( i, j = 0 K 7
and i + j ≤ 7 ).
( T
At the same time, if a point with coordinate x 2w = x w 2 y w2 in another plane
T
Π 2 projects onto the image plane of the current camera also at xˆ im yˆ im , we
similarly obtain that:
(
x% 2w ≅ E 2u%ˆ im , (37)
(
where x% 2w = x w 2
T
y w 2 1 , E2 is also a matrix with 3 rows and 36 columns.
T
Since they have the same projection xˆ im yˆ im , a special relation called
( (
perspectivity (Hartley & Zisserman, 2000) should exist between x1w and x w2 .
This perspectivity relation between all corresponding pairs in Π1 and Π 2 can be
described by a 3 × 3 matrix C % as:
( % (% w .
x% 2w ≅ Cx 1
(38)
% .
E2 ≅ CE (39)
1
~
Therefore, instead of recovering E 1 and E 2 separately, E 1 and C can be
calculated first. Then equation 39 is employed to recover E 2. Thus, it is ensured
that the perspective constraint is satisfied.
Discussion
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
98 Lei, Hendriks & Katsaggelos
arise: First, for a high degree of distortion, the scale of the model increases
dramatically; Second, it is very difficult to take into account projective con-
straints, as addressed in a previous section for DLT, in the calibration; Third, only
when distortion is not considered, can the original camera parameters be
computed linearly with this model (Wei & Ma, 1994). Therefore, a more efficient
way of estimating the distortion coefficients is needed. For this purpose, Tsai’s
algorithm is a good and representative example.
Tsai’s Algorithm
Assuming that only the radial distortion occurs in the camera and the principal
point [ x0 y0 ] is known (or can be approximated) in advance, Tsai (1987)
T
proposed a two-stage algorithm for explicitly calibrating the camera. In this case,
the imaging equations used are exactly the same as those in equation 12, except
that the possible distortion is limited as follows:
k1
∆ x xr
ˆ 2
ˆ
xr 4
ˆ
xr 6
L k2
∆ = 2
y yrˆ ˆ 4
yr ˆ 6
yr L k3 .
M
With this radial distortion, a pixel in the image is only distorted along the radial
direction, thus a radial alignment constraint (RAC) can be formed (Tsai,
1987) (ref. Equation 12):
sy c
s ⋅ x × xˆ = − s y zc xˆ im − x xˆ
⋅ im 0 × = 0
x yˆ
yˆ − y0 yˆ
. (40)
y
c
f
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 99
where r j ( i, j = 1K 3 ) is the element of R cw at the ith row and jth column, and
i
t = tx [ ty tz ]
T
( )
= − R cw t cw .
T
[yˆx w
yˆ y w yˆ z w yˆ − xˆx w − xˆy w ]
− xˆz w ⋅ a − xˆ = 0
a = a1 [ a2 a3 a4 a5 a6 a7 ]T
T
sy sy sy sy
= t y−1 r11 t r−1 2
y 1
−1 3
t r
y 1
−1
t t
y x t r −1 1
y 2
−1 2
t r
y 2 t r .
−1 3
y 2
sx sx sx sx
With more than seven 3-D world points at general positions, one can estimate the
seven unknowns from an over-determined linear system (by stacking multiple
equation 42). After that, R cw , tx, and ty can be calculated as:
ty = (a ) + (a ) + (a )
5 2 6 2 7 2
,
sy
sx
= ty ⋅ (a ) + (a ) + (a )
1 2 2 2 3 2
sx 4
, tx = s a ty ,
y
[r1
1
r12 r13 ] T
=
sx
sy
t y a1 [ a2 a3 ] , [r
T
1
2 r22 r23 ]
T
[
= t y a5 a6 ]T
a7 ,
[r1
3 r32 r33 ] = [r
T
1
1
r12 r13 ] × [r
T 1
2 r22 r23 .]
T
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
100 Lei, Hendriks & Katsaggelos
On the other hand, if all available points are co-planar, that is zw = 0, equation 42
becomes:
[yˆx w
yˆ y w yˆ − xˆx w ]
− xˆy w ⋅ a'− xˆ = 0 , (43)
T
s y −1 1 sy −1 2
sy −1 −1 1
−1 2
where a' = t y r1 t r
y 1 t t
y x t r
y 2 t r .
y 2
sx sx sx
In this case, R cw , tx, and ty can be recovered from the calculated vector a' only
if the aspect ratio sy / sx is known.
In summary, there are two stages in this algorithm:
1. Compute the 3-D pose R cw , tx, ty, and sy / sx (in case enough non-co-planar
points are available); and
2. Optimize the effective focal length f, radial distortion coefficients k1, k2, ...,
and tz, by employing a simple search scheme.
Because the problem has been split into these two stages, the whole computation
becomes much simpler and more efficient.
Further improvements
A requirement of Tsai’s algorithm is that the position of the principal point and
the aspect ratio (in case only co-planar points are available) are known a priori.
One practical possibility of finding the principal point accurately is to minimize
the left-hand side of equation 42 (in the non-co-planar case) or that of equation
43 (in the co-planar case) (Lenz & Tsai, 1988; Penna, 1991). The horizontal scale
factor (and thus the aspect ratio) can be measured by using the difference
between the scanning frequency of the camera sensor plane and the scanning
frequency of the image capture board frame buffer (Lenz & Tsai, 1988).
However, this scale factor estimation method is not so practical due to the
difficulty of measuring the required frequencies (Penna, 1991). A more direct
way would be to employ the image of a sphere for calculating the aspect ratio
(Penna, 1991). The power spectrum of the images of two sets of parallel lines
can also be utilized for the same purpose (Bani-Hashemi, 1991).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 101
In addition, the RAC model requires that the angle of incidence between the
optical axis of the camera and the calibration plane should be at least 30 0 (Tsai,
1987). This ill-conditioned situation can be avoided by setting cos α = 1 and sin
α = α when αC0 (Zhuang & Wu, 1996).
The above modifications improved the capabilities of the original Tsai’s algo-
rithm. However, Tsai’s algorithm can still only take the radial distortion into
consideration. And, the strategy of recovering several subsets of the whole
camera parameter space in separate steps may suffer from stability and
convergence problems due to the tight correlation between camera parameters
(Slama, 1980). It is desirable, therefore, to estimate all parameters simulta-
neously. Methods based on this idea are discussed below.
During the 1980s, camera calibration techniques were mainly full-scale nonlinear
optimization incorporating distortion (Slama, 1980) or techniques that only take
into account the linear projection relation as depicted by equation 5 (Abdel-Aziz
& Karara, 1971). The iterative two-phase strategy was proposed at the
beginning of the 1990s to achieve a better performance by combining the above
two approaches (Weng et al., 1992). With this iterative strategy, a linear
estimation technique such as DLT is applied in phase 1 to approximate the
imaging process by pl, and then in phase 2, starting with the linearly recovered
pl and pd = 0, an optimization process is performed iteratively until a best fitting
parameter point pc is reached.
Because for most cameras, the linear model (ref. equation 5) is quite adequate,
and the distortion coefficients are very close to 0, it can be argued that this
iterative two-phase strategy would produce better results than pure linear
techniques or pure full-scale nonlinear search methods. Camera calibration
methods employing the iterative two-phase strategy differ mainly in the following
three aspects: 1) The adopted distortion model and distortion coefficients; 2) The
linear estimation technique; and 3) The objective function to be optimized. In Sid-
Ahmed & Boraie (1990), for example, the reconstruction-distortion model is
utilized and k1Re, k2Re, k3Re, P1Re, and P2Re are considered. The DLT method
introduced is directly employed in phase 1. Then, in phase 2, assuming that
[x0 y 0 ] is already known, the Marquardt method is used to solve a least-
T
[
squares problem with respect to p = p T k1Re k 2Re k 3Re P1Re P2Re (ref.
T
]
equation 21). All camera parameters are made implicit.
To further guarantee the geometric validity of the estimated p , one extra phase
can be introduced between phase 1 and phase 2. In this extra phase, elements
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
102 Lei, Hendriks & Katsaggelos
f (p c ) = ∑ (xˆ im
− x im − f x ⋅ ∆Imx ) + (yˆ im − y im − f y ⋅ ∆Imy ) ,
2 2
(44)
where the summation is done over all available data and all variables were
defined in equation 10. Instead of the linear estimation method in phase 1, a
nonlinear optimization can also be carried out to get a better initial guess of pl with
all distortion coefficients set to zero.
Optimization issue
At phase 2, each iteration of the optimization can also be split up into the following
two steps (Weng et al., 1992):
However, due to the tight interaction between the linear parameters and the
distortion coefficients, this two-step optimization converges very slowly.
In Chatterjee, Roychowdhury & Chong (1997), the nonlinear optimization part
(phase 2) is further divided into three stages by following the Gauss-Seidel
approach. The first stage is similar to the optimization phase in Sid-Ahmed &
Boraie (1990), except that the optical center [x 0 y 0 ]T , the aspect ratio sy / sx,
and all distortion coefficients are fixed. The second stage is the same as Step “a”
in Weng et al. (1992). Finally, in the third stage, the function f(pc) in equation
44 is minimized only w.r.t. the optical center [x 0 y 0 ]T and the aspect ratio sy / sx,
while all other camera parameters are fixed. Convergence analysis for this new
parameter space partition method has been given in Chatterjee et al. (1997).
However, no convergence speed was provided.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 103
Conclusions
Various 2-D planar patterns have been used as calibration targets. Compared
with 3-D calibration objects, 2-D planar patterns can be more accurately
manufactured and fit easier into the view volume of a camera. With known
absolute or relative poses, planar patterns are a special type of 3-D calibration
object. In this case, traditional non-co-planar calibration techniques can be
applied directly or with very little modification (Tsai, 1987). More often, a single
planar pattern is put at several unknown poses to calibrate a camera (Zhang,
2000). Each pose of the planar pattern is called a frame. It has been demon-
strated that this is already adequate for calibrating a camera. The iterative two-
phase strategy discussed can still be applied here. For planar patterns, only phase
1 is different and it is discussed below.
~ ~ w ~ ~ ~w ~
x im ≅ P ⋅ ~ [(
x = K ⋅ M ⋅ x = K ⋅ R cw )
T
( ) ]
− R cw ⋅ t cw ⋅ ~
T ~
[( )T
( ) T
x w = K ⋅ R cw ⋅ x w − R cw ⋅ t cw ]
~
[( ) ( ) ( ) T
] ~
[( ) ( ) ( )]
= K ⋅ R cw ⋅ R ow x o + t ow − R cw ⋅ t cw = K ⋅ R cw ⋅ R ow R cw ⋅ t ow - t cw ⋅ ~
T T T
xo , (45)
~
[
= K ⋅ [R t ]⋅ x o o
]
y 0 1 = K ⋅ [r1 r2 t ]⋅ x
~
[ o o
y 1]
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
104 Lei, Hendriks & Katsaggelos
where R = R cw ( ) T
⋅ R ow = [r1 r2 r3 ]( r j ( j = 1K 3 ) is the jth column of the
matrix R ) and
( ) ⋅ (t
t = R cw
T w
o )
− t cw .
Thus:
~ ~
x im ≅ H ⋅ x o [ ]
y o 1 where H = K ⋅ [r1
~ ~
r2 t ]. (46)
By applying the simple DLT method introduced, with at least four pairs of
[ T T
]
~
[
corresponding x im y im and x o y o , H can be determined up to a non-
~
]
zero factor as H , which means H ≅ H .
~ ~
As the inverse K −1 of K exists (ref. equation 32), equation 46 can be rewritten
as:
[r1 r2 t ] = K −1 ⋅ H ≅ K −1 ⋅ H .
~ ~ ~
(47)
Thus:
~ ~
r1 = ρK −1 ⋅ h 1 and r2 = ρK −1 ⋅ h 2 ,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 105
~ ~
where K −T K −1 is called the Image of Absolute Conic (IAC), which has been
applied successfully in self-calibration (Hartley & Zisserman, 2000). Once the
IAC of a camera is located, the geometry of this camera has been determined.
~
Equations 48 and 49 thus provide two constraints for the intrinsic matrix K −1 with
~ −1
one frame. Since K has four DOFs (ref. equation 32), if two frames (which
~
means two different H) are available, K −1 (and all four intrinsic parameters) can
then be recovered.
~
Once K −1 is determined, r1, r2, and t can be calculated directly from equation
47 under the constraint (r1)Tr1 = (r2)T r2 = 1. It follows that r3 can then be computed
as r3 = r1 × r2. Here it is obvious that, if r1 and r2 are solutions of equation 47, r1' = -r1 and
r2' = -r2 also satisfy equation 47. Again, the correct solutions can be verified by
means of the oriented projective geometry.
In the single-camera case, without loss of generality, it can be assumed that
R cw = I and t cw = 0 . Then R ow = R and t ow = t . However, in a multiple-camera
configuration, which is discussed in the next section, things are not so simple.
Conclusions
Using the planar pattern as the calibration object may ease the calibration-data-
acquisition work quite a lot. The corresponding calibration method is simple and
efficient. However, the algorithm described above only holds for a single-camera
case. If multiple cameras need to be calibrated in one system, the calibration
algorithm should be modified for obtaining higher accuracy (Slama, 1980). This
issue is discussed next.
Suppose that there are n cameras (n > 1) and m frames (m > 1) for which the
relative poses among all cameras are to be recovered.
In the general situation, let us assume that each frame can be viewed by every
camera. The whole camera and frame set in this configuration is called
complete. By applying the linear geometry estimation techniques discussed, the
relative pose between camera i and frame j can be computed as:
Orientation: R ij = R ciw( ) T
⋅ R ojw ,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
106 Lei, Hendriks & Katsaggelos
( ) ⋅ (t
Position: t ij = R ciw
T w
oj )
− t ciw ,
w w
where i = 1...n and j = 1...m, R ci is the orientation of camera i and t ci , its position,
and R ojw is the orientation of frame j and t ojw , its position.
Writing all orientation matrices into one large matrix yields:
R 11 L R 1m R cw1
T
( )
M O
M = M R ow1 L R om w
1442443
[ ]
R n1 L R nm R cn
1442443 1
w T
( ) Mo .
M
42 4
3
Mc
in:
1 0 Tm −2 0 w R cw1t 11
t o1
M M M − I n× n M
M
1 0 Tm −2 0 w R cn w
t n1
t
O M ⋅ = M
t w R w t
om
0T
0 m − 2 1 c1 c1 1m
M M M − I n× n M M ,
t w w
T
0 0 m −2 1 12 cn
3 R cn t nm
144424443 1424
x 3
A b
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 107
(
x = AT A ) (A b ),
−1 T
where the sparsity of the matrix A TA can be explored to improve the efficiency
of the computation.
However, the camera and frame set is not always complete, which means that
some frames are not visible in certain cameras. For this situation there are two
solutions: 1) The whole set can be decomposed into several subsets that are
complete by themselves. Then, for each subset the above calculations can be
done independently and subsequently combined into one WCS; and 2) The
problem is treated as a missing data system, which can be solved by the
interpolation method proposed in Sturm (2000).
Vanishing points
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
108 Lei, Hendriks & Katsaggelos
plane lie on the same line in the image; this line is called the vanishing line (Wang
& Tsai, 1991); 2) Given the vanishing points of three sets of mutually orthogonal
lines, the orthocenter of the triangle with the three vanishing points as vertices
is the principal point; 3) Given the vanishing points [x1 y1 ]T and [x 2 y 2 ]T of
two sets of orthogonal lines, it can be shown that: x1 x 2 + y1 y 2 + f 2 = 0 (Guillou,
Meneveaux, Maisel & Bouatouch, 2000); 4) If the camera moves, the motion of
the vanishing points in the image plane depends only on the camera rotation, not
on the camera translation. The vanishing points of three non-co-planar sets of
parallel lines fully determine the rotation matrix (Guillou et al., 2000); and 5)
Given the vanishing points [x1 y1 ]T , [x 2 y 2 ]T , [x3 y 3 ]T of three sets of
mutually orthogonal lines, from equation 5 it can be verified immediately that
(Cipolla, Drummond, & Robertson, 1999):
1 0 0
λ1 x1 λ2 x 2 λ3 x 3 0 ~ w T
λ y = K (R c )
~ 0 1
1 1 λ2 y 2 λ3 y 3 = P ⋅
0 0 1 ,
λ1 λ2 λ3
0 0 0
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 109
Other techniques
Besides algorithms for calibrating cameras, other related aspects have been
investigated to refine the performance of the camera calibration system. For
example, several suggestions on the design of the calibration object and the data
acquisition for increasing the calibration performance were given in Tu &
Dubuisson (1992).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
110 Lei, Hendriks & Katsaggelos
Self-Calibration
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 111
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
112 Lei, Hendriks & Katsaggelos
they can be used directly for other applications, as well, such as body, face, and
gesture modeling and animation.
Manipulation process
The whole process includes five stages and can be briefly described as follows:
Figure 2. One example face image. Left: The original image. Middle: The
image after a hard thresholding. Right: Markers to be tracked through the
sequence.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 113
3-D reconstruction
xw
p − p x − p + p x
1 3 im 1 3 im
−p +p x
1 3 im
− p + p x w
1 3 im
4 4
im
= 1 1 2 2 3 3
y
p − p y − p + p y
2 3 im 2 3
− p + p y im
2 3
− p + p y im w . (50)
2 3
4 4 1 1 2 2 3 3 z
It is easy to see from equation 50 that no unique solution for xw can be determined
with one known xim and the calculated p ij (i = 1...3; j = 1...4). That is why two
xim that correspond to the same xw are needed to recover it uniquely. Suppose
im im
these two xim are denoted as x r and x l respectively, then:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
114 Lei, Hendriks & Katsaggelos
p 1r 4 − p r34 x rim − p 1r1 + p r31 x rim − p r12 + p r32 x rim − p r13 + p r33 x rim w
2 im 2 im
x
p r 4 − p 3
r 4 y r = − p r 1 + p 3 im
r 1 y r − p r
2
2 + p 3
r 2 y im
r − p r
2
3 + p 3
r 3 y r y w
pl14 − pl34 xlim − pl11 + pl31 xlim − pl12 + pl32 xlim − pl13 + pl33 xlim w
2 im z . (51)
1p4l 4 − pl 4 y l
3
− pl1 + pl1 y l
2 3 im
− pl22 + p l32 y lim − pl23 + pl33 y lim {
4244 3 144444444424444444443 x w
b B
From equation 51, xw can be easily calculated using least squares as:
(
x w = BT B ) (B b ).
−1 T (52)
Figure 3. The projection and 3-D reconstruction process of the 3-D face
model. From two cameras C L and C R we get a pair of stereo images
through projection. 3-D reconstruction is then applied on each pair of
im
corresponding points x im r and x l to obtain their original 3-D
w
correspondence x . The reconstructed 3-D model can then be projected
into an arbitrary virtual camera C D to form a virtual view. This is just the
traditional view transformation process. Alternatively, the 3-D model
reconstruction step could be neglected and the virtual view synthesized
directed from the two stereo views. This image-based-rendering idea will be
explored in the next section.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 115
Since all tracked points are important facial control points, the reconstructed
sequence of 3-D coordinates of those points reflects more or less facial actions.
Thus, this sequence can be used as a coarse 3-D face model. If more control
points are selected, a finer 3-D model can be obtained.
Based on the reconstructed model, FAPs can be calculated. To show the
accuracy of the face model qualitatively, a VRML file is generated automati-
cally. Figure 4 shows three example frames of such a VRML file.
Figure 4. Three example frames of the reconstructed 3-D face model. Blue
points are control points that were reconstructed.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
116 Lei, Hendriks & Katsaggelos
Telepresence
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 117
be reconstructed directly from the two views from cameras 1 and 2 of site 2
(remote site) at the local site according to the current viewpoint of participant A.
This process will be specifically discussed below. Due to the symmetry in the
system, the reconstruction for the other participants is similar.
Every 40ms, the fixed stereo set-up at the remote site acquires two images. After
segmentation, the pair of stereo views, containing only the remote participant
without background, is broadcast to the local site. Locally, the two views, based
on the information about the stereo set-up, the local display, and the pose
(position and orientation) of the local participant, are used to reconstruct a novel
view (“telepresence”) of the remote participant that is adapted to the current
local viewpoint. The reconstructed novel view is then combined with a man-
made uniform virtual environment to give the local participant the impression that
he/she is in a local conference with the remote participant. The whole processing
chain is shown in Figure 6.
Obviously, all parameters of each of the three four-camera set-ups should be
computed beforehand. The calibration is done by combining the linear estimation
technique and the Levenberg-Marquardt nonlinear optimization method.
With explicitly recovered camera parameters, the view can be transformed in a
very flexible and intuitive way, discussed briefly in the next section.
Figure 6. The processing chain for adapting the synthesized view of one
participant in line with the viewpoint change of another participant. Based
on a pair of stereo sequences, the “virtually” perceived view should be
reconstructed and integrated seamlessly with the man-made uniform
environment in real time.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
118 Lei, Hendriks & Katsaggelos
View transformation
t cL = [1 0 0] , t cR = [− 1 0 0] , t cD = [xcD z cD ] ,
T T T
y cD
where tcL, tcR, and tcD are the position vectors of C L, CR, and CD, respectively. This
means that the x-axis of the WCS lies on the baseline b of C L and CR, and points
from C R to CL. The origin of the WCS is at the middle point on b, that is, the unit
of distance is b / 2.
In the general case, the view transformation process can be divided into five
steps (see Figure 7):
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 119
share the same image plane. This process is known as stereo rectification
(Hartley, 1999) and is intended to eliminate the photometric and orientation
differences between the two source cameras to simplify the correspon-
dence estimation into a 1-D search problem along the scan line and at the
same time to provide parallel processing possibilities for later steps.
2) X-interpolation: Given the disparity information, the two parallel views
VrectiL and VrectiR are combined by interpolation or extrapolation (Seitz &
Dyer, 1995) to produce another parallel view Vx. The corresponding
camera Cx is located at [x cD 0 0] with the same rotation and intrinsic
parameters as CrectiL and C rectiR. The y coordinate of each pixel remains the
same, while the x coordinate is transformed by
1 − xcD LR
x pX = x rectiL
p + dp
2
and/or (in case of occlusion)
1 + xcD RL
x pX = x rectiR
p + dp ,
2
where xp is the x coordinate of pixel p* in view V* (* = X, rectiL, rectiR).
prectiL and prectiR are projections of the same 3-D point. d pLR and d pRL are
disparities of prectiL and prectiR respectively, where d p = x p − x rectiL
LR rectiR
p and
d p = x p − x p . Note that in the case of occlusion, either p
RL rectiL rectiR rectiL
or
rectiR
p is not available (Lei & Hendriks, 2002). Through this step, the
difference in x position with the final view VD is eliminated.
3) Y-extrapolation: The X-interpolated view Vx is extrapolated (Scharstein,
1999) by shifting pixels in the y direction to produce the view VY, which
comes from a virtual camera CY located at [x cD y cD 0] with the same
rotation and intrinsic parameters as Cx. In this process, the x coordinate of
each pixel remains the same while the y coordinate is transformed by
s
y Yp = y pX − y cD ⋅ x ⋅ d pX , where y * is the y coordinate of pixel p* in view V
sy p *
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
120 Lei, Hendriks & Katsaggelos
Results
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 121
Figure 9. All intermediate and final results from the view reconstruction are
listed in this figure together with the disparity data.
(c) Rectified left view (d) Rectified right view (e) Left disparity map (f) Right disparity map
VrectiL VrectiR
(g) X-interpolated view (h) Y-extrapolated view (i) Reconstructed view (j) Real destination view
VX VY VD
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
122 Lei, Hendriks & Katsaggelos
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 123
References
Abdel-Aziz, Y. & Karara, H. (1971). Direct linear transformation into object
space coordinates in close-range photogrammetry. In Proc. symposium on
close-range photogrammetry, Urbana, IL, 1-18.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
124 Lei, Hendriks & Katsaggelos
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 125
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
126 Lei, Hendriks & Katsaggelos
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 127
Moons, T., Gool, L., Proesmans, M. & Pauwels, E. (1996). Affine reconstruc-
tion from perspective image pairs with a relative object-camera translation
in between. Pattern Analysis and Machine Intelligence, 18(1), 77-83.
Pedersini, F., Sarti, A. & Tubaro, S. (1999). Multi-camera systems: Calibration
and applications. IEEE Signal Processing Magazine, 16(3), 55-65.
Penna, M. (1991). Camera calibration: A quick and easy way to determine the
scale factor. IEEE Transaction PAMI, 12, 1240-1245.
Perš, J., & Kovaèiè , S. (2002). Nonparametric, model-based radial lens distor-
tion correction using tilted camera assumption. In H. Wildenauer & W.
Kropatsch (Eds.), Proc. the computer vision winter workshop 2002,
Bad Aussee, Austria, 286-295.
Pollefeys, M., Koch, R. & Gool, L. (1999). Self-calibration and metric recon-
struction in spite of varying and unknown intrinsic camera. International
Journal of Computer Vision, 32(1), 7-25.
Press, W., Teukolsky, S., Vetterling, W. & Flannery, B. (1992). Numerical
recipes in c (Second ed.). Cambridge, MA:Cambridge University Press.
Quan, L. & Triggs, B. (2000). A unification of autocalibration methods. In Proc.
ACCV’2000. Taipei, Taiwan.
Redert, P. (2000). Multi-viewpoint systems for 3-d visual communication.
Unpublished doctoral dissertation, Delft University of Technology.
Scharstein, D. (1999). View synthesis using stereo vision (Vol. 1583). Springer
Verlag.
Scott, T. & Mohan, T. (1995). Residual uncertainty in three-dimensional
reconstruction using two-planes calibration and stereo methods. Pattern
Recognition, 28(7), 1073-1082.
Seitz, S. & Dyer, C. (1995). Physically-valid view synthesis by image interpo-
lation. In Proc. workshop on representation of visual scenes, MIT,
Cambridge, MA, 18-25.
Sid-Ahmed, M. & Boraie, M. (1990). Dual camera calibration for 3-d machine
vision metrology. IEEE Transactions on Instrumentation and Measure-
ment, 39(3), 512-516.
Slama, C. (1980). Manual of photogrammetry (Fourth ed.). Falls Church, VA:
American Society of Photogrammetry and Remote Sensing.
Spetsakis, M. & Aloimonos, J. (1990). Structure from motion using line corre-
spondences. International Journal of Computer Vision, 4, 171-183.
Stavnitzky, J. & Capson, D. (2000). Multiple camera model-based 3-d visual
servo. IEEE Transactions on Robotics and Automation, 16(6), 732-739.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
128 Lei, Hendriks & Katsaggelos
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Camera Calibration 129
Weng, J., Cohen, P. & Herniou, M. (1992). Camera calibration with distortion
models and accuracy evaluation. IEEE Transaction PAMI, 14(10), 965-
980.
Wilczkowiak, M., Boyer, E. & Sturm, P. (2001). Camera calibration and 3d
reconstruction from single images using parallelepipeds. In Proc.
ICCV’2001, Vancouver, Canada, 142-148.
Willson, R. (1994). Modeling and calibration of automated zoom lenses.
Unpublished doctoral dissertation, Department of Electrical and Computer
Engineering, Carnegie Mellon University.
Wolberg, G. (1990). Digital image warping. Los Alamitos, CA: IEEE Com-
puter Society Press.
Xu, G., Terai, J. & Shum, H. (2000). A linear algorithm for camera self-
calibration, motion and structure recovery for multi-planar scenes from two
perspective images. In Proc. CVPR’2000, Hilton Head Island, SC, 474-
479.
Xu, L., Lei, B. & Hendriks, E. (2002). Computer vision for 3d visualization and
telepresence collaborative working environment. BT Technical Journal,
64-74.
Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE
Transaction PAMI, 22(11), 1330-1334.
Zhuang, H. & Wu, W. (1996). Camera calibration with a near-parallel (ill-
conditioned) calibration board configuration. IEEE Transactions on Ro-
botics and Automation, 12, 918-921.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
130 Ozer, Lv & Wolf
Chapter IV
Real-Time Analysis of
Human Body Parts and
Gesture-Activity
Recognition in 3D
Burak Ozer
Princeton University, USA
Tiehan Lv
Princeton University, USA
Wayne Wolf
Princeton University, USA
Abstract
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 131
Introduction
Three-dimensional motion estimation has a wide range of applications, from
video surveillance to virtual animation. Therefore, reconstruction of visual
information from multiple cameras and its analysis has been a research area for
many years in computer vision and computer graphics communities. Recent
advances in camera and storage systems are main factors driving the increased
popularity of multi-camera systems. Prices continue to drop on components, e.g.,
CMOS cameras, while manufacturers have added more features. Furthermore,
the evolution of digital video, especially in digital video storage and retrieval
systems, is another leading factor.
In this chapter, we focus on real-time processing of multiple views for practical
applications, such as smart rooms and video surveillance systems. The increased
importance of applications requiring fast, cheap, small and highly accurate smart
cameras necessitates research efforts to provide efficient solutions to the
problem of real-time detection of persons and classification of their activities. A
great effort has been devoted to three-dimensional human modeling and motion
estimation by using multi-camera systems in order to overcome the problems due
to the occlusion and motion ambiguities related to projection into the image plane.
However, introduced computational complexity is the main obstacle for many
practical applications.
This chapter investigates the relationship between the activity recognition
algorithms and the architectures required to perform these tasks in real time. We
focus on the concepts of three-dimensional human detection and activity
recognition for real-time video processing. As an example, we present our real-
time human detection and activity recognition algorithm and our multi-camera,
test bed architecture. We extend our previous 2D method for 3D applications and
propose a new algorithm for generating a global 3D human model and activity
classification.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
132 Ozer, Lv & Wolf
• Surveillance
• Provide security in a campus, shopping mall, office complex, casino, etc.
• Detect people’s movements, gestures and postures from a security check-
point in an airport, parking garage, or other facility
• Traffic
• Monitor pedestrian activity in an uncontrolled and/or controlled crosswalk
• Smart Environments
• Entertainment
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 133
3D scene synthesis and analysis, by using visible light and multiple cameras, has
been studied by many researchers. Before considering some of these methods,
it is beneficial to review general stereo vision issues with respect to their real-
time applicability. There are three basic problems, namely correspondence
(disparity map), reconstruction, and rendering.
One well-known technique for obtaining depth information from digital images
is the stereo technique. In stereo techniques, the objective is to solve the
correspondence problem, i.e., to find the corresponding points in the left and right
image. For each scene element in one image, a matching scene element in the
other image is identified. The difference in the spatial position of the correspond-
ing points, namely disparity, is stored in a disparity map. Whenever the corre-
sponding points are determined, the depth can be computed by triangulation.
Attempts to solve the correspondence problem have produced many variations,
which can be grouped into matching pixels and matching features, e.g., edges.
The former approach produces dense depth maps while the latter produces
sparse depth maps. The specific approach desired depends on the objective of
the application. In some applications, e.g., the reconstruction of complex
surfaces, it is desirable to compute dense disparity maps defined for all pixels in
the image. Unfortunately, most of the existing dense stereo techniques are very
time consuming.
Even though stereo vision techniques are used in many image processing
applications, the computational complexity of matching stereo images is still the
main obstacle for practical applications. Therefore, computational fast stereo
techniques are required for real-time applications. Given the algorithmic com-
plexity of stereo vision techniques, general purpose computers are not fast
enough to meet real-time requirements which necessitate the use of parallel
algorithms and/or special hardware to achieve real-time execution.
Two main performance evaluation metrics are throughput, that is, frame rate
times frame size, and range of disparity search that determines the dynamic
range of distance measurement. There is still a great deal of research devoted
to develop stereo systems to achieve the desired performance. The PRISM3
system (Nishihara, 1990), developed by Teleos, the JPL stereo implemented on
DataCube (Matthies, 1992), CMU’s warp-based multi-baseline stereo (Webb,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
134 Ozer, Lv & Wolf
1993), and INRIA’s system (Faugeras, 1993) are some of the early real-time
stereo systems. Yet, they do not provide a complete video-rate output of range
as dense as the input image with low latency. Another major problem is that the
depth maps obtained by current stereo systems are not very accurate or reliable.
At Carnegie Mellon, a video rate stereo machine was developed (Kanade et al.,
1996) where multiple images are obtained by multiple cameras to produce
different baselines in lengths and in directions. The multi-baseline stereo method
consists of three steps. The first step is the Laplacian of Gaussian (LOG) filtering
of input images. This enhances the image features, as well as removes the effect
of intensity variations among images due to the difference in camera gains,
ambient light, etc. The second step is the computation of sum-of-squares
differences (SSD) values for all stereo image pairs and the summation of the
SSD values to produce the sum-of-sum-of-squares differences (SSSD) func-
tion. Image interpolation for sub-pixel re-sampling is required in this process. The
third and final step is the identification and localization of the minimum of the
SSSD function to determine the inverse depth. Uncertainty is evaluated by
analyzing the curvature of the SSSD function at the minimum. All these
measurements are done in one-tenth sub-pixel precision. One of the advantages
of this multi-baseline stereo technique is that it is completely local in its
computation without requiring any global optimization or comparison.
Schreer et al. (2001) developed a real-time disparity algorithm for immersive
teleconferencing. It is a hybrid and pixel recursive disparity analysis approach,
called hybrid recursive matching (HRM). The computational time is minimized
by the efficient selection of a small number of candidate vectors, guaranteeing
both spatial and temporal consistency of disparities. The authors use cameras,
mounted around a wide screen, yielding a wide-baseline stereo geometry. The
authors compare the real-time performance of their algorithm with a pyramid
approach, based on multi-resolution images, and with a two stage hierarchical
block-matching algorithm. The proposed method can achieve a processing speed
of 40 msecs per frame for HRM algorithm in the case of sparse fields with block
sizes of 8 by 8 pixels.
In Koschan & Rodehorst’s (1995) work, parallel algorithms are proposed to
obtain dense depth maps from color stereo images employing a block matching
approach. The authors compare single processor and multiple processor perfor-
mance to evaluate the profit of parallel realizations. The authors present
computing times for block matching and edge-based stereo algorithms for
multiple processing units that run in parallel on different hardware configura-
tions.
A commercial system with small-baseline cameras has been developed by
Videre Design. From two calibrated cameras, the system generates a disparity
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 135
image in real-time by using area based stereo matching (Konolige, 1997). Their
algorithm has four major blocks, namely LOG transform, variable disparity
search, post-filtering, and interpolation. The special purpose hardware consists
of two CMOS 320x240 grayscale imagers and lenses, low power A/D convert-
ers, a digital signal processor, and a small flash memory for program storage. The
board communicates with the host PC via the parallel port. Second generation
hardware uses a DSP from Texas Instruments (TMS320C60x).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
136 Ozer, Lv & Wolf
from prior information. In Yang et al. (2002), the authors use a graphics
hardware that effectively combines a plane-sweeping algorithm with view
synthesis for real-time, 3D scene acquisition.
Rendering
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 137
dynamic scenes. Image quality is increased with the accuracy of the disparity
maps provided with the recorded video streams. In Li et al. (2003), a simulta-
neous visual hull reconstruction and rendering algorithm is proposed by exploiting
off-the-shelf graphics hardware.
Beside special hardware, the use of parallel algorithms can’t be avoided to
achieve high-speed rendering applications. Early systems, such as Pixar’s
CHAP (Levinthal & Porter, 1984) and the commercially available Ikonas
platform (England, 1986), had SIMD processors that could process vertex and
pixel data in parallel. Programmable MIMD machines that could process
triangles in parallel, such as the Pixel Planes (Fuchs et al., 1989) and the SGI
InfiniteReality, had complex low-level custom microcodes and were rarely used.
CPU vendors began to introduce graphics-oriented SIMD processor extensions
into general purpose CPU designs. Examples of these extensions include Intel’s
MMX/SSE instructions, AMD’s 3DNow architecture, and Motorola’s AltiVec
technology. Although such extensions accelerate several graphics operations,
more sophisticated graphics coprocessors, e.g., processors that can support
rendering pipelines, are needed. Such a system has been developed by Sony. The
company designed a custom dual-processor SIMD architecture for graphics
called the Emotion Engine (Kunimatsu et al., 2000).
A detailed survey on graphics hardware can be found in Thompson et al. (2002).
The basic steps for image rendering are shown in Figure 1. The input of the
graphics hardware is raw geometry data specified in some local coordinate
system. The hardware transforms this geometry into world space and performs
lighting and color calculations followed by a texture step. The hardware converts
the vector-based geometry to a pixel-based raster representation, and the
resulting pixels are sent into the screen buffer.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
138 Ozer, Lv & Wolf
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 139
2D
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
140 Ozer, Lv & Wolf
single moving person while our hierarchical and parallel graph matching and
HMM-based activity recognition algorithms enable multi-person detection and
activity recognition.
W4 is another real-time human tracking system (Haritaoglu et al., 1998) where
the background information should be collected before the system can track
foreground objects. The individual body parts are found using a cardboard model
of a walking human as reference. There are a few works that aim to obtain a
more compact representation of the human body without requiring segmentation.
Oren et al. (1997) used wavelet coefficients to find pedestrians in the images,
while Ozer & Wolf (2001) used DCT coefficients that are available in MPEG
movies to detect people and recognize their posture.
Self-occlusion makes the 2D tracking problem hard for arbitrary movements and
some of the systems assume a priori knowledge of the type of movement. The
authors (Wolf et al., 2002) developed a system by using ellipses and a graph-
matching algorithm to detect human body parts and classified the activity of the
body parts via a Hidden Markov Model-based method. The proposed system can
work in real-time and has a high correct classification rate. However, a lot of
information has been lost during the 2D human body detection and activity
classification. Generating a 3D model of the scene and of the object of interest
by using multiple cameras can minimize the effects of occlusion, as well as help
to cover a larger area of interest.
3D
One of the early works on tracking articulated objects is by O’Rourke & Badler
(1980). The authors used a 3D model of a person made of overlapping spheres.
They synthesized the model in images, analyzed the images, estimated the pose
of the model and predicted the next pose. Hogg (1983) tracked human activity
and studied periodic walking activity in monocular images. Rehg & Kanade
(1995) built a 3D articulated model of a hand with truncated cones. The authors
minimized the difference between each image and the appearance of the 3D
model. Kakadiaris & Metaxas (1995; 1996) proposed a method to generate the
3D model of an articulated object from different views. The authors used an
extended Kalman filter for motion prediction. Kuch & Huang (1995) modeled the
hand with cubic B-splines and used a tracking technique based on minimization.
Gavrila & Davis (1996) used superquadrics to model the human body. They used
dynamic time warping to recognize human motion.
Munkelt et al. (1998) used markers and stereo to estimate the joints of a 3D
articulated model. Deutscher et al. (1999) tracked the human arm by using a
Kalman filter and the condensation algorithm and compared their performances.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 141
Bregler & Malik (1998) proposed a new method for articulated visual motion
tracking based on exponential maps and twist motions.
Most of the previous work for human detection depends highly on the segmen-
tation results and mostly motion is used as the cue for segmentation. Most of the
activity recognition techniques rely on successful feature extraction and pro-
posed approaches are generally only suitable for a specific application type. The
authors have developed a system that can detect a wide range of activities for
different applications. For this reason, our scheme detects different body parts
and their movement in order to combine them at a later stage that connects to
high-level semantics.
Real-Time 3D Analysis
This section is devoted to our proposed method of real-time 3D human motion
estimation. Multi-camera systems are used to overcome self-occlusion problems
in the estimation of articulated human body motion. Since many movements
become ambiguous when projected into the image plane and 2D information
alone can not represent 3D constraints, we use multiple views to estimate 3D
human motion. First, we discuss real-time aspects of 3D human detection and
activity recognition. In the following two subsections we show how these aspects
affect the software and hardware architectures. A detailed analysis of our 3D
human detection/activity recognition algorithm and a testbed architecture for this
particular algorithm are given in the last subsection.
Real-Time Aspects
Real-time aspects are critical for the success of the algorithm. The authors
analyze various aspects and challenges of 3D human detection and activity
recognition algorithms. These include: the instruction statistics, branch behavior,
and memory access behavior of different program parts, e.g., stereo matching,
disparity map generation, reconstruction, projection, 2D/3D human-body part
detection, 2D/3D tracking, 2D/3D activity recognition, etc., in the Section
“Algorithmic Issues.” Note that it is essential to understand the application
behavior to develop efficient hardware for a 3D camera system. Hardware
related aspects and challenges for real-time applications are discussed in the
Section “Hardware Issues.” Decisions such as the number of processors in the
system, the topology of the processors, cache parameters of each processor, the
number of arithmetic logic units, ISA (instruction set architecture), etc., all rely
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
142 Ozer, Lv & Wolf
on the characteristic of the application running in the system. For this purpose,
we focus on a specific algorithm, our proposed 3D human detection/activity
recognition system, and evaluate some extended aspects that are presented in
this section.
Algorithmic issues
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 143
We start with operations that are clearly signal-oriented and move steadily away
from the signal representation until the data are very far removed from a
traditional signal representation. In general, the volume of data goes down as
image processing progresses.
Hardware issues
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
144 Ozer, Lv & Wolf
Testbed
In this subsection, we give our testbed architecture where a single camera node
is composed of a standard camera and a TriMedia video processing board.
Designed for media processing, the TriMedia processing board allows Windows
and Macintosh platforms to take advantage of the TriMedia processor via PCI
interface. Multiple TriMedia processing boards can be installed to one host PC
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 145
Constant 5
Integer ALU 5
Load/Store 2
DSP ALU 2
DSP MUL 2
Number of Functional Units
Shifter 2
Branch 3
Int/Float MUL 2
Float ALU 2
Float Compare 1
Float sqrt/div 1
Number of Registers 128
Instruction Cache 32KB, 8 Way
Data Cache 16KB, 8 Way
Number of Operation Slots-Instruction 5
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
146 Ozer, Lv & Wolf
tional simulation. During the experiment, we use the TriMedia Software Develop
Kit version tcs2.20 that includes a compiler tmcc, an assembler tmas, a linker
tmld, a simulator tmsim, an execution tool tmrun, and a simulator tmprof. The
TriMedia system is running on a Dell Precision-210 computer with two TriMedia
reference boards. The TriMedia boards can communicate via shared memory,
which enables fast data communication for stereo vision applications, e.g.,
disparity map generation.
2D
A - Low-level Processing:
This section presents the proposed algorithm for the detection of the human body
parts. The algorithm blocks are displayed in Figure 3. A more detailed explana-
tion of our algorithm can be found in Ozer et al. (2000).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 147
bine the meaningful adjacent segments and use them as the input of the
following algorithm steps.
Contour following: We apply the contour following algorithm that uses the 3x3
filter to follow the edge of the component where the filter can move in any
of eight directions to follow the edge.
Ellipse fitting: Even when human body is not occluded by another object, due
to the possible positions of non-rigid parts, a body part can be occluded in
different ways. For example, the hand can occlude some part of the torso
or legs. In this case, 2D approximation of parts by fitting ellipses with shape-
preserving deformations provides more satisfactory results. It also helps to
discard the deformations due to the clothing. Global approximation methods
give more satisfactory results for human detection purposes. Hence,
instead of region pixels, parametric surface approximations are used to
compute shape descriptors. Details of the ellipse fitting can be found in
Ozer & Wolf (2002b).
Object modeling by invariant shape attributes: For object detection, it is
necessary to select part attributes which are invariant to two-dimensional
transformations and are maximally discriminating between objects. Geo-
metric descriptors for simple object segments such as area, compactness
(circularity), weak perspective invariants, and spatial relationships are
computed (Ozer et al., 2000). These descriptors are classified into two
groups: unary and binary attributes. The unary features for human bodies
are: a) compactness, and b) eccentricity. The binary features are: a) ratio
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
148 Ozer, Lv & Wolf
B - High-level Processing:
This section covers the proposed real-time activity recognition algorithm based
on Hidden Markov Models (HMMs). HMM is a statistical modeling tool that
helps to analyze time-varying signals. Online handwriting recognition (Sim &
Kim, 1997), video classification and speech recognition (Rose, 1992) are some
of the application areas of HMMs. Only a few researchers have used the HMM
to recognize activities of the body parts. It is mainly used for hand gestures
(Starner & Pentland, 1995). Parameterized HMM (Wilson & Bobick, 1999) can
recognize complex events such as an interaction of two mobile objects, gestures
made with two hands (e.g., so big, so small), etc. One of the drawbacks of the
parameterized HMM is that for complex events (e.g., a combination of sub-
events) parameter training space may become very large. In our application, we
assume that each body part has its own freedom of motion and the activity
recognition for each part is achieved by using several HMMs in parallel.
Combining the outputs of the HMMs to generate scenarios is an application
dependent issue. In our application environment, smart room, we use the
Mahalanobis distance classifier for combining the activities of different body
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 149
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
150 Ozer, Lv & Wolf
motion of the body parts. We observe that different activity patterns can have
overlapping periods (same or similar patterns for a period) for some body parts.
Hence, the detection of start and end times of activities is crucial. To detect the
start and end time of a gesture, we use the gap between different gestures/
activities.
Eighty-six percent (86%) of the body parts in the processed frames and 90% of
the activities are correctly classified, with the rest considered the miss and false
classification. Details of the gesture recognition algorithm can be found in Ozer
& Wolf (2002a).
From 2D to 3D
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 151
After graph matching, head, torso and hand ellipses and their corresponding
attributes are sent from each processing board to the other one via the shared
memory. High-level information (ellipses corresponding to head, torso, and hand
areas) and low-level information (ellipse attributes) are used to model the best-
fit ellipsoids for each body part as shown in Figure 8. The best-fit ellipsoid
algorithm is based on Owens’s (1984) work. Figure 6 displays the orthogonal
ellipses, their attributes, and best-fit ellipsoid after iterative approximation.
The equation of an ellipse is given by:
x2 y2
+ =1
α2 β2
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
152 Ozer, Lv & Wolf
α2 (α / β ) 2
d= +
cos(φ ) 2 sin(φ ) 2
α2 (α / β ) 2
e= +
sin(φ ) 2 cos(φ ) 2
For three perpendicular surfaces, namely surf1, surf2, and surf3, the diagonal
components λ and off-diagonal components γ are calculated for the 2x2 matrices
representing the sectional ellipses:
λsurf 1 = 1/ 2 (α surf 2 (cos(φ surf 2 )) 2 + β surf 2 (sin(φsurf 2 )) 2 ) + 1/ 2(α surf 3 (cos(φsurf 3 )) 2 + β surf 3 (sin(φsurf 3 )) 2 )
λsurf 2 = 1/ 2 (α surf 1 (cos(φ surf 1 )) 2 + β surf 1 (sin(φsurf 1 )) 2 ) + 1/ 2 (α surf 3 (cos(φsurf 3 )) 2 + β surf 3 (sin(φsurf 3 )) 2 )
λsurf 3 = 1/ 2 (α surf 1 (cos(φ surf 1 )) 2 + β surf 1 (sin(φ surf 1 )) 2 ) + 1/ 2 (α surf 2 (cos(φsurf 2 )) 2 + β surf 2 (sin(φsurf 2 )) 2 )
γ surf 1 = α surf 1 sin(φ surf 1 ) cos(φ surf 1 ) − β surf 1 sin(φ surf 1 ) cos(φ surf 1 )
γ surf 2 = α surf 2 sin(φ surf 2 ) cos(φ surf 2 ) − β surf 2 sin(φ surf 2 ) cos(φ surf 2 )
γ surf 3 = α surf 3 sin(φ surf 3 ) cos(φ surf 3 ) − β surf 3 sin(φ surf 3 ) cos(φ surf 3 )
Note that the diagonal component λ is doubly defined. To get an initial estimate
we average the two doubly defined terms. To get a better best-fit estimate we
define a matrix P and calculate the normalized eigenvalues Π and eigenvectors
V of the sectional ellipses by using singular value decomposition.
P = [(λ surf 1 , γ surf 3 , γ surf 2 ) (γ surf 3 , λ surf 2 , γ surf 1 ) (γ surf 2 , γ surf 1 , λsurf 3 )]
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 153
[(λ surf 2 , γ surf 1 ) (γ surf 1 , λ surf 3 )], [(λ surf 1 , γ surf 2 ) (γ surf 2 , λ surf 3 )] and [(λ surf 1 , γ surf 3 ) (γ surf 3 , λ surf 2 )]
G = (Π surf 1, surf 2,surf 3 − (λ surf 1, surf 2, surf 3 / β surf 1,surf 2, surf 3 )) 2 + (κ surf 1,surf 2, surf 3 − φ surf 1,surf 2, surf 3 ) 2
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
154 Ozer, Lv & Wolf
Figure 9. Example ellipsoids from front view (top) and side view (bottom)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 155
Figure 11. Example ellipsoids for “pointing camera1” activity from front
and side view.
camera view cannot find the activities such as pointing towards the camera, e.g.,
area change with time is not reliable for small body parts. However, the proposed
system combines the activity directions and body pose information from multiple
views and recognizes the correct activity in real-time.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
156 Ozer, Lv & Wolf
Figure 12 shows recognized moving left activity. Figure 13, first row, shows
moving down activity where the person turns from the first camera to the second
one during the activity period. The second row is another correctly recognized
moving down/up activity for a different person.
Figure 14 is an example for unattended object detection. The person enters the
scene with an object in his hand and leaves the object on the table. After a
predefined time, the system can detect the body parts and the object left on the
table correctly. An alarm can be generated and sent to the security personnel for
the unattended object.
For applications such as smart rooms where devices are controlled by people, the
deployment of cameras can be adjusted for optimum capture of the motion.
However, for less structured motion estimation applications such as surveillance,
self-occlusion may occur. In this case, the corresponding parts are modeled less
accurately for the duration of the occlusion. One of our future works includes a
feedback scheme that uses temporal dependency. Furthermore, more cameras
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 157
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
158 Ozer, Lv & Wolf
Note that the most time-consuming parts of the algorithm are computed in
parallel by different video processors and the integration to reconstruct the 3D
model is based on the processing of parameters as opposed to pixel processing.
This feature of the algorithm makes the integration of multiple cameras more
attractive. A detailed description of multi-camera architecture is presented in the
next section.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 159
input frame size is not large enough to make those additional costs ignorable, we
convert this intra-frame data independency into instruction-level parallelism,
which can be explored by VLIW or superscalar architecture processors. The
instruction-level parallelism can be explicitly expressed in an executable file,
since the parallelism is available during the compile-time. Both VLIW and
superscalar processors can exploit static instruction-level parallelism. Superscalar
processors use hardware schemes to discover instruction parallelism in a
program, so a superscalar processor can provide backward compatibility for old
generation processors. For this reason, most of general processors are superscalar
processors. On the other hand, a VLIW processor can achieve a similar
performance on a program with explicit parallelism by using significantly less
hardware effort with dedicated compiler support. We use the VLIW processor
to exploit the instruction-level parallelism that resulted from the intra-frame data
independency, since such parallelism can be explicitly expressed at compile time.
In the following, we will introduce our process of converting intra-frame data
independency to instruction-level parallelism. Although the target is a VLIW
processor, most parts of this process can benefit from superscalar processors,
as well.
The first step is to use loop fusion, a way of combining two similar, adjacent loops
for reducing the overhead, and loop unrolling, which partitions the loops to
discover loop-carried dependencies that may let several iterations be executed
at the same time, which increases the basic block size and thus increases
available instruction parallelism. Figure 16 shows examples of loop fusion and
unrolling.
When a loop is executed, there might be dependencies between trips. The
instructions that need to be executed in different trips cannot be executed
simultaneously. The essential idea behind loop fusion and loop unrolling is to
decrease the total number of trips needed to be executed by putting more tasks
in each trip. Loop fusion merges loops together without changing the result of the
executed program. In Figure 16, two loops are merged into one loop. This change
will increase the number of instructions in each trip. Loop unrolling merges
consecutive trips together to reduce the total trip count. In this example, the trip
count is reduced from four to two as loop unrolling is performed. These source
code transformations do not change the execution results, but increase the
number of instructions located in each loop trip and thus increase the number of
instructions that can be executed simultaneously.
Both loop fusion and loop unrolling increase basic block size by merging several
basic blocks together. While loop fusion merges basic blocks in code-domain, in
that different code segments are merged, loop unrolling merges basic blocks in
time-domain, in that different loop iterations are merged. This step increases the
code size for each loop trip. However, we do not observe significant basic block
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
160 Ozer, Lv & Wolf
size changes. The conditional operations, such as absolute value calculation and
threshold inside loop, block the increase of basic block size.
In the second step, we sought two methods to reduce the branches, which limit
the basic block size in loops. A solution for this is to use conditional execution
instructions, which requires hardware support. The TriMedia processors offer
such instructions, such as IABS, that calculate the absolute value in a single
instruction. This optimization provides a significant performance improvement.
Another technique we used is to convert control flow dependency to data
dependency by using look-up tables. In our algorithm, contour following, the
instruction level parallelism is limited by the control flow dependency. The
critical control flow structure is shown on the left-hand side of Figure 17.
Although if-conversion is a general method to remove branches caused by if-else
statements, the if-conversion does not help much for such a control flow
dependency. To increase the available parallelism, we convert the control
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 161
Figure 18. Instruction cycles for 10 frames (left), and available parallelism
(right).
A different level of data independency in our smart camera system is the inter-
frame data independency. Since this independency lies between different input
frames, it is a coarse-grained data independency. The corresponding parallel-
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
162 Ozer, Lv & Wolf
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 163
The above discussions are about the available data independencies. There is
another parallelism resulting from the data flow structure. The algorithm stages
of the low-level processing part form a pipelined process. A corresponding
architecture is a pipelined multi-processor architecture (Figure 21). Figure 22
shows the projected performance of such architecture. Series 1 shows the
throughput when communication cost is zero, while in series2 the communication
cost is 20% of the computation cost. The additional benefit of such architecture
over other parallel architecture is that the processor can be tailored to the
requirement of the stage. For example, the CPU used to process background
elimination does not have to carry a floating-point unit. The limiting factor of such
architecture is the granularity of the stages. When a stage counts more than 50%
of the overall computation time, the speed-up is limited.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
164 Ozer, Lv & Wolf
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 165
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
166 Ozer, Lv & Wolf
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 167
Conclusions
In this chapter, we review previous 3D methods on real-time processing of
multiple views for human detection and activity recognition algorithms. We
discuss the advantages and drawbacks of these algorithms with respect to
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
168 Ozer, Lv & Wolf
References
Aggarwal, J. K. & Cai, Q. (1999). Human Motion Analysis: A Review.
Computer Vision and Image Understanding, 73(3), 428-440.
Bregler, C. & Malik, J. (1998). Tracking people with twists and exponential
maps. Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 8-15.
Cheung, G. K. M., Kanade, T., Bouguet, J. Y. & Holler, M. (2000). A real time
system for robust 3d voxel reconstruction of human motions. Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
714-720.
Cohen, I. & Lee, M. W. (2002). 3D Body Reconstruction for Immersive
Interaction. Proceedings of the International Workshop on Articulated
Motion and Deformable Objects, 119-130.
Comaniciu, D., Ramesh, V. & Meer, P. (2000). Real-time Tracking of Non-rigid
Objects using Mean Shift. Proceedings of the IEEE International
Conference on Computer Vision and Pattern Recognition, 142-149.
Crockett, T. W. (1997). An Introduction to Parallel Rendering. Parallel
Computing, 23(7), 819-843.
Davis, L. S., Borovikov, E., Cutler, R. & Horprasert, T. (1999). Multi-perspec-
tive Analysis of Human Action. Proceedings of the International Work-
shop on Cooperative Distributed Vision.
Delamarre, Q. & Faugeras, O. (2001). 3D Articulated Models and Multi-view
Tracking with Physical Forces. Computer Vision and Image Under-
standing, 81(3), 328-357.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 169
Deutscher, J., North, B., Bascle, B. & Blake, A. (1999). Tracking through
singularities and discontinuities by random sampling. Proceedings of the
IEEE International Conference on Computer Vision, 1144-1149.
DiFranco, D., Cham, T. & Rehg, J. (2001), Reconstruction of 3d figure motion
from 2d correspondences. Proceedings of the International Conference
on Computer Vision and Pattern Recognition, 307-314.
Dockstader, S. L. & Tekalp, A. M. (2001). Multiple Camera Tracking of
Interacting and Occluded Human Motion. Proceedings of the IEEE,
(89)10, 1441-1455.
England, N. (1986). A graphics system architecture for interactive application-
specific display functions. IEEE Computer Graphics and Applications,
6(1), 60-70.
Faugeras, O. (1993). Real time correlation based stereo: Algorithm, imple-
mentations and applications. Research Report 2013, INRIA Sophia-
Antipolis.
Focken. D. & Stiefelhagen R. (2002). Towards Vision-based 3-D People
Tracking in a Smart Room. Proceedings of the International Confer-
ence on Multimodal Interfaces, 400-405.
Fritts, J., Wolf, W. & Liu, B. (1999). Understanding Multimedia Application
Characteristics for Designing Programmable Media Processors. Proceed-
ings of the SPIE Photonics West, Media Processors, 2-13.
Fuchs, H. et al. (1989). Pixel-planes 5: A heterogeneous multi-processor
graphics system using processor enhanced memories. Proceedings of the
Siggraph International Conference on Computer Graphics and Inter-
active Techniques, 79-88.
Gavrila D. M. (1999). The Visual Analysis of Human Movement: A Survey.
Computer Vision and Image Understanding, 73(1), 82-98.
Gavrila, D. M. & Davis, L. (1996). 3D model based tracking of humans in action:
a multi-view approach. Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, 73-80.
Goddard, N. (1994). Incremental model-based discrimination of articulated
movement direct from motion features. Proceedings of the IEEE Work-
shop on Motion of Non-Rigid and Articulated Objects, 89-94.
Goldlücke, B., Magnor, M. A. & Wilburn, B. (2002). Hardware-accelerated
Dynamic Light Field Rendering. Proceedings of the Conference on
Vision, Modeling and Visualization, 455-462.
Guo, Y., Xu, G. & Tsuji, S. (1994). Understanding human motion patterns.
Proceedings of the International Conference on Pattern Recognition,
325-329 (B).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
170 Ozer, Lv & Wolf
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 171
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
172 Ozer, Lv & Wolf
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Human Body Parts and Gesture-Activity Recognition in 3D 173
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
174 Ozer, Lv & Wolf
Wolf, W., Ozer, I. B. & Lv, T. (2002). Smart Cameras as Embedded Systems.
IEEE Computer, 35(9), 48-53.
Wren, C. R., Azarbayejani, A., Darrell, T. & Pentland, A. (1999). Pfinder: real-
time tracking of the human body. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 19(7), 780-785.
Wren, C. R., Clarkson, B. P. & Pentland, A. P. (2000). Understanding
Purposeful Human Motion. Proceedings of the International Confer-
ence on Automatic Face and Gesture Recognition, 378-383.
Wu, Y. & Huang, T. S. (1999). Vision-based gesture recognition: A review.
Proceedings of the Gesture Workshop, 103-115.
Yang, R., Welch, G. & Bishop, G. (2002). Real-Time Consensus-Based Scene
Reconstruction Using Commodity Graphics Hardware. Proceedings of
the Pacific Conference on Computer Graphics and Applications, 225-
234.
Zisserman, A., Fitzgibbon, A. & Cross, G. (1999). VHS to VRML: 3D Graphical
Models from Video Sequences. Proceedings of the International Con-
ference on Multimedia Systems, 51-57.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 175
Chapter V
Abstract
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
176 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
Introduction
Current information processing and visualization systems are capable of offering
advanced and intuitive means of receiving input from and communicating output
to their users. As a result, Man-Machine Interaction (MMI) systems that utilize
multimodal information about their users’ current emotional state are presently
at the forefront of interest of the computer vision and artificial intelligence
communities. Such interfaces give the opportunity to less technology-aware
individuals, as well as handicapped people, to use computers more efficiently
and, thus, overcome related fears and preconceptions. Besides this, most
emotion-related facial and body gestures are considered universal, in the sense
that they are recognized among different cultures. Therefore, the introduction of
an “emotional dictionary” that includes descriptions and perceived meanings of
facial expressions and body gestures, so as to help infer the likely emotional state
of a specific user, can enhance the affective nature of MMI applications (Picard,
2000).
Despite the progress in related research, our intuition of what a human
expression or emotion actually represents is still based on trying to mimic the way
the human mind works while making an effort to recognize such an emotion. This
means that even though image or video input are necessary to this task, this
process cannot come to robust results without taking into account features like
speech, hand gestures or body pose. These features provide the means to convey
messages in a much more expressive and definite manner than wording, which
can be misleading or ambiguous. While a lot of effort has been invested in
individually examining these aspects of human expression, recent research
(Cowie et al., 2001) has shown that even this approach can benefit from taking
into account multimodal information. Consider a situation where the user sits in
front of a camera-equipped computer and responds verbally to written or spoken
messages from the computer: speech analysis can indicate periods of silence
from the part of the user, thus informing the visual analysis module that it can use
related data from the mouth region, which is essentially ineffective when the user
speaks. Hand gestures and body pose provide another powerful means of
communication. Sometimes, a simple hand action, such as placing one’s hands
over their ears, can pass on the message that they’ve had enough of what they
are hearing more expressively than any spoken phrase.
In this chapter, we present a systematic approach to analyzing emotional cues
from user facial expressions and hand gestures. In the Section “Affective
analysis in MMI,” we provide an overview of affective analysis of facial
expressions and gestures, supported by psychological studies describing emo-
tions as discrete points or areas of an “emotional space.” The sections “Facial
expression analysis” and “Gesture analysis” provide algorithms and experimen-
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 177
tal results from the analysis of facial expressions and hand gestures in video
sequences. In the case of facial expressions, the motion of tracked feature points
is translated to MPEG-4 FAPs, which describe their observed motion in a high-
level manner. Regarding hand gestures, hand segments are located in a video
sequence via color segmentation and motion estimation algorithms. The position
of these segments is tracked to provide the hand’s position over time and fed into
a HMM architecture to provide affective gesture estimation.
In most cases, a single expression or gesture cannot help the system deduce a
positive decision about the users’ observed emotion. As a result, a fuzzy
architecture is employed that uses the symbolic representation of the tracked
features as input. This concept is described in the section “Multimodal affective
analysis.” The decision of the fuzzy system is based on rules obtained from the
extracted features of actual video sequences showing emotional human dis-
course, as well as feature-based description of common knowledge of what
everyday expressions and gestures mean. Results of the multimodal affective
analysis system are provided here, while conclusions and future work concepts
are included in the final section “Conclusions – Future work.”
Representation of Emotion
The obvious goal for emotion analysis applications is to assign category labels
that identify emotional states. However, labels as such are very poor descrip-
tions, especially since humans use a daunting number of labels to describe
emotion. Therefore, we need to incorporate a more transparent, as well as
continuous, representation that more closely matches our conception of what
emotions are or, at least, how they are expressed and perceived.
Activation-emotion space (Whissel, 1989) is a representation that is both simple
and capable of capturing a wide range of significant issues in emotion (Cowie et
al., 2001). Perceived full-blown emotions are not evenly distributed in this space;
instead, they tend to form a roughly circular pattern. From that and related
evidence, Plutchik (1980) shows that there is a circular structure inherent in
emotionality. In this framework, emotional strength can be measured as the
distance from the origin to a given point in activation-evaluation space. The
concept of a full-blown emotion can then be translated roughly as a state where
emotional strength has passed a certain limit. A related extension is to think of
primary or basic emotions as cardinal points on the periphery of an emotion
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
178 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
FE A R S U R P R IS E
Irritated
A n tag o n istic P a n ick y P le ase d
ANGER R esen tfu l
C he e rfu l JO Y
E a g er
C ritic al A m u se d C on te n t
P o sse siv e H o pe fu l
V E RY S u sp ic io u s V E RY
S e re n e
N E G AT IV E D ep resse d D elig h te d P O S IT IV E
D esp air in g
C on te m p tu o us
D IS G U S T T ru stin g A N T IC IPATIO N
B ored
SA D N E SS C alm A C C EP TA N C E
G lo o m y
V E RY PA S S IV E
circle. Plutchik has offered a useful formulation of that idea, the “emotion wheel”
(see Figure 1).
Activation-evaluation space is a surprisingly powerful device, which is increas-
ingly being used in computationally oriented research. However, it has to be
noted that such representations depend on collapsing the structured, high-
dimensional space of possible emotional states into a homogeneous space of two
dimensions. There is inevitably loss of information. Worse still, there are
different ways of making the collapse lead to substantially different results. That
is well illustrated in the fact that fear and anger are at opposite extremes in
Plutchik’s emotion wheel, but close together in Whissell’s activation/emotion
space. Thus, extreme care is needed to ensure that collapsed representations are
used consistently.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 179
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
180 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
shown (Bassili, 1979) that facial expressions can be more accurately recognized
from image sequences, than from single still images. Bassili’s experiments used
point-light conditions, i.e., subjects viewed image sequences in which only white
dots on a darkened surface of the face were visible. Expressions were
recognized at above chance levels when based on image sequences, whereas
only happiness and sadness were recognized when based on still images.
The detection and interpretation of hand gestures has become an important part
of human computer interaction (MMI) in recent years (Wu & Huang, 2001).
Sometimes, a simple hand action, such as placing a person’s hands over his ears,
can pass on the message that he has had enough of what he is hearing. This is
conveyed more expressively than with any other spoken phrase.
In general, human hand motion consists of the global hand motion and local finger
motion. Hand motion capturing deals with finding the global and local motion of
hand movements. Two types of cues are often used in the localization process:
color cues (Kjeldsen & Kender, 1996) and motion cues (Freeman & Weissman,
1995). Alternatively, the fusion of color, motion and other cues, like speech or
gaze, is used (Sharma, Huang & Pavlovic, 1996).
Hand localization is locating hand regions in image sequences. Skin color offers
an effective and efficient way to fulfill this goal. According to the representation
of color distribution in certain color spaces, current techniques of skin detection
can be classified into two general approaches: nonparametric (Kjeldsen &
Kender, 1996) and parametric (Wren, Azarbayejani, Darrel & Pentland, 1997).
To capture articulate hand motion in full degree of freedom, both global hand
motion and local finger motion should be determined from video sequences.
Different methods have been taken to approach this problem. One possible
method is the appearance-based approach, in which 2-D deformable hand-shape
templates are used to track a moving hand in 2-D (Darrell, Essa & Pentland,
1996). Another possible way is the 3-D model-based approach, which takes the
advantages of a priori knowledge built in the 3-D models.
Meaningful gestures could be represented by both temporal hand movements
and static hand postures. Hand postures express certain concepts through hand
configurations, while temporal hand gestures represent certain actions by hand
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 181
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
182 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
Figure 2. FDP feature points (adapted from (Tekalp & Ostermann, 2000)).
11.5 11.5
11.4
11.4
11.2 11.1
11.2 11.1 11.3
4.4 4.2 4.1 4.3 4.4
4.6 4.5
11.6 4.6 4.2
10.2 10.1 10.2
3.2 3.1
3.8 3.11
3.12 3.6 3.5 3.7
3.4 3.3
3.10 3.9
9.6 9.7
Right eye Left eye
9.8
9.12
Nose
9.14 9.13
9.10
9.11
9.3
6.4 6.3
6.2 2.9 2.8
8.8 2.3 8.7
6.1
Tongue Mouth 8.2
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 183
The facial feature extraction scheme used in the system proposed in this chapter
is based on a hierarchical, robust scheme, coping with large variations in the
appearance of diverse subjects, as well as the same subject in various instances
within real video sequences (Votsis, Drosopoulos & Kollias, 2003). Soft a priori
assumptions are made on the pose of the face or the general location of the
features in it. Gradual revelation of information concerning the face is supported
under the scope of optimization in each step of the hierarchical scheme,
producing a posteriori knowledge about it and leading to a step-by-step
visualization of the features in search.
Face detection is performed first through detection of skin segments or blobs,
merging them based on the probability of their belonging to a facial area, and
identification of the most salient skin color blob or segment. Following this,
primary facial features, such as eyes, mouth and nose, are dealt with as major
discontinuities on the segmented, arbitrarily rotated face. In the first step of the
method, the system performs an optimized segmentation procedure. The initial
estimates of the segments, also called seeds, are approximated through min-max
analysis and refined through the maximization of a conditional likelihood func-
tion. Enhancement is needed so that closed objects will occur and part of the
artifacts will be removed. Seed growing is achieved through expansion, utilizing
chromatic and value information of the input image. The enhanced seeds form
an object set, which reveals the in-plane facial rotation through the use of active
contours applied on all objects of the set, which is restricted to a finer set, where
the features and feature points are finally labeled according to an error
minimization criterion.
Experimental Results
Figure 3 shows a characteristic frame from the “hands over the head” sequence.
After skin detection and segmentation, the primary facial features are shown in
Figure 4. Figure 5 shows the initial detected blobs, which include face and mouth.
Figure 6 shows the estimates of the eyebrow and nose positions. Figure 7 shows
the initial neutral image used to calculate the FP distances. In Figure 8 the
horizontal axis indicates the FAP number, while the vertical axis shows the
corresponding FAP values estimated through the features stated in the second
column of Table 1.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
184 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 185
Gesture Analysis
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
186 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
contain a large number of small objects due to the presence of noise and objects
with color similar to the skin. To overcome this, morphological filtering is
employed on both masks to remove small objects. All described morphological
operations are carried out with a disk-structuring element with a radius of 1% of
the image width. The distance transform of the color mask is first calculated
(Figure 12) and only objects above the desired size are retained (Figure 13).
These objects are used as markers for the morphological reconstruction of the
initial color mask. The color mask is then closed to provide better centroid
calculation.
The moving skin mask (msm) is then created by fusing the processed skin and
motion masks (sm, mm), through the morphological reconstruction of the color
mask using the motion mask as marker. The result of this process, after excluding
the head object, is shown in Figure 19. The moving skin mask consists of many
large connected areas. For the next frame, a new moving skin mask is created,
and a one-to-one object correspondence is performed. Object correspondence
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 187
between two frames is performed on the color mask and is based on object
centroid distance for objects of similar (at least 50%) area (Figure 20). In these
figures, crosses represent the position of the centroid of the detected right hand
of the user, while circles correspond to the left hand. In the case of hand object
merging and splitting, e.g., in the case of clapping, we establish a new matching
of the left-most candidate object to the user’s right hand and the right-most object
to the left hand (Figure 21).
Following object matching in the subsequent moving skin masks, the mask flow
is computed, i.e., a vector for each frame depicting the motion direction and
magnitude of the frame’s objects. The described algorithm is lightweight,
allowing a rate of around 12 fps on a usual PC during our experiments, which is
enough for continuous gesture tracking. The object correspondence heuristic
makes it possible to individually track the hand segments correctly, at least during
usual meaningful gesture sequences. In addition, the fusion of color and motion
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
188 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
Video Sequence
Head & Hand Head & Hand
Segmentation Tracking
Gesture Classes
Feature Vector
HMM Classifier
Formation
In Table 2 we present the utilized features that feed (as sequences of vectors)
the HMM classifier, as well as the output classes of the HMM classifier.
Gesture lift of the hand – low speed, lift of the hand – high speed
Classes hands over the head – gesture, hands over the head – posture
italianate gestures
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 189
A general diagram of the HMM classifier is shown in Figure 23. The recognizer
consists of M different HMMs corresponding to the modeled gesture classes. In
our case, M=7 as it can be seen in Table 2. We use first order left-to-right models
consisting of a varying number (for each one of the HMMs) of internal states Gk, j that
have been identified through the learning process. For example, the third HMM,
which recognizes low speed on hand lift, consists of only three states G 3,1, G 3,2
and G 3,3. More complex gesture classes, like the hand clapping, require as
much as eight states to be efficiently modeled by the corresponding HMM. Some
characteristics of our HMM implementation are presented below.
1
exp{− (O i − ì k , j ) T C k , j −1 (O i − ì k , j )}
bk , j (O i ) = 2
K 1 (1)
(2π ) 2 ⋅ C k , j 2
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
190 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
• The match score of feature vector sequence O = O1O 2 ...O T given the model
λ m ( A m , B m , ð m ) (m=1,2,…,M) is calculated as follows:
o We compute the best state sequence Q * given the observation se-
quence O , using Viterbi’s algorithm, i.e.:
P * = P(O / Q * , λ m ) (3)
It should be mentioned here that the final block of the architecture corresponds
to a hard decision system, i.e., it selects the best-matched gesture class.
However, when gesture classification is used to support the facial expression
analysis process, the probabilities of the distinct HMMs should be used instead
(soft decision system). In this case, since the HMMs work independently, their
outputs do not sum up to one.
P (O | ë1 )
ë1=(A1, B 1, ð )
P (O | ë2 )
F eature Vector
S equence ë2=(A2, B 2, ð ) õ*=argmax[P (O|ëõ)]
S elect Maximun
õ
. .
. .
. .
P (O | ëM )
ëM =(AM , B M, ð)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 191
Experimental Results
In the first part of our experiments, the efficiency of the features used to
discriminate the various gesture classes is illustrated (Figure 24 to Figure 27).
The first column shows a characteristic frame of each sequence and the tracked
centroids of the head and left and right hand, while the remaining two columns
show the evolution of the features described in the first row of Table 2, i.e., the
difference of the horizontal and vertical coordinates of the head and hand
segments. In the case of the first sequence, the gesture is easily discriminated
since the vertical position of the hand segments almost matches that of the head,
while in the closing frame of the sequence the three objects overlap. Overlapping
is crucial to indicate that two objects are in contact during some point of the
gesture, in order to separate this sequence from, e.g., the “lift of the hand”
gesture. Likewise, during clapping, the distance between the two hand segments
is zeroed periodically, with the length of the in-between time segments providing
a measure of frequency, while during the “italianate” gesture the horizontal
distance of the two hands follows a repetitive, sinusoidal pattern.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
192 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 193
rate of 94.3% was achieved. The experimental results are shown in the
confusion matrix (Table 3).
From the results summarized in Table 3, we observe a mutual misclassification
between “Italianate Gestures” (IG) and “Hand Clapping – High Frequency”
(HC - HF). This is mainly due to the variations on “Italianate Gestures” across
different individuals. Thus, training the HMM classifier on a personalized basis
is anticipated to improve the discrimination between these two classes.
The facial expression analysis subsystem is the main part of the presented
system. Gestures are utilized to support the outcome of this subsystem.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
194 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
pi( k ) = ∏
(k ) (k )
ri(, kj )
and bi = max ( p i(k ) ) , (4)
Ai , j ∈∆ i , j k
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 195
µ i(,kj)
si(,kj)
si(,kj) ci(,kj) si(,kj)
(k )
where ri(, kj ) = max{g i ∩ Ai(,kj) } expresses the relevance ri , j of the i-th element of the
input feature vector with respect to class Ai(,kj ) . Actually g = A' (G ) = { g1, g 2 ,...} is
− −
the fuzzified input vector resulting from a singleton fuzzification procedure (Klir
& Yuan, 1995).
The various emotion profiles correspond to the fuzzy intersection of several sets
and are implemented through a J-norm of the form t(a,b)=a·b. Similarly the
belief that an observed feature vector corresponds to a particular emotion results
from a fuzzy union of several sets through an I-norm which is implemented as
u(a,b)=max(a,b).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
196 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
Gestures are utilized to support the outcome of the facial expression analysis
subsystem, since in most cases they are too ambiguous to indicate a particular
emotion. However, in a given context of interaction, some gestures are obviously
associated with a particular expression — e.g., hand clapping of high fre-
quency expresses joy, satisfaction — while others can provide indications for
the kind of the emotion expressed by the user. In particular, quantitative features
derived from hand tracking, like speed and amplitude of motion, fortify the
position of an observed emotion; for example, satisfaction turns to joy or even
to exhilaration, as the speed and amplitude of clapping increases.
As was mentioned in the section “Gesture analysis,” the position of the centroids
of the head and the hands over time forms the feature vector sequence that feeds
an HMM classifier whose outputs corresponds to a particular gesture class.
Table 5 below shows the correlation between some detectable gestures with the
six archetypal expressions.
Given a particular context of interaction, gesture classes corresponding to the
same emotional are combined in a “logical OR” form. Table 5 shows that a
particular gesture may correspond to more than one gesture class carrying
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 197
In the final step of the proposed system, the facial expression analysis subsystem
and the affective gesture analysis subsystem are integrated, as shown in Figure
30, into a system which provides as a result the possible emotions of the user,
each accompanied by a degree of belief.
Expression
Profiles
f G Facial
Facial Point
f ⇒ FAP Expression .
Detection .
Decision System .
System .
.
Gesture
Profiles
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
198 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
d k = w1 ⋅ bk + w2 ⋅ EI k (6)
where the weights w1 and w 2 are used to account for the reliability of the two
subsystems as far as the emotional state estimation is concerned. In this
implementation we use w1 =0.75 and w2 =0.25. These values enable the affective
gesture analysis subsystem to be important in cases where the facial expression
analysis subsystem produces ambiguous results, while at the same time leave the
latter subsystem to be the main contributing part in the overall decision system.
For the input sequence shown in Figure 3, the affective gesture analysis
subsystem consistently provided a “surprise” selection. This was used to fortify
the output of the facial analysis subsystem, which was around 85%.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Emotionally-Rich Man-Machine Interaction 199
References
Bassili, J. N. (1979). Emotion recognition: The role of facial movement and the
relative importance of upper and lower areas of the face. Journal of
Personality and Social Psychology, 37, 2049-2059.
Bregler, C. (1997). Learning and recognition human dynamics in video se-
quences. Proc. IEEE Conf. Computer Vision and Pattern Recognition,
568-574.
Cowie, R. et al. (2001). Emotion Recognition in Human-Computer Interaction.
IEEE Signal Processing Magazine, 1, 32-80.
Darrell, T. & Pentland, A. (1996). Active gesture recognition using partially
observable Markov decision processes. In Proc. IEEE Int. Conf. Pattern
Recognition, 3, 984-988.
Darrell, T., Essa, I. & Pentland, A. (1996). Task-Specific Gesture Analysis in
Real-Time Using Interpolated Views. IEEE Trans. Pattern Analysis and
Machine Intelligence, 18(12), 1,236-1,242.
Davis, M. & College, H. (1975). Recognition of Facial Expressions. New
York: Arno Press.
Ekman, P. & Friesen, W. (1975). Unmasking the Face. New York: Prentice-
Hall.
Ekman, P. & Friesen, W. (1978). The Facial Action Coding System. San
Francisco, CA: Consulting Psychologists Press.
Faigin, G. (1990). The Artist’s Complete Guide to Facial Expressions. New
York: Watson-Guptill.
Freeman, W. T. & Weissman, C. D. (1995). Television Control by Hand
Gestures. Proc. Int’l Workshop on Automatic Face and Gesture Recog-
nition, Switzerland, 179-183.
Karpouzis, K., Tsapatsoulis N. & Kollias, S. (2000). Moving to Continuous
Facial Expression Space using the MPEG-4 Facial Definition Parameter
(FDP) Set. Proc. of SPIE Electronic Imaging 2000, San Jose, CA, USA.
Kjeldsen, R. & Kender, J. (1996). Finding skin in color images. Proc. 2nd Int.
Conf. Automatic Face and Gesture Recognition, 312-317.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
200 Karpouzis, Raouzaiou, Drosopoulos, Ioannou, Balomenos, Tsapatsoulis & Kollias
Klir, G. & Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic, Theory and
Application. New Jersey: Prentice-Hall.
Picard, R. W. (2000). Affective Computing. Cambridge, MA: MIT Press.
Plutchik, R. (1980). Emotion: A psychoevolutionary synthesis. New York:
Harper and Row.
Raouzaiou, A., Tsapatsoulis, N., Karpouzis, K. & Kollias, S. (2002). Parameter-
ized facial expression synthesis based on MPEG-4. EURASIP Journal on
Applied Signal Processing, 10, 1021-1038.
Sharma, R., Huang, T. S. & Pavlovic, V. I. (1996). A Multimodal Framework
for Interacting with Virtual Environments. In C. A. Ntuen and E. H. Park
(Eds), Human Interaction With Complex Systems. Kluwer Academic
Publishers.
Tekalp, M. & Ostermann, J. (2000). Face and 2-D mesh animation in MPEG-
4. Image Communication Journal, 15(4-5), 387-421.
Votsis, G., Drosopoulos, A. & Kollias, S. (2003). A modular approach to facial
feature segmentation on real sequences. Signal Processing, Image
Communication, 18, 67-89.
Whissel, C. M. (1989). The dictionary of affect in language. In R. Plutchnik &
H. Kellerman (Eds), Emotion: Theory, research and experience: vol-
ume 4, The measurement of emotions. New York: Academic Press.
Wren, C., Azarbayejani, A., Darrel, T. & Pentland, A. (1997). Pfinder: Real-
time tracking of the human body. IEEE Trans. Pattern Anal. Machine
Intell., 9(7), 780-785.
Wu, Y. & Huang, T. S. (2001). Hand modeling, analysis, and recognition for
vision-based human computer interaction. IEEE Signal Processing Maga-
zine, 18(3), 51-60.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 201
Chapter VI
Techniques for
Face Motion &
Expression Analysis on
Monocular Images
Ana C. Andrés del Valle
Institut Eurécom, France
Jean-Luc Dugelay
Institut Eurécom, France
Abstract
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
202 Andrés del Valle & Dugelay
Introduction
Researchers from the Computer Vision, Computer Graphics and Image Pro-
cessing communities have been studying the problems associated with the
analysis and synthesis of faces in motion for more than 20 years. The analysis
and synthesis techniques being developed can be useful for the definition of low-
rate bit image compression algorithms (model-based coding), new cinema
technologies, as well as for the deployment of virtual reality applications,
videoconferencing, etc. As computers evolve towards becoming more human-
oriented machines, human-computer interfaces, behavior-learning robots and
disable-adapted computer environments will use face expression analysis to be
able to react to human action. The analysis of motion and expression from
monocular (single) images is widely investigated because non-stereoscopic
static images and videos are the most affordable and extensively used visual
media (i.e., webcams).
This chapter reviews current techniques for the analysis of single images to
derive face animation. These methods can be classified based upon different
criteria:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 203
The analysis algorithms presented include those most related to face motion and
expression understanding. Specific image processing can also be used to locate
faces on images, for face recognition intended for biometrics, for general head
tracking and pose deduction, as well as for face animation synthesis. For those
readers acquainted mainly with 3D and graphics, we provide a brief overview of
the most common image processing methods and mathematical tools involved,
pointing to some sources for the algorithmic details that will not be explained or
will be assumed to be known during the description of the state-of-the–art
approaches.
The core of the chapter includes the description of the methods currently being
developed and tested to generate face animation from real face images. The
techniques herein discussed analyze static images and/or video sequences to
obtain general face expressions or explicit face motion parameters. We have
categorized these methods in three groups: “those that retrieve emotion informa-
tion,” “those that obtain parameters related to the Face Animation synthesis
used,” and “those that use explicit face synthesis during image analysis.”
Background
Many video encoders do motion analysis over video sequences to search for
motion information that will help compression. The concept of motion vectors,
first conceived at the time of the development of the first video coding
techniques, is intimately related to motion analysis. These first analysis tech-
niques help to regenerate video sequences as the exact or approximate reproduc-
tion of the original frames by using motion compensation from neighboring
pictures. They are able to compensate for, but not to understand the actions of
the objects moving on the video and, therefore, they cannot restore the object’s
movements from a different orientation. Faces play an essential role in human
communication. Consequently, they have been the first objects whose motion
has been studied in order to recreate animation on synthesized models or to
interpret motion for a posteriori use.
Synthetic faces are classified into two major groups: avatars and clones.
Generally, avatars are a rough and symbolic representation of the person, and
their animation is speaker independent because it follows generic rules disre-
garding the individual that they personify. Clones are more realistic and their
animation takes into account the nature of the person and his real movements.
Whether we want to animate avatars or clones, we face a great challenge: the
automatic generation of face animation data. Manually generated animation has
long been used to create completely virtual characters and has also been applied
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
204 Andrés del Valle & Dugelay
Figure 1. Image input is analyzed in the search for the face general
characteristics: global motion, lighting, etc. At this point, some image
processing is performed to obtain useful data that can be interpreted
afterwards to obtain face animation synthesis.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 205
Each of the modules may be more or less complex depending on the purpose of
the analysis (i.e., from the understanding of general behavior to exact 3D-motion
extraction). If the analysis is intended for later face expression animation, the
type of FA synthesis often determines the methodology used during expression
analysis. Some systems may not go through either the first or the last stages or
some others may blend these stages in the main motion & expression image
analysis. Systems lacking the pre-motion analysis step are most likely to be
limited by environmental constraints, like special lighting conditions or pre-
determined head pose. Those systems that do not perform motion interpreta-
tion do not focus on delivering any information to perform face animation
synthesis afterwards. A system that is thought to analyze video to generate face
animation data in a robust and efficient way needs to develop all three modules.
The approaches currently under research and that will be exposed in this chapter
clearly perform the facial motion & expression image analysis and to some
extent the motion interpretation to be able to animate 3D models. Nevertheless,
many of them fail to have a strong pre-motion analysis step to ensure some
robustness during the subsequent analysis.
Processing Fundamentals
Pre-Processing Techniques
The conditions under which the user may be recorded are susceptible to change
from one determined moment to the next one. Some changes may come from the
hardware equipment used, for instance, the camera, the lighting environment,
etc. Furthermore, although only one camera is used, we cannot presuppose that
the speaker’s head will remain motionless and looking straight into the camera
at any time. Therefore, pre-processing techniques must help to homogenize the
analysis conditions before studying non-rigid face motion.
Camera calibration
Accurate motion retrieval is highly dependent on the precision of the image data
we analyze. Images recorded by a camera undergo different visual deformations
due to the nature of the acquisition material. Camera calibration can be seen as
the starting point of a precise analysis.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
206 Andrés del Valle & Dugelay
If we want to express motion in real space, we must relate the motion measured
in terms of pixel coordinates to the real/virtual world coordinates. That is, we
need to relate the world reference frame to the image reference frame. Simply
knowing the pixel separation in an image does not allow us to determine the
distance of those points in the real world. We must derive some equations to link
the world reference frame to the image reference frame in order to find the
relationship between the coordinates of points in 3D-space and the coordinates
of the points in the image. We introduce the camera reference frame because
there is no direct relation between the previously mentioned reference frames.
Then, we can find an equation linking the camera reference frame with the image
reference frame (LinkI), and another equation linking the world reference frame
with the camera reference frame (LinkE). Identifying LinkI and LinkE is
equivalent to finding the camera’s characteristics, also known as the camera’s
extrinsic and intrinsic parameters.
Many calibration techniques exist that have been reported in the past two
decades. The developed methods can be roughly classified into two groups:
photogrammetic calibration and self-calibration. We refer the reader to Zhang
(2000) and Luong and Faugeras (1997) to obtain examples and more details
about these approaches.
Other unknown parameters during face analysis are the lighting characteristics
of the environment in which the user is being filmed. The number, origin, nature
and intensity of the light sources of the scene can significantly transform the
appearance of a face. Face reflectance is not uniform all over the face and, thus,
is very difficult to model.
There are two major categories of reflected light:
1. Diffuse Reradiation (scattering): this occurs when the incident light pen-
etrates the surface and is reflected equally in all directions.
2. Specular Reflection: light does not penetrate the object, but it is instead
directly reflected from its outer surface.
The intensity of the pixels that we get from the image of the face is the result of
the light from the recorded scene (i.e., the face) scattered towards the camera
lens. The nature of the reflection phenomenon requires the knowledge of some
vector magnitudes (Figure 2):
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 207
Figure 2. The reflected light that reaches the camera lens depends on the
direction of the normal to the surface ( nr ), the vector from the studied point
to the light source ( sr ) and the vector from the point to the camera lens ( vr ).
ϑ = θ for perfectly specular reflections. ϕ is the angular difference
between the reflected beam and the camera perspective towards the object.
r
n
r
v
ϕ
ϑ θ r
r s
r
p
Due to the difficulty of deducing the great number of parameters and variables,
one common hypothesis usually taken is to consider faces as lambertian surfaces
(only reflecting diffuse light), so as to reduce the complexity of the illumination
model. Luong, Fua and Leclerc (2002) studied the light conditions of faces to be
able to obtain texture images for realistic head synthesis from video sequences
under this hypothesis. Other reflectance models are also used (Debevec et al.,
2000), although they focus more on reproducing natural lighting on synthetic
surfaces than on understanding the consequences of the lighting on the surface,
itself. In most cases, the analysis of motion and expressions on faces is more
concerned with the effect of illumination on the facial surface studied than with
the overall understanding of the lighting characteristics. A fairly extended
approach to appreciate the result of lighting on faces is to analyze illumination by
trying to synthetically reproduce it on the realistic 3D-model of the user’s head.
Phong’s reflection model is the 3D shading model most heavily used to assign
shades to each individual pixel of the synthetic face. It is characterized by
simplifying second-order reflections, introducing an ambient reflection term that
simulates the sparse (diffuse) reflection coming from sources whose light has
been so dispersed that it is very difficult to determine its origin. Whether the
lighting synthesis is used to compensate the image input (Eisert & Girod, 2002)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
208 Andrés del Valle & Dugelay
or to lighten the synthesized model used to help the analysis (Valente & Dugelay,
2001), it proves to be reasonable to control how the lighting modifies the aspect
of the face on the image.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 209
They use the pose information retrieved from one frame to analyze and derive
the pose information on the next one. One of the most extended techniques
involves the use of Kalman filters to predict analytical data, as well as the pose
parameters themselves. We refer the reader to other research (Ström, Jebara,
Basu & Pentland, 1999; Valente & Dugelay, 2001; Cordea, E. M. Petriu,
Georganas, D. C. Petriu & Whalen, 2001) to find related algorithmic details.
Optical flow
The field of displacement vectors of the objects that compose a scene cannot be
computed directly: we can just find the apparent local motion, also called optical
flow, between two images.
There are two major methods to estimate the optical flow: either we match
objects with no ambiguity from image to image, or we calculate the image
gradients between frames. In the first case, the main goal consists in determining
in one of the studied images the group of points that can be related to their
homologues in the second image, thus giving out the displacement vectors. The
most difficult part of this approach is the selection of the points, or regions, to be
matched. In general, the biggest disadvantage of this kind of method is that it
determines motion in a discrete manner and motion information is only precise
for some of the pixels on the image.
The second technique, the gradient-descent method, generates a more dense
optical flow map, providing information at the pixel level. It is based on the
supposition that the intensity of a pixel I(x, y, t) is constant on two consequent
frames, and that its displacement is relatively small. In these circumstances we
verify:
∂I ∂I ∂I
u+ v+ = 0, (1)
∂x ∂y ∂t
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
210 Andrés del Valle & Dugelay
where u = ∂∂xt and v = ∂∂yt are the pixel displacements between two images. Each
point on the image has one equation with two unknowns, u and v, which implies
that motion cannot be directly computed. There exist different methods that try
to solve (1) iteratively.
A complete bibliographical compilation of different optical flow methods can be
found in Wiskott (2001).
Optical flow methods are extensively used in shape recognition, but they do not
perform well in the presence of noise. If we want to identify a more general class
of objects, it is convenient to take into account the probabilistic nature of the
object appearance and, thus, to work with the class distribution in a parametric
and compact way.
The Karhunen-Loève Transform meets the requirements needed to do so. Its
base functions are the eigenvectors of the covariance matrix of the class being
modeled:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 211
Λ = Φ T ΣΦ , (2)
being Σ the covariance matrix, Λ the diagonal matrix of eigenvalues and Φ the
matrix of eigenvectors. The vector base obtained is optimal in terms of
compactness (we can easily isolate vectors of low energy) and parametric (each
eigenvector is orthogonal to the others, creating a parametric eigenspace).
Elements of one class, that is, a vector whose dimension is M, can be represented
by the linear combination of the M eigenvectors obtained for this class. The
Principal Component Analysis (PCA) technique states that the same object can
be reconstructed by only combining the N<M eigenvectors of greatest energy,
also called principal components. It also says that we will minimize the error
difference when performing the approximation if the linear coefficients for the
combination are obtained from projecting the class vector onto the sub-space of
principal components.
This theory is only applicable to objects that can be represented by vectors.
Images have this property, therefore, this theory is easily extended to image
processing and generally used to model the variability of 2D objects on images
like, for example, faces.
Very often PCA is utilized to analyze and identify features of the face. It
introduces some restrictions. One of them is the need for one training stage
previous to the analysis, during which the base of principal component vectors,
in this case images, must be generated. It also forces all images being analyzed
to be the same size. Using PCA in face analysis has lead to the appearance of
concepts like Eigenfaces (Turk & Pentland, 1991), utilized for face recognition,
or Eigenfeatures (Pentland, Mohaddam & Starner, 1994) used to study more
concrete areas of faces robustly.
The book Face Image Analysis by Unsupervised Learning (Bartlett, 2001) is
a complete study of the strengths and weaknesses of methods based on
Independent Component Analysis (ICA) in contrast with PCA. It also includes
a full explanation of concepts like Eigenactions and describes recent approaches
in facial image analysis.
Active contour models, generally called snakes, are geometric curves that
approximate the contours of an image by minimizing an energy function. Snakes
are used to track moving contours within video sequences because they have the
property of deforming themselves to stick onto a contour that evolves along the
time.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
212 Andrés del Valle & Dugelay
Figure 4. By using snakes, face and feature contours are tracked on each
frame of the sequence. Images courtesy of the Image Processing Group at
the Universitat Politècnica de Catalunya.
In general, the energy function can be decomposed into two terms, an internal
energy and an external energy:
The role of the external energy is to attract the point of the snake towards the
image contours. The internal energy tries to ensure certain regularity on the
snake while Eext acts, from a spatial as well as from a temporal perspective. Once
the energy function is defined, we use an iterative process to find its minimum.
We can understand the minimum energy point as the equilibrium position of a
dynamic system submitted to the forces derived from the energy functions.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 213
Deformable models
• Elliptical: circles and ellipsoids can model the eyes (Holbert & Dugelay,
1995).
• Quadratic: parabolic curves are often used to model the lips (Leroy &
Herlin, 1995).
• Splines: to develop more complex models, splines are an option. They have
already been used to characterize mouth expressions (Moses, Reynard &
Blake, 1995).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
214 Andrés del Valle & Dugelay
To extract motion information from specific features of the face (eyes, eye-
brows, lips, etc.), we must know the animation semantics of the FA system that
will synthesize the motion. Deformable models, such as snakes, deliver informa-
tion about the feature in the form of the magnitudes of the parameters that control
the analysis. It is also necessary to relate these parameters to the actions that
we must apply to the 3D-model to recreate motion and expressions. If there are
many different image-processing techniques to analyze face features, there are
at least as many corresponding feature motion models. These motion models
translate the results into face animation parameters.
Malciu and Prêteux (2001) track face features using snakes. Their snakes are
at the same time deformable models containing the Facial Definition Parameters
(FDPs) defined on the MPEG-4 standard (MPEG-4, 2000). Their technique is
capable of tracking FDPs very efficiently, but it does not give out the FAPs that
would animate the model to generate the observed feature motion. Chou, Chang
and Chen (2001) go one step further. They present an analysis technique that
searches for the points belonging to the projection of a simple 3D-model of the
lips, also containing the FDPs. From the projected location they derive the FAPs
that operate on them to generate the studied motion. Since one FAP may act on
more than one point belonging to their lip model, they use a least-square solution
to solve for the magnitudes of the FAPs involved. Goto, Kshirsagar and
Magnenat-Thalmann (1999) use a simpler approach where image processing is
reduced to the search of edges and the mapping of the obtained data is done in
terms of motion interpretation: open mouth, close mouth, half-opened mouth, etc.
The magnitude of the motion is related to the location of the edges. They extend
this technique to eyes, developing their own eye motion model. Similarly,
eyebrows are tracked on the image and associated to model actions.
Estimators
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 215
Linear
r r
Let us call λ the vector of parameters obtained from the image analysis and µ
r
the vector of FA parameters for the synthesis observed by λ . rThe usual way to
r
construct the linear estimator L, which best satisfies µ = L ⋅ λ on the training
database, is to find a solution in the least square sense. We verify that this linear
estimator is given by
r r r r
where Μ = [ µ 1 K µ d ] and Λ = [λ1 K λ d ] are the matrices obtained by concat-
r r
enating all µ and λ vectors from the training set.
Valente, Andrés del Valle and Dugelay (2001) compare the use of a linear
estimator against an RBF (Radial Basis Functions) network estimator. In their
r
experiments, λ are the set of the coefficients obtained from projecting an image
of the feature being analyzed (imagette) onto a PCA imagette database of the
feature recorded making different expressions under different lighting condi-
r
tions. µ contains the actions to apply on the model, in form of AUs, to generate
these different expressions. RBF networks find the relationship between a pair
of examples (input and output) of different dimensions, through the combination
of functions of simple variables whose main characteristic is that they are
continuous in ℜ + and radial (Poggio & Girosi, 1990).
Neural networks
Neural networks are algorithms inspired on the processing structures of the
brain. They allow computers to learn a task from examples. Neural networks are
typically organized in layers. Layers are made up of a number of interconnected
“nodes,” which contain an “activation function.” (See Figure 5a.)
Most artificial neural networks, or ANNs, contain some form of learning rule
that modifies the weights of the connections according to the input patterns that
it is presented with. The most extensively used rule is the delta rule. It is utilized
in the most common class of ANNs called backpropagational neural net-
works (BPNNs). Backpropagation is an abbreviation for the backwards propa-
gation of error.
ANNs complement image-processing techniques that need to understand
images and in analysis scenarios where some previous training is permitted. In
Tian, Kanade and Cohn (2001), we find one fine example of the help neural
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
216 Andrés del Valle & Dugelay
Figure 5a. Patterns are presented to the network via the “input layer,”
which communicates to one or more “hidden layers” where the actual
processing is done via a system of weighted “connections.” The hidden
layers then link to an “output layer” where the answer is output, as shown
in the graphic below.
Figure 5b. Top: A typical illustration of a two state HMM. Circles represent
states with associated observation probabilities, and arrows represent
non-zero transition arcs, with associated probability. Bottom: This is an
illustration of a five state HMM. The arcs under the state circles model the
possibility that some states may be skipped.
networks can provide. In this article, Tian et al. explain how they have developed
the Automatic Face Analysis to analyze facial expressions. Their system takes
as input the detailed parametric description of the face features they analyze.
They use neural networks to convert these data into AUs following the motion
semantics of the Facial Action Coding System (FACS). A similar approach,
aimed at analyzing spontaneous facial behavior, is taken by Bartlett et al. (2001).
Their system also uses neural networks to describe face expressions in terms of
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 217
AUs. These two approaches differ in the image processing techniques and
parameters they use to describe the image characteristics introduced as input to
the neural network.
By collecting data from real human motion, we can model behavior patterns as
statistical densities over configuration space. Different configurations have
different observation probabilities. One very simple behavior model is the
Gaussian Mixture Model (GMM), in which the probability distribution is modeled
as a collection of Gaussians. In this case the composite density is described by:
∑P
k =1
k ⋅ Pr(O λ = k ) (5)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
218 Andrés del Valle & Dugelay
Fuzzy systems
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 219
Humans detect and interpret faces and facial expressions in a scene with little
or no effort. The systems we discuss in this section accomplish this task
automatically. The main concern of these techniques is to classify the observed
facial expressions in terms of generic facial actions or in terms of emotion
categories and not to attempt to understand the face animation that could be
involved to synthetically reproduce them.
Yacoob has explored the use of local parameterized models of image motion for
recognizing the non-rigid and articulated motion of human faces. These models
provide a description of the motion in terms of a small number of parameters that
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
220 Andrés del Valle & Dugelay
are related intuitively to the motion of some facial features under the influence
of expressions. The expression description is obtained after analyzing the spatial
distribution of the motion direction field obtained from the optical flow analysis
computed at points of high gradient values of the image of the face. This
technique gives fairly good results, although the use of optical flow needs very
stable lighting conditions and very smooth movement of head motion during the
analysis. Computationally, it is also quite heavy. From the starting research
(Yacoob & Davis, 1994) to the last published results about the performance of
the system (Black & Yacoob, 1997), improvements in the tuning of the
processing have been added to make it more robust to head rotations.
Huang and Huang (1997) introduce a system developed in two parts: facial
feature extraction (for the training-learning of expressions) and facial expression
recognition. The system applies a point distribution model and a gray-level model
to find the facial features. Then, the position variations are described by ten
Action Parameters (APs). During the training phase, given 90 different expres-
sions, the system classifies the principal components of the APs into six different
clusters. In the recognition phase, given a facial image sequence, it identifies the
facial expressions by extracting the ten APs, analyzes the principal components,
and finally calculates the AP profile correlation for a higher recognition rate. To
perform the image analysis, deformable models of the face features are fitted
onto the images. The system is only trained for faces on a frontal view.
Apparently it seems more robust to illumination conditions than the previous
approach, but they do not discuss the image processing techniques, making this
point hard to evaluate.
Pantic and Rothkrantz (2000) describe another approach, which is the core of the
Integrated System for Facial Expression Recognition (ISFER). The system finds
the contour of the features with several methods suited to each feature: snakes,
binarization, deformable models, etc., making it more efficient under uncon-
trolled conditions: irregular lighting, glasses, facial hair, etc. An NN architecture
of fuzzy classifiers is designed to analyze the complex mouth movements. In their
article, they do not present a robust solution to the non-frontal view positions.
To some extent, all systems discussed have based their description of face
actions on the Facial Action Coding System (FACS) proposed by Ekman and
Friesen (1978). The importance granted to FACS is such that two research
teams, one at the University of California, San Diego (UCSD) and the Salk
Institute, and another at the University of Pittsburgh and Carnegie Mellon
University (CMU), were challenged to develop prototype systems for automatic
recognition of spontaneous facial expressions.
The system developed by the UCSD team, described in Bartlett et al. (2001),
analyzes face features after having determined the pose of the individual in front
of the camera, although tests of their expression analysis system are only
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 221
performed on frontal view faces. Features are studied using Gabor filters and
afterwards classified using a previously trained HMM. The HMM is applied in
two ways:
SVMs are used as classifiers. They are a way to achieve good generalization
rates when compared to other classifiers because they focus on maximally
informative exemplars, the support vectors. To match face features, they first
convolve them with a set of kernels (out of the Gabor analysis) to make a jet.
Then, that jet is compared with a collection of jets taken from training images,
and the similarity value for the closest one is taken. In their study, Bartlett et al.
claim an AU detection accuracy from 80% for eyebrow motion to around 98%
for eye blinks.
CMU has opted for another approach, where face features are modeled in multi-
state facial components of analysis. They use neural networks to derive the AUs
associated with the motion observed. They have developed the facial models for
lips, eyes, brows, cheeks and furrows. In their article, Tian et al. (2001) describe
this technique, giving details about the models and the double use of NN, one for
the upper part of the face and a different one for the lower part. (See Figure 6.)
They do not discuss the image processing involved in the derivation of the feature
model from the images. Tests are performed over a database of faces recorded
under controlled light conditions. Their system allows the analysis of faces that
are not completely in a frontal position, although most tests were performed only
on frontal view faces. The average recognition rates achieved are around 95.4%
for upper face AUs and 95.6% for lower face AUs.
Piat and Tsapatsoulis (2000) take the challenge of deducing face expression out
of images from another perspective, no longer based on FACS. Their technique
finds first the action parameters (MPEG-4 FAPs) related to the expression being
analyzed and then they formulate this expression with high-level semantics. To
do so, they have related the intensity of the most used expressions to their
associated FAPs. Other approaches (Chen & Huang, 2000) complement the
image analysis with the study of the human voice to extract more emotional
information. These studies are oriented to develop the means to create a Human-
Computer Interface (HCI) in a completely bimodal way.
The reader can find in Pantic and Rothkrantz (2000) overviews and comparative
studies of many techniques, including some those just discussed, analyzed from
the HCI perspective.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
222 Andrés del Valle & Dugelay
Figure 6. Face features (eyes, mouth, brows, …) are extracted from the
input image; then, after analyzing them, the parameters of their deformable
models are introduced into the NNs which finally generate the AUs
corresponding to the face expression. Image courtesy of The Robotics
Institute at Carnegie Mellon University.
Some face animation systems need action parameters as input that specify how
to open the mouth, the position of the eyelids, the orientation of the eyes, etc., in
terms of parameter magnitudes associated to physical displacements. The
analysis methods studied in this section try to measure displacements and feature
magnitudes over the images to derive the actions to be performed over the head
models. These methods do not evaluate the expression on the person’s face, but
extract those measurements that will permit the synthesis of it on a model from
the image, as shown in Figure 7.
Terzopoulos and Waters (1993) developed one of the first solutions of this
nature. Their method tracks linear facial features to estimate corresponding
parameters of a three-dimensional, wireframe face model, allowing them to
reproduce facial expressions. A significant limitation of this system is that it
requires facial features to be highlighted with make-up for successful tracking.
Although active contour models are used, the system is still passive. The tracked
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 223
contour features passively shape the facial structure without any active control
based on observations.
Based on a similar animation system as that of Waters’, that is, developed on
anatomical-based muscle actions that animate a 3D face wireframe, Essa and
Petland define a suitable set of control parameters using vision-based observa-
tions. They call their solution FACS+ because it is an extension of the traditional
FAC system. They use optical flow analysis along the time of sequences of
frontal view faces to get the velocity vectors on 2D and then they are mapped
to the parameters. They point out in Essa, Basun, Darrel and Pentland (1996) that
driving the physical system with the inputs from noisy motion estimates can result
in divergence or a chaotic physical response. This is why they use a continuous
time Kalman filter (CTKF) to better estimate uncorrupted state vectors. In their
work they develop the concept of motion templates, which are the “corrected”
or “noise-free” 2D motion field that is associated with each facial expression.
These templates are used to improve the optical flow analysis.
Morishima has been developing a system that succeeds in animating a generic
parametric muscle model after having been customized to take the shape and
texture of the person the model represents. By means of optical flow image
analysis, complemented with speech processing, motion data is generated.
These data are translated into motion parameters after passing through a
previously trained neural network. In Morishima (2001), he explains the basis of
this system, as well as how to generate very realistic animation from electrical
captors on the face. Data obtained from this hardware-based study permits a
perfect training for coupling the audio processing.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
224 Andrés del Valle & Dugelay
To control the optical flow data generated from the analysis continuous frames,
Tang and Huang (1994) project the head model wireframe vertices onto the
images and search for the 2D motion vectors only around these vertices. The
model they animate is very simple and the 2D motion vectors are directly
translated into 2D vertex motion. No 3D action is generated.
Almost the same procedure is used by Sarris and Strintzis (2001, 2002) in their
system for video-phoning for the hearing impaired. The rigid head motion (pose)
is obtained by fitting the projection of a 3D wireframe onto the image being
analyzed. Then, non-rigid face movements (expressions) are estimated thanks
to a feature-based approach adapted from the Kanade, Lucas and Tomasi
algorithm. The KLT algorithm is based on minimizing the sum of squared
intensity differences between a past and a current feature window, which is
performed using a Newton-Raphson minimization method. The features to track
are some of the projected points of the wireframe, the MPEG-4 FDPs. To derive
MPEG-4 FAPs from this system, they add to the KLT algorithm the information
about the degrees of freedom of motion (one or several directions) that the
combination of the possible FAPs allows on the studied feature FDPs.
Ahlberg (2002) also exposes in his work a wireframe fitting technique to obtain
the rigid head motion. He uses the new parameterized variant of the face model
CANDIDE, named CANDIDE-3, which is MPEG-4 compliant. The image
analysis techniques include PCA on eigentextures that permits the analysis of
more specific features that control the model deformation parameters.
More detailed feature point tracking is developed by Chou et al. (2001). They
track the projected points belonging to the mouth, eyes and nostrils provided.
These models are also based on the physical vertex distribution of MPEG-4’s
FDPs and they are able to obtain the combination of FAPs that regenerate the
expression and motion of the analyzed face. Their complete system also deals
with audio input, analyzing it and complementing the animation data for the lips.
The main goal of their approach is to achieve real time analysis to employ these
techniques in teleconferencing applications. They do not directly obtain the pose
parameters to also synthetically reproduce the pose of the head, but they
experiment on how to extend their analysis to head poses other than a frontal
view face, by roughly estimating the head pose from the image analysis and
rectifying the original input image.
The MIRALab research team at the University of Geneva (Switzerland) has
developed a complete system to animate avatars in a realistic way, in order to
use them for telecommunications. In Goto et al. (2001), they review the entire
process to generate customized realistic animation. The goal of their system is
to clone face behavior. The first step in the overall process is to physically adapt
a generic head mesh model (already susceptible to being animated) to the shape
of the person to be represented. In essence, they follow the same procedure that
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 225
Morishima presents in his work. Goto et al. do this by using just a frontal and side
view picture of the individual, whereas Morishima also includes other views to
recover texture on self occlusions. Models are animated using MPEG-4 FAPs
to allow for compatibility with other telecom systems. Animation parameters are
extracted from video input of the frontal view face of the speaker and then
synthesized, either on the cloned head model or on a different one. Speech
processing is also utilized to generate more accurate mouth shapes. An interest-
ing post-processing step is added. If the analysis results do not reflect coherent
anatomical motion, they are rejected and the system searches in a probability
database for the most probable motion solution to the incoherence. In Goto,
Escher and Magnenat-Thalmann (1999), the authors give a more detailed
explanation about the image processing involved. Feature motion models for
eyes, eyebrows, and mouth allow them to extract image parameters in the form
of 2D point displacements. These displacements represent the change of the
feature from the neutral position to the instant of the analysis and are easily
converted into FAPs. Although the system presents possibilities to achieve face
cloning, the current level of animation analysis only permits instant motion
replication with little precision. We consider that face cloning is not guaranteed
even if realistic animation is.
Also aiming at telecom applications, Andrés del Valle and Dugelay (2002) have
developed a system that takes advantage of robust face feature analysis
techniques, as well as the synthesis of the realistic clone of the individual being
analyzed. We can consider their approach a hybrid between the methods
discussed in this section and those that will be presented in the next one. They
use a Kalman filter to recover the head global position and orientation. The data
predicted by the filter allows them to synthesize a highly realistic 3D model of the
speaker with the same scale, position and orientation of the individual being
recorded. These data are also useful to complement and adapt feature analysis
Figure 8. In the approach proposed by Andrés del Valle and Dugelay, the
avatar does not only reproduce the common techniques that non-rigid,
action feature-based analysis would permit, but also synthesizes the rigid
motion, thanks to the use of Kalman filtering during pose prediction.
Images courtesy of the Image Group at the Institut Eurécom.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
226 Andrés del Valle & Dugelay
algorithms initially designed to work for a frontal point of view under any other
head pose. The analysis algorithm parameters and variables are no longer
defined over the image plane in 2D, but over the realistic 3D head-model. This
solution controls face feature analysis during the change of the speaker’s pose.
Although the system utilizes the clone of the speaker to analyze, the obtained
parameters are general enough to be synthesized on other models or avatars.
(See Figure 8.)
Some face motion analysis techniques use the synthesized image of the head
model to control or to refine the analysis procedure. In general, the systems that
use synthesized feedback in their analysis need a very realistic head model of the
speaker, a high control of the synthesis and a knowledge of the conditions of the
face being recorded.
Li, Roivainen and Forchheimer (1993) presented one of the first works to use
resynthesized feedback. Using a 3D model — Candide — their approach is
characterized by a feedback loop connecting computer vision and computer
graphics. They prove that embedding synthesis techniques into the analysis
phase greatly improves the performance of motion estimation. A slightly
different solution is given by Ezzat and Poggio (1996a, 1996b). In their articles,
they describe image-based modeling techniques that make possible the creation
of photo-realistic computer models of real human faces. The model they use is
built using example views of the face, bypassing the need of any 3D computer
graphics. To generate the motion for this model, they use an analysis-by-
synthesis algorithm, which is capable of extracting a set of high-level parameters
from an image sequence involving facial movement using embedded image-
based models. The parameters of the models are perturbed in a local and
independent manner for each image until a correspondence-based error metric
is minimized. Their system is restricted to understand a limited number of
expressions.
More recent research works are able to develop much more realistic results with
three-dimensional models. Eisert and Girod (1998), for instance, present a
system that estimates 3D motion from image sequences showing head and
shoulder scenes for video telephone and teleconferencing applications. They use
a very realistic 3D head model of the person in the video. The model constrains
the motion and deformation in the face to a set of FAPs defined by the MPEG-
4 standard. Using the model, they obtain a description of both global (head pose)
and local 3D head motion as a function of unknown facial parameters. Combining
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 227
the 3D information with the optical flow constraint leads to a linear algorithm that
estimates the facial animation parameters. Each synthesized image reproducing
face motion from frame t is utilized to analyze the image of frame t+1. Since
natural and synthetic frames are compared at the image level, it is necessary for
the lighting conditions of the video scene to be under control. This implies, for
example, standard, well distributed light.
Pighin, Szeliski and Salesin (1999) maximize this approach by customizing
animation and analysis on a person-by-person basis. They use new techniques
to automatically recover the face position and the facial expression from each
frame in a video sequence. For the construction of the model, several views of
the person are used. For the animation, studying how to linearly combine 3D face
models, each corresponding to a particular facial expression of the individual,
ensures realism. Their mesh morphing approach is detailed in Pighin, Hecker,
Lischinski, Szeliski and Salesin (1998). Their face motion and expression
analysis system fits the 3D model on each frame using a continuous optimization
technique. During the fitting process, the parameters are tuned to achieve the
most accurate model shape. Video image and synthesis are compared to find the
degree of similarity of the animated model. They have developed an optimization
method whose goal is to compute the model parameters yielding a rendering of
the model that best resembles the target image. Although a very slow procedure,
the animated results are very impressive because they are highly realistic and
very close to what we would expect from face cloning. (See Figure 9.)
Figure 9. Tracking example of Pighin’s system. The bottom row shows the
result of fitting their model to the target images on the top row. Images
courtesy of the Computer Science Department at the University of
Washington.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
228 Andrés del Valle & Dugelay
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 229
Possible synthesis in
Controlled lighting?
Potential real-time?
Time-line (video)
rotations? (pose
understanding)
reproduction?
Does it allow
Training?
Markers?
analysis?
Realistic
model?
Methods that obtain emotion information
J. Black
Optical flow/parametric
[BY97] & Y. N Y Y N N N N.A. N.A. Y
model of image motion
Yacoob
C. H.
Huang Y
Deformable models / PCA [HH97] Y ♣ N N N N N.A. N.A. Y
& Y. M.
Huang
M. Pantic &
NN/Fuzzy logic/deformable N
[PR00] L.J.M. Y ♣ N N N N N.A. N.A. N
models
Rothkrantz
M. S.
HMM/optical flow/Gabor ♦
[BBL+ 01] Bartlett Y Y Y Y N N Y Y Y
filters/PCA/ICA
et al.
Methods that obtain parameters related to the Face Animation synthesis afterwards used
Snakes [TW93] D.
Terzopoulos N N N Y N N Y Y Y
& K. Waters
♣
Author’s comment. ♦ For the face tracking, which is based on point tracking. ~
Slight rotations are permitted although there is no direct use of the pose data during
image processing
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
230 Andrés del Valle & Dugelay
fields. A proof of this interest is how the new standard for the coding of hybrid
natural-synthetic media, MPEG-4, has given special importance to facial anima-
tion (Ostermann, 2002). The standard specifies common syntax to describe face
behavior, thus permitting interoperability amongst different face animation
systems. At this point of the evolution and deployment of applications compliant
with MPEG-4, several concerns have appeared: Has the standard given a global
solution that all specific face animation systems can adopt? Or, does the syntax
restrict the semantics of the possible achievable motion too much?
No matter the answer, the existence of all these doubts shows that there is still
a long way to go to master face animation and, more concretely, the automatic
generation of realistic human-like face motion. All the analysis techniques
covered in this chapter are of great help in the study of facial motion because
image analysis intrudes the least in the observed scenario, thus permitting the
study of real and completely natural behavior.
References
Ahlberg, J. (2002). An active model for facial feature tracking. Eurasip Journal
on Applied Signal Processing, 6, 566-571.
Andrés del Valle, A. C. & Dugelay, J. L. (2002). Facial expression analysis
robust to 3D head pose motion. Proceedings of the International
Conference on Multimedia and Expo.
Bartlett, M. S. (2001). Face image analysis by unsupervised learning.
Boston, MA: Kluwer Academic Publishers.
Bartlett et al. (2001). Automatic Analysis of spontaneous facial behavior: A
final project report. (Tech. Rep. No. 2001.08). San Diego, CA: Univer-
sity of California, San Diego, MPLab.
Black, M. J. & Yacoob, Y. (1997). Recognizing facial expressions in image
sequences using local parameterized models of image motion. Interna-
tional Journal of Computer Vision, 25(1), 23-48.
Chen, L. S. & Huang, T. S. (2000). Emotional expressions in audiovisual human
computer interaction. Proceedings of the International Conference on
Multimedia and Expo.
Chou, J. C., Chang, Y. J. & Chen, Y. C. (2001). Facial feature point tracking and
expression analysis for virtual conferencing systems. Proceedings of the
International Conference on Multimedia and Expo.
Cordea, M. D., Petriu, E. M., Georganas, N. D., Petriu, D. C. & Whalen, T. E.
(2001). 3D head pose recovery for interactive virtual reality avatars.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 231
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
232 Andrés del Valle & Dugelay
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Techniques for Face Motion & Expression Analysis on Monocular Images 233
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
234 Andrés del Valle & Dugelay
Shimizu, I., Zhang, Z., Akamatsu, S. & Deguchi, K. (1998). Head pose
determination from one image using a generic model. Proceedings of the
Third International Conference on Automatic Face and Gesture Rec-
ognition, 100-105.
Ström, J., Jebara, T., Basu, S. & Pentland, A. (1999). Real time tracking and
modeling of faces: An EKF-based analysis by synthesis approach. Pro-
ceedings of the Modelling People Workshop at ICCV’99.
Tang, L. & Huang, T. S. (1994). Analysis-based facial expression synthesis.
Proceedings of the International Conference on Image Processing,
98-102.
Terzopoulos, D. & Waters, K. (1993, June). Analysis and synthesis of facial
image sequences using physical and anatomical models. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 15(6).
Tian, Y., Kanade, T. & Cohn, J. F. (2001, February). Recognizing action units
for facial expression analysis. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 23(2), 97-115.
Turk, M. & Pentland, A. (1991). Eigenfaces for recognition. Journal of
Cognitive Neuroscience, 3(1).
Valente, S. & Dugelay, J. L. (2001). A visual analysis/synthesis feedback loop
for accurate face tracking. Signal Processing: Image Communication,
16(6), 585-608.
Valente, S., Andrés del Valle, A. C. & Dugelay, J. L. (2001). Analysis and
reproduction of facial expressions for realistic communicating clones.
Journal of VLSI and Signal Processing, 29, 41-49.
Wiskott, L. (2001, July). Optical Flow Estimation. Retrieved September, 26,
2002, from the World Wide Web: http://www.cnl.salk.edu/~wiskott/Bibli-
ographies/FlowEstimation.html.
Yacoob, Y. & Davis, L. (1994). Computing spatio-temporal representations of
human faces. Proceedings of Computer Vision and Pattern Recogni-
tion Conference, 70-75.
Yuille, A. L. (1991). Deformable templates for face recognition. Journal of
Cognitive Neuroscience, 3(1), 59-70.
Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(11),
1330-1334.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 235
Chapter VII
Abstract
Introduction
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
236 Eisert
are becoming more available with today’s computers. In this chapter, the state-
of-the-art in facial animation and analysis is reviewed and new techniques for the
estimation of 3-D human motion, deformation, and facial expressions from
monocular video sequences are presented. The chapter starts with an overview
of existing methods for representing human heads and facial expressions three-
dimensionally in a computer. Algorithms for the determination of facial expres-
sions from images and image sequences are reviewed, focusing on feature-
based and optical-flow based methods. For natural video capture conditions,
scene lighting often varies over time. This illumination variability has a consid-
erable influence not only on the visual appearance of the objects in the scene, but
also on the performance of the estimation algorithms. Therefore, methods for
determining lighting changes in the scene are discussed for the purpose of robust
facial analysis under uncontrolled illumination settings. After this overview, an
example of a hierarchical, gradient-based method for the robust estimation of
MPEG-4 facial animation parameters is given, illustrating the potential of model-
based coding. This method is able to simultaneously determine both global and
local motion in the face in a linear, low-complexity framework. In order to
improve the robustness against lighting changes in the scene, a new technique for
the estimation of photometric properties based on Eigen light maps is added to
the system. The performance of the presented methods is evaluated in some
experiments given in the application section. First, the concept of model-based
coding is described, where head-and-shoulder image sequences are represented
by computer graphics models that are animated according to the facial motion
and deformation extracted from real video sequences. Experiments validate that
such sequences can be encoded at less than 1 kbit/s enabling a wide range of new
applications. Given an object-based representation of the current scene, changes
can easily be made by modifying the 3-D object models. In that context, we will
show how facial expression analysis can be used to synthesize new video
sequences of arbitrary people, who act exactly in the same way as the person
in a reference sequence, which, e.g., enables applications in facial animation for
film productions.
Facial Animation
Modeling the human face is a challenging task because of its familiarity. Already
early in life, we are confronted with faces and learn how to interpret them. We
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 237
are able to recognize individuals from a large number of similar faces and to
detect very subtle changes in facial expressions. Therefore, the general accept-
ability of synthetic face images strongly depends on the 3-D head model used for
rendering. As a result, significant effort has been spent on the accurate modeling
of a person’s appearance and his or her facial expressions (Parke et al., 1996).
Both problems are addressed in the following two sections.
In principle, most head models used for animation are based on triangle meshes
(Rydfalk, 1978; Parke, 1982). Texture mapping is applied to obtain a photorealistic
appearance of the person (Waters, 1987; Terzopoulos et al., 1993; Choi et al.,
1994; Aizawa et al., 95; and Lee et al., 1995). With extensive use of today’s
computer graphics techniques, highly realistic head models can be realized
(Pighin et al., 1998).
Modeling the shape of a human head with polygonal meshes results in a
representation consisting of a large number of triangles and vertices which have
to be moved and deformed to show facial expressions. The face of a person,
however, has a smooth surface and facial expressions result in smooth move-
ments of surface points due to the anatomical properties of tissue and muscles.
These restrictions on curvature and motion can be exploited by splines which
satisfy certain continuity constraints. As a result, the surface can be represented
by a set of spline control points that is much smaller than the original set of
vertices in a triangle mesh. This has been exploited by Hoch et al. (1994) where
B-splines with about 200 control points are used to model the shape of human
heads. In Ip et al. (1996), non-uniform rational B-splines (NURBS) represent the
facial surfaces. Both types of splines are defined on a rectangular topology and,
therefore, do not allow a local patch refinement in areas that are highly curved.
To overcome this restriction, hierarchical splines have been proposed for the
head modeling (Forsey et al., 1988) to allow a recursive subdivision of the
rectangular patches in more complex areas.
Face, eyes, teeth, and the interior of the mouth can be modeled similarly with
textured polygonal meshes, but a realistic representation of hair is still not
available. A lot of work has been done in this field to model the fuzzy shape and
reflection properties of the hair. For example, single hair strands have been
modeled with polygonal meshes (Watanabe et al., 1992) and the hair dynamics
have been incorporated to model moving hair (Anjyo et al., 1992). However,
these algorithms are computationally expensive and are not feasible for real-time
applications in the near future. Image-based rendering techniques (Gortler et al.,
1996; Levoy et al., 1996) might provide new opportunities for solving this
problem.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
238 Eisert
Once a 3-D head model is available, new views can be generated by rotating and
translating the 3-D object. However, for the synthesis of facial expressions, the
model can no longer be static. In general, two different classes of facial
expression modeling can be distinguished in model-based coding applications: the
clip-and-paste method and algorithms based on the deformation the 3-D sur-
faces.
For the clip-and-paste method (Aizawa et al., 1989; Welsh et al., 1990; and
Chao et al., 1994), templates of facial features like eyes and the mouth are
extracted from previous frames and mapped onto the 3-D shape model. The
model is not deformed according to the facial expression, but remains rigid and
is used only to compensate for the global motion given by head rotation and
translation. All local variations in the face must, therefore, be described by
texture changes of the model. During encoding of a video sequence, a codebook
containing templates for different facial expressions is built. A new expression
can then be synthesized by combining several feature templates that are
specified by their position on the model and their template index from the
codebook. As a result, a discrete set of facial expressions can be synthesized.
However, the transmission of the template codebook to the decoder consumes
a large number of bits, which makes the scheme unsuitable for coding purposes
(Welsh et al., 1990). Beyond that, the localization of the facial features in the
frames is a difficult problem. Pasting of templates extracted at slightly inaccu-
rate positions leads to an unpleasant “jitter” in the resulting synthetic sequence.
The deformation method avoids these problems by using the same 3-D model
for all facial expressions. The texture remains basically constant and facial
expressions are generated by deforming the 3-D surface (Noh et al., 2001). In
order to avoid the transmission of all vertex positions in the triangle mesh, the
facial expressions are compactly represented using high-level expression pa-
rameters. Deformation rules associated with the 3-D head model describe how
certain areas in the face are deformed if a parameter value changes. The
superposition of many of these local deformations is then expected to lead to the
desired facial expression. Due to the advantages of the deformation method over
the clip-and-paste method (Welsh et al., 1990), it is used in most current
approaches for representing facial expressions. The algorithms proposed in this
chapter are also based on this technique and, therefore, the following review of
related work focuses on the deformation method for facial expression modeling.
One of the first systems of facial expression parameterization was proposed by
Hjortsjö (1970) and later extended by the psychologists Ekman and Friesen
(1978). Their facial action coding system (FACS) is widely used today for the
description of facial expressions in combination with 3-D head models (Aizawa
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 239
et al., 1989; Li, 1993; Choi et al., 1994; and Hoch et al., 1994). According to that
scheme, any facial expression results from the combined action of the 268
muscles in the face. Ekman and Friesen discovered that the human face
performs only 46 possible basic actions. Each of these basic actions is affected
by a set of muscles that cannot be controlled independently. To obtain the
deformation of the facial skin that is caused by a change of an action unit, the
motion of the muscles and their influence on the facial tissue can be simulated
using soft tissue models (Terzopoulos et al., 1993; Lee et al., 1995). Due to the
high computational complexity of muscle-based tissue simulation, many applica-
tions model the surface deformation directly (Aizawa et al., 1989; Choi et al.,
1994) using heuristic transforms between action units and surface motion.
Very similar to the FACS is the parameterization in the synthetic and natural
hybrid coding (SNHC) part of the MPEG-4 video coding standard (MPEG,
1999). Rather than specifying groups of muscles that can be controlled indepen-
dently and that sometimes lead to deformations in larger areas of the face, the
single parameters in this system directly correspond to locally limited deforma-
tions of the facial surface. There are 66 different facial animation parameters
(FAPs) that control both global and local motion.
Instead of using facial expression descriptions that are designed with a relation
to particular muscles or facial areas, data-driven approaches are also used for
the modeling. By linearly interpolating 3-D models in a database of people
showing different facial expressions, new expressions can be created (Vetter et
al., 1998; Blanz et al., 1999). Ortho-normalizing this face-space using a KLT
leads to a compact description that allows the representation of facial expres-
sions with a small set of parameters (Hölzer, 1999; Kalberer et al., 2001).
Feature-based estimation
One common way for determining the motion and deformation in the face
between two frames of a video sequence is the use of feature points (Kaneko
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
240 Eisert
et al., 1991; Terzopoulos et al., 1993; Gee et al., 1994; Huang et al., 1994; Lopez
et al., 1995; and Pei, 1998). Highly discriminant areas with large spatial
variations, such as areas containing the eyes, nostrils, or mouth corners, are
identified and tracked from frame to frame. If corresponding features are found
in two frames, the change in position determines the displacement.
How the features are searched depends on properties such as color, size, and
shape. For facial features, extensive research has been performed, especially in
the area of face recognition (Chellappa et al., 1995). Templates (Brunelli et al.,
1993), often used for finding facial features, are small reference images of
typical features. They are compared at all positions in the frame to find a good
match between the template and the current image content (Thomas et al.,
1987). The best match is said to be the corresponding feature in the second
frame. Problems with templates arise from the wide variability of captured
images due to illumination changes or different viewing positions. To compensate
for these effects, eigen-features (Moghaddam et al., 1997; Donato et al., 1999),
which span a space of possible feature variations or deformable templates
(Yuille, 1991) and reduce the features to parameterized contours, can be utilized.
Instead of estimating single feature points, the whole contour of features can also
be tracked (Huang et al., 1991; Pearson, 1995) using snakes. Snakes (Kas et al.,
1987) are parameterized active contour models that are composed of internal and
external energy terms. Internal energy terms account for the shape of the
feature and smoothness of the contour, while the external energy attracts the
snake towards feature contours in the image.
All feature-based algorithms have in common that single features, like the eyes,
can be found quite robustly. Dependent on the image content, however, only a
small number of feature correspondences can typically be determined. As a
result, the estimation of 3-D motion and deformation parameters from the
displacements lacks the desired accuracy if a feature is erroneously associated
with a different feature in the second frame.
Approaches based on optical flow information utilize the entire image informa-
tion for the parameter estimation, leading to a large number of point correspon-
dences. The individual correspondences are not as reliable as the ones obtained
with feature-based methods, but due to the large number of equations, some
mismatches are not critical. In addition, possible outliers (Black et al., 1996) can
generously be removed without obtaining an underdetermined system of equa-
tions for the determination of 3-D motion.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 241
One way of estimating 3-D motion is the explicit computation of an optical flow
field (Horn et al., 1981; Barron et al., 1994; and Dufaux et al., 1995), which is
followed by the derivation of motion parameters from the resulting dense
displacement field (Netravali et al., 1984, Essa et al., 1994; and Bartlett et al.,
1995). Since the computation of the flow field from the optical flow constraint
equation (Horn et al., 1981), which relates image gradient information (Simoncelli,
1994) to 2-D image displacements, is an underdetermined problem, additional
smoothness constraints have to be added (Horn, 1986; Barron et al., 1994). A
non-linear cost function (Barron et al., 1994) is obtained that is numerically
minimized. The use of hierarchical frameworks (Enkelmann, 1988; Singh, 1990;
and Sezan et al., 1993) can reduce the computational complexity of the
optimization in this high-dimensional parameter space. However, even if the
global minimum is found, the heuristical smoothness constraints may lead to
deviations from the correct flow field, especially at object boundaries and depth
discontinuities.
In model-based motion estimation, the heuristical smoothness constraints are,
therefore, often replaced by explicit motion constraints derived from the 3-D
object models. For rigid body motion estimation (Kappei, 1988; Koch, 1993), the
3-D motion model, specified by three rotational and three translational degrees
of freedom, restricts the possible flow fields in the image plane. Under the
assumption of perspective projection, known object shape, and small motion
between two successive video frames, an explicit displacement field can be
derived that is linear in the six unknown degrees of freedom (Longuet, 1984;
Netravali et al., 1984; and Waxman et al., 1987). This displacement field can
easily be combined with the optical flow constraint to obtain a robust estimator
for the six motion parameters. Iterative estimation in an analysis-synthesis
framework (Li et al., 1993) removes remaining errors caused by the linearization
of image intensity and the motion model.
For facial expression analysis, the rigid body assumption can no longer be
maintained. Surface deformations due to facial expressions have to be consid-
ered additionally. Most approaches found in the literature (Ostermann, 1994;
Choi et al., 1994; Black et al., 1995; Pei, 1998; and Li et al., 1998) separate this
problem into two steps. First, global head motion is estimated under the
assumption of rigid body motion. Local motion caused by facial expressions is
regarded as noise (Li et al., 1994b) and, therefore, the textured areas around the
mouth and the eyes are often excluded from the estimation (Black et al., 1995;
and Li et al., 1994b). Given head position and orientation, the remaining residuals
of the motion-compensated frame are used to estimate local deformations and
facial expressions. In (Black et al., 1995; Black et al., 1997), several 2-D motion
models with six (affine) or eight parameters are used to model local facial
deformations. By combining these models with the optical flow constraint, the
unknown parameters are estimated in a similar way as in the rigid body case.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
242 Eisert
High-level facial animation parameters can finally be derived from the estimated
set of 2-D motion parameters. Even higher robustness can be expected by
directly estimating the facial animation parameters using more sophisticated
motion models. In Choi et al. (1994), a system is described that utilizes an explicit
3-D head model. This head model directly relates changes of facial animation
parameters to surface deformations. Orthographic projection of the motion
constraints and combination with optical flow information result in a linear
estimator for the unknown parameters. The accuracy problem of separate global
and local motion estimation is here relaxed by an iterative framework that
alternately estimates the parameters for global and local motion.
The joint estimation of global head motion together with facial expressions is
rarely addressed in the literature. In Li et al. (1993; 1994), a system for the
combined estimation of global and local motion is presented that stimulated the
approaches presented in the next section. A 3-D head model based on the
Candide (Rydfalk, 1978) model is used for image synthesis and provides explicit
3-D motion and deformation constraints. The affine motion model describes the
image displacements as a linear function of the six global motion parameters and
the facial action units from the FACS system, which are simultaneously
estimated in an analysis-synthesis framework. Another approach that allows a
joint motion and deformation estimation has been proposed by DeCarlo et al.
(1996, 1998). A deformable head model is employed that consists of ten separate
face components that are connected by spring-like forces incorporating anthro-
pometric constraints (DeCarlo et al., 1998b; Farkas, 1995). Thus, the head shape
can be adjusted similar to the estimation of local deformations. For the determi-
nation of motion and deformation, again a 3-D motion model is combined with the
optical flow constraint. The 3-D model also includes a dynamic, Lagrangian
description for the parameter changes similar to the work of Essa (Essa et al.,
1994; Essa et al., 1997). Since the head model lacks any color information, no
synthetic frames can be rendered which makes it impossible to use an analysis-
synthesis loop. Therefore, additional edge forces are added to avoid an error
accumulation in the estimation.
Illumination Analysis
In order to estimate the motion of objects between two images, most algorithms
make use of the brightness constancy assumption (Horn, 1986). This assump-
tion, which is an inherent part of all optical flow-based and many template-based
methods, implies that corresponding object points in two frames show the same
brightness. However, if the lighting in the scene changes, the brightness of
corresponding points might differ significantly. But, also, if the orientation of the
object surface relative to a light source changes due to object motion, brightness
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 243
is in general not constant (Verri et al., 1989). On the contrary, intensity changes
due to varying illumination conditions can dominate the effects caused by object
motion (Pentland, 1991; Horn, 1986; and Tarr, 1998). For accurate and robust
extraction of motion information, lighting effects must be taken into account.
In spite of the relevance of illumination effects, they are rarely addressed in the
area of 3-D motion estimation. In order to allow the use of the optical flow
constraint for varying brightness, higher order differentials (Treves et al., 1994)
or pre-filtering of the images (Moloney, 1991) have been applied. Similarly,
lightness algorithms (Land et al., 1971; Ono et al., 1993; and Blohm, 1997)
make use of the different spectral distributions of texture and intensity changes
due to shading, in order to separate irradiance from reflectance. If the influence
of illumination cannot be suppressed sufficiently by filtering as, e.g., in image
regions depicting highlights caused by specular reflections, the corresponding
parts are often detected (Klinker et al., 1990; Stauder, 1994; and Schluens et al.,
1995) and classified as outliers for the estimation.
Rather than removing the disturbing effects, explicit information about the
illumination changes can be estimated. This not only improves the motion
estimation but also allows the manipulation and visual enhancement of the
illumination situation in an image afterwards (Blohm, 1997). Under controlled
conditions with, e.g., known object shape, light source position (Sato et al., 1997;
Sato et al., 1996; and Baribeau et al., 1992), and homogeneous non-colored
surface properties (Ikeuchi et al., 1991; Tominaga et al., 2000), parameters of
sophisticated reflection models like the Torrance-Sparrow model (Torrance et
al., 1967; Nayar et al., 1991; and Schlick, 1994) which also includes specular
reflection, can be estimated from camera views. Since the difficulty of param-
eter estimation increases significantly with model complexity, the analysis of
global illumination scenarios (Heckbert, 1992) with, e.g., inter-reflections (Forsyth
et al., 1991) is only addressed for very restricted applications (Wada et al., 1995).
In the context of motion estimation, where the exact position and shape of an
object are often not available, mostly simpler models are used that account for
the dominant lighting effects in the scene. The simplest scenario is the assump-
tion of pure ambient illumination (Foley et al., 1990). Other approaches (Gennert
et al., 1987; Moloney et al., 1991; and Negahdaripour et al., 1993) extend the
optical flow constraint is extended by a two-parameter function to allow for
global intensity scaling and global intensity shifts between the two frames. Local
shading effects can be modeled using additional directional light sources (Foley
et al., 1990). For the estimation of the illuminant direction, surface-normal
information is required. If this information is not available as, e.g., for the large
class of shape-from-shading algorithms (Horn et al., 1989; Lee et al., 1989),
assumptions about the surface-normal distribution are exploited to derive the
direction of the incident light (Pentland, 1982; Lee et al., 1989; Zheng et al., 1991;
and Bozdagi et al., 1994).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
244 Eisert
If explicit 3-D models and with that surface-normal information are available,
more accurate estimates of the illumination parameters are obtainable (Stauder,
1995; Deshpande et al., 1996; Brunelli, 1997; and Eisert et al., 1997). In these
approaches, Lambertian reflection is assumed in combination with directional
and ambient light. Given the surface normals, the illumination parameters are
estimated using neural networks (Brunelli, 1997), linear (Deshpande et al., 1996;
Eisert et al., 1997), or non-linear (Stauder, 1995) optimization.
Rather than using explicit light source and reflection models to describe
illumination effects, multiple images captured from the same viewing position,
but under varying illumination can also be exploited. Hallinan et al. showed
(Hallinan et al., 1994; Epstein et al., 1995) that five eigen images computed from
a set of differently illuminated facial images are sufficient to approximate
arbitrary lighting conditions by linearly blending between the eigen images. An
analytic method for the derivation of the eigen components can be found in
Ramamoorthi (2002). This low-dimensional space of face appearances can be
represented as an illumination cone as shown by Belhumeur et al. (1998). In
Ramamoorthi et al. (2001), the reflection of light was theoretically described by
convolution in a signal-processing framework. Illumination analysis or inverse
rendering can then be considered as deconvolution. Beside the creation of
arbitrarily illuminated face images, the use of multiple input images also allows
the estimation of facial shape and thus a change of head pose in 2-D images
(Georgiades et al., 1999). Using eigen light maps of explicit 3-D models (Eisert
et al., 2002) instead of blending between eigen images, also extends the
applicability of the approach to locally deforming objects like human faces in
image sequences.
For the special application of 3-D model-based motion estimation, relatively few
approaches have been proposed that incorporate photometric effects. In Bozdagi
et al. (1994), the illuminant direction is estimated according to Zheng et al. (1991)
first without exploiting the 3-D model. Given the illumination parameters, the
optical flow constraint is extended to explicitly consider intensity changes caused
by object motion. For that purpose, surface normals are required which are
derived from the 3-D head model. The approach proposed in Stauder (1995 and
1998) makes explicit use of normal information for both illumination estimation
and compensation. Rather than determining the illuminant direction from a single
frame, the changes of surface shading between two successive frames are
exploited to estimate the parameters. The intensity of both ambient and direc-
tional light, as well as the direction of the incident light, is determined by
minimizing a non-linear cost function. Experiments performed for both ap-
proaches show that the consideration of photometric effects can significantly
improve the accuracy of estimated motion parameters and the reconstruction
quality of the motion-compensated frames (Bozdagi et al.; 1994, Stauder, 1995).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 245
∂I (X , Y ) ∂I (X , Y )
dx + d y = I (X , Y ) − I ′(X , Y ), (1)
∂X ∂Y
∂I ∂I
where ∂X and ∂Y are the spatial derivatives of the image intensity at pixel
position [X, Y]. I′-I denotes the temporal change of the intensity between two
time instants ∆t=t′-t corresponding to two successive frames in an image
sequence. This equation, obtained by Taylor series expansion up to first order of
the image intensity, can be set up anywhere in the image. It relates the unknown
2-D motion displacement d=[dx , dy] with the spatial and temporal derivatives of
the images.
The solution of this problem is under-determined since each equation has two
new unknowns for the displacement coordinates. For the determination of the
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
246 Eisert
optical flow or motion field, additional constraints are required. Instead of using
heuristical smoothness constraints, explicit knowledge about the shape and
motion characteristics of the object is exploited. Any 2-D motion model can be
used as an additional motion constraint in order to reduce the number of
unknowns to the number of motion parameters of the corresponding model. In
that case, it is assumed that the motion model is valid for the complete object. An
over-determined system of equations is obtained that can be solved robustly for
the unknown motion and deformation parameters in a least-squares sense.
In the case of facial expression analysis, the motion and deformation model can
be taken from the shape and the motion characteristics of the head model
description. In this context, a triangular B-spline model (Eisert et al., 1998a) is
used to represent the face of a person. For rendering purposes, the continuous
spline surface is discretized and approximated by a triangle mesh as shown in
Figure 6. The surface can be deformed by moving the spline’s control points and
thus affecting the shape of the underlying mesh. A set of facial animation
parameters (FAPs) according to the MPEG-4 standard (MPEG, 1999) charac-
terizes the current facial expression and has to be estimated from the image
sequence. By concatenating all transformations in the head model deformation
and using knowledge from the perspective camera model, a relation between
image displacements and FAPs can be analytically derived
Combining this motion constraint with the optical flow constraint (1) leads to a
linear system of equations for the unknown FAPs. Solving this linear system in
a least squares sense, results in a set of facial animation parameters that
determines the current facial expression of the person in the image sequence.
Hierarchical Framework
Since the optical flow constraint equation (1) is derived assuming the image
intensity to be linear, it is only valid for small motion displacements between two
successive frames. To overcome this limitation, a hierarchical framework can be
used (Eisert et al., 1998a). First, a rough estimate of the facial motion and
deformation parameters is determined from sub-sampled and low-pass filtered
images, where the linear intensity assumption is valid over a wider range. The
3-D model is motion compensated and the remaining motion parameter errors are
reduced on frames having higher resolutions.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 247
For natural video capture conditions, scene lighting often varies over time. This
illumination variability has a considerable influence not only on the visual
appearance of the objects in the scene, but also on the performance of computer
vision algorithms or video-coding methods. The efficiency and robustness of
these algorithms can be significantly improved by removing the undesired effects
of changing illumination. In this section, we introduce a 3-D model-based
technique for estimating and manipulating the lighting in an image sequence
(Eisert et al., 2002). The current scene lighting is estimated for each frame
exploiting 3-D model information and by synthetic re-lighting of the original video
frames. To provide the estimator with surface-normal information, the objects in
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
248 Eisert
the scene are represented by 3-D shape models and their motion and deformation
are tracked over time using a model-based estimation method. Given the normal
information, the current lighting is estimated with a linear algorithm of low
computational complexity using an orthogonal set of light maps.
Light Maps
I C (u ) = I tex
C
(u )⋅ L(u ) (3)
N −1
I C (u ) = I tex
C
(u )⋅ ∑ α iC Li (u ) . (4)
i =0
By varying the scaling parameter α iC and thus blending between different light
maps Li, different lighting scenarios can be created. Moreover, the light map
approach can also model wrinkles and creases which are difficult to describe by
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 249
3-D geometry (Pighin et al., 1998; Liu et al., 2001). The N light maps L i(u) can
be computed off-line with the same surface normal information n(u), but with
different light source configurations. In our experiments, we use one constant
light map L0 representing ambient illumination while the other light maps are
calculated assuming Lambert reflection and point-light sources located at infinity
having illuminant direction li
L0 (u ) = 1
Li (u ) = max{− n(u ) ⋅ l i ,0}, 1 ≤ i ≤ N − 1 . (5)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
250 Eisert
Figure 3: First four eigen light maps representing the dominant shading
effects.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 251
N −1
C
I shaded (x ) = I unshaded
C
(x )⋅ ∑ α iC Li (u(x )) . (6)
i =0
is set up. Since each pixel x being part of the object contributes one equation, a
highly over-determined linear system of equations is obtained that is solved for
the unknown α iC ’s in a least-squares sense. Rendering the 3-D object model
with the shaded texture map using the estimated parameters , α iC leads to a
model frame which approximates the lighting of the original frame. In the same
way, the inverse this formula can be used to remove the lighting variations in real
video sequences as it is shown in Figure 4.
Applications
In this section, two applications, model-based coding and facial animation, are
addressed which make use of the aforementioned methods for facial expression
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
252 Eisert
analysis and synthesis. Experimental results from the approach in Eisert (2000)
are provided in order to illustrate the applicability of model-based techniques to
these applications.
Model-Based Coding
In recent years, several video coding standards, such as H.261/3 and MPEG-1/
2/4 have been introduced to address the compression of digital video for storage
and communication services. These standards describe a hybrid video coding
scheme, which consists of block-based motion-compensated prediction (MCP)
and DCT-based quantification of the prediction error. The recently determined
H.264 standard also follows the same video coding approach. These waveform-
based schemes utilize the statistics of the video signal without knowledge of the
semantic content of the frames and achieve compression ratios of several
hundreds-to-one at a reasonable quality.
If semantic information about a scene is suitably incorporated, higher coding
efficiency can be achieved by employing more sophisticated source models.
Model-based video codecs, e.g., use 3-D models for representing the scene
content. Figure 5 shows the structure of a model-based codec for the application
of video telephony.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 253
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
254 Eisert
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 255
to exclude effects from the background, which is not explicitly modeled. The
trade-off between bit-rate, which can be controlled by changing the quantifying
values for the FAPs, and reconstruction quality is shown in Figure 7.
Facial Animation
The use of different head models for analysis and synthesis of head-and-shoulder
sequences is also interesting in the field of character animation in film produc-
tions or web applications. The facial play of an actor sitting in front of a camera
is analyzed and the resulting FAPs are used to control arbitrary 3-D models. This
way, different people, animals, or fictitious creatures can be animated realisti-
cally. The exchange of the head model to animate other people is shown in Figure
8. The upper row depicts some frames of the original sequence used for facial
expression analysis. Instead of rendering the sequence with the same 3-D head
model used for the FAP estimation, and thus reconstructing the original se-
quence, the head model is exchanged for image synthesis leading to new
sequences with different people that move according to the original sequence.
Examples of this character animation are shown in the lower two rows of Figure
8. In these experiments, the 3-D head models for Akiyo and Bush are derived
from a single image. A generic head model whose shape is controlled by a set
of parameters is roughly adjusted to the outline of the face and the position of
eyes and mouth. Then, the image is projected onto the 3-D model and used as
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
256 Eisert
a texture map. Since the topology of the mesh is identical for all models, the
surface deformation description need not be changed and facial expressions can
easily be applied to different people.
Since the same generic model is used for all people, point correspondences
between surface points and texture coordinates are inherently established. This
enables the morphing between different characters by linearly blending between
the texture map and the position of the vertices. In contrast to 2-D approaches
(Liu et al., 2001), this might be done during a video sequence due to use of a 3-D
model. Local deformations caused by facial expressions are not affected by this
morphing. Figure 9 shows an example of a view of the morphing process between
two different people.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 257
Conclusions
Methods for facial expression analysis and synthesis have received increasing
interest in recent years. The computational power of current computers and
handheld devices like PDAs already allow a real-time rendering of 3-D facial
models, which is the basis for many new applications in the near future.
Especially for handheld devices that are connected to the Internet via a wireless
channel, bit-rates for streaming video is limited. Transmitting only facial expres-
sion parameters drastically reduces the bandwidth requirements to a few kbit/s.
In the same way, face animations or new human-computer interfaces can be
realized with low demands on storage capacities. On the high quality end, film
productions may get new impacts for animation, realistic facial expression, and
motion capture without the use of numerous sensors that interfere with the actor.
Last, but not least, information about motion and symmetry of facial features can
be exploited in medical diagnosis and therapy.
All these applications have in common that accurate information about 3-D
motion deformation and facial expressions is required. In this chapter, the state-
of-the-art in facial expression analysis and synthesis has been reviewed and a
new method for determining FAPs from monocular images sequences has been
presented. In a hierarchical framework, the parameters are robustly found using
optical flow information together with explicit knowledge about shape and motion
constraints of the objects. The robustness can further be increased by incorpo-
rating photometric properties in the estimation. For this purpose, a computationally
efficient algorithm for the determination of lighting effects was given. Finally,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
258 Eisert
References
Aizawa, K. & Huang, T. S. (1995). Model-based image coding: Advanced video
coding techniques for very low bit-rate applications. Proc. IEEE, 83(2),
259-271.
Aizawa, K., Harashima, H. & Saito, T. (1989). Model-based analysis synthesis
image coding (MBASIC) system for a person’s face. Sig. Proc.: Image
Comm., 1(2), 139-152.
Anjyo, K., Usami, Y. & Kurihara, T. (1992). A simple method for extracting the
natural beauty of hair. SIGGRAPH, 26, 111-120.
Baribeau, R., Rioux, M. & Godin, G. (1992). Color reflectance modeling using
a polychromatic laser range sensor. IEEE Tr. PAMI, 14(2), 263-269.
Barron, J. L., Fleet, D. J. & Beauchemin, S. S. (1994). Systems and experiment:
Performance of optical flow techniques. International Journal of Comp.
Vision, 12(1), 43-77.
Bartlett, M. et al. (1995). Classifying facial action. Advances in Neural Inf.
Proc. Systems 8, MIT Press, 823-829.
Belhumeur, P. N. & Kriegman, D. J. (1998). What is the set of images of an
object under all possible illumination conditions. International Journal of
Comp. Vision, 28(3), 245-260.
Black, M. J. & Anandan, P. (1996). The robust estimation of multiple motions:
Parametric and piecewise-smooth flow fields. Computer Vision and
Image Understanding, 63(1), 75-104.
Black, M. J. & Yacoob, Y. (1995). Tracking and recognizing rigid and non-rigid
facial motions using local parametric models of image motion. Proc. ICCV,
374-381.
Black, M. J., Yacoob, Y. & Ju, S. X. (1997). Recognizing human motion using
parameterized models of optical flow. Motion-Based Recognition, 245-
269. Kluwer Academic Publishers.
Blanz, V. & Vetter, T. (1999). A morphable model for the synthesis of 3D faces.
SIGGRAPH, 187-194.
Blohm, W. (1997). Lightness determination at curved surfaces with apps to
dynamic range compression and model-based coding of facial images.
IEEE Tr. Image Proc., 6(8), 1129-1138.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 259
Bozdagi G., Tekalp A. M. & Onural L. (1994). 3-D motion estimation and
wireframe adaption including photometric effects for model-based coding
of facial image sequences. IEEE Tr. CSVT, 4(3), 246-256.
Brunelli, R. (1997). Estimation of pose and illuminant direction for face process-
ing. Image-and-Vision-Computing, 15(10), 741-748.
Brunelli, R. & Poggio, T. (1993). Face recognition: Features versus templates.
IEEE Tr. PAMI, 15(10), 1042-1052.
Chao, S. & Robinson, J. (1994). Model-based analysis/synthesis image coding
with eye and mouth patch codebooks. Proc. of Vision Interface, 104-109.
Chellappa, R., Wilson, C. L. & Sirohey, S. (1995). Human and machine
recognition of faces: A survey. Proc. IEEE, 83(5), 705-740.
Choi, C., Aizawa, K., Harashima, H. & Takebe, T. (1994). Analysis and
synthesis of facial image sequences in model-based image coding. IEEE Tr.
CSVT, 4(3), 257-275.
DeCarlo, D. & Metaxas, D. (1996). The integration of optical flow and
deformable models with applications to human face shape and motion
estimation. Proc. CVPR, 231-238.
DeCarlo, D. & Metaxas, D. (1998). Deformable model-based shape and motion
analysis from images using motion residual error. Proc. ICCV, 113-119.
DeCarlo, D., Metaxas, D. & Stone, M. (1998). An anthropometric face model
using variational techniques. SIGGRAPH, 67-74.
Deshpande, S. G. & Chaudhuri, S. (1996). Recursive estimation of illuminant
motion from flow field. Proc. ICIP, 3, 771-774.
Donato, G., Bartlett, M. S., Hager, J. C., Ekman, P. & Sejnowski, T. (1999).
Classifying facial actions. IEEE Tr. PAMI, 21(10), 974-989.
Dufaux, F. & Moscheni, F. (1995). Motion estimation techniques for digital TV:
A review and a new contribution. Proc. IEEE, 83(6), 858-876.
Eisert, P. (2000). Very Low Bit-Rate Video Coding Using 3-D Models. Ph.D.
thesis, University of Erlangen, Shaker Verlag, Aachen, Germany.
Eisert, P. & Girod, B. (1997). Model-based 3D motion estimation with illumina-
tion compensation. Proc. International Conference on Image Proc. and
Its Applications, 1, 194-198.
Eisert, P. & Girod, B. (1998). Analyzing facial expressions for virtual
conferencing. IEEE Computer Graphics and Applications, 18(5), 70-78.
Eisert, P. & Girod, B. (1998b). Model-based coding of facial image sequences
at varying illumination conditions. Proc. 10th IMDSP. Workshop 98, 119-
122.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
260 Eisert
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 261
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
262 Eisert
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 263
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
264 Eisert
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Analysis and Synthesis of Facial Expressions 265
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
266 Kalberer, Müller & Van Gool
Chapter VIII
Pascal Müller
BIWI – Computer Vision Lab, Switzerland
Abstract
The problem of realistic face animation is a difficult one. This is hampering
a further breakthrough of some high-tech domains, such as special effects
in the movies, the use of 3D face models in communications, the use of
avatars and likenesses in virtual reality, and the production of games with
more subtle scenarios. This work attempts to improve on the current state-
of-the-art in face animation, especially for the creation of highly realistic
lip and speech-related motions. To that end, 3D models of faces are used
and — based on the latest technology — speech-related 3D face motion will
be learned from examples. Thus, the chapter subscribes to the surging field
of image-based modeling and widens its scope to include animation. The
exploitation of detailed 3D motion sequences is quite unique, thereby
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 267
Figure 1. The workflow of our system: (a) An original face is (b) captured,
(c) re-meshed, (d) analyzed and integrated for (e) an animation.
Introduction
Realistic face animation for speech still poses a number of challenges, especially
when we want to automate it to a large degree. Faces are the focus of attention
for an audience, and the slightest deviation from normal faces and face dynamics
is noticed.
There are several factors that make facial animation so elusive. First, the human
face is an extremely complex geometric form. Secondly, the face exhibits
countless tiny creases and wrinkles, as well as subtle variations in color and
texture, all of which are crucial for our comprehension and appreciation of facial
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
268 Kalberer, Müller & Van Gool
2D: For reaching photorealism, one of the most effective approaches has been
to reorder short video sequences (Bregler et al., 1997) or to 2D morph
between photographic images (Beier et al., 1992; Bregler et al., 1995; and
Ezzat et al., 2000). A problem with such techniques is that they do not allow
much freedom in face orientation, relighting or compositing with other 3D
objects.
3D: A 3D approach typically yields such flexibility. Here, a distinction can be
made between appearance-based and physics-based approaches. The
former is typically based on scans or multi-view reconstructions of the face
exterior. Animation takes the form of 3D morphs between several, static
expressions (Chen et al., 1995; Blanz et al., 1999; and Pighin et al., 1998)
or a more detailed replay of observed face dynamics (Guenter et al., 1998;
Lin et al., 2001). Physics-based approaches model the underlying anatomy
in detail, as a skull with layers of muscles and skin (Waters et al., 1995;
Pelachaud et al., 1996; Eben, 1997; and Kähler et al., 2002). The activation
of the virtual muscles drives the animation. Again, excellent results have
been demonstrated. Emphasis has often been on the animation of emotions.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 269
Viseme Selection
Animation of speech has much in common with speech synthesis. Rather than
composing a sequence of phonemes according to the laws of co-articulation to
get the transitions between the phonemes right, the animation generates se-
quences of visemes. Visemes correspond to the basic, visual mouth expressions
that are observed in speech. Whereas there is a reasonably strong consensus
about the set of phonemes, there is less unanimity about the selection of visemes.
Approaches aimed at realistic animation of speech have used any number, from
as few as 16 (Ezzat et al., 2000) up to about 50 visemes (Scott et al., 1994). This
number is by no means the only parameter in assessing the level of sophistication
of different schemes. Much also depends on the addition of co-articulation
effects. There certainly is no simple one-to-one relation between the 52
phonemes and the visemes, as different sounds may look the same and,
therefore, this mapping is rather many-to-one. For instance /b/ and /p/ are two
bilabial stops which differ only in the fact that the former is voiced, while the
latter is voiceless. Visually, there is hardly any difference in fluent speech.
We based our selection of visemes on the work of Owens (Owens et al., 1985)
for consonants. We use his consonant groups, except for two of them, which we
combine into a single /k,g,n,l,ng,h,y/ viseme. The groups are considered as
single visemes because they yield the same visual impression when uttered. We
do not consider all the possible instances of different, neighboring vocals that
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
270 Kalberer, Müller & Van Gool
Owens distinguishes, however. In fact, we only consider two cases for each
cluster: rounded and widened, that represent the instances farthest from the
neutral expression. For instance, the viseme associated with /m/ differs depend-
ing on whether the speaker is uttering the sequence omo or umu vs. the
sequence eme or imi. In the former case, the /m/ viseme assumes a rounded
shape, while the latter assumes a more widened shape. Therefore, each
consonant was assigned to these two types of visemes. For the visemes that
correspond to vocals, we used those proposed by Montgomery et al. (1985).
As shown in Figure 2, the selection contains a total of 20 visemes: 12 representing
the consonants (boxes with “consonant” title), seven representing the monophtongs
(boxes with title “monophtong”) and one representing the neutral pose (box with
title “silence”). Diphtongs (box with title “diphtong”) are divided into two,
separate monophtongs and their mutual influence is taken care of as a co-
articulation effect. The boxes with the smaller title “allophones” can be
discarded by the reader for the moment. The table also contains examples of
words producing the visemes when they are pronounced. This viseme selection
differs from others proposed earlier. It contains more consonant visemes than
most, mainly because the distinction between the rounded and widened shapes
is made systematically. For the sake of comparison, Ezzat and Poggio (Ezzat et
al., 2000) used six (only one for each of Owens’ consonant groups, while also
combining two of them), Bregler et al. (1997) used ten (same clusters, but they
subdivided the cluster /t,d,s,z,th,dh/ into /th,dh/ and the rest, and /k,g,n,l,
ng,h,y/ into /ng/, /h/, /y/, and the rest, what boils down to making an even more
precise subdivision for this cluster), and Massaro (1998) used nine (but this
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 271
animation was restricted to cartoon-like figures, which do not show the same
complexity as real faces). Our selection came out to be a good compromise
between the number of visemes needed in the animation and the realism that is
obtained.
It is important to note that our speech model combines the visemes with additional
co-articulation effects. Further increases in realism are also obtained by adapting
the viseme deformations to the shape of the face. These aspects are described
in the section, Face Animation.
The first step in learning realistic, 3D face deformations for the different visemes
was to extract real deformations from talking faces. Before the data were
extracted, it had to be decided what the test person would say during the
acquisition. It was important that all relevant visemes would be observed at least
once. The subjects were asked to read a short text that contained multiple
instances of the visemes in Figure 2.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
272 Kalberer, Müller & Van Gool
For the 3D shape extraction of the talking face, we have used a 3D acquisition
system that uses structured light (Eyetronics, 1999). It projects a grid onto the
face, and extracts the 3D shape and texture from a single image. By using a video
camera, a quick succession of 3D snapshots can be gathered. We are especially
interested in frames that represent the different visemes. These are the frames
where the lips reach their extremal positions for that sound (Ezzat and Poggio
(Ezzat et al., 2000) followed the same approach in 2D). The acquisition system
yields the 3D coordinates of several thousand points for every frame. The output
is a triangulated, textured surface. The problem is that the 3D points correspond
to projected grid intersections, not corresponding, physical points of the face.
Hence, the points for which 3D coordinates are given change from frame to
frame. The next steps have to solve for the physical correspondences.
Our animation approach assumes a specific topology for the face mesh. This is
a triangulated surface with 2'268 vertices for the skin, supplemented with
separate meshes for the eyes, teeth, and tongue (another 8'848, mainly for the
teeth). Figure 3 shows the generic head and its topology.
The first step in this fitting procedure deforms the generic head by a simple
rotation, translation, and anisotropic scaling operation, to crudely align it with the
neutral shape of the example face. This transformation minimizes the average
distance between a number of special points on the example face and the model
Figure 3. The generic head model that is fitted to the scanned 3D data of
the example face. Left: Shaded version; Right: Underlying mesh.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 273
Figure 4. A first step in the deformation of the generic head to make it fit
a captured 3D face is to globally align the two. This is done using 10 feature
points indicated in the left part of the figure. The right part shows the effect:
Patch and head model are brought into coarse correspondence.
(these 10 points are indicated in Figure 4). These have been indicated manually
on the example faces, but could be extracted automatically (Noh et al., 2001).
After this initial transformation, the salient features may not be aligned well, yet.
The eyes could, e.g., be at a different height from the nose tip.
In order to correct for such flaws, a piecewise constant vertical stretch is
applied. The face is vertically divided into five intervals, ranging from top-of-
head to eyebrows, from eyebrows to eye corners, from eye corners to nose tip,
from nose tip to mouth corners, and from mouth corners to bottom of the chin.
Each part of the transformed model is vertically scaled in order to bring the
border points of these intervals into good correspondence with the example data,
beginning from the top of the head. A final adaptation of the model consists of
the separation of the upper and lower lip, in order to allow the mouth to open. The
dividing line is defined by the midpoints of the upper and lower edges of the mouth
outline.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
274 Kalberer, Müller & Van Gool
This first step fixes the overall shape of the head and is carried out only once (for
the neutral example face). The result of such process is shown in the right column
of Figure 4: starting from the 3D patch for the neutral face and the generic model
that are shown at the top, the alignment at the bottom is obtained. As can be seen,
the generic model has not yet been adapted to the precise shape of the head at
that point. The second step starts with the transformed model of the first step and
performs a local morphing. This morphing maps the topology of the generic
model head precisely onto the given shape. This process starts from the
correspondences for a few salient points. This set includes the ten points of the
previous step, but is also extended to 106 additional points, all indicated in black
in Figure 5.
After the crude matching of the previous step, most of these points on the
example face will already be close to the corresponding points on the deformed
generic model. Typically, the initial frame of the video sequence corresponds to
the neutral expression. This makes a manual drag and drop operation for the 116
Figure 5. To make the generic head model fit the captured face data
precisely, a morphing step is applied using the 116 anchor points (black
dots) and the corresponding Radial Basis Functions for guiding the
remainder of the vertices. The right part of the figure shows a result.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 275
points rather easy. At that point all 116 points are in good correspondence.
Further snapshots of the example face are no longer handled manually. From the
initial frame, the points are tracked automatically throughout the video. The
tracker looks for point candidates in a neighborhood around their previous
position. A dark blob is looked for and its midpoint is taken. As data are sampled
at video rate, the motions between frames are small and this very simple tracking
procedure only required manual help at a dozen or so frames for the set of
example data. The main reason was two candidate points falling into the search
region. Using this tracker, correspondences for all points and for all frames could
be established with limited manual input.
In order to find the deformations for the visemes, the corresponding frames were
selected from the video and their 3D reconstructions were made. The 3D
positions of the 116 points served as anchor points, to map all vertices of the
generic model to the data. The result is a model with the shape and expression
of the example face and with 2'268 vertices at their correct positions. This
mapping was achieved with the help of Radial Basis Functions.
Radial Basis Functions (RBFs) have become quite popular for face model fitting
(Pighin et al., 1998; Noh et al., 2001). They offer an effective method to
interpolate between a network of known correspondences. RBFs describe the
influence that each of the 116 known (anchor) correspondences have on the
nearby points in between in this interpolation process.
Consider the following equations,
n
y inew = y i + ∑ ω j d j (1)
j =1
which specify how the positions yi of the intermediate points are changed into
y inew under the influence of the n vertices mj of the known network (the 116 vertices
in our case). The shift is determined by the weights ωj and the virtual displacements
dj that are attributed to the vertices of the known network of correspondences. More
about these displacements is to follow. The weights depend on the distance of the
intermediate point to the known vertices:
ω j = h( s j / r ) s j = yi − m j (2)
For sj ≤ r, where r is a cut-off value for the distance beyond which h is put to zero,
and where in the interval [0, r] the function h(x) is of one of two types:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
276 Kalberer, Müller & Van Gool
Figure 6. In the morphing step, two types of Radial Basis Functions are
applied. (1) The hermite type is shown in the top-right part of the figure and
is applied to all dark grey points on the face. (2) The exponential type is
shown in the bottom-right part and is applied to the light grey points.
h2 = 2 x 3 − 3x 2 + 1 (4)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 277
Figure 7. The selection of RBF type is adapted to the local geometry. The
figure shows the improvement that results from switching from exponential
to hermite for the central point on the forehead.
Similarly, there are places where an exponential is much more effective than a
hermite RBF. If the generic head, which is of a rather Caucasian type, has to be
mapped onto the head of an Asian person, hermite functions will tend to copy the
shape of the mesh around the eyes, whereas one wants local control in order to
narrow the eyes and keep the corners sharp. The size of the region of influence
is also determined by the scale r. Three such scales were used (for both RBF
types). These scales and their spatial distribution over the face are shown in
Figure 8(1). As can be seen, they vary with the scale of the local facial
structures.
Figure 8. (1) The RBF sizes are also adapted to local geometry. There are
three sizes, where the largest is applied to those parts that are the least
curved. (2,3) For a small subset of points lying in a cavity the cylindrical
mapping is not carried out, to preserve geometrical detail at places where
captured data quality deteriorates.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
278 Kalberer, Müller & Van Gool
Ad X ,Y ,Z = c X ,Y ,Z . (5)
In these equations, the vectors dX, Y, Z represent the column vectors containing all
the X, Y, or Z components of the virtual displacement vectors dj. The influence
matrix A contains the weights that the vertices of the known network apply to
each other. After solving these systems for cX, Y, Z, the interpolation is ready to be
applied. It is important to note that vertices on different sides of the dividing line
of the mouth are decoupled in these calculations.
A third step in the processing projects the interpolated points onto the extracted
3D surface. This is achieved via a cylindrical mapping. This mapping is not
carried out for a small subset of points which lie in a cavity, however. The reason
is that the acquisition system does not always produce good data in these cavities.
The position of these points should be determined fully by the deformed head
model, and not subject to being degraded under the influence of the acquired
data. They are shown on the right side of Figure 8. On Figure 8(3), this is
illustrated for the nostril. The extracted 3D grid is too smooth there and does not
follow the sharp dip that the nose takes. The generic model dominates the fitting
procedure and caters for the desired, high curvatures, as can be seen.
Figure 9. The jaw and lower teeth rotate around the midpoint of the places
where the jaw is attached to the skull, and translated (see text).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 279
An extreme example where the model takes absolute preference is the mouth
cavity. The interior of the mouth is part of the model, which, e.g., contains the
skin connecting the teeth and the interior parts of the lips. Typically, scarcely any
3D data will be captured for this region, and those that are captured tend to be
of low quality. The upper row of teeth is fixed rigidly to the model and has already
received their position through the first step (the global transformation of the
model, possibly with a further adjustment by the user). The lower teeth follow
the jaw motion, which is determined as a rotation about the midpoint between the
points where the jaw is attached to the skull and a translation. The motion itself
is quantified by observing the motion of a point on the chin, standardized as
MPEG-4 point 2.10. These points have also been defined on the generic model,
as can be seen in Figure 9, and can be located automatically after the morph.
It has to be mentioned at this point that all the settings, like type and size of RBFs,
as well as whether vertices have to be cylindrically mapped or not, are defined
only once in the generic model as attributes of its vertices.
The previous subsection described how a generic head model was deformed to
fit 3D snapshots. Not all frames were reconstructed, but only those that
represent the visemes (i.e., the most extreme mouth positions for the different
cases of Figure 2). About 80 frames were selected from the sequence for each
of the example faces. For the representation of the corresponding visemes, the
3D reconstructions, themselves, were not taken (the adapted generic heads), but
the difference of these heads with respect to the neutral one for the same person.
These deformation fields of all the different subjects still contain a lot of
redundancy. This was investigated by applying a Principal Component Analysis.
Over 98.5% of the variance in the deformation fields was found in the space
spanned by the 16 most dominant components. We have used this statistical
method not only to obtain a very compact description of the different shapes, but
also to get rid of small acquisition inaccuracies. The different instances of the
same viseme for the different subjects cluster in this space. The centroids of the
clusters were taken as the prototype visemes used to animate these faces later
on.
Face Animation
The section, Learning Viseme Expressions, describes an approach to extract a
set of visemes from a face that could be observed in 3D, while talking. This
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
280 Kalberer, Müller & Van Gool
process is quite time consuming and one would not want to repeat it for every
single face that has to be animated. This section describes how novel faces can
be animated, using visemes which could not be observed beforehand.
Such animation requires a number of steps:
A good animation requires visemes that are adapted to the shape or “physiog-
nomy” of the face at hand. Hence, one cannot simply copy or “clone” the
deformations that have been extracted from one of the example faces to a novel
face. Although it is not precisely known at this point how the viseme deforma-
tions depend on the physiognomy, visual improvements were observed by
adapting the visemes in a simple way described in this section.
Faces can be efficiently represented as points in a so-called “Face Space”
(Blanz et al., 1999). These points actually represent their deviation from the
average face. This can be done for the neutral faces from which the example
visemes have been extracted using the procedure described in the section,
Learning Viseme Expressions, as well as for a neutral, novel face. The example
faces span a hyper-plane in Face Space. By orthogonally projecting the novel
face onto this plane, a linear combination of the example faces is found that
comes closest to the projected novel face. This procedure is illustrated in Figure
10. Suppose we put the Face Space coordinates of the face that corresponds to
this projection into a single column vector F% nov and, similarly, the coordinates of
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 281
the example face i into the vector Fi. If the coordinates of the projected, novel
face F% nov are given by
n
F% nov = ∑ ωi Fi (6)
i =1
the same weights ωi are applied to the visemes of the example faces, to yield
a personalized set of visemes for the novel face. The effect is that a rounded face
will get visemes that are closer to those of the more rounded example faces, for
instance.
This step in the creation of personalized visemes is schematically represented in
Figure 11.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
282 Kalberer, Müller & Van Gool
As the face F% nov is still a bit different from the original novel face Fnov , expression
cloning is applied as a last step to the visemes found from projection (Noh et al.,
2001). We have found that the direct application of viseme cloning from an
example face to other faces yields results that are less convincing. This is
certainly the case for faces that differ substantially. According to the proposed
strategy the complete set of examples is exploited, and cloning only has to deal
with a small residue. The process of personalizing visemes is summarized in
Figure 12.
Once the visemes for a face have been determined, animation can be achieved
as a concatenation of visemes. The visemes, which have to be visited, the order
in which this should happen, and the time intervals in between are generated
automatically from an audio track containing speech. First a file is generated that
contains the ordered list of allophones and their timing. “Allophones” correspond
to a finer subdivision of phonemes. This transcription has not been our work and
we have used an existing tool, described in Traber (1995). The allophones are
then translated into visemes (the list of visemes is provided in Figure 2). The
vocals and silence are directly mapped to the corresponding visemes. For the
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 283
consonants, the context plays a stronger role. If they immediately follow a vocal
among /o/, /u/, and /@@/ (this is the vocal as in “bird”), then the allophone is
mapped onto a rounded consonant. If the vocal is among /i/, /a/, and /e/ then the
allophone is mapped onto a widened consonant. When the consonant is not
preceded immediately by a vocal, but the subsequent allophone is one, then a
similar decision is made. If the consonant is flanked by two other consonants, the
preceding vocal decides.
From these data — the ordered list of visemes and their timing — the system
automatically generates an animation. The concatenation of the selected visemes
can be achieved elegantly as a navigation through a “Viseme Space,” similar to
a Face Space. The Viseme Space is obtained by applying an Independent
Component Analysis to all extracted, example visemes. It came out that the
variation can be captured well with as few as 16 Independent Components. (This
underlying dimensionality is determined as the PCA step that is part of our ICA
implementation (Hyvärinen, 1997).) Every personalized viseme can be repre-
sented as one point in this 16D Viseme Space. Animation boils down to
subsequently applying the deformations represented by the points along a
trajectory that leads from viseme to viseme, and that is influenced by co-
articulation effects. An important advantage of animating in Viseme Space is
that all visited deformations remain realistic.
Performing animation as navigation through a Viseme Space of some sort is not
new per se. Such approach was already demonstrated by Kalberer and Van Gool
Figure 13. Fitting splines in the “Viseme Space” yields good co-articulation
effects, after attraction forces exerted by the individual nodes (visemes)
were learned from ground-truth data.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
284 Kalberer, Müller & Van Gool
(Kalberer et al., 2001; Kalberer et al., 2002a) and by Kshirsagar (2001), but for
fewer points on the face. Moreover, their Viseme Spaces where based on PCA
(Principal Component Analysis), not ICA. A justification for using ICA rather
than PCA is to follow later.
Straightforward point-to-point navigation as a way of concatenating visemes
would yield jerky motions. Moreover, when generating the temporal samples,
these may not precisely coincide with the pace at which visemes change. Both
problems are solved by fitting splines to the Viseme Space coordinates of the
visemes. This yields smoother changes and allows us to interpolate in order to
get the facial expressions needed at the fixed times of subsequent frames. We
used NURBS curves of order three.
A word on the implementation of co-articulation effects is in order here. A
distinction is made between vocals and labial consonants on the one hand, and
the remainder of the visemes on the other. The former impose their deformations
much more strictly onto the animation than the latter, which can be pronounced
with a lot of visual variation. In terms of the spline fitting, this means that the
animation trajectory will move precisely through the former visemes and will only
be attracted towards the latter. Figure 13 illustrates this for one Viseme Space
coordinate.
Initially a spline is fitted through the values of the corresponding component for
the visemes of the former category. Then, its course is modified by bending it
towards the coordinate values of the visemes in the latter category. This second
category is subdivided into three subcategories: (1) somewhat labial consonants
like those corresponding to the /ch,jh,sh,zh/ viseme pull stronger than (2) the
viseme /f,v/ , which in turn pulls stronger than (3) the remaining visemes of the
second category. In all three cases the same influence is given to the rounded
and widened versions of these visemes. The distance between the current spline
(determined by vocals and labial consonants) and its position if it had to go
through these visemes is reduced to (1) 20%, (2) 40%, and (3) 70%, respectively.
These are also shown in Figure 13. These percentages have been set by
comparing animations against 3D ground-truth. If an example face is animated
with the same audio track used for training, such comparison can be easily made
and deviations could be minimized by optimizing these parameters. Only dis-
tances between lip positions were taken account of so far.
A tool that automatically generates a face animation which the animator then has
to take or leave is a source of frustration, rather than a help. The computer cannot
replace the creative component that the human expert brings to the animation
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 285
process. The animation tool described in this paper only proposes the speech-
based face animation as a point of departure. The animator can thereafter still
change the different visemes and their influences, as well as the complete splines
that define the trajectory in “Viseme Space.”
In terms of the space of possible deformations, PCA and ICA basically yield the
same result. As already mentioned, PCA is part of the ICA algorithm, and
determines the degrees of freedom to be kept. The importance of ICA mainly lies
in the more intuitive, manual changes that the animator can make afterwards. A
face contains many muscles, and several will be active together to produce the
different visemes. In as far as their joint effect can be modeled as a linear
combination of their individual effects, ICA is the way to decouple the net effect
again (Kalberer et al., 2002b). Of course, this model is a bit naive but,
nevertheless, one would hope that ICA is able to yield a reasonable decompo-
sition of face deformations into components that themselves are more strongly
correlated with the facial anatomy than the principal components. This hope has
proved not to be in vain.
From a mathematical point of view, there also is a good indication that ICA may
be more appropriate than PCA to deliver the basis of a Viseme Space. The
distribution of the extracted visemes comes out to have a shape that is quite non-
Gaussian, which can clearly be observed from χ plots.
2
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
286 Kalberer, Müller & Van Gool
the variation. The advantage, rather, is the more intuitive deformations that
correspond to the independent components, where each stays closer to a single,
anatomical action of the face.
Finally, on a more informal score, we found that only about one or two PCs could
be easily described, e.g., “opening the mouth.” In the case of ICs, six or so
components could be described in simple terms. Figure 15 shows a comparison
between principal and independent components. In both cases, there is a
component that one could describe as opening the mouth. When it comes to a
simple action, like rounding the mouth, there is a single IC that corresponds to this
effect. But, in the case of PCs, this rounding is never found in isolation, but is
combined with the opening of the mouth or other effects. Similar observations
can be made for the other ICs and PCs.
One could argue that animation can proceed directly and uniquely as a combi-
nation of basic modes (e.g., independent components) and that going via visemes
is an unnecessary detour. Discussions with animators made it clear, however,
that they insist on having intuitive keyframes, like visemes and basic emotions,
as the primary interface. Hence, we give animators control both at the level of
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 287
complete visemes and single independent components. Having the system work
on the basis of the same keyframes (i.e., in a Viseme Space) helps to make the
interaction with the animator more intuitive, as the animator and the animation
tool “speak” the same language.
Results
As a first example, we show some frames out of an animation sequence, where
the person is animated using his own visemes. This person was one of our
examples. Four frames are shown in Figure 16. Although it is difficult to
demonstrate the realism of animation on paper, the different face expressions
look natural, and so does the corresponding video sequence.
A second example shows the transition between two faces (see Figure 17). In
this case, the visemes of the man are simply cloned to get those for the boy (Noh
Figure 16. One of the example faces uses its own visemes.
Figure 17. To see the effect of purely cloned visemes, a specific experiment
was performed. The man’s visemes are kept throughout the sequence and
are cloned onto the mixed face.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
288 Kalberer, Müller & Van Gool
et al., 2001). The animation shows a few flaws, which become stronger as the
morph gets closer to the boy’s face. Some of these flaws are highlighted in Figure
18 (but they are more outspoken if seen as a video). This example shows that
cloning alone does not suffice to yield realistic animation of speech.
A third example shows the result of our full viseme personalization. The three
faces on the left-hand side of Figure 19 are three of the 10 example faces used
to create the hyper-plane. The face on the right-hand side has been animated by
first projecting it onto the hyper-plane, weighing the visemes of the examples
accordingly, and finally cloning these weighted visemes to the original face.
Figure 19. Combination of the same viseme (represented by the faces in the
upper row) are combined and cloned onto a novel face according to its
physiognomy (face in the center).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 289
Figure 20. The final animation can be enhanced with expressions that are
not related to speech, such as eye blinks and emotions.
As a final note, it is interesting to mention that the system proposed here has been
implemented as an Alias/Wavefront’s Maya plug-in. Figure 1 gives a quick
overview of the processing steps.
Furthermore, we have already experimented with the superposition of speech
and emotions. Detailed displacements were measured for the six basic emotions.
We found that linear addition of displacements due to visemes and emotions
worked out well in these preliminary trials. An example is shown in Figure 20.
Trends
Technologically the trend in face animation is one towards a stronger 3D
component in the modeling and animation pipeline. Entertainment certainly is one
of the primary marketplaces for the work described in this chapter. The 3D
industry has had a significant impact on this industry. With human characters as
one of the central elements, 3D animation of characters and special effects have
become an integral part of many blockbuster movie productions. The game
industry has in the meantime eclipsed the movie industry in terms of gross
revenues. Also, in this branch of industry, there is a trend towards more realistic
human characters. The most notable trend in 3D digital media is the convergence
of these playgrounds. Productions often target multiple markets simultaneously,
with, e.g., movies coupled to games and Web sites, as well as an extensive line
of gadgets.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
290 Kalberer, Müller & Van Gool
Figure 21. The forecasts expect the European online game market to reach
43% by 2004 when there should be around 73 million online gamers, as
shown on the chart.
Conclusions
Realistic face animation is still a challenge. We have tried to attack this problem
via the acquisition and analysis of 3D face shapes for a selection of visemes.
Such data have been captured for a number of faces, which allows the system
to at least apply a crude adaptation of the visemes to the physiognomy of a novel
face for which no such data could be obtained. The animation is organized as a
navigation through “Viseme Space,” where the influence of different visemes on
the space trajectory varies. Given the necessary input in the form of an ordered
sequence of visemes and their timing, a face animation can be created fully
automatically. Such animation will in practice rather be a point of departure for
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 291
an animator, who still has to be able to modify the results. This remains fully
possible within the proposed framework.
The proposed method yields realistic results, with more detail in the underlying,
3D models than usual. The RBF morphing, described in the section, Learning
Viseme Expressions, has been implemented with special care to make this
possible. The way of combining the visemes to a novel face seems to be both
novel and effective. By giving the animator control over independent, rather than
the more usual, principal components this space can be navigated in a more
intuitive way.
Although we believe that our results can already be of help to an animator,
several improvements can be imagined. The following issues are a few ex-
amples. Currently, a fixed texture map is used for all the visemes, but the 3D
acquisition method allows us to extract a separate texture map for every viseme.
This will help to create the impression of wrinkles on rounded lips, for instance.
Another issue is the selection of the visemes. For the moment, only a rounded and
widened version of the consonants has been included. In reality, an /m/ in ama
lies between that in omo and imi. There is a kind of gradual change from umu,
over omo, ama, and eme, up to imi. Accordingly, more versions of the visemes
can be considered. Another possible extension is the inclusion of tongue position
in the viseme classification. Some of the consonant classes have to be subdivided
in that case. A distinction has to be made between, e.g., /l/ and /n/ on the one
hand, and /g/ and /k/ on the other.
It is also possible to take more example images, until they span Face Space very
well. In that case, the final viseme cloning step in our viseme personalization can
probably be left out.
Last, but not least, speech-oriented animation needs to be combined with other
forms of facial deformations. Emotions are probably the most important ex-
ample. It will be interesting to see what else is needed to combine, e.g., visemes
and emotions and keep the overall impression natural. All these expressions call
on the same facial muscles. It remains to be seen whether linear superpositions
of the different deformations really suffice, as it was the case in our preliminary
experiments.
Acknowledgments
This research has been supported by the ETH Research Council (Visemes
Project), the KULeuven Research Council (GOA VHS+ Project), and the EC
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
292 Kalberer, Müller & Van Gool
References
Beier, T. & Neely, S. (1992). Feature-based image metamorphosis. SIGGRAPH,
ACM Press.
Blanz, V. & Vetter, T. (1999). A morphable model for the synthesis of 3d faces.
SIGGRAPH, ACM Press.
Brand, M. (1999). Voice puppetry. Animation SIGGRAPH, ACM Press.
Bregler, C. & Omohundro, S. (1995). Nonlinear image interpolation using
manifold learning. NIPS, volume 7.
Bregler, C., Covell, M. & Slaney, M. (1997). Video rewrite: driving visual speech
with audio. SIGGRAPH, ACM Press.
Chen, D., State, A. & Banks, D. (1995). Interactive shape metamorphosis. In
Symposium on Interactive 3D Graphics, SIGGRAPH.
Cosatto, E. (2000). Sample-based talking-head synthesis. In Ph.D. Thesis,
Signal Processing Lab, Swiss Federal Institute of Techology, Lausanne,
Switzerland.
Cosatto, E. & Graf, H. P. (2000). Photo-realistic talking-heads from image
samples. In IEEE Trans. on Multimedia, 2, 152–163.
Eben, E. (1997). Personal communication. Pixar Animation Studios.
Eyetronics. (1999). Retrieved from the WWW: http://www.eyetronics.com.
Ezzat, T. & Poggio, T. (2000). Visual speech synthesis by morphing visemes. In
Kluwer Academic Publishers, International Journal of Computer Vi-
sion, 38, 45–57.
Guenter, B., Grimm, C., Wood, D., Malvar, H. & Pighin, F. (1998). Making
faces. SIGGRAPH, ACM Press.
Hyvärinen, A. (1997). Independent component analysis by minimizing of
mutual information. In Technical Report A46, Helsinki University of
Technology, 1997.
Kähler, K., Haber, J., Yamauchi, H. & Seidel, H. P. Head shop: Generating
animated head models with anatomical structure. SIGGRAPH Symposium
on Computer Animation, 55–63, ACM Press.
Kalberer, G. A. & Van Gool, L. (2001). Realistic face animation for speech.
Videometrics, Electronic Imaging, IS&T/SPIE, 4309, 16–25.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Modeling and Synthesis of Realistic Visual Speech in 3D 293
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
294 Kalberer, Müller & Van Gool
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 295
Chapter IX
Automatic 3D Face
Model Adaptation with
Two Complexity Modes
for Visual
Communication*
Markus Kampmann
Ericsson Eurolab Deutschland GmbH, Germany
Liang Zhang
Communications Research Centre, Canada
Abstract
This chapter introduces a complete framework for automatic adaptation of
a 3D face model to a human face for visual communication applications like
video conferencing or video telephony. First, facial features in a facial
image are estimated. Then, the 3D face model is adapted using the estimated
facial features. This framework is scalable with respect to complexity. Two
complexity modes, a low complexity and a high complexity mode, are
introduced. For the low complexity mode, only eye and mouth features are
estimated and the low complexity face model Candide is adapted. For the
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
296 Kampmann & Zhang
high complexity mode, a more detailed face model is adapted, using eye and
mouth features, eyebrow and nose features, and chin and cheek contours.
Experimental results with natural videophone sequences show that with this
framework automatic 3D face model adaptation with high accuracy is
possible.
Introduction
In the last few years, virtual humans and especially animated virtual faces (also
called talking heads) have achieved more and more attention and are used in
various applications. In modern computer games, virtual humans act as football
players or Kung Fu fighters. In movies, highly realistic animated virtual humans
are replacing real actors (e.g., in the science fiction movie “Final Fantasy”). On
the Internet, animated virtual faces are acting as news announcers or sales
agents. In visual communication applications, like video telephony or video
conferencing, the real faces of the participants are represented by virtual face
clones of themselves. If we take a closer look at the technology behind these
animated faces, the underlying shape of a virtual face is often built from a 3D
wireframe consisting of vertices and triangles. This wireframe is textured using
textures from a real person’s facial image. Synthetic facial expressions are
generated by animating the 3D wireframe. Usually, the face is animated by
movement of the wireframe’s vertices. In order to produce natural looking facial
movements, an underlying animation structure (providing rules for animation) is
needed, simulating the behavior of a real human face.
The creation of such an animated face requires generating a well-shaped and
textured 3D wire-frame of a human face, as well as providing rules for animation
of this specific 3D wireframe. There are different ways to create an animated
face. One possibility is that an animated face is created manually by an
experienced 3D modeler or animator. However, an automatic approach is less
time consuming and is required for some applications. Dependent on the specific
application and its requirements, different ways for the automatic creation of an
animated face exist.
For 3D modeling of the shape of the head or face, i.e., for generation of the 3D
wire-frame, techniques that are common for the 3D modeling of objects in
general could be used. With a 3D scanner, a laser beam is sent out and reflected
by the object’s surface. Range data from the object can be obtained and used for
3D modeling. Other approaches use range data from multi-view images (Niem,
1994) obtained by multiple cameras for 3D modeling. All these techniques allow
a very accurate 3D modeling of an object, i.e., a human head or face. However,
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 297
the generated 3D model could not be immediately animated, since the underlying
animation structure is missing.
An alternative approach is the use of a generic 3D face model with a built-in
animation structure. Action Units from the Facial Action Coding System
(Ekman & Friesen, 1977), MPEG-4 facial animation parameters (FAP)
(Sarris, Grammalidis & Strintzis, 2002) or muscle contraction parameters
(Fischl, Miller & Robinson, 1993) from a model of facial muscles can be used as
an animation structure for facial expression. A limited number of characteristic
feature points on a generic face model are defined, e.g., the tip of the chin or the
left corner of the mouth. At the first step of 3D modeling using a generic 3D face
model, those defined feature points are detected in facial images. Then, the
characteristic feature points of the generic 3D face model are adapted using the
detected feature points. This process is also called face model adaptation.
According to available input resources, 3D face model adaptation approaches
can be categorized as follows: (a) range image: An approach using range image
to adapt a generic face model with a physics-based muscular model for animation
in 3D is proposed in Lee, Terzopoulos & Waters (1995). From the generic 3D
face model, a planar generic mesh is created using a cylindrical projection. Based
on the range image, the planar generic face mesh adaptation is iteratively
performed to locate feature points in the range image by feature-based matching
techniques; (b) stereoscopic images/videos: An approach to using stereoscopic
images/videos for face model adaptation is proposed in Fua, Plaenkers &
Thalman (1999). Information about the surface of the human face is recovered
by using stereo matching to compute a disparity map and then by turning each
valid disparity value into a 3D point. Finally, the generic face model is deformed
so that it conforms to the cloud of those 3D points based on least-squares
adjustment; (c) orthogonal facial images: Orthogonal facial images are used
to adapt a generic face model in Lee & Magnenat-Thalmann (2000) and Sarris,
Grammalidis & Strintzis (2001). They all require two or three cameras which
must be carefully set up so that their directions are orthogonal; (d) monocular
images/videos: For face model adaptation using monocular images/videos,
facial features in the facial images are determined and the face model is adapted
(Kampmann, 2002). Since no depth information is available from monocular
images/videos, depth information for feature points is provided only in advance
by a face model and is adapted in relation to the determined 2D feature points.
In the following, we concentrate on animated faces for applications in the field
of visual communication where only monocular images are available. For visual
communication applications, like video conferencing or video telephony, a virtual
face clone represents the human participant in the video conference or in the
videophone call. Movements and facial expressions of the human participants
have to be extracted and transmitted. At the receiver side, the virtual face model
is animated using the extracted information about motion and facial expressions.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
298 Kampmann & Zhang
Since such information can be coded with a limited amount of bits, a video
conference or a videophone system with very low bit rates is possible (Musmann,
1995). To implement such a coding system, a generic 3D face model has to be
adapted to the particular face of the participant involved in the monocular video
telephony or video conferencing call. This adaptation must be carried out at the
beginning of the video sequence. Instead of achieving the highest 3D modeling
accuracy, it is the quality of the animated facial expressions in 2D images that
is more important for visual communication at the receiver side.
Face model adaptation for visual communication differs from other applications
in that it has to be done without human interaction and without a priori
information about the participant’s face and its facial features. It is unrealistic
to assume that a participant always has a particular facial expression, such as a
neutral expression with a closed mouth or a particular pose position, in a 3D
world. An algorithm for 3D face model adaptation should not only adapt the face
model to the shape of the real person’s face. An adaptation to the initial facial
expression at the beginning of the video sequence is also necessary.
Furthermore, an algorithm for 3D face model adaptation should be scalable, since
a number of different devices will likely be used for visual communication in the
future. On the one hand, there will be small mobile devices like a mobile phone
with limited computational power for image analysis and animation. The display
size is restricted, which results in less need for high quality animation. On the
other hand, there will be devices without these limitations like stationary PCs. In
the case of a device with limitations regarding power and display, the face model
adaptation algorithm would need to switch to a mode with reduced computational
complexity and less modeling accuracy. For more powerful devices, the algo-
rithm should switch to a mode with higher computational complexity and greater
modeling accuracy.
Some algorithms in the literature deal with automatic face model adaptation in
visual communication. In Kampmann & Ostermann (1997), a face model is
adapted only by means of eye and mouth center points. In addition, nose position,
and eye and mouth corner points are also used in Essa & Pentland (1997). A 3D
generic face model onto which a facial texture has previously been mapped by
hand is adapted to a person’s face in the scene by a steepest-gradient search
method (Strub & Robinson, 1995). No rotation of the face model is allowed. In
Kuo, Huang & Lin (2002), a method is proposed using anthropometric informa-
tion to adapt the 3D facial model. In Reinders et al. (1995), special facial features
like a closed mouth are at first estimated and the face model is then adapted to
these estimated facial features. Rotation of the face model is restricted.
Furthermore, no initial values for the facial animation parameters like Action
Units (Reinders et al., 1995) or muscle contraction parameters (Essa &
Pentland, 1997) have been determined by the adaptation algorithms. An ap-
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 299
3D Face Models
For visual communication like video telephony or video conferencing, a real
human face can be represented by a generic 3D face model that must be adapted
to the face of the individual. The shape of this 3D face model is described by a
3D wireframe. Additional scaling and facial animation parameters are aligned
with the face model. Scaling parameters describe the adaptation of the face
model towards the real shape of the human face, e.g., the size of the face, the
width of the eyes or the thickness of the lips. Once determined, they remain fixed
for the whole video telephony or video conferencing session. Facial animation
parameters describe the facial expressions of the face model, e.g., local
movements of the eyes or mouth. These parameters are temporally changed with
the variations of the real face’s expressions. In this framework, face model
adaptation is carried out in two complexity modes, with a low complexity face
model and a high complex face model. These two face models are described in
detail below.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
300 Kampmann & Zhang
Figure 1. 3D face models: (a)(b) Low complex 3D face mode Candide (79
vertices and 132 triangles): (a) Front view, (b) Side view; (c)(d) High
complex 3D face model (423 vertices and 816 triangles): (c) Front view, (d)
Side view.
The face model Candide (Rydfalk, 1987) used for the low complexity mode is
shown in Figures 1a and 1b. This face model consists of only 79 vertices and 132
triangles. For adaptation of Candide to the shape of the real face, scaling
parameters are introduced. With these parameters, the global size of the face,
the size of the eyes and the lip thickness could be changed.
As facial animation parameters for the face model Candide, six Action Units
from the Facial Action Coding System (Ekman & Friesen, 1977) are utilized.
Each Action Unit (AU) describes the local movement of a set of vertices. Two
Action Units (AU 41, AU 7) are defined for the movements of the eyelids and the
remaining four Action Units are defined for the movements of the mouth corners
(AU8, AU12) and the lips (AU10, AU16).
For the high complexity mode, a derivative from the face model presented in
Fischl, Miller & Robinson (1993) is used (ref. Figures 1c and 1d). This generic
face model consists of 423 vertices and 816 triangles. Compared with the low
complexity face model Candide, more scaling parameters are introduced. Here,
the global size of the face, the size of eyes, nose and eyebrows, the lip thickness,
as well as the shape of chin and cheek contours could be scaled. As facial
animation parameters, facial muscle parameters as described in Fischl, Miller
and Robinson (1993) are utilized. These muscle parameters describe the amount
of contraction of the facial muscles within the human face. Ten different facial
muscles are considered (ref. Figure 2). Since they occur on the left and on the
right side of the face, respectively, 20 facial muscle parameters are used. In
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 301
The estimation of facial features in the 2D facial images is the first part of the
face model adaptation algorithm. For the low complexity mode, eye and mouth
features are estimated (described in the next subsection). For the high complex-
ity mode, chin and cheek contours, eyebrow and nose features are additionally
estimated (described in the following subsection).
The eye and mouth features are estimated by means of 2D eye and mouth
models. In the following, subscripts r, l, u, o stand for right, left, upper and
lower, respectively.
2D eye model
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
302 Kampmann & Zhang
is set to a fixed fraction of the eye width that is the distance between two eye
corner positions hl and hr. The parameters of the parabolic curves (au,ao)
represent the opening heights of the eyelids. It is assumed that both eyes have
the same opening heights. In order to represent eye features with a 2D eye
model, eight parameters have to be estimated, namely: (i) four eye corner points,
(ii) two pupil points, and (iii) two opening heights of the eyelids.
2D mouth models
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 303
parameters (du, do) stand for the thickness of the lips. The mouth width L is
calculated as the distance between the left mouth corner point hl and the right
mouth corner point hr. To represent the mouth-open features, six parameters
are needed: (i) two mouth corner points, (ii) two thickness of the lips, and (iii) two
opening heights of the lips. For the representation of the mouth-closed features,
five parameters are needed: (i) two mouth corner points, (ii) two thickness of the
lips, and (iii) one parameter t which refers to the height between the level of the
corners of the mouth and the contact point of the two lips.
Based on the representation of eye and mouth features using the 2D parametric
models described above, the parameters of the models are separately estimated
one after another (ref. Figure 5). The search areas for eye and mouth features
are first determined using the algorithm proposed in Kampmann & Ostermann
(1997). Within these search areas, the pupil points and the corner points of the
eyes and the mouth are estimated with template matching techniques (Zhang,
1997). After that, these estimated points are utilized to fix the search areas for
the lip thickness and the opening heights of the lips and the eyelids. Since two
mouth models are exploited for representing the 2D mouth features, an appro-
priate mouth model has to be automatically selected first. The lip thickness and
the opening heights of the lips will then be estimated by minimization of cost
functions using an appropriate 2D mouth model. The eyelid opening heights are
also estimated using the 2D eye model analogous to the estimation of the lip
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
304 Kampmann & Zhang
thickness and opening heights (Kampmann & Zhang, 1998). In the following,
only selection of an appropriate mouth model and estimation of mouth opening
heights and lip thickness are addressed in detail.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 305
corner positions have already been estimated, only the thickness (du, do) and the
opening heights (ou, o o) of the lips in the mouth-open case (ref. Figure 4), as well
as the lip thickness (du, do) and the value of the parameter t in the mouth-closed
case (ref. Figure 4) need to be estimated. A perpendicular bisector of the line l
connecting both mouth corner positions in the mouth area M is defined as the
search area for these parameters (ref. Figure 6). Different values of these
parameters create different forms of the 2D mouth models. In order to determine
these parameters, the similarity between the selected 2D mouth model with the
geometrical form of the real mouth in the image is measured using a cost
function. This cost function utilizes texture characteristics of the real mouth by
means of the selected 2D mouth model.
It is observed that there are texture characteristics existing in the mouth area:
1. The areas of the lips have almost the same chrominance value. In the
mouth-open case, an additional area, mouth-inside, exists and is strongly
distinguished from the lip areas. Mouth-insides normally have different
luminance values due to, e.g., teeth (white) and shadow areas (black). But,
it has an almost uniform chrominance value.
2. On the lip, the luminance values of the contours strongly vary.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
306 Kampmann & Zhang
the same, but they should be different to the mean of chrominance in the mouth-
inside. The variances are utilized to describe the uniformity of the chrominance
values U within these areas. The term f open 2 (d u , d o , ou , oo ) consists of the
addends of edge strength (image gradient) g Y ( x, y ) along lip contours Wi. The
run and length of the parabolic curves are defined with a 2D mouth-open model
and are dependent on the parameters to be estimated.
The parameters (du, do, ou, oo) in the mouth-open model are determined by
minimization of the cost f open (d u , d o , ou , oo ) . To reduce the computational
complexity, the cost function is only evaluated at the already detected candidates
for the lip contours (ref. Figure 6). From all possible combinations of the lip
contour’s candidates, the combination with the least cost is determined as the
estimates for the lip thickness and the lip opening heights in the mouth-open
model.
In case of the high complexity mode, other facial features besides the eyes and
the mouth are estimated.
For the estimation of chin and cheek contours, the approach described in
Kampmann (2002) is used. The chin and the cheeks are represented by a
parametric 2D model. This parametric model consists of four parabola branches
linked together. The two lower parabola branches represent the chin, the two
upper parabola branches define the left and the right cheek. Taking into account
the estimated eye and mouth middle positions, search areas for the chin and
cheek contours are established. Inside each search area, the probability of the
(a) (b)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 307
occurrence of the chin and cheek contours is calculated for each pixel. This
calculation takes advantage of the fact that the chin and cheek contours are more
likely to be located in the middle of the search area than at the borders of the
search area. An estimation rule is established using this probability and assumes
high image gradient values at the position of the chin and cheek contours. By
maximization, the position of the chin contour (ref. Figure 7a), as well as the
positions of the cheek contours (ref. Figure 7b) are estimated.
Eyebrows
Using this knowledge, eyebrow features are estimated. First, using the estimated
eye positions, search areas for the eyebrows are introduced above the eyes.
Then, a binarization of the luminance image using a threshold is carried out. In
order to determine this threshold, the upper edge of the eyebrow is identified as
the maximum value of the luminance gradient. The threshold is now the mean
value between the luminance value of the skin and the luminance value of the
eyebrow at this upper edge. After binarization, the area with a luminance value
below the threshold is checked whether it has the typical curvature, length and
thickness of an eyebrow. If the answer is yes, the eyebrow is detected. If the
answer is no, the eyebrow is covered by hair. For this case, using the
morphological image processing operations erosion and dilation, as well as
knowledge about the typical shape of an eyebrow, it is decided whether the
eyebrow is completely or only partly covered by hair. In the case that the
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
308 Kampmann & Zhang
eyebrow is only partly covered by hair, the eyebrow is detected by removing the
covered hair. Otherwise, the eyebrow is completely covered by hair and cannot
be detected. Figure 8 shows the different stages of the eyebrow estimation.
Nose
As nose features, the sides of the nose are estimated (Kampmann & Zhang,
1998). Since the mouth middle position is already determined, search areas for
the sides of the nose are introduced above the mouth. Since the edges of the sides
of the nose have a specific shape, parametric 2D models of the sides of the nose
are introduced. Using this model, each position inside the search area is
evaluated as a possible position of the side of the nose. The positions with the
maximum accumulated value of the luminance gradient at the nose model’s
border are chosen as the sides of the nose.
For 3D face model adaptation with low complexity, the face model Candide is
exploited. For complexity reasons, the scaling and facial animation parameters
of Candide will be determined with eye and mouth features only (Zhang, 1998).
The scaling parameters for the face model Candide include the scaling
parameters for the face size, for the eye size and for the lip thickness. Before
determining the scaling parameters, the face model Candide is rotated in such
a way that the head tilts of the face model and of the real face in the image match.
The face size is defined by the distance between both eye middle positions, as
well as the distances between the eye and mouth middle positions of the face
model. The 3D eye and mouth middle positions of the face model are projected
onto the image plane. Comparison of the distances of the eyes and the mouth in
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 309
the image with those of the projections of Candide yields the scaling factors for
the face size. The eye size is defined as the distance between both eye corner
positions. The scaling parameter for the eye size is determined by comparing the
projections of the 3D eye corner positions of the face model onto the image plane
with the estimated 2D eye corner positions. As the scaling parameter for the lip
thickness, displaced vectors of the lip’s vertices of the face model are intro-
duced. The vertices that represent the inside contours of the lips are fixed during
the scaling of the lip thickness. The vertices that describe the outside contours
of the lips are shifted outwards for the scaling of the lip thickness. The eye
opening is defined by the positions of the upper and lower eyelids. The position
of the upper eyelid can be changed by AU41 and that of the lower eyelid by AU7.
Since the 3D face model has fully opened eyes at the beginning, the eyelids of
the face model are closed down, so that the opening heights of the face model
match the estimated eyelid opening heights in the image plane. The values of
AU41 and AU7 are calculated by determining the range of the movement of the
eyelids of the face model. The movement of the mouth corners are represented
by AU8 and AU12. AU8 moves the mouth corners inward and AU 12 moves them
outward. For the determination of these two Action Units’ values, the estimated
mouth corner positions are utilized. The mouth opening is defined by the position
and movement of the upper and lower lips. The position of the upper lip is
determined by AU10 and that of the lower lip by AU16. For the determination of
these two Action Units’ values, the estimated 2D opening heights of the lips are
used.
For adaptation of the high complexity face model, all estimated facial features
(eyes, mouth, eyebrows, nose, chin and cheek contours) are taken into account.
Using all estimated facial features, scaling and initial facial animation parameters
of the high complexity face model are calculated and used for the face model
adaptation to the individual face. First, orientation, size and position of the face
model are adapted (only eye and mouth middle positions and cheek contours are
used here). Then, the jaw of the face model is rotated. In the next step, the chin
and cheek contour of face model is adapted. Finally, scaling and facial animation
parameters for the rest of the facial features (eyes, mouth, eyebrows, nose) are
determined.
For orientation, the lateral head rotation is adapted first. The quotient of the
distances between eye middle positions and cheek contours on both sides of the
face is introduced as a measure for the lateral head rotation. For adaptation, the
face model is rotated around its vertical axis as long as this quotient measured
in the image plane using the estimated facial features is the same as the quotient
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
310 Kampmann & Zhang
determined by projecting the face model in the image plane. Then, the head tilt
is adapted. The angle between a line through the eye middle positions and the
horizontal image line is a measure for the head tilt. Using the measured angle in
the image, the tilt of the face model is adapted. After that, the face size is scaled.
The distance between the eye middle positions is used for scaling the face width,
the distance between the center of the eye middle positions and the mouth middle
position for scaling the face height. The next step of face model adaptation is the
adjustment of the jaw rotation. Here, the jaw of the face model is rotated until
the projection of the face model’s mouth opening onto the image plane matches
the estimated mouth opening in the image. For scaling of the chin and cheek
contours, the chin and cheek vertices of the face model are individually shifted
so that their projections match the estimated face contour in the image. In order
to maintain the proportions of a human face, all other vertices of the face model
are shifted, too. The amount of shift is reciprocal to the distance from the vertex
to the face model’s chin and cheek. Finally, scaling and facial animation
parameters for the rest of the facial features (eyes, mouth, eyebrows, and nose)
are calculated by comparing the estimated facial features in the image with
projections of the corresponding features of the face model. For scaling, the
width, thickness and position of the eyebrows, the width of the eyes, the size of
the iris, the width, height and depth of the nose, as well as the lip thickness are
determined. For facial animation, the rotation of the eyelids, the translation of the
irises, as well as facial muscle parameters for the mouth are calculated. These
scaling and facial animation parameters are then used for the adaptation of the
high complexity face model.
Experimental Results
For evaluation of the proposed framework, the head and shoulder video
sequences Akiyo and Miss America with a resolution corresponding to CIF and
a frame rate of 10Hz are used to test its performance.
Figure 9 shows some examples of the estimated eye and mouth features over the
original image with the sequence Akiyo and Miss America. For accuracy
evaluation, the true values are manually determined from the natural video
sequences, and the standard deviation between the estimated and the true values
is measured. The estimate error for the pupil positions is 1.2 pel on average and
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 311
Figure 9. Estimated 2D eye and mouth models (Left: Akiyo; Right: Miss
America).
Figure 10. Estimated eyebrows, nose, chin and cheek contours (Left:
Akiyo; Right: Miss America). Displayed eye and mouth middle positions are
used for determining search areas for the other facial features.
for the eye and mouth corner positions is 1.8 pel. The estimate error for lip
thickness and for lip opening heights is 1.5 pel on average, while the error for the
eyelid opening heights amounts to 2.0 pels.
Figure 10 shows some results for the estimation of the other facial features
(eyebrows, nose, chin and cheek contours) of the high complexity mode. The
estimate error for the eyebrows is 2.7 pels on average and for the sides of the
nose is 1.8 pel. The estimate error for chin and cheek contours is 2.7 pels on
average.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
312 Kampmann & Zhang
Figure 11. Adapted 3D face model with low complexity over an original
image based on estimated eye and mouth features (Left: Akiyo; Right: Miss
America).
Figure 12. Adapted 3D face model with high complexity over an original
image based on all estimated facial features (Left: Akiyo; Right: Miss
America).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 313
Figure 13. Facial animation using the face model of low complexity: (O- )
Original images; (S- ) Synthesized images.
Facial Animation
Figure 14. Facial animation using the face model of high complexity:
(O- ) Original images; (S- ) Synthesized images.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
314 Kampmann & Zhang
onto the image plane creates the synthetic images (S-) which are shown in Figure
13. It can be seen that the quality of the synthetic faces is sufficient, especially
for smaller changes of the facial expressions compared with the original image
(O-1). For creating a higher quality synthetic face, a more detailed face model
with more triangles is necessary. This high complexity face model is textured
from the original images (O-1) in Figure 14. The synthetic images (S-) from
Figure 14 show the results of animating the high complexity face model. It can
be seen that using the high complexity face model results in a visually more
impressive facial animation, although at the expense of higher processing
complexity.
Conclusions
A framework for automatic 3D face model adaptation has been introduced
which is qualified for applications in the field of visual communication like video
telephony or video conferencing. Two complexity modes have been realized, a
low complexity mode for less powerful devices like a mobile phone and a high
complexity mode for more powerful devices such as PCs. This framework
consists of two parts. In the first part, facial features in images are estimated.
For the low complexity mode, only eye and mouth features are estimated. Here,
parametric 2D models for the eyes, the open mouth and the closed mouth are
introduced and the parameters of these models are estimated. For the high
complexity mode, additional facial features, such as eyebrows, nose, chin and
cheek contours, are estimated. In the second part of the framework, the
estimated facial features from the first part are used for adapting a generic 3D
face model. For the low complexity mode, the 3D face model Candide is used,
which is adapted using the eye and mouth features only. For the high complexity
mode a more detailed 3D face model is used, which is adapted by using all
estimated facial features. Experiments have been carried out evaluating the
different parts of the face model adaptation framework. The standard deviation
of the 2D estimation error is lower than 2.0 pel for the eye and mouth features
and 2.7 pel for all facial features. Tests with natural videophone sequences show
that an automatic 3D face model adaptation is possible with both complexity
modes. Using the high complexity mode, a better synthesis quality of the facial
animation is achieved, with the disadvantage of higher computational load.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Automatic 3D Face Model Adaptation 315
Endnote
* This work has been carried out at the Institut für Theoretische
Nachrichtentechnik und Informationsverarbeitung, University of Hannover,
Germany.
References
Ekman, P. & Friesen, V. W. (1977). Facial action coding system. Palo Alto,
CA: Consulting Psychologists Press.
Essa, I. & Pentland, A. (1997). Coding, analysis, interpretation, and recognition
of facial expressions. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 19(7), 757-763.
Fischl, J., Miller, B. & Robinson, J. (1993). Parameter tracking in a muscle-
based analysis/synthesis coding system. Picture Coding Symposium
(PCS’93). Lausanne, Switzerland.
Fua, P., Plaenkers, R. & Thalman, D. (1999). From synthesis to analysis: fitting
human animation models to image data. Computer Graphics Interface,
Alberta, Canada.
Kampmann, M. (2002). Automatic 3-D face model adaptation for model-based
coding of videophone sequences. IEEE Transactions on Circuits and
Systems for Video Technology, 12(3), 172-182.
Kampmann, M. & Ostermann, J. (1997). Automatic adaptation of a face model
in a layered coder with an object-based analysis-synthesis layer and a
knowledge-based layer. Signal Processing: Image Communication,
9(3), 201-220.
Kampmann, M. & Zhang, L. (1998). Estimation of eye, eyebrow and nose
features in videophone sequences, International Workshop on Very Low
Bitrate Video Coding (VLBV 98), Urbana, 101-104.
Kuo, C., Huang, R. S. & Lin, T. G. (2002). 3-D facial model estimation from
single front-view facial image. IEEE Transactions on Circuits and
Systems for Video Technology, 12(3), 183-192.
Lee, Y. & Magnenat-Thalmann, N. (2000). Fast head modeling for animation.
Journal Image and Vision Computing, 18(4), 355-364.
Lee, Y., Terzopoulos, D. & Waters, K. (1995). Realistic modeling for facial
animation. Proceedings of ACM SIGGRAPH’95, Los Angeles, 55-62.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
316 Kampmann & Zhang
Musmann, H. (1995). A layered coding system for very low bit rate video coding.
Signal Processing: Image Communication, 7(4-6), 267-278.
Niem, W. (1994). Robust and fast modelling of 3D natural objects from multiple
views. Proceedings of Image and Video Processing II, San Jose, 2182,
388-397.
Reinders, M. J. T., van Beek, P. J. L., Sankur, B. & van der Lubbe, J. C. (1995).
Facial feature location and adaptation of a generic face model for model-
based coding. Singal Processing: Image Communication, 7(1), 57-74.
Rydfalk, R. (1987). CANDIDE, a parameterised face. Internal Report Lith-
ISY-I-0866, Linköping University, Linköping, Sweden.
Sarris, N., Grammalidis, N. & Strintzis, M. G. (2001). Building three-dimensional
head models. Graphical Models, 63(5), 333-368.
Sarris, N., Grammalidis, N. & Strintzis, M. G. (2002). FAP extraction using
three-dimensional motion estimation. IEEE Transactions on Circuits and
Systems for Video Technology, 12(10), 865-876.
Strub, L. & Robinson, J. (1995). Automated facial conformation for model-
based videophone coding. IEEE International Conference on Image
Processing II, Washington, DC, 587-590.
Yuille, A., Hallinan, P. & Cohen, D. (1992). Feature extraction from faces using
deformable templates. International Journal of Computer Vision, 8(2),
99-111.
Zhang, L. (1997). Tracking a face for knowledge-based coding of videophone
sequences. Signal Processing: Image Communication, 10(1-3), 93-114.
Zhang, L. (1998). Automatic adaptation of a face model using action units for
semantic coding of videophone sequences. IEEE Transactions on Cir-
cuits and Systems for Video Technology, 8(6), 781-795.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 317
Chapter X
Learning 3D Face
Deformation Model:
Methods and Applications
Zhen Wen
University of Illinois at Urbana Champaign, USA
Pengyu Hong
Harvard University, USA
Jilin Tu
University of Illinois at Urbana Champaign, USA
Thomas S. Huang
University of Illinois at Urbana Champaign, USA
Abstract
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
318 Wen, Hong, Tu & Huang
Introduction
A synthetic human face provides an effective solution for delivering and
visualizing information related to the human face. A realistic, talking face is
useful for many applications: visual telecommunication (Aizawa & Huang,
1995), virtual environments (Leung et al., 2000), and synthetic agents (Pandzic,
Ostermann & Millen, 1999).
One of the key issues of 3D face analysis (tracking and recognition) and
synthesis (animation) is to model both temporal and spatial facial deformation.
Traditionally, spatial face deformation is controlled by certain facial deformation
control models and the dynamics of the control models define the temporal
deformation. However, facial deformation is complex and often includes subtle
expressional variations. Furthermore, people are very sensitive to facial appear-
ance. Therefore, traditional models usually require extensive manual adjustment
for plausible animation. Recently, the advance of motion capture techniques has
sparked data-driven methods (e.g., Guenter et al., 1998). These techniques
achieve realistic animation by using real face motion data to drive 3D face
animation. However, the basic data-driven methods are inherently cumbersome
because they require a large amount of data for producing each animation.
Besides, it is difficult to use them for facial motion analysis.
More recently, machine learning techniques have been used to learn compact
and flexible face deformation models from motion capture data. The learned
models have been shown to be useful for realistic face motion synthesis and
efficient face motion analysis. In order to allow machine-learning-based ap-
proaches to address the problems of facial deformation, analysis and synthesis
in a systematic way, a unified framework is demanded. The unified framework
needs to address the following problems: (1) how to learn a compact model from
motion capture data for 3D face deformation; and (2) how to use the model for
robust facial motion analysis and flexible animation.
In this chapter, we present a unified machine-learning-based framework on
facial deformation modeling, facial motion analysis and synthesis. The frame-
work is illustrated in Figure 1. In this framework, we first learn from extensive
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 319
Facial
MU-based
MU-based movement video
facial motion
motion
Face image
Motion capture data analysis
analysis sequence
Speech
Learn Motion
Learn Motion MUP
stream
Units
Units sequence
Train audio-
Train audio-
Text New speech visual mapping
visual mapping
Motion stream
Units: Adapt
Adapt
holistic & M
MUU
Convert
Convert Convert
Convert Real-time
parts_based text to speechto
speech
text speech to
MMUP
UP toMMUP
UP MUP
Mapping
MU-based face
MU-based faceanimation
animation
3D facial motion capture data a compact set of Motion Units (MUs), which are
chosen as the quantitative visual representation of facial deformation. Then,
arbitrary facial deformation can be approximated by a linear combination of
MUs, weighted by coefficients called Motion Unit Parameters (MUPs). Based
on facial feature points and a Radial Basis Function (RBF) based interpolation,
the MUs can be adapted to new face geometry and different face mesh topology.
MU representation is used in both facial motion analysis and synthesis. Within
the framework, face animation is done by adjusting the MUPs. For facial motion
tracking, the linear space spanned by MUs is used to constrain low-level 2D
motion estimation. As a result, more robust tracking can be achieved. We also
utilize MUs to learn the correlation between speech and facial motion. A real-
time audio-to-visual mapping is learned using an Artificial Neural Network
(ANN) from an audio-visual database. Experimental results show that our
framework achieved natural face animation and robust non-rigid tracking.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
320 Wen, Hong, Tu & Huang
Background
Analysis of human facial motion is the key component for many applications,
such as model-based, very low-bit-rate video coding for visual telecommunica-
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 321
tion (Aizawa & Huang, 1995), audio-visual speech recognition (Stork & Hennecke,
1996), and expression recognition (Cohen et al., 2002). Simple approaches only
utilize low-level image features (Goto, Kshirsagar & Thalmann, 2001). How-
ever, it is not robust enough to use low-level image features alone because the
error will be accumulated. High-level knowledge of facial deformation must be
used to handle the error accumulation problem by imposing constraints on the
possible deformed facial shapes. For 3D facial motion tracking, people have used
various 3D deformable model spaces, such as a 3D parametric model (DeCarlo,
1998), MPEG-4 FAP-based B-Spline surface (Eisert, Wiegand & Girod, 2000)
and FACS-based models (Tao, 1998). These models, however, are usually
manually defined, which cannot capture the real motion characteristics of facial
features well. Therefore, some researchers have recently proposed to train
facial motion subspace models from real facial motion data (Basu, Oliver &
Pentland, 1999; Reveret & Essa, 2001).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
322 Wen, Hong, Tu & Huang
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 323
Figure 2. (a) The generic model in iFACE; (b) A personalized face model
based on the cyberware scanner data; (c) The feature points defined on
generic model for MU adaptation.
“iFACE,” a face modeling and animation system developed in Hong, Wen &
Huang (2001). iFACE is illustrated in Figure 2.
We use motion capture data from Guenter et al. (1998). The database records
the 3D facial movements of talking subjects, as well as synchronous audio
tracks. The facial motion is captured at the 3D positions of the markers on the
faces of subjects. The motion capture data used 153 markers. Figure 3 shows
an example of the markers. For the purpose of better visualization, we build a
mesh based on those markers, illustrated by Figure 3 (b) and (c).
Figure 3. The markers. (a) The markers shown as small white dots; (b) and
(c) The mesh is shown in two different viewpoints.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
324 Wen, Hong, Tu & Huang
H H M
H H
s = s0 + (∑ ci ei + e0 ) (1)
i =1
H H
where s0 denotes the facial shape without deformation, e0 is the mean facial
H H H
deformation, { e0 , e1 , ..., eM } is the MU set, and { c0 , c1 , …, cM } is the MU
parameter (MUP) set.
PCA (Jolliffe, 1986) is applied to learning MUs from the facial deformation of
the database. The mean facial deformation and the first seven eigenvectors are
selected as the MUs. The MUs correspond to the largest seven eigenvalues that
capture 93.2% of the facial deformation variance. The first four MUs are
visualized by an animated face model in Figure 4. The top row images are the
frontal views of the faces and the bottomHrow images are side views. The first
face is the neutral face, corresponding to s0 . The remaining faces are deformed
by the first four MUs scaled by a constant (from left to right). The method for
Figure 4. The neutral and deformed faces corresponding to the first four
MUs. The top row is the frontal view and the bottom row is the side view.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 325
It is well known that the facial motion is localized, which makes it possible to
decompose the complex facial motion into parts. The decomposition helps
reduce the complexity in deformation modeling and improves the analysis’
robustness and the synthesis’ flexibility. The decomposition can be done
manually based on the prior knowledge of facial muscle distribution (Tao, 1998).
However, it may not be optimal for the linear model used because of the high
nonlinearity of facial motion. Parts-based learning techniques provide a way to
help design parts-based facial deformation models, which can better approxi-
mate real, local facial motion. Recently, Non-negative Matrix Factorization
(NMF) (Lee & Seung, 1999), a technique for learning localized representation
of data samples, has been shown to be able to learn basis images that resemble
parts of faces. In learning the basis of subspace, NMF imposes non-negativity
constraints, which is compatible to the intuitive notion of combining parts to form
a whole in a non-subtractive way.
In our framework, we present a parts-based face deformation model. In the
model, each part corresponds to a facial region where facial motion is mostly
generated by local muscles. The motion of each part is modeled by PCA. Then,
the overall facial deformation is approximated by summing up the deformation
in each part:
H H H H
∆s = ∑ j =1 ∆s j = ∑ j =1 (∑i =1j cij eij + e0 j ) ,
N N M
H H H
where ∆s = s − s0 is the deformation of the facial shape. N is the number of
parts. We call this representation parts-based MU, where the j-th part has its
H H H
MU set { e0 j , e1 j , ..., eMj }, and MUP set { c0 j , c1 j , …, cMj }.
To decompose facial motion into parts, we propose an NMF-based method. In
this method, we randomly initialize the decomposition. Then, we use NMF to
reduce the linear decomposition error to a local minimum. We impose the non-
negativity constraint in the linear combination of the facial motion energy. Figure
5(a) shows some parts derived by NMF. Adjacent different parts are shown in
different colors that are overlayed on the face model. We then use prior
knowledge about facial muscle distribution to refine the learned parts. The parts
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
326 Wen, Hong, Tu & Huang
Figure 5. (a) NMF learned parts overlayed on the generic face model; (b)
The facial muscle distribution; (c) The aligned facial muscle distribution;
(d) The parts overlayed on muscle distribution; (e) The final parts.
can thus be: (1) more related to meaningful facial muscle distribution; and (2) less
biased by individuality in the motion capture data and, thus, more easily
generalized to different faces. We start with an image of human facial muscle,
illustrated in Figure 5(b) (Facial muscle image, 2002). Next, we align it with our
generic face model via image warping, based on facial feature points illustrated
in Figure 2(c). The aligned facial muscle image is shown in Figure 5(c). Then,
we overlay the learned parts on facial muscle distribution (Figure 5(d)) and
interactively adjust the learned parts such that different parts correspond to
different muscles. The final parts are shown in Figure 5(e).
The learned parts-based MUs give more flexibility in local facial deformation
analysis and synthesis. Figure 6 shows some local deformation in the lower lips,
each of which is induced by one of the learned parts-based MUs. These locally
deformed shapes are difficult to approximate using holistic MUs. For each local
deformation shown in Figure 6, more than 100 holistic MUs are needed to
achieve 90% reconstruction accuracy. That means, although some local defor-
mation is induced by only one parts-based MU, more than 100 holistic MUs may
be needed in order to achieve good analysis and synthesis quality. Therefore, we
can have more flexibility in using parts-based MUs. For example, if we are only
interested in lip motion, we only need to learn parts-based MUs from lip motion
data. In face animation, people often want to animate local regions separately.
This task can be easily achieved by adjusting the MUPs of parts-based MUs
separately. In face tracking, people may use parts-based MUs to track only the
region of their interests (e.g., the lips). Furthermore, tracking using parts-based
MUs is more robust because local error will not affect distant regions.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 327
Figure 6. Three lower lip shapes deformed by three of the lower lip parts-
based MUs respectively. The top row is the frontal view and the bottom row
is the side view.
MU Adaptation
The learned MUs are based on the motion capture data of particular subjects.
To use the MUs for other people, they need to be fitted to the new face geometry.
Moreover, the MUs only sample the facial surface motion at the position of the
markers. The movements at other places need to be interpolated. In our
framework, we call this process “MU adaptation.”
Interpolation-based techniques for re-targeting animation to new models, such as
Noh & Neumann (2001), could be used for MU adaptation. Under more
principled guidelines, we design our MU adaptation as a two-step process: (1)
face geometry based MU-fitting; and (2) MU re-sampling. These two steps can
be improved in a systematic way if enough MU sets are collected. For example,
if MU statistics over a large set of different face geometries are available, one
can systematically derive the geometry-to-MU mapping using machine-learning
techniques. On the other hand, if multiple MU sets are available which sample
different positions of the same face, it is possible to combine them to increase the
spatial resolution of MU because markers in MU are usually sparser than face
geometry mesh.
The first step, called “MU fitting,” fits MUs to a face model with different
geometry. We assume that the corresponding positions of the two faces have the
same motion characteristics. Then, the “MU fitting” is done by moving the
markers of the learned MUs to their corresponding positions on the new face.
We interactively build the correspondence of facial feature points shown in
Figure 2(c) via a GUI. Then, warping is used to interpolate the remaining
correspondence.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
328 Wen, Hong, Tu & Huang
The second step is to derive movements of facial surface points that are not
sampled by markers in MUs. We use the radial basis interpolation function. The
family of radial basis functions (RBF) is widely used in face animation (Guenter
et al., 1998; Marschner, Guenter & Raghupathy, 2000;H Noh & Neumann, 2001).
Using RBF, the displacement of a certain vertex vi is of the form
H N
H H H
∆vi = ∑ wij h( vi − p j )∆p j (2)
j =1
H H
where p j , (j = 1,..., N) is the coordinate of a marker, and ∆p j is its displacement.
h is a radial basis kernel function, and wij are the weights. h and wij need to be
carefully designed to ensure the interpolation quality. For facial deformation, the
muscle influence region is local. Thus, we choose a cut-off region for each
vertex. We set the weights to be zero for markers that are outside of the cut-off
region, i.e., they are too far away to influence the vertex. In our current system,
the local influence region for the i-th vertex is heuristically assigned as a circle,
with the radius ri as the average of the distances to its two nearest neighbors.
Similar to Marschner, Guenter & Raghupathy (2000), we choose the radial basis
H H
kernel to be h( x) = (1 + cos(π ⋅ x ri ) 2 , where x = vi − p j . We choose wij to be
H H
a normalization factor such that ∑ j =1 wij h( vi − p j ) = 1 . The lips and eye lids are
N
two special cases for this RBF interpolation, because the motions of the upper
parts of them are not correlated with the motions of the lower parts. To address
this problem, we add “upper” or “lower” tags to vertices and markers near the
mouth and eyes. Markers do not influence vertices with different tags. These
RBF weights need to be computed only once for one set of marker positions. The
weights are stored in a matrix, which is sparse because marker influence is local.
During synthesis, the movement of mesh vertices can be computed by one
multiplication of the sparse RBF matrix based on equation (2). Thus, the
interpolation is fast.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 329
modeling key facial shapes. In order to get a smooth trajectory once a state
sequence is found, we use NURBS (Nonuniform Rational B-splines) interpola-
tion. The NURBS trajectory is defined as:
C (t ) = (∑i =0 N i , p (t ) wi Pi ) (∑i=0 N i , p (t ) wi ) ,
n n
where p is the order of the NURBS, Ni,p is the basis function, Pi is the control point
of the NURBS, and wi is the weight of Pi. We use p = 2. The HMM states (key
facial shapes) are used as control points, which we assume to have Gaussian
distributions. We set the weight of each control point such that the trajectory has
a higher likelihood. Intuitively, it can be achieved in a way that states with small
variance pull the trajectory towards them, while states with larger variance allow
the trajectory to stay away from them. Therefore, we set the weights to be
H H
wi = 1 (σ (ni )) , where ni is the trajectory normal vector that also passes Pi,
H H
σ (ni ) is the variance of the Gaussian distribution in ni direction. In practice, we
H H
approximate ni by normal vector ni ' of line segment Pi −1Pi +1 (see P1P3 in Figure
7(a)). Compared to Brand (1999), the smooth trajectory obtained is less optimal
in terms of maximum likelihood. But, it is fast and robust, especially when the
number of states is small. It is also a natural extension of the traditional key-
frame-based spline interpolation scheme, which is easy to implement. Figure 7
shows a synthetic example comparing conventional NURBS and our statistically
weighted NURBS. The green dots are samples of facial shapes. The red dashed
line connects centers of the states. The blue solid line is the generated facial
deformation trajectory. In Figure 7(b), the trajectory is pulled towards the states
with smaller variance, thus they have a higher likelihood than trajectory in Figure
7(a).
(a) (b)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
330 Wen, Hong, Tu & Huang
We use the motion capture sequence to train the model. Thirty states are used
in the experiment. Each state is observed via MUPs, which are modeled using
a Gaussian model. We assume the covariance matrices of the states to be
diagonal in training. The centers of the trained states are used as key shapes (i.e.,
control points) for the NURBS interpolation scheme. The interpolation-based
temporal model is used in the key-frame-based face animation, such as text-
driven animation in iFACE.
H H H
M(R (V0 + LP ) + T ) (3)
H H
where M is the projection matrix, V0 is the neutral face, LP defines the non-rigid
deformation, R is the 3D rotation decided by three rotation angles
H H
[ ]
wx , wy , wz = W , and T stands for 3D translation. L is an N × M matrix that
t
H
contains M AUs, each of which is an M dimensional vector. P = [ p1 ,..., pM ]t is
H H H
the coefficients of the AUs. To estimate facial motion parameters {T ,W , P}
H
from 2D inter-frame motion d V2 D , the derivative of equation (3) is taken with
H H H H H H H
respect to {T ,W , P} . Then, a linear equation between d V2 D and {d T , d W , d P}
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 331
can be derived (see details in Tao (1998)). The system estimates d Vˆ2 D using
template-matching-based optical flow. The linear system is solved using the least
squares method. A multi-resolution framework is used for efficiency and
robustness.
In the original system, is manually designed using Bezier volume and represented
by the displacements of vertices of face surface mesh. To derive L from the
learned MUs, the “MU adaptation” process described earlier is used. In the
current system, we use the holistic MUs. Parts-based MUs could be used if a
certain local region is the focus of interest, such as the lips in lip-reading. The
system is implemented to run on a 2.2 GHz Pentium 4 processor with 2GB
memory. The image size of the input video is 640 × 480. The system works at
14 Hz for non-rigid face tracking. The tracking results, i.e., the coefficients of
H
MUs, R and T can be directly used to animated face models. Figure 8 shows
some typical frames that were tracked, along with the animated face model to
(a)
(b)
(c)
(d)
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
332 Wen, Hong, Tu & Huang
visualize the results. It can be observed that compared with the neutral face (the
first column images), the mouth opening (the second column), subtle mouth
rounding and mouth protruding (the third and fourth columns) are captured in the
tracking results visualized by the animated face model. Because the motion units
are learned from real facial motion data, the facial animation synthesized using
tracking results looks more natural than that using handcrafted action units in Tao
(1998).
The tracking algorithm can be used in model-based face video coding (Tu et al.,
2003). We track and encode the face area using model-based coding. To encode
the residual in the face area and the background for which a priori knowledge
is not generally available, we use traditional waveform-based coding method
H.26L. This hybrid approach improves the robustness of the model-based
method at the expense of increasing bit-rate. Eisert, Wiegand & Girod (2000)
proposed a similar hybrid coding technique using a different model-based 3D
facial motion tracking approach. We capture and code videos of 352 × 240 at
30Hz. At the same low bit-rate (18 kbits/s), we compare this hybrid coding with
H.26L JM 4.2 reference software. Figure 9 shows three snapshots of a video
with 147 frames. The PSNR around the facial area for hybrid coding is 2dB
higher than H.26L. Moreover, the hybrid coding results have much higher visual
quality. Because our tracking system works in real-time, it could be used in a
real-time low-bit-rate video-phone application. Furthermore, the tracking results
Figure 9. (a) The synthesized face motion; (b) The reconstructed video
frame with synthesized face motion; (c) The reconstructed video frame
using H.26L codec.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 333
can used to extract visual features for audio-visual speech recognition and
emotion recognition (Cohen et al., 2002). In medical applications related to facial
motion disorder, such as facial paralysis, visual cues are important for both
diagnosis and treatment. Therefore, the facial motion analysis method can be
used as a diagnostic tool such as in Wachtman et al. (2001). Compared to other
3D non-rigid facial motion tracking approaches using a single camera, the
features of our tracking system include: (1) the deformation space is learned
automatically from data such that it avoids manual adjustments; (2) it is real-time
so that it can be used in real-time applications; and (3) it is able to recover from
temporary loss of tracking by incorporating a template-matching-based face
detection module.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
334 Wen, Hong, Tu & Huang
Figure 10. Compare the estimated MUPs with the original MUPs. The
content of the corresponding speech track is “A bird flew on lighthearted
wing.”
magnitude of the MUPs. The solid red trajectory is the original MUPs, and the
dashed blue trajectory is the estimation results.
We reconstruct the facial deformation using the estimated MUPs. For both the
ground truth and the estimated results, we divide the deformation of each marker
by its maximum absolute displacement in the ground truth data. To evaluate the
performance, we calculate the Pearson product-moment correlation coefficients
(R) and the mean square error (MSE) using the normalized deformations. The
Pearson product-moment correlation ( 0.0 ≤ R ≤ 1.0 ) measures how good the
global match is between the shapes of two signal sequences. A large coefficient
means a good match. The Pearson product-moment correlation coefficient R
H H
between the ground truth { d n } and the estimated data { d n ' } is calculated by
H H H H
tr ( E[(d n − µ1 )(d n '− µ 2 )T ])
R= H H H H H H H H (4)
tr ( E[(d n − µ1 )(d n − µ1 )T ])tr ( E[(d n '− µ 2 )(d n '− µ 2 )T ])
H H H H
where µ1 = E[d n ] and µ 2 = E[d n ' ] . In our experiment, R = 0.952 and MSE =
0.0069 for training data and R = 0.946 and MSE = 0.0075 for testing data.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 335
Figure 11. Typical animation frames. Temporal order: from left to right;
from top to bottom.
The whole animation procedure contains three steps. First, we extract audio
features from the input speech. Then, we use the trained neural networks to map
the audio features to the visual features (i.e., MUPs). Finally, we use the
estimated MUPs to animate a personalized 3D face model in iFACE. Figure 11
shows a typical animation sequence for the sentence in Figure 10.
Our real-time speech-driven animation can be used in real-time two-way
communication scenarios such as video-phone and virtual environments (Leung
et al., 2000). On the other hand, existing off-line speech-driven animation, e.g.,
Brand (1999), can be used in one-way communication scenarios, such as
broadcasting and advertising. Our approach deals with the mapping of both
vowels and consonants, thus it is more accurate than real-time approaches with
only vowel-mapping (Morishima & Harashima, 1991; Goto, Kshirsagar &
Thalmann, 2001). Compared to real-time approaches using only one neural
network for all audio features (Massaro et al., 1999; Lavagetto, 1995), our local
ANN mapping (i.e., one neural network for each audio feature cluster) is more
efficient because each ANN is much simpler. Therefore, it can be trained with
much less effort for a certain set of training data. More generally, speech-driven
animation can be used in speech and language education (Cole et al., 1999), as
a speech understanding aid for noisy environments and hard-of-hearing people
and as a rehabilitation tool for facial motion disorders.
The synthetic talking face can be evaluated by human perception study. Here,
we describe our experiments which compare the influence of the synthetic
talking face on human emotion perception with that of the real face. We did
similar experiments for 2D MU-based speech-driven animation (Hong, Wen &
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
336 Wen, Hong, Tu & Huang
Table 1. Emotion inference based on visual only or audio only stimuli. “S”
column: Synthetic face; “R” column: Real face.
Facial Expression
Smile Sad
S R S R
Audio-visual Same 15 16 16 16
relation Opposite 2 3 10 12
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 337
The second and third experiments are designed to compare the influence of a
synthetic face on bimodal human emotion perception and that of the real face.
In the second experiment, the subjects are asked to infer the emotional state
while observing the synthetic talking face and listening to the audio tracks. The
third experiment is the same as the second one except that the subjects observe
the real face, instead. In each of the experiments, the audio-visual stimuli are
presented in two groups. In the first group, audio content and visual information
represent the same kind of information (e.g., positive text with smiling expres-
sion). In the second group, the relationship is the opposite. The results are
combined in Table 2.
We can see the face movements and the content of the audio tracks jointly
influence the decisions of the subjects. If the audio content and the facial
expressions represent the same kind of information, the human perception of the
information is enhanced. For example, when the associated facial expression of
the positive-text-content audio track is smiling, nearly all subjects say that the
emotional state is happy (see Table 2). The numbers of the subjects who perceive
a happy emotional state are higher than those using only one stimulus alone (see
Table 1). However, it confuses human subjects if the facial expressions and the
audio tracks represent opposite information. An example is shown in the fifth and
sixth columns of Table 2. The audio content conveys positive information, while
the facial expression is sad. Ten subjects report sad emotion if the synthetic
talking face with a sad expression is shown. The number increases to 12 if the
real face is used. This difference shows that the subjects tend to trust the real
face more than the synthetic face when the visual information conflicts with the
audio information. Overall, the experiments show that our real-time, speech-
driven synthetic talking face successfully affects human emotion perception.
The effectiveness of the synthetic face is comparable with that of the real face,
even though it is slightly weaker.
Conclusions
This chapter presents a unified framework for learning compact facial deforma-
tion models from data and applying the models to facial motion analysis and
synthesis. This framework uses a 3D facial motion capture database to learn
compact holistic and parts-based facial deformation models called MUs. The
MUs are used to approximate arbitrary facial deformation. The learned models
are used in robust 3D facial motion analysis and real-time, speech-driven face
animation. The experiments demonstrate that robust non-rigid face tracking and
flexible, natural face animation can be achieved based on the learned models. In
the future, we plan to investigate systematic ways of adapting learned models for
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
338 Wen, Hong, Tu & Huang
Acknowledgment
This work was supported in part by National Science Foundation Grant CDA 96-
24386, and IIS-00-85980. We would thank Dr. Brian Guenter, Heung-Yeung
Shum and Yong Rui of Microsoft Research for the face motion data.
References
Aizawa, K. & Huang, T. S. (1995). Model-based image coding, Proceeding of
IEEE, 83, 259-271.
Basu, S., Oliver, N. & Pentland, A. (1999). 3D modeling and tracking of human
lip motions. Proceeding of International Conference on Computer
Vision, 337-343.
Brand, M. (1999). Voice puppetry. Proceeding of SIGGRAPH 1999, 21-28.
Cohen, I. et al. (2002). Facial expression recognition from video sequences.
Proceeding of IEEE International Conference on Multimedia and
Expo, 121-124.
Cole, R. et al. (1999). New tools for interactive speech and language training:
Using animated conversational agents in the classrooms of profoundly deaf
children. Proceedings of ESCA/SOCRATES Workshop on Method and
Tool Innovations for Speech Science Education.
DeCarlo, D. (1998). Generation, Estimation and Tracking of Faces. Ph.D.
thesis, University of Pennsylvania.
Eisert, P., Wiegand, T. & Girod, B. (2000). Model-Aided Coding: A New
Approach to Incorporate Facial Animation into Motion-Compensated
Video Coding. IEEE Transactions on Circuits and Systems for Video
Technology, 10(3), 344-358.
Ekman, P. & Friesen, W. V. (1977). Facial Action Coding System. Consulting
Psychologists Press.
Essa, I. & Pentland, A. (1997). Coding Analysis, Interpretation, and Recognition
of Facial Expressions. IEEE Trans. on Pattern Analysis and Machine
Intelligence, 757-763.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Learning 3D Face Deformation Model 339
Facial muscle image. (2002). Retrieved from the World Wide Web: http://
sfghed.ucsf.edu/ClinicImages/anatomy.htm.
Goto, T., Kshirsagar, S. & Thalmann, N. M. (2001). Automatic Face Cloning and
Animation. IEEE Signal Processing Magazine, 18(3), 17-25.
Guenter. B. et al. (1998). Making Faces. Proceeding of SIGGRAPH 1998, 55-
66.
Hong, P., Wen, Z. & Huang, T. S. (2001). iFACE: A 3D synthetic talking face.
International Journal of Image and Graphics, 1(1), 19-26.
Hong, P., Wen, Z. & Huang, T. S. (2002). Real-time speech driven expressive
synthetic talking faces using neural networks. IEEE Transactions on
Neural Networks, 13(4), 916-927
Jolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag.
Kshirsagar, S., Molet, T. & Thalmann, N. M. (2001). Principal Components of
Expressive Speech Animation. Proceeding of Computer Graphics Inter-
national, 38-44.
Lavagetto, F. (1995). Converting speech into lip movements: A multimedia
telephone for hard of hearing people. IEEE Transactions on Rehabilita-
tion Engineering, 90-102.
Lee, D. D. & Seung, H. S. (1999). Learning the parts of objects by non-negative
matrix factorization. Nature, 401, 788-791.
Leung, W. H. et al. (2000). Networked Intelligent Collaborative Environment
(NetICE). Proceeding of IEEE International Conference on Multime-
dia and Expo, 55-62.
Marschner, S. R., Guenter, B. & Raghupathy, S. (2000). Modeling and Render-
ing for Realistic Facial Animation. Proceedings of Workshop on Render-
ing, 231-242.
Massaro, D. W. (1998). Perceiving talking faces: From speech perception
to a behavioral principle. Cambridge, MA: MIT Press.
Massaro, D. W. et al. (1999). Picture my voice: audio to visual speech synthesis
using artificial neural networks. Proceeding of Audio-Visual Speech
Processing, 133-138.
Morishima, S. & Harashima, H. (1991). A media conversion from speech to
facial image for intelligent man-machine interface. IEEE Journal on
Selected Areas in Communications, 4,594-4,599.
Morishima, S., Ishikawa, T. & Terzopoulos, D. (1998). Facial muscle parameter
decision from 2D frontal image. Proceedings of the International
Conference on Pattern Recognition, 160-162.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
340 Wen, Hong, Tu & Huang
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 341
Chapter XI
Costas T. Davarakis
Systema Technologies SA, Greece
Dimitrios Tzovaras
Informatics and Telematics Institute, Greece
Abstract
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
342 Karatzoulis, Davarakis & Tzovaras
synthesis and analysis techniques for the human body reported by R&D
projects worldwide. Technical details are provided for each R&D project
and the results are discussed and evaluated.
Introduction
Humans are the most commonly seen moving objects in one’s daily life. The
ability to model and to recognize humans and their activities by vision is key for
a machine to interact intelligently and effortlessly with a human inhabited
environment. Because of many potentially important applications, examining
human body behavior is currently one of the most active application domains in
computer vision. This survey identifies a number of promising applications and
provides an overview of recent developments in this domain (Hillis, 2002).
Hand and body modeling and animation is still an open issue in the computer
vision area. Various approaches to estimate hand gestures and body posture or
motion from video images have been previously proposed (Rehg & Kanade,
1994; Lien & Huang, 1998; Zaharia, Preda & Preteux, 1999). Most of these
techniques rely on 2-D or 3-D models (Saito, Watanabe & Ozawa, 1999; Tian,
Kanade & Cohn, 2000; Gavrila & Davies, 1996; Wren, Azarbayejani, Darell &
Pentland, 1997) to compactly describe the degrees of freedom of hand and body
motion that has to be estimated. Most techniques use as input an intensity/color
image provided by a camera and rely on the detection of skin color to detect
useful features and to identify each body part in the image (Wren, Azarbayejani,
Darell & Pentland, 1997). In addition, the issue of hand and body modeling and
animation has been addressed by the Synthetic/Natural Hybrid Coding (SNHC)
subgroup of the MPEG-4 standardization group to be described in more detail in
the following.
In Sullivan & Carlsson (2002), view-based activity recognition serves as an input
to a human body location tracker with the ultimate goal of 3D reanimation. The
authors demonstrate that specific human actions can be detected from single
frame postures in a video sequence. By recognizing the image of a person’s
posture as corresponding to a particular key frame from a set of stored key
frames, it is possible to map body locations from the key frames to actual frames
using a shape-matching algorithm. The algorithm is based on qualitative similarity
that computes point-to-point correspondence between shapes, together with
information about appearance.
In Sidenbladh, Black & Sigal (2002), a probabilistic approach is proposed to
address the problem of 3D human motion modeling for synthesis and tracking.
High dimensionality and non-linearity of human body movement modeling is
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 343
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
344 Karatzoulis, Davarakis & Tzovaras
reconstruction of the human body and animation using image sequence process-
ing and graphical modeling. In most cases, reported reconstruction accuracy is
being pursued to map the analysis of the real person with the virtual human
(humanoid) in terms of anthropometrical characteristics. The latter applications
include:
Standards
The main tool introduced for the description of 3D “worlds” is the Virtual Reality
Modeling Language (VRML). Technically speaking, VRML is neither virtual
reality, nor a modeling language. Virtual reality typically implies an immersive 3D
experience (such as the one provided by a head-mounted display) and various 3D
input devices (such as digital gloves). VRML neither requires, nor preludes
immersion. Furthermore, a true modeling language would contain much richer
geometric modeling primitives and mechanisms. VRML provides a bare mini-
mum of geometric modeling features and contains numerous features far beyond
the scope of a modeling language (Carrey & Bell, 1997). VRML was designed
to create a more “friendly” environment for the World Wide Web. It provides the
technology that integrates three dimensions, two dimensions, text and multimedia
into a coherent model. When these media types are combined with scripting
languages and Internet capabilities, an entirely new genre of interactive applica-
tions becomes possible (Carrey & Bell, 1997).
X3D (X3D Task Group) is the next-generation open standard for 3D on the web.
It is the result of several years of development by the Web 3D Consortium’s X3D
Task Group and the recently-formed Browser Working group. The needs that
the standard meets are:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 345
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
346 Karatzoulis, Davarakis & Tzovaras
interactive multimedia content to any platform over any network. Based on the
Virtual Reality Modeling Language (VRML) standard developed by the Web3D
Consortium, MPEG-4 has been under development since 1993 and today is ready
for use. The first generation of MPEG-4 content servers and authoring tools are
now available. Advances in the MPEG-4 standard still continue, particularly in
the area of 3D data processing, offering a unique opportunity to generate new
revenue streams by way of MPEG-4 and related MPEG activities standards.
The Animation Framework eXtension (AFX) (MPEG-4 AFX), for example, is a
joint Web3D-MPEG effort that will define new 3D capabilities for the next
version of the MPEG-4 standard. Similarly, the MPEG group has recently
initiated an effort to develop standards for Multi-User capabilities in MPEG-4
(MPEG-4, requirements for Multi-user worlds).
The issue of hand and body modeling and animation has been addressed by the
Synthetic/Natural Hybrid Coding (SNHC) subgroup of the MPEG-4 standard-
ization group. More specifically, 296 Body Animation Parameters (BAPs) are
defined by MPEG-4 SNHC to describe almost any possible body posture, 28 of
which describe movements of the arm and hand. Most BAPs represent angles
of rotation around body joints. Due to the fact that the number of parameters is
very large, accurate estimation of these parameters for luminance or color
images is a very difficult task. However, if depth images from a calibrated
camera system are available, this problem is significantly simplified.
MPEG-4 originally focused on video and FBA (Face and Body Animation)
coding. The MPEG-4 FBA framework is limited to human-like virtual character
animation. Recently, the FBA specifications have been extended to the so-called
Bone-Based Animation (BBA) (Sévenier, 2002) specifications in order to
animate any articulated virtual character (Preda & Preteux, 2002).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 347
recognition of human motion and its reconstruction and analysis for different
applications. Nevertheless, the current limitations of these tools limit the range
of application of this technology to off-line analysis.
The principal technologies used today are optical, electromagnetic, and electro-
mechanical human tracking systems. Table 1 summarizes the advantages and
disadvantages of the different human tracking systems.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
348 Karatzoulis, Davarakis & Tzovaras
These three fields, i.e., modeling, animation and transmission, represent the main
research areas in human body analysis/synthesis and are also the main focus of
the majority of R&D projects which are, of course, more application-oriented.
The images used in the following section in order to illustrate the concept and the
results of the presented projects are copyrighted by the corresponding projects.
VAKHUM
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 349
research. All of these tasks use anatomical and kinematics data, but they all
encounter the same problem: no data reflecting the high percentage of morpho-
logical variations in the human species are easily available. Frequently, only
normalized models are produced. Hence, the real relationships between the
morphology and the kinematics of a specific subject cannot be foreseen with high
accuracy. VAKHUM’s main goal is to develop a database to allow interactive
access of a broad range of data of a type not currently available, and to use this
to create tutorials on functional anatomy. The data will be made available to
industry, education and research. A source of high-quality data of both morpho-
logical and kinematics models of human joints will be created. The applied
techniques will allow data to be obtained that is of potential interest in related
fields across industry, medical education and research.
Figure 1. 3D bone models of the iliac bone. Left: Surface models using tiling
techniques; Right: Finite elements model.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
350 Karatzoulis, Davarakis & Tzovaras
Figure 2. Joint kinematics. Left: Hip joint during a motion of flexion; Right:
Knee joint flexion. Helical axes of motion are displayed, as well.
Kinematics is the study of motion. As part of the VAKHUM project, the motion
of the human lower limb will be studied during several normal activities (walking,
running, stair climbing). Several techniques can be used to study a motion, each
of them having its own advantages and disadvantages. This data, associated with
medical imaging, can bring forth new information on human kinematics (Figure
2). Unfortunately, electrogoniometry is difficult to use to study full-limb motion.
Other systems like motion-capture-devices using stereophotogrammetry (e.g.,
video cameras) allow us to study the relative angular displacement of the joints
of a particular limb by tracking skin markers attached to a volunteer or patient
during some activities (Figure 3).
VAKHUM deals with combining electrogoniometry and stereophotogrammetry
to animate 3D models collected from medical imaging. This technique allows not
only a combination of different data sources, but also a comparison of results
obtained from different protocols, which currently poses an accuracy problem in
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 351
e-Tailor
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
352 Karatzoulis, Davarakis & Tzovaras
This project is being developed by the HITLab at MIT. The purpose of this
project is to develop an Expert System and Natural Language Parsing module to
parse emotive expressions from textual input. The information will be then used
to set the graphical appearance of avatars in order to reduce the need to switch
between messages.
Technical Approach
The Expert System parses the text to get the emotions the user desires to portray,
taking into account cues present in text, such as: types of words used, contextual
information, length of phrases typed, use of emotions, etc.
An agent entity is causing the display of the emotions of the person who is typing
his text input. The agent is also propagating these emotions to a specific recipient.
The Expert System is used to characterize the agent and his/her emotional states.
Facts about emotional effects must be specified in terms of simple data
structures or unambiguous natural language (NL) statements (e.g., if one is
irritated, then further irritation can make the person angry) (Figure 5).
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 353
FashionMe
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
354 Karatzoulis, Davarakis & Tzovaras
are needed. For example, an avatar must have certain prerequisites in order to
meet these requirements. The three essential components are:
The term eGarments denotes digital, 3D models of real pieces of clothing. Most
online product catalogs only consist of 2D pictures. The easiest way to produce
eGarments is to generate data from a CAD program, which is used for the design
and the cutting construction of the clothing. State-of-the-art of cutting construc-
tion is, however, in most cases only two-dimensional. In order to generate a 3D
volume model of the garment, these faces must be sewed together virtually and
then transferred into a 3D grid model. eGarments are produced in FashionMe in
a multistage process. A real dummy is equipped with the specific garment and
is then scanned in 3D. The basis for this method is that the naked dummy was
scanned in the first instance so that the system knows the gauges of the dummy.
In a second step, the dummy is scanned wearing the garment. By subtracting the
known rough model from the dressed model, the necessary geometric data is
computed. The eGarment consists of the offset of the dummy, the garment’s
surface, and the graphical information, which maps the surface and is used as a
texture.
The focus of the scanning technology used in FashionMe (provided by the
AvatarMe company) basically lies on the simple and fast generation of Internet-
enabled, personalized avatars that have a realistic appearance, and not so much
on an exact rendering of the actual gauges of a person. The scanner is
accommodated in a booth that can also be set up in public places (Figure 6). The
scanning process comprises digitally photographing the model from four differ-
ent perspectives. Based on these digital views, the avatar is computed by means
of existing rough models. In order to be able to move the model realistically, the
avatar is assigned a skeleton. The assignment of all necessary points in the avatar
for the individual joints is realized using predefined avatars. After size, weight,
age and sex of the scanned person have been recorded, the most appropriate
avatar is automatically selected from about 60 different pre-defined body
models. Then, the avatar is personalized using texture mapping. This procedure
provides a first skeleton, which can be further refined manually, in order to equip
fingers with knuckles, for example.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 355
The avatar produced using this procedure does not really comply with the exact
body measures of the respective person. The quality, however, is sufficient to
walk on the virtual catwalk and get a realistic impression of how the garment
would actually look on one’s body. The mesh of such an avatar consists of about
3.800 nodes. This figure can be considered a compromise between the desired
accuracy and the minimal amount of data. The procedure is optimized for real-
time web animation.
Moving the virtual skeleton of the user is referred in FashionMe as “the virtual
catwalk.” Virtual catwalk animates the avatar according to defined, sex-specific
motion sequences. The motion sequences contain detailed, time-dependent
descriptions of motions for every joint. These motion sequences are either
generated artificially by means of a 3D editor, or digitized by motion capturing
real movement of the users. Motion capturing (e.g., the technology of Vicon)
(Vicon, 2001) records and digitizes even complex movements of a human model
with all their irregularities in order to provide the highest possible realistic
impression.
INTERFACE
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
356 Karatzoulis, Davarakis & Tzovaras
Emotional speech analysis: The aim of this task is to define a limited set of
parameters that would characterize the broad emotion classes. To achieve
that, a number of tools for low and high-level feature extraction and analysis
were developed. Calculation of low-level features included phoneme
segmentation, pitch extraction, energy measure and calculation of pitch and
energy derivatives. The high-level features were grouped into five groups:
The first group consists of high-level features that are extracted from pitch.
The second group consists of features that are extracted from energy. The
third group consists of features that are calculated from phoneme segmen-
tation. The fourth group consists of features that are calculated from
features calculated from pitch derivative. The fifth group consists of high-
level features that are calculated from energy derivatives. A set of 26
different high-level features were analyzed, in order to define a set of high-
level features that differentiated various emotions the most. All low and
high-level features that did not show any capabilities to differentiate the
emotions were excluded from the set.
Emotional video analysis: The research that is carried out within the project
is focused on expression recognition techniques that are consistent with the
MPEG-4 standardized parameters for facial definition and animation, FDP
and FAP. To complement the expression recognition based on low-level
parameters, additional techniques that extract the expression from the
video sequence have also been developed. These techniques also need the
location of the face and a rough estimation of the main feature points of the
face. For this aim, there is an activity oriented to Low Level Facial
Parameter extraction and another activity aimed at High Level Facial
Parameter extraction or expression recognition. Another group of tools has
been developed for the extraction of cues important for dialogue systems,
such as gaze, specific movements like nods and shakes for approval or
disapproval, attention or lack of attention.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 357
Emotional facial animation: The first activity carried out was related to the
enhancement of the facial expressions synthesis. With the help of an artist,
a database of high-level expressions has been developed by mixing low-
level expressions. A second activity was related to the development of a
facial animation engine based on Facial Animation Tables (FAT), in order
to accurately reproduce facial animations. For each Facial Animation
Parameter (FAP), we need the FAT to control the displacements of the
vertices corresponding to the Facial Description Parameter (FDP). The
FAT can be defined with an animation engine or by a designer. The
advantages of the FAT-based animation are that the expected deforma-
tions due to the animation are exactly mastered, the FAT can be defined by
a designer and, thus, the animation can be balanced between realistic and
cartoon-like behavior. The drawback is that each FAT is unique to a given
face and it is required to build one FAT for each face (even if the topology
of the mesh is the same), whereas an MPEG-4 animation engine can work
with any face mesh and FDP data.
Emotional body animation: The server calculates the emotion from all man-to-
machine action (text, speech and facial expression) and associates to this
emotion a gesture that is translated into the corresponding Body Animation
Parameters (BAP) file that is finally sent through Internet.
STAR
The (IST STAR) project Service and Training through Augmented Reality
(STAR) is a collaboration between research institutes and companies from
Europe and the USA. STAR’s focus is the development of Mixed Reality
techniques for training, documentation, and planning purposes.
To achieve these, the following components were developed (Figure 7):
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
358 Karatzoulis, Davarakis & Tzovaras
with a camera and a wireless connection to the local computer network. The
camera captures the workspace in which the worker is operating and the video
from the camera is transmitted over the network to an expert. The 3D position
of the camera with respect to the scene is tracked automatically by the system,
using feature-matching techniques. The expert can augment the video with a
variety of relevant information: text, 3D, etc., and send the augmented view back
to the worker over the network. He will then use this information to decide what
steps to perform next.
BLUE-C Project
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 359
video cameras. The project started in April 2000 and its first phase is expected
to be completed by Spring 2003.
Mastering rapidly changing computing and communication resources is an
essential key to personal and professional success in a global information society.
The main challenge consists not only in accessing data, but rather in extracting
relevant information and combining it into new structures. The efficient and
collaborative deployment of applications becomes increasingly important as we
find more complex and interactive tools at our disposal. Today’s technology
enables information exchange and simple communication. However, it often fails
in the promising field of computer enhanced collaboration in virtual reality
environments. Some improvements were made by coming-of-age virtual reality
systems that offer a variety of instrumental tools for stand-alone visual analysis.
Nevertheless, the crucial interaction between humans and virtual objects is
mostly neglected. Therefore, successful models of truly computer supported
collaborative work are still rare.
The blue-c project aims at investigating a new generation of virtual design,
modeling, and collaboration environments. 3D human representations are inte-
grated in real-time into networked virtual environments. The use of large screens
and cutting-edge projection technology creates the impression of total immer-
sion. Thus, unprecedented interaction and collaboration techniques among
humans and virtual models will become feasible.
The blue-c system foresees the simultaneous acquisition of live video streams
and the projection of virtual reality scenes. Color representations with depth
information of the users are generated using real-time image analysis. The
computer-generated graphics will be projected onto wall-sized screens sur-
rounding the user, allowing him to completely enter the virtual world. Multiple
blue-c portals, connected by high-speed networks, will allow remotely located
users to meet, communicate, and collaborate in the same virtual space. The blue-c
system includes:
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
360 Karatzoulis, Davarakis & Tzovaras
ATTEST
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 361
In the introduction period, 2D and 3D-TV sets will co-exist. ATTEST will,
therefore, develop coding schemes within the current MPEG-2 broadcast
standards that allow transmission of depth information in an enhancement layer,
while providing full compatibility with existing 2D decoders. First, perceptual
quality will be assessed through a software prototype, later a hardware real-time
decoder prototype will be developed.
At present, a suitable glasses-free 3D-TV display that enables free positioning
of the viewer is not available. Also, there is no suitable display for single users
(3D-TV on PC), or for use in a typical living room environment. ATTEST will
develop two 3D displays (single and multiple user) that allow free positioning
within an opening angle of 60 degrees. Both are based on head tracking and
project the appropriate views into the viewer’s eyes.
ATTEST will deliver a 3D-TV application running on a demonstrator platform,
with an end-to-end DVB delivery system. The 3D content will either be recorded
with the ATTEST 3D camera, or will be converted from 2D video footage using
the ATTEST 2D-to-3D conversion tools. ATTEST will build a real-time MPEG-
2 base and 3D enhancement layer decoder and demonstrate optimized 3D video
rendering on the ATTEST single and multi-user 3D displays.
VISICAST
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
362 Karatzoulis, Davarakis & Tzovaras
Figure 8. TESSA, the TExt and Sign Figure 9. Signs being motion
Support Assistant. captured.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 363
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
364 Karatzoulis, Davarakis & Tzovaras
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 365
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
366 Karatzoulis, Davarakis & Tzovaras
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 367
In order to clearly indicate the potential of the project, we will present in detail
the functionality of the Virtual Mirror scenario, which is part of the ShopLab
toolbox. In the case of small clothing shops, lack of space is often a factor that
detracts from the shoppers’ enjoyment. When trying different outfits on in a
small cubicle, one cannot see how the clothes will look in a natural environment.
Certain views of the clothes are difficult to obtain, for example, one cannot easily
see how the trousers fit in the back. Moreover, some types of clothes, for
example, a snowboarding garment, are particularly difficult to evaluate in the
shop environment. The following ShopLab installation overcomes these prob-
lems. On one wall of a changing room imagine an ordinary mirror for traditional
changing room evaluation, while on an adjacent wall a flat screen panel displays
the customer in their new clothes in an appropriate environment (Figure 11).
They will be able to see themselves easily from behind or sideways. The
displayed artificial environment and lighting conditions make it easier to evaluate
how the clothes would appear in a “real world” situation (Figure 12).
The ShopLab platform is mainly being developed using the ARTookit libraries
(ARToolkit Library). ARToolkit is an open-source vision-tracking library that
enables the easy development of wide range of Augmented Reality applications.
The library has been inspired and implemented by Professor Hirokata Kato and
Dr. Mark Billinghurst. The 3D human models will be developed based on the H-
Anim standard. A portable 3D body scanner will be used for the modeling.
Figure 11. The concept of the Figure 12. The Virtual Mirror
Virtual Mirror. in action.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
368 Karatzoulis, Davarakis & Tzovaras
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 369
The introduction of virtual reality will allow experimenting with new pedagogical
mechanisms for teaching music and, more specifically, for correcting hand
positioning. Entirely different from the adoption of simple videotapes, by using
the VR it will be possible for the pupil to re-execute specific passages that are
not planned in the video, but can be directly produced on the basis of the music
score or MIDI. This is a high-level coding for the finger movements. The high-
level coding allows the updating of the pupil database via Internet. It would be
impossible to send MPEG videos to show the same gestures.
Correct representation and modeling of the virtual hands can also be performed
using virtual reality gloves (e.g., the CyberGlove of Immersion Technologies)
(Immersion Corp., 2002). CyberGlove is a low-profile, lightweight glove with
flexible sensors which can measure the position and movement of the fingers and
wrist. It is based on a resistive, bend-sensing technology and its sensors are thin
and flexible enough to produce almost undetectable resistance to bending. The
device can provide accurate measurements for a wide range of hand sizes.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
370 Karatzoulis, Davarakis & Tzovaras
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 371
concepts (Figure 14) such as posture, force and repetition by 3D animation and
video effects (Figure 15).
Conclusions
This chapter presents recent results of synthesis and analysis techniques for the
human body reported by R&D projects worldwide. The examined technological
area has produced impressive research results, which have emerged as success-
ful consumer applications, especially in the media, industrial and educational
markets.
During the last decade, a very large number of articles have been published in
the research area of human body modeling. However, computer vision-based
human body analysis/synthesis techniques are not mature enough to be used for
industrial applications and, thus, semi-automatic systems are usually adopted.
Many issues are still open, including unconstrained image segmentation, real-
time motion tracking, personalization of human models, modeling of multiple
person environments and computationally efficient real-time applications. Each
one of these topics represents a stand-alone problem and their solutions are of
interest not only to human body modeling research, but also to other research
fields.
Reduction of the processing time in order to support real-time applications is one
of the major issues in human body modeling and animation. Reduction depends
on the performance of the current computer systems (CPU and memory
capabilities), which are constantly improving, and the computational complexity
of the technical approaches used (matching/tracking algorithms, etc.), which,
however, are not expected to significantly lower processing time. Thus, it is
expected that in the near future algorithms, that nowadays are computationally
prohibitive, will be feasible, giving rise to new human modeling and animation
applications.
Realistic human body modeling and animation is considered essential in virtual
reality applications, as well as in remote collaboration in virtual environments.
Most R&D projects reviewed in this chapter are moving in this direction and their
results are considered to be very interesting and promising. Virtual Reality
applications today lack real-time human participation. Most applications have
been shown as walk-thru types for virtual exploration of spatial data (virtual
prototypes, architecture, cultural heritage, etc.) or as user interactive direct
manipulation of data (training systems, education systems, etc.). Especially in
digital storytelling, applications today are limited to predefined human participa-
tion, such as animated 3D cartoons and/or integration of pre-recorded video
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
372 Karatzoulis, Davarakis & Tzovaras
References
ARToolKit Library. Retrieved from the World Wide Web: http://mtd.fh-
hagenberg.at/depot/graphics/artoolkit/.
Blue-C Project. Retrieved from the World Wide Web: http://blue-c.ethz.ch/.
Bobick, A. F. & Davis, J. W. (2001). The recognition of human movement using
temporal templates. IEEE Trans. on PAMI, 23(3), 257-267.
Carrey, R. & Bell, G. (1997). The Annotated Vrml 2.0 Reference Manual. 1st
edition. Addison-Wesley.
Gavrila, D. M. & Davis, S. L. (1996). 3-D Model Based Tracking of Humans in
Actions: a Multi-view approach. Proceedings IEEE Conference. on
Computer Vision and Pattern Recognition, San Francisco, CA.
Hillis, D. (2002, July). The Power to Shape the World. ACM.
Humanoid Animation WorkingGroup. Retrieved from the World Wide Web:
http://h-anim.org/.
Immersion Corp. (2002). Retrieved from the World Wide Web: http://
www.immersion.com/, Immersion Corp.
Intelligent Conversational Avatar Project. Retrieved from the World Wide Web:
http://www.hitl.washington.edu/research/multimodal/avatar.html.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 373
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
374 Karatzoulis, Davarakis & Tzovaras
Perales, F. J. & Torres, J. (1994). A system for human motion matching between
synthetic and real images based on a biomechanical graphical model. IEEE
Computer Society. Workshop on Motion of Non-Rigid and Articulated
Objects, Austin, TX.
Plänkers, R. & Fua, P. (2001). Articulated soft objects for video-based body
modelling. IEEE International Conference on Computer Vision.
Vancouver, Canada.
Preda, M. & Preteux, F. (2002). Advanced animation framework for virtual
character within the MPEG-4 standard. Proc. IEEE International Con-
ference on Image Processing (ICIP’2002), Rochester, NY.
Preda, M. & Preteux, F. (2002). Critic review on MPEG-4 Face and Body
Animation. Proc. IEEE International Conference on Image Processing
(ICIP’2002), Rochester, NY.
Rehg, J. M. & Kanade, T. (1994). Visual Tracking of High {DOF} Articulated
Structures: An Application to human hand tracking. Proceedings of Third
European Conference on Computer Vision. Austin, TX, 2, 37-46.
Saito, H., Watanabe, A. & Ozawa, S. (1999). Face Pose Estimating System
based on Eigen Space Analysis. ICIP99, Kobe, Japan.
Sévenier, M. B. (2002). FPDAM of ISO/IEC 14496-1/AMD4, ISO/IEC JTC1/
SC29/WG11, N5471, Awaiji.
Sidenbladh, H., Black, M. J. & Sigal, L. (2002). Implicit probabilistic models of
human motion for synthesis and tracking. Proc. European Conf. on
Computer Vision. Copenhagen, Denmark.
Sminchisescu, C. & Triggs, B. (2001). Covariance scaled sampling for monocu-
lar 3D body tracking. IEEE Int. Conf. on Computer Vision and Pattern
Recognition. Kauai Marriott, Hawaii.
Strauss, P. S. (1993). IRIS Inventor, A 3D Graphics Toolkit. Proceedings of the
8th Annual Conference on Object-Oriented Programming Systems,
Languages, and Applications. Edited by A. Paepcke. ACM Press, 192-
200.
Sullivan, J. & Carlsson, S. (2002). Recognizing and Tracking Human Action.
Proc. European Conf. on Computer Vision (ECCV), Copenhagen,
Denmark.
Tian, Y., Kanade, T. & Cohn, J. F. (2000). Recognizing Upper Face Action
Units for Facial Expression Analysis. Computer Vision and Pattern
Recognition ( CVPR’00)- Hilton Head, SC, 1,1294-1298.
Tramberend, H. (2000). Avango: A Distributed Virtual Reality Framework.
GMD – German National Research Center for Information Technology.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Synthesis and Analysis Techniques for the Human Body 375
Vicon Website. (2001). Retrieved from the World Wide Web: http://
www.metrics.co.uk/animation/.
Wren, C. R., Azarbayejani, A., Darrell, T. & Pentland, A. (1997, July). Pfinder:
Real-Time Tracking of the Human Body. IEEE Trans. PAMI., 19(7), 780-
785.
X3D Task Group. Retrieved from the World Wide Web: http://www.web3d.org/
fs_x3d.htm.
Zaharia, T., Preda, M. & Preteux, F. (1999). 3D Body Animation and Coding
Within an MPEG-4 Compliant Framework. Proceedings International
Workshop on Synthetic-Natural Hubrid Coding and 3D Imaging
(IWSNHC3DI99). Santorini, Greece, 74-78.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
376 About the Authors
Nikos Sarris received his Ph.D. from the Aristotle University of Thessaloniki
in “3D Modelling Techniques for the Human Face” and his Master of Engineer-
ing (M.Eng.) degree in Computer Systems Engineering from the University of
Manchester Institute of Science and Technology (UMIST). He has worked as
a Research Assistant in the Information Processing Laboratory of the Aristotle
University of Thessaloniki for four years, where he participated in several
national and international projects and coordinated a national research project
funded by the Greek General Secretariat of Research and Technology. He has
worked as a Research Fellow for the Informatics & Telematics Institute for
three years, where he participated in several national and international projects
and coordinated a Thematic Network of Excellence within the European
Commission Information Society Technologies 5th Framework Programme. Dr.
Sarris has fulfilled his military service in the Research & Informatics Corps of
the Greek Army and has been a member of the Greek Technical Chamber as a
Computer Systems and Informatics Engineer since 1996. His research interests
include 3D model-based image and video processing, image coding, image
analysis and sign language synthesis and recognition.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
About the Authors 377
* * * * *
Ana C. Andrés del Valle received the Spanish State degree of Telecommuni-
cations Engineering from the ETSETB or Barcelona Technical School of
Telecom at UPC, Barcelona, Spain. In September 2003, she will receive her
Ph.D. degree from Télécom Paris after doing research at the Multimedia
Communications Department of the Eurecom Institute, Sophia Antipolis, France.
As a researcher, she has cooperated with several telecom companies: AT&T
Labs – Research, New Jersey (1999) and France Telecom R&D – Rennes
(during her Ph.D.). In academics, Ana C. Andrés has supervised student
research, has written several publications and prepared specialized tutorials
related to Human Body Motion Analysis and Synthesis. In 2002, she was a
visiting professor at the Computer Science and Mathematics Department of the
University of the Balearic Islands (Spain). Her research interests are image and
video processing for multimedia applications, especially facial expression analy-
sis, virtual reality, human computer interaction and computer graphics.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
378 About the Authors
Greece. His research interests include computer vision, human motion and
gesture analysis, human machine interaction, and biomedical applications. He
has published five papers in the above fields.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
About the Authors 379
Peter Eisert is the Head of the Computer Vision & Graphics Group of the
Image Processing Department at the Fraunhofer Institute for Telecommunica-
tions, Heinrich Hertz Institute. He received his diploma degree in Electrical
Engineering from the University of Karlsruhe, Germany, in 1995. He then joined
the Telecommunications Institute of the University of Erlangen, Germany,
where he worked on facial animation, model-based video coding, 3D geometry
reconstruction and light field coding. He was a member of the graduate research
center “3D image analysis and synthesis” and involved in the organization of
multiple national and international workshops. After receiving his Ph.D. in 2000,
he joined the Information Systems Laboratory, Stanford University, as a post-
doc. Since 2002, he has been with the Fraunhofer Institute for Telecommunica-
tions. His present research interests focus on 3D image and image sequence
processing, as well as algorithms from computer vision and graphics. He is
currently working on facial expression analysis and synthesis for low-bandwidth
communication, 3D geometry reconstruction, and acquisition and streaming of
3D image-based scenes. He is a Lecturer at the Technical University of Berlin
and the author of numerous technical papers published in international journals
and conference proceedings.
E. A. Hendriks received his M.Sc. and Ph.D. degrees from the University of
Utrecht in 1983 and 1987, respectively, both in physics. In 1987 he joined the
Electrical Engineering faculty of Delft University of Technology (The Nether-
lands) as an Assistant Professor. In 1994 he became a member of the
Information and Communication Theory of this faculty and since 1997 he heads
the computer vision section of this group as an Associate Professor. His interest
is in computer vision, low-level image processing, image segmentation, stereo-
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
380 About the Authors
scopic and 3D imaging, motion and disparity estimation, structure from motion/
disparity/silhouette and real time algorithms for computer vision applications.
Pengyu Hong received his B.Eng. and M.Eng. degrees from Tsinghua Univer-
sity, Beijing, China, and his Ph.D. degree from University of Illinois at Urbana-
Champaign, Urbana, all in Computer Science. Currently, he is a Postdoctoral
Researcher at School of Public Health, Harvard University, USA. His research
interests include human computer interaction, computer vision and pattern
recognition, machine learning and multimedia database. In 2000, he received the
Ray Ozzie Fellowship for his research work on facial motion modeling, analysis
and synthesis.
Spiros Ioannou was born in Athens, Greece, in 1975. He received the Diploma
in Electrical and Computer Engineering from the National Technical University
of Athens (NTUA), Greece, in 2000. Since 2000, he is pursuing his Ph.D. and
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
About the Authors 381
Nikos Karatzoulis (M.Sc.), born in 1974, he has been studying at the University
of Sunderland (B.A., Business Computing, 1994-1997) and at the University of
Leeds (M.Sc., Distributed Multimedia Systems, 1997-1998). His main area of
interest is Virtual Environments. Currently, he participates in several IST
projects: IMUTUS, HUMODAN and SHOPLAB.
Kostas Karpouzis was born in Athens, Greece, in 1972. He graduated from the
Department of Electrical and Computer Engineering, the National Technical
University of Athens in 1998 and received his Ph.D. degree in 2001 from the
same university. His current research interests lie in the areas of human
computer interaction, image and video processing, 3D computer animation and
virtual reality. He is a member of the Technical Chamber of Greece and a
member of ACM SIGGRAPH and SIGCHI societies. Dr. Karpouzis has
published seven papers in international journals and more than 25 in proceedings
of international conferences. He is a member of the technical committee of the
International Conference on Image Processing (ICIP) and Co-editor of the
Greek Computer Society newsletter. Since 1995, he has participated in eight
research projects at Greek and European levels.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
382 About the Authors
Stefanos Kollias was born in Athens in 1956. He obtained his Diploma from the
National Technical University of Athens (NTUA) in 1979, his M.Sc. in Commu-
nication Engineering in 1980 from UMIST in England and his Ph.D. in Signal
Processing from the Computer Science Division of NTUA. He has been with the
Electrical Engineering Department of NTUA since 1986, where he serves now
as a Professor. Since 1990, he has been the Director of the Image, Video and
Multimedia Systems Laboratory of NTUA. He has published more than 120
papers in the above fields, 50 of which were published in international journals.
He has been a member of the Technical or Advisory Committee or invited
speaker in 40 International Conferences. He is a reviewer of 10 IEEE Transac-
tions and of 10 other journals. Ten graduate students have completed their
doctorate under his supervision, while another 10 are currently performing their
Ph.D. thesis. He and his team have been participating in 38 European and
National projects.
Gauthier Lafruit was a Research Scientist with the Belgian National Founda-
tion for Scientific Research from 1989 to 1994, being mainly active in the area
of wavelet image compression implementations. Subsequently, he was a Re-
search Assistant with the VUB (Free University of Brussels, Belgium). In 1996,
he became the recipient of the Scientific Barco award and joined IMEC
(Interuniversity MicroElectronics Centre, Leuven, Belgium), where he was
involved as Senior Scientist with the design of low-power VLSI for combined
JPEG/wavelet compression engines. He is currently the Principal Scientist in the
Multimedia Image Compression Systems Group with IMEC. His main interests
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
About the Authors 383
Pascal Müller is a Consultant to the Computer Vision Lab of the ETH Zurich
(Switzerland) and also works as Technical Director for the production company
Centralpictures. He received his Master’s Degree in Computer Science from the
ETH Zurich in 2001. His research areas are computer animation, procedural/
physical modeling and sound-sensitive graphics.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
384 About the Authors
his Ph.D. degree in Electrical and Computer Engineering from New Jersey
Institute of Technology in 2000. Currently, he is a Research Staff Member at the
Department of Electrical Engineering, Princeton University, USA. He was a
Post-Doctoral Researcher at the same department in 2001. His research
interests include real-time systems, smart cameras, surveillance systems, digital
image and video libraries, pattern recognition, video/image compression and 2D/
3D object modeling. He is a member of the IEEE and member of the Embedded
Systems Group in the Department of Electrical Engineering at Princeton
University.
Françoise Preteux graduated from the Ecole des Mines de Paris (EMP) and
received her Doctorat d’Etat ès Sciences Mathématiques from the University of
Paris VI, in 1982 and 1987, respectively. After working as a Research Engineer
at the Center for Mathematical Morphology of EMP, she held a position as
Professor at the Ecole Nationale Supérieure des Télécommunications de Paris
(1989-1993). Since 1994, she has been a Professor at the Institut National des
Télécommunications, being successively the Head of the Signal & Image
processing Department (1994-1998) and of the ARTEMIS Project Unit (1999-
present). She is the (co)-author of over 80 major scientific papers within the field
of stochastic modeling and mathematical morphology, medical imaging segmen-
tation, 3D modeling/reconstruction, indexing techniques and digital image cod-
ing. She is a regular reviewer for international journals and a member of
international conference program committees. She actively contributes to the
MPEG standardization process, being the Deputy Head of the French Delegation
for MPEG-7, the official liaison between SC29-WG11 and CEN-ISSS and the
France representative at the ISO SC 29.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
About the Authors 385
Amaryllis Raouzaiou was born in Athens, Greece, in 1977. She graduated from
the Department of Electrical and Computer Engineering, the National Technical
University of Athens in 2000 and she is currently pursuing her Ph.D. degree at
the Image, Video, and Multimedia Systems Laboratory at the same University.
Her current research interests lie in the areas of synthetic-natural hybrid video
coding, human-computer interaction and machine vision. She is a member of the
Technical Chamber of Greece. She is with the team of IST project ERMIS
(Emotionally Rich Man-Machine Interaction Systems). She has published three
journal articles and eight conference papers in the above fields.
Ioan Alexandru Salomie received his M.Sc. degree in Mechanics and Ma-
chines Construction from the “Politechnica” University of Cluj-Napoca, Roma-
nia, in 1989 and his M.Sc. degree in Applied Computer Science from the Vrije
Universiteit Brussel (VUB), Belgium, in 1994. Since October 1995, he has been
a member of the Department of Electronics and Information Processing (ETRO)
at VUB. He has a rich experience in software design and development of tools
for image and data visualization, image analysis, and telemedicine. His research
has evolved in the direction of surface extraction, coding, and animation of
polygonal surface meshes, and he is currently finishing his Ph.D. on this topic.
Since 2000 he has been actively involved in the SNHC group (Synthetic Natural
Hybrid Coding) of MPEG-4, and is the main contributor to the MESHGRID surface
representation in SNHC.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
386 About the Authors
Technical University of Athens in 1994 and received his Ph.D. degree in 2000
from the same university. His current research interests lie in the areas of human
computer interaction, machine vision, image and video processing, neural
networks and biomedical engineering. He is a member of the Technical
Chambers of Greece and Cyprus and a member of IEEE Signal Processing and
Computer societies. Dr. Tsapatsoulis has published 10 papers in international
journals and more than 30 in proceedings of international conferences. He served
as Technical Program Co-Chair for the VLBV’01 workshop. He is a reviewer
of the IEEE Transactions on Neural Networks and IEEE Transactions on
Circuits and Systems for Video Technology journals.
Jilin Tu received his B.Eng. degree and M.Eng. Degree from Huazhong
University of Science and Technology, Wuhan, China, and his M.S. degree from
Colorado State University. Currently, he is pursuing his Ph.D. degree in the
Department of Electrical Engineering at University of Illinois at Urbana-
Champaign (USA). His research interests include facial motion modeling,
analysis and synthesis; machine learning and computer vision.
Luc Van Gool is Professor for Computer Vision at the University of Leuven in
Belgium and at ETH Zurich in Switzerland. He is a member of the editorial board
of several computer vision journals and of the programme committees of
international conferences about the same subject. His research includes object
recognition, tracking, texture, 3D reconstruction, and the confluence of vision
and graphics. Vision and graphics for archaeology is among his favourite
applications.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
About the Authors 387
Zhen Wen received the B.Eng. degree from Tsinghua University, Beijing,
China, and the M.S. degree from University of Illinois at Urbana-Champaign,
Urbana, both in computer science. Currently, he is pursuing his Ph.D. degree in
the Department of Computer Science at University of Illinois at Urbana-
Champaign. His research interests include facial motion modeling, analysis and
synthesis; image based modeling and rendering; machine learning and computer
vision; multimedia systems and communication.
Liang Zhang was born in Zhejiang, China in 1961. He received his B.Sc. degree
from Chengdu Institute of Radio Engineering in 1982, his M.Sc. degree from
Shanghai Jiaotong University in 1986 and his Doctoral degree in electrical
engineering from the University of Hannover, Germany, in 2000. He was
working as an assistant from 1987 to 1988 and as a lecturer from 1989 to 1992
in the Department of Electrical Engineering, Shanghai Jiaotong University. From
1992 to 2000, he was a research assistant at the Institut für Theoretische
Nachrichtentechnik und Informationsverarbeitung, University of Hannover,
Germany. Since 2000, he has been with Communications Research Centre
Canada. His research interests are image analysis, computer vision, and video
coding.
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
388 Index
Index
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Index 389
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
390 Index
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Index 391
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
392 Index
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Index 393
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
394 Index
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Index 395
W
wavelet subdivision surfaces (WSS) 32
Web3D H-Anim standards 6
weighted least squares 115
world coordinate system (WCS) 72
X
X-interpolation 119
X3D task group 344
XMT 52
Y
Y-extrapolation 119
Z
Z-transfer 119
Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Instant access to the latest offerings of Idea Group, Inc. in the fields of
I NFORMATION SCIENCE , T ECHNOLOGY AND MANAGEMENT!
InfoSci-Online
Database
BOOK CHAPTERS
JOURNAL AR TICLES
C ONFERENCE PROCEEDINGS
C ASE STUDIES
“
to use access to solid, current
and in-demand information,
InfoSci-Online, reasonably
The InfoSci-Online database is the
most comprehensive collection of
priced, is recommended for full-text literature published by
academic libraries.
”
- Excerpted with permission from
Library Journal, July 2003 Issue, Page 140
Idea Group, Inc. in:
n Distance Learning
n Knowledge Management
n Global Information Technology
n Data Mining & Warehousing
n E-Commerce & E-Government
n IT Engineering & Modeling
n Human Side of IT
n Multimedia Networking
n IT Virtual Organizations
BENEFITS
n Instant Access
n Full-Text
n Affordable
Start exploring at n Continuously Updated
www.infosci-online.com n Advanced Searching Capabilities
A product of:
Information Science Publishing*
Enhancing knowledge through information science
Publishing Should you be interested in receiving a free sample copy of any of IGP's
existing or upcoming journals please mark the list below and provide your
mailing information in the space provided, attach a business card, or email
IGP at [email protected].
Address: ______________________________________________________________________
_____________________________________________________________________________
[email protected] www.idea-group.com
New Release!
Multimedia Systems and
Content-Based
Image Retrieval
Edited by: Sagarmay Deb, Ph.D.
University of Southern Queensland, Australia