Case Study-Face Animation

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

TOPIC: face ANIMATION

A Case Study for Multimedia Modelling and Specification Languages

INTRODUCTION
Specifying the components of a multimedia presentation and their
spatial/temporal relations are among basic tasks in multimedia systems. They
are necessary when a client asks for a certain presentation to be designed, when
a media player receives input to play, and even when a search is done to retrieve
an existing multimedia file. In all these cases, the description of the contents can
include raw multimedia data (video, audio, etc) and textual commands and
information. Such a description works as a Generalized Encoding, since it
represents the multimedia content in a form not necessarily the same as the
playback format, and is usually more efficient and compact. For instance a
textual description of a scene can be a very effective encoded version of a
multimedia presentation that will be decoded by the media player when it
recreates the scene. Face Animation, as a special type of multimedia
presentation, has been a challenging subject for many researchers. Advances in
computer hardware and software, and also new web-based applications, have
helped intensify these research activities, recently. Video conferencing and online
services provided by human characters are good examples of the applications
using face animation. Personalized Face Animation includes all the information
and activities required to create a multimedia presentation resembling a specific
person. The input to such system can be a combination of audio/visual data and
textual commands and descriptions. A successful face animation system needs
to have efficient yet powerful solutions for providing and displaying the content,
i.e. a content description format, decoding algorithms, and finally architecture to
put different components together in a flexible way.

PROBLEM STATEMENT
Although new streaming technologies allow real-time download/playback of
audio/video data, but bandwidth limitation and its efficient usage still are, and
probably will be, major issues. This makes a textual description of multimedia
presentation (e.g. facial actions) a very effective coding/compression
mechanism, provided the visual effects of these actions can be recreated with a
minimum acceptable quality. Based on this idea, in face animation, some
researchers have been done to translate certain facial actions into a predefined
set of codes. Facial Action Coding System (Ekman and Friesen, 1978) is
probably the first successful attempt in this regard. More recently, MPEG-4
standard (Battista, et al, 1999) has defined Face Animation Parameters to encode
low level facial actions like jaw-down, and higher level, more complicated ones
like smile. Efficient use of bandwidth is not the only advantage of multimedia
content specifications like facial action coding. In many cases, the real
multimedia data does not exist at all, and has to be created based on a
description of desired actions. This leads to the whole new idea of representing
the spatial and temporal relation of the facial actions. In a generalized view, such

a description of facial presentation should provide a hierarchical structure with


elements ranging from low level images, to simple moves, more complicated
actions, to complete stories. We call this a Structured Content Description,
which also requires means of defining capabilities, behavioural templates,
dynamic contents, and event/user interaction. Needless to say, compatibility with
existing multimedia and web technologies is another fundamental requirement,
in this regard. Having a powerful description and specification mechanism, also is
obviously powerful in search applications that currently suffer when looking for
multimedia content. MPEG-7 standard (Nack and Lindsay, 1999) is the newest
arrival in the group of research projects aiming at a better multimedia retrieval
mechanism. Considering three major issues of Content Delivery, Content
Creation, and Content Description, following features can be assumed as
important requirements in a multimedia presentation system (Arya and
Hamidzadeh, 2002): 1- Streaming, i.e. continuously receiving/displaying data 2Structured Content Description, i.e. a hierarchical way to provide information
about the required content from high level scene description to low level moves,
images, and sounds 3- Content Creation (Generalized Decoding), i.e. creating the
displayable content based on the input. This can be decoding a compressed
image or making new content based on the provided textual description. 4Component-based Architecture, i.e. the flexibility to rearrange the system
components, and use new ones as long as a certain interface is supported. 5Compatibility, i.e. the ability to use and work with widely accepted industry
standards in multimedia systems. 6- Minimized Database of audio/visual footage.

GENERAL DISCUSSION
RELATED WORK
Multimedia Content Description The diverse set of works in multimedia content
description involves methods for describing the components of a multimedia
presentation and their spatial and temporal relations. Historically, one of the first
technical achievements in this regards were related to video editing where
temporal positioning of video elements is necessary. The SMPTE (Society of
Motion Picture and Television Engineers) Time Coding (Ankeney, 1995, Little,
1994) that precisely specifies the location of audio/video events down to the
frame level is base for EDL (Edit Decision List) (Ankeney, 1995, Little, 1994) that
relates pieces of recorded audio/video for editing. Electronic Program Guides
(EPGs) are another example of content description for movies in form of textual
information added to the multimedia stream. More recent efforts by SMPTE are
focused on Metadata Dictionary that targets the definition of metadata
description of the content (see http://www.smpte-ra.org/mdd). These metadata
can include items from title to subject and components. The concept of metadata
description is base for other similar researches like Dublin Core
(http://dublincore.org), EBU P/Meta (http://www.ebu.ch/pmc_meta.html), and TV
Anytime (http://www.tv-anytime.org). Motion Picture Expert Group is also another
major player in standards for multimedia content description and delivery. MPEG4 standard that comes after MPEG-1 and MPEG-2, is one of the first
comprehensive attempts to define the multimedia stream in terms of its forming

components (objects like audio, foreground figure, and background image). Users
of MPEG-4 systems can use Object Content Information (OCI) to send textual
information about these objects. A more promising approach in content
description is MPEG-7 standard. MPEG-7 is mainly motivated by the need for a
better more powerful search mechanism for multimedia content over the
Internet but can be used in a variety of other applications including multimedia
authoring. The standard extends OCI and consists of a set of Descriptors for
multimedia features (similar to metadata in other works), Schemes that show the
structure of the descriptors, and an XML-based Description/Schema Definition
Language. Most of these methods are not aimed at and customized for a certain
type of multimedia stream or object. This may result in a wider range of
application but limits the capabilities for some frequently used subjects like
human face. To address this issue MPEG-4 includes Face Definition Parameters
(FDPs) and Face Animation Parameters (FAPs). FDPs define a face by giving
measures for its major parts, as shown in Figure 1. FAPs on the other hand,
encode the movements of these facial features. Together they allow a receiver
system to create a face (using any graphic method) and animate based on low
level commands in FAPs. The concept of FAP can be considered a practical
extension of Facial Action Coding System (FACS) used earlier to code different
movements of facial features for certain expressions and actions. After a series
of efforts to model temporal events in multimedia streams (Hirzalla, et al, 1995),
an important progress in multimedia content description is Synchronized
Multimedia Integration Language (SMIL) (Bulterman, 2001), and XML-based
language designed to specify temporal relation of the components of a
multimedia presentation, especially in web applications. SMIL can be used quite
suitably with MPEG-4 object-based streams. There have also been different
languages in the fields of Virtual Reality and computer graphics for modelling
computer-generated scenes.

APPLICATIONS
Examples are Virtual Reality Modelling Language (VRML, http://www.vrml.org)
and programming libraries like OpenGL (http://www.opengl.org). Another
important group of related works are behavioural modelling languages and tools
for virtual agent. BEAT (Cassell, et al, 2001) is an XML-based system, specifically
designed for human animation purposes. It is a toolkit for automatically
suggesting expressions and gestures, based on a given text to be spoken. BEAT
uses a knowledge base and rule set, and provides synchronization data for facial
activities, all in XML format. This enables the system to use standard XML
parsing and scripting capabilities. Although BEAT is not a general content
description tool, it demonstrates some of the advantages of XML-based
approache. Other scripting and behavioural modelling languages for virtual
humans are considered by other researchers as well (Funge, et al, 1999,
Kallmann, and Thalmann, 1999, Lee, et al, 1999). These languages are usually
simple macros for simplifying the animation, or new languages which are not
using existing multimedia technologies. Most of the times, they are not
specifically designed for face animation.

DESCRIPTION
Five major parts of this system are:
Script Reader, to receive an FML script from a disk file, an Internet address,
or any text stream provider.
Script Parser, to interpret the FML script and create separate intermediate
audio and video descriptions (e.g. words and viseme identifiers).
Video Kernel, to generate the required image frames.
Audio Kernel, to generate the required speech.
Multimedia Mixer, to synchronize audio and video streams.

CONCLUSION
Reviewing the most important works in the area of multimedia specification and
presentation, it's been shown that a comprehensive framework for face
animation is a requirement which has not been met. Such a framework needs to
provide: a structured content description mechanism, an open modular
architecture covering aspects from getting input in standard forms to generating
audio/video data on demand, efficient algorithms for generating multimedia
with minimum use of existing footage and computational complexity. An
approach to such a framework, ShowFace System, is proposed. Unlike most of
existing systems, ShowFace is not limited to an off-line application or a media
player object, but provides a complete and flexible architecture. The componentbased architecture uses standard interfaces to interact internally and also with
other objects provided by underlying platform, making maximum use of existing
technologies like MPEG-4, XML, and DirectX. These components can be used
separately, or in a combination controlled by the animation application. An XMLbased Face Modelling Language (FML) is designed to describe the desired
sequence of actions in form of a scenario. FML allows event handling, and also
sequential or simultaneous combination of supported face states, and will be
parsed to a set of MPEG-4 compatible face actions. Main contributions of FML are
its hierarchical structure, flexibility for static and dynamic scenarios, and
dedication to face animation. Compatibility with MPEG-4 and use XML as a base
are also among the important features in the language. Future extensions to FML
can include more complicated behaviour modelling and better coupling with
MPEG-4 streams. The image-based transformations used in the video kernel are
shown to be successful in creating a variety of facial states based on a minimum
input images. Unlike 3D approaches, these transformations do not need
complicated modelling and computations. On the other hand, compared to usual
2D approaches, they do not use a huge database of images. They can be

extended to include facial textures for better results, and the system allows even
a complete change of image generation methods (e.g. using a 3D model), as
long as the interfaces are supported. Better feature detection is a main objective
of our future work, since any error in detecting a feature point can directly result
in a wrong transformation vector. This effect can be seen in cases like eyebrows
where detecting the exact corresponding points between a pair of learning
images is not easy. As a result, the learned transformation may include additive
random errors which cause non-smooth eyebrow lines in transformed feature set
and image. Combination of pre-learned transformations is used to create more
complicated facial states. As discussed, due to perspective nature of head
movements, this may not be a linear combination. Methods for
shrinking/stretching the mapping vectors as a function of 3D head rotation are
being studied and tested. Another approach can be defining the mapping vectors
in term of relative position to other points rather than numeric values. These
relational descriptions may be invariant with respect to rotations.

You might also like