Face Image Analysis and Synthesis For Hu
Face Image Analysis and Synthesis For Hu
Shigeo MOFUSHIMA
Seikei University
3-3-1 Kichijiji-kitamachi Musashino Tokyo 180-8633 Japan
[email protected]
2. Face Modeling
To generate a realistic avatar’s face, a generic face
model is manually adjusted to user’s face image. To
produce a personal 3D face model, both user’s frontal Fig3 Multi-view fitting to oblique angle
face image and profile image are necessary at least.
The generic face model has all of the control rules
for facial expressions defined by FACS parameter as
a 3D movement of grid points to modify geometry.
Fig. 1 shows a personal model both before and after
fitting process for front view image by using our
original GUI based face fitting tool. Front view image
and profile image are given into the system and then
corresponding control points are manually moved to
a reasonable position by mouse operation.
Synthesized face is coming out by mapping of
blended texture generated by user’s frontal image Fig.4 Cylindrical face texture
and profile image onto the modified personal face
model. However, sometimes self-occlusion happens
and then we cannot capture texture only from front
and profile face image in the occluded part of face
a) Multi-view b) Two-view
Fig.5 Reconstructed face
MI
*U
analyzed and converted into mouth shape parameters
with neural network on frame by frame basis. , ,
Multiple users' communication system in * *MI>
cyberspace is constructed based on server-client Lm -* '
system. In this system, only a few parameters and L.LPC cepstrum
voice signal are transmitted through netwrok. Avatar's M, mouth shape parameter
face are synthesized by these parameters locally at Fig.6 Network for parameter conversion
client system.
Happiness and Sadness. Each basic emotion has a a) Shape for /a/ b) Shape for /i/
specific facial expression parameters described by
FACS (Facial Action Coding System)(Ekman and
Friesen 1978).
by frame. And avatar body is located on cyberspace Natural communication environment between
according to the location information. There are two multiple users in cyberspace by transmission of
modes for displaying, view from avatar’s own eyes natural voice and real-time synthesis of avatar’s facial
for eye contact and view from sky to search for other expression is realized. Current system is working on
users in cyberspace and these views can be chosen 3 users and intra-network environment at the speed
by menu in window. Fig.10 shows a communication of 16 frames per second on SGI Max IMPACT.
system in cyberspace with avatar. Emotion model (Morishima 1996) are introduced to
improve communication environment. And also gaze
tracking and mouth closing point detection can be
3.4 User Adaptation realized by pan-tilt-zoom control camera.
When new user comes in, his face model and voice
model have to be registered before operation. In case 3.6 Entertainment Application
of voice, new learning for neural network has to be When people watch movie film, he sometimes
performed ideally. However, it takes a very long overlap his own figure with actor’s image. An
time to get convergence of backpropagation. To interactive movie system we constructed is an image
simplify the face model construction and voice creating system in which user can control facial
learning, the GUI tool for speaker adaptation is expression and lip motion of his face image inserted
prepared To register the face of new user, a generic into movie scene. User gives voice by microphone
3D face model is modified to fit on the input face and pushing keys which determine expression and
image. Expression control rules are defined onto the special effect. His own video program can be
generic model, so every user’s face can be equally generated on realtime.
modified to generate basic expression using FACS At first, once a frontal face image of visitor is
based expression control mechanism. captured by camera. 3D generic wireframe model is
For voice adaptation, 75 persons’ voice data fitted onto user’s face image to generate personal 3D
including 5 vowels are pre-captured and database for surface model. Facial expression is synthesized by
weights of neural network and voice parameters are controlling the grid point of face model and texture
constructed. So speaker adaptation is performed by mapping. For speaker adaptation, visitor has to speak
choosing the optimum weight from database. Training 5 vowels to choose an optimum weight from database.
of neural network for every 75 persons’ data has At interactive process, a famous movie scene is
already finished before operation. When new going on and face part of actor or actress is replaced
nonregistered speaker comes in, he has to speak 5 with visitor’s face. And also facial expression and lip
vowels into microphone. LPC Cepstnun is calculated shape are controlled synchronouslyby captured voice.
for every 5 vowels and this is given into the neural And also active camera is tracking visitor’s face and
network. And then mouth shape is calculated by facial expression is controlled by CV based face image
selected weight and error between true mouth shape analysis.Fig. 11 shows the original movie clip and
and generated mouth shape is evaluated. This process fig.12 shows the result of fitting of face model into
is applied to all of the database one by one and the this scene. Fig.13 shows an user’s face inserted into
optimum weight is selected when the minimum error actor’s face after color correction. Any expression can
is detected. be appended and any scenario can be given by user
independent to original story in this interactive movie
system.
4. Muscle Constraint Face Model
Muscle based face image synthesis is one of the
most realistic approaches to the realization of a life-
like agent in computers. A facial muscle model is
composed of facial tissue elements and simulated
muscles. In this model, forces are calculated effecting
a facial tissue element by contraction of each muscle
string, so the combination of each muscle contracting
force decides a specific facial expression. This muscle
parameter is determined on a trial and error basis by
comparing the sample photograph and a generated
image using our Muscle-Editor to generate a specific
face image. In this section, we propose the strategy Fig.11 Original movie clip
of automatic estimation of facial muscle parameters
from 2D optical flow using a neural network. This
corresponds to the 3D facial motion capturing from
2D camera image under the physics based condition
without markers.
We introduce multi-layer backpropagation network
for the estimation of muscle parameter. The face
expression is then re-synthesized from estimated
muscle parameter to evaluate how well an impression
of original expression can be recovered. We also tried
to generate animation using the captured data from
an image sequence. As a result, we can get and
synthesize image sequence which give an impression
very close to the original video.
Fig.12 Fitting result of face model
4.1 Layered Dynamic Tissue Model
The human skull is covered by a deformable tissue
which has five distinct layers. Four layers (epidermis,
dermis, subcutaneous connective tissue, and fascia)
comprise the skin, and the fifth layer comprises the
muscles responsible for facial expression. In
accordance with the structure of real skin, we employ
a synthetic tissue model constructed from the
elements illustrated in Fig.14, consisting of nodes
interconnected by deformable springs (the lines in
the figure). The epidermal surface is defined by nodes
1,2, and 3, which are connected by epidermal springs.
The epidermal nodes are also connected by dermal-
fatty layer springs to nodes 4,5, and 6, which define
the fascia surface. Fascia nodes are interconnected Fig.13 Reconstructed face by user's
by fascia springs. They are also connected by muscle
layer springs to skull surface nodes 7,8,9. The facial Epidermal Nodes
tissue model is implemented as a collection of node I 1,2,3 J,Epidermal Surface
and spring data structures. The node data structure
includes variables .to represent the nodal mass,
position, velocity, acceleration, and net force. The
spring data structure comprises the spring stiffness,
the natural length of the spring, and pointers to the
data structures o f the two nodes that are Bone Surface
interconnected by the spring. Newton's laws of motion
govern the response of the tissue model to force (Lee,
Terzopoulos and Waters 1995). This leads to a system
of coupled, second-order ordinary differential
equations that relate the node positions, velocities, Fig.14 Layered tissue element
and accelerations to the nodal forces.
- 1217 *
4.2 Facial Muscle Model
Fig. 15 shows our simulated muslces. Black line
means location of each facial muscle in a layered
tissue face model. Each muscle is modeled by
combination of separated muscles and left part and
right part of face are symmetric. For example,
frontalis has a triangularly spreaded feature, so 3 kinds
of linear muscles model this frontalis on one side.
Frontalis pulls up the eyebrows and makes wrinkles
in the forehead. Cormgater, which is modeled by 4
springs located between eyes, pulls the eyebrow
together and makes wrinkles between left and right
eyebrows. However those muscles can't pull down
the eyebrows and make the eyes thin. So Orbicularis Fig.15 Facial muscle model
oculi muscle is also appended in our model.
Orbicularis oculi is separated into upper part, and
lower part. To make muscle control simple around
the eye area, the Orbicularis oculi is modeled with a
single function in our model. Normally, muscles are
located between a bone node and a fascia node. But
the Orbicularis oculi has an irregular style, whereby
it is attached between fascia nodes in a ring
configuration; it has 8 linear muscles which
approximate a ring muscle. Contraction of ring
muscle makes the eye thin. The muscles around mouth
are very important for the production of speaking
scenes. Most of the Japanese speaking scenes are a) Optical flow b) Feature window
ed of vowels, so we mainly focused on the Fig.16 Motion feature vector on face
ction of vowel mouth shapes as a first step
and relocated the muscles around mouth(Sera,
Morishima and Terzopoulos 1996).As a result, a fmal reduce the calculation cost in parameter mapping, so
facial muscle model has 14 muscle springs in the the face area is divided into three subareas as shown
forehead area and 27 muscle springs in the mouth in Fig. 17. They are the mouth area, left-eye area,
area. and right-eye area, each giving on independent skin
motion. Three independent neural networks are
4.3 Motion Capture on Face prepared for these three areas.
A pp-sonal face model is constructed by fitting the Learning patterns are composed of basic
generic control model to personal range data. Optical expressions and the combination of their muscle
flow vector is calculated in a video sequence and contraction forces. We create 6 basic facial
accumulate motion in the sequence from neutral to expressions consisting of anger, disgust, fear,
each expression. And then motion vectors are happiness, sadness and surprise. Basic mouth shapes
averaged in each window shown in Fig. 16(b). This for vowels "ar', "i", llull,"e" and "o", and a closed
window location is determined automatically in each mouth shape.
video frame. Motion vector is given to a neural network from
A layered neural network finds a mapping from each video frame and then a facial animation is
the motion vector to the muscle parameter. A four- generated from the output muscle parameter
layer structure is chosen to effectively model the sequence. This is the test of the effect of interpolation
nonlinear performance. The first layer is the input on our parameter conversion method based on the
layer, which corresponds to 2D motion vector. The generalization of the neural network.
second and third layers are hidden layers. Units of Fig. 18 shows a result of expression regeneration
the second layer have a linear function and those of for surprise from original video sequence.
the third layer have a sigmoid function. The fourth
layer is the output layer, corresponding to muscle 4.4 Muscle Control by EMG
parameter, and it has linear units. Linear functions in EMG data is the voltage wave captured by needle
the input and output layers are introduced to maintain inserted directly into each muscle, so the feature of
the range of input and output values. wave is expressing the contraction situation of muscle.
A simpler neural network structure can help to Especially 7 di-ball wires are inserted into muscles
speed up the convergence process in learning and around mouth to make a model of mouth shape
control.
4 Lavers
Muscle Parameter
1
Fig.17 Conversion from motion to animation Fig.19 Mouth shape control by EMG
6. Conclusion
To generate a realistic avatar's face for face-to-face
communication, multi-modal signal is introduced to
duplicate original face expression. Voice is used to
realize lip synchronization and expression control.
Video captured image is used to regenerate original Fig.21 Example of hair style
face expression under facial muscle constraint. EMG
data is used to control artificial muscle directly.
Finally, hair modeling method is proposed to make
an avatar more natural and believable.
Reference
Morishima, S . and Harashima H. (1991). A Media
Conversion from Speech to Facial Image for Intelligent
Man-Machine interface, IEEE JSAC, Vo1.9, No.4, pp. 594-
600.
Morishima, S . (1 997). Virtual Face-to-Face
Communication Driven by Voice Through Network.
Workshop on Perceptual User Interfaces, pp.85-86. Fig.22 Copying real hair style
Ekman, P. and Friesen,W.V. (1978). Facial Action Coding
System. Consulting Psychologists Press Inc.
Essa, I., Darrell, T. and Pentland, A. (1994). Tracking Lee, Y, Terzopoulos, D. and Waters, K. (1 995). Realistic
Facial Motion. Proceedings of Workshop on Motion and modeling for facial animation. Proceedings of SIGGRAPH'95,
Nonrigid and Articulated Objects, pp.36-42. pp.55-62.
Mase, K. (1991). Recognition of Facial Expression from Sera, H., Morishima, Sand Terzopoulos, D.(1 996). Physics-
Optical Flow. IEICE Transactions,Vol E 74, No. 10. based muscle model for mouth shape control. Proceedings of
Robot and Human Communication '96(RO-MAN'96), pp. 207-
Morishima, S. (1996). Modeling of Facial Expression and 212.
Emotion for Human Communication System. Displays 17,
pp.15-25, Elsevier.