0% found this document useful (0 votes)
32 views8 pages

Face Image Analysis and Synthesis For Hu

The document discusses research on generating realistic avatars that can precisely duplicate a person's facial expressions and speech in real-time. It describes using multi-camera face fitting to create detailed 3D face models, controlling mouth shapes based on speech analysis with neural networks, modeling facial expressions and hair dynamics, and transmitting parameters over a network for real-time avatar synthesis.

Uploaded by

Madhu Sudhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views8 pages

Face Image Analysis and Synthesis For Hu

The document discusses research on generating realistic avatars that can precisely duplicate a person's facial expressions and speech in real-time. It describes using multi-camera face fitting to create detailed 3D face models, controlling mouth shapes based on speech analysis with neural networks, modeling facial expressions and hair dynamics, and transmitting parameters over a network for real-time avatar synthesis.

Uploaded by

Madhu Sudhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

for Human-Computer Interaction

Shigeo MOFUSHIMA
Seikei University
3-3-1 Kichijiji-kitamachi Musashino Tokyo 180-8633 Japan
[email protected]

Abstract believable or alive depends on how well an avatar


In this papec we describe a recent research results can duplicate a real human’s expression and
about how to generate an avatar $face on a real- impression on a face precisely. Especially in case of
time process exactly copying a real person k face. It communication application using avatar, a realtime
is very important for synthesis of a real avatar to processing with low delay is inevitable.
duplicate emotion and impressionprecisely included Our final goal is to generate a virtual space close
in original face image and voice. FaceJitting tool to the real communication environment between
ji-ommulti-angle camera images is introduced to make network users or between human and machine. There
a real 3 0face model with real texture and geometry is an avatar in cyberspace projecting the feature of
very close to the original. When avatar is speaking each user which has a real texture-mapped face to
something, voice signal is very essential to decide a generate facial expression and action which is
mouth shape feature. So real-time mouth shape controlled by multi-modal input signal. User can also
control mechanism is proposed by conversion from get a view in cyberspace through the avatar’s eyes,
speech parameters to lip shape parameters using so he can communicate with other people by gaze
multi-layer neural network. For dynamic modeling crossing.
of facial expression, muscle structure constraint is In section 2, Face fitting tool from multi-angle
introduced to generate a facial expression naturally camera images is introduced to make a real 3D face
with a few parameters. We also tried to get muscle model with real texture and geometry very close to
parameters automatically to decide an expression the original. This fitting tool is GUI based system
from local motion vector onface calculated by optical and easy mouse operation picking up each feature
flow in video sequence. We also tried to control this point on face contour and face parts can help easy
artificial muscle model directly by EMG signal. To construction of 3D personal face model.
get more reality, modeling method of hair is also When avatar is speaking, voice signal is very
introduced and dynamics of hair in stream of wind essential to determine a mouth shape feature. So
can be achieved with low calculation cost. By using realtime mouth shape control mechanism is proposed
these several kinds of multi-modalsignal sources, vely by conversion from speech parameters to lip shape
natural f a c e image and its impression can be parameters using neural network. This neural network
duplicated on avatar b face. can realize an interpolation between specific mouth
shapes given as learning data(Morishima and
Harashima 1991)(Morishima1997). Emotional factor
sometimes can be captured by speech parameters.
1. Introduction This media conversion mechanism is described in
section 3.
Recently, research into creating friendly human For dynamic modeling of facial expresson, muscle
interfaces has florished remarkably. Such interfaces structure constraint is introduced to make a facial
smooth communication between a computer and a expression naturally with a few parametemwe also
human. One style is to have a virtual human or an tried to get muscle parameters automatically from
avatar appearing on the computer terminal who should local motion vector on face calculated by optical flow
be able to understand and express not only linguistic in video sequence. So 3D facial expression transition
information but also non-verbal information. This is is duplicated from original people by analysis of 2D
similar to 9man;to-hupan communication with a camera captured video image without landmarks on
face-fo-face style. face. These are expressed in section 4.We also tried
Very important factor to make an avatar look to control this artificial muscle direcly by EMG signal
0-7803-5747-7/00/$10.0002000 IEEE.
processing.
To get more reality of head, modeling method of
hair is introduced and dynamics of hair in stream of
wind can be achieved with low calculation cost. This
is expressed in section 5. By using these several kinds
of multi-modal signal sources, very natural face image
and its impression can be duplicated on avatar’s face.

2. Face Modeling
To generate a realistic avatar’s face, a generic face
model is manually adjusted to user’s face image. To
produce a personal 3D face model, both user’s frontal Fig3 Multi-view fitting to oblique angle
face image and profile image are necessary at least.
The generic face model has all of the control rules
for facial expressions defined by FACS parameter as
a 3D movement of grid points to modify geometry.
Fig. 1 shows a personal model both before and after
fitting process for front view image by using our
original GUI based face fitting tool. Front view image
and profile image are given into the system and then
corresponding control points are manually moved to
a reasonable position by mouse operation.
Synthesized face is coming out by mapping of
blended texture generated by user’s frontal image Fig.4 Cylindrical face texture
and profile image onto the modified personal face
model. However, sometimes self-occlusion happens
and then we cannot capture texture only from front
and profile face image in the occluded part of face

a) Multi-view b) Two-view
Fig.5 Reconstructed face

model. And also to construct 3D model more


accurately, we introduce multi-view face image fitting
a) Initial model b) Fitted model tool.
Fig.1 Frontal model fitting by GUI tool Fig.2(b) shows fitting result to captured image from
bottom angle to compensate for texture behind of jaw.
Fig.3 shows the fitting result with face image from
any oblique apgles.Lotationangle of face model can
be controlled in GUI preview window to achieve best
fitting to face image captured from any arbitrary
angle.Fig.4 shows full face texture projected onto
cylindrical coordinate. This texture is projected onto
3D personal model adjusted to fit to multi-view
images to synthesize a face image. Fig.5 shows
examples of reconstructed faces. Fig.S(a) is using 9
r .-- t views images and Fig.S(b) is using only frontal and
profile views. Much better image quality is achieved
a) Profile view b) Bottom view by multi-view fitting process.
Fig.%Multi-view fitting tool
3. Voice Driven Talking Avatar L, . 1

MI

To realize lip synchronization, spoken voice is L* --. , %


I .

*U
analyzed and converted into mouth shape parameters
with neural network on frame by frame basis. , ,
Multiple users' communication system in * *MI>
cyberspace is constructed based on server-client Lm -* '
system. In this system, only a few parameters and L.LPC cepstrum
voice signal are transmitted through netwrok. Avatar's M, mouth shape parameter
face are synthesized by these parameters locally at Fig.6 Network for parameter conversion
client system.

3.1 Parameter Conversion


At server system, voice from each client is
phonetically analyzed and converted to mouth shape
and expression parameters.
LPC Cepstrum parameters are converted into
mouth shape parameters by neural network trained
by vowel features. Fig.6 shows neural network
structure for parameter conversion. 20 dimensional
Cepst" parameter are calculated every 32ms with
32ms frame length. At client system, on-line captured
voice of each user is digitized by 16KHz and 1Gbits, Fig.7 Control panel for mouth shape editor
and is transmitted to server svstem frame-bv-frame

Happiness and Sadness. Each basic emotion has a a) Shape for /a/ b) Shape for /i/
specific facial expression parameters described by
FACS (Facial Action Coding System)(Ekman and
Friesen 1978).

3.2 Mouth Shape Editor


Mouth shape can be easily edited by our mouth
shape editor(Fig.7). We can change each mouth
parameter to decide a specific mouth shape on
preview window. Typical vowel mouth shapes are c) Shape for /U/ d) Shape for lo/
shown in Fig.8. Our special mouth model has Fig.8 Qpical mouth shapes
polygons for mouth inside and teeth. Tongue model
is now under construction. For parameter conversion
from LPC Cepstrum to mouth shape, only mouth
shapes for 5 vowel and nasals are defined as training
set. We have defined all of the mouth shapes for
I
Japanese phoneme and English phoneme by using this
mouth shape editor. Fig. 9 shows a synthesized
avatar's face speaking phoneme /a/.

3.3 Multiple User Communication System


Each user can walk through and fly through
cyberspace by mouse control and current locations
of all users are always observed by server svstem.
Fig.9 Synthesized face speaking /a/
Avatar image is generated in a local client machine
by the location information from the server system.
Emotion condition can always be decided by
voice, but sometimes user give his avatar a specific
emotion condition by pushing function key. This
process works with first priority. For example, push
anger and then angry face of avatar are coming out.
Location information of each avatar, mouth shape
parameters and emotion parameters are transmitted
every 1/30 seconds to client system. Distance between
every 2 users are calculated by the avatar location
information, and voice from every user except himself
is mixed and amplified with gain according to the
distance. So the voice from the nearest avatar is very
loud and one from far away is silent.
Based on facial expression parameters and mouth
shape parameters, avatar face is synthesized frame 3.5 System Feature I

by frame. And avatar body is located on cyberspace Natural communication environment between
according to the location information. There are two multiple users in cyberspace by transmission of
modes for displaying, view from avatar’s own eyes natural voice and real-time synthesis of avatar’s facial
for eye contact and view from sky to search for other expression is realized. Current system is working on
users in cyberspace and these views can be chosen 3 users and intra-network environment at the speed
by menu in window. Fig.10 shows a communication of 16 frames per second on SGI Max IMPACT.
system in cyberspace with avatar. Emotion model (Morishima 1996) are introduced to
improve communication environment. And also gaze
tracking and mouth closing point detection can be
3.4 User Adaptation realized by pan-tilt-zoom control camera.
When new user comes in, his face model and voice
model have to be registered before operation. In case 3.6 Entertainment Application
of voice, new learning for neural network has to be When people watch movie film, he sometimes
performed ideally. However, it takes a very long overlap his own figure with actor’s image. An
time to get convergence of backpropagation. To interactive movie system we constructed is an image
simplify the face model construction and voice creating system in which user can control facial
learning, the GUI tool for speaker adaptation is expression and lip motion of his face image inserted
prepared To register the face of new user, a generic into movie scene. User gives voice by microphone
3D face model is modified to fit on the input face and pushing keys which determine expression and
image. Expression control rules are defined onto the special effect. His own video program can be
generic model, so every user’s face can be equally generated on realtime.
modified to generate basic expression using FACS At first, once a frontal face image of visitor is
based expression control mechanism. captured by camera. 3D generic wireframe model is
For voice adaptation, 75 persons’ voice data fitted onto user’s face image to generate personal 3D
including 5 vowels are pre-captured and database for surface model. Facial expression is synthesized by
weights of neural network and voice parameters are controlling the grid point of face model and texture
constructed. So speaker adaptation is performed by mapping. For speaker adaptation, visitor has to speak
choosing the optimum weight from database. Training 5 vowels to choose an optimum weight from database.
of neural network for every 75 persons’ data has At interactive process, a famous movie scene is
already finished before operation. When new going on and face part of actor or actress is replaced
nonregistered speaker comes in, he has to speak 5 with visitor’s face. And also facial expression and lip
vowels into microphone. LPC Cepstnun is calculated shape are controlled synchronouslyby captured voice.
for every 5 vowels and this is given into the neural And also active camera is tracking visitor’s face and
network. And then mouth shape is calculated by facial expression is controlled by CV based face image
selected weight and error between true mouth shape analysis.Fig. 11 shows the original movie clip and
and generated mouth shape is evaluated. This process fig.12 shows the result of fitting of face model into
is applied to all of the database one by one and the this scene. Fig.13 shows an user’s face inserted into
optimum weight is selected when the minimum error actor’s face after color correction. Any expression can
is detected. be appended and any scenario can be given by user
independent to original story in this interactive movie
system.
4. Muscle Constraint Face Model
Muscle based face image synthesis is one of the
most realistic approaches to the realization of a life-
like agent in computers. A facial muscle model is
composed of facial tissue elements and simulated
muscles. In this model, forces are calculated effecting
a facial tissue element by contraction of each muscle
string, so the combination of each muscle contracting
force decides a specific facial expression. This muscle
parameter is determined on a trial and error basis by
comparing the sample photograph and a generated
image using our Muscle-Editor to generate a specific
face image. In this section, we propose the strategy Fig.11 Original movie clip
of automatic estimation of facial muscle parameters
from 2D optical flow using a neural network. This
corresponds to the 3D facial motion capturing from
2D camera image under the physics based condition
without markers.
We introduce multi-layer backpropagation network
for the estimation of muscle parameter. The face
expression is then re-synthesized from estimated
muscle parameter to evaluate how well an impression
of original expression can be recovered. We also tried
to generate animation using the captured data from
an image sequence. As a result, we can get and
synthesize image sequence which give an impression
very close to the original video.
Fig.12 Fitting result of face model
4.1 Layered Dynamic Tissue Model
The human skull is covered by a deformable tissue
which has five distinct layers. Four layers (epidermis,
dermis, subcutaneous connective tissue, and fascia)
comprise the skin, and the fifth layer comprises the
muscles responsible for facial expression. In
accordance with the structure of real skin, we employ
a synthetic tissue model constructed from the
elements illustrated in Fig.14, consisting of nodes
interconnected by deformable springs (the lines in
the figure). The epidermal surface is defined by nodes
1,2, and 3, which are connected by epidermal springs.
The epidermal nodes are also connected by dermal-
fatty layer springs to nodes 4,5, and 6, which define
the fascia surface. Fascia nodes are interconnected Fig.13 Reconstructed face by user's
by fascia springs. They are also connected by muscle
layer springs to skull surface nodes 7,8,9. The facial Epidermal Nodes
tissue model is implemented as a collection of node I 1,2,3 J,Epidermal Surface
and spring data structures. The node data structure
includes variables .to represent the nodal mass,
position, velocity, acceleration, and net force. The
spring data structure comprises the spring stiffness,
the natural length of the spring, and pointers to the
data structures o f the two nodes that are Bone Surface
interconnected by the spring. Newton's laws of motion
govern the response of the tissue model to force (Lee,
Terzopoulos and Waters 1995). This leads to a system
of coupled, second-order ordinary differential
equations that relate the node positions, velocities, Fig.14 Layered tissue element
and accelerations to the nodal forces.
- 1217 *
4.2 Facial Muscle Model
Fig. 15 shows our simulated muslces. Black line
means location of each facial muscle in a layered
tissue face model. Each muscle is modeled by
combination of separated muscles and left part and
right part of face are symmetric. For example,
frontalis has a triangularly spreaded feature, so 3 kinds
of linear muscles model this frontalis on one side.
Frontalis pulls up the eyebrows and makes wrinkles
in the forehead. Cormgater, which is modeled by 4
springs located between eyes, pulls the eyebrow
together and makes wrinkles between left and right
eyebrows. However those muscles can't pull down
the eyebrows and make the eyes thin. So Orbicularis Fig.15 Facial muscle model
oculi muscle is also appended in our model.
Orbicularis oculi is separated into upper part, and
lower part. To make muscle control simple around
the eye area, the Orbicularis oculi is modeled with a
single function in our model. Normally, muscles are
located between a bone node and a fascia node. But
the Orbicularis oculi has an irregular style, whereby
it is attached between fascia nodes in a ring
configuration; it has 8 linear muscles which
approximate a ring muscle. Contraction of ring
muscle makes the eye thin. The muscles around mouth
are very important for the production of speaking
scenes. Most of the Japanese speaking scenes are a) Optical flow b) Feature window
ed of vowels, so we mainly focused on the Fig.16 Motion feature vector on face
ction of vowel mouth shapes as a first step
and relocated the muscles around mouth(Sera,
Morishima and Terzopoulos 1996).As a result, a fmal reduce the calculation cost in parameter mapping, so
facial muscle model has 14 muscle springs in the the face area is divided into three subareas as shown
forehead area and 27 muscle springs in the mouth in Fig. 17. They are the mouth area, left-eye area,
area. and right-eye area, each giving on independent skin
motion. Three independent neural networks are
4.3 Motion Capture on Face prepared for these three areas.
A pp-sonal face model is constructed by fitting the Learning patterns are composed of basic
generic control model to personal range data. Optical expressions and the combination of their muscle
flow vector is calculated in a video sequence and contraction forces. We create 6 basic facial
accumulate motion in the sequence from neutral to expressions consisting of anger, disgust, fear,
each expression. And then motion vectors are happiness, sadness and surprise. Basic mouth shapes
averaged in each window shown in Fig. 16(b). This for vowels "ar', "i", llull,"e" and "o", and a closed
window location is determined automatically in each mouth shape.
video frame. Motion vector is given to a neural network from
A layered neural network finds a mapping from each video frame and then a facial animation is
the motion vector to the muscle parameter. A four- generated from the output muscle parameter
layer structure is chosen to effectively model the sequence. This is the test of the effect of interpolation
nonlinear performance. The first layer is the input on our parameter conversion method based on the
layer, which corresponds to 2D motion vector. The generalization of the neural network.
second and third layers are hidden layers. Units of Fig. 18 shows a result of expression regeneration
the second layer have a linear function and those of for surprise from original video sequence.
the third layer have a sigmoid function. The fourth
layer is the output layer, corresponding to muscle 4.4 Muscle Control by EMG
parameter, and it has linear units. Linear functions in EMG data is the voltage wave captured by needle
the input and output layers are introduced to maintain inserted directly into each muscle, so the feature of
the range of input and output values. wave is expressing the contraction situation of muscle.
A simpler neural network structure can help to Especially 7 di-ball wires are inserted into muscles
speed up the convergence process in learning and around mouth to make a model of mouth shape
control.
4 Lavers
Muscle Parameter
1

2D optical flow Mouth


t
] 3D Animation

Fig.17 Conversion from motion to animation Fig.19 Mouth shape control by EMG

5. Dynamic Hair Modeling

The naturalness of hair is very important factor in a


visual impression, however it is treated with a very
simple model or as a texture. Because real hair has
huge pieces and complex features, it's one of the
hardest target to be modelled by computer graphics.
In this section, new hair style modelling system is
presented. This system help designer to create any
arbitral hair style by computer graphics with easy
operation using editor of tufts. Each piece of hair has
a model independently and dynamic motion can be
a) Original b) Regenerated easily simulated by solving motion equation.
Fig.18 Face expression regeneration However, each piece has a few segment and the
feature is modelled by 3D B-spline curve, so the
The conversion from EMG data to muscle calculation cost is not so huge.
parameters are as follows. Firstly, an average power
of EMG wave is calculated every 1/30 seconds. 5.1 Modelling of Hair
Sampling frequency of EMG is 2.5 kHz. The It is necessary to create model of each piece of
maximum power is equal to the maximum muscle hair to generate natural dynamic motion when wind
contraction strength by normalization. Then the is flowing. However, hair feature is very complex and
converted contraction strength of each muscle is given real head has more than 100,000 pieces. So, polygon
every 1/30 seconds and then facial animation scene model for hair needs huge size of memory for whole
is generated. However, a jaw is controlled directly head to be stored. In this paper, each piece has a few
by the marker position located on subject jaw segment to be controlled and is expressed with 3D
independent to EMG data. B-Spline curve.
EMG data is captured from subject who is speaking
5 vowels. 7 kinds of wave can be captured and 7 5.2 'kft Model
contraction of muscles are decided. Then face model In our designing system, a head is composed of
is modified and mouth shape for each vowel is about 3400 polygons and 3D B-Spline curve is
synthesized. The result is shown in Figure 18. In coming out from each polygon to create hair style.
Figure 4,camera captured image and synthesized Hair generation area is composed of about 1300
image can be compared, and the impression is very polygons and one specific hair feature is generated
close each other. Next, EMG time sequence is and copied for each polygon. To manipulate hair to
converted into muscle parameters every 1/30 seconds get a specific hair style, some of hair features are
and facial animation is generated. Each sample is treated as one tuft simultaneously. Each tuft is
composed of 3 seconds sentence. Good quality of composed of more than 7 squares in which hair
animation for speaking sentence is achieved by segments are passing through. This square is the
picking up impulse from EMG signal and activating gathering of control point of hair segment, so
each appropriate muscle. This data offer the precise manipulation of this square can realize any hair style
transition feature of each muscle contraction. by rotation, shift and modification.
5.3 Hair Style Designing System
There are 3 processes to generate hair style by GUI
and each process is easily operated by mouse control.
The 1st process is to decide initial region on the head
from which the tuft is coming out. The 2nd process is
to manipulate hair tuft by modifying square by
rotation, shift and expansion. The 3rd process is
matching of tuft onto the surface of head. After these
processes, every polygon get one hair feature and
totally 1300 features are coming out. Finally, 20,000
to 30,000 pieces of hair for static image, or 100,000
to 200,000 pieces for animation are generated by
copying the feature in each polygon. GUI tool for
. hair designing is shown in Fig.20.

5.4 Rendering Fig.20 GUI tool for hair designing


Each hair piece is modelled with 3d B-Spline curve
and also modelled as a very thin circle pipe. So surface
normal can be decided in each pixel to introduce
Lambert model and Phong model.
Typical hair style "Dole Bob" is shown in Fig.21.
And Fig.22 shows an example of a modeling of real
hair style. An impression of generated hair is very
natural.

6. Conclusion
To generate a realistic avatar's face for face-to-face
communication, multi-modal signal is introduced to
duplicate original face expression. Voice is used to
realize lip synchronization and expression control.
Video captured image is used to regenerate original Fig.21 Example of hair style
face expression under facial muscle constraint. EMG
data is used to control artificial muscle directly.
Finally, hair modeling method is proposed to make
an avatar more natural and believable.

Reference
Morishima, S . and Harashima H. (1991). A Media
Conversion from Speech to Facial Image for Intelligent
Man-Machine interface, IEEE JSAC, Vo1.9, No.4, pp. 594-
600.
Morishima, S . (1 997). Virtual Face-to-Face
Communication Driven by Voice Through Network.
Workshop on Perceptual User Interfaces, pp.85-86. Fig.22 Copying real hair style
Ekman, P. and Friesen,W.V. (1978). Facial Action Coding
System. Consulting Psychologists Press Inc.
Essa, I., Darrell, T. and Pentland, A. (1994). Tracking Lee, Y, Terzopoulos, D. and Waters, K. (1 995). Realistic
Facial Motion. Proceedings of Workshop on Motion and modeling for facial animation. Proceedings of SIGGRAPH'95,
Nonrigid and Articulated Objects, pp.36-42. pp.55-62.
Mase, K. (1991). Recognition of Facial Expression from Sera, H., Morishima, Sand Terzopoulos, D.(1 996). Physics-
Optical Flow. IEICE Transactions,Vol E 74, No. 10. based muscle model for mouth shape control. Proceedings of
Robot and Human Communication '96(RO-MAN'96), pp. 207-
Morishima, S. (1996). Modeling of Facial Expression and 212.
Emotion for Human Communication System. Displays 17,
pp.15-25, Elsevier.

You might also like