0% found this document useful (0 votes)
104 views9 pages

(LU) Automated Blendshape Personalization For Faithful Face Animations Using Commodity Smartphones

Uploaded by

monacothelegend
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views9 pages

(LU) Automated Blendshape Personalization For Faithful Face Animations Using Commodity Smartphones

Uploaded by

monacothelegend
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Automated Blendshape Personalization for Faithful Face

Animations Using Commodity Smartphones


Timo Menzel Mario Botsch Marc Erich Latoschik
[email protected] [email protected] [email protected]
TU Dortmund University TU Dortmund University Julius-Maximilians-Universität
Dortmund, Germany Dortmund, Germany Würzburg, Germany

ABSTRACT actors mimicking real actors are becoming increasingly common-


Digital reconstruction of humans has various interesting use-cases. place since their debut in the 1987 CGI movie Rendez-vous in
Animated virtual humans, avatars and agents alike, are the central Montreal, see, e.g., the late creations from the Star Wars franchise.
entities in virtual embodied human-computer and human-human Similarly, the fidelity of interactive Virtual Humans (VHs) [33]
encounters in social XR. Here, a faithful reconstruction of facial also significantly advanced, as demonstrated by commercial devel-
expressions becomes paramount due to their prominent role in opments like Epic Games’ MetaHumans. Here, high-fidelity dig-
non-verbal behavior and social interaction. Current XR-platforms, ital reconstructions of humans open up promising applications
like Unity 3D or the Unreal Engine, integrate recent smartphone in computer games as well as in Virtual, Augmented, and Mixed
technologies to animate faces of virtual humans by facial mo- Reality (VR, AR, MR: XR for short) [7]. These applications in-
tion capturing. Using the same technology, this article presents clude human-computer interactions with computer-controlled VHs,
an optimization-based approach to generate personalized blend- so-called virtual agents [11, 32, 37], as well as mediated human-
shapes as animation targets for facial expressions. The proposed human encounters with user-controlled VHs, so-called avatars (see,
method combines a position-based optimization with a seamless eg., [3, 21, 22, 29, 38]), in future social XRs.
partial deformation transfer, necessary for a faithful reconstruction. Facial expressions are a central channel of non-verbal behavior.
Our method is fully automated and considerably outperforms exist- Their prominent role in social interaction has been confirmed for
ing solutions based on example-based facial rigging or deformation quite some time now [13, 40]. There is evidence that non-verbal be-
transfer, and overall results in a much lower reconstruction error. havior conveys the majority of information communicated [34, 42].
It also neatly integrates with recent smartphone-based reconstruc- Facial expressions are specifically prime conveyors of “emotions,
tion pipelines for mesh generation and automated rigging, further attitudes, interpersonal roles, and severity of pathology” [13, p. 50].
paving the way to a widespread application of human-like and Overall, facial expressions are an important modality of human-
personalized avatars and agents in various use-cases. human interaction. Therefore, synthesis and reconstruction of facial
expressions of VHs and their effects on observers have been inten-
CCS CONCEPTS sively researched [33], as was their role in non-verbal interaction
between avatars confirmed (see, e.g., [41, 42]).
• Computing methodologies → Mesh geometry models; Mo-
Faithful digital reconstruction of humans for interactive applica-
tion capture.
tions faces unique challenges. Most approaches capture the outer
appearance using depth cameras [30] or photogrammetry [1, 15],
KEYWORDS
and then rig the resulting mesh for subsequent animation by defin-
virtual humans, face animation, blendshapes, personalization, de- ing a weighted assignment of mesh vertices to skeleton bones for
formation transfer, facial rigging body animation and defining facial blendshapes for face animation.
ACM Reference Format: Today, this rigging process can be automated to a large extend by
Timo Menzel, Mario Botsch, and Marc Erich Latoschik. 2022. Automated employing template models with predefined rigs: The skeletal rig
Blendshape Personalization for Faithful Face Animations Using Commodity can be transferred by non-rigid registration of the template to the
Smartphones. In 28th ACM Symposium on Virtual Reality Software and target mesh [5], and the facial blendshapes are typically mapped
Technology (VRST ’22), November 29-December 1, 2022, Tsukuba, Japan. ACM, from template to target using deformation transfer [47].
New York, NY, USA, 9 pages. https://doi.org/10.1145/3562939.3565622
While generic template rigs have been shown to work well for
body animation [31, 36], the template’s facial blendshapes in gen-
1 INTRODUCTION eral do not faithfully reconstruct a captured person’s unique facial
A faithful digital reconstruction of humans has various interest- expressions. These differences are particularly problematic since
ing use-cases throughout the media industry and beyond. Virtual humans are capable of detecting even subtle changes and deviations
in human faces, which can potentially even lead to incongruent
social cues [23, 42]. A solution to this problem is provided by using
This work is licensed under a Creative Commons Attribution International personalized blendshapes [10, 17, 20, 26], which are derived from a
4.0 License. training set of facial expression. However, existing approaches still
vary considerably in reconstruction accuracy (hence faithfulness),
VRST ’22, November 29-December 1, 2022, Tsukuba, Japan
© 2022 Copyright held by the owner/author(s). level of automation, ease of use and applicability.
ACM ISBN 978-1-4503-9889-3/22/11.
https://doi.org/10.1145/3562939.3565622
VRST ’22, November 29-December 1, 2022, Tsukuba, Japan Timo Menzel, Mario Botsch, and Marc Erich Latoschik

Contribution: In this paper we leverage the capabilities of recent and non-rigidly deform their blendshapes to better fit these land-
smartphone technologies (in particular Apple’s ARKit) to generate marks. However, the user has to check and correct the detected
personalized blendshapes for a given avatar in a fully automatic landmarks in about 25 of 1500 frames to account for errors of the
and easy-to-use manner. Our approach ideally complements recent facial feature detector. We also use a smartphone to record a facial
smartphone-based avatar reconstructions (e.g., [51]) and perfectly performance of the actor, but leverage an iPhone’s video and depth
matches the smartphone-based facial motion capturing provided in sensor and Apple’s ARKit to capture blendshape weights and 3D
Unity 3D and the Unreal engine. Using ARKit for capturing example geometry for each frame of this performance in a robust and fully
facial expressions, we extend the widely used example-based facial automatic manner. Since our capturing process is easy and intuitive,
rigging [26] to extract considerably more accurate facial blend- we currently let the actor mimic all 52 ARKit blendshapes to get a
shapes from these example expressions, which are then seamlessly maximum of personalization.
implanted into a target avatar using a novel formulation of deforma- Han et al. [18] evaluated two methods to extract personalized
tion transfer [47]. Compared to previous approaches, our extensions blendshapes from expressions scans. They recorded expressions and
lead to considerably more accurate and hence more faithful recon- blendshape weights and fitted an autoencoder and a linear regres-
structions, reducing the reconstruction error for some cases down sion to obtain the blendshapes. However, while they achieve good
to 10%, while still being comparable in terms of computational cost. results with respect to the reconstruction error, they do not com-
pute semantically meaningful blendshapes, which disqualifies their
approach for many applications. Li et al. [26] and Seol et al. [44]
employ optimizations based on deformation gradients to extract
2 RELATED WORK personalized blendshapes from a set of training examples. Li et al.
While many different approaches for face animation have been [26] fit the deformation gradients of the unknown blendshapes to
proposed, such as skeleton and joint models [46] or physics-based the deformation gradients of the example expressions, regularized
muscle models [49], linear blendshape models [25] are still the most by (the deformation gradients of) the blendshapes of generic rig to
widely used technique – in particular in interactive XR applications. ensure semantically meaningful blendshapes. Blendshape weights
The individual blendshapes (or morph targets) are typically based and blendshape meshes are solved for in an alternating optimiza-
on the facial action coding system (FACS) [14], giving them an tion, requiring careful initialization for convergence. Seol et al. [44]
anatomical as well as semantical meaning. The FLAME model of first fit a mesh to the scanned expressions and then separate the
Li et al. [28] replaces the over-complete, linearly dependent FACS fitted expressions into the different blendshapes based on the vertex
blendshape basis by an orthogonal PCA basis. While this is in- displacements in the generic rig. In our experiments, their blend-
deed better suited for tracking and reconstruction, their basis lacks shape separation method leads to strongly damped blendshapes
semantic meaning and is therefore not suitable for several applica- though. Our approach leverages ARKit’s face tracking and there-
tions. To be as widely applicable as possible, our method employs fore avoids the alternating optimization for blendshape weights
standard linear facial blendshapes. and blendshape meshes. As we show in Section 4, our fitting also
Generating the required blendshapes by scanning an actor in all leads to considerably more accurate results compared to [26].
these expressions is not possible in most situations. Hence the blend- Several approaches further increase the reconstruction accu-
shapes of a generic face rig are typically transferred to the target racy by using corrective blendshapes or corrective deformation
avatar, using for instance RBF warps [19, 35, 45], non-rigid regis- fields. The facial animation is then computed by the initial (generic)
tration [16], or variants of deformation transfer [39, 47, 48]. The blendshape set, but it is refined with additional shapes, which add
blendshapes generated this way match the anatomical dimensions idiosyncrasies that cannot be represented by the initial blendshapes
of the target avatar, but they typically do not faithfully reconstruct [6, 12, 17, 20, 27]. Our work focusses on computing personalized
the captured person’s unique facial expressions. blendshapes without extra corrective shapes, to be compatible with
Higher-quality approaches therefore personalize blendshapes standard blendshape pipelines in XR engines, but corrective fields
by capturing an actor not only in the neutral pose, but also in a could easily be added in future work. In contrast to many previous
couple of example expressions [8, 18, 26, 50]. The optimization in- works that use a face/head avatar only, we implant the resulting
volved in the underlying example-based facial rigging process [26] personalized facial blendshapes into a full-body avatar using a
is ill-posed, since both the blendshape meshes of the actor and the modification of deformation transfer [4, 47].
blendshape weights of the captured expressions are unknown. The
method therefore depends on good initial guesses of the blendshape
weights, which restricts the poses the actor can (or has to) perform.
Li et al. [26] proposed a subset of 15 facial expressions the actor 3 METHOD
should perform. These expressions consist of more than one single In this section we present our approach for extracting personalized
activated blendshape; they are instead a combination of several blendshapes from a set of scanned facial expressions and how to
activated blendshapes. Carrigan et al. [9] use an even smaller set transfer these blendshapes to an existing avatar. Figure 1 shows
of facial expressions by packing more basic blendshapes into the an overview of our pipeline. We begin by specifying the capturing
training expressions – making them harder to perform though. process and the resulting input data (Section 3.1). From this training
In contrast, Ichim et al. [20] let the actor perform a sequence data we extract personalized blendshapes (Section 3.2), which are
of dynamic facial expressions and video-record this performance then transferred to the full-body avatar using our seamless partial
using a smartphone. They detect facial feature points in each frame deformation transfer (Section 3.3).
Automated Blendshape Personalization for Faithful Face Animations Using Commodity Smartphones VRST ’22, November 29-December 1, 2022, Tsukuba, Japan

iPhone Recording Scanned Expressions

Personalized
Blendshapes

Example-based
Facial Rigging

Blendshape Seamless Partial


Approximation Deformation Transfer

Neutral Scan
Deformation
Regularization Final Avatar
Transfer
Examples

Regularization
Examples
Initial Avatar

ARKit Template

Figure 1: Our pipeline from recording example expressions to the final avatar. Red boxes indicate user input.

3.1 Input Data section), such that the user does not have to strictly perform the
Our goal is to create personalized blendshapes for an existing full- requested expression in isolation (which is hardly possible).
body avatar. While any method could be used to generate that With this automated recording procedure, the 52 requested facial
avatar, we employ the smartphone-based method of Wenninger expressions can be scanned in approximately 2 minutes.
et al. [51]. As a consequence, the complete avatar generation and
personalization only requires a smartphone for capturing the per-
son, which is in stark contrast to recent approaches based on com-
plex photogrammetry rigs [1, 15] and makes the reconstruction of 3.2 Blendshape Personalization
personalized avatars more widely available. While ARKit provides us with blendshape weights (𝑤𝑖,1, . . . , 𝑤𝑖,𝑚 )
We capture the training data for the blendshape personalization and face meshes S𝑖 for each expression 𝑖 ∈ {1, . . . , 𝑚} of the 𝑚 = 52
using a custom application running on an iPhone 12 Pro (any recent ARKit blendshapes, it does not give access to the internal person-
ARKit-capable iOS device could be used as well). This application alized blendshapes of the user. We therefore have to extract the
guides the user through the recording session, making the whole personalized version of the ARKit blendshapes using a modification
capturing process easy and intuitive. All recording is performed by of example-based facial rigging [26].
the front-facing depth and color cameras, such that the user can The original method of Li et al. [26] fits the per-triangle defor-
watch instructions and get feedback while recording. mation gradients of the unknown blendshapes to the deformation
The user first scans their own face in a neutral expression, and gradients of the scanned expressions in a first step, and then solves
is then asked to perform the facial expressions corresponding to a linear least-squares system to extract the vertex positions best-
the 𝑚 = 52 ARKit blendshapes. For each of these blendshapes, the matching the fitted deformation gradients [4, 47]. This two-step
application shows a textual and pictorial explanation of the ex- procedure has the disadvantage that it does not directly penalize the
pression to be performed. The application continuously tracks the deviation of the reconstructed vertex positions from the scanned
user’s face using the ARKit framework and automatically captures expressions, it only indirectly encourages the vertex positions to
the facial expression when the requested expression is performed match the target expressions. In contrast, we directly optimize for
to a sufficient extent. In particular, when capturing blendshape 𝑖 vertex positions that best-match the observed expressions S𝑖 in the
(for 𝑖 ∈ {1, . . . , 𝑚}), we observe the blendshape weight 𝑤𝑖,𝑖 cor- least-squares sense (see Equation (1) below), which – as we will
responding to that blendshape and once this weight exceeds a shown in Section 4.1 – leads to more accurate results.
certain threshold and reaches a maximum over time (i.e., starts Given the blendshape weights (𝑤𝑖,1, . . . , 𝑤𝑖,𝑚 ) and face meshes
decreasing), we record both the current set of blendshape weights S𝑖 for the recorded expressions 𝑖 ∈ {1, . . . , 𝑚}, as well as the neutral
(𝑤𝑖,1, . . . , 𝑤𝑖,52 ) as well as the current geometry S𝑖 of the ARKit face scan B0 , we have to compute the corresponding personal-
face mesh, where the latter consists of 𝑛 = 1220 vertices and 2304 ized delta-blendshapes B1, . . . , B𝑚 . Delta-blendshapes describe the
triangles. While we expect the blendshape weight 𝑤𝑖,𝑖 to be domi- displacement from the neutral expression to a predefined facial
nant when capturing blendshape 𝑖, other non-vanishing blendshape expression – the corresponding (non-delta-)blendshape [25]. In the
weights 𝑤𝑖,𝑗 do not pose a problem for our reconstruction (see next following, the term blendshape always refers to delta-blendshapes.
VRST ’22, November 29-December 1, 2022, Tsukuba, Japan Timo Menzel, Mario Botsch, and Marc Erich Latoschik

Our model then produces a facial expression from blendshape


weights 𝑤 1, . . . , 𝑤𝑚 as
𝑚
∑︁
B0 + 𝑤 𝑗 B𝑗 .
𝑗=1

Here, the matrix B0 ∈ IR𝑛×3 contains the 𝑛 = 1220 vertex positions


of the neutral face and the matrices B1, . . . , B𝑚 ∈ IR𝑛×3 contain the
𝑛 displacement vectors of the blendshapes, respectively.
We compute the blendshapes B 𝑗 by penalizing the distance from
the observed training expressions S𝑖 , formulated as the cost function
2
𝑚
∑︁ 𝑚
∑︁

𝐸 fit (B1, . . . , B𝑚 ) = B0 +
𝑤𝑖,𝑗 B 𝑗 − S𝑖 . (1)
𝑖=1 𝑗=1

Since the blendshape weights 𝑤𝑖,𝑗 were captured by ARKit and Figure 2: The personalized blendshape (left) and our approx-
hence are known, minimizing (1) only requires solving a sparse imation with generic blendshapes (right)
linear least-squares system.
However, in order to produce semantically meaningful blend-
shapes the optimization has to be regularized. To this end, we 3.3 Blendshape Transfer
transfer the blendshapes of the (non-personalized) ARKit template Having computed the personalized ARKit blendshapes (grey face
to the recorded neutral expression B0 using deformation transfer masks in Figure 1), we now transfer them to the provided full-body
[47], resulting in the generic blendshapes T1, . . . , T𝑚 . Similar to avatar using an extension of deformation transfer [47]. Note that our
Saito [43], we add virtual triangles between upper and lower eye- personalized ARKit blendshapes B 𝑗 only specify the deformation
lids to ensure that the eyes are completely closed in the transferred within the face area. However, this is only a part of the deformation
eye-blink blendshapes. Our regularization energy then penalizes due to face animation, since some blendshapes (e.g. jaw open) also
the deviation of the personalized blendshapes B 𝑗 from the generic affect the area adjacent to the face (e.g. the neck area). Therefore,
blendshapes T 𝑗 : we have to adjust the adjacent area accordingly.
𝑚 In a first step we approximate the personalized ARKit blend-
D 𝑗 B 𝑗 − T 𝑗 2 . shapes with the pre-existing non-personalized blendshapes of the
∑︁ 
𝐸 reg = (2)
𝑗=1 full-body avatar. To this end, we define a correspondence map M
between the avatar’s face and the ARKit face mesh. This is achieved
Here, D 𝑗 are diagonal (𝑛 × 𝑛) matrices containing per-vertex reg- by fitting the avatar template to the ARKit template using non-rigid
ularization weights. These weights ensure that vertices that do registration [1] and selecting the closest point on the ARKit mesh
not move in the template blendshape T 𝑗 also do not move in the for each vertex of the avatar’s face. Since each full-body avatar
personalized blendshape B 𝑗 . They are computed as shares the topology of the full-body template model (from [51] in
our case), this mapping have to be computed only once.
max𝑘=1,...,𝑛 t𝑘,𝑗 − t𝑘,0
To obtain the optimal blendshape weights for the approximation

D 𝑗 𝑖,𝑖 = , (3)
t𝑖,𝑗 − t𝑖,0 we use an approach similar to Lewis and Anjyo [24]. First, we use
the iterative closest point algorithm (ICP) with scaling [52] to find
where t𝑖,𝑗 ∈ IR3 is the position of the 𝑖-th vertex in the template
the optimal translation, rotation, and scaling to register the ARKit
blendshape T 𝑗 . This results in higher regularization weights for
face mesh to the avatar’s face. Second, we compute the weights of
vertices with smaller displacement magnitude in the ARKit template
5 the initial avatar blendshapes by minimizing the energy
blendshapes.
We clamp the regularization weight (D 𝑗 )𝑖,𝑖 to 10 if 2
t𝑖,𝑗 − t𝑖,0 < 𝜖 to avoid division by zero.
𝑘
∑︁ 𝑘
∑︁
𝑤˜ 𝑖2, (4)
𝑗
The personalized blendshapes are finally computed by minimiz- 𝐸 approx (𝑤˜ 1, . . . , 𝑤˜ 𝑘 ) = 𝑤˜ 𝑖 B̃𝑖 − MB 𝑗 + 𝜇

ing the cost function

𝑖=1 𝑖=1
where M is the pre-computed correspondence matrix that maps
𝐸 fit (B1, . . . , B𝑚 ) + 𝐸 reg (B1, . . . , B𝑚 )
vertices of the ARKit face meshes to the full-body avatar. B̃𝑖 de-
that combines the fitting term (1) and the regularization term (2), note the initial avatar’s blendshapes and 𝑤˜ 𝑖 are the (unknown)
which again only involves solving a least-squares linear system. blendshape weights. The second term penalizes large weights to
Li et al. [26] stated that an optimization based on vertex posi- avoid extreme poses. Solving a linear least-squares system results
tions leads to visible artifacts if done naively. However, our results in the blendshape weights 𝑤˜ 1, . . . , 𝑤˜ 𝑘 for approximating the ARKit
demonstrate quantitatively and qualitatively that this is not the case blendshape B 𝑗 using the initial avatar blendshapes B̃1, . . . , B̃𝑘 .
for our regularization. In fact, optimizing vertex positions leads The resulting approximation, which can be considered an auto-
to more accurate results, as we show in Section 4. Without our matic facial retargeting and is denoted by A 𝑗 , is already quite close
regularization though, we would get clearly noticeable artifacts. to the desired personalized blendshape B 𝑗 (see Figure 2). Using
Automated Blendshape Personalization for Faithful Face Animations Using Commodity Smartphones VRST ’22, November 29-December 1, 2022, Tsukuba, Japan

Figure 3: Seamless partial deformation transfer of the Jaw-


Open blendshape: The first step transfers the deformation Figure 4: Root-mean-square reconstruction error of deforma-
of the face region only (left), the second step adjusts the tion transfer (blue), example-based facial rigging (orange),
adjacent neck region (right). and our method (green) for Subject 1.

the initial avatar’s blendshapes it provides the missing information


on how to deform the avatar’s face in the area not covered by the
ARKit face mask (e.g. the neck area under the chin). We therefore
use the approximation A 𝑗 as regularization when transferring the
personalized blendshape B 𝑗 to the avatar.
This transfer proceeds in two steps (see Figure 3): First, we ap-
ply deformation transfer to only the avatar’s face using the pre-
computed correspondence mapping. Second, the vertices in the
face area are fixed, and the vertices in the adjacent area (vertices
that move in A 𝑗 but do not belong to the face area) are non-rigidly
deformed from A 𝑗 [1]. This process seamlessly implants the per-
sonalized ARKit blendshapes to the target avatar. Figure 5: Maximum reconstruction error of deformation
The approximations A 𝑗 also bring facial details not included in transfer (blue), example-based facial rigging (orange), and
the ARKit mesh (eye balls and teeth) to their desired position in the our method (green) for Subject 1.
𝑗th personalized avatar blendshape. However, since the ARKit mesh
does not include eyeballs, the eyelid blendshapes might intersect
them. We eventually repair those artifacts by moving the eyelid the 52 training face expressions (see Section 3.1), but also 20–30 ad-
vertices to their closest point on the eyeball surface [2]. ditional test expressions, for which we record the ARKit face mesh
T𝑖 and the corresponding blendshape weights 𝑤 1, . . . , 𝑤𝑚 . Using
4 RESULTS these weights and the blendshapes B1, . . . , B𝑚 to be evaluated, we
In the following, we present quantitative and qualitative com- compute the root mean square error (RMSE) over the 𝑛 vertices
parisons between our blendshape personalization approach (Sec- w.r.t. T𝑖 as v
u
u 2
tion 3.2), example-based facial rigging (EBFR, Li et al. [26]), and u
t 𝑚
1 ∑︁
deformation transfer (DT, Sumner and Popović [47]). After that, we B0 + 𝑤 B
𝑗 𝑗 − T𝑖 .
(5)
show comparisons of our personalized blendshapes to automatic 𝑛
𝑗=1
facial retargeting and to manual facial retargeting. Automatic facial Our implementation of EBFR only performs the blendshape opti-
retargeting refers to the optimization of Equation (4). The resulting mization step from Li et al. [26], since we already know the correct
blendshape weights are the best fitting combinations of the avatar’s blendshape weights from ARKit. Figure 4 compares the RMSE (5) of
initial blendshapes to approximate the personalized ARKit blend- different sets of personalized blendshapes produced with DT (blue),
shapes. Manual facial retargeting refers to a manually optimized EBFR (orange), and our approach (green) for Subject 1. In almost
mapping of the ARKit blendshape set to the avatar’s initial blend- all cases (except expression 19) our method yields the lowest errors.
shape set. This mapping was hand-crafted by two PhD students in When averaging the RMEs over all expressions and all subjects the
Computer Graphics with sufficient expertise in facial blendshape RMSE of our method is 58% of the RMSE of EBFR and 45% of the
animation (although not being blendshape artists) and represents RMSE of DT (see supplementary material).
our best (manual) effort for retargeting the ARKit blendshapes to We also evaluate and compare the maximum error computed
the initial avatar’s blendshapes. over all 𝑛 vertices, since this measure allows to compare the worst
parts of the reconstructed expressions. If only a particular region of
4.1 Face Mask Personalization the face moves in a test expression, the RMSE would be artificially
In order to evaluate the accuracy of the different sets of personal- reduced due to averaging over mostly unchanged vertex positions.
ized ARKit blendshapes, we asked our subjects to record not only Figure 5 shows the maximum reconstruction errors of the different
VRST ’22, November 29-December 1, 2022, Tsukuba, Japan Timo Menzel, Mario Botsch, and Marc Erich Latoschik

Table 1: Computation times of example-based facial rigging


(EBFR) and our method, including setting up and solving all
required linear systems.

Method Subject 1 Subject 2 Subject 3 Subject 4 Avg.


EBFR 1.585s 1.594s 1.602s 1.594s 1.594s
Ours 0.500s 0.499s 0.499s 0.497s 0.499s

Figure 7 compares the approximations A 𝑗 (computed through


automatic facial retargeting using the initial avatar blendshapes)
to our final avatars with personalized blendshapes. While the ap-
proximation A 𝑗 give reasonable results, the expressions are more
accurately reproduced by our personalized blendshapes. This is
most noticeable in the mouth region, where the retargeted version
does not move the mouth corners far enough or cannot properly
reproduce the lip shape of the target expression.
The automatic regargeting is computed by minimzing the approx-
imation error (4) w.r.t. the blendshape weights. Upon inspection the
resulting (optimal) weights exhibit many non-zero weights, with
several weights even exceeding the range [0, 1]. This explains why
a manual retargeting is rather unlikely to reproduce the optimal
results. Still the manual mapping is the default method for connect-
ing face rigs and face tracking in game engines (see, e.g., the Unity
Live Capture Plugin or the Faceware Live Client for Unity).
Figure 8 compares how well different approaches can reproduce
some test expressions that have been tracking through ARKit. The
manual retargeting using the initial avatar blendshapes yields the
worst results, with clearly visible deviations in the mouth region.
Automatic retargeting produces more accurate expressions, but the
most accurate results are obtained with the personalized blend-
shapes. In the top row the automatic retargeting does not move
the mouth corners far enough to the side, visualized through the
blue cross and the green line. In the bottom row the manual and
Figure 6: Color-coded reconstruction errors. From left to automatic retargeting do not properly close the mouth and eyes,
right: ground truth, our method, example-based facial rig- respectively, while the personalized blendshapes do. Comparisons
ging, deformation transfer. (Blue = 0mm, Red > 5mm) on the full test sequences can be seen in the accompanying video.

approaches for Subject 1. It can be seen that DT gives the worst 4.3 Computation Time
results in all cases. Both EBFR and our method show significant im- Our automatic, app-guided recording of training expressions takes
provements, with our method consistently yielding the lowest error about 2 minutes. Afterwards, all computations of our pipeline take
of all three methods. Averaging the maximum error per expression approximately 35 seconds per avatar. The ARKit face masks consist
over all test expressions and all subjects, our method reduces the of 1,220 vertices and 2,304 triangles, the avatar meshes consist of
maximum reconstruction error of EBFR and DT down to 50% and 21k vertices and 42k triangles.
38%, respectively. The color-coding of reconstruction errors in Fig- Table 1 shows the computation times for EBFR and our method,
ure 6 visualizes how the three reconstruction methods differ and measured on a desktop PC with a 10-core 3.6 GHz CPU and a Nvidia
shows that our reconstructions are the most accurate. Results for RTX 3070 GPU. These timings include setting up the linear sys-
other test subjects are shown in the supplementary material. tems and solving them through sparse Cholesky factorizations. For
EBFR it also includes converting the local frames back to the new
4.2 Avatar Blendshape Personalization blendshape basis, and for our method it includes the computation
The previous section evaluated different approaches for personaliz- of the regularization blendshapes using deformation transfer. On
ing the ARKit face mask blendshapes. In this section we evaluate average, our method is about 3× faster than EBFR. This is mainly
the final avatars produced by implanting the personalized ARKit due to the fact that EBFR performs the optimization per triangle,
blendshapes through our seamless partial deformation transfer. and hence has to solve 2,304 linear least squares problems.
Automated Blendshape Personalization for Faithful Face Animations Using Commodity Smartphones VRST ’22, November 29-December 1, 2022, Tsukuba, Japan

Figure 7: Comparison of the personalized ARKit blendshapes (middle) to an approximation using automatic retargeting using
the initial avatar blendshapes (left) and to our final personalized blendshapes (right), where the latter yield more accurate
results.

5 DISCUSSION 6 CONCLUSION
As described in Section 4.1, personalized blendshapes produce more Faithful digital reconstruction of humans has various interesting
accurate reconstructions of facial expressions than generic blend- use-cases. It may become even more prominent in future social XR
shapes transferred from a template rig. Our improved approach encounters. Here, an accurate reconstruction of facial expressions
further reduces the maximum reconstruction error compared to is a necessity due to the prominent role of facial expressions in
deformation transfer and example-based facial rigging. Considering non-verbal behavior and social interaction. This article presented
Figure 6, the mouth corners of Subject 1 (top row) are closer to the an optimization-based approach to generating personalized blend-
ground truth for our method, while both DT and EBFR lead to a shapes necessary for a faithful reconstruction of facial expressions
stronger grin expression. Even these slight differences can be prob- and their animation. The proposed method combines a position-
lematic, since they might lead to incongruent social cues [42] or based optimization with a seamless partial deformation transfer.
create unnatural-appearing facial expressions that do not correctly It outperforms existing solutions and overall results in a much
convey a tracked person’s actual look and feelings (Figure 8). lower reconstruction error. It also neatly integrates with recent
Our approach also has some limitations. Both the blendshape smartphone-based reconstruction pipelines for mesh generation
approximation and the seamless partial deformation transfer rely and automated rigging, further paving the way to a widespread
on correspondence mappings, which have to be computed in a application of personalized avatars and agents in various use-cases.
preprocess once per template model. Moreover, our seamless partial In the future, we would like to use the iPhone’s front-facing
deformation transfer relies on reasonable initial blendshapes of the depth sensor to capture more accurate geometry during expression
avatar, which is used to estimate how the adjacent area of the face scanning. Furthermore, we would like to investigate the effect that
(e.g. the neck) deforms during facial expressions. As a consequence, our improved personalized blendshapes have on the perceptibility
our method cannot be applied to avatars without blendshapes. of expression semantics.
VRST ’22, November 29-December 1, 2022, Tsukuba, Japan Timo Menzel, Mario Botsch, and Marc Erich Latoschik

Figure 8: Comparison of the original tracked facial expression with the animated avatars. From left to right: captured image,
manual retargeting, automatic retargeting, personalized blendshapes

ACKNOWLEDGMENTS [5] Sofien Bouaziz, Andrea Tagliasacci, and Mark Pauly. 2014. Dynamic 2D/3D
Registration. In Eurographics 2014 Tutorial.
The authors are very grateful to all scanned subjects. This research [6] Sofien Bouaziz, Yangang Wang, and Mark Pauly. 2013. Online modeling for
was supported by the German Federal Ministry of Education and realtime facial animation. ACM Transactions on Graphics 32, 4 (2013), 1–10.
https://doi.org/10.1145/2461912.2461976
Research (BMBF) through the project VIA-VR (ID 16SV8446). [7] David Burden and Maggi Savin-Baden. 2020. Virtual humans: Today and tomorrow.
Chapman and Hall/CRC.
REFERENCES [8] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-
time facial animation with image-based dynamic avatars. ACM Transactions on
[1] Jascha Achenbach, Thomas Waltemate, Marc Erich Latoschik, and Mario Botsch.
Graphics 35, 4 (2016), 1–12. https://doi.org/10.1145/2897824.2925873
2017. Fast generation of realistic virtual humans. In Proceedings of the 23rd ACM
[9] E. Carrigan, E. Zell, C. Guiard, and R. McDonnell. 2020. Expression Packing:
Symposium on Virtual Reality Software and Technology. ACM. https://doi.org/10.
As-Few-As-Possible Training Expressions for Blendshape Transfer. Computer
1145/3139131.3139154
Graphics Forum 39, 2 (2020), 219–233. https://doi.org/10.1111/cgf.13925
[2] Jascha Achenbach, Eduard Zell, and Mario Botsch. 2015. Accurate Face Recon-
[10] Dan Casas, Oleg Alexander, Andrew W. Feng, Graham Fyffe, Ryosuke Ichikari,
struction through Anisotropic Fitting and Eye Correction. In Vision, Modeling
Paul Debevec, Rhuizhe Wang, Evan Suma, and Ari Shapiro. 2015. Rapid Pho-
& Visualization. The Eurographics Association. https://doi.org/10.2312/vmv.
torealistic Blendshapes from Commodity RGB-D Sensors. In Proceedings of
20151251
the 19th Symposium on Interactive 3D Graphics and Games. ACM, 134–134.
[3] Jim Blascovich and Jeremy Bailenson. 2011. Infinite Reality: Avatars, Eternal Life,
https://doi.org/10.1145/2699276.2721398
New Worlds, and the Dawn of the Virtual Revolution. William Morrow & Co.
[11] Justine Cassell, Catherine Pelachaud, Norman Badler, Mark Steedman, Brett
https://doi.org/10.1162/PRES_r_00068
[4] Mario Botsch, Robert Sumner, Mark Pauly, and Markus Gross. 2006. Deformation Achorn, Tripp Becket, Brett Douville, Scott Prevost, and Matthew Stone. 1994.
Transfer for Detail-Preserving Surface Editing. In Proceedings of VMV 2006.
Automated Blendshape Personalization for Faithful Face Animations Using Commodity Smartphones VRST ’22, November 29-December 1, 2022, Tsukuba, Japan

Animated conversation: rule-based generation of facial expression, gesture & 2816795.2818013


spoken intonation for multiple conversational agents. In Proceedings of the 21st [32] Birgit Lugrin, Catherine Pelachaud, and David Traum. 2021. The Handbook
annual conference on Computer graphics and interactive techniques. 413–420. https: on Socially Interactive Agents: 20 years of Research on Embodied Conversa-
//doi.org/10.1145/192161.192272 tional Agents, Intelligent Virtual Agents, and Social Robotics Volume 1: Methods,
[12] Bindita Chaudhuri, Noranart Vesdapunt, Linda Shapiro, and Baoyuan Wang. Behavior, Cognition. (2021). https://doi.org/10.1145/3477322
2020. Personalized Face Modeling for Improved Face Reconstruction and Motion [33] Nadia Magnenat-Thalmann and Daniel Thalmann. 2005. Virtual humans: thirty
Retargeting. In Computer Vision – ECCV 2020. Springer International Publishing, years of research, what next? The Visual Computer 21, 12 (2005), 997–1015.
142–160. https://doi.org/10.1007/978-3-030-58558-7_9 https://doi.org/10.1007/s00371-005-0363-6
[13] Paul Ekman and Wallace V Friesen. 1969. The repertoire of nonverbal behavior: [34] David Matsumoto, Mark G Frank, and Hyi Sung Hwang. 2012. Reading people.
Categories, origins, usage, and coding. Semiotica 1, 1 (1969), 49–98. https: Nonverbal Communication: Science and Applications (2012), 1. https://doi.org/10.
//doi.org/10.1515/semi.1969.1.1.49 4135/9781452244037
[14] Paul Ekman and Wallace V. Friesen. 1978. Facial Action Coding System. https: [35] Verónica Costa Orvalho, Ernesto Zacur, and Antonio Susin. 2008. Transferring
//doi.org/10.1037/t27734-000 the Rig and Animations from a Character to Different Face Models. Computer
[15] Andrew Feng, Evan Suma Rosenberg, and Ari Shapiro. 2017. Just-in-time, viable, Graphics Forum 27, 8 (2008), 1997–2012. https://doi.org/10.1111/j.1467-8659.2008.
3-D avatars from scans. Computer Animation and Virtual Worlds 28, 3-4 (2017). 01187.x
https://doi.org/10.1145/3084363.3085045 [36] Ahmed A A Osman, Timo Bolkart, and Michael J. Black. 2020. STAR: A Sparse
[16] Pablo Garrido, Levi Valgaert, Chenglei Wu, and Christian Theobalt. 2013. Recon- Trained Articulated Human Body Regressor. In European Conference on Computer
structing detailed dynamic face geometry from monocular video. ACM Transac- Vision (ECCV). 598–613. https://doi.org/10.1007/978-3-030-58539-6_36
tions on Graphics 32, 6 (2013), 1–10. https://doi.org/10.1145/2508363.2508380 [37] Catherine Pelachaud. 2009. Modelling multimodal expression of emotion in a
[17] Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, virtual agent. Philosophical Transactions of the Royal Society B: Biological Sciences
Patrick Pérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D 364, 1535 (2009), 3539–3548. https://doi.org/10.1098/rstb.2009.0186
Face Rigs from Monocular Video. ACM Transactions on Graphics 35, 3 (2016), [38] Tekla S Perry. 2015. Virtual reality goes social. IEEE Spectrum 53, 1 (2015), 56–57.
1–15. https://doi.org/10.1145/2890493 https://doi.org/10.1109/MSPEC.2016.7367470
[18] Ju Hee Han, Jee-In Kim, Hyungseok Kim, and Jang Won Suh. 2021. Gener- [39] Richard A. Roberts, Rafael Kuffner dos Anjos, Akinobu Maejima, and Ken Anjyo.
ate Individually Optimized Blendshapes. In 2021 IEEE International Conference 2021. Deformation transfer survey. Computers & Graphics 94 (2021), 52–61.
on Big Data and Smart Computing (BigComp). IEEE. https://doi.org/10.1109/ https://doi.org/10.1016/j.cag.2020.10.004
bigcomp51126.2021.00030 [40] Erika L Rosenberg and Paul Ekman. 2020. What the face reveals: Basic and applied
[19] Roger Blanco i Ribera, Eduard Zell, J. P. Lewis, Junyong Noh, and Mario Botsch. studies of spontaneous expression using the Facial Action Coding System (FACS).
2017. Facial retargeting with automatic range of motion alignment. ACM Trans- Oxford University Press.
actions on Graphics 36, 4 (2017), 1–12. https://doi.org/10.1145/3072959.3073674 [41] Daniel Roth, Carola Bloch, Anne-Kathrin Wilbers, Kai Kaspar, Marc Erich
[20] Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D Latoschik, and Gary Bente. 2015. Quantification of Signal Carriers for Emotion
avatar creation from hand-held video input. ACM Transactions on Graphics 34, 4 Recognition from Body Movement and Facial Affects. Journal of Eye Movement Re-
(2015), 1–14. https://doi.org/10.1145/2766974 search 4 (2015), p. 192. https://downloads.hci.informatik.uni-wuerzburg.de/2015-
[21] Marc Erich Latoschik, Florian Kern, Jan-Philipp Stauffert, Andrea Bartl, Mario ecem-roth-quantification-signal-carriers.pdf
Botsch, and Jean-Luc Lugrin. 2019. Not Alone Here?! Scalability and User Ex- [42] Daniel Roth, Carola Bloch, Anne-Kathrin Wilbers, Marc Erich Latoschik, Kai
perience of Embodied Ambient Crowds in Distributed Social Virtual Reality. Kaspar, and Gary Bente. 2016. What You See is What You Get: Channel Domi-
IEEE Transactions on Visualization and Computer Graphics (TVCG) 25, 5 (2019), nance in the Decoding of Affective Nonverbal Behavior Displayed by Avatars. In
2134–2144. https://doi.org/10.1109/TVCG.2019.2899250 Presentation at the 66th Annual Conference of the International Communication
[22] Marc Erich Latoschik, Daniel Roth, Dominik Gall, Jascha Achenbach, Thomas Association (ICA). https://downloads.hci.informatik.uni-wuerzburg.de/2016-
Waltemate, and Mario Botsch. 2017. The Effect of Avatar Realism in Immersive Roth-WYSIWYG.pdf
Social Virtual Realities. In 23rd ACM Symposium on Virtual Reality Software and [43] Jun Saito. 2013. Smooth contact-aware facial blendshapes transfer. In Proceedings
Technology (VRST). 39:1–39:10. https://doi.org/10.1145/3139131.3139156 of the Symposium on Digital Production - DigiPro '13. ACM Press, 13–17. https:
[23] Marc Erich Latoschik and Carolin Wienrich. 2022. Congruence and Plausibility, //doi.org/10.1145/2491832.2491836
not Presence?! Pivotal Conditions for XR Experiences and Effects, a Novel Model. [44] Yeongho Seol, Wan-Chun Ma, and J. P. Lewis. 2016. Creating an actor-specific
Frontiers in Virtual Reality (2022). https://doi.org/10.3389/frvir.2022.694433 facial rig from performance capture. In Proceedings of the 2016 Symposium on
[24] JP Lewis and Kenichi Anjyo. 2010. Direct Manipulation Blendshapes. IEEE Digital Production. ACM. https://doi.org/10.1145/2947688.2947693
Computer Graphics and Applications 30, 4 (2010), 42–50. https://doi.org/10.1109/ [45] Yeongho Seol, Jaewoo Seo, Paul Hyunjin Kim, J. P. Lewis, and Junyong Noh.
mcg.2010.41 2011. Artist friendly facial animation retargeting, In Proceedings of the 2011
[25] J. P. Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Fred Pighin, and Zhigang SIGGRAPH Asia Conference on - SA '11. ACM Transactions on Graphics, 1–10.
Deng. 2014. Practice and Theory of Blendshape Facial Models. In Eurographics https://doi.org/10.1145/2024156.2024196
2014 - State of the Art Reports. The Eurographics Association. https://doi.org/10. [46] Antonio Susín Sergi Villagrasa. 2010. FACe! 3D Facial Animation System based
2312/egst.20141042 on FACS. IV Iberoamerican Symposium in Computer Graphics (2010), 203–209.
[26] Hao Li, Thibaut Weise, and Mark Pauly. 2010. Example-based facial rigging. [47] Robert W. Sumner and Jovan Popović. 2004. Deformation transfer for triangle
ACM Transactions on Graphics 29, 4 (2010), 1–6. https://doi.org/10.1145/1778765. meshes. ACM Transactions on Graphics 23, 3 (2004), 399–405. https://doi.org/10.
1778769 1145/1015706.1015736
[27] Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime facial animation [48] Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and
with on-the-fly correctives. ACM Transactions on Graphics 32, 4 (2013), 1–10. Matthias Nießner. 2018. Face2Face: real-time face capture and reenactment
https://doi.org/10.1145/2461912.2462019 of RGB videos. Commun. ACM 62, 1 (Dec. 2018), 96–104. https://doi.org/10.1145/
[28] Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. 3292039
Learning a model of facial shape and expression from 4D scans. ACM Transactions [49] Keith Waters. 1987. A muscle model for animation three-dimensional facial
on Graphics 36, 6 (2017), 1–17. https://doi.org/10.1145/3130800.3130813 expression. ACM SIGGRAPH Computer Graphics 21, 4 (1987), 17–24. https:
[29] Qiaoxi Liu and Anthony Steed. 2021. Social Virtual Reality Platform Comparison //doi.org/10.1145/37402.37405
and Evaluation Using a Guided Group Walkthrough Method. Frontiers in Virtual [50] Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime
Reality 2 (2021), p. 52. https://doi.org/10.3389/frvir.2021.668181 performance-based facial animation. ACM Transactions on Graphics 30, 4 (2011),
[30] Yunpeng Liu, Stephan Beck, Renfang Wang, Jin Li, Huixia Xu, Shijie Yao, Xi- 1–10. https://doi.org/10.1145/2010324.1964972
aopeng Tong, and Bernd Froehlich. 2015. Hybrid Lossless-Lossy Compression for [51] Stephan Wenninger, Jascha Achenbach, Andrea Bartl, Marc Erich Latoschik,
Real-Time Depth-Sensor Streams in 3D Telepresence Applications. In Advances in and Mario Botsch. 2020. Realistic Virtual Humans from Smartphone Videos. In
Multimedia Information Processing – PCM 2015. Springer International Publishing, 26th ACM Symposium on Virtual Reality Software and Technology. ACM, 1–11.
Cham, pp. 442–452. https://doi.org/10.1007/978-3-319-24075-6_43 https://doi.org/10.1145/3385956.3418940
[31] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and [52] Timo Zinßer, Jochen Schmidt, and Heinrich Niemann. 2005. Point Set Regis-
Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM tration with Integrated Scale Estimation. In International Conference on Pattern
Transactions on Graphics 34, 6 (2015), 248:1–248:16. https://doi.org/10.1145/ Recognition and Image Processing (PRIP 2005).

You might also like