(LU) Automated Blendshape Personalization For Faithful Face Animations Using Commodity Smartphones
(LU) Automated Blendshape Personalization For Faithful Face Animations Using Commodity Smartphones
Contribution: In this paper we leverage the capabilities of recent and non-rigidly deform their blendshapes to better fit these land-
smartphone technologies (in particular Apple’s ARKit) to generate marks. However, the user has to check and correct the detected
personalized blendshapes for a given avatar in a fully automatic landmarks in about 25 of 1500 frames to account for errors of the
and easy-to-use manner. Our approach ideally complements recent facial feature detector. We also use a smartphone to record a facial
smartphone-based avatar reconstructions (e.g., [51]) and perfectly performance of the actor, but leverage an iPhone’s video and depth
matches the smartphone-based facial motion capturing provided in sensor and Apple’s ARKit to capture blendshape weights and 3D
Unity 3D and the Unreal engine. Using ARKit for capturing example geometry for each frame of this performance in a robust and fully
facial expressions, we extend the widely used example-based facial automatic manner. Since our capturing process is easy and intuitive,
rigging [26] to extract considerably more accurate facial blend- we currently let the actor mimic all 52 ARKit blendshapes to get a
shapes from these example expressions, which are then seamlessly maximum of personalization.
implanted into a target avatar using a novel formulation of deforma- Han et al. [18] evaluated two methods to extract personalized
tion transfer [47]. Compared to previous approaches, our extensions blendshapes from expressions scans. They recorded expressions and
lead to considerably more accurate and hence more faithful recon- blendshape weights and fitted an autoencoder and a linear regres-
structions, reducing the reconstruction error for some cases down sion to obtain the blendshapes. However, while they achieve good
to 10%, while still being comparable in terms of computational cost. results with respect to the reconstruction error, they do not com-
pute semantically meaningful blendshapes, which disqualifies their
approach for many applications. Li et al. [26] and Seol et al. [44]
employ optimizations based on deformation gradients to extract
2 RELATED WORK personalized blendshapes from a set of training examples. Li et al.
While many different approaches for face animation have been [26] fit the deformation gradients of the unknown blendshapes to
proposed, such as skeleton and joint models [46] or physics-based the deformation gradients of the example expressions, regularized
muscle models [49], linear blendshape models [25] are still the most by (the deformation gradients of) the blendshapes of generic rig to
widely used technique – in particular in interactive XR applications. ensure semantically meaningful blendshapes. Blendshape weights
The individual blendshapes (or morph targets) are typically based and blendshape meshes are solved for in an alternating optimiza-
on the facial action coding system (FACS) [14], giving them an tion, requiring careful initialization for convergence. Seol et al. [44]
anatomical as well as semantical meaning. The FLAME model of first fit a mesh to the scanned expressions and then separate the
Li et al. [28] replaces the over-complete, linearly dependent FACS fitted expressions into the different blendshapes based on the vertex
blendshape basis by an orthogonal PCA basis. While this is in- displacements in the generic rig. In our experiments, their blend-
deed better suited for tracking and reconstruction, their basis lacks shape separation method leads to strongly damped blendshapes
semantic meaning and is therefore not suitable for several applica- though. Our approach leverages ARKit’s face tracking and there-
tions. To be as widely applicable as possible, our method employs fore avoids the alternating optimization for blendshape weights
standard linear facial blendshapes. and blendshape meshes. As we show in Section 4, our fitting also
Generating the required blendshapes by scanning an actor in all leads to considerably more accurate results compared to [26].
these expressions is not possible in most situations. Hence the blend- Several approaches further increase the reconstruction accu-
shapes of a generic face rig are typically transferred to the target racy by using corrective blendshapes or corrective deformation
avatar, using for instance RBF warps [19, 35, 45], non-rigid regis- fields. The facial animation is then computed by the initial (generic)
tration [16], or variants of deformation transfer [39, 47, 48]. The blendshape set, but it is refined with additional shapes, which add
blendshapes generated this way match the anatomical dimensions idiosyncrasies that cannot be represented by the initial blendshapes
of the target avatar, but they typically do not faithfully reconstruct [6, 12, 17, 20, 27]. Our work focusses on computing personalized
the captured person’s unique facial expressions. blendshapes without extra corrective shapes, to be compatible with
Higher-quality approaches therefore personalize blendshapes standard blendshape pipelines in XR engines, but corrective fields
by capturing an actor not only in the neutral pose, but also in a could easily be added in future work. In contrast to many previous
couple of example expressions [8, 18, 26, 50]. The optimization in- works that use a face/head avatar only, we implant the resulting
volved in the underlying example-based facial rigging process [26] personalized facial blendshapes into a full-body avatar using a
is ill-posed, since both the blendshape meshes of the actor and the modification of deformation transfer [4, 47].
blendshape weights of the captured expressions are unknown. The
method therefore depends on good initial guesses of the blendshape
weights, which restricts the poses the actor can (or has to) perform.
Li et al. [26] proposed a subset of 15 facial expressions the actor 3 METHOD
should perform. These expressions consist of more than one single In this section we present our approach for extracting personalized
activated blendshape; they are instead a combination of several blendshapes from a set of scanned facial expressions and how to
activated blendshapes. Carrigan et al. [9] use an even smaller set transfer these blendshapes to an existing avatar. Figure 1 shows
of facial expressions by packing more basic blendshapes into the an overview of our pipeline. We begin by specifying the capturing
training expressions – making them harder to perform though. process and the resulting input data (Section 3.1). From this training
In contrast, Ichim et al. [20] let the actor perform a sequence data we extract personalized blendshapes (Section 3.2), which are
of dynamic facial expressions and video-record this performance then transferred to the full-body avatar using our seamless partial
using a smartphone. They detect facial feature points in each frame deformation transfer (Section 3.3).
Automated Blendshape Personalization for Faithful Face Animations Using Commodity Smartphones VRST ’22, November 29-December 1, 2022, Tsukuba, Japan
Personalized
Blendshapes
Example-based
Facial Rigging
Neutral Scan
Deformation
Regularization Final Avatar
Transfer
Examples
Regularization
Examples
Initial Avatar
ARKit Template
Figure 1: Our pipeline from recording example expressions to the final avatar. Red boxes indicate user input.
3.1 Input Data section), such that the user does not have to strictly perform the
Our goal is to create personalized blendshapes for an existing full- requested expression in isolation (which is hardly possible).
body avatar. While any method could be used to generate that With this automated recording procedure, the 52 requested facial
avatar, we employ the smartphone-based method of Wenninger expressions can be scanned in approximately 2 minutes.
et al. [51]. As a consequence, the complete avatar generation and
personalization only requires a smartphone for capturing the per-
son, which is in stark contrast to recent approaches based on com-
plex photogrammetry rigs [1, 15] and makes the reconstruction of 3.2 Blendshape Personalization
personalized avatars more widely available. While ARKit provides us with blendshape weights (𝑤𝑖,1, . . . , 𝑤𝑖,𝑚 )
We capture the training data for the blendshape personalization and face meshes S𝑖 for each expression 𝑖 ∈ {1, . . . , 𝑚} of the 𝑚 = 52
using a custom application running on an iPhone 12 Pro (any recent ARKit blendshapes, it does not give access to the internal person-
ARKit-capable iOS device could be used as well). This application alized blendshapes of the user. We therefore have to extract the
guides the user through the recording session, making the whole personalized version of the ARKit blendshapes using a modification
capturing process easy and intuitive. All recording is performed by of example-based facial rigging [26].
the front-facing depth and color cameras, such that the user can The original method of Li et al. [26] fits the per-triangle defor-
watch instructions and get feedback while recording. mation gradients of the unknown blendshapes to the deformation
The user first scans their own face in a neutral expression, and gradients of the scanned expressions in a first step, and then solves
is then asked to perform the facial expressions corresponding to a linear least-squares system to extract the vertex positions best-
the 𝑚 = 52 ARKit blendshapes. For each of these blendshapes, the matching the fitted deformation gradients [4, 47]. This two-step
application shows a textual and pictorial explanation of the ex- procedure has the disadvantage that it does not directly penalize the
pression to be performed. The application continuously tracks the deviation of the reconstructed vertex positions from the scanned
user’s face using the ARKit framework and automatically captures expressions, it only indirectly encourages the vertex positions to
the facial expression when the requested expression is performed match the target expressions. In contrast, we directly optimize for
to a sufficient extent. In particular, when capturing blendshape 𝑖 vertex positions that best-match the observed expressions S𝑖 in the
(for 𝑖 ∈ {1, . . . , 𝑚}), we observe the blendshape weight 𝑤𝑖,𝑖 cor- least-squares sense (see Equation (1) below), which – as we will
responding to that blendshape and once this weight exceeds a shown in Section 4.1 – leads to more accurate results.
certain threshold and reaches a maximum over time (i.e., starts Given the blendshape weights (𝑤𝑖,1, . . . , 𝑤𝑖,𝑚 ) and face meshes
decreasing), we record both the current set of blendshape weights S𝑖 for the recorded expressions 𝑖 ∈ {1, . . . , 𝑚}, as well as the neutral
(𝑤𝑖,1, . . . , 𝑤𝑖,52 ) as well as the current geometry S𝑖 of the ARKit face scan B0 , we have to compute the corresponding personal-
face mesh, where the latter consists of 𝑛 = 1220 vertices and 2304 ized delta-blendshapes B1, . . . , B𝑚 . Delta-blendshapes describe the
triangles. While we expect the blendshape weight 𝑤𝑖,𝑖 to be domi- displacement from the neutral expression to a predefined facial
nant when capturing blendshape 𝑖, other non-vanishing blendshape expression – the corresponding (non-delta-)blendshape [25]. In the
weights 𝑤𝑖,𝑗 do not pose a problem for our reconstruction (see next following, the term blendshape always refers to delta-blendshapes.
VRST ’22, November 29-December 1, 2022, Tsukuba, Japan Timo Menzel, Mario Botsch, and Marc Erich Latoschik
Since the blendshape weights 𝑤𝑖,𝑗 were captured by ARKit and Figure 2: The personalized blendshape (left) and our approx-
hence are known, minimizing (1) only requires solving a sparse imation with generic blendshapes (right)
linear least-squares system.
However, in order to produce semantically meaningful blend-
shapes the optimization has to be regularized. To this end, we 3.3 Blendshape Transfer
transfer the blendshapes of the (non-personalized) ARKit template Having computed the personalized ARKit blendshapes (grey face
to the recorded neutral expression B0 using deformation transfer masks in Figure 1), we now transfer them to the provided full-body
[47], resulting in the generic blendshapes T1, . . . , T𝑚 . Similar to avatar using an extension of deformation transfer [47]. Note that our
Saito [43], we add virtual triangles between upper and lower eye- personalized ARKit blendshapes B 𝑗 only specify the deformation
lids to ensure that the eyes are completely closed in the transferred within the face area. However, this is only a part of the deformation
eye-blink blendshapes. Our regularization energy then penalizes due to face animation, since some blendshapes (e.g. jaw open) also
the deviation of the personalized blendshapes B 𝑗 from the generic affect the area adjacent to the face (e.g. the neck area). Therefore,
blendshapes T 𝑗 : we have to adjust the adjacent area accordingly.
𝑚
In a first step we approximate the personalized ARKit blend-
D 𝑗 B 𝑗 − T 𝑗
2 . shapes with the pre-existing non-personalized blendshapes of the
∑︁
𝐸 reg = (2)
𝑗=1 full-body avatar. To this end, we define a correspondence map M
between the avatar’s face and the ARKit face mesh. This is achieved
Here, D 𝑗 are diagonal (𝑛 × 𝑛) matrices containing per-vertex reg- by fitting the avatar template to the ARKit template using non-rigid
ularization weights. These weights ensure that vertices that do registration [1] and selecting the closest point on the ARKit mesh
not move in the template blendshape T 𝑗 also do not move in the for each vertex of the avatar’s face. Since each full-body avatar
personalized blendshape B 𝑗 . They are computed as shares the topology of the full-body template model (from [51] in
our case), this mapping have to be computed only once.
max𝑘=1,...,𝑛
t𝑘,𝑗 − t𝑘,0
To obtain the optimal blendshape weights for the approximation
D 𝑗 𝑖,𝑖 =
, (3)
t𝑖,𝑗 − t𝑖,0
we use an approach similar to Lewis and Anjyo [24]. First, we use
the iterative closest point algorithm (ICP) with scaling [52] to find
where t𝑖,𝑗 ∈ IR3 is the position of the 𝑖-th vertex in the template
the optimal translation, rotation, and scaling to register the ARKit
blendshape T 𝑗 . This results in higher regularization weights for
face mesh to the avatar’s face. Second, we compute the weights of
vertices with smaller displacement magnitude in the ARKit template
5 the initial avatar blendshapes by minimizing the energy
blendshapes.
We clamp the regularization weight (D 𝑗 )𝑖,𝑖 to 10 if
2
t𝑖,𝑗 − t𝑖,0
< 𝜖 to avoid division by zero.
𝑘
∑︁
𝑘
∑︁
𝑤˜ 𝑖2, (4)
𝑗
The personalized blendshapes are finally computed by minimiz- 𝐸 approx (𝑤˜ 1, . . . , 𝑤˜ 𝑘 ) =
𝑤˜ 𝑖 B̃𝑖 − MB 𝑗
+ 𝜇
ing the cost function
𝑖=1 𝑖=1
where M is the pre-computed correspondence matrix that maps
𝐸 fit (B1, . . . , B𝑚 ) + 𝐸 reg (B1, . . . , B𝑚 )
vertices of the ARKit face meshes to the full-body avatar. B̃𝑖 de-
that combines the fitting term (1) and the regularization term (2), note the initial avatar’s blendshapes and 𝑤˜ 𝑖 are the (unknown)
which again only involves solving a least-squares linear system. blendshape weights. The second term penalizes large weights to
Li et al. [26] stated that an optimization based on vertex posi- avoid extreme poses. Solving a linear least-squares system results
tions leads to visible artifacts if done naively. However, our results in the blendshape weights 𝑤˜ 1, . . . , 𝑤˜ 𝑘 for approximating the ARKit
demonstrate quantitatively and qualitatively that this is not the case blendshape B 𝑗 using the initial avatar blendshapes B̃1, . . . , B̃𝑘 .
for our regularization. In fact, optimizing vertex positions leads The resulting approximation, which can be considered an auto-
to more accurate results, as we show in Section 4. Without our matic facial retargeting and is denoted by A 𝑗 , is already quite close
regularization though, we would get clearly noticeable artifacts. to the desired personalized blendshape B 𝑗 (see Figure 2). Using
Automated Blendshape Personalization for Faithful Face Animations Using Commodity Smartphones VRST ’22, November 29-December 1, 2022, Tsukuba, Japan
approaches for Subject 1. It can be seen that DT gives the worst 4.3 Computation Time
results in all cases. Both EBFR and our method show significant im- Our automatic, app-guided recording of training expressions takes
provements, with our method consistently yielding the lowest error about 2 minutes. Afterwards, all computations of our pipeline take
of all three methods. Averaging the maximum error per expression approximately 35 seconds per avatar. The ARKit face masks consist
over all test expressions and all subjects, our method reduces the of 1,220 vertices and 2,304 triangles, the avatar meshes consist of
maximum reconstruction error of EBFR and DT down to 50% and 21k vertices and 42k triangles.
38%, respectively. The color-coding of reconstruction errors in Fig- Table 1 shows the computation times for EBFR and our method,
ure 6 visualizes how the three reconstruction methods differ and measured on a desktop PC with a 10-core 3.6 GHz CPU and a Nvidia
shows that our reconstructions are the most accurate. Results for RTX 3070 GPU. These timings include setting up the linear sys-
other test subjects are shown in the supplementary material. tems and solving them through sparse Cholesky factorizations. For
EBFR it also includes converting the local frames back to the new
4.2 Avatar Blendshape Personalization blendshape basis, and for our method it includes the computation
The previous section evaluated different approaches for personaliz- of the regularization blendshapes using deformation transfer. On
ing the ARKit face mask blendshapes. In this section we evaluate average, our method is about 3× faster than EBFR. This is mainly
the final avatars produced by implanting the personalized ARKit due to the fact that EBFR performs the optimization per triangle,
blendshapes through our seamless partial deformation transfer. and hence has to solve 2,304 linear least squares problems.
Automated Blendshape Personalization for Faithful Face Animations Using Commodity Smartphones VRST ’22, November 29-December 1, 2022, Tsukuba, Japan
Figure 7: Comparison of the personalized ARKit blendshapes (middle) to an approximation using automatic retargeting using
the initial avatar blendshapes (left) and to our final personalized blendshapes (right), where the latter yield more accurate
results.
5 DISCUSSION 6 CONCLUSION
As described in Section 4.1, personalized blendshapes produce more Faithful digital reconstruction of humans has various interesting
accurate reconstructions of facial expressions than generic blend- use-cases. It may become even more prominent in future social XR
shapes transferred from a template rig. Our improved approach encounters. Here, an accurate reconstruction of facial expressions
further reduces the maximum reconstruction error compared to is a necessity due to the prominent role of facial expressions in
deformation transfer and example-based facial rigging. Considering non-verbal behavior and social interaction. This article presented
Figure 6, the mouth corners of Subject 1 (top row) are closer to the an optimization-based approach to generating personalized blend-
ground truth for our method, while both DT and EBFR lead to a shapes necessary for a faithful reconstruction of facial expressions
stronger grin expression. Even these slight differences can be prob- and their animation. The proposed method combines a position-
lematic, since they might lead to incongruent social cues [42] or based optimization with a seamless partial deformation transfer.
create unnatural-appearing facial expressions that do not correctly It outperforms existing solutions and overall results in a much
convey a tracked person’s actual look and feelings (Figure 8). lower reconstruction error. It also neatly integrates with recent
Our approach also has some limitations. Both the blendshape smartphone-based reconstruction pipelines for mesh generation
approximation and the seamless partial deformation transfer rely and automated rigging, further paving the way to a widespread
on correspondence mappings, which have to be computed in a application of personalized avatars and agents in various use-cases.
preprocess once per template model. Moreover, our seamless partial In the future, we would like to use the iPhone’s front-facing
deformation transfer relies on reasonable initial blendshapes of the depth sensor to capture more accurate geometry during expression
avatar, which is used to estimate how the adjacent area of the face scanning. Furthermore, we would like to investigate the effect that
(e.g. the neck) deforms during facial expressions. As a consequence, our improved personalized blendshapes have on the perceptibility
our method cannot be applied to avatars without blendshapes. of expression semantics.
VRST ’22, November 29-December 1, 2022, Tsukuba, Japan Timo Menzel, Mario Botsch, and Marc Erich Latoschik
Figure 8: Comparison of the original tracked facial expression with the animated avatars. From left to right: captured image,
manual retargeting, automatic retargeting, personalized blendshapes
ACKNOWLEDGMENTS [5] Sofien Bouaziz, Andrea Tagliasacci, and Mark Pauly. 2014. Dynamic 2D/3D
Registration. In Eurographics 2014 Tutorial.
The authors are very grateful to all scanned subjects. This research [6] Sofien Bouaziz, Yangang Wang, and Mark Pauly. 2013. Online modeling for
was supported by the German Federal Ministry of Education and realtime facial animation. ACM Transactions on Graphics 32, 4 (2013), 1–10.
https://doi.org/10.1145/2461912.2461976
Research (BMBF) through the project VIA-VR (ID 16SV8446). [7] David Burden and Maggi Savin-Baden. 2020. Virtual humans: Today and tomorrow.
Chapman and Hall/CRC.
REFERENCES [8] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-
time facial animation with image-based dynamic avatars. ACM Transactions on
[1] Jascha Achenbach, Thomas Waltemate, Marc Erich Latoschik, and Mario Botsch.
Graphics 35, 4 (2016), 1–12. https://doi.org/10.1145/2897824.2925873
2017. Fast generation of realistic virtual humans. In Proceedings of the 23rd ACM
[9] E. Carrigan, E. Zell, C. Guiard, and R. McDonnell. 2020. Expression Packing:
Symposium on Virtual Reality Software and Technology. ACM. https://doi.org/10.
As-Few-As-Possible Training Expressions for Blendshape Transfer. Computer
1145/3139131.3139154
Graphics Forum 39, 2 (2020), 219–233. https://doi.org/10.1111/cgf.13925
[2] Jascha Achenbach, Eduard Zell, and Mario Botsch. 2015. Accurate Face Recon-
[10] Dan Casas, Oleg Alexander, Andrew W. Feng, Graham Fyffe, Ryosuke Ichikari,
struction through Anisotropic Fitting and Eye Correction. In Vision, Modeling
Paul Debevec, Rhuizhe Wang, Evan Suma, and Ari Shapiro. 2015. Rapid Pho-
& Visualization. The Eurographics Association. https://doi.org/10.2312/vmv.
torealistic Blendshapes from Commodity RGB-D Sensors. In Proceedings of
20151251
the 19th Symposium on Interactive 3D Graphics and Games. ACM, 134–134.
[3] Jim Blascovich and Jeremy Bailenson. 2011. Infinite Reality: Avatars, Eternal Life,
https://doi.org/10.1145/2699276.2721398
New Worlds, and the Dawn of the Virtual Revolution. William Morrow & Co.
[11] Justine Cassell, Catherine Pelachaud, Norman Badler, Mark Steedman, Brett
https://doi.org/10.1162/PRES_r_00068
[4] Mario Botsch, Robert Sumner, Mark Pauly, and Markus Gross. 2006. Deformation Achorn, Tripp Becket, Brett Douville, Scott Prevost, and Matthew Stone. 1994.
Transfer for Detail-Preserving Surface Editing. In Proceedings of VMV 2006.
Automated Blendshape Personalization for Faithful Face Animations Using Commodity Smartphones VRST ’22, November 29-December 1, 2022, Tsukuba, Japan