Avatarme: Realistically Renderable 3D Facial Reconstruction "In-The-Wild"
Avatarme: Realistically Renderable 3D Facial Reconstruction "In-The-Wild"
I
nput I
nfe
rre
dRe
fle
cta
nce Re
nde
rin
gs He
adComp
let
ion
a
)Di
ff
useAl
bedo b
)Di
ff
useNo
rma
ls
c
)Sp
ecul
arAl
bedo d)Spe
cul
arNor
mal
s
Figure 1: From left to right: Input image; Predicted reflectance (diffuse albedo, diffuse normals, specular albedo and specular
normals); Rendered reconstruction in different environments, with detailed reflections; Rendered result with head completion.
Abstract 1. Introduction
Over the last years, with the advent of Generative Ad- The reconstruction of a 3D face geometry and texture is
versarial Networks (GANs), many face analysis tasks have one of the most popular and well-studied fields in the in-
accomplished astounding performance, with applications tersection of computer vision, graphics and machine learn-
including, but not limited to, face generation and 3D face ing. Apart from its countless applications, it demonstrates
reconstruction from a single “in-the-wild” image. Never- the power of recent developments in scanning, learning and
theless, to the best of our knowledge, there is no method synthesizing 3D objects [3, 44]. Recently, mainly due to
which can produce high-resolution photorealistic 3D faces the advent of deep learning, tremendous progress has been
from “in-the-wild” images and this can be attributed to the: made in the reconstruction of a smooth 3D face geometry,
(a) scarcity of available data for training, and (b) lack of ro- even from images captured in arbitrary recording conditions
bust methodologies that can successfully be applied on very (also referred to as “in-the-wild”) [13, 14, 33, 36, 37]. Nev-
high-resolution data. In this paper, we introduce AvatarMe, ertheless, even though the geometry can be inferred some-
the first method that is able to reconstruct photorealistic 3D what accurately, in order to render a reconstructed face in
faces from a single “in-the-wild” image with an increasing arbitrary virtual environments, much more information than
level of detail. To achieve this, we capture a large dataset a 3D smooth geometry is required, i.e., skin reflectance as
of facial shape and reflectance and build on a state-of-the- well as high-frequency normals. In this paper, we propose
art 3D texture and shape reconstruction method and succes- a meticulously designed pipeline for the reconstruction of
sively refine its results, while generating the per-pixel dif- high-resolution render-ready faces from “in-the-wild” im-
fuse and specular components that are required for realistic ages captured in arbitrary poses, lighting conditions and oc-
rendering. As we demonstrate in a series of qualitative and clusions. A result from our pipeline is showcased in Fig. 1.
quantitative experiments, AvatarMe outperforms the exist- The seminal work in the field is the 3D Morphable Model
ing arts by a significant margin and reconstructs authentic, (3DMM) fitting algorithm [3]. The facial texture and shape
4K by 6K -resolution 3D faces from a single low-resolution that is reconstructed by the 3DMM algorithm always lies in
image that, for the first time, bridges the uncanny valley. a space that is spanned by a linear basis which is learned
by Principal Component Analysis (PCA). The linear basis,
1
even though remarkable in representing the basic character- ily creating a digital avatar rather than high-quality render-
istics of the reconstructed face, fails in reconstructing high- ready face reconstruction from “in-the-wild” images which
frequency details in texture and geometry. Furthermore, the is the goal of our work.
PCA model fails in representing the complex structure of fa- In this paper, we propose the first, to the best of
cial texture captured “in-the-wild”. Therefore, 3DMM fitting our knowledge, methodology that produces high-quality
usually fails on “in-the-wild” images. Recently, 3DMM fit- render-ready face reconstructions from arbitrary images. In
ting has been extended so that it uses a PCA model on robust particular, our method builds upon recent reconstruction
features, i.e., Histogram of Oriented Gradients (HoGs) [8], methods (e.g., GANFIT [14]) and contrary to [6, 42] does not
for representing facial texture [4]. The method has shown apply algorithms for high-frequency estimation to the origi-
remarkable results in reconstructing the 3D facial geometry nal input, which could be of very low quality, but to a GAN-
from “in-the-wild” images. Nevertheless, it cannot recon- generated high-quality texture. Using a light stage, we have
struct facial texture that accurately. collected a large scale dataset with samples of over 200 sub-
With the advent of deep learning, many regression meth- jects’ reflectance and geometry and we train image transla-
ods using an encoder-decoder structure have been pro- tion networks that can perform estimation of (a) diffuse and
posed to infer 3D geometry, reflectance and illumination specular albedo, and (b) diffuse and specular normals. We
[6, 14, 33, 35, 36, 37, 39, 44]. Some of the methods demon- demonstrate that it is possible to produce render-ready faces
strate that it is possible to reconstruct shape and texture, from arbitrary faces (pose, occlusion, etc.) including por-
even in real-time on a CPU [44]. Nevertheless, due to var- traits and face sketches, which can be realistically relighted
ious factors, such as the use of basic reflectance models in any environment.
(e.g., the Lambertian reflectance model), the use of syn-
thetic data or mesh-convolutions on colored meshes, the 2. Related Work
methods [33, 35, 36, 37, 39, 44] fail to reconstruct highly-
detailed texture and shape that is render-ready. Further- 2.1. Facial Geometry and Reflectance Capture
more, in many of the above methods the reconstructed tex- Debevec et al. [9] first proposed employing a special-
ture and shape lose many of the identity characteristics of ized light stage setup to acquire a reflectance field of a
the original image. human face for photo-realistic image-based relighting ap-
Arguably, the first generic method that demonstrated that plications. They also employed the acquired data to es-
it is possible to reconstruct high-quality texture and shape timate a few view-dependent reflectance maps for render-
from single “in-the-wild” images is the recently proposed ing. Weyrich et al. [41] employed an LED sphere and 16
GANFIT method [14]. GANFIT can be described as an ex- cameras to densely record facial reflectance and computed
tension of the original 3DMM fitting strategy but with the view-independent estimates of facial reflectance from the
following differences: (a) instead of a PCA texture model, it acquired data including per-pixel diffuse and specular albe-
uses a Generative Adversarial Network (GAN) [23] trained dos, and per-region specular roughness parameters. These
on large-scale high-resolution UV-maps, and (b) in order to initial works employed dense capture of facial reflectance
preserve the identity in the reconstructed texture and shape, which is somewhat cumbersome and impractical.
it uses features from a state-of-the-art face recognition net- Ma et al. [27] introduced polarized spherical gradient
work [11]. However, the reconstructed texture and shape illumination (using an LED sphere) for efficient acquisi-
is not render-ready due to (a) the texture containing baked tion of separated diffuse and specular albedos and photo-
illumination, and (b) not being able to reconstruct high- metric normals of a face using just eight photographs, and
frequency normals or specular reflectance. demonstrated high quality facial geometry, including skin
Early attempts to infer photorealistic render-ready infor- mesostructure as well as realistic rendering with the ac-
mation from single “in-the-wild” images have been made quired data. It was however restricted to a frontal viewpoint
in the line of research of [6, 20, 32, 42]. Arguably, some of acquisition due to their employment of view-dependent
of the results showcased in the above noted papers are of polarization pattern on the LED sphere. Subsequently,
high-quality. Nevertheless, the methods do not generalize Ghosh et al. [15] extended polarized spherical gradient il-
since: (a) they directly manipulate and augment the low- lumination for multi-view facial acquisition by employ-
quality and potentially occluded input facial texture, instead ing two orthogonal spherical polarization patterns. Their
of reconstructing it, and as a result, the quality of the final method allows capture of separated diffuse and specular
reconstruction always depends on the input image. (b) the reflectance and photometric normals from any viewpoint
employed 3D model is not very representative, and (c) a very around the equator of the LED sphere and can be considered
small number of subjects (e.g., 25 [42]) were available for the state-of-the art in terms of high quality facial capture.
training for the high-frequency details of the face. Thus, Recently, Kampouris et al. [22] demonstrated how to em-
while closest to our work, these approaches focus on eas- ploy unpolarized binary spherical gradient illumination for
Figure 2: Overview of the proposed method. A 3DMM is fitted to an “in-the-wild” input image and a completed UV texture
is synthesized, while optimizing for the identity match between the rendering and the input. The texture is up-sampled 8
times, to synthesize plausible high-frequency details. We then use an image translation network to de-light the texture and
obtain the diffuse albedo with high-frequency details. Then, separate networks infer the specular albedo, diffuse normals
and specular normals (in tangent space) from the diffuse albedo and the 3DMM shape normals. The networks are trained on
512 × 512 patches and inferences are ran on 1536 × 1536 patches with a sliding window. Finally, we transfer the facial shape
and consistently inferred reflectance to a head model. Both face and head can be rendered realistically in any environment.
estimating separated diffuse and specular albedo and pho- 2.3. Facial Geometry Estimation
tometric normals using color-space analysis. The method
has the advantage of not requiring polarization and hence Over the years, numerous methods have been introduced
requires half the number of photographs compared to po- in the literature that tackle the problem of 3D facial recon-
larized spherical gradients and enables completely view- struction from a single input image. Early methods required
independent reflectance separation, making it faster and a statistical 3DMM both for shape and appearance, usually
more robust for high quality facial capture [24]. encoded in a low dimensional space constructed by PCA
Passive multiview facial capture has also made signifi- [3, 4]. Lately, many approaches have tried to leverage the
cant progress in recent years, from high quality facial ge- power of Convolutional Neural Networks (CNNs) to either
regress the latent parameters of a PCA model [38, 7] or uti-
ometry capture [2] to even detailed facial appearance esti-
mation [17]. However, the quality of the acquired data with lize a 3DMM to synthesize images and formulate an image-
such passive capture methods is somewhat lower compared to-image translation problem using CNNs [18, 31].
to active illumination techniques.
2.4. Photorealistic 3D faces with Deep Learning
In this work, we employ two state-of-the-art active illu-
mination based multiview facial capture methods [15, 24] Many approaches have been successful in acquiring the
for acquiring high quality facial reflectance data in order to reflectance of materials from a single image, using deep net-
build our training data. works with an encoder-decoder architecture [12, 25, 26].
However, they only explore 2D surfaces and in a constrained
2.2. Image-to-Image Translation environment, usually assuming a single point-light source.
Image-to-image translation refers to the task of translat- Early applications on human faces [34, 35] used im-
ing an input image to a designated target domain (e.g., turn- age translation networks to infer facial reflection from an
ing sketches into images, or day into night scenes). With “in-the-wild” image, producing low-resolution results. Re-
the introduction of GANs [16], image-to-image translation cent approaches attempt to incorporate additional facial nor-
improved dramatically [21, 45]. Recently, with the increas- mal and displacement mappings resulting in representations
ing capabilities in the hardware, image-to-image translation with high frequency details [6]. Although this method
has also been successfully attempted in high-resolution data demonstrates impressive results in geometry inference, it
[40]. In this work we utilize variations of pix2pixHD [40] tends to fail in conditions with harsh illumination and ex-
to carry out tasks such as de-lighting and the extraction of treme head poses, and does not produce re-lightable results.
reflectance maps in very high-resolution. Saito et al. [32] proposed a deep learning approach for data-
driven inference of high resolution facial texture map of an shown in Fig. 3. We name the dataset RealFaceDB. It is
entire face for realistic rendering, using an input of a sin- currently the largest dataset of this type and we intend to
gle low-resolution face image with partial facial coverage. make it publicly available to the scientific community 1 .
This has been extended to inference of facial mesostruc-
ture, given a diffuse albedo [20], and even complete facial 4. Method
reflectance and displacement maps besides albedo texture,
given partial facial image as input [42]. While closest to
our work, these approaches achieve the creation of digital
avatars, rather than high quality facial appearance estima-
tion from “in-the-wild” images. In this work, we try to over-
come these limitations by employing an iterative optimiza-
tion framework as proposed in [14]. This optimization strat-
egy leverages a deep face recognition network and GANs
into a conventional fitting method in order to estimate the (a) Input
high-quality geometry and texture with fine identity char-
acteristics, which can then be used to produce high-quality (b) Diff Alb (c) Spec Alb (d) Diff Nor (e) Spec Nor
reflectance maps.
Figure 4: Rendered patch ([14]-like) of a subject acquired
3. Training Data with [15], ground truth maps (top-row) and predictions with
our network given rendering as input (bottom-row).
3.1. Ground Truth Acquisition
To achieve photorealistic rendering of the human skin,
we separately model the diffuse and specular albedo and
normals of the desired geometry. Therefore, given a single
unconstrained face image as input, we infer the facial ge-
ometry as well as the diffuse albedo (AD ), diffuse normals
(ND ) 2 , specular albedo (AS ), and specular normals (NS ).
As seen in Fig. 2, we first reconstruct a 3D face (base
(a) Diff. Alb. (b) Spec. Alb. (c) Diff. Nor. (d) Spec. Nor. geometry with texture) from a single image at a low res-
olution using an existing 3DMM algorithm [5]. Then, the
Figure 3: Two subjects’ reflectance acquired with [15] (top) reconstructed texture map, which contains baked illumina-
and [22, 24] (bottom). Specular normals in tangent space. tion, is enhanced by a super resolution network, followed
by a de-lighting network to obtain a high resolution diffuse
We employ the state-of-the-art method of [15] for cap- albedo AD . Finally, we infer the other three components
turing high resolution pore-level reflectance maps of faces (AS , ND , NS ) from the diffuse albedo AD in conjunction
using a polarized LED sphere with 168 lights (partitioned with the base geometry. The following sections explain
into two polarization banks) and 9 DSLR cameras. Half the these steps in detail.
LED s on the sphere are vertically polarized (for parallel po-
larization), and the other half are horizontally polarized (for 4.1. Initial Geometry and Texture Estimation
cross-polarization) in an interleaved pattern. Our method requires a low-resolution 3D reconstruction
Using the LED sphere, we can also employ the color- of a given face image I. Therefore, we begin with the esti-
space analysis from unpolarised LEDs [22] for diffuse- mation of the facial shape with n vertices S ∈ Rn×3 and
specular separation and the multi-view facial capture texture T ∈ R576×384×3 by borrowing any state-of-the-
method of [24] to acquire unwrapped textures of similar art 3D face reconstruction approach (we use GANFIT [14]).
quality (Fig. 3). This method requires less than half of data Apart from the usage of deep identity features, GANFIT syn-
captured (hence reduced capture time) and a simpler setup thesizes realistic texture UV maps using a GAN as a statisti-
(no polarizers), enabling the acquisition of larger datasets. cal representation of the facial texture. We reconstruct the
initial base shape and texture of the input image I as follows
3.2. Data Collection
1 For the dataset and other materials we refer the reader to the project’s
In this work, we capture faces of over 200 individuals
page https://github.com/lattas/avatarme.
of different ages and characteristics under 7 different ex- 2 The diffuse normals N
D are not usually used in commercial ren-
pressions. The geometry reconstructions are registered to a dering systems. By inferring ND we can model the reflection as in the
standard topology, like in [5], with unwrapped textures as state-of-the-art specular-diffuse separation techniques [15, 24].
and refer the reader to [14] for further details: training of [14], while also having accurate ground truth of
their albedo and normals. We compute a physically-based
T, S = G(I) (1) rendering for each subject from all view-points, using the
predicted environment map and the predicted light sources
where G : Rk×m×3 7→ R576×384×3 , Rn×3 denotes the with a random variation of their position, creating an illumi-
GANFIT reconstruction method for an Rk×m×3 arbitrary
nated texture map. We denote this whole simulation process
sized image, and n number of vertices on a fixed topology. by ξ : AD ∈ R6144×4096×3 7→ AT 6144×4096×3
D ∈R which
Having acquired the prerequisites, we procedurally im- translates diffuse albedo to the distribution of the textures
prove on them: from the reconstructed geometry S, we ac- with baked illumination, as shown in the following:
quire the shape normals N and enhance the facial texture
T resolution, before using them to estimate the components AT
D = ξ(AD ) ∼ Et∈{T1 ,T2 ,...,Tn } t (3)
for physically based rendering, such as the diffuse and spec-
ular diffuse and normals. 4.3.2 Training the De-lighting Network
5.2. Evaluation
We conduct quantitative as well as qualitative compar-
isons against the state-of-the-art. For the quantitative com-
parisons, we utilize the widely used PSNR metric [19], and
(a) Input (b) Cathedral (c) Sunset (d) Tunnel report the results in Table 1. As can be seen, our method
outperforms [6] and [42] by a significant margin. Moreover
Figure 6: Reconstructions of our method re-illuminated un- using a state-of-the-art face recognition algorithm [11], we
der different environment maps [10] with added spot lights. also find the highest match of facial identity compared to
the input images when using our method. The input images
normals and depth cannot be exploited to improve the qual-
were compared against renderings of the faces with recon-
ity of the generated diffuse and specular components.
structed geometry and reflectance, including eyes.
To alleviate the aforementioned shortcomings, we: (a)
split the original high-resolution data into smaller patches For the qualitative comparisons, we perform 3D recon-
of 512 × 512 size. More specifically, using a stride of size structions of “in-the-wild” images. As shown in Figs. 8 and
256, we derive the partially overlapping patches by pass- 9, our method does not produce any artifacts in the final
ing through each original UV horizontally as well as verti- renderings and successfully handles extreme poses and oc-
cally, (b) for each translation task, we utilize the shape nor- clusions such as sunglasses. We infer the texture maps in a
mals, concatenate them channel-wise with the correspond- patch-based manner from high-resolution input, which pro-
ing grayscale texture input (e.g., in the case of translating duces higher-quality details than [6, 42], who train on high-
the diffuse albedo to the specular normals, we concate- quality scans but infer the maps for the whole face, in lower
nate the grayscale diffuse albedo with the shape normals resolution. This is also apparent in Fig. 5, which shows our
channel-wise) and thus feed a 4D tensor ([G, X, Y, Z]) to reconstruction after each step of our process. Moreover, we
the network. This increases the level of detail in the derived can successfully acquire each component from black-and-
outputs as the shape normals act as a geometric “guide”. white images (Fig. 9) and even drawn portraits (Fig. 8).
Note that during inference that patch size can be larger Furthermore, we experiment with different environment
(e.g. 1536×1536), since the network is fully-convolutional. conditions, in the input images and while rendering. As pre-
sented in Fig. 7, the extracted normals, diffuse and specular
5.1.2 Training Setup
albedos are consistent, regardless of the illumination on the
To train RCAN [43], we use the default hyper-parameters. original input images. Finally, Fig. 6 shows different sub-
For the rest of the translation of models, we use a custom jects rendered under different environments. We can realis-
translation network as described earlier, which is based on tically illuminate each subject in each scene and accurately
pix2pixHD [40]. More specifically, we use 9 and 3 residual reconstruct the environment reflectance, including detailed
blocks in the global and local generators, respectively. The specular reflections and subsurface scattering.
learning rate we employed is 0.0001, whereas the Adam be- In addition to the facial mesh, we are able to infer the
tas are 0.5 for β1 and 0.999 for β2 . Moreover, we do not use entire head topology based on the Universal Head Model
the VGG features matching loss as this slightly deteriorated (UHM) [29, 30]. We project our facial mesh to a subspace,
the performance. Finally, we use as inputs 3 and 4 channel regress the head latent parameters and then finally derive
tensors which include the shape normals NO or depth DO the completed head model with completed textures. Some
together with the RGB AD or grayscale AgrayD values of the qualitative head completion results can be seen in Figs 1, 2.
(a) Input (b) Tex. [6] (c) Nor. [6] (d) Alb. [42] (e) S.A. [42] (f) Ours D.A. (g) Ours S.A. (h) Ours S.N.
Figure 8: Comparison of reflectance maps predicted by our method against state-of-the-art methods. [42] reconstruction
is provided by the authors and [6] from their open-sourced models. Last column is cropped to better show the details.