Vid2Avatar: 3D Avatar Reconstruction From Videos in The Wild Via Self-Supervised Scene Decomposition
Vid2Avatar: 3D Avatar Reconstruction From Videos in The Wild Via Self-Supervised Scene Decomposition
Figure 1. We propose Vid2Avatar, a method to reconstruct detailed 3D avatars from monocular videos in the wild via self-supervised scene
decomposition. Our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans,
nor do we rely on any external segmentation modules.
Abstract 1. Introduction
We present Vid2Avatar, a method to learn human avatars
Being able to reconstruct detailed avatars from readily
from monocular in-the-wild videos. Reconstructing humans
available “in-the-wild” videos, for example recorded with
that move naturally from monocular in-the-wild videos is
a phone, would enable many applications in AR/VR, in
difficult. Solving it requires accurately separating humans
human-computer interaction, robotics and in the movie and
from arbitrary backgrounds. Moreover, it requires recon-
sports industry. Traditionally, high-fidelity 3D reconstruc-
structing detailed 3D surface from short video sequences,
tion of dynamic humans has required calibrated multi-view
making it even more challenging. Despite these challenges,
systems [9, 10, 19, 27, 31, 45, 49], which are expensive and
our method does not require any groundtruth supervision
require highly-specialized expertise to operate. In contrast,
or priors extracted from large datasets of clothed human
emerging applications such as the Metaverse require more
scans, nor do we rely on any external segmentation mod-
light-weight and practical solutions in order to make the
ules. Instead, it solves the tasks of scene decomposition and
digitization of humans a widely available technology. Re-
surface reconstruction directly in 3D by modeling both the
constructing humans that move naturally from monocular
human and the background in the scene jointly, parameter-
in-the-wild videos is clearly a difficult problem. Solving it
ized via two separate neural fields. Specifically, we define
requires accurately separating humans from arbitrary back-
a temporally consistent human representation in canonical
grounds, without any prior knowledge about the scene or
space and formulate a global optimization over the back-
the subject. Moreover it requires reconstructing detailed 3D
ground model, the canonical human shape and texture, and
surface from short video sequences, made even more chal-
per-frame human pose parameters. A coarse-to-fine sam-
lenging due to depth ambiguities, the complex dynamics of
pling strategy for volume rendering and novel objectives are
human motion and the high-frequency surface details.
introduced for a clean separation of dynamic human and
Traditional template-based approaches [15, 16, 58] can-
static background, yielding detailed and robust 3D human
not generalize to in-the-wild settings due to the requirement
geometry reconstructions. We evaluate our methods on pub-
for a pre-scanned template and manual rigging. Methods
licly available datasets and show improvements over prior
that are based on explicit mesh representations are limited
art. Project page: https://moygcc.github.io/vid2avatar/.
to a fixed topology and resolution [3, 8, 14, 34]. Fully-
supervised methods that directly regress 3D surfaces from
images [17, 18, 21, 41, 42, 55, 68] struggle with difficult out- view synthesis, and reconstruction tasks, showing that our
of-distribution poses and shapes, and do not always pre- method performs best across several datasets and settings.
dict temporally consistent reconstructions. Fitting neural To allow for quantitative comparison across methods, we
implicit surfaces to videos has recently been demonstrated contribute a novel semi-synthetic test set that contains ac-
[23, 24, 40, 46, 47, 52, 67]. However, these methods depend curate 3D geometry of human subjects. Finally, we demon-
on pre-segmented inputs and are therefore not robust to un- strate the ability to reconstruct different humans in detail
controlled visual complexity and are upper-bounded in their from online videos and hand-held mobile phone video clips.
reconstruction quality by the segmentation method. In summary, our contributions are:
In this paper, we introduce Vid2Avatar, a method to learn • a method to reconstruct detailed 3D avatars from in-
human avatars from monocular in-the-wild videos without the-wild monocular videos via self-supervised scene
requiring any groundtruth supervision or priors extracted decomposition; and
from large datasets of clothed human scans, nor do we rely • to achieve robust and detailed 3D reconstructions of
on any external segmentation modules. We solve the tasks the human even under challenging poses and environ-
of scene separation and surface reconstruction directly in ments without requiring external segmentation meth-
3D. To achieve this, we model both the foreground (i.e., ods; and
human) and the background in the scene implicitly, param- • a novel semi-synthetic testing dataset that for the first
eterized via two separate neural fields. A key challenge is time allows comparing monocular human reconstruc-
to associate 3D points to either of these fields without re- tion methods on realistic scenes. The dataset contains
verting to 2D segmentation. To tackle this challenge, our rich annotations of the 3D surface.
method builds-upon the following core concepts: i) We de- Code and data will be made available for research purposes.
fine a single temporally consistent representation of the hu-
man shape and texture in canonical space and leverage the
2. Related Work
inverse mapping of a parametric body model to learn from
deformed observations. ii) A global optimization formu- Reconstructing Human from Monocular Video Tradi-
lation jointly optimizes the parameters of the background tional works for monocular human performance capture re-
model, the canonical human shape and its appearance, and quire personalized rigged templates as prior and track the
the pose estimates of the human subject over the entire se- pre-defined human model based on 2D observations [15,
quence. iii) A coarse-to-fine sampling strategy for volume 16, 58]. These works require pre-scanning of the performer
rendering that naturally leads to a separation of dynamic and post-processing for rigging, preventing such methods
foreground and static background. iv) Novel objectives that from being deployed to real-life applications. Some meth-
further improve the scene decomposition and lead to sharp ods attempt to save the need for pre-scanning and manual
boundaries between the human and the background, even rigging [3,8,14,34]. However, the explicit mesh representa-
when both are in contact (e.g., around the feet), yielding tion is limited to a fixed resolution and cannot represent de-
better geometry and appearance reconstructions. tails like the face. Regression-based methods that directly
More specifically, we leverage an inverse-depth param- regress 3D surfaces from images have demonstrated com-
eterization in spherical coordinates [66] to coarsely sepa- pelling results [4, 12, 17, 18, 21, 41, 42, 55, 68]. However,
rate the static background from the dynamic foreground. they require high-quality 3D data for supervision and can-
Within the foreground sphere, we leverage a surface-guided not maintain the space-time coherence of the reconstruc-
volume rendering approach to attain densities via the con- tion over the whole sequence. Recent works fit implicit
version method proposed in [60]. Importantly, we warp all neural fields to videos via neural rendering to obtain artic-
sampled points into canonical space and update the human ulated human models [23, 24, 40, 46, 47, 52, 67]. Human-
shape field dynamically. To attain sharp boundaries be- NeRF [52] extends articulated NeRF to improve novel view
tween the dynamic foreground and the scene, we introduce synthesis. NeuMan [24] further adds a scene NeRF model.
two optimization objectives that encourage a quasi-discrete Both methods model the human geometry with a density
binary distribution of ray opacities and penalize non-zero field, only yielding a noisy, and often low-fidelity human
opacity for rays that do not intersect with the human. The reconstruction. SelfRecon [23] deploys neural surface ren-
final rendering of the scene is then attained by differentiable dering [61] to achieve consistent reconstruction over the
composited volume rendering. sequence. However, all aforementioned methods rely on
pre-segmented inputs and are therefore not robust to uncon-
We show that this optimization formulation leads to
trolled visual complexity and are upper-bounded in their re-
clean scene decomposition and high-quality 3D reconstruc-
construction quality by the external segmentation method.
tions of the human subject. In detailed ablations, we shed
In contrast, our method solves the tasks of scene decompo-
light on the key components of our method. Furthermore,
sition and surface reconstruction jointly in 3D without using
we compare to existing methods in 2D segmentation, novel
segmentation modules.
Reconstructing Human from Multi-view/Depth The the surfaces, we contribute novel objectives that leverage
high fidelity 3D reconstruction of dynamic humans has re- the dynamically updated human shape in canonical space to
quired calibrated dense multi-view systems [9, 10, 19, 27, regularize the ray opacity.
31, 45, 49] which are expensive and laborious to operate We parameterize the 3D geometry and texture of clothed
and require highly-specialized expertise. Recent works humans as a pose-conditioned implicit signed-distance field
[20, 22, 28, 37, 39, 51, 56, 57] attempt to reconstruct humans (SDF) and texture field in canonical space (Sec. 3.1). We
from more sparse settings by deploying neural rendering. then model the background using a separate neural radi-
Depth-based approaches [6, 35, 36] reconstruct the human ance field (NeRF). The human shape and appearance fields
shape by fusing depth measurements across time. Follow- alongside the background field are learned from images
up work [7,11,29,63,64] builds upon this concept by incor- jointly via differentiable composited neural volume render-
porating an articulated motion prior and a parametric body ing (Sec. 3.2). Finally, we leverage the dynamically up-
shape prior. While the aforementioned methods achieve dated canonical human shape to regularize the ray opacities
compelling results, they still require a specialized capturing (Sec. 3.3). The training is formulated as global optimization
setup and are hence not applicable to in-the-wild settings. to jointly optimize the dynamic foreground and static back-
In contrast, our method recovers the dynamic human shape ground fields, and the per-frame pose parameters (Sec. 3.4).
in the wild from a monocular RGB video as the sole input.
3.1. Implicit Neural Avatar Representation
Moving Object Segmentation Traditional research in Canonical Shape Representation. We model the human
moving object segmentation has been extensively con- shape in canonical space to form a single, temporally con-
ducted at the image level (i.e. 2D). One line of research H
sistent representation and use a neural network fsdf to pre-
relies on motion clues to segment objects with different op- dict the signed distance value for any 3D point xc in this
tical flow patterns [5, 38, 54, 59, 62], while another line of space. To model pose-dependent local non-rigid deforma-
work, termed video matting [25, 30, 43] is trained on videos tions such as dynamically changing wrinkles on clothes, we
with human-annotated masks to directly regress the alpha- concatenate the human pose θ as an additional input and
channel values during inference. Those approaches are not H
model fsdf as:
without limitations, as they focus on image-level segmen-
H
tation and incorporate no 3D knowledge. Thus, they can- fsdf : R3+nθ → R1+256 . (1)
not handle complicated backgrounds without enough color
contrast between the human and the background. Recent The pose parameters θ are defined analogously to
H
works learn to decompose dynamic objects and the static SMPL [32], with dimensionality nθ . Furthermore, fsdf out-
background simultaneously in 3D by optimizing multiple puts global geometry features z of dimension 256. With
H
NeRFs [44,48,53,65]. Such methods perform well for non- slight abuse of notation, we also use fsdf to refer to the SDF
complicated dynamic objects but are not directly applicable value only. The canonical shape S is given by the zero-level
H
to articulated humans with intricate motions. set of fsdf :
H
3. Method S = { xc | fsdf (xc , θ) = 0 }. (2)
arate neural fields which are learned jointly from images Here, nb denotes the number of bones in the transformation,
1 nb
to composite the whole scene. To alleviate the ambiguity and w(·) = {w(·) , ..., w(·) } represents the skinning weights
of in-contact body and scene parts and to better delineate for x(·) . Here, deformed points xd are associated with the
Loss Function ℒ
Trainable
Composited Volume Rendering
Parameters
Sec. 3.2
Input Volume
Rendering + SDF-based
Volume Rendering
+ Inverse
Warping
Canonical SDF Normal Canonical Texture
+
+
Figure 2. Method Overview. Given a ray r with camera center o and ray direction v, we sample points densely (xd ) and coarsely (xb )
along the ray for the spherical inner volume and outer volume respectively. Within the foreground sphere, we warp all sampled points
H
into canonical space via inverse warping and evaluate the SDF of the canonical correspondences xc via the canonical shape network fsdf .
We calculate the spatial gradient of the sampled points in deformed space and concatenate them with the canonical points xc , the pose
H
parameters θ, and the extracted geometry feature vectors z to form the input to canonical texture network frgb which predicts color values
for xc . We apply surface-based volume rendering for the dynamic foreground and standard volume rendering for the background, and
then composite the foreground and background components to attain the final pixel color. We minimize the loss L that compares the color
predictions with the image observations along with novel scene decomposition objectives.
average of the nearest SMPL vertices’ skinning weights, the remainder of this paper, we denote this neural SDF with
H H
weighted by the point-to-point distances in deformed space. fsdf (xc ) and the RGB field as frgb (xc ) for brevity.
Canonical points xc are treated analogously.
3.2. Composited Volume Rendering
Canonical Texture Representation. The appearance is We extend the inverted sphere parametrization of
H
also modeled in canonical space via a neural network frgb NeRF++ [66] to represent the scene: an outer volume (i.e.,
that predicts color values for 3D points xc in this space. the background) covers the complement of a spherical inner
H
volume (i.e., the space assumed to be occupied by the hu-
frgb : R3+3+nθ +256 → R3 . (5) man) and both are modeled by separate networks. The final
pixel value is then attained via compositing.
We condition the canonical texture network on the nor-
mal nd in deformed space, facilitating better disentangle-
ment of the geometry and appearance. The normals are Background. Given the origin O, each 3D point xb =
given by the spatial gradient of the signed distance field (xb , yb , zb ) in the outer volume is reparametrized by the
w.r.t. the 3D location in deformed space. Following [67], quadruple x0b = (x0b , yb0 , zb0 , 1r ), where k(x0b , yb0 , zb0 )k = 1,
the spatial gradient of the deformed shape is given by: (xb , yb , zb ) = r · (x0b , yb0 , zb0 ). Here r denotes the magnitude
of the vector from the origin O to xb . This parameterization
H
∂fsdf (xc , θ) ∂f H (xc , θ) ∂xc of background points leads to improved numerical stability
nd = = sdf
∂xd ∂xc ∂xd and assigns lower resolution to farther away points. For
H nb (6) more details, we refer to [66]. Our method is trained with
∂f (xc , θ) X
= sdf ( wdi Bi )−1 . videos and the background is generally not entirely static.
∂xc i=1 To compensate for dynamic changes in e.g., lighting, we
In practice we concatenate the canonical points xc , their condition the background network f B on a per-frame learn-
normals, the pose parameters, and the extracted 256- able latent code ti .
dimensional geometry feature vectors z from the shape net-
work to form the input to the canonical texture network. For f B : R4+3+nt → R1+3 , (7)
where f B takes the 4D representation of the sampled back- the composited pixel value and image RGB value is still a
ground point x0b , viewing direction v, and time encoding ti severely ill-posed problem. This is due to the potentially
with dimension nt as input, and outputs the density and the moving scene, dynamic shadows, and general visual com-
view-dependent radiance. plexity. To this end, we propose two objectives that guide
the optimization towards a clean and robust decoupling of
Dynamic Foreground. We assume that the inner volume the human from the background.
is occupied by a dynamic foreground – the human we seek
to reconstruct. This requires different treatment compared Opacity Sparseness Regularization. One of the key
to [66] where a static foreground is modeled via a vanilla components of our method is a loss Lsparse to regularize
NeRF. In contrast, we combine the implicit neural avatar the ray opacity via the dynamically updated human shape
representation (Sec. 3.1) with surface-based volume ren- in canonical space. We first warp sampled points into the
dering [60]. Thus, we convert the SDF to a density σ canonical space and calculate the signed distance to the hu-
by applying the scaled Laplace distribution’s Cumulative man shape. We then penalize non-zero ray opacities for rays
Distribution Function (CDF) to the negated SDF values that do not intersect with the subject. This ray set is denoted
H
ξ(xc ) = −fsdf (xc ): as Rioff for frame i.
1 1 |ξ(xc )| 1 X
σ(xc ) = α + sign(ξ(xc ))(1 − exp(− )) , Lisparse = |αH (r)|. (12)
2 2 β |Rioff | i
r∈Roff
(8)
where α, β > 0 are learnable parameters.
Similar to [60], we sample N points on a ray r = (o, v) Note that we conservatively update the SDF of the human
with camera center o and ray direction v in two stages – shape throughout the whole training process which leads to
uniform and inverse CDF sampling. We then map the sam- a precise association of human and background rays.
pled points to canonical space via skeletal deformation and
use standard numerical approximation to calculate the inte- Self-supervised Ray Classification. Even with the shape
gral of the volume rendering equation: regularization from Eq. 12, we observe that the human fields
still tend to model parts of the background due to the flexi-
N
X bility and expressive power of MLPs, especially if the sub-
C H (r) = H
τi frgb (xic ) (9)
ject is in contact with the scene. To further delineate dy-
i=1
namic foreground and background, we introduce an addi-
X tional loss term to encourage ray distributions that contain
τi = exp − σ(xjc )δ j (1 − exp(−σ(xic )δ i )) (10) either fully transparent or opaque rays:
j<i
1 X H
where δ (i) is the distance between two adjacent samples. LiBCE = − (α (r) log(αH (r))
|Ri | i (13)
r∈R
Here, the accumulated alpha value of a pixel, which
PN repre- H H
sents ray opacity, can be obtained by αH (r) = i=1 τi . + (1 − α (r)) log(1 − α (r))),
Scene Composition. To attain the final pixel value for a where Ri denotes the sampled rays for frame i. This
ray r, we raycast the human and background volumes sep- term penalizes deviations of the ray opacities from a bi-
arately, followed by a scene compositing step. Using the nary {0, 1} distribution via the binary cross entropy loss.
parameterization of the background, we sample r along the Intuitively this encourages the opacities to be zero for rays
ray r to obtain sample points in the outer volume for which that hit the background and one for those that hit the human
we query f B . The background component of a pixel is shape. In practice, this significantly helps separation of the
then given by the integrated color value C B (r) along the subject and the background, in particular for difficult cases
ray [33]. More details can be found in the Supp. Mat. The with similar pixel values across discontinuities.
final pixel color is then the composite of the foreground and The final scene decomposition loss is then given by Ldec :
background color.
Ldec = λBCE LBCE + λsparse Lsparse . (14)
H H B
C(r) = C (r) + (1 − α (r))C (r). (11)
3.4. Global Optimization
3.3. Scene Decomposition Objectives
To train the models that represent the background and
Learning to decompose the scene into a dynamic human the dynamic foreground jointly from videos, we formulate
and background by simply minimizing the distance between the training as global optimization over all frames.
Figure 4. Qualitative mask comparison. Our method generates
more detailed and robust segmentations compared to 2D segmen-
Figure 3. Importance of scene decomposition loss. Without the tation methods by incorporating 3D knowledge.
scene decomposition loss, the segmentation includes undesirable
background parts due to similar pixel values across discontinuities.
Reconstruction Loss. We calculate the L1 -distance be- Figure 5. Qualitative view synthesis comparison. Our method
tween the rendered color C(r) and the pixel’s RGB value achieves comparable and even better novel view synthesis results
compared to NeRF-based methods (see also Sec. 4.4).
Ĉ(r) to attain the reconstruction loss Lirgb for frame i:
1 X
Lirgb = |C(r) − Ĉ(r)|. (16) NeuMan Dataset [24]: This dataset includes a collection
|Ri | i of videos captured by a mobile phone, in which a single
r∈R
person performs walking. We use this dataset to compare
Full Loss. Given a video sequence with F input frames, our rendering quality of humans under unseen views with
we minimize the combined loss function: other approaches.
3DPW Dataset [50]: This dataset contains challenging
F
X in-the-wild video sequences along with accurate 3D hu-
L(Θ) = Lirgb (ΘH , ΘB ) + λdec Lidec (ΘH ) + λeik Lieik (ΘH ) man poses recovered by using IMUs and a moving camera.
i=1
Moreover, it includes registered static clothed 3D human
(17)
H B models. By animating the human model with the ground-
where Θ and Θ are the sets of optimized parameters
H truth poses, we can obtain quasi ground-truth scans to eval-
for the human and background model respectively. Θ in-
H uate the surface reconstruction performance.
cludes the shape network weights Θsdf , the texture network
H
weights Θrgb and per-frame pose parameters θi . Θ con- B SynWild Dataset: We propose a new dataset called Syn-
tains the background density and radiance network weights. Wild for the evaluation of monocular human surface recon-
struction method. We capture dynamic human subjects in
4. Experiments a multi-view system and reconstruct the detailed geometry
and texture via commercial software [9]. Then we place the
We first conduct ablation studies on our design choices. textured 4D scans into realistic 3D scenes/HDRI panoramas
Next, we compare our method with state-of-the-art ap- and render monocular videos from virtual cameras, leverag-
proaches in 2D segmentation, novel view synthesis, and re- ing a high-quality game engine [2]. This is the first dataset
construction tasks. Finally, we demonstrate human recon- that allows for quantitative comparison in a realistic setting
struction results on several in-the-wild monocular videos via semi-synthetic data.
from different sources qualitatively. Evaluation Protocol: We consider precision, F1 score, and
mask IoU for human segmentation evaluation. We report
4.1. Datasets volumetric IoU, Chamfer distance (cm) and normal con-
MonoPerfCap Dataset [58]: This dataset contains in- sistency for surface reconstruction comparison. Rendering
the-wild human performance sequences with ground-truth quality is measured via SSIM and PSNR.
masks. Since our method can provide human segmentation
4.2. Ablation Study
as by-product, we use this dataset to compare our method
with other off-the-shelf 2D segmentation approaches to val- Jointly Pose Optimization: The initial pose estimate
idate the scene decomposition quality of our method. from a monocular RGB video is usually inaccurate. To eval-
Method Precision ↑ F1 ↑ IoU ↑ Method IoU ↑ C − `2 ↓ NC ↑
SMPL Tracking 0.829 0.781 0.659 ICON [55] 0.718 3.32 0.731
PointRend [26] 0.957 0.960 0.915 SelfRecon [23] 0.648 3.31 0.675
Ye et al. [62] 0.945 0.947 0.890 w/o Joint Opt. 0.810 3.00 0.737
RVM [30] 0.975 0.977 0.950 Ours 0.818 2.66 0.753
w/o Scene Dec. Loss 0.979 0.974 0.942
Ours 0.983 0.983 0.961 Table 3. Quantitative evaluation on 3DPW. Our method consis-
tently outperforms all baselines in all metrics (cf . Fig. 6).
Table 1. Quantitative evaluation on MonoPerfCap. Our method
outperforms all 2D segmentation baselines in all metrics. Method IoU ↑ C − `2 ↓ NC ↑
ICON [55] 0.764 2.91 0.766
Method SSIM ↑ PSNR ↑ SelfRecon [23] 0.805 2.50 0.776
NeuMan [24] 0.958 23.9 Ours 0.813 2.35 0.796
HumanNeRF [52] 0.963 24.8
Ours 0.964 25.1 Table 4. Quantitative evaluation on SynWild. Our method con-
sistently outperforms all baselines in all metrics (cf . Fig. 6).
Table 2. Quantitative evaluation on NeuMan. We report the
quantitative results on test views. Our method achieves on-par and require additional human segmentation as input. Overall,
even better rendering quality compared to NeRF-based methods. we achieve comparable or even better performance quanti-
uate the importance of jointly optimizing pose, shape, and tatively (cf . Tab. 2). As shown in Fig. 5, NeuMan and Hu-
appearance, we compare our full model to a version without manNeRF have obvious artifacts around feet and arms. This
jointly pose optimization. Tab. 3 shows that the joint opti- is because, a) off-the-shelf tools struggle to produce consis-
mization significantly helps in global pose alignment and to tent masks and b) NeRF-based methods are known to have
recover details (normal consistency), this is also confirmed “hazy” floaters in the space leading to visually unpleasant
by qualitative results. Please see the Supp. Mat. results. Our method produces more plausible renderings of
the human with a clean separation from the background.
Image
Geometry
Figure 7. Qualitative results. We show qualitative results of our method from monocular in-the-wild videos.