0% found this document useful (0 votes)
181 views11 pages

Vid2Avatar: 3D Avatar Reconstruction From Videos in The Wild Via Self-Supervised Scene Decomposition

Vid2Avatar is a method to reconstruct detailed 3D avatars from monocular videos without requiring ground truth data or large datasets of 3D human scans. It uses neural implicit functions to model both the human foreground and background scene implicitly. A global optimization over the background model, canonical human shape and texture, and per-frame pose parameters allows it to separate the dynamic human from the static background and produce temporally consistent 3D reconstructions of the human under challenging poses and environments, without relying on external segmentation. It is evaluated on publicly available datasets and introduces a new semi-synthetic test set to quantitatively compare monocular human reconstruction methods.

Uploaded by

Vetal Yeshor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views11 pages

Vid2Avatar: 3D Avatar Reconstruction From Videos in The Wild Via Self-Supervised Scene Decomposition

Vid2Avatar is a method to reconstruct detailed 3D avatars from monocular videos without requiring ground truth data or large datasets of 3D human scans. It uses neural implicit functions to model both the human foreground and background scene implicitly. A global optimization over the background model, canonical human shape and texture, and per-frame pose parameters allows it to separate the dynamic human from the static background and produce temporally consistent 3D reconstructions of the human under challenging poses and environments, without relying on external segmentation. It is evaluated on publicly available datasets and introduces a new semi-synthetic test set to quantitatively compare monocular human reconstruction methods.

Uploaded by

Vetal Yeshor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild

via Self-supervised Scene Decomposition

Chen Guo1 Tianjian Jiang1 Xu Chen1,2 Jie Song1 Otmar Hilliges1


1
ETH Zürich 2 Max Planck Institute for Intelligent Systems, Tübingen
arXiv:2302.11566v1 [cs.CV] 22 Feb 2023

Figure 1. We propose Vid2Avatar, a method to reconstruct detailed 3D avatars from monocular videos in the wild via self-supervised scene
decomposition. Our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans,
nor do we rely on any external segmentation modules.

Abstract 1. Introduction
We present Vid2Avatar, a method to learn human avatars
Being able to reconstruct detailed avatars from readily
from monocular in-the-wild videos. Reconstructing humans
available “in-the-wild” videos, for example recorded with
that move naturally from monocular in-the-wild videos is
a phone, would enable many applications in AR/VR, in
difficult. Solving it requires accurately separating humans
human-computer interaction, robotics and in the movie and
from arbitrary backgrounds. Moreover, it requires recon-
sports industry. Traditionally, high-fidelity 3D reconstruc-
structing detailed 3D surface from short video sequences,
tion of dynamic humans has required calibrated multi-view
making it even more challenging. Despite these challenges,
systems [9, 10, 19, 27, 31, 45, 49], which are expensive and
our method does not require any groundtruth supervision
require highly-specialized expertise to operate. In contrast,
or priors extracted from large datasets of clothed human
emerging applications such as the Metaverse require more
scans, nor do we rely on any external segmentation mod-
light-weight and practical solutions in order to make the
ules. Instead, it solves the tasks of scene decomposition and
digitization of humans a widely available technology. Re-
surface reconstruction directly in 3D by modeling both the
constructing humans that move naturally from monocular
human and the background in the scene jointly, parameter-
in-the-wild videos is clearly a difficult problem. Solving it
ized via two separate neural fields. Specifically, we define
requires accurately separating humans from arbitrary back-
a temporally consistent human representation in canonical
grounds, without any prior knowledge about the scene or
space and formulate a global optimization over the back-
the subject. Moreover it requires reconstructing detailed 3D
ground model, the canonical human shape and texture, and
surface from short video sequences, made even more chal-
per-frame human pose parameters. A coarse-to-fine sam-
lenging due to depth ambiguities, the complex dynamics of
pling strategy for volume rendering and novel objectives are
human motion and the high-frequency surface details.
introduced for a clean separation of dynamic human and
Traditional template-based approaches [15, 16, 58] can-
static background, yielding detailed and robust 3D human
not generalize to in-the-wild settings due to the requirement
geometry reconstructions. We evaluate our methods on pub-
for a pre-scanned template and manual rigging. Methods
licly available datasets and show improvements over prior
that are based on explicit mesh representations are limited
art. Project page: https://moygcc.github.io/vid2avatar/.
to a fixed topology and resolution [3, 8, 14, 34]. Fully-
supervised methods that directly regress 3D surfaces from
images [17, 18, 21, 41, 42, 55, 68] struggle with difficult out- view synthesis, and reconstruction tasks, showing that our
of-distribution poses and shapes, and do not always pre- method performs best across several datasets and settings.
dict temporally consistent reconstructions. Fitting neural To allow for quantitative comparison across methods, we
implicit surfaces to videos has recently been demonstrated contribute a novel semi-synthetic test set that contains ac-
[23, 24, 40, 46, 47, 52, 67]. However, these methods depend curate 3D geometry of human subjects. Finally, we demon-
on pre-segmented inputs and are therefore not robust to un- strate the ability to reconstruct different humans in detail
controlled visual complexity and are upper-bounded in their from online videos and hand-held mobile phone video clips.
reconstruction quality by the segmentation method. In summary, our contributions are:
In this paper, we introduce Vid2Avatar, a method to learn • a method to reconstruct detailed 3D avatars from in-
human avatars from monocular in-the-wild videos without the-wild monocular videos via self-supervised scene
requiring any groundtruth supervision or priors extracted decomposition; and
from large datasets of clothed human scans, nor do we rely • to achieve robust and detailed 3D reconstructions of
on any external segmentation modules. We solve the tasks the human even under challenging poses and environ-
of scene separation and surface reconstruction directly in ments without requiring external segmentation meth-
3D. To achieve this, we model both the foreground (i.e., ods; and
human) and the background in the scene implicitly, param- • a novel semi-synthetic testing dataset that for the first
eterized via two separate neural fields. A key challenge is time allows comparing monocular human reconstruc-
to associate 3D points to either of these fields without re- tion methods on realistic scenes. The dataset contains
verting to 2D segmentation. To tackle this challenge, our rich annotations of the 3D surface.
method builds-upon the following core concepts: i) We de- Code and data will be made available for research purposes.
fine a single temporally consistent representation of the hu-
man shape and texture in canonical space and leverage the
2. Related Work
inverse mapping of a parametric body model to learn from
deformed observations. ii) A global optimization formu- Reconstructing Human from Monocular Video Tradi-
lation jointly optimizes the parameters of the background tional works for monocular human performance capture re-
model, the canonical human shape and its appearance, and quire personalized rigged templates as prior and track the
the pose estimates of the human subject over the entire se- pre-defined human model based on 2D observations [15,
quence. iii) A coarse-to-fine sampling strategy for volume 16, 58]. These works require pre-scanning of the performer
rendering that naturally leads to a separation of dynamic and post-processing for rigging, preventing such methods
foreground and static background. iv) Novel objectives that from being deployed to real-life applications. Some meth-
further improve the scene decomposition and lead to sharp ods attempt to save the need for pre-scanning and manual
boundaries between the human and the background, even rigging [3,8,14,34]. However, the explicit mesh representa-
when both are in contact (e.g., around the feet), yielding tion is limited to a fixed resolution and cannot represent de-
better geometry and appearance reconstructions. tails like the face. Regression-based methods that directly
More specifically, we leverage an inverse-depth param- regress 3D surfaces from images have demonstrated com-
eterization in spherical coordinates [66] to coarsely sepa- pelling results [4, 12, 17, 18, 21, 41, 42, 55, 68]. However,
rate the static background from the dynamic foreground. they require high-quality 3D data for supervision and can-
Within the foreground sphere, we leverage a surface-guided not maintain the space-time coherence of the reconstruc-
volume rendering approach to attain densities via the con- tion over the whole sequence. Recent works fit implicit
version method proposed in [60]. Importantly, we warp all neural fields to videos via neural rendering to obtain artic-
sampled points into canonical space and update the human ulated human models [23, 24, 40, 46, 47, 52, 67]. Human-
shape field dynamically. To attain sharp boundaries be- NeRF [52] extends articulated NeRF to improve novel view
tween the dynamic foreground and the scene, we introduce synthesis. NeuMan [24] further adds a scene NeRF model.
two optimization objectives that encourage a quasi-discrete Both methods model the human geometry with a density
binary distribution of ray opacities and penalize non-zero field, only yielding a noisy, and often low-fidelity human
opacity for rays that do not intersect with the human. The reconstruction. SelfRecon [23] deploys neural surface ren-
final rendering of the scene is then attained by differentiable dering [61] to achieve consistent reconstruction over the
composited volume rendering. sequence. However, all aforementioned methods rely on
pre-segmented inputs and are therefore not robust to uncon-
We show that this optimization formulation leads to
trolled visual complexity and are upper-bounded in their re-
clean scene decomposition and high-quality 3D reconstruc-
construction quality by the external segmentation method.
tions of the human subject. In detailed ablations, we shed
In contrast, our method solves the tasks of scene decompo-
light on the key components of our method. Furthermore,
sition and surface reconstruction jointly in 3D without using
we compare to existing methods in 2D segmentation, novel
segmentation modules.
Reconstructing Human from Multi-view/Depth The the surfaces, we contribute novel objectives that leverage
high fidelity 3D reconstruction of dynamic humans has re- the dynamically updated human shape in canonical space to
quired calibrated dense multi-view systems [9, 10, 19, 27, regularize the ray opacity.
31, 45, 49] which are expensive and laborious to operate We parameterize the 3D geometry and texture of clothed
and require highly-specialized expertise. Recent works humans as a pose-conditioned implicit signed-distance field
[20, 22, 28, 37, 39, 51, 56, 57] attempt to reconstruct humans (SDF) and texture field in canonical space (Sec. 3.1). We
from more sparse settings by deploying neural rendering. then model the background using a separate neural radi-
Depth-based approaches [6, 35, 36] reconstruct the human ance field (NeRF). The human shape and appearance fields
shape by fusing depth measurements across time. Follow- alongside the background field are learned from images
up work [7,11,29,63,64] builds upon this concept by incor- jointly via differentiable composited neural volume render-
porating an articulated motion prior and a parametric body ing (Sec. 3.2). Finally, we leverage the dynamically up-
shape prior. While the aforementioned methods achieve dated canonical human shape to regularize the ray opacities
compelling results, they still require a specialized capturing (Sec. 3.3). The training is formulated as global optimization
setup and are hence not applicable to in-the-wild settings. to jointly optimize the dynamic foreground and static back-
In contrast, our method recovers the dynamic human shape ground fields, and the per-frame pose parameters (Sec. 3.4).
in the wild from a monocular RGB video as the sole input.
3.1. Implicit Neural Avatar Representation
Moving Object Segmentation Traditional research in Canonical Shape Representation. We model the human
moving object segmentation has been extensively con- shape in canonical space to form a single, temporally con-
ducted at the image level (i.e. 2D). One line of research H
sistent representation and use a neural network fsdf to pre-
relies on motion clues to segment objects with different op- dict the signed distance value for any 3D point xc in this
tical flow patterns [5, 38, 54, 59, 62], while another line of space. To model pose-dependent local non-rigid deforma-
work, termed video matting [25, 30, 43] is trained on videos tions such as dynamically changing wrinkles on clothes, we
with human-annotated masks to directly regress the alpha- concatenate the human pose θ as an additional input and
channel values during inference. Those approaches are not H
model fsdf as:
without limitations, as they focus on image-level segmen-
H
tation and incorporate no 3D knowledge. Thus, they can- fsdf : R3+nθ → R1+256 . (1)
not handle complicated backgrounds without enough color
contrast between the human and the background. Recent The pose parameters θ are defined analogously to
H
works learn to decompose dynamic objects and the static SMPL [32], with dimensionality nθ . Furthermore, fsdf out-
background simultaneously in 3D by optimizing multiple puts global geometry features z of dimension 256. With
H
NeRFs [44,48,53,65]. Such methods perform well for non- slight abuse of notation, we also use fsdf to refer to the SDF
complicated dynamic objects but are not directly applicable value only. The canonical shape S is given by the zero-level
H
to articulated humans with intricate motions. set of fsdf :
H
3. Method S = { xc | fsdf (xc , θ) = 0 }. (2)

We introduce Vid2Avatar, a method for detailed geome-


Skeletal Deformation. Given the bone transformation
try and appearance reconstruction of implicit neural avatars
matrix Bi for joint i ∈ {1, ..., nb } which are derived from
from monocular videos in the wild. Our method is schemat-
the body pose θ, a canonical point xc is mapped to the de-
ically illustrated in Fig. 2. Reconstructing humans from in-
formed point xd via linear-blend skinning:
the-wild videos is clearly challenging. Solving it requires
accurately segmenting humans from arbitrary backgrounds nb
X
without any prior knowledge about the appearance of the xd = wci Bi xc . (3)
scene or the subject and requires reconstructing detailed 3D i=1
surface and appearance from short video sequences. In con- The canonical correspondence xc for points xd in deformed
trast to prior works that utilize off-the-shelf 2D segmenta- space is defined by the inverse of Eq. 3:
tion tools or manually labeled masks, we solve the tasks of
nb
scene decomposition and surface reconstruction directly in X
3D. To achieve this, we model both the human and back- xc = ( wdi Bi )−1 xd (4)
ground in the scene implicitly, parameterized via two sep- i=1

arate neural fields which are learned jointly from images Here, nb denotes the number of bones in the transformation,
1 nb
to composite the whole scene. To alleviate the ambiguity and w(·) = {w(·) , ..., w(·) } represents the skinning weights
of in-contact body and scene parts and to better delineate for x(·) . Here, deformed points xd are associated with the
Loss Function ℒ
Trainable
Composited Volume Rendering
Parameters
Sec. 3.2

Input Volume
Rendering + SDF-based
Volume Rendering

Rendered Background Rendered Foreground

Implicit Neural Avatar


Sec. 3.1
+
Scene Decompn. Objectives
Sec. 3.3

+ Inverse
Warping
Canonical SDF Normal Canonical Texture

+
+
Figure 2. Method Overview. Given a ray r with camera center o and ray direction v, we sample points densely (xd ) and coarsely (xb )
along the ray for the spherical inner volume and outer volume respectively. Within the foreground sphere, we warp all sampled points
H
into canonical space via inverse warping and evaluate the SDF of the canonical correspondences xc via the canonical shape network fsdf .
We calculate the spatial gradient of the sampled points in deformed space and concatenate them with the canonical points xc , the pose
H
parameters θ, and the extracted geometry feature vectors z to form the input to canonical texture network frgb which predicts color values
for xc . We apply surface-based volume rendering for the dynamic foreground and standard volume rendering for the background, and
then composite the foreground and background components to attain the final pixel color. We minimize the loss L that compares the color
predictions with the image observations along with novel scene decomposition objectives.

average of the nearest SMPL vertices’ skinning weights, the remainder of this paper, we denote this neural SDF with
H H
weighted by the point-to-point distances in deformed space. fsdf (xc ) and the RGB field as frgb (xc ) for brevity.
Canonical points xc are treated analogously.
3.2. Composited Volume Rendering
Canonical Texture Representation. The appearance is We extend the inverted sphere parametrization of
H
also modeled in canonical space via a neural network frgb NeRF++ [66] to represent the scene: an outer volume (i.e.,
that predicts color values for 3D points xc in this space. the background) covers the complement of a spherical inner
H
volume (i.e., the space assumed to be occupied by the hu-
frgb : R3+3+nθ +256 → R3 . (5) man) and both are modeled by separate networks. The final
pixel value is then attained via compositing.
We condition the canonical texture network on the nor-
mal nd in deformed space, facilitating better disentangle-
ment of the geometry and appearance. The normals are Background. Given the origin O, each 3D point xb =
given by the spatial gradient of the signed distance field (xb , yb , zb ) in the outer volume is reparametrized by the
w.r.t. the 3D location in deformed space. Following [67], quadruple x0b = (x0b , yb0 , zb0 , 1r ), where k(x0b , yb0 , zb0 )k = 1,
the spatial gradient of the deformed shape is given by: (xb , yb , zb ) = r · (x0b , yb0 , zb0 ). Here r denotes the magnitude
of the vector from the origin O to xb . This parameterization
H
∂fsdf (xc , θ) ∂f H (xc , θ) ∂xc of background points leads to improved numerical stability
nd = = sdf
∂xd ∂xc ∂xd and assigns lower resolution to farther away points. For
H nb (6) more details, we refer to [66]. Our method is trained with
∂f (xc , θ) X
= sdf ( wdi Bi )−1 . videos and the background is generally not entirely static.
∂xc i=1 To compensate for dynamic changes in e.g., lighting, we
In practice we concatenate the canonical points xc , their condition the background network f B on a per-frame learn-
normals, the pose parameters, and the extracted 256- able latent code ti .
dimensional geometry feature vectors z from the shape net-
work to form the input to the canonical texture network. For f B : R4+3+nt → R1+3 , (7)
where f B takes the 4D representation of the sampled back- the composited pixel value and image RGB value is still a
ground point x0b , viewing direction v, and time encoding ti severely ill-posed problem. This is due to the potentially
with dimension nt as input, and outputs the density and the moving scene, dynamic shadows, and general visual com-
view-dependent radiance. plexity. To this end, we propose two objectives that guide
the optimization towards a clean and robust decoupling of
Dynamic Foreground. We assume that the inner volume the human from the background.
is occupied by a dynamic foreground – the human we seek
to reconstruct. This requires different treatment compared Opacity Sparseness Regularization. One of the key
to [66] where a static foreground is modeled via a vanilla components of our method is a loss Lsparse to regularize
NeRF. In contrast, we combine the implicit neural avatar the ray opacity via the dynamically updated human shape
representation (Sec. 3.1) with surface-based volume ren- in canonical space. We first warp sampled points into the
dering [60]. Thus, we convert the SDF to a density σ canonical space and calculate the signed distance to the hu-
by applying the scaled Laplace distribution’s Cumulative man shape. We then penalize non-zero ray opacities for rays
Distribution Function (CDF) to the negated SDF values that do not intersect with the subject. This ray set is denoted
H
ξ(xc ) = −fsdf (xc ): as Rioff for frame i.
 
1 1 |ξ(xc )| 1 X
σ(xc ) = α + sign(ξ(xc ))(1 − exp(− )) , Lisparse = |αH (r)|. (12)
2 2 β |Rioff | i
r∈Roff
(8)
where α, β > 0 are learnable parameters.
Similar to [60], we sample N points on a ray r = (o, v) Note that we conservatively update the SDF of the human
with camera center o and ray direction v in two stages – shape throughout the whole training process which leads to
uniform and inverse CDF sampling. We then map the sam- a precise association of human and background rays.
pled points to canonical space via skeletal deformation and
use standard numerical approximation to calculate the inte- Self-supervised Ray Classification. Even with the shape
gral of the volume rendering equation: regularization from Eq. 12, we observe that the human fields
still tend to model parts of the background due to the flexi-
N
X bility and expressive power of MLPs, especially if the sub-
C H (r) = H
τi frgb (xic ) (9)
ject is in contact with the scene. To further delineate dy-
i=1
  namic foreground and background, we introduce an addi-
X tional loss term to encourage ray distributions that contain
τi = exp − σ(xjc )δ j  (1 − exp(−σ(xic )δ i )) (10) either fully transparent or opaque rays:
j<i
1 X H
where δ (i) is the distance between two adjacent samples. LiBCE = − (α (r) log(αH (r))
|Ri | i (13)
r∈R
Here, the accumulated alpha value of a pixel, which
PN repre- H H
sents ray opacity, can be obtained by αH (r) = i=1 τi . + (1 − α (r)) log(1 − α (r))),

Scene Composition. To attain the final pixel value for a where Ri denotes the sampled rays for frame i. This
ray r, we raycast the human and background volumes sep- term penalizes deviations of the ray opacities from a bi-
arately, followed by a scene compositing step. Using the nary {0, 1} distribution via the binary cross entropy loss.
parameterization of the background, we sample r along the Intuitively this encourages the opacities to be zero for rays
ray r to obtain sample points in the outer volume for which that hit the background and one for those that hit the human
we query f B . The background component of a pixel is shape. In practice, this significantly helps separation of the
then given by the integrated color value C B (r) along the subject and the background, in particular for difficult cases
ray [33]. More details can be found in the Supp. Mat. The with similar pixel values across discontinuities.
final pixel color is then the composite of the foreground and The final scene decomposition loss is then given by Ldec :
background color.
Ldec = λBCE LBCE + λsparse Lsparse . (14)
H H B
C(r) = C (r) + (1 − α (r))C (r). (11)
3.4. Global Optimization
3.3. Scene Decomposition Objectives
To train the models that represent the background and
Learning to decompose the scene into a dynamic human the dynamic foreground jointly from videos, we formulate
and background by simply minimizing the distance between the training as global optimization over all frames.
Figure 4. Qualitative mask comparison. Our method generates
more detailed and robust segmentations compared to 2D segmen-
Figure 3. Importance of scene decomposition loss. Without the tation methods by incorporating 3D knowledge.
scene decomposition loss, the segmentation includes undesirable
background parts due to similar pixel values across discontinuities.

Eikonal Loss. Following IGR [13], we leverage Lieik to


H
force the shape network fsdf to satisfy the Eikonal equation
in canonical space:
2
Lieik = Exc k∇fsdf
H
(xc )k − 1 . (15) GT NeuMan HumanNeRF Ours

Reconstruction Loss. We calculate the L1 -distance be- Figure 5. Qualitative view synthesis comparison. Our method
tween the rendered color C(r) and the pixel’s RGB value achieves comparable and even better novel view synthesis results
compared to NeRF-based methods (see also Sec. 4.4).
Ĉ(r) to attain the reconstruction loss Lirgb for frame i:

1 X
Lirgb = |C(r) − Ĉ(r)|. (16) NeuMan Dataset [24]: This dataset includes a collection
|Ri | i of videos captured by a mobile phone, in which a single
r∈R
person performs walking. We use this dataset to compare
Full Loss. Given a video sequence with F input frames, our rendering quality of humans under unseen views with
we minimize the combined loss function: other approaches.
3DPW Dataset [50]: This dataset contains challenging
F
X in-the-wild video sequences along with accurate 3D hu-
L(Θ) = Lirgb (ΘH , ΘB ) + λdec Lidec (ΘH ) + λeik Lieik (ΘH ) man poses recovered by using IMUs and a moving camera.
i=1
Moreover, it includes registered static clothed 3D human
(17)
H B models. By animating the human model with the ground-
where Θ and Θ are the sets of optimized parameters
H truth poses, we can obtain quasi ground-truth scans to eval-
for the human and background model respectively. Θ in-
H uate the surface reconstruction performance.
cludes the shape network weights Θsdf , the texture network
H
weights Θrgb and per-frame pose parameters θi . Θ con- B SynWild Dataset: We propose a new dataset called Syn-
tains the background density and radiance network weights. Wild for the evaluation of monocular human surface recon-
struction method. We capture dynamic human subjects in
4. Experiments a multi-view system and reconstruct the detailed geometry
and texture via commercial software [9]. Then we place the
We first conduct ablation studies on our design choices. textured 4D scans into realistic 3D scenes/HDRI panoramas
Next, we compare our method with state-of-the-art ap- and render monocular videos from virtual cameras, leverag-
proaches in 2D segmentation, novel view synthesis, and re- ing a high-quality game engine [2]. This is the first dataset
construction tasks. Finally, we demonstrate human recon- that allows for quantitative comparison in a realistic setting
struction results on several in-the-wild monocular videos via semi-synthetic data.
from different sources qualitatively. Evaluation Protocol: We consider precision, F1 score, and
mask IoU for human segmentation evaluation. We report
4.1. Datasets volumetric IoU, Chamfer distance (cm) and normal con-
MonoPerfCap Dataset [58]: This dataset contains in- sistency for surface reconstruction comparison. Rendering
the-wild human performance sequences with ground-truth quality is measured via SSIM and PSNR.
masks. Since our method can provide human segmentation
4.2. Ablation Study
as by-product, we use this dataset to compare our method
with other off-the-shelf 2D segmentation approaches to val- Jointly Pose Optimization: The initial pose estimate
idate the scene decomposition quality of our method. from a monocular RGB video is usually inaccurate. To eval-
Method Precision ↑ F1 ↑ IoU ↑ Method IoU ↑ C − `2 ↓ NC ↑
SMPL Tracking 0.829 0.781 0.659 ICON [55] 0.718 3.32 0.731
PointRend [26] 0.957 0.960 0.915 SelfRecon [23] 0.648 3.31 0.675
Ye et al. [62] 0.945 0.947 0.890 w/o Joint Opt. 0.810 3.00 0.737
RVM [30] 0.975 0.977 0.950 Ours 0.818 2.66 0.753
w/o Scene Dec. Loss 0.979 0.974 0.942
Ours 0.983 0.983 0.961 Table 3. Quantitative evaluation on 3DPW. Our method consis-
tently outperforms all baselines in all metrics (cf . Fig. 6).
Table 1. Quantitative evaluation on MonoPerfCap. Our method
outperforms all 2D segmentation baselines in all metrics. Method IoU ↑ C − `2 ↓ NC ↑
ICON [55] 0.764 2.91 0.766
Method SSIM ↑ PSNR ↑ SelfRecon [23] 0.805 2.50 0.776
NeuMan [24] 0.958 23.9 Ours 0.813 2.35 0.796
HumanNeRF [52] 0.963 24.8
Ours 0.964 25.1 Table 4. Quantitative evaluation on SynWild. Our method con-
sistently outperforms all baselines in all metrics (cf . Fig. 6).
Table 2. Quantitative evaluation on NeuMan. We report the
quantitative results on test views. Our method achieves on-par and require additional human segmentation as input. Overall,
even better rendering quality compared to NeRF-based methods. we achieve comparable or even better performance quanti-
uate the importance of jointly optimizing pose, shape, and tatively (cf . Tab. 2). As shown in Fig. 5, NeuMan and Hu-
appearance, we compare our full model to a version without manNeRF have obvious artifacts around feet and arms. This
jointly pose optimization. Tab. 3 shows that the joint opti- is because, a) off-the-shelf tools struggle to produce consis-
mization significantly helps in global pose alignment and to tent masks and b) NeRF-based methods are known to have
recover details (normal consistency), this is also confirmed “hazy” floaters in the space leading to visually unpleasant
by qualitative results. Please see the Supp. Mat. results. Our method produces more plausible renderings of
the human with a clean separation from the background.

Scene Decomposition Loss: To demonstrate the effec- 4.5. Reconstruction Comparisons


tiveness of our proposed scene decomposition loss, we con-
duct an ablation experiment without this term during opti- We compare our proposed human surface reconstruction
mization. Results in Tab. 1 indicate that without the scene method to several state-of-the-art approaches [23, 55] on
decomposition loss, the segmentation tends to be noisy and both 3DPW [50] and SynWild. ICON (image-based) [55]
includes parts of the background as shown in Fig. 3. reconstructs 3D clothed humans by learning a regression
model from a large dateset of clothed human scans [1].
4.3. 2D Segmentation Comparisons SelfRecon (video-based) [23] deploys implicit surface ren-
We generate human masks by extracting the pixels with dering to reconstruct avatars from monocular videos. Both
ray opacity αH (r) value of 1. Our produced masks are methods rely on additional human masks as input for their
compared with SMPL Tracking, PointRend [26], Ye et al. methods. Despite this, our method outperforms [23, 55]
[62] and RVM [30] on the MonoPerfCap dataset [58]. [26] by a substantial margin on all metrics (cf . Tab. 3, Tab. 4).
and [30] are trained on large datasets with human-annotated The difference is more visible in qualitative comparison as
masks, while [62] rely on optical flow as motion clues to shown in Fig. 6, where they tend to produce physically in-
segment objects in an unsupervised manner. SMPL Track- correct body reconstructions (e.g., missing arms and sunken
ing uses dilated projected SMPL masks as the result. Tab. 1 backs). In contrast, our method generates complete human
shows the quantitative results. Our method consistently bodies and recovers more surface details (e.g., cloth wrin-
outperforms other baseline methods on all metrics. Fig. 4 kles and facial features). We attribute this to the better de-
shows that other baselines struggle on the feet since there coupling of humans from the background by our proposed
is no enough photometric contrast between the part of the modeling and learning schemes.
shoes and the stairs. In contrast, our method is able to gener-
ate plausible human segmentation via decomposition from 4.6. Qualitative Results
a 3D perspective. We demonstrate our results on several in-the-wild
monocular videos from different sources: online, datasets,
4.4. View Synthesis Comparisons
and self-captured video clips (Fig. 7). Our method is able
Though not our primary goal, we also compare with Hu- to reconstruct complex cloth deformations and personalized
manNeRF [52] and NeuMan [24] for the task of novel view facial features in detail. Please refer to Supp. Mat for
synthesis on the NeuMan dataset. Note that both methods more qualitative results.
Figure 6. Qualitative reconstruction comparison. Data source top to bottom: 3DPW, SynWild, Online. ICON and SelfRecon produce
less detailed and even physically implausible reconstructions (incomplete human bodies). In contrast, our method generates complete
human bodies and achieves a detailed (e.g., cloth wrinkles) and temporally consistent shape reconstruction.

Image

Geometry

Figure 7. Qualitative results. We show qualitative results of our method from monocular in-the-wild videos.

5. Conclusion timize the parameters of the background field, the canoni-


cal human shape and appearance, and the human pose esti-
In this paper, we present Vid2Avatar to reconstruct de- mates over the entire sequence via differentiable compos-
tailed 3D avatars from monocular in-the-wild videos via ited volume rendering. Our method achieves robust and
self-supervised scene decomposition. Our method does not high-fidelity human reconstruction from monocular videos.
require any groundtruth supervision or priors extracted from Limitations: Although readily available, Vid2Avatar relies
large datasets of clothed human scans, nor do we rely on any on reasonable pose estimates as inputs. Furthermore, loose
external segmentation modules. With carefully designed clothing such as skirts or free-flowing garments poses sig-
background modeling and temporally consistent canoni- nificant challenges due to their fast dynamics. We refer to
cal human representation, a global optimization with novel Supp. Mat for a more detailed discussion of limitations and
scene decomposition objectives is formulated to jointly op- societal impact.
References [15] Marc Habermann, Weipeng Xu, Michael Zollhoefer, Ger-
ard Pons-Moll, and Christian Theobalt. Deepcap: Monoc-
[1] Renderpeople, 2018. https://www.renderpeople. ular human performance capture using weak supervision. In
com. 7 IEEE Conference on Computer Vision and Pattern Recogni-
[2] Unreal, 2020. https://www.unrealengine.com. 6 tion (CVPR). IEEE, jun 2020. 1, 2
[3] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian
[16] Marc Habermann, Weipeng Xu, Michael Zollhöfer, Gerard
Theobalt, and Gerard Pons-Moll. Video based reconstruction
Pons-Moll, and Christian Theobalt. Livecap: Real-time
of 3d people models. In Proceedings of the IEEE Conference
human performance capture from monocular video. ACM
on Computer Vision and Pattern Recognition, pages 8387–
Trans. Graph., 38(2), mar 2019. 1, 2
8397, 2018. 1, 2
[17] Tong He, John Collomosse, Hailin Jin, and Stefano Soatto.
[4] Thiemo Alldieck, Mihai Zanfir, and Cristian Sminchisescu.
Geo-pifu: Geometry and pixel aligned implicit functions
Photorealistic monocular 3d reconstruction of humans wear-
for single-view human reconstruction. In H. Larochelle,
ing clothing. In Proceedings of the IEEE/CVF Conference
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,
on Computer Vision and Pattern Recognition (CVPR), 2022.
Advances in Neural Information Processing Systems, vol-
2
ume 33, pages 9276–9287. Curran Associates, Inc., 2020.
[5] Pia Bideau and Erik Learned-Miller. It’s moving! a prob-
2
abilistic model for causal motion segmentation in moving
camera videos. In Bastian Leibe, Jiri Matas, Nicu Sebe, and [18] Tong He, Yuanlu Xu, Shunsuke Saito, Stefano Soatto, and
Max Welling, editors, Computer Vision – ECCV 2016, pages Tony Tung. Arch++: Animation-ready clothed human re-
433–449, Cham, 2016. Springer International Publishing. 3 construction revisited. In Proceedings of the IEEE/CVF In-
[6] Aljaz Bozic, Pablo Palafox, Michael Zollhofer, Justus Thies, ternational Conference on Computer Vision (ICCV), pages
Angela Dai, and Matthias Niessner. Neural deformation 11046–11056, October 2021. 2
graphs for globally-consistent non-rigid reconstruction. In [19] A. Hilton and J. Starck. Multiple view reconstruction of
Proceedings of the IEEE/CVF Conference on Computer Vi- people. In Proceedings. 2nd International Symposium on
sion and Pattern Recognition (CVPR), pages 1450–1459, 3D Data Processing, Visualization and Transmission, 2004.
June 2021. 3 3DPVT 2004., pages 357–364, 2004. 1, 3
[7] Andrei Burov, Matthias Nießner, and Justus Thies. Dynamic [20] Tao Hu, Tao Yu, Zerong Zheng, He Zhang, Yebin Liu, and
surface function networks for clothed human bodies. In Proc. Matthias Zwicker. Hvtr: Hybrid volumetric-textural render-
International Conference on Computer Vision (ICCV), Oct. ing for human avatars. In 2022 International Conference on
2021. 3 3D Vision (3DV), 2022. 3
[8] Andrés Casado-Elvira, Marc Comino Trinidad, and Dan [21] Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and
Casas. PERGAMO: Personalized 3d garments from monoc- Tony Tung. Arch: Animatable reconstruction of clothed hu-
ular video. Computer Graphics Forum (Proc. of SCA), 2022, mans. In Proceedings of the IEEE/CVF Conference on Com-
2022. 1, 2 puter Vision and Pattern Recognition, pages 3093–3102,
[9] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- 2020. 2
nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, [22] Zhang Jiakai, Liu Xinhang, Ye Xinyi, Zhao Fuqiang, Zhang
and Steve Sullivan. High-quality streamable free-viewpoint Yanshun, Wu Minye, Zhang Yingliang, Xu Lan, and Yu
video. ACM Trans. Graph., 34(4), jul 2015. 1, 3, 6 Jingyi. Editable free-viewpoint video using a layered neu-
[10] Edilson de Aguiar, Carsten Stoll, Christian Theobalt, Naveed ral representation. In ACM SIGGRAPH, 2021. 3
Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Perfor-
[23] Boyi Jiang, Yang Hong, Hujun Bao, and Juyong Zhang. Sel-
mance capture from sparse multi-view video. 27(3):1–10,
frecon: Self reconstruction your digital avatar from monocu-
2008. 1, 3
lar video. In IEEE/CVF Conference on Computer Vision and
[11] Zijian Dong, Chen Guo, Jie Song, Xu Chen, Andreas Geiger, Pattern Recognition (CVPR), 2022. 2, 7
and Otmar Hilliges. Pina: Learning a personalized implicit
[24] Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel,
neural avatar from a single rgb-d video sequence. In Pro-
and Anurag Ranjan. Neuman: Neural human radiance field
ceedings of the IEEE/CVF Conference on Computer Vision
from a single video. In Proceedings of the European confer-
and Pattern Recognition (CVPR), pages 20470–20480, June
ence on computer vision (ECCV), 2022. 2, 6, 7
2022. 3
[12] Qiao Feng, Yebin Liu, Yu-Kun Lai, Jingyu Yang, and Kun [25] Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Ryn-
Li. Fof: Learning fourier occupancy field for monocular real- son W.H. Lau. Modnet: Real-time trimap-free portrait mat-
time human reconstruction. In NeurIPS, 2022. 2 ting via objective decomposition. In AAAI, 2022. 3
[13] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and [26] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir-
Yaron Lipman. Implicit geometric regularization for learning shick. PointRend: Image segmentation as rendering. 2019.
shapes. arXiv preprint arXiv:2002.10099, 2020. 6 7
[14] Chen Guo, Xu Chen, Jie Song, and Otmar Hilliges. Human [27] Vincent Leroy, Jean-Sébastien Franco, and Edmond Boyer.
performance capture from monocular video in the wild. In Volume Sweeping: Learning Photoconsistency for Multi-
2021 International Conference on 3D Vision (3DV), pages View Shape Reconstruction. International Journal of Com-
889–898. IEEE, 2021. 1, 2 puter Vision, 129:284–299, Feb. 2021. 1, 3
[28] Ruilong Li, Julian Tanke, Minh Vo, Michael Zollhofer, Jur- implicit function for high-resolution clothed human digitiza-
gen Gall, Angjoo Kanazawa, and Christoph Lassner. Tava: tion. In Proceedings of the IEEE/CVF International Confer-
Template-free animatable volumetric actors. 2022. 3 ence on Computer Vision, pages 2304–2314, 2019. 2
[29] Zhe Li, Tao Yu, Zerong Zheng, Kaiwen Guo, and Yebin Liu. [42] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul
Posefusion: Pose-guided selective fusion for single-view hu- Joo. Pifuhd: Multi-level pixel-aligned implicit function for
man volumetric capture. In Proceedings of the IEEE/CVF high-resolution 3d human digitization. In Proceedings of
Conference on Computer Vision and Pattern Recognition, the IEEE/CVF Conference on Computer Vision and Pattern
pages 14162–14172, 2021. 3 Recognition, pages 84–93, 2020. 2
[30] Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip [43] Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve
Sengupta. Robust high-resolution video matting with tempo- Seitz, and Ira Kemelmacher-Shlizerman. Background mat-
ral guidance, 2021. 3, 7 ting: The world is your green screen. In Computer Vision
[31] Yebin Liu, Qionghai Dai, and Wenli Xu. A point-cloud- and Pattern Regognition (CVPR), 2020. 3
based multiview stereo algorithm for free-viewpoint video. [44] Prafull Sharma, Ayush Tewari, Yilun Du, Sergey Zakharov,
IEEE Trans. Vis. Comput. Graph., 16(3):407–418, 2010. 1, Rares Ambrus, Adrien Gaidon, William T. Freeman, Fredo
3 Durand, Joshua B. Tenenbaum, and Vincent Sitzmann. See-
[32] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard ing 3d objects in a single image via self-supervised static-
Pons-Moll, and Michael J Black. Smpl: A skinned multi- dynamic disentanglement, 2022. 3
person linear model. ACM transactions on graphics (TOG), [45] Jonathan Starck and Adrian Hilton. Surface capture for
34(6):1–16, 2015. 3 performance-based animation. IEEE Computer Graphics
[33] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, and Applications, 27(3):21–31, 2007. 1, 3
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: [46] Shih-Yang Su, Timur Bagautdinov, and Helge Rhodin.
Representing scenes as neural radiance fields for view syn- Danbo: Disentangled articulated neural body representations
thesis. In European conference on computer vision, pages via graph neural networks. In European Conference on Com-
405–421. Springer, 2020. 5 puter Vision, 2022. 2
[34] Gyeongsik Moon, Hyeongjin Nam, Takaaki Shiratori, and [47] Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge
Kyoung Mu Lee. 3d clothed human reconstruction in the Rhodin. A-nerf: Articulated neural radiance fields for learn-
wild. In European Conference on Computer Vision (ECCV), ing human shape, appearance, and pose. In Advances in Neu-
2022. 1, 2 ral Information Processing Systems, 2021. 2
[35] Richard A Newcombe, Dieter Fox, and Steven M Seitz. [48] Vadim Tschernezki, Diane Larlus, and Andrea Vedaldi. Neu-
Dynamicfusion: Reconstruction and tracking of non-rigid ralDiff: Segmenting 3D objects that move in egocentric
scenes in real-time. In Proceedings of the IEEE conference videos. In Proceedings of the International Conference on
on computer vision and pattern recognition, pages 343–352, 3D Vision (3DV), 2021. 3
2015. 3 [49] Vagia Tsiminaki, Jean-Sébastien Franco, and Edmond
[36] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, Boyer. High Resolution 3D Shape Texture from Multiple
David Molyneaux, David Kim, Andrew J Davison, Pushmeet Videos. In CVPR 2014 - IEEE International Conference
Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. on Computer Vision and Pattern Recognition, pages 1502–
Kinectfusion: Real-time dense surface mapping and track- 1509, Columbus, OH, United States, June 2014. IEEE. 1,
ing. In 2011 10th IEEE international symposium on mixed 3
and augmented reality, pages 127–136. IEEE, 2011. 3 [50] Timo von Marcard, Roberto Henschel, Michael Black, Bodo
[37] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d
Harada. Neural articulated radiance field. In International human pose in the wild using imus and a moving camera.
Conference on Computer Vision, 2021. 3 In European Conference on Computer Vision (ECCV), sep
[38] Anestis Papazoglou and Vittorio Ferrari. Fast object segmen- 2018. 6, 7
tation in unconstrained video. In 2013 IEEE International [51] Shaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu
Conference on Computer Vision, pages 1777–1784, 2013. 3 Tang. Arah: Animatable volume rendering of articulated
[39] Sida Peng, Junting Dong, Qianqian Wang, Shangzhan human sdfs. In European Conference on Computer Vision,
Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- 2022. 3
matable neural radiance fields for modeling dynamic human [52] Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan,
bodies. In ICCV, 2021. 3 Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. Hu-
[40] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, manNeRF: Free-viewpoint rendering of moving people from
Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: monocular video. In Proceedings of the IEEE/CVF Confer-
Implicit neural representations with structured latent codes ence on Computer Vision and Pattern Recognition (CVPR),
for novel view synthesis of dynamic humans. In Proceed- pages 16210–16220, June 2022. 2, 7
ings of the IEEE/CVF Conference on Computer Vision and [53] Tianhao Wu, Fangcheng Zhong, Andrea Tagliasacchi, For-
Pattern Recognition, pages 9054–9063, 2021. 2 rester Cole, and Cengiz Oztireli. D2 nerf: Self-supervised
[41] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- decoupling of dynamic and static objects from a monocular
ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned video, 2022. 3
[54] Christopher Xie, Yu Xiang, Zaid Harchaoui, and Dieter Fox. [67] Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C.
Object discovery in videos as foreground motion clustering. Bühler, Xu Chen, Michael J. Black, and Otmar Hilliges. I
In 2019 IEEE/CVF Conference on Computer Vision and Pat- M Avatar: Implicit morphable head avatars from videos. In
tern Recognition (CVPR), pages 9986–9995, 2019. 3 Computer Vision and Pattern Recognition (CVPR), 2022. 2,
[55] Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and 4
Michael J. Black. ICON: Implicit Clothed humans Obtained [68] Zerong Zheng, Tao Yu, Yebin Liu, and Qionghai Dai.
from Normals. In Proceedings of the IEEE/CVF Conference Pamir: Parametric model-conditioned implicit representa-
on Computer Vision and Pattern Recognition (CVPR), pages tion for image-based human reconstruction. IEEE Transac-
13296–13306, June 2022. 2, 7 tions on Pattern Analysis and Machine Intelligence, 2021. 2
[56] Hongyi Xu, Thiemo Alldieck, and Cristian Sminchisescu.
H-neRF: Neural radiance fields for rendering and temporal
reconstruction of humans in motion. In A. Beygelzimer, Y.
Dauphin, P. Liang, and J. Wortman Vaughan, editors, Ad-
vances in Neural Information Processing Systems, 2021. 3
[57] Tianhan Xu, Yasuhiro Fujita, and Eiichi Matsumoto.
Surface-aligned neural radiance fields for controllable 3d hu-
man synthesis. In CVPR, 2022. 3
[58] Weipeng Xu, Avishek Chatterjee, Michael Zollhöfer, Helge
Rhodin, Dushyant Mehta, Hans-Peter Seidel, and Christian
Theobalt. Monoperfcap: Human performance capture from
monocular video. 37(2):27:1–27:15, May 2018. 1, 2, 6, 7
[59] Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman,
and Weidi Xie. Self-supervised video object segmentation
by motion grouping. In ICCV, 2021. 3
[60] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman.
Volume rendering of neural implicit surfaces. In Thirty-
Fifth Conference on Neural Information Processing Systems,
2021. 2, 5
[61] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Atzmon, Basri Ronen, and Yaron Lipman. Multiview neu-
ral surface reconstruction by disentangling geometry and ap-
pearance. Advances in Neural Information Processing Sys-
tems, 33, 2020. 2
[62] Vickie Ye, Zhengqi Li, Richard Tucker, Angjoo Kanazawa,
and Noah Snavely. Deformable sprites for unsupervised
video decomposition. In IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), June 2022. 3, 7
[63] Tao Yu, Kaiwen Guo, Feng Xu, Yuan Dong, Zhaoqi Su, Jian-
hui Zhao, Jianguo Li, Qionghai Dai, and Yebin Liu. Body-
fusion: Real-time capture of human motion and surface ge-
ometry using a single depth camera. In The IEEE Interna-
tional Conference on Computer Vision (ICCV). IEEE, Octo-
ber 2017. 3
[64] Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai
Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. Doublefu-
sion: Real-time capture of human performances with inner
body shapes from a single depth sensor. In The IEEE Inter-
national Conference on Computer Vision and Pattern Recog-
nition(CVPR). IEEE, June 2018. 3
[65] Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, and Steven
Lovegrove. Star: Self-supervised tracking and reconstruc-
tion of rigid objects in motion with neural rendering. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 13144–13152, 2021. 3
[66] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. Nerf++: Analyzing and improving neural radiance
fields. arXiv:2010.07492, 2020. 2, 4, 5

You might also like