(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: 1CUHK MMLab  2Stanford University  3UT Austin
https://keqiangsun.github.io/projects/ponymation

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Keqiang Sun\orcidlink0000-0003-2900-1202 11    Dor Litvak\orcidlink0009-0004-8720-618X 2233    Yunzhi Zhang\orcidlink0009-0000-3919-4883 22    Hongsheng Li\orcidlink0000-0002-2664-7975 11   
Jiajun Wu\orcidlink0000-0002-4176-343X
22
   Shangzhe Wu\orcidlink0000-0003-1011-5963 22
Abstract

We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for training; it learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features. At the core of our method is a video Photo-Geometric Auto-Encoding framework that decomposes each training video clip into a set of explicit geometric and photometric representations, including a rest-pose 3D shape, an articulated pose sequence, and texture, with the objective of re-rendering the input video via a differentiable renderer. This decomposition allows us to learn a generative model over the underlying articulated pose sequences akin to a Variational Auto-Encoding (VAE) formulation, but without requiring any external pose annotations. At inference time, we can generate new motion sequences by sampling from the learned motion VAE, and create plausible 4D animations of an animal automatically within seconds given a single input image.

Keywords:
3D animal motion 4D generation Unsupervised learning
**footnotetext: Equal contribution. Equal advising.

1 Introduction

We share the planet with a wide variety of lively animals. Similarly to humans, they navigate and interact with the physical world, demonstrating various sophisticated motion patterns. In fact, the first film in history, “The Horse in Motion,” was a sequence of photographs that captured a galloping horse, created by Eadweard Muybridge in 1887 [52]. Films capture only sequences of 2D projections of 3D animal movements. Further modeling dynamic animals in 3D is not only useful for numerous mixed reality and content creation applications, but also provides computational tools for biologists to study animal behaviors.

Refer to caption
Figure 1: Learning 3D Animal Motions from Unlabeled Online Videos. Given a collection of monocular videos of an animal category sourced from the Internet as training data, our method learns a generative model of the articulated 3D motions together with a monocular 3D reconstruction model, without relying on any shape templates or pose annotations. At inference time, the model generates new 3D motion sequences and turns a single test image in 4D animations fully automatically.

While a lot of efforts have been invested in capturing and modeling 3D human motions using computer vision techniques, significantly less attention has been paid to animals. Existing learning-based approaches require an extensive amount of 3D scans [49, 63, 64], parametric shape models [9, 33, 35, 94, 61], multi-view videos [45, 26, 21], or geometric annotations, such as keypoints [24, 27, 23, 59, 61, 60, 69], as supervision for training. Collecting large-scale 3D training data involves specialized capture devices and intensive labor, which can only be justified for specific objects, like humans, that are of utmost value in applications.

In this work, we would like to learn a generative model of the 3D motions of an animal category, which will allow us to sample new 3D motion sequences and generate 4D animations fully automatically within seconds in a feedforward fashion. Crucially, unlike existing 3D motion synthesis approaches on human bodies [27, 23, 59, 60, 32, 85, 97], we do not rely on explicit manual supervision for training, such as keypoints or template shapes. Instead, we propose to learn this 3D motion generative model purely from raw, unlabeled videos sourced from the Internet. This task is also different from video synthesis methods [78, 67, 11] that operate purely on 2D images. We would like to obtain an explicit 3D motion representation, in the form of a 3D mesh and a sequence of articulated 3D poses, which can easily facilitate downstream applications, including fine-grained controllable 3D animation and motion pattern analysis.

Learning 3D motions from unstructured online video collections is an extremely ill-posed task, as each video clip depicts only a short sequence of 2D projections of a unique 4D instance, with unique shape, appearance, motion, and viewpoint that are not assumed to reappear in another clip. This task, therefore, requires registering these unique video clips in a single canonical 3D model to learn a distribution of the underlying 3D motions of the animals. To address this challenge, we take advantage of recent advancements in self-supervised image representation learning [12], and distill semantic correspondences across different instances from self-supervised image features produced by a pre-trained DINO-ViT [12]. Furthermore, we assume a coarse description of the motion skeleton of the animal, e.g., “quadruped,” which effectively constrains the space of deformation akin to Non-Rigid Structure-from-Motion [10] and provides a succinct representation for modeling the 3D motion.

Building on top of these insights, we design a video Photo-Geometric Auto-Encoding framework for learning 3D motion generative models from unlabeled videos. At its core is a spatio-temporal transformer that automatically decomposes a video clip into a set of geometric and photometric factors, including a rest-pose 3D mesh, appearance, viewpoint, and a motion latent code that encapsulates the 3D motion of the instance. This motion latent code is then decoded into a sequence of articulated 3D poses, which are used to animate the rest-pose mesh and re-render a 2D video clip using a differentiable renderer. This allows us to train the entire model end-to-end like a “Variational Auto-Encoder” (VAE) over the space of articulated 3D motions, using only 2D image reconstruction losses on the RGB frames, DINO features, and object masks, with pseudo-ground-truth masks obtained from off-the-shelf detectors [38].

At inference time, we can generate new 3D motion sequences by sampling from the motion VAE latent space. If further given a single image of an animal, our model can reconstruct its articulated 3D shape and appearance in a feed-forward fashion, and generate 4D animations fully automatically within seconds.

To summarize, this paper makes several contributions:

  • We propose a new method for learning a generative model of articulated 3D animal motions from unlabeled Internet videos, without any shape templates or pose annotations;

  • We design a spatio-temporal transformer architecture that effectively extracts motion information from input video clips into a latent VAE;

  • At inference time, the model generates diverse 3D motion sequences and turns a single image into 4D animations automatically in seconds;

2 Related Work

2.0.1 Learning 3D Animals from Image Collections.

While modeling dynamic 3D objects traditionally requires motion capture markers or simultaneous multi-view captures [25, 1, 18], recent learning-based approaches have demonstrated the possibility of learning 3D deformable models simply from raw single-view image collections [34, 82, 44, 92, 80, 93, 48, 70]. Most of these methods require additional geometric supervision besides object masks for training, such as keypoint [34, 43] and viewpoint annotations [68, 56, 19], template shapes [22, 40, 39], semantic correspondences [44, 92, 80, 91, 31], and strong geometric assumptions like symmetries [82, 81, 83] and viewpoint distributions [54, 65, 55, 15, 16, 73, 74]. Among these, MagicPony [80] demonstrates impressive results in learning articulated 3D animals, such as horses, using only single-view images with object masks and self-supervised image features as training supervision. However, it reconstructs static images individually, ignoring the dynamic motions of the underlying 3D animals underneath those images. In this work, we focus on learning a generative model of 3D animal motions from videos instead of independent images.

2.0.2 Deformable Shapes from Monocular Videos.

Reconstructing deformable shapes from monocular videos is a long-standing problem in computer vision. Early approaches with Non-Rigid Structure from Motion (NRSfM) reconstruct deformable shapes from 2D correspondences, by incorporating heavy constraints on the motion patterns [10, 84, 3, 17, 13]. DynamicFusion [53] further integrates additional depth information from depth sensors. NRSfM pipelines have recently been revived with neural representations. In particular, LASR [86] and its follow-ups [87, 83, 88, 89] optimize deformable 3D shapes over a small set of monocular videos, leveraging 2D optical flows in a heavily engineered optimization procedure. DOVE [79] proposes a learning-based framework that learns a category-specific single-image 3D reconstruction model from a monocular video collection. Despite using video data for training, none of these approaches explicitly model the generative distribution of temporal motions of the objects.

2.0.3 Motion Analysis and Synthesis.

Modeling motion patterns of dynamic objects has important applications for both behavior analysis and content generation, and is instrumental to our visual perception system [6]. Computational techniques have been used for decades to study and synthesize human motions [7, 58, 76]. In particular, recent works have explored learning generative models for 3D human motions [46, 2, 51, 27, 23, 59, 60, 69, 36], leveraging parametric human shape models, like SMPL [49], and large-scale human pose annotations [30, 5]. In comparison, much less effort is invested in modeling animal motions. Huang et al. [28] proposes a hierarchical motion learning framework for animals, but requires costly motion capture data and hardly generalizes to animals in the wild. To sidestep the collection of 3D data, BKinD [72] introduces a self-supervised method for discovering and tracking keypoints from videos, but is limited to a 2D representation. Such 2D keypoints could be lifted to 3D [71, 36], but this requires multi-view videos or ground-truth keypoints for training. Unlike these prior works, our motion learning framework does not require any pose annotations or multi-view videos for training, and is trained simply using raw monocular online videos. Recent success of image diffusion models has also led to promising generic 4D generation models [62, 96, 47, 95, 8]. However, the 3D motions generated by these models are still very limited in terms of quality and diversity, as shown in the comparisons in Section 4.2.2.

3 Method

Given a collection of raw video clips of an animal category, such as horses, our goal is to learn a generative model of its articulated 3D motions. This allows us to sample 3D motion sequences from a learned latent space, and generate 4D animations of a new animal instance automatically given only a single 2D image at test time. We train this model simply on raw online videos without relying on any external pose annotations. To do so, we design a video photo-geometric auto-encoding framework that decomposes each training video clip into a rest-pose 3D mesh, appearance, camera viewpoint as well as a sequence articulated 3D poses. This allows us to learn a generative model over the underlying articulated 3D pose sequences akin to a motion “Variational Auto-Encoder”, but simply using the objective of re-rendering the input frames with a differentiable renderer. Figure 2 gives an overview of the training pipeline.

Refer to caption
Figure 2: Training Pipeline. Our method learns a generative model of articulated 3D motion sequences from a collection of unlabeled monocular videos. During training, the model encodes an input video sequence I1:Tsubscript𝐼:1𝑇I_{1:T}italic_I start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT into a latent code z𝑧zitalic_z in the motion VAE, and decodes from it a sequence of articulated 3D poses ξ^1:Tsubscript^𝜉:1𝑇\hat{\xi}_{1:T}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. This pose sequence is used animate the reconstructed 3D shape, allowing the full pipeline to be trained simply using image reconstruction losses with unsupervised image features and object masks obtained from off-the-shelf models, without any external pose annotations.

3.1 Modeling Articulated 3D Animal Motions

Each video clip records a 2D image sequence {It}t=1Tsuperscriptsubscriptsubscript𝐼𝑡𝑡1𝑇\{I_{t}\}_{t=1}^{T}{ italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of the underlying 3D animal motion from one camera trajectory. Since the dataset is obtained from casually-recorded Internet videos, these training clips have diverse unique motion sequences. In order to learn the distribution of the underlying animal motions from such unstructured video collections, we first need to devise a 3D representation that registers these dynamic 2D sequences onto a canonical 3D model, factoring out the 3D motion of each video instance.

Drawing inspiration from prior work on 3D human motion synthesis [49, 32, 97, 85], we leverage a category-specific skinned model to represent the deformable 3D shape of the animals, and further learn the motion distribution over the articulations of its underlying skeleton. To this end, we follow MagicPony [80] and assume a coarse description of the skeleton, e.g., “quadruped”.

Specifically, we represent the category-specific base 3D shape using a Signed Distance Function (SDF) parametrized by a coordinate Multi-Layer Perceptron (MLP), and extract an explicit mesh on the fly using Differentiable Marching Tetrahedron (DMTet) [66]. Let VbaseK×3subscript𝑉basesuperscript𝐾3V_{\text{base}}\in\mathbb{R}^{K\times 3}italic_V start_POSTSUBSCRIPT base end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × 3 end_POSTSUPERSCRIPT denote the list of K𝐾Kitalic_K vertices, and the triangle faces are given by the triplets F{1,,K}3𝐹superscript1𝐾3F\subset\{1,\dots,K\}^{3}italic_F ⊂ { 1 , … , italic_K } start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. To model the slight shape variation of each animal instance in the canonical pose, we further learn an image-conditioned deformation field fΔVsubscript𝑓Δ𝑉f_{\Delta V}italic_f start_POSTSUBSCRIPT roman_Δ italic_V end_POSTSUBSCRIPT parametrized by another MLP that predicts small deformations of each vertex ΔVins,i=fΔV(Vbase,i,ϕ)Δsubscript𝑉ins𝑖subscript𝑓Δ𝑉subscript𝑉base𝑖italic-ϕ\Delta V_{\text{ins},i}=f_{\Delta V}(V_{\text{base},i},\phi)roman_Δ italic_V start_POSTSUBSCRIPT ins , italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT roman_Δ italic_V end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT base , italic_i end_POSTSUBSCRIPT , italic_ϕ ), where ϕ=fϕ(I)italic-ϕsubscript𝑓italic-ϕ𝐼\phi=f_{\phi}(I)italic_ϕ = italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_I ) is a feature vector obtained from an image I𝐼Iitalic_I using a pre-trained DINO-ViT [12], and i{1,,K}𝑖1𝐾i\in\{1,\cdots,K\}italic_i ∈ { 1 , ⋯ , italic_K } denotes the vertex index. Both base shape Vbasesubscript𝑉baseV_{\text{base}}italic_V start_POSTSUBSCRIPT base end_POSTSUBSCRIPT and the instance deformation ΔVinsΔsubscript𝑉ins\Delta V_{\text{ins}}roman_Δ italic_V start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT are enforced to be bilaterally symmetric about yz𝑦𝑧yzitalic_y italic_z-plane by mirroring the query locations in the underlying MLPs.

To account for the temporal motions driven by the underlying bone structure, we then instantiate a quadrupedal skeleton in this instance shape using a simple heuristic: a chain of bones going through the two farthest end points along z𝑧zitalic_z-axis, and four legs branching out from the body bone to the lowest point in each xz𝑥𝑧xzitalic_x italic_z-quadrant. The motion sequence is thus parametrized by a sequence of articulated poses ξ={ξt}t=1T𝜉superscriptsubscriptsubscript𝜉𝑡𝑡1𝑇\xi=\{\xi_{t}\}_{t=1}^{T}italic_ξ = { italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where each pose ξtsubscript𝜉𝑡\xi_{t}italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at a timestamp t𝑡titalic_t consists of a rigid pose ξt,1SE(3)subscript𝜉𝑡1𝑆𝐸3\xi_{t,1}\in SE(3)italic_ξ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ∈ italic_S italic_E ( 3 ) w.r.t. an identity camera pose and the rotation ξt,bSO(3)subscript𝜉𝑡𝑏𝑆𝑂3\xi_{t,b}\in SO(3)italic_ξ start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT ∈ italic_S italic_O ( 3 ) of each bone b=2,,B𝑏2𝐵b=2,...,Bitalic_b = 2 , … , italic_B in the skeleton. These articulated poses are applied to the instance mesh Vinssubscript𝑉insV_{\text{ins}}italic_V start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT to obtain the final posed shape sequence using the widely-used linear blend skinning g(Vins,ξt)𝑔subscript𝑉inssubscript𝜉𝑡g(V_{\text{ins}},\xi_{t})italic_g ( italic_V start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [49]. More details are included in the supplementary material.

The appearance of the instance is modeled using a texture field parametrized by an MLP fa(𝐱,ϕ)[0,1]3subscript𝑓a𝐱italic-ϕsuperscript013f_{\text{a}}(\mathbf{x},\phi)\in[0,1]^{3}italic_f start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( bold_x , italic_ϕ ) ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT where 𝐱𝐱\mathbf{x}bold_x is a 3D location. We then render the posed mesh sequence into a sequence of RGB images using deferred mesh rendering [80], querying fasubscript𝑓af_{\text{a}}italic_f start_POSTSUBSCRIPT a end_POSTSUBSCRIPT at the corresponding 3D locations of the pixels after rasterization.

In the following, we explain the learning formulation to learn the individual components, including Vbasesubscript𝑉baseV_{\text{base}}italic_V start_POSTSUBSCRIPT base end_POSTSUBSCRIPT, fΔVsubscript𝑓Δ𝑉f_{\Delta V}italic_f start_POSTSUBSCRIPT roman_Δ italic_V end_POSTSUBSCRIPT, fasubscript𝑓af_{\text{a}}italic_f start_POSTSUBSCRIPT a end_POSTSUBSCRIPT, and most importantly, a generative model fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT over the motion sequences ξ𝜉\xiitalic_ξ, purely from an unstructured video collection without external pose annotations.

3.2 Video Photo-Geometric Auto-Encoding

Unlike human motion synthesis, we do not have access to large-scale, high-quality 3D captures or pose annotations for most animal species. Hence, we must instead learn from raw Internet videos, which poses significant challenges. To this end, we design a video Photo-Geometric Auto-Encoding framework that deconstructs each training clip into the explicit photometric and geometric factors described in Section 3.1, and train the entire pipeline using the objective of re-rendering the video. At the center of this video auto-encoding pipeline is a generative model of articulated motion sequences, akin to a “Variational Auto-Encoder” (VAE), but learned purely from raw RGB frames. This is very different from simply training a conventional VAE directly in the pose sequence space, which would require explicit pose annotations in the first place.

3.2.1 Video Encoding.

To predict the instance shape deformation ΔVinsΔsubscript𝑉ins\Delta V_{\text{ins}}roman_Δ italic_V start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT and appearance of the object, we extract a feature vector ϕtsubscriptitalic-ϕ𝑡\phi_{t}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each frame of the video using a pre-trained DINO-ViT [12] with frozen weights, as mentioned previously. We assume the instance shape and appearance remain the same throughout the video, and hence take the average image features across all frames, denoted as ϕ¯¯italic-ϕ\bar{\phi}over¯ start_ARG italic_ϕ end_ARG, when querying the MLPs, fΔVsubscript𝑓Δ𝑉f_{\Delta V}italic_f start_POSTSUBSCRIPT roman_Δ italic_V end_POSTSUBSCRIPT and fasubscript𝑓af_{\text{a}}italic_f start_POSTSUBSCRIPT a end_POSTSUBSCRIPT.

In order to extract the motion information more effectively from the input video clip, we design a pair of spatial and temporal transformer-based motion encoders, Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT and Etsubscript𝐸tE_{\text{t}}italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT, that aggregate a set of bone-specific local features first spatially across each frame and then temporally across the entire sequence, eventually obtaining the distribution parameters μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG and Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG of the motion latent VAE.

Specifically, given each frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the input clip, we first construct a bone-specific feature descriptor νt,b=(ϕt,Φt(𝐮t,b),b,𝐉b,𝐮t,b)subscript𝜈𝑡𝑏subscriptitalic-ϕ𝑡subscriptΦ𝑡subscript𝐮𝑡𝑏𝑏subscript𝐉𝑏subscript𝐮𝑡𝑏\nu_{t,b}=(\phi_{t},\Phi_{t}(\mathbf{u}_{t,b}),b,\mathbf{J}_{b},\mathbf{u}_{t,% b})italic_ν start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT = ( italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT ) , italic_b , bold_J start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT ) for each bone b=2,,B𝑏2𝐵b=2,...,Bitalic_b = 2 , … , italic_B and each timestamp t𝑡titalic_t. Here, ϕtsubscriptitalic-ϕ𝑡\phi_{t}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the same global image feature as before. 𝐉bsubscript𝐉𝑏\mathbf{J}_{b}bold_J start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT denotes the 3D location of the center of the bone b𝑏bitalic_b at rest-pose, which projects to the pixel location 𝐮t,bsubscript𝐮𝑡𝑏\mathbf{u}_{t,b}bold_u start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT in the image space, given the rigid pose ξ^t,1subscript^𝜉𝑡1\hat{\xi}_{t,1}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT predicted separately. In addition to the global feature ϕtsubscriptitalic-ϕ𝑡\phi_{t}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we also sample an auxiliary bone-specific local feature vector Φt(𝐮t,b)subscriptΦ𝑡subscript𝐮𝑡𝑏\Phi_{t}(\mathbf{u}_{t,b})roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT ) from the DINO-ViT key token map ΦtsubscriptΦ𝑡\Phi_{t}roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the projected pixel location 𝐮t,bsubscript𝐮𝑡𝑏\mathbf{u}_{t,b}bold_u start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT.

The spatial transformer encoder Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT then fuses these bone-specific feature descriptors {νt,b}b=2Bsuperscriptsubscriptsubscript𝜈𝑡𝑏𝑏2𝐵\{\nu_{t,b}\}_{b=2}^{B}{ italic_ν start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT into a single feature vector νt,subscript𝜈𝑡\nu_{t,*}italic_ν start_POSTSUBSCRIPT italic_t , ∗ end_POSTSUBSCRIPT summarizing the articulated pose of the animal in each frame t𝑡titalic_t:

νt,=Es(νt,2,,νt,B).subscript𝜈𝑡subscript𝐸ssubscript𝜈𝑡2subscript𝜈𝑡𝐵\nu_{t,*}=E_{\text{s}}(\nu_{t,2},\cdots,\nu_{t,B}).italic_ν start_POSTSUBSCRIPT italic_t , ∗ end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT , ⋯ , italic_ν start_POSTSUBSCRIPT italic_t , italic_B end_POSTSUBSCRIPT ) . (1)

In practice, we prepend a learnable token to the list of descriptors, and take the first output token of the transformer as the pose feature νt,subscript𝜈𝑡\nu_{t,*}italic_ν start_POSTSUBSCRIPT italic_t , ∗ end_POSTSUBSCRIPT. We call this Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT a spatial transformer as it extracts the spatial geometric features in each input frame that capture the pose information, conditioned on the given skeleton.

Next, we design a second temporal transformer encoder Etsubscript𝐸tE_{\text{t}}italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT, inspired by [59], which operates along the temporal dimension and maps the entire sequence of pose features {νt,}t=1Tsuperscriptsubscriptsubscript𝜈𝑡𝑡1𝑇\{\nu_{t,*}\}_{t=1}^{T}{ italic_ν start_POSTSUBSCRIPT italic_t , ∗ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT into the motion latent space. Similarly to the Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, Etsubscript𝐸tE_{\text{t}}italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT fuses the pose feature sequence to predict the VAE distribution parameters:

(μ^,Σ^)=Et(ν1,,,νT,).^𝜇^Σsubscript𝐸tsubscript𝜈1subscript𝜈𝑇(\hat{\mu},\hat{\Sigma})=E_{\text{t}}(\nu_{1,*},\cdots,\nu_{T,*}).( over^ start_ARG italic_μ end_ARG , over^ start_ARG roman_Σ end_ARG ) = italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 1 , ∗ end_POSTSUBSCRIPT , ⋯ , italic_ν start_POSTSUBSCRIPT italic_T , ∗ end_POSTSUBSCRIPT ) . (2)

Using the reparametrization trick [37], we then sample a latent code from the Gaussian distribution z𝒩(μ^,Σ^)similar-to𝑧𝒩^𝜇^Σz\sim\mathcal{N}(\hat{\mu},\hat{\Sigma})italic_z ∼ caligraphic_N ( over^ start_ARG italic_μ end_ARG , over^ start_ARG roman_Σ end_ARG ), which will be decoded into a sequence of articulated poses {ξ^t}t=1Tsuperscriptsubscriptsubscript^𝜉𝑡𝑡1𝑇\{\hat{\xi}_{t}\}_{t=1}^{T}{ over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT characterizing the 3D motion of the animal in the clip.

3.2.2 Motion Decoding.

Symmetric to the motion encoders, the motion decoder also consists of a temporal decoder Dtsubscript𝐷tD_{\text{t}}italic_D start_POSTSUBSCRIPT t end_POSTSUBSCRIPT that first decodes z𝑧zitalic_z into a sequence of pose features {zt}t=1Tsuperscriptsubscriptsubscript𝑧𝑡𝑡1𝑇\{z_{t}\}_{t=1}^{T}{ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and a spatial decoder Dssubscript𝐷sD_{\text{s}}italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT that further decodes each pose feature ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to a set of bone rotations {ξ^t,b}b=2Bsuperscriptsubscriptsubscript^𝜉𝑡𝑏𝑏2𝐵\{\hat{\xi}_{t,b}\}_{b=2}^{B}{ over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT.

Specifically, we query the temporal transformer decoder Dtsubscript𝐷tD_{\text{t}}italic_D start_POSTSUBSCRIPT t end_POSTSUBSCRIPT with a sequence of timestamps 𝒯𝒯\mathcal{T}caligraphic_T, and use z𝑧zitalic_z as both the key token and the value token to obtain a sequence of pose features:

(z1,,zT)=Dt(𝒯,z),𝒯=(1,,T).formulae-sequencesubscript𝑧1subscript𝑧𝑇subscript𝐷t𝒯𝑧𝒯1𝑇(z_{1},\cdots,z_{T})=D_{\text{t}}(\mathcal{T},z),\quad\mathcal{T}=(1,\cdots,T).( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_D start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( caligraphic_T , italic_z ) , caligraphic_T = ( 1 , ⋯ , italic_T ) . (3)

Similarly, given each pose feature ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we then query the spatial transformer decoder Dssubscript𝐷sD_{\text{s}}italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT with a sequence of bone indices \mathcal{B}caligraphic_B to produce the bone rotations:

(ξ^t,2,,ξ^t,B)=Ds(,zt),=(2,,B).formulae-sequencesubscript^𝜉𝑡2subscript^𝜉𝑡𝐵subscript𝐷ssubscript𝑧𝑡2𝐵(\hat{\xi}_{t,2},\cdots,\hat{\xi}_{t,B})=D_{\text{s}}(\mathcal{B},z_{t}),\quad% \mathcal{B}=(2,\cdots,B).( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , italic_B end_POSTSUBSCRIPT ) = italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( caligraphic_B , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , caligraphic_B = ( 2 , ⋯ , italic_B ) . (4)

In practice, the rigid pose ξ^t,1subscript^𝜉𝑡1\hat{\xi}_{t,1}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT is predicted by a separate network and is not modeled by this motion VAE, since it is entangled with arbitrary camera motions that are difficult to disentangle in dynamic scenes.

We then deform the predicted instance mesh V^inssubscript^𝑉ins\hat{V}_{\text{ins}}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT using these articulated pose sequence {ξ^t}t=1Tsuperscriptsubscriptsubscript^𝜉𝑡𝑡1𝑇\{\hat{\xi}_{t}\}_{t=1}^{T}{ over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with the skinning equation V^t=g(V^ins,ξ^t)subscript^𝑉𝑡𝑔subscript^𝑉inssubscript^𝜉𝑡\hat{V}_{t}=g(\hat{V}_{\text{ins}},\hat{\xi}_{t})over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ( over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT , over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and render the RGB frames {I^t}t=1Tsuperscriptsubscriptsubscript^𝐼𝑡𝑡1𝑇\{\hat{I}_{t}\}_{t=1}^{T}{ over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and masks {M^t}t=1Tsuperscriptsubscriptsubscript^𝑀𝑡𝑡1𝑇\{\hat{M}_{t}\}_{t=1}^{T}{ over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT using a differentiable renderer [42].

3.3 Learning Formulation

3.3.1 Video Re-rendering Losses.

We train the entire model by minimizing the reconstruction losses on the object masks M^tsubscript^𝑀𝑡\hat{M}_{t}over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and RGB frames I^tsubscript^𝐼𝑡\hat{I}_{t}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

m,t=M^tMt22+λdtM^tdt(Mt)1,im,t=M~t(I^tIt)1,formulae-sequencesubscriptm𝑡superscriptsubscriptnormsubscript^𝑀𝑡subscript𝑀𝑡22subscript𝜆dtsubscriptnormdirect-productsubscript^𝑀𝑡dtsubscript𝑀𝑡1subscriptim𝑡subscriptnormdirect-productsubscript~𝑀𝑡subscript^𝐼𝑡subscript𝐼𝑡1\mathcal{L}_{\text{m},t}=\|\hat{M}_{t}-M_{t}\|_{2}^{2}+\lambda_{\text{dt}}\|% \hat{M}_{t}\odot\texttt{dt}(M_{t})\|_{1},\quad\mathcal{L}_{\text{im},t}=\|% \tilde{M}_{t}\odot(\hat{I}_{t}-I_{t})\|_{1},caligraphic_L start_POSTSUBSCRIPT m , italic_t end_POSTSUBSCRIPT = ∥ over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT dt end_POSTSUBSCRIPT ∥ over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ dt ( italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT im , italic_t end_POSTSUBSCRIPT = ∥ over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (5)

where distance transform dt()dt\texttt{dt}(\cdot)dt ( ⋅ ) is used in the second term of the mask loss with a weight λdtsubscript𝜆dt\lambda_{\text{dt}}italic_λ start_POSTSUBSCRIPT dt end_POSTSUBSCRIPT for more effective gradients [34, 81, 80], and direct-product\odot denotes the Hadamard product. The RGB loss is only computed inside the intersection of the predicted and ground-truth masks M~t=M^tMtsubscript~𝑀𝑡direct-productsubscript^𝑀𝑡subscript𝑀𝑡\tilde{M}_{t}=\hat{M}_{t}\odot M_{t}over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To exploit the temporal consistency of the motion in the videos, we further enforce a temporal smoothness constraint between the predicted poses ξ^tsubscript^𝜉𝑡\hat{\xi}_{t}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of consecutive frames: temp=t=2Tξ^tξ^t122subscripttempsuperscriptsubscript𝑡2𝑇superscriptsubscriptnormsubscript^𝜉𝑡subscript^𝜉𝑡122\mathcal{R}_{\text{temp}}=\sum_{t=2}^{T}\|\hat{\xi}_{t}-\hat{\xi}_{t-1}\|_{2}^% {2}caligraphic_R start_POSTSUBSCRIPT temp end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We also inherit the multi-hypothesis viewpoint prediction mechanism with the hypothesis loss hypsubscripthyp\mathcal{L}_{\text{hyp}}caligraphic_L start_POSTSUBSCRIPT hyp end_POSTSUBSCRIPT and the shape regularizers shape=λEikEik+λartart+λdefdefsubscriptshapesubscript𝜆EiksubscriptEiksubscript𝜆artsubscriptartsubscript𝜆defsubscriptdef\mathcal{R}_{\text{shape}}=\lambda_{\text{Eik}}\mathcal{R}_{\text{Eik}}+% \lambda_{\text{art}}\mathcal{R}_{\text{art}}+\lambda_{\text{def}}\mathcal{R}_{% \text{def}}caligraphic_R start_POSTSUBSCRIPT shape end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT Eik end_POSTSUBSCRIPT caligraphic_R start_POSTSUBSCRIPT Eik end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT art end_POSTSUBSCRIPT caligraphic_R start_POSTSUBSCRIPT art end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT def end_POSTSUBSCRIPT caligraphic_R start_POSTSUBSCRIPT def end_POSTSUBSCRIPT [80] with balancing weights λ𝜆\lambdaitalic_λ’s, which include the Eikonal constraint EiksubscriptEik\mathcal{R}_{\text{Eik}}caligraphic_R start_POSTSUBSCRIPT Eik end_POSTSUBSCRIPT on the SDF MLP for the base shape, and magnitude regularizers artsubscriptart\mathcal{R}_{\text{art}}caligraphic_R start_POSTSUBSCRIPT art end_POSTSUBSCRIPT on the bone rotations ξ^2:Bsubscript^𝜉:2𝐵\hat{\xi}_{2:B}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT 2 : italic_B end_POSTSUBSCRIPT and defsubscriptdef\mathcal{R}_{\text{def}}caligraphic_R start_POSTSUBSCRIPT def end_POSTSUBSCRIPT on the vertex deformations ΔVinsΔsubscript𝑉ins\Delta V_{\text{ins}}roman_Δ italic_V start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT.

3.3.2 Semantic Correspondences.

Instead of relying on external pose annotations or prior shape models to learn the 3D model from monocular videos, we seek a much cheaper alternative solution for establishing correspondences across different instances. We distill semantic correspondences from self-supervised image features, such as DINO [12]. As shown in prior work [4, 80, 92], after a simple PCA reduction, these image features reveal robust part-level correspondences across different instances with varying poses and appearance. To exploit these correspondences, we additionally optimize a feature field in the canonical space using a coordinate MLP ψ(𝐱)D𝜓𝐱superscript𝐷\psi(\mathbf{x})\in\mathbb{R}^{D}italic_ψ ( bold_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, which is rendered into an 2D feature image Φ^tD×H×Wsubscript^Φ𝑡superscript𝐷𝐻𝑊\hat{\Phi}_{t}\in\mathbb{R}^{D\times H\times W}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_H × italic_W end_POSTSUPERSCRIPT given the posed mesh V^tsubscript^𝑉𝑡\hat{V}_{t}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with the same procedure as rendering the appearance of the object described above. We then encourage this rendered feature map Φ^tsubscript^Φ𝑡\hat{\Phi}_{t}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to match the feature map ΦtsubscriptsuperscriptΦ𝑡\Phi^{\prime}_{t}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT pre-extracted from the input frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using DINO-ViT with PCA reduction: feat,t=M~t(Φ^tΦt)22.subscriptfeat𝑡superscriptsubscriptnormdirect-productsubscript~𝑀𝑡subscript^Φ𝑡subscriptsuperscriptΦ𝑡22\mathcal{L}_{\text{feat},t}=\|\tilde{M}_{t}\odot(\hat{\Phi}_{t}-\Phi^{\prime}_% {t})\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT feat , italic_t end_POSTSUBSCRIPT = ∥ over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ ( over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . Intuitively, this enforces the model to establish correspondences across all training video instances through the same canonical feature field, hence disentangling the shape and pose in each monocular frame.

3.3.3 Motion VAE.

Similarly to the conventional VAE, we also minimize the Kullback–Leibler (KL) divergence between the learned motion latent distribution and a standard Gaussian distribution:

KL=i12(logσiσiμi2+1),subscriptKLsubscript𝑖12subscript𝜎𝑖subscript𝜎𝑖superscriptsubscript𝜇𝑖21\mathcal{L}_{\text{KL}}=\sum_{i}-\frac{1}{2}\left(\log\sigma_{i}-\sigma_{i}-% \mu_{i}^{2}+1\right),\vspace{-0.5em}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_log italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) , (6)

where μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are elements of the predicted distribution parameters μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG and Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG.

3.3.4 Training Schedule.

As learning 3D articulated motions from unstructured video clips without labels is extremely ill-posed, we devise a two-stage schedule for robust and efficient training. In the first stage, we pre-train the monocular 3D reconstruction model using a single-image pose predictor ξ~t=fξsin(ϕt)subscript~𝜉𝑡superscriptsubscript𝑓𝜉sinsubscriptitalic-ϕ𝑡\tilde{\xi}_{t}=f_{\xi}^{\text{sin}}(\phi_{t})over~ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sin end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Inspired by but unlike [80], we train this model to re-render entire video clips with the temporal smoothness constraint tempsubscripttemp\mathcal{R}_{\text{temp}}caligraphic_R start_POSTSUBSCRIPT temp end_POSTSUBSCRIPT and temporal feature averaging ϕ¯¯italic-ϕ\bar{\phi}over¯ start_ARG italic_ϕ end_ARG, rather than independent images. The total loss in the first stage is given by:

vid=t=1T(recon,t+λhhyp,t+λsshape,t)+λttemp,subscriptvidsuperscriptsubscript𝑡1𝑇subscriptrecon𝑡subscript𝜆hsubscripthyp𝑡subscript𝜆ssubscriptshape𝑡subscript𝜆tsubscripttemp\vspace{-0.5em}\mathcal{L}_{\text{vid}}=\sum_{t=1}^{T}\left(\mathcal{L}_{\text% {recon},t}+\lambda_{\text{h}}\mathcal{L}_{\text{hyp},t}+\lambda_{\text{s}}% \mathcal{R}_{\text{shape},t}\right)+\lambda_{\text{t}}\mathcal{R}_{\text{temp}},caligraphic_L start_POSTSUBSCRIPT vid end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT recon , italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT h end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT hyp , italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT caligraphic_R start_POSTSUBSCRIPT shape , italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT caligraphic_R start_POSTSUBSCRIPT temp end_POSTSUBSCRIPT , (7)

where recon,t=im,t+λmm,t+λffeat,tsubscriptrecon𝑡subscriptim𝑡subscript𝜆msubscriptm𝑡subscript𝜆fsubscriptfeat𝑡\mathcal{L}_{\text{recon},t}=\mathcal{L}_{\text{im},t}+\lambda_{\text{m}}% \mathcal{L}_{\text{m},t}+\lambda_{\text{f}}\mathcal{L}_{\text{feat},t}caligraphic_L start_POSTSUBSCRIPT recon , italic_t end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT im , italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT m , italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT f end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT feat , italic_t end_POSTSUBSCRIPT summarizes the reconstruction losses on each frame. After this stage, we obtain an accurate monocular 3D reconstruction model, which outperforms the baseline [80] as shown in Table 4, largely owing to the training on videos instead of independent images. More importantly, the model has now learned a reasonable space of articulated poses, on top of which learning a motion generative model is much more efficient.

In the second stage, we replace the monocular pose predictor fξsinsuperscriptsubscript𝑓𝜉sinf_{\xi}^{\text{sin}}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sin end_POSTSUPERSCRIPT with the spatio-temporal transformer-based motion VAE fξsubscript𝑓𝜉f_{\xi}italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT detailed in Section 3.2, which encodes the entire video clip and generates the entire sequence of articulated poses at once. Empirically, training the motion VAE from scratch with an expensive rendering step in the loop is inefficient. To facilitate training efficiency, we recycle pose predictions ξ~tsubscript~𝜉𝑡\tilde{\xi}_{t}over~ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the first stage to guide the predictions of the VAE decoder ξ^tsubscript^𝜉𝑡\hat{\xi}_{t}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using a teacher loss teacher=t=1Tξ^tξ~t22subscriptteachersuperscriptsubscript𝑡1𝑇superscriptsubscriptnormsubscript^𝜉𝑡subscript~𝜉𝑡22\mathcal{L}_{\text{teacher}}=\sum_{t=1}^{T}\|\hat{\xi}_{t}-\tilde{\xi}_{t}\|_{% 2}^{2}caligraphic_L start_POSTSUBSCRIPT teacher end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The final training objective for the second stage is thus:

=vid+λKLKL+λteacherteacher.subscriptvidsubscript𝜆KLsubscriptKLsubscript𝜆teachersubscriptteacher\mathcal{L}=\mathcal{L}_{\text{vid}}+\lambda_{\text{KL}}\mathcal{L}_{\text{KL}% }+\lambda_{\text{teacher}}\mathcal{L}_{\text{teacher}}.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT vid end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT teacher end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT teacher end_POSTSUBSCRIPT . (8)

3.3.5 3D Motion Generation.

During inference time, we can generate diverse 3D motion sequences by sampling from the learned motion VAE latent space. Furthermore, when given a single 2D image of a new animal instance unseen at training, our model can reconstruct its 3D shape and appearance in a feed-forward manner, and generate 4D animations fully automatically within a few seconds, as illustrated in Figure 3.

Table 1: Statistics of the AnimalMotion Dataset. We collect a new animal video dataset containing a total of 82.682.682.682.6k frames for 4 different animal species.
Category # Sequences Total Length # Frames
Horse 640 28’09” 50,682
Zebra 47 5’27” 9,822
Giraffe 60 4’52” 8,768
Cow 69 7’25” 13,359
Total 816 45’54” 82,631
Refer to caption
Figure 3: 3D Motion Generation and Animation. During test time, our model generates plausible 3D motion sequences by sampling from the learned motion VAE. It can also reconstruct articulated 3D shapes from a single 2D image in feed-forward fashion, and generate 4D animations fully automatically within seconds. Within each gray box on the right, the first row shows textured animation, and the second row visualizes the corresponding 3D shapes with the generated bone articulations.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

To train our model, we collected an AnimalMotion dataset consisting of video clips of several quadruped animal categories extracted from the Internet. The statistics of the dataset are summarized in Table 1. As pre-processing, we first detect and segment the animal instances in the videos using the off-the-shelf segmentation model of PointRend [38]. To remove occlusion between different instances, we calculated the extent of mask overlap in each frame and exclude crops where two or more masks overlap with each other. We further apply a smoothing kernel to the sequence of bounding boxes to avoid jittering. The non-occluded instances are then cropped and resized to 256×256256256256\times 256256 × 256. The original videos are all at 30fps. To ensure sufficient motion in each sequence, we remove frames with minimal motion, measured by the magnitude of optical flows within the instance mask estimated from RAFT [75]. To conduct quantitative evaluations and comparisons, we also use PASCAL VOC [20] which contains 108108108108 images of horses, and APT-36K [90] which contains 81818181 video clips of horses, each consisting of 15151515 frames. Both datasets provide 2D keypoint annotations for each animal in the image, allowing us to evaluate the geometric accuracy of the reconstructed shapes and generated motions.

4.1.2 Implementation Details.

The encoders and decoders of the motion VAE model (Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, Etsubscript𝐸tE_{\text{t}}italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT, Dssubscript𝐷sD_{\text{s}}italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, Dtsubscript𝐷tD_{\text{t}}italic_D start_POSTSUBSCRIPT t end_POSTSUBSCRIPT) from Section 3.2 are implemented as stacked transformers [77] with 4 transformer blocks and a latent dimension of 256. We use a sinusoidal function for positional encoding following [59]. For the remaining architectures, we base our implementation on top of [80]. We train the model for 120120120120 epochs for the first stage, which takes roughly 10101010 hours on 8888 A6000 GPUs, and another 180180180180 epochs for the second stage, which takes another 48484848 hours. We use a sequence length of T=10𝑇10T=10italic_T = 10 for training. During inference, we can generate longer sequences by connecting multiple samples and optimizing transition latent codes for smooth interpolation. For visualization, following prior work [80], we finetune (only) the appearance network fasubscript𝑓af_{\text{a}}italic_f start_POSTSUBSCRIPT a end_POSTSUBSCRIPT for 100100100100 iterations on each test image, taking less than 10 seconds, as the model struggles to predict detailed texture in a single feedforward pass. More details are included in the sup. mat.

4.2 3D Motion Generation

4.2.1 Qualitative Results.

After training, we can generate 3D motion sequences by sampling the motion latent space VAE, and render 4D animations with the textured mesh reconstructed from a single 2D image, as shown in Figure 3. It also generalizes to horse-like artifacts, such as carousel horses, which the model has never seen during training. The model can be trained on a wide range of animal species besides horses, including giraffes, zebras and cows, capturing category-specific prior distributions of 3D motions, as shown in Figure 5. Because the datasets for these categories are limited in size and diversity, as in [80], in the first stage of the training, we fine-tune from the model trained on horses. Additional animation results are provided in the supplementary video.

Refer to caption
Figure 4: 4D Generation Comparisons. We compare with 4D-fy [8], a recent text-to-4D generation method distilling from 2D diffusion. Despite heavy prompt engineering and a lengthy training time (12 hours), 4D-fy still fails to produce noticeable motion, whereas our model generates diverse motion sequences in a feed-forward pass within a few seconds, with much better 3D geometry.
Table 2: Quantitative Comparison with State-of-the-Art Motion Generative Models.
Motion Strength User Preference
4D-fy [8] 0.29 112 (17.0%)
Ponymation (Ours) 4.66 548 (83.0%)

4.2.2 Comparison with Existing Methods.

Our method is the first to learn a generative model of 3D animal motions from raw videos without pose annotations or prior shape models. We compare with one of the most recent 4D generative models, 4D-fy[8], which has publicly released code. Specifically, we provide the model with a list of prompts, which are enriched by ChatGPT [57] from a list of basic prompts describing horse motions, such as “a horse is running/walking/jumping/eating”111The full list of prompts are included in the supplementary material.. We generate 20202020 4D instances from 4D-fy, and 20202020 from our method (without text condition). Note that it takes 12121212 hours to generate one 4D-fy instance on one GPU, whereas our model generates 4D animations within a few seconds in a single forward pass. We first compute the Motion Strength to assess the motion magnitude of the generated videos. We use Flowformer [29] to estimate optical flow strengths between consecutive frames of a generated video, and then compute the average of the largest 5555% optical flows as the Motion Strength. We present them in random pairs side by side to 33333333 participants, and ask them to select one that shows “a more plausible 3D horse motion sequence”. As reported in Table 2, users preferred the 4D instances generated by our method over 4D-fy 83.083.083.083.0% of the time. We show a visual comparison in Figure 4. Notably, 4D-fy produces nearly static animals without perceptible motions despite heavy prompt engineering, whereas our method generates much more plausible motion sequences.

4.2.3 Quantitative Evaluation.

Further assessing the quality of the generated 3D motions quantitatively is difficult due to the lack of (1) ground-truth measurements of 3D animal motions, and (2) robust evaluation metrics for generative models. To evaluate and compare different variants of our model, we design a new metric, bi-directional Motion Chamfer Distance (MCD), computed between a set of generated motion sequences projected to 2D image space and a set of 2D keypoint sequences annotated from videos in APT-36K [90]. Since the skeleton automatically discovered by our model is different from the 17 keypoints annotated in APT-36K, we first perform 3D reconstruction on all the images in APT-36K, and optimize a linear transformation that maps the 2D projections of the predicted 3D joints to the annotated 2D keypoints following [34]. To compute MCD, we generate 1,40014001,4001 , 400 random motion sequences by sampling from the learned motion VAE, each consisting of 10101010 frames of 3D articulated poses. We then project these generated 3D poses to 2D using the viewpoints estimated from APT-36K, and apply the previously optimized transformation to align with the annotated keypoints. For each annotated keypoint sequence in the test set, we find the closest generated motion sequence measured by keypoint MSE averaged across all frames, and vice versa for each generated sequence. We then compute MCD based on the MSE between the closest sequence pairs. In essence, MCD measures the fidelity of generated motions by comparing the sampled distribution to that of the real motion sequences annotated from videos. Table 3 compares the results of our final model with two ablated variants.

Refer to caption
Figure 5: 3D Motion Generation Results on More Species. Our method can be trained on various animal species, such as corws, zebras, and giraffes illustrated here. The model learns to generate 3D motions and generate plausible motion sequences specific to the animal species, such as the generated neck motion in the first example which is more common in giraffes than others.
Table 3: Motion Chamfer Distance (MCD) on APT-36K [90] for Motion Generation Evaluation. MP: Magicpony, AM: AnimalMotion dataset, TS: temporal smoothness.
Experiment MCD \downarrow
MP + VAE 38.77
MP + VAE + AM 38.12
MP + VAE + AM + TS (final) 38.03

4.3 Single-Image 3D Reconstruction

We also quantitatively evaluate the monocular 3D reconstruction results of our model and compare with existing methods [41, 44, 40, 80]. For this purpose, we use PASCAL [20], a widely used benchmarking dataset for 3D reconstruction, as well as the aforementioned APT-36K [90] dataset, both of which come with 2D keypoint annotations. We compute the commonly used keypoint transfer metric measured by Percentage of Correct Keypoints (PCK[34, 44, 80]. Specifically, given a set of annotated visible 2D keypoints on a source image, we identify the closest vertices on the reconstructed 3D mesh, and then project those 3D vertices onto the target 2D image. We calculate the percentage of the re-projected keypoints that land within a small distance from the annotated keypoints in the target image. This margin is set to be 0.10.10.10.1 of the image size following prior work [34, 44, 80]. Another commonly used metric is Mask Intersection over Union (MIoU) between the rendered and ground-truth masks, which measures the reconstruction quality in terms of projected 2D silhouettes. In addition, since APT-36K [90] provides keypoint annotations on video sequences, we also measure the temporal consistency across the reconstructions along the video sequences using a Velocity Error, computed as 1Tt=1Tδ^tδt/δt1𝑇superscriptsubscript𝑡1𝑇normsubscript^𝛿𝑡subscript𝛿𝑡subscript𝛿𝑡\frac{1}{T}\sum_{t=1}^{T}\|\hat{\delta}_{t}-\delta_{t}\|/\delta_{t}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ / italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where δ^tsubscript^𝛿𝑡\hat{\delta}_{t}over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and δtsubscript𝛿𝑡\delta_{t}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the keypoint displacements between consecutive frame for predicted and GT pose sequences respectively. As the predicted poses are different from the GT keypoints, we use the same procedure described in Section 4.2.2 to optimize a linear mapping from the predicted poses to the GT keypoints for each method.

Table 4: Comparison of Monocular 3D Reconstruction Results with Different Methods on PASCAL [20] and APT-36K [90]. Our method achieves superior reconstruction accuracy compared to the existing methods, including the recent MagicPony baseline [80].
PASCAL [20] APT-36K [90]
Method PCK \uparrow Mask IoU \uparrow PCK \uparrow Vel. Err. \downarrow
CSM [41] 31.2% - - -
UMR [44] 24.4% - - -
A-CSM [40] 32.9% - - -
MagicPony [80] 42.8% 64.1% 53.9% 57.3%
Ponymation (ours) 48.0% 71.8% 59.9% 49.1%

The results are summarized in Table 4. The results of MagicPony [80] are computed using the publicly released code and models, and the results of other baselines are taken from A-CSM [40]. Our model outperforms all previous methods. In particular, compared to the MagicPony baseline, our model achieves considerable improvement by learning from videos instead of individual images.

Additional ablation studies on the architecture design, discussions on limitations and more visualizations are included the supplementary material.

5 Conclusions

We have presented a new method for learning generative models of articulated 3D animal motions from raw Internet videos, without relying on any pose annotations or shape templates. To this end, we have proposed a video photo-geometric auto-encoding framework that automatically learns to decompose RGB videos into the underlying 3D shape, articulated motion, and object appearance, simply with the objective of re-rendering the videos. At the core of this pipeline is a transformer-based architecture that effectively extracts the temporal and spatial structure of the video clip into a latent motion VAE, which enables sampling at inference time to generate new 3D motion sequences. Experimental results show that the proposed method learns a reasonable distribution of 3D animal motions for several animal categories. This allows us to instantly turn a single 2D image into 4D animations in a fully automatic fashion, enabling promising downstream applications in game design and movie production.

Acknowledgments.

We thank Zizhang Li, Feng Qiu, and Ruining Li for insightful discussions. The work is in part supported by the Stanford Institute for Human-Centered AI (HAI) and Samsung.

References

  • [1] de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. ACM TOG (2008)
  • [2] Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: Generative adversarial synthesis from language to action. In: ICRA (2018)
  • [3] Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Nonrigid structure from motion in trajectory space. In: NeurIPS (2008)
  • [4] Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. In: ECCV Workshop on What is Motion For? (2022)
  • [5] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR (2014)
  • [6] Badler, N.: Temporal Scene Analysis: Conceptual Descriptions of Object Movements. Ph.D. thesis, Queensland University of Technology (1975)
  • [7] Badler, N.I., Phillips, C.B., Webber, B.L.: Simulating Humans: Computer Graphics, Animation, and Control. Oxford University Press (09 1993)
  • [8] Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4D-fy: Text-to-4d generation using hybrid score distillation sampling. In: CVPR (2024)
  • [9] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: ECCV (2016)
  • [10] Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3d shape from image streams. In: CVPR (2000)
  • [11] Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024), https://openai.com/research/video-generation-models-as-world-simulators
  • [12] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
  • [13] Cashman, T.J., Fitzgibbon, A.W.: What shape are dolphins? building 3d morphable models from 2d images. IEEE TPAMI (2012)
  • [14] Chadwick, J.E., Haumann, D.R., Parent, R.E.: Layered construction for deformable animated characters. ACM SIGGRAPH Computer Graphics (1989)
  • [15] Chan, E., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In: CVPR (2021)
  • [16] Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L., Tremblay, J., Khamis, S., Karras, T., Wetzstein, G.: Efficient geometry-aware 3D generative adversarial networks. In: CVPR (2022)
  • [17] Dai, Y., Li, H., He, M.: A simple prior-free method for non-rigid structure-from-motion factorization. In: CVPR (2012)
  • [18] Debevec, P.: The light stages and their applications to photoreal digital actors. In: SIGGRAPH Asia (2012)
  • [19] Duggal, S., Pathak, D.: Topologically-aware deformation fields for single-view 3d reconstruction. CVPR (2022)
  • [20] Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV (2015)
  • [21] Gao, X., Yang, J., Kim, J., Peng, S., Liu, Z., Tong, X.: Mps-nerf: Generalizable 3d human rendering from multiview images. IEEE TPAMI (2022)
  • [22] Goel, S., Kanazawa, A., Malik, J.: Shape and viewpoints without keypoints. In: ECCV (2020)
  • [23] Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: ACM MM (2020)
  • [24] Habibie, I., Holden, D., Schwarz, J., Yearsley, J., Komura, T.: A recurrent variational autoencoder for human motion synthesis. In: BMVC (2017)
  • [25] Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edn. (2004)
  • [26] He, Y., Pang, A., Chen, X., Liang, H., Wu, M., Ma, Y., Xu, L.: ChallenCap: Monocular 3d capture of challenging human performances using multi-modal references. In: CVPR (2021)
  • [27] Henter, G.E., Alexanderson, S., Beskow, J.: MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM TOG (2020)
  • [28] Huang, K., Han, Y., Chen, K., Pan, H., Zhao, G., Yi, W., Li, X., Liu, S., Wei, P., Wang, L.: A hierarchical 3d-motion learning framework for animal spontaneous behavior mapping. Nature Communications (2021)
  • [29] Huang, Z., Shi, X., Zhang, C., Wang, Q., Cheung, K.C., Qin, H., Dai, J., Li, H.: FlowFormer: A transformer architecture for optical flow. ECCV (2022)
  • [30] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI (2014)
  • [31] Jakab, T., Li, R., Wu, S., Rupprecht, C., Vedaldi, A.: Farm3D: Learning articulated 3D animals by distilling 2D diffusion. In: 3DV (2024)
  • [32] Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: Human motion as a foreign language. In: NeurIPS (2024)
  • [33] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
  • [34] Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: ECCV (2018)
  • [35] Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: CVPR (2019)
  • [36] Kapon, R., Tevet, G., Cohen-Or, D., Bermano, A.H.: Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. In: CVPR (2024)
  • [37] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
  • [38] Kirillov, A., Wu, Y., He, K., Girshick, R.: PointRend: Image segmentation as rendering. In: CVPR (2020)
  • [39] Kokkinos, F., Kokkinos, I.: To the point: Correspondence-driven monocular 3d category reconstruction. In: NeurIPS (2021)
  • [40] Kulkarni, N., Gupta, A., Fouhey, D.F., Tulsiani, S.: Articulation-aware canonical surface mapping. In: CVPR (2020)
  • [41] Kulkarni, N., Gupta, A., Tulsiani, S.: Canonical surface mapping via geometric cycle consistency. In: ICCV (2019)
  • [42] Laine, S., Hellsten, J., Karras, T., Seol, Y., Lehtinen, J., Aila, T.: Modular primitives for high-performance differentiable rendering. ACM TOG (2020)
  • [43] Li, X., Liu, S., De Mello, S., Kim, K., Wang, X., Yang, M., Kautz, J.: Online adaptation for consistent mesh reconstruction in the wild. In: NeurIPS (2020)
  • [44] Li, X., Liu, S., Kim, K., De Mello, S., Jampani, V., Yang, M.H., Kautz, J.: Self-supervised single-view 3d reconstruction via semantic consistency. In: ECCV (2020)
  • [45] Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people. In: CVPR (2019)
  • [46] Lin, X., Amer, M.R.: Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652 (2018)
  • [47] Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In: CVPR (2024)
  • [48] Liu, D., Stathopoulos, A., Zhangli, Q., Gao, Y., Metaxas, D.: LEPARD: Learning explicit part discovery for 3d articulated shape reconstruction. In: NeurIPS (2024)
  • [49] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM TOG (2015)
  • [50] Magnenat-Thalmann, N., Primeau, E., Thalmann, D.: Abstract muscle action procedures for human face animation. The Visual Computer (1988)
  • [51] Minderer, M., Sun, C., Villegas, R., Cole, F., Murphy, K.P., Lee, H.: Unsupervised learning of object structure and dynamics from videos. In: NeurIPS (2019)
  • [52] Muybridge, E.: The horse in motion (1887)
  • [53] Newcombe, R.A., Fox, D., Seitz, S.M.: DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In: CVPR (2015)
  • [54] Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: HoloGAN: Unsupervised learning of 3d representations from natural images. In: ICCV (2019)
  • [55] Niemeyer, M., Geiger, A.: GIRAFFE: Representing scenes as compositional generative neural feature fields. In: CVPR (2021)
  • [56] Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In: CVPR (2020)
  • [57] OpenAI: ChatGPT (2023), https://chat.openai.com/
  • [58] Ormoneit, D., Black, M., Hastie, T., Kjellström, H.: Representing cyclic human motion using functional analysis. Image and Vision Computing (2005)
  • [59] Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV (2021)
  • [60] Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: ECCV (2022)
  • [61] Piao, J., Sun, K., Wang, Q., Lin, K.Y., Li, H.: Inverting generative adversarial renderer for face reconstruction. In: CVPR (2021)
  • [62] Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: DreamGaussian4D: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)
  • [63] Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)
  • [64] Saito, S., Simon, T., Saragih, J., Joo, H.: PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: CVPR (2020)
  • [65] Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: GRAF: Generative radiance fields for 3d-aware image synthesis. In: NeurIPS (2020)
  • [66] Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In: NeurIPS (2021)
  • [67] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y.: Make-a-video: Text-to-video generation without text-video data. In: ICLR (2023)
  • [68] Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: Continuous 3d-structure-aware neural scene representations. In: NeurIPS (2019)
  • [69] Starke, S., Mason, I., Komura, T.: DeepPhase: Periodic autoencoders for learning motion phase manifolds. ACM TOG (2022)
  • [70] Stathopoulos, A., Pavlakos, G., Han, L., Metaxas, D.N.: Learning articulated shape with keypoint pseudo-labels from web images. In: CVPR (2023)
  • [71] Sun, J.J., Karashchuk, P., Dravid, A., Ryou, S., Fereidooni, S., Tuthill, J., Katsaggelos, A., Brunton, B.W., Gkioxari, G., Kennedy, A., et al.: BKinD-3D: Self-supervised 3d keypoint discovery from multi-view videos. In: CVPR (2023)
  • [72] Sun, J.J., Ryou, S., Goldshmid, R., Weissbourd, B., Dabiri, J., Anderson, D.J., Kennedy, A., Yue, Y., Perona, P.: Self-supervised keypoint discovery in behavioral videos. In: CVPR (2022)
  • [73] Sun, K., Wu, S., Huang, Z., Zhang, N., Wang, Q., Li, H.: Controllable 3d face synthesis with conditional generative occupancy fields. In: NeurIPS (2022)
  • [74] Sun, K., Wu, S., Zhang, N., Huang, Z., Wang, Q., Li, H.: Cgof++: Controllable 3d face synthesis with conditional generative occupancy fields. IEEE TPAMI (2023)
  • [75] Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: ECCV (2020)
  • [76] Urtasun, R., Fleet, D.J., Lawrence, N.D.: Modeling human locomotion with topologically constrained latent variable models. In: Elgammal, A., Rosenhahn, B., Klette, R. (eds.) Human Motion – Understanding, Modeling, Capture and Animation. pp. 104–118. Springer Berlin Heidelberg, Berlin, Heidelberg (2007)
  • [77] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
  • [78] Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In: NeurIPS (2017)
  • [79] Wu, S., Jakab, T., Rupprecht, C., Vedaldi, A.: DOVE: Learning deformable 3d objects by watching videos. IJCV (2023)
  • [80] Wu, S., Li, R., Jakab, T., Rupprecht, C., Vedaldi, A.: MagicPony: Learning articulated 3d animals in the wild. In: CVPR (2023)
  • [81] Wu, S., Makadia, A., Wu, J., Snavely, N., Tucker, R., Kanazawa, A.: De-rendering the world’s revolutionary artefacts. In: CVPR (2021)
  • [82] Wu, S., Rupprecht, C., Vedaldi, A.: Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In: CVPR (2020)
  • [83] Wu, Y., Chen, Z., Liu, S., Ren, Z., Wang, S.: CASA: Category-agnostic skeletal animal reconstruction. In: NeurIPS (2022)
  • [84] Xiao, J., xiang Chai, J., Kanade, T.: A closed-form solution to non-rigid shape and motion recovery. In: ECCV (2004)
  • [85] Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: OmniControl: Control any joint at any time for human motion generation. In: ICLR (2024)
  • [86] Yang, G., Sun, D., Jampani, V., Vlasic, D., Cole, F., Chang, H., Ramanan, D., Freeman, W.T., Liu, C.: LASR: Learning articulated shape reconstruction from a monocular video. In: CVPR (2021)
  • [87] Yang, G., Sun, D., Jampani, V., Vlasic, D., Cole, F., Liu, C., Ramanan, D.: ViSER: Video-specific surface embeddings for articulated 3d shape reconstruction. In: NeurIPS (2021)
  • [88] Yang, G., Vo, M., Natalia, N., Ramanan, D., Andrea, V., Hanbyul, J.: BANMo: Building animatable 3d neural models from many casual videos. In: CVPR (2022)
  • [89] Yang, G., Wang, C., Reddy, N.D., Ramanan, D.: Reconstructing animatable categories from videos. In: CVPR (2023)
  • [90] Yang, Y., Yang, J., Xu, Y., Zhang, J., Lan, L., Tao, D.: APT-36K: A large-scale benchmark for animal pose estimation and tracking. In: NeurIPS Dataset and Benchmark Track (2022)
  • [91] Yao, C.H., Hung, W.C., Li, Y., Rubinstein, M., Yang, M.H., Jampani, V.: Hi-LASSIE: High-fidelity articulated shape and skeleton discovery from sparse image ensemble. In: CVPR (2023)
  • [92] Yao, C.H., Hung, W.C., Rubinstein, M., Lee, Y., Jampani, V., Yang, M.H.: LASSIE: Learning articulated shape from sparse image ensemble via 3d part discovery. In: NeurIPS (2022)
  • [93] Yao, C.H., Raj, A., Hung, W.C., Rubinstein, M., Li, Y., Yang, M.H., Jampani, V.: ARTIC3D: Learning robust articulated 3d shapes from noisy web image collections. In: NeurIPS (2024)
  • [94] Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3d human dynamics from video. In: ICCV (2019)
  • [95] Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603 (2023)
  • [96] Zheng, Y., Li, X., Nagano, K., Liu, S., Hilliges, O., De Mello, S.: A unified approach for text-and image-guided 4d scene generation. In: CVPR (2024)
  • [97] Zhou, Z., Wang, B.: UDE: A unified driving engine for human motion generation. In: CVPR (2023)

Appendices

Appendix 0.A Additional Qualitative Results

0.A.1 Additional Motion Generation Results

Additional generated 3D motion sequences for are shown in Figures 8 and 9. Please refer to the video222https://youtu.be/poc7c-9hCvQ?si=3k874zHackOre94R for more 3D animation visualizations. As shown in the video, by sampling the learned motion latent VAE, we can generate diverse motion patterns, such as eating with the head bending towards the ground, walking with the legs moving alternately, and jumping with the front legs lifted up.

We trained our VAE model with a sequence length of 10 frames. To produce longer motion sequences as demonstrated in the video, we first sample 2222 latent codes to generate 2222 motion sequences, each comprising 10 frames. We then optimize 1111 additional transition motion latents by encouraging the poses of the first frame and the last frame to be consistent with the last frame and the first frame of two consecutive sequences previously generated.

0.A.2 Qualitative Comparison of Video Reconstruction Results

Figure 6 compares the 3D reconstruction results on video sequences obtained from the MagicPony [80] model and our proposed method. Although MagicPony predicts a plausible 3D shape in most cases, it tends to produce temporally inconsistent poses, including both the rigid pose ξ^t,1subscript^𝜉𝑡1\hat{\xi}_{t,1}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT and bone rotations ξ^t,2:Bsubscript^𝜉:𝑡2𝐵\hat{\xi}_{t,2:B}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , 2 : italic_B end_POSTSUBSCRIPT, as highlighted in Figure 6. In contrast, our method leverages the temporal signals in training videos, and produces temporally coherent reconstruction results.

Refer to caption
Figure 6: Comparison of 3D Reconstruction Results with MagicPony [80]. With the video training framework, our method produces temporally coherent and more accurate pose predictions. In comparison, the baseline model of MagicPony often predicts incorrect rigid poses ξ^t,1subscript^𝜉𝑡1\hat{\xi}_{t,1}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT (red boxes), and incorrect bone articulation ξ^t,2:Bsubscript^𝜉:𝑡2𝐵\hat{\xi}_{t,2:B}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_t , 2 : italic_B end_POSTSUBSCRIPT (blue boxes), resulting in inaccurate 3D reconstruction.

Appendix 0.B Additional Ablation Studies

Table 5: Ablation study on the architecture of the motion VAE model.
   Row Method [email protected] Mask IoU
1 Final (with ST-Transformer) 37.6% 62.0%
2 without spatial Transformers Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, Dssubscript𝐷sD_{\text{s}}italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT 33.4% 58.9%
3 without Teacher Loss teachersubscriptteacher\mathcal{L}_{\text{teacher}}caligraphic_L start_POSTSUBSCRIPT teacher end_POSTSUBSCRIPT 32.4% 57.9%
4 without motion VAE 44.3% 66.7%

0.B.1 Spatio-Temporal Transformer Architecture

We conduct an ablation study to verify the effectiveness of the proposed spatio-temporal transformer architecture. In particular, we remove each individual component from the final model or replace it with a default option, train the model on the same dataset, and evaluate its performance on 3D reconstruction with the same protocol described in Section 4.3 of the main paper.

First, we remove the spatial transformer encoder and decoder, Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT and Dssubscript𝐷sD_{\text{s}}italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, and report the results in row 2 of Table 5. In this variant, specifically, instead of using the spatial transformer encoder Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT to fuse bone-specific local image features before passing them to the temporal transformer encoder Etsubscript𝐸tE_{\text{t}}italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT, we directly feed the global image features {ϕ1,,ϕT}subscriptitalic-ϕ1subscriptitalic-ϕ𝑇\{\phi_{1},\cdots,\phi_{T}\}{ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } into the temporal encoder. Similarly, we also remove the spatial decoder Dssubscript𝐷sD_{\text{s}}italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, and directly decode a fixed set of bone rotations from the temporal transformer decoder Dtsubscript𝐷tD_{\text{t}}italic_D start_POSTSUBSCRIPT t end_POSTSUBSCRIPT.

Compared to the final model with spatio-temporal transformer architectures in row 1 of Table 5, the variant without spatial transformer results in less accurate reconstructions, and hence lower scores on the metrics. This confirms the effectiveness of the proposed spatial transformer in extracting motion-specific spatial information from the images.

0.B.2 Teacher Loss

We also demonstrate the effect of the Teacher Loss teachersubscriptteacher\mathcal{L}_{\text{teacher}}caligraphic_L start_POSTSUBSCRIPT teacher end_POSTSUBSCRIPT introduced in Section 3.3 of the main paper. We train a variant motion VAE model without this loss, and report its reconstruction performance in Row 3333 of  Table 5. Without teachersubscriptteacher\mathcal{L}_{\text{teacher}}caligraphic_L start_POSTSUBSCRIPT teacher end_POSTSUBSCRIPT, the model fails to learn accurate poses effectively, leading to degraded reconstruction results. This is mainly because that training the motion VAE from scratch is computationally inefficient with an expensive rendering step in the loop, and the Teacher Loss can significantly improve training efficiency.

Table 6: Ablation study with different sequence lengths for motion generation evaluated using Motion Chamfer Distance (MCD) on APT-36K [90].
Sequence Length K=10𝐾10K=10italic_K = 10 K=20𝐾20K=20italic_K = 20 K=50𝐾50K=50italic_K = 50
MCD \downarrow 38.03 38.25 39.25

0.B.3 Sequence Length.

We conducted experiments to understand the effect of different sequence lengths during training (K=10,20,50𝐾102050K=10,20,50italic_K = 10 , 20 , 50 frames). For a fair comparison, to evaluate the longer motion sequences generated by these variants (K=20,50𝐾2050K=20,50italic_K = 20 , 50), we divide them into consecutive sub-sequences of 10101010 frames, and average the MCD metric across the subsequences. We use the same metric as introduced in Section 4.2 of the main paper, the Motion Chamfer Distance (MCD) calculated between generated sequences and the annotated sequences in the APT-36K dataset [90]. The results are presented in Table 6.

Upon analyzing the results, we observed that the generated sequences still look plausible as the sequence length increases from 10101010 to 20202020. However, a notable degradation in quality is observed as the sequence length increases to 50505050. This could potentially be attributed to the limited capacity of the motion VAE model as well as the limited size of the training dataset. For our final model, we set the sequence length to 10101010, which tends to yield the most satisfactory results with a reasonable training efficiency.

0.B.4 KL Loss Weight.

To train the motion VAE, in addition to the reconstruction losses, we also use the Kullback–Leibler (KL) divergence loss KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT in Equation (6) in the main paper. We conducted an ablation study on its weight λKLsubscript𝜆KL\lambda_{\text{KL}}italic_λ start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT to assess its impact on the overall 3D reconstruction accuracy. As shown in Table 7, λKL=0.001subscript𝜆KL0.001\lambda_{\text{KL}}=0.001italic_λ start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = 0.001 achieves the best reconstruction results, and is used in all experiments in the main paper.

Table 7: Ablation study on the weight of the KL divergence loss λLKL𝜆subscript𝐿KL\lambda{L}_{\text{KL}}italic_λ italic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT.
[email protected] Mask IoU
λKL=0.01subscript𝜆KL0.01\lambda_{\text{KL}}=0.01italic_λ start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = 0.01 33.58% 59.85%
λKL=0.001subscript𝜆KL0.001\lambda_{\text{KL}}=0.001italic_λ start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = 0.001 37.63% 62.03%
λKL=0.0001subscript𝜆KL0.0001\lambda_{\text{KL}}=0.0001italic_λ start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = 0.0001 35.75% 61.11%
Refer to caption
Figure 7: Illustration of the Spatio-temporal Transformer-based Motion Encoder. For each frame, the bone-specific features {νt,b}b=2Bsuperscriptsubscriptsubscript𝜈𝑡𝑏𝑏2𝐵\{\nu_{t,b}\}_{b=2}^{B}{ italic_ν start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT are first extracted from image features and fused by a spatial encoder Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT to obtain a single feature vector νt,subscript𝜈𝑡\nu_{t,*}italic_ν start_POSTSUBSCRIPT italic_t , ∗ end_POSTSUBSCRIPT. A temporal encoder Etsubscript𝐸tE_{\text{t}}italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT then further fuses the feature vectors of all frames {νt,}t=1Tsuperscriptsubscriptsubscript𝜈𝑡𝑡1𝑇\{\nu_{t,*}\}_{t=1}^{T}{ italic_ν start_POSTSUBSCRIPT italic_t , ∗ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and produces the motion VAE distribution parameters μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG and Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG. Please refer to the Section 3.2 in the main paper for detail.

Appendix 0.C Additional Technical Details

0.C.1 Architecture Details

As explained in the paper, we adopt a spatio-temporal transformer architecture for sequence feature encoding and motion decoding. For better illustrating the architecture, we depict the framework of the spatial and temporal transformer encoders in Figure 7. Also, as presented in Table 8, we use the 4444-layer transformer to implement the spatial and temporal transformer encoders Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, Etsubscript𝐸tE_{\text{t}}italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT and decoders Dssubscript𝐷sD_{\text{s}}italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, Dtsubscript𝐷tD_{\text{t}}italic_D start_POSTSUBSCRIPT t end_POSTSUBSCRIPT. Given the DINO features of the input image, we first concatenate the bone position as Positional Encoding to obtain the bone-specific feature descriptors νt,bsubscript𝜈𝑡𝑏\nu_{t,b}italic_ν start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT with shape (BoneNum, FrameNum, FeatureDim) = (20×10×640201064020\times 10\times 64020 × 10 × 640). Then we map the feature dimension to 256256256256 with a simple Linear layer, and concatenate an additional BoneFeatureQuery token. We use the 4444-layer transformer Essubscript𝐸sE_{\text{s}}italic_E start_POSTSUBSCRIPT s end_POSTSUBSCRIPT to aggregate all the bone-specific feature descriptors into a per-frame pose feature νt,subscript𝜈𝑡\nu_{t,*}italic_ν start_POSTSUBSCRIPT italic_t , ∗ end_POSTSUBSCRIPT, and subsequently Etsubscript𝐸tE_{\text{t}}italic_E start_POSTSUBSCRIPT t end_POSTSUBSCRIPT to aggregate all frame-specific features into the VAE distribution parameters, including the mean μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG and variance Σ^^Σ\hat{\Sigma}over^ start_ARG roman_Σ end_ARG. Using the reparametrization trick, we then sample a latent code z𝑧zitalic_z from the Gaussian distribution z𝒩(μ^,Σ^)similar-to𝑧𝒩^𝜇^Σz\sim\mathcal{N}(\hat{\mu},\hat{\Sigma})italic_z ∼ caligraphic_N ( over^ start_ARG italic_μ end_ARG , over^ start_ARG roman_Σ end_ARG ), which is first decoded by the temporal decoder Dtsubscript𝐷tD_{\text{t}}italic_D start_POSTSUBSCRIPT t end_POSTSUBSCRIPT and the spatial decoder Dssubscript𝐷sD_{\text{s}}italic_D start_POSTSUBSCRIPT s end_POSTSUBSCRIPT into a final sequence of bone rotation angles ξ^,2:B20×10×3subscript^𝜉:2𝐵superscript20103\hat{\xi}_{*,2:B}\in\mathbb{R}^{20\times 10\times 3}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT ∗ , 2 : italic_B end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 20 × 10 × 3 end_POSTSUPERSCRIPT.

Table 8: Architecture of the proposed spatio-temporal transformer VAE.
Operation Output Size
Positional Encoding 20 ×\times× 10 ×\times× 640
Linear(640, 256) 20 ×\times× 10 ×\times× 256
Concat BoneFeatQuery 21 ×\times× 10 ×\times× 256
TransformerLayer ×\times× 4 1 ×\times× 10 ×\times× 256
Reshape 10 ×\times× 1 ×\times× 256
Concat muQuery and sigmaQuery 12 ×\times× 1 ×\times× 256
Positional Encoding 12 ×\times× 1 ×\times× 256
TransformerLayer ×\times× 4 2 ×\times× 1 ×\times× 256
Reparameterizion 1 ×\times× 1 ×\times× 256
TransformerLayer ×\times× 4 10 ×\times× 1 ×\times× 256
Reshape 1 ×\times× 10 ×\times× 256
TransformerLayer ×\times× 4 20 ×\times× 10 ×\times× 256
Linear(256, 3) 20 ×\times× 10 ×\times× 3

0.C.2 Articulation Model Specifications

The configuration of bone topology and skinning weights was established following Magicpony [80]. Here, we give a brief recap of the model.

0.C.2.1 Posed Shape.

The blend skinning model for posing [50, 14, 80] was utilized to articulate the skeleton into a specific pose. This model is parameterised by B1𝐵1B-1italic_B - 1 bone rotations ξbSO(3),b=2,,Bformulae-sequencesubscript𝜉𝑏𝑆𝑂3𝑏2𝐵\xi_{b}\in SO(3),b=2,\dots,Bitalic_ξ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ italic_S italic_O ( 3 ) , italic_b = 2 , … , italic_B, and the viewpoint ξ1SE(3)subscript𝜉1𝑆𝐸3\xi_{1}\in SE(3)italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_S italic_E ( 3 ). A set of rest-pose joint locations 𝐉bsubscript𝐉𝑏\mathbf{J}_{b}bold_J start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT was initialized on the instance mesh using straightforward heuristics. Each bone b𝑏bitalic_b, excluding the root, has a single parent π(b)𝜋𝑏\pi(b)italic_π ( italic_b ), thereby forming a tree structure.

Each vertex Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is linked to the bones via the skinning weights wibsubscript𝑤𝑖𝑏w_{ib}italic_w start_POSTSUBSCRIPT italic_i italic_b end_POSTSUBSCRIPT, determined based on their relative proximity to each bone. The vertices are then posed using the linear blend skinning equation:

Vi(ξ)=(b=1BwibGb(ξ)Gb(ξ)1)Vins,i,subscript𝑉𝑖𝜉superscriptsubscript𝑏1𝐵subscript𝑤𝑖𝑏subscript𝐺𝑏𝜉subscript𝐺𝑏superscriptsuperscript𝜉1subscript𝑉ins𝑖\displaystyle V_{i}(\xi)=\left(\sum_{b=1}^{B}w_{ib}G_{b}(\xi)G_{b}(\xi^{*})^{-% 1}\right)V_{\text{ins},i},italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ξ ) = ( ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_b end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_ξ ) italic_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_V start_POSTSUBSCRIPT ins , italic_i end_POSTSUBSCRIPT , (9)
G1=g1,Gb=Gπ(b)gb,gb(ξ)=[Rξb𝐉b01],formulae-sequencesubscript𝐺1subscript𝑔1formulae-sequencesubscript𝐺𝑏subscript𝐺𝜋𝑏subscript𝑔𝑏subscript𝑔𝑏𝜉matrixsubscript𝑅subscript𝜉𝑏subscript𝐉𝑏01\displaystyle G_{1}=g_{1},~{}~{}G_{b}=G_{\pi(b)}\circ g_{b},~{}~{}g_{b}(\xi)=% \begin{bmatrix}R_{\xi_{b}}&\mathbf{J}_{b}\\ 0&1\\ \end{bmatrix},italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_π ( italic_b ) end_POSTSUBSCRIPT ∘ italic_g start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_ξ ) = [ start_ARG start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL bold_J start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ,

where ξsuperscript𝜉\xi^{*}italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the bone rotations at the rest pose.

0.C.2.2 Bone Topology

For all quadrupedal animals examined in this paper, a chain of 8888 bones of equal lengths was estimated. These bones lie on two line segments that extend from the centre (root) of the rest-pose mesh to the two most extreme vertices along the z𝑧zitalic_z-axis (4444 bones on each side), thereby forming a “spine”. Then the root joint was slightly elevated, and 4444 sets of bones were added to model the legs. The foot joints were first identified as the lowest points of the mesh (in the y𝑦yitalic_y-axis) in each of the four xz𝑥𝑧xzitalic_x italic_z-quadrants. Subsequently, 4444 line segments were drawn from the foot joints to their nearest spine joints, and a chain of 3333 bones of equal lengths was defined on each of the segments, representing each leg.

0.C.2.3 Skinning Weight

The skinning weight wi,bsubscript𝑤𝑖𝑏w_{i,b}italic_w start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT, which associates each vertex Vins,isubscript𝑉ins𝑖V_{\text{ins},i}italic_V start_POSTSUBSCRIPT ins , italic_i end_POSTSUBSCRIPT with the bones, was defined as follows:

wi,bsubscript𝑤𝑖𝑏\displaystyle w_{i,b}italic_w start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT =edi,b/τsk=1Bedi,k/τs,absentsuperscript𝑒subscript𝑑𝑖𝑏subscript𝜏ssuperscriptsubscript𝑘1𝐵superscript𝑒subscript𝑑𝑖𝑘subscript𝜏s\displaystyle=\frac{e^{-d_{i,b}/\tau_{\text{s}}}}{\sum_{k=1}^{B}{e^{-d_{i,k}/% \tau_{\text{s}}}}},= divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_d start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_d start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , (10)
wheredi,bwheresubscript𝑑𝑖𝑏\displaystyle\text{where}\quad d_{i,b}where italic_d start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT =minr[0,1]Vins,ir𝐉~b(1r)𝐉~π(b)22absentsubscript𝑟01subscriptsuperscriptnormsubscript𝑉ins𝑖𝑟subscript~𝐉𝑏1𝑟subscript~𝐉𝜋𝑏22\displaystyle=\min_{r\in[0,1]}\|V_{\text{ins},i}-r\tilde{\mathbf{J}}_{b}-(1-r)% \tilde{\mathbf{J}}_{\pi(b)}\|^{2}_{2}= roman_min start_POSTSUBSCRIPT italic_r ∈ [ 0 , 1 ] end_POSTSUBSCRIPT ∥ italic_V start_POSTSUBSCRIPT ins , italic_i end_POSTSUBSCRIPT - italic_r over~ start_ARG bold_J end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - ( 1 - italic_r ) over~ start_ARG bold_J end_ARG start_POSTSUBSCRIPT italic_π ( italic_b ) end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

In this context, di,bsubscript𝑑𝑖𝑏d_{i,b}italic_d start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT is the minimal distance from the vertex Vins,isubscript𝑉ins𝑖V_{\text{ins},i}italic_V start_POSTSUBSCRIPT ins , italic_i end_POSTSUBSCRIPT to each bone b𝑏bitalic_b, defined by the rest-pose joint locations 𝐉~bsubscript~𝐉𝑏\tilde{\mathbf{J}}_{b}over~ start_ARG bold_J end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and 𝐉~π(b)subscript~𝐉𝜋𝑏\tilde{\mathbf{J}}_{\pi(b)}over~ start_ARG bold_J end_ARG start_POSTSUBSCRIPT italic_π ( italic_b ) end_POSTSUBSCRIPT in world coordinates. 𝐉~π(b)subscript~𝐉𝜋𝑏\tilde{\mathbf{J}}_{\pi(b)}over~ start_ARG bold_J end_ARG start_POSTSUBSCRIPT italic_π ( italic_b ) end_POSTSUBSCRIPT denotes the parent joint of 𝐉~bsubscript~𝐉𝑏\tilde{\mathbf{J}}_{b}over~ start_ARG bold_J end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. The temperature parameter τssubscript𝜏s\tau_{\text{s}}italic_τ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT is set to 0.50.50.50.5.

0.C.3 Text Prompts for 4D-fy Evaluation

We provide the 4D-fy [8] model with a list of text prompts, which are enriched by ChatGPT [57] from a list of basic prompts describing horse motions. The complete list is enumerated in the following:

  • A horse is running.

  • A horse is running.

  • A majestic horse galloping swiftly across the verdant meadow.

  • An energetic steed dashing with unbridled enthusiasm under the azure sky.

  • A spirited horse racing with the wind, its mane flowing like waves.

  • A horse is walking.

  • A horse is walking.

  • A serene horse ambling gently through a misty forest at dawn.

  • An elegant steed strolling leisurely along a cobblestone path.

  • A calm equine sauntering with grace across a blooming meadow.

  • A horse is eating.

  • A horse is eating.

  • A serene horse gently nibbling on the lush green grass of a tranquil meadow.

  • An elegant equine gracefully bending to graze on the dew-kissed clover.

  • A peaceful steed leisurely munching on hay in the golden light of dawn.

  • A horse is jumping.

  • A horse is jumping.

  • A majestic horse soaring effortlessly over a rustic wooden fence, its muscles rippling with power.

  • An agile steed leaping gracefully, silhouetted against the vibrant hues of the setting sun.

  • A spirited equine vaulting energetically over an obstacle, mane flowing like a river in the wind.

Appendix 0.D Limitations and Future Directions

While the model demonstrates promising results, there are several areas where further improvements can be made.

A significant limitation is that the articulated motions are learned on top of a fixed bone topology, which is pre-defined using strong heuristics, such as the number of legs. This approach may not effectively generalize across diverse animal species. A potential avenue for future research could involve the joint discovery of the articulation structure in conjunction with video training.

Additionally, the current model does not distinguish between different legs due to the nature of the DINO features. This can result in a “curious legs” problem, where the model confuses left and right legs of an animal seen from the side. This can be observed in the reconstruction results and subsequently in the generated motion sequences, and is also a common issue even with the most powerful video generation models [11]. Accurately capturing the leg ordering and precise motion is an intriguing challenge for future research in motion generation.

Refer to caption
Figure 8: Additional Motion Generation Results on Horses. Conditioned on an input image, which can be either a real photo or a painting of a horse, our model can generate realistic 4D animations of the instance. See the supplementary video for better visualizations.
Refer to caption
Figure 9: Additional Motion Generation Results for Other Categories. Our model can also be trained on other categories besides horses, and generates realistic motion sequences.

Appendix 0.E Societal Impact

The task of generating 3D motion from unlabeled videos represents a fundamental challenge in the fields of computer vision and computer graphics, in order to extend our current models to the long tail distribution of all kinds of objects in the real world. As an initial exploration in this area, our aim is to stimulate increasing interest and research in this direction. The continued advancement in this field holds great potential of significantly improving the diversity and quality of 3D and 4D models of real-world objects, thereby supporting numerous following applications in virtual reality, robotics and scientific discovery.