(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: ¹CUHK MMLab ²Stanford University ³UT Austin
https://keqiangsun.github.io/projects/ponymation

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Keqiang Sun^∗\orcidlink0000-0003-2900-1202 11 Dor Litvak^∗\orcidlink0009-0004-8720-618X 2233 Yunzhi Zhang\orcidlink0009-0000-3919-4883 22 Hongsheng Li\orcidlink0000-0002-2664-7975 11
Jiajun Wu^†\orcidlink0000-0002-4176-343X 22 Shangzhe Wu^†\orcidlink0000-0003-1011-5963 22

Abstract

We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for training; it learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features. At the core of our method is a video Photo-Geometric Auto-Encoding framework that decomposes each training video clip into a set of explicit geometric and photometric representations, including a rest-pose 3D shape, an articulated pose sequence, and texture, with the objective of re-rendering the input video via a differentiable renderer. This decomposition allows us to learn a generative model over the underlying articulated pose sequences akin to a Variational Auto-Encoding (VAE) formulation, but without requiring any external pose annotations. At inference time, we can generate new motion sequences by sampling from the learned motion VAE, and create plausible 4D animations of an animal automatically within seconds given a single input image.

Keywords:

3D animal motion 4D generation Unsupervised learning

^*^*footnotetext: Equal contribution. ^†Equal advising.

1 Introduction

We share the planet with a wide variety of lively animals. Similarly to humans, they navigate and interact with the physical world, demonstrating various sophisticated motion patterns. In fact, the first film in history, “The Horse in Motion,” was a sequence of photographs that captured a galloping horse, created by Eadweard Muybridge in 1887 [52]. Films capture only sequences of 2D projections of 3D animal movements. Further modeling dynamic animals in 3D is not only useful for numerous mixed reality and content creation applications, but also provides computational tools for biologists to study animal behaviors.

Refer to caption — Figure 1: Learning 3D Animal Motions from Unlabeled Online Videos. Given a collection of monocular videos of an animal category sourced from the Internet as training data, our method learns a *generative* model of the articulated 3D motions together with a monocular 3D reconstruction model, without relying on any shape templates or pose annotations. At inference time, the model generates new 3D motion sequences and turns a single test image in 4D animations fully automatically.

While a lot of efforts have been invested in capturing and modeling 3D human motions using computer vision techniques, significantly less attention has been paid to animals. Existing learning-based approaches require an extensive amount of 3D scans [49, 63, 64], parametric shape models [9, 33, 35, 94, 61], multi-view videos [45, 26, 21], or geometric annotations, such as keypoints [24, 27, 23, 59, 61, 60, 69], as supervision for training. Collecting large-scale 3D training data involves specialized capture devices and intensive labor, which can only be justified for specific objects, like humans, that are of utmost value in applications.

In this work, we would like to learn a generative model of the 3D motions of an animal category, which will allow us to sample new 3D motion sequences and generate 4D animations fully automatically within seconds in a feedforward fashion. Crucially, unlike existing 3D motion synthesis approaches on human bodies [27, 23, 59, 60, 32, 85, 97], we do not rely on explicit manual supervision for training, such as keypoints or template shapes. Instead, we propose to learn this 3D motion generative model purely from raw, unlabeled videos sourced from the Internet. This task is also different from video synthesis methods [78, 67, 11] that operate purely on 2D images. We would like to obtain an explicit 3D motion representation, in the form of a 3D mesh and a sequence of articulated 3D poses, which can easily facilitate downstream applications, including fine-grained controllable 3D animation and motion pattern analysis.

Learning 3D motions from unstructured online video collections is an extremely ill-posed task, as each video clip depicts only a short sequence of 2D projections of a unique 4D instance, with unique shape, appearance, motion, and viewpoint that are not assumed to reappear in another clip. This task, therefore, requires registering these unique video clips in a single canonical 3D model to learn a distribution of the underlying 3D motions of the animals. To address this challenge, we take advantage of recent advancements in self-supervised image representation learning [12], and distill semantic correspondences across different instances from self-supervised image features produced by a pre-trained DINO-ViT [12]. Furthermore, we assume a coarse description of the motion skeleton of the animal, e.g., “quadruped,” which effectively constrains the space of deformation akin to Non-Rigid Structure-from-Motion [10] and provides a succinct representation for modeling the 3D motion.

Building on top of these insights, we design a video Photo-Geometric Auto-Encoding framework for learning 3D motion generative models from unlabeled videos. At its core is a spatio-temporal transformer that automatically decomposes a video clip into a set of geometric and photometric factors, including a rest-pose 3D mesh, appearance, viewpoint, and a motion latent code that encapsulates the 3D motion of the instance. This motion latent code is then decoded into a sequence of articulated 3D poses, which are used to animate the rest-pose mesh and re-render a 2D video clip using a differentiable renderer. This allows us to train the entire model end-to-end like a “Variational Auto-Encoder” (VAE) over the space of articulated 3D motions, using only 2D image reconstruction losses on the RGB frames, DINO features, and object masks, with pseudo-ground-truth masks obtained from off-the-shelf detectors [38].

At inference time, we can generate new 3D motion sequences by sampling from the motion VAE latent space. If further given a single image of an animal, our model can reconstruct its articulated 3D shape and appearance in a feed-forward fashion, and generate 4D animations fully automatically within seconds.

To summarize, this paper makes several contributions:

•

We propose a new method for learning a generative model of articulated 3D animal motions from unlabeled Internet videos, without any shape templates or pose annotations;
•

We design a spatio-temporal transformer architecture that effectively extracts motion information from input video clips into a latent VAE;
•

At inference time, the model generates diverse 3D motion sequences and turns a single image into 4D animations automatically in seconds;

2 Related Work

2.0.1 Learning 3D Animals from Image Collections.

While modeling dynamic 3D objects traditionally requires motion capture markers or simultaneous multi-view captures [25, 1, 18], recent learning-based approaches have demonstrated the possibility of learning 3D deformable models simply from raw single-view image collections [34, 82, 44, 92, 80, 93, 48, 70]. Most of these methods require additional geometric supervision besides object masks for training, such as keypoint [34, 43] and viewpoint annotations [68, 56, 19], template shapes [22, 40, 39], semantic correspondences [44, 92, 80, 91, 31], and strong geometric assumptions like symmetries [82, 81, 83] and viewpoint distributions [54, 65, 55, 15, 16, 73, 74]. Among these, MagicPony [80] demonstrates impressive results in learning articulated 3D animals, such as horses, using only single-view images with object masks and self-supervised image features as training supervision. However, it reconstructs static images individually, ignoring the dynamic motions of the underlying 3D animals underneath those images. In this work, we focus on learning a generative model of 3D animal motions from videos instead of independent images.

2.0.2 Deformable Shapes from Monocular Videos.

Reconstructing deformable shapes from monocular videos is a long-standing problem in computer vision. Early approaches with Non-Rigid Structure from Motion (NRSfM) reconstruct deformable shapes from 2D correspondences, by incorporating heavy constraints on the motion patterns [10, 84, 3, 17, 13]. DynamicFusion [53] further integrates additional depth information from depth sensors. NRSfM pipelines have recently been revived with neural representations. In particular, LASR [86] and its follow-ups [87, 83, 88, 89] optimize deformable 3D shapes over a small set of monocular videos, leveraging 2D optical flows in a heavily engineered optimization procedure. DOVE [79] proposes a learning-based framework that learns a category-specific single-image 3D reconstruction model from a monocular video collection. Despite using video data for training, none of these approaches explicitly model the generative distribution of temporal motions of the objects.

2.0.3 Motion Analysis and Synthesis.

Modeling motion patterns of dynamic objects has important applications for both behavior analysis and content generation, and is instrumental to our visual perception system [6]. Computational techniques have been used for decades to study and synthesize human motions [7, 58, 76]. In particular, recent works have explored learning generative models for 3D human motions [46, 2, 51, 27, 23, 59, 60, 69, 36], leveraging parametric human shape models, like SMPL [49], and large-scale human pose annotations [30, 5]. In comparison, much less effort is invested in modeling animal motions. Huang et al. [28] proposes a hierarchical motion learning framework for animals, but requires costly motion capture data and hardly generalizes to animals in the wild. To sidestep the collection of 3D data, BKinD [72] introduces a self-supervised method for discovering and tracking keypoints from videos, but is limited to a 2D representation. Such 2D keypoints could be lifted to 3D [71, 36], but this requires multi-view videos or ground-truth keypoints for training. Unlike these prior works, our motion learning framework does not require any pose annotations or multi-view videos for training, and is trained simply using raw monocular online videos. Recent success of image diffusion models has also led to promising generic 4D generation models [62, 96, 47, 95, 8]. However, the 3D motions generated by these models are still very limited in terms of quality and diversity, as shown in the comparisons in Section 4.2.2.

3 Method

Given a collection of raw video clips of an animal category, such as horses, our goal is to learn a generative model of its articulated 3D motions. This allows us to sample 3D motion sequences from a learned latent space, and generate 4D animations of a new animal instance automatically given only a single 2D image at test time. We train this model simply on raw online videos without relying on any external pose annotations. To do so, we design a video photo-geometric auto-encoding framework that decomposes each training video clip into a rest-pose 3D mesh, appearance, camera viewpoint as well as a sequence articulated 3D poses. This allows us to learn a generative model over the underlying articulated 3D pose sequences akin to a motion “Variational Auto-Encoder”, but simply using the objective of re-rendering the input frames with a differentiable renderer. Figure 2 gives an overview of the training pipeline.

3.1 Modeling Articulated 3D Animal Motions

Each video clip records a 2D image sequence $\{I_{t}\}_{t=1}^{T}$ of the underlying 3D animal motion from one camera trajectory. Since the dataset is obtained from casually-recorded Internet videos, these training clips have diverse unique motion sequences. In order to learn the distribution of the underlying animal motions from such unstructured video collections, we first need to devise a 3D representation that registers these dynamic 2D sequences onto a canonical 3D model, factoring out the 3D motion of each video instance.

Drawing inspiration from prior work on 3D human motion synthesis [49, 32, 97, 85], we leverage a category-specific skinned model to represent the deformable 3D shape of the animals, and further learn the motion distribution over the articulations of its underlying skeleton. To this end, we follow MagicPony [80] and assume a coarse description of the skeleton, e.g., “quadruped”.

Specifically, we represent the category-specific base 3D shape using a Signed Distance Function (SDF) parametrized by a coordinate Multi-Layer Perceptron (MLP), and extract an explicit mesh on the fly using Differentiable Marching Tetrahedron (DMTet) [66]. Let $V_{\text{base}}\in\mathbb{R}^{K\times 3}$ denote the list of $K$ vertices, and the triangle faces are given by the triplets $F\subset\{1,\dots,K\}^{3}$ . To model the slight shape variation of each animal instance in the canonical pose, we further learn an image-conditioned deformation field $f_{\Delta V}$ parametrized by another MLP that predicts small deformations of each vertex $\Delta V_{\text{ins},i}=f_{\Delta V}(V_{\text{base},i},\phi)$ , where $\phi=f_{\phi}(I)$ is a feature vector obtained from an image $I$ using a pre-trained DINO-ViT [12], and $i\in\{1,\cdots,K\}$ denotes the vertex index. Both base shape $V_{\text{base}}$ and the instance deformation $\Delta V_{\text{ins}}$ are enforced to be bilaterally symmetric about $yz$ -plane by mirroring the query locations in the underlying MLPs.

To account for the temporal motions driven by the underlying bone structure, we then instantiate a quadrupedal skeleton in this instance shape using a simple heuristic: a chain of bones going through the two farthest end points along $z$ -axis, and four legs branching out from the body bone to the lowest point in each $xz$ -quadrant. The motion sequence is thus parametrized by a sequence of articulated poses $\xi=\{\xi_{t}\}_{t=1}^{T}$ , where each pose $\xi_{t}$ at a timestamp $t$ consists of a rigid pose $\xi_{t,1}\in SE(3)$ w.r.t. an identity camera pose and the rotation $\xi_{t,b}\in SO(3)$ of each bone $b=2,...,B$ in the skeleton. These articulated poses are applied to the instance mesh $V_{\text{ins}}$ to obtain the final posed shape sequence using the widely-used linear blend skinning $g(V_{\text{ins}},\xi_{t})$ [49]. More details are included in the supplementary material.

The appearance of the instance is modeled using a texture field parametrized by an MLP $f_{\text{a}}(\mathbf{x},\phi)\in[0,1]^{3}$ where $\mathbf{x}$ is a 3D location. We then render the posed mesh sequence into a sequence of RGB images using deferred mesh rendering [80], querying $f_{\text{a}}$ at the corresponding 3D locations of the pixels after rasterization.

In the following, we explain the learning formulation to learn the individual components, including $V_{\text{base}}$ , $f_{\Delta V}$ , $f_{\text{a}}$ , and most importantly, a generative model $f_{\xi}$ over the motion sequences $\xi$ , purely from an unstructured video collection without external pose annotations.

3.2 Video Photo-Geometric Auto-Encoding

Unlike human motion synthesis, we do not have access to large-scale, high-quality 3D captures or pose annotations for most animal species. Hence, we must instead learn from raw Internet videos, which poses significant challenges. To this end, we design a video Photo-Geometric Auto-Encoding framework that deconstructs each training clip into the explicit photometric and geometric factors described in Section 3.1, and train the entire pipeline using the objective of re-rendering the video. At the center of this video auto-encoding pipeline is a generative model of articulated motion sequences, akin to a “Variational Auto-Encoder” (VAE), but learned purely from raw RGB frames. This is very different from simply training a conventional VAE directly in the pose sequence space, which would require explicit pose annotations in the first place.

3.2.1 Video Encoding.

To predict the instance shape deformation $\Delta V_{\text{ins}}$ and appearance of the object, we extract a feature vector $\phi_{t}$ for each frame of the video using a pre-trained DINO-ViT [12] with frozen weights, as mentioned previously. We assume the instance shape and appearance remain the same throughout the video, and hence take the average image features across all frames, denoted as $\bar{\phi}$ , when querying the MLPs, $f_{\Delta V}$ and $f_{\text{a}}$ .

In order to extract the motion information more effectively from the input video clip, we design a pair of spatial and temporal transformer-based motion encoders, $E_{\text{s}}$ and $E_{\text{t}}$ , that aggregate a set of bone-specific local features first spatially across each frame and then temporally across the entire sequence, eventually obtaining the distribution parameters $\hat{\mu}$ and $\hat{\Sigma}$ of the motion latent VAE.

Specifically, given each frame $I_{t}$ in the input clip, we first construct a bone-specific feature descriptor $\nu_{t,b}=(\phi_{t},\Phi_{t}(\mathbf{u}_{t,b}),b,\mathbf{J}_{b},\mathbf{u}_{t,% b})$ for each bone $b=2,...,B$ and each timestamp $t$ . Here, $\phi_{t}$ denotes the same global image feature as before. $\mathbf{J}_{b}$ denotes the 3D location of the center of the bone $b$ at rest-pose, which projects to the pixel location $\mathbf{u}_{t,b}$ in the image space, given the rigid pose $\hat{\xi}_{t,1}$ predicted separately. In addition to the global feature $\phi_{t}$ , we also sample an auxiliary bone-specific local feature vector $\Phi_{t}(\mathbf{u}_{t,b})$ from the DINO-ViT key token map $\Phi_{t}$ at the projected pixel location $\mathbf{u}_{t,b}$ .

The spatial transformer encoder $E_{\text{s}}$ then fuses these bone-specific feature descriptors $\{\nu_{t,b}\}_{b=2}^{B}$ into a single feature vector $\nu_{t,*}$ summarizing the articulated pose of the animal in each frame $t$ :

\nu_{t,*}=E_{\text{s}}(\nu_{t,2},\cdots,\nu_{t,B}).

(1)

In practice, we prepend a learnable token to the list of descriptors, and take the first output token of the transformer as the pose feature $\nu_{t,*}$ . We call this $E_{\text{s}}$ a spatial transformer as it extracts the spatial geometric features in each input frame that capture the pose information, conditioned on the given skeleton.

Next, we design a second temporal transformer encoder $E_{\text{t}}$ , inspired by [59], which operates along the temporal dimension and maps the entire sequence of pose features $\{\nu_{t,*}\}_{t=1}^{T}$ into the motion latent space. Similarly to the $E_{\text{s}}$ , $E_{\text{t}}$ fuses the pose feature sequence to predict the VAE distribution parameters:

(\hat{\mu},\hat{\Sigma})=E_{\text{t}}(\nu_{1,*},\cdots,\nu_{T,*}).

(2)

Using the reparametrization trick [37], we then sample a latent code from the Gaussian distribution $z\sim\mathcal{N}(\hat{\mu},\hat{\Sigma})$ , which will be decoded into a sequence of articulated poses $\{\hat{\xi}_{t}\}_{t=1}^{T}$ characterizing the 3D motion of the animal in the clip.

3.2.2 Motion Decoding.

Symmetric to the motion encoders, the motion decoder also consists of a temporal decoder $D_{\text{t}}$ that first decodes $z$ into a sequence of pose features $\{z_{t}\}_{t=1}^{T}$ , and a spatial decoder $D_{\text{s}}$ that further decodes each pose feature $z_{t}$ to a set of bone rotations $\{\hat{\xi}_{t,b}\}_{b=2}^{B}$ .

Specifically, we query the temporal transformer decoder $D_{\text{t}}$ with a sequence of timestamps $\mathcal{T}$ , and use $z$ as both the key token and the value token to obtain a sequence of pose features:

(z_{1},\cdots,z_{T})=D_{\text{t}}(\mathcal{T},z),\quad\mathcal{T}=(1,\cdots,T).

(3)

Similarly, given each pose feature $z_{t}$ , we then query the spatial transformer decoder $D_{\text{s}}$ with a sequence of bone indices $\mathcal{B}$ to produce the bone rotations:

(\hat{\xi}_{t,2},\cdots,\hat{\xi}_{t,B})=D_{\text{s}}(\mathcal{B},z_{t}),\quad% \mathcal{B}=(2,\cdots,B).

(4)

In practice, the rigid pose $\hat{\xi}_{t,1}$ is predicted by a separate network and is not modeled by this motion VAE, since it is entangled with arbitrary camera motions that are difficult to disentangle in dynamic scenes.

We then deform the predicted instance mesh $\hat{V}_{\text{ins}}$ using these articulated pose sequence $\{\hat{\xi}_{t}\}_{t=1}^{T}$ with the skinning equation $\hat{V}_{t}=g(\hat{V}_{\text{ins}},\hat{\xi}_{t})$ , and render the RGB frames $\{\hat{I}_{t}\}_{t=1}^{T}$ and masks $\{\hat{M}_{t}\}_{t=1}^{T}$ using a differentiable renderer [42].

3.3 Learning Formulation

3.3.1 Video Re-rendering Losses.

We train the entire model by minimizing the reconstruction losses on the object masks $\hat{M}_{t}$ and RGB frames $\hat{I}_{t}$ :

\mathcal{L}_{\text{m},t}=\|\hat{M}_{t}-M_{t}\|_{2}^{2}+\lambda_{\text{dt}}\|% \hat{M}_{t}\odot\texttt{dt}(M_{t})\|_{1},\quad\mathcal{L}_{\text{im},t}=\|% \tilde{M}_{t}\odot(\hat{I}_{t}-I_{t})\|_{1},

(5)

where distance transform $\texttt{dt}(\cdot)$ is used in the second term of the mask loss with a weight $\lambda_{\text{dt}}$ for more effective gradients [34, 81, 80], and $\odot$ denotes the Hadamard product. The RGB loss is only computed inside the intersection of the predicted and ground-truth masks $\tilde{M}_{t}=\hat{M}_{t}\odot M_{t}$ . To exploit the temporal consistency of the motion in the videos, we further enforce a temporal smoothness constraint between the predicted poses $\hat{\xi}_{t}$ of consecutive frames: $\mathcal{R}_{\text{temp}}=\sum_{t=2}^{T}\|\hat{\xi}_{t}-\hat{\xi}_{t-1}\|_{2}^% {2}$ . We also inherit the multi-hypothesis viewpoint prediction mechanism with the hypothesis loss $\mathcal{L}_{\text{hyp}}$ and the shape regularizers $\mathcal{R}_{\text{shape}}=\lambda_{\text{Eik}}\mathcal{R}_{\text{Eik}}+% \lambda_{\text{art}}\mathcal{R}_{\text{art}}+\lambda_{\text{def}}\mathcal{R}_{% \text{def}}$ [80] with balancing weights $\lambda$ ’s, which include the Eikonal constraint $\mathcal{R}_{\text{Eik}}$ on the SDF MLP for the base shape, and magnitude regularizers $\mathcal{R}_{\text{art}}$ on the bone rotations $\hat{\xi}_{2:B}$ and $\mathcal{R}_{\text{def}}$ on the vertex deformations $\Delta V_{\text{ins}}$ .

3.3.2 Semantic Correspondences.

Instead of relying on external pose annotations or prior shape models to learn the 3D model from monocular videos, we seek a much cheaper alternative solution for establishing correspondences across different instances. We distill semantic correspondences from self-supervised image features, such as DINO [12]. As shown in prior work [4, 80, 92], after a simple PCA reduction, these image features reveal robust part-level correspondences across different instances with varying poses and appearance. To exploit these correspondences, we additionally optimize a feature field in the canonical space using a coordinate MLP $\psi(\mathbf{x})\in\mathbb{R}^{D}$ , which is rendered into an 2D feature image $\hat{\Phi}_{t}\in\mathbb{R}^{D\times H\times W}$ given the posed mesh $\hat{V}_{t}$ , with the same procedure as rendering the appearance of the object described above. We then encourage this rendered feature map $\hat{\Phi}_{t}$ to match the feature map $\Phi^{\prime}_{t}$ pre-extracted from the input frame $I_{t}$ using DINO-ViT with PCA reduction: $\mathcal{L}_{\text{feat},t}=\|\tilde{M}_{t}\odot(\hat{\Phi}_{t}-\Phi^{\prime}_% {t})\|_{2}^{2}.$ Intuitively, this enforces the model to establish correspondences across all training video instances through the same canonical feature field, hence disentangling the shape and pose in each monocular frame.

3.3.3 Motion VAE.

Similarly to the conventional VAE, we also minimize the Kullback–Leibler (KL) divergence between the learned motion latent distribution and a standard Gaussian distribution:

\mathcal{L}_{\text{KL}}=\sum_{i}-\frac{1}{2}\left(\log\sigma_{i}-\sigma_{i}-% \mu_{i}^{2}+1\right),\vspace{-0.5em}

(6)

where $\mu_{i}$ and $\sigma_{i}^{2}$ are elements of the predicted distribution parameters $\hat{\mu}$ and $\hat{\Sigma}$ .

3.3.4 Training Schedule.

As learning 3D articulated motions from unstructured video clips without labels is extremely ill-posed, we devise a two-stage schedule for robust and efficient training. In the first stage, we pre-train the monocular 3D reconstruction model using a single-image pose predictor $\tilde{\xi}_{t}=f_{\xi}^{\text{sin}}(\phi_{t})$ . Inspired by but unlike [80], we train this model to re-render entire video clips with the temporal smoothness constraint $\mathcal{R}_{\text{temp}}$ and temporal feature averaging $\bar{\phi}$ , rather than independent images. The total loss in the first stage is given by:

\vspace{-0.5em}\mathcal{L}_{\text{vid}}=\sum_{t=1}^{T}\left(\mathcal{L}_{\text% {recon},t}+\lambda_{\text{h}}\mathcal{L}_{\text{hyp},t}+\lambda_{\text{s}}% \mathcal{R}_{\text{shape},t}\right)+\lambda_{\text{t}}\mathcal{R}_{\text{temp}},

(7)

where $\mathcal{L}_{\text{recon},t}=\mathcal{L}_{\text{im},t}+\lambda_{\text{m}}% \mathcal{L}_{\text{m},t}+\lambda_{\text{f}}\mathcal{L}_{\text{feat},t}$ summarizes the reconstruction losses on each frame. After this stage, we obtain an accurate monocular 3D reconstruction model, which outperforms the baseline [80] as shown in Table 4, largely owing to the training on videos instead of independent images. More importantly, the model has now learned a reasonable space of articulated poses, on top of which learning a motion generative model is much more efficient.

In the second stage, we replace the monocular pose predictor $f_{\xi}^{\text{sin}}$ with the spatio-temporal transformer-based motion VAE $f_{\xi}$ detailed in Section 3.2, which encodes the entire video clip and generates the entire sequence of articulated poses at once. Empirically, training the motion VAE from scratch with an expensive rendering step in the loop is inefficient. To facilitate training efficiency, we recycle pose predictions $\tilde{\xi}_{t}$ from the first stage to guide the predictions of the VAE decoder $\hat{\xi}_{t}$ using a teacher loss $\mathcal{L}_{\text{teacher}}=\sum_{t=1}^{T}\|\hat{\xi}_{t}-\tilde{\xi}_{t}\|_{% 2}^{2}$ . The final training objective for the second stage is thus:

\mathcal{L}=\mathcal{L}_{\text{vid}}+\lambda_{\text{KL}}\mathcal{L}_{\text{KL}% }+\lambda_{\text{teacher}}\mathcal{L}_{\text{teacher}}.

(8)

3.3.5 3D Motion Generation.

During inference time, we can generate diverse 3D motion sequences by sampling from the learned motion VAE latent space. Furthermore, when given a single 2D image of a new animal instance unseen at training, our model can reconstruct its 3D shape and appearance in a feed-forward manner, and generate 4D animations fully automatically within a few seconds, as illustrated in Figure 3.

Table 1: Statistics of the AnimalMotion Dataset. We collect a new animal video dataset containing a total of

82.6

k frames for 4 different animal species.

Category	# Sequences	Total Length	# Frames
Horse	640	28’09”	50,682
Zebra	47	5’27”	9,822
Giraffe	60	4’52”	8,768
Cow	69	7’25”	13,359
Total	816	45’54”	82,631

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

To train our model, we collected an AnimalMotion dataset consisting of video clips of several quadruped animal categories extracted from the Internet. The statistics of the dataset are summarized in Table 1. As pre-processing, we first detect and segment the animal instances in the videos using the off-the-shelf segmentation model of PointRend [38]. To remove occlusion between different instances, we calculated the extent of mask overlap in each frame and exclude crops where two or more masks overlap with each other. We further apply a smoothing kernel to the sequence of bounding boxes to avoid jittering. The non-occluded instances are then cropped and resized to $256\times 256$ . The original videos are all at 30fps. To ensure sufficient motion in each sequence, we remove frames with minimal motion, measured by the magnitude of optical flows within the instance mask estimated from RAFT [75]. To conduct quantitative evaluations and comparisons, we also use PASCAL VOC [20] which contains $108$ images of horses, and APT-36K [90] which contains $81$ video clips of horses, each consisting of $15$ frames. Both datasets provide 2D keypoint annotations for each animal in the image, allowing us to evaluate the geometric accuracy of the reconstructed shapes and generated motions.

4.1.2 Implementation Details.

The encoders and decoders of the motion VAE model ( $E_{\text{s}}$ , $E_{\text{t}}$ , $D_{\text{s}}$ , $D_{\text{t}}$ ) from Section 3.2 are implemented as stacked transformers [77] with 4 transformer blocks and a latent dimension of 256. We use a sinusoidal function for positional encoding following [59]. For the remaining architectures, we base our implementation on top of [80]. We train the model for $120$ epochs for the first stage, which takes roughly $10$ hours on $8$ A6000 GPUs, and another $180$ epochs for the second stage, which takes another $48$ hours. We use a sequence length of $T=10$ for training. During inference, we can generate longer sequences by connecting multiple samples and optimizing transition latent codes for smooth interpolation. For visualization, following prior work [80], we finetune (only) the appearance network $f_{\text{a}}$ for $100$ iterations on each test image, taking less than 10 seconds, as the model struggles to predict detailed texture in a single feedforward pass. More details are included in the sup. mat.

4.2 3D Motion Generation

4.2.1 Qualitative Results.

After training, we can generate 3D motion sequences by sampling the motion latent space VAE, and render 4D animations with the textured mesh reconstructed from a single 2D image, as shown in Figure 3. It also generalizes to horse-like artifacts, such as carousel horses, which the model has never seen during training. The model can be trained on a wide range of animal species besides horses, including giraffes, zebras and cows, capturing category-specific prior distributions of 3D motions, as shown in Figure 5. Because the datasets for these categories are limited in size and diversity, as in [80], in the first stage of the training, we fine-tune from the model trained on horses. Additional animation results are provided in the supplementary video.

Table 2: Quantitative Comparison with State-of-the-Art Motion Generative Models.

	Motion Strength	User Preference
4D-fy [8]	0.29	112 (17.0%)
Ponymation (Ours)	4.66	548 (83.0%)

4.2.2 Comparison with Existing Methods.

Our method is the first to learn a generative model of 3D animal motions from raw videos without pose annotations or prior shape models. We compare with one of the most recent 4D generative models, 4D-fy[8], which has publicly released code. Specifically, we provide the model with a list of prompts, which are enriched by ChatGPT [57] from a list of basic prompts describing horse motions, such as “a horse is running/walking/jumping/eating”¹¹1The full list of prompts are included in the supplementary material.. We generate $20$ 4D instances from 4D-fy, and $20$ from our method (without text condition). Note that it takes $12$ hours to generate one 4D-fy instance on one GPU, whereas our model generates 4D animations within a few seconds in a single forward pass. We first compute the Motion Strength to assess the motion magnitude of the generated videos. We use Flowformer [29] to estimate optical flow strengths between consecutive frames of a generated video, and then compute the average of the largest $5$ % optical flows as the Motion Strength. We present them in random pairs side by side to $33$ participants, and ask them to select one that shows “a more plausible 3D horse motion sequence”. As reported in Table 2, users preferred the 4D instances generated by our method over 4D-fy $83.0$ % of the time. We show a visual comparison in Figure 4. Notably, 4D-fy produces nearly static animals without perceptible motions despite heavy prompt engineering, whereas our method generates much more plausible motion sequences.

4.2.3 Quantitative Evaluation.

Further assessing the quality of the generated 3D motions quantitatively is difficult due to the lack of (1) ground-truth measurements of 3D animal motions, and (2) robust evaluation metrics for generative models. To evaluate and compare different variants of our model, we design a new metric, bi-directional Motion Chamfer Distance (MCD), computed between a set of generated motion sequences projected to 2D image space and a set of 2D keypoint sequences annotated from videos in APT-36K [90]. Since the skeleton automatically discovered by our model is different from the 17 keypoints annotated in APT-36K, we first perform 3D reconstruction on all the images in APT-36K, and optimize a linear transformation that maps the 2D projections of the predicted 3D joints to the annotated 2D keypoints following [34]. To compute MCD, we generate $1,400$ random motion sequences by sampling from the learned motion VAE, each consisting of $10$ frames of 3D articulated poses. We then project these generated 3D poses to 2D using the viewpoints estimated from APT-36K, and apply the previously optimized transformation to align with the annotated keypoints. For each annotated keypoint sequence in the test set, we find the closest generated motion sequence measured by keypoint MSE averaged across all frames, and vice versa for each generated sequence. We then compute MCD based on the MSE between the closest sequence pairs. In essence, MCD measures the fidelity of generated motions by comparing the sampled distribution to that of the real motion sequences annotated from videos. Table 3 compares the results of our final model with two ablated variants.

Table 3: Motion Chamfer Distance (MCD) on APT-36K [90] for Motion Generation Evaluation. MP: Magicpony, AM: AnimalMotion dataset, TS: temporal smoothness.

Experiment	MCD $\downarrow$
MP + VAE	38.77
MP + VAE + AM	38.12
MP + VAE + AM + TS (final)	38.03

4.3 Single-Image 3D Reconstruction

We also quantitatively evaluate the monocular 3D reconstruction results of our model and compare with existing methods [41, 44, 40, 80]. For this purpose, we use PASCAL [20], a widely used benchmarking dataset for 3D reconstruction, as well as the aforementioned APT-36K [90] dataset, both of which come with 2D keypoint annotations. We compute the commonly used keypoint transfer metric measured by Percentage of Correct Keypoints (PCK) [34, 44, 80]. Specifically, given a set of annotated visible 2D keypoints on a source image, we identify the closest vertices on the reconstructed 3D mesh, and then project those 3D vertices onto the target 2D image. We calculate the percentage of the re-projected keypoints that land within a small distance from the annotated keypoints in the target image. This margin is set to be $0.1$ of the image size following prior work [34, 44, 80]. Another commonly used metric is Mask Intersection over Union (MIoU) between the rendered and ground-truth masks, which measures the reconstruction quality in terms of projected 2D silhouettes. In addition, since APT-36K [90] provides keypoint annotations on video sequences, we also measure the temporal consistency across the reconstructions along the video sequences using a Velocity Error, computed as $\frac{1}{T}\sum_{t=1}^{T}\|\hat{\delta}_{t}-\delta_{t}\|/\delta_{t}$ , where $\hat{\delta}_{t}$ and $\delta_{t}$ are the keypoint displacements between consecutive frame for predicted and GT pose sequences respectively. As the predicted poses are different from the GT keypoints, we use the same procedure described in Section 4.2.2 to optimize a linear mapping from the predicted poses to the GT keypoints for each method.

Table 4: Comparison of Monocular 3D Reconstruction Results with Different Methods on PASCAL [20] and APT-36K [90]. Our method achieves superior reconstruction accuracy compared to the existing methods, including the recent MagicPony baseline [80].

	PASCAL [20]		APT-36K [90]
Method	PCK $\uparrow$	Mask IoU $\uparrow$	PCK $\uparrow$	Vel. Err. $\downarrow$
CSM [41]	31.2%	-	-	-
UMR [44]	24.4%	-	-	-
A-CSM [40]	32.9%	-	-	-
MagicPony [80]	42.8%	64.1%	53.9%	57.3%
Ponymation (ours)	48.0%	71.8%	59.9%	49.1%

The results are summarized in Table 4. The results of MagicPony [80] are computed using the publicly released code and models, and the results of other baselines are taken from A-CSM [40]. Our model outperforms all previous methods. In particular, compared to the MagicPony baseline, our model achieves considerable improvement by learning from videos instead of individual images.

Additional ablation studies on the architecture design, discussions on limitations and more visualizations are included the supplementary material.

5 Conclusions

We have presented a new method for learning generative models of articulated 3D animal motions from raw Internet videos, without relying on any pose annotations or shape templates. To this end, we have proposed a video photo-geometric auto-encoding framework that automatically learns to decompose RGB videos into the underlying 3D shape, articulated motion, and object appearance, simply with the objective of re-rendering the videos. At the core of this pipeline is a transformer-based architecture that effectively extracts the temporal and spatial structure of the video clip into a latent motion VAE, which enables sampling at inference time to generate new 3D motion sequences. Experimental results show that the proposed method learns a reasonable distribution of 3D animal motions for several animal categories. This allows us to instantly turn a single 2D image into 4D animations in a fully automatic fashion, enabling promising downstream applications in game design and movie production.

Acknowledgments.

We thank Zizhang Li, Feng Qiu, and Ruining Li for insightful discussions. The work is in part supported by the Stanford Institute for Human-Centered AI (HAI) and Samsung.

References

[1] de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. ACM TOG (2008)
[2] Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: Generative adversarial synthesis from language to action. In: ICRA (2018)
[3] Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Nonrigid structure from motion in trajectory space. In: NeurIPS (2008)
[4] Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep vit features as dense visual descriptors. In: ECCV Workshop on What is Motion For? (2022)
[5] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR (2014)
[6] Badler, N.: Temporal Scene Analysis: Conceptual Descriptions of Object Movements. Ph.D. thesis, Queensland University of Technology (1975)
[7] Badler, N.I., Phillips, C.B., Webber, B.L.: Simulating Humans: Computer Graphics, Animation, and Control. Oxford University Press (09 1993)
[8] Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4D-fy: Text-to-4d generation using hybrid score distillation sampling. In: CVPR (2024)
[9] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: ECCV (2016)
[10] Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3d shape from image streams. In: CVPR (2000)
[11] Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024), https://openai.com/research/video-generation-models-as-world-simulators
[12] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
[13] Cashman, T.J., Fitzgibbon, A.W.: What shape are dolphins? building 3d morphable models from 2d images. IEEE TPAMI (2012)
[14] Chadwick, J.E., Haumann, D.R., Parent, R.E.: Layered construction for deformable animated characters. ACM SIGGRAPH Computer Graphics (1989)
[15] Chan, E., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In: CVPR (2021)
[16] Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L., Tremblay, J., Khamis, S., Karras, T., Wetzstein, G.: Efficient geometry-aware 3D generative adversarial networks. In: CVPR (2022)
[17] Dai, Y., Li, H., He, M.: A simple prior-free method for non-rigid structure-from-motion factorization. In: CVPR (2012)
[18] Debevec, P.: The light stages and their applications to photoreal digital actors. In: SIGGRAPH Asia (2012)
[19] Duggal, S., Pathak, D.: Topologically-aware deformation fields for single-view 3d reconstruction. CVPR (2022)
[20] Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. IJCV (2015)
[21] Gao, X., Yang, J., Kim, J., Peng, S., Liu, Z., Tong, X.: Mps-nerf: Generalizable 3d human rendering from multiview images. IEEE TPAMI (2022)
[22] Goel, S., Kanazawa, A., Malik, J.: Shape and viewpoints without keypoints. In: ECCV (2020)
[23] Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: ACM MM (2020)
[24] Habibie, I., Holden, D., Schwarz, J., Yearsley, J., Komura, T.: A recurrent variational autoencoder for human motion synthesis. In: BMVC (2017)
[25] Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edn. (2004)
[26] He, Y., Pang, A., Chen, X., Liang, H., Wu, M., Ma, Y., Xu, L.: ChallenCap: Monocular 3d capture of challenging human performances using multi-modal references. In: CVPR (2021)
[27] Henter, G.E., Alexanderson, S., Beskow, J.: MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM TOG (2020)
[28] Huang, K., Han, Y., Chen, K., Pan, H., Zhao, G., Yi, W., Li, X., Liu, S., Wei, P., Wang, L.: A hierarchical 3d-motion learning framework for animal spontaneous behavior mapping. Nature Communications (2021)
[29] Huang, Z., Shi, X., Zhang, C., Wang, Q., Cheung, K.C., Qin, H., Dai, J., Li, H.: FlowFormer: A transformer architecture for optical flow. ECCV (2022)
[30] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI (2014)
[31] Jakab, T., Li, R., Wu, S., Rupprecht, C., Vedaldi, A.: Farm3D: Learning articulated 3D animals by distilling 2D diffusion. In: 3DV (2024)
[32] Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: Human motion as a foreign language. In: NeurIPS (2024)
[33] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
[34] Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: ECCV (2018)
[35] Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: CVPR (2019)
[36] Kapon, R., Tevet, G., Cohen-Or, D., Bermano, A.H.: Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. In: CVPR (2024)
[37] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
[38] Kirillov, A., Wu, Y., He, K., Girshick, R.: PointRend: Image segmentation as rendering. In: CVPR (2020)
[39] Kokkinos, F., Kokkinos, I.: To the point: Correspondence-driven monocular 3d category reconstruction. In: NeurIPS (2021)
[40] Kulkarni, N., Gupta, A., Fouhey, D.F., Tulsiani, S.: Articulation-aware canonical surface mapping. In: CVPR (2020)
[41] Kulkarni, N., Gupta, A., Tulsiani, S.: Canonical surface mapping via geometric cycle consistency. In: ICCV (2019)
[42] Laine, S., Hellsten, J., Karras, T., Seol, Y., Lehtinen, J., Aila, T.: Modular primitives for high-performance differentiable rendering. ACM TOG (2020)
[43] Li, X., Liu, S., De Mello, S., Kim, K., Wang, X., Yang, M., Kautz, J.: Online adaptation for consistent mesh reconstruction in the wild. In: NeurIPS (2020)
[44] Li, X., Liu, S., Kim, K., De Mello, S., Jampani, V., Yang, M.H., Kautz, J.: Self-supervised single-view 3d reconstruction via semantic consistency. In: ECCV (2020)
[45] Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people. In: CVPR (2019)
[46] Lin, X., Amer, M.R.: Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652 (2018)
[47] Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In: CVPR (2024)
[48] Liu, D., Stathopoulos, A., Zhangli, Q., Gao, Y., Metaxas, D.: LEPARD: Learning explicit part discovery for 3d articulated shape reconstruction. In: NeurIPS (2024)
[49] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM TOG (2015)
[50] Magnenat-Thalmann, N., Primeau, E., Thalmann, D.: Abstract muscle action procedures for human face animation. The Visual Computer (1988)
[51] Minderer, M., Sun, C., Villegas, R., Cole, F., Murphy, K.P., Lee, H.: Unsupervised learning of object structure and dynamics from videos. In: NeurIPS (2019)
[52] Muybridge, E.: The horse in motion (1887)
[53] Newcombe, R.A., Fox, D., Seitz, S.M.: DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In: CVPR (2015)
[54] Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: HoloGAN: Unsupervised learning of 3d representations from natural images. In: ICCV (2019)
[55] Niemeyer, M., Geiger, A.: GIRAFFE: Representing scenes as compositional generative neural feature fields. In: CVPR (2021)
[56] Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In: CVPR (2020)
[57] OpenAI: ChatGPT (2023), https://chat.openai.com/
[58] Ormoneit, D., Black, M., Hastie, T., Kjellström, H.: Representing cyclic human motion using functional analysis. Image and Vision Computing (2005)
[59] Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV (2021)
[60] Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: ECCV (2022)
[61] Piao, J., Sun, K., Wang, Q., Lin, K.Y., Li, H.: Inverting generative adversarial renderer for face reconstruction. In: CVPR (2021)
[62] Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: DreamGaussian4D: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)
[63] Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)
[64] Saito, S., Simon, T., Saragih, J., Joo, H.: PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: CVPR (2020)
[65] Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: GRAF: Generative radiance fields for 3d-aware image synthesis. In: NeurIPS (2020)
[66] Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In: NeurIPS (2021)
[67] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y.: Make-a-video: Text-to-video generation without text-video data. In: ICLR (2023)
[68] Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: Continuous 3d-structure-aware neural scene representations. In: NeurIPS (2019)
[69] Starke, S., Mason, I., Komura, T.: DeepPhase: Periodic autoencoders for learning motion phase manifolds. ACM TOG (2022)
[70] Stathopoulos, A., Pavlakos, G., Han, L., Metaxas, D.N.: Learning articulated shape with keypoint pseudo-labels from web images. In: CVPR (2023)
[71] Sun, J.J., Karashchuk, P., Dravid, A., Ryou, S., Fereidooni, S., Tuthill, J., Katsaggelos, A., Brunton, B.W., Gkioxari, G., Kennedy, A., et al.: BKinD-3D: Self-supervised 3d keypoint discovery from multi-view videos. In: CVPR (2023)
[72] Sun, J.J., Ryou, S., Goldshmid, R., Weissbourd, B., Dabiri, J., Anderson, D.J., Kennedy, A., Yue, Y., Perona, P.: Self-supervised keypoint discovery in behavioral videos. In: CVPR (2022)
[73] Sun, K., Wu, S., Huang, Z., Zhang, N., Wang, Q., Li, H.: Controllable 3d face synthesis with conditional generative occupancy fields. In: NeurIPS (2022)
[74] Sun, K., Wu, S., Zhang, N., Huang, Z., Wang, Q., Li, H.: Cgof++: Controllable 3d face synthesis with conditional generative occupancy fields. IEEE TPAMI (2023)
[75] Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: ECCV (2020)
[76] Urtasun, R., Fleet, D.J., Lawrence, N.D.: Modeling human locomotion with topologically constrained latent variable models. In: Elgammal, A., Rosenhahn, B., Klette, R. (eds.) Human Motion – Understanding, Modeling, Capture and Animation. pp. 104–118. Springer Berlin Heidelberg, Berlin, Heidelberg (2007)
[77] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
[78] Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In: NeurIPS (2017)
[79] Wu, S., Jakab, T., Rupprecht, C., Vedaldi, A.: DOVE: Learning deformable 3d objects by watching videos. IJCV (2023)
[80] Wu, S., Li, R., Jakab, T., Rupprecht, C., Vedaldi, A.: MagicPony: Learning articulated 3d animals in the wild. In: CVPR (2023)
[81] Wu, S., Makadia, A., Wu, J., Snavely, N., Tucker, R., Kanazawa, A.: De-rendering the world’s revolutionary artefacts. In: CVPR (2021)
[82] Wu, S., Rupprecht, C., Vedaldi, A.: Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In: CVPR (2020)
[83] Wu, Y., Chen, Z., Liu, S., Ren, Z., Wang, S.: CASA: Category-agnostic skeletal animal reconstruction. In: NeurIPS (2022)
[84] Xiao, J., xiang Chai, J., Kanade, T.: A closed-form solution to non-rigid shape and motion recovery. In: ECCV (2004)
[85] Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: OmniControl: Control any joint at any time for human motion generation. In: ICLR (2024)
[86] Yang, G., Sun, D., Jampani, V., Vlasic, D., Cole, F., Chang, H., Ramanan, D., Freeman, W.T., Liu, C.: LASR: Learning articulated shape reconstruction from a monocular video. In: CVPR (2021)
[87] Yang, G., Sun, D., Jampani, V., Vlasic, D., Cole, F., Liu, C., Ramanan, D.: ViSER: Video-specific surface embeddings for articulated 3d shape reconstruction. In: NeurIPS (2021)
[88] Yang, G., Vo, M., Natalia, N., Ramanan, D., Andrea, V., Hanbyul, J.: BANMo: Building animatable 3d neural models from many casual videos. In: CVPR (2022)
[89] Yang, G., Wang, C., Reddy, N.D., Ramanan, D.: Reconstructing animatable categories from videos. In: CVPR (2023)
[90] Yang, Y., Yang, J., Xu, Y., Zhang, J., Lan, L., Tao, D.: APT-36K: A large-scale benchmark for animal pose estimation and tracking. In: NeurIPS Dataset and Benchmark Track (2022)
[91] Yao, C.H., Hung, W.C., Li, Y., Rubinstein, M., Yang, M.H., Jampani, V.: Hi-LASSIE: High-fidelity articulated shape and skeleton discovery from sparse image ensemble. In: CVPR (2023)
[92] Yao, C.H., Hung, W.C., Rubinstein, M., Lee, Y., Jampani, V., Yang, M.H.: LASSIE: Learning articulated shape from sparse image ensemble via 3d part discovery. In: NeurIPS (2022)
[93] Yao, C.H., Raj, A., Hung, W.C., Rubinstein, M., Li, Y., Yang, M.H., Jampani, V.: ARTIC3D: Learning robust articulated 3d shapes from noisy web image collections. In: NeurIPS (2024)
[94] Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3d human dynamics from video. In: ICCV (2019)
[95] Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603 (2023)
[96] Zheng, Y., Li, X., Nagano, K., Liu, S., Hilliges, O., De Mello, S.: A unified approach for text-and image-guided 4d scene generation. In: CVPR (2024)
[97] Zhou, Z., Wang, B.: UDE: A unified driving engine for human motion generation. In: CVPR (2023)

Appendices

Appendix 0.A Additional Qualitative Results

0.A.1 Additional Motion Generation Results

Additional generated 3D motion sequences for are shown in Figures 8 and 9. Please refer to the video²²2https://youtu.be/poc7c-9hCvQ?si=3k874zHackOre94R for more 3D animation visualizations. As shown in the video, by sampling the learned motion latent VAE, we can generate diverse motion patterns, such as eating with the head bending towards the ground, walking with the legs moving alternately, and jumping with the front legs lifted up.

We trained our VAE model with a sequence length of 10 frames. To produce longer motion sequences as demonstrated in the video, we first sample $2$ latent codes to generate $2$ motion sequences, each comprising 10 frames. We then optimize $1$ additional transition motion latents by encouraging the poses of the first frame and the last frame to be consistent with the last frame and the first frame of two consecutive sequences previously generated.

0.A.2 Qualitative Comparison of Video Reconstruction Results

Figure 6 compares the 3D reconstruction results on video sequences obtained from the MagicPony [80] model and our proposed method. Although MagicPony predicts a plausible 3D shape in most cases, it tends to produce temporally inconsistent poses, including both the rigid pose $\hat{\xi}_{t,1}$ and bone rotations $\hat{\xi}_{t,2:B}$ , as highlighted in Figure 6. In contrast, our method leverages the temporal signals in training videos, and produces temporally coherent reconstruction results.

Appendix 0.B Additional Ablation Studies

Table 5: Ablation study on the architecture of the motion VAE model.

Row	Method	[email protected]	Mask IoU
1	Final (with ST-Transformer)	37.6%	62.0%
2	without spatial Transformers $E_{\text{s}}$ , $D_{\text{s}}$	33.4%	58.9%
3	without Teacher Loss $\mathcal{L}_{\text{teacher}}$	32.4%	57.9%
4	without motion VAE	44.3%	66.7%

0.B.1 Spatio-Temporal Transformer Architecture

We conduct an ablation study to verify the effectiveness of the proposed spatio-temporal transformer architecture. In particular, we remove each individual component from the final model or replace it with a default option, train the model on the same dataset, and evaluate its performance on 3D reconstruction with the same protocol described in Section 4.3 of the main paper.

First, we remove the spatial transformer encoder and decoder, $E_{\text{s}}$ and $D_{\text{s}}$ , and report the results in row 2 of Table 5. In this variant, specifically, instead of using the spatial transformer encoder $E_{\text{s}}$ to fuse bone-specific local image features before passing them to the temporal transformer encoder $E_{\text{t}}$ , we directly feed the global image features $\{\phi_{1},\cdots,\phi_{T}\}$ into the temporal encoder. Similarly, we also remove the spatial decoder $D_{\text{s}}$ , and directly decode a fixed set of bone rotations from the temporal transformer decoder $D_{\text{t}}$ .

Compared to the final model with spatio-temporal transformer architectures in row 1 of Table 5, the variant without spatial transformer results in less accurate reconstructions, and hence lower scores on the metrics. This confirms the effectiveness of the proposed spatial transformer in extracting motion-specific spatial information from the images.

0.B.2 Teacher Loss

We also demonstrate the effect of the Teacher Loss $\mathcal{L}_{\text{teacher}}$ introduced in Section 3.3 of the main paper. We train a variant motion VAE model without this loss, and report its reconstruction performance in Row $3$ of Table 5. Without $\mathcal{L}_{\text{teacher}}$ , the model fails to learn accurate poses effectively, leading to degraded reconstruction results. This is mainly because that training the motion VAE from scratch is computationally inefficient with an expensive rendering step in the loop, and the Teacher Loss can significantly improve training efficiency.

Table 6: Ablation study with different sequence lengths for motion generation evaluated using Motion Chamfer Distance (MCD) on APT-36K [90].

Sequence Length	$K=10$	$K=20$	$K=50$
MCD $\downarrow$	38.03	38.25	39.25

0.B.3 Sequence Length.

We conducted experiments to understand the effect of different sequence lengths during training ( $K=10,20,50$ frames). For a fair comparison, to evaluate the longer motion sequences generated by these variants ( $K=20,50$ ), we divide them into consecutive sub-sequences of $10$ frames, and average the MCD metric across the subsequences. We use the same metric as introduced in Section 4.2 of the main paper, the Motion Chamfer Distance (MCD) calculated between generated sequences and the annotated sequences in the APT-36K dataset [90]. The results are presented in Table 6.

Upon analyzing the results, we observed that the generated sequences still look plausible as the sequence length increases from $10$ to $20$ . However, a notable degradation in quality is observed as the sequence length increases to $50$ . This could potentially be attributed to the limited capacity of the motion VAE model as well as the limited size of the training dataset. For our final model, we set the sequence length to $10$ , which tends to yield the most satisfactory results with a reasonable training efficiency.

0.B.4 KL Loss Weight.

To train the motion VAE, in addition to the reconstruction losses, we also use the Kullback–Leibler (KL) divergence loss $\mathcal{L}_{\text{KL}}$ in Equation (6) in the main paper. We conducted an ablation study on its weight $\lambda_{\text{KL}}$ to assess its impact on the overall 3D reconstruction accuracy. As shown in Table 7, $\lambda_{\text{KL}}=0.001$ achieves the best reconstruction results, and is used in all experiments in the main paper.

Table 7: Ablation study on the weight of the KL divergence loss

\lambda{L}_{\text{KL}}

	[email protected]	Mask IoU
$\lambda_{\text{KL}}=0.01$	33.58%	59.85%
$\lambda_{\text{KL}}=0.001$	37.63%	62.03%
$\lambda_{\text{KL}}=0.0001$	35.75%	61.11%

Appendix 0.C Additional Technical Details

0.C.1 Architecture Details

As explained in the paper, we adopt a spatio-temporal transformer architecture for sequence feature encoding and motion decoding. For better illustrating the architecture, we depict the framework of the spatial and temporal transformer encoders in Figure 7. Also, as presented in Table 8, we use the $4$ -layer transformer to implement the spatial and temporal transformer encoders $E_{\text{s}}$ , $E_{\text{t}}$ and decoders $D_{\text{s}}$ , $D_{\text{t}}$ . Given the DINO features of the input image, we first concatenate the bone position as Positional Encoding to obtain the bone-specific feature descriptors $\nu_{t,b}$ with shape (BoneNum, FrameNum, FeatureDim) = ( $20\times 10\times 640$ ). Then we map the feature dimension to $256$ with a simple Linear layer, and concatenate an additional BoneFeatureQuery token. We use the $4$ -layer transformer $E_{\text{s}}$ to aggregate all the bone-specific feature descriptors into a per-frame pose feature $\nu_{t,*}$ , and subsequently $E_{\text{t}}$ to aggregate all frame-specific features into the VAE distribution parameters, including the mean $\hat{\mu}$ and variance $\hat{\Sigma}$ . Using the reparametrization trick, we then sample a latent code $z$ from the Gaussian distribution $z\sim\mathcal{N}(\hat{\mu},\hat{\Sigma})$ , which is first decoded by the temporal decoder $D_{\text{t}}$ and the spatial decoder $D_{\text{s}}$ into a final sequence of bone rotation angles $\hat{\xi}_{*,2:B}\in\mathbb{R}^{20\times 10\times 3}$ .

Table 8: Architecture of the proposed spatio-temporal transformer VAE.

Operation	Output Size
Positional Encoding	20 $\times$ 10 $\times$ 640
Linear(640, 256)	20 $\times$ 10 $\times$ 256
Concat BoneFeatQuery	21 $\times$ 10 $\times$ 256
TransformerLayer $\times$ 4	1 $\times$ 10 $\times$ 256
Reshape	10 $\times$ 1 $\times$ 256
Concat muQuery and sigmaQuery	12 $\times$ 1 $\times$ 256
Positional Encoding	12 $\times$ 1 $\times$ 256
TransformerLayer $\times$ 4	2 $\times$ 1 $\times$ 256
Reparameterizion	1 $\times$ 1 $\times$ 256
TransformerLayer $\times$ 4	10 $\times$ 1 $\times$ 256
Reshape	1 $\times$ 10 $\times$ 256
TransformerLayer $\times$ 4	20 $\times$ 10 $\times$ 256
Linear(256, 3)	20 $\times$ 10 $\times$ 3

0.C.2 Articulation Model Specifications

The configuration of bone topology and skinning weights was established following Magicpony [80]. Here, we give a brief recap of the model.

0.C.2.1 Posed Shape.

The blend skinning model for posing [50, 14, 80] was utilized to articulate the skeleton into a specific pose. This model is parameterised by $B-1$ bone rotations $\xi_{b}\in SO(3),b=2,\dots,B$ , and the viewpoint $\xi_{1}\in SE(3)$ . A set of rest-pose joint locations $\mathbf{J}_{b}$ was initialized on the instance mesh using straightforward heuristics. Each bone $b$ , excluding the root, has a single parent $\pi(b)$ , thereby forming a tree structure.

Each vertex $V_{i}$ is linked to the bones via the skinning weights $w_{ib}$ , determined based on their relative proximity to each bone. The vertices are then posed using the linear blend skinning equation:

	$\displaystyle V_{i}(\xi)=\left(\sum_{b=1}^{B}w_{ib}G_{b}(\xi)G_{b}(\xi^{*})^{-% 1}\right)V_{\text{ins},i},$		(9)
	$\displaystyle G_{1}=g_{1},~{}~{}G_{b}=G_{\pi(b)}\circ g_{b},~{}~{}g_{b}(\xi)=% \begin{bmatrix}R_{\xi_{b}}&\mathbf{J}_{b}\\ 0&1\\ \end{bmatrix},$		(9)

where $\xi^{*}$ denotes the bone rotations at the rest pose.

0.C.2.2 Bone Topology

For all quadrupedal animals examined in this paper, a chain of $8$ bones of equal lengths was estimated. These bones lie on two line segments that extend from the centre (root) of the rest-pose mesh to the two most extreme vertices along the $z$ -axis ( $4$ bones on each side), thereby forming a “spine”. Then the root joint was slightly elevated, and $4$ sets of bones were added to model the legs. The foot joints were first identified as the lowest points of the mesh (in the $y$ -axis) in each of the four $xz$ -quadrants. Subsequently, $4$ line segments were drawn from the foot joints to their nearest spine joints, and a chain of $3$ bones of equal lengths was defined on each of the segments, representing each leg.

0.C.2.3 Skinning Weight

The skinning weight $w_{i,b}$ , which associates each vertex $V_{\text{ins},i}$ with the bones, was defined as follows:

	$\displaystyle w_{i,b}$	$\displaystyle=\frac{e^{-d_{i,b}/\tau_{\text{s}}}}{\sum_{k=1}^{B}{e^{-d_{i,k}/% \tau_{\text{s}}}}},$		(10)
	$\displaystyle\text{where}\quad d_{i,b}$	$\displaystyle=\min_{r\in[0,1]}\\|V_{\text{ins},i}-r\tilde{\mathbf{J}}_{b}-(1-r)% \tilde{\mathbf{J}}_{\pi(b)}\\|^{2}_{2}$		(10)

In this context, $d_{i,b}$ is the minimal distance from the vertex $V_{\text{ins},i}$ to each bone $b$ , defined by the rest-pose joint locations $\tilde{\mathbf{J}}_{b}$ and $\tilde{\mathbf{J}}_{\pi(b)}$ in world coordinates. $\tilde{\mathbf{J}}_{\pi(b)}$ denotes the parent joint of $\tilde{\mathbf{J}}_{b}$ . The temperature parameter $\tau_{\text{s}}$ is set to $0.5$ .

0.C.3 Text Prompts for 4D-fy Evaluation

We provide the 4D-fy [8] model with a list of text prompts, which are enriched by ChatGPT [57] from a list of basic prompts describing horse motions. The complete list is enumerated in the following:

•

A horse is running.
•

A horse is running.
•

A majestic horse galloping swiftly across the verdant meadow.
•

An energetic steed dashing with unbridled enthusiasm under the azure sky.
•

A spirited horse racing with the wind, its mane flowing like waves.
•

A horse is walking.
•

A horse is walking.
•

A serene horse ambling gently through a misty forest at dawn.
•

An elegant steed strolling leisurely along a cobblestone path.
•

A calm equine sauntering with grace across a blooming meadow.
•

A horse is eating.
•

A horse is eating.
•

A serene horse gently nibbling on the lush green grass of a tranquil meadow.
•

An elegant equine gracefully bending to graze on the dew-kissed clover.
•

A peaceful steed leisurely munching on hay in the golden light of dawn.
•

A horse is jumping.
•

A horse is jumping.
•

A majestic horse soaring effortlessly over a rustic wooden fence, its muscles rippling with power.
•

An agile steed leaping gracefully, silhouetted against the vibrant hues of the setting sun.
•

A spirited equine vaulting energetically over an obstacle, mane flowing like a river in the wind.

Appendix 0.D Limitations and Future Directions

While the model demonstrates promising results, there are several areas where further improvements can be made.

A significant limitation is that the articulated motions are learned on top of a fixed bone topology, which is pre-defined using strong heuristics, such as the number of legs. This approach may not effectively generalize across diverse animal species. A potential avenue for future research could involve the joint discovery of the articulation structure in conjunction with video training.

Additionally, the current model does not distinguish between different legs due to the nature of the DINO features. This can result in a “curious legs” problem, where the model confuses left and right legs of an animal seen from the side. This can be observed in the reconstruction results and subsequently in the generated motion sequences, and is also a common issue even with the most powerful video generation models [11]. Accurately capturing the leg ordering and precise motion is an intriguing challenge for future research in motion generation.

Appendix 0.E Societal Impact

The task of generating 3D motion from unlabeled videos represents a fundamental challenge in the fields of computer vision and computer graphics, in order to extend our current models to the long tail distribution of all kinds of objects in the real world. As an initial exploration in this area, our aim is to stimulate increasing interest and research in this direction. The continued advancement in this field holds great potential of significantly improving the diversity and quality of 3D and 4D models of real-world objects, thereby supporting numerous following applications in virtual reality, robotics and scientific discovery.