ReMoDiffuse - Retrieval-Augmented Motion Diffusion Model
ReMoDiffuse - Retrieval-Augmented Motion Diffusion Model
Mingyuan Zhang1 , Xinying Guo1 , Liang Pan1 , Zhongang Cai12 , Fangzhou Hong1 , Huirong Li1 ,
Lei Yang2 , Ziwei Liu1
1
S-Lab, Nanyang Technological University, Singapore
2
Sensetime, China
{mingyuan001,XGUO012}@e.ntu.edu.sg, [email protected], [email protected]
arXiv:2304.01116v1 [cs.CV] 3 Apr 2023
… Diffusion
Model
walks happily and
then stop
#60
Abstract github.io/projects/ReMoDiffuse.html
1
ates realistic and diverse actions while allowing for multi- tive experiments show that our generated motion sequences
level motion manipulation in both spatial and temporal di- achieve higher generalizability on both common and un-
mensions. MDM [26] uses geometric losses as training common prompts.
constraints to make predictions of the sample itself. While
these methods have achieved impressive results, they are not 2. Related Work
versatile enough for uncommon condition signals.
2.1. Diffusion Models
Some recent works on text-to-image generation utilize
retrieval methods to complement the model framework, Diffusion models [11, 16] is a new class of gener-
providing an retrieval-augmented pipeline to tackle the ative models that have achieved impressive progress on
above issue [24, 5, 4]. However, simply transferring these text-to-image generation tasks. Prafulla Dhariwal and
methods into text-driven motion generation fields is imprac- Alex Nichol [6] propose a diffusion model-based genera-
tical due to three new challenges. Firstly, the similarity tive model, which first outperforms Generative Adversar-
between the target motion sequence and the elements in ial Networks(GAN) and establishes a new state-of-the-art
database is complicated. We need to evaluate both seman- text-driven image generation task. Their success with this
tic and kinematic similarities to find out related knowledge. advanced generative model quickly attract attention from
Secondly, a single motion sequence usually contains several worldwide researchers. GLIDE [15] designs classifier-free
atomic actions. It is necessary to learn from the retrieved guidance and proves its superiority compared to the CLIP
samples selectively. In this procedure, the model should be guidance used in previous works. DALL-E2 [21] attempts
aware of the semantic difference between the given prompt to bridge the text embedding and image embedding from
and retrieved samples. Lastly, motion diffusion models are the CLIP [20]. It includes another diffusion model which
sensitive to the scale in classifier-free guidance, especially tries to synthesize an image embedding from the text em-
when we supply another condition, retrieved samples. bedding.
In this paper, we propose a new text-driven motion gen- Recently, some works have focused on employing re-
eration pipeline, ReMoDiffuse, which addresses the above- trieval methods as complements to the model framework,
mentioned challenges and thoroughly benefits from the re- providing an idea to enhance the generalizability. KNN-
trieval techniques to generate diverse and high-quality mo- Diffusion [24] uses k-Nearest-Neighbors (kNN) to train an
tion sequences. ReMoDiffuse includes two stages: retrieval efficient text-to-image model without any text, enabling the
stage and refinement stage. In the retrieval stage, we ex- model to adapt to novel samples. RDM [4] replaces the re-
pect to acquire the most informative samples to provide trieval examples with the user-assigned images. Then it can
useful guidance for the denoising process. Here we con- effectively transfer artistic style from these images into the
sider both semantic and kinematic similarities and suggest generated one. Re-Imagen [5] leverages knowledge from
a Hybrid Retrieval technique to achieve this objective. In the external database to free the model from memorizing
the refinement stage, we design a Semantics-Modulated rare features, striking a good balance between fidelity and
Transformer to leverage knowledge retrieved from an ex- diversity.
tra multi-modal database and generate semantic-consistent
motion sequences. During inference, Condition Mixture 2.2. Text-Driven Motion Generation
technique enables our model to generate high-fidelity and Text-driven motion generation has witnessed significant
description-consistent motion sequences. We evaluate our progress recently. Earlier works focus on learning a joint
proposed ReMoDiffuse on two standard text-to-motion gen- embedding space between motion sequences and language
eration benchmarks, HumanML3D [8] and KIT-ML [19]. descriptions deterministically. JL2P [1] attempts to create a
Extensive quantitative results demonstrate that ReMoDif- joint embedding space by applying the same reconstruction
fuse outperforms other existing motion generation pipelines task on both text and motion embedding. Specifically, JL2P
by a significant margin. Additionally, we propose several encodes the input text and motion data separately by two
new metrics for quantitative comparisons on uncommon different encoders for each modality. A motion decoder is
samples. We find that ReMoDiffuse significantly improves then applied on both embeddings to reconstruct the origi-
the generation quality on rare samples, demonstrating its nal motion sequences, which are expected to be the same as
superior generalizability. the initial input. Ghosh et al. [7] further develop this idea
To summarize, our contributions are threefold: 1) We by manually dividing each pose sequence into an upper one
carefully design a retrieval-augmented motion diffusion and a lower one to represent two different body parts. In ad-
model which efficiently and effectively explores the knowl- dition, the proposed method integrates a pose discriminator
edge from retrieved samples; 2) We suggest new metrics to to improve the generation quality further. MotionCLIP [25]
evaluate the model’s generalizability under different scenar- attempts to enhance the generalizability of text-to-motion
ios comprehensively; 3) Extensive qualitative and quantita- generation. It enforces the motion embedding to be similar
2
Figure 2: Overview of the proposed ReMoDiffuse. a) Hybrid retrieval database stores various features of each training
data. The pre-processed text feature and relative difference of motion length are sent to calculate the similarity with the
given language description. The most similar ones are fed into the semantics-modulated transformer (SMT), serving as addi-
tional clues for motion generation. b) Semantics-modulated transformer incorporates N identical decoder layers, including
a semantics-modulated attention (SMA) layer and an FFN layer. The figure shows the detailed architecture of SMA module.
CLIP’s extracted text features fprompt from the given prompt, features Rt and Rm from the retrieved samples, and current
motion features fΘ will further refine the noised motion sequence. c) To synthesize diverse and realistic motion sequences,
starting from the pure noised sample, the motion transformer repeatedly eliminates the noise. To better mix outputs under
different combinations of conditions, we suggest a training strategy to find the optimal hyper-parameters w1 , w2 , w3 and w4 .
to the text and image embedding from critical poses. These image generation tasks, some recent works have adapted
two embeddings are acquired from CLIP [20], which ex- this advanced generative model to motion generation tasks.
cels at encoding texts and images into a joint space. Conse- MotionDiffuse [30] is an efficient DDPM-based architec-
quently, MotionCLIP can generate motion sequences with ture for plausible and controllable text-driven motion gen-
unseen descriptions. eration. It generates realistic and diverse actions and al-
To improve the diversity of generated motion sequences, lows for multi-level motion manipulation in both spatial and
previous works introduce variational mechanisms. TEMOS temporal dimensions. MDM [26] is a lightweight diffu-
[18] employs a Variational Autoencoder (VAE) [12] to re- sion model featuring a transformer-encoder backbone. It
place the deterministic auto-encoder structures. Besides, makes predictions of the sample rather than the noise so
different from the recurrent neural networks in the previous that geometric losses are supported as training constraints.
works, both motion encoder and motion decoder in TEMOS Although these methods have outstanding performances on
is based on transformer architectures [28]. Guo et al. [8] text-driven motion generation tasks, they are not versatile
propose an auto-regressive conditional VAE, which is con- enough for uncommon condition signals. In this paper, we
ditioned on both the text feature and the previously gener- equip the diffusion model-based architecture with retrieval
ated frames. Given these conditions, the proposed pipeline capability, enhancing the generalizability.
will generate four successive frames as a unit. TEACH [2]
also exploits auto-regressive models but in a larger length 3. Our Approach
range. It can synthesize a long motion sequence with the In this paper, we present a Retrieval-augmented Motion
given description and the previous sequence. Consequently, Diffusion model (ReMoDiffuse). We first describe the
it can generate motion sequences with different actions con- overall architecture of the proposed method in Section 3.
tinuously. TM2T [9] regards the text-driven motion genera- The background knowledge about the motion diffusion
tion task as a translation task between natural languages and model will be discussed in Section 3.2. Then we will
motion sequences. Most recently, T2M-GPT [29] quantizes introduce our proposed novel retrieval techniques and the
motion clips into discrete tokens and use a transformer to corresponding model structure in Section 3.3. Finally, we
automatically generate later tokens. will introduce the training objective and sampling strategy
Inspired by the success of diffusion models in text-to- in Section 3.5.
3
3.1. Framework Overview noisy data xt into clean data x0 . Following MDM [26], we
predict the clean state x0 . The training target can be written
Figure 2 shows the overall architecture of ReMoDif-
as:
fuse. We establish the whole pipeline based on MotionDif-
\label {eq:objective} \mathbb {E}_{x_0,\epsilon ,t}[\mathbf {x}_0 - S(\mathbf {x}_t,t,\mathrm {retr},\mathrm {text})], (2)
fuse [30], which incorporates diffusion models and a series
of transformer decoder layers. To strengthen its general- where retr and text denote the conditions of retrieved sam-
izability, we extract features from two different modalities ples and the given prompts respectively. Here t ∈ U(0, T )
to establish the retrieval database. During denoising steps, denotes the timestamp, which is uniformly sampled from 0
ReMoDiffuse first retrieves motion sequences based on the to the maximum diffusion steps T . S(xt , t, retr, text) in-
extracted text features and relative motion length. These dicates the estimated clean motion sequence, given the four
retrieved samples are then fed into the motion transformer inputs.
layers. As for each decoder layer, the noised sequence is re- During the sampling process, we can sample xt−1 from a
fined by Semantics-Modulated Attention (SMA) layers and Gaussian Distribution N (µθ (xt , t, c), βt ), where c denotes
then absorbs information from the given description and the the condition of retr and text for simplicity. The mean of
retrieved samples. In the classifier-free generation process, this distribution can be acquired from xt and S(xt , t, c) by
we have distinct outputs under different condition combina- the following equation:
tions. To better fuse these outputs, we finetune our model on
the training split to find the optimal combination of hyper- \begin {aligned} & \mu _{\theta }(\mathbf {x}_t,t,c) = \sqrt {\bar {\alpha }_t} S(\mathbf {x}_t,t,c) + \sqrt {1 - \bar {\alpha }_t}\epsilon _{\theta }(\mathbf {x}_t,t,c) \\ & \epsilon _{\theta }(\mathbf {x}_t,t,c)=(\frac {\mathbf {x}_t}{\sqrt {\bar {\alpha }_t}} - S(\mathbf {x}_t,t,c)) \sqrt {\frac {1}{\bar {\alpha }_t}-1} \end {aligned}
parameters w1 , w2 , w3 and w4 . We will introduce these (3)
components in the following subsections.
3.2. Diffusion Model for Motion Generation Hence, on the basis of diffusion models, the text-driven mo-
tion generation pipeline should be able to predict the start
Recently, diffusion models have been introduced into
sequence x0 , with the given conditions. In this paper, we
motion generation [30, 26]. Compared to VAE-based
propose a retrieval technique to enhance this denoising pro-
pipelines, the most popular motion-generative models in
cess. We will introduce how we retrieve motion sequences
previous works, diffusion models strengthen the genera-
and how to fuse this information.
tion capacity through a stochastic diffusion process, as ev-
idenced by the diverse and high-fidelity generated results. 3.3. Retrieval-Augmented Motion Generation
Therefore, in this work, we build our motion generation
framework in a corporation with diffusion models. Basically, there are two stages in retrieval-based
Diffusion Models pipelines. The first stage is to retrieve appropriate samples
R can be parameterized as a Markov from the database. The second stage is acquiring knowl-
chain pθ (x0 ) := pθ (x0:T ) dx1:T , where x1 , · · · , xT are
the noised sequences distorted from the real data x0 ∼ edge from these retrieved samples to refine the denoising
q(x0 ). All xt , where t = 0, 1, 2, . . . , T , are of the same process of diffusion models. We will thoroughly introduce
dimensionality. In the motion generation tasks, each xt can these two steps.
be represented by a series of pose θi ∈ RD , i = 1, 2, . . . , F ,
where D is the dimensionality of the pose representation Hybrid Retrieval. To support this process, we need to
and F is the number of the frames. extract features for calculating the similarities between the
In the forward process of diffusion models, the computa- given text description and the entities in the database. Con-
tion of the posterior distribution q(x1:T |x0 ) is implemented sidering that the retrieval procedure is not differentiable, we
as a Markov chain that gradually adds Gaussian noises to have to utilize pre-trained models instead of using learnable
the data according to a variance schedule β1 , · · · , βT : architectures. An intuitive method is to generate text fea-
tures on both query text and the data points. Thanks to the
pre-trained CLIP [20], we can easily evaluate the semantic
\begin {aligned} &q(\mathbf {x}_{1:T} \vert \mathbf {x}_0) \,:=\, \prod _{t=1}^{T} q(\mathbf {x}_t \vert \mathbf {x}_{t-1}), \\ &q(\mathbf {x}_t \vert \mathbf {x}_{t-1}) \,:=\, \mathcal {N}(\mathbf {x}_t; \sqrt {1-\beta _t}\mathbf {x}_{t-1}, \beta _t\mathbf {I}). \end {aligned} similarities from language descriptions. Formally, for each
(1) data point (texti , Θi ), we first calculate fit = ET (texti ) as
the text-query feature, where ET is the text encoder in the
CLIP model.
√ xt from
To efficiently acquire √ x0 , Ho et al. [11] approxi- Text features usually encourage the retrieval process to
mate q(xt )Qas x := ᾱt x0 + 1 − ᾱt ϵ, where αt := 1−βt select samples with high semantic similarities. These fea-
t
and ᾱt := s=1 αs . tures play a significant role in retrieving suitable samples.
In diffusion models, the aforementioned forwarding However, there is another kind of feature that is vital but
Markov chain is reversed to learn the original motion distri- easily overlooked, the relative magnitude between the ex-
butions. Expressly, diffusion models are trained to denoise pected motion length and that of each entity in the database.
4
Hence, the similarity score si between i-th data point and Therefore, the model should know which motion features
the given description prompt and expected motion length can be borrowed, guided by the difference between the lan-
L is defined as below: guage descriptions.
Based on these observations, we design two encoders
\begin {aligned} &s_i = <f^t_i, f^t_p> \cdot e^{-\lambda \cdot \gamma }, \\ &f^t_p = E_{t}(\mathrm {prompt}), \gamma = \frac {\Vert l_i - L \Vert }{\max \{l_i, L\}}, \label {eq:score} \end {aligned}
to extract text features and motion features from the re-
(4) trieved data, respectively. As for motion features, we expect
them to be capable of providing low-level information while
where < ·, · > denotes cosine similarity between the two retaining the computational cost to an acceptable degree.
given feature vectors, li is the length of the motion sequence Therefore, we build up a series of encoder layers, which
Θi . The similarity score si becomes larger when text-query include alternating Semantics-Modulated Attention(SMA)
is closer to the prompt feature. When the expected motion modules and FFN modules. This motion encoder processes
length is close to the length of one entity, the correspond- raw motion sequences into usable ones. To reduce the
ing si will also increase. This property is significant be- computational cost, we down-sample the sequence into 1/4
′
cause the motion sequence with a similar length can provide original FPS, which is denoted as Rm ∈ RF ·k×D , where
more informative features for the generation. λ is a hyper- F ′ is the number of frames after down-sampling and k is the
parameter to balance the magnitude of these two different number of retrieved samples. This simple strategy greatly
similarities. decreases the computation with little information lost. As
To establish the retrieval database, we simply select all for the text encoder, the feature Rt ∈ Rk×D from the last
the training data as entities. Given the number of retrieved token is supposed to represent the global semantic informa-
samples k, prompt, and motion length L, we sort all ele- tion. Rm and Rt constitute the features we needed for the
ments by the score si in Equation 4. Then the most k sim- purpose of retrieval-based augmentation.
ilar ones are selected as the retrieved samples (texti , Θi )
and fed into the semantics-modulated attention components Semantics-Modulated Attention. These extracted fea-
in the motion transformer. We will illustrate the detailed tures will be passed to the cross attention component, as
architecture in the next paragraph. shown in Figure 2 . The noised motion sequence forms the
query vector Q ∈ RF ×D . As for the key vector K and the
Network Architecture. Similar to MotionDiffuse [30] value vector V , we consider three sources of data: 1) The
and MDM [26], we build up our pipeline on the basis of motion sequence fΘ ∈ RF ×D itself. As shown in Figure 2,
transformer layers as shown in Figure 2. In both semantics- our proposed transformer does not contain a self-attention
modulated attention modules and FFN modules, follow- module. Instead, we combine the function self-attention
ing MotionDiffuse [30], we add a stylization block to fuse into the SMA; 2) The text condition fprompt , which seman-
timestamp t into the motion generation process. First, an tically describes the expected motion sequence and is ex-
embedding vector et is obtained from the timestamp t. It tracted as in MotionDiffuse [30]. Specifically, the prompt
should be mentioned that the original design in MotionDif- is first fed into the pre-trained CLIP model to get a fea-
fuse also uses an embedding vector from the given prompt, ture sequence, which is further processed by two learnable
which is not suitable for classifier-free guidance. Then for transformer encoder layers; 3) Features Rm , Rt from the
each block, a residual shortcut is applied between the in- retrieved samples. We simply concatenate fΘ , fprompt , Rm
put X ∈ Rn×d and the output Y ∈ Rn×d , where n is the for value vector V and fΘ , fprompt , [Rm ; Rt ] for key vec-
number of elements and d is the dimensionality. tor K. Here [·; ·] denotes the concatenation of both terms.
Two major difficulties should be resolved to better ex- This design allows our proposed method to fuse low-level
plore knowledge from the retrieved samples. First, in the motion information from the retrieved samples and also to
literature of motion diffusion models [30, 26], the resolu- fully consider the semantic similarities. The acquired vec-
tion of motion sequences is not reduced through the denois- tors Q, K, V are sent to perform Linear Attention [23] for
ing process. The maximum length of one motion sequence efficient computation.
is around 200 frames in the HumanML3D [8] dataset, lead-
ing to a dramatic computational cost, especially when we Stylization Block. Similar to MotionDiffuse [30] and
expect to retrieve more samples. Hence, efficiency is highly MDM [26], we build up our pipeline on the basis of trans-
prioritized for the information fusion component. Second, former layers. In both semantics-modulated attention mod-
the semantic relation between the retrieved samples and ules and FFN modules, following MotionDiffuse [30], we
given prompts is complicated. For example, ‘a person is add a stylization block to fuse timestamp t into the motion
walking forward’ and ‘a person is walking forward slowly’ generation process. First, an embedding vector et is ob-
are highly similar. However, these two prompts will lead to tained from the timestamp t. Then for each block, a resid-
two distinct motion sequences regarding pace and intensity. ual shortcut is applied between the input X ∈ Rn×d and the
5
which aims at encoding the paired text descriptions and mo-
tion sequences into a joint embedding space. As for the mo-
tion encoder, we use a 4-layer ACTOR [17] Encoder. The
text encoder is identical to the one we used in ReMoDiffuse.
The only difference is that we require a sentence feature in-
stead of a sequence of word features. We train this con-
trastive learning model with the same loss in Guo et al. [8].
20K and 40K optimization steps are applied for the KIT-ML
Figure 3: Architecture of the stylization block. This mod- and HumanML3D datasets, respectively.
ule is adapted from MotionDiffuse [30]. We remove the
prompt embedding from the original design to better sup- Parameter Finetuning. As mentioned before, we only
port classifier-free guidance. This module attempts to in- use 50 denoising steps to generate motion sequences in the
ject the information of the current timestamp into the fea- inference stage. However, it is impractical to calculate the
ture representation, which is necessary for denoising steps. gradient through such several forward times. To simplify
Specifically, the timestamp embedding et is fed into a series the problem, we divide all denoising steps into the first 40
of transformation layers. Two embeddings are generated af- steps and the last ten steps. In the first part, we use grid
terward and serve as an additive offset and a multiplicative search to find a better parameter combination. Specifically,
offset to the original feature map, respectively. for Equation 6, we search w1 and w2 from [−5, 5] with step
0.5 to find the best parameter for each model. Here we use
output Y ∈ Rn×d , where n is the number of elements and inspiration from Re-Imagen [5] that set w4 = 0. Besides, to
d is the dimensionality. The detailed structure is shown in retain the output’s statistics, we w1 + w2 + w3 + w4 = 1.
Figure 3. These two properties enable us to find the optimal combi-
nation by only searching the value of w1 and w2 . The eval-
3.4. Condition Mixture uation metric is the calculated FID between our generated
Classifier-free guidance enables us to generate motion sequences and the natural motion sequences in the train-
sequences with both high fidelity and consistency with the ing split performed by our trained contrastive model. This
given text description. A typical formulation is described as search aims to find an optimal combination of w1 and w2 to
below: achieve the lowest FID.
In the second stage, we use an end-to-end training
\begin {aligned} \epsilon &= w \cdot \epsilon _{\theta }(\mathbf {x}_t,t,\mathrm {text}) - (w - 1) \cdot \epsilon _{\theta }(\mathbf {x}_t,t), \end {aligned} (5) scheme to optimize w1 , w2 , and w3 . w4 is acquired by
1 − w1 − w2 − w3 . We use the Adam optimizer to train
where w is a hyper-parameter to balance the text-
our model on the training split for 1K steps to find the best
consistency and motion quality. In our proposed
parameter combination.
retrieval-augmented diffusion pipeline, the given re-
We use the searched parameters during training to per-
trieved samples can be regarded as an additional
form the first 40 denoising steps. After that, we auto-
condition. Therefore, we get four estimations:
regressively denoise the motion sequence with learnable
S(xt , t, retr, text), S(xt , t, retr), S(xt , t, text), S(xt , t).
w1 , w2 and w3 . The training objective here is also reduc-
We need four parameters to balance these items. To
ing FID.
achieve a better performance, here we suggest a Condition
We use the Adam optimizer and train 1K steps for both
Mixture technique to achieve this objective. Specifically,
HumanML3D and KIT-ML datasets to find the best param-
given the pre-trained Semantics-Modulated Transformer
eter combination.
(SMT), we optimize the value of w1 , w2 , w3 , w4 and get
the final output Sb as: 3.5. Training and Inference
\label {eq:output} \begin {aligned} \widehat {S} =& w_1 \cdot S(\mathbf {x}_t,t,\mathrm {retr}, \mathrm {text}) + w_2 \cdot S(\mathbf {x}_t,t,\mathrm {text}) + \\ & w_3 \cdot S(\mathbf {x}_t,t,\mathrm {retr}) + w_4 \cdot S(\mathbf {x}_t,t). \end {aligned} Model Training. Inspired by the classifier-free technique,
(6) 10% of the text conditions and 10% of the retrieval condi-
tions are independently randomly masked to approximate
Empirically, we find that the tendency of Frechet Incep- p(x0 ). The training object is to minimize the mean square
tion Distance (FID) is similar to that of Precision when the error between the predicted initial sequence and the ground
hyper-parameters are nearly optimal. Hence, we only at- truth, as shown in Equation 2. In the training stage, we
tempt to minimize the FID in this procedure. typically use a 1000-steps diffusion process.
Constrastive Model. To imitate the evaluator used in the Model Inference. During each denoising step, we use the
standard evaluation process, we train our contrastive model, learned coefficients w1 , w2 , w3 and w4 to get Sb as Equa-
6
Table 1: Evaluation results of different evaluator.
R Precision↑
Methods Dataset FID↓ MM Dist↓ Diversity↑
Top 1 Top 2 Top 3
Guo et al. HumanML3D 0.511±.003 0.703±.003 0.797±.002 0.002 ±.000
2.974 ±.008
9.503±.065
Ours HumanML3D 0.539±.004 0.721±.003 0.810±.003 0.001±.000 1.462±.006 5.298±.047
Guo et al. KIT-ML 0.424±.005 0.649±.006 0.779±.006 0.031±.004 2.788±.012 11.08±.097
Ours KIT-ML 0.475±.006 0.690±.004 0.791±.005 0.002±.000 1.337±.012 6.371±.058
Table 2: Quantitative results on the HumanML3D test set. For a fair comparison, all methods use the real motion length
from the ground truth as the extra given information. ‘↑’(‘↓’) indicates that the values are better if the metric is larger
(smaller). We run all the evaluations 20 times. x±y indicates that the average metric is x and the the 95% confidence interval
is y. The best result and the second best result are in red cells and blue cells, respectively.
R Precision↑
Methods FID↓ MM Dist↓ Diversity↑ MultiModality↑
Top 1 Top 2 Top 3
Real motions 0.511±.003 0.703±.003 0.797±.002 0.002±.000
2.974 ±.008
9.503 ±.065
-
Language2Pose [1] 0.246±.002 0.387±.002 0.486±.002 11.02±.046 5.296±.008 7.676±.058 -
Text2Gesture [3] 0.165±.001 0.267±.002 0.345±.002 7.664±.030 6.030±.008 6.409±.071 -
MoCoGAN [27] 0.037±.000 0.072±.001 0.106±.001 94.41±.021 9.643±.006 0.462±.008 0.019±.000
Dance2Music [13] 0.033±.000 0.065±.001 0.097±.001 66.98±.016 8.116±.006 0.725±.011 0.043±.001
Guo et al. [8] 0.457±.002 0.639±.003 0.740±.003 1.067±.002 3.340±.008 9.188±.002 2.090±.083
MDM [26] - - 0.611±.007 0.544±.044 5.566±.027 9.559±.086 2.799±.072
MotionDiffuse [30] 0.491±.001 0.681±.001 0.782±.001 0.630±.001 3.113±.001 9.410±.049 1.553±.042
T2M-GPT [29] 0.491±.003 0.680±.003 0.775±.002 0.116±.004 3.118±.011 9.761±.081 1.856±.011
Ours 0.510±.005 0.698±.006 0.795±.004 0.103±.004 2.974±.016 9.018±.075 1.795±.043
tion 6. To reduce the computation cost introduced by the (1) FID is an objective metric calculating the distance be-
retrieved samples, we pre-process all fiv , fit , Rt , Rm to en- tween features extracted from real and generated motion se-
sure no repeated computation for different syntheses. quences, which highly reflects the generation quality. (2) R-
Different from the training stage, we carefully reduce precision measures the similarity between the text descrip-
the whole denoising process into 50 steps during inference, tion and the generated motion sequence and indicates the
which enables our model to generate high-quality motion probability that the real text appears in the top k after sort-
sequences efficiently. ing, and in this work, k is taken to be 1, 2, and 3. (3) Diver-
sity measures the variability and richness of the generated
4. Experiments action sequences. (4) Multimodality measures the average
variance of generated motion sequences given a single text
4.1. Datasets and Metrics description. (5) Multi-modal distance (MM Dist for short)
Datasets. We evaluate our proposed framework using the represents the average Euclidean distance between the mo-
KIT dataset [19] and the HumanML3D dataset [8], two tion feature and its corresponding text description feature.
leading benchmarks in text-driven motion generation tasks.
KIT Motion Language Dataset is an open dataset combin- 4.2. Implementation Details
ing human motion and natural language, which contains
We use similar settings on HumanML3D and KIT-ML
3,911 motions and 6,363 natural language annotations. Hu-
datasets. As for the motion encoder, a 4-layer transformer
manML3D is a scripted 3D human motion dataset that orig-
is used, and the latent dimension is 512. As for the text en-
inates from and textually reannotates the HumanAct12 [10]
coder, a frozen text encoder used in the CLIP ViT-B/32, to-
and AMASS datasets [14]. Overall, HumanML3D consists
gether with 2 additional transformer encoder layers, is built
of 14,616 motions and 44,970 descriptions.
and applied. As for the diffusion model, the variances βt are
pre-defined to spread linearly from 0.0001 to 0.02, and the
Evaluation Metrics. We follow the performance mea- total number of noising steps is set to be T = 1000. Adam
sures employed in MotionDiffuse for quantitative evalua- is adapted as the optimizer to train the model with a learn-
tions, namely Frechet Inception Distance (FID), R Preci- ing rate equal to 0.0002. 1 Tesla V100 is used for training,
sion, Diversity, Multimodality, and Multi-Modal Distance. and the batch size on a single GPU is 128. Pieces of training
7
Table 3: Quantitative results on the KIT-ML test set.
R Precision↑
Methods FID↓ MM Dist↓ Diversity↑ MultiModality↑
Top 1 Top 2 Top 3
Real motions 0.424±.005 0.649±.006 0.779±.006 0.031 ±.004
2.788±.012
11.08±.097
-
Language2Pose [1] 0.221±.005 0.373±.004 0.483±.005 6.545±.072 5.147±.030 9.073±.100 -
Text2Gesture [3] 0.156±.004 0.255±.004 0.338±.005 12.12±.183 6.964±.029 9.334±.079 -
MoCoGAN [27] 0.022±.002 0.042±.003 0.063±.003 82.69±.242 10.47±.012 3.091±.043 0.250±.009
Dance2Music [13] 0.031±.002 0.058±.002 0.086±.003 115.4±.240 10.40±.016 0.241±.004 0.062±.002
Guo et al. [8] 0.370±.005 0.569±.007 0.693±.007 2.770±.109 3.401±.008 10.91±.119 1.482±.065
MDM [26] - - 0.396±.004 0.497±.021 9.191±.022 10.847±.109 1.907±.214
MotionDiffuse [30] 0.417±.004 0.621±.004 0.739±.004 1.954±.062 2.958±.005 11.10±.143 0.730±.013
T2M-GPT [29] 0.416±.006 0.627±.006 0.745±.006 0.514±.029 3.007±.023 10.921±.108 1.570±.039
Ours 0.427±.014 0.641±.004 0.765±.055 0.155±.006 2.814±.012 10.80±.105 1.239±.028
A person skips
in a circle
#120
Figure 4: Visual Comparison between previous works and ReMoDiffuse. We draw black lines to show the translation path.
As for both given conditions, only ReMoDiffuse conveys accurate action and path condition.
on KIT-ML and HumanML3D are carried out for 40k and 4.4. Ablation Study
200k steps respectively.
Pose representation in this work follows the schema used Retrieval Techniques. First, we investigate the influence
by Guo et al. [8]. The pose is defined as a tuple of length of different retrieval techniques. To directly evaluate the
seven: (rva , rvx , rvz , rh , jp , jv , jr ), where rva ∈ R is the similarity between the target samples and the given sam-
root angular velocity along Y-axis, and rvx , rvz ∈ R are the ples, we use retrieved samples as generated results and cal-
root linear velocities along X-axis and Z-axis respectively. culate the FID metric for them. We try different λ to bal-
rh ∈ R is the root height. jp , jv ∈ RJ×3 are the local joints ance the terms of semantic similarity and kinematic simi-
positions and velocities. jr ∈ RJ×6 is the 6D local contin- larity. The results are shown in Figure 6. λ = 0 means
uous joints rotations. J denotes the number of joints, and in that the kinematic similarity will not influence the retrieval
HumanML3D and KIT-ML, J is 22 and 21 separately. process, whose retrieval quality is unacceptable. This result
supports our claim that kinematic similarity is significant to
4.3. Main Results the retrieval quality. The optimal value of λ is 0.1 for both
KIT-ML and HumanML3D datasets.
Table 2 and Table 3 show the comparison between our
proposed ReMoDiffuse and four other existing works, in-
cluding recent diffusion models-based algorithms [26, 30],
one VAE-based generative model [8], and one GPT-style Motion Refinement. We further evaluate the proposed
generative model [29]. cross attention component of our retrieval-augmented mo-
Compared to other diffusion model-based pipelines, our tion generation. In Table 4, when using the text feature,
proposed ReMoDiffuse achieves a better balance between FID is enhanced remarkably. It strongly supports our claims
the condition-consistency and fidelity. It should be noted that text features are highly significant in hybrid retrieval,
that, ReMoDiffuse is the first work to achieve state-of-the- which is not discussed in the text-to-image generation tasks.
art on both metrics, which demonstrates the superiority of Besides, the proposed retrieval techniques outperform the
the proposed pipeline. baseline by a remarkable margin.
8
Figure 5: Rareness distribution of HumanML3D test split. We split all testcases into 100 bins according to its Rareness
value.
9
Table 5: Examples of Rareness in the HumanML3D test set.
very challenging to current methods. Hence, these exam- Table 6: Evaluation of Generalization Ability. All results
ples build up a more difficult and realistic environment for are reported on the KIT testset. The best results are in bold.
method evaluation.
Method MM ↓ tail 5% MM ↓ balanced MM↓
MotionDiffuse 2.958 5.928 4.285
Baseline 3.371 6.173 4.661
Results and Analysis. Table 6 shows the generaliza- Ours 2.814 5.439 4.028
tion ability of three different methods. As for the baseline ∆ 0.557 0.734 0.633
model, we simply drop out the retrieval technique. From
this table we can find that, with our proposed retrieval tech- 5. Conclusion
nique, ReMoDiffuse outperforms both the baseline model
and state-of-the-art methods by a remarkable margin. In this paper, we present ReMoDiffuse, a retrieval-
augmented motion diffusion model for text-driven motion
generation. Equipped with a multi-modality retrieval tech-
4.6. Qualitative Results nique, the semantics-modulated attention mechanism, and
To illustrate the effectiveness of ReMoDiffuse, we pro- a learnable condition mixture strategy, ReMoDiffuse effi-
vide a qualitative comparison between previous works and ciently explores and utilizes appropriate knowledge from an
ReMoDiffuse. More examples are available in the project auxiliary database to refine the denoising process without
page. As shown in Figure 4, ReMoDiffuse stands out as the expensive computation. Quantitative and qualitative exper-
only approach that effectively conveys text descriptions that iments are conducted to demonstrate that ReMoDiffuse has
involve both action and path information. In contrast, Guo achieved superior performance in text-driven motion gener-
et al.’s method falls short in capturing path descriptions. ation, particularly for uncommon motions.
MotionDiffuse performs well in action categories, but it Social Impacts. This technique can be used to create fake
lacks precision in providing path details. Meanwhile, MDM media when combined with 3D avatar generation. The
captures path information, but its generated actions are in- manipulated media conveys incidents that never truly hap-
correct. In the examples evaluated, ReMoDiffuse demon- pened and can serve malicious purposes.
strates its capability to appropriately structure and present
the content.
10
References [15] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
[1] Chaitanya Ahuja and Louis-Philippe Morency. Lan- Mark Chen. Glide: Towards photorealistic image generation
guage2pose: Natural language grounded pose forecasting. and editing with text-guided diffusion models. arXiv preprint
In 2019 International Conference on 3D Vision (3DV), pages arXiv:2112.10741, 2021. 2
719–728. IEEE, 2019. 2, 7, 8 [16] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
[2] Nikos Athanasiou, Mathis Petrovich, Michael J Black, and denoising diffusion probabilistic models. In International
Gül Varol. Teach: Temporal action composition for 3d hu- Conference on Machine Learning, pages 8162–8171. PMLR,
mans. arXiv preprint arXiv:2209.04066, 2022. 3 2021. 2
[3] Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek [17] Mathis Petrovich, Michael J Black, and Gül Varol. Action-
Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. conditioned 3d human motion synthesis with transformer
Text2gestures: A transformer-based network for generating vae. In Proceedings of the IEEE/CVF International Con-
emotive body gestures for virtual agents. In 2021 IEEE Vir- ference on Computer Vision, pages 10985–10995, 2021. 6
tual Reality and 3D User Interfaces (VR), pages 1–10. IEEE, [18] Mathis Petrovich, Michael J Black, and Gül Varol. Temos:
2021. 7, 8 Generating diverse human motions from textual descriptions.
[4] Andreas Blattmann, Robin Rombach, Kaan Oktay, and Björn arXiv preprint arXiv:2204.14109, 2022. 3
Ommer. Retrieval-augmented diffusion models. arXiv [19] Matthias Plappert, Christian Mandery, and Tamim Asfour.
preprint arXiv:2204.11824, 2022. 2 The kit motion-language dataset. Big data, 4(4):236–252,
[5] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W 2016. 1, 2, 7
Cohen. Re-imagen: Retrieval-augmented text-to-image gen- [20] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
erator. arXiv preprint arXiv:2209.14491, 2022. 2, 6 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[6] Prafulla Dhariwal and Alexander Nichol. Diffusion models Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
beat gans on image synthesis. Advances in Neural Informa- ing transferable visual models from natural language super-
tion Processing Systems, 34, 2021. 2 vision. arXiv preprint arXiv:2103.00020, 2021. 2, 3, 4, 9
[7] Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian [21] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
Theobalt, and Philipp Slusallek. Synthesis of compositional and Mark Chen. Hierarchical text-conditional image gen-
animations from textual descriptions. In Proceedings of the eration with clip latents. arXiv preprint arXiv:2204.06125,
IEEE/CVF International Conference on Computer Vision, 2022. 2
pages 1396–1406, 2021. 2 [22] Jiawei Ren, Mingyuan Zhang, Cunjun Yu, and Ziwei Liu.
[8] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Balanced mse for imbalanced visual regression. In Proceed-
Xingyu Li, and Li Cheng. Generating diverse and natural 3d ings of the IEEE/CVF Conference on Computer Vision and
human motions from text. In Proceedings of the IEEE/CVF Pattern Recognition, pages 7926–7935, 2022. 9
Conference on Computer Vision and Pattern Recognition, [23] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi,
pages 5152–5161, 2022. 1, 2, 3, 5, 6, 7, 8 and Hongsheng Li. Efficient attention: Attention with lin-
ear complexities. In Proceedings of the IEEE/CVF winter
[9] Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t:
conference on applications of computer vision, pages 3531–
Stochastic and tokenized modeling for the reciprocal gen-
3539, 2021. 5
eration of 3d human motions and texts. arXiv preprint
[24] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer,
arXiv:2207.01696, 2022. 3
Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-
[10] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao diffusion: Image generation via large-scale retrieval. arXiv
Sun, Annan Deng, Minglun Gong, and Li Cheng. Ac- preprint arXiv:2204.02849, 2022. 2
tion2motion: Conditioned generation of 3d human motions.
[25] Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano,
In Proceedings of the 28th ACM International Conference on
and Daniel Cohen-Or. Motionclip: Exposing human motion
Multimedia, pages 2021–2029, 2020. 7
generation to clip space. arXiv preprint arXiv:2203.08063,
[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- 2022. 2
sion probabilistic models. Advances in Neural Information [26] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir,
Processing Systems, 33:6840–6851, 2020. 2, 4 Daniel Cohen-Or, and Amit H Bermano. Human motion dif-
[12] Diederik P Kingma and Max Welling. Auto-encoding varia- fusion model. arXiv preprint arXiv:2209.14916, 2022. 2, 3,
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3 4, 5, 7, 8
[13] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun [27] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan
Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Kautz. Mocogan: Decomposing motion and content for
Dancing to music. Advances in Neural Information Process- video generation. In Proceedings of the IEEE conference on
ing Systems, 32, 2019. 7, 8 computer vision and pattern recognition, pages 1526–1535,
[14] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- 2018. 7, 8
ard Pons-Moll, and Michael J Black. Amass: Archive of [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
motion capture as surface shapes. In Proceedings of the reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
IEEE/CVF International Conference on Computer Vision, Polosukhin. Attention is all you need. Advances in neural
pages 5442–5451, 2019. 7 information processing systems, 30, 2017. 3
11
[29] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli
Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi
Shen. T2m-gpt: Generating human motion from textual de-
scriptions with discrete representations. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2023. 3, 7, 8
[30] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou
Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif-
fuse: Text-driven human motion generation with diffusion
model. arXiv preprint arXiv:2208.15001, 2022. 1, 3, 4, 5, 6,
7, 8, 9
12