0% found this document useful (0 votes)

45 views12 pages

ReMoDiffuse - Retrieval-Augmented Motion Diffusion Model

This paper proposes ReMoDiffuse, a retrieval-augmented motion diffusion model for text-driven 3D human motion generation. ReMoDiffuse enhances diversity and generalizability of generated motions by integrating a retrieval mechanism. It uses a hybrid retrieval technique to find related motion samples based on semantic and kinematic similarities. A semantics-modulated transformer then selectively absorbs knowledge from retrieved samples to generate semantic-consistent motions. During inference, a condition mixture better utilizes the retrieval database to overcome scale sensitivity issues in classifier-free guidance. Experiments show ReMoDiffuse outperforms state-of-the-art methods, especially for more diverse motion generation.

Uploaded by

linzhongyan.fox

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views12 pages

ReMoDiffuse - Retrieval-Augmented Motion Diffusion Model

Uploaded by

linzhongyan.fox

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Mingyuan Zhang1 , Xinying Guo1 , Liang Pan1 , Zhongang Cai12 , Fangzhou Hong1 , Huirong Li1 ,
Lei Yang2 , Ziwei Liu1
1
S-Lab, Nanyang Technological University, Singapore
2
Sensetime, China
{mingyuan001,XGUO012}@e.ntu.edu.sg, [email protected], [email protected]
arXiv:2304.01116v1 [cs.CV] 3 Apr 2023

… Diffusion
Model
walks happily and
then stop

#60

Motion Database Text Description Generated Sequence

Figure 1: ReMoDiffuse is a retrieval-augmented 3D human motion diffusion model. Benefiting from the extra knowledge
from the retrieved samples, ReMoDiffuse is able to achieve high-fidelity on the given prompts.

Abstract github.io/projects/ReMoDiffuse.html

3D human motion generation is crucial for creative indus-

try. Recent advances rely on generative models with domain 1. Introduction
knowledge for text-driven motion generation, leading to
substantial progress in capturing common motions. How- Human motion generation has numerous practical appli-
ever, the performance on more diverse motions remains cations in fields such as game production, film, and virtual
unsatisfactory. In this work, we propose ReMoDiffuse, a reality. This has led to a growing interest in generating
diffusion-model-based motion generation framework that manipulable, plausible, diverse, and realistic human mo-
integrates a retrieval mechanism to refine the denoising tion sequences. Traditional modeling processes are time-
process. ReMoDiffuse enhances the generalizability and consuming and require specialized equipment and a signif-
diversity of text-driven motion generation with three key icant amount of domain knowledge. To address these chal-
designs: 1) Hybrid Retrieval finds appropriate references lenges, generic human motion generation models have been
from the database in terms of both semantic and kine- developed to enable the description, generation, and modi-
matic similarities. 2) Semantic-Modulated Transformer fication of motion sequences. Among all forms of human-
selectively absorbs retrieval knowledge, adapting to the computer interaction, natural language, in the form of text,
difference between retrieved samples and the target motion provides rich semantic details and is a commonly used con-
sequence. 3) Condition Mixture better utilizes the retrieval ditional signal in human motion generation.
database during inference, overcoming the scale sensi- Previous research has explored various generative mod-
tivity in classifier-free guidance. Extensive experiments els for text-driven motion generation. TEMOS uses a
demonstrate that ReMoDiffuse outperforms state-of-the-art Variational-Auto-Encoder (VAE) to synthesize detailed mo-
methods by balancing both text-motion consistency and tions, utilizing the KIT Motion-Language dataset [19]. Guo
motion quality, especially for more diverse motion gener- et al. [8] propose a two-stage auto-regressive approach for
ation. Project page: https://mingyuan-zhang. generating motion sequences. More recently, diffusion
models have been applied to human motion generation due
Corresponding author. to their strength and flexibility. MotionDiffuse [30] gener-

1
ates realistic and diverse actions while allowing for multi- tive experiments show that our generated motion sequences
level motion manipulation in both spatial and temporal di- achieve higher generalizability on both common and un-
mensions. MDM [26] uses geometric losses as training common prompts.
constraints to make predictions of the sample itself. While
these methods have achieved impressive results, they are not 2. Related Work
versatile enough for uncommon condition signals.
2.1. Diffusion Models
Some recent works on text-to-image generation utilize
retrieval methods to complement the model framework, Diffusion models [11, 16] is a new class of gener-
providing an retrieval-augmented pipeline to tackle the ative models that have achieved impressive progress on
above issue [24, 5, 4]. However, simply transferring these text-to-image generation tasks. Prafulla Dhariwal and
methods into text-driven motion generation fields is imprac- Alex Nichol [6] propose a diffusion model-based genera-
tical due to three new challenges. Firstly, the similarity tive model, which first outperforms Generative Adversar-
between the target motion sequence and the elements in ial Networks(GAN) and establishes a new state-of-the-art
database is complicated. We need to evaluate both seman- text-driven image generation task. Their success with this
tic and kinematic similarities to find out related knowledge. advanced generative model quickly attract attention from
Secondly, a single motion sequence usually contains several worldwide researchers. GLIDE [15] designs classifier-free
atomic actions. It is necessary to learn from the retrieved guidance and proves its superiority compared to the CLIP
samples selectively. In this procedure, the model should be guidance used in previous works. DALL-E2 [21] attempts
aware of the semantic difference between the given prompt to bridge the text embedding and image embedding from
and retrieved samples. Lastly, motion diffusion models are the CLIP [20]. It includes another diffusion model which
sensitive to the scale in classifier-free guidance, especially tries to synthesize an image embedding from the text em-
when we supply another condition, retrieved samples. bedding.
In this paper, we propose a new text-driven motion gen- Recently, some works have focused on employing re-
eration pipeline, ReMoDiffuse, which addresses the above- trieval methods as complements to the model framework,
mentioned challenges and thoroughly benefits from the re- providing an idea to enhance the generalizability. KNN-
trieval techniques to generate diverse and high-quality mo- Diffusion [24] uses k-Nearest-Neighbors (kNN) to train an
tion sequences. ReMoDiffuse includes two stages: retrieval efficient text-to-image model without any text, enabling the
stage and refinement stage. In the retrieval stage, we ex- model to adapt to novel samples. RDM [4] replaces the re-
pect to acquire the most informative samples to provide trieval examples with the user-assigned images. Then it can
useful guidance for the denoising process. Here we con- effectively transfer artistic style from these images into the
sider both semantic and kinematic similarities and suggest generated one. Re-Imagen [5] leverages knowledge from
a Hybrid Retrieval technique to achieve this objective. In the external database to free the model from memorizing
the refinement stage, we design a Semantics-Modulated rare features, striking a good balance between fidelity and
Transformer to leverage knowledge retrieved from an ex- diversity.
tra multi-modal database and generate semantic-consistent
motion sequences. During inference, Condition Mixture 2.2. Text-Driven Motion Generation
technique enables our model to generate high-fidelity and Text-driven motion generation has witnessed significant
description-consistent motion sequences. We evaluate our progress recently. Earlier works focus on learning a joint
proposed ReMoDiffuse on two standard text-to-motion gen- embedding space between motion sequences and language
eration benchmarks, HumanML3D [8] and KIT-ML [19]. descriptions deterministically. JL2P [1] attempts to create a
Extensive quantitative results demonstrate that ReMoDif- joint embedding space by applying the same reconstruction
fuse outperforms other existing motion generation pipelines task on both text and motion embedding. Specifically, JL2P
by a significant margin. Additionally, we propose several encodes the input text and motion data separately by two
new metrics for quantitative comparisons on uncommon different encoders for each modality. A motion decoder is
samples. We find that ReMoDiffuse significantly improves then applied on both embeddings to reconstruct the origi-
the generation quality on rare samples, demonstrating its nal motion sequences, which are expected to be the same as
superior generalizability. the initial input. Ghosh et al. [7] further develop this idea
To summarize, our contributions are threefold: 1) We by manually dividing each pose sequence into an upper one
carefully design a retrieval-augmented motion diffusion and a lower one to represent two different body parts. In ad-
model which efficiently and effectively explores the knowl- dition, the proposed method integrates a pose discriminator
edge from retrieved samples; 2) We suggest new metrics to to improve the generation quality further. MotionCLIP [25]
evaluate the model’s generalizability under different scenar- attempts to enhance the generalizability of text-to-motion
ios comprehensively; 3) Extensive qualitative and quantita- generation. It enforces the motion embedding to be similar

2
Figure 2: Overview of the proposed ReMoDiffuse. a) Hybrid retrieval database stores various features of each training
data. The pre-processed text feature and relative difference of motion length are sent to calculate the similarity with the
given language description. The most similar ones are fed into the semantics-modulated transformer (SMT), serving as addi-
tional clues for motion generation. b) Semantics-modulated transformer incorporates N identical decoder layers, including
a semantics-modulated attention (SMA) layer and an FFN layer. The figure shows the detailed architecture of SMA module.
CLIP’s extracted text features fprompt from the given prompt, features Rt and Rm from the retrieved samples, and current
motion features fΘ will further refine the noised motion sequence. c) To synthesize diverse and realistic motion sequences,
starting from the pure noised sample, the motion transformer repeatedly eliminates the noise. To better mix outputs under
different combinations of conditions, we suggest a training strategy to find the optimal hyper-parameters w1 , w2 , w3 and w4 .

to the text and image embedding from critical poses. These image generation tasks, some recent works have adapted
two embeddings are acquired from CLIP [20], which ex- this advanced generative model to motion generation tasks.
cels at encoding texts and images into a joint space. Conse- MotionDiffuse [30] is an efficient DDPM-based architec-
quently, MotionCLIP can generate motion sequences with ture for plausible and controllable text-driven motion gen-
unseen descriptions. eration. It generates realistic and diverse actions and al-
To improve the diversity of generated motion sequences, lows for multi-level motion manipulation in both spatial and
previous works introduce variational mechanisms. TEMOS temporal dimensions. MDM [26] is a lightweight diffu-
[18] employs a Variational Autoencoder (VAE) [12] to re- sion model featuring a transformer-encoder backbone. It
place the deterministic auto-encoder structures. Besides, makes predictions of the sample rather than the noise so
different from the recurrent neural networks in the previous that geometric losses are supported as training constraints.
works, both motion encoder and motion decoder in TEMOS Although these methods have outstanding performances on
is based on transformer architectures [28]. Guo et al. [8] text-driven motion generation tasks, they are not versatile
propose an auto-regressive conditional VAE, which is con- enough for uncommon condition signals. In this paper, we
ditioned on both the text feature and the previously gener- equip the diffusion model-based architecture with retrieval
ated frames. Given these conditions, the proposed pipeline capability, enhancing the generalizability.
will generate four successive frames as a unit. TEACH [2]
also exploits auto-regressive models but in a larger length 3. Our Approach
range. It can synthesize a long motion sequence with the In this paper, we present a Retrieval-augmented Motion
given description and the previous sequence. Consequently, Diffusion model (ReMoDiffuse). We first describe the
it can generate motion sequences with different actions con- overall architecture of the proposed method in Section 3.
tinuously. TM2T [9] regards the text-driven motion genera- The background knowledge about the motion diffusion
tion task as a translation task between natural languages and model will be discussed in Section 3.2. Then we will
motion sequences. Most recently, T2M-GPT [29] quantizes introduce our proposed novel retrieval techniques and the
motion clips into discrete tokens and use a transformer to corresponding model structure in Section 3.3. Finally, we
automatically generate later tokens. will introduce the training objective and sampling strategy
Inspired by the success of diffusion models in text-to- in Section 3.5.

3
3.1. Framework Overview noisy data xt into clean data x0 . Following MDM [26], we
predict the clean state x0 . The training target can be written
Figure 2 shows the overall architecture of ReMoDif-
as:
fuse. We establish the whole pipeline based on MotionDif-
\label {eq:objective} \mathbb {E}_{x_0,\epsilon ,t}[\mathbf {x}_0 - S(\mathbf {x}_t,t,\mathrm {retr},\mathrm {text})], (2)
fuse [30], which incorporates diffusion models and a series
of transformer decoder layers. To strengthen its general- where retr and text denote the conditions of retrieved sam-
izability, we extract features from two different modalities ples and the given prompts respectively. Here t ∈ U(0, T )
to establish the retrieval database. During denoising steps, denotes the timestamp, which is uniformly sampled from 0
ReMoDiffuse first retrieves motion sequences based on the to the maximum diffusion steps T . S(xt , t, retr, text) in-
extracted text features and relative motion length. These dicates the estimated clean motion sequence, given the four
retrieved samples are then fed into the motion transformer inputs.
layers. As for each decoder layer, the noised sequence is re- During the sampling process, we can sample xt−1 from a
fined by Semantics-Modulated Attention (SMA) layers and Gaussian Distribution N (µθ (xt , t, c), βt ), where c denotes
then absorbs information from the given description and the the condition of retr and text for simplicity. The mean of
retrieved samples. In the classifier-free generation process, this distribution can be acquired from xt and S(xt , t, c) by
we have distinct outputs under different condition combina- the following equation:
tions. To better fuse these outputs, we finetune our model on
the training split to find the optimal combination of hyper- \begin {aligned} & \mu _{\theta }(\mathbf {x}_t,t,c) = \sqrt {\bar {\alpha }_t} S(\mathbf {x}_t,t,c) + \sqrt {1 - \bar {\alpha }_t}\epsilon _{\theta }(\mathbf {x}_t,t,c) \\ & \epsilon _{\theta }(\mathbf {x}_t,t,c)=(\frac {\mathbf {x}_t}{\sqrt {\bar {\alpha }_t}} - S(\mathbf {x}_t,t,c)) \sqrt {\frac {1}{\bar {\alpha }_t}-1} \end {aligned}
parameters w1 , w2 , w3 and w4 . We will introduce these (3)
components in the following subsections.

3.2. Diffusion Model for Motion Generation Hence, on the basis of diffusion models, the text-driven mo-
tion generation pipeline should be able to predict the start
Recently, diffusion models have been introduced into
sequence x0 , with the given conditions. In this paper, we
motion generation [30, 26]. Compared to VAE-based
propose a retrieval technique to enhance this denoising pro-
pipelines, the most popular motion-generative models in
cess. We will introduce how we retrieve motion sequences
previous works, diffusion models strengthen the genera-
and how to fuse this information.
tion capacity through a stochastic diffusion process, as ev-
idenced by the diverse and high-fidelity generated results. 3.3. Retrieval-Augmented Motion Generation
Therefore, in this work, we build our motion generation
framework in a corporation with diffusion models. Basically, there are two stages in retrieval-based
Diffusion Models pipelines. The first stage is to retrieve appropriate samples
R can be parameterized as a Markov from the database. The second stage is acquiring knowl-
chain pθ (x0 ) := pθ (x0:T ) dx1:T , where x1 , · · · , xT are
the noised sequences distorted from the real data x0 ∼ edge from these retrieved samples to refine the denoising
q(x0 ). All xt , where t = 0, 1, 2, . . . , T , are of the same process of diffusion models. We will thoroughly introduce
dimensionality. In the motion generation tasks, each xt can these two steps.
be represented by a series of pose θi ∈ RD , i = 1, 2, . . . , F ,
where D is the dimensionality of the pose representation Hybrid Retrieval. To support this process, we need to
and F is the number of the frames. extract features for calculating the similarities between the
In the forward process of diffusion models, the computa- given text description and the entities in the database. Con-
tion of the posterior distribution q(x1:T |x0 ) is implemented sidering that the retrieval procedure is not differentiable, we
as a Markov chain that gradually adds Gaussian noises to have to utilize pre-trained models instead of using learnable
the data according to a variance schedule β1 , · · · , βT : architectures. An intuitive method is to generate text fea-
tures on both query text and the data points. Thanks to the
pre-trained CLIP [20], we can easily evaluate the semantic
\begin {aligned} &q(\mathbf {x}_{1:T} \vert \mathbf {x}_0) \,:=\, \prod _{t=1}^{T} q(\mathbf {x}_t \vert \mathbf {x}_{t-1}), \\ &q(\mathbf {x}_t \vert \mathbf {x}_{t-1}) \,:=\, \mathcal {N}(\mathbf {x}_t; \sqrt {1-\beta _t}\mathbf {x}_{t-1}, \beta _t\mathbf {I}). \end {aligned} similarities from language descriptions. Formally, for each
(1) data point (texti , Θi ), we first calculate fit = ET (texti ) as
the text-query feature, where ET is the text encoder in the
CLIP model.
√ xt from
To efficiently acquire √ x0 , Ho et al. [11] approxi- Text features usually encourage the retrieval process to
mate q(xt )Qas x := ᾱt x0 + 1 − ᾱt ϵ, where αt := 1−βt select samples with high semantic similarities. These fea-
t
and ᾱt := s=1 αs . tures play a significant role in retrieving suitable samples.
In diffusion models, the aforementioned forwarding However, there is another kind of feature that is vital but
Markov chain is reversed to learn the original motion distri- easily overlooked, the relative magnitude between the ex-
butions. Expressly, diffusion models are trained to denoise pected motion length and that of each entity in the database.

4
Hence, the similarity score si between i-th data point and Therefore, the model should know which motion features
the given description prompt and expected motion length can be borrowed, guided by the difference between the lan-
L is defined as below: guage descriptions.
Based on these observations, we design two encoders
\begin {aligned} &s_i = <f^t_i, f^t_p> \cdot e^{-\lambda \cdot \gamma }, \\ &f^t_p = E_{t}(\mathrm {prompt}), \gamma = \frac {\Vert l_i - L \Vert }{\max \{l_i, L\}}, \label {eq:score} \end {aligned}
to extract text features and motion features from the re-
(4) trieved data, respectively. As for motion features, we expect
them to be capable of providing low-level information while
where < ·, · > denotes cosine similarity between the two retaining the computational cost to an acceptable degree.
given feature vectors, li is the length of the motion sequence Therefore, we build up a series of encoder layers, which
Θi . The similarity score si becomes larger when text-query include alternating Semantics-Modulated Attention(SMA)
is closer to the prompt feature. When the expected motion modules and FFN modules. This motion encoder processes
length is close to the length of one entity, the correspond- raw motion sequences into usable ones. To reduce the
ing si will also increase. This property is significant be- computational cost, we down-sample the sequence into 1/4
′

cause the motion sequence with a similar length can provide original FPS, which is denoted as Rm ∈ RF ·k×D , where
more informative features for the generation. λ is a hyper- F ′ is the number of frames after down-sampling and k is the
parameter to balance the magnitude of these two different number of retrieved samples. This simple strategy greatly
similarities. decreases the computation with little information lost. As
To establish the retrieval database, we simply select all for the text encoder, the feature Rt ∈ Rk×D from the last
the training data as entities. Given the number of retrieved token is supposed to represent the global semantic informa-
samples k, prompt, and motion length L, we sort all ele- tion. Rm and Rt constitute the features we needed for the
ments by the score si in Equation 4. Then the most k sim- purpose of retrieval-based augmentation.
ilar ones are selected as the retrieved samples (texti , Θi )
and fed into the semantics-modulated attention components Semantics-Modulated Attention. These extracted fea-
in the motion transformer. We will illustrate the detailed tures will be passed to the cross attention component, as
architecture in the next paragraph. shown in Figure 2 . The noised motion sequence forms the
query vector Q ∈ RF ×D . As for the key vector K and the
Network Architecture. Similar to MotionDiffuse [30] value vector V , we consider three sources of data: 1) The
and MDM [26], we build up our pipeline on the basis of motion sequence fΘ ∈ RF ×D itself. As shown in Figure 2,
transformer layers as shown in Figure 2. In both semantics- our proposed transformer does not contain a self-attention
modulated attention modules and FFN modules, follow- module. Instead, we combine the function self-attention
ing MotionDiffuse [30], we add a stylization block to fuse into the SMA; 2) The text condition fprompt , which seman-
timestamp t into the motion generation process. First, an tically describes the expected motion sequence and is ex-
embedding vector et is obtained from the timestamp t. It tracted as in MotionDiffuse [30]. Specifically, the prompt
should be mentioned that the original design in MotionDif- is first fed into the pre-trained CLIP model to get a fea-
fuse also uses an embedding vector from the given prompt, ture sequence, which is further processed by two learnable
which is not suitable for classifier-free guidance. Then for transformer encoder layers; 3) Features Rm , Rt from the
each block, a residual shortcut is applied between the in- retrieved samples. We simply concatenate fΘ , fprompt , Rm
put X ∈ Rn×d and the output Y ∈ Rn×d , where n is the for value vector V and fΘ , fprompt , [Rm ; Rt ] for key vec-
number of elements and d is the dimensionality. tor K. Here [·; ·] denotes the concatenation of both terms.
Two major difficulties should be resolved to better ex- This design allows our proposed method to fuse low-level
plore knowledge from the retrieved samples. First, in the motion information from the retrieved samples and also to
literature of motion diffusion models [30, 26], the resolu- fully consider the semantic similarities. The acquired vec-
tion of motion sequences is not reduced through the denois- tors Q, K, V are sent to perform Linear Attention [23] for
ing process. The maximum length of one motion sequence efficient computation.
is around 200 frames in the HumanML3D [8] dataset, lead-
ing to a dramatic computational cost, especially when we Stylization Block. Similar to MotionDiffuse [30] and
expect to retrieve more samples. Hence, efficiency is highly MDM [26], we build up our pipeline on the basis of trans-
prioritized for the information fusion component. Second, former layers. In both semantics-modulated attention mod-
the semantic relation between the retrieved samples and ules and FFN modules, following MotionDiffuse [30], we
given prompts is complicated. For example, ‘a person is add a stylization block to fuse timestamp t into the motion
walking forward’ and ‘a person is walking forward slowly’ generation process. First, an embedding vector et is ob-
are highly similar. However, these two prompts will lead to tained from the timestamp t. Then for each block, a resid-
two distinct motion sequences regarding pace and intensity. ual shortcut is applied between the input X ∈ Rn×d and the

5
which aims at encoding the paired text descriptions and mo-
tion sequences into a joint embedding space. As for the mo-
tion encoder, we use a 4-layer ACTOR [17] Encoder. The
text encoder is identical to the one we used in ReMoDiffuse.
The only difference is that we require a sentence feature in-
stead of a sequence of word features. We train this con-
trastive learning model with the same loss in Guo et al. [8].
20K and 40K optimization steps are applied for the KIT-ML
Figure 3: Architecture of the stylization block. This mod- and HumanML3D datasets, respectively.
ule is adapted from MotionDiffuse [30]. We remove the
prompt embedding from the original design to better sup- Parameter Finetuning. As mentioned before, we only
port classifier-free guidance. This module attempts to in- use 50 denoising steps to generate motion sequences in the
ject the information of the current timestamp into the fea- inference stage. However, it is impractical to calculate the
ture representation, which is necessary for denoising steps. gradient through such several forward times. To simplify
Specifically, the timestamp embedding et is fed into a series the problem, we divide all denoising steps into the first 40
of transformation layers. Two embeddings are generated af- steps and the last ten steps. In the first part, we use grid
terward and serve as an additive offset and a multiplicative search to find a better parameter combination. Specifically,
offset to the original feature map, respectively. for Equation 6, we search w1 and w2 from [−5, 5] with step
0.5 to find the best parameter for each model. Here we use
output Y ∈ Rn×d , where n is the number of elements and inspiration from Re-Imagen [5] that set w4 = 0. Besides, to
d is the dimensionality. The detailed structure is shown in retain the output’s statistics, we w1 + w2 + w3 + w4 = 1.
Figure 3. These two properties enable us to find the optimal combi-
nation by only searching the value of w1 and w2 . The eval-
3.4. Condition Mixture uation metric is the calculated FID between our generated
Classifier-free guidance enables us to generate motion sequences and the natural motion sequences in the train-
sequences with both high fidelity and consistency with the ing split performed by our trained contrastive model. This
given text description. A typical formulation is described as search aims to find an optimal combination of w1 and w2 to
below: achieve the lowest FID.
In the second stage, we use an end-to-end training
\begin {aligned} \epsilon &= w \cdot \epsilon _{\theta }(\mathbf {x}_t,t,\mathrm {text}) - (w - 1) \cdot \epsilon _{\theta }(\mathbf {x}_t,t), \end {aligned} (5) scheme to optimize w1 , w2 , and w3 . w4 is acquired by
1 − w1 − w2 − w3 . We use the Adam optimizer to train
where w is a hyper-parameter to balance the text-
our model on the training split for 1K steps to find the best
consistency and motion quality. In our proposed
parameter combination.
retrieval-augmented diffusion pipeline, the given re-
We use the searched parameters during training to per-
trieved samples can be regarded as an additional
form the first 40 denoising steps. After that, we auto-
condition. Therefore, we get four estimations:
regressively denoise the motion sequence with learnable
S(xt , t, retr, text), S(xt , t, retr), S(xt , t, text), S(xt , t).
w1 , w2 and w3 . The training objective here is also reduc-
We need four parameters to balance these items. To
ing FID.
achieve a better performance, here we suggest a Condition
We use the Adam optimizer and train 1K steps for both
Mixture technique to achieve this objective. Specifically,
HumanML3D and KIT-ML datasets to find the best param-
given the pre-trained Semantics-Modulated Transformer
eter combination.
(SMT), we optimize the value of w1 , w2 , w3 , w4 and get
the final output Sb as: 3.5. Training and Inference
\label {eq:output} \begin {aligned} \widehat {S} =& w_1 \cdot S(\mathbf {x}_t,t,\mathrm {retr}, \mathrm {text}) + w_2 \cdot S(\mathbf {x}_t,t,\mathrm {text}) + \\ & w_3 \cdot S(\mathbf {x}_t,t,\mathrm {retr}) + w_4 \cdot S(\mathbf {x}_t,t). \end {aligned} Model Training. Inspired by the classifier-free technique,
(6) 10% of the text conditions and 10% of the retrieval condi-
tions are independently randomly masked to approximate
Empirically, we find that the tendency of Frechet Incep- p(x0 ). The training object is to minimize the mean square
tion Distance (FID) is similar to that of Precision when the error between the predicted initial sequence and the ground
hyper-parameters are nearly optimal. Hence, we only at- truth, as shown in Equation 2. In the training stage, we
tempt to minimize the FID in this procedure. typically use a 1000-steps diffusion process.

Constrastive Model. To imitate the evaluator used in the Model Inference. During each denoising step, we use the
standard evaluation process, we train our contrastive model, learned coefficients w1 , w2 , w3 and w4 to get Sb as Equa-

6
Table 1: Evaluation results of different evaluator.

R Precision↑
Methods Dataset FID↓ MM Dist↓ Diversity↑
Top 1 Top 2 Top 3
Guo et al. HumanML3D 0.511±.003 0.703±.003 0.797±.002 0.002 ±.000
2.974 ±.008
9.503±.065
Ours HumanML3D 0.539±.004 0.721±.003 0.810±.003 0.001±.000 1.462±.006 5.298±.047
Guo et al. KIT-ML 0.424±.005 0.649±.006 0.779±.006 0.031±.004 2.788±.012 11.08±.097
Ours KIT-ML 0.475±.006 0.690±.004 0.791±.005 0.002±.000 1.337±.012 6.371±.058

Table 2: Quantitative results on the HumanML3D test set. For a fair comparison, all methods use the real motion length
from the ground truth as the extra given information. ‘↑’(‘↓’) indicates that the values are better if the metric is larger
(smaller). We run all the evaluations 20 times. x±y indicates that the average metric is x and the the 95% confidence interval
is y. The best result and the second best result are in red cells and blue cells, respectively.

R Precision↑
Methods FID↓ MM Dist↓ Diversity↑ MultiModality↑
Top 1 Top 2 Top 3
Real motions 0.511±.003 0.703±.003 0.797±.002 0.002±.000
2.974 ±.008
9.503 ±.065
-
Language2Pose [1] 0.246±.002 0.387±.002 0.486±.002 11.02±.046 5.296±.008 7.676±.058 -
Text2Gesture [3] 0.165±.001 0.267±.002 0.345±.002 7.664±.030 6.030±.008 6.409±.071 -
MoCoGAN [27] 0.037±.000 0.072±.001 0.106±.001 94.41±.021 9.643±.006 0.462±.008 0.019±.000
Dance2Music [13] 0.033±.000 0.065±.001 0.097±.001 66.98±.016 8.116±.006 0.725±.011 0.043±.001
Guo et al. [8] 0.457±.002 0.639±.003 0.740±.003 1.067±.002 3.340±.008 9.188±.002 2.090±.083
MDM [26] - - 0.611±.007 0.544±.044 5.566±.027 9.559±.086 2.799±.072
MotionDiffuse [30] 0.491±.001 0.681±.001 0.782±.001 0.630±.001 3.113±.001 9.410±.049 1.553±.042
T2M-GPT [29] 0.491±.003 0.680±.003 0.775±.002 0.116±.004 3.118±.011 9.761±.081 1.856±.011
Ours 0.510±.005 0.698±.006 0.795±.004 0.103±.004 2.974±.016 9.018±.075 1.795±.043

tion 6. To reduce the computation cost introduced by the (1) FID is an objective metric calculating the distance be-
retrieved samples, we pre-process all fiv , fit , Rt , Rm to en- tween features extracted from real and generated motion se-
sure no repeated computation for different syntheses. quences, which highly reflects the generation quality. (2) R-
Different from the training stage, we carefully reduce precision measures the similarity between the text descrip-
the whole denoising process into 50 steps during inference, tion and the generated motion sequence and indicates the
which enables our model to generate high-quality motion probability that the real text appears in the top k after sort-
sequences efficiently. ing, and in this work, k is taken to be 1, 2, and 3. (3) Diver-
sity measures the variability and richness of the generated
4. Experiments action sequences. (4) Multimodality measures the average
variance of generated motion sequences given a single text
4.1. Datasets and Metrics description. (5) Multi-modal distance (MM Dist for short)
Datasets. We evaluate our proposed framework using the represents the average Euclidean distance between the mo-
KIT dataset [19] and the HumanML3D dataset [8], two tion feature and its corresponding text description feature.
leading benchmarks in text-driven motion generation tasks.
KIT Motion Language Dataset is an open dataset combin- 4.2. Implementation Details
ing human motion and natural language, which contains
We use similar settings on HumanML3D and KIT-ML
3,911 motions and 6,363 natural language annotations. Hu-
datasets. As for the motion encoder, a 4-layer transformer
manML3D is a scripted 3D human motion dataset that orig-
is used, and the latent dimension is 512. As for the text en-
inates from and textually reannotates the HumanAct12 [10]
coder, a frozen text encoder used in the CLIP ViT-B/32, to-
and AMASS datasets [14]. Overall, HumanML3D consists
gether with 2 additional transformer encoder layers, is built
of 14,616 motions and 44,970 descriptions.
and applied. As for the diffusion model, the variances βt are
pre-defined to spread linearly from 0.0001 to 0.02, and the
Evaluation Metrics. We follow the performance mea- total number of noising steps is set to be T = 1000. Adam
sures employed in MotionDiffuse for quantitative evalua- is adapted as the optimizer to train the model with a learn-
tions, namely Frechet Inception Distance (FID), R Preci- ing rate equal to 0.0002. 1 Tesla V100 is used for training,
sion, Diversity, Multimodality, and Multi-Modal Distance. and the batch size on a single GPU is 128. Pieces of training

7
Table 3: Quantitative results on the KIT-ML test set.
R Precision↑
Methods FID↓ MM Dist↓ Diversity↑ MultiModality↑
Top 1 Top 2 Top 3
Real motions 0.424±.005 0.649±.006 0.779±.006 0.031 ±.004
2.788±.012
11.08±.097
-
Language2Pose [1] 0.221±.005 0.373±.004 0.483±.005 6.545±.072 5.147±.030 9.073±.100 -
Text2Gesture [3] 0.156±.004 0.255±.004 0.338±.005 12.12±.183 6.964±.029 9.334±.079 -
MoCoGAN [27] 0.022±.002 0.042±.003 0.063±.003 82.69±.242 10.47±.012 3.091±.043 0.250±.009
Dance2Music [13] 0.031±.002 0.058±.002 0.086±.003 115.4±.240 10.40±.016 0.241±.004 0.062±.002
Guo et al. [8] 0.370±.005 0.569±.007 0.693±.007 2.770±.109 3.401±.008 10.91±.119 1.482±.065
MDM [26] - - 0.396±.004 0.497±.021 9.191±.022 10.847±.109 1.907±.214
MotionDiffuse [30] 0.417±.004 0.621±.004 0.739±.004 1.954±.062 2.958±.005 11.10±.143 0.730±.013
T2M-GPT [29] 0.416±.006 0.627±.006 0.745±.006 0.514±.029 3.007±.023 10.921±.108 1.570±.039
Ours 0.427±.014 0.641±.004 0.765±.055 0.155±.006 2.814±.012 10.80±.105 1.239±.028

The person runs

in a zigzag
pattern
#120

A person skips
in a circle
#120

a) Guo et al. b) MotionDiffuse c) MDM d) ReMoDiffuse

Figure 4: Visual Comparison between previous works and ReMoDiffuse. We draw black lines to show the translation path.
As for both given conditions, only ReMoDiffuse conveys accurate action and path condition.

on KIT-ML and HumanML3D are carried out for 40k and 4.4. Ablation Study
200k steps respectively.
Pose representation in this work follows the schema used Retrieval Techniques. First, we investigate the influence
by Guo et al. [8]. The pose is defined as a tuple of length of different retrieval techniques. To directly evaluate the
seven: (rva , rvx , rvz , rh , jp , jv , jr ), where rva ∈ R is the similarity between the target samples and the given sam-
root angular velocity along Y-axis, and rvx , rvz ∈ R are the ples, we use retrieved samples as generated results and cal-
root linear velocities along X-axis and Z-axis respectively. culate the FID metric for them. We try different λ to bal-
rh ∈ R is the root height. jp , jv ∈ RJ×3 are the local joints ance the terms of semantic similarity and kinematic simi-
positions and velocities. jr ∈ RJ×6 is the 6D local contin- larity. The results are shown in Figure 6. λ = 0 means
uous joints rotations. J denotes the number of joints, and in that the kinematic similarity will not influence the retrieval
HumanML3D and KIT-ML, J is 22 and 21 separately. process, whose retrieval quality is unacceptable. This result
supports our claim that kinematic similarity is significant to
4.3. Main Results the retrieval quality. The optimal value of λ is 0.1 for both
KIT-ML and HumanML3D datasets.
Table 2 and Table 3 show the comparison between our
proposed ReMoDiffuse and four other existing works, in-
cluding recent diffusion models-based algorithms [26, 30],
one VAE-based generative model [8], and one GPT-style Motion Refinement. We further evaluate the proposed
generative model [29]. cross attention component of our retrieval-augmented mo-
Compared to other diffusion model-based pipelines, our tion generation. In Table 4, when using the text feature,
proposed ReMoDiffuse achieves a better balance between FID is enhanced remarkably. It strongly supports our claims
the condition-consistency and fidelity. It should be noted that text features are highly significant in hybrid retrieval,
that, ReMoDiffuse is the first work to achieve state-of-the- which is not discussed in the text-to-image generation tasks.
art on both metrics, which demonstrates the superiority of Besides, the proposed retrieval techniques outperform the
the proposed pipeline. baseline by a remarkable margin.

8
Figure 5: Rareness distribution of HumanML3D test split. We split all testcases into 100 bins according to its Rareness
value.

nition of sample’s Rareness. As for a test prompt p, we

calculate its rareness rp as:
r_p = 1 - \max \limits _i \{<E_T(t_i), E_T(\mathrm {prompt})>\}, (7)

where ET denotes the text encoder in the CLIP [20] model,

ti is the motion description in the training set, and < ·, · >
represents the cosine similarity of the two given vectors. In-
tuitively, this formulation measures the maximum similarity
between the given prompt and training prompts. If this sim-
ilarity is larger, then the rareness will be lower, and vice
versa.
Figure 6: The retrieval performance of different λ. λ is Based on the definition of rareness, we sort all samples
used to balance semantic and kinematic similarity in the re- in increasing order and define the following metrics: 1) tail
trieval stage. A larger λ indicates the retrieval process fo- 5% MM, the average Multimodality Distance of the last 5%
cuses more on the kinematic similarity. samples; 2)balanced MM. we evenly divide the distance
Table 4: Ablation of the proposed architecture. All re- space into 100 bins and then calculate the average distance
sults are reported on the KIT testset. ‘T’ and ‘M’ denote for each bin. Then balanced MM Dist denotes the aver-
the usage of semantic similarity and kinematic similarity age distance of all bins. Figure 5 shows the distribution of
respectively. These two factors are considered in both re- rareness. The minimum value is almost 0, meaning some
trieval and refinement stages. captions in the test split are similar to some of the training
split. The maximum value is less than 0.25. We divide the
Retrieval Attention #Samples Stride FID↓ whole distribution into 100 bins as the requirement of our
a) - - - - 0.245±.008 proposed balanced MM. Most test data concentrate in inter-
b) T M 2 4 0.314±.012 val [0.03, 0.07].
c) T& M M 2 4 0.192±.008 In addition, we provide some examples of different
d) T T&M 2 4 0.307±.010 rarenesses in Table 5. From these examples, we can find
e) T& M T& M 2 4 0.155±.006 that the increase of rareness usually means the complica-
f) T& M T& M 1 4 0.186±.008
tion of caption in three aspects: unseen expression, more
g) T& M T& M 3 4 0.217±.009
thorough description, and action combination. Some words
or phrases are uncommon in the training set, such as ‘x-
4.5. Analysis on More Diverse Generation
shape’ and ‘dangling and swinging’. These sentences may
Metrics on Diverse Generation. To fairly compare the contain unseen motions or are hard to understand by the
generalization ability of our proposed ReMoDiffuse and text encoders. An example of an action combination is
other existing works, e.g. MotionDiffuse [30], we propose that ’a person stretches their hips, arms, then bend forwards
several new metrics. Specifically, inspired by imbalanced and steps forwards’ contains four unit actions: ‘stretch hip’,
regression task [22], here we propose two variants of the ‘stretch arm’, ‘bend forward’, and ‘step forward’. The gen-
original Multimodality Distance. First, we give the defi- erative models are supposed to act them in a row, which is

9
Table 5: Examples of Rareness in the HumanML3D test set.

Rareness Rareness Quantile Caption

0.0000 0.00% a person slowly jumped forward
0.0110 9.89% a man walks counterclockwise in a circle
0.0283 22.09% the person quickly walks forward, and picks something up.
0.0361 31.49% a person picks an item up and moves it a foot to their right and places it down.
0.0442 44.96% a man sidesteps suddenly to his left, bumps into something and leans over, looks
around, then walks to his left, bumping into something else and once more leaning
over.
0.0502 55.07% someone walks with difficulty on their right side, then tries to run
0.0616 71.70% a person stretches their hips, then arms, then bends forwards and steps forwards.
0.0784 86.40% a person takes in big steps in a hurry walking into the rectangular area while hands
are dangling and swinging.
0.0866 90.87% someone puts both of their hands on their chests and appears to be laughing. then
waves their left hand.
0.1051 96.04% a person crosses his arms in an x-shape out in front of him and then quickly swings
them to the side, brushes off is left leg with his left hand, and then raises his left hand
as if to wave.
0.1872 99.78% a person makes several hand gestures and appears to move objects around.

very challenging to current methods. Hence, these exam- Table 6: Evaluation of Generalization Ability. All results
ples build up a more difficult and realistic environment for are reported on the KIT testset. The best results are in bold.
method evaluation.
Method MM ↓ tail 5% MM ↓ balanced MM↓
MotionDiffuse 2.958 5.928 4.285
Baseline 3.371 6.173 4.661
Results and Analysis. Table 6 shows the generaliza- Ours 2.814 5.439 4.028
tion ability of three different methods. As for the baseline ∆ 0.557 0.734 0.633
model, we simply drop out the retrieval technique. From
this table we can find that, with our proposed retrieval tech- 5. Conclusion
nique, ReMoDiffuse outperforms both the baseline model
and state-of-the-art methods by a remarkable margin. In this paper, we present ReMoDiffuse, a retrieval-
augmented motion diffusion model for text-driven motion
generation. Equipped with a multi-modality retrieval tech-
4.6. Qualitative Results nique, the semantics-modulated attention mechanism, and
To illustrate the effectiveness of ReMoDiffuse, we pro- a learnable condition mixture strategy, ReMoDiffuse effi-
vide a qualitative comparison between previous works and ciently explores and utilizes appropriate knowledge from an
ReMoDiffuse. More examples are available in the project auxiliary database to refine the denoising process without
page. As shown in Figure 4, ReMoDiffuse stands out as the expensive computation. Quantitative and qualitative exper-
only approach that effectively conveys text descriptions that iments are conducted to demonstrate that ReMoDiffuse has
involve both action and path information. In contrast, Guo achieved superior performance in text-driven motion gener-
et al.’s method falls short in capturing path descriptions. ation, particularly for uncommon motions.
MotionDiffuse performs well in action categories, but it Social Impacts. This technique can be used to create fake
lacks precision in providing path details. Meanwhile, MDM media when combined with 3D avatar generation. The
captures path information, but its generated actions are in- manipulated media conveys incidents that never truly hap-
correct. In the examples evaluated, ReMoDiffuse demon- pened and can serve malicious purposes.
strates its capability to appropriately structure and present
the content.

10
References [15] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
[1] Chaitanya Ahuja and Louis-Philippe Morency. Lan- Mark Chen. Glide: Towards photorealistic image generation
guage2pose: Natural language grounded pose forecasting. and editing with text-guided diffusion models. arXiv preprint
In 2019 International Conference on 3D Vision (3DV), pages arXiv:2112.10741, 2021. 2
719–728. IEEE, 2019. 2, 7, 8 [16] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
[2] Nikos Athanasiou, Mathis Petrovich, Michael J Black, and denoising diffusion probabilistic models. In International
Gül Varol. Teach: Temporal action composition for 3d hu- Conference on Machine Learning, pages 8162–8171. PMLR,
mans. arXiv preprint arXiv:2209.04066, 2022. 3 2021. 2
[3] Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek [17] Mathis Petrovich, Michael J Black, and Gül Varol. Action-
Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. conditioned 3d human motion synthesis with transformer
Text2gestures: A transformer-based network for generating vae. In Proceedings of the IEEE/CVF International Con-
emotive body gestures for virtual agents. In 2021 IEEE Vir- ference on Computer Vision, pages 10985–10995, 2021. 6
tual Reality and 3D User Interfaces (VR), pages 1–10. IEEE, [18] Mathis Petrovich, Michael J Black, and Gül Varol. Temos:
2021. 7, 8 Generating diverse human motions from textual descriptions.
[4] Andreas Blattmann, Robin Rombach, Kaan Oktay, and Björn arXiv preprint arXiv:2204.14109, 2022. 3
Ommer. Retrieval-augmented diffusion models. arXiv [19] Matthias Plappert, Christian Mandery, and Tamim Asfour.
preprint arXiv:2204.11824, 2022. 2 The kit motion-language dataset. Big data, 4(4):236–252,
[5] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W 2016. 1, 2, 7
Cohen. Re-imagen: Retrieval-augmented text-to-image gen- [20] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
erator. arXiv preprint arXiv:2209.14491, 2022. 2, 6 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[6] Prafulla Dhariwal and Alexander Nichol. Diffusion models Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
beat gans on image synthesis. Advances in Neural Informa- ing transferable visual models from natural language super-
tion Processing Systems, 34, 2021. 2 vision. arXiv preprint arXiv:2103.00020, 2021. 2, 3, 4, 9
[7] Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian [21] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
Theobalt, and Philipp Slusallek. Synthesis of compositional and Mark Chen. Hierarchical text-conditional image gen-
animations from textual descriptions. In Proceedings of the eration with clip latents. arXiv preprint arXiv:2204.06125,
IEEE/CVF International Conference on Computer Vision, 2022. 2
pages 1396–1406, 2021. 2 [22] Jiawei Ren, Mingyuan Zhang, Cunjun Yu, and Ziwei Liu.
[8] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Balanced mse for imbalanced visual regression. In Proceed-
Xingyu Li, and Li Cheng. Generating diverse and natural 3d ings of the IEEE/CVF Conference on Computer Vision and
human motions from text. In Proceedings of the IEEE/CVF Pattern Recognition, pages 7926–7935, 2022. 9
Conference on Computer Vision and Pattern Recognition, [23] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi,
pages 5152–5161, 2022. 1, 2, 3, 5, 6, 7, 8 and Hongsheng Li. Efficient attention: Attention with lin-
ear complexities. In Proceedings of the IEEE/CVF winter
[9] Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t:
conference on applications of computer vision, pages 3531–
Stochastic and tokenized modeling for the reciprocal gen-
3539, 2021. 5
eration of 3d human motions and texts. arXiv preprint
[24] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer,
arXiv:2207.01696, 2022. 3
Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-
[10] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao diffusion: Image generation via large-scale retrieval. arXiv
Sun, Annan Deng, Minglun Gong, and Li Cheng. Ac- preprint arXiv:2204.02849, 2022. 2
tion2motion: Conditioned generation of 3d human motions.
[25] Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano,
In Proceedings of the 28th ACM International Conference on
and Daniel Cohen-Or. Motionclip: Exposing human motion
Multimedia, pages 2021–2029, 2020. 7
generation to clip space. arXiv preprint arXiv:2203.08063,
[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- 2022. 2
sion probabilistic models. Advances in Neural Information [26] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir,
Processing Systems, 33:6840–6851, 2020. 2, 4 Daniel Cohen-Or, and Amit H Bermano. Human motion dif-
[12] Diederik P Kingma and Max Welling. Auto-encoding varia- fusion model. arXiv preprint arXiv:2209.14916, 2022. 2, 3,
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3 4, 5, 7, 8
[13] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun [27] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan
Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Kautz. Mocogan: Decomposing motion and content for
Dancing to music. Advances in Neural Information Process- video generation. In Proceedings of the IEEE conference on
ing Systems, 32, 2019. 7, 8 computer vision and pattern recognition, pages 1526–1535,
[14] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- 2018. 7, 8
ard Pons-Moll, and Michael J Black. Amass: Archive of [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
motion capture as surface shapes. In Proceedings of the reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
IEEE/CVF International Conference on Computer Vision, Polosukhin. Attention is all you need. Advances in neural
pages 5442–5451, 2019. 7 information processing systems, 30, 2017. 3

11
[29] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli
Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi
Shen. T2m-gpt: Generating human motion from textual de-
scriptions with discrete representations. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2023. 3, 7, 8
[30] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou
Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif-
fuse: Text-driven human motion generation with diffusion
model. arXiv preprint arXiv:2208.15001, 2022. 1, 3, 4, 5, 6,
7, 8, 9

Information Technology Control and Audit, Fifth Edition 2018
100% (1)
Information Technology Control and Audit, Fifth Edition 2018
511 pages
DevOps Cheat Sheet
No ratings yet
DevOps Cheat Sheet
297 pages
Sap Workflow For Beginners Step by Step
86% (7)
Sap Workflow For Beginners Step by Step
5 pages
RF Online Setup Log
No ratings yet
RF Online Setup Log
1,353 pages
The Internet and Drug Markets
No ratings yet
The Internet and Drug Markets
140 pages
Overset Meshing in Ansys Fluent
No ratings yet
Overset Meshing in Ansys Fluent
28 pages
Spring Cloud Dataflow Reference
No ratings yet
Spring Cloud Dataflow Reference
130 pages
2302.03011 Ai Video Creation
No ratings yet
2302.03011 Ai Video Creation
26 pages
NM Arun Final
No ratings yet
NM Arun Final
35 pages
Product Catalogue: Home Automation Systems
No ratings yet
Product Catalogue: Home Automation Systems
29 pages
Feature Extraction Identifying Condition Indicators With Matlab PDF
No ratings yet
Feature Extraction Identifying Condition Indicators With Matlab PDF
23 pages
HK Nater Tech Limited: RL-UM02WBS-8723BU
No ratings yet
HK Nater Tech Limited: RL-UM02WBS-8723BU
12 pages
GDSC USTP - Constitution and Bylaws
No ratings yet
GDSC USTP - Constitution and Bylaws
37 pages
1Y0-312 (119 Questions)
No ratings yet
1Y0-312 (119 Questions)
10 pages
E Invoicing Guidelines 2024
No ratings yet
E Invoicing Guidelines 2024
28 pages
Cpe Diary G7
No ratings yet
Cpe Diary G7
20 pages
Poorv
No ratings yet
Poorv
46 pages
Reverse Engineering Vehicles Burpsuite Style
No ratings yet
Reverse Engineering Vehicles Burpsuite Style
29 pages
Com - Magic.solitairegame Logcat
No ratings yet
Com - Magic.solitairegame Logcat
28 pages
CODESYS V3: Installation and Getting Started: User Documentation
No ratings yet
CODESYS V3: Installation and Getting Started: User Documentation
14 pages
Pipeline Log
No ratings yet
Pipeline Log
10 pages
Zapi Controllers 1511612-2200SRM1006 - (11-2007) - Us-En
No ratings yet
Zapi Controllers 1511612-2200SRM1006 - (11-2007) - Us-En
62 pages
Pre-Intermediate Business Writing: Worksheet 9: An Internal Memo
No ratings yet
Pre-Intermediate Business Writing: Worksheet 9: An Internal Memo
2 pages
NPM Package Manager Q A
No ratings yet
NPM Package Manager Q A
2 pages
Co-Speech Gesture Video Generation Via Motion-Decoupled Diffusion Model
No ratings yet
Co-Speech Gesture Video Generation Via Motion-Decoupled Diffusion Model
22 pages
ATHENA Project Report I
No ratings yet
ATHENA Project Report I
8 pages
WING Feihao
No ratings yet
WING Feihao
69 pages
Verderflex: Vantage 5000 Modbus Digital Control
No ratings yet
Verderflex: Vantage 5000 Modbus Digital Control
2 pages
Make Pixels Dance - High-Dynamic Video Generation
No ratings yet
Make Pixels Dance - High-Dynamic Video Generation
11 pages
MotionLCM V2
No ratings yet
MotionLCM V2
21 pages
Roll It!
No ratings yet
Roll It!
4 pages
Yuan PhysDiff Physics-Guided Human Motion Diffusion Model ICCV 2023 Paper
No ratings yet
Yuan PhysDiff Physics-Guided Human Motion Diffusion Model ICCV 2023 Paper
12 pages
2410 05260v1-Source
No ratings yet
2410 05260v1-Source
22 pages
Audio-Driven Co-Speech Gesture Video Generation
No ratings yet
Audio-Driven Co-Speech Gesture Video Generation
19 pages
TM2T
No ratings yet
TM2T
23 pages
Zhang Generating Human Motion From Textual Descriptions With Discrete Representations CVPR 2023 Paper
No ratings yet
Zhang Generating Human Motion From Textual Descriptions With Discrete Representations CVPR 2023 Paper
11 pages
Text To Image Survey
No ratings yet
Text To Image Survey
40 pages
ChatPDF-Use of Hierarchical Cascading Technique For FEM Analysis of Transverse-Mode Behaviors in Surface Acoustic-Wave Devices
No ratings yet
ChatPDF-Use of Hierarchical Cascading Technique For FEM Analysis of Transverse-Mode Behaviors in Surface Acoustic-Wave Devices
3 pages
T2M-GPT - Generating Human Motion From Textual Descriptions
No ratings yet
T2M-GPT - Generating Human Motion From Textual Descriptions
15 pages
Standard Rotary Numbering Heads, Straight and Convex Rings and Cams
No ratings yet
Standard Rotary Numbering Heads, Straight and Convex Rings and Cams
2 pages
2024 Eacl-Short 33-FlowSeq
No ratings yet
2024 Eacl-Short 33-FlowSeq
13 pages
ONTAP 9.10.1 Performance Tech Spec
No ratings yet
ONTAP 9.10.1 Performance Tech Spec
1 page
Diffusion
100% (5)
Diffusion
62 pages
Guo Generating Diverse and Natural 3D Human Motions From Text CVPR 2022 Paper
No ratings yet
Guo Generating Diverse and Natural 3D Human Motions From Text CVPR 2022 Paper
10 pages
PJYD2200PWE Model
No ratings yet
PJYD2200PWE Model
1 page
Seamless Human Motion Composition With Blended Positional Encodings
No ratings yet
Seamless Human Motion Composition With Blended Positional Encodings
20 pages
EMDM: Efficient Motion Diffusion Model For Fast and High-Quality Motion Generation
No ratings yet
EMDM: Efficient Motion Diffusion Model For Fast and High-Quality Motion Generation
29 pages
Final Project Presentation
No ratings yet
Final Project Presentation
11 pages
Depfake Animacion
No ratings yet
Depfake Animacion
20 pages
Breathing Life Into Sketches Using Text-to-Video Priors
No ratings yet
Breathing Life Into Sketches Using Text-to-Video Priors
19 pages
EDGE: Editable Dance Generation From Music: Jonathan Tseng, Rodrigo Castellon, C. Karen Liu Stanford University
No ratings yet
EDGE: Editable Dance Generation From Music: Jonathan Tseng, Rodrigo Castellon, C. Karen Liu Stanford University
16 pages
Generating Smooth Human Motion From Sparse Tracking Inputs With Diffusion Models - Sebastian Starke
No ratings yet
Generating Smooth Human Motion From Sparse Tracking Inputs With Diffusion Models - Sebastian Starke
16 pages
Animate Diff
No ratings yet
Animate Diff
13 pages
Guo Generating Diverse and CVPR 2022 Supplemental
No ratings yet
Guo Generating Diverse and CVPR 2022 Supplemental
9 pages
2307.04725v2ANIMATEDIFF - ANIMATE YOUR PERSONALIZED TEXT-TO-IMAGE DIFFUSION MODELS WITHOUT SPECIFIC TUNING-SAIL
No ratings yet
2307.04725v2ANIMATEDIFF - ANIMATE YOUR PERSONALIZED TEXT-TO-IMAGE DIFFUSION MODELS WITHOUT SPECIFIC TUNING-SAIL
13 pages
Human Motion Diffusion As A Generative Prior
No ratings yet
Human Motion Diffusion As A Generative Prior
10 pages
Tedi: Temporally-Entangled Diffusion For Long-Term Motion Synthesis
No ratings yet
Tedi: Temporally-Entangled Diffusion For Long-Term Motion Synthesis
10 pages
SAGE Single Image Avatar Generation by Bridging Video Diffusion Models and 3D Gaussian Splatting
No ratings yet
SAGE Single Image Avatar Generation by Bridging Video Diffusion Models and 3D Gaussian Splatting
11 pages
Image Animation With Keypoint Mask
No ratings yet
Image Animation With Keypoint Mask
6 pages
Unianimate: Taming Unified Video Diffusion Models For Consistent Human Image Animation
No ratings yet
Unianimate: Taming Unified Video Diffusion Models For Consistent Human Image Animation
14 pages
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis For Character Animation
No ratings yet
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis For Character Animation
11 pages
AnimateDiff (2) - Nghia Le
No ratings yet
AnimateDiff (2) - Nghia Le
14 pages
AnimateDiff - Name Nghĩa
No ratings yet
AnimateDiff - Name Nghĩa
13 pages
Controllable Video Generation With Text-Based Instructions
No ratings yet
Controllable Video Generation With Text-Based Instructions
12 pages
Motion-Conditioned Diffusion Model For Controllable Video Synthesis
No ratings yet
Motion-Conditioned Diffusion Model For Controllable Video Synthesis
14 pages
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
No ratings yet
Videomage:: Multi-Subject and Motion Customization of Text-To-Video Diffusion Models
14 pages
Video Editing Survey
No ratings yet
Video Editing Survey
23 pages
2024 - MotionClone - Ling Et Al
No ratings yet
2024 - MotionClone - Ling Et Al
17 pages
Drea Moving
No ratings yet
Drea Moving
6 pages
Text2Video Zero
No ratings yet
Text2Video Zero
11 pages
Text2Performer: Text-Driven Human Video Generation
No ratings yet
Text2Performer: Text-Driven Human Video Generation
11 pages
Motionbert: Unified Pretraining For Human Motion Analysis
No ratings yet
Motionbert: Unified Pretraining For Human Motion Analysis
15 pages
Paper 10
No ratings yet
Paper 10
8 pages
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis For Character Animation
No ratings yet
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis For Character Animation
11 pages
Hu Make It Move Controllable Image-to-Video Generation With Text Descriptions CVPR 2022 Paper
No ratings yet
Hu Make It Move Controllable Image-to-Video Generation With Text Descriptions CVPR 2022 Paper
10 pages
LV GPT4Motion Scripting Physical Motions in Text-to-Video Generation Via Blender-Oriented GPT CVPRW 2024 Paper
No ratings yet
LV GPT4Motion Scripting Physical Motions in Text-to-Video Generation Via Blender-Oriented GPT CVPRW 2024 Paper
11 pages
Supervised Video-to-Video Synthesis For Single Human Pose Transfer
No ratings yet
Supervised Video-to-Video Synthesis For Single Human Pose Transfer
13 pages
2302 01329 PDF
No ratings yet
2302 01329 PDF
18 pages
Human Motion Generation - A Survey - 202311
No ratings yet
Human Motion Generation - A Survey - 202311
20 pages
MotionVideoGAN A Novel Video Generator Based On The Motion Space Learned From Image Pairs
No ratings yet
MotionVideoGAN A Novel Video Generator Based On The Motion Space Learned From Image Pairs
13 pages
Motion Zero：用于基于扩散的视频生成的零镜头移动对象控制框架
No ratings yet
Motion Zero：用于基于扩散的视频生成的零镜头移动对象控制框架
9 pages
ModelScope Text-to-Video Technical Report
No ratings yet
ModelScope Text-to-Video Technical Report
14 pages
MotionGPT: How To Generate and Understand Human Motion
No ratings yet
MotionGPT: How To Generate and Understand Human Motion
6 pages
A Good Image Generator Is What You Need For High Resolution Video Synthesis
No ratings yet
A Good Image Generator Is What You Need For High Resolution Video Synthesis
23 pages
Photorealistic Video Generation With Diffusion Models
No ratings yet
Photorealistic Video Generation With Diffusion Models
13 pages
Motion mm2020
No ratings yet
Motion mm2020
9 pages
Cinemo: Consistent and Controllable Image Animation With Motion Diffusion Models
No ratings yet
Cinemo: Consistent and Controllable Image Animation With Motion Diffusion Models
15 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

ReMoDiffuse - Retrieval-Augmented Motion Diffusion Model

Uploaded by

ReMoDiffuse - Retrieval-Augmented Motion Diffusion Model

Uploaded by

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Motion Database Text Description Generated Sequence

3D human motion generation is crucial for creative indus-

The person runs

a) Guo et al. b) MotionDiffuse c) MDM d) ReMoDiffuse

nition of sample’s Rareness. As for a test prompt p, we

where ET denotes the text encoder in the CLIP [20] model,

Rareness Rareness Quantile Caption

You might also like