Seamless Human Motion Composition With Blended Positional Encodings
Seamless Human Motion Composition With Blended Positional Encodings
("forward kick", 2.5s) ("walk slowly", 3.2s) ("get down on ground", 3s) ("crawl", 3.3s) ("walk", 2s) ("walk", 2s) ("walk", 2s) ("walk", 2s) ... ("walk", 2s)
Figure 1. We present FlowMDM, a diffusion-based approach capable of generating seamlessly continuous sequences of human motion
from textual descriptions (left). The whole sequence is generated simultaneously and it does not require any postprocessing. FlowMDM
also makes strides in the challenging problem of extrapolating and controlling periodic motion such as walking, jumping, or waving (right).
Abstract 1. Introduction
Conditional human motion generation is an important In the field of computer vision, recent progress has been
topic with many applications in virtual reality, gaming, and made in developing photorealistic avatars [54] for appli-
robotics. While prior works have focused on generating mo- cations like virtual reality, gaming, and robotics [62, 79].
tion guided by text, music, or scenes, these typically result Aside from looking visually realistic, these avatars must
in isolated motions confined to short durations. Instead, also move in a convincing manner. This is challenging
we address the generation of long, continuous sequences due to the intricate nature of human motion, strongly in-
guided by a series of varying textual descriptions. In this fluenced by various factors such as the environment, inter-
context, we introduce FlowMDM, the first diffusion-based actions, and physical contact [14]. Furthermore, complexity
model that generates seamless Human Motion Composi- increases when attempting to control these motions. Recent
tions (HMC) without any postprocessing or redundant de- advances include the generation of motion sequences from
noising steps. For this, we introduce the Blended Positional control signals like textual descriptions or actions [109];
Encodings, a technique that leverages both absolute and however, such methods only produce isolated, standalone
relative positional encodings in the denoising chain. More motion. Therefore, these approaches fail to handle scenar-
specifically, global motion coherence is recovered at the ab- ios where a long motion is driven by distinct control sig-
solute stage, whereas smooth and realistic transitions are nals on different time slices. Such capability is needed to
built at the relative stage. As a result, we achieve state-of- provide full control over the sequence of desired actions
the-art results in terms of accuracy, realism, and smooth- and their duration. In these scenarios, the generated motion
ness on the Babel and HumanML3D datasets. FlowMDM needs to feature seamless and realistic transitions between
excels when trained with only a single description per mo- actions. In this work, we tackle this problem, which we re-
tion sequence thanks to its Pose-Centric Cross-ATtention, fer to as generative Human Motion Composition (HMC).
which makes it robust against varying text descriptions at In particular, we focus on generating single-human motion
inference time. Finally, to address the limitations of existing from text, illustrated in Fig. 1.
HMC metrics, we propose two new metrics: the Peak Jerk One of the primary obstacles in HMC is the lack of
and the Area Under the Jerk, to detect abrupt transitions. datasets that offer long motion sequences with diverse tex-
tual annotations. Existing datasets typically feature se-
1
quences of limited duration, often lasting only up to 10 sec- gle condition. The progress of this domain has been boosted
onds, and with just a single control signal governing the en- by the release of big datasets including diverse modalities or
tire sequence [26, 64]. This limitation calls for innovative manual annotations [12, 26, 28, 47, 51, 60, 64, 65]. Re-
solutions to address the inherent complexities of the task. search has also focused on problems like human motion
Prior works have tackled this problem mostly with autore- prediction [3, 53, 57, 72, 78, 83, 86, 99] and motion infill-
gressive approaches [4, 45, 48, 66, 104]. These methods it- ing [29, 36, 39, 49, 50, 59, 67, 69, 75, 108], which do not
eratively create compositions by using the current motion as rely on extensive manual annotations but rather on motion
a basis to generate subsequent motions. However, they re- itself. Both tasks share a common challenge with HMC:
quire datasets with multiple consecutive annotated motions, the synthesized motion must not only be plausible but also
and tend to degenerate in very long HMC scenarios due to integrate seamlessly with the neighboring behavior, ensur-
error accumulation [107]. Other recent works have lever- ing fluidity and continuity. In this context, the utilization
aged the infilling capabilities of motion diffusion models of human motion priors has been proven to be a successful
to generate motion compositions [73, 103]. However, for technique to ensure any generated motion includes natural
these, a substantial portion of each motion sequence is gen- transitions [8, 46, 91]. In line with these approaches, our
erated independently from adjacent motions, and generating method learns a motion prior specifically tailored for HMC.
transitions requires computing redundant denoising steps. Autoregressive human motion composition. As in
In this work, we propose a novel architecture designed to many other sequence modeling tasks, HMC was also first
address these specific challenges. Our main contributions tackled with autoregressive methods. The gold standard has
are: been pairing variational autoencoders with autoregressive
• We propose FlowMDM, the first diffusion-based model decoders such as recurrent neural networks [104] or Trans-
that generates seamless human motion compositions formers [4, 45, 48, 66]. Alternative approaches have intro-
without any postprocessing or extra denoising steps. To duced specialized reinforcement learning frameworks [52,
accomplish it, we introduce Blended Positional Encod- 95, 105]. Autoregressive models rely on the availability of
ings (BPE), a new technique for diffusion Transformers annotated motion transitions, a requirement that constrains
that combines the benefits of both absolute and relative the robustness of the models due to the scarcity of such
positional encodings during sampling. In particular, the data. To mitigate this issue, some methods include addi-
denoising first exploits absolute information to recover tional postprocessing steps like linear interpolations [4], or
the global motion coherence, and then leverages relative affine transformations [45]. However, these can distort the
positions to build smooth and realistic transitions between human motion dynamics and require a predetermined esti-
actions. As a result, FlowMDM achieves state-of-the-art mation of the transitions duration. Furthermore, autoregres-
results in terms of accuracy, realism, and smoothness in sive approaches generate motion solely based on the preced-
the HumanML3D [26] and Babel [65] datasets. ing motion. We argue that an accurate model should mimic
• We introduce a new attention technique tailored for HMC: the humans innate capacity to anticipate their next action
the Pose-Centric Cross-ATtention (PCCAT). This layer and adapt their current behavior accordingly [24, 43].
ensures each pose is denoised based on its own condition
Diffusion-based human motion composition. Diffu-
and its neighboring poses. Consequently, FlowMDM can
sion models have excelled at conditional generation [20,
be trained on a dataset with only a single condition avail-
32, 74]. They also possess great zero-shot capabilities for
able per motion sequence and still generate realistic tran-
image inpainting [70], and its equivalence in motion: mo-
sitions when using multiple conditions at inference time.
tion infilling. DiffCollage [103], MultiDiffusion [7], and
• We reveal the lack of sensitivity of current HMC metrics
DoubleTake [73] proposed to modify the diffusion sam-
to identify discontinuous or sharp transitions, and intro-
pling process to simultaneously generate temporally su-
duce two new metrics that help to detect them: the Peak
perimposed motion sequences, and combine the estimated
Jerk (PJ) and the Area Under the Jerk (AUJ).
noise in the overlapped regions so that an infilled transi-
2. Related work tion emerges. DoubleTake complemented such overlapped
sampling with a refinement step in which the emerged tran-
Conditional human motion generation. Recent studies in sition undergoes further unconditional denoising steps. All
motion generation have shown notable progress in synthe- these methods share two main limitations. First, they are
sizing movements conditioned on diverse modalities such constrained to modeling dependencies among neighboring
as text [21, 26, 27, 35, 40, 63, 81, 82, 100–102], music [2, motion sequences. This becomes a limitation when three
17, 47, 77, 84, 96, 110], scenes [15, 30, 87–89, 97], interac- or more consecutive actions share semantics and collec-
tive objects [1, 18, 42, 92], and even other humans’ behav- tively represent a more comprehensive action. In this case,
ior [9, 10, 28, 80, 93]. Traditionally, these approaches have the motion dependencies may extend beyond contiguous
been designed to generate motion sequences matching a sin- actions. Second, they need to set the number of frames
2
that each transition takes between consecutive actions, for long compositions of human motion subsequences, all in
which extra computations are incorporated. Our work seeks harmony and fostering plausible transitions between them,
to address these constraints by offering a solution able to without explicit supervision on transitions generation.
model longer inter-sequence dynamics without imposing
extra computational burdens or predefined transition dura- 3.1. Bidirectional diffusion
tions. The cumulative nature of errors in autoregressive models
often results in a decline in performance when generating
3. Methodology long sequences [107]. This is exacerbated in HMC, where
Problem definition. Our goal consists in generating a mo- transitions are scarce or even missing in the training corpus,
tion sequence of N frames, with the capability of condition- and the model needs to deal with domain shifts at infer-
ing the generated motion inside non-overlapping intervals ence. Another limitation of autoregressive methods is that
[0, τ1 ), [τ1 , τ2 ), ..., [τj , N ), with 0<τ1 < · · · <τj <N . We the generated Si only depends on {Sj }j<i . We discussed
will refer to the motion inside these intervals as motion sub- in Sec. 2 why this is a suboptimal solution for HMC. Thus,
sequences, or Si = {xτi , ..., xτi+1 −1 }, each driven by its an appropriate model for HMC should also be able to an-
corresponding condition ci , and with a maximum length of ticipate the following motion, Si+1 , and possibly adapt Si
L. It is essential that consecutive subsequences, influenced so that the transition is feasible. We argue that the iterative
by different control signals, transition seamlessly and real- paradigm of diffusion models provides very appropriate in-
istically. In particular, we aim at the even more challeng- ductive biases for naturally mimicking such ability: the par-
ing case where motion sequences containing several pairs tially denoised Si and Si+1 are refined later in successive
of (Si , ci ) are not necessarily available in our dataset. denoising steps. By choosing a bidirectional Transformer
In this section, we present FlowMDM, an architecture as our denoising function [38], we enable the modeling of
with strong inductive biases that promote the emergence of both past and future dependencies. Therefore, we design
a robust translation-invariant motion prior. Such mo- our framework as a bidirectional motion diffusion model,
tion prior is learned with a diffusion model equipped with similar to MDM [82]. We refer the reader to [94] for more
a bidirectional (i.e., encoder-only) Transformer, similar to details on the theoretical aspects of diffusion models.
prior works [73, 82]. With it, we overcome the main lim-
3.2. Blended positional encodings
itations of autoregressive methods (Sec. 3.1). However,
previous works are constrained in terms of motion dura- Diffusion models can learn strong motion priors that ensure
tion. We could arguably provide extrapolation capabili- any motion generated is realistic and plausible [73]. In fact,
ties to the diffusion model by replacing the absolute po- they can also generate smooth transitions between subse-
sitional encoding with a relative alternative, thus making quences [7, 73, 103]. However, these capabilities stem from
the denoising of each pose translation invariant. How- inference-time motion infilling techniques, which we argue
ever, this technique would fail to build complex composi- do not exploit the full potential of human motion priors.
tional semantics that require knowledge about the start and In fact, building a prior that extrapolates well to sequences
end of each subsequence. For example, when generating longer than those observed during training is very challeng-
the motion composition Si →Si+1 with ci =‘walking’ and ing. The field of natural language processing has made
ci+1 =‘walk and sit down’, Si+1 might only feature the ac- progress in sequence extrapolation techniques, notably by
tion ‘sit down’ because, with only relative positional infor- substituting absolute positional encoding (APE) with a rela-
mation, the Transformer cannot know if the partially de- tive (RPE) counterpart [37]. By only providing information
noised ‘walking’ motion preceding the beginning of Si+1 regarding how far tokens are between them, they achieve
belongs to Si or Si+1 . To combine the benefits of both rel- sequence-wise translation invariance and, therefore, can ex-
ative and absolute positional encodings, we introduce BPE trapolate their modeling capabilities to longer sequences.
(Sec. 3.2). This novel technique exploits the iterative nature Yet, the absolute positions of poses within a motion, in-
of diffusion models to promote intra-subsequence global cluding their distances to the start and end of the action,
coherence in earlier denoising stages, while making later are necessary to build the global semantics of the motion,
denoising stages translation invariant, ensuring that realis- as exemplified at the beginning of this section.
tic and plausible transitions naturally emerge between sub- Here, we propose BPE, a novel positional encoding
sequences. Still, during training, the condition remains un- scheme designed for diffusion models that enables motion
changed throughout all ground truth motion sequences. In extrapolation while preserving the global motion seman-
order to make our denoising model robust to having mul- tics. Our BPE is inspired by the observation that in mo-
tiple conditions per sequence at inference, we introduce a tion, high frequencies encompass local fine details, whereas
new attention paradigm called PCCAT (Sec. 3.3). As a low frequencies capture global structures. Similar insights
result, FlowMDM is able to simultaneously generate very have been drawn for images [61]. Diffusion models ex-
3
Average attention scores for a single query
cosine functions to inject positional information. It is added
0.08
Denoising steps
First 100th
to the queries, keys, and values of the attention layers.
0.06 10th
20th
200th
400th
Note that for APE, attention is limited to each subse-
0.04 40th
60th
600th
800th
quence, while for RPE, attention spans all frames up to the
0.02
80th Last attention horizon H<L<N . Since L defines the maximum
range of motion dynamics learned during RPE training,
0.00
start of current frame end of
there is no advantage in setting H≥L (Tabs. D/E in supp.
sequence sequence
Keys positions material). Leveraging both APE and RPE constraints en-
Figure 2. Attention scores of a single query pose (current frame) sures quadratic complexity over the maximum subsequence
as a function of the pose attended to (x-axis) in a diffusion-based length L in both memory and computation [11]. As a re-
motion generation model with a sinusoidal absolute positional en- sult, FlowMDM’s complexity is equivalent to that of other
coding. Curves show the scores at each denoising step. We ob- Transformer-based motion diffusion models [73, 103].
serve that, whereas early steps show strong global dependencies
(blue), later denoising stages exhibit a clearly local behavior (red). 3.3. Pose-centric cross-attention
cel at decomposing the generation process into recovering In order to make motion generation with diffusion mod-
lower frequencies, and gradually transitioning to higher fre- els efficient, we would like to simultaneously generate very
quencies. Fig. 2 shows how at early denoising phases, mo- long sequences. In motion Transformers, the generation is
tion diffusion models prioritize global inter-frame depen- conditioned at a sequence level by injecting the condition as
dencies, shifting towards local relative dependencies as the a token [82], or as a sequence-wise transformation in inter-
process unfolds. The proposed BPE harmonizes these dy- mediate layers [102]. Therefore, they cannot be conditioned
namics during inference: at early denoising stages, our de- on multiple signals in different subsequences. For this rea-
noising model is fed with an APE and, towards the conclu- son, diffusion-based methods for HMC opted for individu-
sion, with an RPE. A scheduler guides this transition. As ally generating sequences and then merging them [73, 103].
a result, intra-subsequence global dependencies are recov- To enable such simultaneous heterogeneous conditioning
ered at the beginning of the denoising, and intra- and inter- without any extra postprocessing, we propose to inject the
subsequences motion smoothness and realism are promoted condition at every frame. However, we still need to deal
later. To make the model understand APE and RPE at infer- with a challenge: the condition never varies at training
ence, we expose it to both encodings by randomly alternat- time. Therefore, at inference time, attention scores are com-
ing them during training. As a result, the BPE schedule can puted with the embeddings Exm ,cm and Exn ,cn of the pose-
be tuned at inference time to balance the intra-subsequence condition pairs (xm , cm ) and (xn , cn ) as:
coherence and the inter-subsequence realism trade-off. T
qm kn = (Wq Exm ,cm )T (Wk Exn ,cn ) = ExTm ,cm WqT Wk Exn ,cn .
Rotary Position Encoding (RoPE). Our choice for RPE (2)
is rotary embeddings [76]. RoPE integrates a position em- T
When cm ̸=cn , qm kn was never encountered during train-
bedding into the queries and keys, ensuring that after dot- ing. If instead of injecting the condition at every frame, we
product multiplication, the attention scores’ positional in- used cross-attention layers, distinct conditions would also
formation reflects only the relative pairwise distance be- be temporally mixed, and we would face the same prob-
tween queries and keys. Specifically, let Wq and Wk be the lem. To reduce the presence and impact of such training-
projection matrices into the d-dimensional spaces of queries inference misalignment, we introduce PCCAT, see Fig. 3,
and keys. Then, RoPE encodes the absolute positions m and which aims at minimizing the entanglement between condi-
n of a pair of query (qm =Wq xm ) and key (kn =Wk xn ), re- tions and noisy poses. Specifically, PCCAT combines every
d
spectively, as d-dimensional rotations Rm , Rnd over the pro- frame’s noisy pose and condition into queries, while using
jected poses xm , xn . The rotation angles are parameterized only noisy poses as keys and values. Thus, Eq. 2 becomes:
by m and n so that the attention formulation becomes: T
qm kn = (Wq Exm ,cm )T (Wk Exn ) = ExTm ,cm WqT Wk Exn . (3)
T d
qm kn = (Rm Wq xm )T (Rn
d
Wk xn ) = xTm Wq Rn−m
d
Wk xn . (1)
With PCCAT, the attention output for pose m becomes a
d
Note that the resulting rotation Rn−m only depends on weighted average of the value projections of its neighboring
the distance between n and m, and any absolute information noisy poses. A residual connection adds the PCCAT output
about n or m is removed. RoPE is a natural choice for our to the noisy poses. With comprehensive coverage of the mo-
RPE due to its simplicity and convenient injection before tion spectrum in the training dataset, the network observes
the attention takes place. As a result, RoPE is compatible various poses preceding and following each pose, particu-
with faster attention techniques like FlashAttention [22, 23]. larly within its local neighborhood. Therefore, local rela-
Sinusoidal Position Encoding. Our APE is the classic tionships do not suffer from unseen intermediate represen-
sinusoidal position encoding [85], which leverages sine and tations. Still, there is an obstacle to address: long-range de-
4
Time
and 60 frames for Babel and HumanML3D (1 and 3 sec-
Pose-Centric Cross-ATtention
onds), respectively. The top-3 R-precision (R-prec), and
Bidirectional Transformer
the multimodal distance (MM-Dist) are used to evaluate
how well the subsequences’ motion matches their textual
Transformer with PCCAT + BPE Q K V
5
GT TEACH_B TEACH DoubleTake DiffCollage MultiDiffusion FlowMDM
Transition jerk - Composition on Babel Transition jerk - Composition on HumanML3D Transition jerk - Extrapolation on Babel Transition jerk - Extrapolation on HumanML3D
1.2 0.6 1.2 0.6
1.0 0.5 1.0 0.5
0.8 0.4 0.8 0.4
Maximum jerk over joints
Figure 4. Transitions smoothness. Average maximum jerk over joints at each frame of the transitions for both motion composition
(left) and extrapolation (right) tasks. While other methods show severe smoothness artifacts in the beginning and end of their transition
refinement processes, FlowMDM’s jerk curve has the shortest peak for composition, and an absence of peaks for extrapolation.
Subsequence Transition
R-prec ↑ FID ↓ Div → MM-Dist ↓ FID ↓ Div → PJ → AUJ ↓
GT 0.715±0.003 0.00±0.00 8.42±0.15 3.36±0.00 0.00±0.00 6.20±0.06 0.02±0.00 0.00±0.00
TEACH B 0.703±0.002 1.71±0.03 8.18±0.14 3.43±0.01 3.01±0.04 6.23±0.05 1.09±0.00 2.35±0.01
TEACH 0.655±0.002 1.82±0.02 7.96±0.11 3.72±0.01 3.27±0.04 6.14±0.06 0.07±0.00 0.44±0.00
DoubleTake* 0.596±0.005 3.16±0.06 7.53±0.11 4.17±0.02 3.33±0.06 6.16±0.05 0.28±0.00 1.04±0.01
DoubleTake 0.668±0.005 1.33±0.04 7.98±0.12 3.67±0.03 3.15±0.05 6.14±0.07 0.17±0.00 0.64±0.01
MultiDiffusion 0.702±0.005 1.74±0.04 8.37±0.13 3.43±0.02 6.56±0.12 5.72±0.07 0.18±0.00 0.68±0.00
DiffCollage 0.671±0.003 1.45±0.05 7.93±0.09 3.71±0.01 4.36±0.09 6.09±0.08 0.19±0.00 0.84±0.01
FlowMDM 0.702±0.004 0.99±0.04 8.36±0.13 3.45±0.02 2.61±0.06 6.47±0.05 0.06±0.00 0.13±0.00
Table 1. Comparison of FlowMDM with the state of the art in Babel. Symbols ↑, ↓, and → indicate that higher, lower, or values closer to
the ground truth (GT) are better, respectively. Evaluation is run 10 times and ± specifies the 95% confidence intervals.
Subsequence Transition
R-prec ↑ FID ↓ Div → MM-Dist ↓ FID ↓ Div → PJ → AUJ ↓
GT 0.796±0.004 0.00±0.00 9.34±0.08 2.97±0.01 0.00±0.00 9.54±0.15 0.04±0.00 0.07±0.00
DoubleTake* 0.643±0.005 0.80±0.02 9.20±0.11 3.92±0.01 1.71±0.05 8.82±0.13 0.52±0.01 2.10±0.03
DoubleTake 0.628±0.005 1.25±0.04 ±0.12
9.09±0.08 4.01±0.01 4.19±0.09 8.45±0.09 0.48±0.00 1.83±0.02
MultiDiffusion 0.629±0.002 1.19±0.03 9.38 4.02±0.01 4.31±0.06 8.37±0.10 0.17±0.00 1.06±0.01
DiffCollage 0.615±0.005 1.56±0.04 8.79±0.08 4.13±0.02 4.59±0.10 8.22±0.11 0.26±0.00 2.85±0.09
FlowMDM 0.685±0.004 0.29±0.01 9.58±0.12 3.61±0.01 1.38±0.05 8.79±0.09 0.06±0.00 0.51±0.01
Table 2. Comparison of FlowMDM with the state of the art in HumanML3D.
MDM, as originally proposed (DoubleTake*). TEACH and and noisy pose, and replacing the PCCAT with vanilla self-
TEACH B cannot be trained for HumanML3D due to the attention (SAT), and 2) injecting the condition with cross-
lack of pairs of consecutive actions and textual descriptions. attention layers (CAT). See more details in supp. material
Sec. A.
Implementation details. We tune the hyperparameters
of all models with grid search. The attention horizon for 4.2. Quantitative analysis
RPE, H, is set to 100/150 for Babel/HumanML3D. The
number of diffusion steps is 1K for all experiments. Our Comparison with the state of the art on HMC. Tables
model is trained with the x0 parameterization [90], and min- 1 and 2 show the comparison of FlowMDM with current
imizes the L2 reconstruction loss. During training, RPE and state-of-the-art models in Babel and HumanML3D datasets,
APE are alternated randomly at a frequency of 0.5. We use respectively. In HumanML3D, our model outperforms by
classifier-free guidance with weights 1.5/2.5 [33]. We use a fair margin the other methods in terms of subsequence
a binary step function to guide the BPE sampling, yielding accuracy-wise metrics (R-prec and MM-Dist), and FID. In
125/60 initial APE steps. The minimum/maximum lengths Babel, it matches the state of the art in accuracy and excels
for training subsequences are set to 30/200 and 70/200 in FID score. FlowMDM produces transitions of higher
frames (i.e., 1/6.7s and 3.5/10s). For Babel, training sub- quality and smoothness on both datasets, as indicated by
sequences include consecutive ground truth motions with FID, PJ, and AUJ metrics. The lack of correlation be-
distinct textual descriptions in order to increase the motions tween the FID score and the AUJ underscores the impor-
variability, and make the network explicitly robust to mul- tance of the latter as a complementary metric for assess-
tiple conditions. The ablation study includes two condi- ing smoothness. Fig. 4-left shows the average jerk values
tioning baselines: 1) concatenating each frame’s condition across the generated transitions. We observe that state-of-
6
Subsequence Transition
Cond. Train. PE Inf. PE R-prec ↑ FID ↓ Div → MM-Dist ↓ FID ↓ Div → PJ → AUJ ↓
GT - - 0.715±0.003 0.00±0.00 8.42±0.15 3.36±0.00 0.00±0.00 6.20±0.06 0.02±0.00 0.00±0.00
PCCAT A A 0.699±0.004 1.34±0.04 8.36±0.12 3.40±0.02 4.26±0.07 ±0.06
5.98±0.08 1.81±0.01 3.73±0.01
PCCAT R R 0.635±0.006 1.28±0.03 8.05±0.11 4.02±0.02 2.18±0.07 6.14 0.03±0.00 0.20±0.00
PCCAT B A 0.716±0.006 1.20±0.04 8.31±0.14 3.32±0.02 3.01±0.06 6.35±0.07 1.78±0.01 3.66±0.02
PCCAT B R 0.635±0.004 0.85±0.02 8.25±0.12 3.98±0.02 2.14±0.04 6.44±0.09 0.04±0.00 0.15±0.00
SAT B B 0.681±0.004 1.52±0.04 8.22±0.11 3.61±0.02 1.91±0.03 6.41±0.07 0.06±0.00 0.12±0.00
CAT B B 0.719±0.004 1.29±0.02 8.16±0.13 3.27±0.02 2.57±0.08 6.06±0.07 0.02±0.00 0.07±0.00
PCCAT B B 0.702±0.004 0.99±0.04 8.36±0.13 3.45±0.02 2.61±0.06 6.47±0.05 0.06±0.00 0.13±0.00
Table 3. Ablation study in Babel. Cond. indicates the conditioning scheme, Train./Inf. PE specify the positional encodings (PE) used at
training/inference time, and A, R, and B refer to absolute, relative, and blended PE, respectively. ↑, ↓, and → indicate that higher, lower,
or values closer to the ground truth (GT) are better, respectively. Evaluation is run 10 times and ± specifies the 95% confidence intervals.
Subsequence Transition
Cond. Train. PE Inf. PE R-prec ↑ FID ↓ Div → MM-Dist ↓ FID ↓ Div → PJ → AUJ ↓
GT - - 0.796±0.004 0.00±0.00 9.34±0.08 2.97±0.01 0.00±0.00 9.54±0.15 0.04±0.00 0.07±0.00
PCCAT A A 0.689±0.005 0.66±0.02 9.73±0.12 3.63±0.02 3.90±0.12 8.29±0.08 ±0.01
1.50±0.00 3.40±0.02
PCCAT R R 0.531±0.005 1.75±0.07 8.71±0.10 4.80±0.03 2.53±0.12 8.62±0.08 0.03 0.58±0.01
PCCAT B A 0.699±0.005 0.61±0.02 9.76±0.10 3.54±0.02 2.42±0.09 8.39±0.09 1.40±0.01 3.29±0.02
PCCAT B R 0.554±0.007 1.06±0.06 9.02±0.11 4.54±0.02 1.12±0.04 9.00±0.10 0.05±0.00 0.53±0.01
SAT B B 0.692±0.004 0.49±0.02 9.08±0.09 3.51±0.01 3.19±0.08 8.09±0.11 0.04±0.00 0.36±0.02
CAT B B 0.622±0.005 1.27±0.04 8.86±0.15 4.10±0.01 3.93±0.14 8.23±0.10 0.04±0.00 0.49±0.02
PCCAT B B 0.685±0.004 0.29±0.01 9.58±0.12 3.61±0.01 1.38±0.05 8.79±0.09 0.06±0.00 0.51±0.01
Table 4. Ablation study in HumanML3D.
Subsequences Transitions
the-art methods exhibit severe smoothness artifacts. Dur-
ing TEACH’s spherical linear interpolation, the jerk quickly
reaches values near zero. By contrast, DiffCollage leans to-
Babel
7
TEACH DoubleTake DiffCollage MultiDiffusion FlowMDM (Ours)
A)
B) C)
D)
C)
D)
Figure 6. Qualitative analysis (Babel). A) and B) show compositions of 3 motions (‘walk straight’− →‘side steps’−→‘walk backward’, and
‘walk’−→‘turn around’− →‘sit on the bench’, respectively), and C) and D) illustrate extrapolations that repeat 6 times a static (‘t-pose’) and
a dynamic (‘step to the right’) action, respectively. Solid curves match the trajectories of the global position (blue) and left/right hands
(purple/green). Darker colors indicate instantaneous jerk deviations from the median value, saturating at twice the jerk’s standard deviation
in the dataset (black segments). Abrupt transitions manifest as black segments amidst lighter ones. FlowMDM exhibits the most fluid
motion and preserves the staticity or periodicity of extrapolated actions, in contrast to other methods that show spontaneous high jerk
values and fail to keep the motion coherence in extrapolations.
scores of the APE models. Fig. 5 illustrates this balance. jerk peaks around transitions. These do not typically match
Specifically, increasing the number of APE steps enhances long-range motion scenarios, where such jerks might be
the motion’s congruence with the textual description, at the contextually appropriate. Contrarily, FlowMDM produces
cost of reducing the smoothness and realism of the tran- motion that is realistic, accurate, and smooth. Particularly,
sitions. In HumanML3D, the SAT and CAT conditioning we notice that DiffCollage’s bias toward producing con-
schemes lead to worse transitions in terms of FID and di- stantly high jerk values around transitions is perceived as
versity. This is caused by the coexistence of different con- an overall chaotic motion. Due to the independent gener-
ditions in the local neighborhood of the transition at infer- ation of their subsequences, DoubleTake, DiffCollage, and
ence, which never happens during training. Our PCCAT MultiDiffusion are unable to maintain the static or periodic
conditioning technique effectively solves this problem. In nature of actions when extrapolating them. Only TEACH
Babel, such effect is not present because the training motion and FlowMDM are able to successfully extrapolate a static
sequences include several subsequences, thus increasing the ‘t-pose’, and ours is the only one capable of extrapolating a
model’s robustness to transitions with varying conditions. ’step to the right’ sequence realistically. Finally, FlowMDM
On the efficiency of FlowMDM. Diffusion-based state- also inherits the trajectory control capabilities of motion dif-
of-the-art methods such as MultiDiffusion and DiffCollage fusion models as shown in Fig. 1-right.
denoise poses from the transition more than once in order to
harmonize it with the adjacent motions. DoubleTake’s tran- 5. Conclusion
sitions undergo an additional denoising process, which adds
computational burden and can not be parallelized. Oppo- We presented FlowMDM, the first approach that generates
sitely, FlowMDM does not apply redundant denoising steps human motion compositions simultaneously, without un-
to any pose. In particular, our model goes through 47.1%, dergoing postprocessing or redundant denoising diffusion
28.4%, and 16.5% less pose-wise denoising steps than Dou- steps. We also introduced the blended positional encodings
bleTake, DiffCollage, and MultiDiffusion, respectively. to combine the benefits of absolute and relative positional
encodings during the denoising chain. Finally, we presented
4.3. Qualitative results the pose-centric cross-attention, a technique that improves
Fig. 6 illustrates how our quantitative findings translate into the generation of transitions when training with only a sin-
visual outcomes on the human motion composition and ex- gle condition per motion sequence.
trapolation tasks. First, as anticipated by Fig. 4, we con- Limitations and future work. The absolute stage of
firm that state-of-the-art methods produce short intervals of BPE does not model relationships between subsequences.
8
Consequently, their low-frequency spectrum is generated In Understanding Social Behavior in Dyadic and Small
independently. This limitation could be addressed in fu- Group Interactions, pages 107–138. PMLR, 2022. 2
ture work by incorporating an intention planning module. [11] Iz Beltagy, Matthew E Peters, and Arman Cohan. Long-
Finally, our method learns a strong motion prior that gener- former: The long-document transformer. arXiv preprint
arXiv:2004.05150, 2020. 4
ates transitions between combinations of actions never seen
[12] Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian
at training time. Such capability could theoretically be used Sminchisescu, Christian Theobalt, and Gerard Pons-Moll.
with different models leveraging different control signals, Behave: Dataset and method for tracking human object in-
assuming they all are trained under the same framework. teractions. In Proceedings of the IEEE/CVF Conference on
Future work will experimentally validate this hypothesis. Computer Vision and Pattern Recognition, pages 15935–
15946, 2022. 2
[13] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Pe-
References ter Gehler, Javier Romero, and Michael J Black. Keep it
smpl: Automatic estimation of 3d human pose and shape
[1] Vida Adeli, Mahsa Ehsanpour, Ian Reid, Juan Car-
from a single image. In Computer Vision–ECCV 2016: 14th
los Niebles, Silvio Savarese, Ehsan Adeli, and Hamid
European Conference, Amsterdam, The Netherlands, Octo-
Rezatofighi. Tripod: Human trajectory and pose dynamics
ber 11-14, 2016, Proceedings, Part V 14, pages 561–578.
forecasting in the wild. In Proceedings of the IEEE/CVF In-
Springer, 2016. 5, 20
ternational Conference on Computer Vision, pages 13390–
[14] Paulo Vinicius Koerich Borges, Nicola Conci, and Andrea
13400, 2021. 2
Cavallaro. Video-based human behavior understanding: A
[2] Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and
survey. IEEE transactions on circuits and systems for video
Gustav Eje Henter. Listen, denoise, action! audio-driven
technology, 23(11):1993–2008, 2013. 1
motion synthesis with diffusion models. ACM Transactions
[15] Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai,
on Graphics (TOG), 42(4):1–20, 2023. 2
Minh Vo, and Jitendra Malik. Long-term human mo-
[3] Sadegh Aliakbarian, Microsoft Fatemeh Saleh ACRV,
tion prediction with scene context. In Computer Vision–
Stephen Gould ACRV, and Anu Mathieu Salzmann CVLab.
ECCV 2020: 16th European Conference, Glasgow, UK, Au-
Contextually plausible and diverse 3d human motion pre-
gust 23–28, 2020, Proceedings, Part I 16, pages 387–404.
diction. Proceedings of the IEEE/CVF International Con-
Springer, 2020. 2
ference on Computer Vision (ICCV), 2021. 2
[16] Angela Castillo, Maria Escobar, Guillaume Jeanneret, Al-
[4] Nikos Athanasiou, Mathis Petrovich, Michael J Black, and
bert Pumarola, Pablo Arbeláez, Ali Thabet, and Artsiom
Gül Varol. Teach: Temporal action composition for 3d
Sanakoyeu. Bodiffusion: Diffusing sparse observations
humans. In 2022 International Conference on 3D Vision
for full-body human motion synthesis. arXiv preprint
(3DV), pages 414–423. IEEE, 2022. 2, 5
[5] Sivakumar Balasubramanian, Alejandro Melendez- arXiv:2304.11118, 2023. 5
[17] Kang Chen, Zhipeng Tan, Jin Lei, Song-Hai Zhang, Yuan-
Calderon, and Etienne Burdet. A robust and sensitive
Chen Guo, Weidong Zhang, and Shi-Min Hu. Choreomas-
metric for quantifying movement smoothness. IEEE
ter: choreography-oriented music-driven dance synthesis.
transactions on biomedical engineering, 59(8):2126–2136,
ACM Transactions on Graphics (TOG), 40(4):1–13, 2021.
2011. 5
[6] Sivakumar Balasubramanian, Alejandro Melendez- 2
[18] Enric Corona, Albert Pumarola, Guillem Alenya, and
Calderon, Agnes Roby-Brami, and Etienne Burdet.
Francesc Moreno-Noguer. Context-aware human motion
On the analysis of movement smoothness. Journal of
prediction. In Proceedings of the IEEE/CVF Conference
neuroengineering and rehabilitation, 12(1):1–11, 2015. 5
[7] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. on Computer Vision and Pattern Recognition, pages 6992–
Multidiffusion: Fusing diffusion paths for controlled image 7001, 2020. 2
[19] Antonia Creswell, Tom White, Vincent Dumoulin, Kai
generation. arXiv preprint arXiv:2302.08113, 2023. 2, 3, 5
[8] German Barquero, Sergio Escalera, and Cristina Palmero. Arulkumaran, Biswa Sengupta, and Anil A Bharath. Gen-
Belfusion: Latent diffusion for behavior-driven human mo- erative adversarial networks: An overview. IEEE signal
tion prediction. In Proceedings of the IEEE/CVF Interna- processing magazine, 35(1):53–65, 2018. 5
[20] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu,
tional Conference on Computer Vision, pages 2317–2327,
and Mubarak Shah. Diffusion models in vision: A survey.
2023. 2, 5
[9] German Barquero, Johnny Núnez, Sergio Escalera, Zhen IEEE Transactions on Pattern Analysis and Machine Intel-
Xu, Wei-Wei Tu, Isabelle Guyon, and Cristina Palmero. ligence, 2023. 2
[21] Rishabh Dabral, Muhammad Hamza Mughal, Vladislav
Didn’t see that coming: a survey on non-verbal social hu-
Golyanik, and Christian Theobalt. Mofusion: A frame-
man behavior forecasting. In Understanding Social Behav-
work for denoising-diffusion-based motion synthesis. In
ior in Dyadic and Small Group Interactions, pages 139–
Proceedings of the IEEE/CVF Conference on Computer Vi-
178. PMLR, 2022. 2
[10] German Barquero, Johnny Núñez, Zhen Xu, Sergio Es- sion and Pattern Recognition, pages 9760–9770, 2023. 2
[22] Tri Dao. Flashattention-2: Faster attention with bet-
calera, Wei-Wei Tu, Isabelle Guyon, and Cristina Palmero.
ter parallelism and work partitioning. arXiv preprint
Comparison of spatio-temporal models for human motion
arXiv:2307.08691, 2023. 4
and pose forecasting in face-to-face interaction scenarios.
9
[23] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- [38] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina
pher Ré. Flashattention: Fast and memory-efficient exact Toutanova. Bert: Pre-training of deep bidirectional trans-
attention with io-awareness. Advances in Neural Informa- formers for language understanding. In Proceedings of
tion Processing Systems, 35:16344–16359, 2022. 4 NAACL-HLT, pages 4171–4186, 2019. 3
[24] David A Engström, JA Scott Kelso, and Tom Holroyd. [39] Jihoon Kim, Taehyun Byun, Seungyoun Shin, Jung-
Reaction-anticipation transitions in human perception- dam Won, and Sungjoon Choi. Conditional motion in-
action patterns. Human movement science, 15(6):809–832, betweening. Pattern Recognition, 132:108894, 2022. 2
1996. 2 [40] Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free-
[25] Philipp Gulde and Joachim Hermsdörfer. Smoothness met- form language-based motion synthesis & editing. In Pro-
rics in complex movement tasks. Frontiers in neurology, ceedings of the AAAI Conference on Artificial Intelligence,
9:615, 2018. 5 volume 37, pages 8255–8263, 2023. 2
[26] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, [41] Diederik P Kingma and Jimmy Ba. Adam: A method for
Xingyu Li, and Li Cheng. Generating diverse and natural 3d stochastic optimization. arXiv preprint arXiv:1412.6980,
human motions from text. In Proceedings of the IEEE/CVF 2014. 14
Conference on Computer Vision and Pattern Recognition, [42] Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhi-
pages 5152–5161, 2022. 2, 5 jit Kundu, Justin Johnson, David Fouhey, and Leonidas
[27] Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Guibas. Nifty: Neural object interaction fields for guided
Stochastic and tokenized modeling for the reciprocal gener- human motion synthesis. arXiv preprint arXiv:2307.07511,
ation of 3d human motions and texts. In European Confer- 2023. 2
ence on Computer Vision, pages 580–597. Springer, 2022. [43] Wilfried Kunde, Katrin Elsner, and Andrea Kiesel. No
2 anticipation–no action: the role of anticipation in action and
[28] Wen Guo, Xiaoyu Bie, Xavier Alameda-Pineda, and perception. Cognitive Processing, 8:71–78, 2007. 2
Francesc Moreno-Noguer. Multi-person extreme motion [44] Caroline Larboulette and Sylvie Gibet. A review of com-
prediction. In Proceedings of the IEEE/CVF Conference on putable expressive descriptors of human motion. In Pro-
Computer Vision and Pattern Recognition, pages 13053– ceedings of the 2nd International Workshop on Movement
13064, 2022. 2 and Computing, pages 21–28, 2015. 5
[29] Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and [45] Taeryung Lee, Gyeongsik Moon, and Kyoung Mu Lee.
Christopher Pal. Robust motion in-betweening. ACM Multiact: Long-term 3d human motion generation from
Transactions on Graphics (TOG), 39(4):60–1, 2020. 2 multiple action labels. In Proceedings of the AAAI Con-
[30] Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun ference on Artificial Intelligence, volume 37, pages 1231–
Saito, Jimei Yang, Yi Zhou, and Michael J Black. Stochas- 1239, 2023. 2
tic scene-aware motion prediction. In Proceedings of the [46] Jiaman Li, Ruben Villegas, Duygu Ceylan, Jimei Yang,
IEEE/CVF International Conference on Computer Vision, Zhengfei Kuang, Hao Li, and Yajie Zhao. Task-generic hi-
pages 11374–11384, 2021. 2 erarchical human motion prior using vaes. In 2021 Inter-
[31] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, national Conference on 3D Vision (3DV), pages 771–781.
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a IEEE, 2021. 2
two time-scale update rule converge to a local nash equilib- [47] Ruilong Li, Shan Yang, David A Ross, and Angjoo
rium. Advances in neural information processing systems, Kanazawa. Ai choreographer: Music conditioned 3d dance
30, 2017. 5 generation with aist++. In Proceedings of the IEEE/CVF In-
[32] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- ternational Conference on Computer Vision, pages 13401–
fusion probabilistic models. Advances in Neural Informa- 13412, 2021. 2
tion Processing Systems, 33:6840–6851, 2020. 2 [48] Shuai Li, Sisi Zhuang, Wenfeng Song, Xinyu Zhang, Hejia
[33] Jonathan Ho and Tim Salimans. Classifier-free diffusion Chen, and Aimin Hao. Sequential texts driven cohesive mo-
guidance. arXiv preprint arXiv:2207.12598, 2022. 6, 16 tions synthesis with natural transitions. In Proceedings of
[34] Neville Hogan and Dagmar Sternad. Sensitivity of smooth- the IEEE/CVF International Conference on Computer Vi-
ness measures to movement duration, amplitude, and ar- sion, pages 9498–9508, 2023. 2, 5
rests. Journal of motor behavior, 41(6):529–534, 2009. 5 [49] Weiyu Li, Xuelin Chen, Peizhuo Li, Olga Sorkine-
[35] Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Hornung, and Baoquan Chen. Example-based motion syn-
Tao Chen. Motiongpt: Human motion as a foreign lan- thesis via generative motion matching. arXiv preprint
guage. arXiv preprint arXiv:2306.14795, 2023. 2 arXiv:2306.00378, 2023. 2
[36] Manuel Kaufmann, Emre Aksan, Jie Song, Fabrizio Pece, [50] Yunhao Li, Zhenbo Yu, Yucheng Zhu, Bingbing Ni, Guang-
Remo Ziegler, and Otmar Hilliges. Convolutional autoen- tao Zhai, and Wei Shen. Skeleton2humanoid: Animat-
coders for human motion infilling. In 2020 International ing simulated characters for physically-plausible motion in-
Conference on 3D Vision (3DV), pages 918–927. IEEE, betweening. In Proceedings of the 30th ACM International
2020. 2 Conference on Multimedia, pages 1493–1502, 2022. 2
[37] Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Nate- [51] Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao
san Ramamurthy, Payel Das, and Siva Reddy. The impact of Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-
positional encoding on length generalization in transform- scale 3d expressive whole-body human motion dataset. In
ers. arXiv preprint arXiv:2305.19466, 2023. 3 Thirty-seventh Conference on Neural Information Process-
10
ing Systems Datasets and Benchmarks Track, 2023. 2 Athanasiou, Alejandra Quiros-Ramirez, and Michael J
[52] Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Black. Babel: Bodies, action and behavior with english
Perpetual humanoid control for real-time simulated avatars. labels. In Proceedings of the IEEE/CVF Conference on
In Proceedings of the IEEE/CVF International Conference Computer Vision and Pattern Recognition, pages 722–731,
on Computer Vision, pages 10895–10904, 2023. 2 2021. 2, 5
[53] Hengbo Ma, Jiachen Li, Ramtin Hosseini, Masayoshi [66] Yijun Qian, Jack Urbanek, Alexander G Hauptmann, and
Tomizuka, and Chiho Choi. Multi-objective diverse human Jungdam Won. Breaking the limits of text-conditioned 3d
motion prediction with knowledge distillation. Proceedings motion synthesis with elaborative descriptions. In Proceed-
of the IEEE/CVF Conference on Computer Vision and Pat- ings of the IEEE/CVF International Conference on Com-
tern Recognition, 2022. 2 puter Vision, pages 2306–2316, 2023. 2
[54] Shugao Ma, Tomas Simon, Jason Saragih, Dawei Wang, [67] Jia Qin, Youyi Zheng, and Kun Zhou. Motion in-
Yuecheng Li, Fernando De La Torre, and Yaser Sheikh. betweening via two-stage transformers. ACM Transactions
Pixel codec avatars. In Proceedings of the IEEE/CVF Con- on Graphics (TOG), 41(6):1–16, 2022. 2
ference on Computer Vision and Pattern Recognition, pages [68] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
64–73, 2021. 1 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[55] Antoine Maiorca, Youngwoo Yoon, and Thierry Dutoit. Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Evaluating the quality of a synthesized motion with the ing transferable visual models from natural language super-
fréchet motion distance. In ACM SIGGRAPH 2022 Posters, vision. In International conference on machine learning,
pages 1–2, 2022. 5 pages 8748–8763. PMLR, 2021. 14
[56] Antoine Maiorca, Youngwoo Yoon, and Thierry Dutoit. [69] Tianxiang Ren, Jubo Yu, Shihui Guo, Ying Ma, Yutao
Validating objective evaluation metric: Is fréchet motion Ouyang, Zijiao Zeng, Yazhan Zhang, and Yipeng Qin. Di-
distance able to capture foot skating artifacts? In Proceed- verse motion in-betweening with dual posture stitching.
ings of the 2023 ACM International Conference on Interac- arXiv preprint arXiv:2303.14457, 2023. 2
tive Media Experiences, pages 242–247, 2023. 5 [70] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[57] Wei Mao, Miaomiao Liu, and Mathieu Salzmann. Generat- Patrick Esser, and Björn Ommer. High-resolution image
ing smooth pose sequences for diverse human motion pre- synthesis with latent diffusion models. In Proceedings of
diction. Proceedings of the IEEE/CVF International Con- the IEEE/CVF conference on computer vision and pattern
ference on Computer Vision (ICCV), 2021. 2 recognition, pages 10684–10695, 2022. 2
[58] Alexander Quinn Nichol and Prafulla Dhariwal. Im- [71] Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier
proved denoising diffusion probabilistic models. In Inter- Bousquet, and Sylvain Gelly. Assessing generative models
national Conference on Machine Learning, pages 8162– via precision and recall. Advances in neural information
8171. PMLR, 2021. 14, 15 processing systems, 31, 2018. 5
[59] Boris N Oreshkin, Antonios Valkanas, Félix G Harvey, [72] Tim Salzmann, Marco Pavone, and Markus Ryll. Motron:
Louis-Simon Ménard, Florent Bocquelet, and Mark J Multimodal probabilistic human motion forecasting. Pro-
Coates. Motion in-betweening via deep delta-interpolator. ceedings of the IEEE/CVF Conference on Computer Vision
IEEE Transactions on Visualization and Computer Graph- and Pattern Recognition, 2022. 2
ics, 2023. 2 [73] Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H
[60] Cristina Palmero, German Barquero, Julio CS Jacques Ju- Bermano. Human motion diffusion as a generative prior.
nior, Albert Clapés, Johnny Núnez, David Curto, Sorina arXiv preprint arXiv:2303.01418, 2023. 2, 3, 4, 5, 14
Smeureanu, Javier Selva, Zejian Zhang, David Saeteros, [74] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
et al. Chalearn lap challenges on self-reported personal- and Surya Ganguli. Deep unsupervised learning using
ity recognition and non-verbal behavior forecasting during nonequilibrium thermodynamics. In International Con-
social dyadic interactions: Dataset, design, and results. In ference on Machine Learning, pages 2256–2265. PMLR,
Understanding Social Behavior in Dyadic and Small Group 2015. 2
Interactions, pages 4–52. PMLR, 2022. 2 [75] Paul Starke, Sebastian Starke, Taku Komura, and Frank
[61] Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Fast vision Steinicke. Motion in-betweening with phase manifolds.
transformers with hilo attention. Advances in Neural Infor- Proceedings of the ACM on Computer Graphics and Inter-
mation Processing Systems, 35:14541–14554, 2022. 3 active Techniques, 6(3):1–17, 2023. 2
[62] Sang-Min Park and Young-Gab Kim. A metaverse: Taxon- [76] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha,
omy, components, applications, and open challenges. IEEE Bo Wen, and Yunfeng Liu. Roformer: Enhanced trans-
access, 10:4209–4251, 2022. 1 former with rotary position embedding. arXiv preprint
[63] Mathis Petrovich, Michael J Black, and Gül Varol. Temos: arXiv:2104.09864, 2021. 4, 14
Generating diverse human motions from textual descrip- [77] Guofei Sun, Yongkang Wong, Zhiyong Cheng, Mohan S
tions. In European Conference on Computer Vision, pages Kankanhalli, Weidong Geng, and Xiangdong Li. Deep-
480–497. Springer, 2022. 2, 5 dance: music-to-dance motion choreography with adver-
[64] Matthias Plappert, Christian Mandery, and Tamim Asfour. sarial learning. IEEE Transactions on Multimedia, 23:497–
The kit motion-language dataset. Big data, 4(4):236–252, 509, 2020. 2
2016. 2 [78] Jiarui Sun and Girish Chowdhary. Towards globally con-
[65] Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos sistent stochastic human motion prediction via motion dif-
11
fusion. arXiv preprint arXiv:2305.12554, 2023. 2 2021 International Conference on 3D Vision (3DV), pages
[79] Ryo Suzuki, Adnan Karim, Tian Xia, Hooman Hedayati, 606–616. IEEE, 2021. 2
and Nicolai Marquardt. Augmented reality and robotics: A [92] Sirui Xu, Zhengyuan Li, Yu-Xiong Wang, and Liang-Yan
survey and taxonomy for ar-enhanced human-robot inter- Gui. Interdiff: Generating 3d human-object interactions
action and robotic interfaces. In Proceedings of the 2022 with physics-informed diffusion. In Proceedings of the
CHI Conference on Human Factors in Computing Systems, IEEE/CVF International Conference on Computer Vision,
pages 1–33, 2022. 1 pages 14928–14940, 2023. 2
[80] Julian Tanke, Linguang Zhang, Amy Zhao, Chengcheng [93] Sirui Xu, Yu-Xiong Wang, and Liangyan Gui. Stochastic
Tang, Yujun Cai, Lezi Wang, Po-Chen Wu, Juergen Gall, multi-person 3d motion forecasting. In The Eleventh Inter-
and Cem Keskin. Social diffusion: Long-term multiple hu- national Conference on Learning Representations, 2022. 2
man motion anticipation. In Proceedings of the IEEE/CVF [94] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run-
International Conference on Computer Vision, pages 9601– sheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-
9611, 2023. 2 Hsuan Yang. Diffusion models: A comprehensive survey of
[81] Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, methods and applications. ACM Computing Surveys, 2022.
and Daniel Cohen-Or. Motionclip: Exposing human mo- 3, 5
tion generation to clip space. In European Conference on [95] Zhao Yang, Bing Su, and Ji-Rong Wen. Synthesizing long-
Computer Vision, pages 358–374. Springer, 2022. 2 term human motions with diffusion models via coherent
[82] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel sampling. In Proceedings of the 31st ACM International
Cohen-or, and Amit Haim Bermano. Human motion diffu- Conference on Multimedia, pages 3954–3964, 2023. 2, 5
sion model. In The Eleventh International Conference on [96] Zijie Ye, Haozhe Wu, Jia Jia, Yaohua Bu, Wei Chen, Fanbo
Learning Representations, 2022. 2, 3, 4, 14 Meng, and Yanfeng Wang. Choreonet: Towards music to
[83] Sibo Tian, Minghui Zheng, and Xiao Liang. Transfu- dance synthesis with choreographic action unit. In Proceed-
sion: A practical and effective transformer-based diffusion ings of the 28th ACM International Conference on Multime-
model for 3d human motion prediction. arXiv preprint dia, pages 744–752, 2020. 2
arXiv:2307.16106, 2023. 2 [97] Hongwei Yi, Chun-Hao P Huang, Shashank Tripathi, Lea
[84] Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Hering, Justus Thies, and Michael J Black. Mime:
Editable dance generation from music. In Proceedings of Human-aware 3d scene generation. In Proceedings of the
the IEEE/CVF Conference on Computer Vision and Pattern IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 448–458, 2023. 2 Recognition, pages 12965–12976, 2023. 2
[85] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob [98] Xinyu Yi, Yuxiao Zhou, and Feng Xu. Transpose: Real-
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, time 3d human translation and pose estimation with six
and Illia Polosukhin. Attention is all you need. Advances inertial sensors. ACM Transactions on Graphics (TOG),
in neural information processing systems, 30, 2017. 4 40(4):1–13, 2021. 5
[86] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Mar- [99] Ye Yuan and Kris Kitani. Dlow: Diversifying latent flows
tial Hebert. The pose knows: Video forecasting by gener- for diverse human motion prediction. In Computer Vision–
ating pose futures. Proceedings of the IEEE international ECCV 2020: 16th European Conference, Glasgow, UK, Au-
conference on computer vision, 2017. 2 gust 23–28, 2020, Proceedings, Part IX 16, pages 346–364.
[87] Jingbo Wang, Yu Rong, Jingyuan Liu, Sijie Yan, Dahua Springer, 2020. 2
Lin, and Bo Dai. Towards diverse and natural scene- [100] Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan
aware 3d human motion synthesis. In Proceedings of the Kautz. Physdiff: Physics-guided human motion diffusion
IEEE/CVF Conference on Computer Vision and Pattern model. In Proceedings of the IEEE/CVF International Con-
Recognition, pages 20460–20469, 2022. 2 ference on Computer Vision, pages 16010–16021, 2023. 2
[88] Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and [101] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong
Xiaolong Wang. Synthesizing long-term 3d human mo- Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying
tion and interaction in 3d scenes. In Proceedings of the Shan. Generating human motion from textual descrip-
IEEE/CVF Conference on Computer Vision and Pattern tions with discrete representations. In Proceedings of the
Recognition, pages 9401–9411, 2021. IEEE/CVF Conference on Computer Vision and Pattern
[89] Jingbo Wang, Sijie Yan, Bo Dai, and Dahua Lin. Scene- Recognition, pages 14730–14740, 2023.
aware generative network for human motion synthesis. In [102] Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou
Proceedings of the IEEE/CVF Conference on Computer Vi- Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif-
sion and Pattern Recognition, pages 12206–12215, 2021. fuse: Text-driven human motion generation with diffusion
2 model. arXiv preprint arXiv:2208.15001, 2022. 2, 4
[90] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling [103] Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin
the generative learning trilemma with denoising diffusion Chen, and Ming-Yu Liu. Diffcollage: Parallel generation
GANs. In International Conference on Learning Represen- of large content with diffusion models. In Proceedings of
tations (ICLR), 2022. 6, 14 the IEEE/CVF Conference on Computer Vision and Pattern
[91] Jiachen Xu, Min Wang, Jingyu Gong, Wentao Liu, Chen Recognition, pages 10188–10198, 2023. 2, 3, 4, 5
Qian, Yuan Xie, and Lizhuang Ma. Exploring versatile [104] Yan Zhang, Michael J Black, and Siyu Tang. Perpetual mo-
prior for human motion via motion frequency guidance. In tion: Generating unbounded human motion. arXiv preprint
12
arXiv:2007.13886, 2020. 2
[105] Yan Zhang and Siyu Tang. The wanderings of odysseus in
3d scenes. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 20481–
20491, 2022. 2
[106] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and
Hao Li. On the continuity of rotation representations in
neural networks. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pages
5745–5753, 2019. 5
[107] Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng
Huang, and Hao Li. Auto-conditioned recurrent networks
for extended complex human motion synthesis. In Interna-
tional Conference on Learning Representations, 2018. 2,
3
[108] Yi Zhou, Jingwan Lu, Connelly Barnes, Jimei Yang, Sitao
Xiang, et al. Generative tweening: Long-term inbetweening
of 3d human motions. arXiv preprint arXiv:2005.08891,
2020. 2
[109] Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu
Zhang, Jiaxin Shi, Feng Gao, Qi Tian, and Yizhou Wang.
Human motion generation: A survey. arXiv preprint
arXiv:2307.10894, 2023. 1
[110] Wenlin Zhuang, Congyi Wang, Jinxiang Chai, Yangang
Wang, Ming Shao, and Siyu Xia. Music2dance: Dancenet
for music-driven dance generation. ACM Transactions
on Multimedia Computing, Communications, and Applica-
tions (TOMM), 18(2):1–21, 2022. 2
13
Supplementary Material formers. The attention span of the Transformers is capped
within each subsequence during the APE stage, and within
the attention horizon H=100/150f during the RPE stage. We
A. Further implementation details train with blended positional encodings (BPE), i.e., RPE
and APE are alternated randomly at a frequency of 0.5. We
All values are reported as X/Y for Babel/HumanML3D, or use Adam [41] with learning rate of 0.0001 as our optimizer,
as Z if values are equal for both. Note that motion sequences and train for 1.3M/500k steps in a single RTX 3090 (about
are downsampled to 30/20 fps. 4/2 days). During BPE sampling, the binary step schedule
State-of-the-art models. TEACH is used off-the-shelf 1 transitions from absolute to relative mode after 125/60 de-
with the originally proposed alignment and spherical lin- noising steps (out of 1k steps). Classifier-free guidance with
ear interpolation, and without them (TEACH B). Double- weights 1.5/2.5 is used during sampling.
Take is used off-the-shelf 2 from their original repository,
with the parameters handshake size and blending length set B. Evaluation details
to 10/20f (frames), and 10/5f, respectively. To fulfill the
constraints of their method, the handshake size needs to be Generative models are difficult to evaluate and compare due
shorter than half the shortest sequence we want to gener- to the limitations of the metrics (discussed in Sec. 4.1) and
ate, which is 30f (1s) for Babel. Since DoubleTake uses the the stochasticity present during sampling. To alleviate the
original Motion Diffusion Model [82], whose training dis- latter, we run all our evaluation 10 times and provide the
carded very short sequences, it underperforms in our more 95% confidence intervals. However, we still face another
comprehensive evaluation protocol (see Sec. B). For a fairer issue in our task: the randomness in the combinations of
comparison, we also evaluate it using our diffusion model textual descriptions. The generation difficulty for the com-
with absolute positional encodings (APE), and call it Dou- bination ‘sit down’→‘stand up’→‘run’ is not the same as
bleTake*. DoubleTake* uses the same handshake size and for ‘sit down’→‘run’→‘stand up’. The evaluation protocol
blending length as DoubleTake. DiffCollage and MultiDif- from [73] includes 32 evaluation sequences of 32 randomly
fusion were implemented manually, and utilize our model sampled textual descriptions from the test set. The gener-
as well for the same reasons mentioned earlier. We set their ated motion needs to perform sequentially the 32 actions
sampling parameter transition length to 10/20f. For Dou- from each evaluation sequence. However, these descrip-
bleTake, DiffCollage, and MultiDiffusion, we use classifier- tions are sampled differently in each evaluation run, which
free guidance with weights 1.5/2.5 during sampling. hinders reproducibility. In order to ensure proper replica-
FlowMDM. Our diffusion model uses 1k steps and a co- tion and a fair comparison in future works, we propose a
sine noise schedule [58]. FlowMDM is trained with the x0 more thorough and fully reproducible evaluation protocol
parameterization [90], and an L2 reconstruction loss. De- that enables a more fine-grained analysis based on scenar-
noising timesteps are encoded as a sinusoidal positional en- ios (analysis provided in Sec. C.1):
coding that goes through two dense layers into a 512D vec- Babel. We built two scenarios with in-distribution
tor. Textual descriptions are tokenized and embedded with (50%) and out-of-distribution (50%) combinations. For the
CLIP [68] into 512D vectors. Poses of 135/263D are en- in-distribution scenario, we first selected test motion se-
coded by a dense layer into a sequence of 512D vectors. quences showcasing at least three consecutive actions (i.e.,
If the APE is active, a sinusoidal encoding is added to the textual descriptions) with a total duration of 1.5s. Then, we
embedded poses at this stage. Then, the embedded poses randomly sampled from them to build 32 sets of 32 combi-
are taken as the keys and values of a Transformer. Embed- nations of textual descriptions. For the out-of-distribution
ded poses are concatenated to the sum of the timesteps and scenario, 32 sets were built by autoregressively sampling
text embeddings, and fed to a dense layer. The resulting 32 textual descriptions so that consecutive actions did not
512D vectors are the queries. If the relative positional en- appear together neither in the training nor in the test set.
coding (RPE) is active, rotary embeddings [76] are injected HumanML3D. Since annotations in HumanML3D do
to the queries and keys at this stage. The output of the not include consecutive actions, we cannot build in- and
Transformer is added to the embedded poses with a resid- out-of-distribution scenarios. However, this dataset con-
ual connection. 8 Transformers are stacked together. A final tains a great variability of sequence lengths (3-10s). There-
dense layer converts the pose embeddings back to a vector fore, we decided to build four scenarios by varying the
of 135/263D, which are the denoised poses. A dropout of length of the subsequences included. More specifically, we
0.1 is applied to the APE, and to the inputs of the Trans- created three sets of 6, 8, and 18 combinations (9.4, 12.5,
1 https : / / github . com / athn - nik / teach / commit /
28.1%) by sampling 32 short (3-5s), medium (5-8s), and
f4285aff0fd556a5b46518a751fc90825d91e68b
long (8-10s) test motions, respectively. Ratios were set so
2 https : / / github . com / priorMDM / priorMDM / commit / that all together preserved the proportion of short, medium,
8bc565b3120c08182f067e161e83403b0efe7cc9 and long subsequences in the original test set. This is impor-
14
tant to keep the validity of statistical measures like FID. Ad- the best at most metrics in both scenarios, with an important
ditionally, we included another scenario with 32 sets (50%) gap with respect to the previous state of the art regarding
of 32 random motion sequences from the test set. transition smoothness. Tab. C shows the scenario-wise re-
We share the list of evaluation combinations for both the sults for HumanML3D, where FlowMDM also performs the
human motion composition and extrapolation tasks in our best in most metrics and scenarios. Interestingly, MultiDif-
public code repository3 . Note that a combination consists fusion is, after ours, the most stable method in terms of tran-
of a list of textual descriptions and their associated dura- sition smoothness across scenarios (PJ and AUJ), whereas
tions. The 32 textual descriptions used for the extrapolation DiffCollage and DoubleTake show severe transition degen-
experiments from Sec. 4 are enumerated in Tab. A. eration in combinations of long sequences. Such degenera-
tion is mostly due to their methodological need to pad the
Babel HumanML3D motion sequence during sampling. When dealing with long
walk forward a person walks in a curved path to the left. sequences, sequences might be extended beyond the max-
swim movement a person stands still and does not move. imum sequence length at training time. Therefore, given
stretch arms a person walks straight forward. that the APE does not extrapolate well, the generation in
walk a person does jumping jacks. the padded motion, or transition, tends to degenerate. Our
stand a person start to dance with legs. method naturally avoids this limitation.
step backwards person walking in an s shape.
t-pose a person walks to his right. C.2. On the attention horizon
throw the ball a person slowly walked forward.
run the person is standing still doing body In Tabs. D and E, we show the effect of the attention hori-
stretches.
zon when using RPE for either a purely relative inference
circle right arm backwards the person is dancing the waltz.
wave right the person is clapping.
schedule, or our proposed BPE inference schedule. We ob-
ginga dance walking side to side. serve how increasing it too much (H=200) makes the net-
forward kick a person stayed on the place. work perform worse at transition generation in both datasets
look around person is jogging in place. (FID and AUJ), and also in subsequence generation for Hu-
steps to the right a person walks backward for 3 steps. manML3D (R-prec and MM-Dist). Conversely, when de-
side steps person is running in a circle. creasing it too much (H=50), the capacity to model long-
hop forward the person is waving hi.
range dynamics becomes limited, thus reducing the accu-
dance with arms a person walks in a circular path.
jog swinging arms up and down. racy of the generated subsequences (R-prec and MM-Dist).
walk slowly a man walks counterclockwise in a circle. As the performance with H of 100 and 150 is similar in
jump jacks series the person is walking towards the left. both datasets, we chose values that are closest to the aver-
run in half a circle the person is walking on the treadmill. age sequence length in each dataset, i.e., 100/150f for Ba-
walk a few steps ahead the man is moving his left arm. bel/HumanML3D.
move head up and down the person is doing basketball signals.
rotate right ankle a person remained sitting down. C.3. On the diffusion schedule
play guitar a person hits his drums.
jump forward person is doing a dance. The discussion and the BPE design in Sec. 3.2 are moti-
move both hands around chest a person takes some steps forward. vated by the low-to-high frequencies decomposition during
swing back and forth a person slowly walks forward five steps. the denoising stage of diffusion models. However, the de-
wave a person jumps in place.
noising process depends on how the noise is injected, or
shake it this person appears to be painting.
walk in circle a person wiping a surface with something. the noise schedule. The linear and the cosine (our choice)
noise schedules are the most common schedules. The lin-
Table A. Extrapolated motions for Babel and HumanML3D. ear schedule destroys the motion very fast, reaching a non-
recognizable state after going through the 75% of the dif-
fusion steps [58]. Instead, the cosine schedule destroys
C. More experimental results the motion signal slower and in a more evenly distributed
C.1. Fine-grained comparison way. Fig. A shows the performance of FlowMDM during
BPE sampling with both schedules. First, we observe that
Tab. B shows the comparison of FlowMDM with the state FlowMDM benefits from the steadier noise injection of the
of the art in both in-distribution and out-of-distribution sce- cosine schedule, achieving better performance in all realism
narios. We observe that, while all methods maintain similar and accuracy metrics (R-prec and FID). Second, we iden-
performance in both scenarios for the subsequence gener- tify a displacement in the accuracy (R-prec) and smooth-
ation, they generate less realistic and more abrupt transi- ness (AUJ) curves (see black arrows). Given that with the
tions in the out-of-distribution case. FlowMDM performs linear schedule global dependencies start being recovered
3 https://barquerogerman.github.io/FlowMDM/ later, more APE steps are needed to achieve the accuracy
15
Subsequence Transition
R-prec ↑ FID ↓ Div → MM-Dist ↓ FID ↓ Div → PJ → AUJ ↓
GT 0.715±0.003 0.00±0.00 8.42±0.15 3.36±0.00 0.00±0.00 6.20±0.06 0.02±0.00 0.00±0.00
In-distribution
TEACH B 0.727±0.004 2.26±0.03 8.20±0.12 3.35±0.01 2.77±0.05 6.32±0.07 1.03±0.00 2.20±0.01
TEACH 0.665±0.003 2.09±0.03 8.06±0.09 3.73±0.02 2.78±0.06 6.31±0.07 0.07±0.00 0.42±0.01
DoubleTake* 0.620±0.006 3.04±0.06 7.49±0.07 4.19±0.02 3.04±0.12 6.21±0.06 0.28±0.00 1.01±0.01
DoubleTake 0.682±0.008 1.52±0.03 7.90±0.07 3.67±0.04 3.47±0.08 6.16±0.07 0.17±0.00 0.62±0.01
MultiDiffusion 0.724±0.008 2.00±0.05 8.36±0.10 3.38±0.02 6.33±0.13 5.91±0.06 0.17±0.00 0.65±0.01
DiffCollage 0.690±0.006 1.92±0.07 7.92±0.09 3.67±0.02 4.25±0.15 6.19±0.07 0.19±0.01 0.82±0.02
FlowMDM (Ours) 0.726±0.006 1.36±0.05 8.47±0.10 3.40±0.03 2.26±0.08 6.60±0.08 0.05±0.00 0.11±0.00
Out-of-distribution
TEACH B 0.680±0.006 1.75±0.04 8.15±0.11 3.51±0.01 3.53±0.06 6.04±0.10 1.14±0.01 2.49±0.01
TEACH 0.644±0.004 2.06±0.03 7.94±0.12 3.70±0.01 4.08±0.08 6.00±0.09 0.07±0.00 0.46±0.00
DoubleTake* 0.572±0.007 3.78±0.07 7.53±0.12 4.15±0.02 3.83±0.09 6.12±0.07 0.28±0.00 1.07±0.02
DoubleTake 0.654±0.009 1.65±0.07 8.06±0.08 3.66±0.02 2.98±0.06 6.03±0.07 0.17±0.00 0.66±0.01
MultiDiffusion 0.681±0.009 2.11±0.06 8.35±0.08 3.47±0.03 6.97±0.12 5.67±0.05 0.19±0.00 0.71±0.01
DiffCollage 0.652±0.004 1.60±0.07 7.91±0.09 3.74±0.01 4.65±0.19 6.00±0.09 0.20±0.00 0.86±0.01
FlowMDM (Ours) 0.679±0.004 1.26±0.06 8.16±0.08 3.50±0.03 3.17±0.12 6.44±0.09 0.07±0.00 0.17±0.00
Table B. Scenario-wise comparison in Babel. Symbols ↑, ↓, and → indicate that higher, lower, or values closer to the ground truth (GT)
are better, respectively. Evaluation is run 10 times and ± specifies the 95% confidence intervals.
Subsequence Transition
R-prec ↑ FID ↓ Div → MM-Dist ↓ FID ↓ Div → PJ → AUJ ↓
GT 0.796±0.004 0.00±0.00 9.34±0.08 2.97±0.01 0.00±0.00 9.54±0.15 0.04±0.00 0.07±0.00
Short
DoubleTake* 0.649±0.012 3.03±0.18 9.52±0.11 3.72±0.05 3.56±0.14 8.92±0.14 0.13±0.01 0.79±0.05
DoubleTake 0.704±0.022 4.85±0.20 10.01±0.15 3.25±0.09 4.40±0.24 8.88±0.17 0.09±0.00 0.73±0.02
MultiDiffusion 0.717±0.011 5.49±0.15 10.14±0.17 3.23±0.07 4.66±0.27 8.68±0.08 0.10±0.00 0.92±0.02
DiffCollage 0.705±0.012 4.69±0.18 9.73±0.14 3.30±0.04 4.81±0.32 8.49±0.12 0.15±0.01 1.13±0.10
FlowMDM (Ours) 0.714±0.015 4.75±0.26 9.90±0.20 3.31±0.06 3.17±0.17 9.03±0.14 0.04±0.00 0.59±0.04
Medium
DoubleTake* 0.644±0.009 2.18±0.08 9.18±0.12 3.72±0.04 3.34±0.30 8.73±0.12 0.14±0.00 0.70±0.03
DoubleTake 0.642±0.014 2.34±0.05 9.59±0.09 3.79±0.05 5.42±0.30 8.61±0.11 0.12±0.00 0.83±0.02
MultiDiffusion 0.673±0.007 3.22±0.10 9.91±0.07 3.54±0.04 6.24±0.34 8.11±0.12 0.10±0.00 1.14±0.01
DiffCollage 0.661±0.010 2.03±0.07 9.38±0.10 3.60±0.04 4.95±0.27 8.13±0.09 0.14±0.00 0.66±0.05
FlowMDM (Ours) 0.669±0.012 3.18±0.15 9.68±0.08 3.55±0.04 4.18±0.43 8.52±0.07 0.04±0.00 0.86±0.03
Long
DoubleTake* 0.616±0.006 2.51±0.09 8.77±0.08 4.09±0.03 3.38±0.18 8.50±0.11 0.89±0.02 3.52±0.07
DoubleTake 0.605±0.006 4.07±0.13 8.19±0.11 4.18±0.01 8.45±0.33 7.79±0.12 0.81±0.02 3.04±0.07
MultiDiffusion 0.569±0.012 5.02±0.15 8.07±0.07 4.49±0.05 8.56±0.32 7.91±0.10 0.23±0.01 1.16±0.01
DiffCollage 0.557±0.008 5.79±0.13 7.75±0.09 4.61±0.02 9.00±0.36 7.75±0.09 0.38±0.01 5.04±0.14
FlowMDM (Ours) 0.666±0.012 1.93±0.08 8.81±0.09 3.81±0.04 2.85±0.22 8.54±0.11 0.08±0.00 0.45±0.03
All
DoubleTake* 0.655±0.007 0.84±0.04 9.29±0.10 3.92±0.03 1.91±0.12 8.79±0.11 0.51±0.01 2.11±0.05
DoubleTake 0.621±0.006 1.49±0.07 8.91±0.04 4.13±0.02 4.75±0.13 8.39±0.06 0.47±0.01 1.84±0.03
MultiDiffusion 0.632±0.003 1.17±0.04 9.29±0.09 4.05±0.02 4.42±0.16 8.37±0.08 0.17±0.00 1.06±0.01
DiffCollage 0.615±0.007 1.73±0.07 8.73±0.05 4.18±0.04 4.98±0.24 8.09±0.06 0.26±0.00 2.71±0.12
FlowMDM 0.695±0.008 0.30±0.02 9.55±0.08 3.58±0.02 1.49±0.06 8.78±0.11 0.06±0.00 0.50±0.01
Table C. Scenario-wise comparison in HumanML3D.
and smoothness reached with the cosine schedule. ditionally denoised motion xc , and the unconditionally de-
noised motion x. Then, the denoised sample is computed
C.4. On the classifier-free guidance as x + w(xc − x). If w=1, the classifier-free guidance is
deactivated. When generating motion from single textual
The classifier-free guidance is an important add-on for dif- descriptions with classifier-free guidance, we keep steer-
fusion sampling that intensifies the conditioning signal, thus ing the denoising toward motions matching better the tex-
improving the quality and accuracy of the generated sam- tual description. However, when building human motion
ples [33]. It is implemented by first computing the con-
16
Subsequence Transition
H (frames) Inf. PE R-prec ↑ FID ↓ Div → MM-Dist ↓ FID ↓ Div → PJ → AUJ ↓
GT - 0.715±0.003 0.00±0.00 8.42±0.15 3.36±0.00 0.00±0.00 6.20±0.06 0.02±0.00 0.00±0.00
50 R 0.641±0.004 1.03±0.04 7.99±0.11 3.92±0.03 2.04±0.06 6.30±0.05 0.04±0.00 0.15±0.00
100 R 0.635±0.004 0.85±0.02 8.25±0.12 3.98±0.02 2.14±0.04 6.44±0.09 0.04±0.00 0.15±0.00
150 R 0.641±0.005 0.99±0.04 8.24±0.15 3.88±0.03 2.43±0.06 6.43±0.06 0.04±0.00 0.15±0.00
200 R 0.601±0.005 1.48±0.04 7.85±0.14 4.17±0.02 3.18±0.09 6.16±0.05 0.04±0.00 0.19±0.00
50 B 0.698±0.006 1.07±0.03 8.19±0.11 3.44±0.02 2.34±0.06 6.24±0.07 0.06±0.00 0.13±0.00
100 B 0.702±0.004 0.99±0.04 8.36±0.13 3.45±0.02 2.61±0.06 6.47±0.05 0.06±0.00 0.13±0.00
150 B 0.704±0.004 1.24±0.03 8.34±0.12 3.43±0.02 2.54±0.08 6.40±0.08 0.06±0.00 0.13±0.00
200 B 0.694±0.006 1.13±0.02 8.25±0.13 3.42±0.02 3.31±0.08 6.38±0.09 0.06±0.00 0.14±0.01
Table D. Attention horizon effect in Babel. All models correspond to FlowMDM, trained with BPE. Inf. PE indicates the type of positional
encoding used during sampling: B for BPE, and R for only RPE. Symbols ↑, ↓, and → indicate that higher, lower, or values closer to the
ground truth (GT) are better, respectively. Evaluation is run 10 times and ± specifies the 95% confidence intervals.
Subsequence Transition
H (frames) Inf. PE R-prec ↑ FID ↓ Div → MM-Dist ↓ FID ↓ Div → PJ → AUJ ↓
GT - 0.796±0.004 0.00±0.00 9.34±0.08 2.97±0.01 0.00±0.00 9.54±0.15 0.04±0.00 0.07±0.00
50 R 0.583±0.005 1.08±0.07 9.03±0.15 4.30±0.02 1.88±0.06 8.85±0.10 0.04±0.00 0.70±0.01
100 R 0.591±0.005 1.07±0.03 9.02±0.13 4.29±0.02 1.51±0.08 8.90±0.08 0.04±0.00 0.56±0.01
150 R 0.554±0.007 1.06±0.06 9.02±0.11 4.54±0.02 1.12±0.04 9.00±0.10 0.05±0.00 0.53±0.01
200 R 0.528±0.004 1.37±0.04 8.87±0.07 4.68±0.01 1.72±0.05 8.97±0.09 0.03±0.00 0.97±0.01
50 B 0.671±0.004 0.25±0.01 9.37±0.14 3.66±0.02 1.27±0.04 8.79±0.08 0.06±0.00 0.52±0.01
100 B 0.684±0.003 0.36±0.02 9.55±0.09 3.61±0.02 2.04±0.11 8.59±0.06 0.06±0.00 0.56±0.01
150 B 0.685±0.004 0.29±0.01 9.58±0.12 3.61±0.01 1.38±0.05 8.79±0.09 0.06±0.00 0.51±0.01
200 B 0.658±0.006 0.47±0.03 9.37±0.13 3.77±0.02 2.27±0.07 8.69±0.08 0.06±0.00 0.68±0.01
Table E. Attention horizon effect in HumanML3D. All models correspond to FlowMDM, trained with BPE. Inf. PE indicates the type of
positional encoding used during sampling: B for BPE, and R for only RPE.
0.72 1.4
5
1.3
0.70 3
Babel
Babel
1.2
4
0.68 2
1.1
0.66
1.0 3
1
0.64 0.9
2 0
0.8 2.2 0.8
HumanML3D
0.700
HumanML3D
0.70
6
2.5 0.7
3 0.675 2.0 0.7
5 0.6
0.65 2.0 0.650
1.8 0.6
4 2 0.5
1.5 0.625
1.6 0.5
0.60 3 0.600 0.4
1.0
1 1.4
2 0.575 0.3 0.4
0.5
0.55 1 0.550 0.2 1.2
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
cosine (ours) linear Figure B. Classifier-free guidance. In line with prior works, we
also observe an accuracy improvement (R-prec) when increasing
Figure A. Diffusion noise schedules. The cosine noise schedule
the strength (i.e., weight) of the classifier-free guidance (CFG).
destroys the motion signal slower and in a more evenly distributed
However, above certain values, the performance degrades, espe-
way than the linear schedule. As a result, FlowMDM is able to
cially in terms of smoothness (AUJ). This is caused by the mis-
exploit better the low-to-high frequencies decomposition along the
alignment of CFG directions on each side of the transition.
denoising chain and generate better subsequences and transitions.
The faster motion destruction in the linear schedule translates to
needing more APE steps to reconstruct global dependencies inside
subsequences (black arrows ↔).
compositions with our method, two different conditions co- might not be accurate enough. Fig. B shows these effects for
exist in the neighborhoods of the transitions. There, the FlowMDM. We notice a sweet point around w=1.5/2.5 for
classifier-free guidance pushes the denoising towards dis- Babel/HumanML3D, where FlowMDM reaches the max-
par directions. As a result, if w is too high, the transi- imum accuracy and quality for subsequences and a good
tion will become sharper, and if w is too low, subsequences trade-off for quality and smoothness of transitions.
17
Figure C. Qualitative examples (Babel). A-F feature six human motion compositions, and G-H two human motion extrapolations. Ac-
cording to the scenarios defined in Sec. B, A, B, C belong to in-distribution combinations, and D, E, F to out-of-distribution combinations.
Videos of all samples are also included as part of this supplementary material. Solid curves match the trajectories of the global position
(blue) and left/right hands (purple/green). Darker colors indicate instantaneous jerk deviations from the median value, saturating at twice
the jerk’s standard deviation in the dataset (black segments). Abrupt transitions manifest as black segments amidst lighter ones.
D. Qualitative results beginning and end of these can contain partial transitions to-
ward other actions. Motion videos are also included as part
Figs. C and D show six human motion compositions (A of the supplementary material. Note that we can represent
to F), and two extrapolations (G and H) for Babel and Hu- the motions from Babel with SMPL body meshes thanks
manML3D, respectively. The compositions are subsets of to its motion representation including the SMPL parame-
the evaluation combinations composed of 32 actions, so the
18
Figure D. Qualitative examples (HumanML3D). A-F feature six human motion compositions, and G-H two human motion extrapolations.
According to the scenarios defined in Sec. B, A, B, C are samples from the short, medium, and long scenarios, respectively, and D, E, F
from the mixed scenario. Videos of all samples are also included as part of this supplementary material.
19
ters [13]. For HumanML3D, we use skeletons, as its mo-
tion representation only includes the 3D coordinates of the
joints.
Discussion. The hands trajectories and the jerk color
indicators in Figs. C and D and the videos highlight that
FlowMDM generates the smoothest transitions between
subsequences. Notably, state-of-the-art methods exhibit fre-
quent smoothness artifacts (black segments) in the bound-
aries of their transitions. We notice that the compositions
produced by TEACH lack realism due to the use of a naive
spherical linear interpolation, disrupting the motion dynam-
ics. This becomes more apparent in extrapolations G and
H of both datasets, where the periodicity of the movement
is clearly compromised. On the other side, DoubleTake,
DiffCollage, and MultiDiffusion share two significant lim-
itations. Firstly, they adhere to a predetermined transition
length, which may not fit all situations. For example, in
Babel-A, the ‘picking’ actions occur very rapidly due to the
insufficient length for generating a natural transition. By
contrast, our approach is able to leverage more transitioning
time from either transition side if needed, without artificial
constraints. Secondly, the denoising process in these meth-
ods only considers a small portion of the neighboring sub-
sequences, leading to poor performance in dynamic motion
extrapolations. For example, in HumanML3D-G, they all
generate erratic jumping jacks. While our method also inde-
pendently generates the low-frequency motion spectrum, it
effectively rectifies inconsistencies in later stages, yielding
realistic and periodic motion. In the case of Babel-H, where
successfully extrapolating the ‘hop forward’ action requires
synchronizing each subsequence with the whole neighbor-
ing motion, our model is the only one able to generate a
smooth, coherent, and realistic extrapolation.
Limitations. However, FlowMDM is not without
its imperfections. We noticed that our method struggles
with very complex descriptions, such as the first one in
HumanML3D-B. Instead of executing the intricate descrip-
tion that includes ‘walk backwards, sit, stand, and walk for-
ward again’, it only walks backwards. Given that the par-
tial execution of actions is also observed in other methods,
we consider it a challenge associated with the broader text-
to-motion task. Indeed, our model could theoretically also
benefit from improved conditioning schemes such as using
better text embeddings. Another acknowledged limitation
of our model, discussed in Sec. 5, is the independent gen-
eration of low-frequency components. In Babel-B, for ex-
ample, a slight mismatch between the sitting and standing
positions is observed. Nonetheless, in contrast to DiffCol-
lage, MultiDiffusion, and DoubleTake which also exhibit
this effect, FlowMDM produces a smoother result.
20