Streamingt2V: Consistent, Dynamic, and Extendable Long Video Generation From Text
Streamingt2V: Consistent, Dynamic, and Extendable Long Video Generation From Text
Roberto Henschel1∗ , Levon Khachatryan1∗ , Daniil Hayrapetyan1∗ , Hayk Poghosyan1 , Vahram Tadevosyan1 ,
Zhangyang Wang1,2 , Shant Navasardyan1 , Humphrey Shi1,3
1 2 3
Picsart AI Resarch (PAIR) UT Austin SHI Labs @ Georgia Tech, Oregon & UIUC
arXiv:2403.14773v1 [cs.CV] 21 Mar 2024
https://github.com/Picsart-AI-Research/StreamingT2V
Figure 1. StreamingT2V is an advanced autoregressive technique that enables the creation of long videos featuring rich motion dynamics
without any stagnation. It ensures temporal consistency throughout the video, aligns closely with the descriptive text, and maintains high
frame-level image quality. Our demonstrations include successful examples of videos up to 1200 frames, spanning 2 minutes, and can be
extended for even longer durations. Importantly, the effectiveness of StreamingT2V is not limited by the specific Text2Video model used,
indicating that improvements in base models could yield even higher-quality videos.
1
transitions. The key components are: (i) a short-term mem- concurrent work SparseCtrl [12] utilizes a more sophisti-
ory block called conditional attention module (CAM), which cated conditioning mechanism by sparse encoder. Its archi-
conditions the current generation on the features extracted tecture requires to concatenate additional zero-filled frames
from the previous chunk via an attentional mechanism, lead- to the conditioning frames before being plugged into sparse
ing to consistent chunk transitions, (ii) a long-term mem- encoder. However, this inconsistency in the input leads to
ory block called appearance preservation module, which inconsistencies in the output (see Sec.5.4). Moreover, we
extracts high-level scene and object features from the first observed that all image-to-video methods that we evaluated
video chunk to prevent the model from forgetting the ini- in our experiments (see Sec.5.4) lead eventually to video
tial scene, and (iii) a randomized blending approach that stagnation, when applied autoregressively by conditioning
enables to apply a video enhancer autoregressively for in- on the last frame of the previous chunk.
finitely long videos without inconsistencies between chunks. To overcome the weaknesses and limitations of current
Experiments show that StreamingT2V generates high mo- works, we propose StreamingT2V, an autoregressive text-
tion amount. In contrast, all competing image-to-video to-video method equipped with long/short-term memory
methods are prone to video stagnation when applied naively blocks that generates long videos without temporal incon-
in an autoregressive manner. Thus, we propose with Stream- sistencies.
ingT2V a high-quality seamless text-to-long video genera- To this end, we propose the Conditional Attention Mod-
tor that outperforms competitors with consistency and mo- ule (CAM) which, due to its attentional nature, effectively
tion. borrows the content information from the previous frames
to generate new ones, while not restricting their motion by
the previous structures/shapes. Thanks to CAM, our results
1. Introduction are smooth and with artifact-free video chunk transitions.
In recent years, with the raise of Diffusion Models [15, 26, Existing approaches are not only prone to temporal in-
28, 34], the task of text-guided image synthesis and manipu- consistencies and video stagnation, but they suffer from
lation gained enormous attention from the community. The object appearance/characteristic changes and video quality
huge success in image generation led to the further exten- degradation over time (see e.g., SVD [4] in Fig. 7). The
sion of diffusion models to generate videos conditioned by reason is that, due to conditioning only on the last frame(s)
textual prompts [4, 5, 7, 11–13, 17, 18, 20, 32, 37, 39, 45]. of the previous chunk, they overlook the long-term depen-
Despite the impressive generation quality and text align- dencies of the autoregressive process. To address this is-
ment, the majority of existing approaches such as [4, 5, sue we design an Appearance Preservation Module (APM)
17, 39, 45] are mostly focused on generating short frame that extracts object or global scene appearance information
sequences (typically of 16 or 24 frame-length). However, from an initial image (anchor frame), and conditions the
short videos are limited in real-world use-cases such as ad video generation process of all chunks with that informa-
making, storytelling, etc. tion, which helps to keep object and scene features across
The naı̈ve approach of simply training existing methods the autoregressive process.
on long videos (e.g. ≥ 64 frames) is normally unfeasi- To further improve the quality and resolution of our long
ble. Even for generating short sequences, a very expensive video generation, we adapt a video enhancement model for
training (e.g. using more that 260K steps and 4500 batch autoregressive generation. For this purpose, we choose a
size [39]) is typically required. Without training on longer high-resolution text-to-video model and utilize the SDEdit
videos, video quality commonly degrades when short video [22] approach for enhancing consecutive 24-frame chunks
generators are made to output long videos (see appendix). (overlapping with 8 frames) of our video. To make the
Existing approaches, such as [5, 17, 23], thus extend the chunk enhancement transitions smooth, we design a ran-
baselines to autoregressively generate short video chunks domized blending approach for seamless blending of over-
conditioned on the last frame(s) of the previous chunk. lapping enhanced chunks.
However, the straightforward long-video generation ap- Experiments show that StreamingT2V successfully gen-
proach of simply concatenating the noisy latents of a video erates long and temporal consistent videos from text with-
chunk with the last frame(s) of the previous chunk leads to out video stagnation. To summarize, our contributions are
poor conditioning with inconsistent scene transitions (see three-fold:
Sec. 5.3). Some works [4, 8, 40, 43, 48] incorporate also • We introduce StreamingT2V, an autoregressive approach
CLIP [25] image embeddings of the last frame of the previ- for seamless synthesis of extended video content using
ous chunk reaching to a slightly better consistency, but are short and long-term dependencies.
still prone to inconsistent global motion across chunks (see • Our Conditional Attention Module (CAM) and Appear-
Fig. 5) as the CLIP image encoder loses information impor- ance Preservation Module (APM) ensure the natural
tant for perfectly reconstructing the conditional frames. The continuity of the global scene and object characteristics
2
of generated videos. the (encoded) conditionings with an additional mask (which
• We seamlessly enhance generated long videos by intro- indicates which frame is provided) to the input of the video
ducing our randomized blending approach of consecu- diffusion model.
tive overlapping chunks. In addition to concatenating the conditioning to the in-
put of the diffusion model, several works [4, 40, 48] replace
2. Related Work the text embeddings in the cross-attentions of the diffusion
model by CLIP [25] image embeddings of the conditional
Text-Guided Video Diffusion Models. Generating videos frames. However, according to our experiments, their appli-
from textual instructions using Diffusion Models [15, 33] cability for long video generation is limited. SVD [4] shows
is a recently established yet very active field of research severe quality degradation over time (see Fig.7), and both,
introduced by Video Diffusion Models (VDM) [17]. The I2VGen-XL[48] and SVD[4] generate often inconsistencies
approach requires massive training resources and can gen- between chunks, still indicating that the conditioning mech-
erate only low-resolution videos (up to 128x128), which anism is too weak.
are severely limited by the generation of up to 16 frames Some works [6, 43] such as DynamiCrafter-XL [43] thus
(without autoregression). Also, the training of text-to-video add to each text cross-attention an image cross-attention,
models usually is done on large datasets such as WebVid- which leads to better quality, but still to frequent inconsis-
10M [3], or InternVid [41]. Several methods employ video tencies between chunks.
enhancement in the form of spatial/temporal upsampling The concurrent work SparseCtrl [12] adds a ControlNet
[5, 16, 17, 32], using cascades with up to 7 enhancer mod- [46]-like branch to the model, which takes the conditional
ules [16]. Such an approach produces high-resolution and frames and a frame-indicating mask as input. It requires by
long videos. Yet, the generated content is still limited by the design to append additional frames consisting of black pix-
key frames. els to the conditional frames. This inconsistency is difficult
Towards generating longer videos (i.e. more keyframes), to compensate for the model, leading to frequent and severe
Text-To-Video-Zero (T2V0) [18] and ART-V [42] employ scene cuts between frames.
a text-to-image diffusion model. Therefore, they can gen- Overall, only a small number of keyframes can currently
erate only simple motions. T2V0 conditions on its first be generated at once with high quality. While in-between
frame via cross-frame attention and ART-V on an anchor frames can be interpolated, it does not lead to new content.
frame. Due to the lack of global reasoning, it leads to un- Also, while image-to-video methods can be used autore-
natural or repetitive motions. MTVG [23] turns a text-to- gressively, their used conditional mechanisms lead either to
video model into an autoregressive method by a trainin-free inconsistencies, or the method suffers from video stagna-
approach. It employs strong consistency priors between tion. We conclude that existing works are not suitable for
and among video chunks, which leads to very low motion high-quality and consistent long video generation without
amount, and mostly near-static background. FreeNoise [24] video stagnation.
samples a small set of noise vectors, re-uses them for the
generation of all frames, while temporal attention is per- 3. Preliminaries
formed on local windows. As the employed temporal at-
tentions are invariant to such frame shuffling, it leads to Diffusion Models. Our text-to-video model, which we term
high similarity between frames, almost always static global StreamingT2V, is a diffusion model that operates in the la-
motion and near-constant videos. Gen-L [38] generates tent space of the VQ-GAN [9, 35] autoencoder D(E(·)),
overlapping short videos and aggregates them via temporal where E and D are the corresponding encoder and decoder,
co-denoising, which can lead to quality degradations with respectively. Given a video V ∈ RF ×H×W ×3 , composed
video stagnation. of F frames with spatial resolution H × W , its latent code
Image-Guided Video Diffusion Models as Long Video x0 ∈ RF ×h×w×c is obtained through frame-by-frame ap-
Generators. Several works condition the video generation plication of the encoder. More precisely, by identifying
by a driving image or video [4, 6–8, 10, 12, 21, 27, 40, each tensor x ∈ RF ×ĥ×ŵ×ĉ as a sequence (xf )F f =1 with
43, 44, 48]. They can thus be turned into an autoregres- xf ∈ Rĥ×ŵ×ĉ , we obtain the latent code via xf0 := E(V f ),
sive method by conditioning on the frame(s) of the previous for all f = 1, . . . , F . The diffusion forward process gradu-
chunk. ally adds Gaussian noise ϵ ∼ N (0, I) to the signal x0 :
VideoDrafter [21] uses a text-to-image model to obtain p
an anchor frame. A video diffusion model is conditioned q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), t = 1, . . . , T
on the driving anchor to generate independently multiple (1)
videos that share the same high-level context. However, no where q(xt |xt−1 ) is the conditional density of xt given
consistency among the video chunks is enforced, leading xt−1 , and {βt }Tt=1 are hyperparameters. A high value for T
to drastic scene cuts. Several works [7, 8, 44] concatenate is chosen such that the forward process completely destroys
3
Figure 2. The overall pipeline of StreamingT2V: In the Initialization Stage the first 16-frame chunk is synthesized by a text-to-video
model (e.g. Modelscope [39]). In the Streaming T2V Stage the new content for further frames are autoregressively generated. Finally, in
the Streaming Refinement Stage the generated long video (600, 1200 frames or more) is autoregressively enhanced by applying a high-
resolution text-to-short-video model (e.g. MS-Vid2Vid-XL [48]) equipped with our randomized blending approach.
the initial signal x0 resulting in xT ∼ N (0, I). The goal of StreamingT2V method by taking MS as a basis and turn it
a diffusion model is then to learn a backward process into an autoregressive model suitable for long video gener-
ation with high motion dynamics and consistency.
pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)) (2)
4
Figure 3. Method overview: StreamingT2V extends a video diffusion model (VDM) by the conditional attention module (CAM) as short-
term memory, and the appearance preservation module (APM) as long-term memory. CAM conditions the VDM on the previous chunk
using a frame encoder Econd . The attentional mechanism of CAM leads to smooth transitions between chunks and videos with high motion
amount at the same time. APM extracts from an anchor frame high-level image features and injects it to the text cross-attentions of the
VDM. APM helps to preserve object/scene features across the autogressive video generation.
4.1. Conditional Attention Module Let x denote the output of Econd after zero-convolution.
We use addition, to fuse x with the output of the first
To train a conditional network for our Streaming T2V stage, temporal transformer block of CAM. For the injection of
we leverage the pre-trained power of a text-to-video model CAM’s features into the Video-LDM Unet, we consider
(e.g. Modelscope [39]), as a prior for long video gener- the UNet’s skip-connection features xSC ∈ Rb×F ×h×w×c
ation in an autoregressive manner. In the further writing (see Fig.3) with batch size b. We apply spatio-temporal
we will refer this pre-trained text-to-(short)video model as group norm, and a linear projection Pin on xSC . Let
Video-LDM. To autoregressively condition Video-LDM by x′SC ∈ R(b·w·h)×F ×c be the resulting tensor after reshap-
some short-term information from the previous chunk (see ing. We condition x′SC on the corresponding CAM fea-
Fig. 2, mid), we propose the Conditional Attention Mod- ture xCAM ∈ R(b·w·h)×Fcond ×c (see Fig.3), where Fcond
ule (CAM), which consists of a feature extractor, and a fea- is the number of conditioning frames, via temporal multi-
ture injector into Video-LDM UNet, inspired by ControlNet head attention (T-MHA) [36], i.e. independently for each
[46]. The feature extractor utilizes a frame-wise image en- spatial position (and batch). Using learnable linear maps
coder Econd , followed by the same encoder layers that the PQ , PK , PV , for queries, keys, and values, we apply T-
Video-LDM UNet uses up to its middle layer (and initial- MHA using keys and values from xCAM and queries from
ized with the UNet’s weights). For the feature injection,
we let each long-range skip connection in the UNet at-
tend to corresponding features generated by CAM via cross-
attention.
5
x′SC , i.e. 4.3. Auto-regressive Video Enhancement
x′′SC = T-MHA Q = PQ (x′SC ), To further improve quality and resolution of our text-to-
video results, we utilize a high-resolution (1280x720) text-
K = PK (xCAM ), (4) to-(short)video model (Refiner Video-LDM, see Fig. 3)
V = PV (xCAM ) . to autoregressively enhance 24-frame chunks of generated
videos. Using a text-to-video model as a refiner/enhancer of
Finally, we use a linear projection Pout . Using a suitable 24-frame chunks is done by adding a substantial amount of
reshaping operation R, the output of CAM is added to the noise to the input video-chunk and denoising with the text-
skip connection (as in ControlNet [46]): to-video diffusion model (SDEdit [22] approach). More
precisely, we take a high-resolution text-to-video model (for
x′′′ ′′
SC = xSC + R(Pout (xSC )), (5) example MS-Vid2Vid-XL [40, 48]), and a low resolution
video chunk of 24 frames which is first bilinearly upscaled
so that x′′′ is used in the decoder layers of the UNet. The [2] to the target high resolution. Then we encode the frames
projection Pout is zero-initialized, so that when training using the image encoder E so that we obtain a latent code
starts, CAM is not affecting the base model’s output, which x0 . After that we apply T ′ < T forward diffusion steps
improves training convergence. (see Eq. (1)) so that xT ′ still contains signal information
CAM utilizes the last Fcond conditional frames of the (mostly about the video structure), and denoise it using the
previous chunk as input. The cross attention enables to con- high-resolution video diffusion model.
dition the F frames of the base model to CAM. In contrast, However, the naı̈ve approach of independently enhanc-
the sparse encoder [12] uses convolution for the feature in- ing each chunk leads to inconsistent transitions (see Fig.4
jection, and thus needs additional F − Fcond zero-valued (a)). We tackle this issue by using shared noise between
frames (and a mask) as input in order to add the output to consecutive chunks and leveraging our randomized blend-
the F frames of the base model. This poses an inconsistency ing approach. Given our low-resolution long video, we split
in the input for SparseCtrl, leading to severe inconsistencies it into m chunks V1 , . . . , Vm of F = 24 frame-length such
in the generated videos (see Sect. 5.3 and Sect. 5.4). that each two consecutive chunks have an overlap of O = 8
frames. For the backward diffusion at step t, starting from
4.2. Appearance Preservation Module
T ′ , we must sample noise to perform one denoising step
Autoregressive video generators typically suffer from for- (see Eq. 2). We start with the first chunk V1 and sample
getting initial object and scene features, leading to severe noise ϵ1 ∼ N (0, I) with ϵ1 ∈ RF ×h×w×c . For each subse-
appearance changes. To tackle this issue, we incorporate quent chunk Vi , i > 1, we sample noise ϵ̂i ∼ N (0, I) with
long-term memory by leveraging the information contained ϵ̂i ∈ R(F −O)×h×w×c and concatenate it along the frame di-
−O:F
in a fixed anchor frame of the very first chunk using our mension with the noise ϵF i−1 that was sampled for the O
proposed Appearance Preservation Module (APM). This overlapping frames of the previous chunk, i.e.
helps to maintain scene and object features across video
−O:F
chunk generations (see Fig. 6). ϵi := concat([ϵF i−1 , ϵ̂i ], dim = 0), for all i = 2, . . . , m,
To enable APM to balance the guidance by the anchor (7)
frame with the guidance by the text instructions, we propose so that we obtain shared noise for overlapping frames. We
(see Figure 3): (i) We mix the CLIP [25] image token of the perform one denoising step using ϵi and obtain for chunk Vi
anchor frame with the CLIP text tokens from the textual the latent code xt−1 (i). Yet, this approach is not sufficient
instruction by expanding the clip image token to k = 8 to- to remove transition misalignment (see Fig.4 (b)).
kens using a linear layer, concatenate the text and image en- To improve consistency significantly, we propose the
codings at the token dimension, and use a projection block, randomized blending approach. Consider the latent codes
F −O:F
leading to xmixed ∈ Rb×77×1024 ; (ii) We introduce for each xt−1 (i − 1) and x1:O t−1 (i) of two consecutive chunks
cross attention layer l a weight αl ∈ R (initialized as 0) to Vi−1 , Vi at denoising step t − 1. The latent code xt−1 (i − 1)
perform cross attention using keys and values coming from of chunk Vi−1 possesses a smooth transition from its first
a weighted sum of xmixed and the usual CLIP text encoding frames to the overlapping frames, while the latent code
of the text instructions xtext : xt−1 (i) possesses a smooth transition from the overlap-
ping frames to the subsequent frames. Thus, we combine
xcross = SiLU(αl )xmixed + xtext . (6) the two latent codes via concatenation, by randomly sam-
pling a frame index fthr from {0, . . . , O} then taking from
F −O:F
The experiments in Section 5.3 show that the light- xt−1 (i − 1) the latent code of the first fthr frames and
weight APM module helps to keep scene and identity fea- from x1:O
t−1 (i) the latent code of the frames starting from
tures across the autoregressive process (see Fig.6). fthr + 1. Then, we update the latent code of the entire long
6
video xt−1 on the overlapping frames and perform the next distance from a frame to its warped subsequent frame, ex-
denoising step. Accordingly, for a frame f ∈ {1, . . . , O} cluding occluded regions. Finally, MAWE is defined as:
of the overlap and diffusion step t, the latent code of chunk
f W (V)
Vi−1 is used with probability 1 − O+1 .
MAWE(V) := , (8)
By using a probabilistic mixture of the latents in an over- c · OFS(V)
lapping region, we successfully diminish inconsistencies
between chunks (see Fig. 4(c)). where c aligns the different scales of the two metrics. To
this end, we perform a regression analysis on a subset of
5. Experiments our dataset validation videos and obtain c = 9.5 (details
to derive c are provided in the appendix). MAWE requires
5.1. Implementation Details high motion amount and low warp error for a low metric
values. For the metrics involving optical flow, computations
We generate F = 16 frames and condition on Fcond =
are conducted by resizing all videos to 720×720 resolution.
8 frames. Training is conducted using a dataset col-
lected from publicly available sources. We sample with For video textual alignment, we employ the CLIP [25]
3FPS@256x256 16 frames (during CAM training) and 32 text image similarity score (CLIP), which is applied to all
frames (during CAM+APM training). First, we freeze the frames of a video. CLIP computes for a video sequence the
weights of the pre-trained Video-LDM and train the new cosine similarity from the CLIP text encoding to the CLIP
layers of CAM with batch size 8 and learning rate 5 · 10−5 image encodings.
for 400K steps. We then continue to train CAM+APM. We For per-frame quality we incorporate the Aesthetic Score
randomly sample an anchor frame out of the first 16 frames. [31] computed on top of CLIP image embeddings of all
For the conditioning and denoising, we use the first 8 and frames of a video.
16 frames, starting from frame 17, respectively. This aligns All metrics are computed per video first and then aver-
training with inference, where there is a large time gap be- aged over all videos, all videos are generated with 80 frames
tween the conditional frames and the anchor frame. In ad- for quantitative analysis.
dition, by randomly sampling an anchor frame, the model
can leverage the CLIP information only for the extraction 5.3. Ablation Study
of high-level semantic information, as we do not provide To assess our proposed components, we perform ablation
a frame index to the model. We freeze the CLIP encoder studies on a randomly sampled set of 75 prompts from our
and the temporal layers of the main branch, and train the validation set. We compare CAM against established con-
remaining layers for 1K steps. ditioning approaches, analyse the impact of our long-term
The image encoder Econd used in CAM is composed memory APM, and ablate on our modifications for the video
of stacked 2D convolutions, layer norms and SiLU activa- enhancer.
tions. For the video enhancer, we diffuse an input video CAM for Conditioning. To analyse the importance of
using T ′ = 600 steps. Additional implementation details of CAM, we compare CAM (w/o APM) with two baselines:
StreamingT2V are provided in the appendix. (i) Connect the features of CAM with the skip-connection
of the UNet via zero convolution, followed by addition.
5.2. Metrics
We zero-pad the condition frame and concatenate it with
For quantitative evaluation we employ metrics that measure an frame-indicating mask to form the input for the modi-
to evaluate temporal consistency, text-alignment, and per- fied CAM, which we denote as Add-Cond. (ii) We append
frame quality of our method. the conditional frames and a frame-indicating mask to in-
For temporal consistency, we introduce SCuts, which put of Video-LDM’s Unet along the channel dimension, but
counts for a video the number of detected scene cuts, using do not use CAM, which we denote as Conc-Cond. We train
the AdaptiveDetector algorithm of the PySceneDetect [1] our method with CAM and the baslines on the same dataset.
package with default parameters. In addition, we propose Architectural details (including training) of these baselines
a new metric called motion aware warp error (MAWE), are provided in the appendix.
which coherently assesses motion amount and warp error, We obtain an SCuts score of 0.24, 0.284 and 0.03 for
and yields a low value when the video exhibits both consis- Conc-Cond, Add-Cond and Ours (w/o APM), respectively.
tency and a substantial amount of motion. To this end, we This shows that the inconsistencies in the input caused by
measure the motion amount using OFS (optical flow score), the masking leads to frequent inconsistencies in the gen-
which computes for a video the mean magnitude of all opti- erated videos and that concatenation to the Unet’s input is
cal flow vectors between any two consecutive frames. Fur- a too weak conditioning. In contrast, our CAM generates
thermore, for a video V, we consider the mean warp error consistent videos with a SCuts score that is 88% lower than
[19] W (V), which measures the average squared L2 pixel the baselines.
7
X-T Slice
X-T Slice
t t
Video
Video
(a) Naive Concatenation
X-T Slice (b) Shared Noise
t
Video
Figure 4. Ablation study on our video enhancer improvements. The X-T slice visualization shows that randomized blending leads to
smooth chunk transitions, while both baselines have clearly visible, severe inconsistencies between chunks.
X-T Slice
X-T Slice
t t
Video
Video
Figure 5. Visual comparision of DynamiCrafter-XL and StreamingT2V. Both text-to-video results are generated using the same prompt.
The X-T slice visualization shows that DynamiCrafter-XL suffers from severe chunk inconsistencies and repetitive motions. In contrast,
our method shows seamless transitions and evolving content.
Long-Term Memory. We analyse the impact of utilizing Randomized Blending for Video Enhancement. We
long-term memory in the context of long video generation. assess our randomized blending approach by comparing
against two baselines. (B) enhances each video chunk in-
Fig. 6 shows that long-term memory greatly helps keep-
dependently, and (B+S) uses shared noise for consecutive
ing the object and scene features across autoregressive gen-
chunks, with an overlap of 8 frames, but not randomized
erations. This is also supported quantitatively. We obtain a
blending. We compute per sequence the standard devia-
person re-identification score (definition in the appendix) of
tion of the optical flow magnitudes between consecutive
93.42 and 94.95 for Ours w/o APM, and Ours, respectively.
frames and average over all frames and sequences, which
Our APM module thus improves the identity/appearance
indicates temporal smoothness. We obtain the scores 8.72,
preservation. Also the scene information is better kept, as
6.01 and 3.32 for B, B+S, and StreamingT2V, respectively.
we observe an image distance score in terms of LPIPS[47]
Thus, noise sharing improves chunk consistency (by 31%
of 0.192 and 0.151 for Ours w/o APM and Ours, respec-
vs B), but it is significantly further improved by random-
tively. We thus have an improvement in terms of scene
ized blending (by 62% vs B). Also the visual results in Fig.
preservation of more than 20% when APM is used.
8
(a) Young caucasian female couple drinking cocktails and smiling on terrace in havana, cuba. girls, teens, teenagers, women
Figure 6. Top row: CAM+APM, Bottom row: CAM. The figure shows that using long-term information via APM helps to keep identities
(e.g. the face of the left woman) and scene features, e.g. the dresses or arm clock.
9
Ours
DC-XL
[43]
FreeNse
[24]
SpCtrl
[12]
SVD
[4]
SEINE
[7]
I2VG
[48]
(a) Fishes swimming in ocean camera moving, cinematic (b) Close flyover over a large wheat field in the early
morning sunlight, cinematic
Figure 7. Visual comparisons of StreamingT2V with state-of-the-art methods on 80 frame-length, autoregressively generated videos. In
contrast to other methods, StreamingT2V generates long videos without suffering from motion stagnation.
Table 8. Quanitative comparison to state-of-the-art open-source text-to-long-video generators. Best performing metrics are highlighted
in red, second best in blue. We also tested OpenAI’s Sora with their 48 sample videos released on their website. We list the numbers
here to give the readers a sense of how Sora is performing on these metrics but please be advised that this test is different from the other
open-source models both in terms of the test set and prompts in this table, and hence it is in gray and not ranked.
videos with temporal inconsistencies, or severe stagnation analysed an autoregressive pipeline build on top of a vanilla
up to standstill. To overcome these limitations, we carefully video diffusion model and proposed StreamingT2V, which
10
utilizes novel short- and long-term dependency blocks for In Proceedings of the IEEE/CVF International Conference
seamless continuation of video chunks with a high motion on Computer Vision, pages 7346–7356, 2023. 3, 15
amount, while preserving high-level scene and object fea- [11] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du-
tures throughout the generation process. We proposed a ran- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi
domized blending approach that enables to use a video en- Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz-
hancer within the autoregressive process without temporal ing text-to-video generation by explicit image conditioning.
inconsistencies. Experiments showed that StreamingT2V arXiv preprint arXiv:2311.10709, 2023. 2, 4
leads to significant higher motion amount and temporal con- [12] Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala,
sistency compared with all competitors. StreamingT2V en- Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse con-
trols to text-to-video diffusion models. arXiv preprint
ables long video generation from text without stagnating
arXiv:2311.16933, 2023. 2, 3, 6, 9, 10
content.
[13] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang,
Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin,
References and Bo Dai. Animatediff: Animate your personalized text-
[1] Pyscenedetect. https://www.scenedetect.com/. to-image diffusion models without specific tuning. In The
Accessed: 2024-03-03. 7 Twelfth International Conference on Learning Representa-
tions, 2023. 2
[2] Isaac Amidror. Scattered data interpolation methods for elec-
tronic imaging systems: a survey. Journal of electronic imag- [14] Jonathan Ho and Tim Salimans. Classifier-free diffusion
ing, 11(2):157–176, 2002. 6 guidance. In NeurIPS 2021 Workshop on Deep Generative
Models and Downstream Applications, 2021. 15
[3] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisser-
man. Frozen in time: A joint video and image encoder for [15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
end-to-end retrieval. In IEEE International Conference on sion probabilistic models. Advances in Neural Information
Computer Vision, 2021. 3 Processing Systems, 33:6840–6851, 2020. 2, 3, 4
[4] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel [16] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
Zion English, Vikram Voleti, Adam Letts, et al. Stable video Poole, Mohammad Norouzi, David J Fleet, et al. Imagen
diffusion: Scaling latent video diffusion models to large video: High definition video generation with diffusion mod-
datasets. arXiv preprint arXiv:2311.15127, 2023. 2, 3, 9, els. arXiv preprint arXiv:2210.02303, 2022. 3, 4
10 [17] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William
[5] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- Chan, Mohammad Norouzi, and David J Fleet. Video dif-
horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. fusion models. arXiv preprint arXiv:2204.03458, 2022. 2,
Align your latents: High-resolution video synthesis with la- 3
tent diffusion models. In Proceedings of the IEEE/CVF Con- [18] Levon Khachatryan, Andranik Movsisyan, Vahram Tade-
ference on Computer Vision and Pattern Recognition, pages vosyan, Roberto Henschel, Zhangyang Wang, Shant
22563–22575, 2023. 2, 3, 4 Navasardyan, and Humphrey Shi. Text2video-zero: Text-
[6] Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu to-image diffusion models are zero-shot video generators. In
Liu, Yujun Shen, and Hengshuang Zhao. Livephoto: Real Proceedings of the IEEE/CVF International Conference on
image animation with text-guided motion control. arXiv Computer Vision (ICCV), pages 15954–15964, 2023. 2, 3
preprint arXiv:2312.02928, 2023. 3 [19] Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman,
[7] Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Ersin Yumer, and Ming-Hsuan Yang. Learning blind video
Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu temporal consistency. In Proceedings of the European con-
Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- ference on computer vision (ECCV), pages 170–185, 2018.
sion model for generative transition and prediction. In The 7
Twelfth International Conference on Learning Representa- [20] Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu,
tions, 2023. 2, 3, 9, 10 Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong
[8] Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Wang. Videogen: A reference-guided latent diffusion ap-
Zhu, Long Qin, and Weizhi Wang. Animateanything: Fine- proach for high definition text-to-video generation. arXiv
grained open domain image animation with motion guid- preprint arXiv:2309.00398, 2023. 2
ance, 2023. 2, 3 [21] Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Video-
[9] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming drafter: Content-consistent multi-scene video generation
transformers for high-resolution image synthesis. In Pro- with llm. arXiv preprint arXiv:2401.01256, 2024. 3
ceedings of the IEEE/CVF conference on computer vision [22] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-
and pattern recognition, pages 12873–12883, 2021. 3 Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthe-
[10] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, sis and editing with stochastic differential equations. In In-
Jonathan Granskog, and Anastasis Germanidis. Structure ternational Conference on Learning Representations, 2022.
and content-guided video synthesis with diffusion models. 2, 6
11
[23] Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, [35] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
Jinkyu Kim, Sungwoong Kim, Hyeokmin Kwon, and Sang- representation learning. Advances in neural information pro-
pil Kim. Mtvg: Multi-text video generation with text-to- cessing systems, 30, 2017. 3
video models. arXiv preprint arXiv:2312.04086, 2023. 2, [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
3 reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[24] Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- Polosukhin. Attention is all you need. Advances in neural
tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning- information processing systems, 30, 2017. 5
free longer video diffusion via noise rescheduling. In The [37] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin-
Twelfth International Conference on Learning Representa- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi
tions, 2024. 3, 9, 10 Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan.
[25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Phenaki: Variable length video generation from open domain
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, textual descriptions. In International Conference on Learn-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning ing Representations, 2022. 2
transferable visual models from natural language supervi- [38] Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye,
sion. In International conference on machine learning, pages Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long
8748–8763. PMLR, 2021. 2, 3, 6, 7 video generation via temporal co-denoising. arXiv preprint
[26] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, arXiv:2305.18264, 2023. 3
and Mark Chen. Hierarchical text-conditional image gen- [39] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang,
eration with clip latents. arXiv preprint arXiv:2204.06125, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video
2022. 2, 4 technical report. arXiv preprint arXiv:2308.06571, 2023. 2,
[27] Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun 4, 5, 16
Du, Stephen Huang, and Wenhu Chen. Consisti2v: Enhanc- [40] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji-
ing visual consistency for image-to-video generation. arXiv uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin-
preprint arXiv:2402.04324, 2024. 3 gren Zhou. Videocomposer: Compositional video synthesis
[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, with motion controllability. Advances in Neural Information
Patrick Esser, and Björn Ommer. High-resolution image Processing Systems, 36, 2024. 2, 3, 6
synthesis with latent diffusion models. In Proceedings of
[41] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu,
the IEEE/CVF Conference on Computer Vision and Pattern
Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui
Recognition, pages 10684–10695, 2022. 2, 4
Wang, et al. Internvid: A large-scale video-text dataset for
[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
multimodal understanding and generation. arXiv preprint
net: Convolutional networks for biomedical image segmen-
arXiv:2307.06942, 2023. 3
tation. In Medical Image Computing and Computer-Assisted
[42] Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai,
Intervention–MICCAI 2015: 18th International Conference,
Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jian-
Munich, Germany, October 5-9, 2015, Proceedings, Part III
min Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, and Zhiwei
18, pages 234–241. Springer, 2015. 4
Xiong. Art•v: Auto-regressive text-to-video generation with
[30] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
diffusion models. arXiv preprint arXiv:2311.18834, 2023. 3
Facenet: A unified embedding for face recognition and clus-
tering. In Proceedings of the IEEE conference on computer [43] Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xin-
vision and pattern recognition, pages 815–823, 2015. 16 tao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter:
Animating open-domain images with video diffusion priors.
[31] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
arXiv preprint arXiv:2310.12190, 2023. 2, 3, 9, 10, 15
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- [44] Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang
man, et al. Laion-5b: An open large-scale dataset for training Wei, Yuchen Zhang, and Hang Li. Make pixels dance: High-
next generation image-text models. Advances in Neural In- dynamic video generation. arXiv:2311.10982, 2023. 3
formation Processing Systems, 35:25278–25294, 2022. 7 [45] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu,
[32] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and
Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Mike Zheng Shou. Show-1: Marrying pixel and latent dif-
Oran Gafni, et al. Make-a-video: Text-to-video generation fusion models for text-to-video generation. arXiv preprint
without text-video data. In The Eleventh International Con- arXiv:2309.15818, 2023. 2
ference on Learning Representations, 2022. 2, 3, 4 [46] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
[33] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, conditional control to text-to-image diffusion models. In
and Surya Ganguli. Deep unsupervised learning using Proceedings of the IEEE/CVF International Conference on
nonequilibrium thermodynamics. In International confer- Computer Vision, pages 3836–3847, 2023. 3, 5, 6, 15
ence on machine learning, pages 2256–2265. PMLR, 2015. [47] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
3 and Oliver Wang. The unreasonable effectiveness of deep
[34] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- features as a perceptual metric. In CVPR, 2018. 8
ing diffusion implicit models. In International Conference [48] Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao,
on Learning Representations, 2020. 2 Hangjie Yuan, Zhiwu Qing, Xiang Wang, Deli Zhao, and
12
Jingren Zhou. I2vgen-xl: High-quality image-to-video syn-
thesis via cascaded diffusion models. 2023. 2, 3, 4, 6, 9,
10
13
Appendix quality.
This appendix complements our main paper with further ex- 8. Very Long Video Generation
periments, in which we investigate the text-to-video gener-
ation quality of StreamingT2V for even longer sequences Supplementing our main paper, we study the generation
than those assessed in the main paper, and it contains addi- quality of our text-to-video method StreamingT2V and the
tional information on the implementation of StreamingT2V competing methods for creating even longer videos. To this
and the experiments carried out. end, we generate 240 frames, and even more, 600 frame
In Sec. 7, a user study is conducted on the test set, videos.
in which all text-to-video methods under consideration are Within this setting, we present qualitative results of
evaluated by humans to determine the user preferences. StreamingT2V, qualitative and quantitative comparisons to
Sec. 8 supplements our main paper by additional qual- existing methods, and ablation studies on our APM module
itative results of StreamingT2V, qualitatively comparisons and the randomize blending approach.
of StreamingT2V with competing methods, and ablation 8.1. Qualitative Results.
studies to show the effectiveness of APM and randomized
blending. Fig. 10 and Fig. 11 show text-to-video results of Stream-
In Sec. 9, hyperparameters used in StreamingT2V, and ingT2V for different actions, e.g. dancing, reading, cam-
implementation details of our ablated models are provided era moving, or eating, and different characters like tiger or
Finally, sec. 10 complements our main paper by ad- Santa Claus. We can observe that scene and object fea-
ditional information on the employed metrics used in our tures are kept across each video generation (see e.g. Fig. 10
qualitative experiments and it provides the prompts that (e) and Fig. 11 (d)), thanks to our proposed APM module.
compose our testset. Our proposed CAM module ensures that generated videos
are temporally smooth, with seamless transitions between
7. User Study video chunks, and not stagnating (see e.g. Fig. 10 (f)).
14
on CLIP score is FreeNoise, which achieves a similar but 9. Implementation details
slightly worse CLIP score. However, our shown qualita-
tive results have revealed that FreeNoise is heavily prone to We provide additional implementation details for Stream-
video stagnation, so that the quality is not getting worse, but ingT2V and our ablated models.
the video content is also nearly not changed. 9.1. Streaming T2V Stage
Fig. 9 also shows that StreamingT2V seamlessly con-
nects video chunks and creates temporal consistent videos, For the StreamingT2V stage, we use classifier free guidance
thanks to our effective CAM module, that uses the tempo- [10, 14] from text and the anchor frame. More precisely,
ral attentional mechanism to condition a new video gener- let ϵθ (xt , t, τ, a) denote the noise prediction in the Stream-
ation on the previous chunk. The MAWE score of Stream- ingT2V stage for latent code xt at diffusion step t, text τ and
ingT2V is hardly influenced by longer video generation, anchor frame a. For text guidance and guidance by the an-
thus leading to the best MAWE score among all considered chor frame, we introduce weights ωtext and ωanchor , respec-
methods, which is more than 50% better than the best com- tively. Let τnull and anull denote the empty string, and the
peting method SEINE. The experiment also shows that the image with all pixel values set to zero, respectively. Then,
competing methods SVD, I2VGen-XL, SparseControl and we obtain the multi-conditioned classifier-free-guided noise
FreeNoise are unstable. With more frames being generated, prediction ϵ̂θ (similar to DynamiCrafter-XL [43]) from the
their MAWE score becomes worse. These methods are thus noise predictor ϵ via
prone to increased temporal inconsistencies or video stag- ϵ̂θ (xt , t, τ, a) = ϵθ (xt , t, τnull , anull )
nation. This again shows that their conditioning mechanism
is too weak (e.g., the concatenation of the conditional frame + ωtext ϵθ (xt , t, τ, anull )
to the input and to the cross attention layers for SVD). − ϵθ (xt , t, τnull , anull )
Finally, the SCuts score shows that StreamingT2V leads + ωanchor ϵθ (xt , t, τ, a)
to the least scene cuts and thus best transistions. The ad-
vantange over SparseControl, DynamiCrafter-XL and SVD − ϵθ (xt , t, τ, anull ) . (9)
is significant, especially with more frames (e.g. compared We then use ϵ̂θ for denoising. In our experiments, we set
to StreamingT2V the MAWE score is up to 77 times lower ωtext = ωanchor = 7.5. During training, we randomly re-
for StreamingT2V). place τ with τnull with 5% likelihood, the anchor frame a
Our study on long video generation has thus shown than with anull with 5% likelihood, and we replace at the same
thanks to our CAM and APM module, StreamingT2V is time τ with τnull and a with anull with 5% likelihood.
best suited for long video generation. All other compet- Additional hyperparameters for the architecture, training
ing methods heavily deteriorate their video quality when and inference of the Streaming T2V stage are presented in
applied for long video generation. Tab. 9, where Per-Pixel Temporal Attention refers to the at-
tention module used in CAM (see Fig. 2 of the main paper).
8.3. Ablation Studies
9.2. Ablation models
We present additional qualitative results for the ablations on
our APM module and randomized blending. In Sec. 5.3 of the main paper, we consider two baselines
Effectiveness of APM. We complement our ablation study that we compare with CAM. Here we provide additional
on APM (see Sec. 5.3 of the main paper) by additional qual- implementation details.
itative results in Fig. 12. Thanks to the usage of long-term The ablated model Add-Cond applies to the features of
information via our proposed APM module, identity and CAM (i.e. the outputs of the encoder and middle layer of the
scene features are preserved throughout the video. For in- ControlNet part in Fig. 2) zero-convolution, and uses addi-
stance, the face of the woman (including all its tiny details) tion to fuse it with the features of the skip-connection of the
are consistent across the video generation. Also, the style UNet (similar to ControlNet[46]) (see Fig. 15). We pro-
of the jacket and the bag are correctly generated throughout vide here additional details to construct this model. Given
the video. Without having access to a long-term memory, a video sample V ∈ RF ×H×W ×3 with F = 16 frames,
these object and scene features are changing over time. we construct a mask M ∈ {0, 1}F ×H×W ×3 that indicates
Randomized Blending. Fig. 13 and Fig. 14 show ad- which frame we use for conditioning, i.e. M f [i, j, k] =
ditional ablated results on our randomized blending ap- M f [i′ , j ′ , k ′ ] for all frames f = 1, . . . , F and for all
proach. From the X-T slice visualizations we can see that i, j, k, i′ , j ′ , k ′ . We require that exactly F − Fcond frames
the randomized blending leads to smooth chunk transitions. are masked, i.e.
In contrast, when naively concatenating enhanced video F
X
chunks, or using shared noise, the resulting videos possess M f [i, j, k] = F − Fcond , for all i, j, k. (10)
visible inconsistencies between chunks. f =1
15
Per-Pixel Temporal Attention 10. Experiment details
Sequence length Q 16
Sequence length K,V 8 10.1. Metric details
Token dimensions 320,640,1280 In Sec. 5.3 of the main paper, we employ a re-identification
Appearance Preservation Module score to assess the feature preservation quality of our APM
CLIP Image Embedding Dim 1024 module in the Long-term memory experiment.
CLIP Image Embedding Tokens 1 To obtain the score, let Pn = {pni } be all face patches
MLP hidden layers 1 extracted from frame n using an off-the-shelf head detector
MLP inner dim 1280 [30] and let Fin be the corresponding facial feature of pni ,
MLP output tokens 16 which we obtain from an off-the-shelf face recognition net-
MLP output dim 1024 work [30]. Then, for frame n, n1 := |Pn |, n2 := |Pn+1 |,
1D Conv input tokens 93 we define the re-id score re-id(n) for frame n as
1D Conv output tokens 77 (
1D Conv output dim 1024 maxi,j cos Θ(Fin , Fjn+1 ), n1 > 0 & n2 > 0.
Cross attention sequence length 77 re-id(n) :=
0 otherwise.
Training
(11)
Parametrization ϵ
where cos Θ is the cosine similarity. Finally, we obtain the
Diffusion Setup
re-ID score of a video by averaging over all frames, where
Diffusion steps 1000
the two consecutive frames have face detections, i.e. with
Noise scheduler Linear
m := |{n ∈ {1, .., N } : |Pn | > 0}|, we compute the
β0 0.0085
weighted sum:
βT 0.0120
Sampling Parameters N −1
1 X
Sampler DDIM re-id := re-id(n), (12)
Steps 50 m n=1
η 1.0
ωtext 7.5 where N denotes the number of frames.
ωanchor 7.5 MAWE is based on Warp errors and OF scores which
have highly dependent values. We try to counteract this ef-
Table 9. Hyperparameters of Streaming T2V Stage. Addi- fect in our MAWE score by assuming this dependency is
tional architectural hyperparameters are provided by the Model- linear W (V) = c · OFS(V) and account for it in MAWE’s
sope report[39]. denominator. To calculate c we randomly sample a small
part of our training with a range of optical scores and re-
move outliers applying the Z-score method. Using this
dataset c is obtained by a simple regression analysis.
We concatenate [V⊙M, M ] along the channel dimension
and use it as input for the image encoder Econd , where ⊙ 10.2. Test prompt set
denotes element-wise multiplication. 1. A camel resting on the snow field.
During training, we randomly set the mask M . During 2. Camera following a pack of crows flying in the sky.
inference, we set the mask for the first 8 frames to zero, and 3. A knight riding on a horse through the countryside.
for the last 8 frames to one, so that the model conditions on 4. A gorilla eats a banana in Central Park.
the last 8 frames of the previous chunk. 5. Men walking in the rain.
6. Ants, beetles and centipede nest.
For the ablated model Conc-Cond used in Sec. 5.3, 7. A squirrel on a table full of big nuts.
we start from our Video-LDM’s UNet, and modify its first 8. Close flyover over a large wheat field in the early morn-
convolution. Like for Add-Cond, we consider a video V ing sunlight.
of length F = 16 and a mask M that encodes which 9. A squirrel watches with sweet eyes into the camera.
frames are overwritten by zeros. Now the Unet takes 10. Santa Claus is dancing.
[zt , E(V) ⊙ M, M ] as input, where we concatenate along 11. Chemical reaction.
the channel dimension. As with Add-Cond, we randomly 12. Camera moving in a wide bright ice cave, cyan.
set M during training so that the information of 8 frames is 13. Prague, Czech Republic. Heavy rain on the street.
used, while during inference, we set it such that the last 8 14. Time-lapse of stormclouds during thunderstorm.
frames of the previous chunk are used. Here E denotes the 15. People dancing in room filled with fog and colorful
VQ-GAN encoder (see Sec. 3.1 of the main paper). lights.
16
16. Big celebration with fireworks. taking views.
17. Aerial view of a large city.
18. Wide shot of battlefield, stormtroopers running at night,
fires and smokes and explosions in background.
19. Explosion.
20. Drone flythrough of a tropical jungle with many birds.
21. A camel running on the snow field.
22. Fishes swimming in ocean camera moving.
23. A squirrel in Antarctica, on a pile of hazelnuts cinematic.
24. Fluids mixing and changing colors, closeup.
25. A horse eating grass on a lawn.
26. The fire in the car is extinguished by heavy rain.
27. Camera is zooming out and the baby starts to cry.
28. Flying through nebulas and stars.
29. A kitten resting on a ball of wool.
30. A musk ox grazing on beautiful wildflowers.
31. A hummingbird flutters among colorful flowers, its
wings beating rapidly.
32. A knight riding a horse, pointing with his lance to the
sky.
33. steampunk robot looking at the camera.
34. Drone fly to a mansion in a tropical forest.
35. Top-down footage of a dirt road in forest.
36. Camera moving closely over beautiful roses blooming
time-lapse.
37. A tiger eating raw meat on the street.
38. A beagle looking in the Louvre at a painting.
39. A beagle reading a paper.
40. A panda playing guitar on Times Square.
41. A young girl making selfies with her phone in a crowded
street.
42. Aerial: flying above a breathtaking limestone structure
on a serene and exotic island.
43. Aerial: Hovering above a picturesque mountain range on
a peaceful and idyllic island getaway.
44. A time-lapse sequence illustrating the stages of growth
in a flourishing field of corn.
45. Documenting the growth cycle of vibrant lavender flow-
ers in a mesmerizing time-lapse.
46. Around the lively streets of Corso Como, a fearless ur-
ban rabbit hopped playfully, seemingly unfazed by the
fashionable surroundings.
47. Beside the Duomo’s majestic spires, a fearless falcon
soared, riding the currents of air above the iconic cathe-
dral.
48. A graceful heron stood poised near the reflecting pools
of the Duomo, adding a touch of tranquility to the vibrant
surroundings.
49. A woman with a camera in hand joyfully skipped along
the perimeter of the Duomo, capturing the essence of the
moment.
50. Beside the ancient amphitheater of Taormina, a group of
friends enjoyed a leisurely picnic, taking in the breath-
17
Figure 8. We conduct a user study, asking humans to assess the test set results of Sect 5.4 in a one-to-one evaluation, where for any prompt
of the test set and any competing method, the results of the competing method have to be compared with the corresponding results of our
StreamingT2V method. For each comparison of our method to a competing method, we report the relative of number votes that prefer
StreamingT2V (i.e. wins), that prefer the competing method (i.e. losses), and that consider results from both methods as equal (i.e. draws).
18
(a) MAWE score (↓). (b) SCuts score (↓). (c) CLIP score (↑).
Figure 9. Study on how generating longer videos are affecting the generation quality.
(a) People dancing in room filled with fog and colorful lights
(f) Explosion
19
(a) Wide shot of battlefield, stormtroopers running at night
(d) A young girl making selfies with her phone in a crowded street
Figure 12. Ablation study on the APM module. Top row is generated from StreamingT2V, bottom row is generated from StreamingT2V
w/o APM.
20
X-T Slice
X-T Slice
t t
Video
Video
(a) Naive Concatenation (b) Shared Noise
X-T Slice
t
Video
Figure 13. Ablation study on different approaches for autoregressive video enhancement.
X-T Slice
X-T Slice
t t
Video
Video
t
Video
Figure 14. Ablation study on different approaches for autoregressive video enhancement.
21
Figure 15. Illustration of the Add-Cond baseline, which is used in Sec. 5.3 of the main paper.
22