0% found this document useful (0 votes)

48 views22 pages

Streamingt2V: Consistent, Dynamic, and Extendable Long Video Generation From Text

Uploaded by

uiux1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views22 pages

Streamingt2V: Consistent, Dynamic, and Extendable Long Video Generation From Text

Uploaded by

uiux1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

StreamingT2V: Consistent, Dynamic, and Extendable

Long Video Generation from Text

Roberto Henschel1∗ , Levon Khachatryan1∗ , Daniil Hayrapetyan1∗ , Hayk Poghosyan1 , Vahram Tadevosyan1 ,
Zhangyang Wang1,2 , Shant Navasardyan1 , Humphrey Shi1,3
1 2 3
Picsart AI Resarch (PAIR) UT Austin SHI Labs @ Georgia Tech, Oregon & UIUC
arXiv:2403.14773v1 [cs.CV] 21 Mar 2024

https://github.com/Picsart-AI-Research/StreamingT2V

Figure 1. StreamingT2V is an advanced autoregressive technique that enables the creation of long videos featuring rich motion dynamics
without any stagnation. It ensures temporal consistency throughout the video, aligns closely with the descriptive text, and maintains high
frame-level image quality. Our demonstrations include successful examples of videos up to 1200 frames, spanning 2 minutes, and can be
extended for even longer durations. Importantly, the effectiveness of StreamingT2V is not limited by the specific Text2Video model used,
indicating that improvements in base models could yield even higher-quality videos.

Abstract isting approaches mostly focus on high-quality short video

generation (typically 16 or 24 frames), ending up with hard-
Text-to-video diffusion models enable the generation of cuts when naively extended to the case of long video syn-
high-quality videos that follow text instructions, making it thesis. To overcome these limitations, we introduce Stream-
easy to create diverse and individual content. However, ex- ingT2V, an autoregressive approach for long video gener-
ation of 80, 240, 600, 1200 or more frames with smooth
* Equal contribution.

1
transitions. The key components are: (i) a short-term mem- concurrent work SparseCtrl [12] utilizes a more sophisti-
ory block called conditional attention module (CAM), which cated conditioning mechanism by sparse encoder. Its archi-
conditions the current generation on the features extracted tecture requires to concatenate additional zero-filled frames
from the previous chunk via an attentional mechanism, lead- to the conditioning frames before being plugged into sparse
ing to consistent chunk transitions, (ii) a long-term mem- encoder. However, this inconsistency in the input leads to
ory block called appearance preservation module, which inconsistencies in the output (see Sec.5.4). Moreover, we
extracts high-level scene and object features from the first observed that all image-to-video methods that we evaluated
video chunk to prevent the model from forgetting the ini- in our experiments (see Sec.5.4) lead eventually to video
tial scene, and (iii) a randomized blending approach that stagnation, when applied autoregressively by conditioning
enables to apply a video enhancer autoregressively for in- on the last frame of the previous chunk.
finitely long videos without inconsistencies between chunks. To overcome the weaknesses and limitations of current
Experiments show that StreamingT2V generates high mo- works, we propose StreamingT2V, an autoregressive text-
tion amount. In contrast, all competing image-to-video to-video method equipped with long/short-term memory
methods are prone to video stagnation when applied naively blocks that generates long videos without temporal incon-
in an autoregressive manner. Thus, we propose with Stream- sistencies.
ingT2V a high-quality seamless text-to-long video genera- To this end, we propose the Conditional Attention Mod-
tor that outperforms competitors with consistency and mo- ule (CAM) which, due to its attentional nature, effectively
tion. borrows the content information from the previous frames
to generate new ones, while not restricting their motion by
the previous structures/shapes. Thanks to CAM, our results
1. Introduction are smooth and with artifact-free video chunk transitions.
In recent years, with the raise of Diffusion Models [15, 26, Existing approaches are not only prone to temporal in-
28, 34], the task of text-guided image synthesis and manipu- consistencies and video stagnation, but they suffer from
lation gained enormous attention from the community. The object appearance/characteristic changes and video quality
huge success in image generation led to the further exten- degradation over time (see e.g., SVD [4] in Fig. 7). The
sion of diffusion models to generate videos conditioned by reason is that, due to conditioning only on the last frame(s)
textual prompts [4, 5, 7, 11–13, 17, 18, 20, 32, 37, 39, 45]. of the previous chunk, they overlook the long-term depen-
Despite the impressive generation quality and text align- dencies of the autoregressive process. To address this is-
ment, the majority of existing approaches such as [4, 5, sue we design an Appearance Preservation Module (APM)
17, 39, 45] are mostly focused on generating short frame that extracts object or global scene appearance information
sequences (typically of 16 or 24 frame-length). However, from an initial image (anchor frame), and conditions the
short videos are limited in real-world use-cases such as ad video generation process of all chunks with that informa-
making, storytelling, etc. tion, which helps to keep object and scene features across
The naı̈ve approach of simply training existing methods the autoregressive process.
on long videos (e.g. ≥ 64 frames) is normally unfeasi- To further improve the quality and resolution of our long
ble. Even for generating short sequences, a very expensive video generation, we adapt a video enhancement model for
training (e.g. using more that 260K steps and 4500 batch autoregressive generation. For this purpose, we choose a
size [39]) is typically required. Without training on longer high-resolution text-to-video model and utilize the SDEdit
videos, video quality commonly degrades when short video [22] approach for enhancing consecutive 24-frame chunks
generators are made to output long videos (see appendix). (overlapping with 8 frames) of our video. To make the
Existing approaches, such as [5, 17, 23], thus extend the chunk enhancement transitions smooth, we design a ran-
baselines to autoregressively generate short video chunks domized blending approach for seamless blending of over-
conditioned on the last frame(s) of the previous chunk. lapping enhanced chunks.
However, the straightforward long-video generation ap- Experiments show that StreamingT2V successfully gen-
proach of simply concatenating the noisy latents of a video erates long and temporal consistent videos from text with-
chunk with the last frame(s) of the previous chunk leads to out video stagnation. To summarize, our contributions are
poor conditioning with inconsistent scene transitions (see three-fold:
Sec. 5.3). Some works [4, 8, 40, 43, 48] incorporate also • We introduce StreamingT2V, an autoregressive approach
CLIP [25] image embeddings of the last frame of the previ- for seamless synthesis of extended video content using
ous chunk reaching to a slightly better consistency, but are short and long-term dependencies.
still prone to inconsistent global motion across chunks (see • Our Conditional Attention Module (CAM) and Appear-
Fig. 5) as the CLIP image encoder loses information impor- ance Preservation Module (APM) ensure the natural
tant for perfectly reconstructing the conditional frames. The continuity of the global scene and object characteristics

2
of generated videos. the (encoded) conditionings with an additional mask (which
• We seamlessly enhance generated long videos by intro- indicates which frame is provided) to the input of the video
ducing our randomized blending approach of consecu- diffusion model.
tive overlapping chunks. In addition to concatenating the conditioning to the in-
put of the diffusion model, several works [4, 40, 48] replace
2. Related Work the text embeddings in the cross-attentions of the diffusion
model by CLIP [25] image embeddings of the conditional
Text-Guided Video Diffusion Models. Generating videos frames. However, according to our experiments, their appli-
from textual instructions using Diffusion Models [15, 33] cability for long video generation is limited. SVD [4] shows
is a recently established yet very active field of research severe quality degradation over time (see Fig.7), and both,
introduced by Video Diffusion Models (VDM) [17]. The I2VGen-XL[48] and SVD[4] generate often inconsistencies
approach requires massive training resources and can gen- between chunks, still indicating that the conditioning mech-
erate only low-resolution videos (up to 128x128), which anism is too weak.
are severely limited by the generation of up to 16 frames Some works [6, 43] such as DynamiCrafter-XL [43] thus
(without autoregression). Also, the training of text-to-video add to each text cross-attention an image cross-attention,
models usually is done on large datasets such as WebVid- which leads to better quality, but still to frequent inconsis-
10M [3], or InternVid [41]. Several methods employ video tencies between chunks.
enhancement in the form of spatial/temporal upsampling The concurrent work SparseCtrl [12] adds a ControlNet
[5, 16, 17, 32], using cascades with up to 7 enhancer mod- [46]-like branch to the model, which takes the conditional
ules [16]. Such an approach produces high-resolution and frames and a frame-indicating mask as input. It requires by
long videos. Yet, the generated content is still limited by the design to append additional frames consisting of black pix-
key frames. els to the conditional frames. This inconsistency is difficult
Towards generating longer videos (i.e. more keyframes), to compensate for the model, leading to frequent and severe
Text-To-Video-Zero (T2V0) [18] and ART-V [42] employ scene cuts between frames.
a text-to-image diffusion model. Therefore, they can gen- Overall, only a small number of keyframes can currently
erate only simple motions. T2V0 conditions on its first be generated at once with high quality. While in-between
frame via cross-frame attention and ART-V on an anchor frames can be interpolated, it does not lead to new content.
frame. Due to the lack of global reasoning, it leads to un- Also, while image-to-video methods can be used autore-
natural or repetitive motions. MTVG [23] turns a text-to- gressively, their used conditional mechanisms lead either to
video model into an autoregressive method by a trainin-free inconsistencies, or the method suffers from video stagna-
approach. It employs strong consistency priors between tion. We conclude that existing works are not suitable for
and among video chunks, which leads to very low motion high-quality and consistent long video generation without
amount, and mostly near-static background. FreeNoise [24] video stagnation.
samples a small set of noise vectors, re-uses them for the
generation of all frames, while temporal attention is per- 3. Preliminaries
formed on local windows. As the employed temporal at-
tentions are invariant to such frame shuffling, it leads to Diffusion Models. Our text-to-video model, which we term
high similarity between frames, almost always static global StreamingT2V, is a diffusion model that operates in the la-
motion and near-constant videos. Gen-L [38] generates tent space of the VQ-GAN [9, 35] autoencoder D(E(·)),
overlapping short videos and aggregates them via temporal where E and D are the corresponding encoder and decoder,
co-denoising, which can lead to quality degradations with respectively. Given a video V ∈ RF ×H×W ×3 , composed
video stagnation. of F frames with spatial resolution H × W , its latent code
Image-Guided Video Diffusion Models as Long Video x0 ∈ RF ×h×w×c is obtained through frame-by-frame ap-
Generators. Several works condition the video generation plication of the encoder. More precisely, by identifying
by a driving image or video [4, 6–8, 10, 12, 21, 27, 40, each tensor x ∈ RF ×ĥ×ŵ×ĉ as a sequence (xf )F f =1 with
43, 44, 48]. They can thus be turned into an autoregres- xf ∈ Rĥ×ŵ×ĉ , we obtain the latent code via xf0 := E(V f ),
sive method by conditioning on the frame(s) of the previous for all f = 1, . . . , F . The diffusion forward process gradu-
chunk. ally adds Gaussian noise ϵ ∼ N (0, I) to the signal x0 :
VideoDrafter [21] uses a text-to-image model to obtain p
an anchor frame. A video diffusion model is conditioned q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), t = 1, . . . , T
on the driving anchor to generate independently multiple (1)
videos that share the same high-level context. However, no where q(xt |xt−1 ) is the conditional density of xt given
consistency among the video chunks is enforced, leading xt−1 , and {βt }Tt=1 are hyperparameters. A high value for T
to drastic scene cuts. Several works [7, 8, 44] concatenate is chosen such that the forward process completely destroys

3
Figure 2. The overall pipeline of StreamingT2V: In the Initialization Stage the first 16-frame chunk is synthesized by a text-to-video
model (e.g. Modelscope [39]). In the Streaming T2V Stage the new content for further frames are autoregressively generated. Finally, in
the Streaming Refinement Stage the generated long video (600, 1200 frames or more) is autoregressively enhanced by applying a high-
resolution text-to-short-video model (e.g. MS-Vid2Vid-XL [48]) equipped with our randomized blending approach.

the initial signal x0 resulting in xT ∼ N (0, I). The goal of StreamingT2V method by taking MS as a basis and turn it
a diffusion model is then to learn a backward process into an autoregressive model suitable for long video gener-
ation with high motion dynamics and consistency.
pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)) (2)

for t = T, . . . , 1 (see DDPM [15]), which allows to gen-

erate a valid signal x0 from standard Gaussian noise xT . 4. Method
Once x0 is obtained from xT , we obtain the generated video
through frame-wise application of the decoder: V e f :=
f
In this section, we introduce our method for high-resolution
D(x0 ), for all f = 1, . . . , F . Yet, instead of learning text-to-long video generation. We first generate 256 × 256
a predictor for mean and variance in Eq. (2), we learn a resolution long videos for 5 seconds (16fps), then enhance
model ϵθ (xt , t) to predict the Gaussian noise ϵ that was them to higher resolution (720 × 720). The overview of the
used to form xt from input signal x0 (which is a common whole pipeline is provided in Fig. 2. The long video gener-
reparametrization [15]). ation part consists of (Initialization Stage) synthesizing the
To guide the video generation by a textual prompt τ , we first 16-frame chunk by a pre-traiend text-to-video model
use a noise predictor ϵθ (xt , t, τ ) that is conditioned on τ . (for example one may take Modelscope [39]), and (Stream-
We model ϵθ (xt , t, τ ) as a neural network with learnable ing T2V Stage) autoregressively generating the new con-
weights θ and train it on the denoising task: tent for further frames. For the autoregression (see Fig. 3),
we propose our conditional attention module (CAM) that
min Et,(x0 ,τ )∼pdata ,ϵ∼N (0,I) ||ϵ − ϵθ (xt , t, τ )||22 , (3)
θ leverages short-term information from the last Fcond = 8
frames of the previous chunk to enable seamless transitions
using the data distribution pdata . To simplify notation, we
between chunks. Furthermore, we leverage our appearance
will denote by xr:s
t = (xjt )sj=r the latent sequence of xt
preservation module (APM), which extracts long-term in-
from frame r to frame s, for all r, t, s ∈ N.
formation from a fixed anchor frame making the autore-
Text-To-Video Models. Most text-to-video models [5, 11,
gression process robust against loosing object appearance
16, 32, 39] extend pre-trained text-to-image models [26, 28]
or scene details in the generation.
by inserting new layers that operate on the temporal axis.
Modelscope (MS) [39] follows this approach by extend- After having a long video (80, 240, 600, 1200 frames
ing the UNet-like [29] architecture of Stable Diffusion [28] or more) generated, we apply the Streaming Refinement
with temporal convolutional and attentional layers. It was Stage which enhances the video by autoregressively apply-
trained in a large-scale setup to generate videos with 3 ing a high-resolution text-to-short-video model (for exam-
FPS@256x256 and 16 frames. The quadratic growth in ple one may take MS-Vid2Vid-XL [48]) equipped by our
memory and compute of the temporal attention layers (used randomized blending approach for seamless chunk process-
in recent text-to-video models) together with very high ing. The latter step is done without additional training by so
training costs limits current text-to-video models to gen- making our approach affordable with lower computational
erate long sequences. In this paper, we demonstrate our costs.

4
Figure 3. Method overview: StreamingT2V extends a video diffusion model (VDM) by the conditional attention module (CAM) as short-
term memory, and the appearance preservation module (APM) as long-term memory. CAM conditions the VDM on the previous chunk
using a frame encoder Econd . The attentional mechanism of CAM leads to smooth transitions between chunks and videos with high motion
amount at the same time. APM extracts from an anchor frame high-level image features and injects it to the text cross-attentions of the
VDM. APM helps to preserve object/scene features across the autogressive video generation.

4.1. Conditional Attention Module Let x denote the output of Econd after zero-convolution.
We use addition, to fuse x with the output of the first
To train a conditional network for our Streaming T2V stage, temporal transformer block of CAM. For the injection of
we leverage the pre-trained power of a text-to-video model CAM’s features into the Video-LDM Unet, we consider
(e.g. Modelscope [39]), as a prior for long video gener- the UNet’s skip-connection features xSC ∈ Rb×F ×h×w×c
ation in an autoregressive manner. In the further writing (see Fig.3) with batch size b. We apply spatio-temporal
we will refer this pre-trained text-to-(short)video model as group norm, and a linear projection Pin on xSC . Let
Video-LDM. To autoregressively condition Video-LDM by x′SC ∈ R(b·w·h)×F ×c be the resulting tensor after reshap-
some short-term information from the previous chunk (see ing. We condition x′SC on the corresponding CAM fea-
Fig. 2, mid), we propose the Conditional Attention Mod- ture xCAM ∈ R(b·w·h)×Fcond ×c (see Fig.3), where Fcond
ule (CAM), which consists of a feature extractor, and a fea- is the number of conditioning frames, via temporal multi-
ture injector into Video-LDM UNet, inspired by ControlNet head attention (T-MHA) [36], i.e. independently for each
[46]. The feature extractor utilizes a frame-wise image en- spatial position (and batch). Using learnable linear maps
coder Econd , followed by the same encoder layers that the PQ , PK , PV , for queries, keys, and values, we apply T-
Video-LDM UNet uses up to its middle layer (and initial- MHA using keys and values from xCAM and queries from
ized with the UNet’s weights). For the feature injection,
we let each long-range skip connection in the UNet at-
tend to corresponding features generated by CAM via cross-
attention.

5
x′SC , i.e. 4.3. Auto-regressive Video Enhancement

x′′SC = T-MHA Q = PQ (x′SC ), To further improve quality and resolution of our text-to-
video results, we utilize a high-resolution (1280x720) text-
K = PK (xCAM ), (4) to-(short)video model (Refiner Video-LDM, see Fig. 3)

V = PV (xCAM ) . to autoregressively enhance 24-frame chunks of generated
videos. Using a text-to-video model as a refiner/enhancer of
Finally, we use a linear projection Pout . Using a suitable 24-frame chunks is done by adding a substantial amount of
reshaping operation R, the output of CAM is added to the noise to the input video-chunk and denoising with the text-
skip connection (as in ControlNet [46]): to-video diffusion model (SDEdit [22] approach). More
precisely, we take a high-resolution text-to-video model (for
x′′′ ′′
SC = xSC + R(Pout (xSC )), (5) example MS-Vid2Vid-XL [40, 48]), and a low resolution
video chunk of 24 frames which is first bilinearly upscaled
so that x′′′ is used in the decoder layers of the UNet. The [2] to the target high resolution. Then we encode the frames
projection Pout is zero-initialized, so that when training using the image encoder E so that we obtain a latent code
starts, CAM is not affecting the base model’s output, which x0 . After that we apply T ′ < T forward diffusion steps
improves training convergence. (see Eq. (1)) so that xT ′ still contains signal information
CAM utilizes the last Fcond conditional frames of the (mostly about the video structure), and denoise it using the
previous chunk as input. The cross attention enables to con- high-resolution video diffusion model.
dition the F frames of the base model to CAM. In contrast, However, the naı̈ve approach of independently enhanc-
the sparse encoder [12] uses convolution for the feature in- ing each chunk leads to inconsistent transitions (see Fig.4
jection, and thus needs additional F − Fcond zero-valued (a)). We tackle this issue by using shared noise between
frames (and a mask) as input in order to add the output to consecutive chunks and leveraging our randomized blend-
the F frames of the base model. This poses an inconsistency ing approach. Given our low-resolution long video, we split
in the input for SparseCtrl, leading to severe inconsistencies it into m chunks V1 , . . . , Vm of F = 24 frame-length such
in the generated videos (see Sect. 5.3 and Sect. 5.4). that each two consecutive chunks have an overlap of O = 8
frames. For the backward diffusion at step t, starting from
4.2. Appearance Preservation Module
T ′ , we must sample noise to perform one denoising step
Autoregressive video generators typically suffer from for- (see Eq. 2). We start with the first chunk V1 and sample
getting initial object and scene features, leading to severe noise ϵ1 ∼ N (0, I) with ϵ1 ∈ RF ×h×w×c . For each subse-
appearance changes. To tackle this issue, we incorporate quent chunk Vi , i > 1, we sample noise ϵ̂i ∼ N (0, I) with
long-term memory by leveraging the information contained ϵ̂i ∈ R(F −O)×h×w×c and concatenate it along the frame di-
−O:F
in a fixed anchor frame of the very first chunk using our mension with the noise ϵF i−1 that was sampled for the O
proposed Appearance Preservation Module (APM). This overlapping frames of the previous chunk, i.e.
helps to maintain scene and object features across video
−O:F
chunk generations (see Fig. 6). ϵi := concat([ϵF i−1 , ϵ̂i ], dim = 0), for all i = 2, . . . , m,
To enable APM to balance the guidance by the anchor (7)
frame with the guidance by the text instructions, we propose so that we obtain shared noise for overlapping frames. We
(see Figure 3): (i) We mix the CLIP [25] image token of the perform one denoising step using ϵi and obtain for chunk Vi
anchor frame with the CLIP text tokens from the textual the latent code xt−1 (i). Yet, this approach is not sufficient
instruction by expanding the clip image token to k = 8 to- to remove transition misalignment (see Fig.4 (b)).
kens using a linear layer, concatenate the text and image en- To improve consistency significantly, we propose the
codings at the token dimension, and use a projection block, randomized blending approach. Consider the latent codes
F −O:F
leading to xmixed ∈ Rb×77×1024 ; (ii) We introduce for each xt−1 (i − 1) and x1:O t−1 (i) of two consecutive chunks
cross attention layer l a weight αl ∈ R (initialized as 0) to Vi−1 , Vi at denoising step t − 1. The latent code xt−1 (i − 1)
perform cross attention using keys and values coming from of chunk Vi−1 possesses a smooth transition from its first
a weighted sum of xmixed and the usual CLIP text encoding frames to the overlapping frames, while the latent code
of the text instructions xtext : xt−1 (i) possesses a smooth transition from the overlap-
ping frames to the subsequent frames. Thus, we combine
xcross = SiLU(αl )xmixed + xtext . (6) the two latent codes via concatenation, by randomly sam-
pling a frame index fthr from {0, . . . , O} then taking from
F −O:F
The experiments in Section 5.3 show that the light- xt−1 (i − 1) the latent code of the first fthr frames and
weight APM module helps to keep scene and identity fea- from x1:O
t−1 (i) the latent code of the frames starting from
tures across the autoregressive process (see Fig.6). fthr + 1. Then, we update the latent code of the entire long

6
video xt−1 on the overlapping frames and perform the next distance from a frame to its warped subsequent frame, ex-
denoising step. Accordingly, for a frame f ∈ {1, . . . , O} cluding occluded regions. Finally, MAWE is defined as:
of the overlap and diffusion step t, the latent code of chunk
f W (V)
Vi−1 is used with probability 1 − O+1 .
MAWE(V) := , (8)
By using a probabilistic mixture of the latents in an over- c · OFS(V)
lapping region, we successfully diminish inconsistencies
between chunks (see Fig. 4(c)). where c aligns the different scales of the two metrics. To
this end, we perform a regression analysis on a subset of
5. Experiments our dataset validation videos and obtain c = 9.5 (details
to derive c are provided in the appendix). MAWE requires
5.1. Implementation Details high motion amount and low warp error for a low metric
values. For the metrics involving optical flow, computations
We generate F = 16 frames and condition on Fcond =
are conducted by resizing all videos to 720×720 resolution.
8 frames. Training is conducted using a dataset col-
lected from publicly available sources. We sample with For video textual alignment, we employ the CLIP [25]
3FPS@256x256 16 frames (during CAM training) and 32 text image similarity score (CLIP), which is applied to all
frames (during CAM+APM training). First, we freeze the frames of a video. CLIP computes for a video sequence the
weights of the pre-trained Video-LDM and train the new cosine similarity from the CLIP text encoding to the CLIP
layers of CAM with batch size 8 and learning rate 5 · 10−5 image encodings.
for 400K steps. We then continue to train CAM+APM. We For per-frame quality we incorporate the Aesthetic Score
randomly sample an anchor frame out of the first 16 frames. [31] computed on top of CLIP image embeddings of all
For the conditioning and denoising, we use the first 8 and frames of a video.
16 frames, starting from frame 17, respectively. This aligns All metrics are computed per video first and then aver-
training with inference, where there is a large time gap be- aged over all videos, all videos are generated with 80 frames
tween the conditional frames and the anchor frame. In ad- for quantitative analysis.
dition, by randomly sampling an anchor frame, the model
can leverage the CLIP information only for the extraction 5.3. Ablation Study
of high-level semantic information, as we do not provide To assess our proposed components, we perform ablation
a frame index to the model. We freeze the CLIP encoder studies on a randomly sampled set of 75 prompts from our
and the temporal layers of the main branch, and train the validation set. We compare CAM against established con-
remaining layers for 1K steps. ditioning approaches, analyse the impact of our long-term
The image encoder Econd used in CAM is composed memory APM, and ablate on our modifications for the video
of stacked 2D convolutions, layer norms and SiLU activa- enhancer.
tions. For the video enhancer, we diffuse an input video CAM for Conditioning. To analyse the importance of
using T ′ = 600 steps. Additional implementation details of CAM, we compare CAM (w/o APM) with two baselines:
StreamingT2V are provided in the appendix. (i) Connect the features of CAM with the skip-connection
of the UNet via zero convolution, followed by addition.
5.2. Metrics
We zero-pad the condition frame and concatenate it with
For quantitative evaluation we employ metrics that measure an frame-indicating mask to form the input for the modi-
to evaluate temporal consistency, text-alignment, and per- fied CAM, which we denote as Add-Cond. (ii) We append
frame quality of our method. the conditional frames and a frame-indicating mask to in-
For temporal consistency, we introduce SCuts, which put of Video-LDM’s Unet along the channel dimension, but
counts for a video the number of detected scene cuts, using do not use CAM, which we denote as Conc-Cond. We train
the AdaptiveDetector algorithm of the PySceneDetect [1] our method with CAM and the baslines on the same dataset.
package with default parameters. In addition, we propose Architectural details (including training) of these baselines
a new metric called motion aware warp error (MAWE), are provided in the appendix.
which coherently assesses motion amount and warp error, We obtain an SCuts score of 0.24, 0.284 and 0.03 for
and yields a low value when the video exhibits both consis- Conc-Cond, Add-Cond and Ours (w/o APM), respectively.
tency and a substantial amount of motion. To this end, we This shows that the inconsistencies in the input caused by
measure the motion amount using OFS (optical flow score), the masking leads to frequent inconsistencies in the gen-
which computes for a video the mean magnitude of all opti- erated videos and that concatenation to the Unet’s input is
cal flow vectors between any two consecutive frames. Fur- a too weak conditioning. In contrast, our CAM generates
thermore, for a video V, we consider the mean warp error consistent videos with a SCuts score that is 88% lower than
[19] W (V), which measures the average squared L2 pixel the baselines.

7
X-T Slice

X-T Slice
t t
Video

Video
(a) Naive Concatenation
X-T Slice (b) Shared Noise

t
Video

(c) Randomized Blending

Figure 4. Ablation study on our video enhancer improvements. The X-T slice visualization shows that randomized blending leads to
smooth chunk transitions, while both baselines have clearly visible, severe inconsistencies between chunks.
X-T Slice

X-T Slice

t t
Video

Video

(a) StreamingT2V (b) DynamiCrafter-XL

Figure 5. Visual comparision of DynamiCrafter-XL and StreamingT2V. Both text-to-video results are generated using the same prompt.
The X-T slice visualization shows that DynamiCrafter-XL suffers from severe chunk inconsistencies and repetitive motions. In contrast,
our method shows seamless transitions and evolving content.

Long-Term Memory. We analyse the impact of utilizing Randomized Blending for Video Enhancement. We
long-term memory in the context of long video generation. assess our randomized blending approach by comparing
against two baselines. (B) enhances each video chunk in-
Fig. 6 shows that long-term memory greatly helps keep-
dependently, and (B+S) uses shared noise for consecutive
ing the object and scene features across autoregressive gen-
chunks, with an overlap of 8 frames, but not randomized
erations. This is also supported quantitatively. We obtain a
blending. We compute per sequence the standard devia-
person re-identification score (definition in the appendix) of
tion of the optical flow magnitudes between consecutive
93.42 and 94.95 for Ours w/o APM, and Ours, respectively.
frames and average over all frames and sequences, which
Our APM module thus improves the identity/appearance
indicates temporal smoothness. We obtain the scores 8.72,
preservation. Also the scene information is better kept, as
6.01 and 3.32 for B, B+S, and StreamingT2V, respectively.
we observe an image distance score in terms of LPIPS[47]
Thus, noise sharing improves chunk consistency (by 31%
of 0.192 and 0.151 for Ours w/o APM and Ours, respec-
vs B), but it is significantly further improved by random-
tively. We thus have an improvement in terms of scene
ized blending (by 62% vs B). Also the visual results in Fig.
preservation of more than 20% when APM is used.

8
(a) Young caucasian female couple drinking cocktails and smiling on terrace in havana, cuba. girls, teens, teenagers, women

Figure 6. Top row: CAM+APM, Bottom row: CAM. The figure shows that using long-term information via APM helps to keep identities
(e.g. the face of the left woman) and scene features, e.g. the dresses or arm clock.

4, where only StreamingT2V is able to produce seamless cuts.

transitions, confirms the benefit of randomized blending. Interestingly, all competing methods that incorporate
CLIP image encodings are prone to misalignment (mea-
5.4. Comparison with Baselines
sured in low CLIP scores), i.e. SVD and DynamiCrafterXL
Benchmark. To assess the effectiveness of StreamingT2V, and I2VGen-XL.
we create a test set composed of 50 prompts covering differ- We hypothesize that this is due to a domain shift; the
ent actions, objects and scenes (prompts are listed in the ap- CLIP image encoder is trained on natural images, but ap-
pendix). We compare against recent methods that have code plied on autoregressively generated images. With the help
available, namely the image-to-video methods I2VGen-XL of our long-term memory, APM reminds the network about
[48], SVD [4], DynamiCrafter-XL [43] and SEINE [7] used the domain of real images, as we use a fixed anchor frame,
autoregressively, the video-to-video method SparseControl so that it does not degrade, leading to the second highest
[12], and the text-to-long-video method FreeNoise [24]. CLIP score.
For all methods, we use their released model weights Moreover, we evaluated the per-frame quality of our gen-
and hyperparameters. To have a fair comparison and in- erated videos, and, as can be noticed from Tab. 8 Stream-
sightful analysis on the performance of the methods for the ingT2V significantly outperforms almost all the competi-
autoregressive generation, independent of the employed ini- tors and slighlty underperform SparseCtrl. This shows that
tial frame generator, we use the same Video-LDM model to our method is able to produce high-quality long videos with
generate the first chunk consisting of 16 frames, given a much temporal consistency and motion dynamics than the
text prompt and enhance it to 720x720 resolution using the competitors.
same Refiner Video-LDM. Then, we generate the videos, Qualitative Evaluation. Finally, we present corresponding
while we start all autoregressive methods by conditioning visual results on the test set in Fig. 7. The high similar-
on the last frame(s) of that chunk. For methods working ity of the frames depicted in middle shows that all compet-
on different spatial resolution, we apply zero padding to the ing methods suffer from video stagnation, where the back-
first chunk. ground and the camera is freezed and nearly no location
Automatic Evaluation. Our quantitative evaluation on the motion is generated. In contrast, our method is generat-
test set shows that StreamingT2V performs best in terms ing smooth and consistent videos without leading to stand-
of seamless chunk transitions and motion consistency (see still. SVD, SparseCtrl and SEINE are prone to severe qual-
Tab. 8). Our MAWE score is significantly better than for ity degradation, e.g., wrong colors and distorted frames
all competing methods (e.g. more than 50% lower than the and DynamiCrafter-XL shows strong inconsistencies be-
second best score by SEINE). This is also supported by the tween chunks (see Fig. 5), showing that their condition-
lowest SCuts score among all competitors. Only the meth- ing via CLIP image encoder and concatenation is too weak.
ods FreeNoise and SEINE achieve the same score. How- Thanks to the more powerful CAM mechanism, Stream-
ever, they produce near-static videos (see also Fig. 7), lead- ingT2V leads to smooth chunk transitions.
ing automatically to low SCuts scores. While SparseCon-
trol also follows a ControlNet approach, it leads to 50 times 6. Conclusion
more scene cuts compared to StreamingT2V. This shows
the advantage of our attentional CAM block over SparseC- In this paper, we addressed long video generation from tex-
ontrol, where the conditional frames need to be pad with tual prompts. We identified that all existing methods are
zeros, so that inconsistency in the input lead to severe scene failing to solve this task satisfactorily, generating either

9
Ours
DC-XL
[43]
FreeNse
[24]
SpCtrl
[12]
SVD
[4]
SEINE
[7]
I2VG
[48]
(a) Fishes swimming in ocean camera moving, cinematic (b) Close flyover over a large wheat field in the early
morning sunlight, cinematic

Figure 7. Visual comparisons of StreamingT2V with state-of-the-art methods on 80 frame-length, autoregressively generated videos. In
contrast to other methods, StreamingT2V generates long videos without suffering from motion stagnation.

Table 8. Quanitative comparison to state-of-the-art open-source text-to-long-video generators. Best performing metrics are highlighted
in red, second best in blue. We also tested OpenAI’s Sora with their 48 sample videos released on their website. We list the numbers
here to give the readers a sense of how Sora is performing on these metrics but please be advised that this test is different from the other
open-source models both in terms of the test set and prompts in this table, and hence it is in gray and not ranked.

Method ↓MAWE ↓SCuts ↑CLIP ↑AE

More motion dynamics / less video stagnation Better video consistency / less scene change Better text alignment Better image quality

SparseCtrl [12] 39.92 2.02 31.13 5.29

I2VGenXL [48] 72.89 0.44 29.85 4.71
DynamiCrafterXL [43] 28.60 1.76 30.50 4.83
SEINE [7] 23.69 0.04 31.04 4.71
SVD [4] 96.17 1.34 28.89 3.90
FreeNoise [24] 49.53 0.04 32.14 4.79
OpenAI Sora (demo-only) 9.26 0.81 33.01 5.42
StreamingT2V (Ours) 10.87 0.04 31.30 5.07

videos with temporal inconsistencies, or severe stagnation analysed an autoregressive pipeline build on top of a vanilla
up to standstill. To overcome these limitations, we carefully video diffusion model and proposed StreamingT2V, which

10
utilizes novel short- and long-term dependency blocks for In Proceedings of the IEEE/CVF International Conference
seamless continuation of video chunks with a high motion on Computer Vision, pages 7346–7356, 2023. 3, 15
amount, while preserving high-level scene and object fea- [11] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du-
tures throughout the generation process. We proposed a ran- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi
domized blending approach that enables to use a video en- Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz-
hancer within the autoregressive process without temporal ing text-to-video generation by explicit image conditioning.
inconsistencies. Experiments showed that StreamingT2V arXiv preprint arXiv:2311.10709, 2023. 2, 4
leads to significant higher motion amount and temporal con- [12] Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala,
sistency compared with all competitors. StreamingT2V en- Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse con-
trols to text-to-video diffusion models. arXiv preprint
ables long video generation from text without stagnating
arXiv:2311.16933, 2023. 2, 3, 6, 9, 10
content.
[13] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang,
Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin,
References and Bo Dai. Animatediff: Animate your personalized text-
[1] Pyscenedetect. https://www.scenedetect.com/. to-image diffusion models without specific tuning. In The
Accessed: 2024-03-03. 7 Twelfth International Conference on Learning Representa-
tions, 2023. 2
[2] Isaac Amidror. Scattered data interpolation methods for elec-
tronic imaging systems: a survey. Journal of electronic imag- [14] Jonathan Ho and Tim Salimans. Classifier-free diffusion
ing, 11(2):157–176, 2002. 6 guidance. In NeurIPS 2021 Workshop on Deep Generative
Models and Downstream Applications, 2021. 15
[3] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisser-
man. Frozen in time: A joint video and image encoder for [15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
end-to-end retrieval. In IEEE International Conference on sion probabilistic models. Advances in Neural Information
Computer Vision, 2021. 3 Processing Systems, 33:6840–6851, 2020. 2, 3, 4
[4] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel [16] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
Zion English, Vikram Voleti, Adam Letts, et al. Stable video Poole, Mohammad Norouzi, David J Fleet, et al. Imagen
diffusion: Scaling latent video diffusion models to large video: High definition video generation with diffusion mod-
datasets. arXiv preprint arXiv:2311.15127, 2023. 2, 3, 9, els. arXiv preprint arXiv:2210.02303, 2022. 3, 4
10 [17] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William
[5] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- Chan, Mohammad Norouzi, and David J Fleet. Video dif-
horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. fusion models. arXiv preprint arXiv:2204.03458, 2022. 2,
Align your latents: High-resolution video synthesis with la- 3
tent diffusion models. In Proceedings of the IEEE/CVF Con- [18] Levon Khachatryan, Andranik Movsisyan, Vahram Tade-
ference on Computer Vision and Pattern Recognition, pages vosyan, Roberto Henschel, Zhangyang Wang, Shant
22563–22575, 2023. 2, 3, 4 Navasardyan, and Humphrey Shi. Text2video-zero: Text-
[6] Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu to-image diffusion models are zero-shot video generators. In
Liu, Yujun Shen, and Hengshuang Zhao. Livephoto: Real Proceedings of the IEEE/CVF International Conference on
image animation with text-guided motion control. arXiv Computer Vision (ICCV), pages 15954–15964, 2023. 2, 3
preprint arXiv:2312.02928, 2023. 3 [19] Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman,
[7] Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Ersin Yumer, and Ming-Hsuan Yang. Learning blind video
Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu temporal consistency. In Proceedings of the European con-
Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- ference on computer vision (ECCV), pages 170–185, 2018.
sion model for generative transition and prediction. In The 7
Twelfth International Conference on Learning Representa- [20] Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu,
tions, 2023. 2, 3, 9, 10 Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong
[8] Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Wang. Videogen: A reference-guided latent diffusion ap-
Zhu, Long Qin, and Weizhi Wang. Animateanything: Fine- proach for high definition text-to-video generation. arXiv
grained open domain image animation with motion guid- preprint arXiv:2309.00398, 2023. 2
ance, 2023. 2, 3 [21] Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Video-
[9] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming drafter: Content-consistent multi-scene video generation
transformers for high-resolution image synthesis. In Pro- with llm. arXiv preprint arXiv:2401.01256, 2024. 3
ceedings of the IEEE/CVF conference on computer vision [22] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-
and pattern recognition, pages 12873–12883, 2021. 3 Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthe-
[10] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, sis and editing with stochastic differential equations. In In-
Jonathan Granskog, and Anastasis Germanidis. Structure ternational Conference on Learning Representations, 2022.
and content-guided video synthesis with diffusion models. 2, 6

11
[23] Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, [35] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
Jinkyu Kim, Sungwoong Kim, Hyeokmin Kwon, and Sang- representation learning. Advances in neural information pro-
pil Kim. Mtvg: Multi-text video generation with text-to- cessing systems, 30, 2017. 3
video models. arXiv preprint arXiv:2312.04086, 2023. 2, [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
3 reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[24] Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- Polosukhin. Attention is all you need. Advances in neural
tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning- information processing systems, 30, 2017. 5
free longer video diffusion via noise rescheduling. In The [37] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin-
Twelfth International Conference on Learning Representa- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi
tions, 2024. 3, 9, 10 Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan.
[25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Phenaki: Variable length video generation from open domain
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, textual descriptions. In International Conference on Learn-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning ing Representations, 2022. 2
transferable visual models from natural language supervi- [38] Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye,
sion. In International conference on machine learning, pages Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long
8748–8763. PMLR, 2021. 2, 3, 6, 7 video generation via temporal co-denoising. arXiv preprint
[26] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, arXiv:2305.18264, 2023. 3
and Mark Chen. Hierarchical text-conditional image gen- [39] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang,
eration with clip latents. arXiv preprint arXiv:2204.06125, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video
2022. 2, 4 technical report. arXiv preprint arXiv:2308.06571, 2023. 2,
[27] Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun 4, 5, 16
Du, Stephen Huang, and Wenhu Chen. Consisti2v: Enhanc- [40] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji-
ing visual consistency for image-to-video generation. arXiv uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin-
preprint arXiv:2402.04324, 2024. 3 gren Zhou. Videocomposer: Compositional video synthesis
[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, with motion controllability. Advances in Neural Information
Patrick Esser, and Björn Ommer. High-resolution image Processing Systems, 36, 2024. 2, 3, 6
synthesis with latent diffusion models. In Proceedings of
[41] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu,
the IEEE/CVF Conference on Computer Vision and Pattern
Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui
Recognition, pages 10684–10695, 2022. 2, 4
Wang, et al. Internvid: A large-scale video-text dataset for
[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
multimodal understanding and generation. arXiv preprint
net: Convolutional networks for biomedical image segmen-
arXiv:2307.06942, 2023. 3
tation. In Medical Image Computing and Computer-Assisted
[42] Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai,
Intervention–MICCAI 2015: 18th International Conference,
Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jian-
Munich, Germany, October 5-9, 2015, Proceedings, Part III
min Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, and Zhiwei
18, pages 234–241. Springer, 2015. 4
Xiong. Art•v: Auto-regressive text-to-video generation with
[30] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
diffusion models. arXiv preprint arXiv:2311.18834, 2023. 3
Facenet: A unified embedding for face recognition and clus-
tering. In Proceedings of the IEEE conference on computer [43] Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xin-
vision and pattern recognition, pages 815–823, 2015. 16 tao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter:
Animating open-domain images with video diffusion priors.
[31] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
arXiv preprint arXiv:2310.12190, 2023. 2, 3, 9, 10, 15
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- [44] Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang
man, et al. Laion-5b: An open large-scale dataset for training Wei, Yuchen Zhang, and Hang Li. Make pixels dance: High-
next generation image-text models. Advances in Neural In- dynamic video generation. arXiv:2311.10982, 2023. 3
formation Processing Systems, 35:25278–25294, 2022. 7 [45] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu,
[32] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and
Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Mike Zheng Shou. Show-1: Marrying pixel and latent dif-
Oran Gafni, et al. Make-a-video: Text-to-video generation fusion models for text-to-video generation. arXiv preprint
without text-video data. In The Eleventh International Con- arXiv:2309.15818, 2023. 2
ference on Learning Representations, 2022. 2, 3, 4 [46] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
[33] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, conditional control to text-to-image diffusion models. In
and Surya Ganguli. Deep unsupervised learning using Proceedings of the IEEE/CVF International Conference on
nonequilibrium thermodynamics. In International confer- Computer Vision, pages 3836–3847, 2023. 3, 5, 6, 15
ence on machine learning, pages 2256–2265. PMLR, 2015. [47] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
3 and Oliver Wang. The unreasonable effectiveness of deep
[34] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- features as a perceptual metric. In CVPR, 2018. 8
ing diffusion implicit models. In International Conference [48] Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao,
on Learning Representations, 2020. 2 Hangjie Yuan, Zhiwu Qing, Xiang Wang, Deli Zhao, and

12
Jingren Zhou. I2vgen-xl: High-quality image-to-video syn-
thesis via cascaded diffusion models. 2023. 2, 3, 4, 6, 9,
10

13
Appendix quality.
This appendix complements our main paper with further ex- 8. Very Long Video Generation
periments, in which we investigate the text-to-video gener-
ation quality of StreamingT2V for even longer sequences Supplementing our main paper, we study the generation
than those assessed in the main paper, and it contains addi- quality of our text-to-video method StreamingT2V and the
tional information on the implementation of StreamingT2V competing methods for creating even longer videos. To this
and the experiments carried out. end, we generate 240 frames, and even more, 600 frame
In Sec. 7, a user study is conducted on the test set, videos.
in which all text-to-video methods under consideration are Within this setting, we present qualitative results of
evaluated by humans to determine the user preferences. StreamingT2V, qualitative and quantitative comparisons to
Sec. 8 supplements our main paper by additional qual- existing methods, and ablation studies on our APM module
itative results of StreamingT2V, qualitatively comparisons and the randomize blending approach.
of StreamingT2V with competing methods, and ablation 8.1. Qualitative Results.
studies to show the effectiveness of APM and randomized
blending. Fig. 10 and Fig. 11 show text-to-video results of Stream-
In Sec. 9, hyperparameters used in StreamingT2V, and ingT2V for different actions, e.g. dancing, reading, cam-
implementation details of our ablated models are provided era moving, or eating, and different characters like tiger or
Finally, sec. 10 complements our main paper by ad- Santa Claus. We can observe that scene and object fea-
ditional information on the employed metrics used in our tures are kept across each video generation (see e.g. Fig. 10
qualitative experiments and it provides the prompts that (e) and Fig. 11 (d)), thanks to our proposed APM module.
compose our testset. Our proposed CAM module ensures that generated videos
are temporally smooth, with seamless transitions between
7. User Study video chunks, and not stagnating (see e.g. Fig. 10 (f)).

We conduct a user study comparing our StreamingT2V 8.2. Comparison to state-of-the-art.

method with prior work using the 50-prompt test set results We perform a comprehensive evaluation of all text-to-video
obtained for Sec. 5.4 of the main paper. To remove potential methods (which we considered in the main paper) on the
biases, we resize and crop all videos to align them. The user test set, generating 240 frame-long videos for each.
study is structured as a one vs one comparison between our To further analyse the suitability of text-to-video meth-
StreamingT2V method and competitors where participants ods for long video generation, we extend our test set eval-
are asked to answer three questions for each pair of videos: uation of Sec. 5.4 of our main paper, where we were using
• Which model has better motion? 80 frames, to 240 frames. Fig. 9 shows the results evaluated
• Which model has better text alignment? from 80 frame-long sequences up to the 240 frames videos
• Which model has better overall quality? with a step size of 10 frames.
We accept exactly one of the following three answers for In accordance with our visual observations, we see that
each question: preference for the left model, preference for most competing methods suffer from heavily deteriorated
the right model, or results are considered equal. To ensure text-image alignment (CLIP score) as they generate more
fairness, we randomize the order of the videos presented in frames. For instance, the CLIP score for SVD drops
each comparison, and the sequence of comparisons. Fig. 8 by more than 16%. But also DynamiCrafter-XL, Spar-
shows the preference score obtained from the user study as seControl and I2VGen-XL cannot maintain image qual-
the percentage of votes devoted to the respective answer. ity. We conjecture that this behaviour is to a high degree
Across all comparisons to competing methods, Stream- caused by the CLIP image encoder (which is used by SVD,
ingT2V is significantly more often preferred than the com- DynamiCrafter-XL and I2VGen-XL). The CLIP image en-
peting method, which demonstrates that StreamingT2V coder has been trained on clean and natural images, but
clearly improves upon state-of-the-art for long video gen- the conditioning is performed on generated frames, which
eration. For instance in motion quality, as the results of might confuse the CLIP image encoder, leading eventu-
StreamingT2V are non-stagnating videos, temporal consis- ally to severe quality degradation. In contrast to that, the
tent and possess seamless transitions between chunks, 65% APM module of StreamingT2V conditions on a fixed an-
of the votes were preferring StreamingT2V, compared to chor frame, so that it does not suffer from error accumula-
17% of the votes preferring SEINE. tion. Overall, StreamingT2V has the highest CLIP score
Competing methods are much more affected by quality among all evaluated methods throughout all 240 frames,
degradation over time, which is reflected in the preference thanks to an effective conditioning on the anchor frame us-
for StreamingT2V in terms of text alignment and overall ing our APM module. The second best method in terms

14
on CLIP score is FreeNoise, which achieves a similar but 9. Implementation details
slightly worse CLIP score. However, our shown qualita-
tive results have revealed that FreeNoise is heavily prone to We provide additional implementation details for Stream-
video stagnation, so that the quality is not getting worse, but ingT2V and our ablated models.
the video content is also nearly not changed. 9.1. Streaming T2V Stage
Fig. 9 also shows that StreamingT2V seamlessly con-
nects video chunks and creates temporal consistent videos, For the StreamingT2V stage, we use classifier free guidance
thanks to our effective CAM module, that uses the tempo- [10, 14] from text and the anchor frame. More precisely,
ral attentional mechanism to condition a new video gener- let ϵθ (xt , t, τ, a) denote the noise prediction in the Stream-
ation on the previous chunk. The MAWE score of Stream- ingT2V stage for latent code xt at diffusion step t, text τ and
ingT2V is hardly influenced by longer video generation, anchor frame a. For text guidance and guidance by the an-
thus leading to the best MAWE score among all considered chor frame, we introduce weights ωtext and ωanchor , respec-
methods, which is more than 50% better than the best com- tively. Let τnull and anull denote the empty string, and the
peting method SEINE. The experiment also shows that the image with all pixel values set to zero, respectively. Then,
competing methods SVD, I2VGen-XL, SparseControl and we obtain the multi-conditioned classifier-free-guided noise
FreeNoise are unstable. With more frames being generated, prediction ϵ̂θ (similar to DynamiCrafter-XL [43]) from the
their MAWE score becomes worse. These methods are thus noise predictor ϵ via
prone to increased temporal inconsistencies or video stag- ϵ̂θ (xt , t, τ, a) = ϵθ (xt , t, τnull , anull )
nation. This again shows that their conditioning mechanism
is too weak (e.g., the concatenation of the conditional frame + ωtext ϵθ (xt , t, τ, anull )

to the input and to the cross attention layers for SVD). − ϵθ (xt , t, τnull , anull )
Finally, the SCuts score shows that StreamingT2V leads + ωanchor ϵθ (xt , t, τ, a)
to the least scene cuts and thus best transistions. The ad-
vantange over SparseControl, DynamiCrafter-XL and SVD − ϵθ (xt , t, τ, anull ) . (9)
is significant, especially with more frames (e.g. compared We then use ϵ̂θ for denoising. In our experiments, we set
to StreamingT2V the MAWE score is up to 77 times lower ωtext = ωanchor = 7.5. During training, we randomly re-
for StreamingT2V). place τ with τnull with 5% likelihood, the anchor frame a
Our study on long video generation has thus shown than with anull with 5% likelihood, and we replace at the same
thanks to our CAM and APM module, StreamingT2V is time τ with τnull and a with anull with 5% likelihood.
best suited for long video generation. All other compet- Additional hyperparameters for the architecture, training
ing methods heavily deteriorate their video quality when and inference of the Streaming T2V stage are presented in
applied for long video generation. Tab. 9, where Per-Pixel Temporal Attention refers to the at-
tention module used in CAM (see Fig. 2 of the main paper).
8.3. Ablation Studies
9.2. Ablation models
We present additional qualitative results for the ablations on
our APM module and randomized blending. In Sec. 5.3 of the main paper, we consider two baselines
Effectiveness of APM. We complement our ablation study that we compare with CAM. Here we provide additional
on APM (see Sec. 5.3 of the main paper) by additional qual- implementation details.
itative results in Fig. 12. Thanks to the usage of long-term The ablated model Add-Cond applies to the features of
information via our proposed APM module, identity and CAM (i.e. the outputs of the encoder and middle layer of the
scene features are preserved throughout the video. For in- ControlNet part in Fig. 2) zero-convolution, and uses addi-
stance, the face of the woman (including all its tiny details) tion to fuse it with the features of the skip-connection of the
are consistent across the video generation. Also, the style UNet (similar to ControlNet[46]) (see Fig. 15). We pro-
of the jacket and the bag are correctly generated throughout vide here additional details to construct this model. Given
the video. Without having access to a long-term memory, a video sample V ∈ RF ×H×W ×3 with F = 16 frames,
these object and scene features are changing over time. we construct a mask M ∈ {0, 1}F ×H×W ×3 that indicates
Randomized Blending. Fig. 13 and Fig. 14 show ad- which frame we use for conditioning, i.e. M f [i, j, k] =
ditional ablated results on our randomized blending ap- M f [i′ , j ′ , k ′ ] for all frames f = 1, . . . , F and for all
proach. From the X-T slice visualizations we can see that i, j, k, i′ , j ′ , k ′ . We require that exactly F − Fcond frames
the randomized blending leads to smooth chunk transitions. are masked, i.e.
In contrast, when naively concatenating enhanced video F
X
chunks, or using shared noise, the resulting videos possess M f [i, j, k] = F − Fcond , for all i, j, k. (10)
visible inconsistencies between chunks. f =1

15
Per-Pixel Temporal Attention 10. Experiment details
Sequence length Q 16
Sequence length K,V 8 10.1. Metric details
Token dimensions 320,640,1280 In Sec. 5.3 of the main paper, we employ a re-identification
Appearance Preservation Module score to assess the feature preservation quality of our APM
CLIP Image Embedding Dim 1024 module in the Long-term memory experiment.
CLIP Image Embedding Tokens 1 To obtain the score, let Pn = {pni } be all face patches
MLP hidden layers 1 extracted from frame n using an off-the-shelf head detector
MLP inner dim 1280 [30] and let Fin be the corresponding facial feature of pni ,
MLP output tokens 16 which we obtain from an off-the-shelf face recognition net-
MLP output dim 1024 work [30]. Then, for frame n, n1 := |Pn |, n2 := |Pn+1 |,
1D Conv input tokens 93 we define the re-id score re-id(n) for frame n as
1D Conv output tokens 77 (
1D Conv output dim 1024 maxi,j cos Θ(Fin , Fjn+1 ), n1 > 0 & n2 > 0.
Cross attention sequence length 77 re-id(n) :=
0 otherwise.
Training
(11)
Parametrization ϵ
where cos Θ is the cosine similarity. Finally, we obtain the
Diffusion Setup
re-ID score of a video by averaging over all frames, where
Diffusion steps 1000
the two consecutive frames have face detections, i.e. with
Noise scheduler Linear
m := |{n ∈ {1, .., N } : |Pn | > 0}|, we compute the
β0 0.0085
weighted sum:
βT 0.0120
Sampling Parameters N −1
1 X
Sampler DDIM re-id := re-id(n), (12)
Steps 50 m n=1
η 1.0
ωtext 7.5 where N denotes the number of frames.
ωanchor 7.5 MAWE is based on Warp errors and OF scores which
have highly dependent values. We try to counteract this ef-
Table 9. Hyperparameters of Streaming T2V Stage. Addi- fect in our MAWE score by assuming this dependency is
tional architectural hyperparameters are provided by the Model- linear W (V) = c · OFS(V) and account for it in MAWE’s
sope report[39]. denominator. To calculate c we randomly sample a small
part of our training with a range of optical scores and re-
move outliers applying the Z-score method. Using this
dataset c is obtained by a simple regression analysis.
We concatenate [V⊙M, M ] along the channel dimension
and use it as input for the image encoder Econd , where ⊙ 10.2. Test prompt set
denotes element-wise multiplication. 1. A camel resting on the snow field.
During training, we randomly set the mask M . During 2. Camera following a pack of crows flying in the sky.
inference, we set the mask for the first 8 frames to zero, and 3. A knight riding on a horse through the countryside.
for the last 8 frames to one, so that the model conditions on 4. A gorilla eats a banana in Central Park.
the last 8 frames of the previous chunk. 5. Men walking in the rain.
6. Ants, beetles and centipede nest.
For the ablated model Conc-Cond used in Sec. 5.3, 7. A squirrel on a table full of big nuts.
we start from our Video-LDM’s UNet, and modify its first 8. Close flyover over a large wheat field in the early morn-
convolution. Like for Add-Cond, we consider a video V ing sunlight.
of length F = 16 and a mask M that encodes which 9. A squirrel watches with sweet eyes into the camera.
frames are overwritten by zeros. Now the Unet takes 10. Santa Claus is dancing.
[zt , E(V) ⊙ M, M ] as input, where we concatenate along 11. Chemical reaction.
the channel dimension. As with Add-Cond, we randomly 12. Camera moving in a wide bright ice cave, cyan.
set M during training so that the information of 8 frames is 13. Prague, Czech Republic. Heavy rain on the street.
used, while during inference, we set it such that the last 8 14. Time-lapse of stormclouds during thunderstorm.
frames of the previous chunk are used. Here E denotes the 15. People dancing in room filled with fog and colorful
VQ-GAN encoder (see Sec. 3.1 of the main paper). lights.

16
16. Big celebration with fireworks. taking views.
17. Aerial view of a large city.
18. Wide shot of battlefield, stormtroopers running at night,
fires and smokes and explosions in background.
19. Explosion.
20. Drone flythrough of a tropical jungle with many birds.
21. A camel running on the snow field.
22. Fishes swimming in ocean camera moving.
23. A squirrel in Antarctica, on a pile of hazelnuts cinematic.
24. Fluids mixing and changing colors, closeup.
25. A horse eating grass on a lawn.
26. The fire in the car is extinguished by heavy rain.
27. Camera is zooming out and the baby starts to cry.
28. Flying through nebulas and stars.
29. A kitten resting on a ball of wool.
30. A musk ox grazing on beautiful wildflowers.
31. A hummingbird flutters among colorful flowers, its
wings beating rapidly.
32. A knight riding a horse, pointing with his lance to the
sky.
33. steampunk robot looking at the camera.
34. Drone fly to a mansion in a tropical forest.
35. Top-down footage of a dirt road in forest.
36. Camera moving closely over beautiful roses blooming
time-lapse.
37. A tiger eating raw meat on the street.
38. A beagle looking in the Louvre at a painting.
39. A beagle reading a paper.
40. A panda playing guitar on Times Square.
41. A young girl making selfies with her phone in a crowded
street.
42. Aerial: flying above a breathtaking limestone structure
on a serene and exotic island.
43. Aerial: Hovering above a picturesque mountain range on
a peaceful and idyllic island getaway.
44. A time-lapse sequence illustrating the stages of growth
in a flourishing field of corn.
45. Documenting the growth cycle of vibrant lavender flow-
ers in a mesmerizing time-lapse.
46. Around the lively streets of Corso Como, a fearless ur-
ban rabbit hopped playfully, seemingly unfazed by the
fashionable surroundings.
47. Beside the Duomo’s majestic spires, a fearless falcon
soared, riding the currents of air above the iconic cathe-
dral.
48. A graceful heron stood poised near the reflecting pools
of the Duomo, adding a touch of tranquility to the vibrant
surroundings.
49. A woman with a camera in hand joyfully skipped along
the perimeter of the Duomo, capturing the essence of the
moment.
50. Beside the ancient amphitheater of Taormina, a group of
friends enjoyed a leisurely picnic, taking in the breath-

17
Figure 8. We conduct a user study, asking humans to assess the test set results of Sect 5.4 in a one-to-one evaluation, where for any prompt
of the test set and any competing method, the results of the competing method have to be compared with the corresponding results of our
StreamingT2V method. For each comparison of our method to a competing method, we report the relative of number votes that prefer
StreamingT2V (i.e. wins), that prefer the competing method (i.e. losses), and that consider results from both methods as equal (i.e. draws).

18
(a) MAWE score (↓). (b) SCuts score (↓). (c) CLIP score (↑).

Figure 9. Study on how generating longer videos are affecting the generation quality.

(a) People dancing in room filled with fog and colorful lights

(b) Big celebration with fireworks

(c) Santa Claus is dancing

(d) Camera moving in a wide bright ice cave, cyan

(e) A beagle reading a paper

(f) Explosion

Figure 10. Qualitative results of StreamingT2V for different prompts.

19
(a) Wide shot of battlefield, stormtroopers running at night

(b) Men walking in the rain

(c) A tiger eating raw meat on the street

(d) A young girl making selfies with her phone in a crowded street

Figure 11. Qualitative results of StreamingT2V for different prompts.

Figure 12. Ablation study on the APM module. Top row is generated from StreamingT2V, bottom row is generated from StreamingT2V
w/o APM.

20
X-T Slice

X-T Slice
t t
Video

Video
(a) Naive Concatenation (b) Shared Noise
X-T Slice

t
Video

(c) Randomized Blending

Figure 13. Ablation study on different approaches for autoregressive video enhancement.
X-T Slice

X-T Slice

t t
Video

Video

(a) Naive Concatenation (b) Shared Noise

X-T Slice

t
Video

(c) Randomized Blending

Figure 14. Ablation study on different approaches for autoregressive video enhancement.

21
Figure 15. Illustration of the Add-Cond baseline, which is used in Sec. 5.3 of the main paper.

Tune A Video
No ratings yet
Tune A Video
16 pages
Factorizing Text-to-Video Generation by Explicit Image Conditioning
No ratings yet
Factorizing Text-to-Video Generation by Explicit Image Conditioning
45 pages
FreeLong - Training-Free Long Video Generation With SpectralBlend Temporal Attention
No ratings yet
FreeLong - Training-Free Long Video Generation With SpectralBlend Temporal Attention
22 pages
Automatic Video Generator
No ratings yet
Automatic Video Generator
5 pages
2302.03011 Ai Video Creation
No ratings yet
2302.03011 Ai Video Creation
26 pages
2018 WarpingError
No ratings yet
2018 WarpingError
16 pages
Longvu: Spatiotemporal Adaptive Compression For Long Video-Language Understanding
No ratings yet
Longvu: Spatiotemporal Adaptive Compression For Long Video-Language Understanding
17 pages
2024 - Streaming Dense Video Captioning - Zhou Et Al
No ratings yet
2024 - Streaming Dense Video Captioning - Zhou Et Al
11 pages
Sparsectrl: Adding Sparse Controls To Text-To-Video Diffusion Models
No ratings yet
Sparsectrl: Adding Sparse Controls To Text-To-Video Diffusion Models
11 pages
Exploring Global Diverse Attention Via Pairwise
No ratings yet
Exploring Global Diverse Attention Via Pairwise
12 pages
2025 - Long Context Tuning For Video Generation - Guo Et Al
No ratings yet
2025 - Long Context Tuning For Video Generation - Guo Et Al
11 pages
Dense Video Captioning CVPR 2024 paper جيدة
No ratings yet
Dense Video Captioning CVPR 2024 paper جيدة
10 pages
Video Crafter 2
No ratings yet
Video Crafter 2
11 pages
Fast Vid2Vid Spatial Temporal Compression For Video To Video Synthesis
No ratings yet
Fast Vid2Vid Spatial Temporal Compression For Video To Video Synthesis
23 pages
T2V CompBench CVPR2025
No ratings yet
T2V CompBench CVPR2025
24 pages
AVID - Any-Length Video Inpainting With Diffusion Model
No ratings yet
AVID - Any-Length Video Inpainting With Diffusion Model
16 pages
TI2V-Zero: Zero-Shot Image Conditioning For Text-to-Video Diffusion Models
No ratings yet
TI2V-Zero: Zero-Shot Image Conditioning For Text-to-Video Diffusion Models
14 pages
Image/Video Segmentation Using Bodypix Model: Department of Computer St. John College of Engineering and Management
No ratings yet
Image/Video Segmentation Using Bodypix Model: Department of Computer St. John College of Engineering and Management
5 pages
Framepainter: Endowing Interactive Image Editing With Video Diffusion Priors
No ratings yet
Framepainter: Endowing Interactive Image Editing With Video Diffusion Priors
16 pages
Video Flow A Conditional Flow Based On Nanoparticles
No ratings yet
Video Flow A Conditional Flow Based On Nanoparticles
18 pages
Motion-I2V: Consistent and Controllable Image-to-Video Generation With Explicit Motion Modeling
No ratings yet
Motion-I2V: Consistent and Controllable Image-to-Video Generation With Explicit Motion Modeling
11 pages
Packing Input Frame Context in Next-Frame Prediction Models For Video Generation
No ratings yet
Packing Input Frame Context in Next-Frame Prediction Models For Video Generation
13 pages
CustomVideo Customizing Text-To-Video Generation With Multiple Subjects
No ratings yet
CustomVideo Customizing Text-To-Video Generation With Multiple Subjects
18 pages
Make Pixels Dance - High-Dynamic Video Generation
No ratings yet
Make Pixels Dance - High-Dynamic Video Generation
11 pages
Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models
No ratings yet
Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models
44 pages
Entropy 25 01469
No ratings yet
Entropy 25 01469
22 pages
TGAN sHIT
No ratings yet
TGAN sHIT
10 pages
Lumiere: A Space-Time Diffusion Model For Video Generation
No ratings yet
Lumiere: A Space-Time Diffusion Model For Video Generation
20 pages
RLHF3
No ratings yet
RLHF3
13 pages
0&7&$5'&6HULHV1HZ (OHYDWRU (Phujhqf/$Xwrpdwlf5Hvfxh'Hylfh: 8vhu XLGH
100% (1)
0&7&$5'&6HULHV1HZ (OHYDWRU (Phujhqf/$Xwrpdwlf5Hvfxh'Hylfh: 8vhu XLGH
38 pages
Emu Video
No ratings yet
Emu Video
29 pages
FreeInit, 3DFFT，弥补训练和推理gap
No ratings yet
FreeInit, 3DFFT，弥补训练和推理gap
19 pages
Flexible Diffusion Modeling of Long Videos
No ratings yet
Flexible Diffusion Modeling of Long Videos
23 pages
Autoregressive Video Generation
No ratings yet
Autoregressive Video Generation
22 pages
Towards A Better Metric For Text-to-Video Generation
No ratings yet
Towards A Better Metric For Text-to-Video Generation
16 pages
Skorokhodov StyleGAN-V A Continuous Video Generator With The Price Image Quality CVPR 2022 Paper
No ratings yet
Skorokhodov StyleGAN-V A Continuous Video Generator With The Price Image Quality CVPR 2022 Paper
11 pages
2024 - MotionClone - Ling Et Al
No ratings yet
2024 - MotionClone - Ling Et Al
17 pages
Hu Make It Move Controllable Image-to-Video Generation With Text Descriptions CVPR 2022 Paper
No ratings yet
Hu Make It Move Controllable Image-to-Video Generation With Text Descriptions CVPR 2022 Paper
10 pages
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
No ratings yet
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
17 pages
Make-A-Video - Text-to-Video Generation Without Text-Video Data - 2209.14792
No ratings yet
Make-A-Video - Text-to-Video Generation Without Text-Video Data - 2209.14792
13 pages
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
No ratings yet
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
13 pages
Exploring Pre-Trained Text-to-Video Diffusion Models For Referring Video Object Segmentation
No ratings yet
Exploring Pre-Trained Text-to-Video Diffusion Models For Referring Video Object Segmentation
21 pages
Text2Video Zero
No ratings yet
Text2Video Zero
11 pages
P F M E V G M: Yramidal LOW Atching For Fficient Ideo Enerative Odeling
No ratings yet
P F M E V G M: Yramidal LOW Atching For Fficient Ideo Enerative Odeling
23 pages
Cinemo: Consistent and Controllable Image Animation With Motion Diffusion Models
No ratings yet
Cinemo: Consistent and Controllable Image Animation With Motion Diffusion Models
15 pages
ModelScope Text-to-Video Technical Report
No ratings yet
ModelScope Text-to-Video Technical Report
14 pages
Text To Video
No ratings yet
Text To Video
11 pages
Video-to-Video Synthesis: Website
No ratings yet
Video-to-Video Synthesis: Website
14 pages
Photorealistic Video Generation With Diffusion Models
No ratings yet
Photorealistic Video Generation With Diffusion Models
13 pages
Autoregressive Adversarial Post-Training For Real-Time Interactive Video Generation
No ratings yet
Autoregressive Adversarial Post-Training For Real-Time Interactive Video Generation
20 pages
ContentV Efficient Training of Video Generation Models With Limited Compute
No ratings yet
ContentV Efficient Training of Video Generation Models With Limited Compute
21 pages
Videocrafter1: Open Diffusion Models For High-Quality Video Generation
No ratings yet
Videocrafter1: Open Diffusion Models For High-Quality Video Generation
12 pages
Tivgan: Text To Image To Video Generation With Step-By-Step Evolutionary Generator
No ratings yet
Tivgan: Text To Image To Video Generation With Step-By-Step Evolutionary Generator
10 pages
A Good Image Generator Is What You Need For High Resolution Video Synthesis
No ratings yet
A Good Image Generator Is What You Need For High Resolution Video Synthesis
23 pages
Review Paper Yonas - Brhanu
No ratings yet
Review Paper Yonas - Brhanu
4 pages
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
No ratings yet
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
11 pages
TokenFlow Arxiv
No ratings yet
TokenFlow Arxiv
13 pages
Convert Docs To Video: A Comprehensive Review of Text-to-Video Generation Approaches
No ratings yet
Convert Docs To Video: A Comprehensive Review of Text-to-Video Generation Approaches
5 pages
Text2Video-Zero: High-Quality and Consistent Video Generation With Low Overhead
No ratings yet
Text2Video-Zero: High-Quality and Consistent Video Generation With Low Overhead
3 pages
Update Instructions For MMI 3G v1.7
No ratings yet
Update Instructions For MMI 3G v1.7
49 pages
Text To Video Generation Using Deep Learning
No ratings yet
Text To Video Generation Using Deep Learning
7 pages
Module 6 Living in The IT Era
No ratings yet
Module 6 Living in The IT Era
12 pages
7 - Class 7
No ratings yet
7 - Class 7
81 pages
Linear Convolution Program in C Language Using CCStudio
80% (5)
Linear Convolution Program in C Language Using CCStudio
3 pages
EBS LCM How To Diagnose Issues With Landed Cost Management in Procure To Pay Cycle ID 860747.1
No ratings yet
EBS LCM How To Diagnose Issues With Landed Cost Management in Procure To Pay Cycle ID 860747.1
15 pages
I. Faces, Surfaces, Edges and Vertices
No ratings yet
I. Faces, Surfaces, Edges and Vertices
5 pages
Compromised Domains Live
No ratings yet
Compromised Domains Live
306 pages
Face Recognition Technology: A Seminar Report On
No ratings yet
Face Recognition Technology: A Seminar Report On
16 pages
Cambridge IGCSE ™: Mathematics 0580/13
No ratings yet
Cambridge IGCSE ™: Mathematics 0580/13
6 pages
999 Most Repeated MCQs Collection PPSC FPSC
No ratings yet
999 Most Repeated MCQs Collection PPSC FPSC
32 pages
Class Diagram
No ratings yet
Class Diagram
47 pages
NVK 20M Usb: 20 Channel Mixer
No ratings yet
NVK 20M Usb: 20 Channel Mixer
18 pages
Royal Impact Futures Scholarship Application Form 2025 - 23jan2025 (FINAL) ...
No ratings yet
Royal Impact Futures Scholarship Application Form 2025 - 23jan2025 (FINAL) ...
8 pages
Last Minute Revision Dbms
No ratings yet
Last Minute Revision Dbms
23 pages
Important Topics INT 251
No ratings yet
Important Topics INT 251
2 pages
Les Iptv
No ratings yet
Les Iptv
5 pages
OWC Product List PDF
No ratings yet
OWC Product List PDF
73 pages
State Chart Diagarm
No ratings yet
State Chart Diagarm
4 pages
NLP A2
No ratings yet
NLP A2
7 pages
05LabExer Brisenio
No ratings yet
05LabExer Brisenio
16 pages
Smehfuz
No ratings yet
Smehfuz
12 pages
GX Hand-Held Barcode Scanner Reconfiguration Worksheet
No ratings yet
GX Hand-Held Barcode Scanner Reconfiguration Worksheet
5 pages
Question Paper Stem
No ratings yet
Question Paper Stem
9 pages
Be Wlr-11-U014
No ratings yet
Be Wlr-11-U014
1 page
ThinkCentre Neo 50s 11SY004HAC
No ratings yet
ThinkCentre Neo 50s 11SY004HAC
3 pages
74F283 4-Bit Binary Full Adder With Fast Carry: General Description
No ratings yet
74F283 4-Bit Binary Full Adder With Fast Carry: General Description
6 pages
Mod
No ratings yet
Mod
3 pages
ME 301 Self Assessment - HW1c
No ratings yet
ME 301 Self Assessment - HW1c
1 page
Features - DATA MINING CUP - 2022
No ratings yet
Features - DATA MINING CUP - 2022
2 pages
Analog Dialogue, Volume 47, Number 4
From Everand
Analog Dialogue, Volume 47, Number 4
Analog Dialogue
No ratings yet

Streamingt2V: Consistent, Dynamic, and Extendable Long Video Generation From Text

Uploaded by

Streamingt2V: Consistent, Dynamic, and Extendable Long Video Generation From Text

Uploaded by

StreamingT2V: Consistent, Dynamic, and Extendable

Long Video Generation from Text

Abstract isting approaches mostly focus on high-quality short video

for t = T, . . . , 1 (see DDPM [15]), which allows to gen-

(c) Randomized Blending

(a) StreamingT2V (b) DynamiCrafter-XL

4, where only StreamingT2V is able to produce seamless cuts.

Method ↓MAWE ↓SCuts ↑CLIP ↑AE

SparseCtrl [12] 39.92 2.02 31.13 5.29

We conduct a user study comparing our StreamingT2V 8.2. Comparison to state-of-the-art.

(b) Big celebration with fireworks

(c) Santa Claus is dancing

(d) Camera moving in a wide bright ice cave, cyan

(e) A beagle reading a paper

Figure 10. Qualitative results of StreamingT2V for different prompts.

(b) Men walking in the rain

(c) A tiger eating raw meat on the street

Figure 11. Qualitative results of StreamingT2V for different prompts.

(c) Randomized Blending

(a) Naive Concatenation (b) Shared Noise

(c) Randomized Blending

You might also like