Add SkyReels V2: Infinite-Length Film Generative Model #11518

tolgacangoz · 2025-05-07T18:58:53Z

Thanks for the opportunity to fix #11374!

Original Work

Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074

SkyReels V2's main contributions are summarized as follow:
• Comprehensive video captioner that understand the shot language while capturing the general description of the video, which dramatically improve the prompt adherence.
• Motion-specific preference optimization enhances motion dynamics with a semi-automatic data collection pipeline.
• Effective Diffusion-forcing adaptation enables the generation of ultra-long videos and story generation capabilities, providing a robust framework for extending temporal coherence and narrative depth.
• SkyCaptioner-V1 and SkyReels-V2 series models including diffusion-forcing, text2video, image2video, camera director and elements2video models with various sizes (1.3B, 5B, 14B) are open-sourced.

TODOs:
⏳ FlowMatchUniPCMultistepScheduler: just copy-pasted from the original repo
✅ SkyReelsV2Transformer3DModel: 90% WanTransformer3DModel
✅ SkyReelsV2DiffusionForcingPipeline

tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers is ready to be forked.
tolgacangoz/SkyReels-V2-DF-14B-720P-Diffusers is ready to be forked.
tolgacangoz/SkyReels-V2-DF-14B-540P-Diffusers is ready to be forked.

✅ SkyReelsV2DiffusionForcingImageToVideoPipeline: Includes FLF2V.
✅ SkyReelsV2DiffusionForcingVideoToVideoPipeline: Extends a given video.
⏳ SkyReelsV2Pipeline
⏳ SkyReelsV2ImageToVideoPipeline
⏳ scripts/convert_skyreelsv2_to_diffusers.py
⬜ Did you make sure to update the documentation with your changes?
⬜ Did you write any new necessary tests?

T2V with Diffusion Forcing (OLD)

Skywork/SkyReels-V2-DF-1.3B-540P
seed 0 and num_frames 97
Original repo	`diffusers` integration
original_0_short.mp4	diffusers_0_short.mp4

seed 37 and num_frames 97
Original repo	`diffusers` integration
original_37_short.mp4	diffusers_37_short.mp4

seed 0 and num_frames 257
Original repo	`diffusers` integration
original_0_long.mp4	diffusers_0_long.mp4

seed 37 and num_frames 257
Original repo	`diffusers` integration
original_37_long.mp4	diffusers_37_long.mp4

!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingPipeline
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipe(
    prompt=prompt,
    num_inference_steps=30,
    height=544,
    width=960,
    num_frames=97,
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "T2V.mp4", fps=24, quality=8)

"""
You can set `ar_step=5` to enable asynchronous inference. When asynchronous inference,
`causal_block_size=5` is recommended while it is not supposed to be set for
synchronous generation. Asynchronous inference will take more steps to diffuse the
whole sequence which means it will be SLOWER than synchronous mode. In our
experiments, asynchronous inference may improve the instruction following and visual consistent performance.
"""

I2V with Diffusion Forcing (OLD)

`prompt`="A penguin dances."	`diffusers` integration
	i2v-short.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

image = load_image("Penguin from https://huggingface.co/tasks/image-to-video")
prompt = "A penguin dances."

output = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "I2V.mp4", fps=24, quality=8)

"""
When I set `ar_step=5` and `causal_block_size=5`, then the results seem really bad.
"""

FLF2V with Diffusion Forcing (OLD)

Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (torch.Size([1, 16, 1, 68, 120])) is overwritten onto the first of 25 frame latents of latents (torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thus latents is torch.Size([1, 16, 26, 68, 120]). After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame of latents and not discarding the latest frame latent at the end, but still got bad results. Here are some results:

First Frame	Last Frame

0.mp4	1.mp4
2.mp4	3.mp4
4.mp4	5.mp4
6.mp4	7.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

output = pipe(
    image=first_frame,
    last_image=last_frame,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "FLF2V.mp4", fps=24, quality=8)

V2V with Diffusion Forcing (OLD)

This pipeline extends a given video.

Input Video	`diffusers` integration
video1.mp4	v2v.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird flaps its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its continuing flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
video = load_video("Input video.mp4")

output = pipe(
    video=video,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=120,
    base_num_frames=97,
    ar_step=0,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=17,  # Number of frames to overlap for smooth transitions in long videos
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "V2V.mp4", fps=24, quality=8)

…usion forcing - Introduced the drafts of `SkyReelsV2TextToVideoPipeline`, `SkyReelsV2ImageToVideoPipeline`, `SkyReelsV2DiffusionForcingPipeline`, and `FlowUniPCMultistepScheduler`.

ukaprch · 2025-05-08T15:47:38Z

It's about time. Thanks.

…tion mechanisms

Replaces custom attention implementations with `SkyReelsV2AttnProcessor2_0` and the standard `Attention` module. Updates `WanAttentionBlock` to use `FP32LayerNorm` and `FeedForward`. Removes the `model_type` parameter, simplifying model architecture and attention block initialization.

Introduces new classes `SkyReelsV2ImageEmbedding` and `SkyReelsV2TimeTextImageEmbedding` for enhanced image and time-text processing. Refactors the `SkyReelsV2Transformer3DModel` to integrate these embeddings, updating the constructor parameters for better clarity and functionality. Removes unused classes and methods to streamline the codebase.

…ds and begin reorganizing the forward pass.

…ethod

…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.

…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.

…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.

a-r-r-o-w · 2025-06-02T05:12:51Z

@tolgacangoz Awesome work so far, just checking in on the progress. Still trying to fully wrap my head around diffusion forcing and trying to visually verify that diffusers version code matches original. As a sanity check, do we know why the output of T2V from the original code vs diffusers is different? Typically, we try to ensure that given the same starting embeddings, seed and other starting conditions, the output from different implementations matches numerically with threshold < 1e-3. I will try to help with debugging and testing 🤗

tolgacangoz · 2025-06-02T06:53:00Z

The original code had a default negative prompt (which I wasn't aware of at the time); this might have been the reason; or something about timestep processing. I will try to make it as deterministically matchable as possible.

…_pos_embed_from_grid` for timestep projection.

…tional parameter `flip_sin_to_cos` for flipping sine and cosine embeddings, improving flexibility in positional embedding generation.

…nclude `flip_sin_to_cos` parameter, enhancing the flexibility of time embedding generation.

…yReelsV2TransformerBlock` to ensure consistent use of `torch.float32` and `torch.bfloat16`, improving integration.

…2` for frequency calculations, ensuring consistency in data types across the model.

…d precision for timestep projection.

tolgacangoz · 2025-06-02T16:29:52Z

Today I mostly followed by reading the code. But tomorrow I will be more systematic: I will go through the inputs and outputs of each module step by step.

…ced debugging and output analysis; update `Transformer2DModelOutput` to include debug tensors.

…readability and maintain consistency in style.

…rocessor2_0` for improved performance.

…ensor shapes and values for improved debugging and analysis.

…rBlock`

tolgacangoz · 2025-06-04T17:28:14Z

I have finally discovered the reasons for the discrepancy and will share them tomorrow 🥳!

…le in `SkyReelsV2DiffusionForcingPipeline`.

…yReelsV2DiffusionForcingPipeline`.

…2TransformerBlock` and `SkyReelsV2Transformer3DModel` for cleaner code.

tolgacangoz and others added 7 commits May 7, 2025 21:53

Add SkyReels-V2 pipelines for text-to-video, image-to-video, and diff…

4dd739f

…usion forcing - Introduced the drafts of `SkyReelsV2TextToVideoPipeline`, `SkyReelsV2ImageToVideoPipeline`, `SkyReelsV2DiffusionForcingPipeline`, and `FlowUniPCMultistepScheduler`.

Merge branch 'main' into skyreels-v2

899f41c

up

607b5ba

second draft

3ccf201

Merge branch 'main' into skyreels-v2

959ca1f

up

37ca14f

Merge branch 'main' into skyreels-v2

d80b505

tolgacangoz added 22 commits May 8, 2025 20:01

3rd draft

95d0621

4th draft

6f8a945

upup

e781084

style

4806660

up

0986e81

up

6a300f5

fix fn name

45e1680

update import structure for SkyReelsV2

c8a0c14

add SkyreelsV2 pipeline classes with backend requirements

47306b6

up

c5b8da9

up

5835eaa

add draft transformer_skyreels_v2.py with a custom WanModel and atten…

9d2880e

…tion mechanisms

up

2c0586e

split i2v and t2v pipes for diffusion forcing

52590ea

up

f318efa

Refactors the SkyReelsV2Transformer3DModel by removing unused metho…

9688a82

…ds and begin reorganizing the forward pass.

Refactors SkyReelsV2TransformerBlock to integrate its forward() m…

825c2c1

…ethod

Refactors SkyReelsV2AttnProcessor2_0 to enhance the forward() met…

d848500

…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.

Refactors SkyReelsV2Transformer3DModel to enhance the forward() m…

2f5a4e2

…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.

Refactors SkyReelsV2Transformer3DModel to improve the forward() m…

e5870dd

…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.

tolgacangoz force-pushed the skyreels-v2 branch from 5e3676d to 7814f7d Compare June 2, 2025 06:45

tolgacangoz added 6 commits June 2, 2025 14:00

Refactor SkyReelsV2TimeTextImageEmbedding to utilize `get_1d_sincos…

6d1d1e9

…_pos_embed_from_grid` for timestep projection.

Enhance get_1d_sincos_pos_embed_from_grid function to include an op…

82d86e4

…tional parameter `flip_sin_to_cos` for flipping sine and cosine embeddings, improving flexibility in positional embedding generation.

Update timestep projection in SkyReelsV2TimeTextImageEmbedding to i…

2dce751

…nclude `flip_sin_to_cos` parameter, enhancing the flexibility of time embedding generation.

Refactor tensor type handling in SkyReelsV2AttnProcessor2_0 and `Sk…

a4aa0ba

…yReelsV2TransformerBlock` to ensure consistent use of `torch.float32` and `torch.bfloat16`, improving integration.

Update tensor type in SkyReelsV2RotaryPosEmbed to use `torch.float3…

63814a0

…2` for frequency calculations, ensuring consistency in data types across the model.

Refactor SkyReelsV2TimeTextImageEmbedding to utilize automatic mixe…

a74248f

…d precision for timestep projection.

tolgacangoz added 12 commits June 3, 2025 07:31

down

efccb9e

down

b836618

style

11cd6fb

Add debug tensor tracking to SkyReelsV2Transformer3DModel for enhan…

786f145

…ced debugging and output analysis; update `Transformer2DModelOutput` to include debug tensors.

up

b597f9e

Refactor indentation in SkyReelsV2AttnProcessor2_0 to improve code …

6caffc9

…readability and maintain consistency in style.

Convert query, key, and value tensors to bfloat16 in `SkyReelsV2AttnP…

848acfc

…rocessor2_0` for improved performance.

Add debug print statements in SkyReelsV2TransformerBlock to track t…

a8e01ba

…ensor shapes and values for improved debugging and analysis.

debug

aef60a1

f70627a

debug

7a98f19

Remove commented-out debug tensor tracking from `SkyReelsV2Transforme…

17e931a

…rBlock`

tolgacangoz added 7 commits June 5, 2025 14:03

Add functionality to save processed video latents as a Safetensors fi…

19dae16

…le in `SkyReelsV2DiffusionForcingPipeline`.

up

2947e52

Add functionality to save output latents as a Safetensors file in `Sk…

324e7fe

…yReelsV2DiffusionForcingPipeline`.

up

abf59a5

Remove additional commented-out debug tensor tracking from `SkyReelsV…

e227b38

…2TransformerBlock` and `SkyReelsV2Transformer3DModel` for cleaner code.

style

c6ef3cf

cleansing

f359c77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Add SkyReels V2: Infinite-Length Film Generative Model #11518

tolgacangoz commented May 7, 2025 •

edited

Loading

Uh oh!

ukaprch commented May 8, 2025

Uh oh!

a-r-r-o-w commented Jun 2, 2025

Uh oh!

tolgacangoz commented Jun 2, 2025 •

edited

Loading

Uh oh!

tolgacangoz commented Jun 2, 2025 •

edited

Loading

Uh oh!

tolgacangoz commented Jun 4, 2025

Uh oh!

Uh oh!

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Are you sure you want to change the base?

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Conversation

tolgacangoz commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original Work

T2V with Diffusion Forcing (OLD)

I2V with Diffusion Forcing (OLD)

FLF2V with Diffusion Forcing (OLD)

V2V with Diffusion Forcing (OLD)

Uh oh!

ukaprch commented May 8, 2025

Uh oh!

a-r-r-o-w commented Jun 2, 2025

Uh oh!

tolgacangoz commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tolgacangoz commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tolgacangoz commented Jun 4, 2025

Uh oh!

Uh oh!

tolgacangoz commented May 7, 2025 •

edited

Loading

tolgacangoz commented Jun 2, 2025 •

edited

Loading

tolgacangoz commented Jun 2, 2025 •

edited

Loading