-
Notifications
You must be signed in to change notification settings - Fork 6k
Add SkyReels V2: Infinite-Length Film Generative Model #11518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…usion forcing - Introduced the drafts of `SkyReelsV2TextToVideoPipeline`, `SkyReelsV2ImageToVideoPipeline`, `SkyReelsV2DiffusionForcingPipeline`, and `FlowUniPCMultistepScheduler`.
It's about time. Thanks. |
Replaces custom attention implementations with `SkyReelsV2AttnProcessor2_0` and the standard `Attention` module. Updates `WanAttentionBlock` to use `FP32LayerNorm` and `FeedForward`. Removes the `model_type` parameter, simplifying model architecture and attention block initialization.
Introduces new classes `SkyReelsV2ImageEmbedding` and `SkyReelsV2TimeTextImageEmbedding` for enhanced image and time-text processing. Refactors the `SkyReelsV2Transformer3DModel` to integrate these embeddings, updating the constructor parameters for better clarity and functionality. Removes unused classes and methods to streamline the codebase.
…ds and begin reorganizing the forward pass.
…hod, integrating rotary embeddings and improving attention handling. Removes the deprecated `rope_apply` function and streamlines the attention mechanism for better integration and clarity.
…ethod by updating parameter names for clarity, integrating attention masks, and improving the handling of encoder hidden states.
…ethod by enhancing the handling of time embeddings and encoder hidden states. Updates parameter names for clarity and integrates rotary embeddings, ensuring better compatibility with the model's architecture.
@tolgacangoz Awesome work so far, just checking in on the progress. Still trying to fully wrap my head around diffusion forcing and trying to visually verify that diffusers version code matches original. As a sanity check, do we know why the output of T2V from the original code vs diffusers is different? Typically, we try to ensure that given the same starting embeddings, seed and other starting conditions, the output from different implementations matches numerically with threshold < 1e-3. I will try to help with debugging and testing 🤗 |
The original code had a default negative prompt (which I wasn't aware of at the time); this might have been the reason; or something about timestep processing. I will try to make it as deterministically matchable as possible. |
…_pos_embed_from_grid` for timestep projection.
…tional parameter `flip_sin_to_cos` for flipping sine and cosine embeddings, improving flexibility in positional embedding generation.
…nclude `flip_sin_to_cos` parameter, enhancing the flexibility of time embedding generation.
…yReelsV2TransformerBlock` to ensure consistent use of `torch.float32` and `torch.bfloat16`, improving integration.
…2` for frequency calculations, ensuring consistency in data types across the model.
…d precision for timestep projection.
Today I mostly followed by reading the code. But tomorrow I will be more systematic: I will go through the inputs and outputs of each module step by step. |
…ced debugging and output analysis; update `Transformer2DModelOutput` to include debug tensors.
…readability and maintain consistency in style.
…rocessor2_0` for improved performance.
…ensor shapes and values for improved debugging and analysis.
I have finally discovered the reasons for the discrepancy and will share them tomorrow 🥳! |
…le in `SkyReelsV2DiffusionForcingPipeline`.
…yReelsV2DiffusionForcingPipeline`.
…2TransformerBlock` and `SkyReelsV2Transformer3DModel` for cleaner code.
Thanks for the opportunity to fix #11374!
Original Work
Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074
TODOs:
⏳
FlowMatchUniPCMultistepScheduler
: just copy-pasted from the original repo✅
SkyReelsV2Transformer3DModel
: 90%WanTransformer3DModel
✅
SkyReelsV2DiffusionForcingPipeline
tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers
is ready to be forked.tolgacangoz/SkyReels-V2-DF-14B-720P-Diffusers
is ready to be forked.tolgacangoz/SkyReels-V2-DF-14B-540P-Diffusers
is ready to be forked.✅
SkyReelsV2DiffusionForcingImageToVideoPipeline
: Includes FLF2V.✅
SkyReelsV2DiffusionForcingVideoToVideoPipeline
: Extends a given video.⏳
SkyReelsV2Pipeline
⏳
SkyReelsV2ImageToVideoPipeline
⏳
scripts/convert_skyreelsv2_to_diffusers.py
⬜ Did you make sure to update the documentation with your changes?
⬜ Did you write any new necessary tests?
T2V with Diffusion Forcing (OLD)
diffusers
integrationoriginal_0_short.mp4
diffusers_0_short.mp4
diffusers
integrationoriginal_37_short.mp4
diffusers_37_short.mp4
diffusers
integrationoriginal_0_long.mp4
diffusers_0_long.mp4
diffusers
integrationoriginal_37_long.mp4
diffusers_37_long.mp4
I2V with Diffusion Forcing (OLD)
prompt
="A penguin dances."diffusers
integrationi2v-short.mp4
FLF2V with Diffusion Forcing (OLD)
Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (
torch.Size([1, 16, 1, 68, 120])
) is overwritten onto the first of25
frame latents oflatents
(torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thuslatents
istorch.Size([1, 16, 26, 68, 120])
. After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame oflatents
and not discarding the latest frame latent at the end, but still got bad results. Here are some results:0.mp4
1.mp4
2.mp4
3.mp4
4.mp4
5.mp4
6.mp4
7.mp4
V2V with Diffusion Forcing (OLD)
This pipeline extends a given video.
diffusers
integrationvideo1.mp4
v2v.mp4