Script
Script
Presentation Overview
• Duration: 30-40 minutes
• Presenters: 5 team members
• Structure: Introduction, Model, Results, 3D Generation, Conclusion
Contents
1 Introduction (Presenter 1) - 7 minutes 2
1
Stability AI SV3D Presentation Script
[SLIDE 2: Outline]
Here’s an outline of our presentation today. We’ll start with an introduction to the problem and background,
then dive into our novel multi-view synthesis model. We’ll show results from this model before moving to the
3D generation component and its results. Finally, we’ll discuss limitations and future work.
[SLIDE 6: SV3D]
This brings us to our approach, SV3D. The key idea is to adapt a latent video diffusion model for controllable
novel view synthesis.
Stable Video 3D, or SV3D, adapts Stable Video Diffusion (SVD) for novel view synthesis by repurposing
temporal consistency in videos to achieve spatial or 3D consistency. We add explicit camera pose conditioning,
allowing us to control camera trajectories around objects.
Our approach offers several advantages over prior work:
• Better generalization by leveraging a large-scale pre-trained video diffusion model
2
Stability AI SV3D Presentation Script
1. Novel Multi-View Synthesis: We generate a consistent orbital video around an object, conditioned on a
single input image and camera trajectory.
2. 3D Optimization: We then optimize a 3D representation from these generated views, using both photo-
metric losses and a new masked SDS loss.
I’ll now hand it over to Presenter 2, who will explain our novel multi-view synthesis model in more detail.
3
Stability AI SV3D Presentation Script
1. We remove vector conditionings like ’fps id’ and ’motion bucket id’ since they’re irrelevant for our task.
2. We concatenate the conditioning image to the noisy latent state input to the UNet.
3. We provide the CLIP-embedding matrix of the conditioning image to the cross-attention layers.
4. Most importantly, we feed camera trajectory information into the residual blocks alongside the diffusion
timestep.
4
Stability AI SV3D Presentation Script
where i is the frame index and K is the total number of frames. This approach produces more details in the
back view while avoiding over-sharpening at the end.
1. SV3Du: A pose-unconditioned model that generates static orbits around objects, conditioned only on
the input image.
2. SV3Dc: A pose-conditioned model that’s conditioned on both the input image and camera trajectory,
trained on dynamic orbits.
3. SV3Dp: A progressive training model that’s first finetuned on static orbits without pose conditioning,
then further finetuned on dynamic orbits with camera pose conditioning. This progressive approach
increases training stability.
5
Stability AI SV3D Presentation Script
We compared against recent state-of-the-art methods including Zero123, Zero123-XL, SyncDreamer, Stable
Zero123, Free3D, and EscherNet.
6
Stability AI SV3D Presentation Script
7
Stability AI SV3D Presentation Script
For training orbits, we generate SV3D multi-view images following a reference orbit. We use a dynamic
orbit with sine-varying elevation to ensure coverage of top and bottom views:
i
ei = e0 + A sin(2π · ) (7)
K
with A ≈ 30 degrees, which is crucial for complete 3D reconstruction.
8
Stability AI SV3D Presentation Script
∂ Jˆ
Lsds = w(t)(ϵϕ (zt ; I, πrand , t) − ϵ) (8)
∂θ
where w(t) = σt2 is a t-dependent weight. The SDS loss helps complete unseen regions that aren’t visible in
the original multi-view images.
9
Stability AI SV3D Presentation Script
10
Stability AI SV3D Presentation Script
• Advanced camera control: Conditioning SV3D on full camera matrices and supporting more degrees
of freedom in camera movement.
• Enhanced shading models: Extending beyond Lambertian reflection to better handle specular and
reflective surfaces.
• Direct 3D conditioning: Integrating 3D representations directly into the diffusion model for end-to-end
optimization and even better consistency.
• Wider applications: Extending to scene-level 3D generation and animation of deformable objects.
[Closing]
Thank you for your attention! We believe SV3D represents a significant step forward in single-image 3D
reconstruction by leveraging the power of video diffusion models. We’re excited to see how this approach can
be extended and applied in the future. We’ll be happy to take any questions now.
11