0% found this document useful (0 votes)
20 views11 pages

Script

The SV3D presentation outlines a novel approach for multi-view synthesis and 3D generation from a single image using latent video diffusion. The method, developed by a team from Stability AI, addresses challenges in single-image 3D object reconstruction by leveraging a two-stage pipeline for generating consistent views and optimizing 3D representations. Results demonstrate significant improvements over existing methods in terms of quality and user preference, showcasing the effectiveness of SV3D in diverse applications.

Uploaded by

s24124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Script

The SV3D presentation outlines a novel approach for multi-view synthesis and 3D generation from a single image using latent video diffusion. The method, developed by a team from Stability AI, addresses challenges in single-image 3D object reconstruction by leveraging a two-stage pipeline for generating consistent views and optimizing 3D representations. Results demonstrate significant improvements over existing methods in terms of quality and user preference, showcasing the effectiveness of SV3D in diverse applications.

Uploaded by

s24124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SV3D Presentation Script:

Novel Multi-view Synthesis and 3D Generation from a Single Image


using Latent Video Diffusion
Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz,
Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani
Stability AI

Presentation Overview
• Duration: 30-40 minutes
• Presenters: 5 team members
• Structure: Introduction, Model, Results, 3D Generation, Conclusion

Contents
1 Introduction (Presenter 1) - 7 minutes 2

2 Novel Multi-view Synthesis Model (Presenter 2) - 8 minutes 4

3 Novel Multi-view Synthesis Results (Presenter 3) - 7 minutes 6

4 3D Generation Model (Presenter 4) - 7 minutes 8

5 3D Generation Results and Conclusion (Presenter 5) - 8 minutes 10

1
Stability AI SV3D Presentation Script

1 Introduction (Presenter 1) - 7 minutes


[SLIDE 1: Title Slide]
Good morning/afternoon everyone. Today, we’re excited to present our work titled “SV3D: Novel Multi-view
Synthesis and 3D Generation from a Single Image using Latent Video Diffusion.” This is joint work by Vikram
Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin
Rombach, and Varun Jampani from Stability AI.

[SLIDE 2: Outline]
Here’s an outline of our presentation today. We’ll start with an introduction to the problem and background,
then dive into our novel multi-view synthesis model. We’ll show results from this model before moving to the
3D generation component and its results. Finally, we’ll discuss limitations and future work.

[SLIDE 3: Single-image 3D Object Reconstruction]


Let’s begin by understanding the problem we’re addressing. Single-image 3D object reconstruction is a long-
standing challenge in computer vision with numerous applications in game design, augmented reality, virtual
reality, e-commerce, and robotics.
The task is fundamentally challenging and ill-posed because it requires lifting 2D pixels to 3D space while
simultaneously reasoning about portions of the object that are not visible in the input image. More formally,
given a 2D image I ∈ R3×H×W , we want to reconstruct a full 3D representation Θ of the object.
Recent progress in this domain has been largely driven by advances in generative AI, which have made it
possible to produce much more realistic and detailed 3D reconstructions than was previously feasible.

[SLIDE 4: Current Approaches to 3D Generation]


Currently, there are two main strategies leveraging generative models for 3D generation:
The first approach is direct 3D optimization, where image-based 2D generative models provide an optimiza-
tion loss for unseen novel views. Notable examples include DreamFusion and Magic3D. This can be formulated
as finding the optimal 3D representation by minimizing the expected Score Distillation Sampling loss across
various camera poses.
The second approach, which our work builds upon, is a two-stage pipeline: first performing Novel View
Synthesis (NVS), then optimizing a 3D representation based on these generated views. Examples include
Zero123, SyncDreamer, and EscherNet. This two-stage formulation involves first generating multi-view images
conditioned on the input image and camera poses, then optimizing the 3D representation to match these
generated views.

[SLIDE 5: Key Challenges in Previous Approaches]


However, previous approaches face several critical challenges:
First, in terms of generalization, they need to handle diverse objects and scenes, which is difficult when
relying on limited 3D training data.
Second, controllability is crucial - we need precise control over camera viewpoint generation to ensure we
can create arbitrary views of an object.
Third and perhaps most important is multi-view consistency - ensuring all generated views are coherent in
3D space, which is a significant challenge for many current methods.
These limitations manifest in several ways: previous methods often exhibit limited multi-view consistency,
poor generalization due to reliance on scarce 3D training data, inconsistent geometric and texture details, and
are frequently restricted to specific viewpoints or elevations.

[SLIDE 6: SV3D]
This brings us to our approach, SV3D. The key idea is to adapt a latent video diffusion model for controllable
novel view synthesis.
Stable Video 3D, or SV3D, adapts Stable Video Diffusion (SVD) for novel view synthesis by repurposing
temporal consistency in videos to achieve spatial or 3D consistency. We add explicit camera pose conditioning,
allowing us to control camera trajectories around objects.
Our approach offers several advantages over prior work:
• Better generalization by leveraging a large-scale pre-trained video diffusion model

2
Stability AI SV3D Presentation Script

• Explicit camera pose control for arbitrary viewpoints


• Superior multi-view consistency by inheriting the video model’s ability to maintain coherence across frames

[SLIDE 7: SV3D Two-stage Pipeline]


Our method follows a two-stage pipeline:

1. Novel Multi-View Synthesis: We generate a consistent orbital video around an object, conditioned on a
single input image and camera trajectory.
2. 3D Optimization: We then optimize a 3D representation from these generated views, using both photo-
metric losses and a new masked SDS loss.

I’ll now hand it over to Presenter 2, who will explain our novel multi-view synthesis model in more detail.

3
Stability AI SV3D Presentation Script

2 Novel Multi-view Synthesis Model (Presenter 2) - 8 minutes


[SLIDE 8: Problem Formulation]
Thank you, Presenter 1. Let me explain how we formulate the novel multi-view synthesis problem.
Given an input image I ∈ R3×H×W of an object, our goal is to generate an orbital video J ∈ RK×3×H×W
with K = 21 frames, following a camera trajectory π ∈ RK×2 = {(ei , ai )}K i=1 , where ei and ai represent the
elevation and azimuth angles for each view.
We generate this orbital video by iteratively denoising samples from a learned conditional distribution
p(J|I, π) parameterized by a video diffusion model. The denoising process follows the standard diffusion update
equation shown on the slide, where α and σt are noise schedule parameters, and ϵϕ is the noise prediction model.

[SLIDE 9: SV3D Architecture]


Now, let’s look at the architecture of SV3D. We start with the Stable Video Diffusion (SVD) architecture and
make several key modifications:

1. We remove vector conditionings like ’fps id’ and ’motion bucket id’ since they’re irrelevant for our task.

2. We concatenate the conditioning image to the noisy latent state input to the UNet.
3. We provide the CLIP-embedding matrix of the conditioning image to the cross-attention layers.
4. Most importantly, we feed camera trajectory information into the residual blocks alongside the diffusion
timestep.

[SLIDE 10: Camera Trajectory Conditioning]


Let’s dive deeper into how we condition on camera trajectories.
The camera pose angles (ei , ai ) and noise timestep t are embedded using sinusoidal position embeddings
according to this formula:
 
t/τ
ωk (t) = exp i (1)
100002k/d
The camera pose embeddings are concatenated, linearly transformed, and added to the timestep embedding
to form a combined embedding that is fed to every residual block in the UNet:

embtotal = embt + W [embe ; emba ] (2)


This camera pose conditioning enables explicit control over camera trajectory, generation of views from
arbitrary viewpoints, and better geometric understanding of the 3D object.

[SLIDE 11: Static vs. Dynamic Orbits]


We designed two types of camera orbits for training and evaluation:
A static orbit has the camera circling at regularly-spaced azimuths, always at the same elevation angle as
the conditioning image. The disadvantage is that we don’t get information about the top or bottom of the
object.
A dynamic orbit, on the other hand, can have irregularly spaced azimuths and varying elevation per view.
We create dynamic orbits by adding small random noise to the azimuth angles and a weighted combination of
sinusoids to the elevation. This ensures temporal smoothness and that the camera trajectory loops back to the
starting position.

[SLIDE 12: Triangular CFG Scaling]


We also introduce a triangular CFG scaling technique to improve generation quality.
SVD originally uses a linearly increasing scale for classifier-free guidance (CFG) from 1 to 4 across the
generated frames. However, this causes over-sharpening in the last few frames of our orbital videos.
Our solution is to use a triangle wave CFG scaling: we linearly increase CFG from 1 at the front view to
2.5 at the back view, then linearly decrease back to 1 at the front view. The equation is:
(
2i
1 + 1.5 · K if i ≤ K
2
CFG(i) = (3)
1 + 1.5 · 2(K−i)
K if i > K
2

4
Stability AI SV3D Presentation Script

where i is the frame index and K is the total number of frames. This approach produces more details in the
back view while avoiding over-sharpening at the end.

[SLIDE 13: SV3D Models]


We trained three variants of SV3D:

1. SV3Du: A pose-unconditioned model that generates static orbits around objects, conditioned only on
the input image.
2. SV3Dc: A pose-conditioned model that’s conditioned on both the input image and camera trajectory,
trained on dynamic orbits.
3. SV3Dp: A progressive training model that’s first finetuned on static orbits without pose conditioning,
then further finetuned on dynamic orbits with camera pose conditioning. This progressive approach
increases training stability.

[SLIDE 14: Training Details]


For training, we used the Objaverse dataset with 150K curated CC-licensed 3D objects. We rendered images
at 576×576 resolution with a field-of-view of 33.8 degrees and random color backgrounds. For each object, we
generated 21 frames around it.
Our training schedule consisted of 105K total iterations. For SV3Dp, this was split into 55K iterations
unconditioned and 50K iterations conditioned. We used an effective batch size of 64 on 4 nodes of 8 80GB A100
GPUs, with training taking approximately 6 days.
Now I’ll pass it over to Presenter 3, who will discuss the results of our novel multi-view synthesis.

5
Stability AI SV3D Presentation Script

3 Novel Multi-view Synthesis Results (Presenter 3) - 7 minutes


[SLIDE 15: Evaluation Setup]
Thank you, Presenter 2. Let’s look at how our novel multi-view synthesis performs compared to previous
methods.
For evaluation, we used three datasets:
• Google Scanned Objects (GSO) with 300 filtered objects

• OmniObject3D as an additional test set


• A collection of 22 real-world images for user study
We measured performance using several metrics:
• LPIPS: Learned Perceptual Image Patch Similarity

• PSNR: Peak Signal-to-Noise Ratio in decibels


• SSIM: Structural Similarity Index Measure
• MSE: Mean Squared Error
• CLIP-S: CLIP similarity score

We compared against recent state-of-the-art methods including Zero123, Zero123-XL, SyncDreamer, Stable
Zero123, Free3D, and EscherNet.

[SLIDE 16: Quantitative Results: GSO Static Orbits]


This table shows the quantitative results on static orbits from the GSO dataset. Our three SV3D models
outperform all prior methods across all metrics. Even our pose-unconditioned model SV3Du achieves better
results than previous approaches, while our best model SV3Dp shows substantial improvements, particularly in
PSNR and LPIPS.

[SLIDE 17: Quantitative Results: GSO Dynamic Orbits]


Moving to dynamic orbits, which are more challenging since they involve varying elevations, we see that our
pose-conditioned models SV3Dc and SV3Dp continue to outperform previous methods by significant margins.
Note that only pose-conditioned models can handle dynamic orbits, so SV3Du isn’t included here.
The key observation is that SV3Dp maintains superior performance on dynamic orbits, demonstrating the
flexibility and controllability of our approach.

[SLIDE 18: Quantitative Results: OmniObject3D Dataset]


We also evaluated on the OmniObject3D dataset to test generalization to different objects. Again, our SV3D
models significantly outperform prior methods across all metrics, with improvements of around 2-4 dB in PSNR
and substantial reductions in LPIPS.

[SLIDE 19: Per-Frame Quality Analysis]


This graph shows LPIPS metric versus frame number for different methods on the GSO dataset. SV3D produces
better quality (lower LPIPS) at every frame compared to all baselines.
We observe that quality is generally worse around the back view (middle frames) and better near the
conditioning image (beginning and end frames). This pattern is intuitive - the model has most information near
the input viewpoint.

[SLIDE 20: Visual Comparison]


Looking at visual results, we can see that SV3D generates novel multi-views that are more detailed, more
faithful to the conditioning image, and more consistent across different viewpoints. Note how our approach
better preserves object geometry compared to methods like Zero123XL, which often distorts shapes.
The examples show results on GSO, OmniObject3D, and real-world images, demonstrating that our method
works well across diverse datasets.

6
Stability AI SV3D Presentation Script

[SLIDE 21: User Study on Real-World Images]


We also conducted a user study to evaluate performance on real-world images. 30 users evaluated orbital videos
generated from 22 real-world images, comparing SV3D with other state-of-the-art methods.
The results were overwhelming - users preferred SV3D-generated videos over:

• Zero123XL: 96% of the time


• Stable Zero123: 99% of the time
• EscherNet: 96% of the time

• Free3D: 98% of the time


This demonstrates the real-world applicability of our approach beyond synthetic datasets.
Now I’ll hand it over to Presenter 4, who will explain how we use these multi-view synthesis results for 3D
generation.

7
Stability AI SV3D Presentation Script

4 3D Generation Model (Presenter 4) - 7 minutes


[SLIDE 22: 3D Generation Pipeline]
Thank you, Presenter 3. Now I’ll explain how we leverage our SV3D multi-view synthesis for high-quality 3D
generation.
As shown in the diagram, our 3D optimization pipeline has two main stages:
On the left, we start with SV3D’s novel multi-view synthesis, generating views along the reference orbit.
In the middle, we have a coarse NeRF stage where we optimize a Neural Radiance Field to represent the 3D
object based on the generated views.
On the right, we have a fine DMTet stage where we extract and refine a mesh representation, applying both
photometric losses and a masked SDS loss, which I’ll explain shortly.

[SLIDE 23: 3D Generation Pipeline Details]


Our approach follows a coarse-to-fine training strategy:
In the coarse stage, we train an Instant-NGP NeRF to reconstruct the SV3D-generated images at a lower
resolution. This captures the general shape and appearance of the object.
In the fine stage, we extract a mesh from the trained NeRF using marching cubes, then adopt a DMTet
representation to refine the mesh at high resolution. We apply both photometric losses based on the SV3D-
generated views and SDS-based diffusion guidance for unseen regions.

[SLIDE 24: Disentangled Illumination Model]


One challenge we encountered is that SV3D-generated videos often have consistent but baked-in lighting, which
isn’t ideal for 3D modeling. To address this, we fit a simple illumination model using 24 Spherical Gaussians
(SG).
A Spherical Gaussian at query location x is defined by:

G(x; µ, s, a) = aes(µ·x−1) (4)


3
where µ ∈ R is the axis, s ∈ R the sharpness, and a ∈ R the amplitude. We only model white light with
scalar amplitudes.
For Lambertian shading, we approximate the cosine term with another SG:

Gcosine = G(x; n, 2.133, 1.17) (5)


The lighting evaluation sums the contribution from all SGs:
24
1X
L= max(Gi · Gcosine , 0) (6)
π i=1
This allows us to disentangle illumination from the base color of the object.

[SLIDE 25: Optimization Strategies and Losses]


For 3D reconstruction, we apply several loss functions:
First, photometric losses treat SV3D-generated images as pseudo-ground truth and include:
• MSE loss at the pixel level: LM SE = ∥I − I∥
ˆ2

• Mask loss for object silhouettes: Lmask = ∥S − Ŝ∥2


• LPIPS perceptual loss: LLP IP S = LPIPS(I, I)
ˆ

For training orbits, we generate SV3D multi-view images following a reference orbit. We use a dynamic
orbit with sine-varying elevation to ensure coverage of top and bottom views:
i
ei = e0 + A sin(2π · ) (7)
K
with A ≈ 30 degrees, which is crucial for complete 3D reconstruction.

8
Stability AI SV3D Presentation Script

[SLIDE 26: SV3D-Based SDS Loss]


To further improve reconstruction, especially for unseen regions, we use Score Distillation Sampling (SDS) loss.
The process works as follows:
1. Sample a random camera orbit πrand ̸∈ πref
2. Render views Jˆ using our current 3D model with parameters θ
3. Add noise ϵ at level t to the latent embedding zt of Jˆ
4. Apply SDS loss:

∂ Jˆ
Lsds = w(t)(ϵϕ (zt ; I, πrand , t) − ϵ) (8)
∂θ
where w(t) = σt2 is a t-dependent weight. The SDS loss helps complete unseen regions that aren’t visible in
the original multi-view images.

[SLIDE 27-28: Masked SDS Loss]


However, we found that applying naive SDS can cause unstable training and unfaithful texture. Our solution
is a masked SDS loss that only applies to unseen or occluded areas.
For each point p on the rendered surface from a random view, we:
1. Get the surface normal n at point p
2. Calculate view directions vi from reference cameras:
π̄iref − p
vi = (9)
∥π̄iref − p∥
where π̄iref ∈ R3 is the 3D position of reference camera i
3. Find the reference camera c with maximum visibility likelihood:
c = arg max(vi · n) (10)
i

4. Create a pseudo visibility mask using a smoothstep function:


M = 1 − fs (vc · n, 0, 0.5) (11)
2 x−f0
where fs (x; f0 , f1 ) = x̂ (3 − 2x̂), with x̂ = f1 −f0

5. Apply the masked SDS loss:


Lmask−sds = M · Lsds (12)
This approach preserves visible texture quality while filling in unseen surfaces, resulting in more stable
optimization.

[SLIDE 29: Geometric Priors]


We also add geometric priors to regularize the 3D reconstruction:
• A smooth depth loss from RegNeRF:
Ldepth (i, j) = (d(i, j) − d(i, j + 1))2 + (d(i, j) − d(i + 1, j))2 (13)

• A bilateral normal smoothness loss:


1
Lbilateral = e−3∥∇I∥ · (14)
1 + ∥∇n∥
• A normal estimation loss using normals from Omnidata:
Lnormal = 1 − (n · n̄) (15)

• Spatial albedo smoothness:


Lalbedo = (cd (x) − cd (x + ϵ))2 (16)
These priors collectively help produce high-quality meshes with clean surfaces and detailed features.
Now I’ll pass to Presenter 5, who will show the results of our 3D generation approach.

9
Stability AI SV3D Presentation Script

5 3D Generation Results and Conclusion (Presenter 5) - 8 minutes


[SLIDE 30: Evaluation Setup for 3D Generation]
Thank you, Presenter 4. Let me walk you through the results of our 3D generation approach.
For evaluation, we used 50 randomly sampled objects from the GSO dataset. We measured performance
using both 2D metrics (LPIPS, PSNR, SSIM, MSE, CLIP-S) and 3D metrics (Chamfer Distance and 3D IoU).
We compared against a wide range of baseline methods including Point-E, Shap-E, One-2-3-45++, Dream-
Gaussian, SyncDreamer, EscherNet, Free3D, and Stable Zero123.
In terms of computation, our coarse stage takes about 2 minutes, the full pipeline without SDS takes 8
minutes, and the full pipeline with SDS takes 20 minutes.

[SLIDE 31: Quantitative Results for 3D Generation: 2D Metrics]


This table shows the 2D metrics for our 3D generation results. Our SV3D models substantially outperform
prior methods across all metrics. The best results come from our SV3Dp model with masked SDS loss, showing
the benefits of both our novel multi-view synthesis approach and the specialized losses we developed.
Note that even without SDS loss, our approach already outperforms previous methods, indicating the high
quality of the multi-view images generated by SV3D.

[SLIDE 32: Quantitative Results for 3D Generation: 3D Metrics]


Looking at 3D metrics, the story is similar. Our methods achieve significantly better Chamfer Distance and 3D
IoU scores compared to all prior approaches. The best performance comes from SV3Dp with our masked SDS
loss, but even the SV3Dp without SDS performs very well.
Interestingly, our results approach the quality of GT renders, showing how effective our pipeline is at
reconstructing detailed 3D geometry.

[SLIDE 33: Visual Comparison of 3D Meshes]


The visual comparisons make the improvements even clearer. SV3D produces 3D meshes that are:
• More detailed (observe fine texture and geometry)
• More faithful to input images (color and shape)
• More consistent in 3D geometry (fewer artifacts)
• Higher quality from various viewing angles
Look at the examples of the stone column and horse - our reconstructions preserve fine details that other
methods miss or distort.

[SLIDE 34: Real-World 3D Results]


Our approach also generalizes well to real-world in-the-wild images. These examples show reconstructions of
objects like strawberries, shoes, and toys from single photographs. The results maintain accurate shape and
detailed features across diverse object categories, demonstrating the practical applicability of our method.

[SLIDE 35: Ablation Studies: Static vs. Dynamic Orbits]


We conducted several ablation studies to understand the impact of different components. This one examines
the effect of training orbits on reconstruction quality.
We found that using a dynamic orbit with moderate elevation variation (sine-30) produces better 3D outputs
than a static orbit, with lower Chamfer Distance and higher 3D IoU. This makes sense as dynamic orbits cover
top and bottom views, providing more complete information for reconstruction.
However, too much elevation variation (sine-50) can increase inconsistency between views. A moderate range
of about ±30 provides optimal results.

Training orbit CD↓ 3D IoU↑


Static 0.028 0.610
Sine-30 0.024 0.614
Sine-50 0.026 0.609

10
Stability AI SV3D Presentation Script

[SLIDE 36: Summary of Contributions]


To summarize our contributions:
We introduced SV3D, a latent video diffusion model for high-resolution, controllable novel multi-view syn-
thesis and high-quality 3D generation from a single image.
Our key technical contributions include:
• Camera pose conditioning in video diffusion models
• Triangle CFG scaling for improved view consistency
• Disentangled illumination modeling for clean textures

• Masked SDS loss for handling unseen regions


We demonstrated state-of-the-art performance on both novel multi-view synthesis and 3D reconstruction
across multiple metrics and datasets.

[SLIDE 37: Limitations]


Despite these advances, our approach has some limitations:
SV3D currently handles only 2 degrees of freedom (elevation and azimuth), which is sufficient for single-image
3D generation but limiting for more complex camera movements.
Our method exhibits inconsistency for mirror-like reflective surfaces, which challenge 3D reconstruction.
The current Lambertian reflection model is insufficient for such surfaces.
Training requires substantial computational resources - 32 A100 GPUs for 6 days - which could be a barrier
to wider adoption.

[SLIDE 38: Future Work]


Looking ahead, we see several promising directions for future work:

• Advanced camera control: Conditioning SV3D on full camera matrices and supporting more degrees
of freedom in camera movement.
• Enhanced shading models: Extending beyond Lambertian reflection to better handle specular and
reflective surfaces.

• Direct 3D conditioning: Integrating 3D representations directly into the diffusion model for end-to-end
optimization and even better consistency.
• Wider applications: Extending to scene-level 3D generation and animation of deformable objects.

[Closing]
Thank you for your attention! We believe SV3D represents a significant step forward in single-image 3D
reconstruction by leveraging the power of video diffusion models. We’re excited to see how this approach can
be extended and applied in the future. We’ll be happy to take any questions now.

11

You might also like