0% found this document useful (0 votes)

20 views11 pages

Script

The SV3D presentation outlines a novel approach for multi-view synthesis and 3D generation from a single image using latent video diffusion. The method, developed by a team from Stability AI, addresses challenges in single-image 3D object reconstruction by leveraging a two-stage pipeline for generating consistent views and optimizing 3D representations. Results demonstrate significant improvements over existing methods in terms of quality and user preference, showcasing the effectiveness of SV3D in diverse applications.

Uploaded by

s24124

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views11 pages

Script

Uploaded by

s24124

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

SV3D Presentation Script:

Novel Multi-view Synthesis and 3D Generation from a Single Image

using Latent Video Diffusion
Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz,
Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani
Stability AI

Presentation Overview
• Duration: 30-40 minutes
• Presenters: 5 team members
• Structure: Introduction, Model, Results, 3D Generation, Conclusion

Contents
1 Introduction (Presenter 1) - 7 minutes 2

2 Novel Multi-view Synthesis Model (Presenter 2) - 8 minutes 4

3 Novel Multi-view Synthesis Results (Presenter 3) - 7 minutes 6

4 3D Generation Model (Presenter 4) - 7 minutes 8

5 3D Generation Results and Conclusion (Presenter 5) - 8 minutes 10

1
Stability AI SV3D Presentation Script

1 Introduction (Presenter 1) - 7 minutes

[SLIDE 1: Title Slide]
Good morning/afternoon everyone. Today, we’re excited to present our work titled “SV3D: Novel Multi-view
Synthesis and 3D Generation from a Single Image using Latent Video Diffusion.” This is joint work by Vikram
Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin
Rombach, and Varun Jampani from Stability AI.

[SLIDE 2: Outline]
Here’s an outline of our presentation today. We’ll start with an introduction to the problem and background,
then dive into our novel multi-view synthesis model. We’ll show results from this model before moving to the
3D generation component and its results. Finally, we’ll discuss limitations and future work.

[SLIDE 3: Single-image 3D Object Reconstruction]

Let’s begin by understanding the problem we’re addressing. Single-image 3D object reconstruction is a long-
standing challenge in computer vision with numerous applications in game design, augmented reality, virtual
reality, e-commerce, and robotics.
The task is fundamentally challenging and ill-posed because it requires lifting 2D pixels to 3D space while
simultaneously reasoning about portions of the object that are not visible in the input image. More formally,
given a 2D image I ∈ R3×H×W , we want to reconstruct a full 3D representation Θ of the object.
Recent progress in this domain has been largely driven by advances in generative AI, which have made it
possible to produce much more realistic and detailed 3D reconstructions than was previously feasible.

[SLIDE 4: Current Approaches to 3D Generation]

Currently, there are two main strategies leveraging generative models for 3D generation:
The first approach is direct 3D optimization, where image-based 2D generative models provide an optimiza-
tion loss for unseen novel views. Notable examples include DreamFusion and Magic3D. This can be formulated
as finding the optimal 3D representation by minimizing the expected Score Distillation Sampling loss across
various camera poses.
The second approach, which our work builds upon, is a two-stage pipeline: first performing Novel View
Synthesis (NVS), then optimizing a 3D representation based on these generated views. Examples include
Zero123, SyncDreamer, and EscherNet. This two-stage formulation involves first generating multi-view images
conditioned on the input image and camera poses, then optimizing the 3D representation to match these
generated views.

[SLIDE 5: Key Challenges in Previous Approaches]

However, previous approaches face several critical challenges:
First, in terms of generalization, they need to handle diverse objects and scenes, which is difficult when
relying on limited 3D training data.
Second, controllability is crucial - we need precise control over camera viewpoint generation to ensure we
can create arbitrary views of an object.
Third and perhaps most important is multi-view consistency - ensuring all generated views are coherent in
3D space, which is a significant challenge for many current methods.
These limitations manifest in several ways: previous methods often exhibit limited multi-view consistency,
poor generalization due to reliance on scarce 3D training data, inconsistent geometric and texture details, and
are frequently restricted to specific viewpoints or elevations.

[SLIDE 6: SV3D]
This brings us to our approach, SV3D. The key idea is to adapt a latent video diffusion model for controllable
novel view synthesis.
Stable Video 3D, or SV3D, adapts Stable Video Diffusion (SVD) for novel view synthesis by repurposing
temporal consistency in videos to achieve spatial or 3D consistency. We add explicit camera pose conditioning,
allowing us to control camera trajectories around objects.
Our approach offers several advantages over prior work:
• Better generalization by leveraging a large-scale pre-trained video diffusion model

2
Stability AI SV3D Presentation Script

• Explicit camera pose control for arbitrary viewpoints

• Superior multi-view consistency by inheriting the video model’s ability to maintain coherence across frames

[SLIDE 7: SV3D Two-stage Pipeline]

Our method follows a two-stage pipeline:

1. Novel Multi-View Synthesis: We generate a consistent orbital video around an object, conditioned on a
single input image and camera trajectory.
2. 3D Optimization: We then optimize a 3D representation from these generated views, using both photo-
metric losses and a new masked SDS loss.

I’ll now hand it over to Presenter 2, who will explain our novel multi-view synthesis model in more detail.

3
Stability AI SV3D Presentation Script

2 Novel Multi-view Synthesis Model (Presenter 2) - 8 minutes

[SLIDE 8: Problem Formulation]
Thank you, Presenter 1. Let me explain how we formulate the novel multi-view synthesis problem.
Given an input image I ∈ R3×H×W of an object, our goal is to generate an orbital video J ∈ RK×3×H×W
with K = 21 frames, following a camera trajectory π ∈ RK×2 = {(ei , ai )}K i=1 , where ei and ai represent the
elevation and azimuth angles for each view.
We generate this orbital video by iteratively denoising samples from a learned conditional distribution
p(J|I, π) parameterized by a video diffusion model. The denoising process follows the standard diffusion update
equation shown on the slide, where α and σt are noise schedule parameters, and ϵϕ is the noise prediction model.

[SLIDE 9: SV3D Architecture]

Now, let’s look at the architecture of SV3D. We start with the Stable Video Diffusion (SVD) architecture and
make several key modifications:

1. We remove vector conditionings like ’fps id’ and ’motion bucket id’ since they’re irrelevant for our task.

2. We concatenate the conditioning image to the noisy latent state input to the UNet.
3. We provide the CLIP-embedding matrix of the conditioning image to the cross-attention layers.
4. Most importantly, we feed camera trajectory information into the residual blocks alongside the diffusion
timestep.

[SLIDE 10: Camera Trajectory Conditioning]

Let’s dive deeper into how we condition on camera trajectories.
The camera pose angles (ei , ai ) and noise timestep t are embedded using sinusoidal position embeddings
according to this formula:

t/τ
ωk (t) = exp i (1)
100002k/d
The camera pose embeddings are concatenated, linearly transformed, and added to the timestep embedding
to form a combined embedding that is fed to every residual block in the UNet:

embtotal = embt + W [embe ; emba ] (2)

This camera pose conditioning enables explicit control over camera trajectory, generation of views from
arbitrary viewpoints, and better geometric understanding of the 3D object.

[SLIDE 11: Static vs. Dynamic Orbits]

We designed two types of camera orbits for training and evaluation:
A static orbit has the camera circling at regularly-spaced azimuths, always at the same elevation angle as
the conditioning image. The disadvantage is that we don’t get information about the top or bottom of the
object.
A dynamic orbit, on the other hand, can have irregularly spaced azimuths and varying elevation per view.
We create dynamic orbits by adding small random noise to the azimuth angles and a weighted combination of
sinusoids to the elevation. This ensures temporal smoothness and that the camera trajectory loops back to the
starting position.

[SLIDE 12: Triangular CFG Scaling]

We also introduce a triangular CFG scaling technique to improve generation quality.
SVD originally uses a linearly increasing scale for classifier-free guidance (CFG) from 1 to 4 across the
generated frames. However, this causes over-sharpening in the last few frames of our orbital videos.
Our solution is to use a triangle wave CFG scaling: we linearly increase CFG from 1 at the front view to
2.5 at the back view, then linearly decrease back to 1 at the front view. The equation is:
(
2i
1 + 1.5 · K if i ≤ K
2
CFG(i) = (3)
1 + 1.5 · 2(K−i)
K if i > K
2

4
Stability AI SV3D Presentation Script

where i is the frame index and K is the total number of frames. This approach produces more details in the
back view while avoiding over-sharpening at the end.

[SLIDE 13: SV3D Models]

We trained three variants of SV3D:

1. SV3Du: A pose-unconditioned model that generates static orbits around objects, conditioned only on
the input image.
2. SV3Dc: A pose-conditioned model that’s conditioned on both the input image and camera trajectory,
trained on dynamic orbits.
3. SV3Dp: A progressive training model that’s first finetuned on static orbits without pose conditioning,
then further finetuned on dynamic orbits with camera pose conditioning. This progressive approach
increases training stability.

[SLIDE 14: Training Details]

For training, we used the Objaverse dataset with 150K curated CC-licensed 3D objects. We rendered images
at 576×576 resolution with a field-of-view of 33.8 degrees and random color backgrounds. For each object, we
generated 21 frames around it.
Our training schedule consisted of 105K total iterations. For SV3Dp, this was split into 55K iterations
unconditioned and 50K iterations conditioned. We used an effective batch size of 64 on 4 nodes of 8 80GB A100
GPUs, with training taking approximately 6 days.
Now I’ll pass it over to Presenter 3, who will discuss the results of our novel multi-view synthesis.

5
Stability AI SV3D Presentation Script

3 Novel Multi-view Synthesis Results (Presenter 3) - 7 minutes

[SLIDE 15: Evaluation Setup]
Thank you, Presenter 2. Let’s look at how our novel multi-view synthesis performs compared to previous
methods.
For evaluation, we used three datasets:
• Google Scanned Objects (GSO) with 300 filtered objects

• OmniObject3D as an additional test set

• A collection of 22 real-world images for user study
We measured performance using several metrics:
• LPIPS: Learned Perceptual Image Patch Similarity

• PSNR: Peak Signal-to-Noise Ratio in decibels

• SSIM: Structural Similarity Index Measure
• MSE: Mean Squared Error
• CLIP-S: CLIP similarity score

We compared against recent state-of-the-art methods including Zero123, Zero123-XL, SyncDreamer, Stable
Zero123, Free3D, and EscherNet.

[SLIDE 16: Quantitative Results: GSO Static Orbits]

This table shows the quantitative results on static orbits from the GSO dataset. Our three SV3D models
outperform all prior methods across all metrics. Even our pose-unconditioned model SV3Du achieves better
results than previous approaches, while our best model SV3Dp shows substantial improvements, particularly in
PSNR and LPIPS.

[SLIDE 17: Quantitative Results: GSO Dynamic Orbits]

Moving to dynamic orbits, which are more challenging since they involve varying elevations, we see that our
pose-conditioned models SV3Dc and SV3Dp continue to outperform previous methods by significant margins.
Note that only pose-conditioned models can handle dynamic orbits, so SV3Du isn’t included here.
The key observation is that SV3Dp maintains superior performance on dynamic orbits, demonstrating the
flexibility and controllability of our approach.

[SLIDE 18: Quantitative Results: OmniObject3D Dataset]

We also evaluated on the OmniObject3D dataset to test generalization to different objects. Again, our SV3D
models significantly outperform prior methods across all metrics, with improvements of around 2-4 dB in PSNR
and substantial reductions in LPIPS.

[SLIDE 19: Per-Frame Quality Analysis]

This graph shows LPIPS metric versus frame number for different methods on the GSO dataset. SV3D produces
better quality (lower LPIPS) at every frame compared to all baselines.
We observe that quality is generally worse around the back view (middle frames) and better near the
conditioning image (beginning and end frames). This pattern is intuitive - the model has most information near
the input viewpoint.

[SLIDE 20: Visual Comparison]

Looking at visual results, we can see that SV3D generates novel multi-views that are more detailed, more
faithful to the conditioning image, and more consistent across different viewpoints. Note how our approach
better preserves object geometry compared to methods like Zero123XL, which often distorts shapes.
The examples show results on GSO, OmniObject3D, and real-world images, demonstrating that our method
works well across diverse datasets.

6
Stability AI SV3D Presentation Script

[SLIDE 21: User Study on Real-World Images]

We also conducted a user study to evaluate performance on real-world images. 30 users evaluated orbital videos
generated from 22 real-world images, comparing SV3D with other state-of-the-art methods.
The results were overwhelming - users preferred SV3D-generated videos over:

• Zero123XL: 96% of the time

• Stable Zero123: 99% of the time
• EscherNet: 96% of the time

• Free3D: 98% of the time

This demonstrates the real-world applicability of our approach beyond synthetic datasets.
Now I’ll hand it over to Presenter 4, who will explain how we use these multi-view synthesis results for 3D
generation.

7
Stability AI SV3D Presentation Script

4 3D Generation Model (Presenter 4) - 7 minutes

[SLIDE 22: 3D Generation Pipeline]
Thank you, Presenter 3. Now I’ll explain how we leverage our SV3D multi-view synthesis for high-quality 3D
generation.
As shown in the diagram, our 3D optimization pipeline has two main stages:
On the left, we start with SV3D’s novel multi-view synthesis, generating views along the reference orbit.
In the middle, we have a coarse NeRF stage where we optimize a Neural Radiance Field to represent the 3D
object based on the generated views.
On the right, we have a fine DMTet stage where we extract and refine a mesh representation, applying both
photometric losses and a masked SDS loss, which I’ll explain shortly.

[SLIDE 23: 3D Generation Pipeline Details]

Our approach follows a coarse-to-fine training strategy:
In the coarse stage, we train an Instant-NGP NeRF to reconstruct the SV3D-generated images at a lower
resolution. This captures the general shape and appearance of the object.
In the fine stage, we extract a mesh from the trained NeRF using marching cubes, then adopt a DMTet
representation to refine the mesh at high resolution. We apply both photometric losses based on the SV3D-
generated views and SDS-based diffusion guidance for unseen regions.

[SLIDE 24: Disentangled Illumination Model]

One challenge we encountered is that SV3D-generated videos often have consistent but baked-in lighting, which
isn’t ideal for 3D modeling. To address this, we fit a simple illumination model using 24 Spherical Gaussians
(SG).
A Spherical Gaussian at query location x is defined by:

G(x; µ, s, a) = aes(µ·x−1) (4)

3
where µ ∈ R is the axis, s ∈ R the sharpness, and a ∈ R the amplitude. We only model white light with
scalar amplitudes.
For Lambertian shading, we approximate the cosine term with another SG:

Gcosine = G(x; n, 2.133, 1.17) (5)

The lighting evaluation sums the contribution from all SGs:
24
1X
L= max(Gi · Gcosine , 0) (6)
π i=1
This allows us to disentangle illumination from the base color of the object.

[SLIDE 25: Optimization Strategies and Losses]

For 3D reconstruction, we apply several loss functions:
First, photometric losses treat SV3D-generated images as pseudo-ground truth and include:
• MSE loss at the pixel level: LM SE = ∥I − I∥
ˆ2

• Mask loss for object silhouettes: Lmask = ∥S − Ŝ∥2

• LPIPS perceptual loss: LLP IP S = LPIPS(I, I)
ˆ

For training orbits, we generate SV3D multi-view images following a reference orbit. We use a dynamic
orbit with sine-varying elevation to ensure coverage of top and bottom views:
i
ei = e0 + A sin(2π · ) (7)
K
with A ≈ 30 degrees, which is crucial for complete 3D reconstruction.

8
Stability AI SV3D Presentation Script

[SLIDE 26: SV3D-Based SDS Loss]

To further improve reconstruction, especially for unseen regions, we use Score Distillation Sampling (SDS) loss.
The process works as follows:
1. Sample a random camera orbit πrand ̸∈ πref
2. Render views Jˆ using our current 3D model with parameters θ
3. Add noise ϵ at level t to the latent embedding zt of Jˆ
4. Apply SDS loss:

∂ Jˆ
Lsds = w(t)(ϵϕ (zt ; I, πrand , t) − ϵ) (8)
∂θ
where w(t) = σt2 is a t-dependent weight. The SDS loss helps complete unseen regions that aren’t visible in
the original multi-view images.

[SLIDE 27-28: Masked SDS Loss]

However, we found that applying naive SDS can cause unstable training and unfaithful texture. Our solution
is a masked SDS loss that only applies to unseen or occluded areas.
For each point p on the rendered surface from a random view, we:
1. Get the surface normal n at point p
2. Calculate view directions vi from reference cameras:
π̄iref − p
vi = (9)
∥π̄iref − p∥
where π̄iref ∈ R3 is the 3D position of reference camera i
3. Find the reference camera c with maximum visibility likelihood:
c = arg max(vi · n) (10)
i

4. Create a pseudo visibility mask using a smoothstep function:

M = 1 − fs (vc · n, 0, 0.5) (11)
2 x−f0
where fs (x; f0 , f1 ) = x̂ (3 − 2x̂), with x̂ = f1 −f0

5. Apply the masked SDS loss:

Lmask−sds = M · Lsds (12)
This approach preserves visible texture quality while filling in unseen surfaces, resulting in more stable
optimization.

[SLIDE 29: Geometric Priors]

We also add geometric priors to regularize the 3D reconstruction:
• A smooth depth loss from RegNeRF:
Ldepth (i, j) = (d(i, j) − d(i, j + 1))2 + (d(i, j) − d(i + 1, j))2 (13)

• A bilateral normal smoothness loss:

1
Lbilateral = e−3∥∇I∥ · (14)
1 + ∥∇n∥
• A normal estimation loss using normals from Omnidata:
Lnormal = 1 − (n · n̄) (15)

• Spatial albedo smoothness:

Lalbedo = (cd (x) − cd (x + ϵ))2 (16)
These priors collectively help produce high-quality meshes with clean surfaces and detailed features.
Now I’ll pass to Presenter 5, who will show the results of our 3D generation approach.

9
Stability AI SV3D Presentation Script

5 3D Generation Results and Conclusion (Presenter 5) - 8 minutes

[SLIDE 30: Evaluation Setup for 3D Generation]
Thank you, Presenter 4. Let me walk you through the results of our 3D generation approach.
For evaluation, we used 50 randomly sampled objects from the GSO dataset. We measured performance
using both 2D metrics (LPIPS, PSNR, SSIM, MSE, CLIP-S) and 3D metrics (Chamfer Distance and 3D IoU).
We compared against a wide range of baseline methods including Point-E, Shap-E, One-2-3-45++, Dream-
Gaussian, SyncDreamer, EscherNet, Free3D, and Stable Zero123.
In terms of computation, our coarse stage takes about 2 minutes, the full pipeline without SDS takes 8
minutes, and the full pipeline with SDS takes 20 minutes.

[SLIDE 31: Quantitative Results for 3D Generation: 2D Metrics]

This table shows the 2D metrics for our 3D generation results. Our SV3D models substantially outperform
prior methods across all metrics. The best results come from our SV3Dp model with masked SDS loss, showing
the benefits of both our novel multi-view synthesis approach and the specialized losses we developed.
Note that even without SDS loss, our approach already outperforms previous methods, indicating the high
quality of the multi-view images generated by SV3D.

[SLIDE 32: Quantitative Results for 3D Generation: 3D Metrics]

Looking at 3D metrics, the story is similar. Our methods achieve significantly better Chamfer Distance and 3D
IoU scores compared to all prior approaches. The best performance comes from SV3Dp with our masked SDS
loss, but even the SV3Dp without SDS performs very well.
Interestingly, our results approach the quality of GT renders, showing how effective our pipeline is at
reconstructing detailed 3D geometry.

[SLIDE 33: Visual Comparison of 3D Meshes]

The visual comparisons make the improvements even clearer. SV3D produces 3D meshes that are:
• More detailed (observe fine texture and geometry)
• More faithful to input images (color and shape)
• More consistent in 3D geometry (fewer artifacts)
• Higher quality from various viewing angles
Look at the examples of the stone column and horse - our reconstructions preserve fine details that other
methods miss or distort.

[SLIDE 34: Real-World 3D Results]

Our approach also generalizes well to real-world in-the-wild images. These examples show reconstructions of
objects like strawberries, shoes, and toys from single photographs. The results maintain accurate shape and
detailed features across diverse object categories, demonstrating the practical applicability of our method.

[SLIDE 35: Ablation Studies: Static vs. Dynamic Orbits]

We conducted several ablation studies to understand the impact of different components. This one examines
the effect of training orbits on reconstruction quality.
We found that using a dynamic orbit with moderate elevation variation (sine-30) produces better 3D outputs
than a static orbit, with lower Chamfer Distance and higher 3D IoU. This makes sense as dynamic orbits cover
top and bottom views, providing more complete information for reconstruction.
However, too much elevation variation (sine-50) can increase inconsistency between views. A moderate range
of about ±30 provides optimal results.

Training orbit CD↓ 3D IoU↑

Static 0.028 0.610
Sine-30 0.024 0.614
Sine-50 0.026 0.609

10
Stability AI SV3D Presentation Script

[SLIDE 36: Summary of Contributions]

To summarize our contributions:
We introduced SV3D, a latent video diffusion model for high-resolution, controllable novel multi-view syn-
thesis and high-quality 3D generation from a single image.
Our key technical contributions include:
• Camera pose conditioning in video diffusion models
• Triangle CFG scaling for improved view consistency
• Disentangled illumination modeling for clean textures

• Masked SDS loss for handling unseen regions

We demonstrated state-of-the-art performance on both novel multi-view synthesis and 3D reconstruction
across multiple metrics and datasets.

[SLIDE 37: Limitations]

Despite these advances, our approach has some limitations:
SV3D currently handles only 2 degrees of freedom (elevation and azimuth), which is sufficient for single-image
3D generation but limiting for more complex camera movements.
Our method exhibits inconsistency for mirror-like reflective surfaces, which challenge 3D reconstruction.
The current Lambertian reflection model is insufficient for such surfaces.
Training requires substantial computational resources - 32 A100 GPUs for 6 days - which could be a barrier
to wider adoption.

[SLIDE 38: Future Work]

Looking ahead, we see several promising directions for future work:

• Advanced camera control: Conditioning SV3D on full camera matrices and supporting more degrees
of freedom in camera movement.
• Enhanced shading models: Extending beyond Lambertian reflection to better handle specular and
reflective surfaces.

• Direct 3D conditioning: Integrating 3D representations directly into the diffusion model for end-to-end
optimization and even better consistency.
• Wider applications: Extending to scene-level 3D generation and animation of deformable objects.

[Closing]
Thank you for your attention! We believe SV3D represents a significant step forward in single-image 3D
reconstruction by leveraging the power of video diffusion models. We’re excited to see how this approach can
be extended and applied in the future. We’ll be happy to take any questions now.

Sfm-Net: Learning of Structure and Motion From Video: Sudheendra Vijayanarasimhan Susanna Ricco Cordelia Schmid
100% (1)
Sfm-Net: Learning of Structure and Motion From Video: Sudheendra Vijayanarasimhan Susanna Ricco Cordelia Schmid
9 pages
Multi - View Stereo A Tutorial
No ratings yet
Multi - View Stereo A Tutorial
151 pages
04 - 3D Object Models
No ratings yet
04 - 3D Object Models
40 pages
Structure From Motion: Computer Vision Jia-Bin Huang, Virginia Tech
No ratings yet
Structure From Motion: Computer Vision Jia-Bin Huang, Virginia Tech
84 pages
2023 Sivos+HCR VS
No ratings yet
2023 Sivos+HCR VS
148 pages
3dv Slides
No ratings yet
3dv Slides
153 pages
Cpar 95 Far Final PB
No ratings yet
Cpar 95 Far Final PB
14 pages
An Invitation To 3-D Vision From Images To Models
No ratings yet
An Invitation To 3-D Vision From Images To Models
339 pages
3D Nvidia
No ratings yet
3D Nvidia
20 pages
L12 - 3d Deep Learning On Volumetric Representation
No ratings yet
L12 - 3d Deep Learning On Volumetric Representation
63 pages
04 Multi-View Geometry
No ratings yet
04 Multi-View Geometry
54 pages
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
No ratings yet
D V2D: V D D S M: EEP Ideo To Epth With Ifferentiable Tructure From Otion
20 pages
A Real World Dataset For Multi-View 3D
No ratings yet
A Real World Dataset For Multi-View 3D
18 pages
Zero-1-To-3: Zero-Shot One Image To 3D Object
No ratings yet
Zero-1-To-3: Zero-Shot One Image To 3D Object
13 pages
Robust Dynamic Radiance Fields
No ratings yet
Robust Dynamic Radiance Fields
11 pages
tt3d Gen
No ratings yet
tt3d Gen
32 pages
ViewDiff - 3D - Consisitent Image Generation With Text To Image Models
No ratings yet
ViewDiff - 3D - Consisitent Image Generation With Text To Image Models
22 pages
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
No ratings yet
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
11 pages
Autodecoding Latent 3D Diffusion Models
No ratings yet
Autodecoding Latent 3D Diffusion Models
22 pages
LabLecture8 Inertial Odometery Using AR-Drone
No ratings yet
LabLecture8 Inertial Odometery Using AR-Drone
8 pages
SV4D: Stability AI's Breakthrough in Dynamic 3D Content Generation
No ratings yet
SV4D: Stability AI's Breakthrough in Dynamic 3D Content Generation
8 pages
VFusion3D: Learning Scalable 3D Generative Models From Video Diffusion Models
No ratings yet
VFusion3D: Learning Scalable 3D Generative Models From Video Diffusion Models
21 pages
3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting With View-Consistent 2D Diffusion Priors
No ratings yet
3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting With View-Consistent 2D Diffusion Priors
16 pages
2410 3D-Adapter
No ratings yet
2410 3D-Adapter
22 pages
Pose-Guided Diffusion Model
No ratings yet
Pose-Guided Diffusion Model
15 pages
Learning Efficient Point Cloud Generation For Dense 3D Object Reconstruction
No ratings yet
Learning Efficient Point Cloud Generation For Dense 3D Object Reconstruction
8 pages
Kim2019 Article LatentTransformationsNeuralNet
No ratings yet
Kim2019 Article LatentTransformationsNeuralNet
15 pages
Vfusion3D: Learning Scalable 3D Generative Models From Video Diffusion Models
No ratings yet
Vfusion3D: Learning Scalable 3D Generative Models From Video Diffusion Models
24 pages
3D Generative Models A Survey
No ratings yet
3D Generative Models A Survey
21 pages
MVD: M - D 3DG: Ream Ulti View Iffusion For Eneration
No ratings yet
MVD: M - D 3DG: Ream Ulti View Iffusion For Eneration
21 pages
Visual Modeling With A Hand-Held Camera: Abstract
No ratings yet
Visual Modeling With A Hand-Held Camera: Abstract
26 pages
Drilling For Non Technical People
100% (5)
Drilling For Non Technical People
87 pages
Sync Dreamer
No ratings yet
Sync Dreamer
13 pages
Unit4 CV
No ratings yet
Unit4 CV
24 pages
An Invitation To 3-D Vision PDF
No ratings yet
An Invitation To 3-D Vision PDF
338 pages
Stereo4D: Learning How Things Move in 3D From Internet Stereo Videos
No ratings yet
Stereo4D: Learning How Things Move in 3D From Internet Stereo Videos
17 pages
Unit Iv Aicv Aids
No ratings yet
Unit Iv Aicv Aids
22 pages
Image
No ratings yet
Image
22 pages
Papercraft Stitch para Colorear
100% (1)
Papercraft Stitch para Colorear
11 pages
VGGT
No ratings yet
VGGT
20 pages
Zeronvs: Zero-Shot 360-Degree View Synthesis From A Single Image
No ratings yet
Zeronvs: Zero-Shot 360-Degree View Synthesis From A Single Image
12 pages
Subjective and Objective Quality Assessment Methods of Stereoscopic Videos With Visibility Affecting Distortions
No ratings yet
Subjective and Objective Quality Assessment Methods of Stereoscopic Videos With Visibility Affecting Distortions
13 pages
Hunyuan3D 1.0: A Unified Framework For Text-to-3D and Image-to-3D Generation
No ratings yet
Hunyuan3D 1.0: A Unified Framework For Text-to-3D and Image-to-3D Generation
11 pages
Reconfusion: 3D Reconstruction With Diffusion Priors: 30 PSNR
No ratings yet
Reconfusion: 3D Reconstruction With Diffusion Priors: 30 PSNR
13 pages
Free4D: Tuning-Free 4D Scene Generation With Spatial-Temporal Consistency
No ratings yet
Free4D: Tuning-Free 4D Scene Generation With Spatial-Temporal Consistency
14 pages
Sv4d - Dynamic 3d Content Generation With Multi-Frame and Multi-View Consistency
No ratings yet
Sv4d - Dynamic 3d Content Generation With Multi-Frame and Multi-View Consistency
21 pages
Unit 4
No ratings yet
Unit 4
13 pages
Hollein ViewDiff 3D-Consistent Image Generation With Text-to-Image Models CVPR 2024 Paper
No ratings yet
Hollein ViewDiff 3D-Consistent Image Generation With Text-to-Image Models CVPR 2024 Paper
10 pages
Unit 5 Shapes
No ratings yet
Unit 5 Shapes
13 pages
Recollection From Pensieve: Novel View Synthesis Via Learning From Uncalibrated Videos
No ratings yet
Recollection From Pensieve: Novel View Synthesis Via Learning From Uncalibrated Videos
13 pages
Research Paper
No ratings yet
Research Paper
15 pages
Camera Network Optimization Maximize Coverage in A 3D Virtual Environment
No ratings yet
Camera Network Optimization Maximize Coverage in A 3D Virtual Environment
2 pages
Fast and Accurate Deep Learning-Based Framework For 3D Multi-Object Detector For Autonomous Vehicles
No ratings yet
Fast and Accurate Deep Learning-Based Framework For 3D Multi-Object Detector For Autonomous Vehicles
3 pages
Scenecompleter: Dense 3D Scene Completion For Generative Novel View Synthesis
No ratings yet
Scenecompleter: Dense 3D Scene Completion For Generative Novel View Synthesis
10 pages
Action Plan in Mathematics Grade 4-Integrity
100% (2)
Action Plan in Mathematics Grade 4-Integrity
2 pages
Dynamic View Synthesis As An Inverse Problem: Hidir Yesiltepe Pinar Yanardag
No ratings yet
Dynamic View Synthesis As An Inverse Problem: Hidir Yesiltepe Pinar Yanardag
21 pages
Rukhovich ImVoxelNet Image To Voxels Projection For Monocular and Multi-View General-Purpose WACV 2022 Paper
No ratings yet
Rukhovich ImVoxelNet Image To Voxels Projection For Monocular and Multi-View General-Purpose WACV 2022 Paper
10 pages
Yao Uni4D Unifying Visual Foundation Models For 4D Modeling From A CVPR 2025 Paper
No ratings yet
Yao Uni4D Unifying Visual Foundation Models For 4D Modeling From A CVPR 2025 Paper
11 pages
Camvig Camera Aware Image To Video Generation With 6f1ldnnq8a
No ratings yet
Camvig Camera Aware Image To Video Generation With 6f1ldnnq8a
5 pages
7 TOYOTA Reprog de ECUS PDF
100% (2)
7 TOYOTA Reprog de ECUS PDF
6 pages
Raytech Retail Price List
100% (2)
Raytech Retail Price List
1 page
Chapter Four Structure of Cooperatives in Ethiopia
100% (1)
Chapter Four Structure of Cooperatives in Ethiopia
3 pages
Rele Datasheet Songle Sru
0% (1)
Rele Datasheet Songle Sru
2 pages
SC Form 2 Chapter 1
No ratings yet
SC Form 2 Chapter 1
37 pages
Zorba The Greek - Cacho Tirao
100% (1)
Zorba The Greek - Cacho Tirao
8 pages
CIS Kubernetes Benchmark v1.9.0 PDF
No ratings yet
CIS Kubernetes Benchmark v1.9.0 PDF
330 pages
Bài ôn tập học kì II - lop 8
No ratings yet
Bài ôn tập học kì II - lop 8
29 pages
Rev Ahara Paka
No ratings yet
Rev Ahara Paka
5 pages
Chief Architect x7 Users Guide
No ratings yet
Chief Architect x7 Users Guide
240 pages
Riopipeline2019 1107 201905201751ibp1107 19 Jacques PDF
No ratings yet
Riopipeline2019 1107 201905201751ibp1107 19 Jacques PDF
7 pages
(Ebook) Human-Computer Interaction by Alan Dix ISBN 9780130461094, 0130461091 PDF Download
100% (1)
(Ebook) Human-Computer Interaction by Alan Dix ISBN 9780130461094, 0130461091 PDF Download
56 pages
How To Be A Good Trader
No ratings yet
How To Be A Good Trader
7 pages
Ansari Mansur Ahammad Resume
No ratings yet
Ansari Mansur Ahammad Resume
5 pages
Improvised Project Rubric
No ratings yet
Improvised Project Rubric
1 page
07-Manual Transaxle System
No ratings yet
07-Manual Transaxle System
29 pages
Comparative Efficacy and Safety of Cilnidipine and Telmisartan in Hypertension: An 8 Week, Prospective Study
No ratings yet
Comparative Efficacy and Safety of Cilnidipine and Telmisartan in Hypertension: An 8 Week, Prospective Study
5 pages
Sicr Process Supplier Manual
No ratings yet
Sicr Process Supplier Manual
12 pages
Transistor 2sb1197k ROHM
No ratings yet
Transistor 2sb1197k ROHM
3 pages
Handout
No ratings yet
Handout
45 pages
Philippines Faces Bigger Shortage of Rice Farms - Miraflor (2020)
No ratings yet
Philippines Faces Bigger Shortage of Rice Farms - Miraflor (2020)
3 pages
Forage, Harvest, Feast - Honeysuckle
No ratings yet
Forage, Harvest, Feast - Honeysuckle
6 pages
Valve Operator Matl Control Data
No ratings yet
Valve Operator Matl Control Data
12 pages
2020-02.25 Prodigy Disc Flight Chart PDF
No ratings yet
2020-02.25 Prodigy Disc Flight Chart PDF
1 page
Song Num Title of Song Movie Plus One MP4 X X X X X X X X X X
No ratings yet
Song Num Title of Song Movie Plus One MP4 X X X X X X X X X X
4 pages
AFS19-SA094-Unisteel Scaffolding and Formwork-Alpino-31122019
No ratings yet
AFS19-SA094-Unisteel Scaffolding and Formwork-Alpino-31122019
5 pages

Script

Uploaded by

Script

Uploaded by

SV3D Presentation Script:

Novel Multi-view Synthesis and 3D Generation from a Single Image

2 Novel Multi-view Synthesis Model (Presenter 2) - 8 minutes 4

3 Novel Multi-view Synthesis Results (Presenter 3) - 7 minutes 6

4 3D Generation Model (Presenter 4) - 7 minutes 8

5 3D Generation Results and Conclusion (Presenter 5) - 8 minutes 10

1 Introduction (Presenter 1) - 7 minutes

[SLIDE 3: Single-image 3D Object Reconstruction]

[SLIDE 4: Current Approaches to 3D Generation]

[SLIDE 5: Key Challenges in Previous Approaches]

• Explicit camera pose control for arbitrary viewpoints

[SLIDE 7: SV3D Two-stage Pipeline]

2 Novel Multi-view Synthesis Model (Presenter 2) - 8 minutes

[SLIDE 9: SV3D Architecture]

[SLIDE 10: Camera Trajectory Conditioning]

embtotal = embt + W [embe ; emba ] (2)

[SLIDE 11: Static vs. Dynamic Orbits]

[SLIDE 12: Triangular CFG Scaling]

[SLIDE 13: SV3D Models]

[SLIDE 14: Training Details]

3 Novel Multi-view Synthesis Results (Presenter 3) - 7 minutes

• OmniObject3D as an additional test set

• PSNR: Peak Signal-to-Noise Ratio in decibels

[SLIDE 16: Quantitative Results: GSO Static Orbits]

[SLIDE 17: Quantitative Results: GSO Dynamic Orbits]

[SLIDE 18: Quantitative Results: OmniObject3D Dataset]

[SLIDE 19: Per-Frame Quality Analysis]

[SLIDE 20: Visual Comparison]

[SLIDE 21: User Study on Real-World Images]

• Zero123XL: 96% of the time

• Free3D: 98% of the time

4 3D Generation Model (Presenter 4) - 7 minutes

[SLIDE 23: 3D Generation Pipeline Details]

[SLIDE 24: Disentangled Illumination Model]

G(x; µ, s, a) = aes(µ·x−1) (4)

Gcosine = G(x; n, 2.133, 1.17) (5)

[SLIDE 25: Optimization Strategies and Losses]

• Mask loss for object silhouettes: Lmask = ∥S − Ŝ∥2

[SLIDE 26: SV3D-Based SDS Loss]

[SLIDE 27-28: Masked SDS Loss]

4. Create a pseudo visibility mask using a smoothstep function:

5. Apply the masked SDS loss:

[SLIDE 29: Geometric Priors]

• A bilateral normal smoothness loss:

• Spatial albedo smoothness:

5 3D Generation Results and Conclusion (Presenter 5) - 8 minutes

[SLIDE 31: Quantitative Results for 3D Generation: 2D Metrics]

[SLIDE 32: Quantitative Results for 3D Generation: 3D Metrics]

[SLIDE 33: Visual Comparison of 3D Meshes]

[SLIDE 34: Real-World 3D Results]

[SLIDE 35: Ablation Studies: Static vs. Dynamic Orbits]

Training orbit CD↓ 3D IoU↑

[SLIDE 36: Summary of Contributions]

• Masked SDS loss for handling unseen regions

[SLIDE 37: Limitations]

[SLIDE 38: Future Work]

You might also like