Hugging Face reposted this
Can we use Video diffusion models for novel view synthesis similar to Neural Radiance Fields or Gaussian Splats without retraining? Pablo Vela built a Hugging Face Space using the Rerun Gradio plugin that explores this exact topic. NVS-solver uses warped views and camera poses to modulate the video diffusion process. This allows for consistent view generation for the single view, multi-view and monocular video use case. Focusing specifically on the single view use case the pipeline looks as follows. 1. First a camera trajectory is generated, to specify what images we want the video diffusion network to synthesis. 2. With the generated camera trajectory, a monocular depth estimation network (in this case DepthAnythingV2) is used to generate a depth map. 3. Once we have the image, depth, and camera trajectory, forward warping is performed to place pixels from the source image to the destination image using bilinear splatting. 4. Finally a video diffusion model (in this case SVD) is used to generate the views at each camera trajectory. This involves modulating the score function using the scene priors (warped image and camera poses) as guidance. The critical point here is that this requires no retraining! Give it a whirl and check out the links in the comments: