Generative Camera Dolly:
Extreme Monocular Dynamic Novel View Synthesis

GCD Logo

1Columbia University

2Stanford University

3Toyota Research Institute


In Submission

Summary

We present GCD (for Generative Camera Dolly), a framework for synthesizing large-angle novel viewpoints of dynamic scenes from a single monocular video. Specifically, given any color video, along with precise instructions on how to rotate and/or translate the camera, our model can imagine what that same scene would look like from another perspective. Much like a camera dolly in film-making, our approach essentially conceives a virtual camera that can move around freely, reveal portions of the environment that are otherwise unseen, and reconstruct hidden objects behind occlusions, all within complex dynamic scenes, even when the contents are moving. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.

Method

We train a neural network to predict all frames corresponding to the target viewpoint, conditioned on the input video plus relative camera pose parameters that describe the spatial relationship between the source and target extrinsics. The camera transformation is simply calculated as \( \Delta \mathcal{E} = \mathcal{E}_{src}^{-1} \cdot \mathcal{E}_{dst} \). In practice, we encode these parameters as a rotation (azimuth, elevation) and translation (radius) vector. We teach Stable Video Diffusion, a state-of-the-art diffusion model for image-to-video generation, to accept and utilize these new controls by means of finetuning.

Representative Results

Despite being trained on synthetic multi-view video data only, experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We showcase a mixture of in-domain as well as out-of-distribution (real-world) results. While zero-shot generalization is highly challenging and not the focus of our work, we demonstrate that our model can successfully tackle some of these videos.

Amodal Completion and Object Permanence

Partial and total occlusions are very common in everyday dynamic scenes. Our network is capable of inpainting the occluded parts of objects and scenes. In the two examples below, the input camera resides at a low elevation angle, such that the higher output viewpoint implies having to correctly reconstruct the objects lying further in the back. Note the paper towel roll and the brown bucket in particular.

A more advanced spatiotemporal reasoning ability is needed for objects that become completely occluded throughout the video. Our model successfully persists them in the next two examples, which is a skill known as object permanence. In the first video, both the blue duck and the red duck disappear behind a hand and teabox respectively.

In this second video, the brown shoe falling to the left is temporarily hidden by the purple pag, but the output reflects an accurate continuation of its dynamics, shape, and appearance before it reappears in the observation.

Driving Scene Completion (Color + Semantic)

In embodied AI, including for autonomous vehicles, situational awareness is paramount. In this environment, we trained our model to synthesize a top-down-and-forward perspective that can give the ego car (on which only a single RGB sensor has to be mounted) a much more complete, detailed overview of its surroundings. Note how the white car on the left and the two pedestrians on the right are still visible in the generated video, despite going out-of-frame with respect to the input camera.

Our framework is in principle capable of running any dense predictive computer vision task as long as training annotations are available. In this example, we classify every pixel from the novel viewpoint into its corresponding semantic category.

ParallelDomain-4D Category Legend

Similarly as before, we can also control the camera viewpoint here in a fine-grained fashion. The angles are chosen randomly for demonstration purposes.

ParallelDomain-4D Category Legend

The above driving scenarios are in fact synthetic (from the ParallelDomain engine) -- next, we qualitatively visualize a couple real-world results (from the TRI-DDAD dataset, which was unseen during training).

ParallelDomain-4D Category Legend

Gradual vs. Jumpy Trajectories

In the gallery below, we perform dynamic view synthesis while sweeping over camera control angles. The input video is in the leftmost column, and the four columns to the right are outputs generated by the model for a total azimuth displacement of 30, 60, and 90 degrees respectively (with respect to the center of the scene). Moreover, we depict two different model variants for each example:

  1. Gradual (top row): Linearly interpolates the camera path between the source and target viewpoints.
  2. Jumpy (bottom row): Direct camera displacement, which synthesizes the entire video from the desired viewpoint.
We observe that the gradual model generates results that are more consistent with the input video. In contrast, the jumpy model often introduces more hallucination, especially for moving objects, where both their dynamics and appearance tend to diverge from the original scene. Numerical experiments confirm that the gradual model performs better overall than the direct model, including for large camera movements, when comparing the last frame only for fairness.

Datasets

We contribute two new multi-view video datasets for training and evaluation: Kubric-4D and ParallelDomain-4D. More details coming soon!

Paper

Abstract
Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose GCD, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
BibTeX Citation
@inproceedings{vanhoorick2024gcd,
title={Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis},
author={Van Hoorick, Basile and Wu, Rundi and Ozguroglu, Ege and Sargent, Kyle and Liu, Ruoshi and Tokmakov, Pavel and Dave, Achal and Zheng, Changxi and Vondrick, Carl},
journal={arXiv},
year={2024}}

More Results

Success Cases

Failure Cases

Acknowledgements

This research is based on work partially supported by the NSF CAREER Award #2046910 and the NSF Center for Smart Streetscapes (CS3) under NSF Cooperative Agreement No. EEC-2133516. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors. The webpage template was inspired by this project page.