Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Summary

We present GCD (for Generative Camera Dolly), a framework for synthesizing large-angle novel viewpoints of 4D dynamic scenes from a single monocular video. Specifically, given any color video, along with precise instructions on how to rotate and/or translate the camera, our model can imagine what that same scene would look like from another perspective. Much like a camera dolly in film-making, our approach essentially conceives a virtual camera that can move around freely, reveal portions of the environment that are otherwise unseen, and reconstruct hidden objects behind occlusions, all within complex dynamic scenes, even when the contents are moving. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.

Method

We train a neural network to predict all frames corresponding to the target viewpoint, conditioned on the input video plus relative camera pose parameters that describe the spatial relationship between the source and target extrinsics. The camera transformation is simply calculated as \( \Delta \mathcal{E} = \mathcal{E}_{src}^{-1} \cdot \mathcal{E}_{dst} \). In practice, we encode these parameters as a rotation (azimuth, elevation) and translation (radius) vector. We teach Stable Video Diffusion, a state-of-the-art diffusion model for image-to-video generation, to accept and utilize these new controls by means of finetuning.

Representative Results

Despite being trained on synthetic multi-view video data only, experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We showcase a mixture of in-domain as well as out-of-distribution (real-world) results. While zero-shot generalization is highly challenging and not the focus of our work, we demonstrate that our model can successfully tackle some of these videos.

Amodal Completion and Object Permanence

Partial and total occlusions are very common in everyday dynamic scenes. Our network is capable of inpainting the occluded parts of objects and scenes. In the two examples below, the input camera resides at a low elevation angle, such that the higher output viewpoint implies having to correctly reconstruct the objects lying further in the back. Note the paper towel roll and the brown bucket in particular.

A more advanced spatiotemporal reasoning ability is needed for objects that become completely occluded throughout the video. Our model successfully persists them in the next two examples, which is a skill known as object permanence. In the first video, both the blue duck and the red duck disappear behind a hand and teabox respectively.

In this second video, the brown shoe falling to the left is temporarily hidden by the purple pag, but the output reflects an accurate continuation of its dynamics, shape, and appearance before it reappears in the observation.

Driving Scene Completion (Color + Semantic)

In embodied AI, including for autonomous vehicles, situational awareness is paramount. In this environment, we trained our model to synthesize a top-down-and-forward perspective that can give the ego car (on which only a single RGB sensor has to be mounted) a much more complete, detailed overview of its surroundings. Note how the white car on the left and the two pedestrians on the right are still visible in the generated video, despite going out-of-frame with respect to the input camera.

Our framework is in principle capable of running any dense predictive computer vision task as long as training annotations are available. In this example, we classify every pixel from the novel viewpoint into its corresponding semantic category.

Similarly as before, we can also control the camera viewpoint here in a fine-grained fashion. The angles are chosen randomly for demonstration purposes.

The above driving scenarios are in fact synthetic (from the ParallelDomain engine) -- next, we qualitatively visualize a couple real-world results (from the TRI-DDAD dataset, which was unseen during training).

Gradual vs. Jumpy Trajectories

In the gallery below, we perform dynamic view synthesis while sweeping over camera control angles. The input video is in the leftmost column, and the four columns to the right are outputs generated by the model for a total azimuth displacement of 30, 60, and 90 degrees respectively (with respect to the center of the scene). Moreover, we depict two different model variants for each example:

Gradual (top row): Linearly interpolates the camera path between the source and target viewpoints.
Jumpy (bottom row): Direct camera displacement, which synthesizes the entire video from the desired viewpoint.

We observe that the gradual model generates results that are more consistent with the input video. In contrast, the jumpy model often introduces more hallucination, especially for moving objects, where both their dynamics and appearance tend to diverge from the original scene. Numerical experiments confirm that the gradual model performs better overall than the direct model, including for large camera movements, when comparing the last frame only for fairness. We hypothesize this is most likely due to the difference in distribution shift away from SVD's pretrained representation, which tends to generate videos that are largely aligned with the input image.

Heteroscedastic Diversity

Diffusion models are probabilistic, and hence draw samples from a conditional distribution of possible output videos, when conditioned on an input video and relative camera extrinsics matrix. This is probably why they generally create relatively sharp predictions. However, this also means that our model is able to generate multiple plausible hypotheses, due to a mixture of underlying epistemic and aleatoric uncertainty. Interestingly, we observe that the diversity among predicted samples is a function of both space and time, and often corresponds to what parts of the scene are intuitively more vs. less complex to resolve (i.e. mainly due to complex dynamics, occlusion, and/or being out-of-frame). Below, we showcase one example from Kubric-4D, and another from ParallelDomain-4D, where the purple heatmaps depict per-pixel uncertainty.

Datasets

We contribute two new synthetic multi-view RGB-D video datasets for training and evaluation. When combined, these viewpoints provide a sufficiently dense, detailed coverage of the 4D scene. In the GCD data loading pipeline, we render merged point clouds from arbitrary pairs of poses to learn camera controls in both domains (see code for details).

Kubric-4D

These 3000 scenes were generated with the Kubric simulator, and contain multi-object interactions with rich visual appearance and complicated dynamics. Each scene contains synchronized videos from 16 fixed camera viewpoints (4 high, 12 low) and 60 frames at a resolution of 576 x 384 and a frame rate of 24 FPS. The available modalities include: RGB, depth, optical flow, object coordinates, surface normals, and instance segmentation. Direct download links can be found here:

Train set (2800 scenes; 6.5 TB across 18 split archive files);
Validation set (100 scenes; 239 GB archive);
Test set (100 scenes; 237 GB archive);
Tiny train subset (for exploration and familiarization; 20 scenes; 48 GB archive).

The visualization above only shows 13 out of the 16 available viewpoints for demonstration purposes.
To extract the full training set after downloading, run cat gcd_kubric4d_train.tar.gz.* | tar xvfz -.
The entire dataset takes 7.0 TB of space in compressed form, and 7.8 TB after extraction.
If you wish to generate your own data, please see our repository.

ParallelDomain-4D

These ~1500 scenes were provided by the ParallelDomain engine, and contain photorealistic driving scenarios with diverse environments, traffic patterns, vehicles, pedestrians, and weather conditions. Each scene contains synchronized videos from 19 camera viewpoints (3 ego, 16 surround) and 50 frames at a resolution of 640 x 480 and a frame rate of 10 FPS. The cameras follow the car at the center of each scene precisely. The basic modalities are: RGB, depth, semantic segmentation, instance segmentation, and 2D bounding boxes. The additional modalities are: LiDAR point clouds, optical flow, scene flow, and surface normals. Direct download links can be found here:

Full dataset with basic modalities (~1500 scenes comprising train + val + test; 2.3 TB archive);
Full dataset with additional modalities (should be merged with the above; 3.2 TB archive).

The visualization above shows all 19 available viewpoints, but is sped up x2 (from 10 to 20 FPS) for demonstration purposes.
The entire dataset (with basic modalities only) takes 2.3 TB of space in compressed form, and 2.4 TB after extraction.
Note that some scene folders do not exist, and some scenes have a couple missing frames.

Paper

Abstract

Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose GCD, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.

BibTeX Citation


                        @inproceedings{vanhoorick2024gcd,

                        title={Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis},

                        author={Van Hoorick, Basile and Wu, Rundi and Ozguroglu, Ege and Sargent, Kyle and Liu, Ruoshi and Tokmakov, Pavel and Dave, Achal and Zheng, Changxi and Vondrick, Carl},

                        journal={European Conference on Computer Vision (ECCV)},

                        year={2024}}

More Results

Success Cases

Failure Cases

Video Presentation

Acknowledgements

This research is based on work partially supported by the NSF CAREER Award #2046910 and the NSF Center for Smart Streetscapes (CS3) under NSF Cooperative Agreement No. EEC-2133516. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors. The webpage template was inspired by this project page.