1Columbia University
2Stanford University
3Toyota Research Institute
We present GCD (for Generative Camera Dolly), a framework for synthesizing large-angle novel viewpoints of 4D dynamic scenes from a single monocular video. Specifically, given any color video, along with precise instructions on how to rotate and/or translate the camera, our model can imagine what that same scene would look like from another perspective. Much like a camera dolly in film-making, our approach essentially conceives a virtual camera that can move around freely, reveal portions of the environment that are otherwise unseen, and reconstruct hidden objects behind occlusions, all within complex dynamic scenes, even when the contents are moving. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
We train a neural network to predict all frames corresponding to the target viewpoint, conditioned on the input video plus relative camera pose parameters that describe the spatial relationship between the source and target extrinsics. The camera transformation is simply calculated as \( \Delta \mathcal{E} = \mathcal{E}_{src}^{-1} \cdot \mathcal{E}_{dst} \). In practice, we encode these parameters as a rotation (azimuth, elevation) and translation (radius) vector. We teach Stable Video Diffusion, a state-of-the-art diffusion model for image-to-video generation, to accept and utilize these new controls by means of finetuning.
Despite being trained on synthetic multi-view video data only, experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We showcase a mixture of in-domain as well as out-of-distribution (real-world) results. While zero-shot generalization is highly challenging and not the focus of our work, we demonstrate that our model can successfully tackle some of these videos.
Partial and total occlusions are very common in everyday dynamic scenes. Our network is capable of inpainting the occluded parts of objects and scenes. In the two examples below, the input camera resides at a low elevation angle, such that the higher output viewpoint implies having to correctly reconstruct the objects lying further in the back. Note the paper towel roll and the brown bucket in particular.
A more advanced spatiotemporal reasoning ability is needed for objects that become completely occluded throughout the video. Our model successfully persists them in the next two examples, which is a skill known as object permanence. In the first video, both the blue duck and the red duck disappear behind a hand and teabox respectively.
In this second video, the brown shoe falling to the left is temporarily hidden by the purple pag, but the output reflects an accurate continuation of its dynamics, shape, and appearance before it reappears in the observation.
In embodied AI, including for autonomous vehicles, situational awareness is paramount. In this environment, we trained our model to synthesize a top-down-and-forward perspective that can give the ego car (on which only a single RGB sensor has to be mounted) a much more complete, detailed overview of its surroundings. Note how the white car on the left and the two pedestrians on the right are still visible in the generated video, despite going out-of-frame with respect to the input camera.
Our framework is in principle capable of running any dense predictive computer vision task as long as training annotations are available. In this example, we classify every pixel from the novel viewpoint into its corresponding semantic category.
Similarly as before, we can also control the camera viewpoint here in a fine-grained fashion. The angles are chosen randomly for demonstration purposes.
The above driving scenarios are in fact synthetic (from the ParallelDomain engine) -- next, we qualitatively visualize a couple real-world results (from the TRI-DDAD dataset, which was unseen during training).
In the gallery below, we perform dynamic view synthesis while sweeping over camera control angles. The input video is in the leftmost column, and the four columns to the right are outputs generated by the model for a total azimuth displacement of 30, 60, and 90 degrees respectively (with respect to the center of the scene). Moreover, we depict two different model variants for each example:
Diffusion models are probabilistic, and hence draw samples from a conditional distribution of possible output videos, when conditioned on an input video and relative camera extrinsics matrix. This is probably why they generally create relatively sharp predictions. However, this also means that our model is able to generate multiple plausible hypotheses, due to a mixture of underlying epistemic and aleatoric uncertainty. Interestingly, we observe that the diversity among predicted samples is a function of both space and time, and often corresponds to what parts of the scene are intuitively more vs. less complex to resolve (i.e. mainly due to complex dynamics, occlusion, and/or being out-of-frame). Below, we showcase one example from Kubric-4D, and another from ParallelDomain-4D, where the purple heatmaps depict per-pixel uncertainty.
We contribute two new synthetic multi-view RGB-D video datasets for training and evaluation. When combined, these viewpoints provide a sufficiently dense, detailed coverage of the 4D scene. In the GCD data loading pipeline, we render merged point clouds from arbitrary pairs of poses to learn camera controls in both domains (see code for details).
These 3000 scenes were generated with the Kubric simulator, and contain multi-object interactions with rich visual appearance and complicated dynamics. Each scene contains synchronized videos from 16 fixed camera viewpoints (4 high, 12 low) and 60 frames at a resolution of 576 x 384 and a frame rate of 24 FPS. The available modalities include: RGB, depth, optical flow, object coordinates, surface normals, and instance segmentation. Direct download links can be found here:
The visualization above only shows 13 out of the 16 available viewpoints for demonstration
purposes.
To extract the full training set after downloading, run
cat gcd_kubric4d_train.tar.gz.* | tar xvfz -
.
The entire dataset takes 7.0 TB of space in compressed form, and 7.8 TB after extraction.
If you wish to generate your own data, please see
our repository.
These ~1500 scenes were provided by the ParallelDomain engine, and contain photorealistic driving scenarios with diverse environments, traffic patterns, vehicles, pedestrians, and weather conditions. Each scene contains synchronized videos from 19 camera viewpoints (3 ego, 16 surround) and 50 frames at a resolution of 640 x 480 and a frame rate of 10 FPS. The cameras follow the car at the center of each scene precisely. The basic modalities are: RGB, depth, semantic segmentation, instance segmentation, and 2D bounding boxes. The additional modalities are: LiDAR point clouds, optical flow, scene flow, and surface normals. Direct download links can be found here:
The visualization above shows all 19 available viewpoints, but is sped up x2 (from 10 to 20 FPS) for
demonstration purposes.
The entire dataset (with basic modalities only) takes 2.3 TB of space in compressed form, and 2.4 TB
after extraction.
Note that some scene folders do not exist, and some scenes have a couple missing frames.
@inproceedings{vanhoorick2024gcd,
title={Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis},
author={Van Hoorick, Basile and Wu, Rundi and Ozguroglu, Ege and Sargent, Kyle and Liu, Ruoshi and Tokmakov, Pavel and Dave, Achal and Zheng, Changxi and Vondrick, Carl},
journal={European Conference on Computer Vision (ECCV)},
year={2024}}
This research is based on work partially supported by the NSF CAREER Award #2046910 and the NSF Center for Smart Streetscapes (CS3) under NSF Cooperative Agreement No. EEC-2133516. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors. The webpage template was inspired by this project page.