Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

University of Cambridge

TL;DR Given an input image and a camera trajectory, our method, SRENDER, generates sparse keyframes, reconstructs the 3D scene, and renders the full video efficiently. On average, SRENDER is 43 times faster than a history-guided diffusion baseline (HG) when generating a 20-second 30-fps video from the DL3DV dataset, achieving real-time performance while maintaining comparable or better video quality.

Abstract: Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.


Qualitative Results


Results on the RE10K Dataset (20s @ 10 fps, 200 frames)

Below are qualitative comparisons of our method and the History-Guided Video Diffusion (HG) baseline on the RE10K dataset.
Despite being over 20× faster, our method produces videos with comparable visual quality. Notably, videos generated by the baseline often exhibit flickering artifacts, whereas our approach yields temporally coherent results due to the underlying 3D representation and rendering process.


More examples can be found at the end of the page.




Results on the DL3DV Dataset (20s @ 30 fps, 600 frames)

Below are qualitative comparisons between our method and the HG baseline for generating longer, high-fps videos on the DL3DV dataset. Our method produces videos with comparable visual quality while being more than 40× faster. As we need to interpolate a large number of frames, the videos generated by the HG model exhibit more pronounced flickering and shaking artifacts, even though the temporal gap between keyframes is not larger. In contrast, the videos produced by our method maintain consistent visual quality.


More examples can be found at the end of the page.




Comparison with Different Baselines on DL3DV
(20s @ 5 fps, 100 frames)

Here we present qualitative comparisons between our method and several baselines. The HG and Voyager models are video generation methods, while FILM and RIFE are 2D frame interpolation methods. The Voyager model fails to generate 20-second videos as it relies on the depth maps estimated from the input image. The 2D interpolation methods are unable to follow the specified camera trajectories for intermediate frames and consequently exhibit significant morphing artifacts.




Same Input, Different Scenes

Here we show that our model is capable of generating diverse scenes from the same input while maintaining high visual quality.




Generate Videos Following Different Camera Trajectories

Here we demonstrate that once the 3D Gaussian representation and an initial video of a scene have been generated for a given camera trajectory, videos of the same scene along alternative trajectories can be rendered in just a few seconds. This saves the need to regenerate keyframes and the 3D Gaussian representation. he baseline model, however, requires rerunning the entire generation process, which can take several hundred seconds.




Ablation Studies


Effect of Different Keyframe Numbers on the Videos Generated

Using too few keyframes results in underdefined scenes and visible holes in the generated videos, while using too many keyframes wastes computation without substantially improving visual quality. An optimal number of keyframes strikes a balance between scene detail and computational efficiency. The results in the "Ours" column are generated using the keyframe counts predicted by our keyframe selection model.




Effect of Temporal Chunking on the Video Generated

Temporal chunking improves consistency within each chunk and, as a result, enhances video sharpness.




More Qualitative Results