Efficient Camera-Controlled Video Generation of Static Scenes via SparseDiffusion and 3D Rendering

Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

University of Cambridge

TL;DR Given an input image and a camera trajectory, our method, SRENDER, generates sparse keyframes, reconstructs the 3D scene, and renders the full video efficiently. On average, SRENDER is 43 times faster than a history-guided diffusion baseline (HG) when generating a 20-second 30-fps video from the DL3DV dataset, achieving real-time performance while maintaining comparable or better video quality.

Abstract: Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.

BibTeX

@misc{chen2026efficientcameracontrolledvideogeneration, title={Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering}, author={Jieying Chen and Jeffrey Hu and Joan Lasenby and Ayush Tewari}, year={2026}, eprint={2601.09697}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.09697}, }

Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Qualitative Results

On the RE10K Dataset (20s @ 10 fps, 200 frames)

More examples can be found at the end of the page.

On the DL3DV Dataset (20s @ 30 fps, 600 frames)

More examples can be found at the end of the page.

On the DL3DV Dataset with Different Baselines
(20s @ 5 fps, 100 frames)

Same Input, Different Scenes

Here we show that our model is capable of generating diverse scenes from the same input while maintaining high visual quality.

Generate Videos Following Different Camera Trajectories

Ablation Studies

Effect of Different Keyframe Numbers on the Videos Generated

Effect of Temporal Chunking on the Video Generated

Temporal chunking improves consistency within each chunk and, as a result, enhances video sharpness.

More Qualitative Results

BibTeX

Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Qualitative Results

On the RE10K Dataset (20s @ 10 fps, 200 frames)

More examples can be found at the end of the page.

On the DL3DV Dataset (20s @ 30 fps, 600 frames)

More examples can be found at the end of the page.

On the DL3DV Dataset with Different Baselines (20s @ 5 fps, 100 frames)

Same Input, Different Scenes

Here we show that our model is capable of generating diverse scenes from the same input while maintaining high visual quality.

Generate Videos Following Different Camera Trajectories

Ablation Studies

Effect of Different Keyframe Numbers on the Videos Generated

Effect of Temporal Chunking on the Video Generated

Temporal chunking improves consistency within each chunk and, as a result, enhances video sharpness.

More Qualitative Results

BibTeX

On the DL3DV Dataset with Different Baselines
(20s @ 5 fps, 100 frames)