Diffusion models excel at short-horizon robot planning, yet scaling them to long-horizon tasks remains challenging due to computational constraints and limited training data. Existing compositional approaches stitch together short segments by separately denoising each component and averaging overlapping regions. However, this suffers from instability as the factorization assumption breaks down in noisy data space, leading to inconsistent global plans. We propose that the key to stable compositional generation lies in enforcing boundary agreement on the estimated clean data (Tweedie estimates) rather than on noisy intermediate states. Our method formulates long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks, where pretrained short-horizon video diffusion models provide local priors. At inference time, we enforce boundary agreement through a novel combination of synchronous and asynchronous message passing that operates on Tweedie estimates, producing globally consistent guidance without requiring additional training. Our training-free framework demonstrates significant improvements over existing baselines across 100 simulation tasks spanning 4 diverse scenes, effectively generalizing to unseen start-goal combinations that were not present in the original training data.
replay
We evaluate our approach on the proposed Compositional Planning Benchmark, which comprises 100 diverse tasks. We release all evaluation materials, including synthesized video plans and policy rollouts, for both our method and DiffCollage. Comprehensive qualitative results are presented below.
Our evaluation is designed to address three key questions: (1) the visual fidelity of synthesized video plans, illustrated through high-quality visualizations; (2) the compositional generalization of our approach to long-horizon, out-of-distribution tasks; and (3) the extent to which video plans can effectively guide low-level robotic control. Together, these results demonstrate the scalability and reliability of compositional visual planning in challenging robotic settings.
For pixel-based evaluations (synthesized videos), we disable ray tracing to ensure fully observable scenes without shadows or lighting artifacts. For policy rollouts, ray tracing is enabled to provide physically realistic observations and a clearer visualization of the robot’s manipulation skills.
We introduced Compositional Visual Planning, an inference-time method that composes long-horizon plans by stitching overlapping video factors with message passing on Tweedie estimates. A chain-structured factor graph imposes global consistency, enforced via joint synchronous and asynchronous updates, while diffusion-sphere guidance balances alignment and diversity without retraining. Compositional Visual Planning is plug-and-play with short-horizon diffusion video prediction model, scales with test-time compute, and generalizes to unseen start–goal combinations. Beyond robotics, the framework is applicable to broader domains, such as panorama image generation and long-form text-to-video synthesis, which we leave for future exploration.
Anonymous.