Compositional Visual Planning via
Inference-Time Diffusion Scaling

Anonymous Submission

Abstract

Diffusion models excel at short-horizon robot planning, yet scaling them to long-horizon tasks remains challenging due to computational constraints and limited training data. Existing compositional approaches stitch together short segments by separately denoising each component and averaging overlapping regions. However, this suffers from instability as the factorization assumption breaks down in noisy data space, leading to inconsistent global plans. We propose that the key to stable compositional generation lies in enforcing boundary agreement on the estimated clean data (Tweedie estimates) rather than on noisy intermediate states. Our method formulates long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks, where pretrained short-horizon video diffusion models provide local priors. At inference time, we enforce boundary agreement through a novel combination of synchronous and asynchronous message passing that operates on Tweedie estimates, producing globally consistent guidance without requiring additional training. Our training-free framework demonstrates significant improvements over existing baselines across 100 simulation tasks spanning 4 diverse scenes, effectively generalizing to unseen start-goal combinations that were not present in the original training data.

Motivating Toy Example

toy_dataset

Dataset Distribution

Noisy Samples

DiffCollage

Ours

Estimated Tweedies

DiffCollage

Ours

Qualitative Illustration of Toy Example. To illustrate our approach, we consider a simple 2D drawing task in which the objective is to generate a three-petal “flower” by composing three 120° circular arc segments. We train a short-horizon diffusion model on arc clips from the dataset shown on the left. At test time, factor 1, factor 2, and factor 3 —with partial overlaps—are composed to produce the full flower-like pattern. The right side visualizes noisy samples and their corresponding Tweedie mean estimates from both DiffCollage and our method. While DiffCollage drifts and produces boundary gaps between factors, resulting in globally inconsistent generations, our method rapidly converges by enforcing boundary agreement on the Tweedie estimates, thereby maintaining global consistency and guiding the noisy samples to align seamlessly across factor boundaries. replay

Compositional Planning Benchmark

  • Cube replay
    cube_layout

    Start 0

    Start 1

    Start 2

    Start 3

  • Drawer replay
    drawer_layout

    Start 0

    Start 1

    Start 2

    Start 3

  • Puzzle replay
    puzzle_layout

    Start 0

    Start 1

    Start 2

    Start 3

    Start 4

    Start 5

    Start 6

    Start 7

  • Tool-Use replay
    tool_layout

    Start 0

    Start 1

Qualitative Illustration of Compositional Planning Benchmark. We introduce a benchmark for compositional planning in 6-DoF robotic manipulation. Each scene contains N start states and N goal states, yielding a total of possible tasks corresponding to all start–goal pairs (task layouts are shown above; swipe horizontally to view more). The training dataset includes demonstrations for only N of these start–goal pairs. At test time, we evaluate the planner on both the N seen pairs (in-distribution) and the remaining N² – N unseen pairs (out-of-distribution). A capable planner should generalize to novel start–goal combinations if the dataset provides sufficient coverage of the constituent behavioral fragments. The goal of this benchmark is to evaluate whether an algorithm can acquire distinct high-level skills through skill-agnostic factors and compose them to solve new tasks at test time. This requires learning not only low-level motor control skills, but also high-level sequential reasoning and visual understanding of 3D geometry and physics. All experiments are conducted in the visual planning setting, where the robot receives only raw pixel observations without access to any privileged state information. We additionally visualize the task reset ranges (videos above) to illustrate the diversity of task configurations and reset ranges.

Evaluation on Compositional Planning Benchmark

We evaluate our approach on the proposed Compositional Planning Benchmark, which comprises 100 diverse tasks. We release all evaluation materials, including synthesized video plans and policy rollouts, for both our method and DiffCollage. Comprehensive qualitative results are presented below.

Our evaluation is designed to address three key questions: (1) the visual fidelity of synthesized video plans, illustrated through high-quality visualizations; (2) the compositional generalization of our approach to long-horizon, out-of-distribution tasks; and (3) the extent to which video plans can effectively guide low-level robotic control. Together, these results demonstrate the scalability and reliability of compositional visual planning in challenging robotic settings.

For pixel-based evaluations (synthesized videos), we disable ray tracing to ensure fully observable scenes without shadows or lighting artifacts. For policy rollouts, ray tracing is enabled to provide physically realistic observations and a clearer visualization of the robot’s manipulation skills.

Conclusion

We introduced Compositional Visual Planning, an inference-time method that composes long-horizon plans by stitching overlapping video factors with message passing on Tweedie estimates. A chain-structured factor graph imposes global consistency, enforced via joint synchronous and asynchronous updates, while diffusion-sphere guidance balances alignment and diversity without retraining. Compositional Visual Planning is plug-and-play with short-horizon diffusion video prediction model, scales with test-time compute, and generalizes to unseen start–goal combinations. Beyond robotics, the framework is applicable to broader domains, such as panorama image generation and long-form text-to-video synthesis, which we leave for future exploration.


BibTeX

Anonymous.