Method Overview

(i):The diffusion model is an iterative denoiser that learns to map source wrist-camera images \( I_a^l \) and \( I_a^r \) to target wrist-camera images \( I_b^l \) and \( I_b^r \), conditioned on pose transformations \( \Delta p^{l} \) and \( \Delta p^{r} \), using the original dataset (i.e., the dataset to be augmented).

(ii): We use SAM2 to decompose a bimanual manipulation task into contactless and contact-rich states. We uniformly sample random camera pose perturbations for contactless states (green and yellow dots). For contact-rich states (maroon dots), we use constrained optimization to sample perturbations that satisfy a set of constraints suitable for coordinated manipulation. We then employ the trained diffusion model to synthesize novel views based on the original dataset, using its images and corresponding sampled perturbations. This generates an augmented dataset.

(iii): We combine the original and augmented datasets to train a bimanual manipulation policy.

Original vs. Augmented States

The following examples showcase synthesized images generated by our diffusion model across various bimanual tasks.

Choose a task to view the synthesized images:

Real-World

Simulation

Simulation

The top and bottom figures are examples of the original and synthesized wrist-camera images from both arms using D-CODA. The first black column of images are the original states, where the following column of red images are the augmented (perturbed original) states. All original and augmented state pairs are the same timestep, and each task is from the same episode.

Original vs. Augmented Camera Positions

Explore our original and augmented camera positions across different real-world tasks.

Mobile Device Detected

This interactive 3D visualization is optimized for desktop viewing. Please visit on a desktop device for the full experience.

The visualization shows original and augmented camera position for robotic tasks.

Choose a task to view positions:

Click and drag to rotate. Right click and drag to pan. Scroll up to zoom in. Scroll down to zoom out. For more controls, check the upper right hand corner.

Rollout Comparisons

These are rollout comparisons between the ACT Baseline without augmentation (left) and D-CODA (right). In all rollouts, ACT without augmentation froze, whereas D-CODA succeeded.