Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 180 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation.
(i):The diffusion model is an iterative denoiser that learns to map source wrist-camera images \( I_a^l \) and \( I_a^r \) to target wrist-camera images \( I_b^l \) and \( I_b^r \), conditioned on pose transformations \( \Delta p^{l} \) and \( \Delta p^{r} \), using the original dataset (i.e., the dataset to be augmented).
(ii): We use SAM2 to decompose a bimanual manipulation task into contactless and contact-rich states. We uniformly sample random camera pose perturbations for contactless states (green and yellow dots). For contact-rich states (maroon dots), we use constrained optimization to sample perturbations that satisfy a set of constraints suitable for coordinated manipulation. We then employ the trained diffusion model to synthesize novel views based on the original dataset, using its images and corresponding sampled perturbations. This generates an augmented dataset.
(iii): We combine the original and augmented datasets to train a bimanual manipulation policy.
The following examples showcase synthesized images generated by our diffusion model across various bimanual tasks.
Real-World
The top and bottom figures are examples of the original and synthesized wrist-camera images from both arms using D-CODA. The first black column of images are the original states, where the following column of red images are the augmented (perturbed original) states. All original and augmented state pairs are the same timestep, and each task is from the same episode.
Explore our original and augmented camera positions across different real-world tasks.
Click and drag to rotate. Right click and drag to pan. Scroll up to zoom in. Scroll down to zoom out. For more controls, check the upper right hand corner.
These are rollout comparisons between the ACT Baseline without augmentation (left) and D-CODA (right). In all rollouts, ACT without augmentation froze, whereas D-CODA succeeded.
Compare results between D-CODA, ACT Baseline, Bimanual DMD, and Fine-Tuned VISTA across various tasks.
These are real-world expirements with different variations for each task using D-CODA.
Added various different colored blocks as distractors
Added various different colored blocks as distractors
Added another block above the original block
Added various different colored blocks as distractors
Failure cases of D-CODA across different tasks.