GPLAC: Generalizing Vision-Based Robotic Skills
using Weakly Labeled Images
Avi Singh
Larry Yang
Sergey Levine
University of California, Berkeley
ICCV 2017
[Download Paper]

We tackle the problem of learning robotic sensorimotor control policies that can generalize to visually diverse and unseen environments. Achieving broad generalization typically requires large datasets, which are difficult to obtain for task-specific interactive processes such as reinforcement learning or learning from demonstration. However, much of the visual diversity in the world can be captured through passively collected datasets of images or videos. In our method, which we refer to as GPLAC (Generalized Policy Learning with Attentional Classifier), we use both interaction data and weakly labeled image data to augment the generalization capacity of sensorimotor policies. Our method combines multitask learning on action selection and an auxiliary binary classification objective, together with a convolutional neural network architecture that uses an attentional mechanism to avoid distractors. We show that pairing interaction data from just a single environment with a diverse dataset of weakly labeled data results in greatly improved generalization to unseen environments, and show that this generalization depends on both the auxiliary objective and the attentional architecture that we propose. We demonstrate our results in both simulation and on a real robotic manipulator, and demonstrate substantial improvement over standard convolutional architectures and domain adaptation methods.

Demo Video

Our Solution

We train two convolutional neural networks: the policy and the binary classifier. The two networks share their convolutional layers (shown in blue), and have separate fully connected layers (shown in orange and magenta). Our spatial attention layer (shown in green) lies between the convolutional layer and the fully connected layer, and forms a major information bottleneck. For predicting an action, the robot’s state information (joint angles, velocities, end effector position) are also passed into the network. We train the policy using our expert demonstrations and the loss Ltask, while the classifier is trained with the weakly labeled images and their binary class labels. For more details, refer to the paper.


[Paper PDF]  [arXiv]

Avi Singh, Larry Yang, Sergey Levine GPLAC: Generalizing Vision-Based Robotic Skills using Weakly Labeled Images.
In ICCV 2017.