We tackle the problem of learning robotic sensorimotor control policies that can generalize to visually diverse and unseen environments. Achieving broad generalization typically requires large datasets, which are difficult to obtain for task-specific interactive processes such as reinforcement learning or learning from demonstration. However, much of the visual diversity in the world can be captured through passively collected datasets of images or videos. In our method, which we refer to as GPLAC (Generalized Policy Learning with Attentional Classifier), we use both interaction data and weakly labeled image data to augment the generalization capacity of sensorimotor policies. Our method combines multitask learning on action selection and an auxiliary binary classification objective, together with a convolutional neural network architecture that uses an attentional mechanism to avoid distractors. We show that pairing interaction data from just a single environment with a diverse dataset of weakly labeled data results in greatly improved generalization to unseen environments, and show that this generalization depends on both the auxiliary objective and the attentional architecture that we propose. We demonstrate our results in both simulation and on a real robotic manipulator, and demonstrate substantial improvement over standard convolutional architectures and domain adaptation methods.
We train two convolutional neural networks: the policy and the binary classifier. The two networks
share their convolutional layers (shown in blue), and have separate fully connected layers (shown in orange and magenta). Our spatial
attention layer (shown in green) lies between the convolutional layer and the fully connected layer, and forms a major information bottleneck.
For predicting an action, the robot’s state information (joint angles, velocities, end effector position) are also passed into the network. We train the policy using our expert demonstrations and the loss Ltask, while the classifier is trained with the weakly labeled images
and their binary class labels. For more details, refer to the paper.