CutLER: Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Xudong WangRohit GirdharStella X. YuIshan Misra
FAIR, Meta AI    UC Berkeley / ICSI    University of Michigan
[Preprint]   [Code]   [Colab]   [Bibtex]


Abstract

We propose Cut-and-LEaRn (CutLER) which is a simple approach for training unsupervised object detection and segmentation models. We leverage the property of selfsupervised models to "discover" objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image, and then learns a detector on these masks using our robust loss function. We further improve performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance by over 2.7 times on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a lowshot detector surpassing MoCo-v2 by 7.3% APbox and 6.5% APmask on COCO when training with 5% labels.


Overview


We propose a simple yet effective method to train an object detection and instance segmentation model without using any supervision. We first propose MaskCut to extract initial coarse masks from the features of a self-supervised ViT. We then learn a detector using our loss dropping strategy that is robust to objects missed by MaskCut. We further improve the model using multiple rounds of self-training. Please check our paper and code for more details.
Try out the CutLER demo using Colab!


MaskCut

MaskCut can discover multiple object masks in an image without supervision. We create a patch-wise similarity matrix for the image using a self-supervised DINO model's features. We apply Normalized Cuts to this matrix and obtain a single foreground object mask of the image. We then mask out the affinity matrix values using the foreground mask and repeat the process, which allows MaskCut to discover multiple object masks in a single image.
Try out the MaskCut demo using Colab!


Main Results on 11 Datasets


As a simple approach for training unsupervised object detection and segmentation models, CutLER outperforms previous SOTA by 2.7 times for AP50 and 2.6 times for AR on 11 benchmarks spanning a variety of domains, including video frames, paintings, clip arts, complex scenes, etc.


Materials

Paper Poster

CITATION

If you find our work inspiring or use our codebase in your research, please cite our work:

@article{wang2023cut,
    author={Wang, Xudong and Girdhar, Rohit and Yu, Stella X and Misra, Ishan},
    title={Cut and Learn for Unsupervised Object Detection and Instance Segmentation},
    journal={arXiv preprint arXiv:2301.11320},
    year={2023},
}