Segment Anything without Supervision

BAIR, UC Berkeley
Interpolate start reference image.

 

Can we "segment anything" without supervision? Yes! We present UnSAM, an innovative unsupervised learning method capable of performing both promptable and whole-image segmentation without the need for supervision.

 


Abstract

The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic wholeimage segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to “discover” the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B’s ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semisupervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM’s AR by over 6.7% and AP by 3.9% on SA-1B.


Method Overview

 

UnSAM consists of two primary stages: 1) generating pseudo-masks using a divide-and-conquer approach, which yields multi-granular masks without the need for human supervision, and 2) learning unsupervised segmentation models using these pseudo-masks derived from unlabeled data. The resulting segmentation model is capable of performing promptable and automatic whole-image segmentation, mirroring the capabilities of SAM.

 


Divide-and-Conquer for Multi-granular Mask Generation

 

Interpolate start reference image.
Our segment anything without supervision model starts by generating pseudo masks that respect the hierarchical structure of visual scenes without supervision. This approach is motivated by the observation that the "divide and conquer" strategy is a fundamental organizational principle employed by the human visual system to efficiently process and analyze the vast complexity of visual information present in natural scenes. Divide stage: we leverage a Normalized Cuts (NCuts)-based method, CutLER, to obtain semantic and instance-level masks from unlabeled raw images. Conquer stage: for each instance-/semantic-level mask discovered in the previous stage, we employ iterative merging to decompose the coarse-grained mask into simpler parts, forming a hierarchical structure.

 

Interpolate start reference image.
Unsupervised pseudo-masks generated by our divide-and-conquer pipeline not only contain precise masks for coarse-grained instances (column 5), e.g., cameras and persons, but also capture fine-grained parts (column 3), e.g., digits and icons on a tiny camera monitor that are missed by SA-1B's ground-truth labels.

 


Segment Anything without Supervision

 

UnSAM learns an image segmentation model using the masks discovered by the divide-and-conquer strategy. It has been observed that self-training enables the model to "clean" the pseudo masks and predict masks of higher quality.

 

Interpolate start reference image.
UnSAM has competitive dense object segmentation results compared to the supervised SAM.

 

Interpolate start reference image.
Qualitative comparisons of promptable image segmentation between the fully-supervised SAM, our unsupervised UnSAM, and the lightly semi-supervised UnSAM+. Both UnSAM and UnSAM+ consistently deliver high-quality, multi-granular segmentation masks in response to the point prompts.

 


Demo Results of UnSAM

Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
From top to bottom are raw images, segmentation by SAM, segmentation by UnSAM, and segmentation by UnSAM+.

 


Quantitative Evaluation

Interpolate start reference image. UnSAM achieves the state-of-the-art results on unsupervised image segmentation, using a backbone of ResNet50 and training with only 1% of SA-1B data. We perform a zero-shot evaluation on various image segmentation benchmarks, including whole entity datasets, e.g., COCO and ADE, and part segmentation datasets, e.g., PACO and PartImageNet

 

Interpolate start reference image. UnSAM+ can outperform SAM on most experimented benchmarks (including SA-1B), when training UnSAM on 1% of SA-1B with both ground truth masks and our unsupervised labels. This demonstrates that our unsupervised pseudo masks can serve as a powerful add-on to the densely annotated SA-1B masks!

 


BibTeX

@article{wang2024segment,
      title={Segment Anything without Supervision},
      author={Wang, XuDong and Yang, Jingfeng and Darrell, Trevor},
      journal={arXiv preprint arXiv:2406.20081},
      year={2024}
    }