CLD: Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination

Xudong Wang1,2Ziwei Liu3Stella Yu1,2

1University of California, Berkeley   2International Computer Science Institute   3Nanyang Technological University
[Preprint]   [PDF]   [Github]   [BibTex]
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021)

ABSTRACT

-----

Unsupervised feature learning has made great strideswith contrastive learning based on instance discrimination and invariant mapping, as benchmarked on curated class-balanced datasets. However, natural data could behighly correlated and long-tail distributed. Natural between-instance similarity conflicts with the presumed instance distinction, causing unstable training and poor performance.

Our idea is to discover and integrate between-instance similarity into contrastive learning, not directly by instance grouping, but by cross-level discrimination (CLD) between instances and local instance groups. While invariant mapping of each instance is imposed by attraction within its augmented views, between-instance similarity emerges from common repulsion against instance groups. Our batch-wise and cross-view comparisons also greatly improve the positive/negative sample ratio of contrastive learning and achieve better invariant mapping. To effect both grouping and discrimination objectives, we impose them on features separately derived from a shared representation. In addition, we propose normalized projection heads and unsupervised hyper-parameter tuning for the first time.

Our extensive experimentation demonstrates that CLD is a lean and powerful add-on to existing methods (e.g., NPID, MoCo, InfoMin, BYOL) on highly cor-related, long-tail, or balanced datasets. It not only achieves new state-of-the-art on self-supervision, semi-supervision, and transfer learning benchmarks, but also beats MoCo v2 and SimCLR on every reported performance attained with a much larger compute. CLD effectively extends unsupervised learning to natural data and brings it closer toreal-world applications.

METHOD

-----

Method overview



Our goal is to learn representation f(xi) given image x and its alternative view x from data augmentation. We fork two branches from f: fine-grained instance branch fI and coarse-grained group branch fG. All the computation is mirrored and symmetrical with respect to different views of the same instance.
1) Instance Branch: We apply contrastive loss (two bottom C's) between fI(xi) and a global memory bank vi, which holds the prototype for xi, computed from the average feature of the augmented set of xi.
2) Group Branch: We perform local clustering of fG(xi) for a batch of instances to find k centroids, {M1,,Mk}, with instance i assigned to centroid Γ(i). Their counterparts in the alternative view are fG(xi), M, and Γ.
3) Cross-Level Discrimination: We apply contrastive loss (two top C's) between feature fG(xi) and centroids M according to grouping Γ, and vice versa for xi.
4) Two similar instances xi and xj, which we don't know a priori, would be pushed apart by the instance-level contrastive loss but pulled closer by the cross-level contrastive loss, as they repel common negative groups. Forces from branches fI and fG act on their common feature basis f, organizing it into one that respects both instance grouping and instance discrimination.


Fig 1. Method overview.

Normalized Projection Heads



Existing methods map the feature onto a unit hypersphere with a projection head and then normalization. NPID and MoCo use one FC layer as the linear projection head. MoCo v2, SimCLR, and BYOL adopt a multi-layer perceptron (MLP) head for large datasets, though it could hurt small datasets. We propose to normalize both the FC layer weights W and the shared feature vector f so that projecting f onto W simply calculates their cosine similarity. The final normalized d-dimensional feature N(xi) has t-th component:

Nt(xi)=<Wt,f(xi)>|Wt||f(xi)|,t=1,,d.

Our work makes 4 major contributions.

  1) We extend unsupervised feature learning to natural data with high cor-relation and long-tail distributions.

  2) We propose cross-level discrimination between instances and local groups, to discover and integrate between-instance similarity into contrastive learning.

  3) We also propose normalized projection heads and unsupervised hyper-parameter tuning.

  4) Our experimentation demonstrates that adding CLD and normlized project heads to existing methods has an negligible model complexity overhead and yet delivers a significant boost.

Results

-----

CorrLT
Fig 2. CLD outperforms unsupervised baselines on high-correlation dataset Kichen-HC (left) and long-tailed datasets (right).
SmallMed
Fig 3. Consistent improvements can be observed on small- and medium-scale benchmarks.
ImageNet
Fig 4. Linear classifier top-1 accuracy (%) comparison of self-supervised learning on ImageNet. CLD obtains SOTA among all methods under 100-epoch and 200-epoch pre-training.
VSSOTAs
Fig 5. Comparison with state-of-the-arts under 100 and 200 training epochs.
SemiSup
Fig 6. Top-1 accuracy of semi-supervised learning (1% and 10% label fractions) on ImageNet.
Transfer
Fig 7. Transfer learning results on object detection.

PDF

-----

CITATION

-----


@article{wang2020unsupervised,
    title={Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination},
    author={Wang, Xudong and Liu, Ziwei and Yu, Stella X},
    journal={arXiv preprint arXiv:2008.03813},
    year={2020}
}