Fast Approximate Spectral Clustering

Fast approximate spectral clustering is a general framework we proposed recently for the spectral clustering of large datasets. The aim is to preserve the virtue of spectral clustering yet being able to meet the scability challenge of emerging data mining tasks. It grows out of our work on the perturbation analysis of spectral clustering [1] as well as extensive empirical studies. The main idea is to perform a distortion minimizing local transformation on the data to get a small number of representative points on which the cluster membership of all data points can be derived. There are three steps involved:

Distortion minimizing local transformation on the data to get a small set of representative points which preserve the "clusterness" of the original data.
Spectral clustering on the set of representative points.
Recovery of the cluster membership for all data points based on their correspondence to the representative points.

Since the distortion minimizing local transformation can be computed with fast procedures while the expensive spectral clustering is only performed on the representative points, large speedup can be achieved. Our theory predicts that if the distortion resulted from the data "compression" is small, then algorithms implemented within our framework will incur very little loss in clustering "accuracy".

Currently we have two implementations for distortion minimizing local transformation, one based on K-means clustering and the other on random projection trees. These two are chosen for their favorable empirical computational efficiency and the simplicity of their implementation. Evidently other approaches can be adopted if the resulting distortion can be made small. For more details on the general framework, please refer to [1].

R implementation Download
Example datasets (from UC Irvine Machine Learning Repository) Download
You are welcome to send questions, comments, suggestions, or to report bugs to us. Thank you!

Citation
[1] Donghui Yan, Ling Huang and Michael I. Jordan. Fast approximate spectral clustering 15th ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, 2009. [Long version].