Clustering and Classification via Lossy Compression

Welcome!

This website introduces a new mathematical framework for two related classical problems in statistical learning: data clustering and data classification. The basic idea is to cast data clustering or classification as a (lossy) data compression problem. By minimizing the coding length of the data, the resulting clustering and classification solutions are provably optimal and the proposed algorithms are extremely simple, efficient, and robust. They can achieve the state of the art clustering and classification results, especially for high-dimensional data such as images or gene expression data. This page gives a visual introduction to some representative problems that can be solved by this new method. Please click the links at left for a basic introduction to the new approach, or complete technical details in the reference papers, or sample code for all the algorithms, or implementation details on some real-world applications.

Unsupervised Learning

Given noisy data with multiple structures, group the data into simpler components:

Input: Multivariate mixed data

Output: Subspace-like clusters

Clustering Mixed Data
(.avi, click to view)

Real data such as images and videos often have such complicated, mixed structure. Segmentation breaks the image or video into smaller, simpler pieces making it easier to represent, understand or process:

Original image

Segmentations at three different scales

To read more about the theoretical aspects of this problem and its general solution, please visit our introduction page, or refer to our paper, Segmentation of Multivariate Mixed Data via Lossy Coding and Compression, in the references page.

You can find more details about our solution derived from this framework for image segmentation at our Natural Image Segmentation page or our paper, Segmentation of Natural Images by Texture and Boundary Compression, in the references page.

Supervised Learning: Data Classification

Given a set or training examples whose identity is known, determine the identity of a new sample:

Training examples ( o and o ) and a
Test example ( x ) whose identity is unknown

Decision boundary between two classes
Test sample is classified as ( o )

One of the most prominent applications of this fundamental problem is in automatic face recognition, where the training examples images of human faces, taken under varying expression, pose, or lighting conditions. The test example is an image of a face, possibly taken under different pose, expression or lighting condition. The task is to identify which of the individuals in the training database is pictured in the test image:

Training examples

Test image

Classification Result

To read more about the classification problem and its solution, please refer to our introduction page, or to our paper, Classification via Minimum Incremental Coding Length (MICL) on the references page.

Credits

This website is developed and maintained by the research group of professor Yi Ma at the University of Illinois at Urbana-Champaign.

This material is based upon work supported by the National Science Foundation under Award No. IIS-0347456, CRS-EHS-0509151, and CCF-TF-0514955, and by the Office of Naval Research, under Award No. N0001405-1-0633. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation or Office of Naval Research.

Please direct all questions and comments about this website to the webmaster: John Wright (jnwright@uiuc.edu).