Clustering and Classification via Lossy Compression

A Brief Introduction

Our approach to clustering and classification is based on the principle of Lossy Minimum Description Length. We measure the number of bits needed to code the data

upto distortion

using the following formula (see our PAMI '07 paper for details):

We cast clustering and classification as lossy coding problems, seeking solutions that minimize an appropriate coding length.

Clustering

The optimal segmentation minimizes number of bits needed to encode the segmented data upto distortion

Suppose the data are partitioned into disjoint subsets

If we encode each subset separately, the total number of bits required is

The second term in this equation counts then number of bits needed to represent the membership of the m samples in the k subsets (i.e., by Huffman coding).

Globally minimizing

is a combinatorial problem. Nevertheless, the following agglomerative algorithm is quite effective for segmenting data lying on subspaces of varying dimension. It initially treats each subject as its own group, iteratively merging pairs of groups so that the resulting coding length is minimal. The algorithm terminates when it can no longer reduce the coding length.

The algorithm automatically selects the number of groups, and is robust to noise and outliers. To read further, please see our PAMI '07 Paper.

Classification

Assign the test sample to the class of training data that minimizes the number of bits needed to code it together with the training data

In this setting we are given training data

from k classes, and a test sample x. The number of additional bits needed to code x together with the i-th class is

Here, the third term counts the number of bits needed to encode the assignment of x to the i-th class.

This algorithm produces a family of classifiers, indexed by distortion

, which generalize the classical MAP classifier. The distortion can be viewed as inducing a regularization effect, improving performance with finite samples and degenerate distrubtions in high-dimensional spaces, without sacrificing asymptotic optimality. To read more, see our preprint on MICL.

The use of lossy coding distinguishes both approaches from traditional (lossless) Minimum Description Length (MDL) techniques. Lossy coding is especially relevant when the data contain degenerate structures such as low-dimensional subspaces or submanifolds. Classical approaches become ill-defined in this case, but lossy coding effectively regularizes the distribution, allowing the same algorithms to handle both degenerate and generic data.

To read more, please visit our references page.