Pixel-wise classification, grouping and short-term tracking are performed independently in each modality. Stereo processing outputs a user's silhouette defined by range regions, color processing yields a set of skin color regions within range silhouette boundaries, and face processing returns a list of detected frontal face patterns; we describe each module in turn. Each mode also provides an independent estimate of head location and performs short-term tracking.
To compute a set of user silhouettes, we rely on a dense real-time stereo system. Video from a pair of cameras is used to estimate dense range using a technique based on the census transform ; we have implemented the census algorithm on a single PCI card, multi-FPGA reconfigurable computing engine . This stereo system is capable of computing 24 stereo disparities on 320 by 240 images at 42 frames per second, or approximately 77 million pixel-disparities per second. These processing speeds compare favorably with other real-time stereo implementations such as .
Our segmentation and grouping technique proceeds in several stages of processing, as illustrated in Figure 3. We first smooth the raw range signal to reduce the effect of low confidence stereo disparities using a morphological closing operator. We then compute the response of a gradient operator on the smoothed range data and threshold at a critical value based on the observed noise level in our disparity data. Connected components analysis is applied to these regions of smoothly varying range. We return all connected components whose area exceeds a minimum threshold.
The range processing module provides these user silhouettes, as well as estimates of head location. A candidate head is placed below the maxima of the range profile. Head position is refined in the integration stage, as described below.
Disparity estimation, segmentation, and grouping are repeated independently at each time step; range silhouettes are tracked from frame to frame based on position and size constancy. The centroid and size of each new range silhouette is compared to silhouettes from the previous time step. ``Short-term'' correspondences are indicated using a greedy algorithm starting with the closest unmatched region; for each new region the closest old region within a minimum threshold is marked as the correspondence matches.
Skin color is a useful cue for tracking people's faces and other body parts. We detect skin using a classification strategy which matches skin hue but is largely invariant to intensity or saturation, as this is robust to shading due to illumination and/or the absolute amount of skin pigment in a particular person.
We apply color segmentation processing to images obtained from one camera. Each image is initially represented with pixels corresponding to the red, green, and blue channels of the image, and is converted into a ``log color-opponent'' space. This space can directly represent the approximate hue of skin color, as well as its log intensity value. We convert (R,G,B) tuples into tuples of the form (log(G),log(R)-log(G),log(B)-(log(R)+log(G))/2). Skin color is detected using a classifier with an empirically estimated Gaussian probability model of ``skin'' and ``not-skin'' in the log color-opponent color space. When a new pixel p is presented for classification, the likelihood ratio P(p=skin)/P(p=non-skin) is computed as a classification score. Our color representation is similar to that used in , but we estimate our classification criteria from examples rather than apply hand-tuned parameters. For computational efficiency at run-time, we precompute a lookup table over all possible color values.
After the lookup table has been applied, segmentation and grouping analysis are performed on the classification score image. Similar to the range case, we use morphological smoothing, threshold above a critical value, and apply connected component computation. However, there is one difference: before smoothing we apply the low-gradient mask from the range modality. This restricts color regions to be grown only within the boundary of range regions; if spurious background skin hue is present in the background it will not adversely affect the shape of foreground skin color regions.
As with range processing, classification, segmentation, and grouping are repeated at each time step. Short-term tracking is performed on recovered color regions based on similar centroid position and region size. When a a color region changes size dramatically, we check to see if two regions merged, or if one region split into two. If so we record the identity of the split or merged regions, to be used in the integration stage as described below.
Skin color regions that are above the midline of their associated range component, and which are appropriately sized at the given depth to be heads, are labeled as candidate heads and passed to the integration phase.
To distinguish head from hands and other body parts, and to localize the face within a region containing the head, we use pattern recognition methods which directly model the statistical appearance of faces based on intensity.
We based our implementation of this module on the CMU face detector  library. This library implements a neural network which models the appearance of frontal faces in a scene, and is similar to the pattern recognition approach described in . Both methods are trained on a structured set of examples of faces and non-faces.
Face detection is initially applied over the entire image; when one or more detections are recorded, they are passed directly as candidate head locations to the integration phase. Short term tracking is implemented by focusing search in a new frame within windows around the detected locations in the previous frame. If a new detection is found within such a window it is considered to be in short-term correspondence with the previous detection; if no new detection is found and the detection in the previous frame overlapped a color or range region, then the face detection is updated to move with that region (as long as it persists).