When users are momentarily occluded or exit the scene, short-term tracking will fail since position and size correspondences in the individual modules are unavailable. To track users over medium and long-term time scales, we rely on statistical appearance models. Each visual processing module computes an estimate of certain user attributes, which are expected to be stable over longer time periods. These attributes are averaged as long as the underlying range silhouette continues to be tracked in the short-term, and used in a classification stage to establish medium and long-term correspondences.
Like multi-modal person detection and tracking, multi-modal person appearance classification is more robust than classification systems based on a single data modality. Height, color, and face pattern each offer independent classification data and are accompanied by similarly independent failure modes. Although face patterns are perhaps the most common data source for current passive person classification methods, it is unusual to incorporate height or color information in identification systems because they do not provide sufficient discrimination to justify their use alone. However, combined with each other and with face patterns, height and color can provide important cues to disambiguate otherwise similar people, or help classify people when only degraded data is available in other modes.
In the range module, we estimate the height of the user and use this as an attribute of identity. Height is obtained by computing the median value of the highest point of the a user silhouette in 3-D. In the color module, we compute the average color of the skin and hair regions; we plan to also add a histogram of clothing color. We define the hair region to be those pixels above the face but on the range silhouette; clothing can be defined as all other silhouette pixels not labeled as skin or hair.
In the face detector, we record an image of the actual face pattern wherever the detector records a hit. When a region is identified as a face based on the face pattern detection algorithm, the face pattern (greyscale subimage) in the target region is normalized and then passed to the classification stage. For optimal classification, we want the scale, alignment, and view of detected faces to be comparable. We resize the pattern to normalize for size, and discard images which are not in canonical pose or expression, which is determined by normalized correlation with an average canonical face.
For ``medium-term'' tracking, e.g., over seconds or minutes of occlusion or absence, we rely on all of the above attributes. For ``long-term'' tracking, over hours or longer, we cannot rely on attributes which are not invariant with time of day or from day to day: we correct all color values with a mean color shift to account for changing illumination, and would exclude clothing color from the match computation.
In general, we compute statistics of these attributes while users are being tracked over the short-term, and compare against stored statistics of all previous tracked users.
When we observe a new person, we see if there is a previously tracked individual which could have generated the current observations. We find the previous individual most likely to have generated the new observations; if this probability is above a minimum threshold, we label the currently tracked region as being in correspondence with the previous individual. We integrate likelihood over time and modality: at time t, we find the identity estimate
|P(F0,...,Ft|Uj) = P(F0,...,Ft-1|Uj) P(Ft|Uj)||(4)|
We collect mean and covariance data for the observed user color data, and mean and variance of user height; the likelihoods P(Fi|Uj) and P(Ci|Uj) are computed assuming a Gaussian density model. For face pattern data, we store the size- and position-normalized mean pattern for each user, and approximate P(Ft|Cp) with an empirically determined density which is a function of the normalized correlation of Ft with the the mean pattern for person j.