For each of the two image sequences discussed in this section, motion between all pairs of successive frames were computed using 1) the BCCE only (with measured depth), 2) the DCCE only, and 3) both constraints together. In addition, for each of these cases, parameters were computed using both the perspective and orthographic versions of the constraints. When combining the intensity and depth constraints into a single linear system, we chose the depth constraint weighting factor, , to be the ratio of the mean magnitudes of the It and Zt values. This helps to equalize the contributions of the two sets of constraints toward the least-squares solution.
To evaluate the utility of measured depth and the DCCE, we also implemented a motion estimation system which uses only the BCCE and a simple, generic depth model, as in the class of methods which include [2,3,4]. The particular form of generic depth model that we used was a plane parallel to the camera image plane, initialized to be at the depth of the object being tracked. We used this system to compute a fourth set of motion parameters for all of the synthetic sequences, according to both the orthographic and perspective versions of the BCCE.
For each pair of frames, the image support region for computation of motion parameters was taken to be, as a first pass, the intersection of the sets of pixels in each frame for which depth is non-background and for which all spatial derivatives do not include differences with background pixels. However, because sampling and object self-occlusion (e.g. of the neck by the chin, in our sequences) creates large depth gradients which do not remain consistent with object pose during motion, we found it helpful to eliminate from the support map all pixels for which the magnitude of the depth gradients exceeded the mean magnitude by more than several standard deviations.
The first synthetic sequence begins with the face oriented toward the camera. The face then makes a 40 degree rotation downward about the X-axis over the course of 35 frames ( degree per frame), and returns to the starting position via the opposite rotation in the next 35 frames. The next 70 frames consist of the same rotation about the Y-axis, and the final 70 frames contain the same rotation about the Z-axis. The first two panels of Figure 1 show the intensity and depth images of the starting position for the sequence, while the third, fourth, and fifth panels show intensity images of the sequence's three extrema of rotation.
Figure 2 shows the three computed rotational pose parameters plotted against time over the course of the sequence, using each of the four methods described above, according to the perspective forms of the constraint equations. All rotational parameters are expressed in terms of Euler angles. The ground truth for the parameters, shown as solid lines in each graph, is the same for each graph: the leftmost solid triangle represents the steady rise in the X-axis rotational parameter from zero to 40 degrees and back to zero, the middle triangle represents the identical sequence of changes in the Y-axis rotational parameter, and the rightmost triangle represents these same changes in the Z-axis parameter. Only one Euler angle should be non-zero at any given time.
The results obtained using only the BCCE with the generic shape model
indicate that this method does not perform well for out-of-plane rotations
(i.e. rotations about the X-axis and Y-axis). All three Euler angles are
non-zero throughout these rotations, and the translational parameters (not
shown) were also very inaccurate. The second graph, on the other hand,
shows that simply adding measured depth to the BCCE greatly improves the
pose estimation. The third graph shows that using the DCCE instead of the
BCCE improves the estimation even further. The fourth graph shows that
the best results of all are obtained by using the BCCE and DCCE together.
The accumulated error at the end of the sequence is quite small despite
very large rotations of a rather complex (and incomplete) object.
The second synthetic sequence examines translation in depth, which causes difficulties for many pose and motion algorithms. The artificial face again begins the sequence oriented toward the camera, as shown in the first two panels of Figure 1, then translates steadily and directly away from the camera to the extreme position shown in the rightmost panel of this figure, and finally returns at the same speed to the starting position. The distance between the extreme positions of the face was approximately three times the width of the face model, with the extreme farthest position being about twice as far from the camera as the starting position. The translation between the two extreme positions took place in 150 frames.
Figure 3 shows the three computed translational parameters plotted against time using each of the four methods described above, according to the perspective forms of the constraint equations. The results assume the Z-axis is pointing toward the camera. The ground truth for the parameters, shown as solid lines in each graph, is the same for each graph: both the X- and Y-translation is zero throughout the sequence, while the Z-translation forms a triangle indicating its steady decrease to a position far behind the starting point and its subsequent increase back to the starting point. Again, the results using only the BCCE with a generic depth model indicate that this method does not perform well for translation in depth. Its estimates for the X- and Y-translations are quite noisy, while the Z-translation is greatly under-estimated. As for the first synthetic sequence, the graphs for the BCCE with measured depth and for the DCCE alone show significantly improved results, while the graph for the joint use of the BCCE and DCCE shows the best results of all, with very little accumulated error at the end of the sequence.
In general, orthographic projection results for the two sequences were slightly worse than the perspective results, due to the error in the camera model assumption. Of course, from the orthographic BCCE (15) it is apparent that translation in depth cannot be recovered using only the BCCE, even with accurate depth. We indeed found this shortcoming in our results for the second sequence.
As for the synthetic sequences, we computed motion estimates using 1) the BCCE only with a generic shape model, 2) the BCCE only with measured depth, 3) the DCCE only, and 4) the combined BCCE and DCCE with measured depth. We used the perspective forms of the constraint equations throughout. The weighting factor was chosen as described for the synthetic sequences. Image support regions were computed automatically by selecting large connected foreground regions with smoothly changing range data. This precludes pixels which have an uncertain depth value, typically due to occlusion or low contrast. Also, unlike in the synthetic sequences, real depth imagery is noisy, and we found it advantageous to smooth the depth images prior to computing gradients.
Figure 4 shows still frame images from the sequence overlaid with graphically rendered axes indicating our pose estimates. The original still frames have been greatly lightened to allow the axes to be seen more easily. For the first frame, shown in the top image in the figure, the axes are rendered according to our camera model so as to appear to be a few inches in front of the person's nose. Two of the axes lie in a plane parallel to the image plane, while the third (the darkest axis) is directed at the camera. We updated the position and orientation of the rendered axes for successive frames according to the recovered motion estimates. Therefore, if our pose tracking algorithm works well, we should expect the axes to continue to appear to be rigidly affixed a few inches in front of the nose as the person moves her head. The middle row of images in the figure shows pose estimates obtained at several extreme positions in the sequence, using only the BCCE with a generic shape model. The bottom row of images in the figure shows the results obtained for the same frames using the BCCE and DCCE together with measured depth.
The estimates in the bottom row appear to be qualitatively correct in all frames. The final frame shows that this method accumulated very little error over the course of the 300 frame sequence. The estimation also showed very good stability during non-rigid motions, specifically opening and closing of the mouth. In contrast, the results in the middle row of images show that much greater inaccuracy is obtained by using only the BCCE with a generic shape model. This method was not able to cope with even the moderate out-of-plane rotations exhibited in this sequence, as it produced significant spurious translation and exaggerated rotation. For example, the second and fourth frames in the middle row indicate that the head has rotated by over 90 degrees toward the person's right, while the actual rotation is less than 45 degrees. In addition, the last frame in the middle row, showing axes that are far from their initial frame position, reveals that the method accumulated a large amount of error over the course of the sequence.
Results using either the BCCE or the DCCE alone, with measured depth in each case, were not as good as those obtained using the combined constraints. The DCCE alone performed more poorly, producing qualitatively correct but very noisy estimates. The noise in the estimates is likely a result of the significant noise in the depth images themselves.
The quality of the estimates obtained by using the combination of the
BCCE and DCCE is much more easily judged by viewing the movies of the above
results, which can be found at http://www.interval.com/papers/1999-006.
This site also provides result movies for other real and synthetic sequences,
as well as a color version of this paper (which allows most of its figures
and graphs to be understood more easily).
CLICK HERE FOR AN MPEG VERSION OF THE SEQUENCE FROM THE BOTTOM ROW