Results

Synthetic Image Sequences

Synthetic image sequences provide ground truth that allow us to quantitatively analyze our technique. We generated synthetic sequences from a detailed polygonal model of the frontal half of a human head. The range and color data used to construct the model were obtained with a rotating Cyberware laser scanner equipped to record registered color and depth data at each point. Color test image sequences were constructed via a standard graphical rendering package, employing a perspective camera projection model. Spatial and time derivatives in intensity and depth were computed using difference filters with small support. Sample intensity and depth images of the model are shown in the leftmost two panels of Figure 1.

For each of the two image sequences discussed in this section, motion between all pairs of successive frames were computed using 1) the BCCE only (with measured depth), 2) the DCCE only, and 3) both constraints together. In addition, for each of these cases, parameters were computed using both the perspective and orthographic versions of the constraints. When combining the intensity and depth constraints into a single linear system, we chose the depth constraint weighting factor, $\lambda$ , to be the ratio of the mean magnitudes of the I_t and Z_t values. This helps to equalize the contributions of the two sets of constraints toward the least-squares solution.

To evaluate the utility of measured depth and the DCCE, we also implemented a motion estimation system which uses only the BCCE and a simple, generic depth model, as in the class of methods which include [2,3,4]. The particular form of generic depth model that we used was a plane parallel to the camera image plane, initialized to be at the depth of the object being tracked. We used this system to compute a fourth set of motion parameters for all of the synthetic sequences, according to both the orthographic and perspective versions of the BCCE.

For each pair of frames, the image support region for computation of motion parameters was taken to be, as a first pass, the intersection of the sets of pixels in each frame for which depth is non-background and for which all spatial derivatives do not include differences with background pixels. However, because sampling and object self-occlusion (e.g. of the neck by the chin, in our sequences) creates large depth gradients which do not remain consistent with object pose during motion, we found it helpful to eliminate from the support map all pixels for which the magnitude of the depth gradients exceeded the mean magnitude by more than several standard deviations.

The first synthetic sequence begins with the face oriented toward the camera. The face then makes a 40 degree rotation downward about the X-axis over the course of 35 frames ( $\approx1$ degree per frame), and returns to the starting position via the opposite rotation in the next 35 frames. The next 70 frames consist of the same rotation about the Y-axis, and the final 70 frames contain the same rotation about the Z-axis. The first two panels of Figure 1 show the intensity and depth images of the starting position for the sequence, while the third, fourth, and fifth panels show intensity images of the sequence's three extrema of rotation.

Figure 2 shows the three computed rotational pose parameters plotted against time over the course of the sequence, using each of the four methods described above, according to the perspective forms of the constraint equations. All rotational parameters are expressed in terms of Euler angles. The ground truth for the parameters, shown as solid lines in each graph, is the same for each graph: the leftmost solid triangle represents the steady rise in the X-axis rotational parameter from zero to 40 degrees and back to zero, the middle triangle represents the identical sequence of changes in the Y-axis rotational parameter, and the rightmost triangle represents these same changes in the Z-axis parameter. Only one Euler angle should be non-zero at any given time.

The results obtained using only the BCCE with the generic shape model indicate that this method does not perform well for out-of-plane rotations (i.e. rotations about the X-axis and Y-axis). All three Euler angles are non-zero throughout these rotations, and the translational parameters (not shown) were also very inaccurate. The second graph, on the other hand, shows that simply adding measured depth to the BCCE greatly improves the pose estimation. The third graph shows that using the DCCE instead of the BCCE improves the estimation even further. The fourth graph shows that the best results of all are obtained by using the BCCE and DCCE together. The accumulated error at the end of the sequence is quite small despite very large rotations of a rather complex (and incomplete) object.

**Figure:** Comparison of ground truth (solid) with computed perspective rotation parameters (see legend) for synthetic rotation sequence, in terms of Euler angles (in radians). Left: BCCE only with planar depth model; Middle-left: BCCE only with real depth; Middle-right: DCCE only; Right: Both constraints used.
$\begin{figure}\fourfigw{Rot3_Rot_Planar.ps}{Rot3_Rot_BCCE.ps}{Rot3_Rot_DCCE.ps}{Rot3_Rot_DBCCE.ps}{3.5in}\end{figure}$

**Figure:** Comparison of ground truth (solid) with computed perspective translation parameters (see legend) for synthetic Z-translation sequence. Left: BCCE only with planar depth model; Middle-left: BCCE only with real depth; Middle-right: DCCE only; Right: Both constraints used.
$\begin{figure}\fourfigw{Ztrans_Trans_Planar.ps}{Ztrans_Trans_BCCE.ps}{Ztrans_Trans_DCCE.ps}{Ztrans_Trans_DBCCE.ps}{3.5in}\end{figure}$

The second synthetic sequence examines translation in depth, which causes difficulties for many pose and motion algorithms. The artificial face again begins the sequence oriented toward the camera, as shown in the first two panels of Figure 1, then translates steadily and directly away from the camera to the extreme position shown in the rightmost panel of this figure, and finally returns at the same speed to the starting position. The distance between the extreme positions of the face was approximately three times the width of the face model, with the extreme farthest position being about twice as far from the camera as the starting position. The translation between the two extreme positions took place in 150 frames.

Figure 3 shows the three computed translational parameters plotted against time using each of the four methods described above, according to the perspective forms of the constraint equations. The results assume the Z-axis is pointing toward the camera. The ground truth for the parameters, shown as solid lines in each graph, is the same for each graph: both the X- and Y-translation is zero throughout the sequence, while the Z-translation forms a triangle indicating its steady decrease to a position far behind the starting point and its subsequent increase back to the starting point. Again, the results using only the BCCE with a generic depth model indicate that this method does not perform well for translation in depth. Its estimates for the X- and Y-translations are quite noisy, while the Z-translation is greatly under-estimated. As for the first synthetic sequence, the graphs for the BCCE with measured depth and for the DCCE alone show significantly improved results, while the graph for the joint use of the BCCE and DCCE shows the best results of all, with very little accumulated error at the end of the sequence.

In general, orthographic projection results for the two sequences were slightly worse than the perspective results, due to the error in the camera model assumption. Of course, from the orthographic BCCE (15) it is apparent that translation in depth cannot be recovered using only the BCCE, even with accurate depth. We indeed found this shortcoming in our results for the second sequence.

Real Image Sequence

To test our methods on real data, we recorded an approximately 300 frame (10 second) sequence of registered intensity and depth images using our real-time stereo imaging hardware. The image sequence consists of a person initially facing the camera and then rotating her head toward each of the four image corners in succession. Our goal was to track the motion of the person's head.

As for the synthetic sequences, we computed motion estimates using 1) the BCCE only with a generic shape model, 2) the BCCE only with measured depth, 3) the DCCE only, and 4) the combined BCCE and DCCE with measured depth. We used the perspective forms of the constraint equations throughout. The weighting factor $\lambda$ was chosen as described for the synthetic sequences. Image support regions were computed automatically by selecting large connected foreground regions with smoothly changing range data. This precludes pixels which have an uncertain depth value, typically due to occlusion or low contrast. Also, unlike in the synthetic sequences, real depth imagery is noisy, and we found it advantageous to smooth the depth images prior to computing gradients.

Figure 4 shows still frame images from the sequence overlaid with graphically rendered axes indicating our pose estimates. The original still frames have been greatly lightened to allow the axes to be seen more easily. For the first frame, shown in the top image in the figure, the axes are rendered according to our camera model so as to appear to be a few inches in front of the person's nose. Two of the axes lie in a plane parallel to the image plane, while the third (the darkest axis) is directed at the camera. We updated the position and orientation of the rendered axes for successive frames according to the recovered motion estimates. Therefore, if our pose tracking algorithm works well, we should expect the axes to continue to appear to be rigidly affixed a few inches in front of the nose as the person moves her head. The middle row of images in the figure shows pose estimates obtained at several extreme positions in the sequence, using only the BCCE with a generic shape model. The bottom row of images in the figure shows the results obtained for the same frames using the BCCE and DCCE together with measured depth.

The estimates in the bottom row appear to be qualitatively correct in all frames. The final frame shows that this method accumulated very little error over the course of the 300 frame sequence. The estimation also showed very good stability during non-rigid motions, specifically opening and closing of the mouth. In contrast, the results in the middle row of images show that much greater inaccuracy is obtained by using only the BCCE with a generic shape model. This method was not able to cope with even the moderate out-of-plane rotations exhibited in this sequence, as it produced significant spurious translation and exaggerated rotation. For example, the second and fourth frames in the middle row indicate that the head has rotated by over 90 degrees toward the person's right, while the actual rotation is less than 45 degrees. In addition, the last frame in the middle row, showing axes that are far from their initial frame position, reveals that the method accumulated a large amount of error over the course of the sequence.

Results using either the BCCE or the DCCE alone, with measured depth in each case, were not as good as those obtained using the combined constraints. The DCCE alone performed more poorly, producing qualitatively correct but very noisy estimates. The noise in the estimates is likely a result of the significant noise in the depth images themselves.

The quality of the estimates obtained by using the combination of the BCCE and DCCE is much more easily judged by viewing the movies of the above results, which can be found at http://www.interval.com/papers/1999-006. This site also provides result movies for other real and synthetic sequences, as well as a color version of this paper (which allows most of its figures and graphs to be understood more easily).

**Figure:** Pose estimation results for real sequence of approx 300 frames (about 10 seconds). Top image: Initial frame in sequence, with result pose axes in initial position. Middle image row: Axes indicate results obtained using BCCE with planar depth for select frames later in sequence. Bottom image row: Results for the same frames using joint BCCE and DCCE with measured depth.
$\begin{figure}\onefigw{G140.ps}{3.0in}\vspace{-2.5ex}\fivefigw{GPlane168.ps}... ....0in}\fivefigw{G168.ps}{G211.ps}{G262.ps}{G311.ps}{G393.ps}{3.0in}\end{figure*}$

CLICK HERE FOR AN MPEG VERSION OF THE SEQUENCE FROM THE MIDDLE ROW

CLICK HERE FOR AN MPEG VERSION OF THE SEQUENCE FROM THE BOTTOM ROW

Next:Discussion and Conclusions Up:3D Pose Tracking with Previous:Motion Estimation

Trevor Darrell

9/16/1999