Next: Motion Estimation Up: 3D Pose Tracking with Previous: Introduction

Previous Work

The general problem of estimating object pose from image sequences has been widely studied in the computer vision literature. Here we only outline some of the previous work on this topic, specifically focusing on work in direct parametric motion estimation for head and object tracking.

Rigid and affine models for direct parametric motion estimation have been extensively explored in the past decade. Horn and Weldon provided an early and comprehensive description of the brightness constraints implied by egomotion or the rigid motion of an object in the world [7]. They observed that motion estimation was in general quite difficult to solve with unknown scene depth, although it is possible in several restricted cases. Bergen et al. [1] were among the first to demonstrate image stabilization and object tracking using an affine model with direct image intensity constraints; they utilized a coarse-to-fine algorithm to solve for large motions.

Black and Yacoob [3] applied parametric models to track the motion of a user's head, and also employed non-rigid models to capture expression. For tracking gross head motion they assumed planar face shape, which limits the accuracy and range of motion of their method. Basu and Pentland [2] offered a similar scheme for recovery of rigid motion parameters assuming ellipsoidal shape models and perspective projection. Their method used a precomputed optic flow representation instead of direct brightness constraints. They also represented rigid motion using Euler angles, which can pose certain difficulties at singularities.

More recently, Bregler and Malik [4] introduced the use of the twist representation of rigid motion. Twists, which are commonly used in the field of robotics, are more stable and efficient to compute than Euler angles. They are especially suited to the estimation of chained articulated motion, as Bregler and Malik demonstrated. They estimated twists directly from the image brightness constraint with a scaled orthographic projection model, and they used ellipsoids to model the shape of each limb of the articulated object. To recover motion in depth, they relied on constraints from this articulation and on information from multiple widely-spaced camera views.

Our approach shares similar goals with the derivation of the direct motion stereo equations in Shieh et al.[8], and with the tensor brightness constraint applied to motion stereo shown in Stein and Shashua[9]. However, these methods assume infinitesimal baseline, and rely on a coarse-to-fine solution strategy when used with baselines generating disparities greater than a pixel. Our method uses the range information directly and can be used with any video-rate range sensor, e.g. laser scanner, structured light, or stereo correspondence.

Video-rate range information allows us to express more powerful direct constraints on image and depth gradients, and to linearly estimate pose parameters that can easily track motion in depth. We are able to track the rigid motion of a single unconnected part from a single viewpoint given a monocular sequence of intensity and range imagery.

Next: Motion Estimation Up: 3D Pose Tracking with Previous: Introduction

Trevor Darrell
9/16/1999