3:10-4:30 TALKS: Recent Results
Condensation-Based Recognition of Gestures and Expressions
Michael Black, Xerox PARC
Elliptical Head Tracking Using Intensity Gradients
and Color Histograms
Stan Birchfield, Stanford
Generative Lighting Models for Faces: Singular Value
Decomposition and
the Generalized Bas Relief Ambiguity
Alan Yuille, Smith-Kettlewell Eye Research Institute
A Virtual Mirror Interface using Real-time Robust
Face Tracking
Gaile Gordon, Trevor Darrell, Mike Harville,
John Woodfill and Harlyn Baker, Interval Research
4:30-5:15 BREAK / DEMOS / POSTERS
Eigenpoints & Video Rewrite
Michele Covell, Interval Research
Human Motion Learning
Luis Goncalves and Enrico Di Bernardo, Caltech
Virtual Mirror
Interval Research
Robust Pupil Detector
IBM Almaden
5:15-6:30 TALKS: Ongoing Research
Image-Based Tracking of Eye and Head Motion
Jeffrey B. Mulligan, NASA Ames Research Center
Overview: Person Tracking Research at Autodesk
Brian Burns, Richard Marks, David Beymer, Joseph Weber,
Michael Gleicher and Stan Rosenschein, Autodesk
Overview: Person Tracking Research at Intel
Gary Bradski, Mark Holler and Demetri Terzopolous, Intel
Overview: Research in Perceptive and Multi-Modal User
Interfaces at IBM
Myron Flickner, IBM Almaden
Robust Pupil Detector, or How to Find Alligators in
a Dark Swamp Night.
Carlos Morimoto, David Koons, Arnon Amir and Myron Flickner, IBM Almaden
6:30-6:40 Wrap-up discussion
Closing Business: location/topic of future BAVM.
7:00-8:30 Dinner in Palo Alto
(For those who have previously RSVP'ed)
Cafe Pro Bono
2427 Birch Street; (at California Ave. near Printer's Inc.)
The recognition of human gestures and facial expressions in image
sequences is an important and challenging problem that enables a host
of human-computer interaction applications. I will describe a
framework for incremental recognition of human motion that extends
the
"Condensation" algorithm proposed by Isard and Blake (ECCV'96).
Human
motions are modeled as "temporal trajectories" of some estimated
parameters over time. The Condensation algorithm uses random
sampling
techniques to incrementally match the trajectory models to the
multi-variate input data. The recognition framework is demonstrated
with two examples. The first example involves an augmented office
whiteboard with which a user can make simple hand gestures to grab
regions of the board, print them, save them, etc. The second
example
illustrates the recognition of human facial expressions using the
estimated parameters of a learned model of mouth motion.
This is joint work with Allan Jepson.
Elliptical Head Tracking Using Intensity Gradients
and Color Histograms
Stan Birchfield, Stanford
The ability to automatically track a person moving around an
unmodified room is important for many applications, such as
videoconferencing, distance learning, surveillance, and human-computer
interaction. A serious challenge for this task is to allow full
360-degree rotation of the body without using techniques that
forbid other moving objects in the background or arbitrary camera
movement. In this talk I will present a system that is capable
of
tracking a person's head using the images from a single color camera
with enough accuracy to automatically control the camera's pan, tilt,
and zoom to keep the person centered in the field of view. The
algorithm combines the output from two different vision modules, one
based on intensity gradients and the other based on color histograms,
and allows full 360-degree rotation, arbitrary camera motion,
multiple moving people in the background, and severe but brief
occlusion.
Our goal is to determine generative models of faces which approximate
their appearance under a range of lighting conditions. The input
is a
set of images of the face under different, and unknown,
illumination. Firstly, we demonstrate that Singular Value
Decomposition (SVD) can be used to estimate shape and albedo from
multiple images up to a linear transformation. It is demonstrated that
the SVD approach applies to objects for which the dominant imaging
effects are lambertian reflectance with a point light source and a
background ambient term. To determine that this is a reasonable
approximation we calculate the eigenvalues of the SVD on a set of real
objects, under varying lighting conditions, and demonstrate that the
first few eignevalues account for most of the data in agreement with
our predictions. We discuss alternative possiblities and show
that
knowledge of the object class (i.e. generic face knowledge) is
sufficient to resolve the linear ambiguity. Secondly, we describe the
use of surface consistency for putting constraints on the possible
solutions. We prove that this constraint reduces the ambiguities
to a
subspace called the generalized bas relief ambiguity (GBR) which is
inherent in the lambertian reflectance function and which can be shown
to exist even if attached and cast shadows are present. We
demonstrate the use of surface consistency to solve for the shape and
albedo up to a GBR and describe, and implement, a variety of
additional assumptions to resolve the GBR. Thirdly, we demonstrate
aniterative algorithm that can detect and remove some attached shadows
from faces thereby increasing the accuracy of the reconstructed shape
and albedo.
We describe a virtual mirror interface which can react to people using
robust, real-time face tracking. Our display can directly combine a
user's face with various graphical effects, performed only on the face
region in the image. We have demonstrated our system in crowded
environments with open and moving backgrounds. Robust performance is
achieved using multi-modal integration, combining stereo, color, and
grey-scale pattern matching modules into a single real-time
system. Stereo processing is used to isolate the figure of a user from
other objects and people in the background. Skin-hue classification
identifies and tracks likely body parts within the foreground
region. Face pattern detection discriminates and localizes the face
within the tracked body parts. We show an initial application of the
mirror where the user sees his or her face distorted into various
comic poses. Qualitatively, users of the system felt the display
"knew" where their face was, and provided entertaining imagery.
We
discuss the failure modes of the individual components, and
quantitatively analyze the face localization performance of the
complete system with thousands of users in recent trials.
Video Rewrite uses existing footage to create automatically new video
of a
person mouthing words that she did not speak in the original footage.
This
technique is useful in movie dubbing, for example, where the movie
sequence
can be modified to sync the actors' lip motions to the new soundtrack.
Video Rewrite uses computer-vision techniques to estimate head pose
and to
track points on the speaker's mouth in the training footage, and uses
morphing techniques to combine these mouth gestures into the final
video
sequence. The new video combines the dynamics of the original actor's
articulations with the mannerisms and setting dictated by the background
footage. Video Rewrite automates all the labeling and assembly tasks
required to resync existing footage to a new soundtrack.
Eigen-points estimates the image-plane locations of fiduciary points
on an
objects. By estimating multiple locations simultaneously, eigen-points
exploits the inter-dependence between these locations. This is done
by
associating neighboring, inter-dependent control-points with a model
of the
local appearance. The model of local appearance is used to find the
feature
in new unlabeled images. Control-point locations are then estimated
from
the appearance of this feature in the unlabeled image. The estimation
is
done using an affine manifold model of the coupling between the local
appearance and the local shape. Eigen-points uses models aimed
specifically at recovering shape from image appearance. The estimation
equations are solved non-iteratively, in a way that accounts for noise
in
the training data and the unlabeled images and that accounts for
uncertainty in the distribution and dependencies within these noise
sources.
We propose a method for learning models of human motion based on
motion-capture data and a high level description of the motion such
as
direction of movement, style, mood of actor, age of actor, etc. In
the
field of computer vision, such models can be useful for human body
motion
tracking/estimation and gesture recognition. The models can also
be used
to generate arbitrary realistic human motion, and may be of help in
trying to understand the mechanisms behind the perception of biological
motion by the human visual system. Some experimental results
of the
learning technique applied to reaching and drawing motions are presented.
This talk will discuss methods for high-precision tracking
of eye and head movements. Two classes of eye images
will be considered: "pupil" images, in which the pupil margin
and Purkinje images are identified and tracked, and retinal
(ophthalmoscopic) images, in which the retinal vasculature
is tracked using correlation methods. I will also present
work-in-progress on head movement tracking using
images from a head-mounted scene camera.
One of the areas we are looking at in computer vision is tracking the
human body as part of a perceptual user interface. At Intel we
are
strongly interested in developing algorithms that are computationally
efficient (fast) enough to be realized on PCs that exist now or will
exist within the next 5 years. By "fast enough", we mean tracking
algorithms that are able to work in real time while absorbing 20% or
less of the computational resources. Tracking algorithms that
absorb
more than this will be hard to use as part of a computer interface.
This presentation develops a simple, statistically robust and
computationally efficient computer vision motion tracking algorithm
that tracks real-time dynamic probability distributions derived from
video image sequences. This algorithm can be applied to any
distribution such as distributions derived from motion or feature
pattern matching. In the case presented here, we use a probabilistic
model of flesh color to convert video scenes into distributions of
flesh color in the scene. The tracking algorithm developed here is
an
extension of the mean shift algorithm modified to handle dynamically
evolving probability distributions. The new algorithm is called
the
Continuously Adaptive Mean Shift (CAMSHIFT) Algorithm. CAMSHIFT
uses
an adaptive search window to robustly (in a statistical sense) find
the Mode of a dynamic probability distribution. Since CAMSHIFT adapts
its search window size to the distribution that it tracks, further
computational savings are realized by restricting the region of
calculation to a function of the window that CAMSHIFT finds.
We have used CAMSHIFT with flesh color probability distributions to
track human faces as part of a computer interface for immersive
control of 3D graphics (flying over Hawaii) and for control of
commercial computer games (QUAKE). Videos of face tracking, 3D
graphics control and computer game control will be shown.
This talk gives an overview of area of perceptive and multi-modal
interfaces and describes IBM's research projects in this area pointing
out the vision requirements in these projects. Important problems
that vision people should be working on will be discussed. Finally,
some suggestion will be presented to researchers starting to work in
the area on good and bad things to do.
Two independent infrared (IR) light sources are used for robust pupil
detection. The even and odd frames of the camera are synchronised
with the IR light sources, and the face is alternately illuminated
with an on camera axis IR source when even frames are being captured,
and with an off camera axis IR source for odd frames. The on
camera
axis illumination generates a bright pupil (the red eye effect from
flash photography), and the off camera axis illumination keeps the
scene at about the same illumination, but the pupil remains dark.
After the subtraction of even and odd frames, the only significant
features are pupils and motion disparities. Thresholding the
difference images followed by area based filtering of connected
regions results in robust detection of human pupils. Once the
pupils
are detected, other facial features can be found using simple
geometrical face models. We present experimental results from
a real
time implementation of the system.