Next: Active-Perception Recognition Tasks Up: Reinforcement Learning of Active Previous: Reinforcement Learning of Active

Introduction

Visual phenomena take place over a wide range of scales, and often can not be captured by a single passive camera view. This is especially true in the domain of observing people, since gestures performed by the face or hand require detailed observation while recognition of overall body pose requires a relatively coarse scale view. Confounding matters further is that the significance of a particular detailed gesture can depend on the coarse context in which it is performed.

Techniques for observing people have been a topic of increasing interest in the machine vision literature recently. Progress has been reported by several independent groups on methods for discriminating faces and facial expressions, hand gestures, and tracking body pose. With the exception of face detection methods, the techniques reported for hand and face processing generally assume that high-resolution views (greater than 50x50) of the hand or face are available. With fixed cameras, these methods are limited to domains where face/hand location is stationary and known a priori: for face processing this is possible only in certain domains such as automobile driving or interaction with bank automated teller machines. For hand processing, this usually restricts the class of applications to those in which more traditional interface methods such as mouse or touch-screen are also available (and are often preferred by users). When camera views are fixed, the application of face and hand processing is limited.

A solution problem to this lies in the use of Active Vision techniques [1,2,3]. Active visual observations are used both in biological and machine vision systems to overcome the limitation of fixed view; actuators are added to the the perceptual apparatus together with foveal sensors that have (or approximate) non-uniform photo-detector sampling. The technology for active visual observation has become increasingly accurate and available to the average researcher: high-performance visual tracking systems whose performance rivals biological systems are being developed, while basic pan/tilt/zoom cameras have become readily available at low cost, driven by teleconferencing applications.

The addition of active methods to the perception process raises several interesting questions, chief among them how to determine optimal methods for controlling the active apparatus to achieve perceptual goals such as recognition, i.e., how to perform visual attention. We choose not to focus on the data-driven / bottom-up aspect of attention; our approach instead explores top-down influences in attentive control by attempting to find active recognition strategies to recognize particular objects. We use a decision processes formalism and reinforcement learning solution technique for top-down control, and assume a data-driven, bottom-up module provides candidate foveation locations that appear interesting due to a prior feature model, local statistical content or change over time.

Several methods have been proposed in the machine perception literature for the active control of a camera to reduce the uncertainty that a particular object is present [6,13]. In contrast to these methods, which assume a model of the object, we explore the case when no model of the object or environment is available. Our active recognition system has no access to the underlying state of the world, only to actions and observations made by the system, and an externally provided reinforcement signal that indicates whether assertions of object identity are correct.

To accomplish this we use a hidden-state decision process paradigm for active recognition. Since these decision process formalisms explicitly describe both action and perception in a statistical framework, they are potentially useful for modeling aspects of visual attention. As we shall see in the following sections, hidden-state reinforcement learning can solve decision process tasks which are provided only with observations of actual state, overcoming the perceptual aliasing problem by modeling state as sequences of actions and observations. Reinforcement learning has been explored for use in active visual tasks by several authors recently [21,4,22,23], but none address the task of hidden-state recognition without a prior model.

In this paper we show how these techniques can be used to solve an active recognition task, and how a concise, visual behavior-like representation can be extracted from the learned action-selection policies. The following section presents our definition of an active recognition task within a partially observable decision process formalism. We then present the hidden-state reinforcement learning methods with which we learn action-selection policies to solve these tasks. Finally we present a method for transforming the learned action-selection policy into a simple augmented state-machine behavior, and conclude with a discussion of the properties and limitations of our current method.

Next: Active-Perception Recognition Tasks Up: Reinforcement Learning of Active Previous: Reinforcement Learning of Active

Trevor Darrell
9/14/1998