Stuart Russell -- Foundations: Rationality and Intelligence


The classical notion of rationality as maximization of expected utility is often seen as a solid theoretical foundation for AI. Unfortunately, it is a non-existent foundation: rational behavior in this sense is computationally infeasible in all but the simplest cases. My research in this area is aimed at building a different foundation that is both normative and achievable. Put simply, I address the question, "If it's not possible to calculate the right thing to do, what's the right thing to do?"

Herbert Simon wrote at length about this problem of bounded rationality; his suggested solution was to propose various simplified decision procedures that might or might not bear some resemblance to human decision procedures. He did not, however, develop any normative theory and the question of how to choose among many possible non-rational decision procedures was left open.

I.J. Good proposed "Type II rationality" according to which an agent selects its computations rationally and thereby generates good decisions quickly. Beginning in 1988 and working with PhD student Eric Wefald, I formulated this approach as rational metareasoning: selecting computation steps according to their expected value in improving the quality of the agent's next decision in the physical world. These computation steps are taken in a metalevel MDP with a joint state space in which each state is composed of a computational state (e.g., a partially constructed game tree) and a physical state (e.g., a chess board and its clock). We showed that even a very simple, greedy decision prcoedure in this metalevel MDP generates very effective search behavior in two-player games, often far more effective than alpha-beta search. To go beyond greedy metalevel decisions, we also developed the idea of metalevel reinforcement learning (Harada and Russell, 1999; Hay and Russell, 2011). To allow metalevel RL systems to operate without waiting billions of steps for a final reward signal in the physical MDP, we developed the theory of reward shaping (Ng, Harada, and Russell, 1999).

A similar approach to metalevel control appeared a few years later in the UCT algorithm for controlling Monte Carlo tree search (Kocsis and Szepesvari, 2006). Whereas that work formulates the metalevel decision as a bandit problem, we showed that it is in fact a selection problem rather than bandit problem, leading to very different theoretical properties (Hay, Russell, Shimony, and Tolpin, 2012). With Nick Hay, I developed a completely rigorous formulation of the metalevel decision problem and proved the basic result that an optimal computational policy spends no more time (in expectation) than the decision is worth.

Decision procedures using some form of approximate metalevel control are very interesting and still hold enormous promise as elements of generally intelligent systems, but as a foundational framework for AI they fail in two ways: first, Type II rationality is even less feasible than classical perfect rationality, because the metalevel state space of computational states is usually far larger than the underlying physical state space; and second, there is no theoretical reason to pick any particular approximate metalevel controller over any other, or indeed over any other form of decision procedure.

This brings us back to the question, "If it's not possible to calculate the right thing to do, what's the right thing to do?" The answer is that this is the wrong question, because by its very formulation it cannot have an answer. In fact, the entire project of imposing normative constraints of rationality on individual actions of the agent is hopeless. The designer of an AI system does not control the agent's individual actions, only its program. There is no program that acts rationally, in the classical sense, at every moment in time. We can, however, rank programs according to how well they function in an environment when running on a given machine, and this leads to the idea of bounded optimality: given a fixed machine, which program produces the best expected performance? With PhD student Shlomo Zilberstein, I studied this question for programs composed from so-called anytime algorithms (algorithms whose output quality varies with the amount of time allocated). Devika Subramanian and I developed the general theory of bounded optimality as well as an asympotic version (analogous to big-O notation in classical complexity theory) that allows one to ignore implementation details.

Bounded optimality offers a normative definition of intelligence that is both rigorous and feasible. Recently several research groups have begun to pursue its implications, including Richard Lewis and Satinder Singh at Michigan, Andrew Howes at Birmingham, Falk Lieder at Tubingen, and Tom Griffiths at Princeton.

Publications on bounded optimality

Publications on rational metareasoning