Stuart Russell -- Research on decision making over long time scales


Like many combinatorial algorithms for sequential decision making, AlphaGo is able to make decisions with a horizon of 15 or 20 primitive steps. A human life consists of about 20 trillion primitive steps (potential changes in muscle activation). Even something as simple as going to a conference is about a billion steps. Humans appear to manage by imposing a hierarchical structure on behavior, so that each primitive action is part of some higher-level activity, and so on up to very high-level activities such as "get a PhD" and "earn enough money to retire to Bali". We study both hierarchical reinforcement learning and hierarchical planning and lookahead, establishing fundamental results in each area and showing how to enable much faster learning and planning.

Our original result (Parr and Russell, 1998) showed that reinforcement learning could occur within a hierarchy of nondeterministic finite-state machines that provide some guidance on how to behave in a specific task environment. Nondeterminism corresponds to choice states where the machine must learn how to make each choice, dependent on the environment state and any internal state. The key observation is that the interaction of the hierarchy with the environment can be modelled in a joint [environment state, machine state] state space, yielding a semi-Markov decision process whose states are those joint states where the machine is in a choice state. An appropriately designed reinforcement learning algorithm converges to the hierarchically optimal policy, i.e., the best policy consistent with the hierarchy. Subsequent papers introduced the idea of partial programs (Andre and Russell, 2000, 2002) and concurrent behaviors (Russell and Zimdars, 2003; Marthi et al, 2005), leading to the ability to provide partial specifications of arbitrarily complex behaviors. For example, in the Stratagus domain (similar to WarCraft), learning was successful on a task requiring over 100,000 primitive actions.

Subsequently, we studied the problem of lookahead decision making with hierarchical partial programs. Our initial assumption that this problem was already solved for deterministic domains in the HTN (hierachical task network) planning literature proved false: there was no proper semantics (transition model) known for high-level actions, and therefore no notion of a provably correct high-level plan. We developed the angelic semantics for high-level actions (Marthi, Russell, and Wolfe, 2007) which provides a way to describe high-level actions by the set of reachable states given the initial state in which the action is executed. With this concept, it is possible to derive provably correct high-level plans without examining their concrete instantiations, yielding potentially exponential reductions in planning complexity.

Further exploration of these ideas in the context of task-and-motion planning for robotics led us to expand the semantics of abstract actions to include optimistic descriptions for symbolic actions in physical domains (e.g., omitting geometric preconditions for grasping), combined with planning algorithms that allow for failure and correction of abstract plans when simulating their possible concrete implementations. This led to the first task-and-motion planners that combined off-the-shelf task planners and motion planners, while producing correct plans.

Publications on hierarchical reinforcement learning

Publications on hierarchical planning and lookahead decision making