Stuart Russell -- Research on decision making over long time scales

Like many combinatorial algorithms for sequential decision making, AlphaGo is able to make decisions with a horizon of 15 or 20 primitive steps. A human life consists of about 20 trillion primitive steps (potential changes in muscle activation). Even something as simple as going to a conference is about a billion steps. Humans appear to manage by imposing a hierarchical structure on behavior, so that each primitive action is part of some higher-level activity, and so on up to very high-level activities such as "get a PhD" and "earn enough money to retire to Bali". We study both hierarchical reinforcement learning and hierarchical planning and lookahead, establishing fundamental results in each area and showing how to enable much faster learning and planning.

Our original result (Parr and Russell, 1998) showed that reinforcement learning could occur within a hierarchy of nondeterministic finite-state machines that provide some guidance on how to behave in a specific task environment. Nondeterminism corresponds to choice states where the machine must learn how to make each choice, dependent on the environment state and any internal state. The key observation is that the interaction of the hierarchy with the environment can be modelled in a joint [environment state, machine state] state space, yielding a semi-Markov decision process whose states are those joint states where the machine is in a choice state. An appropriately designed reinforcement learning algorithm converges to the hierarchically optimal policy, i.e., the best policy consistent with the hierarchy. Subsequent papers introduced the idea of partial programs (Andre and Russell, 2000, 2002) and concurrent behaviors (Russell and Zimdars, 2003; Marthi et al, 2005), leading to the ability to provide partial specifications of arbitrarily complex behaviors. For example, in the Stratagus domain (similar to WarCraft), learning was successful on a task requiring over 100,000 primitive actions.

Subsequently, we studied the problem of lookahead decision making with hierarchical partial programs. Our initial assumption that this problem was already solved for deterministic domains in the HTN (hierachical task network) planning literature proved false: there was no proper semantics (transition model) known for high-level actions, and therefore no notion of a provably correct high-level plan. We developed the angelic semantics for high-level actions (Marthi, Russell, and Wolfe, 2007) which provides a way to describe high-level actions by the set of reachable states given the initial state in which the action is executed. With this concept, it is possible to derive provably correct high-level plans without examining their concrete instantiations, yielding potentially exponential reductions in planning complexity.

Further exploration of these ideas in the context of task-and-motion planning for robotics led us to expand the semantics of abstract actions to include optimistic descriptions for symbolic actions in physical domains (e.g., omitting geometric preconditions for grasping), combined with planning algorithms that allow for failure and correction of abstract plans when simulating their possible concrete implementations. This led to the first task-and-motion planners that combined off-the-shelf task planners and motion planners, while producing correct plans.

Publications on hierarchical reinforcement learning

Ron Parr and Stuart Russell, ``Reinforcement Learning with Hierarchies of Machines.'' In Advances in Neural Information Processing Systems 10, MIT Press, 1998.
David Andre and Stuart Russell, ``Programmable Reinforcement Learning Agents.'' In Advances in Neural Information Processing Systems 13, MIT Press, 2001.
David Andre and Stuart Russell, ``State Abstraction for Programmable Reinforcement Learning Agents.'' In Proc. AAAI-02, Edmonton, Alberta: AAAI Press, 2002.
Stuart Russell and Andrew Zimdars, ``Q-Decomposition for Reinforcement Learning Agents.'' In Proc. ICML-03, Washington, DC, 2003.
Bhaskara Marthi, Stuart Russell, David Latham, and Carlos Guestrin, ``Concurrent hierarchical reinforcement learning.'' In Proc. IJCAI-05, Edinburgh, Scotland, 2005.
Bhaskara Marthi, Stuart Russell, and David Andre, ``A Compact, Hierarchical Q-Function Decomposition.'' In Proc. UAI-06, Cambridge, MA, 2006.
Aijun Bai and Stuart Russell, ``Efficient Reinforcement Learning with Hierarchies of Machines by Leveraging Internal Transitions.'' In Proc. IJCAI-17, Melbourne, 2017.

Publications on hierarchical planning and lookahead decision making

B. Marthi, S. J. Russell, and J. Wolfe, ``Angelic Semantics for High-Level Actions.'' In Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS 2007), Providence, Rhode Island, 2007.
Bhaskara Marthi, Stuart Russell, and Jason Wolfe, ``Angelic Hierarchical Planning: Optimal and Online Algorithms.'' In Proceedings of the Eighteenth International Conference on Automated Planning and Scheduling (ICAPS 2008), Sydney, 2008.
Jason Wolfe, Bhaskara Marthi, and Stuart Russell, ``Combined Task and Motion Planning for Mobile Manipulation.'' In Proceedings of the Twentieth International Conference on Automated Planning and Scheduling (ICAPS 2010), Toronto, 2010.
Jason Wolfe and Stuart Russell, ``Bounded Intention Planning.'' In Proc. IJCAI-11, Barcelona, 2011.
Siddharth Srivastava, Eugene Fang, Lorenzo Riano, Rohan Chitnis, Stuart Russell, and Pieter Abbeel, ``Combined Task and Motion Planning Through an Extensible Planner-Independent Interface Layer.'' In Proc. ICRA-14, Hong Kong, 2014.
Siddharth Srivastava, Shlomo Zilberstein, Abhishek Gupta, Pieter Abbeel, Stuart Russell, Tractability of Planning with Loops. In Proc. AAAI-15, Austin, Texas, 2015.
Siddharth Srivastava, Stuart Russell, and Alessandro Pinto, Metaphysics of Planning Domain Descriptions. In Proc. AAAI-16, Phoenix, 2016.