CS 294-67, Spring 2011: Sequential Decisions: Planning and Reinforcement Learning
Reading list
This list is still under construction. An empty bullet item indicates more readings to come for that week.
Books
Week 1 (1/19): Agents, environments, Markov decision processes
- S+B 1, 3
Ch.1 gives a nontechnical motivation for studying RL and an extensive history. Ch.3 introduces the basic MDP definitions and Bellman equations
at some length, including some excellent examples (golf in particular).
- B+T 1, 2.1, 2.4
Ch.1 concisely summarizes the whole book, stressing the approach of sample-based function approximation. 2.1 and 2.4 define MDPs and give examples of MDP formulation.
- R+N, 2, 17.1.
Ch.2 explores the range of environment types, agent types, and agent-environment relationships. 17.1 defines MDPs less technically than B+T
and more concisely than S+B.
Week 2 (1/26): Dynamic programming
- S+B 4
A nice introduction to policy evaluation, policy improvement, policy iteration, and value iteration, with some helpful diagrams.
- B+T 2.2, 2.3.
Technical introduction emphasizing operator formulations and convergence proofs. 2.2 covers episodic (stochastic
shortest-path) problems, while 2.3 covers continuing, discounted,
infinite-horizon problems. These sections are quite similar to each
other, but 2.2 is more complicated. Focus primarily on 2.3, but refer
back to 2.2 as needed for some of the lemmas.
- R+N 17.2-3
Brief analysis of value and policy iteration with simplified proofs where possible.
Week 3 (2/2): Dynamic programming contd.
- Readings as for Week 2
- Optional Yinyu Ye, ``The Simplex and Policy-Iteration Methods are Strongly
Polynomial for the Markov Decision Problem with a Fixed Discount Rate.''
Unpublished manuscript, 2010.
[pdf]
Precise analysis of the runtime of policy iteration and
its connection to linear programming; good summary of the literature on this topic.
Week 4 (2/9): Partially observable MDPs
- R+N 17.4.
Explains how to reduce POMDPs to belief-state MDPs and describes a simple value iteration algorithm.
- Eric Hansen, ``Solving POMDPs by searching in policy space.''
In Proc. UAI-98.
[pdf]
Introduces a direct method of solving POMDPs by building finite
automata that simultaneously represent the policy and maintain the belief state.
- Joelle Pineau, Geoff Gordon, and Sebastian Thrun, ``Point-based value iteration: An anytime algorithm for POMDPs.''
In Proc. IJCAI-03.
[pdf]
Introduces an approximate POMDP solver that generates strategies for selected points in belief-state space.
- Optional Marc Toussaint, Stefan Harmeling, and Amos Storkey, ``Probabilistic inference for solving (PO)MDPs.''
Research Report EDI-INF-RR-0934,
University of Edinburgh, School of Informatics, 2006.
[pdf]
Shows how to reduce MDP and POMDP solving to a combination of probabilistic inference and expectation-maximization (EM).
Although the basic concept is old (Shachter and Peot, UAI 1992), this instantiation is effective given the availability
of powerful inference algorithms for large DBNs.
- Optional S. Ross, J. Pineau, S. Paquet, B. Chaib-draa. ``Online Planning Algorithms for POMDPs.'' JAIR, 32, 663-704, 2008.
[pdf]
A comprehensive survey of recent point-based POMDP solvers.
Week 5 (2/16): Early history, Monte Carlo RL
- Samuel, Arthur L., "Some studies in machine learning using the game of
checkers." IBM Journal of Research and Development 3(3), 210-229, 1959.
[pdf]
The paper that started it all. Lacks technical foundations but has most of the important ideas.
Many of the implementation details are fascinating, even if not germane to the course.
- S+B 5
Shows how to use multiple trials to estimate a value function and optimize a policy.
- B+T 5.1-5.2
Gives a much more careful analysis of the statistical basis for Monte Carlo policy evaluation.
Week 6 (2/23): Basic RL algorithms: TD, Q-learning, SARSA
- B+T 4.1
Brief technical introduction to running averages, Robbins-Monro algorithms, and their convergence.
This will give you a better of idea of the mathematical processes we are dealing with in TD methods.
- S+B 6.1-6.5
Introduces and motivates the basic temporal-difference algorithms: TD(0) for learning V(s)
and SARSA and Q-learning for learning Q(s,a).
- B+T 5.3
5.3 covers TD(λ), connecting it
to Monte Carlo and Robbins-Monro. We will focus largely on
λ=0. Subsections 5.3.3-5.3.7 are quite technical and can
be skipped on first reading (and on subsequent readings).
Week 7 (3/2): Convergence of Q-learning; function approximation
- B+T 4.3, 5.6
This is the easiest of the formal convergence proofs for the TD algorithms in the tabular case. It's worth the effort to try to follow it.
- S+B 8.1-8.4, 11.1.
Introduces the gradient-descent update and several families fo function approximators. 11.1 describes an application to backgammon.
- B+T 8.3.
Describes an application to Tetris.
Project proposals due.
Week 8 (3/9): Function approximation: convergence properties and proofs
- Gordon, G.(1995).
"Stable Function Approximation in Dynamic Programming." In Proc. ICML-95.
A relatively gentle introduction to DP with function approximation.
- Papavassiliou, V. and Russell, S. (1999).
"Convergence of reinforcement learning with general function approximators." In Proc. IJCAI-99.
A quite distinct RL algorithm that works with any function approximator.
- Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S. (2010).
"Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation."
In Advances in Neural Information Processing Systems 22.
A more conventional TD-style algorithm that finds a local optimum with a smooth function approximator.
- B+T 6.2, 6.3
Policy iteration with approximate policy evaluation. Mostly heavy going but worth skimming for extra background.
Week 9 (3/16): LSTD/LSPI; policy search methods
Week 10 (3/23):
Spring Break
Week 11 (3/30): Factored MDPs and symbolic dynamic programming
- Daphne Koller and Ronald Parr (2000).
``Policy Iteration for Factored MDPs.''
In Proc. UAI-00.
Couples linear function approximation with DBN transition models for large MDPs.
- Optional Carlos Guestrin, Daphne Koller, Ronald Parr and Shobha Venkataraman (2003).
``Efficient Solution Algorithms for Factored MDPs.''
JAIR, 19, 399-468.
Very long version of Koller and Parr (2000), with improved LP formulation and convergence results.
- Craig Boutilier, Richard Dearden and Moises Goldszmidt (1995).
``Exploiting Structure in Policy Construction.''
In Proc. IJCAI-95.
Introduces the basic idea of symbolic dynamic programming: large MDPs with symbolic representations for T, R and DP algorithms that
directly manipulate symbolic representations of V, Q, and/or π.
- Optional Craig Boutilier, Richard Dearden and Moises Goldszmidt (2000).
``Stochastic Dynamic Programming with Factored Representations.''
AIJ, 121(1), 49-107.
Very long version of Boutilier et al. (1995), with many algorithmic and representational improvments.
Week 12 (4/6): First-order MDPs and relational RL
- Scott Sanner and Kristian Kersting (2007).
``Symbolic dynamic programming.''
In C. Sammut, editor, Encyclopedia of Machine Learning. Berlin: Springer-Verlag.
A clear and elegant description of dynamic programming with first-order logical expressions. Doesn't get into implementation detail
or mention the problem of bounding the expression size.
- Optional Scott Sanner and Craig Boutilier (2009).
``Practical Solution Techniques for First-Order MDPs..''
Artificial Intelligence, 173, 748-788.
Very long paper giving full details of how it's done. The first three sections cover more or less
the same material as the short article; you could start at Section 4 to find out about where things start to get hairy.
- Carlos Guestrin, Daphne Koller, Chris Gearhart, and Neal Kanodia (2003).
``Generalizing Plans to New Environments in Relational MDPs.''
In Proc. IJCAI-03.
A nice example of how to learn value functions for a relational MDP (FreeCraft). Sections 3 and 5 can be skipped on first reading.
Week 13 (4/13): Hierarchical RL
- Ronald Parr and Stuart Russell (1998).
"Reinforcement learning with
hierarchies of machines."
In Advances in Neural Information Processing Systems 10,
MIT Press, 1998.
This paper marked a shift within hierarchical RL from state-abstraction hierarchies to hierarchies based on temporal abstraction and high-level actions.
- Dietterich, T. G. (2000). "Hierarchical reinforcement learning with the MAXQ value function decomposition."
Journal of Artificial Intelligence Research, 13, 227-303.
Too long, but has a lot of good stuff - particularly the additive temporal decomposition of value functions. Focus on Sections 1, 3, and 4.1-2.
- Bhaskara Marthi, Stuart Russell, David Latham, and Carlos Guestrin (2005).
``Concurrent hierarchical reinforcement learning.''
In Proc. IJCAI-05.
Illustrates the ideas and applications of partial programs in RL, as well as functional (multi-threaded) decomposition of value functions and distributed RL.
Week 14 (4/20): Exploration, bandits, and metalevel RL
- P. Auer, N. Cesa-Bianchi, and P. Fischer (2002).
``Finite time analysis of the multiarmed bandit problem.''
Machine Learning, 47(2-3), 235-256.
Describes and analyzes simple but provably effective algorithms for choosing actions in bandit problems. The proof details are less important than the nature of the results and the algorithms themselves.
- Levente Kocsis and Csaba Szepesvari (2006).
``Bandit Based Monte-Carlo Planning.''
In Proc. ECML-06.
Applies Auer's UCB bandit allocation algorithm to the problem of Monte Carlo search in game trees. Later, this algorithm became part of the world-champion Go program, MoGo. (See paper distributed by email.)
- (Optional)Gittins, J. C. (1979).
``Bandit processes and dynamic allocation indices.''
J. Roy. Statist. Soc., B41, 148-177.
Classic paper in which it is shown that an index function, depending only on the properties of each arm separately, can be derived such that the optimal strategy in a bandit problem is to pick the arm with the highest index.
- (Optional) J. Tsitsiklis (1994).
``A short proof of the Gittins index theorem.''
The Annals of Applied Probability, 4(1), 194-199.
An "elementary" proof of Gittins's theorem, which some may find preferable to the original.
Week 15 (4/27): Inverse RL
- Andrew Y. Ng and Stuart Russell,
``Algorithms for inverse reinforcement learning.''
In Proc. ICML-00.
Introduces the inverse RL problem, describes motivation and difficulties (especially reward degeneracy), and gives an LP formulation for one particular approach to resolving degeneracy.
- Pieter Abbeel and Andrew Y. Ng,
``Apprenticeship learning via inverse reinforcement learning.''
In Proc. ICML-04.
A more sophisticated analysis including theorems demonstrating convergence to a policy nearly as good as the expert's.
Week 16:
Reading/Review/Recitation
Week 17 (TBD):
Project Presentations