CS 294-67, Spring 2011: Sequential Decisions: Planning and Reinforcement Learning
Reading list

This list is still under construction. An empty bullet item indicates more readings to come for that week.
Books

Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto. MIT Press, 1998. (a.k.a. S+B)
Neuro-Dynamic Programming by Dimitri P. Bertsekas and John Tsitsiklis. Athena Scientific, 1996. (a.k.a. B+T)
For background: relevant sections of Artificial Intelligence: A Modern Approach, 3rd edition by Stuart Russell and Peter Norvig. Pearson, 2010. (a.k.a. R+N)

Week 1 (1/19): Agents, environments, Markov decision processes

S+B 1, 3
Ch.1 gives a nontechnical motivation for studying RL and an extensive history. Ch.3 introduces the basic MDP definitions and Bellman equations at some length, including some excellent examples (golf in particular).
B+T 1, 2.1, 2.4
Ch.1 concisely summarizes the whole book, stressing the approach of sample-based function approximation. 2.1 and 2.4 define MDPs and give examples of MDP formulation.
R+N, 2, 17.1.
Ch.2 explores the range of environment types, agent types, and agent-environment relationships. 17.1 defines MDPs less technically than B+T and more concisely than S+B.

Week 2 (1/26): Dynamic programming

S+B 4
A nice introduction to policy evaluation, policy improvement, policy iteration, and value iteration, with some helpful diagrams.
B+T 2.2, 2.3.
Technical introduction emphasizing operator formulations and convergence proofs. 2.2 covers episodic (stochastic shortest-path) problems, while 2.3 covers continuing, discounted, infinite-horizon problems. These sections are quite similar to each other, but 2.2 is more complicated. Focus primarily on 2.3, but refer back to 2.2 as needed for some of the lemmas.
R+N 17.2-3
Brief analysis of value and policy iteration with simplified proofs where possible.

Week 3 (2/2): Dynamic programming contd.

Readings as for Week 2
Optional Yinyu Ye, ``The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate.'' Unpublished manuscript, 2010. [pdf]
Precise analysis of the runtime of policy iteration and its connection to linear programming; good summary of the literature on this topic.

Week 4 (2/9): Partially observable MDPs

R+N 17.4.
Explains how to reduce POMDPs to belief-state MDPs and describes a simple value iteration algorithm.
Eric Hansen, ``Solving POMDPs by searching in policy space.'' In Proc. UAI-98. [pdf]
Introduces a direct method of solving POMDPs by building finite automata that simultaneously represent the policy and maintain the belief state.
Joelle Pineau, Geoff Gordon, and Sebastian Thrun, ``Point-based value iteration: An anytime algorithm for POMDPs.'' In Proc. IJCAI-03. [pdf]
Introduces an approximate POMDP solver that generates strategies for selected points in belief-state space.
Optional Marc Toussaint, Stefan Harmeling, and Amos Storkey, ``Probabilistic inference for solving (PO)MDPs.'' Research Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006. [pdf]
Shows how to reduce MDP and POMDP solving to a combination of probabilistic inference and expectation-maximization (EM). Although the basic concept is old (Shachter and Peot, UAI 1992), this instantiation is effective given the availability of powerful inference algorithms for large DBNs.
Optional S. Ross, J. Pineau, S. Paquet, B. Chaib-draa. ``Online Planning Algorithms for POMDPs.'' JAIR, 32, 663-704, 2008. [pdf]
A comprehensive survey of recent point-based POMDP solvers.

Week 5 (2/16): Early history, Monte Carlo RL

Samuel, Arthur L., "Some studies in machine learning using the game of checkers." IBM Journal of Research and Development 3(3), 210-229, 1959. [pdf]
The paper that started it all. Lacks technical foundations but has most of the important ideas. Many of the implementation details are fascinating, even if not germane to the course.
S+B 5
Shows how to use multiple trials to estimate a value function and optimize a policy.
B+T 5.1-5.2
Gives a much more careful analysis of the statistical basis for Monte Carlo policy evaluation.

Week 6 (2/23): Basic RL algorithms: TD, Q-learning, SARSA

B+T 4.1
Brief technical introduction to running averages, Robbins-Monro algorithms, and their convergence. This will give you a better of idea of the mathematical processes we are dealing with in TD methods.
S+B 6.1-6.5
Introduces and motivates the basic temporal-difference algorithms: TD(0) for learning V(s) and SARSA and Q-learning for learning Q(s,a).
B+T 5.3
5.3 covers TD(λ), connecting it to Monte Carlo and Robbins-Monro. We will focus largely on λ=0. Subsections 5.3.3-5.3.7 are quite technical and can be skipped on first reading (and on subsequent readings).

Week 7 (3/2): Convergence of Q-learning; function approximation

B+T 4.3, 5.6
This is the easiest of the formal convergence proofs for the TD algorithms in the tabular case. It's worth the effort to try to follow it.
S+B 8.1-8.4, 11.1.
Introduces the gradient-descent update and several families fo function approximators. 11.1 describes an application to backgammon.
B+T 8.3.
Describes an application to Tetris.
Project proposals due.

Week 8 (3/9): Function approximation: convergence properties and proofs

Gordon, G.(1995). "Stable Function Approximation in Dynamic Programming." In Proc. ICML-95.
A relatively gentle introduction to DP with function approximation.
Papavassiliou, V. and Russell, S. (1999). "Convergence of reinforcement learning with general function approximators." In Proc. IJCAI-99.
A quite distinct RL algorithm that works with any function approximator.
Maei, H. R., Szepesvari, Cs., Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S. (2010). "Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation." In Advances in Neural Information Processing Systems 22.
A more conventional TD-style algorithm that finds a local optimum with a smooth function approximator.
B+T 6.2, 6.3
Policy iteration with approximate policy evaluation. Mostly heavy going but worth skimming for extra background.

Week 9 (3/16): LSTD/LSPI; policy search methods

Boyan, J. (2002). "Technical Update: Least-Squares Temporal Difference Learning." Machine Learning, 49, 233-246.
Lagoudakis, M., and Parr, R. (2002). "Model-Free Least Squares Policy Iteration." In Advances in Neural Information Processing Systems, 14.
These two papers explain the technical foundations and advantages of batch RL methods.
Ng, A. and Jordan, M. (2000). "PEGASUS: A policy search method for large MDPs and POMDPs." In Proc. UAI-00.
An interesting solution to the problem of reducing variance when estimating gradients in policy search.

Week 10 (3/23):
Spring Break

Week 11 (3/30): Factored MDPs and symbolic dynamic programming

Daphne Koller and Ronald Parr (2000). ``Policy Iteration for Factored MDPs.'' In Proc. UAI-00.
Couples linear function approximation with DBN transition models for large MDPs.
Optional Carlos Guestrin, Daphne Koller, Ronald Parr and Shobha Venkataraman (2003). ``Efficient Solution Algorithms for Factored MDPs.'' JAIR, 19, 399-468.
Very long version of Koller and Parr (2000), with improved LP formulation and convergence results.
Craig Boutilier, Richard Dearden and Moises Goldszmidt (1995). ``Exploiting Structure in Policy Construction.'' In Proc. IJCAI-95.
Introduces the basic idea of symbolic dynamic programming: large MDPs with symbolic representations for T, R and DP algorithms that directly manipulate symbolic representations of V, Q, and/or π.
Optional Craig Boutilier, Richard Dearden and Moises Goldszmidt (2000). ``Stochastic Dynamic Programming with Factored Representations.'' AIJ, 121(1), 49-107.
Very long version of Boutilier et al. (1995), with many algorithmic and representational improvments.

Week 12 (4/6): First-order MDPs and relational RL

Scott Sanner and Kristian Kersting (2007). ``Symbolic dynamic programming.'' In C. Sammut, editor, Encyclopedia of Machine Learning. Berlin: Springer-Verlag.
A clear and elegant description of dynamic programming with first-order logical expressions. Doesn't get into implementation detail or mention the problem of bounding the expression size.
Optional Scott Sanner and Craig Boutilier (2009). ``Practical Solution Techniques for First-Order MDPs..'' Artificial Intelligence, 173, 748-788.
Very long paper giving full details of how it's done. The first three sections cover more or less the same material as the short article; you could start at Section 4 to find out about where things start to get hairy.
Carlos Guestrin, Daphne Koller, Chris Gearhart, and Neal Kanodia (2003). ``Generalizing Plans to New Environments in Relational MDPs.'' In Proc. IJCAI-03.
A nice example of how to learn value functions for a relational MDP (FreeCraft). Sections 3 and 5 can be skipped on first reading.

Week 13 (4/13): Hierarchical RL

Ronald Parr and Stuart Russell (1998). "Reinforcement learning with hierarchies of machines." In Advances in Neural Information Processing Systems 10, MIT Press, 1998.
This paper marked a shift within hierarchical RL from state-abstraction hierarchies to hierarchies based on temporal abstraction and high-level actions.
Dietterich, T. G. (2000). "Hierarchical reinforcement learning with the MAXQ value function decomposition." Journal of Artificial Intelligence Research, 13, 227-303.
Too long, but has a lot of good stuff - particularly the additive temporal decomposition of value functions. Focus on Sections 1, 3, and 4.1-2.
Bhaskara Marthi, Stuart Russell, David Latham, and Carlos Guestrin (2005). ``Concurrent hierarchical reinforcement learning.'' In Proc. IJCAI-05.
Illustrates the ideas and applications of partial programs in RL, as well as functional (multi-threaded) decomposition of value functions and distributed RL.

Week 14 (4/20): Exploration, bandits, and metalevel RL

P. Auer, N. Cesa-Bianchi, and P. Fischer (2002). ``Finite time analysis of the multiarmed bandit problem.'' Machine Learning, 47(2-3), 235-256.
Describes and analyzes simple but provably effective algorithms for choosing actions in bandit problems. The proof details are less important than the nature of the results and the algorithms themselves.
Levente Kocsis and Csaba Szepesvari (2006). ``Bandit Based Monte-Carlo Planning.'' In Proc. ECML-06.
Applies Auer's UCB bandit allocation algorithm to the problem of Monte Carlo search in game trees. Later, this algorithm became part of the world-champion Go program, MoGo. (See paper distributed by email.)
(Optional)Gittins, J. C. (1979). ``Bandit processes and dynamic allocation indices.'' J. Roy. Statist. Soc., B41, 148-177.
Classic paper in which it is shown that an index function, depending only on the properties of each arm separately, can be derived such that the optimal strategy in a bandit problem is to pick the arm with the highest index.
(Optional) J. Tsitsiklis (1994). ``A short proof of the Gittins index theorem.'' The Annals of Applied Probability, 4(1), 194-199.
An "elementary" proof of Gittins's theorem, which some may find preferable to the original.

Week 15 (4/27): Inverse RL

Andrew Y. Ng and Stuart Russell, ``Algorithms for inverse reinforcement learning.'' In Proc. ICML-00.
Introduces the inverse RL problem, describes motivation and difficulties (especially reward degeneracy), and gives an LP formulation for one particular approach to resolving degeneracy.
Pieter Abbeel and Andrew Y. Ng, ``Apprenticeship learning via inverse reinforcement learning.'' In Proc. ICML-04.
A more sophisticated analysis including theorems demonstrating convergence to a policy nearly as good as the expert's.

Week 16:
Reading/Review/Recitation

Week 17 (TBD):
Project Presentations

CS 294-67, Spring 2011: Sequential Decisions: Planning and Reinforcement Learning Reading list

Books

Week 1 (1/19): Agents, environments, Markov decision processes

Week 2 (1/26): Dynamic programming

Week 3 (2/2): Dynamic programming contd.

Week 4 (2/9): Partially observable MDPs

Week 5 (2/16): Early history, Monte Carlo RL

Week 6 (2/23): Basic RL algorithms: TD, Q-learning, SARSA

Week 7 (3/2): Convergence of Q-learning; function approximation

Week 8 (3/9): Function approximation: convergence properties and proofs

Week 9 (3/16): LSTD/LSPI; policy search methods

Week 10 (3/23):

Week 11 (3/30): Factored MDPs and symbolic dynamic programming

Week 12 (4/6): First-order MDPs and relational RL

Week 13 (4/13): Hierarchical RL

Week 14 (4/20): Exploration, bandits, and metalevel RL

Week 15 (4/27): Inverse RL

Week 16:

Week 17 (TBD):

CS 294-67, Spring 2011: Sequential Decisions: Planning and Reinforcement Learning
Reading list