This assignment comes in two parts. The first part is worth 50 points out of 100 and is mainly intended to help you become familiar with the basics of MDP representations, algorithms, and agents and with Spider solitaire. It does not involve writing much new code. The second part, to be posted shortly, deals with reinforcement learning.
The first thing you need to do is load the CS188 AIMA code in the usual way: load aima.lisp and then do (aima-load 'search) and (aima-load 'mdps). You should also copy, compile, and load all the lisp files in this directory.
Be sure to use the latest version from ~cs188. Several things have changed and new code has been added. As always, remember to compile the code.
The AIMA code includes a general facility for defining and using MDPs. Code defining MDPs and all the basic operations on them (including I/O) appears in
The main methods defined on MDPs are as follows:>> (hprint (value-iteration *4x3-mdp*)) #The function value-iteration-policy does value iteration and converts the result into an optimal policy by one-step lookahead:(1 1): 0.70530814 (2 1): 0.655308 (3 2): 0.660274 (1 3): 0.81155825 (2 3): 0.8678082 (4 1): 0.38792402 (4 3): 1.0 (3 1): 0.6114151 (1 2): 0.7615582 (4 2): -1.0 (3 3): 0.91780823
>> (hprint (value-iteration-policy *4x3-mdp*)) (1 1): UP (2 1): LEFT (3 2): UP (1 3): RIGHT (2 3): RIGHT (4 1): LEFT (4 3): NIL (3 1): LEFT (1 2): UP (4 2): NIL (3 3): RIGHT #
Any MDP can be converted into an environment using the mdp->environment function in
This function needs a list of one agent to run in the environment. By default, it uses one constructed by new-simple-mdp-solving-agent, which is defined in Such an agent computes a policy for the MDP (e.g., by the value-iteration-policy algorithm, see mdps/algorithms/dp.lisp) and then executes it.Question 1 (5 pts). Use the agent-trial function to measure the average score of the simple-mdp-solving-agent in *4x3-mdp* over 1000 trials. Your result should be close to the true utility of the initial state (1 1), as shown on AIMA2e p.619.
Question 2 (5 pts). Now let's consider an agent that makes decisions using an approximate utility function and a lookahead search. (While this is unnecessary for the 4x3 world, it is essential for Spider.) Because an MDP has actions with uncertain outcomes, but just one agent, the search we need is an expectimax search that alternates between choosing maximum-utility actions and calculating expected outcome values. The algorithm is given in
and an agent that uses it is given in Using the approximate utility function in 4x3-eval.lisp, evaluate depth-1 and depth-2 expectimax agents over 1000 trials on *4x3-mdp*.Question 3 (15 pts). The expectimax algorithm (and indeed any algorithm using Bellman backups) computes expected values by summing over all possible outcome states. This will not be possible in Spider, where the number of outcomes for one action can exceed 32 quintillion. Instead of summing over all outcomes, we will have to sum over a small sample. First, write the following methods for enumerated MDPs (one line each):
>> (setq ps-easy (make-spider-problem :num-packs 1 :num-suits 1 :num-stacks 10 :num-hidden-rows 2))Notice that the stacks are numbered from 0 to 9 and that the "top-to-bottom" orientation of stacks in the Windows implementation is replaced here by a "left-to-right" ordering. Notice also that this is the state of the Spider poproblem, not the percept. The percept looks the same to the naked eye:>> (setq s0-easy (problem-initial-state ps-easy)) 0 ??? ??? ??? AH 1 ??? ??? ??? KH 2 ??? ??? 2H 3 ??? ??? 5H 4 ??? ??? AH 5 ??? ??? JH 6 ??? ??? 6H 7 ??? ??? 9H 8 ??? ??? 2H 9 ??? ??? 4H Reserve: .................... Completed:
>> (setq s0-percept (get-percept ps-easy s0-easy)) 0 ??? ??? ??? AH 1 ??? ??? ??? KH 2 ??? ??? 2H 3 ??? ??? 5H 4 ??? ??? AH 5 ??? ??? JH 6 ??? ??? 6H 7 ??? ??? 9H 8 ??? ??? 2H 9 ??? ??? 4H Reserve: .................... Completed:but in the percept the hidden cards really are hidden:
>> (card-number (second (aref (spider-state-stacks s0-easy) 9))) 10 >> (card-number (second (aref (spider-state-stacks s0-percept) 9))) NIL
Spider moves specify the number of cards to be moved, the origin stack, and the destination stack:
>> (pprint (actions ps-easy s0-easy)) (NEW-ROW #S(SPIDER-MOVE :K 1 :FROM 4 :TO 8) #S(SPIDER-MOVE :K 1 :FROM 0 :TO 8) #S(SPIDER-MOVE :K 1 :FROM 3 :TO 6) #S(SPIDER-MOVE :K 1 :FROM 9 :TO 3) #S(SPIDER-MOVE :K 1 :FROM 4 :TO 2) #S(SPIDER-MOVE :K 1 :FROM 0 :TO 2))The new-row action deals out a new row of cards from the reserve. The outcome of a Spider action is defined by the result method:
>> (result ps-easy #S(SPIDER-MOVE :K 1 :FROM 9 :TO 3) s0-easy) 0 ??? ??? ??? AH 1 ??? ??? ??? KH 2 ??? ??? 2H 3 ??? ??? 5H 4H 4 ??? ??? AH 5 ??? ??? JH 6 ??? ??? 6H 7 ??? ??? 9H 8 ??? ??? 2H 9 ??? 10H Reserve: .................... Completed:See spider.lisp for a complete explanation of exactly which moves are allowed. We have eliminated some redundant moves to make the game a little easier for the computer to play. You should also look at the goal-test, step-cost, and get-percept methods.
A Spider poproblem instance can be converted into an environment as follows:
>> (setq es-easy (poproblem->environment ps-easy :agents (list (new-random-spider-agent :problem ps-easy)))) #You can now type (run-environment es-easy) and watch the random agent playing. It usually wins, which shows that this is an easy class of Spider problems. Look in spider-agents.lisp to see the definition of new-random-spider-agent. This file also contains an expectimax agent specifically for Spider.
The Spider MDP is defined in spider-mdp.lisp. Like any MDP, this has a results method, but it should be avoided, especially for the new-row action. The file defines num-results and random-result methods for use with sampling expectimax. We can make a Spider MDP as follows:
>> (setq smdp (make-spider-mdp :problem ps-easy :initial-state s0-easy)) #
Question 4 (5 pts). Evaluate the random spider agent over 1000 instances of the "easy" Spider game. (Be sure that you generate a new instance each time!)
Question 5 (10 pts). Write a function new-random+history-spider-agent that, like new-random-spider-agent, returns an agent that selects moves randomly; however, these agents should keep a history of all visited states in a hash table (don't forget to use the state-hash-key!) and should never execute a move that leads to a state already visited. [Hint: you need only check the outcome of moves that have exactly one outcome.] Does this make your agent worse or better on the easy Spider instances?
Question 6 (5 pts). The file spider-eval.lisp contains an approximate utility function for Spider. Explain why the function includes the spider-suits-completed feature. [Hint: this is not a completely trivial question.]
Question 7 (5 pts). Evaluate a depth-1 sampling expectimax agent that uses 5 samples on both the "easy" instances and on instances with 2 packs, 1 suit, and 4 hidden rows. (Do as many trials as you can in a reasonable time, but no more than 1000 in any case.) If cycles are available, evaluate a depth-2 agent as well.