CS287 Home Page --- PS2 Q&A

PS2 Q&A

Q1.1 What does " ... a proper Bellman equation ... " mean?
What I meant: Write out a dynamic programming formulation which captures the way the game is played and which only contains V(b) (and no value function which depends on the entire state).
How many blocks were placed with some of the instructor's implementations for the first 2 questions?
These are means and standard errors of number of blocks placed over 20 runs. Discount factor 0.9. Reward for learning = distance from highest filled square to top of the board.
ALP (using every 10'th board): 251 +/- 34.25
ALP (using every 5'th board): 440 +/- 64
VI (using every 10'th board): 30.20 +/- 0.51
VI (using every 5'th board): 54.85 +/- 4.60
Imitation learning (various amounts of regularization, every 10'th board): 357 +/- 57, 267 +/- 29, 292 +/- 50
Q1 and Q2: Do I use the same tetris_game_log.mat for question 1 and 2?
I suggest you completely ignore the tetris_game_log.mat you were given. For question 1 I suggest that you code up a simple heuristic policy to collect samples from the state space. And then use these samples as your state samples for ALP. For question 2 I suggest you play the game yourself, and have it learn to play from your own demonstrations.