- Q1.1 What does " ... a proper Bellman equation ... " mean?
What I meant: Write out a dynamic programming formulation which captures the way the game is played
and which only contains V(b) (and no value function which depends on the entire state).
- How many blocks were placed with some of the instructor's implementations for the first 2 questions?
These are means and standard errors of number of blocks placed over 20 runs. Discount factor 0.9. Reward for learning = distance from highest filled square to top of the board.
ALP (using every 10'th board): 251 +/- 34.25
ALP (using every 5'th board): 440 +/- 64
VI (using every 10'th board): 30.20 +/- 0.51
VI (using every 5'th board): 54.85 +/- 4.60
Imitation learning (various amounts of regularization, every 10'th board): 357 +/- 57, 267 +/- 29, 292 +/- 50
- Q1 and Q2: Do I use the same tetris_game_log.mat for question 1 and 2?
I suggest you completely ignore the tetris_game_log.mat you were given.
For question 1 I suggest that you code up a simple heuristic policy to collect samples from the state space.
And then use these samples as your state samples for ALP. For question 2 I suggest you play the game yourself, and have
it learn to play from your own demonstrations.