**mdp-agent.lisp**A simple policy agent for Markov decision processes (MDPs).

**mdp.lisp****4x3-mdp.lisp**Stochastic active 4x3 world for chapters 17, 20.

**mdp.lisp**Definitions for Markov Decision Problems and Reinforcement Learning

**dp.lisp**Basic dynamic programming routines for MDPs (Markov decision processes)**stats.lisp**Code for performance assessment of DP and RL algorithms.

**make-mdp-agent** *function* (&key
body
name
mdp
program
algorithm)

An MDP agent constructs a policy from the MDP once, and then uses that policy repeatedly to take action. The ALGORITHM keyword specifies the algorithm that is used to create the policy; don't confuse it with the PROGRAM keyword, which decides what actions to take.

**mdp-agent** *type* (total-reward)

**mdp** *type* (initial-state
model
rewards
terminal-states
hash-key
name)

**mdp-action-model** *type* (transitions
times-executed)

**transition** *type* (destination
probability
times-achieved)

**action-model** *function* (a
s
m)

**transitions** *function* (a
s
m)

Returns the transitions resulting from executing action a in state s according to model M.

**actions** *function* (s
m)

Returns the list of actions feasible in state s according to model M.

***4x3-mdp*** *variable*

***4x3-m-data*** *variable*

***4x3-r-data*** *variable*

**mdp-environment** *type* (* to make an mdp into an environment, we basically just keep track of the*
* current state, and then ask the mdp model to determine the new state.*
* this makes sense for the case of a single agent in the environment.*
mdp
epochs-left)

An MDP-environment is driven by an MDP (Markov Decision Process), which (probabilistically) says what state to transition to for each action.

**mdp-percept** *type* (state
reward
terminalp)

A percept gives the current state, the reward received, and whether it is a terminal state.

**initialize** *method* ((env
mdp-environment))

**get-percept** *method* ((env
mdp-environment)
agent)

The percept is the current state, the reward, and whether this is terminal.

**update-fn** *method* ((env
mdp-environment))

We update by transitioning to a new state. When we hit a terminal state, we restart in the initial state (until there are no more epochs left).

**performance-measure** *method* ((env
mdp-environment)
agent)

Return a number saying how well this agent is doing.

**termination?** *method* ((env
mdp-environment))

**mdp-next-state** *function* (action
state
mdp)

**mdp-transitions** *function* (action
state-model)

**random-transition** *function* (transitions)

**value-iteration-policy** *function* (mdp)

* Given an environment model M, value iteration*
* determine the values of states U.*
* Basic equation is U(i) <- r(i) + max_a sum_j M(a,i,j)U(j)*
* where U(j) MUST be the old value not the new.*

**value-iteration** *function* (mdp
&optional
uold
&key
epsilon)

* A state is a sink if there are no actions that can lead to another state.*
* Sinks can arise by accident during reinforcement learning of an environment*
* model. Because they cause infinite loops, they must be detected.*

**sink?** *function* (s
m)

* Given an initial policy P and initial utilities U, calculate the optimal*
* policy. Do this by value determination alternating with policy update.*

**policy-iteration** *function* (mdp
&optional
u)

* Given a fixed policy and a model, calculate the value of each state.*
* This version does it by an iterative process similar to value iteration.*
* Basic equation is U(i) <- r(i) + sum_j M(P(i),i,j)U(j)*
* where U(j) MUST be the old value not the new.*
* A better alternative is to set up the value equations and solve them*
* using matrix methods.*

**value-determination** *function* (p
uold
m
r
&key
epsilon)

* Compute optimal policy given U and M*

**optimal-policy** *function* (u
m
r)

* The following functions select actions in particular states*
* Pick a random action*

**policy-choice** *function* (state
p)

**random-choice** *function* (state
u
m
r)

* Pick the currently best action with tie-breaking*

**max-choice** *function* (state
u
m
r)

* Simply pick a currently best action deterministically*

**dmax-choice** *function* (state
u
m
r)

* Q(a,s) is the value of doing a in s, calculated by averaging over the*
* utilities of the possible outcomes. Used in several update equations.*

**q-value** *function* (action
state
u
m
r)

***policy-fn*** *variable*

* the policy used by the agent in acting*

***correct-u*** *variable*

***correct-m*** *variable*

***correct-r*** *variable*

**u-rms-error** *function* (u1
u2)

* The policy loss of a utility function U for an mdp is defined as the*
* difference in utility between the corresponding policy and the optimal*
* policy, for the agent's current state. Calculate using*
* value determination wrt the current policy *

**loss** *function* (mdp
u)

AIMA Home | Authors | Lisp Code | AI Programming | Instructors Pages |