# Uncertainty (Subsystem of AIMA Code)

We provide code for working with Markov Decision Processes (MDPs), but the rest of the code remains to be written.

uncertainty/: uncertainty/agents/:
• mdp-agent.lisp A simple policy agent for Markov decision processes (MDPs).
uncertainty/domains/:
uncertainty/environments/:
• mdp.lisp Definitions for Markov Decision Problems and Reinforcement Learning
uncertainty/algorithms/:
• dp.lisp Basic dynamic programming routines for MDPs (Markov decision processes)
• stats.lisp Code for performance assessment of DP and RL algorithms.

## A simple policy agent for Markov decision processes (MDPs).

make-mdp-agent function (&key body name mdp program algorithm)

An MDP agent constructs a policy from the MDP once, and then uses that policy repeatedly to take action. The ALGORITHM keyword specifies the algorithm that is used to create the policy; don't confuse it with the PROGRAM keyword, which decides what actions to take.

mdp-agent type (total-reward)

## File uncertainty/domains/mdp.lisp

Definitions for Markov decision processes (MDPs). An MDP is defined by initial state, transition model, rewards, and distinguished terminal states. Model and rewards are hash tables index by state (after application of hash-key function). The entries in the model are alists keyed by action; each action is associated with an action model: basically a list of transitions. Markov chains (i.e., stochastic processes with no distinct agent) can be defined by allowing only a no-op action in the MDP.

mdp type (initial-state model rewards terminal-states hash-key name)

mdp-action-model type (transitions times-executed)

transition type (destination probability times-achieved)

action-model function (a s m)

transitions function (a s m)

Returns the transitions resulting from executing action a in state s according to model M.

actions function (s m)

Returns the list of actions feasible in state s according to model M.

## Stochastic active 4x3 world for chapters 17, 20.

Each action achieves the intended effect with probability 0.8, but the rest of the time, the action moves the agent at right angles to the intended direction. For example, from the start square (1,1), the action North moves the agent to (1,2) with probability 0.8, but with probability 0.1, it moves East to (2,1), and with probability 0.1, it moves West, bumps into the wall, and stays in (1,1).

*4x3-mdp* variable

*4x3-m-data* variable

*4x3-r-data* variable

## Definitions for Markov Decision Problems and Reinforcement Learning

mdp-environment type ( to make an mdp into an environment, we basically just keep track of the current state, and then ask the mdp model to determine the new state. this makes sense for the case of a single agent in the environment. mdp epochs-left)

An MDP-environment is driven by an MDP (Markov Decision Process), which (probabilistically) says what state to transition to for each action.

mdp-percept type (state reward terminalp)

A percept gives the current state, the reward received, and whether it is a terminal state.

## Generic Functions for MDP-Environments

initialize method ((env mdp-environment))

get-percept method ((env mdp-environment) agent)

The percept is the current state, the reward, and whether this is terminal.

update-fn method ((env mdp-environment))

We update by transitioning to a new state. When we hit a terminal state, we restart in the initial state (until there are no more epochs left).

performance-measure method ((env mdp-environment) agent)

Return a number saying how well this agent is doing.

termination? method ((env mdp-environment))

## Utility Functions

mdp-next-state function (action state mdp)

mdp-transitions function (action state-model)

random-transition function (transitions)

## Basic dynamic programming routines for MDPs (Markov decision processes)

Value iteration, value determination, and policy iteration. MDP agents pass in an mdp and expect a policy in return.

value-iteration-policy function (mdp)

Given an environment model M, value iteration determine the values of states U. Basic equation is U(i) <- r(i) + max_a sum_j M(a,i,j)U(j) where U(j) MUST be the old value not the new.

value-iteration function (mdp &optional uold &key epsilon)

A state is a sink if there are no actions that can lead to another state. Sinks can arise by accident during reinforcement learning of an environment model. Because they cause infinite loops, they must be detected.

sink? function (s m)

Given an initial policy P and initial utilities U, calculate the optimal policy. Do this by value determination alternating with policy update.

policy-iteration function (mdp &optional u)

Given a fixed policy and a model, calculate the value of each state. This version does it by an iterative process similar to value iteration. Basic equation is U(i) <- r(i) + sum_j M(P(i),i,j)U(j) where U(j) MUST be the old value not the new. A better alternative is to set up the value equations and solve them using matrix methods.

value-determination function (p uold m r &key epsilon)

Compute optimal policy given U and M

optimal-policy function (u m r)

The following functions select actions in particular states Pick a random action

policy-choice function (state p)

random-choice function (state u m r)

Pick the currently best action with tie-breaking

max-choice function (state u m r)

Simply pick a currently best action deterministically

dmax-choice function (state u m r)

Q(a,s) is the value of doing a in s, calculated by averaging over the utilities of the possible outcomes. Used in several update equations.

q-value function (action state u m r)

## Code for performance assessment of DP and RL algorithms.

Makes extensive use of global variables to minimize interference with the algorithms themselves.

*policy-fn* variable

the policy used by the agent in acting

*correct-u* variable

*correct-m* variable

*correct-r* variable

## assume U1, U2 have the same states

u-rms-error function (u1 u2)

The policy loss of a utility function U for an mdp is defined as the difference in utility between the corresponding policy and the optimal policy, for the agent's current state. Calculate using value determination wrt the current policy

loss function (mdp u)

 AIMA Home Authors Lisp Code AI Programming Instructors Pages