\documentstyle[fleqn,epsf,aima-slides]{article}
\begin{document}
\begin{huge}
\titleslide{Complex decisions}{Chapter 17, Sections 1--3}
\sf
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Outline}
\blob Decision problems
\blob Value iteration
\blob Policy iteration
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Sequential decision problems}
\vspace*{0.3in}
\epsfxsize=1.0\textwidth
\fig{\file{figures}{decision-problems.ps}}
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Example MDP}
\vspace*{0.3in}
\centerline{%
\epsfxsize=0.4\textwidth
\epsffile{\file{figures}{sequential-decision-world.ps}}\hspace*{1in}%
\epsfxsize=0.2\textwidth
\epsffile{\file{figures}{sequential-decision-model.ps}}}
Model $M_{ij}^a \equiv P(j|i,a)$ = probability that doing $a$ in $i$ leads to $j$
Each state has a {\em reward} $R(i)$\nl
= -0.04 (small penalty) for nonterminal states\nl
= $\pm 1$ for terminal states
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Solving MDPs}
In search problems, aim is to find an optimal {\em sequence}
In MDPs, aim is to find an optimal {\em policy}\nl
i.e., best action for every possible state\nl
(because can't predict where one will end up)
Optimal policy and state values for the given $R(i)$:
\vspace*{0.2in}
\twofig{\file{figures}{sequential-decision-policy.ps}}%
{\file{figures}{sequential-decision-values.ps}}
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Utility}
In {\em sequential} decision problems, preferences are expressed\\
between {\em sequences} of states
Usually use an {\em additive} utility function:\nl
$U([s_1,s_2,s_3,\ldots,s_n]) = R(s_1) + R(s_2) + R(s_3) + \cdots + R(s_n)$\\
(cf. path cost in search problems)
Utility of a {\em state} (a.k.a. its {\em value}) is defined to be\nl
$U(s_i) = {}$ \u{expected sum of rewards until termination}\nl
\phantom{$U(s_i) = {}$} \u{assuming optimal actions}
Given the utilities of the states, choosing the
best action is just MEU: choose the action such that the
expected utility of the immediate successors is highest.
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Bellman equation}
Definition of utility of states leads to a simple relationship among
utilities of neighboring states:
\u{expected sum of rewards}\al
= \u{current reward}\nl
+ \u{expected sum of rewards after taking best action}
Bellman equation (1957):
\[ U(i) = R(i) + \max_a \mysum_j U(j) M_{ij}^a\]
$U(1,1) = -0.04$\\
\tab + $\max\{ 0.8 U(1,2) + 0.1 U(2,1) + 0.1 U(1,1),$\hfill{\em up}\\
\tab \phantom{+ \mbox{$\max\{$}}$0.9 U(1,1) + 0.1 U(1,2) $\hfill{\em left}\\
\tab \phantom{+ \mbox{$\max\{$}}$0.9 U(1,1) + 0.1 U(2,1) $\hfill{\em down}\\
\tab \phantom{+ \mbox{$\max\{$}}$0.8 U(2,1) + 0.1 U(1,2) + 0.1 U(1,1) \}$\hfill{\em right}
One equation per state = $n$ \u{nonlinear} equations in $n$ unknowns
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Value iteration algorithm}
\u{Idea}: Start with arbitrary utility values\nl
Update to make them \u{locally consistent} with Bellman eqn.\nl
Everywhere locally consistent $\Rightarrow$ global optimality
repeat until ``no change''
\[ U(i) \leftarrow R(i) + \max_a \mysum_j U(j) M_{ij}^a \qquad \mbox{for all } i\]
\epsfxsize=0.6\textwidth
\fig{\file{graphs}{4x3-vi-curve.ps}}
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Policy iteration (Howard, 1960)}
Idea: search for optimal policy and utility values simultaneously
Algorithm:\al
$\pi \leftarrow {}$ an arbitrary initial policy\al
repeat until no change in $\pi$\nl
compute utilities given $\pi$\nl
update $\pi$ as if utilities were correct (i.e., local MEU)
To compute utilities given a fixed $\pi$:
\[ U(i) = R(i) + \mysum_j U(j) M_{ij}^{\pi(i)}\qquad \mbox{for all } i \]
i.e., $n$ simultaneous \u{linear} equations in $n$ unknowns,
solve in $O(n^3)$
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{What if I live forever? (digression)}
Using the additive definition of utilities, $U(i)$s are infinite!\\
Moreover, value iteration fails to terminate\\
How should we compare two infinite lifetimes?
1) Discounting: future rewards are discounted at rate $\gamma \leq 1$
\[ U([s_0,\ldots s_{\infty}]) = \mysum_{t=0}^{\infty} \gamma^t R(s_t)\]
Maximum utility bounded above by $R_{{\rm max}}/(1-\gamma)$\\
Smaller $\gamma \Rightarrow {}$ shorter horizon
2) Maximize \u{system gain} = average reward per time step\\
Theorem: optimal policy has constant gain after initial transient\\
E.g., taxi driver's daily scheme cruising for passengers
\end{huge}
\end{document}