## \(\gamma\)-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

If you cannot access YouTube, please download our video here.

**
Summary
**
We train predictive models of environment dynamics with infinite probabilistic horizons using a generative adaptation of temporal difference learning.
The resulting \(\gamma\)-model is a hybrid between model-free and model-based mechanisms.
Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of reward.

**
\(\gamma\)-Model rollouts
**
Replacing single-step models with \(\gamma\)-models leads to generalizations of the procedures that form the foundation of model-based control.
Generalized rollouts have a negative binomial distribution over time per model step.
The first step has a geometric distribution from the special case of \(\text{NegBinom}(1,p)=\text{Geom}(1-p)\).

**
Value estimation
**
Single-step models estimate values using long model-based rollouts, often between tens and hundreds of steps long. In contrast, values are expectations over a **single feedforward pass** of a \(\gamma\)-model: \(
V(\mathbf{s}_t, \mathbf{a}_t) =
\mathbb{E}_{\mathbf{s}_e \sim \mu_\theta(\cdot \mid \mathbf{s}_t, \mathbf{a}_t)}[r(\mathbf{s}_e)]
\).

**\(\gamma\)-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction**