Who wants to be average? Or rather, who wants to learn to predict an average? This was an idea explored in a 2017 paper called a distributional perspective on reinforcement learning. It proposed that instead of just learning to predict the expected value of state-action pairs, RL agents should learn to predict the probability distribution of te random variable corresponding to the return. This idea was justified by a proof of the convergence of a distributional analogue of the Bellman operator, and empirical results in atari showing that the C51 algorithm, which predicted distributions, outperformed DQN in most tasks.
But the empirical evidence was far from conclusive and theoretical results justifying the claimed improved performance were scarce. Over the following year, more analysis followed, leading to improved empirical performance and richer theoretical understanding of distributional RL algorithms. However, one question hung ominously in the distance.
Why does distributional RL work?
There were a few hypotheses floating around. Predicting many things (i.e. auxiliary tasks) often leads to better performance, so it seems reasonable that distributional RL benefits from the many quantiles/atoms in its approximate representations of distributions. Perhaps probability distributions represent a nicer optimization landscape than expected values. Or maybe modeling a probability distribution leads to less jittering of the target function as the agent is learning.
Before trying to tackle the question of why predicting a distribution is helpful, let’s ask a simpler question. The deep RL setting has a lot of moving parts, after all, and it can be difficult to disentangle different factors in behaviour.
When is distributional RL different from expected RL?
This question lends itself much more easily to analysis. Hopefully, it will also provide some hints into what makes the distributional perspective perform better when the answer to the question above is negative.
We’ll assume the usual notation for reinforcement learning. That is, we let \(M = (\mathcal{X}, \mathcal{X}, R, P, \gamma)\) denote an MDP, with \(\pi\) a policy. For the value functions, we’ll use \(Q: \mathcal{X} \times \mathcal{A} \rightarrow \mathbb{R}\) and \(V: \mathcal{X} \rightarrow \mathbb{R}\) for the Q and value-functions respectively.
At the core of distributional reinforcement learning is what’s called the distributional bellman equation. The normal Bellman equation looks like this:
\(T^{\pi} Q(x,a) = \mathbb{E}[R(x,a)] + \gamma \mathbb{E}_{x', a' \sim P^\pi(\cdot|x,a)}Q(x', a')\)
Notice the expected values? We’re going to lift this operator to act on functions of the form \(Z: \mathcal{X} \times \mathcal{A} \rightarrow \mathcal{P}(\mathbb{R})\) (where \(\mathcal{P}(\mathbb{R})\) refers to the set of probability distributions on \(\mathbb{R}\)) by getting rid of them, and looking at the laws of the underlying random variables. We then get\(T^{\pi}_D Z(x,a) = R(x,a) + \gamma Z(X', A')\)
Let’s consider two RL agents: Distributional Dobby predicts probability distributions, and Expected Elsa predicts expectations. Baby Dobby and baby Elsa start off in the world with their predictions initialized such that the distributions that Dobby predicts for each state-action pair have the same expected value as the values that Elsa initially predicts. Now, we’ll let Dobby and Elsa interact with the world by picking some action, then seeing a reward and transitioning to a new state.
In our thought experiment, we couple the trajectories of Dobby and Elsa so that if they agree on the same optimal policy, they’ll take the same trajectory through the MDP. For example, if both are doing epsilon-greedy action selection, then they’ll always either both pick the optimal action, or both pick the same suboptimal action. Assuming they pick the same action, they’ll also then transition to the same next state and see the same reward. They then update their predictions according to some distributional and expected update rule, and repeat this ad infinitum. In this setup, given equivalent intializations, if the two agents always agree on the optimal policy after every update, they will experience the same trajectory in the MDP, and in this sense be behaviourally equivalent.
What determines whether the two agents agree on the optimal policy is how they update their predictions after each step in the MDP. We’re interested in analogous update rules: for example, if Elsa uses the Bellman update, Dobby will use the distributional Bellman update. If for some distributional update rule and find an expectation-based update rule that induces behavioural equivalence, then we can say that distributional algorithms using that update rule should behave like traditional RL algorithms, with any observed empirical differences coming from random seeds and initialization variability. It’s common in RL for the same algorithm to attain wildly different results for different random seeds, so proving mathematically that two algorithms are equivalent is a pretty powerful statement.
We considered a few settings: tabular MDPs with both operator and stochastic updates, as well as the function approximation setting. The quick summary of our findings is that state representation matters, and so does the way we define distance between distributions.
In the tabular setting, we find that (distributional) Bellman updates, TD-mixture updates, and updates which minimize the Cramer distance between a distribution and its TD target (w.r.t. a categorical distribution’s CDF) are all equivalent to expected algorithms. Interestingly, distributional algorithms that minimize other types of distances between distributions using different parameters (for example, the Cram'er distance w.r.t. the PMF of a categorical distribution) may lead to different behaviour from expected algorithms even in the tabular setting.
In the function approximation setting, we find that the distributional perspective makes a big difference – particularly when non-linear function approximation is used. Whereas in the linear function approximation setting, just like in the tabular setting, minimizing the squared Cramer distance between the predicted distribution and a distributional version of the linear TD target is equivalent to vanilla linear function approximation, in non-linear function approximation even this equivalence fails to hold.
We then took a closer look at the differences between linear and non-linear function approximation.