Clare Lyle | A whistle-stop tour of RL algorithms

A whistle-stop tour of RL algorithms - Clare Lyle

Posted on July 21, 2025

A whistle-stop tour of RL algorithms

Ladies and gentlemen, welcome aboard the RL Express! Today we will be taking you on an exciting journey through the world of RL objectives, making stops in Temporal Difference Learning, Actor-Critic Methods, and Preference Optimization, with time for an optional side-trip to World Modelling. Before we begin our journey, please familiarize yourself with the safety features on this vessel, in particular dynamic programming, gradient-based learning algorithms, Markov Decision Processes, and deep neural networks. With that, let’s begin!

Value-based methods

TD Learning

To our starboard side, you can now see the world-famous Great Pyramid of Temporal Difference (TD) Learning. TD is one of the oldest civilizations in this region, with deep cultural and economic ties to the neighbouring realms of Neuroscience, Psychology, Control Theory, and Operations Research.

TD learning is an algorithm for an interactive world: whereas in most other areas of machine learning we assume some fixed and inalterable Data Generating Distribution \(\mathcal{D}\) and differentiable loss function \(\ell\), where learning is done by drawing iid samples from \(\mathcal{D}\) and then updating the learner’s parameters based on the gradient of \(\ell\), TD learning assumes a model where an agent exists in an interactive world, taking actions \(A\), observing the world-state \(S\), and receiving reward \(R\). Importantly, whereas in supervised learning we assume the learner is essentially trapped in a dark room and passively shown data, the reinforcement learner takes an active role in the data collection process because its output action will affect the next observation it sees. The natural description of data in reinforcement learning is a trajectory indexed by time \(S_0, A_0, R_0, S_1, A_1, R_1, \dots\).

The agent’s goal is to maximize the expected cumulative (possibly discounted by some factor \(\gamma\)) reward along a trajectory, also known as the value \(V\), by finding a policy (a mapping from observations to distributions over actions), which selects good actions (see Sutton & Barto for a more thorough introduction). One straightforward way to identify an optimal policy is to learn the expected returns from every state-action pair under your current policy, then change your policy to always take the action with the highest expected value. However, to do this you first need a good way of estimating the values of states and state-action pairs, which is where TD learning comes in.

In TD-learning, the agent is constantly making predictions about the world (in particular, the expected cumulative reward it will get from its current state) and updating those predictions based on an error signal \(\delta_t\) which is just the difference between your prediction and the actual reward you received. If agent has a fixed lifetime of \(H\) steps, this could look something like

\[\delta_t = V(S_T) - \sum_{t=T}^H R_t\\]

or in the discounted case

\[\delta_t = V(S_T) - \sum_{t=T}^\infty \gamma^{t-T}R_t\;.\]

Ultimately, we want to use this error as a learning signal, which means that using the above errors as learning signals will be utterly useless as the signal is only computed after either the agent is dead or the infinite expanse of time has ended in a singularity. Instead, we use some estimator \(G_t\) as our proxy for the outcome from the state, resulting in an error signal of the form

\[\delta_t = V(S_t) - G_t\]

where \(G_t\) can be defined in a variety of ways. The most popular approach, used in algorithms like DQN, is to use the next-step reward and predicted value of the next state to compute a target

\[\delta_t = V(S_t) - [R_t + \gamma V(S_{t+1})]\]

however in more general \(n\)-step methods a partial sum \(\sum_{t=1}^n R_t + V(S_{n}\) is used as an estimator instead. If one is indecisive, it’s also permissible to mix over the different n-step returns \(G^\lambda = (1-\lambda)\sum_{t=1}^n \lambda^{t-1}G^{(t)} + \lambda^{n-1}G^{(n)}\;.\)

To perform these updates faster, it is possible to use eligibility traces.

Notice here that we are using the agent’s predictions about the state it ended up in to update its predictions about the state it started in. This property means that \(n\)-step TD methods lie in an interesting bias-variance trade-off between low-variance (in the case of small \(n\)) but high-bias feedback about the result of an action on the one hand and higher-variance but lower-bias error signals on the other (high \(n\)). The choice of \(n\) is typically therefore a function of how much you trust the learner’s predictions to be reliable enough to learn from.

With the emergence of bootstrapping, we now find our vessel entering more turbulent waters.

Instability and bootstrapping

The challenge with TD-style objectives is that

Policy Gradients

Having successfully navigated the tumultuous waters of bootstrapping with function approximation, you can now see ahead of us the gentle rolling hills of Policy Gradients, a land of plentiful acronyms whose citizens do not believe in doing unnecessary computational work, particularly as it pertains to estimating action-value functions.

It is said that the founder of Policy Gradients was an emigré from the realm of Gradient-Based Optimization who, while passing through the foothills of TD learning during a particularly chaotic revolution, decided that the source of the region’s strife was due to the fact that the error signal \(V(S_t) - R_t + \gamma V(S_{t+1})\) is not a gradient of any function, and sought to found a new city on more differentiable foundations.

The result was a great civilization founded on an extremely simple principle: if we want to maximize the expected return of some policy \(\pi_\theta\) \(J_\theta = \mathbb{E}_{\pi_{\theta}}\sum_{t=1}^H R_t\), why not simply try to compute this gradient directly and then apply standard gradient ascent methods? Supposing that we can find some decent estimator \(\widehat{\nabla J}_\theta\), this results in the update

\[\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J}_\theta\; .\]

Now the question arises: how on earth do we compute this \(\widehat{\nabla J}_\theta\) quantity? The answer will be familiar to anyone who has had to prove the Cramer-Rao lower bound in a statistics class or implement a variational auto-encoder which is to re-parameterize the objective function and then make use of the identity

\[\frac{\partial}{\partial \theta} \ln f(\theta) = \frac{1}{f(\theta)} \frac{\partial}{\partial \theta} f(\theta)\]

and we get that for each \(s \in \mathcal{S}\), \(\nabla_\pi v_\pi (s) = \nabla \bigg [ \sum_a \pi(a|s) q_\pi(s, a) \bigg ]\) which, by a derivation that can be found on p 325 of Sutton & Barto, amounts to

\[\begin{align} \nabla J_\theta &\propto \sum_{s} \mu(s) \sum_a q_\pi(s, a) \nabla \pi(a|s, \theta) \\ &= \mathbb{E}_{s \sim \mu_{\pi}} \pi(a|S_t, \theta) q_\pi(S_t, \theta) \frac{\nabla \pi(a|S_t, \theta)}{\pi(a|S_t, \theta)} \\ &= \mathbb{E}_{S_t, A_t \sim \pi}[q_\pi(S_t, A_t) \frac{\nabla \pi(a|S_t, \theta)}{\pi(a|S_t, \theta)}] \\ &= \mathbb{E}_{S_t, A_t \sim \pi}[G_t \nabla \log \pi(A_t|S_t, \theta)] \end{align}\]

For those of you who skip over derivations, the intuition is simple: the REINFORCE algorithm updates the policy so that actions that give bigger \(G_t\) get higher probabilities in expectation, and actions that result in smaller gains \(G_t\) are effectively reduced in probability.

This becomes even easier to see if we center \(G_t\) by its expected value at the state \(S_t\) (note that since \(\sum_{a} \nabla \pi(a|s) = \nabla 1 = 0\) for any distribution, we can add and subtract per-state constants from the estimator without affecting its expected value). Without centering, the effect of the estimator on increasing the probability of good actions and decreasing the probability of bad actions might have to be done indirectly (i.e. decrease the probability of a bad action by a lot when you take it, which indirectly increases the probabilities of the good actions because probability mass has to go somewhere, and then only decrease the probability of the good action by a little when it is taken.)

With centering, the intuition of increasing the probability of actions that are better than you expected and decreasing the probability of actions that are worse than you expected is a lot clearer. It’s also helpful for reducing the variance of your estimator. However, any gradient descent method, no matter how accurate its gradient estimator, requires careful tuning of the update norm to ensure that progress is stable.

As you can see on our port side, the great ruins of Reinforcius present a cautionary tale of the perils of overzealous optimization. Once the jewel of the continent, this city state experienced rapid development in the previous millenium but without sufficient regularization eventually overstepped the stability threshold of the local landscape and collapsed. If you look closely, you can see ancient hieroglyphs depicting the fall of the society as its citizens’ behaviours collapsed into trivial actions such as “always move left”.

TRPO and PPO

We are now arriving at the twin walled cities of Trpopolis and Ppopolis.

\(\theta_{k+1} = \mathrm{argmax}_{\theta} \mathbb{E}_{s, a \sim \pi_{\theta_k}}[L(s, a, \theta_k, \theta)]\) where we recall \(L^{PG} = \mathbb{E}[\log \pi_\theta(a_t|s_t) \widehat{A_t}]\)

TRPO proposed the constraint \(\mathbb{E}_t[KL[\pi_{\theta_{\mathrm{old}}}(\cdot | s_t), \pi_{\theta}(\cdot|s_t)]] \leq \delta\)

which it then proposes to enforce by following the following penalized objective

\(\mathbb{E}[\log \pi_\theta(a_t|s_t) \widehat{A_t}] + \beta \mathbb{E}_t[KL[\pi_{\theta_{\mathrm{old}}}(\cdot | s_t), \pi_{\theta}(\cdot|s_t)]].\)

The city was founded by the monastic cult of the great god Shul, but suffered a schism shortly after its founding based on a disagreement on the ability of regularization terms to induce convergence to a local optimum of a constrained optimization problem. Those who believed it did not left to found the city of Ppopolis, while those who did remained in Trpopolis. Their descendants continue to follow the same religious rites of the founders, which involve the chanting of the phrase “lagrange multipliers” until one is delirious enough to believe that a regularization penalty can enforce a constraint.

The PPO objective proposes to follow the policy gradient so long as the likelihood ratio between the old and new policy on the sample is within a bounded range, after which, while there isn’t a penalty for diverging from the old policy, the objective doesn’t improve either.

\(\mathcal{L}^{\mathrm{clip}}(\theta) = \mathbb{E}\bigg [ \min \big ( \frac{\pi_\theta(a_t|s_t)}{\pi_G(a_t|s_t)} \widehat{A}_t, \mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\widehat{A_t} \big) \bigg ]\)

Despite not explicitly using any explicit regularization to the reference policy, this approach is highly effective. It is the basis on which Ppopolis has become an international economic and cultural player, exporting a number of useful algorithmic and engineering ideas which have taken root elsewhere. The generality of the policy gradient on which the approach is based (which can be applied to both continuous and discrete action spaces), as well as the stability of the clipping trick, have meant that the objective is an extremely popular baseline. Indeed, despite the many claims of improved algorithms which supposedly outperform it, PPO continues to be the classic reference algorithm against which other methods were evaluated.

A whistle-stop tour of RL algorithms

Value-based methods

TD Learning

Instability and bootstrapping

Policy Gradients

TRPO and PPO

GRPO

SAC: neither

Miscellanea

Search algorithms

Environment design and autocurricula

Imitation learning