Offline reinforcement learning – the process of trying to learn how to achieve a goal without ever getting to interact with the ‘real world’ – is hard. Readers who have tried to train RL agents from offline data will know this well and can skip this paragraph. For readers without experience in offline RL, consider the contrast between watching your parents drive you as a child, and learning to drive a car for the first time as a young adult. Even though they might have many years of “passive” driving experience, a new driver is unlikely to perfectly replicate the policy their parents would follow when it’s their first time behind the wheel. As an even greater challenge, minor inaccuracies or differences in a new driver’s policy can result in them facing new situations that they’ve never seen an expert navigate. For example, a new driver may avoid exceeding the posted speed limit while on a learner’s permit, which on narrow roads could lead to a long line of cars and their irritated drivers accumulating behind it. To navigate this situation, the driver needs to either a) correctly generalize to this new situation and pull over at the first safe opportunity to let the cars pass, or b) ask a supervisor for guidance. Identifying option a) requires skill and creativity typically unavailable to new drivers, and so most jurisdictions institute graduated licensing programs where the learner is required by law to have an expert in the car that they can query at all times for their first year or so of driving.
Learning to drive requires generalization to navigate situations that an expert would have avoided, such as accumulating a long line of angry drivers behind you.
Non-human RL agents face similar challenges when they are trained using offline data, without the benefit of a human supervisor and a prescribed number of hours of real-world interaction. Concretely, there are two major technical challenges in offline RL: distribution shift in the state visitation distribution, and lack of coverage of the action set by the data-generating policy. Distribution shift occurs because even a near-optimal policy will nonetheless make mistakes, and these mistakes can result in the agent encountering novel situations that it never saw the expert navigate when it finally is deployed in the real world. But a potentially bigger problem is the lack of action coverage: even if the state visitation distribution is the same, the agent won’t have the data necessary to to identify when it is overestimating the advantage of a particular action. Agents which can interact with their environment will quickly see the effects of taking actions they incorrectly predict to be optimal, and so collect the data they need to correct overly-optimistic estimates. Agents trained with offline data don’t have this luxury.
This problem has been widely studied in the offline RL literature. Most of the discussion here has focused on the overestimation bias problem, where an agent “halucinates” a high value for an action which then gets exacerbated via bootstrapping over the course of training because the agent never sees data revealing its error. Offline RL approaches like Conservative Q-Learning try to get around this by incorporating pessimism into their predictions, so that an action which isn’t taken in the dataset is assumed to have a lower value than the actions which were taken. Even with this kind of work-around, letting an agent go out into the world and make its own mistakes tends to be much more effective at improving performance (though also more dangerous for the agent).
This blog post will discuss insights from two papers that explore why letting agents make their own mistakes is so crucial for obtaining performant policies in value-based RL. The first of these papers, On the Difficulty of Passive Learning in Deep Reinforcement Learning, identifies a neat phenomenon called the Tandem Effect, whereby agents trained on data collected by a greedy policy with respect to their predictions obtain much better performance than agents with a different network initialization which are otherwise trained on exactly the same stream of data. (I’ll abbreviate this paper to Tandem RL due to the framework used to study this phenomenon.) The second paper, DR3: Value Based Deep RL Requires Explicit Regularization (which I’ll abbreviate to DR3),looks at how the dynamics of stochastic gradient descent influence agents’ representations to propose a mechanism for a failure mode of offline RL known as implicit underparamterization. Taken together, they suggest that data collection policies which don’t reveal when an agent has over-estimated the relative value of an action are harmful to performance even in the absence of bootstrapping, but that when coupled with bootstrap updates they can result in particularly pathological representation learning dynamics and completely derail the learning process.
Side bar: overestimation bias and approximation error.
Overestimation in offline RL can take two forms: overestimation of the value of a state, and overestimation of the advantage of an action at a state. The former materializes as overestimation bias, whereby overly optimistic estimates of an action that the behaviour policy didn’t take leads to runaway bootstrap updates that dramatically overestimate the values of state-action pairs in the entire MDP. The latter type of overestimation which I talk about in this post doesn’t necessarily result in runaway predictions, but can result in sub-optimal policies. Essentially, if the values of all the actions at a given state are similar, even a tiny amount of approximation error can drastically change the greedy policy associated with a learned value function.
The study of the tandem effect is inspired by an experiment done with kittens in the 1960s by psychologists studying how the visual system develops. The idea is to pair up two kittens using an elaborate harness contraption, so that one kitten (the active kitten) can move about freely and look at things, and the other (the passive kitten) can only look at things determined by the first kitten. After a couple of weeks of this, the active kitten has a normally-functioning visual system, while the passive kitten’s vision is under-developed in comparison (the passive kitten is then set free to roam about and do kitten things and catches up after). This was one of the first studies to investigate the importance of having control over the data your brain is exposed to in learning and development.
Kittens in the tandem learning experiment from the 50s.
The tandem RL framework tries to replicate this experiment in RL agents. It considers two agents: an active agent, and a passive agent. As with the kittens, the active agent gets to interact with the environment and learn as usual. The passive agent has access to the active agent’s replay buffer, but can’t generate its own data. Both agents are trained using the same update rule (for the most part, standard TD learning), and sample the same minibatches from replay for every update. Thus the only difference between the training procedure for the two agents is their predicted values at initialization.
One important detail here is that unlike in offline RL, the passive agent sees a constantly evolving and improving data distribution. This is a big difference from using a static, fixed dataset. For starters, it means that the agent can see a greater diversity of states and actions over the course of training. It also means that the passive agent gets to see the exact same data as the active agent, so any difference in performance between the two can’t be attributed to properties of using a fixed dataset. The only difference between the active and the passive agent’s training procedures is that their neural networks are initialized using different random seeds. As a result the actions sampled from the replay buffer will usually be ones that the active agent, but not necessarily the passive agent, thinks are optimal.
So what happens when we set these two agents off to learn about the world? Initially, things usually look ok: both agents will start to improve their performance. However, the passive agent’s performance inevitably levels off and in most cases declines as training progresses. This decline in performance in the passive agent but not the active agent is what the paper terms the tandem effect.
Evaluation of active and passive agents in the Tandem RL framework.
Does this difference in performance occur because differently initialized agents are somehow incompatible? Would the tandem effect go away if the initial policy were greedy with respect to the passive agent’s value function? To answer this, the authors present a “forked tandem” paradigm. Here, we freeze the active agent’s policy and copy the active agent’s weights to the passive agent. We then kick off learning using the active agent’s frozen policy and starting the passive agent from the active agent’s current parameters. Note that, at least initially, the “active” policy will always pick the action that the passive agent thinks is greedy, since they both have the same parameters initially. After running this training procedure for a while (presumably until the outdated active agent’s policy is sufficiently not-greedy with respect to the passive agent’s predictions), the passive agent’s performance again starts to decline.
The “forked tandem” setup.
The authors study three properties that could plausibly generate this effect: bootstrapping (B), the data distribution (D), and function approximation (F). Intriguingly, while bootstrapping exacerbates the tandem effect, it doesn’t seem to play as crucial of a role as the other two factors. To show this, the authors train the passive agents on the same TD targets as the active agent is using (i.e. using the active agent’s target parameters and policy), and still see a tandem effect. In some cases using the active agent’s targets produces a smaller effect, but the phenomenon doesn’t go away in any of the environments.
The influence of bootstrapping exacerbates the tandem effect, but isn’t a necessary condition.
This suggests that whatever’s driving the tandem effect, it’s not solely a product of bootstrapping off of a sub-optimal data-collection distribution. Minus one point for bootstrapping. Instead, it seems to result from the interplay between function approximation and insufficient action coverage. I would conjecture that this probably has something to do with the sidebar above: value-based RL agents following greedy policies will be highly dependent on getting the relative ordering of actions with relatively similar values correct in order to obtain good performance. Even a small approximation error as a result of overestimating an action by \(\epsilon\) can lead to catastrophic performance reductions if it isn’t corrected, and the passive agents are unlikely to get the data they need to make this relatively minor correction.
This argument explains why we would expect to see large performance gaps between value functions that are close in an \(\ell_2\) sense, but doesn’t explain where the difference in the value functions arises from. The authors provide some evidence that this phenomenon is likely due to poor generalization of the passive agent.
Increasing the number of passive learner optimization steps per active learner step exacerbates the gap, so the tandem effect isn’t driven by the passive agent underfitting relative to the active agent. (+1 for D)
The tandem effect isn’t completely eliminated by regressing the passive agent on the active Q-values for all actions (though it is significantly reduced). This suggests that poor advantage approximation drives a significant part of the tandem effect, but can’t totally explain it. (+1 for F and +1 for D)
Increasing network width reduces the tandem effect, but increasing network depth exacerbates it. Honestly, I’m not 100% sure how to interpret this result, so I’m just leaving it here for now. (+1 for F maybe??)
Tying the weights of early layers of the active and passive network (so that the passive network in the extreme case only performs gradient updates on its final layer), reduces the gap between the active and passive agent. (+1 for F)
Conclusions: what we get from all of this is that agents need to see the effects of the actions that they think are best in order to perform well at the task they’re learning. This is true even when the agent is trained without bootstrapping, although the effect is stronger when bootstrapped targets are used. We can conclude that there’s definitely something really interesting going on in the interaction between the generalization induced by function approximation and the predicted optimality of the set of actions that the agent gets to update during training, but I’m still not entirely sure what that something is.
At the Deep RL workshop, Kumar et al. presented a separate paper studying another peculiar phenomenon in offline RL: implicit underparameterization. A previous ICML paper had revealed a weird phenomenon in offline RL where agents’ representations (i.e. penultimate layer feature activations) tend to be dominated by a couple dimensions. This paper proposed an explanation for that phenomenon. Unlike the previous paper, which focused specifically on agent performance and studied the effects of bootstrapping, the data-genearting distribution, and function approximation, this paper studied a potential mechanism by which bootstrapping in offline RL can lead to pathological learning dynamics, resulting in the downstream effect of poor performance. Kumar et al.’s focus on the mechanisms by which poor performance occurs results in a rich analysis that yields a lot of interesting insights into what’s going on in offline RL agents’ features when they perform updates using bootstrapped actions that aren’t the ones followed by the behaviour policy.
The short version of this paper is that, by following a similar analysis to prior work that studied stochastic gradient descent on regression objectives, we can study the implicit regularization induced by following stochastic gradient descent on the temporal difference learning objective when the agent uses bootstrapping. I’m less familiar with the papers the authors cited, but similar work essentially finds that gradient descent with discrete step sizes induces a bias towards flat minima compared to what you would expect by following a continuous tie gradient flow. We’re generally interested in discrete systems of the form
\[ f_{t+1}(x) = f_t(x) + \alpha \nabla \ell(f_t(x)) \; .\]
A first-order continuous approximation of this system looks as follows
\[ \partial_t f_t(x) = - \alpha \nabla \ell(f_t(x)) \]
but importantly also needs to take into account that taking a discrete gradient step results in different dynamics than following infinitesimal gradients. Instead, we can introduce a correction into the continuous time system to account for the effect of these discrete steps. The result is that rather than following the gradient of the loss, we end up following the gradient of the loss plus a scaled penalty on the gradient norm.
\[ \partial_t f_t(x) = -\alpha [ \nabla \ell (f_t(x)) + \lambda \nabla \| \nabla \ell (f_t(x))\|^2 ]= -\alpha \nabla \ell (f_t(x)) - \beta \nabla R(f_t(x)) \]
As a result, we say that running discrete-time GD induces implicit regularization towards solutions with lower gradient norm in their vicinity – in other words, towards solutions that are flatter.
Since TD updates don’t correspond to the gradient of any function, the dynamics for TD learning induce slightly different implicit regularization. We end up getting
\[ R(f_t(x)) = \sum [ {\color{blue}{\|Q_\theta(s,a)\|^2}} - \color{red}{\gamma \nabla Q_\theta(s,a) \nabla [[ Q_\theta(s', a')]]}] \]
(where [[\(\cdot\)]] denotes the stop-gradient function). This has a gradient penalty term as before, but now it also has a weird dot product term that measures how aligned the gradients are between adjacent state-action pairs in the MDP. What’s more, this gradient-alignment penalty actually has a negative sign, which means that it is encouraging this term to be maximized. One interpretation of the gradient alignment penalty is that it says our model should update adjacent states similarly. However, because this is an unnormalized dot product between states, one way we can maximize it is to have all states map to the same feature vector.
Kumar et al. then make a neat insight. In the online RL setting, the dot product maximization can’t get out of hand without also increasing the gradient norms, which are controlled by the blue term. Since most behaviour policies are relatively greedy, the bootstrap state-action pair will be visited in the next timestep, and so the gradient norm penalty will be able to keep the norm in check. However, if a particular state-action pair used in the bootstrap update isn’t seen very often by the behaviour policy, as may be the case in offline RL, then it only appears in the gradient alignment regularizer. If this occurs often enough, it may result in runaway gradient norms in the optimization process.
And that’s indeed what the paper then observes.
Feature co-adaptation as observed by Kumar et al.
They also do an ablation to study the effect of using out-of-sample (i.e. out-of-distribution w.r.t. the empirical distribution in the offline dataset) actions to confirm that the feature norms blow up only when the bootstrap target actions aren’t taken by the behaviour policy, even when bootstrapping is used. The paper then introduces a regularizer to cancel out the implicit regularization induced by discrete step sizes, which does seem to mitigate implicit under-parameterization. This part of the paper is worth a read but doesn’t pertain directly to the focus of this post, so I won’t go over it here. I recommend reading the paper if you’re interested.
The main take-away from these papers is that reinforcement learning agents need to see the effects of the actions they predict are optimal in order to find good policies. Tandem RL shows that even if they’re not using bootstrapping, agents’ performance still suffers when the actions used to generate their training data are different from the ones they predict are optimal, though this effect is exacerbated by bootstrapping. DR3 shows that in agents trained on offline datasets with bootstrapping, poor action coverage can lead to exploding feature vectors.
There are a lot of hanging questions I had after reading these two papers that I’m hoping the community will shed light on in the coming years. The biggest one is to understand how the mechanisms these two papers study relate to each other: can the exacerbating effect of bootstrapping be attributed uniquely to the mechanism proposed by Kumar et al., or are there additional factors at play? I also think maybe applying a similar analysis to that motivating DR3 might be useful to explain the phenomena studied by Ostrovski et al. to get at related questions: what is the mechanism underlying the tandem effect when the agents don’t use bootstrapping? Is there something about updating the action that the agent predicts to be optimal that affects generalization in some important way? Are there alternative forms of regularization that might allow agents to obtain good performance from passive learning?
I don’t have the answers to these questions, nor am I actively working on them right now (I’ve been going down a few too many LaTeX rabbit holes whilst writing my thesis). However, I am very curious about the answers and if you, dear reader, manage to figure them out then please let me know. :)