Everyone's a little bit Bayesian
Last week I found myself doing two things I rarely do these days: traveling to a warm sunny place and talking about Bayesian inference.
The occasion was a workshop organized at MBZUAI on the topic “Rethinking the Role of Bayesianism in the Age of Modern AI”, and it was the successor to a Dagstuhl workshop that I’d unfortunately had to miss last year of the same name. Although I did work on model selection for a couple of years during my PhD, it’s been a while since I wrote a paper that was really “Bayesian” (excepting a sprinkling of posterior estimation for continual learning problems) or kept up with the literature, and going into the workshop I wasn’t sure what to expect.
It turns out that, much like RL, Bayesian deep learning hasn’t had the easiest few years in the wake of LLMs. While there was once a day when machine learning was nearly synonymous with probabilistic methods (I am told this was sometime around 2004), that day has long since passed. Probabilistic graphical models have gone the way of the gramophone — still appreciated by connoisseurs, but not the mass market medium of the information age. The mid-2010s had a few contraptions that could be honestly said to be Bayesian like variational autoencoders and various approximation methods for Bayesian neural network training. And of course, there was always the maxim to fall back on that “everything that works, works because it’s Bayesian” if all else failed.
But in the era of language models, the good Bayesian is left with little that they can point at beyond the trivial observation that any sufficiently well-trained next-token predictor that optimizes a proper scoring rule will have to at least implicitly do some form of Bayesian update in token-space. While RL has had a small renaissance in the wake of reasoning models, the same can’t quite be said for Bayesian methods. The mood at the start of the workshop felt a bit like the mood at RL workshops in the immediate aftermath of ChatGPT: there was a lot of soul searching about what problems that the field was trying to solve were still relevant to pushing the frontier of ML/AI. Despite this, I think most of us came away with some optimism for Bayesian principles being more relevant than ever in the medium-term future.
This blog post is meant to act as a way of organizing my take-aways from the workshop and also as a way of pre-registering predictions on whether Bayesianism will have been sufficiently re-thought by this time next year that they’ll have to give the event a different name.
When used as an adjective, “Bayesian” can refer to many different things, but loosely speaking these different applications tend to have a common theme of computing conditional probability distributions on evidence. Bayesian epistemology, for example, says that your beliefs should be expressed as probability distributions, and updating your beliefs in response to evidence should look like computing a conditional distribution. Bayesian inference includes a grab-bag of techniques for estimating conditional and marginal distributions that involve taking some prior over hypotheses \(P(H)\) and evidence \(\mathcal{D}\) and computing things like \(P(H|D)\), \(P(D)\), etc. This turns out to be hard because integration is hard and does in fact require an entire field with multiple conferences. Bayesian decision theory goes one step further to give a means of combining Bayesian beliefs with goals to induce optimal behaviour.
This diversity was also reflected at the workshop, where talks spanned a huge range of topics including:
Back when I was doing my PhD, if you said you were doing Bayesian ML (or especially bayesian deep learning), it usually meant you were trying to get some sort of posterior distribution out of a neural-network-like object. The spread of topics covered last week expanded this definition significantly — that now priors don’t have to correspond to subjective beliefs (e.g. in PAC-Bayes bounds and compression), and don’t have to be applied to network parameters (they can instead be representations in the network that are updated via in-context information).
One common theme that came up during the workshop was that there haven’t been many examples of exciting new methods or discoveries that are explicitly or uniquely “Bayesian”. Part of this is I think because updating probability distributions based on evidence is not a problem that lends itself easily to flashy headlines. Things that capture the public’s attention tend to involve either doing something impressive like controlling a robot or beating people at games, or making something impressive like a new mathematical discovery. RL problems tend to be innately more attention-grabbing and (though this might be controversial) impactful, because maximizing a reward signal in an environment is a very generic framework for getting AI systems to do things that are useful. Rationally updating one’s beliefs does not alone allow a system to suddenly achieve remarkable feats, though it can be a useful component of such a system. Even worse, operations that look like rational-ish belief updates often emerge as a natural consequence of whatever non-Bayesian learning paradigm is able to solve a task that involves updating beliefs based on evidence, which means that the Bayesianists can’t even take credit for methods that do successfully implement Bayesian reasoning.
Another reason for the lack of famous Bayesian successes is that many of the problems that currently capture the public imagination don’t really involve explicit Bayesian updating. Having well-calibrated uncertainties that can be updated accurately and efficiently in light of new evidence is most useful in settings with limited observability, for example when working with other agents like humans or in environments with hidden state that can change over time like in most digital assistant tasks. It’s probably useful, though not essential, for learning problems like RL where even if the environment is fixed, the learner’s ability to manipulate the world changes over time.
There already are some examples of successful applications of Bayesian methods to these types of problems, in particular SLAM. And there are lots of things in RL that look a bit like tempered Gibbs posteriors if you squint hard enough. But I don’t think many people think of SLAM or MPO when they think of Bayesian methods, nor do they consider uncertainty quantification in neural networks to be necessarily “Bayesian”. So perhaps part of the problem is also a marketing issue, and next year all of the workshop attendees will have learned their lesson and put “Bayesian” or “variational” in all of their paper titles.
I came away with a few conclusions that I didn’t have going in, so in the final section of this post I’ll give a rapid-fire list of lessons that were probably already obvious to most Bayesian ML people but which only crystallized for me last week.
Uncertainty quantification in LLMs is already kind of useful
There are two ways in which this fact isn’t immediately obvious: that uncertainty quantification is reliable enough to give decent estimates of uncertainty, and that these estimated uncertainties correlate reasonably well with response truthfulness and quality. That said, my interactions with the current generations of language models have made it abundantly clear that hallucinations and calibration are still far from a solved problem, and uncertainty estimation methods still exhibit a fairly strict trade-off between quality and compute.
Current approaches for estimating uncertainty all have problems. Ensembling over forward passes is expensive and how to induce randomness in the ensemble is a bit of a personal preference, which means it’s hard to really say in absolute terms what a given level of variance means. Asking the language model if it is sure is a great way to reinforce a hallucination. Using token-level entropies will lead to a lot of spuriously high estimates of uncertainty when there are many ways of phrasing something. Evaluating the quality of an uncertainty estimate is also a challenge. The current standard is to look at whether filtering responses by uncertainty improves some quality metric like accuracy or ROUGE-L, which is a reasonable start but doesn’t give a great absolute measure of how well-calibrated the model’s uncertainties are. The fact that this type of filtering actually does tend to produce better answers is a promising start, though, and I could easily imagine a world where baseline uncertainty quantification methods performed a lot worse than the ones we currently have do.
You can have a prior without being Bayesian
Priors aren’t just useful for Bayesians — they can also give you a (frequentist) PAC-Bayes bound, or a baseline for evaluating compression rates. I’ve written plenty previously on PAC-Bayes bounds, which it turns out despite their name are not actually Bayesian, so I won’t go into a lot of detail on this one. There were two interesting lines of work mentioned at the workshop that gave interesting updates on learning theory, one of which was Pierre Alquier’s work on PAC-Bayes bounds for offline and online learning (see here for a recent overview of that area), and the other of which was an effort to get PAC-Bayesian bounds that “explain” deep learning. As I’ve written previously, I’m a bit skeptical of the ability of generalization bounds to give a causal explanation of generalization, but I was pleasantly surprised to find that current bounds are at least a lot tighter than they were in the past and are a bit closer towards capturing phenomena like double descent. One particularly exciting direction that’s seen recent progress is a shift towards notions of complexity that actually seem to align with generalization in the real world.
The forward pass of a well-trained LLM is a little bit Bayesian
This idea has been around for a while, but I think there are some important nuances in what I mean by the forward pass being a little bit Bayesian. This is because there are multiple levels at which one can view the forward pass and outputs of a sequence model. At the token level, a sequence model can be said to infer the next-token likelihoods over its training distribution. This type of implicit conditioning, while it does depend on the context by definition, is different from the type of general in-context learning that allows LLMs to perform gradient descent or linear regression on new datasets.
Showing that a language model can do explicit Bayesian inference, for example by telling it “Assume a Gaussian prior and suppose that you have seen the following data points — what is your posterior?” is a different problem. It is also fairly trivial to show that exact Bayesian inference (unless the model is provided a scratch pad) will be out of reach for a single forward pass due to computational considerations. Approximate Bayesian inference is still fair game however, and it turns out that the Bayesianists have been hard at work over the past couple of years demonstrating that forward passes can perform more generic in-context Bayesian reasoning than what you would expect from learning a conditional distribution over the internet-scale pretraining distribution. For example, it’s possible to use in-context computation to get reasonable uncertainty decompositions out of LLMs.
Bayes is about adaptation
One interesting point that came up a few times during the workshop is the idea that Bayes’ theorem should be viewed as a means of characterizing “optimal” adaptation, and that Bayesians should be focusing their efforts towards more adaptive systems. This perspective has been elaborated on extensively by Emtiyaz Khan, for example in his CoLLAs talk. From my experience in continual/reinforcement learning, we still don’t have a great intermediate update between in-context and in-weights learning that allows for expressive but rapid adaptation that doesn’t conflict with prior knowledge, but I agree that the properties exhibited by Bayesian posteriors are pretty much exactly what you want from a continual learning system.
The aleatoric-epistemic dichotomy isn’t the only way to view uncertainty
Finally, one really interesting discussion point that came up during the workshop was the weakness of the epistemic aleatoric decomposition of uncertainty. Although in theory this decomposition falls out nicely from the conditional entropy, in practice it conflates semantically distinct sources of uncertainty. It can also be problematic in cases where defining uncertainty with respect to the variance of a distribution gives a different properties than with respect to its entropy (particularly in the case of continuous random variables). This distinction is important if we are e.g. trying to model uncertainty in a regression problem.
Epistemic uncertainty is quite vague, and one thing papers like this one have convinced me of is that it is a wide umbrella term for a number of concepts that it can be useful to distinguish. While we normally think about epistemic uncertainty as the entropy of the belief over model parameters \(P(\theta)\), there are a number of different reasons why one might have uncertainty over these parameters: the model class might be misspecified (for example, you’re trying to fit a quadratic function with a linear family), your learning procedure might not have found the best fit in the model class to the data, and the data that you are currently showing the model might not look much like the data you used for fitting. A model which is uncertain because the data looks out-of-distribution is uncertain in a qualitatively different way than one which is uncertain because its prediction on the datapoint is contingent on randomness in its training process, and although these two notions of uncertainty might correlate (as is the hope of people using ensembles for OOD detection), it is important to know which one your method is actually measuring in cases where they diverge.