AI epistemics mega-series, part 2
This is part 2 of my “unpack why AI conference reviewing is terrible” mega-series. In my first post, I put out the conjecture that a major cause of dissatisfaction with AI as a field is that it’s a huge umbrella for several fields that are almost as different from each other as physics is from biology.
In this post, I’m going to flesh out exactly what these differing epistemic standard can look like, based mostly on personal experience having worked in a few different branches of machine learning.
The tools that a scientific field develops are a function of the phenomena that the discipline studies. So too are its epistemics\(^1\). Astronomers use telescopes and observational data; medical researchers use centrifuges and randomized controlled trials. In medicine, a drug might be considered effective even if it didn’t cure every patient who received it; a law of gravity that only applied to people with a certain gene would be viewed with great skepticism by physicists.
Because the phenomena that different fields study admit different degrees of manipulation, fields that study things which are less amenable to intervention necessarily must have lower standards of evidence when determining whether a work is publishable. Roughly speaking, building something from scratch yields stronger evidence than perturbing an existing system which produces stronger evidence than staring at something for a very long time in terms of producing reliable knowledge of how the object of study works. This hierarchy of epistemic standards generally corresponds with the popular notion of how rigorous different fields are. Physics and chemistry (at least historically) have been able to manipulate their objects of study in exquisite detail and so findings in these fields tend to be highly replicable and carry great predictive power, because the experimenter can eliminate essentially all sources of confounding. In contrast, econometricians only have access to observational data and must make do with what they have, with the consequence being that when an economist says some policy will hurt the economy, their opinion is weighted less heavily than a chemist who says that chlorofluorocarbons can cause chain reactions that deplete the ozone layer.
As Karl Popper was wont to say, scientists never “prove” their hypotheses – they just fail to disprove them, and the credence assigned to a claim is a function of the rigor and quantity of falsification attempts that the claim has withstood. Econometrics carries less predictive power than chemistry not because econometricians are uniformly less capable than chemists, but because economies are intrinsically more complex and intractable systems than chemical bonds and do not lend themselves to the types of interventions that would allow a scientist to rigorously attempt to falsify a hypothesis. It is difficult – not to mention often immoral – to construct rigorous tests for an econometric theory, and as a result virtually any widely accepted theory about the economy has withstood many fewer and much less rigorous falsification attempts than the typical widely accepted theory about electron orbitals.
Nonetheless, economies are quite important to people’s well-being, and we shouldn’t abandon all attempts to study them just because we can’t apply as rigorous epistemic standards to theories about economies as we can to theories about the structure of the atom. The fact of the matter is that there are a lot of important phenomena worthy of study in the world which we can’t arbitrarily manipulate. Because we can’t control these things, it’s often arguably more important to study and understand the ways we can change them, even though doing so is much harder. If we held such a field to the same standard as we hold nuclear physics, nobody would ever be able to publish anything and it would be impossible to make progress. So it’s important to give breathing room to hypotheses that are hard to falsify due to the intractibility of the domain they’re applied to, to make room for discoveries like evolution and plate tectonics.
The same principle applies to AI, too.
At one extreme, researchers studying small neural networks and decomposing their internal processes to identify how the network learns specific skills and concepts are as close as our field gets to 19th-century physicists trying to understand the 18th-century technology of the steam engine.
At the other extreme, there is an emerging field of what can perhaps be best referred to as “LLM Psychology”, which tries to characterize the higher-level phenomena that emerge in larger-scale models. Since there are only a handful of models at any given time that are considered interesting enough to study, this is I guess analogous to something like “Pro Korfball Psychology” or “Psychology of individuals with 11.5 fingers”. That is to say, you can certainly discover things about a particular model (and given that we know very little about frontier models w.r.t. latent capabilities, security vulnerabilities, etc. this can still be a useful thing to do), but you won’t know whether what you discovered will generalize to other frontier models that haven’t been released yet, except in the trivial and uninteresting ways of “GPT-4 could ace the LSAT and so we predict that GPT-5 will too”.
But this blog post has been a lot of philosophizing, and so I want to give a couple of concrete examples of how different epistemic norms play out in research by referring back to my own projects.
When I was young and naive, my PhD advisor warned me off of RL. His concerns, which in hindsight were entirely accurate, could be roughly summarized as “The papers are impossible to reproduce and the variance of your experiments is so high that random seed tuning has a bigger effect size than most algorithmic changes.” If you can’t reproduce other people’s results and shifting your random seed by one permutes the rankings of the algorithms you’re comparing against, it is hard to do good science.
This isn’t a new observation. Papers talking about how bad RL’s epistemic standards are frequently win best paper awards at major ML conferences. People have come to a consensus that three random seeds in a single task is not enough to show that your method is meaningfully better than the baseline. But there are limits to how far just “running more seeds” can take you as a field. RL performance measures are very discrete and noisy, because the system whose performance they are measuring is almost adversarially chaotic. If your agent got lucky and stumbled into a large reward early in training, it will get a giant score at the end of training, whereas the exact same agent with an unlucky seed that didn’t stumble into the reward early enough might get a score of zero. Given how long it takes to run a single “trial” of an RL experiment (for context, one seed on one environment of the ALE takes about 5 GPU-days to collect the standard 200M frames of experience on a V100 GPU, and we typically train 3 seeds x 57 environments), it’s often impossible to get enough of a signal from the number of seeds in a budget to be able to statistically significantly say whether one method is better than another. And reviewers often don’t ask for a specific number of seeds that they think will provide more significance, they just ask for more. Worst of all, RL benchmarks often include a number of different environments, each of which is evaluated for a small number of seeds. This makes comparisons across algorithms even harder – one has to figure out how to aggregate across non-IID scores, and to the best of my knowledge this decision doesn’t have an established best-practice.
Someone coming from, say, on-device computer vision would likely recommend rejecting most RL papers out of hand due to the lack of statistical significance. Indeed, even within RL, standards for what counts as “enough” seeds or “enough” environments can fluctuate significantly between domains. In small-scale tasks or highly parallelizable environments, it’s common to run 20 or more random seeds; in Atari, one usually runs 3-5 seeds of the algorithm on 57 environments and then aggregates across environments. The paper proposing SAC, one of the most widely used off-policy RL algorithms today due to its simplicity and consistency, ran five random seeds on six environments in their main evaluation experiment, i.e. a total of 30 samples of the algorithm.
Would a statistician have been right to reject the paper due to a lack of statistical significance? Possibly – trying to do statistical analysis on a learning curve is a messy undertaking, and it’s possible that the six environments only show that SAC sometimes outperforms other methods on some tasks. But ultimately the reason why people use SAC today is not because they read the paper and decided it demonstrated a statistically significant improvement – they use it because a thousand other people got it to work, and this is a good signal that it’s a reliable, easy-to-implement algorithm.
I think this is the case in many fields, but it’s especially the case in RL, where the cycle of science proceeds as follows: various papers of unknown scientific quality are published each year, and then thousands of PhD students attempt to replicate their results to get baselines for their papers. The methods which are easy to replicate and produce reliable results get included in these papers and thus cited, and the methods which don’t are left as a sad remnant in someone’s commit history and tend to be forgotten. As a result, a high citation count is often a useful signal that an RL algorithm will reliably work if you clone the corresponding github repo, and in this way the field is able to progress.
Although wasteful, this status quo to some extent works. It has many obvious flaws: many grad students waste their time and compute budgets attempting to replicate unreplicable papers, and some potentially good methods get overlooked because they fail to gain traction in the RL community psyche. But if you look at the most commonly used RL algorithms in papers, they tend to be ones with reliable implementations which are easy to reproduce and which produce decent results.
Unfortunately, the peer review process is highly stochastic, and reviewers tend not to apply uniform standards for what counts as sufficiently strong evidence to publish the method and allow trial-by-replication dynamics to determine whether it gets picked up. The result is that papers often go through multiple rounds of peer review, accumulating more evaluations with each cycle, until through some combination of luckily drawing reviewers with lower statistical standards and improving the quality of the evidence supporting the method, the paper gets published. This process is doubly wasteful: inconsistent reviews mean that publication is not a sufficient signal for quality, but the barrier to publication is still high enough that many researcher-hours are wasted reviewing and revising papers that are ready to be battle-tested by replications.
It is an unfortunate fact that mathematically proving almost anything nontrivial about a deep neural network is disgustingly hard. Because of this, any time I propose a new algorithm, I try to prove that it does something reasonable in linear function approximators to justify it to myself. Then I include the proof in the eventual paper, because even though it doesn’t say anything useful about the version of the method I use in the experiments section, it makes the paper look more legit and reduces the risk that a reviewer will say it’s unrigorous.
I’m not alone in doing this. Especially in reinforcement learning, it is very common to do theory on linear systems trained with infinitesimal step sizes. As I’ve mentioned in previous posts, the narrative of “here’s a linear system where our method does something vaguely reasonable, now please trust our results on neural networks where we changed the method beyond recognition to stop our Q-value estimates from diverging” is generously sprinkled throughout many of my papers.
This paper mentioned by the classic Troubling Trends in Machine Learning Scholarship is one example: I’ve also had several papers where there is a large gap between a theoretical statement and the resulting proof (for example, we use an argument about linear regression in this paper to justify a performance estimator for neural architecture search on very-much-not-linear functions in this paper). People have complained about this so-called `mathiness’ for a long time, so I won’t moralize about this too much. Yes, some of the motivation behind adding lots of equations to a paper is sociological (i.e. to intimidate reviewers into a state of awed confusion). But it’s also worth viewing from the perspective of how knowledge is demonstrated. The devil’s advocate for mathiness is that math is an expressive language for describing phenomena rigorously, and it can actually be scientifically useful to include an analysis in an idealized setting to build intuition on your method, or to decompose the mechanism of action into pieces that you can understand individually.
Suppose you’ve come up with a new algorithm, and want to show its empirical superiority over the existing state of the art. You run the algorithm on the benchmark of your choice, but are shocked and disappointed to discover that your results are mediocre at best. This could be happening for one of two reasons: first, your method might be doing the thing you wanted it to do, but that thing doesn’t make performance go up. Second, your method might not actually be doing the thing you thought it was.
In the first case, if “the thing you want it to do” is hard to measure or demonstrate generally, adding some math can be really useful because a proof is always true for everything that satisfies its assumptions. Maybe your method is an approximation of a function that provably bounds the spectral norm of your weight matrices. You don’t have to deal with reviewers saying “oh but you only tested on convolutional networks” or “but will this also apply to larger models?” because, assuming those cases are covered by your theorem statement, because they hold by construction. If your method doesn’t improve performance despite provably doing the thing it was supposed to do, then you need to rethink what makes a model do well on your benchmark. This is annoying, but it’s at least a smaller search space of problems.
For example, I worked on a paper where we tried to develop a method to make the effective learning rate in a network roughly constant, but this approach resulted in worse performance in RL. Because we knew that the method was indeed keeping the effective learning rate constant, we were able to conclude that having a constant effective learning rate is bad for DQN agents in Atari games. This then opened the door for us to develop some interesting insights into the role of the effective learning rate in deep RL and explore several different schedules on the effective learning rate, eventually finding one that did indeed beat the baseline DQN agent.
At the end of the day, when you’re reading someone else’s results, the questions that go through your head are “Is this important?” and “Is this true?”. The “is this important” question is fairly subjective, but the “is this true” question is pretty foundational. Important but wrong papers can throw spanners in the works of intellectual progress that can take years or even decades to recover from, and research communities need to calibrate their norms so that slightly sketchy but promising ideas are fostered without being over-indexed on. This is especially challenging for AI because the types of phenomena we study vary in the degree to which they can be manipulated for falsification experiments and in how much noise can be expected in their measurement, and this variability makes it difficult to decide whether a particular paper is trustworthy or not. There’s definitely been a push in recent years for good statistical and scientific practice standards, such as reproducibility checklists and standardized statistical tests, but it’s clear the field still has a ways to go to fully converge on a set of epistemic norms.
1. The means by which the field decides whether a new result counts as knowledge, for readers who didn’t hang out with philosophy majors in college.