Clare Lyle | Why (machine learning) peer review is awful

Why (machine learning) peer review is awful - Clare Lyle

Posted on September 14, 2025

Why (machine learning) peer review is awful

Part 1: branches in the tree of science

Why does ML twitter complain about reviewers so much?

Machine learning researchers love to complain about peer review. Other scientific fields also like to complain about peer review, but our habit of all submitting papers to the same three conferences means that everyone’s terrible reviews come out at the exact same time, so our complaints tend to get amplified much more by The Algorithms. As a result, every four months one gets the impression that the field has been overrun by sophons hell-bent on fighting scientific process. This synchronization allows for a remarkable sense of community during the dark days of the rebuttal period, and a battlefield camaraderie that persists even after. I have my suspicions that part of the reason why the peer review system has been allowed to continue in its current state is that otherwise we would have nothing to talk about at conference parties.

In addition to my personal favourite of providing conversation fodder, a number of more plausible scapegoats (arms race dynamics among PhD applicants, poor incentives for high-quality reviews, trendiness attracting work from at-best-tangentially related fields whose members are unfamiliar with our norms) are called out every conference cycle. While I don’t dispute any one of these, I have a difficult time believing that there are enough clueless undergrads and anti-social Scrooges in the reviewer and author pools to explain the field’s animus on this topic. An incompetent or confused reviewer, while frustrating, is usually straightforward to either refute or guide towards a more accurate assessment of your work (especially if said reviewer is threatened with cruel and unusual punishment for not reading your rebuttal).

Much more frustrating is the review which was clearly written by someone who clearly knows a lot about machine learning, but who just didn’t get your paper. This is the extensive, vituperative review which so grossly pillories your paper’s contribution that you have to question the mental health of its author. The review which denies the validity of not only your research agenda, but the entire body of work on which you have based your contribution. The review which states “Sure you say you got state of the art results, but how do we know for sure that you didn’t hyper-tune your way to sota unless you also have a mathematical proof?” or “The math looks correct (though I didn’t read it in detail), but I won’t believe it unless you run an experiment with at least a 2-billion parameter transformer.” In these cases, the driving force behind the conflict isn’t ineptitude on the part of authors or reviewers, but rather a fundamental disagreement on what counts as good AI research. Put concretely:

Machine learning peer review sucks because we don’t have a unifying set of epistemic and scientific norms in our field.

Branches in the tree of human knowledge

Scientists will always, no matter the field, disagree on what counts as good research. In a stable field, these disagreements are usually fairly minor, such as what statistical test you should use for a particular type of data generating process, or what variant of genetically identical mouse you should test your drug candidate on. When these disagreements become irreconcilable, the field branches. Natural philosophy split off from the rest of philosophy because understanding things you can interact with via your senses requires a very different methodology than understanding things that only exist in thought-space. Later on, natural philosophy split again into, according to an only-slightly-dubious\(^1\) attribution to Ernest Rutherford, “physics or stamp collecting”.

If your field clusters tightly in the class of problem it considers, it’s easy for groups of people to get on with the grunt work of science without falling into meta-scientific rabbit holes. Consider the physicist-naturalist divide in eighteenth-century Europe. A physicist might study balls rolling down hills and get on very well with other physicists who studied balls falling from towers, but would not get on at all (scientifically at least) with a naturalist who studied the birds living on a particular hill, even if those birds occasionally slipped and rolled down it. The physicist might even go so far as to call the poor naturalist a “stamp collector”. Fortunately, in a remarkable show of restraint, physicists like Rutherford generally left their criticism of biology at snide remarks told to other physicists, and the occasional occurrence of a bird rolling down a hill did not interfere with the independent functioning of the two fields. The physicists critiqued and improved each other’s math, and the naturalists critiqued and improved each other’s classifications of birds. Importantly, the physicists did not waste their time critiquing the lack of quadratic equations in the biologists’ taxonomies, and the biologists didn’t waste their time debating whether it is reasonable to assume a bird is spherical for the purpose of a derivation.

Despite being slightly offensive, the physics-stamp-collecting dichotomy nonetheless gestures at an important spectrum on which different scientific fields can be delineated: the extent to which the field constructs general rules to predict the phenomenon it studies vs cataloguing and describing existing phenomena. And this isn’t the only spectrum one might consider. For example, as biologists developed a more mechanistic understanding of the inner workings of the organism, the field expanded along the explain-vs-document and the build-an-artifact axes, culminating in artificial cells and gene editing, and split into separate fields; the mechanistic study of the cell begot cell biology and biochemistry, the build-an-artifact direction led to synthetic biology, and the more observational quadrant became ecology. As the field branched, each of its new branches adopted different epistemic standards based on the degree to which the object of study could be manipulated.

Now the ecologists have their own journals where they propose mathematical models and collect data on frog populations, the cell biologists write articles about which signaling pathway a new drug is acting on, and the biomechanists confirm that that, yes, the uncomfortable-looking posture that tour de France cyclists do is aerodynamically optimal based on their fluid dynamics simulator. Major discoveries will usually find their way into broader-interest publications like Nature or Science so that people in other fields eventually hear about them, but for the purposes of reviewing and publication the subfields mostly operate independently. After all, why would someone who spent ten years developing world-class expertise in the MToR pathway be able to evaluate someone else’s fluid simulation of elite swimmer hydrodynamics?

Back to AI

Why indeed. Anyone who has reviewed for a machine learning conference understands why this would not be a desirable situation because it is the situation we often find ourselves in when paper assignments come around. I personally have had to review papers on topics spanning optimal transport (for which I was woefully unprepared as a first-year PhD student), causality, probabilistic verification, neural collapse, biologically-inspired neural network architectures, and reinforcement learning. These days I have a large enough body of work that the TPMS usually does a good job filtering relevant papers to bid on, but god forbid I ever miss a bidding deadline or I end up reviewing papers with fifty-page appendices of optimal transport proofs or a new biologically inspired update rule that’s really going to be scalable this time guys, promise.

If we define AI as “The field which consists of anything considered on-topic at the set union of \(\{\)NeurIPS, ICML, ICLR\(\}\)”, it’s safe to say that AI is not a stable field (per the definition above – if you’re not already convinced of this I recommend that you spend a few moments perusing some of history’s more involved OpenReview threads).

Walking around the poster session at any AI conference, you’ll notice a pretty wide range of epistemic norms. We have some papers which look a lot like physics in that they use math to make predictions about learning systems, some which look like engineering in that they say “here’s a system that does something new/cool/important”, and others which look a lot like “stamp-collecting” (in a non-pejorative sense) in that they identify interesting phenomena in particular learning systems without giving a mechanistic explanation or using the observation to improve on some benchmark. As a result, outside of the physics Nobel committee, there remains a lot of uncertainty among researchers about where AI as a field falls (or should fall) on this spectrum – or indeed any spectrum. There is enough variability across subfields that given pretty much any spectrum, you can find some branch of AI that lives on either extreme. The divide characterized by the Ernest Rutherford quote splits along the “predicting vs documenting” axis (see e.g. “physics of deep learning” papers vs the original adversarial examples paper), but there are plenty of others such as the “can-you-manipulate-the-thing-you’re-studying” axis (where academic labs studying leviathan language models are much closer to the “no” extreme and training a small transformer on a synthetic dataset is on the “yes’ end) and the “did-you-make-the-thing-you’re-studying” axis (contrast papers studying with neural architecture search).

Impactful research can happen at any point on these axes. The challenge is that the properties that make research impactful do depend on where you are in the space of scientific problems. This becomes a problem for machine learning, where papers are quite diffusely distributed across the different axes, as opposed to tightly clustering in one vertex of the meta-scientific hypercube.

I’ll give a more detailed account of this mismatch in later posts, but here are a few archetypes of what I mean.

Sure it works, but does it work at scale?
- You: an idealistic student with a neat trick to understand feature maps in image classifiers. Your reviewer: a grizzled research engineer from MindBrain who thinks any model with a parameter count under 10 figures is glorified linear regression. Your task is to convince them that understanding how a ResNet-18 distinguishes cats from dogs is an important scientific question. Good luck.
Sure it works in practice, but does it work in theory?
- The cousin of “does it work at scale”: your opponent is now a grizzled fifth-year PhD student who has been burned one too many times by extravagant claims from unreproducible papers, and has come away with a deep distrust of any result that isn’t accompanied by some worst-case theoretical guarantee. Showing that there exists some setting where your idea works is not enough to justify it. You have to show that there is no adversarial environment, no worst-case scenario, where the method might fail catastrophically. Nothing less will suffice for this pessimist of pessimists.
No SoTA, No Method, No Service.
- You did it! You finally cracked why neural networks work. You’ve written an elegant theory, filled with proofs, diagrams, and exciting, high-level overviews of what research questions can now be answered with your new framework and a little bit of elbow grease. You have everything – except a passing score from your reviewer. “Where is the novelty?” they ask. “Why is ‘understanding’ something that already works significant if you can’t use this new knowledge for anything useful?” You eat a tub of ice cream and look up ways of cursing engineers who lack a certain scientific aesthetic.
Pascal’s rejection.
- Your experiment set is comprehensive – or so you thought. Your reviewer disagrees. Just one more experiment, they say. Just one more ablation, then your claims will stand on solid empirical ground. But no matter how many ablations you run, no matter how many ablations you add, the reviewer’s appetite for evaluations is never sated. In truth, the reviewer’s problem with your paper is much deeper than the superficial additions they are asking for. They simply don’t believe your paper is true, and unlike the helpless Bayesian of Pascal’s Mugging, have a prior as malleable as granite. Perhaps, had they been familiar with the empirical norms of your particular subfield, they would be able to express precisely what evidence would cause them to update their evaluation. But alas, you are doomed to forever run ablations until the (now measurably closer, thanks to the number of GPU-hours you’re burning in this rebuttal period) heat death of the universe.

Concluding thoughts

Recalling the physics-stamp-collecting dichotomy, the\(^1\) problem that AI faces now is that it is (metaphorically) a field that studies birds rolling down hills, and the physicists who assume the birds are spherical are reviewing papers from biologists who want to characterize the lifecycle and social structures of the bird colonies.

Unfortunately, for now we are in an awkward position where the field is big and diverse enough to make peer review unpleasant but not quite big enough that the subfields have hit critical mass to stand on their own. While there are small conferences dedicated to just reinforcement learning or just language models, these haven’t displaced NeurIPS/ICLR/ICML as the gold standard for publication. If you have a big result, you’re going to submit it to NeurIPS and hope that the people who review it and see it at the poster exhibition are in the 5% of attendees who care about your subfield. This standard means that the last NeurIPS I attended had as many participants as a medium-sized university. It also means that communities with very different epistemic standards are brought into contact with each other at least three times every year.

In my next post, I’ll give a more concrete overview of my stamp collection of epistemic norms that have formed in the different subfields of AI that I’ve had the privilege of reviewing in.

1. The quote is often attributed to Ernest Rutherford, but this is a bit of a hear-say from his colleague John Bernal. ^ back to context

2. Well, one. ^ back to context