Posted on October 30, 2020

A Bayesian Perspective on Training Speed and Model Selection

As presented at NeurIPS 2020

TL;DR Models that train faster (with respect to the number of data points they need to fit the dataset) have a higher marginal likelihood. We leverage this to get estimators of the (log) marginal likelihood in linear models that depend on the sum of training losses obtained in an iterative updating procedure. We show an intriguing connection between these estimators and the weight assigned to features in linear regression, and find that the intuition driving these estimators also seems to hold in deep neural networks.

#Introduction

Suppose that you’re trying to build a hotdog classifier, and you have a choice between two neural networks: network A has a cross-entropy loss of 0.01 on the training set, and it reached that loss extremely quickly. Network B obtains a cross-entropy loss of 0.0001 on the training set, but it took a lot longer to get below 0.01 than Network A. You want to use the model that will be the most accurate on new, unseen possible-hotdogs: which do you pick?

A difficult model selection problem.

If you held out some training data for a validation set, then you can pick the one with better performance on the unseen data and be done. But suppose you don’t have access to a validation set. Is there still a way of justifying the intuition that the model that got a pretty low loss quickly is less likely to have overfit than the model that got an extremely low loss slowly? It turns out the answer is yes – if you’re willing to be Bayesian.

Background on Model Selection

TL;DR

Model selection and generalization are deeply connected
Generalization in neural networks is not well-understood
There are some generalization bounds that are related to training speed
A Bayesian approach to model selection is to pick the model with the highest marginal likelihood

###Model selection in neural networks is hard.

Before digging into the Bayesian perspective, we’re going to give some context for the model selection problem in deep learning. In most machine learning applications, our goal is to produce some function approximator that fits a target function on some unknown data-generating distribution, from which we have samples in the form of a training set. Typically, some points from this training set is hidden from the model during training, and an estimate of the model’s expected risk (its average error on the data generating distribution) is obtained by computing the validation loss on this held-out data. The model with the lowest validation loss is assumed to also have the lowest expected risk. While the validation loss is great as an estimator of empirical risk, it’s not ideal for hyperparameter tuning because a) it’s expensive to optimize and b) it doesn’t explain why a model will generalize well.

Generalization bounds can be used to address both of these issues. A generalization bound gives a high-probability guarantee that the test set error will not exceed a certain value, and can typically be decomposed into an empirical risk estimate (i.e. the training loss) and a model complexity penalty. The model complexity penalty can be interpreted as giving a causal hypothesis for what makes a model generalize well, addressing point a) above. When both of these terms are differentiable, the bound can be optimized directly, addressing point b) above. While generalization bounds can get non-vacuous values on neural networks, they’re generally not tight enough to be useful for model selection. We conjecture that part of the reason these bounds might struggle so much in the deep learning regime is that the model complexity term used typically only depends on the final value of the model’s parameters after training, rather than taking into account properties of the training scheme. Some recent work has developed PAC-Bayesian generalization bounds based on a similar idea, but there’s likely much more that could be explored.

The idea that models which train faster should generalize better is not a new one: generalization bounds based on the stability of gradient descent have existed for years. These bounds explicitly depend on the number of training steps taken to reach a minimum. However, they also require some assumptions that aren’t obvious to show in the generic deep learning setting. There have also been some more oblique connections in the literature. For example, Arora et al. propose a generalization bound with a data complexity term that can be related to an upper bound on the convergence rate of gradient descent on a convex function.

Instead of trying to directly solve the model selection problem for neural networks, we’re going to take a digression into an easier setting, Bayesian models, and develop some theoretically grounded intuition on the model selection there. Don’t forget about this section though – we’ll come back to it in a bit.

###Bayesian Inference in 2 Minutes

Being a Bayesian is all about using Bayes’ rule to update prior beliefs based on evidence. For example, given a model \(M\) and some data \(D\), our belief that the model generated the data can be expressed in terms of the likelihood the model assigns the data, and our prior probability of the data and the model.

\[P(M|D) = \frac{P(D|M)P(D)}{P(M)}\]

To quantify terms like ‘evidence’ and ‘prior belief’, we translate some standard quantities from the risk minimization framework into probabilistic analogues. Instead of optimizing the parameters of a function using its loss on the data, we have a model \(\mathcal{M}\) which defines a probability distribution over the data \(P(\mathcal{D}|\theta)\) given some parameters \(\theta \in \Theta\), along with a probability distribution over the parameters \(P(\theta)\).

One quantity of interest is to Bayesians is the posterior distribution over parameters \(P(\theta|\mathcal{D})\): finding this distribution is known as parameter inference. While this computation is straightforward in some models like Gaussian Processes, it’s often not tractable to compute exactly for complex models like Bayesian neural networks. However, it’s still possible to get samples from this distribution via techniques such as ensemble methods, where we randomly draw a number of parameter samples from the prior and then optimize each parameter sample independently to get a sample from the posterior.

The quantity that we will be particularly interested in is the marginal likelihood \(P(\mathcal{D})\) (we’ll abbreviate this as ML from here on out). This tells us how likely the data is under the model, and can be written as follows.

\[P(\mathcal{D}) = \int_{\theta} P(\mathcal{D}|\theta) P(\theta) d\theta \]

If we have a collection of models, picking the one with the highest marginal likelihood is what MacKay calls Type II maximum likelihood estimation and is a popular approach to Bayesian model selection. Maximizing the marginal likelihood is less prone to overfitting than maximum likelihood estimation at the parameter level, as the marginal likelihood can be viewed as having a built-in model complexity term that helps to prevent overfitting.

#Estimating the Marginal Likelihood

Decomposing the log ML

OK, so if we want to perform model selection by maximizing the marginal likelihood, we’ll need to compute the marginal likelihood first. Observe that we can write the (log) ML as follows.

\[\log P(\mathcal{D}) = \sum \log P(\mathcal{D}_i|\mathcal{D}_{<i}) \]

Making the dependence of the conditional \(P(D_i|D_{<i})\) on the parameters \(\theta\) more explicit, we get

\[\log P(\mathcal{D}) = \sum \log \mathbb{E}_{P(\theta | \mathcal{D}_{<i})}[P(\mathcal{D}_i | \theta)] \]

In other words: the marginal likelihood measures how well a posterior update based on one subset of the data \(\mathcal{D}_{<i}\) is able to predict the next data point \(\mathcal{D}_i\). We can visualize the log ML as computing the ‘area under the curve’ of posterior predictive probabilities.

Estimator Zoo

The value \(\log \mathbb{E}[P(\mathcal{D}_i|\theta)]\) isn’t always possible to compute exactly. However, given parameter samples from the posterior, it’s straightforward to estimate a lower bound.

\[\log P(\mathcal{D}) = \sum \log \mathbb{E}_{P(\theta | \mathcal{D}_{<i})}[P(\mathcal{D}_i | \theta)] \ge \sum \mathbb{E}_{P(\theta | \mathcal{D}_{<i})}[ \log P(\mathcal{D}_i | \theta)] = L(D)\]

Where the inequality follows from Jensen’s inequality. We can make this inequality tighter by averaging over multiple parameter samples before applying the logarithm to get another estimator that we’ll call \(L_k\).

\[ \log P(\mathcal{D}) \geq \sum \mathbb{E}_{P(\theta | \mathcal{D}_{<i})} \; [ \log \frac{1}{k} \sum_{j=1}^k P(\mathcal{D}_i | \theta_j)] = L_k(D)\]

Finally, if we have samples from the posterior predictive distribution \(P(\hat{D_i} | D_{<i})\) and this distribution is a Gaussian, then we can use our posterior samples to estimate the parameters \(\mu, \sigma^2\) of \(P(\cdot | D_{<i})\) and then use those estimated parameters in the log likelihood term to obtain yet another estimator \(L_S\).

\[ \log P(\mathcal{D}) = \sum_{i=1}^n \mathbb{E}[\log P(\mathcal{D}_i|\widehat{\mu}_i, \widehat{\sigma}^2_i)] = L_S(D)\]

Each of these estimators has its own pros and cons: \(L\) has an intriguing interpretation from the minimum description length framework, but can have a large bias term when used to estimate the ML. \(L_k\) reduces this bias term, but both \(L_k\) and \(L\) can’t be applied to models whose likelihood is a dirac delta distribution (i.e. models with zero observation noise). \(L_S\) works even for models with zero observation noise, but only yields an unbiased estimate of a lower bound when the posterior predictive is Gaussian (otherwise this may no longer be a lower bound).

Estimating the log ML via Gradient Descent

TL;DR: In linear models, we can sample parameters from a prior and then run gradient descent to get a posterior sample to feed into the estimators from the last section.

Linear model combinations and \(L(D)\)

TL;DR: if you train a linear regressor on top of predictions from Bayesian models concurrently as the models are being updated, then the model with the highest weight is the one with the highest \(L(D)\), not the one whose final posterior is the best fit.

Empirical Results

TL;DR: our estimators agree with the marginal likelihood on what the ‘best’ model is in a number of different settings.

Estimated log P(D) and weight assigned by a linear model combination for various model selection problems. Our estimators give models similar rankings to the log marginal likelihood.

Connection to Deep Learning

Recall that the ML measures how well updates from one subset of the data generalize to unseen data points. This gives us an analogy between training speed, as measured by the sum of log predictive posterior likelihoods, and Bayesian model selection. Training speed in a Bayesian updating procedure is defined with respect to the number of data points needed to assign high likelihood to the rest of the training data. However, in many settings where we might want to do model selection, we don’t necessarily have exact posterior samples from a Bayesian model conditioned on increasing subsets of the data. Standard gradient-based optimization of deep neural networks (DNNs) involves iterating through the entire dataset repeatedly for several epochs, so we’ll need to consider training speed with respect to the number of gradient steps taken, rather than the number of data points seen. In this section, we’re going to take a first stab at the following question:

Does there exist a connection between a notion of training speed that captures within-training-set generalization in DNNs, and their test set generalization performance?

To apply the Bayesian intuition to neural networks requires us to map some of the concepts discussed earlier in the Bayesian setting to the risk minimization setting.

Bayesian model \(\equiv\) function approximator
Posterior \(P(\theta|D) \equiv\) Point estimate \(\theta^*\)
\(\sum \log P(D_i|D_{<i}) \equiv \sum_{t=1}^T \ell(D_{i_t}, \theta_t)\)
Marginal likelihood \(\equiv\) generalization error

While there are many interesting lines of work framing stochastic gradient optimization as performing posterior sampling, in the analysis that follows we won’t be using this interpretation. Nor will we try to cast classifiers as computing unnormalized probability distributions. Instead, we’re going to focus on using minibatch SGD to measure how well model updates based on one subset of the data generalize to other subsets.

###Motivation: Minibatch Loss as a Risk Estimator TL;DR: when you compute a minibatch gradient update, your loss on the next, disjoint minibatch gives you an idea of how well that gradient update will generalize to your test set.

How quickly a model’s training loss decreases while training depends on a number of factors including the optimizer, hyperparameters, model architecture, and dataset. However, all else being equal, a model’s loss will decrease faster if the gradient update based on minibatch \(i\) of the dataset also decreases the loss on minibatch \(i+1\). This relates to the notion of stiffness discussed by Fort et al., who argue that models for which gradients from minibatches all look similar should generalize better than those whose minibatch updates are orthogonal. Summing over the minibatch training losses then gives us a straightforward way to measure how well the gradients are generalizing. Models which attain a low loss with fewer training steps will have a lower sum over training losses (SOTL).

To illustrate how SOTL might measure within-minibatch gradient generalization, consider a first-order Taylor approximation of the training loss. The change in loss for minibatch \(i\) after taking a gradient step computed on that minibatch looks as follows

\[\begin{align} \ell(D_i ; \theta_{t+1}) - \alpha g_{t}) &\approx \ell(D_i; \theta_{t}) - \alpha \nabla_\theta \ell (D_i; \theta_{t}) ^\top g_{t}\\ & = \ell(D_i; \theta_{t_i}) - \alpha \|g_{t}\|^2 \end{align}\]

Meanwhile, the change due to gradients tep \(g_{t_i}\) on a disjoint minibatch \(D_j\) can be written \[\begin{align} \ell(D_j; \theta_{t+1}) = \ell(D_j; \theta_t - \alpha g_{t}) &\approx \ell(D_j; \theta_t) - \nabla_\theta \ell (D_j; \theta)^\top g_t \\ &= \ell(D_j; \theta_t) - \nabla_\theta \ell (D_j; \theta)^\top \nabla_\theta \ell(D_i; \theta) \end{align}\]

This dot product term measures how correlated gradients based on disjoint minibatches are. For the first epoch of training, this gives an unbiased estimate of the improvement to the expected risk obtained by the gradient step. After the first epoch, because the parameters will depend on all data points in the training dataset, we can’t get nice guarantees on this being a good estimate of the change in the test set loss. Nonetheless, for a large training set we can argue that because the parameters depend minimally on any particular minibatch, we should still get a pretty good estimate of the change in the test set error.

Concretely, we hypothesize that under fixed-stepsize SGD, models that obtain a lower sum over training losses should get the best test set performance. We evaluated a handful of different neural network architectures trained on image datasets and found that this correlation was fairly consistent. A more in-depth empirical analysis in the setting of neural architecture serach was performed by Ru et al.

Empirical Results

Models which train faster (as measured by their sum over training losses) generalize better.

We now ask whether the connection between \(\mathcal{L}(\mathcal{D})\) and the weight assigned to a model in a linear model combination might also apply in some form to the deep learning setting. The lemma doing the heavy lifting in the proof of this result basically says that the weight assigned to a feature in a linear regression problem depending to some extent on that feature being fairly predictive of the target (modulo a bunch of assumptions). We make a similar conjecture for the weight assigned to activations in the penultimate layer of a neural network: features which are more predictive of the targets will be assigned higher weight. Because the final-layer weights and activations are trained concurrently, we expect that the notion of how predictive a node is will be more closely aligned with the SOTL definition rather than its final ‘training loss’.

Layer activations which train faster (as measured by their sum over ‘information losses’) are assigned higher weight by the final layer of a neural network.

To investigate this claim, we train a linear combination of models, and see whether the test loss correlates with the assigned model weight. We observe that SGD tends to upweight models that generalize better. This suggests that SGD may implicitly perform model selection. Interestingly, we see that this phenomenon also holds for subnetworks within a network. The above figure shows that SGD upweights sub-networks with lower SOTL. These results are far from the final word on the topic, and we believe they present an interesting direction for future work.

Recap

The goal of this paper was to analyze the connection between Bayesian model selection and training speed in order to gain insight into how models generalize. We proposed a family of estimators of the marginal likelihood that measure a notion of training speed, and which depend only on samples from the posteriors of Bayesian models. We further showed that this quantity is equivalent to measuring how well updates based on one subset of the data generalize to other subsets, and that an analogous measure on deep neural networks also appears to empirically correlate with generalization error. Our results thus point both to a way of doing principled Bayesian model selection on families of models from which we can obtain posterior samples, and also to a direction of inquiry towards understanding generalization in complex function approximators based on optimization trajectories rather than only post-training quantities.