Posted on April 9, 2019

An imPACtful, BAYESic result

The applications of probably approximately correct (PAC) learning results to deep networks have historically been about as interesting as they sound. For neural networks of the scale used in practical applications, bounds involving concepts like VC dimension conclude that the algorithm will have no more than a certain error rate on the test set with probability at least zero. Recently, some work by Dziugaite and Roy, along with some folks from Columbia has managed to obtain non-vacuous generalization bounds for more realistic problems using a concept introduced by McAllester (1999) called PAC Bayes bounds.

The deterministic case

I’ll recall quickly the setup for a PAC bound. We assume we have some hypothesis space \(\mathcal{H}\), data \(\mathcal{D}\), and algorithm \(\mathcal{A} : \mathcal{P}(\mathcal{D}) \rightarrow \mathcal{H}\) which takes as input some subset of our training data and outputs a hypothesis function. We then want to say that, given a sufficient number of training samples, a hypothesis that matches the training data will with high probability attain low error on the data it hasn’t seen yet. The error on the data the algorithm sees is called the empirical risk and is denoted \(\hat{r}_S(h).\) The true risk is the expected error over the entire dataset, and is denoted \(r(h)\). This setup can be formalized in a number of ways (e.g. depending on whether the hypothesis is assumed to attain zero training loss), but they typically look something like this: given error parameters, \(\epsilon\) and \(\delta\), along with sample size \(|S| > m(\delta, \epsilon)\), then with probability \(1 - \delta\):

\(\forall h \quad |r(h) - \hat{r}_S(h)| \leq \epsilon\)

A simple example of a bound that looks like this is called the Occam bound. This bound assumes we have some prefix coding of our countable set of hypotheses. We use the notation \(|h|\) to describe the length of the encoding for the hypothesis \(h\). We assume that our loss function is the zero-one loss function \(L_{01}\), defined as follows.

\(L_{01}(h) \equiv P_{x,y \sim \mathcal{D}}[h(x) \neq y]\).

We then get the bound that given \(N\) samples from our dataset, with probability at least \(1 - \delta\) we will have

\(\forall h \in \mathcal{H}, L_{01}(h) \leq \hat{L_{01}}(h) + \sqrt{\frac{(\ln 2)|h| + \ln \frac{1}{\delta}}{2N}}\)

The proof of this is (compared to a lot of generalization bounds) relatively simple: you need a Chernoff bound on the sample mean of iid random variables that looks like this: \(P(\mathbb{E}(X) > \sum X_i/N + \epsilon) \leq e^{2N\epsilon^2}\), the union bound (i.e. \(P(A \lor B) \leq P(A) + P(B)\)), and Kraft’s inequality, which says that the prefix coding defines a distribution that sums to 1 over all hypotheses. I’ll leave it as an exercise because the Bayesian approach below is (in my opinion) much cooler.

Let’s go bayesian

The above discussion assumes a deterministic output, but what we’re really interested in is probabilistic methods. In particular, we’ll be looking at Gibbs classifiers. A Gibbs classifier assumes that you have a distribution over possible hypotheses, and its output for a given input is sampled from this distribution. Notice that in the Occam bound, we can interpret \(|h|\) as defining a probability distribution over hypotheses given by \(p(h) = \frac{1}{2^{|h|}}\). We could obtain a Gibbs classifier from this distribution by sampling hypotheses according to their probability given by their compression length. McAllester’s theorem gives a bound for the true risk of any Gibbs classifier based on its empirical risk attained on a sample. The version of the theorem and following proof is taken from a tutorial by Seeger that I link to at the end of this post.

Theorem (McAllester, 1999): Let \(Q\) be a probability distribution over a hypothesis space \(\mathcal{H}\). Let \(l(c, x)\) be a loss function with \(l = \mathbb{E}_{x \sim \mathcal{D}} l(c, x)\), and \(\phi: \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}\) such that \(\phi(q,l)\) is convex in \(q\) and \(l\), and nondecreasing in \(|q - l|\). Let \(\mathcal{D}\) be our input dataset. \(\phi\) is essentially some function describing how far our empirical risk is from its true value. We also need the distribution of \(\phi(\hat{l}, l)\) to satisfy a large deviation bound. Assume that \(l\) and \(\mathcal{D}\) are such that there exist constants \(C, \beta\) with \(P(\hat{l} \geq q) \leq C\exp(-\beta \phi(q, l))\) for \(q > l\) (and similarly for \(q < l\)). In most probabilistic bounds, \(\beta\) will be the number of samples we’ve seen so far or closely related to this number, and so the bound will use a different letter like \(m\) or \(N\).

Given these constraints, we now let \(P\) be some fixed reference distribution. Then

\(P_S \bigg \{ \phi(\mathbb{E}_{c \sim Q}\) \([\hat{l} (c)], \mathbb{E}_{c \sim Q} [l(c)]) \geq \frac{KL(Q || P) + \log 2C\beta + \log \frac{1}{\delta}}{\beta - 1} \bigg \} < \delta\)

I’ll go over a sketch of the proof, because I think it provides some insight into what’s going on in the bound. Let \(\Delta(c)\) denote \(\phi(l(c), \hat{l}(c))\) – that is, \(\Delta(c)\) is the error between the empirical risk and the true risk for a given concept/hypothesis. We’ll define a Gibbs measure \(P_G\) based on this error with respect to our prior \(P\) as follows:

\(dP_G(c) = \frac{e^{\alpha \Delta(c)}}{\mathbb{E}_{c \sim P}[e^{\alpha \Delta(c)}]}dP(c)\)

i.e. the Radon-Nikodym derivative of \(P_G\) with respect to \(P\) is just an exponential of the error term multiplied by a constant that will be chosen carefully to make the bound work out nicely.

Then we observe that since the KL divergence of any two probability measures (when it’s defined) is nonnegative, we can come up with a bound.

\(\begin{align*} 0 &\leq KL[Q||P_G] = \int \log \bigg ( \frac{\mathbb{E}_{c \sim P}[e^{\alpha \delta(c)}]}{e^{\alpha \Delta(c)}} \frac{dQ(c)}{dP(c)} \bigg )\\ &= KL[Q||P] + \log \mathbb{E}_{c \sim P} [e^{\alpha \Delta(C)}] - \mathbb{E}_{c \sim Q}[\alpha \Delta(c)]\\ &\implies \mathbb{E}_{c \sim Q}[\alpha \Delta(c)] \leq KL[Q||P] + \log \mathbb{E}_{c \sim P} [e^{\alpha \Delta(C)}] \end{align*}\)

If we choose \(\alpha\) carefully, we can bound the \(\mathbb{E}_P e^{\alpha \Delta(c)}\) term via a technical lemma that gives

\(Pr_S \bigg \{ \mathbb{E}_P e^{(\beta - 1) \Delta(c)} \geq \frac{2C\beta}{\delta} \bigg \} < \delta.\)

Thus, by picking \(\alpha = \beta - 1\), we get with probability at least \(1-\delta\),

\(\log \mathbb{E}_{c \sim P} [e^{\alpha \Delta(C)}] \leq \log 2C\beta\delta^{-1}\)

and so

\(\phi(\mathbb{E}_{c \sim Q}\) \([\hat{l} (c)], \mathbb{E}_{c \sim Q} [l(c)]) \geq \frac{KL(Q || P) + \log 2C\beta + \log \frac{1}{\delta}}{\beta - 1}\)

Why can’t I set P = Q?

Since the KL divergence between two distributions is zero when they’re equal, it seems obvious to set our prior to be \(P = Q\) when we compute our bound. However, since we pick \(Q\) after we’ve seen the sample \(S\), \(Q\) is not independent of \(S\) and this induces a subtle bug into the ‘technical lemma’ in the proof that I mentioned in the last section. The technical lemma states that for all concept classes \(c\), we have

\(\mathbb{E}_s [e^{(\beta - 1)\Delta(c)}] \leq 2C\beta.\)

Then we have that

\(P_S \bigg \{ \mathbb{E}_{c \sim P}[e^{(\beta - 1)\Delta(c)}] \geq \frac{ \mathbb{E}_S \mathbb{E}_P [e^{(\beta - 1)\Delta(c)}]}{\delta} \bigg \} < \delta.\)

Then since \(P\) and \(S\) are independent of each other, we can swap the order of the expectation to get \(\mathbb{E}_S \mathbb{E}_P [e^{(\beta - 1)\Delta(c)}] = \mathbb{E}_P \mathbb{E}_S [e^{(\beta - 1)\Delta(c)}]\). Since the inner expectation is bounded by \(2C\beta\) for all \(c\), we get that the whole thing is bounded by \(2C\beta\).

But we need P and S to be independent!

If P had been a function of \(S\), we couldn’t have swapped the order of the expectations. Since \(Q\) depends on \(S\), it won’t be a suitable distribution to use in our bound.

Further Reading

Hopefully this blog post has been a fairly low-overhead introduction to the mechanics of PAC-Bayes bounds. I was heavily inspired by the following papers, which I recommend reading for a more thorough explanation of the proofs in McAllester’s paper:

[1] The Proof of McAllester’s PAC Bayesian Theorem, Matthias Seeger.

[2] PAC-Bayesian Model Averaging , David A. McAllester (1999)