Clare Lyle

Posted on May 17, 2025

A post-rigorous derivation of the Gauss-Newton matrix

Why take a Hessian when a gradient will do?

Terry Tao has a wonderful blog post on the three phases of a mathematician’s career, dividing roughly into pre-, contemporaneous, and post-rigor. Having abandoned my pursuit of “real” mathematics at the age of 22, I can’t claim to have arrived at the enlightened post-rigor phase of my career through the honourable approach of a decade’s practical experience wielding proof techniques. Instead, having lost much of the patience I used to have for working out finnicky details of proofs and I’ve arrived at a kind of lazy post-rigor state which seems to be a good fit for blog posts.

We begin, as all stories about neural network optimization begin, with the gradient.

Aside: what is the difference between a gradient and a derivative?

This is a minor and pedantic point which almost doesn’t belong in a post-rigorous derivation of a second-order optimization methods. Nonetheless, I get the sense that a lot of machine learning people don’t know the difference between a derivative and a gradient, and I feel a moral obligation to clarify this.

If you remember your multivariable calculus, you’ll recall that a function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\) is differentiable if there’s some linear function \(\lambda : \mathbb{R}^n \rightarrow \mathbb{R}^m\) such that

\[ \lim_{h \rightarrow 0} \frac{ | f(a + h) - f(a) - \lambda(h)|}{|h|} = 0 \]

for any \(h \in \mathbb{R}^n\). If such a \(\lambda\) exists, then \(f\) is differentiable with derivative \(\lambda\). Importantly, \(\lambda : \mathbb{R}^n \rightarrow \mathbb{R}^m\) is a function, not a real number. When students are first introduced to calculus, they are usually taught that the derivative is the “rate of change” of a function. This is fine in 1-D because there’s only one direction that we can evaluate the linear function \(\lambda\) on, so there’s a natural isomorphism between the derivative function \(\lambda\) and its evaluation on a particular directional vector. For vector-valued functions, how fast and in what direction the function’s output is changing depends on the direction you are perturbing it in, so we need the derivative to be a function.

For a function \(f : \mathbb{R}^n \rightarrow \mathbb{R}\), the derivative will be a function of the form \(\lambda(h) = \langle v, h \rangle\) where \(v, h \in \mathbb{R}^n\), so we can identify the derivative with this vector \(v\). The gradient is this vector \(v\). For a pedantic mathematician, this distinction is critical for gradient descent to be well-defined: you can’t add a function to a vector, but you can add two vectors together. So adding a scaled gradient to the current parameters to get an updated parameter set is totally kosher. In particular, when the function \(f\) that we care about is a loss \(\ell\), and our vector \(v\) is a set of parameters \(\theta\) we recover the familiar update

\[ \theta \gets \theta - \eta \frac{\partial}{\partial \theta} \ell(\theta) \]

Take two (derivatives)

Gradient descent is the workhorse of neural network optimization. However, the standard “naive” gradient descent algorithm is typically only used in undergraduate projects and theory papers. Typically, some sort of rescaling is done to the gradients of a model before updates are applied to buffer against issues where the gradients are too large or too small to make meaningful changes to the parameters. For example, Sign SGD updates parameters based on the sign of the derivative of the loss with respect to each parameter, meaning that the magnitude of the updates is completely independent of the sensitivity of the objective function to those updates. RMSProp applies a similar philosophy to stochastic estimates of the gradient. Letting \(\mathbf{g}(t) = \frac{\partial \ell(\theta(t))}{\partial \theta}\)

\[\theta_{ij}(t) = \theta_{ij}(t - 1) - \frac{\eta \mathbf{g}_{ij}(t)}{|\mathbf{g}_{ij}(t)|}\]

Alternatively, we could write this using matrix notation as

\[\theta(t) = \theta(t - 1) - \frac{\eta \mathbf{g}(t)}{|\mathbf{g}(t)|}\]

where the fraction is used to indicate that we are performing element-wise division. The same operation with slightly less notational abuse becomes

\[\theta(t) = \theta(t - 1) - \eta D_{t}\mathbf{g}(t)\]

where the matrix \(D_{t}\) is diagonal with entries \(D_t[i,i]\) equal to \(\frac{1}{|\mathbf{g}_{i}(t)|}\).

This idea of applying some sort of linear transformation to the gradients before computing the update step is common in optimization algorithms. This is because always moving in the direction that locally decreases your loss fastest is not, somewhat counterintuitively, the best way to reduce your loss quickly. On a moment’s reflection, this makes perfect sense. The gradient is telling you to make large updates to your weights in directions where the function is changing rapidly, and small updates in directions where it is changing slowly. If the gradient’s first-order approximation of the function is accurate, then this is a very sensible strategy. If it’s not, then you’re likely to overshoot a minimum in the directions where the gradient is large, and crawl at a snail’s pace in directions where the gradient is small.

To quantify the degree to which a function is changing non-linearly, we use the second derivative, which shows up in the order-2 taylor expansion

\[\ell(x + u) = \ell(x) + \frac{\partial \ell }{\partial x}(x) ^\top u + \frac{1}{2} u^\top \mathbf{H} u\]

where \(\mathbf{H} = \frac{\partial^2 \ell }{\partial x^2}(x)\). The eigendecomposition of \(\mathbf{H}\) describes directions where gradients are changing quickly (high curvature) and directions where the gradients are changing more slowly (low curvature). It is unfortunately the case that directions where the gradients are changing rapidly (i.e. where the linear approximation given by the gradient is the least accurate) tend also to be directions where the gradients are large. If you recall from my edge of stability blog post, in order for a gradient descent step to be guaranteed to reduce the loss in this quadratic approximation, we need that the step size \(\eta < \frac{2}{\lambda_{\max}(\mathbf{H})}\). Strictly speaking, between the fact that gradients usually aren’t perfectly aligned with the direction of maximum curvature and other higher-order terms in the loss’s taylor expansion, we usually dont see a neat boundary where step sizes greater than \(\frac{2}{\lambda_{\max}(\mathbf{H})}\) strictly diverge, but it’s a useful heuristic.

What this means for optimization, however, is that the step size required to maintain stability in the high curvature directions will potentially be orders of magnitude too small for the optimizer ot make any progress in converging to the optimum along lower-curvature subspaces. It’s easy to construct examples where this is the case – consider a 2-D quadratic function of the form \(f(\mathbf{x}) = \mathbf{x}^\top \begin{bmatrix}100 & 0 \\0 & 0.01 \end{bmatrix} \mathbf{x} + \begin{bmatrix}1 \\ 2 \end{bmatrix}^\top \mathbf{x} - 2.\) This function will have gradients \(\begin{bmatrix}100 \mathbf{x}_1 + 1 \\ 0.01 \mathbf{x}_2 + 2\end{bmatrix}\). Its optimum is the vector \([0.001, 2000]\).

Assuming some isotropic gaussian initialization, the component \(\mathbf{x}_2\) needs to make its way to a value in the thousands. However, its gradients multiply its value by \(0.01\), meaning that the parameter takes vanishingly small steps relative to \(x_1\), which requires a small learning rate (\(\approx 0.02\)) to avoid diverging. The result is that optimization to convergence can take millions of steps to achieve. In convex optimization problems you can derive a precise bound on the number of steps to convergence in terms of the condition number of the Hessian matrix, which is just the ratio \(\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\), for example yielding the following well-known result from convex optimization (see this helpful set of course notes for more details)

\[ \ell(\theta_T) - \ell^* \leq \exp\bigg( \frac{-T}{\kappa} \bigg) (\ell(\theta_0) - \ell^*).\]

The solution to this problem is obvious: don’t take giant steps in directions where you don’t trust your gradient to be an accurate model of the function you’re trying to optimize. The obvious approach to take here is to rescale your gradients proportionally to the model curvature, i.e. replacing the \(D\) matrix we saw earlier in RMSProp with \(\mathbf{H}^{-1}\).

This method is known as preconditioned gradient descent. You probably learned about the 1D version of this, Newton’s method, in high school or undergrad. And for the same reasons that Newton’s method results in much faster convergence in root-finding problems, preconditioned gradient descent tends to find local minima much faster in optimization problems.

So that was a long-winded way of saying that pre-conditioners are great, if you can compute them. The challenge in machine learning is that our function inputs don’t have two dimensions, but billions of dimensions, which means any hypothetical Hessian-based pre-conditioner would have billions of billions of entries. Even with Moore’s law going apace, we would have to wait several decades to have computers that can process objects of that size in any reasonable amount of time.

Instead, we use approximations.

The Gauss-Newton Approximation

One of the most popular ways of approximating the Hessian is by decomposing it into two matrices, one of which is easy to compute and one of which is hard, then tactfully misplacing the latter. The decomposition can be done in a few lines if you’re willing to play fast and loose with tensor multiplication.

Laying out notation, we assume we have a loss function \(\ell: \Theta \rightarrow \mathbb{R}\), where \(\Theta\) denotes the parameter space of a neural network. We further assume that \(\ell\) permits the following decomposition: \(\ell(\theta) = \ell_f (f(\theta))\), where \(f: \Theta \rightarrow \mathbb{R}^{n \times d}\) denotes the neural network output on some fixed dataset of size \(n\) and \(\ell_f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}\) is the loss function defined with respect to the neural network outputs, for example \(\ell(f(\theta)) = \frac{1}{n} \sum_{i=1}^n \frac{1}{2} (f(\theta) - \mathbf{y}_i)^2\) for regression or \(\ell(f(\theta)) = \frac{1}{n}\) \(\sum_{i=1}^n \sum_{j=1}^d p_{i, j} \log \frac{\exp{f(\theta)_{i, j}}}{\sum_{j=1}^d \exp{f(\theta)_{i, j}}}\) for cross entropy loss.

With this decomposition, we can rewrite the hessian as follows. \[\begin{align} \mathbf{H} &= \frac{\partial^2}{\partial \theta^2}\ell(\theta) = \frac{\partial}{\partial \theta} \frac{\partial}{\partial \theta}\ell(\theta) \\ &= \frac{\partial}{\partial \theta} \frac{\partial}{\partial f} \ell_f(f(\theta)) \frac{\partial}{\partial \theta} f \\ &= \frac{\partial}{\partial \theta} \delta_\ell \cdot_{n, d}\mathbf{J}_\theta \\ \end{align} \] \(\mathbf{J}_\theta \in \mathbb{R}^{ n \times d\times |\theta|}\) is the network Jacobian, i.e. the matrix of partial derivatives of each output dimension with respect to the parameters, and \(\delta_\ell \in \mathbb{R}^{n \times d}\) denotes the derivative of the loss with respect to the network outputs. Finally, we use the magical einstein notation operator \(\cdot_{n, d}\) to indicate that we multiplying together and then summing all of the entries in the \(n\times d\) output dimensions.

\[ \begin{align} \frac{\partial}{\partial \theta} (\delta_\ell \cdot_{n,d}\mathbf{J}_\theta) &= (\frac{\partial}{\partial \theta} \delta_\ell) \cdot_{n,d}\mathbf{J}_\theta + \delta_\ell \cdot_{n,d} \frac{\partial}{\partial \theta} \mathbf{J}_\theta \end{align} \] a non-controversial application of the chain rule gives \[ = \frac{\partial }{\partial \theta}\frac{\partial \ell}{\partial f}\delta_\ell \cdot_{n,d} \mathbf{J}_\theta + \delta_\ell \cdot \mathbf{H}_\theta \] which we can then decompose by noting that \(\frac{\partial}{\partial \theta} = \frac{\partial}{\partial f}\partial{f}{\partial \theta}\)

\[ = \frac{\partial}{\partial f}\frac{\partial f}{\partial \theta}\frac{\partial \ell}{\partial f}\cdot_{n,d} \mathbf{J}_\theta + \delta_\ell \cdot \mathbf{H}_\theta \] since for “nice” functions, you’re allowed to swap the order of the derivative operators, we liberally do so to obtain \[ = \frac{\partial f}{\partial \theta}\frac{\partial}{\partial f}\frac{\partial \ell}{\partial f}\cdot_{n,d} \mathbf{J}_\theta + \delta_\ell \cdot \mathbf{H}_\theta \] And now we just note that \(\frac{\partial {f}}{\partial \theta}\) is exactly the Jacobian \(\mathbf{J}_\theta\), and \(\frac{\partial^2}{\partial \theta^2} = \mathbf{H}_{f}\) to get the final decomposition \[ =\mathbf{J}_\theta \cdot_{n,d} \mathbf{H}_f \cdot_{n,d} \mathbf{J}_\theta + \delta_\ell \cdot \mathbf{H}_\theta \; . \]

So we have decomposed the Hessian into two terms. The first of these, containing the \(\mathbf{H}_f\) term is positive semi-definite and (comparatively) easy to compute as it doesn’t contain any second derivatives with respect to network parameters. The second matrix, by contrast, requires computing the network output Hessian with respect to parameters \(\mathbf{H}_\theta\), and has the impudence to also contain negative eigenvalues.

The two components of our Hessian decomposition exhibit varying degrees of “fitting convex optimization experts’ ideas of what a loss landscape should look like” and “being possible to physically compute on modern computers”. Much like Elphaba, the unnamed matrix which isn’t the Gauss-Newton matrix is frequently overlooked and misunderstood despite playing an integral role in the training dynamics of… Oz (there are limits to how far one can take a metaphor relating popular culture to second-order optimization methods, unfortunately).

The first matrix is also known as the Gauss-Newton matrix, and has a lot of interesting connections to information geometry via the Fisher information matrix, which I might write up as an addendum to this blog post at some point. Much like Elphaba to Galinda, the second matrix is powerful and misunderstood – while it is inconvenient for convex approximations of the local loss landscape, it contains all of the non-convexity of the neural network and is essentially describing the feature-learning component of the dynamics. As the field is beginning to learn, feature learning is actually quite an important component of neural network training.

So that about wraps up this post, which turned into a bit of a scattered collection of observations about first and second derivatives. I hope it was vaguely useful to give some intuition of why a lot of approximate second-order methods end up computing an outer product of gradients instead of the second derivative; perhaps one day in the future I’ll come back to this and write an actually rigorous version of the derivation.