More gradient descent

Date: 2024-04-07 ｜ Author: Jason Eveleth

Table of Contents

Issues with my last post
What is the right thing
Transpose maps
Why is the “steepest ascent” = the transpose?
Some real-world examples
Wrapping Up
Interactive plot

Issues with my last post

Even after I finished my last post, I didn't feel like I had fully addressed the overall meaning and motivation. The first thing that bothered me was that it felt like I was using the transpose of the derivative as an "inverse" map. But the transpose of the gradient obviously isn't an inverse. We're not able to map from the tangent space at $f(\bm{x})$ to the tangent space at $\bm{x}$ . In the last post, I showed that the types work out (all the dual spaces line up), but that doesn't explain why the result is useful. I was missing the fact that we're not actually mapping some "target" value from the loss functions' tangent space into the input's tangent space. Instead, it turns out the transpose of the gradient is directly the value we want, and we're just calculating efficiently.

I'd like to thank Aidan Hennessey proof-reading and revealing the hidden secrets of manifolds. Now, I will prove that the transpose is indeed the "right value" and explain what the "right value" even means.

What is the right thing

Figure 1: On the left, we see the input space $V$ and the tangent space at $\bm{x}$ which contains $\bm{h}$ . On the right we have our output space $W$ . Note that $Df(\bm{x})\bm{h}$ is in the tangent space at $f(\bm{x})$ . There is an interactive version of this plot here

For gradient ascent (descent is in the opposite direction since we're on a plane), our goal is to maximize $f$ by finding the locally best direction. That is we want $\delta$ -length vector $\bm{h}$ (in the tangent space of the input) such that it maximizes $f$ :

\underset{\|h\|= \delta}{\operatorname*{argmax}\,} f(\bm{x} + \bm{h})

Where $\delta$ is a small constant. The rest of this post is about why the argmax happens to be $\bm{h} = Df(\bm{x})^{\intercal}.$

Transpose maps

Let $f$ be the function that we care about $V\to \mathbb{R}$ , from a vector space to the real numbers (we could actually choose any field we want, but most loss functions are of this form). Notice that the left side of this equation is a scalar

f(\bm{x} + \bm{f}) - f(\bm{x}) = Df(\bm{x})\bm{h}.

Since $Df(\bm{x})$ is turning a vector into a scalar, it must be a linear form. So, $Df(\bm{x})$ is a member of the dual space of $V$ . Now, since the derivative is in the dual space, we can rewrite that expression as an inner product with a vector (the transpose of the derivative). To make that rigorous, given a basis $B = \{\bm{e^1}, \dots ,\bm{e^n}\}$ of $V$ , for all $\bm{v} \in V$ , there are unique $\alpha_{i}$ where $\bm{v} = \alpha_1\bm{e^1} + \dots + \alpha_n \bm{e^n}$ (by definition of a basis). Then we can define basis elements of $V^*$ using

\begin{align*} e_{i}: V \to \mathbb{R}\\ \bm{v} \mapsto \alpha_i \end{align*}

$B^* = \{e_1, \dots , e_{n}\}$ is a basis of $V^*$ which is the space of linear forms/covectors/dual space. Then given any element of the dual space, we can write it in coordinate form:

(\sum_i \alpha_i e_i)(\bm{v}) = \sum_i \alpha_i e_i(\bm{v}) = \sum_i \alpha_i v_i = \langle \alpha, \bm{v} \rangle

Let

\begin{align*} \phi: &V^* \to V\\ &e_{i} \mapsto e^i \end{align*}

be the transpose map. All $\phi$ does is preserve the coefficients of the basis. It should be clear now, that

Df(\bm{x})\bm{h} = \left\langle \phi(Df(\bm{x})), \bm{h} \right\rangle = \left\langle Df(\bm{x})^{\intercal}, \bm{h} \right\rangle .

Why is the “steepest ascent” = the transpose?

By unrolling the definition of the derivative, we have

\begin{align*} \underset{\|h\|= \delta}{\operatorname*{argmax}\,} f(\bm{x} + \bm{h}) &\approx \underset{\|h\|= \delta}{\operatorname*{argmax}\,} f(\bm{x}) + Df(\bm{x})\bm{h}\\ &= \underset{\|h\|= \delta}{\operatorname*{argmax}\,} Df(\bm{x})\bm{h}\\ \end{align*}

Let $\bm{c}=Df(\bm{x})^\intercal$ for clarity. Then, using the dual space equation from earlier, we find:

\underset{\|h\|= \delta}{\operatorname*{argmax}\,} Df(\bm{x})\bm{h}=\underset{\|h\|= \delta}{\operatorname*{argmax}\,} \langle Df(\bm{x})^\intercal, \bm{h}\rangle=\underset{\|h\|= \delta}{\operatorname*{argmax}\,} \langle \bm{c}, \bm{h}\rangle

Now remember that $\langle \bm{a},\bm{b} \rangle \leq \|\bm{a}\|\|\bm{b}\|$ . And since $\|\bm{h}\|=\delta$ , we have that

\langle \bm{c}, \bm{h}\rangle \leq \|\bm{c}\|\delta

Consider $\bm{h} = \frac{\delta}{\|\bm{c}\|}\bm{c}$ . Then

\langle \bm{c}, \bm{h}\rangle = \langle \bm{c}, \frac{\delta}{\|\bm{c}\|}\bm{c}\rangle = \frac{\delta}{\|\bm{c}\|}\|\bm{c}\|^2 = \|\bm{c}\|\delta

Thus, $\bm{h} = \frac{\delta}{\|\bm{c}\|}\bm{c}$ achieves the maximum value. Thus, it must be the argmax. Therefore, $\bm{h} \propto Df(\bm{x})^{\intercal}$ gives the steepest ascent.

Some real-world examples

Now that we believe the transpose of the gradient is actually the element of the tangent space that we want, let's try a nontrivial example. Consider this neural net:

Figure 2: Consider a neural net with activation function $\sigma$ and 2 hidden layers. $\ell$ is the loss function, $x$ is the input vector, and $y$ is the expected output. Each $o_{i}$ is the output vector of a layer, and $W_{i}o_{i-1} + b_{i}$ , the input to the activation function.

We can write this neural net as a function:

N(x; W_{i}, b_{i}) = o_{3} = \sigma(W_{3}(\sigma(W_{2}(\sigma(W_{1}x + b_{1})) + b_{2})) + b_{3})

and the loss as a function of the parameters:

L(x, y; W_{i}, b_{i}) = \ell(N(x; W_{i}, b_{i}), y)

Which is insane to expand out entirely. We're interested in the gradient of $W_{1}$ w.r.t. the loss. I'm going to define some helper functions ( $h_{i}$ ) so our $f$ looks cleaner. So let's define

\begin{align*} h_{1}(o) &= l(o,y)\\ h_{2}(o) &= W_{3}o + b_{3}\\ h_{3}(o) &= W_{2}o + b_{2}\\ h_{4}(W) &= Wx + b_{1}\\ f(W) &= h_{1}(\sigma(h_{2}(\sigma(h_{3}(\sigma(h_{4}(W))))))) .\end{align*}

where the input $x$ and expected output $y$ and the other weights and biases are now treated as constants.

We should think of $f$ as the function that takes in a $W_{1}$ and outputs $\ell(o_{3}, y)$ , but I've written it in terms of a bunch of functions so that we can perform chain rule. Recall the chain rule (better explained in my last post):

\begin{align*} Df(W) &= D(h_{1}\circ \sigma \circ h_{2}\circ \sigma\circ h_{3} \circ \sigma\circ h_{4})(W)\\ &= Dh_{1}(o_{3})D\sigma(a_{3})Dh_{2}(o_{2})D\sigma(a_{2})Dh_{3}(o_{1})Dh_{4}(W)\\ \end{align*}

Since $Dh_{1}(o_{3})$ is a 1 row matrix (since the output is a scalar), it is most efficient to do the multiplication left to right. This is backward mode autograd. And finally this brings us to the Jacobian-vector products of my last post. For example, let's say we've computed our row matrix $g^{\intercal}$ , so it looks like

\underbrace{Dh_{1}(o_{3})D\sigma(a_{3})Dh_{2}(o_{2})}_{g^{\intercal}}D\sigma(a_{2})Dh_{3}(o_{1})Dh_{4}(W)

It is potentially much faster to call $J^{\intercal}\sigma(a_{2}, g^{\intercal})$ (it might even be $O(1)$ , it depends on $\sigma$ ) rather than do the matrix multiplication (which would be $O(n^{2})$ since we have to loop over all rows of the derivative, recall that $g^{\intercal}$ is a 1 row matrix)

J^{\intercal}\sigma\left( a_{2}, g^{\intercal} \right) = g^{\intercal}D\sigma(a_{2})

Wrapping Up

As we've seen, the transpose of the derivative happens to be the direction of steepest ascent. It is not an "inverse map" (the mistake I made in my last post). By using the Jacobian-vector products I introduced in my last post, we can calculate the gradient efficiently, which is critical for practical applications (like neural networks). To put it into practice, I presented an example derivation of the gradient, which you could use for gradient descent.

Interactive plot

The left side is the input space, and the right side is the output space. The arrow on the left is a small increment to the input. The arrow on the right is the approximate change in the output use the first order approximation and the red $\times$ is the true function value.

Right now, it's a linear function: $f(\bm{x}) = A\bm{x} + \bm{b}.$ Both the input and output are vectors. You can adjust the $h$ value and see how accurate the first order approximation (the arrow on the right graph). For the linear function it should be perfect, since it is it's own first order approximation.

Try the quadratic function (vector to scalar) and sin function (vector to vector). Note: I don't really have the time to animate multi dimensional graphs so we represent a scalar value $r$ as $(r, 0)$ . You can find the code here.