More gradient descent

Date: 2024-04-07 | Author: Jason Eveleth

Table of Contents

Issues with my last post

Even after I finished my last post, I didn't feel like I had fully addressed the overall meaning and motivation. The first thing that bothered me was that it felt like I was using the transpose of the derivative as an "inverse" map. But the transpose of the gradient obviously isn't an inverse. We're not able to map from the tangent space at f(x)f(\bm{x}) to the tangent space at x\bm{x}. In the last post, I showed that the types work out (all the dual spaces line up), but that doesn't explain why the result is useful. I was missing the fact that we're not actually mapping some "target" value from the loss functions' tangent space into the input's tangent space. Instead, it turns out the transpose of the gradient is directly the value we want, and we're just calculating efficiently.

I'd like to thank Aidan Hennessey proof-reading and revealing the hidden secrets of manifolds. Now, I will prove that the transpose is indeed the "right value" and explain what the "right value" even means.

What is the right thing

Figure 1: On the left, we see the input space VV and the tangent space at x\bm{x} which contains h\bm{h} . On the right we have our output space WW. Note that Df(x)hDf(\bm{x})\bm{h} is in the tangent space at f(x)f(\bm{x}). There is an interactive version of this plot here

For gradient ascent (descent is in the opposite direction since we're on a plane), our goal is to maximize ff by finding the locally best direction. That is we want Ξ΄\delta-length vector h\bm{h} (in the tangent space of the input) such that it maximizes ff:

argmax⁑ βˆ₯hβˆ₯=Ξ΄f(x+h) \underset{\|h\|= \delta}{\operatorname*{argmax}\,} f(\bm{x} + \bm{h})

Where δ\delta is a small constant. The rest of this post is about why the argmax happens to be h=Df(x)⊺.\bm{h} = Df(\bm{x})^{\intercal}.

Transpose maps

Let ff be the function that we care about V→RV\to \mathbb{R}, from a vector space to the real numbers (we could actually choose any field we want, but most loss functions are of this form). Notice that the left side of this equation is a scalar

f(x+f)βˆ’f(x)=Df(x)h.f(\bm{x} + \bm{f}) - f(\bm{x}) = Df(\bm{x})\bm{h}.

Since Df(x)Df(\bm{x}) is turning a vector into a scalar, it must be a linear form. So, Df(x)Df(\bm{x}) is a member of the dual space of VV . Now, since the derivative is in the dual space, we can rewrite that expression as an inner product with a vector (the transpose of the derivative). To make that rigorous, given a basis B={e1,…,en}B = \{\bm{e^1}, \dots ,\bm{e^n}\} of VV, for all v∈V\bm{v} \in V, there are unique Ξ±i\alpha_{i} where v=Ξ±1e1+β‹―+Ξ±nen\bm{v} = \alpha_1\bm{e^1} + \dots + \alpha_n \bm{e^n} (by definition of a basis). Then we can define basis elements of Vβˆ—V^* using

ei:Vβ†’Rv↦αi\begin{align*} e_{i}: V \to \mathbb{R}\\ \bm{v} \mapsto \alpha_i \end{align*}

Bβˆ—={e1,…,en}B^* = \{e_1, \dots , e_{n}\} is a basis of Vβˆ—V^* which is the space of linear forms/covectors/dual space. Then given any element of the dual space, we can write it in coordinate form:

(βˆ‘iΞ±iei)(v)=βˆ‘iΞ±iei(v)=βˆ‘iΞ±ivi=⟨α,v⟩(\sum_i \alpha_i e_i)(\bm{v}) = \sum_i \alpha_i e_i(\bm{v}) = \sum_i \alpha_i v_i = \langle \alpha, \bm{v} \rangle

Let

Ο•:Vβˆ—β†’Vei↦ei \begin{align*} \phi: &V^* \to V\\ &e_{i} \mapsto e^i \end{align*}

be the transpose map. All Ο•\phi does is preserve the coefficients of the basis. It should be clear now, that

Df(x)h=βŸ¨Ο•(Df(x)),h⟩=⟨Df(x)⊺,h⟩. Df(\bm{x})\bm{h} = \left\langle \phi(Df(\bm{x})), \bm{h} \right\rangle = \left\langle Df(\bm{x})^{\intercal}, \bm{h} \right\rangle .

Why is the β€œsteepest ascent” = the transpose?

By unrolling the definition of the derivative, we have

argmax⁑ βˆ₯hβˆ₯=Ξ΄f(x+h)β‰ˆargmax⁑ βˆ₯hβˆ₯=Ξ΄f(x)+Df(x)h=argmax⁑ βˆ₯hβˆ₯=Ξ΄Df(x)h \begin{align*} \underset{\|h\|= \delta}{\operatorname*{argmax}\,} f(\bm{x} + \bm{h}) &\approx \underset{\|h\|= \delta}{\operatorname*{argmax}\,} f(\bm{x}) + Df(\bm{x})\bm{h}\\ &= \underset{\|h\|= \delta}{\operatorname*{argmax}\,} Df(\bm{x})\bm{h}\\ \end{align*}

Let c=Df(x)⊺\bm{c}=Df(\bm{x})^\intercal for clarity. Then, using the dual space equation from earlier, we find:

argmax⁑ βˆ₯hβˆ₯=Ξ΄Df(x)h=argmax⁑ βˆ₯hβˆ₯=δ⟨Df(x)⊺,h⟩=argmax⁑ βˆ₯hβˆ₯=δ⟨c,h⟩\underset{\|h\|= \delta}{\operatorname*{argmax}\,} Df(\bm{x})\bm{h}=\underset{\|h\|= \delta}{\operatorname*{argmax}\,} \langle Df(\bm{x})^\intercal, \bm{h}\rangle=\underset{\|h\|= \delta}{\operatorname*{argmax}\,} \langle \bm{c}, \bm{h}\rangle

Now remember that ⟨a,bβŸ©β‰€βˆ₯aβˆ₯βˆ₯bβˆ₯\langle \bm{a},\bm{b} \rangle \leq \|\bm{a}\|\|\bm{b}\|. And since βˆ₯hβˆ₯=Ξ΄\|\bm{h}\|=\delta, we have that

⟨c,hβŸ©β‰€βˆ₯cβˆ₯Ξ΄ \langle \bm{c}, \bm{h}\rangle \leq \|\bm{c}\|\delta

Consider h=Ξ΄βˆ₯cβˆ₯c\bm{h} = \frac{\delta}{\|\bm{c}\|}\bm{c}. Then

⟨c,h⟩=⟨c,Ξ΄βˆ₯cβˆ₯c⟩=Ξ΄βˆ₯cβˆ₯βˆ₯cβˆ₯2=βˆ₯cβˆ₯Ξ΄ \langle \bm{c}, \bm{h}\rangle = \langle \bm{c}, \frac{\delta}{\|\bm{c}\|}\bm{c}\rangle = \frac{\delta}{\|\bm{c}\|}\|\bm{c}\|^2 = \|\bm{c}\|\delta

Thus, h=Ξ΄βˆ₯cβˆ₯c\bm{h} = \frac{\delta}{\|\bm{c}\|}\bm{c} achieves the maximum value. Thus, it must be the argmax. Therefore, h∝Df(x)⊺\bm{h} \propto Df(\bm{x})^{\intercal} gives the steepest ascent.

Some real-world examples

Now that we believe the transpose of the gradient is actually the element of the tangent space that we want, let's try a nontrivial example. Consider this neural net:

Figure 2: Consider a neural net with activation function Οƒ\sigma and 2 hidden layers. β„“\ell is the loss function, xx is the input vector, and yy is the expected output. Each oio_{i} is the output vector of a layer, and Wioiβˆ’1+biW_{i}o_{i-1} + b_{i}, the input to the activation function.

We can write this neural net as a function:

N(x;Wi,bi)=o3=Οƒ(W3(Οƒ(W2(Οƒ(W1x+b1))+b2))+b3) N(x; W_{i}, b_{i}) = o_{3} = \sigma(W_{3}(\sigma(W_{2}(\sigma(W_{1}x + b_{1})) + b_{2})) + b_{3})

and the loss as a function of the parameters:

L(x,y;Wi,bi)=β„“(N(x;Wi,bi),y)L(x, y; W_{i}, b_{i}) = \ell(N(x; W_{i}, b_{i}), y)

Which is insane to expand out entirely. We're interested in the gradient of W1W_{1} w.r.t. the loss. I'm going to define some helper functions (hih_{i}) so our ff looks cleaner. So let's define

h1(o)=l(o,y)h2(o)=W3o+b3h3(o)=W2o+b2h4(W)=Wx+b1f(W)=h1(Οƒ(h2(Οƒ(h3(Οƒ(h4(W))))))). \begin{align*} h_{1}(o) &= l(o,y)\\ h_{2}(o) &= W_{3}o + b_{3}\\ h_{3}(o) &= W_{2}o + b_{2}\\ h_{4}(W) &= Wx + b_{1}\\ f(W) &= h_{1}(\sigma(h_{2}(\sigma(h_{3}(\sigma(h_{4}(W))))))) .\end{align*}

where the input xx and expected output yy and the other weights and biases are now treated as constants.

We should think of ff as the function that takes in a W1W_{1} and outputs β„“(o3,y)\ell(o_{3}, y), but I've written it in terms of a bunch of functions so that we can perform chain rule. Recall the chain rule (better explained in my last post):

Df(W)=D(h1βˆ˜Οƒβˆ˜h2βˆ˜Οƒβˆ˜h3βˆ˜Οƒβˆ˜h4)(W)=Dh1(o3)DΟƒ(a3)Dh2(o2)DΟƒ(a2)Dh3(o1)Dh4(W) \begin{align*} Df(W) &= D(h_{1}\circ \sigma \circ h_{2}\circ \sigma\circ h_{3} \circ \sigma\circ h_{4})(W)\\ &= Dh_{1}(o_{3})D\sigma(a_{3})Dh_{2}(o_{2})D\sigma(a_{2})Dh_{3}(o_{1})Dh_{4}(W)\\ \end{align*}

Since Dh1(o3)Dh_{1}(o_{3}) is a 1 row matrix (since the output is a scalar), it is most efficient to do the multiplication left to right. This is backward mode autograd. And finally this brings us to the Jacobian-vector products of my last post. For example, let's say we've computed our row matrix g⊺g^{\intercal}, so it looks like

Dh1(o3)DΟƒ(a3)Dh2(o2)⏟g⊺DΟƒ(a2)Dh3(o1)Dh4(W) \underbrace{Dh_{1}(o_{3})D\sigma(a_{3})Dh_{2}(o_{2})}_{g^{\intercal}}D\sigma(a_{2})Dh_{3}(o_{1})Dh_{4}(W)

It is potentially much faster to call JβŠΊΟƒ(a2,g⊺)J^{\intercal}\sigma(a_{2}, g^{\intercal}) (it might even be O(1)O(1), it depends on Οƒ\sigma) rather than do the matrix multiplication (which would be O(n2)O(n^{2}) since we have to loop over all rows of the derivative, recall that g⊺g^{\intercal} is a 1 row matrix)

JβŠΊΟƒ(a2,g⊺)=g⊺DΟƒ(a2) J^{\intercal}\sigma\left( a_{2}, g^{\intercal} \right) = g^{\intercal}D\sigma(a_{2})

Wrapping Up

As we've seen, the transpose of the derivative happens to be the direction of steepest ascent. It is not an "inverse map" (the mistake I made in my last post). By using the Jacobian-vector products I introduced in my last post, we can calculate the gradient efficiently, which is critical for practical applications (like neural networks). To put it into practice, I presented an example derivation of the gradient, which you could use for gradient descent.

Interactive plot

The left side is the input space, and the right side is the output space. The arrow on the left is a small increment to the input. The arrow on the right is the approximate change in the output use the first order approximation and the red Γ—\times is the true function value.

Right now, it's a linear function: f(x)=Ax+b.f(\bm{x}) = A\bm{x} + \bm{b}. Both the input and output are vectors. You can adjust the hh value and see how accurate the first order approximation (the arrow on the right graph). For the linear function it should be perfect, since it is it's own first order approximation.

Try the quadratic function (vector to scalar) and sin function (vector to vector). Note: I don't really have the time to animate multi dimensional graphs so we represent a scalar value rr as (r,0)(r, 0). You can find the code here.



Β© Jason Eveleth 2023 Β· Powered by Franklin.jl Β· Last modified: December 31, 2024 Page Source