Table of Contents
We are trying to find the line of best fit given a collection of coordinates. For example, { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ( x 3 , y 3 ) , …   } \{(x_1, y_1), (x_2,y_2), (x_3, y_3), \dots\} {( x 1 ​ , y 1 ​ ) , ( x 2 ​ , y 2 ​ ) , ( x 3 ​ , y 3 ​ ) , … } . We would like to find m , b m,b m , b such that y = m x + b y = mx + b y = m x + b minimizes the sum of the residuals squared. By that I mean we want to minimize a function L ( α ) = ∑ i ( y i − ( m x i + b ) ) 2 L(\alpha) = \sum_i (y_i - (mx_i + b))^2 L ( α ) = ∑ i ​ ( y i ​ − ( m x i ​ + b ) ) 2 , by finding the best m m m and b b b .
We will use the notation from here . Here are some fun facts we'll need to know:
∑ i x i 2 = ∥ x ∥ 2 \sum_i x_i^2 =\lVert x \rVert^2 ∑ i ​ x i 2 ​ = ∥ x ∥ 2
∥ x ∥ 2 = x T x \lVert x \rVert^2 = x^Tx ∥ x ∥ 2 = x T x
( A + B ) T = A T + B T (A + B)^T = A^T + B^T ( A + B ) T = A T + B T and ( A B ) = B T A T (AB)=B^TA^T ( A B ) = B T A T .
D ( u T x ) = u T D ( u^Tx) = u^T D ( u T x ) = u T , D ( x T u ) = u T D (x^Tu) = u^T D ( x T u ) = u T , and D ( x T x ) = D ( x ˉ T x ) + D ( x T x ˉ ) D(x^Tx) = D(\bar{x}^Tx) + D(x^T\bar{x}) D ( x T x ) = D ( x ˉ T x ) + D ( x T x ˉ ) .[1]
We need to set up some definitions. Let
y = [ y 1 y 2 y 3 y 4 ⋮ ] , X = [ x 1 1 x 2 1 x 3 1 x 4 1 ⋮ ⋮ ] , α = [ m b ] . y = \begin{bmatrix}y_1\\ y_2\\ y_3\\ y_4\\ \vdots\end{bmatrix},
X = \begin{bmatrix}x_1 & 1\\ x_2 & 1\\ x_3 & 1\\ x_4 & 1\\ \vdots & \vdots\end{bmatrix},
\alpha = \begin{bmatrix}m\\b\end{bmatrix}. y = ⎣ ⎡ ​ y 1 ​ y 2 ​ y 3 ​ y 4 ​ ⋮ ​ ⎦ ⎤ ​ , X = ⎣ ⎡ ​ x 1 ​ x 2 ​ x 3 ​ x 4 ​ ⋮ ​ 1 1 1 1 ⋮ ​ ⎦ ⎤ ​ , α = [ m b ​ ] .
So our goal is:
argmin α   L ( α ) \underset{\alpha}{\text{argmin}}\, L(\alpha) α argmin ​ L ( α )
= argmin α   ∑ ( ( y i ) − ( m x i + b ) ) 2 =\underset{\alpha}{\text{argmin}}\, \sum ((y_i) - (m x_i + b))^2 = α argmin ​ ∑ (( y i ​ ) − ( m x i ​ + b ) ) 2
Apply fun fact number #1:
= argmin α   ∥ [ ( y 1 − ( m x 1 + b ) ) ( y 2 − ( m x 2 + b ) ) ( y 3 − ( m x 3 + b ) ) ( y 4 − ( m x 4 + b ) ) ⋮ ] ∥ 2 =\underset{\alpha}{\text{argmin}}\, \left\lVert \begin{bmatrix}(y_1 - (m x_1 + b))\\ (y_2 - (m x_2 + b))\\ (y_3 - (m x_3 + b))\\ (y_4 - (m x_4 + b))\\ \vdots\end{bmatrix}\right\rVert^2 = α argmin ​ ∥ ∥ ​ ⎣ ⎡ ​ ( y 1 ​ − ( m x 1 ​ + b )) ( y 2 ​ − ( m x 2 ​ + b )) ( y 3 ​ − ( m x 3 ​ + b )) ( y 4 ​ − ( m x 4 ​ + b )) ⋮ ​ ⎦ ⎤ ​ ∥ ∥ ​ 2
= argmin α   ∥ y − X α ∥ 2 =\underset{\alpha}{\text{argmin}}\, \lVert y - X\alpha\rVert^2 = α argmin ​ ∥ y − X α ∥ 2
Apply fun fact #2:
= argmin α   ( y − X α ) T ( y − X α ) =\underset{\alpha}{\text{argmin}}\, (y-X\alpha)^T(y - X\alpha) = α argmin ​ ( y − X α ) T ( y − X α )
Apply fun fact #3:
= argmin α   ( y T − α T X T ) ( y − X α ) =\underset{\alpha}{\text{argmin}}\, (y^T - \alpha^TX^T)(y - X\alpha) = α argmin ​ ( y T − α T X T ) ( y − X α )
= argmin α   y T y − α T X T y − y T X α + α T X T X α =\underset{\alpha}{\text{argmin}}\, y^Ty - \alpha^TX^Ty - y^TX\alpha + \alpha^TX^TX\alpha = α argmin ​ y T y − α T X T y − y T X α + α T X T X α
To minimize this, we find the critical points of the function. We do this by finding where the derivative of L L L with respect to α \alpha α is 0 0 0 .
0 = L ′ ( α ) = D ( y T y − α T X T y − y T X α + α T X T X α ) = D ( y T y ) − D ( α T X T y ) − D ( y T X α ) + D ( α T X T X α )
\begin{align*}
0 &= L'(\alpha)\\
&= D (y^Ty - \alpha^TX^Ty - y^TX\alpha + \alpha^TX^TX\alpha)\\
&=D (y^Ty) - D (\alpha^TX^Ty) - D (y^TX\alpha) + D (\alpha^TX^TX\alpha)\\
\end{align*} 0 ​ = L ′ ( α ) = D ( y T y − α T X T y − y T X α + α T X T X α ) = D ( y T y ) − D ( α T X T y ) − D ( y T X α ) + D ( α T X T X α ) ​
Apply fun fact #4:
= 0 − ( X T y ) T − y T X + D ( α T X T X α ˉ ) + D ( α ˉ T X T X α ) = − y T X − y T X + ( X T X α ) T + α T X T X = − 2 y T X + α T ( X T X ) T + α T X T X = − 2 y T X + α T X T X + α T X T X = − 2 y T X + 2 α T X T X
\begin{align*}
&= 0 - (X^Ty)^T - y^TX + D (\alpha^TX^TX\bar{\alpha}) + D (\bar{\alpha}^TX^TX\alpha)\\
&= -y^TX - y^TX + (X^TX\alpha)^T + \alpha^TX^TX\\
&=-2y^TX + \alpha^T(X^TX)^T + \alpha^TX^TX\\
&=-2y^TX + \alpha^TX^TX + \alpha^TX^TX\\
&=-2y^TX + 2\alpha^TX^TX\\
\end{align*}
​ = 0 − ( X T y ) T − y T X + D ( α T X T X α ˉ ) + D ( α ˉ T X T X α ) = − y T X − y T X + ( X T X α ) T + α T X T X = − 2 y T X + α T ( X T X ) T + α T X T X = − 2 y T X + α T X T X + α T X T X = − 2 y T X + 2 α T X T X ​
And now for the glorious part (who am I kidding, this whole thing has been glorious),
2 y T X = 2 α T X T X 2y^TX=2\alpha^TX^TX 2 y T X = 2 α T X T X
X T y = ( X T X ) α X^Ty=(X^TX)\alpha X T y = ( X T X ) α
( X T X ) − 1 X T y = α = [ m b ] (X^TX)^{-1}X^Ty = \alpha = \begin{bmatrix}m & b\end{bmatrix} ( X T X ) − 1 X T y = α = [ m ​ b ​ ]
Say we are given this set of coordinates: { ( 1.3 , 0.8 ) , ( 3.2 , 3.5 ) , ( 5.6 , 6.4 ) , ( 8.5 , 7.7 ) } \{(1.3, 0.8), (3.2, 3.5), (5.6, 6.4), (8.5, 7.7)\} {( 1.3 , 0.8 ) , ( 3.2 , 3.5 ) , ( 5.6 , 6.4 ) , ( 8.5 , 7.7 )} . Then we can either do the arithmetic by hand (from the last equation) or open julia
and give it the vectors (some output omitted):
julia> y = [0.8 ; 3.5 ; 6.4 ; 7.7 ]
julia> X = [1.3 1 ; 3.2 1 ; 5.6 1 ; 8.5 1 ]
julia> inv(X' * X) * X' * y
2 -element Vector {Float64 }:
0.9628
0.1228
Which gives us our desired line of best fit: y = 0.962823 x + 0.122874 y = 0.962823x + 0.122874 y = 0.962823 x + 0.122874 .
[1]
The bar in α ˉ \bar{\alpha} α ˉ or x ˉ \bar{x} x ˉ means treat it as a constant, in the same way, that you would if you were doing the product rule with single-valued functions: D ( f ( x ) g ( x ) ) = D ( f ( x ) g ˉ ( x ) ) + D ( f ˉ ( x ) g ( x ) ) = ( D f ( x ) ) g ( x ) + ( D g ( x ) ) f ( x ) D(f(x)g(x)) = D(f(x)\bar{g}(x)) + D(\bar{f}(x) g(x)) = (Df(x))g(x) + (Dg(x))f(x) D ( f ( x ) g ( x )) = D ( f ( x ) g ˉ ​ ( x )) + D ( f ˉ ​ ( x ) g ( x )) = ( D f ( x )) g ( x ) + ( D g ( x )) f ( x ) .