Gradient Descent

What is this about?

We have an initial network and we make a prediction using the inputs. USing the output we can calculate the error. Now that we have the error we need to update our weights - how do we do this? With Gradient Descent, a method that tries to pursue a downward trajectory using the slope of our errors.

How does Gradient Descent Work?

We start by making an initial prediction.

\[ \hat{y} = \sigma(Wx+b) \]

which it turns out is not accurate. We then subtract the gradient of the error (a partial derivative \(\frac{\delta E}{\delta W}\)) multiplied by some learning rate \(\alpha\) that governs how much we are willing to change at each step down the hill.

\[ w'_i \gets w_i - \alpha \frac{\delta E}{\delta W_i}\\ b' \gets b -\alpha \frac{\delta E}{\delta b}\\ \hat{y'} = \sigma(W'x t b') \]

Okay, but what again?

To find our gradient we need to take some derivatives. Let's start with the derivative of our sigmoid.

\[ \sigma' = \sigma(x) (1 - \sigma(x)) \] The lecturer shows the derivation, but take my word for it, this is what it is.

Our error is:

\[ E = -y \ln(\hat{y}) - (1 - y)\ln(1 - \hat{y}) \]

and the derivative of this error:

\[ \frac{\delta}{\delta_{wj}}\hat{y} = \hat{y}(1 - \hat{y}) \dot x_j \]

Trust me, this is the derivation. And the derivation of our error becomes:

\[ \frac{\delta}{\delta w_j} = -(y - \hat{y})x_j \]

and for the bias term we get:

\[ \frac{\delta}{\delta b} = -(y - \hat{y}) \]

And our overall gradient can be written as:

\[ \Delta E = -(y - \hat{y})(x_1, \ldots, x_n, 1) \]

So our gradient is the coordinates of the points times the error. This means that the closer our prediction is to the true value, the smaller the gradient will be, and vice-versa, much like the Perceptron Trick we learned earlier.

So, how do you put it all together?

Okay, this is how you update your weights. First, scale \(\alpha\) to match your data set (divide by the number of rows). \[ \alpha = \frac{1}{m}\alpha \]

Now calculate your new weights.

\[ w_i' \gets w_i + \alpha(y - \hat{y})x_i \]

And the new bias.

\[ b' \gets b \alpha(y - \hat{y}) \]