Backpropagation
Table of Contents
Background
We're going to extend the backpropagation from a single layer to multiple hidden layers. The amount of change at each layer (the delta) that you make uses the same equation no matter how many layers you have.
\[ \Delta w_{pq} = \eta \delta_{output} X_{in} \]
Imports
We'll be sticking with our numpy-based implementation of a neural network.
import numpy
The Sigmoid
This is our familiar activation function.
def sigmoid(x: numpy.ndarray) -> numpy.ndarray:
"""
Calculate sigmoid
"""
return 1 / (1 + numpy.exp(-x))
Initial Values
We're going to do a single forward pass followed by backpropagation, so I'll make the values random since we're not really going to validate them..
numpy.random.seed(18)
x = numpy.random.randn(3)
target = numpy.random.random()
learning_rate = numpy.random.random()
weights_input_to_hidden = numpy.random.random((3, 2))
weights_hidden_to_output = numpy.random.random((2, 1))
Variable | Value |
---|---|
x | [ 0.08 2.19 -0.13] |
y | 0.85 |
eta | 0.75 |
Input Weights | [0.67 0.99] |
[0.26 0.03] | |
[0.64 0.85] | |
Hidden To Output Weights | [0.74] |
[0.02] |
The input has 3 nodes and the hidden layer has 2, so our weights from the input layer to the hidden layer has shape 3 rows and 2 columns. The output has one node so the weights from the hidden to output layer has 2 rows (to match the hidden layer) and 1 column to match the output layer. In the lecture they use a vector with 2 entries instead. As far as I can tell it works the same either way.
Forward pass
hidden_layer_input = x.dot(weights_input_to_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)
output_layer_in = hidden_layer_output.dot(weights_hidden_to_output)
output = sigmoid(output_layer_in)
Backwards pass
The Output Error
Our error is \(y - \hat{y}\).
error = target - output
Output Error Term
Our output error term:
\begin{align} \textit{output error term} &= (y - \hat{y}) \times (\hat{y} \times \sigma'(x))\\ &= error \times \hat{y} \times (1 - \hat{y}) \end{align}output_error_term = error * output * (1 - output)
The Hidden Layer Error Term
The hidden layer error term is the output error term scaled by the weight between them times the derivative of the activation function.
\[ \delta^h = W\delta^o f'(h)\\ \]
hidden_error_term = (weights_hidden_to_output.T
* output_error_term
* hidden_layer_output * (1 - hidden_layer_output))
The Hidden To Output Weight Update
\[ \Delta W = \eta \delta^o a \]
Where a is the output of the hidden layer.
delta_w_h_o = learning_rate * output_error_term * hidden_layer_output
The Input To Hidden Weight Update
\[ \Delta w_i = \eta \delta^h x_i \]
The update is the learning rate times the hidden unit error times the input values.
delta_w_i_h = learning_rate * hidden_error_term * x[:, None]
print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)
Change in weights for hidden layer to output layer: [0.02634231 0.02119776] Change in weights for input layer to hidden layer: [[ 5.70726224e-04 1.72873580e-05] [ 1.57375099e-02 4.76690849e-04] [-9.69255871e-04 -2.93588634e-05]]