Training the CBOW Model

Cloistered Monkey

2020-12-09 18:34

Beginning

Previously we looked at preparing the data and how to set up the CBOW Model, now we'll look at training the model.

Imports

# python
import math

# from pypi
from expects import (
    be_true,
    equal,
    expect,
)
import numpy

Functions from Previous Posts

Data Preparation Functions

These were previously defined in Word Embeddings: Data Preparation post.

def window_generator(words: list, half_window: int):
    """Generates windows of words

    Args:
     words: cleaned tokens
     half_window: number of words in the half-window

    Yields:
     the next window
    """
    for center_index in range(half_window, len(words) - half_window):
        center_word = words[center_index]
        context_words = (words[(center_index - half_window) : center_index]
                         + words[(center_index + 1):(center_index + half_window + 1)])
        yield context_words, center_word
    return

def index_word_maps(data: list) -> tuple:
    """Creates index to word mappings

    The index is based on sorted unique tokens in the data

    Args:
       data: the data you want to pull from

    Returns:
       word2Ind: returns dictionary mapping the word to its index
       Ind2Word: returns dictionary mapping the index to its word
    """
    words = sorted(list(set(data)))

    word_to_index = {word: index for index, word in enumerate(words)}
    index_to_word = {index: word for index, word in enumerate(words)}
    return word_to_index, index_to_word

def word_to_one_hot_vector(word: str, word_to_index: dict, vocabulary_size: int) -> numpy.ndarray:
    """Create a one-hot-encoded vector

    Args:
     word: the word from the corpus that we're encoding
     word_to_index: map of the word to the index
     vocabulary_size: the size of the vocabulary

    Returns:
     vector with all zeros except where the word is
    """
    one_hot_vector = numpy.zeros(vocabulary_size)
    one_hot_vector[word_to_index[word]] = 1
    return one_hot_vector

ROWS = 0
def context_words_to_vector(context_words: list,
                            word_to_index: dict) -> numpy.ndarray:
    """Create vector with the mean of the one-hot-vectors

    Args:
     context_words: words to covert to one-hot vectors
     word_to_index: dict mapping word to index
    """
    vocabulary_size = len(word_to_index)
    context_words_vectors = [
        word_to_one_hot_vector(word, word_to_index, vocabulary_size)
        for word in context_words]
    return numpy.mean(context_words_vectors, axis=ROWS)

def training_example_generator(words: list, half_window: int, word_to_index: dict):
    """generates training examples

    Args:
     words: source of words
     half_window: half the window size
     word_to_index: dict with word to index mapping
    """
    vocabulary_size = len(word_to_index)
    for context_words, center_word in window_generator(words, half_window):
        yield (context_words_to_vector(context_words, word_to_index),
               word_to_one_hot_vector(
                   center_word, word_to_index, vocabulary_size))
    return

Activation Functions

These functions were defined in the Introducing the CBOW Model post.

def relu(z: numpy.ndarray) -> numpy.ndarray:
    """Get the ReLU for the input array

    Args:
     z: an array of numbers

    Returns:
     ReLU of z
    """
    result = z.copy()
    result[result < 0] = 0
    return result

def softmax(z: numpy.ndarray) -> numpy.ndarray:
    """Calculate Softmax for the input

    Args:
     v: array of values

    Returns:
     array of probabilities
    """
    e_z = numpy.exp(z)
    sum_e_z = numpy.sum(e_z)
    return e_z / sum_e_z

Word Embeddings: Training the CBOW model

In previous lecture notebooks you saw how to prepare data before feeding it to a continuous bag-of-words model, the model itself, its architecture and activation functions. This notebook will walk you through:

Forward propagation.
Cross-entropy loss.
Backpropagation.
Gradient descent.

Which are concepts necessary to understand how the training of the model works.

Neural Network Initialization

Let's dive into the neural network itself, which is shown below with all the dimensions and formulas you'll need.

Set N equal to 3. Remember that N is a hyperparameter of the CBOW model that represents the size of the word embedding vectors, as well as the size of the hidden layer.

Also set V equal to 5, which is the size of the vocabulary we have used so far.

# Define the size of the word embedding vectors and save it in the variable 'N'
N = 3

# Define V. Remember this was the size of the vocabulary in the previous lecture notebooks
V = 5

Initialization of the weights and biases

Before you start training the neural network, you need to initialize the weight matrices and bias vectors with random values.

In the assignment you will implement a function to do this yourself using numpy.random.rand. In this notebook, we've pre-populated these matrices and vectors for you.

Define the first matrix of weights

W1 = numpy.array([
    [ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
    [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
    [ 0.26637602, -0.23846886, -0.37770863, -0.11399446,  0.34008124]])

Define the second matrix of weights

W2 = numpy.array([[-0.22182064, -0.43008631,  0.13310965],
                  [ 0.08476603,  0.08123194,  0.1772054 ],
                  [ 0.1871551 , -0.06107263, -0.1790735 ],
                  [ 0.07055222, -0.02015138,  0.36107434],
                  [ 0.33480474, -0.39423389, -0.43959196]])

Define the first vector of biases

b1 = numpy.array([[ 0.09688219],
                  [ 0.29239497],
                  [-0.27364426]])

Define the second vector of biases

b2 = numpy.array([[ 0.0352008 ],
                  [-0.36393384],
                  [-0.12775555],
                  [-0.34802326],
                  [-0.07017815]])

Check that the dimensions of these matrices are correct.

print(f'V (vocabulary size): {V}')
print(f'N (embedding size / size of the hidden layer): {N}')

print(f'size of W1: {W1.shape} (NxV)')
print(f'size of b1: {b1.shape} (Nx1)')
print(f'size of W2: {W2.shape} (VxN)')
print(f'size of b2: {b2.shape} (Vx1)')

expect(W1.shape).to(equal((N, V)))
expect(b1.shape).to(equal((N, 1)))
expect(W2.shape).to(equal((V, N)))
expect(b2.shape).to(equal((V, 1)))

V (vocabulary size): 5
N (embedding size / size of the hidden layer): 3
size of W1: (3, 5) (NxV)
size of b1: (3, 1) (Nx1)
size of W2: (5, 3) (VxN)
size of b2: (5, 1) (Vx1)

Before moving forward, you will need some functions and variables defined in previous notebooks. They can be found next. Be sure you understand everything that is going on in the next cell, if not consider doing a refresh of the first lecture notebook.

Define the tokenized version of the corpus

words = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning']

Get 'word_to_index' and 'Ind2word' dictionaries for the tokenized corpus

word_to_index, index_to_word = index_word_maps(words)

The First Training Example

Run the next cells to get the first training example, made of the vector representing the context words "i am because i", and the target which is the one-hot vector representing the center word "happy".

training_examples = training_example_generator(words, 2, word_to_index)
x_array, y_array = next(training_examples)

In this notebook next is used because you will only be performing one iteration of training. In this week's assignment with the full training over several iterations you'll use regular for loops with the iterator that supplies the training examples.

The vector representing the context words, which will be fed into the neural network, is:

print(x_array)

[0.25 0.25 0.   0.5  0.  ]

The one-hot vector representing the center word to be predicted is:

print(y_array)

[0. 0. 1. 0. 0.]

Now convert these vectors into matrices (or 2D arrays) to be able to perform matrix multiplication on the right types of objects, as explained in a previous notebook.

 # Copy vector
 x = x_array.copy()

 # Reshape it
 x.shape = (V, 1)

 # Print it
 print(f'x:\n{x}\n')

 # Copy vector
 y = y_array.copy()

 # Reshape it
 y.shape = (V, 1)

 # Print it
 print(f'y:\n{y}')

x:
[[0.25]
 [0.25]
 [0.  ]
 [0.5 ]
 [0.  ]]

y:
[[0.]
 [0.]
 [1.]
 [0.]
 [0.]]

Forward Propagation

The Hidden Layer

Now that you have initialized all the variables that you need for forward propagation, you can calculate the values of the hidden layer using the following formulas:

\begin{align} \mathbf{z_1} = \mathbf{W_1}\mathbf{x} + \mathbf{b_1} \tag{1} \\ \mathbf{h} = \mathrm{ReLU}(\mathbf{z_1}) \tag{2} \\ \end{align}

First, you can calculate the value of \(\mathbf{z_1}\).

Compute z1 (values of first hidden layer before applying the ReLU function)

z1 = numpy.dot(W1, x) + b1

As expected you get an \(N\) by 1 matrix, or column vector with N elements, where N is equal to the embedding size, which is 3 in this example.

print(z1)

[[ 0.36483875]
 [ 0.63710329]
 [-0.3236647 ]]

You can now take the ReLU of \(\mathbf{z_1}\) to get \(\mathbf{h}\), the vector with the values of the hidden layer.

Compute h (z1 after applying ReLU function)

h = relu(z1)
print(h)

[[0.36483875]
 [0.63710329]
 [0.        ]]

Applying ReLU means that the negative element of \(\mathbf{z_1}\) has been replaced with a zero.

The Output Layer

Here are the formulas you need to calculate the values of the output layer, represented by the vector \(\mathbf{\hat y}\):

\begin{align} \mathbf{z_2} &= \mathbf{W_2}\mathbf{h} + \mathbf{b_2} \tag{3} \\ \mathbf{\hat y} &= \mathrm{softmax}(\mathbf{z_2}) \tag{4} \\ \end{align}

First, calculate \(\mathbf{z_2}\).

Compute z2 (values of the output layer before applying the softmax function)

z2 = numpy.dot(W2, h) + b2
print(z2)
expected = numpy.array([
    [-0.31973737],
    [-0.28125477],
    [-0.09838369],
    [-0.33512159],
    [-0.19919612]])
expect(numpy.allclose(z2, expected)).to(be_true)

[[-0.31973737]
 [-0.28125477]
 [-0.09838369]
 [-0.33512159]
 [-0.19919612]]

This is a V by 1 matrix, where V is the size of the vocabulary, which is 5 in this example.

Now calculate the value of \(\mathbf{\hat y}\).

Compute y_hat (z2 after applying softmax function)

y_hat = softmax(z2)
print(y_hat)
expected = numpy.array([
    [0.18519074],
    [0.19245626],
    [0.23107446],
    [0.18236353],
    [0.20891502]])
expect(numpy.allclose(expected, y_hat)).to(be_true)

[[0.18519074]
 [0.19245626]
 [0.23107446]
 [0.18236353]
 [0.20891502]]

As you've performed the calculations with random matrices and vectors (apart from the input vector), the output of the neural network is essentially random at this point. The learning process will adjust the weights and biases to match the actual targets better.

That being said, what word did the neural network predict?

prediction = numpy.argmax(y_hat)
print(f"The predicted word at index {prediction} is '{index_to_word[prediction]}'.")

The predicted word at index 2 is 'happy'.

The neural network predicted the word "happy": the largest element of \(\mathbf{\hat y}\) is the third one, and the third word of the vocabulary is "happy".

Cross-Entropy Loss

Now that you have the network's prediction, you can calculate the cross-entropy loss to determine how accurate the prediction was compared to the actual target.

Remember that you are working on a single training example, not on a batch of examples, which is why you are using loss and not cost, which is the generalized form of loss.

First let's recall what the prediction was.

print(y_hat)

[[0.18519074]
 [0.19245626]
 [0.23107446]
 [0.18236353]
 [0.20891502]]

And the actual target value is:

print(y)

[[0.]
 [0.]
 [1.]
 [0.]
 [0.]]

The formula for cross-entropy loss is:

\[ J=-\sum\limits_{k=1}^{V}y_k\log{\hat{y}_k} \tag{6} \]

Try implementing the cross-entropy loss function so you get more familiar working with numpy.

def cross_entropy_loss(y_predicted: numpy.ndarray,
                       y_actual: numpy.ndarray) -> numpy.ndarray:
    """Calculate cross-entropy loss  for the prediction

    Args:
     y_predicted: what our model predicted
     y_actual: the known labels

    Returns:
     cross-entropy loss for y_predicted
    """
    loss = -numpy.sum(y_actual * numpy.log(y_predicted))
    return loss

Hint 1:

To multiply two numpy matrices (such as <code>y</code> and <code>y_hat</code>) element-wise, you can simply use the <code>*</code> operator.

Hint 2:

Once you have a vector equal to the element-wise multiplication of y and y_hat, you can use numpy.sum to calculate the sum of the elements of this vector.

Now use this function to calculate the loss with the actual values of \(\mathbf{y}\) and \(\mathbf{\hat y}\).

loss = cross_entropy_loss(y_hat, y)
print(f"{loss:0.3f}")
expected = 1.4650152923611106
expect(math.isclose(loss, expected)).to(be_true)

1.465

This value is neither good nor bad, which is expected as the neural network hasn't learned anything yet.

The actual learning will start during the next phase: backpropagation.

Backpropagation

The formulas that you will implement for backpropagation are the following.

\begin{align} \frac{\partial J}{\partial \mathbf{W_1}} &= \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\mathbf{x}^\top \tag{7}\\ \frac{\partial J}{\partial \mathbf{W_2}} &= (\mathbf{\hat{y}} - \mathbf{y})\mathbf{h^\top} \tag{8}\\ \frac{\partial J}{\partial \mathbf{b_1}} &= \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right ) \tag{9}\\ \frac{\partial J}{\partial \mathbf{b_2}} &= \mathbf{\hat{y}} - \mathbf{y} \tag{10} \end{align}

*Note: these formulas are slightly simplified compared to the ones in the lecture as you're working on a single training example, whereas the lecture provided the formulas for a batch of examples. In the assignment you'll be implementing the latter.

Let's start with an easy one.

Calculate the partial derivative of the loss function with respect to \(\mathbf{b_2}\), and store the result in grad_b2.

\[ \frac{\partial J}{\partial \mathbf{b_2}} = \mathbf{\hat{y}} - \mathbf{y} \tag{10} \]

Compute vector with partial derivatives of loss function with respect to b2

grad_b2 = y_hat - y
print(grad_b2)
expected = numpy.array([
    [ 0.18519074],
    [ 0.19245626],
    [-0.76892554],
    [ 0.18236353],
    [ 0.20891502]])
expect(numpy.allclose(grad_b2, expected)).to(be_true)

[[ 0.18519074]
 [ 0.19245626]
 [-0.76892554]
 [ 0.18236353]
 [ 0.20891502]]

Next, calculate the partial derivative of the loss function with respect to \(\mathbf{W_2}\), and store the result in grad_W2.

\[ \frac{\partial J}{\partial \mathbf{W_2}} = (\mathbf{\hat{y}} - \mathbf{y})\mathbf{h^\top} \tag{8} \]

Hint: use .T to get a transposed matrix, e.g. h.T returns \(\mathbf{h^\top}\).

Compute matrix with partial derivatives of loss function with respect to W2.

grad_W2 = numpy.dot(y_hat - y, h.T)
print(grad_W2)
expected = numpy.array([
    [0.06756476,  0.11798563,  0.        ],
    [ 0.0702155 ,  0.12261452,  0.        ],
    [-0.28053384, -0.48988499,  0.        ],
    [ 0.06653328,  0.1161844 ,  0.        ],
    [ 0.07622029,  0.13310045,  0.        ]])

expect(numpy.allclose(grad_W2, expected)).to(be_true)

[[ 0.06756476  0.11798563  0.        ]
 [ 0.0702155   0.12261452  0.        ]
 [-0.28053384 -0.48988499  0.        ]
 [ 0.06653328  0.1161844   0.        ]
 [ 0.07622029  0.13310045  0.        ]]

Now calculate the partial derivative with respect to \(\mathbf{b_1}\) and store the result in grad_b1.

\[ \frac{\partial J}{\partial \mathbf{b_1}} = \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right ) \tag{9} \]

Compute vector with partial derivatives of loss function with respect to b1.

grad_b1 = relu(numpy.dot(W2.T, y_hat - y))
print(grad_b1)
expected = numpy.array([
    [0.        ],
    [0.        ],
    [0.17045858]])
expect(numpy.allclose(grad_b1, expected)).to(be_true)

[[0.        ]
 [0.        ]
 [0.17045858]]

Finally, calculate the partial derivative of the loss with respect to \(\mathbf{W_1}\), and store it in grad_W1.

\[ \frac{\partial J}{\partial \mathbf{W_1}} = \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\mathbf{x}^\top \tag{7} \] Compute matrix with partial derivatives of loss function with respect to W1.

grad_W1 = numpy.dot(relu(numpy.dot(W2.T, y_hat - y)), x.T)
print(grad_W1)
expected = numpy.array([
    [0.        , 0.        , 0.        , 0.        , 0.        ],
    [0.        , 0.        , 0.        , 0.        , 0.        ],
    [0.04261464, 0.04261464, 0.        , 0.08522929, 0.        ]])

expect(numpy.allclose(grad_W1, expected)).to(be_true)

[[0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.04261464 0.04261464 0.         0.08522929 0.        ]]

Before moving on to gradient descent, double-check that all the matrices have the expected dimensions.

print(f'V (vocabulary size): {V}')
print(f'N (embedding size / size of the hidden layer): {N}')
print(f'size of grad_W1: {grad_W1.shape} (NxV)')
print(f'size of grad_b1: {grad_b1.shape} (Nx1)')
print(f'size of grad_W2: {grad_W2.shape} (VxN)')
print(f'size of grad_b2: {grad_b2.shape} (Vx1)')

expect(grad_W1.shape).to(equal((N, V)))
expect(grad_b1.shape).to(equal((N, 1)))
expect(grad_W2.shape).to(equal((V, N)))
expect(grad_b2.shape).to(equal((V, 1)))

V (vocabulary size): 5
N (embedding size / size of the hidden layer): 3
size of grad_W1: (3, 5) (NxV)
size of grad_b1: (3, 1) (Nx1)
size of grad_W2: (5, 3) (VxN)
size of grad_b2: (5, 1) (Vx1)

Gradient descent

During the gradient descent phase, you will update the weights and biases by subtracting \(\alpha\) times the gradient from the original matrices and vectors, using the following formulas.

\begin{align} \mathbf{W_1} &\gets \mathbf{W_1} - \alpha \frac{\partial J}{\partial \mathbf{W_1}} \tag{11}\\ \mathbf{W_2} &\gets \mathbf{W_2} - \alpha \frac{\partial J}{\partial \mathbf{W_2}} \tag{12}\\ \mathbf{b_1} &\gets \mathbf{b_1} - \alpha \frac{\partial J}{\partial \mathbf{b_1}} \tag{13}\\ \mathbf{b_2} &\gets \mathbf{b_2} - \alpha \frac{\partial J}{\partial \mathbf{b_2}} \tag{14}\\ \end{align}

First, let set a value for \(\alpha\).

alpha = 0.03

The updated weight matrix \(\mathbf{W_1}\) will be:

W1_new = W1 - alpha * grad_W1

Let's compare the previous and new values of \(\mathbf{W_1}\):

print('old value of W1:')
print(W1)
print()
print('new value of W1:')
print(W1_new)

old value of W1:
[[ 0.41687358  0.08854191 -0.23495225  0.28320538  0.41800106]
 [ 0.32735501  0.22795148 -0.23951958  0.4117634  -0.23924344]
 [ 0.26637602 -0.23846886 -0.37770863 -0.11399446  0.34008124]]

new value of W1:
[[ 0.41687358  0.08854191 -0.23495225  0.28320538  0.41800106]
 [ 0.32735501  0.22795148 -0.23951958  0.4117634  -0.23924344]
 [ 0.26509758 -0.2397473  -0.37770863 -0.11655134  0.34008124]]

The difference is very subtle (hint: take a closer look at the last row), which is why it takes a fair amount of iterations to train the neural network until it reaches optimal weights and biases starting from random values.

Now calculate the new values of \(\mathbf{W_2}\) (to be stored in W2_new), \(\mathbf{b_1}\) (in b1_new), and \(\mathbf{b_2}\) (in b2_new).

\begin{align} \mathbf{W_2} &\gets \mathbf{W_2} - \alpha \frac{\partial J}{\partial \mathbf{W_2}} \tag{12}\\ \mathbf{b_1} &\gets \mathbf{b_1} - \alpha \frac{\partial J}{\partial \mathbf{b_1}} \tag{13}\\ \mathbf{b_2} &\gets \mathbf{b_2} - \alpha \frac{\partial J}{\partial \mathbf{b_2}} \tag{14}\\ \end{align}

Compute updated W2.

W2_new = W2 - alpha * grad_W2

Compute updated b1.

b1_new = b1 - alpha * grad_b1

Compute updated b2.

b2_new = b2 - alpha * grad_b2

print('W2_new')
print(W2_new)
print()
print('b1_new')
print(b1_new)
print()
print('b2_new')
print(b2_new)

w2_expected = numpy.array(
   [[-0.22384758, -0.43362588,  0.13310965],
    [ 0.08265956,  0.0775535 ,  0.1772054 ],
    [ 0.19557112, -0.04637608, -0.1790735 ],
    [ 0.06855622, -0.02363691,  0.36107434],
    [ 0.33251813, -0.3982269 , -0.43959196]])

b1_expected = numpy.array(
   [[ 0.09688219],
    [ 0.29239497],
    [-0.27875802]])

b2_expected = numpy.array(
   [[ 0.02964508],
    [-0.36970753],
    [-0.10468778],
    [-0.35349417],
    [-0.0764456 ]]
)

for actual, expected in zip((W2_new, b1_new, b2_new), (w2_expected, b1_expected, b2_expected)):
    expect(numpy.allclose(actual, expected)).to(be_true)

W2_new
[[-0.22384758 -0.43362588  0.13310965]
 [ 0.08265956  0.0775535   0.1772054 ]
 [ 0.19557112 -0.04637608 -0.1790735 ]
 [ 0.06855622 -0.02363691  0.36107434]
 [ 0.33251813 -0.3982269  -0.43959196]]

b1_new
[[ 0.09688219]
 [ 0.29239497]
 [-0.27875802]]

b2_new
[[ 0.02964508]
 [-0.36970753]
 [-0.10468778]
 [-0.35349417]
 [-0.0764456 ]]

Congratulations, you have completed one iteration of training using one training example!

You'll need many more iterations to fully train the neural network, and you can optimize the learning process by training on batches of examples, as described in the lecture. You will get to do this during this week's assignment.

End

Now that we know how to train the CBOW Model, we'll move on to extracting word embeddings from the model. This is part of a series of posts looking at some preliminaries for creating word-embeddings. There is a table-of-contents post here.

Table of Contents