Word Embeddings: Training the Model

Cloistered Monkey

2020-12-13 14:42

Building and Training the Model

In the previous post we did some preliminary set up and data pre-processing. Now we're going to build and train a Continuous Bag of Words (CBOW) model.

Imports

# python
from argparse import Namespace
from enum import Enum, unique
from functools import partial

import math
import random

# pypi
from expects import be_true, contain_exactly, equal, expect

import holoviews
import hvplot.pandas
import numpy
import pandas

# this project
from neurotic.nlp.word_embeddings import DataCleaner, MetaData

# my other stuff
from graeae import EmbedHoloviews, Timer

Set Up

Code from the previous post.

cleaner = DataCleaner()
data = cleaner.processed
meta = MetaData(data)
TIMER = Timer(speak=False)
Embed = partial(EmbedHoloviews, folder_path="files/posts/nlp/word-embeddings-training-the-model")
Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

Something to help remember what the numpy axis argument is.

@unique
class Axis(Enum):
    ROWS = 0
    COLUMNS = 1

Middle

Initializing the model

You will now initialize two matrices and two vectors.

The first matrix (\(W_1\)) is of dimension \(N \times V\), where V is the number of words in your vocabulary and N is the dimension of your word vector.
The second matrix (\(W_2\)) is of dimension \(V \times N\).
Vector \(b_1\) has dimensions \(N\times 1\)
Vector \(b_2\) has dimensions \(V\times 1\).
\(b_1\) and \(b_2\) are the bias vectors of the linear layers from matrices \(W_1\) and \(W_2\).

At this stage we are just initializing the parameters.

Please use numpy.random.rand to generate matrices that are initialized with random values from a uniform distribution, ranging between 0 and 1.

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: initialize_model
def initialize_model(N: int,V: int, random_seed: int=1) -> tuple:
    """Initialize the matrices with random values

    Args: 
       N:  dimension of hidden vector 
       V:  dimension of vocabulary
       random_seed: random seed for consistent results in the unit tests
     Returns: 
       W1, W2, b1, b2: initialized weights and biases
    """

    numpy.random.seed(random_seed)

    ### START CODE HERE (Replace instances of 'None' with your code) ###
    # W1 has shape (N,V)
    W1 = numpy.random.rand(N, V)
    # W2 has shape (V,N)
    W2 = numpy.random.rand(V, N)
    # b1 has shape (N,1)
    b1 = numpy.random.rand(N, 1)
    # b2 has shape (V,1)
    b2 = numpy.random.rand(V, 1)
    ### END CODE HERE ###

    return W1, W2, b1, b2

Test your function example.

tmp_N = 4
tmp_V = 10
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
expect(tmp_W1.shape).to(equal((tmp_N,tmp_V)))
expect(tmp_W2.shape).to(equal((tmp_V,tmp_N)))
expect(tmp_b1.shape).to(equal((tmp_N, 1)))
expect(tmp_b2.shape).to(equal((tmp_V, 1)))
print(f"tmp_W1.shape: {tmp_W1.shape}")
print(f"tmp_W2.shape: {tmp_W2.shape}")
print(f"tmp_b1.shape: {tmp_b1.shape}")
print(f"tmp_b2.shape: {tmp_b2.shape}")

tmp_W1.shape: (4, 10)
tmp_W2.shape: (10, 4)
tmp_b1.shape: (4, 1)
tmp_b2.shape: (10, 1)

Softmax

Before we can start training the model, we need to implement the softmax function as defined in equation 5:

\[ \text{softmax}(z_i) = \frac{e^{z_i} }{\sum_{i=0}^{V-1} e^{z_i} } \tag{5} \]

Array indexing in code starts at 0.
V is the number of words in the vocabulary (which is also the number of rows of z).
i goes from 0 to |V| - 1.

The Implementation

Assume that the input z to softmax is a 2D array
Each training example is represented by a column of shape (V, 1) in this 2D array.
There may be more than one column, in the 2D array, because you can put in a batch of examples to increase efficiency. Let's call the batch size lowercase m, so the z array has shape (V, m)
When taking the sum from \(i=1 \cdots V-1\), take the sum for each column (each example) separately.

Please use

numpy.exp
numpy.sum (set the axis so that you take the sum of each column in z)

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: softmax
def softmax(z: numpy.ndarray) -> numpy.ndarray:
    """Calculate the softmax

    Args: 
       z: output scores from the hidden layer
    Returns: 
       yhat: prediction (estimate of y)
    """

    ### START CODE HERE (Replace instances of 'None' with your own code) ###

    # Calculate yhat (softmax)
    yhat = numpy.exp(z)/numpy.sum(numpy.exp(z), axis=Axis.ROWS.value)

    ### END CODE HERE ###

    return yhat

# Test the function
tmp = numpy.array([[1,2,3],
                   [1,1,1]
                   ])
tmp_sm = softmax(tmp)
print(tmp_sm)
expected =  numpy.array([[0.5, 0.73105858, 0.88079708],
                         [0.5, 0.26894142, 0.11920292]])


expect(numpy.allclose(tmp_sm, expected)).to(be_true)

[[0.5        0.73105858 0.88079708]
 [0.5        0.26894142 0.11920292]]

Forward propagation

We're going to implement the forward propagation z according to equations (1) to (3).

\begin{align} h &= W_1 \ X + b_1 \tag{1} \\ a &= ReLU(h) \tag{2} \\ z &= W_2 \ a + b_2 \tag{3} \\ \end{align}

For that, you will use as activation the Rectified Linear Unit (ReLU) given by:

\[ f(h)=\max (0,h) \tag{6} \]

Hints:

You can use numpy.maximum(x1,x2) to get the maximum of two values
Use numpy.dot(A,B) to matrix multiply A and B

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: forward_prop
def forward_prop(x: numpy.ndarray,
                 W1: numpy.ndarray, W2: numpy.ndarray,
                 b1: numpy.ndarray, b2: numpy.ndarray) -> tuple:
    """Pass the data through the network

    Args: 
       x:  average one hot vector for the context 
       W1, W2, b1, b2:  matrices and biases to be learned
    Returns: 
       z:  output score vector
    """

    ### START CODE HERE (Replace instances of 'None' with your own code) ###

    # Calculate h
    h = numpy.dot(W1, x) + b1

    # Apply the relu on h (store result in h)
    h = numpy.maximum(h, 0)

    # Calculate z
    z = numpy.dot(W2, h) + b2

    ### END CODE HERE ###

    return z, h

Test the function

tmp_N = 2
tmp_V = 3
tmp_x = numpy.array([[0,1,0]]).T

tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(N=tmp_N,V=tmp_V, random_seed=1)

print(f"x has shape {tmp_x.shape}")
print(f"N is {tmp_N} and vocabulary size V is {tmp_V}")

tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)

print("call forward_prop")
print()

print(f"z has shape {tmp_z.shape}")
print("z has values:")
print(tmp_z)

print()

print(f"h has shape {tmp_h.shape}")
print("h has values:")
print(tmp_h)

expect(tmp_x.shape).to(equal((3, 1)))
expect(tmp_z.shape).to(equal((3, 1)))
expected = numpy.array(
    [[0.55379268],
     [1.58960774],
     [1.50722933]]
)
expect(numpy.allclose(tmp_z, expected)).to(be_true)
expect(tmp_h.shape).to(equal((2, 1)))
expected = numpy.array(
    [[0.92477674],
     [1.02487333]]
)

expect(numpy.allclose(tmp_h, expected)).to(be_true)

x has shape (3, 1)
N is 2 and vocabulary size V is 3
call forward_prop

z has shape (3, 1)
z has values:
[[0.55379268]
 [1.58960774]
 [1.50722933]]

h has shape (2, 1)
h has values:
[[0.92477674]
 [1.02487333]]

Pack Index with Frequency

def index_with_frequency(context_words: list,
                              word_to_index: dict) -> list:
    """combines indexes and frequency counts-dict

    Args:
     context_words: words to get the indices for
     word_to_index: mapping of word to index

    Returns:
     list of (word-index, word-count) tuples built from context_words
    """
    frequency_dict = Counter(context_words)
    indices = [word_to_index[word] for word in context_words]
    packed = []
    for index in range(len(indices)):
        word_index = indices[index]
        frequency = frequency_dict[context_words[index]]
        packed.append((word_index, frequency))
    return packed

Vector Generator

def vectors(data: numpy.ndarray, word_to_index: dict, half_window: int):
    """Generates vectors of fraction of context words each word represents

    Args:
     data: source of the vectors
     word_to_index: mapping of word to index in the vocabulary
     half_window: number of tokens on either side of the word to keep

    Yields:
     tuple of x, y 
    """
    location = half_window
    vocabulary_size = len(word_to_index)
    while True:
        y = numpy.zeros(vocabulary_size)
        x = numpy.zeros(vocabulary_size)
        center_word = data[location]
        y[word_to_index[center_word]] = 1
        context_words = (data[(location - half_window): location]
                         + data[(location + 1) : (location + half_window + 1)])

        for word_index, frequency in index_with_frequency(context_words, word_to_index):
            x[word_index] = frequency/len(context_words)
        yield x, y
        location += 1
        if location >= len(data):
            print("location in data is being set to 0")
            location = 0
    return

Batch Generator

This uses a not so common form of the while loop. Whenever you run a loop and it reaches the end (so you didn't break it) then it will run the else clause.

def batch_generator(data: numpy.ndarray, word_to_index: dict,
                    half_window: int, batch_size: int, original: bool=True):
    """Generate batches of vectors

    Args:
     data: the training data
     word_to_index: map of word to vocabulary index
     half_window: number of tokens to take from either side of word
     batch_size: Number of vectors to put in each training batch
     original: run the original buggy code

    Yields:
     tuple of X, Y batches
    """
    vocabulary_size = len(word_to_index)
    batch_x = []
    batch_y = []
    for x, y in vectors(data,
                        word_to_index,
                        half_window):
        if original:
            while len(batch_x) < batch_size:
                batch_x.append(x)
                batch_y.append(y)

            else:
                yield numpy.array(batch_x).T, numpy.array(batch_y).T
        else:
            if len(batch_x) < batch_size:
                batch_x.append(x)
                batch_y.append(y)

            else:
                yield numpy.array(batch_x).T, numpy.array(batch_y).T
                batch_x = []
                batch_y = []
    return

So every time batch_x reaches the batch_size it yields the tuple and then creates a new batch before continuing the outer for-loop.

Cost function

The cross-entropy loss function.

def compute_cost(y: numpy.ndarray, y_hat: numpy.ndarray,
                 batch_size: int) -> numpy.ndarray:
    """Calculates the cross-entropy loss

    Args:
     y: array with the actual words labeled
     y_hat: our model's guesses for the words
     batch_size: the number of examples per training run
    """
    log_probabilities = (numpy.multiply(numpy.log(y_hat), y)
                         + numpy.multiply(numpy.log(1 - y_hat), 1 - y))
    cost = -numpy.sum(log_probabilities)/batch_size
    cost = numpy.squeeze(cost)
    return cost

Test the function

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

tmp_word2Ind, tmp_Ind2word = meta.word_to_index, meta.vocabulary
tmp_V = len(meta.vocabulary)

tmp_x, tmp_y = next(batch_generator(data, tmp_word2Ind, tmp_C, tmp_batch_size))

print(f"tmp_x.shape {tmp_x.shape}")
print(f"tmp_y.shape {tmp_y.shape}")

tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

print(f"tmp_W1.shape {tmp_W1.shape}")
print(f"tmp_W2.shape {tmp_W2.shape}")
print(f"tmp_b1.shape {tmp_b1.shape}")
print(f"tmp_b2.shape {tmp_b2.shape}")

tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print(f"tmp_z.shape: {tmp_z.shape}")
print(f"tmp_h.shape: {tmp_h.shape}")

tmp_yhat = softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

tmp_cost = compute_cost(tmp_y, tmp_yhat, tmp_batch_size)
print("call compute_cost")
print(f"tmp_cost {tmp_cost:.4f}")

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(math.isclose(tmp_cost, 9.9560, abs_tol=1e-4)).to(be_true)

tmp_x.shape (5778, 4)
tmp_y.shape (5778, 4)
tmp_W1.shape (50, 5778)
tmp_W2.shape (5778, 50)
tmp_b1.shape (50, 1)
tmp_b2.shape (5778, 1)
tmp_z.shape: (5778, 4)
tmp_h.shape: (50, 4)
tmp_yhat.shape: (5778, 4)
call compute_cost
tmp_cost 9.9560

Training the Model - Backpropagation

Now that you have understood how the CBOW model works, you will train it. You created a function for the forward propagation. Now you will implement a function that computes the gradients to backpropagate the errors.

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: back_prop
def back_prop(x: numpy.ndarray,
              yhat: numpy.ndarray,
              y: numpy.ndarray,
              h: numpy.ndarray,
              W1: numpy.ndarray,
              W2: numpy.ndarray,
              b1: numpy.ndarray,
              b2: numpy.ndarray,
              batch_size: int) -> tuple:
    """Calculates the gradients

    Args: 
       x:  average one hot vector for the context 
       yhat: prediction (estimate of y)
       y:  target vector
       h:  hidden vector (see eq. 1)
       W1, W2, b1, b2:  matrices and biases  
       batch_size: batch size 

     Returns: 
       grad_W1, grad_W2, grad_b1, grad_b2:  gradients of matrices and biases   
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Compute l1 as W2^T (Yhat - Y)
    # Re-use it whenever you see W2^T (Yhat - Y) used to compute a gradient
    l1 = numpy.dot(W2.T, yhat - y)
    # Apply relu to l1
    l1 = numpy.maximum(l1, 0)
    # Compute the gradient of W1
    grad_W1 = numpy.dot(l1, x.T)/batch_size
    # Compute the gradient of W2
    grad_W2 = numpy.dot(yhat - y, h.T)/batch_size
    # Compute the gradient of b1
    grad_b1 = numpy.sum(l1, axis=Axis.COLUMNS.value, keepdims=True)/batch_size
    # Compute the gradient of b2
    grad_b2 = numpy.sum(yhat - y, axis=Axis.COLUMNS.value, keepdims=True)/batch_size
    ### END CODE HERE ###

    return grad_W1, grad_W2, grad_b1, grad_b2

Test the function

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
tmp_word2Ind, tmp_Ind2word = meta.word_to_index, meta.vocabulary
tmp_V = len(meta.vocabulary)

# get a batch of data
tmp_x, tmp_y = next(batch_generator(data, tmp_word2Ind, tmp_C, tmp_batch_size))

print("get a batch of data")
print(f"tmp_x.shape {tmp_x.shape}")
print(f"tmp_y.shape {tmp_y.shape}")

print()
print("Initialize weights and biases")
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

print(f"tmp_W1.shape {tmp_W1.shape}")
print(f"tmp_W2.shape {tmp_W2.shape}")
print(f"tmp_b1.shape {tmp_b1.shape}")
print(f"tmp_b2.shape {tmp_b2.shape}")

print()
print("Forwad prop to get z and h")
tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print(f"tmp_z.shape: {tmp_z.shape}")
print(f"tmp_h.shape: {tmp_h.shape}")

print()
print("Get yhat by calling softmax")
tmp_yhat = softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

tmp_m = (2*tmp_C)
tmp_grad_W1, tmp_grad_W2, tmp_grad_b1, tmp_grad_b2 = back_prop(tmp_x, tmp_yhat, tmp_y, tmp_h, tmp_W1, tmp_W2, tmp_b1, tmp_b2, tmp_batch_size)

print()
print("call back_prop")
print(f"tmp_grad_W1.shape {tmp_grad_W1.shape}")
print(f"tmp_grad_W2.shape {tmp_grad_W2.shape}")
print(f"tmp_grad_b1.shape {tmp_grad_b1.shape}")
print(f"tmp_grad_b2.shape {tmp_grad_b2.shape}")


expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(tmp_grad_W1.shape).to(equal((50, 5778)))
expect(tmp_grad_W2.shape).to(equal((5778, 50)))
expect(tmp_grad_b1.shape).to(equal((50, 1)))
expect(tmp_grad_b2.shape).to(equal((5778, 1)))

get a batch of data
tmp_x.shape (5778, 4)
tmp_y.shape (5778, 4)

Initialize weights and biases
tmp_W1.shape (50, 5778)
tmp_W2.shape (5778, 50)
tmp_b1.shape (50, 1)
tmp_b2.shape (5778, 1)

Forwad prop to get z and h
tmp_z.shape: (5778, 4)
tmp_h.shape: (50, 4)

Get yhat by calling softmax
tmp_yhat.shape: (5778, 4)

call back_prop
tmp_grad_W1.shape (50, 5778)
tmp_grad_W2.shape (5778, 50)
tmp_grad_b1.shape (50, 1)
tmp_grad_b2.shape (5778, 1)

Gradient Descent

Now that you have implemented a function to compute the gradients, you will implement batch gradient descent over your training set.

Hint: For that, you will use initialize_model and the back_prop functions which you just created (and the compute_cost function). You can also use the provided get_batches helper function:

Also: print the cost after each batch is processed (use batch size = 128).

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: gradient_descent
def gradient_descent(data: numpy.ndarray, word2Ind: dict, N: int, V: int ,
                     num_iters: int, alpha=0.03):    
    """
    This is the gradient_descent function

    Args: 
       data:      text
       word2Ind:  words to Indices
       N:         dimension of hidden vector  
       V:         dimension of vocabulary 
       num_iters: number of iterations  

    Returns: 
       W1, W2, b1, b2:  updated matrices and biases   
    """
    W1, W2, b1, b2 = initialize_model(N,V, random_seed=282)
    batch_size = 128
    iters = 0
    C = 2
    for x, y in batch_generator(data, word2Ind, C, batch_size):
        ### START CODE HERE (Replace instances of 'None' with your own code) ###
        # Get z and h
        z, h = forward_prop(x, W1, W2, b1, b2)
        # Get yhat
        yhat = softmax(z)
        # Get cost
        cost = compute_cost(y, yhat, batch_size)
        if ( (iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        # Get gradients
        grad_W1, grad_W2, grad_b1, grad_b2 = back_prop(x,
                                                       yhat,
                                                       y,
                                                       h,
                                                       W1,
                                                       W2,
                                                       b1,
                                                       b2,
                                                       batch_size)

        # Update weights and biases
        W1 = W1 - alpha * grad_W1
        W2 = W2 - alpha * grad_W2
        b1 = b1 - alpha * grad_b1
        b2 = b2 - alpha * grad_b2

        ### END CODE HERE ###

        iters += 1 
        if iters == num_iters: 
            break
        if iters % 100 == 0:
            alpha *= 0.66

    return W1, W2, b1, b2

Test Your Function

C = 2
N = 50
V = len(meta.vocabulary)
num_iters = 150
print("Call gradient_descent")
W1, W2, b1, b2 = gradient_descent(data, meta.word_to_index, N, V, num_iters)

Call gradient_descent
iters: 10 cost: 0.789141
iters: 20 cost: 0.105543
iters: 30 cost: 0.056008
iters: 40 cost: 0.038101
iters: 50 cost: 0.028868
iters: 60 cost: 0.023237
iters: 70 cost: 0.019444
iters: 80 cost: 0.016716
iters: 90 cost: 0.014660
iters: 100 cost: 0.013054
iters: 110 cost: 0.012133
iters: 120 cost: 0.011370
iters: 130 cost: 0.010698
iters: 140 cost: 0.010100
iters: 150 cost: 0.009566

End

The next post is one on extracting and visualizing the embeddings using Principal Component Analysis.

Bundling It Up

Imports

# python
from collections import Counter, namedtuple
from enum import Enum, unique

# pypi
import attr
import numpy

Enum Setup

@unique
class Axis(Enum):
    ROWS = 0
    COLUMNS = 1

Named Tuples

Gradients = namedtuple("Gradients", ["input_weights", "hidden_weights", "input_bias", "hidden_bias"])

Weights = namedtuple("Weights", ["input_weights", "hidden_weights", "input_bias", "hidden_bias"])

The CBOW Model

@attr.s(auto_attribs=True)
class CBOW:
    """A continuous bag of words model builder

    Args:
     hidden: number of rows in the hidden layer
     vocabulary_size: number of tokens in the vocabulary
     learning_rate: learning rate for back-propagation updates
     random_seed: int
    """
    hidden: int
    vocabulary_size: int
    learning_rate: float=0.03
    random_seed: int=1    
    _random_generator: numpy.random.PCG64=None

    # layer one
    _input_weights: numpy.ndarray=None
    _input_bias: numpy.ndarray=None

    # hidden layer
    _hidden_weights: numpy.ndarray=None
    _hidden_bias: numpy.ndarray=None

The Random Generator

@property
def random_generator(self) -> numpy.random.PCG64:
    """The random number generator"""
    if self._random_generator is None:
        self._random_generator = numpy.random.default_rng(self.random_seed)
    return self._random_generator

First Layer Weights

These are initialized using numpy's new generator. I originally using their standard-normal version by mistake and the model did horrible. Using the Generator.random gives you a uniform distribution which seems to be what you're supposed to use.

@property
def input_weights(self) -> numpy.ndarray:
    """Weights for the first layer"""
    if self._input_weights is None:
        self._input_weights = self.random_generator.random(
            (self.hidden, self.vocabulary_size))
    return self._input_weights

First Layer Bias

@property
def input_bias(self) -> numpy.ndarray:
    """Bias for the input layer"""
    if self._input_bias is None:
        self._input_bias = self.random_generator.random(
            (self.hidden, 1)
        )
    return self._input_bias

Hidden Layer Weights

@property
def hidden_weights(self) -> numpy.ndarray:
    """The weights for the hidden layer"""
    if self._hidden_weights is None:
        self._hidden_weights = self.random_generator.random(
            (self.vocabulary_size, self.hidden)
        )
    return self._hidden_weights

Hidden Layer Bias

@property
def hidden_bias(self) -> numpy.ndarray:
    """Bias for the hidden layer"""
    if self._hidden_bias is None:
        self._hidden_bias = self.random_generator.random(
            (self.vocabulary_size, 1)
        )
    return self._hidden_bias

Softmax

def softmax(self, scores: numpy.ndarray) -> numpy.ndarray:
    """Calculate the softmax

    Args: 
       scores: output scores from the hidden layer
    Returns: 
       yhat: prediction (estimate of y)"""
    return numpy.exp(scores)/numpy.sum(numpy.exp(scores), axis=Axis.ROWS.value)

Forward Propagation

def forward(self, data: numpy.ndarray) -> tuple:
    """makes a model prediction

    Args:
     data: x-values to train on

    Returns:
     output, first-layer output
    """
    first_layer_output = numpy.maximum(numpy.dot(self.input_weights, data)
                                  + self.input_bias, 0)
    second_layer_output = (numpy.dot(self.hidden_weights, first_layer_output)
                   + self.hidden_bias)
    return second_layer_output, first_layer_output

Gradients

def gradients(self, data: numpy.ndarray,
              predicted: numpy.ndarray,
              actual: numpy.ndarray,
              hidden_input: numpy.ndarray) -> Gradients:
    """does the gradient calculation for back-propagation

    This is broken out to be able to troubleshoot/compare it

   Args:
     data: the input x value
     predicted: what our model predicted the labels for the data should be
     actual: what the actual labels should have been
     hidden_input: the input to the hidden layer
    Returns:
     Gradients for input_weight, hidden_weight, input_bias, hidden_bias
    """
    difference = predicted - actual
    batch_size = difference.shape[1]
    l1 = numpy.maximum(numpy.dot(self.hidden_weights.T, difference), 0)

    input_weights_gradient = numpy.dot(l1, data.T)/batch_size
    hidden_weights_gradient = numpy.dot(difference, hidden_input.T)/batch_size
    input_bias_gradient = numpy.sum(l1,
                                    axis=Axis.COLUMNS.value,
                                    keepdims=True)/batch_size
    hidden_bias_gradient = numpy.sum(difference,
                                     axis=Axis.COLUMNS.value,
                                     keepdims=True)/batch_size
    return Gradients(input_weights=input_weights_gradient,
                     hidden_weights=hidden_weights_gradient,
                     input_bias=input_bias_gradient,
                     hidden_bias=hidden_bias_gradient)

Backward Propagation

def backward(self, data: numpy.ndarray,
             predicted: numpy.ndarray,
             actual: numpy.ndarray,
             hidden_input: numpy.ndarray) -> None:
    """Does back-propagation to update the weights

   Arg:s
     data: the input x value
     predicted: what our model predicted the labels for the data should be
     actual: what the actual labels should have been
     hidden_input: the input to the hidden layer
    """
    gradients = self.gradients(data=data,
                               predicted=predicted,
                               actual=actual,
                               hidden_input=hidden_input)
    # I don't have setters for the properties so use the private variables
    self._input_weights -= self.learning_rate * gradients.input_weights
    self._hidden_weights -= self.learning_rate * gradients.hidden_weights
    self._input_bias -= self.learning_rate * gradients.input_bias
    self._hidden_bias -= self.learning_rate * gradients.hidden_bias
    return

Call

def __call__(self, data: numpy.ndarray) -> numpy.ndarray:
    """makes a prediction on the data

    Args:
     data: input data for the prediction

    Returns:
     softmax of model output
    """
    output, _ = self.forward(data)
    return self.softmax(output)

Batch Generator

@attr.s(auto_attribs=True)
class Batches:
    """Generates batches of data

    Args:
     data: the source of the data to generate (training data)
     word_to_index: dict mapping the word to the vocabulary index
     half_window: number of tokens on either side of word to grab
     batch_size: the number of entries per batch
     batches: number of batches to generate before quitting
     verbose: whether to emit messages
    """
    data: numpy.ndarray
    word_to_index: dict
    half_window: int
    batch_size: int
    batches: int
    repetitions: int=0
    verbose: bool=False    
    _vocabulary_size: int=None
    _vectors: object=None

Vocabulary Size

@property
def vocabulary_size(self) -> int:
    """Number of tokens in the vocabulary"""
    if self._vocabulary_size is None:
        self._vocabulary_size = len(self.word_to_index)
    return self._vocabulary_size

Vectors

@property
def vectors(self):
    """our vector-generator started up"""
    if self._vectors is None:
        self._vectors = self.vector_generator()
    return self._vectors

Indices and Frequencies

def indices_and_frequencies(self, context_words: list) -> list:
    """combines word-indexes and frequency counts-dict

    Args:
     context_words: words to get the indices for

    Returns:
     list of (word-index, word-count) tuples built from context_words
    """
    frequencies = Counter(context_words)
    indices = [self.word_to_index[word] for word in context_words]
    return [(indices[index], frequencies[context_words[index]])
            for index in range(len(indices))]

Vectors

def vector_generator(self):
    """Generates vectors infinitely

    x: fraction of context words represented by word
    y: array with 1 where center word is in the vocabulary and 0 elsewhere

    Yields:
     tuple of x, y 
    """
    location = self.half_window
    while True:
        y = numpy.zeros(self.vocabulary_size)
        x = numpy.zeros(self.vocabulary_size)
        center_word = self.data[location]
        y[self.word_to_index[center_word]] = 1
        context_words = (
            self.data[(location - self.half_window): location]
            + self.data[(location + 1) : (location + self.half_window + 1)])

        for word_index, frequency in self.indices_and_frequencies(context_words):
            x[word_index] = frequency/len(context_words)
        yield x, y
        location += 1
        if location >= len(self.data):
            if self.verbose:
                print("location in data is being set to 0")
            location = 0
    return

Iterator Method

def __iter__(self):
    """makes this into an iterator"""
    return self

Next Method

def __next__(self) -> tuple:
    """Creates the batches and returns them

    Returns:
     x, y batches
    """
    batch_x = []
    batch_y = []

    if self.repetitions == self.batches:
        raise StopIteration()
    self.repetitions += 1    
    for x, y in self.vectors:
        if len(batch_x) < self.batch_size:
            batch_x.append(x)
            batch_y.append(y)
        else:
            return numpy.array(batch_x).T, numpy.array(batch_y).T
    return

The Trainer

@attr.s(auto_attribs=True)
class TheTrainer:
    """Something to train the model

    Args:
     model: thing to train
     batches: batch generator
     learning_impairment: rate to slow the model's learning
     impairment_point: how frequently to impair the learner
     emit_point: how frequently to emit messages
     verbose: whether to emit messages
    """
    model: CBOW
    batches: Batches
    learning_impairment: float=0.66
    impairment_point: int=100
    emit_point: int=10
    verbose: bool=False
    _losses: list=None

Losses

@property
def losses(self) -> list:
    """Holder for the training losses"""
    if self._losses is None:
        self._losses = []
    return self._losses

Gradient Descent

def __call__(self):    
    """Trains the model using gradient descent
    """
    self.best_loss = float("inf")
    for repetitions, x_y in enumerate(self.batches):
        x, y = x_y
        output, hidden_input = self.model.forward(x)
        predictions = self.model.softmax(output)

        loss = self.cross_entropy_loss(predicted=predictions, actual=y)
        if loss < self.best_loss:
            self.best_loss = loss
            self.best_weights = Weights(
                self.model.input_weights.copy(),
                self.model.hidden_weights.copy(),
                self.model.input_bias.copy(),
                self.model.hidden_bias.copy(),
            )
        self.losses.append(loss)
        self.model.backward(data=x, predicted=predictions, actual=y,
                            hidden_input=hidden_input)
        if ((repetitions + 1) % self.impairment_point) == 0:
            self.model.learning_rate *= self.learning_impairment
            if self.verbose:
                print(f"new learning rate: {self.model.learning_rate}")
        if self.verbose and ((repetitions + 1) % self.emit_point == 0):
            print(f"{repetitions + 1}: loss={self.losses[repetitions]}")
    return

Cross-Entropy-Loss

def cross_entropy_loss(self, predicted: numpy.ndarray,
                       actual: numpy.ndarray) -> numpy.ndarray:
    """Calculates the cross-entropy loss

    Args:
     predicted: array with the model's guesses
     actual: array with the actual labels

    Returns:
     the cross-entropy loss
    """
    log_probabilities = (numpy.multiply(numpy.log(predicted), actual)
                         + numpy.multiply(numpy.log(1 - predicted), 1 - actual))
    cost = -numpy.sum(log_probabilities)/self.batches.batch_size
    return numpy.squeeze(cost)

Testing It

from neurotic.nlp.word_embeddings import Batches, CBOW, TheTrainer

N = 4
V = len(meta.vocabulary)
model = CBOW(hidden=N, vocabulary_size=V)


expect(model.vocabulary_size).to(equal(V))
expect(model.input_weights.shape).to(equal((N, V)))
expect(model.hidden_weights.shape).to(equal((V, N)))
expect(model.input_bias.shape).to(equal((N, 1)))
expect(model.hidden_bias.shape).to(equal((V, 1)))

tmp = numpy.array([[1,2,3],
                   [1,1,1]
                   ])
tmp_sm = model.softmax(tmp)
expected =  numpy.array([[0.5, 0.73105858, 0.88079708],
                         [0.5, 0.26894142, 0.11920292]])


expect(numpy.allclose(tmp_sm, expected)).to(be_true)

Forward Propagation

tmp_N = 2
tmp_V = 3
tmp_x = numpy.array([[0,1,0]]).T

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(N=tmp_N,V=tmp_V, random_seed=1)

model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_z, tmp_h = model.forward(tmp_x)

expect(tmp_x.shape).to(equal((3, 1)))
expect(tmp_z.shape).to(equal((3, 1)))
expect(tmp_h.shape).to(equal((2, 1)))

expected = numpy.array(
    [[0.55379268],
     [1.58960774],
     [1.50722933]]
)
expect(numpy.allclose(tmp_z, expected)).to(be_true)

expected = numpy.array(
    [[0.92477674],
     [1.02487333]]
)

expect(numpy.allclose(tmp_h, expected)).to(be_true)

Cross Entropy Loss

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=15,
                  half_window=tmp_C, batch_size=tmp_batch_size)

tmp_V = len(meta.vocabulary)

tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_z, tmp_h = model.forward(tmp_x)

tmp_yhat = model.softmax(tmp_z)

train = TheTrainer(model=model, batches=batches, verbose=True)
tmp_cost = train.cross_entropy_loss(actual=tmp_y, predicted=tmp_yhat)

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(math.isclose(tmp_cost, 9.9560, abs_tol=1e-4)).to(be_true)

Back Propagation

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

# get a batch of data
tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2
tmp_z, tmp_h = model.forward(tmp_x)
tmp_yhat = model.softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

gradients = model.gradients(data=tmp_x, predicted=tmp_yhat, actual=tmp_y, hidden_input=tmp_h)
tmp_grad_W1, tmp_grad_W2, tmp_grad_b1, tmp_grad_b2 = back_prop(tmp_x, tmp_yhat, tmp_y, tmp_h, tmp_W1, tmp_W2, tmp_b1, tmp_b2, tmp_batch_size)

expect(numpy.allclose(gradients.input_weights, tmp_grad_W1)).to(be_true)
expect(numpy.allclose(gradients.hidden_weights, tmp_grad_W2)).to(be_true)
expect(numpy.allclose(gradients.input_bias, tmp_grad_b1)).to(be_true)
expect(numpy.allclose(gradients.hidden_bias, tmp_grad_b2)).to(be_true)

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(tmp_grad_W1.shape).to(equal((50, 5778)))
expect(tmp_grad_W2.shape).to(equal((5778, 50)))
expect(tmp_grad_b1.shape).to(equal((50, 1)))
expect(tmp_grad_b2.shape).to(equal((5778, 1)))

Putting Some Stuff Together

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
hidden_layers = 50

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=15,
                  half_window=tmp_C, batch_size=tmp_batch_size)
tmp_x, tmp_y = next(batches)
model = CBOW(hidden=hidden_layers, vocabulary_size=len(meta.vocabulary))
prediction = model(tmp_x)

train = TheTrainer(model=model, batches=batches, verbose=True)
print(train.cross_entropy_loss(predicted=prediction, actual=tmp_y))
print(compute_cost(tmp_y, prediction, tmp_batch_size))

# using their initial weights
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
expect(model.input_weights.shape).to(equal(tmp_W1.shape))
expect(model.hidden_weights.shape).to(equal(tmp_W2.shape))
expect(model.input_bias.shape).to(equal(tmp_b1.shape))
expect(model.hidden_bias.shape).to(equal(tmp_b2.shape))

model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
prediction = model(tmp_x)

train = TheTrainer(model=model, batches=batches, verbose=True)
print(train.cross_entropy_loss(predicted=prediction, actual=tmp_y))
print(compute_cost(tmp_y, prediction, tmp_batch_size))

11.871189103548419
11.871189103548419
9.956016099656951
9.956016099656951

I changed the weights to use the uniform distribution which seems to work better, but weirdly it still does a little worse initially. The random-seed seems to be different for the old numpy random and their new generator.

The Batches

The original batch-generator had a couple of bugs in it. To avoid them pass in original=True.

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=5,
                  half_window=tmp_C, batch_size=tmp_batch_size)


old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C,
                                tmp_batch_size, original=False)


old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
expect(numpy.allclose(tmp_x, old_x)).to(be_true)
expect(numpy.allclose(tmp_y, old_y)).to(be_true)


old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
#expect(numpy.allclose(tmp_x, old_x)).to(be_true)
#expect(numpy.allclose(tmp_y, old_y)).to(be_true)

old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)

Gradient Descent

hidden_layers = 50
half_window = 2
batch_size = 128
repetitions = 150

model = CBOW(hidden=hidden_layers, vocabulary_size=len(meta.vocabulary))
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
train = TheTrainer(model=model, batches=batches, verbose=True)
train()

10: loss=12.949165499168524
20: loss=7.1739091478289225
30: loss=13.431976455238479
40: loss=4.0062314323745545
50: loss=11.595407087927406
60: loss=10.41983077447342
70: loss=7.843047289924249
80: loss=12.529314536141994
90: loss=14.122707806423126
new learning rate: 0.0198
100: loss=10.80530164111974
110: loss=4.624869443165228
120: loss=5.552813055551899
130: loss=8.483428176366933
140: loss=9.047299388851195
150: loss=4.841072955589429

Gradient Re-do

Something's wrong with the trainer's gradient descent so I'm going to try and update the original function to do it.

def grady_the_ent(model: CBOW, data: numpy.ndarray,
                     num_iters: int, batches: Batches, alpha=0.03):
    """This is the gradient_descent function

    Args: 
       data:      text
       word2Ind:  words to Indices
       N:         dimension of hidden vector  
       V:         dimension of vocabulary 
       num_iters: number of iterations  

    Returns: 
       W1, W2, b1, b2:  updated matrices and biases   
    """
    batch_size = 128
    iters = 0
    C = 2
    for x, y in batches:
        z, h = model.forward(x)
        # Get yhat
        yhat = model.softmax(z)
        # Get cost
        cost = compute_cost(y, yhat, batch_size)
        if ((iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        grad_W1, grad_W2, grad_b1, grad_b2 = model.gradients(x,
                                                             yhat,
                                                             y,
                                                             h)

        # Update weights and biases
        model._input_weights -= alpha * grad_W1
        model._hidden_weights -= alpha * grad_W2
        model._input_bias -=  alpha * grad_b1
        model._hidden_bias -=  alpha * grad_b2

        ### END CODE HERE ###

        iters += 1 
        if iters == num_iters: 
            break
        if iters % 100 == 0:
            alpha *= 0.66

    return

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
# batch_generator(data, word2Ind, C, batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)

iters: 10 cost: 12.949165
iters: 20 cost: 7.173909
iters: 30 cost: 13.431976
iters: 40 cost: 4.006231
iters: 50 cost: 11.595407
iters: 60 cost: 10.419831
iters: 70 cost: 7.843047
iters: 80 cost: 12.529315
iters: 90 cost: 14.122708
iters: 100 cost: 10.805302
iters: 110 cost: 4.624869
iters: 120 cost: 5.552813
iters: 130 cost: 8.483428
iters: 140 cost: 9.047299
iters: 150 cost: 4.841073

So, something's wrong with the gradient descent.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = batch_generator(data, meta.word_to_index, C, batch_size)
#batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
#                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)

iters: 10 cost: 0.407862
iters: 20 cost: 0.090807
iters: 30 cost: 0.050924
iters: 40 cost: 0.035379
iters: 50 cost: 0.027105
iters: 60 cost: 0.021969
iters: 70 cost: 0.018470
iters: 80 cost: 0.015932
iters: 90 cost: 0.014008
iters: 100 cost: 0.012499
iters: 110 cost: 0.011631
iters: 120 cost: 0.010911
iters: 130 cost: 0.010274
iters: 140 cost: 0.009708
iters: 150 cost: 0.009201

It looks like it's the batches.

Troubleshooting the Batches

half_window = 2
batch_size = 128
repetitions = 150

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

start = random.randint(0, 100)
context = cleaner.processed[start: start + half_window] + cleaner.processed[start + half_window + 1: start + half_window * 2]
packed_1 = index_with_frequency(context, meta.word_to_index)
packed_2 = batches.indices_and_frequencies(context)
expect(packed_1).to(contain_exactly(*packed_2))

So the indices and frequencies is okay.

half_window = 2

v = vectors(cleaner.processed, meta.word_to_index, half_window)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
repetition = 0
for old, new in zip(v, batches.vectors):
    expect((old[0] == new[0]).all()).to(equal(True))
    expect((old[1] == new[1]).all()).to(equal(True))
    repetition += 1
    if repetition == repetitions:
        break

And the vectors look okay.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)
repetitions = 150
repetition = 0
# batch = next(batches)
for old in old_generator:
    batch_x = []
    batch_y = []
    for x, y in batches.vectors:
        while len(batch_x) < batches.batch_size:
            batch_x.append(x)
            batch_y.append(y)
        else:
            newx, newy = numpy.array(batch_x).T, numpy.array(batch_y).T
            expect((old[0]==newx).all()).to(equal(True))
            repetition += 1
            if repetition == repetitions:
                break
    else:
        continue
    break

So, weirdly, rolling the __next__= by hand seems to work.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)

repetition, repetitions = 0, 150
for old, new in zip(old_generator, batches):
    try:
        expect((old[0] == new[0]).all()).to(equal(True))
        expect((old[1] == new[1]).all()).to(equal(True))
    except AssertionError:
        print(repetition)
        break
    repetition += 1
    if repetition == repetitions:
        break

But not the batches.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)

repetition, repetitions = 0, 150
for old in old_generator:
    new = next(batches)
    expect(old[0].shape).to(equal(new[0].shape))
    try:
        expect((old[0] == new[0]).all()).to(equal(True))
        expect((old[1] == new[1]).all()).to(equal(True))
    except AssertionError:
        print(repetition)
        break
    repetition += 1
    if repetition == repetitions:
        break

Actually, it looks like the old generator might be broken.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = batch_generator(data, meta.word_to_index, C, batch_size)
#batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
#                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)

iters: 10 cost: 12.949165
iters: 20 cost: 7.173909
iters: 30 cost: 13.431976
iters: 40 cost: 4.006231
iters: 50 cost: 11.595407
iters: 60 cost: 10.419831
iters: 70 cost: 7.843047
iters: 80 cost: 12.529315
iters: 90 cost: 14.122708
iters: 100 cost: 10.805302
iters: 110 cost: 4.624869
iters: 120 cost: 5.552813
iters: 130 cost: 8.483428
iters: 140 cost: 9.047299
iters: 150 cost: 4.841073

The old generator wasn't creating new lists every time so it was just fitting the same batch of data every time… in fact it had a while loop instead of a conditional so it was just creating one batch with the same x and y lists repeated over and over so it should really be the worse performance, not the really good performance the original generator gave. I didn't re-run the ones above but this next set is being run after fixing my implementation.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 300
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
trainer = TheTrainer(model, batches, emit_point=50)
with TIMER:
    trainer()

2020-12-16 14:15:54,530 graeae.timers.timer start: Started: 2020-12-16 14:15:54.530779
2020-12-16 14:16:18,600 graeae.timers.timer end: Ended: 2020-12-16 14:16:18.600880
2020-12-16 14:16:18,602 graeae.timers.timer end: Elapsed: 0:00:24.070101

print(trainer.losses[0], trainer.losses[-1])

11.99601105791401 8.827228045367379

Not a huge improvement, but it didn't run for a long time either.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 1000
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

trainer = TheTrainer(model, batches, emit_point=100, verbose=True)
with TIMER:
    trainer()

2020-12-16 14:40:13,275 graeae.timers.timer start: Started: 2020-12-16 14:40:13.275964
new learning rate: 0.0198
100: loss=9.138356897918037
new learning rate: 0.013068000000000001
200: loss=9.077599951734605
new learning rate: 0.008624880000000001
300: loss=8.827228045367379
new learning rate: 0.005692420800000001
400: loss=8.556788482755191
new learning rate: 0.003756997728000001
500: loss=8.92744766914796
new learning rate: 0.002479618500480001
600: loss=9.052677036205138
new learning rate: 0.0016365482103168007
700: loss=8.914532962726918
new learning rate: 0.0010801218188090885
800: loss=8.885698480310062
new learning rate: 0.0007128804004139984
900: loss=9.042620463323736
2020-12-16 14:41:33,457 graeae.timers.timer end: Ended: 2020-12-16 14:41:33.457065
2020-12-16 14:41:33,458 graeae.timers.timer end: Elapsed: 0:01:20.181101
new learning rate: 0.000470501064273239
1000: loss=9.239992952104755

Hmm… doesn't seem to be improving.

losses = pandas.Series(trainer.losses)
line = holoviews.VLine(losses.idxmin()).opts(color=Plot.blue)
time_series = losses.hvplot().opts(title="Loss per Repetition",
                                   width=Plot.width, height=Plot.height,
                                   color=Plot.tan)

plot = time_series * line
output = Embed(plot=plot, file_name="training_1000")()

print(output)

Since the losses are in a Series we can use its idxmin method to see when the losses bottomed out.

print(losses.idxmin())

print(losses.loc[247], losses.iloc[-1])

8.186490214727549 9.239992952104755

So it did the best at 247 and then got a little worse as we went along.

print(len(meta.word_to_index)/batch_size)

45.140625

We exhausted our data after 45 batches so I guess it's overfitting after a while.

Table of Contents

Building and Training the Model

Imports

Set Up

Middle

Initializing the model

Softmax

The Implementation

Forward propagation

Test the function

Pack Index with Frequency

Vector Generator

Batch Generator

Cost function

Test the function

Training the Model - Backpropagation

Test the function

Gradient Descent

Test Your Function

End

Bundling It Up

Imports

Enum Setup

Named Tuples

The CBOW Model

Batch Generator

The Trainer

Testing It

Forward Propagation

Cross Entropy Loss

Back Propagation

Putting Some Stuff Together

The Batches

Gradient Descent

Gradient Re-do

Troubleshooting the Batches