Word Embeddings: Training the Model

Building and Training the Model

In the previous post we did some preliminary set up and data pre-processing. Now we're going to build and train a Continuous Bag of Words (CBOW) model.

Imports

# python
from argparse import Namespace
from enum import Enum, unique
from functools import partial

import math
import random

# pypi
from expects import be_true, contain_exactly, equal, expect

import holoviews
import hvplot.pandas
import numpy
import pandas

# this project
from neurotic.nlp.word_embeddings import DataCleaner, MetaData

# my other stuff
from graeae import EmbedHoloviews, Timer

Set Up

Code from the previous post.

cleaner = DataCleaner()
data = cleaner.processed
meta = MetaData(data)
TIMER = Timer(speak=False)
Embed = partial(EmbedHoloviews, folder_path="files/posts/nlp/word-embeddings-training-the-model")
Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
 )

Something to help remember what the numpy axis argument is.

@unique
class Axis(Enum):
    ROWS = 0
    COLUMNS = 1

Middle

Initializing the model

You will now initialize two matrices and two vectors.

  • The first matrix (\(W_1\)) is of dimension \(N \times V\), where V is the number of words in your vocabulary and N is the dimension of your word vector.
  • The second matrix (\(W_2\)) is of dimension \(V \times N\).
  • Vector \(b_1\) has dimensions \(N\times 1\)
  • Vector \(b_2\) has dimensions \(V\times 1\).
  • \(b_1\) and \(b_2\) are the bias vectors of the linear layers from matrices \(W_1\) and \(W_2\).

At this stage we are just initializing the parameters.

Please use numpy.random.rand to generate matrices that are initialized with random values from a uniform distribution, ranging between 0 and 1.

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: initialize_model
def initialize_model(N: int,V: int, random_seed: int=1) -> tuple:
    """Initialize the matrices with random values

    Args: 
       N:  dimension of hidden vector 
       V:  dimension of vocabulary
       random_seed: random seed for consistent results in the unit tests
     Returns: 
       W1, W2, b1, b2: initialized weights and biases
    """

    numpy.random.seed(random_seed)

    ### START CODE HERE (Replace instances of 'None' with your code) ###
    # W1 has shape (N,V)
    W1 = numpy.random.rand(N, V)
    # W2 has shape (V,N)
    W2 = numpy.random.rand(V, N)
    # b1 has shape (N,1)
    b1 = numpy.random.rand(N, 1)
    # b2 has shape (V,1)
    b2 = numpy.random.rand(V, 1)
    ### END CODE HERE ###

    return W1, W2, b1, b2

Test your function example.

tmp_N = 4
tmp_V = 10
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
expect(tmp_W1.shape).to(equal((tmp_N,tmp_V)))
expect(tmp_W2.shape).to(equal((tmp_V,tmp_N)))
expect(tmp_b1.shape).to(equal((tmp_N, 1)))
expect(tmp_b2.shape).to(equal((tmp_V, 1)))
print(f"tmp_W1.shape: {tmp_W1.shape}")
print(f"tmp_W2.shape: {tmp_W2.shape}")
print(f"tmp_b1.shape: {tmp_b1.shape}")
print(f"tmp_b2.shape: {tmp_b2.shape}")
tmp_W1.shape: (4, 10)
tmp_W2.shape: (10, 4)
tmp_b1.shape: (4, 1)
tmp_b2.shape: (10, 1)

Softmax

Before we can start training the model, we need to implement the softmax function as defined in equation 5:

\[ \text{softmax}(z_i) = \frac{e^{z_i} }{\sum_{i=0}^{V-1} e^{z_i} } \tag{5} \]

  • Array indexing in code starts at 0.
  • V is the number of words in the vocabulary (which is also the number of rows of z).
  • i goes from 0 to |V| - 1.

The Implementation

  • Assume that the input z to softmax is a 2D array
  • Each training example is represented by a column of shape (V, 1) in this 2D array.
  • There may be more than one column, in the 2D array, because you can put in a batch of examples to increase efficiency. Let's call the batch size lowercase m, so the z array has shape (V, m)
  • When taking the sum from \(i=1 \cdots V-1\), take the sum for each column (each example) separately.

Please use

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: softmax
def softmax(z: numpy.ndarray) -> numpy.ndarray:
    """Calculate the softmax

    Args: 
       z: output scores from the hidden layer
    Returns: 
       yhat: prediction (estimate of y)
    """

    ### START CODE HERE (Replace instances of 'None' with your own code) ###

    # Calculate yhat (softmax)
    yhat = numpy.exp(z)/numpy.sum(numpy.exp(z), axis=Axis.ROWS.value)

    ### END CODE HERE ###

    return yhat
# Test the function
tmp = numpy.array([[1,2,3],
                   [1,1,1]
                   ])
tmp_sm = softmax(tmp)
print(tmp_sm)
expected =  numpy.array([[0.5, 0.73105858, 0.88079708],
                         [0.5, 0.26894142, 0.11920292]])


expect(numpy.allclose(tmp_sm, expected)).to(be_true)
[[0.5        0.73105858 0.88079708]
 [0.5        0.26894142 0.11920292]]

Forward propagation

We're going to implement the forward propagation z according to equations (1) to (3).

\begin{align} h &= W_1 \ X + b_1 \tag{1} \\ a &= ReLU(h) \tag{2} \\ z &= W_2 \ a + b_2 \tag{3} \\ \end{align}

For that, you will use as activation the Rectified Linear Unit (ReLU) given by:

\[ f(h)=\max (0,h) \tag{6} \]

Hints:

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: forward_prop
def forward_prop(x: numpy.ndarray,
                 W1: numpy.ndarray, W2: numpy.ndarray,
                 b1: numpy.ndarray, b2: numpy.ndarray) -> tuple:
    """Pass the data through the network

    Args: 
       x:  average one hot vector for the context 
       W1, W2, b1, b2:  matrices and biases to be learned
    Returns: 
       z:  output score vector
    """

    ### START CODE HERE (Replace instances of 'None' with your own code) ###

    # Calculate h
    h = numpy.dot(W1, x) + b1

    # Apply the relu on h (store result in h)
    h = numpy.maximum(h, 0)

    # Calculate z
    z = numpy.dot(W2, h) + b2

    ### END CODE HERE ###

    return z, h

Test the function

tmp_N = 2
tmp_V = 3
tmp_x = numpy.array([[0,1,0]]).T

tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(N=tmp_N,V=tmp_V, random_seed=1)

print(f"x has shape {tmp_x.shape}")
print(f"N is {tmp_N} and vocabulary size V is {tmp_V}")

tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)

print("call forward_prop")
print()

print(f"z has shape {tmp_z.shape}")
print("z has values:")
print(tmp_z)

print()

print(f"h has shape {tmp_h.shape}")
print("h has values:")
print(tmp_h)

expect(tmp_x.shape).to(equal((3, 1)))
expect(tmp_z.shape).to(equal((3, 1)))
expected = numpy.array(
    [[0.55379268],
     [1.58960774],
     [1.50722933]]
)
expect(numpy.allclose(tmp_z, expected)).to(be_true)
expect(tmp_h.shape).to(equal((2, 1)))
expected = numpy.array(
    [[0.92477674],
     [1.02487333]]
)

expect(numpy.allclose(tmp_h, expected)).to(be_true)
x has shape (3, 1)
N is 2 and vocabulary size V is 3
call forward_prop

z has shape (3, 1)
z has values:
[[0.55379268]
 [1.58960774]
 [1.50722933]]

h has shape (2, 1)
h has values:
[[0.92477674]
 [1.02487333]]

Pack Index with Frequency

def index_with_frequency(context_words: list,
                              word_to_index: dict) -> list:
    """combines indexes and frequency counts-dict

    Args:
     context_words: words to get the indices for
     word_to_index: mapping of word to index

    Returns:
     list of (word-index, word-count) tuples built from context_words
    """
    frequency_dict = Counter(context_words)
    indices = [word_to_index[word] for word in context_words]
    packed = []
    for index in range(len(indices)):
        word_index = indices[index]
        frequency = frequency_dict[context_words[index]]
        packed.append((word_index, frequency))
    return packed

Vector Generator

def vectors(data: numpy.ndarray, word_to_index: dict, half_window: int):
    """Generates vectors of fraction of context words each word represents

    Args:
     data: source of the vectors
     word_to_index: mapping of word to index in the vocabulary
     half_window: number of tokens on either side of the word to keep

    Yields:
     tuple of x, y 
    """
    location = half_window
    vocabulary_size = len(word_to_index)
    while True:
        y = numpy.zeros(vocabulary_size)
        x = numpy.zeros(vocabulary_size)
        center_word = data[location]
        y[word_to_index[center_word]] = 1
        context_words = (data[(location - half_window): location]
                         + data[(location + 1) : (location + half_window + 1)])

        for word_index, frequency in index_with_frequency(context_words, word_to_index):
            x[word_index] = frequency/len(context_words)
        yield x, y
        location += 1
        if location >= len(data):
            print("location in data is being set to 0")
            location = 0
    return

Batch Generator

This uses a not so common form of the while loop. Whenever you run a loop and it reaches the end (so you didn't break it) then it will run the else clause.

def batch_generator(data: numpy.ndarray, word_to_index: dict,
                    half_window: int, batch_size: int, original: bool=True):
    """Generate batches of vectors

    Args:
     data: the training data
     word_to_index: map of word to vocabulary index
     half_window: number of tokens to take from either side of word
     batch_size: Number of vectors to put in each training batch
     original: run the original buggy code

    Yields:
     tuple of X, Y batches
    """
    vocabulary_size = len(word_to_index)
    batch_x = []
    batch_y = []
    for x, y in vectors(data,
                        word_to_index,
                        half_window):
        if original:
            while len(batch_x) < batch_size:
                batch_x.append(x)
                batch_y.append(y)

            else:
                yield numpy.array(batch_x).T, numpy.array(batch_y).T
        else:
            if len(batch_x) < batch_size:
                batch_x.append(x)
                batch_y.append(y)

            else:
                yield numpy.array(batch_x).T, numpy.array(batch_y).T
                batch_x = []
                batch_y = []
    return

So every time batch_x reaches the batch_size it yields the tuple and then creates a new batch before continuing the outer for-loop.

Cost function

The cross-entropy loss function.

def compute_cost(y: numpy.ndarray, y_hat: numpy.ndarray,
                 batch_size: int) -> numpy.ndarray:
    """Calculates the cross-entropy loss

    Args:
     y: array with the actual words labeled
     y_hat: our model's guesses for the words
     batch_size: the number of examples per training run
    """
    log_probabilities = (numpy.multiply(numpy.log(y_hat), y)
                         + numpy.multiply(numpy.log(1 - y_hat), 1 - y))
    cost = -numpy.sum(log_probabilities)/batch_size
    cost = numpy.squeeze(cost)
    return cost

Test the function

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

tmp_word2Ind, tmp_Ind2word = meta.word_to_index, meta.vocabulary
tmp_V = len(meta.vocabulary)

tmp_x, tmp_y = next(batch_generator(data, tmp_word2Ind, tmp_C, tmp_batch_size))

print(f"tmp_x.shape {tmp_x.shape}")
print(f"tmp_y.shape {tmp_y.shape}")

tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

print(f"tmp_W1.shape {tmp_W1.shape}")
print(f"tmp_W2.shape {tmp_W2.shape}")
print(f"tmp_b1.shape {tmp_b1.shape}")
print(f"tmp_b2.shape {tmp_b2.shape}")

tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print(f"tmp_z.shape: {tmp_z.shape}")
print(f"tmp_h.shape: {tmp_h.shape}")

tmp_yhat = softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

tmp_cost = compute_cost(tmp_y, tmp_yhat, tmp_batch_size)
print("call compute_cost")
print(f"tmp_cost {tmp_cost:.4f}")

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(math.isclose(tmp_cost, 9.9560, abs_tol=1e-4)).to(be_true)
tmp_x.shape (5778, 4)
tmp_y.shape (5778, 4)
tmp_W1.shape (50, 5778)
tmp_W2.shape (5778, 50)
tmp_b1.shape (50, 1)
tmp_b2.shape (5778, 1)
tmp_z.shape: (5778, 4)
tmp_h.shape: (50, 4)
tmp_yhat.shape: (5778, 4)
call compute_cost
tmp_cost 9.9560

Training the Model - Backpropagation

Now that you have understood how the CBOW model works, you will train it. You created a function for the forward propagation. Now you will implement a function that computes the gradients to backpropagate the errors.

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: back_prop
def back_prop(x: numpy.ndarray,
              yhat: numpy.ndarray,
              y: numpy.ndarray,
              h: numpy.ndarray,
              W1: numpy.ndarray,
              W2: numpy.ndarray,
              b1: numpy.ndarray,
              b2: numpy.ndarray,
              batch_size: int) -> tuple:
    """Calculates the gradients

    Args: 
       x:  average one hot vector for the context 
       yhat: prediction (estimate of y)
       y:  target vector
       h:  hidden vector (see eq. 1)
       W1, W2, b1, b2:  matrices and biases  
       batch_size: batch size 

     Returns: 
       grad_W1, grad_W2, grad_b1, grad_b2:  gradients of matrices and biases   
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # Compute l1 as W2^T (Yhat - Y)
    # Re-use it whenever you see W2^T (Yhat - Y) used to compute a gradient
    l1 = numpy.dot(W2.T, yhat - y)
    # Apply relu to l1
    l1 = numpy.maximum(l1, 0)
    # Compute the gradient of W1
    grad_W1 = numpy.dot(l1, x.T)/batch_size
    # Compute the gradient of W2
    grad_W2 = numpy.dot(yhat - y, h.T)/batch_size
    # Compute the gradient of b1
    grad_b1 = numpy.sum(l1, axis=Axis.COLUMNS.value, keepdims=True)/batch_size
    # Compute the gradient of b2
    grad_b2 = numpy.sum(yhat - y, axis=Axis.COLUMNS.value, keepdims=True)/batch_size
    ### END CODE HERE ###

    return grad_W1, grad_W2, grad_b1, grad_b2

Test the function

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
tmp_word2Ind, tmp_Ind2word = meta.word_to_index, meta.vocabulary
tmp_V = len(meta.vocabulary)

# get a batch of data
tmp_x, tmp_y = next(batch_generator(data, tmp_word2Ind, tmp_C, tmp_batch_size))

print("get a batch of data")
print(f"tmp_x.shape {tmp_x.shape}")
print(f"tmp_y.shape {tmp_y.shape}")

print()
print("Initialize weights and biases")
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

print(f"tmp_W1.shape {tmp_W1.shape}")
print(f"tmp_W2.shape {tmp_W2.shape}")
print(f"tmp_b1.shape {tmp_b1.shape}")
print(f"tmp_b2.shape {tmp_b2.shape}")

print()
print("Forwad prop to get z and h")
tmp_z, tmp_h = forward_prop(tmp_x, tmp_W1, tmp_W2, tmp_b1, tmp_b2)
print(f"tmp_z.shape: {tmp_z.shape}")
print(f"tmp_h.shape: {tmp_h.shape}")

print()
print("Get yhat by calling softmax")
tmp_yhat = softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

tmp_m = (2*tmp_C)
tmp_grad_W1, tmp_grad_W2, tmp_grad_b1, tmp_grad_b2 = back_prop(tmp_x, tmp_yhat, tmp_y, tmp_h, tmp_W1, tmp_W2, tmp_b1, tmp_b2, tmp_batch_size)

print()
print("call back_prop")
print(f"tmp_grad_W1.shape {tmp_grad_W1.shape}")
print(f"tmp_grad_W2.shape {tmp_grad_W2.shape}")
print(f"tmp_grad_b1.shape {tmp_grad_b1.shape}")
print(f"tmp_grad_b2.shape {tmp_grad_b2.shape}")


expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(tmp_grad_W1.shape).to(equal((50, 5778)))
expect(tmp_grad_W2.shape).to(equal((5778, 50)))
expect(tmp_grad_b1.shape).to(equal((50, 1)))
expect(tmp_grad_b2.shape).to(equal((5778, 1)))
get a batch of data
tmp_x.shape (5778, 4)
tmp_y.shape (5778, 4)

Initialize weights and biases
tmp_W1.shape (50, 5778)
tmp_W2.shape (5778, 50)
tmp_b1.shape (50, 1)
tmp_b2.shape (5778, 1)

Forwad prop to get z and h
tmp_z.shape: (5778, 4)
tmp_h.shape: (50, 4)

Get yhat by calling softmax
tmp_yhat.shape: (5778, 4)

call back_prop
tmp_grad_W1.shape (50, 5778)
tmp_grad_W2.shape (5778, 50)
tmp_grad_b1.shape (50, 1)
tmp_grad_b2.shape (5778, 1)

Gradient Descent

Now that you have implemented a function to compute the gradients, you will implement batch gradient descent over your training set.

Hint: For that, you will use initialize_model and the back_prop functions which you just created (and the compute_cost function). You can also use the provided get_batches helper function:

Also: print the cost after each batch is processed (use batch size = 128).

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: gradient_descent
def gradient_descent(data: numpy.ndarray, word2Ind: dict, N: int, V: int ,
                     num_iters: int, alpha=0.03):    
    """
    This is the gradient_descent function

    Args: 
       data:      text
       word2Ind:  words to Indices
       N:         dimension of hidden vector  
       V:         dimension of vocabulary 
       num_iters: number of iterations  

    Returns: 
       W1, W2, b1, b2:  updated matrices and biases   
    """
    W1, W2, b1, b2 = initialize_model(N,V, random_seed=282)
    batch_size = 128
    iters = 0
    C = 2
    for x, y in batch_generator(data, word2Ind, C, batch_size):
        ### START CODE HERE (Replace instances of 'None' with your own code) ###
        # Get z and h
        z, h = forward_prop(x, W1, W2, b1, b2)
        # Get yhat
        yhat = softmax(z)
        # Get cost
        cost = compute_cost(y, yhat, batch_size)
        if ( (iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        # Get gradients
        grad_W1, grad_W2, grad_b1, grad_b2 = back_prop(x,
                                                       yhat,
                                                       y,
                                                       h,
                                                       W1,
                                                       W2,
                                                       b1,
                                                       b2,
                                                       batch_size)

        # Update weights and biases
        W1 = W1 - alpha * grad_W1
        W2 = W2 - alpha * grad_W2
        b1 = b1 - alpha * grad_b1
        b2 = b2 - alpha * grad_b2

        ### END CODE HERE ###

        iters += 1 
        if iters == num_iters: 
            break
        if iters % 100 == 0:
            alpha *= 0.66

    return W1, W2, b1, b2

Test Your Function

C = 2
N = 50
V = len(meta.vocabulary)
num_iters = 150
print("Call gradient_descent")
W1, W2, b1, b2 = gradient_descent(data, meta.word_to_index, N, V, num_iters)
Call gradient_descent
iters: 10 cost: 0.789141
iters: 20 cost: 0.105543
iters: 30 cost: 0.056008
iters: 40 cost: 0.038101
iters: 50 cost: 0.028868
iters: 60 cost: 0.023237
iters: 70 cost: 0.019444
iters: 80 cost: 0.016716
iters: 90 cost: 0.014660
iters: 100 cost: 0.013054
iters: 110 cost: 0.012133
iters: 120 cost: 0.011370
iters: 130 cost: 0.010698
iters: 140 cost: 0.010100
iters: 150 cost: 0.009566

End

The next post is one on extracting and visualizing the embeddings using Principal Component Analysis.

Bundling It Up

Imports

# python
from collections import Counter, namedtuple
from enum import Enum, unique

# pypi
import attr
import numpy

Enum Setup

@unique
class Axis(Enum):
    ROWS = 0
    COLUMNS = 1

Named Tuples

Gradients = namedtuple("Gradients", ["input_weights", "hidden_weights", "input_bias", "hidden_bias"])

Weights = namedtuple("Weights", ["input_weights", "hidden_weights", "input_bias", "hidden_bias"])

The CBOW Model

@attr.s(auto_attribs=True)
class CBOW:
    """A continuous bag of words model builder

    Args:
     hidden: number of rows in the hidden layer
     vocabulary_size: number of tokens in the vocabulary
     learning_rate: learning rate for back-propagation updates
     random_seed: int
    """
    hidden: int
    vocabulary_size: int
    learning_rate: float=0.03
    random_seed: int=1    
    _random_generator: numpy.random.PCG64=None

    # layer one
    _input_weights: numpy.ndarray=None
    _input_bias: numpy.ndarray=None

    # hidden layer
    _hidden_weights: numpy.ndarray=None
    _hidden_bias: numpy.ndarray=None
  • The Random Generator
    @property
    def random_generator(self) -> numpy.random.PCG64:
        """The random number generator"""
        if self._random_generator is None:
            self._random_generator = numpy.random.default_rng(self.random_seed)
        return self._random_generator
    
  • First Layer Weights

    These are initialized using numpy's new generator. I originally using their standard-normal version by mistake and the model did horrible. Using the Generator.random gives you a uniform distribution which seems to be what you're supposed to use.

    @property
    def input_weights(self) -> numpy.ndarray:
        """Weights for the first layer"""
        if self._input_weights is None:
            self._input_weights = self.random_generator.random(
                (self.hidden, self.vocabulary_size))
        return self._input_weights
    
  • First Layer Bias
    @property
    def input_bias(self) -> numpy.ndarray:
        """Bias for the input layer"""
        if self._input_bias is None:
            self._input_bias = self.random_generator.random(
                (self.hidden, 1)
            )
        return self._input_bias
    
  • Hidden Layer Weights
    @property
    def hidden_weights(self) -> numpy.ndarray:
        """The weights for the hidden layer"""
        if self._hidden_weights is None:
            self._hidden_weights = self.random_generator.random(
                (self.vocabulary_size, self.hidden)
            )
        return self._hidden_weights
    
  • Hidden Layer Bias
    @property
    def hidden_bias(self) -> numpy.ndarray:
        """Bias for the hidden layer"""
        if self._hidden_bias is None:
            self._hidden_bias = self.random_generator.random(
                (self.vocabulary_size, 1)
            )
        return self._hidden_bias
    
  • Softmax
    def softmax(self, scores: numpy.ndarray) -> numpy.ndarray:
        """Calculate the softmax
    
        Args: 
           scores: output scores from the hidden layer
        Returns: 
           yhat: prediction (estimate of y)"""
        return numpy.exp(scores)/numpy.sum(numpy.exp(scores), axis=Axis.ROWS.value)
    
  • Forward Propagation
    def forward(self, data: numpy.ndarray) -> tuple:
        """makes a model prediction
    
        Args:
         data: x-values to train on
    
        Returns:
         output, first-layer output
        """
        first_layer_output = numpy.maximum(numpy.dot(self.input_weights, data)
                                      + self.input_bias, 0)
        second_layer_output = (numpy.dot(self.hidden_weights, first_layer_output)
                       + self.hidden_bias)
        return second_layer_output, first_layer_output
    
  • Gradients
    def gradients(self, data: numpy.ndarray,
                  predicted: numpy.ndarray,
                  actual: numpy.ndarray,
                  hidden_input: numpy.ndarray) -> Gradients:
        """does the gradient calculation for back-propagation
    
        This is broken out to be able to troubleshoot/compare it
    
       Args:
         data: the input x value
         predicted: what our model predicted the labels for the data should be
         actual: what the actual labels should have been
         hidden_input: the input to the hidden layer
        Returns:
         Gradients for input_weight, hidden_weight, input_bias, hidden_bias
        """
        difference = predicted - actual
        batch_size = difference.shape[1]
        l1 = numpy.maximum(numpy.dot(self.hidden_weights.T, difference), 0)
    
        input_weights_gradient = numpy.dot(l1, data.T)/batch_size
        hidden_weights_gradient = numpy.dot(difference, hidden_input.T)/batch_size
        input_bias_gradient = numpy.sum(l1,
                                        axis=Axis.COLUMNS.value,
                                        keepdims=True)/batch_size
        hidden_bias_gradient = numpy.sum(difference,
                                         axis=Axis.COLUMNS.value,
                                         keepdims=True)/batch_size
        return Gradients(input_weights=input_weights_gradient,
                         hidden_weights=hidden_weights_gradient,
                         input_bias=input_bias_gradient,
                         hidden_bias=hidden_bias_gradient)
    
  • Backward Propagation
    def backward(self, data: numpy.ndarray,
                 predicted: numpy.ndarray,
                 actual: numpy.ndarray,
                 hidden_input: numpy.ndarray) -> None:
        """Does back-propagation to update the weights
    
       Arg:s
         data: the input x value
         predicted: what our model predicted the labels for the data should be
         actual: what the actual labels should have been
         hidden_input: the input to the hidden layer
        """
        gradients = self.gradients(data=data,
                                   predicted=predicted,
                                   actual=actual,
                                   hidden_input=hidden_input)
        # I don't have setters for the properties so use the private variables
        self._input_weights -= self.learning_rate * gradients.input_weights
        self._hidden_weights -= self.learning_rate * gradients.hidden_weights
        self._input_bias -= self.learning_rate * gradients.input_bias
        self._hidden_bias -= self.learning_rate * gradients.hidden_bias
        return
    
  • Call
    def __call__(self, data: numpy.ndarray) -> numpy.ndarray:
        """makes a prediction on the data
    
        Args:
         data: input data for the prediction
    
        Returns:
         softmax of model output
        """
        output, _ = self.forward(data)
        return self.softmax(output)
    

Batch Generator

@attr.s(auto_attribs=True)
class Batches:
    """Generates batches of data

    Args:
     data: the source of the data to generate (training data)
     word_to_index: dict mapping the word to the vocabulary index
     half_window: number of tokens on either side of word to grab
     batch_size: the number of entries per batch
     batches: number of batches to generate before quitting
     verbose: whether to emit messages
    """
    data: numpy.ndarray
    word_to_index: dict
    half_window: int
    batch_size: int
    batches: int
    repetitions: int=0
    verbose: bool=False    
    _vocabulary_size: int=None
    _vectors: object=None
  • Vocabulary Size
    @property
    def vocabulary_size(self) -> int:
        """Number of tokens in the vocabulary"""
        if self._vocabulary_size is None:
            self._vocabulary_size = len(self.word_to_index)
        return self._vocabulary_size
    
  • Vectors
    @property
    def vectors(self):
        """our vector-generator started up"""
        if self._vectors is None:
            self._vectors = self.vector_generator()
        return self._vectors
    
  • Indices and Frequencies
    def indices_and_frequencies(self, context_words: list) -> list:
        """combines word-indexes and frequency counts-dict
    
        Args:
         context_words: words to get the indices for
    
        Returns:
         list of (word-index, word-count) tuples built from context_words
        """
        frequencies = Counter(context_words)
        indices = [self.word_to_index[word] for word in context_words]
        return [(indices[index], frequencies[context_words[index]])
                for index in range(len(indices))]
    
  • Vectors
    def vector_generator(self):
        """Generates vectors infinitely
    
        x: fraction of context words represented by word
        y: array with 1 where center word is in the vocabulary and 0 elsewhere
    
        Yields:
         tuple of x, y 
        """
        location = self.half_window
        while True:
            y = numpy.zeros(self.vocabulary_size)
            x = numpy.zeros(self.vocabulary_size)
            center_word = self.data[location]
            y[self.word_to_index[center_word]] = 1
            context_words = (
                self.data[(location - self.half_window): location]
                + self.data[(location + 1) : (location + self.half_window + 1)])
    
            for word_index, frequency in self.indices_and_frequencies(context_words):
                x[word_index] = frequency/len(context_words)
            yield x, y
            location += 1
            if location >= len(self.data):
                if self.verbose:
                    print("location in data is being set to 0")
                location = 0
        return
    
  • Iterator Method
    def __iter__(self):
        """makes this into an iterator"""
        return self
    
  • Next Method
    def __next__(self) -> tuple:
        """Creates the batches and returns them
    
        Returns:
         x, y batches
        """
        batch_x = []
        batch_y = []
    
        if self.repetitions == self.batches:
            raise StopIteration()
        self.repetitions += 1    
        for x, y in self.vectors:
            if len(batch_x) < self.batch_size:
                batch_x.append(x)
                batch_y.append(y)
            else:
                return numpy.array(batch_x).T, numpy.array(batch_y).T
        return
    

The Trainer

@attr.s(auto_attribs=True)
class TheTrainer:
    """Something to train the model

    Args:
     model: thing to train
     batches: batch generator
     learning_impairment: rate to slow the model's learning
     impairment_point: how frequently to impair the learner
     emit_point: how frequently to emit messages
     verbose: whether to emit messages
    """
    model: CBOW
    batches: Batches
    learning_impairment: float=0.66
    impairment_point: int=100
    emit_point: int=10
    verbose: bool=False
    _losses: list=None
  • Losses
    @property
    def losses(self) -> list:
        """Holder for the training losses"""
        if self._losses is None:
            self._losses = []
        return self._losses
    
  • Gradient Descent
    def __call__(self):    
        """Trains the model using gradient descent
        """
        self.best_loss = float("inf")
        for repetitions, x_y in enumerate(self.batches):
            x, y = x_y
            output, hidden_input = self.model.forward(x)
            predictions = self.model.softmax(output)
    
            loss = self.cross_entropy_loss(predicted=predictions, actual=y)
            if loss < self.best_loss:
                self.best_loss = loss
                self.best_weights = Weights(
                    self.model.input_weights.copy(),
                    self.model.hidden_weights.copy(),
                    self.model.input_bias.copy(),
                    self.model.hidden_bias.copy(),
                )
            self.losses.append(loss)
            self.model.backward(data=x, predicted=predictions, actual=y,
                                hidden_input=hidden_input)
            if ((repetitions + 1) % self.impairment_point) == 0:
                self.model.learning_rate *= self.learning_impairment
                if self.verbose:
                    print(f"new learning rate: {self.model.learning_rate}")
            if self.verbose and ((repetitions + 1) % self.emit_point == 0):
                print(f"{repetitions + 1}: loss={self.losses[repetitions]}")
        return 
    
  • Cross-Entropy-Loss
    def cross_entropy_loss(self, predicted: numpy.ndarray,
                           actual: numpy.ndarray) -> numpy.ndarray:
        """Calculates the cross-entropy loss
    
        Args:
         predicted: array with the model's guesses
         actual: array with the actual labels
    
        Returns:
         the cross-entropy loss
        """
        log_probabilities = (numpy.multiply(numpy.log(predicted), actual)
                             + numpy.multiply(numpy.log(1 - predicted), 1 - actual))
        cost = -numpy.sum(log_probabilities)/self.batches.batch_size
        return numpy.squeeze(cost)
    

Testing It

from neurotic.nlp.word_embeddings import Batches, CBOW, TheTrainer

N = 4
V = len(meta.vocabulary)
model = CBOW(hidden=N, vocabulary_size=V)


expect(model.vocabulary_size).to(equal(V))
expect(model.input_weights.shape).to(equal((N, V)))
expect(model.hidden_weights.shape).to(equal((V, N)))
expect(model.input_bias.shape).to(equal((N, 1)))
expect(model.hidden_bias.shape).to(equal((V, 1)))

tmp = numpy.array([[1,2,3],
                   [1,1,1]
                   ])
tmp_sm = model.softmax(tmp)
expected =  numpy.array([[0.5, 0.73105858, 0.88079708],
                         [0.5, 0.26894142, 0.11920292]])


expect(numpy.allclose(tmp_sm, expected)).to(be_true)

Forward Propagation

tmp_N = 2
tmp_V = 3
tmp_x = numpy.array([[0,1,0]]).T

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(N=tmp_N,V=tmp_V, random_seed=1)

model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_z, tmp_h = model.forward(tmp_x)

expect(tmp_x.shape).to(equal((3, 1)))
expect(tmp_z.shape).to(equal((3, 1)))
expect(tmp_h.shape).to(equal((2, 1)))

expected = numpy.array(
    [[0.55379268],
     [1.58960774],
     [1.50722933]]
)
expect(numpy.allclose(tmp_z, expected)).to(be_true)

expected = numpy.array(
    [[0.92477674],
     [1.02487333]]
)

expect(numpy.allclose(tmp_h, expected)).to(be_true)

Cross Entropy Loss

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=15,
                  half_window=tmp_C, batch_size=tmp_batch_size)

tmp_V = len(meta.vocabulary)

tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_z, tmp_h = model.forward(tmp_x)

tmp_yhat = model.softmax(tmp_z)

train = TheTrainer(model=model, batches=batches, verbose=True)
tmp_cost = train.cross_entropy_loss(actual=tmp_y, predicted=tmp_yhat)

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(math.isclose(tmp_cost, 9.9560, abs_tol=1e-4)).to(be_true)

Back Propagation

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

# get a batch of data
tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)
model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2
tmp_z, tmp_h = model.forward(tmp_x)
tmp_yhat = model.softmax(tmp_z)
print(f"tmp_yhat.shape: {tmp_yhat.shape}")

gradients = model.gradients(data=tmp_x, predicted=tmp_yhat, actual=tmp_y, hidden_input=tmp_h)
tmp_grad_W1, tmp_grad_W2, tmp_grad_b1, tmp_grad_b2 = back_prop(tmp_x, tmp_yhat, tmp_y, tmp_h, tmp_W1, tmp_W2, tmp_b1, tmp_b2, tmp_batch_size)

expect(numpy.allclose(gradients.input_weights, tmp_grad_W1)).to(be_true)
expect(numpy.allclose(gradients.hidden_weights, tmp_grad_W2)).to(be_true)
expect(numpy.allclose(gradients.input_bias, tmp_grad_b1)).to(be_true)
expect(numpy.allclose(gradients.hidden_bias, tmp_grad_b2)).to(be_true)

expect(tmp_x.shape).to(equal((5778, 4)))
expect(tmp_y.shape).to(equal((5778, 4)))
expect(tmp_W1.shape).to(equal((50, 5778)))
expect(tmp_W2.shape).to(equal((5778, 50)))
expect(tmp_b1.shape).to(equal((50, 1)))
expect(tmp_b2.shape).to(equal((5778, 1)))
expect(tmp_z.shape).to(equal((5778, 4)))
expect(tmp_h.shape).to(equal((50, 4)))
expect(tmp_yhat.shape).to(equal((5778, 4)))
expect(tmp_grad_W1.shape).to(equal((50, 5778)))
expect(tmp_grad_W2.shape).to(equal((5778, 50)))
expect(tmp_grad_b1.shape).to(equal((50, 1)))
expect(tmp_grad_b2.shape).to(equal((5778, 1)))

Putting Some Stuff Together

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4
hidden_layers = 50

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=15,
                  half_window=tmp_C, batch_size=tmp_batch_size)
tmp_x, tmp_y = next(batches)
model = CBOW(hidden=hidden_layers, vocabulary_size=len(meta.vocabulary))
prediction = model(tmp_x)

train = TheTrainer(model=model, batches=batches, verbose=True)
print(train.cross_entropy_loss(predicted=prediction, actual=tmp_y))
print(compute_cost(tmp_y, prediction, tmp_batch_size))

# using their initial weights
tmp_W1, tmp_W2, tmp_b1, tmp_b2 = initialize_model(tmp_N,tmp_V)

model = CBOW(hidden=tmp_N, vocabulary_size=tmp_V)
expect(model.input_weights.shape).to(equal(tmp_W1.shape))
expect(model.hidden_weights.shape).to(equal(tmp_W2.shape))
expect(model.input_bias.shape).to(equal(tmp_b1.shape))
expect(model.hidden_bias.shape).to(equal(tmp_b2.shape))

model._input_weights = tmp_W1
model._hidden_weights = tmp_W2
model._input_bias = tmp_b1
model._hidden_bias = tmp_b2

tmp_x, tmp_y = next(batch_generator(data, meta.word_to_index, tmp_C, tmp_batch_size))
prediction = model(tmp_x)

train = TheTrainer(model=model, batches=batches, verbose=True)
print(train.cross_entropy_loss(predicted=prediction, actual=tmp_y))
print(compute_cost(tmp_y, prediction, tmp_batch_size))
11.871189103548419
11.871189103548419
9.956016099656951
9.956016099656951

I changed the weights to use the uniform distribution which seems to work better, but weirdly it still does a little worse initially. The random-seed seems to be different for the old numpy random and their new generator.

The Batches

The original batch-generator had a couple of bugs in it. To avoid them pass in original=True.

tmp_C = 2
tmp_N = 50
tmp_batch_size = 4

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  batches=5,
                  half_window=tmp_C, batch_size=tmp_batch_size)


old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C,
                                tmp_batch_size, original=False)


old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
expect(numpy.allclose(tmp_x, old_x)).to(be_true)
expect(numpy.allclose(tmp_y, old_y)).to(be_true)


old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)
#expect(numpy.allclose(tmp_x, old_x)).to(be_true)
#expect(numpy.allclose(tmp_y, old_y)).to(be_true)

old_x, old_y = next(old_generator)
tmp_x, tmp_y = next(batches)

Gradient Descent

hidden_layers = 50
half_window = 2
batch_size = 128
repetitions = 150

model = CBOW(hidden=hidden_layers, vocabulary_size=len(meta.vocabulary))
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
train = TheTrainer(model=model, batches=batches, verbose=True)
train()
10: loss=12.949165499168524
20: loss=7.1739091478289225
30: loss=13.431976455238479
40: loss=4.0062314323745545
50: loss=11.595407087927406
60: loss=10.41983077447342
70: loss=7.843047289924249
80: loss=12.529314536141994
90: loss=14.122707806423126
new learning rate: 0.0198
100: loss=10.80530164111974
110: loss=4.624869443165228
120: loss=5.552813055551899
130: loss=8.483428176366933
140: loss=9.047299388851195
150: loss=4.841072955589429

Gradient Re-do

Something's wrong with the trainer's gradient descent so I'm going to try and update the original function to do it.

def grady_the_ent(model: CBOW, data: numpy.ndarray,
                     num_iters: int, batches: Batches, alpha=0.03):
    """This is the gradient_descent function

    Args: 
       data:      text
       word2Ind:  words to Indices
       N:         dimension of hidden vector  
       V:         dimension of vocabulary 
       num_iters: number of iterations  

    Returns: 
       W1, W2, b1, b2:  updated matrices and biases   
    """
    batch_size = 128
    iters = 0
    C = 2
    for x, y in batches:
        z, h = model.forward(x)
        # Get yhat
        yhat = model.softmax(z)
        # Get cost
        cost = compute_cost(y, yhat, batch_size)
        if ((iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        grad_W1, grad_W2, grad_b1, grad_b2 = model.gradients(x,
                                                             yhat,
                                                             y,
                                                             h)

        # Update weights and biases
        model._input_weights -= alpha * grad_W1
        model._hidden_weights -= alpha * grad_W2
        model._input_bias -=  alpha * grad_b1
        model._hidden_bias -=  alpha * grad_b2

        ### END CODE HERE ###

        iters += 1 
        if iters == num_iters: 
            break
        if iters % 100 == 0:
            alpha *= 0.66

    return
hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
# batch_generator(data, word2Ind, C, batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 12.949165
iters: 20 cost: 7.173909
iters: 30 cost: 13.431976
iters: 40 cost: 4.006231
iters: 50 cost: 11.595407
iters: 60 cost: 10.419831
iters: 70 cost: 7.843047
iters: 80 cost: 12.529315
iters: 90 cost: 14.122708
iters: 100 cost: 10.805302
iters: 110 cost: 4.624869
iters: 120 cost: 5.552813
iters: 130 cost: 8.483428
iters: 140 cost: 9.047299
iters: 150 cost: 4.841073

So, something's wrong with the gradient descent.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = batch_generator(data, meta.word_to_index, C, batch_size)
#batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
#                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 0.407862
iters: 20 cost: 0.090807
iters: 30 cost: 0.050924
iters: 40 cost: 0.035379
iters: 50 cost: 0.027105
iters: 60 cost: 0.021969
iters: 70 cost: 0.018470
iters: 80 cost: 0.015932
iters: 90 cost: 0.014008
iters: 100 cost: 0.012499
iters: 110 cost: 0.011631
iters: 120 cost: 0.010911
iters: 130 cost: 0.010274
iters: 140 cost: 0.009708
iters: 150 cost: 0.009201

It looks like it's the batches.

Troubleshooting the Batches

half_window = 2
batch_size = 128
repetitions = 150

batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

start = random.randint(0, 100)
context = cleaner.processed[start: start + half_window] + cleaner.processed[start + half_window + 1: start + half_window * 2]
packed_1 = index_with_frequency(context, meta.word_to_index)
packed_2 = batches.indices_and_frequencies(context)
expect(packed_1).to(contain_exactly(*packed_2))

So the indices and frequencies is okay.

half_window = 2

v = vectors(cleaner.processed, meta.word_to_index, half_window)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
repetition = 0
for old, new in zip(v, batches.vectors):
    expect((old[0] == new[0]).all()).to(equal(True))
    expect((old[1] == new[1]).all()).to(equal(True))
    repetition += 1
    if repetition == repetitions:
        break

And the vectors look okay.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)
repetitions = 150
repetition = 0
# batch = next(batches)
for old in old_generator:
    batch_x = []
    batch_y = []
    for x, y in batches.vectors:
        while len(batch_x) < batches.batch_size:
            batch_x.append(x)
            batch_y.append(y)
        else:
            newx, newy = numpy.array(batch_x).T, numpy.array(batch_y).T
            expect((old[0]==newx).all()).to(equal(True))
            repetition += 1
            if repetition == repetitions:
                break
    else:
        continue
    break

So, weirdly, rolling the __next__= by hand seems to work.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)

repetition, repetitions = 0, 150
for old, new in zip(old_generator, batches):
    try:
        expect((old[0] == new[0]).all()).to(equal(True))
        expect((old[1] == new[1]).all()).to(equal(True))
    except AssertionError:
        print(repetition)
        break
    repetition += 1
    if repetition == repetitions:
        break
1

But not the batches.

old_generator = batch_generator(cleaner.processed, meta.word_to_index, tmp_C, tmp_batch_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=tmp_batch_size, batches=repetitions)

repetition, repetitions = 0, 150
for old in old_generator:
    new = next(batches)
    expect(old[0].shape).to(equal(new[0].shape))
    try:
        expect((old[0] == new[0]).all()).to(equal(True))
        expect((old[1] == new[1]).all()).to(equal(True))
    except AssertionError:
        print(repetition)
        break
    repetition += 1
    if repetition == repetitions:
        break

Actually, it looks like the old generator might be broken.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 150
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = batch_generator(data, meta.word_to_index, C, batch_size)
#batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
#                  half_window=half_window, batch_size=batch_size, batches=repetitions)

grady_the_ent(model, cleaner.processed, repetitions, batches=batches)
iters: 10 cost: 12.949165
iters: 20 cost: 7.173909
iters: 30 cost: 13.431976
iters: 40 cost: 4.006231
iters: 50 cost: 11.595407
iters: 60 cost: 10.419831
iters: 70 cost: 7.843047
iters: 80 cost: 12.529315
iters: 90 cost: 14.122708
iters: 100 cost: 10.805302
iters: 110 cost: 4.624869
iters: 120 cost: 5.552813
iters: 130 cost: 8.483428
iters: 140 cost: 9.047299
iters: 150 cost: 4.841073

The old generator wasn't creating new lists every time so it was just fitting the same batch of data every time… in fact it had a while loop instead of a conditional so it was just creating one batch with the same x and y lists repeated over and over so it should really be the worse performance, not the really good performance the original generator gave. I didn't re-run the ones above but this next set is being run after fixing my implementation.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 300
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)
trainer = TheTrainer(model, batches, emit_point=50)
with TIMER:
    trainer()
2020-12-16 14:15:54,530 graeae.timers.timer start: Started: 2020-12-16 14:15:54.530779
2020-12-16 14:16:18,600 graeae.timers.timer end: Ended: 2020-12-16 14:16:18.600880
2020-12-16 14:16:18,602 graeae.timers.timer end: Elapsed: 0:00:24.070101
print(trainer.losses[0], trainer.losses[-1])
11.99601105791401 8.827228045367379

Not a huge improvement, but it didn't run for a long time either.

hidden_layer = 50
half_window = 2
batch_size = 128
repetitions = 1000
vocabulary_size = len(meta.vocabulary)

model = CBOW(hidden=hidden_layers, vocabulary_size=vocabulary_size)
batches = Batches(data=cleaner.processed, word_to_index=meta.word_to_index,
                  half_window=half_window, batch_size=batch_size, batches=repetitions)

trainer = TheTrainer(model, batches, emit_point=100, verbose=True)
with TIMER:
    trainer()
2020-12-16 14:40:13,275 graeae.timers.timer start: Started: 2020-12-16 14:40:13.275964
new learning rate: 0.0198
100: loss=9.138356897918037
new learning rate: 0.013068000000000001
200: loss=9.077599951734605
new learning rate: 0.008624880000000001
300: loss=8.827228045367379
new learning rate: 0.005692420800000001
400: loss=8.556788482755191
new learning rate: 0.003756997728000001
500: loss=8.92744766914796
new learning rate: 0.002479618500480001
600: loss=9.052677036205138
new learning rate: 0.0016365482103168007
700: loss=8.914532962726918
new learning rate: 0.0010801218188090885
800: loss=8.885698480310062
new learning rate: 0.0007128804004139984
900: loss=9.042620463323736
2020-12-16 14:41:33,457 graeae.timers.timer end: Ended: 2020-12-16 14:41:33.457065
2020-12-16 14:41:33,458 graeae.timers.timer end: Elapsed: 0:01:20.181101
new learning rate: 0.000470501064273239
1000: loss=9.239992952104755

Hmm… doesn't seem to be improving.

losses = pandas.Series(trainer.losses)
line = holoviews.VLine(losses.idxmin()).opts(color=Plot.blue)
time_series = losses.hvplot().opts(title="Loss per Repetition",
                                   width=Plot.width, height=Plot.height,
                                   color=Plot.tan)

plot = time_series * line
output = Embed(plot=plot, file_name="training_1000")()
print(output)

Figure Missing

Since the losses are in a Series we can use its idxmin method to see when the losses bottomed out.

print(losses.idxmin())
247
print(losses.loc[247], losses.iloc[-1])
8.186490214727549 9.239992952104755

So it did the best at 247 and then got a little worse as we went along.

print(len(meta.word_to_index)/batch_size)
45.140625

We exhausted our data after 45 batches so I guess it's overfitting after a while.